Analyzing Results

This guide covers all aspects of analyzing your benchmark results, from launching the interactive dashboard to programmatically parsing raw data.

Table of Contents


Overview

The analysis toolkit provides two primary ways to analyze benchmark results:

Tool
Use Case
Access

Interactive Dashboard

Visual exploration, comparing runs, real-time filtering

make dashboard

Programmatic Analysis

Automation, custom analysis, CI/CD integration

Python API or raw JSON

When to Use Each Tool

  • Dashboard: Ideal for exploring results, comparing configurations, identifying trends, and presenting findings to stakeholders

  • Command-Line/Python: Best for automated analysis pipelines, custom metrics, and integration with other tools


Understanding Output Structure

Directory Layout

Each benchmark run creates a directory under logs/ (or your configured output directory):

Directory Naming Convention

The run directory name encodes key information:

Metadata File ({jobid}.json)

The JSON metadata file is the source of truth for run configuration:

Benchmark Result Files

Each concurrency level produces a JSON file in the profiler results directory:

Log Files

Worker log files contain runtime metrics:

  • .err files: Application logs with batch-level metrics

  • .out files: Standard output (often empty or contains startup info)

Example log line format:

Configuration Snapshots

*_config.json files capture the complete node configuration at runtime:

  • GPU information (count, type, memory, driver version)

  • Server arguments (TP/DP/PP size, attention backend, KV cache settings)

  • Environment variables (NCCL, CUDA, SGLANG settings)

  • Command-line arguments actually passed


Interactive Dashboard

Launching the Dashboard

The dashboard opens at http://localhost:8501 by default.

Dashboard Configuration

On the left sidebar, you will see:

  1. Logs Directory Path: Set the path to your outputs directory (defaults to outputs/)

Run Selection

The sidebar provides powerful filtering options:

GPU Type Filter

Filter runs by GPU hardware (e.g., H100, A100, L40S). Useful when comparing across different hardware generations.

Topology Filter

Filter by worker configuration:

  • Disaggregated: 1P/4D, 2P/8D, etc.

  • Aggregated: 4A, 8A, etc.

ISL/OSL Filter

Filter by input/output sequence length combinations (e.g., 1024/1024, 2048/512).

Container Filter

Filter by container image version to compare software updates.

Tags Filter

Filter by custom tags you have assigned to runs. Tags help organize experiments:

  • baseline - Control runs

  • optimized - Runs with optimizations

  • production - Production-ready configurations

Dashboard Tabs

The dashboard has five main tabs:

1. Pareto Graph Tab

Purpose: Visualize the efficiency trade-off between throughput per GPU and throughput per user.

What You See:

  • X-axis: Output TPS/User - Token generation rate experienced by each user (1000/TPOT)

  • Y-axis: Output TPS/GPU or Total TPS/GPU - GPU utilization efficiency

Key Features:

  • Y-axis toggle: Switch between Output TPS/GPU (decode tokens only) and Total TPS/GPU (input + output)

  • TPS/User cutoff line: Add a vertical line to mark your target throughput requirement

  • Pareto Frontier: Highlight the efficient frontier where no other configuration is strictly better

Interpreting the Graph:

  • Points up and to the right are better (higher efficiency AND higher per-user throughput)

  • Points on the Pareto frontier represent optimal trade-offs

  • Use the cutoff line to identify configurations meeting your latency requirements

Metric Calculations:

Output TPS/GPU=Total Output Throughput (tokens/s)Total Number of GPUs\text{Output TPS/GPU} = \frac{\text{Total Output Throughput (tokens/s)}}{\text{Total Number of GPUs}}

Output TPS/User=1000Mean TPOT (ms)\text{Output TPS/User} = \frac{1000}{\text{Mean TPOT (ms)}}

Data Export: Click "Download Data as CSV" to export all data points.

2. Latency Analysis Tab

Purpose: Analyze latency metrics across concurrency levels.

Graphs Displayed:

  1. TTFT (Time to First Token): Time from request submission to first output token

    • Critical for perceived responsiveness

    • Should remain stable under load

  2. TPOT (Time Per Output Token): Average time between consecutive output tokens

    • Determines streaming speed

    • Lower TPOT = faster generation

  3. ITL (Inter-Token Latency): Similar to TPOT but may include queueing delays

    • Useful for diagnosing scheduling issues

Summary Statistics: Table showing min/max values for each metric across selected runs.

3. Node Metrics Tab

Purpose: Deep dive into runtime behavior of individual workers.

Aggregation Modes:

  • Individual nodes: See every worker separately

  • Group by DP rank: Average metrics across tensor parallel workers within each data parallel group

  • Aggregate all nodes: Single averaged line per run

Prefill Node Metrics:

  • Input Throughput: Tokens/s being processed in prefill

  • Inflight Requests: Requests sent to decode workers awaiting completion

  • KV Cache Utilization: Memory pressure indicator

  • Queued Requests: Backpressure indicator

Decode Node Metrics:

  • Running Requests: Active generation requests

  • Generation Throughput: Output tokens/s

  • KV Cache Utilization: Memory pressure

  • Queued Requests: Decode capacity indicator

Disaggregation Metrics (Stacked or Separate views):

  • Prealloc Queue: Requests waiting for memory allocation

  • Transfer Queue: Requests waiting for KV cache transfer

  • Running: Requests actively generating

4. Rate Match Tab

Purpose: Verify prefill/decode capacity balance.

Interpretation:

  • Lines should align: System is balanced

  • Decode consistently below prefill: Need more decode nodes

  • Decode above prefill: Prefill is the bottleneck, decode underutilized

Toggle: Convert from tokens/s to requests/s using ISL/OSL for clearer comparison.

Note: This tab only applies to disaggregated runs (prefill/decode split). Aggregated runs are skipped.

5. Configuration Tab

Purpose: Review the exact configuration of each run.

Information Displayed:

  • Overview: Node count, GPU type, ISL/OSL, profiler type

  • Topology: Physical node assignments, service distribution

  • Node Config: Command-line arguments for each worker

  • Environment: Environment variables by category (NCCL, SGLANG, CUDA, etc.)

Managing Tags

Tags help organize and filter your experiments:

  1. Adding Tags: Expand a run in the sidebar Tags section, type a tag name, click "Add"

  2. Removing Tags: Click the "x" button next to any existing tag

  3. Filtering by Tags: Use the Tags filter in the Filters section

Tags are stored in the run's {jobid}.json file and persist across sessions.


Command-Line Analysis

Accessing Raw JSON Results

Browse directly to the profiler results:

Using jq for Analysis

Extract specific metrics across concurrency levels:

Python API

For programmatic analysis, use the RunLoader class:

Pandas Analysis Examples


Metrics Deep Dive

Throughput Metrics

Metric
Description
Unit

Output TPS

Total output tokens generated per second across all users

tokens/s

Total TPS

Total tokens processed (input + output) per second

tokens/s

Request Throughput

Number of requests completed per second

requests/s

Request Goodput

Successful requests per second (excludes errors)

requests/s

Output TPS/GPU

Output TPS divided by total GPU count

tokens/s/GPU

Output TPS/User

Per-user generation rate (1000/TPOT)

tokens/s

Latency Metrics

Metric
Description
What It Tells You

TTFT

Time to First Token

User-perceived responsiveness

TPOT

Time Per Output Token

Streaming speed during generation

ITL

Inter-Token Latency

Token spacing (similar to TPOT)

E2EL

End-to-End Latency

Total request duration

Understanding Percentiles

  • Mean: Average across all requests (sensitive to outliers)

  • Median (p50): Middle value (50% of requests faster, 50% slower)

  • p90: 90% of requests complete faster than this

  • p99: 99% of requests complete faster than this (tail latency)

  • Standard Deviation: Spread around the mean

Best Practices:

  • Use p99 for SLA commitments

  • Use median for typical user experience

  • Large gap between median and p99 indicates scheduling issues or resource contention

What "Good" Metrics Look Like

These are general guidelines; actual targets depend on your use case:

Metric
Good
Acceptable
Concerning

TTFT (p99)

< 500ms

500-1000ms

> 1000ms

TPOT (mean)

< 30ms

30-50ms

> 50ms

Output TPS/GPU

> 200

100-200

< 100

KV Cache Utilization

40-80%

20-90%

> 95% or < 10%

Queue Depth

0-10

10-50

> 50 (growing)

Note: These vary significantly by:

  • Model size (larger models = slower)

  • Hardware (H100 vs A100 vs L40S)

  • Sequence lengths (longer = slower)

  • Batch sizes and concurrency


Comparing Experiments

Using Tags for Organization

Establish a tagging convention for your team:

A/B Comparison Patterns

Compare two configurations:

  1. Run both configurations with identical:

    • ISL/OSL settings

    • Concurrency levels

    • Hardware (if possible)

  2. Tag runs appropriately (e.g., configA, configB)

  3. In dashboard:

    • Filter to show only your tagged runs

    • Select both runs for side-by-side comparison

    • Use Pareto graph to see efficiency differences

Identify regressions:

Filtering by Parameters

In the dashboard sidebar:

  1. Use Topology filter to compare same worker ratios

  2. Use ISL/OSL filter to compare same workload profiles

  3. Use Container filter to compare software versions

  4. Use GPU Type filter to compare hardware


Exporting Data

CSV Export

From the dashboard Pareto tab, click "Download Data as CSV" to export:

  • All selected runs

  • All concurrency levels

  • All computed metrics (TPS, TPS/GPU, TPS/User, latencies)

JSON Export

Raw JSON is already available in the logs directory. To consolidate:

Parquet Export (Cached Data)

The analysis system automatically caches parsed data as Parquet files:

Integration with Other Tools

Grafana/InfluxDB:

Jupyter Notebooks:


Troubleshooting Analysis

Dashboard Won't Load

Symptoms: Dashboard shows spinner indefinitely or errors on startup

Solutions:

  1. Check logs directory exists: ls -la logs/

  2. Verify at least one run has {jobid}.json: ls logs/*/*.json

  3. Check for Python errors: uv run streamlit run analysis/dashboard/app.py 2>&1

  4. Clear Streamlit cache: rm -rf ~/.streamlit/cache

Missing Runs in Dashboard

Symptoms: Some runs don't appear in the run selector

Causes and Solutions:

  1. No metadata file: Each run must have {jobid}.json

  2. No benchmark results: Runs without profiler output are skipped

  3. Profiling jobs: torch-profiler type runs are intentionally skipped

  4. Cache invalidation: Force reload by clicking "Sync Now" or restarting dashboard

Incomplete Run Warning

Symptoms: Dashboard shows "Job X is incomplete - Missing concurrencies: [128, 256]"

Causes:

  • Benchmark timed out before completing all concurrency levels

  • Job was cancelled mid-run

  • Profiler crashed at higher concurrencies

Solutions:

  1. Check SLURM logs for timeout or OOM errors

  2. Re-run with longer timeout

  3. Reduce max concurrency for resource-constrained setups

No Node Metrics Found

Symptoms: Node Metrics tab shows "No log files found"

Causes:

  • Log files don't match expected pattern

  • Logs were not captured (stderr redirect issue)

Solutions:

  1. Verify log file naming: ls logs/3667_*/*_prefill_*.err

  2. Check file contents: head logs/3667_*/*_prefill_*.err

  3. Verify log format matches expected patterns (see Log Files section)

Slow Dashboard Loading

Symptoms: Dashboard takes a long time to load or refresh

Causes:

  • Many runs to parse

  • Cache invalidation

  • Large log files

Solutions:

  1. Parquet caching speeds up subsequent loads automatically

  2. Delete old runs you no longer need

  3. Use filters to reduce the number of selected runs

  4. Increase _cache_version in components.py only when parser changes

Incorrect Metrics

Symptoms: Metrics don't match expected values or show as "N/A"

Causes:

  • Benchmark output format changed

  • Incomplete benchmark run

  • Parse error in result files

Solutions:

  1. Verify raw JSON is valid:

  2. Check for required fields:

  3. Clear parquet cache and reload:


Quick Reference

Launch Dashboard

Python API Quick Start

Key File Locations

  • Run metadata: logs/{run_dir}/{jobid}.json

  • Benchmark results: logs/{run_dir}/{profiler}_isl_{isl}_osl_{osl}/concurrency_*.json

  • Worker logs: logs/{run_dir}/*_{prefill|decode}_*.err

  • Node configs: logs/{run_dir}/*_config.json

  • Cache files: logs/{run_dir}/.cache/*.parquet

Last updated