Analyzing Results
This guide covers all aspects of analyzing your benchmark results, from launching the interactive dashboard to programmatically parsing raw data.
Table of Contents
Overview
The analysis toolkit provides two primary ways to analyze benchmark results:
Interactive Dashboard
Visual exploration, comparing runs, real-time filtering
make dashboard
Programmatic Analysis
Automation, custom analysis, CI/CD integration
Python API or raw JSON
When to Use Each Tool
Dashboard: Ideal for exploring results, comparing configurations, identifying trends, and presenting findings to stakeholders
Command-Line/Python: Best for automated analysis pipelines, custom metrics, and integration with other tools
Understanding Output Structure
Directory Layout
Each benchmark run creates a directory under logs/ (or your configured output directory):
Directory Naming Convention
The run directory name encodes key information:
Metadata File ({jobid}.json)
{jobid}.json)The JSON metadata file is the source of truth for run configuration:
Benchmark Result Files
Each concurrency level produces a JSON file in the profiler results directory:
Log Files
Worker log files contain runtime metrics:
.errfiles: Application logs with batch-level metrics.outfiles: Standard output (often empty or contains startup info)
Example log line format:
Configuration Snapshots
*_config.json files capture the complete node configuration at runtime:
GPU information (count, type, memory, driver version)
Server arguments (TP/DP/PP size, attention backend, KV cache settings)
Environment variables (NCCL, CUDA, SGLANG settings)
Command-line arguments actually passed
Interactive Dashboard
Launching the Dashboard
The dashboard opens at http://localhost:8501 by default.
Dashboard Configuration
On the left sidebar, you will see:
Logs Directory Path: Set the path to your outputs directory (defaults to
outputs/)
Run Selection
The sidebar provides powerful filtering options:
GPU Type Filter
Filter runs by GPU hardware (e.g., H100, A100, L40S). Useful when comparing across different hardware generations.
Topology Filter
Filter by worker configuration:
Disaggregated:
1P/4D,2P/8D, etc.Aggregated:
4A,8A, etc.
ISL/OSL Filter
Filter by input/output sequence length combinations (e.g., 1024/1024, 2048/512).
Container Filter
Filter by container image version to compare software updates.
Tags Filter
Filter by custom tags you have assigned to runs. Tags help organize experiments:
baseline- Control runsoptimized- Runs with optimizationsproduction- Production-ready configurations
Dashboard Tabs
The dashboard has five main tabs:
1. Pareto Graph Tab
Purpose: Visualize the efficiency trade-off between throughput per GPU and throughput per user.
What You See:
X-axis: Output TPS/User - Token generation rate experienced by each user (1000/TPOT)
Y-axis: Output TPS/GPU or Total TPS/GPU - GPU utilization efficiency
Key Features:
Y-axis toggle: Switch between Output TPS/GPU (decode tokens only) and Total TPS/GPU (input + output)
TPS/User cutoff line: Add a vertical line to mark your target throughput requirement
Pareto Frontier: Highlight the efficient frontier where no other configuration is strictly better
Interpreting the Graph:
Points up and to the right are better (higher efficiency AND higher per-user throughput)
Points on the Pareto frontier represent optimal trade-offs
Use the cutoff line to identify configurations meeting your latency requirements
Metric Calculations:
Data Export: Click "Download Data as CSV" to export all data points.
2. Latency Analysis Tab
Purpose: Analyze latency metrics across concurrency levels.
Graphs Displayed:
TTFT (Time to First Token): Time from request submission to first output token
Critical for perceived responsiveness
Should remain stable under load
TPOT (Time Per Output Token): Average time between consecutive output tokens
Determines streaming speed
Lower TPOT = faster generation
ITL (Inter-Token Latency): Similar to TPOT but may include queueing delays
Useful for diagnosing scheduling issues
Summary Statistics: Table showing min/max values for each metric across selected runs.
3. Node Metrics Tab
Purpose: Deep dive into runtime behavior of individual workers.
Aggregation Modes:
Individual nodes: See every worker separately
Group by DP rank: Average metrics across tensor parallel workers within each data parallel group
Aggregate all nodes: Single averaged line per run
Prefill Node Metrics:
Input Throughput: Tokens/s being processed in prefill
Inflight Requests: Requests sent to decode workers awaiting completion
KV Cache Utilization: Memory pressure indicator
Queued Requests: Backpressure indicator
Decode Node Metrics:
Running Requests: Active generation requests
Generation Throughput: Output tokens/s
KV Cache Utilization: Memory pressure
Queued Requests: Decode capacity indicator
Disaggregation Metrics (Stacked or Separate views):
Prealloc Queue: Requests waiting for memory allocation
Transfer Queue: Requests waiting for KV cache transfer
Running: Requests actively generating
4. Rate Match Tab
Purpose: Verify prefill/decode capacity balance.
Interpretation:
Lines should align: System is balanced
Decode consistently below prefill: Need more decode nodes
Decode above prefill: Prefill is the bottleneck, decode underutilized
Toggle: Convert from tokens/s to requests/s using ISL/OSL for clearer comparison.
Note: This tab only applies to disaggregated runs (prefill/decode split). Aggregated runs are skipped.
5. Configuration Tab
Purpose: Review the exact configuration of each run.
Information Displayed:
Overview: Node count, GPU type, ISL/OSL, profiler type
Topology: Physical node assignments, service distribution
Node Config: Command-line arguments for each worker
Environment: Environment variables by category (NCCL, SGLANG, CUDA, etc.)
Managing Tags
Tags help organize and filter your experiments:
Adding Tags: Expand a run in the sidebar Tags section, type a tag name, click "Add"
Removing Tags: Click the "x" button next to any existing tag
Filtering by Tags: Use the Tags filter in the Filters section
Tags are stored in the run's {jobid}.json file and persist across sessions.
Command-Line Analysis
Accessing Raw JSON Results
Browse directly to the profiler results:
Using jq for Analysis
Extract specific metrics across concurrency levels:
Python API
For programmatic analysis, use the RunLoader class:
Pandas Analysis Examples
Metrics Deep Dive
Throughput Metrics
Output TPS
Total output tokens generated per second across all users
tokens/s
Total TPS
Total tokens processed (input + output) per second
tokens/s
Request Throughput
Number of requests completed per second
requests/s
Request Goodput
Successful requests per second (excludes errors)
requests/s
Output TPS/GPU
Output TPS divided by total GPU count
tokens/s/GPU
Output TPS/User
Per-user generation rate (1000/TPOT)
tokens/s
Latency Metrics
TTFT
Time to First Token
User-perceived responsiveness
TPOT
Time Per Output Token
Streaming speed during generation
ITL
Inter-Token Latency
Token spacing (similar to TPOT)
E2EL
End-to-End Latency
Total request duration
Understanding Percentiles
Mean: Average across all requests (sensitive to outliers)
Median (p50): Middle value (50% of requests faster, 50% slower)
p90: 90% of requests complete faster than this
p99: 99% of requests complete faster than this (tail latency)
Standard Deviation: Spread around the mean
Best Practices:
Use p99 for SLA commitments
Use median for typical user experience
Large gap between median and p99 indicates scheduling issues or resource contention
What "Good" Metrics Look Like
These are general guidelines; actual targets depend on your use case:
TTFT (p99)
< 500ms
500-1000ms
> 1000ms
TPOT (mean)
< 30ms
30-50ms
> 50ms
Output TPS/GPU
> 200
100-200
< 100
KV Cache Utilization
40-80%
20-90%
> 95% or < 10%
Queue Depth
0-10
10-50
> 50 (growing)
Note: These vary significantly by:
Model size (larger models = slower)
Hardware (H100 vs A100 vs L40S)
Sequence lengths (longer = slower)
Batch sizes and concurrency
Comparing Experiments
Using Tags for Organization
Establish a tagging convention for your team:
A/B Comparison Patterns
Compare two configurations:
Run both configurations with identical:
ISL/OSL settings
Concurrency levels
Hardware (if possible)
Tag runs appropriately (e.g.,
configA,configB)In dashboard:
Filter to show only your tagged runs
Select both runs for side-by-side comparison
Use Pareto graph to see efficiency differences
Identify regressions:
Filtering by Parameters
In the dashboard sidebar:
Use Topology filter to compare same worker ratios
Use ISL/OSL filter to compare same workload profiles
Use Container filter to compare software versions
Use GPU Type filter to compare hardware
Exporting Data
CSV Export
From the dashboard Pareto tab, click "Download Data as CSV" to export:
All selected runs
All concurrency levels
All computed metrics (TPS, TPS/GPU, TPS/User, latencies)
JSON Export
Raw JSON is already available in the logs directory. To consolidate:
Parquet Export (Cached Data)
The analysis system automatically caches parsed data as Parquet files:
Integration with Other Tools
Grafana/InfluxDB:
Jupyter Notebooks:
Troubleshooting Analysis
Dashboard Won't Load
Symptoms: Dashboard shows spinner indefinitely or errors on startup
Solutions:
Check logs directory exists:
ls -la logs/Verify at least one run has
{jobid}.json:ls logs/*/*.jsonCheck for Python errors:
uv run streamlit run analysis/dashboard/app.py 2>&1Clear Streamlit cache:
rm -rf ~/.streamlit/cache
Missing Runs in Dashboard
Symptoms: Some runs don't appear in the run selector
Causes and Solutions:
No metadata file: Each run must have
{jobid}.jsonNo benchmark results: Runs without profiler output are skipped
Profiling jobs:
torch-profilertype runs are intentionally skippedCache invalidation: Force reload by clicking "Sync Now" or restarting dashboard
Incomplete Run Warning
Symptoms: Dashboard shows "Job X is incomplete - Missing concurrencies: [128, 256]"
Causes:
Benchmark timed out before completing all concurrency levels
Job was cancelled mid-run
Profiler crashed at higher concurrencies
Solutions:
Check SLURM logs for timeout or OOM errors
Re-run with longer timeout
Reduce max concurrency for resource-constrained setups
No Node Metrics Found
Symptoms: Node Metrics tab shows "No log files found"
Causes:
Log files don't match expected pattern
Logs were not captured (stderr redirect issue)
Solutions:
Verify log file naming:
ls logs/3667_*/*_prefill_*.errCheck file contents:
head logs/3667_*/*_prefill_*.errVerify log format matches expected patterns (see Log Files section)
Slow Dashboard Loading
Symptoms: Dashboard takes a long time to load or refresh
Causes:
Many runs to parse
Cache invalidation
Large log files
Solutions:
Parquet caching speeds up subsequent loads automatically
Delete old runs you no longer need
Use filters to reduce the number of selected runs
Increase
_cache_versionincomponents.pyonly when parser changes
Incorrect Metrics
Symptoms: Metrics don't match expected values or show as "N/A"
Causes:
Benchmark output format changed
Incomplete benchmark run
Parse error in result files
Solutions:
Verify raw JSON is valid:
Check for required fields:
Clear parquet cache and reload:
Quick Reference
Launch Dashboard
Python API Quick Start
Key File Locations
Run metadata:
logs/{run_dir}/{jobid}.jsonBenchmark results:
logs/{run_dir}/{profiler}_isl_{isl}_osl_{osl}/concurrency_*.jsonWorker logs:
logs/{run_dir}/*_{prefill|decode}_*.errNode configs:
logs/{run_dir}/*_config.jsonCache files:
logs/{run_dir}/.cache/*.parquet
Last updated