Profiling
srtctl supports two profiling backends for performance analysis: Torch Profiler and NVIDIA Nsight Systems (nsys).
Table of Contents
Quick Start
Add a profiling section to your job YAML:
Profiling Modes
none
Default. No profiling, uses dynamo.sglang for serving
-
torch
PyTorch Profiler. Good for Python-level and CUDA kernel analysis
/logs/profiles/{mode}/ (Chrome trace format)
nsys
NVIDIA Nsight Systems. Low-overhead GPU profiling
/logs/profiles/{mode}_{rank}.nsys-rep
Configuration Options
Top-level profiling section
profiling sectionTraffic generator parameters (isl, osl, concurrency) are shared across all phases. Per-phase start_step/stop_step allow different profiling windows for prefill vs decode workers.
Parameters
isl
Input sequence length for profiling requests
Required
osl
Output sequence length for profiling requests
Required
concurrency
Number of concurrent requests (batch size)
Required
prefill.start_step
Step number to begin prefill profiling
0
prefill.stop_step
Step number to end prefill profiling
50
decode.start_step
Step number to begin decode profiling
0
decode.stop_step
Step number to end decode profiling
50
aggregated.start_step
Step number to begin aggregated profiling
0
aggregated.stop_step
Step number to end aggregated profiling
50
Constraints
Profiling has specific requirements:
Single worker only: Profiling requires exactly 1 prefill worker and 1 decode worker (or 1 aggregated worker)
No benchmarking: Profiling and benchmarking are mutually exclusive
Automatic config dump disabled: When profiling is enabled,
enable_config_dumpis automatically set tofalse
How It Works
Normal Mode (type: none)
type: none)Uses
dynamo.sglangmodule for servingStandard disaggregated inference path
Profiling Mode (type: torch or nsys)
type: torch or nsys)Uses
sglang.launch_servermodule insteadThe
--disaggregation-modeflag is automatically skipped (not supported by launch_server)Profiling script (
/scripts/profiling/profile.sh) runs on leader nodesSends requests via
sglang.bench_servingto generate profiling workload
nsys-specific behavior
When using nsys, workers are wrapped with:
Example Configurations
Torch Profiler (Recommended for Python analysis)
Nsight Systems (Recommended for GPU kernel analysis)
Output Files
After profiling completes, find results in the job's log directory:
Viewing Results
Torch Profiler traces:
Open in Chrome:
chrome://tracingOr use TensorBoard:
tensorboard --logdir=logs/.../profiles/
Nsight Systems reports:
Open with NVIDIA Nsight Systems GUI
Or CLI:
nsys stats logs/.../profiles/decode_0.nsys-rep
Troubleshooting
"Profiling mode requires single worker only"
Reduce your worker counts to 1:
"Cannot enable profiling with benchmark type"
Set benchmark to manual:
Empty profile output
Ensure isl, osl, and concurrency are set - they're required for the profiling workload.
Profile too short/long
Adjust start_step and stop_step to capture the desired range. A typical profiling run uses 30-100 steps.
Last updated