Profiling

srtctl supports two profiling backends for performance analysis: Torch Profiler and NVIDIA Nsight Systems (nsys).

Table of Contents


Quick Start

Add a profiling section to your job YAML:

Profiling Modes

Mode
Description
Output

none

Default. No profiling, uses dynamo.sglang for serving

-

torch

PyTorch Profiler. Good for Python-level and CUDA kernel analysis

/logs/profiles/{mode}/ (Chrome trace format)

nsys

NVIDIA Nsight Systems. Low-overhead GPU profiling

/logs/profiles/{mode}_{rank}.nsys-rep

Configuration Options

Top-level profiling section

Traffic generator parameters (isl, osl, concurrency) are shared across all phases. Per-phase start_step/stop_step allow different profiling windows for prefill vs decode workers.

Parameters

Parameter
Description
Default

isl

Input sequence length for profiling requests

Required

osl

Output sequence length for profiling requests

Required

concurrency

Number of concurrent requests (batch size)

Required

prefill.start_step

Step number to begin prefill profiling

0

prefill.stop_step

Step number to end prefill profiling

50

decode.start_step

Step number to begin decode profiling

0

decode.stop_step

Step number to end decode profiling

50

aggregated.start_step

Step number to begin aggregated profiling

0

aggregated.stop_step

Step number to end aggregated profiling

50

Constraints

Profiling has specific requirements:

  1. Single worker only: Profiling requires exactly 1 prefill worker and 1 decode worker (or 1 aggregated worker)

  2. No benchmarking: Profiling and benchmarking are mutually exclusive

  3. Automatic config dump disabled: When profiling is enabled, enable_config_dump is automatically set to false

How It Works

Normal Mode (type: none)

  • Uses dynamo.sglang module for serving

  • Standard disaggregated inference path

Profiling Mode (type: torch or nsys)

  • Uses sglang.launch_server module instead

  • The --disaggregation-mode flag is automatically skipped (not supported by launch_server)

  • Profiling script (/scripts/profiling/profile.sh) runs on leader nodes

  • Sends requests via sglang.bench_serving to generate profiling workload

nsys-specific behavior

When using nsys, workers are wrapped with:

Example Configurations

Output Files

After profiling completes, find results in the job's log directory:

Viewing Results

Torch Profiler traces:

  • Open in Chrome: chrome://tracing

  • Or use TensorBoard: tensorboard --logdir=logs/.../profiles/

Nsight Systems reports:

  • Open with NVIDIA Nsight Systems GUI

  • Or CLI: nsys stats logs/.../profiles/decode_0.nsys-rep

Troubleshooting

"Profiling mode requires single worker only"

Reduce your worker counts to 1:

"Cannot enable profiling with benchmark type"

Set benchmark to manual:

Empty profile output

Ensure isl, osl, and concurrency are set - they're required for the profiling workload.

Profile too short/long

Adjust start_step and stop_step to capture the desired range. A typical profiling run uses 30-100 steps.

Last updated