Profiling

srtctl supports two profiling backends for performance analysis: Torch Profiler and NVIDIA Nsight Systems (nsys).

Table of Contents


Quick Start

Add a profiling section to your job YAML:

Profiling Modes

Mode
Description
Output

none

Default. No profiling, uses dynamo.sglang for serving

-

torch

PyTorch Profiler. Good for Python-level and CUDA kernel analysis

/logs/profiles/{mode}/ (Chrome trace format)

nsys

NVIDIA Nsight Systems. Low-overhead GPU profiling

/logs/profiles/{mode}/ (*.nsys-rep)

Configuration Options

Top-level profiling section

Parameters

Parameter
Description
Default

prefill.start_step

Step number to begin prefill profiling

0

prefill.stop_step

Step number to end prefill profiling

50

decode.start_step

Step number to begin decode profiling

0

decode.stop_step

Step number to end decode profiling

50

aggregated.start_step

Step number to begin aggregated profiling

0

aggregated.stop_step

Step number to end aggregated profiling

50

Constraints

Profiling has specific requirements:

  1. Disaggregated mode: When profiling disaggregated workers, both profiling.prefill and profiling.decode must be set.

  2. Aggregated mode: When profiling aggregated workers, profiling.aggregated must be set (and profiling.prefill/profiling.decode must not be set).

How It Works

Normal Mode (type: none)

  • Uses dynamo.sglang module for serving

  • Standard disaggregated inference path

Profiling Mode (type: torch or nsys)

  • Uses sglang.launch_server module instead

  • The --disaggregation-mode flag is automatically skipped (not supported by launch_server)

  • Profiling script (/scripts/profiling/profile.sh) runs on leader nodes

  • Sends requests via sglang.bench_serving to generate profiling workload

nsys-specific behavior

When using nsys, workers are wrapped with:

Example Configurations

Output Files

After profiling completes, find results in the job's log directory:

Torch profiler traces example:

Nsight Systems (nsys) reports example:

Viewing Results

Torch Profiler traces:

  • Open in Chrome: chrome://tracing

  • Or use TensorBoard: tensorboard --logdir=logs/.../profiles/

Nsight Systems reports:

  • Open with NVIDIA Nsight Systems GUI

  • Or CLI: nsys stats logs/.../profiles/decode/<name>.nsys-rep

Troubleshooting

Validation errors about profiling sections

  • Disaggregated mode requires both profiling.prefill and profiling.decode to be set.

  • Aggregated mode requires profiling.aggregated to be set (and profiling.prefill/profiling.decode must not be set).

Empty profile output

Ensure the benchmark workload is generating requests during the profiling window.

Profile too short/long

Adjust start_step and stop_step to capture the desired range. A typical profiling run uses 30-100 steps.

Last updated