Configuration Reference

Complete reference for job configuration YAML files.

Overview

name: "my-benchmark"           # Required: job name

model:                         # Required: model settings
  path: "deepseek-r1"
  container: "latest"
  precision: "fp8"

resources:                     # Required: GPU allocation
  gpu_type: "gb200"
  prefill_nodes: 1
  decode_nodes: 2

slurm:                         # Optional: SLURM overrides
  time_limit: "02:00:00"

frontend:                      # Optional: router/frontend config
  type: dynamo

backend:                       # Optional: worker config
  type: sglang
  sglang_config:
    prefill: {}
    decode: {}

benchmark:                     # Optional: benchmark config
  type: "sa-bench"
  isl: 1024
  osl: 1024

dynamo:                        # Optional: dynamo version
  version: "0.7.0"

profiling:                     # Optional: profiling config
  type: "none"

output:                        # Optional: output paths
  log_dir: "./outputs/{job_id}/logs"

health_check:                  # Optional: health check settings
  max_attempts: 180
  interval_seconds: 10

setup_script: "my-setup.sh"    # Optional: custom setup script

name

Field

Type

Required

Description

name

string

Yes

Job name, used for identification and log prefixes

name: "deepseek-r1-benchmark"

model

Model and container configuration.

model:
  path: "deepseek-r1"       # Alias from srtslurm.yaml or full path
  container: "latest"       # Container alias from srtslurm.yaml
  precision: "fp8"          # fp8, fp4, bf16, etc.

Field

Type

Required

Description

path

string

Yes

Model path alias (from srtslurm.yaml) or absolute path

container

string

Yes

Container alias (from srtslurm.yaml) or .sqsh path

precision

string

Yes

Model precision (informational: fp4, fp8, fp16, bf16)

resources

GPU allocation and worker topology.

Disaggregated Mode (prefill + decode)

resources:
  gpu_type: "gb200"
  gpus_per_node: 4          # GPUs per node (default: from srtslurm.yaml)

  prefill_nodes: 2          # Nodes for prefill workers
  prefill_workers: 4        # Number of prefill workers

  decode_nodes: 4           # Nodes for decode workers
  decode_workers: 8         # Number of decode workers

Aggregated Mode (single worker type)

resources:
  gpu_type: "h100"
  gpus_per_node: 8
  agg_nodes: 2              # Nodes for aggregated workers
  agg_workers: 4            # Number of aggregated workers

Field

Type

Default

Description

gpu_type

string

GPU type: "gb200", "gb300", or "h100"

gpus_per_node

int

GPUs per node

prefill_nodes

int

null

Nodes dedicated to prefill

decode_nodes

int

null

Nodes dedicated to decode

prefill_workers

int

null

Number of prefill workers

decode_workers

int

null

Number of decode workers

agg_nodes

int

null

Nodes for aggregated mode

agg_workers

int

null

Number of aggregated workers

Notes:

Set decode_nodes: 0 to have decode workers share nodes with prefill workers.
Either use disaggregated mode (prefill_nodes/decode_nodes) OR aggregated mode (agg_nodes), not both.
GPUs per worker are computed automatically: (nodes * gpus_per_node) / workers

Computed Properties

The ResourceConfig provides several computed properties:

is_disaggregated: True if using prefill/decode mode
total_nodes: Total nodes allocated (prefill + decode or agg)
num_prefill, num_decode, num_agg: Worker counts for each role
gpus_per_prefill, gpus_per_decode, gpus_per_agg: GPUs allocated per worker
prefill_gpus, decode_gpus: Total GPUs for each role

slurm

SLURM job settings.

slurm:
  time_limit: "04:00:00"    # Job time limit
  account: "my-account"     # SLURM account (overrides srtslurm.yaml)
  partition: "batch"        # SLURM partition (overrides srtslurm.yaml)

Field

Type

Default

Description

time_limit

string

from srtslurm.yaml

Job time limit (HH:MM:SS)

account

string

from srtslurm.yaml

SLURM account

partition

string

from srtslurm.yaml

SLURM partition

frontend

Frontend/router configuration.

frontend:
  # Frontend type: "dynamo" (default) or "sglang"
  type: dynamo

  # Scaling
  enable_multiple_frontends: true     # Enable nginx + multiple routers
  num_additional_frontends: 9         # Additional routers (total = 1 + this)

  # CLI args passed to the frontend/router
  args:
    router-mode: "kv"                 # dynamo: router-mode
    policy: "cache_aware"             # sglang: policy
    no-kv-events: true                # boolean flags

  # Environment variables for frontend processes
  env:
    MY_VAR: "value"

Field

Type

Default

Description

type

str

dynamo

Frontend type: "dynamo" or "sglang"

enable_multiple_frontends

bool

true

Scale with nginx + multiple routers

num_additional_frontends

int

Additional routers beyond master

args

dict

null

CLI args for the frontend

env

dict

null

Env vars for frontend processes

See SGLang Router for detailed architecture.

backend

Worker configuration and SGLang settings.

backend:
  type: sglang                        # Backend type (currently only sglang)

  # Per-mode environment variables
  prefill_environment:
    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
  decode_environment:
    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
  aggregated_environment: {}

  # SGLang CLI config per mode
  sglang_config:
    prefill:
      tensor-parallel-size: 4
      mem-fraction-static: 0.84
      kv-cache-dtype: "fp8_e4m3"
      disaggregation-mode: "prefill"
      # ... any sglang CLI flag
    decode:
      tensor-parallel-size: 8
      mem-fraction-static: 0.83
      data-parallel-size: 8
      enable-dp-attention: true
    aggregated:
      # ... for aggregated mode

  # KV events (for kv-aware routing)
  kv_events_config:
    prefill: true                     # Enable for prefill workers
    decode: true                      # Enable for decode workers

Field

Type

Default

Description

type

string

sglang

Backend type (currently only "sglang")

gpu_type

string

null

GPU type override

prefill_environment

dict

{}

Environment variables for prefill

decode_environment

dict

{}

Environment variables for decode

aggregated_environment

dict

{}

Environment variables for aggregated

sglang_config

object

null

SGLang CLI configuration per mode

kv_events_config

bool/dict

null

KV events configuration

sglang_config

Per-mode SGLang server configuration. Any SGLang CLI flag can be specified (use kebab-case or snake_case):

Common Flags

Type

Description

tensor-parallel-size

int

Tensor parallelism degree

data-parallel-size

int

Data parallelism degree

expert-parallel-size

int

Expert parallelism (MoE models)

mem-fraction-static

float

GPU memory fraction (0.0-1.0)

kv-cache-dtype

string

KV cache precision (fp8_e4m3, etc.)

context-length

int

Max context length

chunked-prefill-size

int

Chunked prefill batch size

enable-dp-attention

bool

Enable DP attention

disaggregation-mode

string

"prefill" or "decode"

disaggregation-transfer-backend

string

Transfer backend ("nixl" or other)

served-model-name

string

Model name for API

grpc-mode

bool

Enable gRPC mode

kv_events_config

Note: KV events is a Dynamo frontend feature for kv-aware routing. It allows workers to publish cache/scheduling information over ZMQ for the Dynamo router to make intelligent routing decisions.

Enables --kv-events-config for workers with auto-allocated ZMQ ports.

# Enable with defaults
kv_events_config: true         # prefill+decode with publisher=zmq, topic=kv-events

# Per-mode control
kv_events_config:
  prefill: true
  decode: true
  # agg: omitted = disabled

# Custom settings
kv_events_config:
  prefill:
    publisher: "zmq"
    topic: "prefill-events"
  decode:
    topic: "decode-events"     # publisher defaults to "zmq"

Each worker leader gets a globally unique port starting at 5550:

Worker

Port

prefill_0

5550

prefill_1

5551

decode_0

5552

decode_1

5553

benchmark

Benchmark configuration. The type field determines which benchmark runner is used and what additional fields are available.

Available Benchmark Types

Type

Description

manual

No benchmark (default), manual testing mode

sa-bench

Throughput/latency serving benchmark

mmlu

MMLU accuracy evaluation

gpqa

GPQA (Graduate-level science QA) evaluation

longbenchv2

Long-context evaluation benchmark

router

Router performance with prefix caching

mooncake-router

KV-aware routing with Mooncake trace

profiling

Profiling benchmark (auto-selected)

manual

No benchmark is run. Use for manual testing and debugging.

benchmark:
  type: "manual"

sa-bench (Serving Accuracy)

Throughput and latency benchmark at various concurrency levels.

benchmark:
  type: "sa-bench"
  isl: 1024                          # Required: Input sequence length
  osl: 1024                          # Required: Output sequence length
  concurrencies: [256, 512]          # Required: Concurrency levels to test
  req_rate: "inf"                    # Optional: Request rate (default: "inf")

Field

Type

Required

Default

Description

isl

int

Yes

Input sequence length

osl

int

Yes

Output sequence length

concurrencies

list/string

Yes

Concurrency levels (list or "NxM" format)

req_rate

string/int

"inf"

Request rate

Concurrencies format: Can be a list [128, 256, 512] or x-separated string "128x256x512".

mmlu

MMLU accuracy evaluation using sglang.test.run_eval.

benchmark:
  type: "mmlu"
  num_examples: 200                  # Optional: Number of examples
  max_tokens: 2048                   # Optional: Max tokens per response
  repeat: 8                          # Optional: Number of repeats
  num_threads: 512                   # Optional: Concurrent threads

Field

Type

Required

Default

Description

num_examples

int

200

Number of examples to run

max_tokens

int

2048

Max tokens per response

repeat

int

Number of repeats

num_threads

int

512

Concurrent threads

gpqa

Graduate-level science QA evaluation using sglang.test.run_eval.

benchmark:
  type: "gpqa"
  num_examples: 198                  # Optional: Number of examples
  max_tokens: 32768                  # Optional: Max tokens per response
  repeat: 8                          # Optional: Number of repeats
  num_threads: 128                   # Optional: Concurrent threads

Field

Type

Required

Default

Description

num_examples

int

198

Number of examples to run

max_tokens

int

32768

Max tokens per response

repeat

int

Number of repeats

num_threads

int

128

Concurrent threads

longbenchv2

Long-context evaluation benchmark.

benchmark:
  type: "longbenchv2"
  max_context_length: 128000         # Optional: Max context length
  num_threads: 16                    # Optional: Concurrent threads
  max_tokens: 16384                  # Optional: Max tokens
  num_examples: null                 # Optional: Number of examples (all if null)
  categories:                        # Optional: Task categories
    - "multi_doc_qa"
    - "single_doc_qa"

Field

Type

Required

Default

Description

max_context_length

int

128000

Max context length

num_threads

int

Concurrent threads

max_tokens

int

16384

Max tokens

num_examples

int

all

Number of examples

categories

list[str]

all

Task categories to run

router

Router performance benchmark with prefix caching. Requires frontend.type: sglang.

benchmark:
  type: "router"
  isl: 14000                         # Optional: Input sequence length
  osl: 200                           # Optional: Output sequence length
  num_requests: 200                  # Optional: Number of requests
  concurrency: 20                    # Optional: Concurrency level
  prefix_ratios: [0.1, 0.3, 0.5, 0.7, 0.9]  # Optional: Prefix ratios to test

Field

Type

Required

Default

Description

isl

int

14000

Input sequence length

osl

int

200

Output sequence length

num_requests

int

200

Number of requests

concurrency

int

Concurrency level

prefix_ratios

list/string

"0.1 0.3 0.5 0.7 0.9"

Prefix ratios to test

mooncake-router

KV-aware routing benchmark using Mooncake conversation trace.

benchmark:
  type: "mooncake-router"
  mooncake_workload: "conversation"  # Optional: Trace type
  ttft_threshold_ms: 2000            # Optional: Goodput TTFT threshold
  itl_threshold_ms: 25               # Optional: Goodput ITL threshold

Field

Type

Required

Default

Description

mooncake_workload

string

"conversation"

Trace type (see options below)

ttft_threshold_ms

int

2000

Goodput TTFT threshold in ms

itl_threshold_ms

int

Goodput ITL threshold in ms

Workload options: "mooncake", "conversation", "synthetic", "toolagent"

Dataset characteristics (conversation trace):

12,031 requests over ~59 minutes (3.4 req/s)
Avg input: 12,035 tokens, Avg output: 343 tokens
36.64% cache efficiency potential

profiling

Auto-selected when profiling.type is "torch" or "nsys". Configuration is in the profiling section, not here.

benchmark:
  type: "profiling"
# See profiling section for configuration

dynamo

Dynamo installation configuration.

dynamo:
  version: "0.7.0"            # Install from PyPI
  # OR
  hash: "abc123"              # Install from git commit
  # OR
  top_of_tree: true           # Install from main branch

Field

Type

Default

Description

version

string

"0.7.0"

PyPI version

hash

string

null

Git commit hash (source install)

top_of_tree

bool

false

Install from main branch

Notes:

Only one of version, hash, or top_of_tree should be specified.
hash and top_of_tree are mutually exclusive.
When hash or top_of_tree is set, version is automatically cleared.
Source installs (hash or top_of_tree) clone the repo and build with maturin.

profiling

Profiling configuration for nsys or torch profiler.

profiling:
  type: "nsys"                       # "none", "nsys", or "torch"
  isl: 1024                          # Input sequence length for profiling
  osl: 128                           # Output sequence length for profiling
  concurrency: 32                    # Batch size / concurrency

  # Phase-specific profiling step configs
  prefill:
    start_step: 10                   # Step to start profiling
    stop_step: 20                    # Step to stop profiling
  decode:
    start_step: 10
    stop_step: 20
  # OR for aggregated mode:
  aggregated:
    start_step: 10
    stop_step: 20

Field

Type

Required

Default

Description

type

string

"none"

Profiling type: "none", "nsys", "torch"

isl

int

When enabled

null

Input sequence length for profiling

osl

int

When enabled

null

Output sequence length for profiling

concurrency

int

When enabled

null

Batch size / concurrency

prefill

object

Disaggregated

null

Prefill phase config

decode

object

Disaggregated

null

Decode phase config

aggregated

object

Aggregated

null

Aggregated phase config

ProfilingPhaseConfig

Each phase config has:

Field

Type

Required

Default

Description

start_step

int

null

Step to start profiling

stop_step

int

null

Step to stop profiling

Profiling Modes

nsys: NVIDIA Nsight Systems profiling. Wraps worker command with nsys profile.
torch: PyTorch profiler. Sets SGLANG_TORCH_PROFILER_DIR environment variable.

Validation Rules

When profiling is enabled (type != "none"), isl, osl, and concurrency are required.
Disaggregated mode requires both prefill and decode phase configs.
Aggregated mode requires aggregated phase config.
Profiling mode requires exactly 1 worker per role (1 prefill + 1 decode, or 1 aggregated).

Example: Torch Profiling (Disaggregated)

resources:
  gpu_type: "h100"
  prefill_nodes: 1
  prefill_workers: 1
  decode_nodes: 1
  decode_workers: 1

profiling:
  type: "torch"
  isl: 2048
  osl: 256
  concurrency: 64
  prefill:
    start_step: 5
    stop_step: 15
  decode:
    start_step: 5
    stop_step: 15

Example: Nsys Profiling (Aggregated)

resources:
  gpu_type: "h100"
  agg_nodes: 1
  agg_workers: 1

profiling:
  type: "nsys"
  isl: 1024
  osl: 128
  concurrency: 32
  aggregated:
    start_step: 10
    stop_step: 25

output

Output configuration with formattable paths.

output:
  log_dir: "./outputs/{job_id}/logs"

Field

Type

Default

Description

log_dir

FormattablePath

"./outputs/{job_id}/logs"

Directory for log files

The log_dir supports FormattablePath templating. See FormattablePath Template System.

health_check

Health check configuration for worker readiness.

health_check:
  max_attempts: 180
  interval_seconds: 10

Field

Type

Default

Description

max_attempts

int

180

Maximum health check attempts (180 = 30 minutes)

interval_seconds

int

Seconds between health check attempts

Notes:

Default of 180 attempts at 10 second intervals = 30 minutes total wait time.
Large models (e.g., 70B+ parameters) may require the full 30 minutes to load.
Reduce max_attempts for smaller models or faster testing.

sweep

Parameter sweep configuration for running multiple benchmark variations.

sweep:
  mode: "zip"                        # "zip" or "grid"
  parameters:
    isl: [512, 1024, 2048]
    osl: [128, 256, 512]

Field

Type

Default

Description

mode

string

"zip"

Sweep mode: "zip" or "grid"

parameters

dict

{}

Parameter name to list of values mapping

Sweep Modes

zip: Pairs up parameters at matching indices. Parameters must have equal lengths.
- Example: isl=[512, 1024], osl=[128, 256] produces 2 combinations:
  - {isl: 512, osl: 128}
  - {isl: 1024, osl: 256}
grid: Cartesian product of all parameter values.
- Example: isl=[512, 1024], osl=[128, 256] produces 4 combinations:
  - {isl: 512, osl: 128}
  - {isl: 512, osl: 256}
  - {isl: 1024, osl: 128}
  - {isl: 1024, osl: 256}

Using Sweep Parameters

Reference sweep parameters in your config using {placeholder} syntax:

benchmark:
  type: "sa-bench"
  isl: "{isl}"                       # Replaced by sweep value
  osl: "{osl}"                       # Replaced by sweep value
  concurrencies: [128, 256]

sweep:
  mode: "grid"
  parameters:
    isl: [512, 1024, 2048, 4096]
    osl: [128, 256, 512]

FormattablePath Template System

FormattablePath is a powerful templating system for paths that supports runtime placeholders and environment variable expansion.

How It Works

FormattablePath ensures that configuration values with placeholders are always explicitly formatted before use, preventing accidental use of unformatted templates.

# Example usage in config
output:
  log_dir: "$HOME/logs/{job_id}/{run_name}"

container_mounts:
  "$HOME/data": "/data"
  "$HOME/logs/{job_id}": "/logs"

Available Placeholders

Placeholder

Type

Description

Example

{job_id}

string

SLURM job ID

"12345"

{run_name}

string

Job name + job ID

"my-benchmark_12345"

{head_node_ip}

string

IP address of head node

"10.0.0.1"

{log_dir}

string

Resolved log directory path

"/home/user/outputs/12345/logs"

{model_path}

string

Resolved model path

"/models/deepseek-r1"

{container_image}

string

Resolved container image path

"/containers/sglang.sqsh"

{gpus_per_node}

int

GPUs per node

Environment Variable Expansion

FormattablePath also expands environment variables using $VAR or ${VAR} syntax:

output:
  log_dir: "$HOME/outputs/{job_id}/logs"
  # Expands to: /home/username/outputs/12345/logs

Common environment variables:

$HOME - User home directory
$USER - Username
$SLURM_JOB_ID - SLURM job ID (also available as {job_id})

Extra Placeholders

Some contexts support additional placeholders:

Placeholder

Context

Description

{nginx_url}

Frontend config

Nginx URL for load balancing

{frontend_url}

Frontend config

Frontend/router URL

{index}

Worker config

Worker index

{host}

Worker config

Worker host

{port}

Worker config

Worker port

Examples

# Log directory with job ID
output:
  log_dir: "./outputs/{job_id}/logs"

# Mount user data into container
container_mounts:
  "$HOME/datasets": "/datasets"
  "./outputs/{job_id}": "/outputs"

# Custom paths with environment variables
extra_mount:
  - "$SCRATCH/cache:/cache"
  - "${DATA_DIR}/models:/models:ro"

container_mounts

Custom container mount mappings with FormattablePath support.

container_mounts:
  "$HOME/datasets": "/datasets"
  "$HOME/outputs/{job_id}": "/outputs"
  "/shared/cache": "/cache"

Key (Host Path)

Value (Container Path)

Description

FormattablePath

Host path -> Container mount path

Both keys and values support FormattablePath templating with placeholders and environment variables.

Default Mounts

The following mounts are always added automatically:

Host Path

Container Path

Description

Model path

/model

Resolved model directory

Log directory

/logs

Log output directory

configs/ directory

/configs

NATS, etcd binaries

Benchmark scripts

/srtctl-benchmarks

Bundled benchmark scripts

environment

Global environment variables for all worker processes.

environment:
  MY_VAR: "value"
  CUDA_LAUNCH_BLOCKING: "1"
  NCCL_DEBUG: "INFO"

Key

Value

Description

string

Environment variable name=value

Note: For per-worker-mode environment variables, use backend.prefill_environment, backend.decode_environment, or backend.aggregated_environment.

extra_mount

Additional container mounts as a list of mount specifications.

extra_mount:
  - "/local/path:/container/path"
  - "/data:/data:ro"
  - "$HOME/cache:/cache"

Format

Description

host_path:container_path

Read-write mount

host_path:container_path:ro

Read-only mount

Note: Unlike container_mounts, extra_mount uses simple string format, not FormattablePath. Environment variables are still expanded.

sbatch_directives

Additional SLURM sbatch directives.

sbatch_directives:
  mail-user: "user@example.com"
  mail-type: "END,FAIL"
  comment: "Benchmark run for paper"
  reservation: "my-reservation"
  constraint: "volta"
  exclusive: ""                       # Flag without value
  gres: "gpu:8"

Directive

Example Value

Description

mail-user

"user@example.com"

Email for notifications

mail-type

"END,FAIL"

When to send email (BEGIN,END,FAIL)

comment

"My job description"

Job comment for tracking

reservation

"my-reservation"

Use a specific reservation

constraint

"volta"

Node feature constraint

exclusive

Exclusive node access (flag)

gres

"gpu:8"

Generic resource specification

dependency

"afterok:12345"

Job dependency

qos

"high"

Quality of service

Format: Each directive becomes #SBATCH --{key}={value} or #SBATCH --{key} if value is empty.

srun_options

Additional srun options for worker processes.

srun_options:
  cpu-bind: "none"
  mpi: "pmix"
  overlap: ""                         # Flag without value
  ntasks-per-node: "1"

Option

Example Value

Description

cpu-bind

"none"

CPU binding mode (none, cores, sockets)

mpi

"pmix"

MPI implementation

overlap

Allow step overlap (flag)

ntasks-per-node

"1"

Tasks per node

gpus-per-task

"1"

GPUs per task

mem

"0"

Memory per node

Format: Each option becomes --{key}={value} or --{key} if value is empty.

setup_script

Run a custom script before dynamo install and worker startup.

setup_script: "install-custom-deps.sh"

Field

Type

Default

Description

setup_script

string

null

Script filename (must be in configs/)

Notes:

Script must be located in the configs/ directory.
Script runs inside the container before dynamo installation.
Useful for installing custom SGLang versions, additional dependencies, or patches.

Example setup script (configs/install-sglang-main.sh):

#!/bin/bash
pip install --quiet git+https://github.com/sgl-project/sglang.git

enable_config_dump

Enable dumping worker configuration to JSON for debugging.

enable_config_dump: true

Field

Type

Default

Description

enable_config_dump

bool

true

Dump config JSON for debugging

When enabled, worker startup commands include --dump-config-to which writes the resolved configuration to a JSON file.

Complete Examples

Disaggregated Mode with Dynamo

name: "deepseek-r1-disagg"

model:
  path: "deepseek-r1"
  container: "0.5.6"
  precision: "fp8"

resources:
  gpu_type: "gb200"
  gpus_per_node: 4
  prefill_nodes: 2
  prefill_workers: 4
  decode_nodes: 4
  decode_workers: 8

slurm:
  time_limit: "04:00:00"

frontend:
  type: dynamo
  enable_multiple_frontends: true
  args:
    router-mode: "kv"

backend:
  type: sglang

  kv_events_config:
    prefill: true

  prefill_environment:
    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
  decode_environment:
    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"

  sglang_config:
    prefill:
      tensor-parallel-size: 4
      mem-fraction-static: 0.84
      kv-cache-dtype: "fp8_e4m3"
    decode:
      tensor-parallel-size: 8
      mem-fraction-static: 0.83
      data-parallel-size: 8

benchmark:
  type: "sa-bench"
  isl: 1024
  osl: 1024
  concurrencies: [128, 256, 512]

health_check:
  max_attempts: 180
  interval_seconds: 10

dynamo:
  version: "0.7.0"

Aggregated Mode with SGLang Router

name: "qwen-agg-router"

model:
  path: "qwen3-32b"
  container: "latest"
  precision: "bf16"

resources:
  gpu_type: "h100"
  gpus_per_node: 8
  agg_nodes: 4
  agg_workers: 8

slurm:
  time_limit: "02:00:00"

frontend:
  type: sglang
  enable_multiple_frontends: false
  args:
    policy: "cache_aware"

backend:
  type: sglang
  sglang_config:
    aggregated:
      tensor-parallel-size: 4
      mem-fraction-static: 0.9
      enable-dp-attention: true

benchmark:
  type: "router"
  isl: 14000
  osl: 200
  num_requests: 200
  prefix_ratios: [0.1, 0.3, 0.5, 0.7, 0.9]

Profiling Example

name: "profile-decode"

model:
  path: "llama-70b"
  container: "latest"
  precision: "fp8"

resources:
  gpu_type: "h100"
  gpus_per_node: 8
  prefill_nodes: 1
  prefill_workers: 1
  decode_nodes: 1
  decode_workers: 1

slurm:
  time_limit: "01:00:00"

profiling:
  type: "torch"
  isl: 2048
  osl: 256
  concurrency: 32
  prefill:
    start_step: 5
    stop_step: 15
  decode:
    start_step: 5
    stop_step: 15

backend:
  type: sglang
  sglang_config:
    prefill:
      tensor-parallel-size: 8
    decode:
      tensor-parallel-size: 8

benchmark:
  type: "profiling"

Parameter Sweep Example

name: "sweep-throughput"

model:
  path: "deepseek-r1"
  container: "latest"
  precision: "fp8"

resources:
  gpu_type: "gb200"
  gpus_per_node: 4
  prefill_nodes: 1
  prefill_workers: 2
  decode_nodes: 2
  decode_workers: 4

benchmark:
  type: "sa-bench"
  isl: "{isl}"
  osl: "{osl}"
  concurrencies: [64, 128, 256]

sweep:
  mode: "grid"
  parameters:
    isl: [512, 1024, 2048, 4096]
    osl: [128, 256, 512, 1024]

Custom Mounts and Setup

name: "custom-setup"

model:
  path: "$MODELS_DIR/my-model"
  container: "$CONTAINERS_DIR/custom.sqsh"
  precision: "fp8"

resources:
  gpu_type: "h100"
  gpus_per_node: 8
  agg_nodes: 2
  agg_workers: 4

setup_script: "install-custom-sglang.sh"

environment:
  CUSTOM_VAR: "value"
  NCCL_DEBUG: "INFO"

container_mounts:
  "$HOME/datasets": "/datasets"
  "$SCRATCH/cache": "/cache"

extra_mount:
  - "/shared/data:/data:ro"

sbatch_directives:
  mail-user: "user@example.com"
  mail-type: "END,FAIL"
  reservation: "gpu-cluster"

srun_options:
  cpu-bind: "none"

output:
  log_dir: "$HOME/experiments/{job_id}/logs"

health_check:
  max_attempts: 120
  interval_seconds: 15

PreviousCLI Reference NextParameter Sweeps

Last updated 2 days ago

Good morning

Table of Contents

Overview

name

model

resources

Disaggregated Mode (prefill + decode)

Aggregated Mode (single worker type)

Computed Properties

slurm

frontend

backend

sglang_config

kv_events_config

benchmark

Available Benchmark Types

manual

sa-bench (Serving Accuracy)

mmlu

gpqa

longbenchv2

router

mooncake-router

profiling

dynamo

profiling

ProfilingPhaseConfig

Profiling Modes

Validation Rules

Example: Torch Profiling (Disaggregated)

Example: Nsys Profiling (Aggregated)

output

health_check

sweep

Sweep Modes

Using Sweep Parameters

FormattablePath Template System

How It Works

Available Placeholders

Environment Variable Expansion

Extra Placeholders

Examples

container_mounts

Default Mounts

environment

extra_mount

sbatch_directives

srun_options

setup_script

enable_config_dump

Complete Examples

Disaggregated Mode with Dynamo

Aggregated Mode with SGLang Router

Profiling Example

Parameter Sweep Example

Custom Mounts and Setup