Configuration Reference
Complete reference for job configuration YAML files.
Table of Contents
Overview
name
name
string
Yes
Job name, used for identification and log prefixes
model
Model and container configuration.
path
string
Yes
Model path alias (from srtslurm.yaml) or absolute path
container
string
Yes
Container alias (from srtslurm.yaml) or .sqsh path
precision
string
Yes
Model precision (informational: fp4, fp8, fp16, bf16)
resources
GPU allocation and worker topology.
Disaggregated Mode (prefill + decode)
Aggregated Mode (single worker type)
gpu_type
string
-
GPU type: "gb200", "gb300", or "h100"
gpus_per_node
int
4
GPUs per node
prefill_nodes
int
null
Nodes dedicated to prefill
decode_nodes
int
null
Nodes dedicated to decode
prefill_workers
int
null
Number of prefill workers
decode_workers
int
null
Number of decode workers
agg_nodes
int
null
Nodes for aggregated mode
agg_workers
int
null
Number of aggregated workers
Notes:
Set
decode_nodes: 0to have decode workers share nodes with prefill workers.Either use disaggregated mode (prefill_nodes/decode_nodes) OR aggregated mode (agg_nodes), not both.
GPUs per worker are computed automatically:
(nodes * gpus_per_node) / workers
Computed Properties
The ResourceConfig provides several computed properties:
is_disaggregated: True if using prefill/decode modetotal_nodes: Total nodes allocated (prefill + decode or agg)num_prefill,num_decode,num_agg: Worker counts for each rolegpus_per_prefill,gpus_per_decode,gpus_per_agg: GPUs allocated per workerprefill_gpus,decode_gpus: Total GPUs for each role
slurm
SLURM job settings.
time_limit
string
from srtslurm.yaml
Job time limit (HH:MM:SS)
account
string
from srtslurm.yaml
SLURM account
partition
string
from srtslurm.yaml
SLURM partition
frontend
Frontend/router configuration.
type
str
dynamo
Frontend type: "dynamo" or "sglang"
enable_multiple_frontends
bool
true
Scale with nginx + multiple routers
num_additional_frontends
int
9
Additional routers beyond master
args
dict
null
CLI args for the frontend
env
dict
null
Env vars for frontend processes
See SGLang Router for detailed architecture.
backend
Worker configuration and SGLang settings.
type
string
sglang
Backend type (currently only "sglang")
gpu_type
string
null
GPU type override
prefill_environment
dict
{}
Environment variables for prefill
decode_environment
dict
{}
Environment variables for decode
aggregated_environment
dict
{}
Environment variables for aggregated
sglang_config
object
null
SGLang CLI configuration per mode
kv_events_config
bool/dict
null
KV events configuration
sglang_config
Per-mode SGLang server configuration. Any SGLang CLI flag can be specified (use kebab-case or snake_case):
tensor-parallel-size
int
Tensor parallelism degree
data-parallel-size
int
Data parallelism degree
expert-parallel-size
int
Expert parallelism (MoE models)
mem-fraction-static
float
GPU memory fraction (0.0-1.0)
kv-cache-dtype
string
KV cache precision (fp8_e4m3, etc.)
context-length
int
Max context length
chunked-prefill-size
int
Chunked prefill batch size
enable-dp-attention
bool
Enable DP attention
disaggregation-mode
string
"prefill" or "decode"
disaggregation-transfer-backend
string
Transfer backend ("nixl" or other)
served-model-name
string
Model name for API
grpc-mode
bool
Enable gRPC mode
kv_events_config
Note: KV events is a Dynamo frontend feature for kv-aware routing. It allows workers to publish cache/scheduling information over ZMQ for the Dynamo router to make intelligent routing decisions.
Enables --kv-events-config for workers with auto-allocated ZMQ ports.
Each worker leader gets a globally unique port starting at 5550:
prefill_0
5550
prefill_1
5551
decode_0
5552
decode_1
5553
benchmark
Benchmark configuration. The type field determines which benchmark runner is used and what additional fields are available.
Available Benchmark Types
manual
No benchmark (default), manual testing mode
sa-bench
Throughput/latency serving benchmark
mmlu
MMLU accuracy evaluation
gpqa
GPQA (Graduate-level science QA) evaluation
longbenchv2
Long-context evaluation benchmark
router
Router performance with prefix caching
mooncake-router
KV-aware routing with Mooncake trace
profiling
Profiling benchmark (auto-selected)
manual
No benchmark is run. Use for manual testing and debugging.
sa-bench (Serving Accuracy)
Throughput and latency benchmark at various concurrency levels.
isl
int
Yes
-
Input sequence length
osl
int
Yes
-
Output sequence length
concurrencies
list/string
Yes
-
Concurrency levels (list or "NxM" format)
req_rate
string/int
No
"inf"
Request rate
Concurrencies format: Can be a list [128, 256, 512] or x-separated string "128x256x512".
mmlu
MMLU accuracy evaluation using sglang.test.run_eval.
num_examples
int
No
200
Number of examples to run
max_tokens
int
No
2048
Max tokens per response
repeat
int
No
8
Number of repeats
num_threads
int
No
512
Concurrent threads
gpqa
Graduate-level science QA evaluation using sglang.test.run_eval.
num_examples
int
No
198
Number of examples to run
max_tokens
int
No
32768
Max tokens per response
repeat
int
No
8
Number of repeats
num_threads
int
No
128
Concurrent threads
longbenchv2
Long-context evaluation benchmark.
max_context_length
int
No
128000
Max context length
num_threads
int
No
16
Concurrent threads
max_tokens
int
No
16384
Max tokens
num_examples
int
No
all
Number of examples
categories
list[str]
No
all
Task categories to run
router
Router performance benchmark with prefix caching. Requires frontend.type: sglang.
isl
int
No
14000
Input sequence length
osl
int
No
200
Output sequence length
num_requests
int
No
200
Number of requests
concurrency
int
No
20
Concurrency level
prefix_ratios
list/string
No
"0.1 0.3 0.5 0.7 0.9"
Prefix ratios to test
mooncake-router
KV-aware routing benchmark using Mooncake conversation trace.
mooncake_workload
string
No
"conversation"
Trace type (see options below)
ttft_threshold_ms
int
No
2000
Goodput TTFT threshold in ms
itl_threshold_ms
int
No
25
Goodput ITL threshold in ms
Workload options: "mooncake", "conversation", "synthetic", "toolagent"
Dataset characteristics (conversation trace):
12,031 requests over ~59 minutes (3.4 req/s)
Avg input: 12,035 tokens, Avg output: 343 tokens
36.64% cache efficiency potential
profiling
Auto-selected when profiling.type is "torch" or "nsys". Configuration is in the profiling section, not here.
dynamo
Dynamo installation configuration.
version
string
"0.7.0"
PyPI version
hash
string
null
Git commit hash (source install)
top_of_tree
bool
false
Install from main branch
Notes:
Only one of
version,hash, ortop_of_treeshould be specified.hashandtop_of_treeare mutually exclusive.When
hashortop_of_treeis set,versionis automatically cleared.Source installs (
hashortop_of_tree) clone the repo and build with maturin.
profiling
Profiling configuration for nsys or torch profiler.
type
string
No
"none"
Profiling type: "none", "nsys", "torch"
isl
int
When enabled
null
Input sequence length for profiling
osl
int
When enabled
null
Output sequence length for profiling
concurrency
int
When enabled
null
Batch size / concurrency
prefill
object
Disaggregated
null
Prefill phase config
decode
object
Disaggregated
null
Decode phase config
aggregated
object
Aggregated
null
Aggregated phase config
ProfilingPhaseConfig
Each phase config has:
start_step
int
No
null
Step to start profiling
stop_step
int
No
null
Step to stop profiling
Profiling Modes
nsys: NVIDIA Nsight Systems profiling. Wraps worker command with
nsys profile.torch: PyTorch profiler. Sets
SGLANG_TORCH_PROFILER_DIRenvironment variable.
Validation Rules
When profiling is enabled (
type != "none"),isl,osl, andconcurrencyare required.Disaggregated mode requires both
prefillanddecodephase configs.Aggregated mode requires
aggregatedphase config.Profiling mode requires exactly 1 worker per role (1 prefill + 1 decode, or 1 aggregated).
Example: Torch Profiling (Disaggregated)
Example: Nsys Profiling (Aggregated)
output
Output configuration with formattable paths.
log_dir
FormattablePath
"./outputs/{job_id}/logs"
Directory for log files
The log_dir supports FormattablePath templating. See FormattablePath Template System.
health_check
Health check configuration for worker readiness.
max_attempts
int
180
Maximum health check attempts (180 = 30 minutes)
interval_seconds
int
10
Seconds between health check attempts
Notes:
Default of 180 attempts at 10 second intervals = 30 minutes total wait time.
Large models (e.g., 70B+ parameters) may require the full 30 minutes to load.
Reduce
max_attemptsfor smaller models or faster testing.
sweep
Parameter sweep configuration for running multiple benchmark variations.
mode
string
"zip"
Sweep mode: "zip" or "grid"
parameters
dict
{}
Parameter name to list of values mapping
Sweep Modes
zip: Pairs up parameters at matching indices. Parameters must have equal lengths.
Example:
isl=[512, 1024], osl=[128, 256]produces 2 combinations:{isl: 512, osl: 128}{isl: 1024, osl: 256}
grid: Cartesian product of all parameter values.
Example:
isl=[512, 1024], osl=[128, 256]produces 4 combinations:{isl: 512, osl: 128}{isl: 512, osl: 256}{isl: 1024, osl: 128}{isl: 1024, osl: 256}
Using Sweep Parameters
Reference sweep parameters in your config using {placeholder} syntax:
FormattablePath Template System
FormattablePath is a powerful templating system for paths that supports runtime placeholders and environment variable expansion.
How It Works
FormattablePath ensures that configuration values with placeholders are always explicitly formatted before use, preventing accidental use of unformatted templates.
Available Placeholders
{job_id}
string
SLURM job ID
"12345"
{run_name}
string
Job name + job ID
"my-benchmark_12345"
{head_node_ip}
string
IP address of head node
"10.0.0.1"
{log_dir}
string
Resolved log directory path
"/home/user/outputs/12345/logs"
{model_path}
string
Resolved model path
"/models/deepseek-r1"
{container_image}
string
Resolved container image path
"/containers/sglang.sqsh"
{gpus_per_node}
int
GPUs per node
8
Environment Variable Expansion
FormattablePath also expands environment variables using $VAR or ${VAR} syntax:
Common environment variables:
$HOME- User home directory$USER- Username$SLURM_JOB_ID- SLURM job ID (also available as{job_id})
Extra Placeholders
Some contexts support additional placeholders:
{nginx_url}
Frontend config
Nginx URL for load balancing
{frontend_url}
Frontend config
Frontend/router URL
{index}
Worker config
Worker index
{host}
Worker config
Worker host
{port}
Worker config
Worker port
Examples
container_mounts
Custom container mount mappings with FormattablePath support.
FormattablePath
FormattablePath
Host path -> Container mount path
Both keys and values support FormattablePath templating with placeholders and environment variables.
Default Mounts
The following mounts are always added automatically:
Model path
/model
Resolved model directory
Log directory
/logs
Log output directory
configs/ directory
/configs
NATS, etcd binaries
Benchmark scripts
/srtctl-benchmarks
Bundled benchmark scripts
environment
Global environment variables for all worker processes.
string
string
Environment variable name=value
Note: For per-worker-mode environment variables, use backend.prefill_environment, backend.decode_environment, or backend.aggregated_environment.
extra_mount
Additional container mounts as a list of mount specifications.
host_path:container_path
Read-write mount
host_path:container_path:ro
Read-only mount
Note: Unlike container_mounts, extra_mount uses simple string format, not FormattablePath. Environment variables are still expanded.
sbatch_directives
Additional SLURM sbatch directives.
mail-user
"user@example.com"
Email for notifications
mail-type
"END,FAIL"
When to send email (BEGIN,END,FAIL)
comment
"My job description"
Job comment for tracking
reservation
"my-reservation"
Use a specific reservation
constraint
"volta"
Node feature constraint
exclusive
""
Exclusive node access (flag)
gres
"gpu:8"
Generic resource specification
dependency
"afterok:12345"
Job dependency
qos
"high"
Quality of service
Format: Each directive becomes #SBATCH --{key}={value} or #SBATCH --{key} if value is empty.
srun_options
Additional srun options for worker processes.
cpu-bind
"none"
CPU binding mode (none, cores, sockets)
mpi
"pmix"
MPI implementation
overlap
""
Allow step overlap (flag)
ntasks-per-node
"1"
Tasks per node
gpus-per-task
"1"
GPUs per task
mem
"0"
Memory per node
Format: Each option becomes --{key}={value} or --{key} if value is empty.
setup_script
Run a custom script before dynamo install and worker startup.
setup_script
string
null
Script filename (must be in configs/)
Notes:
Script must be located in the
configs/directory.Script runs inside the container before dynamo installation.
Useful for installing custom SGLang versions, additional dependencies, or patches.
Example setup script (configs/install-sglang-main.sh):
enable_config_dump
Enable dumping worker configuration to JSON for debugging.
enable_config_dump
bool
true
Dump config JSON for debugging
When enabled, worker startup commands include --dump-config-to which writes the resolved configuration to a JSON file.
Complete Examples
Disaggregated Mode with Dynamo
Aggregated Mode with SGLang Router
Profiling Example
Parameter Sweep Example
Custom Mounts and Setup
Last updated