Introduction

srtctl is a command-line tool for running distributed LLM inference benchmarks on SLURM clusters. It replaces complex shell scripts and 50+ CLI flags with clean, declarative YAML configuration files.

Why srtctl?

Running large language models across multiple GPUs and nodes requires orchestrating many moving parts: SLURM job scripts, container mounts, SGLang configuration, worker coordination, and benchmark execution. Traditionally, this meant maintaining brittle bash scripts with hardcoded parameters.

srtctl solves this by providing:

Declarative configuration - Define your entire job in a single YAML file
Validation - Catch configuration errors before submitting to SLURM
Reproducibility - Every job saves its full configuration for later reference
Parameter sweeps - Run grid searches across configurations with a single command
Profiling support - Built-in torch/nsys profiling modes

How It Works

When you run srtctl apply -f config.yaml, the tool:

Validates your configuration against the schema
Resolves any aliases from your cluster config (srtslurm.yaml)
Generates a SLURM batch script and SGLang configuration files
Submits to SLURM

Once allocated, workers launch inside containers, discover each other through ETCD and NATS, and begin serving. If you've configured a benchmark, it runs automatically against the serving endpoint and saves results to the log directory.

Commands

Command

Description

srtctl apply -f <config>

Submit job(s) to SLURM

srtctl apply -f <config> --setup-script <script>

Submit with custom setup script

srtctl apply -f <config> --tags tag1,tag2

Submit with tags for filtering

srtctl dry-run -f <config>

Validate and preview without submitting

srtctl validate -f <config>

Alias for dry-run

Next Steps

Installation - Set up srtctl and submit your first job
Monitoring - Understanding job logs and debugging
Parameter Sweeps - Run grid searches across configurations
Profiling - Performance analysis with torch/nsys
Analyzing Results - Dashboard and visualization
SGLang Router - Alternative to Dynamo for PD disaggregation

NextInstallation

Last updated 13 days ago

Good afternoon

Why srtctl?

How It Works

Commands

Next Steps