Installation
Table of Contents
Prerequisites
Access to a SLURM cluster with GPU nodes
Python 3.10+
Container runtime (enroot/pyxis) configured on the cluster
Model weights accessible from compute nodes
SGLang container image (
.sqshformat)
Clone and Install
Gather your cluster user and target partition
These commands might not work on all clusters. You can use AI to figure out the right set of commands for your cluster.
Run Setup
If you are trying to deploy onto Grace (GH200, GB200, etc.), you need to use the aarch64 architecture. Otherwise use x86_64.
The setup will:
Download NATS/ETCD binaries for your architecture
Prompt you for cluster settings:
SLURM account (default:
restricted)SLURM partition (default:
batch)GPUs per node (default:
4)Time limit (default:
4:00:00)
Create
srtslurm.yamlwith your settingsAuto-detect and set
srtctl_rootpath
Configure srtslurm.yaml
After setup, edit srtslurm.yaml to add model paths, containers, and cluster-specific settings:
Adding Model Paths
The model_paths section maps short aliases to full filesystem paths:
Models must be accessible from all compute nodes (typically on a shared filesystem like Lustre or GPFS).
Adding Containers
The containers section maps version aliases to .sqsh container images:
To create a container image from Docker:
Complete srtslurm.yaml Reference
Here's a complete example of all available options:
Create a Job Config
Create configs/my-job.yaml:
See Configuration Reference for all available options.
Submit the Job
Output:
Submit with Tags
You can tag runs for easier filtering in the dashboard:
Tags are saved in the job metadata and can be used to filter runs in analysis.
See Monitoring for how to monitor your job and understand the detailed log structure.
Custom Setup Scripts
You can run custom initialization scripts on worker nodes before starting SGLang workers. This is useful for:
Setting up custom environment variables
Installing additional dependencies
Checking out custom code
Creating a Setup Script
Create your setup script in the
configs/directory:Make it executable:
Submit with the
--setup-scriptflag:
The script will be executed on each worker node (prefill, decode, or aggregated) before installing Dynamo from PyPI and starting the SGLang workers. The script must be located in the configs/ directory, which is mounted into containers at /configs/.
Note: Setup scripts only run when you explicitly specify --setup-script. No default setup script will run if this flag is omitted.
Last updated