Installation

Table of Contents


Prerequisites

  • Access to a SLURM cluster with GPU nodes

  • Python 3.10+

  • Container runtime (enroot/pyxis) configured on the cluster

  • Model weights accessible from compute nodes

  • SGLang container image (.sqsh format)

Clone and Install

Gather your cluster user and target partition

These commands might not work on all clusters. You can use AI to figure out the right set of commands for your cluster.

Run Setup

If you are trying to deploy onto Grace (GH200, GB200, etc.), you need to use the aarch64 architecture. Otherwise use x86_64.

The setup will:

  1. Download NATS/ETCD binaries for your architecture

  2. Prompt you for cluster settings:

    • SLURM account (default: restricted)

    • SLURM partition (default: batch)

    • GPUs per node (default: 4)

    • Time limit (default: 4:00:00)

  3. Create srtslurm.yaml with your settings

  4. Auto-detect and set srtctl_root path

Configure srtslurm.yaml

After setup, edit srtslurm.yaml to add model paths, containers, and cluster-specific settings:

Adding Model Paths

The model_paths section maps short aliases to full filesystem paths:

Models must be accessible from all compute nodes (typically on a shared filesystem like Lustre or GPFS).

Adding Containers

The containers section maps version aliases to .sqsh container images:

To create a container image from Docker:

Complete srtslurm.yaml Reference

Here's a complete example of all available options:

Create a Job Config

Create configs/my-job.yaml:

See Configuration Reference for all available options.

Submit the Job

Output:

Submit with Tags

You can tag runs for easier filtering in the dashboard:

Tags are saved in the job metadata and can be used to filter runs in analysis.

See Monitoring for how to monitor your job and understand the detailed log structure.

Custom Setup Scripts

You can run custom initialization scripts on worker nodes before starting SGLang workers. This is useful for:

  • Setting up custom environment variables

  • Installing additional dependencies

  • Checking out custom code

Creating a Setup Script

  1. Create your setup script in the configs/ directory:

  2. Make it executable:

  3. Submit with the --setup-script flag:

The script will be executed on each worker node (prefill, decode, or aggregated) before installing Dynamo from PyPI and starting the SGLang workers. The script must be located in the configs/ directory, which is mounted into containers at /configs/.

Note: Setup scripts only run when you explicitly specify --setup-script. No default setup script will run if this flag is omitted.

Last updated