Monitoring

Table of Contents


Checking Job Status

# List your running jobs
squeue -u $USER

# Detailed job info
scontrol show job <job_id>

# Cancel a job
scancel <job_id>

Log Directory

After submission, srtctl tells you where logs are stored:

The directory name follows the pattern: {job_id}_{prefill}P_{decode}D_{timestamp}

Log Structure

Key Files

log.out

The main orchestration log showing node assignments, worker launches, and the frontend URL:

benchmark.out

Shows benchmark progress and results:

Worker Logs ({node}_prefill_w0.err, {node}_decode_w0.err)

SGLang worker logs showing model loading, memory allocation, and runtime info. Check these for debugging CUDA errors, OOM issues, or NCCL failures.

config.yaml

The fully resolved configuration showing exactly what ran, with all aliases expanded and defaults applied.

Common Commands

Connecting to Running Jobs

The log.out file includes commands to connect to running nodes

Last updated