Monitoring
Table of Contents
Checking Job Status
# List your running jobs
squeue -u $USER
# Detailed job info
scontrol show job <job_id>
# Cancel a job
scancel <job_id>Log Directory
After submission, srtctl tells you where logs are stored:
The directory name follows the pattern: {job_id}_{prefill}P_{decode}D_{timestamp}
Log Structure
Key Files
log.out
The main orchestration log showing node assignments, worker launches, and the frontend URL:
benchmark.out
Shows benchmark progress and results:
Worker Logs ({node}_prefill_w0.err, {node}_decode_w0.err)
SGLang worker logs showing model loading, memory allocation, and runtime info. Check these for debugging CUDA errors, OOM issues, or NCCL failures.
config.yaml
The fully resolved configuration showing exactly what ran, with all aliases expanded and defaults applied.
Common Commands
Connecting to Running Jobs
The log.out file includes commands to connect to running nodes
Last updated