Accuracy Benchmarks
In srt-slurm, users can run different accuracy benchmarks by setting the benchmark section in the config yaml file. Supported benchmarks include mmlu, gpqa and longbenchv2.
Table of Contents
Note: The context-length argument in the config yaml needs to be larger than the max_tokens argument of accuracy benchmark.
MMLU
For MMLU dataset, the benchmark section in yaml file can be modified in the following way:
Then launch the script as usual:
After finishing benchmarking, the benchmark.out will contain the results of accuracy:
GPQA
For GPQA dataset, the benchmark section in yaml file can be modified in the following way:
The context-length argument here should be set to a value larger than max_tokens.
LongBench-V2
LongBench-V2 is a long-context evaluation benchmark that tests model performance on extended context tasks. It's particularly useful for validating models with large context windows (128K+ tokens).
Configuration
Parameters
max_context_length
int
128000
Maximum context length for evaluation. Should not exceed model's trained context window.
num_threads
int
16
Number of concurrent threads for parallel evaluation. Increase for faster throughput on high-capacity endpoints.
max_tokens
int
16384
Maximum tokens for model output. Must be less than context-length in sglang_config.
num_examples
int
all
Limit the number of examples to evaluate. Useful for quick validation runs.
categories
list
all
Specific task categories to run. Omit to run all categories.
Available Categories
LongBench-V2 includes the following task categories:
single_doc_qa: Single document question answering
multi_doc_qa: Multi-document question answering
summarization: Long document summarization
few_shot_learning: Few-shot learning with long context
code_completion: Long-context code completion
synthetic: Synthetic long-context tasks (needle-in-haystack, etc.)
Example: Full Evaluation
Run complete LongBench-V2 evaluation with all categories:
Example: Quick Validation
Run a quick subset for validation:
Output
After completion, results are saved to the logs directory:
The output includes per-category scores and aggregate metrics:
Important Notes
Context Length: Ensure
context-lengthin your sglang_config exceedsmax_tokensfor the benchmarkMemory: Long-context evaluation requires significant GPU memory. Use appropriate
mem-fraction-staticsettingsThroughput: Increase
num_threadsfor faster evaluation, but monitor for OOM errorsCategories: Running specific categories is useful for targeted validation (e.g., just testing summarization capabilities)
Last updated