Accuracy Benchmarks

In srt-slurm, users can run different accuracy benchmarks by setting the benchmark section in the config yaml file. Supported benchmarks include mmlu, gpqa and longbenchv2.

Table of Contents


Note: The context-length argument in the config yaml needs to be larger than the max_tokens argument of accuracy benchmark.

MMLU

For MMLU dataset, the benchmark section in yaml file can be modified in the following way:

Then launch the script as usual:

After finishing benchmarking, the benchmark.out will contain the results of accuracy:

GPQA

For GPQA dataset, the benchmark section in yaml file can be modified in the following way:

The context-length argument here should be set to a value larger than max_tokens.

LongBench-V2

LongBench-V2 is a long-context evaluation benchmark that tests model performance on extended context tasks. It's particularly useful for validating models with large context windows (128K+ tokens).

Configuration

Parameters

Parameter
Type
Default
Description

max_context_length

int

128000

Maximum context length for evaluation. Should not exceed model's trained context window.

num_threads

int

16

Number of concurrent threads for parallel evaluation. Increase for faster throughput on high-capacity endpoints.

max_tokens

int

16384

Maximum tokens for model output. Must be less than context-length in sglang_config.

num_examples

int

all

Limit the number of examples to evaluate. Useful for quick validation runs.

categories

list

all

Specific task categories to run. Omit to run all categories.

Available Categories

LongBench-V2 includes the following task categories:

  • single_doc_qa: Single document question answering

  • multi_doc_qa: Multi-document question answering

  • summarization: Long document summarization

  • few_shot_learning: Few-shot learning with long context

  • code_completion: Long-context code completion

  • synthetic: Synthetic long-context tasks (needle-in-haystack, etc.)

Example: Full Evaluation

Run complete LongBench-V2 evaluation with all categories:

Example: Quick Validation

Run a quick subset for validation:

Output

After completion, results are saved to the logs directory:

The output includes per-category scores and aggregate metrics:

Important Notes

  1. Context Length: Ensure context-length in your sglang_config exceeds max_tokens for the benchmark

  2. Memory: Long-context evaluation requires significant GPU memory. Use appropriate mem-fraction-static settings

  3. Throughput: Increase num_threads for faster evaluation, but monitor for OOM errors

  4. Categories: Running specific categories is useful for targeted validation (e.g., just testing summarization capabilities)

Last updated