The EleutherAI LM Evaluation Harness is the standard tool for benchmarking language models. One command runs MMLU, HellaSwag, GSM8K, or any of 60+ benchmarks with hundreds of subtask variants. It supports local HuggingFace models, vLLM, and any OpenAI-compatible API. This guide covers everything from first install to building custom benchmarks.
What the Eval Harness Does
The LM Evaluation Harness solves a specific problem: running the same benchmark on different models with identical prompts, scoring, and reporting. Before it existed, every research group implemented their own evaluation scripts. Results were not comparable across papers because prompt formats, few-shot examples, and scoring details differed.
The harness standardizes all of this. Every task defines its prompt template, few-shot format, output parsing, and metrics in a single YAML file. When two teams report HellaSwag accuracy using lm-eval, those numbers are directly comparable. This is why Hugging Face chose it as the backend for the Open LLM Leaderboard.
🔬Reproducible by design
YAML configs plus the commit hash fully specify every evaluation. Share the config, get identical results.
🔌Multiple backends
Load models via HuggingFace, vLLM, GGUF, or evaluate against any OpenAI-compatible API endpoint.
📝Extensible tasks
Add custom benchmarks with a YAML file. No Python subclassing needed for standard task types.
Who uses it
The harness is used internally by NVIDIA, Cohere, BigScience, BigCode, Nous Research, Mosaic ML, and dozens of other organizations. It has been cited in hundreds of research papers and is the evaluation standard for the Open LLM Leaderboard.
Installation
The base package installs with pip. Since v0.4.7, the base install no longer pulls in torch or transformers, which keeps it lightweight if you only need API-based evaluation.
Install options
# Base install (lightweight, good for API evaluation)
pip install lm-eval
# With HuggingFace model support
pip install "lm-eval[hf]"
# With vLLM for fast local inference
pip install "lm-eval[vllm]"
# Full install with all backends
pip install "lm-eval[all]"
# Development install from source (latest features)
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e ".[dev]"After installation, verify with lm-eval --help. To see all available tasks, run lm-eval --tasks list. The output is long. Pipe it to grep to find specific benchmarks.
List available tasks
# See all available tasks
lm-eval --tasks list
# Find MMLU-related tasks
lm-eval --tasks list | grep mmlu
# Find coding-related tasks
lm-eval --tasks list | grep -i "human\|code\|mbpp"Running Your First Benchmark
The minimal command needs three things: a model type, model arguments, and one or more tasks. Here is GPT-2 on HellaSwag. GPT-2 is small enough to run on a CPU, making it a good test that your installation works.
Minimal benchmark: GPT-2 on HellaSwag
lm-eval \
--model hf \
--model_args pretrained=gpt2,dtype=float32 \
--tasks hellaswag \
--output_path results/gpt2/For larger models, use dtype=bfloat16 and device_map=auto to distribute across GPUs. The --batch_size auto flag automatically finds the largest batch size that fits in memory.
Benchmark a real model: Llama 3.1 8B on multiple tasks
lm-eval \
--model hf \
--model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype=bfloat16,device_map=auto,trust_remote_code=true \
--tasks mmlu,hellaswag,arc_challenge,gsm8k \
--num_fewshot 5 \
--batch_size auto \
--output_path results/llama-3.1-8b/ \
--log_samplesThe --log_samples flag
Adding --log_samples saves every input prompt and model output to a JSONL file alongside the results JSON. This is essential for debugging. When a model scores unexpectedly low, the per-sample logs let you see exactly which questions it missed and what it predicted.
For faster evaluation on large models, vLLM gives significant speedups through continuous batching and PagedAttention. The syntax is nearly identical.
vLLM backend for faster evaluation
lm-eval \
--model vllm \
--model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype=auto,gpu_memory_utilization=0.8,tensor_parallel_size=2 \
--tasks mmlu,hellaswag,gsm8k \
--num_fewshot 5 \
--batch_size auto \
--output_path results/llama-3.1-8b-vllm/Key Benchmarks Explained
The harness ships with dozens of benchmarks. Not all are equally useful for every use case. Here are the ones that matter most and what they actually test.
| Benchmark | What It Tests | Format | Typical Use |
|---|---|---|---|
| MMLU | Broad knowledge across 57 subjects | 4-way multiple choice | General model capability |
| MMLU-Pro | Harder MMLU with 10 answer choices, reasoning-focused | 10-way multiple choice | Open LLM Leaderboard v2 |
| HellaSwag | Commonsense reasoning about physical situations | 4-way completion | Language understanding baseline |
| ARC-Challenge | Grade-school science questions (hard subset) | 4-way multiple choice | Reasoning ability |
| GSM8K | Grade-school math word problems | Generative (numeric answer) | Math reasoning |
| GPQA | PhD-level science questions | 4-way multiple choice | Expert knowledge |
| TruthfulQA | Questions designed to elicit common misconceptions | Multiple choice + generative | Factual accuracy |
| BBH | 23 hard tasks from BIG-Bench | Mixed formats | Challenging reasoning |
| IFEval | Instruction following with verifiable constraints | Generative with rubric | Instruction compliance |
Open LLM Leaderboard v2 Tasks
The Open LLM Leaderboard v2 uses a specific set of six benchmarks. You can run all of them in one shot with the leaderboard task group.
Run the full Open LLM Leaderboard v2 suite
# Run all 6 leaderboard tasks at once
lm-eval \
--model hf \
--model_args pretrained=your-model,dtype=bfloat16,device_map=auto \
--tasks leaderboard \
--batch_size auto \
--output_path results/leaderboard/
# The "leaderboard" group includes:
# - leaderboard_bbh (BIG-Bench Hard)
# - leaderboard_gpqa (PhD-level science)
# - leaderboard_ifeval (instruction following)
# - leaderboard_math_hard (math reasoning)
# - leaderboard_mmlu_pro (broad knowledge, 10-way)
# - leaderboard_musr (multistep soft reasoning)Multiple Choice vs Generative Tasks
This distinction matters when choosing your model backend. Multiple-choice tasks like MMLU and HellaSwag use loglikelihood scoring: the model scores each candidate answer, and the one with the highest log probability wins. Generative tasks like GSM8K ask the model to produce free-form text, then extract and score the answer.
Loglikelihood tasks require access to token probabilities. The HuggingFace and vLLM backends provide this. OpenAI-compatible API endpoints typically do not. If you are evaluating via an API, you are limited to generative tasks unless the API exposes logprobs.
Evaluating via OpenAI-Compatible APIs
You do not need to download model weights to benchmark them. If your model is served behind an OpenAI-compatible API (vLLM, TGI, Ollama, or any commercial endpoint), the local-chat-completions model type connects to it directly.
Evaluate a model served via API
# Start your model server (example with vLLM)
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --port 8000
# In another terminal, run the evaluation
lm-eval \
--model local-chat-completions \
--model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,base_url=http://localhost:8000/v1/chat/completions,num_concurrent=32,max_retries=3,tokenized_requests=False \
--tasks gsm8k,ifeval \
--apply_chat_template \
--output_path results/api-eval/API evaluation limitations
The local-chat-completions model type does not support loglikelihood-based tasks. Tasks like HellaSwag (in its default scoring mode) and MMLU (when using log-probability ranking) require token-level logprobs that most chat APIs do not expose. Stick to generative tasks like GSM8K, IFEval, and BBH when evaluating through APIs, or use local-completions if your endpoint supports the completions API with logprobs.
For evaluating Ollama models locally:
Evaluate an Ollama model
# Make sure Ollama is running with your model
ollama run llama3.1:8b
# Evaluate via the Ollama API (OpenAI-compatible)
lm-eval \
--model local-chat-completions \
--model_args model=llama3.1:8b,base_url=http://localhost:11434/v1/chat/completions,num_concurrent=1,tokenized_requests=False \
--tasks gsm8k \
--output_path results/ollama-llama3.1/Interpreting Results
After a run completes, the harness prints a summary table and saves a JSON file. Understanding the output format is essential for drawing correct conclusions.
Example output table
| Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
|--------------|------:|------|-----:|----------|-----:|---|-----:|
|hellaswag | 1|none | 10|acc |0.6153|± |0.0049|
| | |none | 10|acc_norm |0.8012|± |0.0040|
|mmlu | 1|none | 5|acc |0.6234|± |0.0038|
|arc_challenge | 1|none | 25|acc |0.5427|± |0.0146|
| | |none | 25|acc_norm |0.5768|± |0.0144|
|gsm8k | 3|none | 5|exact_match|0.4551|± |0.0137|Key Fields
acc is raw accuracy: what fraction of questions the model answered correctly. acc_norm is length-normalized accuracy, which adjusts for the fact that longer answer choices get higher raw log probabilities. For multiple-choice tasks, acc_norm is usually the more meaningful metric. Stderr is standard error, giving you a confidence interval on the score. n-shot tells you how many few-shot examples were in the prompt.
The JSON output file contains the full results with additional detail:
JSON results structure
{
"results": {
"hellaswag": {
"acc,none": 0.6153,
"acc_stderr,none": 0.0049,
"acc_norm,none": 0.8012,
"acc_norm_stderr,none": 0.0040,
"alias": "hellaswag"
},
"mmlu": {
"acc,none": 0.6234,
"acc_stderr,none": 0.0038,
"alias": "mmlu"
}
},
"versions": {
"hellaswag": 1,
"mmlu": 1
},
"config": {
"model": "hf",
"model_args": "pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype=bfloat16",
"batch_size": "auto",
"num_fewshot": 5
}
}Always report task versions
Task versions are reported in the versions field. When publishing results, always include the version numbers. A task version bump means the prompt format or scoring changed, so results from different versions are not directly comparable.
Comparing Models
The harness does not have a built-in comparison tool, but the consistent JSON output format makes comparison straightforward. Run the same tasks with the same num_fewshot on each model, then compare the scores.
Evaluate multiple models for comparison
# Model A
lm-eval \
--model hf \
--model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype=bfloat16,device_map=auto \
--tasks mmlu,hellaswag,gsm8k,arc_challenge \
--num_fewshot 5 \
--batch_size auto \
--output_path results/llama-3.1-8b/
# Model B
lm-eval \
--model hf \
--model_args pretrained=mistralai/Mistral-7B-Instruct-v0.3,dtype=bfloat16,device_map=auto \
--tasks mmlu,hellaswag,gsm8k,arc_challenge \
--num_fewshot 5 \
--batch_size auto \
--output_path results/mistral-7b/
# Model C (via API)
lm-eval \
--model local-chat-completions \
--model_args model=gpt-4o-mini,base_url=https://api.openai.com/v1/chat/completions,num_concurrent=16,tokenized_requests=False \
--tasks gsm8k \
--output_path results/gpt-4o-mini/For systematic comparison, write a small script that reads the JSON files and builds a table:
Compare results across models (Python)
import json
from pathlib import Path
models = {
"Llama 3.1 8B": "results/llama-3.1-8b",
"Mistral 7B": "results/mistral-7b",
}
tasks = ["mmlu", "hellaswag", "gsm8k", "arc_challenge"]
print(f"{'Model':<20} " + " ".join(f"{t:<15}" for t in tasks))
print("-" * 80)
for model_name, result_dir in models.items():
# Find the results JSON (timestamped filename)
result_files = list(Path(result_dir).glob("results_*.json"))
if not result_files:
continue
with open(result_files[-1]) as f:
data = json.load(f)
scores = []
for task in tasks:
result = data.get("results", {}).get(task, {})
# Prefer acc_norm for multiple-choice, exact_match for generative
score = result.get("acc_norm,none", result.get("acc,none", result.get("exact_match,none", "N/A")))
scores.append(f"{score:<15.4f}" if isinstance(score, float) else f"{score:<15}")
print(f"{model_name:<20} " + " ".join(scores))Statistical significance
The stderr values tell you whether a score difference is meaningful. If Model A scores 0.62 with stderr 0.004 and Model B scores 0.63 with stderr 0.004 on the same benchmark, those scores overlap within one standard error. That difference is not statistically significant. For reliable comparisons, you need either a large gap or a large number of evaluation examples to shrink the stderr.
Custom Task Creation
The harness is not limited to built-in benchmarks. You can add any evaluation task by writing a YAML configuration file. This is how most new benchmarks get integrated, and how you build internal evaluations for domain-specific use cases.
YAML Task Structure
A task config specifies where the data comes from, how to format prompts, what the target answer is, and which metric to compute. Prompts use Jinja2 templates with access to all fields in your dataset.
Custom multiple-choice task (YAML)
task: my_domain_mcq
group: custom_benchmarks
dataset_path: json
dataset_kwargs:
data_files: "path/to/evaluation_data.jsonl"
output_type: multiple_choice
validation_split: train
doc_to_text: "{{question}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_target: "{{answer}}"
doc_to_choice:
- "A"
- "B"
- "C"
- "D"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
num_fewshot: 5
metadata:
version: 1.0Custom generative task (YAML)
task: my_qa_benchmark
group: custom_benchmarks
dataset_path: json
dataset_kwargs:
data_files: "path/to/qa_data.jsonl"
output_type: generate_until
validation_split: test
doc_to_text: "Question: {{question}}\nAnswer:"
doc_to_target: "{{answer}}"
generation_kwargs:
until:
- "\n"
- "Question:"
max_gen_toks: 256
temperature: 0.0
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
num_fewshot: 3
metadata:
version: 1.0Using Template Inheritance
Tasks can inherit from other YAML configs using the include directive. Change only what differs. This is how GSM8K-CoT-Self-Consistency is built on top of the base GSM8K-CoT config.
Inherit from an existing task
# my_mmlu_variant.yaml
include: mmlu
task: my_mmlu_variant
doc_to_text: "You are an expert. {{question}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nThe correct answer is:"
metadata:
version: 1.0Run a custom task
# Point to the directory containing your YAML files
lm-eval \
--model hf \
--model_args pretrained=your-model,dtype=bfloat16,device_map=auto \
--tasks my_domain_mcq \
--include_path ./custom_tasks/ \
--output_path results/custom/Dataset format
Your JSONL data file should have one JSON object per line. Each object needs the fields referenced in your Jinja2 templates. For the multiple-choice example above, each line would look like: {"question": "What is X?", "choices": ["A opt", "B opt", "C opt", "D opt"], "answer": "B"}
Evaluating Coding Models
Code generation evaluation has specific requirements that general language benchmarks do not cover. The model needs to produce executable code, and the evaluation needs to run that code against test cases.
lm-eval for Code Tasks
The harness includes some code-related tasks, but for thorough code evaluation, the BigCode Evaluation Harness is purpose-built. It supports HumanEval (164 Python problems), HumanEval+ (with additional test cases), MBPP, MBPP+, APPS, and DS-1000, all with sandboxed code execution.
| Feature | lm-eval | BigCode Evaluation Harness |
|---|---|---|
| HumanEval | Basic support | Full pass@k with execution |
| Multi-language | Limited | HumanEvalPack: 6 languages, MultiPL-E: 18 languages |
| FIM (fill-in-middle) | Not supported | Supported for completion and insertion |
| pass@k metrics | Not standard | pass@1, pass@10, pass@100 with proper estimation |
| Sandboxed execution | No | Yes, runs generated code in isolation |
| General benchmarks | 60+ tasks: MMLU, HellaSwag, GSM8K, etc. | Code-only |
BigCode Evaluation Harness for code models
# Install
pip install bigcode-eval-harness
# Evaluate on HumanEval with pass@1
accelerate launch main.py \
--model meta-llama/CodeLlama-7b-hf \
--tasks humaneval \
--allow_code_execution \
--do_sample True \
--temperature 0.2 \
--n_samples 20 \
--batch_size 10
# Evaluate on multiple code benchmarks
accelerate launch main.py \
--model meta-llama/CodeLlama-7b-hf \
--tasks humaneval,mbpp \
--allow_code_execution \
--do_sample True \
--temperature 0.2 \
--n_samples 200 \
--batch_size 50For a combined evaluation strategy, run general capability benchmarks (MMLU, HellaSwag, GSM8K) through lm-eval and code-specific benchmarks (HumanEval, MBPP) through the BigCode harness. This gives you both breadth and depth.
CLI Reference
The most important flags, organized by how often you will use them.
| Flag | Description | Example |
|---|---|---|
| --model | Model backend type | hf, vllm, local-chat-completions |
| --model_args | Comma-separated key=value model config | pretrained=gpt2,dtype=bfloat16 |
| --tasks | Comma-separated task names or groups | mmlu,hellaswag,leaderboard |
| --num_fewshot | Number of few-shot examples in prompt | 5 |
| --batch_size | Batch size; use auto for automatic detection | auto, 16, auto:4 |
| --output_path | Directory for JSON results and sample logs | results/my-model/ |
| --log_samples | Save per-example inputs and outputs | (flag, no value) |
| --limit | Cap number of examples per task (for testing) | 100, 0.1 |
| --device | Target device for model | cuda, cuda:0, cpu, mps |
| --include_path | Directory with custom task YAML files | ./custom_tasks/ |
| --gen_kwargs | Generation parameters for generative tasks | temperature=0.0,top_p=0.95 |
| --apply_chat_template | Format prompts using model's chat template | (flag, no value) |
Common command patterns
# Quick test: run 100 examples to verify setup
lm-eval --model hf --model_args pretrained=gpt2 --tasks hellaswag --limit 100
# Full evaluation with vLLM on multi-GPU
lm-eval --model vllm \
--model_args pretrained=meta-llama/Meta-Llama-3.1-70B-Instruct,tensor_parallel_size=4,dtype=auto \
--tasks leaderboard \
--batch_size auto \
--output_path results/llama-70b/ \
--log_samples
# Evaluate with custom generation settings
lm-eval --model hf \
--model_args pretrained=your-model,dtype=bfloat16,device_map=auto \
--tasks gsm8k \
--gen_kwargs temperature=0.0,top_p=1.0 \
--output_path results/Frequently Asked Questions
What is the LLM Eval Harness?
The LM Evaluation Harness is an open-source framework by EleutherAI for benchmarking language models on 60+ academic tasks with hundreds of subtask variants. It powers Hugging Face's Open LLM Leaderboard and is the most widely used LLM evaluation tool in the research community. Install it with pip install lm-eval, then run lm-eval --model hf --model_args pretrained=your-model --tasks mmlu,hellaswag.
How do I install lm-eval-harness?
Run pip install lm-eval for the base package. For vLLM support, use pip install 'lm-eval[vllm]'. For the latest development version, clone the repo and install with pip install -e '.[dev]'. Since v0.4.7, the base install no longer requires torch or transformers, keeping it lightweight when evaluating API-served models.
Can I evaluate models served via an API?
Yes. Use --model local-chat-completions with --model_args base_url=http://your-server/v1/chat/completions,model=your-model-name. This works with any OpenAI-compatible API. The limitation is that chat APIs typically do not expose logprobs, so you are restricted to generative tasks (GSM8K, IFEval, BBH) rather than loglikelihood-based tasks (default HellaSwag, MMLU).
What benchmarks are available?
The harness includes 60+ major benchmarks with hundreds of subtask variants. The most commonly used: MMLU (broad knowledge), HellaSwag (commonsense), ARC (science reasoning), GSM8K and MATH (math), GPQA (PhD-level science), TruthfulQA (factual accuracy), WinoGrande (coreference), BBH (hard reasoning), IFEval (instruction following). Run lm-eval --tasks list to see all available tasks.
How do I create a custom benchmark?
Write a YAML config file specifying your dataset path, Jinja2 prompt template (doc_to_text), target answer (doc_to_target), and metrics. Place it in a directory and reference it with --include_path. For most standard task types (multiple choice, generative Q&A), no Python code is needed. For custom scoring logic, subclass the Task class.
How do I compare results across models?
Run identical --tasks and --num_fewshot configurations on each model, saving results with --output_path. Each run produces a JSON file with per-task scores and standard errors. Compare the acc or acc_norm values. Check that score differences exceed the reported stderr before concluding one model is better.
What is the difference between lm-eval and the BigCode Evaluation Harness?
lm-eval covers general language benchmarks: reasoning, knowledge, math, instruction following. The BigCode Evaluation Harness is purpose-built for code generation with HumanEval, MBPP, APPS, and DS-1000. BigCode supports pass@k metrics, multi-language evaluation (18 languages via MultiPL-E), fill-in-middle mode, and sandboxed code execution. Use both for comprehensive coding model evaluation.
Evaluation Picks the Model. Morph Runs It.
The eval harness tells you which model wins your benchmarks. Morph's fast-apply infrastructure works with whichever model comes out on top, applying LLM-generated code edits at 10,500 tok/s with 97.3% accuracy. Benchmark, choose, deploy.