LLM Eval Harness: How to Benchmark Any Model on 200+ Tasks in One Command

The EleutherAI LM Evaluation Harness is the standard tool for benchmarking language models. This guide covers installation, running MMLU/HellaSwag/HumanEval, custom task creation with YAML, interpreting JSON results, comparing models, and evaluating coding models through OpenAI-compatible APIs.

April 5, 2026 · 2 min read

The EleutherAI LM Evaluation Harness is the standard tool for benchmarking language models. One command runs MMLU, HellaSwag, GSM8K, or any of 60+ benchmarks with hundreds of subtask variants. It supports local HuggingFace models, vLLM, and any OpenAI-compatible API. This guide covers everything from first install to building custom benchmarks.

60+
Academic benchmarks with hundreds of subtask variants
9.1k
GitHub stars, 2.4k forks
v0.4.7
Latest stable release
1 cmd
Install to benchmark results

What the Eval Harness Does

The LM Evaluation Harness solves a specific problem: running the same benchmark on different models with identical prompts, scoring, and reporting. Before it existed, every research group implemented their own evaluation scripts. Results were not comparable across papers because prompt formats, few-shot examples, and scoring details differed.

The harness standardizes all of this. Every task defines its prompt template, few-shot format, output parsing, and metrics in a single YAML file. When two teams report HellaSwag accuracy using lm-eval, those numbers are directly comparable. This is why Hugging Face chose it as the backend for the Open LLM Leaderboard.

🔬Reproducible by design

YAML configs plus the commit hash fully specify every evaluation. Share the config, get identical results.

🔌Multiple backends

Load models via HuggingFace, vLLM, GGUF, or evaluate against any OpenAI-compatible API endpoint.

📝Extensible tasks

Add custom benchmarks with a YAML file. No Python subclassing needed for standard task types.

Who uses it

The harness is used internally by NVIDIA, Cohere, BigScience, BigCode, Nous Research, Mosaic ML, and dozens of other organizations. It has been cited in hundreds of research papers and is the evaluation standard for the Open LLM Leaderboard.

Installation

The base package installs with pip. Since v0.4.7, the base install no longer pulls in torch or transformers, which keeps it lightweight if you only need API-based evaluation.

Install options

# Base install (lightweight, good for API evaluation)
pip install lm-eval

# With HuggingFace model support
pip install "lm-eval[hf]"

# With vLLM for fast local inference
pip install "lm-eval[vllm]"

# Full install with all backends
pip install "lm-eval[all]"

# Development install from source (latest features)
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e ".[dev]"

After installation, verify with lm-eval --help. To see all available tasks, run lm-eval --tasks list. The output is long. Pipe it to grep to find specific benchmarks.

List available tasks

# See all available tasks
lm-eval --tasks list

# Find MMLU-related tasks
lm-eval --tasks list | grep mmlu

# Find coding-related tasks
lm-eval --tasks list | grep -i "human\|code\|mbpp"

Running Your First Benchmark

The minimal command needs three things: a model type, model arguments, and one or more tasks. Here is GPT-2 on HellaSwag. GPT-2 is small enough to run on a CPU, making it a good test that your installation works.

Minimal benchmark: GPT-2 on HellaSwag

lm-eval \
  --model hf \
  --model_args pretrained=gpt2,dtype=float32 \
  --tasks hellaswag \
  --output_path results/gpt2/

For larger models, use dtype=bfloat16 and device_map=auto to distribute across GPUs. The --batch_size auto flag automatically finds the largest batch size that fits in memory.

Benchmark a real model: Llama 3.1 8B on multiple tasks

lm-eval \
  --model hf \
  --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype=bfloat16,device_map=auto,trust_remote_code=true \
  --tasks mmlu,hellaswag,arc_challenge,gsm8k \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path results/llama-3.1-8b/ \
  --log_samples

The --log_samples flag

Adding --log_samples saves every input prompt and model output to a JSONL file alongside the results JSON. This is essential for debugging. When a model scores unexpectedly low, the per-sample logs let you see exactly which questions it missed and what it predicted.

For faster evaluation on large models, vLLM gives significant speedups through continuous batching and PagedAttention. The syntax is nearly identical.

vLLM backend for faster evaluation

lm-eval \
  --model vllm \
  --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype=auto,gpu_memory_utilization=0.8,tensor_parallel_size=2 \
  --tasks mmlu,hellaswag,gsm8k \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path results/llama-3.1-8b-vllm/

Key Benchmarks Explained

The harness ships with dozens of benchmarks. Not all are equally useful for every use case. Here are the ones that matter most and what they actually test.

BenchmarkWhat It TestsFormatTypical Use
MMLUBroad knowledge across 57 subjects4-way multiple choiceGeneral model capability
MMLU-ProHarder MMLU with 10 answer choices, reasoning-focused10-way multiple choiceOpen LLM Leaderboard v2
HellaSwagCommonsense reasoning about physical situations4-way completionLanguage understanding baseline
ARC-ChallengeGrade-school science questions (hard subset)4-way multiple choiceReasoning ability
GSM8KGrade-school math word problemsGenerative (numeric answer)Math reasoning
GPQAPhD-level science questions4-way multiple choiceExpert knowledge
TruthfulQAQuestions designed to elicit common misconceptionsMultiple choice + generativeFactual accuracy
BBH23 hard tasks from BIG-BenchMixed formatsChallenging reasoning
IFEvalInstruction following with verifiable constraintsGenerative with rubricInstruction compliance

Open LLM Leaderboard v2 Tasks

The Open LLM Leaderboard v2 uses a specific set of six benchmarks. You can run all of them in one shot with the leaderboard task group.

Run the full Open LLM Leaderboard v2 suite

# Run all 6 leaderboard tasks at once
lm-eval \
  --model hf \
  --model_args pretrained=your-model,dtype=bfloat16,device_map=auto \
  --tasks leaderboard \
  --batch_size auto \
  --output_path results/leaderboard/

# The "leaderboard" group includes:
# - leaderboard_bbh (BIG-Bench Hard)
# - leaderboard_gpqa (PhD-level science)
# - leaderboard_ifeval (instruction following)
# - leaderboard_math_hard (math reasoning)
# - leaderboard_mmlu_pro (broad knowledge, 10-way)
# - leaderboard_musr (multistep soft reasoning)

Multiple Choice vs Generative Tasks

This distinction matters when choosing your model backend. Multiple-choice tasks like MMLU and HellaSwag use loglikelihood scoring: the model scores each candidate answer, and the one with the highest log probability wins. Generative tasks like GSM8K ask the model to produce free-form text, then extract and score the answer.

Loglikelihood tasks require access to token probabilities. The HuggingFace and vLLM backends provide this. OpenAI-compatible API endpoints typically do not. If you are evaluating via an API, you are limited to generative tasks unless the API exposes logprobs.

Evaluating via OpenAI-Compatible APIs

You do not need to download model weights to benchmark them. If your model is served behind an OpenAI-compatible API (vLLM, TGI, Ollama, or any commercial endpoint), the local-chat-completions model type connects to it directly.

Evaluate a model served via API

# Start your model server (example with vLLM)
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --port 8000

# In another terminal, run the evaluation
lm-eval \
  --model local-chat-completions \
  --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,base_url=http://localhost:8000/v1/chat/completions,num_concurrent=32,max_retries=3,tokenized_requests=False \
  --tasks gsm8k,ifeval \
  --apply_chat_template \
  --output_path results/api-eval/

API evaluation limitations

The local-chat-completions model type does not support loglikelihood-based tasks. Tasks like HellaSwag (in its default scoring mode) and MMLU (when using log-probability ranking) require token-level logprobs that most chat APIs do not expose. Stick to generative tasks like GSM8K, IFEval, and BBH when evaluating through APIs, or use local-completions if your endpoint supports the completions API with logprobs.

For evaluating Ollama models locally:

Evaluate an Ollama model

# Make sure Ollama is running with your model
ollama run llama3.1:8b

# Evaluate via the Ollama API (OpenAI-compatible)
lm-eval \
  --model local-chat-completions \
  --model_args model=llama3.1:8b,base_url=http://localhost:11434/v1/chat/completions,num_concurrent=1,tokenized_requests=False \
  --tasks gsm8k \
  --output_path results/ollama-llama3.1/

Interpreting Results

After a run completes, the harness prints a summary table and saves a JSON file. Understanding the output format is essential for drawing correct conclusions.

Example output table

|    Tasks     |Version|Filter|n-shot| Metric   |Value |   |Stderr|
|--------------|------:|------|-----:|----------|-----:|---|-----:|
|hellaswag     |      1|none  |    10|acc       |0.6153|±  |0.0049|
|              |       |none  |    10|acc_norm  |0.8012|±  |0.0040|
|mmlu          |      1|none  |     5|acc       |0.6234|±  |0.0038|
|arc_challenge |      1|none  |    25|acc       |0.5427|±  |0.0146|
|              |       |none  |    25|acc_norm  |0.5768|±  |0.0144|
|gsm8k         |      3|none  |     5|exact_match|0.4551|±  |0.0137|

Key Fields

acc is raw accuracy: what fraction of questions the model answered correctly. acc_norm is length-normalized accuracy, which adjusts for the fact that longer answer choices get higher raw log probabilities. For multiple-choice tasks, acc_norm is usually the more meaningful metric. Stderr is standard error, giving you a confidence interval on the score. n-shot tells you how many few-shot examples were in the prompt.

The JSON output file contains the full results with additional detail:

JSON results structure

{
  "results": {
    "hellaswag": {
      "acc,none": 0.6153,
      "acc_stderr,none": 0.0049,
      "acc_norm,none": 0.8012,
      "acc_norm_stderr,none": 0.0040,
      "alias": "hellaswag"
    },
    "mmlu": {
      "acc,none": 0.6234,
      "acc_stderr,none": 0.0038,
      "alias": "mmlu"
    }
  },
  "versions": {
    "hellaswag": 1,
    "mmlu": 1
  },
  "config": {
    "model": "hf",
    "model_args": "pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype=bfloat16",
    "batch_size": "auto",
    "num_fewshot": 5
  }
}

Always report task versions

Task versions are reported in the versions field. When publishing results, always include the version numbers. A task version bump means the prompt format or scoring changed, so results from different versions are not directly comparable.

Comparing Models

The harness does not have a built-in comparison tool, but the consistent JSON output format makes comparison straightforward. Run the same tasks with the same num_fewshot on each model, then compare the scores.

Evaluate multiple models for comparison

# Model A
lm-eval \
  --model hf \
  --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype=bfloat16,device_map=auto \
  --tasks mmlu,hellaswag,gsm8k,arc_challenge \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path results/llama-3.1-8b/

# Model B
lm-eval \
  --model hf \
  --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.3,dtype=bfloat16,device_map=auto \
  --tasks mmlu,hellaswag,gsm8k,arc_challenge \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path results/mistral-7b/

# Model C (via API)
lm-eval \
  --model local-chat-completions \
  --model_args model=gpt-4o-mini,base_url=https://api.openai.com/v1/chat/completions,num_concurrent=16,tokenized_requests=False \
  --tasks gsm8k \
  --output_path results/gpt-4o-mini/

For systematic comparison, write a small script that reads the JSON files and builds a table:

Compare results across models (Python)

import json
from pathlib import Path

models = {
    "Llama 3.1 8B": "results/llama-3.1-8b",
    "Mistral 7B": "results/mistral-7b",
}

tasks = ["mmlu", "hellaswag", "gsm8k", "arc_challenge"]

print(f"{'Model':<20} " + " ".join(f"{t:<15}" for t in tasks))
print("-" * 80)

for model_name, result_dir in models.items():
    # Find the results JSON (timestamped filename)
    result_files = list(Path(result_dir).glob("results_*.json"))
    if not result_files:
        continue
    with open(result_files[-1]) as f:
        data = json.load(f)

    scores = []
    for task in tasks:
        result = data.get("results", {}).get(task, {})
        # Prefer acc_norm for multiple-choice, exact_match for generative
        score = result.get("acc_norm,none", result.get("acc,none", result.get("exact_match,none", "N/A")))
        scores.append(f"{score:<15.4f}" if isinstance(score, float) else f"{score:<15}")

    print(f"{model_name:<20} " + " ".join(scores))

Statistical significance

The stderr values tell you whether a score difference is meaningful. If Model A scores 0.62 with stderr 0.004 and Model B scores 0.63 with stderr 0.004 on the same benchmark, those scores overlap within one standard error. That difference is not statistically significant. For reliable comparisons, you need either a large gap or a large number of evaluation examples to shrink the stderr.

Custom Task Creation

The harness is not limited to built-in benchmarks. You can add any evaluation task by writing a YAML configuration file. This is how most new benchmarks get integrated, and how you build internal evaluations for domain-specific use cases.

YAML Task Structure

A task config specifies where the data comes from, how to format prompts, what the target answer is, and which metric to compute. Prompts use Jinja2 templates with access to all fields in your dataset.

Custom multiple-choice task (YAML)

task: my_domain_mcq
group: custom_benchmarks
dataset_path: json
dataset_kwargs:
  data_files: "path/to/evaluation_data.jsonl"
output_type: multiple_choice
validation_split: train
doc_to_text: "{{question}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_target: "{{answer}}"
doc_to_choice:
  - "A"
  - "B"
  - "C"
  - "D"
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
  - metric: acc_norm
    aggregation: mean
    higher_is_better: true
num_fewshot: 5
metadata:
  version: 1.0

Custom generative task (YAML)

task: my_qa_benchmark
group: custom_benchmarks
dataset_path: json
dataset_kwargs:
  data_files: "path/to/qa_data.jsonl"
output_type: generate_until
validation_split: test
doc_to_text: "Question: {{question}}\nAnswer:"
doc_to_target: "{{answer}}"
generation_kwargs:
  until:
    - "\n"
    - "Question:"
  max_gen_toks: 256
  temperature: 0.0
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
num_fewshot: 3
metadata:
  version: 1.0

Using Template Inheritance

Tasks can inherit from other YAML configs using the include directive. Change only what differs. This is how GSM8K-CoT-Self-Consistency is built on top of the base GSM8K-CoT config.

Inherit from an existing task

# my_mmlu_variant.yaml
include: mmlu
task: my_mmlu_variant
doc_to_text: "You are an expert. {{question}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nThe correct answer is:"
metadata:
  version: 1.0

Run a custom task

# Point to the directory containing your YAML files
lm-eval \
  --model hf \
  --model_args pretrained=your-model,dtype=bfloat16,device_map=auto \
  --tasks my_domain_mcq \
  --include_path ./custom_tasks/ \
  --output_path results/custom/

Dataset format

Your JSONL data file should have one JSON object per line. Each object needs the fields referenced in your Jinja2 templates. For the multiple-choice example above, each line would look like: {"question": "What is X?", "choices": ["A opt", "B opt", "C opt", "D opt"], "answer": "B"}

Evaluating Coding Models

Code generation evaluation has specific requirements that general language benchmarks do not cover. The model needs to produce executable code, and the evaluation needs to run that code against test cases.

lm-eval for Code Tasks

The harness includes some code-related tasks, but for thorough code evaluation, the BigCode Evaluation Harness is purpose-built. It supports HumanEval (164 Python problems), HumanEval+ (with additional test cases), MBPP, MBPP+, APPS, and DS-1000, all with sandboxed code execution.

Featurelm-evalBigCode Evaluation Harness
HumanEvalBasic supportFull pass@k with execution
Multi-languageLimitedHumanEvalPack: 6 languages, MultiPL-E: 18 languages
FIM (fill-in-middle)Not supportedSupported for completion and insertion
pass@k metricsNot standardpass@1, pass@10, pass@100 with proper estimation
Sandboxed executionNoYes, runs generated code in isolation
General benchmarks60+ tasks: MMLU, HellaSwag, GSM8K, etc.Code-only

BigCode Evaluation Harness for code models

# Install
pip install bigcode-eval-harness

# Evaluate on HumanEval with pass@1
accelerate launch main.py \
  --model meta-llama/CodeLlama-7b-hf \
  --tasks humaneval \
  --allow_code_execution \
  --do_sample True \
  --temperature 0.2 \
  --n_samples 20 \
  --batch_size 10

# Evaluate on multiple code benchmarks
accelerate launch main.py \
  --model meta-llama/CodeLlama-7b-hf \
  --tasks humaneval,mbpp \
  --allow_code_execution \
  --do_sample True \
  --temperature 0.2 \
  --n_samples 200 \
  --batch_size 50

For a combined evaluation strategy, run general capability benchmarks (MMLU, HellaSwag, GSM8K) through lm-eval and code-specific benchmarks (HumanEval, MBPP) through the BigCode harness. This gives you both breadth and depth.

CLI Reference

The most important flags, organized by how often you will use them.

FlagDescriptionExample
--modelModel backend typehf, vllm, local-chat-completions
--model_argsComma-separated key=value model configpretrained=gpt2,dtype=bfloat16
--tasksComma-separated task names or groupsmmlu,hellaswag,leaderboard
--num_fewshotNumber of few-shot examples in prompt5
--batch_sizeBatch size; use auto for automatic detectionauto, 16, auto:4
--output_pathDirectory for JSON results and sample logsresults/my-model/
--log_samplesSave per-example inputs and outputs(flag, no value)
--limitCap number of examples per task (for testing)100, 0.1
--deviceTarget device for modelcuda, cuda:0, cpu, mps
--include_pathDirectory with custom task YAML files./custom_tasks/
--gen_kwargsGeneration parameters for generative taskstemperature=0.0,top_p=0.95
--apply_chat_templateFormat prompts using model's chat template(flag, no value)

Common command patterns

# Quick test: run 100 examples to verify setup
lm-eval --model hf --model_args pretrained=gpt2 --tasks hellaswag --limit 100

# Full evaluation with vLLM on multi-GPU
lm-eval --model vllm \
  --model_args pretrained=meta-llama/Meta-Llama-3.1-70B-Instruct,tensor_parallel_size=4,dtype=auto \
  --tasks leaderboard \
  --batch_size auto \
  --output_path results/llama-70b/ \
  --log_samples

# Evaluate with custom generation settings
lm-eval --model hf \
  --model_args pretrained=your-model,dtype=bfloat16,device_map=auto \
  --tasks gsm8k \
  --gen_kwargs temperature=0.0,top_p=1.0 \
  --output_path results/

Frequently Asked Questions

What is the LLM Eval Harness?

The LM Evaluation Harness is an open-source framework by EleutherAI for benchmarking language models on 60+ academic tasks with hundreds of subtask variants. It powers Hugging Face's Open LLM Leaderboard and is the most widely used LLM evaluation tool in the research community. Install it with pip install lm-eval, then run lm-eval --model hf --model_args pretrained=your-model --tasks mmlu,hellaswag.

How do I install lm-eval-harness?

Run pip install lm-eval for the base package. For vLLM support, use pip install 'lm-eval[vllm]'. For the latest development version, clone the repo and install with pip install -e '.[dev]'. Since v0.4.7, the base install no longer requires torch or transformers, keeping it lightweight when evaluating API-served models.

Can I evaluate models served via an API?

Yes. Use --model local-chat-completions with --model_args base_url=http://your-server/v1/chat/completions,model=your-model-name. This works with any OpenAI-compatible API. The limitation is that chat APIs typically do not expose logprobs, so you are restricted to generative tasks (GSM8K, IFEval, BBH) rather than loglikelihood-based tasks (default HellaSwag, MMLU).

What benchmarks are available?

The harness includes 60+ major benchmarks with hundreds of subtask variants. The most commonly used: MMLU (broad knowledge), HellaSwag (commonsense), ARC (science reasoning), GSM8K and MATH (math), GPQA (PhD-level science), TruthfulQA (factual accuracy), WinoGrande (coreference), BBH (hard reasoning), IFEval (instruction following). Run lm-eval --tasks list to see all available tasks.

How do I create a custom benchmark?

Write a YAML config file specifying your dataset path, Jinja2 prompt template (doc_to_text), target answer (doc_to_target), and metrics. Place it in a directory and reference it with --include_path. For most standard task types (multiple choice, generative Q&A), no Python code is needed. For custom scoring logic, subclass the Task class.

How do I compare results across models?

Run identical --tasks and --num_fewshot configurations on each model, saving results with --output_path. Each run produces a JSON file with per-task scores and standard errors. Compare the acc or acc_norm values. Check that score differences exceed the reported stderr before concluding one model is better.

What is the difference between lm-eval and the BigCode Evaluation Harness?

lm-eval covers general language benchmarks: reasoning, knowledge, math, instruction following. The BigCode Evaluation Harness is purpose-built for code generation with HumanEval, MBPP, APPS, and DS-1000. BigCode supports pass@k metrics, multi-language evaluation (18 languages via MultiPL-E), fill-in-middle mode, and sandboxed code execution. Use both for comprehensive coding model evaluation.

Evaluation Picks the Model. Morph Runs It.

The eval harness tells you which model wins your benchmarks. Morph's fast-apply infrastructure works with whichever model comes out on top, applying LLM-generated code edits at 10,500 tok/s with 97.3% accuracy. Benchmark, choose, deploy.