Llama 3.1 8B: Specs, Benchmarks, Context Window, and How to Run It

Llama 3.1 8B is Meta's 8-billion-parameter open-weight model with a 128K token context window, released July 23, 2024 under the Llama 3.1 Community License. The Instruct variant scores 69.4 on MMLU and 72.6 on HumanEval pass@1. It runs on a single consumer GPU through Ollama or vLLM, and it works as the cheap tier in a model-routing setup like Morph's.

Parameters

128K

Context window (tokens)

69.4

MMLU (Instruct)

72.6

HumanEval pass@1 (Instruct)

What Is Llama 3.1 8B

Llama 3.1 8B is the smallest model in Meta's Llama 3.1 family. The family also ships a 70B model and a 405B model. The 8B has 8 billion parameters, a 128K token context window, and was released on July 23, 2024 under the Llama 3.1 Community License, per Meta's model card.

Meta pretrained the model on more than 15 trillion tokens, with a knowledge cutoff of December 2023. It supports eight languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. The architecture is an optimized transformer with Grouped-Query Attention (GQA), which keeps inference memory low at long context lengths.

It exists in two forms. The base model (Llama-3.1-8B) is a next-token predictor meant for fine-tuning. The instruction-tuned model (Llama-3.1-8B-Instruct) follows instructions and holds conversations. When people say "Llama 3.1 8B" in the context of chat or agents, they almost always mean the Instruct variant.

Specs at a Glance

Every number below comes from Meta's Llama-3.1-8B-Instruct model card on Hugging Face.

Attribute	Value
Parameters	8 billion (8B)
Context window	128K tokens
Developer	Meta
Release date	July 23, 2024
License	Llama 3.1 Community License
Training tokens	15T+
Knowledge cutoff	December 2023
Architecture	Transformer with Grouped-Query Attention (GQA)
Languages	English, German, French, Italian, Portuguese, Hindi, Spanish, Thai
Variants	Base, Instruct

Context Window

Llama 3.1 8B has a 128K token context window. That holds for both the base and Instruct variants, and it matches the 70B and 405B siblings. All three Llama 3.1 models share the same 128K window.

This was a large jump from the previous generation. Llama 3 8B (released April 2024) had only an 8K context window. Llama 3.1 raised that to 128K, a 16x increase, which is what made the model practical for long documents, multi-file code context, and longer agent histories.

What 128K tokens fits

128K tokens is roughly 300 pages of text, or a medium codebase's worth of source files. It is enough for most retrieval-augmented and agent workloads. Frontier and newer open models now push to 200K, 256K, and 1M, but 128K covers the majority of real use. For a deeper comparison, see LLM Context Window.

License and Cost

Llama 3.1 8B is free to download and use under the Llama 3.1 Community License, and commercial use is allowed. The weights are open. You can run them on your own hardware, fine-tune them, and ship them in a product without paying Meta.

There is one restriction worth stating precisely. If your product, or your affiliates' products, had more than 700 million monthly active users in the calendar month before the Llama 3.1 release date, you must request a separate license from Meta, which Meta may grant at its discretion. This clause targets the largest platforms. For nearly every developer and company, it does not apply, and the model is effectively free.

Beyond the license, the practical cost of 8B is low because it is small. It fits on a single consumer GPU. Hosted providers price it at a fraction of frontier models, often under $0.20 per million tokens, versus $15 per million for the most expensive frontier tiers. The exact price depends on the provider; check theirs before quoting it.

Instruct vs Base

The two variants serve different purposes. Choosing the wrong one is a common early mistake.

Llama 3.1 8B (base)

A raw next-token predictor pretrained on 15T+ tokens. It does not reliably follow instructions or stop at the right point. Use it as a starting point for your own fine-tuning, not for direct chat. Scores 66.7 on MMLU.

Llama 3.1 8B Instruct

The base model after instruction tuning. It follows instructions, holds multi-turn conversations, and uses tools. This is what you want for chat, agents, and applications. Scores 69.4 on MMLU.

If you are building a chatbot, an agent, or an extraction pipeline, use Instruct. If you have a domain dataset and want to specialize the model from scratch, fine-tune the base. The instruction tuning adds about 2.7 points of MMLU on top of the base, and far more in usability.

Benchmarks

These scores are from Meta's Llama-3.1-8B-Instruct model card. They describe the Instruct variant unless noted.

Benchmark	Score	Measures
MMLU	69.4	General knowledge across 57 subjects
HumanEval (pass@1)	72.6	Python code generation
MBPP++ (pass@1)	72.8	Python programming problems
GSM-8K (CoT)	84.5	Grade-school math word problems
MATH (CoT)	51.9	Competition math
MMLU (base model)	66.7	General knowledge, pre-instruction-tuning
ARC-Challenge (base)	79.7	Science reasoning, base model
CommonSenseQA (base)	75.0	Commonsense reasoning, base model

The pattern is clear. On code and grade-school math the 8B is strong: 72.6 on HumanEval and 84.5 on GSM-8K are usable production numbers. On competition math (51.9 on MATH) it is weaker, which is where the parameter gap to larger models shows. Hard, multi-step reasoning is exactly where an 8B model trails a frontier model.

Read these as a small-model baseline, not a frontier comparison

An 8B model scoring 69.4 on MMLU is impressive for its size, but it is not a frontier score. The 405B sibling scores 87.3 on the same benchmark. The point of the 8B is cost and speed at acceptable quality, not topping a leaderboard. Match it to tasks where 69.4 MMLU is enough.

Llama 3.1 8B vs 70B vs 405B

The three Llama 3.1 models share an architecture and a 128K context window. They differ in scale, and that difference is the whole decision: how much capability you need versus how much you want to pay and how fast you want it.

Attribute	8B	70B	405B
Parameters	8B	70B	405B
Context window	128K	128K	128K
MMLU (Instruct)	69.4	Strong mid	87.3
HumanEval (Instruct)	72.6	High	89.0
Hardware	1 consumer GPU	Multi-GPU	Multi-node cluster
Relative cost	Lowest	Medium	Highest
Best for	High-volume simple tasks	Balanced quality/cost	Hard reasoning, frontier parity

The 8B is the cheap tier: simple tasks, high volume, low latency, single GPU. The 405B is the frontier tier: it was the first open-weight model to compete with closed frontier models, scoring 87.3 on MMLU and 89.0 on HumanEval. The 70B sits between them for teams that want better quality than 8B without 405B's infrastructure cost. For the 70B Instruct numbers, check Meta's 70B model card directly rather than relying on the qualitative labels above.

How to Run It

There are three common paths: Ollama for the fastest local start, vLLM for production serving with an OpenAI-compatible API, and a hosted API if you do not want to manage GPUs.

Ollama (local, fastest start)

Ollama pulls a quantized build that fits in roughly 8GB of RAM or VRAM. One command and you are talking to the model.

Run Llama 3.1 8B with Ollama

# Install Ollama from https://ollama.com, then:
ollama run llama3.1:8b

# Or pull the model and call it from code:
ollama pull llama3.1:8b

# Ollama exposes an OpenAI-compatible endpoint on localhost:11434
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Write a haiku about GPUs"}]
  }'

vLLM (production serving, OpenAI-compatible)

vLLM serves the full model with an OpenAI-compatible API and high throughput. The FP16 weights need about 16GB of VRAM; 4-bit quantization brings it under 8GB.

Serve Llama 3.1 8B Instruct with vLLM

pip install vllm

# Start an OpenAI-compatible server on port 8000
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 128000

# Call it like any OpenAI endpoint
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Explain GQA in one sentence"}]
  }'

Hosted, OpenAI-compatible API

If you do not want to run GPUs, call an 8B-class model through a hosted OpenAI-compatible API. The same client code works whether you point it at a local vLLM server, a hosted provider, or Morph's router at api.morphllm.com/v1, which is OpenAI-compatible and can send easy turns to small models for you.

OpenAI-compatible client (works against any of the above)

import OpenAI from "openai";

// Point base_url at vLLM, a hosted provider, or api.morphllm.com/v1.
// The model name and base_url are the only things that change.
const client = new OpenAI({
  apiKey: process.env.API_KEY,
  baseURL: "https://api.morphllm.com/v1",
});

const res = await client.chat.completions.create({
  model: "auto", // let the router pick a tier, or name a specific model
  messages: [
    { role: "user", content: "Rename the variable userData to userRecord" },
  ],
});

console.log(res.choices[0].message.content);

The Cheap Tier Argument

Most turns in a coding session or chat application are simple. Renaming a variable. Formatting JSON. Classifying intent. Extracting fields from text. These do not need frontier reasoning, and an 8B model produces the same output a frontier model would, at a fraction of the cost and latency.

That is the case for Llama 3.1 8B as a cheap tier: it is good enough on the common, simple work, and small enough to run cheaply at high volume. The downside is real and worth stating: on hard, multi-step reasoning it trails larger models clearly. Sending a hard architectural refactor to an 8B model produces a worse answer. The fix is not to pick one model for everything; it is to route.

A model router classifies each prompt's difficulty and sends it to the right tier. Easy and cheap turns go to a small 8B-class model. Hard turns go to a frontier model. Morph's router classifies difficulty in ~430ms at about $0.001 per classification and cuts 40-70% of API cost while keeping quality on the hard turns. Morph serves this through one OpenAI-compatible API at api.morphllm.com.

See LLM Router for how the classification works, and LLM Cost Optimization for the full cost math on routing cheap tiers like 8B against frontier models.

Frequently Asked Questions

How many parameters does Llama 3.1 8B have?

Llama 3.1 8B has 8 billion parameters. It is the smallest model in Meta's Llama 3.1 family, which also includes 70B and 405B models. All three share the same 128K token context window and released on July 23, 2024.

What is the context window of Llama 3.1 8B?

128K (128,000) tokens, for both the base and Instruct variants, matching the 70B and 405B siblings. The earlier Llama 3 8B had only an 8K window, so Llama 3.1 was a 16x increase.

Is Llama 3.1 8B free to use?

Yes, it is free to download and use under the Llama 3.1 Community License, including commercial use. The one restriction: if your product or your affiliates' products had more than 700 million monthly active users in the month before the Llama 3.1 release, you must request a separate license from Meta. For nearly all developers and companies, it is free.

How do you run Llama 3.1 8B locally?

The fastest path is Ollama: ollama run llama3.1:8b pulls a quantized build that fits in about 8GB of RAM or VRAM. For production serving with an OpenAI-compatible API, use vLLM: vllm serve meta-llama/Llama-3.1-8B-Instruct. Full FP16 needs about 16GB of VRAM; 4-bit quantization brings it under 8GB.

What is the difference between Llama 3.1 8B and Instruct?

The base Llama 3.1 8B is a next-token predictor for fine-tuning. Llama 3.1 8B Instruct is the same model after instruction tuning, so it follows instructions and holds conversations out of the box. Use Instruct for chat, agents, and applications. Base scores 66.7 on MMLU; Instruct scores 69.4.

How good is Llama 3.1 8B?

The Instruct variant scores 69.4 on MMLU, 72.6 on HumanEval pass@1, 72.8 on MBPP++, 84.5 on GSM-8K, and 51.9 on MATH, per Meta's model card. Strong for an 8B model on code and grade-school math. It trails the 405B sibling (87.3 MMLU, 89.0 HumanEval) on hard reasoning.

When should I use Llama 3.1 8B instead of a larger model?

Use 8B for high-volume, latency-sensitive, cost-sensitive work where most tasks are simple: formatting, boilerplate, simple edits, classification, extraction. Reserve a frontier model for hard reasoning. A model router automates the split, cutting 40-70% of cost while keeping quality on hard turns.

Related Resources

Use Llama 3.1 8B as Your Cheap Tier, Frontier Models Only When Needed

Morph's router classifies prompt difficulty in ~430ms and sends easy turns to small 8B-class models, hard turns to frontier models. 40-70% API cost savings, about $0.001 per classification. One OpenAI-compatible API at api.morphllm.com.

Try the Router

View API Docs

Fast Apply

WarpGrep

Compact

Model Router

DeepSeek

MiniMax

Qwen

Blog

Startup Credits

Students

Contact Us

About

Careers