What Is AI Inference? How Models Turn a Prompt Into Tokens

What AI Inference Is

A trained model is a frozen set of weights. Inference is the act of running new input through those weights to get an output. When you send a prompt and a model replies, that is inference. When a vision model labels an object in a frame, that is inference. When a classifier flags a transaction as fraud, that is inference. The model is not learning during any of this; it is applying what it already learned.

The economics are lopsided. A frontier model is trained once, at enormous cost, and then serves inference for every request for the rest of its life. A model that took weeks and millions of dollars to train will answer billions of inference requests, and the sum of that serving cost dwarfs the training bill. This is why inference efficiency, not training, is where most production AI engineering effort goes.

Once

A model is trained one time

Billions

Inference requests it then serves

~90%+

Share of lifetime compute spent on inference, not training

Training vs Inference

The two are easy to confuse because both involve running data through a neural network. The difference is direction and frequency.

Training vs Inference

	Training	Inference
What it does	Adjusts the model's weights	Reads the weights to predict
How often	Once (plus occasional fine-tunes)	Every request, forever
Data	A fixed, curated dataset	New input never seen before
Direction	Forward and backward passes	Forward pass only
Cost shape	Huge one-time capital cost	Ongoing per-request operating cost
Optimized for	Final accuracy	Latency, throughput, cost per token

Fine-tuning blurs the line: it is a smaller training run on top of an existing model. But once a fine-tuned model is deployed, it is back to pure inference. The same split applies to small custom models. Training a lightweight classifier takes minutes; running it on live traffic is inference that has to be fast and cheap, the same constraint covered in real-time model monitoring.

How LLM Inference Works: Prefill and Decode

For a language model, inference is two phases with opposite performance profiles. Understanding the split is the difference between reading a benchmark correctly and being fooled by it.

Phase 1: Prefill (reading the prompt)

The model ingests your entire prompt at once, in parallel, and builds a key-value (KV) cache of attention state. Because it processes all input tokens together, prefill is compute-bound and fast even for long prompts. This phase determines time to first token: the longer the prompt, the longer until the first output appears.

Phase 2: Decode (writing the answer)

The model generates output one token at a time. Each new token depends on every token before it, so decode is strictly sequential and memory-bandwidth-bound, not compute-bound. This is the slow phase. Generation speed in tokens per second is a decode measurement, and it is why output costs more than input.

Why the split matters

A model can have fast prefill and slow decode, or the reverse. A coding assistant that reads a huge file (long prefill) but emits a three-line edit (short decode) has a completely different latency profile than a chatbot that gets a short prompt and writes three paragraphs. One "tokens per second" headline number hides this. Always ask whether a quoted speed is prefill, decode, or an average.

The Two Numbers That Matter: TTFT and Throughput

Two metrics describe almost everything a user feels:

Inference Latency Metrics

Metric	What it measures	Driven by
TTFT (time to first token)	Delay before the response starts streaming	Prefill — grows with prompt length
Throughput (tokens/sec)	How fast the rest of the answer streams	Decode — model size, batching, hardware
Total latency	Wall-clock time to a complete answer	TTFT + (output tokens ÷ throughput)

For a chat UI, low TTFT is what makes a model feel responsive, because the user sees text appear quickly even if the full answer takes seconds. For a batch job processing thousands of documents, TTFT is irrelevant and raw throughput is everything. The right model and provider depend on which of these your workload actually cares about. Throughput is also the lever an inference optimization effort moves most.

Why Inference Is a Throughput Game

Decode is memory-bandwidth-bound, which means a single request rarely keeps a modern GPU busy. The GPU spends most of its time waiting on memory, not computing. The fix is batching: serve many users' requests through the model at the same time, so each pass of the weights does useful work for dozens of requests at once.

This is why hosted inference is cheaper than running the same model yourself for low traffic. A provider fills every batch with requests from many customers, pushing GPU utilization high enough that the per-token cost drops. Run that GPU for one user and most of its capacity is idle. Continuous batching, where new requests join a batch already in flight rather than waiting for the next slot, is the technique that makes this practical at scale.

The counterintuitive part

Adding more concurrent users to a busy inference server can leave each user's speed almost unchanged while the cost per token falls, because the GPU was underutilized to begin with. Throughput and cost-efficiency rise together until the batch is full. This is the opposite of most systems, where more load means slower responses.

What Inference Costs

Inference is billed per token, split into input (prefill) and output (decode), with output priced higher because decode is the expensive phase. Prices span more than two orders of magnitude depending on model size and provider efficiency.

Representative output prices, June 2026

Tier	Example	Output $/M tokens
Frontier proprietary	Claude Opus 4.8	$25
Mid proprietary	Claude Sonnet 4.6, GPT-5.4	$15
Efficient open weights	MiniMax M2.7	~$1.20
Cheapest open weights	DeepSeek V4 Flash	~$0.28

The 90x gap between the top and bottom of that table is mostly model size, not provider margin. A 27B model costs far less to run per token than a frontier MoE, and for many tasks the quality difference does not justify the price difference. Matching task difficulty to model tier is the entire idea behind an LLM router: Morph's Router classifies each prompt by difficulty and domain in about 180ms and returns the cheapest model that can handle it. The per-model economics are broken down in the best AI model for coding.

How Providers Make Inference Fast

Beyond batching, a handful of techniques cut latency and cost without retraining the model:

Inference acceleration techniques

Technique	What it does	Wins
Speculative decoding	A small draft model proposes several tokens; the large model verifies them in one pass	Higher decode throughput
Quantization (FP8, INT4)	Runs weights at lower precision to cut memory bandwidth	Faster decode, lower memory
Prompt / KV caching	Reuses the cached attention state for repeated prompt prefixes	Near-zero cost on cache hits
Continuous batching	New requests join an in-flight batch instead of waiting	Higher GPU utilization

These compound. Morph's fast-apply model pairs speculative decoding tuned to code with a small specialized model to reach roughly 10,500 tokens per second on code edits, while the compactor model uses the same family of tricks to compress long contexts at around 33,000 tokens per second. The headline number on any serving page is the product of which of these a provider has actually shipped.

10,500 tok/s

morph-v3-fast, speculative decoding for code

~33k tok/s

morph-compactor context compression

2 orders

Cost-per-token spread across model tiers

Where Inference Runs

Three places, with a clear trade-off:

Hosted API

A provider runs the GPUs and bills per token. Highest utilization through shared batching, lowest cost for variable traffic, nothing to operate. The default for most teams.

Self-hosted

You run the model on your own or rented GPUs. Worth it for data that cannot leave your network or for steady, high-volume traffic that keeps the GPUs full. You own the batching and ops problem.

On-device / edge

A small model runs locally on a phone, laptop, or sensor. Zero network latency and full privacy, bounded by the hardware's memory. Common for classifiers and small language models.

Open weights make all three possible with the same model: hit a hosted endpoint today, move to self-hosting when volume justifies it, without changing the model. That portability is a large part of why open-source models have closed the gap with proprietary ones.

Frequently Asked Questions

What is AI inference?

AI inference is the stage where a trained model takes input it has never seen and produces an output: a chat reply, a code completion, a classification, or a generated image. Training teaches the model patterns from a fixed dataset; inference applies those patterns to new data in production. Inference runs billions of times and accounts for most of the real-world compute cost of an AI system.

What is the difference between training and inference?

Training is a one-time process that adjusts a model's weights by showing it a large labeled or curated dataset; it is compute-intensive and can cost millions of dollars. Inference is the repeated, ongoing process of running the finished model on new inputs to get predictions. Training writes the weights; inference reads them. A model is trained once and then serves inference for its entire deployment life.

What are the two phases of LLM inference?

Prefill and decode. In prefill, the model processes your entire prompt at once and builds a key-value cache; this phase is compute-bound and parallel, so a long prompt is read quickly. In decode, the model generates the response one token at a time, each token depending on the previous one; this phase is memory-bandwidth-bound and sequential, which is why output is slower than input and why generation speed is measured in tokens per second.

What is TTFT in AI inference?

TTFT, or time to first token, is the delay between sending a request and receiving the first output token. It is dominated by the prefill phase, so it grows with prompt length. TTFT is the latency a user feels before a response starts streaming. The second key metric is throughput, measured in tokens per second, which determines how fast the rest of the response streams once it starts.

Why is AI inference expensive?

Inference runs on GPUs that cost dollars per hour, and unlike training it never stops: every user request is a fresh inference. Decode is memory-bandwidth-bound and sequential, so a GPU often cannot be fully utilized by a single request. Providers recover efficiency by batching many users' requests together, and by techniques like speculative decoding, quantization, and prompt caching. The price you pay per million tokens reflects how well a provider packs those GPUs.

How do providers make inference faster?

The main levers are batching (serving many requests on one GPU to raise utilization), speculative decoding (a small draft model proposes several tokens that the large model verifies in one pass), quantization (running weights at lower precision like FP8 to cut memory bandwidth), prompt caching (reusing the key-value cache for repeated prefixes), and KV-cache management. Morph's fast-apply model, for example, uses speculative decoding tuned to code to reach roughly 10,500 tokens per second.

Private deployments

The fastest endpoints are private deployments

Morph's top speeds come from dedicated deployments, not shared public endpoints: speculators trained on your traffic, caching tuned to your workload, and volume discounts over public per-token rates. Over 100 billion tokens per day run this way.

Talk to us about a private deployment

Inference, Made Fast

Morph serves open-source models and specialized apply and compaction models on a single OpenAI-compatible API, tuned for the metric your workload cares about. WarpGrep adds semantic codebase search, free for 100k requests.

Get an API Key

Read the Docs

Kimi K3

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

What Is AI Inference? How Models Turn a Prompt Into Tokens

What AI Inference Is

Training vs Inference

How LLM Inference Works: Prefill and Decode

Phase 1: Prefill (reading the prompt)

Phase 2: Decode (writing the answer)

The Two Numbers That Matter: TTFT and Throughput

Why Inference Is a Throughput Game

What Inference Costs

How Providers Make Inference Fast

Where Inference Runs

Hosted API

Self-hosted

On-device / edge

Frequently Asked Questions

What is AI inference?

What is the difference between training and inference?

What are the two phases of LLM inference?

What is TTFT in AI inference?

Why is AI inference expensive?

How do providers make inference faster?

The fastest endpoints are private deployments

Inference, Made Fast