What Is AI Inference? How Models Turn a Prompt Into Tokens

AI inference is the step where a trained model produces an answer for input it has never seen. This explains the two phases (prefill and decode), the two numbers that matter (TTFT and throughput), what inference costs, and how providers make it fast.

June 9, 2026 · 1 min read
What Is AI Inference? How Models Turn a Prompt Into Tokens

What AI Inference Is

A trained model is a frozen set of weights. Inference is the act of running new input through those weights to get an output. When you send a prompt and a model replies, that is inference. When a vision model labels an object in a frame, that is inference. When a classifier flags a transaction as fraud, that is inference. The model is not learning during any of this; it is applying what it already learned.

The economics are lopsided. A frontier model is trained once, at enormous cost, and then serves inference for every request for the rest of its life. A model that took weeks and millions of dollars to train will answer billions of inference requests, and the sum of that serving cost dwarfs the training bill. This is why inference efficiency, not training, is where most production AI engineering effort goes.

Once
A model is trained one time
Billions
Inference requests it then serves
~90%+
Share of lifetime compute spent on inference, not training

Training vs Inference

The two are easy to confuse because both involve running data through a neural network. The difference is direction and frequency.

TrainingInference
What it doesAdjusts the model's weightsReads the weights to predict
How oftenOnce (plus occasional fine-tunes)Every request, forever
DataA fixed, curated datasetNew input never seen before
DirectionForward and backward passesForward pass only
Cost shapeHuge one-time capital costOngoing per-request operating cost
Optimized forFinal accuracyLatency, throughput, cost per token

Fine-tuning blurs the line: it is a smaller training run on top of an existing model. But once a fine-tuned model is deployed, it is back to pure inference. The same split applies to small custom models. Training a lightweight classifier takes minutes; running it on live traffic is inference that has to be fast and cheap, the same constraint covered in real-time model monitoring.

How LLM Inference Works: Prefill and Decode

For a language model, inference is two phases with opposite performance profiles. Understanding the split is the difference between reading a benchmark correctly and being fooled by it.

Phase 1: Prefill (reading the prompt)

The model ingests your entire prompt at once, in parallel, and builds a key-value (KV) cache of attention state. Because it processes all input tokens together, prefill is compute-bound and fast even for long prompts. This phase determines time to first token: the longer the prompt, the longer until the first output appears.

Phase 2: Decode (writing the answer)

The model generates output one token at a time. Each new token depends on every token before it, so decode is strictly sequential and memory-bandwidth-bound, not compute-bound. This is the slow phase. Generation speed in tokens per second is a decode measurement, and it is why output costs more than input.

Why the split matters

A model can have fast prefill and slow decode, or the reverse. A coding assistant that reads a huge file (long prefill) but emits a three-line edit (short decode) has a completely different latency profile than a chatbot that gets a short prompt and writes three paragraphs. One "tokens per second" headline number hides this. Always ask whether a quoted speed is prefill, decode, or an average.

The Two Numbers That Matter: TTFT and Throughput

Two metrics describe almost everything a user feels:

MetricWhat it measuresDriven by
TTFT (time to first token)Delay before the response starts streamingPrefill — grows with prompt length
Throughput (tokens/sec)How fast the rest of the answer streamsDecode — model size, batching, hardware
Total latencyWall-clock time to a complete answerTTFT + (output tokens ÷ throughput)

For a chat UI, low TTFT is what makes a model feel responsive, because the user sees text appear quickly even if the full answer takes seconds. For a batch job processing thousands of documents, TTFT is irrelevant and raw throughput is everything. The right model and provider depend on which of these your workload actually cares about. Throughput is also the lever an inference optimization effort moves most.

Why Inference Is a Throughput Game

Decode is memory-bandwidth-bound, which means a single request rarely keeps a modern GPU busy. The GPU spends most of its time waiting on memory, not computing. The fix is batching: serve many users' requests through the model at the same time, so each pass of the weights does useful work for dozens of requests at once.

This is why hosted inference is cheaper than running the same model yourself for low traffic. A provider fills every batch with requests from many customers, pushing GPU utilization high enough that the per-token cost drops. Run that GPU for one user and most of its capacity is idle. Continuous batching, where new requests join a batch already in flight rather than waiting for the next slot, is the technique that makes this practical at scale.

The counterintuitive part

Adding more concurrent users to a busy inference server can leave each user's speed almost unchanged while the cost per token falls, because the GPU was underutilized to begin with. Throughput and cost-efficiency rise together until the batch is full. This is the opposite of most systems, where more load means slower responses.

What Inference Costs

Inference is billed per token, split into input (prefill) and output (decode), with output priced higher because decode is the expensive phase. Prices span more than two orders of magnitude depending on model size and provider efficiency.

TierExampleOutput $/M tokens
Frontier proprietaryClaude Opus 4.8$25
Mid proprietaryClaude Sonnet 4.6, GPT-5.4$15
Efficient open weightsMiniMax M2.7~$1.20
Cheapest open weightsDeepSeek V4 Flash~$0.28

The 90x gap between the top and bottom of that table is mostly model size, not provider margin. A 27B model costs far less to run per token than a frontier MoE, and for many tasks the quality difference does not justify the price difference. Matching task difficulty to model tier is the entire idea behind an LLM router, and the per-model economics are broken down in the best AI model for coding.

How Providers Make Inference Fast

Beyond batching, a handful of techniques cut latency and cost without retraining the model:

TechniqueWhat it doesWins
Speculative decodingA small draft model proposes several tokens; the large model verifies them in one passHigher decode throughput
Quantization (FP8, INT4)Runs weights at lower precision to cut memory bandwidthFaster decode, lower memory
Prompt / KV cachingReuses the cached attention state for repeated prompt prefixesNear-zero cost on cache hits
Continuous batchingNew requests join an in-flight batch instead of waitingHigher GPU utilization

These compound. Morph's fast-apply model pairs n-gram speculative decoding with a small specialized model to reach roughly 10,500 tokens per second on code edits, while the compactor model uses the same family of tricks to compress long contexts at around 33,000 tokens per second. The headline number on any serving page is the product of which of these a provider has actually shipped.

10,500 tok/s
morph-v3-fast with n-gram speculative decoding
~33k tok/s
morph-compactor context compression
2 orders
Cost-per-token spread across model tiers

Where Inference Runs

Three places, with a clear trade-off:

Hosted API

A provider runs the GPUs and bills per token. Highest utilization through shared batching, lowest cost for variable traffic, nothing to operate. The default for most teams.

Self-hosted

You run the model on your own or rented GPUs. Worth it for data that cannot leave your network or for steady, high-volume traffic that keeps the GPUs full. You own the batching and ops problem.

On-device / edge

A small model runs locally on a phone, laptop, or sensor. Zero network latency and full privacy, bounded by the hardware's memory. Common for classifiers and small language models.

Open weights make all three possible with the same model: hit a hosted endpoint today, move to self-hosting when volume justifies it, without changing the model. That portability is a large part of why open-source models have closed the gap with proprietary ones.

Frequently Asked Questions

What is AI inference?

AI inference is the stage where a trained model takes input it has never seen and produces an output: a chat reply, a code completion, a classification, or a generated image. Training teaches the model patterns from a fixed dataset; inference applies those patterns to new data in production. Inference runs billions of times and accounts for most of the real-world compute cost of an AI system.

What is the difference between training and inference?

Training is a one-time process that adjusts a model's weights by showing it a large labeled or curated dataset; it is compute-intensive and can cost millions of dollars. Inference is the repeated, ongoing process of running the finished model on new inputs to get predictions. Training writes the weights; inference reads them. A model is trained once and then serves inference for its entire deployment life.

What are the two phases of LLM inference?

Prefill and decode. In prefill, the model processes your entire prompt at once and builds a key-value cache; this phase is compute-bound and parallel, so a long prompt is read quickly. In decode, the model generates the response one token at a time, each token depending on the previous one; this phase is memory-bandwidth-bound and sequential, which is why output is slower than input and why generation speed is measured in tokens per second.

What is TTFT in AI inference?

TTFT, or time to first token, is the delay between sending a request and receiving the first output token. It is dominated by the prefill phase, so it grows with prompt length. TTFT is the latency a user feels before a response starts streaming. The second key metric is throughput, measured in tokens per second, which determines how fast the rest of the response streams once it starts.

Why is AI inference expensive?

Inference runs on GPUs that cost dollars per hour, and unlike training it never stops: every user request is a fresh inference. Decode is memory-bandwidth-bound and sequential, so a GPU often cannot be fully utilized by a single request. Providers recover efficiency by batching many users' requests together, and by techniques like speculative decoding, quantization, and prompt caching. The price you pay per million tokens reflects how well a provider packs those GPUs.

How do providers make inference faster?

The main levers are batching (serving many requests on one GPU to raise utilization), speculative decoding (a small draft model proposes several tokens that the large model verifies in one pass), quantization (running weights at lower precision like FP8 to cut memory bandwidth), prompt caching (reusing the key-value cache for repeated prefixes), and KV-cache management. Morph's fast-apply model, for example, uses n-gram speculative decoding to reach roughly 10,500 tokens per second.

Inference, Made Fast

Morph serves open-source models and specialized apply and compaction models on a single OpenAI-compatible API, tuned for the metric your workload cares about. WarpGrep adds semantic codebase search, free for 100k requests.