Continuous Batching: How It Works and Why It Multiplies LLM Throughput

Continuous batching is an LLM serving technique that decides the batch size at every decode step instead of fixing it for the life of the batch. When a sequence finishes, it is evicted and a waiting request takes its slot on the next iteration, so the GPU never idles waiting for the slowest request. Anyscale measured up to 23x throughput over static batching with vLLM, and Morph runs it in production for morph-v3-fast at ~10,500 tok/s.

up to 23x

vLLM throughput vs static batching (OPT-13B, A100)

36.9x

Orca throughput vs FasterTransformer (GPT-3 175B)

under 4%

KV cache memory waste with PagedAttention

~10,500 tok/s

morph-v3-fast on Morph's fleet

What Continuous Batching Is

Continuous batching is a system-level LLM serving optimization in which the batch size is determined per iteration rather than remaining constant throughout generation. Anyscale, whose 2023 benchmark popularized the term, also calls it dynamic batching or batching with iteration-level scheduling.

The contrast is with static batching, where the size of the batch stays constant until the inference is complete. Under static batching the whole batch must wait for the longest-running sequence to finish before any of it is released, which underutilizes the GPU. Continuous batching removes that wait.

The core mechanism: once a sequence in a batch has completed generation, a new sequence is inserted in its place at the next iteration. This eliminates the idle GPU cycles that static batching spends waiting for the slowest sequence and waiting for the batch to drain before admitting new work.

Why the name varies

The same technique appears under several names depending on the framework. Anyscale and vLLM say continuous batching. NVIDIA TensorRT-LLM says in-flight batching. The research literature, following Orca, says iteration-level scheduling. They describe the same idea: re-decide batch membership every decode step.

Why Static Batching Wastes the GPU

Static batching has two failure modes, both rooted in treating the batch as a fixed unit from start to finish.

The first is padding waste from length variance. LLM generations finish at wildly different lengths: one request emits a 3-token answer, another writes a 2,000-token function. Under static batching the short sequence keeps occupying its slot, contributing nothing, until the longest sequence in the batch finishes. The GPU computes over dead slots.

The second is head-of-line blocking. The problem Orca's iteration-level scheduling solves is that with request-level scheduling, requests that finish earlier than others in a batch cannot return to the client, while newly arrived requests have to wait until the current batch completely finishes. A request that arrives a millisecond after the batch starts sits in the queue for the entire batch duration.

Both problems compound under real traffic, where request arrival is continuous and sequence lengths are unpredictable. The GPU is the most expensive resource in the system, and static batching leaves it computing over padding and finished sequences while live requests wait in a queue.

How Continuous Batching Works

The unit of scheduling changes from the request to the iteration. Orca's iteration-level scheduling schedules execution at the granularity of iteration instead of request: the scheduler invokes the execution engine to run only a single iteration of the model on the current batch, then regains control to re-decide what runs next.

Each decode step, the runtime does three things. It runs one forward pass over every sequence currently in the batch, producing one new token per sequence. It checks which sequences emitted their end-of-sequence token and evicts them, returning their output to the client immediately. It admits waiting requests into the freed slots so the next forward pass runs at full width.

With in-flight batching, the server runtime immediately evicts finished sequences from the batch and begins executing new requests while other requests are still in flight, which greatly increases overall GPU utilization in real-world use cases. Newly arrived requests join at the next iteration, and newly completed requests return at the next iteration, so queue wait time drops and there is no need to pad requests to a common length.

Selective batching

Mixing sequences at different positions in their generation inside one batch is not trivial, because the attention operation depends on each sequence's own KV history. Orca introduced selective batching to handle this: it applies batching only to a selected set of operations. It batches the non-attention operations (the large matrix multiplies that dominate compute) while processing each request individually for the attention operation, since handling attention per-request has only a small impact on efficiency.

Static vs Continuous Batching

The two approaches differ in when the batch composition is decided and what happens to finished and waiting sequences.

Dimension	Static batching	Continuous batching
When batch size is set	Once, at batch start, fixed until done	Every decode iteration
Finished sequences	Hold their slot until the longest finishes	Evicted immediately, returned to client
New requests	Wait for the whole batch to drain	Admitted on the next iteration
GPU utilization	Low: computes over padding and dead slots	High: batch stays full of live work
Throughput under load	Bottlenecked by the slowest sequence	Up to 23x higher (vLLM, OPT-13B, A100)
Single-request latency	Same	Same
Scheduling granularity	Per request	Per iteration (iteration-level)

The throughput numbers come from Anyscale's benchmark on Meta's OPT-13B running on a single NVIDIA A100 with 40GB of RAM. vLLM, using continuous batching plus continuous-batching-specific memory optimizations, reached up to 23x throughput over static batching. The same benchmark showed 8x over static batching on Ray Serve and Hugging Face text-generation-inference, and 4x with FasterTransformer's optimized implementation.

The Orca Origin and vLLM Popularization

The technique originated with Orca, "Orca: A Distributed Serving System for Transformer-Based Generative Models," presented at OSDI 2022 by Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. It implemented iteration-level scheduling and was, to Anyscale's knowledge, the first to tackle this problem.

Orca reported a 36.9x throughput improvement at the same level of latency over NVIDIA FasterTransformer, evaluated on a GPT-3 175B model. The gain came from two ideas working together: iteration-level scheduling to keep the batch full of live work, and selective batching to make a batch of mixed-position sequences executable.

vLLM brought continuous batching to the open-source mainstream in 2023. vLLM builds upon Orca's continuous batching design by taking full control of dynamic memory allocations through PagedAttention, which allocates memory in fixed-size pages just-in-time instead of ahead-of-time. This limits memory wastage to under 4% and enables higher batch sizes and throughput than Orca's original allocation scheme.

Relationship to KV Cache and PagedAttention

Continuous batching raises throughput only if you can actually fit many concurrent sequences in memory. The constraint is the KV cache. Each sequence stores the key and value vectors of every token it has processed so they are not recomputed each decode step, and that cache grows with every token generated. vLLM reports the KV cache takes up to 1.7 GB for a single sequence in LLaMA-13B.

Because each concurrent sequence in a continuous batch holds its own growing KV cache, the number of sequences you can batch is bounded by KV cache memory, not just by compute. Naive pre-allocation makes this worse: due to internal and external fragmentation and over-reservation, earlier serving systems wasted 60% to 80% of KV cache memory, with only 20.4% to 38.2% of it used to store actual token states.

PagedAttention is what makes continuous batching memory-efficient. With PagedAttention, memory waste only happens in the last block of a sequence, resulting in near-optimal memory usage with a waste of under 4%. Freeing that memory lets you batch more sequences at once, which is exactly what continuous batching needs to fill the GPU.

The two techniques are complementary, not competing. Continuous batching decides which sequences run each iteration; PagedAttention decides how their KV caches are laid out in memory. vLLM with PagedAttention improves serving throughput by 2-4x over FasterTransformer and Orca, with gains more pronounced for longer sequences and larger models, and processes 2.2x more requests at the same time than Orca. See LLM context windows for how the KV cache scales with sequence length.

In-Flight Batching and Other Names

The same mechanism ships under different names across the major serving stacks. The behavior is identical: re-decide batch membership every iteration.

Framework	Name used	How it is implemented
vLLM	Continuous batching	Core speed feature, paired with PagedAttention
TensorRT-LLM	In-flight batching	Batch Manager component admits and returns requests each iteration
Hugging Face TGI	Continuous batching	Rust router buffers requests, schedulers and block allocators feed the model server
SGLang	Continuous batching	Core feature with paged attention, RadixAttention, zero-overhead CPU scheduler
Orca (research)	Iteration-level scheduling	Original 2022 design with selective batching

NVIDIA TensorRT-LLM supports in-flight batching of requests (also known in the community as continuous batching or iteration-level batching) via its Batch Manager. In-flight batching allows for the inclusion of newly arrived requests and the return of newly completed requests at each iteration of the token generation loop, which reduces wait times in queues, eliminates the need for padding requests, and allows higher GPU utilization.

Hugging Face TGI implements continuous batching in a Rust router that receives client requests, buffers them, and uses queues, schedulers, and block allocators to produce batched requests sent to the model server, with the prefill and decode loop streaming tokens as new requests are added to the running batch. SGLang lists continuous batching and paged attention among its core features, alongside RadixAttention for prefix caching, a zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, and chunked prefill.

Tradeoffs: Latency and Memory

Continuous batching is a throughput optimization, not a free lunch. The honest picture includes two costs.

It does not speed up a lone request

A single request sent to an idle server finishes at the same speed under static or continuous batching. The gains appear only under concurrent load, where keeping the batch full of live work raises tokens served per GPU-second. If your traffic is one request at a time, continuous batching does nothing for you.

Memory pressure bounds the batch

Every concurrent sequence holds a growing KV cache (up to 1.7 GB for one LLaMA-13B sequence). Admitting more sequences raises throughput until you run out of KV cache memory, at which point the scheduler must preempt or queue. This is why PagedAttention matters: less waste means more concurrent sequences.

There is also a per-token latency consideration for in-flight sequences. When the scheduler admits new requests into freed slots, those new sequences compete for the same forward pass, which can slightly slow per-token generation for sequences already running. This is a small cost against the much larger queue-wait latency it removes, but it is real, and latency-sensitive deployments tune how aggressively new requests are admitted.

Throughput up, tail latency to watch

Continuous batching maximizes average throughput, but admitting many sequences can lengthen the tail of per-token latency for individual in-flight requests under heavy load. Serving stacks expose knobs (max concurrent sequences, token budget per batch) to trade throughput against worst-case per-request latency. Tune them against your actual traffic, not the benchmark.

Configuring It in Practice

In vLLM and SGLang, continuous batching is on by default. The lever you tune is the maximum number of concurrent sequences, which directly controls how full the continuous batch can get against your KV cache budget.

vLLM: cap concurrent sequences in the continuous batch

# Continuous batching is enabled by default in vLLM.
# --max-num-seqs caps how many sequences run concurrently
# in the continuous batch. Higher = more throughput, until
# you run out of KV cache memory.

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --max-num-seqs 256 \
  --gpu-memory-utilization 0.90

# --max-num-seqs: upper bound on sequences in the running batch
# --gpu-memory-utilization: fraction of VRAM for weights + KV cache;
#   the rest of VRAM left for KV cache is what bounds the batch width

Because the runtime is OpenAI-compatible, no client change is needed to benefit from continuous batching. You send standard chat completion requests and the server schedules them into the running batch automatically:

Client side: standard requests, server batches them continuously

import OpenAI from "openai";

// Point at any continuous-batching server (vLLM, SGLang, Morph).
const client = new OpenAI({
  baseURL: "https://api.morphllm.com/v1",
  apiKey: process.env.MORPH_API_KEY,
});

// Fire many concurrent requests. The server admits each one into
// the running batch at the next decode iteration. No client-side
// batching logic is required.
const results = await Promise.all(
  prompts.map((prompt) =>
    client.chat.completions.create({
      model: "morph-v3-fast",
      messages: [{ role: "user", content: prompt }],
    })
  )
);

On Morph's production fleet, continuous batching runs on SGLang. For morph-v3-fast, the fast-apply model, it is combined with ngram speculative decoding (k=64) to reach ~10,500 tok/s. The continuous batching keeps the GPU saturated across concurrent code-edit requests; the speculative decoding multiplies tokens accepted per forward pass within each sequence. The two stack: one fills the batch width, the other deepens each sequence per step. See LLM inference optimization and AI inference for how these techniques combine.

Frequently Asked Questions

What is continuous batching?

Continuous batching is a system-level LLM serving optimization in which the batch size is decided per decode iteration rather than staying fixed for the lifetime of the batch. When a sequence finishes generating, it is evicted from the batch and a waiting request is inserted in its place on the next iteration, eliminating idle GPU cycles spent waiting for the slowest sequence. It is also called dynamic batching or iteration-level batching.

What is the difference between static and continuous batching?

In static batching the batch size stays constant until every sequence in the batch finishes, so the GPU waits for the longest-running sequence and newly arrived requests wait for the whole batch to drain. In continuous batching the scheduler re-decides batch membership every iteration: finished sequences leave immediately and waiting ones join immediately. Anyscale measured up to 23x throughput for continuous batching over static batching with vLLM on OPT-13B on a single A100.

Does continuous batching add latency?

Not for a single request on an idle server, which finishes at the same speed either way. Continuous batching is a throughput optimization that helps under concurrent load by keeping the GPU saturated. It can slightly raise per-token latency for an in-flight sequence when the scheduler admits competing new requests, but it removes the much larger queue-wait latency where a new request would otherwise wait for an entire static batch to finish.

What is in-flight batching in TensorRT-LLM?

In-flight batching is NVIDIA TensorRT-LLM's name for continuous batching. Its Batch Manager component includes newly arrived requests and returns newly completed requests at each iteration of the token generation loop, which reduces queue wait times, eliminates the need to pad requests, and raises GPU utilization. The community also calls the same technique continuous batching or iteration-level batching.

Does vLLM use continuous batching?

Yes. vLLM lists continuous batching of incoming requests as a core speed feature, paired with PagedAttention for efficient management of attention key and value memory. vLLM builds on Orca's continuous batching design by taking full control of dynamic KV cache allocation through PagedAttention, which allocates memory in fixed-size pages just-in-time and limits memory waste to under 4%.

How much throughput does continuous batching add?

It depends on the workload, the variance in sequence lengths, and the memory optimizations paired with it. Anyscale measured up to 23x throughput over static batching with vLLM (continuous batching plus PagedAttention) on OPT-13B on a single A100, 8x with Ray Serve and Hugging Face TGI, and 4x with FasterTransformer. The original Orca paper reported a 36.9x throughput improvement at the same latency over FasterTransformer on a GPT-3 175B model.

What is the relationship between continuous batching and the KV cache?

Each concurrent sequence in a continuous batch holds its own KV cache, which grows with every token generated. The number of sequences you can batch is bounded by available KV cache memory, not just by compute. PagedAttention makes continuous batching practical by allocating the KV cache in fixed-size pages just-in-time, cutting memory waste from the 60-80% seen in earlier systems to under 4% and freeing room for larger batches.

Related Resources

Continuous Batching, Running in Production

Morph serves morph-v3-fast at ~10,500 tok/s by combining continuous batching on SGLang with ngram speculative decoding (k=64). One OpenAI-compatible API at api.morphllm.com. Continuous batching fills the GPU across concurrent edits; speculative decoding deepens each sequence per forward pass.

Read the Docs

Inference Optimization

Fast Apply

WarpGrep

Compact

Model Router

DeepSeek

MiniMax

Qwen

Blog

Startup Credits

Students

Contact Us

About

Careers