nano-vLLM: The 1,200-Line Inference Engine That Teaches You How vLLM Works

vLLM is 100,000+ lines of C++, CUDA, and Python. It is the backbone of most production LLM serving. nano-vLLM reimplements the core ideas in 1,200 lines of pure Python and Triton, built by a DeepSeek engineer whose name is on the V3 and R1 papers. You can read the entire codebase in an afternoon and walk away understanding PagedAttention, continuous batching, and KV cache management at a level that transfers directly to production systems.

Why nano-vLLM Exists

Understanding LLM inference requires understanding five systems that interact simultaneously: KV cache memory management, request scheduling, attention computation, distributed execution, and GPU kernel optimization. In production vLLM, each of these is spread across dozens of files with thread safety, error handling, and edge cases layered on top. The algorithmic core disappears into the engineering.

Xingkai Yu built nano-vLLM as a personal project to expose those core algorithms. Yu is a DeepSeek engineer from Nanjing University whose contributions include DeepSeek-V3 (103K GitHub stars) and DeepSeek-R1 (92K stars). He also built cupytorch, a minimal PyTorch reimplementation using CuPy. The pattern is consistent: strip a complex system to its algorithmic skeleton so others can learn from it.

The project has 12,700+ GitHub stars and a growing ecosystem. MinivLLM extends it with self-contained paged attention and flash attention implementations. flex-nano-vllm adds FlexAttention for Gemma 2 inference. Multiple blog series walk through its internals line by line. The codebase is becoming a shared reference for how LLM inference works.

1,200

Lines of Python (total codebase)

12.7K

GitHub stars

1,434

Tokens/sec on RTX 4070 Laptop

Lines of C++ or custom CUDA

PagedAttention: Virtual Memory for KV Cache

Every token in the context window needs its key and value vectors stored for future attention computation. For a 70B model with 200K context, this KV cache alone consumes 40 to 80 GB of GPU memory. The question is not whether to cache. It is how to allocate the cache without wasting most of it.

Before PagedAttention, inference engines pre-allocated contiguous memory blocks for each request based on the maximum possible sequence length. A request that might generate up to 2,048 tokens got a 2,048-slot contiguous allocation at submission time, even if it only used 200 tokens. Across hundreds of concurrent requests, 60 to 80 percent of KV cache memory sat allocated but empty.

The operating system analogy

PagedAttention borrows directly from how operating systems manage RAM. Physical memory is divided into fixed-size pages. Each process gets a page table mapping virtual addresses to physical locations. Pages can be scattered anywhere in physical memory while the process sees a contiguous address space. PagedAttention does the same thing: fixed-size KV cache blocks, a block table per request, and physical blocks scattered across GPU memory. Copy-on-write for shared prefixes uses reference counting, exactly like OS page sharing.

nano-vLLM's BlockManager implements this in roughly 100 lines. The core data structures are a free list of available block IDs, a per-request block table (list of physical block IDs), and a hash-to-block-id mapping for prefix caching. When a new request arrives, the block manager allocates blocks from the free list. When a request completes, blocks return to the free list. When GPU memory runs low, the scheduler can preempt requests and reclaim their blocks.

PagedAttention Block Allocation (nano-vLLM simplified)

# Block Manager core logic
# Each block holds 256 tokens of KV cache

class BlockManager:
    def __init__(self, num_blocks, block_size=256):
        self.block_size = block_size
        self.free_blocks = list(range(num_blocks))  # free list
        self.hash_to_block = {}  # prefix cache lookup

    def allocate(self, sequence):
        """Allocate blocks for a new sequence."""
        num_needed = ceil(len(sequence.tokens) / self.block_size)
        if len(self.free_blocks) < num_needed:
            return False  # trigger preemption
        blocks = [self.free_blocks.pop() for _ in range(num_needed)]
        sequence.block_table = blocks
        return True

    def free(self, sequence):
        """Return blocks to the free list."""
        for block_id in sequence.block_table:
            self.free_blocks.append(block_id)

# A 700-token request uses 3 blocks:
#   Block 0: tokens 0-255   (physical block 47)
#   Block 1: tokens 256-511 (physical block 12)
#   Block 2: tokens 512-699 (physical block 83)
# Blocks are non-contiguous in GPU memory. The block
# table [47, 12, 83] maps logical to physical.

The key insight is that the attention kernel reads KV data through block table indirection. Instead of assuming KV vectors are at contiguous memory addresses, it looks up each block's physical location. This one level of indirection eliminates memory fragmentation entirely. In vLLM, this kernel is a custom CUDA implementation. In nano-vLLM, it is a Triton kernel that achieves the same result in readable Python-like syntax.

Continuous Batching and Scheduling

Static batching wastes GPU cycles. If a batch of 32 requests includes one that generates 500 tokens and another that generates 10, the GPU processes 490 empty slots after the short request finishes. Continuous batching fixes this by treating the batch as a living set: requests enter when resources are available and exit the moment they are done.

nano-vLLM's scheduler maintains two queues. The waiting queue holds submitted requests that have not started processing. The running set holds requests actively generating tokens. At each decode step, the scheduler:

Check Completions

Scan running requests for stop conditions: EOS token, max length reached, or stop string match. Completed requests exit the running set.

Free Resources

Return completed requests' KV cache blocks to the free list via the BlockManager. GPU memory is immediately available.

Promote Waiting

Pull requests from the waiting queue into the running set if the BlockManager can allocate blocks for them. Respect memory limits.

Handle Preemption

If memory is exhausted and waiting requests have higher priority, evict running requests back to waiting and free their blocks.

The scheduler distinguishes between prefill and decode phases. Prefill requests process all input tokens in parallel (compute-bound). Decode requests generate one token per step (memory-bandwidth-bound). The scheduler groups them appropriately so prefill requests get bulk GPU compute while decode requests share bandwidth-bound resources efficiently.

Scheduler Step Loop (nano-vLLM)

# Simplified from engine/scheduler.py

class Scheduler:
    def __init__(self, block_manager):
        self.waiting = []   # requests not yet started
        self.running = []   # requests actively decoding
        self.block_manager = block_manager

    def schedule(self):
        # 1. Check for completed sequences
        finished = []
        for seq in self.running:
            if seq.is_finished():
                self.block_manager.free(seq)
                finished.append(seq)
        for seq in finished:
            self.running.remove(seq)

        # 2. Promote waiting requests if resources available
        while self.waiting:
            seq = self.waiting[0]
            if self.block_manager.allocate(seq):
                self.waiting.pop(0)
                self.running.append(seq)
            else:
                break  # no memory, stop promoting

        # 3. Build batch metadata for ModelRunner
        return SchedulerOutput(
            prefill_seqs=[s for s in self.running if s.is_prefill],
            decode_seqs=[s for s in self.running if not s.is_prefill],
        )

This is about 150 lines in the actual implementation, including preemption logic and batch metadata construction. The vLLM scheduler adds priority queues, fairness policies, chunk prefilling, and prefix-aware scheduling. The core loop is identical.

KV Cache Architecture

The physical KV cache in nano-vLLM is a single pre-allocated GPU tensor with shape [2, num_layers, num_blocks, block_size, num_kv_heads, head_dim]. The leading dimension of 2 separates keys from values. This layout keeps all data for a given block contiguous in memory, which aligns with how GPU memory controllers fetch data in coalesced cache-line reads.

Writing to the cache uses a Triton kernel, store_kvcache_kernel. Each kernel instance handles one token. The slot_mapping tensor tells each instance exactly where to write. There is no Python-level indexing on the hot path. Reading from the cache during attention uses flash_attn_varlen_func for prefill (variable-length sequences in a batch) and flash_attn_with_kvcache for decode (single new token per sequence, reading from the paged cache).

KV Cache Layout

# Physical KV cache tensor
kv_cache = torch.zeros(
    2,              # 0 = keys, 1 = values
    num_layers,     # e.g., 32 for Qwen3-0.6B
    num_blocks,     # total blocks available (e.g., 1024)
    block_size,     # tokens per block (default: 256)
    num_kv_heads,   # e.g., 8 with GQA
    head_dim,       # e.g., 64
    dtype=torch.float16,
    device="cuda"
)

# Writing: Triton kernel maps each token to its slot
# slot_mapping[i] = (block_id * block_size) + offset_in_block
# Kernel writes key[i] to kv_cache[0, layer, block, offset, head, :]
# Kernel writes val[i] to kv_cache[1, layer, block, offset, head, :]

# Reading: FlashAttention uses block_tables to gather
# the correct KV blocks for each sequence's attention
# No data copy — just indirection through the block table

The control plane versus data plane separation is deliberate. The BlockManager runs on CPU, manipulating only metadata: block IDs, reference counts, hash mappings. It never touches GPU memory. The ModelRunner runs on GPU, reading and writing KV cache data through slot mappings provided by the BlockManager. This separation means scheduling decisions are fast (CPU operations on small data structures) and never block GPU execution.

Prefix Caching

Many inference workloads share prompt prefixes. Every request to a chatbot includes the same system prompt. Every RAG query includes the same retrieval template. Without prefix caching, the engine re-runs prefill for the shared prefix on every request.

nano-vLLM's prefix caching hashes each block's token content using xxhash. When a new request arrives, the block manager checks whether any of its blocks match existing cached blocks by hash. Matching blocks skip allocation and prefill entirely. The new request's block table points to the existing physical blocks. Reference counting ensures blocks are not freed while any request references them.

Prefix Caching with Content-Addressable Blocks

# Prefix caching in BlockManager

import xxhash

def allocate_with_prefix_cache(self, sequence):
    """Allocate blocks, reusing cached prefix blocks."""
    block_table = []
    for i in range(num_blocks_needed):
        # Hash this block's token content
        block_tokens = sequence.tokens[i*256 : (i+1)*256]
        token_hash = xxhash.xxh64(bytes(block_tokens)).hexdigest()

        if token_hash in self.hash_to_block:
            # Reuse existing block — skip prefill for these tokens
            block_id = self.hash_to_block[token_hash]
            self.ref_counts[block_id] += 1
            block_table.append(block_id)
        else:
            # Allocate new block
            block_id = self.free_blocks.pop()
            self.hash_to_block[token_hash] = block_id
            self.ref_counts[block_id] = 1
            block_table.append(block_id)

    sequence.block_table = block_table

# 100 chatbot requests with a 500-token system prompt:
#   Without prefix caching: 100 × prefill(500 tokens)
#   With prefix caching: 1 × prefill(500 tokens) + 99 × reuse

The savings compound. A 500-token system prompt shared across 100 concurrent requests means 99 requests skip 500 tokens of prefill computation. For RAG workloads with 2,000-token retrieval contexts, the savings are proportionally larger. This is the same mechanism that SGLang calls "automatic prefix caching" and that vLLM implements with a more sophisticated eviction policy.

The Four-Layer Architecture

nano-vLLM organizes into four layers, each with a single responsibility. This mirrors vLLM's architecture at a structural level while compressing the implementation.

Layer	Component	Responsibility	Lines (approx)
User Interface	LLM class	Accept prompts and SamplingParams, return generated text. The only public API.	~50
Inference Engine	LLMEngine	Orchestrate request lifecycle: tokenize, schedule, execute, decode. Spawn worker processes for tensor parallel ranks.	~200
Model Execution	ModelRunner + Sampler	Execute forward passes on GPU. Manage CUDA graphs. Handle inter-rank communication via SharedMemory and NCCL. Sample next tokens.	~400
Memory Management	BlockManager + Sequence	Allocate and free KV cache blocks. Track per-request state. Implement prefix caching. All CPU-side metadata.	~200

The file structure maps directly to these layers. engine/llm_engine.py is the orchestrator. engine/model_runner.py handles GPU execution. engine/scheduler.py manages batching. engine/block_manager.py controls memory. models/qwen3.py is the transformer implementation. layers/ contains attention, linear, and sampling primitives. utils/context.py propagates execution metadata via thread-local storage.

Tensor Parallelism

When a model does not fit on one GPU, tensor parallelism splits weight matrices across multiple GPUs. Each GPU holds a slice of each layer's attention projections and MLP layers. The forward pass runs in parallel across GPUs, with NCCL all-reduce operations synchronizing intermediate results.

nano-vLLM implements this with a leader-worker pattern. Rank 0 (the main process) runs the LLMEngine, Scheduler, and BlockManager. It writes commands to a shared memory buffer. Worker processes (ranks 1 through N-1) block on an event, wake when a command arrives, execute their GPU slice, and signal completion.

Tensor Parallel Execution Pattern

# Leader-worker coordination (simplified)

# Rank 0 (leader): runs scheduler + block manager
def leader_step(self, batch):
    # Write command to shared memory
    self.shared_mem.write(pickle.dumps(("forward", batch)))
    self.event.set()  # wake workers

    # Execute own GPU slice
    output = self.model_runner.forward(batch)

    # Wait for workers to finish
    self.barrier.wait()
    return output

# Rank 1..N (workers): GPU execution only
def worker_loop(self):
    while True:
        self.event.wait()  # block until command
        cmd, args = pickle.loads(self.shared_mem.read())
        if cmd == "forward":
            self.model_runner.forward(args)
        self.barrier.wait()  # signal completion

# Weight sharding: for a 4-GPU setup with hidden_dim=4096
# Each GPU holds attention Q,K,V projections of size [4096, 1024]
# instead of [4096, 4096]. NCCL all-reduce merges results.

This is basic compared to vLLM's full distributed execution, which includes pipeline parallelism (splitting layers across GPUs), expert parallelism for MoE models, and Ray-based cluster management. But the core coordination pattern, where one process makes scheduling decisions and broadcasts work to GPU executors, is the same pattern at every scale.

CUDA Graphs and Torch Compile

The decode loop generates one token per sequence per step. Each step launches dozens of small CUDA kernels through Python. The Python dispatch overhead, typically 5 to 20 microseconds per kernel launch, becomes significant when the actual GPU work per kernel is similar in magnitude.

CUDA graphs solve this by capturing a sequence of kernel launches once, then replaying the entire sequence as a single GPU operation. nano-vLLM captures graphs for batch sizes from 1 to 512 at startup. During inference, the decode step replays the appropriate graph with new input tensors. Python overhead drops to near zero.

Torch compile (torch.compile()) provides complementary optimization by fusing multiple Python-level operations into single GPU kernels. RMSNorm, small linear layers, and activation functions are prime fusion targets. nano-vLLM supports both optimizations, controlled by the enforce_eager flag (set to True to disable both for debugging).

When to disable CUDA graphs

Set enforce_eager=True when debugging model behavior, profiling individual kernel performance, or running on hardware that does not support CUDA graph capture. CUDA graphs require fixed tensor shapes, so they only work for the pre-captured batch sizes. Requests arriving at uncaptured batch sizes fall back to eager execution automatically.

Performance: nano-vLLM vs vLLM

The benchmark that matters for nano-vLLM is offline batch inference, which is the only workload it is designed for. Online serving with concurrent requests, streaming, and dynamic load is outside its scope.

Metric	vLLM	nano-vLLM	Delta
Output tokens generated	133,966	133,966	Identical
Total time	98.37s	93.41s	nano-vLLM 5.0s faster
Throughput	1,361.84 tok/s	1,434.13 tok/s	+5.3% nano-vLLM
Codebase size	100,000+ lines	~1,200 lines	83x smaller
Languages	C++, CUDA, Python	Python, Triton	No compiled extensions

The throughput advantage is likely an artifact of reduced overhead. nano-vLLM has less code executing per step: no online serving machinery, no plugin system, no telemetry. For offline inference where the workload is known upfront, that overhead adds up. For production serving with thousands of concurrent requests, vLLM's additional machinery pays for itself through better scheduling and resource utilization.

1,434

nano-vLLM tokens/sec

1,362

vLLM tokens/sec

5.3%

Throughput advantage (offline)

93.4s

Time for 133K tokens

When to Use nano-vLLM vs Full vLLM

Scenario	Recommendation	Why
Learning how LLM inference works	nano-vLLM	1,200 lines. Trace the full path from prompt to token in one sitting. Every concept is exposed without engineering noise.
Prototyping a new attention mechanism	nano-vLLM	Pure Python + Triton. Modify attention.py, test immediately. No C++ recompilation cycle.
Offline batch inference on small models	nano-vLLM	Comparable or faster throughput. Simpler setup. One pip install.
Production API serving	vLLM / TensorRT-LLM / SGLang	Online serving, request queuing, streaming, load balancing, health checks, telemetry. None of this exists in nano-vLLM.
High-concurrency serving (100+ users)	vLLM	Advanced scheduling, pipeline parallelism, expert parallelism, speculative decoding. Performance under load requires features nano-vLLM omits.
Quantized model deployment	vLLM / TensorRT-LLM	nano-vLLM has no quantization support. Production quantization (FP8, INT4 AWQ/GPTQ) requires vLLM or TensorRT-LLM.
Edge / laptop inference	nano-vLLM	Minimal dependencies. Runs on consumer GPUs. The RTX 4070 Laptop benchmark is the proof point.
Teaching a university course on LLM systems	nano-vLLM	Each component maps to a lecture. PagedAttention, scheduling, KV cache, tensor parallelism, CUDA graphs. Students can modify and benchmark each in isolation.

What nano-vLLM Leaves Out (and Why That Matters)

The omissions define what makes vLLM a production system versus an educational tool. Each missing feature represents a real production requirement.

Online Serving (HTTP API)

vLLM exposes an OpenAI-compatible API server with request queuing, streaming responses, health checks, and graceful shutdown. nano-vLLM is offline only: you pass all prompts upfront and get all results back.

Speculative Decoding

A small draft model generates candidate tokens verified by the target model. Yields 2-3x decode speedup. Requires careful draft model selection, verification logic, and token acceptance/rejection. Not trivial to add.

Pipeline Parallelism

Splits model layers across GPUs (vs tensor parallelism which splits within layers). Needed for models too large for tensor parallelism alone. vLLM supports both simultaneously.

Quantization (FP8, INT4)

Weight compression that halves or quarters memory usage with minimal quality loss. Production deployments almost always use FP8 on H100/B200 or INT4 AWQ/GPTQ on smaller GPUs.

Multi-Modal Support

Vision-language models like LLaVA and Qwen-VL require image preprocessing, cross-attention between modalities, and different memory management for image features.

Production Error Handling

Request timeouts, OOM recovery, graceful degradation, request cancellation, logging, metrics export. The 98,800 lines between nano-vLLM and vLLM are mostly this.

Understanding these omissions is part of the educational value. When you see that nano-vLLM's scheduler is 150 lines and vLLM's is thousands, you can ask: what does the extra code do? The answer is always some combination of edge cases, error handling, fairness, and features. The algorithm at the center is the same.

Frequently Asked Questions

What is nano-vLLM?

A 1,200-line Python reimplementation of vLLM's core inference architecture. It implements PagedAttention, continuous batching, KV cache management, tensor parallelism, prefix caching, and CUDA graph optimization using pure Python and Triton. Built by Xingkai Yu, a DeepSeek engineer. Not an official DeepSeek project.

How do I install nano-vLLM?

Installation and Quick Start

# Install
pip install git+https://github.com/GeeeekExplorer/nano-vllm.git

# Download a model
huggingface-cli download Qwen/Qwen3-0.6B --local-dir ./Qwen3-0.6B

# Run inference
from nanovllm import LLM, SamplingParams

llm = LLM("./Qwen3-0.6B", enforce_eager=True, tensor_parallel_size=1)
params = SamplingParams(temperature=0.6, max_tokens=256)
outputs = llm.generate(["Explain PagedAttention in one paragraph."], params)
print(outputs[0]["text"])

Is nano-vLLM faster than vLLM?

For offline batch inference on small models, yes. On Qwen3-0.6B with 256 sequences on an RTX 4070 Laptop, nano-vLLM achieves 1,434 tok/s versus vLLM's 1,362 tok/s. For production serving with concurrent requests, no. vLLM's advanced scheduling, speculative decoding, and optimized kernels outperform at scale.

Who created nano-vLLM?

Xingkai Yu (GitHub: GeeeekExplorer), a DeepSeek engineer from Nanjing University. His name appears on the DeepSeek-V3 (103K stars) and DeepSeek-R1 (92K stars) technical reports. He also built cupytorch, a minimal PyTorch reimplementation using CuPy. nano-vLLM is a personal project, not an official DeepSeek release.

What models does nano-vLLM support?

The current implementation supports Qwen3 models. The transformer implementation in models/qwen3.py is specific to the Qwen3 architecture. Adding new model support requires implementing the model file, similar to adding a model in Hugging Face Transformers. The architecture is general enough that any standard decoder-only transformer can be added.

Can I contribute to nano-vLLM?

Yes. The repository is MIT-licensed on GitHub. The small codebase makes contributions accessible. Adding support for new model architectures, implementing speculative decoding, or adding quantization support are all feasible contributions that would benefit the community while being substantial enough to learn from.

How does this relate to nanoVLM (Hugging Face)?

Different projects. nanoVLM by Hugging Face is a minimal training repository for vision-language models. nano-vLLM by GeeeekExplorer is a minimal inference engine that reimplements vLLM. One trains VLMs, the other serves LLMs. The name similarity is coincidental.

Understanding Inference Internals Helps. Not Having to Manage Them Helps More.

nano-vLLM teaches you how PagedAttention, continuous batching, and KV cache management work. Morph handles all of that at the infrastructure layer so your coding agents can focus on generating code. Deterministic fast apply at 10,500+ tokens per second, with inference optimization built in.

Try Morph

View Docs

Morph Fast Apply

Morph WarpGrep

Morph Compact

Morph Glance

Morph MCP

Morph Monitor

Blog

Startup Credits

Students

Contact Us

About

Careers