DeepSeek NSA: Why O(N²) Attention Breaks at Scale (Native Sparse Attention Explained)

Full attention over 200,000 tokens requires 327 trillion multiply-accumulate operations per layer. At 80 TFLOP/s sustained, that is over 4 seconds per layer. At 30 layers, generating a single token takes 2 minutes. Native Sparse Attention is how DeepSeek makes 200K context economically viable — not by approximating attention, but by restructuring which tokens attend to which.

The Quadratic Scaling Problem

Attention computes Q·Kᵀ·V. The QK multiply has complexity O(N²·d) in FLOPs, where N is sequence length and d is the key dimension. At N=200,000 and d=4,096, the math is:

Full Attention FLOPs at 200K Context

# QK^T multiply
FLOPs_QK = 2 × N × N × d
         = 2 × 200,000 × 200,000 × 4,096
         = 327,680,000,000,000  # 327 trillion FLOPs

# AV multiply (same cost)
FLOPs_AV = 2 × N × N × d
         = 327,680,000,000,000

# Total per layer
FLOPs_total = 655 trillion FLOPs per layer

# Attention matrix memory (float16)
Attn_matrix_bytes = N × N × 2
                  = 200,000 × 200,000 × 2
                  = 80,000,000,000  # 80 GB per layer

# H100 SXM HBM capacity: 80 GB total
# One attention layer fills the entire GPU.

The 80 GB figure is per layer, per head group. With grouped-query attention you share KV across heads, which helps, but the N² problem remains. At 200K context, the attention matrix alone exceeds H100 HBM capacity before any weights or activations are loaded.

Chinchilla scaling laws optimize for the tokens-to-parameters ratio. They say nothing about context length. Context length scaling is a separate, worse problem: doubling sequence length quadruples attention compute while leaving feedforward and projection compute unchanged. Above 32K tokens, attention begins to dominate total compute. Above 128K tokens, it is almost everything.

327T

FLOPs for QK multiply at 200K context per layer

80 GB

Attention matrix memory per layer at 200K tokens

O(N²)

Scaling in both FLOPs and memory

11.6x

NSA decoding speedup at 64K context

What Failed Before (and Why)

Four major sparse attention approaches preceded NSA. Each solved part of the problem and missed something critical.

Method	Mechanism	Theoretical Complexity	Why It Failed
Longformer (2020)	Sliding window + fixed global tokens	O(N × window) + O(N × global_tokens)	Global tokens fixed by position or document structure. Tasks requiring dynamic long-range routing (multi-hop reasoning, arbitrary cross-document retrieval) degrade because the model cannot learn which tokens need global access.
BigBird (2020)	Random blocks + local window + global tokens	O(N) theoretical	Random block access patterns generate cache misses at every step. GPUs expect sequential, coalesced memory access. Random sparse access drops utilization to 20-30% of peak. BigBird was faster than full attention on paper and slower on GPU in practice.
Reformer (2020)	LSH hashing to find approximate KNN tokens	O(N log N)	Hash collisions mean relevant tokens get missed silently. The hashing step adds latency proportional to N. Bucketing does not parallelize efficiently across batch dimensions. Training stability suffers from stochastic routing.
Perceiver IO	Cross-attention to fixed-size latent array	O(N × latents)	Compresses sequence to fixed latents regardless of content. Loses fine-grained positional information. Cannot extend gracefully to arbitrary context lengths — the bottleneck is always latent count, not sequence length.

The failure pattern is consistent. All four methods treat GPU efficiency as an afterthought. They optimize theoretical complexity, deploy on hardware built for dense matrix operations, and wonder why the wall-clock speedup does not match the FLOP reduction. BigBird is the clearest case: O(N) complexity, 30% GPU utilization, slower than FlashAttention on any sequence an H100 can handle.

FlashAttention: Memory Efficient, Not Compute Sparse

FlashAttention is exact attention, not sparse attention. This distinction matters because FlashAttention is often discussed alongside sparse methods as if it solves the same problem. It does not.

FlashAttention is exact attention

FlashAttention uses tiling and on-the-fly recomputation to avoid materializing the N×N attention matrix in HBM. Memory goes from O(N²) to O(N). FLOPs remain O(N²). It eliminates the HBM bandwidth bottleneck, not the compute bottleneck. FlashAttention makes 100K context possible. It does not make 1M context possible.

FlashAttention 2 achieves 50-73% of theoretical peak TFLOP/s on A100 — close to GEMM efficiency. The gains come from better parallelism across thread blocks, reduced non-matmul FLOPs, and smarter warp scheduling. These are kernel optimization improvements, not algorithmic complexity changes.

FlashAttention 3 added asynchronous execution and FP8 support for H100, pushing utilization higher. Still O(N²). At 200K tokens, FlashAttention 3 on H100 still requires 327 trillion FLOPs per layer. It just executes those FLOPs efficiently.

NSA runs on top of FlashAttention-style kernel techniques but applies them to a sparse operation. The tiling and SRAM usage patterns are similar. The difference is NSA attends to ~5,600 selected tokens instead of all 200,000.

DeepSeek NSA: The Three-Component Architecture

NSA employs three attention pathways in parallel, each operating at a different resolution over the key-value sequence. The outputs are combined at each query token.

Compressed Token Attention

The KV sequence is divided into non-overlapping blocks of g=32 tokens. A learned MLP aggregates each block into a single compressed token — one key vector and one value vector per block. Attention over N/32 compressed tokens captures coarse global context at O(N/32) cost. The compression function maps the full token block content into the compressed representation, preserving aggregate semantic information while discarding local variation.

Selected Token Attention

The compressed attention scores (probability weights over N/32 blocks) serve as importance signals for each query token. Top-16 blocks by importance score are selected, plus a fixed first block (the attention sink fix). These 16 selected blocks contain 64 full-resolution tokens each, giving 1,024 selected tokens. The query then attends to these tokens at full resolution. Using compressed scores to route fine attention is the key architectural insight: cheap coarse attention guides expensive fine attention.

Sliding Window Attention

The most recent w=512 tokens receive local attention. This serves two purposes: (1) local positional continuity — tokens depend heavily on their neighbors and sliding window captures this cheaply; (2) attention sink handling — early tokens (the first block in the sequence) accumulate disproportionate attention weight in full attention. NSA preserves this by always activating the first block in the selection step and covering recent tokens with the window.

The three pathways attend to different token sets. Compressed attends to the full sequence via aggregated blocks. Selected attends to 1,024 high-importance tokens at full resolution. Sliding window attends to the 512 most recent tokens. The outputs for each query token are weighted and summed.

NSA Attention Coverage at Decoding (64K context)

Full attention:   attend to all 65,536 tokens

NSA breakdown:
  Compressed:     65,536 / 32 = 2,048 compressed tokens
  Selected:       16 blocks × 64 tokens = 1,024 tokens
  Sliding window: 512 tokens
  Fixed first:    1 block × 64 tokens (always activated)

Total loaded at decoding:
  Compressed KV:  2,048 vectors
  Selected KV:    1,024 vectors
  Window KV:      512 vectors
  Total:          ~3,584 unique tokens attended

Decoding speedup:
  65,536 / 3,584 ≈ 18× fewer memory loads
  Actual speedup: 11.6× (overhead from routing step)

Each pathway uses independent key and value projections. This is not incidental — it is a deliberate design choice to prevent shortcut learning. If the compressed and selected pathways shared projections, the model could learn to make compression trivially easy at the expense of quality in the selected pathway. Independent projections force each pathway to specialize.

End-to-end training matters

NSA is trained from scratch with the sparse attention architecture in place. The block selection routing is differentiable through the compressed attention scores. Wrong selections hurt the pretraining loss directly, so the model learns to compress and select well. Prior sparse methods applied sparsity post-hoc to full-attention models, producing selection mechanisms that were never optimized for the sparse structure.

The Hardware Efficiency Insight

Arithmetic intensity measures FLOPs per byte of memory access. Dense matrix multiplies have high arithmetic intensity — you load data once, compute many operations on it, write results once. GPU tensor cores are designed for this pattern. The H100 SXM delivers 989 TFLOP/s on BF16 matrix multiply because the data-to-compute ratio is favorable.

Standard sparse attention breaks this pattern. Scatter-gather access to arbitrary token positions generates cache misses at every step. The GPU spends cycles waiting for data rather than computing. Arithmetic intensity drops. You lose 70-80% of peak FLOP/s.

NSA avoids this by making all access patterns block-contiguous. The compressed attention step operates on sequential compressed token blocks. The selected attention step accesses contiguous 64-token blocks from the KV cache. The sliding window accesses the last 512 contiguous tokens. No scatter, no random access, no cache misses. The Triton kernels can use SRAM efficiently because the access pattern is predictable.

Property	Random Sparse Attention	NSA Block-Sparse
Memory access pattern	Scattered, random indices	Contiguous blocks, predictable strides
Cache behavior	Cache misses per token	Coalesced loads, full cache lines used
Tensor core utilization	20-30% (data starvation)	Near-optimal (data feeds compute)
Arithmetic intensity	Low (many bytes per FLOP)	High (matches GEMM ratio)
Implementation	Requires custom CUDA scatter kernels	Triton blocked kernels on standard shapes
Practical speedup (64K)	Often slower than FlashAttention	9x forward, 11.6x decoding vs full attention

The paper describes this as "arithmetic intensity-balanced algorithm design." Every component of NSA is designed so the compute-to-memory ratio stays in the range where H800 and H100 tensor cores can saturate. It is not just about doing fewer FLOPs — it is about doing fewer FLOPs in a pattern the hardware can execute efficiently.

What This Means for Context Scaling

Full attention compute at decoding scales as O(N) per token generated (because each new token attends to all N prior tokens). NSA decoding attends to a fixed number of blocks regardless of N beyond a threshold. In practice, as N grows beyond the sum of selected blocks and sliding window, the attended token count stays roughly constant.

Attention Compute Scaling: Full vs. NSA

# Full attention: attended tokens grows linearly with context
Full attention (16K):  16,384 tokens attended per new token
Full attention (64K):  65,536 tokens attended per new token  (4× cost)
Full attention (200K): 204,800 tokens attended per new token (12.5× cost vs 16K)

# NSA: attended token count is approximately constant after warmup
NSA (16K):   1,024 (selected) + 512 (window) + 2,048 compressed = ~3,584
NSA (64K):   1,024 (selected) + 512 (window) + 2,048 compressed = ~3,584
NSA (200K):  1,024 (selected) + 512 (window) + 6,400 compressed = ~7,936

# Decoding cost ratio at 200K:
Full attention: 200,000 tokens × d FLOPs
NSA:            7,936 tokens × d FLOPs
Ratio:          ~25× cheaper at 200K

The compressed attention pathway does scale with N — it attends to N/32 compressed tokens, so at 200K context it processes 6,400 compressed vectors versus 2,048 at 64K. But compressed attention is cheap because vectors are already aggregated. The expensive fine-grained selected attention stays at 1,024 tokens regardless of total sequence length.

Combined with MoE sparsity at inference, the cost reduction compounds. A MoE model with 40B active parameters out of 744B total already saves on feedforward compute. NSA saves on attention compute. The two are multiplicatively independent. At 200K context with MoE and NSA, you are running roughly 3-5% of the total theoretical FLOPs of a dense full-attention model at the same parameter count.

Benchmark Results: NSA vs. Full Attention

NSA was evaluated on a 27B total parameter / 3B active parameter MoE model pretrained for 1 trillion tokens. The baseline is the same architecture with full attention trained identically.

Task	NSA Score	Full Attention Score	Delta
MMLU (general knowledge)	0.565	0.567	-0.002 (negligible)
GSM8K (math reasoning)	0.520	0.486	+0.034 NSA wins
DROP (reading comprehension)	0.545	0.503	+0.042 NSA wins
LongBench avg (long-context)	0.469	0.437	+0.032 NSA wins
LCC (code understanding)	0.232	0.163	+0.069 NSA wins
HPQ (multi-hop QA)	0.437	0.350	+0.087 NSA wins
AIME24 8K (CoT reasoning)	0.121	0.046	+0.075 NSA wins
NIAH 64K (needle retrieval)	Perfect	Perfect	Tied

The reasoning and long-context improvements suggest that NSA's selection mechanism learns to focus attention more precisely than full attention can when operating across a fixed compute budget. On tasks requiring distributed reasoning across a long sequence, having clean high-attention-weight blocks may be more informative than the diffuse attention weights full attention produces.

Speedup at 64K sequences: 9.0x forward pass, 6.0x backward pass, 11.6x decoding. Speedup scales with context length — at 200K, the gains are larger. At 32K, they are smaller. The crossover point where NSA faster than full attention is around 16K-32K tokens.

Where NSA Still Fails

When token selection routes wrong

If the task requires attending to a small, isolated section of a 200K document where the compressed block representation fails to distinguish it from neighboring irrelevant content, NSA will miss it. Compressed attention averages the information within each 32-token block. A critical fact buried in a surrounding paragraph of boilerplate will have its block-level representation dominated by the boilerplate, not the fact. The selection step will rank this block low. The fine-grained attention step will never see it.

This failure mode is most likely in adversarial retrieval tasks, legal document analysis with isolated critical clauses, and code understanding where a rare function definition at token 50K determines the meaning of code at token 180K. For these tasks, full attention or hybrid approaches remain better.

The quality improvements NSA shows on benchmarks are from the in-distribution training data. Needle-in-haystack tests at 64K show perfect retrieval — but those tests use salient, recognizable needles. Real-world tasks often have subtle, low-salience relevance signals that compressed block representations struggle to preserve.

Output length is a separate constraint. Long-context input does not imply long output. NSA reduces the cost of attending over a long KV cache during decoding, but each generated token still requires an autoregressive forward pass. Generation throughput is bounded by model size and hardware, not attention architecture.

Frequently Asked Questions

What is DeepSeek NSA?

NSA (Native Sparse Attention) is a trainable sparse attention mechanism from DeepSeek that uses three hardware-aligned components to reduce attention FLOPs at long context. The three components are compressed token attention (coarse global), selected token attention (fine-grained routing to high-importance blocks), and sliding window attention (local continuity). At 64K context, NSA achieves 9.0x forward speedup and 11.6x decoding speedup while matching or exceeding full attention on benchmarks.

How does the compressed attention route to selected attention?

The compressed attention step produces attention probability weights over N/32 block-level compressed tokens. These weights serve directly as importance scores for each 32-token block in the original sequence. The top-16 highest-scoring blocks are selected for fine-grained attention. Because the compressed scores are computed first and are cheap (O(N/32)), they route the expensive fine attention efficiently. The selection is differentiable through the softmax, so it is trained end-to-end.

Why is block-sparse attention faster than random sparse attention on GPU?

GPU memory bandwidth is optimized for coalesced sequential access. When a kernel reads a contiguous 64-token block from HBM, it loads a full cache line and uses every byte. When it reads 64 randomly scattered tokens, it generates 64 separate cache misses, loading far more data than needed. Block-sparse patterns match GPU memory architecture. Random sparse patterns do not. This gap explains why BigBird achieves O(N) theoretical complexity but is slower than O(N²) FlashAttention at any context length a modern GPU can handle.

Does NSA replace FlashAttention?

NSA uses FlashAttention-style kernel techniques internally — tiling, SRAM reuse, on-the-fly computation. FlashAttention is an implementation strategy; NSA is an algorithmic change. NSA with FlashAttention-style kernels gives you hardware-efficient sparse attention. They are complementary, not competing.

What model is NSA deployed in?

The NSA paper evaluated a 27B total / 3B active MoE model. DeepSeek-V3.2-Exp uses a fine-grained sparse attention variant based on NSA principles. The architecture enables long-context operation at reduced compute cost, combined with MoE sparsity for feedforward layers.

At what context length does NSA outperform FlashAttention in wall-clock time?

NSA reaches parity with FlashAttention around 16K-32K tokens and outperforms it increasingly as context grows. Below 16K tokens, the overhead from routing (compressed attention computation, block selection) exceeds the savings from attending to fewer tokens. Above 32K, the savings dominate. At 64K, NSA is 9x faster for training forward passes.

NSA Makes 200K Context Viable. Morph Applies the Edits at 10,500 tok/s.

Models running NSA at 200K context see and understand full codebases in a single context window. When those models generate code changes, Morph applies the edits deterministically at 10,500+ tokens per second. The fast apply layer that keeps your agent pipeline from bottlenecking on output.

Try Morph

View Docs

Morph Fast Apply

Morph WarpGrep

Morph Compact

Morph Glance

Morph MCP

Morph Monitor

Blog

Startup Credits

Students

Contact Us

About

Careers

DeepSeek NSA: Why O(N²) Attention Breaks at Scale and How Native Sparse Attention Fixes It