The NVIDIA H200 has 141GB of HBM3e memory at 4.8TB/s of bandwidth. It shares the H100's Hopper compute (3,958 FP8 TFLOPS with sparsity) but carries nearly 1.8x the VRAM and 1.4x the bandwidth. For LLM inference, which is memory-bandwidth-bound, that memory is what lets bigger models fit, context run longer, and fewer GPUs serve a model. Morph serves its inference fleet on modern NVIDIA GPUs where VRAM and bandwidth set the ceiling on model size.
H200 Key Specs
The NVIDIA H200 is built on the Hopper architecture, the same compute architecture as the H100. Per NVIDIA's H200 spec page, the SXM form factor delivers 3,958 TFLOPS of FP8 Tensor Core performance with sparsity and 1,979 TFLOPS each of FP16 and BFLOAT16 with sparsity. INT8 is likewise 3,958 TOPS with sparsity.
The headline numbers are memory. The H200 incorporates 141GB of HBM3e running at 4.8TB/s of bandwidth. NVIDIA describes this as nearly 1.8x more GPU memory capacity and 1.4x higher memory bandwidth than the H100. Nothing about the compute changed; the memory system did.
NVIDIA ships two H200 form factors. The H200 SXM delivers 3,958 FP8 TFLOPS and 1,979 FP16 TFLOPS (with sparsity). The H200 NVL, a PCIe dual-slot variant, delivers 3,341 FP8 TFLOPS and 1,671 FP16 TFLOPS (with sparsity). Both carry the 141GB HBM3e memory.
Compute is shared, memory is the upgrade
The H200 SXM and H100 SXM have identical Tensor Core compute: both 3,958 FP8 TFLOPS and 1,979 FP16/BF16 TFLOPS with sparsity. The H200's only advantage over the H100 is memory: 141GB versus 80GB of capacity, and 4.8TB/s versus 3.35TB/s of bandwidth.
Why Memory Dominates LLM Inference
Autoregressive text generation is memory-bandwidth-bound, not compute-bound. On each decode step, the GPU loads the model layer weights and the cached keys and values into the compute cores, performs a relatively small amount of arithmetic, and emits one token. The bottleneck is the data movement, not the math.
This single fact reorders how you think about GPU selection for serving. A card with more bandwidth produces more tokens per second on the same model, because it feeds the cores faster. A card with more VRAM holds a larger model, a longer context window, or a bigger batch, because all of those live in memory.
Two things compete for VRAM during serving: the model weights, which are fixed in size, and the KV cache, which grows with every token of context and every concurrent request. The weights set the floor; the KV cache is what consumes whatever capacity is left. At long context and high batch size, the KV cache can dominate.
For a Llama 2 7B model in float16 with a 10,000-token context, the KV cache alone needs roughly 5GB, about one-third of the model's half-precision weight storage, per Hugging Face. That number scales with batch size and context length, which is why a 141GB H200 can serve far more concurrent long-context requests than an 80GB H100 of the same model.
H200 vs H100: The Difference Is Memory
The H100 SXM has 80GB of HBM3 at 3.35TB/s. The H200 has 141GB of HBM3e at 4.8TB/s. Their Tensor Core compute is identical: both 3,958 FP8 TFLOPS and 1,979 FP16/BF16 TFLOPS with sparsity. So the question of which to use comes down to whether your workload is memory-constrained.
| Spec | H100 SXM | H200 SXM |
|---|---|---|
| Architecture | Hopper | Hopper |
| Memory capacity | 80GB HBM3 | 141GB HBM3e |
| Memory bandwidth | 3.35TB/s | 4.8TB/s |
| FP8 Tensor Core (sparsity) | 3,958 TFLOPS | 3,958 TFLOPS |
| FP16/BF16 (sparsity) | 1,979 TFLOPS | 1,979 TFLOPS |
| Llama 2 70B inference | 1x baseline | 1.9x faster (NVIDIA) |
NVIDIA claims the H200 is 1.9x faster than the H100 on Llama 2 70B inference and 1.6x faster on GPT-3 175B inference. Those gains do not come from faster arithmetic. They come from the wider, larger memory system feeding the same Hopper Tensor Cores. A memory-bound workload speeds up when you widen the memory pipe.
The tradeoff: on a small model that fits in 80GB and is not bandwidth-starved at your batch size, the H200 and H100 perform similarly while the H200 costs more. The H200 earns its premium specifically when memory is the wall, which is the common case for 70B-and-larger models and long-context serving.
A10 vs H100 vs H200
The A10 sits at the opposite end of the curve from the Hopper data center cards. It is an Ampere-architecture GPU with 24GB of GDDR6 at 600GB/s and a 150W single-slot form factor, built for cost-sensitive inference of smaller models. It does not support FP8 in hardware.
| Spec | A10 | H100 SXM | H200 SXM |
|---|---|---|---|
| Architecture | Ampere | Hopper | Hopper |
| VRAM | 24GB GDDR6 | 80GB HBM3 | 141GB HBM3e |
| Memory bandwidth | 600GB/s | 3.35TB/s | 4.8TB/s |
| FP8 in hardware | No | Yes | Yes |
| TDP | 150W | 700W | 700W |
| Best for | 7B-class models, budget serving | 70B models, frontier compute | 70B+ models, long context, fewer GPUs |
The bandwidth gap is the story. The A10's 600GB/s is roughly one-eighth of the H200's 4.8TB/s. Since decoding is bandwidth-bound, that ratio sets a hard ceiling on how many tokens per second the A10 produces on any given model. It is a fine card for a 7B model at moderate traffic and the wrong card for anything that needs to hold tens of gigabytes of weights.
A10 tradeoff
The A10's 24GB cannot hold a 70B model in any common precision, and its lack of hardware FP8 means it cannot use the precision that makes large models fit on Hopper cards. Pick the A10 when the model is small and cost per GPU matters more than peak throughput. Pick a Hopper card when memory is the constraint.
How Many Parameters Fit
The rule of thumb for fitting a model on a GPU is two terms: weight storage plus KV cache. Weight storage is parameters multiplied by bytes per weight. KV cache is the per-token memory multiplied by context length and batch size, and it grows as generation proceeds.
Bytes per weight depends on precision. FP16 and BF16 use 2 bytes per parameter. FP8 uses 1 byte per parameter, halving weight storage. So a 70B model needs roughly 140GB in FP16 but only about 70GB in FP8. That is the difference between not fitting on a single H200 and fitting with room to spare.
Rough VRAM budget for a single GPU
// Weight storage = parameters * bytes_per_weight
// FP16/BF16 -> 2 bytes/param
// FP8 -> 1 byte/param
// 70B model
const params_70B = 70e9
const weights_fp16 = params_70B * 2 / 1e9 // ~140 GB -> does NOT fit on one H200
const weights_fp8 = params_70B * 1 / 1e9 // ~70 GB -> fits on a 141 GB H200
// Remaining capacity is for the KV cache (grows with context * batch)
const h200_capacity = 141 // GB
const kv_headroom = h200_capacity - weights_fp8 // ~71 GB for KV cache + overhead
// Reference: Llama 2 7B in float16 at 10,000-token context
// needs ~5 GB of KV cache (per Hugging Face) -- scales with context and batch.Working the numbers: a 70B model in FP8 needs about 70GB for weights, leaving roughly 71GB of the H200's 141GB for KV cache and overhead. That is enough for substantial context and batch on a single card. The same 70B model in FP16 needs about 140GB of weights alone, which leaves almost no room for KV cache and pushes you to multiple GPUs.
A 120B model in FP8 needs roughly 120GB of weights, which fits on a single 141GB H200 with limited context headroom, or runs with more room across two cards. On an 80GB H100, the same 120B model in FP8 does not fit on one card at all. This is the practical meaning of the H200's extra 61GB: it moves the single-GPU ceiling from the 70B class toward the 120B class.
FP8 and Why It Matters
FP8 is an 8-bit floating-point format with two encodings: E4M3 (4-bit exponent, 3-bit mantissa) and E5M2 (5-bit exponent, 2-bit mantissa). NVIDIA recommends E4M3 for weights and activations and E5M2 for gradients. The H100 and H200 support both in hardware through fourth-generation Tensor Cores and the Transformer Engine.
The payoff is twofold. NVIDIA states FP8 halves data storage requirements versus FP16/BF16 and doubles throughput. Halving storage is what lets a 70B model occupy 70GB instead of 140GB. Doubling throughput is why the H100 SXM lists 3,958 FP8 TFLOPS against 1,979 FP16 TFLOPS, an exact 2x.
FP8 is not free. The E4M3 format has a small dynamic range, so naive per-tensor quantization can introduce outlier errors. Production systems mitigate this: in vLLM, FP8 W8A8 quantization yields a 2x reduction in model memory and up to 1.6x higher throughput with minimal accuracy impact, storing a higher-precision scaling factor alongside each quantized tensor. DeepSeek-V3 validated FP8 training on a 671B model with relative loss error below 0.25% versus BF16.
The A10 has no hardware FP8 path. That is the structural reason it cannot play in the large-model serving space: it is locked into 2-byte weights, so its 24GB holds at most a 12B-class model in FP16, before any KV cache. FP8 is the lever that the Hopper cards have and the A10 does not.
The A10 at the Budget End
The A10 is the right answer for a specific, common workload: serving a small model at moderate concurrency where cost per GPU dominates the decision. Its 24GB of GDDR6 holds a 7B model in FP16 with room for KV cache, or a quantized 13B model. Its 150W single-slot design fits dense, power-constrained server deployments.
What it gives up is everything that depends on memory. The 600GB/s bandwidth caps token throughput well below any Hopper card. The 24GB capacity rules out 70B-class weights entirely. The absence of FP8 means it cannot use the precision that makes large models fit. For frontier-scale serving, the A10 is not in the comparison; for a 7B chatbot or an embedding model under steady load, it is the economical choice.
A10: 24GB, 600GB/s, 150W
Budget inference for 7B-class models at moderate traffic. Single-slot, low power. No hardware FP8. Wrong card for anything 70B or larger.
H100: 80GB, 3.35TB/s
Hopper compute with FP8. Holds a 70B model in FP8 with room for KV cache. The workhorse for frontier-model serving before the H200.
H200: 141GB, 4.8TB/s
Same Hopper compute as H100, 1.8x the VRAM and 1.4x the bandwidth. Pushes the single-GPU ceiling toward the 120B class and serves long context with fewer cards.
What This Means for Serving Models
Morph builds inference infrastructure for AI coding agents and runs its fleet on modern NVIDIA GPUs, including GH200 and H-class nodes. The same physics described here governs that fleet: VRAM and bandwidth set the maximum model size and context that a node can serve, and FP8 is the precision that makes large models fit.
That is why the served models on Morph's production fleet span a range of sizes mapped to node types. Larger models like glm51-754b and qwen35-397b (~120 tok/s) and minimax27-230b (~140 tok/s) need the memory capacity that Hopper-class cards provide; the fast-apply model morph-v3-fast reaches ~10,500 tok/s with ngram speculative decoding (k=64), a technique that, like FP8, attacks the memory-bandwidth bottleneck rather than the compute one.
If you are choosing a GPU to serve a model, the decision reduces to two questions. Does the model plus its KV cache fit in the VRAM at your target context and batch size? And is the bandwidth high enough to hit your tokens-per-second target? The H200's 141GB at 4.8TB/s answers both for the 70B-to-120B range on a single card; the A10's 24GB at 600GB/s answers both only for small models.
For more on how serving systems extract throughput from a fixed memory budget, see LLM Inference Optimization and AI Inference.
Frequently Asked Questions
How much VRAM does the NVIDIA H200 have?
The NVIDIA H200 has 141GB of HBM3e memory, per NVIDIA's H200 spec page. That is nearly 1.8x the 80GB of HBM3 on the H100 SXM. The extra capacity lets the H200 hold larger models, longer context windows, and bigger batches on a single GPU.
What is the difference between the H200 and the H100?
They use the same Hopper compute architecture and have identical Tensor Core throughput (both 3,958 FP8 TFLOPS and 1,979 FP16/BF16 TFLOPS with sparsity). The H200's advantage is purely memory: 141GB at 4.8TB/s versus the H100 SXM's 80GB at 3.35TB/s. NVIDIA reports the H200 is 1.9x faster on Llama 2 70B inference because that workload is memory-bound.
What is the H200 memory bandwidth?
The H200 delivers 4.8TB/s of memory bandwidth from its HBM3e stacks, per NVIDIA. That is about 1.4x the 3.35TB/s of the H100 SXM. Because LLM decoding is memory-bandwidth-bound, higher bandwidth translates directly into more tokens per second on the same model.
Can the H200 run a 70B or 120B model?
Yes. A 70B model in FP8 needs roughly 70GB for weights, leaving about 71GB of the H200's 141GB for KV cache and overhead, so it fits comfortably on one card. A 120B model in FP8 needs roughly 120GB of weights, which fits on a single H200 with limited context headroom, or runs with more room across two cards. The exact fit depends on precision, context length, and batch size.
What is the NVIDIA A10 good for?
The A10 is an Ampere-architecture GPU with 24GB of GDDR6 at 600GB/s and a 150W single-slot form factor. It is a budget inference card for smaller models, such as a 7B model in 16-bit or a quantized 13B model, serving moderate traffic. It cannot hold a 70B model and does not support FP8 in hardware.
Does the H200 support FP8?
Yes. The H200 uses Hopper's fourth-generation Tensor Cores and Transformer Engine, which support FP8 in both E4M3 and E5M2 encodings. NVIDIA states FP8 halves data storage versus FP16/BF16 and doubles throughput. FP8 lets a 141GB H200 hold a model that would overflow it in 16-bit precision.
Why does memory bandwidth matter more than compute for LLM inference?
Autoregressive text generation is memory-bandwidth-bound. Each decode step must load the model weights and the KV cache from memory into the compute cores, and that data movement, not the arithmetic, is the bottleneck. A GPU with higher bandwidth feeds the cores faster and produces more tokens per second, which is why the H200's 4.8TB/s outperforms the H100's 3.35TB/s on the same model.
Related Resources
Run Inference Where VRAM and Bandwidth Are Solved For You
Morph serves its fleet on modern NVIDIA GPUs, including GH200 and H-class nodes, so you hit large models and long context through one OpenAI-compatible API at api.morphllm.com without sizing GPUs yourself. Fast Apply runs at ~10,500 tok/s.
