FP8 quantization stores model weights and activations in 8 bits instead of 16, halving memory and, on NVIDIA Hopper, Ada, and Blackwell Tensor Cores, doubling matrix-multiply throughput versus FP16. Two formats split the space: E4M3 (4-bit exponent, max value 448) for weights and activations, and E5M2 (5-bit exponent, max value 57,344) for gradients. Morph serves FP8 checkpoints in production to roughly halve memory footprint.
What FP8 Is
FP8 quantization represents model weights, activations, or gradients in an 8-bit floating-point binary format. It was proposed in the FP8 Formats for Deep Learning paper (Micikevicius et al., arXiv:2209.05433) as the natural progression beyond 16-bit formats for accelerating deep learning training and inference.
Like any floating-point number, an FP8 value splits its bits into a sign, an exponent, and a mantissa. The sign always takes 1 bit. The remaining 7 bits divide between exponent (dynamic range) and mantissa (precision). FP8 defines two ways to make that split, and the choice changes what the format is good for.
The motivation is mechanical, not abstract. Text generation in an LLM is memory-bandwidth-bound: the bottleneck is loading weights into the compute cores, not the arithmetic itself. Halving the bytes per weight halves the data that must move, which is why FP8 can lift throughput even when raw FLOPS are not the limit.
E4M3 vs E5M2: Two Formats, Two Jobs
The two FP8 encodings make opposite bets with the same 7 non-sign bits. E4M3 gives 4 bits to the exponent and 3 to the mantissa, buying precision at the cost of range. E5M2 gives 5 bits to the exponent and 2 to the mantissa, buying range at the cost of precision.
E4M3 has a maximum normal value of 448 (binary 1.75 x 2^8), an exponent bias of 7, and roughly 17 to 18 binades of dynamic range. To stretch that range, E4M3 does not represent infinities and uses a single bit pattern for NaN. It is the format for weights and activations, where the values cluster and precision pays off.
E5M2 has a maximum normal value of 57,344 (binary 1.75 x 2^15), an exponent bias of 15, and roughly 32 binades of dynamic range. It follows IEEE-754 conventions for special values. The wider range suits gradients during training, where backpropagation produces values spanning many orders of magnitude.
E4M3 (4-bit exponent, 3-bit mantissa)
Max normal value 448, exponent bias 7, ~18 binades of range. More mantissa means finer precision. Recommended for weights and activations. Drops infinities and uses one NaN pattern to widen range.
E5M2 (5-bit exponent, 2-bit mantissa)
Max normal value 57,344, exponent bias 15, ~32 binades of range. More exponent means wider dynamic range, coarser precision. Recommended for gradients. Follows IEEE-754 special-value conventions.
Why the split matters
Weights and activations in a trained model occupy a relatively narrow band of magnitudes, so spending bits on mantissa (E4M3) recovers more useful precision. Gradients during training span a far wider range, so spending bits on exponent (E5M2) avoids underflow and overflow. The same 8 bits, allocated for the job each tensor actually does.
FP16 vs FP8-E4M3 vs FP8-E5M2 vs INT8
FP8 is one point in a spectrum of reduced-precision formats. The table below compares it against FP16 (the baseline it replaces) and INT8 (the integer alternative). Bits, dynamic range, typical use, and hardware support all differ.
| Format | Bits | Range / max value | Typical use | Native hardware |
|---|---|---|---|---|
| FP16 / BF16 | 16 | FP16 to 65,504; BF16 ~3.4e38 | Baseline training and inference precision | Ampere, Hopper, Ada, Blackwell |
| FP8-E4M3 | 8 | Max 448, ~18 binades | Weights and activations (inference) | Hopper, Ada, Blackwell |
| FP8-E5M2 | 8 | Max 57,344, ~32 binades | Gradients (training) | Hopper, Ada, Blackwell |
| INT8 | 8 | -128 to 127, uniform spacing | Weight/activation quant, broad support | Ampere, Hopper, Ada, Blackwell |
The decisive difference between FP8 and INT8 is shape. FP8 is floating point, so it has an exponent and represents magnitudes with consistent relative precision, which matches the heavy-tailed distribution of LLM activations. INT8 spaces its 256 values uniformly, so it needs careful per-channel calibration to keep outliers from dominating the scale. The cost: INT8 runs on older Ampere Tensor Cores, while native FP8 requires Hopper or newer.
Which GPUs Support FP8
FP8 is a hardware format, not just a software trick. Running it natively requires Tensor Cores that accept FP8 inputs. NVIDIA introduced these with the Hopper architecture and carried them forward.
| GPU | Architecture | FP8 Tensor Core | Memory |
|---|---|---|---|
| H100 SXM | Hopper | 3,958 TFLOPS (with sparsity) | 80GB HBM3, 3.35TB/s |
| H200 SXM | Hopper | 3,958 TFLOPS (with sparsity) | 141GB HBM3e, 4.8TB/s |
| L40S | Ada Lovelace | 1,466 TFLOPS (with sparsity) | 48GB GDDR6 |
| Blackwell | Blackwell | FP8 plus FP4 (2nd-gen TE) | HBM3e |
| A10 / A100 | Ampere | None (FP16/INT8 only) | 24GB / 80GB |
Hopper (H100, H200) added FP8 input formats E4M3 and E5M2 through fourth-generation Tensor Cores and the Transformer Engine, which recasts between FP8 and higher precision automatically. The H100 and H200 share identical Tensor Core compute (both 3,958 FP8 TFLOPS with sparsity); the H200's advantage is purely memory, 141GB HBM3e at 4.8TB/s versus the H100's 80GB at 3.35TB/s.
Ada Lovelace GPUs such as the L40S also support FP8 in hardware via fourth-generation Tensor Cores and a Transformer Engine that recasts between FP8 and FP16. Blackwell adds a second-generation Transformer Engine with micro-tensor scaling that enables 4-bit floating point (FP4) in addition to FP8, doubling the model size memory can support while maintaining accuracy.
Ampere is not a native FP8 target
Ampere GPUs (A10, A100) have no FP8 Tensor Cores. vLLM officially supports FP8 W8A8 only on Hopper and Ada; Ampere is limited to weight-only FP8 (W8A16) via software Marlin kernels, which gives the memory saving but not the throughput doubling. If your fleet is Ampere, INT8 or weight-only quantization is the practical path, not native FP8.
Memory and Throughput Gains vs FP16
The headline gains are two of them, and they compound. NVIDIA states FP8 halves data storage and doubles throughput compared to FP16 or BF16. The H100 SXM figures confirm the second claim exactly: 3,958 FP8 TFLOPS against 1,979 FP16/BF16 TFLOPS (both with sparsity), a clean 2x.
For inference, vLLM measures the practical result directly: FP8 W8A8 quantization (8-bit weights, 8-bit activations) yields a 2x reduction in model memory requirements and up to a 1.6x improvement in throughput with minimal impact on accuracy. The throughput gain (1.6x) lands below the raw compute ratio (2x) because real serving is bounded by more than the matrix multiply: attention, the KV cache, and scheduling all share the budget.
The memory saving has a second-order benefit. Smaller weights free GPU memory for a larger KV cache and bigger batch sizes, which is where continuous batching and serving throughput actually live. A model that fits in half the memory can serve more concurrent sequences on the same card. See LLM inference optimization for how these techniques stack.
The Accuracy Tradeoff
FP8 has 256 representable values per format. Compressing a 16-bit tensor into that space loses information. The question is whether the loss is small enough to ignore, and the evidence says it can be, with care.
DeepSeek-V3, a 671B-parameter model (37B activated per token), designed an FP8 mixed-precision training framework and was the first to validate FP8 training at extreme scale. Its reported relative loss error stayed consistently below 0.25% versus a BF16 baseline. That result is from training, the harder case; inference quantization of an already-trained model is more forgiving still.
The failure mode is outliers. FP8 introduces issues with outliers in activations, weights, and gradients that can cause inaccuracies. A single large activation value forces the per-tensor scale wide, which crushes the precision available to every other value in the tensor. This is the real tradeoff, and it is why naive per-tensor FP8 sometimes degrades and well-calibrated FP8 does not.
State the downside plainly
FP8 is not free accuracy. A coarse single per-tensor scale on a model with large activation outliers can lose meaningful quality. The 0.25% loss-error figure from DeepSeek-V3 came from fine-grained quantization with high-precision accumulation, not from a one-line cast. If you quantize without calibration and measure, expect to see the difference before you tune it away.
Calibration and Scaling
Because E4M3 tops out at 448, real tensors must be scaled into the representable range before quantizing. vLLM stores a higher-precision (typically FP32) scaling factor alongside each quantized tensor and applies it at compute time. This is the minimum machinery FP8 requires.
vLLM's default FP8 path uses per-tensor (scalar) scaling: every Linear module weight except the final lm_head is quantized to FP8 E4M3 with a single per-tensor scale, and activations get a dynamic per-tensor scale. Finer-grained per-channel scaling is the next step up in accuracy and is under active development.
DeepSeek-V3 went further for training stability. Its framework ran most compute-dense GEMM operations in FP8 (accumulating to BF16 or FP32) while keeping sensitive operators (embeddings, the output head, MoE gating, normalization, attention) in higher precision, and used fine-grained quantization rather than one coarse per-tensor scale. The lesson generalizes: FP8 everywhere is wrong; FP8 where it is safe, higher precision where it is not.
Serving an FP8 checkpoint (vLLM)
from vllm import LLM
# Load a model with FP8 W8A8 quantization.
# Weights and activations are stored and computed in 8 bits;
# a per-tensor FP32 scale is stored alongside each quantized tensor.
llm = LLM(
model="my-org/my-model-fp8",
quantization="fp8", # E4M3 weights, dynamic per-tensor activation scale
)
# Result on Hopper / Ada Tensor Cores:
# ~2x less model memory than FP16
# up to 1.6x higher throughput
# lm_head stays in higher precision (not quantized)
out = llm.generate("Explain FP8 quantization in one sentence.")
print(out[0].outputs[0].text)FP8 in Production at Morph
Morph builds inference infrastructure for AI coding agents and serves FP8 checkpoints across its production fleet. FP8 is the operating regime, not an experiment: halving the weight footprint relative to FP16 is what lets a model fit on a card with enough memory left for the KV cache and batch sizes that serving throughput depends on.
Morph's Tab next-action model is a Nemotron-H LoRA served in FP8. The production model-serving fleet (glm51-754b, qwen35-397b, minimax27-230b, and others) runs FP8 weights for the same reason: more concurrent sequences per GPU, lower memory pressure, higher tokens-per-second at the same hardware cost.
FP8 is one layer in a stack. It pairs with ngram speculative decoding (the morph-v3-fast apply model runs at ~10,500 tok/s), continuous batching, and prefix caching. Quantization shrinks the bytes moved per token; speculative decoding cuts the number of forward passes; together they multiply. See AI inference for the full picture.
| Property | FP16 weights | FP8 weights |
|---|---|---|
| Bytes per weight | 2 | 1 |
| Model memory | Baseline | ~2x smaller (vLLM W8A8) |
| Tensor Core throughput | 1,979 TFLOPS (H100) | 3,958 TFLOPS (H100) |
| Serving throughput | Baseline | Up to 1.6x (vLLM W8A8) |
| Accuracy impact | None (baseline) | Small with calibration (<0.25% loss error, DeepSeek-V3 training) |
Frequently Asked Questions
What is FP8 quantization?
FP8 quantization represents model weights, activations, or gradients in an 8-bit floating-point format instead of 16-bit. Each value has a sign bit plus an exponent and mantissa split. Versus FP16, FP8 halves storage and, on NVIDIA Hopper, Ada, and Blackwell Tensor Cores, doubles matrix-multiply throughput.
What is the difference between FP8 E4M3 and E5M2?
E4M3 uses a 4-bit exponent and 3-bit mantissa, max normal value 448, exponent bias 7. More precision, less range, so it is used for weights and activations. E5M2 uses a 5-bit exponent and 2-bit mantissa, max normal value 57,344, exponent bias 15. Wider range, coarser precision, so it is used for gradients during training.
Which GPUs support FP8 in hardware?
NVIDIA Hopper (H100, H200) supports FP8 via fourth-generation Tensor Cores and the Transformer Engine, accepting both E4M3 and E5M2. Ada Lovelace (L40S) supports FP8 in hardware. Blackwell adds a second-generation Transformer Engine extending below FP8 to FP4. Ampere lacks native FP8 Tensor Cores.
How much memory and speed does FP8 save versus FP16?
FP8 halves storage because each value uses 8 bits instead of 16. NVIDIA states FP8 doubles Tensor Core throughput: the H100 SXM delivers 3,958 FP8 TFLOPS against 1,979 FP16 TFLOPS. For inference, vLLM measures FP8 W8A8 at a 2x model-memory reduction and up to 1.6x throughput.
Does FP8 hurt model accuracy?
The impact is small with calibration. DeepSeek-V3 (671B parameters) trained in FP8 mixed precision and reported relative loss error consistently below 0.25% versus BF16. The risk is outliers clipping in FP8's narrow range; production FP8 mitigates this with a per-tensor scaling factor and, in harder cases, fine-grained per-channel scaling.
What is the difference between FP8 and INT8?
FP8 is floating point with an exponent, so it represents a wide range of magnitudes with consistent relative precision, matching the heavy-tailed distribution of LLM activations. INT8 is fixed-point integer with uniform spacing, usually needing careful per-channel calibration. FP8 runs natively on Hopper, Ada, and Blackwell; INT8 has broader hardware support including Ampere.
What is FP8 W8A8 quantization?
W8A8 means both weights (W) and activations (A) are 8-bit. In vLLM, FP8 W8A8 quantizes all Linear module weights except the final lm_head to FP8 E4M3 with a per-tensor scale, and applies a dynamic per-tensor scale to activations. It yields a 2x model-memory reduction and up to 1.6x throughput on Hopper and Ada.
Related Resources
Serve FP8 Models Without Building the Stack
Morph runs FP8 checkpoints in production: half the memory of FP16, higher throughput on Hopper and Ada Tensor Cores, calibrated for accuracy. Fast Apply, Model Router, WarpGrep, and Compactor all behind one OpenAI-compatible API at api.morphllm.com.
