Defeating Nondeterminism in LLM Inference: Theory to Deterministic Code Edits

Run the same prompt through Claude twice. You get two different outputs. For chat, that is fine. For code edits applied to production files, it is a bug.

Unique outputs from 1,000 identical runs

100%

Match rate with batch-invariant kernels

10,500

tok/s Fast Apply throughput

~34%

Overhead for full determinism

Why LLMs Are Nondeterministic

LLM nondeterminism has been a known problem since the first transformer APIs shipped. The conventional explanation blamed floating-point non-associativity: since (a + b) + c does not always equal a + (b + c) at the bit level, and GPUs execute reductions in parallel with unpredictable thread ordering, the same computation could produce slightly different results each time.

In September 2025, Thinking Machines Lab (founded by Mira Murati, OpenAI's former CTO) published the first rigorous analysis overturning this explanation. Their key finding: in a typical LLM forward pass, there is usually not a single atomic add present. Most neural network operations use deterministic parallel reduction strategies. Running the same forward pass multiple times with identical inputs yields bitwise identical output, even on highly parallel hardware.

The real culprit is batch variance. The same prompt processed in a batch of 1 versus a batch of 32 produces different logits, because reduction kernels split their work differently depending on batch size. Since production servers process variable numbers of concurrent requests, batch size fluctuates nondeterministically, and so do outputs.

The key insight

LLM nondeterminism is not a GPU concurrency problem. It is a system architecture problem. The forward pass is deterministic for a given batch configuration. What changes between runs is the batch configuration itself.

Sources of Nondeterminism

Six distinct mechanisms contribute to nondeterministic LLM outputs. Understanding which ones matter for your use case determines which mitigations are worth the cost.

Batch size variance

Reduction kernels (RMSNorm, matmul, attention) produce different numerical results depending on how many sequences are processed together. The dominant source in production.

Floating-point non-associativity

(a + b) + c != a + (b + c) at the bit level. Matters in operations using atomic adds, but these are rare in standard LLM forward passes.

Dynamic batching

Serving frameworks (vLLM, SGLang, TGI) group incoming requests into variable-size batches based on arrival timing. Same prompt, different batch, different output.

GPU kernel scheduling

Thread block execution order can vary across runs for operations using non-deterministic algorithms. Less impactful than batch variance but still measurable.

Quantization noise

Different quantization schemes (GPTQ, AWQ, FP8) introduce different rounding patterns. The same model at different quantization levels produces different outputs.

Silent model updates

API providers update model weights, infrastructure, or serving configurations without notice. The system_fingerprint field exists to detect this, but changes are frequent.

The relative impact is not evenly distributed. Thinking Machines' experiments showed that batch variance alone accounts for the vast majority of observed nondeterminism. Fixing only this one source, by making three kernels batch-invariant, was sufficient to achieve 100% bitwise reproducibility across 1,000 runs.

Temperature Zero Is Not Enough

The most common advice for deterministic LLM outputs is "set temperature to 0." This is necessary but wildly insufficient.

Temperature controls sampling: the process of selecting a token from the model's probability distribution. At temperature 0, the highest-probability token is always selected (greedy decoding). This eliminates one source of randomness. It does not address the fact that the probability distribution itself changes between runs due to batch variance.

The seed parameter problem

OpenAI introduced a seed parameter in late 2023 specifically for reproducibility. Their own documentation describes the result as "mostly deterministic." The seed controls the random number generator used in sampling but cannot control forward-pass numerics, infrastructure routing, or batch composition.

Anthropic's Claude API does not expose a seed parameter at all. Google's Gemini supports a seed but with similar caveats. The honest position across all major providers: API-level determinism is best-effort, not guaranteed.

Provider	Temperature 0	Seed Parameter	Guaranteed Determinism
OpenAI	Yes	Yes (best-effort)	No
Anthropic	Yes	No	No
Google	Yes	Yes (best-effort)	No
vLLM (standard)	Yes	Yes	No
vLLM + batch-invariant kernels	Yes	Yes	Yes (with ~34% overhead)

What temperature 0 actually eliminates

Temperature 0 removes sampling randomness: the nondeterminism from choosing among high-probability tokens. It does not remove numerical nondeterminism in the forward pass, which is the dominant source of output variation in production serving.

Batch Invariance: The Real Fix

Thinking Machines identified exactly three kernel types that need batch-invariant replacements: RMSNorm, matrix multiplication, and attention. These are the only operations in a standard transformer forward pass where the computation for one sequence depends on how many other sequences are in the batch.

Their open-source library provides drop-in replacements that guarantee identical output regardless of batch size. The implementation forces each reduction to use the same splitting strategy regardless of how many sequences are present.

Empirical validation

The experiment was simple and conclusive. Run the prompt "Tell me about Richard Feynman" 1,000 times at temperature 0 through Qwen3-8B on vLLM:

Unique completions (standard kernels)

Unique completions (batch-invariant kernels)

With standard kernels, the most common completion appeared only 78 out of 1,000 times. With batch-invariant kernels, all 1,000 were identical.

Performance cost

The initial implementation from Thinking Machines added approximately 60% overhead. SGLang's follow-up work reduced this to roughly 34% using optimized FlashInfer and FlashAttention 3 backends. For applications where reproducibility matters more than raw throughput, this is a reasonable trade-off. For high-volume production serving, it is often too expensive.

Why Code Is Different from Text

Nondeterminism is tolerable in most text generation. If Claude describes a concept slightly differently on two runs, both descriptions are usually correct. Code has no such tolerance.

Token-level fragility

A variable name shifting from 'result' to 'results' breaks every downstream reference. An indentation change alters Python program structure. A missing semicolon fails compilation.

Compositional edits

Coding agents apply sequences of edits. If edit #3 of 10 is nondeterministic, edits 4-10 may be applied to the wrong file state. The error compounds through the chain.

Debugging impossibility

When the same agent session produces different file states on different runs, you cannot reproduce the exact edit sequence that led to a bug. Debugging becomes guesswork.

CI/CD breakage

If a code generation step in your pipeline produces different output on each run, tests become flaky. Builds that pass locally fail in CI, or vice versa, with no code change.

This is not a theoretical concern. Teams using coding agents in production consistently report cases where re-running the same agentic workflow produces subtly different edits, some of which introduce bugs that were not present in previous runs. The standard workaround is retrying and diffing, which does not scale.

Deterministic Code Edits with Specialized Models

The batch-invariance approach solves determinism at the inference level but costs 34-60% throughput. There is a cheaper path for code edits specifically: train a model that learns a deterministic mapping from (file content + edit instruction) to output file.

This is the approach behind Morph Fast Apply. Instead of making a general-purpose model deterministic through kernel modifications, Fast Apply is trained on the narrow task of applying code edits. The task is constrained enough that the model learns a near-deterministic function rather than a broad probability distribution over possible continuations.

Why specialization enables determinism

General-purpose models are trained to produce diverse, creative outputs. The probability distribution over next tokens is intentionally wide. Even at temperature 0, tiny numerical perturbations can flip the argmax between two nearly-equal probabilities.

A model trained specifically for code edits has a much sharper distribution. Given "add a null check before line 42," there are very few correct ways to write that edit. The model's confidence in the correct output is high enough that numerical noise does not affect the argmax. The right token wins by a large margin, every time.

10,500

tok/s throughput

Kernel-level overhead

99.7%

Edit accuracy (SWE-bench)

Stable

Same input, same output

The two-model architecture

The practical pattern for deterministic agentic coding separates reasoning from editing. The reasoning model (Claude, GPT-4, Gemini) decides what to change. It is allowed to be nondeterministic because reasoning benefits from exploration. Fast Apply handles how to apply the change. This step must be deterministic because it writes to the file system.

This separation means nondeterminism in the reasoning layer does not infect the edit layer. Even if Claude suggests a slightly different approach on two runs, Fast Apply applies each suggestion identically.

Two-model architecture

# Step 1: Reasoning model decides what to change (nondeterministic, that's OK)
reasoning_response = claude.complete("Add error handling to the parse function")

# Step 2: Fast Apply applies the edit deterministically
edited_file = morph.fast_apply(
    original_file=source_code,
    edit_instruction=reasoning_response,
    model="morph-v3-fast"
)
# Same original_file + same edit_instruction = same edited_file. Every time.

Benchmarking Determinism

Measuring LLM determinism requires running the same input many times and quantifying output variance. Four metrics capture different aspects of the problem:

Exact match rate

Percentage of runs producing bitwise identical output. The strictest metric. Thinking Machines used this to demonstrate 100% reproducibility with batch-invariant kernels.

Token-level agreement

Average fraction of tokens matching across runs. Captures partial determinism: two outputs that differ by one token score 99%+ but fail exact match.

Semantic equivalence rate

Percentage of runs producing functionally equivalent output. For code, this means the outputs compile and pass the same tests, even if formatting differs.

Edit stability

For code edit models specifically: whether the same edit instruction produces the same file diff across runs. The metric that matters for production coding agents.

Testing methodology

A minimal determinism test: run the same prompt N times (N >= 100) at temperature 0, hash each output, count unique hashes. If unique hashes equal 1, the system is deterministic for that input. Repeat across diverse inputs to build confidence. For code edit models, compare file diffs rather than raw outputs to account for whitespace normalization.

Practical threshold

For most production code edit workflows, semantic equivalence rate above 99% is sufficient. Bitwise determinism is ideal but not always necessary if the edit produces functionally identical code. Fast Apply targets bitwise stability for the same input pair, achieving it without the kernel-level overhead that general-purpose determinism requires.

Frequently Asked Questions

Why are LLM outputs nondeterministic even at temperature 0?

Temperature 0 eliminates sampling randomness but not numerical nondeterminism in the forward pass. The primary source is batch variance: reduction kernels (RMSNorm, matmul, attention) produce different numerical results depending on batch size. Since server load fluctuates, batch size changes unpredictably, causing different outputs for identical prompts.

What did Thinking Machines prove about LLM nondeterminism?

In September 2025, Thinking Machines Lab demonstrated that the primary source of LLM inference nondeterminism is batch variance, not floating-point non-associativity from concurrent GPU threads. Running the same prompt 1,000 times at temperature 0 produced 80 unique completions. After replacing three reduction kernels with batch-invariant implementations, all 1,000 were bitwise identical.

Does the OpenAI seed parameter guarantee deterministic outputs?

No. OpenAI describes their API as "mostly deterministic" when using the seed parameter. The seed controls sampling randomness but does not address numerical nondeterminism in the forward pass caused by batch variance, GPU scheduling, and infrastructure changes.

Why does code generation need deterministic outputs more than text generation?

In natural language, slightly different wording usually preserves meaning. In code, a single token difference can change program semantics: a shifted variable name breaks references, an indentation change alters Python structure, a missing semicolon fails compilation. When coding agents apply sequences of edits, nondeterminism in one step compounds through all subsequent steps.

What are batch-invariant kernels?

Batch-invariant kernels are replacement implementations of reduction operations (RMSNorm, matrix multiplication, attention) that produce identical numerical results regardless of batch size. Thinking Machines released an open-source library of these kernels. The trade-off is approximately 34-60% slower inference.

How does Morph Fast Apply achieve deterministic code edits?

Fast Apply is trained specifically for code edit application rather than general text generation. The task is constrained enough that the model learns a near-deterministic function: given the same file content and edit instruction, it produces the same output. This avoids the 34-60% overhead of batch-invariant kernels because determinism comes from the model's learned behavior, not modified infrastructure.

What is the performance cost of deterministic LLM inference?

Batch-invariant kernels add approximately 34-60% overhead depending on implementation. SGLang achieved the lower end using FlashInfer and FlashAttention 3 backends. For specialized models like Fast Apply that are trained for deterministic tasks, there is no runtime overhead because determinism is a property of the model itself, not an infrastructure modification.

Deterministic Code Edits at 10,500 tok/s

Fast Apply is a model trained for one job: applying code edits to files. Same input, same output. No kernel modifications, no throughput penalty.

Try Fast Apply

View API Docs

Morph Fast Apply

Morph WarpGrep

Morph Compact

Morph Glance

Morph MCP

Morph Monitor

Blog

Startup Credits

Students

Contact Us

About

Careers

Defeating Nondeterminism in LLM Inference: From Theory to Deterministic Code Edits