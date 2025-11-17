A research article by Horace He and the Thinking Machines Lab (X-OpenAI CTO Mira Murati founded) addresses a long-standing issue in large language models (LLMs). Even with greedy decoding bu setting temperature to 0 wiht the goal of no intentional randomness and fixed seeds, the same prompt often produces different outputs across runs or servers.

The Key Problem is Nondeterminism in Deterministic in LLM Inference?

Users expect identical inputs → identical outputs when randomness is disabled.

In practice, rerunning the same prompt yields varying token sequences, breaking reproducibility.

This undermines scientific experiments, debugging, safety evaluations, reinforcement learning (RL), and enterprise trust.

A Common Misconception is Rejected by the Paper

The widely accepted “concurrency + floating-point hypothesis” Floating-point addition is non-associative: (a + b) + c ≠ a + (b + c) due to rounding.

Combined with unpredictable GPU thread scheduling (racing cores finish in different orders), this causes tiny numerical differences that cascade into different token choices.

The paper argues this is mostly wrong for modern LLM forward passes.

Most transformer operations use deterministic reduction trees (fixed-order reductions), not atomic operations or unordered adds.

GPU atomics are avoided in forward passes. True nondeterminism from races is rare.

The true Root Cause?

Lack of Batch Invariance in Inference Kernels

Production LLM servers use dynamic batching. This is multiple unrelated requests are grouped into variable-sized batches to maximize throughput.

Your single prompt is batched with whatever other users’ requests happen to be active at that moment → batch size, padding, and position within the batch change unpredictably.

Standard kernels (in FlashAttention, cuBLAS, Triton, etc.) for key operations are batch-sensitive

LayerNorm / RMSNorm.

Statistics (mean/variance) computed over the batch dimension change with batch content.

MatMul / GEMM

Reduction order or block sizing adapts to batch shape for performance.

Softmax / Attention

Row-wise Reductions are similar shape-dependent optimizations.

Even tiny floating-point differences from different batch shapes propagate through layers and flip argmax in greedy decoding → completely different outputs.

This affects GPUs, CPUs, and TPUs equally.

Experimental Evidence

Using Qwen-3-8B in vLLM.

1,000 identical prompts yield dozens of unique outputs under normal conditions.

Isolating variables shows nondeterminism vanishes with fixed batch size=1 but reappears with dynamic batching.

What is the Solution? Batch-Invariant Kernels

Design and replace kernels so numerical results are identical regardless of batch size, padding, or position.

Force fixed reduction strategies like always use a canonical tree order.

Mask out padded elements properly without affecting statistics.

Avoid shape-dependent optimizations.

Implementation

Open-source PyTorch library

batch-invariant-ops (GitHub: thinking-machines-lab/batch-invariant-ops).

Drop-in replacements via torch.

Library for RMSNorm, MatMul, Softmax, Attention, etc.

Demonstrated integration with vLLM in deterministic mode.

They were able to get 1,000 identical runs → 100% bitwise-identical outputs, even under dynamic batching.

Performance cost is only amModest slowdown (10–40% depending on op/hardware). This is acceptable for research/safety-critical use, and can be mitigated with CUDA graphs.

Broader Insights and ImplicationsReproducibility is not a luxury — it’s essential for science, RLHF/RLAIF (cleaner reward signals), auditing, and aligning training vs. inference behavior.

Current “defeatist” attitude (“LLMs are probabilistic anyway”) hides fixable engineering flaws.

Calls for future inference engines to prioritize determinism alongside speed (e.g., standardize batch-invariant ops).

Influences follow-up work (e.g., LMSYS SGLang adopted these kernels for fully deterministic high-throughput inference).

Potential to make LLM evaluations more trustworthy and reduce “noise” in benchmarks that was previously misattributed to model uncertainty.

In short, the paper reframes LLM nondeterminism as an engineering bug (batch-sensitive kernels) rather than an inevitable hardware limitation, and provides a practical, open-source fix that achieves mathematically perfect reproducibility without sacrificing the benefits of batching. This is a major step toward reliable, scientific-grade AI systems. The full post is available at: https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/