Mine is smaller than yours. And what!

The FP16 myth, and why nobody tells you what precision your AI actually runs at.

When you send a prompt to GPT-4o, Claude, Gemini, or DeepSeek — what precision is the model running at?

Nobody tells you. Not in the pricing page, not in the documentation, not in the API response headers. You’re paying per token for a model you assume runs at full precision, because that’s what the benchmarks were measured on.

It doesn’t.

And this matters more than you think.

What quantization actually is

An AI model is billions of numbers. Each parameter is a numerical weight — a coefficient that determines how the model processes information. During training, these weights are stored at maximum precision: 32-bit floating point (FP32) or 16-bit (FP16/BF16). DeepSeek R1 with its 671 billion parameters weighs approximately 1.3 terabytes in FP16. Kimi K2 with its trillion parameters exceeds 2TB.

No single machine loads that into memory.

Quantization solves this by reducing the precision of each weight. FP16 becomes Q8 (8-bit), then Q4 (4-bit). At each step, the model gets smaller — and loses nuance.

Look at the cover image. Same giraffe, same photograph. On the left, full precision: every spot pattern sharp, every gradation preserved, the texture of the fur visible. On the right, heavy compression: the spots become blocks, the gradations become flat areas, the fine details vanish.

The giraffe is still a giraffe. You’d recognize it instantly. But the information that made it a specific giraffe — the precise pattern, the subtle color shifts, the texture — has been sacrificed.

This is exactly what happens to a quantized model. The response looks like the right response. But the precision of reasoning, the coherence over long text, the ability to catch a nuance — all degrade silently.

And « silently » is the problem. A heavily quantized model doesn’t signal what it has lost. It responds with the same confidence. The difference only shows up when you test rigorously, on demanding tasks, with a non-quantized reference to compare against.

The precision spectrum

Not all bits are created equal. Here’s the hierarchy, from highest precision to lowest:

FP32 (32-bit floating point) — Training precision. Full fidelity. Impractical for inference at scale.

FP16 / BF16 (16-bit) — The « gold standard » for inference. BF16 trades precision for range, which matters for large values. This is what benchmarks are measured on. This is what you assume you’re getting.

FP8 (8-bit floating point) — Half the memory of FP16, with variable spacing between values. Preserves more precision for small weights (which matter most) and less for large ones. NVIDIA’s Hopper architecture was designed for this.

INT8 / Q8 (8-bit integer) — Same size as FP8, but with a fixed grid. Equal spacing between all values, like measuring with a ruler that has marks every centimeter regardless of whether you need millimeters. Simpler to implement, slightly less precise in theory.

FP4 / INT4 / Q4 (4-bit) — Quarter the memory of FP16. This is where real degradation starts. The model fits on smaller hardware, but the precision cost is measurable.

The key distinction most people miss: FP8 and Q8 are not the same thing. FP8 uses floating-point representation — the spacing between values is variable, denser near zero where most weights cluster. Q8 uses integer representation — fixed spacing, uniform grid. In practice, FP8 preserves more of the original model’s behavior for the same memory footprint.

This matters because when someone says « 8-bit quantization, » you need to ask: which kind?

Why we quantize

Economics.

A model served in FP16 consumes twice the GPU memory of FP8, and four times more than FP4. At the scale of millions of simultaneous requests, GPU memory is the bottleneck. More memory per request means fewer concurrent users per GPU. Fewer users per GPU means more GPUs. More GPUs means more cost.

When FP8 delivers 95% of FP16 quality at 50% of the GPU cost, the business decision is obvious. Especially at peak hours.

This isn’t a secret. Production inference frameworks — vLLM, TensorRT-LLM, SGLang — all support FP8, INT8, and INT4 as first-class serving modes. NVIDIA markets FP8 (on Hopper) and FP4 (on Blackwell) as the standard for production serving. The tooling exists because the industry uses it.

The question isn’t whether cloud providers quantize. It’s how much.

The FP16 myth

Here’s what nobody puts on the marketing page.

DeepSeek V3 was trained natively in FP8. Not quantized after training — trained at lower precision from the start. Their technical report documents it: weights stored in FP8, all matrix multiplications in FP8, with only critical components (embedding layers, attention mechanisms, gating networks) remaining in BF16. This is mixed-precision FP8, built into the model’s DNA.

When you use DeepSeek’s API, you’re not getting a model that was compressed from FP16. You’re getting a model that never existed in FP16.

OpenAI released gpt-oss in native MXFP4 — that’s 4-bit. Not as a compromise, not as a budget tier. As the release format. Their message was clear: if it’s good enough for us, it’s good enough for you. A 120-billion parameter model that fits in 80GB of VRAM. A 20B that fits in 16GB. The entire industry noticed.

The production reality is documented across Oracle, NVIDIA, and every major cloud provider: FP8 and INT8 are the standard serving formats. INT4 is used for throughput-critical endpoints. FP16 serving at scale is, for most providers, economically irrational.

The industry strategy is pragmatic: apply post-training quantization (PTQ) to INT8 for the majority of models — it’s fast to deploy and good enough. Reserve targeted quantization-aware training (QAT) only for models or layers where PTQ causes measurable regression, typically in coding, mathematics, and long-chain reasoning. This keeps time-to-market low while focusing training effort where it matters.

You’re not getting FP16. You haven’t been getting FP16 for a while. And the model doesn’t tell you.

The three amigos of inference

When you quantize a model, three things degrade. Understanding them is the difference between making informed decisions and guessing.

Perplexity — the model doubts more. Technically, perplexity measures how surprised the model is by the next token. Higher perplexity means the model’s probability distribution is flatter — it’s less certain about what comes next. In practice, this manifests as slightly less coherent text, more hedging, occasional non sequiturs. At Q8, the increase is typically negligible. At Q4, it becomes measurable.

Accuracy — the model makes more mistakes. Not dramatic, obvious mistakes — subtle ones. A legal clause interpreted with less nuance. A code snippet with a wrong parameter. A historical date off by a year. The kind of errors that look plausible and pass casual review. The kind that matter in production.

Divergence — the model drifts from its full-precision self. Given the same prompt, a Q4 model and the FP16 original will produce increasingly different outputs as the generation gets longer. On short responses, the difference may be imperceptible. On a 2,000-word analysis, the Q4 model has wandered into territory the FP16 model would never have reached.

These three degrade at different rates depending on the task. Factual recall degrades faster than creative writing. Mathematical reasoning degrades faster than summarization. Multi-step logical chains degrade faster than single-step classification.

And none of this shows up in the model’s confidence scores.

The empirical evidence

I tested this. Not with synthetic benchmarks — with real professional tasks.

The protocol: six structured tests covering technical writing, analytical reasoning, creative adaptation, situational awareness, code generation, and multilingual capability. Each test scored on precision, completeness, and professional usability. Same hardware (Mac Studio M4 Max, 64GB unified memory), same software stack (LM Studio, MLX format), same prompts, same evaluation criteria.

The result that changed everything: Qwen 2.5 32B at Q8 quantization consistently outperformed Qwen 2.5 72B at Q4 quantization.

Read that again. A model with less than half the parameters, running at higher precision, beat the bigger model running at lower precision. Not on one test — across the protocol.

The 32B Q8 produced more nuanced analysis, caught subtleties the 70B Q4 missed, and maintained coherence over longer outputs. The 70B Q4 had more raw knowledge but deployed it with less finesse — like an orchestra wearing boxing gloves. All the instruments are there. The dexterity is gone.

This isn’t an anomaly. It’s the predictable result of how quantization affects transformer architectures. Attention mechanisms — the core of what makes these models reason — depend on precise weight differentials to decide what information matters. Compress those differentials too aggressively, and the model can still retrieve facts but can’t weigh them properly.

Bigger with less precision loses to smaller with more precision. The law is consistent across every model pair I tested.

What this means for you

If you’re running models locally — on a Mac Studio, a workstation, or a small cluster — you face this trade-off directly. Your RAM is finite. You choose: a bigger model at lower quantization, or a smaller model at higher quantization.

The data says: choose precision.

A 32B model at Q8 on 64GB of unified memory will serve you better than a 70B model crammed into the same space at Q4. The 32B loads faster, generates faster, and — counterintuitively — reasons better.

If you’re using cloud APIs, you face a different version of the same problem: you don’t know what precision you’re getting. The model that scored 92% on the benchmark in FP16 might be serving you at FP8 or INT4 during peak hours. And the API doesn’t tell you.

This is the gap that nobody talks about. The local model you control, running at a quantization you chose, at a precision you verified — might be running at equal or superior precision to the cloud model you’re paying per token for.

The « frontier superiority » is partly an economic mirage.

The hierarchy nobody teaches

Everyone in AI talks about parameters first. How many billions. How the benchmark scores compare. How the latest model is bigger than the last.

Parameters are the least important factor.

Here’s the hierarchy that actually determines model quality, in order:

First: alignment. Is the model calibrated for truth or for thumbs up? I documented this in The Slop Matrix — five cloud models in FP16, hundreds of billions of parameters each, and every single one failed to detect fabricated code in a technical document. One hallucinated it, the others applauded. Precision doesn’t fix a model that’s optimized to agree with you.

Second: quantization quality. At equal alignment, a 32B Q8 beats a 70B Q4. The precision of how the model stores and processes its weights determines the quality of its reasoning. Not the number of weights — the fidelity of each weight.

Third: parameter count. Only after alignment and quantization are equal does having more parameters give you an advantage. A well-aligned, well-quantized 70B model will outperform a well-aligned, well-quantized 32B. But strip away alignment or compress the precision, and those extra 38 billion parameters are dead weight.

Nobody presents the hierarchy in this order. The marketing says parameters first, because that’s the biggest number. The technical documentation says quantization is a trade-off, because it is. Nobody mentions alignment, because it’s harder to measure and harder to sell.

But when you’re choosing a model for production work — for a law firm, for a financial analysis pipeline, for a medical documentation system — this is the hierarchy that determines whether the output is reliable or just confident.

It’s not the size

The next time someone tells you their model has more parameters, ask them what precision it runs at. Ask them if the benchmarks were measured at serving precision or training precision. Ask them what quantization scheme they use in production.

They won’t answer. Not because they’re hiding something — but because the industry has decided that these details don’t belong on the marketing page.

They do. They belong on the first page.

The giraffe on the right is still a giraffe. But if your job depends on counting the spots, you need the one on the left.

This article is part of a series on AI infrastructure reality. Previously: The Slop Matrix — When Your AI Learns You to Stop Thinking.

Sophie — The Monocle Bear Principal consultant — AI workflows & agentic UX themonoclebear.com

Mine is smaller than yours. And what!

The FP16 myth, and why nobody tells you what precision your AI actually runs at.

What quantization actually is

The precision spectrum

Why we quantize

The FP16 myth

The three amigos of inference

The empirical evidence

What this means for you

The hierarchy nobody teaches

It’s not the size

The slop matrix

Behond the Frontiers

ChatGPT has been showing ads since February 9.

Frontière ? Quelle frontière ?

L’IA est-elle rentable ?

Laisser un commentaire Annuler la réponse.

The Monocle Bear

The FP16 myth, and why nobody tells you what precision your AI actually runs at.

What quantization actually is

The precision spectrum

Why we quantize

The FP16 myth

The three amigos of inference

The empirical evidence

What this means for you

The hierarchy nobody teaches

It’s not the size

Similar Posts

Laisser un commentaire Annuler la réponse.

The Monocle Bear