The Wind Fills the Sail, But It Doesn't Hold the Helm
I want a local LLM. European. Capable of handling EU regulatory analysis. And fluent in French.
I run a 1.28TB Apple Silicon cluster. I’ve tested every serious open-weight model that fits. I’ve built a benchmark protocol — Le Bruit Blanc — specifically designed to expose the gaps that academic leaderboards never show.
If I had to install one and only one local, open-weight, out-of-the-box model today, my choice would be Qwen 3.5.
Not Mistral. Not the European champion. A Chinese model.
Here’s why.
The number
Mistral Large, Mistral AI’s flagship, weighs 675 billion parameters. In French creative writing, it scores 420 out of 500. In standard GDPR analysis, 47.5 out of 50. In algorithmic code generation, 48 out of 50.
In advanced regulatory analysis, it scores 2 out of 50.
Not 20. Not 15. Two.
This is not an accident. This is not a bad prompt. It’s a reproducible pattern across all five Mistral family models tested, from the local Ministral 14B to the cloud-hosted Large 675B. Same prompt, same behavior, same result.
The protocol
The TMB Scoreboard is a homegrown benchmark. Five tests designed to evaluate what an LLM must be able to do in a real professional context. The evaluator is Claude Opus 4.6, using TMB v2 scoring rubrics. Infrastructure is the Exo 4× M3 Ultra cluster in Tensor RDMA for local models, OpenRouter for cloud models.
The full results document is available for download at the end of this article.
Test 01, Le Bruit Blanc (/500). Long-form creative writing. A 4,500-word literary short story, first person, no dialogue. The protagonist lives with a permanent inner white noise. A cat named Pip enters his life. The model must evolve its style throughout the text, from dry and telegraphic to lyrical. This is the most discriminating test on the creative dimension. Recycled formulas, dead metaphors, and generic vocabulary immediately cap the score.
Test 02, GDPR Analysis (/50). A standard European regulatory analysis case. The model must identify legal bases, controller obligations, and data subject rights. Legal rigor is scored.
Test 03, Python CLI (/50). A complete command-line tool. Task manager with JSON persistence, categories, priorities, filters, argparse subcommands. Python 3.10+, standard library only, integrated unit tests. The trap is not algorithmic difficulty. It’s engineering rigor: exhaustive error handling, edge cases, code quality.
Test 04, Mandelbrot (/50). Mandelbrot set visualization with vectorized NumPy, PNG export, CLI zoom. The technical discriminator is vectorization. A model producing pixel-by-pixel loops has functional but non-conforming code.
Test 05, Advanced GDPR (/50). The test that separates generalists from analysts. A multi-entity, multi-regulatory case with an active incident. PharmaBel SA (Brussels) deployed an Azure OpenAI LLM for clinical analysis. A researcher copy-pasted patient data directly into the model. Three entities involved (EU, US, India), four overlapping regulations (GDPR, HIPAA, AI Act, GxP/FDA 21 CFR Part 11). The prompt requests five structured analytical deliverables.
The results
Here is the full scoreboard, seven models across five tests.
Qwen 3.5 397B Q9 dominates with an average of 93.8%. Perfect score in standard GDPR (50/50), near-perfect in code (49 and 49.5/50), Big4-expert level in advanced GDPR (47.5/50). Its only relative weakness: 385/500 in creative writing, versus 435 for the unquantized BF16 version. The 50-point delta (10%) confirms that quantization impacts expressive finesse before anything else.
Qwen 3-6 follows at 89.4%. Lowest score in writing (315/500), but the highest in the benchmark for advanced GDPR (49/50). Pure analytical profile.
Mistral Large comes third at 74.6%.
Mistral, the first four tests
In creative writing, Mistral does better than expected. Large and Medium both reach 420/500, just behind Qwen 3.5 BF16 (435). The French prose is fluid, the stylistic evolution coherent. Ministral 14B surprises with 350/500, an honorable score for a model of this size.
In standard GDPR, Mistral Small leads with 48/50. All Mistral models score at least 43/50. The family handles basic European regulatory analysis well. For French-built models, this is consistent.
In algorithmic code, Devstral achieves a perfect 50/50 on Mandelbrot. This is the only test in the benchmark where a Mistral model takes the lead. Ministral 14B at 48.5/50 confirms that even small Mistral models handle pure algorithmic code well.
In Python CLI, results are more mixed. Ministral 14B drops to 28/50, Mistral Small to 36.5/50. Pure algorithmic code does not predict the ability to produce structured engineering code with exhaustive error handling and clean architecture.
Up to this point, Mistral is a serious contender. Not dominant, but competitive.
And then the fifth test
The PharmaBel case. Three distinct legal entities (Brussels, Delaware, Bangalore). GDPR, HIPAA, AI Act, GxP overlapping. An active incident with a 72-hour clock ticking. Five structured analytical deliverables requested.
All five Mistral models produced ASCII diagrams.
Not prose analysis. Character-based diagrams. Flow charts. Boxes connected by arrows drawn with dashes and pipes.
Mistral Large: 2/50. Mistral Medium: 3/50. Mistral Small: 3/50. Devstral: 2/50. Ministral 14B: 4/50.
From the 675B cloud model to the 14B local. Same behavior. Same score.
Qwen 3.5 on the same prompt: 47.5/50. Structured analytical prose, entity by entity, regulation by regulation. 72-hour notification identified as active. AI Act Annex III classification correctly applied. Triple EU/US/India data flow with appropriate transfer mechanisms.
Qwen 3-6: 49/50.
What the pattern reveals
The collapse is not random. It is systemic.
When the prompt contains a sufficient number of structural elements (multiple entities, overlapping regulations, numbered deliverables), Mistral’s fine-tuning triggers a reflex: switch to “visual schema” mode. The model interprets structural complexity as a signal to produce a diagram rather than an analysis.
This behavior is absent in Qwen. Same prompt, same complexity, analytical prose.
The hypothesis is that Mistral’s RLHF or instruction tuning reinforced a pattern: “when it’s complex, draw a diagram.” This is probably useful in some contexts (software architecture, workflows). But when the prompt explicitly requests prose analysis with textual deliverables, the reflex fires anyway.
This is the type of failure that academic benchmarks never detect. MMLU, HumanEval, MT-Bench never test a multi-regulatory case with an active incident and cross-referenced deliverables. A model can score 85% on every public leaderboard and collapse on the first real professional case that falls outside the standard pattern.
Per-model profile
Mistral Large 2512 (675B, cloud). 420/500 in writing, expert in standard GDPR, production-ready in code. 2/50 in advanced GDPR. Excellent everywhere except complex multi-regulatory analysis. Average 74.6%.
Mistral Medium 3.1. Nearly identical to Large on T01 (420/500), competent on T02-T04. 3/50 in advanced GDPR. The best value-for-money Mistral, with the same structural weakness. Average 69.4%.
Devstral 2512 (123B MoE, code specialist). Code champion (50/50 Mandelbrot) but 262.5/500 in creative writing. 2/50 in advanced GDPR. Hyper-specialized. Useful as a dedicated code model, not as a generalist. Average 66.3%.
Ministral 14B 2512. 350/500 in writing for a 14B, 48.5/50 in Mandelbrot. 28/50 in structured Python, 4/50 in advanced GDPR. Interesting profile for a small model, expected limitations. Average 63.8%.
Mistral Small 2603 (119B MoE). Best Mistral in standard GDPR (48/50) but weakest in code (33-36.5/50). 3/50 in advanced GDPR. Inconsistent profile. Average 61.0%.
The verdict
Mistral is not a bad model. It is incomplete.
For French creative writing, Mistral Large and Medium at 420/500 are solid options. They rival Qwen 3.5 Q9 (385) and are only outpaced by Qwen 3.5 BF16 (435). For targeted algorithmic code, Devstral at 50/50 proves its worth.
For professional use requiring regulatory, legal, or multi-domain analysis, Mistral is eliminated by the data.
Qwen 3.5 397B in BF16 on the Exo cluster remains the most complete model tested. 435/500 in writing, 50/50 in GDPR, 49.5/50 in code, 47.5/50 in multi-regulatory analysis.
The lesson of this benchmark: a model’s blind spots are invisible until you look for them. A benchmark that doesn’t test the real use case tests nothing. And discovering the blind spot in production is not an acceptable option.
TMB Benchmark v2, April 2026. Evaluator: Claude Opus 4.6. Infrastructure: Exo cluster 4× M3 Ultra (Tensor RDMA) + OpenRouter.
Sophie, The Monocle Bear
Sophie, The Monocle Bear