Behond the Frontiers

How Every AI Model Failed a Simple Debugging Task — And What That Tells Us About « Intelligence »

February 19, 2026 — Sophie, The Monocle Bear

I had a bug.

Not a subtle one. Not a race condition buried in async callbacks. Not a memory leak that only shows up after 72 hours. A straightforward, reproducible, observable bug in an n8n automation workflow.

33 RSS sources go in. Only articles from source #1 come out.

The code was correct. Every single node, reviewed and validated. The logic was sound. And yet, the pipeline was broken.

I asked five different AI models to diagnose it. Four of them — including models that score at the top of every benchmark — failed completely. Not because they’re bad at code. Because they don’t understand systems.

The Setup

I built an AI-powered media monitoring pipeline in n8n. The architecture is straightforward:

A Sources Loop (splitInBatches) iterates over 33 RSS feeds — from Simon Willison’s blog to ArXiv, Hugging Face, Reddit’s r/LocalLLaMA, Hacker News, and more.
For each source, Read RSS Feed fetches the latest articles.
Normalize + Prefilter standardizes the data and applies optional regex filters.
An Items Loop (another splitInBatches) processes each article individually: deduplication against Notion, fulltext extraction, LLM analysis via LM Studio, relevance scoring, and finally creating a page in Notion.

Two nested loops. Sources on the outside, articles on the inside. Clean separation of concerns.

Except it didn’t work. Only Simon Willison’s articles — the first source in the list — ever made it to Notion.

The Models I Tested

I fed the complete workflow JSON to each model, described the symptom, confirmed the code was correct, and asked for a diagnosis.

DeepSeek R1

DeepSeek produced the most verbose response — a detailed, well-structured analysis spanning five hypotheses:

RSS Feed failures (timeouts, malformed XML)
Prefilter regex blocking all items
Fulltext fetch failures
LLM analysis returning malformed JSON
Notion API rate limiting

Every hypothesis was plausible. Every hypothesis was wrong. DeepSeek analyzed each node in isolation, found the logic correct (because it was), and then invented downstream explanations for a symptom it couldn’t explain. Its conclusion was telling: « The logic is sound, but the implementation relies on several assumptions. »

Translation: I can’t find the bug, so the problem must be environmental.

GLM-4 (ChatGLM5)

Similar approach, similar result. Focused on feed accessibility and content extraction quality. Suggested adding retry logic and better error handling. Never questioned the loop architecture.

Qwen3 Coder

Went deep into the JavaScript, checking for off-by-one errors, null handling, and JSON parsing edge cases. Solid code review. Completely missed the actual problem, because the actual problem wasn’t in the code.

Cowork (Anthropic’s agentic tool)

Cowork iterated multiple times, each time drilling deeper into its initial hypothesis about feed parsing. This is the expected behavior of an agentic system — once it commits to a direction, it doubles down. Useful for execution, dangerous for diagnosis.

Claude Opus 4.6

Found it on the first pass. Not by analyzing the code, but by analyzing the connections.

The Actual Bug

The problem was architectural, not logical. It lived in the space between nodes, not inside them.

n8n’s splitInBatches node has two outputs:

Output 0 (« done »): fires when all items have been processed
Output 1 (« loop »): fires for each individual item

When you nest two splitInBatches nodes, the inner loop’s state management breaks. Here’s what happens:

Sources Loop sends Source #1 (Simon Willison) to the pipeline.
Items are normalized, and the Items Loop starts processing them one by one.
Articles go through dedup, fetch, LLM analysis, Notion creation. Everything works.
Items Loop finishes. Its « done » output (0) fires back to Sources Loop.
Sources Loop advances to Source #2 (Hugging Face) and sends new items through.
These items arrive at Items Loop’s input.
But Items Loop still considers itself « done » from the previous batch. It doesn’t reset its internal state for the new incoming data.
Items Loop immediately fires its « done » output again without processing anything.
Sources Loop advances to Source #3. Same thing happens.
Repeat for all 33 sources. Only Source #1 ever gets processed.

The bug isn’t a coding error. It’s a known behavioral limitation of nested splitInBatches in n8n. The inner node’s state persists across iterations of the outer loop, and the input port doesn’t distinguish between « new batch of items » and « continue processing current batch. »

Both inputs — new data from Normalize + Prefilter, and loop-back from the Next Item node — arrive on the same input:0port. The splitInBatches node has no way to differentiate them.

The Fix

Replace the entire inner loop (8 nodes: Items Loop, Dedup Check, Is New?, Fetch Fulltext, LM Studio Analysis, Relevant?, Create Notion Page, Next Item) with a single Code node that processes all articles from each source in a sequential for loop.

Before: Sources Loop → RSS → Normalize → Items Loop → [8 nodes] → Sources Loop After: Sources Loop → RSS → Normalize → Process All Items → Sources Loop

The new node consolidates all the processing logic into one JavaScript function. No state management issues. No nested loop conflicts. Fifteen minutes after the fix, Hugging Face articles started appearing in Notion. The pipeline now processes all 33 sources correctly.

Why This Matters

This isn’t a story about one model being smarter than another. It’s a story about what « intelligence » actually means in the context of AI-assisted development.

Every model I tested can write JavaScript. Every model can read n8n workflow JSON. Every model can analyze code logic. On benchmarks that test these skills in isolation, they all score impressively.

But debugging a real-world system isn’t about reading code. It’s about understanding how components interact at runtime. It’s about knowing that a splitInBatches node maintains internal state. It’s about recognizing that the bug might not be inany node, but between them.

Four models did what most AI models do when faced with a system-level problem: they decomposed it into individual components, analyzed each one, found them correct, and then fabricated plausible-sounding explanations for the symptom they couldn’t explain. DeepSeek’s response was the most instructive — five well-reasoned hypotheses, all wrong, because the reasoning started from the wrong premise.

The model that succeeded didn’t analyze the code first. It analyzed the connections — the output:0, output:1, input:0 wiring between nodes. It understood that n8n isn’t just a code executor; it’s a stateful runtime where the orchestration layer has its own behaviors and limitations.

The Uncomfortable Truth About Benchmarks

When we evaluate AI models, we test them on isolated tasks: write a function, solve a math problem, summarize a document, answer a question. These benchmarks are useful but fundamentally limited.

Real-world engineering isn’t isolated tasks. It’s systems. Systems have emergent behaviors. Bugs live in the interactions between correct components. Diagnosing them requires a mental model of the runtime, not just the code.

No benchmark I’ve seen tests for this. No leaderboard captures it. And yet, it’s the difference between an AI that helps you write code and an AI that helps you ship working systems.

The frontier isn’t where the benchmarks say it is.

Sophie is the founder of The Monocle Bear, an AI consultancy specializing in local LLM infrastructure, agentic UX, and workflow automation. She reads balance sheets, not just benchmarks.