Don't Build What You Can't Explain: The Cognitive Debt Test | 4ge Blog

The Team That Shipped Fast and Understood Nothing

A four-person startup. Three developers, one founder. Six months of shipping with AI assistance — Cursor for code, Copilot for reviews, GPT-5.5 for architecture advice. Velocity was extraordinary. They'd built their entire MVP in 8 weeks. Investors were impressed. The demo was flawless.

Then a customer reported a bug in the payment flow. Not a trivial bug — the kind where money goes to the wrong place. The founder opened the codebase and read the payment module. It worked. The logic was correct. But he couldn't explain why it worked. The code referenced a TransactionValidator class that nobody remembered writing. The validation sequence — validate order, then check inventory, then authorize payment — had a specific ordering that nobody could explain. When he asked the team, he got three different answers. The developer who'd implemented it had used AI for 80% of the code and couldn't reconstruct the reasoning either.

They spent 3 days debugging a problem that should've taken 3 hours — not because the bug was complex, but because nobody understood the system well enough to know where to look. The code was a black box that happened to work. Until it didn't.

This is cognitive debt: the gap between code that works and code you understand. And it's the silent killer of AI-native teams — because AI lets you ship fast without understanding, and the bill comes due at exactly the worst moment.

80%+

Of code in AI-assisted projects that the original author cannot fully explain when asked three months later — not because they forgot, but because they never understood it in the first place.

The Cognitive Debt Test

I've been thinking about this gap — between working software and understood software — and I've come up with a test. Five questions. If your team can answer all five, you're in good shape. If you can't, you have cognitive debt. And the longer you've been building with AI, the more likely it is that you can't.

Question 1: Can you explain why each architectural decision was made?

Not what the decision was. Why it was made. What alternatives were considered. What specific problem it solved. What would happen if you reversed it.

"We use the repository pattern" is what. "We use the repository pattern because we tried active record and ended up with data access logic smeared across forty service files, and it took a full sprint to untangle, and during that sprint we had a production incident because the UserStore was directly querying the orders table" is why.

Most AI-built codebases have plenty of whats. They're missing almost all of the whys — because the AI doesn't document its reasoning, and the developer who accepted the AI's suggestion often didn't understand the tradeoffs deeply enough to document them later.

This is the most important question on the test. If your team can answer this one, the other four tend to be manageable. If they can't, everything downstream is built on assumptions nobody can verify.

Question 2: Can you identify which code was AI-generated vs human-written?

This isn't about blame. It's about understanding the provenance of your system. AI-generated code and human-written code have different failure modes:

AI-generated code tends to be correct in isolation but disconnected from the system's architecture. It follows patterns that look right but may violate your conventions. It handles the happy path well and edge cases poorly — because edge cases are specific to your system, and the AI didn't know about them.
Human-written code tends to have more idiosyncrasies but carry more contextual understanding. A developer who wrote a function manually likely understood why it needed to work that way. An AI that generated the same function understood that it should work that way, but not why the alternative would be wrong for your specific system.

The problem isn't AI-generated code. The problem is AI-generated code that nobody can identify — because when you can't tell which code was generated, you can't predict which code will have the characteristic AI failure modes. You're flying blind through your own codebase.

60%+

Of developers in AI-assisted teams who cannot reliably identify which parts of their codebase were AI-generated when asked to do so in a code review.

Question 3: If the lead developer left, could the rest of the team maintain the codebase?

The bus factor test. But with a twist specific to AI-built systems: in many AI-native teams, the lead developer is the only person who carries the project's architectural context in their head — because the AI doesn't document its reasoning, and the context doesn't exist anywhere else.

A traditional codebase has documentation, comments, commit messages, pull request threads, and institutional knowledge distributed across the team. An AI-built codebase often has... the code. The code that works. And the lead developer's memory of what the AI suggested and why they accepted it.

If your lead developer went on holiday for two weeks, could the remaining team modify the payment processing flow without breaking the fraud detection middleware they didn't know existed? If the answer is "I'm not sure" — that's cognitive debt.

Question 4: Do you have specifications that explain the 'why', not just the 'what'?

Documentation that describes what the system does is useful. Documentation that explains why it does it that way is essential. The difference:

What: "The payment module validates orders before checking inventory."
Why: "The payment module validates orders before checking inventory because we had an incident where inventory was reserved for an invalid order, the customer was charged for something that didn't pass validation, and it took three days of manual refunds to sort out. The fix was to reorder the validation sequence. If someone reverts this ordering, the same incident will recur."

The "what" tells a developer (or an AI) what the code does. The "why" tells them what happens if they change it. Without the "why", your specifications are just a description of the current state — not a guide to what must be maintained.

Most AI-built codebases have almost no "why" documentation. The AI generates code that implements the "what" but doesn't explain the reasoning behind it. And the developer who prompted the AI often doesn't document their reasoning either — because the code already works, so why bother?

Here's why: the next person who touches that code — or the next AI session that modifies it — won't know that reordering the validation sequence will cause a production incident. They'll see "validate orders, then check inventory" and think "that seems slower than it needs to be, let's optimise." And you'll relive the incident.

Question 5: When you onboard a new developer, how long until they're productive?

Not "how long until they can commit code." How long until they understand the system well enough to make architectural decisions? How long until they can modify the payment flow without breaking the fraud detection? How long until they know why the validation sequence is ordered the way it is?

In a well-documented codebase with explicit architectural reasoning, a new developer can be productive in days. In a codebase built by AI where the "why" exists only in the original developer's head — and maybe not even there — onboarding takes weeks. Not because the code is bad, but because understanding it requires reverse-engineering the reasoning of an AI that didn't document its thought process and a developer who moved on to the next feature.

3-5x

Longer onboarding time for developers joining AI-built codebases without structured specifications, compared to codebases where architecture decisions are documented with rationale.

Why AI Amplifies Cognitive Debt

Speed without understanding is the core dynamic. AI coding assistants are fast at generating code that works. Absurdly fast. They're not designed to help you understand why that code works, or what happens when you change it, or what architectural constraints it depends on.

Think of it this way: every feature you ship with AI creates two things. The feature itself — visible, testable, demo-able. And the understanding debt — invisible, unmeasured, compounding. The feature ships once. The debt accrues forever.

Traditional technical debt is about code quality — shortcuts in implementation that make future changes harder. Cognitive debt is about understanding quality — gaps in shared knowledge that make future changes dangerous. They're related but distinct. You can have clean code with massive cognitive debt (it works, but nobody knows why). You can have messy code with low cognitive debt (it's ugly, but everyone understands it).

AI-assisted development tends to produce code that's cleaner than average — AI follows patterns consistently, doesn't get lazy, doesn't take shortcuts in naming. But it produces understanding that's worse than average — because the AI's reasoning isn't captured, the developer's acceptance of the AI's suggestion isn't documented, and the architectural context that led to the suggestion isn't preserved.

The cognitive debt spiral works like this:

You use AI to ship a feature fast. The code works. You move on.
Nobody documents the architectural decisions, because the code already works. Why bother?
Next feature. The AI doesn't know about the previous feature's edge cases — because they weren't documented. It generates code that conflicts with them.
You discover the conflict. You fix it. But you don't document it, because you're already behind on the next feature.
Repeat. Each cycle adds understanding debt. Each cycle makes the next one more expensive, because the number of undocumented decisions keeps growing.

The curve isn't linear. It's exponential. The first 3 months feel great — velocity is high, features are shipping, the codebase is growing. Months 4 through 6 are where the debt starts biting — mysterious bugs, longer debugging sessions, developers who can't explain their own code. By month 9, the team is spending more time understanding the system than building new features.

The Spectrum: Vibe → Text → Visual

There are three approaches to specs in AI-native development, and they map directly to cognitive debt levels:

Level 1: Vibe Coding (Maximum Cognitive Debt)

No spec. Prompt the AI, accept the output, ship it. Fast, fun, and — for production software — catastrophic. The AI generates code that works. You have no record of why it works, what alternatives were considered, or what constraints must be maintained. When the next developer (or the next AI session) modifies it, they're guessing.

Vibe coding works for prototyping. For anything that affects production, it's a liability.

Level 2: Text Specs (Moderate Cognitive Debt)

Write specifications before code. Document requirements, design decisions, and implementation steps. This is what OpenSpec, GitHub Spec Kit, and .cursorrules files encourage — structure your intent before you delegate to the AI.

Text specs are a massive improvement over vibe coding. They force you to think before you build. They create a record of what you intended. They give the AI better context.

But text specs have a structural limitation: they're linear. They describe the happy path well. They tend to miss the branches — the error states, the conditional logic, the conflicting requirements that only become visible when you can see the whole system at once. A text spec can tell you "validate orders before checking inventory." It can't show you the gap between "payment succeeds" and "order confirmed" where the idempotency key needs to live.

Level 3: Visual Specs (Minimum Cognitive Debt)

Design specifications visually — flows, states, transitions on a canvas — then generate structured, AI-ready output from that visual design. The canvas makes gaps visible: missing error states, undefined transitions, conflicting paths. You see the system before you build it.

Visual specs don't just reduce cognitive debt at the time of creation. They reduce it continuously — because a visual spec is easier to maintain, easier to onboard from, and easier to audit than a text document. A new developer can look at a flow and see "what happens when payment fails?" in a way that requires reading three paragraphs of text in a markdown file.

The adversarial layer takes this further: a tool that actively probes your spec for weaknesses before you build from it. "You've handled the success path. What about when the payment gateway times out? What about when the user navigates back? What about when the session expires mid-transaction?" This is the difference between a spec that describes your intent and a spec that survives reality.

The Fix: Making Architecture Decisions Explicit

The cognitive debt test isn't just a diagnostic. It's a roadmap for what to fix. Each question corresponds to a specific remedy:

For Question 1 (Why were decisions made?): Architecture Decision Records

One document per significant decision. Not just the decision — the context, the alternatives considered, and the consequences of reversing it. These don't need to be long. A paragraph per decision is often enough:

Decision: Use repository pattern for data access. Context: Active record led to data access logic scattered across 40+ service files. Consequence of reversal: Data access logic will re-scatter. The undo sprint took 5 developer-days. Author: [Who made this decision and when]

Store them where the team can find them. Version them in git. When the AI suggests reintroducing active record, the ADR is what prevents it — not because you told it "don't use active record" in a rules file, but because it can read the ADR and understand why.

For Question 2 (AI vs human code): Provenance Markers

Start marking which code was AI-generated and which was human-written. Not in comments that rot — in your version control, through commit conventions or branch naming. feat/payment-flow-ai versus feat/payment-flow-manual. This isn't about blame. It's about knowing which code is most likely to have the characteristic AI failure modes: correct in isolation, disconnected from architecture, missing edge cases.

This is the controversial recommendation. Some teams resist marking AI-generated code because they see it as low-quality. That's not the point. AI-generated code isn't worse — it's differently flawed. Knowing which code was generated lets you predict which flaws to look for.

For Question 3 (Bus factor): Spec-Based Knowledge Transfer

If the lead developer's departure would cripple the team, the architectural context exists only in their head. The fix: make it explicit in a specification that anyone can read. Not a Notion page that's been stale since January — a living document that's versioned alongside the code and updated when architecture changes.

For Questions 4 and 5 (Why documentation + onboarding): Living Specifications

A structured specification that covers:

What the system does (features, user flows)
How components connect (dependencies, data flows, what breaks when something changes)
What constraints must be maintained (business logic rules, edge cases, ordering requirements with rationale)
Why decisions were made (architecture rationale, alternatives considered, consequences of reversal)

This is the specification that makes cognitive debt visible — because it documents the "why" that AI-built codebases systematically lack. It's the map that tells you not just where you are, but why you're there and what happens if you move.

What This Looks Like in Practice

A team that passes the cognitive debt test:

A new developer joins. They read the specification. They understand the architecture. They can modify the payment flow on day three — because the spec explains why the validation ordering matters and what happens if you change it. They don't need to reverse-engineer this from the code.
The lead developer goes on holiday. The remaining team encounters a bug in the billing module. They look up the relevant Architecture Decision Record. They understand the constraint. They fix it in hours, not days.
The AI suggests reintroducing active record. The developer reads the ADR. They reject the suggestion, not because a rules file told them to, but because they understand the consequences.
A code review flags code that conflicts with an architectural constraint. The reviewer knows about the constraint because it's documented. Not because they happened to remember it from a meeting six months ago.

A team that fails the cognitive debt test:

A new developer joins. They read the code. They don't understand the architecture. They modify the payment flow and break the fraud detection nobody told them about. The incident takes 3 days to resolve.
The lead developer goes on holiday. The remaining team encounters a bug. They spend two days debugging — one day to understand the system, one day to fix it. The lead developer comes back and says "oh, the fraud detection middleware runs before the payment processor, didn't you know?"
The AI suggests reintroducing active record. The developer accepts the suggestion because it looks reasonable. They spend a week undoing the consequences.
A code review approves code that violates an architectural constraint. Nobody catches it because the constraint exists only in the lead developer's memory and a Notion page that hasn't been updated since January.

The difference isn't the team. It isn't the AI. It isn't the code quality. It's whether the understanding is explicit — documented, versioned, and available — or implicit, scattered across people's heads and stale wikis.

The Debt You Can't Refactor Away

Here's the uncomfortable truth: you can't refactor cognitive debt. You can refactor technical debt — improve the code, remove shortcuts, add tests. But cognitive debt isn't a property of the code. It's a property of the team's understanding of the code. You don't fix it by rewriting the function. You fix it by making the reasoning explicit — documenting the "why" that the AI didn't capture and the developer didn't write down.

The test is simple. Five questions. If your team can answer them, you're building on understanding. If they can't, you're building on assumptions — and AI is very, very good at making assumptions look like working code.

Don't build what you can't explain. Not because it might break — because when it does, you won't know why. And in production, not knowing why is the most expensive bug there is.

4ge is a context engineering platform — a visual workspace where architecture decisions, business logic constraints, and edge cases are first-class citizens, not afterthoughts. See how 4ge makes cognitive debt visible before it becomes expensive →