AI Built Your Codebase. Who Understands It? | 4ge Blog

The Question You Can't Answer About Your Own Product

A founder I know walked into a Series A meeting with a working product, growing revenue, and a codebase he couldn't explain. The investor asked a simple question: "Why does your payment module validate orders before checking inventory?" And the founder — who'd built the product, demo'd it a hundred times, lived inside it every day — couldn't answer.

He'd used AI for roughly 80% of the code. Cursor for the initial build. Copilot for the integrations. Claude for the architecture advice that shaped the payment flow. The code worked. The tests passed. Revenue was growing. But why the validation ran in that specific order — before inventory, not after — was something he'd never actually decided. The AI had suggested it, it looked reasonable, he'd accepted it, and moved on.

When he asked his team afterwards, he got three different answers. One developer said "it's a business rule — validate before you reserve." Another said "I think it was for performance — validation is faster than inventory lookup." The third said "I have no idea, it was like that when I joined." None of them were wrong about what they believed. None of them were right about why it actually worked that way. And the real answer — that the AI had generated it based on training data patterns, not based on any decision their team had made — was the most unsettling answer of all.

The investor passed. Not because the code was bad. Because of the silence.

80%+

Of code in AI-assisted projects that the original author cannot fully explain when asked three months later — not because they forgot, but because they never understood it in the first place.

The Speed-Understanding Gap

Here's the thing nobody puts on the pitch deck: AI can build faster than humans can understand. That's not a criticism of AI or of humans — it's a structural fact about how generative AI works.

When you write code yourself, you build a mental model as you go. Every keystroke is a decision — even the ones you'd rather not think about. You chose that data structure for a reason. You named that function for a reason. You put the validation there, not three lines lower, for a reason. The reason might be "I was tired and it was 11pm," but it's your reason, and when something breaks at that exact line six months later, you'll remember — or at least be able to reconstruct — why you made that choice.

When AI generates the same code, it makes the same choices. But the "reasons" are statistical, not intentional. The AI placed validation before inventory lookup because, in its training data, that ordering is more common. Not because your team discussed it. Not because there was a specific incident that made it necessary. Because the tokens lined up that way.

The result: code that works, built at a speed that would have been impossible two years ago, and a growing gap between what your system does and what anyone on your team can explain about why it does it.

The cognitive debt article introduced this concept: the gap between what your system does and what your team actually understands about what it does. The Anthropic study found AI-assisted developers scored 17% lower on comprehension tests than their non-AI peers. Not because they were less skilled — because the AI did the understanding for them, and the understanding didn't transfer.

This article is about what happens when that gap gets wide enough that it starts making decisions for you.

Three Levels of Understanding

There are three levels of understanding a codebase. Most AI-built codebases only achieve the first one.

Level 1: What It Does

Can you describe what the system does? The user-facing features, the data flows, the inputs and outputs. "Users can register, browse products, add to cart, and checkout using Stripe." Level 1 understanding is functional — it describes the system's behaviour.

Most developers can achieve Level 1 for most of their codebase, even AI-generated code. You can read the code. Run the tests. Trace the paths from HTTP request to database response. You know what happens.

But "what it does" isn't enough. Knowing that the payment module validates orders before checking inventory is Level 1 understanding. It tells you the behaviour. It doesn't tell you anything about why, or what happens when the behaviour changes.

Level 2: How It Works

Can you explain the mechanism? Not just "orders are validated before inventory is checked" but "the OrderValidator class runs a synchronous check against the orders table for matching line items, then the InventoryService reserves stock using an optimistic lock with a 30-second TTL, and if the reservation fails it triggers a StockUnavailableEvent that the CartService listens for to restore the cart state."

Level 2 is the implementation level. You can trace the code path. You can explain which functions call which other functions. You can identify the failure points — the optimistic lock, the event listener, the 30-second TTL.

Most developers achieve Level 2 for the parts of the codebase they've worked on recently. For AI-generated code, Level 2 requires reading the code carefully — because the AI didn't explain its choices while generating them. You have to reconstruct the reasoning by reading the output.

The problem: Level 2 degrades fast. You might understand how the payment module works today. In three months, after you've built six other features and the payment module has been modified by three other AI sessions, you won't. The mental model expires.

Level 3: Why It's That Way

Can you explain the architectural rationale? Not what the system does (Level 1) or how the code implements it (Level 2) — but why the design is the way it is. Why validation before inventory, specifically. What happens if you reverse the ordering. What the team considered and rejected. What incident or insight led to this specific design.

"We validate before reserving inventory because we had an incident where inventory was reserved for an invalid order, the customer was charged for something that didn't pass validation, and it took three days of manual refunds to sort out."

That's Level 3. It's the most valuable level — and it's the one that's almost entirely missing from AI-built codebases. Because the AI doesn't make decisions with rationale. It makes choices with probability. And the developer who accepted the AI's suggestion often didn't document the reasoning, because they didn't deeply understand the tradeoffs at the time — or because the reasoning was "the AI suggested it and it worked, so I moved on."

17% lower

AI-assisted developers scored on comprehension tests versus non-AI peers (50% vs 67%). The steepest declines were in debugging — the skill that depends most on understanding *why* the code is structured the way it is.

What You Lose When Understanding Depreciates

Understanding isn't a static asset. It depreciates. And when it depreciates past a certain point, you lose capabilities that don't come back without significant investment.

You lose the ability to maintain

Maintenance isn't just fixing bugs. It's making small changes to existing code without breaking things. And making small changes safely requires understanding the system well enough to predict side effects.

When you lack Level 3 understanding, maintenance becomes archaeology. You don't modify the code — you excavate it. Read the function, trace its dependencies, figure out what other functions call it, reconstruct the design intent, and then make your change hoping you didn't miss anything. It's like editing a novel written by someone else, in a language you only sort of speak.

I've watched developers spend a full day making a change that should have taken an hour — not because the change was complex, but because understanding the surrounding code well enough to make it safely took most of the day. That's the maintenance tax. And it compounds: every change made without full understanding adds more code nobody fully understands.

You lose the ability to onboard

New developer joins the team. You show them the codebase. They can read the code (Level 1). They can trace the execution paths (Level 2, with effort). But when they ask "why is the payment module structured this way?" — you can't answer. The answer isn't in the code. It isn't in a document. It exists only in the fading memory of whoever was in the room when the AI generated it.

The cognitive debt test asks: when you onboard a new developer, how long until they're productive? In a codebase where Level 3 understanding is documented, onboarding takes maybe 2-3 days. In a codebase where that understanding exists only in someone's head — or doesn't exist at all — onboarding takes weeks. The new developer learns by breaking things and fixing them. That's the most expensive learning method there is.

You lose the ability to debug effectively

Production incident. The payment module is declining valid cards. You open the code. You trace the execution path. You find the function where the card is declined. You understand what it does (Level 1). You understand how it works (Level 2, after some reading). But you don't understand why it's checking for isRetryAllowed before attempting a second charge — because that check exists because of a specific incident three months ago that you didn't know about, involving a payment provider that was double-charging customers when retries happened too quickly.

Level 3 understanding is what turns a 2-day debugging session into a 3-hour one. When you know why the code is the way it is, you can find the root cause quickly. When you don't, you reconstruct the reasoning from scratch — and you'll get it wrong, because the reasoning exists outside the code, in the invisible history of decisions and incidents that shaped it.

You lose the ability to iterate

Product asks for a change to the payment flow. Simple: "add support for a new payment provider." In a codebase where you have Level 3 understanding, you can estimate the work accurately because you know what depends on what and what constraints must be maintained. In a codebase where you lack Level 3 understanding, every estimate is a guess — because you don't know what the new provider will break until you try it.

This is why AI-native teams find their velocity declining over time despite using the same tools. The AI generates code just as fast. But the time spent understanding what to generate, and where, and how to make it fit without breaking things — that time increases with every feature nobody fully understands. CodeRabbit's 2026 analysis found review times up 91% and incidents per pull request rising 23.5% for teams using AI coding assistants. The throughput went up. The comprehension didn't come along for the ride.

The Codebase Archaeology Problem

Here's what debugging an AI-built codebase actually feels like. You're not a developer. You're an archaeologist.

You find an artifact — a function, a class, a module. It's well-constructed. Clean lines, consistent materials. Whoever made it was skilled. But you don't know who made it, or when, or why. You don't know what problem they were solving. You don't know what alternatives they considered and rejected. You don't know what environmental factors shaped the design. You can infer some of this from the artifact itself — patterns that suggest a particular approach, naming conventions that hint at a particular philosophy. But the reasoning is lost. The builder is gone. The artifact remains.

This is the experience of maintaining a codebase where Level 3 understanding was never documented. You reverse-engineer intentions from implementation. You guess at reasons from patterns. Sometimes you're right. Often you're wrong — and the wrongness only becomes apparent when your "fix" introduces a new bug because you didn't understand the constraint that shaped the original design.

The archaeological metaphor is instructive because it highlights something important: archaeologists don't blame the builders for not leaving documentation. They accept that information has been lost and work with what they have. But your codebase isn't ancient ruins. The builders are still around. The context is still recoverable — if you capture it at the time of creation, not months later when the reasoning has faded from everyone's memory. Which it does. Fast.

Making Understanding Explicit

The fix isn't reading more code. Reading code gives you Levels 1 and 2 — what it does and how it works. It doesn't give you Level 3. Level 3 understanding comes from documenting decisions at the time they're made, when the reasoning is fresh and the alternatives are still visible in your mind.

Architecture Decision Records

One document per significant decision. Not just what was decided — the context that led to it, the alternatives considered, and the consequences of reversing it. Short, specific, useful:

Decision: Validate orders before checking inventory. Context: Incident on 2026-03-15 — inventory was reserved for an invalid order. Customer charged for unvalidated order. Three days of manual refunds. Alternative considered: Validate after inventory reservation (better UX, slower failure feedback). Consequence of reversal: Same incident recurrence expected. Estimated impact: $2K/day in manual refunds during peak. Author: Sarah Chen, Sprint 14.

That ADR takes maybe 2 minutes to write. And it eliminates the 3-day debugging session that happens when someone doesn't understand why the validation ordering matters. 2 minutes now, or 3 days later. That's the math.

Living Specifications

A structured document that describes what your system does, how components connect, what constraints must be maintained, and why. Not a 50-page PRD — a compressed, task-level specification that captures Level 3 understanding in a form the whole team (and your AI assistant) can reference.

The spec versioned alongside your code. Updated when architecture changes. When a PR changes the system, it should also update the spec. This makes the spec a living document — not a historical artifact that describes what you built six months ago and has been wrong ever since.

Visual Blueprints

Some understanding is spatial — it lives in the relationships between components, the flows between states, the paths through your system. Text documents describe these relationships. Visual blueprints show them.

When you can see the user flow from checkout to payment to order confirmation — with the branching error states, the retry logic, the failure paths — the gaps in your understanding become visible. You can literally see where the spec doesn't cover what happens when the payment gateway is down. That's spatial understanding, and it's what a visual canvas gives you that a Markdown file can't.

The Fix Isn't Reading More Code

This is the trap I see most teams fall into: they discover the understanding gap and try to close it by reading code. Code reviews. Documentation sprints. "Let's schedule an architecture walkthrough." Feels productive. It isn't.

Reading code gives you Levels 1 and 2. It tells you what the system does and how it works. It cannot reconstruct Level 3 — because Level 3 isn't in the code. The reason the validation runs before the inventory check isn't encoded in the function. It's in the incident that made it necessary, the discussion that led to the decision, and the tradeoffs that were considered. Those exist in your team's memory — which degrades — or they don't exist at all.

The fix is documenting decisions before generating code, not reading code after the fact to reconstruct decisions that were never documented. When you spec before you code, the Level 3 understanding gets captured at the point of creation — when the reasoning is fresh, the alternatives are visible, and the context is complete. When you code first and document later (or never), you're doing archaeology on your own decisions.

This is the core argument for context engineering: the variable isn't model quality or prompt phrasing. It's the quality of the input — and input that includes Level 3 understanding produces meaningfully better AI output than input that only includes Levels 1 and 2.

The AI that receives a spec that says "validate orders before checking inventory because of the March 15 incident" generates code that respects that constraint and explains it in comments. The AI that receives "add payment processing" generates code that happens to put validation somewhere — and you won't know where until you read the output, and you won't know why until something breaks.

The Founder's Uncomfortable Question

The founder in the Series A meeting couldn't answer the investor's question because he'd never answered it for himself. He'd accepted the AI's output without understanding the decision it embedded. And when someone with fiduciary responsibility asked him to justify that decision, the silence was deafening.

But here's the thing: that investor was doing what every stakeholder eventually does — asking why. Not what. Not how. Why. Why this approach? Why this ordering? Why this pattern? These are the questions that separate a product you can maintain and scale from a product that works today and mystifies you tomorrow.

If you're building with AI — and at this point, most of us are — the question isn't whether you'll hit the understanding gap. You will. The question is whether you'll have the answer when someone asks.

Most AI-built codebases don't. The code works. The tests pass. The demo is flawless. And if you ask anyone on the team to explain why the payment module is structured the way it is, you'll get three different answers and a lot of silence.

That silence isn't a knowledge gap. It's a decision gap. And the only way to close it is to document the decision when you make it — before the AI generates the code, before the reasoning fades, before the question comes from someone whose answer actually matters.

4ge is a context engineering platform — a visual workspace where architectural decisions, business logic constraints, and edge cases are first-class citizens, documented at the point of creation. See how 4ge makes understanding explicit before it depreciates →