Context Windows Explained: Why Specs Outlive Sessions | 4ge Blog

The Number That Sounds Huge (And Isn't)

Two hundred thousand tokens. That's the standard context window for a modern AI coding assistant. Roughly 150,000 words. Five hundred pages of text. Sounds like more than you'd ever need, right?

Here's what actually fits in a 200K context window during a real coding session:

System prompt and safety instructions: ~2,000 tokens. Not optional — the model injects these automatically.
Your .cursorrules file or project instructions: 500–5,000 tokens. If you've been thorough about documenting your tech stack and naming conventions, it's on the higher end.
Codebase index or @-mentioned files: 10,000–80,000 tokens. A single large file can consume 5,000 tokens. Reference ten files in a refactoring task? That's 50,000 tokens gone before you've started the actual work.
Conversation history: Compounding. Every message you send and receive lives in the context window. Twenty turns of "add this, fix that, what about this edge case?" easily hits 40,000 tokens.
Tool call results, error logs, search outputs: 5,000–30,000 tokens. An agent that runs a terminal command and gets back a stack trace? That's 2,000–5,000 tokens. Do that five times in a session and you've burned 25,000 tokens on error output alone.
Model reasoning tokens (thinking blocks): 5,000–20,000 tokens. Modern reasoning models generate internal thought chains before responding. Those thoughts occupy context. You don't see them, but they consume the window.
The actual output you want: Your code, your answers, your suggestions. This is why you opened the tool. It gets whatever's left.

Add it up. A typical deep coding session — not an extreme one, just a normal afternoon of iterating on a feature — consumes 100,000–150,000 tokens. On a 200K window. Which leaves 50,000–100,000 tokens for... everything else. Including the output you came for.

And then there's the efficiency problem. That 200K window? You can't use all of it reliably. We'll get to that.

30-50%

Of a typical context window consumed by infrastructure — system prompts, tool definitions, rules files, and model reasoning — before you've written a single word of your actual task.

What Is a Context Window, Technically

If you're already familiar with how transformer models work, skip ahead. But for the developers who've been using AI coding tools without understanding the mechanism — which is most developers — here's the two-minute version.

A context window is the maximum number of tokens a language model can process in a single forward pass. Everything the model "knows" in a given turn — your instructions, the conversation history, referenced files, the model's own previous outputs — must fit within this window. When you send a message, the model doesn't query a database or search the internet. It reads the tokens currently in its context window and generates the next token. That's it. There's no external memory, no persistent storage, no background process maintaining state across sessions.

Think of it like working memory. You sit down at a desk. You can spread out papers, refer to notes, and look at what you've written so far. But the desk is finite. When it fills up, something has to come off. In an AI coding assistant, that "something" is earlier conversation turns — the architectural decisions, the explicit constraints, the "use Postgres not SQLite" instructions you established an hour ago. The model doesn't warn you that it's dropped them. It just starts responding as if they never existed.

A token is roughly three-quarters of a word — "context window" is three tokens, "specification" is one. Tokenisation varies by model and by word (common words tokenise more efficiently than rare ones), but the ¾-word rule is close enough for reasoning about capacity.

Current Context Window Sizes

Here's the landscape as of mid-2026. These numbers change — models get updated, new versions ship with different windows. But the structural constraints these windows create are constant, regardless of the specific numbers.

Model	Context Window	Approximate Word Equivalent	Notes
GPT-5.5 Instant	128K tokens	~96,000 words	Fast model, smaller window. Sufficient for most conversations, tight for large codebases.
Claude Opus 4.7	200K tokens	~150,000 words	Current standard. Covers most coding sessions but fills up during extended refactors.
Gemini 3.1 Pro	1M+ tokens	~750,000+ words	Largest available window. Sounds infinite — it isn't, for reasons we'll cover.

The progression is real. Context windows have grown from 4K tokens in early 2023 to 1M+ tokens in 2026. A 250x increase in three years. But here's the thing about bigger windows: they help, and they don't solve the problem. Both of these statements are true simultaneously.

250x

Increase in standard context window sizes from early 2023 (4K tokens) to mid-2026 (1M+ tokens). The growth is real — and it hasn't solved the context loss problem.

How Context Gets Consumed (The Budget You Don't See)

Let me make the consumption problem concrete with a real scenario. You're working on a payment processing feature in a Next.js app with Cursor.

Turn 1: The initial prompt. You type: "Add Stripe checkout to our Next.js app. We use the App Router, Prisma, and our existing apiClient utility for all API calls." The model reads your .cursorrules file (2,000 tokens), processes your prompt (50 tokens), and generates a response with code (3,000 tokens). Running total: ~5,000 tokens. Feels great — plenty of room.

Turn 2: Reference existing code. You @-mention three files: your existing Stripe config (stripe.ts, 1,500 tokens), the billing module (billing.ts, 4,000 tokens), and the order model schema (schema.prisma, 2,000 tokens). The model processes all three files plus your prompt plus the previous conversation. Running total: ~16,000 tokens. Still fine.

Turn 3-5: Iterate on the implementation. You spot a naming issue, ask for a fix, notice a missing error handler, ask for that. Each exchange adds ~3,000 tokens. Running total: ~25,000 tokens. Half an hour in, not even close to the limit.

Turn 6-10: Extended refactoring. The feature actually requires changes across seven files, not three. You keep referencing files, the model keeps generating code, you keep iterating. Each turn adds 5,000–8,000 tokens because the context is compounding — the model's response includes references to earlier code it generated, which you then ask it to modify. Running total: ~60,000 tokens. Getting warmer.

Turn 11-15: Deep debugging. The generated code has a subtle bug. You paste the error output (2,000 tokens). The model reads it, generates a fix, you test, different error (3,000 tokens of stack trace). After five debugging exchanges, you've added ~25,000 tokens of error output and responses. Running total: ~85,000 tokens.

Turn 16-20: More features in the same session. You didn't stop at the payment feature. Now you're adding a webhook handler, then an admin dashboard for reviewing failed payments. Each new feature references the previous work. The conversation is 20 turns deep. Running total: ~130,000 tokens. You're in the danger zone.

Turn 21: You notice the AI has "forgotten" something. The webhook handler doesn't call validateOrder — the constraint you established in turn 3. The AI doesn't remember turn 3 anymore. The context window filled up. The model silently dropped earlier conversation history to make room for the new work. You didn't get a warning. The AI just started generating code that violated a constraint it "knew" about forty minutes ago.

This is a real session. Not an extreme case — a normal Tuesday afternoon. And it explains why context loss feels like betrayal: the AI was working perfectly, and then it wasn't, and there was no error message, no warning, no indication that something had changed. The quality just silently degraded.

The Lost-in-the-Middle Problem

Here's the finding that makes context windows trickier than their raw size suggests: models are bad at using information in the middle of a long context.

The original research (Liu et al., "Lost in the Middle: How Language Models Use Long Contexts") demonstrated that language models exhibit a U-shaped performance curve. Information at the beginning of a context window (the system prompt, the first few instructions) and information at the end (the most recent messages) gets high recall. Information in the middle — the architectural decisions from forty minutes ago, the constraint you established in turn 3, the edge case someone mentioned in a code review comment — gets deprioritised, misremembered, or simply ignored.

The efficiency numbers tell the story: some models achieve near-perfect efficiency across their context window (98%), reliably using information from beginning to end. Others, despite having larger advertised windows, achieve only ~64% functional efficiency — meaning roughly a third of the context you think you're using is effectively invisible to the model.

This isn't a minor quirk. It's a structural constraint that changes how you should think about context management:

More context ≠ better context. The more tokens you stuff into the window, the more likely the important information is buried in the middle where the model can't find it.
Recency bias is real. The model prioritises recent tokens. Your tech stack rules from the start of the session? Less weight than the error log you just pasted. Even if the rules are more important for the task.
There's no warning system. The model doesn't say "I've lost track of the constraint about Postgres." It just generates code that ignores it. You discover the violation later — in code review if you're lucky, in production if you're not.

64%

Functional efficiency of some models across their full context window — meaning roughly a third of the context you think is being used is effectively invisible to the model. Other models achieve 98% efficiency on smaller windows.

Why Bigger Windows Don't Solve the Problem

"So context windows are getting bigger. Doesn't that fix it?" — This is the most common response I hear from developers who haven't hit the wall yet.

No. And understanding why matters.

1. Bigger windows amplify the lost-in-the-middle problem. If a model achieves 64% efficiency on a 200K window, it won't hit 64% of 1M tokens reliably. The middle of a 1M-token context is a vast expanse the model will semi-ignore. The U-shaped performance curve gets more extreme as the window grows — more absolute tokens the model can use, but the effective fraction it can actually leverage often goes down.

2. More context means more noise. Every token you add to the context window competes for the model's attention. (Attention is literally the mechanism — the model's attention heads weight different parts of the input differently.) The more tokens you include, the more the model has to decide what to focus on. And models don't always decide correctly. Your critical architecture constraint gets the same attention weight as the formatting rule you added as an afterthought — unless the model explicitly decides the constraint is more important, which it can't always do.

3. Cost scales linearly (or worse). Processing 1M tokens costs more than processing 200K tokens. At Anthropic's pricing, writing 1 million tokens to cache costs $3.75, while reading from that cache costs $0.30 per million tokens. (Caching helps, but only for stable, repeated context.) If you're working with 500K tokens of repository context and reading it three times during an iterative session, that's real money — and teams that are careless about context consumption burn through their AI budgets fast.

4. The session still ends. This is the fundamental issue. Whatever the window size — 200K, 1M, 10M — when you close your laptop, the context evaporates. Tomorrow morning, you open a fresh session with an empty window. All that context you carefully assembled? Gone. The bigger window just means you accumulated more context before losing it.

Bigger is better than smaller. But "bigger" and "solved" aren't the same thing. The context loss problem isn't a capacity problem — it's a persistence problem. No amount of window expansion fixes that.

Specs as Compression: The 2,000-Token Blueprint

Here's the practical insight that makes context windows manageable: the structure of your context matters more than the quantity.

The Model Context Protocol research documented something remarkable. An agent needed to download a two-hour meeting transcript and attach it to a Salesforce lead. The traditional approach — passing the entire transcript through the LLM context window — would consume over 150,000 tokens. Using Code Execution (keeping intermediate data in a runtime environment instead of passing it through the context), the transcript stays outside the window. Total token consumption: approximately 2,000 tokens. A 98.7% reduction.

That's not a rounding error. That's an architectural insight. 2,000 tokens of structured, compressed context producing better output than 150,000 tokens of raw data. The 2,000 tokens are pure signal. The 150,000 are signal plus noise — and the model has to sort through the noise to find the signal. (And remember, it's not great at sorting through the middle.)

A structured specification is the compression layer for AI coding. Instead of dumping your entire codebase into the context window and hoping the model finds the relevant bits, you give it exactly what it needs:

Without compression (vibe coding): You prompt Cursor: "Add Stripe billing." The model needs to infer your tech stack, your existing Stripe config, where to put the webhook handler, what error handling pattern you use, and what the billing module looks like. It either asks you (adding conversation turns and consuming tokens) or guesses (adding wrong code and consuming your time).

With compression (spec-driven): You give it an atomic, file-specific task: "In src/billing/stripe.ts, add a createCheckoutSession function that calls our existing validateOrder middleware (import from src/middleware/validateOrder.ts), then creates a Stripe checkout session using our existing STRIPE_SECRET_KEY config (in src/config/stripe.ts)." Two hundred tokens. Zero ambiguity. Zero architectural delegation. Code that fits on the first attempt.

The math: 200 tokens of structured context beats 50,000 tokens of raw codebase the model has to parse and interpret. Not because the model is bad at interpretation, but because interpretation is where the lost-in-the-middle problem lives. More interpretation = more misinterpretation risk. Compression eliminates the step.

98.7%

Reduction in context overhead achieved by keeping intermediate data in runtime rather than passing it through the LLM. Less context, better results — when the context is structured.

Practical Tips for Managing Context Windows

If you're hitting context limits regularly — and if you're doing extended AI coding sessions, you are — here are the patterns that work:

1. Budget your context like you budget your money

Track what's consuming tokens. System prompts, rules files, large file references, and conversation history are the four biggest consumers. Before you start a session, think about what you actually need in context. You don't need every file in the codebase — you need the files relevant to the current task.

2. Start new sessions for new tasks

This is the simplest and most effective tip. Don't try to do six features in one context window. Each feature is a fresh session with a fresh allocation. It feels slower — "but I was already in the flow!" — until the 20th turn when the AI forgets your architecture constraints and generates two hours of rework. New task, new window, fresh context.

3. Put the important stuff at the beginning or end

If you must have a long context (and sometimes you must), remember the U-shaped performance curve. Information at the beginning and end of the context gets the highest recall. Put your critical constraints — tech stack rules, architecture decisions, non-negotiable patterns — in your system prompt or rules file (beginning) or in your final instruction before code generation (end). Don't let them end up in the middle of a 40-turn conversation.

4. Compress your context with structured specs

This is where context engineering becomes a practice, not just a theory. Instead of @-mentioning twelve files and hoping the model extracts the relevant bits from 60,000 tokens of raw code, write a 1,000-token specification that tells the AI exactly what it needs to know. File paths, imports, patterns, constraints. Everything the model needs, nothing it doesn't. Maximum signal per token.

5. Don't rely on the model to remember

If a constraint is important, put it in a persistent location — a rules file, a spec document, an architecture decision record — that the model reads at the start of every session. Don't rely on it being in conversation history — that's the first thing pruned when the window fills up.

6. Use caching strategically

If you're working with the same codebase context repeatedly (and you are), use tools that cache stable context. Anthropic's prompt caching reduces the cost of reading the same context from $3.75/M tokens (write) to $0.30/M tokens (read). This doesn't reduce token consumption, but it reduces the financial cost of repeatedly loading large, stable contexts like repository indexes and system prompts.

Why Specs Outlive Sessions

Here's the deeper point — the one that all the context window math is building toward.

Your conversation with an AI coding assistant is ephemeral. It exists for the duration of the session. When the session ends, the context evaporates. Tomorrow you start from nothing — re-explaining architecture, re-establishing constraints, re-specifying tech stack. Every morning is groundhog day.

A specification (done right) is persistent. It survives the session. It's versioned alongside your code. It's available at the start of every session, from turn one. Your first session and your fortieth session get the same context — not because you're patient enough to re-explain every morning, but because the specification carries the context forward.

This is why specs matter more than context window size. A 200K window with a persistent, compressed specification that the AI reads at the start of every session outperforms a 1M window with no specification — because the 1M window starts empty every morning, and the 200K window starts with 2,000 tokens of pure signal that guide everything the model generates.

Context windows max out. Specifications don't. Context windows forget. Specifications remember. Context windows are working memory — specs are long-term memory. You need both, but only one survives the night.

Teams that figure this out — stop solving context loss by stuffing more into the window, start solving it by compressing intent into structured, persistent specs — will have AI tools that actually work across sessions. Teams that don't will keep re-explaining their projects every morning, wondering why yesterday's helpful AI is today's useless one.

It's not the window size. It's the context quality. And quality comes from structure, not volume.

4ge is a context engineering platform — a visual workspace that turns raw ideas into persistent, AI-ready specifications with your project's architecture, constraints, and edge cases baked in. Specs survive the session. See how 4ge makes context windows work for you →