What Is a Context Window?

A Guide for Data and AI Practitioners

By: Lakshay Nasa

04.13.26

What is a context window?

A context window is the maximum amount of text (measured in tokens) that a large language model can hold and reference while generating a response. It includes everything the model can “see” during a single interaction: The user’s prompt, system instructions, conversation history, retrieved documents, and the model’s own output.

Every interaction with a large language model is bounded by a single constraint: How much information the model can hold at once. For a chatbot answering a quick question, that constraint barely matters. For an AI agent executing a 30-step workflow across your data infrastructure, it’s the difference between reliable automation and confident hallucination.

That constraint is the context window. And as AI moves from experimentation into production, understanding how context windows work (and what it takes to fill them reliably) has become a core skill for data and AI practitioners.

What is a context window?

Think of a context window as an LLM’s working memory. But the analogy has limits. Humans forget gradually; an LLM doesn’t degrade gracefully when its context window fills up. Information is either in the window or it isn’t. Once the window is full, earlier content gets truncated or dropped entirely—with no warning.

Here’s what actually occupies space inside a context window:

The user’s prompt or question: What was asked
System instructions: Hidden prompts that shape the model’s behavior (anywhere from a few dozen to several thousand tokens)
Conversation history: Every prior exchange in the session
Retrieved documents: Content pulled in via retrieval-augmented generation (RAG) or tool calls
Tool outputs: Results from API calls, database queries, or other agent actions
The model’s own response: Which also counts against the limit

A context window is not the same as long-term memory. The model doesn’t retain information between sessions unless an external system provides it. It’s not training data either; the context window holds what the model can reference right now, not what it learned during training. And it’s not fine-tuning, which permanently adjusts a model’s behavior. The context window is temporary, session-scoped, and finite.

How do tokens work?

Context windows are measured in tokens, not words. A token is a chunk of text that might represent a full word, part of a word, or a single character — depending on the model’s tokenizer.

In English, one token corresponds to roughly three-quarters of a word. A 100,000-token context window can hold approximately 75,000 words, or about 250 pages of text. But this ratio isn’t fixed. Different models tokenize text differently, and some languages tokenize less efficiently than others.

Research from the Association for Computational Linguistics has found that tokenizers trained primarily on English can produce dramatically different token counts across languages. In some cases, the same sentence in a non-Latin-script language requires several times as many tokens as its English equivalent, despite having fewer characters.

A few practical rules of thumb:

One English word ≈ 1.3 tokens
One page of text ≈ 400–500 tokens
A 200,000-token window ≈ 150,000 words ≈ roughly 550 pages

Token count directly affects three things practitioners care about: cost (most API providers charge per token), latency (more tokens means slower processing), and effective capacity (every system prompt, tool output, and piece of conversation history eats into the budget before the model even starts generating).

Context window sizes across major models

Context windows have grown dramatically. The original GPT-3.5 that launched ChatGPT in late 2022 had a 4,096-token limit. Today’s leading AI models support hundreds of thousands, or even millions, of tokens.

Model	Context window	Approximate word equivalent
GPT-5.4 (OpenAI)	Up to 1,050,000 tokens	~788,000 words
Claude Opus 4.6 (Anthropic)	Up to 1,000,000 tokens	~750,000 words
Gemini 2.5 Pro (Google)	Up to 1,000,000 tokens	~750,000 words
Llama 4 Scout (Meta)	Up to 10,000,000 tokens	~7,500,000 words
Mistral Large 3	256,000 tokens	~192,000 words

Larger context windows bring real benefits. They allow models to maintain coherence across longer conversations, produce more accurate responses by referencing more information, and reduce hallucinations by grounding outputs in more source material.

But those gains come with trade-offs rooted in how transformer-based models actually work. LLMs use a self-attention mechanism that calculates the relationship between every token and every other token in the context. This is what allows the model to understand which words relate to which, but it also means that compute requirements scale quadratically with context length. Double the input tokens, and the model needs roughly four times the processing power. In practice, that translates to higher API costs and progressively slower inference as the context window fills up.

Greater context length, in other words, doesn’t automatically mean better performance. More context introduces more computation, higher cost, and (counterintuitively) more opportunities for the model to get confused by irrelevant information.

The question for practitioners isn’t “which model has the biggest window?” It’s “what needs to be in the window for this task, and how do I keep everything else out?”

Why do context windows matter for AI agents?

In a simple chatbot interaction, context usage is predictable. The user asks a question, the model responds, and the exchange is complete. The context window determines how much information the model can work with at each step—and in a one-shot exchange, that’s rarely a problem.

When an AI agent executes a multi-step task (querying databases, calling APIs, reasoning over intermediate results), each step adds to the context. A 50-step workflow where each LLM call consumes 20,000 tokens means 1,000,000 tokens of total context over the course of the task. If the agent needs to reference early results in later steps, context accumulates within the window, not just across calls.

This is where context windows shift from an abstract technical specification to a production constraint with real consequences. When the window fills up mid-workflow, the agent doesn’t stop and tell you. It continues with whatever’s left. And whatever’s left may not include the information it needs to make the right decision.

What happens when context limits fail?

Context failures are uniquely dangerous because they’re invisible. There’s no error message when a context window quietly drops earlier information. Logs show successful calls. The agent keeps running. The problem only surfaces in outcomes, and sometimes not even then.

Silent degradation

An agent completes a complex task, but with critical information missing from earlier steps. It booked the flight, but forgot the dietary restriction mentioned at the start. It generated the report, but dropped the revenue figures from the first query. The task looks complete. The results look plausible. The details are wrong.

The “lost in the middle” problem

Even when information is technically within the context window, models don’t pay equal attention to all of it. A widely cited 2023 Stanford study found that LLMs perform best on information at the beginning and end of their context, with significant performance degradation for information positioned in the middle. A million-token window doesn’t give you a million tokens of perfect recall. It gives you strong recall at the edges and a fuzzy middle.

Cascading failures

When early tool results get dropped from context, later steps build decisions on incomplete data. The agent doesn’t know what it’s missing. Each subsequent step compounds the error, and the final output can look sophisticated and well-reasoned while being built on a foundation of gaps. These failures are especially hard to catch because the agent’s reasoning chain appears internally consistent; it just started from the wrong premises.

Inconsistent behavior

The same workflow might succeed perfectly with short inputs during testing but break unpredictably with longer inputs in production. A customer pastes in 5,000 words of email history instead of the 500-word test case, and the agent’s behavior shifts. The logic hasn’t changed. The code hasn’t changed. The only difference is context volume and your test cases never caught it.

How practitioners manage context window limits

Context windows are design constraints, not problems to wish away. The teams building the most reliable AI agents treat context as a resource to be intentionally managed, not a limit to max out.

Understand your token budget

Start by accounting for every source of context in your workflow. System prompts alone can consume anywhere from a few dozen to several thousand tokens. Tool outputs from a single database query might return thousands more. Conversation history in a multi-turn interaction adds tokens with every exchange. Build your token budget for the edge cases you’ll hit in production, not the clean paths that work in testing.

Design workflows with context in mind

Break long workflows into stages with summarization points. Instead of carrying every detail from a 30-step process, compress intermediate results at logical breakpoints. When a database query returns 100 rows, decide whether all 100 need to be in context or whether a summary of the top five is sufficient.

Prioritize what stays in the window. Always preserve the original user intent: No matter how long the workflow runs, the agent needs to know what was asked. Keep recent tool results in full detail. Summarize or drop older intermediate outputs that aren’t relevant to remaining steps.

Choose the right retrieval strategy

Not every task benefits from a massive context window. Sometimes smarter retrieval is more effective and cheaper.

Long context windows work best when the agent needs to reference specific details across many steps, or when relationships between distant pieces of information matter (like comparing sections of a long contract).
RAG (retrieval-augmented generation) works best when you’re searching a large knowledge base for specific facts and only a small portion of available data is relevant to each decision.
Rolling context or sliding windows work best for long-running conversations where the agent needs coherence but not complete history.

The right approach depends on how information flows through your specific workflow. Defaulting to the largest available context window is often the most expensive and least effective option.

Monitor and measure

You wouldn’t deploy a web service without monitoring memory usage. The same discipline applies to context.

Track token usage at each workflow step, not just per request. Monitor how close you regularly get to context limits. If you’re consistently at 80% capacity, one edge case will push you over. Watch for performance degradation as context grows. And break down cost per workflow execution to find which patterns are consuming tokens disproportionately.

The upstream problem: What goes into the context window matters most

Every strategy above (compression, summarization, retrieval, monitoring) assumes that the information entering the context window is accurate, current, governed, and relevant. In practice, at enterprise scale, that assumption breaks down fast:

Data comes from dozens of systems
Business terms mean different things to different teams
A metric the finance team deprecated three months ago is still floating around in a dashboard the agent queries
An agent retrieves a customer revenue figure without knowing that the finance team’s definition of “customer” differs from the product team’s
It joins data from two systems without understanding that one uses fiscal quarters and the other uses calendar quarters.

These aren’t context window problems; they’re context quality problems. And they’re the hardest ones to solve.

An agent can have a two-million-token context window and still produce wrong results if the information it retrieves is stale, ambiguous, or unauthorized. A perfectly engineered prompt can’t compensate for unreliable source data. And in enterprise environments with hundreds of agents running simultaneously, each team building its own retrieval pipeline with its own assumptions, the inconsistencies compound.

The disconnect between perception and reality is striking. According to DataHub’s 2026 State of Context Management Report, 88% of organizations claim to have fully operational context platforms, yet 61% frequently delay AI initiatives due to a lack of trusted data. The confidence is there. The infrastructure isn’t. This is where context management enters—not as a product feature, but as an emerging discipline.

What is context management?

Context management is the organization-wide capability to reliably deliver the most relevant data to AI context windows, enabling the governed, enterprise-scale deployment of agents. It sits above the individual tools and techniques of context engineering (prompt design, retrieval strategy, memory management) and asks the systemic question: how does your organization ensure that every agent, regardless of which team built it, works with trustworthy context?

A context window is the mechanism. Context management is the discipline that governs what goes into it.

Metadata designed for human consumption differs fundamentally from context engineered for AI agents.

Understanding context windows is the prerequisite to understanding why context management matters. As AI moves from single-agent prototypes to production-scale systems, the limiting factor won’t be the size of the window—it will be the reliability of what’s inside it.

For a deeper look at the discipline that’s emerging to solve this, read Context Management: The Missing Piece for Agentic AI.
And for the practitioner’s perspective on how data engineering skills map to this new challenge, see The Data Engineer’s Guide to Context Engineering.

FAQs

A context window is the total amount of text (measured in tokens) an LLM can process at once, including both input and output. A token limit often refers specifically to the maximum output length the model can generate. For example, a model might have a 128,000-token context window but cap its output at 4,096 or 16,384 tokens. The context window is the total capacity; the output limit is a subset of it.

In English, 1,000 words translates to roughly 1,300 tokens. The exact number varies depending on the model’s tokenizer, the complexity of the vocabulary, and whether the text includes code, URLs, or non-Latin characters.

Not directly through configuration. A model’s context window is determined during training and architecture design. However, techniques like retrieval-augmented generation (RAG), context compression, and sliding window approaches let you work effectively with more information than the window technically allows, by selecting and prioritizing what goes in.

Start with the smallest window that handles your use case, not the largest one available. A larger window costs more per request, runs slower, and introduces more opportunity for the model to lose focus on relevant information. Map your token budget (system prompt, user input, tool outputs, retrieved documents, conversation history) and build for your production edge cases. If you’re consistently using less than 20% of available context, you’re likely overpaying. If you’re regularly hitting 80%+, you’re one edge case from failures.

Research shows that LLMs perform best on information positioned at the beginning and end of their context window. Information buried in the middle is more likely to be underweighted or ignored, even when it’s technically within the context limit. This means a large context window doesn’t guarantee uniform attention and model performance can actually degrade as context grows longer.

A context window is temporary and session-scoped. Once a conversation or task ends, the context is gone. Long-term memory (where a model recalls information across sessions) requires external systems (databases, memory stores, or retrieval infrastructure) to persist and deliver relevant information back into the context window for future interactions.

RAG (retrieval-augmented generation) is a technique for pulling relevant information from external sources into the context window at query time. Instead of fitting everything into the window upfront, RAG retrieves only what’s needed for a given task. This makes it possible to work with knowledge bases far larger than any context window could hold. But the reliability of RAG depends entirely on the quality and governance of the information being retrieved.

Context management is the organizational capability to reliably deliver accurate, governed, and relevant information to AI context windows at scale. While a context window defines how much an LLM can process, context management addresses the upstream challenge: ensuring that what goes into the window is trustworthy across many agents, teams, and applications. It’s the discipline that transforms context windows from a technical specification into a governed enterprise capability.

What Is a Context Window?

What is a context window?

What is a context window?

How do tokens work?

Context window sizes across major models

Why do context windows matter for AI agents?

What happens when context limits fail?

Silent degradation

The “lost in the middle” problem

Cascading failures

Inconsistent behavior

How practitioners manage context window limits

Understand your token budget

Design workflows with context in mind

Choose the right retrieval strategy

Monitor and measure

The upstream problem: What goes into the context window matters most

What is context management?

FAQs

Recommended next reads

Context Window Optimization

Context Management Is the Missing Piece in the Agentic AI Puzzle

The Data Engineer’s Guide to Context Engineering

What is a context window?

What is a context window?

How do tokens work?

Context window sizes across major models

Why do context windows matter for AI agents?

What happens when context limits fail?

Silent degradation

The “lost in the middle” problem

Cascading failures

Inconsistent behavior

How practitioners manage context window limits

Understand your token budget

Design workflows with context in mind

Choose the right retrieval strategy

Monitor and measure

The upstream problem: What goes into the context window matters most

What is context management?

FAQs

What is context management and how does it relate to context windows?

Context Window Optimization

Context Management Is the Missing Piece in the Agentic AI Puzzle

The Data Engineer’s Guide to Context Engineering