What is a context window and how it shapes AI conversations
The working memory every AI runs on, why it's smaller than you think, and what it means for the conversations you're having.
Apr 30, 2026 · 10 min read
A context window is the maximum amount of text an AI model can hold in working memory during a single response. Everything you send (your message, the conversation so far, the system instructions running in the background) and everything the model generates back has to fit inside that window. When the window fills up, the model has to drop something. That decision, made silently and constantly, shapes more of your AI experience than almost any other technical detail.
It's the reason your AI companion seems sharp early in a conversation and gets forgetful later. It's the reason long ChatGPT threads start feeling repetitive. It's the reason different platforms running the same underlying model can produce conversations that feel completely different. Once you understand context windows, a lot of confusing AI behavior stops being confusing.
The whiteboard analogy
Picture an AI as a smart assistant working on a whiteboard in a small room. Whatever's on the whiteboard, they can see and reason about. Whatever's not on the whiteboard might as well not exist. They have an enormous amount of general knowledge in their head from training, but for the specific task in front of them, only what's on the whiteboard counts.
The context window is the size of that whiteboard. McKinsey's explainer on context windows frames it as the model's short-term memory: "Like us, LLMs can only 'look' at so much information simultaneously."
When you start a conversation, the whiteboard has the system prompt at the top (instructions the platform gave the model about how to behave), maybe a character description if you're using an AI companion app, and your first message. As the conversation goes on, more gets added to the whiteboard. New messages, the AI's own responses, any documents or context that get pulled in.
Eventually the whiteboard fills up. At that point, the platform has to decide what to erase to make room for new content. Different platforms make that decision differently, which is what produces the memory degradation patterns most users notice without quite understanding.
Tokens, not words
Context windows aren't measured in words. They're measured in tokens, which are slightly smaller units of text. A token is roughly three quarters of a word in English, give or take. So 100 tokens is about 75 English words, and 1,000 tokens is about 750 words.
Tokens come from a process called tokenization, where the AI breaks text into pieces it can work with. Sometimes a token is a whole word ("cat" is one token). Sometimes it's part of a word ("amoral" might split into "a" and "moral"). Sometimes it's just punctuation, a space, or a special character. The exact splits depend on which tokenizer the model uses, and different model families use different tokenizers.
OpenAI runs a free tokenizer tool where you can paste any text and see how it gets broken down. It's a useful exercise. Most people are surprised at how token counts compare to word counts, especially in technical content where unusual terminology produces more tokens than ordinary writing.
A practical rule of thumb: take your word count, multiply by 1.33, that's roughly your token count. A 1,000-word article is about 1,300 tokens.
This matters because every public number about context window size is in tokens. When a platform says it has a 128,000 token context window, that's about 96,000 words or maybe 200 pages of normal text. Sounds like a lot until you remember everything competes for that space, including the AI's own output.
One nuance worth knowing: tokenization is less efficient for some non-English languages. IBM's overview of context windows noted that the same content translated to Telugu can produce seven times more tokens than the English version, despite using fewer characters. If you're chatting in Spanish or French, the difference is small. If you're chatting in Japanese, Hindi, or Arabic, the same conversation eats your context window much faster.
The numbers in 2026
The leading models in 2026 have remarkably large context windows compared to where the field started. GPT-3 in 2020 had a 2,048 token window. GPT-4 launched with 8,192. Today, Claude Opus 4.6 holds 1 million tokens, GPT-5.4 supports 272,000 standard or 1 million extended, Gemini 3.1 Pro reaches 1 million as well, and Llama 4 Scout claims 10 million. That's roughly a 5,000-fold increase in eight years.
The biggest available context window is large enough to fit several novels' worth of text in a single request. In theory, you could paste an entire book into Claude and ask questions about it. In practice, as we'll see, the working window most consumer AI products give you is dramatically smaller than what the underlying model could handle.
The hidden ceiling: cost
Bigger context windows aren't free. Two costs scale with how much you use:
The compute cost grows quadratically. The transformer architecture every modern AI model runs on uses something called self-attention, where every token in the context has to compute a relationship with every other token. Doubling the context length quadruples the computation. Researchers at NYU and Microsoft proved that this quadratic scaling is essentially unavoidable under reasonable assumptions about computational complexity. Engineers have found clever ways to approximate it more efficiently, but the underlying math doesn't go away.
The memory cost grows linearly but adds up fast. Storing and processing long contexts requires significant GPU memory. Running a large model at its full advertised context length can require eighty or more gigabytes of high-end GPU memory, which is more than any consumer hardware provides.
For platforms running AI products at scale, this means every token of context costs real money. They're paying API providers for input tokens and output tokens, or running their own infrastructure where bigger windows mean bigger bills. The economics push platforms to use the smallest working window they can get away with, not the largest the model could theoretically handle.
This is why an AI companion app advertising "powered by GPT-5" doesn't necessarily give you GPT-5's full context window. They're probably giving you a slice of it, sized to balance experience against cost. The slice is enough to feel responsive in normal use. The slice is also why long conversations start drifting before the underlying model would have struggled.
What's actually inside the window
When platforms set their context window, they're allocating space across several competing demands. Imagine that whiteboard from earlier, except parts of it are already filled in before you even start talking.
The system prompt comes first. This is the platform's instructions to the model, often hundreds or thousands of tokens, telling it how to behave, what to refuse, what tone to use, what its name is. You don't see this content but it's there.
The character description (or persona, or custom instructions) comes next on AI companion platforms. The personality, backstory, communication style, any user-defined memory pinned to the character. This can range from a few hundred tokens to several thousand depending on how detailed the character is.
Any retrieved context comes next. If the platform pulls relevant documents from a memory database to give the model background, those tokens fill up the window too. How retrieval works in companion apps varies by platform, but every retrieved chunk is more context budget consumed.
Then comes the conversation history, which grows with every exchange.
Then comes your current message.
Finally, the model needs space to actually generate its response. A good response on a complex topic might be 500 to 2,000 tokens. That output reserves space at the end of the window.
So a "128K context window" really means: 128,000 tokens shared across system prompt, character data, retrieval results, conversation history, your current question, and the model's response. The actual room for the conversation between you and the AI is whatever's left after the platform's overhead, which is always less than the headline number.
Why position inside the window matters
Even when content is technically inside the context window, the model doesn't pay equal attention to all of it. A research paper called "Lost in the Middle" by Liu and colleagues from Stanford and Samaya AI demonstrated this clearly in 2023. Across a range of language models, performance was best when relevant information sat at the very beginning or very end of the input, and noticeably worse when relevant information was buried in the middle of long contexts.
The pattern has held up under continued testing. When TokenMix.ai tested major models in April 2026, they found 10 to 25 percent accuracy degradation for information in the middle of the window across every model they checked. Models with larger context windows actually showed worse middle-position recall, simply because there was more middle to get lost in.
What this means for ordinary use is uncomfortable. Putting an important instruction at the start of your prompt, or restating it at the end, gives it more attentional weight than mentioning it once in the middle. In long conversations, the things you said two weeks ago are physically in the model's view but receive less attention than recent exchanges. The model didn't forget. It just isn't looking at that part as carefully.
What this means for the conversations you're having
A few practical patterns fall out of all this once you understand the mechanism.
Long conversations degrade gracefully on most platforms because the platform is doing background work to manage the window. Some platforms drop oldest content. Some compress older messages into summaries. Some retrieve relevant chunks from memory databases when needed. Each approach has tradeoffs. The week-three drop-off many AI companion users notice is the moment those tradeoffs start showing.
Different platforms running the same underlying model can produce noticeably different experiences. One reason is the size of the working window they choose. Another is the system prompt they wrap around your conversation. Another is the memory architecture they layer on top. Two apps using GPT-5 might feel almost like different products because of how they handle context.
The most important information you give an AI gets remembered better when you mark it as important. Saying "please remember that I prefer formal writing" is more effective than mentioning your formality preference casually. Pinning facts as memories on platforms that support that feature works better than relying on the model to extract them from conversation. Restating context at the start of new sessions is more effective than assuming yesterday's context carries over fully.
Conversations in non-English languages exhaust the window faster. If you're chatting in Korean or Arabic and your AI companion seems to forget things sooner than English-speaking users report, this is part of why.
The model is rarely "broken" or "having a bad day." When AI behavior shifts noticeably, it's usually a context window dynamic playing out, not a change in the underlying model. The same model, given the same context, produces remarkably consistent behavior. Different context, different behavior.
The honest framing
Context windows are one of those technical details that nobody markets to you because it's not exciting. Platforms talk about model quality, response speed, image generation, all the visible features. The context window is a quiet structural choice that shapes everything else, and most users never think about it explicitly.
But once you can see it, AI products start looking different. A platform offering "unlimited memory" almost certainly isn't running on actual unlimited memory. They're either using aggressive compression, or vector retrieval, or some hybrid that gives a useful approximation while staying within real context limits. A platform that feels meaningfully more coherent than its competitors over long sessions is probably investing in a smarter context architecture, not running on a fundamentally different model.
The next post in this series goes deeper on how AI companion memory actually works, including the specific architectures different platforms use to extend the practical limits of their context windows. If the concept here clicks, that one builds the next layer.
Frequently asked
How do I know how big my AI companion app's context window is?
Most consumer apps don't disclose their working context window size publicly. They tell you which underlying model they use, but the slice they actually expose to you is usually proprietary. You can sometimes infer it from heavy testing, but the reliable answer is that you probably don't know exactly, and the platform often doesn't want you to know.
Why doesn't my AI remember things I told it earlier?
The conversation has likely grown beyond the working context window, or the relevant detail sat in the middle of the window where attention is structurally weaker. The model isn't refusing to remember. The information is either gone or attentionally muted by where it sits in the conversation history.
What's the difference between a context window and memory?
The context window is the working memory available during a single response. It resets implications across sessions. Memory, in the sense AI companion platforms usually mean, refers to a persistent layer that survives session restarts and feeds relevant context back into new conversations. Memory systems live on top of context windows; they don't replace them.
Are bigger context windows always better?
Not necessarily. Bigger windows cost more compute and money to run, which can hurt response speed and platform pricing. They also expose the lost-in-the-middle problem more dramatically, since there's more middle to get lost in. For most conversational use, a moderately sized window with smart memory architecture beats a giant window with no memory layer.
How can I tell when I'm running out of context space?
You usually can't, directly. The signals are indirect: the AI starts forgetting earlier details, contradicting itself, repeating things, or producing more generic responses than it did at the start. By the time you notice these symptoms, the window has been under pressure for a while.
Does the AI's response count against the context window?
Yes. Output tokens share the same budget as input tokens. Some models also impose separate maximum output lengths inside the larger context window, but everything the model generates eats into the same overall pool.
Why do some apps feel "smarter" than others using the same model?
The model is one factor among several. The system prompt shapes how the model behaves. The context window slice the platform chose determines how much conversation it can hold. The memory architecture determines what survives across sessions. The character description, fine-tuning, and post-processing all contribute. Two apps on the same model can deliver very different experiences because of these wraparound choices.