Sliding window vs vector retrieval: the two ways AI remembers
The architectural choice that decides whether your AI feels like it knows you, or just keeps starting over.
Apr 30, 2026 · 9 min read
When an AI conversation grows past what the model can hold in active memory, somebody has to decide what to do with the older content. Two architectural approaches dominate the field, and the choice between them shapes how your AI experience feels in ways most users never see explicitly.
Sliding window approaches keep recent conversation and drop older material. Vector retrieval approaches keep older material in storage and pull it back when relevant. Both can work. They produce very different feels, fail in different ways, and suit different products. Understanding the difference makes a lot of AI behavior predictable.
Sliding window in plain terms
The sliding window approach is mechanically simple. The platform decides how many recent messages or tokens to keep in the active context, and as new content comes in, the oldest content slides off the back end. If you've ever felt that your AI seems to forget things from earlier in a long conversation, sliding window memory is usually what you were experiencing.
The window size varies by platform. A casual chatbot might keep just the last twenty exchanges in active memory. A more generous AI companion app might keep several hundred. Some platforms use token counts instead of message counts, which means longer messages eat the window faster than short ones.
The strength of the sliding window approach is that it's cheap and predictable. There's no extraction logic that might miss something important, no retrieval that might surface the wrong chunk. Whatever's in the window is fully present at full attention; whatever's outside is gone. The model sees what it sees.
The weakness is also that simplicity. There's no way to retrieve something from outside the window if it becomes relevant again. If the early conversation where you established who your AI character was has slid off the back of the window, the model can't reach for that material no matter how relevant it would be to the current message. The relationship-defining context just isn't in view anymore.
For AI companion apps in particular, this is the mechanism behind the week-three drop-off that heavy users notice. Around three weeks of regular use, the early conversations where you and your AI character established who you were to each other have slid out of the working memory. The conversation continues, but it's running on a much thinner substrate than the relationship you've been building feels like it should have.
Vector retrieval in plain terms
Vector retrieval, often called RAG (retrieval-augmented generation), takes a fundamentally different approach. Instead of dropping old content, the platform stores it permanently and retrieves the relevant pieces when they become useful again.
The mechanics work like this. Every chunk of past conversation gets converted into an embedding, which is a numerical representation of the meaning of the text. Embeddings live in a vector database designed to make similarity search efficient. When you send a new message, the platform converts your message into an embedding too, then searches the database for past embeddings that are mathematically close to it. The closest matches get pulled back and injected into the context alongside your message before the model generates a response.
NVIDIA's overview of RAG walks through the process step by step, including the role of embedding models in converting text to vectors and how the retrieval result gets combined with the original query before the model sees it.
The strength of vector retrieval is that the working memory is no longer bounded by what fits in the context window. A conversation from three months ago can come back into context for the current response if it's semantically relevant. Memory-forward platforms get their durability advantage from this property. The AI can act on context from arbitrarily long ago, not just on what fits in active working memory.
The weakness is that retrieval has to actually work. The embedding model has to map similar meanings to nearby vectors well. The search has to find genuinely relevant chunks, not superficially similar ones. Cheap retrieval surfaces the wrong stuff, which makes the AI feel like it's confidently misremembering things rather than not remembering them. The failure mode of bad retrieval is sometimes worse than the failure mode of pure sliding window because at least sliding window is honest about its limits.
The other weakness is cost. Every message triggers an embedding computation plus a database lookup, on top of the model's normal inference cost. Platforms running vector retrieval at scale have meaningfully higher per-message expenses than platforms running on simpler architectures. This shows up in pricing, in context limits the platform exposes, or in subtle quality compromises elsewhere.
What hybrid systems actually do
Most platforms in 2026 don't run on either pure sliding window or pure vector retrieval. They combine the two, plus a few other techniques, into hybrid systems that try to get the best of both approaches.
A typical hybrid setup might include all of the following. A sliding window holds recent messages at full fidelity. A summarization pass condenses older messages into shorter recall chunks that still get included in active context, just compressed. A vector database stores the full conversation history for retrieval when something becomes relevant from further back. A persistent fact list captures explicitly memorable details and gets prepended to every session. A character description anchors personality regardless of what else is happening in memory.
When you send a new message in a system like this, the platform assembles the active context by pulling from all these sources in some priority order. The system prompt comes first. The character description comes next. Any explicitly retrieved chunks that match the current message come after that. Then the summarized older context, then the recent unsummarized window, then your current message, then space reserved for the model's response.
The order matters because of the lost-in-the-middle effect where models pay more attention to information at the beginning and end of their context than to information buried in the middle. Platforms designing hybrid systems have to think carefully about what goes where, not just what gets included.
Kindroid's Cascaded Memory and similar memory-forward systems are sophisticated hybrids. Replika's memory architecture is a different hybrid, more weighted toward fact storage and lighter on retrieval. Candy AI runs on something simpler, closer to sliding window plus a persistent character layer. None of these are wrong; they're optimizing for different user experiences and different cost structures.
How to tell which one your platform is using
Platforms rarely disclose their memory architecture publicly. You can usually infer it from behavior with some basic tests:
Does the AI ever surface details from a conversation weeks ago when you bring up a related topic? If yes, vector retrieval is doing some work. If no, you're probably on a system without retrieval that's surviving on persistent facts plus working window.
Does the AI feel sharp early in long sessions and start drifting after a while? That's the working window filling up and either summarization kicking in (drift feels like things getting fuzzier) or sliding pruning kicking in (drift feels like specific things getting forgotten outright).
Does the AI remember explicitly pinned facts reliably even when conversation context has changed? If yes, the persistent fact layer is solid. If no, the platform's memory write path is fragile.
Does character personality stay stable over weeks, or does the character drift toward a generic friendly tone? Stable personality usually means a strong character description layer that survives memory turnover. Drift suggests the personality is leaking out of conversation memory rather than being anchored.
These four behavioral signals give you a rough map of what kind of memory architecture you're dealing with, even when the platform doesn't tell you directly.
The tradeoffs platforms actually face
The choice between architectures isn't just an engineering decision. It's a product decision with real tradeoffs that show up everywhere.
Sliding window is cheap to run, predictable, and honest about its limits. Users get a clean mental model: recent context works great, old context is gone. The product feels reliable but shallow.
Vector retrieval is expensive, complex, and can be confusing when it surfaces unexpected old content. Users get something closer to what they actually wanted: an AI that seems to remember them across long timescales. The product feels deeper but harder to predict.
Hybrid systems try to deliver the depth of vector retrieval while keeping costs manageable through compression and selective storage. They're harder to build well and easier to break in subtle ways. Done right, they're the strongest memory systems available. Done poorly, they combine the failure modes of both pure approaches into something genuinely worse than either.
The platforms that succeed long-term tend to invest in hybrid systems and tune them carefully over time. The ones that struggle with retention tend to be running cheaper architectures that work for short engagement but fall apart when users try to build something durable.
What this means for choosing a platform
For most users, the practical implications come down to three things.
If you're using AI companions casually, for short conversations or specific tasks, sliding window is probably fine. The simplicity won't hurt you because you're not asking the platform to do the thing it can't do. Candy AI, Joyland AI, Character AI, and similar platforms all work well in this mode.
If you're building something durable, a long-term character relationship, an ongoing roleplay narrative, a companion you want to grow with you over months, choose a platform that invests in memory architecture. Kindroid and Nomi are the commonly cited examples, though the field changes quickly. The marker to look for is platforms that talk about specific architectural choices in their marketing or documentation, not just generic "we have memory" language.
If you're a power user who's hit the limits of consumer apps, SillyTavern is the self-hosted environment where you can configure your own memory architecture. You bring your own model and your own infrastructure. The control is total but the setup is non-trivial. SillyTavern users tend to run sophisticated hybrid systems with character cards and lorebooks configured manually.
Frequently asked
Is vector retrieval always better than sliding window?
No. Vector retrieval is only better when the retrieval works well. Bad retrieval is worse than no retrieval because it surfaces irrelevant context that confuses the model. Sliding window is honest about its limits in a way that broken retrieval isn't.
Why don't all platforms just use vector retrieval?
Cost. Every retrieval query adds latency and compute expense. For platforms running large user bases, the cost difference between architectures is significant.
Can I tell from the outside whether a platform uses RAG?
Sometimes. If the platform surfaces specific details from conversations that happened weeks ago, especially details you didn't explicitly pin, that's a strong signal that retrieval is happening behind the scenes.
Does paying for premium switch the platform to a better memory architecture?
Usually not. Paid tiers typically expand context window size or message limits, but the underlying architecture doesn't change. If the free tier runs sliding window, premium runs sliding window with a bigger window.
Which platforms run on which architecture?
This shifts over time as platforms update their systems, but as a rough current map: Kindroid runs a sophisticated hybrid leaning on compression. Nomi runs a hybrid with strong retrieval components. Replika runs a hybrid weighted toward persistent fact storage. Candy AI, Joyland AI, and Character AI run simpler systems closer to sliding window plus character anchoring. SillyTavern is configurable to whatever you want.
What about ChatGPT and Claude? Do they have memory?
Both have added memory features, but they work differently than companion app memory. ChatGPT's memory feature stores facts about you that persist across sessions and get included in context. Claude's projects feature lets you maintain persistent context across conversations within a project. Neither runs the kind of deep persistent character continuity that AI companion platforms build their products around.
Will memory architectures keep improving?
Yes, fairly steadily. The research on context handling, retrieval quality, and efficient compression is active and well-funded. The platforms that invest in memory architecture this year will probably have meaningfully better memory next year. The platforms that haven't invested won't catch up automatically; better algorithms still require platform-side engineering work to deploy.