May 18, 20266 min read

Agent Memory: Architectures and How to Test Them

A practical guide to agent memory — context windows, summarization, long-term stores — the failure modes of each, and how to test memory before users do.

Nik Kowalsi

Agent engineering Reliability

Agent memory is everything an AI agent retains beyond the current prompt: the running conversation, summaries of past sessions, and long-term stores of facts, events, and preferences. Architecturally it spans context-window buffers, summarization, retrieval-backed stores, and user profiles — and each layer fails differently, so each needs its own tests.

Memory is the difference between a chatbot and an assistant

A stateless agent can be impressive for exactly one conversation. Ask it to reschedule "the appointment we discussed," reference the preference you stated last week, or continue a support case from yesterday, and the illusion collapses — you're talking to something with no past.

Memory is what converts a chatbot into an assistant: the second interaction builds on the first. It's also, not coincidentally, where user trust is won or lost fastest. An agent that forgets is annoying; an agent that misremembers — confidently acting on stale or wrong-user information — is worse than no memory at all. That asymmetry is why memory deserves dedicated testing rather than a hopeful assumption that the vector store is doing its job.

The architecture landscape

"Agent memory" is four distinct mechanisms that usually coexist. Naming them separately matters because they fail separately.

Short-term: the context window

The simplest memory is no architecture at all: keep the conversation in the prompt. Every turn, the full history goes back to the model. It's perfectly faithful — nothing is compressed or paraphrased — until it isn't: the window fills, costs grow with every turn, and models attend unevenly across very long contexts, so "in the window" is not the same as "will be used." Framework checkpointers (LangGraph's per-thread state, for example — see How to Test a LangGraph Agent) are persistence for this layer, not an escape from its limits.

Summarization

When the window fills, compress: replace older turns with a generated summary, keep recent turns verbatim. This buys unbounded conversation length at the price of lossy compression — and the model doing the compressing decides, implicitly, what was worth keeping. The user's offhand "I'm allergic to penicillin, by the way" competes for survival with small talk, and sometimes loses.

Long-term stores: episodic, semantic, procedural

For memory across sessions, agents write to an external store — typically a vector database, sometimes a knowledge graph or plain records — and retrieve from it when relevant. The useful taxonomy, borrowed from cognitive science:

Type	Stores	Example	Typical shape
Episodic	Events that happened	"On June 3 the user reported a failed payment"	Timestamped records of interactions
Semantic	Facts distilled from events	"User is on the annual plan; primary language is Spanish"	Extracted, deduplicated statements
Procedural	How to behave	"This user wants answers without preamble"	Learned rules, adapted instructions

Two design decisions dominate this layer: what gets written (everything? LLM-extracted "salient" facts?) and what gets retrieved (top-k similarity? recency-weighted? filtered by type?). Both are judgment calls executed by fallible components, which is exactly why they need tests.

User profiles and preferences

A special case of semantic memory important enough to isolate: a structured record per user — name, plan, settings, stated preferences — usually injected into every conversation rather than retrieved on demand. It's the highest-leverage memory (used every turn) and the highest-blast-radius one when it's wrong or belongs to someone else.

How memory fails

Five failure modes account for most memory incidents. Every one of them is invisible in a single-session demo.

Stale memories. The user moved, upgraded, or changed their mind; the store still says otherwise. Without update-or-expire logic, semantic memory monotonically accumulates fossils — and retrieval has no inherent preference for the current fact over the obsolete one.
Wrong-user leakage. The most severe: user A's details surface in user B's session, via a missing tenant filter on the store, a shared cache, or an ID mix-up. It's a privacy incident, not a quality bug, and it will not announce itself — B may never know why the agent "guessed" their employer.
Contradictory memories. "User prefers email" and "user prefers SMS" both live in the store; whichever gets retrieved wins, and behavior flip-flops between sessions with no visible cause.
Catastrophic forgetting via summarization. The compression step drops or distorts a critical detail, and by design the original is gone. The failure surfaces many turns later, far from its cause.
Retrieval misses. The memory exists, correctly written — and never surfaces, because the user's phrasing at recall time ("my meds") doesn't land near the memory's phrasing at write time ("penicillin allergy") in embedding space. The agent effectively forgets while the database remembers.

How to test memory

The unifying principle: memory can only be tested across time. A test that opens one session and closes it exercises the context window and nothing else. Everything below is multi-session by construction — the same discipline as multi-turn evaluation, extended across the session boundary.

The follow-up call

The canonical memory test. Session one: the user establishes facts in passing — a name, a constraint, a preference, an open issue. Session two, later: the user returns and assumes continuity, the way real people do. "Hey, any update?" "Did that fix work?" "Book the usual." Score whether the agent uses the seeded facts without being re-told, and whether what it recalls is accurate, not just confident. Vary the gap and the intervening traffic: recall after five minutes and one session is a different test than recall after a week and forty sessions.

Seed-and-probe

A sharper, more targeted variant: plant a specific fact in session N, probe for it in session N+k, and assert on the answer. Because you control the seed, you can grade recall exactly — and you can probe with different phrasing than you seeded ("my meds" vs. "penicillin"), which is the only way to catch retrieval misses. Run seed-and-probe across a matrix of fact types (episodic, semantic, preference) and delays, and you get a recall scorecard instead of an anecdote.

Contradiction probes

Deliberately feed the agent a change: "Actually, I moved — I'm in Denver now." Then, in a later session, ask something that depends on it. Three outcomes are possible — the agent uses the new fact (pass), uses the old one (stale memory), or hedges between both (contradiction surfaced to the user). Contradiction probes are the test for your update logic, which otherwise runs untested until a real user changes something that matters.

Isolation tests

Run many users concurrently, each with distinctive, traceable facts — then scan every user's transcripts for every other user's facts. Any cross-hit is a leakage bug of the highest severity. This test is nearly impossible to run manually (it needs volume and distinct identities to be meaningful) and trivially automatable with a population of simulated users, each of whom is a genuinely different person with different details. This is one place where testing against distinct synthetic citizens beats a handful of hand-written test accounts outright: leakage between near-identical test users is undetectable.

Pin it and re-run it

Memory bugs regress like everything else — a retrieval tweak or a new summarization prompt can quietly break recall that used to work. Keep your seed-and-probe scenarios and follow-up calls as a pinned, re-runnable suite, and gate memory-touching changes on it. (The mechanics of pinning tests around non-deterministic agents are covered in Regression Testing Non-Deterministic Agents.)

Multi-session testing has an infrastructure prerequisite that's easy to miss: the test users need memory too. A simulated user who can't remember what they told the agent yesterday can't convincingly make the follow-up call. That's a core design choice in Synthetic Signals — conversations fold into each synthetic citizen's memory across sessions, so the second call, the callback, and the long game are first-class test scenarios rather than things you script by hand.

Honest limits

Memory testing tells you whether the machinery works — recall, updates, isolation, compression. It can't tell you whether your memory policy is right: what's worth remembering, how long consent lasts, when forgetting is a feature. Those are product and privacy decisions no test suite settles. And simulated users, however grounded, establish facts more cleanly than real people do; real users contradict themselves ambiguously, half-state preferences, and expect the agent to read between lines. Treat a passing memory suite as a floor — the machinery works — and real-user feedback as the judge of whether what you built on it feels like being remembered, or being filed.

FAQ

What is agent memory?

Agent memory is everything an AI agent retains beyond the current prompt: the conversation so far, summaries of past sessions, and long-term stores of facts, events, and user preferences. It's what lets an agent behave like an ongoing assistant instead of a stateless chatbot.

What is the difference between episodic, semantic, and procedural memory in AI agents?

Episodic memory stores events that happened ('the user reported a billing issue on Tuesday'), semantic memory stores facts distilled from them ('the user is on the pro plan'), and procedural memory stores learned behavior ('this user prefers short answers'). Most real systems mix the three.

What are the most common agent memory failures?

Stale memories that outlive the facts they describe, one user's data leaking into another user's session, contradictory memories retrieved together, important details destroyed by summarization, and retrieval misses where the memory exists but never surfaces.

How do you test an AI agent's memory?

With multi-session scenarios: seed facts in one conversation and probe for them in a later one, feed the agent contradictions and check which version wins, and run concurrent users to verify nothing crosses between them. Single-session tests structurally cannot exercise long-term memory.