B.1 POST · 01 MAY 2026

Retrieval Is Not Memory

Agent platforms call retrieval 'memory.' A new paper argues this is a category error — and the architecture choices get harder once you take it seriously.

Retrieval generalizes by similarity to stored cases; weight-based memory generalizes by applying abstract rules to inputs never seen before. Conflating the two produces agents that accumulate data but never learn.

The word “memory,” as it is used in agent platforms today, names two different things. One is a database lookup. The other is a learned function inside model weights. The current generation of agent infrastructure, including Cognee, Letta, Zep, and the long tail of vector-store-with-graph projects, treats these as the same thing. They are not. A recent paper, Contextual Agentic Memory is a Memo, Not True Memory, makes the distinction formal. The architectural consequences are uncomfortable.

The category error

The argument is short. Retrieval systems return stored items by similarity to a query. They generalize over the corpus only insofar as embeddings cluster nearby cases. Weight-based memory, the kind a model acquires during training, generalizes by representing rules that apply to inputs the model has never seen. These are different operations. They have different failure modes, different scaling laws, and different invariants under distribution shift.

Retrieval generalizes by similarity to stored cases; weight-based memory generalizes by applying abstract rules to inputs never seen before. Conflating the two produces agents that accumulate data but never learn.

The paper goes further: an agent that calls retrieval “memory” will appear to remember as long as the test queries stay close to the stored cases, and will fail silently the moment they don’t. The failure mode looks like forgetting, but the system never knew anything in the first place. It only had access to relevant rows.

This is not a vocabulary complaint. It is a claim about what the system can and cannot do, and that claim has architectural weight.

What the platforms claim

Cognee’s documentation describes their system as agent memory implemented through a knowledge graph that stores and retrieves episodic and semantic facts across sessions. The pitch is that by structuring retrieved content as a graph rather than a flat vector index, the agent gains something closer to human memory: episodic for events, semantic for facts, a unified store that persists.

The work is interesting. Episodic graphs are a real improvement over naive vector recall when relationships between entities matter. The Berlin team has shipped real engineering, and the public benchmarks they post are useful.

But the system is still retrieval. The graph is built at index time from extracted entities and relations; queries traverse it and return matches. Nothing in the pipeline does what weight updates do during training. The rules the agent applies at inference time are the rules baked into the underlying model. The graph contributes context, not learned generalization.

Calling this “memory” is a marketing decision, not an architectural one. It is the same kind of decision that led the field to call vector search “semantic search”: accurate enough at a glance, misleading once you push on it. The HN thread on the difference between RAG and memory in agents walks through several practitioners hitting this wall in production. The pattern is consistent: agents work well in the sample distribution, then fail in ways that look like amnesia but are really just out-of-distribution retrieval misses.

Why this matters at ingestion time

The category error has a specific cost when you scale ingestion. A retrieval system that pretends to be memory will encourage you to do expensive work at index time (entity extraction, relation inference, schema alignment) under the assumption that this work compounds the way learning compounds. It does not. It produces a static graph that has to be rebuilt or reconciled every time the corpus changes, and the resulting structure is brittle to entity drift, document revisions, and natural-language ambiguity.

LightRAG, the engine we run inside Enchilada today, takes this approach. It builds entity-level graphs at ingestion time. For small, curated corpora it works well. At EU-scale document ingestion, tens of thousands of documents per workspace, multilingual, with frequent revisions, the cost of upfront extraction starts to dominate. Worse, the entity graph carries a kind of confidence that the system does not actually have. It looks like learned structure. It is interpolated structure.

LazyGraphRAG, the lazy-extraction approach the same paper points to as a more honest baseline, defers graph construction to query time. This is slower per query and faster per ingestion. It is also more accurate about what the system is doing: the graph is a query-time projection of the corpus, not a stored representation of meaning. Nothing learned. Nothing remembered. Just better-organized retrieval, computed on demand.

The honest framing changes the engineering. If retrieval is retrieval, you optimize for ingestion throughput, query latency, and recall. You do not optimize for “memory consolidation.” You do not write code that pretends to forget gracefully. You write code that retrieves well and tells the operator, plainly, when the query is far from the corpus.

Where the boundary belongs in the architecture

The position we have arrived at with Enchilada is that the retrieval layer and any learning or summarization layer must be separated in the architecture, not just in the docs. Engines are swappable: LightRAG today, LazyGraphRAG or a pure vector engine tomorrow, with no API change at the proxy. This is only possible because the engine returns retrieved context, full stop. It does not return “memories.” It does not claim to have learned anything. The interface is narrow on purpose.

Workspaces in Enchilada hold document corpora. They are bounded knowledge stores, not agent state. If a customer wants long-term agent state (what a user asked last week, which tool calls succeeded, which preferences they stated), that lives in the calling application, in a database the application owns. The platform does not blur the boundary, because blurring it would force us to ship a “memory” abstraction that pretends to do something the underlying model does at training time.

The architectural payoff: when the next paper claims that some new graph structure is “true memory,” we can read it without architectural panic. If it is retrieval, it slots in as another engine. If it is a learning method that updates weights, it does not belong inside a retrieval platform at all. It belongs inside the model lifecycle, alongside fine-tuning and continual pretraining. The categories stay separate because the code keeps them separate.

This is a less ambitious story than the one Cognee and Letta tell. We are not promising memory. We are promising retrieval that is fast, sovereign, and honest about its limits. For an EU-hosted platform serving regulated industries, that honesty matters more than the marketing surface. A bank does not want a vendor who confuses storage with learning. A hospital does not want an agent that forgets in ways the architecture cannot explain.

The implication for builders

If the paper’s argument holds (and the formalization is clean enough that we expect it to), then a number of engineering decisions become clearer:

  • Index-time entity extraction has to justify itself per workload. It is not free, it does not compound, and at scale it is often dominated by query-time approaches that do less work upfront.
  • “Memory” features in agent SDKs are a name for retrieval-with-state. Treat them as such. Audit the failure modes accordingly. Do not assume the agent has learned anything between sessions; it has only kept a log.
  • The boundary between retrieval and learning needs to be visible in the type system. A function that returns RetrievedContext is not a function that returns LearnedRule. If your platform returns the same type for both, you have a category error in your interface.
  • Vendor claims about “true memory” are testable. Ask for the failure mode under distribution shift. Ask for the recall curve as the corpus drifts. Ask which part of the system updates weights, and on what signal. If the answer is “none, but the graph is consolidated,” the system is retrieval.

None of this means episodic graphs are useless or that Cognee’s work is wrong. It means the word “memory” is doing work the engineering does not back. Once that word is removed and replaced with “retrieval over a structured index,” the design space opens up. You stop trying to build a thing that does not exist with the tools you have, and start building the retrieval layer that the tools actually permit.

An open question

The paper’s strongest claim is that no purely retrieval-based system can match the generalization of weight-based memory on out-of-distribution inputs. This implies that any agent that needs to learn from interaction, not just remember it, has to update something other than an index. In practice, that means model updates, parameter-efficient fine-tuning, or some post-training process that touches weights.

Most production agent platforms do not offer this today. The ones that do, through fine-tuning APIs, adapter training, or continual pretraining, keep the weight-update path separate from the retrieval path, often in a separate product surface entirely. If the paper is right, that separation is not an accident of organization. It is the actual architecture.

Which leaves the open question: at what scale does a retrieval-only platform stop being enough? For a customer support agent with a stable knowledge base, retrieval is plenty. For an autonomous research agent that should get better at a domain over months of use, retrieval cannot get there alone. The boundary is somewhere in between, and we do not yet know where. We are interested in workloads that pressure-test it. If you are running one, write to us.


Sources