Graph Structure Is Not an Audit Trail

A research team published XGRAG because their enterprise clients could not explain a GraphRAG answer to a compliance auditor. The graph existed. The retrieval happened. The answer came out. But which nodes drove it, through which paths, weighted how: that was inside the model, invisible. This is the gap the market has been papering over: a knowledge graph in your retrieval pipeline makes answers better. It does not make them auditable.

Why graph structure is not enough

The conflation is understandable. A knowledge graph is visible. You can query it, inspect it, draw it. Compared to a flat vector index, a graph feels transparent: the entities and relationships are there, named, in a structure you can traverse. When Microsoft GraphRAG arrived, it came with exactly this framing: grounding retrieval in a graph makes the system more explainable, because the graph gives a window into what the system knows.

The problem is that the graph is the input, not the reasoning. Once the retrieved subgraph enters the context window, the LLM decides what to do with it. That decision is not logged. It is not inspectable. It is a forward pass through transformer weights, and the output cannot be traced back to individual graph components without significant additional work, work that does not happen during inference and cannot be recovered after the fact.

While the graph structure makes retrieval more transparent, the LLM’s subsequent reasoning process remains a “black-box.” The central question “How the model synthesizes information from various graph components to arrive at its final answer?” is left unanswered by current tools.

XGRAG’s contribution is to answer that question using counterfactual perturbation. Remove a node from the subgraph. Re-run the query. Measure how much the output changes semantically. Do this across all components in the retrieved subgraph and you get a per-component attribution score. It is expensive and post-hoc, but it is causally grounded in a way that inspecting graph structure alone is not.

What the market claims, and where it breaks

Microsoft GraphRAG grounds its explainability claim in graph structure. Look at the actual pipeline: documents pass through LLM-based entity extraction, then Leiden community detection, then LLM-based community summarization, then a final LLM answer pass. Four LLM inference steps before a single answer reaches the user. The XGRAG authors measured what this means for any downstream attribution layer: approximately 610,000 tokens per query. They ruled Microsoft GraphRAG out of their evaluation as computationally infeasible. The architecture makes itself unauditable not through any single bad decision, but through the accumulation of opaque steps that no post-hoc attribution method can practically reconstruct at production scale.

Cognee makes a cleaner version of the same claim. Their positioning is “explainable AI knowledge,” a graph-based memory system designed to make agent reasoning transparent. The engineering is real: Cognee builds entity graphs at ingestion time and traverses them at query time. But the retrieval mechanism is not an exposed, queryable audit log. You can inspect the graph. You cannot produce a per-query trace that shows which specific graph components caused which specific tokens in the output. The structural gap is identical to GraphRAG: the graph is visible before the model; the reasoning is invisible inside it.

Zep and Mem0 sit in the agent memory category rather than document GraphRAG, but the argument applies there too. Both store structured facts about user interactions and preferences. Both carry an implicit claim that structured storage equals transparency. It does not. Knowing what is stored is not the same as knowing which stored item drove which answer. Storage transparency and inference transparency are different properties. A system can have complete visibility into its memory store and zero visibility into how that store influenced a specific output.

The paper addresses Chain-of-Thought and ReAct directly. These approaches “expose intermediate steps, but are often heuristic and lack causal grounding.” This is the right characterisation. An agent that shows a Thought: I will query the knowledge graph for X step in its trace is producing narrative explainability, a description of what it says it is doing. That description may be accurate. It may also be a hallucination of the step. An agent that actually ran a logged query against a graph database produces a different kind of record: there is a stored artifact of what ran, what parameters it used, and what came back. One can be fabricated by the model at inference time; the other cannot.

What auditability actually requires

XGRAG formalises the requirement: for each retrieved graph component, you need to know its counterfactual contribution to the output. This is stronger than knowing what the subgraph contained. It is a measure of what would have been different in the output if that component had not been retrieved.

The perturbation method is sound. Remove a node or relation from the retrieved subgraph, re-run the generation, measure the semantic distance between the original and perturbed outputs. High distance means high causal influence. Low distance means the component was present in context but did not affect the answer materially. Run this over all components and you have a ranked attribution map: these nodes drove this answer; these did not.

This is expensive. A full XGRAG attribution run over a multi-hop query requires multiple generation passes. But it surfaces something important about what auditability requires. What XGRAG approximates via perturbation is something that explicit query logging provides directly, at no additional inference cost.

If retrieval is an explicit, parameterised operation, a Cypher query against a Neo4j workspace for instance, then the audit trail is a natural side effect of the architecture. The query text, the bound parameters, the returned nodes and edges, the workspace identifier, the timestamp: all of these exist as first-class logged operations before the result enters the context window. There is no perturbation required, because the causal record already exists. It was written at retrieval time, not reconstructed after the fact.

The operation as the unit of accountability

This is the architectural position: move the auditability boundary from the model to the operation.

Enchilada’s retrieval layer works this way. Every retrieval against a Neo4j workspace runs as an explicit Cypher query. The query is logged: full text, bound parameters, returned subgraph, workspace ID, timestamp. Before the LLM sees any graph context, there is already a complete record of which nodes and edges were retrieved, from which isolated tenant workspace, under which query conditions. The LLM reasoning that follows is still a forward pass. It is, in that narrow sense, still a black box. But the retrieval that precedes it is not.

For regulated industries in the EU (finance, healthcare, legal), this distinction is material. An auditor asking how a system arrived at an answer does not need the transformer attention weights. They need a defensible record of what information the system had access to, when it had it, from which authorised source, and under which query conditions. A logged Cypher query provides that. An inspectable knowledge graph does not.

The distinction matters specifically for EU AI Act compliance. You cannot satisfy an audit requirement by pointing at a knowledge graph and calling it transparent. You need to show that the specific retrieval that informed this specific output was logged, is reproducible, and draws from a workspace with documented access controls and data lineage. “The graph structure is visible” is not an audit trail. “This Cypher query ran at 14:32:07 UTC, against workspace fin-de-prod-03, and returned nodes N1, N2, N7 with the following edge weights” is one.

XGRAG is built on LightRAG, the same retrieval engine that runs inside Enchilada. This matters for what comes next. LightRAG’s graph architecture is well-suited to the kind of entity-level attribution XGRAG performs. The per-component attribution method XGRAG introduces is a natural extension of the query-level logging the infrastructure already produces. Component-level attribution is not a current Enchilada capability. It is a direction the architecture does not obstruct, because the operation log is already there.

The engineering implication

The market has been using “graph” as a proxy for “explainable.” XGRAG shows this does not hold. Graph structure improves retrieval quality. It does not produce an audit trail. For any system operating in a domain where auditability matters (and in the EU, that covers most regulated workloads), the question is not whether the retrieval pipeline uses a graph. The question is whether every retrieval operation is logged as a first-class event with enough fidelity to reconstruct what happened.

That is an infrastructure question, not a model question. The graph is a data structure. The audit trail is an operational requirement. Building one does not give you the other. The way to get both is to put the auditability in the operation layer, before the context window, not after.

Why graph structure is not enough

What the market claims, and where it breaks

What auditability actually requires

The operation as the unit of accountability

The engineering implication

Sources

Related