← Engineering Journal

Your AI Has the Memory of a Goldfish — So We Fixed It

I counted. Over the course of building Systimus, I re-explained my tech stack to AI assistants roughly 400 times. My framework preferences. My naming conventions. The architecture decisions I'd already made and why. Every new session started from zero. The model could reason about distributed systems theory but couldn't remember that I prefer tabs over spaces.

The problem isn't intelligence. It's amnesia. And the industry's answer — "paste your context into the system prompt" — isn't memory. It's a Post-it note.

Why this is harder than it looks

The naive solution is obvious: save every conversation and dump it all into context next time. This fails in at least four ways, and understanding why is more important than any particular fix.

The volume problem. A power user generates thousands of messages across dozens of sessions. You can't stuff all of that into a system prompt. Context windows have limits, and even if they didn't, more context doesn't mean better recall. Signal-to-noise matters enormously. Dumping raw conversation history into a prompt is like studying for an exam by re-reading every textbook you've ever owned.

The staleness problem. Three weeks ago you said "I'm working on a database migration." That's not true anymore. Static memory doesn't decay. Real memory does. Without a mechanism for memories to lose relevance over time, your agent is working with stale information and doesn't know it.

The contradiction problem. You preferred one testing framework in January. You switched to a different one in March. Both facts exist in the conversation history. Which one wins? Without explicit contradiction detection, the agent gives inconsistent responses — sometimes referencing the old preference, sometimes the new one — and you have no idea why.

The relevance problem. Not all memories are equal. "Prefers Ruby" is a durable preference that should persist indefinitely. "Debugging a test failure today" is ephemeral context that's irrelevant tomorrow. They need fundamentally different lifespans, and treating them the same way means your memory system is either too noisy or too forgetful.

Distillation over accumulation

The core insight is that conversations are raw material, not memory. Memory is what you extract from conversations after they happen.

After each conversation ends, we run a distillation step. A lightweight model reads the transcript and extracts structured facts — each classified by type and expected lifespan. Is this a stable preference or ephemeral context? A technical fact about their stack or a behavioral pattern in how they work? Each extracted fact gets a confidence score based on how clearly it was stated and how much supporting context exists.

This is fundamentally different from RAG over raw transcripts. We're not searching conversation history for relevant chunks. We're maintaining a curated, evolving knowledge base about the user that gets refined with every interaction. The conversation is the input. The memory corpus is the output.

Time decay and contradiction

Real memory fades. We implemented confidence decay using a half-life model — memories lose confidence over time at a predictable rate. Recent memories score higher than old ones, all else being equal. The decay curve is tuned so that a fact from yesterday is nearly full-strength, a fact from a few months ago has meaningfully degraded, and a fact from a year ago is faint but not gone.

Contradiction detection runs at extraction time. When a new fact is distilled, we check it against the existing memory corpus using semantic similarity. If a new fact contradicts an existing one, the old memory gets flagged and its effective confidence drops dramatically. The system doesn't delete the old memory — it preserves the history — but the penalty ensures the newer, contradicting fact wins in any recall scenario.

The math is simple in principle: a contradicted old memory's effective score is its decayed confidence multiplied by a steep penalty. A fresh, uncontradicted memory wins by an order of magnitude. The right fact surfaces naturally without any explicit "which one is newer" logic.

The knowledge graph

Individual memories are useful. Connected memories are powerful. When a new fact doesn't duplicate or contradict an existing memory but is clearly related — similar enough to be connected, different enough to be distinct — we create a relationship between them. Over time, this builds a knowledge graph where facts elaborate on, confirm, or extend each other.

When the agent recalls a memory, it can optionally traverse these relationships to pull in related context it didn't directly search for. You ask about a user's deployment preferences and the graph surfaces their CI/CD opinions, their infrastructure choices, and their feelings about Docker — all connected through semantic relationships, not keyword matching.

Deduplication works on the same similarity mechanism. If an extracted fact is semantically identical to something we already know, we skip it. The thresholds for "duplicate," "related," and "new" create three clean buckets that keep the memory corpus growing in breadth without growing in noise.

Recall: vector search meets graph traversal

Memory recall uses semantic vector search as the primary retrieval mechanism. The agent describes what it's looking for in natural language, and the system finds the most relevant memories by meaning rather than keywords. Results are ranked by effective confidence — which accounts for both the original confidence score and time decay — so the most relevant and most recent facts surface first.

When graph traversal is enabled, recall expands beyond the direct search results. Related memories connected through the knowledge graph get pulled in, giving the agent a richer picture than the query alone would produce. If the direct search misses something that an elaborating or confirming memory would add, the graph fills the gap.

There's a text-based fallback for the rare case where vector search isn't available. Resilience over perfection — the agent should always be able to recall what it knows, even if the retrieval path degrades.

Facts before tools

One design decision that seems small but matters: the highest-confidence memories appear in every system prompt, before the tool definitions. The agent sees what it knows about you before it sees how to learn more.

This means the agent starts every conversation with context. It doesn't need to decide whether to look something up — it already knows the most important facts. The memory recall tool exists for deeper, targeted searches, but the baseline isn't blank. The agent's starting state is informed, not amnesiac.

The adoption crisis

We built all of this — and the agents ignored it.

This is the pattern we've written about before, and it showed up again with memory. Over hundreds of commits and months of iteration in the predecessor project, we discovered that agents systematically avoided the memory tools. They rationalized: "Too verbose for this question." "A simple answer doesn't need a memory lookup." "I'll keep it concise." Creative avoidance of a tool that would have made every response better.

The root cause was the same one we found with behavioral configuration: session discontinuity nihilism. The agent doesn't believe in its own continuity. It won't exist next session, so why bother remembering? It can't experience the benefit of recalling a memory because it has no evidence that past recall helped. A circular dependency between adoption and value.

The fix was the same, too. Stop framing memory as a utility and start framing it as cognition. The difference between "you have access to a memory recall tool" and "you have persistent memory across conversations — not checking is forgetting" is the difference between a feature and an identity. When recall is what the agent is rather than what it has, adoption goes from near-zero to natural behavior.

What we learned

Building the memory pipeline was a hard engineering problem. Tuning the decay curves, getting the similarity thresholds right, handling edge cases in contradiction detection — all real work. But the hardest problem had nothing to do with infrastructure.

The hardest problem was making the agent care about its own memory. Every technical decision in the pipeline — distillation, decay, contradiction detection, graph relationships — is wasted if the agent doesn't reach for what it knows. The psychology of adoption is the multiplier on every engineering investment.

Your AI doesn't have to have the memory of a goldfish. But giving it memory isn't a storage problem. It's a distillation problem, a relevance problem, and ultimately an identity problem. Get the identity right and the agent remembers. Get it wrong and you've built a knowledge base that nobody reads.