Sophonix
Contact
Field notes

BenBrain: persistent memory for every Claude session

Published May 10, 2026#AIInfrastructure #ClaudeAI #VectorDatabase #KnowledgeGraph #BuildInPublic

Every Claude session I run — phone, laptop, anywhere — wakes up already knowing what shipped yesterday and what's still open.

That wasn't true a month ago.

A month ago, every new chat started cold. I'd open a tab, ask a question, and Claude would meet me as a stranger. I'd burn the first ten minutes re-explaining the project, the stack, last week's decisions, the bug we'd tracked down at 2am. By the time the model had enough context to be useful, my mental energy for the actual work was already half-spent.

This is the cold-start tax. It's the friction that makes "AI as a thinking partner" feel like a promise that doesn't quite cash out. Each session is a fresh blank: you have to set the stage again.

I'd been ignoring it because I assumed it was just the shape of the tool. LLMs don't have memory across sessions; that's the deal. So you adapt — pin a long preamble in your editor, keep a "project state" doc up to date, paste it in at the start of every chat. And then forget which version of that doc has the actual current state and waste another ten minutes reconciling.

Eventually it became cheaper to fix the tool than to keep paying the tax.

What I built

BenBrain is a persistent memory layer that sits behind every Claude session. It's two stores, both running on my bare-metal EU server, both behind my Tailscale mesh, neither talking to anyone else.

Qdrant — the vector store. Embeddings of everything that's ever been written down. Decisions, attempts, errors, learnings. The job is fuzzy recall: "what did we decide about that OOM crash three weeks ago?" works even when the words I use today don't match the words I used then. Cosine similarity over the embedding space is forgiving in a way that text search isn't.

Neo4j — the knowledge graph. Same content, different cut. Every decision, attempt, error, module, session, and learning is a typed node with typed relations to other nodes. A Decision has a Why, blocks an Error, supersedes another Decision. Walkable as a graph: from any starting point, you can ask "what led to this?" and follow the edges back.

The vector store handles "find me things like this." The graph handles "show me how this connects." Different questions, different shapes, same source data.

Here's the rough shape of an entry in the graph:

{
  "id": "sophonix:decision:notion-as-blog-cms",
  "type": "Decision",
  "title": "Sophonix blog content lives in Notion, not in-repo MDX",
  "body": "Article rich text + media live in Notion page blocks; the build pulls them at next build and mirrors assets to public/blog/ so deploys are self-contained.",
  "severity": "informational",
  "created_at": "2026-05-11T16:42:00Z",
  "relations": {
    "supports": ["sophonix:initiative:linkedin-pipeline"],
    "informed_by": ["sophonix:learning:notion-rich-text-limits"],
    "blocks": []
  }
}

That's the data. The interesting part is the loop that fills it.

The loop, in four steps

  1. I hit /compact when a session's context starts to fill. This is the only manual step, and I'll come back to why it's still manual.
  2. A hook reads the transcript. A SessionEnd hook in Claude Code grabs the full conversation and pipes it through a structured-extraction prompt to GPT-4o. The prompt asks for typed entries: what decisions were made, what attempts succeeded or failed, what was learned that wasn't obvious from the code alone.
  3. Entries get written to both stores. The extracted JSON gets embedded and indexed into Qdrant. The same entries become typed nodes in Neo4j with relations to the surrounding context. A decision references the attempts that led to it. An error references the decisions that prevented it from recurring.
  4. The next session wakes up with that memory prepended. A SessionStart hook queries both stores — recent entries, plus a similarity search keyed on the current project and any open todos — and injects the results as the opening context. By the time I type the first message, Claude already knows where we left off.

One command from me. Everything else runs itself.

Why bare-metal

I get asked this constantly: why not Pinecone, Weaviate, Neo4j Aura? Why run all this on a box I have to keep alive myself?

The cost answer is incidental. Yes, it's cheaper, but that's not why.

The control answer is the point. This layer holds every decision I make: client positioning, architecture choices, security tradeoffs, half-formed ideas about the next product. None of that should live on a server I don't control. Not Pinecone's. Not Neo4j's. Not anyone's. The whole reason I want persistent memory is to externalize my thinking — and externalizing your thinking into someone else's database is a strictly worse trade than not externalizing it at all.

Tailscale closes the loop. The server is reachable only inside my mesh. There's no public ingress to the memory layer. Claude Code on my laptop reaches Qdrant and Neo4j over Tailscale; nothing else does. Zero internet exposure on the data path.

This is the same principle that runs through the consulting work at Sophonix: we build for clients the way we build for ourselves. If I wouldn't trust a third party with my own decision log, I'm not going to recommend a client trust one with theirs.

What's still hand-built

The /compact step is mine to invoke. That's by choice.

The alternative — continuous capture, with the hook firing every few thousand tokens of context — sounds elegant. But it doubles the API spend on extraction, and it means the model has less control over what counts as a memorable boundary. A session that wanders into three unrelated topics gets fragmented into three half-baked entries instead of one coherent thread. Manual compaction lets the human decide what "one thing" looks like.

Forgetting is also still hand-built, in the sense that nothing forgets yet. Every entry lives forever, every embedding stays indexed. That's fine for now, but it's a known dead end. Append-only memory degrades on retrieval long before it degrades on storage: as the corpus grows, the signal in any single query gets buried under historical noise. The right fix is decay — recency-weighted retrieval, possibly with explicit "this is no longer current" tombstones. Not built yet. Logged.

What's next

The roadmap, in order:

  • Continuous capture with smarter boundaries. Detect topic shifts in the transcript and compact at those edges instead of waiting for /compact. Still want human review on the entries, but the bookmark of "what to compact" can be automatic.
  • Decay and tombstoning. When a decision is superseded, the old one shouldn't just exist alongside the new one — it should be marked superseded, and retrieval should weight against returning it.
  • Sharing primitives. Right now BenBrain is single-user. The same architecture could front a team: a small group with a shared memory layer, each member's session contributing to the same graph. This is where the consulting work points: the same durability problem that BenBrain solves for me, every team eventually hits.

The point

The infrastructure was always on. The server's been there. The graph database, the vector store, the network — all of it.

What changed isn't infrastructure. What changed is that the thinking is now on too. The decisions don't evaporate at the end of a session. The errors don't get re-debugged six weeks later. The throughline of "what we're building and why" survives across days, devices, contexts.

That's the part that matters. Not the stack. Not the choice of Qdrant over Pinecone or Neo4j over Aura. Those are footnotes. The point is that a thinking partner who remembers is a different kind of partner than one who doesn't — and getting there turned out to be one extraction prompt, two stores, and a hook.

What's your memory layer?

Mentioned
  • Qdrant
  • Neo4j
Read this on LinkedIn
Cookie preferences

Choose what's on. We don't share data with third parties unless you opt in below, and you can revoke any of this whenever you want.