Version History
v1.0 — SQLite backend, embedding search, basic CRUD. A memory that could store and find.
v1.1 — 5-layer stack (PostgreSQL, Neo4j, Redis, MinIO, ONNX). Graph reasoning, cross-encoder reranking, promotion pipeline. A memory that could think.
v1.2 — Auto-discovery pipeline. Observes tool use, detects behavioral patterns, creates confidence-scored memories autonomously. A memory that learns.
The Gap
Every major AI assistant (ChatGPT, Claude, Gemini) handles memory the same way: a flat list of facts stapled to every conversation. No search. No relationships. No lifecycle. The entire memory is dumped into the context window every turn, regardless of relevance.
This approach doesn't scale. At a few hundred memories, you're burning half your context on irrelevant facts. There's no way to ask "what connects this project to that decision?" because there are no connections. Just a pile of sticky notes.
The tooling gap is even wider for AI coding assistants. They start every session from zero. Yesterday's architectural decisions, past debugging sessions, operator corrections. All gone. The assistant repeats mistakes it was already corrected on.
And even with explicit memory, the operator has to do all the work. Every useful pattern, every learned correction, every workflow preference — manually stored, or lost.
What I Built
A dedicated memory infrastructure layer with five specialized backends, each handling a different type of recall:
| Layer | Technology | Purpose |
|---|---|---|
| Structured + Semantic | PostgreSQL + pgvector | Long-term storage with HNSW vector indexing for sub-linear similarity search |
| Graph | Neo4j | Entity extraction and relationship traversal: "what's connected to what" |
| Working Memory | Redis | Ephemeral session state, search caching, multi-agent shared context with TTL |
| Artifact Storage | MinIO | S3-compatible object storage for large files linked to memories |
| Intelligence | ONNX models (local) | Embedding generation + cross-encoder reranking, fully offline |
The system exposes 12 tools via the Model Context Protocol, making it plug-and-play for any MCP-compatible AI assistant. No custom integration, no vendor lock-in.
Retrieval Architecture
The search pipeline is where this diverges most from flat-memory systems. Instead of dumping everything into context, it retrieves only what's relevant:
Three search modes fuse together:
- Semantic:query is embedded into a 384-dimension vector, compared against HNSW-indexed memory embeddings via cosine distance
- Keyword:BM25 ranking against PostgreSQL's
tsvectorfull-text index - Hybrid (default): reciprocal rank fusion merges both result sets, then a cross-encoder reranker scores the top candidates for precision
Every search is scoped. A query scoped to project:vaultkeeper won't bleed into memories from other projects.
Graph Reasoning
Vector search answers "what's similar to this?" The graph layer answers "what's connected to this?". A fundamentally different question.
When a memory is stored, an entity extraction pipeline identifies people, projects, technologies, and concepts, then writes them as nodes and edges in Neo4j. Over time, a knowledge graph emerges organically from the memories themselves:
A query like "How does Redis connect to the VK platform?" traverses from the Redis entity node to find three paths: the session token strategy, the cache invalidation pattern, and a shared dependency with email-triage. Vector search would only find documents about Redis. The graph finds documents connected through Redis.
- "What decisions were made about authentication?" → traverse from
JWTentity → 2 connected memories across 1 project - "What infrastructure do these three projects share?" → fan-out from project nodes → intersection at
RedisandPostgreSQL - "Show me everything connected to the email triage system" → single entity traversal → technologies, decisions, related projects
This is the kind of reasoning that flat key-value memory fundamentally cannot do.
Memory Lifecycle
Memories aren't static. An automated promotion pipeline manages the lifecycle:
- Decay:relevance score drops exponentially based on access recency (30-day half-life), with type-weighted adjustments. Frequently accessed memories stay hot
- Pinning:critical memories (operator identity, foundational corrections) are marked permanent and exempt from decay
- Consolidation:near-duplicate memories (>0.95 cosine similarity) are detected and flagged for merge or auto-archive
- Auto-archive:memories below the relevance threshold are soft-archived, never hard-deleted
The pipeline is driven by a scoring function that considers recency, access frequency, memory type, and content quality, then runs as a scheduled background process.
Auto-Discovery: Memory That Learns
Everything above is explicit memory — the operator stores it, the system recalls it. Version 1.2 adds an observation layer that watches how the AI assistant is actually used and discovers patterns autonomously.
The pipeline has four zero-cost detectors — no API calls, no LLM inference:
- Repeated tool+input: same command run 3+ times across sessions → "operator frequently runs
npm testbefore committing" - Error-then-fix: failed tool call followed by successful retry → "when Jest fails with ENOMEM, increase
--maxWorkers" - Hot files: same file edited 3+ times per session → "
src/config.tsis a frequent edit target" - Repeated commands: normalized shell commands recurring → "operator uses
docker compose up -dto start the dev stack"
Detected patterns become memories with a confidence score. The scoring is deliberately conservative:
- Self-limiting reinforcement: each repeated observation adds
+0.05 * (1 - current)— high-confidence patterns get smaller bumps, preventing echo chambers - Confidence ceiling at 0.85: auto-discovered memories never auto-promote past this. The operator must explicitly confirm for higher trust
- Read-time decay: confidence decays with a 60-day half-life if the pattern stops appearing, with a freeze for dormant projects
- Bounded search influence: discovered memories can reduce their search score by at most 50%, ensuring they still surface but don't dominate explicit knowledge
Three-tier semantic deduplication prevents the system from creating near-duplicate memories: similarity >= 0.95 reinforces the existing memory, 0.85-0.95 reinforces and logs new evidence, below 0.85 creates a new entry.
Security is multi-layered: client-side secret scrubbing before network transit, server-side scrubbing as defense-in-depth, content denylist blocking dangerous command patterns, and API key authentication on all observation endpoints.
Design Principles
Zero external API dependencies. Embedding generation (all-MiniLM-L6-v2) and reranking (ms-marco-MiniLM) run locally via ONNX. No token costs, no network latency, no third-party data sharing. A memory system that depends on an API to think isn't a memory system. It's a feature that disappears when the API changes.
Interface-driven architecture. Every backend implements a shared TypeScript interface (IMemoryStore, IVectorSearch). PostgreSQL can be swapped for SQLite. The in-memory embedding index can be swapped for pgvector. Clients don't know or care which backend is active.
Eval-gated deployment. A built-in retrieval evaluation harness measures Precision@K, Recall@K, MRR, and NDCG against a seeded query set. No changes ship without confirming search quality holds. This is how production search systems are maintained: not vibes, measurements.
Graceful degradation. Neo4j, Redis, and MinIO each have an enabled flag. The system runs on PostgreSQL alone, or SQLite alone, or with any combination of optional backends. Every layer is additive.
Results
The system runs 24/7 as a Windows service, accessible across a WireGuard VPN from any device on the network. It's been the primary memory backend for all AI-assisted development work since deployment.
- Sub-50ms hybrid search with cross-encoder reranking, faster than a filesystem glob
- 13 MCP tools available to any compatible AI assistant without configuration
- Full offline operation: embedding, search, reranking, and pattern detection run without internet
- Automated lifecycle: the memory store stays relevant without manual curation
- Autonomous learning: the system observes, detects, and remembers without being told
- 149-test regression suite: every phase validated against the eval harness baseline
v1.0 was a database that could remember. v1.1 was a system that could reason about what it remembers. v1.2 is a system that decides what's worth remembering in the first place.
