AI Memory System
Back to Projects

AI Memory System

A 5-layer distributed memory architecture with auto-discovery — gives AI coding assistants persistent recall, relationship reasoning, and the ability to learn behavioral patterns from observation.

2025Live
mnemos
[09:12:01] mnemos v1.2 online
[09:12:01] ✓ PostgreSQL + pgvector (HNSW)
[09:12:02] ✓ Neo4j graph (891 entities)
[09:12:02] ✓ Redis cache connected
[09:12:03] ✓ Embedding model loaded (384-dim)
[09:12:03] ✓ Observation ingestion enabled
[09:12:03] ✓ Discovery scheduler started
[09:12:22] SEARCH "auth token rotation"
→ hybrid + rerank │ 23 candidates
→ #1 (0.94) "JWT refresh token strategy"
→ #2 (0.87) "Redis session management"
[09:13:00] DISCOVER project:mnemos
→ 142 observations │ 4 heuristics
→ Pattern: "npm test" repeated 11x (0.78)
→ Pattern: hot file src/config.ts (0.62)
→ 2 created │ 1 reinforced │ ceiling 0.85
── 247 memories │ 891 entities │ 13 tools ──

Results

5+1
Memory Layers
PostgreSQL, Neo4j, Redis, MinIO, ONNX + auto-discovery
13
MCP Tools
Plug-and-play for any AI assistant
<50ms
Search Latency
Hybrid semantic + keyword with reranking
0
External API Calls
All inference runs locally on-device

Version History

v1.0 — SQLite backend, embedding search, basic CRUD. A memory that could store and find.

v1.1 — 5-layer stack (PostgreSQL, Neo4j, Redis, MinIO, ONNX). Graph reasoning, cross-encoder reranking, promotion pipeline. A memory that could think.

v1.2 — Auto-discovery pipeline. Observes tool use, detects behavioral patterns, creates confidence-scored memories autonomously. A memory that learns.

The Gap

Every major AI assistant (ChatGPT, Claude, Gemini) handles memory the same way: a flat list of facts stapled to every conversation. No search. No relationships. No lifecycle. The entire memory is dumped into the context window every turn, regardless of relevance.

This approach doesn't scale. At a few hundred memories, you're burning half your context on irrelevant facts. There's no way to ask "what connects this project to that decision?" because there are no connections. Just a pile of sticky notes.

The tooling gap is even wider for AI coding assistants. They start every session from zero. Yesterday's architectural decisions, past debugging sessions, operator corrections. All gone. The assistant repeats mistakes it was already corrected on.

And even with explicit memory, the operator has to do all the work. Every useful pattern, every learned correction, every workflow preference — manually stored, or lost.

What I Built

A dedicated memory infrastructure layer with five specialized backends, each handling a different type of recall:

LayerTechnologyPurpose
Structured + SemanticPostgreSQL + pgvectorLong-term storage with HNSW vector indexing for sub-linear similarity search
GraphNeo4jEntity extraction and relationship traversal: "what's connected to what"
Working MemoryRedisEphemeral session state, search caching, multi-agent shared context with TTL
Artifact StorageMinIOS3-compatible object storage for large files linked to memories
IntelligenceONNX models (local)Embedding generation + cross-encoder reranking, fully offline

The system exposes 12 tools via the Model Context Protocol, making it plug-and-play for any MCP-compatible AI assistant. No custom integration, no vendor lock-in.

Retrieval Architecture

The search pipeline is where this diverges most from flat-memory systems. Instead of dumping everything into context, it retrieves only what's relevant:

Three search modes fuse together:

  • Semantic:query is embedded into a 384-dimension vector, compared against HNSW-indexed memory embeddings via cosine distance
  • Keyword:BM25 ranking against PostgreSQL's tsvector full-text index
  • Hybrid (default): reciprocal rank fusion merges both result sets, then a cross-encoder reranker scores the top candidates for precision

Every search is scoped. A query scoped to project:vaultkeeper won't bleed into memories from other projects.

Graph Reasoning

Vector search answers "what's similar to this?" The graph layer answers "what's connected to this?". A fundamentally different question.

When a memory is stored, an entity extraction pipeline identifies people, projects, technologies, and concepts, then writes them as nodes and edges in Neo4j. Over time, a knowledge graph emerges organically from the memories themselves:

A query like "How does Redis connect to the VK platform?" traverses from the Redis entity node to find three paths: the session token strategy, the cache invalidation pattern, and a shared dependency with email-triage. Vector search would only find documents about Redis. The graph finds documents connected through Redis.

  • "What decisions were made about authentication?" → traverse from JWT entity → 2 connected memories across 1 project
  • "What infrastructure do these three projects share?" → fan-out from project nodes → intersection at Redis and PostgreSQL
  • "Show me everything connected to the email triage system" → single entity traversal → technologies, decisions, related projects

This is the kind of reasoning that flat key-value memory fundamentally cannot do.

Memory Lifecycle

Memories aren't static. An automated promotion pipeline manages the lifecycle:

  • Decay:relevance score drops exponentially based on access recency (30-day half-life), with type-weighted adjustments. Frequently accessed memories stay hot
  • Pinning:critical memories (operator identity, foundational corrections) are marked permanent and exempt from decay
  • Consolidation:near-duplicate memories (>0.95 cosine similarity) are detected and flagged for merge or auto-archive
  • Auto-archive:memories below the relevance threshold are soft-archived, never hard-deleted

The pipeline is driven by a scoring function that considers recency, access frequency, memory type, and content quality, then runs as a scheduled background process.

Auto-Discovery: Memory That Learns

Everything above is explicit memory — the operator stores it, the system recalls it. Version 1.2 adds an observation layer that watches how the AI assistant is actually used and discovers patterns autonomously.

The pipeline has four zero-cost detectors — no API calls, no LLM inference:

  • Repeated tool+input: same command run 3+ times across sessions → "operator frequently runs npm test before committing"
  • Error-then-fix: failed tool call followed by successful retry → "when Jest fails with ENOMEM, increase --maxWorkers"
  • Hot files: same file edited 3+ times per session → "src/config.ts is a frequent edit target"
  • Repeated commands: normalized shell commands recurring → "operator uses docker compose up -d to start the dev stack"

Detected patterns become memories with a confidence score. The scoring is deliberately conservative:

  • Self-limiting reinforcement: each repeated observation adds +0.05 * (1 - current) — high-confidence patterns get smaller bumps, preventing echo chambers
  • Confidence ceiling at 0.85: auto-discovered memories never auto-promote past this. The operator must explicitly confirm for higher trust
  • Read-time decay: confidence decays with a 60-day half-life if the pattern stops appearing, with a freeze for dormant projects
  • Bounded search influence: discovered memories can reduce their search score by at most 50%, ensuring they still surface but don't dominate explicit knowledge

Three-tier semantic deduplication prevents the system from creating near-duplicate memories: similarity >= 0.95 reinforces the existing memory, 0.85-0.95 reinforces and logs new evidence, below 0.85 creates a new entry.

Security is multi-layered: client-side secret scrubbing before network transit, server-side scrubbing as defense-in-depth, content denylist blocking dangerous command patterns, and API key authentication on all observation endpoints.

Design Principles

Zero external API dependencies. Embedding generation (all-MiniLM-L6-v2) and reranking (ms-marco-MiniLM) run locally via ONNX. No token costs, no network latency, no third-party data sharing. A memory system that depends on an API to think isn't a memory system. It's a feature that disappears when the API changes.

Interface-driven architecture. Every backend implements a shared TypeScript interface (IMemoryStore, IVectorSearch). PostgreSQL can be swapped for SQLite. The in-memory embedding index can be swapped for pgvector. Clients don't know or care which backend is active.

Eval-gated deployment. A built-in retrieval evaluation harness measures Precision@K, Recall@K, MRR, and NDCG against a seeded query set. No changes ship without confirming search quality holds. This is how production search systems are maintained: not vibes, measurements.

Graceful degradation. Neo4j, Redis, and MinIO each have an enabled flag. The system runs on PostgreSQL alone, or SQLite alone, or with any combination of optional backends. Every layer is additive.

Results

The system runs 24/7 as a Windows service, accessible across a WireGuard VPN from any device on the network. It's been the primary memory backend for all AI-assisted development work since deployment.

  • Sub-50ms hybrid search with cross-encoder reranking, faster than a filesystem glob
  • 13 MCP tools available to any compatible AI assistant without configuration
  • Full offline operation: embedding, search, reranking, and pattern detection run without internet
  • Automated lifecycle: the memory store stays relevant without manual curation
  • Autonomous learning: the system observes, detects, and remembers without being told
  • 149-test regression suite: every phase validated against the eval harness baseline

v1.0 was a database that could remember. v1.1 was a system that could reason about what it remembers. v1.2 is a system that decides what's worth remembering in the first place.