Skip to content

TheDataCo/engram

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Engram — parametric agent memory in LoRA shards

Working name (TBD). An engram is the physical trace a memory leaves in neural tissue — which is literally what this does: encode facts into network weights.

Engram stores an agent's long-term memory inside small LoRA adapters instead of (only) a vector database. Facts are extracted into atomic triples, sharded by topic, baked into one LoRA adapter per shard on a tiny base model, and recalled by routing a query to the right shard(s). Multi-hop questions are answered by recursive retrieval — recall → ask a sub-question → re-route → recall → compose — not by asking the weights to reason.

TL;DR result

On HippoRAG's 2WikiMultiHopQA (150-question subset, same reader), a 360M parametric memory matches vanilla RAG on multi-hop QA:

method F1
Engram, single-pass 31
Engram + recursion 41
vanilla RAG (same reader) 43
MeMo (14B memory model, MuSiQue ref) 48

A 360M model reaches RAG parity on multi-hop — where the comparable parametric-memory paper (MeMo) used a 14B memory model (~40× larger). Recursion is the +10 unlock.

Honest caveat (see Limitations): this parity depends on high-fidelity QA (~20 pairs/fact) and does not yet scale — at 300 questions it regresses (29 vs 42), and the bottleneck is routing (more shards → the router misses), not recall. This is a research result, not a production system.


Why parametric memory?

Traditional RAG retrieves text chunks into the context window on every query. Engram bakes facts into weights, so atomic recall needs zero retrieved context — cheaper at scale, and the recall is a learned, precise binding rather than a fuzzy chunk match. The index does routing and relationship reasoning; the LoRA is a compressed recall cache.

How it works

 WRITE:  text ──▶ extract atomic triples ──▶ generate diverse QA ──▶ cluster (k-means)
                ──▶ train one LoRA adapter per shard (eval-gated)

 READ:   query ──▶ route to top-k shards (per-fact embedding index)
                ──▶ recall from each adapter (in parallel, never stacked)
                ──▶ reconstruct answer
         multi-hop ──▶ recursive: recall ──▶ "need another fact?" ──▶ re-route ──▶ recall ──▶ compose

See docs/ARCHITECTURE.md for the full design (tiered memory hierarchy, sharding-as-LSM-tree, mixed-adapter serving, economics).

Key findings

  • Recursion, not latent composition. A LoRA can't compose two separately-stored facts internally (the "two-hop curse"), and stacking adapters destroys them. Externalize each hop as an atomic recall and let an executive chain them → multi-hop works.
  • Merging is dead for facts. TIES/DARE/linear merging of fact-adapters → 0–3.5% recall (destructive interference on sharp bindings). Route, don't merge.
  • Clustering helps twice — better routing and less intra-adapter interference.
  • The recall recipe is QA-pipeline-bound: small coherent shards + binding-preserving QA (answers restate the full fact) + ~20 pairs/fact. Allocation > binding > LR > N; rank ~irrelevant at small scale.
  • At scale, routing is the bottleneck — not recall fidelity (proven: better QA didn't recover scale parity).

Clean results report: docs/RESULTS.md. Full raw log: docs/FINDINGS.md.

Repo layout

engram/            core library
  extract.py       text/sessions -> atomic memory triples (cloud extractor)
  qa.py            diverse, binding-preserving QA generation
  shards.py        coherent, size-bounded sharding + routing
  sessions.py      Claude Code session-transcript parser
  common.py        config + OpenRouter client
benchmarks/        the HippoRAG benchmark pipeline (Modal)
  load_data.py     download HippoRAG datasets into a Modal volume
  prep.py          subset -> extract facts + QA  (Stage A, local)
  qa_structured.py / select_augment.py   QA-pipeline variants
  train.py         embed -> shard -> train per-shard adapters (Stage B, GPU; checkpointed)
  reload.py        reload saved adapters -> recall / recursion (no retrain)
  score.py         reconstruct answers + EM/F1 vs gold + vanilla-RAG baseline (Stage C)
  fidelity.py      per-shard recall sweep (facts/shard x batch)
docs/
  ARCHITECTURE.md  system design
  FINDINGS.md      full experiment log / report

Quickstart

Requires Python 3.11, a Modal account (GPU), and an OpenRouter API key.

pip install -r requirements.txt
echo "OPENROUTER_API_KEY=sk-..." > .env

# 1. load the benchmark datasets into a Modal volume
modal run benchmarks/load_data.py

# 2. extract facts + QA for a subset (local; reuses a passage cache)
python benchmarks/prep.py --dataset 2wiki --n 150

# 3. (recipe) augment QA to ~20 pairs/fact
python benchmarks/qa_structured.py --name 2wiki_n150 --out facts_qa20.json   # or qa.py

# 4. train one adapter per shard (GPU; saves adapters, resumable)
modal run benchmarks/train.py --name 2wiki_n150 --facts-file facts_qa20.json --train-only

# 5. recall + recursion on the saved adapters (no retrain)
modal run benchmarks/reload.py --name 2wiki_n150 --trained-dir trained_qa20_t40 --mode recurse

# 6. score (Engram vs vanilla RAG, per-type)
python benchmarks/score.py --recalled data/bench/2wiki_n150/reload_recurse_k3_d3.json --reader deepseek

Models (via OpenRouter): extractor/reasoner deepseek/deepseek-v4-flash; reader deepseek-v4-flash or qwen/qwen3-32b. Memory model: HuggingFaceTB/SmolLM2-360M (base).

Limitations

This is research-stage, and the limitations are first-class findings, not footnotes:

  1. Parity depends on ~20-QA fidelity. At 8-QA (cheap), recursion can't compose on noisier recall and Engram regresses.
  2. It doesn't scale yet. 150-Q parity (41≈43) breaks at 300-Q (29 vs 42). Better QA fidelity did not recover it → the scale bottleneck is routing (more shards → the embedding router misses the right shard), which is the next thing to fix (entity-index / oracle routing / stronger embeddings).
  3. QA generation is the throughput bottleneck for batch benchmarking (amortized per-fact at ingest in a real deployment).
  4. Evaluated only on 2Wiki so far; MuSiQue / full-1000 not yet run.

Roadmap

  • Routing ablation (oracle gold-shard vs entity-index vs embedding) — isolate the scale bottleneck (cheap, reload-only)
  • Entity→shard index for exact multi-hop addressability
  • Scale to full-1000 2Wiki; MuSiQue head-to-head vs MeMo (matched reader)
  • Online write path (streaming ingest, LSM-style compaction, eval-gated eviction)
  • Distill the cloud extractor to a local model

Related

  • MeMo — parametric memory with a 14B memory model (the bar we compare to)
  • HippoRAG / HippoRAG2 — graph-based RAG; benchmark + baseline
  • Parametric Memory Law; LoRA-as-knowledge; O-LoRA / InfLoRA (orthogonal subspaces)

License

MIT — see LICENSE.


Built as a solo research project. Results are honest and reproducible; the approach is promising but not production-ready. Contributions and replication welcome.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages