Working name (TBD). An engram is the physical trace a memory leaves in neural tissue — which is literally what this does: encode facts into network weights.
Engram stores an agent's long-term memory inside small LoRA adapters instead of (only) a vector database. Facts are extracted into atomic triples, sharded by topic, baked into one LoRA adapter per shard on a tiny base model, and recalled by routing a query to the right shard(s). Multi-hop questions are answered by recursive retrieval — recall → ask a sub-question → re-route → recall → compose — not by asking the weights to reason.
On HippoRAG's 2WikiMultiHopQA (150-question subset, same reader), a 360M parametric memory matches vanilla RAG on multi-hop QA:
| method | F1 |
|---|---|
| Engram, single-pass | 31 |
| Engram + recursion | 41 |
| vanilla RAG (same reader) | 43 |
| MeMo (14B memory model, MuSiQue ref) | 48 |
A 360M model reaches RAG parity on multi-hop — where the comparable parametric-memory paper (MeMo) used a 14B memory model (~40× larger). Recursion is the +10 unlock.
Honest caveat (see Limitations): this parity depends on high-fidelity QA (~20 pairs/fact) and does not yet scale — at 300 questions it regresses (29 vs 42), and the bottleneck is routing (more shards → the router misses), not recall. This is a research result, not a production system.
Traditional RAG retrieves text chunks into the context window on every query. Engram bakes facts into weights, so atomic recall needs zero retrieved context — cheaper at scale, and the recall is a learned, precise binding rather than a fuzzy chunk match. The index does routing and relationship reasoning; the LoRA is a compressed recall cache.
WRITE: text ──▶ extract atomic triples ──▶ generate diverse QA ──▶ cluster (k-means)
──▶ train one LoRA adapter per shard (eval-gated)
READ: query ──▶ route to top-k shards (per-fact embedding index)
──▶ recall from each adapter (in parallel, never stacked)
──▶ reconstruct answer
multi-hop ──▶ recursive: recall ──▶ "need another fact?" ──▶ re-route ──▶ recall ──▶ compose
See docs/ARCHITECTURE.md for the full design (tiered memory hierarchy, sharding-as-LSM-tree, mixed-adapter serving, economics).
- Recursion, not latent composition. A LoRA can't compose two separately-stored facts internally (the "two-hop curse"), and stacking adapters destroys them. Externalize each hop as an atomic recall and let an executive chain them → multi-hop works.
- Merging is dead for facts. TIES/DARE/linear merging of fact-adapters → 0–3.5% recall (destructive interference on sharp bindings). Route, don't merge.
- Clustering helps twice — better routing and less intra-adapter interference.
- The recall recipe is QA-pipeline-bound: small coherent shards + binding-preserving QA (answers restate the full fact) + ~20 pairs/fact. Allocation > binding > LR > N; rank ~irrelevant at small scale.
- At scale, routing is the bottleneck — not recall fidelity (proven: better QA didn't recover scale parity).
Clean results report: docs/RESULTS.md. Full raw log: docs/FINDINGS.md.
engram/ core library
extract.py text/sessions -> atomic memory triples (cloud extractor)
qa.py diverse, binding-preserving QA generation
shards.py coherent, size-bounded sharding + routing
sessions.py Claude Code session-transcript parser
common.py config + OpenRouter client
benchmarks/ the HippoRAG benchmark pipeline (Modal)
load_data.py download HippoRAG datasets into a Modal volume
prep.py subset -> extract facts + QA (Stage A, local)
qa_structured.py / select_augment.py QA-pipeline variants
train.py embed -> shard -> train per-shard adapters (Stage B, GPU; checkpointed)
reload.py reload saved adapters -> recall / recursion (no retrain)
score.py reconstruct answers + EM/F1 vs gold + vanilla-RAG baseline (Stage C)
fidelity.py per-shard recall sweep (facts/shard x batch)
docs/
ARCHITECTURE.md system design
FINDINGS.md full experiment log / report
Requires Python 3.11, a Modal account (GPU), and an OpenRouter API key.
pip install -r requirements.txt
echo "OPENROUTER_API_KEY=sk-..." > .env
# 1. load the benchmark datasets into a Modal volume
modal run benchmarks/load_data.py
# 2. extract facts + QA for a subset (local; reuses a passage cache)
python benchmarks/prep.py --dataset 2wiki --n 150
# 3. (recipe) augment QA to ~20 pairs/fact
python benchmarks/qa_structured.py --name 2wiki_n150 --out facts_qa20.json # or qa.py
# 4. train one adapter per shard (GPU; saves adapters, resumable)
modal run benchmarks/train.py --name 2wiki_n150 --facts-file facts_qa20.json --train-only
# 5. recall + recursion on the saved adapters (no retrain)
modal run benchmarks/reload.py --name 2wiki_n150 --trained-dir trained_qa20_t40 --mode recurse
# 6. score (Engram vs vanilla RAG, per-type)
python benchmarks/score.py --recalled data/bench/2wiki_n150/reload_recurse_k3_d3.json --reader deepseekModels (via OpenRouter): extractor/reasoner deepseek/deepseek-v4-flash; reader deepseek-v4-flash or qwen/qwen3-32b. Memory model: HuggingFaceTB/SmolLM2-360M (base).
This is research-stage, and the limitations are first-class findings, not footnotes:
- Parity depends on ~20-QA fidelity. At 8-QA (cheap), recursion can't compose on noisier recall and Engram regresses.
- It doesn't scale yet. 150-Q parity (41≈43) breaks at 300-Q (29 vs 42). Better QA fidelity did not recover it → the scale bottleneck is routing (more shards → the embedding router misses the right shard), which is the next thing to fix (entity-index / oracle routing / stronger embeddings).
- QA generation is the throughput bottleneck for batch benchmarking (amortized per-fact at ingest in a real deployment).
- Evaluated only on 2Wiki so far; MuSiQue / full-1000 not yet run.
- Routing ablation (oracle gold-shard vs entity-index vs embedding) — isolate the scale bottleneck (cheap, reload-only)
- Entity→shard index for exact multi-hop addressability
- Scale to full-1000 2Wiki; MuSiQue head-to-head vs MeMo (matched reader)
- Online write path (streaming ingest, LSM-style compaction, eval-gated eviction)
- Distill the cloud extractor to a local model
- MeMo — parametric memory with a 14B memory model (the bar we compare to)
- HippoRAG / HippoRAG2 — graph-based RAG; benchmark + baseline
- Parametric Memory Law; LoRA-as-knowledge; O-LoRA / InfLoRA (orthogonal subspaces)
MIT — see LICENSE.
Built as a solo research project. Results are honest and reproducible; the approach is promising but not production-ready. Contributions and replication welcome.