Engram — parametric agent memory in LoRA shards

Working name (TBD). An engram is the physical trace a memory leaves in neural tissue — which is literally what this does: encode facts into network weights.

Engram stores an agent's long-term memory inside small LoRA adapters instead of (only) a vector database. Facts are extracted into atomic triples, sharded by topic, baked into one LoRA adapter per shard on a tiny base model, and recalled by routing a query to the right shard(s). Multi-hop questions are answered by recursive retrieval — recall → ask a sub-question → re-route → recall → compose — not by asking the weights to reason.

TL;DR result

On HippoRAG's 2WikiMultiHopQA (150-question subset, same reader), a 360M parametric memory matches vanilla RAG on multi-hop QA:

method	F1
Engram, single-pass	31
Engram + recursion	41
vanilla RAG (same reader)	43
MeMo (14B memory model, MuSiQue ref)	48

A 360M model reaches RAG parity on multi-hop — where the comparable parametric-memory paper (MeMo) used a 14B memory model (~40× larger). Recursion is the +10 unlock.

Honest caveat (see Limitations): this parity depends on high-fidelity QA (~20 pairs/fact) and does not yet scale — at 300 questions it regresses (29 vs 42), and the bottleneck is routing (more shards → the router misses), not recall. This is a research result, not a production system.

Why parametric memory?

Traditional RAG retrieves text chunks into the context window on every query. Engram bakes facts into weights, so atomic recall needs zero retrieved context — cheaper at scale, and the recall is a learned, precise binding rather than a fuzzy chunk match. The index does routing and relationship reasoning; the LoRA is a compressed recall cache.

How it works

 WRITE:  text ──▶ extract atomic triples ──▶ generate diverse QA ──▶ cluster (k-means)
                ──▶ train one LoRA adapter per shard (eval-gated)

 READ:   query ──▶ route to top-k shards (per-fact embedding index)
                ──▶ recall from each adapter (in parallel, never stacked)
                ──▶ reconstruct answer
         multi-hop ──▶ recursive: recall ──▶ "need another fact?" ──▶ re-route ──▶ recall ──▶ compose

See docs/ARCHITECTURE.md for the full design (tiered memory hierarchy, sharding-as-LSM-tree, mixed-adapter serving, economics).

Key findings

Recursion, not latent composition. A LoRA can't compose two separately-stored facts internally (the "two-hop curse"), and stacking adapters destroys them. Externalize each hop as an atomic recall and let an executive chain them → multi-hop works.
Merging is dead for facts. TIES/DARE/linear merging of fact-adapters → 0–3.5% recall (destructive interference on sharp bindings). Route, don't merge.
Clustering helps twice — better routing and less intra-adapter interference.
The recall recipe is QA-pipeline-bound: small coherent shards + binding-preserving QA (answers restate the full fact) + ~20 pairs/fact. Allocation > binding > LR > N; rank ~irrelevant at small scale.
At scale, routing is the bottleneck — not recall fidelity (proven: better QA didn't recover scale parity).

Clean results report: docs/RESULTS.md. Full raw log: docs/FINDINGS.md.

Repo layout

engram/            core library
  extract.py       text/sessions -> atomic memory triples (cloud extractor)
  qa.py            diverse, binding-preserving QA generation
  shards.py        coherent, size-bounded sharding + routing
  sessions.py      Claude Code session-transcript parser
  common.py        config + OpenRouter client
benchmarks/        the HippoRAG benchmark pipeline (Modal)
  load_data.py     download HippoRAG datasets into a Modal volume
  prep.py          subset -> extract facts + QA  (Stage A, local)
  qa_structured.py / select_augment.py   QA-pipeline variants
  train.py         embed -> shard -> train per-shard adapters (Stage B, GPU; checkpointed)
  reload.py        reload saved adapters -> recall / recursion (no retrain)
  score.py         reconstruct answers + EM/F1 vs gold + vanilla-RAG baseline (Stage C)
  fidelity.py      per-shard recall sweep (facts/shard x batch)
docs/
  ARCHITECTURE.md  system design
  FINDINGS.md      full experiment log / report

Quickstart

Requires Python 3.11, a Modal account (GPU), and an OpenRouter API key.

pip install -r requirements.txt
echo "OPENROUTER_API_KEY=sk-..." > .env

# 1. load the benchmark datasets into a Modal volume
modal run benchmarks/load_data.py

# 2. extract facts + QA for a subset (local; reuses a passage cache)
python benchmarks/prep.py --dataset 2wiki --n 150

# 3. (recipe) augment QA to ~20 pairs/fact
python benchmarks/qa_structured.py --name 2wiki_n150 --out facts_qa20.json   # or qa.py

# 4. train one adapter per shard (GPU; saves adapters, resumable)
modal run benchmarks/train.py --name 2wiki_n150 --facts-file facts_qa20.json --train-only

# 5. recall + recursion on the saved adapters (no retrain)
modal run benchmarks/reload.py --name 2wiki_n150 --trained-dir trained_qa20_t40 --mode recurse

# 6. score (Engram vs vanilla RAG, per-type)
python benchmarks/score.py --recalled data/bench/2wiki_n150/reload_recurse_k3_d3.json --reader deepseek

Models (via OpenRouter): extractor/reasoner deepseek/deepseek-v4-flash; reader deepseek-v4-flash or qwen/qwen3-32b. Memory model: HuggingFaceTB/SmolLM2-360M (base).

Limitations

This is research-stage, and the limitations are first-class findings, not footnotes:

Parity depends on ~20-QA fidelity. At 8-QA (cheap), recursion can't compose on noisier recall and Engram regresses.
It doesn't scale yet. 150-Q parity (41≈43) breaks at 300-Q (29 vs 42). Better QA fidelity did not recover it → the scale bottleneck is routing (more shards → the embedding router misses the right shard), which is the next thing to fix (entity-index / oracle routing / stronger embeddings).
QA generation is the throughput bottleneck for batch benchmarking (amortized per-fact at ingest in a real deployment).
Evaluated only on 2Wiki so far; MuSiQue / full-1000 not yet run.

Roadmap

Routing ablation (oracle gold-shard vs entity-index vs embedding) — isolate the scale bottleneck (cheap, reload-only)
Entity→shard index for exact multi-hop addressability
Scale to full-1000 2Wiki; MuSiQue head-to-head vs MeMo (matched reader)
Online write path (streaming ingest, LSM-style compaction, eval-gated eviction)
Distill the cloud extractor to a local model

License

MIT — see LICENSE.

Built as a solo research project. Results are honest and reproducible; the approach is promising but not production-ready. Contributions and replication welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Engram — parametric agent memory in LoRA shards

TL;DR result

Why parametric memory?

How it works

Key findings

Repo layout

Quickstart

Limitations

Roadmap

Related

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmarks		benchmarks
docs		docs
engram		engram
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Engram — parametric agent memory in LoRA shards

TL;DR result

Why parametric memory?

How it works

Key findings

Repo layout

Quickstart

Limitations

Roadmap

Related

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages