eval: Arika memory-system adapter for HaluMem (Extraction + QA + lenient judge) by johnz1019 · Pull Request #2 · lay2dev/HaluMem

johnz1019 · 2026-06-19T05:30:09Z

Scoring-side half of running Arika as a memory system on a HaluMem-shaped benchmark. Pairs with the Swift replay/integration in lay2dev/arika-app#82.

What this adds

arika artifact → results assembler (eval_arika.py): turns the Swift replay's artifact.jsonl into HaluMem *_eval_results.jsonl for three frames — arika (event/topic projection, retrieval-only QA via a neutral LLM), arika-e2e (Arika's own answer), arika-points (atomic-fact extraction baseline).
Frame registration + zero-count guards (evaluation.py): register the arika frames and _safe_ratio-guard aggregations so the dropped Update task doesn't divide-by-zero.
Bare-JSON judge tolerance (llms.py): accept judge output without a json fence.
Arika-native dataset shape (evaluation.py): QA evidence may be a list of plain strings (not only {memory_content} dicts); memory_type accuracy seeds unseen types on demand (Arika uses Event Memory / Topic Memory); MAX_WORKERS env throttles the judge pool to share a rate-limited proxy with a concurrent replay.
Lenient (core-fact) QA judge (eval_tools.py, JUDGE_MODE=lenient): diagnostic-only variant that does not penalize extra, non-contradictory detail — quantifies how much strict "hallucination" is pure over-answering vs genuine error. Default scoring is unchanged (official strict rubric).

Companion PR

Swift replay (EvalRunner --mode halumem), dataset adapter, resilient checkpointed replay, and the concise answer style live in lay2dev/arika-app#82 — https://github.com/lay2dev/arika-app/pull/82

Scope

Extraction + QA. Update task deferred. Default scores remain comparable to the official HaluMem rubric; the lenient judge is opt-in diagnostics.

🤖 Generated with Claude Code

gpt-5.5 (and others) emit unfenced JSON; the strict ```json``` regex discarded valid judgments, marking every integrity/QA score invalid (recall=0, qa=0 artifacts). Fall back to whole-content then first {...}. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- QA evidence can be a list of plain strings (Arika-native set) as well as the HaluMem-Medium list of {memory_content} dicts — handle both. - memory_type accuracy seeds unseen types on demand (Arika uses "Event Memory" / "Topic Memory"), instead of KeyError on a fixed type set. - max_workers reads MAX_WORKERS env so the LLM judge can be throttled to share a rate-limited proxy with a concurrent replay. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Diagnostic-only judge that keeps the official strict rubric by default but, under JUDGE_MODE=lenient, scores an answer Correct when its core facts match the reference even if it volunteers extra non-contradictory detail (only contradictions, or asserting a definite fact when the reference is unknown, count as Hallucination). Used to quantify how much strict "hallucination" is pure over-answering vs genuine error. Does not change the default scores. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

johnz1019 and others added 6 commits June 16, 2026 22:48

eval: add HaluMem-medium smoke-subset helper

4f73fdc

eval: add arika artifact->results assembler

79fff29

eval: register arika frames + guard zero-count divisions

cc4b0a3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

eval: Arika memory-system adapter for HaluMem (Extraction + QA + lenient judge)#2

eval: Arika memory-system adapter for HaluMem (Extraction + QA + lenient judge)#2
johnz1019 wants to merge 6 commits into
mainfrom
arika-eval-integration

johnz1019 commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

johnz1019 commented Jun 19, 2026

What this adds

Companion PR

Scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant