Skip to content

eval: Arika memory-system adapter for HaluMem (Extraction + QA + lenient judge)#2

Open
johnz1019 wants to merge 6 commits into
mainfrom
arika-eval-integration
Open

eval: Arika memory-system adapter for HaluMem (Extraction + QA + lenient judge)#2
johnz1019 wants to merge 6 commits into
mainfrom
arika-eval-integration

Conversation

@johnz1019

Copy link
Copy Markdown
Collaborator

Scoring-side half of running Arika as a memory system on a HaluMem-shaped benchmark. Pairs with the Swift replay/integration in lay2dev/arika-app#82.

What this adds

  • arika artifact → results assembler (eval_arika.py): turns the Swift replay's artifact.jsonl into HaluMem *_eval_results.jsonl for three frames — arika (event/topic projection, retrieval-only QA via a neutral LLM), arika-e2e (Arika's own answer), arika-points (atomic-fact extraction baseline).
  • Frame registration + zero-count guards (evaluation.py): register the arika frames and _safe_ratio-guard aggregations so the dropped Update task doesn't divide-by-zero.
  • Bare-JSON judge tolerance (llms.py): accept judge output without a json fence.
  • Arika-native dataset shape (evaluation.py): QA evidence may be a list of plain strings (not only {memory_content} dicts); memory_type accuracy seeds unseen types on demand (Arika uses Event Memory / Topic Memory); MAX_WORKERS env throttles the judge pool to share a rate-limited proxy with a concurrent replay.
  • Lenient (core-fact) QA judge (eval_tools.py, JUDGE_MODE=lenient): diagnostic-only variant that does not penalize extra, non-contradictory detail — quantifies how much strict "hallucination" is pure over-answering vs genuine error. Default scoring is unchanged (official strict rubric).

Companion PR

Swift replay (EvalRunner --mode halumem), dataset adapter, resilient checkpointed replay, and the concise answer style live in lay2dev/arika-app#82https://github.com/lay2dev/arika-app/pull/82

Scope

Extraction + QA. Update task deferred. Default scores remain comparable to the official HaluMem rubric; the lenient judge is opt-in diagnostics.

🤖 Generated with Claude Code

johnz1019 and others added 6 commits June 16, 2026 22:48
gpt-5.5 (and others) emit unfenced JSON; the strict ```json``` regex
discarded valid judgments, marking every integrity/QA score invalid
(recall=0, qa=0 artifacts). Fall back to whole-content then first {...}.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- QA evidence can be a list of plain strings (Arika-native set) as well as the
  HaluMem-Medium list of {memory_content} dicts — handle both.
- memory_type accuracy seeds unseen types on demand (Arika uses "Event Memory"
  / "Topic Memory"), instead of KeyError on a fixed type set.
- max_workers reads MAX_WORKERS env so the LLM judge can be throttled to share
  a rate-limited proxy with a concurrent replay.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Diagnostic-only judge that keeps the official strict rubric by default but,
under JUDGE_MODE=lenient, scores an answer Correct when its core facts match
the reference even if it volunteers extra non-contradictory detail (only
contradictions, or asserting a definite fact when the reference is unknown,
count as Hallucination). Used to quantify how much strict "hallucination" is
pure over-answering vs genuine error. Does not change the default scores.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant