eval: Arika memory-system adapter for HaluMem (Extraction + QA + lenient judge)#2
Open
johnz1019 wants to merge 6 commits into
Open
eval: Arika memory-system adapter for HaluMem (Extraction + QA + lenient judge)#2johnz1019 wants to merge 6 commits into
johnz1019 wants to merge 6 commits into
Conversation
gpt-5.5 (and others) emit unfenced JSON; the strict ```json``` regex
discarded valid judgments, marking every integrity/QA score invalid
(recall=0, qa=0 artifacts). Fall back to whole-content then first {...}.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- QA evidence can be a list of plain strings (Arika-native set) as well as the
HaluMem-Medium list of {memory_content} dicts — handle both.
- memory_type accuracy seeds unseen types on demand (Arika uses "Event Memory"
/ "Topic Memory"), instead of KeyError on a fixed type set.
- max_workers reads MAX_WORKERS env so the LLM judge can be throttled to share
a rate-limited proxy with a concurrent replay.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Diagnostic-only judge that keeps the official strict rubric by default but, under JUDGE_MODE=lenient, scores an answer Correct when its core facts match the reference even if it volunteers extra non-contradictory detail (only contradictions, or asserting a definite fact when the reference is unknown, count as Hallucination). Used to quantify how much strict "hallucination" is pure over-answering vs genuine error. Does not change the default scores. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Scoring-side half of running Arika as a memory system on a HaluMem-shaped benchmark. Pairs with the Swift replay/integration in lay2dev/arika-app#82.
What this adds
eval_arika.py): turns the Swift replay'sartifact.jsonlinto HaluMem*_eval_results.jsonlfor three frames —arika(event/topic projection, retrieval-only QA via a neutral LLM),arika-e2e(Arika's own answer),arika-points(atomic-fact extraction baseline).evaluation.py): register the arika frames and_safe_ratio-guard aggregations so the dropped Update task doesn't divide-by-zero.llms.py): accept judge output without ajsonfence.evaluation.py): QAevidencemay be a list of plain strings (not only{memory_content}dicts);memory_typeaccuracy seeds unseen types on demand (Arika usesEvent Memory/Topic Memory);MAX_WORKERSenv throttles the judge pool to share a rate-limited proxy with a concurrent replay.eval_tools.py,JUDGE_MODE=lenient): diagnostic-only variant that does not penalize extra, non-contradictory detail — quantifies how much strict "hallucination" is pure over-answering vs genuine error. Default scoring is unchanged (official strict rubric).Companion PR
Swift replay (
EvalRunner --mode halumem), dataset adapter, resilient checkpointed replay, and the concise answer style live in lay2dev/arika-app#82 — https://github.com/lay2dev/arika-app/pull/82Scope
Extraction + QA. Update task deferred. Default scores remain comparable to the official HaluMem rubric; the lenient judge is opt-in diagnostics.
🤖 Generated with Claude Code