Skip to content

feat(threshold): raise default private_date post-filter to 0.85#46

Open
lBroth wants to merge 1 commit into
mainfrom
feat/date-threshold-default
Open

feat(threshold): raise default private_date post-filter to 0.85#46
lBroth wants to merge 1 commit into
mainfrom
feat/date-threshold-default

Conversation

@lBroth
Copy link
Copy Markdown
Owner

@lBroth lBroth commented May 30, 2026

Summary

The GLiNER model is over-eager on bare calendar dates without identifying context — a system-prompt line like Today's date is 2026-05-21 or a footer like © 2024 Acme Inc. gets tagged private_date and erodes the token budget on every roundtrip without protecting anyone.

This adds DEFAULT_CATEGORY_THRESHOLDS with private_date: 0.85 and merges it into the post-filter on a key-by-key basis — user-supplied categoryThresholds still win for the labels they spell out. Spans above 0.85 (DOB, birth-date, explicit personal-date PII) keep firing; the long tail of generic dates drops.

Files

  • src/defaults.tsDEFAULT_CATEGORY_THRESHOLDS = { private_date: 0.85 }
  • src/nullpii.ts — merge built-in + user thresholds; user keys win
  • test/nullpii.test.ts — 2 integration tests: weak-date dropped / strong-date kept; user override still works

Test plan

  • npm test — 273 passing (was 271)
  • npm run typecheck / lint / build — clean
  • Bench validation: run packages/eval on a date-heavy corpus to confirm no recall regression on DOB / birth-date / explicit dated PII (model emits these well above 0.9 in eval-loop history). Defer until bench run is convenient.

Rationale

Choice of 0.85 is conservative — sits halfway between the global decode threshold (0.5) and the high-precision recognizer floor (0.95). If bench shows recall regression on legitimate dated PII the right knob is private_date-specific not the global cut.

🤖 Generated with Claude Code

The GLiNER model is over-eager on bare calendar dates without
identifying context — a system-prompt line like
"Today's date is 2026-05-21" or a copyright footer like "© 2024
Acme Inc." gets tagged as `private_date` and erodes the token
budget on every roundtrip without protecting anyone.

Add `DEFAULT_CATEGORY_THRESHOLDS` with `private_date: 0.85` and merge
it into the post-filter on a key-by-key basis — user-supplied
`categoryThresholds` still win for the labels they spell out. Spans
above 0.85 (DOB, birth-date, explicit personal-date PII) keep firing;
the long tail of generic dates drops.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant