Skip to content

Fix NormalizeText.remove_accents — use NFKD (#45)#108

Merged
matthewhorridge merged 1 commit into
mainfrom
fix/normalize-accents-nfkd
May 13, 2026
Merged

Fix NormalizeText.remove_accents — use NFKD (#45)#108
matthewhorridge merged 1 commit into
mainfrom
fix/normalize-accents-nfkd

Conversation

@matthewhorridge
Copy link
Copy Markdown
Contributor

@matthewhorridge matthewhorridge commented May 13, 2026

Summary

Issue #45 framed this as a docs/configurability question, but on investigation the existing behaviour was a silent bug: remove_accents was using NFKC (a composing normalization form), which left pre-composed accented characters like 'é' (U+00E9) as a single code point. The subsequent unicodedata.combining() filter only catches separate combining marks, so the accent was never removed — \"café\" came back as \"café\".

Quick repro:

>>> NormalizeText(Normalization.ACCENT).transform(\"café\")  # before
'café'
>>> NormalizeText(Normalization.ACCENT).transform(\"café\")  # after
'cafe'

The fix

Switch the normalization form to NFKD (compatibility decomposition):

  • NFD or NFKD both split 'é' into 'e' + U+0301 (combining acute), which the existing filter then drops correctly.
  • NFKD over NFD because the compatibility decomposition also folds presentational variants useful for harmonization: the ligature 'fi' → 'fi', superscript digits → plain digits, etc. Documented in the docstring so callers who need to preserve those can normalize input to NFC beforehand.

Tests

Five new tests in test_primitives_serialization.py:

  • Common Latin accents (café, résumé, naïve, piñata, über, Ångström) on pre-composed input
  • Same logic on pre-decomposed input (verifies NFKD is idempotent)
  • Unaccented text is unchanged
  • Compatibility folding (filefile, x2) — pinning the NFKD choice
  • Serialization roundtrip with the remove_accents value

Total: 167/167 tests pass (was 162).

Test plan

  • pytest green
  • Manual repro of the original bug fixed (cafécafe)

🤖 Generated with Claude Code

`remove_accents` was using NFKC normalization, which is a composing form:
a pre-composed character like 'é' (U+00E9) stayed as a single code point
with no separate combining mark for the subsequent filter to drop. As a
result, accent removal was a no-op for any normal pre-composed input —
"café" came back as "café".

Switch to NFKD (compatibility decomposition) so accented characters are
split into base + combining mark, and the existing combining-character
filter actually strips the diacritic. NFKD over NFD also folds
presentational variants useful in harmonization (the ligature 'fi'
becomes 'fi', superscript '²' becomes '2'); document this trade-off in
the docstring.

Resolves #45.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@matthewhorridge matthewhorridge merged commit da171f9 into main May 13, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant