Fix NormalizeText.remove_accents — use NFKD (#45)#108
Merged
Conversation
`remove_accents` was using NFKC normalization, which is a composing form: a pre-composed character like 'é' (U+00E9) stayed as a single code point with no separate combining mark for the subsequent filter to drop. As a result, accent removal was a no-op for any normal pre-composed input — "café" came back as "café". Switch to NFKD (compatibility decomposition) so accented characters are split into base + combining mark, and the existing combining-character filter actually strips the diacritic. NFKD over NFD also folds presentational variants useful in harmonization (the ligature 'fi' becomes 'fi', superscript '²' becomes '2'); document this trade-off in the docstring. Resolves #45. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Issue #45 framed this as a docs/configurability question, but on investigation the existing behaviour was a silent bug:
remove_accentswas using NFKC (a composing normalization form), which left pre-composed accented characters like 'é' (U+00E9) as a single code point. The subsequentunicodedata.combining()filter only catches separate combining marks, so the accent was never removed —\"café\"came back as\"café\".Quick repro:
The fix
Switch the normalization form to NFKD (compatibility decomposition):
Tests
Five new tests in
test_primitives_serialization.py:café,résumé,naïve,piñata,über,Ångström) on pre-composed inputfile→file,x²→x2) — pinning the NFKD choiceremove_accentsvalueTotal: 167/167 tests pass (was 162).
Test plan
pytestgreencafé→cafe)🤖 Generated with Claude Code