Fix NormalizeText.remove_accents — use NFKD (#45) by matthewhorridge · Pull Request #108 · bmir-radx/harmonization-framework

matthewhorridge · 2026-05-13T16:07:41Z

Summary

Issue #45 framed this as a docs/configurability question, but on investigation the existing behaviour was a silent bug: remove_accents was using NFKC (a composing normalization form), which left pre-composed accented characters like 'é' (U+00E9) as a single code point. The subsequent unicodedata.combining() filter only catches separate combining marks, so the accent was never removed — \"café\" came back as \"café\".

Quick repro:

>>> NormalizeText(Normalization.ACCENT).transform(\"café\")  # before
'café'
>>> NormalizeText(Normalization.ACCENT).transform(\"café\")  # after
'cafe'

The fix

Switch the normalization form to NFKD (compatibility decomposition):

NFD or NFKD both split 'é' into 'e' + U+0301 (combining acute), which the existing filter then drops correctly.
NFKD over NFD because the compatibility decomposition also folds presentational variants useful for harmonization: the ligature 'ﬁ' → 'fi', superscript digits → plain digits, etc. Documented in the docstring so callers who need to preserve those can normalize input to NFC beforehand.

Tests

Five new tests in test_primitives_serialization.py:

Common Latin accents (café, résumé, naïve, piñata, über, Ångström) on pre-composed input
Same logic on pre-decomposed input (verifies NFKD is idempotent)
Unaccented text is unchanged
Compatibility folding (ﬁle → file, x² → x2) — pinning the NFKD choice
Serialization roundtrip with the remove_accents value

Total: 167/167 tests pass (was 162).

Test plan

pytest green
Manual repro of the original bug fixed (café → cafe)

🤖 Generated with Claude Code

`remove_accents` was using NFKC normalization, which is a composing form: a pre-composed character like 'é' (U+00E9) stayed as a single code point with no separate combining mark for the subsequent filter to drop. As a result, accent removal was a no-op for any normal pre-composed input — "café" came back as "café". Switch to NFKD (compatibility decomposition) so accented characters are split into base + combining mark, and the existing combining-character filter actually strips the diacritic. NFKD over NFD also folds presentational variants useful in harmonization (the ligature 'ﬁ' becomes 'fi', superscript '²' becomes '2'); document this trade-off in the docstring. Resolves #45. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

matthewhorridge merged commit da171f9 into main May 13, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix NormalizeText.remove_accents — use NFKD (#45)#108

Fix NormalizeText.remove_accents — use NFKD (#45)#108
matthewhorridge merged 1 commit into
mainfrom
fix/normalize-accents-nfkd

matthewhorridge commented May 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

matthewhorridge commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The fix

Tests

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

matthewhorridge commented May 13, 2026 •

edited

Loading