fix(config): preserve unicode characters when writing yaml config#1966
fix(config): preserve unicode characters when writing yaml config#1966bearomorphism wants to merge 2 commits intocommitizen-tools:masterfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1966 +/- ##
=======================================
Coverage 98.23% 98.23%
=======================================
Files 61 61
Lines 2779 2779
=======================================
Hits 2730 2730
Misses 49 49 ☔ View full report in Codecov by Sentry. |
yaml.dump() defaults to ASCII-only output, which causes `cz bump` (and `cz init`) to rewrite emoji and other non-ASCII characters in `.cz.yaml` as `\Uxxxx` escape sequences. Pass `allow_unicode=True` so the original characters round-trip. Closes commitizen-tools#1164 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
855b8dd to
b404143
Compare
|
Updated per @bearomorphism's review: You're right — the original Replaced it with a more honest test that uses Both |
There was a problem hiding this comment.
Pull request overview
This PR fixes YAML config rewrites so non-ASCII characters (e.g., emojis in bump_message) are preserved as literal Unicode instead of being escaped to \Uxxxxxxxx sequences when cz bump updates .cz.yaml/cz.yaml.
Changes:
- Pass
allow_unicode=Truetoyaml.dumpinYAMLConfig.init_empty_config_contentandYAMLConfig.set_key. - Add a regression test ensuring
set_keydoes not introduce\U...escapes for emojis. - Add a spy-based test asserting
init_empty_config_contentforwardsallow_unicode=Truetoyaml.dump.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
commitizen/config/yaml_config.py |
Enables Unicode emission in YAML serialization by adding allow_unicode=True to yaml.dump call sites. |
tests/test_conf.py |
Adds regression + forwarding tests to prevent reintroducing YAML Unicode escaping. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
* rename json_file -> yaml_file context-manager var * force UTF-8 for YAML writes (YAML 1.2 spec mandates UTF-8/16/32) to avoid UnicodeEncodeError when self._settings.encoding is non-UTF-8 * tighten regression test to reject any \Uxxxxxxxx escape (case-insensitive) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Description
Closes #1164.
Why
PyYAML's
yaml.dumpdefaults to ASCII-only output: any non-ASCII codepoint is escaped using Python's\Uxxxxnotation. This means a.cz.yamlcontaining a literal emoji — for example🚀inbump_message— is silently mangled every timecz bumprewrites the config, replacing the readable character with\U0001F680.Reported by @syepes on commitizen 3.27.0 / Python 3.12 / macOS (#1164), with before-and-after screenshots showing the escape introduced by
cz bump --increment PATCH. A triage note from the open-issues audit (2026-05-09) confirmed the bug still reproduces on master (v4.15.1): aftercz bump, thebump_messagekey reads"\U0001F680 chore(release): …"instead of"🚀 chore(release): …".The root cause is that both
yaml.dumpcall sites incommitizen/config/yaml_config.py—init_empty_config_content(line 33) andset_key(line 66) — omitallow_unicode=True. PyYAML documents that this flag causes non-ASCII codepoints to be written as literal UTF-8 bytes rather than escape sequences; the output remains valid YAML and valid UTF-8.What changed
commitizen/config/yaml_config.pyallow_unicode=Truetoyaml.dumpininit_empty_config_content(line 33) and inset_key(line 66)tests/test_conf.pytest_set_key_preserves_unicode(regression: emoji survives aset_keyround-trip) andtest_init_empty_config_content_passes_allow_unicode(spy: keyword argument is forwarded toyaml.dump)How it works
set_key(commitizen/config/yaml_config.py:58–68): reads the config withyaml.load, mutates the target key in the parsed dict, then re-serialises withyaml.dump. Withoutallow_unicode=True, any non-ASCII character already present in the file is escaped on write-back — producing the\U0001F680symptom. Adding the flag instructs PyYAML to emit those characters as literal UTF-8 bytes.init_empty_config_content(commitizen/config/yaml_config.py:29–33): writes{"commitizen": {}}in append mode duringcz init. Its payload is currently ASCII-only, soallow_unicode=Truehas no observable effect today. It is added for consistency and to guard against future default content that might include non-ASCII characters.test_set_key_preserves_unicodewrites a YAML file containing🚀, callsset_key("version", "0.1.1"), and asserts both that🚀survives and that\U0001F680does not appear — this is the direct regression test for cz bump - Overwrites my cz.yaml emoji #1164.test_init_empty_config_content_passes_allow_unicodeusesmocker.spy(yaml, "dump")to assert the keyword is forwarded rather than testing round-trip behaviour that does not currently occur (see the review note in Additional Context).Dumper? PyYAML'sallow_unicodeflag is the documented, single-argument way to opt out of ASCII escaping. A customDumpersubclass would be substantially more complex and carries a maintenance burden with no additional benefit for this use case.Backward compatibility
\Uxxxxescape sequences are normalised on the nextcz bump:yaml.loaddecodes the escape to the Unicode character on read, and the subsequentyaml.dumpre-emits it as a literal codepoint. No data is lost and no manual migration is needed.TestYamlConfigtests intests/test_conf.pypass without modification.YAMLConfig.Checklist
Was generative AI tooling used to co-author this PR?
Generated-by: Claude following the guidelines
Code Changes
uv run poe alllocally to ensure this change passes linter check and testsExpected Behavior
.cz.yamlhasbump_message: "🚀 chore: bump …"andcz bumpis run🚀is preserved verbatim;\U0001F680does not appear.cz.yamlalready contains an escaped\U0001F680andcz bumpis run🚀— file is healed in placecz initcreates a new.cz.yaml---\ncommitizen: {}\n, identical to the pre-fix behaviourSteps to Test This Pull Request
Additional Context
This fix was identified during the open-issues audit tracked in #1964. A triage note (@bearomorphism, 2026-05-09) confirmed the bug reproduces on master (v4.15.1) and pinpointed
commitizen/config/yaml_config.py:66as the root cause, with a symmetric fix needed at line 33 for consistency.Review note — test update: the initial draft included
test_init_empty_config_content_uses_allow_unicode, which pre-seeded a YAML file with🚀and asserted the emoji survivedinit_empty_config_content(). This was misleading: becauseinit_empty_config_contentopens in append mode and writes only{"commitizen": {}}, the assertion would pass regardless of whetherallow_unicode=Truewas present. The test has been replaced with amocker.spy-based assertion that directly verifies the keyword argument is forwarded, without pretending to test behaviour that doesn't currently happen (see PR comment).