fix(reports): ensure CSV attachments use UTF-8 BOM when encoding is utf-8-sig#37245
fix(reports): ensure CSV attachments use UTF-8 BOM when encoding is utf-8-sig#37245yayunn wants to merge 2 commits into
Conversation
|
CodeAnt AI is reviewing your PR. Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
Code Review Agent Run #61ca6cActionable Suggestions - 0Review Details
Bito Usage GuideCommands Type the following command in the pull request comment and save the comment.
Refer to the documentation for additional commands. Configuration This repository uses Documentation & Help |
|
CodeAnt AI finished reviewing your PR. |
✅ Deploy Preview for superset-docs-preview ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Pull request overview
This PR fixes a bug where CSV attachments generated by the Reports feature are missing the UTF-8 Byte Order Mark (BOM), even when CSV_EXPORT.encoding is configured as utf-8-sig. This causes non-ASCII characters (particularly CJK characters) to appear garbled in Excel and some email clients.
Changes:
- Adds a new
_ensure_utf8_bom()method to detect and prepend UTF-8 BOM when needed - Integrates BOM handling into the
_get_csv_data()method in the report execution pipeline - Ensures consistency between SQL Lab CSV exports and Report CSV exports
| def _ensure_utf8_bom(self, csv_data: bytes, encoding: str) -> bytes: | ||
| """ | ||
| Ensure CSV bytes contain UTF-8 BOM when encoding is utf-8-sig. | ||
| Avoid double BOM. | ||
| """ | ||
| if not csv_data: | ||
| return csv_data | ||
|
|
||
| enc = (encoding or "").lower().replace("_", "-") | ||
| if enc in ("utf-8-sig", "utf8-sig"): | ||
| if not csv_data.startswith(b"\xef\xbb\xbf"): | ||
| return b"\xef\xbb\xbf" + csv_data | ||
| return csv_data |
There was a problem hiding this comment.
The new method _ensure_utf8_bom lacks test coverage. Given that this addresses a specific encoding issue that could lead to data corruption (garbled characters in Excel), it would be valuable to add a test that verifies:
- When CSV_EXPORT.encoding is "utf-8-sig", the returned CSV bytes start with the UTF-8 BOM (b"\xef\xbb\xbf")
- When a BOM already exists, it's not duplicated
- When the encoding is not utf-8-sig, no BOM is added
Consider adding a test in tests/integration_tests/reports/commands_tests.py similar to test_email_chart_report_schedule_with_csv but with assertions on the BOM presence.
| return csv_data | ||
|
|
||
| enc = (encoding or "").lower().replace("_", "-") | ||
| if enc in ("utf-8-sig", "utf8-sig"): |
There was a problem hiding this comment.
The encoding normalization at line 447 replaces underscores with hyphens, converting both "utf_8_sig" and "utf-8-sig" to "utf-8-sig". Therefore, checking for "utf8-sig" (without any separator) in line 448 is unnecessary after this normalization.
Consider simplifying to just check for "utf-8-sig":
enc = (encoding or "").lower().replace("_", "-")
if enc == "utf-8-sig":Alternatively, if you want to handle "utf8sig" (no separator), you could check for both "utf-8-sig" and "utf8sig", but the current check for "utf8-sig" (with hyphen but no separators elsewhere) won't match any real input after normalization.
| if enc in ("utf-8-sig", "utf8-sig"): | |
| if enc == "utf-8-sig": |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #37245 +/- ##
==========================================
- Coverage 64.28% 56.18% -8.11%
==========================================
Files 2659 2659
Lines 144304 144314 +10
Branches 33260 33263 +3
==========================================
- Hits 92773 81087 -11686
- Misses 49902 62504 +12602
+ Partials 1629 723 -906
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
Thanks @yayunn, let's see if we can get this across the finish line. CI is currently red: the Python-Integration suites (test-mysql/test-postgres/test-sqlite) are failing, likely an existing report/CSV test now seeing the BOM. I'm updating the branch now, but if we can get those green, and a small unit test asserting the BOM is prepended only for utf-8-sig, we'll take a fresh look. |
Code Review Agent Run #0dbd2cActionable Suggestions - 0Review Details
Bito Usage GuideCommands Type the following command in the pull request comment and save the comment.
Refer to the documentation for additional commands. Configuration This repository uses Documentation & Help |
Summary
This PR fixes an issue where CSV attachments generated by Reports are missing the UTF-8 BOM, even when
CSV_EXPORT.encodingis set toutf-8-sig.This causes non-ASCII characters (e.g. CJK) to appear garbled in Excel and some email clients.
Problem
Solution
CSV_EXPORT.encodingisutf-8-sigScreenshots / Logs
Before:
e5 94 ae ...(no BOM)After:
ef bb bf e5 94 ae ...(correct UTF-8 BOM)ADDITIONAL INFORMATION