Unfold continuation lines before UTF-8 decode#159
Conversation
Fix UnicodeDecodeError when manifest continuation folding splits UTF-8 multibyte sequences across line boundaries (e.g. "Bou\xc3\r\n \xa9"). - process MANIFEST.MF as bytes and join continuation lines before decode - simplify _normalize_manifest to line split/strip only - add regression tests
|
Not sure how helpful this is, but be sure to check out https://docs.oracle.com/en/java/javase/25/docs/specs/jar/jar.html#notes-on-manifest-and-signature-files if you haven't already. |
|
Characters must be complete prior to the newline detailed in https://docs.oracle.com/en/java/javase/25/docs/specs/jar/jar.html#manifest-specification A looser interpretation similar to yours failed on the JDK team https://bugs.java.com/bugdatabase/JDK-8202525 When JDK team found that their own tooling was generating manifests that were non-compliant to the specification (for some versions prior to Java 9), they rewrote their tooling output. Generally, I advocate for failures to be early and notable. More to the point maven-resolver-util fixed the issue only a few point releases later (fixed in 1.9.23) and is now well into the 2.x release. If you don't see an issue in your other tooling, it might be because it's pre JDK-8202525 or that MANIFEST.MF line might be dropped as @dmlloyd pointed out... malformed manifest lines are dropped). I'd say this need fixed, but not here. You need to upgrade your maven-resolver-util to 1.9.23 or later and if issues reappear file an apache bug, as it means they're not using Java tooling (or are implementing it with bugs.) |
I ran into a manifest where a UTF-8 multibyte sequence was split across mulitple lines, which made the parser crash.
How to reproduce:
Changes introduced: