You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Dec 9, 2018. It is now read-only.
Fix#445.
Now --space-as-offset works on "unicode space" instead of ASCII SPACE before decoding the text.
This change should also increases the oppotunities of converting spaces to offsets.
However for PDFs with bad unicode support, this may still drop chars, though I haven't found an example yet.
--space-as-offset may not guarantee to work if either the ToUnicode mapping for the font encoding is corrupted. In fact I had a few test cases before, where the font encoding is OK yet ToUnicode is missing or corrupted. According to my experience, there are more issues in the ToUnicode mappings, especially for old PDF files.
Seems that old PDF generators/converters were not able to handle this well -- after all this has nothing to do with printing. And ToUnicode is indeed optional in the standard.
I'm not sure if this is a good solution. Or possible we can take consideration of the --to-unicode parameter, that whether we trust the mapping.
If ToUnicode is missing, can we just ignore --space-as-offset 1 for that font automaticly? We can also add --space-as-offset 2 to force it on even if ToUnicode is missing. However it seems impossible to detect whether ToUnicode or font encoding is corrupted.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Labels
None yet
3 participants
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix #445.
Now
--space-as-offsetworks on "unicode space" instead of ASCII SPACE before decoding the text.This change should also increases the oppotunities of converting spaces to offsets.
However for PDFs with bad unicode support, this may still drop chars, though I haven't found an example yet.