fix(text): avoid splitting surrogate pairs when truncating text#1160
Open
greymoth-jp wants to merge 1 commit into
Open
fix(text): avoid splitting surrogate pairs when truncating text#1160greymoth-jp wants to merge 1 commit into
greymoth-jp wants to merge 1 commit into
Conversation
|
这是来自QQ邮箱的假期自动回复邮件。
您好,我已经收到你的邮件,我将仔细阅读来信,祝君身体健康,合家美满!
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
truncateSingleLineinparseText.tscuts a line withtextLine.substr(0, subLength), wheresubLengthis a UTF-16 code unit count fromestimateLength(or from the proportional estimate on later iterations).estimateLengthwalks the string one code unit at a time (text.charCodeAt(i)), sosubLengthcan land between the two halves of a surrogate pair. The slice then keeps an orphaned lead surrogate and corrupts the character.This affects real text. Characters outside the BMP are encoded as surrogate pairs in UTF-16, including CJK Extension B ideographs such as U+20BB7 (𠮷, which appears in Japanese names) and emoji.
Repro
With a font where one fullwidth glyph is 2 units wide:
measureCharWidthcounts each surrogate half as a full wide char, soestimateLengthreturns 3, an odd index inside the second pair. The whole-string measuremeasureWidthcorrectly treats the pair as a single width-2 glyph, so once the orphaned result fits the content width the loop stops and the broken character is kept.Fix
Before slicing, if
subLengthwould cut right after a lead surrogate, step it back by one to the pair boundary. ASCII and BMP text are unaffected, and the loop still appends the ellipsis as before.I added a unit test that stubs
measureText(so no canvas is needed) and checks that the surrogate pair is preserved, that text which fits is returned unchanged, and that ASCII truncation is unchanged. The fulltest/utsuite passes.