Skip to content

fix: unicode is handled properly in strdist#280

Open
letFunny wants to merge 6 commits into
canonical:mainfrom
letFunny:bug-fix-distance-unicode
Open

fix: unicode is handled properly in strdist#280
letFunny wants to merge 6 commits into
canonical:mainfrom
letFunny:bug-fix-distance-unicode

Conversation

@letFunny
Copy link
Copy Markdown
Collaborator

@letFunny letFunny commented Apr 2, 2026

  • Have you signed the CLA?

While doing more performance work I noticed a couple of bugs in our implementation of distance. I will fix each one of them in a different PR so that the perf work can land. I have added a TODO for the next bug. I don't want to fix both as part of the same PR as it will be much harder to reason about the code, the fix for unicode in itself is pretty easy.

This PR fixes a common bug when handling strings in Go. The for loop:

for bi, br := range b {
    ...
    lst[bi+1] = cost
}

Is iterating over a string and using bi which is the byte offset to store things in lst instead of the current rune count. In the case where the characters are ASCII it works because each rune is 1 byte but in the case of runes that take more than 1 byte this fails (see the added test case).

Comment thread internal/strdist/strdist.go Outdated
}
_ = stop
if cut != 0 && stop {
if cut != 0 && len(b) > 0 && stop {
Copy link
Copy Markdown
Collaborator Author

@letFunny letFunny Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not strictly related to unicode but it made the test fail when the second string was empty so I fixed it as well.

For context, stop is meant to represent that the minimum edit cost is greater than the threshold. However, if the second string is empty, stop will not be set to false by the inner loop, which is a bug (see test case).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put into words why this is the right fix here? Note that this is disabling the cut logic altogether when b is empty.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your intuition was right this is only a partial fix. I added several more test cases that were failing previously. I also changed the logic to finish as soon as cut is reached. The issue was the following, stop is computed by iterating over b[i] and seeing if:

  • cost of swapping it for a[i]
  • cost of inserting b[i]
  • cost of removing a[i]
  • 0 if a[i] == b[i]

any of these was less than cut. The problem is that to enter the loop b could not be empty. The calculation above the for loop correctly accounted for this case in lst[0], that is comparing a[:j] to b[:0]. The problem is that we were not using the cost here to compute min.

You were the one devising the algorithm so I hope I got a good understanding, let me know if we need to discuss it in more depth in person.

@letFunny letFunny added the Bug An undesired feature ;-) label Apr 2, 2026
Copy link
Copy Markdown
Collaborator

@upils upils left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Would you mind reworking the PR description to clarify what is actually fixed? I found it a little tricky to spot.

Copy link
Copy Markdown
Contributor

@niemeyer niemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! A question about the added logic.

Comment thread internal/strdist/strdist.go Outdated
}
_ = stop
if cut != 0 && stop {
if cut != 0 && len(b) > 0 && stop {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put into words why this is the right fix here? Note that this is disabling the cut logic altogether when b is empty.

Comment thread internal/strdist/strdist_test.go
Copy link
Copy Markdown
Contributor

@niemeyer niemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👆

@letFunny letFunny requested a review from upils April 28, 2026 08:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug An undesired feature ;-)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants