fix(parser): prevent duplicate TLD match for numerical subdomain URLs (#425)#438
Open
fuleinist wants to merge 1 commit into
Open
fix(parser): prevent duplicate TLD match for numerical subdomain URLs (#425)#438fuleinist wants to merge 1 commit into
fuleinist wants to merge 1 commit into
Conversation
…gregjacobs#425) When a scheme-less URL with a numerical subdomain was autolinked (e.g. "123.example.com"), the parser emitted two UrlMatch objects: the full "123.example.com" link followed by a stray "example.com" link. The output HTML therefore contained the substring "example.com" twice for a single input. Root cause: the digit-started TLD state machine (TldUrlStateMachine) ran in parallel with a PhoneNumberDigit state machine and an IPv4 state machine. At the first alphabetic character ("e" in "example.com"), statePhoneNumberSeparator captured and invoked stateNoMatch, which then added a fresh TldUrlStateMachine starting at the same character index. Both machines proceeded to accept "example.com" as a valid TLD match, producing the duplicate. Fix: in stateNoMatch, skip the TLD-machine creation branch when an existing TLD URL state machine is already running past its first character — that machine will consume the current character itself. Scheme and IPv4 machines are intentionally not checked; they have their own guards (hasSchemeUrlMachine and the digit-prefixed IPv4 state) that already prevent the duplicate-creation race in their respective code paths. Adds a regression test in tests/autolinker-url-tld.spec.ts covering "123.example.com", "456.example.com" mid-sentence, and a multi-label "999.subdomain.example.com".
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #425 — scheme-less URLs with a numerical subdomain (e.g.
123.example.com) were producing twoUrlMatchobjects: the full URL and a strayexample.commatch appended after it.Root cause
When the parser sees a digit,
stateNoMatchstarts three URL/phone state machines in parallel:PhoneNumberStateMachine(inPhoneNumberDigitstate)IpV4UrlStateMachine(inIpV4Digitstate)TldUrlStateMachine(inDomainLabelCharstate)For input
123.example.com:., the phone machine transitions toPhoneNumberSeparator; the IPv4 machine transitions toIpV4Dot; the TLD machine transitions toDomainDotand (with no accept state yet) is still alive.eofexample.com:(character → callscaptureMatchIfValidAndRemove, thenstateNoMatch.context.removeMachine.DomainDot→DomainLabelCharand setsacceptStateReached = true.But
stateNoMatchthen unconditionally added anotherTldUrlStateMachinestarting ate. From that point on both TLD machines processedexample.com, both reachedacceptStateReached, and both passedisValidTldMatch, so the matches array received123.example.comandexample.com.Fix
stateNoMatchnow skips the TLD-creation branch when an existingTldUrlStateMachineis already running past its first character — that machine will consume the current character itself, so spawning another one would just duplicate it.hasSchemeUrlMachine(scheme URL) and the digit-prefixed IPv4stateNoMatchpaths already have their own guards against duplicate creation, so they're intentionally not folded into the same check.Verification
tests/autolinker-url-tld.spec.tsexercises123.example.com,456.example.commid-sentence, and999.subdomain.example.com.pnpm run test:unit): 101,053 passing — including the existingabc.123non-match case (an unknown TLD at the end) andlocalhost.local001/testnon-match case, so the fix does not over-link.autolinker.link('123.example.com')now produces a single anchor instead of an anchor followed by a duplicateexample.comsubstring.Diff