Skip to content

fix(parser): prevent duplicate TLD match for numerical subdomain URLs (#425)#438

Open
fuleinist wants to merge 1 commit into
gregjacobs:masterfrom
fuleinist:fix-numerical-subdomain-duplicate
Open

fix(parser): prevent duplicate TLD match for numerical subdomain URLs (#425)#438
fuleinist wants to merge 1 commit into
gregjacobs:masterfrom
fuleinist:fix-numerical-subdomain-duplicate

Conversation

@fuleinist

Copy link
Copy Markdown

Fixes #425 — scheme-less URLs with a numerical subdomain (e.g. 123.example.com) were producing two UrlMatch objects: the full URL and a stray example.com match appended after it.

Root cause

When the parser sees a digit, stateNoMatch starts three URL/phone state machines in parallel:

  1. PhoneNumberStateMachine (in PhoneNumberDigit state)
  2. IpV4UrlStateMachine (in IpV4Digit state)
  3. TldUrlStateMachine (in DomainLabelChar state)

For input 123.example.com:

  • At ., the phone machine transitions to PhoneNumberSeparator; the IPv4 machine transitions to IpV4Dot; the TLD machine transitions to DomainDot and (with no accept state yet) is still alive.
  • At the first alpha character e of example.com:
    • The phone machine (in Separator) hits a non-digit non-( character → calls captureMatchIfValidAndRemove, then stateNoMatch.
    • The IPv4 machine sees an alpha character → context.removeMachine.
    • The TLD machine transitions DomainDotDomainLabelChar and sets acceptStateReached = true.

But stateNoMatch then unconditionally added another TldUrlStateMachine starting at e. From that point on both TLD machines processed example.com, both reached acceptStateReached, and both passed isValidTldMatch, so the matches array received 123.example.com and example.com.

Fix

stateNoMatch now skips the TLD-creation branch when an existing TldUrlStateMachine is already running past its first character — that machine will consume the current character itself, so spawning another one would just duplicate it.

hasSchemeUrlMachine (scheme URL) and the digit-prefixed IPv4 stateNoMatch paths already have their own guards against duplicate creation, so they're intentionally not folded into the same check.

Verification

  • New regression test tests/autolinker-url-tld.spec.ts exercises 123.example.com, 456.example.com mid-sentence, and 999.subdomain.example.com.
  • Full unit test suite (pnpm run test:unit): 101,053 passing — including the existing abc.123 non-match case (an unknown TLD at the end) and localhost.local001/test non-match case, so the fix does not over-link.
  • Manually verified the output of autolinker.link('123.example.com') now produces a single anchor instead of an anchor followed by a duplicate example.com substring.

Diff

src/parser/parse-matches.ts      | +34 / -2
tests/autolinker-url-tld.spec.ts | +21 / -0

…gregjacobs#425)

When a scheme-less URL with a numerical subdomain was autolinked
(e.g. "123.example.com"), the parser emitted two UrlMatch objects:
the full "123.example.com" link followed by a stray "example.com"
link. The output HTML therefore contained the substring "example.com"
twice for a single input.

Root cause: the digit-started TLD state machine (TldUrlStateMachine)
ran in parallel with a PhoneNumberDigit state machine and an IPv4
state machine. At the first alphabetic character ("e" in
"example.com"), statePhoneNumberSeparator captured and invoked
stateNoMatch, which then added a fresh TldUrlStateMachine starting at
the same character index. Both machines proceeded to accept
"example.com" as a valid TLD match, producing the duplicate.

Fix: in stateNoMatch, skip the TLD-machine creation branch when an
existing TLD URL state machine is already running past its first
character — that machine will consume the current character itself.
Scheme and IPv4 machines are intentionally not checked; they have
their own guards (hasSchemeUrlMachine and the digit-prefixed IPv4
state) that already prevent the duplicate-creation race in their
respective code paths.

Adds a regression test in tests/autolinker-url-tld.spec.ts covering
"123.example.com", "456.example.com" mid-sentence, and a
multi-label "999.subdomain.example.com".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Linking numerical subdomain without protocol is broken

1 participant