Skip to content

[codex] Speed up MySQL lexing and parsing#375

Closed
adamziel wants to merge 4 commits intotrunkfrom
explore/lexing-parsing-10x
Closed

[codex] Speed up MySQL lexing and parsing#375
adamziel wants to merge 4 commits intotrunkfrom
explore/lexing-parsing-10x

Conversation

@adamziel
Copy link
Copy Markdown
Collaborator

@adamziel adamziel commented Apr 27, 2026

What changed

This draft explores faster MySQL lexing and parsing while keeping the parser compact.

  • Speeds up WP_MySQL_Lexer::remaining_tokens() by avoiding repeated public method calls during bulk tokenization.
  • Caches the SQL payload length in the lexer and uses cheaper character checks on hot paths.
  • Speeds up MySQL token construction by avoiding an extra parent constructor call per token.
  • Replaces the parser grammar's limited iterative lookahead table with full FIRST-set computation.
  • Adds per-token branch candidate tables so the parser tries only grammar branches that can start with the current token.
  • Caches parser grammar tables on the parser instance during a parse.
  • Caches token IDs alongside parser tokens to avoid repeated object-property reads in terminal/lookahead checks.
  • Adds a rule-name lookup map to avoid repeated linear array_search() calls.
  • Reuses computed branch dispatch tables for identical grammar loads, drops retained intermediate FIRST sets, and stores single-candidate dispatch entries as integers to keep default test memory under control.
  • Avoids reading the next token for MySQL's SELECT ... INTO negative-lookahead check unless the current rule is selectStatement.
  • Removes parser failed-match memoization after branch dispatch made it net overhead in the benchmark suite.

Why

The dynamic recursive-descent parser spends a lot of time repeatedly rejecting grammar branches that cannot match the current token. The lexer benchmark also paid avoidable overhead for the common remaining_tokens() path used before parsing.

This keeps the current architecture and grammar file format intact while moving more branch-selection work to grammar initialization.

Performance

Original trunk baseline captured before this branch:

  • Lexer: 69,578 queries in 4.824s @ 14.4k QPS
  • Parser including lexing: 69,577 queries in 21.275s @ 3.27k QPS

Fresh local run on this branch:

  • Lexer: 69,578 queries in 1.76580s @ 39.4k QPS
  • Parser including lexing: 69,577 queries in 9.31625s @ 7.47k QPS

Reviewer run from the adversarial loop:

  • Lexer: 69,578 queries in 1.71405s @ 40.6k QPS
  • Parser including lexing: 69,577 queries in 9.70479s @ 7.17k QPS

This is roughly:

  • 2.8x faster lexer time.
  • 2.28x faster end-to-end parser time.

It does not reach 10x. The independent reviewer concluded that further large gains likely require a generated/specialized parser or larger rearchitecture.

Parser size constraint

The current compact parser footprint remains well under the requested 200 KB cap:

  • src/parser/*.php plus src/mysql/mysql-grammar.php: 92,090 bytes total.
  • Current follow-up commits: 93,804 bytes total.

Validation

  • git diff --check
  • php -l on modified lexer/parser files
  • composer run test -- --filter 'WP_MySQL_(Lexer|Server_Suite_(Lexer|Parser))'
    • 141 tests
    • 1,420,987 assertions
  • composer run test
    • 667 tests
    • 1,427,673 assertions
    • 2 skipped, 2 incomplete
  • php packages/mysql-on-sqlite/tests/tools/run-lexer-benchmark.php
  • php packages/mysql-on-sqlite/tests/tools/run-parser-benchmark.php

Follow-up exploration

The next phase should investigate whether a compact specialized parser can preserve the 200 KB cap while reducing dynamic recursive-descent overhead further. Promising directions:

  • generate compact predictive dispatch tables rather than expanding PHP parser code;
  • specialize the high-volume statement families used by WordPress while falling back to the generic parser;
  • keep any generated artifacts small enough that src/parser/*.php plus grammar metadata stays below 200 KB.

@adamziel
Copy link
Copy Markdown
Collaborator Author

Closing per request; scrapping this parser performance experiment.

@adamziel adamziel closed this Apr 27, 2026
JanJakes added a commit that referenced this pull request Apr 28, 2026
Apply lexer optimisations from PR #375:

- Cache `strlen($sql)` once in `$sql_length` instead of recomputing on each
  EOF check.
- Replace `strspn($byte, MASK) > 0` with direct byte comparisons
  (`$byte >= '0' && $byte <= '9'`, `false !== strpos(MASK, $byte)`,
  unrolled whitespace check).
- Use `strpos($sql, '*/', $pos)` instead of a manual scan loop in
  `read_comment_content()`.
- In `read_quoted_text()`, use `strpos()` to find the next quote, eliminating
  the separate end-of-input check that follows the `strcspn()` scan.
- Inline `next_token()` + `get_token()` in `remaining_tokens()` so the hot
  loop builds tokens directly.

Co-authored-by: Adam Zieliński <adam@adamziel.com>

Adapted from #375
JanJakes added a commit that referenced this pull request Apr 28, 2026
Token construction is on the lexer hot path; bypassing the
`WP_Parser_Token::__construct()` indirection and assigning the four
properties directly removes one method call per token.

Requires `$input` on `WP_Parser_Token` to be `protected` instead of
`private` so the subclass can write to it.

Co-authored-by: Adam Zieliński <adam@adamziel.com>

Adapted from #375
JanJakes added a commit that referenced this pull request Apr 28, 2026
Apply lexer optimisations from PR #375:

- Cache `strlen($sql)` once in `$sql_length` instead of recomputing on each
  EOF check.
- Replace `strspn($byte, MASK) > 0` with direct byte comparisons
  (`$byte >= '0' && $byte <= '9'`, `false !== strpos(MASK, $byte)`,
  unrolled whitespace check).
- Use `strpos($sql, '*/', $pos)` instead of a manual scan loop in
  `read_comment_content()`.
- In `read_quoted_text()`, use `strpos()` to find the next quote, eliminating
  the separate end-of-input check that follows the `strcspn()` scan.
- Inline `next_token()` + `get_token()` in `remaining_tokens()` so the hot
  loop builds tokens directly.

Co-authored-by: Adam Zieliński <adam@adamziel.com>

Adapted from #375
JanJakes added a commit that referenced this pull request Apr 28, 2026
Token construction is on the lexer hot path; bypassing the
`WP_Parser_Token::__construct()` indirection and assigning the four
properties directly removes one method call per token.

Requires `$input` on `WP_Parser_Token` to be `protected` instead of
`private` so the subclass can write to it.

Co-authored-by: Adam Zieliński <adam@adamziel.com>

Adapted from #375
JanJakes added a commit that referenced this pull request Apr 29, 2026
Apply lexer optimisations from PR #375:

- Cache `strlen($sql)` once in `$sql_length` instead of recomputing on each
  EOF check.
- Replace `strspn($byte, MASK) > 0` with direct byte comparisons
  (`$byte >= '0' && $byte <= '9'`, `false !== strpos(MASK, $byte)`,
  unrolled whitespace check).
- Use `strpos($sql, '*/', $pos)` instead of a manual scan loop in
  `read_comment_content()`.
- In `read_quoted_text()`, use `strpos()` to find the next quote, eliminating
  the separate end-of-input check that follows the `strcspn()` scan.
- Inline `next_token()` + `get_token()` in `remaining_tokens()` so the hot
  loop builds tokens directly.

Co-authored-by: Adam Zieliński <adam@adamziel.com>

Adapted from #375
JanJakes added a commit that referenced this pull request Apr 29, 2026
Token construction is on the lexer hot path; bypassing the
`WP_Parser_Token::__construct()` indirection and assigning the four
properties directly removes one method call per token.

Requires `$input` on `WP_Parser_Token` to be `protected` instead of
`private` so the subclass can write to it.

Co-authored-by: Adam Zieliński <adam@adamziel.com>

Adapted from #375
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant