[codex] Speed up MySQL lexing and parsing by adamziel · Pull Request #375 · WordPress/sqlite-database-integration

adamziel · 2026-04-27T15:30:50Z

What changed

This draft explores faster MySQL lexing and parsing while keeping the parser compact.

Speeds up WP_MySQL_Lexer::remaining_tokens() by avoiding repeated public method calls during bulk tokenization.
Caches the SQL payload length in the lexer and uses cheaper character checks on hot paths.
Speeds up MySQL token construction by avoiding an extra parent constructor call per token.
Replaces the parser grammar's limited iterative lookahead table with full FIRST-set computation.
Adds per-token branch candidate tables so the parser tries only grammar branches that can start with the current token.
Caches parser grammar tables on the parser instance during a parse.
Caches token IDs alongside parser tokens to avoid repeated object-property reads in terminal/lookahead checks.
Adds a rule-name lookup map to avoid repeated linear array_search() calls.
Reuses computed branch dispatch tables for identical grammar loads, drops retained intermediate FIRST sets, and stores single-candidate dispatch entries as integers to keep default test memory under control.
Avoids reading the next token for MySQL's SELECT ... INTO negative-lookahead check unless the current rule is selectStatement.
Removes parser failed-match memoization after branch dispatch made it net overhead in the benchmark suite.

Why

The dynamic recursive-descent parser spends a lot of time repeatedly rejecting grammar branches that cannot match the current token. The lexer benchmark also paid avoidable overhead for the common remaining_tokens() path used before parsing.

This keeps the current architecture and grammar file format intact while moving more branch-selection work to grammar initialization.

Performance

Original trunk baseline captured before this branch:

Lexer: 69,578 queries in 4.824s @ 14.4k QPS
Parser including lexing: 69,577 queries in 21.275s @ 3.27k QPS

Fresh local run on this branch:

Lexer: 69,578 queries in 1.76580s @ 39.4k QPS
Parser including lexing: 69,577 queries in 9.31625s @ 7.47k QPS

Reviewer run from the adversarial loop:

Lexer: 69,578 queries in 1.71405s @ 40.6k QPS
Parser including lexing: 69,577 queries in 9.70479s @ 7.17k QPS

This is roughly:

2.8x faster lexer time.
2.28x faster end-to-end parser time.

It does not reach 10x. The independent reviewer concluded that further large gains likely require a generated/specialized parser or larger rearchitecture.

Parser size constraint

The current compact parser footprint remains well under the requested 200 KB cap:

src/parser/*.php plus src/mysql/mysql-grammar.php: 92,090 bytes total.
Current follow-up commits: 93,804 bytes total.

Validation

git diff --check
php -l on modified lexer/parser files
composer run test -- --filter 'WP_MySQL_(Lexer|Server_Suite_(Lexer|Parser))'
- 141 tests
- 1,420,987 assertions
composer run test
- 667 tests
- 1,427,673 assertions
- 2 skipped, 2 incomplete
php packages/mysql-on-sqlite/tests/tools/run-lexer-benchmark.php
php packages/mysql-on-sqlite/tests/tools/run-parser-benchmark.php

Follow-up exploration

The next phase should investigate whether a compact specialized parser can preserve the 200 KB cap while reducing dynamic recursive-descent overhead further. Promising directions:

generate compact predictive dispatch tables rather than expanding PHP parser code;
specialize the high-volume statement families used by WordPress while falling back to the generic parser;
keep any generated artifacts small enough that src/parser/*.php plus grammar metadata stays below 200 KB.

adamziel · 2026-04-27T23:23:22Z

Closing per request; scrapping this parser performance experiment.

Apply lexer optimisations from PR #375: - Cache `strlen($sql)` once in `$sql_length` instead of recomputing on each EOF check. - Replace `strspn($byte, MASK) > 0` with direct byte comparisons (`$byte >= '0' && $byte <= '9'`, `false !== strpos(MASK, $byte)`, unrolled whitespace check). - Use `strpos($sql, '*/', $pos)` instead of a manual scan loop in `read_comment_content()`. - In `read_quoted_text()`, use `strpos()` to find the next quote, eliminating the separate end-of-input check that follows the `strcspn()` scan. - Inline `next_token()` + `get_token()` in `remaining_tokens()` so the hot loop builds tokens directly. Co-authored-by: Adam Zieliński <adam@adamziel.com> Adapted from #375

Token construction is on the lexer hot path; bypassing the `WP_Parser_Token::__construct()` indirection and assigning the four properties directly removes one method call per token. Requires `$input` on `WP_Parser_Token` to be `protected` instead of `private` so the subclass can write to it. Co-authored-by: Adam Zieliński <adam@adamziel.com> Adapted from #375

Apply lexer optimisations from PR #375: - Cache `strlen($sql)` once in `$sql_length` instead of recomputing on each EOF check. - Replace `strspn($byte, MASK) > 0` with direct byte comparisons (`$byte >= '0' && $byte <= '9'`, `false !== strpos(MASK, $byte)`, unrolled whitespace check). - Use `strpos($sql, '*/', $pos)` instead of a manual scan loop in `read_comment_content()`. - In `read_quoted_text()`, use `strpos()` to find the next quote, eliminating the separate end-of-input check that follows the `strcspn()` scan. - Inline `next_token()` + `get_token()` in `remaining_tokens()` so the hot loop builds tokens directly. Co-authored-by: Adam Zieliński <adam@adamziel.com> Adapted from #375

Token construction is on the lexer hot path; bypassing the `WP_Parser_Token::__construct()` indirection and assigning the four properties directly removes one method call per token. Requires `$input` on `WP_Parser_Token` to be `protected` instead of `private` so the subclass can write to it. Co-authored-by: Adam Zieliński <adam@adamziel.com> Adapted from #375

Apply lexer optimisations from PR #375: - Cache `strlen($sql)` once in `$sql_length` instead of recomputing on each EOF check. - Replace `strspn($byte, MASK) > 0` with direct byte comparisons (`$byte >= '0' && $byte <= '9'`, `false !== strpos(MASK, $byte)`, unrolled whitespace check). - Use `strpos($sql, '*/', $pos)` instead of a manual scan loop in `read_comment_content()`. - In `read_quoted_text()`, use `strpos()` to find the next quote, eliminating the separate end-of-input check that follows the `strcspn()` scan. - Inline `next_token()` + `get_token()` in `remaining_tokens()` so the hot loop builds tokens directly. Co-authored-by: Adam Zieliński <adam@adamziel.com> Adapted from #375

Token construction is on the lexer hot path; bypassing the `WP_Parser_Token::__construct()` indirection and assigning the four properties directly removes one method call per token. Requires `$input` on `WP_Parser_Token` to be `protected` instead of `private` so the subclass can write to it. Co-authored-by: Adam Zieliński <adam@adamziel.com> Adapted from #375

adamziel added 4 commits April 27, 2026 17:30

Speed up MySQL lexer and parser

9c2bd24

Refine parser dispatch tables

0fc8ff8

Avoid unnecessary parser lookahead reads

cad9d21

Remove parser failed-match memoization

d762567

adamziel mentioned this pull request Apr 27, 2026

[codex] Split parser single-branch path #376

Closed

adamziel closed this Apr 27, 2026

JanJakes mentioned this pull request Apr 28, 2026

Parser + lexer performance: consolidated 2–3× end-to-end speedup #378

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Speed up MySQL lexing and parsing#375

[codex] Speed up MySQL lexing and parsing#375
adamziel wants to merge 4 commits intotrunkfrom
explore/lexing-parsing-10x

adamziel commented Apr 27, 2026 •

edited

Loading

Uh oh!

adamziel commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adamziel commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Why

Performance

Parser size constraint

Validation

Follow-up exploration

Uh oh!

adamziel commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

adamziel commented Apr 27, 2026 •

edited

Loading