Skip to content

Re-enable Joop Leo right-recursion optimization (fixes #397)#1602

Open
erezsh wants to merge 5 commits into
masterfrom
fix_joop_leo
Open

Re-enable Joop Leo right-recursion optimization (fixes #397)#1602
erezsh wants to merge 5 commits into
masterfrom
fix_joop_leo

Conversation

@erezsh

@erezsh erezsh commented Jun 2, 2026

Copy link
Copy Markdown
Member

Leo was disabled when predict_and_complete was refactored into a method, breaking is_quasi_complete's access to start_symbol. This caused incorrect behavior with self-referential start symbols and multiple start symbols.

  • Restore TransitiveItem in earley_common.py
  • Add start_symbol parameter to predict_and_complete so is_quasi_complete can correctly guard against self-referential cycles
  • Move held_completions update before the Leo/regular branch so empty items populate it even when Leo fires (fixes nullable symbol parsing)
  • Guard create_leo_transitives with NULLABLE check: Leo must not shortcut nullable right-recursive symbols, as that bypasses the intermediate SPPF nodes where priority/disambiguation lives
  • Fix load_paths for multi-hop chains: use reduction.rule.origin instead of next_titem.s (which carries the outer rule's origin, not the intermediate symbol), and merge paths sharing the same intermediate symbol into one canonical SymbolNode to preserve SPPF sharing
  • Pass start_symbol through in xearley.py
  • Add tests for right-recursive Leo and multi-start-symbol correctness

Leo was disabled when predict_and_complete was refactored into a method,
breaking is_quasi_complete's access to start_symbol. This caused incorrect
behavior with self-referential start symbols and multiple start symbols.

- Restore TransitiveItem in earley_common.py
- Add start_symbol parameter to predict_and_complete so is_quasi_complete
  can correctly guard against self-referential cycles
- Move held_completions update before the Leo/regular branch so empty
  items populate it even when Leo fires (fixes nullable symbol parsing)
- Guard create_leo_transitives with NULLABLE check: Leo must not shortcut
  nullable right-recursive symbols, as that bypasses the intermediate SPPF
  nodes where priority/disambiguation lives
- Fix load_paths for multi-hop chains: use reduction.rule.origin instead
  of next_titem.s (which carries the outer rule's origin, not the
  intermediate symbol), and merge paths sharing the same intermediate
  symbol into one canonical SymbolNode to preserve SPPF sharing
- Pass start_symbol through in xearley.py
- Add tests for right-recursive Leo and multi-start-symbol correctness
@codecov

codecov Bot commented Jun 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.36364% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.46%. Comparing base (eb015d1) to head (583bac0).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
lark/parsers/earley_common.py 80.00% 3 Missing ⚠️
lark/parsers/earley.py 98.11% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1602      +/-   ##
==========================================
+ Coverage   90.06%   90.46%   +0.40%     
==========================================
  Files          52       52              
  Lines        8089     8169      +80     
==========================================
+ Hits         7285     7390     +105     
+ Misses        804      779      -25     
Flag Coverage Δ
unittests 90.46% <96.36%> (+0.40%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@erezsh

erezsh commented Jun 2, 2026

Copy link
Copy Markdown
Member Author

@night199uk Does this seem right? It was vibe-coded, so I don't trust it much. But it does pass the tests.

Comment thread lark/parsers/earley.py
if quasi.expect not in self.NULLABLE:
return False
if quasi.rule.origin == start_symbol and quasi.expect == start_symbol:
return False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement isn't covered by the test cases. So, either we are missing a test case that exercises this branch, or start_symbol isn't actually needed here.

(There are a couple other lines in the Joop Leo code that aren't covered, but this one is the most interesting)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though one of the new test cases executes this branch, all the tests pass if this branch is removed, so I'm still not convinced it is necessary.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I guess I'll have to go a bit deeper into it myself, so I don't keep generating slop :)

Comment thread lark/parsers/earley.py
quasi = quasi.advance()
return True

def create_leo_transitives(origin, start):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original implementation of this function contains code comments that would probably be helpful to retain.

Comment thread lark/parsers/earley.py Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we are touching the transitives code: it would be less confusing for item.rule.origin to be used here instead of item.s. The existing code is correct given that item.s == item.rule.origin when item.is_complete, but unexpected given the preceding if item.rule.origin in transitives[item.start] condition.

Comment thread lark/parsers/earley.py
items.append(new_item)
###R Regular Earley completer
else:
# Empty has 0 length. If we complete an empty symbol in a particular

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: None of the tests actually fail if this is not moved up

Comment thread lark/parsers/earley_forest.py Outdated
# When a node already IS the intermediate symbol (from a direct Leo completion),
# use it as the canonical node so all other paths to the same symbol merge into it,
# preserving SPPF sharing and rule-order-based disambiguation.
canonical_vn = {}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function contains the largest amount of new code and, unsurprisingly, is also the most questionable.

After playing around with it a while, I believe the logic is valid, but the implementation definitely reeks of LLM-generation.

I came up with a version that effectively does the same thing, but more concisely and in-line with the patterns of the existing earley code:

node_cache = {(node.s, node.start, node.end): node for (_, node) in self.paths}
for transitive, node in self.paths:
    next_node = node
    if transitive.next_titem is not None:
        label = (transitive.next_titem.reduction.rule.origin, transitive.next_titem.reduction.start, self.end)
        is_node_the_next_node = label == (node.s, node.start, node.end)
        if not is_node_the_next_node:
            next_node = node_cache[label] if label in node_cache else node_cache.setdefault(label, type(self)(*label))
            next_node.add_path(transitive.next_titem, node)

    self.add_family(transitive.reduction.rule.origin, transitive.reduction.rule, transitive.reduction.start, transitive.reduction.node, next_node)
self.paths_loaded = True

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does look better, thanks! I updated the code accordingly.

erezsh and others added 4 commits June 7, 2026 15:27
As suggested by @chanicpanic, with small updates.

Collapses the two-pass canonical_vn dance into one loop keyed on the full
(s, start, end) SPPF identity. Paths whose node matches the expected
intermediate label are reused as the canonical node; later paths to the
same label merge via add_path and skip the redundant add_family.
1) Add Joop Leo regression tests for chain-extension corner cases

Two new tests under the Joop Leo section in TestFullEarley:
- Chain stops at non-nullable suffix: is_quasi_complete returns False when
  the outer rule has a non-nullable symbol after the recursive position.
- Chain extends through nullable suffix: is_quasi_complete traverses
  trailing nullable non-terminals via the NULLABLE check + advance loop.

These cover the previously-untested NULLABLE-suffix branch of
is_quasi_complete and the break-at-non-nullable-suffix path of
create_leo_transitives.

2) Add Joop Leo test for nullable-start self-reference guard

Covers the is_quasi_complete branch that returns False when the chain
encounters a state in start_symbol's rule expecting start_symbol itself
as a nullable non-terminal. Without this guard, the Leo shortcut would
treat the start rule as quasi-complete and produce an incorrect chain.
- is_quasi_complete's complete-item fast-path: caller in
  create_leo_transitives only passes originators with expect != None.
- root_transitive's else branch: every titem is registered at
  transitives[col][previous], so the lookup always succeeds (the root
  titem self-resolves through it).
- Leo + terminal scan branch: is_quasi_complete requires the post-
  recursion suffix to be all-nullable, so new_item.expect is never a
  terminal.

The first two are Python-level defensive code; the third is residue
from a more paper-faithful structure that the current titem
representation makes redundant. Cycle detection (visited) and the
empty-FIRST guard are retained as cheap protection against unusual
grammars.
- Move held_completions update back into the regular completer branch.
  Empty (nullable) items can never trigger Leo (create_leo_transitives
  breaks at the NULLABLE check, so transitives never contains nullable
  origins as keys), so the unconditional placement was unnecessary.
  Reviewer's observation that no test failed without the move is correct.
- Use item.rule.origin instead of item.s when looking up the transitive,
  matching the preceding `in transitives[item.start]` check. Both are
  equivalent for complete items, but item.rule.origin is less surprising.
- Restore the original walk-backwards/walk-forwards comments in
  create_leo_transitives so the two phases are immediately legible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erezsh

erezsh commented Jun 7, 2026

Copy link
Copy Markdown
Member Author

Updated the PR based on the review. (simplified, improved coverage)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants