Re-enable Joop Leo right-recursion optimization (fixes #397)#1602
Re-enable Joop Leo right-recursion optimization (fixes #397)#1602erezsh wants to merge 5 commits into
Conversation
Leo was disabled when predict_and_complete was refactored into a method, breaking is_quasi_complete's access to start_symbol. This caused incorrect behavior with self-referential start symbols and multiple start symbols. - Restore TransitiveItem in earley_common.py - Add start_symbol parameter to predict_and_complete so is_quasi_complete can correctly guard against self-referential cycles - Move held_completions update before the Leo/regular branch so empty items populate it even when Leo fires (fixes nullable symbol parsing) - Guard create_leo_transitives with NULLABLE check: Leo must not shortcut nullable right-recursive symbols, as that bypasses the intermediate SPPF nodes where priority/disambiguation lives - Fix load_paths for multi-hop chains: use reduction.rule.origin instead of next_titem.s (which carries the outer rule's origin, not the intermediate symbol), and merge paths sharing the same intermediate symbol into one canonical SymbolNode to preserve SPPF sharing - Pass start_symbol through in xearley.py - Add tests for right-recursive Leo and multi-start-symbol correctness
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1602 +/- ##
==========================================
+ Coverage 90.06% 90.46% +0.40%
==========================================
Files 52 52
Lines 8089 8169 +80
==========================================
+ Hits 7285 7390 +105
+ Misses 804 779 -25
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
@night199uk Does this seem right? It was vibe-coded, so I don't trust it much. But it does pass the tests. |
| if quasi.expect not in self.NULLABLE: | ||
| return False | ||
| if quasi.rule.origin == start_symbol and quasi.expect == start_symbol: | ||
| return False |
There was a problem hiding this comment.
This statement isn't covered by the test cases. So, either we are missing a test case that exercises this branch, or start_symbol isn't actually needed here.
(There are a couple other lines in the Joop Leo code that aren't covered, but this one is the most interesting)
There was a problem hiding this comment.
Even though one of the new test cases executes this branch, all the tests pass if this branch is removed, so I'm still not convinced it is necessary.
There was a problem hiding this comment.
Thanks. I guess I'll have to go a bit deeper into it myself, so I don't keep generating slop :)
| quasi = quasi.advance() | ||
| return True | ||
|
|
||
| def create_leo_transitives(origin, start): |
There was a problem hiding this comment.
The original implementation of this function contains code comments that would probably be helpful to retain.
There was a problem hiding this comment.
While we are touching the transitives code: it would be less confusing for item.rule.origin to be used here instead of item.s. The existing code is correct given that item.s == item.rule.origin when item.is_complete, but unexpected given the preceding if item.rule.origin in transitives[item.start] condition.
| items.append(new_item) | ||
| ###R Regular Earley completer | ||
| else: | ||
| # Empty has 0 length. If we complete an empty symbol in a particular |
There was a problem hiding this comment.
Note: None of the tests actually fail if this is not moved up
| # When a node already IS the intermediate symbol (from a direct Leo completion), | ||
| # use it as the canonical node so all other paths to the same symbol merge into it, | ||
| # preserving SPPF sharing and rule-order-based disambiguation. | ||
| canonical_vn = {} |
There was a problem hiding this comment.
This function contains the largest amount of new code and, unsurprisingly, is also the most questionable.
After playing around with it a while, I believe the logic is valid, but the implementation definitely reeks of LLM-generation.
I came up with a version that effectively does the same thing, but more concisely and in-line with the patterns of the existing earley code:
node_cache = {(node.s, node.start, node.end): node for (_, node) in self.paths}
for transitive, node in self.paths:
next_node = node
if transitive.next_titem is not None:
label = (transitive.next_titem.reduction.rule.origin, transitive.next_titem.reduction.start, self.end)
is_node_the_next_node = label == (node.s, node.start, node.end)
if not is_node_the_next_node:
next_node = node_cache[label] if label in node_cache else node_cache.setdefault(label, type(self)(*label))
next_node.add_path(transitive.next_titem, node)
self.add_family(transitive.reduction.rule.origin, transitive.reduction.rule, transitive.reduction.start, transitive.reduction.node, next_node)
self.paths_loaded = TrueThere was a problem hiding this comment.
That does look better, thanks! I updated the code accordingly.
As suggested by @chanicpanic, with small updates. Collapses the two-pass canonical_vn dance into one loop keyed on the full (s, start, end) SPPF identity. Paths whose node matches the expected intermediate label are reused as the canonical node; later paths to the same label merge via add_path and skip the redundant add_family.
1) Add Joop Leo regression tests for chain-extension corner cases Two new tests under the Joop Leo section in TestFullEarley: - Chain stops at non-nullable suffix: is_quasi_complete returns False when the outer rule has a non-nullable symbol after the recursive position. - Chain extends through nullable suffix: is_quasi_complete traverses trailing nullable non-terminals via the NULLABLE check + advance loop. These cover the previously-untested NULLABLE-suffix branch of is_quasi_complete and the break-at-non-nullable-suffix path of create_leo_transitives. 2) Add Joop Leo test for nullable-start self-reference guard Covers the is_quasi_complete branch that returns False when the chain encounters a state in start_symbol's rule expecting start_symbol itself as a nullable non-terminal. Without this guard, the Leo shortcut would treat the start rule as quasi-complete and produce an incorrect chain.
- is_quasi_complete's complete-item fast-path: caller in create_leo_transitives only passes originators with expect != None. - root_transitive's else branch: every titem is registered at transitives[col][previous], so the lookup always succeeds (the root titem self-resolves through it). - Leo + terminal scan branch: is_quasi_complete requires the post- recursion suffix to be all-nullable, so new_item.expect is never a terminal. The first two are Python-level defensive code; the third is residue from a more paper-faithful structure that the current titem representation makes redundant. Cycle detection (visited) and the empty-FIRST guard are retained as cheap protection against unusual grammars.
- Move held_completions update back into the regular completer branch. Empty (nullable) items can never trigger Leo (create_leo_transitives breaks at the NULLABLE check, so transitives never contains nullable origins as keys), so the unconditional placement was unnecessary. Reviewer's observation that no test failed without the move is correct. - Use item.rule.origin instead of item.s when looking up the transitive, matching the preceding `in transitives[item.start]` check. Both are equivalent for complete items, but item.rule.origin is less surprising. - Restore the original walk-backwards/walk-forwards comments in create_leo_transitives so the two phases are immediately legible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Updated the PR based on the review. (simplified, improved coverage) |
Leo was disabled when predict_and_complete was refactored into a method, breaking is_quasi_complete's access to start_symbol. This caused incorrect behavior with self-referential start symbols and multiple start symbols.