Skip to content

Reject cyclic grammars instead of looping forever (#1585)#1607

Open
sarathfrancis90 wants to merge 1 commit into
lark-parser:masterfrom
sarathfrancis90:reject-cyclic-grammar
Open

Reject cyclic grammars instead of looping forever (#1585)#1607
sarathfrancis90 wants to merge 1 commit into
lark-parser:masterfrom
sarathfrancis90:reject-cyclic-grammar

Conversation

@sarathfrancis90

Copy link
Copy Markdown
Contributor

Fixes #1585.

As we discussed, a rule that can reach itself without consuming input — e.g. a: a b* where b* may be empty, which expands to a unit production a: a — has no base case, so LALR and CYK build a degenerate table and then spin forever at parse time. Earley tolerates it.

This catches it at construction instead: I build the "derives to a single symbol with everything else nullable" relation over the rules and look for a cycle; if a non-terminal can reach itself, a GrammarError is raised naming the rule. As you pointed out, this isn't specific to start — it fires for any rule (your x: "a" | x x* example included), and also for indirect cycles like a: b / b: a.

I scoped the check to the LALR and CYK analyzers — the ones that actually hang — and left Earley untouched, since it currently returns a parse for these grammars and I didn't want to change its behaviour without your call. If you'd rather prevent these grammars from compiling everywhere (your "even if they pass grammar detection" point), it's a small move to run the check in GrammarAnalyzer itself so Earley rejects them too — happy to do that here.

Left recursion, optional repetition (x*), and nullable chains are not cyclic and still build fine. Full suite passes (1107) and mypy is clean; added a regression test in test_grammar.py.

@erezsh

erezsh commented Jun 9, 2026

Copy link
Copy Markdown
Member

Thanks, I will take a look when I have the time.

Quick note: calculate_sets() is relatively expensive, so ideally we'd only call it once. Maybe this check is supposed to move to grammar_analysis? But I'm only speaking from memory.

A rule that can derive itself without consuming any input -- e.g. `a: a b*`
where `b*` may be empty, which expands to a unit production `a: a` -- has no
base case to terminate on. The LALR and CYK parsers build a degenerate table
for such a grammar and then loop forever at parse time; Earley tolerates it.

Detect this at construction time: a grammar is cyclic when some non-terminal
A derives A directly or through other symbols that are all nullable. Build the
"derives in one step to a single non-nullable symbol" relation over the rules
and look for a cycle; if one exists, raise a GrammarError naming the rule.

The check runs for the LALR and CYK analyzers (the ones that hang). Earley is
left as-is since it handles these grammars without looping.

Fixes lark-parser#1585
@sarathfrancis90 sarathfrancis90 force-pushed the reject-cyclic-grammar branch from aa2f436 to 77e5042 Compare June 9, 2026 14:51
@sarathfrancis90

Copy link
Copy Markdown
Contributor Author

Good call to check — I tried moving it into GrammarAnalyzer (which does compute NULLABLE once), but it also runs for Earley, and that breaks test_many_cycles plus a few other ambiguity tests: Earley deliberately handles cyclic grammars like start: a? | start start and resolves them, so we don't want to reject those. That's why the check sits on the LALR and CYK analyzers — the two that actually loop.

On the cost: LALR already reuses the NULLABLE that GrammarAnalyzer computes, so there's no extra calculate_sets() there. The only fresh call is in the CYK parser, which doesn't go through GrammarAnalyzer at all and so has nothing to reuse (and CYK is the rarely-used path). I added a short note in the code about why Earley is left out.

No rush on the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Infinite loop in lalr and cyk

2 participants