Goal/regex slice3c by cukas · Pull Request #433 · KERNlang/kern

cukas · 2026-06-14T20:24:37Z

What

Why

How

Checklist

…II classes, \Z/\A anchors, re.ASCII) Replace the verbatim/flag-subset regexLit lowering with the certified emission- normalization on BOTH targets in lockstep, so a KERN regex literal matches byte-identically by construction on the certified core. Transform (matches the .agon-goals/regex-slice1-oracle reference exactly): - normalizeRegexClasses (SHARED, both targets): \d→[0-9], \w→[A-Za-z0-9_], \s→[ \t\n\r\f\v]. On TS \d/\w are match-no-ops (already ASCII) but \s narrows JS \s to drop Unicode whitespace (NBSP). - lowerRegexAnchorsPython (PYTHON-ONLY, non-/m path): $→\Z, ^→\A so Python anchors match JS input-end/start (Python $ differs at a trailing newline). On the /m path anchors are kept + re.MULTILINE (line-based, like JS /m). - re.ASCII injected on EVERY Python flag expression (load-bearing for \b and ASCII class semantics; JS without /u is already ASCII). Shared module: new packages/core/src/codegen/regex-normalize.ts, exported from core's barrel and imported by the TS emitter (codegen-expression regexLit) and the Python emitter (codegen-body-python pyRegexPattern/pyRegexFlags). One shared class transform makes byte-identity true by construction. The crude string- replace transform is the deliberate, certified Slice-1 contract (parity-safe because identical on both targets); a tokenizing normalizer is a later hardening. Parity-completeness finding: the route/portable Python path (packages/python/src/core/expr/index.ts lowerStringArgMethods) SKIPS regex args (!args[0].startsWith('/')) and never emits a regexLit through re.compile — it only handles string-literal separators. Confirmed it does NOT bypass pyRegexPattern, so no normalizer wiring is needed there. The .test/.match/.replace method paths reuse pyRegexPattern/pyRegexFlags and inherit normalization. Discriminating emitter tests (regex-emission-slice1-python.test.ts) assert the exact TS pattern AND Python pattern+flags for each oracle killer row; verified each FAILS the named wrong-impls (naive_passthrough, no_anchor, no_re_ascii, raw_anchor_re_m). Updated native-handlers-python.test.ts to the new contract (it encoded the pre-Slice-1 emission). Oracle check.py: GREEN (parity 0/16, all 4 wrong-impls caught). ⚔️ Forged by [Agon](https://github.com/KERNlang/agon) Co-Authored-By: agon (KERN) <292465531+KERN-Agon@users.noreply.github.com>

… class-expansion, Set(B) fail-close) Closes the non-ASCII /i gap Slice 1 left: a non-ASCII Set(A) letter under /i (e.g. /é/i) was emitted raw, so on Python (re.IGNORECASE | re.ASCII) it MISSED its fold partner É while node /é/i matched — a real cross-engine divergence. - New frozen data module packages/core/src/codegen/regex-fold-table.ts (1073 fold classes / 2170 member chars, 11 fail-close chars), generated ONCE from the vendored probe seed by packages/core/scripts/gen-regex-fold-table.mjs and committed as pure data. NOT wired into pnpm build: node U16.0 and python U15.0.0 disagree on 42 fold classes, so a host-regenerated table would diverge — the table is frozen on purpose. Full Set(A) = the seed's 1050 size-2 pairs + the 23 size-3/4 classes the probe's compare_sizeN_expansion.py proved recovered byte-identically. - expandRegexIFold(pattern, flags) in the shared regex-normalize.ts: under /i, class-expand each Set(A) letter into its explicit fold class (é → [Éé]) matched by pure codepoint membership (host-DB-independent); a Set(A) letter already inside a [...] set merges as bare members ([xé] → [xÉé], not nested); fail-close (thrown compile error, identical message on both targets) on the Set(B) length-changing residue (ß, ligatures, titlecase). The accept/expand/reject decision is purely lexical (scan vs the frozen table, never a host fold). The codepoint scan leaves a clean seam for a later astral slice. - KEEP re.IGNORECASE | re.ASCII (do NOT drop /i): the ASCII letters in a mixed /aé/i keep folding, and re.ASCII is the load-bearing invariant that suppresses any Python re-fold of the explicit non-ASCII class members — empirically why KEEP-i is parity-safe. - Wiring: TS regexLit and Python pyRegexPattern both call expandRegexIFold on the same class-normalized pattern (order: classes → fold-expand → python anchors). No regexLit emit path bypasses the shared normalizer. Member order inside an emitted class is codepoint-ascending (match-irrelevant; deterministic). New discriminating test packages/python/tests/regex-emission-slice-i-python.test.ts asserts exact TS+Python emission for the killer rows and the fail-close throw; each row fails a plausibly-wrong impl (verified by reverting). Staged oracle .agon-goals/regex-slice-i/oracle/check.py is GREEN (parity 0/16, certified==expect 16/16, all 8 wrong-impls caught). Slice 1 test unchanged (no regression). Conformance has no regex-/i fixtures, so the whole-corpus run is deferred to CI. ⚔️ Forged by [Agon](https://github.com/KERNlang/agon) Co-Authored-By: agon (KERN) <292465531+KERN-Agon@users.noreply.github.com>

…oint under /i Two empirically-verified (node v22.22.0 / python3 3.12.7) parity holes the original Slice-/i oracle missed silently diverge, so the portable contract (identical on the certified subset, fail-close elsewhere) requires fail-closing them: HOLE 1 — non-ASCII backreference under /i. /(é)\1/i matches "Éé" on JS (JS /i folds the backreference's referent too), but the emitted `([Éé])\1` under re.IGNORECASE| re.ASCII does NOT fold the \1 referent → MISS on Python. Fix: a conservative lexical predicate in expandRegexIFold — any backreference token (\1–\9 numeric, or \k<name>) combined with ANY non-ASCII Set(A) letter present under /i → fail-close. ASCII-only backrefs (/(a)\1/i) and backrefs with no non-ASCII Set(A) letter still emit normally. HOLE 2 — Set(A) letter as a [...] range endpoint under /i. /[a-é]/i would expand to [a-Éé], silently changing the range a-é (U+0061..U+00E9) to a-É (U+0061..U+00C9) + literal é, dropping U+00CA..U+00E8 → divergence vs JS. Fix: detect a Set(A) letter adjacent to an unescaped range `-` (X-é or é-X) inside a class and fail-close instead of corrupting the range. A plain class MEMBER (/[xé]/i → [xÉé]) is NOT a range endpoint and still expands. Both fail-closes throw a target-symmetric message (the TS and Python emitters share expandRegexIFold + regexIFoldFailMessage, now reason-discriminated), so the refusal is byte-identical across targets. Oracle (.agon-goals/regex-slice-i): added 4 rows (2 fail-close killers + silent_backref/silent_range wrong-impls + 2 positive controls); check.py GREEN (20/20 parity, all 10 wrong-impls caught). In-repo test adds 7 rows incl. revert-check (each new fail-close test fails the pre-hardening code). ⚔️ Forged by [Agon](https://github.com/KERNlang/agon) Co-Authored-By: agon (KERN) <292465531+KERN-Agon@users.noreply.github.com>

… to classDepth 0 Replaces the fragile per-`-` escape-adjacency heuristic in expandRegexIFold with whole-class SIMPLE/COMPLEX classification (scanCharClass + isComplexClassBody). A Set(A) letter inside a [...] class now expands ONLY when the class is SIMPLE (no backslash escape, no range `-`), else fail-closes (reason 'complexClass', was 'rangeEndpoint'). This closes the /[\\-é]/i SILENT-DIVERGENCE (old code misread the escaped-backslash chain as an escaped hyphen and expanded é, corrupting the real \..é range) and the /[[-é]-z]/i over-expand, and corrects the in-class backref false-positive by gating sawBackref to classDepth 0. Conservative (over-reject safe): /[\1é]/i now fail-closes. scanCharClass honors the literal-]-first member rule ([]], [^]]) so the class close index is correct. Oracle (.agon-goals/regex-slice-i) + slice-i tests updated with the full red-team table; oracle GREEN (0/30 node!=py, every wrong-impl incl. silent_class caught). ⚔️ Forged by [Agon](https://github.com/KERNlang/agon) Co-Authored-By: agon (KERN) <292465531+KERN-Agon@users.noreply.github.com>

…crash) The Slice-1 lowerRegexAnchorsPython blindly replaceAll'd every ^/$, so a ^/$ inside a [...] class or escaped (\^/\$) was rewritten too: /[^a]/ -> [\Aa] and /[a$]/ -> [a\Z] both raise re.error: bad escape at re.compile (CRASH), and /a\^b/ -> a\Ab silently corrupted an escaped literal. Negated classes are extremely common, so this blocked the regex core. Fix: a single escape-aware forward pass lowers ^->\A / $->\Z ONLY for a TRUE anchor (classDepth 0, unescaped). An in-class or escaped ^/$ is left verbatim. Reuses the literal-]-first-aware scanCharClass and the same escape/classDepth bookkeeping as expandRegexIFold. The /m path and the TS side are unchanged. Only lowerRegexAnchorsPython changes. ⚔️ Forged by [Agon](https://github.com/KERNlang/agon) Co-Authored-By: agon (KERN) <292465531+KERN-Agon@users.noreply.github.com>

… shape, .matchAll, .split, fail-close .test(/g)/.exec) Certifies the regex METHOD result/iteration shapes are portable across the TS and Python targets, where JS RegExp methods and Python re genuinely differ in SHAPE/COUNT (not pattern). Strictly additive on the Slice-1//i pattern paths. IN-CORE (byte/shape-identical both targets): - .match(s) no /g: canonical {full,groups,index,named}|null shape on BOTH targets (D2, the load-bearing fix). Python today emitted a bare re.search Match OBJECT while JS .match returns an array-with-groups — a real divergence. TS adapts the native RegExpMatchArray inline; Python builds the dict via a _kern_regex_match helper (group(0)/groups()/start()/groupdict()), null-safe. - .match(s) /g: [m.group(0) for m in finditer] or None — full matches only, NEVER re.findall (which returns tuples when >1 group). Promoted from fail-close. - .matchAll(s) (/g): re.finditer shaped to [{full,groups,index},…], incl. empties. - .replace count locked: no /g -> count=1 (FIRST), /g//.replaceAll -> count=0 (ALL). - .split non-zero-width, no limit: re.split (capture-group inclusion is portable, D1). FAIL-CLOSE (symmetric, byte-identical message both targets, lexical predicate): - .test(/g) (stateful lastIndex), .exec (redirect to .matchAll, D4), .matchAll//.replaceAll without /g (JS TypeError), .split zero-width-capable or with a limit arg. The .split zero-width gate uses a SYNTACTIC zero-width-capable predicate (isZeroWidthCapableRegex, in core, shared by both emitters — no host-engine probe, version-independent per the frozen-fold-table lesson). Red-teamed against node str.split vs python3 re.split over a 60-pattern battery: every diverging pattern fail-closes (0 leaks), every always-non-empty pattern stays in-core (0 over-reject), incl. adversarial \b/lookaround/x*a/(ab)*c/(foo)|(bar)? cases. Fail-close messages live in core (regex-normalize.ts) as single-source constants imported by both targets, so the refusal is observably symmetric. Oracle .agon-goals/regex-slice3/oracle/check.py: GREEN (parity 0/17, all 5 wrong-impls caught). New discriminating test regex-method-slice3-python.test.ts (27 cases) asserts exact TS+Python emission for every killer row; revert-checked (.match-shape -> bare re.search fails 3 rows; .replace count=1 -> 0 fails 1 row). native-handlers test updated for the intentional .match(/g) in-core promotion (D2). agon nero verdict: FLAWED@35% — driven by missing codebase context (critic guessed Python Match-object semantics and the JS .match(/g) spec); the oracle empirically refutes its load-bearing challenges (1/2/5). Challenges 3+4 (zero-width predicate undecidability/normalization) were the real risks and are addressed by the conservative SYNTACTIC predicate + 60-pattern red-team. ⚔️ Forged by [Agon](https://github.com/KERNlang/agon) Co-Authored-By: agon (KERN) <292465531+KERN-Agon@users.noreply.github.com>

…match named-group + let-bound parity, ctx threading Hardens Milestone C Slice 3 (regex match-set) against a 6-engine review that found 2 blocking escape-handling bugs in the zero-width `.split` predicate plus 2 cross-target parity gaps. All decisions re-red-teamed against node v22 str.split vs python3.12 re.split. FIX 1 (BLOCKING) — escape-robust isZeroWidthCapableRegex: - Multi-char escapes (\xHH, \uHHHH, \u{..}, \cX, octal \0..) are now consumed as SINGLE atoms via scanZwEscape, so a following quantifier attaches to the right base. The old 1-char scan mis-attributed `*` (e.g. \x41* read as \x 4 1*), LEAKING zero-width-capable patterns in-core where the engines diverge. - Any backreference (\1-\9, \k<name>) fail-closes conservatively (no group-nullability analysis): a nullable backref is zero-width-capable and a non-existent group makes re.split throw. - Broadened to fail-close on every .split-UNSAFE escape via a portable-escape allowlist: escapes python re rejects but JS accepts (\cX, \u{..}, \p, identity-escape letters) and meaning-divergent ones (\A \Z \a). Re-red-teamed over a 79-pattern x 11-input battery: ZERO leaks (over-rejections of valid non-nullable backrefs are spec-sanctioned). FIX 2 (parity) — .match named-group undefined->null: - The TS adapter normalized positional groups but copied named groups verbatim, leaving an unmatched optional named group `undefined` while Python's groupdict() returns None. Now maps each named value undefined->null so {full,groups,index,named} is shape-identical across targets. FIX 3 (parity) — let-bound regex divergence: - Approach A (thread a regex-binding table into the TS body emitter + resolve the ident at the body level). Approach B (Python also emits raw) was REJECTED: Python's str has no .match, so 'both raw' would CRASH Python, not achieve parity. TS now resolves `let re = /…/; s.match(re)` to the bound literal and lowers through the SAME canonical adapter/fail-close as a direct literal — proven byte-identical to the direct-literal emission, with let-bound .split/.test(/g) fail-closing identically on both targets. FIX 4 (parity) — Math.match(/a/g) fail-close: - On TS, applyStdlibLoweringTS already runs before the regex lowering and rejects `Math.match` (no code change needed). On Python the regex lowering ran FIRST and mis-claimed the namespace as the subject (broken …finditer("a", Math, …)); lowerRegexCallPython now defers stdlib-namespace receivers to applyStdlibLoweringPython, so both targets reject Math.match identically. Oracle (.agon-goals/regex-slice3): run_js canonMatchObj mirrors the named-group normalization; run_py is_zero_width_capable mirrors the escape-robust predicate (backref + python-rejected-escape + empty-match); 5 new killer/failclose rows (nullable backref, \x41*, \u0041*, \cA*, \0*). check.py GREEN (0/22 parity violations, every wrong-impl caught). Revert-check vs 7bd5b2d: the 4 multi-char/backref killers (\x41* \u0041* \cA* (a?)\1) LEAKED under the old predicate (returned in-core) and now fail-close; the old .match adapter produced named.b===undefined where Python had None. Targeted (no heavy suites): pnpm build green; regex-method-slice3 + regex-emission-slice1 + regex-emission-slice-i + core native-handlers all pass; biome clean. ⚔️ Forged by [Agon](https://github.com/KERNlang/agon) Co-Authored-By: agon (KERN) <292465531+KERN-Agon@users.noreply.github.com>

…drop fragile resolve-to-literal); string methods unchanged The Slice-3b FIX 3 resolved a let-bound regex ident to its cached literal and lowered it canonically (TS via node-clone substitution, Python via resolveRegexExpr following the ident). That emitted a STALE pattern after a reassignment (let re=/a/; re=/b/; s.match(re) wrongly lowered /a/) and risked fail-closing common string methods. Replace the RESOLUTION with DETECTION, symmetric across both targets: - Keep the per-scope regex-binding TABLE as a DETECTOR only (distinguishes a let-bound REGEX from a let-bound STRING — removing it would wrongly fail-close s.match(stringVar)). - New shared, single-source detector regexMethodRegexArgIdent + message REGEX_NONLITERAL_FAILCLOSE in regex-normalize.ts (core barrel re-export; Python imports from @kernlang/core), matching the existing slice-3 fail-close pattern. - TS (body-ts.ts): restore emitValueTS to emitExpression(node); a recursive walk throws REGEX_NONLITERAL_FAILCLOSE when a regex method's regex position is an ident KNOWN to be regex-bound. No substitution, no node-cloning. - Python (codegen-body-python.ts): resolveRegexExpr now resolves ONLY direct literals; lowerRegexCallPython throws the same shared message for a known regex-bound ident in the regex position. - Reassign-invalidation (both targets): rebindRegexOnReassign updates the owning scope on assign — to a literal keeps it regex (still fail-closed), to anything else UNMARKS it. RHS is emitted before the rebind so re=s.match(re) is checked against the pre-reassign table. Net contract (symmetric): direct regex literal = portable Slice-3 lowering; a variable KNOWN to hold a regex = symmetric FAIL-CLOSE both targets; a string/unknown variable = unchanged plain host method. No silent divergence, no stale-pattern wrong-lowering. FIX 1/2/4 untouched. Tests: rewrote the slice3 let-bound describe block to the new contract (fail-close, string-stays-plain regression row, reassign-keeps-regex, reassign-to-nonregex-unmarks, direct-literal-unchanged, nested-position); updated native-handlers-python let-bound .test row. Oracle: added naive_bound_resolve wrong-impl (resolves to STALE literal) + bound_match fixture — check.py GREEN, every wrong-impl caught. ⚔️ Forged by [Agon](https://github.com/KERNlang/agon) Co-Authored-By: agon (KERN) <292465531+KERN-Agon@users.noreply.github.com>

…ied closures (TS/Python parity); exec arity guard FIX A (cross-target divergence): the TS bound-regex fail-close walk (assertNoBoundRegexMethodTS in body-ts.ts) did not inspect lambda.bodyBlock. A bound-regex method inside a block-bodied arrow (let re=/.../; arr.map(x => { return s.match(re); })) emitted RAW s.match(re) on TS while the Python emitter re-parses block-closure expressions and FAIL-CLOSED the same construct — a silent divergence. The TS walk now descends into bodyBlock via the shared parseClosureBlockAst closure-AST path (reused, no new text scanner) and re-parses each call via the same parseExpr the Python lowerer uses, applying the SAME regexMethodRegexArgIdent + lookupRegexBinding detector. Both targets now fail-close symmetrically; a string-/unknown-bound ident stays plain on both; a direct literal still lowers canonically. Closure-param shadowing of an outer regex name is conservatively flagged on both (Python's lookupRegexBinding ignores shadowedSymbols too) — over-reject is safe and symmetric. FIX B (arity nit): RegExp.prototype.exec takes exactly 1 arg. regexMethodRegexArgIdent (regex-normalize.ts) now arity-guards exec the same way it guards test, so the shared detector's arity conditions mirror the lowering shapes exactly on BOTH targets. A non-canonical 2-arg rg.exec(s, 5) is no longer mis-detected as a bound-regex method and stays a plain host call. Tests: regex-method-slice3-python.test.ts gains block-body rows (regex-bound -> fail-close both; string-bound -> plain both; direct literal -> canonical both) and an exec-arity row. Oracle check.py GREEN (the runtime result-shape harness is orthogonal to these compile-time codegen properties). Targeted green: regex-method-slice3, regex-emission-slice1, regex-emission-slice-i, core native-handlers. ⚔️ Forged by [Agon](https://github.com/KERNlang/agon) Co-Authored-By: agon (KERN) <292465531+KERN-Agon@users.noreply.github.com>

…pdate Slice-1 \s migrate test expectation FIX 1 — migrate-native-handlers stale \s assertion: Slice 1 normalizes \s to the ASCII whitespace class [ \t\n\r\f\v], so the migrated `value.replace(/\s+/g, " ")` handler now emits `value.replace(/[ \t\n\r\f\v]+/g, " ")`. Update the test's expected TS to match the intended Slice-1 emission (test-only; no production change). FIX 2 — exec must fail-close at ANY arity: re.Pattern in Python has NO .exec method, so a 2-arg rg.exec(s, 5) (silently ignored by JS) leaked to a plain host call and CRASHED Python at runtime while TS ran. Remove the exec arity condition in regexMethodRegexArgIdent so a bound-regex .exec is detected and fail-closes (redirect-to-matchAll) regardless of arg count. test keeps its 1-arg guard (it has the portable re.search analog). Verified rg.exec(s), rg.exec(s, 5), rg.exec() all fail-close symmetrically on both targets. ⚔️ Forged by [Agon](https://github.com/KERNlang/agon) Co-Authored-By: agon (KERN) <292465531+KERN-Agon@users.noreply.github.com>

Slice 3's block-bodied bound-regex fail-close added `import ts from 'typescript'` to body-ts.ts to walk the closure block for call expressions, making body-ts.js a 6th static `typescript` importer in the core barrel graph — tripping browser-spine-import-graph.test.ts ("static typescript edges are exactly the sanctioned set", expected 5, got 6). Relocate the `ts.forEachChild` call-collection walk into closure-eligibility.ts (the Node-only module that already owns the TS AST + `parseClosureBlockAst`) as `collectClosureBlockCallTexts(raw): string[]`. body-ts.ts now consumes plain call source texts and imports no `typescript` — the AST walk stays quarantined where the pin already sanctions it. Behavior-preserving: same pre-order walk, same getText()/parseExpr/regexMethodRegexArgIdent detector, same first-match REGEX_NONLITERAL_FAILCLOSE throw. Verified: browser-spine test back to the 5-element pin (8/8 green); all 41 slice-3 regex tests green incl. the block-bodied bound-regex fail-close / string-binding-stays-plain / literal-inside-block cases. ⚔️ Forged by [Agon](https://github.com/KERNlang/agon) Co-Authored-By: agon (KERN) <292465531+KERN-Agon@users.noreply.github.com>

KERN-Agon added 11 commits June 14, 2026 13:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Goal/regex slice3c#433

Goal/regex slice3c#433
cukas wants to merge 11 commits into
mainfrom
goal/regex-slice3c

cukas commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cukas commented Jun 14, 2026

What

Why

How

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants