Skip to content

Goal/regex slice3c#433

Open
cukas wants to merge 11 commits into
mainfrom
goal/regex-slice3c
Open

Goal/regex slice3c#433
cukas wants to merge 11 commits into
mainfrom
goal/regex-slice3c

Conversation

@cukas

@cukas cukas commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

What

Why

How

Checklist

  • tsc -b passes
  • pnpm test passes
  • pnpm test:kern passes
  • pnpm lint passes
  • kern review packages/ --recursive checked

KERN-Agon added 11 commits June 14, 2026 13:29
…II classes, \Z/\A anchors, re.ASCII)

Replace the verbatim/flag-subset regexLit lowering with the certified emission-
normalization on BOTH targets in lockstep, so a KERN regex literal matches
byte-identically by construction on the certified core.

Transform (matches the .agon-goals/regex-slice1-oracle reference exactly):
  - normalizeRegexClasses (SHARED, both targets): \d→[0-9], \w→[A-Za-z0-9_],
    \s→[ \t\n\r\f\v]. On TS \d/\w are match-no-ops (already ASCII) but
    \s narrows JS \s to drop Unicode whitespace (NBSP).
  - lowerRegexAnchorsPython (PYTHON-ONLY, non-/m path): $→\Z, ^→\A so Python
    anchors match JS input-end/start (Python $ differs at a trailing newline).
    On the /m path anchors are kept + re.MULTILINE (line-based, like JS /m).
  - re.ASCII injected on EVERY Python flag expression (load-bearing for \b and
    ASCII class semantics; JS without /u is already ASCII).

Shared module: new packages/core/src/codegen/regex-normalize.ts, exported from
core's barrel and imported by the TS emitter (codegen-expression regexLit) and
the Python emitter (codegen-body-python pyRegexPattern/pyRegexFlags). One shared
class transform makes byte-identity true by construction. The crude string-
replace transform is the deliberate, certified Slice-1 contract (parity-safe
because identical on both targets); a tokenizing normalizer is a later hardening.

Parity-completeness finding: the route/portable Python path
(packages/python/src/core/expr/index.ts lowerStringArgMethods) SKIPS regex args
(!args[0].startsWith('/')) and never emits a regexLit through re.compile — it
only handles string-literal separators. Confirmed it does NOT bypass
pyRegexPattern, so no normalizer wiring is needed there. The .test/.match/.replace
method paths reuse pyRegexPattern/pyRegexFlags and inherit normalization.

Discriminating emitter tests (regex-emission-slice1-python.test.ts) assert the
exact TS pattern AND Python pattern+flags for each oracle killer row; verified
each FAILS the named wrong-impls (naive_passthrough, no_anchor, no_re_ascii,
raw_anchor_re_m). Updated native-handlers-python.test.ts to the new contract
(it encoded the pre-Slice-1 emission). Oracle check.py: GREEN (parity 0/16, all
4 wrong-impls caught).

⚔️ Forged by [Agon](https://github.com/KERNlang/agon)

Co-Authored-By: agon (KERN) <292465531+KERN-Agon@users.noreply.github.com>
… class-expansion, Set(B) fail-close)

Closes the non-ASCII /i gap Slice 1 left: a non-ASCII Set(A) letter under /i (e.g. /é/i) was emitted raw, so on Python (re.IGNORECASE | re.ASCII) it MISSED its fold partner É while node /é/i matched — a real cross-engine divergence.

- New frozen data module packages/core/src/codegen/regex-fold-table.ts (1073 fold classes / 2170 member chars, 11 fail-close chars), generated ONCE from the vendored probe seed by packages/core/scripts/gen-regex-fold-table.mjs and committed as pure data. NOT wired into pnpm build: node U16.0 and python U15.0.0 disagree on 42 fold classes, so a host-regenerated table would diverge — the table is frozen on purpose. Full Set(A) = the seed's 1050 size-2 pairs + the 23 size-3/4 classes the probe's compare_sizeN_expansion.py proved recovered byte-identically.
- expandRegexIFold(pattern, flags) in the shared regex-normalize.ts: under /i, class-expand each Set(A) letter into its explicit fold class (é → [Éé]) matched by pure codepoint membership (host-DB-independent); a Set(A) letter already inside a [...] set merges as bare members ([xé] → [xÉé], not nested); fail-close (thrown compile error, identical message on both targets) on the Set(B) length-changing residue (ß, ligatures, titlecase). The accept/expand/reject decision is purely lexical (scan vs the frozen table, never a host fold). The codepoint scan leaves a clean seam for a later astral slice.
- KEEP re.IGNORECASE | re.ASCII (do NOT drop /i): the ASCII letters in a mixed /aé/i keep folding, and re.ASCII is the load-bearing invariant that suppresses any Python re-fold of the explicit non-ASCII class members — empirically why KEEP-i is parity-safe.
- Wiring: TS regexLit and Python pyRegexPattern both call expandRegexIFold on the same class-normalized pattern (order: classes → fold-expand → python anchors). No regexLit emit path bypasses the shared normalizer.

Member order inside an emitted class is codepoint-ascending (match-irrelevant; deterministic). New discriminating test packages/python/tests/regex-emission-slice-i-python.test.ts asserts exact TS+Python emission for the killer rows and the fail-close throw; each row fails a plausibly-wrong impl (verified by reverting). Staged oracle .agon-goals/regex-slice-i/oracle/check.py is GREEN (parity 0/16, certified==expect 16/16, all 8 wrong-impls caught). Slice 1 test unchanged (no regression). Conformance has no regex-/i fixtures, so the whole-corpus run is deferred to CI.

⚔️ Forged by [Agon](https://github.com/KERNlang/agon)

Co-Authored-By: agon (KERN) <292465531+KERN-Agon@users.noreply.github.com>
…oint under /i

Two empirically-verified (node v22.22.0 / python3 3.12.7) parity holes the original
Slice-/i oracle missed silently diverge, so the portable contract (identical on the
certified subset, fail-close elsewhere) requires fail-closing them:

HOLE 1 — non-ASCII backreference under /i. /(é)\1/i matches "Éé" on JS (JS /i folds
the backreference's referent too), but the emitted `([Éé])\1` under re.IGNORECASE|
re.ASCII does NOT fold the \1 referent → MISS on Python. Fix: a conservative lexical
predicate in expandRegexIFold — any backreference token (\1–\9 numeric, or \k<name>)
combined with ANY non-ASCII Set(A) letter present under /i → fail-close. ASCII-only
backrefs (/(a)\1/i) and backrefs with no non-ASCII Set(A) letter still emit normally.

HOLE 2 — Set(A) letter as a [...] range endpoint under /i. /[a-é]/i would expand to
[a-Éé], silently changing the range a-é (U+0061..U+00E9) to a-É (U+0061..U+00C9) +
literal é, dropping U+00CA..U+00E8 → divergence vs JS. Fix: detect a Set(A) letter
adjacent to an unescaped range `-` (X-é or é-X) inside a class and fail-close instead
of corrupting the range. A plain class MEMBER (/[xé]/i → [xÉé]) is NOT a range endpoint
and still expands.

Both fail-closes throw a target-symmetric message (the TS and Python emitters share
expandRegexIFold + regexIFoldFailMessage, now reason-discriminated), so the refusal is
byte-identical across targets. Oracle (.agon-goals/regex-slice-i): added 4 rows
(2 fail-close killers + silent_backref/silent_range wrong-impls + 2 positive controls);
check.py GREEN (20/20 parity, all 10 wrong-impls caught). In-repo test adds 7 rows incl.
revert-check (each new fail-close test fails the pre-hardening code).

⚔️ Forged by [Agon](https://github.com/KERNlang/agon)

Co-Authored-By: agon (KERN) <292465531+KERN-Agon@users.noreply.github.com>
… to classDepth 0

Replaces the fragile per-`-` escape-adjacency heuristic in expandRegexIFold with whole-class SIMPLE/COMPLEX classification (scanCharClass + isComplexClassBody). A Set(A) letter inside a [...] class now expands ONLY when the class is SIMPLE (no backslash escape, no range `-`), else fail-closes (reason 'complexClass', was 'rangeEndpoint'). This closes the /[\\-é]/i SILENT-DIVERGENCE (old code misread the escaped-backslash chain as an escaped hyphen and expanded é, corrupting the real \..é range) and the /[[-é]-z]/i over-expand, and corrects the in-class backref false-positive by gating sawBackref to classDepth 0. Conservative (over-reject safe): /[\1é]/i now fail-closes. scanCharClass honors the literal-]-first member rule ([]], [^]]) so the class close index is correct. Oracle (.agon-goals/regex-slice-i) + slice-i tests updated with the full red-team table; oracle GREEN (0/30 node!=py, every wrong-impl incl. silent_class caught).

⚔️ Forged by [Agon](https://github.com/KERNlang/agon)

Co-Authored-By: agon (KERN) <292465531+KERN-Agon@users.noreply.github.com>
…crash)

The Slice-1 lowerRegexAnchorsPython blindly replaceAll'd every ^/$, so a ^/$ inside a [...] class or escaped (\^/\$) was rewritten too: /[^a]/ -> [\Aa] and /[a$]/ -> [a\Z] both raise re.error: bad escape at re.compile (CRASH), and /a\^b/ -> a\Ab silently corrupted an escaped literal. Negated classes are extremely common, so this blocked the regex core.

Fix: a single escape-aware forward pass lowers ^->\A / $->\Z ONLY for a TRUE anchor (classDepth 0, unescaped). An in-class or escaped ^/$ is left verbatim. Reuses the literal-]-first-aware scanCharClass and the same escape/classDepth bookkeeping as expandRegexIFold. The /m path and the TS side are unchanged. Only lowerRegexAnchorsPython changes.

⚔️ Forged by [Agon](https://github.com/KERNlang/agon)

Co-Authored-By: agon (KERN) <292465531+KERN-Agon@users.noreply.github.com>
… shape, .matchAll, .split, fail-close .test(/g)/.exec)

Certifies the regex METHOD result/iteration shapes are portable across the TS
and Python targets, where JS RegExp methods and Python re genuinely differ in
SHAPE/COUNT (not pattern). Strictly additive on the Slice-1//i pattern paths.

IN-CORE (byte/shape-identical both targets):
- .match(s) no /g: canonical {full,groups,index,named}|null shape on BOTH targets
  (D2, the load-bearing fix). Python today emitted a bare re.search Match OBJECT
  while JS .match returns an array-with-groups — a real divergence. TS adapts the
  native RegExpMatchArray inline; Python builds the dict via a _kern_regex_match
  helper (group(0)/groups()/start()/groupdict()), null-safe.
- .match(s) /g: [m.group(0) for m in finditer] or None — full matches only, NEVER
  re.findall (which returns tuples when >1 group). Promoted from fail-close.
- .matchAll(s) (/g): re.finditer shaped to [{full,groups,index},…], incl. empties.
- .replace count locked: no /g -> count=1 (FIRST), /g//.replaceAll -> count=0 (ALL).
- .split non-zero-width, no limit: re.split (capture-group inclusion is portable, D1).

FAIL-CLOSE (symmetric, byte-identical message both targets, lexical predicate):
- .test(/g) (stateful lastIndex), .exec (redirect to .matchAll, D4),
  .matchAll//.replaceAll without /g (JS TypeError), .split zero-width-capable or
  with a limit arg.

The .split zero-width gate uses a SYNTACTIC zero-width-capable predicate
(isZeroWidthCapableRegex, in core, shared by both emitters — no host-engine probe,
version-independent per the frozen-fold-table lesson). Red-teamed against node
str.split vs python3 re.split over a 60-pattern battery: every diverging pattern
fail-closes (0 leaks), every always-non-empty pattern stays in-core (0 over-reject),
incl. adversarial \b/lookaround/x*a/(ab)*c/(foo)|(bar)? cases.

Fail-close messages live in core (regex-normalize.ts) as single-source constants
imported by both targets, so the refusal is observably symmetric.

Oracle .agon-goals/regex-slice3/oracle/check.py: GREEN (parity 0/17, all 5
wrong-impls caught). New discriminating test regex-method-slice3-python.test.ts
(27 cases) asserts exact TS+Python emission for every killer row; revert-checked
(.match-shape -> bare re.search fails 3 rows; .replace count=1 -> 0 fails 1 row).
native-handlers test updated for the intentional .match(/g) in-core promotion (D2).

agon nero verdict: FLAWED@35% — driven by missing codebase context (critic
guessed Python Match-object semantics and the JS .match(/g) spec); the oracle
empirically refutes its load-bearing challenges (1/2/5). Challenges 3+4 (zero-width
predicate undecidability/normalization) were the real risks and are addressed by the
conservative SYNTACTIC predicate + 60-pattern red-team.

⚔️ Forged by [Agon](https://github.com/KERNlang/agon)

Co-Authored-By: agon (KERN) <292465531+KERN-Agon@users.noreply.github.com>
…match named-group + let-bound parity, ctx threading

Hardens Milestone C Slice 3 (regex match-set) against a 6-engine review that found 2 blocking escape-handling bugs in the zero-width `.split` predicate plus 2 cross-target parity gaps. All decisions re-red-teamed against node v22 str.split vs python3.12 re.split.

FIX 1 (BLOCKING) — escape-robust isZeroWidthCapableRegex:
- Multi-char escapes (\xHH, \uHHHH, \u{..}, \cX, octal \0..) are now consumed as SINGLE atoms via scanZwEscape, so a following quantifier attaches to the right base. The old 1-char scan mis-attributed `*` (e.g. \x41* read as \x 4 1*), LEAKING zero-width-capable patterns in-core where the engines diverge.
- Any backreference (\1-\9, \k<name>) fail-closes conservatively (no group-nullability analysis): a nullable backref is zero-width-capable and a non-existent group makes re.split throw.
- Broadened to fail-close on every .split-UNSAFE escape via a portable-escape allowlist: escapes python re rejects but JS accepts (\cX, \u{..}, \p, identity-escape letters) and meaning-divergent ones (\A \Z \a). Re-red-teamed over a 79-pattern x 11-input battery: ZERO leaks (over-rejections of valid non-nullable backrefs are spec-sanctioned).

FIX 2 (parity) — .match named-group undefined->null:
- The TS adapter normalized positional groups but copied named groups verbatim, leaving an unmatched optional named group `undefined` while Python's groupdict() returns None. Now maps each named value undefined->null so {full,groups,index,named} is shape-identical across targets.

FIX 3 (parity) — let-bound regex divergence:
- Approach A (thread a regex-binding table into the TS body emitter + resolve the ident at the body level). Approach B (Python also emits raw) was REJECTED: Python's str has no .match, so 'both raw' would CRASH Python, not achieve parity. TS now resolves `let re = /…/; s.match(re)` to the bound literal and lowers through the SAME canonical adapter/fail-close as a direct literal — proven byte-identical to the direct-literal emission, with let-bound .split/.test(/g) fail-closing identically on both targets.

FIX 4 (parity) — Math.match(/a/g) fail-close:
- On TS, applyStdlibLoweringTS already runs before the regex lowering and rejects `Math.match` (no code change needed). On Python the regex lowering ran FIRST and mis-claimed the namespace as the subject (broken …finditer("a", Math, …)); lowerRegexCallPython now defers stdlib-namespace receivers to applyStdlibLoweringPython, so both targets reject Math.match identically.

Oracle (.agon-goals/regex-slice3): run_js canonMatchObj mirrors the named-group normalization; run_py is_zero_width_capable mirrors the escape-robust predicate (backref + python-rejected-escape + empty-match); 5 new killer/failclose rows (nullable backref, \x41*, \u0041*, \cA*, \0*). check.py GREEN (0/22 parity violations, every wrong-impl caught).

Revert-check vs 7bd5b2d: the 4 multi-char/backref killers (\x41* \u0041* \cA* (a?)\1) LEAKED under the old predicate (returned in-core) and now fail-close; the old .match adapter produced named.b===undefined where Python had None.

Targeted (no heavy suites): pnpm build green; regex-method-slice3 + regex-emission-slice1 + regex-emission-slice-i + core native-handlers all pass; biome clean.

⚔️ Forged by [Agon](https://github.com/KERNlang/agon)

Co-Authored-By: agon (KERN) <292465531+KERN-Agon@users.noreply.github.com>
…drop fragile resolve-to-literal); string methods unchanged

The Slice-3b FIX 3 resolved a let-bound regex ident to its cached literal and
lowered it canonically (TS via node-clone substitution, Python via resolveRegexExpr
following the ident). That emitted a STALE pattern after a reassignment
(let re=/a/; re=/b/; s.match(re) wrongly lowered /a/) and risked fail-closing common
string methods.

Replace the RESOLUTION with DETECTION, symmetric across both targets:
- Keep the per-scope regex-binding TABLE as a DETECTOR only (distinguishes a
  let-bound REGEX from a let-bound STRING — removing it would wrongly fail-close
  s.match(stringVar)).
- New shared, single-source detector regexMethodRegexArgIdent + message
  REGEX_NONLITERAL_FAILCLOSE in regex-normalize.ts (core barrel re-export; Python
  imports from @kernlang/core), matching the existing slice-3 fail-close pattern.
- TS (body-ts.ts): restore emitValueTS to emitExpression(node); a recursive walk
  throws REGEX_NONLITERAL_FAILCLOSE when a regex method's regex position is an
  ident KNOWN to be regex-bound. No substitution, no node-cloning.
- Python (codegen-body-python.ts): resolveRegexExpr now resolves ONLY direct
  literals; lowerRegexCallPython throws the same shared message for a known
  regex-bound ident in the regex position.
- Reassign-invalidation (both targets): rebindRegexOnReassign updates the owning
  scope on assign — to a literal keeps it regex (still fail-closed), to anything
  else UNMARKS it. RHS is emitted before the rebind so re=s.match(re) is checked
  against the pre-reassign table.

Net contract (symmetric): direct regex literal = portable Slice-3 lowering; a
variable KNOWN to hold a regex = symmetric FAIL-CLOSE both targets; a string/unknown
variable = unchanged plain host method. No silent divergence, no stale-pattern
wrong-lowering. FIX 1/2/4 untouched.

Tests: rewrote the slice3 let-bound describe block to the new contract (fail-close,
string-stays-plain regression row, reassign-keeps-regex, reassign-to-nonregex-unmarks,
direct-literal-unchanged, nested-position); updated native-handlers-python let-bound
.test row. Oracle: added naive_bound_resolve wrong-impl (resolves to STALE literal)
+ bound_match fixture — check.py GREEN, every wrong-impl caught.

⚔️ Forged by [Agon](https://github.com/KERNlang/agon)

Co-Authored-By: agon (KERN) <292465531+KERN-Agon@users.noreply.github.com>
…ied closures (TS/Python parity); exec arity guard

FIX A (cross-target divergence): the TS bound-regex fail-close walk (assertNoBoundRegexMethodTS in body-ts.ts) did not inspect lambda.bodyBlock. A bound-regex method inside a block-bodied arrow (let re=/.../; arr.map(x => { return s.match(re); })) emitted RAW s.match(re) on TS while the Python emitter re-parses block-closure expressions and FAIL-CLOSED the same construct — a silent divergence. The TS walk now descends into bodyBlock via the shared parseClosureBlockAst closure-AST path (reused, no new text scanner) and re-parses each call via the same parseExpr the Python lowerer uses, applying the SAME regexMethodRegexArgIdent + lookupRegexBinding detector. Both targets now fail-close symmetrically; a string-/unknown-bound ident stays plain on both; a direct literal still lowers canonically. Closure-param shadowing of an outer regex name is conservatively flagged on both (Python's lookupRegexBinding ignores shadowedSymbols too) — over-reject is safe and symmetric.

FIX B (arity nit): RegExp.prototype.exec takes exactly 1 arg. regexMethodRegexArgIdent (regex-normalize.ts) now arity-guards exec the same way it guards test, so the shared detector's arity conditions mirror the lowering shapes exactly on BOTH targets. A non-canonical 2-arg rg.exec(s, 5) is no longer mis-detected as a bound-regex method and stays a plain host call.

Tests: regex-method-slice3-python.test.ts gains block-body rows (regex-bound -> fail-close both; string-bound -> plain both; direct literal -> canonical both) and an exec-arity row. Oracle check.py GREEN (the runtime result-shape harness is orthogonal to these compile-time codegen properties). Targeted green: regex-method-slice3, regex-emission-slice1, regex-emission-slice-i, core native-handlers.

⚔️ Forged by [Agon](https://github.com/KERNlang/agon)

Co-Authored-By: agon (KERN) <292465531+KERN-Agon@users.noreply.github.com>
…pdate Slice-1 \s migrate test expectation

FIX 1 — migrate-native-handlers stale \s assertion: Slice 1 normalizes \s to the ASCII whitespace class [ \t\n\r\f\v], so the migrated `value.replace(/\s+/g, " ")` handler now emits `value.replace(/[ \t\n\r\f\v]+/g, " ")`. Update the test's expected TS to match the intended Slice-1 emission (test-only; no production change).

FIX 2 — exec must fail-close at ANY arity: re.Pattern in Python has NO .exec method, so a 2-arg rg.exec(s, 5) (silently ignored by JS) leaked to a plain host call and CRASHED Python at runtime while TS ran. Remove the exec arity condition in regexMethodRegexArgIdent so a bound-regex .exec is detected and fail-closes (redirect-to-matchAll) regardless of arg count. test keeps its 1-arg guard (it has the portable re.search analog). Verified rg.exec(s), rg.exec(s, 5), rg.exec() all fail-close symmetrically on both targets.

⚔️ Forged by [Agon](https://github.com/KERNlang/agon)

Co-Authored-By: agon (KERN) <292465531+KERN-Agon@users.noreply.github.com>
Slice 3's block-bodied bound-regex fail-close added `import ts from 'typescript'`
to body-ts.ts to walk the closure block for call expressions, making body-ts.js
a 6th static `typescript` importer in the core barrel graph — tripping
browser-spine-import-graph.test.ts ("static typescript edges are exactly the
sanctioned set", expected 5, got 6).

Relocate the `ts.forEachChild` call-collection walk into closure-eligibility.ts
(the Node-only module that already owns the TS AST + `parseClosureBlockAst`) as
`collectClosureBlockCallTexts(raw): string[]`. body-ts.ts now consumes plain call
source texts and imports no `typescript` — the AST walk stays quarantined where
the pin already sanctions it. Behavior-preserving: same pre-order walk, same
getText()/parseExpr/regexMethodRegexArgIdent detector, same first-match
REGEX_NONLITERAL_FAILCLOSE throw.

Verified: browser-spine test back to the 5-element pin (8/8 green); all 41
slice-3 regex tests green incl. the block-bodied bound-regex fail-close /
string-binding-stays-plain / literal-inside-block cases.

⚔️ Forged by [Agon](https://github.com/KERNlang/agon)

Co-Authored-By: agon (KERN) <292465531+KERN-Agon@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants