Skip to content

Add extract_regex and validate_pattern primitives (#100, #101)#107

Merged
matthewhorridge merged 1 commit into
mainfrom
feat/regex-primitives
May 13, 2026
Merged

Add extract_regex and validate_pattern primitives (#100, #101)#107
matthewhorridge merged 1 commit into
mainfrom
feat/regex-primitives

Conversation

@matthewhorridge
Copy link
Copy Markdown
Contributor

@matthewhorridge matthewhorridge commented May 13, 2026

Summary

Adds two regex-based primitives — extract_regex and validate_pattern — paired in one PR because they share infrastructure (flag whitelist, null-handling decorator, factory wiring) and the same test patterns. Closes #100 and #101.

extract_regex (#100)

Pulls a capture group out of a string. Supports positional or named groups, optional regex flags from a whitelisted set (IGNORECASE, MULTILINE, DOTALL), and the standard strict/default fallback pattern.

{
  "operation": "extract_regex",
  "expression": "MRN[:\\\\s]+([A-Z0-9-]+)",
  "group": 1,
  "strict": true
}
ExtractRegex(r"MRN[:\\s]+([A-Z0-9-]+)").transform("Patient MRN: A12-99")
# => "A12-99"

validate_pattern (#101)

Asserts that a value matches a regex; returns the original value on success, or raises (strict) / returns default (non-strict) on failure. Configurable mode picks the anchoring semantics:

  • match — anchored at start (default)
  • fullmatch — anchored both ends
  • search — match anywhere
{
  "operation": "validate_pattern",
  "expression": "^[A-Z]{2}\\\\d{6}$",
  "mode": "fullmatch",
  "strict": true
}

Design notes

  • Patterns compile eagerly at construction time so a malformed regex fails fast rather than per-row.
  • Flags are a whitelisted set (rejects unknown names like VERBOSE) so the serialized form stays portable and reviewable.
  • Both primitives wear @handle_null + @support_iterable, consistent with the other string primitives — nulls pass through unchanged, lists fan out per element.
  • validate_pattern deliberately reuses extract_regex's flag resolver to keep the two primitives in lockstep on the supported flag vocabulary.

Tests

31 new tests in tests/test_regex_primitives.py covering every bullet in the issues' test plans: basic / named-group / explicit-index extraction, full-match group 0, strict + non-strict failure paths, invalid group, iterable input, flags, invalid pattern, unknown flag, all three validate modes, serialization roundtrip with and without flags, null pass-through, and a HarmonizationRule chain that goes through serialization on both primitives.

Total: 162 tests pass (was 131).

Test plan

  • pytest — 162/162 pass
  • Spec items from each issue's V1 behaviour and test-plan sections are covered by new tests

🤖 Generated with Claude Code

Two regex-based primitives, paired in one change because they share
infrastructure (flag whitelist, null-handling decorator, factory wiring):

- `extract_regex` pulls a capture group out of a string. Supports
  positional or named groups, optional regex flags from a whitelist
  (IGNORECASE/MULTILINE/DOTALL), and the standard strict/default
  fallback pattern.
- `validate_pattern` asserts that a value matches a regex and returns
  it unchanged on success. Configurable mode (match/fullmatch/search)
  picks the anchoring semantics, with the same strict/default surface.

Both compile their patterns eagerly so a malformed regex fails at
construction time rather than per-row, and both go through @handle_null
so missing values pass through untouched.

Resolves #100, #101.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@matthewhorridge matthewhorridge merged commit 5448d75 into main May 13, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add extract_regex primitive for regex capture extraction

1 participant