Skip to content

Replace RuleRegistry with RuleSet and unify rule model (#102)#103

Merged
matthewhorridge merged 1 commit into
mainfrom
feat/ruleset-refactor
May 12, 2026
Merged

Replace RuleRegistry with RuleSet and unify rule model (#102)#103
matthewhorridge merged 1 commit into
mainfrom
feat/ruleset-refactor

Conversation

@matthewhorridge
Copy link
Copy Markdown
Contributor

@matthewhorridge matthewhorridge commented May 12, 2026

Summary

Refactor of the rule model and registry to prepare for multi-source harmonization (#90). Closes #102.

  • HarmonizationRule: source: str becomes sources: List[str]. Single-source rules are the one-element case. transform() unwraps the singleton so existing primitives still receive a scalar; multi-source rules pass the full list through.
  • RuleSet replaces RuleRegistry: flat list of rules keyed on target. One rule per target (warns and replaces on collision). JSON schema becomes a flat array. For migration, RuleSet.load() and HarmonizationRule.from_serialization() still accept the legacy nested {source: {target: rule}} / scalar "source": "x" formats.
  • harmonize_dataset runs every rule in the RuleSet (no more harmonization_pairs). Source columns are read via df[rule.sources].apply(axis=1).
  • CLI --targets becomes a load-time filter on the RuleSet. --on-missing checks every entry of rule.sources; a multi-source rule is skipped if any source is missing.
  • RPC API: mode=pairs/harmonization_pairs removed (per the additional design decision in the issue thread) — the RPC always runs every rule in the rules file. To run a subset, construct a rules file containing only the desired targets. TS client types updated to match.
  • utils.transformations.replay() rewritten to build a per-dataset RuleSet.
  • Demo rules files migrated to the flat array schema. Rule, RuleStore, and RuleRegistry aliases removed.

Breaking changes

  • JSON rule files now use a flat array (legacy nested schema still loads).
  • harmonize_dataset(... harmonization_pairs=...) signature changed; harmonization_pairs is gone.
  • harmonize RPC drops mode and pairs.
  • Rule, RuleStore, RuleRegistry (the class) symbols are removed; import RuleSet instead.

Notes

  • demo/integration.ipynb still imports an even older RuleStore/rule_store that didn't exist on main; it was already stale and is left untouched.

Test plan

  • pytest — all 75 tests pass, including new coverage for legacy schema compat and multi-source transform
  • demo/harmonize_example/run_example.py runs end-to-end against the migrated flat-array rules file
  • Reviewer to confirm RPC breaking change is acceptable (was also discussed in the issue thread)

🤖 Generated with Claude Code

Single-source rules become the one-element case of multi-source: a rule
declares `sources: List[str]` and a single `target`. The old nested
`{source: {target: rule}}` registry is replaced with a flat `RuleSet`
keyed on target, and `harmonize_dataset` now runs every rule in the set
rather than taking a separate `harmonization_pairs` list.

Resolves #102.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@matthewhorridge matthewhorridge merged commit a7a0e9c into main May 12, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor: unify rule model and replace RuleRegistry with RuleSet

1 participant