Replace RuleRegistry with RuleSet and unify rule model (#102)#103
Merged
Conversation
Single-source rules become the one-element case of multi-source: a rule
declares `sources: List[str]` and a single `target`. The old nested
`{source: {target: rule}}` registry is replaced with a flat `RuleSet`
keyed on target, and `harmonize_dataset` now runs every rule in the set
rather than taking a separate `harmonization_pairs` list.
Resolves #102.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Refactor of the rule model and registry to prepare for multi-source harmonization (#90). Closes #102.
HarmonizationRule:source: strbecomessources: List[str]. Single-source rules are the one-element case.transform()unwraps the singleton so existing primitives still receive a scalar; multi-source rules pass the full list through.RuleSetreplacesRuleRegistry: flat list of rules keyed on target. One rule per target (warns and replaces on collision). JSON schema becomes a flat array. For migration,RuleSet.load()andHarmonizationRule.from_serialization()still accept the legacy nested{source: {target: rule}}/ scalar"source": "x"formats.harmonize_datasetruns every rule in theRuleSet(no moreharmonization_pairs). Source columns are read viadf[rule.sources].apply(axis=1).--targetsbecomes a load-time filter on theRuleSet.--on-missingchecks every entry ofrule.sources; a multi-source rule is skipped if any source is missing.mode=pairs/harmonization_pairsremoved (per the additional design decision in the issue thread) — the RPC always runs every rule in the rules file. To run a subset, construct a rules file containing only the desired targets. TS client types updated to match.utils.transformations.replay()rewritten to build a per-datasetRuleSet.Rule,RuleStore, andRuleRegistryaliases removed.Breaking changes
harmonize_dataset(... harmonization_pairs=...)signature changed;harmonization_pairsis gone.harmonizeRPC dropsmodeandpairs.Rule,RuleStore,RuleRegistry(the class) symbols are removed; importRuleSetinstead.Notes
demo/integration.ipynbstill imports an even olderRuleStore/rule_storethat didn't exist onmain; it was already stale and is left untouched.Test plan
pytest— all 75 tests pass, including new coverage for legacy schema compat and multi-source transformdemo/harmonize_example/run_example.pyruns end-to-end against the migrated flat-array rules file🤖 Generated with Claude Code