Multi-source to single-target harmonization with MapEach primitive (#90)#104
Merged
Merged
Conversation
The structural refactor in #102 left rules able to declare `sources: List[str]`, which makes multi-source harmonization possible in principle. This change makes it real: a new `MapEach` primitive lets a per-element transform run against each source value before a list-consuming step like `Reduce`, so the canonical one-hot → enum pipeline works as a plain sequence of existing primitives. The deserialization match statement is moved to a `primitives.factory` helper so MapEach (which holds nested operations) can reuse it without importing back into HarmonizationRule. Resolves #90. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Builds on #102 to deliver multi-source → single-target harmonization end-to-end. The rule model already accepts
sources: List[str]andharmonize_datasetalready reads multi-column inputs — this PR adds the missing piece (a per-element transform), verifies the full pipeline against real data, and documents the pattern in the demo.MapEachprimitive: applies a nested chain of operations to each element of a list. Composes naturally withReduceso a multi-source rule can do per-source casts/normalisations before a list-consuming step. Serialisation uses a nestedoperationsarray.primitives/factory.py: shareddeserialize_operation()helper. LetsMapEachdeserialise its children without importingHarmonizationRule(avoids a circular import).HarmonizationRule.from_serializationcollapses to a one-line list comp.tests/test_multi_source.py, 6 tests): rule serialisation roundtrip with multi-source +MapEach;harmonize_datasetwith one-hot → enum and withMapEach+Reduce(sum); CLI execution of a multi-source rule;--on-missing skipand--on-missing errorsemantics when any source column is absent.MapEachunit tests (4 added totests/test_primitives_serialization.py): serialisation roundtrip; canonicalMapEach+Reduce(one-hot)pipeline; rejects non-list input; empty-operations identity.flag_baseline/flag_followup/flag_screening) todemo/harmonize_example/input.csvplus a multi-source rule producing avisit_phasecolumn. The demo now yieldsvisit_type_label(derived from the enum code) andvisit_phase(derived from the one-hot flags) with matching values per row.Canonical multi-source pipeline
{ \"sources\": [\"flag_baseline\", \"flag_followup\", \"flag_screening\"], \"target\": \"visit_phase\", \"operations\": [ {\"operation\": \"reduce\", \"reduction\": \"one-hot\"}, {\"operation\": \"cast\", \"source\": \"integer\", \"target\": \"text\"}, {\"operation\": \"enum_to_enum\", \"mapping\": {\"0\": \"baseline\", \"1\": \"follow_up\", \"2\": \"screening\"}} ] }MapEachis used when each source value needs its own transform first — for example casting string CSV values to int beforeReduce(sum):{ \"operations\": [ {\"operation\": \"map_each\", \"operations\": [{\"operation\": \"cast\", \"source\": \"text\", \"target\": \"integer\"}]}, {\"operation\": \"reduce\", \"reduction\": \"sum\"} ] }Backwards compatibility
Purely additive — no breaking changes. Existing single-source rules and serialised payloads are unaffected.
Test plan
pytest— all 85 tests pass (10 new: 6 multi-source + 4 MapEach)demo/harmonize_example/run_example.pyruns end-to-end;visit_phaseandvisit_type_labelagree row-for-rowMapEachserialisation shape (operation: \"map_each\"with nestedoperations: [...])🤖 Generated with Claude Code