Skip to content

Multi-source to single-target harmonization with MapEach primitive (#90)#104

Merged
matthewhorridge merged 1 commit into
mainfrom
feat/multi-source-harmonization
May 13, 2026
Merged

Multi-source to single-target harmonization with MapEach primitive (#90)#104
matthewhorridge merged 1 commit into
mainfrom
feat/multi-source-harmonization

Conversation

@matthewhorridge
Copy link
Copy Markdown
Contributor

Summary

Builds on #102 to deliver multi-source → single-target harmonization end-to-end. The rule model already accepts sources: List[str] and harmonize_dataset already reads multi-column inputs — this PR adds the missing piece (a per-element transform), verifies the full pipeline against real data, and documents the pattern in the demo.

  • New MapEach primitive: applies a nested chain of operations to each element of a list. Composes naturally with Reduce so a multi-source rule can do per-source casts/normalisations before a list-consuming step. Serialisation uses a nested operations array.
  • New primitives/factory.py: shared deserialize_operation() helper. Lets MapEach deserialise its children without importing HarmonizationRule (avoids a circular import). HarmonizationRule.from_serialization collapses to a one-line list comp.
  • Multi-source tests (tests/test_multi_source.py, 6 tests): rule serialisation roundtrip with multi-source + MapEach; harmonize_dataset with one-hot → enum and with MapEach + Reduce(sum); CLI execution of a multi-source rule; --on-missing skip and --on-missing error semantics when any source column is absent.
  • MapEach unit tests (4 added to tests/test_primitives_serialization.py): serialisation roundtrip; canonical MapEach + Reduce(one-hot) pipeline; rejects non-list input; empty-operations identity.
  • Demo: added one-hot flag columns (flag_baseline / flag_followup / flag_screening) to demo/harmonize_example/input.csv plus a multi-source rule producing a visit_phase column. The demo now yields visit_type_label (derived from the enum code) and visit_phase (derived from the one-hot flags) with matching values per row.

Canonical multi-source pipeline

{
  \"sources\": [\"flag_baseline\", \"flag_followup\", \"flag_screening\"],
  \"target\": \"visit_phase\",
  \"operations\": [
    {\"operation\": \"reduce\", \"reduction\": \"one-hot\"},
    {\"operation\": \"cast\", \"source\": \"integer\", \"target\": \"text\"},
    {\"operation\": \"enum_to_enum\", \"mapping\": {\"0\": \"baseline\", \"1\": \"follow_up\", \"2\": \"screening\"}}
  ]
}

MapEach is used when each source value needs its own transform first — for example casting string CSV values to int before Reduce(sum):

{
  \"operations\": [
    {\"operation\": \"map_each\", \"operations\": [{\"operation\": \"cast\", \"source\": \"text\", \"target\": \"integer\"}]},
    {\"operation\": \"reduce\", \"reduction\": \"sum\"}
  ]
}

Backwards compatibility

Purely additive — no breaking changes. Existing single-source rules and serialised payloads are unaffected.

Test plan

  • pytest — all 85 tests pass (10 new: 6 multi-source + 4 MapEach)
  • demo/harmonize_example/run_example.py runs end-to-end; visit_phase and visit_type_label agree row-for-row
  • Reviewer to sanity-check the MapEach serialisation shape (operation: \"map_each\" with nested operations: [...])

🤖 Generated with Claude Code

The structural refactor in #102 left rules able to declare `sources:
List[str]`, which makes multi-source harmonization possible in principle.
This change makes it real: a new `MapEach` primitive lets a per-element
transform run against each source value before a list-consuming step
like `Reduce`, so the canonical one-hot → enum pipeline works as a plain
sequence of existing primitives.

The deserialization match statement is moved to a `primitives.factory`
helper so MapEach (which holds nested operations) can reuse it without
importing back into HarmonizationRule.

Resolves #90.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@matthewhorridge matthewhorridge merged commit 4d45308 into main May 13, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant