Handle None/NaN/pd.NA in primitives (#25) by matthewhorridge · Pull Request #105 · bmir-radx/harmonization-framework

matthewhorridge · 2026-05-13T15:24:48Z

Summary

Fixes the long-standing crash when primitives encounter missing values. Closes #25.

Before:

```python
Scale(2.0).transform(None) # TypeError: unsupported operand type
df["weight_lbs"].apply(Scale(2.0).transform) # crashes mid-apply on any NaN
```

After: nulls pass through untouched.

```python
Scale(2.0).transform(None) # None
Scale(2.0).transform(float("nan")) # nan
Scale(2.0).transform(pd.NA) #
```

Design

`isnull(value)` helper in `primitives/base.py` — recognises `None`, float `NaN`, and `pd.NA`.
`@handle_null` decorator — for scalar primitives whose transform expects a non-null value. The decorated method is never called for a null input; the null is returned as-is. Pairs with `@support_iterable` (outer-to-inner): a list flows through `support_iterable` first, then each element through `handle_null` before reaching the real transform. Decorator order is documented in the docstring.
Applied to 12 scalar primitives: `Bin`, `Cast`, `ConvertDate`, `ConvertUnits`, `FormatNumber`, `NormalizeText`, `Offset`, `Round`, `Scale`, `Substitute`, `Threshold`, `Truncate`.
Not applied to `EnumToEnum` or `NormalizeBoolean` — they already model missing values via `strict`/`default` parameters and their semantics shouldn't change.
`Reduce` and `MapEach` reject null elements with a positional error message. The rationale is documented in their docstrings: a null inside a multi-source list usually means a source column is missing for that row, and silently dropping it is exactly the kind of quiet data corruption Primitives crash on None/NaN values - need null handling #25 warned about. The CLI's existing `--on-missing` policy continues to handle the whole-column case.

Tests

`tests/test_null_handling.py` (46 new tests, all green):

`isnull` semantics (recognised forms, rejected non-nulls)
`@handle_null` only invokes the underlying transform for non-null values
Every decorated primitive passes `None`, `float('nan')`, and `pd.NA` through unchanged (parameterised)
Scale on a list containing nulls preserves the nulls in the output list
`Reduce` and `MapEach` raise `ValueError` with the offending index for null elements
End-to-end: `harmonize_dataset` on a pandas DataFrame containing `NaN` and `None` runs without crashing and yields nulls at the corresponding output positions

131 tests pass overall (was 85).

Backwards compatibility

Purely additive — non-null inputs behave exactly as before. The behaviour of `EnumToEnum` and `NormalizeBoolean` (which were already null-aware) is unchanged.

Test plan

`pytest` — 131/131 pass
`demo/harmonize_example/run_example.py` still runs end-to-end
Reviewer to confirm the "fail loudly on null in Reduce/MapEach" choice; the alternative would be to silently skip nulls

🤖 Generated with Claude Code

Most scalar primitives previously crashed when applied to a pandas column with blank cells: `Scale(2.0).transform(None)` raised TypeError, and the same operation on a DataFrame column containing NaN failed mid- apply. Real CSV data always has missing values, so this was a sharp edge for every user. Introduce an `isnull()` helper and a `@handle_null` decorator. Apply the decorator to the twelve scalar primitives that legitimately operate on non-null values (Bin, Cast, ConvertDate, ConvertUnits, FormatNumber, NormalizeText, Offset, Round, Scale, Substitute, Threshold, Truncate); they now pass `None`, `float('nan')`, and `pd.NA` through unchanged. EnumToEnum and NormalizeBoolean keep their existing strict/default semantics since those already model missing values. List-consuming primitives (Reduce, MapEach) intentionally reject null elements with a positional error message rather than silently dropping them, so that a multi-source rule producing a partial input surfaces as a data-quality issue. Resolves #25. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

matthewhorridge merged commit a202e36 into main May 13, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle None/NaN/pd.NA in primitives (#25)#105

Handle None/NaN/pd.NA in primitives (#25)#105
matthewhorridge merged 1 commit into
mainfrom
feat/null-handling

matthewhorridge commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

matthewhorridge commented May 13, 2026

Summary

Design

Tests

Backwards compatibility

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant