Skip to content

Handle None/NaN/pd.NA in primitives (#25)#105

Merged
matthewhorridge merged 1 commit into
mainfrom
feat/null-handling
May 13, 2026
Merged

Handle None/NaN/pd.NA in primitives (#25)#105
matthewhorridge merged 1 commit into
mainfrom
feat/null-handling

Conversation

@matthewhorridge
Copy link
Copy Markdown
Contributor

Summary

Fixes the long-standing crash when primitives encounter missing values. Closes #25.

Before:

```python
Scale(2.0).transform(None) # TypeError: unsupported operand type
df["weight_lbs"].apply(Scale(2.0).transform) # crashes mid-apply on any NaN
```

After: nulls pass through untouched.

```python
Scale(2.0).transform(None) # None
Scale(2.0).transform(float("nan")) # nan
Scale(2.0).transform(pd.NA) #
```

Design

  • `isnull(value)` helper in `primitives/base.py` — recognises `None`, float `NaN`, and `pd.NA`.
  • `@handle_null` decorator — for scalar primitives whose transform expects a non-null value. The decorated method is never called for a null input; the null is returned as-is. Pairs with `@support_iterable` (outer-to-inner): a list flows through `support_iterable` first, then each element through `handle_null` before reaching the real transform. Decorator order is documented in the docstring.
  • Applied to 12 scalar primitives: `Bin`, `Cast`, `ConvertDate`, `ConvertUnits`, `FormatNumber`, `NormalizeText`, `Offset`, `Round`, `Scale`, `Substitute`, `Threshold`, `Truncate`.
  • Not applied to `EnumToEnum` or `NormalizeBoolean` — they already model missing values via `strict`/`default` parameters and their semantics shouldn't change.
  • `Reduce` and `MapEach` reject null elements with a positional error message. The rationale is documented in their docstrings: a null inside a multi-source list usually means a source column is missing for that row, and silently dropping it is exactly the kind of quiet data corruption Primitives crash on None/NaN values - need null handling #25 warned about. The CLI's existing `--on-missing` policy continues to handle the whole-column case.

Tests

`tests/test_null_handling.py` (46 new tests, all green):

  • `isnull` semantics (recognised forms, rejected non-nulls)
  • `@handle_null` only invokes the underlying transform for non-null values
  • Every decorated primitive passes `None`, `float('nan')`, and `pd.NA` through unchanged (parameterised)
  • Scale on a list containing nulls preserves the nulls in the output list
  • `Reduce` and `MapEach` raise `ValueError` with the offending index for null elements
  • End-to-end: `harmonize_dataset` on a pandas DataFrame containing `NaN` and `None` runs without crashing and yields nulls at the corresponding output positions

131 tests pass overall (was 85).

Backwards compatibility

Purely additive — non-null inputs behave exactly as before. The behaviour of `EnumToEnum` and `NormalizeBoolean` (which were already null-aware) is unchanged.

Test plan

  • `pytest` — 131/131 pass
  • `demo/harmonize_example/run_example.py` still runs end-to-end
  • Reviewer to confirm the "fail loudly on null in Reduce/MapEach" choice; the alternative would be to silently skip nulls

🤖 Generated with Claude Code

Most scalar primitives previously crashed when applied to a pandas
column with blank cells: `Scale(2.0).transform(None)` raised TypeError,
and the same operation on a DataFrame column containing NaN failed mid-
apply. Real CSV data always has missing values, so this was a sharp edge
for every user.

Introduce an `isnull()` helper and a `@handle_null` decorator. Apply the
decorator to the twelve scalar primitives that legitimately operate on
non-null values (Bin, Cast, ConvertDate, ConvertUnits, FormatNumber,
NormalizeText, Offset, Round, Scale, Substitute, Threshold, Truncate);
they now pass `None`, `float('nan')`, and `pd.NA` through unchanged.
EnumToEnum and NormalizeBoolean keep their existing strict/default
semantics since those already model missing values.

List-consuming primitives (Reduce, MapEach) intentionally reject null
elements with a positional error message rather than silently dropping
them, so that a multi-source rule producing a partial input surfaces as
a data-quality issue.

Resolves #25.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@matthewhorridge matthewhorridge merged commit a202e36 into main May 13, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Primitives crash on None/NaN values - need null handling

1 participant