Skip to content

Accept pandas Series/ExtensionArray for Data; lift pandas<3 cap#1469

Draft
rly wants to merge 2 commits into
devfrom
fix/pandas-3-compat
Draft

Accept pandas Series/ExtensionArray for Data; lift pandas<3 cap#1469
rly wants to merge 2 commits into
devfrom
fix/pandas-3-compat

Conversation

@rly
Copy link
Copy Markdown
Contributor

@rly rly commented May 4, 2026

Summary

  • Accept pd.Series and pandas.api.extensions.ExtensionArray (incl. StringArray/ArrowStringArray) as data in Data and its subclasses, normalizing to numpy at the Data.__init__/Data.extend boundary so every subclass (VectorData, VectorIndex, ScratchData, ElementIdentifiers, …) picks up the fix without per-class changes.
  • Reject pd.NA/NaN and pandas nullable numeric/boolean dtypes (IntegerArray, BooleanArray, FloatingArray) with informative TypeErrors — the former would crash at HDF5 vlen-string write time, the latter would silently widen on .to_numpy().
  • Lift the pandas<3 cap from pyproject.toml.

Fixes #1384.

Why

Pandas 3.0 makes PyArrow-backed strings the default for all DataFrame string columns. df['col'].values is now ArrowStringArray, so VectorData(name=..., data=df['col'].values) (and any other typical user pattern that hands HDMF a string column) now fails docval type validation. Centralizing the fix at the Data construction boundary means VectorData, add_unit, add_electrode, from_dataframe, etc. all keep working with no further changes.

Behavior

  • ArrowStringArray, StringArray, pd.Series (any backing dtype), pd.Categorical → converted to np.ndarray silently.
  • pandas input containing pd.NA or NaNTypeError pointing at the missing-values cause and asking the user to fill with a sentinel.
  • IntegerArray/BooleanArray/FloatingArrayTypeError asking the user to cast explicitly (.astype('int64').to_numpy() or .to_numpy(dtype=...)), since defaulting .to_numpy() would silently change the dtype.
  • Non-pandas inputs are pass-through; no behavior change for existing callers.

Verification

Test plan

  • Unit tests added for coerce_pandas_data covering StringArray, ArrowStringArray, plain numeric Series, Categorical, NA-bearing inputs, and nullable int/bool.
  • End-to-end test through VectorData for both Series and df.values paths.
  • Manual HDF5 roundtrip with pandas 3.0.
  • CI passes on Python 3.10–3.13 with pandas 1.4 (lower bound), pandas 2.x, and pandas 3.x.

🤖 Generated with Claude Code

…pat)

Pandas 3.0 makes PyArrow-backed strings the default for DataFrame string
columns, so df['col'].values is now ArrowStringArray and constructing
VectorData(data=...) fails type validation. Add pd.Series and
pandas.api.extensions.ExtensionArray to the array_data macro and coerce
to numpy at the Data construction boundary so every Data subclass picks
up the fix without per-class changes.

Reject pd.NA/NaN with an informative TypeError (HDF5 vlen-string writes
already crash on these) and reject IntegerArray/BooleanArray/FloatingArray
to avoid silent dtype widening on .to_numpy(). Lift the pandas<3 cap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.20%. Comparing base (0d61982) to head (2788270).

Additional details and impacted files
@@            Coverage Diff             @@
##              dev    #1469      +/-   ##
==========================================
+ Coverage   93.18%   93.20%   +0.01%     
==========================================
  Files          41       41              
  Lines       10176    10195      +19     
  Branches     2103     2108       +5     
==========================================
+ Hits         9483     9502      +19     
  Misses        415      415              
  Partials      278      278              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rly rly marked this pull request as draft May 4, 2026 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pandas 3.0 String Type Compatibility Breaking HDMF Data Ingestion

1 participant