Accept pandas Series/ExtensionArray for Data; lift pandas<3 cap#1469
Draft
rly wants to merge 2 commits into
Draft
Accept pandas Series/ExtensionArray for Data; lift pandas<3 cap#1469rly wants to merge 2 commits into
rly wants to merge 2 commits into
Conversation
…pat) Pandas 3.0 makes PyArrow-backed strings the default for DataFrame string columns, so df['col'].values is now ArrowStringArray and constructing VectorData(data=...) fails type validation. Add pd.Series and pandas.api.extensions.ExtensionArray to the array_data macro and coerce to numpy at the Data construction boundary so every Data subclass picks up the fix without per-class changes. Reject pd.NA/NaN with an informative TypeError (HDF5 vlen-string writes already crash on these) and reject IntegerArray/BooleanArray/FloatingArray to avoid silent dtype widening on .to_numpy(). Lift the pandas<3 cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## dev #1469 +/- ##
==========================================
+ Coverage 93.18% 93.20% +0.01%
==========================================
Files 41 41
Lines 10176 10195 +19
Branches 2103 2108 +5
==========================================
+ Hits 9483 9502 +19
Misses 415 415
Partials 278 278 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pd.Seriesandpandas.api.extensions.ExtensionArray(incl.StringArray/ArrowStringArray) asdatainDataand its subclasses, normalizing to numpy at theData.__init__/Data.extendboundary so every subclass (VectorData,VectorIndex,ScratchData,ElementIdentifiers, …) picks up the fix without per-class changes.pd.NA/NaNand pandas nullable numeric/boolean dtypes (IntegerArray,BooleanArray,FloatingArray) with informativeTypeErrors — the former would crash at HDF5 vlen-string write time, the latter would silently widen on.to_numpy().pandas<3cap frompyproject.toml.Fixes #1384.
Why
Pandas 3.0 makes PyArrow-backed strings the default for all DataFrame string columns.
df['col'].valuesis nowArrowStringArray, soVectorData(name=..., data=df['col'].values)(and any other typical user pattern that hands HDMF a string column) now fails docval type validation. Centralizing the fix at theDataconstruction boundary meansVectorData,add_unit,add_electrode,from_dataframe, etc. all keep working with no further changes.Behavior
ArrowStringArray,StringArray,pd.Series(any backing dtype),pd.Categorical→ converted tonp.ndarraysilently.pd.NAorNaN→TypeErrorpointing at the missing-values cause and asking the user to fill with a sentinel.IntegerArray/BooleanArray/FloatingArray→TypeErrorasking the user to cast explicitly (.astype('int64').to_numpy()or.to_numpy(dtype=...)), since defaulting.to_numpy()would silently change the dtype.Verification
DynamicTable.from_dataframe(df=...)with pandas 3.0.2 default string columns works end-to-end.Test plan
coerce_pandas_datacoveringStringArray,ArrowStringArray, plain numericSeries,Categorical, NA-bearing inputs, and nullable int/bool.VectorDatafor bothSeriesanddf.valuespaths.🤖 Generated with Claude Code