Skip to content

Support pandas 3.0#1745

Open
filippsatverily wants to merge 15 commits into
cdisc-org:mainfrom
filippsatverily:filipps/pandas_3_upgrade
Open

Support pandas 3.0#1745
filippsatverily wants to merge 15 commits into
cdisc-org:mainfrom
filippsatverily:filipps/pandas_3_upgrade

Conversation

@filippsatverily

@filippsatverily filippsatverily commented May 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Upgrades cdisc-rules-engine to support pandas 3.0. Stacked on top of #1713 (relax dependency constraints) — please merge that first.

  • Drop the pandas <3.0 upper bound and add pytz (no longer a pandas transitive dep)
  • Replace applymap() with map() (removed in pandas 3.0)
  • Replace inplace=True mutation patterns (pandas 3.0 Copy-on-Write)
  • Handle pandas 3.0 default StringDtype in comparison operators
  • Handle extension arrays in DaskDataset.__setitem__
  • Remove unsupported dd.DataFrame type annotation in parquet_reader
  • Remove method= and downcast= kwargs removed in pandas 3.0
  • Replace Dask GroupBy .apply(set) path in Distinct operation

Moves dependency constraints to pyproject.toml.
Makes requirements.txt a lockfile.
Fixes an incompatibility caused by click 8.3.0, which passes the default value as-is.
Fixes an incompatibility caused by pyreadstat 1.2.9, which changed original_variable_type from 'NULL' to None
Works around an behavior change in jsonpath-ng 1.8.0 where Child.str gets wrapped in parenthesis.
Fixes tokenization errors when using dask 2024.8.1+. Starting with this version, dask enforces that tokens remain stable across pickle round-trips (dask/dask#11320). Capturing self in a lambda fails this check because instance objects can have non-deterministic pickle representations. Since calculate_variable_value_length is already a static method, replacing self with the class name is enough to remove the capture.
Dask 2025.4.0 optimizes multiple DataFrames together, which exposes division mismatches and causes dask to throw an error. This change removes a source of repartitioning, preserving the divisions when assigning a pandas series to a dask dataframe
Fixes a unit test to support pandas 2.2.0+. The pandas release fixes an sorting bug with pandas-dev/pandas#54611. This commit changes the expected results accordingly.
@RamilCDISC

Copy link
Copy Markdown
Collaborator

@filippsatverily Could you please resolve the conflicts on the branch so we can move towards validation and merge?

Comment thread requirements.txt
Pympler==1.1
pyreadstat==1.2.7
python-dotenv==1.0.0
pytz==2026.2

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@filippsatverily what this dependency declaration intentional?

@SFJohnson24

Copy link
Copy Markdown
Collaborator

@filippsatverily we have merged your other PR with some tweaks--we are now using a pyproject.toml for the dependency installation. I suspect this PR will need to be reworked given the changes--I have thus moved this back to In-Progress. I am happy to rereview once this is ready

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants