Skip to content

Fix metadata export when config rows are non-consecutive#8270

Open
KirtiRamchandani wants to merge 1 commit into
huggingface:mainfrom
KirtiRamchandani:fix/metadata-groupby-non-consecutive-configs
Open

Fix metadata export when config rows are non-consecutive#8270
KirtiRamchandani wants to merge 1 commit into
huggingface:mainfrom
KirtiRamchandani:fix/metadata-groupby-non-consecutive-configs

Conversation

@KirtiRamchandani

Copy link
Copy Markdown

Fixes #8269

groupby() on unsorted exported parquet rows dropped earlier shards when the same config reappeared later in the list. Sort by (config, split) first.

Regression test: test_non_consecutive_config_rows_are_merged_in_metadata_configs_from_exported_parquet_files

rootdir: /tmp/datasets
configfile: pyproject.toml
plugins: datadir-1.8.0, xdist-3.8.0, anyio-4.13.0, hypothesis-6.155.2
collecting ... collected 1 item

tests/test_metadata_util.py::test_non_consecutive_config_rows_are_merged_in_metadata_configs_from_exported_parquet_files PASSED

============================== 1 passed in 0.39s ===============================

Sort exported parquet rows before itertools.groupby so repeated config
names separated by other configs merge all shard URLs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MetadataConfigs drops parquet shards when exported config rows are non-consecutive

1 participant