Skip to content

Fix fixed size binary#8281

Open
Alex-PLACET wants to merge 2 commits into
huggingface:mainfrom
Alex-PLACET:fix/fixed_size_binary
Open

Fix fixed size binary#8281
Alex-PLACET wants to merge 2 commits into
huggingface:mainfrom
Alex-PLACET:fix/fixed_size_binary

Conversation

@Alex-PLACET

Copy link
Copy Markdown

The original issue is a crash in _arrow_to_datasets_dtype because the miss of support of fixed_size_binary:

Exception:    SplitsNotFoundError
Message:      The split names could not be parsed from the dataset config.
Traceback:    Traceback (most recent call last):
                File "/usr/local/lib/python3.12/site-packages/datasets/inspect.py", line 286, in get_dataset_config_info
                  for split_generator in builder._split_generators(
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^
                File "/usr/local/lib/python3.12/site-packages/datasets/packaged_modules/parquet/parquet.py", line 118, in _split_generators
                  self.info.features = datasets.Features.from_arrow_schema(pq.read_schema(f))
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                File "/usr/local/lib/python3.12/site-packages/datasets/features/features.py", line 1858, in from_arrow_schema
                  else generate_from_arrow_type(field.type)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                File "/usr/local/lib/python3.12/site-packages/datasets/features/features.py", line 1518, in generate_from_arrow_type
                  return Value(dtype=_arrow_to_datasets_dtype(pa_type))
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                File "/usr/local/lib/python3.12/site-packages/datasets/features/features.py", line 121, in _arrow_to_datasets_dtype
                  raise ValueError(f"Arrow type {arrow_type} does not have a datasets dtype equivalent.")
              ValueError: Arrow type fixed_size_binary[3] does not have a datasets dtype equivalent.
              
              The above exception was the direct cause of the following exception:
              
              Traceback (most recent call last):
                File "/src/services/worker/src/worker/job_runners/config/split_names.py", line 65, in compute_split_names_from_streaming_response
                  for split in get_dataset_split_names(
                               ^^^^^^^^^^^^^^^^^^^^^^^^
                File "/usr/local/lib/python3.12/site-packages/datasets/inspect.py", line 340, in get_dataset_split_names
                  info = get_dataset_config_info(
                         ^^^^^^^^^^^^^^^^^^^^^^^^
                File "/usr/local/lib/python3.12/site-packages/datasets/inspect.py", line 291, in get_dataset_config_info
                  raise SplitsNotFoundError("The split names could not be parsed from the dataset config.") from err
              datasets.inspect.SplitsNotFoundError: The split names could not be parsed from the dataset config.

In this pull request I add the support for fixed size binary

…nversion

- Map fixed_size_binary to "fixed_size_binary[n]" and list to "list[<type>]" in _arrow_to_datasets_dtype
- Parse fixed_size_binary[...] in string_to_arrow and return pa.binary(byte_width)
- Add test coverage for fixed_size_binary roundtrip
@Alex-PLACET Alex-PLACET marked this pull request as ready for review June 23, 2026 12:45
@Alex-PLACET

Copy link
Copy Markdown
Author

Subset of #8258
Ping @lhoestq

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant