Skip to content

Add the supports of missing types in conversion between arrow and dtype types#8258

Open
Alex-PLACET wants to merge 2 commits into
huggingface:mainfrom
Alex-PLACET:support_of_fixed_size_array
Open

Add the supports of missing types in conversion between arrow and dtype types#8258
Alex-PLACET wants to merge 2 commits into
huggingface:mainfrom
Alex-PLACET:support_of_fixed_size_array

Conversation

@Alex-PLACET

Copy link
Copy Markdown

The original issue is a crash in _arrow_to_datasets_dtype because the miss of support of fixed_size_binary:

Exception:    SplitsNotFoundError
Message:      The split names could not be parsed from the dataset config.
Traceback:    Traceback (most recent call last):
                File "/usr/local/lib/python3.12/site-packages/datasets/inspect.py", line 286, in get_dataset_config_info
                  for split_generator in builder._split_generators(
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^
                File "/usr/local/lib/python3.12/site-packages/datasets/packaged_modules/parquet/parquet.py", line 118, in _split_generators
                  self.info.features = datasets.Features.from_arrow_schema(pq.read_schema(f))
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                File "/usr/local/lib/python3.12/site-packages/datasets/features/features.py", line 1858, in from_arrow_schema
                  else generate_from_arrow_type(field.type)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                File "/usr/local/lib/python3.12/site-packages/datasets/features/features.py", line 1518, in generate_from_arrow_type
                  return Value(dtype=_arrow_to_datasets_dtype(pa_type))
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                File "/usr/local/lib/python3.12/site-packages/datasets/features/features.py", line 121, in _arrow_to_datasets_dtype
                  raise ValueError(f"Arrow type {arrow_type} does not have a datasets dtype equivalent.")
              ValueError: Arrow type fixed_size_binary[3] does not have a datasets dtype equivalent.
              
              The above exception was the direct cause of the following exception:
              
              Traceback (most recent call last):
                File "/src/services/worker/src/worker/job_runners/config/split_names.py", line 65, in compute_split_names_from_streaming_response
                  for split in get_dataset_split_names(
                               ^^^^^^^^^^^^^^^^^^^^^^^^
                File "/usr/local/lib/python3.12/site-packages/datasets/inspect.py", line 340, in get_dataset_split_names
                  info = get_dataset_config_info(
                         ^^^^^^^^^^^^^^^^^^^^^^^^
                File "/usr/local/lib/python3.12/site-packages/datasets/inspect.py", line 291, in get_dataset_config_info
                  raise SplitsNotFoundError("The split names could not be parsed from the dataset config.") from err
              datasets.inspect.SplitsNotFoundError: The split names could not be parsed from the dataset config.

In this pull request I add the support for all the missing types

…nversion

- Map fixed_size_binary to "fixed_size_binary[n]" and list to "list[<type>]" in _arrow_to_datasets_dtype
- Parse fixed_size_binary[...] in string_to_arrow and return pa.binary(byte_width)
- Add test coverage for fixed_size_binary roundtrip
@Alex-PLACET

Copy link
Copy Markdown
Author

@lhoestq Hi Quentin. A gentle ping to run the workflows

@lhoestq

lhoestq commented Jun 19, 2026

Copy link
Copy Markdown
Member

Hi! Can you only include the change for fixed_size_binary ? For the other types it requires discussing them in an issue first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants