Skip to content

Fix data file patterns starting with "./" for Hub datasets (#7727)#8260

Open
discobot wants to merge 1 commit into
huggingface:mainfrom
discobot:fix/7727-dot-segment-patterns
Open

Fix data file patterns starting with "./" for Hub datasets (#7727)#8260
discobot wants to merge 1 commit into
huggingface:mainfrom
discobot:fix/7727-dot-segment-patterns

Conversation

@discobot

Copy link
Copy Markdown

Fixes #7727.

data_files patterns starting with ./ raise FileNotFoundError when the dataset is loaded from the Hub, even though the same patterns work locally.

The asymmetry comes from resolve_pattern(): it joins relative patterns onto the base path with xjoin (a plain posixpath.join), which keeps the . segments, so the glob pattern ends up as hf://datasets/<repo>/./plain_text/*.parquet. HfFileSystem.glob (like fsspec's generic glob) treats . as a literal directory name and matches nothing, while LocalFileSystem normalizes the dot segments away — hence local loading works and Hub loading doesn't.

This PR strips . segments from relative patterns in resolve_pattern() before joining, so ./data/* resolves the same as data/* on every filesystem. The stripping also removes any leading slashes it uncovers (e.g. in .//data/*), so a relative pattern can never turn absolute and resolve outside the base path. .. segments are left untouched, as before.

Added tests for the local, mock-fs and mock-dataset-repository cases (they fail without the fix, no network needed), and checked manually that resolve_pattern("./plain_text/*.parquet", "hf://datasets/rajpurkar/squad") now resolves the two parquet files instead of raising.

…ce#7727)

Relative patterns like "./data/*.parquet" (e.g. from data_files in YAML configs) were joined with the base path without removing the "." segments. Local loading still worked because LocalFileSystem normalizes paths, but remote filesystems like HfFileSystem treat "." as a literal directory name when globbing, so loading the same dataset from the Hub failed with FileNotFoundError. Now resolve_pattern() strips "." segments from relative patterns before joining them with the base path. Added tests for the local, mock-fs and dataset-repository cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

config paths that start with ./ are not valid as hf:// accessed repos, but are valid when accessed locally

1 participant