Skip to content

Share fsspec filesystem across threads when resolving data files#8286

Open
Ijtihed wants to merge 1 commit into
huggingface:mainfrom
Ijtihed:fix/share-filesystem-across-threads
Open

Share fsspec filesystem across threads when resolving data files#8286
Ijtihed wants to merge 1 commit into
huggingface:mainfrom
Ijtihed:fix/share-filesystem-across-threads

Conversation

@Ijtihed

@Ijtihed Ijtihed commented Jun 26, 2026

Copy link
Copy Markdown

What does this PR do?

Fixes #8149.

_get_origin_metadata resolves each data file in a ThreadPoolExecutor and every worker builds its own filesystem via url_to_fs since fsspec caches filesystem instances per thread. For cloud backends like s3:// or gcs:// this re-resolves credentials and opens a new connection pool in each thread.

This builds the filesystem once per protocol in the main thread and passes it to the workers which just strip the protocol off the path instead of rebuilding it. hf:// paths and chained :: urls keep the same behavior as before

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Optimization] Prevent per-thread instantiation of Cloud Storage FileSystem during Data loading initialization

1 participant