Add WebDataset reader#124
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a47f33f163
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "Codex (@codex) review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".
There was a problem hiding this comment.
Code Review
This pull request introduces the read_webdataset function and WebDatasetReader class, enabling the processing of WebDataset tar archives where members are grouped by sample keys. The implementation includes support for sequential streaming, automatic JSON parsing, and metadata column configuration, accompanied by updated documentation and comprehensive unit tests. Review feedback suggests optimizing the JSON member check by utilizing the pre-processed field name and considering edge cases for archive members that lack file extensions.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4fd151ec8d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "Codex (@codex) review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3ae4b2130e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "Codex (@codex) review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aea4e596d9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "Codex (@codex) review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".
827532d to
e0e2a60
Compare
e0e2a60 to
e294057
Compare
Summary
read_webdataset(...)for.tar,.tar.gz, and.tgzshards.jsonmembers by default and keep other payloads as bytesValidation
uv run pytest tests/readers/test_webdataset_reader.py tests/readers/test_multi_source_readers.pyuv run ruff check src/refiner/pipeline/sources/readers/webdataset.py src/refiner/pipeline/pipeline.py tests/readers/test_webdataset_reader.py docs/reading-and-writing.mduv run ty check src/refiner/pipeline/sources/readers/webdataset.py src/refiner/pipeline/pipeline.py tests/readers/test_webdataset_reader.py