Skip to content

Add WebDataset reader#124

Open
Guilherme Penedo (guipenedo) wants to merge 7 commits into
mainfrom
codex/webdataset-reader
Open

Add WebDataset reader#124
Guilherme Penedo (guipenedo) wants to merge 7 commits into
mainfrom
codex/webdataset-reader

Conversation

@guipenedo

Copy link
Copy Markdown
Collaborator

Summary

  • add read_webdataset(...) for .tar, .tar.gz, and .tgz shards
  • stream archives sequentially and emit one row per adjacent WebDataset sample
  • parse .json members by default and keep other payloads as bytes

Validation

  • uv run pytest tests/readers/test_webdataset_reader.py tests/readers/test_multi_source_readers.py
  • uv run ruff check src/refiner/pipeline/sources/readers/webdataset.py src/refiner/pipeline/pipeline.py tests/readers/test_webdataset_reader.py docs/reading-and-writing.md
  • uv run ty check src/refiner/pipeline/sources/readers/webdataset.py src/refiner/pipeline/pipeline.py tests/readers/test_webdataset_reader.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a47f33f163

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

Comment thread src/refiner/pipeline/sources/readers/webdataset.py Outdated

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the read_webdataset function and WebDatasetReader class, enabling the processing of WebDataset tar archives where members are grouped by sample keys. The implementation includes support for sequential streaming, automatic JSON parsing, and metadata column configuration, accompanied by updated documentation and comprehensive unit tests. Review feedback suggests optimizing the JSON member check by utilizing the pre-processed field name and considering edge cases for archive members that lack file extensions.

Comment thread src/refiner/pipeline/sources/readers/webdataset.py Outdated
Comment thread src/refiner/pipeline/sources/readers/webdataset.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4fd151ec8d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

Comment thread src/refiner/pipeline/sources/readers/webdataset.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3ae4b2130e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

Comment thread src/refiner/pipeline/sources/readers/webdataset.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aea4e596d9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

Comment thread src/refiner/pipeline/sources/readers/webdataset.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant