Increase the throughput of the `validate_duplicate_files` by tomighita · Pull Request #2296 · apache/iceberg-rust

tomighita · 2026-03-30T11:49:36Z

Which issue does this PR close?

Closes Improve throughput of validate_duplicate_files #2295

What changes are included in this PR?

Increase the throughput of the validate_duplicate_files by starting all requests and polling rather than sequentially fetching each file.

Are these changes tested?

No need to add extra tests since the functionality should be equivalent and existing tests should capture this behaviour

CTTY · 2026-03-30T22:15:26Z

+                .entries()
+                .iter()
+                .map(|entry| entry.load_manifest(file_io))
+                .collect();


Should we buffer_unordered here? This is an IO operation and too many requests may overwhelm the storage backend

.try_buffer_unordered(32) should make the most object stores happy

tomighita · 2026-04-01T08:35:16Z

Thanks for your suggestion @CTTY! I have also incorporated @liurenjie1024's feedback from slack but I am a bit concerned about allocating threads without being explicit. For instance, in other places, we explicitly set the number of threads when setting the thread count [ref].

Any thoughts?

github-actions · 2026-05-02T00:36:27Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

tomighita · 2026-05-04T10:50:00Z

Can anyone take a look over this, so we can get this merged?

emkornfield · 2026-05-04T16:39:32Z

+                    let file_io = self.table.file_io().clone();
+                    spawn(async move { entry.load_manifest(&file_io).await })
+                })
+                .buffer_unordered(32)


Should 32 be a shared constant someplace?

+1 on making this a constant

Good point. Moved. If you don't love the name, feel free to suggest a new one 😅

emkornfield

One minor question is I don't know if the constant should be more centralized (e.g. do we want to limit all file operations to the same parallelism) but it could potentially be refactoring later.

emkornfield · 2026-05-05T08:33:15Z

 use crate::{Error, ErrorKind, TableRequirement, TableUpdate};

 const META_ROOT_PATH: &str = "metadata";
+const NUM_THREADS_VALIDATE_DUPLICATE_FILES: usize = 32;


Maybe a doc suggesting why it is 32.

tomighita · 2026-05-05T11:46:37Z

Ty for the review! 🥳

One minor question is I don't know if the constant should be more centralized (e.g. do we want to limit all file operations to the same parallelism) but it could potentially be refactoring later

@emkornfield I like the idea to centralise but I am afraid here this does not translate well to other operations. I would be in favour of refactoring this later and re-using it where needed, if needed.

tomighita and others added 2 commits March 30, 2026 14:31

Implement futures unordered when checking

1d10f6b

Merge branch 'main' into tomighita/increase-duplicate-check-throughput

96777d5

tomighita marked this pull request as ready for review March 30, 2026 12:22

CTTY reviewed Mar 30, 2026

View reviewed changes

tomighita added 2 commits March 31, 2026 13:52

Change to task per fetch

65da6db

Move to buffered and threads

b9d544f

tomighita requested a review from CTTY April 1, 2026 10:28

github-actions Bot added the stale label May 2, 2026

Merge branch 'main' into tomighita/increase-duplicate-check-throughput

9319652

emkornfield reviewed May 4, 2026

View reviewed changes

github-actions Bot removed the stale label May 5, 2026

tomighita and others added 2 commits May 5, 2026 10:09

Move num threads to constant

5a91720

Merge branch 'main' into tomighita/increase-duplicate-check-throughput

3c2da2c

tomighita requested a review from emkornfield May 5, 2026 07:15

emkornfield approved these changes May 5, 2026

View reviewed changes

emkornfield reviewed May 5, 2026

View reviewed changes

Add docstring to explain the motivation of the variable and its value

84f65d2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase the throughput of the `validate_duplicate_files`#2296

Increase the throughput of the `validate_duplicate_files`#2296
tomighita wants to merge 8 commits intoapache:mainfrom
dbt-labs:tomighita/increase-duplicate-check-throughput

tomighita commented Mar 30, 2026 •

edited

Loading

Uh oh!

CTTY Mar 30, 2026 •

edited

Loading

Uh oh!

tomighita commented Apr 1, 2026

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

tomighita commented May 4, 2026

Uh oh!

emkornfield May 4, 2026

Uh oh!

CTTY May 4, 2026

Uh oh!

tomighita May 5, 2026

Uh oh!

emkornfield left a comment

Uh oh!

emkornfield May 5, 2026

Uh oh!

tomighita May 5, 2026

Uh oh!

tomighita commented May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tomighita commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

CTTY Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomighita commented Apr 1, 2026

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

tomighita commented May 4, 2026

Uh oh!

emkornfield May 4, 2026

Choose a reason for hiding this comment

Uh oh!

CTTY May 4, 2026

Choose a reason for hiding this comment

Uh oh!

tomighita May 5, 2026

Choose a reason for hiding this comment

Uh oh!

emkornfield left a comment

Choose a reason for hiding this comment

Uh oh!

emkornfield May 5, 2026

Choose a reason for hiding this comment

Uh oh!

tomighita May 5, 2026

Choose a reason for hiding this comment

Uh oh!

tomighita commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tomighita commented Mar 30, 2026 •

edited

Loading

CTTY Mar 30, 2026 •

edited

Loading

tomighita commented May 5, 2026 •

edited

Loading