Increase the throughput of the validate_duplicate_files#2296
Increase the throughput of the validate_duplicate_files#2296tomighita wants to merge 8 commits intoapache:mainfrom
validate_duplicate_files#2296Conversation
| .entries() | ||
| .iter() | ||
| .map(|entry| entry.load_manifest(file_io)) | ||
| .collect(); |
There was a problem hiding this comment.
Should we buffer_unordered here? This is an IO operation and too many requests may overwhelm the storage backend
.try_buffer_unordered(32) should make the most object stores happy
|
Thanks for your suggestion @CTTY! I have also incorporated @liurenjie1024's feedback from slack but I am a bit concerned about allocating threads without being explicit. For instance, in other places, we explicitly set the number of threads when setting the thread count [ref]. Any thoughts? |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
Can anyone take a look over this, so we can get this merged? |
| let file_io = self.table.file_io().clone(); | ||
| spawn(async move { entry.load_manifest(&file_io).await }) | ||
| }) | ||
| .buffer_unordered(32) |
There was a problem hiding this comment.
Should 32 be a shared constant someplace?
There was a problem hiding this comment.
Good point. Moved. If you don't love the name, feel free to suggest a new one 😅
emkornfield
left a comment
There was a problem hiding this comment.
One minor question is I don't know if the constant should be more centralized (e.g. do we want to limit all file operations to the same parallelism) but it could potentially be refactoring later.
| use crate::{Error, ErrorKind, TableRequirement, TableUpdate}; | ||
|
|
||
| const META_ROOT_PATH: &str = "metadata"; | ||
| const NUM_THREADS_VALIDATE_DUPLICATE_FILES: usize = 32; |
There was a problem hiding this comment.
Maybe a doc suggesting why it is 32.
|
Ty for the review! 🥳
@emkornfield I like the idea to centralise but I am afraid here this does not translate well to other operations. I would be in favour of refactoring this later and re-using it where needed, if needed. |
Which issue does this PR close?
What changes are included in this PR?
Increase the throughput of the
validate_duplicate_filesby starting all requests and polling rather than sequentially fetching each file.Are these changes tested?
No need to add extra tests since the functionality should be equivalent and existing tests should capture this behaviour