-
Notifications
You must be signed in to change notification settings - Fork 481
feat: validate_deleted_data_files
#1938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Fokko
merged 28 commits into
apache:main
from
jayceslesar:feat/validate-deleted-data-files
May 20, 2025
Merged
Changes from 26 commits
Commits
Show all changes
28 commits
Select commit
Hold shift + click to select a range
beff92b
feat: validation history
jayceslesar c369720
format
jayceslesar 41bb8a4
almost a working test
jayceslesar 763e9f4
allow content_filter in snapshot.manifests
jayceslesar f200beb
simplify order of arguments to validation_history
jayceslesar 7f6bf9d
simplify return in snapshot.manifests
jayceslesar c63cc55
tests passing
jayceslesar f2f3a88
correct ancestors_between
jayceslesar 74d5569
fix to/from logic and allow optional `to_snapshot` arg in `validation…
jayceslesar 167f9e4
remove a level of nesting with smarter clause
jayceslesar efe50b4
fix bad accessor
jayceslesar 0793713
fix docstring
jayceslesar 89120aa
[wip] feat: `validate_deleted_data_files`
jayceslesar a39abd2
first dummy test case
jayceslesar cf5b061
first working tests (needs cleanup)
jayceslesar 645e8df
Merge branch 'main' into feat/validate-deleted-data-files
jayceslesar caf7e36
bring back all code from silly merge
jayceslesar a52c422
last tweaks
jayceslesar d3df4a7
Merge branch 'main' into feat/validate-deleted-data-files
jayceslesar 9eca29f
fix order of args in deleted_data_files
jayceslesar fc83d42
make `deleted_data_files` private
jayceslesar ee3959a
Update pyiceberg/table/update/validate.py
jayceslesar a51c2da
ise inclusive metrics evaluator and add rows cannot match check
jayceslesar ec86695
show what snapshot IDs conflict
jayceslesar 0a6b781
maybe correct partition_spec impl?
jayceslesar a96da01
fix ordering to match
jayceslesar 03b2913
make private and add test
jayceslesar aae22e6
Merge branch 'main' into feat/validate-deleted-data-files
jayceslesar File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -14,11 +14,17 @@ | |||||
| # KIND, either express or implied. See the License for the | ||||||
| # specific language governing permissions and limitations | ||||||
| # under the License. | ||||||
| from typing import Iterator, Optional | ||||||
|
|
||||||
| from pyiceberg.exceptions import ValidationException | ||||||
| from pyiceberg.manifest import ManifestContent, ManifestFile | ||||||
| from pyiceberg.expressions import BooleanExpression | ||||||
| from pyiceberg.expressions.visitors import ROWS_CANNOT_MATCH, _InclusiveMetricsEvaluator | ||||||
| from pyiceberg.manifest import ManifestContent, ManifestEntry, ManifestEntryStatus, ManifestFile | ||||||
| from pyiceberg.table import Table | ||||||
| from pyiceberg.table.snapshots import Operation, Snapshot, ancestors_between | ||||||
| from pyiceberg.typedef import Record | ||||||
|
|
||||||
| VALIDATE_DATA_FILES_EXIST_OPERATIONS = {Operation.OVERWRITE, Operation.REPLACE, Operation.DELETE} | ||||||
|
|
||||||
|
|
||||||
| def validation_history( | ||||||
|
|
@@ -69,3 +75,78 @@ def validation_history( | |||||
| raise ValidationException("No matching snapshot found.") | ||||||
|
|
||||||
| return manifests_files, snapshots | ||||||
|
|
||||||
|
|
||||||
| def _deleted_data_files( | ||||||
| table: Table, | ||||||
| starting_snapshot: Snapshot, | ||||||
| data_filter: Optional[BooleanExpression], | ||||||
| partition_set: Optional[dict[int, set[Record]]], | ||||||
| parent_snapshot: Optional[Snapshot], | ||||||
| ) -> Iterator[ManifestEntry]: | ||||||
| """Find deleted data files matching a filter since a starting snapshot. | ||||||
|
|
||||||
| Args: | ||||||
| table: Table to validate | ||||||
| starting_snapshot: Snapshot current at the start of the operation | ||||||
| data_filter: Expression used to find deleted data files | ||||||
| partition_set: dict of {spec_id: set[partition]} to filter on | ||||||
| parent_snapshot: Ending snapshot on the branch being validated | ||||||
|
|
||||||
| Returns: | ||||||
| List of conflicting manifest-entries | ||||||
| """ | ||||||
| # if there is no current table state, no files have been deleted | ||||||
| if parent_snapshot is None: | ||||||
| return | ||||||
|
|
||||||
| manifests, snapshot_ids = validation_history( | ||||||
| table, | ||||||
| parent_snapshot, | ||||||
| starting_snapshot, | ||||||
| VALIDATE_DATA_FILES_EXIST_OPERATIONS, | ||||||
| ManifestContent.DATA, | ||||||
| ) | ||||||
|
|
||||||
| if data_filter is not None: | ||||||
| evaluator = _InclusiveMetricsEvaluator(table.schema(), data_filter).eval | ||||||
|
|
||||||
| for manifest in manifests: | ||||||
| for entry in manifest.fetch_manifest_entry(table.io, discard_deleted=False): | ||||||
| if entry.snapshot_id not in snapshot_ids: | ||||||
| continue | ||||||
|
|
||||||
| if entry.status != ManifestEntryStatus.DELETED: | ||||||
| continue | ||||||
|
|
||||||
| if data_filter is not None and evaluator(entry.data_file) is ROWS_CANNOT_MATCH: | ||||||
| continue | ||||||
|
|
||||||
| if partition_set is not None: | ||||||
| spec_id = entry.data_file.spec_id | ||||||
| partition = entry.data_file.partition | ||||||
| if spec_id not in partition_set or partition not in partition_set[spec_id]: | ||||||
| continue | ||||||
|
|
||||||
| yield entry | ||||||
|
|
||||||
|
|
||||||
| def validate_deleted_data_files( | ||||||
| table: Table, | ||||||
| starting_snapshot: Snapshot, | ||||||
| data_filter: Optional[BooleanExpression], | ||||||
| parent_snapshot: Snapshot, | ||||||
| ) -> None: | ||||||
| """Validate that no files matching a filter have been deleted from the table since a starting snapshot. | ||||||
|
|
||||||
| Args: | ||||||
| table: Table to validate | ||||||
| starting_snapshot: Snapshot current at the start of the operation | ||||||
| data_filter: Expression used to find deleted data files | ||||||
| parent_snapshot: Ending snapshot on the branch being validated | ||||||
|
|
||||||
| """ | ||||||
| conflicting_entries = _deleted_data_files(table, starting_snapshot, data_filter, None, parent_snapshot) | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit, I think style wise it is more elegant to switch over to keyword-arguments:
Suggested change
|
||||||
| if any(conflicting_entries): | ||||||
| conflicting_snapshots = {entry.snapshot_id for entry in conflicting_entries} | ||||||
| raise ValidationException(f"Deleted data files were found matching the filter for snapshots {conflicting_snapshots}!") | ||||||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.