You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Add dedicated subsections for append, overwrite, delete, dynamic
partition overwrite, upsert, and Transaction API
- Document overwrite_filter with worked example and note on accepted
expression types
- Document branch parameter on append and overwrite
- Add case_sensitive note to overwrite and delete
- Add merge-on-read warning to delete (falls back to copy-on-write)
- Add Transaction API section showing atomic data + metadata changes
- Fix transaction rollback language: catalog commit is aborted but
object-storage files are not automatically cleaned up
- Remove redundant top-level Snapshot properties section
Reading and writing is being done using [Apache Arrow](https://arrow.apache.org/). Arrow is an in-memory columnar format for fast data interchange and in-memory analytics. Let's consider the following Arrow Table:
275
+
PyIceberg supports several write modes: [append](#append), [overwrite](#overwrite), [delete](#delete), [dynamic partition overwrite](#dynamic-partition-overwrite), and [upsert](#upsert). All writes use [Apache Arrow](https://arrow.apache.org/) as the in-memory format. Writes can be issued directly on the `Table` object or grouped together using the [Transaction API](#transaction-api).
The nested lists indicate the different Arrow buffers. Each of the writes produce a [Parquet file](https://parquet.apache.org/) where each [row group](https://parquet.apache.org/docs/concepts/) translates into an Arrow buffer. In the case where the table is large, PyIceberg also allows the option to stream the buffers using the Arrow [RecordBatchReader](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html), avoiding pulling everything into memory right away:
346
+
To avoid type inconsistencies, convert the Iceberg table schema to Arrow before writing:
You can delete some of the data from the table by calling `tbl.delete()` with a desired `delete_filter`. This will use the Iceberg metadata to only open up the Parquet files that contain relevant information.
365
+
### Overwrite
366
+
367
+
`overwrite`replaces data in the table with new data. When called without an `overwrite_filter`, it behaves like a full table replacement — existing data is deleted and the new data is written. When the table is empty, `overwrite` and `append` produce the same result.
369
368
370
369
```python
371
-
tbl.delete(delete_filter="city == 'Paris'")
370
+
tbl.overwrite(df)
372
371
```
373
372
374
-
In the above example, any records where the city field value equals to `Paris` will be deleted. Running `tbl.scan().to_arrow()` will now yield:
373
+
#### Partial overwrite with `overwrite_filter`
374
+
375
+
Pass an `overwrite_filter` to delete only the rows that match the predicate before appending the new data. This is useful for replacing a specific subset of rows.
376
+
377
+
For example, to replace the record for `Paris` with a record for `New York`:
378
+
379
+
```python
380
+
from pyiceberg.expressions import EqualTo
381
+
382
+
df_new = pa.Table.from_pylist(
383
+
[{"city": "New York", "lat": 40.7128, "long": 74.0060}]
In the case of `tbl.delete(delete_filter="city == 'Groningen'")`, the whole Parquet file will be dropped without checking it contents, since from the Iceberg metadata PyIceberg can derive that all the content in the file matches the predicate.
401
+
The `overwrite_filter` accepts both expression objects (e.g., `EqualTo`, `GreaterThan`) and SQL-style string predicates (e.g., `"city == 'Paris'"`). Matching is case-sensitive by default; pass `case_sensitive=False` to change this.
388
402
389
-
### Partial overwrites
390
-
391
-
When using the `overwrite` API, you can use an `overwrite_filter` to delete data that matches the filter before appending new data into the table. For example, consider the following Iceberg table:
403
+
Optionally, you can also set snapshot properties or target a branch:
Use `delete` to remove rows matching a predicate without writing new data. PyIceberg uses Iceberg metadata to prune which Parquet files need to be opened, so only relevant files are read. The filter is case-sensitive by default; pass `case_sensitive=False` to change this.
415
+
416
+
<!-- prettier-ignore-start -->
411
417
412
-
You can overwrite the record of `Paris` with a record of `New York`:
418
+
!!! warning "Merge-on-read not yet supported"
419
+
If the table property `write.delete.mode` is set to `merge-on-read`, PyIceberg will fall back to copy-on-write and emit a warning. All deletes currently rewrite Parquet files.
420
+
421
+
<!-- prettier-ignore-end -->
413
422
414
423
```python
415
-
from pyiceberg.expressions import EqualTo
416
-
df = pa.Table.from_pylist(
417
-
[
418
-
{"city": "New York", "lat": 40.7128, "long": 74.0060},
If the PyIceberg table is partitioned, you can use `tbl.dynamic_partition_overwrite(df)` to replace the existing partitions with new ones provided in the dataframe. The partitions to be replaced are detected automatically from the provided arrow table.
438
-
For example, with an iceberg table with a partition specified on `"city"` field:
440
+
When the predicate matches all rows in a Parquet file (e.g., `tbl.delete(delete_filter="city == 'Groningen'")`), PyIceberg drops the entire file without scanning its contents.
441
+
442
+
### Dynamic Partition Overwrite
443
+
444
+
For partitioned tables, `dynamic_partition_overwrite` replaces only the partitions present in the provided Arrow table. The partitions to overwrite are detected automatically — you do not need to specify them explicitly.
445
+
446
+
First, create a partitioned table:
439
447
440
448
```python
441
449
from pyiceberg.schema import Schema
442
450
from pyiceberg.types import DoubleType, NestedField, StringType
451
+
from pyiceberg.partitioning import PartitionSpec, PartitionField
452
+
from pyiceberg.transforms import IdentityTransform
All write operations can also be issued as part of a transaction, which lets you combine multiple mutations — including schema changes, property updates, and data writes — into a single atomic commit.
506
+
507
+
```python
508
+
with tbl.transaction() as txn:
509
+
txn.append(df)
510
+
```
511
+
512
+
You can combine multiple write operations in one transaction:
513
+
514
+
```python
515
+
with tbl.transaction() as txn:
516
+
txn.delete("city == 'Paris'")
517
+
txn.append(pa.Table.from_pylist([{"city": "New York", "lat": 40.7128, "long": 74.0060}]))
518
+
```
519
+
520
+
You can also mix data writes with metadata changes in the same transaction:
521
+
522
+
```python
523
+
with tbl.transaction() as txn:
524
+
txn.append(df)
525
+
with txn.update_schema() as update_schema:
526
+
update_schema.add_column("population", "long")
527
+
txn.set_properties(owner="data-team")
528
+
```
529
+
530
+
If an exception is raised inside the `with` block, no snapshot is committed to the catalog. Note that data files already written to object storage are not automatically cleaned up in that case.
531
+
495
532
### Upsert
496
533
497
534
PyIceberg supports upsert operations, meaning that it is able to merge an Arrow table into an Iceberg table. Rows are considered the same based on the [identifier field](https://iceberg.apache.org/spec/?column-projection#identifier-field-ids). If a row is already in the table, it will update that row. If a row cannot be found, it will insert that new row.
@@ -553,6 +590,7 @@ upd = tbl.upsert(df)
553
590
554
591
assert upd.rows_updated == 1
555
592
assert upd.rows_inserted == 1
593
+
# Paris was already up-to-date; PyIceberg skips it silently
556
594
```
557
595
558
596
PyIceberg will automatically detect which rows need to be updated, inserted or can simply be ignored.
0 commit comments