Write Iceberg table schema into manifest header#801
Write Iceberg table schema into manifest header#801jvansanten wants to merge 10 commits intoduckdb:mainfrom
Conversation
rather than manifest schema to "schema" in a mix of Iceberg and Avro schemas Likely implements the intent of 9c8b1fd
Tishj
left a comment
There was a problem hiding this comment.
I believe we have all this logic already, in IcebergCreateTableRequest::PopulateSchema
Can we please use that instead of duplicating all this logic?
Thanks, that's a much better idea; I had missed where this happens. Will do. |
|
It looks fine to me, but I would really like a test that fails on def test_duckdb_drop_table_nested_types(self, spark_con):
from datetime import datetime, timezone
ts = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S")
spark_con.sql(
"""
DELETE FROM nested_types_2;
"""
)
spark_con.sql(f"""
CALL system.expire_snapshots(
table => 'default.nested_types_2',
older_than => TIMESTAMP '{ts}'
)
""")I tried DROP, I tried DELETE + refresh and DELETE + expire snapshots, none of them failed |
I agree that it's really nice to have tests to show that the change does something, and that it doesn't regress. The symptom I observed is that iceberg-rest-fixture silently fails to delete the data files for tables created by duckdb when Since the current header raises an exception inside of
This also bothers me, but I figured the manifest schema was there for a reason, and the thing 2b97d7f seems to have been indented to write exists in spark-created manifests as "iceberg.schema". I don't know the history behind that change, though. |
|
As far as I know we have tests that read duckdb-created tables with Spark, see |
|
If you search for |
|
Okay, I will try to think harder about how to reproduce the failures I observed. |
|
I haven't been able to find a way to make Spark choke on the malformed schema, so it would seem that the failure I saw is a fairly rare path. That would go some way towards explaining why it went unnoticed, but does raise the question of what the schema is supposed to be used for. For want of a better idea, I switched on purge in the fixture catalog config and added a test that reproduces the failure I saw (dropping a duckdb-written table from duckdb, with purge enabled). It fails on main but passes on this PR. Interestingly, Spark's |
b06a141 to
aeab59f
Compare
These catalogs do not purge synchronously on drop
aeab59f to
6bcc258
Compare
6bcc258 to
200a020
Compare
|
Tests now pass for all catalogs. The drop test is disabled for all but the REST fixture catalog, as all others seem to purge asynchronously, making it difficult to verify whether they have actually done so. |
Duckdb tries to write the Iceberg schema of the manifest itself at the key where org.apache.iceberg.ManifestReader expects to find the table schema, while mixing Avro and Iceberg schema syntax.
This PR writes the Iceberg table schema to the key "schema" in the manifest header as required by the spec, and the Iceberg manifest schema to "iceberg.schema" as observed in Spark-created manifests.
Fixes #799.