Write Iceberg table schema into manifest header by jvansanten · Pull Request #801 · duckdb/duckdb-iceberg

jvansanten · 2026-03-16T20:39:51Z

Duckdb tries to write the Iceberg schema of the manifest itself at the key where org.apache.iceberg.ManifestReader expects to find the table schema, while mixing Avro and Iceberg schema syntax.

This PR writes the Iceberg table schema to the key "schema" in the manifest header as required by the spec, and the Iceberg manifest schema to "iceberg.schema" as observed in Spark-created manifests.

Fixes #799.

Fixes duckdb#799

rather than manifest schema to "schema" in a mix of Iceberg and Avro schemas Likely implements the intent of 9c8b1fd

Tishj

I believe we have all this logic already, in IcebergCreateTableRequest::PopulateSchema
Can we please use that instead of duplicating all this logic?

jvansanten · 2026-03-17T10:26:18Z

I believe we have all this logic already, in IcebergCreateTableRequest::PopulateSchema Can we please use that instead of duplicating all this logic?

Thanks, that's a much better idea; I had missed where this happens. Will do.

Tishj · 2026-03-18T08:28:55Z

It looks fine to me, but I would really like a test that fails on main and now passes with this.
I couldn't manage to make it fail on main:

    def test_duckdb_drop_table_nested_types(self, spark_con):
        from datetime import datetime, timezone

        ts = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S")

        spark_con.sql(
            """
            DELETE FROM nested_types_2;
            """
        )
        spark_con.sql(f"""
        CALL system.expire_snapshots(
        table => 'default.nested_types_2',
        older_than => TIMESTAMP '{ts}'
        )
        """)

I tried DROP, I tried DELETE + refresh and DELETE + expire snapshots, none of them failed
I'm also hesitant about adding iceberg.schema, as the spec doesn't mention this anywhere

jvansanten · 2026-03-18T09:04:29Z

I tried DROP, I tried DELETE + refresh and DELETE + expire snapshots, none of them failed

I agree that it's really nice to have tests to show that the change does something, and that it doesn't regress.

The symptom I observed is that iceberg-rest-fixture silently fails to delete the data files for tables created by duckdb when PURGE_REQUESTED true. From the perspective of the catalog, though, the drop succeeded; the data files are just orphaned in storage. I don't know how to test this in the context of sqllogictest, though.

Since the current header raises an exception inside of org.apache.iceberg.ManifestReader, I think it's likely that attempts to scan the table with Spark would fail as well. One option could be to add a new set of tests that read duckdb-written tables back with Spark. Unless, of course, CatalogUtils is the only thing that actually uses ManifestReader.

I'm also hesitant about adding iceberg.schema, as the spec doesn't mention this anywhere

This also bothers me, but I figured the manifest schema was there for a reason, and the thing 2b97d7f seems to have been indented to write exists in spark-created manifests as "iceberg.schema". I don't know the history behind that change, though.

Tishj · 2026-03-18T09:27:34Z

As far as I know we have tests that read duckdb-created tables with Spark, see test_spark_read.py

Tmonster · 2026-03-18T15:48:45Z

If you search for default.lower_upper_bounds_test, you can see a duckdb test writes it, and later, both pyiceberg and spark read it back. It's not very well documented, but we have been running these kinds of tests for a while now

jvansanten · 2026-03-19T11:07:30Z

Okay, I will try to think harder about how to reproduce the failures I observed.

jvansanten · 2026-03-19T17:34:40Z

I haven't been able to find a way to make Spark choke on the malformed schema, so it would seem that the failure I saw is a fairly rare path. That would go some way towards explaining why it went unnoticed, but does raise the question of what the schema is supposed to be used for.

For want of a better idea, I switched on purge in the fixture catalog config and added a test that reproduces the failure I saw (dropping a duckdb-written table from duckdb, with purge enabled). It fails on main but passes on this PR.

Interestingly, Spark's DROP TABLE PURGE on a duckdb-written table succeeds on main; it does not seem to rely on the catalog to purge files.

These catalogs do not purge synchronously on drop

jvansanten · 2026-04-20T15:27:27Z

Tests now pass for all catalogs. The drop test is disabled for all but the REST fixture catalog, as all others seem to purge asynchronously, making it difficult to verify whether they have actually done so.

jvansanten added 2 commits March 16, 2026 21:02

Write table schema to manifest header

3264af9

Fixes duckdb#799

Write manifest schema in Iceberg format to "iceberg.schema"

d8d468c

rather than manifest schema to "schema" in a mix of Iceberg and Avro schemas Likely implements the intent of 9c8b1fd

jvansanten mentioned this pull request Mar 16, 2026

Malformed schema in manifest header #799

Open

Tishj reviewed Mar 16, 2026

View reviewed changes

Comment thread src/metadata/iceberg_column_definition.cpp Outdated

Tishj requested changes Mar 17, 2026

View reviewed changes

jvansanten added 2 commits March 17, 2026 11:30

fixup! Write table schema to manifest header

cb69e6e

fixup! Write manifest schema in Iceberg format to "iceberg.schema"

a2d6514

jvansanten requested a review from Tishj March 17, 2026 10:42

jvansanten added 2 commits March 19, 2026 18:19

tests: verify that PURGE_REQUESTED actually purges files

aef3705

Merge remote-tracking branch 'upstream/main' into manifest-schema-header

40750ab

jvansanten added 2 commits April 16, 2026 17:08

Merge remote-tracking branch 'upstream/main' into manifest-schema-header

0626985

Set PURGE_REQUESTED for all catalogs

1524f9b

jvansanten force-pushed the manifest-schema-header branch 2 times, most recently from b06a141 to aeab59f Compare April 17, 2026 13:03

Exclude lakekeeper, nessie, polaris from purge test

5ab24d7

These catalogs do not purge synchronously on drop

jvansanten force-pushed the manifest-schema-header branch from aeab59f to 6bcc258 Compare April 17, 2026 13:15

Enable drop with purge in Polaris setup

200a020

jvansanten force-pushed the manifest-schema-header branch from 6bcc258 to 200a020 Compare April 20, 2026 07:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write Iceberg table schema into manifest header#801

Write Iceberg table schema into manifest header#801
jvansanten wants to merge 10 commits intoduckdb:mainfrom
jvansanten:manifest-schema-header

jvansanten commented Mar 16, 2026

Uh oh!

Uh oh!

Tishj left a comment

Uh oh!

jvansanten commented Mar 17, 2026 •

edited

Loading

Uh oh!

Tishj commented Mar 18, 2026 •

edited

Loading

Uh oh!

jvansanten commented Mar 18, 2026

Uh oh!

Tishj commented Mar 18, 2026 •

edited

Loading

Uh oh!

Tmonster commented Mar 18, 2026

Uh oh!

jvansanten commented Mar 19, 2026

Uh oh!

jvansanten commented Mar 19, 2026 •

edited

Loading

Uh oh!

jvansanten commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jvansanten commented Mar 16, 2026

Uh oh!

Uh oh!

Tishj left a comment

Choose a reason for hiding this comment

Uh oh!

jvansanten commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tishj commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jvansanten commented Mar 18, 2026

Uh oh!

Tishj commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tmonster commented Mar 18, 2026

Uh oh!

jvansanten commented Mar 19, 2026

Uh oh!

jvansanten commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jvansanten commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jvansanten commented Mar 17, 2026 •

edited

Loading

Tishj commented Mar 18, 2026 •

edited

Loading

Tishj commented Mar 18, 2026 •

edited

Loading

jvansanten commented Mar 19, 2026 •

edited

Loading