Cluster Databricks tables to match Snowflake's pre-clustered sample data by gedejong · Pull Request #28 · get-select/snowflake-databricks-benchmark

gedejong · 2026-05-19T12:17:10Z

Summary

This PR closes an asymmetry in the benchmark that, as written, makes every read-side scenario compare a clustered Snowflake table to an unclustered Databricks table rather than comparing query engines.

The Snowflake side reads SNOWFLAKE_SAMPLE_DATA.TPCH_SF1000 (see snowflake/queries/adapted_queries/*.sql), which ships pre-clustered:

LINEITEM clustered on l_shipdate
ORDERS clustered on o_orderdate

…with Snowflake's Automatic Clustering maintaining the layout for free.

The Databricks side currently loads raw Parquet via COPY INTO ... FILEFORMAT = PARQUET into plain Delta — no CLUSTER BY, no TBLPROPERTIES, no post-load OPTIMIZE, no Predictive Optimization. And databricks/validate_tpch_data.py previously enforced this with the comment "should have NONE for fair comparison".

That framing is the issue: "no clustering on Databricks" is not symmetric to Snowflake's pre-clustered sample data — it's symmetric to disabling Automatic Clustering on Snowflake and re-loading the sample dataset unclustered. The DML result published in the writeup is the cleanest example: the post attributes the 59% Snowflake win to "elite query pruning that treated 6B rows like 6M" — which is exactly what clustering buys. The DML targets June 1995, Snowflake's LINEITEM is clustered on l_shipdate, pruning is essentially free. The Databricks copy had no layout to prune against.

Changes

databricks/apply_table_optimization.py (new): Applies CLUSTER BY + OPTIMIZE to the six large tables. Cluster keys match Snowflake's documented keys on LINEITEM (l_shipdate) and ORDERS (o_orderdate); join keys on customer, supplier, part, partsupp. nation/region skipped (5/25 rows).
databricks/load_customer_table.py: Applies clustering when the loader runs so the table doesn't drift back to unclustered on reload.
databricks/validate_tpch_data.py: Flipped. Previously failed validation if any table had clustering; now reports each table's clusteringColumns from DESCRIBE DETAIL and warns when a large table is unclustered. The misleading docstring/comment were updated to explain the rationale.
databricks/README.md: New "Table Clustering (Required for Fair Comparison)" section documenting why and the setup step.

Alternative considered

CLUSTER BY AUTO (liquid clustering with ML-selected keys) + Predictive Optimization would more closely reflect "what a modern Databricks customer gets out-of-the-box on UC managed tables in 2025/2026." I went with explicit keys for two reasons: (1) deterministic and reviewable, (2) the keys on LINEITEM and ORDERS directly mirror Snowflake's, making the comparison apples-to-apples rather than "two different auto-tuners argue with each other." Happy to switch to AUTO if that's preferred.

Test plan

Run uv run databricks/apply_table_optimization.py against the existing benchmark schema
Run uv run databricks/validate_tpch_data.py and confirm all six large tables report clustering keys
Re-run the sequential, concurrent, and DML scenarios and compare to the previously-published numbers
Confirm CTAS numbers are essentially unchanged (clustering shouldn't materially affect bulk writes)
uv run pytest tests/ -v — all 40 existing tests still pass on this branch

🤖 Generated with Claude Code

…d sample data The Snowflake side queries SNOWFLAKE_SAMPLE_DATA.TPCH_SF1000, which ships pre-clustered (LINEITEM on l_shipdate, ORDERS on o_orderdate) with Automatic Clustering maintaining the layout for free. Leaving the Databricks copy as plain Delta with no clustering means every read-side scenario was comparing a clustered table to an unclustered table, not two query engines. - Add databricks/apply_table_optimization.py: applies CLUSTER BY + OPTIMIZE to the six large tables. Cluster keys match Snowflake's documented keys on LINEITEM and ORDERS; join keys on the rest. - Update databricks/load_customer_table.py to apply clustering on load. - Flip databricks/validate_tpch_data.py: it previously failed validation if any table had clustering ("should have NONE for fair comparison"). It now reports clusteringColumns from DESCRIBE DETAIL and warns when a large table is unclustered. - Document the requirement and the new setup step in databricks/README.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gedejong mentioned this pull request May 19, 2026

Methodology: Databricks tables are unclustered while Snowflake sample data is pre-clustered #29

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster Databricks tables to match Snowflake's pre-clustered sample data#28

Cluster Databricks tables to match Snowflake's pre-clustered sample data#28
gedejong wants to merge 1 commit into
get-select:mainfrom
gedejong:fix/databricks-clustering-asymmetry

gedejong commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gedejong commented May 19, 2026

Summary

Changes

Alternative considered

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant