Cluster Databricks tables to match Snowflake's pre-clustered sample data#28
Open
gedejong wants to merge 1 commit into
Open
Cluster Databricks tables to match Snowflake's pre-clustered sample data#28gedejong wants to merge 1 commit into
gedejong wants to merge 1 commit into
Conversation
…d sample data
The Snowflake side queries SNOWFLAKE_SAMPLE_DATA.TPCH_SF1000, which ships
pre-clustered (LINEITEM on l_shipdate, ORDERS on o_orderdate) with
Automatic Clustering maintaining the layout for free. Leaving the
Databricks copy as plain Delta with no clustering means every read-side
scenario was comparing a clustered table to an unclustered table, not
two query engines.
- Add databricks/apply_table_optimization.py: applies CLUSTER BY +
OPTIMIZE to the six large tables. Cluster keys match Snowflake's
documented keys on LINEITEM and ORDERS; join keys on the rest.
- Update databricks/load_customer_table.py to apply clustering on load.
- Flip databricks/validate_tpch_data.py: it previously failed validation
if any table had clustering ("should have NONE for fair comparison").
It now reports clusteringColumns from DESCRIBE DETAIL and warns when a
large table is unclustered.
- Document the requirement and the new setup step in databricks/README.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR closes an asymmetry in the benchmark that, as written, makes every read-side scenario compare a clustered Snowflake table to an unclustered Databricks table rather than comparing query engines.
The Snowflake side reads
SNOWFLAKE_SAMPLE_DATA.TPCH_SF1000(seesnowflake/queries/adapted_queries/*.sql), which ships pre-clustered:LINEITEMclustered onl_shipdateORDERSclustered ono_orderdate…with Snowflake's Automatic Clustering maintaining the layout for free.
The Databricks side currently loads raw Parquet via
COPY INTO ... FILEFORMAT = PARQUETinto plain Delta — noCLUSTER BY, noTBLPROPERTIES, no post-loadOPTIMIZE, no Predictive Optimization. Anddatabricks/validate_tpch_data.pypreviously enforced this with the comment "should have NONE for fair comparison".That framing is the issue: "no clustering on Databricks" is not symmetric to Snowflake's pre-clustered sample data — it's symmetric to disabling Automatic Clustering on Snowflake and re-loading the sample dataset unclustered. The DML result published in the writeup is the cleanest example: the post attributes the 59% Snowflake win to "elite query pruning that treated 6B rows like 6M" — which is exactly what clustering buys. The DML targets June 1995, Snowflake's LINEITEM is clustered on
l_shipdate, pruning is essentially free. The Databricks copy had no layout to prune against.Changes
databricks/apply_table_optimization.py(new): AppliesCLUSTER BY+OPTIMIZEto the six large tables. Cluster keys match Snowflake's documented keys onLINEITEM(l_shipdate) andORDERS(o_orderdate); join keys oncustomer,supplier,part,partsupp.nation/regionskipped (5/25 rows).databricks/load_customer_table.py: Applies clustering when the loader runs so the table doesn't drift back to unclustered on reload.databricks/validate_tpch_data.py: Flipped. Previously failed validation if any table had clustering; now reports each table'sclusteringColumnsfromDESCRIBE DETAILand warns when a large table is unclustered. The misleading docstring/comment were updated to explain the rationale.databricks/README.md: New "Table Clustering (Required for Fair Comparison)" section documenting why and the setup step.Alternative considered
CLUSTER BY AUTO(liquid clustering with ML-selected keys) + Predictive Optimization would more closely reflect "what a modern Databricks customer gets out-of-the-box on UC managed tables in 2025/2026." I went with explicit keys for two reasons: (1) deterministic and reviewable, (2) the keys onLINEITEMandORDERSdirectly mirror Snowflake's, making the comparison apples-to-apples rather than "two different auto-tuners argue with each other." Happy to switch toAUTOif that's preferred.Test plan
uv run databricks/apply_table_optimization.pyagainst the existingbenchmarkschemauv run databricks/validate_tpch_data.pyand confirm all six large tables report clustering keysuv run pytest tests/ -v— all 40 existing tests still pass on this branch🤖 Generated with Claude Code