Skip to content

Cluster Databricks tables to match Snowflake's pre-clustered sample data#28

Open
gedejong wants to merge 1 commit into
get-select:mainfrom
gedejong:fix/databricks-clustering-asymmetry
Open

Cluster Databricks tables to match Snowflake's pre-clustered sample data#28
gedejong wants to merge 1 commit into
get-select:mainfrom
gedejong:fix/databricks-clustering-asymmetry

Conversation

@gedejong
Copy link
Copy Markdown

Summary

This PR closes an asymmetry in the benchmark that, as written, makes every read-side scenario compare a clustered Snowflake table to an unclustered Databricks table rather than comparing query engines.

The Snowflake side reads SNOWFLAKE_SAMPLE_DATA.TPCH_SF1000 (see snowflake/queries/adapted_queries/*.sql), which ships pre-clustered:

  • LINEITEM clustered on l_shipdate
  • ORDERS clustered on o_orderdate

…with Snowflake's Automatic Clustering maintaining the layout for free.

The Databricks side currently loads raw Parquet via COPY INTO ... FILEFORMAT = PARQUET into plain Delta — no CLUSTER BY, no TBLPROPERTIES, no post-load OPTIMIZE, no Predictive Optimization. And databricks/validate_tpch_data.py previously enforced this with the comment "should have NONE for fair comparison".

That framing is the issue: "no clustering on Databricks" is not symmetric to Snowflake's pre-clustered sample data — it's symmetric to disabling Automatic Clustering on Snowflake and re-loading the sample dataset unclustered. The DML result published in the writeup is the cleanest example: the post attributes the 59% Snowflake win to "elite query pruning that treated 6B rows like 6M" — which is exactly what clustering buys. The DML targets June 1995, Snowflake's LINEITEM is clustered on l_shipdate, pruning is essentially free. The Databricks copy had no layout to prune against.

Changes

  • databricks/apply_table_optimization.py (new): Applies CLUSTER BY + OPTIMIZE to the six large tables. Cluster keys match Snowflake's documented keys on LINEITEM (l_shipdate) and ORDERS (o_orderdate); join keys on customer, supplier, part, partsupp. nation/region skipped (5/25 rows).
  • databricks/load_customer_table.py: Applies clustering when the loader runs so the table doesn't drift back to unclustered on reload.
  • databricks/validate_tpch_data.py: Flipped. Previously failed validation if any table had clustering; now reports each table's clusteringColumns from DESCRIBE DETAIL and warns when a large table is unclustered. The misleading docstring/comment were updated to explain the rationale.
  • databricks/README.md: New "Table Clustering (Required for Fair Comparison)" section documenting why and the setup step.

Alternative considered

CLUSTER BY AUTO (liquid clustering with ML-selected keys) + Predictive Optimization would more closely reflect "what a modern Databricks customer gets out-of-the-box on UC managed tables in 2025/2026." I went with explicit keys for two reasons: (1) deterministic and reviewable, (2) the keys on LINEITEM and ORDERS directly mirror Snowflake's, making the comparison apples-to-apples rather than "two different auto-tuners argue with each other." Happy to switch to AUTO if that's preferred.

Test plan

  • Run uv run databricks/apply_table_optimization.py against the existing benchmark schema
  • Run uv run databricks/validate_tpch_data.py and confirm all six large tables report clustering keys
  • Re-run the sequential, concurrent, and DML scenarios and compare to the previously-published numbers
  • Confirm CTAS numbers are essentially unchanged (clustering shouldn't materially affect bulk writes)
  • uv run pytest tests/ -v — all 40 existing tests still pass on this branch

🤖 Generated with Claude Code

…d sample data

The Snowflake side queries SNOWFLAKE_SAMPLE_DATA.TPCH_SF1000, which ships
pre-clustered (LINEITEM on l_shipdate, ORDERS on o_orderdate) with
Automatic Clustering maintaining the layout for free. Leaving the
Databricks copy as plain Delta with no clustering means every read-side
scenario was comparing a clustered table to an unclustered table, not
two query engines.

- Add databricks/apply_table_optimization.py: applies CLUSTER BY +
  OPTIMIZE to the six large tables. Cluster keys match Snowflake's
  documented keys on LINEITEM and ORDERS; join keys on the rest.
- Update databricks/load_customer_table.py to apply clustering on load.
- Flip databricks/validate_tpch_data.py: it previously failed validation
  if any table had clustering ("should have NONE for fair comparison").
  It now reports clusteringColumns from DESCRIBE DETAIL and warns when a
  large table is unclustered.
- Document the requirement and the new setup step in databricks/README.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant