[SPARK-57562][SQL] Add benchmarks for the TIME data type by shrirangmhalgi · Pull Request #56809 · apache/spark

shrirangmhalgi · 2026-06-26T11:56:02Z

What changes were proposed in this pull request?

Add TIME data type benchmarks:

New TimeBenchmark covering current_time, make_time, to_time, hour/minute/second extraction, time_trunc, time_diff, TIME +/- interval, and LocalTime conversion/collection
ExtractBenchmark: add TIME cases (HOUR, MINUTE, SECOND)
JsonBenchmark: add TIME write/read cases
CSVBenchmark: add TIME write/read cases
HashBenchmark: add TIME column to normal schema

Add TimeBenchmark covering TIME data type functions: current_time, make_time, to_time, hour / minute / second extraction, time_trunc, time_diff, TIME + / - interval, and LocalTime conversion / collection.

Why are the changes needed?

The SPIP (SPARK-51162, Q6) flagged performance regressions as the main risk. There was no TIME-specific benchmark until now.

Does this PR introduce any user-facing change?

No. Benchmark only.

How was this patch tested?

All five benchmarks run successfully. Results generated via GitHub Actions Benchmarks workflow on Linux CI (AMD EPYC 7763 / OpenJDK 17.0.19).

Was this patch authored or co-authored using generative AI tooling?

Co-Authored using Claude Opus 4.6

MaxGekk

1 blocking, 2 non-blocking, 0 nits.
Clean, well-modeled benchmark mirroring DateTimeBenchmark. One blocking process issue: the committed results must come from the standard CI hardware, not a local Mac.

Design / architecture (1)

*-results.txt: results were regenerated on Apple M3 Pro / macOS instead of the standard CI hardware (Linux / Intel Xeon Platinum 8370C, used by the existing committed results). This rewrites every number in the three modified files, not just the new TIME cases (ExtractBenchmark +104/-86, CSVBenchmark +56/-51, JsonBenchmark +82/-77), so the numbers are no longer comparable against the rest of the repo and the unrelated churn obscures the PR's actual contribution. Please regenerate all four results files via the "Benchmarks" GitHub Actions workflow and drop the local Mac-generated output.

Correctness (1)

CSVBenchmark.scala:197 (same pattern at JsonBenchmark.scala:414): times generator (t % 86400000000L) * 1000000L — see inline.

Suggestions (1)

ExtractBenchmark.scala:72: the case other message "Valid column types are 'timestamp' and 'date'" is now stale — it omits the newly-added 'time' (and the pre-existing 'interval'). Only reachable on a programming error, so optional, but worth updating while you're here.

MaxGekk · 2026-06-27T06:42:25Z

+
+      def times = {
+        spark.range(0, rowsNum, 1, 1).mapPartitions { iter =>
+          iter.map(t => LocalTime.ofNanoOfDay((t % 86400000000L) * 1000000L))


The (t % 86400000000L) * 1000000L formula is inconsistent and latently unsafe:

The modulus is micros/day (86_400_000_000) but the scale factor is *1e6, so the two don't compose into nanos-of-day.

Since t < rowsNum (1e7) is far below the modulus, the % 86400000000L is a dead no-op, and the generated times only span the first ~2.8h of the day rather than the full day.

It would overflow LocalTime.ofNanoOfDay (max 86399999999999) if rowsNum is ever raised above ~8.64e7.

It doesn't fail at the current rowsNum, but consider aligning with TimeBenchmark's full-day wrap (nanos % 86400000000000L):

Suggested change

iter.map(t => LocalTime.ofNanoOfDay((t % 86400000000L) * 1000000L))

iter.map(t => LocalTime.ofNanoOfDay(t % 86400000000000L))

MaxGekk · 2026-06-27T06:42:25Z

+
+      def times = {
+        spark.range(0, rowsNum, 1, 1).mapPartitions { iter =>
+          iter.map(t => LocalTime.ofNanoOfDay((t % 86400000000L) * 1000000L))


Same generator issue as CSVBenchmark.scala:197: (t % 86400000000L) * 1000000L has an inconsistent modulus/scale (micros/day mod, *1e6 scale), the % 86400000000L is a dead no-op at this rowsNum, the times don't span a full day, and it would overflow ofNanoOfDay if rowsNum exceeded ~8.64e7. Suggest the same full-day wrap as TimeBenchmark:

Suggested change

iter.map(t => LocalTime.ofNanoOfDay((t % 86400000000L) * 1000000L))

iter.map(t => LocalTime.ofNanoOfDay(t % 86400000000000L))

### What changes were proposed in this pull request? Add TimeBenchmark covering TIME data type functions: current_time, make_time, to_time, hour/minute/second extraction, time_trunc, time_diff, TIME +/- interval, and LocalTime conversion/collection. ### Why are the changes needed? The SPIP (SPARK-51162, Q6) flagged performance regressions as the main risk. There was no TIME-specific benchmark until now. ### Does this PR introduce _any_ user-facing change? No. Benchmark only. ### How was this patch tested? Benchmark runs successfully and results file is generated. ### Was this patch authored or co-authored using generative AI tooling? Co-Authored using Claude Opus 4.6

shrirangmhalgi · 2026-06-27T09:19:14Z

Thanks @MaxGekk for the review! All addressed:

Fixed time generator to t % 86400000000000L (full nanos/day wrap) in CSV and JSON
Updated ExtractBenchmark error message to include all valid types
Reverted local results; regenerating via GitHub Actions Benchmarks workflow on Linux CI

MaxGekk

2 addressed, 1 remaining, 0 new.
Thanks for the quick turnaround — the generator fix, the ExtractBenchmark message, and the reverted results churn all look good, and the new TIME feature-gating (spark.sql.timeType.enabled) is correct.

Design / architecture — remaining (1)

TimeBenchmark-results.txt is still generated on Apple M3 Pro / macOS. The three modified results files are reverted and HashBenchmark-results.txt is correctly on Linux CI now, but TimeBenchmark-results.txt still carries Mac numbers — and the reverted CSV/Json/Extract results no longer include the new TIME cases their .scala files add. Please finish the GitHub Actions "Benchmarks" regeneration so all of them come from the standard Linux hardware rather than a local Mac.

…e error message - Fix LocalTime generator: use t % 86400000000000L (nanos/day) instead of inconsistent (t % 86400000000L) * 1000000L in CSV and JSON benchmarks - Revert ExtractBenchmark/CSV/JSON/Hash results to master (drop Mac output; will regenerate via GitHub Actions Benchmarks workflow) - Update ExtractBenchmark error message to list all valid types

…nchmark (JDK 17, Scala 2.13, split 1 of 1)

…tBenchmark (JDK 17, Scala 2.13, split 1 of 1)

….JsonBenchmark (JDK 17, Scala 2.13, split 1 of 1)

…CSVBenchmark (JDK 17, Scala 2.13, split 1 of 1)

shrirangmhalgi · 2026-06-27T14:11:43Z

@MaxGekk. I regenerated all four results files (TimeBenchmark, ExtractBenchmark, JsonBenchmark, CSVBenchmark) via the GitHub Actions Benchmarks workflow on Linux CI. PTAL.

MaxGekk

1 addressed, 0 remaining, 1 new.
The round-2 provenance blocker is resolved — all four sql/core results files are now regenerated on the standard Linux CI runner (17.0.19 / Linux 6.17 azure) and include the TIME cases. Thanks! One new inconsistency this round:

Design / architecture — new (1)

HashBenchmark.scala:128: the .scala adds a time column but HashBenchmark-results.txt was reverted to master (no time column, old hardware) — see inline.

MaxGekk · 2026-06-27T18:30:57Z

      .add("binary", BinaryType)
      .add("date", DateType)
      .add("timestamp", TimestampType)
+      .add("time", TimeType())


This adds a time column to the normal hash schema, but HashBenchmark-results.txt was reverted to upstream/master in this commit — it still has the old hardware header (17.0.18 / Linux 6.14) and no time column. Round 2 (727e46b) had this file correctly regenerated on CI, so the benchmark runs fine; this commit just dropped the regenerated results while keeping the schema change.

Your note mentions regenerating "all four" results files (the sql/core ones); HashBenchmark lives in sql/catalyst and is a fifth modified benchmark whose results were missed. As committed, the benchmark code and its results file are out of sync. Please regenerate HashBenchmark-results.txt on the CI runner too (so it reflects the new time column), or drop this HashBenchmark schema change if it's out of scope for the PR.

shrirangmhalgi · 2026-06-28T14:08:32Z

Thank you @MaxGekk. Good catch - the revert dropped the earlier CI-generated HashBenchmark-results.txt along with the other reverted files.

Retriggered HashBenchmark via CI.
Updated the PR description to mention HashBenchmark

…la 2.13, split 1 of 1)

shrirangmhalgi changed the title ~~Add benchmarks for the TIME data type~~ [SPARK-57562][SQL] Add benchmarks for the TIME data type Jun 26, 2026

shrirangmhalgi marked this pull request as ready for review June 27, 2026 04:47

MaxGekk reviewed Jun 27, 2026

View reviewed changes

shrirangmhalgi added 2 commits June 27, 2026 01:45

Add TIME cases to ExtractBenchmark, JsonBenchmark, and CSVBenchmark

b5a2054

shrirangmhalgi force-pushed the SPARK-57562-time-benchmarks branch from c620316 to ad5c3c5 Compare June 27, 2026 08:46

shrirangmhalgi force-pushed the SPARK-57562-time-benchmarks branch from ad5c3c5 to b6ae6d2 Compare June 27, 2026 09:39

MaxGekk requested changes Jun 27, 2026

View reviewed changes

shrirangmhalgi force-pushed the SPARK-57562-time-benchmarks branch from 727e46b to 098bfd2 Compare June 27, 2026 11:53

shrirangmhalgi added 4 commits June 27, 2026 13:06

Benchmark results for org.apache.spark.sql.execution.benchmark.TimeBe…

da3ac84

…nchmark (JDK 17, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.sql.execution.benchmark.Extrac…

f0100c4

…tBenchmark (JDK 17, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.sql.execution.datasources.json…

15a0289

….JsonBenchmark (JDK 17, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.sql.execution.datasources.csv.…

96896fd

…CSVBenchmark (JDK 17, Scala 2.13, split 1 of 1)

MaxGekk reviewed Jun 27, 2026

View reviewed changes

Benchmark results for org.apache.spark.sql.HashBenchmark (JDK 17, Sca…

be71268

…la 2.13, split 1 of 1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-57562][SQL] Add benchmarks for the TIME data type#56809

[SPARK-57562][SQL] Add benchmarks for the TIME data type#56809
shrirangmhalgi wants to merge 8 commits into
apache:masterfrom
shrirangmhalgi:SPARK-57562-time-benchmarks

shrirangmhalgi commented Jun 26, 2026 •

edited

Loading

Uh oh!

MaxGekk left a comment

Uh oh!

MaxGekk Jun 27, 2026

Uh oh!

MaxGekk Jun 27, 2026

Uh oh!

shrirangmhalgi commented Jun 27, 2026

Uh oh!

MaxGekk left a comment

Uh oh!

shrirangmhalgi commented Jun 27, 2026

Uh oh!

MaxGekk left a comment

Uh oh!

MaxGekk Jun 27, 2026

Uh oh!

shrirangmhalgi commented Jun 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	iter.map(t => LocalTime.ofNanoOfDay((t % 86400000000L) * 1000000L))
	iter.map(t => LocalTime.ofNanoOfDay(t % 86400000000000L))

Uh oh!

Conversation

shrirangmhalgi commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Design / architecture (1)

Correctness (1)

Suggestions (1)

Uh oh!

MaxGekk Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

MaxGekk Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

shrirangmhalgi commented Jun 27, 2026

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Design / architecture — remaining (1)

Uh oh!

shrirangmhalgi commented Jun 27, 2026

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Design / architecture — new (1)

Uh oh!

MaxGekk Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

shrirangmhalgi commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shrirangmhalgi commented Jun 26, 2026 •

edited

Loading

shrirangmhalgi commented Jun 28, 2026 •

edited

Loading