[SPARK-57415][SQL] Parquet vectorized reader performance improvements (umbrella)

JIRA: https://issues.apache.org/jira/browse/SPARK-57415

## Overview

This is an umbrella issue tracking a series of performance improvements to the Parquet vectorized reader in Spark SQL. The changes target allocation reduction, bulk-read optimizations, and JIT-friendly code patterns across multiple encoding paths.

All PRs are independent and can be reviewed/merged in any order. Together they yield significant throughput gains (1.2x to 9x depending on the encoding and data shape) for Parquet reads with no user-facing behavioral changes.

## Summary

| # | JIRA | PR | Status | Focus | Key Speedup |
|---|---|---|---|---|---|
| [1](#user-content-pr-1) | [SPARK-56892](https://issues.apache.org/jira/browse/SPARK-56892) | [#55919](https://github.com/apache/spark/pull/55919) | Merged | DELTA_BINARY_PACKED bulk reads + widening | up to 8.2x |
| [2](#user-content-pr-2) | [SPARK-56893](https://issues.apache.org/jira/browse/SPARK-56893) | [#55920](https://github.com/apache/spark/pull/55920) | Merged | Dictionary decode hasNull fast path | 1.22-1.62x |
| [3](#user-content-pr-3) | [SPARK-56894](https://issues.apache.org/jira/browse/SPARK-56894) | [#55921](https://github.com/apache/spark/pull/55921) | Merged | BYTE_STREAM_SPLIT vectorized reader + widening + FLBA | 1.5-4.5x |
| [4](#user-content-pr-4) | [SPARK-56895](https://issues.apache.org/jira/browse/SPARK-56895) | [#55922](https://github.com/apache/spark/pull/55922) | Merged | RLE PACKED batch ByteBuffer slice | 1.4-2.4x |
| [5](#user-content-pr-5) | [SPARK-56896](https://issues.apache.org/jira/browse/SPARK-56896) | [#55923](https://github.com/apache/spark/pull/55923) | Merged | Timestamp/date updater bulk reads | 2.1-3.3x |
| [6](#user-content-pr-6) | [SPARK-56897](https://issues.apache.org/jira/browse/SPARK-56897) | [#55924](https://github.com/apache/spark/pull/55924) | Open | DELTA_BYTE_ARRAY allocation reduction | 1.78-2.24x |
| [7](#user-content-pr-7) | [SPARK-56907](https://issues.apache.org/jira/browse/SPARK-56907) | [#55932](https://github.com/apache/spark/pull/55932) | Open | DELTA_LENGTH_BYTE_ARRAY allocation reduction | 1.14-2.00x |
| [8](#user-content-pr-8) | [SPARK-57420](https://issues.apache.org/jira/browse/SPARK-57420) | [#56479](https://github.com/apache/spark/pull/56479) | Merged | Benchmark workflow: skip TPC-DS gen + early CPU check | ~5-10 min saved per run |

## Pull Requests

<a name="pr-1"></a>

### 1. DELTA_BINARY_PACKED bulk read optimization

**PR:** [#55919](https://github.com/apache/spark/pull/55919) ([SPARK-56892](https://issues.apache.org/jira/browse/SPARK-56892))

Replaces per-element lambda dispatch in `readIntegers`/`readLongs` with bulk paths that compute prefix sums in-place and write via `putInts`/`putLongs`. Also eliminates 3 allocations per value in `readUnsignedLongs` by replacing `BigInteger(Long.toUnsignedString(v))` with a zero-allocation `byte[]` loop encoder. Adds `readIntegersAsLongs` and `readIntegersAsDoubles` widening overrides that skip the int narrowing step entirely.

Benchmarks on AMD EPYC 7763 (JDK 17/21/25):

| Type | Speedup |
|---|---|
| INT32 reads | 1.1-1.6x |
| INT32 skip | 1.3-1.8x |
| INT64 reads | 1.8-3.7x |
| INT64 skip | 2.3-4.0x |
| readIntegersAsLongs (INT32 -> Long) | 2.4-2.7x |
| readIntegersAsDoubles (INT32 -> Double) | 2.1-2.4x |
| readUnsignedLongs | 7.3-8.2x |

---

<a name="pr-2"></a>

### 2. Dictionary decoding hasNull fast path + per-class updater overrides

**PR:** [#55920](https://github.com/apache/spark/pull/55920) ([SPARK-56893](https://issues.apache.org/jira/browse/SPARK-56893))

Adds a `hasNull()` fast path that skips per-element null checks when the column has no nulls (common case). Per-class `decodeDictionaryIds` overrides give C2 monomorphic call sites, enabling full inlining of type-specific decode expressions.

Benchmarks on AMD EPYC 9V74 (baseline vs optimized on same CPU):

| Scenario | JDK 17 | JDK 21 | JDK 25 |
|---|---|---|---|
| No nulls | 1.21-1.22x | **1.56-1.62x** | 1.24-1.25x |
| 10% nulls | ~1.0x | 1.24-1.29x | ~1.0x |
| 50% nulls | ~1.0x | 1.25-1.26x | ~1.0x |

JDK 21 benefits dramatically across all null fractions due to monomorphic devirtualization. JDK 17/25 benefit primarily in the no-nulls fast path.

---

<a name="pr-3"></a>

### 3. Vectorized BYTE_STREAM_SPLIT reader

**PR:** [#55921](https://github.com/apache/spark/pull/55921) ([SPARK-56894](https://issues.apache.org/jira/browse/SPARK-56894))

Adds a new `VectorizedByteStreamSplitValuesReader` that decodes BSS-encoded pages (FLOAT, DOUBLE, INT32, INT64, FIXED_LEN_BYTE_ARRAY) using batch byte-gathering instead of falling back to parquet-mr per-value reads. Includes widening overrides, FLBA batch allocation reduction, and FixedLenByteArrayUpdater routing through the batch path.

| Type | JDK 17 | JDK 21 | JDK 25 |
|---|---|---|---|
| INT32 | 4.5x | 3.8x | 4.2x |
| INT64 | 2.8x | 2.0x | 1.8x |
| FLOAT | 4.3x | 3.6x | 4.3x |
| DOUBLE | 2.7x | 2.0x | 1.8x |
| readIntegersAsLongs | 3.8x | 3.0x | 3.5x |
| readFloatsAsDoubles | 3.8x | 3.2x | 3.9x |
| FLBA(12) readBinary | 1.7x | 1.6x | 1.5x |

---

<a name="pr-4"></a>

### 4. Batch ByteBuffer slice in RLE PACKED decode

**PR:** [#55922](https://github.com/apache/spark/pull/55922) ([SPARK-56895](https://issues.apache.org/jira/browse/SPARK-56895))

Replaces per-group `in.slice(bitWidth)` (one `ByteBuffer` allocation per 8 values) with a single bulk slice for the entire PACKED run. Eliminates ~128K short-lived ByteBuffer allocations per 1M-value page.

| bitWidth | Speedup (readIntegers) | Speedup (skipIntegers) |
|---|---|---|
| 4 | 2.0x | 2.1x |
| 8 | 2.0x | 2.4x |
| 12 | 1.6x | 1.6x |
| 20 | 1.4x | 1.4x |

---

<a name="pr-5"></a>

### 5. Bulk read paths for timestamp/date Parquet vector updaters

**PR:** [#55923](https://github.com/apache/spark/pull/55923) ([SPARK-56896](https://issues.apache.org/jira/browse/SPARK-56896))

Replaces per-element `readValue` loops with two-pass bulk read + in-place conversion for four updaters (`LongAsMicrosUpdater`, `LongAsNanosUpdater`, `LongAsMicrosRebaseUpdater`, `DateToTimestampNTZWithRebaseUpdater`). Avoids per-element virtual dispatch through `VectorizedValuesReader`. Note: `DateToTimestampNTZUpdater` (CORRECTED mode) was already optimized via [SPARK-56804](https://issues.apache.org/jira/browse/SPARK-56804).

| Updater | Speedup (JDK 17/21/25) |
|---|---|
| LongAsMicrosUpdater | 2.1x / 2.9x / 3.3x |
| LongAsNanosUpdater | (new benchmark; ~2.6x in local runs) |
| LongAsMicrosRebaseUpdater | (new benchmark; ~2.1x in local runs) |
| DateToTimestampNTZWithRebaseUpdater | (new benchmark; ~2.0x in local runs) |

---

<a name="pr-6"></a>

### 6. Reduce per-value allocations in DELTA_BYTE_ARRAY decoder

**PR:** [#55924](https://github.com/apache/spark/pull/55924) ([SPARK-56897](https://issues.apache.org/jira/browse/SPARK-56897))

Replaces `ByteBuffer`-based state tracking with a reusable `byte[]` buffer, eliminating 2 ByteBuffer allocations per decoded value (~8K objects per 4096-value page). Also rewrites `skipBinary` to avoid column vector reset/swap overhead.

**skipBinary** (primary improvement — Best Time in ms, lower is better):

| Case | JDK 17 Base | JDK 17 PR | Speedup | JDK 21 Base | JDK 21 PR | Speedup | JDK 25 Base | JDK 25 PR | Speedup |
|---|---|---|---|---|---|---|---|---|---|
| no overlap | 32 | 18 | **1.78x** | 34 | 17 | **2.00x** | 32 | 18 | **1.78x** |
| half overlap | 39 | 18 | **2.17x** | 38 | 17 | **2.24x** | 38 | 18 | **2.11x** |
| full overlap | 40 | 19 | **2.11x** | 38 | 17 | **2.24x** | 38 | 18 | **2.11x** |
| half overlap, len=64 | 39 | 19 | **2.05x** | 38 | 18 | **2.11x** | 38 | 18 | **2.11x** |

**readBinary** (modest improvement from allocation reduction):

| Case | JDK 17 Base | JDK 17 PR | Speedup | JDK 21 Base | JDK 21 PR | Speedup | JDK 25 Base | JDK 25 PR | Speedup |
|---|---|---|---|---|---|---|---|---|---|
| no overlap, len=16 | 27 | 26 | **1.04x** | 27 | 25 | **1.08x** | 27 | 25 | **1.08x** |
| half overlap, len=16 | 33 | 26 | **1.27x** | 33 | 26 | **1.27x** | 33 | 25 | **1.32x** |
| full overlap, len=16 | 34 | 26 | **1.31x** | 33 | 24 | **1.38x** | 34 | 24 | **1.42x** |
| half overlap, len=64 | 35 | 27 | **1.30x** | 33 | 26 | **1.27x** | 34 | 26 | **1.31x** |

**readBinary(len) single-value:**

| JDK 17 Base | JDK 17 PR | Speedup | JDK 21 Base | JDK 21 PR | Speedup | JDK 25 Base | JDK 25 PR | Speedup |
|---|---|---|---|---|---|---|---|---|
| 62 | 52 | **1.19x** | 63 | 52 | **1.21x** | 64 | 54 | **1.19x** |

---

<a name="pr-7"></a>

### 7. Reduce per-value allocation in DELTA_LENGTH_BYTE_ARRAY decoder

**PR:** [#55932](https://github.com/apache/spark/pull/55932) ([SPARK-56907](https://issues.apache.org/jira/browse/SPARK-56907))

Replaces per-value `in.slice(length)` with a single bulk slice for the entire batch. Replaces per-value skip loop with a single bulk skip.

**readBinary (Best Time in ms, lower is better):**

| Case | JDK 17 Base | JDK 17 PR | Speedup | JDK 21 Base | JDK 21 PR | Speedup | JDK 25 Base | JDK 25 PR | Speedup |
|---|---|---|---|---|---|---|---|---|---|
| payloadLen=8 | 17 | 14 | **1.21x** | 16 | 12 | **1.33x** | 16 | 14 | **1.14x** |
| payloadLen=32 | 17 | 15 | **1.13x** | 16 | 12 | **1.33x** | 16 | 15 | **1.07x** |
| payloadLen=128 | 19 | 18 | **1.06x** | 18 | 15 | **1.20x** | 19 | 20 | 0.95x |
| payloadLen=512 | 38 | 42 | 0.90x | 39 | 39 | 1.00x | 42 | 40 | **1.05x** |

**skipBinary (Best Time in ms, lower is better):**

| Case | JDK 17 Base | JDK 17 PR | Speedup | JDK 21 Base | JDK 21 PR | Speedup | JDK 25 Base | JDK 25 PR | Speedup |
|---|---|---|---|---|---|---|---|---|---|
| payloadLen=8 | 6 | 3 | **2.00x** | 6 | 3 | **2.00x** | 6 | 3 | **2.00x** |
| payloadLen=32 | 6 | 3 | **2.00x** | 6 | 3 | **2.00x** | 6 | 3 | **2.00x** |
| payloadLen=128 | 6 | 3 | **2.00x** | 6 | 3 | **2.00x** | 6 | 3 | **2.00x** |
| payloadLen=512 | 6 | 3 | **2.00x** | 6 | 3 | **2.00x** | 6 | 3 | **2.00x** |

`readBinary` speedup is largest for small payloads where allocation cost dominates. `skipBinary` shows a uniform **2x improvement** across all payload sizes and JDK versions.

---

<a name="pr-8"></a>

### 8. Benchmark workflow: skip TPC-DS generation + early CPU check

**PR:** [#56479](https://github.com/apache/spark/pull/56479) ([SPARK-57420](https://issues.apache.org/jira/browse/SPARK-57420))

Two improvements to the GHA benchmark workflow used to produce results for this umbrella:

1. **Skip TPC-DS data generation for non-TPCDS benchmarks.** Changes `contains(inputs.class, '*')` to `inputs.class == '*'` so wildcard patterns like `*VectorizedDeltaReaderBenchmark` no longer trigger the expensive TPC-DS generation job (~5-10 min saved per run).

2. **Add early CPU model check step** that runs immediately after checkout, before compilation. Prints the CPU as a `::notice::` annotation for live visibility in the Actions UI, and optionally fails fast if the runner CPU does not match the `expected-cpu` input parameter (saves 20-30 min of wasted runs on wrong hardware).

---

## Common Themes

- **Allocation reduction**: Replace per-value `ByteBuffer.slice()` / `ByteBuffer.wrap()` with bulk reads into reusable buffers
- **Bulk vectorized reads**: Replace per-element virtual dispatch with single batch calls backed by `System.arraycopy`
- **JIT-friendly patterns**: Per-class method overrides for monomorphic call sites; avoiding megamorphic profile pollution from shared helpers

## Benchmarking

All benchmarks run via GHA workflow on AMD EPYC 7763 with OpenJDK 17/21/25 (both baseline and PR on the same CPU model). Results committed to each PR branch.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-57415][SQL] Parquet vectorized reader performance improvements (umbrella) #56011

Overview

Summary

Pull Requests

1. DELTA_BINARY_PACKED bulk read optimization

2. Dictionary decoding hasNull fast path + per-class updater overrides

3. Vectorized BYTE_STREAM_SPLIT reader

4. Batch ByteBuffer slice in RLE PACKED decode

5. Bulk read paths for timestamp/date Parquet vector updaters

6. Reduce per-value allocations in DELTA_BYTE_ARRAY decoder

7. Reduce per-value allocation in DELTA_LENGTH_BYTE_ARRAY decoder

8. Benchmark workflow: skip TPC-DS generation + early CPU check

Common Themes

Benchmarking

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

#	JIRA	PR	Status	Focus	Key Speedup
1	SPARK-56892	#55919	Merged	DELTA_BINARY_PACKED bulk reads + widening	up to 8.2x
2	SPARK-56893	#55920	Merged	Dictionary decode hasNull fast path	1.22-1.62x
3	SPARK-56894	#55921	Merged	BYTE_STREAM_SPLIT vectorized reader + widening + FLBA	1.5-4.5x
4	SPARK-56895	#55922	Merged	RLE PACKED batch ByteBuffer slice	1.4-2.4x
5	SPARK-56896	#55923	Merged	Timestamp/date updater bulk reads	2.1-3.3x
6	SPARK-56897	#55924	Open	DELTA_BYTE_ARRAY allocation reduction	1.78-2.24x
7	SPARK-56907	#55932	Open	DELTA_LENGTH_BYTE_ARRAY allocation reduction	1.14-2.00x
8	SPARK-57420	#56479	Merged	Benchmark workflow: skip TPC-DS gen + early CPU check	~5-10 min saved per run

Type	Speedup
INT32 reads	1.1-1.6x
INT32 skip	1.3-1.8x
INT64 reads	1.8-3.7x
INT64 skip	2.3-4.0x
readIntegersAsLongs (INT32 -> Long)	2.4-2.7x
readIntegersAsDoubles (INT32 -> Double)	2.1-2.4x
readUnsignedLongs	7.3-8.2x

Scenario	JDK 17	JDK 21	JDK 25
No nulls	1.21-1.22x	1.56-1.62x	1.24-1.25x
10% nulls	~1.0x	1.24-1.29x	~1.0x
50% nulls	~1.0x	1.25-1.26x	~1.0x

Type	JDK 17	JDK 21	JDK 25
INT32	4.5x	3.8x	4.2x
INT64	2.8x	2.0x	1.8x
FLOAT	4.3x	3.6x	4.3x
DOUBLE	2.7x	2.0x	1.8x
readIntegersAsLongs	3.8x	3.0x	3.5x
readFloatsAsDoubles	3.8x	3.2x	3.9x
FLBA(12) readBinary	1.7x	1.6x	1.5x

Updater	Speedup (JDK 17/21/25)
LongAsMicrosUpdater	2.1x / 2.9x / 3.3x
LongAsNanosUpdater	(new benchmark; ~2.6x in local runs)
LongAsMicrosRebaseUpdater	(new benchmark; ~2.1x in local runs)
DateToTimestampNTZWithRebaseUpdater	(new benchmark; ~2.0x in local runs)

Case	JDK 17 Base	JDK 17 PR	Speedup	JDK 21 Base	JDK 21 PR	Speedup	JDK 25 Base	JDK 25 PR	Speedup
no overlap	32	18	1.78x	34	17	2.00x	32	18	1.78x
half overlap	39	18	2.17x	38	17	2.24x	38	18	2.11x
full overlap	40	19	2.11x	38	17	2.24x	38	18	2.11x
half overlap, len=64	39	19	2.05x	38	18	2.11x	38	18	2.11x

Case	JDK 17 Base	JDK 17 PR	Speedup	JDK 21 Base	JDK 21 PR	Speedup	JDK 25 Base	JDK 25 PR	Speedup
no overlap, len=16	27	26	1.04x	27	25	1.08x	27	25	1.08x
half overlap, len=16	33	26	1.27x	33	26	1.27x	33	25	1.32x
full overlap, len=16	34	26	1.31x	33	24	1.38x	34	24	1.42x
half overlap, len=64	35	27	1.30x	33	26	1.27x	34	26	1.31x

Case	JDK 17 Base	JDK 17 PR	Speedup	JDK 21 Base	JDK 21 PR	Speedup	JDK 25 Base	JDK 25 PR	Speedup
payloadLen=8	17	14	1.21x	16	12	1.33x	16	14	1.14x
payloadLen=32	17	15	1.13x	16	12	1.33x	16	15	1.07x
payloadLen=128	19	18	1.06x	18	15	1.20x	19	20	0.95x
payloadLen=512	38	42	0.90x	39	39	1.00x	42	40	1.05x

Uh oh!

[SPARK-57415][SQL] Parquet vectorized reader performance improvements (umbrella) #56011

Description

Overview

Summary

Pull Requests

1. DELTA_BINARY_PACKED bulk read optimization

2. Dictionary decoding hasNull fast path + per-class updater overrides

3. Vectorized BYTE_STREAM_SPLIT reader

4. Batch ByteBuffer slice in RLE PACKED decode

5. Bulk read paths for timestamp/date Parquet vector updaters

6. Reduce per-value allocations in DELTA_BYTE_ARRAY decoder

7. Reduce per-value allocation in DELTA_LENGTH_BYTE_ARRAY decoder

8. Benchmark workflow: skip TPC-DS generation + early CPU check

Common Themes

Benchmarking

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions