MDEV-40145 Vectorize VEC_DISTANCE_EUCLIDEAN/COSINE brute-force distance by tonuonu · Pull Request #5273 · MariaDB/server

tonuonu · 2026-06-24T12:09:02Z

The Jira issue number for this PR is: MDEV-40145

Description

calc_distance_euclidean() and calc_distance_cosine() in sql/item_vectorfunc.cc compute the float32 distance used by VEC_DISTANCE_EUCLIDEAN(), VEC_DISTANCE_COSINE() and VEC_DISTANCE() on the brute-force path — when there is no vector index on the column, or the function is evaluated directly (full-scan ORDER BY VEC_DISTANCE(...) LIMIT k, re-ranking, ad-hoc distance).

Both functions accumulate into a single double. The server is built with strict floating-point semantics (no -ffast-math), so the compiler must preserve the summation order; the single accumulator becomes a serial dependency chain it cannot reorder or vectorize — even though it already vectorizes the surrounding subtract, the float→double widening and the multiply. The loop is latency-bound on the scalar fadd chain.

This PR uses several independent partial-sum accumulators, combined after the loop (euclidean: 8; cosine: 4 each for dot / |a|² / |b|²). That breaks the dependency chain, so the compiler emits a vectorized reduction on every architecture — portable C, no intrinsics, no #ifdef. A scalar tail handles the remainder.

The indexed (HNSW) search path is unaffected — it uses the separate int16 quantized dot_product in sql/vector_mhnsw.cc, which is already SIMD-optimized.

Benchmark

Real server build, brute-force query (no vector index), 20000 rows × 512-dim, warm buffer pool, server-side query time (median of 40):

metric	before	after	speedup
VEC_DISTANCE_EUCLIDEAN	7.37 ms	3.98 ms	1.85×
VEC_DISTANCE_COSINE	9.05 ms	6.94 ms	1.30×

Measured on Apple M4. Cross-platform kernel A/B (standalone, real -O2 flags), euclidean: ~2.0× on AMD Zen (AVX2), ~1.8× on Intel Xeon (AVX-512). The win is portable across architectures.

Correctness

The summation order changes, so results differ from the previous code only in the last ~1 ULP; nearest-neighbour rankings are unaffected.

All main.vector* tests pass unchanged (no .result edits needed).
Verified across every dimension 1..2050 (covers all loop-tail remainders): max relative error 1.3e-14 (euclidean), 9e-9 (cosine).
Top-k result sets identical to the previous implementation on the benchmark data.

Basing the PR against the correct MariaDB version

This is a performance improvement; the PR is based against main.

PR quality check

I checked the CODING_STANDARDS.md file and my contribution conforms to it.
I am the author of this contribution and license it to MariaDB under the BSD-new license and the GPLv2, per the MariaDB contribution terms.

CLAassistant · 2026-06-24T12:09:30Z

All committers have signed the CLA.

gemini-code-assist

Code Review

This pull request optimizes the Euclidean and Cosine distance calculation functions in sql/item_vectorfunc.cc by unrolling loops and utilizing multiple independent accumulators to enable compiler vectorization. The review feedback recommends formatting adjustments to comply with MariaDB's coding standards, such as adding proper spacing around operators and commas, and simplifying pointer arithmetic.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

calc_distance_euclidean()/calc_distance_cosine() (used by VEC_DISTANCE_*() for non-indexed, brute-force search) accumulated into a single double. With strict floating-point order (no -ffast-math) the compiler must preserve the summation order, so the reduction is a serial dependency chain it cannot vectorize -- even though it already vectorizes the surrounding subtract/widen/square. Use several independent accumulators so the compiler emits a vectorized reduction on every architecture, with no intrinsics. The summation order changes, so results differ from the previous code only in the last ~1 ULP; neighbour rankings are unaffected and all main.vector* tests pass unchanged. Measured on a real server build (Apple M4, 20000 x 512-dim, no index): VEC_DISTANCE_EUCLIDEAN 7.37ms -> 3.98ms (1.85x) VEC_DISTANCE_COSINE 9.05ms -> 6.94ms (1.30x) Standalone kernel A/B confirms cross-platform: ~2.0x AVX2 (Zen), ~1.8x AVX-512 (Xeon). Correctness checked across all dimensions 1..2050 (max relative error 1.3e-14 euclidean, 9e-9 cosine).

Copilot

Pull request overview

Optimizes the brute-force evaluation path for VEC_DISTANCE_EUCLIDEAN(), VEC_DISTANCE_COSINE(), and VEC_DISTANCE() by restructuring the float32→double accumulation loops to enable compiler auto-vectorization under strict FP semantics.

Changes:

Refactors Euclidean distance to use 8 independent partial-sum accumulators with a scalar tail.
Refactors Cosine distance to use 4-way partial sums for dot product and both norms, with a scalar tail.
Adds an explanatory comment describing the motivation (breaking the scalar fadd dependency chain) and the expected tiny numerical differences due to changed summation order.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

vuvova · 2026-06-24T12:55:50Z

Hi, Tõnu!

Does this PR fix a real problem you (or somebody you know) have faced yourself? Or is it a hypothetical issue?

I'd expect the speed to come from the index anyway, and it vectorizes distance calculations already. And I wouldn't think that the speed of these functions matter much.

But I can be wrong, of course. So, looking for proofs that this speedup matters.

tonuonu · 2026-06-24T13:58:53Z

Hello Sergei,

we haven't talked since ~2000, I think? Good to hear from you!

No, I haven't hit a problem in this exact code — I've just been doing some extreme-optimization work elsewhere and I'm warmed up to spotting these patterns. So I A/B tested it (same binary, forced-scalar vs patched).

You're right that the real speed is in the index, and that the distance math is already vectorized — the compiler auto-vectorizes the elementwise part of calc_distance_euclidean/cosine (subtract, float→double widen, multiply). The one thing it can't touch is the reduction: without -ffast-math it must keep FP summation order, so the sum += stays a scalar serial chain. Splitting it into a few independent accumulators lets the sum vectorize too. (If these were already fully vectorized, the patch would be a no-op — the speedup is exactly that recovered.)

It only affects the non-indexed path: VEC_DISTANCE_*() with no vector index — full-scan ORDER BY … LIMIT k, re-ranking, or distance as a plain expression. Indexed search goes through the int16 dot_product and is untouched.

20000×512, no index, warm pool:

euclidean: 7.37 → 3.98 ms (1.85×)
cosine: 9.05 → 6.94 ms (1.30×)

(~2× on Zen/AVX2, ~1.8× on Xeon/AVX-512 at the kernel level.)

So honestly it only matters if someone runs VEC_DISTANCE without an index — which I don't do myself. If you don't think that path is worth the extra code, I'm fine to drop it.

gkodinov

Thank you for your contribution! This is a preliminary review.

LGTM. Please keep working with Serg on the final review.

gemini-code-assist Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread sql/item_vectorfunc.cc Outdated

Comment thread sql/item_vectorfunc.cc Outdated

tonuonu force-pushed the perf-vec-distance-simd-reduction branch from 665fd79 to 990235e Compare June 24, 2026 12:19

vuvova requested a review from Copilot June 24, 2026 12:51

Copilot started reviewing on behalf of vuvova June 24, 2026 12:51 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

gkodinov added the External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. label Jun 25, 2026

gkodinov self-assigned this Jun 25, 2026

gkodinov approved these changes Jun 25, 2026

View reviewed changes

gkodinov assigned vuvova and unassigned gkodinov Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

MDEV-40145 Vectorize VEC_DISTANCE_EUCLIDEAN/COSINE brute-force distance#5273

MDEV-40145 Vectorize VEC_DISTANCE_EUCLIDEAN/COSINE brute-force distance#5273
tonuonu wants to merge 1 commit into
MariaDB:mainfrom
tonuonu:perf-vec-distance-simd-reduction

tonuonu commented Jun 24, 2026

Uh oh!

CLAassistant commented Jun 24, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

vuvova commented Jun 24, 2026

Uh oh!

tonuonu commented Jun 24, 2026

Uh oh!

gkodinov left a comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

5 participants

Uh oh!

Uh oh!

Conversation

tonuonu commented Jun 24, 2026

Description

Benchmark

Correctness

Basing the PR against the correct MariaDB version

PR quality check

Uh oh!

CLAassistant commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

vuvova commented Jun 24, 2026

Uh oh!

tonuonu commented Jun 24, 2026

Uh oh!

gkodinov left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

5 participants

CLAassistant commented Jun 24, 2026 •

edited

Loading