Skip to content

MDEV-40145 Vectorize VEC_DISTANCE_EUCLIDEAN/COSINE brute-force distance#5273

Open
tonuonu wants to merge 1 commit into
MariaDB:mainfrom
tonuonu:perf-vec-distance-simd-reduction
Open

MDEV-40145 Vectorize VEC_DISTANCE_EUCLIDEAN/COSINE brute-force distance#5273
tonuonu wants to merge 1 commit into
MariaDB:mainfrom
tonuonu:perf-vec-distance-simd-reduction

Conversation

@tonuonu

@tonuonu tonuonu commented Jun 24, 2026

Copy link
Copy Markdown

The Jira issue number for this PR is: MDEV-40145

Description

calc_distance_euclidean() and calc_distance_cosine() in sql/item_vectorfunc.cc compute the float32 distance used by VEC_DISTANCE_EUCLIDEAN(), VEC_DISTANCE_COSINE() and VEC_DISTANCE() on the brute-force path — when there is no vector index on the column, or the function is evaluated directly (full-scan ORDER BY VEC_DISTANCE(...) LIMIT k, re-ranking, ad-hoc distance).

Both functions accumulate into a single double. The server is built with strict floating-point semantics (no -ffast-math), so the compiler must preserve the summation order; the single accumulator becomes a serial dependency chain it cannot reorder or vectorize — even though it already vectorizes the surrounding subtract, the float→double widening and the multiply. The loop is latency-bound on the scalar fadd chain.

This PR uses several independent partial-sum accumulators, combined after the loop (euclidean: 8; cosine: 4 each for dot / |a|² / |b|²). That breaks the dependency chain, so the compiler emits a vectorized reduction on every architecture — portable C, no intrinsics, no #ifdef. A scalar tail handles the remainder.

The indexed (HNSW) search path is unaffected — it uses the separate int16 quantized dot_product in sql/vector_mhnsw.cc, which is already SIMD-optimized.

Benchmark

Real server build, brute-force query (no vector index), 20000 rows × 512-dim, warm buffer pool, server-side query time (median of 40):

metric before after speedup
VEC_DISTANCE_EUCLIDEAN 7.37 ms 3.98 ms 1.85×
VEC_DISTANCE_COSINE 9.05 ms 6.94 ms 1.30×

Measured on Apple M4. Cross-platform kernel A/B (standalone, real -O2 flags), euclidean: ~2.0× on AMD Zen (AVX2), ~1.8× on Intel Xeon (AVX-512). The win is portable across architectures.

Correctness

The summation order changes, so results differ from the previous code only in the last ~1 ULP; nearest-neighbour rankings are unaffected.

  • All main.vector* tests pass unchanged (no .result edits needed).
  • Verified across every dimension 1..2050 (covers all loop-tail remainders): max relative error 1.3e-14 (euclidean), 9e-9 (cosine).
  • Top-k result sets identical to the previous implementation on the benchmark data.

Basing the PR against the correct MariaDB version

  • This is a performance improvement; the PR is based against main.

PR quality check

  • I checked the CODING_STANDARDS.md file and my contribution conforms to it.
  • I am the author of this contribution and license it to MariaDB under the BSD-new license and the GPLv2, per the MariaDB contribution terms.

@CLAassistant

CLAassistant commented Jun 24, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the Euclidean and Cosine distance calculation functions in sql/item_vectorfunc.cc by unrolling loops and utilizing multiple independent accumulators to enable compiler vectorization. The review feedback recommends formatting adjustments to comply with MariaDB's coding standards, such as adding proper spacing around operators and commas, and simplifying pointer arithmetic.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread sql/item_vectorfunc.cc Outdated
Comment thread sql/item_vectorfunc.cc Outdated
calc_distance_euclidean()/calc_distance_cosine() (used by VEC_DISTANCE_*()
for non-indexed, brute-force search) accumulated into a single double.
With strict floating-point order (no -ffast-math) the compiler must
preserve the summation order, so the reduction is a serial dependency
chain it cannot vectorize -- even though it already vectorizes the
surrounding subtract/widen/square.

Use several independent accumulators so the compiler emits a vectorized
reduction on every architecture, with no intrinsics. The summation order
changes, so results differ from the previous code only in the last ~1 ULP;
neighbour rankings are unaffected and all main.vector* tests pass unchanged.

Measured on a real server build (Apple M4, 20000 x 512-dim, no index):
  VEC_DISTANCE_EUCLIDEAN  7.37ms -> 3.98ms  (1.85x)
  VEC_DISTANCE_COSINE     9.05ms -> 6.94ms  (1.30x)
Standalone kernel A/B confirms cross-platform: ~2.0x AVX2 (Zen),
~1.8x AVX-512 (Xeon). Correctness checked across all dimensions 1..2050
(max relative error 1.3e-14 euclidean, 9e-9 cosine).
@tonuonu tonuonu force-pushed the perf-vec-distance-simd-reduction branch from 665fd79 to 990235e Compare June 24, 2026 12:19
@vuvova vuvova requested a review from Copilot June 24, 2026 12:51

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes the brute-force evaluation path for VEC_DISTANCE_EUCLIDEAN(), VEC_DISTANCE_COSINE(), and VEC_DISTANCE() by restructuring the float32→double accumulation loops to enable compiler auto-vectorization under strict FP semantics.

Changes:

  • Refactors Euclidean distance to use 8 independent partial-sum accumulators with a scalar tail.
  • Refactors Cosine distance to use 4-way partial sums for dot product and both norms, with a scalar tail.
  • Adds an explanatory comment describing the motivation (breaking the scalar fadd dependency chain) and the expected tiny numerical differences due to changed summation order.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@vuvova

vuvova commented Jun 24, 2026

Copy link
Copy Markdown
Member

Hi, Tõnu!

Does this PR fix a real problem you (or somebody you know) have faced yourself? Or is it a hypothetical issue?

I'd expect the speed to come from the index anyway, and it vectorizes distance calculations already. And I wouldn't think that the speed of these functions matter much.

But I can be wrong, of course. So, looking for proofs that this speedup matters.

@tonuonu

tonuonu commented Jun 24, 2026

Copy link
Copy Markdown
Author

Hello Sergei,

we haven't talked since ~2000, I think? Good to hear from you!

No, I haven't hit a problem in this exact code — I've just been doing some extreme-optimization work elsewhere and I'm warmed up to spotting these patterns. So I A/B tested it (same binary, forced-scalar vs patched).

You're right that the real speed is in the index, and that the distance math is already vectorized — the compiler auto-vectorizes the elementwise part of calc_distance_euclidean/cosine (subtract, float→double widen, multiply). The one thing it can't touch is the reduction: without -ffast-math it must keep FP summation order, so the sum += stays a scalar serial chain. Splitting it into a few independent accumulators lets the sum vectorize too. (If these were already fully vectorized, the patch would be a no-op — the speedup is exactly that recovered.)

It only affects the non-indexed path: VEC_DISTANCE_*() with no vector index — full-scan ORDER BY … LIMIT k, re-ranking, or distance as a plain expression. Indexed search goes through the int16 dot_product and is untouched.

20000×512, no index, warm pool:

  • euclidean: 7.37 → 3.98 ms (1.85×)
  • cosine: 9.05 → 6.94 ms (1.30×)

(~2× on Zen/AVX2, ~1.8× on Xeon/AVX-512 at the kernel level.)

So honestly it only matters if someone runs VEC_DISTANCE without an index — which I don't do myself. If you don't think that path is worth the extra code, I'm fine to drop it.

@gkodinov gkodinov added the External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. label Jun 25, 2026
@gkodinov gkodinov self-assigned this Jun 25, 2026

@gkodinov gkodinov left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! This is a preliminary review.

LGTM. Please keep working with Serg on the final review.

@gkodinov gkodinov assigned vuvova and unassigned gkodinov Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements.

Development

Successfully merging this pull request may close these issues.

5 participants