MDEV-40145 Vectorize VEC_DISTANCE_EUCLIDEAN/COSINE brute-force distance#5273
MDEV-40145 Vectorize VEC_DISTANCE_EUCLIDEAN/COSINE brute-force distance#5273tonuonu wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request optimizes the Euclidean and Cosine distance calculation functions in sql/item_vectorfunc.cc by unrolling loops and utilizing multiple independent accumulators to enable compiler vectorization. The review feedback recommends formatting adjustments to comply with MariaDB's coding standards, such as adding proper spacing around operators and commas, and simplifying pointer arithmetic.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
calc_distance_euclidean()/calc_distance_cosine() (used by VEC_DISTANCE_*() for non-indexed, brute-force search) accumulated into a single double. With strict floating-point order (no -ffast-math) the compiler must preserve the summation order, so the reduction is a serial dependency chain it cannot vectorize -- even though it already vectorizes the surrounding subtract/widen/square. Use several independent accumulators so the compiler emits a vectorized reduction on every architecture, with no intrinsics. The summation order changes, so results differ from the previous code only in the last ~1 ULP; neighbour rankings are unaffected and all main.vector* tests pass unchanged. Measured on a real server build (Apple M4, 20000 x 512-dim, no index): VEC_DISTANCE_EUCLIDEAN 7.37ms -> 3.98ms (1.85x) VEC_DISTANCE_COSINE 9.05ms -> 6.94ms (1.30x) Standalone kernel A/B confirms cross-platform: ~2.0x AVX2 (Zen), ~1.8x AVX-512 (Xeon). Correctness checked across all dimensions 1..2050 (max relative error 1.3e-14 euclidean, 9e-9 cosine).
665fd79 to
990235e
Compare
There was a problem hiding this comment.
Pull request overview
Optimizes the brute-force evaluation path for VEC_DISTANCE_EUCLIDEAN(), VEC_DISTANCE_COSINE(), and VEC_DISTANCE() by restructuring the float32→double accumulation loops to enable compiler auto-vectorization under strict FP semantics.
Changes:
- Refactors Euclidean distance to use 8 independent partial-sum accumulators with a scalar tail.
- Refactors Cosine distance to use 4-way partial sums for dot product and both norms, with a scalar tail.
- Adds an explanatory comment describing the motivation (breaking the scalar
fadddependency chain) and the expected tiny numerical differences due to changed summation order.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Hi, Tõnu! Does this PR fix a real problem you (or somebody you know) have faced yourself? Or is it a hypothetical issue? I'd expect the speed to come from the index anyway, and it vectorizes distance calculations already. And I wouldn't think that the speed of these functions matter much. But I can be wrong, of course. So, looking for proofs that this speedup matters. |
|
Hello Sergei, we haven't talked since ~2000, I think? Good to hear from you! No, I haven't hit a problem in this exact code — I've just been doing some extreme-optimization work elsewhere and I'm warmed up to spotting these patterns. So I A/B tested it (same binary, forced-scalar vs patched). You're right that the real speed is in the index, and that the distance math is already vectorized — the compiler auto-vectorizes the elementwise part of It only affects the non-indexed path: 20000×512, no index, warm pool:
(~2× on Zen/AVX2, ~1.8× on Xeon/AVX-512 at the kernel level.) So honestly it only matters if someone runs |
gkodinov
left a comment
There was a problem hiding this comment.
Thank you for your contribution! This is a preliminary review.
LGTM. Please keep working with Serg on the final review.
The Jira issue number for this PR is: MDEV-40145
Description
calc_distance_euclidean()andcalc_distance_cosine()insql/item_vectorfunc.cccompute the float32 distance used byVEC_DISTANCE_EUCLIDEAN(),VEC_DISTANCE_COSINE()andVEC_DISTANCE()on the brute-force path — when there is no vector index on the column, or the function is evaluated directly (full-scanORDER BY VEC_DISTANCE(...) LIMIT k, re-ranking, ad-hoc distance).Both functions accumulate into a single
double. The server is built with strict floating-point semantics (no-ffast-math), so the compiler must preserve the summation order; the single accumulator becomes a serial dependency chain it cannot reorder or vectorize — even though it already vectorizes the surrounding subtract, the float→double widening and the multiply. The loop is latency-bound on the scalarfaddchain.This PR uses several independent partial-sum accumulators, combined after the loop (euclidean: 8; cosine: 4 each for dot / |a|² / |b|²). That breaks the dependency chain, so the compiler emits a vectorized reduction on every architecture — portable C, no intrinsics, no
#ifdef. A scalar tail handles the remainder.The indexed (HNSW) search path is unaffected — it uses the separate int16 quantized
dot_productinsql/vector_mhnsw.cc, which is already SIMD-optimized.Benchmark
Real server build, brute-force query (no vector index), 20000 rows × 512-dim, warm buffer pool, server-side query time (median of 40):
Measured on Apple M4. Cross-platform kernel A/B (standalone, real
-O2flags), euclidean: ~2.0× on AMD Zen (AVX2), ~1.8× on Intel Xeon (AVX-512). The win is portable across architectures.Correctness
The summation order changes, so results differ from the previous code only in the last ~1 ULP; nearest-neighbour rankings are unaffected.
main.vector*tests pass unchanged (no.resultedits needed).Basing the PR against the correct MariaDB version
main.PR quality check