Skip to content

Commit 8291e06

Browse files
authored
fix(profiler): make MySQL median deterministic (#27815)
* fix(profiler): make MySQL median deterministic The MySQL MedianFn returned non-deterministic values across runs on identical data. Two bugs: 1. ROW_NUMBER() OVER () lacked a window ORDER BY, so row numbers were assigned in implementation-defined storage order, unrelated to the sorted column position the median needs. 2. The (SELECT @counter := COUNT(*) FROM tbl) t_count cross-join relied on user-variable side-effect ordering, which MySQL explicitly leaves undefined for expressions involving user variables. Replaced with ROW_NUMBER() OVER (ORDER BY {col}) + COUNT(*) OVER () AS total_count, matching the pattern Doris and SQLite dialects in this same file already use. Both correlated (dimension_col) and non-correlated branches updated symmetrically. Transitive impact: firstQuartile, thirdQuartile, and interQuartileRange all reuse MedianFn via PercentilMixin and become deterministic on MySQL as a side effect. Bug present since #10962 (2023-04-11). The original PR noted "Tested only external to OM" — no in-tree integration test against actual median values, so the 6 existing unit tests (which assert SQL strings) all passed against the broken impl. Verified locally: 10/10 sequential runs returned median=680 for [600,650,680,720,750] post-fix; 3/3 returned mixed 680/650/650 pre-fix. * test(profiler): add MySQL median integration test (regression sentinel) Mirrors the existing test_median_mariadb.py shape — testcontainers spins up a real MySQL 8.0 container, seeds 10 rows across 2 categories, then asserts MedianFn returns the correct percentile-discrete value across all six combinations (p=0.25/0.50/0.75 × non-correlated/dimension_col). Two extra regression sentinels guarding against the pre-fix bugs: - test_compiled_sql_uses_window_order_by — asserts ROW_NUMBER() OVER (ORDER BY ...) is in the generated SQL and the broken `OVER ()` pattern is absent. - test_compiled_sql_avoids_user_variable_counter — asserts @counter is absent and COUNT(*) OVER () is present. Plus a 10x determinism check (test_median_non_correlated_deterministic_ across_runs) that would have flagged the original bug from #10962 had it existed at the time. The MySQL median is percentile-discrete (picks a row at ROUND(p*N)) whereas MariaDB's PERCENTILE_CONT interpolates — same seed data produces different expected values across the two dialects, both documented inline in the test. Wait strategy uses LogMessageWaitStrategy("ready for connections") .with_startup_timeout(120) — testcontainers' default regex expects the message twice (which only MariaDB emits) and times out at 10s before MySQL 8 finishes initializing. * style(profiler): apply ruff format to MySQL median test Single-line set comprehension per `make py_format` (CI checkstyle).
1 parent 70fc7cb commit 8291e06

2 files changed

Lines changed: 211 additions & 15 deletions

File tree

ingestion/src/metadata/profiler/orm/functions/median.py

Lines changed: 12 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -192,33 +192,30 @@ def _(elements, compiler, **kwargs): # pylint: disable=unused-argument
192192
FROM (
193193
SELECT
194194
{col},
195-
ROW_NUMBER() OVER () AS row_num
196-
FROM
197-
`{table}` AS median_inner,
198-
(SELECT @counter := COUNT(*)
199-
FROM `{table}` AS median_count
200-
WHERE median_count.{dimension_col} = `{table}`.{dimension_col}) t_count
195+
ROW_NUMBER() OVER (ORDER BY {col}) AS row_num,
196+
COUNT(*) OVER () AS total_count
197+
FROM `{table}` AS median_inner
201198
WHERE median_inner.{dimension_col} = `{table}`.{dimension_col}
202-
ORDER BY {col}
203199
) temp
204-
WHERE temp.row_num = ROUND({percentile} * @counter)
200+
WHERE temp.row_num = ROUND({percentile} * temp.total_count)
205201
)
206202
""".format(col=col, table=table, percentile=percentile, dimension_col=dimension_col) # noqa: UP032
207203
else: # noqa: RET505
208-
# NON-CORRELATED MODE: Original behavior (profiler)
204+
# NON-CORRELATED MODE: window-function-based count to avoid
205+
# user-variable side-effect ordering (MySQL doesn't guarantee
206+
# when `SELECT @v := COUNT(*)` inside a derived table is
207+
# evaluated relative to the outer WHERE).
209208
return """
210209
(SELECT
211210
{col}
212211
FROM (
213212
SELECT
214213
{col},
215-
ROW_NUMBER() OVER () AS row_num
216-
FROM
217-
`{table}`,
218-
(SELECT @counter := COUNT(*) FROM `{table}`) t_count
219-
ORDER BY {col}
214+
ROW_NUMBER() OVER (ORDER BY {col}) AS row_num,
215+
COUNT(*) OVER () AS total_count
216+
FROM `{table}`
220217
) temp
221-
WHERE temp.row_num = ROUND({percentile} * @counter)
218+
WHERE temp.row_num = ROUND({percentile} * temp.total_count)
222219
)
223220
""".format(col=col, table=table, percentile=percentile) # noqa: UP032
224221

Lines changed: 199 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,199 @@
1+
# Copyright 2026 Collate
2+
# Licensed under the Collate Community License, Version 1.0 (the "License");
3+
# you may not use this file except in compliance with the License.
4+
# You may obtain a copy of the License at
5+
# https://github.com/open-metadata/OpenMetadata/blob/main/ingestion/LICENSE
6+
# Unless required by applicable law or agreed to in writing, software
7+
# distributed under the License is distributed on an "AS IS" BASIS,
8+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
9+
# See the License for the specific language governing permissions and
10+
# limitations under the License.
11+
12+
"""
13+
Integration tests for MySQL median/percentile functions against a real MySQL container.
14+
15+
Validates that MedianFn (MySQL dialect compile) returns deterministic, correct
16+
percentile-discrete values for both non-correlated (whole-table) and correlated
17+
(dimension_col) modes. Regression sentinel for the pre-2026-04-29 bug where
18+
ROW_NUMBER() OVER () lacked a window ORDER BY and the @counter user-variable
19+
side-effect ordering was undefined — both producing non-deterministic results.
20+
21+
Note on expected values: MySQL's MedianFn uses percentile-discrete via
22+
ROUND(percentile * COUNT(*)), picking a single existing row at the sorted
23+
position. MariaDB's PERCENTILE_CONT interpolates. Same seed data → different
24+
expected values across the two dialects.
25+
"""
26+
27+
import pytest
28+
from sqlalchemy import Column, Float, Integer, String, column, create_engine, text
29+
from sqlalchemy.orm import DeclarativeBase, Session
30+
31+
from metadata.profiler.orm.functions.median import MedianFn
32+
33+
try:
34+
from testcontainers.core.wait_strategies import LogMessageWaitStrategy
35+
from testcontainers.mysql import MySqlContainer
36+
except ImportError:
37+
pytest.skip("testcontainers not installed", allow_module_level=True)
38+
39+
40+
class Base(DeclarativeBase):
41+
pass
42+
43+
44+
class MedianTestData(Base):
45+
__tablename__ = "test_data"
46+
id = Column(Integer, primary_key=True)
47+
value = Column(Float, nullable=False)
48+
category = Column(String(50), nullable=False)
49+
50+
51+
# Test data: 10 rows, 2 categories of 5.
52+
# Whole table sorted: [10, 20, 30, 40, 50, 100, 200, 300, 400, 500]
53+
# p=0.50 → ROUND(5.00)=5 → 5th = 50
54+
# p=0.25 → ROUND(2.50)=3 → 3rd = 30
55+
# p=0.75 → ROUND(7.50)=8 → 8th = 300
56+
# Per-dimension (5 rows each):
57+
# cat "a" [10, 20, 30, 40, 50]: p=0.50 → 3rd = 30, p=0.25 → 1st = 10, p=0.75 → 4th = 40
58+
# cat "b" [100,200,300,400,500]: p=0.50 → 3rd = 300, p=0.25 → 1st = 100, p=0.75 → 4th = 400
59+
TEST_ROWS = [
60+
(1, 10.0, "a"),
61+
(2, 20.0, "a"),
62+
(3, 30.0, "a"),
63+
(4, 40.0, "a"),
64+
(5, 50.0, "a"),
65+
(6, 100.0, "b"),
66+
(7, 200.0, "b"),
67+
(8, 300.0, "b"),
68+
(9, 400.0, "b"),
69+
(10, 500.0, "b"),
70+
]
71+
72+
73+
def _compile_median_fn(session, col_name, table_name, percentile, dimension_col=None):
74+
"""Compile a MedianFn to SQL string using the session's MySQL dialect."""
75+
args = (column(col_name), table_name, percentile)
76+
if dimension_col is not None:
77+
args = args + (dimension_col,)
78+
fn = MedianFn(*args)
79+
return fn.compile(
80+
dialect=session.get_bind().dialect,
81+
compile_kwargs={"literal_binds": True},
82+
)
83+
84+
85+
@pytest.fixture(scope="module")
86+
def mysql_engine():
87+
# MySQL 8 cold-start commonly exceeds the 10s default; wait up to 120s
88+
# for the single "ready for connections" log line from the main server
89+
# (the testcontainers default regex expects two occurrences which only
90+
# MariaDB emits — MySQL emits one).
91+
container = MySqlContainer(image="mysql:8.0", dbname="test_db").waiting_for(
92+
LogMessageWaitStrategy("ready for connections").with_startup_timeout(120)
93+
)
94+
with container as container:
95+
url = container.get_connection_url()
96+
if url.startswith("mysql://"):
97+
url = "mysql+pymysql://" + url[len("mysql://") :]
98+
engine = create_engine(url)
99+
with engine.connect() as conn:
100+
conn.execute(
101+
text(
102+
"CREATE TABLE test_data ("
103+
"id INTEGER PRIMARY KEY, "
104+
"value DOUBLE NOT NULL, "
105+
"category VARCHAR(50) NOT NULL)"
106+
)
107+
)
108+
values = ", ".join(f"({row[0]}, {row[1]}, '{row[2]}')" for row in TEST_ROWS)
109+
conn.execute(text(f"INSERT INTO test_data (id, value, category) VALUES {values}"))
110+
conn.commit()
111+
yield engine
112+
engine.dispose()
113+
114+
115+
@pytest.fixture(scope="module")
116+
def session(mysql_engine):
117+
with Session(mysql_engine) as session:
118+
yield session
119+
120+
121+
class TestMySQLMedianFn:
122+
def test_median_non_correlated(self, session):
123+
"""p=0.5 over 10-row table picks the 5th sorted element (50)."""
124+
compiled = _compile_median_fn(session, "value", "test_data", 0.50)
125+
result = session.execute(text(f"SELECT {compiled} AS median_val FROM test_data LIMIT 1")).scalar()
126+
assert result == pytest.approx(50.0)
127+
128+
def test_first_quartile_non_correlated(self, session):
129+
"""p=0.25 over 10-row table picks the 3rd sorted element (30)."""
130+
compiled = _compile_median_fn(session, "value", "test_data", 0.25)
131+
result = session.execute(text(f"SELECT {compiled} AS q1_val FROM test_data LIMIT 1")).scalar()
132+
assert result == pytest.approx(30.0)
133+
134+
def test_third_quartile_non_correlated(self, session):
135+
"""p=0.75 over 10-row table picks the 8th sorted element (300)."""
136+
compiled = _compile_median_fn(session, "value", "test_data", 0.75)
137+
result = session.execute(text(f"SELECT {compiled} AS q3_val FROM test_data LIMIT 1")).scalar()
138+
assert result == pytest.approx(300.0)
139+
140+
def test_median_with_dimension_col(self, session):
141+
"""Per-group median: 3rd element of each 5-row group."""
142+
compiled = _compile_median_fn(session, "value", "test_data", 0.50, "category")
143+
results = session.execute(
144+
text(f"SELECT DISTINCT category, {compiled} AS median_val FROM test_data ORDER BY category")
145+
).fetchall()
146+
medians = {row[0]: row[1] for row in results}
147+
assert medians["a"] == pytest.approx(30.0)
148+
assert medians["b"] == pytest.approx(300.0)
149+
150+
def test_first_quartile_with_dimension_col(self, session):
151+
"""Per-group Q1: 1st element of each 5-row group."""
152+
compiled = _compile_median_fn(session, "value", "test_data", 0.25, "category")
153+
results = session.execute(
154+
text(f"SELECT DISTINCT category, {compiled} AS q1_val FROM test_data ORDER BY category")
155+
).fetchall()
156+
q1s = {row[0]: row[1] for row in results}
157+
assert q1s["a"] == pytest.approx(10.0)
158+
assert q1s["b"] == pytest.approx(100.0)
159+
160+
def test_third_quartile_with_dimension_col(self, session):
161+
"""Per-group Q3: 4th element of each 5-row group."""
162+
compiled = _compile_median_fn(session, "value", "test_data", 0.75, "category")
163+
results = session.execute(
164+
text(f"SELECT DISTINCT category, {compiled} AS q3_val FROM test_data ORDER BY category")
165+
).fetchall()
166+
q3s = {row[0]: row[1] for row in results}
167+
assert q3s["a"] == pytest.approx(40.0)
168+
assert q3s["b"] == pytest.approx(400.0)
169+
170+
def test_compiled_sql_uses_window_order_by(self, session):
171+
"""Regression sentinel: ROW_NUMBER() OVER must include ORDER BY {col}.
172+
173+
Without the ORDER BY in the window spec, row numbers are assigned in
174+
implementation-defined storage order, making the percentile pick
175+
non-deterministic. This was the original bug from #10962.
176+
"""
177+
compiled = str(_compile_median_fn(session, "value", "test_data", 0.50))
178+
assert "ROW_NUMBER() OVER (ORDER BY" in compiled
179+
assert "ROW_NUMBER() OVER ()" not in compiled
180+
181+
def test_compiled_sql_avoids_user_variable_counter(self, session):
182+
"""Regression sentinel: must not use @counter user-variable.
183+
184+
MySQL leaves the evaluation order of expressions involving user
185+
variables undefined. The original impl used `(SELECT @counter := COUNT(*)
186+
FROM tbl) t_count` cross-joined with the data table, then read @counter
187+
in the outer WHERE — making the result depend on optimizer choices.
188+
"""
189+
compiled = str(_compile_median_fn(session, "value", "test_data", 0.50))
190+
assert "@counter" not in compiled
191+
assert "COUNT(*) OVER ()" in compiled
192+
193+
def test_median_non_correlated_deterministic_across_runs(self, session):
194+
"""Same query 10x must return the same value — pre-fix this flipped."""
195+
compiled = _compile_median_fn(session, "value", "test_data", 0.50)
196+
results = {
197+
session.execute(text(f"SELECT {compiled} AS median_val FROM test_data LIMIT 1")).scalar() for _ in range(10)
198+
}
199+
assert len(results) == 1, f"non-deterministic: got {results}"

0 commit comments

Comments
 (0)