sqlmodel(dm): materialize generated columns for unique expression indexes#12697
sqlmodel(dm): materialize generated columns for unique expression indexes#12697joechenrh wants to merge 51 commits into
Conversation
|
Skipping CI for Draft Pull Request. |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
856af31 to
739c846
Compare
739c846 to
6bb0cc4
Compare
…lity A functional/expression UNIQUE index is backed by a hidden virtual generated column that is never present in the binlog row image. DM builds the row-value slice from non-hidden columns only, so the hidden column's offset equals len(values); getCausalityString then indexed the row-value slice by that offset and panicked the whole worker with 'index out of range' (issue pingcap#12696). Rather than dropping the index, compute its value so it still takes part in conflict detection: - WhereHandle now exposes CausalityIdxs (all UNIQUE indexes, incl. those backed by a hidden column) for causality, while UniqueIdxs keeps only WHERE-usable indexes (a hidden column is not addressable in a WHERE clause). - getCausalityString materializes hidden/virtual generated columns by evaluating their generated expression (the same machinery the expression filter uses), so an expression index produces a correct causality key. Two rows that collide on the expression value share the key and are serialized. If a value cannot be computed, the index is skipped instead of panicking. The generated expressions depend only on the schema, so they are built once and cached on the per-table WhereHandle (sync.Once, like foreignKeyInitOnce); only Eval runs per row. A DDL drops the cached DownstreamTableInfo (Tracker.RemoveDownstreamSchema) and rebuilds the WhereHandle, so the cache is invalidated together with CausalityIdxs. ref pingcap#12696 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: joechenrh <joechenrh@gmail.com>
6bb0cc4 to
7730fb5
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files
Flags with carried forward coverage won't be shown. Click here to find out more. @@ Coverage Diff @@
## master #12697 +/- ##
================================================
+ Coverage 53.6344% 57.2430% +3.6085%
================================================
Files 1038 524 -514
Lines 146348 69591 -76757
================================================
- Hits 78493 39836 -38657
+ Misses 61948 26858 -35090
+ Partials 5907 2897 -3010 🚀 New features to boost your workflow:
|
ingress-bot
left a comment
There was a problem hiding this comment.
This review was generated by AI and should be verified by a human reviewer.
Manual follow-up is recommended before merge.
Summary
- Total findings: 12
- Inline comments: 11
- Summary-only findings (no inline anchor): 0
Findings (highest risk first)
⚠️ [Major] (1)
- Materialized expression-index values break equality check (decimal false positive, bit/binary panic) (pkg/sqlmodel/causality.go:301, pkg/sqlmodel/reduce.go:109, pkg/sqlmodel/reduce.go:151)
🟡 [Minor] (9)
- sql_mode now feeds expression-index evaluation but no test covers sql_mode/collation-sensitive expressions (dm/syncer/syncer.go:450, pkg/sqlmodel/where_handle.go:90, pkg/sqlmodel/causality_test.go:172, pkg/sqlmodel/reduce_test.go:351)
- Local variable
pkAndUksis a misnomer after reassignment tocausalityIdxs(pkg/sqlmodel/causality.go:211) - Hidden generated columns are re-materialized redundantly per row for expression-index tables (pkg/sqlmodel/causality.go:219, pkg/sqlmodel/causality.go:254, pkg/sqlmodel/reduce.go:141)
- Per-row Warn logging on materialization failure can flood logs on the replication hot path (pkg/sqlmodel/causality.go:289, pkg/sqlmodel/causality.go:271)
datumValueconverts[]bytetostringwithout a comment explaining the value-equality invariant it protects (pkg/sqlmodel/causality.go:301, pkg/sqlmodel/reduce.go:115)- Parallel inline policies for handling expression-index materialization failure diverge in granularity across two call sites (pkg/sqlmodel/reduce.go:144, pkg/sqlmodel/causality.go:228)
- Comment on UniqueIdxs contradicts itself: 'every UNIQUE index' vs. 'hidden-column indexes excluded' (pkg/sqlmodel/where_handle.go:39)
- WhereHandle.UniqueIdxs field name misrepresents its contents after the expression-index exclusion (pkg/sqlmodel/where_handle.go:41, pkg/sqlmodel/reduce.go:136)
getOrBuildExprsaccepts a session context parameter that is ignored on every call after the first (pkg/sqlmodel/where_handle.go:69, pkg/sqlmodel/causality.go:264)
🧹 [Nit] (2)
- SQL mode extraction in
Syncer.Initlacks a comment explaining its connection to expression index evaluation (dm/syncer/syncer.go:450) - TODO in generatedColumnExprContext lacks a ticket reference and a concrete removal condition (pkg/sqlmodel/where_handle.go:93)
|
/test pull-syncdiff-integration-test |
ingress-bot
left a comment
There was a problem hiding this comment.
This review was generated by AI and should be verified by a human reviewer.
Manual follow-up is recommended before merge.
Summary
- Total findings: 7
- Inline comments: 6
- Summary-only findings (no inline anchor): 1
Findings (highest risk first)
🟡 [Minor] (4)
- Stale 'PK/UK' comment in
getCausalityStringfallback path was not updated with the rest of the function (pkg/sqlmodel/causality.go:245) - End-to-end expression-index coverage silently no-ops on pre-8.0.13 upstream (dm/tests/all_mode/data/db1.increment.sql:32, dm/tests/all_mode/run.sh)
fillVirtualGeneratedValuesdoc comment does not describe what theboolreturn signals (pkg/sqlmodel/causality.go:255)- WhereHandle's
causalityIdxs ⊇ UniqueIdxsinvariant is undocumented and maintained only by append order in GetWhereHandle (pkg/sqlmodel/where_handle.go:131, pkg/sqlmodel/reduce.go:136)
ℹ️ [Info] (1)
- Generated-column materialization allocates a full row plus datum slice and MutRow per call with no reuse (pkg/sqlmodel/causality.go:277, pkg/sqlmodel/causality.go:282)
🧹 [Nit] (2)
- Test helper
corruptHiddenGeneratedExpris defined incausality_test.gobut called fromreduce_test.go(pkg/sqlmodel/reduce_test.go:172, pkg/sqlmodel/causality_test.go:302) - TODO in
generatedColumnExprContextlacks an issue reference and removal condition (pkg/sqlmodel/where_handle.go:92)
Unanchored findings
🟡 [Minor] (1)
- Stale 'PK/UK' comment in
getCausalityStringfallback path was not updated with the rest of the function- Request: Update the fallback comment at lines 245–246 to match the new terminology and account for the expression-index skip path: e.g., '// No causality index produced a non-null key: all indexes were NULL, or expression-index values could not be materialized. Fall back to the full row.'
| id int primary key, | ||
| name varchar(64) | ||
| ); | ||
| /*!80013 alter table t_expr_unique add unique key uk_lower_name ((lower(name))) */; |
There was a problem hiding this comment.
maybe add another test where the UK is inside the table definition
There was a problem hiding this comment.
Added in 4ce9dcf (work for MySQL >= 8.0.13, TiDB)
| // Use causality indexes only when both row images are fully materialized. | ||
| preValues, preOK = r.fillVirtualGeneratedValues(r.preValues) | ||
| postValues, postOK = r.fillVirtualGeneratedValues(r.postValues) | ||
| if preOK && postOK { |
There was a problem hiding this comment.
will one of the oks be false happen? in that case some indices inside causalityIdxs is not considered
There was a problem hiding this comment.
Yes, acutually it's the same problem as the below comment. I think that if such a situation occurs, DML will also likely fail to execute (like some incompatibility between upstream and downstream).
| for _, col := range c.columns { | ||
| e, err := expression.ParseSimpleExprWithTableInfo(c.exprCtx, col.GeneratedExprString, c.sourceTableInfo) | ||
| if err != nil { | ||
| log.Warn("cannot build generated column expression, its index will be skipped for causality", |
There was a problem hiding this comment.
we cannot safely ignore this error for non-trivial expression based UK, maybe expose it
There was a problem hiding this comment.
I agree that ignoring the error might not be safe. But exposing this requires big changes with related APIs.
Besides, since the table schema has already been tracked by DM, the chance of failing to build the generated expression should be low. If expression building or evaluation does fail, it likely indicates a schema/session mismatch, and the corresponding downstream DML may also fail. So I prefer to keep the current behavior in this PR.
There was a problem hiding this comment.
maybe add a comment to explain this
There was a problem hiding this comment.
Done in b282753 . Added a comment on the expression build failure path to document the current degraded behavior: the error cannot be surfaced through the existing causality scheduling API, so we skip the hidden-column index even though it can reduce causality. The comment also records why this is expected to be rare after DM has tracked the DDL.
|
/check-issue-triage-complete |
D3Hunter
left a comment
There was a problem hiding this comment.
Summary
- Total findings: 5
- Inline comments: 4
- Summary-only findings (no inline anchor): 1
Findings (highest risk first)
⚠️ [Major] (3)
- Expression-only unique tables can still build invalid WHERE predicates (
Fallback WHERE construction inwhereColumnsAndValuesinpkg/sqlmodel/row_change.go``) - Schema-level WhereHandle now owns session-scoped expression state (
pkg/sqlmodel/where_handle.go:68) - Expression-index causality fails open when materialization fails (
pkg/sqlmodel/where_handle.go:76, pkg/sqlmodel/causality.go:265)
🟡 [Minor] (1)
- Expression-index integration coverage depends on MySQL version comments (
dm/tests/all_mode/data/db1.increment.sql:44)
🧹 [Nit] (1)
- TODO omits tracking and removal condition (
pkg/sqlmodel/where_handle.go:98)
Unanchored findings
⚠️ [Major] (1)
- Expression-only unique tables can still build invalid WHERE predicates
- Scope:
Fallback WHERE construction inwhereColumnsAndValuesinpkg/sqlmodel/row_change.go`` - Request: Can we filter hidden generated columns out of the fallback WHERE column list, or otherwise make the fallback use the same visible-column projection as
ColumnCount, and add a case for UPDATE/DELETE on a table whose only unique key is a functional index and has no PK or ordinary unique key?
- Scope:
| } | ||
| } | ||
|
|
||
| // getOrBuildExprs uses tiSessionCtx only on the first cache build. |
There was a problem hiding this comment.
⚠️ [Major] Schema-level WhereHandle now owns session-scoped expression state
Why
WhereHandle is cached as part of DownstreamTableInfo, but the new generatedColumnExprCache builds and stores ExprContext/expression values from the first tiSessionCtx that reaches getOrBuildExprs. That makes causality and safe-mode identity checks depend on hidden first-call state instead of an explicit row/session-owned materializer.
Scope
pkg/sqlmodel/where_handle.go:68
Risk if unchanged
A cached handle can be reused across row changes and DML worker goroutines with evaluator state tied to an earlier session context, so future session-sensitive expression indexes or caller paths can silently use stale SQL mode/charset/eval state or share mutable evaluator state concurrently.
Evidence
GetWhereHandle is documented as cacheable and stores WhereHandle in DownstreamTableInfo (dm/pkg/schema/tracker.go:85, dm/pkg/schema/tracker.go:502). The new cache keeps exprCtx/exprs behind sync.Once using the first tiSessionCtx (pkg/sqlmodel/where_handle.go:46, pkg/sqlmodel/where_handle.go:68). RowChanges are assigned that cached handle (dm/syncer/dml.go:169), and DML workers run multiple goroutines that can call IsPrimaryOrUniqueKeyUpdated through the same handle (dm/syncer/dml_worker.go:146, dm/syncer/dml_worker.go:403).
Change request
Maybe move this session-dependent materializer out of the schema-level WhereHandle, or keep only session-independent parsed metadata there and build/key the EvalContext by explicit session settings at the RowChange call site. If the shared cache is intentional, please add a short why-comment and a focused test/guard for first-caller session context and concurrent worker use.
There was a problem hiding this comment.
Done in 225b331 . I kept the cache on the per-table WhereHandle intentionally.
The main reason is to avoid rebuilding/parsing generated-column expressions on every row, which would put expression-index tables on a hot causality path.
For the session-state concern: the handle is rebuilt when DDL invalidates the downstream schema cache, and within one Syncer lifetime DM builds one s.sessCtx from the downstream apply SQL mode/timezone and reuses it for all RowChanges. Runtime changes to global SQL mode/timezone are not reflected in existing DM sessions anyway. I added a short comment documenting that lifecycle assumption.
| c.exprCtx = generatedColumnExprContext(tiSessionCtx) | ||
| exprs := make(map[int]expression.Expression) | ||
| for _, col := range c.columns { | ||
| e, err := expression.ParseSimpleExprWithTableInfo(c.exprCtx, col.GeneratedExprString, c.sourceTableInfo) |
There was a problem hiding this comment.
⚠️ [Major] Expression-index causality fails open when materialization fails
Why
When generated-column expression parsing or evaluation fails, the new path logs and skips the hidden-column index instead of returning an actionable error or falling back to a conservative conflict key.
Scope
pkg/sqlmodel/where_handle.go:76, pkg/sqlmodel/causality.go:265
Risk if unchanged
Rows with different primary keys but the same unique expression-index value can be scheduled concurrently after a schema/session mismatch or unsupported expression, which can still produce transient downstream unique-key conflicts during upgrade or compatibility-sensitive replication.
Evidence
getOrBuildExprs returns ok=false after a parse error and comments that the index will be skipped for causality; fillVirtualGeneratedValues propagates fullValuesOK=false, and getCausalityString then skips only hidden-column indexes while retaining other keys such as the PK. The fallback test asserts only the PK key remains for an expression-index table, so this is an intentional silent degradation rather than fail-closed behavior.
Change request
Can we fail closed for this path, for example by using a conservative whole-row/table-level causality key for tables with unmaterializable hidden expression indexes, or by surfacing an actionable error that tells operators the expression index cannot be used safely?
There was a problem hiding this comment.
I agree that ignoring this error is not ideal, but I prefer to keep the existing degraded behavior in this PR.
Exposing it as an error would require changing the current causality APIs and the scheduler call path. Also, by this point DM has already tracked the DDL and built the hidden generated-column metadata from the downstream schema. So a generated-expression build failure should be rare, and if it does happen it likely indicates a schema/session mismatch that may also make the corresponding downstream DML fail.
I added a comment in the build-failure path earlier to document this existing behavior and the correctness tradeoff: b282753 .
| id int primary key, | ||
| name varchar(64) | ||
| ); | ||
| /*!80013 alter table t_expr_unique add unique key uk_lower_name ((lower(name))) */; |
There was a problem hiding this comment.
🟡 [Minor] Expression-index integration coverage depends on MySQL version comments
Why
The added all_mode fixture wraps the functional-index DDL and one full table scenario in /*!80013 ... */, so older upstream versions can ignore the exact statements intended to exercise hidden generated columns.
Scope
dm/tests/all_mode/data/db1.increment.sql:44
Risk if unchanged
The integration case may pass without validating the new expression-index causality behavior in environments where the version comment is ignored, reducing deterministic evidence for the compatibility-sensitive path.
Evidence
t_expr_unique is created unconditionally, but the unique functional index is guarded by /*!80013 ... */; t_expr_unique_inline and all its rows are also guarded. If the source does not execute 8.0.13 version comments, the test either exercises ordinary primary-key replaces only or skips the inline table entirely.
Change request
Please add a deterministic guard or assertion for this path, such as checking the source supports functional indexes before running the case, splitting it into a MySQL-8.0.13+ case, or adding a validation that the functional index actually exists before relying on the integration result.
There was a problem hiding this comment.
Done in 225b331 . I added a guard after loading the incremental SQL.
If the upstream does not execute the /*!80013 */ functional-index statements, the test prints an explicit skip message. If the case is enabled, it asserts that both expression-index fixtures have uk_lower_name in information_schema.statistics, so the integration result no longer silently relies on a skipped version-comment path.
| func generatedColumnExprContext(tiSessionCtx sessionctx.Context) *exprstatic.ExprContext { | ||
| vars := tiSessionCtx.GetSessionVars() | ||
| charset, collation := vars.GetCharsetInfo() | ||
| // TODO(joechenrh): Carry downstream charset/collation for collation-sensitive expression indexes. |
There was a problem hiding this comment.
🧹 [Nit] TODO omits tracking and removal condition
Scope
pkg/sqlmodel/where_handle.go:98
Change request
Please add an issue/reference and the condition for removing this TODO, or convert it into a short limitation comment if there is no planned follow-up.
There was a problem hiding this comment.
I kept this as a TODO intentionally. The missing piece is carrying downstream charset/collation into this expression context; once DM passes those session settings here, this TODO can be removed.
For this PR, SQL mode/timezone are already handled by the fixed downstream apply session context.
|
@joechenrh: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/retest |
What problem does this PR solve?
Issue Number: close #12696
A
dm-workerpanics withindex out of rangeand crash-loops when replicating a table that has an expression-based (functional) UNIQUE index. A single affected row takes down the worker process; after restart, the task replays the same binlog row and panics again, so it does not self-recover.Skipping UNIQUE expression indexes would avoid the panic, but it is not correctness-preserving. For example, two
REPLACEstatements that both hitUNIQUE KEY ((lower(name)))must keep their upstream order. If(20, 'ALICE')is replicated before(10, 'Alice'), the final downstream row can become(10, 'Alice')instead of the upstream final row(20, 'ALICE').What is changed and how it works?
This PR keeps expression UNIQUE indexes in DM causality while keeping hidden columns out of WHERE generation.
UniqueIdxsremains the visible-column index set used for WHERE clauses.ColumnInfo.Offset, so hidden columns interleaved by DDL are handled.Check List
Tests
go test ./pkg/sqlmodelgo test ./pkg/sqlmodel -run 'TestCausalityKeysInterleavedHiddenColumn|TestPrimaryOrUniqueKeyUpdatedInterleavedHiddenColumn' -count=1go test ./dm/syncer -run '^$'go test -race ./pkg/sqlmodel -run 'TestCausalityKeysExpressionIndexNoRace|TestPrimaryOrUniqueKeyUpdatedWithExpressionIndex|TestPrimaryOrUniqueKeyUpdatedExpressionIndexMaterializeFailure|TestCausalityKeysExpressionIndexMaterializeFailure'dm/tests/all_mode, covering stored generated unique indexes,ALTER TABLE ... ADD UNIQUE KEY ((expr)), and inline functional unique indexes inCREATE TABLE(make dm_integration_test CASE=all_modenot run locally)Questions
Will it cause performance regression or break compatibility?
No compatibility break. The WHERE-index selection path still excludes hidden columns. Generated expression evaluation is only attempted for tables that have a hidden-column UNIQUE index; expressions are parsed once per table
WhereHandleand evaluated per row only when needed.Do you need to update user documentation, design documentation or monitoring documentation?
No.
Release note