Skip to content

syncer(dm): prevent checkpoint flush after BeginTx error#12644

Open
GMHDBJD wants to merge 1 commit into
pingcap:masterfrom
GMHDBJD:tiflow1/fix-12626-checkpoint-begin-error
Open

syncer(dm): prevent checkpoint flush after BeginTx error#12644
GMHDBJD wants to merge 1 commit into
pingcap:masterfrom
GMHDBJD:tiflow1/fix-12626-checkpoint-begin-error

Conversation

@GMHDBJD

@GMHDBJD GMHDBJD commented May 19, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #12626

What is changed and how it works?

This PR fixes a DM correctness issue where checkpoint flush could still advance after a downstream transaction BeginTx failure.

Changes:

  • Introduce a shared downstream execution error predicate that includes ErrDBExecuteFailedBegin together with existing ErrDBExecuteFailed and ErrDBUnExpect.
  • Use the predicate in sync, async, and checkpoint flush worker guards so checkpoint flush is skipped after downstream begin failures.
  • Treat begin failures as downstream execution errors when flushing the safe-mode exit point on task exit.
  • Avoid running key-not-found diagnostics after ExecuteSQL already returned an execution error.
  • Add a regression test for ErrDBExecuteFailedBegin(sql.ErrConnDone) to ensure checkpoint flush is skipped.

Check List

Tests

  • Unit test
go test ./dm/syncer -run TestCheckpointFlushWorkerSkipsCheckpointOnBeginError -count=1
go test ./dm/syncer -run 'Test(CheckpointFlushWorkerSkipsCheckpointOnBeginError|JudgeKeyNotFound|EnableSafeModeInitializationPhase)$' -count=1

Also ran:

make fmt
git diff --check

Questions

Will it cause performance regression or break compatibility?

No. The change only broadens existing checkpoint-blocking error classification to include downstream transaction begin failures and suppresses a misleading diagnostic after failed execution.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

Fix a DM correctness issue that could flush checkpoints after downstream transaction begin failures.

@ti-chi-bot

ti-chi-bot Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gmhdbjd for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. area/dm Issues or PRs related to DM. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 19, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request centralizes downstream execution error handling by introducing the isDownstreamExecutionError helper function, which now includes terror.ErrDBExecuteFailedBegin in its checks. This ensures that checkpoint flushes are correctly skipped when a downstream transaction fails to begin, preventing potential data inconsistency. The changes also include a new unit test to verify this behavior and a safety check in dml_worker.go to ensure key-not-found logic only executes when no other errors are present. I have no feedback to provide as there were no review comments to assess.

@ti-chi-bot

ti-chi-bot Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

@GMHDBJD: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-verify ad6182a link true /test pull-verify
pull-dm-integration-test ad6182a link true /test pull-dm-integration-test
pull-dm-integration-test-next-gen ad6182a link false /test pull-dm-integration-test-next-gen

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ingress-bot

Copy link
Copy Markdown

🔍 Starting code review for this PR...

@ingress-bot ingress-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This review was generated by AI and should be verified by a human reviewer.
Manual follow-up is recommended before merge.

Summary

  • Total findings: 4
  • Inline comments: 2
  • Summary-only findings (no inline anchor): 1
Findings (highest risk first)

🟡 [Minor] (3)

  1. New test hand-rolls full CheckPoint mock instead of reusing the package's interface-embedding pattern (dm/syncer/checkpoint_flush_worker_test.go:70, dm/syncer/status_test.go:72, dm/syncer/safe_mode_test.go:30)
  2. Resume-safety predicate is only partially covered: no negative case and three of four call sites untested (dm/syncer/execution_error.go:19, dm/syncer/checkpoint_flush_worker_test.go:60, dm/syncer/syncer.go:1977)
  3. isDownstreamExecutionError name over-promises generality and lacks intent comment for its safety-critical allowlist (dm/syncer/execution_error.go:19, dm/syncer/syncer.go:1294, dm/syncer/checkpoint_flush_worker.go:79)

ℹ️ [Info] (1)

  1. Stale TODO comment claims any error blocks checkpoint flush, contradicting the selective gate (dm/syncer/syncer.go:1290, dm/syncer/syncer.go:1326, dm/syncer/checkpoint_flush_worker.go:75)

Unanchored findings

ℹ️ [Info] (1)

  1. Stale TODO comment claims any error blocks checkpoint flush, contradicting the selective gate
    • Request: Reword the comment to say checkpoint flush is skipped only for specific downstream execution failures (the isDownstreamExecutionError set), not for every error, so the stated limitation matches actual behavior.

require.Zero(t, cp.flushCount, "checkpoint flush must be skipped after downstream BeginTx failure; flushing here can persist a checkpoint past non-durable DML")
}

type checkpointFlushWorkerTestCheckpoint struct {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Minor] New test hand-rolls full CheckPoint mock instead of reusing the package's interface-embedding pattern

Impact
Adds about 100 lines of boilerplate and a third divergent CheckPoint mock in the same package.
This one must be manually edited every time the CheckPoint interface grows a method, while the embedded-interface mocks do not, creating ongoing maintenance drift and compile-break risk for unrelated changes.

Scope

  • dm/syncer/checkpoint_flush_worker_test.go:70checkpointFlushWorkerTestCheckpoint
  • dm/syncer/status_test.go:72mockCheckpoint
  • dm/syncer/safe_mode_test.go:30mockCheckpointForSafeMode

Evidence
mockCheckpoint and mockCheckpointForSafeMode embed the CheckPoint interface and override only the methods they exercise. checkpointFlushWorkerTestCheckpoint instead implements every CheckPoint method explicitly, yet the test's worker returns early via isDownstreamExecutionError so FlushPointsExcept (and every other method) is never actually called.

Change request
Follow the existing pattern: declare type checkpointFlushWorkerTestCheckpoint struct { CheckPoint; flushCount int } and keep only the FlushPointsExcept override that increments flushCount, dropping the ~25 unused method stubs.


import "github.com/pingcap/tiflow/dm/pkg/terror"

func isDownstreamExecutionError(err error) bool {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 findings on this line:

🟡 [Minor] Resume-safety predicate is only partially covered: no negative case and three of four call sites untested

Impact
The new test asserts checkpoint flush is skipped for one error class (ErrDBExecuteFailedBegin) at one of four call sites, and never asserts that a non-downstream error still flushes.
A degenerate predicate that returned true for every non-nil error (e.g. wrongly matching context.Canceled) would still pass this test, yet would silently stall checkpoint progress; conversely, dropping ErrDBExecuteFailed/ErrDBUnExpect or breaking the inverted safe-mode-exit branch at syncer.go:1977 would reintroduce the checkpoint-advances-past-uncommitted-DML data loss on resume without failing CI.

Scope

  • dm/syncer/execution_error.go:19isDownstreamExecutionError
  • dm/syncer/checkpoint_flush_worker_test.go:60TestCheckpointFlushWorkerSkipsCheckpointOnBeginError
  • dm/syncer/syncer.go:1977Syncer.Run

Evidence
isDownstreamExecutionError guards four resume-critical branches with inverted meaning: it skips checkpoint flush in checkpointFlushWorker.Run, flushCheckPoints, and flushCheckPointsAsync, but triggers FlushSafeModeExitPoint at syncer.go:1977. The single test only exercises the worker skip path with ErrDBExecuteFailedBegin; ErrDBExecuteFailed, ErrDBUnExpect, the negative (non-matching) case, and the safe-mode-exit branch are unverified.

Change request
Add a focused unit test for isDownstreamExecutionError covering all three positive error classes plus at least one negative (e.g. a generic/context.Canceled error returning false), so the predicate contract that all four sites depend on is locked down independent of any one call site.


🟡 [Minor] isDownstreamExecutionError name over-promises generality and lacks intent comment for its safety-critical allowlist

Impact
The name reads as "any error from executing against downstream", but the function is a curated allowlist of exactly three terror codes that gate whether a checkpoint may advance past possibly non-durable DML.
A maintainer trusting the name could add or drop an error type, or assume other downstream DB failures are covered, and silently change checkpoint-durability behavior across all four call sites.

Scope

  • dm/syncer/execution_error.go:19isDownstreamExecutionError
  • dm/syncer/syncer.go:1294flushCheckPoints
  • dm/syncer/checkpoint_flush_worker.go:79checkpointFlushWorker.Run

Evidence
isDownstreamExecutionError matches only ErrDBExecuteFailed, ErrDBExecuteFailedBegin, and ErrDBUnExpect; any other downstream DB error returns false. The helper is now the single source of truth for the "skip checkpoint flush" decision but carries no doc comment explaining why these specific codes (and notably the newly added ErrDBExecuteFailedBegin) are the durability-relevant set.

Change request
Add a doc comment on isDownstreamExecutionError stating the invariant it encodes (these errors mean downstream DML may not be durable, so the checkpoint must not advance) and why ErrDBExecuteFailedBegin belongs here. Consider a more intent-revealing name (for example one that signals it is a checkpoint-skip predicate rather than a generic "any downstream execution error" check).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/dm Issues or PRs related to DM. do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DM] checkpoint may be flushed after downstream BeginTx failure and skip unapplied DML on resume

2 participants