etcd(ticdc): update etcd cluster member correctly (#12646)#12649
etcd(ticdc): update etcd cluster member correctly (#12646)#12649ti-chi-bot wants to merge 6 commits into
Conversation
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
This cherry pick PR is for a release branch and has not yet been approved by triage owners. To merge this cherry pick:
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Code Review
This pull request updates the etcd client by making the health check function private, removing the AutoSyncInterval configuration, and refining the healthyChecker to properly close and remove stale or offline clients. Feedback identifies a risk where an empty endpoint list could lead to the deletion of all healthy clients, suggesting an early return to preserve state. Additionally, it was noted that the timeout logic should use an else-if structure to prevent performing operations on clients that have already been closed and deleted.
| } | ||
|
|
||
| func (checker *healthyChecker) update(eps []string) { | ||
| updateEps := make(map[string]struct{}, len(eps)) |
There was a problem hiding this comment.
If eps is empty (which occurs if syncUrls fails to list members), this function will proceed to iterate over all existing clients in the checker and delete them because they won't be present in the empty updateEps map. This can lead to a complete loss of the healthy checker state due to transient etcd errors. Since an etcd cluster used by TiCDC should always have members, it is safer to return early if no endpoints are provided to preserve the current state.
if len(eps) == 0 {
return
}
updateEps := make(map[string]struct{}, len(eps))| if time.Since(lastHealthy) > etcdServerOfflineTimeout { | ||
| log.Info("some etcd server maybe offline", zap.String("endpoint", ep)) | ||
| client.Close() | ||
| checker.Delete(ep) | ||
| } | ||
| if time.Since(lastHealthy) > etcdServerDisconnectedTimeout { | ||
| // try to reset client endpoint to trigger reconnect | ||
| client.(*healthyClient).Client.SetEndpoints([]string{}...) | ||
| client.(*healthyClient).Client.SetEndpoints(ep) | ||
| client.Client.SetEndpoints([]string{}...) | ||
| client.Client.SetEndpoints(ep) | ||
| } |
There was a problem hiding this comment.
The second if condition for etcdServerDisconnectedTimeout (1m) will always be true if the first if for etcdServerOfflineTimeout (30m) is met. This results in calling SetEndpoints on a client that has already been closed and deleted from the checker. Using an else if ensures that these operations are mutually exclusive and avoids unnecessary operations on a closed client.
if time.Since(lastHealthy) > etcdServerOfflineTimeout {
log.Info("some etcd server maybe offline", zap.String("endpoint", ep))
client.Close()
checker.Delete(ep)
} else if time.Since(lastHealthy) > etcdServerDisconnectedTimeout {
// try to reset client endpoint to trigger reconnect
client.Client.SetEndpoints([]string{}...)
client.Client.SetEndpoints(ep)
}
This is an automated cherry-pick of #12646
What problem does this PR solve?
Issue Number: close #12368
What is changed and how it works?
When updating etcd client URLs, remove the client that is not a member of the etcd cluster.
Check List
Tests
Print only once
Questions
Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?
Release note