etcd(ticdc): update etcd cluster member correctly (#12646)#12651
etcd(ticdc): update etcd cluster member correctly (#12646)#12651ti-chi-bot wants to merge 6 commits into
Conversation
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
This cherry pick PR is for a release branch and has not yet been approved by triage owners. To merge this cherry pick:
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Code Review
This pull request updates the etcd client by adjusting imports, making the health check function private, and improving the healthyChecker logic to explicitly close and remove stale clients. Feedback highlights two critical issues: the potential for an empty endpoint list to clear all cached clients and a logic error that could lead to a panic by accessing a client after it has been closed.
| } | ||
|
|
||
| func (checker *healthyChecker) update(eps []string) { | ||
| updateEps := make(map[string]struct{}, len(eps)) |
There was a problem hiding this comment.
If syncUrls fails (e.g., due to a transient network issue), it returns an empty slice. In this scenario, the update function will iterate over all existing clients in the healthyChecker and delete them because they are not present in the empty updateEps map. This wipes out all cached healthy clients and causes the main client to lose all its endpoints temporarily. It is safer to skip the update if the provided endpoint list is empty, as an etcd cluster should always have at least one voting member.
if len(eps) == 0 {
return
}
updateEps := make(map[string]struct{}, len(eps))| if time.Since(lastHealthy) > etcdServerOfflineTimeout { | ||
| log.Info("some etcd server maybe offline", zap.String("endpoint", ep)) | ||
| client.Close() | ||
| checker.Delete(ep) | ||
| } |
There was a problem hiding this comment.
When a client is identified as offline (exceeding etcdServerOfflineTimeout), it is closed and deleted from the checker. However, the code continues to the next if block which checks for etcdServerDisconnectedTimeout. Since the offline timeout is significantly larger than the disconnected timeout, this second block will also execute, calling SetEndpoints on the already closed client. This can lead to panics or unexpected behavior in the gRPC client. You should skip the disconnected check if the client was just deleted.
if time.Since(lastHealthy) > etcdServerOfflineTimeout {
log.Info("some etcd server maybe offline", zap.String("endpoint", ep))
client.Close()
checker.Delete(ep)
continue
}
This is an automated cherry-pick of #12646
What problem does this PR solve?
Issue Number: close #12368
What is changed and how it works?
When updating etcd client URLs, remove the client that is not a member of the etcd cluster.
Check List
Tests
Print only once
Questions
Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?
Release note