Add native CDC support#13287
Conversation
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
tclinkenbeard-oai
left a comment
There was a problem hiding this comment.
What is it trying to do?
This adds native CDC support: durable named CDC streams over registered key ranges, CDC pseudo-tag routing in the commit path, CDC proxies that peek/filter/buffer/serve/ack/pop TLog data, shared-tag safe-pop semantics, recovery/failover support, and simulation coverage. Bindings are intentionally deferred.
Is it correct?
By inspection, yes. I traced the registration/removal metadata path, commit-path CDC tag injection, clear-range clipping, proxy buffering and delivery frontiers, durable acknowledgement reconciliation, active/retired tag popping, recovery retention, proxy recruitment, and primary/remote/satellite TLog propagation.
I also checked the earlier AI-review findings against the current head. The previously reported high/medium issues appear fixed or explicitly documented as intended behavior. I did not find a concrete correctness regression.
Are there bugs?
No confirmed bugs found.
Are there omissions?
No blocking omissions. The test coverage appears broad by inspection: lifecycle behavior, clear clipping, proxy replacement, memory bounds, durable acknowledgement scanning, retired-tag cleanup/recovery, disabled restart/drain, and satellite configurations.
I did not run builds, tests, simulations, or other validation commands.
Are there better ways of doing things?
The remaining notable limitation is scalability: registration, safe-pop reconstruction, and assignment publication still scan global CDC metadata in single transactions. The design already identifies this as a scaling boundary. A future follow-up could maintain per-tag aggregate safe-pop state and an indexed assignment structure so these paths become incremental.
Should this CL be LGTMd?
Yes, subject to the pending public CI completing successfully.
Current public CI has no failures: clang-format and Test Boost CONFIG Mode on Windows pass; the default, clang, clang-arm, clang-ide, macOS, macOS_m1, and cluster-test builders are still pending.
Result of foundationdb-pr-macos-m1 on macOS 14.x
|
|
@gxglass I mostly added new test code since the last review, but re-requesting review now while I run another filtered 100k correctness test ensemble, because the PR has grown more |
Result of foundationdb-pr-clang-arm on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
|
Here's the current review from local AI. This diff: |
Result of foundationdb-pr-macos on macOS 14.x
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS 14.x
|
Result of foundationdb-pr-clang-arm on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS 14.x
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
gxglass
left a comment
There was a problem hiding this comment.
Partial review comments follow. This round was essentially just a close second reading of the design doc with a couple Ctrl-F searches to discover knob values for clarity.
This is obviously a very good design doc. It describes the system from all important angles at the right level of detail to convince a reader that the choices are well motivated and can be pulled off. The design itself seems sound and fits in well with how FDB works.
Comments below reflect a few clarity nits and future-work-documentation suggestions, and a question or two. None of this gates LGTM or points to risks/gaps/etc. It's basically just doc enhancement related.
| `fdbclient/CDCProxyInterface.h`. | ||
|
|
||
| ```cpp | ||
| Future<CDCStreamId> registerNativeCdcStreamClient(Database cx, Key name, KeyRange keys); |
There was a problem hiding this comment.
It helps here and below if readers know that CDCStreamId is a uint64_t typedef (I had to look it up)
| per-stream in-memory buffer. | ||
|
|
||
| All raw peek windows and stream buffers owned by one CDC proxy share a | ||
| `CDC_PROXY_BUFFER_BYTES` budget. A replicated log read may retain one separately |
There was a problem hiding this comment.
mention production default 1GB
| ## Acknowledgement and tag popping | ||
|
|
||
| Acknowledgement is per stream, while TLog popping is per CDC tag. This | ||
| distinction is the core retention rule. |
There was a problem hiding this comment.
Nice
also: it's fifos all the way down
| retention decision. | ||
|
|
||
| Acknowledgement notifications are level-triggered and coalesced for at least | ||
| `CDC_PROXY_POP_MIN_INTERVAL`. A notification received during a durable-state |
There was a problem hiding this comment.
mention 100ms default for clarity
| 3. For a new name, it validates the feature knob. | ||
| 4. It allocates a new monotonically increasing `CDCStreamId`. | ||
| 5. It selects a CDC tag using current active stream counts. The allocator uses | ||
| the least populated tag among `NATIVE_CDC_TAG_COUNT` tags, choosing the |
There was a problem hiding this comment.
give production default (256) here
| differs. | ||
|
|
||
| `NATIVE_CDC_TAG_COUNT` controls the bounded tag pool used for new stream | ||
| allocation and must be between 1 and 65,536 inclusive. Invalid values reject |
There was a problem hiding this comment.
Prefer to discuss considerations about how a good default can be chosen. (Also it's totally unclear why somebody would want 64K tags.) I'm guessing the default is chosen by something like the following: it's roughly 10x more than the number of proxies we anticipate, so that load balancing of tags to processes ends up with proxies being roughly evenly balanced even if moderate (small mulitple) per-tag variance in write traffic exists.
| proxy throughput, buffer memory, lag, or number of active readers. | ||
| * Assignment mutations use one coalescing change key that wakes a full durable | ||
| ownership rescan. This is appropriate for low-rate control-plane changes but | ||
| should be sharded if registration and removal throughput becomes material. |
There was a problem hiding this comment.
API users should perhaps be aware of this -- maybe mention expected registrations/sec in the API example section much earlier
| response to load. A future implementation can use versioned tag history to | ||
| make such changes without losing the ability to read earlier tagged data. | ||
| * The native interface does not yet provide external binding support, | ||
| administrative tooling, or a higher-level consumer checkpoint abstraction. |
There was a problem hiding this comment.
Is there a command to force unregister a stream? If not maybe a short section on "Needed for production" stuff (which I think this would be?)
| abandoned consumer, but it would silently violate the stated retention | ||
| contract for a slow active stream. The initial design therefore requires an | ||
| acknowledgement or explicit removal before releasing required history and | ||
| treats administrative expiration policy as future work. |
There was a problem hiding this comment.
Somehow I'm guessing this is going to need to be invented before long, but of course it's fine to pipeline the work.
Scenario: CDC downstream/consumer system is having trouble, AND FDB performance under tlog spilling is borderline overloaded for a cluster in question. What do we want to happen? I'm guessing some manner of "turn off CDC and recompute downstream application state from a full scan of the database" once the downstream system gets generally unwedged. At least some of the time this would probably be desired. That said, maybe I'm wrong about TLOG performance / spilling being considered something to generally avoid. My impression is we try pretty hard to avoid that locally.
| acknowledgement lag, safe-pop distance, CDC-attributed TLog retention, and proxy | ||
| buffer pressure. The oldest required stream ID identifies the first consumer | ||
| to investigate. Because the initial implementation has no automatic stream | ||
| expiration, operators must repair the consumer, explicitly discard its |
There was a problem hiding this comment.
Mentioned above but some table of anticipated tooling to guide future development work may be helpful here.
gxglass
left a comment
There was a problem hiding this comment.
Review through fdbclient/.
The natural diff ordering here represents a reasonable review order: design/, fdbclient/, fdbserver/, fdbserver/workloads, tests/, so I'll just go in that order.
NOTE: a couple comments below say "uint64_t" when they mean to say "uint16_t".
| #include "fdbclient/CommitTransaction.h" | ||
| #include "flow/FileIdentifier.h" | ||
| #include "fdbrpc/fdbrpc.h" | ||
|
|
There was a problem hiding this comment.
minor: suggest a link to design/cdc.md re: this protocol & its messages
| Optional<Value> decodeTagLocalityListKey(KeyRef const&); | ||
| int8_t decodeTagLocalityListValue(ValueRef const&); | ||
|
|
||
| // Native CDC stream routing and lifecycle metadata persisted in the transaction state store. |
There was a problem hiding this comment.
this stuff is well described in the design doc so a comment link here too could be useful
|
|
||
| bool validNativeCdcTagCount(int tagCount) { | ||
| return tagCount > 0 && | ||
| static_cast<uint64_t>(tagCount) <= static_cast<uint64_t>(std::numeric_limits<uint16_t>::max()) + 1; |
There was a problem hiding this comment.
suggest a typedef for the uint64_t because it's used below and drift here would be bad
| class NativeCdcIdentifierAllocator { | ||
| bool sawStream = false; | ||
| CDCStreamId maxStreamId = 0; | ||
| std::unordered_map<uint16_t, uint32_t> tagStreamCounts; |
There was a problem hiding this comment.
suggest uint64_t ==> CDCTagId_t or something. also increases clarity a little
| } | ||
|
|
||
| void signalNativeCdcProxyAssignmentChange(Transaction* tr) { | ||
| // Assignment updates are low-rate control-plane operations. A single |
There was a problem hiding this comment.
could put a metric to count this to be able to confirm it's actually low rate
| const Key nameKey = cdcStreamNameKeyFor(name); | ||
| Optional<Value> currentId = co_await tr.get(nameKey); | ||
| if (!nativeCdcNameMatchesStream(currentId, streamId)) { | ||
| CODE_PROBE(currentId.present(), "Native CDC preserves a replacement stream during removal retry"); |
There was a problem hiding this comment.
I don't know how rare this is expected to be but logging something here may be useful
Result of foundationdb-pr-clang-arm on Linux RHEL 9
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS 14.x
|
Result of foundationdb-pr-macos on macOS 14.x
macOS stat flags, not linux compatiblelock_mtime() { stat -f %m $LOCK || echo 0 }
|
This PR extends FDB to support natively streaming mutations to tracked key ranges. Clients register CDC streams and these are tracked via CDC proxies, which peek from the log system and pop once acknowledgements are received from clients. New pseudo-tags are added to mutations by commit proxies in order to support this. Overlapping CDC streams are supported, and a mutation can't be safely popped until all CDC streams have sent corresponding acknowledgements. This PR does not add support for the CDC interface in the bindings, that is planned for a future PR.
Full documentation of the feature is added in
design/cdc.md.100k/100k filtered
*NativeCDC*correctness tests passed.