Add more production observability metrics#13403
Conversation
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS 14.x
|
Result of foundationdb-pr-clang-arm on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS 14.x
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
tclinkenbeard-oai
left a comment
There was a problem hiding this comment.
Generated by Codex.
What is it trying to do?
This adds low-overhead production observability in three places: commit-proxy counters for commit-batch flush reasons, TLog histograms for serialized DiskQueue write and commit sizes, and RecoveryState on periodic MasterRecoveryMetrics.
Is it correct?
From inspection, yes. The commit-batcher helper preserves the existing send/reset behavior and records one reason at each flush site. The TLog path measures the exact serialized buffer passed to IDiskQueue::push, accumulates it between commits, and routes all four push sites through the helper while preserving the existing queue-size threshold accounting. RecoveryState is initialized before the trace callback can run and is updated by the existing recovery state machine.
There are no new wire or persisted-format changes; the output changes are additive metrics. I did not run build or test validation. On the current head, clang-format and Windows Boost CONFIG are green; FoundationDB builders and cluster tests are still pending, with no visible failures.
Are there bugs?
I did not find any correctness bugs. I specifically checked compatibility boundaries, hot-path cost, counter units/populations, async lifetimes, removed invariants, and whether the changed metric paths are fully covered. The removed trigger metric is not hiding a live producer: triggerCommit has no true-setting site in the current public source.
Are there omissions?
No blocking omissions. There is no targeted test for the new metric values or flush-reason classification, so the remaining risk is observability semantics rather than data-path correctness.
Are there better ways of doing things?
No material alternative. Keeping TLog accounting in one helper avoids duplicating the same size and threshold logic at four push sites, and the explicit flush-reason enum keeps the counter mapping clear.
Should this CL be LGTMd?
I would wait for the current FoundationDB builders and cluster tests to finish. Subject to that, yes, this is LGTM from code inspection: I inspected all four changed files, traced the TLog push sites and recovery-state writes, and found no blocking correctness issue.
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS 14.x
|
Result of foundationdb-pr-clang-arm on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS 14.x
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
This PR adds low-overhead observability for production FDB latency and recovery investigations.
ProxyMetricsandProxyDetailedMetrics.RecoveryStateto periodicMasterRecoveryMetrics.These metrics help correlate slow
ToTlogpaths with TLog payload sizes, distinguish why commit batches flush, and add recovery phase context without introducing per-request tracing.The selected metrics were motivated by gaps identified in real cluster investigations.