Add more production observability metrics by tclinkenbeard-oai · Pull Request #13403 · apple/foundationdb

tclinkenbeard-oai · 2026-06-28T09:08:10Z

This PR adds low-overhead observability for production FDB latency and recovery investigations.

Adds commit batch flush reason counters to ProxyMetrics and ProxyDetailedMetrics.
Adds TLog histograms for serialized disk queue write and commit sizes.
Adds RecoveryState to periodic MasterRecoveryMetrics.

These metrics help correlate slow ToTlog paths with TLog payload sizes, distinguish why commit batches flush, and add recovery phase context without introducing per-request tracing.

The selected metrics were motivated by gaps identified in real cluster investigations.

foundationdb-ci · 2026-06-28T09:29:05Z

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Commit ID: 3bd2129
Duration 0:20:45
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-28T09:39:57Z

Result of foundationdb-pr-macos-m1 on macOS 14.x

Commit ID: 3bd2129
Duration 0:31:36
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-28T09:52:53Z

Result of foundationdb-pr-clang-arm on Linux RHEL 9

Commit ID: 3bd2129
Duration 0:44:31
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-28T09:55:30Z

Result of foundationdb-pr-macos on macOS 14.x

Commit ID: 3bd2129
Duration 0:47:09
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-28T09:58:05Z

Result of foundationdb-pr-clang on Linux RHEL 9

Commit ID: 3bd2129
Duration 0:49:42
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

tclinkenbeard-oai

Generated by Codex.

What is it trying to do?

This adds low-overhead production observability in three places: commit-proxy counters for commit-batch flush reasons, TLog histograms for serialized DiskQueue write and commit sizes, and RecoveryState on periodic MasterRecoveryMetrics.

Is it correct?

From inspection, yes. The commit-batcher helper preserves the existing send/reset behavior and records one reason at each flush site. The TLog path measures the exact serialized buffer passed to IDiskQueue::push, accumulates it between commits, and routes all four push sites through the helper while preserving the existing queue-size threshold accounting. RecoveryState is initialized before the trace callback can run and is updated by the existing recovery state machine.

There are no new wire or persisted-format changes; the output changes are additive metrics. I did not run build or test validation. On the current head, clang-format and Windows Boost CONFIG are green; FoundationDB builders and cluster tests are still pending, with no visible failures.

Are there bugs?

I did not find any correctness bugs. I specifically checked compatibility boundaries, hot-path cost, counter units/populations, async lifetimes, removed invariants, and whether the changed metric paths are fully covered. The removed trigger metric is not hiding a live producer: triggerCommit has no true-setting site in the current public source.

Are there omissions?

No blocking omissions. There is no targeted test for the new metric values or flush-reason classification, so the remaining risk is observability semantics rather than data-path correctness.

Are there better ways of doing things?

No material alternative. Keeping TLog accounting in one helper avoids duplicating the same size and threshold logic at four push sites, and the explicit flush-reason enum keeps the counter mapping clear.

Should this CL be LGTMd?

I would wait for the current FoundationDB builders and cluster tests to finish. Subject to that, yes, this is LGTM from code inspection: I inspected all four changed files, traced the TLog push sites and recovery-state writes, and found no blocking correctness issue.

foundationdb-ci · 2026-06-28T10:05:26Z

Result of foundationdb-pr on Linux RHEL 9

Commit ID: 3bd2129
Duration 0:57:03
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-28T10:09:56Z

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Commit ID: d4f56a6
Duration 0:20:28
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-28T10:10:35Z

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Commit ID: 3bd2129
Duration 1:02:14
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci · 2026-06-28T10:20:46Z

Result of foundationdb-pr-macos-m1 on macOS 14.x

Commit ID: d4f56a6
Duration 0:31:18
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-28T10:34:20Z

Result of foundationdb-pr-clang-arm on Linux RHEL 9

Commit ID: d4f56a6
Duration 0:44:55
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-28T10:36:56Z

Result of foundationdb-pr-macos on macOS 14.x

Commit ID: d4f56a6
Duration 0:47:27
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-28T10:44:22Z

Result of foundationdb-pr-clang on Linux RHEL 9

Commit ID: d4f56a6
Duration 0:54:54
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-28T10:48:33Z

Result of foundationdb-pr on Linux RHEL 9

Commit ID: d4f56a6
Duration 0:59:04
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-28T10:52:27Z

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Commit ID: d4f56a6
Duration 1:03:01
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

Add FDB production observability metrics

3bd2129

Remove dead commit batch trigger metric

d4f56a6

tclinkenbeard-oai commented Jun 28, 2026

View reviewed changes

Uh oh!

Conversation

tclinkenbeard-oai commented Jun 28, 2026

Uh oh!

foundationdb-ci commented Jun 28, 2026

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Uh oh!

foundationdb-ci commented Jun 28, 2026

Result of foundationdb-pr-macos-m1 on macOS 14.x

Uh oh!

foundationdb-ci commented Jun 28, 2026

Result of foundationdb-pr-clang-arm on Linux RHEL 9

Uh oh!

foundationdb-ci commented Jun 28, 2026

Result of foundationdb-pr-macos on macOS 14.x

Uh oh!

foundationdb-ci commented Jun 28, 2026

Result of foundationdb-pr-clang on Linux RHEL 9

Uh oh!

tclinkenbeard-oai left a comment

Choose a reason for hiding this comment

What is it trying to do?

Is it correct?

Are there bugs?

Are there omissions?

Are there better ways of doing things?

Should this CL be LGTMd?

Uh oh!

foundationdb-ci commented Jun 28, 2026

Result of foundationdb-pr on Linux RHEL 9

Uh oh!

foundationdb-ci commented Jun 28, 2026

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Uh oh!

foundationdb-ci commented Jun 28, 2026

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Uh oh!

foundationdb-ci commented Jun 28, 2026

Result of foundationdb-pr-macos-m1 on macOS 14.x

Uh oh!

foundationdb-ci commented Jun 28, 2026

Result of foundationdb-pr-clang-arm on Linux RHEL 9

Uh oh!

foundationdb-ci commented Jun 28, 2026

Result of foundationdb-pr-macos on macOS 14.x

Uh oh!

foundationdb-ci commented Jun 28, 2026

Result of foundationdb-pr-clang on Linux RHEL 9

Uh oh!

foundationdb-ci commented Jun 28, 2026

Result of foundationdb-pr on Linux RHEL 9

Uh oh!

foundationdb-ci commented Jun 28, 2026

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants