Skip to content

Add more production observability metrics#13403

Open
tclinkenbeard-oai wants to merge 2 commits into
apple:mainfrom
tclinkenbeard-oai:dev/tclinkenbeard/fdb-production-observability-metrics
Open

Add more production observability metrics#13403
tclinkenbeard-oai wants to merge 2 commits into
apple:mainfrom
tclinkenbeard-oai:dev/tclinkenbeard/fdb-production-observability-metrics

Conversation

@tclinkenbeard-oai

Copy link
Copy Markdown
Collaborator

This PR adds low-overhead observability for production FDB latency and recovery investigations.

  • Adds commit batch flush reason counters to ProxyMetrics and ProxyDetailedMetrics.
  • Adds TLog histograms for serialized disk queue write and commit sizes.
  • Adds RecoveryState to periodic MasterRecoveryMetrics.

These metrics help correlate slow ToTlog paths with TLog payload sizes, distinguish why commit batches flush, and add recovery phase context without introducing per-request tracing.

The selected metrics were motivated by gaps identified in real cluster investigations.

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 3bd2129
  • Duration 0:20:45
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS 14.x

  • Commit ID: 3bd2129
  • Duration 0:31:36
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux RHEL 9

  • Commit ID: 3bd2129
  • Duration 0:44:31
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS 14.x

  • Commit ID: 3bd2129
  • Duration 0:47:09
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 3bd2129
  • Duration 0:49:42
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@tclinkenbeard-oai tclinkenbeard-oai left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generated by Codex.

What is it trying to do?

This adds low-overhead production observability in three places: commit-proxy counters for commit-batch flush reasons, TLog histograms for serialized DiskQueue write and commit sizes, and RecoveryState on periodic MasterRecoveryMetrics.

Is it correct?

From inspection, yes. The commit-batcher helper preserves the existing send/reset behavior and records one reason at each flush site. The TLog path measures the exact serialized buffer passed to IDiskQueue::push, accumulates it between commits, and routes all four push sites through the helper while preserving the existing queue-size threshold accounting. RecoveryState is initialized before the trace callback can run and is updated by the existing recovery state machine.

There are no new wire or persisted-format changes; the output changes are additive metrics. I did not run build or test validation. On the current head, clang-format and Windows Boost CONFIG are green; FoundationDB builders and cluster tests are still pending, with no visible failures.

Are there bugs?

I did not find any correctness bugs. I specifically checked compatibility boundaries, hot-path cost, counter units/populations, async lifetimes, removed invariants, and whether the changed metric paths are fully covered. The removed trigger metric is not hiding a live producer: triggerCommit has no true-setting site in the current public source.

Are there omissions?

No blocking omissions. There is no targeted test for the new metric values or flush-reason classification, so the remaining risk is observability semantics rather than data-path correctness.

Are there better ways of doing things?

No material alternative. Keeping TLog accounting in one helper avoids duplicating the same size and threshold logic at four push sites, and the explicit flush-reason enum keeps the counter mapping clear.

Should this CL be LGTMd?

I would wait for the current FoundationDB builders and cluster tests to finish. Subject to that, yes, this is LGTM from code inspection: I inspected all four changed files, traced the TLog push sites and recovery-state writes, and found no blocking correctness issue.

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 3bd2129
  • Duration 0:57:03
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: d4f56a6
  • Duration 0:20:28
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 3bd2129
  • Duration 1:02:14
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS 14.x

  • Commit ID: d4f56a6
  • Duration 0:31:18
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux RHEL 9

  • Commit ID: d4f56a6
  • Duration 0:44:55
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS 14.x

  • Commit ID: d4f56a6
  • Duration 0:47:27
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: d4f56a6
  • Duration 0:54:54
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: d4f56a6
  • Duration 0:59:04
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: d4f56a6
  • Duration 1:03:01
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants