Skip to content

docs: update hyperdrive observability metrics documentation#31729

Draft
ivoryibu wants to merge 1 commit into
productionfrom
docs/hyperdrive-graphql-metrics
Draft

docs: update hyperdrive observability metrics documentation#31729
ivoryibu wants to merge 1 commit into
productionfrom
docs/hyperdrive-graphql-metrics

Conversation

@ivoryibu

Copy link
Copy Markdown
Collaborator

Clarifies how Hyperdrive connection pool metrics are reported, to reduce confusion from customers who see "Open Connections" plateau below their configured limit and assume the product is throttling them. This is a recurring escalation path where the pooling behavior is healthy but the per-machine metric presentation looks broken.

  • Add a "How Hyperdrive distributes your connection pool" subsection to Pool size metrics explaining that the pool is sharded across 5 machines per Cloudflare location, so the configured maximum is divided across them (for example, a limit of 100 gives each machine 20). This explains why max.currentPoolSize plateaus at the per-machine slice rather than the full limit even when healthy.
  • Note that pools start empty and reuse connections, so Hyperdrive opens few origin connections — addressing customers who expect to see connection counts climb toward their full limit.
  • Add a note clarifying that waitingClients is a point-in-time count, not a duration. A brief wait (typically milliseconds) registers as a spike, which customers misread as clients being stuck.
  • State that pool size metrics are currently per-machine and not aggregated, and tell customers to sum per-location values to estimate total usage. Phrased as current behavior to leave room for a future aggregate ("All locations") view.
  • Add context to the dashboard section noting the Pool connections chart shows per-machine values.
  • Add a GraphQL API note that hyperdrivePoolSizesAdaptiveGroups reports per-machine values grouped by coloCode, and that summing across coloCode groups is the closest available estimate of total usage.

Source: recent connection-limit escalations and the near-term documentation actions in the pool metrics PRD. No data pipeline or schema changes are involved.

@cloudflare-docs-bot

cloudflare-docs-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Review

⚠️ 1 warning found in commit 1b543f5.

Code Review

This code review is in beta and may not always be helpful — use your judgment.

Warnings (1)
File Issue
hyperdrive/observability/metrics.mdx line 62 Inconsistent aggregation guidance — Line 62 tells readers to estimate total connection usage by summing per-location values. Lines 50-54 and 89 state that pool metrics are reported per machine, with five machines per Cloudflare location, and that each returned value reflects a single machine rather than a location's full pool. If the query returns one value per coloCode, summing those values would undercount total usage by a factor of five. Fix: Clarify whether readers should sum all per-machine rows from the GraphQL response, or adjust the guidance to multiply by the number of machines per location when estimating total usage.

Style Guide Review

✅ No style-guide issues found.

Commands

Only codeowners can run commands. Post a comment with the command to trigger it.

Command Description
/review Runs a review now. Incremental if a prior review exists, full if not.
/full-review Re-reviews the entire PR diff from scratch, ignoring incremental history. Useful after a rebase, when you want a fresh review, or if the bot gets out of sync and reports issues that no longer exist.
/ignore-review-limit Permanently lifts the 2-review automatic limit for this PR. Future pushes will trigger reviews as normal.

@github-actions github-actions Bot added product:hyperdrive Hyperdrive: https://developers.cloudflare.com/hyperdrive/ size/s labels Jun 26, 2026
@github-actions

Copy link
Copy Markdown
Contributor

This pull request requires reviews from CODEOWNERS as it changes files that match the following patterns:

Pattern Owners
/src/content/docs/hyperdrive/ @elithrar, @rita3ko, @irvinebroque, @vy-ton, @ivoryibu, @thomasgauvin, @sejoker, @oxyjun, @knickish, @cloudflare/product-owners

@ask-bonk ask-bonk Bot added documentation Documentation edits content:edit Request for content edits labels Jun 26, 2026
@ivoryibu ivoryibu marked this pull request as draft June 26, 2026 16:16

@ask-bonk ask-bonk Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This is a clear, well-targeted docs change that directly addresses a recurring customer escalation path. The new How Hyperdrive distributes your connection pool subsection and the clarifications in the Dashboard and GraphQL sections do a good job of explaining why metrics can appear lower than expected.

What changed

  • Added an H4 subsection explaining per-machine pool sharding (five machines per Cloudflare location).
  • Clarified that pools start empty and reuse connections, so open connection counts remain low rather than climbing toward the full limit.
  • Added a :::note explaining that waitingClients is a point-in-time count, not a measure of wait duration.
  • Updated the Dashboard and GraphQL sections to state that metrics are per-machine and not aggregated.

Suggestions

I left a few non-blocking inline comments with suggestions to tighten wording and avoid potential confusion. Nothing that would block merging.

Overall, this looks good to merge after considering the inline suggestions.

Connection contention appears as a spike in waiting clients, or when open connections consistently approach the pool size maximum. If your open connections regularly approach this limit, consider [increasing your Hyperdrive connection limit](/hyperdrive/platform/limits/#request-a-limit-increase).
#### How Hyperdrive distributes your connection pool

Hyperdrive shards your connection pool across five machines within each Cloudflare location. Your configured maximum is divided evenly across these machines. For example, a connection limit of 100 gives each machine a maximum of 20 connections (100 / 5).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: "divided evenly" implies exact equal distribution for every possible limit. For limits not divisible by 5 (for example, 97), this may not be literally true. Consider softening to "divided across" or adding an "approximately" qualifier.


Connection contention appears as a sustained spike in waiting clients, or when per-machine open connections consistently approach the per-machine slice of your pool size maximum. If your open connections regularly approach this limit, consider [increasing your Hyperdrive connection limit](/hyperdrive/platform/limits/#request-a-limit-increase).

To estimate your total connection usage across a configuration, sum the per-location values returned by the [GraphQL Analytics API](#query-via-the-graphql-api).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: This says to "sum the per-location values," but the GraphQL section clarifies that each returned value is per-machine. This could be misread as summing one value per location. Consider rewording to something like: "To estimate your total connection usage across a configuration, sum all values returned by the GraphQL Analytics API."


Hyperdrive's GraphQL datasets require an `accountTag` filter with your Cloudflare account ID. Hyperdrive exposes the `hyperdriveQueriesAdaptiveGroups` and `hyperdrivePoolSizesAdaptiveGroups` datasets.

The `hyperdrivePoolSizesAdaptiveGroups` dataset reports per-machine values grouped by the `coloCode` dimension, rather than a single aggregated total. To approximate your total connection usage across all locations, sum the values across each `coloCode` group. This remains an estimate, because each value reflects a single machine rather than a location's full pool. Refer to [How Hyperdrive distributes your connection pool](#how-hyperdrive-distributes-your-connection-pool) for details.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The instruction to "sum the values across each coloCode group" is slightly ambiguous about whether multiple entries per coloCode need to be summed. Consider making it explicit, for example: "To approximate your total connection usage across all locations, sum the values for every entry returned. Each entry reflects a single machine, so you can also group by coloCode first if you need a per-location estimate."

@ask-bonk

ask-bonk Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

PR #31729 Review Complete

Labels applied: documentation, product:hyperdrive, content:edit

Summary

This PR updates src/content/docs/hyperdrive/observability/metrics.mdx to clarify how Hyperdrive reports connection pool metrics. It adds a new "How Hyperdrive distributes your connection pool" subsection explaining that pools are sharded across five machines per Cloudflare location, which causes per-machine metrics to plateau below the configured maximum. It also adds a note that waitingClients is a point-in-time count (not a duration), and updates the Dashboard and GraphQL sections to state that reported values are per-machine and not aggregated.

What I found

  • Title & description: The title follows commit conventions and the description is thorough and accurately reflects the diff.
  • Frontmatter & structure: Valid frontmatter, correct heading hierarchy (H2 → H3 → H4), no skipped levels.
  • Links: All internal links use root-relative paths with trailing slashes.
  • MDX syntax: No unescaped special characters; components are properly imported.
  • Style: Generally clean. One minor point: "five" is correctly spelled out per the style guide (numbers 0–9).

Issues flagged (non-blocking)

I posted three inline review comments with suggestions to tighten wording:

  1. Line 50: "divided evenly" may not hold for limits not divisible by 5 — suggest softening the language.
  2. Line 62: "sum the per-location values" is ambiguous because the API returns per-machine values — suggest clarifying that users should sum all returned values.
  3. Line 89: The instruction to "sum the values across each coloCode group" is slightly ambiguous about multiple entries per coloCode — suggest making it explicit.

Overall, this is a good change that directly addresses a real support escalation pattern. I left the review as Comment rather than Approve or Request changes so the author can decide whether to accept the wording suggestions. No commits were pushed and no new PRs were opened.

github run

@github-actions

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

content:edit Request for content edits documentation Documentation edits product:hyperdrive Hyperdrive: https://developers.cloudflare.com/hyperdrive/ size/s

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants