Skip to content

feat(metrics): add lean_gossip_mesh_peers gauge#343

Merged
pablodeymo merged 10 commits intomainfrom
feat/lean-gossip-mesh-peers-metric
May 7, 2026
Merged

feat(metrics): add lean_gossip_mesh_peers gauge#343
pablodeymo merged 10 commits intomainfrom
feat/lean-gossip-mesh-peers-metric

Conversation

@MegaRedHand
Copy link
Copy Markdown
Collaborator

@MegaRedHand MegaRedHand commented May 4, 2026

🗒️ Description / Motivation

Implements the lean_gossip_mesh_peers metric introduced in leanMetrics PR #35.

This is a Gauge that exposes the number of peers currently in our gossipsub mesh, broken down by client label. It complements the existing lean_connected_peers gauge: connected peers are nodes we have an active libp2p connection to, while mesh peers are the smaller subset gossipsub has selected as full-message neighbours.

The client label uses the same <name>_<N>|unknown resolution as lean_connected_peers (handled by the existing resolve() helper in crates/net/p2p/src/metrics.rs), so PR #35's label-format clarification needed no code changes — only the new gauge.

What Changed

  • crates/net/p2p/src/metrics.rs
    • Register LEAN_GOSSIP_MESH_PEERS: IntGaugeVec with the client label.
    • Add update_gossip_mesh_peers(peers): groups peers by resolved client name (acquiring the NODE_NAME_REGISTRY read lock once for the batch). To avoid the scrape-window gap that an IntGaugeVec::reset() + repopulate would leave between writes, the function instead enumerates currently-registered children via Collector::collect() and set(0) for any client labels no longer in the mesh, then writes the new counts. Each set() is a single registry write, so a concurrent Prometheus scrape always observes a coherent label set.
  • crates/common/metrics/src/lib.rs
    • Re-export prometheus::core::Collector so consumers can enumerate existing label children.
  • crates/net/p2p/src/swarm_adapter.rs
    • Add a tokio::time::interval (10 s, MissedTickBehavior::Skip) arm to the swarm_loop tokio::select!. On each tick it reads swarm.behaviour().gossipsub.all_mesh_peers() and pushes counts into the gauge.

Correctness / Behavior Guarantees

  • No state changes on the network/consensus path. The new code only reads gossipsub.all_mesh_peers() (a BTreeSet iteration) and writes to a Prometheus registry; it cannot affect block propagation, fork choice, or attestation handling.
  • Scrape-stable updates. Departed client labels are written as set(0) rather than removed via reset(), so a Prometheus scrape between writes always sees a complete (non-empty) label set. Alerting rules like lean_gossip_mesh_peers > 0 see continuous time series instead of vanishing labels.
  • No starvation of swarm/cmd events. The new arm shares the tokio::select! with the swarm event and command receiver, with MissedTickBehavior::Skip to avoid backlogging metric refreshes if the swarm loop is busy. Polling cost is one short iterator pass per 10 s over the mesh peer set (typically <50 peers).
  • Polling rationale. Mesh membership changes inside gossipsub heartbeats (700 ms) and is not surfaced as a BehaviourEvent, so an event-driven update is not possible without forking the gossipsub behaviour. 10 s polling bounds scrape staleness while keeping overhead negligible.

Tests Added / Run

The change is operational (Prometheus gauge update): no new unit tests were added because there's no observable behaviour outside the metrics endpoint and the gauge update is exercised by every devnet run.

Verification commands:

  • cargo check -p ethlambda-p2p — clean
  • make fmt — clean
  • make lint (clippy with -D warnings) — clean
  • cargo test --workspace --release — all passing

Devnet verification (4 ethlambda nodes, local image):

  • All 4 nodes published lean_gossip_mesh_peers{client="ethlambda_N"} = 1 for each peer once mesh formed.
  • After docker stop ethlambda_3, surviving nodes' lean_gossip_mesh_peers{client="ethlambda_3"} dropped to 0 within ~6 s instead of disappearing — confirming scrape-stable behavior.

Related Issues / PRs

Tracks the number of peers currently in the gossipsub mesh, broken down
by client name (resolved via the existing peer-id name registry). Defined
in leanMetrics PR #35.

The gauge is refreshed on a 1s interval inside the swarm adapter loop
because mesh membership is held in the gossipsub behaviour and changes
implicitly during heartbeats (700ms) without dedicated graft/prune
events. Reset-then-repopulate ensures clients that drop out of the mesh
do not leave stale label counts.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

🤖 Claude Code Review

Here is my review of PR #343:


PR #343: feat(metrics): add lean_gossip_mesh_peers gauge

Overview

Adds a lean_gossip_mesh_peers IntGaugeVec metric (labelled by client) that exposes the number of gossipsub mesh peers. Implemented by polling all_mesh_peers() every 1 second inside the existing swarm_loop tokio::select!.

The change is small (43 lines), non-invasive, and carries zero risk to the consensus/networking path. Code follows project conventions throughout.


Code Quality

metrics.rsupdate_gossip_mesh_peers (lines 186–196)

  • Correct approach: accumulate counts into a HashMap, reset() the gauge, then repopulate. Follows the same labelling strategy as lean_connected_peers.
  • Idiomatic nit: .or_insert(0) can be .or_default() for i64.
  • resolve() acquires a RwLock read guard per peer. At 1 Hz over a mesh of <50 peers this is immaterial, but the accumulation loop could take the lock once if this ever became a concern. Not worth changing now.

swarm_adapter.rsswarm_loop (lines 117–133)

  • MissedTickBehavior::Skip is the right choice: missed ticks are silently dropped rather than queued, so a busy swarm event burst won't pile up deferred metric refreshes.
  • The first tick() fires immediately at startup (tokio's interval contract). The mesh is typically empty then, so reset() runs on an empty gauge — harmless.
  • select! branch ordering is correct; the tick arm will not starve the swarm/cmd arms, and vice versa.

Static declaration placement

LEAN_GOSSIP_MESH_PEERS is placed between the disconnection events counter and the gossip size histograms (line 71). The file has an implicit grouping: peer connection statics → gossip size histograms. The new gauge sits more naturally with the peer gauges (LEAN_CONNECTED_PEERS, line 44). Minor, but worth moving for readability.


Specific Issues

Point 1 — reset() / repopulate is observable (PR description is inaccurate)

The description states: "the brief window between reset and repopulate is not observable to scrapers." This is not accurate. Prometheus scrapes the registry by locking each metric independently; a scrape that hits between reset() and the set() loop will see zero counts for all client labels. For a mesh-peer gauge the transient zero is benign, but the correctness claim should not be cited as a guarantee.

If zero-flash is ever unacceptable, the fix is to swap the order: set new values first, then remove labels that are no longer present. That requires tracking the previous label set, which is more complex and not warranted here.

Point 2 — Comment inaccuracy in swarm_adapter.rs (line 16-17)

/// Slightly slower than the gossipsub heartbeat (700ms) so the gauge
/// reflects post-heartbeat mesh state with minimal polling overhead.

1000 ms is ~43% slower than 700 ms — "slightly" is misleading. The real rationale is that 1 s gives one full heartbeat cycle of margin and is a clean human-readable interval. Suggest rewording: "One full heartbeat cycle (700 ms) after each mesh churn event; 1 s bounds staleness while keeping overhead negligible."

Point 3 — or_insert(0) vs or_default() (line 190)

*counts.entry(name).or_insert(0) += 1;

Prefer .or_default() for zero-initialized numeric types. Idiomatic Rust.


What's Not Covered (intentional, not a bug)

all_mesh_peers() returns the union of peers across all topics. A peer in the block mesh and the attestation mesh appears once in the count. The metric therefore reports distinct mesh peers, not (peer, topic) pairs. The description is consistent with this, so no label for topic is expected. Just noting it explicitly in case a per-topic breakdown is ever requested.


Verdict

LGTM with minor suggestions. Points 1 and 2 are documentation-only fixes; Point 3 is a one-character nit. None block merging. The implementation is correct, safe, and consistent with project conventions.


Automated review by Claude (Anthropic) · sonnet · custom prompt

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 4, 2026

Greptile Summary

Introduces the lean_gossip_mesh_peers IntGaugeVec (labelled by client) to complement the existing lean_connected_peers gauge, and polls gossipsub.all_mesh_peers() on a timer inside the swarm loop.

  • crates/net/p2p/src/metrics.rs: Registers the new gauge and adds update_gossip_mesh_peers(). Stale labels are handled via a collect-then-selective-zero approach (iterating existing metric families and setting absent clients to 0), which correctly avoids the scrape-observation gap that a full reset() would create. A single NODE_NAME_REGISTRY read lock covers the entire peer batch, resolving the per-peer locking concern raised in the previous review.
  • crates/net/p2p/src/swarm_adapter.rs: Adds a tokio::time::interval arm to the swarm_loop select!. MissedTickBehavior::Skip is set correctly. The declared interval in the PR description (1 s) does not match the constant in code (10 s).

Confidence Score: 5/5

Safe to merge. The change only reads gossipsub state and writes to a Prometheus registry; it does not touch any network, consensus, or state-machine paths.

All changes are confined to metric instrumentation. The collect-then-selective-zero approach correctly avoids the scrape-observation gap discussed in the prior review round. The single remaining discrepancy (interval value in the PR description vs. in the code) is a documentation inconsistency, not a runtime defect.

No files require special attention.

Important Files Changed

Filename Overview
crates/common/metrics/src/lib.rs Adds re-export of prometheus::core::Collector needed by the new collect() call in metrics.rs.
crates/net/p2p/src/metrics.rs Adds LEAN_GOSSIP_MESH_PEERS gauge and update_gossip_mesh_peers(). Uses a collect-then-zero approach to avoid the scrape-window gap that reset() would create; single registry lock acquisition for the full peer batch.
crates/net/p2p/src/swarm_adapter.rs Adds a 10 s interval tick to the swarm select loop, calling update_gossip_mesh_peers on each tick. MissedTickBehavior::Skip is correctly set. PR description claims 1 s interval but constant is 10 s.

Sequence Diagram

sequenceDiagram
    participant SL as swarm_loop (tokio task)
    participant GS as gossipsub behaviour
    participant MF as update_gossip_mesh_peers
    participant NR as NODE_NAME_REGISTRY (RwLock)
    participant PG as LEAN_GOSSIP_MESH_PEERS (IntGaugeVec)
    participant PS as Prometheus scraper

    loop every 10 s (MissedTickBehavior::Skip)
        SL->>GS: all_mesh_peers()
        GS-->>SL: "Iterator<&PeerId>"
        SL->>MF: update_gossip_mesh_peers(peers)
        MF->>NR: read() lock
        NR-->>MF: registry snapshot
        MF->>MF: group peers into counts HashMap
        MF->>NR: release read lock
        MF->>PG: collect() snapshot existing label series
        loop for each existing label NOT in counts
            MF->>PG: with_label_values([client]).set(0)
        end
        loop for each (client, count) in counts
            MF->>PG: with_label_values([client]).set(count)
        end
    end

    PS->>PG: scrape /metrics
    PG-->>PS: "lean_gossip_mesh_peers{client=...}"
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
crates/net/p2p/src/swarm_adapter.rs:14-15
The PR description consistently says the polling interval is **1 s** ("Add a `tokio::time::interval` (1 s, …)", "1 s polling bounds scrape staleness"), but the constant is `Duration::from_secs(10)`. If 10 s is intentional, the description should be updated to match so scrape-staleness reasoning stays accurate.

```suggestion
/// Interval between gossipsub mesh peer metric refreshes.
const MESH_METRIC_REFRESH_INTERVAL: Duration = Duration::from_secs(1);
```

Reviews (2): Last reviewed commit: "Merge branch 'main' into feat/lean-gossi..." | Re-trigger Greptile

Comment thread crates/net/p2p/src/metrics.rs Outdated
Comment thread crates/net/p2p/src/metrics.rs
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

🤖 Codex Code Review

Findings

  1. Low: metrics.rs:192 does IntGaugeVec::reset() and then rebuilds the series label-by-label. That update is not atomic relative to Prometheus scraping, so a scrape can land between reset() and the last set() and observe an empty or partially rebuilt lean_gossip_mesh_peers snapshot. If you want scrape-stable values, keep the previously seen client labels and set(0) for labels that disappeared instead of resetting the whole vector each tick.

Notes

  • I did not see consensus-path, security, or memory-safety issues in the touched code. This is read-only metrics plumbing and does not affect fork choice, attestation processing, justification/finalization, state transition, XMSS, or SSZ logic.
  • Assuming gossipsub.all_mesh_peers() at swarm_adapter.rs:130 returns a deduplicated peer set as described in the PR, the 1s polling approach looks reasonable and low-overhead.
  • I could not rerun cargo check or cargo test in this sandbox because the git dependency leanMultisig is not cached locally and outbound network is disabled.

Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@MegaRedHand MegaRedHand marked this pull request as draft May 4, 2026 22:55
Comment thread crates/net/p2p/src/swarm_adapter.rs Outdated
Replace IntGaugeVec::reset() + repopulate with set(0) on departed
labels. The reset() variant cleared all child series in one registry
write, but the subsequent set() calls were separate writes — a
Prometheus scrape that landed in between observed an empty gauge,
which would falsely trigger any "no mesh peers" alert.

Existing children are enumerated via Collector::collect() so we don't
have to track the previous label set ourselves. Also acquire the
NODE_NAME_REGISTRY read lock once for the whole batch instead of per
peer.
@MegaRedHand MegaRedHand marked this pull request as ready for review May 7, 2026 16:29
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

🤖 Claude Code Review

Here is the code review for PR #343:


Review: feat(metrics): add lean_gossip_mesh_peers gauge

Overview

This PR adds a lean_gossip_mesh_peers{client="..."} IntGaugeVec metric that tracks the number of peers currently in the gossipsub mesh, broken down by client name. A 10-second tokio::time::interval arm in swarm_loop polls gossipsub.all_mesh_peers() and feeds counts into the gauge. Departed client labels are written as set(0) rather than dropped, to avoid vanishing time series in Prometheus.

The change is narrowly scoped (metrics only), passes clippy with -D warnings, and the devnet verification is thorough. Overall quality is high.


Correctness

Scrape-atomicity claim is slightly overstated.

The PR description says:

Each set() is a single registry write, so a concurrent Prometheus scrape always observes a coherent label set.

The zero-out pass and the repopulate pass are two sequential loops, not a single atomic operation. A Prometheus scrape landing between them would observe some labels zeroed and others not yet updated — e.g., a mesh of 3 peers from two clients could briefly report {client="A"} 0, {client="B"} 0 before the second loop finishes. For a scrape-interval metric this is harmless and self-corrects within milliseconds, but the guarantee is "eventual coherence per scrape cycle" rather than true atomicity.

metrics.rs:189–201

label.value() borrow vs. HashMap<&'static str, i64>

counts.contains_key(value) works fine (via Borrow<str>), and the borrow of value is dropped before family goes out of scope. No lifetime issue in practice, but the asymmetry (&'static str keys, &str query) is subtle enough to trip a future reader — a short comment would help.

metrics.rs:192–199


Design Notes

Collector re-export in ethlambda-metrics.

// crates/common/metrics/src/lib.rs
pub use prometheus::{, core::Collector,};

prometheus::core::Collector is a low-level trait for enumerating metric children — not the usual surface the convenience re-exports crate exposes. Its only current consumer is update_gossip_mesh_peers. Consider importing it locally in crates/net/p2p/src/metrics.rs instead:

use prometheus::core::Collector as _;  // only for .collect()

This keeps the common metrics crate focused on high-level types.

First tick fires at t=0.

tokio::time::interval delivers its first tick immediately. At node startup no peers exist yet, so update_gossip_mesh_peers runs once over an empty iterator with no observable effect. It's benign, but if it ever matters (e.g., a future caller that distinguishes "never polled" from "polled, zero peers"), interval_at(Instant::now() + MESH_METRIC_REFRESH_INTERVAL, …) makes intent explicit.

swarm_adapter.rs:112


Minor Style / Efficiency

  • counts: HashMap<&'static str, i64> — the value type i64 is correct for IntGaugeVec::set(), but using u64 until the final set() call would make it self-documenting that counts are non-negative. Negligible in practice.
  • HashMap::new() could be HashMap::with_capacity(mesh_size_hint), but with mesh sizes under 50 this is noise.

What Looks Good

  • MissedTickBehavior::Skip is the right choice — avoids piling up metric refreshes if the swarm loop is busy.
  • Holding the NODE_NAME_REGISTRY read lock only for the peer-name resolution batch (then releasing before the Prometheus writes) is correct and avoids any contention with the connection-event path.
  • The polling rationale (gossipsub mesh events not surfaced as BehaviourEvent, 700 ms heartbeat) is accurate and well-documented.
  • LEAN_GOSSIP_MESH_PEERS registration follows the existing LazyLock pattern exactly.
  • The set(0) strategy for departed labels is the right call for Prometheus alerting continuity.

Summary

No blocking issues. Two actionable items worth addressing:

  1. Either drop the Collector re-export from ethlambda-metrics in favour of a local use prometheus::core::Collector as _ in p2p/metrics.rs.
  2. Soften the "always coherent" language in the comment to "scrape-stable (set(0) rather than reset)" — the current wording implies transactional atomicity the implementation doesn't provide.

Everything else is style-level or informational.


Automated review by Claude (Anthropic) · sonnet · custom prompt

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

🤖 Codex Code Review

Findings

  1. Medium: lean_gossip_mesh_peers{client=...} is not actually keyed by client implementation. The updater in crates/net/p2p/src/metrics.rs resolves each peer through NODE_NAME_REGISTRY, but that registry is built from ValidatorConfigEntry.name and is explicitly described as “labeling metrics with node names” in bin/ethlambda/src/main.rs. The in-repo examples use names like ethlambda_0 in bin/ethlambda/src/main.rs, while the peer metrics are documented as client=ethlambda,grandine,... in docs/metrics.md. So this new metric will emit per-node or unknown labels, not per-client labels, which breaks aggregation and increases label cardinality. Either derive the label from real client metadata (for example Identify/agent info) or rename/document the label as node.

No other correctness, security, or consensus-path issues stood out in this diff; it only touches metrics plumbing. I couldn’t run cargo check in this sandbox because Cargo needs write access under the read-only home cache.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

MegaRedHand and others added 3 commits May 7, 2026 16:20
Seed the counts map with 0 for previously-published labels so the final
set() loop covers both current and departed clients in one pass, rather
than walking the existing children twice.
@pablodeymo pablodeymo merged commit 8dafa26 into main May 7, 2026
2 checks passed
@pablodeymo pablodeymo deleted the feat/lean-gossip-mesh-peers-metric branch May 7, 2026 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants