feat(metrics): add lean_gossip_mesh_peers gauge by MegaRedHand · Pull Request #343 · lambdaclass/ethlambda

MegaRedHand · 2026-05-04T22:50:56Z

🗒️ Description / Motivation

Implements the lean_gossip_mesh_peers metric introduced in leanMetrics PR #35.

This is a Gauge that exposes the number of peers currently in our gossipsub mesh, broken down by client label. It complements the existing lean_connected_peers gauge: connected peers are nodes we have an active libp2p connection to, while mesh peers are the smaller subset gossipsub has selected as full-message neighbours.

The client label uses the same <name>_<N>|unknown resolution as lean_connected_peers (handled by the existing resolve() helper in crates/net/p2p/src/metrics.rs), so PR #35's label-format clarification needed no code changes — only the new gauge.

What Changed

crates/net/p2p/src/metrics.rs
- Register LEAN_GOSSIP_MESH_PEERS: IntGaugeVec with the client label.
- Add update_gossip_mesh_peers(peers): groups peers by resolved client name (acquiring the NODE_NAME_REGISTRY read lock once for the batch). To avoid the scrape-window gap that an IntGaugeVec::reset() + repopulate would leave between writes, the function instead enumerates currently-registered children via Collector::collect() and set(0) for any client labels no longer in the mesh, then writes the new counts. Each set() is a single registry write, so a concurrent Prometheus scrape always observes a coherent label set.
crates/common/metrics/src/lib.rs
- Re-export prometheus::core::Collector so consumers can enumerate existing label children.
crates/net/p2p/src/swarm_adapter.rs
- Add a tokio::time::interval (10 s, MissedTickBehavior::Skip) arm to the swarm_loop tokio::select!. On each tick it reads swarm.behaviour().gossipsub.all_mesh_peers() and pushes counts into the gauge.

Correctness / Behavior Guarantees

No state changes on the network/consensus path. The new code only reads gossipsub.all_mesh_peers() (a BTreeSet iteration) and writes to a Prometheus registry; it cannot affect block propagation, fork choice, or attestation handling.
Scrape-stable updates. Departed client labels are written as set(0) rather than removed via reset(), so a Prometheus scrape between writes always sees a complete (non-empty) label set. Alerting rules like lean_gossip_mesh_peers > 0 see continuous time series instead of vanishing labels.
No starvation of swarm/cmd events. The new arm shares the tokio::select! with the swarm event and command receiver, with MissedTickBehavior::Skip to avoid backlogging metric refreshes if the swarm loop is busy. Polling cost is one short iterator pass per 10 s over the mesh peer set (typically <50 peers).
Polling rationale. Mesh membership changes inside gossipsub heartbeats (700 ms) and is not surfaced as a BehaviourEvent, so an event-driven update is not possible without forking the gossipsub behaviour. 10 s polling bounds scrape staleness while keeping overhead negligible.

Tests Added / Run

The change is operational (Prometheus gauge update): no new unit tests were added because there's no observable behaviour outside the metrics endpoint and the gauge update is exercised by every devnet run.

Verification commands:

cargo check -p ethlambda-p2p — clean
make fmt — clean
make lint (clippy with -D warnings) — clean
cargo test --workspace --release — all passing

Devnet verification (4 ethlambda nodes, local image):

All 4 nodes published lean_gossip_mesh_peers{client="ethlambda_N"} = 1 for each peer once mesh formed.
After docker stop ethlambda_3, surviving nodes' lean_gossip_mesh_peers{client="ethlambda_3"} dropped to 0 within ~6 s instead of disappearing — confirming scrape-stable behavior.

Related Issues / PRs

Implements the lean_gossip_mesh_peers metric from Add gossip mesh peers metric leanEthereum/leanMetrics#35

Tracks the number of peers currently in the gossipsub mesh, broken down by client name (resolved via the existing peer-id name registry). Defined in leanMetrics PR #35. The gauge is refreshed on a 1s interval inside the swarm adapter loop because mesh membership is held in the gossipsub behaviour and changes implicitly during heartbeats (700ms) without dedicated graft/prune events. Reset-then-repopulate ensures clients that drop out of the mesh do not leave stale label counts.

github-actions · 2026-05-04T22:52:43Z

🤖 Claude Code Review

Here is my review of PR #343:

PR #343: `feat(metrics): add lean_gossip_mesh_peers gauge`

Overview

Adds a lean_gossip_mesh_peers IntGaugeVec metric (labelled by client) that exposes the number of gossipsub mesh peers. Implemented by polling all_mesh_peers() every 1 second inside the existing swarm_loop tokio::select!.

The change is small (43 lines), non-invasive, and carries zero risk to the consensus/networking path. Code follows project conventions throughout.

Code Quality

metrics.rs — update_gossip_mesh_peers (lines 186–196)

Correct approach: accumulate counts into a HashMap, reset() the gauge, then repopulate. Follows the same labelling strategy as lean_connected_peers.
Idiomatic nit: .or_insert(0) can be .or_default() for i64.
resolve() acquires a RwLock read guard per peer. At 1 Hz over a mesh of <50 peers this is immaterial, but the accumulation loop could take the lock once if this ever became a concern. Not worth changing now.

swarm_adapter.rs — swarm_loop (lines 117–133)

MissedTickBehavior::Skip is the right choice: missed ticks are silently dropped rather than queued, so a busy swarm event burst won't pile up deferred metric refreshes.
The first tick() fires immediately at startup (tokio's interval contract). The mesh is typically empty then, so reset() runs on an empty gauge — harmless.
select! branch ordering is correct; the tick arm will not starve the swarm/cmd arms, and vice versa.

Static declaration placement

LEAN_GOSSIP_MESH_PEERS is placed between the disconnection events counter and the gossip size histograms (line 71). The file has an implicit grouping: peer connection statics → gossip size histograms. The new gauge sits more naturally with the peer gauges (LEAN_CONNECTED_PEERS, line 44). Minor, but worth moving for readability.

Specific Issues

Point 1 — reset() / repopulate is observable (PR description is inaccurate)

The description states: "the brief window between reset and repopulate is not observable to scrapers." This is not accurate. Prometheus scrapes the registry by locking each metric independently; a scrape that hits between reset() and the set() loop will see zero counts for all client labels. For a mesh-peer gauge the transient zero is benign, but the correctness claim should not be cited as a guarantee.

If zero-flash is ever unacceptable, the fix is to swap the order: set new values first, then remove labels that are no longer present. That requires tracking the previous label set, which is more complex and not warranted here.

Point 2 — Comment inaccuracy in swarm_adapter.rs (line 16-17)

/// Slightly slower than the gossipsub heartbeat (700ms) so the gauge
/// reflects post-heartbeat mesh state with minimal polling overhead.

1000 ms is ~43% slower than 700 ms — "slightly" is misleading. The real rationale is that 1 s gives one full heartbeat cycle of margin and is a clean human-readable interval. Suggest rewording: "One full heartbeat cycle (700 ms) after each mesh churn event; 1 s bounds staleness while keeping overhead negligible."

Point 3 — or_insert(0) vs or_default() (line 190)

*counts.entry(name).or_insert(0) += 1;

Prefer .or_default() for zero-initialized numeric types. Idiomatic Rust.

What's Not Covered (intentional, not a bug)

all_mesh_peers() returns the union of peers across all topics. A peer in the block mesh and the attestation mesh appears once in the count. The metric therefore reports distinct mesh peers, not (peer, topic) pairs. The description is consistent with this, so no label for topic is expected. Just noting it explicitly in case a per-topic breakdown is ever requested.

Verdict

LGTM with minor suggestions. Points 1 and 2 are documentation-only fixes; Point 3 is a one-character nit. None block merging. The implementation is correct, safe, and consistent with project conventions.

Automated review by Claude (Anthropic) · sonnet · custom prompt

greptile-apps · 2026-05-04T22:52:59Z

Greptile Summary

Introduces the lean_gossip_mesh_peers IntGaugeVec (labelled by client) to complement the existing lean_connected_peers gauge, and polls gossipsub.all_mesh_peers() on a timer inside the swarm loop.

crates/net/p2p/src/metrics.rs: Registers the new gauge and adds update_gossip_mesh_peers(). Stale labels are handled via a collect-then-selective-zero approach (iterating existing metric families and setting absent clients to 0), which correctly avoids the scrape-observation gap that a full reset() would create. A single NODE_NAME_REGISTRY read lock covers the entire peer batch, resolving the per-peer locking concern raised in the previous review.
crates/net/p2p/src/swarm_adapter.rs: Adds a tokio::time::interval arm to the swarm_loop select!. MissedTickBehavior::Skip is set correctly. The declared interval in the PR description (1 s) does not match the constant in code (10 s).

Confidence Score: 5/5

Safe to merge. The change only reads gossipsub state and writes to a Prometheus registry; it does not touch any network, consensus, or state-machine paths.

All changes are confined to metric instrumentation. The collect-then-selective-zero approach correctly avoids the scrape-observation gap discussed in the prior review round. The single remaining discrepancy (interval value in the PR description vs. in the code) is a documentation inconsistency, not a runtime defect.

No files require special attention.

Important Files Changed

Filename	Overview
crates/common/metrics/src/lib.rs	Adds re-export of `prometheus::core::Collector` needed by the new `collect()` call in metrics.rs.
crates/net/p2p/src/metrics.rs	Adds `LEAN_GOSSIP_MESH_PEERS` gauge and `update_gossip_mesh_peers()`. Uses a collect-then-zero approach to avoid the scrape-window gap that `reset()` would create; single registry lock acquisition for the full peer batch.
crates/net/p2p/src/swarm_adapter.rs	Adds a 10 s interval tick to the swarm select loop, calling `update_gossip_mesh_peers` on each tick. `MissedTickBehavior::Skip` is correctly set. PR description claims 1 s interval but constant is 10 s.

Sequence Diagram

sequenceDiagram
    participant SL as swarm_loop (tokio task)
    participant GS as gossipsub behaviour
    participant MF as update_gossip_mesh_peers
    participant NR as NODE_NAME_REGISTRY (RwLock)
    participant PG as LEAN_GOSSIP_MESH_PEERS (IntGaugeVec)
    participant PS as Prometheus scraper

    loop every 10 s (MissedTickBehavior::Skip)
        SL->>GS: all_mesh_peers()
        GS-->>SL: "Iterator<&PeerId>"
        SL->>MF: update_gossip_mesh_peers(peers)
        MF->>NR: read() lock
        NR-->>MF: registry snapshot
        MF->>MF: group peers into counts HashMap
        MF->>NR: release read lock
        MF->>PG: collect() snapshot existing label series
        loop for each existing label NOT in counts
            MF->>PG: with_label_values([client]).set(0)
        end
        loop for each (client, count) in counts
            MF->>PG: with_label_values([client]).set(count)
        end
    end

    PS->>PG: scrape /metrics
    PG-->>PS: "lean_gossip_mesh_peers{client=...}"

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
crates/net/p2p/src/swarm_adapter.rs:14-15
The PR description consistently says the polling interval is **1 s** ("Add a `tokio::time::interval` (1 s, …)", "1 s polling bounds scrape staleness"), but the constant is `Duration::from_secs(10)`. If 10 s is intentional, the description should be updated to match so scrape-staleness reasoning stays accurate.

```suggestion
/// Interval between gossipsub mesh peer metric refreshes.
const MESH_METRIC_REFRESH_INTERVAL: Duration = Duration::from_secs(1);
```

_{Reviews (2): Last reviewed commit: "Merge branch 'main' into feat/lean-gossi..." | Re-trigger Greptile}

github-actions · 2026-05-04T22:55:03Z

🤖 Codex Code Review

Findings

Low: metrics.rs:192 does IntGaugeVec::reset() and then rebuilds the series label-by-label. That update is not atomic relative to Prometheus scraping, so a scrape can land between reset() and the last set() and observe an empty or partially rebuilt lean_gossip_mesh_peers snapshot. If you want scrape-stable values, keep the previously seen client labels and set(0) for labels that disappeared instead of resetting the whole vector each tick.

Notes

I did not see consensus-path, security, or memory-safety issues in the touched code. This is read-only metrics plumbing and does not affect fork choice, attestation processing, justification/finalization, state transition, XMSS, or SSZ logic.
Assuming gossipsub.all_mesh_peers() at swarm_adapter.rs:130 returns a deduplicated peer set as described in the PR, the 1s polling approach looks reasonable and low-overhead.
I could not rerun cargo check or cargo test in this sandbox because the git dependency leanMultisig is not cached locally and outbound network is disabled.

Automated review by OpenAI Codex · gpt-5.4 · custom prompt

Replace IntGaugeVec::reset() + repopulate with set(0) on departed labels. The reset() variant cleared all child series in one registry write, but the subsequent set() calls were separate writes — a Prometheus scrape that landed in between observed an empty gauge, which would falsely trigger any "no mesh peers" alert. Existing children are enumerated via Collector::collect() so we don't have to track the previous label set ourselves. Also acquire the NODE_NAME_REGISTRY read lock once for the whole batch instead of per peer.

github-actions · 2026-05-07T16:31:16Z

🤖 Claude Code Review

Here is the code review for PR #343:

Review: `feat(metrics): add lean_gossip_mesh_peers gauge`

Overview

This PR adds a lean_gossip_mesh_peers{client="..."} IntGaugeVec metric that tracks the number of peers currently in the gossipsub mesh, broken down by client name. A 10-second tokio::time::interval arm in swarm_loop polls gossipsub.all_mesh_peers() and feeds counts into the gauge. Departed client labels are written as set(0) rather than dropped, to avoid vanishing time series in Prometheus.

The change is narrowly scoped (metrics only), passes clippy with -D warnings, and the devnet verification is thorough. Overall quality is high.

Correctness

Scrape-atomicity claim is slightly overstated.

The PR description says:

Each set() is a single registry write, so a concurrent Prometheus scrape always observes a coherent label set.

The zero-out pass and the repopulate pass are two sequential loops, not a single atomic operation. A Prometheus scrape landing between them would observe some labels zeroed and others not yet updated — e.g., a mesh of 3 peers from two clients could briefly report {client="A"} 0, {client="B"} 0 before the second loop finishes. For a scrape-interval metric this is harmless and self-corrects within milliseconds, but the guarantee is "eventual coherence per scrape cycle" rather than true atomicity.

metrics.rs:189–201

label.value() borrow vs. HashMap<&'static str, i64>

counts.contains_key(value) works fine (via Borrow<str>), and the borrow of value is dropped before family goes out of scope. No lifetime issue in practice, but the asymmetry (&'static str keys, &str query) is subtle enough to trip a future reader — a short comment would help.

metrics.rs:192–199

Design Notes

Collector re-export in ethlambda-metrics.

// crates/common/metrics/src/lib.rs
pub use prometheus::{…, core::Collector, …};

prometheus::core::Collector is a low-level trait for enumerating metric children — not the usual surface the convenience re-exports crate exposes. Its only current consumer is update_gossip_mesh_peers. Consider importing it locally in crates/net/p2p/src/metrics.rs instead:

use prometheus::core::Collector as _;  // only for .collect()

This keeps the common metrics crate focused on high-level types.

First tick fires at t=0.

tokio::time::interval delivers its first tick immediately. At node startup no peers exist yet, so update_gossip_mesh_peers runs once over an empty iterator with no observable effect. It's benign, but if it ever matters (e.g., a future caller that distinguishes "never polled" from "polled, zero peers"), interval_at(Instant::now() + MESH_METRIC_REFRESH_INTERVAL, …) makes intent explicit.

swarm_adapter.rs:112

Minor Style / Efficiency

counts: HashMap<&'static str, i64> — the value type i64 is correct for IntGaugeVec::set(), but using u64 until the final set() call would make it self-documenting that counts are non-negative. Negligible in practice.
HashMap::new() could be HashMap::with_capacity(mesh_size_hint), but with mesh sizes under 50 this is noise.

What Looks Good

MissedTickBehavior::Skip is the right choice — avoids piling up metric refreshes if the swarm loop is busy.
Holding the NODE_NAME_REGISTRY read lock only for the peer-name resolution batch (then releasing before the Prometheus writes) is correct and avoids any contention with the connection-event path.
The polling rationale (gossipsub mesh events not surfaced as BehaviourEvent, 700 ms heartbeat) is accurate and well-documented.
LEAN_GOSSIP_MESH_PEERS registration follows the existing LazyLock pattern exactly.
The set(0) strategy for departed labels is the right call for Prometheus alerting continuity.

Summary

No blocking issues. Two actionable items worth addressing:

Either drop the Collector re-export from ethlambda-metrics in favour of a local use prometheus::core::Collector as _ in p2p/metrics.rs.
Soften the "always coherent" language in the comment to "scrape-stable (set(0) rather than reset)" — the current wording implies transactional atomicity the implementation doesn't provide.

Everything else is style-level or informational.

Automated review by Claude (Anthropic) · sonnet · custom prompt

github-actions · 2026-05-07T16:34:53Z

🤖 Codex Code Review

Findings

Medium: lean_gossip_mesh_peers{client=...} is not actually keyed by client implementation. The updater in crates/net/p2p/src/metrics.rs resolves each peer through NODE_NAME_REGISTRY, but that registry is built from ValidatorConfigEntry.name and is explicitly described as “labeling metrics with node names” in bin/ethlambda/src/main.rs. The in-repo examples use names like ethlambda_0 in bin/ethlambda/src/main.rs, while the peer metrics are documented as client=ethlambda,grandine,... in docs/metrics.md. So this new metric will emit per-node or unknown labels, not per-client labels, which breaks aggregation and increases label cardinality. Either derive the label from real client metadata (for example Identify/agent info) or rename/document the label as node.

No other correctness, security, or consensus-path issues stood out in this diff; it only touches metrics plumbing. I couldn’t run cargo check in this sandbox because Cargo needs write access under the read-only home cache.

Automated review by OpenAI Codex · gpt-5.4 · custom prompt

Seed the counts map with 0 for previously-published labels so the final set() loop covers both current and departed clients in one pass, rather than walking the existing children twice.

greptile-apps Bot reviewed May 4, 2026

View reviewed changes

Comment thread crates/net/p2p/src/metrics.rs Outdated

Comment thread crates/net/p2p/src/metrics.rs

MegaRedHand marked this pull request as draft May 4, 2026 22:55

MegaRedHand added 2 commits May 6, 2026 16:47

Merge branch 'main' into feat/lean-gossip-mesh-peers-metric

c95e62a

Merge branch 'main' into feat/lean-gossip-mesh-peers-metric

329bfb1

MegaRedHand commented May 7, 2026

View reviewed changes

Comment thread crates/net/p2p/src/swarm_adapter.rs Outdated

MegaRedHand added 4 commits May 7, 2026 11:29

chore: bump time between metric refreshes to 10s

8033d3b

chore: cargo fmt

fe9eec3

Merge branch 'main' into feat/lean-gossip-mesh-peers-metric

228736d

MegaRedHand marked this pull request as ready for review May 7, 2026 16:29

MegaRedHand and others added 3 commits May 7, 2026 16:20

refactor(metrics): unify mesh peers refresh into a single set() pass

4c7714e

Seed the counts map with 0 for previously-published labels so the final set() loop covers both current and departed clients in one pass, rather than walking the existing children twice.

Merge branch 'main' into feat/lean-gossip-mesh-peers-metric

3a8ae1f

chore: cargo fmt

8709ed8

pablodeymo approved these changes May 7, 2026

View reviewed changes

pablodeymo merged commit 8dafa26 into main May 7, 2026
2 checks passed

pablodeymo deleted the feat/lean-gossip-mesh-peers-metric branch May 7, 2026 20:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): add lean_gossip_mesh_peers gauge#343

feat(metrics): add lean_gossip_mesh_peers gauge#343
pablodeymo merged 10 commits intomainfrom
feat/lean-gossip-mesh-peers-metric

MegaRedHand commented May 4, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

greptile-apps Bot commented May 4, 2026 •

edited

Loading

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MegaRedHand commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🗒️ Description / Motivation

What Changed

Correctness / Behavior Guarantees

Tests Added / Run

Related Issues / PRs

Uh oh!

github-actions Bot commented May 4, 2026

🤖 Claude Code Review

PR #343: feat(metrics): add lean_gossip_mesh_peers gauge

Overview

Code Quality

Specific Issues

What's Not Covered (intentional, not a bug)

Verdict

Uh oh!

greptile-apps Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 4, 2026

🤖 Codex Code Review

Uh oh!

Uh oh!

github-actions Bot commented May 7, 2026

🤖 Claude Code Review

Review: feat(metrics): add lean_gossip_mesh_peers gauge

Overview

Correctness

Design Notes

Minor Style / Efficiency

What Looks Good

Summary

Uh oh!

github-actions Bot commented May 7, 2026

🤖 Codex Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MegaRedHand commented May 4, 2026 •

edited

Loading

PR #343: `feat(metrics): add lean_gossip_mesh_peers gauge`

greptile-apps Bot commented May 4, 2026 •

edited

Loading

Review: `feat(metrics): add lean_gossip_mesh_peers gauge`