feat(metrics): add lean_gossip_mesh_peers gauge#343
Conversation
Tracks the number of peers currently in the gossipsub mesh, broken down by client name (resolved via the existing peer-id name registry). Defined in leanMetrics PR #35. The gauge is refreshed on a 1s interval inside the swarm adapter loop because mesh membership is held in the gossipsub behaviour and changes implicitly during heartbeats (700ms) without dedicated graft/prune events. Reset-then-repopulate ensures clients that drop out of the mesh do not leave stale label counts.
🤖 Claude Code ReviewHere is my review of PR #343: PR #343:
|
Greptile SummaryIntroduces the
Confidence Score: 5/5Safe to merge. The change only reads gossipsub state and writes to a Prometheus registry; it does not touch any network, consensus, or state-machine paths. All changes are confined to metric instrumentation. The collect-then-selective-zero approach correctly avoids the scrape-observation gap discussed in the prior review round. The single remaining discrepancy (interval value in the PR description vs. in the code) is a documentation inconsistency, not a runtime defect. No files require special attention.
|
| Filename | Overview |
|---|---|
| crates/common/metrics/src/lib.rs | Adds re-export of prometheus::core::Collector needed by the new collect() call in metrics.rs. |
| crates/net/p2p/src/metrics.rs | Adds LEAN_GOSSIP_MESH_PEERS gauge and update_gossip_mesh_peers(). Uses a collect-then-zero approach to avoid the scrape-window gap that reset() would create; single registry lock acquisition for the full peer batch. |
| crates/net/p2p/src/swarm_adapter.rs | Adds a 10 s interval tick to the swarm select loop, calling update_gossip_mesh_peers on each tick. MissedTickBehavior::Skip is correctly set. PR description claims 1 s interval but constant is 10 s. |
Sequence Diagram
sequenceDiagram
participant SL as swarm_loop (tokio task)
participant GS as gossipsub behaviour
participant MF as update_gossip_mesh_peers
participant NR as NODE_NAME_REGISTRY (RwLock)
participant PG as LEAN_GOSSIP_MESH_PEERS (IntGaugeVec)
participant PS as Prometheus scraper
loop every 10 s (MissedTickBehavior::Skip)
SL->>GS: all_mesh_peers()
GS-->>SL: "Iterator<&PeerId>"
SL->>MF: update_gossip_mesh_peers(peers)
MF->>NR: read() lock
NR-->>MF: registry snapshot
MF->>MF: group peers into counts HashMap
MF->>NR: release read lock
MF->>PG: collect() snapshot existing label series
loop for each existing label NOT in counts
MF->>PG: with_label_values([client]).set(0)
end
loop for each (client, count) in counts
MF->>PG: with_label_values([client]).set(count)
end
end
PS->>PG: scrape /metrics
PG-->>PS: "lean_gossip_mesh_peers{client=...}"
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
crates/net/p2p/src/swarm_adapter.rs:14-15
The PR description consistently says the polling interval is **1 s** ("Add a `tokio::time::interval` (1 s, …)", "1 s polling bounds scrape staleness"), but the constant is `Duration::from_secs(10)`. If 10 s is intentional, the description should be updated to match so scrape-staleness reasoning stays accurate.
```suggestion
/// Interval between gossipsub mesh peer metric refreshes.
const MESH_METRIC_REFRESH_INTERVAL: Duration = Duration::from_secs(1);
```
Reviews (2): Last reviewed commit: "Merge branch 'main' into feat/lean-gossi..." | Re-trigger Greptile
🤖 Codex Code ReviewFindings
Notes
Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
Replace IntGaugeVec::reset() + repopulate with set(0) on departed labels. The reset() variant cleared all child series in one registry write, but the subsequent set() calls were separate writes — a Prometheus scrape that landed in between observed an empty gauge, which would falsely trigger any "no mesh peers" alert. Existing children are enumerated via Collector::collect() so we don't have to track the previous label set ourselves. Also acquire the NODE_NAME_REGISTRY read lock once for the whole batch instead of per peer.
🤖 Claude Code ReviewHere is the code review for PR #343: Review:
|
🤖 Codex Code ReviewFindings
No other correctness, security, or consensus-path issues stood out in this diff; it only touches metrics plumbing. I couldn’t run Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
Seed the counts map with 0 for previously-published labels so the final set() loop covers both current and departed clients in one pass, rather than walking the existing children twice.
🗒️ Description / Motivation
Implements the
lean_gossip_mesh_peersmetric introduced in leanMetrics PR #35.This is a
Gaugethat exposes the number of peers currently in our gossipsub mesh, broken down byclientlabel. It complements the existinglean_connected_peersgauge: connected peers are nodes we have an active libp2p connection to, while mesh peers are the smaller subset gossipsub has selected as full-message neighbours.The
clientlabel uses the same<name>_<N>|unknownresolution aslean_connected_peers(handled by the existingresolve()helper incrates/net/p2p/src/metrics.rs), so PR #35's label-format clarification needed no code changes — only the new gauge.What Changed
crates/net/p2p/src/metrics.rsLEAN_GOSSIP_MESH_PEERS: IntGaugeVecwith theclientlabel.update_gossip_mesh_peers(peers): groups peers by resolved client name (acquiring theNODE_NAME_REGISTRYread lock once for the batch). To avoid the scrape-window gap that anIntGaugeVec::reset()+ repopulate would leave between writes, the function instead enumerates currently-registered children viaCollector::collect()andset(0)for any client labels no longer in the mesh, then writes the new counts. Eachset()is a single registry write, so a concurrent Prometheus scrape always observes a coherent label set.crates/common/metrics/src/lib.rsprometheus::core::Collectorso consumers can enumerate existing label children.crates/net/p2p/src/swarm_adapter.rstokio::time::interval(10 s,MissedTickBehavior::Skip) arm to theswarm_looptokio::select!. On each tick it readsswarm.behaviour().gossipsub.all_mesh_peers()and pushes counts into the gauge.Correctness / Behavior Guarantees
gossipsub.all_mesh_peers()(aBTreeSetiteration) and writes to a Prometheus registry; it cannot affect block propagation, fork choice, or attestation handling.set(0)rather than removed viareset(), so a Prometheus scrape between writes always sees a complete (non-empty) label set. Alerting rules likelean_gossip_mesh_peers > 0see continuous time series instead of vanishing labels.tokio::select!with the swarm event and command receiver, withMissedTickBehavior::Skipto avoid backlogging metric refreshes if the swarm loop is busy. Polling cost is one short iterator pass per 10 s over the mesh peer set (typically <50 peers).BehaviourEvent, so an event-driven update is not possible without forking the gossipsub behaviour. 10 s polling bounds scrape staleness while keeping overhead negligible.Tests Added / Run
The change is operational (Prometheus gauge update): no new unit tests were added because there's no observable behaviour outside the metrics endpoint and the gauge update is exercised by every devnet run.
Verification commands:
cargo check -p ethlambda-p2p— cleanmake fmt— cleanmake lint(clippy with-D warnings) — cleancargo test --workspace --release— all passingDevnet verification (4 ethlambda nodes, local image):
lean_gossip_mesh_peers{client="ethlambda_N"} = 1for each peer once mesh formed.docker stop ethlambda_3, surviving nodes'lean_gossip_mesh_peers{client="ethlambda_3"}dropped to0within ~6 s instead of disappearing — confirming scrape-stable behavior.Related Issues / PRs
lean_gossip_mesh_peersmetric from Add gossip mesh peers metric leanEthereum/leanMetrics#35