Skip to content

Harden networking: auth, memory-safety, DoS, and thread-safety fixes#101

Merged
igorls merged 3 commits into
mainfrom
security/networking-hardening
Jun 6, 2026
Merged

Harden networking: auth, memory-safety, DoS, and thread-safety fixes#101
igorls merged 3 commits into
mainfrom
security/networking-hardening

Conversation

@igorls

@igorls igorls commented Jun 6, 2026

Copy link
Copy Markdown
Owner

Hardening pass over the networking stack (src/net, src/nat, src/discovery, src/protocol, src/wireguard) addressing confirmed Critical/High issues plus two low-risk Mediums from an adversarial review. Every fix has a regression test. Green on zig build test, Debug, ReleaseFast (shipping config), ReleaseSafe, and zig fmt.

What's hardened

Authentication & spoofing

  • Org control messages (revoke / alias / vouch) now require a verified Ed25519 signature and a trusted org before they take effect.
  • RX cryptokey routing: a decrypted packet's inner source IP must belong to the sending peer (IPv4 + IPv6) before it reaches the TUN.
  • Gossiped dead/leave/suspect about a third party now only marks it suspected locally and must be confirmed by our own failure detector before eviction; gossip can no longer clear local suspicion or resurrect a peer.
  • DNS bootstrap: randomized per-query transaction ID, connected socket (source filtering), and transaction-ID/question validation.

Memory safety (the shipped ReleaseFast build runs with safety checks off)

  • Bound the base64 token decode against its output buffer before decoding (fixes a stack OOB write on an over-long mg:// token).
  • Return expired-peer keys via a caller-owned buffer instead of a slice into a reclaimed stack frame.

Resource exhaustion / DoS

  • Cap the membership table and reclaim non-alive entries so unauthenticated gossip can't grow it without bound.
  • Rate-limit inbound WireGuard handshake initiations (per-source + global token bucket) before the expensive X25519.

Thread safety

  • RwLock on the WgDevice peer table (exclusive for mutation/handshake, shared for the data plane) so peer removal can't zero key material mid-encrypt.
  • RwLock on the membership table — single writer takes it exclusively, external readers (data plane, FFI) share it.

Service policy

  • The service filter now classifies IPv6 and fails closed on anything unclassifiable under a default-deny policy (previously IPv6 bypassed filtering).

Tests

9 new regression tests covering each fix (membership cap/reclaim, expired-peer buffer, org-signature verification, gossip non-eviction, base64 bound, DNS response validation, IPv6 filter fail-closed, handshake rate limiter, RX source-IP enforcement).

Notes for review

  • The thread-safety locks are covered by reasoning + unit tests; a multi-node / TSan soak under adversarial churn is recommended before relying on them in production.
  • A full WireGuard cookie/MAC2 reply (to fully mitigate source-spoofed handshake floods) is a sensible follow-up; this PR adds the token-bucket bound.

🤖 Generated with Claude Code

Addresses the confirmed Critical/High issues (plus two low-risk Mediums)
from an adversarial review of the networking stack. Each fix ships with a
regression test; green on `zig build test`, Debug, ReleaseFast, ReleaseSafe.

Authentication / spoofing
- Verify the Ed25519 signature on org control messages (revoke/alias/vouch)
  and require a trusted org before acting on them (swim.zig, codec.zig).
- Enforce RX cryptokey routing: a decrypted packet's inner source IP must
  belong to the sending peer, for IPv4 and IPv6 (device.zig).
- Gossiped dead/leave/suspect about a third party now only marks it
  suspected locally and is confirmed by our own failure detector before any
  eviction; gossip can no longer clear local suspicion or resurrect a peer
  (swim.zig).
- DNS bootstrap: random per-query transaction ID, connected socket for
  source filtering, and transaction-ID/question validation (dns.zig).

Memory safety (safety checks are off in the shipped ReleaseFast build)
- Bound the base64 token decode against the output buffer before decoding,
  fixing a stack out-of-bounds write on an over-long mg:// token
  (coordinated_punch.zig).
- Return expired-peer keys via a caller-owned buffer instead of a slice into
  a reclaimed stack frame (membership.zig).

Resource exhaustion / DoS
- Cap the membership table and reclaim non-alive entries so unauthenticated
  gossip cannot grow it without bound (membership.zig).
- Rate-limit inbound WireGuard handshake initiations (per-source + global
  token bucket) before the X25519 (device.zig).

Thread safety
- Add an RwLock to the WgDevice peer table (exclusive for mutation/handshake,
  shared for the data plane) so peer removal cannot zero key material while a
  worker is mid-encrypt (device.zig, main.zig).
- Add an RwLock to the membership table; the single writer takes it
  exclusively for mutations, external readers (data plane, FFI) share it
  (membership.zig, main.zig, meshguard_ffi.zig).

Service policy
- Classify IPv6 packets in the service filter and fail closed on anything
  unclassifiable under a default-deny policy (policy.zig).
Copilot AI review requested due to automatic review settings June 6, 2026 12:38
The H4 rate-limit change added a src_ip parameter to WgDevice.handleInitiation;
wg_interop.zig (built only on Linux) still used the old 1-arg call, breaking the
ubuntu CI build. macOS doesn't compile this binary, so local builds missed it.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 70e3460cb8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/wireguard/device.zig
Comment on lines +577 to +578
if (!innerSourceAllowed(peer.mesh_ip, peer.mesh_ip6, out[0..len])) {
return error.SourceSpoofed;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Allow framed tunnel payloads through decryptTransport

When the FFI tunnel API is used, meshguard_tunnel_send deliberately frames arbitrary payloads with a leading 0x00 header before encrypting them, and the receiver routes those packets through handleWgPacketdev.decryptTransport. This new unconditional inner-IP check makes decryptTransport return SourceSpoofed for any non-IPv4/IPv6 plaintext, so every framed tunnel message is dropped after decryption rather than being delivered to the tunnel inbox. The source-IP enforcement needs to be limited to TUN IP packets or provide a separate decrypt path for the framed tunnel transport.

Useful? React with 👍 / 👎.

Comment thread src/wireguard/device.zig
Comment on lines +476 to +477
if (!self.hs_limiter.allow(src_ip, nowNs())) {
return error.HandshakeRateLimited;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid charging auto-registered handshakes twice

In the FFI responder path, the first packet from a membership-known peer that is not yet in WgDevice is handled by calling handleInitiation, catching UnknownPeer, auto-registering, and then immediately retrying handleInitiation. Since the limiter token is consumed here before the UnknownPeer lookup, that single legitimate first handshake burns one token on the failed probe and another on the retry; if the source has only one token left, the retry is rate-limited and no response is sent. Consider not consuming the bucket for UnknownPeer probes or making the auto-register retry reuse the admission decision.

Useful? React with 👍 / 👎.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Hardening pass across Meshguard’s networking stack to close multiple security and reliability gaps in the WireGuard/userspace data plane, SWIM membership handling, DNS bootstrap, and service-policy enforcement, with accompanying regression tests and documentation updates.

Changes:

  • Add WireGuard peer-table/thread-safety protections (device RwLock) plus a handshake-init token-bucket limiter and RX inner-source enforcement.
  • Harden discovery/membership against spoofing and resource exhaustion (org control signature verification, gossip “dead” handling, membership cap + reclaim, safer expireSuspected API).
  • Fix service policy IPv6 bypass and DNS bootstrap response spoofing weaknesses (random TXID, connected UDP, response validation).

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/wireguard/device.zig Adds handshake rate limiter, peer-table RwLock, and RX inner-source enforcement plus tests.
src/services/policy.zig Adds IPv6-capable packet classification and a fail-closed allowPacket API plus tests.
src/protocol/codec.zig Adds canonical signed-payload builders for org control message verification.
src/net/dns.zig Randomizes DNS TXID, connects UDP socket, validates response matches query, adds tests.
src/nat/coordinated_punch.zig Bounds base64 decode output to prevent OOB writes; adds regression tests.
src/meshguard_ffi.zig Adds membership read locks for several exported FFI entry points; updates WG initiation call signature.
src/main.zig Uses new service-filter API and adds WG device shared locks in some dataplane paths.
src/discovery/swim.zig Verifies org control signatures and changes gossip “dead/leave/suspect” handling; updates expire API usage; adds tests.
src/discovery/membership.zig Adds membership cap + reclaim, introduces RwLock, fixes expireSuspected use-after-return, adds tests.
SECURITY.md Documents the new security properties (RX routing, signed org control, confirmed failure detection, rate limiting).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/wireguard/device.zig
Comment on lines +272 to +274
// Per-source bucket (hash → slot, reset on collision with a different IP).
const h = (((@as(usize, ip[0]) *% 131 +% ip[1]) *% 131 +% ip[2]) *% 131 +% ip[3]);
const b = &self.buckets[h % SLOTS];
Comment thread src/wireguard/device.zig
Comment on lines 463 to +466
/// Handle an incoming handshake initiation (we are responder).
/// O(1) lookup via static key table instead of O(N) peer iteration.
pub fn handleInitiation(self: *WgDevice, msg: *const noise.HandshakeInitiation) !struct { response: noise.HandshakeResponse, slot: usize } {
// Step 0: Verify MAC1 BEFORE expensive DH (anti-DoS gate)
pub fn handleInitiation(self: *WgDevice, msg: *const noise.HandshakeInitiation, src_ip: [4]u8) !struct { response: noise.HandshakeResponse, slot: usize } {
// Step 0: Verify MAC1 BEFORE expensive DH (cheap, no lock).
Comment thread src/net/dns.zig
Comment on lines +482 to +484
if (!std.ascii.eqlIgnoreCase(data[HEADER_LEN .. HEADER_LEN + q_end], q_buf[0..q_end])) return false;
if (readU16(data, HEADER_LEN + q_end) != qtype) return false; // QTYPE echoed
return true;
Comment on lines 118 to +124
const existing = self.peers.get(peer.pubkey);
if (existing) |e| {
// Only update if the incoming info is newer (higher Lamport timestamp)
if (peer.lamport <= e.lamport) return;
try self.peers.put(peer.pubkey, peer);
return;
}
Comment thread src/main.zig
Comment on lines 2687 to 2692
if (wg_dev.decryptTransport(pkt, &out_buf)) |dec| {
// Service filter: check port access before TUN write
if (lib.services.Policy.parseTransportHeader(out_buf[0..dec.len])) |ti| {
const peer = wg_dev.peers[dec.slot] orelse continue;
const org_pk = if (membership.peers.getPtr(peer.identity_key)) |mp| mp.org_pubkey else null;
if (!service_filter.check(peer.identity_key, org_pk, ti.proto, ti.dst_port)) continue;
// Service filter: check access before TUN write (IPv4 + IPv6, M5).
if (wg_dev.peers[dec.slot]) |peer| {
const org_pk = orgPubkeyLocked(membership, peer.identity_key);
if (!service_filter.allowPacket(peer.identity_key, org_pk, out_buf[0..dec.len])) continue;
}
Comment thread src/main.zig
Comment on lines 2731 to 2736
if (wg_dev.decryptTransport(pkt, &decrypt_storage[n_decrypted.*])) |result| {
// Check service filter before buffering
const PolicyMod = lib.services.Policy;
if (PolicyMod.parseTransportHeader(decrypt_storage[n_decrypted.*][0..result.len])) |ti| {
if (wg_dev.peers[result.slot]) |peer| {
const org_pk = if (swim.membership.peers.getPtr(peer.identity_key)) |mp| mp.org_pubkey else null;
if (!service_filter.check(peer.identity_key, org_pk, ti.proto, ti.dst_port)) return;
}
// Check service filter before buffering (IPv4 + IPv6, M5).
if (wg_dev.peers[result.slot]) |peer| {
const org_pk = orgPubkeyLocked(swim.membership, peer.identity_key);
if (!service_filter.allowPacket(peer.identity_key, org_pk, decrypt_storage[n_decrypted.*][0..result.len])) return;
}
Code-review follow-ups on the hardening PR:

- P1 (regression): the H1 RX source-IP check rejected non-IP plaintext, which
  dropped every FFI framed tunnel message (framed with a leading-zero length
  header → version nibble 0). innerSourceAllowed now only constrains IPv4/IPv6
  packets (the IP-routed ones); non-IP framed payloads pass to the inbox.
- H6 race: the daemon re-read wg_dev.peers[slot] after decryptTransport had
  released the device lock — a data race vs removePeer that could also skip the
  service filter. decryptTransport now returns the peer identity_key (captured
  under the lock); all four daemon filter blocks and the FFI use it, so peers[]
  is never re-read post-decrypt.
- P2: the FFI auto-register path charged the handshake rate limiter twice for one
  legitimate handshake (initial UnknownPeer probe + retry). Added
  handleInitiationAdmitted for the retry, which skips re-admission.
- DNS: responseMatchesQuery now also validates the echoed QCLASS.
- membership.upsert: free the previous name allocation on the update path
  (consistent with remove/deinit) to avoid a leak.
- device.zig: make the handshake-limiter hash operands explicitly usize; give
  the initiation handlers a shared named return type (InitiationResult).

Verified: zig build test, Debug, ReleaseFast, ReleaseSafe, and x86_64-linux
cross-compile (Debug + ReleaseFast).

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ab85b27106

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/main.zig
Comment on lines +2304 to +2305
wg_dev.lock.lockSharedUncancelable(zio());
defer wg_dev.lock.unlockShared(zio());

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Release the device lock before the next batch

In this loop, defer is scoped to cryptoWorkerPipeline, not to the current iteration, so after the first popped batch each crypto worker keeps a shared wg_dev.lock held until the thread exits (including across continue and the next crypto_q.pop()). Any later exclusive operation on the same device lock, such as initiateHandshake, handleResponse, or removePeer, can then block indefinitely once data-plane traffic has passed through the worker, preventing rekeys/handshakes and peer cleanup in userspace mode.

Useful? React with 👍 / 👎.

@igorls igorls merged commit 56d9d8d into main Jun 6, 2026
4 checks passed
@igorls igorls deleted the security/networking-hardening branch June 6, 2026 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants