Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,93 @@ and this file MUST be updated together whenever `__version__` changes.

---

## [0.8.0-dev10] — Webhook & API security hardening (security)

A full security review of the inbound webhook surface (requested before
opening a public firewall port for the Meraki receiver) found nine
issues spanning authentication, exposure, DoS, and replay. This release
fixes all of them. The theme is **fail closed**: receivers reject
requests they cannot authenticate, the public ingress exposes only the
receiver/health paths, and request bodies are bounded everywhere.

### New configuration (core secret / env)

All optional; secure defaults. Set via the `netcortex/core` secret or
the matching `NETCORTEX_*` env var.

| Key | Default | Purpose |
| --- | --- | --- |
| `api_secret` | `""` | When set, `/`, `/api/*`, `/metrics`, docs, and the telemetry SSE monitor require `Authorization: Bearer <api_secret>`. |
| `webhook_allow_unsigned` | `false` | Master fail-open switch. `false` = reject webhooks for tenants with no configured secret. Enable only to bootstrap a new receiver. |
| `webhook_max_body_bytes` | `1048576` (1 MiB) | Hard cap on webhook/telemetry request bodies (413 over cap). |
| `webhook_replay_window_seconds` | `300` | Reject webhooks whose trusted timestamp (e.g. Meraki `sentAt`) is outside this window. `0` disables. |
| `telemetry_secret` | `""` | Shared `X-Telemetry-Token` required on the HTTP telemetry-push ingest. |
| `cors_allow_origins` | `[]` | Explicit browser-origin allow-list. We never ship `*`. |

### Findings fixed

- **F1 — HMAC/token auth failed open when no secret was configured.**
`meraki.py` and `catalyst_center.py` now **fail closed** (503) when no
signing secret is provisioned for a tenant, unless
`webhook_allow_unsigned=true`. Previously they accepted the request
with only a warning.
- **F2 — Entire API + topology exposed unauthenticated on the public
port.** Two layers: (a) a new app-level `_api_auth` middleware gates
every non-receiver path behind `api_secret` when set; (b) the Helm
ingress now defaults to `exposeApi=false`, publishing only
`/webhooks`, `/ingest`, and `/health` — the UI/API/metrics/MCP stay
cluster-internal unless `exposeApi=true` is opted into.
- **F3 — No request body-size limit (memory-exhaustion DoS).** New
in-handler `Content-Length` + materialized-body guards (413) plus an
ingress `proxy-body-size: 1m` annotation.
- **F4 — Unauthenticated amplification via background adapter sync.**
Webhook-driven syncs now funnel through a coalescing scheduler
(`sync_coalesce.py`): single-flight + trailing coalesce + min-interval
+ global concurrency cap. The generic catch-all endpoint is now gated
behind the administrative `api_secret`.
- **F5 — Non-constant-time token comparison (Catalyst Center).** Now
uses `hmac.compare_digest`.
- **F6 — Unauthenticated telemetry ingest + SSE stream.** Ingest
requires `X-Telemetry-Token` (`telemetry_secret`); the SSE monitor is
gated behind `api_secret`. Both are body-size capped.
- **F7 — Nexus Dashboard API key never verified.** New dedicated
`nexus_dashboard.py` handler verifies `X-ND-API-Key` in constant time
and fails closed.
- **F8 — Broad CORS + info leak via `/api` and `/metrics`.** CORS no
longer ships `*` (explicit allow-list via `NETCORTEX_CORS_ALLOW_ORIGINS`,
default none); `/metrics` and `/api/*` are covered by `_api_auth`.
- **F9 — No webhook replay protection.** Meraki deliveries are rejected
(403) when `sentAt` falls outside `webhook_replay_window_seconds`.

### Shared hardening module

New `netcortex/webhooks/auth.py` centralizes body-size enforcement,
fail-closed gating, constant-time comparison, replay checks, and the
admin/telemetry token gates — so every current and future receiver
inherits the same controls in one place (the same consolidation that
would have prevented the dev9 per-vendor regression).

### Operational notes

- **The status UI is no longer public by default.** After deploy, reach
it via `kubectl port-forward svc/<release>-web 8000:8000`, or set
`ingress.exposeApi=true` **and** an `api_secret`.
- Store webhook secrets at `netcortex/webhooks/<vendor>/<instance>`
(`shared_secret`, or `api_key` for Nexus Dashboard) before the
receiver will accept traffic.

### Tests

- `tests/webhooks/test_auth_helpers.py` — the shared auth/guard helpers.
- `tests/webhooks/test_other_webhook_routes.py` — Catalyst Center,
Nexus Dashboard, generic, and telemetry route auth.
- `tests/webhooks/test_sync_coalesce.py` — coalescing scheduler.
- `tests/test_api_auth.py` — API-auth middleware + path classification.
- Updated `tests/webhooks/test_meraki_webhook_route.py` for fail-closed,
replay (403), and body-size (413) behavior.

---

## [0.8.0-dev9] — Fix webhook HMAC verify silently failing open (security)

**Pre-existing bug caught by the dev8 deploy.** While verifying the
Expand Down
30 changes: 29 additions & 1 deletion deploy/helm/templates/ingress.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,17 @@ metadata:
name: {{ include "netcortex.fullname" . }}
labels:
{{- include "netcortex.labels" . | nindent 4 }}
{{- with .Values.ingress.annotations }}
annotations:
# Body-size cap (F3, 0.8.0-dev10): outermost defense against
# memory-exhaustion via large unauthenticated POST bodies. Keep aligned
# with the in-app webhook_max_body_bytes guard (default 1 MiB).
nginx.ingress.kubernetes.io/proxy-body-size: {{ .Values.ingress.proxyBodySize | quote }}
# Per-source request-rate cap (defense-in-depth against webhook floods).
{{- if .Values.ingress.rateLimit.enabled }}
nginx.ingress.kubernetes.io/limit-rps: {{ .Values.ingress.rateLimit.rps | quote }}
nginx.ingress.kubernetes.io/limit-connections: {{ .Values.ingress.rateLimit.connections | quote }}
{{- end }}
{{- with .Values.ingress.annotations }}
{{- toYaml . | nindent 4 }}
{{- end }}
spec:
Expand All @@ -23,11 +32,30 @@ spec:
- host: {{ .Values.ingress.hostname }}
http:
paths:
{{- if .Values.ingress.exposeApi }}
# exposeApi=true: the whole app (status UI, /api, /metrics, /mcp)
# is reachable on this host. ONLY do this on a trusted network or
# with api_secret set so /api and /metrics require a bearer token.
- path: /
pathType: Prefix
backend:
service:
name: {{ include "netcortex.fullname" . }}-web
port:
number: {{ .Values.service.port }}
{{- else }}
# Default (F2, 0.8.0-dev10): expose ONLY the public receiver and
# health surface. The status UI, topology/inventory API, metrics,
# and MCP endpoint stay cluster-internal (reach them via
# `kubectl port-forward` or a separate internal-only ingress).
{{- range .Values.ingress.publicPaths }}
- path: {{ . }}
pathType: Prefix
backend:
service:
name: {{ include "netcortex.fullname" $ }}-web
port:
number: {{ $.Values.service.port }}
{{- end }}
{{- end }}
{{- end }}
28 changes: 28 additions & 0 deletions deploy/helm/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,34 @@ ingress:
# annotations:
# route.openshift.io/termination: edge

# ---------------------------------------------------------------------------
# Public exposure surface (0.8.0-dev10 security hardening)
# ---------------------------------------------------------------------------
# exposeApi=false (default, SECURE): the public ingress serves ONLY the
# webhook/telemetry receiver and health paths below. The status UI, the
# topology/inventory API (/api/*), /metrics, and the MCP endpoint stay
# cluster-internal. Reach them with `kubectl port-forward svc/<name>-web
# 8000:8000` or a separate internal-only ingress.
#
# exposeApi=true: serve the WHOLE app on this host. Only do this on a
# trusted network, or set `api_secret` in the core secret so /api,
# /metrics, /mcp, and the UI require `Authorization: Bearer <api_secret>`.
exposeApi: false
publicPaths:
- /webhooks
- /ingest
- /health

# Hard cap on request body size at the ingress (F3). Mirror the in-app
# webhook_max_body_bytes (default 1 MiB).
proxyBodySize: "1m"

# Per-source rate limiting (defense-in-depth against webhook floods).
rateLimit:
enabled: true
rps: 20 # requests/second per client IP
connections: 10 # concurrent connections per client IP

# -----------------------------------------------------------------------------
# Service Account
# -----------------------------------------------------------------------------
Expand Down
36 changes: 35 additions & 1 deletion docs/secrets.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,10 +108,44 @@ The **instance ID** used throughout NetCortex (MCP tools, sync status, logs, Net
"ssh_timeout": 30,
"netconf_port": 830,
"restconf_port": 443,
"status_refresh_interval": 30
"status_refresh_interval": 30,

"api_secret": "random-32-byte-hex-secret",
"webhook_allow_unsigned": false,
"webhook_max_body_bytes": 1048576,
"webhook_replay_window_seconds": 300,
"telemetry_secret": "random-32-byte-hex-secret",
"cors_allow_origins": []
}
```

### HTTP API & webhook security keys (0.8.0-dev10)

All optional with secure defaults. Each also has a `NETCORTEX_*` env
equivalent for bootstrap before the backend is reachable.

| Key | Default | Effect |
| --- | --- | --- |
| `api_secret` | `""` | When set, every non-receiver HTTP path (`/`, `/api/*`, `/metrics`, docs, telemetry SSE) requires `Authorization: Bearer <api_secret>`. Leave empty only when the API is reached solely cluster-internally (the default ingress keeps it private). |
| `webhook_allow_unsigned` | `false` | Master fail-open switch. `false` = a webhook for a tenant with **no** configured secret is rejected (503). Set `true` only to bootstrap a brand-new receiver before its secret is stored, then turn it back off. |
| `webhook_max_body_bytes` | `1048576` | Hard body-size cap (413 over cap). Keep aligned with the ingress `proxy-body-size`. |
| `webhook_replay_window_seconds` | `300` | Reject webhooks whose trusted timestamp (e.g. Meraki `sentAt`) is outside ±this window. `0` disables. |
| `telemetry_secret` | `""` | Shared token required in `X-Telemetry-Token` on `POST /ingest/telemetry/{device}`. Fails closed when unset (unless `webhook_allow_unsigned`). |
| `cors_allow_origins` | `[]` | Browser-origin allow-list. Never `*`. The bundled UI is same-origin so it needs none. |

#### Per-vendor webhook secrets

Stored at `netcortex/webhooks/<vendor>/<instance_name>`:

```json
// netcortex/webhooks/meraki/<instance> → {"shared_secret": "..."}
// netcortex/webhooks/catalyst_center/<instance> → {"shared_secret": "..."}
// netcortex/webhooks/nexus_dashboard/<instance> → {"api_key": "..."}
```

A receiver **rejects all traffic (503)** until its secret is provisioned
(fail closed), unless `webhook_allow_unsigned=true`.

### `netcortex/devices/site/building-a` example

```json
Expand Down
2 changes: 1 addition & 1 deletion netcortex/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,4 @@
``CHANGELOG.md`` MUST be kept in sync whenever ``__version__`` changes.
"""

__version__ = "0.8.0-dev9"
__version__ = "0.8.0-dev10"
102 changes: 102 additions & 0 deletions netcortex/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,36 @@ class Settings:
mcp_transport: str
mcp_secret: str

# HTTP API / webhook security (0.8.0-dev10)
#
# api_secret: when non-empty, every non-public HTTP route (the status
# UI, /api/*, /metrics, the telemetry SSE monitor) requires
# `Authorization: Bearer <api_secret>`. Empty disables app-level API
# auth — in that mode the *ingress* is the control: the shipped chart
# only exposes the webhook/health receiver paths publicly, so /api
# stays cluster-internal. Set this whenever the API is reachable from
# an untrusted network.
# webhook_allow_unsigned: master fail-open switch. Default False =
# fail closed: a webhook for a tenant with no configured secret is
# rejected (503) instead of silently accepted. Set True only to
# bootstrap a brand-new receiver before its secret is provisioned.
# webhook_max_body_bytes: hard cap on webhook/telemetry request bodies.
# Requests larger than this are rejected (413) before parsing.
# webhook_replay_window_seconds: when a webhook payload carries a
# trusted timestamp (e.g. Meraki `sentAt`), reject it if older than
# this many seconds. 0 disables the freshness check.
# telemetry_secret: shared token required (header `X-Telemetry-Token`)
# on the HTTP telemetry-push ingest endpoint. Same fail-closed rules
# as webhook secrets.
# cors_allow_origins: explicit allow-list of browser origins. Empty =
# no cross-origin access (same-origin only) — we never ship `*`.
api_secret: str
webhook_allow_unsigned: bool
webhook_max_body_bytes: int
webhook_replay_window_seconds: int
telemetry_secret: str
cors_allow_origins: list[str]

# Sync engine
sync_backend: str
sync_conflict_policy: str
Expand Down Expand Up @@ -187,6 +217,21 @@ def __init__(self, bootstrap: BootstrapSettings) -> None:
self.redis_url = os.environ.get("REDIS_URL", "redis://redis:6379/0")
self.mcp_transport = "http"
self.mcp_secret = ""

# HTTP API / webhook security (0.8.0-dev10). Env vars provide the
# bootstrap defaults; the core secret can override during hydrate().
self.api_secret = os.environ.get("NETCORTEX_API_SECRET", "")
self.webhook_allow_unsigned = _env_bool(
"NETCORTEX_WEBHOOK_ALLOW_UNSIGNED", default=False
)
self.webhook_max_body_bytes = _env_int(
"NETCORTEX_WEBHOOK_MAX_BODY_BYTES", default=1_048_576 # 1 MiB
)
self.webhook_replay_window_seconds = _env_int(
"NETCORTEX_WEBHOOK_REPLAY_WINDOW_SECONDS", default=300
)
self.telemetry_secret = os.environ.get("NETCORTEX_TELEMETRY_SECRET", "")
self.cors_allow_origins = _env_csv("NETCORTEX_CORS_ALLOW_ORIGINS")
# Secure-by-default. Override with NETBOX_VERIFY_SSL=0 or
# core-secret `netbox_verify_ssl=false` for self-signed labs.
_verify_env = os.environ.get("NETBOX_VERIFY_SSL")
Expand Down Expand Up @@ -278,6 +323,40 @@ async def hydrate(self) -> None:
self.redis_url = core.get("redis_url") or self.redis_url
self.mcp_transport = core.get("mcp_transport", self.mcp_transport)
self.mcp_secret = core.get("mcp_secret", self.mcp_secret)

# HTTP API / webhook security — core secret overrides env defaults.
self.api_secret = core.get("api_secret", self.api_secret)
raw_allow_unsigned = core.get("webhook_allow_unsigned", self.webhook_allow_unsigned)
if isinstance(raw_allow_unsigned, str):
self.webhook_allow_unsigned = raw_allow_unsigned.strip().lower() in {
"1", "true", "yes", "on",
}
else:
self.webhook_allow_unsigned = bool(raw_allow_unsigned)
self.webhook_max_body_bytes = int(
core.get("webhook_max_body_bytes", self.webhook_max_body_bytes)
)
self.webhook_replay_window_seconds = int(
core.get("webhook_replay_window_seconds", self.webhook_replay_window_seconds)
)
self.telemetry_secret = core.get("telemetry_secret", self.telemetry_secret)
raw_cors = core.get("cors_allow_origins", None)
if raw_cors is not None:
if isinstance(raw_cors, str):
self.cors_allow_origins = [
o.strip() for o in raw_cors.split(",") if o.strip()
]
elif isinstance(raw_cors, (list, tuple)):
self.cors_allow_origins = [str(o).strip() for o in raw_cors if str(o).strip()]
# Loud warning if the operator left fail-open enabled — this should
# never be true in a production deployment.
if self.webhook_allow_unsigned:
log.warning(
"settings.webhook_allow_unsigned_enabled",
hint="Webhooks for tenants without a configured secret will be "
"ACCEPTED. Set webhook_allow_unsigned=false once secrets "
"are provisioned.",
)
self.sync_backend = core.get("sync_backend", self.sync_backend)
self.sync_conflict_policy = core.get("sync_conflict_policy", self.sync_conflict_policy)

Expand Down Expand Up @@ -342,6 +421,29 @@ def _require(d: dict[str, Any], key: str, path: str) -> Any:
return d[key]


def _env_bool(name: str, *, default: bool) -> bool:
raw = os.environ.get(name)
if raw is None:
return default
return raw.strip().lower() in {"1", "true", "yes", "on"}


def _env_int(name: str, *, default: int) -> int:
raw = os.environ.get(name)
if raw is None or not raw.strip():
return default
try:
return int(raw)
except ValueError:
log.warning("settings.env_int_invalid", name=name, value=raw, fallback=default)
return default


def _env_csv(name: str) -> list[str]:
raw = os.environ.get(name, "")
return [item.strip() for item in raw.split(",") if item.strip()]


# ---------------------------------------------------------------------------
# Singleton — populated at startup via Settings.create()
# ---------------------------------------------------------------------------
Expand Down
Loading
Loading