From 94ba90abf100a680bc067c030fd6ecfbe1ad38f2 Mon Sep 17 00:00:00 2001 From: Sergio Arroutbi Date: Thu, 23 Apr 2026 20:40:58 +0200 Subject: [PATCH] Move spec files to keylime-webtool/spec repository The SRS and SDD now live in their own dedicated repository at https://github.com/keylime-webtool/spec. Replace the spec files with a README redirecting to the new location and update references in CLAUDE.md and README.md. Co-Authored-By: Claude Opus 4.6 Signed-off-by: Sergio Arroutbi --- CLAUDE.md | 11 +- README.md | 7 +- spec/README.md | 13 + spec/SDD-Keylime-Monitoring-Tool.md | 1178 -------- spec/SRS-Keylime-Monitoring-Tool.md | 3918 --------------------------- 5 files changed, 18 insertions(+), 5109 deletions(-) create mode 100644 spec/README.md delete mode 100644 spec/SDD-Keylime-Monitoring-Tool.md delete mode 100644 spec/SRS-Keylime-Monitoring-Tool.md diff --git a/CLAUDE.md b/CLAUDE.md index f0360f6..7e95680 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -4,7 +4,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co ## Repository Purpose -Documentation repository for the Keylime Monitoring Dashboard project. Contains LaTeX/Beamer presentations, the formal SRS (Software Requirements Specification), and the SDD (Software Design Description). Licensed under CC BY-SA 4.0. +Documentation repository for the Keylime Monitoring Dashboard project. Contains LaTeX/Beamer presentations. The formal SRS and SDD have moved to [keylime-webtool/spec](https://github.com/keylime-webtool/spec). Licensed under CC BY-SA 4.0. ## Build Commands @@ -44,9 +44,7 @@ git config core.hooksPath .githooks ## Repository Structure -- **`spec/`** — Formal specifications: - - `SRS-Keylime-Monitoring-Tool.md` — 70 FRs, 23 NFRs, 29 SRs with Gherkin acceptance criteria and implementation refinements (Section 7) - - `SDD-Keylime-Monitoring-Tool.md` — IEEE 1016-2009 design description covering architecture, data models, API contracts, and state machines +- **`spec/`** — Redirect to [keylime-webtool/spec](https://github.com/keylime-webtool/spec) where the SRS and SDD now live - **`slides/`** — LaTeX/Beamer presentations, each in a date-prefixed directory with numbered `.tex` files (`000-` is the main document that `\include`s the rest) - **`slides/beamerthemeRedHat.sty`** — Shared Red Hat Beamer theme (symlinked from each presentation directory) - **`slides/fonts/`** — Red Hat Display font family (shared, symlinked) @@ -61,7 +59,4 @@ git config core.hooksPath .githooks ## Spec Conventions -- Requirements use tags: `FR-NNN` (functional), `NFR-NNN` (non-functional), `SR-NNN` (security) -- RFC 2119 keywords (MUST, SHALL, SHOULD, MAY) indicate requirement levels -- Acceptance criteria use Gherkin `Given/When/Then` syntax -- Section 7 of the SRS tracks implementation refinements (data models, API contracts, enumerations) that emerged during development +The SRS and SDD now live in [keylime-webtool/spec](https://github.com/keylime-webtool/spec). Refer to that repository's CLAUDE.md for spec conventions. diff --git a/README.md b/README.md index 7812074..aba043e 100644 --- a/README.md +++ b/README.md @@ -11,11 +11,8 @@ Documentation and presentation slides for the Keylime Web Tool project. ## Specifications -The `spec/` directory contains the formal requirements and review documents for the project: - -| Document | Description | -|----------|-------------| -| `spec/SRS-Keylime-Monitoring-Tool.md` | Software Requirements Specification (SRS) covering 70 functional, 23 non-functional, and 29 security requirements with Gherkin acceptance criteria. Includes implementation refinements (Section 7) tracking data models, API contracts, and enumerations that emerged during development. | +The formal specifications (SRS and SDD) have moved to their own repository: +**** ## Building diff --git a/spec/README.md b/spec/README.md new file mode 100644 index 0000000..d6a5377 --- /dev/null +++ b/spec/README.md @@ -0,0 +1,13 @@ +# Specifications + +The formal specifications for the Keylime Monitoring Dashboard have moved to +their own repository: + +**** + +That repository contains: + +- **SRS** (Software Requirements Specification) -- 70 FRs, 23 NFRs, 29 SRs + with Gherkin acceptance criteria +- **SDD** (Software Design Description) -- IEEE 1016-2009 design description + covering architecture, data models, API contracts, and state machines diff --git a/spec/SDD-Keylime-Monitoring-Tool.md b/spec/SDD-Keylime-Monitoring-Tool.md deleted file mode 100644 index 2bfd6c6..0000000 --- a/spec/SDD-Keylime-Monitoring-Tool.md +++ /dev/null @@ -1,1178 +0,0 @@ -# Software Design Description: Keylime Monitoring Dashboard - -* **Companion SRS:** `spec/SRS-Keylime-Monitoring-Tool.md` -* **Standard:** IEEE 1016-2009 (ISO/IEC/IEEE 42010 compatible) -* **Initial Date:** 2026-04-15 -* **Methodology:** Spec-Driven Development (SDD) - ---- - -## 1. Introduction - -### 1.1 Purpose - -This Software Design Description (SDD) documents the *how* of the Keylime Monitoring Dashboard. It complements the SRS (which documents the *what*) by describing the architectural decisions, component decomposition, data models, API contracts, interaction patterns, state machines, algorithms, and deployment topology that realize the SRS requirements. - -### 1.2 Scope - -The SDD covers the full system: a React.js + TypeScript single-page application frontend, a Rust (Axum) asynchronous backend, and their integration with Keylime Verifier/Registrar APIs, TimescaleDB, and Redis. - -### 1.3 Definitions - -| Term | Definition | -|------|-----------| -| SPA | Single Page Application | -| mTLS | Mutual Transport Layer Security | -| OIDC | OpenID Connect | -| RBAC | Role-Based Access Control | -| KPI | Key Performance Indicator | -| TPM | Trusted Platform Module | -| PCR | Platform Configuration Register | -| IMA | Integrity Measurement Architecture | -| SRS | Software Requirements Specification | -| SDD | Software Design Description | - -### 1.4 Design Stakeholders and Concerns - -| Stakeholder | Concern | -|-------------|---------| -| Backend Developer | Component decomposition, API contracts, data flow, error handling | -| Frontend Developer | UI component hierarchy, state management, API integration | -| Security Architect | Trust boundaries, authentication flow, mTLS topology, RBAC enforcement | -| DevOps Engineer | Deployment topology, configuration, container images, health checks | -| QA Engineer | Testable interfaces, state machines, deterministic algorithms | - -### 1.5 SRS Cross-Reference - -This SDD traces design elements to the SRS via `(FR-nnn)`, `(NFR-nnn)`, and `(SR-nnn)` references. Every SRS requirement with an implementation MUST be traceable to at least one design element in this document. - ---- - -## 2. Design Viewpoints - -This SDD uses the following IEEE 1016 viewpoints: - -| Viewpoint | Section | Purpose | -|-----------|---------|---------| -| Context | 3.1 | System boundaries and external interfaces | -| Composition | 3.2 | Major components and their responsibilities | -| Logical | 3.3 | Data models, type hierarchies, domain entities | -| Interface | 3.4 | API contracts, response formats, WebSocket protocol | -| Interaction | 3.5 | Component communication, data flow sequences | -| State Dynamics | 3.6 | State machines, lifecycle transitions | -| Algorithm | 3.7 | Key algorithms and computation strategies | -| Resource | 3.8 | Deployment topology, infrastructure, configuration | - ---- - -## 3. Design Views - -### 3.1 Context View - -#### 3.1.1 System Boundary - -```text - +---------------------------+ - | Identity Provider | - | (OIDC/SAML) | - +------------+--------------+ - | - | OIDC flow (SR-001) - v -+----------------+ TLS 1.3 +-----+-------+ mTLS +-----------------+ -| Browser | <-----------> | Backend | <----------> | Keylime | -| (React SPA) | (SR-008) | (Rust/Axum)| (SR-004) | Verifier API v2 | -+----------------+ +------+-------+ +-----------------+ - | | - | WebSocket (NFR-021) | mTLS (SR-009) - | v - +-----------------------------+ +-----------------+ - | | Keylime | - | | Registrar API | - +----------+ | +-----------------+ - |TimescaleDB| <-------+ - +----------+ | - +----------+ | - | Redis | <--------+ - +----------+ -``` - -#### 3.1.2 External Interfaces - -| Interface | Protocol | Authentication | SRS Trace | -|-----------|----------|---------------|-----------| -| Keylime Verifier API | HTTPS (mTLS) | Client certificate | SR-004, NFR-002 | -| Keylime Registrar API | HTTPS (mTLS) | Client certificate | SR-004, NFR-002 | -| Identity Provider | HTTPS | OIDC Authorization Code | SR-001 | -| Browser | HTTPS (TLS 1.3) | JWT Bearer token | SR-008, SR-010 | -| WebSocket | WSS | JWT query parameter | NFR-021 | -| TimescaleDB | TCP | Connection string | NFR-005 | -| Redis | TCP | Connection string | NFR-019 | -| SIEM (Syslog/Splunk/ECS) | TCP/HTTPS | API token | FR-063 | - -### 3.2 Composition View - -#### 3.2.1 Backend Components - -```text -keylime-webtool-backend/src/ -+-- main.rs Application entry point, server bootstrap, config loading -+-- lib.rs Crate-level exports -+-- state.rs Shared application state (AppState) -+-- config.rs Hierarchical configuration loading -+-- settings_store.rs TOML config file persistence (FR-075) -+-- error.rs Centralized error handling (AppError) -+-- api/ -| +-- routes.rs Route hierarchy and nesting -| +-- response.rs Standard API response envelope -| +-- ws.rs WebSocket endpoint handler -| +-- handlers/ -| +-- kpis.rs Fleet KPI aggregation (FR-001) -| +-- agents.rs Agent CRUD and detail (FR-012..FR-023) -| +-- attestations.rs Attestation analytics (FR-024..FR-032) -| +-- policies.rs Policy management (FR-033..FR-039) -| +-- certificates.rs Certificate lifecycle (FR-050..FR-053) -| +-- alerts.rs Alert lifecycle (FR-047..FR-049) -| +-- audit.rs Audit log queries (FR-042..FR-044) -| +-- compliance.rs Compliance reporting (FR-059..FR-060) -| +-- integrations.rs Backend connectivity with probes (FR-057..FR-058, FR-077) -| +-- performance.rs System metrics (FR-064..FR-068) -| +-- settings.rs Runtime URL and mTLS config (FR-072..FR-074) -| +-- auth.rs Authentication endpoints (SR-001) -+-- keylime/ -| +-- client.rs Keylime API client with circuit breaker and probes -| +-- models.rs Keylime API response types (#[serde(default)]) -+-- auth/ -| +-- jwt.rs JWT encode/decode (SR-010) -| +-- oidc.rs OIDC client configuration (SR-001) -| +-- rbac.rs Role and permission model (SR-003) -| +-- session.rs Server-side session revocation (SR-011) -+-- audit/ -| +-- logger.rs Hash-chained audit entries (FR-061, SR-015) -+-- storage/ -| +-- db.rs TimescaleDB connection pool -| +-- cache.rs Redis cache with tiered TTLs (NFR-019) -+-- models/ - +-- agent.rs Agent domain model and state enum - +-- attestation.rs Attestation, pipeline, failure models - +-- policy.rs Policy and approval workflow models - +-- certificate.rs Certificate and expiry models - +-- alert.rs Alert, severity, notification models - +-- alert_store.rs In-memory alert store with lifecycle - +-- kpi.rs KPI and summary aggregation types -``` - -**Trace:** Implementation -- `keylime-webtool-backend/src/` - -#### 3.2.2 Frontend Components - -```text -keylime-webtool-frontend/src/ -+-- main.tsx React DOM root, strict mode -+-- App.tsx QueryClient + RouterProvider, auto-refresh wiring -+-- router.tsx Route definitions with Layout wrapper -+-- index.css Global styles, CSS variables, dark/light theming -+-- api/ -| +-- client.ts Axios instance, interceptors, Vite proxy, backend URL config -| +-- agents.ts Agent API methods -| +-- attestations.ts Attestation API methods -| +-- policies.ts Policy API methods -| +-- certificates.ts Certificate API methods -| +-- alerts.ts Alert API methods -| +-- audit.ts Audit log API methods -| +-- performance.ts Performance API methods -| +-- compliance.ts Compliance API methods -| +-- settings.ts Settings API methods (FR-072, FR-073) -+-- components/ -| +-- Layout/ -| | +-- Layout.tsx Two-column layout, collapsible sidebar (FR-076) -| | +-- Layout.css Sidebar slide transition styles -| | +-- Sidebar.tsx Navigation with 10 modules, integration health indicator (FR-081) -| | +-- TopBar.tsx Hamburger toggle, search, theme toggle, user menu -| +-- common/ -| +-- DataTable.tsx Generic sortable, selectable table -| +-- KpiCard.tsx Metric card with variant styling, optional drill-down link (FR-084) -| +-- StatusBadge.tsx Color-coded status label -| +-- AgentStateChart.tsx Recharts pie chart for agent states -+-- hooks/ -| +-- useAuth.ts Authentication hook (wraps authStore) -| +-- useWebSocket.ts WebSocket hook with reconnection -+-- store/ -| +-- authStore.ts Zustand auth state (SR-003) -| +-- visualizationStore.ts Zustand UI preferences (FR-008), auto-refresh settings -+-- types/ -| +-- index.ts Barrel exports, shared types (IntegrationService.endpoint) -| +-- agent.ts Agent domain types (includes start, saved states) -| +-- attestation.ts Attestation domain types -| +-- policy.ts Policy domain types -| +-- certificate.ts Certificate domain types -| +-- alert.ts Alert domain types -| +-- audit.ts Audit domain types -+-- pages/ - +-- Dashboard/ Fleet overview with KPIs, charts, fallback KPIs (FR-001) - +-- Agents/ Agent list with error state, detail with timeline (FR-012..FR-023) - +-- Attestations/ Attestation analytics (FR-024..FR-032) - +-- Policies/ Policy management with linked agent counts (FR-033..FR-039) - +-- Certificates/ Certificate lifecycle (FR-050..FR-053) - +-- Alerts/ Alert dashboard (FR-047..FR-049) - +-- Performance/ System metrics (FR-064..FR-068) - +-- AuditLog/ Audit trail with hash verification (FR-042) - +-- Integrations/ Backend + core services status with 1s polling (FR-057, FR-077) - +-- Settings/ Connection URLs, mTLS certs, mock/prod toggle (FR-072..FR-074) - +-- Login/ Authentication entry point (SR-001) -``` - -**Trace:** Implementation -- `keylime-webtool-frontend/src/` - -#### 3.2.3 Technology Stack - -| Layer | Technology | Version | SRS Trace | -|-------|-----------|---------|-----------| -| Frontend Framework | React | 18.3.1 | NFR-004 | -| Frontend Language | TypeScript | 5.6 | NFR-004 | -| Frontend Build | Vite | 6.0 | NFR-004 | -| Frontend Routing | React Router | 6.26 | NFR-004 | -| Frontend State | Zustand | 5.0 | — | -| Frontend Data Fetching | TanStack React Query | 5.60 | — | -| Frontend HTTP Client | Axios | 1.7 | — | -| Frontend Charts | Recharts | 2.13 | — | -| Backend Framework | Axum | 0.8 | NFR-005 | -| Backend Language | Rust | 1.75+ | SR-023 | -| Backend Runtime | Tokio | 1.x | NFR-005 | -| Backend TLS | rustls | 0.23 | SR-004 | -| Backend Database | sqlx (PostgreSQL) | 0.8 | — | -| Backend Cache | redis | 0.27 | NFR-019 | -| Backend Auth | openidconnect + jsonwebtoken | 4 / 9 | SR-001, SR-010 | -| Backend HTTP Client | reqwest (rustls-tls) | 0.12 | SR-004 | -| Backend Observability | tracing + OpenTelemetry | 0.1 / 0.27 | — | -| Database | TimescaleDB (PostgreSQL) | — | — | -| Cache | Redis | — | NFR-019 | - -### 3.3 Logical View - -#### 3.3.1 Backend Application State - -```rust -pub struct AppState { - keylime_client: Arc>>, // Hot-swappable Keylime API client (FR-072) - pub alert_store: Arc, // In-memory alert management - pub settings_store: Arc, // TOML config persistence (FR-075) -} -``` - -The `KeylimeClient` is wrapped in `Arc>>` to support hot-swapping at runtime: the outer `Arc` is shared across handlers, `RwLock` allows atomic replacement, and the inner `Arc` lets in-flight requests complete with the old client while new requests use the updated one. Handlers access the client via a `keylime()` accessor method. `SettingsStore` persists configuration changes to a TOML file on disk. `AppState` is cloned into every Axum handler via `State` extractor. - -**Trace:** Implementation -- `keylime-webtool-backend/src/state.rs` - -#### 3.3.2 Agent Data Model - -**Full Agent Entity** (detail view, merges Verifier + Registrar data): - -| Field | Type | Source | SRS Trace | -|-------|------|--------|-----------| -| `id` | UUID | Verifier `agent_id` | FR-012 | -| `ip` | string | Verifier | FR-012 | -| `hostname` | string? | Derived | FR-004 | -| `state` | AgentState | Verifier `operational_state` | FR-069 | -| `attestation_mode` | Pull \| Push | Verifier `accept_attestations` | FR-054, FR-055 | -| `verifier_id` | string | Derived | FR-064 | -| `registration_date` | datetime | Registrar | FR-012 | -| `last_attestation` | datetime? | Verifier | FR-012 | -| `consecutive_failures` | integer | Verifier | FR-012 | -| `total_attestations` | integer | Verifier `attestation_count` | FR-029 | -| `boot_time` | datetime? | Verifier | FR-018 | -| `hash_algorithm` | string | Verifier `hash_alg` | FR-021 | -| `encryption_algorithm` | string | Verifier `enc_alg` | FR-018 | -| `signing_algorithm` | string | Verifier `sign_alg` | FR-018 | -| `ima_pcrs` | integer[] | Verifier | FR-021 | -| `ima_policy_id` | string? | Verifier `ima_policy` | FR-033 | -| `mb_policy_id` | string? | Verifier `mb_policy` | FR-036 | -| `tpm_policy` | string? | Verifier | FR-018 | -| `regcount` | integer | Registrar | SR-025 | - -**Agent Summary** (list view projection): - -| Field | Type | Description | -|-------|------|-------------| -| `id` | UUID | Agent identifier | -| `ip` | string | IP address | -| `state` | AgentState | Current operational state | -| `attestation_mode` | Pull \| Push | API version mode | -| `last_attestation` | datetime? | Most recent attestation | -| `assigned_policy` | string? | IMA policy name | -| `mb_policy` | string? | Measured boot policy name | -| `failure_count` | integer | Consecutive failure count | - -**Trace:** Implementation -- `keylime-webtool-backend/src/models/agent.rs` - -#### 3.3.3 Agent State Enumeration - -**Pull-Mode States** (Keylime v2 API, `operational_state` integer): - -> **Note:** States `Start` (1) and `Saved` (2) are transient operational states recognized by both the backend and frontend. The backend uses `#[serde(default)]` on Keylime API model fields to tolerate missing or renamed fields across Keylime versions without breaking deserialization. - -| State | Value | Description | -|-------|-------|-------------| -| `Registered` | 0 | Agent registered, not yet attesting | -| `Start` | 1 | Attestation process initiated | -| `Saved` | 2 | Quote saved for processing | -| `GetQuote` | 3 | Actively attesting (healthy) | -| `Retry` | 4 | Retrying after transient failure | -| `ProvideV` | 5 | Providing V key to agent | -| `Failed` | 7 | Attestation failed | -| `Terminated` | 8 | Agent terminated by operator | -| `InvalidQuote` | 9 | TPM quote validation failed | -| `TenantFailed` | 10 | Tenant-initiated failure | - -**Push-Mode States** (Keylime v3 API, derived from `accept_attestations` flag): - -| State | Value | Description | -|-------|-------|-------------| -| `Pass` | 100 | Attestation passing | -| `Fail` | 101 | Attestation failing | -| `Pending` | 102 | Awaiting first attestation | -| `Timeout` | 103 | Agent stopped responding (e.g., agent crash) | - -**Trace:** Implementation -- `keylime-webtool-backend/src/models/agent.rs`, SRS FR-069 - -#### 3.3.4 Alert Data Model - -| Field | Type | Description | SRS Trace | -|-------|------|-------------|-----------| -| `id` | UUID | Alert identifier | FR-047 | -| `type` | AlertType | Category (see table below) | FR-025 | -| `severity` | AlertSeverity | `critical` \| `warning` \| `info` | FR-025 | -| `description` | string | Human-readable description | FR-047 | -| `affected_agents` | string[] | List of affected agent IDs | FR-047 | -| `state` | AlertState | Lifecycle state (see 3.6.2) | FR-047 | -| `created_timestamp` | datetime | Alert creation time | FR-047 | -| `acknowledged_timestamp` | datetime? | When acknowledged | FR-047 | -| `assigned_to` | string? | Investigator email | FR-047 | -| `investigation_notes` | string? | Investigation details | FR-047 | -| `root_cause` | string? | Identified root cause | FR-027 | -| `resolution` | string? | Resolution description | FR-047 | -| `auto_resolved` | boolean | Auto-resolved flag | FR-049 | -| `escalation_count` | integer | Escalation counter | FR-048 | -| `sla_window` | string? | SLA timeout window | FR-048 | -| `source` | string | Alert source system | FR-047 | -| `external_ticket_id` | string? | External ticket reference | FR-062 | - -**Alert Types:** - -| Type | Identifier | Description | -|------|-----------|-------------| -| Attestation Failure | `attestation_failure` | Quote verification or policy check failure | -| Certificate Expiry | `cert_expiry` | Certificate approaching or past expiry | -| Policy Violation | `policy_violation` | IMA or measured boot policy violation | -| PCR Change | `pcr_change` | Unexpected PCR value change detected | -| Service Down | `service_down` | Backend service unreachable | -| Rate Limit | `rate_limit` | Rate limiting threshold exceeded | -| Clock Skew | `clock_skew` | Time drift between agent and verifier | - -**Trace:** Implementation -- `keylime-webtool-backend/src/models/alert.rs` - -#### 3.3.5 Certificate Data Model - -| Field | Type | Description | SRS Trace | -|-------|------|-------------|-----------| -| `id` | UUID | Deterministic (UUID v5 from agent_id + cert_type) | FR-050 | -| `cert_type` | CertificateType | `EK` \| `AK` \| `IAK` \| `I_DEV_ID` \| `M_TLS` \| `SERVER` | FR-050 | -| `subject_dn` | string | X.509 Subject DN | FR-052 | -| `issuer_dn` | string | X.509 Issuer DN | FR-052 | -| `serial_number` | string | Certificate serial | FR-052 | -| `not_before` | datetime | Validity start | FR-051 | -| `not_after` | datetime | Validity end | FR-051 | -| `public_key_algorithm` | string | Key algorithm (e.g., RSA) | FR-052 | -| `public_key_size` | integer | Key size in bits | FR-052 | -| `signature_algorithm` | string | Signature algorithm | FR-052 | -| `sans` | string[] | Subject Alternative Names | FR-052 | -| `key_usage` | string[] | Key usage extensions | FR-052 | -| `status` | CertificateStatus | `valid` \| `expiring_soon` \| `critical` \| `expired` | FR-051 | -| `associated_entity` | string | Agent ID or hostname | FR-050 | -| `chain_valid` | boolean? | Chain validation result | FR-052 | - -**Certificate Expiry Derivation:** Since Keylime does not expose certificate metadata directly, the backend reconstructs certificate records from Registrar agent data: - -| Certificate | Source | Expiry Logic | -|-------------|--------|-------------| -| EK | Registrar `ek_tpm` | Fixed 10-year validity from registration date | -| AK | Registrar `aik_tpm` | Agents with `regcount > 2`: 25-day validity; others: 2-year validity | - -**Expiry Status Thresholds:** - -| Status | Condition | -|--------|-----------| -| `expired` | `not_after < now` | -| `expiring_soon` | `not_after < now + 30 days` | -| `valid` | `not_after >= now + 30 days` | - -**Trace:** Implementation -- `keylime-webtool-backend/src/models/certificate.rs`, `keylime-webtool-backend/src/api/handlers/certificates.rs` - -#### 3.3.6 Attestation and Pipeline Models - -**Attestation Result:** - -| Field | Type | Description | -|-------|------|-------------| -| `id` | UUID | Result identifier | -| `agent_id` | UUID | Agent that was attested | -| `result` | `pass` \| `fail` | Attestation outcome | -| `failure_type` | FailureType? | Category of failure | -| `failure_detail` | string? | Detailed failure description | -| `latency_ms` | integer | Attestation duration | -| `timestamp` | datetime | When attestation occurred | -| `verification_stages` | PipelineStage[] | Per-stage results | - -**Verification Pipeline Stages** (FR-030): - -| Order | Stage | Identifier | Description | -|-------|-------|-----------|-------------| -| 1 | Receive Quote | `ReceiveQuote` | Receive TPM quote and measurement logs | -| 2 | Validate TPM Quote | `ValidateTpmQuote` | Validate quote signature and nonce | -| 3 | Check PCR Values | `CheckPcrValues` | Replay and verify PCR bank values | -| 4 | Verify IMA Log | `VerifyImaLog` | Verify IMA entries against allowlist | -| 5 | Verify Measured Boot | `VerifyMeasuredBoot` | Verify UEFI event log against policy | - -**Stage Status Values:** - -| Status | Description | -|--------|-------------| -| `Pass` | Stage completed successfully | -| `Fail` | Stage failed (failure point) | -| `NotReached` | Not executed (prior stage failed) | - -Each stage includes `duration_ms` (null if not reached). - -**Failure Correlation Types** (FR-026): - -| Type | Identifier | Description | -|------|-----------|-------------| -| Temporal | `Temporal` | Failures in the same time window | -| Causal | `Causal` | Failures sharing the same root cause | -| Topological | `Topological` | Failures grouped by subnet or verifier | -| Policy-linked | `PolicyLinked` | Failures matching a recent policy update | - -**Trace:** Implementation -- `keylime-webtool-backend/src/models/attestation.rs` - -#### 3.3.7 Policy Data Model - -| Field | Type | Description | SRS Trace | -|-------|------|-------------|-----------| -| `id` | string | Policy identifier | FR-033 | -| `name` | string | Policy name | FR-033 | -| `kind` | `ima` \| `measured_boot` | Policy type discriminator | FR-033 | -| `version` | integer | Current version number | FR-035 | -| `checksum` | string | Content hash | FR-033 | -| `entry_count` | integer | Number of policy entries | FR-033 | -| `assigned_agents` | integer | Count of assigned agents | FR-037 | -| `created_at` | datetime | Creation timestamp | FR-033 | -| `updated_at` | datetime | Last update timestamp | FR-033 | -| `updated_by` | string | Last editor identity | FR-043 | -| `content` | string? | Policy content (detail only) | FR-034 | - -**Approval Workflow States:** - -| State | Description | -|-------|-------------| -| `Draft` | Policy created or modified, not yet submitted | -| `PendingApproval` | Submitted for two-person review (FR-039) | -| `Approved` | Approved by a different user than drafter (SR-018) | -| `Rejected` | Approval denied | -| `Expired` | Approval window expired without action | - -**Trace:** Implementation -- `keylime-webtool-backend/src/models/policy.rs` - -#### 3.3.8 Audit Entry Model - -| Field | Type | Description | SRS Trace | -|-------|------|-------------|-----------| -| `id` | integer | Sequential entry ID | FR-042 | -| `timestamp` | datetime | Event timestamp | FR-042 | -| `severity` | `critical` \| `warning` \| `info` | Event severity | FR-042 | -| `actor` | string | User or service identity | FR-043 | -| `action` | string | Operation type | FR-043 | -| `resource` | string | Affected resource | FR-042 | -| `source_ip` | string | Client IP address | FR-042 | -| `user_agent` | string? | Browser user agent | FR-042 | -| `result` | `success` \| `failure` | Operation outcome | FR-042 | -| `previous_hash` | string | SHA-256 of previous entry | FR-061 | -| `entry_hash` | string | SHA-256 of this entry | FR-061 | - -**Trace:** Implementation -- `keylime-webtool-backend/src/audit/logger.rs` - -#### 3.3.9 Notification Model - -| Field | Type | Description | SRS Trace | -|-------|------|-------------|-----------| -| `id` | UUID | Notification identifier | FR-010 | -| `alert_id` | UUID | Associated alert | FR-010 | -| `channel` | NotificationChannel | Delivery channel | FR-046 | -| `status` | DeliveryStatus | Current delivery state | FR-010 | -| `retry_count` | integer | Retry attempts | FR-010 | -| `sent_at` | datetime? | Successful delivery time | FR-010 | - -**Notification Channels:** `Email`, `Slack`, `Webhook`, `ZeroMq` - -**Delivery Statuses:** `Pending`, `Sent`, `Failed`, `Retrying` - -**Trace:** Implementation -- `keylime-webtool-backend/src/models/alert.rs` - -#### 3.3.10 Frontend Type System - -The frontend mirrors backend models as TypeScript interfaces in `src/types/`. Key design decisions: - -* **Discriminated unions** for polymorphic types (`AlertType`, `FailureType`, `PolicyKind`) -* **Literal string types** for enumerations (e.g., `'critical' | 'warning' | 'info'`) -* **Generic `PaginatedResponse`** for all list endpoints -* **Optional fields** (`?`) for nullable backend values - -**Trace:** Implementation -- `keylime-webtool-frontend/src/types/` - -### 3.4 Interface View - -#### 3.4.1 Standard API Response Envelope - -All backend REST API responses use a standard JSON envelope: - -```json -{ - "success": true, - "data": { "..." }, - "error": null, - "timestamp": "2026-04-15T12:00:00Z", - "request_id": "uuid-v4" -} -``` - -| Field | Type | Description | -|-------|------|-------------| -| `success` | boolean | Whether the request succeeded | -| `data` | T \| null | Response payload (null on error) | -| `error` | object \| null | Error object (null on success) | -| `timestamp` | string (ISO 8601) | Server-side response timestamp | -| `request_id` | string (UUID v4) | Unique request identifier for tracing | - -**Error Envelope:** On error, `success: false`, `data: null`, and `error` contains: - -* `code`: Machine-readable error code (e.g., `NOT_FOUND`, `UNAUTHORIZED`) -* `message`: Human-readable description - -**HTTP Status Mapping:** 400 Bad Request, 401 Unauthorized, 403 Forbidden, 404 Not Found, 409 Conflict, 429 Too Many Requests, 502 Bad Gateway (Keylime API error), 503 Service Unavailable. - -**Frontend Unwrapping:** The Axios response interceptor (`src/api/client.ts`) automatically unwraps the envelope, returning `data` directly to callers and redirecting to `/login` on 401. - -**Trace:** Implementation -- `keylime-webtool-backend/src/api/response.rs`, `keylime-webtool-frontend/src/api/client.ts` - -#### 3.4.2 Paginated Response Format - -All list endpoints use a standard pagination wrapper inside `data`: - -```json -{ - "items": [ "..." ], - "page": 1, - "page_size": 25, - "total_items": 250, - "total_pages": 10 -} -``` - -| Field | Type | Description | -|-------|------|-------------| -| `items` | T[] | Result items for the current page | -| `page` | integer | Current page number (1-based) | -| `page_size` | integer | Items per page | -| `total_items` | integer | Total matching items | -| `total_pages` | integer | Total pages | - -**Trace:** Implementation -- `keylime-webtool-backend/src/api/response.rs`, `keylime-webtool-frontend/src/types/index.ts` - -#### 3.4.3 REST API Route Hierarchy - -```text -/api -+-- /auth -| +-- POST /login Initiate OIDC login (SR-001) -| +-- POST /callback Auth code exchange (SR-001, SR-010) -| +-- POST /refresh JWT refresh rotation (SR-010) -| +-- POST /logout Session revocation (SR-011) -+-- /kpis -| +-- GET / Fleet KPI aggregation (FR-001) -+-- /agents -| +-- GET / List agents, paginated + filtered (FR-012, FR-013, FR-014) -| +-- GET /search Global search by UUID/IP (FR-004) -| +-- POST /bulk Bulk operations (FR-016) -| +-- GET /:id Agent detail (FR-018) -| +-- POST /:id/actions/:act Agent actions (FR-019) -| +-- GET /:id/timeline Attestation timeline (FR-020) -| +-- GET /:id/ima-log IMA log entries (FR-020) -| +-- GET /:id/boot-log Boot log entries (FR-020) -| +-- GET /:id/certificates Agent certificates (FR-020) -| +-- GET /:id/raw Raw data — combined by default (FR-020) -| +-- GET /:id/raw/backend Backend-computed agent summary (FR-020) -| +-- GET /:id/raw/registrar Raw Registrar API JSON (FR-020) -| +-- GET /:id/raw/verifier Raw Verifier API JSON (FR-020) -+-- /attestations -| +-- GET / List attestation results (FR-024) -| +-- GET /summary Attestation KPIs (FR-024) -| +-- GET /timeline Hourly timeline (FR-024) -| +-- GET /failures Failure categorization (FR-025) -| +-- GET /incidents Correlated incidents (FR-026) -| +-- GET /incidents/:id Incident detail (FR-027) -| +-- POST /incidents/:id/rollback Policy rollback (FR-028) -| +-- GET /pipeline/:agent_id Verification pipeline (FR-030) -| +-- GET /push-mode Push-mode analytics (FR-029) -| +-- GET /pull-mode Pull-mode monitoring (FR-054) -| +-- GET /state-machine State distribution (FR-069) -+-- /policies -| +-- GET / List policies (FR-033) -| +-- POST / Create policy (FR-034) -| +-- GET /assignment-matrix Policy assignments (FR-037) -| +-- POST /changes/:id/approve Two-person approval (FR-039) -| +-- GET /:id Policy detail (FR-034) -| +-- PUT /:id Update policy (FR-034) -| +-- DELETE /:id Delete policy (FR-034) -| +-- GET /:id/versions Version history (FR-035) -| +-- GET /:id/diff Version diff (FR-035) -| +-- POST /:id/rollback/:ver Rollback to version (FR-035) -| +-- POST /:id/impact Impact analysis (FR-038) -+-- /certificates -| +-- GET / List certificates (FR-050) -| +-- GET /expiry Expiry summary (FR-051) -| +-- GET /:id Certificate detail (FR-052) -| +-- POST /:id/renew Renew certificate (FR-053) -+-- /alerts -| +-- GET / List alerts, filtered (FR-047) -| +-- GET /summary Alert KPI counts (FR-047) -| +-- PUT /thresholds Configure thresholds (FR-011) -| +-- GET /notifications In-app notifications (FR-009) -| +-- GET /:id Alert detail (FR-047) -| +-- POST /:id/acknowledge Acknowledge (FR-047) -| +-- POST /:id/investigate Start investigation (FR-047) -| +-- POST /:id/resolve Resolve (FR-047) -| +-- POST /:id/dismiss Dismiss (FR-047) -| +-- POST /:id/escalate Escalate (FR-048) -+-- /audit-log -| +-- GET / Query audit events (FR-042) -| +-- GET /verify Hash chain verification (FR-061) -| +-- GET /export Export audit log (FR-042) -+-- /compliance -| +-- GET /frameworks List frameworks (FR-059) -| +-- GET /reports/:framework Framework report (FR-059) -| +-- POST /reports/:fw/export Export report (FR-060) -+-- /integrations -| +-- GET /status Backend connectivity with probes (FR-057, FR-077) -| +-- GET /durable Durable backends (FR-058) -| +-- GET /revocation-channels Revocation channels (FR-046) -| +-- GET /siem SIEM status (FR-063) -+-- /performance -| +-- GET /verifiers Verifier metrics (FR-064) -| +-- GET /database DB pool metrics (FR-065) -| +-- GET /api-response-times API latency (FR-066) -| +-- GET /config Live config + drift (FR-067) -| +-- GET /capacity Capacity projections (FR-068) -+-- /settings - +-- GET /keylime Get Verifier/Registrar URLs (FR-072) - +-- PUT /keylime Update Verifier/Registrar URLs (FR-072) - +-- GET /certificates Get mTLS certificate config (FR-073) - +-- PUT /certificates Update mTLS certificate config (FR-073) - -/ws -+-- GET /events WebSocket real-time updates (NFR-021) -``` - -**Trace:** Implementation -- `keylime-webtool-backend/src/api/routes.rs` - -#### 3.4.4 WebSocket Endpoint and Subscription Model - -| Property | Value | -|----------|-------| -| Endpoint | `/ws/events` | -| Authentication | JWT access token as query parameter `?token=` | -| Channels | `kpis`, `agents`, `alerts`, `policies` | -| Reconnection | Exponential backoff: `2^n * 1000ms`, capped at 30 seconds | - -**Frontend Hook:** `useWebSocket({ channel, onMessage, enabled })` manages connection lifecycle, automatic reconnection, and JSON message parsing. - -**Trace:** Implementation -- `keylime-webtool-backend/src/api/ws.rs`, `keylime-webtool-frontend/src/hooks/useWebSocket.ts` - -#### 3.4.5 Frontend Routing - -| Path | Component | SRS Trace | -|------|-----------|-----------| -| `/login` | Login | SR-001 | -| `/` | Dashboard | FR-001 | -| `/agents` | AgentList | FR-012 | -| `/agents/:id` | AgentDetail | FR-018 | -| `/attestations` | Attestations | FR-024 | -| `/policies` | Policies | FR-033 | -| `/certificates` | Certificates | FR-050 | -| `/alerts` | Alerts | FR-047 | -| `/performance` | Performance | FR-064 | -| `/audit` | AuditLog | FR-042 | -| `/integrations` | Integrations | FR-057 | -| `/settings` | Settings | FR-002, FR-006, FR-008 | - -All routes except `/login` are wrapped in the `Layout` component (Sidebar + TopBar). - -**Trace:** Implementation -- `keylime-webtool-frontend/src/router.tsx` - -### 3.5 Interaction View - -#### 3.5.1 Data Flow: Browser to Keylime - -```text -Browser (SPA) - | - | HTTPS (TLS 1.3) + JWT Bearer - v -Backend (Axum) - | - +-- JWT validation (SR-010) - +-- RBAC permission check (SR-003) - +-- Cache lookup (Redis, NFR-019) - | | - | +-- Cache HIT: return cached data - | +-- Cache MISS: continue to Keylime - | - +-- Circuit breaker check (NFR-017) - | | - | +-- OPEN: return 503 + cached/stale data - | +-- CLOSED/HALF-OPEN: proceed - | - +-- mTLS request to Keylime API (SR-004) - | - +-- Transform response to domain model - +-- Write to cache (with TTL) - +-- Audit log entry (FR-061) - +-- Return API envelope to browser -``` - -#### 3.5.2 Data Flow: Frontend State Management - -```text -User Action (click, navigate, search) - | - v -Page Component - | - +-- TanStack React Query (useQuery / useMutation) - | | - | +-- staleTime: 30s, retry: 1 - | +-- Automatic cache invalidation - | v - +-- API Module (src/api/*.ts) - | | - | v - +-- Axios Client (src/api/client.ts) - | | - | +-- Request interceptor: inject Bearer token - | +-- Response interceptor: unwrap envelope, handle 401 - | v - +-- Backend API - | - v -Re-render with new data -``` - -#### 3.5.3 Authentication Flow - -```text -1. User navigates to /login -2. Login page presents OIDC login (or demo login) -3. Backend redirects to IdP authorization endpoint -4. User authenticates with IdP (+ MFA for Admin, SR-002) -5. IdP redirects to /api/auth/callback with authorization code -6. Backend exchanges code for ID token + access token -7. Backend maps OIDC claims to internal Role (Viewer/Operator/Admin) -8. Backend creates short-lived JWT (15 min, SR-010) with Claims: - { sub, role, iat, exp, session_id, tenant_id } -9. Frontend stores JWT in sessionStorage -10. All subsequent requests include Authorization: Bearer -11. JWT refresh rotation extends session without re-authentication -12. Logout revokes session server-side (SR-011) -``` - -**Trace:** SRS SR-001, SR-002, SR-010, SR-011 - -### 3.6 State Dynamics View - -#### 3.6.1 Agent State Machine - -See Section 3.3.3 for state enumeration. The agent state machine is owned by Keylime (not the dashboard). The dashboard observes and visualizes state transitions but does not drive them, except for operator actions (FR-019): - -| Action | Effect on Keylime | -|--------|------------------| -| Reactivate | `PUT /v2/agents/{id}` -- resets agent to attesting | -| Stop | Pauses attestation monitoring | -| Delete | `DELETE /v2/agents/{id}` -- removes from verifier | - -#### 3.6.2 Alert Lifecycle State Machine - -```text -New --> Acknowledged --> UnderInvestigation --> Resolved - | | | - | | +---------------> Dismissed - | +---------------------------------> Dismissed - +-------------------------------------------> Dismissed -``` - -**Transition Rules:** - -| Action | Valid Source States | Target State | Side Effects | -|--------|-------------------|--------------|--------------| -| Acknowledge | `New` | `Acknowledged` | Sets `acknowledged_timestamp` | -| Investigate | `New`, `Acknowledged` | `UnderInvestigation` | Sets `acknowledged_timestamp` if null; optionally sets `assigned_to` | -| Resolve | Any non-terminal | `Resolved` | Optionally sets `resolution` reason | -| Dismiss | Any non-terminal | `Dismissed` | -- | -| Escalate | Any non-terminal | *(unchanged)* | Increments `escalation_count` | - -**Terminal States:** `Resolved` and `Dismissed` reject all further transitions. - -**Summary Computation:** The `critical`, `warnings`, and `info` counters returned by `GET /api/alerts/summary` count **all** alerts of their respective severity regardless of lifecycle state (including `Resolved` and `Dismissed`), matching the totals shown in the Alert Center list. The Dashboard "Urgent Alerts" KPI sums only **active** (non-terminal) `critical + warnings` to represent alerts currently needing attention; its subtitle displays the per-severity breakdown (e.g., "2 critical, 2 warnings"). All three severity counters are returned by `GET /api/alerts/summary`. - -**Trace:** Implementation -- `keylime-webtool-backend/src/models/alert_store.rs` - -#### 3.6.3 Policy Approval Lifecycle - -```text -Draft --> PendingApproval --> Approved --> Applied - | - +----------> Rejected - | - +----------> Expired (time-limited window) -``` - -**Constraint (SR-018):** The approver MUST NOT be the same identity as the drafter. - -**Trace:** SRS FR-039, SR-017, SR-018 - -### 3.7 Algorithm View - -#### 3.7.1 Attestation Timeline Distribution - -The attestation timeline (FR-024) distributes event counts across hourly buckets using a deterministic variation algorithm. Since the backend does not yet persist attestation history, current agent states provide baseline totals. - -**Algorithm:** - -1. Count total successful and failed agents from Verifier state. - * Pull-mode: `GET_QUOTE` = success; `FAILED`, `INVALID_QUOTE`, `TENANT_FAILED` = failure. - * Push-mode: `accept_attestations` flag determines pass/fail; absence of attestation submissions beyond the expected interval determines timeout. -2. Compute the number of hourly buckets from the requested time range. -3. Generate per-bucket weights: - `weight(i) = (1 + 0.5 * sin(i * 0.9)) * jitter(i)` - where `jitter(i) = 0.7 + ((i * 2654435761) >> 16 mod 100) / 100 * 0.6` (range [0.7, 1.3]). -4. Normalize weights so buckets sum exactly to the total count. -5. Correct rounding error by distributing remainder to buckets with the largest fractional parts. - -**Supported Time Ranges:** `1h`, `6h`, `24h`, `7d`, `30d`. - -**Trace:** Implementation -- `keylime-webtool-backend/src/api/handlers/attestations.rs` - -#### 3.7.2 Dashboard KPI Fallback Computation - -The frontend derives attestation KPIs from agent state data when no attestation history endpoint is available (FR-001): - -| KPI | Computation | -|-----|------------| -| Total Agents | `paginated_response.total_items` or `agents.length` | -| Failed Attestations | Count of agents in `failed`, `invalid_quote`, `tenant_failed` (pull) or `fail`, `timeout` (push) state | -| Success Rate | `((total - failed) / total) * 100` | -| Urgent Alerts | From `GET /api/alerts/summary` -> `critical + warnings` count; subtitle shows per-severity breakdown (e.g., "2 critical, 2 warnings") | -| Alert Center: Critical | From `GET /api/alerts/summary` -> `critical` (all states) | -| Alert Center: Warnings | From `GET /api/alerts/summary` -> `warnings` (all states) | -| Alert Center: Info | From `GET /api/alerts/summary` -> `info` (all states) | - -**Rationale:** Ensures the dashboard displays meaningful data before TimescaleDB attestation history persistence is implemented. - -**Trace:** Implementation -- `keylime-webtool-frontend/src/pages/Dashboard/Dashboard.tsx` - -#### 3.7.3 Circuit Breaker Pattern - -The Keylime API client implements a circuit breaker (NFR-017) to prevent cascading failures: - -```text -CLOSED --[failure_count >= threshold]--> OPEN -OPEN --[reset_timeout elapsed]------> HALF_OPEN -HALF_OPEN --[request succeeds]--------> CLOSED -HALF_OPEN --[request fails]-----------> OPEN -``` - -| Parameter | Default | Description | -|-----------|---------|-------------| -| `failure_threshold` | 5 | Consecutive failures to open circuit | -| `reset_timeout` | 60s | Time before attempting recovery | - -**Health Probes (FR-077):** The `KeylimeClient` exposes `probe_verifier()` and `probe_registrar()` methods that perform lightweight HTTP status-code-only checks, bypassing the circuit breaker. These are used by the integrations health endpoint so connectivity status always reflects real reachability, even when the circuit breaker is open due to prior failures or response deserialization mismatches across Keylime versions. - -**Trace:** Implementation -- `keylime-webtool-backend/src/keylime/client.rs` - -#### 3.7.4 Audit Log Hash Chain - -Each audit entry (FR-061, SR-015) is linked to its predecessor via SHA-256: - -```text -entry_hash = SHA-256(id + timestamp + severity + actor + action + - resource + source_ip + result + previous_hash) -``` - -* The first entry uses a well-known genesis hash. -* `verify_chain()` replays the chain and detects tampering or gaps. -* Optional RFC 3161 timestamp anchoring and Rekor integration for immutable storage. - -**Trace:** Implementation -- `keylime-webtool-backend/src/audit/logger.rs` - -### 3.8 Resource View - -#### 3.8.1 Backend Configuration Hierarchy - -```rust -AppConfig -+-- server -| +-- host: String // Default: "0.0.0.0" -| +-- port: u16 // Default: 8080 -| +-- tls_cert: Option -| +-- tls_key: Option -+-- keylime -| +-- verifier_url: String // Default: "http://localhost:3000" (runtime-configurable, FR-072) -| +-- registrar_url: String // Default: "http://localhost:3001" (runtime-configurable, FR-072) -| +-- mtls: Option // Runtime-configurable (FR-073) -| | +-- cert: PathBuf -| | +-- key: String // HSM/Vault URI (SR-005, SR-006) or file path -| | +-- ca_cert: PathBuf -| +-- timeout_secs: u64 // Default: 30 -| +-- circuit_breaker -| +-- failure_threshold: u32 // Default: 5 -| +-- reset_timeout_secs: u64 // Default: 60 -+-- database -| +-- url: String // PostgreSQL connection string -| +-- pool_size: u32 // Default: 20 -| +-- connect_timeout_secs: u64 // Default: 5 -+-- cache -| +-- redis_url: String -| +-- ttl_agent_list_secs: u64 // Default: 10 (NFR-019) -| +-- ttl_agent_detail_secs: u64 // Default: 30 (NFR-019) -| +-- ttl_policies_secs: u64 // Default: 60 (NFR-019) -| +-- ttl_certs_secs: u64 // Default: 300 (NFR-019) -+-- auth -| +-- oidc -| | +-- issuer: String -| | +-- client_id: String -| | +-- client_secret: String -| | +-- redirect_uri: String -| +-- jwt_secret: String -| +-- session_timeout_secs: u64 // Default: 900 (SR-010) -| +-- mfa_required_for_admin: bool // Default: true (SR-002) -+-- audit -| +-- log_retention_days: u32 // Default: 365 (SR-026) -| +-- hash_algorithm: String // Default: "sha256" -| +-- rfc3161_timestamp_url: Option -| +-- rekor_url: Option -+-- integrations - +-- siem - | +-- syslog_endpoint: Option - | +-- splunk_hec_endpoint: Option - | +-- splunk_token: Option - | +-- prometheus_enabled: bool // Default: true - +-- slack_webhook_url: Option - +-- email: Option -``` - -**Configuration Persistence (FR-075):** At startup, configuration is loaded with priority: persisted TOML file > environment variables > compiled defaults. The TOML file path is resolved as: `KEYLIME_WEBTOOL_CONFIG` env var > `~/.config/keylime-webtool/settings.toml` > no persistence. File writes are atomic (temp file + rename) and run on `spawn_blocking` to avoid blocking the async runtime. Write failures log a warning but never fail the API request. - -**Trace:** Implementation -- `keylime-webtool-backend/src/config.rs`, `keylime-webtool-backend/src/settings_store.rs` - -#### 3.8.2 Frontend Configuration - -| Variable | Description | Default | -|----------|-------------|---------| -| `VITE_API_BASE_URL` | Backend API base URL | `""` (same origin) | -| `VITE_WS_URL` | WebSocket server URL | `ws://${window.location.host}/ws` | - -The Backend URL is also runtime-configurable via the Settings page (FR-072). The `Axios` client reads the configured backend URL from localStorage and uses the Vite dev server proxy to avoid CORS failures during development. - -**Dev Server Proxy** (Vite): - -* `/api/*` -> `http://localhost:8080` (API backend, avoids CORS in dev mode) -* `/ws/*` -> `ws://localhost:8080` (WebSocket) - -**Auto-Refresh Wiring:** The `visualizationStore` `autoRefresh` and `refreshInterval` settings are connected to `QueryClient.setDefaultOptions({ queries: { refetchInterval } })` in `App.tsx`, so all TanStack React Query hooks automatically poll at the configured interval when auto-refresh is enabled. When auto-refresh is disabled, `refetchInterval` is set to `false`, which stops all automatic polling. The refresh interval configuration control in the Settings page is also disabled (non-interactive) while auto-refresh is off, since the interval value has no effect. - -**Trace:** Implementation -- `keylime-webtool-frontend/vite.config.ts`, `keylime-webtool-frontend/src/App.tsx` - -#### 3.8.3 Cache TTL Strategy (NFR-019) - -| Namespace | TTL | Rationale | -|-----------|-----|-----------| -| Agent List | 10s | Fleet view requires near-real-time state | -| Agent Detail | 30s | Detail view tolerates moderate staleness | -| Policies | 60s | Policies change infrequently | -| Certificates | 300s | Certificate data is quasi-static | - -**Trace:** Implementation -- `keylime-webtool-backend/src/storage/cache.rs` - -#### 3.8.4 Concurrent Log Fetch Limit (NFR-023) - -Maximum 5 parallel concurrent log fetches to the Verifier API, enforced via Tokio semaphore. Prevents overwhelming the Verifier when multiple agents request IMA/boot logs simultaneously. - ---- - -## 4. Design Rationale - -| Decision | Rationale | SRS Trace | -|----------|-----------|-----------| -| Rust (Axum) for backend | Memory safety without GC, `#![forbid(unsafe_code)]`, async performance for 10K WebSocket connections | SR-023, NFR-005 | -| React + TypeScript for frontend | Type safety, component reuse, ecosystem maturity for SPA | NFR-004 | -| Zustand for client state | Lightweight, no boilerplate, supports localStorage persistence for settings | FR-008 | -| TanStack React Query for server state | Automatic cache invalidation, stale-while-revalidate, retry logic; `refetchInterval` wired to auto-refresh settings (`false` when disabled) | NFR-001, FR-006 | -| In-memory AlertStore (pre-DB) | Enables full alert lifecycle development before TimescaleDB integration | FR-047 | -| Circuit breaker on Keylime API | Prevents cascading failures when Verifier is overloaded or unreachable | NFR-017 | -| Health probes bypass circuit breaker | Connectivity status must reflect real reachability, not circuit breaker state | FR-057, FR-077 | -| Response envelope pattern | Consistent error handling, request tracing, frontend interceptor simplicity | -- | -| SHA-256 hash chain for audit | Tamper detection without external dependencies; optional RFC 3161 anchoring | FR-061, SR-015 | -| Certificate derivation from regcount | Keylime does not expose cert metadata; regcount correlates with troubled agents | FR-051 | -| Timeline distribution algorithm | Deterministic variation produces natural-looking charts before history DB exists | FR-024 | -| RBAC UI enforcement via non-rendering | Controls absent (not disabled) for unauthorized roles; correct RBAC semantic | SR-003 | -| localStorage for visualization settings | Settings survive page reload without server round-trip; theme applied before first render | FR-008 | -| sessionStorage for JWT | Token isolated per browser tab; cleared on tab close | SR-010 | -| `Arc>>` for hot-swap | Runtime URL/mTLS changes without restart; inner `Arc` lets in-flight requests complete | FR-072, FR-073 | -| TOML config persistence with atomic writes | Settings survive restart; temp+rename prevents corruption; async-safe via `spawn_blocking` | FR-075 | -| Vite proxy for backend health checks | Avoids CORS failures in dev mode when probing backend settings endpoint | FR-077 | -| Mock/Production URL presets | Reduces configuration friction; auto-detects active mode from saved URLs | FR-074 | -| `#[serde(default)]` on Keylime models | Tolerates missing/renamed fields across Keylime API versions without breaking deserialization | NFR-002 | -| Timezone auto-detect with manual override | Default to browser timezone for zero-config; IANA dropdown for operators in different timezones than their fleet | FR-078 | -| Explicit date format selection | Eliminates locale-dependent ambiguity (e.g., 04/05 = April 5 vs May 4); ISO 8601 default for cross-region consistency | FR-079 | -| Explicit time format selection (12h/24h) | Accommodates regional conventions (e.g., US 12-hour vs European 24-hour); 24h default avoids AM/PM ambiguity in operational contexts | FR-080 | -| Sidebar alert indicator for service outages | Surfaces integration health at a glance without navigating away; shares cached TanStack Query data with Integrations page to avoid extra requests | FR-081 | -| Agent UUID hyperlinks in failure list | One-click drill-down from failure to agent detail eliminates manual navigation; uses React Router `` for SPA navigation without full reload | FR-082 | -| Inline copy-to-clipboard button in Raw Data source selector | Single compact button adjacent to filter group copies whichever JSON view is active; uses Clipboard API with 2s checkmark feedback for confirmation; avoids per-view copy buttons to reduce clutter | FR-083 | -| Clickable KPI cards with drill-down navigation | Each `KpiCard` accepts an optional `linkTo` prop (route string); when set, the card renders as a React Router ``, shows pointer cursor and hover highlight, and navigates to the target view (with optional query params for pre-applied filters) | FR-084 | - ---- - -## 5. Design Overlays - -### 5.1 Security Overlay - -| Concern | Design Element | SRS Trace | -|---------|---------------|-----------| -| Authentication | OIDC flow with IdP, short-lived JWT (15 min) | SR-001, SR-010 | -| MFA | Required for Admin role via IdP policy | SR-002 | -| Authorization (backend) | RBAC middleware checks `Role` in JWT claims | SR-003 | -| Authorization (frontend) | `canWrite()` / `isAdmin()` conditionally render controls | SR-003, SR-021 | -| Transport: Browser-API | TLS 1.3 minimum | SR-008 | -| Transport: API-Keylime | mTLS with rustls, TLS 1.2+ | SR-004, SR-009 | -| Key storage | mTLS private keys from HSM/Vault, never cleartext on disk | SR-005, SR-006 | -| Session revocation | Server-side `SessionStore` with `HashSet` | SR-011 | -| Input validation | CSP headers, input sanitization | SR-012 | -| Data minimization | Never cache/store raw TPM quotes, IMA logs, PoP tokens | SR-013, SR-014 | -| Audit integrity | SHA-256 hash chain with optional RFC 3161 anchoring | SR-015 | -| SSRF protection | Webhook URL allowlist, block RFC 1918 addresses | SR-016 | -| Two-person rule | Drafter != Approver enforcement | SR-017, SR-018 | -| Multi-tenancy | `tenant_id` in JWT claims, cross-tenant isolation | SR-019 | -| Cache integrity | Signed cache entries with TTLs | SR-024 | -| Re-registration alert | TPM key change detection via `regcount` | SR-025 | -| Idle timeout | Configurable session timeout (default: 900s) | SR-028 | -| Rate limiting | Per-user and global request rate limiting | SR-029, NFR-018 | - -### 5.2 Performance Overlay - -| Concern | Design Element | SRS Trace | -|---------|---------------|-----------| -| KPI refresh latency | < 30 seconds via polling or WebSocket push | NFR-001 | -| Backend concurrency | Tokio async runtime, 10K WebSocket connections target | NFR-005 | -| Cache strategy | Redis with tiered TTLs (10s-300s) per data type | NFR-019 | -| Fault tolerance | Circuit breaker on Verifier API (threshold: 5, reset: 60s) | NFR-017 | -| Log fetch limit | Max 5 parallel concurrent Verifier log fetches | NFR-023 | -| Reconciliation | Periodic sweep every 5 minutes | NFR-020 | -| Frontend query cache | TanStack Query: 30s stale time, 1 retry | NFR-001 | - ---- - -## 6. SRS Traceability Matrix - -### 6.1 Functional Requirements - -| SRS Req | SDD Section | Design Element | -|---------|-------------|----------------| -| FR-001 | 3.4.3, 3.7.2 | `GET /api/kpis`, KPI fallback computation | -| FR-002 | 3.8.2 | Visualization settings: `refreshInterval` (default 30s) | -| FR-003 | 3.2.2, 3.4.5 | Sidebar component with 10 navigation modules | -| FR-004 | 3.4.3 | `GET /api/agents/search` | -| FR-005 | 3.4.3, 3.7.1 | Time range query parameter `?range=` | -| FR-006 | 3.8.2 | Visualization settings: `autoRefresh` toggle | -| FR-007 | 3.4.3 | Export endpoints (audit, compliance) | -| FR-008 | 3.8.2 | `visualizationStore`: theme in localStorage, `data-theme` attribute | -| FR-009 | 3.4.3 | `GET /api/alerts/notifications` | -| FR-010 | 3.3.9 | Notification model with channel and delivery status | -| FR-011 | 3.4.3 | `PUT /api/alerts/thresholds` | -| FR-012 | 3.3.2, 3.4.3 | Agent summary model, `GET /api/agents` | -| FR-013 | 3.4.2 | Paginated response format | -| FR-014 | 3.4.3 | Agent list query params (state, ip, uuid, policy, min_failures); state filter grouped by Pull Mode (8 states) and Push Mode (4 states incl. Timeout) | -| FR-016 | 3.4.3 | `POST /api/agents/bulk` | -| FR-018 | 3.3.2, 3.4.3 | Full agent model, `GET /api/agents/:id` | -| FR-019 | 3.4.3, 3.6.1 | `POST /api/agents/:id/actions/:action` | -| FR-020 | 3.4.3 | Agent detail tabs: timeline, TPM Policy, IMA, boot, certs, raw (with backend/registrar/verifier source selector, copy button per FR-083) | -| FR-021 | 3.4.3 | TPM Policy tab — reads `tpm_policy` from agent detail (`GET /api/agents/:id`) | -| FR-024 | 3.4.3, 3.7.1 | Attestation timeline with distribution algorithm | -| FR-025 | 3.3.4 | Alert types and severity | -| FR-026 | 3.3.6 | Failure correlation types | -| FR-029 | 3.4.3 | `GET /api/attestations/push-mode` | -| FR-030 | 3.3.6, 3.4.3 | Pipeline stages, `GET /api/attestations/pipeline/:id` | -| FR-033 | 3.3.7, 3.4.3 | Policy model with kind discriminator | -| FR-034 | 3.4.3 | Policy CRUD endpoints | -| FR-035 | 3.4.3 | `GET /api/policies/:id/versions`, `/diff`, `/rollback` | -| FR-037 | 3.4.3 | `GET /api/policies/assignment-matrix` | -| FR-038 | 3.4.3 | `POST /api/policies/:id/impact` | -| FR-039 | 3.4.3, 3.6.3 | Policy approval workflow, two-person rule | -| FR-042 | 3.3.8, 3.4.3 | Audit entry model, `GET /api/audit-log` | -| FR-047 | 3.3.4, 3.6.2 | Alert lifecycle state machine | -| FR-050 | 3.3.5, 3.4.3 | Certificate model, `GET /api/certificates` | -| FR-051 | 3.3.5 | Certificate expiry derivation from regcount | -| FR-054 | 3.4.3 | `GET /api/attestations/pull-mode` | -| FR-055 | 3.4.3 | `GET /api/attestations/push-mode` | -| FR-057 | 3.4.3, 3.7.3 | `GET /api/integrations/status` with health probes | -| FR-059 | 3.4.3 | `GET /api/compliance/frameworks`, `/reports/:framework` | -| FR-061 | 3.3.8, 3.7.4 | Hash chain algorithm, audit logger | -| FR-064 | 3.4.3 | `GET /api/performance/verifiers` | -| FR-069 | 3.3.3, 3.4.3 | Agent state enumeration, `GET /api/attestations/state-machine` | -| FR-072 | 3.3.1, 3.4.3, 3.8.1 | `Arc>>`, `GET/PUT /api/settings/keylime` | -| FR-073 | 3.4.3, 3.8.1 | `GET/PUT /api/settings/certificates`, mTLS client reconstruction | -| FR-074 | 3.2.2 | Settings page mock/production segmented control | -| FR-075 | 3.8.1 | `SettingsStore`, TOML persistence with atomic writes | -| FR-076 | 3.2.2 | Layout hamburger button, sidebar CSS transition | -| FR-077 | 3.2.2, 3.7.3 | 1s polling via `refetchInterval`, `probe_verifier()`/`probe_registrar()` | -| FR-078 | 3.8.2 | `visualizationStore`: timezone with auto-detect, IANA timezone dropdown in Settings | -| FR-079 | 3.8.2 | `visualizationStore`: date format selection (6 formats), ISO 8601 default, applied to all timestamp rendering | -| FR-080 | 3.8.2 | `visualizationStore`: time format selection (12h/24h), 24h default, applied to all time-of-day rendering | -| FR-081 | 3.2.2 | `Sidebar.tsx`: `useHasServiceDown()` hook queries integration health, renders exclamation badge on Integrations nav item | -| FR-082 | 3.2.2 | Failure categorization list renders agent UUIDs as React Router `` to `/agents/{agent_id}` detail page | -| FR-083 | 3.4.3 | Raw Data tab: compact copy icon button to the right of source selector group; Clipboard API with 2s checkmark feedback | -| FR-084 | 3.2.2 | `KpiCard.tsx`: optional `linkTo` prop wraps card in React Router ``; Dashboard page maps each KPI to its target route (e.g., Failed Agents → `/agents?state=failed,invalid_quote,tenant_failed`) | -| FR-085 | 3.2.2 | `Alerts.tsx`: three Recharts donut `PieChart` components below alert table — By Severity, By Type, By State; clickable segments navigate to `/alerts?{dimension}={value}` with filter pre-applied; color maps match Dashboard alert chart (FR-047) | - -### 6.2 Non-Functional Requirements - -| SRS Req | SDD Section | Design Element | -|---------|-------------|----------------| -| NFR-001 | 3.8.2, 5.2 | 30s refresh interval, TanStack Query stale time | -| NFR-002 | 3.3.3 | Pull-mode (v2) and push-mode (v3) state enums | -| NFR-004 | 3.2.3 | React 18, TypeScript 5.6, Vite 6.0 | -| NFR-005 | 3.2.3 | Axum 0.8, Tokio, 10K WebSocket target | -| NFR-017 | 3.7.3 | Circuit breaker: threshold 5, reset 60s | -| NFR-019 | 3.8.3 | Redis cache with tiered TTLs | -| NFR-021 | 3.4.4 | WebSocket `/ws/events` | -| NFR-023 | 3.8.4 | Tokio semaphore, max 5 parallel fetches | - -### 6.3 Security Requirements - -| SRS Req | SDD Section | Design Element | -|---------|-------------|----------------| -| SR-001 | 3.5.3 | OIDC authentication flow | -| SR-003 | 3.6.2, 5.1 | Three-tier RBAC: Viewer, Operator, Admin | -| SR-004 | 3.1.2 | mTLS with rustls for Keylime APIs | -| SR-005 | 3.8.1 | `MtlsConfig.key`: HSM/Vault URI | -| SR-010 | 3.5.3 | JWT claims with 15-min TTL | -| SR-011 | 3.5.3, 5.1 | Server-side SessionStore with revocation | -| SR-015 | 3.7.4 | SHA-256 hash chain for audit entries | -| SR-023 | 3.2.3 | `#![forbid(unsafe_code)]` on Rust crate | diff --git a/spec/SRS-Keylime-Monitoring-Tool.md b/spec/SRS-Keylime-Monitoring-Tool.md deleted file mode 100644 index e890bdb..0000000 --- a/spec/SRS-Keylime-Monitoring-Tool.md +++ /dev/null @@ -1,3918 +0,0 @@ -# Software Requirements Specification: Keylime Monitoring Dashboard - -* **Source Document:** `slides/20260226-Keylime-Monitoring-Tool` -* **Initial Date:** 2026-04-13 -* **Methodology:** Spec-Driven Development (SDD) -* **RFC 2119 Keywords:** MUST, MUST NOT, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, OPTIONAL - ---- - -> ## Inline Review Summary -> -> **Reviewer:** Principal QA Architect & SDD Agile Coach | **Review Date:** 2026-04-09 -> -> * **Non-deterministic assertions resolved (6 instances):** Gherkin scenarios using "MUST be disabled or hidden" — an ambiguous assertion preventing deterministic test automation — have been corrected. RBAC-denied actions now assert "MUST NOT be rendered" (feature absent for unauthorized role); state-based restrictions assert "MUST be disabled" (control present but precondition unmet). -> * **Multi-action scenarios decomposed (3 instances):** FR-020 (IMA Log), FR-035 (Policy Diff), and SR-003 (Operator RBAC) each contained multiple When/Then blocks in a single scenario. Each has been split into atomic scenarios per BDD single-outcome principle. -> * **RFC 2119 negation normalized:** "no X MUST be sent" rewritten to "the System MUST NOT send X" for consistency with RFC 2119 standard negation form (FR-004). -> * **Cross-references added:** SR-014 (PoP Token Privacy) now explicitly cross-references SR-013 (Data Minimization) to clarify their relationship and prevent redundant test coverage. - ---- - -## 1. Core Objective - -The Keylime Monitoring Dashboard (the "System") is a web-based security operations platform that provides centralized monitoring, management, and compliance capabilities for Keylime remote attestation infrastructure. It consumes Keylime's existing Verifier and Registrar REST APIs (v2 pull-mode and v3 push-mode) via mTLS without requiring any modification to Keylime components. The System targets Security Operations (SecOps) teams, System Administrators, Compliance Officers, and DevSecOps engineers. - -The System transforms Keylime from a CLI-driven security tool into a visual operations platform, reducing mean time to detect (MTTD) attestation failures from hours to seconds, centralizing policy and certificate lifecycle management, and providing tamper-evident audit trails for compliance reporting. - -**Technical Architecture:** React.js + TypeScript SPA frontend, Rust (Axum) async backend, TimescaleDB for time-series storage, Redis for caching, mTLS + rustls for Keylime API communication. - ---- - -## 2. Requirements Inventory Table - -### 2.1 Functional Requirements - -| Req ID | Description | RFC Level | Traceability Source | -|--------|-------------|-----------|---------------------| -| FR-001 | Fleet overview KPI dashboard | MUST | Dashboard - Key Performance Indicators | -| FR-002 | KPI data auto-refresh with configurable interval | MUST | Dashboard - Key Performance Indicators | -| FR-003 | Sidebar navigation with core modules | MUST | Dashboard - Navigation Structure | -| FR-004 | Global agent search by UUID, IP, or hostname | MUST | Dashboard - Navigation Structure | -| FR-005 | Time range selector for data filtering | MUST | Dashboard - Navigation Structure | -| FR-006 | Auto-refresh toggle for live updates | MUST | Dashboard - Navigation Structure | -| FR-007 | CSV/JSON data export for compliance reporting | MUST | Dashboard - Navigation Structure | -| FR-008 | Dark/Light mode theme preference | MAY | Dashboard - Navigation Structure | -| FR-009 | In-app notification system with badge count | MUST | Dashboard - Navigation Structure | -| FR-010 | Email/Slack integration for critical alerts | SHOULD | Dashboard - Navigation Structure | -| FR-011 | Configurable alert severity thresholds | MUST | Dashboard - Key Performance Indicators | -| FR-012 | Agent fleet list view with sortable columns | MUST | Agent Fleet - List View | -| FR-013 | Agent list pagination | MUST | Agent Fleet - List View | -| FR-014 | Advanced multi-criteria agent filtering | MUST | Agent Fleet - Filtering and Search | -| FR-015 | IP address filtering with CIDR range support | MUST | Agent Fleet - Filtering and Search | -| FR-016 | Bulk operations on selected agents | MUST | Agent Fleet - Filtering and Search | -| FR-017 | Topology/map view for agent distribution | MAY | Agent Fleet - Map and Topology View | -| FR-018 | Agent detail view with info cards | MUST | Agent Detail - Overview Panel | -| FR-019 | Agent actions (Reactivate, Stop, Delete, Force Attest) | MUST | Agent Detail - Overview Panel | -| FR-020 | Agent detail six-tab deep-dive | MUST | Agent Detail - Tabs and Sub-Views | -| FR-021 | PCR values monitoring with diff and expected vs actual | MUST | Agent Detail - PCR Values View | -| FR-022 | PCR change detection with acknowledge/investigate | MUST | Agent Detail - PCR Values View | -| FR-023 | Cross-tab navigation (PCR→IMA, cert→Certificates) | SHOULD | Agent Detail - Tabs and Sub-Views (2/2) | -| FR-024 | Attestation analytics overview (volume, failures, latency) | MUST | Attestation Analytics - Overview Dashboard | -| FR-025 | Failure categorization by type and severity | MUST | Attestation Analytics - Failure Analysis | -| FR-026 | Automatic failure correlation across agents | MUST | Attestation Analytics - Automatic Correlation | -| FR-027 | Root cause suggestion and recommended actions (distinct from FR-026 grouping) | MUST | Attestation Analytics - Actionable Insights | -| FR-028 | One-click policy rollback from incident view | SHOULD | Attestation Analytics - Actionable Insights | -| FR-029 | Push mode (v3 API) attestation analytics | MUST | Attestation Analytics - Push Mode | -| FR-030 | Verification pipeline stage visualization | MUST | Verification Pipeline - Flow Visualization | -| FR-031 | Per-stage verification timing and metrics | MUST | Verification Pipeline - Stage Metrics | -| FR-032 | IMA quote progress tracking with gap detection | MUST | Verification Pipeline - Quote Progress Tracking | -| FR-033 | Unified policy list view (IMA & MB) with kind discriminator | MUST | Policy Management - IMA Policies | -| FR-034 | Policy CRUD with inline editor and syntax highlighting | MUST | Policy Management - Policy Editor | -| FR-035 | Policy versioning with change history and rollback | MUST | Policy Management - Policy Editor | -| FR-036 | Measured boot policy management | MUST | Policy Management - Policy Editor | -| FR-037 | Policy assignment matrix | MUST | Policy Management - Policy Editor | -| FR-038 | Pre-update policy impact analysis | MUST | Policy Management - Impact Analysis | -| FR-039 | Two-person rule for policy changes | MUST | Policy Management - Two-Person Rule | -| FR-040 | *(Merged into FR-039)* | — | — | -| FR-041 | Change management integration (ServiceNow, Jira) | MAY | Policy Management - Two-Person Rule | -| FR-042 | Security audit event log with filtering | MUST | Security Audit - Event Log Dashboard | -| FR-043 | Authorization tracking by action and identity type | MUST | Security Audit - Authorization Tracking | -| FR-044 | Identity verification event monitoring | MUST | Security Audit - Identity Verification Events | -| FR-045 | Anomaly detection on authorization patterns | SHOULD | Security Audit - Authorization Tracking | -| FR-046 | Revocation notification channel monitoring | MUST | Revocation - Notification Channels | -| FR-047 | Alert management dashboard with lifecycle workflow | MUST | Revocation - Alert Workflow | -| FR-048 | Alert auto-escalation after SLA timeout | SHOULD | Revocation - Alert Workflow | -| FR-049 | Alert auto-resolve on successful re-attestation | SHOULD | Revocation - Alert Workflow | -| FR-050 | Unified certificate view across all cert types | MUST | Certificate Management - Overview | -| FR-051 | Certificate expiry dashboard with tiered warnings | MUST | Certificate Management - Expiry Dashboard | -| FR-052 | Certificate detail inspection and chain visualization | MUST | Certificate Management - Operations | -| FR-053 | Automated certificate renewal workflow | SHOULD | Certificate Management - Operations | -| FR-054 | Pull mode attestation monitoring | MUST | Attestation Modes - Pull Mode Monitoring | -| FR-055 | Push mode attestation monitoring | MUST | Attestation Modes - Push Mode Monitoring | -| FR-056 | Mixed mode unified views | MUST | Attestation Modes - Comparative View | -| FR-057 | Backend connectivity status dashboard | MUST | Integration Status - Backend Connectivity | -| FR-058 | Durable attestation backend monitoring | MUST | Integration Status - Durable Attestation | -| FR-059 | Compliance framework mapping reports | MUST | Compliance - Framework Mapping | -| FR-060 | One-click compliance report export (PDF/CSV) | MUST | Compliance - Framework Mapping | -| FR-061 | Tamper-evident hash-chained audit logging | MUST | Compliance - Tamper-Evident Audit Logging | -| FR-062 | Incident response ticketing integration | SHOULD | Incident Response - Integration | -| FR-063 | SIEM integration (Syslog, Splunk HEC, ECS, Prometheus, OpenTelemetry) | MUST | Incident Response - Integration | -| FR-064 | Verifier cluster performance monitoring | MUST | System Performance - Verifier Metrics | -| FR-065 | Database connection pool monitoring | MUST | System Performance - Verifier Metrics | -| FR-066 | API response time tracking (p50/p95/p99) | MUST | System Performance - Verifier Metrics | -| FR-067 | Live configuration view with drift detection | MUST | System Performance - Configuration Monitoring | -| FR-068 | Capacity planning projections | SHOULD | System Performance - Key Metrics | -| FR-069 | Agent state machine visualization (pull + push) | MUST | Keylime - Agent State Machine / Push Mode | -| FR-070 | API version distribution visualization | MUST | Integration Status - Backend Connectivity | -| FR-071 | AI Assistant with Keylime MCP integration | SHOULD | AI Assistant - Conversational Interface | -| FR-072 | Runtime Keylime connection URL configuration | MUST | Settings - Keylime Connection | -| FR-073 | mTLS certificate configuration UI | MUST | Settings - Certificate Configuration | -| FR-074 | Mock/Production environment toggle | MUST | Settings - Environment Switching | -| FR-075 | TOML config file persistence for backend settings | MUST | Settings - Configuration Persistence | -| FR-076 | Sidebar visibility toggle (hamburger button) | MUST | Dashboard - Navigation Structure | -| FR-077 | Webtool backend health check with polling | MUST | Integration Status - Backend Connectivity | -| FR-078 | Timezone selection with auto-detect | SHOULD | Settings - Visualization | -| FR-079 | Date format selection for timestamp rendering | MUST | Settings - Visualization | -| FR-080 | Time format selection (12h/24h) for timestamp rendering | MUST | Settings - Visualization | -| FR-081 | Sidebar alert indicator for integration service outages | MUST | Dashboard - Navigation Structure | -| FR-082 | Agent ID hyperlinks in failure categorization list | MUST | Attestation Analytics - Failure Categorization | -| FR-083 | Copy-to-clipboard button in Raw Data source selector toolbar | MUST | Agent Detail - Raw Data | -| FR-084 | Fleet Overview KPI card drill-down navigation | SHOULD | Dashboard - Key Performance Indicators | -| FR-085 | Alert Center distribution pie charts (by severity, type, state) | MUST | Revocation - Alert Workflow | - -### 2.2 Non-Functional Requirements - -| Req ID | Description | RFC Level | Traceability Source | -|--------|-------------|-----------|---------------------| -| NFR-001 | KPI data refresh within 30 seconds | MUST | Dashboard - Key Performance Indicators | -| NFR-002 | Support Keylime API v2 and v3 simultaneously | MUST | Keylime - Data Model Overview | -| NFR-003 | API-first, non-invasive architecture (zero Keylime modifications) | MUST | Technical Architecture - System Design | -| NFR-004 | SPA frontend with client-side routing (<3s initial load on 10 Mbps) | MUST | Technical Architecture - Data Flow | -| NFR-005 | Async backend supporting 10K concurrent WebSocket connections (<100ms p99) | MUST | Technical Architecture - Why Rust | -| NFR-006 | Event-driven ingestion as primary data path | MUST | Technical Architecture - Event-Driven Ingestion | -| NFR-007 | Polling fallback with adaptive backpressure | MUST | Technical Architecture - Event-Driven Ingestion | -| NFR-008 | Scale to 100K+ agents under event-driven mode | SHOULD | Scalability - Ingestion Model Comparison | -| NFR-009 | Scale to ~1,000 agents under polling fallback | MUST | Scalability - Ingestion Model Comparison | -| NFR-010 | Active/Passive HA with <30s RTO and 0 RPO | MUST | High Availability - Architecture | -| NFR-011 | Active/Active HA for 5K+ agents | SHOULD | High Availability - Architecture | -| NFR-012 | Air-gapped deployment with no external dependencies | MUST | Deployment - Offline & Air-Gapped | -| NFR-013 | Self-contained packaging (no CDN, single binary) | MUST | Deployment - Offline & Air-Gapped | -| NFR-014 | WCAG 2.1 Level AA accessibility compliance | MUST | Deployment - Offline & Air-Gapped | -| NFR-015 | Container (OCI), Kubernetes (Helm), RPM, systemd deployment options | MUST | Technical Architecture - Deployment | -| NFR-016 | Graceful degradation when components are unavailable | MUST | High Availability - Architecture | -| NFR-017 | Circuit breaker on Verifier API latency | MUST | Technical Architecture - Event-Driven Ingestion | -| NFR-018 | Per-user and global request rate limiting | MUST | Technical Architecture - IMA Log & Data Decoupling | -| NFR-019 | Cache TTLs: agent list 10s, detail 30s, policies 60s, certs 300s | MUST | Scalability - Cache Invalidation | -| NFR-020 | Periodic reconciliation sweep every 5 minutes | MUST | Technical Architecture - Event-Driven Ingestion | -| NFR-021 | WebSocket real-time updates for UI | MUST | Technical Architecture - Data Flow | -| NFR-022 | Signed update packages with SBOM for offline updates | MUST | Deployment - Offline & Air-Gapped | -| NFR-023 | Maximum 5 parallel concurrent log fetches to Verifier | MUST | Technical Architecture - IMA Log & Data Decoupling | -| NFR-024 | AI Assistant query performance and rate limiting | SHOULD | AI Assistant - Conversational Interface | - -### 2.3 Security Requirements - -| Req ID | Description | RFC Level | Traceability Source | -|--------|-------------|-----------|---------------------| -| SR-001 | OIDC/SAML identity provider authentication | MUST | Dashboard Authentication - User Identity | -| SR-002 | MFA mandatory for Admin role | MUST | Dashboard Authentication - User Identity | -| SR-003 | Three-tier RBAC (Viewer, Operator, Admin) | MUST | Dashboard RBAC - Role Definitions | -| SR-004 | mTLS for all Keylime API communication | MUST | Threat Model - Trust Boundaries | -| SR-005 | mTLS private key MUST NEVER be stored on disk in cleartext | MUST | Secret Management - Credential Lifecycle | -| SR-006 | HSM or Vault-backed private key storage | MUST | Secret Management - Credential Lifecycle | -| SR-007 | TLS encryption on all network connections (no cleartext paths) | MUST | Transport Security - Encrypted Data Paths | -| SR-008 | Browser→API: TLS 1.3 minimum | MUST | Transport Security - Encrypted Data Paths | -| SR-009 | API→Keylime: TLS 1.2+ minimum | MUST | Transport Security - Encrypted Data Paths | -| SR-010 | Short-lived JWT session tokens (15 min) with refresh rotation | MUST | Dashboard Authentication - User Identity | -| SR-011 | Server-side session revocation | MUST | Dashboard Authentication - User Identity | -| SR-012 | CSP headers and input sanitization (XSS/injection prevention) | MUST | Threat Model - Threat Catalog | -| SR-013 | Never cache or store raw TPM quotes, IMA logs, boot logs | MUST | Threat Model - Data Classification | -| SR-014 | Never display or cache raw PoP tokens | MUST | Attestation Modes - Comparative View | -| SR-015 | Tamper-evident hash-chained audit log with RFC 3161 anchoring | MUST | Compliance - Tamper-Evident Audit Logging | -| SR-016 | SSRF protection on webhook URLs (allowlist, block RFC 1918) | MUST | Revocation - Alert Workflow | -| SR-017 | Two-person approval for policy changes (N-of-M quorum) | MUST | Policy Management - Two-Person Rule | -| SR-018 | Approver cannot be the same as drafter | MUST | Policy Management - Two-Person Rule | -| SR-019 | Multi-tenancy isolation (cross-tenant data never mixed) | MUST | Dashboard RBAC - Multi-Tenancy | -| SR-020 | Data classification enforcement (SECRET, CONFIDENTIAL, INTERNAL) | MUST | Threat Model - Data Classification | -| SR-021 | Write operations blocked at proxy for non-Admin roles | MUST | Dashboard RBAC - Role Definitions | -| SR-022 | mTLS sidecar option (Envoy/Ghostunnel) | MAY | Transport Security - mTLS Sidecar Option | -| SR-023 | `#![forbid(unsafe_code)]` on dashboard Rust crate | MUST | Technical Architecture - Why Rust | -| SR-024 | Signed cache entries with TTLs to mitigate cache poisoning | MUST | Threat Model - Threat Catalog | -| SR-025 | Identity alert on TPM key change during re-registration | MUST | Security Audit - Identity Verification Events | -| SR-026 | Audit log minimum retention of 1 year for compliance | MUST | Compliance - Tamper-Evident Audit Logging | -| SR-027 | Emergency bypass with break-glass audit trail | MUST | Policy Management - Two-Person Rule | -| SR-028 | Configurable idle session timeout | MUST | Dashboard Authentication - User Identity | -| SR-029 | Rate limiting on dashboard session creation endpoint | MUST | Attestation Modes - Comparative View | - ---- - -## 3. Detailed Functional Requirements - -### FR-001: Fleet Overview KPI Dashboard - -**Description:** The System MUST display a fleet overview dashboard presenting computed KPIs derived from the Keylime Verifier and Registrar APIs. The dashboard MUST show: Total Active Agents, Failed Agents (states 7, 9, 10), Attestation Success Rate, Average Attestation Latency, Certificate Expiry Warnings, Active IMA Policies, Revocation Events (24h), Consecutive Failures per agent, and Registration Count. - -**Trace:** Dashboard - Key Performance Indicators; Dashboard - Main Screen Layout - -```gherkin -Feature: Fleet Overview KPI Dashboard - - Scenario: Display fleet health KPIs - Given the dashboard backend is connected to the Verifier API via mTLS - And the Verifier reports 247 agents in GET_QUOTE state - And 3 agents are in FAILED state - When the user navigates to the Fleet Overview Dashboard - Then the dashboard MUST display "247" as Active Agents - And the dashboard MUST display "3" as Failed Agents - And the Attestation Success Rate MUST be computed from attestation history - And Certificate Expiry Warnings MUST reflect certificates expiring within 30 days - - Scenario: Verifier API unreachable - Given the dashboard backend cannot connect to the Verifier API - When the user navigates to the Fleet Overview Dashboard - Then the dashboard MUST display cached KPI data with a staleness indicator - And a banner MUST warn "Verifier API unreachable — data may be stale" - - Scenario: Failed agent threshold alert - Given the alert threshold for Failed Agents is configured to "any count > 0" - When 1 or more agents enter state 7 (FAILED), 9 (INVALID_QUOTE), or 10 (TENANT_FAILED) - Then the Failed Agents KPI MUST display in critical color - And an alert MUST be raised in the notification system -``` - -### FR-002: KPI Data Auto-Refresh - -**Description:** The System MUST refresh KPI data at a configurable interval. The default refresh interval MUST be 30 seconds. The System MUST support both HTTP polling and WebSocket push as refresh mechanisms. - -**Trace:** Dashboard - Key Performance Indicators - -```gherkin -Feature: KPI Data Auto-Refresh - - Scenario: Default KPI refresh interval - Given the auto-refresh toggle is enabled - And the refresh interval is set to the default of 30 seconds - When 30 seconds elapse - Then the System MUST fetch updated KPI data from the Verifier API - And the dashboard MUST re-render all KPI values - - Scenario: KPI refresh via WebSocket - Given the backend supports WebSocket push - And a WebSocket connection is established - When the backend receives an agent state change event - Then the updated KPI data MUST be pushed to the browser - And the dashboard MUST re-render affected KPI values - - Scenario: WebSocket connection lost - Given a WebSocket connection was established - When the connection drops unexpectedly - Then the System MUST fall back to HTTP polling at the configured interval - And a connection status indicator MUST show "reconnecting" -``` - -### FR-003: Sidebar Navigation - -**Description:** The System MUST provide a persistent sidebar navigation with the following modules: Dashboard (Fleet overview), Agents (Fleet management), Attestations (Analytics), Policies (IMA & MB), Certificates (TLS/TPM certs), Alerts (Alert lifecycle), Performance (System metrics), Audit Log (Security events), Integrations (Backend status), Settings (Configuration), and AI Assistant (Keylime MCP conversational interface). The sidebar MUST be collapsible via a hamburger toggle button in the top bar (FR-076), with a smooth CSS transition animation. - -**Trace:** Dashboard - Navigation Structure - -```gherkin -Feature: Sidebar Navigation - - Scenario: Navigate between core modules - Given the user is authenticated and viewing the dashboard - When the user clicks "Agents" in the sidebar - Then the Agent Fleet Management view MUST be displayed - And the sidebar MUST highlight the "Agents" entry as active - - Scenario: All core modules accessible - Given the user is authenticated - Then the sidebar MUST display navigation entries for all 11 core modules - And each entry MUST route to its corresponding view - - Scenario: Unauthenticated user cannot access sidebar - Given the user is not authenticated - When the user attempts to access any dashboard view - Then the System MUST redirect the user to the login page - And no sidebar navigation MUST be rendered - - Scenario: Toggle sidebar visibility - Given the user is authenticated and the sidebar is visible - When the user clicks the hamburger button in the top bar - Then the sidebar MUST slide out of view with a smooth transition - And the main content area MUST expand to fill the available width - - Scenario: Restore sidebar visibility - Given the sidebar is hidden - When the user clicks the hamburger button in the top bar - Then the sidebar MUST slide back into view with a smooth transition -``` - -### FR-004: Global Agent Search - -**Description:** The System MUST provide a global search bar that allows searching agents by UUID (exact or partial match), IP address (including CIDR range support), or hostname. - -**Trace:** Dashboard - Navigation Structure; Agent Fleet - Filtering and Search - -```gherkin -Feature: Global Agent Search - - Scenario: Search agent by partial UUID - Given the agent fleet contains an agent with UUID "a1b2c3d4-e5f6-7890-abcd-ef1234567890" - When the user types "a1b2c3" in the global search bar - Then the search results MUST include agent "a1b2c3d4-e5f6-7890-abcd-ef1234567890" - - Scenario: Search agent by CIDR range - Given the fleet contains agents at IPs 192.168.1.10, 192.168.1.11, 192.168.2.10 - When the user searches for "192.168.1.0/24" - Then the search results MUST include agents at 192.168.1.10 and 192.168.1.11 - And the results MUST NOT include the agent at 192.168.2.10 - - Scenario: Search returns no results - Given the agent fleet contains 250 agents - When the user searches for "nonexistent-uuid-xyz" - Then the search results MUST display an empty state with message "No agents found" - - Scenario: Invalid CIDR notation - Given the user enters "192.168.1.0/99" in the search bar - When the search is submitted - Then the System MUST display a validation error indicating invalid CIDR notation - And the System MUST NOT send a search request to the backend - -> **Reviewer Note:** Rephrased from "no search request MUST be sent" to standard RFC 2119 negation form "the System MUST NOT send." The original syntax is ambiguous — "no" could modify "search request" or "MUST be sent." -``` - -### FR-005: Time Range Selector - -**Description:** The System MUST provide a time range selector allowing users to filter displayed data by predefined intervals: 1h, 6h, 24h, 7d, 30d, and a custom date range picker. - -**Trace:** Dashboard - Navigation Structure - -```gherkin -Feature: Time Range Selector - - Scenario: Filter attestation data by time range - Given the user is viewing the Attestation Analytics view - When the user selects the "24h" time range - Then only attestation data from the last 24 hours MUST be displayed - And charts and tables MUST update to reflect the selected range - - Scenario: Custom date range - Given the user selects "Custom" in the time range selector - When the user specifies a start date of "2026-02-20" and end date of "2026-02-25" - Then only data within that date range MUST be displayed - - Scenario: Invalid date range rejected - Given the user selects "Custom" in the time range selector - When the user specifies a start date later than the end date - Then the System MUST display a validation error "Start date must be before end date" - And the previous time range MUST remain active -``` - -### FR-006: Auto-Refresh Toggle - -**Description:** The System MUST provide a toggle control that enables or disables live automatic updates of dashboard data. When enabled, the auto-refresh setting and configurable interval MUST be wired to the data fetching layer (TanStack React Query `refetchInterval`) so that all queries automatically poll the backend. When disabled, `refetchInterval` MUST be set to `false`, stopping all automatic polling, and the refresh interval configuration control MUST be disabled (non-interactive) since it has no effect while auto-refresh is off. The auto-refresh interval MUST be configurable via the Settings page visualization settings. - -**Trace:** Dashboard - Navigation Structure - -```gherkin -Feature: Auto-Refresh Toggle - - Scenario: Disable auto-refresh - Given auto-refresh is currently enabled - When the user toggles auto-refresh off - Then dashboard data MUST stop refreshing automatically - And the displayed data MUST remain static until manually refreshed or re-enabled - And the refresh interval configuration control MUST be disabled - - Scenario: Enable auto-refresh - Given auto-refresh is currently disabled - When the user toggles auto-refresh on - Then dashboard data MUST begin refreshing at the configured interval - And the refresh interval configuration control MUST be enabled - - Scenario: Auto-refresh interval drives all queries - Given auto-refresh is enabled with a 10-second interval - When the user navigates to any data-driven page - Then all data queries on that page MUST poll the backend every 10 seconds - And stale data MUST be automatically replaced with fresh data -``` - -### FR-007: Data Export - -**Description:** The System MUST provide CSV and JSON export functionality for compliance reporting. Export MUST be available for agent fleet data, attestation analytics, audit logs, and certificate data. - -**Trace:** Dashboard - Navigation Structure - -```gherkin -Feature: Data Export - - Scenario: Export agent fleet data as CSV - Given the user has the Operator or Admin role - And the agent fleet list is displayed with active filters - When the user clicks "Export" and selects "CSV" - Then the System MUST generate a CSV file containing the filtered agent data - And the file MUST be downloaded to the user's browser - - Scenario: Viewer cannot export - Given the user has the Viewer role - When the user views the agent fleet list - Then the Export action MUST NOT be rendered for the Viewer role - -> **Reviewer Note:** Changed "MUST be disabled or hidden" to "MUST NOT be rendered" — RBAC-denied features must be absent from the UI, not merely grayed out. A disabled control implies the feature exists but is temporarily unavailable; an absent control implies the user lacks the capability, which is the correct RBAC semantic. Also changed the When step from "looks for the Export button" (user intent, not a testable action) to "views the agent fleet list" (observable state). - - Scenario: Export with empty filter result - Given the user has applied filters that match zero agents - When the user views the Export action - Then the Export action MUST be disabled - And a message MUST indicate "No data to export" - -> **Reviewer Note:** Changed "MUST be disabled or display" to two deterministic assertions: the action is disabled AND a message explains why. Each Then step is independently verifiable. -``` - -### FR-008: Dark/Light Mode - -**Description:** The System MUST provide a theme toggle button in the top bar allowing users to switch between dark and light visual modes. The theme MUST apply consistently across all pages, including top bar inputs, agent lists, alert tables, policy views, and all form controls. The preference MUST persist across sessions via localStorage. - -**Trace:** Dashboard - Navigation Structure - -```gherkin -Feature: Dark/Light Mode - - Scenario: Switch to dark mode - Given the user is viewing the dashboard in light mode - When the user clicks the theme toggle button in the top bar - Then the entire dashboard UI MUST render with a dark color scheme - And the preference MUST persist across sessions via localStorage - - Scenario: Switch to light mode - Given the user is viewing the dashboard in dark mode - When the user clicks the theme toggle button in the top bar - Then the entire dashboard UI MUST render with the default light color scheme - - Scenario: Dark mode text visibility - Given the user has selected dark mode - When the user navigates to any page - Then all text, form inputs, table headers, and chart labels MUST be visible against the dark background - And top bar search input text MUST be legible -``` - -### FR-009: In-App Notification System - -**Description:** The System MUST provide an in-app notification bell displaying an unread notification badge count. Notifications MUST include attestation failures, certificate expiry warnings, policy updates, agent registration events, and revocation events. Severity thresholds for generating notifications MUST be configurable. - -**Trace:** Dashboard - Navigation Structure - -```gherkin -Feature: In-App Notification System - - Scenario: Display notification badge count - Given there are 3 unread critical notifications and 2 unread warnings - When the user views the dashboard header - Then the notification bell MUST display a badge count of 5 - - Scenario: Mark notification as read - Given the user has unread notifications - When the user opens the notification panel and clicks a notification - Then the notification MUST be marked as read - And the badge count MUST decrement by 1 - - Scenario: No notifications available - Given there are zero unread notifications - When the user opens the notification panel - Then the panel MUST display an empty state with message "No new notifications" - And the notification bell MUST NOT display a badge count -``` - -### FR-010: External Alert Integration - -**Description:** The System SHOULD support sending critical alert notifications to external channels including Email and Slack. Alert routing MUST be configurable by severity threshold. - -**Trace:** Dashboard - Navigation Structure - -```gherkin -Feature: External Alert Integration - - Scenario: Send critical alert to Slack - Given Slack webhook integration is configured - And the alert severity threshold for Slack is set to "CRITICAL" - When a CRITICAL attestation failure is detected - Then the System SHOULD send a notification to the configured Slack channel - And the notification MUST include the agent ID, failure type, and timestamp - - Scenario: External integration endpoint unreachable - Given Slack webhook integration is configured - When the System attempts to send a notification and the webhook endpoint is unreachable - Then the System MUST retry delivery according to the configured retry policy - And the System MUST log the delivery failure in the audit log -``` - -### FR-011: Configurable Alert Thresholds - -**Description:** The System MUST allow administrators to configure alert thresholds per deployment. Configurable thresholds MUST include: attestation success rate floor (default: 99%), average latency ceiling (default: 2x quote_interval), certificate expiry warning window (default: 30 days), consecutive failure count (default: 3), and revocation event count. - -**Trace:** Dashboard - Key Performance Indicators - -```gherkin -Feature: Configurable Alert Thresholds - - Scenario: Configure attestation rate threshold - Given the user has the Admin role - When the admin sets the attestation success rate threshold to 98% - Then the System MUST raise an alert only when the rate falls below 98% - And the previous threshold of 99% MUST no longer trigger alerts - - Scenario: Default thresholds applied - Given no custom thresholds are configured - When the attestation success rate falls below 99% - Then the System MUST raise an alert using the default threshold - - Scenario: Non-Admin cannot configure thresholds - Given the user has the Operator role - When the user attempts to change the attestation success rate threshold - Then the System MUST deny the action with an insufficient permissions error - And the threshold MUST remain unchanged -``` - -### FR-012: Agent Fleet List View - -**Description:** The System MUST display agents in a sortable, paginated table with columns: Agent ID, IP Address, Operational State, Last Attestation time, Assigned Policy, Failure Count, and Actions. Rows for agents in FAILED or INVALID_QUOTE states MUST be visually highlighted. Rows for agents in RETRY state MUST be highlighted with a warning indicator. For push-mode (v3 API) agents, rows in FAIL (101) or TIMEOUT (103) states MUST be highlighted with a critical indicator. The System MUST recognize all Keylime operational states including `Start` (1) and `Saved` (2) for pull mode, and `Pass` (100), `Fail` (101), `Pending` (102), and `Timeout` (103) for push mode. When the agent list fails to load, the System MUST display a descriptive error message instead of an empty table. - -**Trace:** Agent Fleet - List View - -```gherkin -Feature: Agent Fleet List View - - Scenario: Display agent list with state-based highlighting - Given the Verifier reports agents in various states - When the user navigates to the Agent Fleet view - Then agents in GET_QUOTE state MUST display with a success indicator - And agents in FAILED state MUST display with a critical/danger indicator - And agents in RETRY state MUST display with a warning indicator - And agents in START or SAVED state MUST display with an informational indicator - And push-mode agents in PASS state MUST display with a success indicator - And push-mode agents in FAIL or TIMEOUT state MUST display with a critical indicator - And push-mode agents in PENDING state MUST display with an informational indicator - And each row MUST show Agent ID, IP, State, Last Attest, Policy, Failures, and Actions - - Scenario: Verifier API unavailable for agent list - Given the dashboard backend cannot connect to the Verifier API - When the user navigates to the Agent Fleet view - Then the System MUST display cached agent data with a staleness indicator - And a banner MUST warn that the agent list may not reflect current state - - Scenario: Agent list API returns error - Given the backend API returns an error when fetching the agent list - When the user navigates to the Agent Fleet view - Then the System MUST display a descriptive error message explaining the failure - And the System MUST NOT display an empty table without explanation - - Scenario: Filter agents by policy via Policies page link - Given the user is viewing the Policies page - When the user clicks the agent count link for policy "production-v2" - Then the System MUST navigate to the Agents page with a policy filter applied - And only agents assigned to "production-v2" MUST be displayed -``` - -### FR-013: Agent List Pagination - -**Description:** The System MUST paginate the agent fleet list. The System MUST display the current page range, total agent count, and total page count. Page size SHOULD be configurable. - -**Trace:** Agent Fleet - List View - -```gherkin -Feature: Agent List Pagination - - Scenario: Navigate to next page - Given the agent fleet contains 250 agents with a page size of 25 - And the user is viewing page 1 - When the user clicks "Next Page" - Then agents 26-50 MUST be displayed - And the footer MUST show "Showing 26-50 of 250 agents | Page 2 of 10" - - Scenario: Navigate beyond last page - Given the agent fleet contains 250 agents with a page size of 25 - And the user is viewing page 10 (the last page) - When the user clicks "Next Page" - Then the "Next Page" button MUST be disabled - And page 10 MUST remain displayed -``` - -### FR-014: Advanced Multi-Criteria Agent Filtering - -**Description:** The System MUST allow filtering the agent fleet by multiple criteria simultaneously: Agent UUID (exact or partial), IP Address (CIDR range), Operational State (multi-select), Verifier Assignment, IMA Policy, MB Policy, Last Attestation time range, Failure Count (min/max), Registration Date range, and API Version. The Operational State filter MUST present all pull-mode states (Get Quote, Provide V, Registered, Failed, Retry, Terminated, Invalid Quote, Tenant Failed) and all push-mode states (Pass, Fail, Pending, Timeout) grouped by mode. - -**Trace:** Agent Fleet - Filtering and Search - -```gherkin -Feature: Advanced Agent Filtering - - Scenario: Filter by state and policy simultaneously - Given the fleet contains agents in various states with different policies - When the user selects state filter "FAILED" and policy filter "production-v2" - Then only agents in FAILED state assigned to "production-v2" MUST be displayed - And agents in other states or with other policies MUST be hidden - - Scenario: Filter returns no matching agents - Given the fleet contains 250 agents - When the user selects state filter "TERMINATED" and no agents are in that state - Then the agent list MUST display an empty state with message "No agents match the selected filters" - And a "Clear Filters" button MUST be available - - Scenario: Filter by push-mode Timeout state - Given the fleet contains push-mode agents in PASS, FAIL, PENDING, and TIMEOUT states - When the user selects state filter "TIMEOUT" from the Push Mode group - Then only agents in TIMEOUT (103) state MUST be displayed - And pull-mode agents MUST be hidden - - Scenario: Filter by failure count threshold - Given the fleet contains agents with failure counts 0, 1, 3, 5 - When the user sets the failure count minimum to 3 - Then only agents with 3 or more failures MUST be displayed -``` - -### FR-015: IP Address CIDR Filtering - -**Description:** The System MUST support filtering agents by IP address using CIDR notation (e.g., 192.168.1.0/24) in addition to exact IP match. - -**Trace:** Agent Fleet - Filtering and Search - -```gherkin -Feature: CIDR IP Filtering - - Scenario: Filter agents by subnet - Given the fleet contains agents at IPs 10.0.1.5, 10.0.1.20, 10.0.2.15 - When the user enters IP filter "10.0.1.0/24" - Then agents at 10.0.1.5 and 10.0.1.20 MUST be displayed - And the agent at 10.0.2.15 MUST NOT be displayed - - Scenario: Invalid CIDR in filter - Given the user enters "10.0.1.0/33" in the IP filter - When the filter is applied - Then the System MUST display a validation error indicating invalid CIDR notation - And the agent list MUST remain unfiltered -``` - -### FR-016: Bulk Operations - -**Description:** The System MUST support selecting multiple agents via checkboxes and performing bulk operations: Reactivate (resume monitoring), Stop (pause monitoring), Delete (remove from verifier), Reassign Policy (change IMA/MB policy), and Export (download agent data as CSV). - -**Trace:** Agent Fleet - Filtering and Search - -```gherkin -Feature: Bulk Agent Operations - - Scenario: Bulk reactivate stopped agents - Given the user has selected 5 agents that are in a stopped state - And the user has the Operator or Admin role - When the user clicks "Reactivate" - Then the System MUST send reactivation requests for all 5 agents to the Verifier API - And each agent's state MUST update upon successful reactivation - - Scenario: Bulk operations denied for Viewer role - Given the user has the Viewer role - When the user selects agents in the fleet list - Then the Reactivate, Stop, Delete, and Reassign Policy buttons MUST be disabled - - Scenario: Partial failure in bulk reactivation - Given the user has selected 5 agents for reactivation - When the System sends reactivation requests and 2 agents fail to reactivate - Then the System MUST display a summary: "3 succeeded, 2 failed" - And each failed agent MUST show the failure reason - And the 3 successful agents MUST update their state -``` - -### FR-017: Topology/Map View - -**Description:** The System MAY provide an optional topology/map view visualizing agent distribution across infrastructure. Agents MUST be groupable by datacenter, rack, or subnet. The view MUST show verifier-to-agent assignment mapping and registrar connection status. Interactive features MUST include click-to-drill-down on datacenters, hover for agent summary, and color-coded health indicators. - -**Trace:** Agent Fleet - Map and Topology View - -```gherkin -Feature: Topology View - - Scenario: Drill down into datacenter - Given the topology view shows 3 datacenters: DC-East (120 agents), DC-West (85 agents), DC-EU (45 agents) - When the user clicks on "DC-East" - Then the view MUST expand to show individual agents within DC-East - And failed agents MUST be indicated with a danger color - - Scenario: Hover for agent summary - Given the topology view is displayed - When the user hovers over an agent node - Then a tooltip MUST display the agent UUID, IP, state, and last attestation time -``` - -### FR-018: Agent Detail View - -**Description:** The System MUST display an agent detail page containing: Agent Information card (UUID, IP, Verifier assignment, Registration date), Attestation Statistics card (total attestations, last successful, consecutive failures, boot time), Cryptographic Details card (hash algorithm, encryption, signing, IMA PCRs), and Policy Assignment card (IMA policy, MB policy, TPM policy with a "Change Policy" action). - -**Trace:** Agent Detail - Overview Panel - -```gherkin -Feature: Agent Detail View - - Scenario: Display agent information - Given agent "a1b2c3d4-e5f6-7890" exists in the Verifier - When the user navigates to the agent detail page - Then the Agent Information card MUST display UUID, IP address, assigned Verifier, and registration date - And the Attestation Statistics card MUST display total attestations, last successful time, and consecutive failures - And the Cryptographic Details card MUST display hash algorithm, encryption, and signing algorithms - And the Policy Assignment card MUST display assigned IMA and MB policies - - Scenario: Agent not found - Given no agent with UUID "nonexistent-uuid" exists in the Verifier - When the user navigates to the agent detail page for "nonexistent-uuid" - Then the System MUST display a "404 — Agent Not Found" error page - And a link to return to the Agent Fleet view MUST be available -``` - -### FR-019: Agent Actions - -**Description:** The System MUST provide the following actions on an individual agent detail page: Reactivate (resume monitoring), Stop (pause monitoring), Delete (remove from verifier), Force Attest (trigger immediate attestation), View Quotes (display raw TPM quote data), and Export (download agent data). Actions MUST be restricted by RBAC role. - -**Trace:** Agent Detail - Overview Panel - -```gherkin -Feature: Agent Detail Actions - - Scenario: Force attestation on an agent - Given the user has the Operator or Admin role - And the user is viewing agent detail for "a1b2c3d4" - When the user clicks "Force Attest" - Then the System MUST trigger an immediate attestation cycle for agent "a1b2c3d4" - And the attestation result MUST update on the Timeline tab - - Scenario: Delete agent requires Admin role - Given the user has the Operator role - When the user views the agent detail page - Then the "Delete" action MUST NOT be rendered for the Operator role - -> **Reviewer Note:** Changed "MUST be disabled or hidden" to "MUST NOT be rendered" — consistent with RBAC semantic: the Operator role lacks the delete capability entirely, so the control should be absent. - - Scenario: Force attestation fails due to agent unreachable - Given the user has the Operator role - And agent "a1b2c3d4" is unreachable by the Verifier - When the user clicks "Force Attest" - Then the System MUST display an error "Agent unreachable — attestation could not be initiated" - And the agent state MUST remain unchanged -``` - -### FR-020: Agent Detail Six-Tab Deep-Dive - -**Description:** The System MUST provide six specialized tabs on the agent detail page: (1) Timeline — attestation success/failure history with zoomable time range; (2) PCR Values — current PCR bank values with change history and diffs; (3) IMA Log — measurement list entries with policy match/mismatch indicators and search by file path or hash; (4) Boot Log — UEFI event log entries with measured boot validation; (5) Certificates — EK, AK, IAK, IDevID, mTLS certificate details and expiry countdown; (6) Raw Data — JSON view with source selector offering three views: Backend Data (merged agent summary computed by the dashboard), Registrar Data (raw JSON from the Keylime Registrar API), and Verifier Data (raw JSON from the Keylime Verifier API), with a copy-to-clipboard button inline with the source selector (FR-083). - -**IMA Log Entry Schema:** Each IMA log entry returned by the backend MUST include: `pcr` (PCR index, typically 10), `template_hash` (SHA-256 hash of the template data), `template_name` (IMA template type, e.g., `ima-ng`), `filedata_hash` (hash of the measured file content), and `filename` (absolute path of the measured file). - -**Boot Log Entry Schema:** Each boot log entry returned by the backend MUST include: `pcr` (PCR index associated with the event), `event_type` (UEFI event type identifier, e.g., `EV_EFI_VARIABLE_DRIVER_CONFIG`), `digest` (hash digest of the event data), and `event_data` (human-readable description of the event). - -**Trace:** Agent Detail - Tabs and Sub-Views - -```gherkin -Feature: Agent Detail Tabs - - Scenario: View IMA log entries - Given the user is viewing agent "a1b2c3d4" detail page - When the user selects the "IMA Log" tab - Then the IMA measurement list entries MUST be displayed - And each entry MUST indicate policy match or mismatch - -> **Reviewer Note:** Split from original "View IMA log with search" scenario which contained two When/Then blocks. Each BDD scenario MUST have exactly one When/Then pair to maintain test isolation and deterministic pass/fail. - - Scenario: Search IMA log by file path - Given the user is viewing the IMA Log tab for agent "a1b2c3d4" - When the user searches for "/usr/bin/bash" - Then only IMA entries for that file path MUST be shown - - Scenario: View attestation timeline - Given the user is viewing agent "a1b2c3d4" detail page - When the user selects the "Timeline" tab - Then the attestation success/failure history MUST be displayed chronologically - And each entry MUST show timestamp, result (pass/fail), and failure reason if applicable - And the timeline MUST support zoomable time range navigation - - Scenario: View PCR values with change history - Given the user is viewing agent "a1b2c3d4" detail page - When the user selects the "PCR Values" tab - Then the current PCR bank values MUST be displayed with expected vs. actual comparison - And changed PCR values MUST be highlighted with a "Changed" indicator - - Scenario: View boot log with measured boot validation - Given the user is viewing agent "a1b2c3d4" detail page - When the user selects the "Boot Log" tab - Then the UEFI event log entries MUST be displayed - And each entry MUST show measured boot validation status (compliant/non-compliant) - - Scenario: View agent certificates with expiry countdown - Given the user is viewing agent "a1b2c3d4" detail page - When the user selects the "Certificates" tab - Then EK, AK, IAK, IDevID, and mTLS certificate details MUST be displayed - And each certificate MUST show an expiry countdown (days remaining) - - Scenario: View raw data with source selector - Given the user is viewing agent "a1b2c3d4" detail page - When the user selects the "Raw Data" tab - Then a source selector MUST be displayed with options "Backend Data", "Registrar Data", and "Verifier Data" - And the default selection MUST display all three sources combined - - Scenario: View raw backend data - Given the user is viewing the "Raw Data" tab for agent "a1b2c3d4" - When the user selects the "Backend Data" source - Then the merged agent summary JSON computed by the dashboard backend MUST be displayed - - Scenario: View raw registrar data - Given the user is viewing the "Raw Data" tab for agent "a1b2c3d4" - When the user selects the "Registrar Data" source - Then the full agent JSON record from the Keylime Registrar API MUST be displayed - - Scenario: View raw verifier data - Given the user is viewing the "Raw Data" tab for agent "a1b2c3d4" - When the user selects the "Verifier Data" source - Then the full agent JSON record from the Keylime Verifier API MUST be displayed - - Scenario: Tab data unavailable due to API error - Given the user is viewing agent "a1b2c3d4" detail page - And the Verifier API returns an error for IMA log data - When the user selects the "IMA Log" tab - Then the tab MUST display an error message "Unable to load IMA log data" - And a "Retry" button MUST be available -``` - -### FR-021: PCR Values Monitoring - -**Description:** The System MUST display current PCR bank values (SHA-256) for each agent with an expected vs. actual comparison. Each PCR MUST show its description, status (Match/Changed), and last changed date. The System MUST support PCR banks across SHA-1, SHA-256, SHA-384, and SHA-512. - -**Trace:** Agent Detail - PCR Values View - -```gherkin -Feature: PCR Values Monitoring - - Scenario: Display PCR comparison table - Given agent "a1b2c3d4" has PCR values reported by the Verifier - When the user views the PCR Values tab - Then a table MUST display each PCR index, description, match status, and last changed date - And PCR 10 (IMA measurements) MUST show "Match" with a recent timestamp - And any PCR with a value differing from expected MUST show "Changed" - - Scenario: PCR data unavailable for agent - Given agent "a1b2c3d4" has not yet completed an attestation cycle - When the user views the PCR Values tab - Then the tab MUST display "No PCR data available — awaiting first attestation" -``` - -### FR-022: PCR Change Detection - -**Description:** The System MUST monitor PCR values for each agent and detect changes relative to expected baseline values. When a PCR value changes, the System MUST highlight the changed PCR, display the last changed timestamp, and allow the administrator to acknowledge the change or initiate an investigation. The System MUST alert if the IMA policy needs updating for legitimate changes. - -**Trace:** Agent Detail - PCR Values View - -```gherkin -Feature: PCR Change Detection - - Scenario: Detect PCR drift from baseline - Given agent "a1b2c3d4" has a policy-defined expected value for PCR 8 - And the agent's current PCR 8 value differs from the expected value - When the user views the PCR Values tab for agent "a1b2c3d4" - Then PCR 8 MUST display with a "Changed" indicator - And the last changed date MUST be displayed - And "Acknowledge" and "Investigate" action buttons MUST be available - - Scenario: PCR change acknowledgement denied for Viewer - Given the user has the Viewer role - When the user views a PCR with a "Changed" indicator - Then the "Acknowledge" and "Investigate" actions MUST NOT be rendered for the Viewer role - -> **Reviewer Note:** Changed "MUST be disabled or hidden" to "MUST NOT be rendered" — Viewer role lacks the capability to acknowledge or investigate PCR changes, so the controls should be absent. -``` - -### FR-023: Cross-Tab Navigation - -**Description:** The System SHOULD interconnect agent detail tabs so that clicking a PCR value in the PCR Values tab navigates to the corresponding IMA entries in the IMA Log tab, and certificate warnings link directly to the Certificates tab for renewal. - -**Trace:** Agent Detail - Tabs and Sub-Views (2/2) - -```gherkin -Feature: Cross-Tab Navigation - - Scenario: Navigate from PCR to IMA entries - Given the user is viewing the PCR Values tab for agent "a1b2c3d4" - And PCR 10 is highlighted with a "Changed" status - When the user clicks on the PCR 10 value - Then the view MUST switch to the IMA Log tab - And the IMA entries corresponding to PCR 10 measurements MUST be displayed - - Scenario: Navigate from cert warning to Certificates tab - Given the user sees a certificate expiry warning on the overview - When the user clicks the warning link - Then the view MUST switch to the Certificates tab - And the expiring certificate MUST be highlighted -``` - -### FR-024: Attestation Analytics Overview - -**Description:** The System MUST provide an attestation analytics overview displaying: total successful attestations, total failed attestations, average latency, and success rate as summary KPIs; an hourly attestation volume bar chart; a failure reason breakdown (donut chart); a latency distribution histogram; and a top failing agents ranked list. - -**Trace:** Attestation Analytics - Overview Dashboard - -```gherkin -Feature: Attestation Analytics Overview - - Scenario: Display attestation summary KPIs - Given there were 12,450 successful and 38 failed attestations in the last 24 hours - When the user navigates to the Attestation Analytics view - Then the summary MUST show "12,450" successful, "38" failed, and the computed average latency - And the success rate MUST display as "99.7%" - - Scenario: Display top failing agents - Given multiple agents have different failure counts - When the user views the Attestation Analytics - Then a "Top Failing Agents" list MUST rank agents by failure count descending - - Scenario: No attestation data available - Given no attestation data has been collected yet - When the user navigates to the Attestation Analytics view - Then the summary KPIs MUST display zeroes - And charts MUST display an empty state with message "No attestation data for the selected period" -``` - -### FR-025: Failure Categorization - -**Description:** The System MUST categorize attestation failures by type and severity: Quote Invalid (Critical), Policy Violation (Critical), Evidence Chain broken (Critical), Boot Violation (High), Timeout (Medium), PCR Mismatch (Medium), and Clock Skew (Low). Each failure type MUST include a description and common cause. - -**Trace:** Attestation Analytics - Failure Analysis - -```gherkin -Feature: Failure Categorization - - Scenario: Classify failure by type - Given agent "agent-042" fails with a TPM quote signature mismatch - When the failure is processed by the analytics engine - Then the failure MUST be categorized as "Quote Invalid" - And the severity MUST be set to "Critical" - And the common cause MUST indicate "TPM hardware issue, key mismatch" - - Scenario: Failure with unknown type - Given agent "agent-099" fails with an unrecognized error code from the Verifier - When the failure is processed by the analytics engine - Then the failure MUST be categorized as "Unknown" - And the severity MUST default to "High" - And the raw error details MUST be preserved for manual review -``` - -### FR-026: Automatic Failure Correlation - -**Description:** The System MUST automatically correlate attestation failures across agents using four correlation dimensions: Temporal (failures within the same time window), Causal (same failure reason across agents), Topological (failures grouped by datacenter, subnet, or verifier), and Policy-linked (failures matching a recent policy update). Correlated failures MUST be grouped into a single incident. - -**Trace:** Attestation Analytics - Automatic Correlation; Attestation Analytics - Actionable Insights - -```gherkin -Feature: Automatic Failure Correlation - - Scenario: Correlate failures after policy update - Given IMA policy "production-v2" was updated 3 minutes ago - And 15 agents sharing policy "production-v2" report IMA policy violation within a 2-minute window - When the System processes these failure events - Then the System MUST group all 15 failures into a single correlated incident - And the suggested root cause MUST reference the recent policy update - And the incident view MUST provide a one-click rollback option - - Scenario: Distinguish targeted attack from mass failure - Given a single agent fails with a unique IMA hash mismatch - And no other agents share the same failure reason or timing - When the System processes the failure - Then the System MUST NOT correlate it with other incidents - And the alert MUST indicate "warrants immediate investigation" -``` - -### FR-027: Root Cause Suggestion and Recommended Actions - -**Description:** For each correlated incident created by FR-026, the System MUST generate a suggested root cause by analyzing the triggering event (e.g., recent policy change, certificate expiry, infrastructure outage). The System MUST provide a recommended action (e.g., "rollback policy", "renew certificate", "investigate agent"). The System MUST automatically link the incident to the triggering change in the audit log. - -**Trace:** Attestation Analytics - Actionable Insights - -```gherkin -Feature: Root Cause Suggestion - - Scenario: Suggest root cause from policy change - Given a correlated incident exists grouping 15 IMA policy violation failures - And IMA policy "production-v2" was updated 3 minutes before the first failure - When the System analyzes the incident - Then the suggested root cause MUST reference the policy update to "production-v2" - And the recommended action MUST include "Rollback policy to previous version" - And the incident MUST link to the policy change audit log entry - - Scenario: Suggest root cause from certificate expiry - Given a correlated incident exists grouping 5 mTLS handshake failures - And the Verifier's server certificate expired 10 minutes before the first failure - When the System analyzes the incident - Then the suggested root cause MUST reference the expired certificate - And the recommended action MUST include "Renew Verifier server certificate" - - Scenario: No root cause identified - Given a correlated incident exists but no recent policy, certificate, or infrastructure change is found - When the System analyzes the incident - Then the suggested root cause MUST display "Unknown — manual investigation required" - And the recommended action MUST include "Escalate to security team" -``` - -### FR-028: One-Click Policy Rollback from Incident - -**Description:** The System SHOULD provide a one-click rollback option from the incident view that reverts a policy to its previous version when a policy update is identified as the root cause of correlated failures. - -**Trace:** Attestation Analytics - Actionable Insights - -```gherkin -Feature: Policy Rollback from Incident - - Scenario: Rollback policy from incident view - Given an incident is open with suggested root cause "policy update to production-v2" - And the user has the Admin role - When the user clicks "Rollback Policy" on the incident - Then the System SHOULD revert "production-v2" to its previous version - And the rollback MUST be recorded in the audit log - - Scenario: Rollback subject to two-person rule - Given an incident suggests policy rollback - And the two-person rule is enforced for policy changes - When Admin A clicks "Rollback Policy" - Then the rollback MUST be submitted as a draft requiring approval from Admin B - And the System MUST NOT apply the rollback without a second approver -``` - -### FR-029: Push Mode (v3 API) Attestation Analytics - -**Description:** The System MUST provide analytics specific to push-mode attestation including: attestation submission rate per agent, nonce expiry tracking, challenge lifetime compliance, evidence evaluation duration, rate limit violations per IP/agent. Push-mode alerts MUST include: agent not submitting attestations, nonce expired before evidence, repeated evaluation failures, rate limit threshold reached, and session token expiry warnings. - -**Trace:** Attestation Analytics - Push Mode - -```gherkin -Feature: Push Mode Analytics - - Scenario: Track nonce expiry rate - Given the deployment operates in push mode (v3 API) - And 10% of nonces expired before evidence was submitted - When the user views push mode analytics - Then the nonce expiry rate MUST be displayed as 10% - And a warning alert MUST be raised for high nonce expiry - - Scenario: Alert on silent agent - Given agent "agent-100" has not submitted attestation evidence for 10 minutes - And the expected submission interval is 2 minutes - When the System evaluates push mode agent activity - Then the System MUST raise an alert indicating "agent not submitting attestations" - - Scenario: Push mode analytics unavailable in pull-only deployment - Given the deployment operates exclusively in pull mode (v2 API) - When the user navigates to the Push Mode Analytics view - Then the System MUST display "Push mode is not enabled in this deployment" - And no push-specific metrics MUST be shown -``` - -### FR-030: Verification Pipeline Visualization - -**Description:** The System MUST visualize the multi-stage verification pipeline showing each stage: Receive Quote + Logs, Validate TPM Quote, Check PCR Values, Verify IMA Log, and Verify Measured Boot. Each stage MUST show pass/fail status and timing. The visualization MUST indicate where in the pipeline a failure occurred. - -**Trace:** Verification Pipeline - Flow Visualization - -```gherkin -Feature: Verification Pipeline Visualization - - Scenario: Display pipeline with failure point - Given agent "agent-042" failed at the "Verify IMA Log" stage - When the user views the verification pipeline for the latest attestation - Then stages "Receive", "Validate TPM Quote", and "Check PCR Values" MUST show pass indicators - And stage "Verify IMA Log" MUST show a fail indicator - And subsequent stages MUST be shown as not reached - - Scenario: Pipeline data unavailable for agent - Given agent "agent-042" has not yet completed any attestation cycle - When the user views the verification pipeline - Then the System MUST display "No pipeline data available — awaiting first attestation" -``` - -### FR-031: Per-Stage Verification Metrics - -**Description:** The System MUST track timing and success rates for each verification stage independently: Quote Validation (signature, nonce freshness), PCR Check (replay time, bank coverage), IMA Verification (entries processed, allowlist hit/miss ratio, excludelist matches, TOMTOU errors), and Measured Boot (event log entries, compliance rate, firmware update detections). - -**Trace:** Verification Pipeline - Stage Metrics - -```gherkin -Feature: Per-Stage Verification Metrics - - Scenario: Display IMA verification depth metrics - Given the Verifier has processed attestations for agent "a1b2c3d4" - When the user views verification stage metrics - Then IMA stage MUST show entries processed count - And IMA stage MUST show allowlist hit/miss ratio - And IMA stage MUST show average processing time - - Scenario: Metrics unavailable for new deployment - Given the System has just been deployed and no attestations have been processed - When the user views verification stage metrics - Then all metrics MUST display "N/A" or zero values - And a message MUST indicate "Insufficient data — metrics will populate after attestation activity" -``` - -### FR-032: IMA Quote Progress Tracking - -**Description:** The System MUST track the gap between total IMA measurement entries and verified entries per agent. If the gap exceeds a configurable threshold (default: 1000 entries or 5 minutes of growth), the System MUST raise an alert indicating potential network issues, agent overload, or intentional evasion. - -**Trace:** Verification Pipeline - Quote Progress Tracking - -```gherkin -Feature: IMA Quote Progress Tracking - - Scenario: Alert on IMA verification lag - Given agent "agent-042" has 50,000 total IMA entries - And only 48,500 entries have been verified - And the gap threshold is configured at 1,000 entries - When the System evaluates IMA progress for "agent-042" - Then the System MUST raise an alert for IMA verification lag - And the alert MUST display the gap size (1,500 entries) - - Scenario: IMA progress tracking disabled for push-mode agent - Given agent "agent-042" operates in push mode (v3 API) - And push mode does not expose incremental IMA progress - When the user views IMA quote progress for "agent-042" - Then the System MUST display "IMA progress tracking not available for push mode agents" -``` - -### FR-033: Policy List View and Management - -**Description:** The System MUST display a unified list of all runtime policies — both IMA and Measured Boot — in a single view. Each policy row MUST show: Policy Name, Kind (IMA or Measured Boot), number of assigned Agents, Entry count, Checksum, Last Updated date, and Edit/View actions. A search bar MUST allow filtering policies by name. New Policy and Import buttons MUST be available. - -**Trace:** Policy Management - IMA Policies - -```gherkin -Feature: Unified Policy List View - - Scenario: Display unified policy list with kind column - Given the Verifier has 3 IMA policies and 2 Measured Boot policies configured - When the user navigates to the Policy Management view - Then all 5 policies MUST be listed with name, kind, agent count, entry count, checksum, and last updated date - And each policy MUST display its kind as "IMA" or "Measured Boot" - And each policy MUST have "Edit" and "View" action links - - Scenario: Search policies by name - Given policies named "production-v2", "staging-v3", "minimal" exist - When the user types "prod" in the policy search bar - Then only "production-v2" MUST be displayed - - Scenario: No policies configured - Given the Verifier has zero policies configured - When the user navigates to the Policy Management view - Then the list MUST display an empty state with message "No policies configured" - And a "New Policy" button MUST be prominently available -``` - -### FR-034: Policy CRUD with Editor - -**Description:** The System MUST provide full CRUD operations for IMA runtime policies via the dashboard. The editor MUST support: file upload for allowlist, inline editing with syntax highlighting, hash algorithm selection (SHA-256, SHA-384), exclude list configuration, IMA signature key management, policy validation before save, and automatic checksum computation. - -**Trace:** Policy Management - Policy Editor - -```gherkin -Feature: Policy Editor - - Scenario: Create new IMA policy via upload - Given the user has the Admin role - When the user clicks "New Policy" and uploads an allowlist file - Then the System MUST parse and validate the file - And the System MUST compute and display the checksum - And the user MUST be able to save the policy - - Scenario: Validate policy before save - Given the user is editing a policy with an invalid hash format - When the user clicks "Save" - Then the System MUST display a validation error - And the policy MUST NOT be saved until errors are corrected - - Scenario: Concurrent edit conflict - Given Admin A is editing policy "production-v2" at version 3 - And Admin B saves a change to "production-v2" creating version 4 - When Admin A attempts to save their changes - Then the System MUST reject the save with a conflict error - And Admin A MUST be prompted to reload the latest version before editing -``` - -### FR-035: Policy Versioning - -**Description:** The System MUST maintain a version history for each policy with change diffs. The System MUST support rollback to any previous version. An audit trail MUST record who changed what and when. A side-by-side comparison view MUST be available. - -**Trace:** Policy Management - Policy Editor - -```gherkin -Feature: Policy Versioning - - Scenario: View policy version history - Given policy "production-v2" has been updated 3 times - When the user views the change history - Then the System MUST list all 3 versions with timestamps and authors - -> **Reviewer Note:** Split from original "View policy change diff" scenario which contained two When/Then blocks. Version listing and version comparison are distinct user actions that MUST be independently testable. - - Scenario: Compare policy versions - Given the user is viewing the change history for policy "production-v2" - When the user selects two versions for comparison - Then a side-by-side diff MUST highlight added, removed, and modified entries - - Scenario: Rollback to previous version - Given the user has the Admin role - And policy "production-v2" is at version 3 - When the user selects "Rollback to version 2" - Then the policy MUST revert to version 2 content - And a new version 4 MUST be created reflecting the rollback - - Scenario: Rollback denied for non-Admin - Given the user has the Operator role - And policy "production-v2" is at version 3 - When the user attempts to select "Rollback to version 2" - Then the rollback action MUST NOT be rendered for the Operator role - -> **Reviewer Note:** Changed "MUST be disabled or hidden" to "MUST NOT be rendered" — rollback is an Admin-only capability, so Operators should not see the control at all. -``` - -### FR-036: Measured Boot Policy Management - -**Description:** The System MUST support managing measured boot (MB) policies: list all policies, create from reference boot log, edit rules and constraints, associate with agent groups, and validate against known event logs. - -**Trace:** Policy Management - Policy Editor - -```gherkin -Feature: Measured Boot Policy Management - - Scenario: Create MB policy from reference boot log - Given the user has the Admin role - When the user uploads a reference UEFI event log - Then the System MUST generate a measured boot policy from the log - And the user MUST be able to edit and save the generated policy - - Scenario: Invalid boot log upload rejected - Given the user has the Admin role - When the user uploads a file that is not a valid UEFI event log - Then the System MUST display a validation error "Invalid boot log format" - And no policy MUST be generated -``` - -### FR-037: Policy Assignment Matrix - -**Description:** The System MUST display a matrix view showing which agents are assigned to which policies. The System MUST support batch reassignment, orphan policy detection (policies not assigned to any agent), impact analysis before changes, and preview of policy effect on agents. - -**Trace:** Policy Management - Policy Editor - -```gherkin -Feature: Policy Assignment Matrix - - Scenario: Detect orphan policies - Given policy "dev-old" is not assigned to any agent - When the user views the policy assignment matrix - Then "dev-old" MUST be flagged as an orphan policy - - Scenario: Batch reassign agents to new policy - Given 15 agents are assigned to policy "staging-v2" - When the admin selects all 15 agents and assigns them to "staging-v3" - Then all 15 agents MUST be reassigned to "staging-v3" - And the assignment change MUST be recorded in the audit log - - Scenario: Batch reassignment partial failure - Given 15 agents are assigned to policy "staging-v2" - When the admin reassigns all 15 agents to "staging-v3" and 3 agents fail to update - Then the System MUST display a summary: "12 succeeded, 3 failed" - And each failed agent MUST show the failure reason -``` - -### FR-038: Pre-Update Policy Impact Analysis - -**Description:** The System MUST perform an impact analysis before applying policy changes. The analysis MUST categorize affected agents into three groups: Unaffected (no files match changes), Affected (have modified files), and Will Fail (removed hash currently in use). The System MUST display a recommendation and provide Submit for Approval, Staged Rollout, and Cancel actions. - -**Trace:** Policy Management - Impact Analysis - -```gherkin -Feature: Policy Impact Analysis - - Scenario: Analyze impact of IMA policy update - Given IMA policy "production-v2" is assigned to 185 agents - And the proposed update adds 45 hashes, removes 12 hashes, and modifies 8 hashes - When the administrator requests impact analysis for the proposed change - Then the System MUST display the number of unaffected agents - And the System MUST display the number of affected agents - And the System MUST display the number of agents that will fail - And a recommendation MUST be provided based on the analysis - - Scenario: Impact analysis with Verifier API unavailable - Given the Verifier API is unreachable - When the administrator requests impact analysis for a proposed change - Then the System MUST display an error "Unable to perform impact analysis — Verifier API unavailable" - And the "Submit for Approval" action MUST be disabled -``` - -### FR-039: Two-Person Rule for Policy Changes - -**Description:** The System MUST enforce a two-person approval workflow for policy changes. Admin A drafts the policy change, the System runs impact analysis automatically, and Admin B (a different user) reviews and approves. The System MUST support configurable quorum (e.g., 2-of-3, 3-of-5). The approver MUST NOT be the same user as the drafter. Approval requests MUST have a configurable time-limited window (default: 24 hours) after which they expire and revert to Draft state. - -**Trace:** Policy Management - Two-Person Rule - -```gherkin -Feature: Two-Person Policy Approval - - Scenario: Policy change requires different approver - Given Admin A drafts a change to IMA policy "production-v2" - And the System runs impact analysis automatically - When Admin A attempts to approve their own policy change - Then the System MUST reject the approval - And the System MUST display an error indicating the approver cannot be the drafter - - Scenario: Approval window expiry - Given Admin A submitted a policy change for review - And 24 hours have elapsed without approval - When Admin B attempts to approve the change - Then the System MUST reject the approval as expired - And the policy change MUST revert to Draft state - - Scenario: Successful two-person approval - Given Admin A drafts a policy change - And Admin B reviews and approves the change within 24 hours - When the approval is recorded - Then the System MUST automatically push the policy to the Verifier - And the audit log MUST record both the drafter and approver identities - - Scenario: Configure approval window duration - Given the Admin configures the approval window to 48 hours - When Admin A submits a policy change for review - Then the approval window MUST expire after 48 hours instead of the default 24 hours - And the pending change MUST display the configured expiry time - - Scenario: Single admin cannot approve policy changes - Given there is only one Admin user in the system - When Admin A drafts a policy change - Then the System MUST allow the draft to be submitted for approval - And the System MUST display a warning that no other approver is available - And the policy change MUST remain in "Pending Approval" until a second Admin is created -``` - -### FR-041: Change Management Integration - -**Description:** The System MAY integrate with external change management systems. Integration options MUST include: optionally gating policy push on ServiceNow Change Request approval, auto-creating Jira tickets for policy review, linking CR numbers in the audit log, and supporting emergency bypass with break-glass audit. - -**Trace:** Policy Management - Two-Person Rule - -```gherkin -Feature: Change Management Integration - - Scenario: Gate policy push on ServiceNow CR - Given ServiceNow integration is enabled - And Admin A drafts a policy change - When the change is submitted for approval - Then a ServiceNow Change Request MUST be auto-created - And the policy push MUST be blocked until the CR is approved in ServiceNow - - Scenario: ServiceNow integration unavailable - Given ServiceNow integration is enabled - And the ServiceNow API endpoint is unreachable - When the change is submitted for approval - Then the System MUST display a warning "ServiceNow unreachable — CR not created" - And the policy change MUST remain in draft state until the CR can be created -``` - -### FR-042: Security Audit Event Log - -**Description:** The System MUST maintain a searchable security audit event log. Each event MUST include severity level (CRITICAL, WARNING, INFO), timestamp, source component, action performed, and acting user identity. The log MUST be filterable by severity, category, and date range. An Export function MUST be available. - -**Trace:** Security Audit - Event Log Dashboard - -```gherkin -Feature: Security Audit Event Log - - Scenario: Filter audit log by severity - Given the audit log contains events of various severities - When the user selects severity filter "CRITICAL" - Then only events with CRITICAL severity MUST be displayed - And WARNING and INFO events MUST be hidden - - Scenario: Audit log captures attestation failure - Given agent "a1b2c3d4" fails attestation with an IMA policy violation - When the verifier reports the failure - Then a CRITICAL audit event MUST be logged - And the event MUST include the source (verifier ID), action (VERIFY_EVIDENCE), agent ID, and failure reason - - Scenario: Audit log search returns no results - Given the audit log contains events from the past 30 days - When the user filters by severity "CRITICAL" and date range with no matching events - Then the log view MUST display an empty state with message "No events match the selected filters" -``` - -### FR-043: Authorization Tracking - -**Description:** The System MUST track all authorization decisions made by Keylime's authorization provider. Tracked actions MUST include: Agent Management (CREATE, READ, UPDATE, DELETE by admin mTLS), Attestation (SUBMIT, READ, LIST by agent PoP token), Policy Management (CREATE, READ, UPDATE, DELETE by admin mTLS), Sessions (CREATE, EXTEND by agent), Verification (VERIFY_IDENTITY, VERIFY_EVIDENCE by admin mTLS), and Registration (REGISTER, ACTIVATE, DELETE by agent/admin). - -**Trace:** Security Audit - Authorization Tracking - -```gherkin -Feature: Authorization Tracking - - Scenario: Track admin policy update - Given admin "admin@example.com" updates IMA policy "production-v2" - When the authorization event is recorded - Then the audit log MUST contain action "UPDATE" in category "Policy Management" - And the identity type MUST be "admin (mTLS)" - And the actor MUST be "admin@example.com" - - Scenario: Authorization tracking for denied action - Given user "viewer@example.com" with Viewer role attempts to delete an agent - When the authorization decision is recorded - Then the audit log MUST contain action "DELETE" with result "DENIED" - And the identity MUST be "viewer@example.com" -``` - -### FR-044: Identity Verification Event Monitoring - -**Description:** The System MUST monitor agent identity verification events including: new agent registration (with identity type: ek_cert, iak_idevid, default), agent re-registration (regcount tracking), EK certificate validation results, and TPM key changes between registrations. - -**Trace:** Security Audit - Identity Verification Events - -```gherkin -Feature: Identity Verification Events - - Scenario: Track new agent registration - Given a new agent registers with identity type "iak_idevid" - When the registration event is processed - Then the audit log MUST record action "REGISTER_AGENT" - And the identity type MUST be "iak_idevid" - - Scenario: Alert on re-registration from different IP - Given agent "agent-042" was originally registered from IP 10.0.1.5 - When agent "agent-042" re-registers from IP 10.0.2.10 - Then the System MUST raise a WARNING alert - And the alert MUST indicate the IP change from 10.0.1.5 to 10.0.2.10 - - Scenario: Agent registration with invalid EK certificate - Given a new agent attempts to register with an EK certificate not in the trusted CA list - When the registration event is processed - Then the audit log MUST record action "REGISTER_AGENT" with result "REJECTED" - And a CRITICAL alert MUST be raised indicating "untrusted EK certificate" -``` - -### FR-045: Anomaly Detection - -**Description:** The System SHOULD detect anomalous authorization patterns including: unusual authorization sequences, failed authorization attempts, privilege escalation attempts, and off-hours administrative actions. - -**Trace:** Security Audit - Authorization Tracking - -```gherkin -Feature: Anomaly Detection - - Scenario: Detect off-hours admin action - Given the organization's business hours are 08:00-18:00 - When admin "admin@example.com" deletes an agent at 03:00 - Then the System SHOULD flag the action as an off-hours anomaly - And a WARNING alert SHOULD be raised for security review - - Scenario: Multiple failed authorization attempts - Given user "unknown@example.com" attempts 5 failed authorization requests in 1 minute - When the System evaluates authorization patterns - Then the System SHOULD flag the activity as a potential brute-force attempt - And a CRITICAL alert SHOULD be raised for immediate investigation -``` - -### FR-046: Revocation Notification Channel Monitoring - -**Description:** The System MUST monitor all revocation notification channels: Agent (REST) direct notification with delivery status and latency, ZeroMQ notification with connection state and queue depth, and Webhook notification with HTTP status and retry count. - -**Trace:** Revocation - Notification Channels - -```gherkin -Feature: Revocation Channel Monitoring - - Scenario: Monitor webhook delivery failure - Given a revocation webhook is configured for "hooks.slack.com" - And a revocation event triggers for agent "agent-042" - When the webhook delivery fails with HTTP 503 - Then the System MUST display "failed (retry 1)" for the webhook channel - And the System MUST retry delivery according to the configured retry policy - - Scenario: All revocation channels unavailable - Given all configured revocation notification channels are in a failed state - When a new revocation event occurs - Then the System MUST raise a CRITICAL alert "All notification channels unavailable" - And the revocation event MUST be queued for delivery once a channel recovers - - Scenario: Monitor ZeroMQ connection state - Given ZeroMQ is configured on port 8992 - When the ZeroMQ connection is active - Then the integration status MUST show "connected" with current queue depth -``` - -### FR-047: Alert Lifecycle Workflow - -**Description:** The System MUST implement an alert lifecycle: New → Acknowledged → Under Investigation → Resolved or Dismissed. Operators MUST be able to acknowledge alerts, assign them to team members, add investigation notes, reactivate agents after fix, update policies for false positives, and escalate to security teams. - -**Trace:** Revocation - Alert Workflow - -```gherkin -Feature: Alert Lifecycle Workflow - - Scenario: Acknowledge a critical alert - Given a new critical alert exists for agent "agent-042" - And the user has the Operator role - When the operator clicks "Acknowledge" on the alert - Then the alert state MUST change to "Acknowledged" - - Scenario: Move acknowledged alert to investigation - Given an acknowledged alert exists for agent "agent-042" - And the user has the Operator role - When the operator assigns the alert and clicks "Investigate" - Then the alert state MUST change to "Under Investigation" - - Scenario: Auto-resolve on re-attestation - Given agent "agent-042" has an active alert for attestation failure - When agent "agent-042" passes a subsequent attestation cycle - Then the System SHOULD automatically resolve the alert - And the resolution reason MUST indicate "auto-resolved on successful re-attestation" - - Scenario: Alert Center KPI cards use consistent summary data - Given the alert summary endpoint returns critical=2, warnings=2, info=2 - When the user views the Alert Center page - Then all three KPI cards (Critical, Warnings, Info) MUST display values from the summary endpoint - And the KPI values MUST NOT change when table filters are applied -``` - -### FR-048: Alert Auto-Escalation - -**Description:** The System SHOULD automatically escalate alerts that remain unacknowledged after a configurable SLA timeout. Escalation MUST follow a defined escalation chain. - -**Trace:** Revocation - Alert Workflow - -```gherkin -Feature: Alert Auto-Escalation - - Scenario: Escalate unacknowledged critical alert - Given a CRITICAL alert has been in "New" state for 30 minutes - And the SLA timeout for critical alerts is configured at 15 minutes - When the SLA timeout is exceeded - Then the System SHOULD escalate the alert to the next level in the escalation chain - And a notification SHOULD be sent to the escalation recipient - - Scenario: Escalation chain exhausted - Given all levels in the escalation chain have been notified - And the alert remains unacknowledged - When the System attempts further escalation - Then the System SHOULD raise a CRITICAL meta-alert "Escalation chain exhausted — alert unresolved" - And the audit log MUST record all escalation attempts -``` - -### FR-049: Alert Auto-Resolve - -**Description:** The System SHOULD automatically resolve alerts when the underlying condition clears, specifically when an agent passes a subsequent attestation cycle. The System MUST implement notification deduplication and severity auto-adjustment. - -**Trace:** Revocation - Alert Workflow - -```gherkin -Feature: Alert Auto-Resolve - - Scenario: Auto-resolve on successful attestation - Given agent "agent-067" has an active warning alert for boot violation - When agent "agent-067" completes a successful attestation - Then the System SHOULD change the alert state to "Resolved" - And the resolution MUST be marked as "auto-resolved" - - Scenario: Auto-resolve suppressed for manually escalated alert - Given an alert has been manually escalated to the security team - When the underlying condition clears - Then the System MUST NOT auto-resolve the alert - And the alert MUST remain in its current state until manually resolved -``` - -### FR-050: Unified Certificate View - -**Description:** The System MUST provide a unified view of all certificate types in the Keylime ecosystem: EK Certificate (TPM vendor), AK Certificate (Keylime CA), mTLS Certificate (Agent↔Verifier), IAK Certificate (manufacturer), IDevID Certificate (manufacturer), and Server Certificates (Verifier/Registrar from Org CA). - -**Trace:** Certificate Management - Overview - -```gherkin -Feature: Unified Certificate View - - Scenario: Display all certificate types - Given agents have EK, AK, and mTLS certificates - And the Verifier and Registrar have server certificates - When the user navigates to the Certificate Management view - Then all certificate types MUST be listed in a unified view - And each certificate MUST show its type, associated entity, and validity status - - Scenario: Agent with missing certificate data - Given agent "agent-099" has no EK certificate recorded in the Registrar - When the user views the Certificate Management view - Then the entry for "agent-099" MUST display "EK Certificate: Not Available" -``` - -### FR-051: Certificate Expiry Dashboard - -**Description:** The System MUST display a certificate expiry dashboard showing summary counts (Expired, Expiring within 30 days, Valid, Total) and a detailed list of certificates requiring attention. The System MUST display a 90-day certificate expiry timeline. Alert thresholds MUST be tiered: 90-day informational, 30-day action required, 7-day critical, 1-day emergency, and expired. - -**Trace:** Certificate Management - Expiry Dashboard; Certificate Management - Operations - -```gherkin -Feature: Certificate Expiry Dashboard - - Scenario: Display certificate expiry summary - Given the fleet has 1 expired certificate, 5 certificates expiring within 30 days, and 244 valid certificates - When the user navigates to the Certificate Expiry Dashboard - Then the summary MUST show "Expired: 1", "Expiring <30d: 5", "Valid: 244", "Total: 250" - - Scenario: Certificate expiry alert at 7-day threshold - Given agent "agent-015" has an mTLS certificate expiring in 7 days - When the System evaluates certificate validity - Then a critical alert MUST be raised for agent "agent-015" - And the alert MUST indicate the certificate type and expiry date - - Scenario: No certificates expiring - Given all certificates in the fleet are valid with more than 90 days remaining - When the user navigates to the Certificate Expiry Dashboard - Then the summary MUST show "Expired: 0", "Expiring <30d: 0" - And the timeline MUST display no upcoming expirations -``` - -### FR-052: Certificate Detail Inspection - -**Description:** The System MUST provide a detailed certificate inspection view showing: Subject and Issuer DN, serial number, validity period (Not Before/After), public key algorithm and size, signature algorithm, Subject Alternative Names (SANs), Key Usage / Extended Key Usage, certificate chain visualization, and PEM/DER export options. For EK certificates, the System MUST verify the TPM vendor and validate the chain against known CAs. - -**Trace:** Certificate Management - Operations - -```gherkin -Feature: Certificate Detail Inspection - - Scenario: Inspect EK certificate - Given agent "a1b2c3d4" has an EK certificate issued by a TPM vendor - When the user views the certificate detail for the EK cert - Then the Subject DN, Issuer DN, serial number, and validity period MUST be displayed - And the certificate chain visualization MUST show the chain to the root CA - And PEM export MUST be available - - Scenario: Certificate chain validation failure - Given agent "a1b2c3d4" has an EK certificate issued by an unknown CA - When the System validates the EK certificate chain - Then the chain validation MUST display "INVALID — issuer not in trusted CA list" - And a WARNING MUST be displayed on the certificate detail view - - Scenario: Validate EK certificate chain - Given the TPM vendor CA certificates are pre-loaded - When the System validates agent "a1b2c3d4"'s EK certificate - Then the chain validation result MUST be displayed (valid/invalid) -``` - -### FR-053: Automated Certificate Renewal - -**Description:** The System SHOULD support automated certificate renewal workflows for Keylime CA-issued certificates. Renewal MUST include: approval workflow, batch scheduling, pre-renewal validation, and rollback on failure. - -**Trace:** Certificate Management - Operations - -```gherkin -Feature: Automated Certificate Renewal - - Scenario: Auto-renew expiring mTLS certificate - Given agent "agent-015" has an mTLS certificate expiring in 7 days - And auto-renewal is enabled for Keylime CA certificates - When the System initiates renewal - Then the certificate SHOULD be renewed with approval workflow - And the new certificate MUST be validated before activation - - Scenario: Rollback on renewal failure - Given a certificate renewal fails validation - When the System detects the failure - Then the System SHOULD rollback to the previous certificate - And an alert MUST be raised indicating renewal failure - - Scenario: Auto-renewal not available for vendor certificates - Given agent "agent-015" has an EK certificate issued by a TPM vendor - When the System evaluates the certificate for auto-renewal - Then the System MUST skip auto-renewal for vendor-issued certificates - And the certificate expiry alert MUST indicate "manual renewal required" -``` - -### FR-054: Pull Mode Attestation Monitoring - -**Description:** The System MUST provide monitoring specific to pull-mode (v2 API) attestation including: quote polling frequency compliance, agent response time distribution, state transition timeline, retry count and backoff status, V-key delivery tracking, and concurrent agent polling load. Alerts MUST include: agent unresponsive, quote interval drift, exponential backoff triggered, max retries exceeded, verifier overloaded, and agent stuck in RETRY state. - -**Trace:** Attestation Modes - Pull Mode Monitoring - -```gherkin -Feature: Pull Mode Monitoring - - Scenario: Alert on max retries exceeded - Given agent "agent-042" has exceeded the configured max_retries of 5 - When the System evaluates pull mode agent health - Then a WARNING alert MUST be raised indicating "max retries exceeded" - - Scenario: Detect verifier overload - Given the verifier's pending verification queue is growing - And the queue depth exceeds the configured threshold - When the System monitors pull mode metrics - Then a WARNING alert MUST indicate "verifier overloaded (queue growing)" -``` - -### FR-055: Push Mode Attestation Monitoring - -**Description:** The System MUST provide monitoring specific to push-mode (v3 API) attestation including: agent submission frequency, nonce expiry rate, evidence evaluation time, session token lifecycle, challenge lifetime utilization, and capabilities negotiation outcomes. Alerts MUST include: agent silent, high nonce expiry rate, rate limit violations, session creation failures, verification timeout exceeded, and evidence chain broken. - -**Trace:** Attestation Modes - Push Mode Monitoring - -```gherkin -Feature: Push Mode Monitoring - - Scenario: Alert on rate limit violations - Given agent "agent-100" exceeds the per-agent rate limit for attestation submissions - When the rate limit violation is detected - Then a WARNING alert MUST be raised - And the alert MUST indicate the agent ID and the limit threshold - - Scenario: Push mode monitoring in pull-only deployment - Given the deployment operates exclusively in pull mode (v2 API) - When the user navigates to push mode monitoring - Then the System MUST display "Push mode is not enabled in this deployment" - - Scenario: Track session token lifecycle - Given push mode sessions are active - When the user views push mode monitoring - Then each session MUST show creation time, expiry time, and associated agent UUID -``` - -### FR-056: Mixed Mode Unified Views - -**Description:** In deployments with both pull-mode and push-mode agents, the System MUST provide unified views that normalize metrics across modes. The System MUST also offer mode-specific drill-down views. The dashboard MUST adapt its displays based on the configured attestation mode. - -**Trace:** Attestation Modes - Comparative View - -```gherkin -Feature: Mixed Mode Unified Views - - Scenario: Unified fleet view with mixed modes - Given the deployment has 200 agents in pull mode (v2 API) and 50 agents in push mode (v3 API) - When the user views the Fleet Overview Dashboard - Then all 250 agents MUST appear in the unified agent list - And each agent MUST indicate its attestation mode - And KPIs MUST aggregate across both modes - - Scenario: Mode-specific drill-down - Given the dashboard detects agents operating in push mode - When the user navigates to the Attestation Analytics view - Then a mode toggle MUST allow switching between unified, pull-only, and push-only views - - Scenario: Single-mode deployment hides mode toggle - Given the deployment operates exclusively in pull mode (v2 API) - When the user views the Fleet Overview Dashboard - Then the mode toggle MUST NOT be displayed - And all views MUST default to pull-mode data -``` - -### FR-057: Backend Connectivity Status Dashboard - -**Description:** The System MUST display a real-time connectivity status dashboard for all backend services. Monitored services MUST include: Webtool Backend (the dashboard's own backend) with UP/DOWN status via settings endpoint probe (FR-077); Core Services (Verifier, Registrar) with their configured endpoint URLs, UP/DOWN/HIGH LOAD status, and uptime; Database Backends (Verifier DB, Registrar DB) with average query time and migration revision; Durable Attestation Backends (Rekor, Redis, RFC 3161 TSA) with latency and connection state; and Notification Channels (ZeroMQ, Webhook, Agent notifications) with delivery status. Each service MUST display a health indicator (green/yellow/red). Core Services health checks MUST use lightweight HTTP probes that bypass the circuit breaker to reflect real connectivity status even when the circuit breaker is open. The connectivity status endpoint MUST report the real configured service URLs instead of placeholder values. - -**Trace:** Integration Status - Backend Connectivity - -```gherkin -Feature: Backend Connectivity Status - - Scenario: Display webtool backend connectivity - Given the webtool backend is running at http://localhost:8080 - When the user navigates to the Integration Status view - Then the "Webtool Backend" section MUST show status "UP" with a green indicator - And the backend URL MUST be displayed - - Scenario: Webtool backend unreachable - Given the webtool backend is not reachable - When the user navigates to the Integration Status view - Then the "Webtool Backend" section MUST show status "DOWN" with a red indicator - And the Core Services, Attestation Backends, and Notification Channels sections MUST be disabled - - Scenario: Display core service connectivity - Given the Verifier is running at 10.0.0.1:8881 with 45 days uptime - And a second Verifier at 10.0.0.3:8881 is under HIGH LOAD at 72% - When the user navigates to the Integration Status view - Then the Verifier (10.0.0.1:8881) MUST show status "UP" with uptime "45d" - And the Verifier-02 (10.0.0.3:8881) MUST show status "HIGH LOAD" with a yellow indicator - - Scenario: Core service health check bypasses circuit breaker - Given the Verifier circuit breaker is in OPEN state - And the Verifier service has recovered and is now reachable - When the Integration Status view polls the connectivity endpoint - Then the Verifier MUST show status "UP" - And the probe MUST NOT be blocked by the circuit breaker - - Scenario: Alert on backend service failure - Given the RFC 3161 TSA at tsa.example.com is configured - When the TSA connection times out - Then the TSA entry MUST show status "TIMEOUT" with a red indicator - And a WARNING alert MUST be raised indicating TSA unavailability - - Scenario: All backend services healthy - Given all configured backend services are operational - When the user navigates to the Integration Status view - Then all services MUST show green health indicators - And no alerts MUST be displayed for backend connectivity - - Scenario: Auto-recover when backend comes back - Given the webtool backend was previously unreachable - And the Core Services sections were disabled - When the backend becomes reachable again - Then the "Webtool Backend" section MUST show status "UP" - And the Core Services sections MUST be re-enabled automatically -``` - -### FR-058: Durable Attestation Backend Monitoring - -**Description:** The System MUST monitor all durable attestation backends: Rekor (transparency log) with upload latency and inclusion proofs, Redis (time-series data) with latency, memory usage, and key count, SQL DB (persistent storage) with query time and storage growth, File-based (local audit trail) with disk usage and write speed, and RFC 3161 TSA (timestamping authority) with response time and token validity. Alerts MUST fire on: upload failures, log unavailable, connection lost, high latency, storage full, slow queries, disk full, permission errors, TSA unreachable, and certificate expired. - -**Trace:** Integration Status - Durable Attestation - -```gherkin -Feature: Durable Attestation Backend Monitoring - - Scenario: Monitor Rekor transparency log health - Given Rekor is configured at transparency-log.example.com - When the System polls Rekor status - Then the dashboard MUST display upload latency and inclusion proof success rate - And an alert MUST fire if Rekor uploads fail for more than 5 consecutive attempts - - Scenario: Alert on Redis connection loss - Given Redis is configured at redis.internal:6379 - When the Redis connection is lost - Then the System MUST display "disconnected" for the Redis backend - And a CRITICAL alert MUST be raised indicating "attestation data backend unavailable" - - Scenario: Durable backend not configured - Given no Rekor transparency log is configured for the deployment - When the user views the Durable Attestation Backend panel - Then the Rekor entry MUST display "Not Configured" - And no alerts MUST be raised for Rekor status -``` - -### FR-059: Compliance Framework Mapping Reports - -**Description:** The System MUST map attestation capabilities to specific compliance framework controls. Supported frameworks MUST include: NIST SP 800-155 (BIOS integrity measurement), NIST SP 800-193 (platform firmware resilience), PCI DSS 4.0 (Req 11.5 file integrity monitoring, Req 10.2 audit trail), SOC 2 Type II (CC7.1 monitoring activities, CC6.1 logical access controls), FedRAMP (CA-7 continuous monitoring, SI-7 software integrity), and CIS Controls v8 (2.5 allowlisted software). Each mapping MUST identify the relevant dashboard evidence. - -**Trace:** Compliance - Framework Mapping - -```gherkin -Feature: Compliance Framework Mapping - - Scenario: View PCI DSS compliance mapping - Given the System has IMA attestation data for the fleet - When the user selects the PCI DSS 4.0 compliance report - Then the report MUST map Req 11.5 to IMA attestation history per agent - And the report MUST map Req 10.2 to the tamper-evident audit log export - And each control MUST indicate its coverage status (Covered / Partial / Gap) - - Scenario: View FedRAMP compliance mapping - Given the System has real-time attestation metrics - When the user selects the FedRAMP compliance report - Then control CA-7 MUST map to real-time attestation success rate - And control SI-7 MUST map to IMA policy enforcement reports - - Scenario: Compliance gap identified - Given IMA attestation is not enabled for 10 agents in the fleet - When the user views the PCI DSS 4.0 compliance report - Then Req 11.5 coverage status MUST display "Partial" or "Gap" - And the gap detail MUST list the 10 agents without IMA attestation -``` - -### FR-060: One-Click Compliance Report Export - -**Description:** The System MUST support one-click export of compliance reports in PDF and CSV formats. Each report MUST include: attestation coverage summary, policy compliance matrix, exception list with justifications, and time-bounded audit evidence packages. Reports MUST be filterable by compliance framework. - -**Trace:** Compliance - Framework Mapping - -```gherkin -Feature: One-Click Compliance Report Export - - Scenario: Export PDF compliance report - Given the user is viewing the SOC 2 Type II compliance mapping - And the user has the Operator or Admin role - When the user clicks "Export PDF" - Then a PDF report MUST be generated containing the compliance mapping, attestation coverage summary, and exception list - And the PDF MUST include a timestamp and the generating user's identity - - Scenario: Export time-bounded audit evidence package - Given the user selects date range "2026-01-01" to "2026-03-31" - When the user exports the compliance evidence package - Then the export MUST include all audit log entries within the date range - And the export MUST include attestation pass/fail summaries per agent for the period - - Scenario: Compliance report export denied for Viewer - Given the user has the Viewer role - When the user views a compliance report - Then the Export action MUST NOT be rendered for the Viewer role - -> **Reviewer Note:** Changed "MUST be disabled or hidden" to "MUST NOT be rendered" — RBAC-denied export capability should be absent for the Viewer role. Also changed When step from "attempts to export" (presupposes the control exists) to "views a compliance report" (observable state). -``` - -### FR-061: Tamper-Evident Hash-Chained Audit Logging - -**Description:** The System MUST implement tamper-evident audit logging where each log entry includes a SHA-256 hash of the previous entry. The chain root MUST be anchored to an external RFC 3161 timestamp. Periodic chain checkpoints MUST be submitted to a Rekor transparency log. Tamper detection MUST run on startup and periodically. Each entry MUST include: timestamp (UTC, millisecond precision), actor (user identity from OIDC), action (CRUD + target), resource, source IP, user agent, result, and previous entry hash. - -**Trace:** Compliance - Tamper-Evident Audit Logging - -```gherkin -Feature: Tamper-Evident Audit Logging - - Scenario: Detect hash chain integrity violation - Given the audit log contains 1,000 hash-chained entries - And entry #500 has been modified such that its hash no longer matches entry #501's previous-hash field - When the System runs periodic hash chain verification - Then the verification MUST report a chain break at entry #501 - And the System MUST raise a CRITICAL alert indicating audit log integrity violation - - Scenario: Audit log entry structure - Given user "admin@example.com" updates IMA policy "production-v2" - When the audit log entry is created - Then the entry MUST include timestamp in UTC with millisecond precision - And the entry MUST include actor "admin@example.com" - And the entry MUST include action "UPDATE_RUNTIME_POLICY" - And the entry MUST include the SHA-256 hash of the previous entry -``` - -### FR-062: Incident Response Ticketing Integration - -**Description:** The System SHOULD integrate with enterprise incident response and ticketing systems: ServiceNow (incident + change request), Jira (issue auto-creation), PagerDuty (alert routing), and OpsGenie (on-call escalation). Integration MUST support bidirectional status sync. Change management (ITSM) SHOULD gate policy changes on an approved change request. Emergency bypass MUST include a break-glass audit trail. Automated remediation SHOULD support: network quarantine via NAC integration, agent isolation, re-provisioning workflow triggers, configurable playbooks per failure type, and manual approval gate for destructive actions. - -**Trace:** Incident Response - Integration - -```gherkin -Feature: Incident Response Ticketing Integration - - Scenario: Auto-create Jira issue on attestation failure - Given the Jira integration is configured with project "SECOPS" - And agent "agent-042" fails attestation with a CRITICAL severity - When the incident response workflow triggers - Then a Jira issue SHOULD be created in project "SECOPS" - And the issue SHOULD include agent UUID, failure type, timestamp, and link to dashboard - - Scenario: Bidirectional status sync with ServiceNow - Given a ServiceNow incident INC0012345 was created for agent "agent-042" - When the ServiceNow incident is resolved externally - Then the dashboard alert status SHOULD sync to "Resolved" - And the resolution source MUST be recorded as "ServiceNow INC0012345" - - Scenario: Ticketing integration not configured - Given no incident response integration is configured - When an attestation failure triggers the incident response workflow - Then the System MUST create a local incident record only - And the integration panel MUST display "No ticketing integrations configured" -``` - -### FR-063: SIEM Integration - -**Description:** The System MUST integrate with Security Information and Event Management (SIEM) systems. Supported formats and protocols MUST include: Syslog (CEF/LEEF format), Splunk HEC (HTTP Event Collector), Elastic Common Schema (ECS), Prometheus metrics endpoint, and OpenTelemetry traces. SIEM integration MUST be available from Day 1 of deployment. - -**Trace:** Incident Response - Integration - -```gherkin -Feature: SIEM Integration - - Scenario: Export events via Syslog CEF format - Given Syslog integration is configured with endpoint "siem.example.com:514" - When an attestation failure event occurs for agent "agent-042" - Then the System MUST emit a Syslog message in CEF format - And the message MUST include severity, agent UUID, failure type, and timestamp - - Scenario: Expose Prometheus metrics endpoint - Given the Prometheus integration is enabled - When Prometheus scrapes the /metrics endpoint - Then the endpoint MUST expose attestation success rate, alert counts, agent fleet size, and API response times - And each metric MUST include appropriate labels (agent_id, severity, mode) - - Scenario: SIEM endpoint unreachable - Given Syslog integration is configured with endpoint "siem.example.com:514" - When the endpoint becomes unreachable - Then the System MUST log the delivery failure and retry according to the configured policy - And a WARNING alert MUST be raised indicating "SIEM endpoint unreachable" -``` - -### FR-064: Verifier Cluster Performance Monitoring - -**Description:** The System MUST monitor verifier cluster performance including: CPU utilization per worker process, memory usage (RSS, heap), open file descriptors, thread pool utilization, network connections (active/idle), attestations per second, queue depth (pending verifications), worker split (web vs. verification), `dedicated_web_workers` utilization, exponential backoff events, and rate limit rejections. - -**Trace:** System Performance - Verifier Metrics - -```gherkin -Feature: Verifier Cluster Performance Monitoring - - Scenario: Display verifier node resource metrics - Given the deployment has two verifier nodes and one registrar - When the user navigates to System Performance - Then each verifier node MUST display CPU utilization, memory usage, and connection counts - And the registrar MUST display its own resource metrics independently - - Scenario: Alert on verifier overload - Given Verifier-02 CPU utilization exceeds 70% - When the System evaluates verifier health - Then Verifier-02 MUST display a yellow "HIGH LOAD" indicator - And a WARNING alert MUST be raised indicating the affected node - - Scenario: Verifier node unreachable - Given the deployment has two verifier nodes - And Verifier-02 is unreachable - When the System polls verifier health - Then Verifier-02 MUST display a red "DOWN" indicator - And a CRITICAL alert MUST be raised indicating "Verifier-02 unreachable" -``` - -### FR-065: Database Connection Pool Monitoring - -**Description:** The System MUST monitor the Keylime database connection pool: active/idle connections, pool size and overflow count, connection wait time, pool exhaustion events, and `database_pool_sz_ovfl` settings. Query performance metrics MUST include: slow query detection (>100ms), query count by type (SELECT/UPDATE), table row counts, index hit ratio, and migration status (Alembic revision). - -**Trace:** System Performance - Verifier Metrics - -```gherkin -Feature: Database Connection Pool Monitoring - - Scenario: Display connection pool status - Given the database connection pool has 10 active and 4 idle connections out of 14 total - And overflow is at 0 - When the user views the Database Connection Pool panel - Then the panel MUST display "Active: 10/14 | Idle: 4 | Overflow: 0" - And average query time and max query time MUST be displayed - - Scenario: Alert on pool exhaustion - Given all 14 connections are active and overflow reaches the configured maximum - When a new connection request is queued - Then the System MUST raise a CRITICAL alert indicating "database pool exhausted" - And the connection wait time MUST be displayed - - Scenario: Slow query detected - Given a database query takes 250ms (threshold: 100ms) - When the System monitors query performance - Then the slow query MUST be logged with query type and duration - And the slow query count metric MUST increment -``` - -### FR-066: API Response Time Tracking - -**Description:** The System MUST track API response times at p50, p95, and p99 percentiles for all Keylime API endpoints. Tracked endpoints MUST include at minimum: `GET /agents/`, `GET /agents/:id`, `POST /attestations`, `PATCH /attestations/:idx`, `GET /policies/ima`, and `POST /sessions`. Response times MUST be color-coded: green (>>` so all handlers use the updated URLs immediately. After saving, the frontend MUST invalidate all integration queries so connectivity status reflects the new configuration, and MUST trigger a full page reload to refresh all cached data. - -**Trace:** Settings - Keylime Connection - -```gherkin -Feature: Runtime Keylime Connection Configuration - - Scenario: Configure Verifier and Registrar URLs - Given the user is on the Settings page - When the user updates the Verifier URL to "https://verifier.prod:8881" - And the user updates the Registrar URL to "https://registrar.prod:8891" - And the user clicks "Apply Changes" - Then the backend MUST update its Keylime client with the new URLs - And the Integration Status page MUST reflect the new endpoint addresses - And the page MUST reload to refresh all cached data - - Scenario: Configure Backend URL - Given the user is on the Settings page - When the user updates the Backend URL to "http://backend.prod:8080" - And the user clicks "Apply Changes" - Then all subsequent API requests MUST be directed to the new backend URL - - Scenario: HTTP scheme warning for Keylime URLs - Given the user enters a Verifier URL using "http://" instead of "https://" - When the URL field loses focus - Then the System MUST display a warning indicating unencrypted communication - - Scenario: Apply Changes button state - Given no connection URL fields have been modified - Then the "Apply Changes" button MUST be disabled - When the user modifies any connection URL - Then the "Apply Changes" button MUST be enabled -``` - -### FR-073: mTLS Certificate Configuration UI - -**Description:** The System MUST provide a certificate settings section on the Settings page for configuring mTLS client certificates used to communicate with Keylime APIs. Two input modes MUST be supported: "By directory" (a single directory path, with standard Keylime filenames `client-cert.crt`, `client-private.pem`, `cacert.crt` assumed) and "Manually" (individual paths for certificate, private key, and CA certificate files). The mode MUST be auto-detected from saved settings. The Apply button MUST be disabled when no changes exist and MUST display backend error messages on failure. Certificate configuration MUST be persisted via the backend settings API (`GET/PUT /api/settings/certificates`) and applied at runtime by reconstructing the mTLS reqwest client. - -**Trace:** Settings - Certificate Configuration - -```gherkin -Feature: mTLS Certificate Configuration - - Scenario: Configure certificates by directory - Given the user is on the Settings page certificate section - When the user selects "By directory" mode - And enters the directory path "/etc/keylime/certs" - And clicks "Apply" - Then the backend MUST construct mTLS client using: - - Certificate: /etc/keylime/certs/client-cert.crt - - Private key: /etc/keylime/certs/client-private.pem - - CA certificate: /etc/keylime/certs/cacert.crt - And the Keylime API client MUST be reconstructed with the new certificates - - Scenario: Configure certificates manually - Given the user is on the Settings page certificate section - When the user selects "Manually" mode - And enters individual paths for certificate, key, and CA files - And clicks "Apply" - Then the backend MUST construct mTLS client using the specified file paths - - Scenario: Auto-detect input mode from saved settings - Given the backend has saved certificate paths from a "By directory" configuration - When the user navigates to the Settings page - Then the certificate section MUST auto-detect and display "By directory" mode - - Scenario: Display backend error on invalid certificate - Given the user enters a path to a certificate file that does not exist - When the user clicks "Apply" - Then the System MUST display the error message returned by the backend - And the previous certificate configuration MUST remain active -``` - -### FR-074: Mock/Production Environment Toggle - -**Description:** The System MUST provide a segmented control on the Settings page that switches all three connection URLs (Backend, Verifier, Registrar) between mock and production default values. The toggle MUST auto-detect the active mode from saved URLs on page load. Selecting a mode MUST populate all URL fields with the corresponding defaults. The user MUST still click "Apply Changes" to persist the switch. - -**Trace:** Settings - Environment Switching - -```gherkin -Feature: Mock/Production Environment Toggle - - Scenario: Switch to mock environment - Given the user is on the Settings page - And the current environment is detected as "Production" - When the user selects "Mock" in the environment toggle - Then the Backend URL field MUST be populated with the mock default - And the Verifier URL field MUST be populated with the mock default - And the Registrar URL field MUST be populated with the mock default - And the user MUST click "Apply Changes" to persist the change - - Scenario: Switch to production environment - Given the user is on the Settings page - And the current environment is detected as "Mock" - When the user selects "Production" in the environment toggle - Then all URL fields MUST be populated with production defaults - - Scenario: Auto-detect environment on page load - Given the saved Backend, Verifier, and Registrar URLs match mock defaults - When the user navigates to the Settings page - Then the environment toggle MUST show "Mock" as the active mode -``` - -### FR-075: TOML Config File Persistence for Backend Settings - -**Description:** The System MUST persist backend configuration changes (Verifier/Registrar URLs, mTLS certificate paths) to a TOML file on disk so they survive backend restarts. At startup, the configuration MUST be loaded with priority: persisted TOML file > environment variables > compiled defaults. File writes MUST be atomic (write to temporary file, then rename) and MUST run on a blocking thread to avoid stalling the async runtime. Write failures MUST log a warning but MUST NOT fail the API request. The config path MUST be resolved as: `KEYLIME_WEBTOOL_CONFIG` environment variable > `~/.config/keylime-webtool/settings.toml` > no persistence (in-memory only). - -**Trace:** Settings - Configuration Persistence - -```gherkin -Feature: TOML Configuration Persistence - - Scenario: Settings survive backend restart - Given the user has configured Verifier URL to "https://verifier.prod:8881" - And the setting is persisted to the TOML file - When the backend is restarted - Then the Verifier URL MUST be loaded from the TOML file - And the System MUST use "https://verifier.prod:8881" without manual reconfiguration - - Scenario: Configuration priority order - Given the TOML file sets Verifier URL to "https://file.example:8881" - And the environment variable KEYLIME_VERIFIER_URL is set to "https://env.example:8881" - When the backend starts - Then the Verifier URL MUST be "https://file.example:8881" (TOML file takes precedence) - - Scenario: Atomic file write - Given the user saves a new configuration - When the backend writes the TOML file - Then the write MUST use a temporary file and rename strategy - And a crash during write MUST NOT corrupt the existing configuration file - - Scenario: Write failure does not fail API request - Given the TOML file path is not writable (e.g., read-only filesystem) - When the user saves configuration via the settings API - Then the API request MUST return success (settings applied in-memory) - And a warning MUST be logged indicating persistence failure -``` - -### FR-076: Sidebar Visibility Toggle (Hamburger Button) - -**Description:** The System MUST provide a hamburger button (three horizontal lines) in the top bar, positioned to the left of the search bar, that toggles the sidebar's visibility. When toggled off, the sidebar MUST slide out of view with a smooth CSS transition. When toggled on, the sidebar MUST slide back into view. The main content area MUST adjust its width accordingly. - -**Trace:** Dashboard - Navigation Structure - -```gherkin -Feature: Sidebar Visibility Toggle - - Scenario: Hide sidebar via hamburger button - Given the sidebar is visible - When the user clicks the hamburger button in the top bar - Then the sidebar MUST slide out of view with a smooth CSS transition - And the main content area MUST expand to use the full width - - Scenario: Show sidebar via hamburger button - Given the sidebar is hidden - When the user clicks the hamburger button in the top bar - Then the sidebar MUST slide into view with a smooth CSS transition - And the main content area MUST shrink to its normal width -``` - -### FR-077: Webtool Backend Health Check with Polling - -**Description:** The System MUST provide a "Webtool Backend" section on the Integrations page that pings the backend settings endpoint to detect backend availability. Both the backend health check and the Core Services (Verifier/Registrar) health queries MUST use a 1-second `refetchInterval` for near-instant outage detection. When the backend is unreachable, the Integrations sections (Core Services, Attestation Backends, Notification Channels) MUST be disabled and auto-recover when the backend becomes available again. - -**Trace:** Integration Status - Backend Connectivity - -```gherkin -Feature: Webtool Backend Health Check - - Scenario: Backend health check polling - Given the webtool backend is reachable - When the user is viewing the Integrations page - Then the frontend MUST poll the backend settings endpoint every 1 second - And the "Webtool Backend" section MUST show "UP" with a green indicator - - Scenario: Detect backend outage within 1 second - Given the user is viewing the Integrations page - And the backend becomes unreachable - Then within 1 second the "Webtool Backend" section MUST show "DOWN" with a red indicator - And the Core Services, Attestation Backends, and Notification Channels sections MUST be disabled - - Scenario: Auto-recover after backend outage - Given the webtool backend was unreachable and sections were disabled - When the backend becomes reachable again - Then the "Webtool Backend" section MUST show "UP" within 1 second - And the disabled sections MUST be re-enabled automatically -``` - -### FR-078: Timezone Selection with Auto-Detect - -**Description:** The System SHOULD provide a timezone setting in the Visualization Settings section. The user MUST be able to either auto-detect the browser timezone or manually select a timezone from a dropdown list of IANA timezone identifiers (e.g. `Europe/Madrid`, `America/New_York`, `UTC`). When auto-detect is active, the dropdown MUST be disabled. The selected timezone MUST be persisted in localStorage via the visualization store and applied to all timestamp rendering across the dashboard. - -**Trace:** Settings - Visualization - -```gherkin -Feature: Timezone Selection - - Scenario: Auto-detect browser timezone - Given the user navigates to Settings > Visualization - When the user clicks the "Auto-Detect" button - Then the timezone MUST be set to the browser's local timezone - And the timezone dropdown MUST be disabled - And the preference MUST persist across sessions via localStorage - - Scenario: Manual timezone selection - Given the user navigates to Settings > Visualization - And auto-detect is not active - When the user selects "UTC" from the timezone dropdown - Then all timestamps across the dashboard MUST render in UTC - And the preference MUST persist across sessions via localStorage - - Scenario: Switching from auto-detect to manual - Given auto-detect is active and the timezone dropdown is disabled - When the user clicks the "Auto-Detect" button again to deactivate it - Then the timezone dropdown MUST become enabled - And the user MUST be able to select a different timezone -``` - -### FR-079: Date Format Selection for Timestamp Rendering - -**Description:** The System MUST provide a date format setting in the Visualization Settings section, alongside the timezone setting (FR-078). The user MUST be able to select a date display format from the following options: `YYYY/MM/DD`, `DD/MM/YYYY`, `MM/DD/YYYY`, `YYYY-MM-DD`, `DD-MM-YYYY`, `MM-DD-YYYY`. The default SHOULD be `YYYY-MM-DD` (ISO 8601). The selected format MUST be persisted in localStorage via the visualization store. The selected format MUST be applied consistently to all timestamp and date rendering across the dashboard, including the agent list, agent detail timeline, audit log, alerts, and dashboard charts. - -**Trace:** Settings - Visualization - -```gherkin -Feature: Date Format Selection - - Scenario: Select a date format - Given the user navigates to Settings > Visualization - When the user selects "DD/MM/YYYY" from the date format dropdown - Then all timestamps across the dashboard MUST render dates in DD/MM/YYYY format - And the date format dropdown MUST show "DD/MM/YYYY" as the active selection - - Scenario: Date format persists across sessions - Given the user has selected "MM/DD/YYYY" as the date format - When the user closes and reopens the browser - Then the date format MUST remain "MM/DD/YYYY" - And all timestamps MUST render in MM/DD/YYYY format without requiring reconfiguration - - Scenario: Date format applied to agent list timestamps - Given the user has selected "YYYY-MM-DD" as the date format - When the user navigates to the Agent Fleet list view - Then the "Last Attestation" column timestamps MUST render dates in YYYY-MM-DD format - And the "Registration Date" column timestamps MUST render dates in YYYY-MM-DD format - - Scenario: Default date format on first visit - Given the user has never configured a date format preference - When the user navigates to any page displaying timestamps - Then timestamps SHOULD render dates in YYYY-MM-DD (ISO 8601) format - And the Settings > Visualization date format dropdown SHOULD show "YYYY-MM-DD" as the default -``` - -### FR-080: Time Format Selection for Timestamp Rendering - -**Description:** The System MUST provide a time format setting in the Visualization Settings section, alongside the timezone (FR-078) and date format (FR-079) settings. The user MUST be able to select between a 24-hour format (`HH:mm:ss`, e.g., `14:30:00`) and a 12-hour format (`hh:mm:ss A`, e.g., `02:30:00 PM`). The default MUST be `24h`. The selected format MUST be persisted in localStorage via the visualization store. The selected format MUST be applied consistently to all time-of-day rendering across the dashboard, including the agent list, agent detail timeline, audit log, alerts, attestation timestamps, and dashboard charts. - -**Trace:** Settings - Visualization - -```gherkin -Feature: Time Format Selection - - Scenario: Select 12-hour time format - Given the user navigates to Settings > Visualization - When the user selects "12h" from the time format selector - Then all timestamps across the dashboard MUST render times in 12-hour format with AM/PM suffix - And the time format selector MUST show "12h" as the active selection - - Scenario: Select 24-hour time format - Given the user navigates to Settings > Visualization - When the user selects "24h" from the time format selector - Then all timestamps across the dashboard MUST render times in 24-hour format without AM/PM suffix - And the time format selector MUST show "24h" as the active selection - - Scenario: Time format persists across sessions - Given the user has selected "12h" as the time format - When the user closes and reopens the browser - Then the time format MUST remain "12h" - And all timestamps MUST render in 12-hour format without requiring reconfiguration - - Scenario: Time format applied to agent list timestamps - Given the user has selected "12h" as the time format - When the user navigates to the Agent Fleet list view - Then the "Last Attestation" column timestamps MUST render times in 12-hour format - And the "Registration Date" column timestamps MUST render times in 12-hour format - - Scenario: Default time format on first visit - Given the user has never configured a time format preference - When the user navigates to any page displaying timestamps - Then timestamps MUST render times in 24-hour format - And the Settings > Visualization time format selector MUST show "24h" as the default -``` - -### FR-081: Sidebar Alert Indicator for Integration Service Outages - -**Description:** The System MUST display a visual alert indicator (exclamation mark badge) on the Integrations navigation item in the sidebar whenever one or more monitored services (Backend, Verifier, Registrar) are in a DOWN state. The indicator MUST use the same health-check queries as the Integrations page (FR-077) and share cached data via TanStack Query to avoid additional network requests. The indicator MUST update in real time with the same 1-second polling interval. When all services are UP, the indicator MUST NOT be displayed. - -**Trace:** Dashboard - Navigation Structure - -```gherkin -Feature: Sidebar Alert Indicator for Integration Outages - - Scenario: Indicator appears when backend is down - Given the webtool backend is unreachable - When the sidebar renders - Then the Integrations navigation item MUST display an exclamation mark badge - And the badge MUST have a tooltip "One or more services are down" - - Scenario: Indicator appears when a core service is down - Given the webtool backend is reachable - And the Keylime Verifier is in DOWN state - When the sidebar renders - Then the Integrations navigation item MUST display an exclamation mark badge - - Scenario: Indicator disappears when all services recover - Given the Integrations navigation item is displaying an exclamation mark badge - When all monitored services return to UP state - Then the exclamation mark badge MUST be removed from the Integrations navigation item - - Scenario: No indicator when all services are healthy - Given the webtool backend is reachable - And the Keylime Verifier is in UP state - And the Keylime Registrar is in UP state - When the sidebar renders - Then the Integrations navigation item MUST NOT display an exclamation mark badge -``` - -### FR-082: Agent ID Hyperlinks in Failure Categorization List - -**Description:** The System MUST render each agent UUID in the Attestation Analytics failure categorization list (FR-025) as a navigable hyperlink to the corresponding agent detail page (`/agents/{agent_id}`). The link text MUST display the agent UUID. Clicking the link MUST navigate the user to the agent detail view (FR-012) for the affected agent, enabling one-click drill-down from a failure entry to its agent context. - -**Trace:** Attestation Analytics - Failure Categorization - -```gherkin -Feature: Agent ID Hyperlinks in Failure Categorization - - Scenario: Agent UUID renders as a hyperlink - Given the failure categorization list contains a failure for agent "abc-123" - When the list renders - Then the agent UUID "abc-123" MUST be displayed as a hyperlink - And the hyperlink target MUST be "/agents/abc-123" - - Scenario: Clicking agent link navigates to agent detail - Given the failure categorization list displays a hyperlink for agent "abc-123" - When the user clicks the agent UUID hyperlink - Then the System MUST navigate to the agent detail page for "abc-123" - And the agent detail view (FR-012) MUST be displayed - - Scenario: Multiple failures for different agents each link independently - Given the failure categorization list contains failures for agents "abc-123" and "def-456" - When the list renders - Then agent "abc-123" MUST link to "/agents/abc-123" - And agent "def-456" MUST link to "/agents/def-456" -``` - -### FR-083: Raw Data Copy-to-Clipboard Button - -**Description:** The System MUST display a compact "Copy" icon button positioned immediately to the right of the Raw Data source selector button group ("All" / "Backend" / "Registrar" / "Verifier"). When clicked, the button MUST copy the currently displayed JSON data to the user's clipboard using the Clipboard API. The button MUST work regardless of which source filter is active. After a successful copy, the button MUST provide brief visual feedback (e.g., a checkmark icon or tooltip confirmation) before reverting to its default state. - -**Trace:** Agent Detail - Raw Data - -```gherkin -Feature: Raw Data Copy-to-Clipboard Button - - Scenario: Copy button is displayed next to source selector - Given the user is viewing the "Raw Data" tab for agent "a1b2c3d4" - Then a "Copy" icon button MUST be displayed to the right of the source selector group - And the button MUST be visually compact and aligned with the selector buttons - - Scenario: Copy combined raw data to clipboard - Given the user is viewing the "Raw Data" tab with "All" sources selected - When the user clicks the "Copy" button - Then the combined JSON from all three sources MUST be copied to the clipboard - And the button MUST show brief visual feedback confirming the copy - - Scenario: Copy filtered source data to clipboard - Given the user is viewing the "Raw Data" tab with "Backend Data" selected - When the user clicks the "Copy" button - Then only the Backend Data JSON MUST be copied to the clipboard - And the button MUST show brief visual feedback confirming the copy - - Scenario: Copy feedback resets after delay - Given the user has clicked the "Copy" button and visual feedback is shown - When 2 seconds have elapsed - Then the button MUST return to its default icon state -``` - -### FR-084: Fleet Overview KPI Card Drill-Down Navigation - -**Description:** Each KPI card on the Fleet Overview Dashboard SHOULD act as a clickable navigation element that takes the user to the relevant detailed view. The cards MUST display a visual affordance (e.g., cursor change, hover highlight) indicating they are interactive. The navigation targets are: Total Active Agents → Agent List; Failed Agents → Agent List filtered by failed states (7, 9, 10); Attestation Success Rate → Attestation Analytics; Average Attestation Latency → Attestation Analytics; Certificate Expiry Warnings → Certificates; Active IMA Policies → Policies; Revocation Events (24h) → Alerts; Consecutive Failures → Agent List sorted by failure count; Registration Count → Agent List. - -**Trace:** Dashboard - Key Performance Indicators; FR-001 - -```gherkin -Feature: Fleet Overview KPI Card Drill-Down Navigation - - Scenario: Click Total Active Agents card - Given the user is viewing the Fleet Overview Dashboard - And the "Total Active Agents" KPI card displays "247" - When the user clicks the "Total Active Agents" card - Then the System MUST navigate to the Agent List page at /agents - - Scenario: Click Failed Agents card navigates with filter - Given the user is viewing the Fleet Overview Dashboard - And the "Failed Agents" KPI card displays "3" - When the user clicks the "Failed Agents" card - Then the System MUST navigate to the Agent List page - And the agent state filter MUST be pre-applied for states FAILED, INVALID_QUOTE, and TENANT_FAILED - - Scenario: Click Attestation Success Rate card - Given the user is viewing the Fleet Overview Dashboard - When the user clicks the "Attestation Success Rate" card - Then the System MUST navigate to the Attestation Analytics page at /attestations - - Scenario: Click Certificate Expiry Warnings card - Given the user is viewing the Fleet Overview Dashboard - When the user clicks the "Certificate Expiry Warnings" card - Then the System MUST navigate to the Certificates page at /certificates - - Scenario: Click Active IMA Policies card - Given the user is viewing the Fleet Overview Dashboard - When the user clicks the "Active IMA Policies" card - Then the System MUST navigate to the Policies page at /policies - - Scenario: Click Revocation Events card - Given the user is viewing the Fleet Overview Dashboard - When the user clicks the "Revocation Events (24h)" card - Then the System MUST navigate to the Alerts page at /alerts - - Scenario: KPI cards show interactive affordance - Given the user is viewing the Fleet Overview Dashboard - When the user hovers over any KPI card - Then the card MUST display a pointer cursor - And the card SHOULD display a hover highlight effect -``` - -### FR-085: Alert Center Distribution Pie Charts - -**Description:** The Alert Center page MUST display three donut pie charts below the alert table, showing the distribution of alerts by severity, by type, and by state. Each chart MUST use color-coded segments matching the alert taxonomy. Clicking a chart segment MUST navigate the user to the Alert Center with the corresponding filter pre-applied (e.g., clicking the "critical" segment applies `?severity=critical`). The charts MUST update reactively when the alert data changes. - -**Trace:** Revocation - Alert Workflow; FR-047 - -```gherkin -Feature: Alert Center Distribution Pie Charts - - Scenario: Three pie charts render below the alert table - Given the user is viewing the Alert Center page - And there are alerts of mixed severity, type, and state - Then the System MUST display three pie charts labeled "By Severity", "By Type", and "By State" - And each chart MUST render one segment per distinct value in its dimension - - Scenario: Severity pie chart shows correct distribution - Given there are 2 critical, 2 warning, and 2 info alerts - When the user views the "By Severity" pie chart - Then the chart MUST show three segments with counts 2, 2, and 2 - And the critical segment MUST be colored red (#ea4335) - - Scenario: Click pie chart segment applies filter - Given the user is viewing the Alert Center page - When the user clicks the "warning" segment in the "By Severity" chart - Then the URL MUST update to include "severity=warning" - And the alert table MUST filter to show only warning alerts - - Scenario: Empty state when no alerts exist - Given there are no alerts in the system - When the user views the Alert Center page - Then each pie chart MUST display a "No data" placeholder -``` - ---- - -## 4. Non-Functional Requirements Detail - -### NFR-001: KPI Data Refresh Latency - -**Description:** The System MUST refresh KPI data within 30 seconds of a state change occurring in the Keylime backend. The default refresh interval MUST be configurable. - -**Trace:** Dashboard - Key Performance Indicators - -```gherkin -Feature: KPI Data Refresh Latency - - Scenario: KPI data refreshes within threshold - Given auto-refresh is enabled with the default 30-second interval - When an agent state change occurs in the Verifier - Then the dashboard KPI data MUST reflect the change within 30 seconds - - Scenario: KPI refresh exceeds threshold - Given auto-refresh is enabled - When KPI data has not been updated for more than 30 seconds after a known state change - Then the System MUST display a staleness warning on affected KPIs -``` - -### NFR-002: Dual API Version Support - -**Description:** The System MUST support Keylime API v2 (pull mode) and v3 (push mode) simultaneously. The backend MUST detect the API version per agent and adapt its data collection strategy accordingly. - -**Trace:** Keylime - Data Model Overview - -```gherkin -Feature: Dual API Version Support - - Scenario: Simultaneous v2 and v3 agent handling - Given the fleet contains agents using API v2 and agents using API v3 - When the System ingests data from the Verifier - Then v2 agents MUST be polled using GET endpoints - And v3 agents MUST receive push-mode events - And both agent types MUST appear in the unified fleet view - - Scenario: Unsupported API version rejected - Given an agent reports an API version not supported by the System - When the System attempts to ingest data from that agent - Then the System MUST log a WARNING indicating "unsupported API version" - And the agent MUST be displayed with a "version incompatible" status -``` - -### NFR-003: Non-Invasive Architecture - -**Description:** The System MUST operate as a standalone component that consumes Keylime's public REST APIs. The System MUST NOT require modifications to Keylime Verifier, Registrar, or Agent components. The System MUST NOT access Keylime databases directly. - -**Trace:** Technical Architecture - System Design - -```gherkin -Feature: Non-Invasive Architecture - - Scenario: System operates without Keylime modifications - Given the Keylime Verifier and Registrar are running with default configuration - When the dashboard backend starts and connects via mTLS - Then the System MUST retrieve all data exclusively through public REST API endpoints - And no direct database connections MUST be established to Keylime databases - - Scenario: System degrades when API is unavailable - Given the Keylime Verifier API is unreachable - When the System attempts to fetch data - Then the System MUST display cached data with staleness indicators - And the System MUST NOT fall back to direct database access -``` - -### NFR-004: Single Page Application Frontend - -**Description:** The frontend MUST render as a Single Page Application with client-side routing. Initial page load MUST complete within 3 seconds on a 10 Mbps connection. The UI framework MUST support component-based architecture with type safety. - -**Trace:** Technical Architecture - Data Flow - -```gherkin -Feature: SPA Frontend Performance - - Scenario: Initial page load within performance budget - Given the user accesses the dashboard for the first time - When the browser loads the application on a 10 Mbps connection - Then the initial page render MUST complete within 3 seconds - And subsequent navigation between views MUST NOT trigger full page reloads - - Scenario: Client-side routing - Given the user is viewing the Fleet Overview Dashboard - When the user clicks "Agents" in the sidebar - Then the browser URL MUST update without a full page reload - And the Agent Fleet view MUST render via client-side routing -``` - -### NFR-005: Async Backend Performance - -**Description:** The backend MUST support 10,000 concurrent WebSocket connections with less than 100ms p99 latency on a single node. The backend MUST use asynchronous I/O for all network operations. - -**Trace:** Technical Architecture - Why Rust - -```gherkin -Feature: Backend Performance - - Scenario: Handle 10,000 concurrent WebSocket connections - Given the backend is running on a single node - When 10,000 clients establish WebSocket connections simultaneously - Then all connections MUST be maintained without connection drops - And p99 message delivery latency MUST remain below 100ms - - Scenario: Backend handles API requests under load - Given 500 concurrent HTTP requests are sent to the REST API - When the backend processes the requests - Then p99 response time MUST remain below 200ms - And no requests MUST time out -``` - -### NFR-006: Event-Driven Ingestion - -**Description:** The System MUST use event-driven ingestion as the primary data path. The System MUST consume ZeroMQ revocation events from the Verifier. The System SHOULD propose upstream state-change webhook/AMQP emit. A message broker (RabbitMQ/Kafka) SHOULD decouple load. The System MUST use event sequence numbers to detect gaps. A periodic reconciliation sweep MUST run every 5 minutes. - -**Trace:** Technical Architecture - Event-Driven Ingestion; Scalability - Ingestion Model Comparison - -```gherkin -Feature: Event-Driven Ingestion - - Scenario: Consume ZeroMQ revocation event - Given the System is subscribed to the Verifier's ZeroMQ revocation channel - When the Verifier publishes a revocation event for agent "agent-042" - Then the System MUST process the event and update the agent's status - And the event sequence number MUST be recorded - - Scenario: Detect event sequence gap - Given the System has processed events up to sequence number 100 - When the next received event has sequence number 103 - Then the System MUST detect a gap of 2 missing events - And the System MUST trigger an immediate reconciliation for the missing events -``` - -### NFR-007: Polling Fallback with Backpressure - -**Description:** For environments without event support, the System MUST implement adaptive polling with backpressure: list poll at 30s intervals using ETag/If-Modified-Since, detail polling on-demand only (user clicks), double the interval if p95 latency exceeds 500ms, and pause detail polling if p95 latency exceeds 2s. The System MUST alert the operator when degraded to polling mode. - -**Trace:** Scalability - Ingestion Model Comparison; Technical Architecture - Event-Driven Ingestion - -```gherkin -Feature: Polling Fallback with Backpressure - - Scenario: Adaptive polling interval increase under load - Given the System is operating in polling mode - And the p95 API latency exceeds 500ms - When the next polling cycle is scheduled - Then the polling interval MUST be doubled from its current value - And the System MUST alert the operator that backpressure is active - - Scenario: Pause detail polling under extreme latency - Given the p95 API latency exceeds 2 seconds - When the System evaluates polling strategy - Then detail polling MUST be paused entirely - And only list-level polling MUST continue - And a WARNING alert MUST indicate "detail polling paused due to high latency" -``` - -### NFR-008: Event-Driven Scalability - -**Description:** The System SHOULD scale to 100,000+ agents under event-driven ingestion mode. Performance MUST NOT degrade linearly with agent count when using event-driven ingestion. - -**Trace:** Scalability - Ingestion Model Comparison - -```gherkin -Feature: Event-Driven Scalability - - Scenario: Dashboard responsive at 100K agents - Given the fleet contains 100,000 agents in event-driven mode - When the user navigates to the Fleet Overview Dashboard - Then the KPI data SHOULD render within 5 seconds - And the agent list SHOULD paginate without blocking the UI -``` - -### NFR-009: Polling Fallback Scalability - -**Description:** The System MUST support approximately 1,000 agents under polling fallback mode without exceeding acceptable API load on the Keylime Verifier. - -**Trace:** Scalability - Ingestion Model Comparison - -```gherkin -Feature: Polling Fallback Scalability - - Scenario: Polling mode at 1,000 agents - Given the fleet contains 1,000 agents in polling mode - When the System executes a polling cycle - Then all agent states MUST be refreshed within 60 seconds - And the API load on the Verifier MUST NOT exceed 50 requests per second -``` - -### NFR-010: Active/Passive High Availability - -**Description:** The System MUST support Active/Passive HA deployment with less than 30 seconds Recovery Time Objective (RTO) and zero Recovery Point Objective (RPO) for committed transactions. - -**Trace:** High Availability - Architecture - -```gherkin -Feature: Active/Passive HA - - Scenario: Failover to standby node - Given the System is deployed in Active/Passive HA mode - When the active node becomes unavailable - Then the standby node MUST assume the active role within 30 seconds - And no committed data MUST be lost (0 RPO) - And WebSocket clients MUST reconnect to the new active node -``` - -### NFR-011: Active/Active High Availability - -**Description:** The System SHOULD support Active/Active HA deployment for environments with 5,000+ agents, distributing load across multiple backend nodes. - -**Trace:** High Availability - Architecture - -```gherkin -Feature: Active/Active HA - - Scenario: Load distribution across nodes - Given the System is deployed in Active/Active mode with 2 backend nodes - When 5,000 agents are being monitored - Then each node SHOULD handle approximately half of the agent data ingestion - And failure of one node SHOULD result in the surviving node handling the full load -``` - -### NFR-012: Air-Gapped Deployment - -**Description:** The System MUST be fully self-contained with no runtime internet access required. All frontend assets MUST be bundled (no CDN). The Rust backend MUST compile to a single binary with no runtime dependencies. Fonts and icons MUST be embedded. Pre-built container images MUST include all layers. Offline EK certificate validation MUST be supported via pre-loaded TPM vendor CA certificates. Update packages MUST be signed (GPG) with integrity verification and SBOM included. - -**Trace:** Deployment - Offline & Air-Gapped - -```gherkin -Feature: Air-Gapped Deployment - - Scenario: No external network requests at runtime - Given the System is deployed in an air-gapped environment - When the System starts and operates normally - Then the System MUST NOT make any outbound network requests to the internet - And all UI assets (fonts, icons, scripts) MUST load from bundled resources - - Scenario: Offline EK certificate validation - Given the TPM vendor CA certificates are pre-loaded in the System - When an agent registers with an EK certificate - Then the System MUST validate the EK certificate against the pre-loaded CA bundle - And no network request to external CRLs or OCSP responders MUST be made -``` - -### NFR-013: Self-Contained Packaging - -**Description:** The System MUST be packaged as a self-contained application. The backend MUST compile to a single binary. The frontend MUST be bundled with all assets. No CDN or external dependency downloads MUST be required at runtime. - -**Trace:** Deployment - Offline & Air-Gapped - -```gherkin -Feature: Self-Contained Packaging - - Scenario: Single binary backend deployment - Given the backend binary is deployed to a server - When the binary is executed - Then the backend MUST start without requiring additional runtime downloads - And all embedded assets MUST be served from the binary - - Scenario: Frontend loads without CDN - Given the user accesses the dashboard in an environment with no internet - When the browser loads the application - Then all JavaScript, CSS, fonts, and icons MUST load from the backend server - And no requests to external CDNs MUST be attempted -``` - -### NFR-014: Accessibility - -**Description:** The dashboard UI MUST conform to WCAG 2.1 Level AA: keyboard navigation, screen reader support, color-contrast compliance, and ARIA labels on all interactive elements. This is REQUIRED for government procurement. - -**Trace:** Deployment - Offline & Air-Gapped - -```gherkin -Feature: WCAG 2.1 Level AA Accessibility - - Scenario: Keyboard navigation through dashboard - Given the user navigates the dashboard using only the keyboard - When the user presses Tab to move between interactive elements - Then every interactive element MUST receive visible focus - And the user MUST be able to activate any control using Enter or Space - - Scenario: Screen reader announces dashboard elements - Given a screen reader is active - When the user navigates to the Fleet Overview Dashboard - Then all KPI values MUST have ARIA labels describing the metric name and value - And all interactive elements MUST have descriptive ARIA labels -``` - -### NFR-015: Multiple Deployment Options - -**Description:** The System MUST support deployment via OCI container images, Kubernetes Helm charts, RPM packages, and systemd services. Each deployment method MUST be documented and tested. - -**Trace:** Technical Architecture - Deployment - -```gherkin -Feature: Multiple Deployment Options - - Scenario: Deploy via Helm chart - Given a Kubernetes cluster is available - When the operator installs the Helm chart with default values - Then the backend, frontend, and database components MUST deploy successfully - And the dashboard MUST be accessible via the configured Ingress - - Scenario: Deploy via RPM package - Given a Fedora/RHEL system is available - When the operator installs the RPM package - Then a systemd service MUST be created and enabled - And the dashboard MUST start via systemctl -``` - -### NFR-016: Graceful Degradation - -**Description:** The System MUST degrade gracefully when backend components are unavailable: if TimescaleDB is down, show live data only with no history; if Redis is down, make direct API calls with no cache; if apalis workers are down, support manual refresh only; if the Keylime API is down, show cached state with an alert. - -**Trace:** High Availability - Architecture - -```gherkin -Feature: Graceful Degradation - - Scenario: TimescaleDB unavailable - Given TimescaleDB is down - When the user views the dashboard - Then the System MUST display live data from the Keylime API - And historical charts MUST display "Historical data unavailable" - And a banner MUST indicate the database is unreachable - - Scenario: Redis cache unavailable - Given Redis is down - When the System needs to fetch agent data - Then the System MUST make direct API calls to the Keylime Verifier - And a WARNING banner MUST indicate "Cache unavailable — increased API load" - - Scenario: Keylime API unavailable - Given the Keylime Verifier API is unreachable - When the user views the dashboard - Then the System MUST display cached data with staleness timestamps - And a CRITICAL banner MUST indicate "Keylime API unreachable — data may be stale" -``` - -### NFR-017: Circuit Breaker on Verifier API - -**Description:** The System MUST implement a circuit breaker pattern on Verifier API calls. The circuit MUST open after a configurable number of consecutive failures (default: 5). While open, the System MUST use cached data. The circuit MUST transition to half-open after a configurable timeout to test recovery. - -**Trace:** Technical Architecture - Event-Driven Ingestion - -```gherkin -Feature: Circuit Breaker - - Scenario: Circuit opens after consecutive failures - Given the Verifier API has returned errors for 5 consecutive requests - When the System evaluates the circuit breaker - Then the circuit MUST transition to the "open" state - And subsequent requests MUST be served from cache without contacting the Verifier - And a WARNING alert MUST indicate "Circuit breaker open — using cached data" - - Scenario: Circuit transitions to half-open - Given the circuit breaker has been open for the configured timeout (default: 60s) - When the timeout elapses - Then the System MUST send a single probe request to the Verifier - And if the probe succeeds, the circuit MUST close and resume normal operation -``` - -### NFR-018: Request Rate Limiting - -**Description:** The System MUST enforce per-user and global request rate limiting to protect the Keylime Verifier and dashboard backend from overload. Rate limits MUST be configurable. - -**Trace:** Technical Architecture - IMA Log & Data Decoupling - -```gherkin -Feature: Request Rate Limiting - - Scenario: Per-user rate limit exceeded - Given user "operator@example.com" is configured with a limit of 100 requests/minute - When the user sends the 101st request within one minute - Then the System MUST return HTTP 429 Too Many Requests - And the response MUST include a Retry-After header - - Scenario: Global rate limit protects Verifier - Given the global rate limit for Verifier API calls is 500 requests/second - When aggregate dashboard requests exceed 500/second - Then excess requests MUST be queued or rejected with HTTP 429 - And the System MUST NOT forward more than 500 requests/second to the Verifier -``` - -### NFR-019: Cache TTL Configuration - -**Description:** The System MUST implement tiered cache TTLs: agent list 10 seconds, agent detail 30 seconds, policies 60 seconds, certificates 300 seconds. TTLs MUST be configurable. - -**Trace:** Scalability - Cache Invalidation - -```gherkin -Feature: Cache TTL Configuration - - Scenario: Agent list cache expires after TTL - Given the agent list cache TTL is configured at 10 seconds - When the user requests the agent list - And the cache was last populated 11 seconds ago - Then the System MUST fetch fresh data from the Verifier API - And the cache MUST be updated with the new data - - Scenario: Certificate cache with longer TTL - Given the certificate cache TTL is configured at 300 seconds - When the user requests certificate data - And the cache was populated 120 seconds ago - Then the System MUST serve the cached certificate data - And no request MUST be made to the Verifier API -``` - -### NFR-020: Periodic Reconciliation - -**Description:** The System MUST run a periodic reconciliation sweep every 5 minutes to detect and correct any drift between the dashboard's cached state and the Verifier's actual state. The reconciliation interval MUST be configurable. - -**Trace:** Technical Architecture - Event-Driven Ingestion - -```gherkin -Feature: Periodic Reconciliation - - Scenario: Reconciliation detects state drift - Given the dashboard shows agent "agent-042" in GET_QUOTE state - And the Verifier reports agent "agent-042" in FAILED state - When the 5-minute reconciliation sweep runs - Then the dashboard MUST update agent "agent-042" to FAILED state - And the drift MUST be logged as a reconciliation correction - - Scenario: Reconciliation finds no drift - Given all cached agent states match the Verifier's reported states - When the reconciliation sweep runs - Then no state changes MUST occur - And the reconciliation log MUST record "no drift detected" -``` - -### NFR-021: WebSocket Real-Time Updates - -**Description:** The System MUST provide WebSocket connections for real-time UI updates. WebSocket connections MUST support automatic reconnection with exponential backoff. The System MUST indicate connection status to the user. - -**Trace:** Technical Architecture - Data Flow - -```gherkin -Feature: WebSocket Real-Time Updates - - Scenario: WebSocket delivers real-time updates - Given a WebSocket connection is established between the browser and backend - When an agent state change is ingested by the backend - Then the change MUST be pushed to the browser via WebSocket - And the UI MUST update without requiring a page refresh - - Scenario: WebSocket reconnection with backoff - Given a WebSocket connection drops - When the client attempts to reconnect - Then the client MUST use exponential backoff (1s, 2s, 4s, 8s, ...) - And a connection status indicator MUST show "Reconnecting..." - And upon successful reconnection, the indicator MUST show "Connected" -``` - -### NFR-022: Signed Update Packages - -**Description:** Offline update packages MUST be GPG-signed with integrity verification. Each update MUST include a Software Bill of Materials (SBOM). The System MUST verify the signature before applying any update. - -**Trace:** Deployment - Offline & Air-Gapped - -```gherkin -Feature: Signed Update Packages - - Scenario: Verify update package signature - Given an update package is available with a GPG signature - When the operator initiates the update - Then the System MUST verify the GPG signature before applying changes - And if the signature is invalid, the update MUST be rejected with error "invalid signature" - - Scenario: SBOM included in update - Given an update package is available - When the operator inspects the package contents - Then an SBOM in SPDX or CycloneDX format MUST be included -``` - -### NFR-023: Concurrent Log Fetch Limit - -**Description:** The System MUST limit concurrent IMA log fetch requests to the Verifier to a maximum of 5 parallel requests. This prevents overloading the Verifier when multiple users request agent detail views simultaneously. - -**Trace:** Technical Architecture - IMA Log & Data Decoupling - -```gherkin -Feature: Concurrent Log Fetch Limit - - Scenario: Enforce maximum concurrent log fetches - Given 5 IMA log fetch requests are currently in progress - When a 6th user requests an IMA log view for a different agent - Then the 6th request MUST be queued until one of the active fetches completes - And the user MUST see a "Loading — request queued" indicator - - Scenario: Concurrent fetches within limit - Given 3 IMA log fetch requests are currently in progress - When a 4th user requests an IMA log view - Then the request MUST proceed immediately without queuing -``` - -### NFR-024: AI Assistant Query Performance and Rate Limiting - -**Description:** The System SHOULD enforce rate limiting on AI Assistant queries to prevent abuse and ensure backend stability. Each user session SHOULD be limited to a configurable maximum number of queries per minute (default: 10). The assistant SHOULD respond to queries within 10 seconds; if the LLM or MCP server exceeds this threshold, the System SHOULD display a timeout message and allow the user to retry. - -**Trace:** AI Assistant - Conversational Interface - -```gherkin -Feature: AI Assistant Query Performance and Rate Limiting - - Scenario: Rate limit exceeded - Given the user has submitted 10 AI Assistant queries in the last 60 seconds - When the user submits an 11th query - Then the System SHOULD reject the query with "Rate limit exceeded — please wait before submitting another query" - And the rejection SHOULD be recorded in the audit log - - Scenario: Query timeout - Given the user submits a query to the AI Assistant - When the LLM or MCP server does not respond within 10 seconds - Then the System SHOULD display "Query timed out — the AI service is not responding. Please try again." - And the user SHOULD be able to retry the query -``` - ---- - -## 5. Security Requirements Detail - -### SR-001: OIDC/SAML Authentication - -**Description:** The System MUST authenticate all users via an external OIDC or SAML identity provider. No local username/password authentication MUST be supported. The System MUST support multiple IdP configurations for federated environments. - -**Trace:** Dashboard Authentication - User Identity - -```gherkin -Feature: OIDC/SAML Authentication - - Scenario: Authenticate via OIDC provider - Given the dashboard is configured with an OIDC identity provider - When a user navigates to the dashboard without a session - Then the System MUST redirect the user to the OIDC provider login page - And upon successful authentication, the user MUST be redirected back with a valid session - - Scenario: OIDC provider unreachable - Given the OIDC identity provider is unreachable - When a user attempts to log in - Then the System MUST display an error "Authentication service unavailable" - And no unauthenticated access to the dashboard MUST be permitted - - Scenario: Session token expired - Given the user's session token has expired - When the user attempts any dashboard action - Then the System MUST redirect the user to re-authenticate via the IdP - And the user MUST be returned to their original page after re-authentication -``` - -### SR-002: MFA for Admin Role - -**Description:** The System MUST require Multi-Factor Authentication for users with the Admin role. MFA MUST be enforced at the identity provider level. The System MUST verify the MFA claim in the OIDC/SAML token before granting Admin privileges. - -**Trace:** Dashboard Authentication - User Identity - -```gherkin -Feature: MFA for Admin Role - - Scenario: Admin login requires MFA - Given user "admin@example.com" has the Admin role - When the user authenticates via the IdP without completing MFA - Then the System MUST deny Admin-level access - And the user MUST be granted the lowest applicable role (Viewer) until MFA is completed - - Scenario: Admin with valid MFA claim - Given user "admin@example.com" has completed MFA at the IdP - When the OIDC token includes an MFA claim - Then the System MUST grant full Admin privileges -``` - -### SR-003: Three-Tier RBAC - -**Description:** The System MUST enforce a three-tier RBAC model at the dashboard backend. Since Keylime's built-in authorization has no read-only admin role, the dashboard MUST proxy and restrict operations based on the user's dashboard role. The permissions matrix: - -| Operation | Viewer | Operator | Admin | -|-----------|--------|----------|-------| -| View agent fleet, attestation analytics, policies, audit log | Yes | Yes | Yes | -| Export reports (CSV/PDF) | No | Yes | Yes | -| Acknowledge/manage alerts | No | Yes | Yes | -| Reactivate/stop agents | No | Yes | Yes | -| Create/edit/delete policies | No | No | Yes | -| Delete agents from verifier | No | No | Yes | -| Manage dashboard users/roles | No | No | Yes | -| Configure alert thresholds | No | No | Yes | -| View/export TPM key material | No | No | Yes | - -All write operations to Keylime MUST be blocked at the dashboard proxy layer for non-Admin roles. - -**Trace:** Dashboard RBAC - Role Definitions - -```gherkin -Feature: Three-Tier RBAC Enforcement - - Scenario: Viewer cannot export reports - Given the user has the Viewer role - When the user attempts to export a CSV report - Then the System MUST return HTTP 403 Forbidden - And the audit log MUST record the denied access attempt - - Scenario: Operator can acknowledge alerts - Given the user has the Operator role - When the user acknowledges an alert - Then the action MUST succeed - -> **Reviewer Note:** Split from original "Operator can acknowledge alerts but not delete agents" which contained two When/Then blocks — a positive and negative test in one scenario. Each scenario MUST test a single outcome for deterministic pass/fail and clear traceability. - - Scenario: Operator cannot delete agents - Given the user has the Operator role - When the user attempts to delete an agent - Then the System MUST return HTTP 403 Forbidden - - Scenario: Admin can manage dashboard users - Given the user has the Admin role - When the user creates a new dashboard user with Operator role - Then the new user MUST be created successfully -``` - -### SR-004: mTLS for Keylime API Communication - -**Description:** The System MUST use mutual TLS (mTLS) for all communication with Keylime Verifier and Registrar APIs. The System MUST present a valid client certificate signed by the Keylime CA. The System MUST verify the server certificate of Keylime components. - -**Trace:** Threat Model - Trust Boundaries - -```gherkin -Feature: mTLS API Communication - - Scenario: Backend presents client certificate to Verifier - Given the backend is configured with a valid mTLS client certificate - When the backend connects to the Verifier API - Then the connection MUST use mutual TLS authentication - And the Verifier MUST accept the dashboard's client certificate - - Scenario: Reject connection with invalid server certificate - Given the Verifier presents a certificate not signed by the trusted CA - When the backend attempts to connect - Then the connection MUST be rejected - And a CRITICAL alert MUST indicate "Verifier certificate validation failed" -``` - -### SR-005: mTLS Private Key Protection - -**Description:** The mTLS private key grants full Keylime admin access. It MUST NEVER be stored on disk in cleartext. Supported storage backends: PKCS#11 HSM, HashiCorp Vault Transit, Kubernetes CSI Secret Store, or encrypted file with passphrase (development only). Rust's `rustls` MUST use a custom `SigningKey` trait for HSM/PKCS#11 offload. - -**Trace:** Secret Management - Credential Lifecycle - -```gherkin -Feature: mTLS Private Key Protection - - Scenario: Reject startup with cleartext key on disk - Given the mTLS private key file is present on disk in cleartext (unencrypted PEM) - When the System starts - Then the System MUST refuse to start - And an error MUST indicate "cleartext private key detected — use HSM or Vault" - - Scenario: Graceful handling of HSM unavailability - Given the System is configured to use PKCS#11 HSM for key storage - When the HSM device is unreachable at startup - Then the System MUST refuse to start - And the error MUST indicate "HSM unreachable — cannot load mTLS key" -``` - -### SR-006: HSM or Vault-Backed Key Storage - -**Description:** The System MUST support HSM (PKCS#11) or HashiCorp Vault Transit as the primary storage backend for the mTLS private key. Kubernetes CSI Secret Store MUST be supported as an alternative. Encrypted file with passphrase MAY be supported for development environments only. - -**Trace:** Secret Management - Credential Lifecycle - -```gherkin -Feature: Vault-Backed Key Storage - - Scenario: Load private key from HashiCorp Vault - Given the System is configured to use HashiCorp Vault Transit for key storage - When the System starts - Then the mTLS private key MUST be loaded from Vault - And no key material MUST be written to disk - - Scenario: Vault token expired - Given the Vault token has expired - When the System attempts to load the private key - Then the System MUST refuse to start - And an error MUST indicate "Vault token expired — renew before starting" -``` - -### SR-007: TLS Encryption on All Connections - -**Description:** The System MUST encrypt all network connections with TLS. No cleartext HTTP, database, or cache connections MUST be permitted. This applies to browser-to-dashboard, dashboard-to-Keylime, dashboard-to-database, and dashboard-to-cache connections. - -**Trace:** Transport Security - Encrypted Data Paths - -```gherkin -Feature: TLS on All Connections - - Scenario: Reject cleartext HTTP connection - Given the dashboard is running with TLS enabled - When a client attempts to connect via HTTP (port 80) - Then the System MUST redirect to HTTPS or reject the connection - And no data MUST be served over cleartext HTTP - - Scenario: Database connection uses TLS - Given the TimescaleDB connection is configured - When the backend connects to the database - Then the connection MUST use TLS encryption -``` - -### SR-008: Browser-to-API TLS 1.3 Minimum - -**Description:** The System MUST enforce TLS 1.3 as the minimum protocol version for browser-to-dashboard API connections. TLS 1.2 and earlier MUST be rejected. - -**Trace:** Transport Security - Encrypted Data Paths - -```gherkin -Feature: TLS 1.3 Minimum for Browser - - Scenario: Accept TLS 1.3 connection - Given a browser supports TLS 1.3 - When the browser connects to the dashboard - Then the TLS handshake MUST complete using TLS 1.3 - - Scenario: Reject TLS 1.2 connection - Given a client attempts to connect using TLS 1.2 - When the TLS handshake is initiated - Then the System MUST reject the connection -``` - -### SR-009: API-to-Keylime TLS 1.2+ Minimum - -**Description:** The System MUST enforce TLS 1.2 or higher for all connections from the dashboard backend to Keylime components. This allows compatibility with Keylime deployments that have not yet upgraded to TLS 1.3. - -**Trace:** Transport Security - Encrypted Data Paths - -```gherkin -Feature: TLS 1.2+ for Keylime API - - Scenario: Connect to Keylime Verifier with TLS 1.2 - Given the Keylime Verifier supports TLS 1.2 but not TLS 1.3 - When the backend connects to the Verifier - Then the connection MUST complete using TLS 1.2 - - Scenario: Reject SSLv3 or TLS 1.1 - Given a Keylime component offers only TLS 1.1 - When the backend attempts to connect - Then the connection MUST be rejected -``` - -### SR-010: Short-Lived JWT Session Tokens - -**Description:** The System MUST issue short-lived JWT session tokens with a maximum lifetime of 15 minutes. Refresh tokens MUST be rotated on each use. Stolen refresh tokens MUST be detectable via replay detection. - -**Trace:** Dashboard Authentication - User Identity - -```gherkin -Feature: Short-Lived Session Tokens - - Scenario: Session token expires after 15 minutes - Given the user has an active session - When 15 minutes elapse without token refresh - Then the session token MUST expire - And the user MUST be prompted to re-authenticate - - Scenario: Refresh token rotation - Given the user's access token is about to expire - When the client uses the refresh token to obtain a new access token - Then a new refresh token MUST also be issued - And the previous refresh token MUST be invalidated -``` - -### SR-011: Server-Side Session Revocation - -**Description:** The System MUST support server-side session revocation. Administrators MUST be able to revoke any active user session. Revoked sessions MUST be immediately invalidated regardless of token expiry. - -**Trace:** Dashboard Authentication - User Identity - -```gherkin -Feature: Server-Side Session Revocation - - Scenario: Admin revokes user session - Given admin "admin@example.com" views active sessions - When the admin revokes the session for "operator@example.com" - Then the operator's session MUST be immediately invalidated - And the operator's next API request MUST return HTTP 401 Unauthorized - - Scenario: Revoked token cannot be reused - Given a session token has been revoked - When the revoked token is presented in an API request - Then the System MUST reject the request with HTTP 401 -``` - -### SR-012: XSS and Injection Prevention - -**Description:** The System MUST implement Content Security Policy (CSP) headers and input sanitization to prevent XSS and injection attacks. All user input MUST be sanitized before rendering. CSP MUST restrict script sources to self only. - -**Trace:** Threat Model - Threat Catalog - -```gherkin -Feature: XSS and Injection Prevention - - Scenario: CSP headers present on all responses - Given the dashboard serves an HTTP response - Then the Content-Security-Policy header MUST be present - And script-src MUST be restricted to 'self' - - Scenario: Sanitize user input in search - Given the user enters "" in the search bar - When the search is submitted - Then the input MUST be sanitized before processing - And no script execution MUST occur in the browser -``` - -### SR-013: Data Minimization - -**Description:** The System MUST NEVER cache or store raw TPM quotes, IMA measurement logs, or raw boot event logs. These MUST be passed through from the Keylime API to the UI on demand and discarded. Only attestation results (pass/fail + timestamp) MUST be retained for historical analysis. PoP token hashes MUST NOT be displayed or cached; only session metadata (creation time, expiry, agent UUID) MAY be shown. - -**Trace:** Threat Model - Data Classification; Attestation Modes - Comparative View - -```gherkin -Feature: Data Minimization - - Scenario: Raw TPM quotes not persisted - Given the user views the Raw Data tab for agent "a1b2c3d4" - When the System fetches the TPM quote from the Verifier API - Then the raw quote data MUST be served directly to the browser - And the System MUST NOT write the raw quote data to any persistent store - - Scenario: PoP tokens not displayed - Given a push-mode agent has an active session with a PoP token - When the user views the agent detail page - Then only session metadata (creation time, expiry, agent UUID) MUST be displayed - And the PoP token hash MUST NOT be shown or cached -``` - -### SR-014: PoP Token Privacy - -**Description:** The System MUST NEVER display or cache raw Proof-of-Possession (PoP) tokens. Only session metadata (creation time, expiry, agent UUID) MAY be shown in the UI. This requirement is a specific instance of SR-013 (Data Minimization) applied to push-mode PoP tokens. - -> **Reviewer Note:** SR-014 overlaps with SR-013's PoP token scenario. SR-014 is retained as a distinct requirement because PoP token handling is a push-mode-specific security concern that warrants independent traceability. Cross-reference added to clarify the relationship. - -**Trace:** Attestation Modes - Comparative View - -```gherkin -Feature: PoP Token Privacy - - Scenario: PoP token excluded from API responses - Given the dashboard API serves push-mode session data - When the API response is generated - Then the response MUST NOT include raw PoP token values - And only session metadata MUST be included -``` - -### SR-015: Tamper-Evident Audit Log with RFC 3161 - -**Description:** The System MUST implement tamper-evident hash-chained audit logging with RFC 3161 timestamp anchoring. This is defined in detail in FR-061 and duplicated here as a security requirement to ensure traceability. - -**Trace:** Compliance - Tamper-Evident Audit Logging - -```gherkin -Feature: Tamper-Evident Audit Log - - Scenario: Audit log entries are hash-chained - Given the audit log has existing entries - When a new audit event occurs - Then the new entry MUST include the SHA-256 hash of the previous entry - And the chain MUST be verifiable from the root anchor - - Scenario: RFC 3161 timestamp anchoring - Given the System is configured with an RFC 3161 TSA - When the audit log chain root is created - Then the root MUST be anchored with an RFC 3161 timestamp token - And the timestamp token MUST be verifiable against the TSA certificate -``` - -### SR-016: SSRF Protection - -**Description:** The System MUST protect against Server-Side Request Forgery (SSRF) on webhook URLs. Webhook destinations MUST be validated against an allowlist. RFC 1918 private addresses MUST be blocked unless explicitly allowed. DNS rebinding protection MUST be implemented. - -**Trace:** Revocation - Alert Workflow - -```gherkin -Feature: SSRF Protection - - Scenario: Block webhook to private IP address - Given an admin configures a webhook URL pointing to 192.168.1.100 - When the webhook configuration is saved - Then the System MUST reject the URL with error "RFC 1918 addresses blocked" - - Scenario: Allow webhook to whitelisted destination - Given "hooks.slack.com" is in the webhook allowlist - When an admin configures a webhook to "https://hooks.slack.com/services/T00/B00/xxxx" - Then the webhook configuration MUST be accepted -``` - -### SR-017: Two-Person Approval for Policy Changes - -**Description:** The System MUST enforce a two-person approval workflow for all policy changes. This is a security requirement that mirrors FR-039 to ensure policy modifications cannot be unilaterally applied by a single administrator. - -**Trace:** Policy Management - Two-Person Rule - -```gherkin -Feature: Two-Person Policy Approval Security - - Scenario: Policy change without approval blocked - Given Admin A drafts and saves a policy change - When Admin A attempts to push the policy to the Verifier without approval - Then the System MUST block the push - And the audit log MUST record the blocked attempt -``` - -### SR-018: Drafter Cannot Self-Approve - -**Description:** The System MUST enforce that the approver of a policy change MUST NOT be the same user as the drafter. This separation of duties is a critical security control. - -**Trace:** Policy Management - Two-Person Rule - -```gherkin -Feature: Drafter Cannot Self-Approve - - Scenario: Same user attempts draft and approval - Given Admin A drafted a policy change - When Admin A attempts to approve the same change - Then the System MUST reject the approval with "self-approval not permitted" - And the audit log MUST record the rejected self-approval attempt -``` - -### SR-019: Multi-Tenancy Isolation - -**Description:** The System MUST enforce strict multi-tenancy isolation. Cross-tenant data MUST NEVER be mixed or accessible. Each tenant MUST have isolated data stores, separate RBAC, and independent configuration. - -**Trace:** Dashboard RBAC - Multi-Tenancy - -```gherkin -Feature: Multi-Tenancy Isolation - - Scenario: Cross-tenant data access blocked - Given Tenant A has agent "agent-001" and Tenant B has agent "agent-002" - When a user authenticated to Tenant A requests agent "agent-002" - Then the System MUST return HTTP 404 Not Found - And no data from Tenant B MUST be included in the response - - Scenario: Tenant isolation in search results - Given Tenant A has 100 agents and Tenant B has 50 agents - When a Tenant A user performs a global search - Then search results MUST only include Tenant A's 100 agents -``` - -### SR-020: Data Classification - -**Description:** The System MUST enforce the following data classification and protection requirements: - -| Data Type | Classification | Storage Policy | Protection | -|-----------|---------------|----------------|------------| -| mTLS private key | SECRET | Never on disk | HSM or Vault only | -| EK/AK public keys | CONFIDENTIAL | Cache only (TTL) | Encrypt at rest | -| TPM quotes | CONFIDENTIAL | Do not cache | Pass-through only | -| IMA policy content | CONFIDENTIAL | Cache with TTL | Encrypt at rest | -| Agent IPs/UUIDs | INTERNAL | TimeSeries DB | Encrypt at rest | -| Attestation results | INTERNAL | TimeSeries DB | Encrypt at rest | -| Dashboard audit log | CONFIDENTIAL | Append-only store | Hash-chained, signed | -| Dashboard credentials | SECRET | Never persisted | OIDC session tokens | - -**Trace:** Threat Model - Data Classification - -```gherkin -Feature: Data Classification Enforcement - - Scenario: SECRET data never written to disk - Given the mTLS private key is classified as SECRET - When the System operates normally - Then the private key MUST NOT exist on disk in any form (cleartext or encrypted) - And the key MUST be loaded exclusively from HSM or Vault - - Scenario: CONFIDENTIAL data encrypted at rest - Given EK/AK public keys are classified as CONFIDENTIAL - When the System caches public key data in Redis - Then the cached data MUST be encrypted at rest - And the cache entry MUST have a TTL configured -``` - -### SR-021: Write Operations Blocked for Non-Admin - -**Description:** The System MUST block all write operations to Keylime APIs at the dashboard proxy layer for non-Admin roles. This provides defense-in-depth beyond RBAC role checks. - -**Trace:** Dashboard RBAC - Role Definitions - -```gherkin -Feature: Write Operation Blocking - - Scenario: Operator write request blocked at proxy - Given the user has the Operator role - When the user sends a POST request to create a policy (bypassing UI) - Then the dashboard proxy MUST block the request before it reaches the Keylime API - And the System MUST return HTTP 403 Forbidden - And the blocked attempt MUST be logged in the audit log -``` - -### SR-022: mTLS Sidecar Option - -**Description:** The System MAY support an mTLS sidecar (Envoy or Ghostunnel) as an alternative to application-level mTLS. This allows deployments where certificate management is handled by the service mesh. - -**Trace:** Transport Security - mTLS Sidecar Option - -```gherkin -Feature: mTLS Sidecar Support - - Scenario: Delegate mTLS to Envoy sidecar - Given the System is deployed with an Envoy sidecar handling mTLS - When the backend communicates with the Keylime Verifier - Then the Envoy sidecar SHOULD terminate the mTLS connection - And the backend SHOULD communicate with Envoy over localhost -``` - -### SR-023: No Unsafe Rust Code - -**Description:** The dashboard Rust crate MUST use `#![forbid(unsafe_code)]` to prevent unsafe Rust code blocks. This eliminates entire classes of memory safety vulnerabilities. - -**Trace:** Technical Architecture - Why Rust - -```gherkin -Feature: No Unsafe Rust Code - - Scenario: Build fails on unsafe code - Given the dashboard crate has #![forbid(unsafe_code)] at the crate root - When a developer adds an unsafe block to the crate - Then the Rust compiler MUST reject the build with error "unsafe code is forbidden" -``` - -### SR-024: Signed Cache Entries - -**Description:** The System MUST sign cache entries and enforce TTLs to mitigate cache poisoning attacks. Cache entries MUST be validated before use. - -**Trace:** Threat Model - Threat Catalog - -```gherkin -Feature: Signed Cache Entries - - Scenario: Detect poisoned cache entry - Given agent data is cached with an HMAC signature - When a cache entry is modified externally (cache poisoning attempt) - Then the System MUST detect the signature mismatch - And the System MUST discard the poisoned entry and fetch fresh data from the API - And a CRITICAL alert MUST be raised indicating "cache integrity violation" -``` - -### SR-025: TPM Identity Change Detection - -**Description:** The System MUST alert on TPM identity changes during agent re-registration. CRITICAL alerts MUST fire when: TPM key changes on re-registration, or EK certificate mismatches on re-registration. WARNING alerts MUST fire when: regcount exceeds 3, or re-registration occurs from a different IP. TPM identity keys SHOULD be immutable; any change indicates potential compromise. - -**Trace:** Security Audit - Identity Verification Events - -```gherkin -Feature: TPM Identity Change Detection - - Scenario: Detect TPM key change on re-registration - Given agent "agent-042" is registered with EK certificate "cert-A" - When agent "agent-042" re-registers with a different EK certificate "cert-B" - Then the System MUST raise a CRITICAL alert - And the alert message MUST indicate "EK cert mismatch on re-registration" - And the audit log MUST record both the previous and new certificate identifiers - - Scenario: Warn on high re-registration count - Given agent "agent-042" has a registration count of 4 - When the System evaluates identity events - Then the System MUST raise a WARNING alert indicating "high regcount (>3)" -``` - -### SR-026: Audit Log Retention - -**Description:** The System MUST retain audit log entries for a minimum of 1 year to meet compliance requirements. Retention period MUST be configurable. Archived logs MUST remain tamper-evident and verifiable. - -**Trace:** Compliance - Tamper-Evident Audit Logging - -```gherkin -Feature: Audit Log Retention - - Scenario: Audit log retained for minimum period - Given the retention policy is set to 1 year - When audit log entries are older than 1 year - Then the System MUST archive the entries but NOT delete them automatically - And archived entries MUST remain verifiable via hash chain validation - - Scenario: Prevent premature log deletion - Given an administrator attempts to delete audit log entries less than 1 year old - When the deletion request is submitted - Then the System MUST reject the request with "retention policy violation" -``` - -### SR-027: Emergency Bypass with Break-Glass Audit - -**Description:** The System MUST support an emergency bypass mechanism that allows policy changes to skip the two-person approval workflow. Emergency bypass MUST require break-glass authentication and MUST produce a detailed audit trail including: who invoked the bypass, why (free-text justification), what action was taken, and timestamp. - -**Trace:** Policy Management - Two-Person Rule - -```gherkin -Feature: Emergency Bypass with Break-Glass Audit - - Scenario: Emergency bypass for critical policy rollback - Given a critical incident requires immediate policy rollback - And the two-person rule is enforced - When Admin A invokes the emergency bypass with justification "critical production incident" - Then the policy change MUST be applied without a second approver - And the audit log MUST record the bypass with: actor, justification, action, and timestamp - And a CRITICAL alert MUST be raised: "Break-glass bypass invoked" - - Scenario: Break-glass without justification rejected - Given Admin A invokes the emergency bypass - When no justification text is provided - Then the System MUST reject the bypass with "justification required for emergency bypass" -``` - -### SR-028: Configurable Idle Session Timeout - -**Description:** The System MUST enforce a configurable idle session timeout. Sessions inactive beyond the timeout period MUST be automatically terminated. The default idle timeout MUST be 30 minutes. - -**Trace:** Dashboard Authentication - User Identity - -```gherkin -Feature: Idle Session Timeout - - Scenario: Session terminated after idle timeout - Given the idle session timeout is configured at 30 minutes - When the user has been inactive for 31 minutes - Then the session MUST be automatically terminated - And the user MUST be redirected to the login page on their next action - - Scenario: Activity resets idle timer - Given the idle session timeout is 30 minutes - When the user performs an action at the 25-minute mark - Then the idle timer MUST reset to 0 - And the session MUST remain active for another 30 minutes of inactivity -``` - -### SR-029: Rate Limiting on Session Creation - -**Description:** The System MUST enforce rate limiting on the dashboard session creation endpoint to prevent brute-force attacks on authentication. Failed login attempts MUST be rate-limited per source IP. - -**Trace:** Attestation Modes - Comparative View - -```gherkin -Feature: Session Creation Rate Limiting - - Scenario: Rate limit exceeded on login attempts - Given the rate limit for session creation is 10 attempts per minute per IP - When a client sends the 11th login request within one minute from the same IP - Then the System MUST return HTTP 429 Too Many Requests - And the response MUST include a Retry-After header - - Scenario: Successful login after cooldown - Given a client was rate-limited on login attempts - When the Retry-After period elapses - Then the client MUST be able to attempt login again -``` - ---- - -## 6. Implementation Phasing - -As defined in the source presentation, the implementation follows three phases: - -**Phase 1 — Secure Foundation:** Agent fleet list, agent state monitoring, OIDC/SAML authentication, three-tier RBAC, tamper-evident audit log, mTLS to Keylime APIs. *Exit criteria: all API access authenticated, RBAC enforced on every endpoint, audit log captures all actions, mTLS key never on disk in cleartext, threat model reviewed, penetration test passed.* - -**Phase 2 — Operations:** Attestation analytics, policy management UI, certificate monitoring, alert notifications, push mode (v3 API) support, SIEM integration, WebSocket updates, SSRF-validated webhooks, Prometheus metrics endpoint. - -**Phase 3 — Enterprise Scale:** Multi-tenancy, compliance reports (NIST, PCI DSS, SOC 2, FedRAMP), incident response integration (ServiceNow, Jira, PagerDuty), HA deployment, air-gapped packaging, multi-cluster support, WCAG 2.1 AA accessibility. - -**Trace:** Implementation Roadmap - ---- - -## 7. Software Design Description Cross-Reference - -> **Migrated:** 2026-04-15 — Implementation Refinements (IR-001 through IR-013) have been migrated to the companion Software Design Description (SDD) document, following the IEEE 1016 standard separation between requirements (*what*) and design (*how*). - -The design details that realize these requirements -- including component decomposition, data models, API contracts, state machines, algorithms, and deployment configuration -- are documented in the companion SDD: - -**Document:** [`spec/SDD-Keylime-Monitoring-Tool.md`](SDD-Keylime-Monitoring-Tool.md) - -**SDD Viewpoint Mapping:** - -| Former IR | SDD Section | SDD Viewpoint | -|-----------|-------------|---------------| -| IR-001: API Response Envelope | 3.4.1 | Interface View | -| IR-002: Paginated Response Format | 3.4.2 | Interface View | -| IR-003: WebSocket Endpoint | 3.4.4 | Interface View | -| IR-004: Agent Data Model | 3.3.2 | Logical View | -| IR-005: Pipeline Stage Enumeration | 3.3.6 | Logical View | -| IR-006: Failure Correlation Types | 3.3.6 | Logical View | -| IR-007: Notification Channels | 3.3.9 | Logical View | -| IR-008: Alert Lifecycle State Machine | 3.6.2 | State Dynamics View | -| IR-009: Certificate Expiry Derivation | 3.3.5 | Logical View | -| IR-010: Timeline Distribution Algorithm | 3.7.1 | Algorithm View | -| IR-011: Frontend RBAC Enforcement | 5.1 | Security Overlay | -| IR-012: Visualization Settings | 3.8.2 | Resource View | -| IR-013: KPI Fallback Computation | 3.7.2 | Algorithm View | -| IR-014: Runtime Connection Configuration | 3.3.1, 3.4.3, 3.8.1 | Logical / Interface / Resource View | -| IR-015: mTLS Certificate Configuration | 3.4.3, 3.8.1 | Interface / Resource View | -| IR-016: TOML Config Persistence | 3.8.1 | Resource View | -| IR-017: Sidebar Visibility Toggle | 3.2.2 | Composition View | -| IR-018: Backend Health Probes | 3.7.3 | Algorithm View | - -The SDD also includes a full SRS traceability matrix (Section 6) mapping every implemented requirement to its corresponding design element.