diff --git a/benchmark/.DS_Store b/benchmark/.DS_Store deleted file mode 100644 index 639d19a..0000000 Binary files a/benchmark/.DS_Store and /dev/null differ diff --git a/benchmark/agentfield-254/EVALUATION.md b/benchmark/agentfield-254/EVALUATION.md deleted file mode 100644 index 30737eb..0000000 --- a/benchmark/agentfield-254/EVALUATION.md +++ /dev/null @@ -1,90 +0,0 @@ -# PR-AF Architecture Progression & Evaluation -## Target: AgentField PR #254 (Config Storage Migration) - -**Evaluation Date**: 2026-03-11 -**Systems Compared**: PR-AF (Current Version, Kimi k2.5) vs. Claude Code (Single-agent baseline) -**Goal**: To document the architectural improvements made to PR-AF and demonstrate how composite multi-agent reasoning out-performs a single-agent baseline like Claude Code in depth, precision, and systemic insight. - ---- - -## 1. Executive Summary - -This document evaluates the current version of **PR-AF (Pull Request Agent Field)** against a standard single-agent approach (**Claude Code**). The target is AgentField PR #254, a complex 28-file migration from local JSON config to a SQLite-backed storage model. - -The core finding is that **multi-agent composite reasoning (PR-AF) discovers critical systemic vulnerabilities and compound attack chains that a single agent (Claude Code) cannot perceive.** - -While Claude Code successfully catches surface-level mechanical errors (missing parameters, unused variables) in seconds for ~$0.50, PR-AF acts as a deep architectural auditor. Through its progression of architectural improvements—culminating in a Hybrid Evidence Grounding layer and Parallel Compound Analysis—PR-AF achieved a **0% false positive rate** while synthesizing a multi-vector authentication bypass chain that would result in complete system compromise. - -### High-Level Comparison (Current Version vs. CC) - -| Metric | Claude Code | PR-AF (Current Version) | -|---|---|---| -| **Architecture** | Single-agent, fast context window | 8-Phase Multi-Agent DAG | -| **Duration** | ~5-10 minutes | ~45-50 minutes | -| **Cost** | ~$0.50 - $2.00 | ~$0 (opencode / OSS models) | -| **Surface Bugs Caught**| Yes (e.g., interface mismatches) | Yes | -| **Systemic Flaws** | Missed | **Found** (inconsistent protection) | -| **Compound Risks** | Missed | **Found** (coordinated config injection) | -| **False Positive Rate**| High (relies on assumptions) | **0%** (via Evidence Verifier) | - ---- - -## 2. The PR-AF Architectural Journey - -To understand why the Current Version performs so well, we must trace the improvements made to the PR-AF pipeline. We ran 4 successive iterations of the pipeline against the exact same PR to measure the impact of each architectural upgrade. - -### Run 1: The Baseline (Sonnet 4.6) -* **Architecture:** Basic Intake → Anatomy → Review Dimensions (no deep context) → Cross-Ref Scoring → Synthesis. -* **Result:** 20 findings, 3 critical, ~35 minutes. -* **Flaw:** High false positive rate (~10%). The agents relied on the diff text and guessed how it interacted with the wider repo, leading to hallucinated claims about error handling. - -### Run 2: Enriched Context (Kimi k2.5) -* **Improvement:** Replaced static prompts with **Investigative Prompts**. The harness was explicitly instructed to browse the repository (`cwd=repo_path`), read imports, and verify function signatures before writing findings. -* **Result:** 25 findings, 8 critical, ~40 minutes. -* **Flaw:** Signal rate improved to 88%, but false positives still existed (4%). Agents were *told* to investigate, but LLMs are lazy—they often relied on assumptions instead of actually grepping the repo. - -### Run 3: Hybrid Evidence Grounding Layer -* **Improvement:** Introduced the **HUNT → PROVE** adversarial tension. We added a programmatic extraction layer (using fast Python AST parsing) to pull exact caller snippets, import contexts, and cross-references. We fed this raw data into an **Evidence Verifier** harness, forcing it to falsify claims that lacked concrete proof. -* **Result:** 25 findings, 7 critical, ~43 minutes. -* **Impact:** **False Positive Rate dropped to 0%.** The verifier correctly dropped assumptions that couldn't be backed up by the extracted code snippets. - -### Run 4: Current Version (Compound Analysis & Dedup) -* **Improvement:** The original `cross_ref` phase was a naive scoring multiplier that wasted 34% of the pipeline time (16 minutes) without changing any finding rankings. We replaced it with **Parallel Compound Analysis**. The system groups related findings into clusters (by file, import, caller, or tag) and spawns parallel investigators to see if the combination of minor bugs creates a major exploit. A final `compound_dedup_phase` collapses duplicate insights. -* **Result:** 17 findings, 13 critical. Cross-ref time reduced from 16m → 5m. -* **Impact:** Discovered **3 genuinely novel, critical insights** (see Section 3) that no individual reviewer agent found. - ---- - -## 3. The Power of Compound Analysis - -The most significant differentiator between PR-AF and Claude Code is the **Phase 5.5: Compound Analysis**. - -In PR #254, individual reviewers found several isolated issues in `config_db.go`: -1. `AdminToken` can be overridden from the database. -2. `APIKey` lacks protection from database merge. -3. `WebhookSecret` is merged blindly from the database. - -A single agent (Claude Code) sees these as three separate, medium-severity bugs ("Hey, you forgot to protect this field"). - -The **PR-AF Compound Analyzer** was handed this cluster of findings along with their evidence. It recognized the systemic pattern and synthesized a **first-class critical finding**: - -> **Complete System Compromise via Coordinated DB Config Injection** -> *Severity: Critical | Score: 1.104* -> The combination of multiple unprotected security-sensitive fields in the DB config merge logic creates a complete authentication and authorization bypass chain. An attacker with database write access can simultaneously inject malicious values for: (1) DID Authorization tokens, (2) API Keys, and (3) Webhook secrets. This is not an isolated missing validation, but a systemic control gap where the protection pattern applied to the `Storage` config was neglected across all authentication vectors. - -Claude Code cannot make this leap because it lacks the architectural design to group, step back, and re-evaluate findings in relation to one another. - ---- - -## 4. PR-AF Current Version vs. Claude Code (CC) - -### Depth vs. Speed -* **Claude Code** is exceptional for the "inner loop" of development. If an engineer forgets a parameter or misnames a variable, CC finds it in seconds and fixes it inline. -* **PR-AF** is designed for the "outer loop" (the CI/CD gate). It takes 45 minutes because it performs exhaustive, multi-dimensional analysis (Semantic, Mechanical, Systemic), programmatic evidence extraction, and adversarial challenges. - -### Precision (False Positives) -* **Claude Code** relies on its context window. If a referenced function isn't in the window, it guesses based on naming conventions. This creates false positives that human reviewers have to dismiss. -* **PR-AF** uses an **Evidence Grounding Layer**. If a semantic reviewer claims a bug exists, the extraction engine pulls the exact AST node, and the Verifier tests the claim. In our benchmarks, PR-AF's current version achieved a 0% false positive rate on PR #254. - -### The Verdict -Our multi-reasoner architecture proves that **intelligence is in the composition, not just the model**. By structuring the workflow into parallel hunters, programmatic evidence extraction, adversarial verification, and compound synthesis, PR-AF transforms an average LLM into a senior architectural auditor. diff --git a/benchmark/agentfield-254/pr-af-result-kimi-compound.json b/benchmark/agentfield-254/pr-af-result-kimi-compound.json deleted file mode 100644 index 2eb6376..0000000 --- a/benchmark/agentfield-254/pr-af-result-kimi-compound.json +++ /dev/null @@ -1,1036 +0,0 @@ -{ - "findings": [ - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The MockStorageProvider in execute_test.go has the old method signatures for SetConfig and GetConfig that don't match the updated StorageProvider interface. The methods are missing the `updatedBy` parameter in SetConfig and return `interface{}` instead of `*storage.ConfigEntry` for GetConfig. Additionally, the mock is missing the new required methods ListConfigs and DeleteConfig.", - "confidence": 1, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "mock-storage-provider-interface-compliance", - "dimension_name": "MockStorageProvider Interface Compliance", - "evidence": "Step 1: The StorageProvider interface at control-plane/internal/storage/storage.go:133-136 defines:\n - SetConfig(ctx context.Context, key string, value string, updatedBy string) error\n - GetConfig(ctx context.Context, key string) (*ConfigEntry, error)\n - ListConfigs(ctx context.Context) ([]*ConfigEntry, error)\n - DeleteConfig(ctx context.Context, key string) error\n\nStep 2: MockStorageProvider at lines 173-178 has old signatures:\n - SetConfig(ctx context.Context, key string, value interface{}) error\n - GetConfig(ctx context.Context, key string) (interface{}, error)\n - Missing: ListConfigs method\n - Missing: DeleteConfig method\n\nStep 3: This causes compilation failure when running `go build ./...` because the mock doesn't implement the interface.", - "file_path": "control-plane/internal/handlers/execute_test.go", - "id": "f_009", - "line_end": 178, - "line_start": 173, - "score": 1.2, - "severity": "critical", - "suggestion": "Update MockStorageProvider in execute_test.go to match the new interface:\n1. Change SetConfig signature to: SetConfig(ctx context.Context, key string, value string, updatedBy string) error\n2. Change GetConfig signature to: GetConfig(ctx context.Context, key string) (*storage.ConfigEntry, error)\n3. Add ListConfigs(ctx context.Context) ([]*storage.ConfigEntry, error) method\n4. Add DeleteConfig(ctx context.Context, key string) error method\n5. Add import for \"github.com/Agent-Field/agentfield/control-plane/internal/storage\" to access ConfigEntry type", - "tags": [ - "compilation-error", - "interface-mismatch", - "mock-fix" - ], - "title": "MockStorageProvider SetConfig and GetConfig have outdated signatures" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The MockStorageProvider.GetConfig method references `*storage.ConfigEntry` but the storage package is not imported in execute_test.go. This will cause a **compile-time error**: `undefined: storage`.\n\nWhile the GetConfig signature is correct (`*storage.ConfigEntry, error`), the lack of import makes the type reference invalid.", - "confidence": 0.98, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "mock-getconfig-type-compliance", - "dimension_name": "MockStorageProvider GetConfig Type Compliance", - "evidence": "Step 1: execute_test.go lines 176-178 define `func (m *MockStorageProvider) GetConfig(ctx context.Context, key string) (*storage.ConfigEntry, error)`\nStep 2: File imports (lines 1-20) show no storage package import - only types, gin, and testify packages\nStep 3: `storage.ConfigEntry` is undefined without the import\nStep 4: This causes compilation failure: `undefined: storage in storage.ConfigEntry`", - "file_path": "control-plane/internal/handlers/execute_test.go", - "id": "f_007", - "line_end": 178, - "line_start": 176, - "score": 1.176, - "severity": "critical", - "suggestion": "Add the storage package import to the import block:\n```go\nimport (\n \"bytes\"\n \"context\"\n \"encoding/json\"\n \"net/http\"\n \"net/http/httptest\"\n \"testing\"\n \"time\"\n\n \"github.com/Agent-Field/agentfield/control-plane/internal/storage\" // ADD THIS LINE\n \"github.com/Agent-Field/agentfield/control-plane/pkg/types\"\n\n \"github.com/gin-gonic/gin\"\n \"github.com/stretchr/testify/assert\"\n \"github.com/stretchr/testify/mock\"\n)\n```", - "tags": [ - "missing-import", - "undefined-type", - "compilation-error", - "mock" - ], - "title": "Missing storage import causes undefined type error" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The `MockStorageProvider` in `config_test.go` has **outdated method signatures** that do not match the updated `StorageProvider` interface defined in `storage.go`. This will cause **compilation failures** when running tests.\n\n**Issues found:**\n\n1. **SetConfig signature mismatch** (line 289-292):\n - **Interface expects:** `SetConfig(ctx context.Context, key string, value string, updatedBy string) error`\n - **Mock has:** `SetConfig(ctx context.Context, key string, value interface{}) error`\n - **Missing:** The `updatedBy` parameter (4th parameter)\n - **Wrong type:** `value` should be `string`, not `interface{}`\n\n2. **GetConfig signature mismatch** (line 294-297):\n - **Interface expects:** `GetConfig(ctx context.Context, key string) (*ConfigEntry, error)`\n - **Mock has:** `GetConfig(ctx context.Context, key string) (interface{}, error)`\n - **Wrong return type:** Should return `*ConfigEntry`, not `interface{}`\n\n3. **Missing ListConfigs method**:\n - **Interface requires:** `ListConfigs(ctx context.Context) ([]*ConfigEntry, error)`\n - **Mock is missing this method entirely**\n\n4. **Missing DeleteConfig method**:\n - **Interface requires:** `DeleteConfig(ctx context.Context, key string) error`\n - **Mock is missing this method entirely**", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "mock-storage-provider-interface-compliance", - "dimension_name": "MockStorageProvider Interface Compliance", - "evidence": "Step 1: The StorageProvider interface in storage.go:133-136 defines:\n- SetConfig(ctx context.Context, key string, value string, updatedBy string) error\n- GetConfig(ctx context.Context, key string) (*ConfigEntry, error)\n- ListConfigs(ctx context.Context) ([]*ConfigEntry, error)\n- DeleteConfig(ctx context.Context, key string) error\n\nStep 2: The MockStorageProvider in config_test.go:289-292 has:\n- SetConfig(ctx context.Context, key string, value interface{}) error (WRONG: missing updatedBy, value type)\n\nStep 3: The MockStorageProvider in config_test.go:294-297 has:\n- GetConfig(ctx context.Context, key string) (interface{}, error) (WRONG: return type)\n\nStep 4: The MockStorageProvider is MISSING:\n- ListConfigs method\n- DeleteConfig method\n\nStep 5: This causes the MockStorageProvider to NOT implement the StorageProvider interface, resulting in compilation errors like:\n'*MockStorageProvider does not implement storage.StorageProvider (missing ListConfigs method)'\n'*MockStorageProvider does not implement storage.StorageProvider (wrong type for SetConfig method)'", - "file_path": "control-plane/internal/handlers/ui/config_test.go", - "id": "f_001", - "line_end": 297, - "line_start": 289, - "score": 1.14, - "severity": "critical", - "suggestion": "Update the MockStorageProvider to match the interface:\n\n1. Update SetConfig (lines 289-292):\n```go\nfunc (m *MockStorageProvider) SetConfig(ctx context.Context, key string, value string, updatedBy string) error {\n args := m.Called(ctx, key, value, updatedBy)\n return args.Error(0)\n}\n```\n\n2. Update GetConfig (lines 294-297):\n```go\nfunc (m *MockStorageProvider) GetConfig(ctx context.Context, key string) (*storage.ConfigEntry, error) {\n args := m.Called(ctx, key)\n if args.Get(0) == nil {\n return nil, args.Error(1)\n }\n return args.Get(0).(*storage.ConfigEntry), args.Error(1)\n}\n```\n\n3. Add ListConfigs method after line 297:\n```go\nfunc (m *MockStorageProvider) ListConfigs(ctx context.Context) ([]*storage.ConfigEntry, error) {\n args := m.Called(ctx)\n if args.Get(0) == nil {\n return nil, args.Error(1)\n }\n return args.Get(0).([]*storage.ConfigEntry), args.Error(1)\n}\n```\n\n4. Add DeleteConfig method after that:\n```go\nfunc (m *MockStorageProvider) DeleteConfig(ctx context.Context, key string) error {\n args := m.Called(ctx, key)\n return args.Error(0)\n}\n```", - "tags": [ - "compilation-error", - "interface-mismatch", - "mock-update-required", - "go-build-failure" - ], - "title": "MockStorageProvider has outdated SetConfig and GetConfig signatures causing compilation failure" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The `mergeDBConfig` function merges `Features.DID` as an entire struct when `dbCfg.Features.DID.Method != \"\"`. This is dangerous because `DIDConfig` contains security-sensitive authorization tokens (`AdminToken` and `InternalToken`).\n\n**The vulnerability:** If an attacker with database write access sets `features.did.method` to any non-empty value in the DB-stored config, the entire `DIDConfig` struct from the DB overwrites the file/env config, including:\n- `AdminToken`: Used for admin operations like tag approval and policy management\n- `InternalToken`: Used for internal authentication when forwarding execution requests to agents\n\n**Attack scenario:**\n1. Attacker gains DB write access\n2. Attacker inserts a malicious config via `PUT /api/v1/configs/agentfield.yaml` with `features.did.method: key` and `features.did.authorization.admin_token: attacker-controlled-token`\n3. On next server start or config reload, the attacker's token replaces the legitimate admin token\n4. Attacker can now authenticate as admin using their token\n\n**Expected behavior:** Similar to how `Storage` is preserved (lines 33, 45), security-sensitive tokens should be explicitly protected from DB override.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "security-field-protection", - "dimension_name": "Security-Sensitive Field Protection in DB Config Merge", - "evidence": "Step 1: config_db.go:87-89 checks `if dbCfg.Features.DID.Method != \"\"` and assigns entire `dbCfg.Features.DID` to `target.Features.DID`. Step 2: config.go:99-135 shows DIDConfig contains AuthorizationConfig with AdminToken (line 125) and InternalToken (line 129). Step 3: When DID struct is assigned, ALL fields including Authorization are overwritten. Step 4: This allows DB-stored tokens to replace file/env tokens, enabling privilege escalation.", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_002", - "line_end": 89, - "line_start": 86, - "score": 1.14, - "severity": "critical", - "suggestion": "Change the DID merge logic to preserve `Authorization.AdminToken` and `Authorization.InternalToken` from the original config. Only merge non-sensitive fields like `Method`, `KeyAlgorithm`, etc. For example:\n\n```go\n// Save sensitive tokens before merge\nsavedAdminToken := target.Features.DID.Authorization.AdminToken\nsavedInternalToken := target.Features.DID.Authorization.InternalToken\n\nif dbCfg.Features.DID.Method != \"\" {\n target.Features.DID = dbCfg.Features.DID\n // Restore security-sensitive fields\n target.Features.DID.Authorization.AdminToken = savedAdminToken\n target.Features.DID.Authorization.InternalToken = savedInternalToken\n}\n```", - "tags": [ - "security", - "privilege-escalation", - "configuration", - "authorization" - ], - "title": "DID Authorization tokens (AdminToken/InternalToken) can be overridden from DB config" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The MockStorageProvider.GetConfig method in config_test.go returns `(interface{}, error)` but the StorageProvider interface defines it as `(*ConfigEntry, error)`. This is a type mismatch that will cause a **compile-time error** - the mock no longer implements the interface.\n\nThe mock must be updated to:\n1. Return `(*storage.ConfigEntry, error)` instead of `(interface{}, error)`\n2. Return `args.Get(0).(*storage.ConfigEntry)` with proper nil checking like other mock methods in the file", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "mock-getconfig-type-compliance", - "dimension_name": "MockStorageProvider GetConfig Type Compliance", - "evidence": "Step 1: StorageProvider interface in storage.go:134 defines `GetConfig(ctx context.Context, key string) (*ConfigEntry, error)`\nStep 2: MockStorageProvider in config_test.go:294-297 implements `GetConfig(ctx context.Context, key string) (interface{}, error)`\nStep 3: The return type mismatch means MockStorageProvider no longer satisfies the StorageProvider interface\nStep 4: Any test using this mock will fail to compile with: `MockStorageProvider does not implement StorageProvider (wrong type for GetConfig method)`", - "file_path": "control-plane/internal/handlers/ui/config_test.go", - "id": "f_006", - "line_end": 297, - "line_start": 294, - "score": 1.14, - "severity": "critical", - "suggestion": "Update the GetConfig method signature from:\n```go\nfunc (m *MockStorageProvider) GetConfig(ctx context.Context, key string) (interface{}, error) {\n args := m.Called(ctx, key)\n return args.Get(0), args.Error(1)\n}\n```\n\nTo:\n```go\nfunc (m *MockStorageProvider) GetConfig(ctx context.Context, key string) (*storage.ConfigEntry, error) {\n args := m.Called(ctx, key)\n if args.Get(0) == nil {\n return nil, args.Error(1)\n }\n return args.Get(0).(*storage.ConfigEntry), args.Error(1)\n}\n```", - "tags": [ - "type-mismatch", - "interface-compliance", - "compilation-error", - "mock" - ], - "title": "Mock GetConfig returns wrong type - interface{} instead of *storage.ConfigEntry" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The MockStorageProvider.SetConfig method has signature `(ctx context.Context, key string, value interface{})` but the StorageProvider interface defines it as `(ctx context.Context, key string, value string, updatedBy string)`. This is another interface compliance issue that will cause compilation errors.\n\nThe mock is also missing the `updatedBy` parameter entirely, and uses `interface{}` for value instead of `string`.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "mock-getconfig-type-compliance", - "dimension_name": "MockStorageProvider GetConfig Type Compliance", - "evidence": "Step 1: StorageProvider interface in storage.go:133 defines `SetConfig(ctx context.Context, key string, value string, updatedBy string) error`\nStep 2: MockStorageProvider in config_test.go:289-292 implements `SetConfig(ctx context.Context, key string, value interface{}) error`\nStep 3: Missing `updatedBy string` parameter and wrong `value` type (interface{} vs string)\nStep 4: Interface mismatch will cause: `MockStorageProvider does not implement StorageProvider (wrong type for SetConfig method)`", - "file_path": "control-plane/internal/handlers/ui/config_test.go", - "id": "f_008", - "line_end": 292, - "line_start": 289, - "score": 1.14, - "severity": "critical", - "suggestion": "Update SetConfig signature from:\n```go\nfunc (m *MockStorageProvider) SetConfig(ctx context.Context, key string, value interface{}) error {\n args := m.Called(ctx, key, value)\n return args.Error(0)\n}\n```\n\nTo:\n```go\nfunc (m *MockStorageProvider) SetConfig(ctx context.Context, key string, value string, updatedBy string) error {\n args := m.Called(ctx, key, value, updatedBy)\n return args.Error(0)\n}\n```", - "tags": [ - "type-mismatch", - "interface-compliance", - "compilation-error", - "mock", - "missing-parameter" - ], - "title": "Mock SetConfig has wrong signature - missing updatedBy parameter" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The configReloadFn() method returns a function that calls overlayDBConfig(s.config, s.storage) which directly modifies the shared s.config struct. This creates a data race because the returned function is called asynchronously (likely from a signal handler or watcher) while dozens of goroutines concurrently read from s.config fields without any synchronization mechanism.\n\nThe AgentFieldServer struct includes a configMu mutex field (line 82) that was intended to protect these operations, but it is never locked in configReloadFn(). This means concurrent reads during a config reload can observe partially updated or inconsistent configuration values, leading to undefined behavior.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "semantic-001", - "dimension_name": "Data Race in Config Reload", - "evidence": "Line 82: configMu field exists in struct but is unused\nLine 440-441: Direct modification of s.config without lock\nOverlayDBConfig modifies s.config fields via mergeDBConfig()", - "file_path": "control-plane/internal/server/server.go", - "id": "f_010", - "line_end": 442, - "line_start": 433, - "score": 1.14, - "severity": "critical", - "suggestion": "Acquire the configMu lock before modifying s.config in the returned function:\n\nfunc (s *AgentFieldServer) configReloadFn() handlers.ConfigReloadFunc {\n if src := os.Getenv(\"AGENTFIELD_CONFIG_SOURCE\"); src != \"db\" {\n return nil\n }\n return func() error {\n s.configMu.Lock()\n defer s.configMu.Unlock()\n return overlayDBConfig(s.config, s.storage)\n }\n}\n\nAdditionally, all read access to s.config fields throughout the codebase should also acquire at least a read lock (RLock) to prevent data races during concurrent reads.", - "tags": [ - "data-race", - "concurrency", - "mutex", - "config-reload", - "critical" - ], - "title": "Data Race: Config Reload Function Modifies Shared Config Without Synchronization" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The mergeDBConfig function has a systemic security control gap where comments claim protection for security-sensitive fields, but the actual implementation only explicitly preserves Storage config (lines 33, 45). This creates multiple authentication bypass vectors through a shared vulnerable code pattern.\n\n**The compound risk:** An attacker with database write access can override ALL critical authentication/authorization tokens by inserting malicious YAML into the database config:\n\n1. **API Authentication Bypass** (lines 94-97): Comment claims 'never override API key from DB for security' but code only merges CORS settings. The API.Auth.APIKey can be overridden from DB, allowing attacker to authenticate with their own key.\n\n2. **Admin Privilege Escalation** (lines 87-89): Features.DID is merged entirely when Method != '', which includes Authorization.AdminToken. Attacker can set their own admin token to gain administrative access to tag approval and policy management routes.\n\n3. **Agent Impersonation** (lines 87-89): Same DID merge includes Authorization.InternalToken, which is sent as Authorization: Bearer header when control plane forwards execution requests to agents. Attacker can impersonate the control plane to agents with RequireOriginAuth enabled.\n\n4. **Approval System Compromise** (lines 82-84): AgentField.Approval config including WebhookSecret is entirely merged from DB. Attacker can manipulate approval workflows and potentially bypass approval requirements.\n\n**Why this is worse than individual findings:** The shared merge pattern suggests a developer misunderstanding of the actual protection scope. Only Storage is explicitly preserved (bootstrap problem), while other security-sensitive fields have only comments claiming protection. This indicates a systemic control gap where the security model is inconsistent and incomplete. Fixing one field won't address the underlying architectural issue.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "compound", - "dimension_name": "Compound Analysis", - "evidence": "Evidence from code review:\\n1. Line 33, 45: Only Storage config is explicitly saved and restored (correct protection for bootstrap problem)\\n2. Line 82-84: AgentField.Approval (including WebhookSecret) is entirely merged from DB without protection\\n3. Line 87-89: Features.DID (including Authorization.AdminToken and InternalToken) is entirely merged when Method != ''\\n4. Line 94-97: Comment claims API key protection but only CORS is handled, not Auth\\n5. Line 90-92: Comment claims Connector token protection but no enforcement code exists\\n6. config.go line 207-212: AuthConfig contains APIKey string field\\n7. config.go line 112-135: AuthorizationConfig contains AdminToken (line 125) and InternalToken (line 129)\\n8. config.go line 46: ApprovalConfig contains WebhookSecret\\n\\nAttack scenario: INSERT INTO config (key, value) VALUES ('agentfield.yaml', 'api:\\n auth:\\n api_key: attacker-controlled-key\\nfeatures:\\n did:\\n method: key\\n authorization:\\n admin_token: attacker-admin-token\\n internal_token: attacker-internal-token\\nagentfield:\\n approval:\\n webhook_secret: attacker-webhook-secret')", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_012", - "line_end": 103, - "line_start": 52, - "score": 1.14, - "severity": "critical", - "suggestion": "Implement a comprehensive security-sensitive field protection system:\\n1. Create an explicit whitelist approach for DB-configurable fields instead of selective merging\\n2. Add a security audit comment block at the top of mergeDBConfig listing ALL protected fields\\n3. Implement a struct tag system (e.g., `dbconfig:\"protected\"`) to mark fields that should never come from DB\\n4. Add validation tests that verify no security-sensitive fields can be set from DB config\\n5. Consider encrypting security-sensitive config values in the database\\n6. Log all config changes from DB with before/after values for security-sensitive fields", - "tags": [ - "security", - "authentication-bypass", - "configuration", - "database", - "systemic-vulnerability", - "privilege-escalation", - "defense-in-depth" - ], - "title": "Systemic configuration merge vulnerability enables multiple authentication bypass vectors" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The database configuration overlay mechanism (`overlayDBConfig`) contains a systemic security control gap where security-sensitive tokens are not protected from DB-based override, despite comments claiming protection exists. This compound issue creates a complete authentication bypass vulnerability.\n\n**The compound vulnerability:**\n\n1. **Pattern of False Security Claims**: Lines 90-92 and 94 contain comments stating that connector tokens and API keys are intentionally NOT merged from DB, but these protections are NOT actually implemented in code. This creates a dangerous false sense of security.\n\n2. **Multiple Critical Token Override**: An attacker with DB write access can override ALL of these tokens simultaneously:\n - `API.Auth.APIKey` (controls all API access) - line 209 in config.go\n - `AgentField.Approval.WebhookSecret` (controls webhook verification) - line 47 in config.go\n - `Features.DID.Authorization.AdminToken` (controls admin operations) - line 125 in config.go\n - `Features.DID.Authorization.InternalToken` (controls agent authentication) - line 129 in config.go\n - `Features.Connector.Token` (commented as protected but not enforced) - line 89 in config.go\n\n3. **Inconsistent Protection Logic**: While `Storage` is properly protected with save/restore pattern (lines 33, 45), equally or more sensitive fields like APIKey and WebhookSecret are NOT protected using the same pattern, despite being security-critical.\n\n4. **Hot-reload Amplification**: The `/api/v1/configs/reload` endpoint (config_storage.go:114-128) allows immediate application of malicious config changes without server restart, enabling rapid exploitation.\n\n5. **Zero Validation**: The SetConfig storage method (local.go:5129-5161) accepts arbitrary YAML content without validating or rejecting sensitive field modifications.\n\n**Complete Attack Chain:**\n1. Attacker gains DB write access OR compromises an account with `config_management` capability\n2. Attacker uploads malicious config YAML with attacker-controlled tokens via `PUT /api/v1/configs/agentfield.yaml`\n3. Attacker triggers config reload via `POST /api/v1/configs/reload`\n4. Server immediately loads attacker's tokens from DB, replacing legitimate file/env-configured tokens\n5. Attacker can now authenticate with their own API key, forge webhook approvals, perform admin operations with their admin token, and authenticate to agents with their internal token\n\n**Risk Escalation:** This is worse than individual findings because it allows COMPLETE SYSTEM COMPROMISE through a single config write operation, bypassing all authentication layers simultaneously.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "compound", - "dimension_name": "Compound Analysis", - "evidence": "Evidence of the compound control gap:\n\n1. **False security claims in comments** (config_db.go:90-97):\n Line 90-92: 'NOTE: Connector config (token, capabilities) is intentionally NOT merged from DB.'\n Line 94: 'API settings (but never override API key from DB for security)'\n Yet NO code enforces these protections - only CORS is merged conditionally at lines 95-97.\n\n2. **Missing protection for APIKey** (config_db.go:94-97):\n The comment says API key should never be overridden from DB, but the only code that runs is CORS merge. API.Auth.APIKey is never preserved or restored.\n\n3. **Dangerous struct-level merge for Approval** (config_db.go:82-84):\n ```go\n if dbCfg.AgentField.Approval.WebhookSecret != \"\" || dbCfg.AgentField.Approval.DefaultExpiryHours != 0 {\n target.AgentField.Approval = dbCfg.AgentField.Approval\n }\n ```\n This merges the ENTIRE Approval struct including WebhookSecret when either field is non-empty.\n\n4. **Dangerous struct-level merge for DID** (config_db.go:86-89):\n ```go\n if dbCfg.Features.DID.Method != \"\" {\n target.Features.DID = dbCfg.Features.DID\n }\n ```\n This merges the ENTIRE DIDConfig struct including Authorization.AdminToken and Authorization.InternalToken.\n\n5. **Proper protection only for Storage** (config_db.go:33,45):\n Line 33: `savedStorage := cfg.Storage`\n Line 45: `cfg.Storage = savedStorage`\n This shows the pattern that SHOULD be used for other sensitive fields but is NOT.\n\n6. **Config structs showing sensitive fields** (config.go):\n - Line 47: `WebhookSecret string` in ApprovalConfig\n - Line 125: `AdminToken string` in AuthorizationConfig \n - Line 129: `InternalToken string` in AuthorizationConfig\n - Line 209: `APIKey string` in AuthConfig\n\n7. **No validation in SetConfig** (local.go:5129-5161):\n Raw YAML stored directly to DB without checking for sensitive field modifications.", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_013", - "line_end": 103, - "line_start": 19, - "score": 1.14, - "severity": "critical", - "suggestion": "Implement consistent security field protection across ALL sensitive configuration values:\n\n1. **Immediate Fix - Add protection for all security-sensitive tokens** (config_db.go):\n```go\nfunc overlayDBConfig(cfg *config.Config, store storage.StorageProvider) error {\n // ... existing code ...\n \n // Preserve ALL security-sensitive tokens from file/env config\n savedStorage := cfg.Storage\n savedAPIKey := cfg.API.Auth.APIKey\n savedWebhookSecret := cfg.AgentField.Approval.WebhookSecret\n savedAdminToken := cfg.Features.DID.Authorization.AdminToken\n savedInternalToken := cfg.Features.DID.Authorization.InternalToken\n savedConnectorToken := cfg.Features.Connector.Token\n \n // Parse and merge DB config\n var dbCfg config.Config\n if err := yaml.Unmarshal([]byte(entry.Value), &dbCfg); err != nil {\n return fmt.Errorf(\"failed to parse database config YAML: %w\", err)\n }\n mergeDBConfig(cfg, &dbCfg)\n \n // Restore all security-sensitive values (never overridden from DB)\n cfg.Storage = savedStorage\n cfg.API.Auth.APIKey = savedAPIKey\n cfg.AgentField.Approval.WebhookSecret = savedWebhookSecret\n cfg.Features.DID.Authorization.AdminToken = savedAdminToken\n cfg.Features.DID.Authorization.InternalToken = savedInternalToken\n cfg.Features.Connector.Token = savedConnectorToken\n \n // ... rest of function ...\n}\n```\n\n2. **Medium-term - Add field-level merge for DID and Approval** instead of struct-level merge to avoid accidentally merging sensitive sub-fields.\n\n3. **Long-term - Add config validation middleware** that rejects DB config updates containing modifications to security-sensitive fields, returning a 400 error with explanation.", - "tags": [ - "security", - "authentication-bypass", - "configuration", - "api-key", - "token-override", - "systemic-control-gap" - ], - "title": "Systemic DB Config Security Control Gap - Multiple Critical Tokens Unprotected" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The combination of multiple unprotected security-sensitive fields in the DB config merge logic creates a complete authentication and authorization bypass chain. An attacker with database write access can simultaneously inject malicious values for: (1) DID Authorization tokens (AdminToken/InternalToken) via the full-DID-struct merge at lines 87-89, (2) WebhookSecret via the full-Approval-struct merge at lines 82-84, (3) API.Auth.APIKey which is parsed by yaml.Unmarshal at line 37 but never explicitly restored, and (4) Connector.Token/Capabilities which are claimed to be protected by comment at lines 90-92 but have no actual code enforcement. This allows an attacker to: authenticate with their own API key, escalate privileges using their own AdminToken, forge approval callbacks with their own WebhookSecret, and gain unauthorized connector access with their own token. The compound effect is TOTAL SYSTEM COMPROMISE - the attacker controls all authentication, authorization, and validation mechanisms simultaneously, making this significantly more severe than any individual vulnerability.", - "confidence": 0.92, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "compound", - "dimension_name": "Compound Analysis", - "evidence": "Step 1: yaml.Unmarshal at line 37 parses ALL fields from DB-stored YAML including api.auth.api_key, features.did.authorization.admin_token, features.did.authorization.internal_token, agentfield.approval.webhook_secret, and features.connector.token. Step 2: Lines 87-89 merge entire DID struct when Method != '', overwriting Authorization.AdminToken and Authorization.InternalToken. Step 3: Lines 82-84 merge entire Approval struct when WebhookSecret != '', allowing secret replacement. Step 4: Lines 90-92 claim connector config is protected but NO code enforcement exists (unlike lines 33,45 which save/restore Storage). Step 5: Lines 94-97 only merge CORS, leaving API.Auth vulnerable to DB override. Step 6: The save/restore pattern at lines 33,45 proves the correct protection approach exists but is inconsistently applied.", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_014", - "line_end": 97, - "line_start": 82, - "score": 1.104, - "severity": "critical", - "suggestion": "Apply the same save/restore pattern used for Storage (lines 33,45) to ALL security-sensitive fields before calling mergeDBConfig. Specifically: (1) Save cfg.API.Auth before line 42 and restore after, (2) Save cfg.Features.DID.Authorization before line 42 and restore after, (3) Save cfg.AgentField.Approval.WebhookSecret before line 42 and restore after, (4) Save cfg.Features.Connector before line 42 and restore after. Alternatively, implement a whitelist approach where ONLY explicitly allowed non-sensitive fields can be merged from DB config.", - "tags": [ - "security", - "authentication-bypass", - "authorization-bypass", - "privilege-escalation", - "configuration-injection", - "compound-risk" - ], - "title": "Complete System Compromise via Coordinated DB Config Injection" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The `mergeDBConfig` function implements an INCONSISTENT security protection pattern that creates a systemic control gap enabling total authentication bypass. While Storage config is properly protected (saved at line 33, restored at line 45), FOUR other critical security-sensitive fields are left completely unprotected:\n\n1. **API.Auth.APIKey** (lines 94-97): Comment claims 'never override API key from DB for security' but code only merges CORS settings. The APIKey parsed from DB YAML remains in dbCfg struct with no explicit clearing.\n\n2. **AgentField.Approval.WebhookSecret** (lines 82-84): Entire Approval struct is merged when WebhookSecret or DefaultExpiryHours is set in DB, overwriting file/env HMAC-SHA256 secret used for webhook verification.\n\n3. **Features.DID.Authorization.AdminToken/InternalToken** (lines 87-89): Entire DID struct is merged when Method is non-empty, overwriting admin and internal authentication tokens used for privileged operations and agent authentication.\n\n4. **Features.Connector.Token/Capabilities** (lines 90-92): Comment claims connector config is 'intentionally NOT merged from DB' but NO CODE ENFORCES THIS. Parsed DB values persist in dbCfg struct.\n\n**COMPOUND IMPACT - Total System Compromise:**\nAn attacker with database write access can override ALL authentication mechanisms simultaneously:\n- Set `api.auth.api_key` \u2192 Gain unauthorized API access\n- Set `agentfield.approval.webhook_secret` \u2192 Forge webhook callbacks for unauthorized approvals\n- Set `features.did.method` + `features.did.authorization.admin_token` \u2192 Perform admin operations and bypass agent authentication\n- Set `features.connector.token` \u2192 Compromise connector service integration\n\nThis is NOT four separate vulnerabilities - it is ONE SYSTEMIC CONTROL GAP where a security protection pattern exists but is inconsistently applied. The existence of proper Storage protection proves the developers understand the risk, but the same protection was omitted for other equally critical credentials.", - "confidence": 0.92, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "compound", - "dimension_name": "Compound Analysis", - "evidence": "1. **Storage protection pattern (CORRECT)**: config_db.go:33 saves `cfg.Storage` before merge, line 45 restores it after. This proves the security model exists. 2. **APIKey protection FAILURE**: config_db.go:94 comment says 'never override API key from DB' but lines 95-97 only merge CORS. No explicit clearing of dbCfg.API.Auth.APIKey. 3. **WebhookSecret override**: config_db.go:82-84 assigns entire `target.AgentField.Approval = dbCfg.AgentField.Approval` when WebhookSecret is non-empty, overwriting the file/env secret. 4. **DID Authorization tokens override**: config_db.go:87-89 assigns entire `target.Features.DID = dbCfg.Features.DID` when Method is non-empty. config.go:125,129 show DIDConfig.Authorization contains AdminToken and InternalToken. 5. **Connector protection COMMENT-ONLY**: config_db.go:90-92 comment claims protection but no code saves/restores `cfg.Features.Connector` like Storage. 6. **Attack vector**: All sensitive values are parsed from DB YAML at config_db.go:37 via `yaml.Unmarshal`.", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_016", - "line_end": 97, - "line_start": 32, - "score": 1.104, - "severity": "critical", - "suggestion": "Implement CONSISTENT protection for ALL security-sensitive fields. Create a systematic approach:\n\n1. **Immediate fix**: Add save/restore pattern for all sensitive fields:\n```go\n// At line 32-33, add:\nsavedAPIKey := cfg.API.Auth.APIKey\nsavedApproval := cfg.AgentField.Approval\nsavedDIDAuth := cfg.Features.DID.Authorization\nsavedConnector := cfg.Features.Connector\n\n// At line 44-45, add:\ncfg.API.Auth.APIKey = savedAPIKey\ncfg.AgentField.Approval = savedApproval\ncfg.Features.DID.Authorization = savedDIDAuth\ncfg.Features.Connector = savedConnector\n```\n\n2. **Better fix**: Refactor mergeDBConfig to use field-by-field merging for sensitive structs instead of whole-struct assignment. Only merge non-sensitive fields individually.\n\n3. **Best fix**: Add a comprehensive test that verifies NO sensitive credentials can be overridden from DB config by attempting to inject malicious values for all security-sensitive fields.", - "tags": [ - "security", - "authentication-bypass", - "configuration-management", - "systemic-vulnerability", - "db-config-override", - "total-compromise", - "inconsistent-protection" - ], - "title": "Systemic DB Config Security Control Gap Enables Total Authentication Bypass" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The codebase demonstrates a systemic control gap where the correct pattern for protecting security-sensitive configuration fields exists but is inconsistently applied. The save/restore pattern at lines 33,45 correctly protects Storage config from DB override (addressing the bootstrap problem), but this same pattern is NOT applied to other equally sensitive fields: API.Auth (controlling API authentication), Features.DID.Authorization (controlling admin/internal tokens), AgentField.Approval (controlling webhook secrets), and Features.Connector (controlling service tokens). This pattern inconsistency indicates a missing security control in the development process - the Storage protection was implemented as a one-off fix rather than establishing a comprehensive security rule. The presence of comments at lines 90-92 and 94 claiming protection exists (without code enforcement) further suggests confusion about what is actually protected. This systemic gap means future security-sensitive fields are likely to be similarly vulnerable.", - "confidence": 0.88, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "compound", - "dimension_name": "Compound Analysis", - "evidence": "Step 1: Lines 33,45 show the correct save/restore pattern: `savedStorage := cfg.Storage` before merge and `cfg.Storage = savedStorage` after merge. Step 2: Lines 87-89, 82-84 show entire struct assignment for DID and Approval without field-level protection. Step 3: Lines 94-97 show comment claiming API key protection but only CORS is actually protected. Step 4: Lines 90-92 show comment claiming connector protection but NO corresponding code. Step 5: The pattern inconsistency spans 4 different security-sensitive fields across lines 82-97, indicating a missing systematic approach.", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_015", - "line_end": 45, - "line_start": 32, - "score": 1.056, - "severity": "critical", - "suggestion": "Establish a comprehensive security policy for DB config merging: (1) Create an explicit allowlist of fields that CAN be merged from DB, default-deny all others, (2) Document the save/restore pattern requirement in code comments and developer documentation, (3) Add unit tests that verify each security-sensitive field cannot be overridden from DB config, (4) Consider creating a helper function `preserveSecurityFields(cfg *Config) (restore func())` that automatically saves and returns a restore function for all sensitive fields, ensuring consistency.", - "tags": [ - "security", - "systemic-control-gap", - "configuration-security", - "defense-in-depth", - "pattern-consistency" - ], - "title": "Systemic Control Gap: Inconsistent Application of Security-Sensitive Field Protection" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The `mergeDBConfig` function only merges `API.CORS` settings (lines 94-97) but completely ignores `API.Auth.APIKey`. This means the API authentication key is left vulnerable to being set/overridden from DB config through struct assignment elsewhere or future code changes.\n\n**The vulnerability:** While the current code doesn't explicitly merge `API.Auth`, the struct can still receive values from DB config parsing. The YAML unmarshaling at line 37 populates `dbCfg` with ALL values from DB-stored YAML, including `api.auth.api_key`. Since there's no explicit preservation of `API.Auth.APIKey` like there is for `Storage` (lines 33, 45), this sensitive credential could be overridden.\n\n**Security impact:**\n- `API.Auth.APIKey` controls access to the entire AgentField API\n- If an attacker can set this via DB config, they can authenticate to the API with their own key\n- This bypasses any file/env-based API key configuration\n\n**The comment at line 94** says \"API settings (but never override API key from DB for security)\" but this protection is NOT actually implemented in the code.", - "confidence": 0.85, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "security-field-protection", - "dimension_name": "Security-Sensitive Field Protection in DB Config Merge", - "evidence": "Step 1: config_db.go:94-97 shows only CORS is merged, comment says API key should not be overridden but no code enforces this. Step 2: config.go:207-212 shows AuthConfig contains APIKey (line 209). Step 3: yaml.Unmarshal at config_db.go:37 parses ALL fields from DB YAML including api.auth.api_key. Step 4: Since mergeDBConfig doesn't explicitly handle API.Auth fields, the dbCfg value could persist if the field exists in DB YAML.", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_003", - "line_end": 97, - "line_start": 94, - "score": 1.02, - "severity": "critical", - "suggestion": "Add explicit protection for `API.Auth.APIKey` similar to how `Storage` is protected. Before calling `mergeDBConfig`, save the API key and restore it after:\n\n```go\n// At line 32-33, add:\nsavedAPIKey := cfg.API.Auth.APIKey\n\n// At line 44-45, add:\ncfg.API.Auth.APIKey = savedAPIKey\n```\n\nAlternatively, explicitly set it in mergeDBConfig if it was preserved elsewhere.", - "tags": [ - "security", - "api-key", - "authentication", - "configuration" - ], - "title": "API.Auth.APIKey can be overridden from DB config - no protection implemented" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The SetConfig handler at lines 67-101 accepts raw YAML/text body and stores it directly in the database without any validation that it parses as valid YAML or conforms to the expected config schema.\n\n**Why this is a problem:**\n1. Invalid YAML can be stored via `PUT /api/v1/configs/agentfield.yaml`\n2. On next server startup with `AGENTFIELD_CONFIG_SOURCE=db`, `overlayDBConfig` calls `yaml.Unmarshal` which fails\n3. The error is only logged as a warning (server.go:110), so startup continues with potentially partial/inconsistent config\n4. This creates a broken state that's hard to recover from - operators must manually delete the invalid config via API or DB edit\n\n**Attack scenario:** A malicious actor or buggy client could store malformed YAML, breaking config reloads until manual intervention.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "yaml-validation-gap", - "dimension_name": "YAML Validation Gap in SetConfig Handler", - "evidence": "Step 1: HTTP PUT /api/v1/configs/agentfield.yaml -> SetConfig handler (config_storage.go:67)\nStep 2: Handler reads body with io.ReadAll (line 70), stores directly via storage.SetConfig (line 85)\nStep 3: No validation performed - body stored as raw string\nStep 4: On server restart with AGENTFIELD_CONFIG_SOURCE=db, overlayDBConfig (config_db.go:19) reads entry\nStep 5: yaml.Unmarshal (config_db.go:37) attempts to parse stored value\nStep 6: If stored value is invalid YAML (e.g., 'invalid: [unclosed'), unmarshal fails\nStep 7: Error returned at config_db.go:38, logged as warning at server.go:110\nStep 8: Server continues startup with partial/inconsistent configuration", - "file_path": "control-plane/internal/handlers/config_storage.go", - "id": "f_000", - "line_end": 101, - "line_start": 67, - "score": 0.798, - "severity": "important", - "suggestion": "Add YAML validation before storing in SetConfig. Parse the body with `yaml.Unmarshal` into a temporary config struct to verify it's valid YAML and conforms to the schema. Return 400 Bad Request with details if validation fails. Additionally, consider adding a dedicated `/configs/validate` endpoint for dry-run validation before apply.", - "tags": [ - "yaml", - "validation", - "config", - "data-integrity" - ], - "title": "SetConfig handler stores invalid YAML without validation" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The `AgentField.Approval` struct is merged entirely from DB config when `WebhookSecret` or `DefaultExpiryHours` is non-zero (lines 82-84). This includes `WebhookSecret`, which is a security-sensitive HMAC-SHA256 secret used for verifying webhook callbacks.\n\n**The vulnerability:**\n- `WebhookSecret` is used to authenticate incoming webhooks (config.go:47)\n- If an attacker can set this via DB config, they can forge webhook callbacks\n- This could allow unauthorized approval actions or other webhook-triggered operations\n\n**Current behavior:**\n- Lines 82-84 merge the entire `Approval` struct if either field is set in DB\n- This overwrites the file/env `WebhookSecret` with DB value\n- No preservation of the original secret like `Storage` has", - "confidence": 0.85, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "security-field-protection", - "dimension_name": "Security-Sensitive Field Protection in DB Config Merge", - "evidence": "Step 1: config_db.go:82-84 merges entire Approval struct if WebhookSecret or DefaultExpiryHours is non-empty. Step 2: config.go:46-49 shows ApprovalConfig contains WebhookSecret (line 47) described as 'HMAC-SHA256 secret for verifying webhook callbacks'. Step 3: Entire struct assignment overwrites all fields including the secret.", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_005", - "line_end": 84, - "line_start": 82, - "score": 0.714, - "severity": "important", - "suggestion": "Add explicit protection for `AgentField.Approval.WebhookSecret` by saving it before merge and restoring after, similar to Storage protection. Or merge only non-sensitive fields individually instead of assigning the entire struct.", - "tags": [ - "security", - "webhook", - "secret", - "configuration" - ], - "title": "Approval.WebhookSecret can be overridden from DB config" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "Lines 90-92 contain a comment stating \"Connector config (token, capabilities) is intentionally NOT merged from DB. These are security-sensitive and must come from file/env config\". However, this is only a comment - there is NO actual code enforcement of this protection.\n\n**The issue:**\n1. The comment suggests connector token and capabilities are protected like storage config\n2. However, unlike lines 33 and 45 which explicitly save/restore `cfg.Storage`, there is NO corresponding save/restore for `cfg.Features.Connector`\n3. If DB config contains `features.connector.token` or `features.connector.capabilities`, these values WILL be parsed into `dbCfg` at line 37\n4. While the current `mergeDBConfig` doesn't explicitly merge Connector fields, future modifications could inadvertently enable this\n\n**Recommendation:** Either implement the protection (like Storage) or remove the misleading comment.", - "confidence": 0.8, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "security-field-protection", - "dimension_name": "Security-Sensitive Field Protection in DB Config Merge", - "evidence": "Step 1: config_db.go:90-92 comment claims connector config is NOT merged for security. Step 2: config_db.go:33,45 shows Storage is saved before merge and restored after - the pattern for security-sensitive fields. Step 3: No corresponding save/restore exists for cfg.Features.Connector. Step 4: config.go:87-91 shows ConnectorConfig contains Token (line 89) - a security-sensitive field.", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_004", - "line_end": 92, - "line_start": 90, - "score": 0.672, - "severity": "important", - "suggestion": "Add explicit protection for Connector config similar to Storage:\n\n```go\n// At line 32-33, add:\nsavedConnector := cfg.Features.Connector\n\n// At line 44-45, add:\ncfg.Features.Connector = savedConnector\n```\n\nOr if the comment is incorrect, update it to reflect actual behavior.", - "tags": [ - "security", - "connector", - "token", - "documentation", - "configuration" - ], - "title": "Comment claims connector token/capabilities are excluded but no enforcement in code" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "Both MockStorageProvider implementations (execute_test.go and ui/config_test.go) have been verified to correctly implement the updated StorageProvider interface for configuration storage methods.\n\nThe mock implementations match the interface definition at storage.go:133-136:\n- SetConfig: signature with value string and updatedBy string parameters \u2713\n- GetConfig: returns (*storage.ConfigEntry, error) \u2713\n- ListConfigs: returns ([]*storage.ConfigEntry, error) \u2713\n- DeleteConfig: signature with key string parameter \u2713", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "mock-compliance-001", - "dimension_name": "MockStorageProvider Interface Compliance in execute_test.go", - "evidence": "execute_test.go lines 174-185: All four config methods implemented with correct signatures matching storage.go:133-136\nui/config_test.go lines 289-313: All four config methods implemented with correct signatures", - "file_path": "control-plane/internal/handlers/execute_test.go", - "id": "f_011", - "line_end": 185, - "line_start": 174, - "score": 0.342, - "severity": "info", - "suggestion": "No changes required. The mock implementations are already compliant with the updated interface.", - "tags": [ - "mock", - "interface-compliance", - "config-storage", - "tests" - ], - "title": "MockStorageProvider Correctly Implements Updated StorageProvider Interface" - } - ], - "metadata": { - "agent_invocations": 20, - "anatomy": { - "blast_radius": [], - "clusters": [ - { - "description": "", - "files": [ - "control-plane/config/agentfield.yaml" - ], - "id": "cluster_0", - "name": "control-plane/config", - "primary_language": "yaml" - }, - { - "description": "", - "files": [ - "control-plane/internal/handlers/config_storage.go" - ], - "id": "cluster_1", - "name": "control-plane/internal/handlers", - "primary_language": "go" - }, - { - "description": "", - "files": [ - "control-plane/internal/server/config_db.go", - "control-plane/internal/server/server.go", - "control-plane/internal/server/server_routes_test.go" - ], - "id": "cluster_2", - "name": "control-plane/internal/server", - "primary_language": "go" - }, - { - "description": "", - "files": [ - "control-plane/internal/storage/local.go", - "control-plane/internal/storage/migrations.go", - "control-plane/internal/storage/models.go", - "control-plane/internal/storage/storage.go" - ], - "id": "cluster_3", - "name": "control-plane/internal/storage", - "primary_language": "go" - }, - { - "description": "", - "files": [ - "control-plane/migrations/028_create_config_storage.sql" - ], - "id": "cluster_4", - "name": "control-plane/migrations", - "primary_language": "sql" - } - ], - "context_notes": "This PR is part of a multi-PR feature involving: 1) This control plane PR (config storage backend), 2) Connector PR (config_management capability), 3) hax-sdk PR (config editor UI). The feature enables SaaS-style remote configuration management where a central connector can push config to multiple control plane instances. The bootstrap safety mechanism (preserving storage section) is critical because the DB connection parameters cannot come from the DB itself.", - "dependency_graph": {}, - "files": [ - { - "hunks": [ - { - "content": " enabled: true\n observability_config:\n enabled: false\n+ config_management:\n+ enabled: true\n+ read_only: false", - "header": "@@ -146,3 +146,6 @@ features:", - "new_count": 6, - "new_start": 146, - "old_count": 3, - "old_start": 146 - } - ], - "language": "yaml", - "lines_added": 3, - "lines_removed": 0, - "path": "control-plane/config/agentfield.yaml", - "status": "modified" - }, - { - "hunks": [ - { - "content": "+package handlers\n+\n+import (\n+\t\"io\"\n+\t\"net/http\"\n+\n+\t\"github.com/Agent-Field/agentfield/control-plane/internal/storage\"\n+\t\"github.com/gin-gonic/gin\"\n+)\n+\n+// maxConfigBodySize is the maximum allowed size for a config body (1 MB).\n+// Prevents DoS via unbounded request body reads.\n+const maxConfigBodySize = 1 << 20 // 1 MB\n+\n+// ConfigReloadFunc is called to reload configuration from the database.\n+type ConfigReloadFunc func() error\n+\n+// ConfigStorageHandlers provides HTTP handlers for database-backed configuration.\n+type ConfigStorageHandlers struct {\n+\tstorage storage.StorageProvider\n+\treloadFn ConfigReloadFunc\n+}\n+\n+// NewConfigStorageHandlers creates a new ConfigStorageHandlers instance.\n+func NewConfigStorageHandlers(store storage.StorageProvider, reloadFn ConfigReloadFunc) *ConfigStorageHandlers {\n+\treturn &ConfigStorageHandlers{storage: store, reloadFn: reloadFn}\n+}\n+\n+// RegisterRoutes registers config storage routes on the given router group.\n+func (h *ConfigStorageHandlers) RegisterRoutes(group *gin.RouterGroup) {\n+\tgroup.GET(\"/configs\", h.ListConfigs)\n+\tgroup.GET(\"/configs/:key\", h.GetConfig)\n+\tgroup.PUT(\"/configs/:key\", h.SetConfig)\n+\tgroup.DELETE(\"/configs/:key\", h.DeleteConfig)\n+\tgroup.POST(\"/configs/reload\", h.ReloadConfig)\n+}\n+\n+// ListConfigs returns all stored configuration entries.\n+func (h *ConfigStorageHandlers) ListConfigs(c *gin.Context) {\n+\tentries, err := h.storage.ListConfigs(c.Request.Context())\n+\tif err != nil {\n+\t\tc.JSON(http.StatusInternalServerError, gin.H{\"error\": err.Error()})\n+\t\treturn\n+\t}\n+\tif entries == nil {\n+\t\tentries = []*storage.ConfigEntry{}\n+\t}\n+\tc.JSON(http.StatusOK, gin.H{\n+\t\t\"configs\": entries,\n+\t\t\"total\": len(entries),\n+\t})\n+}\n+\n+// GetConfig returns a specific configuration entry by key.\n+func (h *ConfigStorageHandlers) GetConfig(c *gin.Context) {\n+\tkey := c.Param(\"key\")\n+\tentry, err := h.storage.GetConfig(c.Request.Context(), key)\n+\tif err != nil {\n+\t\tc.JSON(http.StatusInternalServerError, gin.H{\"error\": err.Error()})\n+\t\treturn\n+\t}\n+\tif entry == nil {\n+\t\tc.JSON(http.StatusNotFound, gin.H{\"error\": \"config not found\", \"key\": key})\n+\t\treturn\n+\t}\n+\tc.JSON(http.StatusOK, entry)\n+}\n+\n+// SetConfig creates or updates a configuration entry.\n+// Accepts raw YAML/text body as the config value.\n+func (h *ConfigStorageHandlers) SetConfig(c *gin.Context) {\n+\tkey := c.Param(\"key\")\n+\n+\tbody, err := io.ReadAll(io.LimitReader(c.Request.Body, maxConfigBodySize+1))\n+\tif err != nil {\n+\t\tc.JSON(http.StatusBadRequest, gin.H{\"error\": \"failed to read request body\"})\n+\t\treturn\n+\t}\n+\tif len(body) == 0 {\n+\t\tc.JSON(http.StatusBadRequest, gin.H{\"error\": \"request body is empty\"})\n+\t\treturn\n+\t}\n+\tif len(body) > maxConfigBodySize {\n+\t\tc.JSON(http.StatusRequestEntityTooLarge, gin.H{\n+\t\t\t\"error\": \"config body exceeds maximum size\",\n+\t\t\t\"max\": maxConfigBodySize,\n+\t\t})\n+\t\treturn\n+\t}\n+\n+\tupdatedBy := c.GetHeader(\"X-Updated-By\")\n+\tif updatedBy == \"\" {\n+\t\tupdatedBy = \"api\"\n+\t}\n+\n+\tif err := h.storage.SetConfig(c.Request.Context(), key, string(body), updatedBy); err != nil {\n+\t\tc.JSON(http.StatusInternalServerError, gin.H{\"error\": err.Error()})\n+\t\treturn\n+\t}\n+\n+\t// Return the saved entry\n+\tentry, err := h.storage.GetConfig(c.Request.Context(), key)\n+\tif err != nil {\n+\t\tc.JSON(http.StatusInternalServerError, gin.H{\"error\": err.Error()})\n+\t\treturn\n+\t}\n+\n+\tc.JSON(http.StatusOK, gin.H{\n+\t\t\"message\": \"config saved\",\n+\t\t\"config\": entry,\n+\t})\n+}\n+\n+// DeleteConfig removes a configuration entry by key.\n+func (h *ConfigStorageHandlers) DeleteConfig(c *gin.Context) {\n+\tkey := c.Param(\"key\")\n+\tif err := h.storage.DeleteConfig(c.Request.Context(), key); err != nil {\n+\t\tc.JSON(http.StatusNotFound, gin.H{\"error\": err.Error()})\n+\t\treturn\n+\t}\n+\tc.JSON(http.StatusOK, gin.H{\"message\": \"config deleted\", \"key\": key})\n+}\n+\n+// ReloadConfig triggers a hot-reload of configuration from the database.\n+func (h *ConfigStorageHandlers) ReloadConfig(c *gin.Context) {\n+\tif h.reloadFn == nil {\n+\t\tc.JSON(http.StatusServiceUnavailable, gin.H{\n+\t\t\t\"error\": \"config reload not available (AGENTFIELD_CONFIG_SOURCE != db)\",\n+\t\t})\n+\t\treturn\n+\t}\n+\tif err := h.reloadFn(); err != nil {\n+\t\tc.JSON(http.StatusInternalServerError, gin.H{\n+\t\t\t\"error\": \"config reload failed\",\n+\t\t\t\"details\": err.Error(),\n+\t\t})\n+\t\treturn\n+\t}\n+\tc.JSON(http.StatusOK, gin.H{\"message\": \"config reloaded from database\"})\n+}", - "header": "@@ -0,0 +1,140 @@", - "new_count": 140, - "new_start": 1, - "old_count": 0, - "old_start": 0 - } - ], - "language": "go", - "lines_added": 140, - "lines_removed": 0, - "path": "control-plane/internal/handlers/config_storage.go", - "status": "added" - }, - { - "hunks": [ - { - "content": "+package server\n+\n+import (\n+\t\"context\"\n+\t\"fmt\"\n+\t\"time\"\n+\n+\t\"github.com/Agent-Field/agentfield/control-plane/internal/config\"\n+\t\"github.com/Agent-Field/agentfield/control-plane/internal/storage\"\n+\t\"gopkg.in/yaml.v3\"\n+)\n+\n+const dbConfigKey = \"agentfield.yaml\"\n+\n+// overlayDBConfig loads config from the database and merges it into the\n+// existing config. The storage section is preserved from the original config\n+// to avoid the bootstrap problem (DB connection settings can't come from DB).\n+// Precedence: env vars > DB config > file config > defaults.\n+func overlayDBConfig(cfg *config.Config, store storage.StorageProvider) error {\n+\tctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)\n+\tdefer cancel()\n+\n+\tentry, err := store.GetConfig(ctx, dbConfigKey)\n+\tif err != nil {\n+\t\treturn fmt.Errorf(\"failed to read config from database: %w\", err)\n+\t}\n+\tif entry == nil {\n+\t\tfmt.Println(\"[config] No database config found (key: agentfield.yaml), using file/env config only.\")\n+\t\treturn nil\n+\t}\n+\n+\t// Preserve the storage config \u2014 it must always come from file/env (bootstrap)\n+\tsavedStorage := cfg.Storage\n+\n+\t// Parse the DB-stored YAML into a config struct\n+\tvar dbCfg config.Config\n+\tif err := yaml.Unmarshal([]byte(entry.Value), &dbCfg); err != nil {\n+\t\treturn fmt.Errorf(\"failed to parse database config YAML: %w\", err)\n+\t}\n+\n+\t// Overlay non-zero DB values onto the existing config\n+\tmergeDBConfig(cfg, &dbCfg)\n+\n+\t// Restore storage config (never overridden from DB)\n+\tcfg.Storage = savedStorage\n+\n+\tfmt.Printf(\"[config] Loaded config from database (key: %s, version: %d, updated: %s)\\n\",\n+\t\tentry.Key, entry.Version, entry.UpdatedAt.Format(time.RFC3339))\n+\treturn nil\n+}\n+\n+// mergeDBConfig selectively merges DB config values into the target config.\n+// Only non-zero/non-empty values from the DB config are applied.\n+func mergeDBConfig(target, dbCfg *config.Config) {\n+\t// AgentField settings\n+\tif dbCfg.AgentField.Port != 0 {\n+\t\ttarget.AgentField.Port = dbCfg.AgentField.Port\n+\t}\n+\tif dbCfg.AgentField.NodeHealth.CheckInterval != 0 {\n+\t\ttarget.AgentField.NodeHealth = dbCfg.AgentField.NodeHealth\n+\t}\n+\t// Merge execution cleanup field-by-field to avoid zeroing out unset fields\n+\tif dbCfg.AgentField.ExecutionCleanup.RetentionPeriod != 0 {\n+\t\ttarget.AgentField.ExecutionCleanup.RetentionPeriod = dbCfg.AgentField.ExecutionCleanup.RetentionPeriod\n+\t}\n+\tif dbCfg.AgentField.ExecutionCleanup.CleanupInterval != 0 {\n+\t\ttarget.AgentField.ExecutionCleanup.CleanupInterval = dbCfg.AgentField.ExecutionCleanup.CleanupInterval\n+\t}\n+\tif dbCfg.AgentField.ExecutionCleanup.BatchSize != 0 {\n+\t\ttarget.AgentField.ExecutionCleanup.BatchSize = dbCfg.AgentField.ExecutionCleanup.BatchSize\n+\t}\n+\tif dbCfg.AgentField.ExecutionCleanup.PreserveRecentDuration != 0 {\n+\t\ttarget.AgentField.ExecutionCleanup.PreserveRecentDuration = dbCfg.AgentField.ExecutionCleanup.PreserveRecentDuration\n+\t}\n+\tif dbCfg.AgentField.ExecutionCleanup.StaleExecutionTimeout != 0 {\n+\t\ttarget.AgentField.ExecutionCleanup.StaleExecutionTimeout = dbCfg.AgentField.ExecutionCleanup.StaleExecutionTimeout\n+\t}\n+\t// Enabled is a bool \u2014 only override if cleanup config is present in DB at all\n+\tif dbCfg.AgentField.ExecutionCleanup.RetentionPeriod != 0 || dbCfg.AgentField.ExecutionCleanup.CleanupInterval != 0 {\n+\t\ttarget.AgentField.ExecutionCleanup.Enabled = dbCfg.AgentField.ExecutionCleanup.Enabled\n+\t}\n+\tif dbCfg.AgentField.Approval.WebhookSecret != \"\" || dbCfg.AgentField.Approval.DefaultExpiryHours != 0 {\n+\t\ttarget.AgentField.Approval = dbCfg.AgentField.Approval\n+\t}\n+\n+\t// Features\n+\tif dbCfg.Features.DID.Method != \"\" {\n+\t\ttarget.Features.DID = dbCfg.Features.DID\n+\t}\n+\t// NOTE: Connector config (token, capabilities) is intentionally NOT merged\n+\t// from DB. These are security-sensitive and must come from file/env config,\n+\t// similar to how storage config is protected from the bootstrap problem.\n+\n+\t// API settings (but never override API key from DB for security)\n+\tif len(dbCfg.API.CORS.AllowedOrigins) > 0 {\n+\t\ttarget.API.CORS = dbCfg.API.CORS\n+\t}\n+\n+\t// UI settings\n+\tif dbCfg.UI.Mode != \"\" {\n+\t\ttarget.UI = dbCfg.UI\n+\t}\n+}", - "header": "@@ -0,0 +1,103 @@", - "new_count": 103, - "new_start": 1, - "old_count": 0, - "old_start": 0 - } - ], - "language": "go", - "lines_added": 103, - "lines_removed": 0, - "path": "control-plane/internal/server/config_db.go", - "status": "added" - }, - { - "hunks": [ - { - "content": " \t\"path/filepath\"\n \t\"strconv\"\n \t\"strings\"\n+\t\"sync\"\n \t\"time\"\n \n \t\"github.com/Agent-Field/agentfield/control-plane/internal/config\"", - "header": "@@ -13,6 +13,7 @@ import (", - "new_count": 7, - "new_start": 13, - "old_count": 6, - "old_start": 13 - }, - { - "content": " \tadminGRPCPort int\n \twebhookDispatcher services.WebhookDispatcher\n \tobservabilityForwarder services.ObservabilityForwarder\n+\tconfigMu sync.RWMutex\n }\n \n // NewAgentFieldServer creates a new instance of the AgentFieldServer.", - "header": "@@ -79,6 +80,7 @@ type AgentFieldServer struct {", - "new_count": 7, - "new_start": 80, - "old_count": 6, - "old_start": 79 - }, - { - "content": " \t\treturn nil, err\n \t}\n \n+\t// Overlay database-stored config if AGENTFIELD_CONFIG_SOURCE=db\n+\tif src := os.Getenv(\"AGENTFIELD_CONFIG_SOURCE\"); src == \"db\" {\n+\t\tif err := overlayDBConfig(cfg, storageProvider); err != nil {\n+\t\t\tfmt.Printf(\"Warning: failed to load config from database: %v\\n\", err)\n+\t\t}\n+\t}\n+\n \tRouter := gin.Default()\n \n \t// Sync installed.yaml to database for package visibility", - "header": "@@ -104,6 +106,13 @@ func NewAgentFieldServer(cfg *config.Config) (*AgentFieldServer, error) {", - "new_count": 13, - "new_start": 106, - "old_count": 6, - "old_start": 104 - }, - { - "content": " \t}, nil\n }\n \n+// configReloadFn returns a function that reloads config from the database,\n+// or nil if AGENTFIELD_CONFIG_SOURCE is not set to \"db\".\n+// The returned function acquires configMu to prevent data races with\n+// concurrent readers of s.config.\n+func (s *AgentFieldServer) configReloadFn() handlers.ConfigReloadFunc {\n+\tif src := os.Getenv(\"AGENTFIELD_CONFIG_SOURCE\"); src != \"db\" {\n+\t\treturn nil\n+\t}\n+\treturn func() error {\n+\t\ts.configMu.Lock()\n+\t\tdefer s.configMu.Unlock()\n+\t\treturn overlayDBConfig(s.config, s.storage)\n+\t}\n+}\n+\n // Start initializes and starts the AgentFieldServer.\n func (s *AgentFieldServer) Start() error {\n \t// Setup routes", - "header": "@@ -423,6 +432,21 @@ func NewAgentFieldServer(cfg *config.Config) (*AgentFieldServer, error) {", - "new_count": 21, - "new_start": 432, - "old_count": 6, - "old_start": 423 - }, - { - "content": " \t\t\tlogger.Logger.Info().Msg(\"\ud83d\udccb Authorization admin routes registered\")\n \t\t}\n \n+\t\t// Config storage routes (admin-authenticated)\n+\t\t{\n+\t\t\tconfigHandlers := handlers.NewConfigStorageHandlers(s.storage, s.configReloadFn())\n+\t\t\tconfigHandlers.RegisterRoutes(agentAPI)\n+\t\t\tlogger.Logger.Info().Msg(\"Config storage routes registered\")\n+\t\t}\n+\n \t\t// Connector routes (authenticated with separate connector token)\n \t\tif s.config.Features.Connector.Enabled && s.config.Features.Connector.Token != \"\" {\n \t\t\tconnectorGroup := agentAPI.Group(\"/connector\")", - "header": "@@ -1529,6 +1553,13 @@ func (s *AgentFieldServer) setupRoutes() {", - "new_count": 13, - "new_start": 1553, - "old_count": 6, - "old_start": 1529 - }, - { - "content": " \t\t\t)\n \t\t\tconnectorHandlers.RegisterRoutes(connectorGroup)\n \n+\t\t\t// Config management routes for connector\n+\t\t\tconfigGroup := connectorGroup.Group(\"\")\n+\t\t\tconfigGroup.Use(middleware.ConnectorCapabilityCheck(\"config_management\", s.config.Features.Connector.Capabilities))\n+\t\t\t{\n+\t\t\t\tconfigHandlers := handlers.NewConfigStorageHandlers(s.storage, s.configReloadFn())\n+\t\t\t\tconfigHandlers.RegisterRoutes(configGroup)\n+\t\t\t}\n+\n \t\t\tlogger.Logger.Info().Msg(\"\ud83d\udd0c Connector routes registered\")\n \t\t}\n \t}", - "header": "@@ -1544,6 +1575,14 @@ func (s *AgentFieldServer) setupRoutes() {", - "new_count": 14, - "new_start": 1575, - "old_count": 6, - "old_start": 1544 - } - ], - "language": "go", - "lines_added": 39, - "lines_removed": 0, - "path": "control-plane/internal/server/server.go", - "status": "modified" - }, - { - "hunks": [ - { - "content": " }\n \n // Configuration\n-func (s *stubStorage) SetConfig(ctx context.Context, key string, value interface{}) error { return nil }\n-func (s *stubStorage) GetConfig(ctx context.Context, key string) (interface{}, error) {\n+func (s *stubStorage) SetConfig(ctx context.Context, key string, value string, updatedBy string) error {\n+\treturn nil\n+}\n+func (s *stubStorage) GetConfig(ctx context.Context, key string) (*storage.ConfigEntry, error) {\n+\treturn nil, nil\n+}\n+func (s *stubStorage) ListConfigs(ctx context.Context) ([]*storage.ConfigEntry, error) {\n \treturn nil, nil\n }\n+func (s *stubStorage) DeleteConfig(ctx context.Context, key string) error { return nil }\n \n // Reasoner Performance and History\n func (s *stubStorage) GetReasonerPerformanceMetrics(ctx context.Context, reasonerID string) (*types.ReasonerPerformanceMetrics, error) {", - "header": "@@ -230,10 +230,16 @@ func (s *stubStorage) ListAgentGroups(ctx context.Context, teamID string) ([]typ", - "new_count": 16, - "new_start": 230, - "old_count": 10, - "old_start": 230 - } - ], - "language": "go", - "lines_added": 8, - "lines_removed": 2, - "path": "control-plane/internal/server/server_routes_test.go", - "status": "modified" - }, - { - "hunks": [ - { - "content": " \treturn nil\n }\n \n-// SetConfig stores a configuration key-value pair in SQLite.\n-func (ls *LocalStorage) SetConfig(ctx context.Context, key string, value interface{}) error {\n-\t// Fast-fail if context is already cancelled\n+// SetConfig upserts a configuration entry in the database.\n+// On conflict (duplicate key), it increments the version and updates the value.\n+func (ls *LocalStorage) SetConfig(ctx context.Context, key string, value string, updatedBy string) error {\n \tif err := ctx.Err(); err != nil {\n \t\treturn err\n \t}\n \n-\t// TODO: Implement configuration storage in SQLite\n-\treturn fmt.Errorf(\"SetConfig not yet implemented for LocalStorage\")\n+\tdb := ls.requireSQLDB()\n+\tnow := time.Now().UTC()\n+\n+\tif ls.mode == \"postgres\" {\n+\t\t_, err := db.ExecContext(ctx, `\n+\t\t\tINSERT INTO config_storage (key, value, version, created_by, updated_by, created_at, updated_at)\n+\t\t\tVALUES ($1, $2, 1, $3, $3, $4, $4)\n+\t\t\tON CONFLICT (key) DO UPDATE SET\n+\t\t\t\tvalue = EXCLUDED.value,\n+\t\t\t\tversion = config_storage.version + 1,\n+\t\t\t\tupdated_by = EXCLUDED.updated_by,\n+\t\t\t\tupdated_at = EXCLUDED.updated_at`,\n+\t\t\tkey, value, updatedBy, now)\n+\t\treturn err\n+\t}\n+\n+\t// SQLite\n+\t_, err := db.ExecContext(ctx, `\n+\t\tINSERT INTO config_storage (key, value, version, created_by, updated_by, created_at, updated_at)\n+\t\tVALUES (?, ?, 1, ?, ?, ?, ?)\n+\t\tON CONFLICT (key) DO UPDATE SET\n+\t\t\tvalue = excluded.value,\n+\t\t\tversion = config_storage.version + 1,\n+\t\t\tupdated_by = excluded.updated_by,\n+\t\t\tupdated_at = excluded.updated_at`,\n+\t\tkey, value, updatedBy, updatedBy, now, now)\n+\treturn err\n }\n \n-// GetConfig retrieves a configuration value from SQLite by key.\n-func (ls *LocalStorage) GetConfig(ctx context.Context, key string) (interface{}, error) {\n-\t// Fast-fail if context is already cancelled\n+// GetConfig retrieves a configuration entry by key.\n+func (ls *LocalStorage) GetConfig(ctx context.Context, key string) (*ConfigEntry, error) {\n+\tif err := ctx.Err(); err != nil {\n+\t\treturn nil, err\n+\t}\n+\n+\tdb := ls.requireSQLDB()\n+\tvar entry ConfigEntry\n+\n+\tvar placeholder string\n+\tif ls.mode == \"postgres\" {\n+\t\tplaceholder = \"$1\"\n+\t} else {\n+\t\tplaceholder = \"?\"\n+\t}\n+\n+\trow := db.QueryRowContext(ctx,\n+\t\tfmt.Sprintf(`SELECT key, value, version, COALESCE(created_by, ''), COALESCE(updated_by, ''), created_at, updated_at\n+\t\tFROM config_storage WHERE key = %s`, placeholder), key)\n+\n+\terr := row.Scan(&entry.Key, &entry.Value, &entry.Version,\n+\t\t&entry.CreatedBy, &entry.UpdatedBy, &entry.CreatedAt, &entry.UpdatedAt)\n+\tif err != nil {\n+\t\tif errors.Is(err, sql.ErrNoRows) {\n+\t\t\treturn nil, nil\n+\t\t}\n+\t\treturn nil, fmt.Errorf(\"failed to get config %q: %w\", key, err)\n+\t}\n+\treturn &entry, nil\n+}\n+\n+// ListConfigs returns all stored configuration entries.\n+func (ls *LocalStorage) ListConfigs(ctx context.Context) ([]*ConfigEntry, error) {\n \tif err := ctx.Err(); err != nil {\n \t\treturn nil, err\n \t}\n \n-\t// TODO: Implement configuration retrieval from SQLite\n-\treturn nil, fmt.Errorf(\"GetConfig not yet implemented for LocalStorage\")\n+\tdb := ls.requireSQLDB()\n+\trows, err := db.QueryContext(ctx,\n+\t\t`SELECT key, value, version, COALESCE(created_by, ''), COALESCE(updated_by, ''), created_at, updated_at\n+\t\tFROM config_storage ORDER BY key`)\n+\tif err != nil {\n+\t\treturn nil, fmt.Errorf(\"failed to list configs: %w\", err)\n+\t}\n+\tdefer rows.Close()\n+\n+\tvar entries []*ConfigEntry\n+\tfor rows.Next() {\n+\t\tvar entry ConfigEntry\n+\t\tif err := rows.Scan(&entry.Key, &entry.Value, &entry.Version,\n+\t\t\t&entry.CreatedBy, &entry.UpdatedBy, &entry.CreatedAt, &entry.UpdatedAt); err != nil {\n+\t\t\treturn nil, fmt.Errorf(\"failed to scan config row: %w\", err)\n+\t\t}\n+\t\tentries = append(entries, &entry)\n+\t}\n+\treturn entries, rows.Err()\n+}\n+\n+// DeleteConfig removes a configuration entry by key.\n+func (ls *LocalStorage) DeleteConfig(ctx context.Context, key string) error {\n+\tif err := ctx.Err(); err != nil {\n+\t\treturn err\n+\t}\n+\n+\tdb := ls.requireSQLDB()\n+\tvar placeholder string\n+\tif ls.mode == \"postgres\" {\n+\t\tplaceholder = \"$1\"\n+\t} else {\n+\t\tplaceholder = \"?\"\n+\t}\n+\n+\tresult, err := db.ExecContext(ctx,\n+\t\tfmt.Sprintf(`DELETE FROM config_storage WHERE key = %s`, placeholder), key)\n+\tif err != nil {\n+\t\treturn fmt.Errorf(\"failed to delete config %q: %w\", key, err)\n+\t}\n+\trows, _ := result.RowsAffected()\n+\tif rows == 0 {\n+\t\treturn fmt.Errorf(\"config %q not found\", key)\n+\t}\n+\treturn nil\n }\n \n // SubscribeToMemoryChanges implements the StorageProvider SubscribeToMemoryChanges method using local pub/sub.", - "header": "@@ -5124,26 +5124,124 @@ func (ls *LocalStorage) UpdateAgentTrafficWeight(ctx context.Context, id string,", - "new_count": 124, - "new_start": 5124, - "old_count": 26, - "old_start": 5124 - } - ], - "language": "go", - "lines_added": 108, - "lines_removed": 10, - "path": "control-plane/internal/storage/local.go", - "status": "modified" - }, - { - "hunks": [ - { - "content": " \t\t&DIDDocumentModel{},\n \t\t&AccessPolicyModel{},\n \t\t&AgentTagVCModel{},\n+\t\t&ConfigStorageModel{},\n \t}\n \n \tif err := gormDB.WithContext(ctx).AutoMigrate(models...); err != nil {", - "header": "@@ -233,6 +233,7 @@ func (ls *LocalStorage) autoMigrateSchema(ctx context.Context) error {", - "new_count": 7, - "new_start": 233, - "old_count": 6, - "old_start": 233 - } - ], - "language": "go", - "lines_added": 1, - "lines_removed": 0, - "path": "control-plane/internal/storage/migrations.go", - "status": "modified" - }, - { - "hunks": [ - { - "content": " }\n \n func (AgentTagVCModel) TableName() string { return \"agent_tag_vcs\" }\n+\n+// ConfigStorageModel stores configuration files in the database.\n+// Each record represents a named configuration (e.g. \"agentfield.yaml\")\n+// with versioning for audit trail.\n+type ConfigStorageModel struct {\n+\tID int64 `gorm:\"column:id;primaryKey;autoIncrement\"`\n+\tKey string `gorm:\"column:key;not null;uniqueIndex\"`\n+\tValue string `gorm:\"column:value;type:text;not null\"`\n+\tVersion int `gorm:\"column:version;not null;default:1\"`\n+\tCreatedBy *string `gorm:\"column:created_by\"`\n+\tUpdatedBy *string `gorm:\"column:updated_by\"`\n+\tCreatedAt time.Time `gorm:\"column:created_at;autoCreateTime\"`\n+\tUpdatedAt time.Time `gorm:\"column:updated_at;autoUpdateTime\"`\n+}\n+\n+func (ConfigStorageModel) TableName() string { return \"config_storage\" }", - "header": "@@ -472,3 +472,19 @@ type AgentTagVCModel struct {", - "new_count": 19, - "new_start": 472, - "old_count": 3, - "old_start": 472 - } - ], - "language": "go", - "lines_added": 16, - "lines_removed": 0, - "path": "control-plane/internal/storage/models.go", - "status": "modified" - }, - { - "hunks": [ - { - "content": " \tActiveExecutions int\n }\n \n+// ConfigEntry represents a database-stored configuration file.\n+type ConfigEntry struct {\n+\tKey string `json:\"key\"`\n+\tValue string `json:\"value\"`\n+\tVersion int `json:\"version\"`\n+\tCreatedBy string `json:\"created_by,omitempty\"`\n+\tUpdatedBy string `json:\"updated_by,omitempty\"`\n+\tCreatedAt time.Time `json:\"created_at\"`\n+\tUpdatedAt time.Time `json:\"updated_at\"`\n+}\n+\n // StorageProvider is the interface for the primary data storage backend.\n type StorageProvider interface {\n \t// Lifecycle", - "header": "@@ -26,6 +26,17 @@ type RunSummaryAggregation struct {", - "new_count": 17, - "new_start": 26, - "old_count": 6, - "old_start": 26 - }, - { - "content": " \tUpdateAgentVersion(ctx context.Context, id string, version string) error\n \tUpdateAgentTrafficWeight(ctx context.Context, id string, version string, weight int) error\n \n-\t// Configuration\n-\tSetConfig(ctx context.Context, key string, value interface{}) error\n-\tGetConfig(ctx context.Context, key string) (interface{}, error)\n+\t// Configuration Storage (database-backed config files)\n+\tSetConfig(ctx context.Context, key string, value string, updatedBy string) error\n+\tGetConfig(ctx context.Context, key string) (*ConfigEntry, error)\n+\tListConfigs(ctx context.Context) ([]*ConfigEntry, error)\n+\tDeleteConfig(ctx context.Context, key string) error\n \n \t// Reasoner Performance and History\n \tGetReasonerPerformanceMetrics(ctx context.Context, reasonerID string) (*types.ReasonerPerformanceMetrics, error)", - "header": "@@ -118,9 +129,11 @@ type StorageProvider interface {", - "new_count": 11, - "new_start": 129, - "old_count": 9, - "old_start": 118 - } - ], - "language": "go", - "lines_added": 16, - "lines_removed": 3, - "path": "control-plane/internal/storage/storage.go", - "status": "modified" - }, - { - "hunks": [ - { - "content": "+-- +goose Up\n+-- +goose StatementBegin\n+CREATE TABLE IF NOT EXISTS config_storage (\n+ id BIGSERIAL PRIMARY KEY,\n+ key TEXT NOT NULL UNIQUE,\n+ value TEXT NOT NULL,\n+ version INTEGER NOT NULL DEFAULT 1,\n+ created_by TEXT,\n+ updated_by TEXT,\n+ created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),\n+ updated_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()\n+);\n+\n+CREATE INDEX IF NOT EXISTS idx_config_storage_key ON config_storage(key);\n+-- +goose StatementEnd\n+\n+-- +goose Down\n+-- +goose StatementBegin\n+DROP INDEX IF EXISTS idx_config_storage_key;\n+DROP TABLE IF EXISTS config_storage;\n+-- +goose StatementEnd", - "header": "@@ -0,0 +1,21 @@", - "new_count": 21, - "new_start": 1, - "old_count": 0, - "old_start": 0 - } - ], - "language": "sql", - "lines_added": 21, - "lines_removed": 0, - "path": "control-plane/migrations/028_create_config_storage.sql", - "status": "added" - } - ], - "intent_gaps": [ - "**API Key Override Not Documented**: The PR description states precedence is 'env vars > DB config > file config > defaults' but doesn't explicitly document that API.Auth.APIKey from DB would override file config. This could be surprising behavior for operators.", - "**No Config Validation Endpoint**: The PR mentions storing config via API but doesn't provide a way to validate config before storing it. Users can store invalid YAML that breaks server startup on next reload.", - "**Missing Audit Logging**: While the DB stores `created_by` and `updated_by`, there's no comprehensive audit log of config changes with diffs. The PR mentions 'versioning for audit trail' in the model but the actual audit trail features aren't implemented.", - "**No Config Diff/Compare**: The PR enables storing multiple versions but doesn't provide API endpoints to compare versions or view historical values.", - "**Connector Config Scope Ambiguity**: The PR mentions 'connector-scoped config routes' but it's unclear if these routes allow the connector to manage its own config section only, or any config. The capability is named `config_management` but the scope isn't clearly defined." - ], - "pr_narrative": "This PR implements database-backed configuration storage for the AgentField control plane, enabling remote configuration management via API and connector integration.\n\n**Core Changes:**\n\n1. **Database Schema (migration 028)**: Adds `config_storage` table with fields: id, key (unique), value (text), version, created_by, updated_by, timestamps. Supports both PostgreSQL and SQLite via Goose migration.\n\n2. **Storage Layer (local.go:5129-5245)**: Implements CRUD operations on LocalStorage:\n - `SetConfig`: Upsert with version increment (SQLite uses `?` placeholders, PostgreSQL uses `$1`)\n - `GetConfig`: Returns ConfigEntry with COALESCE for null handling\n - `ListConfigs`: Ordered by key\n - `DeleteConfig`: Returns error if key not found\n\n3. **GORM Model (models.go:476-490)**: Adds `ConfigStorageModel` with auto-migration support via migrations.go:236.\n\n4. **HTTP Handlers (config_storage.go)**: Full CRUD API under `/api/v1/configs`:\n - GET /configs - List all\n - GET /configs/:key - Get specific\n - PUT /configs/:key - Create/update (raw body = value, X-Updated-By header)\n - DELETE /configs/:key - Remove\n - POST /configs/reload - Trigger hot-reload (only if AGENTFIELD_CONFIG_SOURCE=db)\n\n5. **Config Loading (config_db.go)**: Implements `overlayDBConfig` called during server initialization (server.go:107-112) when `AGENTFIELD_CONFIG_SOURCE=db`:\n - Reads config from DB key `agentfield.yaml`\n - Parses YAML into config struct\n - Merges field-by-field (only non-zero values)\n - **CRITICAL**: Preserves `cfg.Storage` from file/env (bootstrap safety - can't get DB connection from DB)\n - Also excludes connector token/capabilities from DB merge (security-sensitive)\n\n6. **Connector Integration (server.go:1573-1578)**: Adds connector-scoped config routes gated by `config_management` capability check middleware.\n\n7. **Default Config (agentfield.yaml)**: Adds `config_management` capability to connector capabilities (lines 149-151).\n\n**Flow:**\n1. Server starts, creates storage provider\n2. If `AGENTFIELD_CONFIG_SOURCE=db`, calls `overlayDBConfig(cfg, storage)`\n3. Storage section preserved from file/env, rest merged from DB\n4. Server initializes with merged config\n5. API endpoints allow runtime config CRUD\n6. POST /configs/reload triggers re-merge without restart (if env var set)", - "risk_surfaces": [ - "**Bootstrap Safety Gap (config_db.go:33-45)**: The storage section is preserved, but other security-sensitive configs (API.Auth.APIKey, Features.DID.Authorization.AdminToken, Features.DID.Authorization.InternalToken) are NOT explicitly excluded from DB overlay. If these are set in DB config, they could override file/env values, creating a security risk where DB-stored credentials take precedence.", - "**Config Reload Race Condition (server.go:435-442, config_storage.go:114-128)**: The `configReloadFn()` closure captures `s.config` pointer and `s.storage`. When called, it re-runs `overlayDBConfig` which modifies the config struct in-place. If other goroutines are reading config values during reload, they may see inconsistent/partial state. No mutex protects the config struct.", - "**Version Increment Race (local.go:5129-5161)**: `SetConfig` uses version increment logic (`version = version + 1`) but doesn't use atomic operations or row-level locking. Concurrent updates to the same key could result in lost updates or version collisions, especially under high load.", - "**YAML Validation Gap (config_storage.go:67-78)**: The `SetConfig` handler accepts raw YAML/text without any validation that it parses as valid YAML or that it conforms to the expected config schema. Invalid YAML stored in DB will cause `overlayDBConfig` to fail on next reload, potentially preventing server startup.", - "**Merge Logic Maintenance Burden (config_db.go:54-103)**: The `mergeDBConfig` function manually merges each field. When new config fields are added to the `config.Config` struct, developers must remember to add corresponding merge logic here. Missing fields will silently not be overlayable from DB, creating confusion.", - "**Connector Capability Bypass Risk (server.go:1574)**: The connector config routes use `middleware.ConnectorCapabilityCheck(\"config_management\", ...)`. If the capability check middleware has bugs or is bypassed, the connector could modify config without proper authorization. The middleware implementation should be reviewed.", - "**Test Coverage Gap (server_routes_test.go)**: The test file adds stub implementations but doesn't add actual tests for the new config storage routes. The `config_management` capability is added to test config but no tests verify the routes work correctly.", - "**Migration Ordering (migrations/028_create_config_storage.sql)**: Migration 028 creates the config_storage table. If this migration fails or is skipped, the server will fail at runtime when trying to use config storage. The error handling in `overlayDBConfig` logs a warning but continues startup, which could mask issues." - ], - "stats": { - "files_added": 3, - "files_modified": 7, - "files_removed": 0, - "files_renamed": 0, - "test_files_changed": 1, - "test_to_code_ratio": 0.1111111111111111, - "total_additions": 455, - "total_deletions": 15, - "total_files": 10 - }, - "unrelated_changes": [] - }, - "budget": { - "budget_exhausted": true, - "cost_breakdown": { - "adversary": 0, - "anatomy": 0, - "coverage": 0, - "cross_ref": 0, - "intake": 0, - "meta_selectors": 0, - "output": 0, - "review": 0, - "synthesis": 0 - }, - "max_cost_usd": 3, - "max_duration_seconds": 2700, - "total_cost_usd": 0 - }, - "intake": { - "ai_generated": 0.6666666666666666, - "areas_touched": [ - "database", - "api", - "tests", - "config" - ], - "complexity": "complex", - "languages": [ - "go", - "sql", - "yaml" - ], - "pr_summary": "## Summary\n- Add `config_storage` table (GORM model + Goose migration 028) for storing configuration files in the database\n- Implement `SetConfig`/`GetConfig`/`ListConfigs`/`DeleteConfig` on the `StorageProvider` interface (works on both SQLite and PostgreSQL)\n- Add `AGENTFIELD_CONFIG_SOURCE=db` environment variable to load config from the database at startup (overlays on top of file config, preserving storage section for bootstrap)\n- Add CRUD API endpoints at `GET/PUT/DELETE /api/v1/configs/:key`\n- Add connector-scoped config routes gated by `config_management` capability\n- Add `config_management` capability to default `agentfield.yaml`\n\n## How It Works\n1. **Store config in DB**: `PUT /api/v1/configs/agentfield.yaml` with YAML body\n2. **Load from DB at startup**: Set `AGENTFIELD_CONFIG_SOURCE=db` \u2192 server reads config from DB after storage init\n3. **Remote management**: SaaS \u2192 connector \u2192 `config_management` capability \u2192 CP config API\n4. **Precedence**: env vars > DB config > file config > defaults\n5. **Bootstrap safety**: The `storage` section is never overridden from DB (DB connection can't come from DB)\n\n## Related PRs\n- Connector: Agent-Field/connector (config_management capability)\n- hax-sdk: Agent-Field/hax-sdk (config editor UI)\n\n## Test plan\n- [x] `go build ./...` passes\n- [x] Server tests pass\n- [x] Storage test failure is pre-existing (FTS5 not available)\n- [ ] Manual test: create config via API, verify it loads on restart with `AGENTFIELD_CONFIG_SOURCE=db`\n- [ ] Manual test: verify connector flow end-to-end\n\n\ud83e\udd16 Generated with [Claude Code](https://claude.com/claude-code)", - "pr_type": "feature", - "review_depth": "standard", - "risk_signals": [ - "modifies data model or schema-affecting code", - "changes API surface or request/response behavior", - "includes configuration changes", - "test behavior updated" - ] - }, - "phases_completed": [ - "intake", - "anatomy", - "meta_selectors", - "review", - "adversary", - "cross_ref", - "coverage", - "synthesis", - "output" - ], - "plan": { - "ai_adjusted": false, - "cross_ref_hints": [], - "dimensions": [ - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 120, - "max_reference_follows": 5 - }, - "context_files": [ - "control-plane/internal/server/config_db.go", - "control-plane/internal/handlers/config_storage.go" - ], - "id": "semantic_semantic-001", - "name": "Config Reload Race Condition", - "priority": 10, - "review_prompt": "CRITICAL: The configReloadFn function in control-plane/internal/server/server.go:439-441 modifies s.config in-place via overlayDBConfig but does NOT use the configMu mutex defined in the server struct (line 82). This creates a data race when concurrent goroutines read config values during reload.\n\nINVESTIGATION STEPS:\n1. Verify that configMu is defined but unused in configReloadFn\n2. Check all places where s.config is accessed (search for s.config. throughout server.go)\n3. Identify which goroutines might read config during runtime (health checks, cleanup services, etc.)\n4. Determine if overlayDBConfig modifies the config struct atomically or field-by-field\n\nVERIFICATION:\n- The race condition exists if any goroutine reads s.config fields while reload is in progress\n- This is a SEMANTIC bug because it can cause inconsistent config state, not a style issue\n- Suggest fix: Add s.configMu.Lock() at start of returned function and defer s.configMu.Unlock()", - "target_files": [ - "control-plane/internal/server/server.go" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [ - "control-plane/internal/storage/storage.go" - ], - "id": "mechanical_mech-001", - "name": "MockStorageProvider interface compliance in config_test.go", - "priority": 10, - "review_prompt": "The StorageProvider interface in control-plane/internal/storage/storage.go was updated with new method signatures:\n\n1. SetConfig changed from:\n SetConfig(ctx context.Context, key string, value interface{}) error\n to:\n SetConfig(ctx context.Context, key string, value string, updatedBy string) error\n\n2. GetConfig changed from:\n GetConfig(ctx context.Context, key string) (interface{}, error)\n to:\n GetConfig(ctx context.Context, key string) (*ConfigEntry, error)\n\n3. Two new required methods were added:\n ListConfigs(ctx context.Context) ([]*ConfigEntry, error)\n DeleteConfig(ctx context.Context, key string) error\n\nThe MockStorageProvider in control-plane/internal/handlers/ui/config_test.go (lines 289-297) still has the OLD signatures. This will cause compilation failures.\n\nVerify the fix:\n1. Check lines 289-297 in config_test.go - both SetConfig and GetConfig need signature updates\n2. Add the missing ListConfigs method\n3. Add the missing DeleteConfig method\n4. Run 'go build ./...' in control-plane to confirm compilation succeeds\n\nThe updated interface definition is at control-plane/internal/storage/storage.go:133-136", - "target_files": [ - "control-plane/internal/handlers/ui/config_test.go" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [ - "control-plane/internal/storage/storage.go" - ], - "id": "mechanical_mech-002", - "name": "MockStorageProvider interface compliance in execute_test.go", - "priority": 10, - "review_prompt": "The StorageProvider interface in control-plane/internal/storage/storage.go was updated with new method signatures:\n\n1. SetConfig changed from:\n SetConfig(ctx context.Context, key string, value interface{}) error\n to:\n SetConfig(ctx context.Context, key string, value string, updatedBy string) error\n\n2. GetConfig changed from:\n GetConfig(ctx context.Context, key string) (interface{}, error)\n to:\n GetConfig(ctx context.Context, key string) (*ConfigEntry, error)\n\n3. Two new required methods were added:\n ListConfigs(ctx context.Context) ([]*ConfigEntry, error)\n DeleteConfig(ctx context.Context, key string) error\n\nThe MockStorageProvider in control-plane/internal/handlers/execute_test.go (lines 173-178) still has the OLD signatures. This will cause compilation failures.\n\nVerify the fix:\n1. Check lines 173-178 in execute_test.go - both SetConfig and GetConfig need signature updates\n2. Add the missing ListConfigs method\n3. Add the missing DeleteConfig method\n4. Run 'go build ./...' in control-plane to confirm compilation succeeds\n\nThe updated interface definition is at control-plane/internal/storage/storage.go:133-136", - "target_files": [ - "control-plane/internal/handlers/execute_test.go" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 120, - "max_reference_follows": 5 - }, - "context_files": [ - "control-plane/internal/config/config.go", - "control-plane/internal/handlers/config_storage.go" - ], - "id": "semantic_semantic-002", - "name": "Security-Sensitive Config Override from DB", - "priority": 9, - "review_prompt": "CRITICAL: The mergeDBConfig function in control-plane/internal/server/config_db.go:54-103 merges config from DB but only explicitly protects the Storage section (lines 32-33, 44-45). However, other security-sensitive fields like API.Auth.APIKey, Features.DID.Authorization.AdminToken, and Features.DID.Authorization.InternalToken are NOT protected and can be overridden from DB config.\n\nINVESTIGATION STEPS:\n1. Review the mergeDBConfig function to identify which fields are merged vs protected\n2. Check the config.Config struct in control-plane/internal/config/config.go for security-sensitive fields\n3. Verify that API.Auth.APIKey, AdminToken, InternalToken, and Connector.Token are NOT in the merge logic\n4. Check if the comment on lines 90-92 about connector config is actually enforced in code\n\nVERIFICATION:\n- This is a SEMANTIC security issue: DB-stored credentials could override file/env values\n- An attacker with DB write access could escalate privileges by setting AdminToken in DB config\n- The PR description claims 'connector token/capabilities' are excluded but verify this is actually implemented\n- Suggest fix: Add explicit protection for all security-sensitive tokens/keys similar to Storage", - "target_files": [ - "control-plane/internal/server/config_db.go" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 120, - "max_reference_follows": 5 - }, - "context_files": [ - "control-plane/internal/server/config_db.go", - "control-plane/internal/server/server.go" - ], - "id": "semantic_semantic-003", - "name": "Invalid YAML Config Storage and Reload Failure", - "priority": 8, - "review_prompt": "HIGH: The SetConfig handler in control-plane/internal/handlers/config_storage.go:67-101 accepts raw YAML/text body without any validation that it parses as valid YAML or conforms to the expected config schema. Invalid YAML stored in DB will cause overlayDBConfig (config_db.go:37) to fail on next reload, potentially preventing server startup or causing runtime errors.\n\nINVESTIGATION STEPS:\n1. Review SetConfig handler to confirm no YAML validation is performed before storing\n2. Check overlayDBConfig to see how it handles YAML unmarshal errors\n3. Verify that invalid YAML in DB causes server startup failure or just a warning\n4. Check if there's any way to recover from invalid YAML in DB (delete via API, manual DB edit)\n\nVERIFICATION:\n- This is a SEMANTIC issue: storing invalid data can break server functionality\n- The current code at config_db.go:37-39 returns error if YAML unmarshal fails\n- At server.go:109-111, this error only prints a warning but doesn't prevent startup\n- However, the server continues with potentially partial/inconsistent config\n- Suggest fix: Add YAML validation in SetConfig before storing, or implement config validation endpoint", - "target_files": [ - "control-plane/internal/handlers/config_storage.go" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [ - "control-plane/internal/storage/storage.go" - ], - "id": "mechanical_mech-003", - "name": "ConfigEntry type import in test mocks", - "priority": 8, - "review_prompt": "The updated GetConfig method now returns *ConfigEntry instead of interface{}. The ConfigEntry type is defined in control-plane/internal/storage/storage.go (lines 29-38).\n\nWhen updating the MockStorageProvider implementations in:\n- control-plane/internal/handlers/ui/config_test.go\n- control-plane/internal/handlers/execute_test.go\n\nEnsure that:\n1. The storage package is properly imported (it should already be imported as the mocks implement StorageProvider)\n2. The GetConfig method returns (*storage.ConfigEntry, error) not (*ConfigEntry, error) - verify the import alias\n3. Any test code that calls GetConfig and expects interface{} will need to be updated to handle *ConfigEntry\n\nCheck for any test assertions that might break due to the type change from interface{} to *ConfigEntry.", - "target_files": [ - "control-plane/internal/handlers/ui/config_test.go", - "control-plane/internal/handlers/execute_test.go" - ] - } - ], - "total_budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - } - } - }, - "pr_url": "https://github.com/Agent-Field/agentfield/pull/254", - "review": { - "body": "## \ud83d\udd34 PR-AF Review \u2014 **Needs Major Rework**\n\n*Automated multi-agent code review \u00b7 [PR-AF](https://github.com/Agent-Field/agentfield) built with [AgentField](https://github.com/Agent-Field/agentfield)*\n\n> **17 findings** \u00b7 \ud83d\udd34 13 critical \u00b7 \ud83d\udfe0 3 important \u00b7 \ud83d\udd35 0 suggestions \u00b7 \u26aa 0 nitpicks\n\n
\nPR Overview\n\n## Summary\n- Add `config_storage` table (GORM model + Goose migration 028) for storing configuration files in the database\n- Implement `SetConfig`/`GetConfig`/`ListConfigs`/`DeleteConfig` on the `StorageProvider` interface (works on both SQLite and PostgreSQL)\n- Add `AGENTFIELD_CONFIG_SOURCE=db` environment variable to load config from the database at startup (overlays on top of file config, preserving storage section for bootstrap)\n- Add CRUD API endpoints at `GET/PUT/DELETE /api/v1/configs/:key`\n- Add connector-scoped config routes gated by `config_management` capability\n- Add `config_management` capability to default `agentfield.yaml`\n\n## How It Works\n1. **Store config in DB**: `PUT /api/v1/configs/agentfield.yaml` with YAML body\n2. **Load from DB at startup**: Set `AGENTFIELD_CONFIG_SOURCE=db` \u2192 server reads config from DB after storage init\n3. **Remote management**: SaaS \u2192 connector \u2192 `config_management` capability \u2192 CP config API\n4. **Precedence**: env vars > DB config > file config > defaults\n5. **Bootstrap safety**: The `storage` section is never overridden from DB (DB connection can't come from DB)\n\n## Related PRs\n- Connector: Agent-Field/connector (config_management capability)\n- hax-sdk: Agent-Field/hax-sdk (config editor UI)\n\n## Test plan\n- [x] `go build ./...` passes\n- [x] Server tests pass\n- [x] Storage test failure is pre-existing (FTS5 not available)\n- [ ] Manual test: create config via API, verify it loads on restart with `AGENTFIELD_CONFIG_SOURCE=db`\n- [ ] Manual test: verify connector flow end-to-end\n\n\ud83e\udd16 Generated with [Claude Code](https://claude.com/claude-code)\n\n
\n\n### Key Findings\n\n**16 issue(s) should be addressed before merge:**\n\n- \ud83d\udd34 **MockStorageProvider SetConfig and GetConfig have outdated signatures** (`control-plane/internal/handlers/execute_test.go:173`) \u2014 The MockStorageProvider in execute_test.go has the old method signatures for SetConfig and GetConfig that don't match the updated StorageProvider interface.\n- \ud83d\udd34 **Missing storage import causes undefined type error** (`control-plane/internal/handlers/execute_test.go:176`) \u2014 The MockStorageProvider.GetConfig method references `*storage.ConfigEntry` but the storage package is not imported in execute_test.go.\n- \ud83d\udd34 **MockStorageProvider has outdated SetConfig and GetConfig signatures causing compilation failure** (`control-plane/internal/handlers/ui/config_test.go:289`) \u2014 The `MockStorageProvider` in `config_test.go` has **outdated method signatures** that do not match the updated `StorageProvider` interface defined in `storage.go`.\n- \ud83d\udd34 **DID Authorization tokens (AdminToken/InternalToken) can be overridden from DB config** (`control-plane/internal/server/config_db.go:86`) \u2014 The `mergeDBConfig` function merges `Features.DID` as an entire struct when `dbCfg.Features.DID.Method != \"\"`.\n- \ud83d\udd34 **Mock GetConfig returns wrong type - interface{} instead of *storage.ConfigEntry** (`control-plane/internal/handlers/ui/config_test.go:294`) \u2014 The MockStorageProvider.GetConfig method in config_test.go returns `(interface{}, error)` but the StorageProvider interface defines it as `(*ConfigEntry, error)`.\n- \ud83d\udd34 **Mock SetConfig has wrong signature - missing updatedBy parameter** (`control-plane/internal/handlers/ui/config_test.go:289`) \u2014 The MockStorageProvider.SetConfig method has signature `(ctx context.Context, key string, value interface{})` but the StorageProvider interface defines it as `(ctx context.Context, key string, value s\u2026\n- \ud83d\udd34 **Data Race: Config Reload Function Modifies Shared Config Without Synchronization** (`control-plane/internal/server/server.go:433`) \u2014 The configReloadFn() method returns a function that calls overlayDBConfig(s.config, s.storage) which directly modifies the shared s.config struct.\n- \ud83d\udd34 **Systemic configuration merge vulnerability enables multiple authentication bypass vectors** (`control-plane/internal/server/config_db.go:52`) \u2014 The mergeDBConfig function has a systemic security control gap where comments claim protection for security-sensitive fields, but the actual implementation only explicitly preserves Storage config (li\u2026\n- \u2026 and 8 more (see All Findings by Severity)\n\n**Files with findings:** `control-plane/internal/handlers/config_storage.go`, `control-plane/internal/handlers/execute_test.go`, `control-plane/internal/handlers/ui/config_test.go`, `control-plane/internal/server/config_db.go`, `control-plane/internal/server/server.go`\n\n
\nAll Findings by Severity\n\n#### \ud83d\udd34 Critical (13)\n\n- **MockStorageProvider SetConfig and GetConfig have outdated signatures** `control-plane/internal/handlers/execute_test.go:173`\n- **Missing storage import causes undefined type error** `control-plane/internal/handlers/execute_test.go:176`\n- **MockStorageProvider has outdated SetConfig and GetConfig signatures causing compilation failure** `control-plane/internal/handlers/ui/config_test.go:289`\n- **DID Authorization tokens (AdminToken/InternalToken) can be overridden from DB config** `control-plane/internal/server/config_db.go:86`\n- **Mock GetConfig returns wrong type - interface{} instead of *storage.ConfigEntry** `control-plane/internal/handlers/ui/config_test.go:294`\n- **Mock SetConfig has wrong signature - missing updatedBy parameter** `control-plane/internal/handlers/ui/config_test.go:289`\n- **Data Race: Config Reload Function Modifies Shared Config Without Synchronization** `control-plane/internal/server/server.go:433`\n- **Systemic configuration merge vulnerability enables multiple authentication bypass vectors** `control-plane/internal/server/config_db.go:52`\n- **Systemic DB Config Security Control Gap - Multiple Critical Tokens Unprotected** `control-plane/internal/server/config_db.go:19`\n- **Complete System Compromise via Coordinated DB Config Injection** `control-plane/internal/server/config_db.go:82`\n- **Systemic DB Config Security Control Gap Enables Total Authentication Bypass** `control-plane/internal/server/config_db.go:32`\n- **Systemic Control Gap: Inconsistent Application of Security-Sensitive Field Protection** `control-plane/internal/server/config_db.go:32`\n- **API.Auth.APIKey can be overridden from DB config - no protection implemented** `control-plane/internal/server/config_db.go:94`\n\n#### \ud83d\udfe0 Important (3)\n\n- **SetConfig handler stores invalid YAML without validation** `control-plane/internal/handlers/config_storage.go:67`\n- **Approval.WebhookSecret can be overridden from DB config** `control-plane/internal/server/config_db.go:82`\n- **Comment claims connector token/capabilities are excluded but no enforcement in code** `control-plane/internal/server/config_db.go:90`\n\n
\n\n
\nReview Process Details\n\n**Dimensions Analyzed (6):**\n\n- **Config Reload Race Condition** \u2014 1 file(s)\n- **MockStorageProvider interface compliance in config_test.go** \u2014 1 file(s)\n- **MockStorageProvider interface compliance in execute_test.go** \u2014 1 file(s)\n- **Security-Sensitive Config Override from DB** \u2014 1 file(s)\n- **Invalid YAML Config Storage and Reload Failure** \u2014 1 file(s)\n- **ConfigEntry type import in test mocks** \u2014 2 file(s)\n\n**Meta-Dimension Lenses (3):**\n\n- **Semantic** \u2014 4 dimension(s), 85% coverage confidence\n- **Mechanical** \u2014 3 dimension(s), 95% coverage confidence\n- **Systemic** \u2014 5 dimension(s), 85% coverage confidence\n\n**Cross-Reference & Adversary Analysis:**\n\n- **7** compound finding(s) synthesized\n\n
\n\n
\nPipeline Stats\n\n| Metric | Value |\n|--------|-------|\n| Duration | 2867.2s |\n| Agent invocations | 20 |\n| Coverage iterations | 0 |\n| Estimated cost | N/A (provider does not report cost) |\n| Budget exhausted | Yes (timeout: 2867s > 2700s limit) |\n| PR type | feature |\n| Complexity | complex |\n\n
\n\nReview ID: `rev_2947062915e9`", - "comments": [ - { - "body": "\ud83d\udd34 **[CRITICAL] DID Authorization tokens (AdminToken/InternalToken) can be overridden from DB config**\n\nThe `mergeDBConfig` function merges `Features.DID` as an entire struct when `dbCfg.Features.DID.Method != \"\"`. This is dangerous because `DIDConfig` contains security-sensitive authorization tokens (`AdminToken` and `InternalToken`).\n\n**The vulnerability:** If an attacker with database write access sets `features.did.method` to any non-empty value in the DB-stored config, the entire `DIDConfig` struct from the DB overwrites the file/env config, including:\n- `AdminToken`: Used for admin operations like tag approval and policy management\n- `InternalToken`: Used for internal authentication when forwarding execution requests to agents\n\n**Attack scenario:**\n1. Attacker gains DB write access\n2. Attacker inserts a malicious config via `PUT /api/v1/configs/agentfield.yaml` with `features.did.method: key` and `features.did.authorization.admin_token: attacker-controlled-token`\n3. On next server start or config reload, the attacker's token replaces the legitimate admin token\n4. Attacker can now authenticate as admin using their token\n\n**Expected behavior:** Similar to how `Storage` is preserved (lines 33, 45), security-sensitive tokens should be explicitly protected from DB override.\n\n---\n\n> Step 1: config_db.go:87-89 checks `if dbCfg.Features.DID.Method != \"\"` and assigns entire `dbCfg.Features.DID` to `target.Features.DID`. Step 2: config.go:99-135 shows DIDConfig contains AuthorizationConfig with AdminToken (line 125) and InternalToken (line 129). Step 3: When DID struct is assigned, ALL fields including Authorization are overwritten. Step 4: This allows DB-stored tokens to replace file/env tokens, enabling privilege escalation.\n\n**\ud83d\udca1 Suggested Fix**\n\nChange the DID merge logic to preserve `Authorization.AdminToken` and `Authorization.InternalToken` from the original config. Only merge non-sensitive fields like `Method`, `KeyAlgorithm`, etc. For example:\n\n```go\n// Save sensitive tokens before merge\nsavedAdminToken := target.Features.DID.Authorization.AdminToken\nsavedInternalToken := target.Features.DID.Authorization.InternalToken\n\nif dbCfg.Features.DID.Method != \"\" {\n target.Features.DID = dbCfg.Features.DID\n // Restore security-sensitive fields\n target.Features.DID.Authorization.AdminToken = savedAdminToken\n target.Features.DID.Authorization.InternalToken = savedInternalToken\n}\n```\n\n---\n*`Security-Sensitive Field Protection in DB Config Merge` \u00b7 confidence 95%*", - "line": 86, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd34 **[CRITICAL] Data Race: Config Reload Function Modifies Shared Config Without Synchronization**\n\nThe configReloadFn() method returns a function that calls overlayDBConfig(s.config, s.storage) which directly modifies the shared s.config struct. This creates a data race because the returned function is called asynchronously (likely from a signal handler or watcher) while dozens of goroutines concurrently read from s.config fields without any synchronization mechanism.\n\nThe AgentFieldServer struct includes a configMu mutex field (line 82) that was intended to protect these operations, but it is never locked in configReloadFn(). This means concurrent reads during a config reload can observe partially updated or inconsistent configuration values, leading to undefined behavior.\n\n---\n\n> Line 82: configMu field exists in struct but is unused\n> Line 440-441: Direct modification of s.config without lock\n> OverlayDBConfig modifies s.config fields via mergeDBConfig()\n\n**\ud83d\udca1 Suggested Fix**\n\nAcquire the configMu lock before modifying s.config in the returned function:\n\nfunc (s *AgentFieldServer) configReloadFn() handlers.ConfigReloadFunc {\n if src := os.Getenv(\"AGENTFIELD_CONFIG_SOURCE\"); src != \"db\" {\n return nil\n }\n return func() error {\n s.configMu.Lock()\n defer s.configMu.Unlock()\n return overlayDBConfig(s.config, s.storage)\n }\n}\n\nAdditionally, all read access to s.config fields throughout the codebase should also acquire at least a read lock (RLock) to prevent data races during concurrent reads.\n\n---\n*`Data Race in Config Reload` \u00b7 confidence 95%*", - "line": 433, - "path": "control-plane/internal/server/server.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd34 **[CRITICAL] Systemic configuration merge vulnerability enables multiple authentication bypass vectors**\n\nThe mergeDBConfig function has a systemic security control gap where comments claim protection for security-sensitive fields, but the actual implementation only explicitly preserves Storage config (lines 33, 45). This creates multiple authentication bypass vectors through a shared vulnerable code pattern.\n\n**The compound risk:** An attacker with database write access can override ALL critical authentication/authorization tokens by inserting malicious YAML into the database config:\n\n1. **API Authentication Bypass** (lines 94-97): Comment claims 'never override API key from DB for security' but code only merges CORS settings. The API.Auth.APIKey can be overridden from DB, allowing attacker to authenticate with their own key.\n\n2. **Admin Privilege Escalation** (lines 87-89): Features.DID is merged entirely when Method != '', which includes Authorization.AdminToken. Attacker can set their own admin token to gain administrative access to tag approval and policy management routes.\n\n3. **Agent Impersonation** (lines 87-89): Same DID merge includes Authorization.InternalToken, which is sent as Authorization: Bearer header when control plane forwards execution requests to agents. Attacker can impersonate the control plane to agents with RequireOriginAuth enabled.\n\n4. **Approval System Compromise** (lines 82-84): AgentField.Approval config including WebhookSecret is entirely merged from DB. Attacker can manipulate approval workflows and potentially bypass approval requirements.\n\n**Why this is worse than individual findings:** The shared merge pattern suggests a developer misunderstanding of the actual protection scope. Only Storage is explicitly preserved (bootstrap problem), while other security-sensitive fields have only comments claiming protection. This indicates a systemic control gap where the security model is inconsistent and incomplete. Fixing one field won't address the underlying architectural issue.\n\n---\n\n> Evidence from code review:\\n1. Line 33, 45: Only Storage config is explicitly saved and restored (correct protection for bootstrap problem)\\n2. Line 82-84: AgentField.Approval (including WebhookSecret) is entirely merged from DB without protection\\n3. Line 87-89: Features.DID (including Authorization.AdminToken and InternalToken) is entirely merged when Method != ''\\n4. Line 94-97: Comment claims API key protection but only CORS is handled, not Auth\\n5. Line 90-92: Comment claims Connector token protection but no enforcement code exists\\n6. config.go line 207-212: AuthConfig contains APIKey string field\\n7. config.go line 112-135: AuthorizationConfig contains AdminToken (line 125) and InternalToken (line 129)\\n8. config.go line 46: ApprovalConfig contains WebhookSecret\\n\\nAttack scenario: INSERT INTO config (key, value) VALUES ('agentfield.yaml', 'api:\\n auth:\\n api_key: attacker-controlled-key\\nfeatures:\\n did:\\n method: key\\n authorization:\\n admin_token: attacker-admin-token\\n internal_token: attacker-internal-token\\nagentfield:\\n approval:\\n webhook_secret: attacker-webhook-secret')\n\n**\ud83d\udca1 Suggested Fix**\n\nImplement a comprehensive security-sensitive field protection system:\\n1. Create an explicit whitelist approach for DB-configurable fields instead of selective merging\\n2. Add a security audit comment block at the top of mergeDBConfig listing ALL protected fields\\n3. Implement a struct tag system (e.g., `dbconfig:\"protected\"`) to mark fields that should never come from DB\\n4. Add validation tests that verify no security-sensitive fields can be set from DB config\\n5. Consider encrypting security-sensitive config values in the database\\n6. Log all config changes from DB with before/after values for security-sensitive fields\n\n---\n*`Compound Analysis` \u00b7 confidence 95%*", - "line": 52, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd34 **[CRITICAL] Systemic DB Config Security Control Gap - Multiple Critical Tokens Unprotected**\n\nThe database configuration overlay mechanism (`overlayDBConfig`) contains a systemic security control gap where security-sensitive tokens are not protected from DB-based override, despite comments claiming protection exists. This compound issue creates a complete authentication bypass vulnerability.\n\n**The compound vulnerability:**\n\n1. **Pattern of False Security Claims**: Lines 90-92 and 94 contain comments stating that connector tokens and API keys are intentionally NOT merged from DB, but these protections are NOT actually implemented in code. This creates a dangerous false sense of security.\n\n2. **Multiple Critical Token Override**: An attacker with DB write access can override ALL of these tokens simultaneously:\n - `API.Auth.APIKey` (controls all API access) - line 209 in config.go\n - `AgentField.Approval.WebhookSecret` (controls webhook verification) - line 47 in config.go\n - `Features.DID.Authorization.AdminToken` (controls admin operations) - line 125 in config.go\n - `Features.DID.Authorization.InternalToken` (controls agent authentication) - line 129 in config.go\n - `Features.Connector.Token` (commented as protected but not enforced) - line 89 in config.go\n\n3. **Inconsistent Protection Logic**: While `Storage` is properly protected with save/restore pattern (lines 33, 45), equally or more sensitive fields like APIKey and WebhookSecret are NOT protected using the same pattern, despite being security-critical.\n\n4. **Hot-reload Amplification**: The `/api/v1/configs/reload` endpoint (config_storage.go:114-128) allows immediate application of malicious config changes without server restart, enabling rapid exploitation.\n\n5. **Zero Validation**: The SetConfig storage method (local.go:5129-5161) accepts arbitrary YAML content without validating or rejecting sensitive field modifications.\n\n**Complete Attack Chain:**\n1. Attacker gains DB write access OR compromises an account with `config_management` capability\n2. Attacker uploads malicious config YAML with attacker-controlled tokens via `PUT /api/v1/configs/agentfield.yaml`\n3. Attacker triggers config reload via `POST /api/v1/configs/reload`\n4. Server immediately loads attacker's tokens from DB, replacing legitimate file/env-configured tokens\n5. Attacker can now authenticate with their own API key, forge webhook approvals, perform admin operations with their admin token, and authenticate to agents with their internal token\n\n**Risk Escalation:** This is worse than individual findings because it allows COMPLETE SYSTEM COMPROMISE through a single config write operation, bypassing all authentication layers simultaneously.\n\n---\n\n> Evidence of the compound control gap:\n> \n> 1. **False security claims in comments** (config_db.go:90-97):\n> Line 90-92: 'NOTE: Connector config (token, capabilities) is intentionally NOT merged from DB.'\n> Line 94: 'API settings (but never override API key from DB for security)'\n> Yet NO code enforces these protections - only CORS is merged conditionally at lines 95-97.\n> \n> 2. **Missing protection for APIKey** (config_db.go:94-97):\n> The comment says API key should never be overridden from DB, but the only code that runs is CORS merge. API.Auth.APIKey is never preserved or restored.\n> \n> 3. **Dangerous struct-level merge for Approval** (config_db.go:82-84):\n> ```go\n> if dbCfg.AgentField.Approval.WebhookSecret != \"\" || dbCfg.AgentField.Approval.DefaultExpiryHours != 0 {\n> target.AgentField.Approval = dbCfg.AgentField.Approval\n> }\n> ```\n> This merges the ENTIRE Approval struct including WebhookSecret when either field is non-empty.\n> \n> 4. **Dangerous struct-level merge for DID** (config_db.go:86-89):\n> ```go\n> if dbCfg.Features.DID.Method != \"\" {\n> target.Features.DID = dbCfg.Features.DID\n> }\n> ```\n> This merges the ENTIRE DIDConfig struct including Authorization.AdminToken and Authorization.InternalToken.\n> \n> 5. **Proper protection only for Storage** (config_db.go:33,45):\n> Line 33: `savedStorage := cfg.Storage`\n> Line 45: `cfg.Storage = savedStorage`\n> This shows the pattern that SHOULD be used for other sensitive fields but is NOT.\n> \n> 6. **Config structs showing sensitive fields** (config.go):\n> - Line 47: `WebhookSecret string` in ApprovalConfig\n> - Line 125: `AdminToken string` in AuthorizationConfig \n> - Line 129: `InternalToken string` in AuthorizationConfig\n> - Line 209: `APIKey string` in AuthConfig\n> \n> 7. **No validation in SetConfig** (local.go:5129-5161):\n> Raw YAML stored directly to DB without checking for sensitive field modifications.\n\n**\ud83d\udca1 Suggested Fix**\n\nImplement consistent security field protection across ALL sensitive configuration values:\n\n1. **Immediate Fix - Add protection for all security-sensitive tokens** (config_db.go):\n```go\nfunc overlayDBConfig(cfg *config.Config, store storage.StorageProvider) error {\n // ... existing code ...\n \n // Preserve ALL security-sensitive tokens from file/env config\n savedStorage := cfg.Storage\n savedAPIKey := cfg.API.Auth.APIKey\n savedWebhookSecret := cfg.AgentField.Approval.WebhookSecret\n savedAdminToken := cfg.Features.DID.Authorization.AdminToken\n savedInternalToken := cfg.Features.DID.Authorization.InternalToken\n savedConnectorToken := cfg.Features.Connector.Token\n \n // Parse and merge DB config\n var dbCfg config.Config\n if err := yaml.Unmarshal([]byte(entry.Value), &dbCfg); err != nil {\n return fmt.Errorf(\"failed to parse database config YAML: %w\", err)\n }\n mergeDBConfig(cfg, &dbCfg)\n \n // Restore all security-sensitive values (never overridden from DB)\n cfg.Storage = savedStorage\n cfg.API.Auth.APIKey = savedAPIKey\n cfg.AgentField.Approval.WebhookSecret = savedWebhookSecret\n cfg.Features.DID.Authorization.AdminToken = savedAdminToken\n cfg.Features.DID.Authorization.InternalToken = savedInternalToken\n cfg.Features.Connector.Token = savedConnectorToken\n \n // ... rest of function ...\n}\n```\n\n2. **Medium-term - Add field-level merge for DID and Approval** instead of struct-level merge to avoid accidentally merging sensitive sub-fields.\n\n3. **Long-term - Add config validation middleware** that rejects DB config updates containing modifications to security-sensitive fields, returning a 400 error with explanation.\n\n---\n*`Compound Analysis` \u00b7 confidence 95%*", - "line": 19, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd34 **[CRITICAL] Complete System Compromise via Coordinated DB Config Injection**\n\nThe combination of multiple unprotected security-sensitive fields in the DB config merge logic creates a complete authentication and authorization bypass chain. An attacker with database write access can simultaneously inject malicious values for: (1) DID Authorization tokens (AdminToken/InternalToken) via the full-DID-struct merge at lines 87-89, (2) WebhookSecret via the full-Approval-struct merge at lines 82-84, (3) API.Auth.APIKey which is parsed by yaml.Unmarshal at line 37 but never explicitly restored, and (4) Connector.Token/Capabilities which are claimed to be protected by comment at lines 90-92 but have no actual code enforcement. This allows an attacker to: authenticate with their own API key, escalate privileges using their own AdminToken, forge approval callbacks with their own WebhookSecret, and gain unauthorized connector access with their own token. The compound effect is TOTAL SYSTEM COMPROMISE - the attacker controls all authentication, authorization, and validation mechanisms simultaneously, making this significantly more severe than any individual vulnerability.\n\n---\n\n> Step 1: yaml.Unmarshal at line 37 parses ALL fields from DB-stored YAML including api.auth.api_key, features.did.authorization.admin_token, features.did.authorization.internal_token, agentfield.approval.webhook_secret, and features.connector.token. Step 2: Lines 87-89 merge entire DID struct when Method != '', overwriting Authorization.AdminToken and Authorization.InternalToken. Step 3: Lines 82-84 merge entire Approval struct when WebhookSecret != '', allowing secret replacement. Step 4: Lines 90-92 claim connector config is protected but NO code enforcement exists (unlike lines 33,45 which save/restore Storage). Step 5: Lines 94-97 only merge CORS, leaving API.Auth vulnerable to DB override. Step 6: The save/restore pattern at lines 33,45 proves the correct protection approach exists but is inconsistently applied.\n\n**\ud83d\udca1 Suggested Fix**\n\nApply the same save/restore pattern used for Storage (lines 33,45) to ALL security-sensitive fields before calling mergeDBConfig. Specifically: (1) Save cfg.API.Auth before line 42 and restore after, (2) Save cfg.Features.DID.Authorization before line 42 and restore after, (3) Save cfg.AgentField.Approval.WebhookSecret before line 42 and restore after, (4) Save cfg.Features.Connector before line 42 and restore after. Alternatively, implement a whitelist approach where ONLY explicitly allowed non-sensitive fields can be merged from DB config.\n\n---\n*`Compound Analysis` \u00b7 confidence 92%*", - "line": 82, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd34 **[CRITICAL] Systemic DB Config Security Control Gap Enables Total Authentication Bypass**\n\nThe `mergeDBConfig` function implements an INCONSISTENT security protection pattern that creates a systemic control gap enabling total authentication bypass. While Storage config is properly protected (saved at line 33, restored at line 45), FOUR other critical security-sensitive fields are left completely unprotected:\n\n1. **API.Auth.APIKey** (lines 94-97): Comment claims 'never override API key from DB for security' but code only merges CORS settings. The APIKey parsed from DB YAML remains in dbCfg struct with no explicit clearing.\n\n2. **AgentField.Approval.WebhookSecret** (lines 82-84): Entire Approval struct is merged when WebhookSecret or DefaultExpiryHours is set in DB, overwriting file/env HMAC-SHA256 secret used for webhook verification.\n\n3. **Features.DID.Authorization.AdminToken/InternalToken** (lines 87-89): Entire DID struct is merged when Method is non-empty, overwriting admin and internal authentication tokens used for privileged operations and agent authentication.\n\n4. **Features.Connector.Token/Capabilities** (lines 90-92): Comment claims connector config is 'intentionally NOT merged from DB' but NO CODE ENFORCES THIS. Parsed DB values persist in dbCfg struct.\n\n**COMPOUND IMPACT - Total System Compromise:**\nAn attacker with database write access can override ALL authentication mechanisms simultaneously:\n- Set `api.auth.api_key` \u2192 Gain unauthorized API access\n- Set `agentfield.approval.webhook_secret` \u2192 Forge webhook callbacks for unauthorized approvals\n- Set `features.did.method` + `features.did.authorization.admin_token` \u2192 Perform admin operations and bypass agent authentication\n- Set `features.connector.token` \u2192 Compromise connector service integration\n\nThis is NOT four separate vulnerabilities - it is ONE SYSTEMIC CONTROL GAP where a security protection pattern exists but is inconsistently applied. The existence of proper Storage protection proves the developers understand the risk, but the same protection was omitted for other equally critical credentials.\n\n---\n\n> 1. **Storage protection pattern (CORRECT)**: config_db.go:33 saves `cfg.Storage` before merge, line 45 restores it after. This proves the security model exists. 2. **APIKey protection FAILURE**: config_db.go:94 comment says 'never override API key from DB' but lines 95-97 only merge CORS. No explicit clearing of dbCfg.API.Auth.APIKey. 3. **WebhookSecret override**: config_db.go:82-84 assigns entire `target.AgentField.Approval = dbCfg.AgentField.Approval` when WebhookSecret is non-empty, overwriting the file/env secret. 4. **DID Authorization tokens override**: config_db.go:87-89 assigns entire `target.Features.DID = dbCfg.Features.DID` when Method is non-empty. config.go:125,129 show DIDConfig.Authorization contains AdminToken and InternalToken. 5. **Connector protection COMMENT-ONLY**: config_db.go:90-92 comment claims protection but no code saves/restores `cfg.Features.Connector` like Storage. 6. **Attack vector**: All sensitive values are parsed from DB YAML at config_db.go:37 via `yaml.Unmarshal`.\n\n**\ud83d\udca1 Suggested Fix**\n\nImplement CONSISTENT protection for ALL security-sensitive fields. Create a systematic approach:\n\n1. **Immediate fix**: Add save/restore pattern for all sensitive fields:\n```go\n// At line 32-33, add:\nsavedAPIKey := cfg.API.Auth.APIKey\nsavedApproval := cfg.AgentField.Approval\nsavedDIDAuth := cfg.Features.DID.Authorization\nsavedConnector := cfg.Features.Connector\n\n// At line 44-45, add:\ncfg.API.Auth.APIKey = savedAPIKey\ncfg.AgentField.Approval = savedApproval\ncfg.Features.DID.Authorization = savedDIDAuth\ncfg.Features.Connector = savedConnector\n```\n\n2. **Better fix**: Refactor mergeDBConfig to use field-by-field merging for sensitive structs instead of whole-struct assignment. Only merge non-sensitive fields individually.\n\n3. **Best fix**: Add a comprehensive test that verifies NO sensitive credentials can be overridden from DB config by attempting to inject malicious values for all security-sensitive fields.\n\n---\n*`Compound Analysis` \u00b7 confidence 92%*", - "line": 32, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd34 **[CRITICAL] Systemic Control Gap: Inconsistent Application of Security-Sensitive Field Protection**\n\nThe codebase demonstrates a systemic control gap where the correct pattern for protecting security-sensitive configuration fields exists but is inconsistently applied. The save/restore pattern at lines 33,45 correctly protects Storage config from DB override (addressing the bootstrap problem), but this same pattern is NOT applied to other equally sensitive fields: API.Auth (controlling API authentication), Features.DID.Authorization (controlling admin/internal tokens), AgentField.Approval (controlling webhook secrets), and Features.Connector (controlling service tokens). This pattern inconsistency indicates a missing security control in the development process - the Storage protection was implemented as a one-off fix rather than establishing a comprehensive security rule. The presence of comments at lines 90-92 and 94 claiming protection exists (without code enforcement) further suggests confusion about what is actually protected. This systemic gap means future security-sensitive fields are likely to be similarly vulnerable.\n\n---\n\n> Step 1: Lines 33,45 show the correct save/restore pattern: `savedStorage := cfg.Storage` before merge and `cfg.Storage = savedStorage` after merge. Step 2: Lines 87-89, 82-84 show entire struct assignment for DID and Approval without field-level protection. Step 3: Lines 94-97 show comment claiming API key protection but only CORS is actually protected. Step 4: Lines 90-92 show comment claiming connector protection but NO corresponding code. Step 5: The pattern inconsistency spans 4 different security-sensitive fields across lines 82-97, indicating a missing systematic approach.\n\n**\ud83d\udca1 Suggested Fix**\n\nEstablish a comprehensive security policy for DB config merging: (1) Create an explicit allowlist of fields that CAN be merged from DB, default-deny all others, (2) Document the save/restore pattern requirement in code comments and developer documentation, (3) Add unit tests that verify each security-sensitive field cannot be overridden from DB config, (4) Consider creating a helper function `preserveSecurityFields(cfg *Config) (restore func())` that automatically saves and returns a restore function for all sensitive fields, ensuring consistency.\n\n---\n*`Compound Analysis` \u00b7 confidence 88%*", - "line": 32, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd34 **[CRITICAL] API.Auth.APIKey can be overridden from DB config - no protection implemented**\n\nThe `mergeDBConfig` function only merges `API.CORS` settings (lines 94-97) but completely ignores `API.Auth.APIKey`. This means the API authentication key is left vulnerable to being set/overridden from DB config through struct assignment elsewhere or future code changes.\n\n**The vulnerability:** While the current code doesn't explicitly merge `API.Auth`, the struct can still receive values from DB config parsing. The YAML unmarshaling at line 37 populates `dbCfg` with ALL values from DB-stored YAML, including `api.auth.api_key`. Since there's no explicit preservation of `API.Auth.APIKey` like there is for `Storage` (lines 33, 45), this sensitive credential could be overridden.\n\n**Security impact:**\n- `API.Auth.APIKey` controls access to the entire AgentField API\n- If an attacker can set this via DB config, they can authenticate to the API with their own key\n- This bypasses any file/env-based API key configuration\n\n**The comment at line 94** says \"API settings (but never override API key from DB for security)\" but this protection is NOT actually implemented in the code.\n\n---\n\n> Step 1: config_db.go:94-97 shows only CORS is merged, comment says API key should not be overridden but no code enforces this. Step 2: config.go:207-212 shows AuthConfig contains APIKey (line 209). Step 3: yaml.Unmarshal at config_db.go:37 parses ALL fields from DB YAML including api.auth.api_key. Step 4: Since mergeDBConfig doesn't explicitly handle API.Auth fields, the dbCfg value could persist if the field exists in DB YAML.\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd explicit protection for `API.Auth.APIKey` similar to how `Storage` is protected. Before calling `mergeDBConfig`, save the API key and restore it after:\n\n```go\n// At line 32-33, add:\nsavedAPIKey := cfg.API.Auth.APIKey\n\n// At line 44-45, add:\ncfg.API.Auth.APIKey = savedAPIKey\n```\n\nAlternatively, explicitly set it in mergeDBConfig if it was preserved elsewhere.\n\n---\n*`Security-Sensitive Field Protection in DB Config Merge` \u00b7 confidence 85%*", - "line": 94, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] SetConfig handler stores invalid YAML without validation**\n\nThe SetConfig handler at lines 67-101 accepts raw YAML/text body and stores it directly in the database without any validation that it parses as valid YAML or conforms to the expected config schema.\n\n**Why this is a problem:**\n1. Invalid YAML can be stored via `PUT /api/v1/configs/agentfield.yaml`\n2. On next server startup with `AGENTFIELD_CONFIG_SOURCE=db`, `overlayDBConfig` calls `yaml.Unmarshal` which fails\n3. The error is only logged as a warning (server.go:110), so startup continues with potentially partial/inconsistent config\n4. This creates a broken state that's hard to recover from - operators must manually delete the invalid config via API or DB edit\n\n**Attack scenario:** A malicious actor or buggy client could store malformed YAML, breaking config reloads until manual intervention.\n\n---\n\n> Step 1: HTTP PUT /api/v1/configs/agentfield.yaml -> SetConfig handler (config_storage.go:67)\n> Step 2: Handler reads body with io.ReadAll (line 70), stores directly via storage.SetConfig (line 85)\n> Step 3: No validation performed - body stored as raw string\n> Step 4: On server restart with AGENTFIELD_CONFIG_SOURCE=db, overlayDBConfig (config_db.go:19) reads entry\n> Step 5: yaml.Unmarshal (config_db.go:37) attempts to parse stored value\n> Step 6: If stored value is invalid YAML (e.g., 'invalid: [unclosed'), unmarshal fails\n> Step 7: Error returned at config_db.go:38, logged as warning at server.go:110\n> Step 8: Server continues startup with partial/inconsistent configuration\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd YAML validation before storing in SetConfig. Parse the body with `yaml.Unmarshal` into a temporary config struct to verify it's valid YAML and conforms to the schema. Return 400 Bad Request with details if validation fails. Additionally, consider adding a dedicated `/configs/validate` endpoint for dry-run validation before apply.\n\n---\n*`YAML Validation Gap in SetConfig Handler` \u00b7 confidence 95%*", - "line": 67, - "path": "control-plane/internal/handlers/config_storage.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Approval.WebhookSecret can be overridden from DB config**\n\nThe `AgentField.Approval` struct is merged entirely from DB config when `WebhookSecret` or `DefaultExpiryHours` is non-zero (lines 82-84). This includes `WebhookSecret`, which is a security-sensitive HMAC-SHA256 secret used for verifying webhook callbacks.\n\n**The vulnerability:**\n- `WebhookSecret` is used to authenticate incoming webhooks (config.go:47)\n- If an attacker can set this via DB config, they can forge webhook callbacks\n- This could allow unauthorized approval actions or other webhook-triggered operations\n\n**Current behavior:**\n- Lines 82-84 merge the entire `Approval` struct if either field is set in DB\n- This overwrites the file/env `WebhookSecret` with DB value\n- No preservation of the original secret like `Storage` has\n\n---\n\n> Step 1: config_db.go:82-84 merges entire Approval struct if WebhookSecret or DefaultExpiryHours is non-empty. Step 2: config.go:46-49 shows ApprovalConfig contains WebhookSecret (line 47) described as 'HMAC-SHA256 secret for verifying webhook callbacks'. Step 3: Entire struct assignment overwrites all fields including the secret.\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd explicit protection for `AgentField.Approval.WebhookSecret` by saving it before merge and restoring after, similar to Storage protection. Or merge only non-sensitive fields individually instead of assigning the entire struct.\n\n---\n*`Security-Sensitive Field Protection in DB Config Merge` \u00b7 confidence 85%*", - "line": 82, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Comment claims connector token/capabilities are excluded but no enforcement in code**\n\nLines 90-92 contain a comment stating \"Connector config (token, capabilities) is intentionally NOT merged from DB. These are security-sensitive and must come from file/env config\". However, this is only a comment - there is NO actual code enforcement of this protection.\n\n**The issue:**\n1. The comment suggests connector token and capabilities are protected like storage config\n2. However, unlike lines 33 and 45 which explicitly save/restore `cfg.Storage`, there is NO corresponding save/restore for `cfg.Features.Connector`\n3. If DB config contains `features.connector.token` or `features.connector.capabilities`, these values WILL be parsed into `dbCfg` at line 37\n4. While the current `mergeDBConfig` doesn't explicitly merge Connector fields, future modifications could inadvertently enable this\n\n**Recommendation:** Either implement the protection (like Storage) or remove the misleading comment.\n\n---\n\n> Step 1: config_db.go:90-92 comment claims connector config is NOT merged for security. Step 2: config_db.go:33,45 shows Storage is saved before merge and restored after - the pattern for security-sensitive fields. Step 3: No corresponding save/restore exists for cfg.Features.Connector. Step 4: config.go:87-91 shows ConnectorConfig contains Token (line 89) - a security-sensitive field.\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd explicit protection for Connector config similar to Storage:\n\n```go\n// At line 32-33, add:\nsavedConnector := cfg.Features.Connector\n\n// At line 44-45, add:\ncfg.Features.Connector = savedConnector\n```\n\nOr if the comment is incorrect, update it to reflect actual behavior.\n\n---\n*`Security-Sensitive Field Protection in DB Config Merge` \u00b7 confidence 80%*", - "line": 90, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - } - ], - "event": "REQUEST_CHANGES" - }, - "review_id": "rev_2947062915e9", - "summary": { - "adversary_challenged": 0, - "adversary_confirmed": 0, - "ai_generated_confidence": 0.6666666666666666, - "budget_exhausted": true, - "by_severity": { - "critical": 13, - "important": 3, - "info": 1 - }, - "cost_usd": 0, - "coverage_iterations": 0, - "cross_ref_interactions": 7, - "dimensions_run": 6, - "duration_seconds": 2867.247, - "total_findings": 17 - } -} \ No newline at end of file diff --git a/benchmark/agentfield-254/pr-af-result-kimi-enriched.json b/benchmark/agentfield-254/pr-af-result-kimi-enriched.json deleted file mode 100644 index 1456409..0000000 --- a/benchmark/agentfield-254/pr-af-result-kimi-enriched.json +++ /dev/null @@ -1,1267 +0,0 @@ -{ - "findings": [ - { - "active_multipliers": [ - "adversary_confirmed", - "ai_generated_pr" - ], - "body": "The `mergeDBConfig` function claims to merge DB config values but **entire sections of the Config struct are not merged at all**, effectively ignoring user settings stored in the database.\n\n**Missing sections:**\n1. **`AgentField.ExecutionQueue`** (lines 72-78 in config.go): All webhook timeout, retry, and backoff settings are ignored from DB config\n2. **`API.Auth`** (lines 207-212 in config.go): SkipPaths configuration cannot be set from DB\n3. **Most `Features.DID` fields**: Only `Method` is merged; `Enabled`, `KeyAlgorithm`, `DerivationMethod`, `KeyRotationDays`, `VCRequirements`, `Keystore`, and `Authorization` are all ignored\n4. **Most `API.CORS` fields**: Only `AllowedOrigins` is merged; `AllowedMethods`, `AllowedHeaders`, `ExposedHeaders`, `AllowCredentials` are ignored\n5. **Most `NodeHealth` fields**: Only `CheckInterval` is merged; `CheckTimeout`, `ConsecutiveFailures`, `RecoveryDebounce`, `HeartbeatStaleThreshold` are ignored\n\nThis means users who store config in the database expecting to control webhook timeouts, DID authorization policies, CORS settings, or health check parameters will have their settings silently ignored, leading to **configuration drift** between what's stored in DB and what's actually applied.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "merge-logic-completeness", - "dimension_name": "Merge Logic Completeness and Correctness", - "evidence": "Step 1: Config struct at config.go:17-23 shows 5 top-level sections\nStep 2: mergeDBConfig only handles partial subsets:\n - AgentField: Port, partial NodeHealth (only CheckInterval), ExecutionCleanup, Approval, MISSING ExecutionQueue\n - Features: Only DID.Method, intentionally skips Connector\n - API: Only CORS.AllowedOrigins, MISSING Auth entirely\n - UI: Fully merged\n - Storage: Explicitly preserved (correct)\nStep 3: User stores config with ExecutionQueue.WebhookTimeout=30s in DB\nStep 4: mergeDBConfig has no logic for ExecutionQueue - value is silently ignored\nStep 5: Server uses default timeout, user configuration is discarded", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_008", - "line_end": 103, - "line_start": 54, - "score": 1.482, - "severity": "critical", - "suggestion": "Add explicit merge logic for all config fields. For struct fields, either:\n1. Merge field-by-field like ExecutionCleanup, or\n2. Check a sentinel field to determine if the struct was intentionally set\n\nAt minimum, add merge logic for:\n- `AgentField.ExecutionQueue` (all fields)\n- `API.Auth.SkipPaths` (check slice length)\n- All `Features.DID` sub-fields\n- All `API.CORS` fields\n- All `NodeHealth` fields", - "tags": [ - "config", - "merge", - "missing-fields", - "data-loss" - ], - "title": "Multiple Config Sections Completely Missing from Merge Logic" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The `MockStorageProvider` in `config_test.go` implements the old `SetConfig` and `GetConfig` method signatures that were changed in this PR. The interface was updated from:\n\n**Old signatures:**\n- `SetConfig(ctx context.Context, key string, value interface{}) error`\n- `GetConfig(ctx context.Context, key string) (interface{}, error)`\n\n**New signatures:**\n- `SetConfig(ctx context.Context, key string, value string, updatedBy string) error`\n- `GetConfig(ctx context.Context, key string) (*storage.ConfigEntry, error)`\n- `ListConfigs(ctx context.Context) ([]*storage.ConfigEntry, error)`\n- `DeleteConfig(ctx context.Context, key string) error`\n\nThe mock implementation on lines 289-297 still uses the old signatures, meaning this struct no longer satisfies the `StorageProvider` interface. This will cause a **compilation error** when running tests.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "storage-interface-verify", - "dimension_name": "StorageProvider Interface Verification", - "evidence": "Step 1: Interface definition at storage/storage.go:132-136 defines:\n- `SetConfig(ctx context.Context, key string, value string, updatedBy string) error`\n- `GetConfig(ctx context.Context, key string) (*ConfigEntry, error)`\n- `ListConfigs(ctx context.Context) ([]*ConfigEntry, error)`\n- `DeleteConfig(ctx context.Context, key string) error`\n\nStep 2: MockStorageProvider at handlers/ui/config_test.go:289-297 implements:\n- `SetConfig(ctx context.Context, key string, value interface{}) error` (missing updatedBy param, wrong value type)\n- `GetConfig(ctx context.Context, key string) (interface{}, error)` (wrong return type)\n- Missing: `ListConfigs` and `DeleteConfig` methods entirely\n\nStep 3: Go's type system requires interface satisfaction - any code using MockStorageProvider as StorageProvider will fail to compile with 'does not implement' errors.", - "file_path": "control-plane/internal/handlers/ui/config_test.go", - "id": "f_000", - "line_end": 297, - "line_start": 289, - "score": 1.14, - "severity": "critical", - "suggestion": "Update the MockStorageProvider to implement the new interface signatures:\n\n```go\nfunc (m *MockStorageProvider) SetConfig(ctx context.Context, key string, value string, updatedBy string) error {\n args := m.Called(ctx, key, value, updatedBy)\n return args.Error(0)\n}\n\nfunc (m *MockStorageProvider) GetConfig(ctx context.Context, key string) (*storage.ConfigEntry, error) {\n args := m.Called(ctx, key)\n if args.Get(0) == nil {\n return nil, args.Error(1)\n }\n return args.Get(0).(*storage.ConfigEntry), args.Error(1)\n}\n\nfunc (m *MockStorageProvider) ListConfigs(ctx context.Context) ([]*storage.ConfigEntry, error) {\n args := m.Called(ctx)\n if args.Get(0) == nil {\n return nil, args.Error(1)\n }\n return args.Get(0).([]*storage.ConfigEntry), args.Error(1)\n}\n\nfunc (m *MockStorageProvider) DeleteConfig(ctx context.Context, key string) error {\n args := m.Called(ctx, key)\n return args.Error(0)\n}\n```", - "tags": [ - "compilation-error", - "interface-mismatch", - "tests", - "mock" - ], - "title": "MockStorageProvider has outdated SetConfig/GetConfig signatures - will cause compilation failure" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The `MockStorageProvider` in `execute_test.go` implements the old `SetConfig` and `GetConfig` method signatures that were changed in this PR. The interface was updated from:\n\n**Old signatures:**\n- `SetConfig(ctx context.Context, key string, value interface{}) error`\n- `GetConfig(ctx context.Context, key string) (interface{}, error)`\n\n**New signatures:**\n- `SetConfig(ctx context.Context, key string, value string, updatedBy string) error`\n- `GetConfig(ctx context.Context, key string) (*storage.ConfigEntry, error)`\n- `ListConfigs(ctx context.Context) ([]*storage.ConfigEntry, error)`\n- `DeleteConfig(ctx context.Context, key string) error`\n\nThe mock implementation on lines 173-178 still uses the old signatures, meaning this struct no longer satisfies the `StorageProvider` interface. This will cause a **compilation error** when running tests.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "storage-interface-verify", - "dimension_name": "StorageProvider Interface Verification", - "evidence": "Step 1: Interface definition at storage/storage.go:132-136 defines:\n- `SetConfig(ctx context.Context, key string, value string, updatedBy string) error`\n- `GetConfig(ctx context.Context, key string) (*ConfigEntry, error)`\n- `ListConfigs(ctx context.Context) ([]*ConfigEntry, error)`\n- `DeleteConfig(ctx context.Context, key string) error`\n\nStep 2: MockStorageProvider at handlers/execute_test.go:173-178 implements:\n- `SetConfig(ctx context.Context, key string, value interface{}) error` (missing updatedBy param, wrong value type)\n- `GetConfig(ctx context.Context, key string) (interface{}, error)` (wrong return type)\n- Missing: `ListConfigs` and `DeleteConfig` methods entirely\n\nStep 3: Go's type system requires interface satisfaction - any code using MockStorageProvider as StorageProvider will fail to compile with 'does not implement' errors.", - "file_path": "control-plane/internal/handlers/execute_test.go", - "id": "f_001", - "line_end": 178, - "line_start": 173, - "score": 1.14, - "severity": "critical", - "suggestion": "Update the MockStorageProvider to implement the new interface signatures:\n\n```go\nfunc (m *MockStorageProvider) SetConfig(ctx context.Context, key string, value string, updatedBy string) error {\n return nil\n}\n\nfunc (m *MockStorageProvider) GetConfig(ctx context.Context, key string) (*storage.ConfigEntry, error) {\n return nil, nil\n}\n\nfunc (m *MockStorageProvider) ListConfigs(ctx context.Context) ([]*storage.ConfigEntry, error) {\n return nil, nil\n}\n\nfunc (m *MockStorageProvider) DeleteConfig(ctx context.Context, key string) error {\n return nil\n}\n```", - "tags": [ - "compilation-error", - "interface-mismatch", - "tests", - "mock" - ], - "title": "MockStorageProvider has outdated SetConfig/GetConfig signatures - will cause compilation failure" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The configReloadFn() function accesses and modifies s.config without any mutex protection, yet multiple goroutines throughout server.go read from s.config concurrently.\n\nThe PR description claims configMu.Lock() is acquired during reload (lines 435-442), but NO SUCH MUTEX EXISTS in the codebase. The function directly calls overlayDBConfig(s.config, s.storage) which mutates the config struct in-place via mergeDBConfig().\n\nThis creates a data race:\n- HTTP request handlers read s.config.AgentField.Port, s.config.API.CORS, s.config.Features.DID.Enabled, etc.\n- The reload goroutine (triggered by API call) writes to these same fields\n- No synchronization primitive protects these concurrent accesses\n\nAffected readers include:\n- Route setup code (lines 834-838, 882-893, 913, 919-927, 971)\n- Execute handlers (lines 1246-1247, 1251)\n- Admin routes (lines 1531-1533)\n- DID middleware (lines 890, 1204, 1232)\n- UI routes (lines 1586, 1619)\n\nThis is a critical data race that can cause crashes, memory corruption, or inconsistent config state.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "concurrency-safety-config-reload", - "dimension_name": "Concurrency Safety of Dynamic Config Reload", - "evidence": "Step 1: configReloadFn() at server.go:435-442 returns a closure that calls overlayDBConfig(s.config, s.storage)\nStep 2: overlayDBConfig at config_db.go:19-50 calls mergeDBConfig(cfg, andbCfg) at line 42\nStep 3: mergeDBConfig at config_db.go:54-103 writes directly to target fields like target.AgentField.Port = dbCfg.AgentField.Port (line 57), target.AgentField.NodeHealth = dbCfg.AgentField.NodeHealth (line 60), etc.\nStep 4: Concurrent goroutines in server.go read s.config fields without any mutex (e.g., line 502: s.config.AgentField.Port, line 834: s.config.API.CORS.AllowedOrigins)\nStep 5: No configMu or similar mutex exists in the codebase - verified by grep search\nResult: Unsynchronized concurrent read/write on shared config struct = data race", - "file_path": "control-plane/internal/server/server.go", - "id": "f_021", - "line_end": 442, - "line_start": 435, - "score": 1.14, - "severity": "critical", - "suggestion": "Add a sync.RWMutex field (configMu) to AgentFieldServer struct. Acquire Lock() in configReloadFn() before calling overlayDBConfig, and acquire RLock() in all HTTP handlers that read config. Alternatively, use atomic pointer swap: store config as atomic.Pointer[Config] and swap the entire struct atomically on reload, eliminating need for RLock in readers.", - "tags": [ - "data-race", - "concurrency", - "config", - "mutex-missing" - ], - "title": "Missing Mutex Protection for Config Reload - Data Race on s.config" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The overlayDBConfig function modifies the shared cfg struct in-place through mergeDBConfig, creating race conditions with any concurrent readers.\n\nCritical issue: The function receives a pointer to the server's config struct and directly mutates its fields:\n- Line 42: mergeDBConfig(cfg, andbCfg) - calls merge function\n- Lines 56-102 in mergeDBConfig: Direct field assignments like target.AgentField.Port = dbCfg.AgentField.Port\n\nThe storage section is protected (saved at line 33, restored at line 45), but all other config sections are unprotected during the merge operation.\n\nThis means concurrent readers can observe:\n1. Partially updated config (e.g., Port updated but NodeHealth not yet updated)\n2. Corrupted memory if writes overlap with reads\n3. Inconsistent state between related fields (e.g., DID.Enabled=true but DID.Authorization config not yet applied)", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "concurrency-safety-config-reload", - "dimension_name": "Concurrency Safety of Dynamic Config Reload", - "evidence": "Step 1: overlayDBConfig receives cfg *config.Config parameter at line 19\nStep 2: Only storage config is saved: savedStorage := cfg.Storage at line 33\nStep 3: mergeDBConfig(cfg, andbCfg) at line 42 writes directly to cfg fields\nStep 4: mergeDBConfig lines 56-102 perform direct assignments: target.AgentField.Port = dbCfg.AgentField.Port, target.AgentField.NodeHealth = dbCfg.AgentField.NodeHealth, etc.\nStep 5: Storage is restored at line 45: cfg.Storage = savedStorage\nResult: All non-storage config fields are mutated in-place without atomicity or synchronization", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_022", - "line_end": 50, - "line_start": 19, - "score": 1.14, - "severity": "critical", - "suggestion": "Option 1: Require caller to hold mutex before calling overlayDBConfig (document in function comments). Option 2: Have overlayDBConfig create a deep copy of the config, modify the copy, then atomically swap the pointer (requires config to be stored as atomic.Pointer). Option 3: Protect each config section with its own mutex (more granular but complex).", - "tags": [ - "data-race", - "in-place-mutation", - "config", - "synchronization" - ], - "title": "overlayDBConfig Modifies Config Struct In-Place Without Synchronization" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The SetConfig handler uses io.ReadAll(c.Request.Body) without any size limitation. This allows attackers to send arbitrarily large request bodies, causing memory exhaustion and potential denial of service. The PR diff indicated a maxConfigBodySize constant (1 MB) and io.LimitReader should be used, but the actual implementation is missing this protection. Impact: An attacker with a valid API key can crash the server by uploading multi-gigabyte config files.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "config-storage-handler-review", - "dimension_name": "Config Storage Handler Implementation Review", - "evidence": "Step 1: Attacker sends PUT /api/v1/configs/agentfield.yaml with a 10GB request body. Step 2: Handler calls io.ReadAll(c.Request.Body). Step 3: io.ReadAll allocates memory proportional to request body size. Step 4: Server runs out of memory and crashes (OOM).", - "file_path": "control-plane/internal/handlers/config_storage.go", - "id": "f_027", - "line_end": 78, - "line_start": 70, - "score": 1.14, - "severity": "critical", - "suggestion": "Add a body size limit using io.LimitReader. Define const maxConfigBodySize = 1 << 20 // 1 MB. Then use body, err := io.ReadAll(io.LimitReader(c.Request.Body, maxConfigBodySize+1)) and check if len(body) > maxConfigBodySize then return http.StatusRequestEntityTooLarge with appropriate error message.", - "tags": [ - "security", - "dos", - "memory-exhaustion", - "missing-validation" - ], - "title": "No request body size limit - potential DoS vulnerability" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The `mergeDBConfig()` function at lines 54-103 performs field-by-field merging of DB config into the target config struct. This happens in-place on the shared `s.config` object.\n\n**The Problem:**\n1. If a reader accesses `s.config` during `mergeDBConfig()`, they may see a partially updated config.\n2. For example, if the merge updates `AgentField.Port` first, then gets preempted, a reader might see the new Port but old NodeHealth settings.\n3. This can lead to inconsistent state where different config fields are from different config versions.\n\n**Even worse**, since `configMu` doesn't exist, there's no mutex protection at all. Multiple goroutines can read `s.config` while it's being modified.", - "confidence": 0.9, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "concurrency-safety-config-reload", - "dimension_name": "Concurrency Safety of Dynamic Config Reload", - "evidence": "Step 1: `overlayDBConfig()` at line 42 calls `mergeDBConfig(cfg, &dbCfg)` where `cfg` is `s.config`.\nStep 2: `mergeDBConfig()` modifies fields one-by-one (lines 56-103) without atomicity.\nStep 3: Example: Line 56-58 updates `AgentField.Port`, lines 59-61 update `NodeHealth` - a reader could see new Port but old NodeHealth.\nStep 4: No atomic snapshot or deep copy is performed.\nStep 5: The config struct is modified in-place while other goroutines may be reading it.", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_035", - "line_end": 103, - "line_start": 42, - "score": 1.08, - "severity": "critical", - "suggestion": "Use atomic config replacement instead of in-place modification:\n\n```go\nfunc (s *AgentFieldServer) configReloadFn() handlers.ConfigReloadFunc {\n return func() error {\n // Load new config\n newCfg := *s.config // Copy current config\n if err := overlayDBConfig(&newCfg, s.storage); err != nil {\n return err\n }\n // Atomically swap\n s.configMu.Lock()\n s.config = &newCfg\n s.configMu.Unlock()\n return nil\n }\n}\n```\n\nThis ensures readers always see a consistent (if potentially stale) config, never a partially updated one.", - "tags": [ - "concurrency", - "data-race", - "partial-update", - "atomicity", - "critical" - ], - "title": "Partial config visibility during reload - readers can see half-updated config" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The default configuration enables `config_management` capability with `read_only: false`. This grants any connector with a valid token write access to server configuration via the database-backed config storage API. Connectors can modify security-critical settings (API keys, admin tokens, DID authorization settings) without admin privileges. This is inconsistent with other sensitive capabilities like `did_management` which defaults to `enabled: false`.", - "confidence": 0.9, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "cluster_0", - "dimension_name": "Config Merge Correctness", - "evidence": "Step 1: agentfield.yaml:149-151 sets `config_management: enabled: true, read_only: false`. Step 2: PR description states connector routes are gated by `config_management` capability check. Step 3: With these defaults, any deployment using the default config exposes write access to configuration. Step 4: Connectors can call PUT/DELETE /api/v1/connector/configs/* to modify server config including auth tokens (lines mentioned in PR context: server.go:1573-1578).", - "file_path": "control-plane/config/agentfield.yaml", - "id": "f_038", - "line_end": 151, - "line_start": 149, - "score": 1.08, - "severity": "critical", - "suggestion": "Change the default to `enabled: false` or at minimum `read_only: true`. This follows the principle of least privilege and prevents unauthorized configuration modifications. Operators who need connector config management can explicitly enable it after reviewing security implications.", - "tags": [ - "security", - "default-values", - "authorization", - "connector" - ], - "title": "Security risk: config_management enabled with write access by default" - }, - { - "active_multipliers": [ - "adversary_confirmed", - "ai_generated_pr" - ], - "body": "The `NodeHealth` merge logic at lines 59-61 uses blanket struct assignment when `CheckInterval != 0`:\n\n```go\nif dbCfg.AgentField.NodeHealth.CheckInterval != 0 {\n target.AgentField.NodeHealth = dbCfg.AgentField.NodeHealth\n}\n```\n\n**Problem**: If the DB config only specifies `CheckInterval` but not other fields like `CheckTimeout`, `ConsecutiveFailures`, `RecoveryDebounce`, or `HeartbeatStaleThreshold`, the entire struct is overwritten. This means:\n1. File/env settings for other NodeHealth fields are lost\n2. The zero values from the YAML unmarshal (for unspecified fields) overwrite valid existing values\n\nThis contradicts the function's stated purpose of \"only non-zero/non-empty values from the DB config are applied.\"", - "confidence": 0.9, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "merge-logic-completeness", - "dimension_name": "Merge Logic Completeness and Correctness", - "evidence": "Step 1: File config has NodeHealth.CheckTimeout=10s, NodeHealth.CheckInterval=5s\nStep 2: DB config only sets CheckInterval=15s (leaving others at Go zero values)\nStep 3: mergeDBConfig checks CheckInterval != 0 (true)\nStep 4: target.AgentField.NodeHealth = dbCfg.AgentField.NodeHealth assigns entire struct\nStep 5: target.AgentField.NodeHealth.CheckTimeout becomes 0 (was 10s), data is lost", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_009", - "line_end": 61, - "line_start": 59, - "score": 0.983, - "severity": "important", - "suggestion": "Change NodeHealth merge to field-by-field approach like ExecutionCleanup:\n```go\nif dbCfg.AgentField.NodeHealth.CheckInterval != 0 {\n target.AgentField.NodeHealth.CheckInterval = dbCfg.AgentField.NodeHealth.CheckInterval\n}\nif dbCfg.AgentField.NodeHealth.CheckTimeout != 0 {\n target.AgentField.NodeHealth.CheckTimeout = dbCfg.AgentField.NodeHealth.CheckTimeout\n}\n// etc for all fields\n```", - "tags": [ - "config", - "merge", - "struct-assignment", - "data-loss" - ], - "title": "NodeHealth Struct Merge Uses Blanket Assignment, Risking Data Loss" - }, - { - "active_multipliers": [ - "adversary_confirmed", - "ai_generated_pr" - ], - "body": "The `Features.DID` merge at lines 87-89 only checks if `Method != \"\"` and then does blanket struct assignment:\n\n```go\nif dbCfg.Features.DID.Method != \"\" {\n target.Features.DID = dbCfg.Features.DID\n}\n```\n\n**Problems**:\n1. **Data loss**: Like NodeHealth, this uses blanket assignment, so unspecified fields in DB config overwrite valid file/env settings with zero values\n2. **Cannot set non-Method fields alone**: If a user wants to only change `KeyRotationDays` or `VCRequirements` in DB config without changing `Method`, they cannot - the condition requires Method to be non-empty\n\nThe `DIDConfig` struct (config.go:100-109) has 9 fields, but only `Method` can trigger a merge, and when triggered, all other fields are subject to zero-value overwrite.", - "confidence": 0.9, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "merge-logic-completeness", - "dimension_name": "Merge Logic Completeness and Correctness", - "evidence": "Step 1: File config sets DID.Enabled=true, Method=\"did:key\", KeyRotationDays=90\nStep 2: DB config only sets KeyRotationDays=30 (leaving Method empty)\nStep 3: Condition Method != \"\" evaluates to false\nStep 4: No merge happens, KeyRotationDays remains 90 despite DB having 30\nOR if Method WAS set in DB, entire struct is overwritten, losing file/env settings for unspecified fields", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_012", - "line_end": 89, - "line_start": 87, - "score": 0.983, - "severity": "important", - "suggestion": "Implement field-by-field merge for DIDConfig similar to ExecutionCleanup:\n```go\nif dbCfg.Features.DID.Method != \"\" {\n target.Features.DID.Method = dbCfg.Features.DID.Method\n}\nif dbCfg.Features.DID.KeyAlgorithm != \"\" {\n target.Features.DID.KeyAlgorithm = dbCfg.Features.DID.KeyAlgorithm\n}\n// Handle nested structs like VCRequirements, Keystore, Authorization recursively\n```", - "tags": [ - "config", - "merge", - "struct-assignment", - "missing-fields" - ], - "title": "DIDConfig Merge Only Checks Method Field, Missing All Other DID Settings" - }, - { - "active_multipliers": [ - "adversary_confirmed", - "ai_generated_pr" - ], - "body": "The logic for merging `ExecutionCleanup.Enabled` (lines 79-81) requires at least one other cleanup field to be non-zero:\n\n```go\nif dbCfg.AgentField.ExecutionCleanup.RetentionPeriod != 0 || dbCfg.AgentField.ExecutionCleanup.CleanupInterval != 0 {\n target.AgentField.ExecutionCleanup.Enabled = dbCfg.AgentField.ExecutionCleanup.Enabled\n}\n```\n\n**Problem**: A user who wants to explicitly **disable** cleanup by setting `enabled: false` in the DB config cannot do so unless they also set `retention_period` or `cleanup_interval` to non-zero values. If they only set `enabled: false` (with other fields at 0), the condition fails and `Enabled` is not updated.\n\nThis violates the principle that users should be able to explicitly set boolean flags to their zero value (false) independently of other fields.", - "confidence": 0.85, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "merge-logic-completeness", - "dimension_name": "Merge Logic Completeness and Correctness", - "evidence": "Step 1: File config has ExecutionCleanup.Enabled=true, RetentionPeriod=24h\nStep 2: User wants to disable cleanup, stores DB config with only 'enabled: false'\nStep 3: All duration fields in dbCfg are 0 (not specified)\nStep 4: Condition at line 79 evaluates to false (0 != 0 || 0 != 0)\nStep 5: target.AgentField.ExecutionCleanup.Enabled remains true, user's explicit false is ignored", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_010", - "line_end": 81, - "line_start": 79, - "score": 0.928, - "severity": "important", - "suggestion": "Use a sentinel/presence check pattern for booleans. Options:\n1. Use a `*bool` pointer type to distinguish between 'not set' and 'explicitly false'\n2. Add a comment explaining that to disable cleanup, users must also set a non-zero retention_period\n3. Always merge Enabled if any ExecutionCleanup field is non-zero (broader check)\n\nRecommended fix:\n```go\n// Check if any cleanup field is configured in DB\ncleanupConfigured := dbCfg.AgentField.ExecutionCleanup.RetentionPeriod != 0 ||\n dbCfg.AgentField.ExecutionCleanup.CleanupInterval != 0 ||\n dbCfg.AgentField.ExecutionCleanup.BatchSize != 0 ||\n dbCfg.AgentField.ExecutionCleanup.PreserveRecentDuration != 0 ||\n dbCfg.AgentField.ExecutionCleanup.StaleExecutionTimeout != 0\nif cleanupConfigured {\n target.AgentField.ExecutionCleanup.Enabled = dbCfg.AgentField.ExecutionCleanup.Enabled\n}\n```", - "tags": [ - "config", - "merge", - "boolean-handling", - "zero-value-ambiguity" - ], - "title": "ExecutionCleanup.Enabled Bool Cannot Be Explicitly Set to false Without Changing Other Fields" - }, - { - "active_multipliers": [ - "adversary_confirmed", - "ai_generated_pr" - ], - "body": "The `configReloadFn()` method returns a closure that calls `overlayDBConfig(s.config, s.storage)` without any mutex protection. This creates a data race when the reload endpoint is invoked while background services are reading config values.\n\n**Background services that read config concurrently:**\n- `healthMonitor` - uses `cfg.AgentField.NodeHealth.*` settings (line 160-166)\n- `cleanupService` - uses `cfg.AgentField.ExecutionCleanup.*` settings (line 392)\n- `webhookDispatcher` - uses execution queue settings (line 366-371)\n- `statusManager` - uses heartbeat thresholds (line 133-148)\n\n**The race condition:**\n1. Background goroutines read nested config fields (e.g., `s.config.AgentField.NodeHealth.CheckInterval`)\n2. Hot reload via `POST /api/v1/configs/reload` calls `overlayDBConfig()` which mutates the shared config struct\n3. Go's memory model doesn't guarantee atomicity of struct field writes - readers may see partially updated values\n4. This can cause services to operate with inconsistent configuration\n\n**Note:** While the PR narrative mentions 'Concurrent Config Access' as a known risk, the actual code doesn't implement the necessary synchronization to mitigate it.", - "confidence": 0.75, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "config-reload-func-verification", - "dimension_name": "ConfigReloadFunc Type and Usage Verification", - "evidence": "Step 1: `configReloadFn()` is defined at server.go:435-442, returns closure calling `overlayDBConfig(s.config, s.storage)`\nStep 2: `overlayDBConfig()` at config_db.go:19-50 directly mutates `cfg` fields via `mergeDBConfig()`\nStep 3: Background services initialized in NewAgentFieldServer (lines 133-392) store config references and access them concurrently\nStep 4: HTTP handlers invoke the reload function without any synchronization barrier\nStep 5: No mutex is defined in AgentFieldServer struct (lines 48-82)", - "file_path": "control-plane/internal/server/server.go", - "id": "f_016", - "line_end": 442, - "line_start": 435, - "score": 0.819, - "severity": "important", - "suggestion": "Add a `sync.RWMutex` field to `AgentFieldServer` struct to protect config access:\n\n1. Add `configMu sync.RWMutex` to the struct (line 48-82)\n2. In `configReloadFn()`, acquire write lock before calling `overlayDBConfig`:\n ```go\n return func() error {\n s.configMu.Lock()\n defer s.configMu.Unlock()\n return overlayDBConfig(s.config, s.storage)\n }\n ```\n3. Background services should acquire read locks when accessing config, OR config should be accessed through getter methods that acquire read locks", - "tags": [ - "concurrency", - "data-race", - "config-reload", - "mutex" - ], - "title": "Unprotected concurrent config access during hot reload - potential data race" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The `HealthMonitor` service is initialized with a `HealthMonitorConfig` struct at lines 160-166. The config values (`CheckInterval`, `CheckTimeout`, `ConsecutiveFailures`, `RecoveryDebounce`) are copied into the service at startup and never updated.\n\nWhen `overlayDBConfig()` reloads config from the database (via the reload API), the health monitor will continue using the stale cached values. This means changes to `NodeHealth` configuration via the DB reload mechanism will NOT take effect until the server is restarted.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "concurrency-safety-config-reload", - "dimension_name": "Concurrency Safety of Dynamic Config Reload", - "evidence": "Step 1: `healthMonitorConfig` is created with values from `cfg.AgentField.NodeHealth` at lines 160-165.\nStep 2: `services.NewHealthMonitor()` receives the config by value (copied), not by reference.\nStep 3: The `HealthMonitor` struct stores `config HealthMonitorConfig` by value (see health_monitor.go:50).\nStep 4: `overlayDBConfig()` at server.go:109 and config_db.go:19-50 can update `AgentField.NodeHealth` values.\nStep 5: No mechanism exists to propagate reloaded config to the already-running HealthMonitor.", - "file_path": "control-plane/internal/server/server.go", - "id": "f_032", - "line_end": 166, - "line_start": 160, - "score": 0.798, - "severity": "important", - "suggestion": "Either document that NodeHealth config changes require a server restart, OR add a `ReloadConfig()` method to HealthMonitor that can be called after config reload, OR have HealthMonitor read from the shared config with proper mutex protection instead of caching values.", - "tags": [ - "concurrency", - "stale-config", - "health-monitor", - "config-reload" - ], - "title": "HealthMonitor caches config values at startup - won't see reloads" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The `WebhookDispatcher` is initialized with a `WebhookDispatcherConfig` struct at lines 366-371. The config values (`WebhookTimeout`, `WebhookMaxAttempts`, `WebhookRetryBackoff`, `WebhookMaxRetryBackoff`) are copied into the dispatcher at startup and never updated.\n\nWhen `overlayDBConfig()` reloads config from the database, the webhook dispatcher will continue using the stale cached values. Changes to webhook configuration via DB reload will NOT take effect until server restart.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "concurrency-safety-config-reload", - "dimension_name": "Concurrency Safety of Dynamic Config Reload", - "evidence": "Step 1: `WebhookDispatcherConfig` is created with values from `cfg.AgentField.ExecutionQueue` at lines 367-370.\nStep 2: `services.NewWebhookDispatcher()` receives config by value.\nStep 3: The `webhookDispatcher` struct stores `cfg WebhookDispatcherConfig` by value (see webhook_dispatcher.go:51).\nStep 4: `overlayDBConfig()` can update `AgentField.ExecutionQueue` values.\nStep 5: No mechanism exists to propagate reloaded config to the running WebhookDispatcher.", - "file_path": "control-plane/internal/server/server.go", - "id": "f_033", - "line_end": 375, - "line_start": 366, - "score": 0.798, - "severity": "important", - "suggestion": "Either document that ExecutionQueue config changes require restart, or add a ReloadConfig method to WebhookDispatcher.", - "tags": [ - "concurrency", - "stale-config", - "webhook-dispatcher", - "config-reload" - ], - "title": "WebhookDispatcher caches config values at startup - won't see reloads" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The `ExecutionCleanupService` is initialized with `cfg.AgentField.ExecutionCleanup` at line 392. The cleanup config (`RetentionPeriod`, `CleanupInterval`, `BatchSize`, etc.) is copied into the service at startup.\n\nWhen `overlayDBConfig()` reloads config, the cleanup service will continue using the stale cached values. Changes to `ExecutionCleanup` configuration via DB reload will NOT take effect until server restart. The cleanup service runs in a background goroutine (line 476) and uses its cached config for all operations.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "concurrency-safety-config-reload", - "dimension_name": "Concurrency Safety of Dynamic Config Reload", - "evidence": "Step 1: `NewExecutionCleanupService()` receives `cfg.AgentField.ExecutionCleanup` at line 392.\nStep 2: The `ExecutionCleanupService` struct stores `config config.ExecutionCleanupConfig` by value (see execution_cleanup.go:16).\nStep 3: `overlayDBConfig()` at config_db.go:63-81 can update `AgentField.ExecutionCleanup` values.\nStep 4: The cleanup service starts at line 476 and runs independently with cached config.", - "file_path": "control-plane/internal/server/server.go", - "id": "f_034", - "line_end": 392, - "line_start": 392, - "score": 0.798, - "severity": "important", - "suggestion": "Either document that ExecutionCleanup config changes require restart, or add a ReloadConfig method to ExecutionCleanupService.", - "tags": [ - "concurrency", - "stale-config", - "cleanup-service", - "config-reload" - ], - "title": "ExecutionCleanupService caches config values at startup - won't see reloads" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The `connectorCapEnvMap` that maps environment variables to connector capabilities does not include the new `config_management` capability. This means the capability can be configured via YAML file but cannot be overridden via environment variables like other capabilities (e.g., `AGENTFIELD_CONNECTOR_CAP_POLICY_MANAGEMENT`). This breaks configuration parity and prevents operators from disabling or restricting config_management via environment variables in containerized deployments.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "cluster_0", - "dimension_name": "Config Merge Correctness", - "evidence": "Step 1: Line 333-340 defines `connectorCapEnvMap` with 6 capability mappings. Step 2: Lines 341-355 iterate this map to apply environment overrides. Step 3: The new `config_management` capability added to agentfield.yaml:149-151 is NOT present in this map. Step 4: Setting `AGENTFIELD_CONNECTOR_CAP_CONFIG_MANAGEMENT=readonly` in environment will have no effect, unlike other capabilities.", - "file_path": "control-plane/internal/config/config.go", - "id": "f_037", - "line_end": 340, - "line_start": 333, - "score": 0.798, - "severity": "important", - "suggestion": "Add the `config_management` capability to the `connectorCapEnvMap` with a corresponding environment variable name: `AGENTFIELD_CONNECTOR_CAP_CONFIG_MANAGEMENT`. The entry should map to the capability name `config_management` following the same pattern as other capabilities.", - "tags": [ - "config", - "environment-variables", - "inconsistency", - "connector" - ], - "title": "Missing environment variable override for config_management capability" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "Four background services are initialized with config values at server startup and cache them internally. When config is reloaded via the API, these services continue using the old cached values.\n\nServices affected:\n\n1. webhookDispatcher (lines 366-374):\n - Caches WebhookTimeout, WebhookMaxAttempts, WebhookRetryBackoff, WebhookMaxRetryBackoff\n - Values stored in WebhookDispatcherConfig struct at creation time\n - Reload does NOT update these values\n\n2. observabilityForwarder (lines 377-389):\n - Caches BatchSize, BatchTimeout, HTTPTimeout, MaxAttempts, etc.\n - Values stored in ObservabilityForwarderConfig struct at creation time\n - Has ReloadConfig() method but it only reloads webhook URL from storage, not the forwarder config\n\n3. cleanupService (line 392):\n - Caches ExecutionCleanupConfig (RetentionPeriod, CleanupInterval, BatchSize, etc.)\n - Used in cleanupLoop() which runs indefinitely\n - Reload does NOT update these values\n\n4. healthMonitor (line 166):\n - Caches HealthMonitorConfig (CheckInterval, CheckTimeout, ConsecutiveFailures, etc.)\n - Used in Start() method which runs indefinitely\n - Reload does NOT update these values\n\nImpact: After config reload, the server appears to use new config (API returns success), but background services silently continue with old values. This creates confusion and unexpected behavior.", - "confidence": 0.9, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "concurrency-safety-config-reload", - "dimension_name": "Concurrency Safety of Dynamic Config Reload", - "evidence": "Step 1: webhookDispatcher created at server.go:366-371 with services.WebhookDispatcherConfig{Timeout: cfg.AgentField.ExecutionQueue.WebhookTimeout, ...}\nStep 2: observabilityForwarder created at server.go:377-389 with services.ObservabilityForwarderConfig{BatchSize: 10, ...} (hardcoded defaults, not from config at all!)\nStep 3: cleanupService created at server.go:392 with cfg.AgentField.ExecutionCleanup\nStep 4: healthMonitor created at server.go:166 with services.HealthMonitorConfig{CheckInterval: cfg.AgentField.NodeHealth.CheckInterval, ...}\nStep 5: All services started before config reload can be triggered\nStep 6: None of these services have mechanisms to receive updated config values\nResult: Config reload only affects the main server's config struct, not the cached values in background services", - "file_path": "control-plane/internal/server/server.go", - "id": "f_023", - "line_end": 428, - "line_start": 366, - "score": 0.756, - "severity": "important", - "suggestion": "For each background service, either:\n1. Add a ReloadConfig(newCfg ConfigType) method that updates internal config (requires careful synchronization within the service)\n2. Document that certain config changes require server restart to take effect\n3. Pass config via callback function instead of static values, so services read latest config each time\n4. For observabilityForwarder, the config values are currently hardcoded - they should at least be read from config at startup", - "tags": [ - "stale-config", - "background-services", - "caching", - "config-reload" - ], - "title": "Background Services Cache Config Values at Startup - Reload Has No Effect" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The mergeDBConfig function updates config fields one by one, creating a window where readers can see a partially updated configuration. This is a form of torn read.\n\nExample scenario:\n1. Reader goroutine accesses cfg.AgentField.ExecutionCleanup during reload\n2. mergeDBConfig has updated RetentionPeriod but not yet updated CleanupInterval\n3. Reader sees inconsistent state: new retention period with old cleanup interval\n\nSpecific vulnerable fields:\n- Lines 63-81: ExecutionCleanup fields updated individually (RetentionPeriod, CleanupInterval, BatchSize, PreserveRecentDuration, StaleExecutionTimeout, Enabled)\n- Lines 82-84: Approval struct replaced atomically (better, but still mixed with other fields)\n- Lines 87-89: Features.DID struct replaced atomically\n- Lines 95-97: API.CORS struct replaced atomically\n\nThe problem: While individual struct assignments are atomic, the overall config is NOT updated atomically. Between the first and last field update, readers see an inconsistent mix of old and new values.", - "confidence": 0.85, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "concurrency-safety-config-reload", - "dimension_name": "Concurrency Safety of Dynamic Config Reload", - "evidence": "Step 1: mergeDBConfig at config_db.go:54-103 updates fields sequentially\nStep 2: Lines 63-81 update ExecutionCleanup field-by-field (not atomic as a group)\nStep 3: Concurrent reader at server.go:392 accessing s.config.AgentField.ExecutionCleanup could read during updates\nStep 4: Example race: Writer updates RetentionPeriod at line 64, then gets preempted\nStep 5: Reader reads ExecutionCleanup struct, sees new RetentionPeriod but old CleanupInterval (line 67 hasn't executed yet)\nResult: Reader observes inconsistent config state", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_024", - "line_end": 103, - "line_start": 54, - "score": 0.714, - "severity": "important", - "suggestion": "Make config updates atomic by either:\n1. Create a complete new Config struct, populate it with merged values, then atomically swap the pointer (using atomic.Pointer or similar)\n2. Hold a write lock during the entire merge operation, and have all readers acquire read lock (but this blocks readers during reload)\n3. Accept that partial visibility is a known limitation and document which config sections are updated atomically vs field-by-field", - "tags": [ - "atomicity", - "partial-visibility", - "config", - "consistency" - ], - "title": "Partial Config Visibility Risk - Individual Field Updates Not Atomic" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The DeleteConfig handler returns HTTP 404 (Not Found) for ANY error from storage.DeleteConfig(), regardless of the actual error cause. This incorrectly masks database errors, permission errors, or other internal failures as not found conditions. Current behavior: Database connection failure results in 404 Not Found. Expected behavior: Database connection failure results in 500 Internal Server Error. This makes debugging difficult and violates HTTP semantics.", - "confidence": 0.85, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "config-storage-handler-review", - "dimension_name": "Config Storage Handler Implementation Review", - "evidence": "Step 1: Database connection fails during DeleteConfig call. Step 2: storage.DeleteConfig returns error like connection refused. Step 3: Handler returns c.JSON(http.StatusNotFound, ...) for ANY error. Step 4: Client receives misleading 404 status instead of 500.", - "file_path": "control-plane/internal/handlers/config_storage.go", - "id": "f_028", - "line_end": 110, - "line_start": 106, - "score": 0.714, - "severity": "important", - "suggestion": "Check the error type to distinguish not found from other errors. If errors.Is(err, storage.ErrNotFound) then return http.StatusNotFound, otherwise return http.StatusInternalServerError. Or if the storage layer does not return typed errors, check for not found in the error message.", - "tags": [ - "error-handling", - "http-semantics", - "incorrect-status-code" - ], - "title": "DeleteConfig returns 404 for all errors, masking real failures" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "Line 502 accesses s.config.AgentField.Port without any synchronization:\nreturn s.Router.Run(: + strconv.Itoa(s.config.AgentField.Port))\n\nWhile this specific access happens during server startup (before reload is possible), other accesses to Port throughout the codebase may happen concurrently. Additionally, this pattern demonstrates the unsynchronized access pattern that's problematic.\n\nMore critically, if the port were to change via config reload, the server would need to restart to bind to the new port - but this isn't handled. The port is effectively 'cached' by the running HTTP server.", - "confidence": 0.7, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "concurrency-safety-config-reload", - "dimension_name": "Concurrency Safety of Dynamic Config Reload", - "evidence": "Step 1: Line 502 reads s.config.AgentField.Port to start HTTP server\nStep 2: No RLock acquired before reading\nStep 3: If config reload changes the port, the running server continues on old port\nResult: Port config is effectively immutable after startup, but this isn't enforced or documented", - "file_path": "control-plane/internal/server/server.go", - "id": "f_025", - "line_end": 502, - "line_start": 502, - "score": 0.588, - "severity": "important", - "suggestion": "Either document that port changes require restart, or add a check in config reload to reject changes to certain immutable fields (like port). Also add mutex protection for consistency.", - "tags": [ - "config", - "port", - "synchronization", - "documentation" - ], - "title": "HTTP Server Port Accessed Without Lock During Concurrent Reload" - }, - { - "active_multipliers": [ - "adversary_confirmed", - "ai_generated_pr" - ], - "body": "The type alias `ConfigReloadFunc` is correctly defined with an exported name (capitalized) and can be imported by the server package. The function signature `func() error` matches the expected usage pattern for configuration reload callbacks.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "config-reload-func-verification", - "dimension_name": "ConfigReloadFunc Type and Usage Verification", - "evidence": "Line 12: `type ConfigReloadFunc func() error` - exported type name, correct signature", - "file_path": "control-plane/internal/handlers/config_storage.go", - "id": "f_017", - "line_end": 12, - "line_start": 12, - "score": 0.445, - "severity": "suggestion", - "suggestion": null, - "tags": [ - "type-check", - "verification" - ], - "title": "ConfigReloadFunc type alias is correctly exported" - }, - { - "active_multipliers": [ - "adversary_confirmed", - "ai_generated_pr" - ], - "body": "Both call sites (lines 1552 and 1576) correctly invoke `NewConfigStorageHandlers(s.storage, s.configReloadFn())`. The `configReloadFn()` method returns `handlers.ConfigReloadFunc`, which matches the expected parameter type. Both admin routes and connector routes use the same initialization pattern.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "config-reload-func-verification", - "dimension_name": "ConfigReloadFunc Type and Usage Verification", - "evidence": "Line 1552: `configHandlers := handlers.NewConfigStorageHandlers(s.storage, s.configReloadFn())`\nLine 1576: `configHandlers := handlers.NewConfigStorageHandlers(s.storage, s.configReloadFn())`\nLine 435: `func (s *AgentFieldServer) configReloadFn() handlers.ConfigReloadFunc`", - "file_path": "control-plane/internal/server/server.go", - "id": "f_018", - "line_end": 1576, - "line_start": 1552, - "score": 0.445, - "severity": "suggestion", - "suggestion": null, - "tags": [ - "type-check", - "verification" - ], - "title": "NewConfigStorageHandlers receives correct function type at all call sites" - }, - { - "active_multipliers": [ - "adversary_confirmed", - "ai_generated_pr" - ], - "body": "The `ReloadConfig` handler correctly checks for nil `reloadFn` at line 115 and returns HTTP 503 with a descriptive error message when config reload is not available (AGENTFIELD_CONFIG_SOURCE != db). This prevents nil pointer dereference.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "config-reload-func-verification", - "dimension_name": "ConfigReloadFunc Type and Usage Verification", - "evidence": "Line 115-119: `if h.reloadFn == nil { c.JSON(http.StatusServiceUnavailable, gin.H{\"error\": \"config reload not available...\"}) }`", - "file_path": "control-plane/internal/handlers/config_storage.go", - "id": "f_019", - "line_end": 129, - "line_start": 114, - "score": 0.445, - "severity": "suggestion", - "suggestion": null, - "tags": [ - "nil-safety", - "verification" - ], - "title": "Nil reloadFn is handled correctly in ReloadConfig handler" - }, - { - "active_multipliers": [ - "adversary_confirmed", - "ai_generated_pr" - ], - "body": "GetConfig checks for 'no rows' condition by comparing err.Error() to a string literal 'sql: no rows in result set' instead of using errors.Is(err, sql.ErrNoRows). This is fragile because the error message string could change in future Go versions or with different database drivers. The standard approach throughout Go codebases is to use errors.Is() for error comparison.", - "confidence": 0.9, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "storage-provider-interface-extension", - "dimension_name": "StorageProvider Interface Extension for Config Storage", - "evidence": "Step 1: GetConfig at local.go:5186 checks `if err.Error() == \"sql: no rows in result set\"`. Step 2: The standard pattern in Go is `if errors.Is(err, sql.ErrNoRows)` as seen in GetWorkflowRun at local.go:300. Step 3: String comparison is fragile - the error message format could change or be driver-specific.", - "file_path": "control-plane/internal/storage/local.go", - "id": "f_006", - "line_end": 5192, - "line_start": 5163, - "score": 0.421, - "severity": "suggestion", - "suggestion": "Replace the string comparison with standard error checking:\n```go\nif errors.Is(err, sql.ErrNoRows) {\n return nil, nil\n}\n```\nThis requires importing `errors` package (which is already imported in the file).", - "tags": [ - "error-handling", - "best-practice", - "robustness" - ], - "title": "GetConfig uses string comparison for sql.ErrNoRows instead of errors.Is" - }, - { - "active_multipliers": [ - "adversary_confirmed", - "ai_generated_pr" - ], - "body": "The `API.CORS` merge at lines 95-97 only checks `AllowedOrigins` and does blanket assignment:\n\n```go\nif len(dbCfg.API.CORS.AllowedOrigins) > 0 {\n target.API.CORS = dbCfg.API.CORS\n}\n```\n\n**Missing fields** from CORSConfig (config.go:198-204):\n- `AllowedMethods`\n- `AllowedHeaders`\n- `ExposedHeaders`\n- `AllowCredentials`\n\nUsers cannot configure these CORS settings from DB config. Additionally, blanket assignment causes zero-value overwrite issues for unspecified fields.", - "confidence": 0.85, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "merge-logic-completeness", - "dimension_name": "Merge Logic Completeness and Correctness", - "evidence": "Step 1: CORSConfig struct at config.go:198-204 has 5 fields\nStep 2: mergeDBConfig lines 95-97 only checks AllowedOrigins\nStep 3: User stores DB config with AllowedMethods=[\"POST\", \"GET\"] but no AllowedOrigins\nStep 4: Condition len(AllowedOrigins) > 0 evaluates to false\nStep 5: AllowedMethods is ignored, CORS remains with default methods", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_013", - "line_end": 97, - "line_start": 95, - "score": 0.398, - "severity": "suggestion", - "suggestion": "Add field-by-field merge for all CORS fields:\n```go\nif len(dbCfg.API.CORS.AllowedOrigins) > 0 {\n target.API.CORS.AllowedOrigins = dbCfg.API.CORS.AllowedOrigins\n}\nif len(dbCfg.API.CORS.AllowedMethods) > 0 {\n target.API.CORS.AllowedMethods = dbCfg.API.CORS.AllowedMethods\n}\n// etc for AllowedHeaders, ExposedHeaders\n// For AllowCredentials (bool), use presence of other fields or pointer type\n```", - "tags": [ - "config", - "merge", - "missing-fields", - "cors" - ], - "title": "CORSConfig Merge Only Handles AllowedOrigins, Missing Other CORS Fields" - } - ], - "metadata": { - "agent_invocations": 25, - "anatomy": { - "blast_radius": [], - "clusters": [ - { - "description": "", - "files": [ - "control-plane/config/agentfield.yaml" - ], - "id": "cluster_0", - "name": "control-plane/config", - "primary_language": "yaml" - }, - { - "description": "", - "files": [ - "control-plane/internal/handlers/config_storage.go" - ], - "id": "cluster_1", - "name": "control-plane/internal/handlers", - "primary_language": "go" - }, - { - "description": "", - "files": [ - "control-plane/internal/server/config_db.go", - "control-plane/internal/server/server.go", - "control-plane/internal/server/server_routes_test.go" - ], - "id": "cluster_2", - "name": "control-plane/internal/server", - "primary_language": "go" - }, - { - "description": "", - "files": [ - "control-plane/internal/storage/local.go", - "control-plane/internal/storage/migrations.go", - "control-plane/internal/storage/models.go", - "control-plane/internal/storage/storage.go" - ], - "id": "cluster_3", - "name": "control-plane/internal/storage", - "primary_language": "go" - }, - { - "description": "", - "files": [ - "control-plane/migrations/028_create_config_storage.sql" - ], - "id": "cluster_4", - "name": "control-plane/migrations", - "primary_language": "sql" - } - ], - "context_notes": "This PR enables SaaS-style remote configuration management where a connector can push config to the control plane. The bootstrap safety mechanism (protecting storage section) is correctly implemented, but the security model assumes API keys are sufficient protection for config modification.", - "dependency_graph": {}, - "files": [ - { - "hunks": [ - { - "content": " enabled: true\n observability_config:\n enabled: false\n+ config_management:\n+ enabled: true\n+ read_only: false", - "header": "@@ -146,3 +146,6 @@ features:", - "new_count": 6, - "new_start": 146, - "old_count": 3, - "old_start": 146 - } - ], - "language": "yaml", - "lines_added": 3, - "lines_removed": 0, - "path": "control-plane/config/agentfield.yaml", - "status": "modified" - }, - { - "hunks": [ - { - "content": "+package handlers\n+\n+import (\n+\t\"io\"\n+\t\"net/http\"\n+\n+\t\"github.com/Agent-Field/agentfield/control-plane/internal/storage\"\n+\t\"github.com/gin-gonic/gin\"\n+)\n+\n+// maxConfigBodySize is the maximum allowed size for a config body (1 MB).\n+// Prevents DoS via unbounded request body reads.\n+const maxConfigBodySize = 1 << 20 // 1 MB\n+\n+// ConfigReloadFunc is called to reload configuration from the database.\n+type ConfigReloadFunc func() error\n+\n+// ConfigStorageHandlers provides HTTP handlers for database-backed configuration.\n+type ConfigStorageHandlers struct {\n+\tstorage storage.StorageProvider\n+\treloadFn ConfigReloadFunc\n+}\n+\n+// NewConfigStorageHandlers creates a new ConfigStorageHandlers instance.\n+func NewConfigStorageHandlers(store storage.StorageProvider, reloadFn ConfigReloadFunc) *ConfigStorageHandlers {\n+\treturn &ConfigStorageHandlers{storage: store, reloadFn: reloadFn}\n+}\n+\n+// RegisterRoutes registers config storage routes on the given router group.\n+func (h *ConfigStorageHandlers) RegisterRoutes(group *gin.RouterGroup) {\n+\tgroup.GET(\"/configs\", h.ListConfigs)\n+\tgroup.GET(\"/configs/:key\", h.GetConfig)\n+\tgroup.PUT(\"/configs/:key\", h.SetConfig)\n+\tgroup.DELETE(\"/configs/:key\", h.DeleteConfig)\n+\tgroup.POST(\"/configs/reload\", h.ReloadConfig)\n+}\n+\n+// ListConfigs returns all stored configuration entries.\n+func (h *ConfigStorageHandlers) ListConfigs(c *gin.Context) {\n+\tentries, err := h.storage.ListConfigs(c.Request.Context())\n+\tif err != nil {\n+\t\tc.JSON(http.StatusInternalServerError, gin.H{\"error\": err.Error()})\n+\t\treturn\n+\t}\n+\tif entries == nil {\n+\t\tentries = []*storage.ConfigEntry{}\n+\t}\n+\tc.JSON(http.StatusOK, gin.H{\n+\t\t\"configs\": entries,\n+\t\t\"total\": len(entries),\n+\t})\n+}\n+\n+// GetConfig returns a specific configuration entry by key.\n+func (h *ConfigStorageHandlers) GetConfig(c *gin.Context) {\n+\tkey := c.Param(\"key\")\n+\tentry, err := h.storage.GetConfig(c.Request.Context(), key)\n+\tif err != nil {\n+\t\tc.JSON(http.StatusInternalServerError, gin.H{\"error\": err.Error()})\n+\t\treturn\n+\t}\n+\tif entry == nil {\n+\t\tc.JSON(http.StatusNotFound, gin.H{\"error\": \"config not found\", \"key\": key})\n+\t\treturn\n+\t}\n+\tc.JSON(http.StatusOK, entry)\n+}\n+\n+// SetConfig creates or updates a configuration entry.\n+// Accepts raw YAML/text body as the config value.\n+func (h *ConfigStorageHandlers) SetConfig(c *gin.Context) {\n+\tkey := c.Param(\"key\")\n+\n+\tbody, err := io.ReadAll(io.LimitReader(c.Request.Body, maxConfigBodySize+1))\n+\tif err != nil {\n+\t\tc.JSON(http.StatusBadRequest, gin.H{\"error\": \"failed to read request body\"})\n+\t\treturn\n+\t}\n+\tif len(body) == 0 {\n+\t\tc.JSON(http.StatusBadRequest, gin.H{\"error\": \"request body is empty\"})\n+\t\treturn\n+\t}\n+\tif len(body) > maxConfigBodySize {\n+\t\tc.JSON(http.StatusRequestEntityTooLarge, gin.H{\n+\t\t\t\"error\": \"config body exceeds maximum size\",\n+\t\t\t\"max\": maxConfigBodySize,\n+\t\t})\n+\t\treturn\n+\t}\n+\n+\tupdatedBy := c.GetHeader(\"X-Updated-By\")\n+\tif updatedBy == \"\" {\n+\t\tupdatedBy = \"api\"\n+\t}\n+\n+\tif err := h.storage.SetConfig(c.Request.Context(), key, string(body), updatedBy); err != nil {\n+\t\tc.JSON(http.StatusInternalServerError, gin.H{\"error\": err.Error()})\n+\t\treturn\n+\t}\n+\n+\t// Return the saved entry\n+\tentry, err := h.storage.GetConfig(c.Request.Context(), key)\n+\tif err != nil {\n+\t\tc.JSON(http.StatusInternalServerError, gin.H{\"error\": err.Error()})\n+\t\treturn\n+\t}\n+\n+\tc.JSON(http.StatusOK, gin.H{\n+\t\t\"message\": \"config saved\",\n+\t\t\"config\": entry,\n+\t})\n+}\n+\n+// DeleteConfig removes a configuration entry by key.\n+func (h *ConfigStorageHandlers) DeleteConfig(c *gin.Context) {\n+\tkey := c.Param(\"key\")\n+\tif err := h.storage.DeleteConfig(c.Request.Context(), key); err != nil {\n+\t\tc.JSON(http.StatusNotFound, gin.H{\"error\": err.Error()})\n+\t\treturn\n+\t}\n+\tc.JSON(http.StatusOK, gin.H{\"message\": \"config deleted\", \"key\": key})\n+}\n+\n+// ReloadConfig triggers a hot-reload of configuration from the database.\n+func (h *ConfigStorageHandlers) ReloadConfig(c *gin.Context) {\n+\tif h.reloadFn == nil {\n+\t\tc.JSON(http.StatusServiceUnavailable, gin.H{\n+\t\t\t\"error\": \"config reload not available (AGENTFIELD_CONFIG_SOURCE != db)\",\n+\t\t})\n+\t\treturn\n+\t}\n+\tif err := h.reloadFn(); err != nil {\n+\t\tc.JSON(http.StatusInternalServerError, gin.H{\n+\t\t\t\"error\": \"config reload failed\",\n+\t\t\t\"details\": err.Error(),\n+\t\t})\n+\t\treturn\n+\t}\n+\tc.JSON(http.StatusOK, gin.H{\"message\": \"config reloaded from database\"})\n+}", - "header": "@@ -0,0 +1,140 @@", - "new_count": 140, - "new_start": 1, - "old_count": 0, - "old_start": 0 - } - ], - "language": "go", - "lines_added": 140, - "lines_removed": 0, - "path": "control-plane/internal/handlers/config_storage.go", - "status": "added" - }, - { - "hunks": [ - { - "content": "+package server\n+\n+import (\n+\t\"context\"\n+\t\"fmt\"\n+\t\"time\"\n+\n+\t\"github.com/Agent-Field/agentfield/control-plane/internal/config\"\n+\t\"github.com/Agent-Field/agentfield/control-plane/internal/storage\"\n+\t\"gopkg.in/yaml.v3\"\n+)\n+\n+const dbConfigKey = \"agentfield.yaml\"\n+\n+// overlayDBConfig loads config from the database and merges it into the\n+// existing config. The storage section is preserved from the original config\n+// to avoid the bootstrap problem (DB connection settings can't come from DB).\n+// Precedence: env vars > DB config > file config > defaults.\n+func overlayDBConfig(cfg *config.Config, store storage.StorageProvider) error {\n+\tctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)\n+\tdefer cancel()\n+\n+\tentry, err := store.GetConfig(ctx, dbConfigKey)\n+\tif err != nil {\n+\t\treturn fmt.Errorf(\"failed to read config from database: %w\", err)\n+\t}\n+\tif entry == nil {\n+\t\tfmt.Println(\"[config] No database config found (key: agentfield.yaml), using file/env config only.\")\n+\t\treturn nil\n+\t}\n+\n+\t// Preserve the storage config \u2014 it must always come from file/env (bootstrap)\n+\tsavedStorage := cfg.Storage\n+\n+\t// Parse the DB-stored YAML into a config struct\n+\tvar dbCfg config.Config\n+\tif err := yaml.Unmarshal([]byte(entry.Value), &dbCfg); err != nil {\n+\t\treturn fmt.Errorf(\"failed to parse database config YAML: %w\", err)\n+\t}\n+\n+\t// Overlay non-zero DB values onto the existing config\n+\tmergeDBConfig(cfg, &dbCfg)\n+\n+\t// Restore storage config (never overridden from DB)\n+\tcfg.Storage = savedStorage\n+\n+\tfmt.Printf(\"[config] Loaded config from database (key: %s, version: %d, updated: %s)\\n\",\n+\t\tentry.Key, entry.Version, entry.UpdatedAt.Format(time.RFC3339))\n+\treturn nil\n+}\n+\n+// mergeDBConfig selectively merges DB config values into the target config.\n+// Only non-zero/non-empty values from the DB config are applied.\n+func mergeDBConfig(target, dbCfg *config.Config) {\n+\t// AgentField settings\n+\tif dbCfg.AgentField.Port != 0 {\n+\t\ttarget.AgentField.Port = dbCfg.AgentField.Port\n+\t}\n+\tif dbCfg.AgentField.NodeHealth.CheckInterval != 0 {\n+\t\ttarget.AgentField.NodeHealth = dbCfg.AgentField.NodeHealth\n+\t}\n+\t// Merge execution cleanup field-by-field to avoid zeroing out unset fields\n+\tif dbCfg.AgentField.ExecutionCleanup.RetentionPeriod != 0 {\n+\t\ttarget.AgentField.ExecutionCleanup.RetentionPeriod = dbCfg.AgentField.ExecutionCleanup.RetentionPeriod\n+\t}\n+\tif dbCfg.AgentField.ExecutionCleanup.CleanupInterval != 0 {\n+\t\ttarget.AgentField.ExecutionCleanup.CleanupInterval = dbCfg.AgentField.ExecutionCleanup.CleanupInterval\n+\t}\n+\tif dbCfg.AgentField.ExecutionCleanup.BatchSize != 0 {\n+\t\ttarget.AgentField.ExecutionCleanup.BatchSize = dbCfg.AgentField.ExecutionCleanup.BatchSize\n+\t}\n+\tif dbCfg.AgentField.ExecutionCleanup.PreserveRecentDuration != 0 {\n+\t\ttarget.AgentField.ExecutionCleanup.PreserveRecentDuration = dbCfg.AgentField.ExecutionCleanup.PreserveRecentDuration\n+\t}\n+\tif dbCfg.AgentField.ExecutionCleanup.StaleExecutionTimeout != 0 {\n+\t\ttarget.AgentField.ExecutionCleanup.StaleExecutionTimeout = dbCfg.AgentField.ExecutionCleanup.StaleExecutionTimeout\n+\t}\n+\t// Enabled is a bool \u2014 only override if cleanup config is present in DB at all\n+\tif dbCfg.AgentField.ExecutionCleanup.RetentionPeriod != 0 || dbCfg.AgentField.ExecutionCleanup.CleanupInterval != 0 {\n+\t\ttarget.AgentField.ExecutionCleanup.Enabled = dbCfg.AgentField.ExecutionCleanup.Enabled\n+\t}\n+\tif dbCfg.AgentField.Approval.WebhookSecret != \"\" || dbCfg.AgentField.Approval.DefaultExpiryHours != 0 {\n+\t\ttarget.AgentField.Approval = dbCfg.AgentField.Approval\n+\t}\n+\n+\t// Features\n+\tif dbCfg.Features.DID.Method != \"\" {\n+\t\ttarget.Features.DID = dbCfg.Features.DID\n+\t}\n+\t// NOTE: Connector config (token, capabilities) is intentionally NOT merged\n+\t// from DB. These are security-sensitive and must come from file/env config,\n+\t// similar to how storage config is protected from the bootstrap problem.\n+\n+\t// API settings (but never override API key from DB for security)\n+\tif len(dbCfg.API.CORS.AllowedOrigins) > 0 {\n+\t\ttarget.API.CORS = dbCfg.API.CORS\n+\t}\n+\n+\t// UI settings\n+\tif dbCfg.UI.Mode != \"\" {\n+\t\ttarget.UI = dbCfg.UI\n+\t}\n+}", - "header": "@@ -0,0 +1,103 @@", - "new_count": 103, - "new_start": 1, - "old_count": 0, - "old_start": 0 - } - ], - "language": "go", - "lines_added": 103, - "lines_removed": 0, - "path": "control-plane/internal/server/config_db.go", - "status": "added" - }, - { - "hunks": [ - { - "content": " \t\"path/filepath\"\n \t\"strconv\"\n \t\"strings\"\n+\t\"sync\"\n \t\"time\"\n \n \t\"github.com/Agent-Field/agentfield/control-plane/internal/config\"", - "header": "@@ -13,6 +13,7 @@ import (", - "new_count": 7, - "new_start": 13, - "old_count": 6, - "old_start": 13 - }, - { - "content": " \tadminGRPCPort int\n \twebhookDispatcher services.WebhookDispatcher\n \tobservabilityForwarder services.ObservabilityForwarder\n+\tconfigMu sync.RWMutex\n }\n \n // NewAgentFieldServer creates a new instance of the AgentFieldServer.", - "header": "@@ -79,6 +80,7 @@ type AgentFieldServer struct {", - "new_count": 7, - "new_start": 80, - "old_count": 6, - "old_start": 79 - }, - { - "content": " \t\treturn nil, err\n \t}\n \n+\t// Overlay database-stored config if AGENTFIELD_CONFIG_SOURCE=db\n+\tif src := os.Getenv(\"AGENTFIELD_CONFIG_SOURCE\"); src == \"db\" {\n+\t\tif err := overlayDBConfig(cfg, storageProvider); err != nil {\n+\t\t\tfmt.Printf(\"Warning: failed to load config from database: %v\\n\", err)\n+\t\t}\n+\t}\n+\n \tRouter := gin.Default()\n \n \t// Sync installed.yaml to database for package visibility", - "header": "@@ -104,6 +106,13 @@ func NewAgentFieldServer(cfg *config.Config) (*AgentFieldServer, error) {", - "new_count": 13, - "new_start": 106, - "old_count": 6, - "old_start": 104 - }, - { - "content": " \t}, nil\n }\n \n+// configReloadFn returns a function that reloads config from the database,\n+// or nil if AGENTFIELD_CONFIG_SOURCE is not set to \"db\".\n+// The returned function acquires configMu to prevent data races with\n+// concurrent readers of s.config.\n+func (s *AgentFieldServer) configReloadFn() handlers.ConfigReloadFunc {\n+\tif src := os.Getenv(\"AGENTFIELD_CONFIG_SOURCE\"); src != \"db\" {\n+\t\treturn nil\n+\t}\n+\treturn func() error {\n+\t\ts.configMu.Lock()\n+\t\tdefer s.configMu.Unlock()\n+\t\treturn overlayDBConfig(s.config, s.storage)\n+\t}\n+}\n+\n // Start initializes and starts the AgentFieldServer.\n func (s *AgentFieldServer) Start() error {\n \t// Setup routes", - "header": "@@ -423,6 +432,21 @@ func NewAgentFieldServer(cfg *config.Config) (*AgentFieldServer, error) {", - "new_count": 21, - "new_start": 432, - "old_count": 6, - "old_start": 423 - }, - { - "content": " \t\t\tlogger.Logger.Info().Msg(\"\ud83d\udccb Authorization admin routes registered\")\n \t\t}\n \n+\t\t// Config storage routes (admin-authenticated)\n+\t\t{\n+\t\t\tconfigHandlers := handlers.NewConfigStorageHandlers(s.storage, s.configReloadFn())\n+\t\t\tconfigHandlers.RegisterRoutes(agentAPI)\n+\t\t\tlogger.Logger.Info().Msg(\"Config storage routes registered\")\n+\t\t}\n+\n \t\t// Connector routes (authenticated with separate connector token)\n \t\tif s.config.Features.Connector.Enabled && s.config.Features.Connector.Token != \"\" {\n \t\t\tconnectorGroup := agentAPI.Group(\"/connector\")", - "header": "@@ -1529,6 +1553,13 @@ func (s *AgentFieldServer) setupRoutes() {", - "new_count": 13, - "new_start": 1553, - "old_count": 6, - "old_start": 1529 - }, - { - "content": " \t\t\t)\n \t\t\tconnectorHandlers.RegisterRoutes(connectorGroup)\n \n+\t\t\t// Config management routes for connector\n+\t\t\tconfigGroup := connectorGroup.Group(\"\")\n+\t\t\tconfigGroup.Use(middleware.ConnectorCapabilityCheck(\"config_management\", s.config.Features.Connector.Capabilities))\n+\t\t\t{\n+\t\t\t\tconfigHandlers := handlers.NewConfigStorageHandlers(s.storage, s.configReloadFn())\n+\t\t\t\tconfigHandlers.RegisterRoutes(configGroup)\n+\t\t\t}\n+\n \t\t\tlogger.Logger.Info().Msg(\"\ud83d\udd0c Connector routes registered\")\n \t\t}\n \t}", - "header": "@@ -1544,6 +1575,14 @@ func (s *AgentFieldServer) setupRoutes() {", - "new_count": 14, - "new_start": 1575, - "old_count": 6, - "old_start": 1544 - } - ], - "language": "go", - "lines_added": 39, - "lines_removed": 0, - "path": "control-plane/internal/server/server.go", - "status": "modified" - }, - { - "hunks": [ - { - "content": " }\n \n // Configuration\n-func (s *stubStorage) SetConfig(ctx context.Context, key string, value interface{}) error { return nil }\n-func (s *stubStorage) GetConfig(ctx context.Context, key string) (interface{}, error) {\n+func (s *stubStorage) SetConfig(ctx context.Context, key string, value string, updatedBy string) error {\n+\treturn nil\n+}\n+func (s *stubStorage) GetConfig(ctx context.Context, key string) (*storage.ConfigEntry, error) {\n+\treturn nil, nil\n+}\n+func (s *stubStorage) ListConfigs(ctx context.Context) ([]*storage.ConfigEntry, error) {\n \treturn nil, nil\n }\n+func (s *stubStorage) DeleteConfig(ctx context.Context, key string) error { return nil }\n \n // Reasoner Performance and History\n func (s *stubStorage) GetReasonerPerformanceMetrics(ctx context.Context, reasonerID string) (*types.ReasonerPerformanceMetrics, error) {", - "header": "@@ -230,10 +230,16 @@ func (s *stubStorage) ListAgentGroups(ctx context.Context, teamID string) ([]typ", - "new_count": 16, - "new_start": 230, - "old_count": 10, - "old_start": 230 - } - ], - "language": "go", - "lines_added": 8, - "lines_removed": 2, - "path": "control-plane/internal/server/server_routes_test.go", - "status": "modified" - }, - { - "hunks": [ - { - "content": " \treturn nil\n }\n \n-// SetConfig stores a configuration key-value pair in SQLite.\n-func (ls *LocalStorage) SetConfig(ctx context.Context, key string, value interface{}) error {\n-\t// Fast-fail if context is already cancelled\n+// SetConfig upserts a configuration entry in the database.\n+// On conflict (duplicate key), it increments the version and updates the value.\n+func (ls *LocalStorage) SetConfig(ctx context.Context, key string, value string, updatedBy string) error {\n \tif err := ctx.Err(); err != nil {\n \t\treturn err\n \t}\n \n-\t// TODO: Implement configuration storage in SQLite\n-\treturn fmt.Errorf(\"SetConfig not yet implemented for LocalStorage\")\n+\tdb := ls.requireSQLDB()\n+\tnow := time.Now().UTC()\n+\n+\tif ls.mode == \"postgres\" {\n+\t\t_, err := db.ExecContext(ctx, `\n+\t\t\tINSERT INTO config_storage (key, value, version, created_by, updated_by, created_at, updated_at)\n+\t\t\tVALUES ($1, $2, 1, $3, $3, $4, $4)\n+\t\t\tON CONFLICT (key) DO UPDATE SET\n+\t\t\t\tvalue = EXCLUDED.value,\n+\t\t\t\tversion = config_storage.version + 1,\n+\t\t\t\tupdated_by = EXCLUDED.updated_by,\n+\t\t\t\tupdated_at = EXCLUDED.updated_at`,\n+\t\t\tkey, value, updatedBy, now)\n+\t\treturn err\n+\t}\n+\n+\t// SQLite\n+\t_, err := db.ExecContext(ctx, `\n+\t\tINSERT INTO config_storage (key, value, version, created_by, updated_by, created_at, updated_at)\n+\t\tVALUES (?, ?, 1, ?, ?, ?, ?)\n+\t\tON CONFLICT (key) DO UPDATE SET\n+\t\t\tvalue = excluded.value,\n+\t\t\tversion = config_storage.version + 1,\n+\t\t\tupdated_by = excluded.updated_by,\n+\t\t\tupdated_at = excluded.updated_at`,\n+\t\tkey, value, updatedBy, updatedBy, now, now)\n+\treturn err\n }\n \n-// GetConfig retrieves a configuration value from SQLite by key.\n-func (ls *LocalStorage) GetConfig(ctx context.Context, key string) (interface{}, error) {\n-\t// Fast-fail if context is already cancelled\n+// GetConfig retrieves a configuration entry by key.\n+func (ls *LocalStorage) GetConfig(ctx context.Context, key string) (*ConfigEntry, error) {\n+\tif err := ctx.Err(); err != nil {\n+\t\treturn nil, err\n+\t}\n+\n+\tdb := ls.requireSQLDB()\n+\tvar entry ConfigEntry\n+\n+\tvar placeholder string\n+\tif ls.mode == \"postgres\" {\n+\t\tplaceholder = \"$1\"\n+\t} else {\n+\t\tplaceholder = \"?\"\n+\t}\n+\n+\trow := db.QueryRowContext(ctx,\n+\t\tfmt.Sprintf(`SELECT key, value, version, COALESCE(created_by, ''), COALESCE(updated_by, ''), created_at, updated_at\n+\t\tFROM config_storage WHERE key = %s`, placeholder), key)\n+\n+\terr := row.Scan(&entry.Key, &entry.Value, &entry.Version,\n+\t\t&entry.CreatedBy, &entry.UpdatedBy, &entry.CreatedAt, &entry.UpdatedAt)\n+\tif err != nil {\n+\t\tif errors.Is(err, sql.ErrNoRows) {\n+\t\t\treturn nil, nil\n+\t\t}\n+\t\treturn nil, fmt.Errorf(\"failed to get config %q: %w\", key, err)\n+\t}\n+\treturn &entry, nil\n+}\n+\n+// ListConfigs returns all stored configuration entries.\n+func (ls *LocalStorage) ListConfigs(ctx context.Context) ([]*ConfigEntry, error) {\n \tif err := ctx.Err(); err != nil {\n \t\treturn nil, err\n \t}\n \n-\t// TODO: Implement configuration retrieval from SQLite\n-\treturn nil, fmt.Errorf(\"GetConfig not yet implemented for LocalStorage\")\n+\tdb := ls.requireSQLDB()\n+\trows, err := db.QueryContext(ctx,\n+\t\t`SELECT key, value, version, COALESCE(created_by, ''), COALESCE(updated_by, ''), created_at, updated_at\n+\t\tFROM config_storage ORDER BY key`)\n+\tif err != nil {\n+\t\treturn nil, fmt.Errorf(\"failed to list configs: %w\", err)\n+\t}\n+\tdefer rows.Close()\n+\n+\tvar entries []*ConfigEntry\n+\tfor rows.Next() {\n+\t\tvar entry ConfigEntry\n+\t\tif err := rows.Scan(&entry.Key, &entry.Value, &entry.Version,\n+\t\t\t&entry.CreatedBy, &entry.UpdatedBy, &entry.CreatedAt, &entry.UpdatedAt); err != nil {\n+\t\t\treturn nil, fmt.Errorf(\"failed to scan config row: %w\", err)\n+\t\t}\n+\t\tentries = append(entries, &entry)\n+\t}\n+\treturn entries, rows.Err()\n+}\n+\n+// DeleteConfig removes a configuration entry by key.\n+func (ls *LocalStorage) DeleteConfig(ctx context.Context, key string) error {\n+\tif err := ctx.Err(); err != nil {\n+\t\treturn err\n+\t}\n+\n+\tdb := ls.requireSQLDB()\n+\tvar placeholder string\n+\tif ls.mode == \"postgres\" {\n+\t\tplaceholder = \"$1\"\n+\t} else {\n+\t\tplaceholder = \"?\"\n+\t}\n+\n+\tresult, err := db.ExecContext(ctx,\n+\t\tfmt.Sprintf(`DELETE FROM config_storage WHERE key = %s`, placeholder), key)\n+\tif err != nil {\n+\t\treturn fmt.Errorf(\"failed to delete config %q: %w\", key, err)\n+\t}\n+\trows, _ := result.RowsAffected()\n+\tif rows == 0 {\n+\t\treturn fmt.Errorf(\"config %q not found\", key)\n+\t}\n+\treturn nil\n }\n \n // SubscribeToMemoryChanges implements the StorageProvider SubscribeToMemoryChanges method using local pub/sub.", - "header": "@@ -5124,26 +5124,124 @@ func (ls *LocalStorage) UpdateAgentTrafficWeight(ctx context.Context, id string,", - "new_count": 124, - "new_start": 5124, - "old_count": 26, - "old_start": 5124 - } - ], - "language": "go", - "lines_added": 108, - "lines_removed": 10, - "path": "control-plane/internal/storage/local.go", - "status": "modified" - }, - { - "hunks": [ - { - "content": " \t\t&DIDDocumentModel{},\n \t\t&AccessPolicyModel{},\n \t\t&AgentTagVCModel{},\n+\t\t&ConfigStorageModel{},\n \t}\n \n \tif err := gormDB.WithContext(ctx).AutoMigrate(models...); err != nil {", - "header": "@@ -233,6 +233,7 @@ func (ls *LocalStorage) autoMigrateSchema(ctx context.Context) error {", - "new_count": 7, - "new_start": 233, - "old_count": 6, - "old_start": 233 - } - ], - "language": "go", - "lines_added": 1, - "lines_removed": 0, - "path": "control-plane/internal/storage/migrations.go", - "status": "modified" - }, - { - "hunks": [ - { - "content": " }\n \n func (AgentTagVCModel) TableName() string { return \"agent_tag_vcs\" }\n+\n+// ConfigStorageModel stores configuration files in the database.\n+// Each record represents a named configuration (e.g. \"agentfield.yaml\")\n+// with versioning for audit trail.\n+type ConfigStorageModel struct {\n+\tID int64 `gorm:\"column:id;primaryKey;autoIncrement\"`\n+\tKey string `gorm:\"column:key;not null;uniqueIndex\"`\n+\tValue string `gorm:\"column:value;type:text;not null\"`\n+\tVersion int `gorm:\"column:version;not null;default:1\"`\n+\tCreatedBy *string `gorm:\"column:created_by\"`\n+\tUpdatedBy *string `gorm:\"column:updated_by\"`\n+\tCreatedAt time.Time `gorm:\"column:created_at;autoCreateTime\"`\n+\tUpdatedAt time.Time `gorm:\"column:updated_at;autoUpdateTime\"`\n+}\n+\n+func (ConfigStorageModel) TableName() string { return \"config_storage\" }", - "header": "@@ -472,3 +472,19 @@ type AgentTagVCModel struct {", - "new_count": 19, - "new_start": 472, - "old_count": 3, - "old_start": 472 - } - ], - "language": "go", - "lines_added": 16, - "lines_removed": 0, - "path": "control-plane/internal/storage/models.go", - "status": "modified" - }, - { - "hunks": [ - { - "content": " \tActiveExecutions int\n }\n \n+// ConfigEntry represents a database-stored configuration file.\n+type ConfigEntry struct {\n+\tKey string `json:\"key\"`\n+\tValue string `json:\"value\"`\n+\tVersion int `json:\"version\"`\n+\tCreatedBy string `json:\"created_by,omitempty\"`\n+\tUpdatedBy string `json:\"updated_by,omitempty\"`\n+\tCreatedAt time.Time `json:\"created_at\"`\n+\tUpdatedAt time.Time `json:\"updated_at\"`\n+}\n+\n // StorageProvider is the interface for the primary data storage backend.\n type StorageProvider interface {\n \t// Lifecycle", - "header": "@@ -26,6 +26,17 @@ type RunSummaryAggregation struct {", - "new_count": 17, - "new_start": 26, - "old_count": 6, - "old_start": 26 - }, - { - "content": " \tUpdateAgentVersion(ctx context.Context, id string, version string) error\n \tUpdateAgentTrafficWeight(ctx context.Context, id string, version string, weight int) error\n \n-\t// Configuration\n-\tSetConfig(ctx context.Context, key string, value interface{}) error\n-\tGetConfig(ctx context.Context, key string) (interface{}, error)\n+\t// Configuration Storage (database-backed config files)\n+\tSetConfig(ctx context.Context, key string, value string, updatedBy string) error\n+\tGetConfig(ctx context.Context, key string) (*ConfigEntry, error)\n+\tListConfigs(ctx context.Context) ([]*ConfigEntry, error)\n+\tDeleteConfig(ctx context.Context, key string) error\n \n \t// Reasoner Performance and History\n \tGetReasonerPerformanceMetrics(ctx context.Context, reasonerID string) (*types.ReasonerPerformanceMetrics, error)", - "header": "@@ -118,9 +129,11 @@ type StorageProvider interface {", - "new_count": 11, - "new_start": 129, - "old_count": 9, - "old_start": 118 - } - ], - "language": "go", - "lines_added": 16, - "lines_removed": 3, - "path": "control-plane/internal/storage/storage.go", - "status": "modified" - }, - { - "hunks": [ - { - "content": "+-- +goose Up\n+-- +goose StatementBegin\n+CREATE TABLE IF NOT EXISTS config_storage (\n+ id BIGSERIAL PRIMARY KEY,\n+ key TEXT NOT NULL UNIQUE,\n+ value TEXT NOT NULL,\n+ version INTEGER NOT NULL DEFAULT 1,\n+ created_by TEXT,\n+ updated_by TEXT,\n+ created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),\n+ updated_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()\n+);\n+\n+CREATE INDEX IF NOT EXISTS idx_config_storage_key ON config_storage(key);\n+-- +goose StatementEnd\n+\n+-- +goose Down\n+-- +goose StatementBegin\n+DROP INDEX IF EXISTS idx_config_storage_key;\n+DROP TABLE IF EXISTS config_storage;\n+-- +goose StatementEnd", - "header": "@@ -0,0 +1,21 @@", - "new_count": 21, - "new_start": 1, - "old_count": 0, - "old_start": 0 - } - ], - "language": "sql", - "lines_added": 21, - "lines_removed": 0, - "path": "control-plane/migrations/028_create_config_storage.sql", - "status": "added" - } - ], - "intent_gaps": [ - "**No Config Schema Validation**: The PR stores raw YAML in the database but never validates it against the config schema. Invalid configs can be stored and will cause server failures on restart (config_storage.go:67-101).", - "**No Audit Logging**: While version numbers track changes, there's no audit log of who changed what config when - only the `updated_by` field is captured (local.go:5137-5160).", - "**No Rollback Mechanism**: The reload endpoint loads current DB config but there's no API to view or restore previous versions of a config (config_storage.go:114-128).", - "**Silent Security Override**: The merge logic (config_db.go:54-103) silently overrides critical security settings like `Approval.WebhookSecret` and `API.CORS` from DB without explicit opt-in or warnings.", - "**No Test Coverage for Config Loading**: The test file `server_routes_test.go` has stub implementations for config methods (lines 232-242) but no tests for the actual DB config loading or overlay behavior." - ], - "pr_narrative": "This PR implements database-backed configuration storage for the AgentField control plane. The feature adds a new `config_storage` table to store YAML configuration files in the database, enables dynamic config loading at startup via `AGENTFIELD_CONFIG_SOURCE=db` environment variable, and exposes CRUD API endpoints for remote management.\n\n**Key Changes:**\n\n1. **Database Schema (migration 028)**: Creates `config_storage` table with `key` (unique), `value` (text), `version`, audit fields (`created_by`, `updated_by`, `created_at`, `updated_at`). GORM model added at `models.go:476-490`.\n\n2. **Storage Interface**: Added four methods to `StorageProvider` interface (`storage.go:132-136`): `SetConfig`, `GetConfig`, `ListConfigs`, `DeleteConfig`. Implementation in `local.go:5129-5245` supports both SQLite and PostgreSQL with dialect-specific SQL.\n\n3. **Startup Config Loading**: New `config_db.go` file with `overlayDBConfig()` function that loads config from DB and merges it with file/env config using precedence: env vars > DB config > file config > defaults. Called in `server.go:107-112` during server initialization when `AGENTFIELD_CONFIG_SOURCE=db`. Storage section is explicitly preserved from file config to avoid bootstrap circularity.\n\n4. **API Endpoints**: New `config_storage.go` handlers provide: `GET /api/v1/configs` (list), `GET /api/v1/configs/:key` (get), `PUT /api/v1/configs/:key` (set), `DELETE /api/v1/configs/:key` (delete), `POST /api/v1/configs/reload` (hot reload). Connector-scoped routes at `/api/v1/connector/configs/*` gated by `config_management` capability check (`server.go:1573-1578`).\n\n5. **Default Config**: Added `config_management` capability to default `agentfield.yaml:149-151`.\n\n**Old vs New Mechanism**: Previously, configuration was loaded only from YAML files and environment variables at startup. The new mechanism allows storing config in the database and dynamically overlaying it at startup, with hot-reload capability via API.", - "risk_surfaces": [ - "**Bootstrap/Startup Risk (server.go:107-112, config_db.go:19-50)**: Database config loading happens after storage init but before other services. If DB config contains invalid YAML, `yaml.Unmarshal` will fail and the server will crash on startup. The storage section is protected, but other security-critical settings (API keys, tokens) can be overridden from DB without validation.", - "**Authentication Bypass (server.go:1550-1555, config_storage.go:26-31)**: Config storage routes are registered under the main `/api/v1` group with only standard API key auth - no admin token requirement. Any caller with a valid API key can modify server configuration, potentially escalating privileges or disrupting service.", - "**Concurrent Config Access (config_db.go:19-50, server.go:435-442)**: The `overlayDBConfig` function mutates the shared config struct post-initialization. Background services (health monitor, cleanup service, webhook dispatcher) may have cached config values at startup. Changes to `ExecutionCleanup`, `Approval`, or `DID` settings via DB reload won't propagate to already-running services without restart.", - "**Connector Capability Escalation (server.go:1573-1578)**: Connector routes reuse the same `ConfigStorageHandlers` but with `ConnectorCapabilityCheck` middleware. If the capability check has logic bugs or the capability list is misconfigured, connectors could gain unauthorized write access to server configuration.", - "**SQL Injection Surface (local.go:5179-5181, 5235-5236)**: `GetConfig` and `DeleteConfig` use `fmt.Sprintf` to build queries with placeholder variables. While currently using parameterized placeholders (`$1`, `?`), future modifications could inadvertently introduce string interpolation of user input. The pattern `fmt.Sprintf(..., placeholder)` is risky.", - "**Version Concurrency (local.go:5137-5160)**: `SetConfig` uses UPSERT with `version = config_storage.version + 1` but has no optimistic locking or conflict detection. Concurrent updates from multiple API clients will result in last-write-wins without detecting overwrites.", - "**Missing Config Validation (config_storage.go:67-101)**: The `SetConfig` handler accepts raw YAML body without validation. Invalid YAML, malformed config structure, or missing required fields can be stored and will cause server startup failures on next restart with `AGENTFIELD_CONFIG_SOURCE=db`." - ], - "stats": { - "files_added": 3, - "files_modified": 7, - "files_removed": 0, - "files_renamed": 0, - "test_files_changed": 1, - "test_to_code_ratio": 0.1111111111111111, - "total_additions": 455, - "total_deletions": 15, - "total_files": 10 - }, - "unrelated_changes": [] - }, - "budget": { - "budget_exhausted": false, - "cost_breakdown": { - "adversary": 0, - "anatomy": 0, - "coverage": 0, - "cross_ref": 0, - "intake": 0, - "meta_selectors": 0, - "output": 0, - "review": 0, - "synthesis": 0 - }, - "max_cost_usd": 2, - "max_duration_seconds": 2400, - "total_cost_usd": 0 - }, - "intake": { - "ai_generated": 0.6666666666666666, - "areas_touched": [ - "database", - "api", - "tests", - "config" - ], - "complexity": ": ", - "languages": [ - "go", - "sql", - "yaml" - ], - "pr_summary": "## Summary\n- Add `config_storage` table (GORM model + Goose migration 028) for storing configuration files in the database\n- Implement `SetConfig`/`GetConfig`/`ListConfigs`/`DeleteConfig` on the `StorageProvider` interface (works on both SQLite and PostgreSQL)\n- Add `AGENTFIELD_CONFIG_SOURCE=db` environment variable to load config from the database at startup (overlays on top of file config, preserving storage section for bootstrap)\n- Add CRUD API endpoints at `GET/PUT/DELETE /api/v1/configs/:key`\n- Add connector-scoped config routes gated by `config_management` capability\n- Add `config_management` capability to default `agentfield.yaml`\n\n## How It Works\n1. **Store config in DB**: `PUT /api/v1/configs/agentfield.yaml` with YAML body\n2. **Load from DB at startup**: Set `AGENTFIELD_CONFIG_SOURCE=db` \u2192 server reads config from DB after storage init\n3. **Remote management**: SaaS \u2192 connector \u2192 `config_management` capability \u2192 CP config API\n4. **Precedence**: env vars > DB config > file config > defaults\n5. **Bootstrap safety**: The `storage` section is never overridden from DB (DB connection can't come from DB)\n\n## Related PRs\n- Connector: Agent-Field/connector (config_management capability)\n- hax-sdk: Agent-Field/hax-sdk (config editor UI)\n\n## Test plan\n- [x] `go build ./...` passes\n- [x] Server tests pass\n- [x] Storage test failure is pre-existing (FTS5 not available)\n- [ ] Manual test: create config via API, verify it loads on restart with `AGENTFIELD_CONFIG_SOURCE=db`\n- [ ] Manual test: verify connector flow end-to-end\n\n\ud83e\udd16 Generated with [Claude Code](https://claude.com/claude-code)", - "pr_type": ": ", - "review_depth": "standard", - "risk_signals": [ - "modifies data model or schema-affecting code", - "changes API surface or request/response behavior", - "includes configuration changes", - "test behavior updated" - ] - }, - "phases_completed": [ - "intake", - "anatomy", - "meta_selectors", - "review", - "adversary", - "cross_ref", - "coverage", - "synthesis", - "output" - ], - "plan": { - "ai_adjusted": false, - "cross_ref_hints": [], - "dimensions": [ - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 90, - "max_reference_follows": 5 - }, - "context_files": [ - "control-plane/internal/config/config.go", - "control-plane/internal/server/server.go" - ], - "id": "semantic_semantic-001", - "name": "Config Merge Correctness", - "priority": 10, - "review_prompt": "Review the mergeDBConfig function in control-plane/internal/server/config_db.go (lines 54-103). This function selectively merges DB config values into the target config, claiming to only apply non-zero/non-empty values.\n\nInvestigate:\n1. Does the merge logic correctly handle ALL config fields? Check if any fields in the Config struct are missing from the merge logic (e.g., logging, metrics, feature flags beyond DID, API settings beyond CORS).\n2. The function uses zero-value checks (e.g., `Port != 0`, `WebHookSecret != \"\"`). Does this correctly distinguish between 'not set in DB' vs 'explicitly set to zero/empty in DB'? A user might want to explicitly disable a feature by setting it to 0 or false.\n3. The ExecutionCleanup.Enabled bool is only set if RetentionPeriod or CleanupInterval is non-zero. What if a user wants to explicitly disable cleanup (Enabled=false) while keeping other settings?\n4. Verify that the Connector config is truly NOT being merged (security-sensitive) - confirm no accidental merge happens.\n5. The comment says \"Only non-zero/non-empty values from the DB config are applied\" - verify this holds true for all types including booleans, slices, and nested structs.\n\nLook for cases where the merge logic could produce different configuration results than expected, especially around partial updates and zero-value handling.", - "target_files": [ - "control-plane/internal/server/config_db.go" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 90, - "max_reference_follows": 5 - }, - "context_files": [ - "control-plane/internal/services/*.go" - ], - "id": "semantic_semantic-002", - "name": "Concurrent Config Access Safety", - "priority": 9, - "review_prompt": "Review the concurrency safety of the dynamic config reload mechanism.\n\nFocus on:\n1. The configReloadFn in control-plane/internal/server/server.go (lines 435-442) acquires configMu.Lock() during reload. Verify that ALL readers of s.config throughout the codebase hold configMu.RLock() or Lock() when accessing s.config.\n2. Check server.go for any goroutines or background services (webhook dispatcher, cleanup service, health monitor) that might cache config values at startup and not see reloads.\n3. Look for any direct field access on s.config that bypasses the mutex (e.g., s.config.AgentField.Port accessed without locking).\n4. The overlayDBConfig function in config_db.go modifies the cfg struct in-place. Verify this doesn't cause races with concurrent readers that might be iterating over the config.\n5. Check if there's a risk of partial config visibility during reload - can a reader see a half-updated config if they acquire RLock during a reload?\n\nIdentify any cases where concurrent access could lead to data races, stale config reads, or inconsistent state between different config fields.", - "target_files": [ - "control-plane/internal/server/server.go", - "control-plane/internal/server/config_db.go" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [ - "control-plane/internal/server/server_routes_test.go" - ], - "id": "mechanical_mech-001", - "name": "StorageProvider Interface Method Signature Compatibility", - "priority": 10, - "review_prompt": "This PR changes the StorageProvider interface methods from:\n- SetConfig(ctx context.Context, key string, value interface{}) error\n- GetConfig(ctx context.Context, key string) (interface{}, error)\n\nTo:\n- SetConfig(ctx context.Context, key string, value string, updatedBy string) error\n- GetConfig(ctx context.Context, key string) (*ConfigEntry, error)\n- ListConfigs(ctx context.Context) ([]*ConfigEntry, error)\n- DeleteConfig(ctx context.Context, key string) error\n\nVERIFY:\n1. ALL implementations of StorageProvider have been updated (check local.go, any cloud implementations, mock implementations in tests)\n2. ALL callers of these methods pass the correct arguments (check handlers/config_storage.go, any other files calling storage.SetConfig/GetConfig)\n3. The return type change from interface{} to *ConfigEntry doesn't break any caller expecting the old type\n4. Test stubs in server_routes_test.go match the new signatures (appears updated but verify all 4 methods)\n5. No other files in the codebase call these methods with the old signatures\n\nFiles to examine:\n- control-plane/internal/storage/storage.go (interface definition)\n- control-plane/internal/storage/local.go (implementation)\n- control-plane/internal/handlers/config_storage.go (callers)\n- control-plane/internal/server/server_routes_test.go (test stubs)\n- Any other files that might implement or call these methods", - "target_files": [ - "control-plane/internal/storage/storage.go", - "control-plane/internal/storage/local.go", - "control-plane/internal/handlers/config_storage.go" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [ - "control-plane/internal/server/config_db.go" - ], - "id": "mechanical_mech-002", - "name": "ConfigEntry Type Flow and Handler Response Consistency", - "priority": 8, - "review_prompt": "This PR introduces a new ConfigEntry struct in storage.go and uses it in handlers/config_storage.go.\n\nVERIFY:\n1. The ConfigEntry struct in storage/storage.go has correct JSON tags for API responses (check handlers use it properly)\n2. The handler in config_storage.go correctly serializes *storage.ConfigEntry to JSON responses\n3. ListConfigs returns []*ConfigEntry but the handler returns it directly - verify this doesn't cause JSON marshaling issues\n4. The GetConfig handler checks if entry == nil and returns 404 - verify this nil check is sufficient (entry could be non-nil but contain empty values)\n5. The SetConfig handler reads the body with io.ReadAll and passes it as 'value' string - verify the content-type handling and that binary/config data flows correctly\n6. Check that the import path \"github.com/Agent-Field/agentfield/control-plane/internal/storage\" resolves correctly in handlers/config_storage.go\n\nFiles to examine:\n- control-plane/internal/storage/storage.go (ConfigEntry definition)\n- control-plane/internal/handlers/config_storage.go (handler implementations)\n- control-plane/internal/server/config_db.go (caller of GetConfig)", - "target_files": [ - "control-plane/internal/storage/storage.go", - "control-plane/internal/handlers/config_storage.go" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [ - "control-plane/internal/server/config_db.go" - ], - "id": "mechanical_mech-003", - "name": "ConfigReloadFunc Type and Handler Registration", - "priority": 7, - "review_prompt": "This PR defines a ConfigReloadFunc type alias in handlers/config_storage.go and uses it in server/server.go.\n\nVERIFY:\n1. The type alias `type ConfigReloadFunc func() error` in handlers/config_storage.go is correctly exported and can be imported by server.go\n2. NewConfigStorageHandlers receives `reloadFn ConfigReloadFunc` parameter - verify all call sites pass the correct function type\n3. The configReloadFn() method in server.go returns `handlers.ConfigReloadFunc` - verify this method signature matches what the handlers package expects\n4. Check that RegisterRoutes is called with the correct router group and that route paths don't conflict with existing routes\n5. Verify that when reloadFn is nil (AGENTFIELD_CONFIG_SOURCE != \"db\"), the handlers still work correctly (they should, but verify no nil pointer dereference in ReloadConfig handler)\n6. Check that the configMu mutex is properly initialized before configReloadFn is called\n\nFiles to examine:\n- control-plane/internal/handlers/config_storage.go (ConfigReloadFunc definition and usage)\n- control-plane/internal/server/server.go (configReloadFn method and handler registration)\n- control-plane/internal/server/config_db.go (overlayDBConfig function)", - "target_files": [ - "control-plane/internal/handlers/config_storage.go", - "control-plane/internal/server/server.go" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [ - "control-plane/internal/storage/memory.go" - ], - "id": "systemic_storage-interface-consistency", - "name": "Storage Interface Extension Pattern", - "priority": 2, - "review_prompt": "Review the StorageProvider interface extension for config storage methods (SetConfig, GetConfig, ListConfigs, DeleteConfig) in control-plane/internal/storage/storage.go. Assess whether these new methods follow the established patterns in the codebase:\n\n1. Compare the signatures and error handling patterns with existing StorageProvider methods like SetMemory/GetMemory/ListMemory/DeleteMemory\n2. Check if the ConfigEntry struct follows the same patterns as other storage structs (e.g., MemoryEntry, ExecutionRecord)\n3. Verify that the LocalStorage implementation in local.go follows the same SQL patterns used elsewhere (transaction handling, context cancellation checks, placeholder variable usage)\n4. Look for any inconsistencies in return types - the old SetConfig/GetConfig returned interface{}, the new ones use concrete types\n\nIdentify any deviations from existing patterns that could introduce maintenance burden or confuse developers working with the storage layer.", - "target_files": [ - "control-plane/internal/storage/storage.go", - "control-plane/internal/storage/local.go", - "control-plane/internal/storage/models.go" - ] - } - ], - "total_budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - } - } - }, - "pr_url": "https://github.com/Agent-Field/agentfield/pull/254", - "review": { - "body": "## \ud83d\udd34 PR-AF Review \u2014 **Needs Major Rework**\n\n*Automated multi-agent code review \u00b7 [PR-AF](https://github.com/Agent-Field/agentfield) built with [AgentField](https://github.com/Agent-Field/agentfield)*\n\n> **25 findings** \u00b7 \ud83d\udd34 8 critical \u00b7 \ud83d\udfe0 12 important \u00b7 \ud83d\udd35 5 suggestions \u00b7 \u26aa 0 nitpicks\n\n
\nPR Overview\n\n## Summary\n- Add `config_storage` table (GORM model + Goose migration 028) for storing configuration files in the database\n- Implement `SetConfig`/`GetConfig`/`ListConfigs`/`DeleteConfig` on the `StorageProvider` interface (works on both SQLite and PostgreSQL)\n- Add `AGENTFIELD_CONFIG_SOURCE=db` environment variable to load config from the database at startup (overlays on top of file config, preserving storage section for bootstrap)\n- Add CRUD API endpoints at `GET/PUT/DELETE /api/v1/configs/:key`\n- Add connector-scoped config routes gated by `config_management` capability\n- Add `config_management` capability to default `agentfield.yaml`\n\n## How It Works\n1. **Store config in DB**: `PUT /api/v1/configs/agentfield.yaml` with YAML body\n2. **Load from DB at startup**: Set `AGENTFIELD_CONFIG_SOURCE=db` \u2192 server reads config from DB after storage init\n3. **Remote management**: SaaS \u2192 connector \u2192 `config_management` capability \u2192 CP config API\n4. **Precedence**: env vars > DB config > file config > defaults\n5. **Bootstrap safety**: The `storage` section is never overridden from DB (DB connection can't come from DB)\n\n## Related PRs\n- Connector: Agent-Field/connector (config_management capability)\n- hax-sdk: Agent-Field/hax-sdk (config editor UI)\n\n## Test plan\n- [x] `go build ./...` passes\n- [x] Server tests pass\n- [x] Storage test failure is pre-existing (FTS5 not available)\n- [ ] Manual test: create config via API, verify it loads on restart with `AGENTFIELD_CONFIG_SOURCE=db`\n- [ ] Manual test: verify connector flow end-to-end\n\n\ud83e\udd16 Generated with [Claude Code](https://claude.com/claude-code)\n\n
\n\n### Key Findings\n\n**20 issue(s) should be addressed before merge:**\n\n- \ud83d\udd34 **Multiple Config Sections Completely Missing from Merge Logic** (`control-plane/internal/server/config_db.go:54`) \u2014 The `mergeDBConfig` function claims to merge DB config values but **entire sections of the Config struct are not merged at all**, effectively ignoring user settings stored in the database.\n- \ud83d\udd34 **MockStorageProvider has outdated SetConfig/GetConfig signatures - will cause compilation failure** (`control-plane/internal/handlers/ui/config_test.go:289`) \u2014 The `MockStorageProvider` in `config_test.go` implements the old `SetConfig` and `GetConfig` method signatures that were changed in this PR.\n- \ud83d\udd34 **MockStorageProvider has outdated SetConfig/GetConfig signatures - will cause compilation failure** (`control-plane/internal/handlers/execute_test.go:173`) \u2014 The `MockStorageProvider` in `execute_test.go` implements the old `SetConfig` and `GetConfig` method signatures that were changed in this PR.\n- \ud83d\udd34 **Missing Mutex Protection for Config Reload - Data Race on s.config** (`control-plane/internal/server/server.go:435`) \u2014 The configReloadFn() function accesses and modifies s.config without any mutex protection, yet multiple goroutines throughout server.go read from s.config concurrently.\n- \ud83d\udd34 **overlayDBConfig Modifies Config Struct In-Place Without Synchronization** (`control-plane/internal/server/config_db.go:19`) \u2014 The overlayDBConfig function modifies the shared cfg struct in-place through mergeDBConfig, creating race conditions with any concurrent readers.\n- \ud83d\udd34 **No request body size limit - potential DoS vulnerability** (`control-plane/internal/handlers/config_storage.go:70`) \u2014 The SetConfig handler uses io.ReadAll(c.Request.Body) without any size limitation.\n- \ud83d\udd34 **Partial config visibility during reload - readers can see half-updated config** (`control-plane/internal/server/config_db.go:42`) \u2014 The `mergeDBConfig()` function at lines 54-103 performs field-by-field merging of DB config into the target config struct.\n- \ud83d\udd34 **Security risk: config_management enabled with write access by default** (`control-plane/config/agentfield.yaml:149`) \u2014 The default configuration enables `config_management` capability with `read_only: false`.\n- \u2026 and 12 more (see All Findings by Severity)\n\n**5 suggestion(s) and style note(s):**\n\n- \ud83d\udd35 ConfigReloadFunc type alias is correctly exported (`control-plane/internal/handlers/config_storage.go:12`)\n- \ud83d\udd35 NewConfigStorageHandlers receives correct function type at all call sites (`control-plane/internal/server/server.go:1552`)\n- \ud83d\udd35 Nil reloadFn is handled correctly in ReloadConfig handler (`control-plane/internal/handlers/config_storage.go:114`)\n- \ud83d\udd35 GetConfig uses string comparison for sql.ErrNoRows instead of errors.Is (`control-plane/internal/storage/local.go:5163`)\n- \ud83d\udd35 CORSConfig Merge Only Handles AllowedOrigins, Missing Other CORS Fields (`control-plane/internal/server/config_db.go:95`)\n\n**Files with findings:** `control-plane/config/agentfield.yaml`, `control-plane/internal/config/config.go`, `control-plane/internal/handlers/config_storage.go`, `control-plane/internal/handlers/execute_test.go`, `control-plane/internal/handlers/ui/config_test.go`, `control-plane/internal/server/config_db.go`, `control-plane/internal/server/server.go`, `control-plane/internal/storage/local.go`\n\n
\nAll Findings by Severity\n\n#### \ud83d\udd34 Critical (8)\n\n- **Multiple Config Sections Completely Missing from Merge Logic** `control-plane/internal/server/config_db.go:54`\n- **MockStorageProvider has outdated SetConfig/GetConfig signatures - will cause compilation failure** `control-plane/internal/handlers/ui/config_test.go:289`\n- **MockStorageProvider has outdated SetConfig/GetConfig signatures - will cause compilation failure** `control-plane/internal/handlers/execute_test.go:173`\n- **Missing Mutex Protection for Config Reload - Data Race on s.config** `control-plane/internal/server/server.go:435`\n- **overlayDBConfig Modifies Config Struct In-Place Without Synchronization** `control-plane/internal/server/config_db.go:19`\n- **No request body size limit - potential DoS vulnerability** `control-plane/internal/handlers/config_storage.go:70`\n- **Partial config visibility during reload - readers can see half-updated config** `control-plane/internal/server/config_db.go:42`\n- **Security risk: config_management enabled with write access by default** `control-plane/config/agentfield.yaml:149`\n\n#### \ud83d\udfe0 Important (12)\n\n- **NodeHealth Struct Merge Uses Blanket Assignment, Risking Data Loss** `control-plane/internal/server/config_db.go:59`\n- **DIDConfig Merge Only Checks Method Field, Missing All Other DID Settings** `control-plane/internal/server/config_db.go:87`\n- **ExecutionCleanup.Enabled Bool Cannot Be Explicitly Set to false Without Changing Other Fields** `control-plane/internal/server/config_db.go:79`\n- **Unprotected concurrent config access during hot reload - potential data race** `control-plane/internal/server/server.go:435`\n- **HealthMonitor caches config values at startup - won't see reloads** `control-plane/internal/server/server.go:160`\n- **WebhookDispatcher caches config values at startup - won't see reloads** `control-plane/internal/server/server.go:366`\n- **ExecutionCleanupService caches config values at startup - won't see reloads** `control-plane/internal/server/server.go:392`\n- **Missing environment variable override for config_management capability** `control-plane/internal/config/config.go:333`\n- **Background Services Cache Config Values at Startup - Reload Has No Effect** `control-plane/internal/server/server.go:366`\n- **Partial Config Visibility Risk - Individual Field Updates Not Atomic** `control-plane/internal/server/config_db.go:54`\n- **DeleteConfig returns 404 for all errors, masking real failures** `control-plane/internal/handlers/config_storage.go:106`\n- **HTTP Server Port Accessed Without Lock During Concurrent Reload** `control-plane/internal/server/server.go:502`\n\n#### \ud83d\udd35 Suggestion (5)\n\n- **ConfigReloadFunc type alias is correctly exported** `control-plane/internal/handlers/config_storage.go:12`\n- **NewConfigStorageHandlers receives correct function type at all call sites** `control-plane/internal/server/server.go:1552`\n- **Nil reloadFn is handled correctly in ReloadConfig handler** `control-plane/internal/handlers/config_storage.go:114`\n- **GetConfig uses string comparison for sql.ErrNoRows instead of errors.Is** `control-plane/internal/storage/local.go:5163`\n- **CORSConfig Merge Only Handles AllowedOrigins, Missing Other CORS Fields** `control-plane/internal/server/config_db.go:95`\n\n
\n\n
\nReview Process Details\n\n**Dimensions Analyzed (6):**\n\n- **Config Merge Correctness** \u2014 1 file(s)\n- **Concurrent Config Access Safety** \u2014 2 file(s)\n- **StorageProvider Interface Method Signature Compatibility** \u2014 3 file(s)\n- **ConfigEntry Type Flow and Handler Response Consistency** \u2014 2 file(s)\n- **ConfigReloadFunc Type and Handler Registration** \u2014 2 file(s)\n- **Storage Interface Extension Pattern** \u2014 3 file(s)\n\n**Meta-Dimension Lenses (3):**\n\n- **Semantic** \u2014 3 dimension(s), 85% coverage confidence\n- **Mechanical** \u2014 3 dimension(s), 85% coverage confidence\n- **Systemic** \u2014 2 dimension(s), 75% coverage confidence\n\n**Cross-Reference & Adversary Analysis:**\n\n- **24** finding(s) adversarially tested: 16 confirmed, 8 challenged\n\n
\n\n
\nPipeline Stats\n\n| Metric | Value |\n|--------|-------|\n| Duration | 1994.4s |\n| Agent invocations | 25 |\n| Coverage iterations | 2 |\n| Estimated cost | N/A (provider does not report cost) |\n| Budget exhausted | No |\n| PR type | : |\n| Complexity | : |\n\n
\n\nReview ID: `rev_5f6ae7c54951`", - "comments": [ - { - "body": "\ud83d\udd34 **[CRITICAL] Multiple Config Sections Completely Missing from Merge Logic**\n\nThe `mergeDBConfig` function claims to merge DB config values but **entire sections of the Config struct are not merged at all**, effectively ignoring user settings stored in the database.\n\n**Missing sections:**\n1. **`AgentField.ExecutionQueue`** (lines 72-78 in config.go): All webhook timeout, retry, and backoff settings are ignored from DB config\n2. **`API.Auth`** (lines 207-212 in config.go): SkipPaths configuration cannot be set from DB\n3. **Most `Features.DID` fields**: Only `Method` is merged; `Enabled`, `KeyAlgorithm`, `DerivationMethod`, `KeyRotationDays`, `VCRequirements`, `Keystore`, and `Authorization` are all ignored\n4. **Most `API.CORS` fields**: Only `AllowedOrigins` is merged; `AllowedMethods`, `AllowedHeaders`, `ExposedHeaders`, `AllowCredentials` are ignored\n5. **Most `NodeHealth` fields**: Only `CheckInterval` is merged; `CheckTimeout`, `ConsecutiveFailures`, `RecoveryDebounce`, `HeartbeatStaleThreshold` are ignored\n\nThis means users who store config in the database expecting to control webhook timeouts, DID authorization policies, CORS settings, or health check parameters will have their settings silently ignored, leading to **configuration drift** between what's stored in DB and what's actually applied.\n\n---\n\n> Step 1: Config struct at config.go:17-23 shows 5 top-level sections\n> Step 2: mergeDBConfig only handles partial subsets:\n> - AgentField: Port, partial NodeHealth (only CheckInterval), ExecutionCleanup, Approval, MISSING ExecutionQueue\n> - Features: Only DID.Method, intentionally skips Connector\n> - API: Only CORS.AllowedOrigins, MISSING Auth entirely\n> - UI: Fully merged\n> - Storage: Explicitly preserved (correct)\n> Step 3: User stores config with ExecutionQueue.WebhookTimeout=30s in DB\n> Step 4: mergeDBConfig has no logic for ExecutionQueue - value is silently ignored\n> Step 5: Server uses default timeout, user configuration is discarded\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd explicit merge logic for all config fields. For struct fields, either:\n1. Merge field-by-field like ExecutionCleanup, or\n2. Check a sentinel field to determine if the struct was intentionally set\n\nAt minimum, add merge logic for:\n- `AgentField.ExecutionQueue` (all fields)\n- `API.Auth.SkipPaths` (check slice length)\n- All `Features.DID` sub-fields\n- All `API.CORS` fields\n- All `NodeHealth` fields\n\n---\n*`Merge Logic Completeness and Correctness` \u00b7 confidence 95%*", - "line": 54, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd34 **[CRITICAL] Missing Mutex Protection for Config Reload - Data Race on s.config**\n\nThe configReloadFn() function accesses and modifies s.config without any mutex protection, yet multiple goroutines throughout server.go read from s.config concurrently.\n\nThe PR description claims configMu.Lock() is acquired during reload (lines 435-442), but NO SUCH MUTEX EXISTS in the codebase. The function directly calls overlayDBConfig(s.config, s.storage) which mutates the config struct in-place via mergeDBConfig().\n\nThis creates a data race:\n- HTTP request handlers read s.config.AgentField.Port, s.config.API.CORS, s.config.Features.DID.Enabled, etc.\n- The reload goroutine (triggered by API call) writes to these same fields\n- No synchronization primitive protects these concurrent accesses\n\nAffected readers include:\n- Route setup code (lines 834-838, 882-893, 913, 919-927, 971)\n- Execute handlers (lines 1246-1247, 1251)\n- Admin routes (lines 1531-1533)\n- DID middleware (lines 890, 1204, 1232)\n- UI routes (lines 1586, 1619)\n\nThis is a critical data race that can cause crashes, memory corruption, or inconsistent config state.\n\n---\n\n> Step 1: configReloadFn() at server.go:435-442 returns a closure that calls overlayDBConfig(s.config, s.storage)\n> Step 2: overlayDBConfig at config_db.go:19-50 calls mergeDBConfig(cfg, andbCfg) at line 42\n> Step 3: mergeDBConfig at config_db.go:54-103 writes directly to target fields like target.AgentField.Port = dbCfg.AgentField.Port (line 57), target.AgentField.NodeHealth = dbCfg.AgentField.NodeHealth (line 60), etc.\n> Step 4: Concurrent goroutines in server.go read s.config fields without any mutex (e.g., line 502: s.config.AgentField.Port, line 834: s.config.API.CORS.AllowedOrigins)\n> Step 5: No configMu or similar mutex exists in the codebase - verified by grep search\n> Result: Unsynchronized concurrent read/write on shared config struct = data race\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd a sync.RWMutex field (configMu) to AgentFieldServer struct. Acquire Lock() in configReloadFn() before calling overlayDBConfig, and acquire RLock() in all HTTP handlers that read config. Alternatively, use atomic pointer swap: store config as atomic.Pointer[Config] and swap the entire struct atomically on reload, eliminating need for RLock in readers.\n\n---\n*`Concurrency Safety of Dynamic Config Reload` \u00b7 confidence 95%*", - "line": 435, - "path": "control-plane/internal/server/server.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd34 **[CRITICAL] overlayDBConfig Modifies Config Struct In-Place Without Synchronization**\n\nThe overlayDBConfig function modifies the shared cfg struct in-place through mergeDBConfig, creating race conditions with any concurrent readers.\n\nCritical issue: The function receives a pointer to the server's config struct and directly mutates its fields:\n- Line 42: mergeDBConfig(cfg, andbCfg) - calls merge function\n- Lines 56-102 in mergeDBConfig: Direct field assignments like target.AgentField.Port = dbCfg.AgentField.Port\n\nThe storage section is protected (saved at line 33, restored at line 45), but all other config sections are unprotected during the merge operation.\n\nThis means concurrent readers can observe:\n1. Partially updated config (e.g., Port updated but NodeHealth not yet updated)\n2. Corrupted memory if writes overlap with reads\n3. Inconsistent state between related fields (e.g., DID.Enabled=true but DID.Authorization config not yet applied)\n\n---\n\n> Step 1: overlayDBConfig receives cfg *config.Config parameter at line 19\n> Step 2: Only storage config is saved: savedStorage := cfg.Storage at line 33\n> Step 3: mergeDBConfig(cfg, andbCfg) at line 42 writes directly to cfg fields\n> Step 4: mergeDBConfig lines 56-102 perform direct assignments: target.AgentField.Port = dbCfg.AgentField.Port, target.AgentField.NodeHealth = dbCfg.AgentField.NodeHealth, etc.\n> Step 5: Storage is restored at line 45: cfg.Storage = savedStorage\n> Result: All non-storage config fields are mutated in-place without atomicity or synchronization\n\n**\ud83d\udca1 Suggested Fix**\n\nOption 1: Require caller to hold mutex before calling overlayDBConfig (document in function comments). Option 2: Have overlayDBConfig create a deep copy of the config, modify the copy, then atomically swap the pointer (requires config to be stored as atomic.Pointer). Option 3: Protect each config section with its own mutex (more granular but complex).\n\n---\n*`Concurrency Safety of Dynamic Config Reload` \u00b7 confidence 95%*", - "line": 19, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd34 **[CRITICAL] No request body size limit - potential DoS vulnerability**\n\nThe SetConfig handler uses io.ReadAll(c.Request.Body) without any size limitation. This allows attackers to send arbitrarily large request bodies, causing memory exhaustion and potential denial of service. The PR diff indicated a maxConfigBodySize constant (1 MB) and io.LimitReader should be used, but the actual implementation is missing this protection. Impact: An attacker with a valid API key can crash the server by uploading multi-gigabyte config files.\n\n---\n\n> Step 1: Attacker sends PUT /api/v1/configs/agentfield.yaml with a 10GB request body. Step 2: Handler calls io.ReadAll(c.Request.Body). Step 3: io.ReadAll allocates memory proportional to request body size. Step 4: Server runs out of memory and crashes (OOM).\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd a body size limit using io.LimitReader. Define const maxConfigBodySize = 1 << 20 // 1 MB. Then use body, err := io.ReadAll(io.LimitReader(c.Request.Body, maxConfigBodySize+1)) and check if len(body) > maxConfigBodySize then return http.StatusRequestEntityTooLarge with appropriate error message.\n\n---\n*`Config Storage Handler Implementation Review` \u00b7 confidence 95%*", - "line": 70, - "path": "control-plane/internal/handlers/config_storage.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd34 **[CRITICAL] Partial config visibility during reload - readers can see half-updated config**\n\nThe `mergeDBConfig()` function at lines 54-103 performs field-by-field merging of DB config into the target config struct. This happens in-place on the shared `s.config` object.\n\n**The Problem:**\n1. If a reader accesses `s.config` during `mergeDBConfig()`, they may see a partially updated config.\n2. For example, if the merge updates `AgentField.Port` first, then gets preempted, a reader might see the new Port but old NodeHealth settings.\n3. This can lead to inconsistent state where different config fields are from different config versions.\n\n**Even worse**, since `configMu` doesn't exist, there's no mutex protection at all. Multiple goroutines can read `s.config` while it's being modified.\n\n---\n\n> Step 1: `overlayDBConfig()` at line 42 calls `mergeDBConfig(cfg, &dbCfg)` where `cfg` is `s.config`.\n> Step 2: `mergeDBConfig()` modifies fields one-by-one (lines 56-103) without atomicity.\n> Step 3: Example: Line 56-58 updates `AgentField.Port`, lines 59-61 update `NodeHealth` - a reader could see new Port but old NodeHealth.\n> Step 4: No atomic snapshot or deep copy is performed.\n> Step 5: The config struct is modified in-place while other goroutines may be reading it.\n\n**\ud83d\udca1 Suggested Fix**\n\nUse atomic config replacement instead of in-place modification:\n\n```go\nfunc (s *AgentFieldServer) configReloadFn() handlers.ConfigReloadFunc {\n return func() error {\n // Load new config\n newCfg := *s.config // Copy current config\n if err := overlayDBConfig(&newCfg, s.storage); err != nil {\n return err\n }\n // Atomically swap\n s.configMu.Lock()\n s.config = &newCfg\n s.configMu.Unlock()\n return nil\n }\n}\n```\n\nThis ensures readers always see a consistent (if potentially stale) config, never a partially updated one.\n\n---\n*`Concurrency Safety of Dynamic Config Reload` \u00b7 confidence 90%*", - "line": 42, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd34 **[CRITICAL] Security risk: config_management enabled with write access by default**\n\nThe default configuration enables `config_management` capability with `read_only: false`. This grants any connector with a valid token write access to server configuration via the database-backed config storage API. Connectors can modify security-critical settings (API keys, admin tokens, DID authorization settings) without admin privileges. This is inconsistent with other sensitive capabilities like `did_management` which defaults to `enabled: false`.\n\n---\n\n> Step 1: agentfield.yaml:149-151 sets `config_management: enabled: true, read_only: false`. Step 2: PR description states connector routes are gated by `config_management` capability check. Step 3: With these defaults, any deployment using the default config exposes write access to configuration. Step 4: Connectors can call PUT/DELETE /api/v1/connector/configs/* to modify server config including auth tokens (lines mentioned in PR context: server.go:1573-1578).\n\n**\ud83d\udca1 Suggested Fix**\n\nChange the default to `enabled: false` or at minimum `read_only: true`. This follows the principle of least privilege and prevents unauthorized configuration modifications. Operators who need connector config management can explicitly enable it after reviewing security implications.\n\n---\n*`Config Merge Correctness` \u00b7 confidence 90%*", - "line": 149, - "path": "control-plane/config/agentfield.yaml", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] NodeHealth Struct Merge Uses Blanket Assignment, Risking Data Loss**\n\nThe `NodeHealth` merge logic at lines 59-61 uses blanket struct assignment when `CheckInterval != 0`:\n\n```go\nif dbCfg.AgentField.NodeHealth.CheckInterval != 0 {\n target.AgentField.NodeHealth = dbCfg.AgentField.NodeHealth\n}\n```\n\n**Problem**: If the DB config only specifies `CheckInterval` but not other fields like `CheckTimeout`, `ConsecutiveFailures`, `RecoveryDebounce`, or `HeartbeatStaleThreshold`, the entire struct is overwritten. This means:\n1. File/env settings for other NodeHealth fields are lost\n2. The zero values from the YAML unmarshal (for unspecified fields) overwrite valid existing values\n\nThis contradicts the function's stated purpose of \"only non-zero/non-empty values from the DB config are applied.\"\n\n---\n\n> Step 1: File config has NodeHealth.CheckTimeout=10s, NodeHealth.CheckInterval=5s\n> Step 2: DB config only sets CheckInterval=15s (leaving others at Go zero values)\n> Step 3: mergeDBConfig checks CheckInterval != 0 (true)\n> Step 4: target.AgentField.NodeHealth = dbCfg.AgentField.NodeHealth assigns entire struct\n> Step 5: target.AgentField.NodeHealth.CheckTimeout becomes 0 (was 10s), data is lost\n\n**\ud83d\udca1 Suggested Fix**\n\nChange NodeHealth merge to field-by-field approach like ExecutionCleanup:\n```go\nif dbCfg.AgentField.NodeHealth.CheckInterval != 0 {\n target.AgentField.NodeHealth.CheckInterval = dbCfg.AgentField.NodeHealth.CheckInterval\n}\nif dbCfg.AgentField.NodeHealth.CheckTimeout != 0 {\n target.AgentField.NodeHealth.CheckTimeout = dbCfg.AgentField.NodeHealth.CheckTimeout\n}\n// etc for all fields\n```\n\n---\n*`Merge Logic Completeness and Correctness` \u00b7 confidence 90%*", - "line": 59, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] DIDConfig Merge Only Checks Method Field, Missing All Other DID Settings**\n\nThe `Features.DID` merge at lines 87-89 only checks if `Method != \"\"` and then does blanket struct assignment:\n\n```go\nif dbCfg.Features.DID.Method != \"\" {\n target.Features.DID = dbCfg.Features.DID\n}\n```\n\n**Problems**:\n1. **Data loss**: Like NodeHealth, this uses blanket assignment, so unspecified fields in DB config overwrite valid file/env settings with zero values\n2. **Cannot set non-Method fields alone**: If a user wants to only change `KeyRotationDays` or `VCRequirements` in DB config without changing `Method`, they cannot - the condition requires Method to be non-empty\n\nThe `DIDConfig` struct (config.go:100-109) has 9 fields, but only `Method` can trigger a merge, and when triggered, all other fields are subject to zero-value overwrite.\n\n---\n\n> Step 1: File config sets DID.Enabled=true, Method=\"did:key\", KeyRotationDays=90\n> Step 2: DB config only sets KeyRotationDays=30 (leaving Method empty)\n> Step 3: Condition Method != \"\" evaluates to false\n> Step 4: No merge happens, KeyRotationDays remains 90 despite DB having 30\n> OR if Method WAS set in DB, entire struct is overwritten, losing file/env settings for unspecified fields\n\n**\ud83d\udca1 Suggested Fix**\n\nImplement field-by-field merge for DIDConfig similar to ExecutionCleanup:\n```go\nif dbCfg.Features.DID.Method != \"\" {\n target.Features.DID.Method = dbCfg.Features.DID.Method\n}\nif dbCfg.Features.DID.KeyAlgorithm != \"\" {\n target.Features.DID.KeyAlgorithm = dbCfg.Features.DID.KeyAlgorithm\n}\n// Handle nested structs like VCRequirements, Keystore, Authorization recursively\n```\n\n---\n*`Merge Logic Completeness and Correctness` \u00b7 confidence 90%*", - "line": 87, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] ExecutionCleanup.Enabled Bool Cannot Be Explicitly Set to false Without Changing Other Fields**\n\nThe logic for merging `ExecutionCleanup.Enabled` (lines 79-81) requires at least one other cleanup field to be non-zero:\n\n```go\nif dbCfg.AgentField.ExecutionCleanup.RetentionPeriod != 0 || dbCfg.AgentField.ExecutionCleanup.CleanupInterval != 0 {\n target.AgentField.ExecutionCleanup.Enabled = dbCfg.AgentField.ExecutionCleanup.Enabled\n}\n```\n\n**Problem**: A user who wants to explicitly **disable** cleanup by setting `enabled: false` in the DB config cannot do so unless they also set `retention_period` or `cleanup_interval` to non-zero values. If they only set `enabled: false` (with other fields at 0), the condition fails and `Enabled` is not updated.\n\nThis violates the principle that users should be able to explicitly set boolean flags to their zero value (false) independently of other fields.\n\n---\n\n> Step 1: File config has ExecutionCleanup.Enabled=true, RetentionPeriod=24h\n> Step 2: User wants to disable cleanup, stores DB config with only 'enabled: false'\n> Step 3: All duration fields in dbCfg are 0 (not specified)\n> Step 4: Condition at line 79 evaluates to false (0 != 0 || 0 != 0)\n> Step 5: target.AgentField.ExecutionCleanup.Enabled remains true, user's explicit false is ignored\n\n**\ud83d\udca1 Suggested Fix**\n\nUse a sentinel/presence check pattern for booleans. Options:\n1. Use a `*bool` pointer type to distinguish between 'not set' and 'explicitly false'\n2. Add a comment explaining that to disable cleanup, users must also set a non-zero retention_period\n3. Always merge Enabled if any ExecutionCleanup field is non-zero (broader check)\n\nRecommended fix:\n```go\n// Check if any cleanup field is configured in DB\ncleanupConfigured := dbCfg.AgentField.ExecutionCleanup.RetentionPeriod != 0 ||\n dbCfg.AgentField.ExecutionCleanup.CleanupInterval != 0 ||\n dbCfg.AgentField.ExecutionCleanup.BatchSize != 0 ||\n dbCfg.AgentField.ExecutionCleanup.PreserveRecentDuration != 0 ||\n dbCfg.AgentField.ExecutionCleanup.StaleExecutionTimeout != 0\nif cleanupConfigured {\n target.AgentField.ExecutionCleanup.Enabled = dbCfg.AgentField.ExecutionCleanup.Enabled\n}\n```\n\n---\n*`Merge Logic Completeness and Correctness` \u00b7 confidence 85%*", - "line": 79, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Unprotected concurrent config access during hot reload - potential data race**\n\nThe `configReloadFn()` method returns a closure that calls `overlayDBConfig(s.config, s.storage)` without any mutex protection. This creates a data race when the reload endpoint is invoked while background services are reading config values.\n\n**Background services that read config concurrently:**\n- `healthMonitor` - uses `cfg.AgentField.NodeHealth.*` settings (line 160-166)\n- `cleanupService` - uses `cfg.AgentField.ExecutionCleanup.*` settings (line 392)\n- `webhookDispatcher` - uses execution queue settings (line 366-371)\n- `statusManager` - uses heartbeat thresholds (line 133-148)\n\n**The race condition:**\n1. Background goroutines read nested config fields (e.g., `s.config.AgentField.NodeHealth.CheckInterval`)\n2. Hot reload via `POST /api/v1/configs/reload` calls `overlayDBConfig()` which mutates the shared config struct\n3. Go's memory model doesn't guarantee atomicity of struct field writes - readers may see partially updated values\n4. This can cause services to operate with inconsistent configuration\n\n**Note:** While the PR narrative mentions 'Concurrent Config Access' as a known risk, the actual code doesn't implement the necessary synchronization to mitigate it.\n\n---\n\n> Step 1: `configReloadFn()` is defined at server.go:435-442, returns closure calling `overlayDBConfig(s.config, s.storage)`\n> Step 2: `overlayDBConfig()` at config_db.go:19-50 directly mutates `cfg` fields via `mergeDBConfig()`\n> Step 3: Background services initialized in NewAgentFieldServer (lines 133-392) store config references and access them concurrently\n> Step 4: HTTP handlers invoke the reload function without any synchronization barrier\n> Step 5: No mutex is defined in AgentFieldServer struct (lines 48-82)\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd a `sync.RWMutex` field to `AgentFieldServer` struct to protect config access:\n\n1. Add `configMu sync.RWMutex` to the struct (line 48-82)\n2. In `configReloadFn()`, acquire write lock before calling `overlayDBConfig`:\n ```go\n return func() error {\n s.configMu.Lock()\n defer s.configMu.Unlock()\n return overlayDBConfig(s.config, s.storage)\n }\n ```\n3. Background services should acquire read locks when accessing config, OR config should be accessed through getter methods that acquire read locks\n\n---\n*`ConfigReloadFunc Type and Usage Verification` \u00b7 confidence 75%*", - "line": 435, - "path": "control-plane/internal/server/server.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Partial Config Visibility Risk - Individual Field Updates Not Atomic**\n\nThe mergeDBConfig function updates config fields one by one, creating a window where readers can see a partially updated configuration. This is a form of torn read.\n\nExample scenario:\n1. Reader goroutine accesses cfg.AgentField.ExecutionCleanup during reload\n2. mergeDBConfig has updated RetentionPeriod but not yet updated CleanupInterval\n3. Reader sees inconsistent state: new retention period with old cleanup interval\n\nSpecific vulnerable fields:\n- Lines 63-81: ExecutionCleanup fields updated individually (RetentionPeriod, CleanupInterval, BatchSize, PreserveRecentDuration, StaleExecutionTimeout, Enabled)\n- Lines 82-84: Approval struct replaced atomically (better, but still mixed with other fields)\n- Lines 87-89: Features.DID struct replaced atomically\n- Lines 95-97: API.CORS struct replaced atomically\n\nThe problem: While individual struct assignments are atomic, the overall config is NOT updated atomically. Between the first and last field update, readers see an inconsistent mix of old and new values.\n\n---\n\n> Step 1: mergeDBConfig at config_db.go:54-103 updates fields sequentially\n> Step 2: Lines 63-81 update ExecutionCleanup field-by-field (not atomic as a group)\n> Step 3: Concurrent reader at server.go:392 accessing s.config.AgentField.ExecutionCleanup could read during updates\n> Step 4: Example race: Writer updates RetentionPeriod at line 64, then gets preempted\n> Step 5: Reader reads ExecutionCleanup struct, sees new RetentionPeriod but old CleanupInterval (line 67 hasn't executed yet)\n> Result: Reader observes inconsistent config state\n\n**\ud83d\udca1 Suggested Fix**\n\nMake config updates atomic by either:\n1. Create a complete new Config struct, populate it with merged values, then atomically swap the pointer (using atomic.Pointer or similar)\n2. Hold a write lock during the entire merge operation, and have all readers acquire read lock (but this blocks readers during reload)\n3. Accept that partial visibility is a known limitation and document which config sections are updated atomically vs field-by-field\n\n---\n*`Concurrency Safety of Dynamic Config Reload` \u00b7 confidence 85%*", - "line": 54, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] DeleteConfig returns 404 for all errors, masking real failures**\n\nThe DeleteConfig handler returns HTTP 404 (Not Found) for ANY error from storage.DeleteConfig(), regardless of the actual error cause. This incorrectly masks database errors, permission errors, or other internal failures as not found conditions. Current behavior: Database connection failure results in 404 Not Found. Expected behavior: Database connection failure results in 500 Internal Server Error. This makes debugging difficult and violates HTTP semantics.\n\n---\n\n> Step 1: Database connection fails during DeleteConfig call. Step 2: storage.DeleteConfig returns error like connection refused. Step 3: Handler returns c.JSON(http.StatusNotFound, ...) for ANY error. Step 4: Client receives misleading 404 status instead of 500.\n\n**\ud83d\udca1 Suggested Fix**\n\nCheck the error type to distinguish not found from other errors. If errors.Is(err, storage.ErrNotFound) then return http.StatusNotFound, otherwise return http.StatusInternalServerError. Or if the storage layer does not return typed errors, check for not found in the error message.\n\n---\n*`Config Storage Handler Implementation Review` \u00b7 confidence 85%*", - "line": 106, - "path": "control-plane/internal/handlers/config_storage.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd35 **[SUGGESTION] ConfigReloadFunc type alias is correctly exported**\n\nThe type alias `ConfigReloadFunc` is correctly defined with an exported name (capitalized) and can be imported by the server package. The function signature `func() error` matches the expected usage pattern for configuration reload callbacks.\n\n---\n\n> Line 12: `type ConfigReloadFunc func() error` - exported type name, correct signature\n\n---\n*`ConfigReloadFunc Type and Usage Verification` \u00b7 confidence 95%*", - "line": 12, - "path": "control-plane/internal/handlers/config_storage.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd35 **[SUGGESTION] Nil reloadFn is handled correctly in ReloadConfig handler**\n\nThe `ReloadConfig` handler correctly checks for nil `reloadFn` at line 115 and returns HTTP 503 with a descriptive error message when config reload is not available (AGENTFIELD_CONFIG_SOURCE != db). This prevents nil pointer dereference.\n\n---\n\n> Line 115-119: `if h.reloadFn == nil { c.JSON(http.StatusServiceUnavailable, gin.H{\"error\": \"config reload not available...\"}) }`\n\n---\n*`ConfigReloadFunc Type and Usage Verification` \u00b7 confidence 95%*", - "line": 114, - "path": "control-plane/internal/handlers/config_storage.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd35 **[SUGGESTION] GetConfig uses string comparison for sql.ErrNoRows instead of errors.Is**\n\nGetConfig checks for 'no rows' condition by comparing err.Error() to a string literal 'sql: no rows in result set' instead of using errors.Is(err, sql.ErrNoRows). This is fragile because the error message string could change in future Go versions or with different database drivers. The standard approach throughout Go codebases is to use errors.Is() for error comparison.\n\n---\n\n> Step 1: GetConfig at local.go:5186 checks `if err.Error() == \"sql: no rows in result set\"`. Step 2: The standard pattern in Go is `if errors.Is(err, sql.ErrNoRows)` as seen in GetWorkflowRun at local.go:300. Step 3: String comparison is fragile - the error message format could change or be driver-specific.\n\n**\ud83d\udca1 Suggested Fix**\n\nReplace the string comparison with standard error checking:\n```go\nif errors.Is(err, sql.ErrNoRows) {\n return nil, nil\n}\n```\nThis requires importing `errors` package (which is already imported in the file).\n\n---\n*`StorageProvider Interface Extension for Config Storage` \u00b7 confidence 90%*", - "line": 5163, - "path": "control-plane/internal/storage/local.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd35 **[SUGGESTION] CORSConfig Merge Only Handles AllowedOrigins, Missing Other CORS Fields**\n\nThe `API.CORS` merge at lines 95-97 only checks `AllowedOrigins` and does blanket assignment:\n\n```go\nif len(dbCfg.API.CORS.AllowedOrigins) > 0 {\n target.API.CORS = dbCfg.API.CORS\n}\n```\n\n**Missing fields** from CORSConfig (config.go:198-204):\n- `AllowedMethods`\n- `AllowedHeaders`\n- `ExposedHeaders`\n- `AllowCredentials`\n\nUsers cannot configure these CORS settings from DB config. Additionally, blanket assignment causes zero-value overwrite issues for unspecified fields.\n\n---\n\n> Step 1: CORSConfig struct at config.go:198-204 has 5 fields\n> Step 2: mergeDBConfig lines 95-97 only checks AllowedOrigins\n> Step 3: User stores DB config with AllowedMethods=[\"POST\", \"GET\"] but no AllowedOrigins\n> Step 4: Condition len(AllowedOrigins) > 0 evaluates to false\n> Step 5: AllowedMethods is ignored, CORS remains with default methods\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd field-by-field merge for all CORS fields:\n```go\nif len(dbCfg.API.CORS.AllowedOrigins) > 0 {\n target.API.CORS.AllowedOrigins = dbCfg.API.CORS.AllowedOrigins\n}\nif len(dbCfg.API.CORS.AllowedMethods) > 0 {\n target.API.CORS.AllowedMethods = dbCfg.API.CORS.AllowedMethods\n}\n// etc for AllowedHeaders, ExposedHeaders\n// For AllowCredentials (bool), use presence of other fields or pointer type\n```\n\n---\n*`Merge Logic Completeness and Correctness` \u00b7 confidence 85%*", - "line": 95, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - } - ], - "event": "REQUEST_CHANGES" - }, - "review_id": "rev_5f6ae7c54951", - "summary": { - "adversary_challenged": 8, - "adversary_confirmed": 16, - "ai_generated_confidence": 0.6666666666666666, - "budget_exhausted": false, - "by_severity": { - "critical": 8, - "important": 12, - "suggestion": 5 - }, - "cost_usd": 0, - "coverage_iterations": 2, - "cross_ref_interactions": 0, - "dimensions_run": 6, - "duration_seconds": 1994.388, - "total_findings": 25 - } -} \ No newline at end of file diff --git a/benchmark/agentfield-254/pr-af-result-kimi-evidence-grounding.json b/benchmark/agentfield-254/pr-af-result-kimi-evidence-grounding.json deleted file mode 100644 index 8920eb2..0000000 --- a/benchmark/agentfield-254/pr-af-result-kimi-evidence-grounding.json +++ /dev/null @@ -1,1283 +0,0 @@ -{ - "findings": [ - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The `MockStorageProvider` in `config_test.go` has obsolete method signatures for `SetConfig` and `GetConfig` that do not match the updated `StorageProvider` interface. The interface was changed in `storage.go` to use `string` parameters and `*ConfigEntry` return types, plus added `ListConfigs` and `DeleteConfig` methods. The mock still uses the old `interface{}` signatures and lacks the new methods entirely.\n\n**Interface signature (storage.go:132-136):**\n```go\nSetConfig(ctx context.Context, key string, value string, updatedBy string) error\nGetConfig(ctx context.Context, key string) (*ConfigEntry, error)\nListConfigs(ctx context.Context) ([]*ConfigEntry, error)\nDeleteConfig(ctx context.Context, key string) error\n```\n\n**Mock signature (config_test.go:289-297):**\n```go\nSetConfig(ctx context.Context, key string, value interface{}) error // WRONG: missing updatedBy, wrong type\nGetConfig(ctx context.Context, key string) (interface{}, error) // WRONG: wrong return type\n// ListConfigs - MISSING entirely\n// DeleteConfig - MISSING entirely\n```\n\nThis is a **compile-breaking issue**. Go's strict interface satisfaction rules mean `MockStorageProvider` no longer implements `StorageProvider`, causing build failures.", - "confidence": 1, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "storage-provider-interface-mismatch", - "dimension_name": "StorageProvider Interface Implementation Verification", - "evidence": "Step 1: StorageProvider interface defines SetConfig with signature `(ctx context.Context, key string, value string, updatedBy string) error` at storage.go:133\nStep 2: MockStorageProvider defines SetConfig with signature `(ctx context.Context, key string, value interface{}) error` at config_test.go:289\nStep 3: Parameter mismatch: interface expects 4 parameters (ctx, key, value, updatedBy) but mock has 3 parameters (ctx, key, value)\nStep 4: Type mismatch: interface expects `value string` but mock accepts `value interface{}`\nStep 5: Return type mismatch for GetConfig: interface expects `(*ConfigEntry, error)` but mock returns `(interface{}, error)` at config_test.go:294-297\nStep 6: Missing methods: MockStorageProvider lacks ListConfigs(ctx) ([]*ConfigEntry, error) and DeleteConfig(ctx, key string) error required by interface at storage.go:135-136", - "file_path": "control-plane/internal/handlers/ui/config_test.go", - "id": "f_004", - "line_end": 297, - "line_start": 289, - "score": 1.2, - "severity": "critical", - "suggestion": "Update the MockStorageProvider in config_test.go to match the new interface signatures:\n1. Change `SetConfig(ctx context.Context, key string, value interface{}) error` to `SetConfig(ctx context.Context, key string, value string, updatedBy string) error`\n2. Change `GetConfig(ctx context.Context, key string) (interface{}, error)` to `GetConfig(ctx context.Context, key string) (*storage.ConfigEntry, error)`\n3. Add `ListConfigs(ctx context.Context) ([]*storage.ConfigEntry, error)` method\n4. Add `DeleteConfig(ctx context.Context, key string) error` method", - "tags": [ - "compile-error", - "interface-mismatch", - "test-mock", - "breaking-change" - ], - "title": "MockStorageProvider.SetConfig/GetConfig have obsolete signatures - interface mismatch" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The `MockStorageProvider` in `execute_test.go` has obsolete method signatures for `SetConfig` and `GetConfig` that do not match the updated `StorageProvider` interface. The interface was changed in `storage.go` to use `string` parameters and `*ConfigEntry` return types, plus added `ListConfigs` and `DeleteConfig` methods. The mock still uses the old `interface{}` signatures and lacks the new methods entirely.\n\n**Interface signature (storage.go:132-136):**\n```go\nSetConfig(ctx context.Context, key string, value string, updatedBy string) error\nGetConfig(ctx context.Context, key string) (*ConfigEntry, error)\nListConfigs(ctx context.Context) ([]*ConfigEntry, error)\nDeleteConfig(ctx context.Context, key string) error\n```\n\n**Mock signature (execute_test.go:173-178):**\n```go\nSetConfig(ctx context.Context, key string, value interface{}) error // WRONG: missing updatedBy, wrong type\nGetConfig(ctx context.Context, key string) (interface{}, error) // WRONG: wrong return type\n// ListConfigs - MISSING entirely\n// DeleteConfig - MISSING entirely\n```\n\nThis is a **compile-breaking issue**. Go's strict interface satisfaction rules mean `MockStorageProvider` no longer implements `StorageProvider`, causing build failures.", - "confidence": 1, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "storage-provider-interface-mismatch", - "dimension_name": "StorageProvider Interface Implementation Verification", - "evidence": "Step 1: StorageProvider interface defines SetConfig with signature `(ctx context.Context, key string, value string, updatedBy string) error` at storage.go:133\nStep 2: MockStorageProvider defines SetConfig with signature `(ctx context.Context, key string, value interface{}) error` at execute_test.go:173\nStep 3: Parameter mismatch: interface expects 4 parameters (ctx, key, value, updatedBy) but mock has 3 parameters (ctx, key, value)\nStep 4: Type mismatch: interface expects `value string` but mock accepts `value interface{}`\nStep 5: Return type mismatch for GetConfig: interface expects `(*ConfigEntry, error)` but mock returns `(interface{}, error)` at execute_test.go:176-178\nStep 6: Missing methods: MockStorageProvider lacks ListConfigs(ctx) ([]*ConfigEntry, error) and DeleteConfig(ctx, key string) error required by interface at storage.go:135-136", - "file_path": "control-plane/internal/handlers/execute_test.go", - "id": "f_005", - "line_end": 178, - "line_start": 173, - "score": 1.2, - "severity": "critical", - "suggestion": "Update the MockStorageProvider in execute_test.go to match the new interface signatures:\n1. Change `SetConfig(ctx context.Context, key string, value interface{}) error` to `SetConfig(ctx context.Context, key string, value string, updatedBy string) error`\n2. Change `GetConfig(ctx context.Context, key string) (interface{}, error)` to `GetConfig(ctx context.Context, key string) (*storage.ConfigEntry, error)`\n3. Add `ListConfigs(ctx context.Context) ([]*storage.ConfigEntry, error)` method\n4. Add `DeleteConfig(ctx context.Context, key string) error` method", - "tags": [ - "compile-error", - "interface-mismatch", - "test-mock", - "breaking-change" - ], - "title": "MockStorageProvider.SetConfig/GetConfig have obsolete signatures - interface mismatch" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The `MockStorageProvider` in `config_test.go` is missing the two new configuration methods added to the `StorageProvider` interface: `ListConfigs` and `DeleteConfig`. These were added as part of the database-backed configuration storage feature in the PR.\n\n**Required by interface (storage.go:135-136):**\n```go\nListConfigs(ctx context.Context) ([]*ConfigEntry, error)\nDeleteConfig(ctx context.Context, key string) error\n```\n\n**Current state:** Neither method exists in MockStorageProvider\n\nThis causes the mock to fail to implement the interface, resulting in a compile error.", - "confidence": 1, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "storage-provider-interface-mismatch", - "dimension_name": "StorageProvider Interface Implementation Verification", - "evidence": "Step 1: StorageProvider interface at storage.go:40 defines four configuration methods at lines 133-136\nStep 2: MockStorageProvider at config_test.go:25 only implements SetConfig and GetConfig at lines 289-297\nStep 3: ListConfigs method is NOT present in the mock (grep found no match)\nStep 4: DeleteConfig method is NOT present in the mock (grep found no match)\nStep 5: Go compiler will report: 'MockStorageProvider does not implement StorageProvider (missing ListConfigs method)' and similar for DeleteConfig", - "file_path": "control-plane/internal/handlers/ui/config_test.go", - "id": "f_006", - "line_end": 30, - "line_start": 25, - "score": 1.2, - "severity": "critical", - "suggestion": "Add the missing methods to MockStorageProvider:\n\n```go\nfunc (m *MockStorageProvider) ListConfigs(ctx context.Context) ([]*storage.ConfigEntry, error) {\n args := m.Called(ctx)\n if args.Get(0) == nil {\n return nil, args.Error(1)\n }\n return args.Get(0).([]*storage.ConfigEntry), args.Error(1)\n}\n\nfunc (m *MockStorageProvider) DeleteConfig(ctx context.Context, key string) error {\n args := m.Called(ctx, key)\n return args.Error(0)\n}\n```", - "tags": [ - "compile-error", - "interface-mismatch", - "test-mock", - "missing-methods" - ], - "title": "MockStorageProvider missing ListConfigs and DeleteConfig methods" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The `MockStorageProvider` in `execute_test.go` is missing the two new configuration methods added to the `StorageProvider` interface: `ListConfigs` and `DeleteConfig`. These were added as part of the database-backed configuration storage feature in the PR.\n\n**Required by interface (storage.go:135-136):**\n```go\nListConfigs(ctx context.Context) ([]*ConfigEntry, error)\nDeleteConfig(ctx context.Context, key string) error\n```\n\n**Current state:** Neither method exists in MockStorageProvider\n\nThis causes the mock to fail to implement the interface, resulting in a compile error.", - "confidence": 1, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "storage-provider-interface-mismatch", - "dimension_name": "StorageProvider Interface Implementation Verification", - "evidence": "Step 1: StorageProvider interface at storage.go:40 defines four configuration methods at lines 133-136\nStep 2: MockStorageProvider at execute_test.go:22 only implements SetConfig and GetConfig at lines 173-178\nStep 3: ListConfigs method is NOT present in the mock (grep found no match)\nStep 4: DeleteConfig method is NOT present in the mock (grep found no match)\nStep 5: Go compiler will report: 'MockStorageProvider does not implement StorageProvider (missing ListConfigs method)' and similar for DeleteConfig", - "file_path": "control-plane/internal/handlers/execute_test.go", - "id": "f_007", - "line_end": 25, - "line_start": 22, - "score": 1.2, - "severity": "critical", - "suggestion": "Add the missing methods to MockStorageProvider:\n\n```go\nfunc (m *MockStorageProvider) ListConfigs(ctx context.Context) ([]*storage.ConfigEntry, error) {\n return nil, nil\n}\n\nfunc (m *MockStorageProvider) DeleteConfig(ctx context.Context, key string) error {\n return nil\n}\n```", - "tags": [ - "compile-error", - "interface-mismatch", - "test-mock", - "missing-methods" - ], - "title": "MockStorageProvider missing ListConfigs and DeleteConfig methods" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The `version` column is auto-incremented during upsert operations but there's no database-level constraint or application-level check to prevent lost updates. When two admins simultaneously update the same config key via `PUT /api/v1/configs/:key`, the second write will overwrite the first without any warning or conflict detection.\n\nThe storage implementation at `local.go:5129-5160` uses `ON CONFLICT DO UPDATE` with `version = config_storage.version + 1`, which is atomic but doesn't validate that the admin read the latest version before updating. This means:\n\n1. Admin A reads config version 5\n2. Admin B reads config version 5\n3. Admin A saves \u2192 version becomes 6\n4. Admin B saves \u2192 version becomes 7 (silently overwriting Admin A's changes)\n\n**Impact**: Configuration changes can be silently lost in multi-admin environments, potentially causing production misconfiguration.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "cluster_4", - "dimension_name": "Coverage Gap - Database Migration", - "evidence": "Step 1: Migration defines `version INTEGER NOT NULL DEFAULT 1` (line 7)\nStep 2: GORM model marks `Version int` with `not null;default:1` tag (models.go:483)\nStep 3: SetConfig() uses upsert: `version = config_storage.version + 1` (local.go:5143,5156)\nStep 4: No version check in WHERE clause or BEFORE UPDATE trigger to validate expected version\nStep 5: ConfigStorageHandlers.SetConfig() accepts no version parameter (config_storage.go:67-100)", - "file_path": "control-plane/migrations/028_create_config_storage.sql", - "id": "f_017", - "line_end": 21, - "line_start": 1, - "score": 1.14, - "severity": "critical", - "suggestion": "Add optimistic locking by either:\n1. **Preferred**: Add `expected_version` parameter to PUT endpoint and fail with 409 Conflict if current version != expected\n2. Alternative: Add timestamp-based conflict detection using `updated_at`\n3. Add application-level check in SetConfig: `UPDATE config_storage SET ... WHERE key = ? AND version = ?` then check RowsAffected", - "tags": [ - "concurrency", - "data-loss", - "api-design", - "migration" - ], - "title": "Version field lacks optimistic locking - concurrent updates cause silent data loss" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The `SetConfig` method implements versioning without optimistic locking, causing **silent data loss** when concurrent updates occur.\n\n**The Problem:**\n- Admin A reads config at version 1\n- Admin B reads config at version 1\n- Both admins modify different parts of the config\n- Both call `SetConfig` with their changes\n- Both execute `ON CONFLICT (key) DO UPDATE SET version = config_storage.version + 1`\n- Both result in version = 2\n- **Admin A's changes are silently lost** with no error or warning\n\n**Why this is critical:**\nIn production environments with multiple admins or automated systems updating config, concurrent modifications will result in last-write-wins behavior that loses intermediate changes. The version field provides an **audit trail illusion** - it looks like versioning is working but actually provides no conflict detection.\n\n**Code analysis:**\n```go\nON CONFLICT (key) DO UPDATE SET\n value = EXCLUDED.value,\n version = config_storage.version + 1, // <-- No WHERE clause checking expected version!\n updated_by = EXCLUDED.updated_by,\n updated_at = EXCLUDED.updated_at\n```\n\nThis is different from proper optimistic locking which would use:\n```sql\nUPDATE config_storage SET value = ?, version = version + 1 WHERE key = ? AND version = ?\n```", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "cluster_3", - "dimension_name": "storage layer - ConfigStorageModel versioning and SetConfig implementation", - "evidence": "Step 1: Two admins (A and B) both call `GET /api/v1/configs/agentfield.yaml` and receive version=1\nStep 2: Admin A modifies port setting, calls `PUT /api/v1/configs/agentfield.yaml` - succeeds, version becomes 2\nStep 3: Admin B modifies log level, calls `PUT` with payload based on version=1 they read earlier\nStep 4: In local.go:5137-5161, the SQL executes `ON CONFLICT...version + 1` without checking if the update is based on current version\nStep 5: Admin B's update succeeds (version becomes 2), but **Admin A's port change is silently overwritten**\nStep 6: No error is returned - the data loss is undetected", - "file_path": "control-plane/internal/storage/local.go", - "id": "f_022", - "line_end": 5161, - "line_start": 5129, - "score": 1.14, - "severity": "critical", - "suggestion": "Implement proper optimistic locking by:\n1. Adding an optional `expectedVersion` parameter to `SetConfig`\n2. Using a transaction with SELECT FOR UPDATE to read current version\n3. Only updating if current version matches expected version\n4. Returning a specific error (e.g., `ErrConfigVersionConflict`) when versions don't match\n5. Updating the handler to accept `If-Match` header with expected version and return 409 Conflict on mismatch", - "tags": [ - "concurrency", - "data-loss", - "optimistic-locking", - "versioning" - ], - "title": "VERSIONING WITHOUT OPTIMISTIC LOCKING: Concurrent updates cause silent data loss" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The config storage routes at /api/v1/configs/* are registered directly on agentAPI without any authentication middleware, despite the comment claiming they are 'admin-authenticated'. The vulnerability: (1) Line 1552-1553 registers config handlers on agentAPI without authentication. (2) The global APIKeyAuth middleware (line 881) is a no-op when no API key is configured (default state). (3) The AdminTokenAuth middleware used for other admin routes (line 1533) is NOT applied to config routes. (4) This leaves all config CRUD operations (list, get, set, delete, reload) exposed to unauthenticated requests. Attack scenario: Attacker calls GET /api/v1/configs to dump all configuration including secrets. Attacker calls PUT /api/v1/configs/agentfield.yaml with malicious config to modify server behavior. Attacker calls POST /api/v1/configs/reload to trigger immediate config reload. Server loads attacker-controlled configuration on next restart or reload. Impact: Full configuration compromise including admin tokens, storage credentials, DID settings, and feature toggles. This is a complete system compromise vector.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "cluster_0", - "dimension_name": "Coverage Gap Review - agentfield.yaml config_management capability", - "evidence": "Step 1: server.go:1552-1553 registers config handlers on agentAPI without auth middleware. Step 2: agentfield.yaml has no api.auth.api_key set, so APIKeyAuth is no-op (middleware/auth.go:26-28). Step 3: Other admin routes (lines 1532-1548) use AdminTokenAuth but config routes do not. Step 4: config_storage.go:26-31 exposes PUT/DELETE/POST endpoints for config modification. Step 5: Attacker can modify config without any authentication credentials.", - "file_path": "control-plane/internal/server/server.go", - "id": "f_026", - "line_end": 1555, - "line_start": 1550, - "score": 1.14, - "severity": "critical", - "suggestion": "Apply authentication middleware to config storage routes. Move config routes under adminGroup (line 1532) to inherit AdminTokenAuth, or add explicit AdminTokenAuth middleware to the config routes group. Example fix: Create a configGroup with agentAPI.Group('') and apply middleware.AdminTokenAuth(s.config.Features.DID.Authorization.AdminToken) before registering routes.", - "tags": [ - "security", - "authentication", - "authorization", - "configuration", - "critical" - ], - "title": "Config storage admin routes exposed without authentication" - }, - { - "active_multipliers": [ - "adversary_confirmed", - "ai_generated_pr" - ], - "body": "The `NodeHealth` struct has 5 fields (CheckInterval, CheckTimeout, ConsecutiveFailures, RecoveryDebounce, HeartbeatStaleThreshold), but `mergeDBConfig()` only handles `CheckInterval`. All other NodeHealth fields from DB config are silently ignored.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "config-merge-completeness", - "dimension_name": "Config Merge Completeness and Maintainability", - "evidence": "config.go:54-59 defines NodeHealthConfig with 5 fields. config_db.go:59-61 only checks `dbCfg.AgentField.NodeHealth.CheckInterval != 0`. Other fields have no corresponding merge logic.", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_010", - "line_end": 61, - "line_start": 59, - "score": 1.037, - "severity": "important", - "suggestion": "Add merge logic for all NodeHealth fields: CheckTimeout, ConsecutiveFailures, RecoveryDebounce, and HeartbeatStaleThreshold. Consider replacing the entire NodeHealth struct when any field is set, similar to how Approval and DID are handled.", - "tags": [ - "config", - "incomplete-merge" - ], - "title": "Incomplete NodeHealth Merge - Only CheckInterval Is Handled" - }, - { - "active_multipliers": [ - "adversary_confirmed", - "ai_generated_pr" - ], - "body": "The `mergeDBConfig` function at `config_db.go:54-103` selectively merges only specific known config fields from the database, leaving many fields unhandled. This creates a **maintenance hazard** where any new fields added to the `Config` struct will silently be ignored when loading from DB, causing confusion and incomplete configuration application.\n\n**Missing fields NOT merged from DB (partial list):**\n- `AgentFieldConfig.ExecutionQueue` (lines 39, 71-78 in config.go) - Agent call timeout, webhook settings\n- `NodeHealthConfig.CheckTimeout` (line 55) - Health check timeout\n- `NodeHealthConfig.ConsecutiveFailures` (line 56) - Failure threshold\n- `NodeHealthConfig.RecoveryDebounce` (line 57) - Recovery debounce\n- `NodeHealthConfig.HeartbeatStaleThreshold` (line 58) - Staleness threshold\n- `Features.DID.Authorization` (lines 111-135) - DID auth settings, admin tokens, access policies\n- `Features.DID.VCRequirements` (lines 171-179) - VC generation requirements\n- `Features.DID.Keystore` (lines 182-189) - Keystore configuration\n- `API.Auth` (lines 207-212) - API authentication settings\n- `UI.Enabled` (line 27) - UI enabled/disabled flag\n- `UI.SourcePath`, `UI.DistPath`, `UI.DevPort` (lines 29-31) - UI paths and dev port\n\n**Impact:** Users storing config in DB may set values like `execution_queue.agent_call_timeout` or `features.did.authorization.enabled`, but these will be silently ignored. The server continues running with incomplete config, making this a subtle bug that only manifests in production behavior differences.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "partial-config-merge", - "dimension_name": "Partial Config Merge Maintenance Hazard", - "evidence": "Step 1: Config struct defines AgentField.ExecutionQueue at config.go:39,72-78 with fields: AgentCallTimeout, WebhookTimeout, WebhookMaxAttempts, WebhookRetryBackoff, WebhookMaxRetryBackoff.\nStep 2: mergeDBConfig (config_db.go:54-103) checks AgentField.Port, NodeHealth, ExecutionCleanup, Approval, Features.DID (partially), API.CORS, UI.\nStep 3: ExecutionQueue is never referenced in mergeDBConfig - all queue settings are silently ignored when loading from DB.\nStep 4: This means webhook timeouts and agent call timeouts set via DB config API will have no effect.", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_013", - "line_end": 103, - "line_start": 54, - "score": 1.037, - "severity": "important", - "suggestion": "1. Add comprehensive handling for all current Config struct fields, OR\n2. Implement a reflection-based merge that uses struct tags to determine which fields should be merged (with explicit 'security' or 'nosync' tags to exclude sensitive fields), OR\n3. At minimum, add documentation comments listing all unhandled fields and a TODO/FIXME comment explaining that new fields must be manually added here\n\nRecommended approach: Add a struct tag like `merge:\"true\"` to fields that should be synced from DB, then use reflection to automatically merge those fields while preserving security-sensitive ones.", - "tags": [ - "config", - "database", - "maintenance-hazard", - "silent-failure", - "incomplete-implementation" - ], - "title": "Missing Config Fields in mergeDBConfig Creates Silent Failures" - }, - { - "active_multipliers": [ - "adversary_confirmed", - "ai_generated_pr" - ], - "body": "The DIDConfig struct has 8 fields (Enabled, Method, KeyAlgorithm, DerivationMethod, KeyRotationDays, VCRequirements, Keystore, Authorization), but `mergeDBConfig()` only checks if `Method != \"\"` and then replaces the entire struct. This means:\n1. If DB only sets `Enabled: false` without Method, the entire DID config is ignored\n2. Individual DID field updates from DB are not supported - it's all-or-nothing based on Method\n3. VCRequirements, Keystore, and Authorization sub-configs from DB are never applied", - "confidence": 0.9, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "config-merge-completeness", - "dimension_name": "Config Merge Completeness and Maintainability", - "evidence": "config.go:100-109 defines DIDConfig with 8 fields. config_db.go:87-89 only checks `dbCfg.Features.DID.Method != \"\"` before replacing entire struct. No handling for VCRequirements (lines 171-179), Keystore (lines 182-189), or Authorization (lines 112-135).", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_011", - "line_end": 89, - "line_start": 87, - "score": 0.983, - "severity": "important", - "suggestion": "Either handle DIDConfig fields individually (like ExecutionCleanup) or check for any non-zero DID field before replacing the struct. Ensure sub-structs (VCRequirements, Keystore, Authorization) are also considered.", - "tags": [ - "config", - "incomplete-merge" - ], - "title": "DIDConfig Merge Only Checks Method Field - Other DID Settings Ignored" - }, - { - "active_multipliers": [ - "adversary_confirmed", - "ai_generated_pr" - ], - "body": "The CORSConfig struct has 5 fields, but `mergeDBConfig()` only checks `AllowedOrigins`. If the DB config specifies `AllowedMethods`, `AllowedHeaders`, `ExposedHeaders`, or `AllowCredentials` without `AllowedOrigins`, those settings are silently ignored.", - "confidence": 0.9, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "config-merge-completeness", - "dimension_name": "Config Merge Completeness and Maintainability", - "evidence": "config.go:198-204 defines CORSConfig with 5 fields (AllowedOrigins, AllowedMethods, AllowedHeaders, ExposedHeaders, AllowCredentials). config_db.go:95-97 only checks `len(dbCfg.API.CORS.AllowedOrigins) > 0`.", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_012", - "line_end": 97, - "line_start": 95, - "score": 0.983, - "severity": "important", - "suggestion": "Expand the condition to check for any non-zero CORS field: `len(dbCfg.API.CORS.AllowedOrigins) > 0 || len(dbCfg.API.CORS.AllowedMethods) > 0 || ...` or check each field individually.", - "tags": [ - "config", - "incomplete-merge" - ], - "title": "CORSConfig Partial Merge - Only AllowedOrigins Is Checked" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The `mergeDBConfig()` function only handles a subset of configuration fields, causing **silent data loss** when config is loaded from the database. Users storing complete config in the DB will find that most fields are ignored without warning.\n\n**Fields that ARE merged (minimal subset):**\n- `AgentField.Port`\n- `AgentField.NodeHealth.CheckInterval` (only this one field - other NodeHealth fields ignored)\n- `AgentField.ExecutionCleanup` (all 6 fields merged individually)\n- `AgentField.Approval` (both fields)\n- `Features.DID.Method` (entire struct replaced if Method is set)\n- `API.CORS` (only if AllowedOrigins has items)\n- `UI` (entire struct replaced if Mode is set)\n\n**Fields NOT merged from DB (will be silently ignored):**\n\n**ExecutionQueueConfig (lines 72-78 in config.go):**\n- `AgentField.ExecutionQueue.AgentCallTimeout`\n- `AgentField.ExecutionQueue.WebhookTimeout`\n- `AgentField.ExecutionQueue.WebhookMaxAttempts`\n- `AgentField.ExecutionQueue.WebhookRetryBackoff`\n- `AgentField.ExecutionQueue.WebhookMaxRetryBackoff`\n\n**NodeHealthConfig (lines 54-59 in config.go):**\n- `AgentField.NodeHealth.CheckTimeout`\n- `AgentField.NodeHealth.ConsecutiveFailures`\n- `AgentField.NodeHealth.RecoveryDebounce`\n- `AgentField.NodeHealth.HeartbeatStaleThreshold`\n\n**DIDConfig (lines 100-109 in config.go):**\n- `Features.DID.Enabled`\n- `Features.DID.KeyAlgorithm`\n- `Features.DID.DerivationMethod`\n- `Features.DID.KeyRotationDays`\n\n**VCRequirements (lines 171-179 in config.go):**\n- `Features.DID.VCRequirements.RequireVCForRegistration`\n- `Features.DID.VCRequirements.RequireVCForExecution`\n- `Features.DID.VCRequirements.RequireVCForCrossAgent`\n- `Features.DID.VCRequirements.StoreInputOutput`\n- `Features.DID.VCRequirements.HashSensitiveData`\n- `Features.DID.VCRequirements.PersistExecutionVC`\n- `Features.DID.VCRequirements.StorageMode`\n\n**KeystoreConfig (lines 182-189 in config.go):**\n- `Features.DID.Keystore.Type`\n- `Features.DID.Keystore.Path`\n- `Features.DID.Keystore.Encryption`\n- `Features.DID.Keystore.EncryptionPassphrase`\n- `Features.DID.Keystore.BackupEnabled`\n- `Features.DID.Keystore.BackupInterval`\n\n**AuthorizationConfig (lines 112-135 in config.go):**\n- `Features.DID.Authorization.Enabled`\n- `Features.DID.Authorization.DIDAuthEnabled`\n- `Features.DID.Authorization.Domain`\n- `Features.DID.Authorization.TimestampWindowSeconds`\n- `Features.DID.Authorization.DefaultApprovalDurationHours`\n- `Features.DID.Authorization.AdminToken`\n- `Features.DID.Authorization.InternalToken`\n- `Features.DID.Authorization.TagApprovalRules` (all subfields)\n- `Features.DID.Authorization.AccessPolicies` (all subfields)\n\n**CORSConfig partial (lines 198-204 in config.go):**\n- `API.CORS.AllowedMethods` (not merged even if DB has values)\n- `API.CORS.AllowedHeaders` (not merged even if DB has values)\n- `API.CORS.ExposedHeaders` (not merged even if DB has values)\n- `API.CORS.AllowCredentials` (not merged even if DB has values)\n\nThis is a **semantic drift hazard** - future developers adding new config fields will almost certainly forget to update `mergeDBConfig()`, causing silent failures where DB config values are ignored.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "config-merge-completeness", - "dimension_name": "Config Merge Completeness and Maintainability", - "evidence": "mergeDBConfig() at config_db.go:54-102 only has merge logic for:\n- AgentField.Port (line 56-58)\n- AgentField.NodeHealth.CheckInterval (line 59-61)\n- AgentField.ExecutionCleanup.* (lines 63-81)\n- AgentField.Approval (lines 82-84)\n- Features.DID.Method (lines 87-89)\n- API.CORS.AllowedOrigins (lines 95-97)\n- UI.Mode (lines 100-102)\n\nconfig.go shows many additional fields in AgentFieldConfig (ExecutionQueue), DIDConfig (Enabled, KeyAlgorithm, DerivationMethod, KeyRotationDays, VCRequirements, Keystore, Authorization), and CORSConfig (AllowedMethods, AllowedHeaders, ExposedHeaders, AllowCredentials) that have no corresponding merge logic.", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_008", - "line_end": 102, - "line_start": 54, - "score": 0.798, - "severity": "important", - "suggestion": "Replace the manual field-by-field merge with a generic deep-merge approach using reflection or a library like `mergo`. Alternatively, use a whitelist approach with explicit validation that fails if unknown fields are present in the DB config. At minimum, add a comment at the top of Config struct in config.go warning developers that new fields must be added to mergeDBConfig().", - "tags": [ - "config", - "maintainability", - "silent-failure", - "data-loss" - ], - "title": "Partial Config Merge - Many Config Fields Silently Ignored from DB" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The `SetConfig` handler at `control-plane/internal/handlers/config_storage.go:67-78` accepts raw YAML via `io.ReadAll()` and stores it directly to the database without any validation. Only basic checks are performed (empty body at line 75-77), but **no YAML syntax validation** or **schema validation** occurs.\n\n**The Attack Scenario:**\n1. Attacker with API access calls `PUT /api/v1/configs/agentfield.yaml` with malformed YAML (e.g., invalid indentation, invalid types, or non-existent fields)\n2. Handler accepts and stores it successfully (line 85: `h.storage.SetConfig()`)\n3. Server continues running normally with current config\n4. On next restart with `AGENTFIELD_CONFIG_SOURCE=db`, `overlayDBConfig()` attempts to parse the invalid YAML at `config_db.go:37`\n5. `yaml.Unmarshal()` fails, returning an error\n6. At `server.go:109-110`, this error only prints a warning and the server continues with file/env config\n7. **Result**: Expected DB config is silently ignored, potentially causing production downtime or configuration drift\n\n**Why This Matters:**\n- In production environments using `AGENTFIELD_CONFIG_SOURCE=db`, operators expect the database to be the source of truth\n- Invalid config only surfaces during restart, which may be delayed hours/days after the bad config was stored\n- The silent fallback to file config can mask critical misconfigurations and cause cluster inconsistency", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "config-validation-gap", - "dimension_name": "Config Storage Validation Gap", - "evidence": "Step 1: Client calls `PUT /api/v1/configs/:key` endpoint at `config_storage.go:67`\nStep 2: Handler reads body at line 70: `body, err := io.ReadAll(c.Request.Body)`\nStep 3: Handler only checks `len(body) == 0` at lines 75-77 - no YAML validation\nStep 4: Handler stores raw body to DB at line 85: `h.storage.SetConfig(c.Request.Context(), key, string(body), updatedBy)`\nStep 5: On server restart with `AGENTFIELD_CONFIG_SOURCE=db`, `NewAgentFieldServer()` calls `overlayDBConfig(cfg, storageProvider)` at `server.go:108-109`\nStep 6: `overlayDBConfig()` calls `yaml.Unmarshal([]byte(entry.Value), &dbCfg)` at `config_db.go:37`\nStep 7: If YAML is malformed, error is returned: `fmt.Errorf(\"failed to parse database config YAML: %w\", err)`\nStep 8: At `server.go:109-110`, error is only logged as warning: `fmt.Printf(\"Warning: failed to load config from database: %v\\n\", err)`\nStep 9: Server continues startup with potentially stale file/env config instead of expected DB config", - "file_path": "control-plane/internal/handlers/config_storage.go", - "id": "f_016", - "line_end": 78, - "line_start": 67, - "score": 0.798, - "severity": "important", - "suggestion": "Add YAML validation in `SetConfig` handler before storing to database:\n\n1. **Immediate fix**: After reading body at line 70, validate it's valid YAML:\n```go\n// Validate YAML syntax\nvar yamlTest map[string]interface{}\nif err := yaml.Unmarshal(body, &yamlTest); err != nil {\n c.JSON(http.StatusBadRequest, gin.H{\"error\": \"invalid YAML syntax\", \"details\": err.Error()})\n return\n}\n```\n\n2. **Stronger validation**: Parse into actual Config struct to catch type mismatches:\n```go\nvar cfgTest config.Config\nif err := yaml.Unmarshal(body, &cfgTest); err != nil {\n c.JSON(http.StatusBadRequest, gin.H{\"error\": \"invalid config schema\", \"details\": err.Error()})\n return\n}\n```\n\n3. **Consider dry-run reload**: If `reloadFn` is available, attempt a config reload with the new YAML before persisting to catch runtime issues.", - "tags": [ - "validation", - "yaml", - "config", - "security", - "availability" - ], - "title": "SetConfig accepts invalid YAML without validation, causing delayed startup failures" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The migration sets `DEFAULT NOW()` for both `created_at` and `updated_at`, but lacks a database-level trigger to automatically update `updated_at` on row modification. While the Go implementation in `local.go` explicitly sets `updated_at` during upserts, this creates a risk for:\n\n1. Direct database updates via SQL console or admin tools won't update the timestamp\n2. Future code that uses GORM's generic Update() instead of the custom SetConfig() will fail to update the timestamp\n3. Data migration scripts or external tools won't maintain audit trail accuracy\n\n**Related risk**: The GORM model uses `autoUpdateTime` tag (models.go:487) which GORM handles automatically, but the storage layer bypasses GORM with raw SQL, creating inconsistency in behavior.", - "confidence": 0.85, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "cluster_4", - "dimension_name": "Coverage Gap - Database Migration", - "evidence": "Step 1: Migration line 11: `updated_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()` - only sets on INSERT\nStep 2: No `ON UPDATE` trigger or `GENERATED ALWAYS AS` clause present\nStep 3: GORM model line 487 uses `autoUpdateTime` but storage implementation bypasses GORM\nStep 4: local.go:5138-5160 uses raw SQL upsert which manually sets updated_at\nStep 5: If someone uses GORM db.Save(&model) directly, updated_at won't update due to schema limitation", - "file_path": "control-plane/migrations/028_create_config_storage.sql", - "id": "f_018", - "line_end": 11, - "line_start": 10, - "score": 0.714, - "severity": "important", - "suggestion": "Add database-level trigger to auto-update `updated_at` on any row modification:\n```sql\nCREATE OR REPLACE FUNCTION update_updated_at_column()\nRETURNS TRIGGER AS $$\nBEGIN\n NEW.updated_at = NOW();\n RETURN NEW;\nEND;\n$$ language 'plpgsql';\n\nCREATE TRIGGER update_config_storage_updated_at\n BEFORE UPDATE ON config_storage\n FOR EACH ROW\n EXECUTE FUNCTION update_updated_at_column();\n```", - "tags": [ - "data-integrity", - "audit-trail", - "schema-design" - ], - "title": "Missing ON UPDATE trigger for updated_at timestamp" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The config_management capability is added with enabled: true and read_only: false by default. This creates a privilege escalation risk if the connector token is compromised. The risk: (1) Connector routes (server.go:1558-1578) allow config management via connector token. (2) The connector token is a single shared secret stored in config (line 132: token: test-connector-token-123). (3) If an attacker obtains the connector token (via log leak, config exposure, etc.), they can modify configuration via /api/v1/connector/configs/* routes, change security settings, disable auth, redirect storage, and escalate from connector access to full control plane compromise. Current protections: config_db.go intentionally skips merging connector config from DB (good), but attacker can still modify OTHER critical sections (DID auth, storage, features). The connector is designed for SaaS integration with limited scope, but config_management gives it effectively full control over the control plane configuration. This violates the principle of least privilege.", - "confidence": 0.85, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "cluster_0", - "dimension_name": "Coverage Gap Review - agentfield.yaml config_management capability", - "evidence": "Step 1: agentfield.yaml:149-151 sets config_management enabled=true, read_only=false. Step 2: server.go:1560 applies ConnectorTokenAuth to connector routes. Step 3: server.go:1574 applies ConnectorCapabilityCheck middleware. Step 4: config_storage.go:26-31 exposes full CRUD via RegisterRoutes. Step 5: Compromised connector token leads to ability to modify any config except connector section.", - "file_path": "control-plane/config/agentfield.yaml", - "id": "f_027", - "line_end": 151, - "line_start": 149, - "score": 0.714, - "severity": "important", - "suggestion": "Change the default to enabled: false or at minimum read_only: true. Example: config_management: enabled: false (users must explicitly enable after understanding risks), read_only: true (or enable but restrict to read-only by default). Alternatively, require explicit opt-in via environment variable for write access.", - "tags": [ - "security", - "connector", - "capabilities", - "privilege-escalation" - ], - "title": "config_management capability enabled by default with write access" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The `key` column is defined as `TEXT NOT NULL UNIQUE` without any length constraint or validation pattern. While this provides flexibility, it allows insertion of extremely large keys (up to 1GB in PostgreSQL) which could cause:\n\n1. **Performance issues**: Index `idx_config_storage_key` on large TEXT values increases storage and lookup overhead\n2. **API abuse**: Malicious actors could create configs with multi-MB keys causing DoS\n3. **UI/display issues**: The web UI and logs may truncate or fail to display extremely long keys\n4. **Storage waste**: Index entries for large text consume significant disk space\n\n**Context**: The primary use case is `agentfield.yaml` as the config key (as seen in config_db.go:13), which is short and predictable. There's no business requirement for arbitrary-length keys.", - "confidence": 0.8, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "cluster_4", - "dimension_name": "Coverage Gap - Database Migration", - "evidence": "Step 1: Migration line 5 defines `key TEXT NOT NULL UNIQUE`\nStep 2: No CHECK constraint or length validation present\nStep 3: Index at line 14 `idx_config_storage_key` will index full TEXT values\nStep 4: config_db.go:13 shows expected key is `agentfield.yaml` (14 chars)\nStep 5: config_storage.go handlers accept arbitrary key strings from URL path", - "file_path": "control-plane/migrations/028_create_config_storage.sql", - "id": "f_019", - "line_end": 5, - "line_start": 5, - "score": 0.672, - "severity": "important", - "suggestion": "Add length constraint to key column:\n```sql\n-- Add to migration\nkey VARCHAR(255) NOT NULL UNIQUE CHECK (LENGTH(key) > 0 AND LENGTH(key) <= 255)\n```\nOr add validation at application layer in SetConfig handler before storage call.", - "tags": [ - "data-validation", - "performance", - "security", - "dos" - ], - "title": "key column uses TEXT type without length limit or validation" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The `GetConfig` method at line 5186-5187 returns `nil, nil` when config is not found, using string comparison `err.Error() == \"sql: no rows in result set\"` instead of the standard `errors.Is(err, sql.ErrNoRows)`.\n\n**Issues:**\n1. **Fragile error detection**: String comparison instead of `errors.Is()` may fail with different drivers or wrapped errors\n2. **Silent failures**: The handler in `config_storage.go` calls `GetConfig` after `SetConfig` to return saved state. If this call returns `nil, nil` (due to race condition where config was deleted between insert and select), the handler returns 500 with misleading error even though SetConfig succeeded.\n\nThis creates the scenario mentioned in the PR context: \"Error handling inconsistency: SetConfig calls storage.SetConfig(), then immediately calls storage.GetConfig() to return saved entry. If GetConfig fails, handler returns 500 error even though config WAS saved successfully\"", - "confidence": 0.75, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "cluster_3", - "dimension_name": "storage layer - ConfigStorageModel versioning and SetConfig implementation", - "evidence": "Step 1: Handler calls `storage.SetConfig()` successfully\nStep 2: Handler immediately calls `storage.GetConfig()` at config_storage.go:91-94\nStep 3: If GetConfig returns `nil, nil` (not found), handler checks `if err != nil` only\nStep 4: Handler proceeds with `nil` entry causing nil pointer dereference or returns incorrect response\nStep 5: Client receives 500 error despite config being successfully saved", - "file_path": "control-plane/internal/storage/local.go", - "id": "f_023", - "line_end": 5191, - "line_start": 5164, - "score": 0.63, - "severity": "important", - "suggestion": "1. Use `errors.Is(err, sql.ErrNoRows)` instead of string comparison at line 5186\n2. Consider returning a typed error like `ErrConfigNotFound` for missing configs\n3. Document in the `StorageProvider` interface what callers should expect for 'not found' cases", - "tags": [ - "error-handling", - "api-contract", - "nil-safety" - ], - "title": "INCONSISTENT ERROR HANDLING: GetConfig returns nil on 'not found' but storage.go contract is unclear" - }, - { - "active_multipliers": [ - "cross_ref_compound", - "ai_generated_pr" - ], - "body": "Multiple background goroutines access `s.config` fields during server startup without any mutex protection. These goroutines run concurrently and can race with config reload operations.\n\n**Affected goroutines:**\n1. **healthMonitor** (line 164): Reads `cfg.AgentField.NodeHealth.*` fields at startup\n2. **statusManager** (line 144): Reads config during initialization\n3. **presenceManager** (line 155): Uses status config\n4. **webhookDispatcher** (lines 366-371): Reads `cfg.AgentField.ExecutionQueue.*`\n5. **observabilityForwarder** (lines 377-389): Reads config fields\n6. **cleanupService** (line 392): Uses `cfg.AgentField.ExecutionCleanup`\n\nIf config is reloaded via `POST /api/v1/configs/reload` while these services are running, data races occur when they read config fields that are being modified.", - "confidence": 0.7, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "thread-safety-mutex-usage", - "dimension_name": "Thread Safety - Config Reload Mutex", - "evidence": "Step 1: healthMonitor reads cfg.AgentField.NodeHealth at line 161-165\nStep 2: webhookDispatcher reads cfg.AgentField.ExecutionQueue.WebhookTimeout at line 367\nStep 3: cleanupService reads cfg.AgentField.ExecutionCleanup at line 392\nStep 4: All these goroutines start at lines 450-485 and run concurrently\nStep 5: Config reload via overlayDBConfig() modifies these same fields without synchronization", - "file_path": "control-plane/internal/server/server.go", - "id": "f_002", - "line_end": 167, - "line_start": 133, - "score": 0.378, - "severity": "suggestion", - "suggestion": "For each goroutine that reads config, wrap the config access with `s.configMu.RLock()` and `defer s.configMu.RUnlock()`. Alternatively, consider making config reload an atomic pointer swap rather than in-place modification.", - "tags": [ - "data-race", - "goroutines", - "config-read", - "concurrency" - ], - "title": "Important: Background goroutines read s.config without mutex protection" - }, - { - "active_multipliers": [ - "adversary_challenged", - "ai_generated_pr" - ], - "body": "While the code correctly excludes `Connector` config (token, capabilities) from DB merge with a clear security comment (lines 90-92), it also silently omits `Features.DID.Authorization` which contains equally security-sensitive fields like `AdminToken`, `InternalToken`, `AccessPolicies`, and `DIDAuthEnabled` (config.go:111-135).\n\nThe DID Authorization struct contains:\n- `AdminToken` - Separate token for admin operations\n- `InternalToken` - Used for Authorization: Bearer header to agents\n- `Domain` - Domain for did:web identifiers\n- `AccessPolicies` - Tag-based authorization policies\n\nThese fields are **not merged from DB** despite being security-relevant, but unlike the Connector exclusion, there's no explanatory comment. This inconsistency makes it unclear whether the omission is intentional (security) or accidental (incomplete implementation).", - "confidence": 0.85, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "partial-config-merge", - "dimension_name": "Partial Config Merge Maintenance Hazard", - "evidence": "Step 1: DIDConfig.Authorization struct at config.go:111-135 defines security-sensitive fields: AdminToken, InternalToken, AccessPolicies, DIDAuthEnabled.\nStep 2: mergeDBConfig only checks dbCfg.Features.DID.Method at line 87, then assigns entire DID struct.\nStep 3: DID.Authorization is part of DID struct but never specifically handled - it would be zeroed if only Method is set, or copied wholesale if any Method is set.\nStep 4: No security comment explains why these sensitive fields are treated differently from Connector config.", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_014", - "line_end": 92, - "line_start": 86, - "score": 0.357, - "severity": "important", - "suggestion": "Add an explicit comment explaining why DID.Authorization fields are excluded from DB merge, similar to the Connector comment:\n\n```go\n// NOTE: DID.Authorization config (admin_token, internal_token, access_policies) is\n// intentionally NOT merged from DB for security, similar to connector config.\n// Only DID.Method is merged as it affects VC generation behavior.\n```", - "tags": [ - "config", - "security", - "inconsistency", - "documentation" - ], - "title": "Inconsistent Security Field Handling - DID.Authorization Omitted Without Comment" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "There is no automated mechanism (build-time check, code generation, or test) to ensure that `mergeDBConfig()` stays synchronized with the `Config` struct definition. When new fields are added to `config.Config`, developers must manually remember to update `mergeDBConfig()` in a different file. This is a classic source of drift bugs.", - "confidence": 0.85, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "config-merge-completeness", - "dimension_name": "Config Merge Completeness and Maintainability", - "evidence": "mergeDBConfig() comment at line 52-53 states 'selectively merges' but provides no mechanism to ensure completeness. The function and Config struct are in separate files (config_db.go vs config.go) increasing the likelihood of drift.", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_009", - "line_end": 54, - "line_start": 52, - "score": 0.306, - "severity": "suggestion", - "suggestion": "Consider adding a build tag or go:generate directive that uses reflection to verify all exported fields in Config have corresponding merge logic. Alternatively, add a unit test that uses reflection to compare the Config struct fields against known merged fields and fails if new fields are detected without test coverage in mergeDBConfig.", - "tags": [ - "maintainability", - "automation", - "testing-gap" - ], - "title": "No Automated Sync Check Between Config Struct and Merge Function" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The function comment at lines 52-53 describes what the function does but does not warn maintainers that this function must be updated whenever new config fields are added. The field-by-field merge approach creates a **compile-time blind spot** - the code compiles successfully even when Config struct has fields not handled here.\n\nA maintainer adding a new field to `Config` struct will have no indication that they also need to add handling here unless they happen to read this file. This is exactly the type of issue that caused the ExecutionCleanup bug requiring the a8bfc8c fix commit.", - "confidence": 0.8, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "partial-config-merge", - "dimension_name": "Partial Config Merge Maintenance Hazard", - "evidence": "Step 1: Function comment at lines 52-53 says 'selectively merges' and 'Only non-zero/non-empty values' but gives no warning about the maintenance requirement.\nStep 2: Config struct has 15+ fields/sub-structs (config.go:17-23, 34-41, etc.).\nStep 3: mergeDBConfig handles only 7 specific field paths (Port, NodeHealth.CheckInterval, ExecutionCleanup.*, Approval, DID.Method, API.CORS, UI).\nStep 4: No compile-time or comment-based guard exists to warn when Config grows but mergeDBConfig doesn't.", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_015", - "line_end": 53, - "line_start": 52, - "score": 0.288, - "severity": "suggestion", - "suggestion": "Add a prominent TODO/FIXME comment at the top of mergeDBConfig:\n\n```go\n// TODO: This function must be updated when adding new config fields.\n// Currently missing: ExecutionQueue, NodeHealth (partial), DID.Authorization,\n// DID.VCRequirements, DID.Keystore, API.Auth, UI.Enabled, etc.\n// Consider using reflection-based merging with struct tags to avoid\n// this maintenance burden (see also: viper's automatic config merging).\n```", - "tags": [ - "config", - "documentation", - "maintenance-hazard" - ], - "title": "Missing TODO/FIXME Comment Warning About Maintenance Burden" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "In `GetConfig` (lines 5180-5184), the SQL uses `COALESCE(created_by, '')` and `COALESCE(updated_by, '')` to handle NULL values.\n\n**Issues:**\n1. **Loss of semantic meaning**: Empty string `\"\"` and NULL have different meanings - NULL means \"unknown/system\" while empty string could mean \"intentionally blank\"\n2. **Inconsistent with model**: `ConfigStorageModel` uses `*string` pointers for these fields indicating they can be NULL\n3. **ConfigEntry uses non-pointer**: The `ConfigEntry` struct in storage.go:30-38 uses plain `string` not `*string`, forcing the COALESCE\n\nThis makes it impossible to distinguish between \"created by system (NULL)\" and \"created by user with empty name (empty string)\".", - "confidence": 0.7, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "cluster_3", - "dimension_name": "storage layer - ConfigStorageModel versioning and SetConfig implementation", - "evidence": "storage.go:30-38 defines ConfigEntry with `CreatedBy string` and `UpdatedBy string` (no pointers)\n\nlocal.go:5180-5181 uses `COALESCE(created_by, '')` and `COALESCE(updated_by, '')` to handle NULLs because ConfigEntry can't hold NULL\n\nmodels.go:484-485 defines `CreatedBy *string` and `UpdatedBy *string` as pointers in the model", - "file_path": "control-plane/internal/storage/local.go", - "id": "f_025", - "line_end": 5184, - "line_start": 5179, - "score": 0.252, - "severity": "suggestion", - "suggestion": "Change `ConfigEntry` to use `*string` for `CreatedBy` and `UpdatedBy`:\n```go\ntype ConfigEntry struct {\n Key string `json:\"key\"`\n Value string `json:\"value\"`\n Version int `json:\"version\"`\n CreatedBy *string `json:\"created_by,omitempty\"` // Use pointer\n UpdatedBy *string `json:\"updated_by,omitempty\"` // Use pointer\n CreatedAt time.Time `json:\"created_at\"`\n UpdatedAt time.Time `json:\"updated_at\"`\n}\n```\n\nRemove COALESCE from SQL and scan directly into pointer fields.", - "tags": [ - "api-design", - "null-handling", - "audit-trail" - ], - "title": "AMBIGUOUS NULL HANDLING: COALESCE converts NULL to empty string losing audit information" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The `created_by` and `updated_by` columns are defined as nullable TEXT without foreign key constraints or validation. This design allows arbitrary strings that may not correspond to actual users in the system, making the audit trail unreliable.\n\n**Trade-offs**: Adding FK constraints to a users table would require that table to exist and be populated, which may not be true in all deployment scenarios (e.g., API-only authentication). However, even without FK constraints, the application should validate these values against authenticated principals.", - "confidence": 0.65, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "cluster_4", - "dimension_name": "Coverage Gap - Database Migration", - "evidence": "Step 1: Migration lines 8-9: `created_by TEXT` and `updated_by TEXT` - no constraints\nStep 2: GORM model lines 484-485 uses `*string` pointers allowing NULL\nStep 3: config_storage.go:76-78 extracts `updatedBy` from context but has no validation\nStep 4: No users/agents table reference exists to validate against", - "file_path": "control-plane/migrations/028_create_config_storage.sql", - "id": "f_020", - "line_end": 9, - "line_start": 8, - "score": 0.234, - "severity": "suggestion", - "suggestion": "Consider either:\n1. Add CHECK constraint to validate format (e.g., must be valid UUID or email)\n2. Document that application layer must validate principals before storage\n3. Add comment explaining audit trail limitations for external tools", - "tags": [ - "audit-trail", - "data-integrity", - "documentation" - ], - "title": "created_by/updated_by lack referential integrity constraints" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The `ConfigStorageModel` struct defines a `key` field with `uniqueIndex` but no constraints on key format, length, or allowed characters.\n\n**Potential issues:**\n1. Empty string keys allowed (no `NOT NULL` constraint validation at struct level)\n2. No maximum length enforcement\n3. No validation that keys follow expected naming conventions (e.g., no path traversal characters like `../` or `..\\`)\n\nWhile the API layer may validate, defense-in-depth suggests the storage layer should also enforce constraints.", - "confidence": 0.6, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "cluster_3", - "dimension_name": "storage layer - ConfigStorageModel versioning and SetConfig implementation", - "evidence": "models.go:479-488 shows ConfigStorageModel with `gorm:\"column:key;not null;uniqueIndex\"` - the `not null` is present but there's no size limit or format validation\n\nlocal.go:5129-5161 SetConfig accepts any key string and passes directly to SQL without validation", - "file_path": "control-plane/internal/storage/models.go", - "id": "f_024", - "line_end": 490, - "line_start": 476, - "score": 0.216, - "severity": "suggestion", - "suggestion": "Add GORM validation tags and constraints:\n```go\ntype ConfigStorageModel struct {\n ID int64 `gorm:\"column:id;primaryKey;autoIncrement\"`\n Key string `gorm:\"column:key;not null;uniqueIndex;size:255\"` // Add NOT NULL and size limit\n Value string `gorm:\"column:value;type:text;not null\"`\n // ...\n}\n```\n\nConsider adding application-level validation in `SetConfig` to reject keys containing path separators or control characters.", - "tags": [ - "validation", - "data-integrity", - "security" - ], - "title": "MISSING DATABASE CONSTRAINTS: ConfigStorageModel lacks validation for key format" - }, - { - "active_multipliers": [ - "cross_ref_compound", - "ai_generated_pr" - ], - "body": "The `configMu sync.RWMutex` field is declared in the AgentFieldServer struct at line 82, but there are **zero** usages of this mutex in the entire file.\n\nSearch results for 'configMu':\n- Line 82: Declaration only\n- NO calls to configMu.Lock()\n- NO calls to configMu.Unlock()\n- NO calls to configMu.RLock()\n- NO calls to configMu.RUnlock()\n\nThe mutex was added to the struct but never actually locked or unlocked. This makes it completely ineffective for preventing data races.", - "confidence": 0.99, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "thread-safety-mutex-usage", - "dimension_name": "Thread Safety - Config Reload Mutex", - "evidence": "Step 1: grep for 'configMu' in server.go shows only line 82 (declaration)\nStep 2: No Lock(), Unlock(), RLock(), or RUnlock() calls found\nStep 3: The mutex exists but provides zero protection\nStep 4: This indicates incomplete implementation of the thread-safety feature", - "file_path": "control-plane/internal/server/server.go", - "id": "f_001", - "line_end": 82, - "line_start": 82, - "score": 0.178, - "severity": "nitpick", - "suggestion": "Either:\n1. Add proper mutex protection around all config reads and writes (configMu.Lock() in configReloadFn, configMu.RLock() in goroutines that read config)\n2. OR remove the unused field if config reloading isn't meant to be thread-safe\n\nRecommended approach: Add RLock() around config reads in background goroutines like healthMonitor, presenceManager, etc.", - "tags": [ - "unused-code", - "mutex", - "incomplete-implementation" - ], - "title": "Important: configMu mutex is declared but NEVER used anywhere" - } - ], - "metadata": { - "agent_invocations": 21, - "anatomy": { - "blast_radius": [], - "clusters": [ - { - "description": "", - "files": [ - "control-plane/config/agentfield.yaml" - ], - "id": "cluster_0", - "name": "control-plane/config", - "primary_language": "yaml" - }, - { - "description": "", - "files": [ - "control-plane/internal/handlers/config_storage.go" - ], - "id": "cluster_1", - "name": "control-plane/internal/handlers", - "primary_language": "go" - }, - { - "description": "", - "files": [ - "control-plane/internal/server/config_db.go", - "control-plane/internal/server/server.go", - "control-plane/internal/server/server_routes_test.go" - ], - "id": "cluster_2", - "name": "control-plane/internal/server", - "primary_language": "go" - }, - { - "description": "", - "files": [ - "control-plane/internal/storage/local.go", - "control-plane/internal/storage/migrations.go", - "control-plane/internal/storage/models.go", - "control-plane/internal/storage/storage.go" - ], - "id": "cluster_3", - "name": "control-plane/internal/storage", - "primary_language": "go" - }, - { - "description": "", - "files": [ - "control-plane/migrations/028_create_config_storage.sql" - ], - "id": "cluster_4", - "name": "control-plane/migrations", - "primary_language": "sql" - } - ], - "context_notes": "This is a feature PR adding database-backed config storage with 455 lines added across 10 files. The implementation follows the existing patterns in the codebase (GORM models, Gin handlers, StorageProvider interface). Key files are config_db.go (103 lines) for config loading logic, config_storage.go (140 lines) for HTTP handlers, and local.go additions for storage implementation.", - "dependency_graph": {}, - "files": [ - { - "hunks": [ - { - "content": " enabled: true\n observability_config:\n enabled: false\n+ config_management:\n+ enabled: true\n+ read_only: false", - "header": "@@ -146,3 +146,6 @@ features:", - "new_count": 6, - "new_start": 146, - "old_count": 3, - "old_start": 146 - } - ], - "language": "yaml", - "lines_added": 3, - "lines_removed": 0, - "path": "control-plane/config/agentfield.yaml", - "status": "modified" - }, - { - "hunks": [ - { - "content": "+package handlers\n+\n+import (\n+\t\"io\"\n+\t\"net/http\"\n+\n+\t\"github.com/Agent-Field/agentfield/control-plane/internal/storage\"\n+\t\"github.com/gin-gonic/gin\"\n+)\n+\n+// maxConfigBodySize is the maximum allowed size for a config body (1 MB).\n+// Prevents DoS via unbounded request body reads.\n+const maxConfigBodySize = 1 << 20 // 1 MB\n+\n+// ConfigReloadFunc is called to reload configuration from the database.\n+type ConfigReloadFunc func() error\n+\n+// ConfigStorageHandlers provides HTTP handlers for database-backed configuration.\n+type ConfigStorageHandlers struct {\n+\tstorage storage.StorageProvider\n+\treloadFn ConfigReloadFunc\n+}\n+\n+// NewConfigStorageHandlers creates a new ConfigStorageHandlers instance.\n+func NewConfigStorageHandlers(store storage.StorageProvider, reloadFn ConfigReloadFunc) *ConfigStorageHandlers {\n+\treturn &ConfigStorageHandlers{storage: store, reloadFn: reloadFn}\n+}\n+\n+// RegisterRoutes registers config storage routes on the given router group.\n+func (h *ConfigStorageHandlers) RegisterRoutes(group *gin.RouterGroup) {\n+\tgroup.GET(\"/configs\", h.ListConfigs)\n+\tgroup.GET(\"/configs/:key\", h.GetConfig)\n+\tgroup.PUT(\"/configs/:key\", h.SetConfig)\n+\tgroup.DELETE(\"/configs/:key\", h.DeleteConfig)\n+\tgroup.POST(\"/configs/reload\", h.ReloadConfig)\n+}\n+\n+// ListConfigs returns all stored configuration entries.\n+func (h *ConfigStorageHandlers) ListConfigs(c *gin.Context) {\n+\tentries, err := h.storage.ListConfigs(c.Request.Context())\n+\tif err != nil {\n+\t\tc.JSON(http.StatusInternalServerError, gin.H{\"error\": err.Error()})\n+\t\treturn\n+\t}\n+\tif entries == nil {\n+\t\tentries = []*storage.ConfigEntry{}\n+\t}\n+\tc.JSON(http.StatusOK, gin.H{\n+\t\t\"configs\": entries,\n+\t\t\"total\": len(entries),\n+\t})\n+}\n+\n+// GetConfig returns a specific configuration entry by key.\n+func (h *ConfigStorageHandlers) GetConfig(c *gin.Context) {\n+\tkey := c.Param(\"key\")\n+\tentry, err := h.storage.GetConfig(c.Request.Context(), key)\n+\tif err != nil {\n+\t\tc.JSON(http.StatusInternalServerError, gin.H{\"error\": err.Error()})\n+\t\treturn\n+\t}\n+\tif entry == nil {\n+\t\tc.JSON(http.StatusNotFound, gin.H{\"error\": \"config not found\", \"key\": key})\n+\t\treturn\n+\t}\n+\tc.JSON(http.StatusOK, entry)\n+}\n+\n+// SetConfig creates or updates a configuration entry.\n+// Accepts raw YAML/text body as the config value.\n+func (h *ConfigStorageHandlers) SetConfig(c *gin.Context) {\n+\tkey := c.Param(\"key\")\n+\n+\tbody, err := io.ReadAll(io.LimitReader(c.Request.Body, maxConfigBodySize+1))\n+\tif err != nil {\n+\t\tc.JSON(http.StatusBadRequest, gin.H{\"error\": \"failed to read request body\"})\n+\t\treturn\n+\t}\n+\tif len(body) == 0 {\n+\t\tc.JSON(http.StatusBadRequest, gin.H{\"error\": \"request body is empty\"})\n+\t\treturn\n+\t}\n+\tif len(body) > maxConfigBodySize {\n+\t\tc.JSON(http.StatusRequestEntityTooLarge, gin.H{\n+\t\t\t\"error\": \"config body exceeds maximum size\",\n+\t\t\t\"max\": maxConfigBodySize,\n+\t\t})\n+\t\treturn\n+\t}\n+\n+\tupdatedBy := c.GetHeader(\"X-Updated-By\")\n+\tif updatedBy == \"\" {\n+\t\tupdatedBy = \"api\"\n+\t}\n+\n+\tif err := h.storage.SetConfig(c.Request.Context(), key, string(body), updatedBy); err != nil {\n+\t\tc.JSON(http.StatusInternalServerError, gin.H{\"error\": err.Error()})\n+\t\treturn\n+\t}\n+\n+\t// Return the saved entry\n+\tentry, err := h.storage.GetConfig(c.Request.Context(), key)\n+\tif err != nil {\n+\t\tc.JSON(http.StatusInternalServerError, gin.H{\"error\": err.Error()})\n+\t\treturn\n+\t}\n+\n+\tc.JSON(http.StatusOK, gin.H{\n+\t\t\"message\": \"config saved\",\n+\t\t\"config\": entry,\n+\t})\n+}\n+\n+// DeleteConfig removes a configuration entry by key.\n+func (h *ConfigStorageHandlers) DeleteConfig(c *gin.Context) {\n+\tkey := c.Param(\"key\")\n+\tif err := h.storage.DeleteConfig(c.Request.Context(), key); err != nil {\n+\t\tc.JSON(http.StatusNotFound, gin.H{\"error\": err.Error()})\n+\t\treturn\n+\t}\n+\tc.JSON(http.StatusOK, gin.H{\"message\": \"config deleted\", \"key\": key})\n+}\n+\n+// ReloadConfig triggers a hot-reload of configuration from the database.\n+func (h *ConfigStorageHandlers) ReloadConfig(c *gin.Context) {\n+\tif h.reloadFn == nil {\n+\t\tc.JSON(http.StatusServiceUnavailable, gin.H{\n+\t\t\t\"error\": \"config reload not available (AGENTFIELD_CONFIG_SOURCE != db)\",\n+\t\t})\n+\t\treturn\n+\t}\n+\tif err := h.reloadFn(); err != nil {\n+\t\tc.JSON(http.StatusInternalServerError, gin.H{\n+\t\t\t\"error\": \"config reload failed\",\n+\t\t\t\"details\": err.Error(),\n+\t\t})\n+\t\treturn\n+\t}\n+\tc.JSON(http.StatusOK, gin.H{\"message\": \"config reloaded from database\"})\n+}", - "header": "@@ -0,0 +1,140 @@", - "new_count": 140, - "new_start": 1, - "old_count": 0, - "old_start": 0 - } - ], - "language": "go", - "lines_added": 140, - "lines_removed": 0, - "path": "control-plane/internal/handlers/config_storage.go", - "status": "added" - }, - { - "hunks": [ - { - "content": "+package server\n+\n+import (\n+\t\"context\"\n+\t\"fmt\"\n+\t\"time\"\n+\n+\t\"github.com/Agent-Field/agentfield/control-plane/internal/config\"\n+\t\"github.com/Agent-Field/agentfield/control-plane/internal/storage\"\n+\t\"gopkg.in/yaml.v3\"\n+)\n+\n+const dbConfigKey = \"agentfield.yaml\"\n+\n+// overlayDBConfig loads config from the database and merges it into the\n+// existing config. The storage section is preserved from the original config\n+// to avoid the bootstrap problem (DB connection settings can't come from DB).\n+// Precedence: env vars > DB config > file config > defaults.\n+func overlayDBConfig(cfg *config.Config, store storage.StorageProvider) error {\n+\tctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)\n+\tdefer cancel()\n+\n+\tentry, err := store.GetConfig(ctx, dbConfigKey)\n+\tif err != nil {\n+\t\treturn fmt.Errorf(\"failed to read config from database: %w\", err)\n+\t}\n+\tif entry == nil {\n+\t\tfmt.Println(\"[config] No database config found (key: agentfield.yaml), using file/env config only.\")\n+\t\treturn nil\n+\t}\n+\n+\t// Preserve the storage config \u2014 it must always come from file/env (bootstrap)\n+\tsavedStorage := cfg.Storage\n+\n+\t// Parse the DB-stored YAML into a config struct\n+\tvar dbCfg config.Config\n+\tif err := yaml.Unmarshal([]byte(entry.Value), &dbCfg); err != nil {\n+\t\treturn fmt.Errorf(\"failed to parse database config YAML: %w\", err)\n+\t}\n+\n+\t// Overlay non-zero DB values onto the existing config\n+\tmergeDBConfig(cfg, &dbCfg)\n+\n+\t// Restore storage config (never overridden from DB)\n+\tcfg.Storage = savedStorage\n+\n+\tfmt.Printf(\"[config] Loaded config from database (key: %s, version: %d, updated: %s)\\n\",\n+\t\tentry.Key, entry.Version, entry.UpdatedAt.Format(time.RFC3339))\n+\treturn nil\n+}\n+\n+// mergeDBConfig selectively merges DB config values into the target config.\n+// Only non-zero/non-empty values from the DB config are applied.\n+func mergeDBConfig(target, dbCfg *config.Config) {\n+\t// AgentField settings\n+\tif dbCfg.AgentField.Port != 0 {\n+\t\ttarget.AgentField.Port = dbCfg.AgentField.Port\n+\t}\n+\tif dbCfg.AgentField.NodeHealth.CheckInterval != 0 {\n+\t\ttarget.AgentField.NodeHealth = dbCfg.AgentField.NodeHealth\n+\t}\n+\t// Merge execution cleanup field-by-field to avoid zeroing out unset fields\n+\tif dbCfg.AgentField.ExecutionCleanup.RetentionPeriod != 0 {\n+\t\ttarget.AgentField.ExecutionCleanup.RetentionPeriod = dbCfg.AgentField.ExecutionCleanup.RetentionPeriod\n+\t}\n+\tif dbCfg.AgentField.ExecutionCleanup.CleanupInterval != 0 {\n+\t\ttarget.AgentField.ExecutionCleanup.CleanupInterval = dbCfg.AgentField.ExecutionCleanup.CleanupInterval\n+\t}\n+\tif dbCfg.AgentField.ExecutionCleanup.BatchSize != 0 {\n+\t\ttarget.AgentField.ExecutionCleanup.BatchSize = dbCfg.AgentField.ExecutionCleanup.BatchSize\n+\t}\n+\tif dbCfg.AgentField.ExecutionCleanup.PreserveRecentDuration != 0 {\n+\t\ttarget.AgentField.ExecutionCleanup.PreserveRecentDuration = dbCfg.AgentField.ExecutionCleanup.PreserveRecentDuration\n+\t}\n+\tif dbCfg.AgentField.ExecutionCleanup.StaleExecutionTimeout != 0 {\n+\t\ttarget.AgentField.ExecutionCleanup.StaleExecutionTimeout = dbCfg.AgentField.ExecutionCleanup.StaleExecutionTimeout\n+\t}\n+\t// Enabled is a bool \u2014 only override if cleanup config is present in DB at all\n+\tif dbCfg.AgentField.ExecutionCleanup.RetentionPeriod != 0 || dbCfg.AgentField.ExecutionCleanup.CleanupInterval != 0 {\n+\t\ttarget.AgentField.ExecutionCleanup.Enabled = dbCfg.AgentField.ExecutionCleanup.Enabled\n+\t}\n+\tif dbCfg.AgentField.Approval.WebhookSecret != \"\" || dbCfg.AgentField.Approval.DefaultExpiryHours != 0 {\n+\t\ttarget.AgentField.Approval = dbCfg.AgentField.Approval\n+\t}\n+\n+\t// Features\n+\tif dbCfg.Features.DID.Method != \"\" {\n+\t\ttarget.Features.DID = dbCfg.Features.DID\n+\t}\n+\t// NOTE: Connector config (token, capabilities) is intentionally NOT merged\n+\t// from DB. These are security-sensitive and must come from file/env config,\n+\t// similar to how storage config is protected from the bootstrap problem.\n+\n+\t// API settings (but never override API key from DB for security)\n+\tif len(dbCfg.API.CORS.AllowedOrigins) > 0 {\n+\t\ttarget.API.CORS = dbCfg.API.CORS\n+\t}\n+\n+\t// UI settings\n+\tif dbCfg.UI.Mode != \"\" {\n+\t\ttarget.UI = dbCfg.UI\n+\t}\n+}", - "header": "@@ -0,0 +1,103 @@", - "new_count": 103, - "new_start": 1, - "old_count": 0, - "old_start": 0 - } - ], - "language": "go", - "lines_added": 103, - "lines_removed": 0, - "path": "control-plane/internal/server/config_db.go", - "status": "added" - }, - { - "hunks": [ - { - "content": " \t\"path/filepath\"\n \t\"strconv\"\n \t\"strings\"\n+\t\"sync\"\n \t\"time\"\n \n \t\"github.com/Agent-Field/agentfield/control-plane/internal/config\"", - "header": "@@ -13,6 +13,7 @@ import (", - "new_count": 7, - "new_start": 13, - "old_count": 6, - "old_start": 13 - }, - { - "content": " \tadminGRPCPort int\n \twebhookDispatcher services.WebhookDispatcher\n \tobservabilityForwarder services.ObservabilityForwarder\n+\tconfigMu sync.RWMutex\n }\n \n // NewAgentFieldServer creates a new instance of the AgentFieldServer.", - "header": "@@ -79,6 +80,7 @@ type AgentFieldServer struct {", - "new_count": 7, - "new_start": 80, - "old_count": 6, - "old_start": 79 - }, - { - "content": " \t\treturn nil, err\n \t}\n \n+\t// Overlay database-stored config if AGENTFIELD_CONFIG_SOURCE=db\n+\tif src := os.Getenv(\"AGENTFIELD_CONFIG_SOURCE\"); src == \"db\" {\n+\t\tif err := overlayDBConfig(cfg, storageProvider); err != nil {\n+\t\t\tfmt.Printf(\"Warning: failed to load config from database: %v\\n\", err)\n+\t\t}\n+\t}\n+\n \tRouter := gin.Default()\n \n \t// Sync installed.yaml to database for package visibility", - "header": "@@ -104,6 +106,13 @@ func NewAgentFieldServer(cfg *config.Config) (*AgentFieldServer, error) {", - "new_count": 13, - "new_start": 106, - "old_count": 6, - "old_start": 104 - }, - { - "content": " \t}, nil\n }\n \n+// configReloadFn returns a function that reloads config from the database,\n+// or nil if AGENTFIELD_CONFIG_SOURCE is not set to \"db\".\n+// The returned function acquires configMu to prevent data races with\n+// concurrent readers of s.config.\n+func (s *AgentFieldServer) configReloadFn() handlers.ConfigReloadFunc {\n+\tif src := os.Getenv(\"AGENTFIELD_CONFIG_SOURCE\"); src != \"db\" {\n+\t\treturn nil\n+\t}\n+\treturn func() error {\n+\t\ts.configMu.Lock()\n+\t\tdefer s.configMu.Unlock()\n+\t\treturn overlayDBConfig(s.config, s.storage)\n+\t}\n+}\n+\n // Start initializes and starts the AgentFieldServer.\n func (s *AgentFieldServer) Start() error {\n \t// Setup routes", - "header": "@@ -423,6 +432,21 @@ func NewAgentFieldServer(cfg *config.Config) (*AgentFieldServer, error) {", - "new_count": 21, - "new_start": 432, - "old_count": 6, - "old_start": 423 - }, - { - "content": " \t\t\tlogger.Logger.Info().Msg(\"\ud83d\udccb Authorization admin routes registered\")\n \t\t}\n \n+\t\t// Config storage routes (admin-authenticated)\n+\t\t{\n+\t\t\tconfigHandlers := handlers.NewConfigStorageHandlers(s.storage, s.configReloadFn())\n+\t\t\tconfigHandlers.RegisterRoutes(agentAPI)\n+\t\t\tlogger.Logger.Info().Msg(\"Config storage routes registered\")\n+\t\t}\n+\n \t\t// Connector routes (authenticated with separate connector token)\n \t\tif s.config.Features.Connector.Enabled && s.config.Features.Connector.Token != \"\" {\n \t\t\tconnectorGroup := agentAPI.Group(\"/connector\")", - "header": "@@ -1529,6 +1553,13 @@ func (s *AgentFieldServer) setupRoutes() {", - "new_count": 13, - "new_start": 1553, - "old_count": 6, - "old_start": 1529 - }, - { - "content": " \t\t\t)\n \t\t\tconnectorHandlers.RegisterRoutes(connectorGroup)\n \n+\t\t\t// Config management routes for connector\n+\t\t\tconfigGroup := connectorGroup.Group(\"\")\n+\t\t\tconfigGroup.Use(middleware.ConnectorCapabilityCheck(\"config_management\", s.config.Features.Connector.Capabilities))\n+\t\t\t{\n+\t\t\t\tconfigHandlers := handlers.NewConfigStorageHandlers(s.storage, s.configReloadFn())\n+\t\t\t\tconfigHandlers.RegisterRoutes(configGroup)\n+\t\t\t}\n+\n \t\t\tlogger.Logger.Info().Msg(\"\ud83d\udd0c Connector routes registered\")\n \t\t}\n \t}", - "header": "@@ -1544,6 +1575,14 @@ func (s *AgentFieldServer) setupRoutes() {", - "new_count": 14, - "new_start": 1575, - "old_count": 6, - "old_start": 1544 - } - ], - "language": "go", - "lines_added": 39, - "lines_removed": 0, - "path": "control-plane/internal/server/server.go", - "status": "modified" - }, - { - "hunks": [ - { - "content": " }\n \n // Configuration\n-func (s *stubStorage) SetConfig(ctx context.Context, key string, value interface{}) error { return nil }\n-func (s *stubStorage) GetConfig(ctx context.Context, key string) (interface{}, error) {\n+func (s *stubStorage) SetConfig(ctx context.Context, key string, value string, updatedBy string) error {\n+\treturn nil\n+}\n+func (s *stubStorage) GetConfig(ctx context.Context, key string) (*storage.ConfigEntry, error) {\n+\treturn nil, nil\n+}\n+func (s *stubStorage) ListConfigs(ctx context.Context) ([]*storage.ConfigEntry, error) {\n \treturn nil, nil\n }\n+func (s *stubStorage) DeleteConfig(ctx context.Context, key string) error { return nil }\n \n // Reasoner Performance and History\n func (s *stubStorage) GetReasonerPerformanceMetrics(ctx context.Context, reasonerID string) (*types.ReasonerPerformanceMetrics, error) {", - "header": "@@ -230,10 +230,16 @@ func (s *stubStorage) ListAgentGroups(ctx context.Context, teamID string) ([]typ", - "new_count": 16, - "new_start": 230, - "old_count": 10, - "old_start": 230 - } - ], - "language": "go", - "lines_added": 8, - "lines_removed": 2, - "path": "control-plane/internal/server/server_routes_test.go", - "status": "modified" - }, - { - "hunks": [ - { - "content": " \treturn nil\n }\n \n-// SetConfig stores a configuration key-value pair in SQLite.\n-func (ls *LocalStorage) SetConfig(ctx context.Context, key string, value interface{}) error {\n-\t// Fast-fail if context is already cancelled\n+// SetConfig upserts a configuration entry in the database.\n+// On conflict (duplicate key), it increments the version and updates the value.\n+func (ls *LocalStorage) SetConfig(ctx context.Context, key string, value string, updatedBy string) error {\n \tif err := ctx.Err(); err != nil {\n \t\treturn err\n \t}\n \n-\t// TODO: Implement configuration storage in SQLite\n-\treturn fmt.Errorf(\"SetConfig not yet implemented for LocalStorage\")\n+\tdb := ls.requireSQLDB()\n+\tnow := time.Now().UTC()\n+\n+\tif ls.mode == \"postgres\" {\n+\t\t_, err := db.ExecContext(ctx, `\n+\t\t\tINSERT INTO config_storage (key, value, version, created_by, updated_by, created_at, updated_at)\n+\t\t\tVALUES ($1, $2, 1, $3, $3, $4, $4)\n+\t\t\tON CONFLICT (key) DO UPDATE SET\n+\t\t\t\tvalue = EXCLUDED.value,\n+\t\t\t\tversion = config_storage.version + 1,\n+\t\t\t\tupdated_by = EXCLUDED.updated_by,\n+\t\t\t\tupdated_at = EXCLUDED.updated_at`,\n+\t\t\tkey, value, updatedBy, now)\n+\t\treturn err\n+\t}\n+\n+\t// SQLite\n+\t_, err := db.ExecContext(ctx, `\n+\t\tINSERT INTO config_storage (key, value, version, created_by, updated_by, created_at, updated_at)\n+\t\tVALUES (?, ?, 1, ?, ?, ?, ?)\n+\t\tON CONFLICT (key) DO UPDATE SET\n+\t\t\tvalue = excluded.value,\n+\t\t\tversion = config_storage.version + 1,\n+\t\t\tupdated_by = excluded.updated_by,\n+\t\t\tupdated_at = excluded.updated_at`,\n+\t\tkey, value, updatedBy, updatedBy, now, now)\n+\treturn err\n }\n \n-// GetConfig retrieves a configuration value from SQLite by key.\n-func (ls *LocalStorage) GetConfig(ctx context.Context, key string) (interface{}, error) {\n-\t// Fast-fail if context is already cancelled\n+// GetConfig retrieves a configuration entry by key.\n+func (ls *LocalStorage) GetConfig(ctx context.Context, key string) (*ConfigEntry, error) {\n+\tif err := ctx.Err(); err != nil {\n+\t\treturn nil, err\n+\t}\n+\n+\tdb := ls.requireSQLDB()\n+\tvar entry ConfigEntry\n+\n+\tvar placeholder string\n+\tif ls.mode == \"postgres\" {\n+\t\tplaceholder = \"$1\"\n+\t} else {\n+\t\tplaceholder = \"?\"\n+\t}\n+\n+\trow := db.QueryRowContext(ctx,\n+\t\tfmt.Sprintf(`SELECT key, value, version, COALESCE(created_by, ''), COALESCE(updated_by, ''), created_at, updated_at\n+\t\tFROM config_storage WHERE key = %s`, placeholder), key)\n+\n+\terr := row.Scan(&entry.Key, &entry.Value, &entry.Version,\n+\t\t&entry.CreatedBy, &entry.UpdatedBy, &entry.CreatedAt, &entry.UpdatedAt)\n+\tif err != nil {\n+\t\tif errors.Is(err, sql.ErrNoRows) {\n+\t\t\treturn nil, nil\n+\t\t}\n+\t\treturn nil, fmt.Errorf(\"failed to get config %q: %w\", key, err)\n+\t}\n+\treturn &entry, nil\n+}\n+\n+// ListConfigs returns all stored configuration entries.\n+func (ls *LocalStorage) ListConfigs(ctx context.Context) ([]*ConfigEntry, error) {\n \tif err := ctx.Err(); err != nil {\n \t\treturn nil, err\n \t}\n \n-\t// TODO: Implement configuration retrieval from SQLite\n-\treturn nil, fmt.Errorf(\"GetConfig not yet implemented for LocalStorage\")\n+\tdb := ls.requireSQLDB()\n+\trows, err := db.QueryContext(ctx,\n+\t\t`SELECT key, value, version, COALESCE(created_by, ''), COALESCE(updated_by, ''), created_at, updated_at\n+\t\tFROM config_storage ORDER BY key`)\n+\tif err != nil {\n+\t\treturn nil, fmt.Errorf(\"failed to list configs: %w\", err)\n+\t}\n+\tdefer rows.Close()\n+\n+\tvar entries []*ConfigEntry\n+\tfor rows.Next() {\n+\t\tvar entry ConfigEntry\n+\t\tif err := rows.Scan(&entry.Key, &entry.Value, &entry.Version,\n+\t\t\t&entry.CreatedBy, &entry.UpdatedBy, &entry.CreatedAt, &entry.UpdatedAt); err != nil {\n+\t\t\treturn nil, fmt.Errorf(\"failed to scan config row: %w\", err)\n+\t\t}\n+\t\tentries = append(entries, &entry)\n+\t}\n+\treturn entries, rows.Err()\n+}\n+\n+// DeleteConfig removes a configuration entry by key.\n+func (ls *LocalStorage) DeleteConfig(ctx context.Context, key string) error {\n+\tif err := ctx.Err(); err != nil {\n+\t\treturn err\n+\t}\n+\n+\tdb := ls.requireSQLDB()\n+\tvar placeholder string\n+\tif ls.mode == \"postgres\" {\n+\t\tplaceholder = \"$1\"\n+\t} else {\n+\t\tplaceholder = \"?\"\n+\t}\n+\n+\tresult, err := db.ExecContext(ctx,\n+\t\tfmt.Sprintf(`DELETE FROM config_storage WHERE key = %s`, placeholder), key)\n+\tif err != nil {\n+\t\treturn fmt.Errorf(\"failed to delete config %q: %w\", key, err)\n+\t}\n+\trows, _ := result.RowsAffected()\n+\tif rows == 0 {\n+\t\treturn fmt.Errorf(\"config %q not found\", key)\n+\t}\n+\treturn nil\n }\n \n // SubscribeToMemoryChanges implements the StorageProvider SubscribeToMemoryChanges method using local pub/sub.", - "header": "@@ -5124,26 +5124,124 @@ func (ls *LocalStorage) UpdateAgentTrafficWeight(ctx context.Context, id string,", - "new_count": 124, - "new_start": 5124, - "old_count": 26, - "old_start": 5124 - } - ], - "language": "go", - "lines_added": 108, - "lines_removed": 10, - "path": "control-plane/internal/storage/local.go", - "status": "modified" - }, - { - "hunks": [ - { - "content": " \t\t&DIDDocumentModel{},\n \t\t&AccessPolicyModel{},\n \t\t&AgentTagVCModel{},\n+\t\t&ConfigStorageModel{},\n \t}\n \n \tif err := gormDB.WithContext(ctx).AutoMigrate(models...); err != nil {", - "header": "@@ -233,6 +233,7 @@ func (ls *LocalStorage) autoMigrateSchema(ctx context.Context) error {", - "new_count": 7, - "new_start": 233, - "old_count": 6, - "old_start": 233 - } - ], - "language": "go", - "lines_added": 1, - "lines_removed": 0, - "path": "control-plane/internal/storage/migrations.go", - "status": "modified" - }, - { - "hunks": [ - { - "content": " }\n \n func (AgentTagVCModel) TableName() string { return \"agent_tag_vcs\" }\n+\n+// ConfigStorageModel stores configuration files in the database.\n+// Each record represents a named configuration (e.g. \"agentfield.yaml\")\n+// with versioning for audit trail.\n+type ConfigStorageModel struct {\n+\tID int64 `gorm:\"column:id;primaryKey;autoIncrement\"`\n+\tKey string `gorm:\"column:key;not null;uniqueIndex\"`\n+\tValue string `gorm:\"column:value;type:text;not null\"`\n+\tVersion int `gorm:\"column:version;not null;default:1\"`\n+\tCreatedBy *string `gorm:\"column:created_by\"`\n+\tUpdatedBy *string `gorm:\"column:updated_by\"`\n+\tCreatedAt time.Time `gorm:\"column:created_at;autoCreateTime\"`\n+\tUpdatedAt time.Time `gorm:\"column:updated_at;autoUpdateTime\"`\n+}\n+\n+func (ConfigStorageModel) TableName() string { return \"config_storage\" }", - "header": "@@ -472,3 +472,19 @@ type AgentTagVCModel struct {", - "new_count": 19, - "new_start": 472, - "old_count": 3, - "old_start": 472 - } - ], - "language": "go", - "lines_added": 16, - "lines_removed": 0, - "path": "control-plane/internal/storage/models.go", - "status": "modified" - }, - { - "hunks": [ - { - "content": " \tActiveExecutions int\n }\n \n+// ConfigEntry represents a database-stored configuration file.\n+type ConfigEntry struct {\n+\tKey string `json:\"key\"`\n+\tValue string `json:\"value\"`\n+\tVersion int `json:\"version\"`\n+\tCreatedBy string `json:\"created_by,omitempty\"`\n+\tUpdatedBy string `json:\"updated_by,omitempty\"`\n+\tCreatedAt time.Time `json:\"created_at\"`\n+\tUpdatedAt time.Time `json:\"updated_at\"`\n+}\n+\n // StorageProvider is the interface for the primary data storage backend.\n type StorageProvider interface {\n \t// Lifecycle", - "header": "@@ -26,6 +26,17 @@ type RunSummaryAggregation struct {", - "new_count": 17, - "new_start": 26, - "old_count": 6, - "old_start": 26 - }, - { - "content": " \tUpdateAgentVersion(ctx context.Context, id string, version string) error\n \tUpdateAgentTrafficWeight(ctx context.Context, id string, version string, weight int) error\n \n-\t// Configuration\n-\tSetConfig(ctx context.Context, key string, value interface{}) error\n-\tGetConfig(ctx context.Context, key string) (interface{}, error)\n+\t// Configuration Storage (database-backed config files)\n+\tSetConfig(ctx context.Context, key string, value string, updatedBy string) error\n+\tGetConfig(ctx context.Context, key string) (*ConfigEntry, error)\n+\tListConfigs(ctx context.Context) ([]*ConfigEntry, error)\n+\tDeleteConfig(ctx context.Context, key string) error\n \n \t// Reasoner Performance and History\n \tGetReasonerPerformanceMetrics(ctx context.Context, reasonerID string) (*types.ReasonerPerformanceMetrics, error)", - "header": "@@ -118,9 +129,11 @@ type StorageProvider interface {", - "new_count": 11, - "new_start": 129, - "old_count": 9, - "old_start": 118 - } - ], - "language": "go", - "lines_added": 16, - "lines_removed": 3, - "path": "control-plane/internal/storage/storage.go", - "status": "modified" - }, - { - "hunks": [ - { - "content": "+-- +goose Up\n+-- +goose StatementBegin\n+CREATE TABLE IF NOT EXISTS config_storage (\n+ id BIGSERIAL PRIMARY KEY,\n+ key TEXT NOT NULL UNIQUE,\n+ value TEXT NOT NULL,\n+ version INTEGER NOT NULL DEFAULT 1,\n+ created_by TEXT,\n+ updated_by TEXT,\n+ created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),\n+ updated_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()\n+);\n+\n+CREATE INDEX IF NOT EXISTS idx_config_storage_key ON config_storage(key);\n+-- +goose StatementEnd\n+\n+-- +goose Down\n+-- +goose StatementBegin\n+DROP INDEX IF EXISTS idx_config_storage_key;\n+DROP TABLE IF EXISTS config_storage;\n+-- +goose StatementEnd", - "header": "@@ -0,0 +1,21 @@", - "new_count": 21, - "new_start": 1, - "old_count": 0, - "old_start": 0 - } - ], - "language": "sql", - "lines_added": 21, - "lines_removed": 0, - "path": "control-plane/migrations/028_create_config_storage.sql", - "status": "added" - } - ], - "intent_gaps": [ - "MISSING CONNECTOR ROUTES: PR description promises 'Add connector-scoped config routes gated by config_management capability' but no connector handler code visible in this PR. The 'Related PRs' section mentions 'Connector: Agent-Field/connector' - these routes may be implemented there instead.", - "NO AUTOMATED TESTS: Test plan only lists manual tests. No unit/integration tests added for SetConfig/GetConfig/ListConfigs/DeleteConfig, overlayDBConfig, or config_storage handlers.", - "NO CONFIG VALIDATION: PR description mentions 'Store config in DB' via PUT endpoint but doesn't describe any validation of config content. Invalid YAML can be stored and will only fail on server restart.", - "HOT RELOAD LIMITATIONS: The POST /configs/reload endpoint reloads config into memory, but services initialized at startup (health monitor intervals, webhook timeouts) won't pick up changes without full server restart.", - "NO ROLLBACK MECHANISM: Once DB config is loaded, there's no API to revert to file-only config without restarting server with AGENTFIELD_CONFIG_SOURCE unset.", - "CONFIG MERGE INCOMPLETENESS: The mergeDBConfig() function in config_db.go:54-102 handles specific fields but comments suggest it should handle all config. Missing: logging config, feature flags (other than DID), mcp config, and any future config sections.", - "VERSIONING SEMANTICS: PR mentions 'versioning for audit trail' but no endpoint to retrieve historical config versions or rollback to previous version." - ], - "pr_narrative": "This PR introduces a database-backed configuration storage system with the following architecture:\n\n1. **Database Schema**: New `config_storage` table (migration 028) with GORM model `ConfigStorageModel` storing key-value config pairs with versioning, audit trail (created_by, updated_by, timestamps).\n\n2. **Storage Layer**: Four new methods on `StorageProvider` interface (`SetConfig`, `GetConfig`, `ListConfigs`, `DeleteConfig`) implemented in `local.go` using GORM for both SQLite and PostgreSQL.\n\n3. **Config Loading at Startup**: New `config_db.go` containing `overlayDBConfig()` function that loads config from DB when `AGENTFIELD_CONFIG_SOURCE=db` env var is set. The function:\n - Reads config entry with key 'agentfield.yaml' from DB\n - Parses YAML into config struct\n - Selectively merges non-zero values into existing config (preserving storage section for bootstrap safety)\n - Precedence: env vars > DB config > file config > defaults\n\n4. **API Surface**: New `config_storage.go` handlers providing:\n - `GET /api/v1/configs` - List all configs\n - `GET /api/v1/configs/:key` - Get specific config\n - `PUT /api/v1/configs/:key` - Create/update config (accepts raw YAML body)\n - `DELETE /api/v1/configs/:key` - Delete config\n - `POST /api/v1/configs/reload` - Hot-reload config from DB\n\n5. **Server Integration**: Modified `server.go` to:\n - Call `overlayDBConfig()` during `NewAgentFieldServer()` initialization (lines 107-112)\n - Add `configReloadFn()` method that returns reload function when `AGENTFIELD_CONFIG_SOURCE=db`\n - Config storage handlers receive reload function via constructor\n\n6. **Default Config**: Added `config_management` capability to connector capabilities in `agentfield.yaml` (lines 149-151).", - "risk_surfaces": [ - "BOOTSTRAP TIMING RACE: overlayDBConfig() in server.go:107-112 runs AFTER storage initialization but BEFORE config is fully used. If DB config fails to load (line 109-110), server continues with only a warning print. In production with `AGENTFIELD_CONFIG_SOURCE=db` expected, silent fallback to file config could cause config drift across instances.", - "PARTIAL CONFIG MERGE: config_db.go:54-102 mergeDBConfig() only handles specific known fields (AgentField.Port, NodeHealth, ExecutionCleanup, Approval, Features.DID, API.CORS, UI). Any NEW config fields added to the Config struct in the future will NOT be merged from DB unless explicitly added here - this is a maintenance hazard.", - "SECURITY FIELD PROTECTION: config_db.go:91-92 comment states connector config (token, capabilities) is intentionally NOT merged from DB for security. However, the PR adds `config_management` capability to default agentfield.yaml:149-151. If connector compromise occurs, attacker could potentially modify config via connector API (routes not visible in this PR but implied).", - "ERROR HANDLING INCONSISTENCY: config_storage.go:85-100 SetConfig calls storage.SetConfig(), then immediately calls storage.GetConfig() to return saved entry. If GetConfig fails (lines 91-94), handler returns 500 error even though config WAS saved successfully, leaving client uncertain of actual state.", - "NO CONFIG VALIDATION: config_storage.go:67-78 accepts raw YAML body via io.ReadAll() without any validation that the YAML is valid, matches expected schema, or won't break server on next restart. Invalid YAML will only surface when server restarts with `AGENTFIELD_CONFIG_SOURCE=db`.", - "VERSIONING WITHOUT OPTIMISTIC LOCKING: models.go:479-488 ConfigStorageModel has version field auto-incremented by GORM, but storage.go SetConfig() implementation (not visible in this PR) likely uses simple upsert. Concurrent updates from multiple admins could cause last-write-wins data loss.", - "RELOAD RACE CONDITION: config_storage.go:114-128 ReloadConfig handler calls reloadFn which modifies in-memory config struct. No mutex protection visible - concurrent reloads or reload during config access could cause race conditions.", - "MISSING CONNECTOR ROUTES: PR description mentions 'connector-scoped config routes gated by config_management capability' but no connector handler code or routes are visible in the provided files. Either these routes are in a separate PR (mentioned as 'Related PRs') or this is incomplete implementation.", - "YAML PARSING FAILURE MODE: config_db.go:36-39 calls yaml.Unmarshal() on DB config value. If YAML is malformed, overlayDBConfig() returns error which is only logged as warning (server.go:110). Server continues startup with potentially incomplete config - could mask critical misconfiguration.", - "STORAGE SECTION PROTECTION BYPASS: config_db.go:33-45 preserves storage config and restores it after merge. However, if DB config contains storage section with empty/zero values, the merge logic (lines 54-102) might still apply changes before restoration at line 45, potentially causing temporary connection issues." - ], - "stats": { - "files_added": 3, - "files_modified": 7, - "files_removed": 0, - "files_renamed": 0, - "test_files_changed": 1, - "test_to_code_ratio": 0.1111111111111111, - "total_additions": 455, - "total_deletions": 15, - "total_files": 10 - }, - "unrelated_changes": [ - "server_routes_test.go:233-242 adds stub implementations for new Config methods to stubStorage, but these are required for interface compliance, not unrelated.", - "migrations/028_create_config_storage.sql:14 creates index on key column, but GORM model at models.go:481 already defines `uniqueIndex` on Key - potentially redundant index creation.", - "models.go:479-488 ConfigStorageModel includes both `CreatedAt` with `autoCreateTime` and explicit time.Time fields - standard GORM pattern, not truly unrelated." - ] - }, - "budget": { - "budget_exhausted": true, - "cost_breakdown": { - "adversary": 0, - "anatomy": 0, - "coverage": 0, - "cross_ref": 0, - "intake": 0, - "meta_selectors": 0, - "output": 0, - "review": 0, - "synthesis": 0 - }, - "max_cost_usd": 2, - "max_duration_seconds": 2400, - "total_cost_usd": 0 - }, - "intake": { - "ai_generated": 0.6666666666666666, - "areas_touched": [ - "database", - "api", - "tests", - "config" - ], - "complexity": "complex", - "languages": [ - "go", - "sql", - "yaml" - ], - "pr_summary": "## Summary\n- Add `config_storage` table (GORM model + Goose migration 028) for storing configuration files in the database\n- Implement `SetConfig`/`GetConfig`/`ListConfigs`/`DeleteConfig` on the `StorageProvider` interface (works on both SQLite and PostgreSQL)\n- Add `AGENTFIELD_CONFIG_SOURCE=db` environment variable to load config from the database at startup (overlays on top of file config, preserving storage section for bootstrap)\n- Add CRUD API endpoints at `GET/PUT/DELETE /api/v1/configs/:key`\n- Add connector-scoped config routes gated by `config_management` capability\n- Add `config_management` capability to default `agentfield.yaml`\n\n## How It Works\n1. **Store config in DB**: `PUT /api/v1/configs/agentfield.yaml` with YAML body\n2. **Load from DB at startup**: Set `AGENTFIELD_CONFIG_SOURCE=db` \u2192 server reads config from DB after storage init\n3. **Remote management**: SaaS \u2192 connector \u2192 `config_management` capability \u2192 CP config API\n4. **Precedence**: env vars > DB config > file config > defaults\n5. **Bootstrap safety**: The `storage` section is never overridden from DB (DB connection can't come from DB)\n\n## Related PRs\n- Connector: Agent-Field/connector (config_management capability)\n- hax-sdk: Agent-Field/hax-sdk (config editor UI)\n\n## Test plan\n- [x] `go build ./...` passes\n- [x] Server tests pass\n- [x] Storage test failure is pre-existing (FTS5 not available)\n- [ ] Manual test: create config via API, verify it loads on restart with `AGENTFIELD_CONFIG_SOURCE=db`\n- [ ] Manual test: verify connector flow end-to-end\n\n\ud83e\udd16 Generated with [Claude Code](https://claude.com/claude-code)", - "pr_type": "feature", - "review_depth": "standard", - "risk_signals": [ - "modifies data model or schema-affecting code", - "changes API surface or request/response behavior", - "includes configuration changes", - "test behavior updated" - ] - }, - "phases_completed": [ - "intake", - "anatomy", - "meta_selectors", - "review", - "adversary", - "cross_ref", - "coverage", - "synthesis", - "output" - ], - "plan": { - "ai_adjusted": false, - "cross_ref_hints": [], - "dimensions": [ - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [ - "control-plane/internal/config/config.go" - ], - "id": "semantic_sem-001", - "name": "Config Reload Race Condition", - "priority": 10, - "review_prompt": "Investigate the thread safety of the config reload mechanism introduced in this PR.\n\n**Problem**: The PR adds a `configMu sync.RWMutex` field to `AgentFieldServer` struct (server.go:81) and uses it in `configReloadFn()` to protect writes during reload. However, the config (`s.config`) is accessed from 35+ locations throughout the codebase (grep for 's\\.config\\.' in server.go) WITHOUT any mutex protection.\n\n**Key Files to Examine**:\n- `control-plane/internal/server/server.go:433-442` - configReloadFn() implementation\n- `control-plane/internal/server/server.go:48-82` - AgentFieldServer struct definition showing config field\n- `control-plane/internal/server/config_db.go:19-50` - overlayDBConfig() that modifies config\n\n**Verification Steps**:\n1. Check if ANY readers of s.config acquire configMu.RLock() before access\n2. Look at server.go:502 (s.config.AgentField.Port), 834-838 (CORS config access), 882-883 (API key access), 913 (DID config), etc.\n3. Confirm that overlayDBConfig() modifies the config struct in-place (line 42: mergeDBConfig(cfg, &dbCfg))\n4. Verify that concurrent config access during reload could cause data races\n\n**Expected Issue**: The mutex only protects the reload operation itself, not the readers. During a reload, readers may see partially updated config, torn reads, or stale data. This is a classic readers-writers problem where readers run unsynchronized.", - "target_files": [ - "control-plane/internal/server/server.go", - "control-plane/internal/server/config_db.go" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [ - "control-plane/internal/storage/local.go" - ], - "id": "mechanical_mech-001", - "name": "StorageProvider Interface Signature Compatibility", - "priority": 10, - "review_prompt": "The PR changes the StorageProvider interface methods from (SetConfig/GetConfig with interface{} return types) to new signatures with string parameters and *ConfigEntry return types, plus adds ListConfigs and DeleteConfig methods.\n\nVerify that ALL implementations of StorageProvider have been updated:\n\n1. **Check these test mocks have OLD signatures (WILL BREAK):**\n - `control-plane/internal/handlers/ui/config_test.go:289-297` - MockStorageProvider.SetConfig/GetConfig still use `interface{}`\n - `control-plane/internal/handlers/execute_test.go:173-178` - MockStorageProvider has old signatures\n\n2. **Verify these mocks are missing NEW methods:**\n - Both mocks above lack `ListConfigs(ctx context.Context) ([]*storage.ConfigEntry, error)`\n - Both mocks above lack `DeleteConfig(ctx context.Context, key string) error`\n - Both mocks have wrong signature for `SetConfig(ctx, key, value string, updatedBy string)`\n\n3. **Check if interface is fully implemented:**\n - Run: `cd control-plane && go build ./...`\n - Any compile errors about interface satisfaction?\n - Check: `go test ./internal/handlers/ui/...` and `./internal/handlers/...`\n\nThis is a CRITICAL mechanical issue - the PR will not compile due to interface mismatch.", - "target_files": [ - "control-plane/internal/handlers/ui/config_test.go", - "control-plane/internal/handlers/execute_test.go", - "control-plane/internal/storage/storage.go" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [], - "id": "semantic_sem-002", - "name": "Partial Config Merge Maintenance Hazard", - "priority": 8, - "review_prompt": "Analyze the completeness and maintainability of the config merge logic in mergeDBConfig().\n\n**Problem**: The mergeDBConfig() function in config_db.go:54-102 selectively merges only specific known config fields. Any NEW config fields added to the Config struct in the future will NOT be merged from DB unless explicitly added to this function.\n\n**Key Files to Examine**:\n- `control-plane/internal/server/config_db.go:54-102` - mergeDBConfig() implementation\n- `control-plane/internal/config/config.go` - Full Config struct definition\n\n**Verification Steps**:\n1. List all config fields that ARE merged: AgentField.Port, NodeHealth, ExecutionCleanup fields, Approval, Features.DID, API.CORS, UI\n2. List config fields that are NOT merged (check config.go):\n - ExecutionQueue (AgentCallTimeout, WebhookTimeout, etc.)\n - Features.DID.Authorization (all security settings)\n - Features.DID.VCRequirements\n - Features.DID.Keystore\n - API.Auth (API key from DB is explicitly ignored per comment)\n - Logging config (if any)\n - MCP config (if any)\n3. Check if there's any automated way to ensure mergeDBConfig stays in sync with Config struct\n4. Verify this creates a maintenance burden where adding new config fields requires updating mergeDBConfig\n\n**Expected Issue**: This is a semantic drift hazard. Future developers adding config fields will likely forget to update mergeDBConfig(), causing silent failures where DB config values are ignored.", - "target_files": [ - "control-plane/internal/server/config_db.go", - "control-plane/internal/config/config.go" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [ - "control-plane/internal/config/config.go" - ], - "id": "semantic_sem-005", - "name": "Unvalidated Config Storage and Late Failure", - "priority": 8, - "review_prompt": "Investigate the validation gap in config storage and its impact on server startup.\n\n**Problem**: The SetConfig handler (config_storage.go:67-78) accepts raw YAML without validating it's valid YAML or matches the expected config schema. Invalid YAML is stored successfully but only fails when the server restarts with AGENTFIELD_CONFIG_SOURCE=db.\n\n**Key Files to Examine**:\n- `control-plane/internal/handlers/config_storage.go:67-78` - SetConfig body reading\n- `control-plane/internal/server/config_db.go:36-39` - YAML parsing at startup\n- `control-plane/internal/config/config.go:222-249` - LoadConfig validation\n\n**Verification Steps**:\n1. Check what validation occurs in SetConfig:\n - Line 70-77: Only checks for empty body and size limit\n - No YAML syntax validation\n - No schema validation against Config struct\n2. Verify when invalid YAML is detected:\n - config_db.go:37-38: yaml.Unmarshal() at server startup\n - Line 110: Only prints warning, server continues\n3. Consider attack vector: attacker with API access stores malformed YAML, server cannot restart with DB config\n4. Check if there's any way to validate config without full server restart\n\n**Expected Issue**: Malformed config can be stored via API and will only surface as a startup failure, potentially causing downtime or forcing fallback to file config when DB config was intended.", - "target_files": [ - "control-plane/internal/handlers/config_storage.go", - "control-plane/internal/server/config_db.go" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [ - "control-plane/internal/server/config_db.go" - ], - "id": "mechanical_mech-002", - "name": "Config Reload Mutex Protection", - "priority": 8, - "review_prompt": "The PR adds a `configMu sync.RWMutex` to AgentFieldServer struct (server.go:82) but the configReloadFn method (server.go:433-442) does NOT acquire this mutex when reloading config.\n\nInvestigate the thread-safety:\n\n1. **Check server.go:433-442** - configReloadFn() returns a function that calls overlayDBConfig()\n - Does it acquire s.configMu.Lock()? (It should but verify)\n - The overlayDBConfig function modifies s.config directly\n\n2. **Check for concurrent access patterns:**\n - Search for other readers of s.config throughout server.go\n - Are there goroutines that read config without holding the mutex?\n - Specifically check: health monitor, cleanup service, webhook dispatcher - these all read config fields\n\n3. **Verify the mutex is actually used:**\n - Search for `configMu` usage in server.go\n - Is it only declared but never locked/unlocked?\n - The PR adds the mutex field but may not use it consistently\n\nThis could cause data races if config is reloaded while other goroutines read config values.", - "target_files": [ - "control-plane/internal/server/server.go" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [ - "control-plane/internal/config/config.go" - ], - "id": "systemic_systemic-001", - "name": "Config Merge Completeness and Maintainability", - "priority": 8, - "review_prompt": "Review the mergeDBConfig function in control-plane/internal/server/config_db.go:54-102. This function implements field-by-field merging of DB config into the target config, but only handles specific known fields (AgentField.Port, NodeHealth, ExecutionCleanup, Approval, Features.DID, API.CORS, UI).\n\nKey concerns:\n1. The function has a maintenance hazard - any NEW config fields added to the Config struct in the future will NOT be merged from DB unless explicitly added here. Check if this is documented or if there's a more robust pattern.\n2. Compare with existing config loading patterns in the codebase (e.g., how viper handles config merging).\n3. Look at the Config struct in control-plane/internal/config/config.go to identify fields that are NOT handled by mergeDBConfig (e.g., Storage, Logging, MCP, Feature flags other than DID).\n4. Determine if the selective merge is intentional (for security/bootstrap safety) or if it creates an incomplete feature.\n5. Check if there's a TODO or comment explaining this limitation and when it should be expanded.", - "target_files": [ - "control-plane/internal/server/config_db.go" - ] - } - ], - "total_budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - } - } - }, - "pr_url": "https://github.com/Agent-Field/agentfield/pull/254", - "review": { - "body": "## \ud83d\udd34 PR-AF Review \u2014 **Needs Major Rework**\n\n*Automated multi-agent code review \u00b7 [PR-AF](https://github.com/Agent-Field/agentfield) built with [AgentField](https://github.com/Agent-Field/agentfield)*\n\n> **25 findings** \u00b7 \ud83d\udd34 7 critical \u00b7 \ud83d\udfe0 11 important \u00b7 \ud83d\udd35 6 suggestions \u00b7 \u26aa 1 nitpicks\n\n
\nPR Overview\n\n## Summary\n- Add `config_storage` table (GORM model + Goose migration 028) for storing configuration files in the database\n- Implement `SetConfig`/`GetConfig`/`ListConfigs`/`DeleteConfig` on the `StorageProvider` interface (works on both SQLite and PostgreSQL)\n- Add `AGENTFIELD_CONFIG_SOURCE=db` environment variable to load config from the database at startup (overlays on top of file config, preserving storage section for bootstrap)\n- Add CRUD API endpoints at `GET/PUT/DELETE /api/v1/configs/:key`\n- Add connector-scoped config routes gated by `config_management` capability\n- Add `config_management` capability to default `agentfield.yaml`\n\n## How It Works\n1. **Store config in DB**: `PUT /api/v1/configs/agentfield.yaml` with YAML body\n2. **Load from DB at startup**: Set `AGENTFIELD_CONFIG_SOURCE=db` \u2192 server reads config from DB after storage init\n3. **Remote management**: SaaS \u2192 connector \u2192 `config_management` capability \u2192 CP config API\n4. **Precedence**: env vars > DB config > file config > defaults\n5. **Bootstrap safety**: The `storage` section is never overridden from DB (DB connection can't come from DB)\n\n## Related PRs\n- Connector: Agent-Field/connector (config_management capability)\n- hax-sdk: Agent-Field/hax-sdk (config editor UI)\n\n## Test plan\n- [x] `go build ./...` passes\n- [x] Server tests pass\n- [x] Storage test failure is pre-existing (FTS5 not available)\n- [ ] Manual test: create config via API, verify it loads on restart with `AGENTFIELD_CONFIG_SOURCE=db`\n- [ ] Manual test: verify connector flow end-to-end\n\n\ud83e\udd16 Generated with [Claude Code](https://claude.com/claude-code)\n\n
\n\n### Key Findings\n\n**18 issue(s) should be addressed before merge:**\n\n- \ud83d\udd34 **MockStorageProvider.SetConfig/GetConfig have obsolete signatures - interface mismatch** (`control-plane/internal/handlers/ui/config_test.go:289`) \u2014 The `MockStorageProvider` in `config_test.go` has obsolete method signatures for `SetConfig` and `GetConfig` that do not match the updated `StorageProvider` interface.\n- \ud83d\udd34 **MockStorageProvider.SetConfig/GetConfig have obsolete signatures - interface mismatch** (`control-plane/internal/handlers/execute_test.go:173`) \u2014 The `MockStorageProvider` in `execute_test.go` has obsolete method signatures for `SetConfig` and `GetConfig` that do not match the updated `StorageProvider` interface.\n- \ud83d\udd34 **MockStorageProvider missing ListConfigs and DeleteConfig methods** (`control-plane/internal/handlers/ui/config_test.go:25`) \u2014 The `MockStorageProvider` in `config_test.go` is missing the two new configuration methods added to the `StorageProvider` interface: `ListConfigs` and `DeleteConfig`.\n- \ud83d\udd34 **MockStorageProvider missing ListConfigs and DeleteConfig methods** (`control-plane/internal/handlers/execute_test.go:22`) \u2014 The `MockStorageProvider` in `execute_test.go` is missing the two new configuration methods added to the `StorageProvider` interface: `ListConfigs` and `DeleteConfig`.\n- \ud83d\udd34 **Version field lacks optimistic locking - concurrent updates cause silent data loss** (`control-plane/migrations/028_create_config_storage.sql:1`) \u2014 The `version` column is auto-incremented during upsert operations but there's no database-level constraint or application-level check to prevent lost updates.\n- \ud83d\udd34 **VERSIONING WITHOUT OPTIMISTIC LOCKING: Concurrent updates cause silent data loss** (`control-plane/internal/storage/local.go:5129`) \u2014 The `SetConfig` method implements versioning without optimistic locking, causing **silent data loss** when concurrent updates occur.\n- \ud83d\udd34 **Config storage admin routes exposed without authentication** (`control-plane/internal/server/server.go:1550`) \u2014 The config storage routes at /api/v1/configs/* are registered directly on agentAPI without any authentication middleware, despite the comment claiming they are 'admin-authenticated'.\n- \ud83d\udfe0 **Incomplete NodeHealth Merge - Only CheckInterval Is Handled** (`control-plane/internal/server/config_db.go:59`) \u2014 The `NodeHealth` struct has 5 fields (CheckInterval, CheckTimeout, ConsecutiveFailures, RecoveryDebounce, HeartbeatStaleThreshold), but `mergeDBConfig()` only handles `CheckInterval`.\n- \u2026 and 10 more (see All Findings by Severity)\n\n**7 suggestion(s) and style note(s):**\n\n- \ud83d\udd35 Important: Background goroutines read s.config without mutex protection (`control-plane/internal/server/server.go:133`)\n- \ud83d\udd35 No Automated Sync Check Between Config Struct and Merge Function (`control-plane/internal/server/config_db.go:52`)\n- \ud83d\udd35 Missing TODO/FIXME Comment Warning About Maintenance Burden (`control-plane/internal/server/config_db.go:52`)\n- \ud83d\udd35 AMBIGUOUS NULL HANDLING: COALESCE converts NULL to empty string losing audit information (`control-plane/internal/storage/local.go:5179`)\n- \ud83d\udd35 created_by/updated_by lack referential integrity constraints (`control-plane/migrations/028_create_config_storage.sql:8`)\n- \u2026 and 2 more (see All Findings by Severity)\n\n**Files with findings:** `control-plane/config/agentfield.yaml`, `control-plane/internal/handlers/config_storage.go`, `control-plane/internal/handlers/execute_test.go`, `control-plane/internal/handlers/ui/config_test.go`, `control-plane/internal/server/config_db.go`, `control-plane/internal/server/server.go`, `control-plane/internal/storage/local.go`, `control-plane/internal/storage/models.go`, `control-plane/migrations/028_create_config_storage.sql`\n\n
\nAll Findings by Severity\n\n#### \ud83d\udd34 Critical (7)\n\n- **MockStorageProvider.SetConfig/GetConfig have obsolete signatures - interface mismatch** `control-plane/internal/handlers/ui/config_test.go:289`\n- **MockStorageProvider.SetConfig/GetConfig have obsolete signatures - interface mismatch** `control-plane/internal/handlers/execute_test.go:173`\n- **MockStorageProvider missing ListConfigs and DeleteConfig methods** `control-plane/internal/handlers/ui/config_test.go:25`\n- **MockStorageProvider missing ListConfigs and DeleteConfig methods** `control-plane/internal/handlers/execute_test.go:22`\n- **Version field lacks optimistic locking - concurrent updates cause silent data loss** `control-plane/migrations/028_create_config_storage.sql:1`\n- **VERSIONING WITHOUT OPTIMISTIC LOCKING: Concurrent updates cause silent data loss** `control-plane/internal/storage/local.go:5129`\n- **Config storage admin routes exposed without authentication** `control-plane/internal/server/server.go:1550`\n\n#### \ud83d\udfe0 Important (11)\n\n- **Incomplete NodeHealth Merge - Only CheckInterval Is Handled** `control-plane/internal/server/config_db.go:59`\n- **Missing Config Fields in mergeDBConfig Creates Silent Failures** `control-plane/internal/server/config_db.go:54`\n- **DIDConfig Merge Only Checks Method Field - Other DID Settings Ignored** `control-plane/internal/server/config_db.go:87`\n- **CORSConfig Partial Merge - Only AllowedOrigins Is Checked** `control-plane/internal/server/config_db.go:95`\n- **Partial Config Merge - Many Config Fields Silently Ignored from DB** `control-plane/internal/server/config_db.go:54`\n- **SetConfig accepts invalid YAML without validation, causing delayed startup failures** `control-plane/internal/handlers/config_storage.go:67`\n- **Missing ON UPDATE trigger for updated_at timestamp** `control-plane/migrations/028_create_config_storage.sql:10`\n- **config_management capability enabled by default with write access** `control-plane/config/agentfield.yaml:149`\n- **key column uses TEXT type without length limit or validation** `control-plane/migrations/028_create_config_storage.sql:5`\n- **INCONSISTENT ERROR HANDLING: GetConfig returns nil on 'not found' but storage.go contract is unclear** `control-plane/internal/storage/local.go:5164`\n- **Inconsistent Security Field Handling - DID.Authorization Omitted Without Comment** `control-plane/internal/server/config_db.go:86`\n\n#### \ud83d\udd35 Suggestion (6)\n\n- **Important: Background goroutines read s.config without mutex protection** `control-plane/internal/server/server.go:133`\n- **No Automated Sync Check Between Config Struct and Merge Function** `control-plane/internal/server/config_db.go:52`\n- **Missing TODO/FIXME Comment Warning About Maintenance Burden** `control-plane/internal/server/config_db.go:52`\n- **AMBIGUOUS NULL HANDLING: COALESCE converts NULL to empty string losing audit information** `control-plane/internal/storage/local.go:5179`\n- **created_by/updated_by lack referential integrity constraints** `control-plane/migrations/028_create_config_storage.sql:8`\n- **MISSING DATABASE CONSTRAINTS: ConfigStorageModel lacks validation for key format** `control-plane/internal/storage/models.go:476`\n\n#### \u26aa Nitpick (1)\n\n- **Important: configMu mutex is declared but NEVER used anywhere** `control-plane/internal/server/server.go:82`\n\n
\n\n
\nReview Process Details\n\n**Dimensions Analyzed (6):**\n\n- **Config Reload Race Condition** \u2014 2 file(s)\n- **StorageProvider Interface Signature Compatibility** \u2014 3 file(s)\n- **Partial Config Merge Maintenance Hazard** \u2014 2 file(s)\n- **Unvalidated Config Storage and Late Failure** \u2014 2 file(s)\n- **Config Reload Mutex Protection** \u2014 1 file(s)\n- **Config Merge Completeness and Maintainability** \u2014 1 file(s)\n\n**Meta-Dimension Lenses (3):**\n\n- **Semantic** \u2014 5 dimension(s), 85% coverage confidence\n- **Mechanical** \u2014 3 dimension(s), 85% coverage confidence\n- **Systemic** \u2014 3 dimension(s), 75% coverage confidence\n\n**Cross-Reference & Adversary Analysis:**\n\n- **8** cross-change interaction(s) detected\n- **16** finding(s) adversarially tested: 13 confirmed, 3 challenged\n\n
\n\n
\nPipeline Stats\n\n| Metric | Value |\n|--------|-------|\n| Duration | 2608.6s |\n| Agent invocations | 21 |\n| Coverage iterations | 1 |\n| Estimated cost | N/A (provider does not report cost) |\n| Budget exhausted | Yes (timeout: 2608s > 2400s limit) |\n| PR type | feature |\n| Complexity | complex |\n\n
\n\nReview ID: `rev_5795c21d6bdd`", - "comments": [ - { - "body": "\ud83d\udd34 **[CRITICAL] Version field lacks optimistic locking - concurrent updates cause silent data loss**\n\nThe `version` column is auto-incremented during upsert operations but there's no database-level constraint or application-level check to prevent lost updates. When two admins simultaneously update the same config key via `PUT /api/v1/configs/:key`, the second write will overwrite the first without any warning or conflict detection.\n\nThe storage implementation at `local.go:5129-5160` uses `ON CONFLICT DO UPDATE` with `version = config_storage.version + 1`, which is atomic but doesn't validate that the admin read the latest version before updating. This means:\n\n1. Admin A reads config version 5\n2. Admin B reads config version 5\n3. Admin A saves \u2192 version becomes 6\n4. Admin B saves \u2192 version becomes 7 (silently overwriting Admin A's changes)\n\n**Impact**: Configuration changes can be silently lost in multi-admin environments, potentially causing production misconfiguration.\n\n---\n\n> Step 1: Migration defines `version INTEGER NOT NULL DEFAULT 1` (line 7)\n> Step 2: GORM model marks `Version int` with `not null;default:1` tag (models.go:483)\n> Step 3: SetConfig() uses upsert: `version = config_storage.version + 1` (local.go:5143,5156)\n> Step 4: No version check in WHERE clause or BEFORE UPDATE trigger to validate expected version\n> Step 5: ConfigStorageHandlers.SetConfig() accepts no version parameter (config_storage.go:67-100)\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd optimistic locking by either:\n1. **Preferred**: Add `expected_version` parameter to PUT endpoint and fail with 409 Conflict if current version != expected\n2. Alternative: Add timestamp-based conflict detection using `updated_at`\n3. Add application-level check in SetConfig: `UPDATE config_storage SET ... WHERE key = ? AND version = ?` then check RowsAffected\n\n---\n*`Coverage Gap - Database Migration` \u00b7 confidence 95%*", - "line": 1, - "path": "control-plane/migrations/028_create_config_storage.sql", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd34 **[CRITICAL] VERSIONING WITHOUT OPTIMISTIC LOCKING: Concurrent updates cause silent data loss**\n\nThe `SetConfig` method implements versioning without optimistic locking, causing **silent data loss** when concurrent updates occur.\n\n**The Problem:**\n- Admin A reads config at version 1\n- Admin B reads config at version 1\n- Both admins modify different parts of the config\n- Both call `SetConfig` with their changes\n- Both execute `ON CONFLICT (key) DO UPDATE SET version = config_storage.version + 1`\n- Both result in version = 2\n- **Admin A's changes are silently lost** with no error or warning\n\n**Why this is critical:**\nIn production environments with multiple admins or automated systems updating config, concurrent modifications will result in last-write-wins behavior that loses intermediate changes. The version field provides an **audit trail illusion** - it looks like versioning is working but actually provides no conflict detection.\n\n**Code analysis:**\n```go\nON CONFLICT (key) DO UPDATE SET\n value = EXCLUDED.value,\n version = config_storage.version + 1, // <-- No WHERE clause checking expected version!\n updated_by = EXCLUDED.updated_by,\n updated_at = EXCLUDED.updated_at\n```\n\nThis is different from proper optimistic locking which would use:\n```sql\nUPDATE config_storage SET value = ?, version = version + 1 WHERE key = ? AND version = ?\n```\n\n---\n\n> Step 1: Two admins (A and B) both call `GET /api/v1/configs/agentfield.yaml` and receive version=1\n> Step 2: Admin A modifies port setting, calls `PUT /api/v1/configs/agentfield.yaml` - succeeds, version becomes 2\n> Step 3: Admin B modifies log level, calls `PUT` with payload based on version=1 they read earlier\n> Step 4: In local.go:5137-5161, the SQL executes `ON CONFLICT...version + 1` without checking if the update is based on current version\n> Step 5: Admin B's update succeeds (version becomes 2), but **Admin A's port change is silently overwritten**\n> Step 6: No error is returned - the data loss is undetected\n\n**\ud83d\udca1 Suggested Fix**\n\nImplement proper optimistic locking by:\n1. Adding an optional `expectedVersion` parameter to `SetConfig`\n2. Using a transaction with SELECT FOR UPDATE to read current version\n3. Only updating if current version matches expected version\n4. Returning a specific error (e.g., `ErrConfigVersionConflict`) when versions don't match\n5. Updating the handler to accept `If-Match` header with expected version and return 409 Conflict on mismatch\n\n---\n*`storage layer - ConfigStorageModel versioning and SetConfig implementation` \u00b7 confidence 95%*", - "line": 5129, - "path": "control-plane/internal/storage/local.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Incomplete NodeHealth Merge - Only CheckInterval Is Handled**\n\nThe `NodeHealth` struct has 5 fields (CheckInterval, CheckTimeout, ConsecutiveFailures, RecoveryDebounce, HeartbeatStaleThreshold), but `mergeDBConfig()` only handles `CheckInterval`. All other NodeHealth fields from DB config are silently ignored.\n\n---\n\n> config.go:54-59 defines NodeHealthConfig with 5 fields. config_db.go:59-61 only checks `dbCfg.AgentField.NodeHealth.CheckInterval != 0`. Other fields have no corresponding merge logic.\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd merge logic for all NodeHealth fields: CheckTimeout, ConsecutiveFailures, RecoveryDebounce, and HeartbeatStaleThreshold. Consider replacing the entire NodeHealth struct when any field is set, similar to how Approval and DID are handled.\n\n---\n*`Config Merge Completeness and Maintainability` \u00b7 confidence 95%*", - "line": 59, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Missing Config Fields in mergeDBConfig Creates Silent Failures**\n\nThe `mergeDBConfig` function at `config_db.go:54-103` selectively merges only specific known config fields from the database, leaving many fields unhandled. This creates a **maintenance hazard** where any new fields added to the `Config` struct will silently be ignored when loading from DB, causing confusion and incomplete configuration application.\n\n**Missing fields NOT merged from DB (partial list):**\n- `AgentFieldConfig.ExecutionQueue` (lines 39, 71-78 in config.go) - Agent call timeout, webhook settings\n- `NodeHealthConfig.CheckTimeout` (line 55) - Health check timeout\n- `NodeHealthConfig.ConsecutiveFailures` (line 56) - Failure threshold\n- `NodeHealthConfig.RecoveryDebounce` (line 57) - Recovery debounce\n- `NodeHealthConfig.HeartbeatStaleThreshold` (line 58) - Staleness threshold\n- `Features.DID.Authorization` (lines 111-135) - DID auth settings, admin tokens, access policies\n- `Features.DID.VCRequirements` (lines 171-179) - VC generation requirements\n- `Features.DID.Keystore` (lines 182-189) - Keystore configuration\n- `API.Auth` (lines 207-212) - API authentication settings\n- `UI.Enabled` (line 27) - UI enabled/disabled flag\n- `UI.SourcePath`, `UI.DistPath`, `UI.DevPort` (lines 29-31) - UI paths and dev port\n\n**Impact:** Users storing config in DB may set values like `execution_queue.agent_call_timeout` or `features.did.authorization.enabled`, but these will be silently ignored. The server continues running with incomplete config, making this a subtle bug that only manifests in production behavior differences.\n\n---\n\n> Step 1: Config struct defines AgentField.ExecutionQueue at config.go:39,72-78 with fields: AgentCallTimeout, WebhookTimeout, WebhookMaxAttempts, WebhookRetryBackoff, WebhookMaxRetryBackoff.\n> Step 2: mergeDBConfig (config_db.go:54-103) checks AgentField.Port, NodeHealth, ExecutionCleanup, Approval, Features.DID (partially), API.CORS, UI.\n> Step 3: ExecutionQueue is never referenced in mergeDBConfig - all queue settings are silently ignored when loading from DB.\n> Step 4: This means webhook timeouts and agent call timeouts set via DB config API will have no effect.\n\n**\ud83d\udca1 Suggested Fix**\n\n1. Add comprehensive handling for all current Config struct fields, OR\n2. Implement a reflection-based merge that uses struct tags to determine which fields should be merged (with explicit 'security' or 'nosync' tags to exclude sensitive fields), OR\n3. At minimum, add documentation comments listing all unhandled fields and a TODO/FIXME comment explaining that new fields must be manually added here\n\nRecommended approach: Add a struct tag like `merge:\"true\"` to fields that should be synced from DB, then use reflection to automatically merge those fields while preserving security-sensitive ones.\n\n---\n*`Partial Config Merge Maintenance Hazard` \u00b7 confidence 95%*", - "line": 54, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] DIDConfig Merge Only Checks Method Field - Other DID Settings Ignored**\n\nThe DIDConfig struct has 8 fields (Enabled, Method, KeyAlgorithm, DerivationMethod, KeyRotationDays, VCRequirements, Keystore, Authorization), but `mergeDBConfig()` only checks if `Method != \"\"` and then replaces the entire struct. This means:\n1. If DB only sets `Enabled: false` without Method, the entire DID config is ignored\n2. Individual DID field updates from DB are not supported - it's all-or-nothing based on Method\n3. VCRequirements, Keystore, and Authorization sub-configs from DB are never applied\n\n---\n\n> config.go:100-109 defines DIDConfig with 8 fields. config_db.go:87-89 only checks `dbCfg.Features.DID.Method != \"\"` before replacing entire struct. No handling for VCRequirements (lines 171-179), Keystore (lines 182-189), or Authorization (lines 112-135).\n\n**\ud83d\udca1 Suggested Fix**\n\nEither handle DIDConfig fields individually (like ExecutionCleanup) or check for any non-zero DID field before replacing the struct. Ensure sub-structs (VCRequirements, Keystore, Authorization) are also considered.\n\n---\n*`Config Merge Completeness and Maintainability` \u00b7 confidence 90%*", - "line": 87, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] CORSConfig Partial Merge - Only AllowedOrigins Is Checked**\n\nThe CORSConfig struct has 5 fields, but `mergeDBConfig()` only checks `AllowedOrigins`. If the DB config specifies `AllowedMethods`, `AllowedHeaders`, `ExposedHeaders`, or `AllowCredentials` without `AllowedOrigins`, those settings are silently ignored.\n\n---\n\n> config.go:198-204 defines CORSConfig with 5 fields (AllowedOrigins, AllowedMethods, AllowedHeaders, ExposedHeaders, AllowCredentials). config_db.go:95-97 only checks `len(dbCfg.API.CORS.AllowedOrigins) > 0`.\n\n**\ud83d\udca1 Suggested Fix**\n\nExpand the condition to check for any non-zero CORS field: `len(dbCfg.API.CORS.AllowedOrigins) > 0 || len(dbCfg.API.CORS.AllowedMethods) > 0 || ...` or check each field individually.\n\n---\n*`Config Merge Completeness and Maintainability` \u00b7 confidence 90%*", - "line": 95, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Partial Config Merge - Many Config Fields Silently Ignored from DB**\n\nThe `mergeDBConfig()` function only handles a subset of configuration fields, causing **silent data loss** when config is loaded from the database. Users storing complete config in the DB will find that most fields are ignored without warning.\n\n**Fields that ARE merged (minimal subset):**\n- `AgentField.Port`\n- `AgentField.NodeHealth.CheckInterval` (only this one field - other NodeHealth fields ignored)\n- `AgentField.ExecutionCleanup` (all 6 fields merged individually)\n- `AgentField.Approval` (both fields)\n- `Features.DID.Method` (entire struct replaced if Method is set)\n- `API.CORS` (only if AllowedOrigins has items)\n- `UI` (entire struct replaced if Mode is set)\n\n**Fields NOT merged from DB (will be silently ignored):**\n\n**ExecutionQueueConfig (lines 72-78 in config.go):**\n- `AgentField.ExecutionQueue.AgentCallTimeout`\n- `AgentField.ExecutionQueue.WebhookTimeout`\n- `AgentField.ExecutionQueue.WebhookMaxAttempts`\n- `AgentField.ExecutionQueue.WebhookRetryBackoff`\n- `AgentField.ExecutionQueue.WebhookMaxRetryBackoff`\n\n**NodeHealthConfig (lines 54-59 in config.go):**\n- `AgentField.NodeHealth.CheckTimeout`\n- `AgentField.NodeHealth.ConsecutiveFailures`\n- `AgentField.NodeHealth.RecoveryDebounce`\n- `AgentField.NodeHealth.HeartbeatStaleThreshold`\n\n**DIDConfig (lines 100-109 in config.go):**\n- `Features.DID.Enabled`\n- `Features.DID.KeyAlgorithm`\n- `Features.DID.DerivationMethod`\n- `Features.DID.KeyRotationDays`\n\n**VCRequirements (lines 171-179 in config.go):**\n- `Features.DID.VCRequirements.RequireVCForRegistration`\n- `Features.DID.VCRequirements.RequireVCForExecution`\n- `Features.DID.VCRequirements.RequireVCForCrossAgent`\n- `Features.DID.VCRequirements.StoreInputOutput`\n- `Features.DID.VCRequirements.HashSensitiveData`\n- `Features.DID.VCRequirements.PersistExecutionVC`\n- `Features.DID.VCRequirements.StorageMode`\n\n**KeystoreConfig (lines 182-189 in config.go):**\n- `Features.DID.Keystore.Type`\n- `Features.DID.Keystore.Path`\n- `Features.DID.Keystore.Encryption`\n- `Features.DID.Keystore.EncryptionPassphrase`\n- `Features.DID.Keystore.BackupEnabled`\n- `Features.DID.Keystore.BackupInterval`\n\n**AuthorizationConfig (lines 112-135 in config.go):**\n- `Features.DID.Authorization.Enabled`\n- `Features.DID.Authorization.DIDAuthEnabled`\n- `Features.DID.Authorization.Domain`\n- `Features.DID.Authorization.TimestampWindowSeconds`\n- `Features.DID.Authorization.DefaultApprovalDurationHours`\n- `Features.DID.Authorization.AdminToken`\n- `Features.DID.Authorization.InternalToken`\n- `Features.DID.Authorization.TagApprovalRules` (all subfields)\n- `Features.DID.Authorization.AccessPolicies` (all subfields)\n\n**CORSConfig partial (lines 198-204 in config.go):**\n- `API.CORS.AllowedMethods` (not merged even if DB has values)\n- `API.CORS.AllowedHeaders` (not merged even if DB has values)\n- `API.CORS.ExposedHeaders` (not merged even if DB has values)\n- `API.CORS.AllowCredentials` (not merged even if DB has values)\n\nThis is a **semantic drift hazard** - future developers adding new config fields will almost certainly forget to update `mergeDBConfig()`, causing silent failures where DB config values are ignored.\n\n---\n\n> mergeDBConfig() at config_db.go:54-102 only has merge logic for:\n> - AgentField.Port (line 56-58)\n> - AgentField.NodeHealth.CheckInterval (line 59-61)\n> - AgentField.ExecutionCleanup.* (lines 63-81)\n> - AgentField.Approval (lines 82-84)\n> - Features.DID.Method (lines 87-89)\n> - API.CORS.AllowedOrigins (lines 95-97)\n> - UI.Mode (lines 100-102)\n> \n> config.go shows many additional fields in AgentFieldConfig (ExecutionQueue), DIDConfig (Enabled, KeyAlgorithm, DerivationMethod, KeyRotationDays, VCRequirements, Keystore, Authorization), and CORSConfig (AllowedMethods, AllowedHeaders, ExposedHeaders, AllowCredentials) that have no corresponding merge logic.\n\n**\ud83d\udca1 Suggested Fix**\n\nReplace the manual field-by-field merge with a generic deep-merge approach using reflection or a library like `mergo`. Alternatively, use a whitelist approach with explicit validation that fails if unknown fields are present in the DB config. At minimum, add a comment at the top of Config struct in config.go warning developers that new fields must be added to mergeDBConfig().\n\n---\n*`Config Merge Completeness and Maintainability` \u00b7 confidence 95%*", - "line": 54, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] SetConfig accepts invalid YAML without validation, causing delayed startup failures**\n\nThe `SetConfig` handler at `control-plane/internal/handlers/config_storage.go:67-78` accepts raw YAML via `io.ReadAll()` and stores it directly to the database without any validation. Only basic checks are performed (empty body at line 75-77), but **no YAML syntax validation** or **schema validation** occurs.\n\n**The Attack Scenario:**\n1. Attacker with API access calls `PUT /api/v1/configs/agentfield.yaml` with malformed YAML (e.g., invalid indentation, invalid types, or non-existent fields)\n2. Handler accepts and stores it successfully (line 85: `h.storage.SetConfig()`)\n3. Server continues running normally with current config\n4. On next restart with `AGENTFIELD_CONFIG_SOURCE=db`, `overlayDBConfig()` attempts to parse the invalid YAML at `config_db.go:37`\n5. `yaml.Unmarshal()` fails, returning an error\n6. At `server.go:109-110`, this error only prints a warning and the server continues with file/env config\n7. **Result**: Expected DB config is silently ignored, potentially causing production downtime or configuration drift\n\n**Why This Matters:**\n- In production environments using `AGENTFIELD_CONFIG_SOURCE=db`, operators expect the database to be the source of truth\n- Invalid config only surfaces during restart, which may be delayed hours/days after the bad config was stored\n- The silent fallback to file config can mask critical misconfigurations and cause cluster inconsistency\n\n---\n\n> Step 1: Client calls `PUT /api/v1/configs/:key` endpoint at `config_storage.go:67`\n> Step 2: Handler reads body at line 70: `body, err := io.ReadAll(c.Request.Body)`\n> Step 3: Handler only checks `len(body) == 0` at lines 75-77 - no YAML validation\n> Step 4: Handler stores raw body to DB at line 85: `h.storage.SetConfig(c.Request.Context(), key, string(body), updatedBy)`\n> Step 5: On server restart with `AGENTFIELD_CONFIG_SOURCE=db`, `NewAgentFieldServer()` calls `overlayDBConfig(cfg, storageProvider)` at `server.go:108-109`\n> Step 6: `overlayDBConfig()` calls `yaml.Unmarshal([]byte(entry.Value), &dbCfg)` at `config_db.go:37`\n> Step 7: If YAML is malformed, error is returned: `fmt.Errorf(\"failed to parse database config YAML: %w\", err)`\n> Step 8: At `server.go:109-110`, error is only logged as warning: `fmt.Printf(\"Warning: failed to load config from database: %v\\n\", err)`\n> Step 9: Server continues startup with potentially stale file/env config instead of expected DB config\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd YAML validation in `SetConfig` handler before storing to database:\n\n1. **Immediate fix**: After reading body at line 70, validate it's valid YAML:\n```go\n// Validate YAML syntax\nvar yamlTest map[string]interface{}\nif err := yaml.Unmarshal(body, &yamlTest); err != nil {\n c.JSON(http.StatusBadRequest, gin.H{\"error\": \"invalid YAML syntax\", \"details\": err.Error()})\n return\n}\n```\n\n2. **Stronger validation**: Parse into actual Config struct to catch type mismatches:\n```go\nvar cfgTest config.Config\nif err := yaml.Unmarshal(body, &cfgTest); err != nil {\n c.JSON(http.StatusBadRequest, gin.H{\"error\": \"invalid config schema\", \"details\": err.Error()})\n return\n}\n```\n\n3. **Consider dry-run reload**: If `reloadFn` is available, attempt a config reload with the new YAML before persisting to catch runtime issues.\n\n---\n*`Config Storage Validation Gap` \u00b7 confidence 95%*", - "line": 67, - "path": "control-plane/internal/handlers/config_storage.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Missing ON UPDATE trigger for updated_at timestamp**\n\nThe migration sets `DEFAULT NOW()` for both `created_at` and `updated_at`, but lacks a database-level trigger to automatically update `updated_at` on row modification. While the Go implementation in `local.go` explicitly sets `updated_at` during upserts, this creates a risk for:\n\n1. Direct database updates via SQL console or admin tools won't update the timestamp\n2. Future code that uses GORM's generic Update() instead of the custom SetConfig() will fail to update the timestamp\n3. Data migration scripts or external tools won't maintain audit trail accuracy\n\n**Related risk**: The GORM model uses `autoUpdateTime` tag (models.go:487) which GORM handles automatically, but the storage layer bypasses GORM with raw SQL, creating inconsistency in behavior.\n\n---\n\n> Step 1: Migration line 11: `updated_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()` - only sets on INSERT\n> Step 2: No `ON UPDATE` trigger or `GENERATED ALWAYS AS` clause present\n> Step 3: GORM model line 487 uses `autoUpdateTime` but storage implementation bypasses GORM\n> Step 4: local.go:5138-5160 uses raw SQL upsert which manually sets updated_at\n> Step 5: If someone uses GORM db.Save(&model) directly, updated_at won't update due to schema limitation\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd database-level trigger to auto-update `updated_at` on any row modification:\n```sql\nCREATE OR REPLACE FUNCTION update_updated_at_column()\nRETURNS TRIGGER AS $$\nBEGIN\n NEW.updated_at = NOW();\n RETURN NEW;\nEND;\n$$ language 'plpgsql';\n\nCREATE TRIGGER update_config_storage_updated_at\n BEFORE UPDATE ON config_storage\n FOR EACH ROW\n EXECUTE FUNCTION update_updated_at_column();\n```\n\n---\n*`Coverage Gap - Database Migration` \u00b7 confidence 85%*", - "line": 10, - "path": "control-plane/migrations/028_create_config_storage.sql", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] config_management capability enabled by default with write access**\n\nThe config_management capability is added with enabled: true and read_only: false by default. This creates a privilege escalation risk if the connector token is compromised. The risk: (1) Connector routes (server.go:1558-1578) allow config management via connector token. (2) The connector token is a single shared secret stored in config (line 132: token: test-connector-token-123). (3) If an attacker obtains the connector token (via log leak, config exposure, etc.), they can modify configuration via /api/v1/connector/configs/* routes, change security settings, disable auth, redirect storage, and escalate from connector access to full control plane compromise. Current protections: config_db.go intentionally skips merging connector config from DB (good), but attacker can still modify OTHER critical sections (DID auth, storage, features). The connector is designed for SaaS integration with limited scope, but config_management gives it effectively full control over the control plane configuration. This violates the principle of least privilege.\n\n---\n\n> Step 1: agentfield.yaml:149-151 sets config_management enabled=true, read_only=false. Step 2: server.go:1560 applies ConnectorTokenAuth to connector routes. Step 3: server.go:1574 applies ConnectorCapabilityCheck middleware. Step 4: config_storage.go:26-31 exposes full CRUD via RegisterRoutes. Step 5: Compromised connector token leads to ability to modify any config except connector section.\n\n**\ud83d\udca1 Suggested Fix**\n\nChange the default to enabled: false or at minimum read_only: true. Example: config_management: enabled: false (users must explicitly enable after understanding risks), read_only: true (or enable but restrict to read-only by default). Alternatively, require explicit opt-in via environment variable for write access.\n\n---\n*`Coverage Gap Review - agentfield.yaml config_management capability` \u00b7 confidence 85%*", - "line": 149, - "path": "control-plane/config/agentfield.yaml", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] key column uses TEXT type without length limit or validation**\n\nThe `key` column is defined as `TEXT NOT NULL UNIQUE` without any length constraint or validation pattern. While this provides flexibility, it allows insertion of extremely large keys (up to 1GB in PostgreSQL) which could cause:\n\n1. **Performance issues**: Index `idx_config_storage_key` on large TEXT values increases storage and lookup overhead\n2. **API abuse**: Malicious actors could create configs with multi-MB keys causing DoS\n3. **UI/display issues**: The web UI and logs may truncate or fail to display extremely long keys\n4. **Storage waste**: Index entries for large text consume significant disk space\n\n**Context**: The primary use case is `agentfield.yaml` as the config key (as seen in config_db.go:13), which is short and predictable. There's no business requirement for arbitrary-length keys.\n\n---\n\n> Step 1: Migration line 5 defines `key TEXT NOT NULL UNIQUE`\n> Step 2: No CHECK constraint or length validation present\n> Step 3: Index at line 14 `idx_config_storage_key` will index full TEXT values\n> Step 4: config_db.go:13 shows expected key is `agentfield.yaml` (14 chars)\n> Step 5: config_storage.go handlers accept arbitrary key strings from URL path\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd length constraint to key column:\n```sql\n-- Add to migration\nkey VARCHAR(255) NOT NULL UNIQUE CHECK (LENGTH(key) > 0 AND LENGTH(key) <= 255)\n```\nOr add validation at application layer in SetConfig handler before storage call.\n\n---\n*`Coverage Gap - Database Migration` \u00b7 confidence 80%*", - "line": 5, - "path": "control-plane/migrations/028_create_config_storage.sql", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] INCONSISTENT ERROR HANDLING: GetConfig returns nil on 'not found' but storage.go contract is unclear**\n\nThe `GetConfig` method at line 5186-5187 returns `nil, nil` when config is not found, using string comparison `err.Error() == \"sql: no rows in result set\"` instead of the standard `errors.Is(err, sql.ErrNoRows)`.\n\n**Issues:**\n1. **Fragile error detection**: String comparison instead of `errors.Is()` may fail with different drivers or wrapped errors\n2. **Silent failures**: The handler in `config_storage.go` calls `GetConfig` after `SetConfig` to return saved state. If this call returns `nil, nil` (due to race condition where config was deleted between insert and select), the handler returns 500 with misleading error even though SetConfig succeeded.\n\nThis creates the scenario mentioned in the PR context: \"Error handling inconsistency: SetConfig calls storage.SetConfig(), then immediately calls storage.GetConfig() to return saved entry. If GetConfig fails, handler returns 500 error even though config WAS saved successfully\"\n\n---\n\n> Step 1: Handler calls `storage.SetConfig()` successfully\n> Step 2: Handler immediately calls `storage.GetConfig()` at config_storage.go:91-94\n> Step 3: If GetConfig returns `nil, nil` (not found), handler checks `if err != nil` only\n> Step 4: Handler proceeds with `nil` entry causing nil pointer dereference or returns incorrect response\n> Step 5: Client receives 500 error despite config being successfully saved\n\n**\ud83d\udca1 Suggested Fix**\n\n1. Use `errors.Is(err, sql.ErrNoRows)` instead of string comparison at line 5186\n2. Consider returning a typed error like `ErrConfigNotFound` for missing configs\n3. Document in the `StorageProvider` interface what callers should expect for 'not found' cases\n\n---\n*`storage layer - ConfigStorageModel versioning and SetConfig implementation` \u00b7 confidence 75%*", - "line": 5164, - "path": "control-plane/internal/storage/local.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Inconsistent Security Field Handling - DID.Authorization Omitted Without Comment**\n\nWhile the code correctly excludes `Connector` config (token, capabilities) from DB merge with a clear security comment (lines 90-92), it also silently omits `Features.DID.Authorization` which contains equally security-sensitive fields like `AdminToken`, `InternalToken`, `AccessPolicies`, and `DIDAuthEnabled` (config.go:111-135).\n\nThe DID Authorization struct contains:\n- `AdminToken` - Separate token for admin operations\n- `InternalToken` - Used for Authorization: Bearer header to agents\n- `Domain` - Domain for did:web identifiers\n- `AccessPolicies` - Tag-based authorization policies\n\nThese fields are **not merged from DB** despite being security-relevant, but unlike the Connector exclusion, there's no explanatory comment. This inconsistency makes it unclear whether the omission is intentional (security) or accidental (incomplete implementation).\n\n---\n\n> Step 1: DIDConfig.Authorization struct at config.go:111-135 defines security-sensitive fields: AdminToken, InternalToken, AccessPolicies, DIDAuthEnabled.\n> Step 2: mergeDBConfig only checks dbCfg.Features.DID.Method at line 87, then assigns entire DID struct.\n> Step 3: DID.Authorization is part of DID struct but never specifically handled - it would be zeroed if only Method is set, or copied wholesale if any Method is set.\n> Step 4: No security comment explains why these sensitive fields are treated differently from Connector config.\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd an explicit comment explaining why DID.Authorization fields are excluded from DB merge, similar to the Connector comment:\n\n```go\n// NOTE: DID.Authorization config (admin_token, internal_token, access_policies) is\n// intentionally NOT merged from DB for security, similar to connector config.\n// Only DID.Method is merged as it affects VC generation behavior.\n```\n\n---\n*`Partial Config Merge Maintenance Hazard` \u00b7 confidence 85%*", - "line": 86, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd35 **[SUGGESTION] No Automated Sync Check Between Config Struct and Merge Function**\n\nThere is no automated mechanism (build-time check, code generation, or test) to ensure that `mergeDBConfig()` stays synchronized with the `Config` struct definition. When new fields are added to `config.Config`, developers must manually remember to update `mergeDBConfig()` in a different file. This is a classic source of drift bugs.\n\n---\n\n> mergeDBConfig() comment at line 52-53 states 'selectively merges' but provides no mechanism to ensure completeness. The function and Config struct are in separate files (config_db.go vs config.go) increasing the likelihood of drift.\n\n**\ud83d\udca1 Suggested Fix**\n\nConsider adding a build tag or go:generate directive that uses reflection to verify all exported fields in Config have corresponding merge logic. Alternatively, add a unit test that uses reflection to compare the Config struct fields against known merged fields and fails if new fields are detected without test coverage in mergeDBConfig.\n\n---\n*`Config Merge Completeness and Maintainability` \u00b7 confidence 85%*", - "line": 52, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd35 **[SUGGESTION] Missing TODO/FIXME Comment Warning About Maintenance Burden**\n\nThe function comment at lines 52-53 describes what the function does but does not warn maintainers that this function must be updated whenever new config fields are added. The field-by-field merge approach creates a **compile-time blind spot** - the code compiles successfully even when Config struct has fields not handled here.\n\nA maintainer adding a new field to `Config` struct will have no indication that they also need to add handling here unless they happen to read this file. This is exactly the type of issue that caused the ExecutionCleanup bug requiring the a8bfc8c fix commit.\n\n---\n\n> Step 1: Function comment at lines 52-53 says 'selectively merges' and 'Only non-zero/non-empty values' but gives no warning about the maintenance requirement.\n> Step 2: Config struct has 15+ fields/sub-structs (config.go:17-23, 34-41, etc.).\n> Step 3: mergeDBConfig handles only 7 specific field paths (Port, NodeHealth.CheckInterval, ExecutionCleanup.*, Approval, DID.Method, API.CORS, UI).\n> Step 4: No compile-time or comment-based guard exists to warn when Config grows but mergeDBConfig doesn't.\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd a prominent TODO/FIXME comment at the top of mergeDBConfig:\n\n```go\n// TODO: This function must be updated when adding new config fields.\n// Currently missing: ExecutionQueue, NodeHealth (partial), DID.Authorization,\n// DID.VCRequirements, DID.Keystore, API.Auth, UI.Enabled, etc.\n// Consider using reflection-based merging with struct tags to avoid\n// this maintenance burden (see also: viper's automatic config merging).\n```\n\n---\n*`Partial Config Merge Maintenance Hazard` \u00b7 confidence 80%*", - "line": 52, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd35 **[SUGGESTION] AMBIGUOUS NULL HANDLING: COALESCE converts NULL to empty string losing audit information**\n\nIn `GetConfig` (lines 5180-5184), the SQL uses `COALESCE(created_by, '')` and `COALESCE(updated_by, '')` to handle NULL values.\n\n**Issues:**\n1. **Loss of semantic meaning**: Empty string `\"\"` and NULL have different meanings - NULL means \"unknown/system\" while empty string could mean \"intentionally blank\"\n2. **Inconsistent with model**: `ConfigStorageModel` uses `*string` pointers for these fields indicating they can be NULL\n3. **ConfigEntry uses non-pointer**: The `ConfigEntry` struct in storage.go:30-38 uses plain `string` not `*string`, forcing the COALESCE\n\nThis makes it impossible to distinguish between \"created by system (NULL)\" and \"created by user with empty name (empty string)\".\n\n---\n\n> storage.go:30-38 defines ConfigEntry with `CreatedBy string` and `UpdatedBy string` (no pointers)\n> \n> local.go:5180-5181 uses `COALESCE(created_by, '')` and `COALESCE(updated_by, '')` to handle NULLs because ConfigEntry can't hold NULL\n> \n> models.go:484-485 defines `CreatedBy *string` and `UpdatedBy *string` as pointers in the model\n\n**\ud83d\udca1 Suggested Fix**\n\nChange `ConfigEntry` to use `*string` for `CreatedBy` and `UpdatedBy`:\n```go\ntype ConfigEntry struct {\n Key string `json:\"key\"`\n Value string `json:\"value\"`\n Version int `json:\"version\"`\n CreatedBy *string `json:\"created_by,omitempty\"` // Use pointer\n UpdatedBy *string `json:\"updated_by,omitempty\"` // Use pointer\n CreatedAt time.Time `json:\"created_at\"`\n UpdatedAt time.Time `json:\"updated_at\"`\n}\n```\n\nRemove COALESCE from SQL and scan directly into pointer fields.\n\n---\n*`storage layer - ConfigStorageModel versioning and SetConfig implementation` \u00b7 confidence 70%*", - "line": 5179, - "path": "control-plane/internal/storage/local.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd35 **[SUGGESTION] created_by/updated_by lack referential integrity constraints**\n\nThe `created_by` and `updated_by` columns are defined as nullable TEXT without foreign key constraints or validation. This design allows arbitrary strings that may not correspond to actual users in the system, making the audit trail unreliable.\n\n**Trade-offs**: Adding FK constraints to a users table would require that table to exist and be populated, which may not be true in all deployment scenarios (e.g., API-only authentication). However, even without FK constraints, the application should validate these values against authenticated principals.\n\n---\n\n> Step 1: Migration lines 8-9: `created_by TEXT` and `updated_by TEXT` - no constraints\n> Step 2: GORM model lines 484-485 uses `*string` pointers allowing NULL\n> Step 3: config_storage.go:76-78 extracts `updatedBy` from context but has no validation\n> Step 4: No users/agents table reference exists to validate against\n\n**\ud83d\udca1 Suggested Fix**\n\nConsider either:\n1. Add CHECK constraint to validate format (e.g., must be valid UUID or email)\n2. Document that application layer must validate principals before storage\n3. Add comment explaining audit trail limitations for external tools\n\n---\n*`Coverage Gap - Database Migration` \u00b7 confidence 65%*", - "line": 8, - "path": "control-plane/migrations/028_create_config_storage.sql", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd35 **[SUGGESTION] MISSING DATABASE CONSTRAINTS: ConfigStorageModel lacks validation for key format**\n\nThe `ConfigStorageModel` struct defines a `key` field with `uniqueIndex` but no constraints on key format, length, or allowed characters.\n\n**Potential issues:**\n1. Empty string keys allowed (no `NOT NULL` constraint validation at struct level)\n2. No maximum length enforcement\n3. No validation that keys follow expected naming conventions (e.g., no path traversal characters like `../` or `..\\`)\n\nWhile the API layer may validate, defense-in-depth suggests the storage layer should also enforce constraints.\n\n---\n\n> models.go:479-488 shows ConfigStorageModel with `gorm:\"column:key;not null;uniqueIndex\"` - the `not null` is present but there's no size limit or format validation\n> \n> local.go:5129-5161 SetConfig accepts any key string and passes directly to SQL without validation\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd GORM validation tags and constraints:\n```go\ntype ConfigStorageModel struct {\n ID int64 `gorm:\"column:id;primaryKey;autoIncrement\"`\n Key string `gorm:\"column:key;not null;uniqueIndex;size:255\"` // Add NOT NULL and size limit\n Value string `gorm:\"column:value;type:text;not null\"`\n // ...\n}\n```\n\nConsider adding application-level validation in `SetConfig` to reject keys containing path separators or control characters.\n\n---\n*`storage layer - ConfigStorageModel versioning and SetConfig implementation` \u00b7 confidence 60%*", - "line": 476, - "path": "control-plane/internal/storage/models.go", - "side": "RIGHT" - }, - { - "body": "\u26aa **[NITPICK] Important: configMu mutex is declared but NEVER used anywhere**\n\nThe `configMu sync.RWMutex` field is declared in the AgentFieldServer struct at line 82, but there are **zero** usages of this mutex in the entire file.\n\nSearch results for 'configMu':\n- Line 82: Declaration only\n- NO calls to configMu.Lock()\n- NO calls to configMu.Unlock()\n- NO calls to configMu.RLock()\n- NO calls to configMu.RUnlock()\n\nThe mutex was added to the struct but never actually locked or unlocked. This makes it completely ineffective for preventing data races.\n\n---\n\n> Step 1: grep for 'configMu' in server.go shows only line 82 (declaration)\n> Step 2: No Lock(), Unlock(), RLock(), or RUnlock() calls found\n> Step 3: The mutex exists but provides zero protection\n> Step 4: This indicates incomplete implementation of the thread-safety feature\n\n**\ud83d\udca1 Suggested Fix**\n\nEither:\n1. Add proper mutex protection around all config reads and writes (configMu.Lock() in configReloadFn, configMu.RLock() in goroutines that read config)\n2. OR remove the unused field if config reloading isn't meant to be thread-safe\n\nRecommended approach: Add RLock() around config reads in background goroutines like healthMonitor, presenceManager, etc.\n\n---\n*`Thread Safety - Config Reload Mutex` \u00b7 confidence 99%*", - "line": 82, - "path": "control-plane/internal/server/server.go", - "side": "RIGHT" - } - ], - "event": "REQUEST_CHANGES" - }, - "review_id": "rev_5795c21d6bdd", - "summary": { - "adversary_challenged": 3, - "adversary_confirmed": 13, - "ai_generated_confidence": 0.6666666666666666, - "budget_exhausted": true, - "by_severity": { - "critical": 7, - "important": 11, - "nitpick": 1, - "suggestion": 6 - }, - "cost_usd": 0, - "coverage_iterations": 1, - "cross_ref_interactions": 8, - "dimensions_run": 6, - "duration_seconds": 2608.64, - "total_findings": 25 - } -} \ No newline at end of file diff --git a/benchmark/agentfield-254/pr-af-result-sonnet-254.json b/benchmark/agentfield-254/pr-af-result-sonnet-254.json deleted file mode 100644 index 3e279a3..0000000 --- a/benchmark/agentfield-254/pr-af-result-sonnet-254.json +++ /dev/null @@ -1,1139 +0,0 @@ -{ - "execution_id": "exec_20260310_165506_23twwiqt", - "run_id": "run_20260310_165506_1qym4blk", - "status": "succeeded", - "result": { - "findings": [ - { - "active_multipliers": [ - "adversary_confirmed", - "ai_generated_pr" - ], - "body": "The `MockStorageProvider` in `config_test.go` (and identically in `execute_test.go`) implements `SetConfig` and `GetConfig` with signatures that do **not** match the `StorageProvider` interface defined in `storage/storage.go:133-136`.\n\n**Interface (storage.go:133-136):**\n```go\nSetConfig(ctx context.Context, key string, value string, updatedBy string) error\nGetConfig(ctx context.Context, key string) (*ConfigEntry, error)\nListConfigs(ctx context.Context) ([]*ConfigEntry, error)\nDeleteConfig(ctx context.Context, key string) error\n```\n\n**Mock (config_test.go:289-297):**\n```go\nfunc (m *MockStorageProvider) SetConfig(ctx context.Context, key string, value interface{}) error\nfunc (m *MockStorageProvider) GetConfig(ctx context.Context, key string) (interface{}, error)\n```\n\nDifferences:\n1. `SetConfig`: interface takes `(value string, updatedBy string)`, mock takes `(value interface{})` \u2014 wrong parameter count AND wrong type\n2. `GetConfig`: interface returns `(*ConfigEntry, error)`, mock returns `(interface{}, error)` \u2014 wrong return type\n3. `ListConfigs` is **entirely absent** from the mock\n4. `DeleteConfig` is **entirely absent** from the mock\n\nBecause both files carry `//go:build integration`, these compile errors are **suppressed during default `go test ./...` runs** and will only surface when running with the `integration` build tag. This means the broken mocks are silently excluded from CI unless integration tests are explicitly exercised, creating a false sense of correctness.", - "confidence": 0.98, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "interface-compliance", - "dimension_name": "StorageProvider Interface Implementation Completeness", - "evidence": "Step 1: Interface at storage/storage.go:133 defines `SetConfig(ctx context.Context, key string, value string, updatedBy string) error` with two string parameters after key.\nStep 2: Interface at storage/storage.go:134 defines `GetConfig(ctx context.Context, key string) (*ConfigEntry, error)` returning a concrete pointer type.\nStep 3: Mock at config_test.go:289 implements `SetConfig(ctx context.Context, key string, value interface{}) error` \u2014 only one parameter after key, and typed as `interface{}` not `string`.\nStep 4: Mock at config_test.go:294 implements `GetConfig(ctx context.Context, key string) (interface{}, error)` \u2014 returns `interface{}` not `*storage.ConfigEntry`.\nStep 5: Searching config_test.go for `ListConfigs` and `DeleteConfig` returns 0 matches \u2014 both methods are entirely absent.\nStep 6: execute_test.go:173 and :176 contain identical wrong signatures.\nStep 7: Both files are `//go:build integration` (config_test.go:1, execute_test.go:1), so these compile errors are hidden from default test runs but will break `go test -tags integration ./...`.", - "file_path": "control-plane/internal/handlers/ui/config_test.go", - "id": "f_000", - "line_end": 297, - "line_start": 289, - "score": 1.529, - "severity": "critical", - "suggestion": "Update both mock files to match the current interface signatures exactly:\n```go\nfunc (m *MockStorageProvider) SetConfig(ctx context.Context, key string, value string, updatedBy string) error {\n args := m.Called(ctx, key, value, updatedBy)\n return args.Error(0)\n}\nfunc (m *MockStorageProvider) GetConfig(ctx context.Context, key string) (*storage.ConfigEntry, error) {\n args := m.Called(ctx, key)\n if args.Get(0) == nil {\n return nil, args.Error(1)\n }\n return args.Get(0).(*storage.ConfigEntry), args.Error(1)\n}\nfunc (m *MockStorageProvider) ListConfigs(ctx context.Context) ([]*storage.ConfigEntry, error) {\n args := m.Called(ctx)\n if args.Get(0) == nil {\n return nil, args.Error(1)\n }\n return args.Get(0).([]*storage.ConfigEntry), args.Error(1)\n}\nfunc (m *MockStorageProvider) DeleteConfig(ctx context.Context, key string) error {\n args := m.Called(ctx, key)\n return args.Error(0)\n}\n```\nApply the same fix to `internal/handlers/execute_test.go`.", - "tags": [ - "interface-mismatch", - "test", - "compile-error", - "integration-test" - ], - "title": "MockStorageProvider implements SetConfig/GetConfig with wrong signatures and is missing ListConfigs and DeleteConfig entirely" - }, - { - "active_multipliers": [ - "adversary_confirmed", - "ai_generated_pr" - ], - "body": "When `AGENTFIELD_CONFIG_SOURCE=db` is set, `mergeDBConfig` in `config_db.go:87-89` replaces the **entire** `target.Features.DID` struct \u2014 including `Authorization.AdminToken` and `Authorization.InternalToken` \u2014 with values from the DB-stored YAML if `dbCfg.Features.DID.Method != \"\"`.\n\n```go\n// config_db.go:86-89\nif dbCfg.Features.DID.Method != \"\" {\n target.Features.DID = dbCfg.Features.DID // replaces AdminToken, InternalToken, all auth config\n}\n```\n\nThe comment at line 94 says `// API settings (but never override API key from DB for security)` and correctly protects `API.Auth.APIKey`. However, `AdminToken` (used to guard admin routes including tag approval, policy management, and the config routes themselves) and `InternalToken` (used as bearer for agent-to-agent calls) are both nested under `Features.DID.Authorization` and are **not similarly protected**.\n\nAttack chain:\n1. Attacker calls `PUT /api/v1/configs/agentfield.yaml` with a YAML body containing `features.did.method: did:key` and `features.did.authorization.admin_token: attacker-controlled-token` (unauthenticated, due to Finding 1).\n2. Attacker calls `POST /api/v1/configs/reload` to trigger `overlayDBConfig`.\n3. `mergeDBConfig` sees `dbCfg.Features.DID.Method == \"did:key\"` (non-empty), replaces `target.Features.DID` entirely, overwriting `AdminToken` with the attacker-controlled value.\n4. Attacker now has full `X-Admin-Token` admin access over tag approval, policy management, and all future admin routes.", - "confidence": 0.92, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "auth-config-crud", - "dimension_name": "Config CRUD Route Authorization Gap", - "evidence": "Step 1: Attacker sends `PUT /api/v1/configs/agentfield.yaml` with body `features:\\n did:\\n method: did:key\\n authorization:\\n admin_token: evil-token` \u2014 unauthenticated because `APIKeyAuth` is a no-op when `api_key` is empty (Finding 1).\nStep 2: `SetConfig` at config_storage.go:85 calls `h.storage.SetConfig(ctx, \"agentfield.yaml\", body, \"api\")` \u2014 no validation or sanitization of the YAML content.\nStep 3: Attacker sends `POST /api/v1/configs/reload`. `ReloadConfig` at config_storage.go:121 calls `h.reloadFn()` which calls `overlayDBConfig(s.config, s.storage)` (server.go:440).\nStep 4: `overlayDBConfig` at config_db.go:37-42 parses the stored YAML into `dbCfg` and calls `mergeDBConfig(cfg, &dbCfg)`.\nStep 5: `mergeDBConfig` at config_db.go:87-89: `dbCfg.Features.DID.Method == \"did:key\"` (non-empty), so `target.Features.DID = dbCfg.Features.DID` executes, replacing `Authorization.AdminToken` with `evil-token`.\nStep 6: Subsequent requests using `X-Admin-Token: evil-token` are accepted by `AdminTokenAuth` at middleware/auth.go:99.", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_008", - "line_end": 89, - "line_start": 87, - "score": 1.435, - "severity": "critical", - "suggestion": "Add explicit protection in `mergeDBConfig` for security-sensitive fields inside `Features.DID`, mirroring the API key protection at line 94:\n\n```go\nif dbCfg.Features.DID.Method != \"\" {\n // Preserve security-sensitive authorization tokens \u2014 must come from file/env only\n savedAdminToken := target.Features.DID.Authorization.AdminToken\n savedInternalToken := target.Features.DID.Authorization.InternalToken\n target.Features.DID = dbCfg.Features.DID\n target.Features.DID.Authorization.AdminToken = savedAdminToken\n target.Features.DID.Authorization.InternalToken = savedInternalToken\n}\n```\n\nLong-term, fixing Finding 1 (adding AdminTokenAuth to the config routes) removes the unauthenticated write path, making this a defense-in-depth item. Both fixes should be applied.", - "tags": [ - "security", - "authorization-bypass", - "config-injection", - "token-overwrite" - ], - "title": "PUT /configs/agentfield.yaml can overwrite admin_token and internal_token via mergeDBConfig when DID.Method is set" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The comment at line 1550 says `// Config storage routes (admin-authenticated)` but **no `AdminTokenAuth` middleware is applied**. The routes are registered directly on `agentAPI` (the bare `/api/v1` group) with no sub-group and no `.Use(middleware.AdminTokenAuth(...))` call.\n\nCompare this with lines 1532\u20131545 where the actual admin-protected routes are set up:\n\n```go\n// Lines 1532-1545 \u2014 ACTUAL admin auth\nadminGroup := agentAPI.Group(\"\")\nadminGroup.Use(middleware.AdminTokenAuth(s.config.Features.DID.Authorization.AdminToken))\n```\n\nBut the config routes at lines 1551\u20131554 are:\n\n```go\n// Lines 1550-1555 \u2014 NO admin auth applied\n{\n configHandlers := handlers.NewConfigStorageHandlers(s.storage, s.configReloadFn())\n configHandlers.RegisterRoutes(agentAPI) // directly on agentAPI, NOT on adminGroup\n}\n```\n\nThe **only** protection is the global `middleware.APIKeyAuth` at line 881. As confirmed in `middleware/auth.go:26-29`, when `config.APIKey == \"\"` the middleware is an explicit no-op (`c.Next()` is called immediately). The default `agentfield.yaml` in the repo has **no `api.auth.api_key` field at all**, meaning `cfg.API.Auth.APIKey` is the zero value (empty string). The dev environment therefore runs fully unauthenticated.\n\nThis means on any default or dev deployment:\n- `GET /api/v1/configs` \u2014 lists **all** stored configuration entries including `agentfield.yaml`\n- `GET /api/v1/configs/agentfield.yaml` \u2014 returns the full config YAML including `admin_token`, `internal_token`, `webhook_secret`, DID keystore config\n- `PUT /api/v1/configs/agentfield.yaml` \u2014 overwrites the stored config, and if `AGENTFIELD_CONFIG_SOURCE=db` is set, `POST /api/v1/configs/reload` activates it, allowing an attacker to replace `admin_token`, `cors.allowed_origins`, DID authorization settings, etc.\n- `DELETE /api/v1/configs/:key` \u2014 deletes any stored configuration key", - "confidence": 0.98, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "auth-config-crud", - "dimension_name": "Config CRUD Route Authorization Gap", - "evidence": "Step 1: `setupRoutes()` (server.go:831) registers global middleware including `middleware.APIKeyAuth(middleware.AuthConfig{APIKey: s.config.API.Auth.APIKey, ...})` at line 881.\nStep 2: `middleware.APIKeyAuth` at `middleware/auth.go:26-29` returns `c.Next()` immediately when `config.APIKey == \"\"`.\nStep 3: `agentfield.yaml` (config/agentfield.yaml) has no `api.auth.api_key` key at all. `AuthConfig.APIKey` is an untagged Go string, defaulting to `\"\"`. The `applyEnvOverrides` function at config.go:263 only overrides if `AGENTFIELD_API_KEY` env var is non-empty.\nStep 4: With no API key set, the global middleware is a no-op. No other middleware guards the `/api/v1/configs` routes.\nStep 5: `configHandlers.RegisterRoutes(agentAPI)` at server.go:1553 calls `group.GET(\"/configs\", ...)`, `group.GET(\"/configs/:key\", ...)`, `group.PUT(\"/configs/:key\", ...)`, `group.DELETE(\"/configs/:key\", ...)`, and `group.POST(\"/configs/reload\", ...)` directly on the unauthenticated `agentAPI` group (server.go:1164 `agentAPI := s.Router.Group(\"/api/v1\")`).\nStep 6: `GetConfig` at config_storage.go:51-63 calls `h.storage.GetConfig(ctx, key)` and returns the full entry value without redaction. `ListConfigs` at config_storage.go:35-48 returns all entries.\nStep 7: Any unauthenticated HTTP client can `curl http://localhost:8080/api/v1/configs/agentfield.yaml` and receive the stored YAML including secrets.", - "file_path": "control-plane/internal/server/server.go", - "id": "f_007", - "line_end": 1555, - "line_start": 1550, - "score": 1.176, - "severity": "critical", - "suggestion": "Create a dedicated sub-group with `AdminTokenAuth` applied before registering config routes, mirroring the pattern used for tag-approval and access-policy admin routes (lines 1532\u20131545):\n\n```go\n// Config storage routes \u2014 require admin token\nconfigAdminGroup := agentAPI.Group(\"\")\nconfigAdminGroup.Use(middleware.AdminTokenAuth(s.config.Features.DID.Authorization.AdminToken))\nconfigHandlers := handlers.NewConfigStorageHandlers(s.storage, s.configReloadFn())\nconfigHandlers.RegisterRoutes(configAdminGroup)\n```\n\nNote: `AdminTokenAuth` is itself a no-op when `adminToken == \"\"` (see `middleware/auth.go:92-94`), so the admin token must also be required to be non-empty for this to be effective in production. Add a startup warning (similar to line 268) if the config routes are reachable but `AdminToken` is empty.", - "tags": [ - "security", - "authentication", - "authorization", - "missing-auth" - ], - "title": "Config CRUD routes are not admin-authenticated: comment is false, no AdminTokenAuth applied" - }, - { - "active_multipliers": [ - "adversary_confirmed", - "ai_generated_pr" - ], - "body": "The `ReloadConfig` handler returns:\n\n```json\n{\"message\": \"config reloaded from database\"}\n```\n\nwith `HTTP 200` when `reloadFn()` succeeds. However, `reloadFn` is `overlayDBConfig`, which **only mutates the in-memory `*config.Config` struct**. As established by the other findings in this review, the overwhelming majority of services that consume config values have already copied those values at construction time and will not observe any change:\n\n- `ExecutionCleanupService` \u2014 reads retention period, cleanup interval, batch size from its own frozen copy\n- `HealthMonitor` \u2014 uses a frozen check interval ticker\n- `WebhookDispatcher` \u2014 uses a frozen `http.Client` timeout\n- `ExecuteHandler`/`ExecuteAsyncHandler` \u2014 use a frozen agent-call timeout\n- `ApprovalWebhookHandler` \u2014 uses a frozen HMAC secret\n- CORS middleware \u2014 configured once at `setupRoutes()` from the config values at that time\n- API key auth middleware \u2014 similarly frozen at route registration\n\nThe only fields that _are_ lazily re-read (because handlers call `s.config.*` directly) are a small subset of route-guard conditions checked on each request. But these are not what callers typically expect to change via a config reload.\n\nThere is **no documented contract** in the handler, any comment block, or any API response body that tells callers which fields are applied immediately versus which require a restart. A caller who updates `execution_cleanup.retention_period` in the DB, calls `POST /configs/reload`, receives `HTTP 200 \"config reloaded from database\"`, and concludes the cleanup service is now running with the new retention period is completely misled.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "config-reload-behavioral-contract", - "dimension_name": "Config Reload Behavioral Contract", - "evidence": "Step 1: `config_storage.go:121` calls `h.reloadFn()` which is `overlayDBConfig(s.config, s.storage)` (server.go:440).\nStep 2: `overlayDBConfig` calls `mergeDBConfig` which writes to fields of `*config.Config` in place (config_db.go:42,54-102).\nStep 3: All background services examined hold value copies of the mutated fields (see companion findings above).\nStep 4: `config_storage.go:128` returns `{\"message\": \"config reloaded from database\"}` \u2014 no qualification, no list of affected vs. unaffected subsystems.\nStep 5: No code comment, no API documentation file, and no OpenAPI annotation in the target files describes which fields are hot-reloadable.", - "file_path": "control-plane/internal/handlers/config_storage.go", - "id": "f_018", - "line_end": 128, - "line_start": 121, - "score": 1.037, - "severity": "important", - "suggestion": "The response body should be honest about what was applied. At minimum, add a disclaimer: return a structured body listing which config sections were merged and a note that changes to cleanup intervals, health monitor timings, webhook settings, and execution timeouts require a server restart to take effect. Longer term, either (a) implement true hot-reload for each service via `Reconfigure()` methods and enumerate the actually-reloaded subsystems in the response, or (b) make the API contract explicit in documentation and return a `partial_reload` status with a list of fields that only take effect after restart.", - "tags": [ - "api-contract", - "config-reload", - "misleading-response", - "behavioral-contract" - ], - "title": "POST /configs/reload returns HTTP 200 with a success message even though most running services are unaffected by the reload" - }, - { - "active_multipliers": [ - "adversary_confirmed", - "ai_generated_pr" - ], - "body": "The `config_storage` table is created via two independent mechanisms that are never coordinated:\n\n1. **GORM AutoMigrate** (`migrations.go:236`): `&ConfigStorageModel{}` is included in the `autoMigrateSchema` call, which runs unconditionally on every server startup for **both** `local` (SQLite) and `postgres` modes.\n2. **Goose SQL migration** (`028_create_config_storage.sql`): A standalone DDL file intended to be run manually via `goose -dir ./migrations postgres ... up` before the server starts in PostgreSQL mode.\n\nEvery other model that has a Goose migration file also relies on GORM AutoMigrate for its schema (e.g., `DIDDocumentModel` \u2194 `019_create_did_documents.sql`, `AccessPolicyModel` \u2194 `021_create_access_policies.sql`, `AgentTagVCModel` \u2194 `022_create_agent_tag_vcs.sql`). This is the **established pattern** for this codebase: Goose files are the PostgreSQL-mode canonical DDL, and GORM AutoMigrate handles schema reconciliation on startup. `config_storage` follows this same dual-path \u2014 so the pattern is consistent \u2014 but the **design itself** is an undocumented hazard for future maintainers.\n\nThe critical risk is schema divergence over time. If a developer adds a column to `ConfigStorageModel` (e.g., `Tags string`), GORM AutoMigrate will silently add that column to both SQLite and PostgreSQL. But Goose migration `028` will not be updated. The reverse is equally true: if someone adds a `CHECK` constraint in a new Goose migration `029_alter_config_storage.sql`, GORM AutoMigrate will not reproduce it on a fresh install that skips Goose. Because neither mechanism has visibility into what the other has done, schema drift is a when-not-if scenario.", - "confidence": 0.92, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "dual-track-schema-management", - "dimension_name": "Dual-Track Schema Management: AutoMigrate vs Goose", - "evidence": "Step 1: `StorageFactory.CreateStorage` (storage.go:350) calls `pgStorage.Initialize(ctx, ...)` for postgres mode.\nStep 2: `Initialize` (local.go:534) calls `ls.initializePostgres(ctx)`.\nStep 3: `initializePostgres` (local.go:734) calls `ls.createSchema(ctx)`.\nStep 4: `createSchema` (local.go:862) calls `ls.autoMigrateSchema(ctx)` unconditionally, which includes `&ConfigStorageModel{}` (migrations.go:236), creating the table via GORM.\nStep 5: The CLAUDE.md documentation instructs operators to also run `goose -dir ./migrations postgres ... up` before starting in PostgreSQL mode, which would also execute `028_create_config_storage.sql` (with `CREATE TABLE IF NOT EXISTS`, so no hard error, but the DDL is effectively applied twice from two separate sources).\nStep 6: No mechanism prevents `ConfigStorageModel` fields from being changed in models.go without a corresponding Goose migration update.", - "file_path": "control-plane/internal/storage/migrations.go", - "id": "f_003", - "line_end": 236, - "line_start": 236, - "score": 1.005, - "severity": "important", - "suggestion": "Document explicitly (in a comment in `migrations.go` near the AutoMigrate list, and in a header comment in `028_create_config_storage.sql`) that for PostgreSQL mode, the Goose file is the authoritative DDL for initial creation and structural constraints, while GORM AutoMigrate handles additive column additions. Add a CI check or test that compares the column set of the GORM model struct against the columns created by the corresponding Goose migration, to detect drift early. Alternatively, adopt the stricter approach used by `kv_store`, `distributed_locks`, and `memory_events` tables: create them entirely via `ensurePostgres*` helper functions (Go code with `CREATE TABLE IF NOT EXISTS`), removing the Goose SQL file entirely for purely application-managed tables.", - "tags": [ - "schema-management", - "migration-pattern", - "maintenance-hazard", - "postgresql" - ], - "title": "Dual-path schema creation for config_storage breaks the established single-source-of-truth migration pattern" - }, - { - "active_multipliers": [ - "adversary_confirmed", - "ai_generated_pr" - ], - "body": "The comment on `AdminTokenAuth` says *\"falls back to global API key auth\"* when `adminToken` is empty. However, the global API key auth is **also** a no-op when `api_key` is empty (confirmed above). The combination means: in the default `agentfield.yaml` configuration where `admin_token: \"admin-secret\"` is set, admin routes are protected \u2014 but any operator who forgets to set `admin_token` in production leaves admin routes fully open.\n\nMore critically for the existing admin group (lines 1532\u20131545), the empty-token guard for `AdminTokenAuth` is the **only** runtime protection difference between DID being enabled and not. The code at server.go:1531 wraps the admin group in a conditional `if s.config.Features.DID.Authorization.Enabled`, but if `Enabled` is `true` and `AdminToken` is `\"\"`, `AdminTokenAuth` is still a no-op.\n\nWhile the default `agentfield.yaml` does ship with `admin_token: \"admin-secret\"` (line 96 of agentfield.yaml), this is a **well-known default credential** that many operators will fail to rotate, providing essentially no real protection.", - "confidence": 0.88, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "auth-config-crud", - "dimension_name": "Config CRUD Route Authorization Gap", - "evidence": "Step 1: `agentfield.yaml:96` sets `admin_token: \"admin-secret\"` \u2014 a known hardcoded default.\nStep 2: If an operator deploys without overriding this, `s.config.Features.DID.Authorization.AdminToken == \"admin-secret\"`.\nStep 3: `AdminTokenAuth(\"admin-secret\")` at middleware/auth.go:99 requires `X-Admin-Token: admin-secret`. Since this value is in the public repo, any attacker who reads the documentation or source code can trivially provide this header.\nStep 4: For the no-api-key case, `middleware.APIKeyAuth` no-ops at line 26-29, so the fallback described in the comment provides zero protection.\nStep 5: `middleware/auth.go:92-94`: `if adminToken == \"\" { c.Next(); return }` \u2014 if AdminToken is unset, all admin route requests pass through.", - "file_path": "control-plane/internal/server/middleware/auth.go", - "id": "f_009", - "line_end": 95, - "line_start": 90, - "score": 0.961, - "severity": "important", - "suggestion": "1. Add a hard startup failure (not just a warning) when `Authorization.Enabled == true && AdminToken == \"\"`. The existing log message at server.go:268 is a warning; it should be a fatal error or at minimum should disable the admin routes entirely.\n2. Consider shipping with an empty `admin_token` in the default config and requiring operators to explicitly set it, rather than shipping a known-bad default (`admin-secret`).\n3. When `AdminTokenAuth` receives an empty token, it should deny all requests rather than being a no-op, since a missing token configuration is a security misconfiguration, not a deliberate bypass.", - "tags": [ - "security", - "default-credentials", - "misconfiguration", - "no-op-middleware" - ], - "title": "AdminTokenAuth is a no-op when adminToken is empty \u2014 existing admin routes (tag approval, policy management) are unprotected in default dev config" - }, - { - "active_multipliers": [ - "adversary_confirmed", - "ai_generated_pr" - ], - "body": "`GetConfig` at line 5186 checks for the not-found condition by comparing the error's string representation:\n\n```go\nif err.Error() == \"sql: no rows in result set\" {\n return nil, nil\n}\n```\n\nThis is fragile for two reasons:\n\n1. **Driver-dependent string**: The message `\"sql: no rows in result set\"` is the canonical text for `sql.ErrNoRows`, but the comparison bypasses the sentinel value. If any driver wraps `sql.ErrNoRows` (e.g., with `fmt.Errorf(\"...: %w\", sql.ErrNoRows)`), `errors.Is` would still match, but the string comparison would fail \u2014 causing a generic `\"failed to get config\"` error instead of the intended `nil, nil` (not-found) return.\n\n2. **Inconsistency**: Every other `GetX` method in `local.go` uses the idiomatic `errors.Is(err, sql.ErrNoRows)` pattern (e.g., `GetWorkflowRun` at line 300: `if errors.Is(err, sql.ErrNoRows) { return nil, nil }`). This deviation from the established pattern is a latent defect.\n\nThe downstream caller `config_db.go:27` relies on `entry == nil` to mean \"not found\" and prints an informational message. If the string comparison fails under a different driver or future wrapping, `overlayDBConfig` would instead return an error and potentially block server startup.", - "confidence": 0.85, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "interface-compliance", - "dimension_name": "StorageProvider Interface Implementation Completeness", - "evidence": "Step 1: `GetConfig` at local.go:5185-5188 checks `err.Error() == \"sql: no rows in result set\"` to detect missing rows.\nStep 2: `sql.ErrNoRows` is defined in `database/sql` as `var ErrNoRows = errors.New(\"sql: no rows in result set\")` \u2014 the string match coincidentally works today with direct `sql.QueryRowContext` usage.\nStep 3: But `errors.Is(err, sql.ErrNoRows)` is the correct, future-proof idiom \u2014 used by the same file at line 300 (`GetWorkflowRun`), line 302: `if errors.Is(err, sql.ErrNoRows)`.\nStep 4: If the underlying row scan ever returns a wrapped error (driver upgrade, middleware), `err.Error()` will not equal the bare string, causing a generic error to propagate instead of the nil-not-found signal.\nStep 5: `config_db.go:27-29` consumes the nil return from `GetConfig` as \"no config in DB\" and silently continues; a spurious error here would cause `overlayDBConfig` to return an error, propagating to server startup.", - "file_path": "control-plane/internal/storage/local.go", - "id": "f_001", - "line_end": 5187, - "line_start": 5186, - "score": 0.928, - "severity": "important", - "suggestion": "Replace the string comparison with the standard sentinel check, consistent with the rest of the file:\n```go\nif errors.Is(err, sql.ErrNoRows) {\n return nil, nil\n}\n```\nThe `errors` package is already imported at line 8 of `local.go`.", - "tags": [ - "error-handling", - "fragile-comparison", - "not-found" - ], - "title": "GetConfig uses fragile string comparison instead of errors.Is(sql.ErrNoRows) for not-found detection" - }, - { - "active_multipliers": [ - "adversary_confirmed", - "ai_generated_pr" - ], - "body": "The `GetConfig` implementation detects a missing key by comparing the error string:\n\n```go\nif err.Error() == \"sql: no rows in result set\" {\n return nil, nil\n}\n```\n\nThis is the critical code path that `overlayDBConfig` depends on for safe early-return when `agentfield.yaml` does not exist in the DB. The guard in `overlayDBConfig` at line 27 (`if entry == nil { return nil }`) is only safe **if** `GetConfig` reliably returns `(nil, nil)` for a not-found key.\n\nThe string comparison is fragile for two concrete reasons:\n\n1. **Standard library contract:** `database/sql` defines `sql.ErrNoRows` as a sentinel error. The idiomatic and safe check is `errors.Is(err, sql.ErrNoRows)`. The string `\"sql: no rows in result set\"` is the `.Error()` text of `sql.ErrNoRows` \u2014 but it is not part of the public API and could change between Go versions.\n\n2. **Wrapped errors:** If any middleware, driver wrapper, or future refactoring wraps the `sql.ErrNoRows` error (e.g., `fmt.Errorf(\"scan failed: %w\", err)`), `err.Error()` will no longer match the literal string, but `errors.Is(err, sql.ErrNoRows)` would still return `true`. A wrapped error would fall through to the generic error path and return `(nil, wrappedError)`, causing `overlayDBConfig` to fail with `\"failed to read config from database\"` instead of silently skipping the DB config \u2014 a behavioral regression that would break startup whenever the DB config key is absent.\n\nWhile the current code works today (the string is stable in the standard `database/sql` implementation), this is an API contract violation that creates a latent bug.", - "confidence": 0.85, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "config-db-runtime-trace", - "dimension_name": "overlayDBConfig Runtime Execution Trace", - "evidence": "Step 1: `overlayDBConfig` (config_db.go:23) calls `store.GetConfig(ctx, \"agentfield.yaml\")`.\nStep 2: `LocalStorage.GetConfig` (local.go) executes `SELECT ... WHERE key = ?` / `$1`.\nStep 3: If key is absent, `row.Scan` returns `sql.ErrNoRows`.\nStep 4: The implementation checks `err.Error() == \"sql: no rows in result set\"` \u2014 a string literal, not `errors.Is(err, sql.ErrNoRows)`.\nStep 5: If the error is wrapped at any layer (now or in a future refactor), `err.Error()` no longer matches the literal, the condition is false, and the function returns `(nil, fmt.Errorf(\"failed to get config %q: %w\", key, err))`.\nStep 6: `overlayDBConfig` receives `(nil, nonNilError)`, hits the `if err != nil` branch at line 24, and returns `fmt.Errorf(\"failed to read config from database: %w\", err)`.\nStep 7: Server startup fails with an error even though no DB config was intended \u2014 a silent regression triggered by any error-wrapping change in the storage stack.", - "file_path": "control-plane/internal/storage/local.go", - "id": "f_013", - "line_end": 5183, - "line_start": 5179, - "score": 0.928, - "severity": "important", - "suggestion": "Replace the string comparison with `errors.Is`:\n\n```go\nimport (\n \"database/sql\"\n \"errors\"\n)\n\nif errors.Is(err, sql.ErrNoRows) {\n return nil, nil\n}\n```\n\nThis is both idiomatic Go and resilient to error wrapping. No behavioral change for the current code path.", - "tags": [ - "error-handling", - "api-contract", - "sql", - "fragile-comparison", - "startup-path" - ], - "title": "Fragile `no rows` detection via string comparison instead of `errors.Is(sql.ErrNoRows)`" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "The config storage routes (`GET/PUT/DELETE /api/v1/configs/:key`, `GET /api/v1/configs`, `POST /api/v1/configs/reload`) are registered directly on the `agentAPI` group at line 1553 via `configHandlers.RegisterRoutes(agentAPI)`. The `agentAPI` group itself has **no middleware** \u2014 authentication is only provided by the global `s.Router.Use(middleware.APIKeyAuth(...))` applied at line 881.\n\nThe `APIKeyAuth` middleware has an explicit early-return when the configured key is empty:\n```go\n// No auth configured, allow everything.\nif config.APIKey == \"\" {\n c.Next()\n return\n}\n```\n\nWhen `AGENTFIELD_API_KEY` / `s.config.API.Auth.APIKey` is not set (which is the default in local/dev mode), **every** config endpoint \u2014 including `PUT /api/v1/configs/:key` (write arbitrary config), `DELETE /api/v1/configs/:key`, and `POST /api/v1/configs/reload` \u2014 is fully unauthenticated and accessible to any HTTP client with network access.\n\nContrast this with the comment on line 1550 which says \"admin-authenticated\": this is **misleading** \u2014 no admin token (`AdminTokenAuth`) is enforced here. The connector-facing duplicate at line 1572\u20131578 at least sits behind `ConnectorTokenAuth` + `ConnectorCapabilityCheck`. The `agentAPI`-facing endpoints have no equivalent protection beyond the optional global API key.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "dual-config-route-registration", - "dimension_name": "Dual Registration of Config Routes", - "evidence": "Step 1: Global auth is registered at server.go:881 \u2014 `s.Router.Use(middleware.APIKeyAuth(middleware.AuthConfig{APIKey: s.config.API.Auth.APIKey, ...}))`. Step 2: `middleware.APIKeyAuth` (middleware/auth.go:26) returns early with `c.Next()` when `config.APIKey == \"\"`. Step 3: `agentAPI` is created at server.go:1164 as `s.Router.Group(\"/api/v1\")` with no middleware of its own. Step 4: `configHandlers.RegisterRoutes(agentAPI)` at server.go:1553 registers `PUT /api/v1/configs/:key`, `DELETE /api/v1/configs/:key`, and `POST /api/v1/configs/reload` directly on that group. Step 5: With default configuration (no API key set), any unauthenticated HTTP request to `PUT /api/v1/configs/some-key` with arbitrary body will write to the config store and return 200 OK.", - "file_path": "control-plane/internal/server/server.go", - "id": "f_011", - "line_end": 1555, - "line_start": 1550, - "score": 0.798, - "severity": "important", - "suggestion": "Register the config routes on a sub-group that requires the admin token middleware, consistent with how other admin-only routes are handled (e.g., the `adminGroup` created at line 1532). Replace:\n```go\n// Config storage routes (admin-authenticated)\n{\n configHandlers := handlers.NewConfigStorageHandlers(s.storage, s.configReloadFn())\n configHandlers.RegisterRoutes(agentAPI)\n}\n```\nwith:\n```go\n// Config storage routes (admin-authenticated)\n{\n cfgAdminGroup := agentAPI.Group(\"\")\n cfgAdminGroup.Use(middleware.AdminTokenAuth(s.config.Features.DID.Authorization.AdminToken))\n configHandlers := handlers.NewConfigStorageHandlers(s.storage, s.configReloadFn())\n configHandlers.RegisterRoutes(cfgAdminGroup)\n}\n```\nAlternatively, reuse the existing `adminGroup` (lines 1532\u20131545) if DID authorization is enabled, but ensure a fallback exists when it is not.", - "tags": [ - "security", - "authentication", - "misconfiguration" - ], - "title": "Config routes registered on unauthenticated `agentAPI` group \u2014 no dedicated auth guard" - }, - { - "active_multipliers": [ - "ai_generated_pr" - ], - "body": "Multiple handler and service constructors eagerly copy scalar config values at startup, making them permanently immune to reload:\n\n**1. WebhookDispatcher** (`server.go:366-371`) copies `WebhookTimeout`, `WebhookMaxAttempts`, `WebhookRetryBackoff`, and `WebhookMaxRetryBackoff` into a `WebhookDispatcherConfig` struct stored by value in `webhookDispatcher.cfg`. The `http.Client` timeout (`webhook_dispatcher.go:71-73`) is set once from this config and never updated.\n\n**2. `ExecuteHandler` and `ExecuteAsyncHandler`** (`server.go:1246-1247`) copy `cfg.AgentField.ExecutionQueue.AgentCallTimeout` and `cfg.Features.DID.Authorization.InternalToken` as bare `time.Duration` and `string` values into the `executionController` struct (`execute.go:198-212`). Even if `overlayDBConfig` were to update these fields, the registered route closures hold independent copies.\n\n**3. `ApprovalWebhookHandler`** (`server.go:1267`) passes `cfg.AgentField.Approval.WebhookSecret` as a `string` argument. The `webhookApprovalController` captures this string at registration time (`webhook_approval.go:127-129`). A DB reload that changes the HMAC secret will leave the running handler verifying against the old secret.\n\nIn all three cases, the issue is the same: `setupRoutes()` is called once at `Start()` time, and all handler constructors receive primitive copies of config fields. There is no mechanism to re-register routes or re-inject values after `configReloadFn` runs.", - "confidence": 0.93, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "config-reload-behavioral-contract", - "dimension_name": "Config Reload Behavioral Contract", - "evidence": "Step 1: `server.go:366-371` constructs `WebhookDispatcherConfig` by copying four scalar values from `cfg`; `webhook_dispatcher.go:66-74` stores this config by value and bakes the timeout into `http.Client{Timeout: normalized.Timeout}`.\nStep 2: `server.go:1246` passes `s.config.AgentField.ExecutionQueue.AgentCallTimeout` as a `time.Duration` argument; `execute.go:169,198-212` stores it in `executionController.timeout` \u2014 a plain struct field.\nStep 3: `server.go:1267` passes `s.config.AgentField.Approval.WebhookSecret` as a `string`; `webhook_approval.go:127-129` stores it in `webhookApprovalController.webhookSecret`.\nStep 4: `server.go:439-441` shows `configReloadFn` only mutates `s.config` in memory; `setupRoutes()` is never called again, so no handler is re-registered with new values.", - "file_path": "control-plane/internal/server/server.go", - "id": "f_017", - "line_end": 371, - "line_start": 366, - "score": 0.781, - "severity": "important", - "suggestion": "For operational parameters that must be hot-reloadable (timeouts, retry counts, secrets), pass the parent `*config.Config` pointer into handlers and read values lazily on each request, or wrap the values behind an `atomic.Value` / `sync.RWMutex`-protected struct updated by the reload function. For the webhook secret specifically, changing HMAC validation secrets mid-flight is a security-sensitive operation that should be explicitly documented as requiring a restart, since there is a window where in-flight requests with the old signature will be rejected.", - "tags": [ - "eager-copy", - "config-reload", - "behavioral-contract", - "webhook", - "security" - ], - "title": "WebhookDispatcher and ExecuteHandler/ApprovalWebhookHandler capture config values eagerly: reload cannot change webhook timeouts, agent-call timeout, secrets, or internal token" - }, - { - "active_multipliers": [ - "adversary_challenged", - "ai_generated_pr" - ], - "body": "**`ExecutionCleanupService`** stores `config.ExecutionCleanupConfig` as a **value copy** (not a pointer) in its struct field:\n\n```go\n// execution_cleanup.go:16\ntype ExecutionCleanupService struct {\n storage storage.StorageProvider\n config config.ExecutionCleanupConfig // value copy, not *config.ExecutionCleanupConfig\n ...\n}\n```\n\nAt construction time (`server.go:392`), the current value of `cfg.AgentField.ExecutionCleanup` is copied into the service struct:\n\n```go\ncleanupService := handlers.NewExecutionCleanupService(storageProvider, cfg.AgentField.ExecutionCleanup)\n```\n\nThe `cleanupLoop` (`execution_cleanup.go:96`) creates a `time.NewTicker(ecs.config.CleanupInterval)` from this value-copy and then **never re-reads the config**. The `performCleanup` method reads `ecs.config.RetentionPeriod`, `ecs.config.BatchSize`, and `ecs.config.StaleExecutionTimeout` directly from the same frozen copy.\n\nWhen `POST /configs/reload` is called, `overlayDBConfig` mutates the in-memory `*config.Config` struct (e.g., updating `cfg.AgentField.ExecutionCleanup.RetentionPeriod`), but the running `ExecutionCleanupService` goroutine holds its own copy \u2014 those fields are **never updated**. A caller who changes the retention period from 72h to 24h via the DB config and then calls reload will see the old 72h behavior continue until the server restarts.", - "confidence": 0.97, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "config-reload-behavioral-contract", - "dimension_name": "Config Reload Behavioral Contract", - "evidence": "Step 1: `server.go:392` calls `handlers.NewExecutionCleanupService(storageProvider, cfg.AgentField.ExecutionCleanup)` \u2014 passing the config struct by value.\nStep 2: `execution_cleanup.go:29-35` stores this value in `ecs.config config.ExecutionCleanupConfig` (not a pointer).\nStep 3: `execution_cleanup.go:96` calls `time.NewTicker(ecs.config.CleanupInterval)` \u2014 ticker interval is baked in at goroutine start.\nStep 4: `execution_cleanup.go:124,125,134,147,164` read `ecs.config.RetentionPeriod`, `ecs.config.BatchSize`, `ecs.config.StaleExecutionTimeout` from the frozen copy on every invocation.\nStep 5: `server.go:439-441` (configReloadFn) calls `overlayDBConfig(s.config, s.storage)` which mutates `s.config.AgentField.ExecutionCleanup` in place.\nStep 6: The `cleanupService` struct holds a completely independent copy \u2014 no path exists from the mutated `s.config` to the running service's fields.", - "file_path": "control-plane/internal/server/server.go", - "id": "f_015", - "line_end": 392, - "line_start": 392, - "score": 0.407, - "severity": "important", - "suggestion": "Change `ExecutionCleanupService.config` to a pointer (`*config.ExecutionCleanupConfig`) or wrap it in an atomic/sync-protected accessor. Then pass a pointer at construction: `handlers.NewExecutionCleanupService(storageProvider, &cfg.AgentField.ExecutionCleanup)`. The cleanup loop and `performCleanup` will then read through the pointer and observe any in-place mutations made by `overlayDBConfig`. Alternatively, if pointer semantics are avoided, add a `UpdateConfig(cfg config.ExecutionCleanupConfig)` method and call it from within `configReloadFn`.", - "tags": [ - "eager-copy", - "config-reload", - "behavioral-contract", - "cleanup-service" - ], - "title": "ExecutionCleanupService copies config by value at construction: reload has zero effect on running behavior" - }, - { - "active_multipliers": [ - "adversary_challenged", - "ai_generated_pr" - ], - "body": "**`HealthMonitor`** receives a `HealthMonitorConfig` struct **by value** at construction and stores it as `hm.config HealthMonitorConfig` (not a pointer):\n\n```go\n// health_monitor.go:50\ntype HealthMonitor struct {\n config HealthMonitorConfig // value copy\n ...\n}\n```\n\nThe `Start()` method at `health_monitor.go:217` creates a ticker from the frozen copy:\n\n```go\nticker := time.NewTicker(hm.config.CheckInterval)\n```\n\nThis ticker is never reset after construction. When `POST /configs/reload` mutates `s.config.AgentField.NodeHealth.CheckInterval` via `overlayDBConfig` (`config_db.go:59-61`), the health monitor's loop continues running at the original check interval indefinitely.\n\nSimilarly, `cfg.AgentField.NodeHealth.HeartbeatStaleThreshold` is copied at construction into `StatusManagerConfig` (`server.go:137`), and that struct is also stored by value in the `StatusManager`. None of these operational parameters take effect until restart.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "config-reload-behavioral-contract", - "dimension_name": "Config Reload Behavioral Contract", - "evidence": "Step 1: `server.go:160-165` constructs `healthMonitorConfig` from `cfg.AgentField.NodeHealth.*` by value.\nStep 2: `server.go:166` passes this value to `services.NewHealthMonitor(...)`, which stores it in `hm.config` at `health_monitor.go:85`.\nStep 3: `health_monitor.go:217` calls `time.NewTicker(hm.config.CheckInterval)` once; the ticker is never recreated.\nStep 4: `config_db.go:59-61` shows `mergeDBConfig` updates `target.AgentField.NodeHealth = dbCfg.AgentField.NodeHealth` on reload.\nStep 5: `server.go:439-441` shows `configReloadFn` updates `s.config` via `overlayDBConfig`, but the running `healthMonitor` field holds a fully independent value copy with no reference back to `s.config`.", - "file_path": "control-plane/internal/server/server.go", - "id": "f_016", - "line_end": 166, - "line_start": 160, - "score": 0.399, - "severity": "important", - "suggestion": "Pass a pointer to the config, or add a `Reconfigure(cfg HealthMonitorConfig)` method that stops the existing ticker and restarts the loop with new intervals. For the common case where only the interval changes, the stop/start approach is straightforward: call `hm.Stop()` then restart with the new config. If the goal is zero-downtime reconfiguration, store config behind a `sync/atomic.Value` or `sync.RWMutex` and re-read it at each tick loop iteration.", - "tags": [ - "eager-copy", - "config-reload", - "behavioral-contract", - "health-monitor" - ], - "title": "HealthMonitor copies config by value at construction: NodeHealth interval/timeout changes on reload are silently ignored" - }, - { - "active_multipliers": [ - "adversary_challenged", - "ai_generated_pr" - ], - "body": "**`setupRoutes()`** is called once from `Start()` and constructs a `cors.Config` value by copying fields directly from `s.config.API.CORS`:\n\n```go\ncorsConfig := cors.Config{\n AllowOrigins: s.config.API.CORS.AllowedOrigins,\n AllowMethods: s.config.API.CORS.AllowedMethods,\n AllowHeaders: s.config.API.CORS.AllowedHeaders,\n ExposeHeaders: s.config.API.CORS.ExposedHeaders,\n AllowCredentials: s.config.API.CORS.AllowCredentials,\n}\ns.Router.Use(cors.New(corsConfig))\n```\n\nThe `gin-contrib/cors` middleware is a `gin.HandlerFunc` closure that captures the `cors.Config` by value at the time `cors.New()` is called. Even though `mergeDBConfig` (`config_db.go:95-97`) explicitly handles CORS updates:\n\n```go\nif len(dbCfg.API.CORS.AllowedOrigins) > 0 {\n target.API.CORS = dbCfg.API.CORS\n}\n```\n\n...this update reaches `s.config.API.CORS`, but the already-registered Gin middleware closure is completely unaffected. Requests after a reload continue to use the original CORS policy.\n\nThis is particularly notable because CORS is one of the primary reasons an operator would want a runtime config reload \u2014 e.g., to add a new allowed origin for a newly deployed frontend. The API surface implies this is a supported use case, but it does not work.", - "confidence": 0.92, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "config-reload-behavioral-contract", - "dimension_name": "Config Reload Behavioral Contract", - "evidence": "Step 1: `server.go:831` `setupRoutes()` is called from `server.go:447` (`s.setupRoutes()`) inside `Start()`.\nStep 2: `server.go:833-839` constructs `cors.Config` from `s.config.API.CORS.*` \u2014 these are value copies (slice headers are copied, not the underlying arrays, but a new CORS config replaces them with new slice references that the middleware never sees).\nStep 3: `server.go:852` calls `s.Router.Use(cors.New(corsConfig))` \u2014 the `gin-contrib/cors` middleware captures this struct at call time.\nStep 4: `config_db.go:95-97` shows `mergeDBConfig` updates `target.API.CORS` in the live `*config.Config`, but the Gin router's middleware chain is immutable after `setupRoutes()` returns.\nStep 5: `configReloadFn` (server.go:439-441) never calls `setupRoutes()` again.", - "file_path": "control-plane/internal/server/server.go", - "id": "f_019", - "line_end": 852, - "line_start": 831, - "score": 0.386, - "severity": "important", - "suggestion": "Use a `sync.RWMutex`-protected wrapper around the CORS config and implement a custom middleware that reads the live config pointer on each request rather than using `cors.New()` at setup time. Alternatively, replace the static middleware with a dynamic closure:\n```go\ns.Router.Use(func(c *gin.Context) {\n // Read current CORS config on each request\n cfg := s.config.API.CORS // reads through the *config.Config pointer\n cors.New(cors.Config{AllowOrigins: cfg.AllowedOrigins, ...})(c)\n})\n```\nNote: this has performance implications (allocates a new middleware on each request). A better approach is to cache the `cors.Handler` behind an `atomic.Pointer[cors.Config]` and swap it on reload.", - "tags": [ - "eager-copy", - "config-reload", - "cors", - "middleware", - "behavioral-contract" - ], - "title": "CORS middleware is registered once at startup: reloading API.CORS config has no effect on running requests" - }, - { - "active_multipliers": [ - "adversary_challenged", - "ai_generated_pr" - ], - "body": "Tables with `updated_at` columns in the Goose migrations for this codebase are paired with `BEFORE UPDATE` triggers that call `update_updated_at_column()`. For example:\n- `workflow_runs` (migration 011) has `CREATE TRIGGER update_workflow_runs_updated_at BEFORE UPDATE ... EXECUTE FUNCTION update_updated_at_column()`\n- `workflow_steps` (migration 011) has the same pattern\n\nMigration `028_create_config_storage.sql` defines `updated_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()` but **does not create an `BEFORE UPDATE` trigger** to keep `updated_at` current on row modifications.\n\nFor the `SetConfig` raw SQL path (local.go:5138-5147), `updated_at` is manually set by the application code (`updated_at = EXCLUDED.updated_at` where `EXCLUDED.updated_at` is the Go `now` variable). This means correctness depends entirely on every code path that touches `config_storage` explicitly setting `updated_at`. GORM's `autoUpdateTime` tag on `ConfigStorageModel.UpdatedAt` only fires when GORM ORM methods are used; the `SetConfig` / `GetConfig` / `DeleteConfig` implementations bypass GORM entirely and use raw `database/sql` queries.\n\nCurrently `SetConfig` does correctly set `updated_at`, so this is not an active bug. But the lack of a DB-level trigger means:\n1. Any future raw SQL that `UPDATE config_storage SET value = ... WHERE key = ...` without explicitly setting `updated_at` will silently leave `updated_at` stale.\n2. The schema contract is different from peer tables, making it a maintenance trap for contributors who see the trigger pattern on `workflow_runs` and assume it also exists on `config_storage`.", - "confidence": 0.85, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "dual-track-schema-management", - "dimension_name": "Dual-Track Schema Management: AutoMigrate vs Goose", - "evidence": "Step 1: `028_create_config_storage.sql` lines 10-11 declare `updated_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()` but contain no trigger DDL.\nStep 2: `011_create_workflow_runs_and_steps.sql` lines 47-54 show the expected pattern: `CREATE TRIGGER update_workflow_runs_updated_at BEFORE UPDATE ON workflow_runs FOR EACH ROW EXECUTE FUNCTION update_updated_at_column()`.\nStep 3: `SetConfig` in local.go:5137-5147 does manually pass `updated_at = EXCLUDED.updated_at` in the ON CONFLICT clause, so the current implementation is correct.\nStep 4: However, any future `UPDATE config_storage SET value = $1 WHERE key = $2` without an explicit `updated_at` clause would leave the column stale \u2014 the DB trigger pattern that prevents this on other tables is absent here.", - "file_path": "control-plane/migrations/028_create_config_storage.sql", - "id": "f_004", - "line_end": 11, - "line_start": 10, - "score": 0.357, - "severity": "important", - "suggestion": "Add a `BEFORE UPDATE` trigger to migration `028_create_config_storage.sql` mirroring the pattern in migration `011`:\n```sql\nCREATE TRIGGER update_config_storage_updated_at\n BEFORE UPDATE ON config_storage\n FOR EACH ROW EXECUTE FUNCTION update_updated_at_column();\n```\nAnd add its DROP to the `-- +goose Down` section. This makes `updated_at` maintenance a DB invariant rather than an application-layer responsibility, consistent with how `workflow_runs` and `workflow_steps` are managed.", - "tags": [ - "schema-consistency", - "trigger-missing", - "updated_at", - "maintenance-hazard" - ], - "title": "Goose migration for config_storage omits the updated_at auto-update trigger that equivalent tables have, and GORM autoUpdateTime does not replace it" - }, - { - "active_multipliers": [ - "adversary_challenged", - "ai_generated_pr" - ], - "body": "`SetConfig` at config_storage.go:67 accepts any `key` from the URL parameter and any raw body as the value. There is no allowlist of permitted keys, no validation that the value is well-formed YAML when the key implies a YAML config file, and no protection against overwriting critical system keys.\n\nSpecific concerns:\n1. **Key `agentfield.yaml`** can be written with arbitrary content. When loaded via `overlayDBConfig`, a YAML parse error at `config_db.go:37` only returns a warning \u2014 the server does not crash but the config is partially loaded in an inconsistent state.\n2. **Arbitrary key injection**: An attacker can store keys like `../../../../etc/passwd` \u2014 while the storage layer likely sanitizes this, there is no explicit check in the handler.\n3. **No content-type enforcement**: The handler accepts any body as a raw string regardless of content type. The comment says \"Accepts raw YAML/text body\" but this is not validated.\n4. The `updatedBy` field at line 80-83 is taken directly from the `X-Updated-By` header with no sanitization \u2014 this is stored in the audit log and could be used for log injection.", - "confidence": 0.82, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "auth-config-crud", - "dimension_name": "Config CRUD Route Authorization Gap", - "evidence": "Step 1: `PUT /api/v1/configs/` calls `SetConfig` at config_storage.go:67.\nStep 2: `key := c.Param(\"key\")` at line 68 \u2014 raw URL parameter, no validation.\nStep 3: `body, err := io.ReadAll(c.Request.Body)` at line 70 \u2014 reads entire body as-is.\nStep 4: `h.storage.SetConfig(ctx, key, string(body), updatedBy)` at line 85 \u2014 stores without validation.\nStep 5: `updatedBy := c.GetHeader(\"X-Updated-By\")` at line 80 \u2014 user-controlled string stored in DB audit field.", - "file_path": "control-plane/internal/handlers/config_storage.go", - "id": "f_010", - "line_end": 101, - "line_start": 67, - "score": 0.344, - "severity": "important", - "suggestion": "1. Add an allowlist of permitted config keys (e.g., only `agentfield.yaml` or a predefined set), or at minimum validate the key does not contain path traversal characters.\n2. Validate that the body is valid YAML when the key ends in `.yaml` before persisting it.\n3. Sanitize the `X-Updated-By` header value (strip control characters, limit length).\n4. Return a clear error if the key is not in the allowlist.", - "tags": [ - "security", - "input-validation", - "missing-allowlist" - ], - "title": "SetConfig accepts arbitrary keys and values with no validation \u2014 allows storing malformed YAML or overwriting critical system keys" - }, - { - "active_multipliers": [ - "adversary_challenged", - "ai_generated_pr" - ], - "body": "The `DeleteConfig` HTTP handler at line 106-108 responds with `http.StatusNotFound` (404) for **any** error returned by `storage.DeleteConfig`:\n\n```go\nif err := h.storage.DeleteConfig(c.Request.Context(), key); err != nil {\n c.JSON(http.StatusNotFound, gin.H{\"error\": err.Error()})\n return\n}\n```\n\nHowever, the storage implementation (`local.go:5235-5244`) can return two distinct error categories:\n- A not-found sentinel: `fmt.Errorf(\"config %q not found\", key)` when `RowsAffected() == 0`\n- A database execution error: `fmt.Errorf(\"failed to delete config %q: %w\", key, err)` for actual DB failures\n\nMapping a database-level error (connection failure, disk full, constraint violation) to 404 is semantically incorrect and will mislead API clients and operators. A DB failure should produce 500 Internal Server Error.", - "confidence": 0.92, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "interface-compliance", - "dimension_name": "StorageProvider Interface Implementation Completeness", - "evidence": "Step 1: `DeleteConfig` in local.go:5235 executes `DELETE FROM config_storage WHERE key = ?`.\nStep 2: If `db.ExecContext` returns an error (network, disk, constraint), local.go:5237-5239 returns `fmt.Errorf(\"failed to delete config %q: %w\", key, err)`.\nStep 3: If `RowsAffected() == 0`, local.go:5242 returns `fmt.Errorf(\"config %q not found\", key)`.\nStep 4: The handler at config_storage.go:107 maps BOTH error types to `http.StatusNotFound` (404).\nStep 5: A database execution failure will be surfaced to the API client as a 404, concealing the real 5xx nature of the error.", - "file_path": "control-plane/internal/handlers/config_storage.go", - "id": "f_002", - "line_end": 110, - "line_start": 104, - "score": 0.166, - "severity": "suggestion", - "suggestion": "Distinguish between not-found and server errors. One approach is to check the error message or define a sentinel type in the storage layer:\n```go\nif err := h.storage.DeleteConfig(c.Request.Context(), key); err != nil {\n // Check if it's a not-found error vs. a storage failure\n if strings.Contains(err.Error(), \"not found\") {\n c.JSON(http.StatusNotFound, gin.H{\"error\": err.Error()})\n } else {\n c.JSON(http.StatusInternalServerError, gin.H{\"error\": err.Error()})\n }\n return\n}\n```\nA cleaner solution is to define a typed `ErrNotFound` sentinel in the storage package and use `errors.Is` in the handler.", - "tags": [ - "error-handling", - "http-status", - "api-contract" - ], - "title": "DeleteConfig handler returns 404 for all storage errors, including 500-class failures" - }, - { - "active_multipliers": [ - "adversary_confirmed", - "ai_generated_pr" - ], - "body": "The Goose migration defines `key TEXT NOT NULL UNIQUE` on line 5 (which in PostgreSQL automatically creates a unique B-tree index on `key`) and then explicitly creates `CREATE INDEX IF NOT EXISTS idx_config_storage_key ON config_storage(key)` on line 14. The explicit non-unique index on `key` is redundant because PostgreSQL will always prefer the unique index for lookups on that column.\n\nThis is a minor inefficiency: two indexes occupy storage and must be updated on every INSERT/UPDATE/DELETE to `config_storage`. The duplicate won't cause incorrect behavior, but it wastes space and write amplification.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "dual-track-schema-management", - "dimension_name": "Dual-Track Schema Management: AutoMigrate vs Goose", - "evidence": "Step 1: `028_create_config_storage.sql` line 5 defines `key TEXT NOT NULL UNIQUE`.\nStep 2: PostgreSQL documentation states a UNIQUE constraint automatically creates a unique B-tree index on the constrained column(s), which can be used for point lookups just as a regular index can.\nStep 3: Line 14 then creates a separate non-unique index `idx_config_storage_key ON config_storage(key)`, duplicating coverage already provided by the unique constraint index.", - "file_path": "control-plane/migrations/028_create_config_storage.sql", - "id": "f_005", - "line_end": 14, - "line_start": 14, - "score": 0.148, - "severity": "nitpick", - "suggestion": "Remove the explicit `CREATE INDEX IF NOT EXISTS idx_config_storage_key ON config_storage(key)` from the `-- +goose Up` section and its corresponding `DROP INDEX` from `-- +goose Down`. The UNIQUE constraint already provides an index suitable for all single-column equality lookups on `key`.", - "tags": [ - "schema", - "redundant-index", - "performance", - "postgresql" - ], - "title": "Redundant index on config_storage(key): the UNIQUE constraint already implies a unique index" - }, - { - "active_multipliers": [ - "adversary_confirmed", - "ai_generated_pr" - ], - "body": "Both the not-found path (line 28) and the success path (line 47) log via `fmt.Println` / `fmt.Printf` rather than the project's structured logger (`zerolog`).\n\nThe CLAUDE.md project guidance specifies:\n> Use zerolog for structured logging: `logger.Logger.Info().Msg(\"message\")`\n\nUsing `fmt.Print*` here:\n- Bypasses log-level filtering (these messages always appear, even in production with `LOG_LEVEL=warn`)\n- Produces unstructured output that cannot be parsed by log aggregation systems\n- Is inconsistent with the rest of the control-plane codebase\n\nThis is a style/maintainability issue, not a correctness bug.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "config-db-runtime-trace", - "dimension_name": "overlayDBConfig Runtime Execution Trace", - "evidence": "Line 28: `fmt.Println(\"[config] No database config found (key: agentfield.yaml), using file/env config only.\")`\nLine 47: `fmt.Printf(\"[config] Loaded config from database (key: %s, version: %d, updated: %s)\\n\", ...)`\nBoth bypass zerolog, the structured logger used throughout the rest of the control-plane (per CLAUDE.md and observed usage in other files).", - "file_path": "control-plane/internal/server/config_db.go", - "id": "f_014", - "line_end": 47, - "line_start": 28, - "score": 0.148, - "severity": "nitpick", - "suggestion": "Replace `fmt.Println` / `fmt.Printf` with the zerolog structured logger:\n\n```go\nimport \"github.com/Agent-Field/agentfield/control-plane/internal/logger\"\n\n// not-found path:\nlogger.Logger.Info().Str(\"key\", dbConfigKey).Msg(\"No database config found, using file/env config only\")\n\n// success path:\nlogger.Logger.Info().\n Str(\"key\", entry.Key).\n Int(\"version\", entry.Version).\n Time(\"updated\", entry.UpdatedAt).\n Msg(\"Loaded config from database\")\n```", - "tags": [ - "logging", - "style", - "zerolog", - "structured-logging" - ], - "title": "`fmt.Println`/`fmt.Printf` used for logging instead of the structured logger" - }, - { - "active_multipliers": [ - "adversary_challenged", - "ai_generated_pr" - ], - "body": "The `ConfigStorageModel.Version` field is declared with `gorm:\"column:version;not null;default:1\"` and the auto-increment is implemented purely in application SQL via `version = config_storage.version + 1` in `SetConfig` (local.go:5143, 5156). Neither the GORM model nor the Goose migration adds a `CHECK (version > 0)` constraint or a sequence-based mechanism.\n\nThis means:\n1. Any code path that uses GORM ORM methods directly (e.g., `db.Save(&ConfigStorageModel{..., Version: 0, ...})`) will set version to 0 or any arbitrary value, bypassing the increment logic.\n2. The `version` field comment says it is for \"audit trail\" (models.go:478), but without a monotonically-increasing guarantee at the DB level, audit integrity can be violated silently.\n\nThis is a suggestion rather than a critical issue because currently all writes go through the raw-SQL `SetConfig` which correctly increments. But the model struct exposes `Version int` as a writable field, and future GORM-based code would not benefit from the increment.", - "confidence": 0.75, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "dual-track-schema-management", - "dimension_name": "Dual-Track Schema Management: AutoMigrate vs Goose", - "evidence": "Step 1: `ConfigStorageModel.Version` is `int` with `gorm:\"column:version;not null;default:1\"` (models.go:483) \u2014 no GORM constraint prevents setting it to any value.\nStep 2: `SetConfig` increments via `version = config_storage.version + 1` in the ON CONFLICT clause (local.go:5143, 5156) \u2014 this is correct.\nStep 3: But any direct GORM call like `gormDB.Save(&ConfigStorageModel{Key: \"k\", Value: \"v\", Version: 0})` would set version to 0, no DB constraint prevents it.\nStep 4: `028_create_config_storage.sql` line 7 defines `version INTEGER NOT NULL DEFAULT 1` with no CHECK constraint.", - "file_path": "control-plane/internal/storage/models.go", - "id": "f_006", - "line_end": 483, - "line_start": 483, - "score": 0.135, - "severity": "suggestion", - "suggestion": "Add a `CHECK (version >= 1)` constraint in migration `028_create_config_storage.sql`:\n```sql\nversion INTEGER NOT NULL DEFAULT 1 CHECK (version >= 1),\n```\nThis at minimum prevents accidental version-0 writes. For a stronger audit guarantee, document that GORM's ORM Save/Create methods should never be used directly on `ConfigStorageModel`; only `SetConfig`/`DeleteConfig` are the sanctioned write paths.", - "tags": [ - "data-integrity", - "audit-trail", - "version-management", - "constraint-missing" - ], - "title": "Version increment is application-enforced only; no DB-level constraint prevents version regression or skipping" - }, - { - "active_multipliers": [ - "adversary_challenged", - "ai_generated_pr" - ], - "body": "The two `configHandlers` declarations are in separate block scopes (lines 1551\u20131555 and 1575\u20131578) with no shadowing of a shared variable. They register routes on distinct base paths:\n\n- First: `agentAPI` \u2192 `/api/v1/configs/...`\n- Second: `configGroup` (= `connectorGroup.Group(\"\")` = `agentAPI.Group(\"/connector\")`) \u2192 `/api/v1/connector/configs/...`\n\nGin's router tree separates these cleanly \u2014 no duplicate-path panic occurs.\n\nThe `:key` parameter name is identical in both registrations (both call the same `RegisterRoutes` method), but since they live in different router-tree path segments (`/configs` under `/api/v1` vs `/configs` under `/api/v1/connector`), there is no wildcard conflict.\n\nBoth calls pass `s.configReloadFn()` which evaluates `os.Getenv(\"AGENTFIELD_CONFIG_SOURCE\")` at setup time and returns either `nil` or a valid reload closure. The connector-facing reload endpoint will return 503 only when the env var is not `\"db\"` \u2014 **exactly the same behavior** as the `agentAPI`-facing endpoint. There is no regression here.\n\nThe variable name reuse (`configHandlers`) inside separate Go block scopes (`{ }`) is cosmetically confusing but harmless \u2014 Go's scoping rules guarantee no aliasing.", - "confidence": 0.98, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "dual-config-route-registration", - "dimension_name": "Dual Registration of Config Routes", - "evidence": "Step 1: `agentAPI` base path = `/api/v1` (server.go:1164). Step 2: `connectorGroup = agentAPI.Group(\"/connector\")` \u2192 base `/api/v1/connector` (server.go:1559). Step 3: `configGroup = connectorGroup.Group(\"\")` \u2192 still `/api/v1/connector` (server.go:1573). Step 4: `RegisterRoutes` registers identical relative paths (`/configs`, `/configs/:key`, `/configs/reload`) on both groups, yielding `/api/v1/configs/...` and `/api/v1/connector/configs/...` \u2014 distinct full paths. Step 5: Both `NewConfigStorageHandlers` calls at lines 1552 and 1576 invoke `s.configReloadFn()` which is the same method returning equivalent closures (or nil). No behavioral divergence.", - "file_path": "control-plane/internal/server/server.go", - "id": "f_012", - "line_end": 1578, - "line_start": 1572, - "score": 0.059, - "severity": "nitpick", - "suggestion": "Consider renaming the inner `configHandlers` to `connectorConfigHandlers` for clarity, even though the current code is functionally correct:\n```go\nconnectorConfigHandlers := handlers.NewConfigStorageHandlers(s.storage, s.configReloadFn())\nconnectorConfigHandlers.RegisterRoutes(configGroup)\n```", - "tags": [ - "routing", - "correctness", - "naming" - ], - "title": "Verified: no path conflict and no 503 regression from second `configHandlers` instantiation" - } - ], - "metadata": { - "agent_invocations": 15, - "anatomy": { - "blast_radius": [], - "clusters": [ - { - "description": "", - "files": [ - "control-plane/config/agentfield.yaml" - ], - "id": "cluster_0", - "name": "control-plane/config", - "primary_language": "yaml" - }, - { - "description": "", - "files": [ - "control-plane/internal/handlers/config_storage.go" - ], - "id": "cluster_1", - "name": "control-plane/internal/handlers", - "primary_language": "go" - }, - { - "description": "", - "files": [ - "control-plane/internal/server/config_db.go", - "control-plane/internal/server/server.go", - "control-plane/internal/server/server_routes_test.go" - ], - "id": "cluster_2", - "name": "control-plane/internal/server", - "primary_language": "go" - }, - { - "description": "", - "files": [ - "control-plane/internal/storage/local.go", - "control-plane/internal/storage/migrations.go", - "control-plane/internal/storage/models.go", - "control-plane/internal/storage/storage.go" - ], - "id": "cluster_3", - "name": "control-plane/internal/storage", - "primary_language": "go" - }, - { - "description": "", - "files": [ - "control-plane/migrations/028_create_config_storage.sql" - ], - "id": "cluster_4", - "name": "control-plane/migrations", - "primary_language": "sql" - } - ], - "context_notes": "The PR is internally consistent in its storage and handler wiring. The primary concern is the unprotected /api/v1/configs route (no AdminToken, only global API key which may be empty), the non-propagating hot-reload, and the data race on shared config pointer. The schema dual-path (GORM AutoMigrate + Goose) is a pre-existing pattern in this codebase and is handled correctly via CREATE TABLE IF NOT EXISTS. The stub storage in server_routes_test.go was correctly updated with no-op implementations of the four new interface methods, maintaining test compilability.", - "dependency_graph": {}, - "files": [ - { - "hunks": [ - { - "content": " enabled: true\n observability_config:\n enabled: false\n+ config_management:\n+ enabled: true\n+ read_only: false", - "header": "@@ -146,3 +146,6 @@ features:", - "new_count": 6, - "new_start": 146, - "old_count": 3, - "old_start": 146 - } - ], - "language": "yaml", - "lines_added": 3, - "lines_removed": 0, - "path": "control-plane/config/agentfield.yaml", - "status": "modified" - }, - { - "hunks": [ - { - "content": "+package handlers\n+\n+import (\n+\t\"io\"\n+\t\"net/http\"\n+\n+\t\"github.com/Agent-Field/agentfield/control-plane/internal/storage\"\n+\t\"github.com/gin-gonic/gin\"\n+)\n+\n+// ConfigReloadFunc is called to reload configuration from the database.\n+type ConfigReloadFunc func() error\n+\n+// ConfigStorageHandlers provides HTTP handlers for database-backed configuration.\n+type ConfigStorageHandlers struct {\n+\tstorage storage.StorageProvider\n+\treloadFn ConfigReloadFunc\n+}\n+\n+// NewConfigStorageHandlers creates a new ConfigStorageHandlers instance.\n+func NewConfigStorageHandlers(store storage.StorageProvider, reloadFn ConfigReloadFunc) *ConfigStorageHandlers {\n+\treturn &ConfigStorageHandlers{storage: store, reloadFn: reloadFn}\n+}\n+\n+// RegisterRoutes registers config storage routes on the given router group.\n+func (h *ConfigStorageHandlers) RegisterRoutes(group *gin.RouterGroup) {\n+\tgroup.GET(\"/configs\", h.ListConfigs)\n+\tgroup.GET(\"/configs/:key\", h.GetConfig)\n+\tgroup.PUT(\"/configs/:key\", h.SetConfig)\n+\tgroup.DELETE(\"/configs/:key\", h.DeleteConfig)\n+\tgroup.POST(\"/configs/reload\", h.ReloadConfig)\n+}\n+\n+// ListConfigs returns all stored configuration entries.\n+func (h *ConfigStorageHandlers) ListConfigs(c *gin.Context) {\n+\tentries, err := h.storage.ListConfigs(c.Request.Context())\n+\tif err != nil {\n+\t\tc.JSON(http.StatusInternalServerError, gin.H{\"error\": err.Error()})\n+\t\treturn\n+\t}\n+\tif entries == nil {\n+\t\tentries = []*storage.ConfigEntry{}\n+\t}\n+\tc.JSON(http.StatusOK, gin.H{\n+\t\t\"configs\": entries,\n+\t\t\"total\": len(entries),\n+\t})\n+}\n+\n+// GetConfig returns a specific configuration entry by key.\n+func (h *ConfigStorageHandlers) GetConfig(c *gin.Context) {\n+\tkey := c.Param(\"key\")\n+\tentry, err := h.storage.GetConfig(c.Request.Context(), key)\n+\tif err != nil {\n+\t\tc.JSON(http.StatusInternalServerError, gin.H{\"error\": err.Error()})\n+\t\treturn\n+\t}\n+\tif entry == nil {\n+\t\tc.JSON(http.StatusNotFound, gin.H{\"error\": \"config not found\", \"key\": key})\n+\t\treturn\n+\t}\n+\tc.JSON(http.StatusOK, entry)\n+}\n+\n+// SetConfig creates or updates a configuration entry.\n+// Accepts raw YAML/text body as the config value.\n+func (h *ConfigStorageHandlers) SetConfig(c *gin.Context) {\n+\tkey := c.Param(\"key\")\n+\n+\tbody, err := io.ReadAll(c.Request.Body)\n+\tif err != nil {\n+\t\tc.JSON(http.StatusBadRequest, gin.H{\"error\": \"failed to read request body\"})\n+\t\treturn\n+\t}\n+\tif len(body) == 0 {\n+\t\tc.JSON(http.StatusBadRequest, gin.H{\"error\": \"request body is empty\"})\n+\t\treturn\n+\t}\n+\n+\tupdatedBy := c.GetHeader(\"X-Updated-By\")\n+\tif updatedBy == \"\" {\n+\t\tupdatedBy = \"api\"\n+\t}\n+\n+\tif err := h.storage.SetConfig(c.Request.Context(), key, string(body), updatedBy); err != nil {\n+\t\tc.JSON(http.StatusInternalServerError, gin.H{\"error\": err.Error()})\n+\t\treturn\n+\t}\n+\n+\t// Return the saved entry\n+\tentry, err := h.storage.GetConfig(c.Request.Context(), key)\n+\tif err != nil {\n+\t\tc.JSON(http.StatusInternalServerError, gin.H{\"error\": err.Error()})\n+\t\treturn\n+\t}\n+\n+\tc.JSON(http.StatusOK, gin.H{\n+\t\t\"message\": \"config saved\",\n+\t\t\"config\": entry,\n+\t})\n+}\n+\n+// DeleteConfig removes a configuration entry by key.\n+func (h *ConfigStorageHandlers) DeleteConfig(c *gin.Context) {\n+\tkey := c.Param(\"key\")\n+\tif err := h.storage.DeleteConfig(c.Request.Context(), key); err != nil {\n+\t\tc.JSON(http.StatusNotFound, gin.H{\"error\": err.Error()})\n+\t\treturn\n+\t}\n+\tc.JSON(http.StatusOK, gin.H{\"message\": \"config deleted\", \"key\": key})\n+}\n+\n+// ReloadConfig triggers a hot-reload of configuration from the database.\n+func (h *ConfigStorageHandlers) ReloadConfig(c *gin.Context) {\n+\tif h.reloadFn == nil {\n+\t\tc.JSON(http.StatusServiceUnavailable, gin.H{\n+\t\t\t\"error\": \"config reload not available (AGENTFIELD_CONFIG_SOURCE != db)\",\n+\t\t})\n+\t\treturn\n+\t}\n+\tif err := h.reloadFn(); err != nil {\n+\t\tc.JSON(http.StatusInternalServerError, gin.H{\n+\t\t\t\"error\": \"config reload failed\",\n+\t\t\t\"details\": err.Error(),\n+\t\t})\n+\t\treturn\n+\t}\n+\tc.JSON(http.StatusOK, gin.H{\"message\": \"config reloaded from database\"})\n+}", - "header": "@@ -0,0 +1,129 @@", - "new_count": 129, - "new_start": 1, - "old_count": 0, - "old_start": 0 - } - ], - "language": "go", - "lines_added": 129, - "lines_removed": 0, - "path": "control-plane/internal/handlers/config_storage.go", - "status": "added" - }, - { - "hunks": [ - { - "content": "+package server\n+\n+import (\n+\t\"context\"\n+\t\"fmt\"\n+\t\"time\"\n+\n+\t\"github.com/Agent-Field/agentfield/control-plane/internal/config\"\n+\t\"github.com/Agent-Field/agentfield/control-plane/internal/storage\"\n+\t\"gopkg.in/yaml.v3\"\n+)\n+\n+const dbConfigKey = \"agentfield.yaml\"\n+\n+// overlayDBConfig loads config from the database and merges it into the\n+// existing config. The storage section is preserved from the original config\n+// to avoid the bootstrap problem (DB connection settings can't come from DB).\n+// Precedence: env vars > DB config > file config > defaults.\n+func overlayDBConfig(cfg *config.Config, store storage.StorageProvider) error {\n+\tctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)\n+\tdefer cancel()\n+\n+\tentry, err := store.GetConfig(ctx, dbConfigKey)\n+\tif err != nil {\n+\t\treturn fmt.Errorf(\"failed to read config from database: %w\", err)\n+\t}\n+\tif entry == nil {\n+\t\tfmt.Println(\"[config] No database config found (key: agentfield.yaml), using file/env config only.\")\n+\t\treturn nil\n+\t}\n+\n+\t// Preserve the storage config \u2014 it must always come from file/env (bootstrap)\n+\tsavedStorage := cfg.Storage\n+\n+\t// Parse the DB-stored YAML into a config struct\n+\tvar dbCfg config.Config\n+\tif err := yaml.Unmarshal([]byte(entry.Value), &dbCfg); err != nil {\n+\t\treturn fmt.Errorf(\"failed to parse database config YAML: %w\", err)\n+\t}\n+\n+\t// Overlay non-zero DB values onto the existing config\n+\tmergeDBConfig(cfg, &dbCfg)\n+\n+\t// Restore storage config (never overridden from DB)\n+\tcfg.Storage = savedStorage\n+\n+\tfmt.Printf(\"[config] Loaded config from database (key: %s, version: %d, updated: %s)\\n\",\n+\t\tentry.Key, entry.Version, entry.UpdatedAt.Format(time.RFC3339))\n+\treturn nil\n+}\n+\n+// mergeDBConfig selectively merges DB config values into the target config.\n+// Only non-zero/non-empty values from the DB config are applied.\n+func mergeDBConfig(target, dbCfg *config.Config) {\n+\t// AgentField settings\n+\tif dbCfg.AgentField.Port != 0 {\n+\t\ttarget.AgentField.Port = dbCfg.AgentField.Port\n+\t}\n+\tif dbCfg.AgentField.NodeHealth.CheckInterval != 0 {\n+\t\ttarget.AgentField.NodeHealth = dbCfg.AgentField.NodeHealth\n+\t}\n+\t// Merge execution cleanup field-by-field to avoid zeroing out unset fields\n+\tif dbCfg.AgentField.ExecutionCleanup.RetentionPeriod != 0 {\n+\t\ttarget.AgentField.ExecutionCleanup.RetentionPeriod = dbCfg.AgentField.ExecutionCleanup.RetentionPeriod\n+\t}\n+\tif dbCfg.AgentField.ExecutionCleanup.CleanupInterval != 0 {\n+\t\ttarget.AgentField.ExecutionCleanup.CleanupInterval = dbCfg.AgentField.ExecutionCleanup.CleanupInterval\n+\t}\n+\tif dbCfg.AgentField.ExecutionCleanup.BatchSize != 0 {\n+\t\ttarget.AgentField.ExecutionCleanup.BatchSize = dbCfg.AgentField.ExecutionCleanup.BatchSize\n+\t}\n+\tif dbCfg.AgentField.ExecutionCleanup.PreserveRecentDuration != 0 {\n+\t\ttarget.AgentField.ExecutionCleanup.PreserveRecentDuration = dbCfg.AgentField.ExecutionCleanup.PreserveRecentDuration\n+\t}\n+\tif dbCfg.AgentField.ExecutionCleanup.StaleExecutionTimeout != 0 {\n+\t\ttarget.AgentField.ExecutionCleanup.StaleExecutionTimeout = dbCfg.AgentField.ExecutionCleanup.StaleExecutionTimeout\n+\t}\n+\t// Enabled is a bool \u2014 only override if cleanup config is present in DB at all\n+\tif dbCfg.AgentField.ExecutionCleanup.RetentionPeriod != 0 || dbCfg.AgentField.ExecutionCleanup.CleanupInterval != 0 {\n+\t\ttarget.AgentField.ExecutionCleanup.Enabled = dbCfg.AgentField.ExecutionCleanup.Enabled\n+\t}\n+\tif dbCfg.AgentField.Approval.WebhookSecret != \"\" || dbCfg.AgentField.Approval.DefaultExpiryHours != 0 {\n+\t\ttarget.AgentField.Approval = dbCfg.AgentField.Approval\n+\t}\n+\n+\t// Features\n+\tif dbCfg.Features.DID.Method != \"\" {\n+\t\ttarget.Features.DID = dbCfg.Features.DID\n+\t}\n+\t// NOTE: Connector config (token, capabilities) is intentionally NOT merged\n+\t// from DB. These are security-sensitive and must come from file/env config,\n+\t// similar to how storage config is protected from the bootstrap problem.\n+\n+\t// API settings (but never override API key from DB for security)\n+\tif len(dbCfg.API.CORS.AllowedOrigins) > 0 {\n+\t\ttarget.API.CORS = dbCfg.API.CORS\n+\t}\n+\n+\t// UI settings\n+\tif dbCfg.UI.Mode != \"\" {\n+\t\ttarget.UI = dbCfg.UI\n+\t}\n+}", - "header": "@@ -0,0 +1,103 @@", - "new_count": 103, - "new_start": 1, - "old_count": 0, - "old_start": 0 - } - ], - "language": "go", - "lines_added": 103, - "lines_removed": 0, - "path": "control-plane/internal/server/config_db.go", - "status": "added" - }, - { - "hunks": [ - { - "content": " \t\treturn nil, err\n \t}\n \n+\t// Overlay database-stored config if AGENTFIELD_CONFIG_SOURCE=db\n+\tif src := os.Getenv(\"AGENTFIELD_CONFIG_SOURCE\"); src == \"db\" {\n+\t\tif err := overlayDBConfig(cfg, storageProvider); err != nil {\n+\t\t\tfmt.Printf(\"Warning: failed to load config from database: %v\\n\", err)\n+\t\t}\n+\t}\n+\n \tRouter := gin.Default()\n \n \t// Sync installed.yaml to database for package visibility", - "header": "@@ -104,6 +104,13 @@ func NewAgentFieldServer(cfg *config.Config) (*AgentFieldServer, error) {", - "new_count": 13, - "new_start": 104, - "old_count": 6, - "old_start": 104 - }, - { - "content": " \t}, nil\n }\n \n+// configReloadFn returns a function that reloads config from the database,\n+// or nil if AGENTFIELD_CONFIG_SOURCE is not set to \"db\".\n+func (s *AgentFieldServer) configReloadFn() handlers.ConfigReloadFunc {\n+\tif src := os.Getenv(\"AGENTFIELD_CONFIG_SOURCE\"); src != \"db\" {\n+\t\treturn nil\n+\t}\n+\treturn func() error {\n+\t\treturn overlayDBConfig(s.config, s.storage)\n+\t}\n+}\n+\n // Start initializes and starts the AgentFieldServer.\n func (s *AgentFieldServer) Start() error {\n \t// Setup routes", - "header": "@@ -423,6 +430,17 @@ func NewAgentFieldServer(cfg *config.Config) (*AgentFieldServer, error) {", - "new_count": 17, - "new_start": 430, - "old_count": 6, - "old_start": 423 - }, - { - "content": " \t\t\tlogger.Logger.Info().Msg(\"\ud83d\udccb Authorization admin routes registered\")\n \t\t}\n \n+\t\t// Config storage routes (admin-authenticated)\n+\t\t{\n+\t\t\tconfigHandlers := handlers.NewConfigStorageHandlers(s.storage, s.configReloadFn())\n+\t\t\tconfigHandlers.RegisterRoutes(agentAPI)\n+\t\t\tlogger.Logger.Info().Msg(\"Config storage routes registered\")\n+\t\t}\n+\n \t\t// Connector routes (authenticated with separate connector token)\n \t\tif s.config.Features.Connector.Enabled && s.config.Features.Connector.Token != \"\" {\n \t\t\tconnectorGroup := agentAPI.Group(\"/connector\")", - "header": "@@ -1529,6 +1547,13 @@ func (s *AgentFieldServer) setupRoutes() {", - "new_count": 13, - "new_start": 1547, - "old_count": 6, - "old_start": 1529 - }, - { - "content": " \t\t\t)\n \t\t\tconnectorHandlers.RegisterRoutes(connectorGroup)\n \n+\t\t\t// Config management routes for connector\n+\t\t\tconfigGroup := connectorGroup.Group(\"\")\n+\t\t\tconfigGroup.Use(middleware.ConnectorCapabilityCheck(\"config_management\", s.config.Features.Connector.Capabilities))\n+\t\t\t{\n+\t\t\t\tconfigHandlers := handlers.NewConfigStorageHandlers(s.storage, s.configReloadFn())\n+\t\t\t\tconfigHandlers.RegisterRoutes(configGroup)\n+\t\t\t}\n+\n \t\t\tlogger.Logger.Info().Msg(\"\ud83d\udd0c Connector routes registered\")\n \t\t}\n \t}", - "header": "@@ -1544,6 +1569,14 @@ func (s *AgentFieldServer) setupRoutes() {", - "new_count": 14, - "new_start": 1569, - "old_count": 6, - "old_start": 1544 - } - ], - "language": "go", - "lines_added": 33, - "lines_removed": 0, - "path": "control-plane/internal/server/server.go", - "status": "modified" - }, - { - "hunks": [ - { - "content": " }\n \n // Configuration\n-func (s *stubStorage) SetConfig(ctx context.Context, key string, value interface{}) error { return nil }\n-func (s *stubStorage) GetConfig(ctx context.Context, key string) (interface{}, error) {\n+func (s *stubStorage) SetConfig(ctx context.Context, key string, value string, updatedBy string) error {\n+\treturn nil\n+}\n+func (s *stubStorage) GetConfig(ctx context.Context, key string) (*storage.ConfigEntry, error) {\n+\treturn nil, nil\n+}\n+func (s *stubStorage) ListConfigs(ctx context.Context) ([]*storage.ConfigEntry, error) {\n \treturn nil, nil\n }\n+func (s *stubStorage) DeleteConfig(ctx context.Context, key string) error { return nil }\n \n // Reasoner Performance and History\n func (s *stubStorage) GetReasonerPerformanceMetrics(ctx context.Context, reasonerID string) (*types.ReasonerPerformanceMetrics, error) {", - "header": "@@ -230,10 +230,16 @@ func (s *stubStorage) ListAgentGroups(ctx context.Context, teamID string) ([]typ", - "new_count": 16, - "new_start": 230, - "old_count": 10, - "old_start": 230 - } - ], - "language": "go", - "lines_added": 8, - "lines_removed": 2, - "path": "control-plane/internal/server/server_routes_test.go", - "status": "modified" - }, - { - "hunks": [ - { - "content": " \treturn nil\n }\n \n-// SetConfig stores a configuration key-value pair in SQLite.\n-func (ls *LocalStorage) SetConfig(ctx context.Context, key string, value interface{}) error {\n-\t// Fast-fail if context is already cancelled\n+// SetConfig upserts a configuration entry in the database.\n+// On conflict (duplicate key), it increments the version and updates the value.\n+func (ls *LocalStorage) SetConfig(ctx context.Context, key string, value string, updatedBy string) error {\n \tif err := ctx.Err(); err != nil {\n \t\treturn err\n \t}\n \n-\t// TODO: Implement configuration storage in SQLite\n-\treturn fmt.Errorf(\"SetConfig not yet implemented for LocalStorage\")\n+\tdb := ls.requireSQLDB()\n+\tnow := time.Now().UTC()\n+\n+\tif ls.mode == \"postgres\" {\n+\t\t_, err := db.ExecContext(ctx, `\n+\t\t\tINSERT INTO config_storage (key, value, version, created_by, updated_by, created_at, updated_at)\n+\t\t\tVALUES ($1, $2, 1, $3, $3, $4, $4)\n+\t\t\tON CONFLICT (key) DO UPDATE SET\n+\t\t\t\tvalue = EXCLUDED.value,\n+\t\t\t\tversion = config_storage.version + 1,\n+\t\t\t\tupdated_by = EXCLUDED.updated_by,\n+\t\t\t\tupdated_at = EXCLUDED.updated_at`,\n+\t\t\tkey, value, updatedBy, now)\n+\t\treturn err\n+\t}\n+\n+\t// SQLite\n+\t_, err := db.ExecContext(ctx, `\n+\t\tINSERT INTO config_storage (key, value, version, created_by, updated_by, created_at, updated_at)\n+\t\tVALUES (?, ?, 1, ?, ?, ?, ?)\n+\t\tON CONFLICT (key) DO UPDATE SET\n+\t\t\tvalue = excluded.value,\n+\t\t\tversion = config_storage.version + 1,\n+\t\t\tupdated_by = excluded.updated_by,\n+\t\t\tupdated_at = excluded.updated_at`,\n+\t\tkey, value, updatedBy, updatedBy, now, now)\n+\treturn err\n }\n \n-// GetConfig retrieves a configuration value from SQLite by key.\n-func (ls *LocalStorage) GetConfig(ctx context.Context, key string) (interface{}, error) {\n-\t// Fast-fail if context is already cancelled\n+// GetConfig retrieves a configuration entry by key.\n+func (ls *LocalStorage) GetConfig(ctx context.Context, key string) (*ConfigEntry, error) {\n+\tif err := ctx.Err(); err != nil {\n+\t\treturn nil, err\n+\t}\n+\n+\tdb := ls.requireSQLDB()\n+\tvar entry ConfigEntry\n+\n+\tvar placeholder string\n+\tif ls.mode == \"postgres\" {\n+\t\tplaceholder = \"$1\"\n+\t} else {\n+\t\tplaceholder = \"?\"\n+\t}\n+\n+\trow := db.QueryRowContext(ctx,\n+\t\tfmt.Sprintf(`SELECT key, value, version, COALESCE(created_by, ''), COALESCE(updated_by, ''), created_at, updated_at\n+\t\tFROM config_storage WHERE key = %s`, placeholder), key)\n+\n+\terr := row.Scan(&entry.Key, &entry.Value, &entry.Version,\n+\t\t&entry.CreatedBy, &entry.UpdatedBy, &entry.CreatedAt, &entry.UpdatedAt)\n+\tif err != nil {\n+\t\tif err.Error() == \"sql: no rows in result set\" {\n+\t\t\treturn nil, nil\n+\t\t}\n+\t\treturn nil, fmt.Errorf(\"failed to get config %q: %w\", key, err)\n+\t}\n+\treturn &entry, nil\n+}\n+\n+// ListConfigs returns all stored configuration entries.\n+func (ls *LocalStorage) ListConfigs(ctx context.Context) ([]*ConfigEntry, error) {\n \tif err := ctx.Err(); err != nil {\n \t\treturn nil, err\n \t}\n \n-\t// TODO: Implement configuration retrieval from SQLite\n-\treturn nil, fmt.Errorf(\"GetConfig not yet implemented for LocalStorage\")\n+\tdb := ls.requireSQLDB()\n+\trows, err := db.QueryContext(ctx,\n+\t\t`SELECT key, value, version, COALESCE(created_by, ''), COALESCE(updated_by, ''), created_at, updated_at\n+\t\tFROM config_storage ORDER BY key`)\n+\tif err != nil {\n+\t\treturn nil, fmt.Errorf(\"failed to list configs: %w\", err)\n+\t}\n+\tdefer rows.Close()\n+\n+\tvar entries []*ConfigEntry\n+\tfor rows.Next() {\n+\t\tvar entry ConfigEntry\n+\t\tif err := rows.Scan(&entry.Key, &entry.Value, &entry.Version,\n+\t\t\t&entry.CreatedBy, &entry.UpdatedBy, &entry.CreatedAt, &entry.UpdatedAt); err != nil {\n+\t\t\treturn nil, fmt.Errorf(\"failed to scan config row: %w\", err)\n+\t\t}\n+\t\tentries = append(entries, &entry)\n+\t}\n+\treturn entries, rows.Err()\n+}\n+\n+// DeleteConfig removes a configuration entry by key.\n+func (ls *LocalStorage) DeleteConfig(ctx context.Context, key string) error {\n+\tif err := ctx.Err(); err != nil {\n+\t\treturn err\n+\t}\n+\n+\tdb := ls.requireSQLDB()\n+\tvar placeholder string\n+\tif ls.mode == \"postgres\" {\n+\t\tplaceholder = \"$1\"\n+\t} else {\n+\t\tplaceholder = \"?\"\n+\t}\n+\n+\tresult, err := db.ExecContext(ctx,\n+\t\tfmt.Sprintf(`DELETE FROM config_storage WHERE key = %s`, placeholder), key)\n+\tif err != nil {\n+\t\treturn fmt.Errorf(\"failed to delete config %q: %w\", key, err)\n+\t}\n+\trows, _ := result.RowsAffected()\n+\tif rows == 0 {\n+\t\treturn fmt.Errorf(\"config %q not found\", key)\n+\t}\n+\treturn nil\n }\n \n // SubscribeToMemoryChanges implements the StorageProvider SubscribeToMemoryChanges method using local pub/sub.", - "header": "@@ -5124,26 +5124,124 @@ func (ls *LocalStorage) UpdateAgentTrafficWeight(ctx context.Context, id string,", - "new_count": 124, - "new_start": 5124, - "old_count": 26, - "old_start": 5124 - } - ], - "language": "go", - "lines_added": 108, - "lines_removed": 10, - "path": "control-plane/internal/storage/local.go", - "status": "modified" - }, - { - "hunks": [ - { - "content": " \t\t&DIDDocumentModel{},\n \t\t&AccessPolicyModel{},\n \t\t&AgentTagVCModel{},\n+\t\t&ConfigStorageModel{},\n \t}\n \n \tif err := gormDB.WithContext(ctx).AutoMigrate(models...); err != nil {", - "header": "@@ -233,6 +233,7 @@ func (ls *LocalStorage) autoMigrateSchema(ctx context.Context) error {", - "new_count": 7, - "new_start": 233, - "old_count": 6, - "old_start": 233 - } - ], - "language": "go", - "lines_added": 1, - "lines_removed": 0, - "path": "control-plane/internal/storage/migrations.go", - "status": "modified" - }, - { - "hunks": [ - { - "content": " }\n \n func (AgentTagVCModel) TableName() string { return \"agent_tag_vcs\" }\n+\n+// ConfigStorageModel stores configuration files in the database.\n+// Each record represents a named configuration (e.g. \"agentfield.yaml\")\n+// with versioning for audit trail.\n+type ConfigStorageModel struct {\n+\tID int64 `gorm:\"column:id;primaryKey;autoIncrement\"`\n+\tKey string `gorm:\"column:key;not null;uniqueIndex\"`\n+\tValue string `gorm:\"column:value;type:text;not null\"`\n+\tVersion int `gorm:\"column:version;not null;default:1\"`\n+\tCreatedBy *string `gorm:\"column:created_by\"`\n+\tUpdatedBy *string `gorm:\"column:updated_by\"`\n+\tCreatedAt time.Time `gorm:\"column:created_at;autoCreateTime\"`\n+\tUpdatedAt time.Time `gorm:\"column:updated_at;autoUpdateTime\"`\n+}\n+\n+func (ConfigStorageModel) TableName() string { return \"config_storage\" }", - "header": "@@ -472,3 +472,19 @@ type AgentTagVCModel struct {", - "new_count": 19, - "new_start": 472, - "old_count": 3, - "old_start": 472 - } - ], - "language": "go", - "lines_added": 16, - "lines_removed": 0, - "path": "control-plane/internal/storage/models.go", - "status": "modified" - }, - { - "hunks": [ - { - "content": " \tActiveExecutions int\n }\n \n+// ConfigEntry represents a database-stored configuration file.\n+type ConfigEntry struct {\n+\tKey string `json:\"key\"`\n+\tValue string `json:\"value\"`\n+\tVersion int `json:\"version\"`\n+\tCreatedBy string `json:\"created_by,omitempty\"`\n+\tUpdatedBy string `json:\"updated_by,omitempty\"`\n+\tCreatedAt time.Time `json:\"created_at\"`\n+\tUpdatedAt time.Time `json:\"updated_at\"`\n+}\n+\n // StorageProvider is the interface for the primary data storage backend.\n type StorageProvider interface {\n \t// Lifecycle", - "header": "@@ -26,6 +26,17 @@ type RunSummaryAggregation struct {", - "new_count": 17, - "new_start": 26, - "old_count": 6, - "old_start": 26 - }, - { - "content": " \tUpdateAgentVersion(ctx context.Context, id string, version string) error\n \tUpdateAgentTrafficWeight(ctx context.Context, id string, version string, weight int) error\n \n-\t// Configuration\n-\tSetConfig(ctx context.Context, key string, value interface{}) error\n-\tGetConfig(ctx context.Context, key string) (interface{}, error)\n+\t// Configuration Storage (database-backed config files)\n+\tSetConfig(ctx context.Context, key string, value string, updatedBy string) error\n+\tGetConfig(ctx context.Context, key string) (*ConfigEntry, error)\n+\tListConfigs(ctx context.Context) ([]*ConfigEntry, error)\n+\tDeleteConfig(ctx context.Context, key string) error\n \n \t// Reasoner Performance and History\n \tGetReasonerPerformanceMetrics(ctx context.Context, reasonerID string) (*types.ReasonerPerformanceMetrics, error)", - "header": "@@ -118,9 +129,11 @@ type StorageProvider interface {", - "new_count": 11, - "new_start": 129, - "old_count": 9, - "old_start": 118 - } - ], - "language": "go", - "lines_added": 16, - "lines_removed": 3, - "path": "control-plane/internal/storage/storage.go", - "status": "modified" - }, - { - "hunks": [ - { - "content": "+-- +goose Up\n+-- +goose StatementBegin\n+CREATE TABLE IF NOT EXISTS config_storage (\n+ id BIGSERIAL PRIMARY KEY,\n+ key TEXT NOT NULL UNIQUE,\n+ value TEXT NOT NULL,\n+ version INTEGER NOT NULL DEFAULT 1,\n+ created_by TEXT,\n+ updated_by TEXT,\n+ created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),\n+ updated_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()\n+);\n+\n+CREATE INDEX IF NOT EXISTS idx_config_storage_key ON config_storage(key);\n+-- +goose StatementEnd\n+\n+-- +goose Down\n+-- +goose StatementBegin\n+DROP INDEX IF EXISTS idx_config_storage_key;\n+DROP TABLE IF EXISTS config_storage;\n+-- +goose StatementEnd", - "header": "@@ -0,0 +1,21 @@", - "new_count": 21, - "new_start": 1, - "old_count": 0, - "old_start": 0 - } - ], - "language": "sql", - "lines_added": 21, - "lines_removed": 0, - "path": "control-plane/migrations/028_create_config_storage.sql", - "status": "added" - } - ], - "intent_gaps": [ - "The PR description states 'Precedence: env vars > DB config > file config > defaults' but the overlay is a one-time operation at startup (not continuous), and the RELOAD endpoint only re-applies the DB overlay to the already-file+env-merged config. If env vars set a value that was also in the file config, the file value was already overridden by env before overlay. The stated precedence description is accurate for startup, but the description does not clarify that RELOAD does not re-read the file config or re-apply env vars.", - "The PR description claims 'Add connector-scoped config routes gated by config_management capability' but the code also registers the same config routes unconditionally on /api/v1/configs (server.go:1552-1554) without the capability gate. This unauthenticated (modulo global API key) route is not mentioned in the PR description.", - "The PR description mentions 'works on both SQLite and PostgreSQL' for the storage implementation, but the actual SetConfig/GetConfig/ListConfigs/DeleteConfig method bodies are not visible in the diff (the local.go additions are in lines not shown). The claim cannot be verified from the diff alone.", - "The PR description says config_management is added as a new capability, but it is also enabled by default (read_only: false) in the committed agentfield.yaml with the test connector token. The description does not mention this default-on behavior or its security implication for deployments that use the default config.", - "The ReloadConfig handler (handlers/config_storage.go:114-128) is described as a hot-reload mechanism, but because services are initialized with config values at construction time (intervals, timeouts, flags), a reload only affects future reads of config fields that are checked dynamically (e.g., if a handler reads s.config.AgentField.Port at request time). The PR description does not document which settings actually take effect on hot-reload vs. which require a restart." - ], - "pr_narrative": "This PR introduces a database-backed configuration storage system with the following end-to-end flow:\n\n**1. Schema and Storage Layer**\nA new GORM model `ConfigStorageModel` (`storage/models.go:479-488`) maps to a `config_storage` table with columns: id, key (unique), value (TEXT), version (auto-incremented on update), created_by, updated_by, created_at, updated_at. The model is appended to the GORM `AutoMigrate` list (`storage/migrations.go:236`), meaning the table is created automatically on startup for both SQLite and PostgreSQL. A parallel Goose SQL migration (`028_create_config_storage.sql`) creates the same table for the managed PostgreSQL migration path. A new `ConfigEntry` DTO (`storage/storage.go:30-38`) is defined alongside four new interface methods on `StorageProvider` (`storage/storage.go:132-136`): `SetConfig`, `GetConfig`, `ListConfigs`, `DeleteConfig`. The concrete implementation is in `LocalStorage` via GORM upsert/query/delete using the `ConfigStorageModel` (implementation not in the diff directly, but referenced by the handler and config_db code).\n\n**2. HTTP Handler Layer**\n`handlers/config_storage.go` defines `ConfigStorageHandlers` with five routes: LIST (`GET /configs`), GET (`GET /configs/:key`), SET (`PUT /configs/:key` \u2014 raw body is the YAML value), DELETE (`DELETE /configs/:key`), and RELOAD (`POST /configs/reload`). The `SetConfig` handler reads raw bytes from the request body and accepts an optional `X-Updated-By` header to track who made the change. After writing, it re-reads and returns the saved entry. The RELOAD endpoint invokes a `ConfigReloadFunc` callback; if the function is nil (i.e., `AGENTFIELD_CONFIG_SOURCE` != `db`), it returns 503.\n\n**3. Startup Config Overlay**\n`server/config_db.go` implements `overlayDBConfig`, called from `NewAgentFieldServer` (`server/server.go:108-112`) after storage is initialized, when `AGENTFIELD_CONFIG_SOURCE=db`. It fetches the entry keyed `agentfield.yaml`, parses it as `config.Config` YAML, then calls `mergeDBConfig` which selectively copies non-zero fields from the DB config onto the in-memory config. The `storage` section of the config is unconditionally restored after merge (bootstrap safety). The `connector` config section is also explicitly excluded from merge (`config_db.go:90-92`).\n\n**4. Route Registration**\nIn `setupRoutes` (`server/server.go:1550-1578`), config routes are registered in two places: (a) unconditionally on `agentAPI` (`/api/v1/configs/...`) with no additional authentication beyond the global API key middleware, and (b) inside the connector group at `/api/v1/connector/configs/...` behind both the connector token middleware and a `config_management` capability check. A hot-reload route (`POST /configs/reload`) is registered at both locations.\n\n**5. Config Precedence at Runtime**\nThe stated precedence is: env vars (Viper) > DB config (overlay at startup) > file config > defaults. The DB overlay happens once, at server construction, not continuously. The RELOAD endpoint (`POST /configs/reload`) re-invokes `overlayDBConfig` live, but this modifies the in-memory `*config.Config` struct that was already used to initialize services \u2014 downstream services (health monitor intervals, cleanup, etc.) are NOT reinitialized.\n\n**6. Default Config Change**\n`agentfield.yaml` gains `config_management.enabled: true, read_only: false` under `features.connector.capabilities`, enabling the capability by default for the dev/test token.", - "risk_surfaces": [ - "AUTHORIZATION GAP \u2014 config CRUD routes at /api/v1/configs/:key (server.go:1552-1554) are registered with no authentication beyond the global API key middleware. If no API key is configured (the default dev/test scenario with api.auth.api_key empty), these endpoints are completely unauthenticated. Any caller can read, write, or delete the server's configuration YAML, including settings like CORS origins, admin_token, and DID authorization flags. The PR description states 'admin-authenticated' in a comment but the code does not use AdminTokenAuth on this route group.", - "HOT-RELOAD DOES NOT PROPAGATE \u2014 overlayDBConfig (config_db.go:19-50) modifies s.config in place, but all services were constructed from that config at startup (health monitor intervals, cleanup batch sizes, webhook timeouts, execution cleanup enabled flag, CORS origins, etc.). Calling POST /configs/reload via the ReloadConfig handler will silently succeed while leaving all service behaviors unchanged until the next server restart. Callers will expect the reload to take effect immediately.", - "DUPLICATE ROUTE REGISTRATION \u2014 ConfigStorageHandlers.RegisterRoutes registers the same five route patterns (GET /configs, GET /configs/:key, PUT /configs/:key, DELETE /configs/:key, POST /configs/reload) twice: once on agentAPI at server.go:1552-1554 and again inside the connector capability group at server.go:1573-1578. Both registrations use the same handler instance but with different middleware chains. In Gin, duplicate route registration panics at startup if patterns conflict. The connector group uses prefix /connector, so the full paths differ (/api/v1/configs vs /api/v1/connector/configs), but this dual registration is non-obvious and the inner configHandlers variable shadows the outer one (server.go:1552 vs 1576).", - "SCHEMA DUAL-PATH DIVERGENCE \u2014 The table is created via two independent mechanisms: GORM AutoMigrate (migrations.go:236) and Goose migration 028. For SQLite (local mode), only GORM AutoMigrate runs. For PostgreSQL in managed deployments using Goose, both run. The GORM model has `version NOT NULL DEFAULT 1` and auto-increments on update via GORM hooks, but the Goose SQL schema has `version INTEGER NOT NULL DEFAULT 1` with no trigger or sequence for auto-increment \u2014 the increment logic must be in the Go GORM layer (Upsert with version+1). If a raw SQL INSERT bypasses GORM, version will always be 1. Additionally, if AutoMigrate runs first on a fresh Postgres DB and then Goose migration 028 also runs, the CREATE TABLE IF NOT EXISTS in 028 is a no-op, so no conflict \u2014 but this dual-track is a maintenance hazard.", - "VERSION INCREMENT CONTRACT \u2014 The ConfigStorageModel has `Version int` with `gorm:\"default:1\"`. SetConfig presumably uses an upsert that increments version, but the diff does not show the actual SetConfig implementation in local.go (the new methods are not in the shown portion of local.go). If the GORM upsert does not explicitly increment version (e.g., uses Save without a version bump expression), the audit trail promise in the PR description is broken. The ConfigEntry DTO exposes Version to API callers who may rely on it for optimistic locking.", - "YAML INJECTION / ARBITRARY CONFIG OVERRIDE \u2014 PUT /configs/:key accepts raw bytes with no YAML schema validation before storage. On load, overlayDBConfig calls yaml.Unmarshal into a config.Config struct. A malformed YAML will return a parse error at startup/reload (safe), but a structurally valid YAML that sets unexpected fields (e.g., changing agentfield.port, did.authorization.admin_token to empty string, or disabling DID) will silently succeed because mergeDBConfig only checks for zero-values before applying. If admin_token is set to a non-empty value in the file but an empty string in the DB config, the zero-value guard (`dbCfg.Features.DID.Method != \"\"` line config_db.go:87) prevents the DID block from being applied \u2014 but the Approval block (config_db.go:82-84) is applied wholesale if either WebhookSecret or DefaultExpiryHours is non-zero.", - "BOOL FIELD OVERRIDE HEURISTIC \u2014 mergeDBConfig uses a heuristic to decide whether to override the boolean ExecutionCleanup.Enabled: it only overrides if RetentionPeriod or CleanupInterval is also non-zero (config_db.go:79-81). This means a DB config that sets `enabled: false` alone will be silently ignored. An operator who stores a config to disable cleanup will be surprised that cleanup keeps running.", - "CONNECTOR CAPABILITY CHECK MIDDLEWARE PLACEMENT \u2014 At server.go:1573-1574, a new middleware.ConnectorCapabilityCheck is applied to a sub-group of the connector group. If ConnectorCapabilityCheck uses c.Abort() correctly, requests without config_management capability will be rejected. However, the capability check middleware is applied to a new group created with connectorGroup.Group(\"\") \u2014 in Gin, middleware from connectorGroup is inherited by this sub-group. If ConnectorCapabilityCheck does not call c.Abort() on failure, requests could fall through to the handlers. This pattern should be verified.", - "CONCURRENT CONFIG MODIFICATION \u2014 s.config (*config.Config) is a pointer shared across goroutines (health monitor, status manager, cleanup service all hold references or read from it). overlayDBConfig modifies the struct fields without any synchronization (no mutex, no atomic swap). Concurrent reads from goroutines checking config values (e.g., cleanup interval, node health thresholds) while a reload is in progress creates a data race.", - "CONTEXT TIMEOUT IN HOT-RELOAD PATH \u2014 overlayDBConfig creates its own 10-second context (config_db.go:20). When invoked from the HTTP RELOAD handler (handlers/config_storage.go:121), this timeout is independent of the request context. If the DB is slow, the reload may succeed from the handler's perspective but the overlay timed out, and the handler returns HTTP 500 (error path at config_db.go:25) while the handler at line 122-126 propagates that as 500. This is fine, but the handler at line 128 returns 200 on success without indicating what changed.", - "MISSING POSTGRESQL IMPLEMENTATION VERIFICATION \u2014 The SetConfig/GetConfig/ListConfigs/DeleteConfig implementations on LocalStorage are not shown in the diff (the local.go diff shows 108 additions but the shown content is pre-existing code). For PostgreSQL mode, the GORM-based implementation must handle the upsert (incrementing version) correctly. If the implementation uses GORM Save() on a new record vs. an existing one, behavior may differ between SQLite and PostgreSQL due to different GORM driver behaviors for upsert with auto-increment fields." - ], - "stats": { - "files_added": 3, - "files_modified": 7, - "files_removed": 0, - "files_renamed": 0, - "test_files_changed": 1, - "test_to_code_ratio": 0.1111111111111111, - "total_additions": 438, - "total_deletions": 15, - "total_files": 10 - }, - "unrelated_changes": [ - "agentfield.yaml already has a connector section; adding config_management capability to the default dev config (lines 149-151) is functional but also sets read_only: false with a known test token ('test-connector-token-123'), which means any deployment using the default config file without overriding has this capability enabled with a public token." - ] - }, - "budget": { - "budget_exhausted": true, - "cost_breakdown": { - "adversary": 0, - "anatomy": 0, - "coverage": 0, - "cross_ref": 0, - "intake": 0, - "meta_selectors": 0, - "output": 0, - "review": 0, - "synthesis": 0 - }, - "max_cost_usd": 2, - "max_duration_seconds": 900, - "total_cost_usd": 0 - }, - "intake": { - "ai_generated": 0.8, - "areas_touched": [ - "database", - "api", - "tests", - "config" - ], - "complexity": "complex", - "languages": [ - "go", - "sql", - "yaml" - ], - "pr_summary": "## Summary\n- Add `config_storage` table (GORM model + Goose migration 028) for storing configuration files in the database\n- Implement `SetConfig`/`GetConfig`/`ListConfigs`/`DeleteConfig` on the `StorageProvider` interface (works on both SQLite and PostgreSQL)\n- Add `AGENTFIELD_CONFIG_SOURCE=db` environment variable to load config from the database at startup (overlays on top of file config, preserving storage section for bootstrap)\n- Add CRUD API endpoints at `GET/PUT/DELETE /api/v1/configs/:key`\n- Add connector-scoped config routes gated by `config_management` capability\n- Add `config_management` capability to default `agentfield.yaml`\n\n## How It Works\n1. **Store config in DB**: `PUT /api/v1/configs/agentfield.yaml` with YAML body\n2. **Load from DB at startup**: Set `AGENTFIELD_CONFIG_SOURCE=db` \u2192 server reads config from DB after storage init\n3. **Remote management**: SaaS \u2192 connector \u2192 `config_management` capability \u2192 CP config API\n4. **Precedence**: env vars > DB config > file config > defaults\n5. **Bootstrap safety**: The `storage` section is never overridden from DB (DB connection can't come from DB)\n\n## Related PRs\n- Connector: Agent-Field/connector (config_management capability)\n- hax-sdk: Agent-Field/hax-sdk (config editor UI)\n\n## Test plan\n- [x] `go build ./...` passes\n- [x] Server tests pass\n- [x] Storage test failure is pre-existing (FTS5 not available)\n- [ ] Manual test: create config via API, verify it loads on restart with `AGENTFIELD_CONFIG_SOURCE=db`\n- [ ] Manual test: verify connector flow end-to-end\n\n\ud83e\udd16 Generated with [Claude Code](https://claude.com/claude-code)", - "pr_type": "feature", - "review_depth": "standard", - "risk_signals": [ - "modifies data model or schema-affecting code", - "changes API surface or request/response behavior", - "includes configuration changes", - "test behavior updated" - ] - }, - "phases_completed": [ - "intake", - "anatomy", - "meta_selectors", - "review", - "adversary", - "cross_ref", - "coverage", - "synthesis", - "output" - ], - "plan": { - "ai_adjusted": false, - "cross_ref_hints": [], - "dimensions": [ - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 90, - "max_reference_follows": 4 - }, - "context_files": [ - "control-plane/config/agentfield.yaml", - "control-plane/internal/server/config_db.go" - ], - "id": "semantic_sem_01", - "name": "Unauthenticated Config Write/Read/Delete via /api/v1/configs", - "priority": 10, - "review_prompt": "The config CRUD routes at /api/v1/configs/:key are registered on agentAPI (server.go:1552-1554) with no authentication beyond the global API key middleware. Investigate: (1) What is the default value of api.auth.api_key in agentfield.yaml and in the dev/test environment? If it is empty or not set, is the global API key middleware a no-op? (2) Is there any AdminTokenAuth or equivalent applied to this route group? The PR description says 'admin-authenticated' but the code must be checked. (3) What data is accessible via GET /configs? Can an unauthenticated caller retrieve the stored agentfield.yaml which may contain admin_token, webhook secrets, DID config, or CORS origins? (4) Can an unauthenticated caller PUT /configs/agentfield.yaml and override security-sensitive config fields (admin_token, did.authorization, cors.allowed_origins)? Focus on the authorization gap between the /api/v1/configs routes and the /api/v1/connector/configs routes (which DO have ConnectorCapabilityCheck). Determine the actual security boundary enforced at runtime.", - "target_files": [ - "control-plane/internal/server/server.go", - "control-plane/internal/handlers/config_storage.go" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 4 - }, - "context_files": [ - "control-plane/internal/handlers/config_storage.go", - "control-plane/internal/server/config_db.go" - ], - "id": "mechanical_mech_01", - "name": "StorageProvider Interface Completeness: Missing Method Implementations", - "priority": 10, - "review_prompt": "Verify that ALL concrete types implementing `StorageProvider` (storage/storage.go:132-136) have implementations for the four new methods: `SetConfig`, `GetConfig`, `ListConfigs`, and `DeleteConfig`. Specifically: (1) Check `LocalStorage` in `storage/local.go` \u2014 the diff claims 108 additions but the shown content may not include these methods. Confirm they exist and have the correct signatures matching the interface exactly (parameter types, return types). (2) If there is a PostgreSQL-specific storage type or any mock/stub in tests, confirm it also satisfies the interface or will produce a compile error. (3) Verify the `ConfigEntry` DTO (storage/storage.go:30-38) return type matches what callers in `handlers/config_storage.go` and `server/config_db.go` expect \u2014 e.g., does `GetConfig` return `(*ConfigEntry, error)` or `(ConfigEntry, error)`? Pointer vs value mismatches will cause compile failures or nil-dereference panics at runtime.", - "target_files": [ - "control-plane/internal/storage/storage.go", - "control-plane/internal/storage/local.go" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.4, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [ - "control-plane/internal/handlers/config_storage.go" - ], - "id": "mechanical_mech_02", - "name": "Gin Route Registration: Duplicate Pattern Panic Risk", - "priority": 9, - "review_prompt": "Inspect `server/server.go:1550-1578` for the dual registration of config routes. In Gin, registering two routes with identical HTTP method + full path will panic at startup. Determine the full resolved paths for both registrations: (a) the `agentAPI` group base path + `/configs`, `/configs/:key`, `/configs/reload` and (b) the connector group base path + the sub-group prefix + the same suffixes. Confirm that the full paths are truly distinct (e.g., `/api/v1/configs/...` vs `/api/v1/connector/configs/...`). Also check whether the `:key` parameter name is consistent \u2014 if one registration uses `:key` and another uses a different param name at the same position within the same router tree segment, Gin will panic with a wildcard conflict error. Additionally, verify that the `configHandlers` variable at server.go:1552 and the re-used or shadowing `configHandlers` at server.go:1576 reference the same `ConfigStorageHandlers` instance with the same `ConfigReloadFunc` \u2014 if the inner declaration creates a new instance without a reload func, the connector-facing RELOAD endpoint will always return 503.", - "target_files": [ - "control-plane/internal/server/server.go" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 90, - "max_reference_follows": 4 - }, - "context_files": [ - "control-plane/config/agentfield.yaml" - ], - "id": "semantic_sem_02", - "name": "Hot-Reload Does Not Reinitialize Services \u2014 Silent Staleness", - "priority": 8, - "review_prompt": "overlayDBConfig (config_db.go:19-50) modifies the in-memory *config.Config struct in place. However, all services (health monitor, cleanup service, webhook dispatcher, CORS middleware, etc.) were constructed using that config at startup and hold either a copy of config values or a pointer to the struct. Investigate: (1) Do downstream services read config values lazily (via the pointer at call time) or eagerly (copied into local fields at construction)? If eagerly copied, a reload will have zero effect on running behavior. (2) After a successful POST /configs/reload, what actually changes at runtime vs. what the caller expects to change? (3) Does the reload handler (handlers/config_storage.go:121-128) return any indication of what fields were applied? Does it return HTTP 200 with no body to indicate success, even when no services were reinitialized? (4) Is there a documented contract for which config fields take effect on reload vs. which require restart? If not, this is a behavioral contract violation for callers who rely on reload to change operational parameters.", - "target_files": [ - "control-plane/internal/server/config_db.go", - "control-plane/internal/handlers/config_storage.go", - "control-plane/internal/server/server.go" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.4, - "max_duration_seconds": 60, - "max_reference_follows": 4 - }, - "context_files": [ - "control-plane/internal/storage/storage.go", - "control-plane/internal/server/server.go" - ], - "id": "mechanical_mech_03", - "name": "overlayDBConfig: yaml.Unmarshal Target Type and Nil Pointer Safety", - "priority": 8, - "review_prompt": "Trace the exact runtime execution of `overlayDBConfig` in `server/config_db.go:19-50`. (1) Verify `GetConfig` returns a type from which the raw YAML bytes/string are accessed \u2014 confirm no nil pointer dereference if the key `agentfield.yaml` does not exist in the DB (the not-found code path must return early without error). (2) Confirm `yaml.Unmarshal` is called with a `*config.Config` target \u2014 if called with a value type, the populated struct is discarded. (3) In `mergeDBConfig`, confirm each field access on `dbCfg` (the unmarshaled struct) is nil-safe \u2014 if `dbCfg.Features` or nested structs are pointer types and the YAML omits those sections, accessing `dbCfg.Features.DID.Method` at config_db.go:87 will panic with a nil pointer dereference. Check whether `config.Config` uses value types or pointer types for nested structs, and whether `yaml.Unmarshal` zero-initializes nested structs or leaves them nil when YAML keys are absent.", - "target_files": [ - "control-plane/internal/server/config_db.go" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [ - "control-plane/internal/storage/storage.go", - "control-plane/internal/storage/local.go" - ], - "id": "systemic_systemic_schema_dual_path", - "name": "Schema Dual-Path Divergence: GORM AutoMigrate vs Goose Migration", - "priority": 8, - "review_prompt": "This PR creates the `config_storage` table via two independent mechanisms: GORM AutoMigrate (storage/migrations.go:236) and a Goose SQL migration (migrations/028_create_config_storage.sql). Investigate whether this dual-track schema management is consistent with how other tables in this codebase are managed. Specifically: (1) Do other models use both AutoMigrate AND a Goose migration, or is one mechanism the established pattern? (2) Does the GORM model schema (version auto-increment via hooks) match the Goose SQL DDL precisely, or are there divergences (e.g., missing triggers, different column constraints)? (3) If AutoMigrate runs first on a fresh PostgreSQL database and Goose migration 028 also runs, is the result deterministic and conflict-free? (4) What is the maintenance risk if the GORM model is updated but the Goose migration is not (or vice versa)? Conclude whether this dual-path is justified or whether it introduces a long-term maintenance hazard inconsistent with the codebase's existing migration strategy.", - "target_files": [ - "control-plane/internal/storage/migrations.go", - "control-plane/internal/storage/models.go", - "control-plane/migrations/028_create_config_storage.sql" - ] - } - ], - "total_budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - } - } - }, - "pr_url": "https://github.com/Agent-Field/agentfield/pull/254", - "review": { - "body": "## \ud83d\udd34 PR-AF Review \u2014 **Needs Major Rework**\n\n*Automated multi-agent code review \u00b7 [PR-AF](https://github.com/Agent-Field/agentfield) built with [AgentField](https://github.com/Agent-Field/agentfield)*\n\n> **20 findings** \u00b7 \ud83d\udd34 3 critical \u00b7 \ud83d\udfe0 12 important \u00b7 \ud83d\udd35 2 suggestions \u00b7 \u26aa 3 nitpicks\n\n
\nPR Overview\n\n## Summary\n- Add `config_storage` table (GORM model + Goose migration 028) for storing configuration files in the database\n- Implement `SetConfig`/`GetConfig`/`ListConfigs`/`DeleteConfig` on the `StorageProvider` interface (works on both SQLite and PostgreSQL)\n- Add `AGENTFIELD_CONFIG_SOURCE=db` environment variable to load config from the database at startup (overlays on top of file config, preserving storage section for bootstrap)\n- Add CRUD API endpoints at `GET/PUT/DELETE /api/v1/configs/:key`\n- Add connector-scoped config routes gated by `config_management` capability\n- Add `config_management` capability to default `agentfield.yaml`\n\n## How It Works\n1. **Store config in DB**: `PUT /api/v1/configs/agentfield.yaml` with YAML body\n2. **Load from DB at startup**: Set `AGENTFIELD_CONFIG_SOURCE=db` \u2192 server reads config from DB after storage init\n3. **Remote management**: SaaS \u2192 connector \u2192 `config_management` capability \u2192 CP config API\n4. **Precedence**: env vars > DB config > file config > defaults\n5. **Bootstrap safety**: The `storage` section is never overridden from DB (DB connection can't come from DB)\n\n## Related PRs\n- Connector: Agent-Field/connector (config_management capability)\n- hax-sdk: Agent-Field/hax-sdk (config editor UI)\n\n## Test plan\n- [x] `go build ./...` passes\n- [x] Server tests pass\n- [x] Storage test failure is pre-existing (FTS5 not available)\n- [ ] Manual test: create config via API, verify it loads on restart with `AGENTFIELD_CONFIG_SOURCE=db`\n- [ ] Manual test: verify connector flow end-to-end\n\n\ud83e\udd16 Generated with [Claude Code](https://claude.com/claude-code)\n\n
\n\n### Key Findings\n\n**15 issue(s) should be addressed before merge:**\n\n- \ud83d\udd34 **MockStorageProvider implements SetConfig/GetConfig with wrong signatures and is missing ListConfigs and DeleteConfig entirely** (`control-plane/internal/handlers/ui/config_test.go:289`) \u2014 The `MockStorageProvider` in `config_test.go` (and identically in `execute_test.go`) implements `SetConfig` and `GetConfig` with signatures that do **not** match the `StorageProvider` interface define\u2026\n- \ud83d\udd34 **PUT /configs/agentfield.yaml can overwrite admin_token and internal_token via mergeDBConfig when DID.Method is set** (`control-plane/internal/server/config_db.go:87`) \u2014 When `AGENTFIELD_CONFIG_SOURCE=db` is set, `mergeDBConfig` in `config_db.go:87-89` replaces the **entire** `target.Features.DID` struct \u2014 including `Authorization.AdminToken` and `Authorization.Intern\u2026\n- \ud83d\udd34 **Config CRUD routes are not admin-authenticated: comment is false, no AdminTokenAuth applied** (`control-plane/internal/server/server.go:1550`) \u2014 The comment at line 1550 says `// Config storage routes (admin-authenticated)` but **no `AdminTokenAuth` middleware is applied**.\n- \ud83d\udfe0 **POST /configs/reload returns HTTP 200 with a success message even though most running services are unaffected by the reload** (`control-plane/internal/handlers/config_storage.go:121`) \u2014 The `ReloadConfig` handler returns: ```json {\"message\": \"config reloaded from database\"} ``` with `HTTP 200` when `reloadFn()` succeeds.\n- \ud83d\udfe0 **Dual-path schema creation for config_storage breaks the established single-source-of-truth migration pattern** (`control-plane/internal/storage/migrations.go:236`) \u2014 The `config_storage` table is created via two independent mechanisms that are never coordinated: 1.\n- \ud83d\udfe0 **AdminTokenAuth is a no-op when adminToken is empty \u2014 existing admin routes (tag approval, policy management) are unprotected in default dev config** (`control-plane/internal/server/middleware/auth.go:90`) \u2014 The comment on `AdminTokenAuth` says *\"falls back to global API key auth\"* when `adminToken` is empty.\n- \ud83d\udfe0 **GetConfig uses fragile string comparison instead of errors.Is(sql.ErrNoRows) for not-found detection** (`control-plane/internal/storage/local.go:5186`) \u2014 `GetConfig` at line 5186 checks for the not-found condition by comparing the error's string representation: ```go if err.Error() == \"sql: no rows in result set\" { return nil, nil } ``` This is f\u2026\n- \ud83d\udfe0 **Fragile `no rows` detection via string comparison instead of `errors.Is(sql.ErrNoRows)`** (`control-plane/internal/storage/local.go:5179`) \u2014 The `GetConfig` implementation detects a missing key by comparing the error string: ```go if err.Error() == \"sql: no rows in result set\" { return nil, nil } ``` This is the critical code path th\u2026\n- \u2026 and 7 more (see All Findings by Severity)\n\n**5 suggestion(s) and style note(s):**\n\n- \ud83d\udd35 DeleteConfig handler returns 404 for all storage errors, including 500-class failures (`control-plane/internal/handlers/config_storage.go:104`)\n- \ud83d\udd35 Version increment is application-enforced only; no DB-level constraint prevents version regression or skipping (`control-plane/internal/storage/models.go:483`)\n- \u26aa Redundant index on config_storage(key): the UNIQUE constraint already implies a unique index (`control-plane/migrations/028_create_config_storage.sql:14`)\n- \u26aa `fmt.Println`/`fmt.Printf` used for logging instead of the structured logger (`control-plane/internal/server/config_db.go:28`)\n- \u26aa Verified: no path conflict and no 503 regression from second `configHandlers` instantiation (`control-plane/internal/server/server.go:1572`)\n\n**Files with findings:** `control-plane/internal/handlers/config_storage.go`, `control-plane/internal/handlers/ui/config_test.go`, `control-plane/internal/server/config_db.go`, `control-plane/internal/server/middleware/auth.go`, `control-plane/internal/server/server.go`, `control-plane/internal/storage/local.go`, `control-plane/internal/storage/migrations.go`, `control-plane/internal/storage/models.go`, `control-plane/migrations/028_create_config_storage.sql`\n\n
\nAll Findings by Severity\n\n#### \ud83d\udd34 Critical (3)\n\n- **MockStorageProvider implements SetConfig/GetConfig with wrong signatures and is missing ListConfigs and DeleteConfig entirely** `control-plane/internal/handlers/ui/config_test.go:289`\n- **PUT /configs/agentfield.yaml can overwrite admin_token and internal_token via mergeDBConfig when DID.Method is set** `control-plane/internal/server/config_db.go:87`\n- **Config CRUD routes are not admin-authenticated: comment is false, no AdminTokenAuth applied** `control-plane/internal/server/server.go:1550`\n\n#### \ud83d\udfe0 Important (12)\n\n- **POST /configs/reload returns HTTP 200 with a success message even though most running services are unaffected by the reload** `control-plane/internal/handlers/config_storage.go:121`\n- **Dual-path schema creation for config_storage breaks the established single-source-of-truth migration pattern** `control-plane/internal/storage/migrations.go:236`\n- **AdminTokenAuth is a no-op when adminToken is empty \u2014 existing admin routes (tag approval, policy management) are unprotected in default dev config** `control-plane/internal/server/middleware/auth.go:90`\n- **GetConfig uses fragile string comparison instead of errors.Is(sql.ErrNoRows) for not-found detection** `control-plane/internal/storage/local.go:5186`\n- **Fragile `no rows` detection via string comparison instead of `errors.Is(sql.ErrNoRows)`** `control-plane/internal/storage/local.go:5179`\n- **Config routes registered on unauthenticated `agentAPI` group \u2014 no dedicated auth guard** `control-plane/internal/server/server.go:1550`\n- **WebhookDispatcher and ExecuteHandler/ApprovalWebhookHandler capture config values eagerly: reload cannot change webhook timeouts, agent-call timeout, secrets, or internal token** `control-plane/internal/server/server.go:366`\n- **ExecutionCleanupService copies config by value at construction: reload has zero effect on running behavior** `control-plane/internal/server/server.go:392`\n- **HealthMonitor copies config by value at construction: NodeHealth interval/timeout changes on reload are silently ignored** `control-plane/internal/server/server.go:160`\n- **CORS middleware is registered once at startup: reloading API.CORS config has no effect on running requests** `control-plane/internal/server/server.go:831`\n- **Goose migration for config_storage omits the updated_at auto-update trigger that equivalent tables have, and GORM autoUpdateTime does not replace it** `control-plane/migrations/028_create_config_storage.sql:10`\n- **SetConfig accepts arbitrary keys and values with no validation \u2014 allows storing malformed YAML or overwriting critical system keys** `control-plane/internal/handlers/config_storage.go:67`\n\n#### \ud83d\udd35 Suggestion (2)\n\n- **DeleteConfig handler returns 404 for all storage errors, including 500-class failures** `control-plane/internal/handlers/config_storage.go:104`\n- **Version increment is application-enforced only; no DB-level constraint prevents version regression or skipping** `control-plane/internal/storage/models.go:483`\n\n#### \u26aa Nitpick (3)\n\n- **Redundant index on config_storage(key): the UNIQUE constraint already implies a unique index** `control-plane/migrations/028_create_config_storage.sql:14`\n- **`fmt.Println`/`fmt.Printf` used for logging instead of the structured logger** `control-plane/internal/server/config_db.go:28`\n- **Verified: no path conflict and no 503 regression from second `configHandlers` instantiation** `control-plane/internal/server/server.go:1572`\n\n
\n\n
\nReview Process Details\n\n**Dimensions Analyzed (6):**\n\n- **Unauthenticated Config Write/Read/Delete via /api/v1/configs** \u2014 2 file(s)\n- **StorageProvider Interface Completeness: Missing Method Implementations** \u2014 2 file(s)\n- **Gin Route Registration: Duplicate Pattern Panic Risk** \u2014 1 file(s)\n- **Hot-Reload Does Not Reinitialize Services \u2014 Silent Staleness** \u2014 3 file(s)\n- **overlayDBConfig: yaml.Unmarshal Target Type and Nil Pointer Safety** \u2014 1 file(s)\n- **Schema Dual-Path Divergence: GORM AutoMigrate vs Goose Migration** \u2014 3 file(s)\n\n**Meta-Dimension Lenses (3):**\n\n- **Semantic** \u2014 5 dimension(s), 87% coverage confidence\n- **Mechanical** \u2014 5 dimension(s), 82% coverage confidence\n- **Systemic** \u2014 2 dimension(s), 78% coverage confidence\n\n**Cross-Reference & Adversary Analysis:**\n\n- **17** finding(s) adversarially tested: 9 confirmed, 8 challenged\n\n
\n\n
\nPipeline Stats\n\n| Metric | Value |\n|--------|-------|\n| Duration | 1933.7s |\n| Agent invocations | 15 |\n| Coverage iterations | 0 |\n| Estimated cost | N/A (provider does not report cost) |\n| Budget exhausted | Yes (timeout: 1933s > 900s limit) |\n| PR type | feature |\n| Complexity | complex |\n\n
\n\nReview ID: `rev_4840f78ef080`", - "comments": [ - { - "body": "\ud83d\udd34 **[CRITICAL] PUT /configs/agentfield.yaml can overwrite admin_token and internal_token via mergeDBConfig when DID.Method is set**\n\nWhen `AGENTFIELD_CONFIG_SOURCE=db` is set, `mergeDBConfig` in `config_db.go:87-89` replaces the **entire** `target.Features.DID` struct \u2014 including `Authorization.AdminToken` and `Authorization.InternalToken` \u2014 with values from the DB-stored YAML if `dbCfg.Features.DID.Method != \"\"`.\n\n```go\n// config_db.go:86-89\nif dbCfg.Features.DID.Method != \"\" {\n target.Features.DID = dbCfg.Features.DID // replaces AdminToken, InternalToken, all auth config\n}\n```\n\nThe comment at line 94 says `// API settings (but never override API key from DB for security)` and correctly protects `API.Auth.APIKey`. However, `AdminToken` (used to guard admin routes including tag approval, policy management, and the config routes themselves) and `InternalToken` (used as bearer for agent-to-agent calls) are both nested under `Features.DID.Authorization` and are **not similarly protected**.\n\nAttack chain:\n1. Attacker calls `PUT /api/v1/configs/agentfield.yaml` with a YAML body containing `features.did.method: did:key` and `features.did.authorization.admin_token: attacker-controlled-token` (unauthenticated, due to Finding 1).\n2. Attacker calls `POST /api/v1/configs/reload` to trigger `overlayDBConfig`.\n3. `mergeDBConfig` sees `dbCfg.Features.DID.Method == \"did:key\"` (non-empty), replaces `target.Features.DID` entirely, overwriting `AdminToken` with the attacker-controlled value.\n4. Attacker now has full `X-Admin-Token` admin access over tag approval, policy management, and all future admin routes.\n\n---\n\n> Step 1: Attacker sends `PUT /api/v1/configs/agentfield.yaml` with body `features:\\n did:\\n method: did:key\\n authorization:\\n admin_token: evil-token` \u2014 unauthenticated because `APIKeyAuth` is a no-op when `api_key` is empty (Finding 1).\n> Step 2: `SetConfig` at config_storage.go:85 calls `h.storage.SetConfig(ctx, \"agentfield.yaml\", body, \"api\")` \u2014 no validation or sanitization of the YAML content.\n> Step 3: Attacker sends `POST /api/v1/configs/reload`. `ReloadConfig` at config_storage.go:121 calls `h.reloadFn()` which calls `overlayDBConfig(s.config, s.storage)` (server.go:440).\n> Step 4: `overlayDBConfig` at config_db.go:37-42 parses the stored YAML into `dbCfg` and calls `mergeDBConfig(cfg, &dbCfg)`.\n> Step 5: `mergeDBConfig` at config_db.go:87-89: `dbCfg.Features.DID.Method == \"did:key\"` (non-empty), so `target.Features.DID = dbCfg.Features.DID` executes, replacing `Authorization.AdminToken` with `evil-token`.\n> Step 6: Subsequent requests using `X-Admin-Token: evil-token` are accepted by `AdminTokenAuth` at middleware/auth.go:99.\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd explicit protection in `mergeDBConfig` for security-sensitive fields inside `Features.DID`, mirroring the API key protection at line 94:\n\n```go\nif dbCfg.Features.DID.Method != \"\" {\n // Preserve security-sensitive authorization tokens \u2014 must come from file/env only\n savedAdminToken := target.Features.DID.Authorization.AdminToken\n savedInternalToken := target.Features.DID.Authorization.InternalToken\n target.Features.DID = dbCfg.Features.DID\n target.Features.DID.Authorization.AdminToken = savedAdminToken\n target.Features.DID.Authorization.InternalToken = savedInternalToken\n}\n```\n\nLong-term, fixing Finding 1 (adding AdminTokenAuth to the config routes) removes the unauthenticated write path, making this a defense-in-depth item. Both fixes should be applied.\n\n---\n*`Config CRUD Route Authorization Gap` \u00b7 confidence 92%*", - "line": 87, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd34 **[CRITICAL] Config CRUD routes are not admin-authenticated: comment is false, no AdminTokenAuth applied**\n\nThe comment at line 1550 says `// Config storage routes (admin-authenticated)` but **no `AdminTokenAuth` middleware is applied**. The routes are registered directly on `agentAPI` (the bare `/api/v1` group) with no sub-group and no `.Use(middleware.AdminTokenAuth(...))` call.\n\nCompare this with lines 1532\u20131545 where the actual admin-protected routes are set up:\n\n```go\n// Lines 1532-1545 \u2014 ACTUAL admin auth\nadminGroup := agentAPI.Group(\"\")\nadminGroup.Use(middleware.AdminTokenAuth(s.config.Features.DID.Authorization.AdminToken))\n```\n\nBut the config routes at lines 1551\u20131554 are:\n\n```go\n// Lines 1550-1555 \u2014 NO admin auth applied\n{\n configHandlers := handlers.NewConfigStorageHandlers(s.storage, s.configReloadFn())\n configHandlers.RegisterRoutes(agentAPI) // directly on agentAPI, NOT on adminGroup\n}\n```\n\nThe **only** protection is the global `middleware.APIKeyAuth` at line 881. As confirmed in `middleware/auth.go:26-29`, when `config.APIKey == \"\"` the middleware is an explicit no-op (`c.Next()` is called immediately). The default `agentfield.yaml` in the repo has **no `api.auth.api_key` field at all**, meaning `cfg.API.Auth.APIKey` is the zero value (empty string). The dev environment therefore runs fully unauthenticated.\n\nThis means on any default or dev deployment:\n- `GET /api/v1/configs` \u2014 lists **all** stored configuration entries including `agentfield.yaml`\n- `GET /api/v1/configs/agentfield.yaml` \u2014 returns the full config YAML including `admin_token`, `internal_token`, `webhook_secret`, DID keystore config\n- `PUT /api/v1/configs/agentfield.yaml` \u2014 overwrites the stored config, and if `AGENTFIELD_CONFIG_SOURCE=db` is set, `POST /api/v1/configs/reload` activates it, allowing an attacker to replace `admin_token`, `cors.allowed_origins`, DID authorization settings, etc.\n- `DELETE /api/v1/configs/:key` \u2014 deletes any stored configuration key\n\n---\n\n> Step 1: `setupRoutes()` (server.go:831) registers global middleware including `middleware.APIKeyAuth(middleware.AuthConfig{APIKey: s.config.API.Auth.APIKey, ...})` at line 881.\n> Step 2: `middleware.APIKeyAuth` at `middleware/auth.go:26-29` returns `c.Next()` immediately when `config.APIKey == \"\"`.\n> Step 3: `agentfield.yaml` (config/agentfield.yaml) has no `api.auth.api_key` key at all. `AuthConfig.APIKey` is an untagged Go string, defaulting to `\"\"`. The `applyEnvOverrides` function at config.go:263 only overrides if `AGENTFIELD_API_KEY` env var is non-empty.\n> Step 4: With no API key set, the global middleware is a no-op. No other middleware guards the `/api/v1/configs` routes.\n> Step 5: `configHandlers.RegisterRoutes(agentAPI)` at server.go:1553 calls `group.GET(\"/configs\", ...)`, `group.GET(\"/configs/:key\", ...)`, `group.PUT(\"/configs/:key\", ...)`, `group.DELETE(\"/configs/:key\", ...)`, and `group.POST(\"/configs/reload\", ...)` directly on the unauthenticated `agentAPI` group (server.go:1164 `agentAPI := s.Router.Group(\"/api/v1\")`).\n> Step 6: `GetConfig` at config_storage.go:51-63 calls `h.storage.GetConfig(ctx, key)` and returns the full entry value without redaction. `ListConfigs` at config_storage.go:35-48 returns all entries.\n> Step 7: Any unauthenticated HTTP client can `curl http://localhost:8080/api/v1/configs/agentfield.yaml` and receive the stored YAML including secrets.\n\n**\ud83d\udca1 Suggested Fix**\n\nCreate a dedicated sub-group with `AdminTokenAuth` applied before registering config routes, mirroring the pattern used for tag-approval and access-policy admin routes (lines 1532\u20131545):\n\n```go\n// Config storage routes \u2014 require admin token\nconfigAdminGroup := agentAPI.Group(\"\")\nconfigAdminGroup.Use(middleware.AdminTokenAuth(s.config.Features.DID.Authorization.AdminToken))\nconfigHandlers := handlers.NewConfigStorageHandlers(s.storage, s.configReloadFn())\nconfigHandlers.RegisterRoutes(configAdminGroup)\n```\n\nNote: `AdminTokenAuth` is itself a no-op when `adminToken == \"\"` (see `middleware/auth.go:92-94`), so the admin token must also be required to be non-empty for this to be effective in production. Add a startup warning (similar to line 268) if the config routes are reachable but `AdminToken` is empty.\n\n---\n*`Config CRUD Route Authorization Gap` \u00b7 confidence 98%*", - "line": 1550, - "path": "control-plane/internal/server/server.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] POST /configs/reload returns HTTP 200 with a success message even though most running services are unaffected by the reload**\n\nThe `ReloadConfig` handler returns:\n\n```json\n{\"message\": \"config reloaded from database\"}\n```\n\nwith `HTTP 200` when `reloadFn()` succeeds. However, `reloadFn` is `overlayDBConfig`, which **only mutates the in-memory `*config.Config` struct**. As established by the other findings in this review, the overwhelming majority of services that consume config values have already copied those values at construction time and will not observe any change:\n\n- `ExecutionCleanupService` \u2014 reads retention period, cleanup interval, batch size from its own frozen copy\n- `HealthMonitor` \u2014 uses a frozen check interval ticker\n- `WebhookDispatcher` \u2014 uses a frozen `http.Client` timeout\n- `ExecuteHandler`/`ExecuteAsyncHandler` \u2014 use a frozen agent-call timeout\n- `ApprovalWebhookHandler` \u2014 uses a frozen HMAC secret\n- CORS middleware \u2014 configured once at `setupRoutes()` from the config values at that time\n- API key auth middleware \u2014 similarly frozen at route registration\n\nThe only fields that _are_ lazily re-read (because handlers call `s.config.*` directly) are a small subset of route-guard conditions checked on each request. But these are not what callers typically expect to change via a config reload.\n\nThere is **no documented contract** in the handler, any comment block, or any API response body that tells callers which fields are applied immediately versus which require a restart. A caller who updates `execution_cleanup.retention_period` in the DB, calls `POST /configs/reload`, receives `HTTP 200 \"config reloaded from database\"`, and concludes the cleanup service is now running with the new retention period is completely misled.\n\n---\n\n> Step 1: `config_storage.go:121` calls `h.reloadFn()` which is `overlayDBConfig(s.config, s.storage)` (server.go:440).\n> Step 2: `overlayDBConfig` calls `mergeDBConfig` which writes to fields of `*config.Config` in place (config_db.go:42,54-102).\n> Step 3: All background services examined hold value copies of the mutated fields (see companion findings above).\n> Step 4: `config_storage.go:128` returns `{\"message\": \"config reloaded from database\"}` \u2014 no qualification, no list of affected vs. unaffected subsystems.\n> Step 5: No code comment, no API documentation file, and no OpenAPI annotation in the target files describes which fields are hot-reloadable.\n\n**\ud83d\udca1 Suggested Fix**\n\nThe response body should be honest about what was applied. At minimum, add a disclaimer: return a structured body listing which config sections were merged and a note that changes to cleanup intervals, health monitor timings, webhook settings, and execution timeouts require a server restart to take effect. Longer term, either (a) implement true hot-reload for each service via `Reconfigure()` methods and enumerate the actually-reloaded subsystems in the response, or (b) make the API contract explicit in documentation and return a `partial_reload` status with a list of fields that only take effect after restart.\n\n---\n*`Config Reload Behavioral Contract` \u00b7 confidence 95%*", - "line": 121, - "path": "control-plane/internal/handlers/config_storage.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Dual-path schema creation for config_storage breaks the established single-source-of-truth migration pattern**\n\nThe `config_storage` table is created via two independent mechanisms that are never coordinated:\n\n1. **GORM AutoMigrate** (`migrations.go:236`): `&ConfigStorageModel{}` is included in the `autoMigrateSchema` call, which runs unconditionally on every server startup for **both** `local` (SQLite) and `postgres` modes.\n2. **Goose SQL migration** (`028_create_config_storage.sql`): A standalone DDL file intended to be run manually via `goose -dir ./migrations postgres ... up` before the server starts in PostgreSQL mode.\n\nEvery other model that has a Goose migration file also relies on GORM AutoMigrate for its schema (e.g., `DIDDocumentModel` \u2194 `019_create_did_documents.sql`, `AccessPolicyModel` \u2194 `021_create_access_policies.sql`, `AgentTagVCModel` \u2194 `022_create_agent_tag_vcs.sql`). This is the **established pattern** for this codebase: Goose files are the PostgreSQL-mode canonical DDL, and GORM AutoMigrate handles schema reconciliation on startup. `config_storage` follows this same dual-path \u2014 so the pattern is consistent \u2014 but the **design itself** is an undocumented hazard for future maintainers.\n\nThe critical risk is schema divergence over time. If a developer adds a column to `ConfigStorageModel` (e.g., `Tags string`), GORM AutoMigrate will silently add that column to both SQLite and PostgreSQL. But Goose migration `028` will not be updated. The reverse is equally true: if someone adds a `CHECK` constraint in a new Goose migration `029_alter_config_storage.sql`, GORM AutoMigrate will not reproduce it on a fresh install that skips Goose. Because neither mechanism has visibility into what the other has done, schema drift is a when-not-if scenario.\n\n---\n\n> Step 1: `StorageFactory.CreateStorage` (storage.go:350) calls `pgStorage.Initialize(ctx, ...)` for postgres mode.\n> Step 2: `Initialize` (local.go:534) calls `ls.initializePostgres(ctx)`.\n> Step 3: `initializePostgres` (local.go:734) calls `ls.createSchema(ctx)`.\n> Step 4: `createSchema` (local.go:862) calls `ls.autoMigrateSchema(ctx)` unconditionally, which includes `&ConfigStorageModel{}` (migrations.go:236), creating the table via GORM.\n> Step 5: The CLAUDE.md documentation instructs operators to also run `goose -dir ./migrations postgres ... up` before starting in PostgreSQL mode, which would also execute `028_create_config_storage.sql` (with `CREATE TABLE IF NOT EXISTS`, so no hard error, but the DDL is effectively applied twice from two separate sources).\n> Step 6: No mechanism prevents `ConfigStorageModel` fields from being changed in models.go without a corresponding Goose migration update.\n\n**\ud83d\udca1 Suggested Fix**\n\nDocument explicitly (in a comment in `migrations.go` near the AutoMigrate list, and in a header comment in `028_create_config_storage.sql`) that for PostgreSQL mode, the Goose file is the authoritative DDL for initial creation and structural constraints, while GORM AutoMigrate handles additive column additions. Add a CI check or test that compares the column set of the GORM model struct against the columns created by the corresponding Goose migration, to detect drift early. Alternatively, adopt the stricter approach used by `kv_store`, `distributed_locks`, and `memory_events` tables: create them entirely via `ensurePostgres*` helper functions (Go code with `CREATE TABLE IF NOT EXISTS`), removing the Goose SQL file entirely for purely application-managed tables.\n\n---\n*`Dual-Track Schema Management: AutoMigrate vs Goose` \u00b7 confidence 92%*", - "line": 236, - "path": "control-plane/internal/storage/migrations.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] GetConfig uses fragile string comparison instead of errors.Is(sql.ErrNoRows) for not-found detection**\n\n`GetConfig` at line 5186 checks for the not-found condition by comparing the error's string representation:\n\n```go\nif err.Error() == \"sql: no rows in result set\" {\n return nil, nil\n}\n```\n\nThis is fragile for two reasons:\n\n1. **Driver-dependent string**: The message `\"sql: no rows in result set\"` is the canonical text for `sql.ErrNoRows`, but the comparison bypasses the sentinel value. If any driver wraps `sql.ErrNoRows` (e.g., with `fmt.Errorf(\"...: %w\", sql.ErrNoRows)`), `errors.Is` would still match, but the string comparison would fail \u2014 causing a generic `\"failed to get config\"` error instead of the intended `nil, nil` (not-found) return.\n\n2. **Inconsistency**: Every other `GetX` method in `local.go` uses the idiomatic `errors.Is(err, sql.ErrNoRows)` pattern (e.g., `GetWorkflowRun` at line 300: `if errors.Is(err, sql.ErrNoRows) { return nil, nil }`). This deviation from the established pattern is a latent defect.\n\nThe downstream caller `config_db.go:27` relies on `entry == nil` to mean \"not found\" and prints an informational message. If the string comparison fails under a different driver or future wrapping, `overlayDBConfig` would instead return an error and potentially block server startup.\n\n---\n\n> Step 1: `GetConfig` at local.go:5185-5188 checks `err.Error() == \"sql: no rows in result set\"` to detect missing rows.\n> Step 2: `sql.ErrNoRows` is defined in `database/sql` as `var ErrNoRows = errors.New(\"sql: no rows in result set\")` \u2014 the string match coincidentally works today with direct `sql.QueryRowContext` usage.\n> Step 3: But `errors.Is(err, sql.ErrNoRows)` is the correct, future-proof idiom \u2014 used by the same file at line 300 (`GetWorkflowRun`), line 302: `if errors.Is(err, sql.ErrNoRows)`.\n> Step 4: If the underlying row scan ever returns a wrapped error (driver upgrade, middleware), `err.Error()` will not equal the bare string, causing a generic error to propagate instead of the nil-not-found signal.\n> Step 5: `config_db.go:27-29` consumes the nil return from `GetConfig` as \"no config in DB\" and silently continues; a spurious error here would cause `overlayDBConfig` to return an error, propagating to server startup.\n\n**\ud83d\udca1 Suggested Fix**\n\nReplace the string comparison with the standard sentinel check, consistent with the rest of the file:\n```go\nif errors.Is(err, sql.ErrNoRows) {\n return nil, nil\n}\n```\nThe `errors` package is already imported at line 8 of `local.go`.\n\n---\n*`StorageProvider Interface Implementation Completeness` \u00b7 confidence 85%*", - "line": 5186, - "path": "control-plane/internal/storage/local.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Fragile `no rows` detection via string comparison instead of `errors.Is(sql.ErrNoRows)`**\n\nThe `GetConfig` implementation detects a missing key by comparing the error string:\n\n```go\nif err.Error() == \"sql: no rows in result set\" {\n return nil, nil\n}\n```\n\nThis is the critical code path that `overlayDBConfig` depends on for safe early-return when `agentfield.yaml` does not exist in the DB. The guard in `overlayDBConfig` at line 27 (`if entry == nil { return nil }`) is only safe **if** `GetConfig` reliably returns `(nil, nil)` for a not-found key.\n\nThe string comparison is fragile for two concrete reasons:\n\n1. **Standard library contract:** `database/sql` defines `sql.ErrNoRows` as a sentinel error. The idiomatic and safe check is `errors.Is(err, sql.ErrNoRows)`. The string `\"sql: no rows in result set\"` is the `.Error()` text of `sql.ErrNoRows` \u2014 but it is not part of the public API and could change between Go versions.\n\n2. **Wrapped errors:** If any middleware, driver wrapper, or future refactoring wraps the `sql.ErrNoRows` error (e.g., `fmt.Errorf(\"scan failed: %w\", err)`), `err.Error()` will no longer match the literal string, but `errors.Is(err, sql.ErrNoRows)` would still return `true`. A wrapped error would fall through to the generic error path and return `(nil, wrappedError)`, causing `overlayDBConfig` to fail with `\"failed to read config from database\"` instead of silently skipping the DB config \u2014 a behavioral regression that would break startup whenever the DB config key is absent.\n\nWhile the current code works today (the string is stable in the standard `database/sql` implementation), this is an API contract violation that creates a latent bug.\n\n---\n\n> Step 1: `overlayDBConfig` (config_db.go:23) calls `store.GetConfig(ctx, \"agentfield.yaml\")`.\n> Step 2: `LocalStorage.GetConfig` (local.go) executes `SELECT ... WHERE key = ?` / `$1`.\n> Step 3: If key is absent, `row.Scan` returns `sql.ErrNoRows`.\n> Step 4: The implementation checks `err.Error() == \"sql: no rows in result set\"` \u2014 a string literal, not `errors.Is(err, sql.ErrNoRows)`.\n> Step 5: If the error is wrapped at any layer (now or in a future refactor), `err.Error()` no longer matches the literal, the condition is false, and the function returns `(nil, fmt.Errorf(\"failed to get config %q: %w\", key, err))`.\n> Step 6: `overlayDBConfig` receives `(nil, nonNilError)`, hits the `if err != nil` branch at line 24, and returns `fmt.Errorf(\"failed to read config from database: %w\", err)`.\n> Step 7: Server startup fails with an error even though no DB config was intended \u2014 a silent regression triggered by any error-wrapping change in the storage stack.\n\n**\ud83d\udca1 Suggested Fix**\n\nReplace the string comparison with `errors.Is`:\n\n```go\nimport (\n \"database/sql\"\n \"errors\"\n)\n\nif errors.Is(err, sql.ErrNoRows) {\n return nil, nil\n}\n```\n\nThis is both idiomatic Go and resilient to error wrapping. No behavioral change for the current code path.\n\n---\n*`overlayDBConfig Runtime Execution Trace` \u00b7 confidence 85%*", - "line": 5179, - "path": "control-plane/internal/storage/local.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Config routes registered on unauthenticated `agentAPI` group \u2014 no dedicated auth guard**\n\nThe config storage routes (`GET/PUT/DELETE /api/v1/configs/:key`, `GET /api/v1/configs`, `POST /api/v1/configs/reload`) are registered directly on the `agentAPI` group at line 1553 via `configHandlers.RegisterRoutes(agentAPI)`. The `agentAPI` group itself has **no middleware** \u2014 authentication is only provided by the global `s.Router.Use(middleware.APIKeyAuth(...))` applied at line 881.\n\nThe `APIKeyAuth` middleware has an explicit early-return when the configured key is empty:\n```go\n// No auth configured, allow everything.\nif config.APIKey == \"\" {\n c.Next()\n return\n}\n```\n\nWhen `AGENTFIELD_API_KEY` / `s.config.API.Auth.APIKey` is not set (which is the default in local/dev mode), **every** config endpoint \u2014 including `PUT /api/v1/configs/:key` (write arbitrary config), `DELETE /api/v1/configs/:key`, and `POST /api/v1/configs/reload` \u2014 is fully unauthenticated and accessible to any HTTP client with network access.\n\nContrast this with the comment on line 1550 which says \"admin-authenticated\": this is **misleading** \u2014 no admin token (`AdminTokenAuth`) is enforced here. The connector-facing duplicate at line 1572\u20131578 at least sits behind `ConnectorTokenAuth` + `ConnectorCapabilityCheck`. The `agentAPI`-facing endpoints have no equivalent protection beyond the optional global API key.\n\n---\n\n> Step 1: Global auth is registered at server.go:881 \u2014 `s.Router.Use(middleware.APIKeyAuth(middleware.AuthConfig{APIKey: s.config.API.Auth.APIKey, ...}))`. Step 2: `middleware.APIKeyAuth` (middleware/auth.go:26) returns early with `c.Next()` when `config.APIKey == \"\"`. Step 3: `agentAPI` is created at server.go:1164 as `s.Router.Group(\"/api/v1\")` with no middleware of its own. Step 4: `configHandlers.RegisterRoutes(agentAPI)` at server.go:1553 registers `PUT /api/v1/configs/:key`, `DELETE /api/v1/configs/:key`, and `POST /api/v1/configs/reload` directly on that group. Step 5: With default configuration (no API key set), any unauthenticated HTTP request to `PUT /api/v1/configs/some-key` with arbitrary body will write to the config store and return 200 OK.\n\n**\ud83d\udca1 Suggested Fix**\n\nRegister the config routes on a sub-group that requires the admin token middleware, consistent with how other admin-only routes are handled (e.g., the `adminGroup` created at line 1532). Replace:\n```go\n// Config storage routes (admin-authenticated)\n{\n configHandlers := handlers.NewConfigStorageHandlers(s.storage, s.configReloadFn())\n configHandlers.RegisterRoutes(agentAPI)\n}\n```\nwith:\n```go\n// Config storage routes (admin-authenticated)\n{\n cfgAdminGroup := agentAPI.Group(\"\")\n cfgAdminGroup.Use(middleware.AdminTokenAuth(s.config.Features.DID.Authorization.AdminToken))\n configHandlers := handlers.NewConfigStorageHandlers(s.storage, s.configReloadFn())\n configHandlers.RegisterRoutes(cfgAdminGroup)\n}\n```\nAlternatively, reuse the existing `adminGroup` (lines 1532\u20131545) if DID authorization is enabled, but ensure a fallback exists when it is not.\n\n---\n*`Dual Registration of Config Routes` \u00b7 confidence 95%*", - "line": 1550, - "path": "control-plane/internal/server/server.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Goose migration for config_storage omits the updated_at auto-update trigger that equivalent tables have, and GORM autoUpdateTime does not replace it**\n\nTables with `updated_at` columns in the Goose migrations for this codebase are paired with `BEFORE UPDATE` triggers that call `update_updated_at_column()`. For example:\n- `workflow_runs` (migration 011) has `CREATE TRIGGER update_workflow_runs_updated_at BEFORE UPDATE ... EXECUTE FUNCTION update_updated_at_column()`\n- `workflow_steps` (migration 011) has the same pattern\n\nMigration `028_create_config_storage.sql` defines `updated_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()` but **does not create an `BEFORE UPDATE` trigger** to keep `updated_at` current on row modifications.\n\nFor the `SetConfig` raw SQL path (local.go:5138-5147), `updated_at` is manually set by the application code (`updated_at = EXCLUDED.updated_at` where `EXCLUDED.updated_at` is the Go `now` variable). This means correctness depends entirely on every code path that touches `config_storage` explicitly setting `updated_at`. GORM's `autoUpdateTime` tag on `ConfigStorageModel.UpdatedAt` only fires when GORM ORM methods are used; the `SetConfig` / `GetConfig` / `DeleteConfig` implementations bypass GORM entirely and use raw `database/sql` queries.\n\nCurrently `SetConfig` does correctly set `updated_at`, so this is not an active bug. But the lack of a DB-level trigger means:\n1. Any future raw SQL that `UPDATE config_storage SET value = ... WHERE key = ...` without explicitly setting `updated_at` will silently leave `updated_at` stale.\n2. The schema contract is different from peer tables, making it a maintenance trap for contributors who see the trigger pattern on `workflow_runs` and assume it also exists on `config_storage`.\n\n---\n\n> Step 1: `028_create_config_storage.sql` lines 10-11 declare `updated_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()` but contain no trigger DDL.\n> Step 2: `011_create_workflow_runs_and_steps.sql` lines 47-54 show the expected pattern: `CREATE TRIGGER update_workflow_runs_updated_at BEFORE UPDATE ON workflow_runs FOR EACH ROW EXECUTE FUNCTION update_updated_at_column()`.\n> Step 3: `SetConfig` in local.go:5137-5147 does manually pass `updated_at = EXCLUDED.updated_at` in the ON CONFLICT clause, so the current implementation is correct.\n> Step 4: However, any future `UPDATE config_storage SET value = $1 WHERE key = $2` without an explicit `updated_at` clause would leave the column stale \u2014 the DB trigger pattern that prevents this on other tables is absent here.\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd a `BEFORE UPDATE` trigger to migration `028_create_config_storage.sql` mirroring the pattern in migration `011`:\n```sql\nCREATE TRIGGER update_config_storage_updated_at\n BEFORE UPDATE ON config_storage\n FOR EACH ROW EXECUTE FUNCTION update_updated_at_column();\n```\nAnd add its DROP to the `-- +goose Down` section. This makes `updated_at` maintenance a DB invariant rather than an application-layer responsibility, consistent with how `workflow_runs` and `workflow_steps` are managed.\n\n---\n*`Dual-Track Schema Management: AutoMigrate vs Goose` \u00b7 confidence 85%*", - "line": 10, - "path": "control-plane/migrations/028_create_config_storage.sql", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] SetConfig accepts arbitrary keys and values with no validation \u2014 allows storing malformed YAML or overwriting critical system keys**\n\n`SetConfig` at config_storage.go:67 accepts any `key` from the URL parameter and any raw body as the value. There is no allowlist of permitted keys, no validation that the value is well-formed YAML when the key implies a YAML config file, and no protection against overwriting critical system keys.\n\nSpecific concerns:\n1. **Key `agentfield.yaml`** can be written with arbitrary content. When loaded via `overlayDBConfig`, a YAML parse error at `config_db.go:37` only returns a warning \u2014 the server does not crash but the config is partially loaded in an inconsistent state.\n2. **Arbitrary key injection**: An attacker can store keys like `../../../../etc/passwd` \u2014 while the storage layer likely sanitizes this, there is no explicit check in the handler.\n3. **No content-type enforcement**: The handler accepts any body as a raw string regardless of content type. The comment says \"Accepts raw YAML/text body\" but this is not validated.\n4. The `updatedBy` field at line 80-83 is taken directly from the `X-Updated-By` header with no sanitization \u2014 this is stored in the audit log and could be used for log injection.\n\n---\n\n> Step 1: `PUT /api/v1/configs/` calls `SetConfig` at config_storage.go:67.\n> Step 2: `key := c.Param(\"key\")` at line 68 \u2014 raw URL parameter, no validation.\n> Step 3: `body, err := io.ReadAll(c.Request.Body)` at line 70 \u2014 reads entire body as-is.\n> Step 4: `h.storage.SetConfig(ctx, key, string(body), updatedBy)` at line 85 \u2014 stores without validation.\n> Step 5: `updatedBy := c.GetHeader(\"X-Updated-By\")` at line 80 \u2014 user-controlled string stored in DB audit field.\n\n**\ud83d\udca1 Suggested Fix**\n\n1. Add an allowlist of permitted config keys (e.g., only `agentfield.yaml` or a predefined set), or at minimum validate the key does not contain path traversal characters.\n2. Validate that the body is valid YAML when the key ends in `.yaml` before persisting it.\n3. Sanitize the `X-Updated-By` header value (strip control characters, limit length).\n4. Return a clear error if the key is not in the allowlist.\n\n---\n*`Config CRUD Route Authorization Gap` \u00b7 confidence 82%*", - "line": 67, - "path": "control-plane/internal/handlers/config_storage.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd35 **[SUGGESTION] DeleteConfig handler returns 404 for all storage errors, including 500-class failures**\n\nThe `DeleteConfig` HTTP handler at line 106-108 responds with `http.StatusNotFound` (404) for **any** error returned by `storage.DeleteConfig`:\n\n```go\nif err := h.storage.DeleteConfig(c.Request.Context(), key); err != nil {\n c.JSON(http.StatusNotFound, gin.H{\"error\": err.Error()})\n return\n}\n```\n\nHowever, the storage implementation (`local.go:5235-5244`) can return two distinct error categories:\n- A not-found sentinel: `fmt.Errorf(\"config %q not found\", key)` when `RowsAffected() == 0`\n- A database execution error: `fmt.Errorf(\"failed to delete config %q: %w\", key, err)` for actual DB failures\n\nMapping a database-level error (connection failure, disk full, constraint violation) to 404 is semantically incorrect and will mislead API clients and operators. A DB failure should produce 500 Internal Server Error.\n\n---\n\n> Step 1: `DeleteConfig` in local.go:5235 executes `DELETE FROM config_storage WHERE key = ?`.\n> Step 2: If `db.ExecContext` returns an error (network, disk, constraint), local.go:5237-5239 returns `fmt.Errorf(\"failed to delete config %q: %w\", key, err)`.\n> Step 3: If `RowsAffected() == 0`, local.go:5242 returns `fmt.Errorf(\"config %q not found\", key)`.\n> Step 4: The handler at config_storage.go:107 maps BOTH error types to `http.StatusNotFound` (404).\n> Step 5: A database execution failure will be surfaced to the API client as a 404, concealing the real 5xx nature of the error.\n\n**\ud83d\udca1 Suggested Fix**\n\nDistinguish between not-found and server errors. One approach is to check the error message or define a sentinel type in the storage layer:\n```go\nif err := h.storage.DeleteConfig(c.Request.Context(), key); err != nil {\n // Check if it's a not-found error vs. a storage failure\n if strings.Contains(err.Error(), \"not found\") {\n c.JSON(http.StatusNotFound, gin.H{\"error\": err.Error()})\n } else {\n c.JSON(http.StatusInternalServerError, gin.H{\"error\": err.Error()})\n }\n return\n}\n```\nA cleaner solution is to define a typed `ErrNotFound` sentinel in the storage package and use `errors.Is` in the handler.\n\n---\n*`StorageProvider Interface Implementation Completeness` \u00b7 confidence 92%*", - "line": 104, - "path": "control-plane/internal/handlers/config_storage.go", - "side": "RIGHT" - }, - { - "body": "\u26aa **[NITPICK] Redundant index on config_storage(key): the UNIQUE constraint already implies a unique index**\n\nThe Goose migration defines `key TEXT NOT NULL UNIQUE` on line 5 (which in PostgreSQL automatically creates a unique B-tree index on `key`) and then explicitly creates `CREATE INDEX IF NOT EXISTS idx_config_storage_key ON config_storage(key)` on line 14. The explicit non-unique index on `key` is redundant because PostgreSQL will always prefer the unique index for lookups on that column.\n\nThis is a minor inefficiency: two indexes occupy storage and must be updated on every INSERT/UPDATE/DELETE to `config_storage`. The duplicate won't cause incorrect behavior, but it wastes space and write amplification.\n\n---\n\n> Step 1: `028_create_config_storage.sql` line 5 defines `key TEXT NOT NULL UNIQUE`.\n> Step 2: PostgreSQL documentation states a UNIQUE constraint automatically creates a unique B-tree index on the constrained column(s), which can be used for point lookups just as a regular index can.\n> Step 3: Line 14 then creates a separate non-unique index `idx_config_storage_key ON config_storage(key)`, duplicating coverage already provided by the unique constraint index.\n\n**\ud83d\udca1 Suggested Fix**\n\nRemove the explicit `CREATE INDEX IF NOT EXISTS idx_config_storage_key ON config_storage(key)` from the `-- +goose Up` section and its corresponding `DROP INDEX` from `-- +goose Down`. The UNIQUE constraint already provides an index suitable for all single-column equality lookups on `key`.\n\n---\n*`Dual-Track Schema Management: AutoMigrate vs Goose` \u00b7 confidence 95%*", - "line": 14, - "path": "control-plane/migrations/028_create_config_storage.sql", - "side": "RIGHT" - }, - { - "body": "\u26aa **[NITPICK] `fmt.Println`/`fmt.Printf` used for logging instead of the structured logger**\n\nBoth the not-found path (line 28) and the success path (line 47) log via `fmt.Println` / `fmt.Printf` rather than the project's structured logger (`zerolog`).\n\nThe CLAUDE.md project guidance specifies:\n> Use zerolog for structured logging: `logger.Logger.Info().Msg(\"message\")`\n\nUsing `fmt.Print*` here:\n- Bypasses log-level filtering (these messages always appear, even in production with `LOG_LEVEL=warn`)\n- Produces unstructured output that cannot be parsed by log aggregation systems\n- Is inconsistent with the rest of the control-plane codebase\n\nThis is a style/maintainability issue, not a correctness bug.\n\n---\n\n> Line 28: `fmt.Println(\"[config] No database config found (key: agentfield.yaml), using file/env config only.\")`\n> Line 47: `fmt.Printf(\"[config] Loaded config from database (key: %s, version: %d, updated: %s)\\n\", ...)`\n> Both bypass zerolog, the structured logger used throughout the rest of the control-plane (per CLAUDE.md and observed usage in other files).\n\n**\ud83d\udca1 Suggested Fix**\n\nReplace `fmt.Println` / `fmt.Printf` with the zerolog structured logger:\n\n```go\nimport \"github.com/Agent-Field/agentfield/control-plane/internal/logger\"\n\n// not-found path:\nlogger.Logger.Info().Str(\"key\", dbConfigKey).Msg(\"No database config found, using file/env config only\")\n\n// success path:\nlogger.Logger.Info().\n Str(\"key\", entry.Key).\n Int(\"version\", entry.Version).\n Time(\"updated\", entry.UpdatedAt).\n Msg(\"Loaded config from database\")\n```\n\n---\n*`overlayDBConfig Runtime Execution Trace` \u00b7 confidence 95%*", - "line": 28, - "path": "control-plane/internal/server/config_db.go", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd35 **[SUGGESTION] Version increment is application-enforced only; no DB-level constraint prevents version regression or skipping**\n\nThe `ConfigStorageModel.Version` field is declared with `gorm:\"column:version;not null;default:1\"` and the auto-increment is implemented purely in application SQL via `version = config_storage.version + 1` in `SetConfig` (local.go:5143, 5156). Neither the GORM model nor the Goose migration adds a `CHECK (version > 0)` constraint or a sequence-based mechanism.\n\nThis means:\n1. Any code path that uses GORM ORM methods directly (e.g., `db.Save(&ConfigStorageModel{..., Version: 0, ...})`) will set version to 0 or any arbitrary value, bypassing the increment logic.\n2. The `version` field comment says it is for \"audit trail\" (models.go:478), but without a monotonically-increasing guarantee at the DB level, audit integrity can be violated silently.\n\nThis is a suggestion rather than a critical issue because currently all writes go through the raw-SQL `SetConfig` which correctly increments. But the model struct exposes `Version int` as a writable field, and future GORM-based code would not benefit from the increment.\n\n---\n\n> Step 1: `ConfigStorageModel.Version` is `int` with `gorm:\"column:version;not null;default:1\"` (models.go:483) \u2014 no GORM constraint prevents setting it to any value.\n> Step 2: `SetConfig` increments via `version = config_storage.version + 1` in the ON CONFLICT clause (local.go:5143, 5156) \u2014 this is correct.\n> Step 3: But any direct GORM call like `gormDB.Save(&ConfigStorageModel{Key: \"k\", Value: \"v\", Version: 0})` would set version to 0, no DB constraint prevents it.\n> Step 4: `028_create_config_storage.sql` line 7 defines `version INTEGER NOT NULL DEFAULT 1` with no CHECK constraint.\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd a `CHECK (version >= 1)` constraint in migration `028_create_config_storage.sql`:\n```sql\nversion INTEGER NOT NULL DEFAULT 1 CHECK (version >= 1),\n```\nThis at minimum prevents accidental version-0 writes. For a stronger audit guarantee, document that GORM's ORM Save/Create methods should never be used directly on `ConfigStorageModel`; only `SetConfig`/`DeleteConfig` are the sanctioned write paths.\n\n---\n*`Dual-Track Schema Management: AutoMigrate vs Goose` \u00b7 confidence 75%*", - "line": 483, - "path": "control-plane/internal/storage/models.go", - "side": "RIGHT" - }, - { - "body": "\u26aa **[NITPICK] Verified: no path conflict and no 503 regression from second `configHandlers` instantiation**\n\nThe two `configHandlers` declarations are in separate block scopes (lines 1551\u20131555 and 1575\u20131578) with no shadowing of a shared variable. They register routes on distinct base paths:\n\n- First: `agentAPI` \u2192 `/api/v1/configs/...`\n- Second: `configGroup` (= `connectorGroup.Group(\"\")` = `agentAPI.Group(\"/connector\")`) \u2192 `/api/v1/connector/configs/...`\n\nGin's router tree separates these cleanly \u2014 no duplicate-path panic occurs.\n\nThe `:key` parameter name is identical in both registrations (both call the same `RegisterRoutes` method), but since they live in different router-tree path segments (`/configs` under `/api/v1` vs `/configs` under `/api/v1/connector`), there is no wildcard conflict.\n\nBoth calls pass `s.configReloadFn()` which evaluates `os.Getenv(\"AGENTFIELD_CONFIG_SOURCE\")` at setup time and returns either `nil` or a valid reload closure. The connector-facing reload endpoint will return 503 only when the env var is not `\"db\"` \u2014 **exactly the same behavior** as the `agentAPI`-facing endpoint. There is no regression here.\n\nThe variable name reuse (`configHandlers`) inside separate Go block scopes (`{ }`) is cosmetically confusing but harmless \u2014 Go's scoping rules guarantee no aliasing.\n\n---\n\n> Step 1: `agentAPI` base path = `/api/v1` (server.go:1164). Step 2: `connectorGroup = agentAPI.Group(\"/connector\")` \u2192 base `/api/v1/connector` (server.go:1559). Step 3: `configGroup = connectorGroup.Group(\"\")` \u2192 still `/api/v1/connector` (server.go:1573). Step 4: `RegisterRoutes` registers identical relative paths (`/configs`, `/configs/:key`, `/configs/reload`) on both groups, yielding `/api/v1/configs/...` and `/api/v1/connector/configs/...` \u2014 distinct full paths. Step 5: Both `NewConfigStorageHandlers` calls at lines 1552 and 1576 invoke `s.configReloadFn()` which is the same method returning equivalent closures (or nil). No behavioral divergence.\n\n**\ud83d\udca1 Suggested Fix**\n\nConsider renaming the inner `configHandlers` to `connectorConfigHandlers` for clarity, even though the current code is functionally correct:\n```go\nconnectorConfigHandlers := handlers.NewConfigStorageHandlers(s.storage, s.configReloadFn())\nconnectorConfigHandlers.RegisterRoutes(configGroup)\n```\n\n---\n*`Dual Registration of Config Routes` \u00b7 confidence 98%*", - "line": 1572, - "path": "control-plane/internal/server/server.go", - "side": "RIGHT" - } - ], - "event": "REQUEST_CHANGES" - }, - "review_id": "rev_4840f78ef080", - "summary": { - "adversary_challenged": 8, - "adversary_confirmed": 9, - "ai_generated_confidence": 0.8, - "budget_exhausted": true, - "by_severity": { - "critical": 3, - "important": 12, - "nitpick": 3, - "suggestion": 2 - }, - "cost_usd": 0, - "coverage_iterations": 0, - "cross_ref_interactions": 0, - "dimensions_run": 6, - "duration_seconds": 1933.747, - "total_findings": 20 - } - }, - "started_at": "2026-03-10T16:55:06Z", - "completed_at": "2026-03-10T17:27:35Z", - "duration_ms": 1948494, - "webhook_registered": false -} diff --git a/benchmark/truenas-middleware-18291/EVALUATION.md b/benchmark/truenas-middleware-18291/EVALUATION.md deleted file mode 100644 index 11fc88d..0000000 --- a/benchmark/truenas-middleware-18291/EVALUATION.md +++ /dev/null @@ -1,522 +0,0 @@ -# LLM-as-a-Judge Evaluation: Automated PR Review Systems -## truenas/middleware PR #18291 — ZFS Dataset Encryption Refactor - -**Evaluation date**: 2026-03-10 -**Evaluator**: LLM-as-a-Judge (structured rubric) -**Systems compared**: PR-AF + Kimi k2.5, PR-AF + Sonnet 4.6, Claude Code (claude[bot]) -**Architecture note**: Both PR-AF runs use the same v2 meta-selector pipeline. This document evaluates model choice, not architecture version. -**Companion data**: `pr-af-result-kimi.json` (Kimi), `pr-af-result-sonnet.json` (Sonnet), `claude-code-inline-comments.json`, `claude-code-reviews.json` (same directory) - ---- - -## 1. Executive Summary - -Three automated PR review systems were evaluated against truenas/middleware PR #18291, a high-risk refactor replacing py-libzfs with truenas_pylibzfs across encryption key management, KMIP key sync, pool/dataset creation, and failover unlock paths. - -**Sonnet 4.6 is the strongest overall reviewer.** It found the hardest bug in the dataset (the `k in existing_datasets` type mismatch that silently wipes the KMIP cache), discovered a novel runtime crash nobody else caught (missing `ds['id']` argument in `datastore.update`), and correctly investigated and ruled out a false alarm that Claude Code flagged as critical. Its 14 findings had zero adversary challenges, indicating high precision. - -**Kimi k2.5 found the highest-scoring individual finding** (method name shadowing causing infinite recursion, score 1.852) and produced the broadest coverage at 25 findings across 8 dimensions. However, 7 of those findings were adversary-challenged, and it missed both the KMIP cache wipe bug and the novel datastore crash. - -**Claude Code** operates in a fundamentally different regime: near-instant, single-agent, inline comments. It caught CC-1 (decorator dispatch crash) that both multi-agent systems missed, and CC-4 (KMIP cache wipe) that Kimi missed. Its value is speed and GitHub-native integration, not depth. - -**No system caught everything.** The decorator dispatch crash (CC-1) was found only by Claude Code. The method shadowing bug was found only by Kimi. The novel datastore argument bug was found only by Sonnet. This is the central finding: complementary coverage, not dominance. - -| System | Findings | Duration | Critical Bugs Found | Novel Bugs | Adversary Challenges | -|---|---|---|---|---|---| -| PR-AF + Kimi k2.5 | 25 | ~19 min | 6 labeled critical | 2 unique | 7 challenged (28%) | -| PR-AF + Sonnet 4.6 | 14 | ~35 min | 2 labeled critical | 3 unique | 0 challenged (0%) | -| Claude Code | ~6 automated | Near-instant | 2 critical flagged | 0 unique | N/A | - ---- - -## 2. Methodology - -### 2.1 What Was Compared - -All three systems reviewed the same PR diff. PR-AF runs used identical pipeline architecture (v2 meta-selectors: intake -> anatomy -> meta_selectors -> review -> adversary -> cross_ref -> coverage -> synthesis -> output). The only variable between the two PR-AF runs is the underlying LLM: Kimi k2.5 vs Claude Sonnet 4.6. - -Claude Code is a single-agent GitHub App that reads the diff and produces inline comments. It is included as a baseline representing the current state of production automated review. - -### 2.2 Ground Truth - -Ground truth was established by cross-referencing all findings across systems and identifying bugs confirmed by multiple independent systems or by explicit code analysis. The confirmed bug set used for recall scoring: - -1. **CC-1**: `@pass_thread_local_storage` dispatch crash in `sync_zfs_keys` -2. **CC-2**: `ZFSKeyFormat` enum comparison always False -3. **CC-3**: `pbkdf2iters` minimum inconsistency across option classes -4. **CC-4**: `k in existing_datasets` type mismatch silently wipes KMIP cache -5. **Method shadowing**: `check_key` name shadows imported function, causing infinite recursion -6. **Duplicate export**: `PoolRemoveArgs` appears twice in `__all__` -7. **Missing argument**: `ds['id']` missing from `datastore.update` call -8. **Exception contract**: Broad `Exception` catch masks `ZFSNotEncryptedException` -9. **TOCTOU**: Race condition in `load_key()` - -This is a 9-bug ground truth set. No system found all 9. - -### 2.3 Scoring Rubric - -Five criteria, weighted: - -| Criterion | Weight | Description | -|---|---|---| -| Recall | 30% | Fraction of ground-truth bugs found | -| Precision | 25% | Fraction of findings that are real bugs (not noise) | -| Evidence quality | 20% | Specificity of reasoning, code references, impact analysis | -| Severity calibration | 15% | Critical bugs labeled critical; suggestions not over-elevated | -| Breadth | 10% | Coverage across multiple risk dimensions | - -### 2.4 Limitations - -- Ground truth is constructed post-hoc from the union of all findings. Bugs that all systems missed cannot be scored. -- Kimi's budget was exhausted by duration (19 min cap), meaning some planned phases may have been truncated. -- Sonnet's budget was also exhausted by duration (35 min cap), but it ran longer and produced fewer findings, suggesting more deliberate analysis per finding. -- Claude Code's inline comments mix automated (claude[bot]) and human (yocalebo) reviewer comments. Only claude[bot] comments are scored here. -- The adversary phase for Sonnet ran but produced zero challenges. This could mean Sonnet's findings are genuinely solid, or that the adversary agent was under-resourced in that run. - ---- - -## 3. The PR Under Review - -**truenas/middleware PR #18291** replaces py-libzfs with truenas_pylibzfs as the Python ZFS binding across the TrueNAS middleware stack. The refactor touches: - -- `dataset_encryption_operations.py` — encryption key management, load/unload, change key -- `kmip_operations.py` — KMIP key sync (push/pull ZFS keys to/from KMIP server) -- `pool_dataset.py` — pool and dataset creation, option validation -- Failover unlock paths - -This is a high-risk refactor because: (a) it changes the exception hierarchy (new library throws different exception types), (b) it changes method signatures in some cases, (c) encryption key management bugs can cause data loss or silent security failures, and (d) KMIP integration bugs can corrupt the key sync state silently. - -The PR is 8+ files, non-trivial in scope, and the new library's behavior differences from py-libzfs are not fully documented in the diff. - ---- - -## 4. Reviewer Profiles - -### 4.1 PR-AF + Kimi k2.5 - -**Architecture**: v2 meta-selector pipeline, 9 phases, 20 agent invocations -**Duration**: ~1122 seconds (~19 minutes), budget exhausted -**Output**: 25 findings across 8 analysis dimensions -**Severity distribution**: critical=6, important=10, suggestion=9 -**Adversary results**: 7 challenged, 3 confirmed, 15 no adversary result -**Average finding score**: 0.524 -**Peak finding score**: 1.852 (method shadowing bug) - -Kimi operates as a high-volume, broad-coverage reviewer. It generates more findings than Sonnet and covers more distinct dimensions (8 vs 6). The adversary phase challenged 28% of its findings, which is a meaningful false-positive signal. Three findings survived adversary challenge with confirmation; four were challenged without resolution (no adversary result). The high peak score on the method shadowing finding reflects genuine depth on that specific bug. - -### 4.2 PR-AF + Sonnet 4.6 - -**Architecture**: v2 meta-selector pipeline, 9 phases, 11 agent invocations -**Duration**: ~2100 seconds (~35 minutes), budget exhausted -**Output**: 14 findings across 6 analysis dimensions -**Severity distribution**: critical=2, important=9, suggestion=2, nitpick=1 -**Adversary results**: 0 challenged, 0 confirmed -**Average finding score**: 0.611 -**Peak finding score**: 0.97 (KMIP cache wipe bug) - -Sonnet operates as a precision-focused reviewer. It produces fewer findings but with higher average score and zero adversary challenges. The 0-challenge adversary result is notable: either Sonnet's findings are genuinely solid (supported by the fact that its top two findings are confirmed critical bugs), or the adversary agent was under-resourced. Given that Sonnet's top findings include a novel bug nobody else caught, the former explanation is more credible. - -Sonnet used fewer agent invocations (11 vs 20) despite running nearly twice as long. This suggests longer per-invocation reasoning rather than more parallel exploration. - -### 4.3 Claude Code (claude[bot]) - -**Architecture**: Single-agent GitHub App, reads diff, produces inline comments -**Duration**: Near-instant (seconds) -**Output**: ~6 automated inline comments (claude[bot] only; yocalebo comments excluded) -**Adversary**: None (single-agent, no pipeline) - -Claude Code is the production baseline. It operates at a fundamentally different cost and latency point. Its value is immediate feedback on the diff without any pipeline overhead. It caught CC-1 (decorator dispatch crash) and CC-4 (KMIP cache wipe) that Kimi missed entirely. It did not find the method shadowing bug, the novel datastore argument bug, or the exception contract violations that the multi-agent systems found. - ---- - -## 5. Cross-System Coverage Matrix - -The following matrix maps each confirmed bug to which system found it. - -| Bug | Kimi k2.5 | Sonnet 4.6 | Claude Code | -|---|---|---|---| -| CC-1: Decorator dispatch crash (`@pass_thread_local_storage`) | NO | NO (investigated, ruled not a bug) | YES | -| CC-2: Enum comparison always False (`ZFSKeyFormat`) | NO | YES (finding #3, score 0.686) | YES | -| CC-3: `pbkdf2iters` minimum inconsistency | YES (findings #6, #7, #8) | YES (findings #5, #7, #11) | YES | -| CC-4: `k in existing_datasets` type mismatch, KMIP cache wipe | NO | YES (finding #1, score 0.97) | YES | -| Method shadowing / infinite recursion | YES (finding #1, score 1.852) | NO | NO | -| Duplicate export `PoolRemoveArgs` in `__all__` | YES (finding #3) | NO | NO | -| Missing `ds['id']` in `datastore.update` | NO | YES (finding #2, score 0.95) | NO | -| Exception contract violations / broad `Exception` catch | YES (findings #2, #11, #12, #13) | YES (findings #4, #8, #9, #10) | NO | -| TOCTOU race condition in `load_key()` | YES (finding #5) | NO | NO | - -**Recall summary**: -- Kimi: found 6 of 9 ground-truth bugs (67%) -- Sonnet: found 6 of 9 ground-truth bugs (67%) -- Claude Code: found 4 of 9 ground-truth bugs (44%) - -Both PR-AF systems achieve the same raw recall, but on different subsets of bugs. This is the most important finding in the matrix: the two systems are complementary, not redundant. - ---- - -## 6. Finding-by-Finding Comparison - -### 6.1 PR-AF + Kimi k2.5 — All 25 Findings - -| # | Severity | Score | Status | Summary | -|---|---|---|---|---| -| 1 | critical | 1.852 | CONFIRMED+CROSSREF | Method name shadows imported function, causing infinite recursion in `dataset_encryption_operations.py` | -| 2 | important | 1.092 | CONFIRMED+CROSSREF | `sync_db_keys()` marks non-encrypted datasets for removal due to broad `Exception` catch | -| 3 | critical | 1.0 | — | Duplicate export: `PoolRemoveArgs` appears twice in `__all__` | -| 4 | important | 0.892 | CROSSREF | Missing hex validation on encryption keys before database storage | -| 5 | important | 0.787 | CROSSREF | TOCTOU race condition in `load_key()` | -| 6 | important | 0.63 | — | Breaking API change: `pbkdf2iters` minimum raised from 100,000 to 1,300,000 | -| 7 | important | 0.63 | — | Breaking API change: `PoolDatasetChangeKeyOptions.pbkdf2iters` minimum raised | -| 8 | important | 0.595 | — | `from_previous` silently modifies `pbkdf2iters` without notification | -| 9 | important | 0.49 | — | Hardcoded minimum prevents users from choosing lower security settings | -| 10 | critical | 0.475 | CHALLENGED | Malformed hex key causes confusing 'Missing key' error | -| 11 | critical | 0.475 | CHALLENGED | KMIP `push_zfs_keys()` crashes when `check_key()` raises `ZFSNotEncryptedException` | -| 12 | critical | 0.475 | CHALLENGED | KMIP `pull_zfs_keys()` crashes when `check_key()` raises `ZFSNotEncryptedException` | -| 13 | critical | 0.475 | CHALLENGED | Generic `Exception` catching masks `ZFSNotEncryptedException` | -| 14 | suggestion | 0.38 | CONFIRMED+CROSSREF | Key file validation uses different hex parsing logic than unlock path | -| 15 | suggestion | 0.337 | CROSSREF | Silent failure when hex decoding fails during unlock | -| 16 | suggestion | 0.315 | CROSSREF | No database-level constraints on `encryption_key` column | -| 17 | important | 0.297 | CHALLENGED | Silent hex conversion failure preserves invalid string | -| 18 | important | 0.297 | CHALLENGED | Broad `Exception` catch masks `ZFSNotEncryptedException` as 'invalid key' | -| 19 | important | 0.28 | CHALLENGED | Malformed hex keys cause unnecessary key removal during sync | -| 20 | suggestion | 0.27 | CROSSREF | Missing key validation before load in `unlock()` | -| 21 | suggestion | 0.27 | CROSSREF | Staleness of `check_key()` result in `pull_zfs_keys` | -| 22 | suggestion | 0.225 | — | Significant performance impact from increased PBKDF2 iterations | -| 23 | suggestion | 0.195 | — | Missing key existence check in `from_previous` migration method | -| 24 | suggestion | 0.195 | — | Missing key existence check in `PoolDatasetChangeKeyOptions.from_previous` | -| 25 | suggestion | 0.18 | — | Key validation without subsequent load in `push_zfs_keys` | - -**Adversary breakdown**: Findings #10, #11, #12, #13, #17, #18, #19 were challenged. Of these, none received a "confirmed" adversary result — they remain in a challenged/unresolved state. Findings #1, #2, #14 were confirmed by the adversary and cross-referenced. - -### 6.2 PR-AF + Sonnet 4.6 — All 14 Findings - -| # | Severity | Score | Status | Summary | -|---|---|---|---|---| -| 1 | critical | 0.97 | — | `zfs_keys` cache silently wiped: `k in existing_datasets` checks string against list-of-dicts, always False | -| 2 | critical | 0.95 | — | Missing `ds['id']` argument in `datastore.update` call — wrong argument count, guaranteed runtime crash | -| 3 | important | 0.686 | — | Old guard was always False: key-encrypted child under passphrase-root inheritance never blocked (enum comparison bug) | -| 4 | important | 0.665 | — | `ZFSKeyAlreadyLoadedException` and `ZFSNotEncryptedException` silently swallowed as string errors | -| 5 | important | 0.665 | — | `from_previous` fires on write only; legacy API callers have `pbkdf2iters` silently upgraded to 1,300,000 | -| 6 | important | 0.644 | — | `sync_db_keys` lock lambda embeds full args list, causing inconsistent lock keys | -| 7 | important | 0.644 | — | Existing passphrase-encrypted datasets silently re-keyed at 3.7x higher iteration count on next change | -| 8 | important | 0.63 | — | Custom ZFS exceptions inherit from plain `Exception` instead of `CallError`, breaking structured error propagation | -| 9 | important | 0.574 | — | `ZFSNotEncryptedException` from `change_key()` propagates as raw `Exception` to WebSocket API layer | -| 10 | important | 0.56 | — | Raw `truenas_pylibzfs.ZFSException` from `crypto.load_key()` propagates out of `encryption.load_key()` | -| 11 | important | 0.525 | — | 3.7x PBKDF2 iteration increase enforced with no hardware capability check | -| 12 | suggestion | 0.294 | — | No double-injection bug: explicit TLS passing is correct for direct calls (CC-1 investigated, ruled out) | -| 13 | suggestion | 0.285 | — | No test covers the newly-enforced rejection path | -| 14 | nitpick | 0.097 | — | Original TLS-injection concern is a false alarm: decorator order is correct (CC-1 re-investigated) | - -**Adversary breakdown**: Zero findings challenged. All 14 passed the adversary phase without challenge. - -**Notable**: Findings #12 and #14 are explicit investigations of the CC-1 concern (decorator dispatch crash). Sonnet analyzed the `@pass_thread_local_storage` pattern and concluded that TLS is explicitly passed in the direct call path, making the dispatch crash a non-issue in the current code. This is a judgment call — Claude Code flagged it as critical. Sonnet's reasoning may be correct for the specific call site analyzed, or it may have missed a different call path where the crash occurs. - -### 6.3 Claude Code (claude[bot]) — Key Automated Findings - -| Label | Severity | Summary | -|---|---|---| -| CC-1 | Critical | `@pass_thread_local_storage` dispatch crash: `sync_zfs_keys` calls `push_zfs_keys(ids)` and `pull_zfs_keys()` directly, bypassing middleware dispatch, wrong arg binding | -| CC-2 | Critical | `ZFSKeyFormat(val) == ZFSKeyFormat.PASSPHRASE.value` compares enum instance to string, always False | -| CC-3 | Important | PR raises `pbkdf2iters` minimum to 1.3M in `pool_dataset` but leaves `PoolCreateEncryptionOptions` with old value | -| CC-4 | Critical | `k in existing_datasets` where k is str and `existing_datasets` is list[dict], always False, silently wipes KMIP cache | - -Claude Code also produced pattern/naming observations (open_handle pattern, docstrings, method behavior) that are minor and not scored here. - ---- - -## 7. Critical Misses Analysis - -### 7.1 CC-1: Decorator Dispatch Crash (Found only by Claude Code) - -`sync_zfs_keys` calls `push_zfs_keys(ids)` and `pull_zfs_keys()` directly. Both functions are decorated with `@pass_thread_local_storage`, which is designed to inject `tls` via middleware dispatch. A direct call bypasses this injection, causing wrong argument binding and a crash. - -**Why Kimi missed it**: Kimi's analysis focused on exception handling and hex validation patterns. The decorator injection mechanism was not in any of its 8 analysis dimensions. - -**Why Sonnet missed it (sort of)**: Sonnet explicitly investigated this concern (findings #12 and #14) and concluded it is not a bug because TLS is explicitly passed in the direct call path. This is a substantive disagreement with Claude Code's assessment. One of them is wrong. Without running the code, the evaluation cannot definitively resolve this — but the fact that Sonnet investigated and made a reasoned judgment is itself valuable signal. - -**Implication**: If CC-1 is a real bug, both multi-agent systems failed to catch a critical crash. If Sonnet's analysis is correct and CC-1 is a false alarm, then Claude Code has a false positive and Sonnet correctly ruled it out. - -### 7.2 CC-4: KMIP Cache Wipe (Missed by Kimi, found by Sonnet and Claude Code) - -`k in existing_datasets` where `k` is a string (dataset ID) and `existing_datasets` is a list of dicts. The `in` operator on a list checks for element equality, not key membership. A string is never equal to a dict, so this check always returns False. The result: every push/pull cycle wipes the `zfs_keys` cache, treating all datasets as new. - -This is a pre-existing bug that the PR did not introduce but also did not fix. It is subtle because the code looks plausible at a glance — the variable name `existing_datasets` suggests it should contain dataset identifiers, not dicts. - -**Why Kimi missed it**: Kimi's analysis of KMIP operations focused on exception handling (findings #11, #12) and key validation. The type mismatch in the cache lookup was not surfaced. - -**Why Sonnet found it**: Sonnet's top finding (score 0.97) is precisely this bug. The analysis correctly identifies the type mismatch and its consequence (cache always wiped). This is the hardest bug in the dataset to find because it requires understanding both the data structure of `existing_datasets` and the semantics of Python's `in` operator on lists vs dicts. - -### 7.3 Method Shadowing / Infinite Recursion (Found only by Kimi) - -A method named `check_key` in `dataset_encryption_operations.py` shadows an imported function also named `check_key`. When the method calls `check_key(...)`, it calls itself recursively rather than the imported function, causing infinite recursion. - -This is Kimi's highest-scoring finding (1.852) and was confirmed by the adversary phase and cross-referenced. It is a genuine critical bug. - -**Why Sonnet missed it**: Sonnet's analysis dimensions did not include name shadowing or import resolution. Its focus on exception handling, type mismatches, and API contracts left this category uncovered. - -**Why Claude Code missed it**: Single-agent diff review is unlikely to catch name shadowing without explicit analysis of import resolution. - -### 7.4 Missing `ds['id']` in `datastore.update` (Found only by Sonnet) - -Sonnet's second-highest finding (score 0.95) is a missing argument in a `datastore.update` call. The call passes the wrong number of arguments — `ds['id']` is missing — which would cause a guaranteed runtime crash when this code path executes. - -This is a novel finding: neither Kimi nor Claude Code identified it. It is the kind of bug that requires careful argument-count analysis against the `datastore.update` API signature, which Sonnet apparently performed. - ---- - -## 8. Strengths Analysis - -### 8.1 Kimi k2.5 Strengths - -**Breadth**: 8 analysis dimensions vs Sonnet's 6. Kimi covered TLS parameter verification, exception contract changes, encryption key storage validation, hex string conversion error handling, TOCTOU races, and coverage gap analysis. This breadth is why it found the method shadowing bug and the TOCTOU race that Sonnet missed. - -**Volume with adversary filtering**: 25 findings with 7 adversary challenges is a reasonable precision/recall tradeoff. The adversary phase is doing its job — it challenged 28% of findings, which is a meaningful filter. - -**Top finding quality**: The method shadowing bug (score 1.852, confirmed+crossref) is the highest-quality finding across all three systems. When Kimi finds something, it can find it with depth. - -**Speed**: 19 minutes vs 35 minutes for Sonnet. For time-sensitive review workflows, Kimi's throughput advantage matters. - -**Exception contract coverage**: Findings #2, #11, #12, #13 all address exception handling failures. While some were challenged, the pattern of analysis is correct — the new library's exception hierarchy is a genuine risk area. - -### 8.2 Sonnet 4.6 Strengths - -**Precision**: Zero adversary challenges across 14 findings. Every finding survived the adversary phase. This is the strongest precision signal in the evaluation. - -**Hardest bug found**: CC-4 (KMIP cache wipe, score 0.97) is the most subtle bug in the dataset. Sonnet found it and ranked it as its top finding. This demonstrates genuine depth of analysis. - -**Novel bug found**: Missing `ds['id']` in `datastore.update` (score 0.95) was found by no other system. This is a guaranteed runtime crash that would have shipped undetected. - -**Active false-positive investigation**: Findings #12 and #14 show Sonnet explicitly investigating the CC-1 concern and making a reasoned judgment. This is qualitatively different from simply missing a bug — it is active analysis with a conclusion. - -**Higher average score**: 0.611 vs 0.524 for Kimi. Sonnet's findings are more consistently high-quality. - -**Exception hierarchy analysis**: Findings #4, #8, #9, #10 address the exception inheritance and propagation issues with more specificity than Kimi's equivalent findings. Finding #8 specifically identifies that custom ZFS exceptions should inherit from `CallError` rather than `Exception` — a concrete, actionable recommendation. - -### 8.3 Claude Code Strengths - -**Speed**: Near-instant. For a first-pass review on every PR, this is the dominant advantage. - -**CC-1 detection**: Claude Code is the only system that flagged the decorator dispatch crash. Whether this is a true positive or false positive (Sonnet argues the latter), Claude Code's pattern recognition on decorator injection is unique. - -**GitHub-native integration**: Inline comments on the diff are immediately actionable for the PR author. No pipeline, no latency, no cost overhead. - -**CC-4 detection**: Claude Code also caught the KMIP cache wipe, matching Sonnet's top finding. For a single-agent system, this is impressive. - ---- - -## 9. Evidence Quality Comparison - -Evidence quality measures whether a finding includes: specific file and line references, a clear explanation of the failure mode, concrete impact analysis, and a suggested fix or direction. - -### 9.1 Kimi Evidence Quality - -Kimi's top findings (method shadowing, sync_db_keys exception catch) include specific code references and clear failure mode descriptions. The method shadowing finding explains the recursion mechanism precisely. However, many lower-scoring findings (hex validation, database constraints) are more speculative — they identify a pattern that could be a problem without demonstrating that the pattern actually causes a failure in this code. - -The 7 adversary-challenged findings tend to have weaker evidence: they assert a failure mode without fully tracing the execution path. Finding #10 (malformed hex causes 'Missing key' error) is challenged because the error message behavior depends on implementation details not fully analyzed. - -**Evidence quality rating**: High for top 5 findings, moderate for findings 6-15, low for findings 16-25. - -### 9.2 Sonnet Evidence Quality - -Sonnet's findings consistently include type-level analysis. Finding #1 (KMIP cache wipe) explains the Python `in` operator semantics on lists vs dicts, traces the consequence (cache always wiped), and identifies the correct fix (use a dict keyed by dataset ID, or check `k in {d['id'] for d in existing_datasets}`). Finding #2 (missing argument) identifies the specific call site and the expected vs actual argument count. - -The exception hierarchy findings (#8, #9, #10) are particularly well-evidenced: they trace the exception propagation path from the ZFS library through the middleware layer to the WebSocket API, identifying exactly where the exception type mismatch causes information loss. - -**Evidence quality rating**: High across all 14 findings. No finding is purely speculative. - -### 9.3 Claude Code Evidence Quality - -Claude Code's inline comments are concise by design. CC-1 and CC-4 are identified with enough specificity to be actionable, but without the depth of analysis that the multi-agent systems provide. The comments point to the problem but do not trace the full impact or suggest a fix. - -**Evidence quality rating**: Moderate. Sufficient for a developer to investigate, insufficient for a developer to fix without additional analysis. - ---- - -## 10. False Positive Analysis - -### 10.1 Kimi False Positives - -Seven findings were adversary-challenged. Of these: -- Findings #10, #11, #12, #13 (critical severity) were challenged and remain unresolved. These findings assert that KMIP operations crash when `check_key()` raises `ZFSNotEncryptedException`. The adversary challenge likely questioned whether `check_key()` can actually raise this exception in the call paths analyzed. -- Findings #17, #18, #19 (important severity) were challenged on similar grounds — they assert failure modes that depend on specific exception behavior that may not occur in practice. - -The challenged findings cluster around exception handling in KMIP operations. This suggests Kimi's exception analysis is directionally correct (the exception hierarchy is a real risk) but over-specific in asserting which exact exceptions propagate through which exact paths. - -**Estimated false positive rate**: 4-7 of 25 findings (16-28%) are likely false positives or over-stated. - -### 10.2 Sonnet False Positives - -Zero adversary challenges. The most likely false positive candidate is the CC-1 investigation (findings #12 and #14), but these are explicitly framed as "this is NOT a bug" — they are true negatives, not false positives. - -Finding #13 (no test covers the rejection path) is a suggestion, not a bug claim. It is accurate but low-value. - -**Estimated false positive rate**: 0-1 of 14 findings (0-7%). - -### 10.3 Claude Code False Positives - -CC-1 (decorator dispatch crash) is disputed by Sonnet's analysis. If Sonnet is correct that TLS is explicitly passed in the direct call path, CC-1 is a false positive. This is the primary false positive risk for Claude Code. - -**Estimated false positive rate**: 0-1 of 6 findings (0-17%), depending on CC-1 resolution. - ---- - -## 11. Scoring Rubric and Weighted Scorecard - -### 11.1 Recall Scoring (30% weight) - -Ground truth: 9 bugs. Partial credit for bugs found in related form. - -| System | Bugs Found | Recall Score | -|---|---|---| -| Kimi k2.5 | 6/9 (CC-3, method shadowing, duplicate export, exception contract, TOCTOU, partial CC-3) | 0.67 | -| Sonnet 4.6 | 6/9 (CC-2, CC-3, CC-4, missing argument, exception contract, lock lambda) | 0.67 | -| Claude Code | 4/9 (CC-1, CC-2, CC-3, CC-4) | 0.44 | - -Both PR-AF systems achieve the same recall, but on different bugs. Combined recall of Kimi+Sonnet would be 8/9 (89%). - -### 11.2 Precision Scoring (25% weight) - -| System | Estimated True Positives | Total Findings | Precision Score | -|---|---|---|---| -| Kimi k2.5 | ~18-21 of 25 | 25 | 0.72-0.84, midpoint 0.78 | -| Sonnet 4.6 | ~13-14 of 14 | 14 | 0.93-1.00, midpoint 0.96 | -| Claude Code | ~5-6 of 6 | 6 | 0.83-1.00, midpoint 0.92 | - -### 11.3 Evidence Quality Scoring (20% weight) - -Scored 0-1 based on specificity, code references, impact analysis, and actionability. - -| System | Evidence Quality Score | -|---|---| -| Kimi k2.5 | 0.68 (high for top findings, drops off significantly) | -| Sonnet 4.6 | 0.87 (consistently high across all findings) | -| Claude Code | 0.62 (sufficient for identification, insufficient for remediation) | - -### 11.4 Severity Calibration Scoring (15% weight) - -Measures whether critical bugs are labeled critical and suggestions are not over-elevated. - -| System | Calibration Notes | Score | -|---|---|---| -| Kimi k2.5 | 6 critical labels; 4 of these were adversary-challenged (over-elevation risk). Method shadowing correctly critical. | 0.70 | -| Sonnet 4.6 | 2 critical labels (CC-4 and missing argument) — both are genuinely critical. 9 important labels are well-calibrated. | 0.92 | -| Claude Code | 2 critical labels (CC-1, CC-4) — CC-4 is correct; CC-1 is disputed. | 0.80 | - -### 11.5 Breadth Scoring (10% weight) - -Measures coverage across distinct risk dimensions. - -| System | Dimensions Covered | Score | -|---|---|---| -| Kimi k2.5 | 8 dimensions: TLS, exception contracts, key storage, hex conversion, TOCTOU, coverage gaps | 0.90 | -| Sonnet 4.6 | 6 dimensions: decorator injection, enum comparison, exception handling, lock keys, PBKDF2, argument validation | 0.75 | -| Claude Code | 3-4 dimensions: decorator injection, enum comparison, PBKDF2, type mismatch | 0.50 | - -### 11.6 Weighted Final Scores - -| Criterion | Weight | Kimi k2.5 | Sonnet 4.6 | Claude Code | -|---|---|---|---|---| -| Recall | 30% | 0.67 | 0.67 | 0.44 | -| Precision | 25% | 0.78 | 0.96 | 0.92 | -| Evidence quality | 20% | 0.68 | 0.87 | 0.62 | -| Severity calibration | 15% | 0.70 | 0.92 | 0.80 | -| Breadth | 10% | 0.90 | 0.75 | 0.50 | -| **Weighted total** | 100% | **0.727** | **0.828** | **0.656** | - -Calculation: -- Kimi: (0.67x0.30) + (0.78x0.25) + (0.68x0.20) + (0.70x0.15) + (0.90x0.10) = 0.201 + 0.195 + 0.136 + 0.105 + 0.090 = **0.727** -- Sonnet: (0.67x0.30) + (0.96x0.25) + (0.87x0.20) + (0.92x0.15) + (0.75x0.10) = 0.201 + 0.240 + 0.174 + 0.138 + 0.075 = **0.828** -- Claude Code: (0.44x0.30) + (0.92x0.25) + (0.62x0.20) + (0.80x0.15) + (0.50x0.10) = 0.132 + 0.230 + 0.124 + 0.120 + 0.050 = **0.656** - -**Sonnet 4.6 scores highest overall (0.828), driven by precision and evidence quality advantages. Kimi k2.5 scores second (0.727), with breadth as its strongest dimension. Claude Code scores third (0.656) but operates at a fundamentally different cost/latency point.** - ---- - -## 12. Conclusions and Recommendations - -### 12.1 Primary Conclusions - -**Sonnet 4.6 is the better model for PR-AF on this class of PR.** Its precision advantage (0.96 vs 0.78) and evidence quality advantage (0.87 vs 0.68) are substantial. It found the hardest bug (CC-4), found a novel bug nobody else caught (missing `ds['id']`), and produced zero false positives. The cost is 1.9x longer runtime. - -**Kimi k2.5 provides complementary coverage.** It found the method shadowing bug and the TOCTOU race that Sonnet missed. Its breadth advantage (8 dimensions vs 6) is real. For PRs where coverage breadth matters more than precision, Kimi is the better choice. - -**Neither system is sufficient alone.** The combined recall of Kimi+Sonnet is 8/9 (89%), compared to 67% for either alone. The one remaining miss (CC-1, the decorator dispatch crash) was caught only by Claude Code. - -**Claude Code remains valuable as a first-pass filter.** Its near-instant feedback and GitHub-native integration make it the right tool for immediate PR feedback. It caught CC-1 and CC-4 — two of the most impactful bugs — without any pipeline overhead. - -**The adversary phase is working for Kimi but not for Sonnet.** Kimi's 28% challenge rate shows the adversary is filtering noise. Sonnet's 0% challenge rate is either a sign of genuine precision or an under-resourced adversary run. This warrants investigation in future evaluations. - -### 12.2 Recommendations - -**For production deployment of PR-AF:** - -1. **Use Sonnet 4.6 as the primary model** for high-risk PRs (encryption, authentication, data integrity). Its precision and evidence quality reduce reviewer fatigue from false positives. - -2. **Use Kimi k2.5 as a secondary sweep** on the same PR when breadth matters. The 19-minute runtime is acceptable for a background job. The complementary coverage justifies the cost. - -3. **Keep Claude Code as the first-pass reviewer** on every PR. Its speed and GitHub integration make it the right tool for immediate feedback, and it catches bugs (CC-1) that the multi-agent systems miss. - -4. **Investigate the adversary phase for Sonnet.** Zero challenges across 14 findings is unusual. Either the adversary agent needs more resources, or Sonnet's self-filtering before the adversary phase is so effective that the adversary has nothing to challenge. Understanding which is true matters for calibrating confidence in Sonnet's findings. - -5. **Add name shadowing and import resolution as an explicit analysis dimension.** The method shadowing bug (Kimi's top finding) is a category that neither Sonnet nor Claude Code covered. Adding it as a required dimension would improve recall across all systems. - -6. **Resolve the CC-1 dispute.** Sonnet's analysis (findings #12, #14) argues CC-1 is not a bug. Claude Code says it is. This should be resolved by running the code or by a human reviewer examining the specific call path. The answer will calibrate trust in Sonnet's false-positive investigation capability. - -### 12.3 Model Selection Heuristic - -For future PR-AF deployments, use this heuristic: - -- **High-risk, precision-critical PRs** (encryption, auth, data integrity): Sonnet 4.6 -- **Large PRs requiring broad coverage** (refactors touching many subsystems): Kimi k2.5 -- **Time-sensitive PRs needing immediate feedback**: Claude Code -- **Maximum coverage on critical PRs**: Run all three, deduplicate findings, prioritize by cross-system confirmation - ---- - -## 13. Appendix: Finding Count Summary - -### A.1 By System - -| System | Critical | Important | Suggestion | Nitpick | Total | -|---|---|---|---|---|---| -| PR-AF + Kimi k2.5 | 6 | 10 | 9 | 0 | 25 | -| PR-AF + Sonnet 4.6 | 2 | 9 | 2 | 1 | 14 | -| Claude Code (automated) | 2 | 1 | 3 | 0 | ~6 | - -### A.2 By Ground Truth Bug - -| Bug | Systems That Found It | Confidence | -|---|---|---| -| CC-1: Decorator dispatch crash | Claude Code only | Disputed (Sonnet ruled out) | -| CC-2: Enum comparison always False | Sonnet, Claude Code | High | -| CC-3: pbkdf2iters inconsistency | All three | High | -| CC-4: KMIP cache wipe | Sonnet, Claude Code | High | -| Method shadowing / infinite recursion | Kimi only | High (confirmed+crossref) | -| Duplicate export PoolRemoveArgs | Kimi only | High | -| Missing ds['id'] in datastore.update | Sonnet only | High | -| Exception contract violations | Kimi, Sonnet | High | -| TOCTOU race in load_key() | Kimi only | Moderate | - -### A.3 Unique Contributions - -| System | Unique findings (not found by others) | -|---|---| -| Kimi k2.5 | Method shadowing, duplicate export, TOCTOU, hex validation patterns | -| Sonnet 4.6 | Missing ds['id'] argument, lock lambda inconsistency, CC-4 (also CC) | -| Claude Code | CC-1 (decorator dispatch crash) | - -### A.4 Data Sources - -All findings sourced from: -- `pr-af-result-kimi.json` — Kimi k2.5 pipeline output -- `pr-af-result-sonnet.json` — Sonnet 4.6 pipeline output -- `claude-code-inline-comments.json` — Claude Code inline comments -- `claude-code-reviews.json` — Claude Code review summaries - -All files located in the same directory as this evaluation document. - ---- - -*This document evaluates model choice (Kimi k2.5 vs Sonnet 4.6) on the v2 meta-selector PR-AF architecture against the Claude Code single-agent baseline. It does not compare architecture versions. For architecture version comparison (v1 vs v2), see the archived evaluation document.* - -*Evaluation produced by LLM-as-a-judge analysis. All findings sourced from `pr-af-result.json` (v2), `pr-af-result-old.json` (v1), `claude-code-inline-comments.json`, and `claude-code-reviews.json` in this directory. No findings were invented or inferred beyond what the source data contains.* diff --git a/benchmark/truenas-middleware-18291/claude-code-inline-comments.json b/benchmark/truenas-middleware-18291/claude-code-inline-comments.json deleted file mode 100644 index c831888..0000000 --- a/benchmark/truenas-middleware-18291/claude-code-inline-comments.json +++ /dev/null @@ -1,1038 +0,0 @@ -[ - { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2849532373", - "pull_request_review_id": 3850511973, - "id": 2849532373, - "node_id": "PRRC_kwDOAIOmMM6p2GnV", - "diff_hunk": "@@ -0,0 +1,138 @@\n+from typing import Any, Iterable, Sequence, TypedDict, TYPE_CHECKING\n+\n+import truenas_pylibzfs\n+from truenas_pylibzfs import ZFSException\n+from middlewared.service import CallError\n+\n+if TYPE_CHECKING:\n+ from middlewared.service import ServiceContext\n+\n+\n+class EncryptionProperties(TypedDict, total=False):\n+ keyformat: Any\n+ keylocation: str\n+ pbkdf2iters: Any\n+\n+\n+def load_key(", - "path": "src/middlewared/middlewared/plugins/zfs/dataset_encryption.py", - "commit_id": "2ced671986f6cadbd599d54b2b780f85c320fb1d", - "original_commit_id": "2ced671986f6cadbd599d54b2b780f85c320fb1d", - "user": { - "login": "yocalebo", - "id": 30729806, - "node_id": "MDQ6VXNlcjMwNzI5ODA2", - "avatar_url": "https://avatars.githubusercontent.com/u/30729806?v=4", - "gravatar_id": "", - "url": "https://api.github.com/users/yocalebo", - "html_url": "https://github.com/yocalebo", - "followers_url": "https://api.github.com/users/yocalebo/followers", - "following_url": "https://api.github.com/users/yocalebo/following{/other_user}", - "gists_url": "https://api.github.com/users/yocalebo/gists{/gist_id}", - "starred_url": "https://api.github.com/users/yocalebo/starred{/owner}{/repo}", - "subscriptions_url": "https://api.github.com/users/yocalebo/subscriptions", - "organizations_url": "https://api.github.com/users/yocalebo/orgs", - "repos_url": "https://api.github.com/users/yocalebo/repos", - "events_url": "https://api.github.com/users/yocalebo/events{/privacy}", - "received_events_url": "https://api.github.com/users/yocalebo/received_events", - "type": "User", - "user_view_type": "public", - "site_admin": false - }, - "body": "I don't like this approach. IIRC the other plugins that I wrote get an `open_handle` passed into the function. We need to do that instead, otherwise, every time this function is called it opens a libzfs handle...which isn't the worst but we've specifically designed around it.", - "created_at": "2026-02-24T21:03:40Z", - "updated_at": "2026-02-24T21:03:44Z", - "html_url": "https://github.com/truenas/middleware/pull/18291#discussion_r2849532373", - "pull_request_url": "https://api.github.com/repos/truenas/middleware/pulls/18291", - "_links": { - "self": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/comments/2849532373" - }, - "html": { - "href": "https://github.com/truenas/middleware/pull/18291#discussion_r2849532373" - }, - "pull_request": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/18291" - } - }, - "reactions": { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2849532373/reactions", - "total_count": 0, - "+1": 0, - "-1": 0, - "laugh": 0, - "hooray": 0, - "confused": 0, - "heart": 0, - "rocket": 0, - "eyes": 0 - }, - "start_line": null, - "original_start_line": null, - "start_side": null, - "line": null, - "original_line": 17, - "side": "RIGHT", - "author_association": "CONTRIBUTOR", - "original_position": 17, - "position": 1, - "subject_type": "line" - }, - { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2872438785", - "pull_request_review_id": 3876201194, - "id": 2872438785, - "node_id": "PRRC_kwDOAIOmMM6rNfAB", - "diff_hunk": "@@ -338,6 +342,55 @@ def nest_paths(self, flat_list: list[dict[str, typing.Any]]) -> list[dict[str, t\n roots.append(item)\n return roots\n \n+ @private\n+ @pass_thread_local_storage\n+ def load_key(self, tls, id_: str, **kwargs) -> None:", - "path": "src/middlewared/middlewared/plugins/zfs/resource_crud.py", - "commit_id": "f20e1d231d9276a131dead5ea78803ef8fab52ad", - "original_commit_id": "f20e1d231d9276a131dead5ea78803ef8fab52ad", - "user": { - "login": "yocalebo", - "id": 30729806, - "node_id": "MDQ6VXNlcjMwNzI5ODA2", - "avatar_url": "https://avatars.githubusercontent.com/u/30729806?v=4", - "gravatar_id": "", - "url": "https://api.github.com/users/yocalebo", - "html_url": "https://github.com/yocalebo", - "followers_url": "https://api.github.com/users/yocalebo/followers", - "following_url": "https://api.github.com/users/yocalebo/following{/other_user}", - "gists_url": "https://api.github.com/users/yocalebo/gists{/gist_id}", - "starred_url": "https://api.github.com/users/yocalebo/starred{/owner}{/repo}", - "subscriptions_url": "https://api.github.com/users/yocalebo/subscriptions", - "organizations_url": "https://api.github.com/users/yocalebo/orgs", - "repos_url": "https://api.github.com/users/yocalebo/repos", - "events_url": "https://api.github.com/users/yocalebo/events{/privacy}", - "received_events_url": "https://api.github.com/users/yocalebo/received_events", - "type": "User", - "user_view_type": "public", - "site_admin": false - }, - "body": "The positional argument of `id_` should be changed to something more relevant that also matches the other patterns in this file.", - "created_at": "2026-03-02T13:25:17Z", - "updated_at": "2026-03-02T13:29:42Z", - "html_url": "https://github.com/truenas/middleware/pull/18291#discussion_r2872438785", - "pull_request_url": "https://api.github.com/repos/truenas/middleware/pulls/18291", - "_links": { - "self": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/comments/2872438785" - }, - "html": { - "href": "https://github.com/truenas/middleware/pull/18291#discussion_r2872438785" - }, - "pull_request": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/18291" - } - }, - "reactions": { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2872438785/reactions", - "total_count": 0, - "+1": 0, - "-1": 0, - "laugh": 0, - "hooray": 0, - "confused": 0, - "heart": 0, - "rocket": 0, - "eyes": 0 - }, - "start_line": null, - "original_start_line": null, - "start_side": null, - "line": null, - "original_line": 347, - "side": "RIGHT", - "author_association": "CONTRIBUTOR", - "original_position": 17, - "position": 1, - "subject_type": "line" - }, - { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2872444154", - "pull_request_review_id": 3876201194, - "id": 2872444154, - "node_id": "PRRC_kwDOAIOmMM6rNgT6", - "diff_hunk": "@@ -338,6 +342,55 @@ def nest_paths(self, flat_list: list[dict[str, typing.Any]]) -> list[dict[str, t\n roots.append(item)\n return roots\n \n+ @private\n+ @pass_thread_local_storage\n+ def load_key(self, tls, id_: str, **kwargs) -> None:\n+ \"\"\"Load the encryption key for dataset `id_`.", - "path": "src/middlewared/middlewared/plugins/zfs/resource_crud.py", - "commit_id": "f20e1d231d9276a131dead5ea78803ef8fab52ad", - "original_commit_id": "f20e1d231d9276a131dead5ea78803ef8fab52ad", - "user": { - "login": "yocalebo", - "id": 30729806, - "node_id": "MDQ6VXNlcjMwNzI5ODA2", - "avatar_url": "https://avatars.githubusercontent.com/u/30729806?v=4", - "gravatar_id": "", - "url": "https://api.github.com/users/yocalebo", - "html_url": "https://github.com/yocalebo", - "followers_url": "https://api.github.com/users/yocalebo/followers", - "following_url": "https://api.github.com/users/yocalebo/following{/other_user}", - "gists_url": "https://api.github.com/users/yocalebo/gists{/gist_id}", - "starred_url": "https://api.github.com/users/yocalebo/starred{/owner}{/repo}", - "subscriptions_url": "https://api.github.com/users/yocalebo/subscriptions", - "organizations_url": "https://api.github.com/users/yocalebo/orgs", - "repos_url": "https://api.github.com/users/yocalebo/repos", - "events_url": "https://api.github.com/users/yocalebo/events{/privacy}", - "received_events_url": "https://api.github.com/users/yocalebo/received_events", - "type": "User", - "user_view_type": "public", - "site_admin": false - }, - "body": "Please update the docstrings to match the pattern that other methods in this file follow. (i.e. (datasets and volumes))", - "created_at": "2026-03-02T13:26:11Z", - "updated_at": "2026-03-02T13:29:42Z", - "html_url": "https://github.com/truenas/middleware/pull/18291#discussion_r2872444154", - "pull_request_url": "https://api.github.com/repos/truenas/middleware/pulls/18291", - "_links": { - "self": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/comments/2872444154" - }, - "html": { - "href": "https://github.com/truenas/middleware/pull/18291#discussion_r2872444154" - }, - "pull_request": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/18291" - } - }, - "reactions": { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2872444154/reactions", - "total_count": 0, - "+1": 0, - "-1": 0, - "laugh": 0, - "hooray": 0, - "confused": 0, - "heart": 0, - "rocket": 0, - "eyes": 0 - }, - "start_line": null, - "original_start_line": null, - "start_side": null, - "line": null, - "original_line": 348, - "side": "RIGHT", - "author_association": "CONTRIBUTOR", - "original_position": 18, - "position": 1, - "subject_type": "line" - }, - { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2872447573", - "pull_request_review_id": 3876201194, - "id": 2872447573, - "node_id": "PRRC_kwDOAIOmMM6rNhJV", - "diff_hunk": "@@ -338,6 +342,55 @@ def nest_paths(self, flat_list: list[dict[str, typing.Any]]) -> list[dict[str, t\n roots.append(item)\n return roots\n \n+ @private\n+ @pass_thread_local_storage\n+ def load_key(self, tls, id_: str, **kwargs) -> None:\n+ \"\"\"Load the encryption key for dataset `id_`.\n+\n+ Raises CallError if the dataset is not encrypted, the key is already\n+ loaded, or the ZFS operation fails.\n+\n+ `key` (str | bytes) and `key_location` (str) are mutually exclusive.\n+ Pass `key` as str for hex/passphrase keyformats or as bytes for raw\n+ keyformat. Key material is passed to ZFS via an in-memory file and\n+ never written to disk.\n+ \"\"\"\n+ return load_key(self.context, tls, id_, **kwargs)", - "path": "src/middlewared/middlewared/plugins/zfs/resource_crud.py", - "commit_id": "f20e1d231d9276a131dead5ea78803ef8fab52ad", - "original_commit_id": "f20e1d231d9276a131dead5ea78803ef8fab52ad", - "user": { - "login": "yocalebo", - "id": 30729806, - "node_id": "MDQ6VXNlcjMwNzI5ODA2", - "avatar_url": "https://avatars.githubusercontent.com/u/30729806?v=4", - "gravatar_id": "", - "url": "https://api.github.com/users/yocalebo", - "html_url": "https://github.com/yocalebo", - "followers_url": "https://api.github.com/users/yocalebo/followers", - "following_url": "https://api.github.com/users/yocalebo/following{/other_user}", - "gists_url": "https://api.github.com/users/yocalebo/gists{/gist_id}", - "starred_url": "https://api.github.com/users/yocalebo/starred{/owner}{/repo}", - "subscriptions_url": "https://api.github.com/users/yocalebo/subscriptions", - "organizations_url": "https://api.github.com/users/yocalebo/orgs", - "repos_url": "https://api.github.com/users/yocalebo/repos", - "events_url": "https://api.github.com/users/yocalebo/events{/privacy}", - "received_events_url": "https://api.github.com/users/yocalebo/received_events", - "type": "User", - "user_view_type": "public", - "site_admin": false - }, - "body": "This doesn't come close to matching the behavior of the other methods that have been written in this file. Please review the other methods in this file and take note of the the zfs error exceptions that are raised.", - "created_at": "2026-03-02T13:26:48Z", - "updated_at": "2026-03-02T13:29:42Z", - "html_url": "https://github.com/truenas/middleware/pull/18291#discussion_r2872447573", - "pull_request_url": "https://api.github.com/repos/truenas/middleware/pulls/18291", - "_links": { - "self": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/comments/2872447573" - }, - "html": { - "href": "https://github.com/truenas/middleware/pull/18291#discussion_r2872447573" - }, - "pull_request": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/18291" - } - }, - "reactions": { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2872447573/reactions", - "total_count": 0, - "+1": 0, - "-1": 0, - "laugh": 0, - "hooray": 0, - "confused": 0, - "heart": 0, - "rocket": 0, - "eyes": 0 - }, - "start_line": null, - "original_start_line": null, - "start_side": null, - "line": null, - "original_line": 358, - "side": "RIGHT", - "author_association": "CONTRIBUTOR", - "original_position": 28, - "position": 1, - "subject_type": "line" - }, - { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2872454096", - "pull_request_review_id": 3876201194, - "id": 2872454096, - "node_id": "PRRC_kwDOAIOmMM6rNivQ", - "diff_hunk": "@@ -0,0 +1,112 @@\n+import threading", - "path": "src/middlewared/middlewared/plugins/zfs/dataset_encryption.py", - "commit_id": "f20e1d231d9276a131dead5ea78803ef8fab52ad", - "original_commit_id": "f20e1d231d9276a131dead5ea78803ef8fab52ad", - "user": { - "login": "yocalebo", - "id": 30729806, - "node_id": "MDQ6VXNlcjMwNzI5ODA2", - "avatar_url": "https://avatars.githubusercontent.com/u/30729806?v=4", - "gravatar_id": "", - "url": "https://api.github.com/users/yocalebo", - "html_url": "https://github.com/yocalebo", - "followers_url": "https://api.github.com/users/yocalebo/followers", - "following_url": "https://api.github.com/users/yocalebo/following{/other_user}", - "gists_url": "https://api.github.com/users/yocalebo/gists{/gist_id}", - "starred_url": "https://api.github.com/users/yocalebo/starred{/owner}{/repo}", - "subscriptions_url": "https://api.github.com/users/yocalebo/subscriptions", - "organizations_url": "https://api.github.com/users/yocalebo/orgs", - "repos_url": "https://api.github.com/users/yocalebo/repos", - "events_url": "https://api.github.com/users/yocalebo/events{/privacy}", - "received_events_url": "https://api.github.com/users/yocalebo/received_events", - "type": "User", - "user_view_type": "public", - "site_admin": false - }, - "body": "The name of this file should not be \"dataset_encryption.py\". Name it \"encryption.py\" or something that follows paradigm of other files in this directory.", - "created_at": "2026-03-02T13:28:07Z", - "updated_at": "2026-03-02T13:29:42Z", - "html_url": "https://github.com/truenas/middleware/pull/18291#discussion_r2872454096", - "pull_request_url": "https://api.github.com/repos/truenas/middleware/pulls/18291", - "_links": { - "self": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/comments/2872454096" - }, - "html": { - "href": "https://github.com/truenas/middleware/pull/18291#discussion_r2872454096" - }, - "pull_request": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/18291" - } - }, - "reactions": { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2872454096/reactions", - "total_count": 0, - "+1": 0, - "-1": 0, - "laugh": 0, - "hooray": 0, - "confused": 0, - "heart": 0, - "rocket": 0, - "eyes": 0 - }, - "start_line": null, - "original_start_line": null, - "start_side": null, - "line": null, - "original_line": 1, - "side": "RIGHT", - "author_association": "CONTRIBUTOR", - "original_position": 1, - "position": 1, - "subject_type": "line" - }, - { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2872461390", - "pull_request_review_id": 3876201194, - "id": 2872461390, - "node_id": "PRRC_kwDOAIOmMM6rNkhO", - "diff_hunk": "@@ -0,0 +1,112 @@\n+import threading\n+from typing import Iterable, Literal, NotRequired, TypedDict, TYPE_CHECKING, cast\n+\n+from truenas_pylibzfs import ZFSException\n+from middlewared.service import CallError\n+\n+if TYPE_CHECKING:\n+ from middlewared.service import ServiceContext\n+\n+\n+class EncryptionProperties(TypedDict, total=False):\n+ keyformat: Literal['hex', 'passphrase', 'raw']\n+ keylocation: str\n+ pbkdf2iters: int | None\n+\n+\n+class CheckKeyParams(TypedDict):\n+ id_: str\n+ key: NotRequired[str | bytes]\n+ key_location: NotRequired[str]\n+\n+\n+class CheckKeyResult(TypedDict):\n+ result: bool | None\n+ error: str | None\n+\n+\n+def load_key(ctx: 'ServiceContext', tls: threading.local, id_: str, **kwargs) -> None:", - "path": "src/middlewared/middlewared/plugins/zfs/dataset_encryption.py", - "commit_id": "f20e1d231d9276a131dead5ea78803ef8fab52ad", - "original_commit_id": "f20e1d231d9276a131dead5ea78803ef8fab52ad", - "user": { - "login": "yocalebo", - "id": 30729806, - "node_id": "MDQ6VXNlcjMwNzI5ODA2", - "avatar_url": "https://avatars.githubusercontent.com/u/30729806?v=4", - "gravatar_id": "", - "url": "https://api.github.com/users/yocalebo", - "html_url": "https://github.com/yocalebo", - "followers_url": "https://api.github.com/users/yocalebo/followers", - "following_url": "https://api.github.com/users/yocalebo/following{/other_user}", - "gists_url": "https://api.github.com/users/yocalebo/gists{/gist_id}", - "starred_url": "https://api.github.com/users/yocalebo/starred{/owner}{/repo}", - "subscriptions_url": "https://api.github.com/users/yocalebo/subscriptions", - "organizations_url": "https://api.github.com/users/yocalebo/orgs", - "repos_url": "https://api.github.com/users/yocalebo/repos", - "events_url": "https://api.github.com/users/yocalebo/events{/privacy}", - "received_events_url": "https://api.github.com/users/yocalebo/received_events", - "type": "User", - "user_view_type": "public", - "site_admin": false - }, - "body": "These methods follow their own pattern and completely ignore how other functions have been implemented in this directory. I don't want to raise `CallError` in this file. We need to catch known zfs exceptions and raise custom exceptions with proper errnos (cf. \"mount_unmount_impl.py\").", - "created_at": "2026-03-02T13:29:29Z", - "updated_at": "2026-03-02T13:29:42Z", - "html_url": "https://github.com/truenas/middleware/pull/18291#discussion_r2872461390", - "pull_request_url": "https://api.github.com/repos/truenas/middleware/pulls/18291", - "_links": { - "self": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/comments/2872461390" - }, - "html": { - "href": "https://github.com/truenas/middleware/pull/18291#discussion_r2872461390" - }, - "pull_request": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/18291" - } - }, - "reactions": { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2872461390/reactions", - "total_count": 0, - "+1": 0, - "-1": 0, - "laugh": 0, - "hooray": 0, - "confused": 0, - "heart": 0, - "rocket": 0, - "eyes": 0 - }, - "start_line": null, - "original_start_line": null, - "start_side": null, - "line": null, - "original_line": 28, - "side": "RIGHT", - "author_association": "CONTRIBUTOR", - "original_position": 28, - "position": 1, - "subject_type": "line" - }, - { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2879306118", - "pull_request_review_id": 3883749924, - "id": 2879306118, - "node_id": "PRRC_kwDOAIOmMM6rnrmG", - "diff_hunk": "@@ -167,37 +168,48 @@ def sync_db_keys(self, job, name=None):\n # It is possible we have a pool configured but for some mistake/reason the pool did not import like\n # during repair disks were not plugged in and system was booted, in such cases we would like to not\n # remove the encryption keys from the database.\n- for root_ds in {pool['name'] for pool in self.middleware.call_sync('pool.query')} - {\n- ds['id'] for ds in self.middleware.call_sync(\n- 'pool.dataset.query', [], {'extra': {'retrieve_children': False, 'properties': []}}\n- )\n- }:\n+ for root_ds in (", - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "commit_id": "3f933a880207082b67be6b664f5f79b6b7472f08", - "original_commit_id": "3f933a880207082b67be6b664f5f79b6b7472f08", - "user": { - "login": "yocalebo", - "id": 30729806, - "node_id": "MDQ6VXNlcjMwNzI5ODA2", - "avatar_url": "https://avatars.githubusercontent.com/u/30729806?v=4", - "gravatar_id": "", - "url": "https://api.github.com/users/yocalebo", - "html_url": "https://github.com/yocalebo", - "followers_url": "https://api.github.com/users/yocalebo/followers", - "following_url": "https://api.github.com/users/yocalebo/following{/other_user}", - "gists_url": "https://api.github.com/users/yocalebo/gists{/gist_id}", - "starred_url": "https://api.github.com/users/yocalebo/starred{/owner}{/repo}", - "subscriptions_url": "https://api.github.com/users/yocalebo/subscriptions", - "organizations_url": "https://api.github.com/users/yocalebo/orgs", - "repos_url": "https://api.github.com/users/yocalebo/repos", - "events_url": "https://api.github.com/users/yocalebo/events{/privacy}", - "received_events_url": "https://api.github.com/users/yocalebo/received_events", - "type": "User", - "user_view_type": "public", - "site_admin": false - }, - "body": "nested for loops with inner comprehension.....gross. Let's make this part not suck as much please", - "created_at": "2026-03-03T16:34:32Z", - "updated_at": "2026-03-03T16:34:32Z", - "html_url": "https://github.com/truenas/middleware/pull/18291#discussion_r2879306118", - "pull_request_url": "https://api.github.com/repos/truenas/middleware/pulls/18291", - "_links": { - "self": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/comments/2879306118" - }, - "html": { - "href": "https://github.com/truenas/middleware/pull/18291#discussion_r2879306118" - }, - "pull_request": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/18291" - } - }, - "reactions": { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2879306118/reactions", - "total_count": 0, - "+1": 0, - "-1": 0, - "laugh": 0, - "hooray": 0, - "confused": 0, - "heart": 0, - "rocket": 0, - "eyes": 0 - }, - "start_line": null, - "original_start_line": null, - "start_side": null, - "line": null, - "original_line": 171, - "side": "RIGHT", - "author_association": "CONTRIBUTOR", - "original_position": 123, - "position": 1, - "subject_type": "line" - }, - { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2879314036", - "pull_request_review_id": 3883759592, - "id": 2879314036, - "node_id": "PRRC_kwDOAIOmMM6rnth0", - "diff_hunk": "@@ -214,11 +219,15 @@ def unlock(self, job, id_, options):\n \n job.set_progress(int(name_i / len(names) * 90 + 0.5), f'Unlocking {name!r}')\n try:\n- self.middleware.call_sync(\n- 'zfs.dataset.load_key', name, {'key': datasets[name]['key'], 'mount': False}\n- )\n- except CallError as e:\n- failed[name]['error'] = 'Invalid Key' if 'incorrect key provided' in str(e).lower() else str(e)\n+ load_key(tls, name, key=datasets[name]['key'])\n+ except ZFSException as e:\n+ if ZFSError(e.code) == ZFSError.EZFS_CRYPTOFAILED:", - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py", - "commit_id": "3f933a880207082b67be6b664f5f79b6b7472f08", - "original_commit_id": "3f933a880207082b67be6b664f5f79b6b7472f08", - "user": { - "login": "yocalebo", - "id": 30729806, - "node_id": "MDQ6VXNlcjMwNzI5ODA2", - "avatar_url": "https://avatars.githubusercontent.com/u/30729806?v=4", - "gravatar_id": "", - "url": "https://api.github.com/users/yocalebo", - "html_url": "https://github.com/yocalebo", - "followers_url": "https://api.github.com/users/yocalebo/followers", - "following_url": "https://api.github.com/users/yocalebo/following{/other_user}", - "gists_url": "https://api.github.com/users/yocalebo/gists{/gist_id}", - "starred_url": "https://api.github.com/users/yocalebo/starred{/owner}{/repo}", - "subscriptions_url": "https://api.github.com/users/yocalebo/subscriptions", - "organizations_url": "https://api.github.com/users/yocalebo/orgs", - "repos_url": "https://api.github.com/users/yocalebo/repos", - "events_url": "https://api.github.com/users/yocalebo/events{/privacy}", - "received_events_url": "https://api.github.com/users/yocalebo/received_events", - "type": "User", - "user_view_type": "public", - "site_admin": false - }, - "body": "You don't need to create another instance of ZFSError, you can just do `if e.code == ZFSError.EZFS_CRYPTOFAILED`", - "created_at": "2026-03-03T16:35:58Z", - "updated_at": "2026-03-03T16:35:58Z", - "html_url": "https://github.com/truenas/middleware/pull/18291#discussion_r2879314036", - "pull_request_url": "https://api.github.com/repos/truenas/middleware/pulls/18291", - "_links": { - "self": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/comments/2879314036" - }, - "html": { - "href": "https://github.com/truenas/middleware/pull/18291#discussion_r2879314036" - }, - "pull_request": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/18291" - } - }, - "reactions": { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2879314036/reactions", - "total_count": 0, - "+1": 0, - "-1": 0, - "laugh": 0, - "hooray": 0, - "confused": 0, - "heart": 0, - "rocket": 0, - "eyes": 0 - }, - "start_line": null, - "original_start_line": null, - "start_side": null, - "line": null, - "original_line": 224, - "side": "RIGHT", - "author_association": "CONTRIBUTOR", - "original_position": 38, - "position": 1, - "subject_type": "line" - }, - { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2879626077", - "pull_request_review_id": 3884111323, - "id": 2879626077, - "node_id": "PRRC_kwDOAIOmMM6ro5td", - "diff_hunk": "@@ -50,7 +52,8 @@ def get_encrypted_datasets(self, filters):\n return rv\n \n @private\n- def push_zfs_keys(self, ids=None):\n+ @pass_thread_local_storage\n+ def push_zfs_keys(self, tls, ids=None):", - "path": "src/middlewared/middlewared/plugins/kmip/zfs_keys.py", - "commit_id": "5be20327222bf023533f2dbd7d143645f692a372", - "original_commit_id": "8db5fce3a922f1296d588f6c7f0532e6d6e465f0", - "user": { - "login": "claude[bot]", - "id": 209825114, - "node_id": "BOT_kgDODIGtWg", - "avatar_url": "https://avatars.githubusercontent.com/in/1236702?v=4", - "gravatar_id": "", - "url": "https://api.github.com/users/claude%5Bbot%5D", - "html_url": "https://github.com/apps/claude", - "followers_url": "https://api.github.com/users/claude%5Bbot%5D/followers", - "following_url": "https://api.github.com/users/claude%5Bbot%5D/following{/other_user}", - "gists_url": "https://api.github.com/users/claude%5Bbot%5D/gists{/gist_id}", - "starred_url": "https://api.github.com/users/claude%5Bbot%5D/starred{/owner}{/repo}", - "subscriptions_url": "https://api.github.com/users/claude%5Bbot%5D/subscriptions", - "organizations_url": "https://api.github.com/users/claude%5Bbot%5D/orgs", - "repos_url": "https://api.github.com/users/claude%5Bbot%5D/repos", - "events_url": "https://api.github.com/users/claude%5Bbot%5D/events{/privacy}", - "received_events_url": "https://api.github.com/users/claude%5Bbot%5D/received_events", - "type": "Bot", - "user_view_type": "public", - "site_admin": false - }, - "body": "\ud83d\udd34 `sync_zfs_keys` (line ~130) calls `self.push_zfs_keys(ids)` and `self.pull_zfs_keys()` as direct Python method calls, but both methods now require `tls` as a positional parameter due to `@pass_thread_local_storage`. Since `sync_zfs_keys` lacks this decorator and bypasses middleware dispatch, `push_zfs_keys(ids)` binds `ids` to the `tls` parameter (causing `AttributeError` on `tls.lzh`), and `pull_zfs_keys()` raises `TypeError` for the missing required `tls` arg. This completely breaks KMIP ZFS key synchronization. Fix by either routing through `self.middleware.call_sync(\"kmip.push_zfs_keys\", ids)` / `self.middleware.call_sync(\"kmip.pull_zfs_keys\")`, or by adding `@pass_thread_local_storage` to `sync_zfs_keys` and passing `tls` through explicitly.\n\n
\nExtended reasoning...\n\n## What the bug is\n\n`push_zfs_keys` and `pull_zfs_keys` were updated in this PR to use `@pass_thread_local_storage`, adding `tls` as their first positional parameter (after `self`). Their signatures are now:\n- `push_zfs_keys(self, tls, ids=None)`\n- `pull_zfs_keys(self, tls)`\n\nHowever, `sync_zfs_keys` (which calls both methods) was NOT updated with `@pass_thread_local_storage` and still calls them as direct Python method calls at lines ~137 and ~141.\n\n## How @pass_thread_local_storage works\n\nThe `@pass_thread_local_storage` decorator (defined in `decorators.py:209-222`) only sets `fn._pass_thread_local_storage = True` on the function object and returns the function **unmodified**. It does NOT wrap the function or inject any arguments. The actual `tls` injection only happens when the method is invoked through middleware dispatch (`middleware.call_sync()`) or the Job framework, which check for the `_pass_thread_local_storage` flag and inject the thread-local storage object.\n\n## Step-by-step proof of failure\n\n**Case 1: `self.push_zfs_keys(ids)` where `ids` is a list like `[42]`:**\n1. `sync_zfs_keys` is called with `ids=[42]` (e.g., from `insert_or_update_encrypted_record`)\n2. `self.push_zfs_keys([42])` is called as a direct Python method call\n3. Python binds: `self=self`, `tls=[42]`, `ids=None` (default)\n4. Inside `push_zfs_keys`, `check_key(tls, ds[name], key=...)` is called where `tls` is `[42]`\n5. `check_key` calls `open_resource(tls, dataset)` which calls `tls.lzh.open_resource(...)` \n6. `[42].lzh` raises `AttributeError: list object has no attribute lzh`\n\n**Case 2: `self.pull_zfs_keys()`:**\n1. `sync_zfs_keys` calls `self.pull_zfs_keys()` with no arguments\n2. `pull_zfs_keys(self, tls)` requires `tls` as a positional argument with no default\n3. Immediate `TypeError: pull_zfs_keys() missing 1 required positional argument: tls`\n\n## Impact\n\nThis completely breaks KMIP ZFS key synchronization. The `sync_zfs_keys` method is called from `insert_or_update_encrypted_record` (when KMIP is enabled and managing ZFS keys) and from itself as a periodic sync job. Any TrueNAS Enterprise deployment using KMIP for ZFS key management will fail to push or pull encryption keys after this PR is merged.\n\n## How to fix\n\nEither:\n1. Route calls through middleware dispatch: `self.middleware.call_sync(kmip.push_zfs_keys, ids)` and `self.middleware.call_sync(kmip.pull_zfs_keys)`, which will handle `tls` injection automatically.\n2. Add `@pass_thread_local_storage` to `sync_zfs_keys` itself and pass `tls` through: `self.push_zfs_keys(tls, ids)` and `self.pull_zfs_keys(tls)`. Note that `sync_zfs_keys` also has `@job`, so `tls` would need to be positioned correctly relative to the `job` parameter.\n\n**Note regarding potential duplicate:** A verifier flagged this as a possible duplicate of bug_007. Whether or not that is the case, the bug itself is confirmed by all verifiers and must be fixed before merging.\n\n
", - "created_at": "2026-03-03T17:36:39Z", - "updated_at": "2026-03-03T17:36:39Z", - "html_url": "https://github.com/truenas/middleware/pull/18291#discussion_r2879626077", - "pull_request_url": "https://api.github.com/repos/truenas/middleware/pulls/18291", - "_links": { - "self": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/comments/2879626077" - }, - "html": { - "href": "https://github.com/truenas/middleware/pull/18291#discussion_r2879626077" - }, - "pull_request": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/18291" - } - }, - "reactions": { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2879626077/reactions", - "total_count": 2, - "+1": 1, - "-1": 1, - "laugh": 0, - "hooray": 0, - "confused": 0, - "heart": 0, - "rocket": 0, - "eyes": 0 - }, - "start_line": 53, - "original_start_line": 53, - "start_side": "RIGHT", - "line": 56, - "original_line": 56, - "side": "RIGHT", - "author_association": "NONE", - "original_position": 16, - "position": 16, - "subject_type": "line" - }, - { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2879626085", - "pull_request_review_id": 3884111323, - "id": 2879626085, - "node_id": "PRRC_kwDOAIOmMM6ro5tl", - "diff_hunk": "@@ -233,23 +232,23 @@ async def inherit_parent_encryption_properties(self, id_):\n elif '/' not in id_:\n raise CallError('Root datasets do not have a parent and cannot inherit encryption settings')\n else:\n- parent = await self.middleware.call(\n+ parent = self.middleware.call_sync(\n 'pool.dataset.get_instance_quick', id_.rsplit('/', 1)[0], {\n 'encryption': True,\n }\n )\n if not parent['encrypted']:\n raise CallError('This operation requires the parent dataset to be encrypted')\n else:\n- parent_encrypted_root = await self.middleware.call(\n+ parent_encrypted_root = self.middleware.call_sync(\n 'pool.dataset.get_instance_quick', parent['encryption_root'], {\n 'encryption': True,\n }\n )\n if ZFSKeyFormat(parent_encrypted_root['key_format']['value']) == ZFSKeyFormat.PASSPHRASE.value:", - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py", - "commit_id": "14d2ef618892e058b5864226929595fe04eb9eca", - "original_commit_id": "8db5fce3a922f1296d588f6c7f0532e6d6e465f0", - "user": { - "login": "claude[bot]", - "id": 209825114, - "node_id": "BOT_kgDODIGtWg", - "avatar_url": "https://avatars.githubusercontent.com/in/1236702?v=4", - "gravatar_id": "", - "url": "https://api.github.com/users/claude%5Bbot%5D", - "html_url": "https://github.com/apps/claude", - "followers_url": "https://api.github.com/users/claude%5Bbot%5D/followers", - "following_url": "https://api.github.com/users/claude%5Bbot%5D/following{/other_user}", - "gists_url": "https://api.github.com/users/claude%5Bbot%5D/gists{/gist_id}", - "starred_url": "https://api.github.com/users/claude%5Bbot%5D/starred{/owner}{/repo}", - "subscriptions_url": "https://api.github.com/users/claude%5Bbot%5D/subscriptions", - "organizations_url": "https://api.github.com/users/claude%5Bbot%5D/orgs", - "repos_url": "https://api.github.com/users/claude%5Bbot%5D/repos", - "events_url": "https://api.github.com/users/claude%5Bbot%5D/events{/privacy}", - "received_events_url": "https://api.github.com/users/claude%5Bbot%5D/received_events", - "type": "Bot", - "user_view_type": "public", - "site_admin": false - }, - "body": "\ud83d\udfe3 Pre-existing bug: ZFSKeyFormat(...) == ZFSKeyFormat.PASSPHRASE.value compares an enum member to a string (\"PASSPHRASE\"), which always evaluates to False since ZFSKeyFormat is a plain enum.Enum, not StrEnum. This means the security check that prevents passphrase-encrypted parents from having key-encrypted children is silently bypassed. The fix is to remove .value so it reads == ZFSKeyFormat.PASSPHRASE (enum-to-enum), consistent with every other comparison in the codebase.\n\n
\nExtended reasoning...\n\n## Bug analysis\n\nZFSKeyFormat is defined as a plain enum.Enum in src/middlewared/middlewared/plugins/pool_/utils.py:213. In Python, a plain enum member never compares equal to its .value string:\n\n```python\n>>> from enum import Enum\n>>> class ZFSKeyFormat(Enum):\n... PASSPHRASE = \"PASSPHRASE\"\n>>> ZFSKeyFormat(\"PASSPHRASE\") == ZFSKeyFormat.PASSPHRASE.value\nFalse # Comparing enum member to string \"PASSPHRASE\"\n>>> ZFSKeyFormat(\"PASSPHRASE\") == ZFSKeyFormat.PASSPHRASE\nTrue # Correct: enum member to enum member\n```\n\n## Affected code path\n\nOn line 248 of dataset_encryption_operations.py, inside inherit_parent_encryption_properties:\n\n```python\nif ZFSKeyFormat(parent_encrypted_root[\"key_format\"][\"value\"]) == ZFSKeyFormat.PASSPHRASE.value:\n```\n\nZFSKeyFormat(parent_encrypted_root[\"key_format\"][\"value\"]) constructs a ZFSKeyFormat enum member (e.g. ZFSKeyFormat.PASSPHRASE), then compares it to ZFSKeyFormat.PASSPHRASE.value which is the string \"PASSPHRASE\". This always returns False.\n\n## Why existing code does not prevent it\n\nPython does not warn or error on comparing an enum to a string - it simply returns False. There are no type checks or runtime guards catching this mismatch. The code compiles and runs without any error; it just never enters the if block.\n\n## Impact\n\nBecause this condition is always False, the security validation that checks whether a passphrase-encrypted parent has key-encrypted children is completely dead code. A user could call inherit_parent_encryption_properties and inherit encryption from a passphrase-encrypted parent even when key-encrypted children exist - an operation the code was explicitly designed to prevent.\n\n## Step-by-step proof\n\n1. User calls inherit_parent_encryption_properties(\"pool/child\") where pool/child is an encryption root with key-encrypted sub-children, and the parent encryption root uses passphrase encryption.\n2. Code reaches line 248: ZFSKeyFormat(parent_encrypted_root[\"key_format\"][\"value\"]) constructs ZFSKeyFormat.PASSPHRASE (an enum member).\n3. The right-hand side is ZFSKeyFormat.PASSPHRASE.value, which is the string \"PASSPHRASE\".\n4. Python evaluates ZFSKeyFormat.PASSPHRASE == \"PASSPHRASE\" -> False (enum member != string).\n5. The if block is skipped entirely, so the check for key-encrypted children never runs.\n6. Execution falls through to change_encryption_root(), allowing an operation that should have been rejected.\n\n## How to fix\n\nRemove .value from the right-hand side of the comparison on line 248:\n\n```python\n# Before (broken):\nif ZFSKeyFormat(parent_encrypted_root[\"key_format\"][\"value\"]) == ZFSKeyFormat.PASSPHRASE.value:\n\n# After (correct):\nif ZFSKeyFormat(parent_encrypted_root[\"key_format\"][\"value\"]) == ZFSKeyFormat.PASSPHRASE:\n```\n\nThis is consistent with every other comparison in the codebase - e.g. line 28 (== ZFSKeyFormat.PASSPHRASE), line 49 (!= ZFSKeyFormat.PASSPHRASE), line 206 (!= ZFSKeyFormat.PASSPHRASE), and line 310 (== ZFSKeyFormat.PASSPHRASE). Line 248 is the only instance with the erroneous .value on the RHS.\n\n## Pre-existing note\n\nThis bug pre-dates this PR (the same incorrect comparison existed in the async version). However, since this PR is actively rewriting this method (async-to-sync conversion, adding tls parameter), this is an ideal time to fix it.\n\n
", - "created_at": "2026-03-03T17:36:39Z", - "updated_at": "2026-03-03T17:36:39Z", - "html_url": "https://github.com/truenas/middleware/pull/18291#discussion_r2879626085", - "pull_request_url": "https://api.github.com/repos/truenas/middleware/pulls/18291", - "_links": { - "self": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/comments/2879626085" - }, - "html": { - "href": "https://github.com/truenas/middleware/pull/18291#discussion_r2879626085" - }, - "pull_request": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/18291" - } - }, - "reactions": { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2879626085/reactions", - "total_count": 2, - "+1": 1, - "-1": 1, - "laugh": 0, - "hooray": 0, - "confused": 0, - "heart": 0, - "rocket": 0, - "eyes": 0 - }, - "start_line": null, - "original_start_line": null, - "start_side": null, - "line": null, - "original_line": 248, - "side": "RIGHT", - "author_association": "NONE", - "original_position": 170, - "position": 1, - "subject_type": "line" - }, - { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2879774141", - "pull_request_review_id": 3884281546, - "id": 2879774141, - "node_id": "PRRC_kwDOAIOmMM6rpd29", - "diff_hunk": "@@ -50,7 +52,8 @@ def get_encrypted_datasets(self, filters):\n return rv\n \n @private\n- def push_zfs_keys(self, ids=None):\n+ @pass_thread_local_storage\n+ def push_zfs_keys(self, tls, ids=None):", - "path": "src/middlewared/middlewared/plugins/kmip/zfs_keys.py", - "commit_id": "5be20327222bf023533f2dbd7d143645f692a372", - "original_commit_id": "8db5fce3a922f1296d588f6c7f0532e6d6e465f0", - "user": { - "login": "claude[bot]", - "id": 209825114, - "node_id": "BOT_kgDODIGtWg", - "avatar_url": "https://avatars.githubusercontent.com/in/1236702?v=4", - "gravatar_id": "", - "url": "https://api.github.com/users/claude%5Bbot%5D", - "html_url": "https://github.com/apps/claude", - "followers_url": "https://api.github.com/users/claude%5Bbot%5D/followers", - "following_url": "https://api.github.com/users/claude%5Bbot%5D/following{/other_user}", - "gists_url": "https://api.github.com/users/claude%5Bbot%5D/gists{/gist_id}", - "starred_url": "https://api.github.com/users/claude%5Bbot%5D/starred{/owner}{/repo}", - "subscriptions_url": "https://api.github.com/users/claude%5Bbot%5D/subscriptions", - "organizations_url": "https://api.github.com/users/claude%5Bbot%5D/orgs", - "repos_url": "https://api.github.com/users/claude%5Bbot%5D/repos", - "events_url": "https://api.github.com/users/claude%5Bbot%5D/events{/privacy}", - "received_events_url": "https://api.github.com/users/claude%5Bbot%5D/received_events", - "type": "Bot", - "user_view_type": "public", - "site_admin": false - }, - "body": "\ud83d\udd34 `sync_zfs_keys` (line 130) calls `self.push_zfs_keys(ids)` and `self.pull_zfs_keys()` as direct Python method calls, but `@pass_thread_local_storage` is a marker-only decorator that does not inject `tls` \u2014 injection only happens through the middleware dispatcher. This means `push_zfs_keys(ids)` binds `ids` to the `tls` parameter (causing `AttributeError` on `tls.lzh`), and `pull_zfs_keys()` raises `TypeError` for the missing required `tls` argument, completely breaking KMIP ZFS key sync at runtime.\n\n
\nExtended reasoning...\n\n## What the bug is\n\nThe PR adds `@pass_thread_local_storage` and a `tls` parameter to both `push_zfs_keys(self, tls, ids=None)` (line 56) and `pull_zfs_keys(self, tls)` (line 99). However, `sync_zfs_keys` (line 130) calls these methods directly via `self.push_zfs_keys(ids)` (line 137) and `self.pull_zfs_keys()` (line 141) \u2014 not through the middleware dispatcher.\n\n## Why injection does not happen\n\nThe `@pass_thread_local_storage` decorator (`decorators.py:221`) only sets `fn._pass_thread_local_storage = True` as a flag and returns the unmodified function. It does **not** wrap the function or inject `tls`. The actual `tls` injection happens exclusively in the middleware dispatch paths: `job.py:620-621` (for job methods) and `main.py:862-865` (for regular method calls). The `Service` class has no `__getattr__` or method interception that would inject `tls` on direct `self.method()` calls.\n\n## Step-by-step proof of the crash\n\n**Path 1 \u2014 `self.push_zfs_keys(ids)` (line 137):**\n1. `sync_zfs_keys` is called when KMIP is enabled and managing ZFS keys.\n2. It calls `self.push_zfs_keys(ids)` where `ids` is e.g. `[pk]` (a list of integers).\n3. `push_zfs_keys` signature is `(self, tls, ids=None)`. Python binds: `tls = [pk]`, `ids = None`.\n4. Inside `push_zfs_keys`, `check_key(tls, ds[name], key=...)` is called (line 65).\n5. `check_key` calls `open_resource(tls, dataset)` which does `tls.lzh.open_resource(...)`.\n6. Since `tls` is actually a list, this raises `AttributeError: list object has no attribute lzh`.\n\n**Path 2 \u2014 `self.pull_zfs_keys()` (line 141):**\n1. `sync_zfs_keys` calls `self.pull_zfs_keys()` with no arguments (besides `self`).\n2. `pull_zfs_keys` signature is `(self, tls)`. `tls` is a required positional parameter.\n3. Python raises `TypeError: pull_zfs_keys() missing 1 required positional argument: tls`.\n\n## Impact\n\nBoth code paths crash at runtime whenever KMIP ZFS key sync is triggered (`kmip.sync_zfs_keys`), completely breaking KMIP key management. This is called from `insert_or_update_encrypted_record` (line 60 in `dataset_encryption_operations.py`) and periodically via `zfs_keys_pending_sync`.\n\n## How to fix\n\nEither:\n1. Add `@pass_thread_local_storage` to `sync_zfs_keys` and pass `tls` through: `self.push_zfs_keys(tls, ids)` and `self.pull_zfs_keys(tls)`. Note that since `sync_zfs_keys` is a `@job` method, the job runner (`job.py:620-621`) will inject `tls` when it has `_pass_thread_local_storage`.\n2. Or call these methods through middleware dispatch: `self.middleware.call_sync(kmip.push_zfs_keys, ids)` and `self.middleware.call_sync(kmip.pull_zfs_keys)`.\n\n
", - "created_at": "2026-03-03T18:06:52Z", - "updated_at": "2026-03-03T18:06:53Z", - "html_url": "https://github.com/truenas/middleware/pull/18291#discussion_r2879774141", - "pull_request_url": "https://api.github.com/repos/truenas/middleware/pulls/18291", - "_links": { - "self": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/comments/2879774141" - }, - "html": { - "href": "https://github.com/truenas/middleware/pull/18291#discussion_r2879774141" - }, - "pull_request": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/18291" - } - }, - "reactions": { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2879774141/reactions", - "total_count": 2, - "+1": 1, - "-1": 1, - "laugh": 0, - "hooray": 0, - "confused": 0, - "heart": 0, - "rocket": 0, - "eyes": 0 - }, - "start_line": 55, - "original_start_line": 55, - "start_side": "RIGHT", - "line": 56, - "original_line": 56, - "side": "RIGHT", - "author_association": "NONE", - "original_position": 16, - "position": 16, - "subject_type": "line" - }, - { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2879774145", - "pull_request_review_id": 3884281546, - "id": 2879774145, - "node_id": "PRRC_kwDOAIOmMM6rpd3B", - "diff_hunk": "@@ -233,23 +232,23 @@ async def inherit_parent_encryption_properties(self, id_):\n elif '/' not in id_:\n raise CallError('Root datasets do not have a parent and cannot inherit encryption settings')\n else:\n- parent = await self.middleware.call(\n+ parent = self.middleware.call_sync(\n 'pool.dataset.get_instance_quick', id_.rsplit('/', 1)[0], {\n 'encryption': True,\n }\n )\n if not parent['encrypted']:\n raise CallError('This operation requires the parent dataset to be encrypted')\n else:\n- parent_encrypted_root = await self.middleware.call(\n+ parent_encrypted_root = self.middleware.call_sync(\n 'pool.dataset.get_instance_quick', parent['encryption_root'], {\n 'encryption': True,\n }\n )\n if ZFSKeyFormat(parent_encrypted_root['key_format']['value']) == ZFSKeyFormat.PASSPHRASE.value:", - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py", - "commit_id": "14d2ef618892e058b5864226929595fe04eb9eca", - "original_commit_id": "8db5fce3a922f1296d588f6c7f0532e6d6e465f0", - "user": { - "login": "claude[bot]", - "id": 209825114, - "node_id": "BOT_kgDODIGtWg", - "avatar_url": "https://avatars.githubusercontent.com/in/1236702?v=4", - "gravatar_id": "", - "url": "https://api.github.com/users/claude%5Bbot%5D", - "html_url": "https://github.com/apps/claude", - "followers_url": "https://api.github.com/users/claude%5Bbot%5D/followers", - "following_url": "https://api.github.com/users/claude%5Bbot%5D/following{/other_user}", - "gists_url": "https://api.github.com/users/claude%5Bbot%5D/gists{/gist_id}", - "starred_url": "https://api.github.com/users/claude%5Bbot%5D/starred{/owner}{/repo}", - "subscriptions_url": "https://api.github.com/users/claude%5Bbot%5D/subscriptions", - "organizations_url": "https://api.github.com/users/claude%5Bbot%5D/orgs", - "repos_url": "https://api.github.com/users/claude%5Bbot%5D/repos", - "events_url": "https://api.github.com/users/claude%5Bbot%5D/events{/privacy}", - "received_events_url": "https://api.github.com/users/claude%5Bbot%5D/received_events", - "type": "Bot", - "user_view_type": "public", - "site_admin": false - }, - "body": "\ud83d\udfe3 Pre-existing bug: On line 248, `ZFSKeyFormat(...) == ZFSKeyFormat.PASSPHRASE.value` compares an enum instance against the string `\"PASSPHRASE\"`, which always returns `False` for `enum.Enum` (not `StrEnum`). This means the safeguard preventing key-encrypted children under passphrase-encrypted parents in `inherit_parent_encryption_properties` is completely bypassed. Should be `== ZFSKeyFormat.PASSPHRASE` (without `.value`).\n\n
\nExtended reasoning...\n\n## Bug Analysis\n\n`ZFSKeyFormat` is defined as `enum.Enum` (not `StrEnum`) in `pool_/utils.py:213`. Its members are standard enum instances: `ZFSKeyFormat.PASSPHRASE` is an enum instance, and `ZFSKeyFormat.PASSPHRASE.value` is the string `\"PASSPHRASE\"`. In Python, comparing a standard `enum.Enum` instance with a string via `==` always returns `False`.\n\nOn line 248 of `dataset_encryption_operations.py`, the code reads:\n```python\nif ZFSKeyFormat(parent_encrypted_root[\"key_format\"][\"value\"]) == ZFSKeyFormat.PASSPHRASE.value:\n```\nThe left side constructs a `ZFSKeyFormat` enum instance (e.g., `ZFSKeyFormat.PASSPHRASE`), and the right side is `ZFSKeyFormat.PASSPHRASE.value` which is the string `\"PASSPHRASE\"`. Since `ZFSKeyFormat` is `enum.Enum`, this comparison always evaluates to `False`.\n\n## Step-by-step proof\n\n1. `parent_encrypted_root[\"key_format\"][\"value\"]` returns `\"PASSPHRASE\"` (a string from ZFS properties).\n2. `ZFSKeyFormat(\"PASSPHRASE\")` constructs `ZFSKeyFormat.PASSPHRASE` (an enum instance).\n3. `ZFSKeyFormat.PASSPHRASE.value` evaluates to `\"PASSPHRASE\"` (a string).\n4. `ZFSKeyFormat.PASSPHRASE == \"PASSPHRASE\"` returns `False` because Python standard `enum.Enum.__eq__` does not coerce types.\n5. The `if` block never executes, so the safeguard is bypassed.\n\nEvery other `ZFSKeyFormat` comparison in the codebase correctly compares enum-to-enum (e.g., `== ZFSKeyFormat.RAW`, `== ZFSKeyFormat.PASSPHRASE` at lines 102, 196, 309, 177, and 28). Line 248 is the only instance with `.value` on the right-hand side.\n\n## Impact\n\nThe purpose of this check is to prevent creating key-encrypted children under passphrase-encrypted datasets when calling `inherit_parent_encryption_properties`. Because the condition never triggers, a user could corrupt their ZFS encryption hierarchy by inheriting encryption from a passphrase-encrypted parent while having key-encrypted children underneath, which ZFS does not support correctly.\n\n## Fix\n\nRemove `.value` from the right-hand side:\n```python\nif ZFSKeyFormat(parent_encrypted_root[\"key_format\"][\"value\"]) == ZFSKeyFormat.PASSPHRASE:\n```\n\nThis is a pre-existing bug \u2014 line 248 was not changed in this PR \u2014 but the PR actively modifies the surrounding method (converting async to sync, adding `tls` parameter), making this a natural time to fix it.\n\n
", - "created_at": "2026-03-03T18:06:53Z", - "updated_at": "2026-03-03T18:06:53Z", - "html_url": "https://github.com/truenas/middleware/pull/18291#discussion_r2879774145", - "pull_request_url": "https://api.github.com/repos/truenas/middleware/pulls/18291", - "_links": { - "self": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/comments/2879774145" - }, - "html": { - "href": "https://github.com/truenas/middleware/pull/18291#discussion_r2879774145" - }, - "pull_request": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/18291" - } - }, - "reactions": { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2879774145/reactions", - "total_count": 2, - "+1": 1, - "-1": 1, - "laugh": 0, - "hooray": 0, - "confused": 0, - "heart": 0, - "rocket": 0, - "eyes": 0 - }, - "start_line": null, - "original_start_line": null, - "start_side": null, - "line": null, - "original_line": 248, - "side": "RIGHT", - "author_association": "NONE", - "original_position": 170, - "position": 1, - "subject_type": "line" - }, - { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2879988304", - "pull_request_review_id": 3884518820, - "id": 2879988304, - "node_id": "PRRC_kwDOAIOmMM6rqSJQ", - "diff_hunk": "@@ -172,14 +172,19 @@ class PoolDatasetChangeKeyOptions(BaseModel):\n \"\"\"Generate a new random encryption key instead of using a provided key or passphrase.\"\"\"\n key_file: bool = False\n \"\"\"Whether the provided key is from a key file rather than entered directly.\"\"\"\n- pbkdf2iters: int = Field(default=350000, ge=100000)\n+ pbkdf2iters: int = Field(default=1300000, ge=1300000)\n \"\"\"Number of PBKDF2 iterations for passphrase-based keys. Higher values improve security against \\\n- brute force attacks but increase unlock time. Default 350,000 balances security and performance.\"\"\"\n+ brute force attacks but increase unlock time.\"\"\"\n passphrase: Secret[NonEmptyString | None] = None\n \"\"\"Passphrase to use for encryption key derivation.\"\"\"\n key: Secret[Annotated[str, Field(min_length=64, max_length=64)] | None] = None\n \"\"\"Raw hex-encoded encryption key.\"\"\"\n \n+ @classmethod\n+ def from_previous(cls, value):\n+ value['pbkdf2iters'] = max(1300000, value['pbkdf2iters'])\n+ return value", - "path": "src/middlewared/middlewared/api/v26_0_0/pool_dataset.py", - "commit_id": "5be20327222bf023533f2dbd7d143645f692a372", - "original_commit_id": "bfe3651a785219d1540aacc8df9c98ed8e1de254", - "user": { - "login": "claude[bot]", - "id": 209825114, - "node_id": "BOT_kgDODIGtWg", - "avatar_url": "https://avatars.githubusercontent.com/in/1236702?v=4", - "gravatar_id": "", - "url": "https://api.github.com/users/claude%5Bbot%5D", - "html_url": "https://github.com/apps/claude", - "followers_url": "https://api.github.com/users/claude%5Bbot%5D/followers", - "following_url": "https://api.github.com/users/claude%5Bbot%5D/following{/other_user}", - "gists_url": "https://api.github.com/users/claude%5Bbot%5D/gists{/gist_id}", - "starred_url": "https://api.github.com/users/claude%5Bbot%5D/starred{/owner}{/repo}", - "subscriptions_url": "https://api.github.com/users/claude%5Bbot%5D/subscriptions", - "organizations_url": "https://api.github.com/users/claude%5Bbot%5D/orgs", - "repos_url": "https://api.github.com/users/claude%5Bbot%5D/repos", - "events_url": "https://api.github.com/users/claude%5Bbot%5D/events{/privacy}", - "received_events_url": "https://api.github.com/users/claude%5Bbot%5D/received_events", - "type": "Bot", - "user_view_type": "public", - "site_admin": false - }, - "body": "\ud83d\udfe1 This PR raises `PoolDatasetChangeKeyOptions.pbkdf2iters` to `ge=1300000, default=1300000` but leaves `PoolCreateEncryptionOptions.pbkdf2iters` in `v26_0_0/pool.py:139` at `ge=100000, default=350000`. Both feed into `validate_encryption_data`, so users can create passphrase-encrypted datasets with 350k iterations but are then required to use >= 1.3M iterations when changing the key. The `from_previous` clamping to 1.3M confirms the higher value is the intended floor \u2014 `PoolCreateEncryptionOptions` should be updated to match.\n\n
\nExtended reasoning...\n\n## What the bug is\n\nThis PR updates `PoolDatasetChangeKeyOptions.pbkdf2iters` in `v26_0_0/pool_dataset.py:175` from `Field(ge=100000, default=350000)` to `Field(ge=1300000, default=1300000)`. However, `PoolCreateEncryptionOptions.pbkdf2iters` in `v26_0_0/pool.py:139` still has `Field(ge=100000, default=350000)`. Both models are in the same API version (v26_0_0) and both flow through the same `validate_encryption_data` function.\n\n## How the inconsistency manifests\n\nWhen a user creates a new passphrase-encrypted pool or dataset, the API allows `pbkdf2iters` as low as 100,000 (defaulting to 350,000). But when that same user later tries to change the encryption key on that dataset, the API now requires `pbkdf2iters >= 1,300,000`. This creates an asymmetric policy: you can create a dataset with weak key derivation, but you cannot maintain that same setting when rotating keys.\n\n## Step-by-step proof\n\n1. User calls `pool.dataset.create` with `encryption=True`, `encryption_options.passphrase=\"secret\"`. The `PoolCreateEncryptionOptions` model accepts this with `pbkdf2iters=350000` (the default).\n2. `validate_encryption_data` (pool_dataset.py:116) produces `{\"pbkdf2iters\": 350000, ...}` and passes it to ZFS.\n3. Later, user calls `pool.dataset.change_key` on the same dataset with `options.passphrase=\"newsecret\"`. The `PoolDatasetChangeKeyOptions` model applies `default=1300000, ge=1300000`.\n4. If the user explicitly passes `pbkdf2iters=350000` in the change_key call, Pydantic validation rejects it because `350000 < 1300000`.\n5. If the user omits pbkdf2iters, it defaults to 1,300,000 \u2014 silently upgrading the iteration count without the user requesting it.\n\n## Why the higher minimum is likely the intended floor everywhere\n\nThe `from_previous` classmethod added to `PoolDatasetChangeKeyOptions` clamps old values upward: `value[\"pbkdf2iters\"] = max(1300000, value[\"pbkdf2iters\"])`. This migration logic strongly suggests 1,300,000 is the new minimum across the board, aligning with OpenZFS 2.2+ which raised the default from 350,000 to 1,300,000.\n\n## Impact\n\nThis is a policy inconsistency rather than a crash. OpenZFS still accepts 350,000 iterations at the kernel level, so dataset creation with the old default will not fail. However, users who create passphrase-encrypted datasets get weaker key derivation than what the PR intends to enforce, and they encounter a surprising jump in iterations if they later change the key.\n\n## How to fix\n\nUpdate `PoolCreateEncryptionOptions` in `v26_0_0/pool.py:139` to match:\n```python\npbkdf2iters: int = Field(ge=1300000, default=1300000)\n```\nAnd add a corresponding `from_previous` migration method if needed for backward compatibility with existing API callers.\n\n
", - "created_at": "2026-03-03T19:03:07Z", - "updated_at": "2026-03-03T19:03:08Z", - "html_url": "https://github.com/truenas/middleware/pull/18291#discussion_r2879988304", - "pull_request_url": "https://api.github.com/repos/truenas/middleware/pulls/18291", - "_links": { - "self": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/comments/2879988304" - }, - "html": { - "href": "https://github.com/truenas/middleware/pull/18291#discussion_r2879988304" - }, - "pull_request": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/18291" - } - }, - "reactions": { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2879988304/reactions", - "total_count": 2, - "+1": 1, - "-1": 1, - "laugh": 0, - "hooray": 0, - "confused": 0, - "heart": 0, - "rocket": 0, - "eyes": 0 - }, - "start_line": 175, - "original_start_line": 175, - "start_side": "RIGHT", - "line": 186, - "original_line": 186, - "side": "RIGHT", - "author_association": "NONE", - "original_position": 17, - "position": 17, - "subject_type": "line" - }, - { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2879988308", - "pull_request_review_id": 3884518820, - "id": 2879988308, - "node_id": "PRRC_kwDOAIOmMM6rqSJU", - "diff_hunk": "@@ -91,16 +95,18 @@ def push_zfs_keys(self, ids=None):\n return failed", - "path": "src/middlewared/middlewared/plugins/kmip/zfs_keys.py", - "commit_id": "5be20327222bf023533f2dbd7d143645f692a372", - "original_commit_id": "bfe3651a785219d1540aacc8df9c98ed8e1de254", - "user": { - "login": "claude[bot]", - "id": 209825114, - "node_id": "BOT_kgDODIGtWg", - "avatar_url": "https://avatars.githubusercontent.com/in/1236702?v=4", - "gravatar_id": "", - "url": "https://api.github.com/users/claude%5Bbot%5D", - "html_url": "https://github.com/apps/claude", - "followers_url": "https://api.github.com/users/claude%5Bbot%5D/followers", - "following_url": "https://api.github.com/users/claude%5Bbot%5D/following{/other_user}", - "gists_url": "https://api.github.com/users/claude%5Bbot%5D/gists{/gist_id}", - "starred_url": "https://api.github.com/users/claude%5Bbot%5D/starred{/owner}{/repo}", - "subscriptions_url": "https://api.github.com/users/claude%5Bbot%5D/subscriptions", - "organizations_url": "https://api.github.com/users/claude%5Bbot%5D/orgs", - "repos_url": "https://api.github.com/users/claude%5Bbot%5D/repos", - "events_url": "https://api.github.com/users/claude%5Bbot%5D/events{/privacy}", - "received_events_url": "https://api.github.com/users/claude%5Bbot%5D/received_events", - "type": "Bot", - "user_view_type": "public", - "site_admin": false - }, - "body": "\ud83d\udfe3 Pre-existing bug: `self.zfs_keys = {k: v for k, v in self.zfs_keys.items() if k in existing_datasets}` on lines 94 and 125 always produces an empty dict because `existing_datasets` is a `list[dict]` and `k` is a `str`. In Python, `str in list[dict]` checks element-wise equality (`str == dict`), which is always `False`, so the entire KMIP key cache is wiped after every push/pull call. Fix by building a set of names first: `existing_names = {ds[\"name\"] for ds in existing_datasets}` and filtering with `if k in existing_names`.\n\n
\nExtended reasoning...\n\n## What the bug is\n\n`get_encrypted_datasets()` (line 34-52) returns a `list[dict]` \u2014 it initializes `rv = list()` and appends datastore record dicts via `rv.append(ds_in_db[i[\"name\"]])`. On lines 94 and 125, `self.zfs_keys` is filtered with:\n\n```python\nself.zfs_keys = {k: v for k, v in self.zfs_keys.items() if k in existing_datasets}\n```\n\nHere `k` is a string (dataset name like `\"pool/ds1\"`) and `existing_datasets` is a `list[dict]`. The `in` operator checks element-wise equality, and since `str == dict` is always `False` in Python, every key is filtered out.\n\n## Step-by-step proof\n\n1. `push_zfs_keys` or `pull_zfs_keys` is called.\n2. `existing_datasets = self.get_encrypted_datasets(filters)` returns e.g. `[{\"name\": \"pool/ds1\", \"id\": 1, ...}]`.\n3. During the loop, keys are added to `self.zfs_keys`, e.g. `self.zfs_keys[\"pool/ds1\"] = \"\"`.\n4. After the loop, the comprehension runs: `k = \"pool/ds1\"`, `\"pool/ds1\" in [{\"name\": \"pool/ds1\", ...}]`.\n5. Python evaluates: `\"pool/ds1\" == {\"name\": \"pool/ds1\", ...}` \u2192 `False` (string never equals dict).\n6. Result: `self.zfs_keys = {}` \u2014 all cached keys are lost.\n\nQuick verification:\n```python\n>>> existing_datasets = [{\"name\": \"pool/ds1\", \"id\": 1}]\n>>> \"pool/ds1\" in existing_datasets\nFalse\n```\n\n## Why existing code does not prevent it\n\nPython does not warn when comparing strings to dicts via `in`. The operation silently returns `False` for every element, producing an empty dict without any error or exception.\n\n## Impact\n\n`self.zfs_keys` is the in-memory cache of ZFS encryption keys managed by KMIP. After every push or pull:\n- All cached keys are lost, including keys for datasets not involved in the current operation.\n- `kmip.retrieve_zfs_keys` returns `{}`, so `query_encrypted_roots_keys` cannot find KMIP-managed keys.\n- This affects `encryption_summary`, `sync_db_keys`, `export_keys`, and unlock operations for KMIP-managed datasets.\n- `zfs_keys_pending_sync` always thinks keys need syncing because `ds[\"name\"] not in self.zfs_keys` is always `True` after the cache is wiped.\n\n## How to fix\n\nBuild a set of dataset names before filtering:\n\n```python\nexisting_names = {ds[\"name\"] for ds in existing_datasets}\nself.zfs_keys = {k: v for k, v in self.zfs_keys.items() if k in existing_names}\n```\n\nThis needs to be applied on both line 94 (in `push_zfs_keys`) and line 125 (in `pull_zfs_keys`).\n\n
", - "created_at": "2026-03-03T19:03:08Z", - "updated_at": "2026-03-03T19:03:08Z", - "html_url": "https://github.com/truenas/middleware/pull/18291#discussion_r2879988308", - "pull_request_url": "https://api.github.com/repos/truenas/middleware/pulls/18291", - "_links": { - "self": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/comments/2879988308" - }, - "html": { - "href": "https://github.com/truenas/middleware/pull/18291#discussion_r2879988308" - }, - "pull_request": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/18291" - } - }, - "reactions": { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2879988308/reactions", - "total_count": 2, - "+1": 1, - "-1": 1, - "laugh": 0, - "hooray": 0, - "confused": 0, - "heart": 0, - "rocket": 0, - "eyes": 0 - }, - "start_line": null, - "original_start_line": null, - "start_side": null, - "line": 95, - "original_line": 95, - "side": "RIGHT", - "author_association": "NONE", - "original_position": 33, - "position": 33, - "subject_type": "line" - }, - { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2880169596", - "pull_request_review_id": 3884717354, - "id": 2880169596, - "node_id": "PRRC_kwDOAIOmMM6rq-Z8", - "diff_hunk": "@@ -91,16 +95,18 @@ def push_zfs_keys(self, ids=None):\n return failed", - "path": "src/middlewared/middlewared/plugins/kmip/zfs_keys.py", - "commit_id": "5be20327222bf023533f2dbd7d143645f692a372", - "original_commit_id": "bfe3651a785219d1540aacc8df9c98ed8e1de254", - "user": { - "login": "creatorcary", - "id": 54003261, - "node_id": "MDQ6VXNlcjU0MDAzMjYx", - "avatar_url": "https://avatars.githubusercontent.com/u/54003261?v=4", - "gravatar_id": "", - "url": "https://api.github.com/users/creatorcary", - "html_url": "https://github.com/creatorcary", - "followers_url": "https://api.github.com/users/creatorcary/followers", - "following_url": "https://api.github.com/users/creatorcary/following{/other_user}", - "gists_url": "https://api.github.com/users/creatorcary/gists{/gist_id}", - "starred_url": "https://api.github.com/users/creatorcary/starred{/owner}{/repo}", - "subscriptions_url": "https://api.github.com/users/creatorcary/subscriptions", - "organizations_url": "https://api.github.com/users/creatorcary/orgs", - "repos_url": "https://api.github.com/users/creatorcary/repos", - "events_url": "https://api.github.com/users/creatorcary/events{/privacy}", - "received_events_url": "https://api.github.com/users/creatorcary/received_events", - "type": "User", - "user_view_type": "public", - "site_admin": false - }, - "body": "Needs separate PR", - "created_at": "2026-03-03T19:50:59Z", - "updated_at": "2026-03-03T19:50:59Z", - "html_url": "https://github.com/truenas/middleware/pull/18291#discussion_r2880169596", - "pull_request_url": "https://api.github.com/repos/truenas/middleware/pulls/18291", - "_links": { - "self": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/comments/2880169596" - }, - "html": { - "href": "https://github.com/truenas/middleware/pull/18291#discussion_r2880169596" - }, - "pull_request": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/18291" - } - }, - "reactions": { - "url": "https://api.github.com/repos/truenas/middleware/pulls/comments/2880169596/reactions", - "total_count": 0, - "+1": 0, - "-1": 0, - "laugh": 0, - "hooray": 0, - "confused": 0, - "heart": 0, - "rocket": 0, - "eyes": 0 - }, - "start_line": null, - "original_start_line": null, - "start_side": null, - "line": 95, - "original_line": 95, - "side": "RIGHT", - "in_reply_to_id": 2879988308, - "author_association": "CONTRIBUTOR", - "original_position": 33, - "position": 33, - "subject_type": "line" - } -] diff --git a/benchmark/truenas-middleware-18291/claude-code-reviews.json b/benchmark/truenas-middleware-18291/claude-code-reviews.json deleted file mode 100644 index aafd8e0..0000000 --- a/benchmark/truenas-middleware-18291/claude-code-reviews.json +++ /dev/null @@ -1,402 +0,0 @@ -[ - { - "id": 3850511973, - "node_id": "PRR_kwDOAIOmMM7lgiZl", - "user": { - "login": "yocalebo", - "id": 30729806, - "node_id": "MDQ6VXNlcjMwNzI5ODA2", - "avatar_url": "https://avatars.githubusercontent.com/u/30729806?v=4", - "gravatar_id": "", - "url": "https://api.github.com/users/yocalebo", - "html_url": "https://github.com/yocalebo", - "followers_url": "https://api.github.com/users/yocalebo/followers", - "following_url": "https://api.github.com/users/yocalebo/following{/other_user}", - "gists_url": "https://api.github.com/users/yocalebo/gists{/gist_id}", - "starred_url": "https://api.github.com/users/yocalebo/starred{/owner}{/repo}", - "subscriptions_url": "https://api.github.com/users/yocalebo/subscriptions", - "organizations_url": "https://api.github.com/users/yocalebo/orgs", - "repos_url": "https://api.github.com/users/yocalebo/repos", - "events_url": "https://api.github.com/users/yocalebo/events{/privacy}", - "received_events_url": "https://api.github.com/users/yocalebo/received_events", - "type": "User", - "user_view_type": "public", - "site_admin": false - }, - "body": "", - "state": "CHANGES_REQUESTED", - "html_url": "https://github.com/truenas/middleware/pull/18291#pullrequestreview-3850511973", - "pull_request_url": "https://api.github.com/repos/truenas/middleware/pulls/18291", - "author_association": "CONTRIBUTOR", - "_links": { - "html": { - "href": "https://github.com/truenas/middleware/pull/18291#pullrequestreview-3850511973" - }, - "pull_request": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/18291" - } - }, - "submitted_at": "2026-02-24T21:03:44Z", - "commit_id": "2ced671986f6cadbd599d54b2b780f85c320fb1d" - }, - { - "id": 3867975352, - "node_id": "PRR_kwDOAIOmMM7mjJ64", - "user": { - "login": "claude[bot]", - "id": 209825114, - "node_id": "BOT_kgDODIGtWg", - "avatar_url": "https://avatars.githubusercontent.com/in/1236702?v=4", - "gravatar_id": "", - "url": "https://api.github.com/users/claude%5Bbot%5D", - "html_url": "https://github.com/apps/claude", - "followers_url": "https://api.github.com/users/claude%5Bbot%5D/followers", - "following_url": "https://api.github.com/users/claude%5Bbot%5D/following{/other_user}", - "gists_url": "https://api.github.com/users/claude%5Bbot%5D/gists{/gist_id}", - "starred_url": "https://api.github.com/users/claude%5Bbot%5D/starred{/owner}{/repo}", - "subscriptions_url": "https://api.github.com/users/claude%5Bbot%5D/subscriptions", - "organizations_url": "https://api.github.com/users/claude%5Bbot%5D/orgs", - "repos_url": "https://api.github.com/users/claude%5Bbot%5D/repos", - "events_url": "https://api.github.com/users/claude%5Bbot%5D/events{/privacy}", - "received_events_url": "https://api.github.com/users/claude%5Bbot%5D/received_events", - "type": "Bot", - "user_view_type": "public", - "site_admin": false - }, - "body": "No bugs found, but this PR touches encryption-sensitive code paths across 8 files and depends on unreleased truenas_pylibzfs changes (PR #145). Worth a human review given the scope and security sensitivity.\n\n\n
\nExtended reasoning...\n\n### Overview\n\nThis PR replaces the deprecated py-libzfs with truenas_pylibzfs for ZFS dataset encryption methods (`load_key`, `check_key`, `change_key`, `change_encryption_root`, `bulk_check`). The old `zfs_/dataset_encryption.py` (which ran in a process pool) is deleted, and new standalone functions are added in `zfs/dataset_encryption.py`. Service wrappers are added to `resource_crud.py` using `@pass_thread_local_storage` to provide thread-safe libzfs handles. Callers across KMIP, pool dataset encryption info/lock/operations are updated accordingly. The `pbkdf2iters` default and minimum are raised from 350,000 to 1,300,000 with a `from_previous` migration hook.\n\n### Security risks\n\nThis PR directly handles encryption key loading, verification, and rotation for ZFS datasets. While the code patterns look correct (proper error handling, thread-local libzfs handles, no key material logged), the security-critical nature of these operations means any subtle behavioral difference between the old py-libzfs and new truenas_pylibzfs implementations could have serious consequences (failed unlocks, key verification bypass, etc.).\n\n### Level of scrutiny\n\nHigh scrutiny warranted. This is production-critical encryption infrastructure code. The changes span 8 files and replace the underlying ZFS library for all encryption operations. The `from_previous` migration for `pbkdf2iters` is a breaking API change that requires careful review.\n\n### Other factors\n\n- The PR description states it depends on unreleased truenas_pylibzfs changes (PR #145) and the author is \"waiting for next nightly to test\", suggesting the PR is not yet fully validated.\n- A reviewer (yocalebo) previously raised an architectural concern about libzfs handle management that appears to have been addressed by using `tls.lzh` instead of opening new handles, and the comment is resolved.\n- The removal of the generic `bulk_process` job (which dispatched by method name) in favor of the specific `bulk_check` function is a positive safety improvement.\n- No test changes are included, which is expected since integration tests would cover this, but it means correctness depends on the nightly test run the author mentioned.\n\n
", - "state": "COMMENTED", - "html_url": "https://github.com/truenas/middleware/pull/18291#pullrequestreview-3867975352", - "pull_request_url": "https://api.github.com/repos/truenas/middleware/pulls/18291", - "author_association": "NONE", - "_links": { - "html": { - "href": "https://github.com/truenas/middleware/pull/18291#pullrequestreview-3867975352" - }, - "pull_request": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/18291" - } - }, - "submitted_at": "2026-02-27T16:50:36Z", - "commit_id": "f20e1d231d9276a131dead5ea78803ef8fab52ad" - }, - { - "id": 3876201194, - "node_id": "PRR_kwDOAIOmMM7nCiLq", - "user": { - "login": "yocalebo", - "id": 30729806, - "node_id": "MDQ6VXNlcjMwNzI5ODA2", - "avatar_url": "https://avatars.githubusercontent.com/u/30729806?v=4", - "gravatar_id": "", - "url": "https://api.github.com/users/yocalebo", - "html_url": "https://github.com/yocalebo", - "followers_url": "https://api.github.com/users/yocalebo/followers", - "following_url": "https://api.github.com/users/yocalebo/following{/other_user}", - "gists_url": "https://api.github.com/users/yocalebo/gists{/gist_id}", - "starred_url": "https://api.github.com/users/yocalebo/starred{/owner}{/repo}", - "subscriptions_url": "https://api.github.com/users/yocalebo/subscriptions", - "organizations_url": "https://api.github.com/users/yocalebo/orgs", - "repos_url": "https://api.github.com/users/yocalebo/repos", - "events_url": "https://api.github.com/users/yocalebo/events{/privacy}", - "received_events_url": "https://api.github.com/users/yocalebo/received_events", - "type": "User", - "user_view_type": "public", - "site_admin": false - }, - "body": "This needs to be cleaned up and polished.", - "state": "CHANGES_REQUESTED", - "html_url": "https://github.com/truenas/middleware/pull/18291#pullrequestreview-3876201194", - "pull_request_url": "https://api.github.com/repos/truenas/middleware/pulls/18291", - "author_association": "CONTRIBUTOR", - "_links": { - "html": { - "href": "https://github.com/truenas/middleware/pull/18291#pullrequestreview-3876201194" - }, - "pull_request": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/18291" - } - }, - "submitted_at": "2026-03-02T13:29:42Z", - "commit_id": "f20e1d231d9276a131dead5ea78803ef8fab52ad" - }, - { - "id": 3883749924, - "node_id": "PRR_kwDOAIOmMM7nfVIk", - "user": { - "login": "yocalebo", - "id": 30729806, - "node_id": "MDQ6VXNlcjMwNzI5ODA2", - "avatar_url": "https://avatars.githubusercontent.com/u/30729806?v=4", - "gravatar_id": "", - "url": "https://api.github.com/users/yocalebo", - "html_url": "https://github.com/yocalebo", - "followers_url": "https://api.github.com/users/yocalebo/followers", - "following_url": "https://api.github.com/users/yocalebo/following{/other_user}", - "gists_url": "https://api.github.com/users/yocalebo/gists{/gist_id}", - "starred_url": "https://api.github.com/users/yocalebo/starred{/owner}{/repo}", - "subscriptions_url": "https://api.github.com/users/yocalebo/subscriptions", - "organizations_url": "https://api.github.com/users/yocalebo/orgs", - "repos_url": "https://api.github.com/users/yocalebo/repos", - "events_url": "https://api.github.com/users/yocalebo/events{/privacy}", - "received_events_url": "https://api.github.com/users/yocalebo/received_events", - "type": "User", - "user_view_type": "public", - "site_admin": false - }, - "body": "", - "state": "COMMENTED", - "html_url": "https://github.com/truenas/middleware/pull/18291#pullrequestreview-3883749924", - "pull_request_url": "https://api.github.com/repos/truenas/middleware/pulls/18291", - "author_association": "CONTRIBUTOR", - "_links": { - "html": { - "href": "https://github.com/truenas/middleware/pull/18291#pullrequestreview-3883749924" - }, - "pull_request": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/18291" - } - }, - "submitted_at": "2026-03-03T16:34:32Z", - "commit_id": "3f933a880207082b67be6b664f5f79b6b7472f08" - }, - { - "id": 3883759592, - "node_id": "PRR_kwDOAIOmMM7nfXfo", - "user": { - "login": "yocalebo", - "id": 30729806, - "node_id": "MDQ6VXNlcjMwNzI5ODA2", - "avatar_url": "https://avatars.githubusercontent.com/u/30729806?v=4", - "gravatar_id": "", - "url": "https://api.github.com/users/yocalebo", - "html_url": "https://github.com/yocalebo", - "followers_url": "https://api.github.com/users/yocalebo/followers", - "following_url": "https://api.github.com/users/yocalebo/following{/other_user}", - "gists_url": "https://api.github.com/users/yocalebo/gists{/gist_id}", - "starred_url": "https://api.github.com/users/yocalebo/starred{/owner}{/repo}", - "subscriptions_url": "https://api.github.com/users/yocalebo/subscriptions", - "organizations_url": "https://api.github.com/users/yocalebo/orgs", - "repos_url": "https://api.github.com/users/yocalebo/repos", - "events_url": "https://api.github.com/users/yocalebo/events{/privacy}", - "received_events_url": "https://api.github.com/users/yocalebo/received_events", - "type": "User", - "user_view_type": "public", - "site_admin": false - }, - "body": "", - "state": "COMMENTED", - "html_url": "https://github.com/truenas/middleware/pull/18291#pullrequestreview-3883759592", - "pull_request_url": "https://api.github.com/repos/truenas/middleware/pulls/18291", - "author_association": "CONTRIBUTOR", - "_links": { - "html": { - "href": "https://github.com/truenas/middleware/pull/18291#pullrequestreview-3883759592" - }, - "pull_request": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/18291" - } - }, - "submitted_at": "2026-03-03T16:35:58Z", - "commit_id": "3f933a880207082b67be6b664f5f79b6b7472f08" - }, - { - "id": 3884004701, - "node_id": "PRR_kwDOAIOmMM7ngTVd", - "user": { - "login": "yocalebo", - "id": 30729806, - "node_id": "MDQ6VXNlcjMwNzI5ODA2", - "avatar_url": "https://avatars.githubusercontent.com/u/30729806?v=4", - "gravatar_id": "", - "url": "https://api.github.com/users/yocalebo", - "html_url": "https://github.com/yocalebo", - "followers_url": "https://api.github.com/users/yocalebo/followers", - "following_url": "https://api.github.com/users/yocalebo/following{/other_user}", - "gists_url": "https://api.github.com/users/yocalebo/gists{/gist_id}", - "starred_url": "https://api.github.com/users/yocalebo/starred{/owner}{/repo}", - "subscriptions_url": "https://api.github.com/users/yocalebo/subscriptions", - "organizations_url": "https://api.github.com/users/yocalebo/orgs", - "repos_url": "https://api.github.com/users/yocalebo/repos", - "events_url": "https://api.github.com/users/yocalebo/events{/privacy}", - "received_events_url": "https://api.github.com/users/yocalebo/received_events", - "type": "User", - "user_view_type": "public", - "site_admin": false - }, - "body": "", - "state": "APPROVED", - "html_url": "https://github.com/truenas/middleware/pull/18291#pullrequestreview-3884004701", - "pull_request_url": "https://api.github.com/repos/truenas/middleware/pulls/18291", - "author_association": "CONTRIBUTOR", - "_links": { - "html": { - "href": "https://github.com/truenas/middleware/pull/18291#pullrequestreview-3884004701" - }, - "pull_request": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/18291" - } - }, - "submitted_at": "2026-03-03T17:19:31Z", - "commit_id": "8db5fce3a922f1296d588f6c7f0532e6d6e465f0" - }, - { - "id": 3884111323, - "node_id": "PRR_kwDOAIOmMM7ngtXb", - "user": { - "login": "claude[bot]", - "id": 209825114, - "node_id": "BOT_kgDODIGtWg", - "avatar_url": "https://avatars.githubusercontent.com/in/1236702?v=4", - "gravatar_id": "", - "url": "https://api.github.com/users/claude%5Bbot%5D", - "html_url": "https://github.com/apps/claude", - "followers_url": "https://api.github.com/users/claude%5Bbot%5D/followers", - "following_url": "https://api.github.com/users/claude%5Bbot%5D/following{/other_user}", - "gists_url": "https://api.github.com/users/claude%5Bbot%5D/gists{/gist_id}", - "starred_url": "https://api.github.com/users/claude%5Bbot%5D/starred{/owner}{/repo}", - "subscriptions_url": "https://api.github.com/users/claude%5Bbot%5D/subscriptions", - "organizations_url": "https://api.github.com/users/claude%5Bbot%5D/orgs", - "repos_url": "https://api.github.com/users/claude%5Bbot%5D/repos", - "events_url": "https://api.github.com/users/claude%5Bbot%5D/events{/privacy}", - "received_events_url": "https://api.github.com/users/claude%5Bbot%5D/received_events", - "type": "Bot", - "user_view_type": "public", - "site_admin": false - }, - "body": "", - "state": "COMMENTED", - "html_url": "https://github.com/truenas/middleware/pull/18291#pullrequestreview-3884111323", - "pull_request_url": "https://api.github.com/repos/truenas/middleware/pulls/18291", - "author_association": "NONE", - "_links": { - "html": { - "href": "https://github.com/truenas/middleware/pull/18291#pullrequestreview-3884111323" - }, - "pull_request": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/18291" - } - }, - "submitted_at": "2026-03-03T17:36:39Z", - "commit_id": "8db5fce3a922f1296d588f6c7f0532e6d6e465f0" - }, - { - "id": 3884281546, - "node_id": "PRR_kwDOAIOmMM7nhW7K", - "user": { - "login": "claude[bot]", - "id": 209825114, - "node_id": "BOT_kgDODIGtWg", - "avatar_url": "https://avatars.githubusercontent.com/in/1236702?v=4", - "gravatar_id": "", - "url": "https://api.github.com/users/claude%5Bbot%5D", - "html_url": "https://github.com/apps/claude", - "followers_url": "https://api.github.com/users/claude%5Bbot%5D/followers", - "following_url": "https://api.github.com/users/claude%5Bbot%5D/following{/other_user}", - "gists_url": "https://api.github.com/users/claude%5Bbot%5D/gists{/gist_id}", - "starred_url": "https://api.github.com/users/claude%5Bbot%5D/starred{/owner}{/repo}", - "subscriptions_url": "https://api.github.com/users/claude%5Bbot%5D/subscriptions", - "organizations_url": "https://api.github.com/users/claude%5Bbot%5D/orgs", - "repos_url": "https://api.github.com/users/claude%5Bbot%5D/repos", - "events_url": "https://api.github.com/users/claude%5Bbot%5D/events{/privacy}", - "received_events_url": "https://api.github.com/users/claude%5Bbot%5D/received_events", - "type": "Bot", - "user_view_type": "public", - "site_admin": false - }, - "body": "", - "state": "COMMENTED", - "html_url": "https://github.com/truenas/middleware/pull/18291#pullrequestreview-3884281546", - "pull_request_url": "https://api.github.com/repos/truenas/middleware/pulls/18291", - "author_association": "NONE", - "_links": { - "html": { - "href": "https://github.com/truenas/middleware/pull/18291#pullrequestreview-3884281546" - }, - "pull_request": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/18291" - } - }, - "submitted_at": "2026-03-03T18:06:52Z", - "commit_id": "8db5fce3a922f1296d588f6c7f0532e6d6e465f0" - }, - { - "id": 3884518820, - "node_id": "PRR_kwDOAIOmMM7niQ2k", - "user": { - "login": "claude[bot]", - "id": 209825114, - "node_id": "BOT_kgDODIGtWg", - "avatar_url": "https://avatars.githubusercontent.com/in/1236702?v=4", - "gravatar_id": "", - "url": "https://api.github.com/users/claude%5Bbot%5D", - "html_url": "https://github.com/apps/claude", - "followers_url": "https://api.github.com/users/claude%5Bbot%5D/followers", - "following_url": "https://api.github.com/users/claude%5Bbot%5D/following{/other_user}", - "gists_url": "https://api.github.com/users/claude%5Bbot%5D/gists{/gist_id}", - "starred_url": "https://api.github.com/users/claude%5Bbot%5D/starred{/owner}{/repo}", - "subscriptions_url": "https://api.github.com/users/claude%5Bbot%5D/subscriptions", - "organizations_url": "https://api.github.com/users/claude%5Bbot%5D/orgs", - "repos_url": "https://api.github.com/users/claude%5Bbot%5D/repos", - "events_url": "https://api.github.com/users/claude%5Bbot%5D/events{/privacy}", - "received_events_url": "https://api.github.com/users/claude%5Bbot%5D/received_events", - "type": "Bot", - "user_view_type": "public", - "site_admin": false - }, - "body": "", - "state": "COMMENTED", - "html_url": "https://github.com/truenas/middleware/pull/18291#pullrequestreview-3884518820", - "pull_request_url": "https://api.github.com/repos/truenas/middleware/pulls/18291", - "author_association": "NONE", - "_links": { - "html": { - "href": "https://github.com/truenas/middleware/pull/18291#pullrequestreview-3884518820" - }, - "pull_request": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/18291" - } - }, - "submitted_at": "2026-03-03T19:03:07Z", - "commit_id": "bfe3651a785219d1540aacc8df9c98ed8e1de254" - }, - { - "id": 3884717354, - "node_id": "PRR_kwDOAIOmMM7njBUq", - "user": { - "login": "creatorcary", - "id": 54003261, - "node_id": "MDQ6VXNlcjU0MDAzMjYx", - "avatar_url": "https://avatars.githubusercontent.com/u/54003261?v=4", - "gravatar_id": "", - "url": "https://api.github.com/users/creatorcary", - "html_url": "https://github.com/creatorcary", - "followers_url": "https://api.github.com/users/creatorcary/followers", - "following_url": "https://api.github.com/users/creatorcary/following{/other_user}", - "gists_url": "https://api.github.com/users/creatorcary/gists{/gist_id}", - "starred_url": "https://api.github.com/users/creatorcary/starred{/owner}{/repo}", - "subscriptions_url": "https://api.github.com/users/creatorcary/subscriptions", - "organizations_url": "https://api.github.com/users/creatorcary/orgs", - "repos_url": "https://api.github.com/users/creatorcary/repos", - "events_url": "https://api.github.com/users/creatorcary/events{/privacy}", - "received_events_url": "https://api.github.com/users/creatorcary/received_events", - "type": "User", - "user_view_type": "public", - "site_admin": false - }, - "body": "", - "state": "COMMENTED", - "html_url": "https://github.com/truenas/middleware/pull/18291#pullrequestreview-3884717354", - "pull_request_url": "https://api.github.com/repos/truenas/middleware/pulls/18291", - "author_association": "CONTRIBUTOR", - "_links": { - "html": { - "href": "https://github.com/truenas/middleware/pull/18291#pullrequestreview-3884717354" - }, - "pull_request": { - "href": "https://api.github.com/repos/truenas/middleware/pulls/18291" - } - }, - "submitted_at": "2026-03-03T19:50:59Z", - "commit_id": "bfe3651a785219d1540aacc8df9c98ed8e1de254" - } -] diff --git a/benchmark/truenas-middleware-18291/pr-af-result-kimi.json b/benchmark/truenas-middleware-18291/pr-af-result-kimi.json deleted file mode 100644 index b9059dd..0000000 --- a/benchmark/truenas-middleware-18291/pr-af-result-kimi.json +++ /dev/null @@ -1,1425 +0,0 @@ -{ - "execution_id": "exec_20260310_113453_ohqpddr0", - "run_id": "run_20260310_113453_owqznuac", - "status": "succeeded", - "result": { - "findings": [ - { - "active_multipliers": [ - "cross_ref_compound", - "adversary_confirmed" - ], - "body": "**CRITICAL BUG**: The method `change_key` at line 121 shadows the imported function `change_key` from `middlewared.plugins.zfs.encryption` (imported at line 7). When line 200 calls `change_key(tls, id_, encryption_dict, key)`, Python's name resolution (LEGB rule) binds the unqualified name `change_key` to the method in the class scope, NOT the module-level import.\n\nThis causes:\n1. **Infinite recursion**: The method calls itself instead of the encryption function\n2. **Type mismatch**: The recursive call binds parameters incorrectly:\n - `job` receives `tls` (thread-local object)\n - `tls` receives `id_` (string dataset name)\n - `id_` receives `encryption_dict` (dict)\n - `options` receives `key` (string)\n\n**Impact**: When users attempt to change encryption keys via the API, the system will crash with `RecursionError` or fail when trying to access attributes like `tls.lzh` on a string.\n\n**Root cause**: The import at line 7 brings `change_key` into the module namespace, but the method definition at line 121 creates a class attribute with the same name, shadowing the import within method bodies.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "1", - "dimension_name": "TLS Parameter Verification for @pass_thread_local_storage Decorated Functions", - "evidence": "Step 1: Import at line 7: `from middlewared.plugins.zfs.encryption import change_encryption_root, change_key`\nStep 2: Method definition at line 121: `def change_key(self, job, tls, id_, options):`\nStep 3: Call at line 200: `change_key(tls, id_, encryption_dict, key)`\nStep 4: Python resolves `change_key` to the method (class scope), not the imported function (module scope)\nStep 5: Method recursively calls itself with wrong parameter types causing RecursionError or AttributeError", - "file_path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py", - "id": "f_004", - "line_end": 200, - "line_start": 200, - "score": 1.852, - "severity": "critical", - "suggestion": "Rename the import to avoid shadowing: `from middlewared.plugins.zfs.encryption import change_key as zfs_change_key, change_encryption_root`, then update line 200 to call `zfs_change_key(tls, id_, encryption_dict, key)`. Alternatively, rename the method to `do_change_key` and update the API method decorator.", - "tags": [ - "shadowing", - "infinite-recursion", - "name-resolution", - "encryption" - ], - "title": "Method name shadows imported function causing infinite recursion" - }, - { - "active_multipliers": [ - "cross_ref_compound", - "adversary_confirmed" - ], - "body": "The `sync_db_keys()` method at lines 200-203 catches all exceptions from `check_key()` and sets `should_remove = True`. With the new exception contract, if a dataset is not encrypted but exists in the database, `check_key()` raises `ZFSNotEncryptedException`, which is caught and the dataset is marked for removal from the database.\n\n**Potential issue**: While removing non-encrypted datasets from the encryption database might be correct behavior, the broad exception catch also catches other legitimate errors (ZFS errors, I/O errors, etc.) and treats them the same way. A dataset with a valid key but experiencing a transient ZFS error would be incorrectly removed from the database.\n\n**Previous behavior**: Only datasets with genuinely invalid keys would return `False` and be marked for removal.\n**New behavior**: ANY exception (including ZFS errors, not just non-encrypted datasets) causes removal.", - "confidence": 0.8, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "exception_contract_check_key", - "dimension_name": "Exception Contract Change in check_key()", - "evidence": "Step 1: `sync_db_keys()` at line 194 iterates over `db_datasets`\nStep 2: At line 201, calls `should_remove = not check_key(tls, ds_name, key=key)`\nStep 3: Lines 200-203 use `except Exception:` to catch all exceptions and set `should_remove = True`\nStep 4: `check_key()` raises `ZFSNotEncryptedException` for non-encrypted datasets\nStep 5: Also catches any other ZFS errors, treating them all as 'invalid key' and removing from DB\nStep 6: `should_remove = True` causes dataset to be added to `to_remove` list at line 205-206", - "file_path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "id": "f_015", - "line_end": 203, - "line_start": 200, - "score": 1.092, - "severity": "important", - "suggestion": "Catch `ZFSNotEncryptedException` specifically and mark those datasets for removal (since they shouldn't be in the encryption database). Re-raise or handle other exceptions differently - perhaps log them and skip removal rather than assuming the key is invalid.", - "tags": [ - "exception-handling", - "data-loss-risk", - "database-consistency" - ], - "title": "sync_db_keys() marks non-encrypted datasets for removal due to broad Exception catch" - }, - { - "active_multipliers": [], - "body": "The `__all__` list contains `PoolRemoveArgs` twice (lines 20 and 21). While this doesn't cause runtime errors, it indicates potential copy-paste errors or incomplete cleanup that may mask other issues.\n\n```python\n\"PoolRemoveArgs\", \"PoolRemoveArgs\", \"PoolRemoveResult\",\n```\n\nThis is a minor issue but suggests insufficient code review for this module.", - "confidence": 1, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "cluster_1", - "dimension_name": "Coverage gap review - cluster_1 API schema changes", - "evidence": "Line 20-21 of pool.py shows: \"PoolRemoveArgs\", \"PoolRemoveArgs\",\nThis is a straightforward duplication that should have been caught.", - "file_path": "src/middlewared/middlewared/api/v26_0_0/pool.py", - "id": "f_031", - "line_end": 20, - "line_start": 20, - "score": 1, - "severity": "critical", - "suggestion": "Remove the duplicate 'PoolRemoveArgs' entry from the __all__ list.", - "tags": [ - "code-quality", - "export-list" - ], - "title": "Duplicate export: PoolRemoveArgs appears twice in __all__ list" - }, - { - "active_multipliers": [ - "cross_ref_compound" - ], - "body": "The `insert_or_update_encrypted_record` method stores encryption keys in the database without validating they are valid hexadecimal strings. While the method correctly skips storing passphrase keys (lines 28-30), it does not validate that HEX format keys are properly formatted before storage.\n\nThe only hex validation in the codebase exists in `validate_encryption_data` (lines 101-106), but this only applies to keys read from file input pipes, not to keys provided directly via API parameters. When `options['key']` is provided directly, it bypasses the hex validation entirely.\n\nThis creates a data integrity risk where invalid hex keys could be stored in the database, only to fail later when retrieved and passed to `bytes.fromhex()` in unlock operations.", - "confidence": 0.85, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "encryption_key_validation", - "dimension_name": "Encryption Key Storage Validation", - "evidence": "Step 1: `insert_or_update_encrypted_record` is called from multiple locations:\n - dataset.py:690-693 during dataset creation\n - pool.py:524-530 during pool creation\n - dataset_encryption_lock.py:344-346 during unlock\n - dataset_encryption_operations.py:205 during key change\n\nStep 2: In `insert_or_update_encrypted_record` (lines 26-58), the key is stored directly:\n```python\ndata['encryption_key'] = data['encryption_key'] # Line 38 - no validation\n```\n\nStep 3: The only hex validation exists in `validate_encryption_data` (lines 101-106) but ONLY for file input:\n```python\nif not key and job:\n job.check_pipe('input')\n key = job.pipes.input.r.read(64)\n try:\n key = hex(int(key, 16))[2:]\n if len(key) != 64:\n raise ValueError('Invalid key')\n except ValueError:\n verrors.add(f'{schema}.key_file', 'Please specify a valid key')\n```\n\nStep 4: When keys are retrieved for unlock operations (dataset_encryption_lock.py:177-182), they are passed to `bytes.fromhex()`:\n```python\nif ZFSKeyFormat(ds['key_format']['value']) == ZFSKeyFormat.RAW and ds_key:\n try:\n ds_key = bytes.fromhex(ds_key)\n except ValueError:\n ds_key = None\n```\n\nStep 5: The error is silently suppressed, meaning invalid keys stored in the database will silently fail to unlock datasets.", - "file_path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py", - "id": "f_025", - "line_end": 58, - "line_start": 26, - "score": 0.892, - "severity": "important", - "suggestion": "Add hex validation in `insert_or_update_encrypted_record` before storing the key:\n\n```python\nif data['encryption_key'] and ZFSKeyFormat(key_format.upper()) == ZFSKeyFormat.HEX:\n try:\n # Validate it's a valid hex string of correct length (64 chars = 32 bytes)\n if len(data['encryption_key']) != 64 or int(data['encryption_key'], 16) < 0:\n raise ValueError('Invalid hex key format')\n except ValueError:\n raise CallError(f'Invalid hex encryption key format for {data[\"name\"]}')\n```\n\nAlternatively, move the hex validation to a common validation function that is called for ALL key inputs, not just file inputs.", - "tags": [ - "security", - "data-integrity", - "validation", - "encryption" - ], - "title": "Missing hex validation on encryption keys before database storage" - }, - { - "active_multipliers": [ - "cross_ref_compound" - ], - "body": "The `load_key()` function in `encryption.py` contains a Time-Of-Check-Time-Of-Use (TOCTOU) race condition. At lines 32-34, the function first checks `crypto.info().key_is_loaded` and then immediately calls `crypto.load_key()`. Between this check and the actual load operation, another process or thread could load a key into the same ZFS dataset, causing the subsequent `load_key()` call to fail with an unexpected error.\n\nThe function does raise `ZFSKeyAlreadyLoadedException` if the key is loaded at check time, but this exception is not designed to handle the race where the key gets loaded AFTER the check but BEFORE the load. In a concurrent environment, this race window\u2014though small\u2014is non-zero and could lead to:\n1. Unnecessary error propagation to the caller\n2. Failed unlock operations even when valid keys are provided\n3. Inconsistent dataset states when multiple unlock operations are triggered concurrently\n\nThe ZFS kernel module provides atomic operations, but this Python wrapper introduces a race window by separating the check from the operation.", - "confidence": 0.75, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "race_condition_check_load_key", - "dimension_name": "TOCTOU Race Between check_key() and load_key() Operations", - "evidence": "Step 1: `load_key()` is called at encryption.py:29-34.\nStep 2: Line 32 checks `crypto.info().key_is_loaded` - this is a separate ZFS operation.\nStep 3: If key_is_loaded is False, execution proceeds to line 34.\nStep 4: At line 34, `crypto.load_key(**kwargs)` is called.\nStep 5: Between Step 2 and Step 4, another thread/process could successfully call `load_key()` on the same dataset.\nStep 6: This causes the second `load_key()` call to fail with an unexpected ZFS error rather than the handled `ZFSKeyAlreadyLoadedException`.", - "file_path": "src/middlewared/middlewared/plugins/zfs/encryption.py", - "id": "f_021", - "line_end": 34, - "line_start": 29, - "score": 0.787, - "severity": "important", - "suggestion": "Consider removing the pre-check for `key_is_loaded` and instead directly attempt `crypto.load_key()`, catching the specific ZFS error that occurs when a key is already loaded. This reduces the race window to the atomic ZFS operation itself. Alternatively, implement a per-dataset locking mechanism to serialize key loading operations.", - "tags": [ - "race-condition", - "toctou", - "concurrency", - "zfs", - "encryption" - ], - "title": "TOCTOU Race Condition in load_key() Function" - }, - { - "active_multipliers": [], - "body": "The `PoolCreateEncryptionOptions.pbkdf2iters` field changed its constraint from `ge=100000` (v25) to `ge=1300000` (v26). This is a **breaking API change** that will cause validation failures for API clients that explicitly set pbkdf2iters to any value between 100000 and 1299999.\n\n**Impact Analysis:**\n- **Silent behavioral change**: Clients relying on the default value (changed from 350000 to 1300000) will experience 3.7x slower encryption key derivation without warning\n- **Explicit validation failures**: Clients sending explicit values in the previously-valid range (100000-1299999) will receive Pydantic validation errors\n- **Breaking change for automation**: Scripts or integrations that hardcoded iteration values within the old range will fail when upgraded to API v26\n\n**Previous constraints (v25_10_2):**\n```python\npbkdf2iters: int = Field(ge=100000, default=350000)\n```\n\n**New constraints (v26_0_0):**\n```python\npbkdf2iters: int = Field(ge=1300000, default=1300000)\n```\n\nThe `from_previous` method (lines 151-154) mitigates this for clients *upgrading* API versions (by forcing values to max(1300000, old_value)), but this does not help:\n1. New API v26 clients making fresh calls\n2. Clients who migrate to v26 without going through upgrade path\n3. Configuration-as-code tools that validate against the new schema\n\nThe security improvement (higher minimum iterations) is valid, but should be introduced with deprecation warnings or a transitional period.", - "confidence": 0.9, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "cluster_1", - "dimension_name": "Coverage gap review - cluster_1 API schema changes", - "evidence": "Step 1: Client on API v26 calls pool.create with encryption_options={'pbkdf2iters': 500000, 'passphrase': 'secret'}\nStep 2: Pydantic validates the input against PoolCreateEncryptionOptions at line 139\nStep 3: Field constraint ge=1300000 rejects 500000 as below minimum\nStep 4: ValidationError raised with message about failing ge constraint\n\nEvidence from v25_10_2/pool.py line 167: pbkdf2iters: int = Field(ge=100000, default=350000)\nEvidence from v26_0_0/pool.py line 139: pbkdf2iters: int = Field(ge=1300000, default=1300000)", - "file_path": "src/middlewared/middlewared/api/v26_0_0/pool.py", - "id": "f_028", - "line_end": 139, - "line_start": 139, - "score": 0.63, - "severity": "important", - "suggestion": "Consider one of the following approaches:\n1. **Soft deprecation path**: Keep ge=100000 for one release cycle, log deprecation warnings for values < 1300000, then enforce the new minimum in v27\n2. **Document migration requirements**: Explicitly document that API v26 requires clients to update their pbkdf2iters values\n3. **Conditional validation**: Use a model_validator to allow old values during a transition period with warnings\n\nIf this change is intentional and acceptable as a breaking change in a major version, ensure it is prominently documented in the API changelog with clear migration instructions.", - "tags": [ - "api-breaking-change", - "validation", - "encryption", - "backward-compatibility" - ], - "title": "Breaking API change: pbkdf2iters minimum raised from 100000 to 1300000" - }, - { - "active_multipliers": [], - "body": "The `PoolDatasetChangeKeyOptions.pbkdf2iters` field changed its constraint from `ge=100000` (v25) to `ge=1300000` (v26). This is a breaking change for the `pool.dataset.change_key` endpoint.\n\n**Impact Analysis:**\n- Clients calling `pool.dataset.change_key` with explicit pbkdf2iters values between 100000-1299999 will receive validation errors\n- Clients relying on the default (350000 -> 1300000) will experience slower key derivation without warning\n\n**Previous (v25_10_2 line 175):**\n```python\npbkdf2iters: int = Field(default=350000, ge=100000)\n```\n\n**New (v26_0_0 line 175):**\n```python\npbkdf2iters: int = Field(default=1300000, ge=1300000)\n```\n\nThis change mirrors the issue in PoolCreateEncryptionOptions but affects the dataset key change operation specifically.", - "confidence": 0.9, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "cluster_1", - "dimension_name": "Coverage gap review - cluster_1 API schema changes", - "evidence": "Step 1: Client calls pool.dataset.change_key with options={'pbkdf2iters': 200000, 'passphrase': 'newsecret'}\nStep 2: Pydantic validates PoolDatasetChangeKeyOptions at line 175\nStep 3: ge=1300000 constraint fails for value 200000\nStep 4: ValidationError raised\n\nEvidence from v25_10_2/pool_dataset.py line 175: pbkdf2iters: int = Field(default=350000, ge=100000)\nEvidence from v26_0_0/pool_dataset.py line 175: pbkdf2iters: int = Field(default=1300000, ge=1300000)", - "file_path": "src/middlewared/middlewared/api/v26_0_0/pool_dataset.py", - "id": "f_029", - "line_end": 175, - "line_start": 175, - "score": 0.63, - "severity": "important", - "suggestion": "Apply the same migration strategy as PoolCreateEncryptionOptions. Consider soft deprecation with warnings before enforcing the new minimum, or clearly document this as a breaking change requiring client updates.", - "tags": [ - "api-breaking-change", - "validation", - "encryption", - "backward-compatibility" - ], - "title": "Breaking API change: PoolDatasetChangeKeyOptions.pbkdf2iters minimum raised from 100000 to 1300000" - }, - { - "active_multipliers": [], - "body": "The `from_previous` classmethod at lines 151-154 silently increases pbkdf2iters to 1300000 without any warning or indication to the client. While this ensures compatibility, it creates a **silent behavioral change** that may confuse users.\n\n```python\n@classmethod\ndef from_previous(cls, value):\n value['pbkdf2iters'] = max(1300000, value['pbkdf2iters'])\n return value\n```\n\n**Issues:**\n1. **Silent upgrade**: A client requesting 350000 iterations (for performance reasons) will silently get 1300000 instead, making encryption/unlocking 3.7x slower without any indication\n2. **No audit trail**: The system doesn't log that it modified the requested value\n3. **Performance surprise**: Users who explicitly chose lower iterations for performance will experience unexplained slowdowns\n4. **No opt-out**: There's no way for clients to preserve the old behavior during transition\n\nThis pattern also exists in PoolDatasetChangeKeyOptions.from_previous (pool_dataset.py:183-186).", - "confidence": 0.85, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "cluster_1", - "dimension_name": "Coverage gap review - cluster_1 API schema changes", - "evidence": "Step 1: Client on API v25 calls pool.create with encryption_options={'pbkdf2iters': 350000}\nStep 2: API version adapter detects UPGRADE direction and calls PoolCreateEncryptionOptions.from_previous at line 233 of version.py\nStep 3: from_previous silently replaces 350000 with 1300000 via max() operation\nStep 4: New value 1300000 is validated (passes ge=1300000) and used\nStep 5: Client gets 3.7x slower encryption without any notification\n\nEvidence: version.py line 233 calls new_model.from_previous(value) during UPGRADE", - "file_path": "src/middlewared/middlewared/api/v26_0_0/pool.py", - "id": "f_030", - "line_end": 154, - "line_start": 153, - "score": 0.595, - "severity": "important", - "suggestion": "Add a warning log when from_previous increases the value:\n```python\n@classmethod\ndef from_previous(cls, value):\n old_value = value.get('pbkdf2iters', 350000)\n new_value = max(1300000, old_value)\n if new_value > old_value:\n logger.warning(\n 'pbkdf2iters automatically increased from %d to %d for security compliance',\n old_value, new_value\n )\n value['pbkdf2iters'] = new_value\n return value\n```\nAlternatively, return a response header or metadata indicating the value was modified.", - "tags": [ - "silent-behavior-change", - "logging", - "user-experience" - ], - "title": "from_previous implementation silently modifies pbkdf2iters without notification" - }, - { - "active_multipliers": [], - "body": "The `ge=1300000` constraint combined with the `from_previous` migration means users CANNOT choose lower iteration counts even if they understand the security trade-offs and prioritize unlock speed. This removes user agency and could be problematic for: development/test environments where fast unlock is preferred, systems with weak CPUs where 1.3M iterations cause unacceptable delays, and emergency recovery scenarios. The old API allowed any value >= 100000. The new API forces >= 1300000 with no opt-out.", - "confidence": 0.7, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "cluster_0", - "dimension_name": "Root cluster coverage gap review", - "evidence": "Step 1: v25_10_2 allowed pbkdf2iters >= 100000 (Field(ge=100000, default=350000)). Step 2: v26_0_0 requires pbkdf2iters >= 1300000 (Field(ge=1300000, default=1300000)). Step 3: from_previous uses max() to force upgrade of any existing lower values. Step 4: No mechanism exists for users to opt-out of this minimum requirement. Step 5: This is a breaking change that removes flexibility for edge cases.", - "file_path": "src/middlewared/middlewared/api/v26_0_0/pool.py", - "id": "f_035", - "line_end": 153, - "line_start": 139, - "score": 0.49, - "severity": "important", - "suggestion": "Consider whether the hard minimum of 1300000 is appropriate for all use cases, or if there should be an escape hatch for users who need lower iteration counts and accept the security trade-offs. At minimum, document why this specific value was chosen and what users should expect.", - "tags": [ - "api-design", - "user-choice", - "breaking-change" - ], - "title": "Hardcoded minimum prevents users from choosing lower security settings" - }, - { - "active_multipliers": [ - "adversary_challenged" - ], - "body": "When a RAW format encryption key contains malformed hex, the code catches `ValueError` from `bytes.fromhex()` and sets `ds_key = None` (lines 179-182). This causes the subsequent check at line 216-217 to report 'Missing key' even though a key was actually provided. This is a confusing user experience - the error message should indicate the key format is invalid, not that no key was provided.\n\n**The failure flow:**\n1. User provides a malformed hex key (e.g., 'gggg' instead of valid hex)\n2. Line 180: `bytes.fromhex(ds_key)` raises `ValueError`\n3. Line 182: `ds_key` is silently set to `None`\n4. Line 216: `not datasets[name]['key']` evaluates to `True` (because key is None)\n5. Line 217: Reports 'Missing key' - which is misleading\n\nThis bypasses the actual error (invalid hex format) and produces a confusing message that suggests no key was provided at all.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "hex-conversion-error-handling", - "dimension_name": "Hex String to Bytes Conversion Error Handling", - "evidence": "Step 1: `pool.dataset.unlock` API is called with malformed hex key\nStep 2: Line 177-182: `bytes.fromhex(ds_key)` raises ValueError, `ds_key` set to None\nStep 3: Line 216: Check `if not datasets[name]['key']` is True\nStep 4: Line 217: Reports 'Missing key' error\nStep 5: User sees confusing error message instead of 'Invalid hex key format'", - "file_path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py", - "id": "f_000", - "line_end": 217, - "line_start": 177, - "score": 0.475, - "severity": "critical", - "suggestion": "Change the exception handler to raise a clear `CallError` or `ValidationErrors` with a message like 'Invalid hex format for RAW encryption key' instead of silently setting the key to None. This ensures users get actionable feedback about the actual problem.", - "tags": [ - "error-handling", - "user-experience", - "encryption", - "hex-conversion" - ], - "title": "Malformed hex key causes confusing 'Missing key' error instead of clear validation message" - }, - { - "active_multipliers": [ - "adversary_challenged" - ], - "body": "The `check_key()` function now raises `ZFSNotEncryptedException` for non-encrypted datasets instead of returning `False`. The KMIP `push_zfs_keys()` method at lines 64-69 calls `check_key()` without any exception handling, expecting a boolean return value.\n\n**Impact**: If a dataset in the database is not actually encrypted (e.g., encryption was removed, or database is out of sync with ZFS), the entire `push_zfs_keys()` operation will crash with an unhandled exception. This could prevent KMIP key synchronization from completing, leaving encryption keys in an inconsistent state.\n\n**The code path**:\n1. `push_zfs_keys()` iterates over datasets from database (line 59)\n2. For each dataset without `encryption_key`, it checks if the in-memory key is valid (line 67)\n3. `check_key()` raises `ZFSNotEncryptedException` if the dataset is not encrypted\n4. Exception propagates uncaught, aborting the entire sync operation", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "exception_contract_check_key", - "dimension_name": "Exception Contract Change in check_key()", - "evidence": "Step 1: `push_zfs_keys()` at line 56 iterates over `existing_datasets` from database\nStep 2: At line 64-69, for datasets without `encryption_key`, it checks `if ds['name'] in self.zfs_keys and check_key(tls, ds['name'], key=self.zfs_keys[ds['name']])`\nStep 3: `check_key()` in encryption.py:57-58 raises `ZFSNotEncryptedException(dataset)` when `rsrc.crypto()` returns None (dataset not encrypted)\nStep 4: No exception handling in this code path causes unhandled exception to propagate up\nStep 5: This aborts the entire KMIP key push operation, potentially leaving other datasets unsynchronized", - "file_path": "src/middlewared/middlewared/plugins/kmip/zfs_keys.py", - "id": "f_012", - "line_end": 69, - "line_start": 64, - "score": 0.475, - "severity": "critical", - "suggestion": "Wrap the `check_key()` call in a try-except block to catch `ZFSNotEncryptedException` and handle it appropriately. Options:\n1. Skip datasets that are not encrypted (they don't need KMIP key management)\n2. Log a warning and continue with other datasets\n3. Consider removing such datasets from `self.zfs_keys` since they shouldn't have encryption keys", - "tags": [ - "exception-handling", - "kmip", - "zfs-encryption", - "crash" - ], - "title": "KMIP push_zfs_keys() crashes when check_key() raises ZFSNotEncryptedException" - }, - { - "active_multipliers": [ - "adversary_challenged" - ], - "body": "The `pull_zfs_keys()` method at lines 107-111 calls `check_key()` without exception handling. Similar to `push_zfs_keys()`, if a dataset is not encrypted but exists in `self.zfs_keys`, the call to `check_key()` will raise `ZFSNotEncryptedException` and crash the operation.\n\n**Impact**: The KMIP key pull operation will fail entirely if any dataset in the iteration is not encrypted. This prevents migrating keys from KMIP server back to local database for datasets that are actually encrypted, because the operation aborts on the first non-encrypted dataset encountered.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "exception_contract_check_key", - "dimension_name": "Exception Contract Change in check_key()", - "evidence": "Step 1: `pull_zfs_keys()` at line 99 iterates over `existing_datasets` with KMIP UIDs\nStep 2: At lines 107-111, it checks `elif ds['name'] in self.zfs_keys and check_key(tls, ds['name'], key=self.zfs_keys[ds['name']])`\nStep 3: `check_key()` in encryption.py:57-58 raises `ZFSNotEncryptedException` if dataset not encrypted\nStep 4: No try-except block catches this exception in `pull_zfs_keys()`\nStep 5: Unhandled exception aborts the entire key pull operation, preventing other datasets from being synchronized", - "file_path": "src/middlewared/middlewared/plugins/kmip/zfs_keys.py", - "id": "f_013", - "line_end": 111, - "line_start": 107, - "score": 0.475, - "severity": "critical", - "suggestion": "Add explicit exception handling for `ZFSNotEncryptedException` around the `check_key()` call at lines 107-109. When a dataset is not encrypted, it should be skipped (continue to next dataset) or handled appropriately rather than crashing the entire operation.", - "tags": [ - "exception-handling", - "kmip", - "zfs-encryption", - "crash" - ], - "title": "KMIP pull_zfs_keys() crashes when check_key() raises ZFSNotEncryptedException" - }, - { - "active_multipliers": [ - "adversary_challenged" - ], - "body": "The code at lines 106-109 catches generic `Exception` instead of the specific `ZFSNotEncryptedException`. This has two serious problems:\n\n1. **Real errors are masked**: Any actual error (ZFS communication failure, invalid dataset name, memory errors, etc.) will be silently converted to `valid_key = False`, making it indistinguishable from a non-encrypted dataset case.\n\n2. **Missing specific exception import**: The file does not import `ZFSNotEncryptedException` from `middlewared.plugins.zfs.exceptions`, which is required for proper exception handling.\n\nThe OLD behavior was: `check_key()` returned `False` for non-encrypted datasets.\nThe NEW behavior is: `check_key()` raises `ZFSNotEncryptedException` for non-encrypted datasets.\n\nThe current code catches the new exception, but also catches ALL other exceptions, including critical failures that should be propagated to the caller or logged as errors.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "check_key_exception_contract", - "dimension_name": "check_key() Exception Contract Review", - "evidence": "Step 1: `encryption_summary()` calls `check_key(tls, name, key=ds_key)` at line 107\nStep 2: For non-encrypted datasets, `check_key()` raises `ZFSNotEncryptedException` (encryption.py:58)\nStep 3: The generic `except Exception:` at line 108 catches this AND any other exception\nStep 4: `valid_key = False` is set regardless of whether it's a non-encrypted dataset or a real error\nStep 5: Real errors (ZFS failures, communication issues) are masked and logged as routine 'invalid key' cases", - "file_path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "id": "f_016", - "line_end": 109, - "line_start": 106, - "score": 0.475, - "severity": "critical", - "suggestion": "Import `ZFSNotEncryptedException` and catch it specifically. Re-raise or log other exceptions appropriately. Recommended change:\n\n```python\nfrom middlewared.plugins.zfs.exceptions import ZFSNotEncryptedException\n\ntry:\n valid_key = check_key(tls, name, key=ds_key)\nexcept ZFSNotEncryptedException:\n valid_key = False\nexcept Exception as e:\n self.logger.error('Failed to check key for %s: %s', name, e, exc_info=True)\n valid_key = False\n```", - "tags": [ - "exception-handling", - "error-masking", - "api-contract-change" - ], - "title": "Generic Exception catching masks ZFSNotEncryptedException and real errors" - }, - { - "active_multipliers": [ - "cross_ref_compound", - "adversary_confirmed" - ], - "body": "In `validate_encryption_data()` at lines 101-107, there's a different approach to hex validation using `hex(int(key, 16))` instead of `bytes.fromhex()`. This is inconsistent with the hex parsing in `dataset_encryption_lock.py` and `dataset_encryption_info.py`.\n\nWhile both approaches validate hex, using different methods across the codebase:\n1. Makes maintenance harder - fixes to hex validation need to be applied in multiple places\n2. Could have subtle differences in what they accept (e.g., leading zeros, case sensitivity)\n3. Creates technical debt and potential for divergence\n\nNote: This location DOES properly handle errors with a clear validation message (line 106), which is good practice that should be emulated in the other locations.", - "confidence": 0.65, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "hex-conversion-error-handling", - "dimension_name": "Hex String to Bytes Conversion Error Handling", - "evidence": "Step 1: `validate_encryption_data()` uses `hex(int(key, 16))` for validation\nStep 2: `dataset_encryption_lock.py` and `dataset_encryption_info.py` use `bytes.fromhex()`\nStep 3: Different parsing methods could accept different formats\nStep 4: Inconsistent error handling - one raises validation error, others suppress or use generic messages", - "file_path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py", - "id": "f_003", - "line_end": 107, - "line_start": 101, - "score": 0.38, - "severity": "suggestion", - "suggestion": "Consider refactoring to use a common utility function for hex key validation/conversion that is used consistently across all encryption-related code paths. This would centralize the validation logic and ensure consistent error handling.", - "tags": [ - "code-quality", - "consistency", - "encryption", - "hex-conversion" - ], - "title": "Key file validation uses different hex parsing logic than unlock path" - }, - { - "active_multipliers": [ - "cross_ref_compound" - ], - "body": "When retrieving keys from the database for unlock operations, the code attempts to convert hex-encoded keys to bytes using `bytes.fromhex()`. If this fails due to invalid hex format stored in the database, the `ValueError` is silently suppressed and the key is set to `None`.\n\nThis silent failure mode could make debugging difficult - the user would see a generic 'Invalid Key' error (line 225) without knowing that the root cause was corrupt data in the database.", - "confidence": 0.75, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "encryption_key_validation", - "dimension_name": "Encryption Key Storage Validation", - "evidence": "```python\nif ZFSKeyFormat(ds['key_format']['value']) == ZFSKeyFormat.RAW and ds_key:\n try:\n ds_key = bytes.fromhex(ds_key)\n except ValueError:\n ds_key = None # Silent failure - key is lost\n```", - "file_path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py", - "id": "f_027", - "line_end": 182, - "line_start": 177, - "score": 0.337, - "severity": "suggestion", - "suggestion": "Consider logging a warning when hex decoding fails, indicating potential database corruption:\n\n```python\nif ZFSKeyFormat(ds['key_format']['value']) == ZFSKeyFormat.RAW and ds_key:\n try:\n ds_key = bytes.fromhex(ds_key)\n except ValueError:\n self.logger.warning(\n 'Invalid hex key format stored in database for dataset %s',\n name\n )\n ds_key = None\n```", - "tags": [ - "error-handling", - "logging", - "debugging" - ], - "title": "Silent failure when hex decoding fails during unlock" - }, - { - "active_multipliers": [ - "cross_ref_compound" - ], - "body": "The database model defines `encryption_key` as `sa.EncryptedText(), nullable=True` with no CHECK constraints or validation at the database level. While the application should validate inputs, adding a database CHECK constraint would provide defense-in-depth against invalid data insertion from any source (migrations, manual database edits, bugs).\n\nHowever, since the column uses `EncryptedText`, the stored value is encrypted and a CHECK constraint on the raw value would not be feasible. The validation must happen at the application layer before encryption.", - "confidence": 0.7, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "encryption_key_validation", - "dimension_name": "Encryption Key Storage Validation", - "evidence": "```python\nclass PoolDatasetEncryptionModel(sa.Model):\n __tablename__ = 'storage_encrypteddataset'\n\n id = sa.Column(sa.Integer(), primary_key=True)\n name = sa.Column(sa.String(255))\n encryption_key = sa.Column(sa.EncryptedText(), nullable=True) # No validation\n kmip_uid = sa.Column(sa.String(255), nullable=True, default=None)\n```", - "file_path": "src/middlewared/middlewared/plugins/pool_/dataset.py", - "id": "f_026", - "line_end": 47, - "line_start": 41, - "score": 0.315, - "severity": "suggestion", - "suggestion": "Since `EncryptedText` encrypts the value before storage, database-level CHECK constraints cannot validate the plaintext hex format. Ensure application-level validation is implemented in `insert_or_update_encrypted_record` as suggested in the previous finding.", - "tags": [ - "database", - "constraints", - "defense-in-depth" - ], - "title": "No database-level constraints on encryption_key column" - }, - { - "active_multipliers": [ - "adversary_challenged" - ], - "body": "In `encryption_summary()` at lines 102-104, malformed hex keys are silently suppressed using `contextlib.suppress(ValueError)`. When `bytes.fromhex()` fails, the original hex string is preserved instead of being converted to bytes. This means an invalid hex string gets passed to `check_key()` at line 107.\n\nWhile `check_key()` may handle this gracefully, this creates an inconsistent state where:\n- The code expects `ds_key` to be bytes for RAW format\n- But it may actually be a string (the original malformed hex)\n\nThis violates type expectations and could cause subtle bugs. The `valid_key` result at line 107 will likely be `False` for malformed keys (caught by generic Exception handler at line 108-109), but the user gets no indication that their key format was invalid.", - "confidence": 0.85, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "hex-conversion-error-handling", - "dimension_name": "Hex String to Bytes Conversion Error Handling", - "evidence": "Step 1: `encryption_summary` processes a dataset with RAW key format\nStep 2: Line 102-104: `bytes.fromhex(ds_key)` raises ValueError, silently suppressed\nStep 3: `ds_key` remains a string (the invalid hex), not bytes as expected\nStep 4: Line 107: `check_key()` called with invalid type (string instead of bytes)\nStep 5: Generic Exception handler catches and sets `valid_key = False`\nStep 6: User sees 'valid_key: false' with no indication the key format was invalid", - "file_path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "id": "f_001", - "line_end": 109, - "line_start": 102, - "score": 0.297, - "severity": "important", - "suggestion": "Instead of silently suppressing the error, either:\n1. Track that the key format was invalid and include this in the response (e.g., add 'key_format_invalid' field to results)\n2. Set `ds_key = None` when conversion fails to ensure consistent types\n3. Raise a validation error if this is called via an API that should reject invalid keys upfront", - "tags": [ - "error-handling", - "type-safety", - "encryption", - "hex-conversion" - ], - "title": "Silent hex conversion failure preserves invalid string, causing potential downstream errors" - }, - { - "active_multipliers": [ - "adversary_challenged" - ], - "body": "The `encryption_summary()` method uses a broad `except Exception:` catch at lines 106-109 to handle any exception from `check_key()`. While this prevents crashes, it semantically conflates 'dataset is not encrypted' with 'key is invalid'.\n\n**Previous behavior**: `check_key()` returned `False` for non-encrypted datasets, which was set as `valid_key = False`\n**New behavior**: `check_key()` raises `ZFSNotEncryptedException`, which is caught and also sets `valid_key = False`\n\n**Issue**: The user sees 'valid_key: false' but cannot distinguish between:\n1. The dataset is not encrypted (shouldn't even be in the encryption summary)\n2. The provided key is actually invalid\n\nThis could mislead users trying to unlock datasets that aren't actually encrypted.", - "confidence": 0.85, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "exception_contract_check_key", - "dimension_name": "Exception Contract Change in check_key()", - "evidence": "Step 1: `encryption_summary()` at line 100 iterates over encrypted datasets from `query_encrypted_datasets()`\nStep 2: At line 107, it calls `check_key(tls, name, key=ds_key)`\nStep 3: If dataset is not encrypted, `check_key()` raises `ZFSNotEncryptedException` (encryption.py:58)\nStep 4: Lines 106-109 catch ALL exceptions and set `valid_key = False`\nStep 5: The user cannot distinguish between 'not encrypted' vs 'wrong key' - both show as `valid_key: false`", - "file_path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "id": "f_014", - "line_end": 109, - "line_start": 106, - "score": 0.297, - "severity": "important", - "suggestion": "Catch `ZFSNotEncryptedException` specifically and handle it differently from other exceptions. Options:\n1. Skip non-encrypted datasets from the results entirely (they shouldn't appear in an 'encryption summary')\n2. Add a specific flag or error message indicating the dataset is not encrypted\n3. Consider filtering non-encrypted datasets earlier in the method before calling `check_key()`", - "tags": [ - "exception-handling", - "semantic-confusion", - "user-experience" - ], - "title": "Broad Exception catch masks ZFSNotEncryptedException as 'invalid key' in encryption_summary" - }, - { - "active_multipliers": [ - "adversary_challenged" - ], - "body": "In `sync_db_keys()` at lines 196-198, malformed hex keys from the database are silently suppressed using `contextlib.suppress(ValueError)`. When `bytes.fromhex()` fails, the original hex string is preserved and passed to `check_key()` at line 201.\n\nIf `check_key()` fails (which is likely with a malformed key), the dataset is marked for removal from the database at line 206. This means:\n1. A user stores a valid hex key in the database\n2. Somehow the key becomes corrupted in the database (manual edit, migration issue, etc.)\n3. The periodic sync job (runs every 86400 seconds) sees the malformed key\n4. The malformed key fails validation and is removed from the database\n5. The user loses their encryption key permanently\n\nThis is a data loss scenario - corrupted keys in the database should not be silently deleted; instead, an error should be logged alerting administrators to the corruption.", - "confidence": 0.8, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "hex-conversion-error-handling", - "dimension_name": "Hex String to Bytes Conversion Error Handling", - "evidence": "Step 1: Periodic job `sync_db_keys` runs (every 86400 seconds via @periodic decorator)\nStep 2: Line 196-198: Database key fails `bytes.fromhex()`, silently suppressed\nStep 3: Original invalid string passed to `check_key()` at line 201\nStep 4: `check_key()` likely fails (returns False or raises)\nStep 5: Line 206: Dataset name added to `to_remove` list\nStep 6: Line 212: Corrupted key deleted from database permanently", - "file_path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "id": "f_002", - "line_end": 206, - "line_start": 196, - "score": 0.28, - "severity": "important", - "suggestion": "Instead of silently suppressing the error and potentially deleting corrupted keys:\n1. Log an explicit error when hex conversion fails, including the dataset name\n2. Do NOT remove keys that fail hex conversion - they might be recoverable\n3. Consider adding a validation check when keys are INSERTED/UPDATED in the database to prevent invalid hex from being stored in the first place", - "tags": [ - "error-handling", - "data-loss", - "encryption", - "hex-conversion", - "periodic-job" - ], - "title": "Malformed hex keys in database cause unnecessary key removal during sync" - }, - { - "active_multipliers": [ - "cross_ref_compound" - ], - "body": "The `unlock()` method in `dataset_encryption_lock.py` directly calls `load_key()` at line 222 without first calling `check_key()` to validate the key. While this avoids a TOCTOU race between check and load (since there's no check), it means that invalid keys will only be discovered during the load attempt, potentially leaving the dataset in a partially processed state.\n\nThe current implementation catches `ZFSException` and handles `EZFS_CRYPTOFAILED` as 'Invalid Key', which is correct. However, the investigation prompt suggested looking for `check_key()` followed by `load_key()` patterns. In this file, no such pattern exists\u2014the code correctly avoids the TOCTOU by not checking before loading.\n\nThe job lock at line 93 (`@job(lock=lambda args: f'dataset_unlock_{args[0]}')`) provides some serialization for unlock operations targeting the same dataset, but different datasets can still be unlocked concurrently, and the ZFS resource operations themselves are not protected by this high-level lock.", - "confidence": 0.6, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "race_condition_check_load_key", - "dimension_name": "TOCTOU Race Between check_key() and load_key() Operations", - "evidence": "Step 1: `unlock()` job acquires lock for specific dataset ID at line 93.\nStep 2: At line 222, `load_key(tls, name, key=datasets[name]['key'])` is called directly.\nStep 3: No `check_key()` call precedes this load operation.\nStep 4: Lines 223-231 catch exceptions from the load operation.\nObservation: The code correctly avoids TOCTOU by not separating validation from action, though this means error feedback is only available after attempting the operation.", - "file_path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py", - "id": "f_022", - "line_end": 231, - "line_start": 221, - "score": 0.27, - "severity": "suggestion", - "suggestion": "The current approach of loading directly and catching exceptions is actually safer than check-then-load. No change needed unless you want to add pre-validation for better error messages. If pre-validation is added, ensure it's understood that the validation result could be stale by the time load is called.", - "tags": [ - "race-condition", - "zfs", - "encryption", - "validation" - ], - "title": "Missing Key Validation Before Load in unlock()" - }, - { - "active_multipliers": [ - "cross_ref_compound" - ], - "body": "In `pull_zfs_keys()` at lines 107-111, `check_key()` is used to determine if an in-memory key is valid for a dataset. If valid, the key is used for database updates (line 120) but NOT for loading into ZFS.\n\nThe validation at line 109 confirms the key can unlock the dataset at that moment, but the actual use of the key is for database operations (line 120: `update_data = {'encryption_key': key, 'kmip_uid': None}`). This is appropriate usage because:\n1. No `load_key()` follows the `check_key()`\n2. The database update doesn't depend on the current ZFS state\n\nHowever, the check validates against current ZFS state, which could change before any future unlock operation. This is a minor concern about validation staleness rather than a TOCTOU race.", - "confidence": 0.6, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "race_condition_check_load_key", - "dimension_name": "TOCTOU Race Between check_key() and load_key() Operations", - "evidence": "Step 1: At line 109, `check_key(tls, ds['name'], key=self.zfs_keys[ds['name']])` validates the in-memory key.\nStep 2: If True, line 111 assigns the key to a local variable.\nStep 3: Lines 119-121 use this key to update the database, not to load into ZFS.\nStep 4: No `load_key()` call exists in this code path.\nObservation: The check is used to select a key source, not to validate before an action.", - "file_path": "src/middlewared/middlewared/plugins/kmip/zfs_keys.py", - "id": "f_024", - "line_end": 111, - "line_start": 107, - "score": 0.27, - "severity": "suggestion", - "suggestion": "No immediate fix needed. The `check_key()` usage here is for determining which key source to use (in-memory vs KMIP vs database). The validation result staleness is acceptable because the key will be validated again when actually used for unlocking. Consider adding a comment explaining that this is a point-in-time validation.", - "tags": [ - "race-condition", - "kmip", - "zfs", - "validation" - ], - "title": "Staleness of check_key() Result in pull_zfs_keys" - }, - { - "active_multipliers": [], - "body": "The default `pbkdf2iters` was increased from 350,000 to 1,300,000 (3.7x increase). This is a security improvement against brute force attacks, but it will significantly increase unlock times for passphrase-encrypted datasets. Users with passphrase-encrypted pools will experience ~3-4x longer unlock times without warning. This could impact system boot time for encrypted pools, dataset unlock operations, and user experience for large-scale deployments. Consider adding a release note or documentation about this performance trade-off.", - "confidence": 0.75, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "cluster_0", - "dimension_name": "Root cluster coverage gap review", - "evidence": "Step 1: Previous API versions (v25_10_2) had default=350000, ge=100000. Step 2: New v26_0_0 has default=1300000, ge=1300000. Step 3: PBKDF2 iterations directly correlate with unlock time - higher iterations = slower unlock. Step 4: Users upgrading to v26 who had passphrase-encrypted pools will see significantly longer unlock times without any warning.", - "file_path": "src/middlewared/middlewared/api/v26_0_0/pool.py", - "id": "f_034", - "line_end": 139, - "line_start": 139, - "score": 0.225, - "severity": "suggestion", - "suggestion": "Add documentation or release notes warning users about increased unlock times for passphrase-encrypted datasets. Consider allowing users to explicitly set a lower value if they understand the security trade-offs (the ge=1300000 constraint currently prevents this).", - "tags": [ - "performance", - "user-experience", - "security" - ], - "title": "Significant performance impact from increased PBKDF2 iterations" - }, - { - "active_multipliers": [], - "body": "The `from_previous` classmethod in `PoolCreateEncryptionOptions` accesses `value['pbkdf2iters']` without first checking if the key exists. While this may work in normal API flows where pydantic populates defaults before migration, it's a fragile pattern that could cause a `KeyError` if called with incomplete data during API version transitions or internal usage. The method should use `.get()` with a default value or check key existence before accessing it.", - "confidence": 0.65, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "cluster_0", - "dimension_name": "Root cluster coverage gap review", - "evidence": "Step 1: `from_previous` is called during API version migrations to convert data from previous API versions. Step 2: The method directly accesses `value['pbkdf2iters']` at line 153 without checking key existence. Step 3: If the input dict lacks this key (e.g., from malformed client data or internal calls), a KeyError will be raised. Step 4: This causes an unhandled exception instead of graceful migration.", - "file_path": "src/middlewared/middlewared/api/v26_0_0/pool.py", - "id": "f_032", - "line_end": 154, - "line_start": 151, - "score": 0.195, - "severity": "suggestion", - "suggestion": "Change `value['pbkdf2iters']` to `value.get('pbkdf2iters', 1300000)` to safely handle cases where the key might not be present.", - "tags": [ - "defensive-coding", - "api-migration", - "backward-compatibility" - ], - "title": "Missing key existence check in from_previous migration method" - }, - { - "active_multipliers": [], - "body": "Same issue as in pool.py - the `from_previous` method in `PoolDatasetChangeKeyOptions` accesses `value['pbkdf2iters']` without checking if the key exists first. This could cause a `KeyError` in edge cases during API version migrations.", - "confidence": 0.65, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "cluster_0", - "dimension_name": "Root cluster coverage gap review", - "evidence": "Step 1: The `from_previous` method is designed to migrate data from previous API versions. Step 2: Line 185 directly accesses dictionary key without existence check. Step 3: While pydantic typically populates defaults, internal calls or edge cases could omit this key. Step 4: This results in KeyError instead of graceful handling.", - "file_path": "src/middlewared/middlewared/api/v26_0_0/pool_dataset.py", - "id": "f_033", - "line_end": 186, - "line_start": 183, - "score": 0.195, - "severity": "suggestion", - "suggestion": "Use `value.get('pbkdf2iters', 1300000)` instead of `value['pbkdf2iters']` to safely handle missing keys.", - "tags": [ - "defensive-coding", - "api-migration", - "backward-compatibility" - ], - "title": "Missing key existence check in PoolDatasetChangeKeyOptions.from_previous" - }, - { - "active_multipliers": [], - "body": "In `push_zfs_keys()` at lines 65-76, `check_key()` is called to validate an in-memory key. If the check passes, the code continues to the next iteration (line 69). If it fails, the code attempts to retrieve the key from KMIP.\n\nWhile there's no `load_key()` call immediately following the `check_key()` in this specific code path, there is a logical issue: the `check_key()` validates the key against the ZFS dataset's current state, but by the time the key is used (potentially later in the same method or by other callers), the dataset state may have changed. The validation result has a limited time window of validity.\n\nHowever, this is not a TOCTOU race in the traditional sense because no action is taken based on the check result other than skipping to the next dataset. The investigation prompt asked about `check_key()` followed by `load_key()` patterns\u2014this file does not contain such a pattern.", - "confidence": 0.6, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "race_condition_check_load_key", - "dimension_name": "TOCTOU Race Between check_key() and load_key() Operations", - "evidence": "Step 1: At line 67, `check_key(tls, ds['name'], key=self.zfs_keys[ds['name']])` is called.\nStep 2: If True, the code executes `continue` at line 69 and proceeds to the next dataset.\nStep 3: If False or exception, lines 71-76 retrieve and store the key from KMIP.\nObservation: No `load_key()` follows the `check_key()` call. The check is used for decision-making, not for validating before an action.", - "file_path": "src/middlewared/middlewared/plugins/kmip/zfs_keys.py", - "id": "f_023", - "line_end": 76, - "line_start": 65, - "score": 0.18, - "severity": "suggestion", - "suggestion": "The usage of `check_key()` here is appropriate for determining whether to retrieve a key from KMIP. However, be aware that the validation result represents a point-in-time check and may not reflect the state when the key is actually used. Consider documenting this behavior or adding comments about the temporal nature of the validation.", - "tags": [ - "race-condition", - "kmip", - "zfs", - "validation" - ], - "title": "Key Validation Without Subsequent Load in push_zfs_keys" - } - ], - "metadata": { - "agent_invocations": 20, - "anatomy": { - "blast_radius": [], - "clusters": [ - { - "description": "", - "files": [ - "" - ], - "id": "cluster_0", - "name": "root", - "primary_language": "" - }, - { - "description": "", - "files": [ - "src/middlewared/middlewared/api/v26_0_0/pool.py", - "src/middlewared/middlewared/api/v26_0_0/pool_dataset.py" - ], - "id": "cluster_1", - "name": "src/middlewared/middlewared/api/v26_0_0", - "primary_language": "python" - }, - { - "description": "", - "files": [ - "src/middlewared/middlewared/plugins/kmip/zfs_keys.py" - ], - "id": "cluster_2", - "name": "src/middlewared/middlewared/plugins/kmip", - "primary_language": "python" - }, - { - "description": "", - "files": [ - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py", - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py" - ], - "id": "cluster_3", - "name": "src/middlewared/middlewared/plugins/pool_", - "primary_language": "python" - }, - { - "description": "", - "files": [ - "src/middlewared/middlewared/plugins/zfs/encryption.py", - "src/middlewared/middlewared/plugins/zfs/exceptions.py" - ], - "id": "cluster_4", - "name": "src/middlewared/middlewared/plugins/zfs", - "primary_language": "python" - } - ], - "context_notes": "This PR is part of a larger migration from py-libzfs to truenas_pylibzfs. The new encryption.py module follows the pattern established by other _impl.py files in the zfs/ directory (destroy_impl.py, load_unload_impl.py, etc.). The use of @pass_thread_local_storage is consistent with the new architecture where ZFS operations are performed directly in the main process using thread-local libzfs handles rather than being dispatched to a process pool. The change increases PBKDF2 iterations which aligns with current security best practices (OWASP recommends 600k+ iterations for PBKDF2).", - "dependency_graph": {}, - "files": [ - { - "hunks": [ - { - "content": " key.\"\"\"\n generate_key: bool = False\n \"\"\"Automatically generate the key to be used for dataset encryption.\"\"\"\n- pbkdf2iters: int = Field(ge=100000, default=350000)\n+ pbkdf2iters: int = Field(ge=1300000, default=1300000)\n \"\"\"Number of PBKDF2 iterations for key derivation from passphrase. Higher iterations improve security \\\n- against brute force attacks but increase unlock time. Default 350,000 balances security and performance.\"\"\"\n+ against brute force attacks but increase unlock time.\"\"\"\n algorithm: Literal[\n \"AES-128-CCM\", \"AES-192-CCM\", \"AES-256-CCM\", \"AES-128-GCM\", \"AES-192-GCM\", \"AES-256-GCM\"\n ] = \"AES-256-GCM\"", - "header": "@@ -136,9 +136,9 @@ class PoolCreateEncryptionOptions(BaseModel):", - "new_count": 9, - "new_start": 136, - "old_count": 9, - "old_start": 136 - }, - { - "content": " key: Secret[Annotated[str, Field(min_length=64, max_length=64)] | None] = None\n \"\"\"A hex-encoded key specified as an alternative to using `passphrase`.\"\"\"\n \n+ @classmethod\n+ def from_previous(cls, value):\n+ value['pbkdf2iters'] = max(1300000, value['pbkdf2iters'])\n+ return value\n+\n \n class PoolCreateTopologyVdevDRAID(BaseModel):\n type: Literal[\"DRAID1\", \"DRAID2\", \"DRAID3\"]", - "header": "@@ -148,6 +148,11 @@ class PoolCreateEncryptionOptions(BaseModel):", - "new_count": 11, - "new_start": 148, - "old_count": 6, - "old_start": 148 - } - ], - "language": "python", - "lines_added": 7, - "lines_removed": 2, - "path": "src/middlewared/middlewared/api/v26_0_0/pool.py", - "status": "modified" - }, - { - "hunks": [ - { - "content": " \"\"\"Generate a new random encryption key instead of using a provided key or passphrase.\"\"\"\n key_file: bool = False\n \"\"\"Whether the provided key is from a key file rather than entered directly.\"\"\"\n- pbkdf2iters: int = Field(default=350000, ge=100000)\n+ pbkdf2iters: int = Field(default=1300000, ge=1300000)\n \"\"\"Number of PBKDF2 iterations for passphrase-based keys. Higher values improve security against \\\n- brute force attacks but increase unlock time. Default 350,000 balances security and performance.\"\"\"\n+ brute force attacks but increase unlock time.\"\"\"\n passphrase: Secret[NonEmptyString | None] = None\n \"\"\"Passphrase to use for encryption key derivation.\"\"\"\n key: Secret[Annotated[str, Field(min_length=64, max_length=64)] | None] = None\n \"\"\"Raw hex-encoded encryption key.\"\"\"\n \n+ @classmethod\n+ def from_previous(cls, value):\n+ value['pbkdf2iters'] = max(1300000, value['pbkdf2iters'])\n+ return value\n+\n \n class PoolDatasetCreateUserProperty(BaseModel):\n key: Annotated[str, Field(examples=[\"custom:backup_policy\", \"org:created_by\"], pattern=\".*:.*\")]", - "header": "@@ -172,14 +172,19 @@ class PoolDatasetChangeKeyOptions(BaseModel):", - "new_count": 19, - "new_start": 172, - "old_count": 14, - "old_start": 172 - } - ], - "language": "python", - "lines_added": 7, - "lines_removed": 2, - "path": "src/middlewared/middlewared/api/v26_0_0/pool_dataset.py", - "status": "modified" - }, - { - "hunks": [ - { - "content": " # See the file LICENSE.IX for complete terms and conditions\n \n from middlewared.api.current import ZFSResourceQuery\n+from middlewared.plugins.zfs.encryption import check_key\n from middlewared.service import job, private, Service\n+from middlewared.service.decorators import pass_thread_local_storage\n \n from .connection import KMIPServerMixin\n ", - "header": "@@ -4,7 +4,9 @@", - "new_count": 9, - "new_start": 4, - "old_count": 7, - "old_start": 4 - }, - { - "content": " return rv\n \n @private\n- def push_zfs_keys(self, ids=None):\n+ @pass_thread_local_storage\n+ def push_zfs_keys(self, tls, ids=None):\n failed = []\n filters = [] if ids is None else [['id', 'in', ids]]\n existing_datasets = self.get_encrypted_datasets(filters)", - "header": "@@ -50,7 +52,8 @@ def get_encrypted_datasets(self, filters):", - "new_count": 8, - "new_start": 52, - "old_count": 7, - "old_start": 50 - }, - { - "content": " if not ds['encryption_key']:\n # We want to make sure we have the KMIP server's keys and in-memory keys in sync\n try:\n- if ds['name'] in self.zfs_keys and self.middleware.call_sync(\n- 'zfs.dataset.check_key', ds['name'], {'key': self.zfs_keys[ds['name']]}\n+ if (\n+ ds['name'] in self.zfs_keys\n+ and check_key(tls, ds['name'], key=self.zfs_keys[ds['name']])\n ):\n continue\n else:", - "header": "@@ -59,8 +62,9 @@ def push_zfs_keys(self, ids=None):", - "new_count": 9, - "new_start": 62, - "old_count": 8, - "old_start": 59 - }, - { - "content": " return failed\n \n @private\n- def pull_zfs_keys(self):\n+ @pass_thread_local_storage\n+ def pull_zfs_keys(self, tls):\n existing_datasets = self.get_encrypted_datasets([['kmip_uid', '!=', None]])\n failed = []\n connection_successful = self.middleware.call_sync('kmip.test_connection')", - "header": "@@ -91,7 +95,8 @@ def push_zfs_keys(self, ids=None):", - "new_count": 8, - "new_start": 95, - "old_count": 7, - "old_start": 91 - }, - { - "content": " try:\n if ds['encryption_key']:\n key = ds['encryption_key']\n- elif ds['name'] in self.zfs_keys and self.middleware.call_sync(\n- 'zfs.dataset.check_key', ds['name'], {'key': self.zfs_keys[ds['name']]}\n+ elif (\n+ ds['name'] in self.zfs_keys\n+ and check_key(tls, ds['name'], key=self.zfs_keys[ds['name']])\n ):\n key = self.zfs_keys[ds['name']]\n elif connection_successful:", - "header": "@@ -99,8 +104,9 @@ def pull_zfs_keys(self):", - "new_count": 9, - "new_start": 104, - "old_count": 8, - "old_start": 99 - }, - { - "content": " return failed\n \n @private\n+ @pass_thread_local_storage\n @job(lock=lambda args: f'kmip_sync_zfs_keys_{args}')\n- def sync_zfs_keys(self, job, ids=None):\n+ def sync_zfs_keys(self, job, tls, ids=None):\n if not self.middleware.call_sync('kmip.zfs_keys_pending_sync'):\n return\n config = self.middleware.call_sync('kmip.config')\n conn_successful = self.middleware.call_sync('kmip.test_connection', None, True)\n if config['enabled'] and config['manage_zfs_keys']:\n if conn_successful:\n- failed = self.push_zfs_keys(ids)\n+ failed = self.push_zfs_keys(tls, ids) # type: ignore\n else:\n return\n else:\n- failed = self.pull_zfs_keys()\n+ failed = self.pull_zfs_keys(tls) # type: ignore\n if failed:\n self.middleware.call_sync(\n 'alert.oneshot_create', 'KMIPZFSDatasetsSyncFailure', {'datasets': ','.join(failed)}", - "header": "@@ -120,19 +126,20 @@ def pull_zfs_keys(self):", - "new_count": 20, - "new_start": 126, - "old_count": 19, - "old_start": 120 - } - ], - "language": "python", - "lines_added": 16, - "lines_removed": 9, - "path": "src/middlewared/middlewared/plugins/kmip/zfs_keys.py", - "status": "modified" - }, - { - "hunks": [ - { - "content": " from middlewared.service.decorators import pass_thread_local_storage\n from middlewared.utils.filter_list import filter_list\n from middlewared.plugins.pool_.utils import get_dataset_parents\n+from middlewared.plugins.zfs.encryption import check_key\n \n from .utils import DATASET_DATABASE_MODEL_NAME, dataset_can_be_mounted, retrieve_keys_from_file, ZFSKeyFormat\n ", - "header": "@@ -18,6 +18,7 @@", - "new_count": 7, - "new_start": 18, - "old_count": 6, - "old_start": 18 - }, - { - "content": " namespace = 'pool.dataset'\n \n @api_method(PoolDatasetEncryptionSummaryArgs, PoolDatasetEncryptionSummaryResult, roles=['DATASET_READ'])\n+ @pass_thread_local_storage\n @job(lock=lambda args: f'encryption_summary_options_{args[0]}', pipes=['input'], check_pipes=False)\n- def encryption_summary(self, job, id_, options):\n+ def encryption_summary(self, job, tls, id_, options):\n \"\"\"\n Retrieve summary of all encrypted roots under `id`.\n ", - "header": "@@ -28,8 +29,9 @@ class Config:", - "new_count": 9, - "new_start": 29, - "old_count": 8, - "old_start": 28 - }, - { - "content": " verrors.check()\n datasets = self.query_encrypted_datasets(id_, {'all': True})\n \n- to_check = []\n+ results = []\n for name, ds in datasets.items():\n ds_key = keys_supplied.get(name, {}).get('key') or ds['encryption_key']\n if ZFSKeyFormat(ds['key_format']['value']) == ZFSKeyFormat.RAW and ds_key:\n with contextlib.suppress(ValueError):\n ds_key = bytes.fromhex(ds_key)\n- to_check.append((name, {'key': ds_key}))\n \n- check_job = self.middleware.call_sync('zfs.dataset.bulk_process', 'check_key', to_check)\n- check_job.wait_sync()\n- if check_job.error:\n- raise CallError(f'Failed to retrieve encryption summary for {id_}: {check_job.error}')\n+ try:\n+ valid_key = check_key(tls, name, key=ds_key)\n+ except Exception:\n+ valid_key = False\n \n- results = []\n- for ds_data, status in zip(to_check, check_job.result):\n- ds_name = ds_data[0]\n- data = datasets[ds_name]\n results.append({\n- 'name': ds_name,\n- 'key_format': ZFSKeyFormat(data['key_format']['value']).value,\n- 'key_present_in_database': bool(data['encryption_key']),\n- 'valid_key': bool(status['result']), 'locked': data['locked'],\n+ 'name': name,\n+ 'key_format': ZFSKeyFormat(ds['key_format']['value']).value,\n+ 'key_present_in_database': bool(ds['encryption_key']),\n+ 'valid_key': valid_key,\n+ 'locked': ds['locked'],\n 'unlock_error': None,\n 'unlock_successful': False,\n })\n \n failed = set()\n for ds in sorted(results, key=lambda d: d['name'].count('/')):\n- for i in range(1, ds['name'].count('/') + 1):\n- check = ds['name'].rsplit('/', i)[0]\n+ ds_name = ds['name']\n+ for i in range(1, ds_name.count('/') + 1):\n+ check = ds_name.rsplit('/', i)[0]\n if check in failed:\n- failed.add(ds['name'])\n+ failed.add(ds_name)\n ds['unlock_error'] = f'Child cannot be unlocked when parent \"{check}\" is locked'\n \n- if ds['locked'] and not options['force'] and not keys_supplied.get(ds['name'], {}).get('force'):\n- err = dataset_can_be_mounted(ds['name'], os.path.join('/mnt', ds['name']))\n+ ds_locked = ds['locked']\n+ if ds_locked and not options['force'] and not keys_supplied.get(ds_name, {}).get('force'):\n+ err = dataset_can_be_mounted(ds_name, os.path.join('/mnt', ds_name))\n if ds['unlock_error'] and err:\n ds['unlock_error'] += f' and {err}'\n elif err:", - "header": "@@ -94,42 +96,40 @@ def encryption_summary(self, job, id_, options):", - "new_count": 40, - "new_start": 96, - "old_count": 42, - "old_start": 94 - }, - { - "content": " \n if ds['valid_key']:\n ds['unlock_successful'] = not bool(ds['unlock_error'])\n- elif not ds['locked']:\n+ elif not ds_locked:\n # For datasets which are already not locked, unlock operation for them\n # will succeed as they are not locked\n ds['unlock_successful'] = True\n else:\n- key_provided = ds['name'] in keys_supplied or ds['key_present_in_database']\n+ key_provided = ds_name in keys_supplied or ds['key_present_in_database']\n if key_provided:\n if ds['unlock_error']:\n- if ds['name'] in keys_supplied or ds['key_present_in_database']:\n+ if ds_name in keys_supplied or ds['key_present_in_database']:\n ds['unlock_error'] += ' and provided key is invalid'\n else:\n ds['unlock_error'] = 'Provided key is invalid'\n elif not ds['unlock_error']:\n ds['unlock_error'] = 'Key not provided'\n- failed.add(ds['name'])\n+ failed.add(ds_name)\n \n return results\n \n @periodic(86400)\n @private\n+ @pass_thread_local_storage\n @job(lock=lambda args: f'sync_encrypted_pool_dataset_keys_{args}')\n- def sync_db_keys(self, job, name=None):\n+ def sync_db_keys(self, job, tls, name=None):\n if not self.middleware.call_sync('failover.is_single_master_node'):\n # We don't want to do this for passive controller\n return", - "header": "@@ -137,28 +137,29 @@ def encryption_summary(self, job, id_, options):", - "new_count": 29, - "new_start": 137, - "old_count": 28, - "old_start": 137 - }, - { - "content": " # It is possible we have a pool configured but for some mistake/reason the pool did not import like\n # during repair disks were not plugged in and system was booted, in such cases we would like to not\n # remove the encryption keys from the database.\n- for root_ds in {pool['name'] for pool in self.middleware.call_sync('pool.query')} - {\n- ds['id'] for ds in self.middleware.call_sync(\n+ pool_names = {pool['name'] for pool in self.middleware.call_sync('pool.query')}\n+ ds_names = {\n+ ds['id']\n+ for ds in self.middleware.call_sync(\n 'pool.dataset.query', [], {'extra': {'retrieve_children': False, 'properties': []}}\n )\n- }:\n+ }\n+ for root_ds in pool_names - ds_names:\n filters.extend([['name', '!=', root_ds], ['name', '!^', f'{root_ds}/']])\n \n db_datasets = self.query_encrypted_roots_keys(filters)\n encrypted_roots = {\n- d['name']: d for d in self.middleware.call_sync(\n- 'pool.dataset.query', filters, {'extra': {'properties': ['encryptionroot']}}\n- ) if d['name'] == d['encryption_root']\n+ d['name']: d\n+ for d in self.middleware.call_sync(\n+ 'pool.dataset.query',\n+ filters,\n+ {'extra': {'properties': ['encryptionroot', 'keyformat']}}\n+ )\n+ if d['name'] == d['encryption_root']\n }\n+\n to_remove = []\n- check_key_job = self.middleware.call_sync('zfs.dataset.bulk_process', 'check_key', [\n- (name, {'key': db_datasets[name]}) for name in db_datasets\n- ])\n- check_key_job.wait_sync()\n- if check_key_job.error:\n- self.logger.error(f'Failed to sync database keys: {check_key_job.error}')\n+ try:\n+ for ds_name, key in db_datasets.items():\n+ ds = encrypted_roots.get(ds_name)\n+ if ds and ZFSKeyFormat(ds['key_format']['value']) == ZFSKeyFormat.RAW and key:\n+ with contextlib.suppress(ValueError):\n+ key = bytes.fromhex(key)\n+\n+ try:\n+ should_remove = not check_key(tls, ds_name, key=key)\n+ except Exception:\n+ should_remove = True\n+\n+ if should_remove:\n+ to_remove.append(ds_name)\n+\n+ except Exception as exc:\n+ self.logger.error(f'Failed to sync database keys: {exc}')\n return\n \n- for dataset, status in zip(db_datasets, check_key_job.result):\n- if not status['result']:\n- to_remove.append(dataset)\n- elif status['error']:\n- if dataset not in encrypted_roots:\n- to_remove.append(dataset)\n- else:\n- self.logger.error(f'Failed to check encryption status for {dataset}: {status[\"error\"]}')\n-\n self.middleware.call_sync('pool.dataset.delete_encrypted_datasets_from_db', [['name', 'in', to_remove]])\n \n @private", - "header": "@@ -167,37 +168,47 @@ def sync_db_keys(self, job, name=None):", - "new_count": 47, - "new_start": 168, - "old_count": 37, - "old_start": 167 - } - ], - "language": "python", - "lines_added": 57, - "lines_removed": 46, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "status": "modified" - }, - { - "hunks": [ - { - "content": " from datetime import datetime\n from pathlib import Path\n \n+from truenas_pylibzfs import ZFSError, ZFSException\n+\n from middlewared.api import api_method\n from middlewared.api.current import (\n PoolDatasetLockArgs, PoolDatasetLockResult, PoolDatasetUnlockArgs, PoolDatasetUnlockResult\n )\n+from middlewared.plugins.zfs.encryption import load_key\n from middlewared.service import CallError, job, private, Service, ValidationErrors\n+from middlewared.service.decorators import pass_thread_local_storage\n from middlewared.utils.filesystem.directory import directory_is_empty\n \n from .utils import (", - "header": "@@ -6,11 +6,15 @@", - "new_count": 15, - "new_start": 6, - "old_count": 11, - "old_start": 6 - }, - { - "content": " return True\n \n @api_method(PoolDatasetUnlockArgs, PoolDatasetUnlockResult, roles=['DATASET_WRITE'])\n+ @pass_thread_local_storage\n @job(lock=lambda args: f'dataset_unlock_{args[0]}', pipes=['input'], check_pipes=False)\n- def unlock(self, job, id_, options):\n+ def unlock(self, job, tls, id_, options):\n \"\"\"\n Unlock dataset `id` (and its children if `unlock_options.recursive` is `true`).\n ", - "header": "@@ -85,8 +89,9 @@ async def lock(self, job, id_, options):", - "new_count": 9, - "new_start": 89, - "old_count": 8, - "old_start": 85 - }, - { - "content": " \n job.set_progress(int(name_i / len(names) * 90 + 0.5), f'Unlocking {name!r}')\n try:\n- self.middleware.call_sync(\n- 'zfs.dataset.load_key', name, {'key': datasets[name]['key'], 'mount': False}\n- )\n- except CallError as e:\n- failed[name]['error'] = 'Invalid Key' if 'incorrect key provided' in str(e).lower() else str(e)\n+ load_key(tls, name, key=datasets[name]['key'])\n+ except ZFSException as e:\n+ if e.code == ZFSError.EZFS_CRYPTOFAILED:\n+ failed[name]['error'] = 'Invalid Key'\n+ else:\n+ failed[name]['error'] = str(e)\n+ continue\n+ except Exception as e:\n+ failed[name]['error'] = str(e)\n continue\n \n # Before we mount the dataset in question, we should ensure that the path where it will be mounted", - "header": "@@ -214,11 +219,15 @@ def unlock(self, job, id_, options):", - "new_count": 15, - "new_start": 219, - "old_count": 11, - "old_start": 214 - } - ], - "language": "python", - "lines_added": 15, - "lines_removed": 6, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py", - "status": "modified" - }, - { - "hunks": [ - { - "content": " PoolDatasetChangeKeyArgs, PoolDatasetChangeKeyResult, PoolDatasetInheritParentEncryptionPropertiesArgs,\n PoolDatasetInheritParentEncryptionPropertiesResult\n )\n+from middlewared.plugins.zfs.encryption import change_encryption_root, change_key\n from middlewared.service import CallError, job, private, Service, ValidationErrors\n+from middlewared.service.decorators import pass_thread_local_storage\n from middlewared.utils import secrets\n \n from .utils import DATASET_DATABASE_MODEL_NAME, ZFSKeyFormat", - "header": "@@ -4,7 +4,9 @@", - "new_count": 9, - "new_start": 4, - "old_count": 7, - "old_start": 4 - }, - { - "content": " PoolDatasetInsertOrUpdateEncryptedRecordResult,\n roles=['DATASET_WRITE']\n )\n- async def insert_or_update_encrypted_record(self, data):\n+ def insert_or_update_encrypted_record(self, data):\n key_format = data.pop('key_format') or ZFSKeyFormat.PASSPHRASE.value\n if not data['encryption_key'] or ZFSKeyFormat(key_format.upper()) == ZFSKeyFormat.PASSPHRASE:\n # We do not want to save passphrase keys - they are only known to the user\n return\n \n ds_id = data.pop('id')\n- ds = await self.middleware.call(\n+ ds = self.middleware.call_sync(\n 'datastore.query', DATASET_DATABASE_MODEL_NAME,\n [['id', '=', ds_id]] if ds_id else [['name', '=', data['name']]]\n )", - "header": "@@ -21,14 +23,14 @@ class Config:", - "new_count": 14, - "new_start": 23, - "old_count": 14, - "old_start": 21 - }, - { - "content": " \n pk = ds[0]['id'] if ds else None\n if ds:\n- await self.middleware.call(\n+ self.middleware.call_sync(\n 'datastore.update',\n DATASET_DATABASE_MODEL_NAME,\n ds[0]['id'], data\n )\n else:\n- pk = await self.middleware.call(\n+ pk = self.middleware.call_sync(\n 'datastore.insert',\n DATASET_DATABASE_MODEL_NAME,\n data\n )\n \n- kmip_config = await self.middleware.call('kmip.config')\n+ kmip_config = self.middleware.call_sync('kmip.config')\n if kmip_config['enabled'] and kmip_config['manage_zfs_keys']:\n- await self.middleware.call('kmip.sync_zfs_keys', [pk])\n+ self.middleware.call_sync('kmip.sync_zfs_keys', [pk])\n \n return pk\n ", - "header": "@@ -37,21 +39,21 @@ async def insert_or_update_encrypted_record(self, data):", - "new_count": 21, - "new_start": 39, - "old_count": 21, - "old_start": 37 - }, - { - "content": " return opts\n \n @api_method(PoolDatasetChangeKeyArgs, PoolDatasetChangeKeyResult, roles=['DATASET_WRITE'])\n+ @pass_thread_local_storage\n @job(lock=lambda args: f'dataset_change_key_{args[0]}', pipes=['input'], check_pipes=False)\n- async def change_key(self, job, id_, options):\n+ def change_key(self, job, tls, id_, options):\n \"\"\"\n Change encryption properties for `id` encrypted dataset.\n ", - "header": "@@ -114,8 +116,9 @@ def validate_encryption_data(self, job, verrors, encryption_dict, schema):", - "new_count": 9, - "new_start": 116, - "old_count": 8, - "old_start": 114 - }, - { - "content": " 1) It has encrypted roots as children which are encrypted with a key\n 2) If it is a root dataset where the system dataset is located\n \"\"\"\n- ds = await self.middleware.call('pool.dataset.get_instance_quick', id_, {\n+ ds = self.middleware.call_sync('pool.dataset.get_instance_quick', id_, {\n 'encryption': True,\n })\n verrors = ValidationErrors()", - "header": "@@ -124,7 +127,7 @@ async def change_key(self, job, id_, options):", - "new_count": 7, - "new_start": 127, - "old_count": 7, - "old_start": 124 - }, - { - "content": " )\n elif any(\n d['name'] == d['encryption_root']\n- for d in await self.middleware.call(\n+ for d in self.middleware.call_sync(\n 'pool.dataset.query', [\n ['id', '^', f'{id_}/'], ['encrypted', '=', True],\n ['key_format.value', '!=', ZFSKeyFormat.PASSPHRASE.value]", - "header": "@@ -142,7 +145,7 @@ async def change_key(self, job, id_, options):", - "new_count": 7, - "new_start": 145, - "old_count": 7, - "old_start": 142 - }, - { - "content": " f'{id_} has children which are encrypted with a key. It is not allowed to have encrypted '\n 'roots which are encrypted with a key as children for passphrase encrypted datasets.'\n )\n- elif id_ == (await self.middleware.call('systemdataset.config'))['pool']:\n+ elif id_ == self.middleware.call_sync('systemdataset.config')['pool']:\n verrors.add(\n 'id',\n f'{id_} contains the system dataset. Please move the system dataset to a '", - "header": "@@ -154,7 +157,7 @@ async def change_key(self, job, id_, options):", - "new_count": 7, - "new_start": 157, - "old_count": 7, - "old_start": 154 - }, - { - "content": " f'change_key_options.{k}',\n 'Either Key or passphrase must be provided.'\n )\n- elif id_.count('/') and await self.middleware.call(\n+ elif id_.count('/') and self.middleware.call_sync(\n 'pool.dataset.query', [\n ['id', 'in', [id_.rsplit('/', i)[0] for i in range(1, id_.count('/') + 1)]],\n ['key_format.value', '=', ZFSKeyFormat.PASSPHRASE.value], ['encrypted', '=', True]", - "header": "@@ -167,7 +170,7 @@ async def change_key(self, job, id_, options):", - "new_count": 7, - "new_start": 170, - "old_count": 7, - "old_start": 167 - }, - { - "content": " \n verrors.check()\n \n- encryption_dict = await self.middleware.call(\n+ encryption_dict = self.middleware.call_sync(\n 'pool.dataset.validate_encryption_data', job, verrors, {\n 'enabled': True, 'passphrase': options['passphrase'],\n 'generate_key': options['generate_key'], 'key_file': options['key_file'],", - "header": "@@ -181,7 +184,7 @@ async def change_key(self, job, id_, options):", - "new_count": 7, - "new_start": 184, - "old_count": 7, - "old_start": 181 - }, - { - "content": " encryption_dict.pop('encryption')\n key = encryption_dict.pop('key')\n \n- await self.middleware.call(\n- 'zfs.dataset.change_key', id_, {\n- 'encryption_properties': encryption_dict,\n- 'key': key, 'load_key': False,\n- }\n- )\n+ change_key(tls, id_, encryption_dict, key)\n \n # TODO: Handle renames of datasets appropriately wrt encryption roots and db - this will be done when\n # devd changes are in from the OS end\n data = {'encryption_key': key, 'key_format': 'PASSPHRASE' if options['passphrase'] else 'HEX', 'name': id_}\n- await self.insert_or_update_encrypted_record(data)\n+ self.insert_or_update_encrypted_record(data)\n if options['passphrase'] and ZFSKeyFormat(ds['key_format']['value']) != ZFSKeyFormat.PASSPHRASE:\n- await self.middleware.call('pool.dataset.sync_db_keys', id_)\n+ self.middleware.call_sync('pool.dataset.sync_db_keys', id_)\n \n data['old_key_format'] = ds['key_format']['value']\n- await self.middleware.call_hook('dataset.change_key', data)\n+ self.middleware.call_hook_sync('dataset.change_key', data)\n \n @api_method(\n PoolDatasetInheritParentEncryptionPropertiesArgs,\n PoolDatasetInheritParentEncryptionPropertiesResult,\n roles=['DATASET_WRITE']\n )\n- async def inherit_parent_encryption_properties(self, id_):\n+ @pass_thread_local_storage\n+ def inherit_parent_encryption_properties(self, tls, id_):\n \"\"\"\n Allows inheriting parent's encryption root discarding its current encryption settings. This\n can only be done where `id` has an encrypted parent and `id` itself is an encryption root.\n \"\"\"\n- ds = await self.middleware.call('pool.dataset.get_instance_quick', id_, {\n+ ds = self.middleware.call_sync('pool.dataset.get_instance_quick', id_, {\n 'encryption': True,\n })\n if not ds['encrypted']:", - "header": "@@ -194,34 +197,30 @@ async def change_key(self, job, id_, options):", - "new_count": 30, - "new_start": 197, - "old_count": 34, - "old_start": 194 - }, - { - "content": " elif '/' not in id_:\n raise CallError('Root datasets do not have a parent and cannot inherit encryption settings')\n else:\n- parent = await self.middleware.call(\n+ parent = self.middleware.call_sync(\n 'pool.dataset.get_instance_quick', id_.rsplit('/', 1)[0], {\n 'encryption': True,\n }", - "header": "@@ -233,7 +232,7 @@ async def inherit_parent_encryption_properties(self, id_):", - "new_count": 7, - "new_start": 232, - "old_count": 7, - "old_start": 233 - }, - { - "content": " if not parent['encrypted']:\n raise CallError('This operation requires the parent dataset to be encrypted')\n else:\n- parent_encrypted_root = await self.middleware.call(\n+ parent_encrypted_root = self.middleware.call_sync(\n 'pool.dataset.get_instance_quick', parent['encryption_root'], {\n 'encryption': True,\n }\n )\n- if ZFSKeyFormat(parent_encrypted_root['key_format']['value']) == ZFSKeyFormat.PASSPHRASE.value:\n+ if parent_encrypted_root['key_format']['value'] == ZFSKeyFormat.PASSPHRASE.value:\n if any(\n d['name'] == d['encryption_root']\n- for d in await self.middleware.call(\n+ for d in self.middleware.call_sync(\n 'pool.dataset.query', [\n ['id', '^', f'{id_}/'], ['encrypted', '=', True],\n ['key_format.value', '!=', ZFSKeyFormat.PASSPHRASE.value]", - "header": "@@ -241,15 +240,15 @@ async def inherit_parent_encryption_properties(self, id_):", - "new_count": 15, - "new_start": 240, - "old_count": 15, - "old_start": 241 - }, - { - "content": " 'roots which are encrypted with a key as children for passphrase encrypted datasets.'\n )\n \n- await self.middleware.call('zfs.dataset.change_encryption_root', id_, {'load_key': False})\n- await self.middleware.call('pool.dataset.sync_db_keys', id_)\n- await self.middleware.call_hook('dataset.inherit_parent_encryption_root', id_)\n+ change_encryption_root(tls, id_)\n+ self.middleware.call_sync('pool.dataset.sync_db_keys', id_)\n+ self.middleware.call_hook_sync('dataset.inherit_parent_encryption_root', id_)", - "header": "@@ -261,6 +260,6 @@ async def inherit_parent_encryption_properties(self, id_):", - "new_count": 6, - "new_start": 260, - "old_count": 6, - "old_start": 261 - } - ], - "language": "python", - "lines_added": 29, - "lines_removed": 30, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py", - "status": "modified" - }, - { - "hunks": [ - { - "content": "+import threading\n+from typing import Literal, TypedDict, cast\n+\n+from .exceptions import ZFSKeyAlreadyLoadedException, ZFSNotEncryptedException\n+from .utils import open_resource\n+\n+\n+class EncryptionProperties(TypedDict, total=False):\n+ keyformat: Literal['hex', 'passphrase', 'raw']\n+ keylocation: str\n+ pbkdf2iters: int | None\n+\n+\n+def load_key(tls: threading.local, dataset: str, **kwargs: str | bytes) -> None:\n+ \"\"\"\n+ Load the encryption key for a ZFS dataset.\n+\n+ Args:\n+ dataset: Name of the ZFS dataset whose key should be loaded.\n+\n+ Keyword Args:\n+ key: Key material as ``str`` (hex/passphrase) or ``bytes`` (raw).\n+ Mutually exclusive with ``key_location``.\n+ key_location: Path to the key file on disk.\n+ Mutually exclusive with ``key``.\n+ \"\"\"\n+ if len(kwargs) > 1:\n+ raise ValueError('Cannot specify both key and key location')\n+ rsrc = open_resource(tls, dataset)\n+ if (crypto := rsrc.crypto()) is None:\n+ raise ZFSNotEncryptedException(dataset)\n+ if crypto.info().key_is_loaded:\n+ raise ZFSKeyAlreadyLoadedException(dataset)\n+ crypto.load_key(**kwargs)\n+\n+\n+def check_key(tls: threading.local, dataset: str, **kwargs: str | bytes) -> bool:\n+ \"\"\"\n+ Return True if ``key`` (or the key at ``key_location``) can unlock ``dataset``.\n+\n+ Does not actually load the key. Raises ZFSNotEncryptedException if the\n+ dataset is not encrypted or if the ZFS operation fails for a reason other\n+ than a wrong key (EZFS_CRYPTOFAILED returns False rather than raising).\n+\n+ Args:\n+ dataset: Name of the ZFS dataset to check.\n+\n+ Keyword Args:\n+ key: Key material as ``str`` (hex/passphrase) or ``bytes`` (raw).\n+ Mutually exclusive with ``key_location``.\n+ key_location: Path to the key file on disk.\n+ Mutually exclusive with ``key``.\n+ \"\"\"\n+ if len(kwargs) > 1:\n+ raise ValueError('Cannot specify both key and key location')\n+ rsrc = open_resource(tls, dataset)\n+ if (crypto := rsrc.crypto()) is None:\n+ raise ZFSNotEncryptedException(dataset)\n+ return crypto.check_key(**kwargs) # type: ignore[no-any-return]\n+\n+\n+def change_key(\n+ tls: threading.local,\n+ dataset: str,\n+ properties: EncryptionProperties | None = None,\n+ key: str | None = None\n+) -> None:\n+ \"\"\"\n+ Change the encryption key and/or properties for ``dataset``.\n+\n+ The dataset's key must already be loaded before calling this.\n+\n+ Args:\n+ dataset: Name of the ZFS dataset whose key should be changed.\n+ properties: May contain any combination of keyformat, keylocation, and\n+ pbkdf2iters.\n+ key: New key material. Required when keylocation is not given.\n+ \"\"\"\n+ props = {} if properties is None else cast(dict[str, str | int | None], properties.copy())\n+ if key:\n+ props.pop('keylocation', None)\n+ props['key'] = key\n+ elif 'keylocation' not in props:\n+ raise ValueError('Must specify either key or key location')\n+\n+ rsrc = open_resource(tls, dataset)\n+ if (crypto := rsrc.crypto()) is None:\n+ raise ZFSNotEncryptedException(dataset)\n+ config = tls.lzh.resource_cryptography_config(**props)\n+ crypto.change_key(info=config)\n+\n+\n+def change_encryption_root(tls: threading.local, dataset: str) -> None:\n+ \"\"\"\n+ Make ``dataset`` inherit encryption from its parent, removing it as\n+ an encryption root.\n+\n+ ``dataset`` must currently be an encryption root and its key must be loaded.\n+\n+ Args:\n+ dataset: Name of the ZFS dataset to remove as an encryption root.\n+ \"\"\"\n+ rsrc = open_resource(tls, dataset)\n+ if (crypto := rsrc.crypto()) is None:\n+ raise ZFSNotEncryptedException(dataset)\n+ crypto.inherit_key()", - "header": "@@ -0,0 +1,106 @@", - "new_count": 106, - "new_start": 1, - "old_count": 0, - "old_start": 0 - } - ], - "language": "python", - "lines_added": 106, - "lines_removed": 0, - "path": "src/middlewared/middlewared/plugins/zfs/encryption.py", - "status": "added" - }, - { - "hunks": [ - { - "content": "-from typing import Collection\n+from typing import Iterable\n \n __all__ = (\n+ \"ZFSKeyAlreadyLoadedException\",\n+ \"ZFSNotEncryptedException\",\n \"ZFSPathAlreadyExistsException\",\n \"ZFSPathInvalidException\",\n \"ZFSPathNotASnapshotException\",", - "header": "@@ -1,6 +1,8 @@", - "new_count": 8, - "new_start": 1, - "old_count": 6, - "old_start": 1 - }, - { - "content": " )\n \n \n+class ZFSKeyAlreadyLoadedException(Exception):\n+ def __init__(self, path: str):\n+ self.message = f\"{path!r} key is already loaded\"\n+ super().__init__(self.message)\n+\n+\n+class ZFSNotEncryptedException(Exception):\n+ def __init__(self, path: str):\n+ self.message = f\"{path!r} is not encrypted\"\n+ super().__init__(self.message)\n+\n+\n class ZFSPathAlreadyExistsException(Exception):\n def __init__(self, path: str):\n self.message = f\"{path!r} already exists\"", - "header": "@@ -9,6 +11,18 @@", - "new_count": 18, - "new_start": 11, - "old_count": 6, - "old_start": 9 - }, - { - "content": " \n \n class ZFSPathHasClonesException(Exception):\n- def __init__(self, path: str, clones: Collection[str]):\n+ def __init__(self, path: str, clones: Iterable[str]):\n self.path = path\n self.clones = clones\n self.message = f\"{path!r} has the following clones: {','.join(clones)}\"", - "header": "@@ -16,7 +30,7 @@ def __init__(self, path: str):", - "new_count": 7, - "new_start": 30, - "old_count": 7, - "old_start": 16 - }, - { - "content": " \n \n class ZFSPathHasHoldsException(Exception):\n- def __init__(self, path: str, holds: Collection[str]):\n+ def __init__(self, path: str, holds: Iterable[str]):\n self.message = f\"{path!r} has the following holds: {','.join(holds)}\"\n super().__init__(self.message)\n ", - "header": "@@ -24,7 +38,7 @@ def __init__(self, path: str, clones: Collection[str]):", - "new_count": 7, - "new_start": 38, - "old_count": 7, - "old_start": 24 - } - ], - "language": "python", - "lines_added": 17, - "lines_removed": 3, - "path": "src/middlewared/middlewared/plugins/zfs/exceptions.py", - "status": "modified" - }, - { - "hunks": [ - { - "content": "-import libzfs\n-\n-from middlewared.service import CallError, job, Service\n-\n-\n-class ZFSDatasetService(Service):\n-\n- class Config:\n- namespace = 'zfs.dataset'\n- private = True\n- process_pool = True\n-\n- def common_load_dataset_checks(self, id_, ds):\n- self.common_encryption_checks(id_, ds)\n- if ds.key_loaded:\n- raise CallError(f'{id_} key is already loaded')\n-\n- def common_encryption_checks(self, id_, ds):\n- if not ds.encrypted:\n- raise CallError(f'{id_} is not encrypted')\n-\n- def load_key(self, id_: str, options: dict | None = None):\n- if options is None:\n- options = {\n- 'mount': True,\n- 'recursive': False,\n- 'key': None,\n- 'key_location': None,\n- }\n- options.setdefault('mount', True)\n- options.setdefault('recursive', False)\n- options.setdefault('key', None)\n- options.setdefault('key_location', None)\n-\n- mount_ds = options.pop('mount')\n- recursive = options.pop('recursive')\n- try:\n- with libzfs.ZFS() as zfs:\n- ds = zfs.get_dataset(id_)\n- self.common_load_dataset_checks(id_, ds)\n- ds.load_key(**options)\n- except libzfs.ZFSException as e:\n- self.logger.error(f'Failed to load key for {id_}', exc_info=True)\n- raise CallError(f'Failed to load key for {id_}: {e}')\n- else:\n- if mount_ds:\n- self.call_sync2(self.s.zfs.resource.mount, id_, recursive=recursive)\n-\n- def check_key(self, id_: str, options: dict | None = None):\n- \"\"\"\n- Returns `true` if the `key` is valid, `false` otherwise.\n- \"\"\"\n- if options is None:\n- options = {\n- 'key': None,\n- 'key_location': None,\n- }\n-\n- try:\n- with libzfs.ZFS() as zfs:\n- ds = zfs.get_dataset(id_)\n- self.common_encryption_checks(id_, ds)\n- return ds.check_key(**options)\n- except libzfs.ZFSException as e:\n- self.logger.error(f'Failed to check key for {id_}', exc_info=True)\n- raise CallError(f'Failed to check key for {id_}: {e}')\n-\n- def change_key(self, id_: str, options: dict | None = None):\n- if options is None:\n- options = {\n- 'encryption_properties': {},\n- 'load_key': True,\n- 'key': None,\n- }\n-\n- try:\n- with libzfs.ZFS() as zfs:\n- ds = zfs.get_dataset(id_)\n- self.common_encryption_checks(id_, ds)\n- ds.change_key(props=options['encryption_properties'], load_key=options['load_key'], key=options['key'])\n- except libzfs.ZFSException as e:\n- self.logger.error(f'Failed to change key for {id_}', exc_info=True)\n- raise CallError(f'Failed to change key for {id_}: {e}')\n-\n- def change_encryption_root(self, id_: str, options: dict | None = None):\n- if options is None:\n- options = {'load_key': True}\n-\n- try:\n- with libzfs.ZFS() as zfs:\n- ds = zfs.get_dataset(id_)\n- ds.change_key(load_key=options['load_key'], inherit=True)\n- except libzfs.ZFSException as e:\n- raise CallError(f'Failed to change encryption root for {id_}: {e}')\n-\n- @job()\n- def bulk_process(self, job, name: str, params: list):\n- f = getattr(self, name, None)\n- if not f:\n- raise CallError(f'{name} method not found in zfs.dataset')\n-\n- statuses = []\n- for i in params:\n- result = error = None\n- try:\n- result = f(*i)\n- except Exception as e:\n- error = str(e)\n- finally:\n- statuses.append({'result': result, 'error': error})\n-\n- return statuses", - "header": "@@ -1,112 +0,0 @@", - "new_count": 0, - "new_start": 0, - "old_count": 112, - "old_start": 1 - } - ], - "language": "", - "lines_added": 0, - "lines_removed": 112, - "path": "", - "status": "removed" - } - ], - "intent_gaps": [ - "The PR description mentions 'Depends on changes made in https://github.com/truenas/truenas_pylibzfs/pull/145' but doesn't specify what those changes are. The code uses crypto.load_key(), crypto.check_key(), crypto.change_key() methods that presumably were added in that PR - reviewers need to verify those methods exist and have correct signatures.", - "PR description says 'removes another use case of our process pool' but doesn't document which process_pool usages remain. The zfs_/pool*.py files still have process_pool=True in their Config classes - full migration status is unclear.", - "The PR adds new exception types but doesn't document when they're raised vs when ZFSException is raised. Code in encryption.py raises ZFSNotEncryptedException before calling ZFS operations, but ZFS operations themselves can also fail - error contract is implicit.", - "No tests are included in this PR (test_files_changed: 0). For a security-critical encryption refactor, this is a significant gap. The PR should include tests for: key loading, key validation, key changing, encryption root inheritance, error cases (wrong key, non-encrypted dataset, already loaded key).", - "The PR description mentions converting 'zfs.dataset encryption methods' but also changes KMIP integration (kmip/zfs_keys.py). This cross-service impact isn't mentioned in the PR description.", - "Dead code risk: The old process_pool-based encryption implementation files are removed (112 lines deleted in removed file), but it's unclear if any other code still references those removed functions. Static analysis should confirm no dangling references." - ], - "pr_narrative": "This PR refactors ZFS dataset encryption operations to replace the deprecated py-libzfs/process_pool mechanism with direct truenas_pylibzfs calls.\n\nOLD MECHANISM:\n- ZFS operations ran in a separate process pool (process_pool=True in service Config)\n- Used py-libzfs bindings for encryption operations\n- Required marshaling data between main process and worker processes\n\nNEW MECHANISM:\n- New src/middlewared/middlewared/plugins/zfs/encryption.py module with 4 functions:\n - load_key(tls, dataset, **kwargs): Load encryption key into ZFS\n - check_key(tls, dataset, **kwargs): Validate key without loading (returns bool)\n - change_key(tls, dataset, properties, key): Change encryption key/properties\n - change_encryption_root(tls, dataset): Inherit encryption from parent\n\n- All functions use @pass_thread_local_storage decorator to receive 'tls' parameter\n- tls.lzh (libzfs handle) is used to open ZFS resources directly via truenas_pylibzfs\n- Functions validate preconditions (encrypted, key not already loaded) before calling ZFS\n\nENTRY POINT TO EFFECT FLOW:\n1. pool.dataset.unlock() -> calls load_key() for each locked dataset -> mounts datasets\n2. pool.dataset.encryption_summary() -> calls check_key() to validate keys -> returns validation results\n3. pool.dataset.sync_db_keys() -> calls check_key() to verify keys -> removes invalid keys from DB\n4. pool.dataset.change_key() -> calls change_key() -> updates DB with new key\n5. pool.dataset.inherit_parent_encryption_properties() -> calls change_encryption_root()\n6. kmip.sync_zfs_keys() -> calls check_key() to verify key validity before syncing to KMIP\n\nADDITIONAL CHANGES:\n- Added new exceptions: ZFSKeyAlreadyLoadedException, ZFSNotEncryptedException\n- Updated PoolCreateEncryptionOptions.pbkdf2iters default from 350000 to 1300000 (security hardening)\n- Changed API field type for 'id' parameter in pool_dataset.py from str to NonEmptyString", - "risk_surfaces": [ - "Thread-local storage contract violation: All new encryption functions require 'tls' parameter with 'lzh' attribute (libzfs handle). Callers must use @pass_thread_local_storage decorator. Risk: If any caller forgets the decorator, tls will be None causing AttributeError at tls.lzh.open_resource(). Affected: dataset_encryption_lock.py:222, dataset_encryption_info.py:107,201, dataset_encryption_operations.py:200,263, kmip/zfs_keys.py:67,109", - "Exception contract change: check_key() now raises ZFSNotEncryptedException for non-encrypted datasets instead of returning False. Old code in dataset_encryption_info.py:107-109 catches generic Exception to handle this - risk of masking other real errors. The exception is NOT caught in kmip/zfs_keys.py:67,109 where it's expected to propagate up - this changes error handling semantics.", - "Key format conversion risk: RAW keys are hex-encoded in database but truenas_pylibzfs expects bytes. Code converts via bytes.fromhex() in multiple places (dataset_encryption_info.py:103-104,178-182,196-198). Risk: ValueError from malformed hex is caught and silently sets key to None, which causes 'Missing key' failure later without clear error message about the hex parsing failure.", - "Race condition in check_key: check_key() in encryption.py:57-59 opens resource, checks crypto, returns crypto.check_key(). Between check and actual load_key() call, another process could load/unload the key. This is existing behavior but more explicit now.", - "ZFSException EZFS_CRYPTOFAILED handling: In dataset_encryption_lock.py:223-226, ZFSException with EZFS_CRYPTOFAILED returns 'Invalid Key' error. If truenas_pylibzfs changes error code mapping or introduces new error codes for key validation failures, this error handling breaks.", - "KMIP integration risk: kmip/zfs_keys.py push_zfs_keys() and pull_zfs_keys() now use check_key() to verify keys before syncing. If check_key() raises unexpected exceptions (not ZFSNotEncryptedException), the sync will fail. The code catches generic Exception at lines 72,117 but this could mask real failures.", - "API compatibility: The change to PoolCreateEncryptionOptions.pbkdf2iters default (350000 -> 1300000) is a breaking change for API consumers expecting the old default. Existing scripts creating encrypted datasets will get stronger (slower) key derivation without explicitly requesting it.", - "Load order dependency: path_in_locked_datasets() in dataset_encryption_info.py:216-283 now relies on tls.lzh directly instead of process pool. This is a hot code path - any issue with thread-local storage initialization will cause failures in path validation throughout the system.", - "Missing validation in change_key: encryption.py:62-90 receives 'properties' dict that may contain None values (e.g., pbkdf2iters). These are passed directly to tls.lzh.resource_cryptography_config() - if truenas_pylibzfs doesn't handle None properly, this could cause crashes." - ], - "stats": { - "files_added": 1, - "files_modified": 7, - "files_removed": 1, - "files_renamed": 0, - "test_files_changed": 0, - "test_to_code_ratio": 0, - "total_additions": 254, - "total_deletions": 210, - "total_files": 9 - }, - "unrelated_changes": [ - "PoolCreateEncryptionOptions.pbkdf2iters default changed from 350000 to 1300000 in src/middlewared/middlewared/api/v26_0_0/pool.py:139 and pool_dataset.py:175. This is a security hardening change unrelated to the py-libzfs -> truenas_pylibzfs migration. It increases PBKDF2 iterations for passphrase-based encryption, making key derivation more secure but slower.", - "PoolDatasetRenameArgs.id field type changed from str to NonEmptyString in pool_dataset.py:815. This adds stricter validation for rename operations, unrelated to encryption refactoring.", - "ZFSPathHasClonesException and ZFSPathHasHoldsException added to exceptions.py but not used in encryption operations. These appear to be added for completeness/consistency but are orthogonal to the encryption changes." - ] - }, - "budget": { - "budget_exhausted": true, - "cost_breakdown": { - "adversary": 0, - "anatomy": 0, - "coverage": 0, - "cross_ref": 0, - "intake": 0, - "meta_selectors": 0, - "output": 0, - "review": 0, - "synthesis": 0 - }, - "max_cost_usd": 2, - "max_duration_seconds": 900, - "total_cost_usd": 0 - }, - "intake": { - "ai_generated": 0, - "areas_touched": [ - "api" - ], - "complexity": "standard", - "languages": [ - "python" - ], - "pr_summary": "Replace usage of the deprecated py-libzfs with truenas_pylibzfs for these private methods. This removes another use case of our process pool.\r\n\r\nDepends on changes made in https://github.com/truenas/truenas_pylibzfs/pull/145.", - "pr_type": "refactor", - "review_depth": "standard", - "risk_signals": [ - "changes API surface or request/response behavior" - ] - }, - "phases_completed": [ - "intake", - "anatomy", - "meta_selectors", - "review", - "adversary", - "cross_ref", - "coverage", - "synthesis", - "output" - ], - "plan": { - "ai_adjusted": false, - "cross_ref_hints": [], - "dimensions": [ - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [ - "src/middlewared/middlewared/plugins/zfs/exceptions.py" - ], - "id": "semantic_sem-001", - "name": "Thread-local storage contract verification", - "priority": 10, - "review_prompt": "Investigate the thread-local storage contract for the new encryption functions. All four new functions (load_key, check_key, change_key, change_encryption_root) in encryption.py require the 'tls' parameter with 'lzh' attribute. Verify that EVERY caller of these functions in dataset_encryption_lock.py:222, dataset_encryption_info.py:107,201, dataset_encryption_operations.py:200,263, and kmip/zfs_keys.py:67,109 properly uses the @pass_thread_local_storage decorator. Check for any edge cases where tls might be None, leading to AttributeError at tls.lzh.open_resource(). Look for any code paths where decorators might be bypassed or where nested function calls could lose the tls context.", - "target_files": [ - "src/middlewared/middlewared/plugins/zfs/encryption.py", - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py", - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py", - "src/middlewared/middlewared/plugins/kmip/zfs_keys.py" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [ - "src/middlewared/middlewared/plugins/zfs/encryption.py" - ], - "id": "mechanical_dim_tls_decorator_contract", - "name": "Thread-Local Storage Decorator Contract Verification", - "priority": 10, - "review_prompt": "Verify that ALL callers of functions decorated with @pass_thread_local_storage actually receive the 'tls' parameter through proper decorator application.\n\nFunctions requiring tls: load_key(), check_key(), change_key(), change_encryption_root() in zfs/encryption.py\n\nRequired checks:\n1. Verify kmip/zfs_keys.py:67 and 109 - are these function calls wrapped in @pass_thread_local_storage decorator?\n2. Verify dataset_encryption_info.py:107,201 - ensure check_key() and path_in_locked_datasets() receive tls through decorator chain\n3. Verify dataset_encryption_lock.py:222 - ensure load_key() caller is decorated\n4. Verify dataset_encryption_operations.py:200,263 - ensure change_key() and change_encryption_root() callers are decorated\n5. Search for any direct calls to these functions WITHOUT going through the decorator chain\n\nCritical: If tls is None, accessing tls.lzh will raise AttributeError. Each call path must be traced to verify the decorator is present in the complete call chain from entry point to ZFS function.", - "target_files": [ - "src/middlewared/middlewared/plugins/kmip/zfs_keys.py", - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py", - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [ - "src/middlewared/middlewared/plugins/zfs/exceptions.py" - ], - "id": "semantic_sem-002", - "name": "Exception contract change in check_key()", - "priority": 9, - "review_prompt": "Verify the exception contract change in check_key() function. The new implementation raises ZFSNotEncryptedException for non-rypted datasets instead of returning False. Trace through all callers: dataset_encryption_info.py lines 107-109 use broad Exception catching which could mask real errors; kmip/zfs_keys.py lines 67,109 expect exceptions to propagate up. Ensure the exception handling is consistent across all call sites. Check if there are any callers that still expect a boolean return and will break with the new exception-based flow. Verify the ZFSNotEncryptedException is properly defined in exceptions.py with correct inheritance chain.", - "target_files": [ - "src/middlewared/middlewared/plugins/zfs/encryption.py", - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "src/middlewared/middlewared/plugins/kmip/zfs_keys.py" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [ - "src/middlewared/middlewared/plugins/zfs/encryption.py", - "src/middlewared/middlewared/plugins/zfs/exceptions.py" - ], - "id": "mechanical_dim_exception_contract_check_key", - "name": "check_key() Exception Contract Change Verification", - "priority": 9, - "review_prompt": "Verify that check_key() exception contract change is handled correctly in ALL call sites.\n\nOLD behavior: check_key() returned False for non-encrypted datasets\nNEW behavior: check_key() raises ZFSNotEncryptedException for non-encrypted datasets\n\nRequired checks:\n1. dataset_encryption_info.py:107-109 - verify it catches ZFSNotEncryptedException explicitly (not generic Exception) to handle non-encrypted datasets\n2. kmip/zfs_keys.py:67,109 - verify these call sites either catch ZFSNotEncryptedException or are designed to let it propagate (check expected behavior)\n3. Verify no code relies on check_key() returning False - search for any `if not check_key(...)` patterns\n4. Verify ZFSNotEncryptedException is properly imported in all files using check_key()\n\nRisk: Generic Exception catching masks real errors. Unhandled ZFSNotEncryptedException propagates as unexpected error to API consumers.", - "target_files": [ - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "src/middlewared/middlewared/plugins/kmip/zfs_keys.py" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [ - "src/middlewared/middlewared/plugins/zfs/encryption.py", - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py" - ], - "id": "semantic_sem-003", - "name": "Key format conversion and hex parsing errors", - "priority": 8, - "review_prompt": "Analyze the key format conversion from hex string to bytes across the codebase. RAW keys stored as hex strings in the database are converted via bytes.fromhex() in dataset_encryption_info.py lines 103-104, 178-182, and 196-198. Check that all ValueError exceptions from malformed hex are properly caught and handled with clear error messages. Verify that silent failures (setting key to None) don't propagate to cause confusing 'Missing key' errors later. Check for any other locations where hex encoding/decoding might fail. Ensure that malformed hex keys don't bypass validation and cause cryptic downstream failures.", - "target_files": [ - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [], - "id": "semantic_sem-005", - "name": "Race condition in check_key vs load_key sequence", - "priority": 8, - "review_prompt": "Investigate the race condition between check_key() and load_key() operations. In encryption.py:57-59, check_key() opens a resource, validates the key, and returns. Between this check and the actual load_key() call, another process could load or unload the key. Trace all code paths where check_key() is followed by load_key() (dataset_encryption_lock.py, kmip/zfs_keys.py). Verify whether the system correctly handles the TOCTOU (time-of-check-time-of-use) race. Check if there are any synchronization mechanisms in place or if the code assumes single-threaded access to ZFS datasets.", - "target_files": [ - "src/middlewared/middlewared/plugins/zfs/encryption.py", - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py", - "src/middlewared/middlewared/plugins/kmip/zfs_keys.py" - ] - } - ], - "total_budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - } - } - }, - "pr_url": "https://github.com/truenas/middleware/pull/18291", - "review": { - "body": "## \ud83d\udd34 PR-AF Review \u2014 **Needs Major Rework**\n\n*Automated multi-agent code review \u00b7 [PR-AF](https://github.com/Agent-Field/agentfield) built with [AgentField](https://github.com/Agent-Field/agentfield)*\n\n> **25 findings** \u00b7 \ud83d\udd34 6 critical \u00b7 \ud83d\udfe0 10 important \u00b7 \ud83d\udd35 9 suggestions \u00b7 \u26aa 0 nitpicks\n\n
\nPR Overview\n\nReplace usage of the deprecated py-libzfs with truenas_pylibzfs for these private methods. This removes another use case of our process pool.\r\n\r\nDepends on changes made in https://github.com/truenas/truenas_pylibzfs/pull/145.\n\n
\n\n### Key Findings\n\n**16 issue(s) should be addressed before merge:**\n\n- \ud83d\udd34 **Method name shadows imported function causing infinite recursion** (`src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py:200`) \u2014 **CRITICAL BUG**: The method `change_key` at line 121 shadows the imported function `change_key` from `middlewared.plugins.zfs.encryption` (imported at line 7).\n- \ud83d\udd34 **Duplicate export: PoolRemoveArgs appears twice in __all__ list** (`src/middlewared/middlewared/api/v26_0_0/pool.py:20`) \u2014 The `__all__` list contains `PoolRemoveArgs` twice (lines 20 and 21).\n- \ud83d\udd34 **Malformed hex key causes confusing 'Missing key' error instead of clear validation message** (`src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py:177`) \u2014 When a RAW format encryption key contains malformed hex, the code catches `ValueError` from `bytes.fromhex()` and sets `ds_key = None` (lines 179-182).\n- \ud83d\udd34 **KMIP push_zfs_keys() crashes when check_key() raises ZFSNotEncryptedException** (`src/middlewared/middlewared/plugins/kmip/zfs_keys.py:64`) \u2014 The `check_key()` function now raises `ZFSNotEncryptedException` for non-encrypted datasets instead of returning `False`.\n- \ud83d\udd34 **KMIP pull_zfs_keys() crashes when check_key() raises ZFSNotEncryptedException** (`src/middlewared/middlewared/plugins/kmip/zfs_keys.py:107`) \u2014 The `pull_zfs_keys()` method at lines 107-111 calls `check_key()` without exception handling.\n- \ud83d\udd34 **Generic Exception catching masks ZFSNotEncryptedException and real errors** (`src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py:106`) \u2014 The code at lines 106-109 catches generic `Exception` instead of the specific `ZFSNotEncryptedException`.\n- \ud83d\udfe0 **sync_db_keys() marks non-encrypted datasets for removal due to broad Exception catch** (`src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py:200`) \u2014 The `sync_db_keys()` method at lines 200-203 catches all exceptions from `check_key()` and sets `should_remove = True`.\n- \ud83d\udfe0 **Missing hex validation on encryption keys before database storage** (`src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py:26`) \u2014 The `insert_or_update_encrypted_record` method stores encryption keys in the database without validating they are valid hexadecimal strings.\n- \u2026 and 8 more (see All Findings by Severity)\n\n**9 suggestion(s) and style note(s):**\n\n- \ud83d\udd35 Key file validation uses different hex parsing logic than unlock path (`src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py:101`)\n- \ud83d\udd35 Silent failure when hex decoding fails during unlock (`src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py:177`)\n- \ud83d\udd35 No database-level constraints on encryption_key column (`src/middlewared/middlewared/plugins/pool_/dataset.py:41`)\n- \ud83d\udd35 Missing Key Validation Before Load in unlock() (`src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py:221`)\n- \ud83d\udd35 Staleness of check_key() Result in pull_zfs_keys (`src/middlewared/middlewared/plugins/kmip/zfs_keys.py:107`)\n- \u2026 and 4 more (see All Findings by Severity)\n\n**Files with findings:** `src/middlewared/middlewared/api/v26_0_0/pool.py`, `src/middlewared/middlewared/api/v26_0_0/pool_dataset.py`, `src/middlewared/middlewared/plugins/kmip/zfs_keys.py`, `src/middlewared/middlewared/plugins/pool_/dataset.py`, `src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py`, `src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py`, `src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py`, `src/middlewared/middlewared/plugins/zfs/encryption.py`\n\n
\nAll Findings by Severity\n\n#### \ud83d\udd34 Critical (6)\n\n- **Method name shadows imported function causing infinite recursion** `src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py:200`\n- **Duplicate export: PoolRemoveArgs appears twice in __all__ list** `src/middlewared/middlewared/api/v26_0_0/pool.py:20`\n- **Malformed hex key causes confusing 'Missing key' error instead of clear validation message** `src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py:177`\n- **KMIP push_zfs_keys() crashes when check_key() raises ZFSNotEncryptedException** `src/middlewared/middlewared/plugins/kmip/zfs_keys.py:64`\n- **KMIP pull_zfs_keys() crashes when check_key() raises ZFSNotEncryptedException** `src/middlewared/middlewared/plugins/kmip/zfs_keys.py:107`\n- **Generic Exception catching masks ZFSNotEncryptedException and real errors** `src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py:106`\n\n#### \ud83d\udfe0 Important (10)\n\n- **sync_db_keys() marks non-encrypted datasets for removal due to broad Exception catch** `src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py:200`\n- **Missing hex validation on encryption keys before database storage** `src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py:26`\n- **TOCTOU Race Condition in load_key() Function** `src/middlewared/middlewared/plugins/zfs/encryption.py:29`\n- **Breaking API change: pbkdf2iters minimum raised from 100000 to 1300000** `src/middlewared/middlewared/api/v26_0_0/pool.py:139`\n- **Breaking API change: PoolDatasetChangeKeyOptions.pbkdf2iters minimum raised from 100000 to 1300000** `src/middlewared/middlewared/api/v26_0_0/pool_dataset.py:175`\n- **from_previous implementation silently modifies pbkdf2iters without notification** `src/middlewared/middlewared/api/v26_0_0/pool.py:153`\n- **Hardcoded minimum prevents users from choosing lower security settings** `src/middlewared/middlewared/api/v26_0_0/pool.py:139`\n- **Silent hex conversion failure preserves invalid string, causing potential downstream errors** `src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py:102`\n- **Broad Exception catch masks ZFSNotEncryptedException as 'invalid key' in encryption_summary** `src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py:106`\n- **Malformed hex keys in database cause unnecessary key removal during sync** `src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py:196`\n\n#### \ud83d\udd35 Suggestion (9)\n\n- **Key file validation uses different hex parsing logic than unlock path** `src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py:101`\n- **Silent failure when hex decoding fails during unlock** `src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py:177`\n- **No database-level constraints on encryption_key column** `src/middlewared/middlewared/plugins/pool_/dataset.py:41`\n- **Missing Key Validation Before Load in unlock()** `src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py:221`\n- **Staleness of check_key() Result in pull_zfs_keys** `src/middlewared/middlewared/plugins/kmip/zfs_keys.py:107`\n- **Significant performance impact from increased PBKDF2 iterations** `src/middlewared/middlewared/api/v26_0_0/pool.py:139`\n- **Missing key existence check in from_previous migration method** `src/middlewared/middlewared/api/v26_0_0/pool.py:151`\n- **Missing key existence check in PoolDatasetChangeKeyOptions.from_previous** `src/middlewared/middlewared/api/v26_0_0/pool_dataset.py:183`\n- **Key Validation Without Subsequent Load in push_zfs_keys** `src/middlewared/middlewared/plugins/kmip/zfs_keys.py:65`\n\n
\n\n
\nReview Process Details\n\n**Meta-Dimension Lenses (3):**\n\n- **Semantic** \u2014 5 dimension(s), 85% coverage confidence\n- **Mechanical** \u2014 3 dimension(s), 85% coverage confidence\n- **Systemic** \u2014 3 dimension(s), 85% coverage confidence\n\n**Dimensions Analyzed (6):**\n\n- **Thread-local storage contract verification** \u2014 5 file(s)\n- **Thread-Local Storage Decorator Contract Verification** \u2014 4 file(s)\n- **Exception contract change in check_key()** \u2014 3 file(s)\n- **check_key() Exception Contract Change Verification** \u2014 2 file(s)\n- **Key format conversion and hex parsing errors** \u2014 1 file(s)\n- **Race condition in check_key vs load_key sequence** \u2014 3 file(s)\n\n**Cross-Reference & Adversary Analysis:**\n\n- **8** cross-change interaction(s) detected\n- **20** finding(s) adversarially tested: 4 confirmed, 16 challenged\n\n
\n\n
\nPipeline Stats\n\n| Metric | Value |\n|--------|-------|\n| Duration | 1120.0s |\n| Agent invocations | 20 |\n| Coverage iterations | 1 |\n| Estimated cost | N/A (provider does not report cost) |\n| Budget exhausted | Yes (timeout: 1120s > 900s limit) |\n| PR type | refactor |\n| Complexity | standard |\n\n
\n\nReview ID: `rev_4d1f3985141a`", - "comments": [ - { - "body": "\ud83d\udd34 **[CRITICAL] Method name shadows imported function causing infinite recursion**\n\n**CRITICAL BUG**: The method `change_key` at line 121 shadows the imported function `change_key` from `middlewared.plugins.zfs.encryption` (imported at line 7). When line 200 calls `change_key(tls, id_, encryption_dict, key)`, Python's name resolution (LEGB rule) binds the unqualified name `change_key` to the method in the class scope, NOT the module-level import.\n\nThis causes:\n1. **Infinite recursion**: The method calls itself instead of the encryption function\n2. **Type mismatch**: The recursive call binds parameters incorrectly:\n - `job` receives `tls` (thread-local object)\n - `tls` receives `id_` (string dataset name)\n - `id_` receives `encryption_dict` (dict)\n - `options` receives `key` (string)\n\n**Impact**: When users attempt to change encryption keys via the API, the system will crash with `RecursionError` or fail when trying to access attributes like `tls.lzh` on a string.\n\n**Root cause**: The import at line 7 brings `change_key` into the module namespace, but the method definition at line 121 creates a class attribute with the same name, shadowing the import within method bodies.\n\n---\n\n> Step 1: Import at line 7: `from middlewared.plugins.zfs.encryption import change_encryption_root, change_key`\n> Step 2: Method definition at line 121: `def change_key(self, job, tls, id_, options):`\n> Step 3: Call at line 200: `change_key(tls, id_, encryption_dict, key)`\n> Step 4: Python resolves `change_key` to the method (class scope), not the imported function (module scope)\n> Step 5: Method recursively calls itself with wrong parameter types causing RecursionError or AttributeError\n\n**\ud83d\udca1 Suggested Fix**\n\nRename the import to avoid shadowing: `from middlewared.plugins.zfs.encryption import change_key as zfs_change_key, change_encryption_root`, then update line 200 to call `zfs_change_key(tls, id_, encryption_dict, key)`. Alternatively, rename the method to `do_change_key` and update the API method decorator.\n\n---\n*`TLS Parameter Verification for @pass_thread_local_storage Decorated Functions` \u00b7 confidence 95%*", - "line": 200, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] sync_db_keys() marks non-encrypted datasets for removal due to broad Exception catch**\n\nThe `sync_db_keys()` method at lines 200-203 catches all exceptions from `check_key()` and sets `should_remove = True`. With the new exception contract, if a dataset is not encrypted but exists in the database, `check_key()` raises `ZFSNotEncryptedException`, which is caught and the dataset is marked for removal from the database.\n\n**Potential issue**: While removing non-encrypted datasets from the encryption database might be correct behavior, the broad exception catch also catches other legitimate errors (ZFS errors, I/O errors, etc.) and treats them the same way. A dataset with a valid key but experiencing a transient ZFS error would be incorrectly removed from the database.\n\n**Previous behavior**: Only datasets with genuinely invalid keys would return `False` and be marked for removal.\n**New behavior**: ANY exception (including ZFS errors, not just non-encrypted datasets) causes removal.\n\n---\n\n> Step 1: `sync_db_keys()` at line 194 iterates over `db_datasets`\n> Step 2: At line 201, calls `should_remove = not check_key(tls, ds_name, key=key)`\n> Step 3: Lines 200-203 use `except Exception:` to catch all exceptions and set `should_remove = True`\n> Step 4: `check_key()` raises `ZFSNotEncryptedException` for non-encrypted datasets\n> Step 5: Also catches any other ZFS errors, treating them all as 'invalid key' and removing from DB\n> Step 6: `should_remove = True` causes dataset to be added to `to_remove` list at line 205-206\n\n**\ud83d\udca1 Suggested Fix**\n\nCatch `ZFSNotEncryptedException` specifically and mark those datasets for removal (since they shouldn't be in the encryption database). Re-raise or handle other exceptions differently - perhaps log them and skip removal rather than assuming the key is invalid.\n\n---\n*`Exception Contract Change in check_key()` \u00b7 confidence 80%*", - "line": 200, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Missing hex validation on encryption keys before database storage**\n\nThe `insert_or_update_encrypted_record` method stores encryption keys in the database without validating they are valid hexadecimal strings. While the method correctly skips storing passphrase keys (lines 28-30), it does not validate that HEX format keys are properly formatted before storage.\n\nThe only hex validation in the codebase exists in `validate_encryption_data` (lines 101-106), but this only applies to keys read from file input pipes, not to keys provided directly via API parameters. When `options['key']` is provided directly, it bypasses the hex validation entirely.\n\nThis creates a data integrity risk where invalid hex keys could be stored in the database, only to fail later when retrieved and passed to `bytes.fromhex()` in unlock operations.\n\n---\n\n> Step 1: `insert_or_update_encrypted_record` is called from multiple locations:\n> - dataset.py:690-693 during dataset creation\n> - pool.py:524-530 during pool creation\n> - dataset_encryption_lock.py:344-346 during unlock\n> - dataset_encryption_operations.py:205 during key change\n> \n> Step 2: In `insert_or_update_encrypted_record` (lines 26-58), the key is stored directly:\n> ```python\n> data['encryption_key'] = data['encryption_key'] # Line 38 - no validation\n> ```\n> \n> Step 3: The only hex validation exists in `validate_encryption_data` (lines 101-106) but ONLY for file input:\n> ```python\n> if not key and job:\n> job.check_pipe('input')\n> key = job.pipes.input.r.read(64)\n> try:\n> key = hex(int(key, 16))[2:]\n> if len(key) != 64:\n> raise ValueError('Invalid key')\n> except ValueError:\n> verrors.add(f'{schema}.key_file', 'Please specify a valid key')\n> ```\n> \n> Step 4: When keys are retrieved for unlock operations (dataset_encryption_lock.py:177-182), they are passed to `bytes.fromhex()`:\n> ```python\n> if ZFSKeyFormat(ds['key_format']['value']) == ZFSKeyFormat.RAW and ds_key:\n> try:\n> ds_key = bytes.fromhex(ds_key)\n> except ValueError:\n> ds_key = None\n> ```\n> \n> Step 5: The error is silently suppressed, meaning invalid keys stored in the database will silently fail to unlock datasets.\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd hex validation in `insert_or_update_encrypted_record` before storing the key:\n\n```python\nif data['encryption_key'] and ZFSKeyFormat(key_format.upper()) == ZFSKeyFormat.HEX:\n try:\n # Validate it's a valid hex string of correct length (64 chars = 32 bytes)\n if len(data['encryption_key']) != 64 or int(data['encryption_key'], 16) < 0:\n raise ValueError('Invalid hex key format')\n except ValueError:\n raise CallError(f'Invalid hex encryption key format for {data[\"name\"]}')\n```\n\nAlternatively, move the hex validation to a common validation function that is called for ALL key inputs, not just file inputs.\n\n---\n*`Encryption Key Storage Validation` \u00b7 confidence 85%*", - "line": 26, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] TOCTOU Race Condition in load_key() Function**\n\nThe `load_key()` function in `encryption.py` contains a Time-Of-Check-Time-Of-Use (TOCTOU) race condition. At lines 32-34, the function first checks `crypto.info().key_is_loaded` and then immediately calls `crypto.load_key()`. Between this check and the actual load operation, another process or thread could load a key into the same ZFS dataset, causing the subsequent `load_key()` call to fail with an unexpected error.\n\nThe function does raise `ZFSKeyAlreadyLoadedException` if the key is loaded at check time, but this exception is not designed to handle the race where the key gets loaded AFTER the check but BEFORE the load. In a concurrent environment, this race window\u2014though small\u2014is non-zero and could lead to:\n1. Unnecessary error propagation to the caller\n2. Failed unlock operations even when valid keys are provided\n3. Inconsistent dataset states when multiple unlock operations are triggered concurrently\n\nThe ZFS kernel module provides atomic operations, but this Python wrapper introduces a race window by separating the check from the operation.\n\n---\n\n> Step 1: `load_key()` is called at encryption.py:29-34.\n> Step 2: Line 32 checks `crypto.info().key_is_loaded` - this is a separate ZFS operation.\n> Step 3: If key_is_loaded is False, execution proceeds to line 34.\n> Step 4: At line 34, `crypto.load_key(**kwargs)` is called.\n> Step 5: Between Step 2 and Step 4, another thread/process could successfully call `load_key()` on the same dataset.\n> Step 6: This causes the second `load_key()` call to fail with an unexpected ZFS error rather than the handled `ZFSKeyAlreadyLoadedException`.\n\n**\ud83d\udca1 Suggested Fix**\n\nConsider removing the pre-check for `key_is_loaded` and instead directly attempt `crypto.load_key()`, catching the specific ZFS error that occurs when a key is already loaded. This reduces the race window to the atomic ZFS operation itself. Alternatively, implement a per-dataset locking mechanism to serialize key loading operations.\n\n---\n*`TOCTOU Race Between check_key() and load_key() Operations` \u00b7 confidence 75%*", - "line": 29, - "path": "src/middlewared/middlewared/plugins/zfs/encryption.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Breaking API change: pbkdf2iters minimum raised from 100000 to 1300000**\n\nThe `PoolCreateEncryptionOptions.pbkdf2iters` field changed its constraint from `ge=100000` (v25) to `ge=1300000` (v26). This is a **breaking API change** that will cause validation failures for API clients that explicitly set pbkdf2iters to any value between 100000 and 1299999.\n\n**Impact Analysis:**\n- **Silent behavioral change**: Clients relying on the default value (changed from 350000 to 1300000) will experience 3.7x slower encryption key derivation without warning\n- **Explicit validation failures**: Clients sending explicit values in the previously-valid range (100000-1299999) will receive Pydantic validation errors\n- **Breaking change for automation**: Scripts or integrations that hardcoded iteration values within the old range will fail when upgraded to API v26\n\n**Previous constraints (v25_10_2):**\n```python\npbkdf2iters: int = Field(ge=100000, default=350000)\n```\n\n**New constraints (v26_0_0):**\n```python\npbkdf2iters: int = Field(ge=1300000, default=1300000)\n```\n\nThe `from_previous` method (lines 151-154) mitigates this for clients *upgrading* API versions (by forcing values to max(1300000, old_value)), but this does not help:\n1. New API v26 clients making fresh calls\n2. Clients who migrate to v26 without going through upgrade path\n3. Configuration-as-code tools that validate against the new schema\n\nThe security improvement (higher minimum iterations) is valid, but should be introduced with deprecation warnings or a transitional period.\n\n---\n\n> Step 1: Client on API v26 calls pool.create with encryption_options={'pbkdf2iters': 500000, 'passphrase': 'secret'}\n> Step 2: Pydantic validates the input against PoolCreateEncryptionOptions at line 139\n> Step 3: Field constraint ge=1300000 rejects 500000 as below minimum\n> Step 4: ValidationError raised with message about failing ge constraint\n> \n> Evidence from v25_10_2/pool.py line 167: pbkdf2iters: int = Field(ge=100000, default=350000)\n> Evidence from v26_0_0/pool.py line 139: pbkdf2iters: int = Field(ge=1300000, default=1300000)\n\n**\ud83d\udca1 Suggested Fix**\n\nConsider one of the following approaches:\n1. **Soft deprecation path**: Keep ge=100000 for one release cycle, log deprecation warnings for values < 1300000, then enforce the new minimum in v27\n2. **Document migration requirements**: Explicitly document that API v26 requires clients to update their pbkdf2iters values\n3. **Conditional validation**: Use a model_validator to allow old values during a transition period with warnings\n\nIf this change is intentional and acceptable as a breaking change in a major version, ensure it is prominently documented in the API changelog with clear migration instructions.\n\n---\n*`Coverage gap review - cluster_1 API schema changes` \u00b7 confidence 90%*", - "line": 139, - "path": "src/middlewared/middlewared/api/v26_0_0/pool.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Breaking API change: PoolDatasetChangeKeyOptions.pbkdf2iters minimum raised from 100000 to 1300000**\n\nThe `PoolDatasetChangeKeyOptions.pbkdf2iters` field changed its constraint from `ge=100000` (v25) to `ge=1300000` (v26). This is a breaking change for the `pool.dataset.change_key` endpoint.\n\n**Impact Analysis:**\n- Clients calling `pool.dataset.change_key` with explicit pbkdf2iters values between 100000-1299999 will receive validation errors\n- Clients relying on the default (350000 -> 1300000) will experience slower key derivation without warning\n\n**Previous (v25_10_2 line 175):**\n```python\npbkdf2iters: int = Field(default=350000, ge=100000)\n```\n\n**New (v26_0_0 line 175):**\n```python\npbkdf2iters: int = Field(default=1300000, ge=1300000)\n```\n\nThis change mirrors the issue in PoolCreateEncryptionOptions but affects the dataset key change operation specifically.\n\n---\n\n> Step 1: Client calls pool.dataset.change_key with options={'pbkdf2iters': 200000, 'passphrase': 'newsecret'}\n> Step 2: Pydantic validates PoolDatasetChangeKeyOptions at line 175\n> Step 3: ge=1300000 constraint fails for value 200000\n> Step 4: ValidationError raised\n> \n> Evidence from v25_10_2/pool_dataset.py line 175: pbkdf2iters: int = Field(default=350000, ge=100000)\n> Evidence from v26_0_0/pool_dataset.py line 175: pbkdf2iters: int = Field(default=1300000, ge=1300000)\n\n**\ud83d\udca1 Suggested Fix**\n\nApply the same migration strategy as PoolCreateEncryptionOptions. Consider soft deprecation with warnings before enforcing the new minimum, or clearly document this as a breaking change requiring client updates.\n\n---\n*`Coverage gap review - cluster_1 API schema changes` \u00b7 confidence 90%*", - "line": 175, - "path": "src/middlewared/middlewared/api/v26_0_0/pool_dataset.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] from_previous implementation silently modifies pbkdf2iters without notification**\n\nThe `from_previous` classmethod at lines 151-154 silently increases pbkdf2iters to 1300000 without any warning or indication to the client. While this ensures compatibility, it creates a **silent behavioral change** that may confuse users.\n\n```python\n@classmethod\ndef from_previous(cls, value):\n value['pbkdf2iters'] = max(1300000, value['pbkdf2iters'])\n return value\n```\n\n**Issues:**\n1. **Silent upgrade**: A client requesting 350000 iterations (for performance reasons) will silently get 1300000 instead, making encryption/unlocking 3.7x slower without any indication\n2. **No audit trail**: The system doesn't log that it modified the requested value\n3. **Performance surprise**: Users who explicitly chose lower iterations for performance will experience unexplained slowdowns\n4. **No opt-out**: There's no way for clients to preserve the old behavior during transition\n\nThis pattern also exists in PoolDatasetChangeKeyOptions.from_previous (pool_dataset.py:183-186).\n\n---\n\n> Step 1: Client on API v25 calls pool.create with encryption_options={'pbkdf2iters': 350000}\n> Step 2: API version adapter detects UPGRADE direction and calls PoolCreateEncryptionOptions.from_previous at line 233 of version.py\n> Step 3: from_previous silently replaces 350000 with 1300000 via max() operation\n> Step 4: New value 1300000 is validated (passes ge=1300000) and used\n> Step 5: Client gets 3.7x slower encryption without any notification\n> \n> Evidence: version.py line 233 calls new_model.from_previous(value) during UPGRADE\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd a warning log when from_previous increases the value:\n```python\n@classmethod\ndef from_previous(cls, value):\n old_value = value.get('pbkdf2iters', 350000)\n new_value = max(1300000, old_value)\n if new_value > old_value:\n logger.warning(\n 'pbkdf2iters automatically increased from %d to %d for security compliance',\n old_value, new_value\n )\n value['pbkdf2iters'] = new_value\n return value\n```\nAlternatively, return a response header or metadata indicating the value was modified.\n\n---\n*`Coverage gap review - cluster_1 API schema changes` \u00b7 confidence 85%*", - "line": 153, - "path": "src/middlewared/middlewared/api/v26_0_0/pool.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Hardcoded minimum prevents users from choosing lower security settings**\n\nThe `ge=1300000` constraint combined with the `from_previous` migration means users CANNOT choose lower iteration counts even if they understand the security trade-offs and prioritize unlock speed. This removes user agency and could be problematic for: development/test environments where fast unlock is preferred, systems with weak CPUs where 1.3M iterations cause unacceptable delays, and emergency recovery scenarios. The old API allowed any value >= 100000. The new API forces >= 1300000 with no opt-out.\n\n---\n\n> Step 1: v25_10_2 allowed pbkdf2iters >= 100000 (Field(ge=100000, default=350000)). Step 2: v26_0_0 requires pbkdf2iters >= 1300000 (Field(ge=1300000, default=1300000)). Step 3: from_previous uses max() to force upgrade of any existing lower values. Step 4: No mechanism exists for users to opt-out of this minimum requirement. Step 5: This is a breaking change that removes flexibility for edge cases.\n\n**\ud83d\udca1 Suggested Fix**\n\nConsider whether the hard minimum of 1300000 is appropriate for all use cases, or if there should be an escape hatch for users who need lower iteration counts and accept the security trade-offs. At minimum, document why this specific value was chosen and what users should expect.\n\n---\n*`Root cluster coverage gap review` \u00b7 confidence 70%*", - "line": 139, - "path": "src/middlewared/middlewared/api/v26_0_0/pool.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd34 **[CRITICAL] KMIP push_zfs_keys() crashes when check_key() raises ZFSNotEncryptedException**\n\nThe `check_key()` function now raises `ZFSNotEncryptedException` for non-encrypted datasets instead of returning `False`. The KMIP `push_zfs_keys()` method at lines 64-69 calls `check_key()` without any exception handling, expecting a boolean return value.\n\n**Impact**: If a dataset in the database is not actually encrypted (e.g., encryption was removed, or database is out of sync with ZFS), the entire `push_zfs_keys()` operation will crash with an unhandled exception. This could prevent KMIP key synchronization from completing, leaving encryption keys in an inconsistent state.\n\n**The code path**:\n1. `push_zfs_keys()` iterates over datasets from database (line 59)\n2. For each dataset without `encryption_key`, it checks if the in-memory key is valid (line 67)\n3. `check_key()` raises `ZFSNotEncryptedException` if the dataset is not encrypted\n4. Exception propagates uncaught, aborting the entire sync operation\n\n---\n\n> Step 1: `push_zfs_keys()` at line 56 iterates over `existing_datasets` from database\n> Step 2: At line 64-69, for datasets without `encryption_key`, it checks `if ds['name'] in self.zfs_keys and check_key(tls, ds['name'], key=self.zfs_keys[ds['name']])`\n> Step 3: `check_key()` in encryption.py:57-58 raises `ZFSNotEncryptedException(dataset)` when `rsrc.crypto()` returns None (dataset not encrypted)\n> Step 4: No exception handling in this code path causes unhandled exception to propagate up\n> Step 5: This aborts the entire KMIP key push operation, potentially leaving other datasets unsynchronized\n\n**\ud83d\udca1 Suggested Fix**\n\nWrap the `check_key()` call in a try-except block to catch `ZFSNotEncryptedException` and handle it appropriately. Options:\n1. Skip datasets that are not encrypted (they don't need KMIP key management)\n2. Log a warning and continue with other datasets\n3. Consider removing such datasets from `self.zfs_keys` since they shouldn't have encryption keys\n\n---\n*`Exception Contract Change in check_key()` \u00b7 confidence 95%*", - "line": 64, - "path": "src/middlewared/middlewared/plugins/kmip/zfs_keys.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd34 **[CRITICAL] KMIP pull_zfs_keys() crashes when check_key() raises ZFSNotEncryptedException**\n\nThe `pull_zfs_keys()` method at lines 107-111 calls `check_key()` without exception handling. Similar to `push_zfs_keys()`, if a dataset is not encrypted but exists in `self.zfs_keys`, the call to `check_key()` will raise `ZFSNotEncryptedException` and crash the operation.\n\n**Impact**: The KMIP key pull operation will fail entirely if any dataset in the iteration is not encrypted. This prevents migrating keys from KMIP server back to local database for datasets that are actually encrypted, because the operation aborts on the first non-encrypted dataset encountered.\n\n---\n\n> Step 1: `pull_zfs_keys()` at line 99 iterates over `existing_datasets` with KMIP UIDs\n> Step 2: At lines 107-111, it checks `elif ds['name'] in self.zfs_keys and check_key(tls, ds['name'], key=self.zfs_keys[ds['name']])`\n> Step 3: `check_key()` in encryption.py:57-58 raises `ZFSNotEncryptedException` if dataset not encrypted\n> Step 4: No try-except block catches this exception in `pull_zfs_keys()`\n> Step 5: Unhandled exception aborts the entire key pull operation, preventing other datasets from being synchronized\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd explicit exception handling for `ZFSNotEncryptedException` around the `check_key()` call at lines 107-109. When a dataset is not encrypted, it should be skipped (continue to next dataset) or handled appropriately rather than crashing the entire operation.\n\n---\n*`Exception Contract Change in check_key()` \u00b7 confidence 95%*", - "line": 107, - "path": "src/middlewared/middlewared/plugins/kmip/zfs_keys.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd34 **[CRITICAL] Generic Exception catching masks ZFSNotEncryptedException and real errors**\n\nThe code at lines 106-109 catches generic `Exception` instead of the specific `ZFSNotEncryptedException`. This has two serious problems:\n\n1. **Real errors are masked**: Any actual error (ZFS communication failure, invalid dataset name, memory errors, etc.) will be silently converted to `valid_key = False`, making it indistinguishable from a non-encrypted dataset case.\n\n2. **Missing specific exception import**: The file does not import `ZFSNotEncryptedException` from `middlewared.plugins.zfs.exceptions`, which is required for proper exception handling.\n\nThe OLD behavior was: `check_key()` returned `False` for non-encrypted datasets.\nThe NEW behavior is: `check_key()` raises `ZFSNotEncryptedException` for non-encrypted datasets.\n\nThe current code catches the new exception, but also catches ALL other exceptions, including critical failures that should be propagated to the caller or logged as errors.\n\n---\n\n> Step 1: `encryption_summary()` calls `check_key(tls, name, key=ds_key)` at line 107\n> Step 2: For non-encrypted datasets, `check_key()` raises `ZFSNotEncryptedException` (encryption.py:58)\n> Step 3: The generic `except Exception:` at line 108 catches this AND any other exception\n> Step 4: `valid_key = False` is set regardless of whether it's a non-encrypted dataset or a real error\n> Step 5: Real errors (ZFS failures, communication issues) are masked and logged as routine 'invalid key' cases\n\n**\ud83d\udca1 Suggested Fix**\n\nImport `ZFSNotEncryptedException` and catch it specifically. Re-raise or log other exceptions appropriately. Recommended change:\n\n```python\nfrom middlewared.plugins.zfs.exceptions import ZFSNotEncryptedException\n\ntry:\n valid_key = check_key(tls, name, key=ds_key)\nexcept ZFSNotEncryptedException:\n valid_key = False\nexcept Exception as e:\n self.logger.error('Failed to check key for %s: %s', name, e, exc_info=True)\n valid_key = False\n```\n\n---\n*`check_key() Exception Contract Review` \u00b7 confidence 95%*", - "line": 106, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Silent hex conversion failure preserves invalid string, causing potential downstream errors**\n\nIn `encryption_summary()` at lines 102-104, malformed hex keys are silently suppressed using `contextlib.suppress(ValueError)`. When `bytes.fromhex()` fails, the original hex string is preserved instead of being converted to bytes. This means an invalid hex string gets passed to `check_key()` at line 107.\n\nWhile `check_key()` may handle this gracefully, this creates an inconsistent state where:\n- The code expects `ds_key` to be bytes for RAW format\n- But it may actually be a string (the original malformed hex)\n\nThis violates type expectations and could cause subtle bugs. The `valid_key` result at line 107 will likely be `False` for malformed keys (caught by generic Exception handler at line 108-109), but the user gets no indication that their key format was invalid.\n\n---\n\n> Step 1: `encryption_summary` processes a dataset with RAW key format\n> Step 2: Line 102-104: `bytes.fromhex(ds_key)` raises ValueError, silently suppressed\n> Step 3: `ds_key` remains a string (the invalid hex), not bytes as expected\n> Step 4: Line 107: `check_key()` called with invalid type (string instead of bytes)\n> Step 5: Generic Exception handler catches and sets `valid_key = False`\n> Step 6: User sees 'valid_key: false' with no indication the key format was invalid\n\n**\ud83d\udca1 Suggested Fix**\n\nInstead of silently suppressing the error, either:\n1. Track that the key format was invalid and include this in the response (e.g., add 'key_format_invalid' field to results)\n2. Set `ds_key = None` when conversion fails to ensure consistent types\n3. Raise a validation error if this is called via an API that should reject invalid keys upfront\n\n---\n*`Hex String to Bytes Conversion Error Handling` \u00b7 confidence 85%*", - "line": 102, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Broad Exception catch masks ZFSNotEncryptedException as 'invalid key' in encryption_summary**\n\nThe `encryption_summary()` method uses a broad `except Exception:` catch at lines 106-109 to handle any exception from `check_key()`. While this prevents crashes, it semantically conflates 'dataset is not encrypted' with 'key is invalid'.\n\n**Previous behavior**: `check_key()` returned `False` for non-encrypted datasets, which was set as `valid_key = False`\n**New behavior**: `check_key()` raises `ZFSNotEncryptedException`, which is caught and also sets `valid_key = False`\n\n**Issue**: The user sees 'valid_key: false' but cannot distinguish between:\n1. The dataset is not encrypted (shouldn't even be in the encryption summary)\n2. The provided key is actually invalid\n\nThis could mislead users trying to unlock datasets that aren't actually encrypted.\n\n---\n\n> Step 1: `encryption_summary()` at line 100 iterates over encrypted datasets from `query_encrypted_datasets()`\n> Step 2: At line 107, it calls `check_key(tls, name, key=ds_key)`\n> Step 3: If dataset is not encrypted, `check_key()` raises `ZFSNotEncryptedException` (encryption.py:58)\n> Step 4: Lines 106-109 catch ALL exceptions and set `valid_key = False`\n> Step 5: The user cannot distinguish between 'not encrypted' vs 'wrong key' - both show as `valid_key: false`\n\n**\ud83d\udca1 Suggested Fix**\n\nCatch `ZFSNotEncryptedException` specifically and handle it differently from other exceptions. Options:\n1. Skip non-encrypted datasets from the results entirely (they shouldn't appear in an 'encryption summary')\n2. Add a specific flag or error message indicating the dataset is not encrypted\n3. Consider filtering non-encrypted datasets earlier in the method before calling `check_key()`\n\n---\n*`Exception Contract Change in check_key()` \u00b7 confidence 85%*", - "line": 106, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Malformed hex keys in database cause unnecessary key removal during sync**\n\nIn `sync_db_keys()` at lines 196-198, malformed hex keys from the database are silently suppressed using `contextlib.suppress(ValueError)`. When `bytes.fromhex()` fails, the original hex string is preserved and passed to `check_key()` at line 201.\n\nIf `check_key()` fails (which is likely with a malformed key), the dataset is marked for removal from the database at line 206. This means:\n1. A user stores a valid hex key in the database\n2. Somehow the key becomes corrupted in the database (manual edit, migration issue, etc.)\n3. The periodic sync job (runs every 86400 seconds) sees the malformed key\n4. The malformed key fails validation and is removed from the database\n5. The user loses their encryption key permanently\n\nThis is a data loss scenario - corrupted keys in the database should not be silently deleted; instead, an error should be logged alerting administrators to the corruption.\n\n---\n\n> Step 1: Periodic job `sync_db_keys` runs (every 86400 seconds via @periodic decorator)\n> Step 2: Line 196-198: Database key fails `bytes.fromhex()`, silently suppressed\n> Step 3: Original invalid string passed to `check_key()` at line 201\n> Step 4: `check_key()` likely fails (returns False or raises)\n> Step 5: Line 206: Dataset name added to `to_remove` list\n> Step 6: Line 212: Corrupted key deleted from database permanently\n\n**\ud83d\udca1 Suggested Fix**\n\nInstead of silently suppressing the error and potentially deleting corrupted keys:\n1. Log an explicit error when hex conversion fails, including the dataset name\n2. Do NOT remove keys that fail hex conversion - they might be recoverable\n3. Consider adding a validation check when keys are INSERTED/UPDATED in the database to prevent invalid hex from being stored in the first place\n\n---\n*`Hex String to Bytes Conversion Error Handling` \u00b7 confidence 80%*", - "line": 196, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd35 **[SUGGESTION] Missing Key Validation Before Load in unlock()**\n\nThe `unlock()` method in `dataset_encryption_lock.py` directly calls `load_key()` at line 222 without first calling `check_key()` to validate the key. While this avoids a TOCTOU race between check and load (since there's no check), it means that invalid keys will only be discovered during the load attempt, potentially leaving the dataset in a partially processed state.\n\nThe current implementation catches `ZFSException` and handles `EZFS_CRYPTOFAILED` as 'Invalid Key', which is correct. However, the investigation prompt suggested looking for `check_key()` followed by `load_key()` patterns. In this file, no such pattern exists\u2014the code correctly avoids the TOCTOU by not checking before loading.\n\nThe job lock at line 93 (`@job(lock=lambda args: f'dataset_unlock_{args[0]}')`) provides some serialization for unlock operations targeting the same dataset, but different datasets can still be unlocked concurrently, and the ZFS resource operations themselves are not protected by this high-level lock.\n\n---\n\n> Step 1: `unlock()` job acquires lock for specific dataset ID at line 93.\n> Step 2: At line 222, `load_key(tls, name, key=datasets[name]['key'])` is called directly.\n> Step 3: No `check_key()` call precedes this load operation.\n> Step 4: Lines 223-231 catch exceptions from the load operation.\n> Observation: The code correctly avoids TOCTOU by not separating validation from action, though this means error feedback is only available after attempting the operation.\n\n**\ud83d\udca1 Suggested Fix**\n\nThe current approach of loading directly and catching exceptions is actually safer than check-then-load. No change needed unless you want to add pre-validation for better error messages. If pre-validation is added, ensure it's understood that the validation result could be stale by the time load is called.\n\n---\n*`TOCTOU Race Between check_key() and load_key() Operations` \u00b7 confidence 60%*", - "line": 221, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd35 **[SUGGESTION] Staleness of check_key() Result in pull_zfs_keys**\n\nIn `pull_zfs_keys()` at lines 107-111, `check_key()` is used to determine if an in-memory key is valid for a dataset. If valid, the key is used for database updates (line 120) but NOT for loading into ZFS.\n\nThe validation at line 109 confirms the key can unlock the dataset at that moment, but the actual use of the key is for database operations (line 120: `update_data = {'encryption_key': key, 'kmip_uid': None}`). This is appropriate usage because:\n1. No `load_key()` follows the `check_key()`\n2. The database update doesn't depend on the current ZFS state\n\nHowever, the check validates against current ZFS state, which could change before any future unlock operation. This is a minor concern about validation staleness rather than a TOCTOU race.\n\n---\n\n> Step 1: At line 109, `check_key(tls, ds['name'], key=self.zfs_keys[ds['name']])` validates the in-memory key.\n> Step 2: If True, line 111 assigns the key to a local variable.\n> Step 3: Lines 119-121 use this key to update the database, not to load into ZFS.\n> Step 4: No `load_key()` call exists in this code path.\n> Observation: The check is used to select a key source, not to validate before an action.\n\n**\ud83d\udca1 Suggested Fix**\n\nNo immediate fix needed. The `check_key()` usage here is for determining which key source to use (in-memory vs KMIP vs database). The validation result staleness is acceptable because the key will be validated again when actually used for unlocking. Consider adding a comment explaining that this is a point-in-time validation.\n\n---\n*`TOCTOU Race Between check_key() and load_key() Operations` \u00b7 confidence 60%*", - "line": 107, - "path": "src/middlewared/middlewared/plugins/kmip/zfs_keys.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd35 **[SUGGESTION] Significant performance impact from increased PBKDF2 iterations**\n\nThe default `pbkdf2iters` was increased from 350,000 to 1,300,000 (3.7x increase). This is a security improvement against brute force attacks, but it will significantly increase unlock times for passphrase-encrypted datasets. Users with passphrase-encrypted pools will experience ~3-4x longer unlock times without warning. This could impact system boot time for encrypted pools, dataset unlock operations, and user experience for large-scale deployments. Consider adding a release note or documentation about this performance trade-off.\n\n---\n\n> Step 1: Previous API versions (v25_10_2) had default=350000, ge=100000. Step 2: New v26_0_0 has default=1300000, ge=1300000. Step 3: PBKDF2 iterations directly correlate with unlock time - higher iterations = slower unlock. Step 4: Users upgrading to v26 who had passphrase-encrypted pools will see significantly longer unlock times without any warning.\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd documentation or release notes warning users about increased unlock times for passphrase-encrypted datasets. Consider allowing users to explicitly set a lower value if they understand the security trade-offs (the ge=1300000 constraint currently prevents this).\n\n---\n*`Root cluster coverage gap review` \u00b7 confidence 75%*", - "line": 139, - "path": "src/middlewared/middlewared/api/v26_0_0/pool.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd35 **[SUGGESTION] Missing key existence check in from_previous migration method**\n\nThe `from_previous` classmethod in `PoolCreateEncryptionOptions` accesses `value['pbkdf2iters']` without first checking if the key exists. While this may work in normal API flows where pydantic populates defaults before migration, it's a fragile pattern that could cause a `KeyError` if called with incomplete data during API version transitions or internal usage. The method should use `.get()` with a default value or check key existence before accessing it.\n\n---\n\n> Step 1: `from_previous` is called during API version migrations to convert data from previous API versions. Step 2: The method directly accesses `value['pbkdf2iters']` at line 153 without checking key existence. Step 3: If the input dict lacks this key (e.g., from malformed client data or internal calls), a KeyError will be raised. Step 4: This causes an unhandled exception instead of graceful migration.\n\n**\ud83d\udca1 Suggested Fix**\n\nChange `value['pbkdf2iters']` to `value.get('pbkdf2iters', 1300000)` to safely handle cases where the key might not be present.\n\n---\n*`Root cluster coverage gap review` \u00b7 confidence 65%*", - "line": 151, - "path": "src/middlewared/middlewared/api/v26_0_0/pool.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd35 **[SUGGESTION] Missing key existence check in PoolDatasetChangeKeyOptions.from_previous**\n\nSame issue as in pool.py - the `from_previous` method in `PoolDatasetChangeKeyOptions` accesses `value['pbkdf2iters']` without checking if the key exists first. This could cause a `KeyError` in edge cases during API version migrations.\n\n---\n\n> Step 1: The `from_previous` method is designed to migrate data from previous API versions. Step 2: Line 185 directly accesses dictionary key without existence check. Step 3: While pydantic typically populates defaults, internal calls or edge cases could omit this key. Step 4: This results in KeyError instead of graceful handling.\n\n**\ud83d\udca1 Suggested Fix**\n\nUse `value.get('pbkdf2iters', 1300000)` instead of `value['pbkdf2iters']` to safely handle missing keys.\n\n---\n*`Root cluster coverage gap review` \u00b7 confidence 65%*", - "line": 183, - "path": "src/middlewared/middlewared/api/v26_0_0/pool_dataset.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd35 **[SUGGESTION] Key Validation Without Subsequent Load in push_zfs_keys**\n\nIn `push_zfs_keys()` at lines 65-76, `check_key()` is called to validate an in-memory key. If the check passes, the code continues to the next iteration (line 69). If it fails, the code attempts to retrieve the key from KMIP.\n\nWhile there's no `load_key()` call immediately following the `check_key()` in this specific code path, there is a logical issue: the `check_key()` validates the key against the ZFS dataset's current state, but by the time the key is used (potentially later in the same method or by other callers), the dataset state may have changed. The validation result has a limited time window of validity.\n\nHowever, this is not a TOCTOU race in the traditional sense because no action is taken based on the check result other than skipping to the next dataset. The investigation prompt asked about `check_key()` followed by `load_key()` patterns\u2014this file does not contain such a pattern.\n\n---\n\n> Step 1: At line 67, `check_key(tls, ds['name'], key=self.zfs_keys[ds['name']])` is called.\n> Step 2: If True, the code executes `continue` at line 69 and proceeds to the next dataset.\n> Step 3: If False or exception, lines 71-76 retrieve and store the key from KMIP.\n> Observation: No `load_key()` follows the `check_key()` call. The check is used for decision-making, not for validating before an action.\n\n**\ud83d\udca1 Suggested Fix**\n\nThe usage of `check_key()` here is appropriate for determining whether to retrieve a key from KMIP. However, be aware that the validation result represents a point-in-time check and may not reflect the state when the key is actually used. Consider documenting this behavior or adding comments about the temporal nature of the validation.\n\n---\n*`TOCTOU Race Between check_key() and load_key() Operations` \u00b7 confidence 60%*", - "line": 65, - "path": "src/middlewared/middlewared/plugins/kmip/zfs_keys.py", - "side": "RIGHT" - } - ], - "event": "REQUEST_CHANGES" - }, - "review_id": "rev_4d1f3985141a", - "summary": { - "adversary_challenged": 16, - "adversary_confirmed": 4, - "ai_generated_confidence": 0, - "budget_exhausted": true, - "by_severity": { - "critical": 6, - "important": 10, - "suggestion": 9 - }, - "cost_usd": 0, - "coverage_iterations": 1, - "cross_ref_interactions": 8, - "dimensions_run": 6, - "duration_seconds": 1120.021, - "total_findings": 25 - } - }, - "started_at": "2026-03-10T11:34:53Z", - "completed_at": "2026-03-10T11:53:35Z", - "duration_ms": 1122026, - "webhook_registered": false -} diff --git a/benchmark/truenas-middleware-18291/pr-af-result-sonnet.json b/benchmark/truenas-middleware-18291/pr-af-result-sonnet.json deleted file mode 100644 index adcef99..0000000 --- a/benchmark/truenas-middleware-18291/pr-af-result-sonnet.json +++ /dev/null @@ -1,1086 +0,0 @@ -{ - "execution_id": "exec_20260310_144121_rkn7qq8x", - "run_id": "run_20260310_144121_ji0fblzy", - "status": "succeeded", - "result": { - "findings": [ - { - "active_multipliers": [], - "body": "`get_encrypted_datasets` returns a `list` of dataset dicts (each a `dict` with keys `'name'`, `'id'`, `'encryption_key'`, `'kmip_uid'`, etc.). The in-memory key cache is a `dict[str, bytes]` keyed by dataset name.\n\nAt line 94 (and identically at line 125), the filter expression `if k in existing_datasets` checks whether the **string** `k` (a dataset name) is a member of a **list of dicts**. Python's `in` operator for lists uses `==` equality \u2014 a string will never equal a dict, so this membership test is **always `False`** for every dataset name.\n\nAs a result, **`self.zfs_keys` is emptied to `{}` after every call to `push_zfs_keys` or `pull_zfs_keys`**, regardless of which datasets were actually processed. This defeats the entire purpose of the in-memory key cache: subsequent calls cannot reuse previously loaded keys, and the optimization at lines 64-69 and 107-111 (skipping KMIP retrieval when the key is already known and valid) will never trigger after the first sync.\n\nThe fix should use `{ds['name'] for ds in existing_datasets}` to build a set of names for the membership check.", - "confidence": 0.97, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "decorator_injection", - "dimension_name": "Decorator Double-Injection Analysis", - "evidence": "Step 1: `get_encrypted_datasets` (lines 33-52) builds `rv` by appending `ds_in_db[i['name']]` \u2014 each element is a dict like `{'id': 1, 'name': 'pool/ds', 'encryption_key': ..., 'kmip_uid': ...}`.\nStep 2: `push_zfs_keys` line 59: `existing_datasets = self.get_encrypted_datasets(filters)` \u2192 list of dicts.\nStep 3: Line 94: `{k: v for k, v in self.zfs_keys.items() if k in existing_datasets}` \u2014 `k` is a string (e.g. `'pool/ds'`), `existing_datasets` is a list of dicts. Python evaluates `'pool/ds' == {'id': 1, 'name': 'pool/ds', ...}` \u2192 `False` for every element.\nStep 4: All items are filtered out. `self.zfs_keys` becomes `{}`.\nStep 5: Same logic applies identically at line 125 in `pull_zfs_keys`.\nStep 6: On the next call, lines 64-69 check `ds['name'] in self.zfs_keys` \u2192 always `False` \u2192 unnecessary KMIP round-trips for every dataset on every sync.", - "file_path": "src/middlewared/middlewared/plugins/kmip/zfs_keys.py", - "id": "f_001", - "line_end": 94, - "line_start": 94, - "score": 0.97, - "severity": "critical", - "suggestion": "Change both occurrences to build a name-set first:\n\n```python\n# Line 94 in push_zfs_keys:\nexisting_names = {ds['name'] for ds in existing_datasets}\nself.zfs_keys = {k: v for k, v in self.zfs_keys.items() if k in existing_names}\n\n# Line 125 in pull_zfs_keys:\nexisting_names = {ds['name'] for ds in existing_datasets}\nself.zfs_keys = {k: v for k, v in self.zfs_keys.items() if k in existing_names}\n```\n\nThis restores the intended behavior: evict cache entries for datasets that no longer exist, while preserving entries for datasets that do.", - "tags": [ - "logic-error", - "cache", - "silent-data-loss", - "membership-check" - ], - "title": "zfs_keys cache silently wiped on every push/pull: `k in existing_datasets` checks string in list-of-dicts" - }, - { - "active_multipliers": [], - "body": "The `datastore.update` API signature is `(table: str, id: int, data: dict)`. At line 157, the call is:\n\n```python\nawait self.middleware.call('datastore.update', 'storage.encrypteddataset', {'kmip_uid': None})\n```\n\nThis passes **only two positional arguments** after the method name: `table='storage.encrypteddataset'` and `id={'kmip_uid': None}`. The `data` dict argument is missing entirely. The middleware will either raise a `TypeError` due to wrong argument count/types, or silently misinterpret `{'kmip_uid': None}` as the row `id`, attempting to look up a row by dict identity \u2014 which will fail.\n\nThe intent (from surrounding context in `clear_sync_pending_zfs_keys`, lines 153-161) is clearly to update the specific dataset record `ds` to clear its `kmip_uid`. The missing argument is `ds['id']`.\n\nThis means `clear_sync_pending_zfs_keys` will **always raise an error** when processing any dataset whose `encryption_key` is set, leaving `kmip_uid` values un-cleared and the sync-pending state stale.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "decorator_injection", - "dimension_name": "Decorator Double-Injection Analysis", - "evidence": "Step 1: `clear_sync_pending_zfs_keys` at lines 153-160 iterates over encrypted datasets with non-null `kmip_uid`.\nStep 2: For a dataset where `ds['encryption_key']` is truthy (line 156), it calls `datastore.update` at line 157.\nStep 3: The call is `('datastore.update', 'storage.encrypteddataset', {'kmip_uid': None})` \u2014 three args total, but `datastore.update` requires four: `(method, table, id, data)`.\nStep 4: Compare with correct usages at line 93: `self.middleware.call_sync('datastore.update', 'storage.encrypteddataset', ds['id'], update_data)` and line 121: same pattern with `ds['id']`.\nStep 5: The missing `ds['id']` means the dict `{'kmip_uid': None}` is passed as the `id` parameter \u2014 this will cause a runtime error in the datastore layer when it tries to use a dict as a row identifier.", - "file_path": "src/middlewared/middlewared/plugins/kmip/zfs_keys.py", - "id": "f_002", - "line_end": 157, - "line_start": 157, - "score": 0.95, - "severity": "critical", - "suggestion": "Add the missing `ds['id']` argument:\n\n```python\nawait self.middleware.call('datastore.update', 'storage.encrypteddataset', ds['id'], {'kmip_uid': None})\n```\n\nThis matches the pattern used elsewhere in the codebase (e.g., line 93 and line 121).", - "tags": [ - "runtime-error", - "wrong-arguments", - "data-integrity", - "typo" - ], - "title": "Missing `id` argument in `datastore.update` call \u2014 wrong argument count, update never applied to correct row" - }, - { - "active_multipliers": [], - "body": "**The old comparison was provably always `False`.**\n\nIn the prior code (`bde8f1de3b`), the guard in `inherit_parent_encryption_properties_impl` read:\n\n```python\nif ZFSKeyFormat(parent_encrypted_root.key_format.value) == ZFSKeyFormat.PASSPHRASE.value:\n```\n\nThe left-hand side is `ZFSKeyFormat('PASSPHRASE')` \u2014 a `ZFSKeyFormat` enum *instance* \u2014 while the right-hand side is `ZFSKeyFormat.PASSPHRASE.value` \u2014 the raw string `'PASSPHRASE'`. Python's `==` for `Enum` instances does **not** fall back to comparing against the `.value`; an enum member only equals itself (or another member with the same identity), never a plain string. This was verified:\n\n```\nZFSKeyFormat('PASSPHRASE') == 'PASSPHRASE' # \u2192 False, always\n```\n\n**What the guard was supposed to do:** prevent a key-encrypted dataset (`id_`) that has its own key-encrypted child encryption roots from inheriting a passphrase-encrypted parent root. If such a dataset were allowed to inherit, its key-encrypted children would end up under a passphrase root, violating the invariant that passphrase roots cannot have key-encrypted encryption-root descendants.\n\n**Behavioral change introduced by the fix:** The new code uses:\n\n```python\nif parent_encrypted_root['key_format']['value'] == ZFSKeyFormat.PASSPHRASE.value:\n```\n\nThis is a string-to-string comparison (`'PASSPHRASE' == 'PASSPHRASE'`) that evaluates to `True` correctly. For the first time, the inner `any(...)` check that looks for key-encrypted child encryption roots is actually executed, and if any are found, a `CallError` is raised, preventing the operation.\n\n**Concrete scenario now blocked that was previously silently allowed:**\n\n1. Pool `tank` has dataset `tank/passroot` encrypted with a passphrase (encryption root).\n2. Under it, `tank/passroot/keyroot` is a key-encrypted encryption root (HEX format).\n3. Under `keyroot`, `tank/passroot/keyroot/keychild` is *also* a key-encrypted encryption root.\n4. A user calls `pool.dataset.inherit_parent_encryption_properties('tank/passroot/keyroot')`.\n5. **Old code:** guard fires `False`, inner check is skipped, `change_encryption_root` executes. `keyroot` now falls under `passroot`'s passphrase root, but `keychild` remains a separate key-encrypted root under a passphrase root \u2014 an explicitly forbidden structure.\n6. **New code:** guard fires `True`, inner `any()` detects `keychild`, raises `CallError` with a clear message. The operation is rejected.\n\n**Does any existing production workflow depend on the old no-op guard?** The only test exercising `inherit_parent_encryption_properties` (`test_key_encrypted_dataset` at line 404) uses a *hex-key* parent root, so `parent_encrypted_root['key_format']['value'] == 'HEX'`, and the guard evaluates to `False` in both old and new code. That test is unaffected. There is no test covering the now-enforced case (passphrase parent root + key-encrypted child roots), which is the exact gap described below.", - "confidence": 0.98, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "enum-comparison-guard", - "dimension_name": "Enum vs String Comparison Bug in Encryption Root Guard", - "evidence": "Step 1: Old code at `bde8f1de3b` line ~222: `if ZFSKeyFormat(parent_encrypted_root.key_format.value) == ZFSKeyFormat.PASSPHRASE.value:`\nStep 2: `parent_encrypted_root.key_format.value` is a string, e.g. `'PASSPHRASE'`.\nStep 3: `ZFSKeyFormat('PASSPHRASE')` constructs `ZFSKeyFormat.PASSPHRASE`, an enum instance.\nStep 4: `ZFSKeyFormat.PASSPHRASE == 'PASSPHRASE'` \u2192 `False` (Python Enum.__eq__ compares member identity, not value string).\nStep 5: The `if` body (the `any()` child-root check and potential `raise CallError`) is NEVER reached regardless of input.\nStep 6: `change_encryption_root` / `zfs.dataset.change_encryption_root` always executes even when the parent root is passphrase-encrypted and the dataset has key-encrypted child roots.\nVerification: `python3 -c \"from enum import Enum; class E(Enum): P='PASSPHRASE'; print(E('PASSPHRASE') == 'PASSPHRASE')\"` prints `False`.", - "file_path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py", - "id": "f_003", - "line_end": 261, - "line_start": 248, - "score": 0.686, - "severity": "important", - "suggestion": "The fix is correct. The only follow-up needed is a regression test for the newly-enforced path: create a passphrase-encrypted root, a key-encrypted encryption root beneath it, and a second key-encrypted encryption root as a child of that \u2014 then assert that `inherit_parent_encryption_properties` on the middle dataset raises a `CallError`. This ensures the guard remains correct if the code is refactored again.", - "tags": [ - "logic-error", - "enum-comparison", - "security", - "encryption", - "guard-bypassed" - ], - "title": "Old guard was always False: key-encrypted child under passphrase-root inheritance was never blocked" - }, - { - "active_multipliers": [], - "body": "The bare `except Exception as e` branch on line 229 catches `ZFSKeyAlreadyLoadedException` and `ZFSNotEncryptedException` (both plain `Exception` subclasses from `zfs/exceptions.py`) and converts them to `failed[name]['error'] = str(e)` \u2014 a raw string embedded in the return value dict.\n\nThis is a contract violation because:\n1. These exceptions are **pre-condition guards** (dataset not encrypted, or key already loaded) that signal programmer/caller errors, not transient ZFS crypto failures. Treating them identically to \"Invalid Key\" hides the actual cause.\n2. The `unlock` API method's structured return `{'unlocked': [...], 'failed': {...}}` will surface these as opaque string errors (e.g. `\"'pool/ds' key is already loaded\"`) with no errno or structured error code, making it impossible for callers to distinguish pre-condition failures from crypto failures.\n3. The old code path (before `load_key` was extracted to `zfs/encryption.py`) presumably raised `CallError` directly \u2014 the refactoring broke this by introducing new exception types without updating the catch sites.\n\nSpecifically:\n- `ZFSKeyAlreadyLoadedException` raised at `encryption.py:33` falls into `except Exception` at `dataset_encryption_lock.py:229`\n- `ZFSNotEncryptedException` raised at `encryption.py:31` similarly falls into `except Exception` at `dataset_encryption_lock.py:229`\n\nNeither is ever re-raised as a `CallError`.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "exception-handling-contract", - "dimension_name": "Exception Handling Contract", - "evidence": "Step 1: `unlock` calls `load_key(tls, name, key=datasets[name]['key'])` at line 222.\nStep 2: `load_key` in `zfs/encryption.py:31` calls `rsrc.crypto()`, and if it returns `None`, raises `ZFSNotEncryptedException(dataset)` \u2014 a subclass of plain `Exception` (confirmed at `exceptions.py:20`).\nStep 3: `load_key` at `encryption.py:33` raises `ZFSKeyAlreadyLoadedException(dataset)` if `crypto.info().key_is_loaded` is True \u2014 also a plain `Exception` subclass (`exceptions.py:14`).\nStep 4: Neither exception is a `ZFSException` subclass (imported from `truenas_pylibzfs`), so the `except ZFSException as e` block at line 223 does NOT catch them.\nStep 5: They fall through to `except Exception as e` at line 229, where `failed[name]['error'] = str(e)` stores the message string `\"'pool/ds' key is already loaded\"` or `\"'pool/ds' is not encrypted\"` \u2014 no `CallError`, no errno.", - "file_path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py", - "id": "f_005", - "line_end": 231, - "line_start": 229, - "score": 0.665, - "severity": "important", - "suggestion": "Either (a) make `ZFSKeyAlreadyLoadedException` and `ZFSNotEncryptedException` inherit from `CallError` (with appropriate `errno` values such as `errno.ENOTSUP` for not-encrypted and `errno.EEXIST` for already-loaded), OR (b) add an explicit catch before the bare `except Exception` block:\n```python\nfrom middlewared.plugins.zfs.exceptions import ZFSKeyAlreadyLoadedException, ZFSNotEncryptedException\n\ntry:\n load_key(tls, name, key=datasets[name]['key'])\nexcept ZFSKeyAlreadyLoadedException:\n # Key already loaded means dataset is effectively unlocked; treat as success or specific error\n failed[name]['error'] = 'Key is already loaded'\n continue\nexcept ZFSNotEncryptedException:\n failed[name]['error'] = 'Dataset is not encrypted'\n continue\nexcept ZFSException as e:\n ...\nexcept Exception as e:\n failed[name]['error'] = str(e)\n continue\n```\nOption (a) is cleaner and ensures these exceptions carry structured error information everywhere they propagate.", - "tags": [ - "exception-handling", - "api-contract", - "error-propagation" - ], - "title": "ZFSKeyAlreadyLoadedException and ZFSNotEncryptedException silently swallowed as string errors instead of structured CallError" - }, - { - "active_multipliers": [], - "body": "**`from_previous` is invoked exclusively on incoming write operations (argument upgrade), never on reads (API responses).**\n\nThe `APIVersionsAdapter` in `legacy_api_method.py` upgrades incoming parameters from an older API version to the current version via `_adapt_params`, which calls `adapter.adapt(params_dict, model_name, self.api_version, self.adapter.current_version)`. Because `version1_index < version2_index` the direction resolves to `Direction.UPGRADE`, triggering `new_model.from_previous(value)` at `version.py:233`.\n\nConversely, `_dump_result` adapts the **result** from `current_version` back to `api_version` (downgrade direction), which calls `to_previous`. Neither `PoolDatasetChangeKeyOptions` nor `PoolCreateEncryptionOptions` define `to_previous`, so outgoing responses are never touched.\n\n**Practical impact:** An automation client or script pinned to API v25.x that deliberately submits `pbkdf2iters=350000` (valid under `ge=100000` in v25.10.x) will have that value silently overwritten to `1300000` by `from_previous` before the `change_key` handler executes. The caller receives `{\"result\": null}` \u2014 the standard success response for `PoolDatasetChangeKeyResult` \u2014 with no indication that a different iteration count was actually applied to ZFS.\n\nNote: `pbkdf2iters` is only forwarded to the ZFS layer when `passphrase_key_format=True` (plugin line 114), so this affects only passphrase-encrypted datasets. For raw-hex keyed datasets `pbkdf2iters` is excluded from `opts` entirely and no iteration count is stored.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "pbkdf2iters-migration-behavior", - "dimension_name": "PBKDF2 Iteration Count Silent Migration", - "evidence": "Step 1: Client on API v25.10.2 calls `pool.dataset.change_key` with `options={\"pbkdf2iters\": 350000, \"passphrase\": \"mypass\"}`. Old model allows this: `pbkdf2iters: int = Field(default=350000, ge=100000)` (v25_10_2/pool_dataset.py:175).\nStep 2: `LegacyAPIMethod.call()` (legacy_api_method.py:60) calls `_adapt_params()` \u2192 `adapter.adapt(params_dict, 'PoolDatasetChangeKeyArgs', 'v25.10.2', 'v26.0.0')`.\nStep 3: `adapt_model` computes `version1_index < version2_index` \u2192 `direction = Direction.UPGRADE`.\nStep 4: `_adapt_value` on `PoolDatasetChangeKeyArgs` calls `_adapt_nested_value` on the `options` field because both versions define a model named `PoolDatasetChangeKeyOptions`; this triggers a recursive `_adapt_value` call.\nStep 5: At the end of the nested `_adapt_value`, line 233 of version.py: `value = new_model.from_previous(value)` where `new_model` is v26_0_0's `PoolDatasetChangeKeyOptions`.\nStep 6: `from_previous` (pool_dataset.py:185) executes `value['pbkdf2iters'] = max(1300000, 350000)` \u2192 `1300000`.\nStep 7: `change_key` plugin receives `options['pbkdf2iters'] == 1300000`, passes it to `validate_encryption_data` (line 191), which includes it in `opts` because `passphrase_key_format=True` (line 114).\nStep 8: `zfs/encryption.py::change_key()` permanently stores `pbkdf2iters=1300000` in the dataset's ZFS config.\nStep 9: `_dump_result` downgrades `{\"result\": null}` \u2014 no clamping info is surfaced.", - "file_path": "src/middlewared/middlewared/api/v26_0_0/pool_dataset.py", - "id": "f_011", - "line_end": 186, - "line_start": 183, - "score": 0.665, - "severity": "important", - "suggestion": "At minimum, emit a job log warning when `pbkdf2iters` is clamped upward. A job-status message such as `job.set_progress(0, f'Note: pbkdf2iters elevated from submitted value to {options[\"pbkdf2iters\"]}')` would make the override visible to operators. Longer-term, consider returning the effective `pbkdf2iters` in the result payload or adding a `to_previous` on the result model so legacy clients can detect the discrepancy.", - "tags": [ - "api-versioning", - "silent-migration", - "encryption", - "pbkdf2" - ], - "title": "from_previous fires on write only; legacy API callers have pbkdf2iters silently upgraded to 1,300,000 without any notification" - }, - { - "active_multipliers": [], - "body": "The `lock` lambda on `sync_db_keys` uses `args` (the entire raw-arguments list) rather than `args[0]` (the first positional argument, `name`):\n\n```python\n@job(lock=lambda args: f'sync_encrypted_pool_dataset_keys_{args}')\ndef sync_db_keys(self, job, tls, name=None):\n```\n\nThe `@job` and `@pass_thread_local_storage` decorators are both **pure marker decorators** \u2014 they stamp attributes on the function and return it unchanged. `Job.__init__` stores the raw caller-supplied `params` list as `self.args`, and the lock lambda is evaluated with that list before the job is queued (in `JobsQueue.handle_lock` \u2192 `Job.get_lock_name`). The `tls` object is injected at run time in `Job.__run_body`, well after lock computation, so `tls` is **not** visible to the lambda.\n\nThe real problem is that `name` has a default of `None`. This means:\n\n| Call site | `self.args` passed to lambda | Resulting lock key |\n|---|---|---|\n| Periodic scheduler (no args) | `[]` | `sync_encrypted_pool_dataset_keys_[]` |\n| `call_sync('pool.dataset.sync_db_keys', 'tank')` | `['tank']` | `sync_encrypted_pool_dataset_keys_['tank']` |\n| `call_sync('pool.dataset.sync_db_keys', None)` | `[None]` | `sync_encrypted_pool_dataset_keys_[None]` |\n\nThe periodic invocation produces the key `sync_encrypted_pool_dataset_keys_[]` while an explicit `sync_db_keys(None)` produces `sync_encrypted_pool_dataset_keys_[None]` \u2014 these are **different lock keys**, so the two calls do NOT share a lock and can run concurrently. This defeats the purpose of the lock for the all-datasets sync case.\n\nBy contrast, the `encryption_summary` lock lambda on the same class correctly uses `args[0]`:\n```python\n@job(lock=lambda args: f'encryption_summary_options_{args[0]}', ...)\n```\n\nAdditionally, the lock key includes Python list-repr brackets (e.g., `['tank']`) rather than a clean string like `tank`, making the key non-human-readable and fragile if calling conventions change.", - "confidence": 0.92, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "decorator-order-lock-key", - "dimension_name": "Decorator Order and Lock Key Correctness", - "evidence": "Step 1: `sync_db_keys` is decorated with `@job(lock=lambda args: f'sync_encrypted_pool_dataset_keys_{args}')` at line 161.\nStep 2: `@job` is a pure marker decorator (`decorators.py:153-166`) \u2014 it sets `fn._job = {'lock': lock, ...}` and returns `fn` unchanged.\nStep 3: `_call_prepare` in `main.py:880` constructs `Job(self, name, serviceobj, methodobj, params, ...)` where `params` is the raw caller-supplied arguments list.\nStep 4: `Job.__init__` at `job.py:333` stores `self.args = args` (the `params` parameter passed in).\nStep 5: `JobsQueue.add` at `job.py:149` calls `self.handle_lock(job)`, which calls `job.get_lock_name()` at `job.py:422`: `lock_name = lock_name(self.args)` \u2014 so the lambda receives the raw `params` list.\nStep 6: Periodic scheduler calls `sync_db_keys` with zero user arguments \u2192 `params = []` \u2192 lambda receives `[]` \u2192 lock key is `sync_encrypted_pool_dataset_keys_[]`.\nStep 7: Explicit `call_sync('pool.dataset.sync_db_keys', None)` \u2192 `params = [None]` \u2192 lambda receives `[None]` \u2192 lock key is `sync_encrypted_pool_dataset_keys_[None]`.\nStep 8: Keys differ \u2192 neither invocation blocks the other \u2192 two full-dataset syncs can run concurrently.", - "file_path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "id": "f_009", - "line_end": 162, - "line_start": 161, - "score": 0.644, - "severity": "important", - "suggestion": "Change the lambda to extract only the first argument and normalize `None` to an empty string, mirroring the pattern used by `encryption_summary`:\n\n```python\n@job(lock=lambda args: f'sync_encrypted_pool_dataset_keys_{args[0] if args else \"\"}')\n```\n\nThis ensures:\n- A periodic call (no args) and an explicit `call(..., None)` both produce the same lock key: `sync_encrypted_pool_dataset_keys_None`\n- A call with a specific pool name produces `sync_encrypted_pool_dataset_keys_tank`\n- The key no longer contains list brackets", - "tags": [ - "locking", - "concurrency", - "decorator-order", - "correctness" - ], - "title": "`sync_db_keys` lock lambda embeds the full args list, causing inconsistent lock keys between periodic and explicit calls" - }, - { - "active_multipliers": [], - "body": "**Existing datasets with `pbkdf2iters` between 100,000 and 1,299,999 will have their iteration count permanently changed to 1,300,000 on the next `change_key` call, regardless of whether the user explicitly requested this change.**\n\nThere are two distinct triggers:\n\n1. **Legacy API client omits `pbkdf2iters`:** The v25.10.x default was 350,000. When a v25.x client calls `change_key` without specifying `pbkdf2iters`, `_adapt_value` fills in the missing field using the **v26.0.0 new default** of `1300000` (version.py:226: `value[key_to_use] = field_info.get_default(call_default_factory=True)`). `from_previous` then sees `max(1300000, 1300000)` which is a no-op \u2014 but the applied value is the new default, not what the user would have expected from their v25.x context.\n\n2. **Legacy API client explicitly submits `pbkdf2iters=350000`:** `from_previous` clamps it to 1,300,000 as described in the companion finding.\n\nIn both cases, `change_key` permanently alters the ZFS dataset property `pbkdf2iters`. Once a dataset is re-keyed at 1,300,000 iterations, every subsequent passphrase-unlock of that dataset (at boot, during HA failover, or via `pool.dataset.unlock`) will run PBKDF2 with 1,300,000 iterations. The user never saw a prompt asking to confirm this change, and the API response `{\"result\": null}` provides no visibility into what iteration count was applied.\n\n**Scope:** Only passphrase-encrypted datasets are affected (line 114 of `dataset_encryption_operations.py` guards `pbkdf2iters` inclusion on `passphrase_key_format=True`). Raw-hex keyed datasets are not affected.", - "confidence": 0.92, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "pbkdf2iters-migration-behavior", - "dimension_name": "PBKDF2 Iteration Count Silent Migration", - "evidence": "Step 1: User has a passphrase-encrypted dataset with `pbkdf2iters=350000` (set under v25.x).\nStep 2: User or script calls `pool.dataset.change_key` via v25.x API client without specifying `pbkdf2iters`.\nStep 3: `_adapt_value` (version.py:224-227) detects `pbkdf2iters` is absent; since the field has a default in v26 (`1300000`), it fills: `value['pbkdf2iters'] = 1300000`.\nStep 4: `from_previous` is a no-op for `max(1300000, 1300000)`, but the effective value is now 1,300,000 instead of the user's expected 350,000.\nStep 5: `change_key` plugin line 191 passes `pbkdf2iters: 1300000` to `validate_encryption_data`.\nStep 6: Since `passphrase_key_format=True`, line 114 includes `pbkdf2iters` in `opts`.\nStep 7: `zfs/encryption.py::change_key()` writes `pbkdf2iters=1300000` permanently to ZFS.\nStep 8: API returns `{\"result\": null}` \u2014 no indication the iteration count was elevated.", - "file_path": "src/middlewared/middlewared/api/v26_0_0/pool_dataset.py", - "id": "f_012", - "line_end": 186, - "line_start": 175, - "score": 0.644, - "severity": "important", - "suggestion": "Compare `options['pbkdf2iters']` against the dataset's current stored iteration count before applying the change (available via `ds['pbkdf2iters']['parsed']` from `get_instance_quick`). If the value is being elevated due to the minimum-floor and not due to the user explicitly passing the new value, emit a warning. Consider adding a `pbkdf2iters_effective` field to `PoolDatasetChangeKeyResult` so callers can detect the actual value applied.", - "tags": [ - "encryption", - "silent-mutation", - "pbkdf2", - "dataset-state-change", - "api-versioning" - ], - "title": "Existing passphrase-encrypted datasets silently re-keyed at 3.7x higher iteration count on next change_key call via any API version" - }, - { - "active_multipliers": [], - "body": "`ZFSKeyAlreadyLoadedException` (line 14) and `ZFSNotEncryptedException` (line 20) both inherit directly from `Exception`. This is the root cause of the contract break identified in the other findings.\n\nIn the TrueNAS middleware architecture, user-facing errors are expected to be `CallError` instances (with an `errno` attribute). Any unhandled non-`CallError` exception that escapes a service method is treated as an internal server error by the WebSocket API layer, producing unstructured error responses.\n\nBy making these exceptions plain `Exception` subclasses:\n1. Every call site that calls `load_key()`, `check_key()`, `change_key()`, or `change_encryption_root()` must manually wrap exceptions to convert them to `CallError` \u2014 creating a systemic catch-site gap.\n2. Existing bare `except Exception` handlers (as in `dataset_encryption_lock.py:229`) silently absorb them as string errors with no errno, making them indistinguishable from other failures.\n3. The `.message` attribute is redundant with `str(e)` since `super().__init__(self.message)` already sets the string representation \u2014 the `.message` attribute is never used by any handler.", - "confidence": 0.9, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "exception-handling-contract", - "dimension_name": "Exception Handling Contract", - "evidence": "Step 1: `exceptions.py:14` \u2014 `class ZFSKeyAlreadyLoadedException(Exception)` \u2014 base class is plain `Exception`.\nStep 2: `exceptions.py:20` \u2014 `class ZFSNotEncryptedException(Exception)` \u2014 base class is plain `Exception`.\nStep 3: These are imported and raised in `zfs/encryption.py` at lines 31, 33, 58, 88, 105.\nStep 4: `dataset_encryption_lock.py:229` and `dataset_encryption_operations.py:200,263` are call sites with no conversion to `CallError`.\nStep 5: The middleware WebSocket error dispatch (not read, but standard TrueNAS architecture) wraps `CallError` into structured JSON error responses with errno codes; plain `Exception` becomes an unstructured internal error.", - "file_path": "src/middlewared/middlewared/plugins/zfs/exceptions.py", - "id": "f_007", - "line_end": 23, - "line_start": 14, - "score": 0.63, - "severity": "important", - "suggestion": "Change the base class of both exceptions to `CallError` with appropriate errno values:\n```python\nfrom middlewared.service.core import CallError # or wherever CallError is importable\nimport errno\n\nclass ZFSKeyAlreadyLoadedException(CallError):\n def __init__(self, path: str):\n super().__init__(f\"{path!r} key is already loaded\", errno=errno.EEXIST)\n\nclass ZFSNotEncryptedException(CallError):\n def __init__(self, path: str):\n super().__init__(f\"{path!r} is not encrypted\", errno=errno.ENOTSUP)\n```\nThis ensures that wherever these exceptions propagate \u2014 through `except Exception`, `except CallError`, or unhandled \u2014 they carry structured error information and are handled correctly by the middleware's error dispatch layer. Note: verify there are no circular import issues between `middlewared.plugins.zfs` and `middlewared.service`; if so, an intermediate base class in `zfs/exceptions.py` may be needed.", - "tags": [ - "exception-hierarchy", - "api-contract", - "architecture", - "error-propagation" - ], - "title": "Custom ZFS exceptions inherit from plain Exception instead of CallError, breaking structured error propagation across all callers" - }, - { - "active_multipliers": [], - "body": "`dataset_encryption_operations.py:200` calls `change_key(tls, id_, encryption_dict, key)` with no surrounding try/except. The `change_key` function in `zfs/encryption.py:87-88` can raise `ZFSNotEncryptedException` if `rsrc.crypto()` returns `None`.\n\nAlthough the `change_key` method does validate `ds['encrypted']` at line 134 via `verrors.add`, this is a **database/metadata check** \u2014 it does NOT prevent a race condition where the ZFS state diverges from the database (e.g. dataset was recreated between the query and the `change_key` call). If the ZFS layer reports the dataset as unencrypted but the DB still has it marked encrypted, `ZFSNotEncryptedException` will propagate all the way to the WebSocket API layer as an unhandled `Exception`, not a `CallError`.\n\nSimilarly, `change_encryption_root` at `dataset_encryption_operations.py:263` calls `change_encryption_root(tls, id_)` which also raises `ZFSNotEncryptedException` at `encryption.py:104-105` with no catch.", - "confidence": 0.82, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "exception-handling-contract", - "dimension_name": "Exception Handling Contract", - "evidence": "Step 1: `change_key` method in `dataset_encryption_operations.py:200` calls `change_key(tls, id_, encryption_dict, key)` with no try/except.\nStep 2: `change_key` in `zfs/encryption.py:86-88`: `rsrc = open_resource(tls, dataset); if (crypto := rsrc.crypto()) is None: raise ZFSNotEncryptedException(dataset)`.\nStep 3: `ZFSNotEncryptedException` inherits from `Exception` (confirmed at `exceptions.py:20`), NOT from `CallError`.\nStep 4: No catch exists between `encryption.py:88` and the WebSocket layer. The exception propagates as a raw `Exception`.\nStep 5: The WebSocket API layer expects `CallError` for user-facing error messages with structured errno codes. A raw `Exception` results in an unstructured 500-style error.\nSame path applies to `change_encryption_root` at `dataset_encryption_operations.py:263` calling `encryption.py:103-105`.", - "file_path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py", - "id": "f_006", - "line_end": 200, - "line_start": 200, - "score": 0.574, - "severity": "important", - "suggestion": "Wrap the `change_key` and `change_encryption_root` calls with try/except to convert `ZFSNotEncryptedException` (and `ZFSKeyAlreadyLoadedException` if applicable) into `CallError`:\n```python\nfrom middlewared.plugins.zfs.exceptions import ZFSNotEncryptedException\n\ntry:\n change_key(tls, id_, encryption_dict, key)\nexcept ZFSNotEncryptedException as e:\n raise CallError(str(e), errno=errno.ENOTSUP)\n```\nAlternatively, make `ZFSNotEncryptedException` a subclass of `CallError` with a fixed errno so it automatically presents correctly to all callers throughout the codebase.", - "tags": [ - "exception-handling", - "api-contract", - "race-condition", - "error-propagation" - ], - "title": "ZFSNotEncryptedException from change_key() propagates as raw Exception to WebSocket API layer \u2014 no CallError wrapping" - }, - { - "active_multipliers": [], - "body": "In the old `zfs.dataset.load_key` service method, all `libzfs.ZFSException` instances were caught and re-raised as `CallError`. In the new `encryption.py:load_key()`, the call to `crypto.load_key(**kwargs)` at line 34 is **not wrapped in any try/except**.\n\nAny `truenas_pylibzfs.ZFSException` raised by `crypto.load_key()` propagates directly out of `encryption.load_key()` back to its caller with:\n- A `.code` attribute (a `ZFSError` enum value)\n- **No `.errmsg`** or **`.errno`** fields in the `CallError` sense\n- No `CallError` wrapping\n\nFor the `unlock` call path in `dataset_encryption_lock.py`, this is handled correctly: `except ZFSException as e:` at line 223 catches these and processes `EZFS_CRYPTOFAILED` vs. other codes. So the current only caller handles it.\n\nHowever, the **API contract has silently changed**: any other present or future caller of `encryption.load_key()` that expects `CallError` (because the old `zfs.dataset.load_key` always raised `CallError`) will receive raw `ZFSException` instead. If such a caller reaches the WebSocket dispatch layer without intermediate handling, `websocket_app.py:196-207` catches the bare `Exception`, calls `adapt_exception(e)` (which only handles `subprocess.CalledProcessError` \u2014 not `ZFSException`), and falls back to `send_error(message, EINVAL, str(e))`, losing the original ZFS error code entirely and emitting a generic `EINVAL` to the client.", - "confidence": 0.8, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "error-handling-exception-flow", - "dimension_name": "Exception Handling and Error Flow", - "evidence": "Step 1: `encryption.py:load_key()` calls `crypto.load_key(**kwargs)` at line 34 with no surrounding try/except block.\nStep 2: `truenas_pylibzfs.ZFSException` is the exception type raised by `crypto.load_key()` on failure (e.g., wrong key \u2192 `EZFS_CRYPTOFAILED`).\nStep 3: `ZFSException` has a `.code` attribute (a `ZFSError` enum), but no `.errmsg` or `.errno` in the `CallError` sense.\nStep 4: The old service method `zfs.dataset.load_key` caught all `libzfs.ZFSException` and re-raised as `CallError` \u2014 all callers expected `CallError`.\nStep 5: A hypothetical new caller of `encryption.load_key()` that does not import `truenas_pylibzfs.ZFSException` and uses only `except CallError` will miss the exception.\nStep 6: That uncaught `ZFSException` reaches `websocket_app.py:196`, `adapt_exception(e)` returns `None` (only handles `CalledProcessError`), and `send_error(message, EINVAL, str(e))` emits an unstructured `EINVAL` response to the client.", - "file_path": "src/middlewared/middlewared/plugins/zfs/encryption.py", - "id": "f_008", - "line_end": 34, - "line_start": 34, - "score": 0.56, - "severity": "important", - "suggestion": "Either:\n1. **Document the contract explicitly** in `load_key()`'s docstring: state that it may raise `truenas_pylibzfs.ZFSException` directly (in addition to `ZFSNotEncryptedException` and `ZFSKeyAlreadyLoadedException`), so all callers know they must handle `ZFSException`.\n2. **Convert at the boundary**: wrap `crypto.load_key(**kwargs)` in a try/except that re-raises as a typed domain exception (e.g., add `ZFSLoadKeyException` to `exceptions.py`), so `encryption.py` never leaks `truenas_pylibzfs` types to callers:\n```python\ntry:\n crypto.load_key(**kwargs)\nexcept ZFSException as e:\n if e.code == ZFSError.EZFS_CRYPTOFAILED:\n raise ZFSInvalidKeyException(dataset) from e\n raise\n```\nOption 2 is the cleaner design: it keeps `truenas_pylibzfs` as an internal implementation detail.", - "tags": [ - "api-contract", - "exception-propagation", - "error-handling", - "refactoring" - ], - "title": "Raw truenas_pylibzfs.ZFSException from crypto.load_key() propagates out of encryption.load_key() undecorated, breaking the old CallError contract for any caller outside unlock" - }, - { - "active_multipliers": [], - "body": "**The 3.7x increase from 350,000 to 1,300,000 PBKDF2 iterations is applied unconditionally with no runtime check for hardware capability. On low-power or embedded hardware, this can cause passphrase-based key derivation to exceed unlock timeouts, making encrypted datasets permanently inaccessible without manual CLI intervention.**\n\nOnce a passphrase-encrypted dataset is re-keyed with `pbkdf2iters=1300000` (whether explicitly or via the silent clamping in `from_previous`), every future unlock attempt runs PBKDF2-SHA256 with 1,300,000 iterations synchronously. On ARM SoCs and Atom-class CPUs common in consumer NAS hardware:\n- At 350,000 iters: typically ~0.5\u20131 second per dataset\n- At 1,300,000 iters: typically ~2\u20134 seconds per dataset\n\nFor pools with multiple passphrase-encrypted datasets that must all unlock at pool import (a common TrueNAS configuration), unlock times multiply linearly. If this occurs during boot under a systemd service timeout, or during HA failover under a failover timeout, the unlock will fail \u2014 and with `ge=1300000` enforced as the hard minimum, there is **no API path** to reduce the iteration count back down without using the ZFS CLI directly (`zfs change-key -o pbkdf2iters=...`).\n\nThe `change_key` plugin (`dataset_encryption_operations.py:118`) does not measure or estimate key derivation time before applying the new iteration count. Neither `PoolCreateEncryptionOptions` nor `PoolDatasetChangeKeyOptions` expose any per-hardware tuning path below the new minimum.\n\nNote: `PoolCreateEncryptionOptions.from_previous` in `pool.py:152` applies the same clamping on pool creation encryption options. For new pool creation this affects the root dataset's initial encryption setup, not just re-keying.", - "confidence": 0.75, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "pbkdf2iters-migration-behavior", - "dimension_name": "PBKDF2 Iteration Count Silent Migration", - "evidence": "Step 1: Passphrase-encrypted dataset is re-keyed to `pbkdf2iters=1300000` via `change_key` (either explicitly or via silent clamping from `from_previous`).\nStep 2: `dataset_encryption_operations.py:191` passes `pbkdf2iters: options['pbkdf2iters']` to `validate_encryption_data`.\nStep 3: `validate_encryption_data` line 114 includes `pbkdf2iters` in `opts` when `passphrase_key_format=True`.\nStep 4: `zfs/encryption.py::change_key()` line 89 calls `tls.lzh.resource_cryptography_config(**props)` with `pbkdf2iters=1300000`, permanently recording it as a ZFS dataset property.\nStep 5: On the next pool import or `pool.dataset.unlock`, ZFS runs PBKDF2-SHA256 with 1,300,000 iterations to derive the wrapping key from the passphrase.\nStep 6: On low-power hardware (e.g., Cortex-A53 at 1.4GHz, ~350k iters/sec for PBKDF2-SHA256), this takes ~3.7 seconds per dataset. With 5 passphrase datasets: ~18.5 seconds total.\nStep 7: If a systemd or HA failover timeout fires during this window, unlock fails; dataset remains locked.\nStep 8: The `ge=1300000` constraint on `PoolDatasetChangeKeyOptions` means there is no supported API path to reduce `pbkdf2iters` on an already-re-keyed dataset \u2014 only direct ZFS CLI access can recover.", - "file_path": "src/middlewared/middlewared/api/v26_0_0/pool.py", - "id": "f_013", - "line_end": 154, - "line_start": 151, - "score": 0.525, - "severity": "important", - "suggestion": "Consider the following mitigations: (1) **Benchmark gate:** Before applying `change_key` with a high `pbkdf2iters`, run a short PBKDF2 benchmark and warn or reject if estimated unlock time exceeds a configurable threshold. (2) **System-wide override:** Allow a `tunable` or system config option to set a lower `pbkdf2iters` ceiling for constrained hardware, overriding the API minimum for that installation. (3) **Recovery documentation:** Explicitly document that `zfs change-key -o pbkdf2iters=` is available as a recovery path if unlock times become prohibitive. (4) **Job warning:** At minimum, have the `change_key` job emit a progress message noting the effective iteration count when it exceeds the old default.", - "tags": [ - "encryption", - "availability", - "hardware", - "pbkdf2", - "timeout-risk", - "embedded" - ], - "title": "3.7x PBKDF2 iteration increase enforced with no hardware capability check; may cause passphrase unlock timeouts making datasets inaccessible" - }, - { - "active_multipliers": [], - "body": "`@pass_thread_local_storage` is a **marker-only decorator** \u2014 it sets `fn._pass_thread_local_storage = True` and returns `fn` unchanged (`decorators.py:221-222`). The actual `tls` injection happens only at API dispatch time: in `main.py:862-865` for normal methods and `job.py:620-621` for `@job` methods.\n\nWhen `sync_zfs_keys` calls `self.push_zfs_keys(tls, ids)` and `self.pull_zfs_keys(tls)` directly (lines 138 and 142), these are **plain Python method calls** \u2014 they bypass the middleware dispatch system entirely. The `_pass_thread_local_storage` attribute on `push_zfs_keys` and `pull_zfs_keys` has **no effect** on direct calls. Therefore, `tls` is supplied exactly once by the caller, and the functions receive it correctly.\n\nThe decorators on `push_zfs_keys`/`pull_zfs_keys` are intentional: they allow those methods to be called independently through the middleware dispatch system (e.g., `self.middleware.call_sync('kmip.push_zfs_keys', ...)`) with `tls` injected automatically. The `# type: ignore` comments are consistent with the decorator's type signature hiding `tls` from external callers.\n\n**No double-injection occurs. The code is correct for this pattern.**", - "confidence": 0.98, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "decorator_injection", - "dimension_name": "Decorator Double-Injection Analysis", - "evidence": "Step 1: `pass_thread_local_storage` in `service/decorators.py:209-222` sets `fn._pass_thread_local_storage = True` and returns `fn` unchanged \u2014 no wrapping, no injection at decoration time.\nStep 2: `main.py:862-865` \u2014 injection only occurs inside `_call_prepare`, which is invoked by the middleware dispatch system, not on direct Python calls.\nStep 3: `job.py:620-621` \u2014 same: injection only at job run time via `prepend.append(thread_local_storage)`.\nStep 4: `sync_zfs_keys` at lines 138/142 calls `self.push_zfs_keys(tls, ids)` directly \u2014 this is a plain Python attribute lookup and call, bypassing `_call_prepare` entirely.\nStep 5: `push_zfs_keys` receives `(self, tls, ids)` \u2014 one `tls` from the caller, zero injected by decorator. Correct.", - "file_path": "src/middlewared/middlewared/plugins/kmip/zfs_keys.py", - "id": "f_000", - "line_end": 142, - "line_start": 138, - "score": 0.294, - "severity": "suggestion", - "suggestion": "No change needed for the decorator/injection pattern. The explicit `tls` passing at lines 138 and 142 is correct because these are direct Python method calls, not middleware dispatches.", - "tags": [ - "decorator", - "thread-local-storage", - "no-bug", - "call-convention" - ], - "title": "No double-injection bug: explicit tls passing is correct for direct calls" - }, - { - "active_multipliers": [], - "body": "The only integration test for `inherit_parent_encryption_properties` (`tests/api2/test_pool_dataset_encryption.py:404`) exercises the case where the parent's encryption root uses a **hex key** \u2014 so `parent_encrypted_root['key_format']['value'] == 'HEX'`. The guard evaluates to `False` in both old and new code, meaning this test provides **zero coverage** of the bug fix.\n\nThe case that was silently broken (passphrase-encrypted parent root + key-encrypted child encryption roots under `id_`) has never been tested. Now that the guard works correctly, there is a real behavioral difference: the operation **raises a `CallError`** instead of silently succeeding. Without a test for this path:\n\n1. There is no automated verification that the `CallError` message is correct.\n2. A future refactor could re-introduce the same type-comparison mistake and no test would catch it.\n3. The complementary allowed case \u2014 passphrase parent root, `id_` has *no* key-encrypted child roots \u2014 is also untested; verifying it proceeds successfully is equally important.\n\nThe guard itself (`any(d['name'] == d['encryption_root'] for d in self.middleware.call_sync('pool.dataset.query', [...]))`) is logically sound and the fix is correct, but the absence of test coverage for the enforced path is a gap worth closing.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "enum-comparison-guard", - "dimension_name": "Enum vs String Comparison Bug in Encryption Root Guard", - "evidence": "Only test reference: `tests/api2/test_pool_dataset_encryption.py:404`\n```python\ndef test_key_encrypted_dataset(self):\n # parent uses HEX key\n payload = {'name': dataset, 'encryption_options': {'key': dataset_token_hex}, ...}\n call('pool.dataset.create', payload)\n # child uses PASSPHRASE\n payload.update({'name': child_dataset, 'encryption_options': {'passphrase': passphrase}})\n call('pool.dataset.create', payload)\n # parent_encrypted_root is the HEX-keyed parent -> guard evaluates False in both old and new code\n call('pool.dataset.inherit_parent_encryption_properties', child_dataset)\n ds = call('pool.dataset.get_instance', child_dataset)\n assert ds['key_format']['value'] == 'HEX', ds\n```\nNo test exercises the path where `parent_encrypted_root['key_format']['value'] == 'PASSPHRASE'`.", - "file_path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py", - "id": "f_004", - "line_end": 261, - "line_start": 248, - "score": 0.285, - "severity": "suggestion", - "suggestion": "Add a test case in `tests/api2/test_pool_dataset_encryption.py` that:\n1. Creates a passphrase-encrypted dataset `P` as an encryption root.\n2. Creates `P/K` as a key-encrypted encryption root (child of P).\n3. Creates `P/K/KC` as a second key-encrypted encryption root (grandchild).\n4. Calls `pool.dataset.inherit_parent_encryption_properties('P/K')` and asserts a `ClientException` / `CallError` is raised containing the expected message.\n5. Also tests the allowed sub-case: `P/K` with no key-encrypted child roots successfully inherits from the passphrase root.", - "tags": [ - "test-coverage", - "encryption", - "guard", - "regression-risk" - ], - "title": "No test covers the newly-enforced rejection path (passphrase root + key-encrypted child roots)" - }, - { - "active_multipliers": [], - "body": "The review prompt raised a concern that if `@pass_thread_local_storage` wraps the `@job`-decorated function, the lock lambda might see `(tls, name)` instead of `(name,)`.\n\nThis concern does **not** apply. Both decorators are pure markers:\n\n```python\n# decorators.py:153-166\ndef check_job(fn):\n fn._job = {'lock': lock, ...}\n return fn # fn is returned unchanged\n\n# decorators.py:221-222\nfn._pass_thread_local_storage = True\nreturn fn # fn is returned unchanged\n```\n\nNeither decorator wraps the function \u2014 they only set attributes. The `tls` object is injected at job run time in `job.py:620-621` inside `Job.__run_body`, well after `get_lock_name()` has already evaluated the lock lambda at queue time. The `Job` object is constructed with `params` (raw caller args), and that is what the lambda sees \u2014 never `tls`.\n\nThe actual decorator stacking requirement is documented in `api/base/decorator.py:53-59`: `@job` must be the innermost (bottommost) decorator, and the current ordering is correct.", - "confidence": 0.97, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "decorator-order-lock-key", - "dimension_name": "Decorator Order and Lock Key Correctness", - "evidence": "Step 1: `@pass_thread_local_storage` at `decorators.py:209-222` sets `fn._pass_thread_local_storage = True` and returns `fn` \u2014 no wrapping.\nStep 2: `@job` at `decorators.py:153-166` sets `fn._job = {...}` and returns `fn` \u2014 no wrapping.\nStep 3: `_call_prepare` at `main.py:880` constructs `Job(..., params, job_options, ...)` where `params` is the raw caller args \u2014 `tls` is NOT in this list.\nStep 4: `tls` injection for jobs occurs in `job.py:620-621` inside `Job.__run_body`, which runs after the job has been queued and the lock key has already been computed.\nStep 5: `get_lock_name` at `job.py:422` calls `lock_name(self.args)` where `self.args = params` \u2014 the lambda never sees `tls`.", - "file_path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "id": "f_010", - "line_end": 162, - "line_start": 158, - "score": 0.097, - "severity": "nitpick", - "suggestion": "No code change needed for this specific concern. The decorator order is correct and `tls` is never present in the lock lambda's argument list.", - "tags": [ - "decorator-order", - "false-positive-cleared", - "tls", - "locking" - ], - "title": "Original `tls`-injection concern is a false alarm: decorator order is correct and `tls` is never visible to the lock lambda" - } - ], - "metadata": { - "agent_invocations": 11, - "anatomy": { - "blast_radius": [], - "clusters": [ - { - "description": "", - "files": [ - "" - ], - "id": "cluster_0", - "name": "root", - "primary_language": "" - }, - { - "description": "", - "files": [ - "src/middlewared/middlewared/api/v26_0_0/pool.py", - "src/middlewared/middlewared/api/v26_0_0/pool_dataset.py" - ], - "id": "cluster_1", - "name": "src/middlewared/middlewared/api/v26_0_0", - "primary_language": "python" - }, - { - "description": "", - "files": [ - "src/middlewared/middlewared/plugins/kmip/zfs_keys.py" - ], - "id": "cluster_2", - "name": "src/middlewared/middlewared/plugins/kmip", - "primary_language": "python" - }, - { - "description": "", - "files": [ - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py", - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py" - ], - "id": "cluster_3", - "name": "src/middlewared/middlewared/plugins/pool_", - "primary_language": "python" - }, - { - "description": "", - "files": [ - "src/middlewared/middlewared/plugins/zfs/encryption.py", - "src/middlewared/middlewared/plugins/zfs/exceptions.py" - ], - "id": "cluster_4", - "name": "src/middlewared/middlewared/plugins/zfs", - "primary_language": "python" - } - ], - "context_notes": "The removed file `src/middlewared/middlewared/plugins/zfs_/dataset_encryption.py` used `process_pool = True`, meaning every call to `zfs.dataset.*` previously serialized through a subprocess via the process pool mechanism. The new code runs synchronously in the middleware's main worker threads, sharing the thread-local `tls.lzh` handle managed by `@pass_thread_local_storage`. This is architecturally consistent with the broader truenas_pylibzfs migration effort visible in other modules (load_unload_impl.py, resource_crud.py, etc.). The `truenas_pylibzfs` dependency (PR #145) must provide: `ZFSResource.crypto()` returning an optional `ZFSResourceCryptography` object; `ZFSResourceCryptography.info()` returning an object with `key_is_loaded: bool`; `ZFSResourceCryptography.load_key(**kwargs)`, `.check_key(**kwargs) -> bool`, `.change_key(info)`, and `.inherit_key()`; and `ZFSLibHandle.resource_cryptography_config(**props)` returning a config object. None of these are visible in this repository \u2014 the PR is incomplete without that upstream merge.", - "dependency_graph": {}, - "files": [ - { - "hunks": [ - { - "content": " key.\"\"\"\n generate_key: bool = False\n \"\"\"Automatically generate the key to be used for dataset encryption.\"\"\"\n- pbkdf2iters: int = Field(ge=100000, default=350000)\n+ pbkdf2iters: int = Field(ge=1300000, default=1300000)\n \"\"\"Number of PBKDF2 iterations for key derivation from passphrase. Higher iterations improve security \\\n- against brute force attacks but increase unlock time. Default 350,000 balances security and performance.\"\"\"\n+ against brute force attacks but increase unlock time.\"\"\"\n algorithm: Literal[\n \"AES-128-CCM\", \"AES-192-CCM\", \"AES-256-CCM\", \"AES-128-GCM\", \"AES-192-GCM\", \"AES-256-GCM\"\n ] = \"AES-256-GCM\"", - "header": "@@ -136,9 +136,9 @@ class PoolCreateEncryptionOptions(BaseModel):", - "new_count": 9, - "new_start": 136, - "old_count": 9, - "old_start": 136 - }, - { - "content": " key: Secret[Annotated[str, Field(min_length=64, max_length=64)] | None] = None\n \"\"\"A hex-encoded key specified as an alternative to using `passphrase`.\"\"\"\n \n+ @classmethod\n+ def from_previous(cls, value):\n+ value['pbkdf2iters'] = max(1300000, value['pbkdf2iters'])\n+ return value\n+\n \n class PoolCreateTopologyVdevDRAID(BaseModel):\n type: Literal[\"DRAID1\", \"DRAID2\", \"DRAID3\"]", - "header": "@@ -148,6 +148,11 @@ class PoolCreateEncryptionOptions(BaseModel):", - "new_count": 11, - "new_start": 148, - "old_count": 6, - "old_start": 148 - } - ], - "language": "python", - "lines_added": 7, - "lines_removed": 2, - "path": "src/middlewared/middlewared/api/v26_0_0/pool.py", - "status": "modified" - }, - { - "hunks": [ - { - "content": " \"\"\"Generate a new random encryption key instead of using a provided key or passphrase.\"\"\"\n key_file: bool = False\n \"\"\"Whether the provided key is from a key file rather than entered directly.\"\"\"\n- pbkdf2iters: int = Field(default=350000, ge=100000)\n+ pbkdf2iters: int = Field(default=1300000, ge=1300000)\n \"\"\"Number of PBKDF2 iterations for passphrase-based keys. Higher values improve security against \\\n- brute force attacks but increase unlock time. Default 350,000 balances security and performance.\"\"\"\n+ brute force attacks but increase unlock time.\"\"\"\n passphrase: Secret[NonEmptyString | None] = None\n \"\"\"Passphrase to use for encryption key derivation.\"\"\"\n key: Secret[Annotated[str, Field(min_length=64, max_length=64)] | None] = None\n \"\"\"Raw hex-encoded encryption key.\"\"\"\n \n+ @classmethod\n+ def from_previous(cls, value):\n+ value['pbkdf2iters'] = max(1300000, value['pbkdf2iters'])\n+ return value\n+\n \n class PoolDatasetCreateUserProperty(BaseModel):\n key: Annotated[str, Field(examples=[\"custom:backup_policy\", \"org:created_by\"], pattern=\".*:.*\")]", - "header": "@@ -172,14 +172,19 @@ class PoolDatasetChangeKeyOptions(BaseModel):", - "new_count": 19, - "new_start": 172, - "old_count": 14, - "old_start": 172 - } - ], - "language": "python", - "lines_added": 7, - "lines_removed": 2, - "path": "src/middlewared/middlewared/api/v26_0_0/pool_dataset.py", - "status": "modified" - }, - { - "hunks": [ - { - "content": " # See the file LICENSE.IX for complete terms and conditions\n \n from middlewared.api.current import ZFSResourceQuery\n+from middlewared.plugins.zfs.encryption import check_key\n from middlewared.service import job, private, Service\n+from middlewared.service.decorators import pass_thread_local_storage\n \n from .connection import KMIPServerMixin\n ", - "header": "@@ -4,7 +4,9 @@", - "new_count": 9, - "new_start": 4, - "old_count": 7, - "old_start": 4 - }, - { - "content": " return rv\n \n @private\n- def push_zfs_keys(self, ids=None):\n+ @pass_thread_local_storage\n+ def push_zfs_keys(self, tls, ids=None):\n failed = []\n filters = [] if ids is None else [['id', 'in', ids]]\n existing_datasets = self.get_encrypted_datasets(filters)", - "header": "@@ -50,7 +52,8 @@ def get_encrypted_datasets(self, filters):", - "new_count": 8, - "new_start": 52, - "old_count": 7, - "old_start": 50 - }, - { - "content": " if not ds['encryption_key']:\n # We want to make sure we have the KMIP server's keys and in-memory keys in sync\n try:\n- if ds['name'] in self.zfs_keys and self.middleware.call_sync(\n- 'zfs.dataset.check_key', ds['name'], {'key': self.zfs_keys[ds['name']]}\n+ if (\n+ ds['name'] in self.zfs_keys\n+ and check_key(tls, ds['name'], key=self.zfs_keys[ds['name']])\n ):\n continue\n else:", - "header": "@@ -59,8 +62,9 @@ def push_zfs_keys(self, ids=None):", - "new_count": 9, - "new_start": 62, - "old_count": 8, - "old_start": 59 - }, - { - "content": " return failed\n \n @private\n- def pull_zfs_keys(self):\n+ @pass_thread_local_storage\n+ def pull_zfs_keys(self, tls):\n existing_datasets = self.get_encrypted_datasets([['kmip_uid', '!=', None]])\n failed = []\n connection_successful = self.middleware.call_sync('kmip.test_connection')", - "header": "@@ -91,7 +95,8 @@ def push_zfs_keys(self, ids=None):", - "new_count": 8, - "new_start": 95, - "old_count": 7, - "old_start": 91 - }, - { - "content": " try:\n if ds['encryption_key']:\n key = ds['encryption_key']\n- elif ds['name'] in self.zfs_keys and self.middleware.call_sync(\n- 'zfs.dataset.check_key', ds['name'], {'key': self.zfs_keys[ds['name']]}\n+ elif (\n+ ds['name'] in self.zfs_keys\n+ and check_key(tls, ds['name'], key=self.zfs_keys[ds['name']])\n ):\n key = self.zfs_keys[ds['name']]\n elif connection_successful:", - "header": "@@ -99,8 +104,9 @@ def pull_zfs_keys(self):", - "new_count": 9, - "new_start": 104, - "old_count": 8, - "old_start": 99 - }, - { - "content": " return failed\n \n @private\n+ @pass_thread_local_storage\n @job(lock=lambda args: f'kmip_sync_zfs_keys_{args}')\n- def sync_zfs_keys(self, job, ids=None):\n+ def sync_zfs_keys(self, job, tls, ids=None):\n if not self.middleware.call_sync('kmip.zfs_keys_pending_sync'):\n return\n config = self.middleware.call_sync('kmip.config')\n conn_successful = self.middleware.call_sync('kmip.test_connection', None, True)\n if config['enabled'] and config['manage_zfs_keys']:\n if conn_successful:\n- failed = self.push_zfs_keys(ids)\n+ failed = self.push_zfs_keys(tls, ids) # type: ignore\n else:\n return\n else:\n- failed = self.pull_zfs_keys()\n+ failed = self.pull_zfs_keys(tls) # type: ignore\n if failed:\n self.middleware.call_sync(\n 'alert.oneshot_create', 'KMIPZFSDatasetsSyncFailure', {'datasets': ','.join(failed)}", - "header": "@@ -120,19 +126,20 @@ def pull_zfs_keys(self):", - "new_count": 20, - "new_start": 126, - "old_count": 19, - "old_start": 120 - } - ], - "language": "python", - "lines_added": 16, - "lines_removed": 9, - "path": "src/middlewared/middlewared/plugins/kmip/zfs_keys.py", - "status": "modified" - }, - { - "hunks": [ - { - "content": " from middlewared.service.decorators import pass_thread_local_storage\n from middlewared.utils.filter_list import filter_list\n from middlewared.plugins.pool_.utils import get_dataset_parents\n+from middlewared.plugins.zfs.encryption import check_key\n \n from .utils import DATASET_DATABASE_MODEL_NAME, dataset_can_be_mounted, retrieve_keys_from_file, ZFSKeyFormat\n ", - "header": "@@ -18,6 +18,7 @@", - "new_count": 7, - "new_start": 18, - "old_count": 6, - "old_start": 18 - }, - { - "content": " namespace = 'pool.dataset'\n \n @api_method(PoolDatasetEncryptionSummaryArgs, PoolDatasetEncryptionSummaryResult, roles=['DATASET_READ'])\n+ @pass_thread_local_storage\n @job(lock=lambda args: f'encryption_summary_options_{args[0]}', pipes=['input'], check_pipes=False)\n- def encryption_summary(self, job, id_, options):\n+ def encryption_summary(self, job, tls, id_, options):\n \"\"\"\n Retrieve summary of all encrypted roots under `id`.\n ", - "header": "@@ -28,8 +29,9 @@ class Config:", - "new_count": 9, - "new_start": 29, - "old_count": 8, - "old_start": 28 - }, - { - "content": " verrors.check()\n datasets = self.query_encrypted_datasets(id_, {'all': True})\n \n- to_check = []\n+ results = []\n for name, ds in datasets.items():\n ds_key = keys_supplied.get(name, {}).get('key') or ds['encryption_key']\n if ZFSKeyFormat(ds['key_format']['value']) == ZFSKeyFormat.RAW and ds_key:\n with contextlib.suppress(ValueError):\n ds_key = bytes.fromhex(ds_key)\n- to_check.append((name, {'key': ds_key}))\n \n- check_job = self.middleware.call_sync('zfs.dataset.bulk_process', 'check_key', to_check)\n- check_job.wait_sync()\n- if check_job.error:\n- raise CallError(f'Failed to retrieve encryption summary for {id_}: {check_job.error}')\n+ try:\n+ valid_key = check_key(tls, name, key=ds_key)\n+ except Exception:\n+ valid_key = False\n \n- results = []\n- for ds_data, status in zip(to_check, check_job.result):\n- ds_name = ds_data[0]\n- data = datasets[ds_name]\n results.append({\n- 'name': ds_name,\n- 'key_format': ZFSKeyFormat(data['key_format']['value']).value,\n- 'key_present_in_database': bool(data['encryption_key']),\n- 'valid_key': bool(status['result']), 'locked': data['locked'],\n+ 'name': name,\n+ 'key_format': ZFSKeyFormat(ds['key_format']['value']).value,\n+ 'key_present_in_database': bool(ds['encryption_key']),\n+ 'valid_key': valid_key,\n+ 'locked': ds['locked'],\n 'unlock_error': None,\n 'unlock_successful': False,\n })\n \n failed = set()\n for ds in sorted(results, key=lambda d: d['name'].count('/')):\n- for i in range(1, ds['name'].count('/') + 1):\n- check = ds['name'].rsplit('/', i)[0]\n+ ds_name = ds['name']\n+ for i in range(1, ds_name.count('/') + 1):\n+ check = ds_name.rsplit('/', i)[0]\n if check in failed:\n- failed.add(ds['name'])\n+ failed.add(ds_name)\n ds['unlock_error'] = f'Child cannot be unlocked when parent \"{check}\" is locked'\n \n- if ds['locked'] and not options['force'] and not keys_supplied.get(ds['name'], {}).get('force'):\n- err = dataset_can_be_mounted(ds['name'], os.path.join('/mnt', ds['name']))\n+ ds_locked = ds['locked']\n+ if ds_locked and not options['force'] and not keys_supplied.get(ds_name, {}).get('force'):\n+ err = dataset_can_be_mounted(ds_name, os.path.join('/mnt', ds_name))\n if ds['unlock_error'] and err:\n ds['unlock_error'] += f' and {err}'\n elif err:", - "header": "@@ -94,42 +96,40 @@ def encryption_summary(self, job, id_, options):", - "new_count": 40, - "new_start": 96, - "old_count": 42, - "old_start": 94 - }, - { - "content": " \n if ds['valid_key']:\n ds['unlock_successful'] = not bool(ds['unlock_error'])\n- elif not ds['locked']:\n+ elif not ds_locked:\n # For datasets which are already not locked, unlock operation for them\n # will succeed as they are not locked\n ds['unlock_successful'] = True\n else:\n- key_provided = ds['name'] in keys_supplied or ds['key_present_in_database']\n+ key_provided = ds_name in keys_supplied or ds['key_present_in_database']\n if key_provided:\n if ds['unlock_error']:\n- if ds['name'] in keys_supplied or ds['key_present_in_database']:\n+ if ds_name in keys_supplied or ds['key_present_in_database']:\n ds['unlock_error'] += ' and provided key is invalid'\n else:\n ds['unlock_error'] = 'Provided key is invalid'\n elif not ds['unlock_error']:\n ds['unlock_error'] = 'Key not provided'\n- failed.add(ds['name'])\n+ failed.add(ds_name)\n \n return results\n \n @periodic(86400)\n @private\n+ @pass_thread_local_storage\n @job(lock=lambda args: f'sync_encrypted_pool_dataset_keys_{args}')\n- def sync_db_keys(self, job, name=None):\n+ def sync_db_keys(self, job, tls, name=None):\n if not self.middleware.call_sync('failover.is_single_master_node'):\n # We don't want to do this for passive controller\n return", - "header": "@@ -137,28 +137,29 @@ def encryption_summary(self, job, id_, options):", - "new_count": 29, - "new_start": 137, - "old_count": 28, - "old_start": 137 - }, - { - "content": " # It is possible we have a pool configured but for some mistake/reason the pool did not import like\n # during repair disks were not plugged in and system was booted, in such cases we would like to not\n # remove the encryption keys from the database.\n- for root_ds in {pool['name'] for pool in self.middleware.call_sync('pool.query')} - {\n- ds['id'] for ds in self.middleware.call_sync(\n+ pool_names = {pool['name'] for pool in self.middleware.call_sync('pool.query')}\n+ ds_names = {\n+ ds['id']\n+ for ds in self.middleware.call_sync(\n 'pool.dataset.query', [], {'extra': {'retrieve_children': False, 'properties': []}}\n )\n- }:\n+ }\n+ for root_ds in pool_names - ds_names:\n filters.extend([['name', '!=', root_ds], ['name', '!^', f'{root_ds}/']])\n \n db_datasets = self.query_encrypted_roots_keys(filters)\n encrypted_roots = {\n- d['name']: d for d in self.middleware.call_sync(\n- 'pool.dataset.query', filters, {'extra': {'properties': ['encryptionroot']}}\n- ) if d['name'] == d['encryption_root']\n+ d['name']: d\n+ for d in self.middleware.call_sync(\n+ 'pool.dataset.query',\n+ filters,\n+ {'extra': {'properties': ['encryptionroot', 'keyformat']}}\n+ )\n+ if d['name'] == d['encryption_root']\n }\n+\n to_remove = []\n- check_key_job = self.middleware.call_sync('zfs.dataset.bulk_process', 'check_key', [\n- (name, {'key': db_datasets[name]}) for name in db_datasets\n- ])\n- check_key_job.wait_sync()\n- if check_key_job.error:\n- self.logger.error(f'Failed to sync database keys: {check_key_job.error}')\n+ try:\n+ for ds_name, key in db_datasets.items():\n+ ds = encrypted_roots.get(ds_name)\n+ if ds and ZFSKeyFormat(ds['key_format']['value']) == ZFSKeyFormat.RAW and key:\n+ with contextlib.suppress(ValueError):\n+ key = bytes.fromhex(key)\n+\n+ try:\n+ should_remove = not check_key(tls, ds_name, key=key)\n+ except Exception:\n+ should_remove = True\n+\n+ if should_remove:\n+ to_remove.append(ds_name)\n+\n+ except Exception as exc:\n+ self.logger.error(f'Failed to sync database keys: {exc}')\n return\n \n- for dataset, status in zip(db_datasets, check_key_job.result):\n- if not status['result']:\n- to_remove.append(dataset)\n- elif status['error']:\n- if dataset not in encrypted_roots:\n- to_remove.append(dataset)\n- else:\n- self.logger.error(f'Failed to check encryption status for {dataset}: {status[\"error\"]}')\n-\n self.middleware.call_sync('pool.dataset.delete_encrypted_datasets_from_db', [['name', 'in', to_remove]])\n \n @private", - "header": "@@ -167,37 +168,47 @@ def sync_db_keys(self, job, name=None):", - "new_count": 47, - "new_start": 168, - "old_count": 37, - "old_start": 167 - } - ], - "language": "python", - "lines_added": 57, - "lines_removed": 46, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "status": "modified" - }, - { - "hunks": [ - { - "content": " from datetime import datetime\n from pathlib import Path\n \n+from truenas_pylibzfs import ZFSError, ZFSException\n+\n from middlewared.api import api_method\n from middlewared.api.current import (\n PoolDatasetLockArgs, PoolDatasetLockResult, PoolDatasetUnlockArgs, PoolDatasetUnlockResult\n )\n+from middlewared.plugins.zfs.encryption import load_key\n from middlewared.service import CallError, job, private, Service, ValidationErrors\n+from middlewared.service.decorators import pass_thread_local_storage\n from middlewared.utils.filesystem.directory import directory_is_empty\n \n from .utils import (", - "header": "@@ -6,11 +6,15 @@", - "new_count": 15, - "new_start": 6, - "old_count": 11, - "old_start": 6 - }, - { - "content": " return True\n \n @api_method(PoolDatasetUnlockArgs, PoolDatasetUnlockResult, roles=['DATASET_WRITE'])\n+ @pass_thread_local_storage\n @job(lock=lambda args: f'dataset_unlock_{args[0]}', pipes=['input'], check_pipes=False)\n- def unlock(self, job, id_, options):\n+ def unlock(self, job, tls, id_, options):\n \"\"\"\n Unlock dataset `id` (and its children if `unlock_options.recursive` is `true`).\n ", - "header": "@@ -85,8 +89,9 @@ async def lock(self, job, id_, options):", - "new_count": 9, - "new_start": 89, - "old_count": 8, - "old_start": 85 - }, - { - "content": " \n job.set_progress(int(name_i / len(names) * 90 + 0.5), f'Unlocking {name!r}')\n try:\n- self.middleware.call_sync(\n- 'zfs.dataset.load_key', name, {'key': datasets[name]['key'], 'mount': False}\n- )\n- except CallError as e:\n- failed[name]['error'] = 'Invalid Key' if 'incorrect key provided' in str(e).lower() else str(e)\n+ load_key(tls, name, key=datasets[name]['key'])\n+ except ZFSException as e:\n+ if e.code == ZFSError.EZFS_CRYPTOFAILED:\n+ failed[name]['error'] = 'Invalid Key'\n+ else:\n+ failed[name]['error'] = str(e)\n+ continue\n+ except Exception as e:\n+ failed[name]['error'] = str(e)\n continue\n \n # Before we mount the dataset in question, we should ensure that the path where it will be mounted", - "header": "@@ -214,11 +219,15 @@ def unlock(self, job, id_, options):", - "new_count": 15, - "new_start": 219, - "old_count": 11, - "old_start": 214 - } - ], - "language": "python", - "lines_added": 15, - "lines_removed": 6, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py", - "status": "modified" - }, - { - "hunks": [ - { - "content": " PoolDatasetChangeKeyArgs, PoolDatasetChangeKeyResult, PoolDatasetInheritParentEncryptionPropertiesArgs,\n PoolDatasetInheritParentEncryptionPropertiesResult\n )\n+from middlewared.plugins.zfs.encryption import change_encryption_root, change_key\n from middlewared.service import CallError, job, private, Service, ValidationErrors\n+from middlewared.service.decorators import pass_thread_local_storage\n from middlewared.utils import secrets\n \n from .utils import DATASET_DATABASE_MODEL_NAME, ZFSKeyFormat", - "header": "@@ -4,7 +4,9 @@", - "new_count": 9, - "new_start": 4, - "old_count": 7, - "old_start": 4 - }, - { - "content": " PoolDatasetInsertOrUpdateEncryptedRecordResult,\n roles=['DATASET_WRITE']\n )\n- async def insert_or_update_encrypted_record(self, data):\n+ def insert_or_update_encrypted_record(self, data):\n key_format = data.pop('key_format') or ZFSKeyFormat.PASSPHRASE.value\n if not data['encryption_key'] or ZFSKeyFormat(key_format.upper()) == ZFSKeyFormat.PASSPHRASE:\n # We do not want to save passphrase keys - they are only known to the user\n return\n \n ds_id = data.pop('id')\n- ds = await self.middleware.call(\n+ ds = self.middleware.call_sync(\n 'datastore.query', DATASET_DATABASE_MODEL_NAME,\n [['id', '=', ds_id]] if ds_id else [['name', '=', data['name']]]\n )", - "header": "@@ -21,14 +23,14 @@ class Config:", - "new_count": 14, - "new_start": 23, - "old_count": 14, - "old_start": 21 - }, - { - "content": " \n pk = ds[0]['id'] if ds else None\n if ds:\n- await self.middleware.call(\n+ self.middleware.call_sync(\n 'datastore.update',\n DATASET_DATABASE_MODEL_NAME,\n ds[0]['id'], data\n )\n else:\n- pk = await self.middleware.call(\n+ pk = self.middleware.call_sync(\n 'datastore.insert',\n DATASET_DATABASE_MODEL_NAME,\n data\n )\n \n- kmip_config = await self.middleware.call('kmip.config')\n+ kmip_config = self.middleware.call_sync('kmip.config')\n if kmip_config['enabled'] and kmip_config['manage_zfs_keys']:\n- await self.middleware.call('kmip.sync_zfs_keys', [pk])\n+ self.middleware.call_sync('kmip.sync_zfs_keys', [pk])\n \n return pk\n ", - "header": "@@ -37,21 +39,21 @@ async def insert_or_update_encrypted_record(self, data):", - "new_count": 21, - "new_start": 39, - "old_count": 21, - "old_start": 37 - }, - { - "content": " return opts\n \n @api_method(PoolDatasetChangeKeyArgs, PoolDatasetChangeKeyResult, roles=['DATASET_WRITE'])\n+ @pass_thread_local_storage\n @job(lock=lambda args: f'dataset_change_key_{args[0]}', pipes=['input'], check_pipes=False)\n- async def change_key(self, job, id_, options):\n+ def change_key(self, job, tls, id_, options):\n \"\"\"\n Change encryption properties for `id` encrypted dataset.\n ", - "header": "@@ -114,8 +116,9 @@ def validate_encryption_data(self, job, verrors, encryption_dict, schema):", - "new_count": 9, - "new_start": 116, - "old_count": 8, - "old_start": 114 - }, - { - "content": " 1) It has encrypted roots as children which are encrypted with a key\n 2) If it is a root dataset where the system dataset is located\n \"\"\"\n- ds = await self.middleware.call('pool.dataset.get_instance_quick', id_, {\n+ ds = self.middleware.call_sync('pool.dataset.get_instance_quick', id_, {\n 'encryption': True,\n })\n verrors = ValidationErrors()", - "header": "@@ -124,7 +127,7 @@ async def change_key(self, job, id_, options):", - "new_count": 7, - "new_start": 127, - "old_count": 7, - "old_start": 124 - }, - { - "content": " )\n elif any(\n d['name'] == d['encryption_root']\n- for d in await self.middleware.call(\n+ for d in self.middleware.call_sync(\n 'pool.dataset.query', [\n ['id', '^', f'{id_}/'], ['encrypted', '=', True],\n ['key_format.value', '!=', ZFSKeyFormat.PASSPHRASE.value]", - "header": "@@ -142,7 +145,7 @@ async def change_key(self, job, id_, options):", - "new_count": 7, - "new_start": 145, - "old_count": 7, - "old_start": 142 - }, - { - "content": " f'{id_} has children which are encrypted with a key. It is not allowed to have encrypted '\n 'roots which are encrypted with a key as children for passphrase encrypted datasets.'\n )\n- elif id_ == (await self.middleware.call('systemdataset.config'))['pool']:\n+ elif id_ == self.middleware.call_sync('systemdataset.config')['pool']:\n verrors.add(\n 'id',\n f'{id_} contains the system dataset. Please move the system dataset to a '", - "header": "@@ -154,7 +157,7 @@ async def change_key(self, job, id_, options):", - "new_count": 7, - "new_start": 157, - "old_count": 7, - "old_start": 154 - }, - { - "content": " f'change_key_options.{k}',\n 'Either Key or passphrase must be provided.'\n )\n- elif id_.count('/') and await self.middleware.call(\n+ elif id_.count('/') and self.middleware.call_sync(\n 'pool.dataset.query', [\n ['id', 'in', [id_.rsplit('/', i)[0] for i in range(1, id_.count('/') + 1)]],\n ['key_format.value', '=', ZFSKeyFormat.PASSPHRASE.value], ['encrypted', '=', True]", - "header": "@@ -167,7 +170,7 @@ async def change_key(self, job, id_, options):", - "new_count": 7, - "new_start": 170, - "old_count": 7, - "old_start": 167 - }, - { - "content": " \n verrors.check()\n \n- encryption_dict = await self.middleware.call(\n+ encryption_dict = self.middleware.call_sync(\n 'pool.dataset.validate_encryption_data', job, verrors, {\n 'enabled': True, 'passphrase': options['passphrase'],\n 'generate_key': options['generate_key'], 'key_file': options['key_file'],", - "header": "@@ -181,7 +184,7 @@ async def change_key(self, job, id_, options):", - "new_count": 7, - "new_start": 184, - "old_count": 7, - "old_start": 181 - }, - { - "content": " encryption_dict.pop('encryption')\n key = encryption_dict.pop('key')\n \n- await self.middleware.call(\n- 'zfs.dataset.change_key', id_, {\n- 'encryption_properties': encryption_dict,\n- 'key': key, 'load_key': False,\n- }\n- )\n+ change_key(tls, id_, encryption_dict, key)\n \n # TODO: Handle renames of datasets appropriately wrt encryption roots and db - this will be done when\n # devd changes are in from the OS end\n data = {'encryption_key': key, 'key_format': 'PASSPHRASE' if options['passphrase'] else 'HEX', 'name': id_}\n- await self.insert_or_update_encrypted_record(data)\n+ self.insert_or_update_encrypted_record(data)\n if options['passphrase'] and ZFSKeyFormat(ds['key_format']['value']) != ZFSKeyFormat.PASSPHRASE:\n- await self.middleware.call('pool.dataset.sync_db_keys', id_)\n+ self.middleware.call_sync('pool.dataset.sync_db_keys', id_)\n \n data['old_key_format'] = ds['key_format']['value']\n- await self.middleware.call_hook('dataset.change_key', data)\n+ self.middleware.call_hook_sync('dataset.change_key', data)\n \n @api_method(\n PoolDatasetInheritParentEncryptionPropertiesArgs,\n PoolDatasetInheritParentEncryptionPropertiesResult,\n roles=['DATASET_WRITE']\n )\n- async def inherit_parent_encryption_properties(self, id_):\n+ @pass_thread_local_storage\n+ def inherit_parent_encryption_properties(self, tls, id_):\n \"\"\"\n Allows inheriting parent's encryption root discarding its current encryption settings. This\n can only be done where `id` has an encrypted parent and `id` itself is an encryption root.\n \"\"\"\n- ds = await self.middleware.call('pool.dataset.get_instance_quick', id_, {\n+ ds = self.middleware.call_sync('pool.dataset.get_instance_quick', id_, {\n 'encryption': True,\n })\n if not ds['encrypted']:", - "header": "@@ -194,34 +197,30 @@ async def change_key(self, job, id_, options):", - "new_count": 30, - "new_start": 197, - "old_count": 34, - "old_start": 194 - }, - { - "content": " elif '/' not in id_:\n raise CallError('Root datasets do not have a parent and cannot inherit encryption settings')\n else:\n- parent = await self.middleware.call(\n+ parent = self.middleware.call_sync(\n 'pool.dataset.get_instance_quick', id_.rsplit('/', 1)[0], {\n 'encryption': True,\n }", - "header": "@@ -233,7 +232,7 @@ async def inherit_parent_encryption_properties(self, id_):", - "new_count": 7, - "new_start": 232, - "old_count": 7, - "old_start": 233 - }, - { - "content": " if not parent['encrypted']:\n raise CallError('This operation requires the parent dataset to be encrypted')\n else:\n- parent_encrypted_root = await self.middleware.call(\n+ parent_encrypted_root = self.middleware.call_sync(\n 'pool.dataset.get_instance_quick', parent['encryption_root'], {\n 'encryption': True,\n }\n )\n- if ZFSKeyFormat(parent_encrypted_root['key_format']['value']) == ZFSKeyFormat.PASSPHRASE.value:\n+ if parent_encrypted_root['key_format']['value'] == ZFSKeyFormat.PASSPHRASE.value:\n if any(\n d['name'] == d['encryption_root']\n- for d in await self.middleware.call(\n+ for d in self.middleware.call_sync(\n 'pool.dataset.query', [\n ['id', '^', f'{id_}/'], ['encrypted', '=', True],\n ['key_format.value', '!=', ZFSKeyFormat.PASSPHRASE.value]", - "header": "@@ -241,15 +240,15 @@ async def inherit_parent_encryption_properties(self, id_):", - "new_count": 15, - "new_start": 240, - "old_count": 15, - "old_start": 241 - }, - { - "content": " 'roots which are encrypted with a key as children for passphrase encrypted datasets.'\n )\n \n- await self.middleware.call('zfs.dataset.change_encryption_root', id_, {'load_key': False})\n- await self.middleware.call('pool.dataset.sync_db_keys', id_)\n- await self.middleware.call_hook('dataset.inherit_parent_encryption_root', id_)\n+ change_encryption_root(tls, id_)\n+ self.middleware.call_sync('pool.dataset.sync_db_keys', id_)\n+ self.middleware.call_hook_sync('dataset.inherit_parent_encryption_root', id_)", - "header": "@@ -261,6 +260,6 @@ async def inherit_parent_encryption_properties(self, id_):", - "new_count": 6, - "new_start": 260, - "old_count": 6, - "old_start": 261 - } - ], - "language": "python", - "lines_added": 29, - "lines_removed": 30, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py", - "status": "modified" - }, - { - "hunks": [ - { - "content": "+import threading\n+from typing import Literal, TypedDict, cast\n+\n+from .exceptions import ZFSKeyAlreadyLoadedException, ZFSNotEncryptedException\n+from .utils import open_resource\n+\n+\n+class EncryptionProperties(TypedDict, total=False):\n+ keyformat: Literal['hex', 'passphrase', 'raw']\n+ keylocation: str\n+ pbkdf2iters: int | None\n+\n+\n+def load_key(tls: threading.local, dataset: str, **kwargs: str | bytes) -> None:\n+ \"\"\"\n+ Load the encryption key for a ZFS dataset.\n+\n+ Args:\n+ dataset: Name of the ZFS dataset whose key should be loaded.\n+\n+ Keyword Args:\n+ key: Key material as ``str`` (hex/passphrase) or ``bytes`` (raw).\n+ Mutually exclusive with ``key_location``.\n+ key_location: Path to the key file on disk.\n+ Mutually exclusive with ``key``.\n+ \"\"\"\n+ if len(kwargs) > 1:\n+ raise ValueError('Cannot specify both key and key location')\n+ rsrc = open_resource(tls, dataset)\n+ if (crypto := rsrc.crypto()) is None:\n+ raise ZFSNotEncryptedException(dataset)\n+ if crypto.info().key_is_loaded:\n+ raise ZFSKeyAlreadyLoadedException(dataset)\n+ crypto.load_key(**kwargs)\n+\n+\n+def check_key(tls: threading.local, dataset: str, **kwargs: str | bytes) -> bool:\n+ \"\"\"\n+ Return True if ``key`` (or the key at ``key_location``) can unlock ``dataset``.\n+\n+ Does not actually load the key. Raises ZFSNotEncryptedException if the\n+ dataset is not encrypted or if the ZFS operation fails for a reason other\n+ than a wrong key (EZFS_CRYPTOFAILED returns False rather than raising).\n+\n+ Args:\n+ dataset: Name of the ZFS dataset to check.\n+\n+ Keyword Args:\n+ key: Key material as ``str`` (hex/passphrase) or ``bytes`` (raw).\n+ Mutually exclusive with ``key_location``.\n+ key_location: Path to the key file on disk.\n+ Mutually exclusive with ``key``.\n+ \"\"\"\n+ if len(kwargs) > 1:\n+ raise ValueError('Cannot specify both key and key location')\n+ rsrc = open_resource(tls, dataset)\n+ if (crypto := rsrc.crypto()) is None:\n+ raise ZFSNotEncryptedException(dataset)\n+ return crypto.check_key(**kwargs) # type: ignore[no-any-return]\n+\n+\n+def change_key(\n+ tls: threading.local,\n+ dataset: str,\n+ properties: EncryptionProperties | None = None,\n+ key: str | None = None\n+) -> None:\n+ \"\"\"\n+ Change the encryption key and/or properties for ``dataset``.\n+\n+ The dataset's key must already be loaded before calling this.\n+\n+ Args:\n+ dataset: Name of the ZFS dataset whose key should be changed.\n+ properties: May contain any combination of keyformat, keylocation, and\n+ pbkdf2iters.\n+ key: New key material. Required when keylocation is not given.\n+ \"\"\"\n+ props = {} if properties is None else cast(dict[str, str | int | None], properties.copy())\n+ if key:\n+ props.pop('keylocation', None)\n+ props['key'] = key\n+ elif 'keylocation' not in props:\n+ raise ValueError('Must specify either key or key location')\n+\n+ rsrc = open_resource(tls, dataset)\n+ if (crypto := rsrc.crypto()) is None:\n+ raise ZFSNotEncryptedException(dataset)\n+ config = tls.lzh.resource_cryptography_config(**props)\n+ crypto.change_key(info=config)\n+\n+\n+def change_encryption_root(tls: threading.local, dataset: str) -> None:\n+ \"\"\"\n+ Make ``dataset`` inherit encryption from its parent, removing it as\n+ an encryption root.\n+\n+ ``dataset`` must currently be an encryption root and its key must be loaded.\n+\n+ Args:\n+ dataset: Name of the ZFS dataset to remove as an encryption root.\n+ \"\"\"\n+ rsrc = open_resource(tls, dataset)\n+ if (crypto := rsrc.crypto()) is None:\n+ raise ZFSNotEncryptedException(dataset)\n+ crypto.inherit_key()", - "header": "@@ -0,0 +1,106 @@", - "new_count": 106, - "new_start": 1, - "old_count": 0, - "old_start": 0 - } - ], - "language": "python", - "lines_added": 106, - "lines_removed": 0, - "path": "src/middlewared/middlewared/plugins/zfs/encryption.py", - "status": "added" - }, - { - "hunks": [ - { - "content": "-from typing import Collection\n+from typing import Iterable\n \n __all__ = (\n+ \"ZFSKeyAlreadyLoadedException\",\n+ \"ZFSNotEncryptedException\",\n \"ZFSPathAlreadyExistsException\",\n \"ZFSPathInvalidException\",\n \"ZFSPathNotASnapshotException\",", - "header": "@@ -1,6 +1,8 @@", - "new_count": 8, - "new_start": 1, - "old_count": 6, - "old_start": 1 - }, - { - "content": " )\n \n \n+class ZFSKeyAlreadyLoadedException(Exception):\n+ def __init__(self, path: str):\n+ self.message = f\"{path!r} key is already loaded\"\n+ super().__init__(self.message)\n+\n+\n+class ZFSNotEncryptedException(Exception):\n+ def __init__(self, path: str):\n+ self.message = f\"{path!r} is not encrypted\"\n+ super().__init__(self.message)\n+\n+\n class ZFSPathAlreadyExistsException(Exception):\n def __init__(self, path: str):\n self.message = f\"{path!r} already exists\"", - "header": "@@ -9,6 +11,18 @@", - "new_count": 18, - "new_start": 11, - "old_count": 6, - "old_start": 9 - }, - { - "content": " \n \n class ZFSPathHasClonesException(Exception):\n- def __init__(self, path: str, clones: Collection[str]):\n+ def __init__(self, path: str, clones: Iterable[str]):\n self.path = path\n self.clones = clones\n self.message = f\"{path!r} has the following clones: {','.join(clones)}\"", - "header": "@@ -16,7 +30,7 @@ def __init__(self, path: str):", - "new_count": 7, - "new_start": 30, - "old_count": 7, - "old_start": 16 - }, - { - "content": " \n \n class ZFSPathHasHoldsException(Exception):\n- def __init__(self, path: str, holds: Collection[str]):\n+ def __init__(self, path: str, holds: Iterable[str]):\n self.message = f\"{path!r} has the following holds: {','.join(holds)}\"\n super().__init__(self.message)\n ", - "header": "@@ -24,7 +38,7 @@ def __init__(self, path: str, clones: Collection[str]):", - "new_count": 7, - "new_start": 38, - "old_count": 7, - "old_start": 24 - } - ], - "language": "python", - "lines_added": 17, - "lines_removed": 3, - "path": "src/middlewared/middlewared/plugins/zfs/exceptions.py", - "status": "modified" - }, - { - "hunks": [ - { - "content": "-import libzfs\n-\n-from middlewared.service import CallError, job, Service\n-\n-\n-class ZFSDatasetService(Service):\n-\n- class Config:\n- namespace = 'zfs.dataset'\n- private = True\n- process_pool = True\n-\n- def common_load_dataset_checks(self, id_, ds):\n- self.common_encryption_checks(id_, ds)\n- if ds.key_loaded:\n- raise CallError(f'{id_} key is already loaded')\n-\n- def common_encryption_checks(self, id_, ds):\n- if not ds.encrypted:\n- raise CallError(f'{id_} is not encrypted')\n-\n- def load_key(self, id_: str, options: dict | None = None):\n- if options is None:\n- options = {\n- 'mount': True,\n- 'recursive': False,\n- 'key': None,\n- 'key_location': None,\n- }\n- options.setdefault('mount', True)\n- options.setdefault('recursive', False)\n- options.setdefault('key', None)\n- options.setdefault('key_location', None)\n-\n- mount_ds = options.pop('mount')\n- recursive = options.pop('recursive')\n- try:\n- with libzfs.ZFS() as zfs:\n- ds = zfs.get_dataset(id_)\n- self.common_load_dataset_checks(id_, ds)\n- ds.load_key(**options)\n- except libzfs.ZFSException as e:\n- self.logger.error(f'Failed to load key for {id_}', exc_info=True)\n- raise CallError(f'Failed to load key for {id_}: {e}')\n- else:\n- if mount_ds:\n- self.call_sync2(self.s.zfs.resource.mount, id_, recursive=recursive)\n-\n- def check_key(self, id_: str, options: dict | None = None):\n- \"\"\"\n- Returns `true` if the `key` is valid, `false` otherwise.\n- \"\"\"\n- if options is None:\n- options = {\n- 'key': None,\n- 'key_location': None,\n- }\n-\n- try:\n- with libzfs.ZFS() as zfs:\n- ds = zfs.get_dataset(id_)\n- self.common_encryption_checks(id_, ds)\n- return ds.check_key(**options)\n- except libzfs.ZFSException as e:\n- self.logger.error(f'Failed to check key for {id_}', exc_info=True)\n- raise CallError(f'Failed to check key for {id_}: {e}')\n-\n- def change_key(self, id_: str, options: dict | None = None):\n- if options is None:\n- options = {\n- 'encryption_properties': {},\n- 'load_key': True,\n- 'key': None,\n- }\n-\n- try:\n- with libzfs.ZFS() as zfs:\n- ds = zfs.get_dataset(id_)\n- self.common_encryption_checks(id_, ds)\n- ds.change_key(props=options['encryption_properties'], load_key=options['load_key'], key=options['key'])\n- except libzfs.ZFSException as e:\n- self.logger.error(f'Failed to change key for {id_}', exc_info=True)\n- raise CallError(f'Failed to change key for {id_}: {e}')\n-\n- def change_encryption_root(self, id_: str, options: dict | None = None):\n- if options is None:\n- options = {'load_key': True}\n-\n- try:\n- with libzfs.ZFS() as zfs:\n- ds = zfs.get_dataset(id_)\n- ds.change_key(load_key=options['load_key'], inherit=True)\n- except libzfs.ZFSException as e:\n- raise CallError(f'Failed to change encryption root for {id_}: {e}')\n-\n- @job()\n- def bulk_process(self, job, name: str, params: list):\n- f = getattr(self, name, None)\n- if not f:\n- raise CallError(f'{name} method not found in zfs.dataset')\n-\n- statuses = []\n- for i in params:\n- result = error = None\n- try:\n- result = f(*i)\n- except Exception as e:\n- error = str(e)\n- finally:\n- statuses.append({'result': result, 'error': error})\n-\n- return statuses", - "header": "@@ -1,112 +0,0 @@", - "new_count": 0, - "new_start": 0, - "old_count": 112, - "old_start": 1 - } - ], - "language": "", - "lines_added": 0, - "lines_removed": 112, - "path": "", - "status": "removed" - } - ], - "intent_gaps": [ - "The PR description says 'Replace usage of the deprecated py-libzfs with truenas_pylibzfs for these private methods' but does not enumerate which methods. The actual scope is: check_key, load_key, change_key, change_encryption_root in four separate call sites across three files. The description gives no indication that kmip/zfs_keys.py is included.", - "The PR description says 'This removes another use case of our process pool' but does not explain that the `zfs.dataset` service (`process_pool = True`) is being entirely deleted, not just reduced. The deleted file's `bulk_process` method was the batch dispatch mechanism; its removal means no more batch key-checking across datasets \u2014 checks are now serial within the job thread.", - "The PR description mentions a dependency on truenas_pylibzfs/pull/145 but does not specify what that PR adds (presumably the `crypto()` method on ZFS resources, `resource_cryptography_config`, and `ZFSResourceCryptography.check_key/load_key/change_key/inherit_key`). The correct behavior of this PR is entirely contingent on that dependency, which is not merged in this repository.", - "The pbkdf2iters security hardening (350k \u2192 1.3M) is not mentioned anywhere in the PR description. Reviewers would not know to scrutinize the performance and migration implications of this change without reading the API model diffs.", - "The PR does not address what happens to the `zfs.dataset.bulk_process` method that was used by callers outside the encryption path (if any). The deleted file's `bulk_process` was a generic dispatcher for any method on `ZFSDatasetService`; its removal is silent and no audit of other callers is documented.", - "The description does not clarify the error-handling philosophy change: old code wrapped all libzfs errors in CallError (friendly, loggable); new code lets raw truenas_pylibzfs ZFSException propagate to callers, relying on catch-all `except Exception` blocks in the job layer for recovery." - ], - "pr_narrative": "This PR replaces the deprecated `py-libzfs` (via `libzfs` Python bindings and the process-pool-dispatched `zfs.dataset` service) with direct `truenas_pylibzfs` calls for four ZFS dataset encryption operations: key loading, key checking, key changing, and encryption root inheritance.\n\n**Old mechanism**: `src/middlewared/middlewared/plugins/zfs_/dataset_encryption.py` defined a `ZFSDatasetService` class (namespace `zfs.dataset`) with `process_pool = True`. This class used `import libzfs` and opened a new `libzfs.ZFS()` context for every operation. Callers in `pool_/dataset_encryption_info.py` and `pool_/dataset_encryption_operations.py` dispatched to this service via `self.middleware.call('zfs.dataset.bulk_process', ...)` or `self.middleware.call('zfs.dataset.change_key', ...)` \u2014 meaning all operations ran in a subprocess pool, fully isolated from the main event loop, and all were `async`.\n\n**New mechanism**: A new module `src/middlewared/middlewared/plugins/zfs/encryption.py` is introduced containing four free functions (`load_key`, `check_key`, `change_key`, `change_encryption_root`) that operate directly on `truenas_pylibzfs` objects via a thread-local `tls.lzh` handle. These functions are called inline (no subprocess) from the same thread that holds the job or method. The `@pass_thread_local_storage` decorator is added to every consuming method to inject the `tls` argument, and each consuming method is converted from `async def` to synchronous `def` (with `await self.middleware.call(...)` replaced by `self.middleware.call_sync(...)`).\n\nThe change touches five callers:\n1. `pool_/dataset_encryption_info.py` \u2014 `encryption_summary` and `sync_db_keys` now call `check_key(tls, ...)` directly instead of dispatching a `bulk_process` job.\n2. `pool_/dataset_encryption_lock.py` \u2014 `unlock` now calls `load_key(tls, ...)` directly.\n3. `pool_/dataset_encryption_operations.py` \u2014 `change_key` and `inherit_parent_encryption_properties` now call `change_key(tls, ...)` and `change_encryption_root(tls, ...)` directly; `insert_or_update_encrypted_record` is also converted from `async` to sync.\n4. `kmip/zfs_keys.py` \u2014 `push_zfs_keys` and `pull_zfs_keys` now call `check_key(tls, ...)` directly with `@pass_thread_local_storage`.\n5. `api/v26_0_0/pool.py` and `api/v26_0_0/pool_dataset.py` \u2014 `pbkdf2iters` minimum/default raised from 350,000 to 1,300,000 for both `PoolCreateEncryptionOptions` and `PoolDatasetChangeKeyOptions`; a `from_previous` classmethod is added to clamp old values to the new minimum when migrating from prior API versions.\n\nThe deleted file `zfs_/dataset_encryption.py` (112 lines) is fully removed; its `bulk_process` method, subprocess dispatch, and per-call `libzfs.ZFS()` context creation are gone.", - "risk_surfaces": [ - "EXCEPTION CONTRACT CHANGE \u2014 load_key: The old `zfs.dataset.load_key` wrapped all `libzfs.ZFSException` in `CallError` and logged before raising. The new `load_key` in `zfs/encryption.py` raises `ZFSNotEncryptedException` or `ZFSKeyAlreadyLoadedException` for those pre-checks, then calls `crypto.load_key(**kwargs)` which propagates raw `truenas_pylibzfs.ZFSException` directly. In `dataset_encryption_lock.py:222-228`, the `unlock` method catches `ZFSException` (checking `e.code == ZFSError.EZFS_CRYPTOFAILED`) and bare `Exception`, so the raw `ZFSException` from `crypto.load_key()` is still caught. However, `ZFSKeyAlreadyLoadedException` and `ZFSNotEncryptedException` are plain `Exception` subclasses with no `code` attribute \u2014 they will be caught by the bare `except Exception` branch and surfaced as a string error rather than the typed `CallError` the old code would have produced. Callers expecting a `CallError` (e.g. the WebSocket client) would previously get a structured error; now they get a raw exception string.", - "EXCEPTION CONTRACT CHANGE \u2014 check_key: Old `zfs.dataset.check_key` raised `CallError` on any `libzfs.ZFSException` (including wrong-key scenarios). The new `check_key` raises `ZFSNotEncryptedException` for non-encrypted datasets but returns `False` for `EZFS_CRYPTOFAILED` (per docstring). In `encryption_summary` (line 106-109) and `sync_db_keys` (line 200-203), both sites wrap `check_key` in `except Exception: valid_key/should_remove = False/True`, so the behavior is preserved for the happy path. However, there is no guard against passing `key=None` to `crypto.check_key()`. In `encryption_summary`, `ds_key` can be `None` if `ds['encryption_key']` is `None` and no key was supplied by the user \u2014 `check_key(tls, name, key=None)` would pass `key=None` as a kwarg into `crypto.check_key(key=None)`. The behavior of `truenas_pylibzfs`'s `check_key(key=None)` is not visible in this repo; if it does not accept `None`, an exception is raised and silently swallowed to `valid_key = False`, which is the same end result as before \u2014 but relying on an exception catch to cover this is fragile.", - "BULK PROCESS REMOVED \u2014 error aggregation semantics: The old `sync_db_keys` called `zfs.dataset.bulk_process('check_key', [...])` which processed all datasets, accumulated per-dataset errors in `status['error']`, and only aborted if the job itself errored. The new code wraps the entire loop in a single `try/except Exception` (line 208-210). If any unexpected exception escapes the inner `try/except Exception` at line 200-203 (which seems impossible in current code but is a structural fragility), the outer handler will abort the entire loop and return early without processing remaining datasets. The old code continued on a per-dataset error and then separately checked `check_key_job.error` for the job-level error. The new outer catch at line 208-210 logging `f'Failed to sync database keys: {exc}'` uses an f-string without `exc_info=True`, losing the stack trace.", - "ASYNC-TO-SYNC CONVERSION \u2014 insert_or_update_encrypted_record: This method changed from `async def` to `def`. Its callers in `dataset_encryption_lock.py` (`unlock`) and `dataset_encryption_operations.py` (`change_key`) are also sync, so the immediate callers are fine. However, if any other caller invokes this as `await self.middleware.call('pool.dataset.insert_or_update_encrypted_record', ...)` from an async context, it will still work through the middleware dispatch layer. The concern is whether any external caller relied on this being co-routine-safe. No external callers are visible in the diff, but this should be verified.", - "DECORATOR ORDERING \u2014 @pass_thread_local_storage with @job: In `encryption_summary` and `sync_db_keys`, the decorator order is `@pass_thread_local_storage` above `@job`. The `tls` argument is injected between `self/job` and the user-visible arguments (`id_`, `options`, `name`). If the `@job` decorator wraps the function and then `@pass_thread_local_storage` wraps that, the positional argument order seen by the actual function body is `(self, job, tls, id_, options)`. This pattern matches how `unlock` was already written (`def unlock(self, job, tls, id_, options)`), so it appears intentional. But `sync_db_keys` has `lock=lambda args: f'sync_encrypted_pool_dataset_keys_{args}'` \u2014 the `args` lambda receives the job's original positional args. If `tls` is now injected before `name`, the lock key computation could change. Verify that the `args` lambda in `@job` sees the pre-`tls`-injection argument list.", - "change_key \u2014 load_key parameter removed: The old `zfs.dataset.change_key` accepted a `load_key` boolean (always passed as `False` from the calling site). The new `change_key` in `zfs/encryption.py` does not accept or pass `load_key` at all to `crypto.change_key(info=config)`. If `truenas_pylibzfs`'s `crypto.change_key` has a different default for whether it reloads the key, the behavior could diverge from the old code's explicit `load_key=False`.", - "change_key \u2014 props/key argument shape: The old code called `ds.change_key(props=options['encryption_properties'], load_key=False, key=options['key'])` with `props` as a dict. The new `change_key` builds a `props` dict from `EncryptionProperties`, calls `tls.lzh.resource_cryptography_config(**props)` to get a config object, then passes `info=config` to `crypto.change_key`. The `resource_cryptography_config` API (defined in `truenas_pylibzfs`) must accept the same property names (`keyformat`, `keylocation`, `pbkdf2iters`, `key`). If `truenas_pylibzfs` rejects unknown property names or has different semantics for `pbkdf2iters=None` (the TypedDict marks it as `int | None`), key-change operations could fail silently or raise.", - "change_encryption_root \u2014 ZFSKeyFormat comparison bug fix: In the old code (line in diff): `if ZFSKeyFormat(parent_encrypted_root['key_format']['value']) == ZFSKeyFormat.PASSPHRASE.value:` \u2014 this compared a `ZFSKeyFormat` enum member to a string (`.value`), which would always be `False`. The new code: `if parent_encrypted_root['key_format']['value'] == ZFSKeyFormat.PASSPHRASE.value:` \u2014 correctly compares two strings. This is a behavioral change: the passphrase-key-children guard in `inherit_parent_encryption_properties` was previously never enforced (always skipped) and will now be enforced. This is a semantics fix, but it is an undocumented behavior change that could break workflows where users inherited encryption roots from passphrase-encrypted parents that had key-encrypted children.", - "pbkdf2iters default increase \u2014 from_previous migration: `PoolCreateEncryptionOptions` and `PoolDatasetChangeKeyOptions` in `api/v26_0_0` raise the minimum from 100,000 to 1,300,000 and the default from 350,000 to 1,300,000. The `from_previous` classmethod clamps existing values upward with `max(1300000, value['pbkdf2iters'])`. This means any existing dataset or pool that was created with pbkdf2iters between 100,000 and 1,299,999 will silently have their iteration count upgraded on the next API operation touching these fields. This can cause a significant increase in key-derivation time during unlock. This is a security hardening but is a breaking change for automated scripts that stored or compared pbkdf2iters values.", - "KMIP check_key \u2014 no tls guard: In `kmip/zfs_keys.py`, `push_zfs_keys` and `pull_zfs_keys` now call `check_key(tls, ...)` directly. The `@pass_thread_local_storage` decorator was added to both. However, these are called from `sync_zfs_keys` at lines 138 and 142 as `self.push_zfs_keys(tls, ids)` and `self.pull_zfs_keys(tls)` \u2014 passing `tls` explicitly. If `@pass_thread_local_storage` injects `tls` automatically, passing it explicitly would result in a double injection (`tls` appears twice in the argument list). This is a potential signature mismatch that could cause a `TypeError` at runtime.", - "path_in_locked_datasets \u2014 not in PR scope but adjacent risk: This method in `dataset_encryption_info.py` (lines 216-283) already uses `tls.lzh.open_resource(...)` directly and was not changed by this PR. It is annotated as a hot code path and handles `ZFSException` with EZFS_NOENT and EZFS_INVALIDNAME filtering. This code is architecturally similar to the new functions but was not touched, which is correct. However, reviewers should verify no regression was introduced in how `ZFSException` is imported \u2014 the import at line 9 is `from truenas_pylibzfs import ZFSError, ZFSException`, which is correct." - ], - "stats": { - "files_added": 1, - "files_modified": 7, - "files_removed": 1, - "files_renamed": 0, - "test_files_changed": 0, - "test_to_code_ratio": 0, - "total_additions": 254, - "total_deletions": 210, - "total_files": 9 - }, - "unrelated_changes": [ - "api/v26_0_0/pool.py and api/v26_0_0/pool_dataset.py \u2014 pbkdf2iters default/minimum raised from 350,000 to 1,300,000 with a `from_previous` migration validator added. This is a security hardening change unrelated to the py-libzfs \u2192 truenas_pylibzfs refactor. The PR description makes no mention of this change.", - "dataset_encryption_operations.py \u2014 The `ZFSKeyFormat` comparison bug fix in `inherit_parent_encryption_properties` (old: compared enum instance to string value, always False; new: compares two strings, now actually enforces the constraint) is a behavioral bug fix bundled into this refactor PR without mention in the PR description.", - "dataset_encryption_info.py sync_db_keys \u2014 The query for `encrypted_roots` was changed to also fetch the `keyformat` property (`'properties': ['encryptionroot', 'keyformat']`) where before it only fetched `encryptionroot`. This is needed for the new hex-key detection logic but represents a query change not mentioned in the PR description.", - "kmip/zfs_keys.py get_encrypted_datasets \u2014 Changed from calling `self.middleware.call_sync('pool.dataset.query', ...)` (old code, visible from context) to using `self.call_sync2(self.s.zfs.resource.query_impl, ZFSResourceQuery(...))` \u2014 an internal implementation-level change that shifts from the high-level dataset query to the low-level ZFS resource query. This may filter or format results differently." - ] - }, - "budget": { - "budget_exhausted": true, - "cost_breakdown": { - "adversary": 0, - "anatomy": 0, - "coverage": 0, - "cross_ref": 0, - "intake": 0, - "meta_selectors": 0, - "output": 0, - "review": 0, - "synthesis": 0 - }, - "max_cost_usd": 2, - "max_duration_seconds": 900, - "total_cost_usd": 0 - }, - "intake": { - "ai_generated": 0, - "areas_touched": [ - "api" - ], - "complexity": "standard", - "languages": [ - "python" - ], - "pr_summary": "Replace usage of the deprecated py-libzfs with truenas_pylibzfs for these private methods. This removes another use case of our process pool.\r\n\r\nDepends on changes made in https://github.com/truenas/truenas_pylibzfs/pull/145.", - "pr_type": "refactor", - "review_depth": "standard", - "risk_signals": [ - "changes API surface or request/response behavior" - ] - }, - "phases_completed": [ - "intake", - "anatomy", - "meta_selectors", - "review", - "adversary", - "cross_ref", - "coverage", - "synthesis", - "output" - ], - "plan": { - "ai_adjusted": false, - "cross_ref_hints": [], - "dimensions": [ - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [ - "src/middlewared/middlewared/plugins/zfs/exceptions.py" - ], - "id": "semantic_sem_01", - "name": "Exception contract change in load_key: typed exceptions vs. CallError", - "priority": 10, - "review_prompt": "The old `zfs.dataset.load_key` caught all `libzfs.ZFSException` and re-raised as `CallError`. The new `load_key` in `zfs/encryption.py` raises `ZFSNotEncryptedException` or `ZFSKeyAlreadyLoadedException` (plain `Exception` subclasses with no `code` attribute) for pre-check failures, and lets raw `truenas_pylibzfs.ZFSException` propagate from `crypto.load_key()`. In `dataset_encryption_lock.py`, the `unlock` method catches `ZFSException` (checking `e.code == ZFSError.EZFS_CRYPTOFAILED`) and then a bare `except Exception`. Verify: (1) `ZFSNotEncryptedException` and `ZFSKeyAlreadyLoadedException` \u2014 do they fall through to the bare `except Exception` branch and get surfaced as a raw string error rather than a structured `CallError`? (2) Do any callers of `unlock` (e.g., WebSocket dispatch) depend on receiving a `CallError` with a specific `.errno` or `.errmsg` structure? (3) Are there any paths where the new typed exceptions bypass all error handling and bubble up to the framework uncaught?", - "target_files": [ - "src/middlewared/middlewared/plugins/zfs/encryption.py", - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 4 - }, - "context_files": [ - "src/middlewared/middlewared/plugins/zfs/encryption.py" - ], - "id": "mechanical_mech_1", - "name": "KMIP double-injection: @pass_thread_local_storage + explicit tls arg causes TypeError", - "priority": 10, - "review_prompt": "In `kmip/zfs_keys.py`, `push_zfs_keys` and `pull_zfs_keys` have been decorated with `@pass_thread_local_storage`, which automatically injects `tls` as the first argument after `self`. However, their caller `sync_zfs_keys` invokes them as `self.push_zfs_keys(tls, ids)` and `self.pull_zfs_keys(tls)` \u2014 passing `tls` explicitly as a positional argument. If `@pass_thread_local_storage` injects `tls` into the argument list before the call executes, and the caller also passes `tls` explicitly, the function receives `tls` twice: once from the decorator injection and once from the caller. This will produce a `TypeError: push_zfs_keys() got multiple values for argument 'tls'` (or a positional argument count mismatch) at runtime.\n\nYour task:\n1. Read `kmip/zfs_keys.py` in full. Identify the signatures of `push_zfs_keys`, `pull_zfs_keys`, and `sync_zfs_keys`.\n2. Read or infer the implementation of `@pass_thread_local_storage` to understand exactly when and how it injects `tls` \u2014 does it inject before or after the decorated function is called, and does it strip `tls` from the call-site args?\n3. Determine whether `sync_zfs_keys` must be updated to NOT pass `tls` explicitly (because the decorator handles it), or whether the decorator was NOT intended to be added to these methods (and they should instead receive `tls` from their caller).\n4. If a double-injection bug exists, report the exact file and line numbers, the erroneous decorator placement or call-site, and the correct fix.\n5. If no double-injection occurs (e.g., the decorator is a pass-through that does not inject when already present), explain the mechanism that prevents the bug.", - "target_files": [ - "src/middlewared/middlewared/plugins/kmip/zfs_keys.py" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 4 - }, - "context_files": [ - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py" - ], - "id": "mechanical_mech_2", - "name": "Exception contract break: ZFSKeyAlreadyLoadedException / ZFSNotEncryptedException caught by bare except as string, not CallError", - "priority": 9, - "review_prompt": "The new `load_key` function in `zfs/encryption.py` raises `ZFSKeyAlreadyLoadedException` or `ZFSNotEncryptedException` (both plain `Exception` subclasses defined in `zfs/exceptions.py`) as pre-condition guards before calling `crypto.load_key()`. In `dataset_encryption_lock.py`, the `unlock` method catches exceptions in two branches: `except ZFSException as e` (checking `e.code == ZFSError.EZFS_CRYPTOFAILED`) and a bare `except Exception as e`. The new custom exceptions are NOT `ZFSException` subclasses, so they fall into the bare `except Exception` branch and are stringified into the error result \u2014 instead of being raised as a structured `CallError` as the old code did.\n\nYour task:\n1. Read `zfs/exceptions.py` to confirm the class hierarchy of `ZFSKeyAlreadyLoadedException` and `ZFSNotEncryptedException`. Do they inherit from `ZFSException`, `CallError`, or plain `Exception`?\n2. Read `dataset_encryption_lock.py` lines 200\u2013240 (approximate). Trace what happens when each of these two exceptions is raised: which `except` branch catches it, what is placed in the error result (stringified message vs. structured `CallError`), and whether a `CallError` is ever re-raised.\n3. Read `zfs/encryption.py` `load_key` function fully. Confirm it raises these exceptions before calling `crypto.load_key()`.\n4. Determine whether the callers of `unlock` (e.g., the WebSocket API layer) expect a `CallError` with a specific `errno` or just any exception. If `CallError` is expected, the current code is a contract break.\n5. Report all locations where the exception handling must be updated to convert these custom exceptions into `CallError` before they escape to callers, or where the exception class hierarchy must be changed.", - "target_files": [ - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py", - "src/middlewared/middlewared/plugins/zfs/exceptions.py", - "src/middlewared/middlewared/plugins/zfs/encryption.py" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [ - "src/middlewared/middlewared/plugins/zfs/encryption.py" - ], - "id": "semantic_sem_03", - "name": "ZFSKeyFormat enum comparison fix silently activates previously dead guard", - "priority": 8, - "review_prompt": "In the old `inherit_parent_encryption_properties` / `change_encryption_root`, the condition `if ZFSKeyFormat(parent_encrypted_root['key_format']['value']) == ZFSKeyFormat.PASSPHRASE.value:` compared a `ZFSKeyFormat` enum instance to a string (`.value`), which always evaluates to `False` in Python due to type-strict `==` semantics. This means the guard that prevents key-encrypted children from inheriting encryption roots from passphrase-encrypted parents was never enforced. The new code uses `if parent_encrypted_root['key_format']['value'] == ZFSKeyFormat.PASSPHRASE.value:`, a string-to-string comparison that correctly enforces the guard. Verify: (1) Confirm the old code's comparison was indeed always `False` \u2014 that is, no datasets exist in production that relied on this guard being a no-op. (2) What is the exact behavior change for a key-encrypted child dataset whose parent has a passphrase-encrypted root \u2014 will the operation now raise an error, return early, or behave differently in some other way? (3) Is there any documented or tested workflow that previously worked because this guard was silently skipped, and will now fail?", - "target_files": [ - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [], - "id": "semantic_sem_04", - "name": "pbkdf2iters silent upgrade via from_previous: latency regression and breakage for automation", - "priority": 7, - "review_prompt": "In `api/v26_0_0/pool.py` and `api/v26_0_0/pool_dataset.py`, `PoolCreateEncryptionOptions` and `PoolDatasetChangeKeyOptions` now default `pbkdf2iters` to 1,300,000 (up from 350,000) with a minimum of 1,300,000. The `from_previous` classmethod uses `max(1300000, value['pbkdf2iters'])` to silently clamp old values upward. Verify: (1) Is the `from_previous` migration invoked on read (i.e., for existing dataset API responses) or only on write (i.e., only when the user explicitly submits a key-change operation)? If invoked on write, does the caller receive the upgraded value transparently without being warned? (2) For existing datasets with pbkdf2iters between 100,000 and 1,299,999, will the iteration count be silently changed to 1,300,000 on the next `change_key` call, meaning the encryption parameters of a live dataset change without explicit user intent? (3) On low-power or embedded hardware, does a 3.7x increase in PBKDF2 iterations cause key-derivation to exceed unlock timeouts, potentially making encrypted datasets permanently inaccessible without intervention?", - "target_files": [ - "src/middlewared/middlewared/api/v26_0_0/pool.py", - "src/middlewared/middlewared/api/v26_0_0/pool_dataset.py" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 4 - }, - "context_files": [ - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py", - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py" - ], - "id": "mechanical_mech_3", - "name": "Decorator ordering: @pass_thread_local_storage above @job \u2014 does @job lambda see pre- or post-tls-injection arg list?", - "priority": 7, - "review_prompt": "In `dataset_encryption_info.py`, `sync_db_keys` uses `@job(lock=lambda args: f'sync_encrypted_pool_dataset_keys_{args}')` stacked beneath `@pass_thread_local_storage`. The `args` lambda passed to `@job` receives the positional arguments at the time the job dispatch layer captures them. If `@pass_thread_local_storage` is the outer decorator (applied last, wraps the `@job`-decorated function), then `tls` is injected AFTER the `@job` lock-key computation runs \u2014 meaning the lock lambda sees `(name,)` as intended. But if the decorator order means `@job` wraps the already-`tls`-injected function, the lambda would see `(tls, name)` and the lock key would be `sync_encrypted_pool_dataset_keys_(tls_object, 'poolname')`, producing an incorrect and potentially non-unique lock key.\n\nYour task:\n1. Read `dataset_encryption_info.py` to confirm the exact decorator order on `sync_db_keys` (which decorator appears on the line immediately above `def sync_db_keys`).\n2. Find and read the implementation of `@pass_thread_local_storage` to understand its wrapping behavior \u2014 does it wrap the already-decorated function or is it the inner decorator?\n3. Find and read the `@job` decorator implementation to understand when the `lock` lambda is evaluated relative to argument injection by outer decorators.\n4. Determine whether the `lock` lambda in `sync_db_keys` receives `(name,)` or `(tls, name)` at runtime.\n5. If `tls` is visible to the lambda, report the exact file/line and explain why the lock key will be malformed, and what the correct fix is (e.g., swap decorator order, or adjust the lambda to index `args[1]` instead of `args`).", - "target_files": [ - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py" - ] - } - ], - "total_budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - } - } - }, - "pr_url": "https://github.com/truenas/middleware/pull/18291", - "review": { - "body": "## \ud83d\udd34 PR-AF Review \u2014 **Changes Required**\n\n*Automated multi-agent code review \u00b7 [PR-AF](https://github.com/Agent-Field/agentfield) built with [AgentField](https://github.com/Agent-Field/agentfield)*\n\n> **14 findings** \u00b7 \ud83d\udd34 2 critical \u00b7 \ud83d\udfe0 9 important \u00b7 \ud83d\udd35 2 suggestions \u00b7 \u26aa 1 nitpicks\n\n
\nPR Overview\n\nReplace usage of the deprecated py-libzfs with truenas_pylibzfs for these private methods. This removes another use case of our process pool.\r\n\r\nDepends on changes made in https://github.com/truenas/truenas_pylibzfs/pull/145.\n\n
\n\n### Key Findings\n\n**11 issue(s) should be addressed before merge:**\n\n- \ud83d\udd34 **zfs_keys cache silently wiped on every push/pull: `k in existing_datasets` checks string in list-of-dicts** (`src/middlewared/middlewared/plugins/kmip/zfs_keys.py:94`) \u2014 `get_encrypted_datasets` returns a `list` of dataset dicts (each a `dict` with keys `'name'`, `'id'`, `'encryption_key'`, `'kmip_uid'`, etc.).\n- \ud83d\udd34 **Missing `id` argument in `datastore.update` call \u2014 wrong argument count, update never applied to correct row** (`src/middlewared/middlewared/plugins/kmip/zfs_keys.py:157`) \u2014 The `datastore.update` API signature is `(table: str, id: int, data: dict)`.\n- \ud83d\udfe0 **Old guard was always False: key-encrypted child under passphrase-root inheritance was never blocked** (`src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py:248`) \u2014 **The old comparison was provably always `False`.** In the prior code (`bde8f1de3b`), the guard in `inherit_parent_encryption_properties_impl` read: ```python if ZFSKeyFormat(parent_encrypted_root.k\u2026\n- \ud83d\udfe0 **ZFSKeyAlreadyLoadedException and ZFSNotEncryptedException silently swallowed as string errors instead of structured CallError** (`src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py:229`) \u2014 The bare `except Exception as e` branch on line 229 catches `ZFSKeyAlreadyLoadedException` and `ZFSNotEncryptedException` (both plain `Exception` subclasses from `zfs/exceptions.py`) and converts them\u2026\n- \ud83d\udfe0 **from_previous fires on write only; legacy API callers have pbkdf2iters silently upgraded to 1,300,000 without any notification** (`src/middlewared/middlewared/api/v26_0_0/pool_dataset.py:183`) \u2014 **`from_previous` is invoked exclusively on incoming write operations (argument upgrade), never on reads (API responses).** The `APIVersionsAdapter` in `legacy_api_method.py` upgrades incoming parame\u2026\n- \ud83d\udfe0 **`sync_db_keys` lock lambda embeds the full args list, causing inconsistent lock keys between periodic and explicit calls** (`src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py:161`) \u2014 The `lock` lambda on `sync_db_keys` uses `args` (the entire raw-arguments list) rather than `args[0]` (the first positional argument, `name`): ```python @job(lock=lambda args: f'sync_encrypted_pool_d\u2026\n- \ud83d\udfe0 **Existing passphrase-encrypted datasets silently re-keyed at 3.7x higher iteration count on next change_key call via any API version** (`src/middlewared/middlewared/api/v26_0_0/pool_dataset.py:175`) \u2014 **Existing datasets with `pbkdf2iters` between 100,000 and 1,299,999 will have their iteration count permanently changed to 1,300,000 on the next `change_key` call, regardless of whether the user expl\u2026\n- \ud83d\udfe0 **Custom ZFS exceptions inherit from plain Exception instead of CallError, breaking structured error propagation across all callers** (`src/middlewared/middlewared/plugins/zfs/exceptions.py:14`) \u2014 `ZFSKeyAlreadyLoadedException` (line 14) and `ZFSNotEncryptedException` (line 20) both inherit directly from `Exception`.\n- \u2026 and 3 more (see All Findings by Severity)\n\n**3 suggestion(s) and style note(s):**\n\n- \ud83d\udd35 No double-injection bug: explicit tls passing is correct for direct calls (`src/middlewared/middlewared/plugins/kmip/zfs_keys.py:138`)\n- \ud83d\udd35 No test covers the newly-enforced rejection path (passphrase root + key-encrypted child roots) (`src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py:248`)\n- \u26aa Original `tls`-injection concern is a false alarm: decorator order is correct and `tls` is never visible to the lock lambda (`src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py:158`)\n\n**Files with findings:** `src/middlewared/middlewared/api/v26_0_0/pool.py`, `src/middlewared/middlewared/api/v26_0_0/pool_dataset.py`, `src/middlewared/middlewared/plugins/kmip/zfs_keys.py`, `src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py`, `src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py`, `src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py`, `src/middlewared/middlewared/plugins/zfs/encryption.py`, `src/middlewared/middlewared/plugins/zfs/exceptions.py`\n\n
\nAll Findings by Severity\n\n#### \ud83d\udd34 Critical (2)\n\n- **zfs_keys cache silently wiped on every push/pull: `k in existing_datasets` checks string in list-of-dicts** `src/middlewared/middlewared/plugins/kmip/zfs_keys.py:94`\n- **Missing `id` argument in `datastore.update` call \u2014 wrong argument count, update never applied to correct row** `src/middlewared/middlewared/plugins/kmip/zfs_keys.py:157`\n\n#### \ud83d\udfe0 Important (9)\n\n- **Old guard was always False: key-encrypted child under passphrase-root inheritance was never blocked** `src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py:248`\n- **ZFSKeyAlreadyLoadedException and ZFSNotEncryptedException silently swallowed as string errors instead of structured CallError** `src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py:229`\n- **from_previous fires on write only; legacy API callers have pbkdf2iters silently upgraded to 1,300,000 without any notification** `src/middlewared/middlewared/api/v26_0_0/pool_dataset.py:183`\n- **`sync_db_keys` lock lambda embeds the full args list, causing inconsistent lock keys between periodic and explicit calls** `src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py:161`\n- **Existing passphrase-encrypted datasets silently re-keyed at 3.7x higher iteration count on next change_key call via any API version** `src/middlewared/middlewared/api/v26_0_0/pool_dataset.py:175`\n- **Custom ZFS exceptions inherit from plain Exception instead of CallError, breaking structured error propagation across all callers** `src/middlewared/middlewared/plugins/zfs/exceptions.py:14`\n- **ZFSNotEncryptedException from change_key() propagates as raw Exception to WebSocket API layer \u2014 no CallError wrapping** `src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py:200`\n- **Raw truenas_pylibzfs.ZFSException from crypto.load_key() propagates out of encryption.load_key() undecorated, breaking the old CallError contract for any caller outside unlock** `src/middlewared/middlewared/plugins/zfs/encryption.py:34`\n- **3.7x PBKDF2 iteration increase enforced with no hardware capability check; may cause passphrase unlock timeouts making datasets inaccessible** `src/middlewared/middlewared/api/v26_0_0/pool.py:151`\n\n#### \ud83d\udd35 Suggestion (2)\n\n- **No double-injection bug: explicit tls passing is correct for direct calls** `src/middlewared/middlewared/plugins/kmip/zfs_keys.py:138`\n- **No test covers the newly-enforced rejection path (passphrase root + key-encrypted child roots)** `src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py:248`\n\n#### \u26aa Nitpick (1)\n\n- **Original `tls`-injection concern is a false alarm: decorator order is correct and `tls` is never visible to the lock lambda** `src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py:158`\n\n
\n\n
\nReview Process Details\n\n**Dimensions Analyzed (6):**\n\n- **Exception contract change in load_key: typed exceptions vs. CallError** \u2014 2 file(s)\n- **KMIP double-injection: @pass_thread_local_storage + explicit tls arg causes TypeError** \u2014 1 file(s)\n- **Exception contract break: ZFSKeyAlreadyLoadedException / ZFSNotEncryptedException caught by bare except as string, not CallError** \u2014 3 file(s)\n- **ZFSKeyFormat enum comparison fix silently activates previously dead guard** \u2014 1 file(s)\n- **pbkdf2iters silent upgrade via from_previous: latency regression and breakage for automation** \u2014 2 file(s)\n- **Decorator ordering: @pass_thread_local_storage above @job \u2014 does @job lambda see pre- or post-tls-injection arg list?** \u2014 1 file(s)\n\n**Meta-Dimension Lenses (3):**\n\n- **Semantic** \u2014 5 dimension(s), 88% coverage confidence\n- **Mechanical** \u2014 3 dimension(s), 87% coverage confidence\n- **Systemic** \u2014 2 dimension(s), 82% coverage confidence\n\n
\n\n
\nPipeline Stats\n\n| Metric | Value |\n|--------|-------|\n| Duration | 1808.7s |\n| Agent invocations | 11 |\n| Coverage iterations | 0 |\n| Estimated cost | N/A (provider does not report cost) |\n| Budget exhausted | Yes (timeout: 1808s > 900s limit) |\n| PR type | refactor |\n| Complexity | standard |\n\n
\n\nReview ID: `rev_07c8d4f2bf5a`", - "comments": [ - { - "body": "\ud83d\udfe0 **[IMPORTANT] Old guard was always False: key-encrypted child under passphrase-root inheritance was never blocked**\n\n**The old comparison was provably always `False`.**\n\nIn the prior code (`bde8f1de3b`), the guard in `inherit_parent_encryption_properties_impl` read:\n\n```python\nif ZFSKeyFormat(parent_encrypted_root.key_format.value) == ZFSKeyFormat.PASSPHRASE.value:\n```\n\nThe left-hand side is `ZFSKeyFormat('PASSPHRASE')` \u2014 a `ZFSKeyFormat` enum *instance* \u2014 while the right-hand side is `ZFSKeyFormat.PASSPHRASE.value` \u2014 the raw string `'PASSPHRASE'`. Python's `==` for `Enum` instances does **not** fall back to comparing against the `.value`; an enum member only equals itself (or another member with the same identity), never a plain string. This was verified:\n\n```\nZFSKeyFormat('PASSPHRASE') == 'PASSPHRASE' # \u2192 False, always\n```\n\n**What the guard was supposed to do:** prevent a key-encrypted dataset (`id_`) that has its own key-encrypted child encryption roots from inheriting a passphrase-encrypted parent root. If such a dataset were allowed to inherit, its key-encrypted children would end up under a passphrase root, violating the invariant that passphrase roots cannot have key-encrypted encryption-root descendants.\n\n**Behavioral change introduced by the fix:** The new code uses:\n\n```python\nif parent_encrypted_root['key_format']['value'] == ZFSKeyFormat.PASSPHRASE.value:\n```\n\nThis is a string-to-string comparison (`'PASSPHRASE' == 'PASSPHRASE'`) that evaluates to `True` correctly. For the first time, the inner `any(...)` check that looks for key-encrypted child encryption roots is actually executed, and if any are found, a `CallError` is raised, preventing the operation.\n\n**Concrete scenario now blocked that was previously silently allowed:**\n\n1. Pool `tank` has dataset `tank/passroot` encrypted with a passphrase (encryption root).\n2. Under it, `tank/passroot/keyroot` is a key-encrypted encryption root (HEX format).\n3. Under `keyroot`, `tank/passroot/keyroot/keychild` is *also* a key-encrypted encryption root.\n4. A user calls `pool.dataset.inherit_parent_encryption_properties('tank/passroot/keyroot')`.\n5. **Old code:** guard fires `False`, inner check is skipped, `change_encryption_root` executes. `keyroot` now falls under `passroot`'s passphrase root, but `keychild` remains a separate key-encrypted root under a passphrase root \u2014 an explicitly forbidden structure.\n6. **New code:** guard fires `True`, inner `any()` detects `keychild`, raises `CallError` with a clear message. The operation is rejected.\n\n**Does any existing production workflow depend on the old no-op guard?** The only test exercising `inherit_parent_encryption_properties` (`test_key_encrypted_dataset` at line 404) uses a *hex-key* parent root, so `parent_encrypted_root['key_format']['value'] == 'HEX'`, and the guard evaluates to `False` in both old and new code. That test is unaffected. There is no test covering the now-enforced case (passphrase parent root + key-encrypted child roots), which is the exact gap described below.\n\n---\n\n> Step 1: Old code at `bde8f1de3b` line ~222: `if ZFSKeyFormat(parent_encrypted_root.key_format.value) == ZFSKeyFormat.PASSPHRASE.value:`\n> Step 2: `parent_encrypted_root.key_format.value` is a string, e.g. `'PASSPHRASE'`.\n> Step 3: `ZFSKeyFormat('PASSPHRASE')` constructs `ZFSKeyFormat.PASSPHRASE`, an enum instance.\n> Step 4: `ZFSKeyFormat.PASSPHRASE == 'PASSPHRASE'` \u2192 `False` (Python Enum.__eq__ compares member identity, not value string).\n> Step 5: The `if` body (the `any()` child-root check and potential `raise CallError`) is NEVER reached regardless of input.\n> Step 6: `change_encryption_root` / `zfs.dataset.change_encryption_root` always executes even when the parent root is passphrase-encrypted and the dataset has key-encrypted child roots.\n> Verification: `python3 -c \"from enum import Enum; class E(Enum): P='PASSPHRASE'; print(E('PASSPHRASE') == 'PASSPHRASE')\"` prints `False`.\n\n**\ud83d\udca1 Suggested Fix**\n\nThe fix is correct. The only follow-up needed is a regression test for the newly-enforced path: create a passphrase-encrypted root, a key-encrypted encryption root beneath it, and a second key-encrypted encryption root as a child of that \u2014 then assert that `inherit_parent_encryption_properties` on the middle dataset raises a `CallError`. This ensures the guard remains correct if the code is refactored again.\n\n---\n*`Enum vs String Comparison Bug in Encryption Root Guard` \u00b7 confidence 98%*", - "line": 248, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] ZFSKeyAlreadyLoadedException and ZFSNotEncryptedException silently swallowed as string errors instead of structured CallError**\n\nThe bare `except Exception as e` branch on line 229 catches `ZFSKeyAlreadyLoadedException` and `ZFSNotEncryptedException` (both plain `Exception` subclasses from `zfs/exceptions.py`) and converts them to `failed[name]['error'] = str(e)` \u2014 a raw string embedded in the return value dict.\n\nThis is a contract violation because:\n1. These exceptions are **pre-condition guards** (dataset not encrypted, or key already loaded) that signal programmer/caller errors, not transient ZFS crypto failures. Treating them identically to \"Invalid Key\" hides the actual cause.\n2. The `unlock` API method's structured return `{'unlocked': [...], 'failed': {...}}` will surface these as opaque string errors (e.g. `\"'pool/ds' key is already loaded\"`) with no errno or structured error code, making it impossible for callers to distinguish pre-condition failures from crypto failures.\n3. The old code path (before `load_key` was extracted to `zfs/encryption.py`) presumably raised `CallError` directly \u2014 the refactoring broke this by introducing new exception types without updating the catch sites.\n\nSpecifically:\n- `ZFSKeyAlreadyLoadedException` raised at `encryption.py:33` falls into `except Exception` at `dataset_encryption_lock.py:229`\n- `ZFSNotEncryptedException` raised at `encryption.py:31` similarly falls into `except Exception` at `dataset_encryption_lock.py:229`\n\nNeither is ever re-raised as a `CallError`.\n\n---\n\n> Step 1: `unlock` calls `load_key(tls, name, key=datasets[name]['key'])` at line 222.\n> Step 2: `load_key` in `zfs/encryption.py:31` calls `rsrc.crypto()`, and if it returns `None`, raises `ZFSNotEncryptedException(dataset)` \u2014 a subclass of plain `Exception` (confirmed at `exceptions.py:20`).\n> Step 3: `load_key` at `encryption.py:33` raises `ZFSKeyAlreadyLoadedException(dataset)` if `crypto.info().key_is_loaded` is True \u2014 also a plain `Exception` subclass (`exceptions.py:14`).\n> Step 4: Neither exception is a `ZFSException` subclass (imported from `truenas_pylibzfs`), so the `except ZFSException as e` block at line 223 does NOT catch them.\n> Step 5: They fall through to `except Exception as e` at line 229, where `failed[name]['error'] = str(e)` stores the message string `\"'pool/ds' key is already loaded\"` or `\"'pool/ds' is not encrypted\"` \u2014 no `CallError`, no errno.\n\n**\ud83d\udca1 Suggested Fix**\n\nEither (a) make `ZFSKeyAlreadyLoadedException` and `ZFSNotEncryptedException` inherit from `CallError` (with appropriate `errno` values such as `errno.ENOTSUP` for not-encrypted and `errno.EEXIST` for already-loaded), OR (b) add an explicit catch before the bare `except Exception` block:\n```python\nfrom middlewared.plugins.zfs.exceptions import ZFSKeyAlreadyLoadedException, ZFSNotEncryptedException\n\ntry:\n load_key(tls, name, key=datasets[name]['key'])\nexcept ZFSKeyAlreadyLoadedException:\n # Key already loaded means dataset is effectively unlocked; treat as success or specific error\n failed[name]['error'] = 'Key is already loaded'\n continue\nexcept ZFSNotEncryptedException:\n failed[name]['error'] = 'Dataset is not encrypted'\n continue\nexcept ZFSException as e:\n ...\nexcept Exception as e:\n failed[name]['error'] = str(e)\n continue\n```\nOption (a) is cleaner and ensures these exceptions carry structured error information everywhere they propagate.\n\n---\n*`Exception Handling Contract` \u00b7 confidence 95%*", - "line": 229, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] from_previous fires on write only; legacy API callers have pbkdf2iters silently upgraded to 1,300,000 without any notification**\n\n**`from_previous` is invoked exclusively on incoming write operations (argument upgrade), never on reads (API responses).**\n\nThe `APIVersionsAdapter` in `legacy_api_method.py` upgrades incoming parameters from an older API version to the current version via `_adapt_params`, which calls `adapter.adapt(params_dict, model_name, self.api_version, self.adapter.current_version)`. Because `version1_index < version2_index` the direction resolves to `Direction.UPGRADE`, triggering `new_model.from_previous(value)` at `version.py:233`.\n\nConversely, `_dump_result` adapts the **result** from `current_version` back to `api_version` (downgrade direction), which calls `to_previous`. Neither `PoolDatasetChangeKeyOptions` nor `PoolCreateEncryptionOptions` define `to_previous`, so outgoing responses are never touched.\n\n**Practical impact:** An automation client or script pinned to API v25.x that deliberately submits `pbkdf2iters=350000` (valid under `ge=100000` in v25.10.x) will have that value silently overwritten to `1300000` by `from_previous` before the `change_key` handler executes. The caller receives `{\"result\": null}` \u2014 the standard success response for `PoolDatasetChangeKeyResult` \u2014 with no indication that a different iteration count was actually applied to ZFS.\n\nNote: `pbkdf2iters` is only forwarded to the ZFS layer when `passphrase_key_format=True` (plugin line 114), so this affects only passphrase-encrypted datasets. For raw-hex keyed datasets `pbkdf2iters` is excluded from `opts` entirely and no iteration count is stored.\n\n---\n\n> Step 1: Client on API v25.10.2 calls `pool.dataset.change_key` with `options={\"pbkdf2iters\": 350000, \"passphrase\": \"mypass\"}`. Old model allows this: `pbkdf2iters: int = Field(default=350000, ge=100000)` (v25_10_2/pool_dataset.py:175).\n> Step 2: `LegacyAPIMethod.call()` (legacy_api_method.py:60) calls `_adapt_params()` \u2192 `adapter.adapt(params_dict, 'PoolDatasetChangeKeyArgs', 'v25.10.2', 'v26.0.0')`.\n> Step 3: `adapt_model` computes `version1_index < version2_index` \u2192 `direction = Direction.UPGRADE`.\n> Step 4: `_adapt_value` on `PoolDatasetChangeKeyArgs` calls `_adapt_nested_value` on the `options` field because both versions define a model named `PoolDatasetChangeKeyOptions`; this triggers a recursive `_adapt_value` call.\n> Step 5: At the end of the nested `_adapt_value`, line 233 of version.py: `value = new_model.from_previous(value)` where `new_model` is v26_0_0's `PoolDatasetChangeKeyOptions`.\n> Step 6: `from_previous` (pool_dataset.py:185) executes `value['pbkdf2iters'] = max(1300000, 350000)` \u2192 `1300000`.\n> Step 7: `change_key` plugin receives `options['pbkdf2iters'] == 1300000`, passes it to `validate_encryption_data` (line 191), which includes it in `opts` because `passphrase_key_format=True` (line 114).\n> Step 8: `zfs/encryption.py::change_key()` permanently stores `pbkdf2iters=1300000` in the dataset's ZFS config.\n> Step 9: `_dump_result` downgrades `{\"result\": null}` \u2014 no clamping info is surfaced.\n\n**\ud83d\udca1 Suggested Fix**\n\nAt minimum, emit a job log warning when `pbkdf2iters` is clamped upward. A job-status message such as `job.set_progress(0, f'Note: pbkdf2iters elevated from submitted value to {options[\"pbkdf2iters\"]}')` would make the override visible to operators. Longer-term, consider returning the effective `pbkdf2iters` in the result payload or adding a `to_previous` on the result model so legacy clients can detect the discrepancy.\n\n---\n*`PBKDF2 Iteration Count Silent Migration` \u00b7 confidence 95%*", - "line": 183, - "path": "src/middlewared/middlewared/api/v26_0_0/pool_dataset.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] `sync_db_keys` lock lambda embeds the full args list, causing inconsistent lock keys between periodic and explicit calls**\n\nThe `lock` lambda on `sync_db_keys` uses `args` (the entire raw-arguments list) rather than `args[0]` (the first positional argument, `name`):\n\n```python\n@job(lock=lambda args: f'sync_encrypted_pool_dataset_keys_{args}')\ndef sync_db_keys(self, job, tls, name=None):\n```\n\nThe `@job` and `@pass_thread_local_storage` decorators are both **pure marker decorators** \u2014 they stamp attributes on the function and return it unchanged. `Job.__init__` stores the raw caller-supplied `params` list as `self.args`, and the lock lambda is evaluated with that list before the job is queued (in `JobsQueue.handle_lock` \u2192 `Job.get_lock_name`). The `tls` object is injected at run time in `Job.__run_body`, well after lock computation, so `tls` is **not** visible to the lambda.\n\nThe real problem is that `name` has a default of `None`. This means:\n\n| Call site | `self.args` passed to lambda | Resulting lock key |\n|---|---|---|\n| Periodic scheduler (no args) | `[]` | `sync_encrypted_pool_dataset_keys_[]` |\n| `call_sync('pool.dataset.sync_db_keys', 'tank')` | `['tank']` | `sync_encrypted_pool_dataset_keys_['tank']` |\n| `call_sync('pool.dataset.sync_db_keys', None)` | `[None]` | `sync_encrypted_pool_dataset_keys_[None]` |\n\nThe periodic invocation produces the key `sync_encrypted_pool_dataset_keys_[]` while an explicit `sync_db_keys(None)` produces `sync_encrypted_pool_dataset_keys_[None]` \u2014 these are **different lock keys**, so the two calls do NOT share a lock and can run concurrently. This defeats the purpose of the lock for the all-datasets sync case.\n\nBy contrast, the `encryption_summary` lock lambda on the same class correctly uses `args[0]`:\n```python\n@job(lock=lambda args: f'encryption_summary_options_{args[0]}', ...)\n```\n\nAdditionally, the lock key includes Python list-repr brackets (e.g., `['tank']`) rather than a clean string like `tank`, making the key non-human-readable and fragile if calling conventions change.\n\n---\n\n> Step 1: `sync_db_keys` is decorated with `@job(lock=lambda args: f'sync_encrypted_pool_dataset_keys_{args}')` at line 161.\n> Step 2: `@job` is a pure marker decorator (`decorators.py:153-166`) \u2014 it sets `fn._job = {'lock': lock, ...}` and returns `fn` unchanged.\n> Step 3: `_call_prepare` in `main.py:880` constructs `Job(self, name, serviceobj, methodobj, params, ...)` where `params` is the raw caller-supplied arguments list.\n> Step 4: `Job.__init__` at `job.py:333` stores `self.args = args` (the `params` parameter passed in).\n> Step 5: `JobsQueue.add` at `job.py:149` calls `self.handle_lock(job)`, which calls `job.get_lock_name()` at `job.py:422`: `lock_name = lock_name(self.args)` \u2014 so the lambda receives the raw `params` list.\n> Step 6: Periodic scheduler calls `sync_db_keys` with zero user arguments \u2192 `params = []` \u2192 lambda receives `[]` \u2192 lock key is `sync_encrypted_pool_dataset_keys_[]`.\n> Step 7: Explicit `call_sync('pool.dataset.sync_db_keys', None)` \u2192 `params = [None]` \u2192 lambda receives `[None]` \u2192 lock key is `sync_encrypted_pool_dataset_keys_[None]`.\n> Step 8: Keys differ \u2192 neither invocation blocks the other \u2192 two full-dataset syncs can run concurrently.\n\n**\ud83d\udca1 Suggested Fix**\n\nChange the lambda to extract only the first argument and normalize `None` to an empty string, mirroring the pattern used by `encryption_summary`:\n\n```python\n@job(lock=lambda args: f'sync_encrypted_pool_dataset_keys_{args[0] if args else \"\"}')\n```\n\nThis ensures:\n- A periodic call (no args) and an explicit `call(..., None)` both produce the same lock key: `sync_encrypted_pool_dataset_keys_None`\n- A call with a specific pool name produces `sync_encrypted_pool_dataset_keys_tank`\n- The key no longer contains list brackets\n\n---\n*`Decorator Order and Lock Key Correctness` \u00b7 confidence 92%*", - "line": 161, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Existing passphrase-encrypted datasets silently re-keyed at 3.7x higher iteration count on next change_key call via any API version**\n\n**Existing datasets with `pbkdf2iters` between 100,000 and 1,299,999 will have their iteration count permanently changed to 1,300,000 on the next `change_key` call, regardless of whether the user explicitly requested this change.**\n\nThere are two distinct triggers:\n\n1. **Legacy API client omits `pbkdf2iters`:** The v25.10.x default was 350,000. When a v25.x client calls `change_key` without specifying `pbkdf2iters`, `_adapt_value` fills in the missing field using the **v26.0.0 new default** of `1300000` (version.py:226: `value[key_to_use] = field_info.get_default(call_default_factory=True)`). `from_previous` then sees `max(1300000, 1300000)` which is a no-op \u2014 but the applied value is the new default, not what the user would have expected from their v25.x context.\n\n2. **Legacy API client explicitly submits `pbkdf2iters=350000`:** `from_previous` clamps it to 1,300,000 as described in the companion finding.\n\nIn both cases, `change_key` permanently alters the ZFS dataset property `pbkdf2iters`. Once a dataset is re-keyed at 1,300,000 iterations, every subsequent passphrase-unlock of that dataset (at boot, during HA failover, or via `pool.dataset.unlock`) will run PBKDF2 with 1,300,000 iterations. The user never saw a prompt asking to confirm this change, and the API response `{\"result\": null}` provides no visibility into what iteration count was applied.\n\n**Scope:** Only passphrase-encrypted datasets are affected (line 114 of `dataset_encryption_operations.py` guards `pbkdf2iters` inclusion on `passphrase_key_format=True`). Raw-hex keyed datasets are not affected.\n\n---\n\n> Step 1: User has a passphrase-encrypted dataset with `pbkdf2iters=350000` (set under v25.x).\n> Step 2: User or script calls `pool.dataset.change_key` via v25.x API client without specifying `pbkdf2iters`.\n> Step 3: `_adapt_value` (version.py:224-227) detects `pbkdf2iters` is absent; since the field has a default in v26 (`1300000`), it fills: `value['pbkdf2iters'] = 1300000`.\n> Step 4: `from_previous` is a no-op for `max(1300000, 1300000)`, but the effective value is now 1,300,000 instead of the user's expected 350,000.\n> Step 5: `change_key` plugin line 191 passes `pbkdf2iters: 1300000` to `validate_encryption_data`.\n> Step 6: Since `passphrase_key_format=True`, line 114 includes `pbkdf2iters` in `opts`.\n> Step 7: `zfs/encryption.py::change_key()` writes `pbkdf2iters=1300000` permanently to ZFS.\n> Step 8: API returns `{\"result\": null}` \u2014 no indication the iteration count was elevated.\n\n**\ud83d\udca1 Suggested Fix**\n\nCompare `options['pbkdf2iters']` against the dataset's current stored iteration count before applying the change (available via `ds['pbkdf2iters']['parsed']` from `get_instance_quick`). If the value is being elevated due to the minimum-floor and not due to the user explicitly passing the new value, emit a warning. Consider adding a `pbkdf2iters_effective` field to `PoolDatasetChangeKeyResult` so callers can detect the actual value applied.\n\n---\n*`PBKDF2 Iteration Count Silent Migration` \u00b7 confidence 92%*", - "line": 175, - "path": "src/middlewared/middlewared/api/v26_0_0/pool_dataset.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Custom ZFS exceptions inherit from plain Exception instead of CallError, breaking structured error propagation across all callers**\n\n`ZFSKeyAlreadyLoadedException` (line 14) and `ZFSNotEncryptedException` (line 20) both inherit directly from `Exception`. This is the root cause of the contract break identified in the other findings.\n\nIn the TrueNAS middleware architecture, user-facing errors are expected to be `CallError` instances (with an `errno` attribute). Any unhandled non-`CallError` exception that escapes a service method is treated as an internal server error by the WebSocket API layer, producing unstructured error responses.\n\nBy making these exceptions plain `Exception` subclasses:\n1. Every call site that calls `load_key()`, `check_key()`, `change_key()`, or `change_encryption_root()` must manually wrap exceptions to convert them to `CallError` \u2014 creating a systemic catch-site gap.\n2. Existing bare `except Exception` handlers (as in `dataset_encryption_lock.py:229`) silently absorb them as string errors with no errno, making them indistinguishable from other failures.\n3. The `.message` attribute is redundant with `str(e)` since `super().__init__(self.message)` already sets the string representation \u2014 the `.message` attribute is never used by any handler.\n\n---\n\n> Step 1: `exceptions.py:14` \u2014 `class ZFSKeyAlreadyLoadedException(Exception)` \u2014 base class is plain `Exception`.\n> Step 2: `exceptions.py:20` \u2014 `class ZFSNotEncryptedException(Exception)` \u2014 base class is plain `Exception`.\n> Step 3: These are imported and raised in `zfs/encryption.py` at lines 31, 33, 58, 88, 105.\n> Step 4: `dataset_encryption_lock.py:229` and `dataset_encryption_operations.py:200,263` are call sites with no conversion to `CallError`.\n> Step 5: The middleware WebSocket error dispatch (not read, but standard TrueNAS architecture) wraps `CallError` into structured JSON error responses with errno codes; plain `Exception` becomes an unstructured internal error.\n\n**\ud83d\udca1 Suggested Fix**\n\nChange the base class of both exceptions to `CallError` with appropriate errno values:\n```python\nfrom middlewared.service.core import CallError # or wherever CallError is importable\nimport errno\n\nclass ZFSKeyAlreadyLoadedException(CallError):\n def __init__(self, path: str):\n super().__init__(f\"{path!r} key is already loaded\", errno=errno.EEXIST)\n\nclass ZFSNotEncryptedException(CallError):\n def __init__(self, path: str):\n super().__init__(f\"{path!r} is not encrypted\", errno=errno.ENOTSUP)\n```\nThis ensures that wherever these exceptions propagate \u2014 through `except Exception`, `except CallError`, or unhandled \u2014 they carry structured error information and are handled correctly by the middleware's error dispatch layer. Note: verify there are no circular import issues between `middlewared.plugins.zfs` and `middlewared.service`; if so, an intermediate base class in `zfs/exceptions.py` may be needed.\n\n---\n*`Exception Handling Contract` \u00b7 confidence 90%*", - "line": 14, - "path": "src/middlewared/middlewared/plugins/zfs/exceptions.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] ZFSNotEncryptedException from change_key() propagates as raw Exception to WebSocket API layer \u2014 no CallError wrapping**\n\n`dataset_encryption_operations.py:200` calls `change_key(tls, id_, encryption_dict, key)` with no surrounding try/except. The `change_key` function in `zfs/encryption.py:87-88` can raise `ZFSNotEncryptedException` if `rsrc.crypto()` returns `None`.\n\nAlthough the `change_key` method does validate `ds['encrypted']` at line 134 via `verrors.add`, this is a **database/metadata check** \u2014 it does NOT prevent a race condition where the ZFS state diverges from the database (e.g. dataset was recreated between the query and the `change_key` call). If the ZFS layer reports the dataset as unencrypted but the DB still has it marked encrypted, `ZFSNotEncryptedException` will propagate all the way to the WebSocket API layer as an unhandled `Exception`, not a `CallError`.\n\nSimilarly, `change_encryption_root` at `dataset_encryption_operations.py:263` calls `change_encryption_root(tls, id_)` which also raises `ZFSNotEncryptedException` at `encryption.py:104-105` with no catch.\n\n---\n\n> Step 1: `change_key` method in `dataset_encryption_operations.py:200` calls `change_key(tls, id_, encryption_dict, key)` with no try/except.\n> Step 2: `change_key` in `zfs/encryption.py:86-88`: `rsrc = open_resource(tls, dataset); if (crypto := rsrc.crypto()) is None: raise ZFSNotEncryptedException(dataset)`.\n> Step 3: `ZFSNotEncryptedException` inherits from `Exception` (confirmed at `exceptions.py:20`), NOT from `CallError`.\n> Step 4: No catch exists between `encryption.py:88` and the WebSocket layer. The exception propagates as a raw `Exception`.\n> Step 5: The WebSocket API layer expects `CallError` for user-facing error messages with structured errno codes. A raw `Exception` results in an unstructured 500-style error.\n> Same path applies to `change_encryption_root` at `dataset_encryption_operations.py:263` calling `encryption.py:103-105`.\n\n**\ud83d\udca1 Suggested Fix**\n\nWrap the `change_key` and `change_encryption_root` calls with try/except to convert `ZFSNotEncryptedException` (and `ZFSKeyAlreadyLoadedException` if applicable) into `CallError`:\n```python\nfrom middlewared.plugins.zfs.exceptions import ZFSNotEncryptedException\n\ntry:\n change_key(tls, id_, encryption_dict, key)\nexcept ZFSNotEncryptedException as e:\n raise CallError(str(e), errno=errno.ENOTSUP)\n```\nAlternatively, make `ZFSNotEncryptedException` a subclass of `CallError` with a fixed errno so it automatically presents correctly to all callers throughout the codebase.\n\n---\n*`Exception Handling Contract` \u00b7 confidence 82%*", - "line": 200, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Raw truenas_pylibzfs.ZFSException from crypto.load_key() propagates out of encryption.load_key() undecorated, breaking the old CallError contract for any caller outside unlock**\n\nIn the old `zfs.dataset.load_key` service method, all `libzfs.ZFSException` instances were caught and re-raised as `CallError`. In the new `encryption.py:load_key()`, the call to `crypto.load_key(**kwargs)` at line 34 is **not wrapped in any try/except**.\n\nAny `truenas_pylibzfs.ZFSException` raised by `crypto.load_key()` propagates directly out of `encryption.load_key()` back to its caller with:\n- A `.code` attribute (a `ZFSError` enum value)\n- **No `.errmsg`** or **`.errno`** fields in the `CallError` sense\n- No `CallError` wrapping\n\nFor the `unlock` call path in `dataset_encryption_lock.py`, this is handled correctly: `except ZFSException as e:` at line 223 catches these and processes `EZFS_CRYPTOFAILED` vs. other codes. So the current only caller handles it.\n\nHowever, the **API contract has silently changed**: any other present or future caller of `encryption.load_key()` that expects `CallError` (because the old `zfs.dataset.load_key` always raised `CallError`) will receive raw `ZFSException` instead. If such a caller reaches the WebSocket dispatch layer without intermediate handling, `websocket_app.py:196-207` catches the bare `Exception`, calls `adapt_exception(e)` (which only handles `subprocess.CalledProcessError` \u2014 not `ZFSException`), and falls back to `send_error(message, EINVAL, str(e))`, losing the original ZFS error code entirely and emitting a generic `EINVAL` to the client.\n\n---\n\n> Step 1: `encryption.py:load_key()` calls `crypto.load_key(**kwargs)` at line 34 with no surrounding try/except block.\n> Step 2: `truenas_pylibzfs.ZFSException` is the exception type raised by `crypto.load_key()` on failure (e.g., wrong key \u2192 `EZFS_CRYPTOFAILED`).\n> Step 3: `ZFSException` has a `.code` attribute (a `ZFSError` enum), but no `.errmsg` or `.errno` in the `CallError` sense.\n> Step 4: The old service method `zfs.dataset.load_key` caught all `libzfs.ZFSException` and re-raised as `CallError` \u2014 all callers expected `CallError`.\n> Step 5: A hypothetical new caller of `encryption.load_key()` that does not import `truenas_pylibzfs.ZFSException` and uses only `except CallError` will miss the exception.\n> Step 6: That uncaught `ZFSException` reaches `websocket_app.py:196`, `adapt_exception(e)` returns `None` (only handles `CalledProcessError`), and `send_error(message, EINVAL, str(e))` emits an unstructured `EINVAL` response to the client.\n\n**\ud83d\udca1 Suggested Fix**\n\nEither:\n1. **Document the contract explicitly** in `load_key()`'s docstring: state that it may raise `truenas_pylibzfs.ZFSException` directly (in addition to `ZFSNotEncryptedException` and `ZFSKeyAlreadyLoadedException`), so all callers know they must handle `ZFSException`.\n2. **Convert at the boundary**: wrap `crypto.load_key(**kwargs)` in a try/except that re-raises as a typed domain exception (e.g., add `ZFSLoadKeyException` to `exceptions.py`), so `encryption.py` never leaks `truenas_pylibzfs` types to callers:\n```python\ntry:\n crypto.load_key(**kwargs)\nexcept ZFSException as e:\n if e.code == ZFSError.EZFS_CRYPTOFAILED:\n raise ZFSInvalidKeyException(dataset) from e\n raise\n```\nOption 2 is the cleaner design: it keeps `truenas_pylibzfs` as an internal implementation detail.\n\n---\n*`Exception Handling and Error Flow` \u00b7 confidence 80%*", - "line": 34, - "path": "src/middlewared/middlewared/plugins/zfs/encryption.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] 3.7x PBKDF2 iteration increase enforced with no hardware capability check; may cause passphrase unlock timeouts making datasets inaccessible**\n\n**The 3.7x increase from 350,000 to 1,300,000 PBKDF2 iterations is applied unconditionally with no runtime check for hardware capability. On low-power or embedded hardware, this can cause passphrase-based key derivation to exceed unlock timeouts, making encrypted datasets permanently inaccessible without manual CLI intervention.**\n\nOnce a passphrase-encrypted dataset is re-keyed with `pbkdf2iters=1300000` (whether explicitly or via the silent clamping in `from_previous`), every future unlock attempt runs PBKDF2-SHA256 with 1,300,000 iterations synchronously. On ARM SoCs and Atom-class CPUs common in consumer NAS hardware:\n- At 350,000 iters: typically ~0.5\u20131 second per dataset\n- At 1,300,000 iters: typically ~2\u20134 seconds per dataset\n\nFor pools with multiple passphrase-encrypted datasets that must all unlock at pool import (a common TrueNAS configuration), unlock times multiply linearly. If this occurs during boot under a systemd service timeout, or during HA failover under a failover timeout, the unlock will fail \u2014 and with `ge=1300000` enforced as the hard minimum, there is **no API path** to reduce the iteration count back down without using the ZFS CLI directly (`zfs change-key -o pbkdf2iters=...`).\n\nThe `change_key` plugin (`dataset_encryption_operations.py:118`) does not measure or estimate key derivation time before applying the new iteration count. Neither `PoolCreateEncryptionOptions` nor `PoolDatasetChangeKeyOptions` expose any per-hardware tuning path below the new minimum.\n\nNote: `PoolCreateEncryptionOptions.from_previous` in `pool.py:152` applies the same clamping on pool creation encryption options. For new pool creation this affects the root dataset's initial encryption setup, not just re-keying.\n\n---\n\n> Step 1: Passphrase-encrypted dataset is re-keyed to `pbkdf2iters=1300000` via `change_key` (either explicitly or via silent clamping from `from_previous`).\n> Step 2: `dataset_encryption_operations.py:191` passes `pbkdf2iters: options['pbkdf2iters']` to `validate_encryption_data`.\n> Step 3: `validate_encryption_data` line 114 includes `pbkdf2iters` in `opts` when `passphrase_key_format=True`.\n> Step 4: `zfs/encryption.py::change_key()` line 89 calls `tls.lzh.resource_cryptography_config(**props)` with `pbkdf2iters=1300000`, permanently recording it as a ZFS dataset property.\n> Step 5: On the next pool import or `pool.dataset.unlock`, ZFS runs PBKDF2-SHA256 with 1,300,000 iterations to derive the wrapping key from the passphrase.\n> Step 6: On low-power hardware (e.g., Cortex-A53 at 1.4GHz, ~350k iters/sec for PBKDF2-SHA256), this takes ~3.7 seconds per dataset. With 5 passphrase datasets: ~18.5 seconds total.\n> Step 7: If a systemd or HA failover timeout fires during this window, unlock fails; dataset remains locked.\n> Step 8: The `ge=1300000` constraint on `PoolDatasetChangeKeyOptions` means there is no supported API path to reduce `pbkdf2iters` on an already-re-keyed dataset \u2014 only direct ZFS CLI access can recover.\n\n**\ud83d\udca1 Suggested Fix**\n\nConsider the following mitigations: (1) **Benchmark gate:** Before applying `change_key` with a high `pbkdf2iters`, run a short PBKDF2 benchmark and warn or reject if estimated unlock time exceeds a configurable threshold. (2) **System-wide override:** Allow a `tunable` or system config option to set a lower `pbkdf2iters` ceiling for constrained hardware, overriding the API minimum for that installation. (3) **Recovery documentation:** Explicitly document that `zfs change-key -o pbkdf2iters=` is available as a recovery path if unlock times become prohibitive. (4) **Job warning:** At minimum, have the `change_key` job emit a progress message noting the effective iteration count when it exceeds the old default.\n\n---\n*`PBKDF2 Iteration Count Silent Migration` \u00b7 confidence 75%*", - "line": 151, - "path": "src/middlewared/middlewared/api/v26_0_0/pool.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd35 **[SUGGESTION] No double-injection bug: explicit tls passing is correct for direct calls**\n\n`@pass_thread_local_storage` is a **marker-only decorator** \u2014 it sets `fn._pass_thread_local_storage = True` and returns `fn` unchanged (`decorators.py:221-222`). The actual `tls` injection happens only at API dispatch time: in `main.py:862-865` for normal methods and `job.py:620-621` for `@job` methods.\n\nWhen `sync_zfs_keys` calls `self.push_zfs_keys(tls, ids)` and `self.pull_zfs_keys(tls)` directly (lines 138 and 142), these are **plain Python method calls** \u2014 they bypass the middleware dispatch system entirely. The `_pass_thread_local_storage` attribute on `push_zfs_keys` and `pull_zfs_keys` has **no effect** on direct calls. Therefore, `tls` is supplied exactly once by the caller, and the functions receive it correctly.\n\nThe decorators on `push_zfs_keys`/`pull_zfs_keys` are intentional: they allow those methods to be called independently through the middleware dispatch system (e.g., `self.middleware.call_sync('kmip.push_zfs_keys', ...)`) with `tls` injected automatically. The `# type: ignore` comments are consistent with the decorator's type signature hiding `tls` from external callers.\n\n**No double-injection occurs. The code is correct for this pattern.**\n\n---\n\n> Step 1: `pass_thread_local_storage` in `service/decorators.py:209-222` sets `fn._pass_thread_local_storage = True` and returns `fn` unchanged \u2014 no wrapping, no injection at decoration time.\n> Step 2: `main.py:862-865` \u2014 injection only occurs inside `_call_prepare`, which is invoked by the middleware dispatch system, not on direct Python calls.\n> Step 3: `job.py:620-621` \u2014 same: injection only at job run time via `prepend.append(thread_local_storage)`.\n> Step 4: `sync_zfs_keys` at lines 138/142 calls `self.push_zfs_keys(tls, ids)` directly \u2014 this is a plain Python attribute lookup and call, bypassing `_call_prepare` entirely.\n> Step 5: `push_zfs_keys` receives `(self, tls, ids)` \u2014 one `tls` from the caller, zero injected by decorator. Correct.\n\n**\ud83d\udca1 Suggested Fix**\n\nNo change needed for the decorator/injection pattern. The explicit `tls` passing at lines 138 and 142 is correct because these are direct Python method calls, not middleware dispatches.\n\n---\n*`Decorator Double-Injection Analysis` \u00b7 confidence 98%*", - "line": 138, - "path": "src/middlewared/middlewared/plugins/kmip/zfs_keys.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd35 **[SUGGESTION] No test covers the newly-enforced rejection path (passphrase root + key-encrypted child roots)**\n\nThe only integration test for `inherit_parent_encryption_properties` (`tests/api2/test_pool_dataset_encryption.py:404`) exercises the case where the parent's encryption root uses a **hex key** \u2014 so `parent_encrypted_root['key_format']['value'] == 'HEX'`. The guard evaluates to `False` in both old and new code, meaning this test provides **zero coverage** of the bug fix.\n\nThe case that was silently broken (passphrase-encrypted parent root + key-encrypted child encryption roots under `id_`) has never been tested. Now that the guard works correctly, there is a real behavioral difference: the operation **raises a `CallError`** instead of silently succeeding. Without a test for this path:\n\n1. There is no automated verification that the `CallError` message is correct.\n2. A future refactor could re-introduce the same type-comparison mistake and no test would catch it.\n3. The complementary allowed case \u2014 passphrase parent root, `id_` has *no* key-encrypted child roots \u2014 is also untested; verifying it proceeds successfully is equally important.\n\nThe guard itself (`any(d['name'] == d['encryption_root'] for d in self.middleware.call_sync('pool.dataset.query', [...]))`) is logically sound and the fix is correct, but the absence of test coverage for the enforced path is a gap worth closing.\n\n---\n\n> Only test reference: `tests/api2/test_pool_dataset_encryption.py:404`\n> ```python\n> def test_key_encrypted_dataset(self):\n> # parent uses HEX key\n> payload = {'name': dataset, 'encryption_options': {'key': dataset_token_hex}, ...}\n> call('pool.dataset.create', payload)\n> # child uses PASSPHRASE\n> payload.update({'name': child_dataset, 'encryption_options': {'passphrase': passphrase}})\n> call('pool.dataset.create', payload)\n> # parent_encrypted_root is the HEX-keyed parent -> guard evaluates False in both old and new code\n> call('pool.dataset.inherit_parent_encryption_properties', child_dataset)\n> ds = call('pool.dataset.get_instance', child_dataset)\n> assert ds['key_format']['value'] == 'HEX', ds\n> ```\n> No test exercises the path where `parent_encrypted_root['key_format']['value'] == 'PASSPHRASE'`.\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd a test case in `tests/api2/test_pool_dataset_encryption.py` that:\n1. Creates a passphrase-encrypted dataset `P` as an encryption root.\n2. Creates `P/K` as a key-encrypted encryption root (child of P).\n3. Creates `P/K/KC` as a second key-encrypted encryption root (grandchild).\n4. Calls `pool.dataset.inherit_parent_encryption_properties('P/K')` and asserts a `ClientException` / `CallError` is raised containing the expected message.\n5. Also tests the allowed sub-case: `P/K` with no key-encrypted child roots successfully inherits from the passphrase root.\n\n---\n*`Enum vs String Comparison Bug in Encryption Root Guard` \u00b7 confidence 95%*", - "line": 248, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py", - "side": "RIGHT" - }, - { - "body": "\u26aa **[NITPICK] Original `tls`-injection concern is a false alarm: decorator order is correct and `tls` is never visible to the lock lambda**\n\nThe review prompt raised a concern that if `@pass_thread_local_storage` wraps the `@job`-decorated function, the lock lambda might see `(tls, name)` instead of `(name,)`.\n\nThis concern does **not** apply. Both decorators are pure markers:\n\n```python\n# decorators.py:153-166\ndef check_job(fn):\n fn._job = {'lock': lock, ...}\n return fn # fn is returned unchanged\n\n# decorators.py:221-222\nfn._pass_thread_local_storage = True\nreturn fn # fn is returned unchanged\n```\n\nNeither decorator wraps the function \u2014 they only set attributes. The `tls` object is injected at job run time in `job.py:620-621` inside `Job.__run_body`, well after `get_lock_name()` has already evaluated the lock lambda at queue time. The `Job` object is constructed with `params` (raw caller args), and that is what the lambda sees \u2014 never `tls`.\n\nThe actual decorator stacking requirement is documented in `api/base/decorator.py:53-59`: `@job` must be the innermost (bottommost) decorator, and the current ordering is correct.\n\n---\n\n> Step 1: `@pass_thread_local_storage` at `decorators.py:209-222` sets `fn._pass_thread_local_storage = True` and returns `fn` \u2014 no wrapping.\n> Step 2: `@job` at `decorators.py:153-166` sets `fn._job = {...}` and returns `fn` \u2014 no wrapping.\n> Step 3: `_call_prepare` at `main.py:880` constructs `Job(..., params, job_options, ...)` where `params` is the raw caller args \u2014 `tls` is NOT in this list.\n> Step 4: `tls` injection for jobs occurs in `job.py:620-621` inside `Job.__run_body`, which runs after the job has been queued and the lock key has already been computed.\n> Step 5: `get_lock_name` at `job.py:422` calls `lock_name(self.args)` where `self.args = params` \u2014 the lambda never sees `tls`.\n\n**\ud83d\udca1 Suggested Fix**\n\nNo code change needed for this specific concern. The decorator order is correct and `tls` is never present in the lock lambda's argument list.\n\n---\n*`Decorator Order and Lock Key Correctness` \u00b7 confidence 97%*", - "line": 158, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "side": "RIGHT" - } - ], - "event": "REQUEST_CHANGES" - }, - "review_id": "rev_07c8d4f2bf5a", - "summary": { - "adversary_challenged": 0, - "adversary_confirmed": 0, - "ai_generated_confidence": 0, - "budget_exhausted": true, - "by_severity": { - "critical": 2, - "important": 9, - "nitpick": 1, - "suggestion": 2 - }, - "cost_usd": 0, - "coverage_iterations": 0, - "cross_ref_interactions": 0, - "dimensions_run": 6, - "duration_seconds": 1808.733, - "total_findings": 14 - } - }, - "started_at": "2026-03-10T14:41:21Z", - "completed_at": "2026-03-10T15:11:32Z", - "duration_ms": 1811005, - "webhook_registered": false -} diff --git a/benchmark/truenas-middleware-18291/pr-af-result.json b/benchmark/truenas-middleware-18291/pr-af-result.json deleted file mode 100644 index adcef99..0000000 --- a/benchmark/truenas-middleware-18291/pr-af-result.json +++ /dev/null @@ -1,1086 +0,0 @@ -{ - "execution_id": "exec_20260310_144121_rkn7qq8x", - "run_id": "run_20260310_144121_ji0fblzy", - "status": "succeeded", - "result": { - "findings": [ - { - "active_multipliers": [], - "body": "`get_encrypted_datasets` returns a `list` of dataset dicts (each a `dict` with keys `'name'`, `'id'`, `'encryption_key'`, `'kmip_uid'`, etc.). The in-memory key cache is a `dict[str, bytes]` keyed by dataset name.\n\nAt line 94 (and identically at line 125), the filter expression `if k in existing_datasets` checks whether the **string** `k` (a dataset name) is a member of a **list of dicts**. Python's `in` operator for lists uses `==` equality \u2014 a string will never equal a dict, so this membership test is **always `False`** for every dataset name.\n\nAs a result, **`self.zfs_keys` is emptied to `{}` after every call to `push_zfs_keys` or `pull_zfs_keys`**, regardless of which datasets were actually processed. This defeats the entire purpose of the in-memory key cache: subsequent calls cannot reuse previously loaded keys, and the optimization at lines 64-69 and 107-111 (skipping KMIP retrieval when the key is already known and valid) will never trigger after the first sync.\n\nThe fix should use `{ds['name'] for ds in existing_datasets}` to build a set of names for the membership check.", - "confidence": 0.97, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "decorator_injection", - "dimension_name": "Decorator Double-Injection Analysis", - "evidence": "Step 1: `get_encrypted_datasets` (lines 33-52) builds `rv` by appending `ds_in_db[i['name']]` \u2014 each element is a dict like `{'id': 1, 'name': 'pool/ds', 'encryption_key': ..., 'kmip_uid': ...}`.\nStep 2: `push_zfs_keys` line 59: `existing_datasets = self.get_encrypted_datasets(filters)` \u2192 list of dicts.\nStep 3: Line 94: `{k: v for k, v in self.zfs_keys.items() if k in existing_datasets}` \u2014 `k` is a string (e.g. `'pool/ds'`), `existing_datasets` is a list of dicts. Python evaluates `'pool/ds' == {'id': 1, 'name': 'pool/ds', ...}` \u2192 `False` for every element.\nStep 4: All items are filtered out. `self.zfs_keys` becomes `{}`.\nStep 5: Same logic applies identically at line 125 in `pull_zfs_keys`.\nStep 6: On the next call, lines 64-69 check `ds['name'] in self.zfs_keys` \u2192 always `False` \u2192 unnecessary KMIP round-trips for every dataset on every sync.", - "file_path": "src/middlewared/middlewared/plugins/kmip/zfs_keys.py", - "id": "f_001", - "line_end": 94, - "line_start": 94, - "score": 0.97, - "severity": "critical", - "suggestion": "Change both occurrences to build a name-set first:\n\n```python\n# Line 94 in push_zfs_keys:\nexisting_names = {ds['name'] for ds in existing_datasets}\nself.zfs_keys = {k: v for k, v in self.zfs_keys.items() if k in existing_names}\n\n# Line 125 in pull_zfs_keys:\nexisting_names = {ds['name'] for ds in existing_datasets}\nself.zfs_keys = {k: v for k, v in self.zfs_keys.items() if k in existing_names}\n```\n\nThis restores the intended behavior: evict cache entries for datasets that no longer exist, while preserving entries for datasets that do.", - "tags": [ - "logic-error", - "cache", - "silent-data-loss", - "membership-check" - ], - "title": "zfs_keys cache silently wiped on every push/pull: `k in existing_datasets` checks string in list-of-dicts" - }, - { - "active_multipliers": [], - "body": "The `datastore.update` API signature is `(table: str, id: int, data: dict)`. At line 157, the call is:\n\n```python\nawait self.middleware.call('datastore.update', 'storage.encrypteddataset', {'kmip_uid': None})\n```\n\nThis passes **only two positional arguments** after the method name: `table='storage.encrypteddataset'` and `id={'kmip_uid': None}`. The `data` dict argument is missing entirely. The middleware will either raise a `TypeError` due to wrong argument count/types, or silently misinterpret `{'kmip_uid': None}` as the row `id`, attempting to look up a row by dict identity \u2014 which will fail.\n\nThe intent (from surrounding context in `clear_sync_pending_zfs_keys`, lines 153-161) is clearly to update the specific dataset record `ds` to clear its `kmip_uid`. The missing argument is `ds['id']`.\n\nThis means `clear_sync_pending_zfs_keys` will **always raise an error** when processing any dataset whose `encryption_key` is set, leaving `kmip_uid` values un-cleared and the sync-pending state stale.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "decorator_injection", - "dimension_name": "Decorator Double-Injection Analysis", - "evidence": "Step 1: `clear_sync_pending_zfs_keys` at lines 153-160 iterates over encrypted datasets with non-null `kmip_uid`.\nStep 2: For a dataset where `ds['encryption_key']` is truthy (line 156), it calls `datastore.update` at line 157.\nStep 3: The call is `('datastore.update', 'storage.encrypteddataset', {'kmip_uid': None})` \u2014 three args total, but `datastore.update` requires four: `(method, table, id, data)`.\nStep 4: Compare with correct usages at line 93: `self.middleware.call_sync('datastore.update', 'storage.encrypteddataset', ds['id'], update_data)` and line 121: same pattern with `ds['id']`.\nStep 5: The missing `ds['id']` means the dict `{'kmip_uid': None}` is passed as the `id` parameter \u2014 this will cause a runtime error in the datastore layer when it tries to use a dict as a row identifier.", - "file_path": "src/middlewared/middlewared/plugins/kmip/zfs_keys.py", - "id": "f_002", - "line_end": 157, - "line_start": 157, - "score": 0.95, - "severity": "critical", - "suggestion": "Add the missing `ds['id']` argument:\n\n```python\nawait self.middleware.call('datastore.update', 'storage.encrypteddataset', ds['id'], {'kmip_uid': None})\n```\n\nThis matches the pattern used elsewhere in the codebase (e.g., line 93 and line 121).", - "tags": [ - "runtime-error", - "wrong-arguments", - "data-integrity", - "typo" - ], - "title": "Missing `id` argument in `datastore.update` call \u2014 wrong argument count, update never applied to correct row" - }, - { - "active_multipliers": [], - "body": "**The old comparison was provably always `False`.**\n\nIn the prior code (`bde8f1de3b`), the guard in `inherit_parent_encryption_properties_impl` read:\n\n```python\nif ZFSKeyFormat(parent_encrypted_root.key_format.value) == ZFSKeyFormat.PASSPHRASE.value:\n```\n\nThe left-hand side is `ZFSKeyFormat('PASSPHRASE')` \u2014 a `ZFSKeyFormat` enum *instance* \u2014 while the right-hand side is `ZFSKeyFormat.PASSPHRASE.value` \u2014 the raw string `'PASSPHRASE'`. Python's `==` for `Enum` instances does **not** fall back to comparing against the `.value`; an enum member only equals itself (or another member with the same identity), never a plain string. This was verified:\n\n```\nZFSKeyFormat('PASSPHRASE') == 'PASSPHRASE' # \u2192 False, always\n```\n\n**What the guard was supposed to do:** prevent a key-encrypted dataset (`id_`) that has its own key-encrypted child encryption roots from inheriting a passphrase-encrypted parent root. If such a dataset were allowed to inherit, its key-encrypted children would end up under a passphrase root, violating the invariant that passphrase roots cannot have key-encrypted encryption-root descendants.\n\n**Behavioral change introduced by the fix:** The new code uses:\n\n```python\nif parent_encrypted_root['key_format']['value'] == ZFSKeyFormat.PASSPHRASE.value:\n```\n\nThis is a string-to-string comparison (`'PASSPHRASE' == 'PASSPHRASE'`) that evaluates to `True` correctly. For the first time, the inner `any(...)` check that looks for key-encrypted child encryption roots is actually executed, and if any are found, a `CallError` is raised, preventing the operation.\n\n**Concrete scenario now blocked that was previously silently allowed:**\n\n1. Pool `tank` has dataset `tank/passroot` encrypted with a passphrase (encryption root).\n2. Under it, `tank/passroot/keyroot` is a key-encrypted encryption root (HEX format).\n3. Under `keyroot`, `tank/passroot/keyroot/keychild` is *also* a key-encrypted encryption root.\n4. A user calls `pool.dataset.inherit_parent_encryption_properties('tank/passroot/keyroot')`.\n5. **Old code:** guard fires `False`, inner check is skipped, `change_encryption_root` executes. `keyroot` now falls under `passroot`'s passphrase root, but `keychild` remains a separate key-encrypted root under a passphrase root \u2014 an explicitly forbidden structure.\n6. **New code:** guard fires `True`, inner `any()` detects `keychild`, raises `CallError` with a clear message. The operation is rejected.\n\n**Does any existing production workflow depend on the old no-op guard?** The only test exercising `inherit_parent_encryption_properties` (`test_key_encrypted_dataset` at line 404) uses a *hex-key* parent root, so `parent_encrypted_root['key_format']['value'] == 'HEX'`, and the guard evaluates to `False` in both old and new code. That test is unaffected. There is no test covering the now-enforced case (passphrase parent root + key-encrypted child roots), which is the exact gap described below.", - "confidence": 0.98, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "enum-comparison-guard", - "dimension_name": "Enum vs String Comparison Bug in Encryption Root Guard", - "evidence": "Step 1: Old code at `bde8f1de3b` line ~222: `if ZFSKeyFormat(parent_encrypted_root.key_format.value) == ZFSKeyFormat.PASSPHRASE.value:`\nStep 2: `parent_encrypted_root.key_format.value` is a string, e.g. `'PASSPHRASE'`.\nStep 3: `ZFSKeyFormat('PASSPHRASE')` constructs `ZFSKeyFormat.PASSPHRASE`, an enum instance.\nStep 4: `ZFSKeyFormat.PASSPHRASE == 'PASSPHRASE'` \u2192 `False` (Python Enum.__eq__ compares member identity, not value string).\nStep 5: The `if` body (the `any()` child-root check and potential `raise CallError`) is NEVER reached regardless of input.\nStep 6: `change_encryption_root` / `zfs.dataset.change_encryption_root` always executes even when the parent root is passphrase-encrypted and the dataset has key-encrypted child roots.\nVerification: `python3 -c \"from enum import Enum; class E(Enum): P='PASSPHRASE'; print(E('PASSPHRASE') == 'PASSPHRASE')\"` prints `False`.", - "file_path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py", - "id": "f_003", - "line_end": 261, - "line_start": 248, - "score": 0.686, - "severity": "important", - "suggestion": "The fix is correct. The only follow-up needed is a regression test for the newly-enforced path: create a passphrase-encrypted root, a key-encrypted encryption root beneath it, and a second key-encrypted encryption root as a child of that \u2014 then assert that `inherit_parent_encryption_properties` on the middle dataset raises a `CallError`. This ensures the guard remains correct if the code is refactored again.", - "tags": [ - "logic-error", - "enum-comparison", - "security", - "encryption", - "guard-bypassed" - ], - "title": "Old guard was always False: key-encrypted child under passphrase-root inheritance was never blocked" - }, - { - "active_multipliers": [], - "body": "The bare `except Exception as e` branch on line 229 catches `ZFSKeyAlreadyLoadedException` and `ZFSNotEncryptedException` (both plain `Exception` subclasses from `zfs/exceptions.py`) and converts them to `failed[name]['error'] = str(e)` \u2014 a raw string embedded in the return value dict.\n\nThis is a contract violation because:\n1. These exceptions are **pre-condition guards** (dataset not encrypted, or key already loaded) that signal programmer/caller errors, not transient ZFS crypto failures. Treating them identically to \"Invalid Key\" hides the actual cause.\n2. The `unlock` API method's structured return `{'unlocked': [...], 'failed': {...}}` will surface these as opaque string errors (e.g. `\"'pool/ds' key is already loaded\"`) with no errno or structured error code, making it impossible for callers to distinguish pre-condition failures from crypto failures.\n3. The old code path (before `load_key` was extracted to `zfs/encryption.py`) presumably raised `CallError` directly \u2014 the refactoring broke this by introducing new exception types without updating the catch sites.\n\nSpecifically:\n- `ZFSKeyAlreadyLoadedException` raised at `encryption.py:33` falls into `except Exception` at `dataset_encryption_lock.py:229`\n- `ZFSNotEncryptedException` raised at `encryption.py:31` similarly falls into `except Exception` at `dataset_encryption_lock.py:229`\n\nNeither is ever re-raised as a `CallError`.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "exception-handling-contract", - "dimension_name": "Exception Handling Contract", - "evidence": "Step 1: `unlock` calls `load_key(tls, name, key=datasets[name]['key'])` at line 222.\nStep 2: `load_key` in `zfs/encryption.py:31` calls `rsrc.crypto()`, and if it returns `None`, raises `ZFSNotEncryptedException(dataset)` \u2014 a subclass of plain `Exception` (confirmed at `exceptions.py:20`).\nStep 3: `load_key` at `encryption.py:33` raises `ZFSKeyAlreadyLoadedException(dataset)` if `crypto.info().key_is_loaded` is True \u2014 also a plain `Exception` subclass (`exceptions.py:14`).\nStep 4: Neither exception is a `ZFSException` subclass (imported from `truenas_pylibzfs`), so the `except ZFSException as e` block at line 223 does NOT catch them.\nStep 5: They fall through to `except Exception as e` at line 229, where `failed[name]['error'] = str(e)` stores the message string `\"'pool/ds' key is already loaded\"` or `\"'pool/ds' is not encrypted\"` \u2014 no `CallError`, no errno.", - "file_path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py", - "id": "f_005", - "line_end": 231, - "line_start": 229, - "score": 0.665, - "severity": "important", - "suggestion": "Either (a) make `ZFSKeyAlreadyLoadedException` and `ZFSNotEncryptedException` inherit from `CallError` (with appropriate `errno` values such as `errno.ENOTSUP` for not-encrypted and `errno.EEXIST` for already-loaded), OR (b) add an explicit catch before the bare `except Exception` block:\n```python\nfrom middlewared.plugins.zfs.exceptions import ZFSKeyAlreadyLoadedException, ZFSNotEncryptedException\n\ntry:\n load_key(tls, name, key=datasets[name]['key'])\nexcept ZFSKeyAlreadyLoadedException:\n # Key already loaded means dataset is effectively unlocked; treat as success or specific error\n failed[name]['error'] = 'Key is already loaded'\n continue\nexcept ZFSNotEncryptedException:\n failed[name]['error'] = 'Dataset is not encrypted'\n continue\nexcept ZFSException as e:\n ...\nexcept Exception as e:\n failed[name]['error'] = str(e)\n continue\n```\nOption (a) is cleaner and ensures these exceptions carry structured error information everywhere they propagate.", - "tags": [ - "exception-handling", - "api-contract", - "error-propagation" - ], - "title": "ZFSKeyAlreadyLoadedException and ZFSNotEncryptedException silently swallowed as string errors instead of structured CallError" - }, - { - "active_multipliers": [], - "body": "**`from_previous` is invoked exclusively on incoming write operations (argument upgrade), never on reads (API responses).**\n\nThe `APIVersionsAdapter` in `legacy_api_method.py` upgrades incoming parameters from an older API version to the current version via `_adapt_params`, which calls `adapter.adapt(params_dict, model_name, self.api_version, self.adapter.current_version)`. Because `version1_index < version2_index` the direction resolves to `Direction.UPGRADE`, triggering `new_model.from_previous(value)` at `version.py:233`.\n\nConversely, `_dump_result` adapts the **result** from `current_version` back to `api_version` (downgrade direction), which calls `to_previous`. Neither `PoolDatasetChangeKeyOptions` nor `PoolCreateEncryptionOptions` define `to_previous`, so outgoing responses are never touched.\n\n**Practical impact:** An automation client or script pinned to API v25.x that deliberately submits `pbkdf2iters=350000` (valid under `ge=100000` in v25.10.x) will have that value silently overwritten to `1300000` by `from_previous` before the `change_key` handler executes. The caller receives `{\"result\": null}` \u2014 the standard success response for `PoolDatasetChangeKeyResult` \u2014 with no indication that a different iteration count was actually applied to ZFS.\n\nNote: `pbkdf2iters` is only forwarded to the ZFS layer when `passphrase_key_format=True` (plugin line 114), so this affects only passphrase-encrypted datasets. For raw-hex keyed datasets `pbkdf2iters` is excluded from `opts` entirely and no iteration count is stored.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "pbkdf2iters-migration-behavior", - "dimension_name": "PBKDF2 Iteration Count Silent Migration", - "evidence": "Step 1: Client on API v25.10.2 calls `pool.dataset.change_key` with `options={\"pbkdf2iters\": 350000, \"passphrase\": \"mypass\"}`. Old model allows this: `pbkdf2iters: int = Field(default=350000, ge=100000)` (v25_10_2/pool_dataset.py:175).\nStep 2: `LegacyAPIMethod.call()` (legacy_api_method.py:60) calls `_adapt_params()` \u2192 `adapter.adapt(params_dict, 'PoolDatasetChangeKeyArgs', 'v25.10.2', 'v26.0.0')`.\nStep 3: `adapt_model` computes `version1_index < version2_index` \u2192 `direction = Direction.UPGRADE`.\nStep 4: `_adapt_value` on `PoolDatasetChangeKeyArgs` calls `_adapt_nested_value` on the `options` field because both versions define a model named `PoolDatasetChangeKeyOptions`; this triggers a recursive `_adapt_value` call.\nStep 5: At the end of the nested `_adapt_value`, line 233 of version.py: `value = new_model.from_previous(value)` where `new_model` is v26_0_0's `PoolDatasetChangeKeyOptions`.\nStep 6: `from_previous` (pool_dataset.py:185) executes `value['pbkdf2iters'] = max(1300000, 350000)` \u2192 `1300000`.\nStep 7: `change_key` plugin receives `options['pbkdf2iters'] == 1300000`, passes it to `validate_encryption_data` (line 191), which includes it in `opts` because `passphrase_key_format=True` (line 114).\nStep 8: `zfs/encryption.py::change_key()` permanently stores `pbkdf2iters=1300000` in the dataset's ZFS config.\nStep 9: `_dump_result` downgrades `{\"result\": null}` \u2014 no clamping info is surfaced.", - "file_path": "src/middlewared/middlewared/api/v26_0_0/pool_dataset.py", - "id": "f_011", - "line_end": 186, - "line_start": 183, - "score": 0.665, - "severity": "important", - "suggestion": "At minimum, emit a job log warning when `pbkdf2iters` is clamped upward. A job-status message such as `job.set_progress(0, f'Note: pbkdf2iters elevated from submitted value to {options[\"pbkdf2iters\"]}')` would make the override visible to operators. Longer-term, consider returning the effective `pbkdf2iters` in the result payload or adding a `to_previous` on the result model so legacy clients can detect the discrepancy.", - "tags": [ - "api-versioning", - "silent-migration", - "encryption", - "pbkdf2" - ], - "title": "from_previous fires on write only; legacy API callers have pbkdf2iters silently upgraded to 1,300,000 without any notification" - }, - { - "active_multipliers": [], - "body": "The `lock` lambda on `sync_db_keys` uses `args` (the entire raw-arguments list) rather than `args[0]` (the first positional argument, `name`):\n\n```python\n@job(lock=lambda args: f'sync_encrypted_pool_dataset_keys_{args}')\ndef sync_db_keys(self, job, tls, name=None):\n```\n\nThe `@job` and `@pass_thread_local_storage` decorators are both **pure marker decorators** \u2014 they stamp attributes on the function and return it unchanged. `Job.__init__` stores the raw caller-supplied `params` list as `self.args`, and the lock lambda is evaluated with that list before the job is queued (in `JobsQueue.handle_lock` \u2192 `Job.get_lock_name`). The `tls` object is injected at run time in `Job.__run_body`, well after lock computation, so `tls` is **not** visible to the lambda.\n\nThe real problem is that `name` has a default of `None`. This means:\n\n| Call site | `self.args` passed to lambda | Resulting lock key |\n|---|---|---|\n| Periodic scheduler (no args) | `[]` | `sync_encrypted_pool_dataset_keys_[]` |\n| `call_sync('pool.dataset.sync_db_keys', 'tank')` | `['tank']` | `sync_encrypted_pool_dataset_keys_['tank']` |\n| `call_sync('pool.dataset.sync_db_keys', None)` | `[None]` | `sync_encrypted_pool_dataset_keys_[None]` |\n\nThe periodic invocation produces the key `sync_encrypted_pool_dataset_keys_[]` while an explicit `sync_db_keys(None)` produces `sync_encrypted_pool_dataset_keys_[None]` \u2014 these are **different lock keys**, so the two calls do NOT share a lock and can run concurrently. This defeats the purpose of the lock for the all-datasets sync case.\n\nBy contrast, the `encryption_summary` lock lambda on the same class correctly uses `args[0]`:\n```python\n@job(lock=lambda args: f'encryption_summary_options_{args[0]}', ...)\n```\n\nAdditionally, the lock key includes Python list-repr brackets (e.g., `['tank']`) rather than a clean string like `tank`, making the key non-human-readable and fragile if calling conventions change.", - "confidence": 0.92, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "decorator-order-lock-key", - "dimension_name": "Decorator Order and Lock Key Correctness", - "evidence": "Step 1: `sync_db_keys` is decorated with `@job(lock=lambda args: f'sync_encrypted_pool_dataset_keys_{args}')` at line 161.\nStep 2: `@job` is a pure marker decorator (`decorators.py:153-166`) \u2014 it sets `fn._job = {'lock': lock, ...}` and returns `fn` unchanged.\nStep 3: `_call_prepare` in `main.py:880` constructs `Job(self, name, serviceobj, methodobj, params, ...)` where `params` is the raw caller-supplied arguments list.\nStep 4: `Job.__init__` at `job.py:333` stores `self.args = args` (the `params` parameter passed in).\nStep 5: `JobsQueue.add` at `job.py:149` calls `self.handle_lock(job)`, which calls `job.get_lock_name()` at `job.py:422`: `lock_name = lock_name(self.args)` \u2014 so the lambda receives the raw `params` list.\nStep 6: Periodic scheduler calls `sync_db_keys` with zero user arguments \u2192 `params = []` \u2192 lambda receives `[]` \u2192 lock key is `sync_encrypted_pool_dataset_keys_[]`.\nStep 7: Explicit `call_sync('pool.dataset.sync_db_keys', None)` \u2192 `params = [None]` \u2192 lambda receives `[None]` \u2192 lock key is `sync_encrypted_pool_dataset_keys_[None]`.\nStep 8: Keys differ \u2192 neither invocation blocks the other \u2192 two full-dataset syncs can run concurrently.", - "file_path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "id": "f_009", - "line_end": 162, - "line_start": 161, - "score": 0.644, - "severity": "important", - "suggestion": "Change the lambda to extract only the first argument and normalize `None` to an empty string, mirroring the pattern used by `encryption_summary`:\n\n```python\n@job(lock=lambda args: f'sync_encrypted_pool_dataset_keys_{args[0] if args else \"\"}')\n```\n\nThis ensures:\n- A periodic call (no args) and an explicit `call(..., None)` both produce the same lock key: `sync_encrypted_pool_dataset_keys_None`\n- A call with a specific pool name produces `sync_encrypted_pool_dataset_keys_tank`\n- The key no longer contains list brackets", - "tags": [ - "locking", - "concurrency", - "decorator-order", - "correctness" - ], - "title": "`sync_db_keys` lock lambda embeds the full args list, causing inconsistent lock keys between periodic and explicit calls" - }, - { - "active_multipliers": [], - "body": "**Existing datasets with `pbkdf2iters` between 100,000 and 1,299,999 will have their iteration count permanently changed to 1,300,000 on the next `change_key` call, regardless of whether the user explicitly requested this change.**\n\nThere are two distinct triggers:\n\n1. **Legacy API client omits `pbkdf2iters`:** The v25.10.x default was 350,000. When a v25.x client calls `change_key` without specifying `pbkdf2iters`, `_adapt_value` fills in the missing field using the **v26.0.0 new default** of `1300000` (version.py:226: `value[key_to_use] = field_info.get_default(call_default_factory=True)`). `from_previous` then sees `max(1300000, 1300000)` which is a no-op \u2014 but the applied value is the new default, not what the user would have expected from their v25.x context.\n\n2. **Legacy API client explicitly submits `pbkdf2iters=350000`:** `from_previous` clamps it to 1,300,000 as described in the companion finding.\n\nIn both cases, `change_key` permanently alters the ZFS dataset property `pbkdf2iters`. Once a dataset is re-keyed at 1,300,000 iterations, every subsequent passphrase-unlock of that dataset (at boot, during HA failover, or via `pool.dataset.unlock`) will run PBKDF2 with 1,300,000 iterations. The user never saw a prompt asking to confirm this change, and the API response `{\"result\": null}` provides no visibility into what iteration count was applied.\n\n**Scope:** Only passphrase-encrypted datasets are affected (line 114 of `dataset_encryption_operations.py` guards `pbkdf2iters` inclusion on `passphrase_key_format=True`). Raw-hex keyed datasets are not affected.", - "confidence": 0.92, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "pbkdf2iters-migration-behavior", - "dimension_name": "PBKDF2 Iteration Count Silent Migration", - "evidence": "Step 1: User has a passphrase-encrypted dataset with `pbkdf2iters=350000` (set under v25.x).\nStep 2: User or script calls `pool.dataset.change_key` via v25.x API client without specifying `pbkdf2iters`.\nStep 3: `_adapt_value` (version.py:224-227) detects `pbkdf2iters` is absent; since the field has a default in v26 (`1300000`), it fills: `value['pbkdf2iters'] = 1300000`.\nStep 4: `from_previous` is a no-op for `max(1300000, 1300000)`, but the effective value is now 1,300,000 instead of the user's expected 350,000.\nStep 5: `change_key` plugin line 191 passes `pbkdf2iters: 1300000` to `validate_encryption_data`.\nStep 6: Since `passphrase_key_format=True`, line 114 includes `pbkdf2iters` in `opts`.\nStep 7: `zfs/encryption.py::change_key()` writes `pbkdf2iters=1300000` permanently to ZFS.\nStep 8: API returns `{\"result\": null}` \u2014 no indication the iteration count was elevated.", - "file_path": "src/middlewared/middlewared/api/v26_0_0/pool_dataset.py", - "id": "f_012", - "line_end": 186, - "line_start": 175, - "score": 0.644, - "severity": "important", - "suggestion": "Compare `options['pbkdf2iters']` against the dataset's current stored iteration count before applying the change (available via `ds['pbkdf2iters']['parsed']` from `get_instance_quick`). If the value is being elevated due to the minimum-floor and not due to the user explicitly passing the new value, emit a warning. Consider adding a `pbkdf2iters_effective` field to `PoolDatasetChangeKeyResult` so callers can detect the actual value applied.", - "tags": [ - "encryption", - "silent-mutation", - "pbkdf2", - "dataset-state-change", - "api-versioning" - ], - "title": "Existing passphrase-encrypted datasets silently re-keyed at 3.7x higher iteration count on next change_key call via any API version" - }, - { - "active_multipliers": [], - "body": "`ZFSKeyAlreadyLoadedException` (line 14) and `ZFSNotEncryptedException` (line 20) both inherit directly from `Exception`. This is the root cause of the contract break identified in the other findings.\n\nIn the TrueNAS middleware architecture, user-facing errors are expected to be `CallError` instances (with an `errno` attribute). Any unhandled non-`CallError` exception that escapes a service method is treated as an internal server error by the WebSocket API layer, producing unstructured error responses.\n\nBy making these exceptions plain `Exception` subclasses:\n1. Every call site that calls `load_key()`, `check_key()`, `change_key()`, or `change_encryption_root()` must manually wrap exceptions to convert them to `CallError` \u2014 creating a systemic catch-site gap.\n2. Existing bare `except Exception` handlers (as in `dataset_encryption_lock.py:229`) silently absorb them as string errors with no errno, making them indistinguishable from other failures.\n3. The `.message` attribute is redundant with `str(e)` since `super().__init__(self.message)` already sets the string representation \u2014 the `.message` attribute is never used by any handler.", - "confidence": 0.9, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "exception-handling-contract", - "dimension_name": "Exception Handling Contract", - "evidence": "Step 1: `exceptions.py:14` \u2014 `class ZFSKeyAlreadyLoadedException(Exception)` \u2014 base class is plain `Exception`.\nStep 2: `exceptions.py:20` \u2014 `class ZFSNotEncryptedException(Exception)` \u2014 base class is plain `Exception`.\nStep 3: These are imported and raised in `zfs/encryption.py` at lines 31, 33, 58, 88, 105.\nStep 4: `dataset_encryption_lock.py:229` and `dataset_encryption_operations.py:200,263` are call sites with no conversion to `CallError`.\nStep 5: The middleware WebSocket error dispatch (not read, but standard TrueNAS architecture) wraps `CallError` into structured JSON error responses with errno codes; plain `Exception` becomes an unstructured internal error.", - "file_path": "src/middlewared/middlewared/plugins/zfs/exceptions.py", - "id": "f_007", - "line_end": 23, - "line_start": 14, - "score": 0.63, - "severity": "important", - "suggestion": "Change the base class of both exceptions to `CallError` with appropriate errno values:\n```python\nfrom middlewared.service.core import CallError # or wherever CallError is importable\nimport errno\n\nclass ZFSKeyAlreadyLoadedException(CallError):\n def __init__(self, path: str):\n super().__init__(f\"{path!r} key is already loaded\", errno=errno.EEXIST)\n\nclass ZFSNotEncryptedException(CallError):\n def __init__(self, path: str):\n super().__init__(f\"{path!r} is not encrypted\", errno=errno.ENOTSUP)\n```\nThis ensures that wherever these exceptions propagate \u2014 through `except Exception`, `except CallError`, or unhandled \u2014 they carry structured error information and are handled correctly by the middleware's error dispatch layer. Note: verify there are no circular import issues between `middlewared.plugins.zfs` and `middlewared.service`; if so, an intermediate base class in `zfs/exceptions.py` may be needed.", - "tags": [ - "exception-hierarchy", - "api-contract", - "architecture", - "error-propagation" - ], - "title": "Custom ZFS exceptions inherit from plain Exception instead of CallError, breaking structured error propagation across all callers" - }, - { - "active_multipliers": [], - "body": "`dataset_encryption_operations.py:200` calls `change_key(tls, id_, encryption_dict, key)` with no surrounding try/except. The `change_key` function in `zfs/encryption.py:87-88` can raise `ZFSNotEncryptedException` if `rsrc.crypto()` returns `None`.\n\nAlthough the `change_key` method does validate `ds['encrypted']` at line 134 via `verrors.add`, this is a **database/metadata check** \u2014 it does NOT prevent a race condition where the ZFS state diverges from the database (e.g. dataset was recreated between the query and the `change_key` call). If the ZFS layer reports the dataset as unencrypted but the DB still has it marked encrypted, `ZFSNotEncryptedException` will propagate all the way to the WebSocket API layer as an unhandled `Exception`, not a `CallError`.\n\nSimilarly, `change_encryption_root` at `dataset_encryption_operations.py:263` calls `change_encryption_root(tls, id_)` which also raises `ZFSNotEncryptedException` at `encryption.py:104-105` with no catch.", - "confidence": 0.82, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "exception-handling-contract", - "dimension_name": "Exception Handling Contract", - "evidence": "Step 1: `change_key` method in `dataset_encryption_operations.py:200` calls `change_key(tls, id_, encryption_dict, key)` with no try/except.\nStep 2: `change_key` in `zfs/encryption.py:86-88`: `rsrc = open_resource(tls, dataset); if (crypto := rsrc.crypto()) is None: raise ZFSNotEncryptedException(dataset)`.\nStep 3: `ZFSNotEncryptedException` inherits from `Exception` (confirmed at `exceptions.py:20`), NOT from `CallError`.\nStep 4: No catch exists between `encryption.py:88` and the WebSocket layer. The exception propagates as a raw `Exception`.\nStep 5: The WebSocket API layer expects `CallError` for user-facing error messages with structured errno codes. A raw `Exception` results in an unstructured 500-style error.\nSame path applies to `change_encryption_root` at `dataset_encryption_operations.py:263` calling `encryption.py:103-105`.", - "file_path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py", - "id": "f_006", - "line_end": 200, - "line_start": 200, - "score": 0.574, - "severity": "important", - "suggestion": "Wrap the `change_key` and `change_encryption_root` calls with try/except to convert `ZFSNotEncryptedException` (and `ZFSKeyAlreadyLoadedException` if applicable) into `CallError`:\n```python\nfrom middlewared.plugins.zfs.exceptions import ZFSNotEncryptedException\n\ntry:\n change_key(tls, id_, encryption_dict, key)\nexcept ZFSNotEncryptedException as e:\n raise CallError(str(e), errno=errno.ENOTSUP)\n```\nAlternatively, make `ZFSNotEncryptedException` a subclass of `CallError` with a fixed errno so it automatically presents correctly to all callers throughout the codebase.", - "tags": [ - "exception-handling", - "api-contract", - "race-condition", - "error-propagation" - ], - "title": "ZFSNotEncryptedException from change_key() propagates as raw Exception to WebSocket API layer \u2014 no CallError wrapping" - }, - { - "active_multipliers": [], - "body": "In the old `zfs.dataset.load_key` service method, all `libzfs.ZFSException` instances were caught and re-raised as `CallError`. In the new `encryption.py:load_key()`, the call to `crypto.load_key(**kwargs)` at line 34 is **not wrapped in any try/except**.\n\nAny `truenas_pylibzfs.ZFSException` raised by `crypto.load_key()` propagates directly out of `encryption.load_key()` back to its caller with:\n- A `.code` attribute (a `ZFSError` enum value)\n- **No `.errmsg`** or **`.errno`** fields in the `CallError` sense\n- No `CallError` wrapping\n\nFor the `unlock` call path in `dataset_encryption_lock.py`, this is handled correctly: `except ZFSException as e:` at line 223 catches these and processes `EZFS_CRYPTOFAILED` vs. other codes. So the current only caller handles it.\n\nHowever, the **API contract has silently changed**: any other present or future caller of `encryption.load_key()` that expects `CallError` (because the old `zfs.dataset.load_key` always raised `CallError`) will receive raw `ZFSException` instead. If such a caller reaches the WebSocket dispatch layer without intermediate handling, `websocket_app.py:196-207` catches the bare `Exception`, calls `adapt_exception(e)` (which only handles `subprocess.CalledProcessError` \u2014 not `ZFSException`), and falls back to `send_error(message, EINVAL, str(e))`, losing the original ZFS error code entirely and emitting a generic `EINVAL` to the client.", - "confidence": 0.8, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "error-handling-exception-flow", - "dimension_name": "Exception Handling and Error Flow", - "evidence": "Step 1: `encryption.py:load_key()` calls `crypto.load_key(**kwargs)` at line 34 with no surrounding try/except block.\nStep 2: `truenas_pylibzfs.ZFSException` is the exception type raised by `crypto.load_key()` on failure (e.g., wrong key \u2192 `EZFS_CRYPTOFAILED`).\nStep 3: `ZFSException` has a `.code` attribute (a `ZFSError` enum), but no `.errmsg` or `.errno` in the `CallError` sense.\nStep 4: The old service method `zfs.dataset.load_key` caught all `libzfs.ZFSException` and re-raised as `CallError` \u2014 all callers expected `CallError`.\nStep 5: A hypothetical new caller of `encryption.load_key()` that does not import `truenas_pylibzfs.ZFSException` and uses only `except CallError` will miss the exception.\nStep 6: That uncaught `ZFSException` reaches `websocket_app.py:196`, `adapt_exception(e)` returns `None` (only handles `CalledProcessError`), and `send_error(message, EINVAL, str(e))` emits an unstructured `EINVAL` response to the client.", - "file_path": "src/middlewared/middlewared/plugins/zfs/encryption.py", - "id": "f_008", - "line_end": 34, - "line_start": 34, - "score": 0.56, - "severity": "important", - "suggestion": "Either:\n1. **Document the contract explicitly** in `load_key()`'s docstring: state that it may raise `truenas_pylibzfs.ZFSException` directly (in addition to `ZFSNotEncryptedException` and `ZFSKeyAlreadyLoadedException`), so all callers know they must handle `ZFSException`.\n2. **Convert at the boundary**: wrap `crypto.load_key(**kwargs)` in a try/except that re-raises as a typed domain exception (e.g., add `ZFSLoadKeyException` to `exceptions.py`), so `encryption.py` never leaks `truenas_pylibzfs` types to callers:\n```python\ntry:\n crypto.load_key(**kwargs)\nexcept ZFSException as e:\n if e.code == ZFSError.EZFS_CRYPTOFAILED:\n raise ZFSInvalidKeyException(dataset) from e\n raise\n```\nOption 2 is the cleaner design: it keeps `truenas_pylibzfs` as an internal implementation detail.", - "tags": [ - "api-contract", - "exception-propagation", - "error-handling", - "refactoring" - ], - "title": "Raw truenas_pylibzfs.ZFSException from crypto.load_key() propagates out of encryption.load_key() undecorated, breaking the old CallError contract for any caller outside unlock" - }, - { - "active_multipliers": [], - "body": "**The 3.7x increase from 350,000 to 1,300,000 PBKDF2 iterations is applied unconditionally with no runtime check for hardware capability. On low-power or embedded hardware, this can cause passphrase-based key derivation to exceed unlock timeouts, making encrypted datasets permanently inaccessible without manual CLI intervention.**\n\nOnce a passphrase-encrypted dataset is re-keyed with `pbkdf2iters=1300000` (whether explicitly or via the silent clamping in `from_previous`), every future unlock attempt runs PBKDF2-SHA256 with 1,300,000 iterations synchronously. On ARM SoCs and Atom-class CPUs common in consumer NAS hardware:\n- At 350,000 iters: typically ~0.5\u20131 second per dataset\n- At 1,300,000 iters: typically ~2\u20134 seconds per dataset\n\nFor pools with multiple passphrase-encrypted datasets that must all unlock at pool import (a common TrueNAS configuration), unlock times multiply linearly. If this occurs during boot under a systemd service timeout, or during HA failover under a failover timeout, the unlock will fail \u2014 and with `ge=1300000` enforced as the hard minimum, there is **no API path** to reduce the iteration count back down without using the ZFS CLI directly (`zfs change-key -o pbkdf2iters=...`).\n\nThe `change_key` plugin (`dataset_encryption_operations.py:118`) does not measure or estimate key derivation time before applying the new iteration count. Neither `PoolCreateEncryptionOptions` nor `PoolDatasetChangeKeyOptions` expose any per-hardware tuning path below the new minimum.\n\nNote: `PoolCreateEncryptionOptions.from_previous` in `pool.py:152` applies the same clamping on pool creation encryption options. For new pool creation this affects the root dataset's initial encryption setup, not just re-keying.", - "confidence": 0.75, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "pbkdf2iters-migration-behavior", - "dimension_name": "PBKDF2 Iteration Count Silent Migration", - "evidence": "Step 1: Passphrase-encrypted dataset is re-keyed to `pbkdf2iters=1300000` via `change_key` (either explicitly or via silent clamping from `from_previous`).\nStep 2: `dataset_encryption_operations.py:191` passes `pbkdf2iters: options['pbkdf2iters']` to `validate_encryption_data`.\nStep 3: `validate_encryption_data` line 114 includes `pbkdf2iters` in `opts` when `passphrase_key_format=True`.\nStep 4: `zfs/encryption.py::change_key()` line 89 calls `tls.lzh.resource_cryptography_config(**props)` with `pbkdf2iters=1300000`, permanently recording it as a ZFS dataset property.\nStep 5: On the next pool import or `pool.dataset.unlock`, ZFS runs PBKDF2-SHA256 with 1,300,000 iterations to derive the wrapping key from the passphrase.\nStep 6: On low-power hardware (e.g., Cortex-A53 at 1.4GHz, ~350k iters/sec for PBKDF2-SHA256), this takes ~3.7 seconds per dataset. With 5 passphrase datasets: ~18.5 seconds total.\nStep 7: If a systemd or HA failover timeout fires during this window, unlock fails; dataset remains locked.\nStep 8: The `ge=1300000` constraint on `PoolDatasetChangeKeyOptions` means there is no supported API path to reduce `pbkdf2iters` on an already-re-keyed dataset \u2014 only direct ZFS CLI access can recover.", - "file_path": "src/middlewared/middlewared/api/v26_0_0/pool.py", - "id": "f_013", - "line_end": 154, - "line_start": 151, - "score": 0.525, - "severity": "important", - "suggestion": "Consider the following mitigations: (1) **Benchmark gate:** Before applying `change_key` with a high `pbkdf2iters`, run a short PBKDF2 benchmark and warn or reject if estimated unlock time exceeds a configurable threshold. (2) **System-wide override:** Allow a `tunable` or system config option to set a lower `pbkdf2iters` ceiling for constrained hardware, overriding the API minimum for that installation. (3) **Recovery documentation:** Explicitly document that `zfs change-key -o pbkdf2iters=` is available as a recovery path if unlock times become prohibitive. (4) **Job warning:** At minimum, have the `change_key` job emit a progress message noting the effective iteration count when it exceeds the old default.", - "tags": [ - "encryption", - "availability", - "hardware", - "pbkdf2", - "timeout-risk", - "embedded" - ], - "title": "3.7x PBKDF2 iteration increase enforced with no hardware capability check; may cause passphrase unlock timeouts making datasets inaccessible" - }, - { - "active_multipliers": [], - "body": "`@pass_thread_local_storage` is a **marker-only decorator** \u2014 it sets `fn._pass_thread_local_storage = True` and returns `fn` unchanged (`decorators.py:221-222`). The actual `tls` injection happens only at API dispatch time: in `main.py:862-865` for normal methods and `job.py:620-621` for `@job` methods.\n\nWhen `sync_zfs_keys` calls `self.push_zfs_keys(tls, ids)` and `self.pull_zfs_keys(tls)` directly (lines 138 and 142), these are **plain Python method calls** \u2014 they bypass the middleware dispatch system entirely. The `_pass_thread_local_storage` attribute on `push_zfs_keys` and `pull_zfs_keys` has **no effect** on direct calls. Therefore, `tls` is supplied exactly once by the caller, and the functions receive it correctly.\n\nThe decorators on `push_zfs_keys`/`pull_zfs_keys` are intentional: they allow those methods to be called independently through the middleware dispatch system (e.g., `self.middleware.call_sync('kmip.push_zfs_keys', ...)`) with `tls` injected automatically. The `# type: ignore` comments are consistent with the decorator's type signature hiding `tls` from external callers.\n\n**No double-injection occurs. The code is correct for this pattern.**", - "confidence": 0.98, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "decorator_injection", - "dimension_name": "Decorator Double-Injection Analysis", - "evidence": "Step 1: `pass_thread_local_storage` in `service/decorators.py:209-222` sets `fn._pass_thread_local_storage = True` and returns `fn` unchanged \u2014 no wrapping, no injection at decoration time.\nStep 2: `main.py:862-865` \u2014 injection only occurs inside `_call_prepare`, which is invoked by the middleware dispatch system, not on direct Python calls.\nStep 3: `job.py:620-621` \u2014 same: injection only at job run time via `prepend.append(thread_local_storage)`.\nStep 4: `sync_zfs_keys` at lines 138/142 calls `self.push_zfs_keys(tls, ids)` directly \u2014 this is a plain Python attribute lookup and call, bypassing `_call_prepare` entirely.\nStep 5: `push_zfs_keys` receives `(self, tls, ids)` \u2014 one `tls` from the caller, zero injected by decorator. Correct.", - "file_path": "src/middlewared/middlewared/plugins/kmip/zfs_keys.py", - "id": "f_000", - "line_end": 142, - "line_start": 138, - "score": 0.294, - "severity": "suggestion", - "suggestion": "No change needed for the decorator/injection pattern. The explicit `tls` passing at lines 138 and 142 is correct because these are direct Python method calls, not middleware dispatches.", - "tags": [ - "decorator", - "thread-local-storage", - "no-bug", - "call-convention" - ], - "title": "No double-injection bug: explicit tls passing is correct for direct calls" - }, - { - "active_multipliers": [], - "body": "The only integration test for `inherit_parent_encryption_properties` (`tests/api2/test_pool_dataset_encryption.py:404`) exercises the case where the parent's encryption root uses a **hex key** \u2014 so `parent_encrypted_root['key_format']['value'] == 'HEX'`. The guard evaluates to `False` in both old and new code, meaning this test provides **zero coverage** of the bug fix.\n\nThe case that was silently broken (passphrase-encrypted parent root + key-encrypted child encryption roots under `id_`) has never been tested. Now that the guard works correctly, there is a real behavioral difference: the operation **raises a `CallError`** instead of silently succeeding. Without a test for this path:\n\n1. There is no automated verification that the `CallError` message is correct.\n2. A future refactor could re-introduce the same type-comparison mistake and no test would catch it.\n3. The complementary allowed case \u2014 passphrase parent root, `id_` has *no* key-encrypted child roots \u2014 is also untested; verifying it proceeds successfully is equally important.\n\nThe guard itself (`any(d['name'] == d['encryption_root'] for d in self.middleware.call_sync('pool.dataset.query', [...]))`) is logically sound and the fix is correct, but the absence of test coverage for the enforced path is a gap worth closing.", - "confidence": 0.95, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "enum-comparison-guard", - "dimension_name": "Enum vs String Comparison Bug in Encryption Root Guard", - "evidence": "Only test reference: `tests/api2/test_pool_dataset_encryption.py:404`\n```python\ndef test_key_encrypted_dataset(self):\n # parent uses HEX key\n payload = {'name': dataset, 'encryption_options': {'key': dataset_token_hex}, ...}\n call('pool.dataset.create', payload)\n # child uses PASSPHRASE\n payload.update({'name': child_dataset, 'encryption_options': {'passphrase': passphrase}})\n call('pool.dataset.create', payload)\n # parent_encrypted_root is the HEX-keyed parent -> guard evaluates False in both old and new code\n call('pool.dataset.inherit_parent_encryption_properties', child_dataset)\n ds = call('pool.dataset.get_instance', child_dataset)\n assert ds['key_format']['value'] == 'HEX', ds\n```\nNo test exercises the path where `parent_encrypted_root['key_format']['value'] == 'PASSPHRASE'`.", - "file_path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py", - "id": "f_004", - "line_end": 261, - "line_start": 248, - "score": 0.285, - "severity": "suggestion", - "suggestion": "Add a test case in `tests/api2/test_pool_dataset_encryption.py` that:\n1. Creates a passphrase-encrypted dataset `P` as an encryption root.\n2. Creates `P/K` as a key-encrypted encryption root (child of P).\n3. Creates `P/K/KC` as a second key-encrypted encryption root (grandchild).\n4. Calls `pool.dataset.inherit_parent_encryption_properties('P/K')` and asserts a `ClientException` / `CallError` is raised containing the expected message.\n5. Also tests the allowed sub-case: `P/K` with no key-encrypted child roots successfully inherits from the passphrase root.", - "tags": [ - "test-coverage", - "encryption", - "guard", - "regression-risk" - ], - "title": "No test covers the newly-enforced rejection path (passphrase root + key-encrypted child roots)" - }, - { - "active_multipliers": [], - "body": "The review prompt raised a concern that if `@pass_thread_local_storage` wraps the `@job`-decorated function, the lock lambda might see `(tls, name)` instead of `(name,)`.\n\nThis concern does **not** apply. Both decorators are pure markers:\n\n```python\n# decorators.py:153-166\ndef check_job(fn):\n fn._job = {'lock': lock, ...}\n return fn # fn is returned unchanged\n\n# decorators.py:221-222\nfn._pass_thread_local_storage = True\nreturn fn # fn is returned unchanged\n```\n\nNeither decorator wraps the function \u2014 they only set attributes. The `tls` object is injected at job run time in `job.py:620-621` inside `Job.__run_body`, well after `get_lock_name()` has already evaluated the lock lambda at queue time. The `Job` object is constructed with `params` (raw caller args), and that is what the lambda sees \u2014 never `tls`.\n\nThe actual decorator stacking requirement is documented in `api/base/decorator.py:53-59`: `@job` must be the innermost (bottommost) decorator, and the current ordering is correct.", - "confidence": 0.97, - "diff_line": null, - "diff_side": "RIGHT", - "dimension_id": "decorator-order-lock-key", - "dimension_name": "Decorator Order and Lock Key Correctness", - "evidence": "Step 1: `@pass_thread_local_storage` at `decorators.py:209-222` sets `fn._pass_thread_local_storage = True` and returns `fn` \u2014 no wrapping.\nStep 2: `@job` at `decorators.py:153-166` sets `fn._job = {...}` and returns `fn` \u2014 no wrapping.\nStep 3: `_call_prepare` at `main.py:880` constructs `Job(..., params, job_options, ...)` where `params` is the raw caller args \u2014 `tls` is NOT in this list.\nStep 4: `tls` injection for jobs occurs in `job.py:620-621` inside `Job.__run_body`, which runs after the job has been queued and the lock key has already been computed.\nStep 5: `get_lock_name` at `job.py:422` calls `lock_name(self.args)` where `self.args = params` \u2014 the lambda never sees `tls`.", - "file_path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "id": "f_010", - "line_end": 162, - "line_start": 158, - "score": 0.097, - "severity": "nitpick", - "suggestion": "No code change needed for this specific concern. The decorator order is correct and `tls` is never present in the lock lambda's argument list.", - "tags": [ - "decorator-order", - "false-positive-cleared", - "tls", - "locking" - ], - "title": "Original `tls`-injection concern is a false alarm: decorator order is correct and `tls` is never visible to the lock lambda" - } - ], - "metadata": { - "agent_invocations": 11, - "anatomy": { - "blast_radius": [], - "clusters": [ - { - "description": "", - "files": [ - "" - ], - "id": "cluster_0", - "name": "root", - "primary_language": "" - }, - { - "description": "", - "files": [ - "src/middlewared/middlewared/api/v26_0_0/pool.py", - "src/middlewared/middlewared/api/v26_0_0/pool_dataset.py" - ], - "id": "cluster_1", - "name": "src/middlewared/middlewared/api/v26_0_0", - "primary_language": "python" - }, - { - "description": "", - "files": [ - "src/middlewared/middlewared/plugins/kmip/zfs_keys.py" - ], - "id": "cluster_2", - "name": "src/middlewared/middlewared/plugins/kmip", - "primary_language": "python" - }, - { - "description": "", - "files": [ - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py", - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py" - ], - "id": "cluster_3", - "name": "src/middlewared/middlewared/plugins/pool_", - "primary_language": "python" - }, - { - "description": "", - "files": [ - "src/middlewared/middlewared/plugins/zfs/encryption.py", - "src/middlewared/middlewared/plugins/zfs/exceptions.py" - ], - "id": "cluster_4", - "name": "src/middlewared/middlewared/plugins/zfs", - "primary_language": "python" - } - ], - "context_notes": "The removed file `src/middlewared/middlewared/plugins/zfs_/dataset_encryption.py` used `process_pool = True`, meaning every call to `zfs.dataset.*` previously serialized through a subprocess via the process pool mechanism. The new code runs synchronously in the middleware's main worker threads, sharing the thread-local `tls.lzh` handle managed by `@pass_thread_local_storage`. This is architecturally consistent with the broader truenas_pylibzfs migration effort visible in other modules (load_unload_impl.py, resource_crud.py, etc.). The `truenas_pylibzfs` dependency (PR #145) must provide: `ZFSResource.crypto()` returning an optional `ZFSResourceCryptography` object; `ZFSResourceCryptography.info()` returning an object with `key_is_loaded: bool`; `ZFSResourceCryptography.load_key(**kwargs)`, `.check_key(**kwargs) -> bool`, `.change_key(info)`, and `.inherit_key()`; and `ZFSLibHandle.resource_cryptography_config(**props)` returning a config object. None of these are visible in this repository \u2014 the PR is incomplete without that upstream merge.", - "dependency_graph": {}, - "files": [ - { - "hunks": [ - { - "content": " key.\"\"\"\n generate_key: bool = False\n \"\"\"Automatically generate the key to be used for dataset encryption.\"\"\"\n- pbkdf2iters: int = Field(ge=100000, default=350000)\n+ pbkdf2iters: int = Field(ge=1300000, default=1300000)\n \"\"\"Number of PBKDF2 iterations for key derivation from passphrase. Higher iterations improve security \\\n- against brute force attacks but increase unlock time. Default 350,000 balances security and performance.\"\"\"\n+ against brute force attacks but increase unlock time.\"\"\"\n algorithm: Literal[\n \"AES-128-CCM\", \"AES-192-CCM\", \"AES-256-CCM\", \"AES-128-GCM\", \"AES-192-GCM\", \"AES-256-GCM\"\n ] = \"AES-256-GCM\"", - "header": "@@ -136,9 +136,9 @@ class PoolCreateEncryptionOptions(BaseModel):", - "new_count": 9, - "new_start": 136, - "old_count": 9, - "old_start": 136 - }, - { - "content": " key: Secret[Annotated[str, Field(min_length=64, max_length=64)] | None] = None\n \"\"\"A hex-encoded key specified as an alternative to using `passphrase`.\"\"\"\n \n+ @classmethod\n+ def from_previous(cls, value):\n+ value['pbkdf2iters'] = max(1300000, value['pbkdf2iters'])\n+ return value\n+\n \n class PoolCreateTopologyVdevDRAID(BaseModel):\n type: Literal[\"DRAID1\", \"DRAID2\", \"DRAID3\"]", - "header": "@@ -148,6 +148,11 @@ class PoolCreateEncryptionOptions(BaseModel):", - "new_count": 11, - "new_start": 148, - "old_count": 6, - "old_start": 148 - } - ], - "language": "python", - "lines_added": 7, - "lines_removed": 2, - "path": "src/middlewared/middlewared/api/v26_0_0/pool.py", - "status": "modified" - }, - { - "hunks": [ - { - "content": " \"\"\"Generate a new random encryption key instead of using a provided key or passphrase.\"\"\"\n key_file: bool = False\n \"\"\"Whether the provided key is from a key file rather than entered directly.\"\"\"\n- pbkdf2iters: int = Field(default=350000, ge=100000)\n+ pbkdf2iters: int = Field(default=1300000, ge=1300000)\n \"\"\"Number of PBKDF2 iterations for passphrase-based keys. Higher values improve security against \\\n- brute force attacks but increase unlock time. Default 350,000 balances security and performance.\"\"\"\n+ brute force attacks but increase unlock time.\"\"\"\n passphrase: Secret[NonEmptyString | None] = None\n \"\"\"Passphrase to use for encryption key derivation.\"\"\"\n key: Secret[Annotated[str, Field(min_length=64, max_length=64)] | None] = None\n \"\"\"Raw hex-encoded encryption key.\"\"\"\n \n+ @classmethod\n+ def from_previous(cls, value):\n+ value['pbkdf2iters'] = max(1300000, value['pbkdf2iters'])\n+ return value\n+\n \n class PoolDatasetCreateUserProperty(BaseModel):\n key: Annotated[str, Field(examples=[\"custom:backup_policy\", \"org:created_by\"], pattern=\".*:.*\")]", - "header": "@@ -172,14 +172,19 @@ class PoolDatasetChangeKeyOptions(BaseModel):", - "new_count": 19, - "new_start": 172, - "old_count": 14, - "old_start": 172 - } - ], - "language": "python", - "lines_added": 7, - "lines_removed": 2, - "path": "src/middlewared/middlewared/api/v26_0_0/pool_dataset.py", - "status": "modified" - }, - { - "hunks": [ - { - "content": " # See the file LICENSE.IX for complete terms and conditions\n \n from middlewared.api.current import ZFSResourceQuery\n+from middlewared.plugins.zfs.encryption import check_key\n from middlewared.service import job, private, Service\n+from middlewared.service.decorators import pass_thread_local_storage\n \n from .connection import KMIPServerMixin\n ", - "header": "@@ -4,7 +4,9 @@", - "new_count": 9, - "new_start": 4, - "old_count": 7, - "old_start": 4 - }, - { - "content": " return rv\n \n @private\n- def push_zfs_keys(self, ids=None):\n+ @pass_thread_local_storage\n+ def push_zfs_keys(self, tls, ids=None):\n failed = []\n filters = [] if ids is None else [['id', 'in', ids]]\n existing_datasets = self.get_encrypted_datasets(filters)", - "header": "@@ -50,7 +52,8 @@ def get_encrypted_datasets(self, filters):", - "new_count": 8, - "new_start": 52, - "old_count": 7, - "old_start": 50 - }, - { - "content": " if not ds['encryption_key']:\n # We want to make sure we have the KMIP server's keys and in-memory keys in sync\n try:\n- if ds['name'] in self.zfs_keys and self.middleware.call_sync(\n- 'zfs.dataset.check_key', ds['name'], {'key': self.zfs_keys[ds['name']]}\n+ if (\n+ ds['name'] in self.zfs_keys\n+ and check_key(tls, ds['name'], key=self.zfs_keys[ds['name']])\n ):\n continue\n else:", - "header": "@@ -59,8 +62,9 @@ def push_zfs_keys(self, ids=None):", - "new_count": 9, - "new_start": 62, - "old_count": 8, - "old_start": 59 - }, - { - "content": " return failed\n \n @private\n- def pull_zfs_keys(self):\n+ @pass_thread_local_storage\n+ def pull_zfs_keys(self, tls):\n existing_datasets = self.get_encrypted_datasets([['kmip_uid', '!=', None]])\n failed = []\n connection_successful = self.middleware.call_sync('kmip.test_connection')", - "header": "@@ -91,7 +95,8 @@ def push_zfs_keys(self, ids=None):", - "new_count": 8, - "new_start": 95, - "old_count": 7, - "old_start": 91 - }, - { - "content": " try:\n if ds['encryption_key']:\n key = ds['encryption_key']\n- elif ds['name'] in self.zfs_keys and self.middleware.call_sync(\n- 'zfs.dataset.check_key', ds['name'], {'key': self.zfs_keys[ds['name']]}\n+ elif (\n+ ds['name'] in self.zfs_keys\n+ and check_key(tls, ds['name'], key=self.zfs_keys[ds['name']])\n ):\n key = self.zfs_keys[ds['name']]\n elif connection_successful:", - "header": "@@ -99,8 +104,9 @@ def pull_zfs_keys(self):", - "new_count": 9, - "new_start": 104, - "old_count": 8, - "old_start": 99 - }, - { - "content": " return failed\n \n @private\n+ @pass_thread_local_storage\n @job(lock=lambda args: f'kmip_sync_zfs_keys_{args}')\n- def sync_zfs_keys(self, job, ids=None):\n+ def sync_zfs_keys(self, job, tls, ids=None):\n if not self.middleware.call_sync('kmip.zfs_keys_pending_sync'):\n return\n config = self.middleware.call_sync('kmip.config')\n conn_successful = self.middleware.call_sync('kmip.test_connection', None, True)\n if config['enabled'] and config['manage_zfs_keys']:\n if conn_successful:\n- failed = self.push_zfs_keys(ids)\n+ failed = self.push_zfs_keys(tls, ids) # type: ignore\n else:\n return\n else:\n- failed = self.pull_zfs_keys()\n+ failed = self.pull_zfs_keys(tls) # type: ignore\n if failed:\n self.middleware.call_sync(\n 'alert.oneshot_create', 'KMIPZFSDatasetsSyncFailure', {'datasets': ','.join(failed)}", - "header": "@@ -120,19 +126,20 @@ def pull_zfs_keys(self):", - "new_count": 20, - "new_start": 126, - "old_count": 19, - "old_start": 120 - } - ], - "language": "python", - "lines_added": 16, - "lines_removed": 9, - "path": "src/middlewared/middlewared/plugins/kmip/zfs_keys.py", - "status": "modified" - }, - { - "hunks": [ - { - "content": " from middlewared.service.decorators import pass_thread_local_storage\n from middlewared.utils.filter_list import filter_list\n from middlewared.plugins.pool_.utils import get_dataset_parents\n+from middlewared.plugins.zfs.encryption import check_key\n \n from .utils import DATASET_DATABASE_MODEL_NAME, dataset_can_be_mounted, retrieve_keys_from_file, ZFSKeyFormat\n ", - "header": "@@ -18,6 +18,7 @@", - "new_count": 7, - "new_start": 18, - "old_count": 6, - "old_start": 18 - }, - { - "content": " namespace = 'pool.dataset'\n \n @api_method(PoolDatasetEncryptionSummaryArgs, PoolDatasetEncryptionSummaryResult, roles=['DATASET_READ'])\n+ @pass_thread_local_storage\n @job(lock=lambda args: f'encryption_summary_options_{args[0]}', pipes=['input'], check_pipes=False)\n- def encryption_summary(self, job, id_, options):\n+ def encryption_summary(self, job, tls, id_, options):\n \"\"\"\n Retrieve summary of all encrypted roots under `id`.\n ", - "header": "@@ -28,8 +29,9 @@ class Config:", - "new_count": 9, - "new_start": 29, - "old_count": 8, - "old_start": 28 - }, - { - "content": " verrors.check()\n datasets = self.query_encrypted_datasets(id_, {'all': True})\n \n- to_check = []\n+ results = []\n for name, ds in datasets.items():\n ds_key = keys_supplied.get(name, {}).get('key') or ds['encryption_key']\n if ZFSKeyFormat(ds['key_format']['value']) == ZFSKeyFormat.RAW and ds_key:\n with contextlib.suppress(ValueError):\n ds_key = bytes.fromhex(ds_key)\n- to_check.append((name, {'key': ds_key}))\n \n- check_job = self.middleware.call_sync('zfs.dataset.bulk_process', 'check_key', to_check)\n- check_job.wait_sync()\n- if check_job.error:\n- raise CallError(f'Failed to retrieve encryption summary for {id_}: {check_job.error}')\n+ try:\n+ valid_key = check_key(tls, name, key=ds_key)\n+ except Exception:\n+ valid_key = False\n \n- results = []\n- for ds_data, status in zip(to_check, check_job.result):\n- ds_name = ds_data[0]\n- data = datasets[ds_name]\n results.append({\n- 'name': ds_name,\n- 'key_format': ZFSKeyFormat(data['key_format']['value']).value,\n- 'key_present_in_database': bool(data['encryption_key']),\n- 'valid_key': bool(status['result']), 'locked': data['locked'],\n+ 'name': name,\n+ 'key_format': ZFSKeyFormat(ds['key_format']['value']).value,\n+ 'key_present_in_database': bool(ds['encryption_key']),\n+ 'valid_key': valid_key,\n+ 'locked': ds['locked'],\n 'unlock_error': None,\n 'unlock_successful': False,\n })\n \n failed = set()\n for ds in sorted(results, key=lambda d: d['name'].count('/')):\n- for i in range(1, ds['name'].count('/') + 1):\n- check = ds['name'].rsplit('/', i)[0]\n+ ds_name = ds['name']\n+ for i in range(1, ds_name.count('/') + 1):\n+ check = ds_name.rsplit('/', i)[0]\n if check in failed:\n- failed.add(ds['name'])\n+ failed.add(ds_name)\n ds['unlock_error'] = f'Child cannot be unlocked when parent \"{check}\" is locked'\n \n- if ds['locked'] and not options['force'] and not keys_supplied.get(ds['name'], {}).get('force'):\n- err = dataset_can_be_mounted(ds['name'], os.path.join('/mnt', ds['name']))\n+ ds_locked = ds['locked']\n+ if ds_locked and not options['force'] and not keys_supplied.get(ds_name, {}).get('force'):\n+ err = dataset_can_be_mounted(ds_name, os.path.join('/mnt', ds_name))\n if ds['unlock_error'] and err:\n ds['unlock_error'] += f' and {err}'\n elif err:", - "header": "@@ -94,42 +96,40 @@ def encryption_summary(self, job, id_, options):", - "new_count": 40, - "new_start": 96, - "old_count": 42, - "old_start": 94 - }, - { - "content": " \n if ds['valid_key']:\n ds['unlock_successful'] = not bool(ds['unlock_error'])\n- elif not ds['locked']:\n+ elif not ds_locked:\n # For datasets which are already not locked, unlock operation for them\n # will succeed as they are not locked\n ds['unlock_successful'] = True\n else:\n- key_provided = ds['name'] in keys_supplied or ds['key_present_in_database']\n+ key_provided = ds_name in keys_supplied or ds['key_present_in_database']\n if key_provided:\n if ds['unlock_error']:\n- if ds['name'] in keys_supplied or ds['key_present_in_database']:\n+ if ds_name in keys_supplied or ds['key_present_in_database']:\n ds['unlock_error'] += ' and provided key is invalid'\n else:\n ds['unlock_error'] = 'Provided key is invalid'\n elif not ds['unlock_error']:\n ds['unlock_error'] = 'Key not provided'\n- failed.add(ds['name'])\n+ failed.add(ds_name)\n \n return results\n \n @periodic(86400)\n @private\n+ @pass_thread_local_storage\n @job(lock=lambda args: f'sync_encrypted_pool_dataset_keys_{args}')\n- def sync_db_keys(self, job, name=None):\n+ def sync_db_keys(self, job, tls, name=None):\n if not self.middleware.call_sync('failover.is_single_master_node'):\n # We don't want to do this for passive controller\n return", - "header": "@@ -137,28 +137,29 @@ def encryption_summary(self, job, id_, options):", - "new_count": 29, - "new_start": 137, - "old_count": 28, - "old_start": 137 - }, - { - "content": " # It is possible we have a pool configured but for some mistake/reason the pool did not import like\n # during repair disks were not plugged in and system was booted, in such cases we would like to not\n # remove the encryption keys from the database.\n- for root_ds in {pool['name'] for pool in self.middleware.call_sync('pool.query')} - {\n- ds['id'] for ds in self.middleware.call_sync(\n+ pool_names = {pool['name'] for pool in self.middleware.call_sync('pool.query')}\n+ ds_names = {\n+ ds['id']\n+ for ds in self.middleware.call_sync(\n 'pool.dataset.query', [], {'extra': {'retrieve_children': False, 'properties': []}}\n )\n- }:\n+ }\n+ for root_ds in pool_names - ds_names:\n filters.extend([['name', '!=', root_ds], ['name', '!^', f'{root_ds}/']])\n \n db_datasets = self.query_encrypted_roots_keys(filters)\n encrypted_roots = {\n- d['name']: d for d in self.middleware.call_sync(\n- 'pool.dataset.query', filters, {'extra': {'properties': ['encryptionroot']}}\n- ) if d['name'] == d['encryption_root']\n+ d['name']: d\n+ for d in self.middleware.call_sync(\n+ 'pool.dataset.query',\n+ filters,\n+ {'extra': {'properties': ['encryptionroot', 'keyformat']}}\n+ )\n+ if d['name'] == d['encryption_root']\n }\n+\n to_remove = []\n- check_key_job = self.middleware.call_sync('zfs.dataset.bulk_process', 'check_key', [\n- (name, {'key': db_datasets[name]}) for name in db_datasets\n- ])\n- check_key_job.wait_sync()\n- if check_key_job.error:\n- self.logger.error(f'Failed to sync database keys: {check_key_job.error}')\n+ try:\n+ for ds_name, key in db_datasets.items():\n+ ds = encrypted_roots.get(ds_name)\n+ if ds and ZFSKeyFormat(ds['key_format']['value']) == ZFSKeyFormat.RAW and key:\n+ with contextlib.suppress(ValueError):\n+ key = bytes.fromhex(key)\n+\n+ try:\n+ should_remove = not check_key(tls, ds_name, key=key)\n+ except Exception:\n+ should_remove = True\n+\n+ if should_remove:\n+ to_remove.append(ds_name)\n+\n+ except Exception as exc:\n+ self.logger.error(f'Failed to sync database keys: {exc}')\n return\n \n- for dataset, status in zip(db_datasets, check_key_job.result):\n- if not status['result']:\n- to_remove.append(dataset)\n- elif status['error']:\n- if dataset not in encrypted_roots:\n- to_remove.append(dataset)\n- else:\n- self.logger.error(f'Failed to check encryption status for {dataset}: {status[\"error\"]}')\n-\n self.middleware.call_sync('pool.dataset.delete_encrypted_datasets_from_db', [['name', 'in', to_remove]])\n \n @private", - "header": "@@ -167,37 +168,47 @@ def sync_db_keys(self, job, name=None):", - "new_count": 47, - "new_start": 168, - "old_count": 37, - "old_start": 167 - } - ], - "language": "python", - "lines_added": 57, - "lines_removed": 46, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "status": "modified" - }, - { - "hunks": [ - { - "content": " from datetime import datetime\n from pathlib import Path\n \n+from truenas_pylibzfs import ZFSError, ZFSException\n+\n from middlewared.api import api_method\n from middlewared.api.current import (\n PoolDatasetLockArgs, PoolDatasetLockResult, PoolDatasetUnlockArgs, PoolDatasetUnlockResult\n )\n+from middlewared.plugins.zfs.encryption import load_key\n from middlewared.service import CallError, job, private, Service, ValidationErrors\n+from middlewared.service.decorators import pass_thread_local_storage\n from middlewared.utils.filesystem.directory import directory_is_empty\n \n from .utils import (", - "header": "@@ -6,11 +6,15 @@", - "new_count": 15, - "new_start": 6, - "old_count": 11, - "old_start": 6 - }, - { - "content": " return True\n \n @api_method(PoolDatasetUnlockArgs, PoolDatasetUnlockResult, roles=['DATASET_WRITE'])\n+ @pass_thread_local_storage\n @job(lock=lambda args: f'dataset_unlock_{args[0]}', pipes=['input'], check_pipes=False)\n- def unlock(self, job, id_, options):\n+ def unlock(self, job, tls, id_, options):\n \"\"\"\n Unlock dataset `id` (and its children if `unlock_options.recursive` is `true`).\n ", - "header": "@@ -85,8 +89,9 @@ async def lock(self, job, id_, options):", - "new_count": 9, - "new_start": 89, - "old_count": 8, - "old_start": 85 - }, - { - "content": " \n job.set_progress(int(name_i / len(names) * 90 + 0.5), f'Unlocking {name!r}')\n try:\n- self.middleware.call_sync(\n- 'zfs.dataset.load_key', name, {'key': datasets[name]['key'], 'mount': False}\n- )\n- except CallError as e:\n- failed[name]['error'] = 'Invalid Key' if 'incorrect key provided' in str(e).lower() else str(e)\n+ load_key(tls, name, key=datasets[name]['key'])\n+ except ZFSException as e:\n+ if e.code == ZFSError.EZFS_CRYPTOFAILED:\n+ failed[name]['error'] = 'Invalid Key'\n+ else:\n+ failed[name]['error'] = str(e)\n+ continue\n+ except Exception as e:\n+ failed[name]['error'] = str(e)\n continue\n \n # Before we mount the dataset in question, we should ensure that the path where it will be mounted", - "header": "@@ -214,11 +219,15 @@ def unlock(self, job, id_, options):", - "new_count": 15, - "new_start": 219, - "old_count": 11, - "old_start": 214 - } - ], - "language": "python", - "lines_added": 15, - "lines_removed": 6, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py", - "status": "modified" - }, - { - "hunks": [ - { - "content": " PoolDatasetChangeKeyArgs, PoolDatasetChangeKeyResult, PoolDatasetInheritParentEncryptionPropertiesArgs,\n PoolDatasetInheritParentEncryptionPropertiesResult\n )\n+from middlewared.plugins.zfs.encryption import change_encryption_root, change_key\n from middlewared.service import CallError, job, private, Service, ValidationErrors\n+from middlewared.service.decorators import pass_thread_local_storage\n from middlewared.utils import secrets\n \n from .utils import DATASET_DATABASE_MODEL_NAME, ZFSKeyFormat", - "header": "@@ -4,7 +4,9 @@", - "new_count": 9, - "new_start": 4, - "old_count": 7, - "old_start": 4 - }, - { - "content": " PoolDatasetInsertOrUpdateEncryptedRecordResult,\n roles=['DATASET_WRITE']\n )\n- async def insert_or_update_encrypted_record(self, data):\n+ def insert_or_update_encrypted_record(self, data):\n key_format = data.pop('key_format') or ZFSKeyFormat.PASSPHRASE.value\n if not data['encryption_key'] or ZFSKeyFormat(key_format.upper()) == ZFSKeyFormat.PASSPHRASE:\n # We do not want to save passphrase keys - they are only known to the user\n return\n \n ds_id = data.pop('id')\n- ds = await self.middleware.call(\n+ ds = self.middleware.call_sync(\n 'datastore.query', DATASET_DATABASE_MODEL_NAME,\n [['id', '=', ds_id]] if ds_id else [['name', '=', data['name']]]\n )", - "header": "@@ -21,14 +23,14 @@ class Config:", - "new_count": 14, - "new_start": 23, - "old_count": 14, - "old_start": 21 - }, - { - "content": " \n pk = ds[0]['id'] if ds else None\n if ds:\n- await self.middleware.call(\n+ self.middleware.call_sync(\n 'datastore.update',\n DATASET_DATABASE_MODEL_NAME,\n ds[0]['id'], data\n )\n else:\n- pk = await self.middleware.call(\n+ pk = self.middleware.call_sync(\n 'datastore.insert',\n DATASET_DATABASE_MODEL_NAME,\n data\n )\n \n- kmip_config = await self.middleware.call('kmip.config')\n+ kmip_config = self.middleware.call_sync('kmip.config')\n if kmip_config['enabled'] and kmip_config['manage_zfs_keys']:\n- await self.middleware.call('kmip.sync_zfs_keys', [pk])\n+ self.middleware.call_sync('kmip.sync_zfs_keys', [pk])\n \n return pk\n ", - "header": "@@ -37,21 +39,21 @@ async def insert_or_update_encrypted_record(self, data):", - "new_count": 21, - "new_start": 39, - "old_count": 21, - "old_start": 37 - }, - { - "content": " return opts\n \n @api_method(PoolDatasetChangeKeyArgs, PoolDatasetChangeKeyResult, roles=['DATASET_WRITE'])\n+ @pass_thread_local_storage\n @job(lock=lambda args: f'dataset_change_key_{args[0]}', pipes=['input'], check_pipes=False)\n- async def change_key(self, job, id_, options):\n+ def change_key(self, job, tls, id_, options):\n \"\"\"\n Change encryption properties for `id` encrypted dataset.\n ", - "header": "@@ -114,8 +116,9 @@ def validate_encryption_data(self, job, verrors, encryption_dict, schema):", - "new_count": 9, - "new_start": 116, - "old_count": 8, - "old_start": 114 - }, - { - "content": " 1) It has encrypted roots as children which are encrypted with a key\n 2) If it is a root dataset where the system dataset is located\n \"\"\"\n- ds = await self.middleware.call('pool.dataset.get_instance_quick', id_, {\n+ ds = self.middleware.call_sync('pool.dataset.get_instance_quick', id_, {\n 'encryption': True,\n })\n verrors = ValidationErrors()", - "header": "@@ -124,7 +127,7 @@ async def change_key(self, job, id_, options):", - "new_count": 7, - "new_start": 127, - "old_count": 7, - "old_start": 124 - }, - { - "content": " )\n elif any(\n d['name'] == d['encryption_root']\n- for d in await self.middleware.call(\n+ for d in self.middleware.call_sync(\n 'pool.dataset.query', [\n ['id', '^', f'{id_}/'], ['encrypted', '=', True],\n ['key_format.value', '!=', ZFSKeyFormat.PASSPHRASE.value]", - "header": "@@ -142,7 +145,7 @@ async def change_key(self, job, id_, options):", - "new_count": 7, - "new_start": 145, - "old_count": 7, - "old_start": 142 - }, - { - "content": " f'{id_} has children which are encrypted with a key. It is not allowed to have encrypted '\n 'roots which are encrypted with a key as children for passphrase encrypted datasets.'\n )\n- elif id_ == (await self.middleware.call('systemdataset.config'))['pool']:\n+ elif id_ == self.middleware.call_sync('systemdataset.config')['pool']:\n verrors.add(\n 'id',\n f'{id_} contains the system dataset. Please move the system dataset to a '", - "header": "@@ -154,7 +157,7 @@ async def change_key(self, job, id_, options):", - "new_count": 7, - "new_start": 157, - "old_count": 7, - "old_start": 154 - }, - { - "content": " f'change_key_options.{k}',\n 'Either Key or passphrase must be provided.'\n )\n- elif id_.count('/') and await self.middleware.call(\n+ elif id_.count('/') and self.middleware.call_sync(\n 'pool.dataset.query', [\n ['id', 'in', [id_.rsplit('/', i)[0] for i in range(1, id_.count('/') + 1)]],\n ['key_format.value', '=', ZFSKeyFormat.PASSPHRASE.value], ['encrypted', '=', True]", - "header": "@@ -167,7 +170,7 @@ async def change_key(self, job, id_, options):", - "new_count": 7, - "new_start": 170, - "old_count": 7, - "old_start": 167 - }, - { - "content": " \n verrors.check()\n \n- encryption_dict = await self.middleware.call(\n+ encryption_dict = self.middleware.call_sync(\n 'pool.dataset.validate_encryption_data', job, verrors, {\n 'enabled': True, 'passphrase': options['passphrase'],\n 'generate_key': options['generate_key'], 'key_file': options['key_file'],", - "header": "@@ -181,7 +184,7 @@ async def change_key(self, job, id_, options):", - "new_count": 7, - "new_start": 184, - "old_count": 7, - "old_start": 181 - }, - { - "content": " encryption_dict.pop('encryption')\n key = encryption_dict.pop('key')\n \n- await self.middleware.call(\n- 'zfs.dataset.change_key', id_, {\n- 'encryption_properties': encryption_dict,\n- 'key': key, 'load_key': False,\n- }\n- )\n+ change_key(tls, id_, encryption_dict, key)\n \n # TODO: Handle renames of datasets appropriately wrt encryption roots and db - this will be done when\n # devd changes are in from the OS end\n data = {'encryption_key': key, 'key_format': 'PASSPHRASE' if options['passphrase'] else 'HEX', 'name': id_}\n- await self.insert_or_update_encrypted_record(data)\n+ self.insert_or_update_encrypted_record(data)\n if options['passphrase'] and ZFSKeyFormat(ds['key_format']['value']) != ZFSKeyFormat.PASSPHRASE:\n- await self.middleware.call('pool.dataset.sync_db_keys', id_)\n+ self.middleware.call_sync('pool.dataset.sync_db_keys', id_)\n \n data['old_key_format'] = ds['key_format']['value']\n- await self.middleware.call_hook('dataset.change_key', data)\n+ self.middleware.call_hook_sync('dataset.change_key', data)\n \n @api_method(\n PoolDatasetInheritParentEncryptionPropertiesArgs,\n PoolDatasetInheritParentEncryptionPropertiesResult,\n roles=['DATASET_WRITE']\n )\n- async def inherit_parent_encryption_properties(self, id_):\n+ @pass_thread_local_storage\n+ def inherit_parent_encryption_properties(self, tls, id_):\n \"\"\"\n Allows inheriting parent's encryption root discarding its current encryption settings. This\n can only be done where `id` has an encrypted parent and `id` itself is an encryption root.\n \"\"\"\n- ds = await self.middleware.call('pool.dataset.get_instance_quick', id_, {\n+ ds = self.middleware.call_sync('pool.dataset.get_instance_quick', id_, {\n 'encryption': True,\n })\n if not ds['encrypted']:", - "header": "@@ -194,34 +197,30 @@ async def change_key(self, job, id_, options):", - "new_count": 30, - "new_start": 197, - "old_count": 34, - "old_start": 194 - }, - { - "content": " elif '/' not in id_:\n raise CallError('Root datasets do not have a parent and cannot inherit encryption settings')\n else:\n- parent = await self.middleware.call(\n+ parent = self.middleware.call_sync(\n 'pool.dataset.get_instance_quick', id_.rsplit('/', 1)[0], {\n 'encryption': True,\n }", - "header": "@@ -233,7 +232,7 @@ async def inherit_parent_encryption_properties(self, id_):", - "new_count": 7, - "new_start": 232, - "old_count": 7, - "old_start": 233 - }, - { - "content": " if not parent['encrypted']:\n raise CallError('This operation requires the parent dataset to be encrypted')\n else:\n- parent_encrypted_root = await self.middleware.call(\n+ parent_encrypted_root = self.middleware.call_sync(\n 'pool.dataset.get_instance_quick', parent['encryption_root'], {\n 'encryption': True,\n }\n )\n- if ZFSKeyFormat(parent_encrypted_root['key_format']['value']) == ZFSKeyFormat.PASSPHRASE.value:\n+ if parent_encrypted_root['key_format']['value'] == ZFSKeyFormat.PASSPHRASE.value:\n if any(\n d['name'] == d['encryption_root']\n- for d in await self.middleware.call(\n+ for d in self.middleware.call_sync(\n 'pool.dataset.query', [\n ['id', '^', f'{id_}/'], ['encrypted', '=', True],\n ['key_format.value', '!=', ZFSKeyFormat.PASSPHRASE.value]", - "header": "@@ -241,15 +240,15 @@ async def inherit_parent_encryption_properties(self, id_):", - "new_count": 15, - "new_start": 240, - "old_count": 15, - "old_start": 241 - }, - { - "content": " 'roots which are encrypted with a key as children for passphrase encrypted datasets.'\n )\n \n- await self.middleware.call('zfs.dataset.change_encryption_root', id_, {'load_key': False})\n- await self.middleware.call('pool.dataset.sync_db_keys', id_)\n- await self.middleware.call_hook('dataset.inherit_parent_encryption_root', id_)\n+ change_encryption_root(tls, id_)\n+ self.middleware.call_sync('pool.dataset.sync_db_keys', id_)\n+ self.middleware.call_hook_sync('dataset.inherit_parent_encryption_root', id_)", - "header": "@@ -261,6 +260,6 @@ async def inherit_parent_encryption_properties(self, id_):", - "new_count": 6, - "new_start": 260, - "old_count": 6, - "old_start": 261 - } - ], - "language": "python", - "lines_added": 29, - "lines_removed": 30, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py", - "status": "modified" - }, - { - "hunks": [ - { - "content": "+import threading\n+from typing import Literal, TypedDict, cast\n+\n+from .exceptions import ZFSKeyAlreadyLoadedException, ZFSNotEncryptedException\n+from .utils import open_resource\n+\n+\n+class EncryptionProperties(TypedDict, total=False):\n+ keyformat: Literal['hex', 'passphrase', 'raw']\n+ keylocation: str\n+ pbkdf2iters: int | None\n+\n+\n+def load_key(tls: threading.local, dataset: str, **kwargs: str | bytes) -> None:\n+ \"\"\"\n+ Load the encryption key for a ZFS dataset.\n+\n+ Args:\n+ dataset: Name of the ZFS dataset whose key should be loaded.\n+\n+ Keyword Args:\n+ key: Key material as ``str`` (hex/passphrase) or ``bytes`` (raw).\n+ Mutually exclusive with ``key_location``.\n+ key_location: Path to the key file on disk.\n+ Mutually exclusive with ``key``.\n+ \"\"\"\n+ if len(kwargs) > 1:\n+ raise ValueError('Cannot specify both key and key location')\n+ rsrc = open_resource(tls, dataset)\n+ if (crypto := rsrc.crypto()) is None:\n+ raise ZFSNotEncryptedException(dataset)\n+ if crypto.info().key_is_loaded:\n+ raise ZFSKeyAlreadyLoadedException(dataset)\n+ crypto.load_key(**kwargs)\n+\n+\n+def check_key(tls: threading.local, dataset: str, **kwargs: str | bytes) -> bool:\n+ \"\"\"\n+ Return True if ``key`` (or the key at ``key_location``) can unlock ``dataset``.\n+\n+ Does not actually load the key. Raises ZFSNotEncryptedException if the\n+ dataset is not encrypted or if the ZFS operation fails for a reason other\n+ than a wrong key (EZFS_CRYPTOFAILED returns False rather than raising).\n+\n+ Args:\n+ dataset: Name of the ZFS dataset to check.\n+\n+ Keyword Args:\n+ key: Key material as ``str`` (hex/passphrase) or ``bytes`` (raw).\n+ Mutually exclusive with ``key_location``.\n+ key_location: Path to the key file on disk.\n+ Mutually exclusive with ``key``.\n+ \"\"\"\n+ if len(kwargs) > 1:\n+ raise ValueError('Cannot specify both key and key location')\n+ rsrc = open_resource(tls, dataset)\n+ if (crypto := rsrc.crypto()) is None:\n+ raise ZFSNotEncryptedException(dataset)\n+ return crypto.check_key(**kwargs) # type: ignore[no-any-return]\n+\n+\n+def change_key(\n+ tls: threading.local,\n+ dataset: str,\n+ properties: EncryptionProperties | None = None,\n+ key: str | None = None\n+) -> None:\n+ \"\"\"\n+ Change the encryption key and/or properties for ``dataset``.\n+\n+ The dataset's key must already be loaded before calling this.\n+\n+ Args:\n+ dataset: Name of the ZFS dataset whose key should be changed.\n+ properties: May contain any combination of keyformat, keylocation, and\n+ pbkdf2iters.\n+ key: New key material. Required when keylocation is not given.\n+ \"\"\"\n+ props = {} if properties is None else cast(dict[str, str | int | None], properties.copy())\n+ if key:\n+ props.pop('keylocation', None)\n+ props['key'] = key\n+ elif 'keylocation' not in props:\n+ raise ValueError('Must specify either key or key location')\n+\n+ rsrc = open_resource(tls, dataset)\n+ if (crypto := rsrc.crypto()) is None:\n+ raise ZFSNotEncryptedException(dataset)\n+ config = tls.lzh.resource_cryptography_config(**props)\n+ crypto.change_key(info=config)\n+\n+\n+def change_encryption_root(tls: threading.local, dataset: str) -> None:\n+ \"\"\"\n+ Make ``dataset`` inherit encryption from its parent, removing it as\n+ an encryption root.\n+\n+ ``dataset`` must currently be an encryption root and its key must be loaded.\n+\n+ Args:\n+ dataset: Name of the ZFS dataset to remove as an encryption root.\n+ \"\"\"\n+ rsrc = open_resource(tls, dataset)\n+ if (crypto := rsrc.crypto()) is None:\n+ raise ZFSNotEncryptedException(dataset)\n+ crypto.inherit_key()", - "header": "@@ -0,0 +1,106 @@", - "new_count": 106, - "new_start": 1, - "old_count": 0, - "old_start": 0 - } - ], - "language": "python", - "lines_added": 106, - "lines_removed": 0, - "path": "src/middlewared/middlewared/plugins/zfs/encryption.py", - "status": "added" - }, - { - "hunks": [ - { - "content": "-from typing import Collection\n+from typing import Iterable\n \n __all__ = (\n+ \"ZFSKeyAlreadyLoadedException\",\n+ \"ZFSNotEncryptedException\",\n \"ZFSPathAlreadyExistsException\",\n \"ZFSPathInvalidException\",\n \"ZFSPathNotASnapshotException\",", - "header": "@@ -1,6 +1,8 @@", - "new_count": 8, - "new_start": 1, - "old_count": 6, - "old_start": 1 - }, - { - "content": " )\n \n \n+class ZFSKeyAlreadyLoadedException(Exception):\n+ def __init__(self, path: str):\n+ self.message = f\"{path!r} key is already loaded\"\n+ super().__init__(self.message)\n+\n+\n+class ZFSNotEncryptedException(Exception):\n+ def __init__(self, path: str):\n+ self.message = f\"{path!r} is not encrypted\"\n+ super().__init__(self.message)\n+\n+\n class ZFSPathAlreadyExistsException(Exception):\n def __init__(self, path: str):\n self.message = f\"{path!r} already exists\"", - "header": "@@ -9,6 +11,18 @@", - "new_count": 18, - "new_start": 11, - "old_count": 6, - "old_start": 9 - }, - { - "content": " \n \n class ZFSPathHasClonesException(Exception):\n- def __init__(self, path: str, clones: Collection[str]):\n+ def __init__(self, path: str, clones: Iterable[str]):\n self.path = path\n self.clones = clones\n self.message = f\"{path!r} has the following clones: {','.join(clones)}\"", - "header": "@@ -16,7 +30,7 @@ def __init__(self, path: str):", - "new_count": 7, - "new_start": 30, - "old_count": 7, - "old_start": 16 - }, - { - "content": " \n \n class ZFSPathHasHoldsException(Exception):\n- def __init__(self, path: str, holds: Collection[str]):\n+ def __init__(self, path: str, holds: Iterable[str]):\n self.message = f\"{path!r} has the following holds: {','.join(holds)}\"\n super().__init__(self.message)\n ", - "header": "@@ -24,7 +38,7 @@ def __init__(self, path: str, clones: Collection[str]):", - "new_count": 7, - "new_start": 38, - "old_count": 7, - "old_start": 24 - } - ], - "language": "python", - "lines_added": 17, - "lines_removed": 3, - "path": "src/middlewared/middlewared/plugins/zfs/exceptions.py", - "status": "modified" - }, - { - "hunks": [ - { - "content": "-import libzfs\n-\n-from middlewared.service import CallError, job, Service\n-\n-\n-class ZFSDatasetService(Service):\n-\n- class Config:\n- namespace = 'zfs.dataset'\n- private = True\n- process_pool = True\n-\n- def common_load_dataset_checks(self, id_, ds):\n- self.common_encryption_checks(id_, ds)\n- if ds.key_loaded:\n- raise CallError(f'{id_} key is already loaded')\n-\n- def common_encryption_checks(self, id_, ds):\n- if not ds.encrypted:\n- raise CallError(f'{id_} is not encrypted')\n-\n- def load_key(self, id_: str, options: dict | None = None):\n- if options is None:\n- options = {\n- 'mount': True,\n- 'recursive': False,\n- 'key': None,\n- 'key_location': None,\n- }\n- options.setdefault('mount', True)\n- options.setdefault('recursive', False)\n- options.setdefault('key', None)\n- options.setdefault('key_location', None)\n-\n- mount_ds = options.pop('mount')\n- recursive = options.pop('recursive')\n- try:\n- with libzfs.ZFS() as zfs:\n- ds = zfs.get_dataset(id_)\n- self.common_load_dataset_checks(id_, ds)\n- ds.load_key(**options)\n- except libzfs.ZFSException as e:\n- self.logger.error(f'Failed to load key for {id_}', exc_info=True)\n- raise CallError(f'Failed to load key for {id_}: {e}')\n- else:\n- if mount_ds:\n- self.call_sync2(self.s.zfs.resource.mount, id_, recursive=recursive)\n-\n- def check_key(self, id_: str, options: dict | None = None):\n- \"\"\"\n- Returns `true` if the `key` is valid, `false` otherwise.\n- \"\"\"\n- if options is None:\n- options = {\n- 'key': None,\n- 'key_location': None,\n- }\n-\n- try:\n- with libzfs.ZFS() as zfs:\n- ds = zfs.get_dataset(id_)\n- self.common_encryption_checks(id_, ds)\n- return ds.check_key(**options)\n- except libzfs.ZFSException as e:\n- self.logger.error(f'Failed to check key for {id_}', exc_info=True)\n- raise CallError(f'Failed to check key for {id_}: {e}')\n-\n- def change_key(self, id_: str, options: dict | None = None):\n- if options is None:\n- options = {\n- 'encryption_properties': {},\n- 'load_key': True,\n- 'key': None,\n- }\n-\n- try:\n- with libzfs.ZFS() as zfs:\n- ds = zfs.get_dataset(id_)\n- self.common_encryption_checks(id_, ds)\n- ds.change_key(props=options['encryption_properties'], load_key=options['load_key'], key=options['key'])\n- except libzfs.ZFSException as e:\n- self.logger.error(f'Failed to change key for {id_}', exc_info=True)\n- raise CallError(f'Failed to change key for {id_}: {e}')\n-\n- def change_encryption_root(self, id_: str, options: dict | None = None):\n- if options is None:\n- options = {'load_key': True}\n-\n- try:\n- with libzfs.ZFS() as zfs:\n- ds = zfs.get_dataset(id_)\n- ds.change_key(load_key=options['load_key'], inherit=True)\n- except libzfs.ZFSException as e:\n- raise CallError(f'Failed to change encryption root for {id_}: {e}')\n-\n- @job()\n- def bulk_process(self, job, name: str, params: list):\n- f = getattr(self, name, None)\n- if not f:\n- raise CallError(f'{name} method not found in zfs.dataset')\n-\n- statuses = []\n- for i in params:\n- result = error = None\n- try:\n- result = f(*i)\n- except Exception as e:\n- error = str(e)\n- finally:\n- statuses.append({'result': result, 'error': error})\n-\n- return statuses", - "header": "@@ -1,112 +0,0 @@", - "new_count": 0, - "new_start": 0, - "old_count": 112, - "old_start": 1 - } - ], - "language": "", - "lines_added": 0, - "lines_removed": 112, - "path": "", - "status": "removed" - } - ], - "intent_gaps": [ - "The PR description says 'Replace usage of the deprecated py-libzfs with truenas_pylibzfs for these private methods' but does not enumerate which methods. The actual scope is: check_key, load_key, change_key, change_encryption_root in four separate call sites across three files. The description gives no indication that kmip/zfs_keys.py is included.", - "The PR description says 'This removes another use case of our process pool' but does not explain that the `zfs.dataset` service (`process_pool = True`) is being entirely deleted, not just reduced. The deleted file's `bulk_process` method was the batch dispatch mechanism; its removal means no more batch key-checking across datasets \u2014 checks are now serial within the job thread.", - "The PR description mentions a dependency on truenas_pylibzfs/pull/145 but does not specify what that PR adds (presumably the `crypto()` method on ZFS resources, `resource_cryptography_config`, and `ZFSResourceCryptography.check_key/load_key/change_key/inherit_key`). The correct behavior of this PR is entirely contingent on that dependency, which is not merged in this repository.", - "The pbkdf2iters security hardening (350k \u2192 1.3M) is not mentioned anywhere in the PR description. Reviewers would not know to scrutinize the performance and migration implications of this change without reading the API model diffs.", - "The PR does not address what happens to the `zfs.dataset.bulk_process` method that was used by callers outside the encryption path (if any). The deleted file's `bulk_process` was a generic dispatcher for any method on `ZFSDatasetService`; its removal is silent and no audit of other callers is documented.", - "The description does not clarify the error-handling philosophy change: old code wrapped all libzfs errors in CallError (friendly, loggable); new code lets raw truenas_pylibzfs ZFSException propagate to callers, relying on catch-all `except Exception` blocks in the job layer for recovery." - ], - "pr_narrative": "This PR replaces the deprecated `py-libzfs` (via `libzfs` Python bindings and the process-pool-dispatched `zfs.dataset` service) with direct `truenas_pylibzfs` calls for four ZFS dataset encryption operations: key loading, key checking, key changing, and encryption root inheritance.\n\n**Old mechanism**: `src/middlewared/middlewared/plugins/zfs_/dataset_encryption.py` defined a `ZFSDatasetService` class (namespace `zfs.dataset`) with `process_pool = True`. This class used `import libzfs` and opened a new `libzfs.ZFS()` context for every operation. Callers in `pool_/dataset_encryption_info.py` and `pool_/dataset_encryption_operations.py` dispatched to this service via `self.middleware.call('zfs.dataset.bulk_process', ...)` or `self.middleware.call('zfs.dataset.change_key', ...)` \u2014 meaning all operations ran in a subprocess pool, fully isolated from the main event loop, and all were `async`.\n\n**New mechanism**: A new module `src/middlewared/middlewared/plugins/zfs/encryption.py` is introduced containing four free functions (`load_key`, `check_key`, `change_key`, `change_encryption_root`) that operate directly on `truenas_pylibzfs` objects via a thread-local `tls.lzh` handle. These functions are called inline (no subprocess) from the same thread that holds the job or method. The `@pass_thread_local_storage` decorator is added to every consuming method to inject the `tls` argument, and each consuming method is converted from `async def` to synchronous `def` (with `await self.middleware.call(...)` replaced by `self.middleware.call_sync(...)`).\n\nThe change touches five callers:\n1. `pool_/dataset_encryption_info.py` \u2014 `encryption_summary` and `sync_db_keys` now call `check_key(tls, ...)` directly instead of dispatching a `bulk_process` job.\n2. `pool_/dataset_encryption_lock.py` \u2014 `unlock` now calls `load_key(tls, ...)` directly.\n3. `pool_/dataset_encryption_operations.py` \u2014 `change_key` and `inherit_parent_encryption_properties` now call `change_key(tls, ...)` and `change_encryption_root(tls, ...)` directly; `insert_or_update_encrypted_record` is also converted from `async` to sync.\n4. `kmip/zfs_keys.py` \u2014 `push_zfs_keys` and `pull_zfs_keys` now call `check_key(tls, ...)` directly with `@pass_thread_local_storage`.\n5. `api/v26_0_0/pool.py` and `api/v26_0_0/pool_dataset.py` \u2014 `pbkdf2iters` minimum/default raised from 350,000 to 1,300,000 for both `PoolCreateEncryptionOptions` and `PoolDatasetChangeKeyOptions`; a `from_previous` classmethod is added to clamp old values to the new minimum when migrating from prior API versions.\n\nThe deleted file `zfs_/dataset_encryption.py` (112 lines) is fully removed; its `bulk_process` method, subprocess dispatch, and per-call `libzfs.ZFS()` context creation are gone.", - "risk_surfaces": [ - "EXCEPTION CONTRACT CHANGE \u2014 load_key: The old `zfs.dataset.load_key` wrapped all `libzfs.ZFSException` in `CallError` and logged before raising. The new `load_key` in `zfs/encryption.py` raises `ZFSNotEncryptedException` or `ZFSKeyAlreadyLoadedException` for those pre-checks, then calls `crypto.load_key(**kwargs)` which propagates raw `truenas_pylibzfs.ZFSException` directly. In `dataset_encryption_lock.py:222-228`, the `unlock` method catches `ZFSException` (checking `e.code == ZFSError.EZFS_CRYPTOFAILED`) and bare `Exception`, so the raw `ZFSException` from `crypto.load_key()` is still caught. However, `ZFSKeyAlreadyLoadedException` and `ZFSNotEncryptedException` are plain `Exception` subclasses with no `code` attribute \u2014 they will be caught by the bare `except Exception` branch and surfaced as a string error rather than the typed `CallError` the old code would have produced. Callers expecting a `CallError` (e.g. the WebSocket client) would previously get a structured error; now they get a raw exception string.", - "EXCEPTION CONTRACT CHANGE \u2014 check_key: Old `zfs.dataset.check_key` raised `CallError` on any `libzfs.ZFSException` (including wrong-key scenarios). The new `check_key` raises `ZFSNotEncryptedException` for non-encrypted datasets but returns `False` for `EZFS_CRYPTOFAILED` (per docstring). In `encryption_summary` (line 106-109) and `sync_db_keys` (line 200-203), both sites wrap `check_key` in `except Exception: valid_key/should_remove = False/True`, so the behavior is preserved for the happy path. However, there is no guard against passing `key=None` to `crypto.check_key()`. In `encryption_summary`, `ds_key` can be `None` if `ds['encryption_key']` is `None` and no key was supplied by the user \u2014 `check_key(tls, name, key=None)` would pass `key=None` as a kwarg into `crypto.check_key(key=None)`. The behavior of `truenas_pylibzfs`'s `check_key(key=None)` is not visible in this repo; if it does not accept `None`, an exception is raised and silently swallowed to `valid_key = False`, which is the same end result as before \u2014 but relying on an exception catch to cover this is fragile.", - "BULK PROCESS REMOVED \u2014 error aggregation semantics: The old `sync_db_keys` called `zfs.dataset.bulk_process('check_key', [...])` which processed all datasets, accumulated per-dataset errors in `status['error']`, and only aborted if the job itself errored. The new code wraps the entire loop in a single `try/except Exception` (line 208-210). If any unexpected exception escapes the inner `try/except Exception` at line 200-203 (which seems impossible in current code but is a structural fragility), the outer handler will abort the entire loop and return early without processing remaining datasets. The old code continued on a per-dataset error and then separately checked `check_key_job.error` for the job-level error. The new outer catch at line 208-210 logging `f'Failed to sync database keys: {exc}'` uses an f-string without `exc_info=True`, losing the stack trace.", - "ASYNC-TO-SYNC CONVERSION \u2014 insert_or_update_encrypted_record: This method changed from `async def` to `def`. Its callers in `dataset_encryption_lock.py` (`unlock`) and `dataset_encryption_operations.py` (`change_key`) are also sync, so the immediate callers are fine. However, if any other caller invokes this as `await self.middleware.call('pool.dataset.insert_or_update_encrypted_record', ...)` from an async context, it will still work through the middleware dispatch layer. The concern is whether any external caller relied on this being co-routine-safe. No external callers are visible in the diff, but this should be verified.", - "DECORATOR ORDERING \u2014 @pass_thread_local_storage with @job: In `encryption_summary` and `sync_db_keys`, the decorator order is `@pass_thread_local_storage` above `@job`. The `tls` argument is injected between `self/job` and the user-visible arguments (`id_`, `options`, `name`). If the `@job` decorator wraps the function and then `@pass_thread_local_storage` wraps that, the positional argument order seen by the actual function body is `(self, job, tls, id_, options)`. This pattern matches how `unlock` was already written (`def unlock(self, job, tls, id_, options)`), so it appears intentional. But `sync_db_keys` has `lock=lambda args: f'sync_encrypted_pool_dataset_keys_{args}'` \u2014 the `args` lambda receives the job's original positional args. If `tls` is now injected before `name`, the lock key computation could change. Verify that the `args` lambda in `@job` sees the pre-`tls`-injection argument list.", - "change_key \u2014 load_key parameter removed: The old `zfs.dataset.change_key` accepted a `load_key` boolean (always passed as `False` from the calling site). The new `change_key` in `zfs/encryption.py` does not accept or pass `load_key` at all to `crypto.change_key(info=config)`. If `truenas_pylibzfs`'s `crypto.change_key` has a different default for whether it reloads the key, the behavior could diverge from the old code's explicit `load_key=False`.", - "change_key \u2014 props/key argument shape: The old code called `ds.change_key(props=options['encryption_properties'], load_key=False, key=options['key'])` with `props` as a dict. The new `change_key` builds a `props` dict from `EncryptionProperties`, calls `tls.lzh.resource_cryptography_config(**props)` to get a config object, then passes `info=config` to `crypto.change_key`. The `resource_cryptography_config` API (defined in `truenas_pylibzfs`) must accept the same property names (`keyformat`, `keylocation`, `pbkdf2iters`, `key`). If `truenas_pylibzfs` rejects unknown property names or has different semantics for `pbkdf2iters=None` (the TypedDict marks it as `int | None`), key-change operations could fail silently or raise.", - "change_encryption_root \u2014 ZFSKeyFormat comparison bug fix: In the old code (line in diff): `if ZFSKeyFormat(parent_encrypted_root['key_format']['value']) == ZFSKeyFormat.PASSPHRASE.value:` \u2014 this compared a `ZFSKeyFormat` enum member to a string (`.value`), which would always be `False`. The new code: `if parent_encrypted_root['key_format']['value'] == ZFSKeyFormat.PASSPHRASE.value:` \u2014 correctly compares two strings. This is a behavioral change: the passphrase-key-children guard in `inherit_parent_encryption_properties` was previously never enforced (always skipped) and will now be enforced. This is a semantics fix, but it is an undocumented behavior change that could break workflows where users inherited encryption roots from passphrase-encrypted parents that had key-encrypted children.", - "pbkdf2iters default increase \u2014 from_previous migration: `PoolCreateEncryptionOptions` and `PoolDatasetChangeKeyOptions` in `api/v26_0_0` raise the minimum from 100,000 to 1,300,000 and the default from 350,000 to 1,300,000. The `from_previous` classmethod clamps existing values upward with `max(1300000, value['pbkdf2iters'])`. This means any existing dataset or pool that was created with pbkdf2iters between 100,000 and 1,299,999 will silently have their iteration count upgraded on the next API operation touching these fields. This can cause a significant increase in key-derivation time during unlock. This is a security hardening but is a breaking change for automated scripts that stored or compared pbkdf2iters values.", - "KMIP check_key \u2014 no tls guard: In `kmip/zfs_keys.py`, `push_zfs_keys` and `pull_zfs_keys` now call `check_key(tls, ...)` directly. The `@pass_thread_local_storage` decorator was added to both. However, these are called from `sync_zfs_keys` at lines 138 and 142 as `self.push_zfs_keys(tls, ids)` and `self.pull_zfs_keys(tls)` \u2014 passing `tls` explicitly. If `@pass_thread_local_storage` injects `tls` automatically, passing it explicitly would result in a double injection (`tls` appears twice in the argument list). This is a potential signature mismatch that could cause a `TypeError` at runtime.", - "path_in_locked_datasets \u2014 not in PR scope but adjacent risk: This method in `dataset_encryption_info.py` (lines 216-283) already uses `tls.lzh.open_resource(...)` directly and was not changed by this PR. It is annotated as a hot code path and handles `ZFSException` with EZFS_NOENT and EZFS_INVALIDNAME filtering. This code is architecturally similar to the new functions but was not touched, which is correct. However, reviewers should verify no regression was introduced in how `ZFSException` is imported \u2014 the import at line 9 is `from truenas_pylibzfs import ZFSError, ZFSException`, which is correct." - ], - "stats": { - "files_added": 1, - "files_modified": 7, - "files_removed": 1, - "files_renamed": 0, - "test_files_changed": 0, - "test_to_code_ratio": 0, - "total_additions": 254, - "total_deletions": 210, - "total_files": 9 - }, - "unrelated_changes": [ - "api/v26_0_0/pool.py and api/v26_0_0/pool_dataset.py \u2014 pbkdf2iters default/minimum raised from 350,000 to 1,300,000 with a `from_previous` migration validator added. This is a security hardening change unrelated to the py-libzfs \u2192 truenas_pylibzfs refactor. The PR description makes no mention of this change.", - "dataset_encryption_operations.py \u2014 The `ZFSKeyFormat` comparison bug fix in `inherit_parent_encryption_properties` (old: compared enum instance to string value, always False; new: compares two strings, now actually enforces the constraint) is a behavioral bug fix bundled into this refactor PR without mention in the PR description.", - "dataset_encryption_info.py sync_db_keys \u2014 The query for `encrypted_roots` was changed to also fetch the `keyformat` property (`'properties': ['encryptionroot', 'keyformat']`) where before it only fetched `encryptionroot`. This is needed for the new hex-key detection logic but represents a query change not mentioned in the PR description.", - "kmip/zfs_keys.py get_encrypted_datasets \u2014 Changed from calling `self.middleware.call_sync('pool.dataset.query', ...)` (old code, visible from context) to using `self.call_sync2(self.s.zfs.resource.query_impl, ZFSResourceQuery(...))` \u2014 an internal implementation-level change that shifts from the high-level dataset query to the low-level ZFS resource query. This may filter or format results differently." - ] - }, - "budget": { - "budget_exhausted": true, - "cost_breakdown": { - "adversary": 0, - "anatomy": 0, - "coverage": 0, - "cross_ref": 0, - "intake": 0, - "meta_selectors": 0, - "output": 0, - "review": 0, - "synthesis": 0 - }, - "max_cost_usd": 2, - "max_duration_seconds": 900, - "total_cost_usd": 0 - }, - "intake": { - "ai_generated": 0, - "areas_touched": [ - "api" - ], - "complexity": "standard", - "languages": [ - "python" - ], - "pr_summary": "Replace usage of the deprecated py-libzfs with truenas_pylibzfs for these private methods. This removes another use case of our process pool.\r\n\r\nDepends on changes made in https://github.com/truenas/truenas_pylibzfs/pull/145.", - "pr_type": "refactor", - "review_depth": "standard", - "risk_signals": [ - "changes API surface or request/response behavior" - ] - }, - "phases_completed": [ - "intake", - "anatomy", - "meta_selectors", - "review", - "adversary", - "cross_ref", - "coverage", - "synthesis", - "output" - ], - "plan": { - "ai_adjusted": false, - "cross_ref_hints": [], - "dimensions": [ - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [ - "src/middlewared/middlewared/plugins/zfs/exceptions.py" - ], - "id": "semantic_sem_01", - "name": "Exception contract change in load_key: typed exceptions vs. CallError", - "priority": 10, - "review_prompt": "The old `zfs.dataset.load_key` caught all `libzfs.ZFSException` and re-raised as `CallError`. The new `load_key` in `zfs/encryption.py` raises `ZFSNotEncryptedException` or `ZFSKeyAlreadyLoadedException` (plain `Exception` subclasses with no `code` attribute) for pre-check failures, and lets raw `truenas_pylibzfs.ZFSException` propagate from `crypto.load_key()`. In `dataset_encryption_lock.py`, the `unlock` method catches `ZFSException` (checking `e.code == ZFSError.EZFS_CRYPTOFAILED`) and then a bare `except Exception`. Verify: (1) `ZFSNotEncryptedException` and `ZFSKeyAlreadyLoadedException` \u2014 do they fall through to the bare `except Exception` branch and get surfaced as a raw string error rather than a structured `CallError`? (2) Do any callers of `unlock` (e.g., WebSocket dispatch) depend on receiving a `CallError` with a specific `.errno` or `.errmsg` structure? (3) Are there any paths where the new typed exceptions bypass all error handling and bubble up to the framework uncaught?", - "target_files": [ - "src/middlewared/middlewared/plugins/zfs/encryption.py", - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 4 - }, - "context_files": [ - "src/middlewared/middlewared/plugins/zfs/encryption.py" - ], - "id": "mechanical_mech_1", - "name": "KMIP double-injection: @pass_thread_local_storage + explicit tls arg causes TypeError", - "priority": 10, - "review_prompt": "In `kmip/zfs_keys.py`, `push_zfs_keys` and `pull_zfs_keys` have been decorated with `@pass_thread_local_storage`, which automatically injects `tls` as the first argument after `self`. However, their caller `sync_zfs_keys` invokes them as `self.push_zfs_keys(tls, ids)` and `self.pull_zfs_keys(tls)` \u2014 passing `tls` explicitly as a positional argument. If `@pass_thread_local_storage` injects `tls` into the argument list before the call executes, and the caller also passes `tls` explicitly, the function receives `tls` twice: once from the decorator injection and once from the caller. This will produce a `TypeError: push_zfs_keys() got multiple values for argument 'tls'` (or a positional argument count mismatch) at runtime.\n\nYour task:\n1. Read `kmip/zfs_keys.py` in full. Identify the signatures of `push_zfs_keys`, `pull_zfs_keys`, and `sync_zfs_keys`.\n2. Read or infer the implementation of `@pass_thread_local_storage` to understand exactly when and how it injects `tls` \u2014 does it inject before or after the decorated function is called, and does it strip `tls` from the call-site args?\n3. Determine whether `sync_zfs_keys` must be updated to NOT pass `tls` explicitly (because the decorator handles it), or whether the decorator was NOT intended to be added to these methods (and they should instead receive `tls` from their caller).\n4. If a double-injection bug exists, report the exact file and line numbers, the erroneous decorator placement or call-site, and the correct fix.\n5. If no double-injection occurs (e.g., the decorator is a pass-through that does not inject when already present), explain the mechanism that prevents the bug.", - "target_files": [ - "src/middlewared/middlewared/plugins/kmip/zfs_keys.py" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 4 - }, - "context_files": [ - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py" - ], - "id": "mechanical_mech_2", - "name": "Exception contract break: ZFSKeyAlreadyLoadedException / ZFSNotEncryptedException caught by bare except as string, not CallError", - "priority": 9, - "review_prompt": "The new `load_key` function in `zfs/encryption.py` raises `ZFSKeyAlreadyLoadedException` or `ZFSNotEncryptedException` (both plain `Exception` subclasses defined in `zfs/exceptions.py`) as pre-condition guards before calling `crypto.load_key()`. In `dataset_encryption_lock.py`, the `unlock` method catches exceptions in two branches: `except ZFSException as e` (checking `e.code == ZFSError.EZFS_CRYPTOFAILED`) and a bare `except Exception as e`. The new custom exceptions are NOT `ZFSException` subclasses, so they fall into the bare `except Exception` branch and are stringified into the error result \u2014 instead of being raised as a structured `CallError` as the old code did.\n\nYour task:\n1. Read `zfs/exceptions.py` to confirm the class hierarchy of `ZFSKeyAlreadyLoadedException` and `ZFSNotEncryptedException`. Do they inherit from `ZFSException`, `CallError`, or plain `Exception`?\n2. Read `dataset_encryption_lock.py` lines 200\u2013240 (approximate). Trace what happens when each of these two exceptions is raised: which `except` branch catches it, what is placed in the error result (stringified message vs. structured `CallError`), and whether a `CallError` is ever re-raised.\n3. Read `zfs/encryption.py` `load_key` function fully. Confirm it raises these exceptions before calling `crypto.load_key()`.\n4. Determine whether the callers of `unlock` (e.g., the WebSocket API layer) expect a `CallError` with a specific `errno` or just any exception. If `CallError` is expected, the current code is a contract break.\n5. Report all locations where the exception handling must be updated to convert these custom exceptions into `CallError` before they escape to callers, or where the exception class hierarchy must be changed.", - "target_files": [ - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py", - "src/middlewared/middlewared/plugins/zfs/exceptions.py", - "src/middlewared/middlewared/plugins/zfs/encryption.py" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [ - "src/middlewared/middlewared/plugins/zfs/encryption.py" - ], - "id": "semantic_sem_03", - "name": "ZFSKeyFormat enum comparison fix silently activates previously dead guard", - "priority": 8, - "review_prompt": "In the old `inherit_parent_encryption_properties` / `change_encryption_root`, the condition `if ZFSKeyFormat(parent_encrypted_root['key_format']['value']) == ZFSKeyFormat.PASSPHRASE.value:` compared a `ZFSKeyFormat` enum instance to a string (`.value`), which always evaluates to `False` in Python due to type-strict `==` semantics. This means the guard that prevents key-encrypted children from inheriting encryption roots from passphrase-encrypted parents was never enforced. The new code uses `if parent_encrypted_root['key_format']['value'] == ZFSKeyFormat.PASSPHRASE.value:`, a string-to-string comparison that correctly enforces the guard. Verify: (1) Confirm the old code's comparison was indeed always `False` \u2014 that is, no datasets exist in production that relied on this guard being a no-op. (2) What is the exact behavior change for a key-encrypted child dataset whose parent has a passphrase-encrypted root \u2014 will the operation now raise an error, return early, or behave differently in some other way? (3) Is there any documented or tested workflow that previously worked because this guard was silently skipped, and will now fail?", - "target_files": [ - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - }, - "context_files": [], - "id": "semantic_sem_04", - "name": "pbkdf2iters silent upgrade via from_previous: latency regression and breakage for automation", - "priority": 7, - "review_prompt": "In `api/v26_0_0/pool.py` and `api/v26_0_0/pool_dataset.py`, `PoolCreateEncryptionOptions` and `PoolDatasetChangeKeyOptions` now default `pbkdf2iters` to 1,300,000 (up from 350,000) with a minimum of 1,300,000. The `from_previous` classmethod uses `max(1300000, value['pbkdf2iters'])` to silently clamp old values upward. Verify: (1) Is the `from_previous` migration invoked on read (i.e., for existing dataset API responses) or only on write (i.e., only when the user explicitly submits a key-change operation)? If invoked on write, does the caller receive the upgraded value transparently without being warned? (2) For existing datasets with pbkdf2iters between 100,000 and 1,299,999, will the iteration count be silently changed to 1,300,000 on the next `change_key` call, meaning the encryption parameters of a live dataset change without explicit user intent? (3) On low-power or embedded hardware, does a 3.7x increase in PBKDF2 iterations cause key-derivation to exceed unlock timeouts, potentially making encrypted datasets permanently inaccessible without intervention?", - "target_files": [ - "src/middlewared/middlewared/api/v26_0_0/pool.py", - "src/middlewared/middlewared/api/v26_0_0/pool_dataset.py" - ] - }, - { - "budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 4 - }, - "context_files": [ - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py", - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py" - ], - "id": "mechanical_mech_3", - "name": "Decorator ordering: @pass_thread_local_storage above @job \u2014 does @job lambda see pre- or post-tls-injection arg list?", - "priority": 7, - "review_prompt": "In `dataset_encryption_info.py`, `sync_db_keys` uses `@job(lock=lambda args: f'sync_encrypted_pool_dataset_keys_{args}')` stacked beneath `@pass_thread_local_storage`. The `args` lambda passed to `@job` receives the positional arguments at the time the job dispatch layer captures them. If `@pass_thread_local_storage` is the outer decorator (applied last, wraps the `@job`-decorated function), then `tls` is injected AFTER the `@job` lock-key computation runs \u2014 meaning the lock lambda sees `(name,)` as intended. But if the decorator order means `@job` wraps the already-`tls`-injected function, the lambda would see `(tls, name)` and the lock key would be `sync_encrypted_pool_dataset_keys_(tls_object, 'poolname')`, producing an incorrect and potentially non-unique lock key.\n\nYour task:\n1. Read `dataset_encryption_info.py` to confirm the exact decorator order on `sync_db_keys` (which decorator appears on the line immediately above `def sync_db_keys`).\n2. Find and read the implementation of `@pass_thread_local_storage` to understand its wrapping behavior \u2014 does it wrap the already-decorated function or is it the inner decorator?\n3. Find and read the `@job` decorator implementation to understand when the `lock` lambda is evaluated relative to argument injection by outer decorators.\n4. Determine whether the `lock` lambda in `sync_db_keys` receives `(name,)` or `(tls, name)` at runtime.\n5. If `tls` is visible to the lambda, report the exact file/line and explain why the lock key will be malformed, and what the correct fix is (e.g., swap decorator order, or adjust the lambda to index `args[1]` instead of `args`).", - "target_files": [ - "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py" - ] - } - ], - "total_budget": { - "max_child_spawns": 2, - "max_cost_usd": 0.5, - "max_duration_seconds": 60, - "max_reference_follows": 3 - } - } - }, - "pr_url": "https://github.com/truenas/middleware/pull/18291", - "review": { - "body": "## \ud83d\udd34 PR-AF Review \u2014 **Changes Required**\n\n*Automated multi-agent code review \u00b7 [PR-AF](https://github.com/Agent-Field/agentfield) built with [AgentField](https://github.com/Agent-Field/agentfield)*\n\n> **14 findings** \u00b7 \ud83d\udd34 2 critical \u00b7 \ud83d\udfe0 9 important \u00b7 \ud83d\udd35 2 suggestions \u00b7 \u26aa 1 nitpicks\n\n
\nPR Overview\n\nReplace usage of the deprecated py-libzfs with truenas_pylibzfs for these private methods. This removes another use case of our process pool.\r\n\r\nDepends on changes made in https://github.com/truenas/truenas_pylibzfs/pull/145.\n\n
\n\n### Key Findings\n\n**11 issue(s) should be addressed before merge:**\n\n- \ud83d\udd34 **zfs_keys cache silently wiped on every push/pull: `k in existing_datasets` checks string in list-of-dicts** (`src/middlewared/middlewared/plugins/kmip/zfs_keys.py:94`) \u2014 `get_encrypted_datasets` returns a `list` of dataset dicts (each a `dict` with keys `'name'`, `'id'`, `'encryption_key'`, `'kmip_uid'`, etc.).\n- \ud83d\udd34 **Missing `id` argument in `datastore.update` call \u2014 wrong argument count, update never applied to correct row** (`src/middlewared/middlewared/plugins/kmip/zfs_keys.py:157`) \u2014 The `datastore.update` API signature is `(table: str, id: int, data: dict)`.\n- \ud83d\udfe0 **Old guard was always False: key-encrypted child under passphrase-root inheritance was never blocked** (`src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py:248`) \u2014 **The old comparison was provably always `False`.** In the prior code (`bde8f1de3b`), the guard in `inherit_parent_encryption_properties_impl` read: ```python if ZFSKeyFormat(parent_encrypted_root.k\u2026\n- \ud83d\udfe0 **ZFSKeyAlreadyLoadedException and ZFSNotEncryptedException silently swallowed as string errors instead of structured CallError** (`src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py:229`) \u2014 The bare `except Exception as e` branch on line 229 catches `ZFSKeyAlreadyLoadedException` and `ZFSNotEncryptedException` (both plain `Exception` subclasses from `zfs/exceptions.py`) and converts them\u2026\n- \ud83d\udfe0 **from_previous fires on write only; legacy API callers have pbkdf2iters silently upgraded to 1,300,000 without any notification** (`src/middlewared/middlewared/api/v26_0_0/pool_dataset.py:183`) \u2014 **`from_previous` is invoked exclusively on incoming write operations (argument upgrade), never on reads (API responses).** The `APIVersionsAdapter` in `legacy_api_method.py` upgrades incoming parame\u2026\n- \ud83d\udfe0 **`sync_db_keys` lock lambda embeds the full args list, causing inconsistent lock keys between periodic and explicit calls** (`src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py:161`) \u2014 The `lock` lambda on `sync_db_keys` uses `args` (the entire raw-arguments list) rather than `args[0]` (the first positional argument, `name`): ```python @job(lock=lambda args: f'sync_encrypted_pool_d\u2026\n- \ud83d\udfe0 **Existing passphrase-encrypted datasets silently re-keyed at 3.7x higher iteration count on next change_key call via any API version** (`src/middlewared/middlewared/api/v26_0_0/pool_dataset.py:175`) \u2014 **Existing datasets with `pbkdf2iters` between 100,000 and 1,299,999 will have their iteration count permanently changed to 1,300,000 on the next `change_key` call, regardless of whether the user expl\u2026\n- \ud83d\udfe0 **Custom ZFS exceptions inherit from plain Exception instead of CallError, breaking structured error propagation across all callers** (`src/middlewared/middlewared/plugins/zfs/exceptions.py:14`) \u2014 `ZFSKeyAlreadyLoadedException` (line 14) and `ZFSNotEncryptedException` (line 20) both inherit directly from `Exception`.\n- \u2026 and 3 more (see All Findings by Severity)\n\n**3 suggestion(s) and style note(s):**\n\n- \ud83d\udd35 No double-injection bug: explicit tls passing is correct for direct calls (`src/middlewared/middlewared/plugins/kmip/zfs_keys.py:138`)\n- \ud83d\udd35 No test covers the newly-enforced rejection path (passphrase root + key-encrypted child roots) (`src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py:248`)\n- \u26aa Original `tls`-injection concern is a false alarm: decorator order is correct and `tls` is never visible to the lock lambda (`src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py:158`)\n\n**Files with findings:** `src/middlewared/middlewared/api/v26_0_0/pool.py`, `src/middlewared/middlewared/api/v26_0_0/pool_dataset.py`, `src/middlewared/middlewared/plugins/kmip/zfs_keys.py`, `src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py`, `src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py`, `src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py`, `src/middlewared/middlewared/plugins/zfs/encryption.py`, `src/middlewared/middlewared/plugins/zfs/exceptions.py`\n\n
\nAll Findings by Severity\n\n#### \ud83d\udd34 Critical (2)\n\n- **zfs_keys cache silently wiped on every push/pull: `k in existing_datasets` checks string in list-of-dicts** `src/middlewared/middlewared/plugins/kmip/zfs_keys.py:94`\n- **Missing `id` argument in `datastore.update` call \u2014 wrong argument count, update never applied to correct row** `src/middlewared/middlewared/plugins/kmip/zfs_keys.py:157`\n\n#### \ud83d\udfe0 Important (9)\n\n- **Old guard was always False: key-encrypted child under passphrase-root inheritance was never blocked** `src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py:248`\n- **ZFSKeyAlreadyLoadedException and ZFSNotEncryptedException silently swallowed as string errors instead of structured CallError** `src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py:229`\n- **from_previous fires on write only; legacy API callers have pbkdf2iters silently upgraded to 1,300,000 without any notification** `src/middlewared/middlewared/api/v26_0_0/pool_dataset.py:183`\n- **`sync_db_keys` lock lambda embeds the full args list, causing inconsistent lock keys between periodic and explicit calls** `src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py:161`\n- **Existing passphrase-encrypted datasets silently re-keyed at 3.7x higher iteration count on next change_key call via any API version** `src/middlewared/middlewared/api/v26_0_0/pool_dataset.py:175`\n- **Custom ZFS exceptions inherit from plain Exception instead of CallError, breaking structured error propagation across all callers** `src/middlewared/middlewared/plugins/zfs/exceptions.py:14`\n- **ZFSNotEncryptedException from change_key() propagates as raw Exception to WebSocket API layer \u2014 no CallError wrapping** `src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py:200`\n- **Raw truenas_pylibzfs.ZFSException from crypto.load_key() propagates out of encryption.load_key() undecorated, breaking the old CallError contract for any caller outside unlock** `src/middlewared/middlewared/plugins/zfs/encryption.py:34`\n- **3.7x PBKDF2 iteration increase enforced with no hardware capability check; may cause passphrase unlock timeouts making datasets inaccessible** `src/middlewared/middlewared/api/v26_0_0/pool.py:151`\n\n#### \ud83d\udd35 Suggestion (2)\n\n- **No double-injection bug: explicit tls passing is correct for direct calls** `src/middlewared/middlewared/plugins/kmip/zfs_keys.py:138`\n- **No test covers the newly-enforced rejection path (passphrase root + key-encrypted child roots)** `src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py:248`\n\n#### \u26aa Nitpick (1)\n\n- **Original `tls`-injection concern is a false alarm: decorator order is correct and `tls` is never visible to the lock lambda** `src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py:158`\n\n
\n\n
\nReview Process Details\n\n**Dimensions Analyzed (6):**\n\n- **Exception contract change in load_key: typed exceptions vs. CallError** \u2014 2 file(s)\n- **KMIP double-injection: @pass_thread_local_storage + explicit tls arg causes TypeError** \u2014 1 file(s)\n- **Exception contract break: ZFSKeyAlreadyLoadedException / ZFSNotEncryptedException caught by bare except as string, not CallError** \u2014 3 file(s)\n- **ZFSKeyFormat enum comparison fix silently activates previously dead guard** \u2014 1 file(s)\n- **pbkdf2iters silent upgrade via from_previous: latency regression and breakage for automation** \u2014 2 file(s)\n- **Decorator ordering: @pass_thread_local_storage above @job \u2014 does @job lambda see pre- or post-tls-injection arg list?** \u2014 1 file(s)\n\n**Meta-Dimension Lenses (3):**\n\n- **Semantic** \u2014 5 dimension(s), 88% coverage confidence\n- **Mechanical** \u2014 3 dimension(s), 87% coverage confidence\n- **Systemic** \u2014 2 dimension(s), 82% coverage confidence\n\n
\n\n
\nPipeline Stats\n\n| Metric | Value |\n|--------|-------|\n| Duration | 1808.7s |\n| Agent invocations | 11 |\n| Coverage iterations | 0 |\n| Estimated cost | N/A (provider does not report cost) |\n| Budget exhausted | Yes (timeout: 1808s > 900s limit) |\n| PR type | refactor |\n| Complexity | standard |\n\n
\n\nReview ID: `rev_07c8d4f2bf5a`", - "comments": [ - { - "body": "\ud83d\udfe0 **[IMPORTANT] Old guard was always False: key-encrypted child under passphrase-root inheritance was never blocked**\n\n**The old comparison was provably always `False`.**\n\nIn the prior code (`bde8f1de3b`), the guard in `inherit_parent_encryption_properties_impl` read:\n\n```python\nif ZFSKeyFormat(parent_encrypted_root.key_format.value) == ZFSKeyFormat.PASSPHRASE.value:\n```\n\nThe left-hand side is `ZFSKeyFormat('PASSPHRASE')` \u2014 a `ZFSKeyFormat` enum *instance* \u2014 while the right-hand side is `ZFSKeyFormat.PASSPHRASE.value` \u2014 the raw string `'PASSPHRASE'`. Python's `==` for `Enum` instances does **not** fall back to comparing against the `.value`; an enum member only equals itself (or another member with the same identity), never a plain string. This was verified:\n\n```\nZFSKeyFormat('PASSPHRASE') == 'PASSPHRASE' # \u2192 False, always\n```\n\n**What the guard was supposed to do:** prevent a key-encrypted dataset (`id_`) that has its own key-encrypted child encryption roots from inheriting a passphrase-encrypted parent root. If such a dataset were allowed to inherit, its key-encrypted children would end up under a passphrase root, violating the invariant that passphrase roots cannot have key-encrypted encryption-root descendants.\n\n**Behavioral change introduced by the fix:** The new code uses:\n\n```python\nif parent_encrypted_root['key_format']['value'] == ZFSKeyFormat.PASSPHRASE.value:\n```\n\nThis is a string-to-string comparison (`'PASSPHRASE' == 'PASSPHRASE'`) that evaluates to `True` correctly. For the first time, the inner `any(...)` check that looks for key-encrypted child encryption roots is actually executed, and if any are found, a `CallError` is raised, preventing the operation.\n\n**Concrete scenario now blocked that was previously silently allowed:**\n\n1. Pool `tank` has dataset `tank/passroot` encrypted with a passphrase (encryption root).\n2. Under it, `tank/passroot/keyroot` is a key-encrypted encryption root (HEX format).\n3. Under `keyroot`, `tank/passroot/keyroot/keychild` is *also* a key-encrypted encryption root.\n4. A user calls `pool.dataset.inherit_parent_encryption_properties('tank/passroot/keyroot')`.\n5. **Old code:** guard fires `False`, inner check is skipped, `change_encryption_root` executes. `keyroot` now falls under `passroot`'s passphrase root, but `keychild` remains a separate key-encrypted root under a passphrase root \u2014 an explicitly forbidden structure.\n6. **New code:** guard fires `True`, inner `any()` detects `keychild`, raises `CallError` with a clear message. The operation is rejected.\n\n**Does any existing production workflow depend on the old no-op guard?** The only test exercising `inherit_parent_encryption_properties` (`test_key_encrypted_dataset` at line 404) uses a *hex-key* parent root, so `parent_encrypted_root['key_format']['value'] == 'HEX'`, and the guard evaluates to `False` in both old and new code. That test is unaffected. There is no test covering the now-enforced case (passphrase parent root + key-encrypted child roots), which is the exact gap described below.\n\n---\n\n> Step 1: Old code at `bde8f1de3b` line ~222: `if ZFSKeyFormat(parent_encrypted_root.key_format.value) == ZFSKeyFormat.PASSPHRASE.value:`\n> Step 2: `parent_encrypted_root.key_format.value` is a string, e.g. `'PASSPHRASE'`.\n> Step 3: `ZFSKeyFormat('PASSPHRASE')` constructs `ZFSKeyFormat.PASSPHRASE`, an enum instance.\n> Step 4: `ZFSKeyFormat.PASSPHRASE == 'PASSPHRASE'` \u2192 `False` (Python Enum.__eq__ compares member identity, not value string).\n> Step 5: The `if` body (the `any()` child-root check and potential `raise CallError`) is NEVER reached regardless of input.\n> Step 6: `change_encryption_root` / `zfs.dataset.change_encryption_root` always executes even when the parent root is passphrase-encrypted and the dataset has key-encrypted child roots.\n> Verification: `python3 -c \"from enum import Enum; class E(Enum): P='PASSPHRASE'; print(E('PASSPHRASE') == 'PASSPHRASE')\"` prints `False`.\n\n**\ud83d\udca1 Suggested Fix**\n\nThe fix is correct. The only follow-up needed is a regression test for the newly-enforced path: create a passphrase-encrypted root, a key-encrypted encryption root beneath it, and a second key-encrypted encryption root as a child of that \u2014 then assert that `inherit_parent_encryption_properties` on the middle dataset raises a `CallError`. This ensures the guard remains correct if the code is refactored again.\n\n---\n*`Enum vs String Comparison Bug in Encryption Root Guard` \u00b7 confidence 98%*", - "line": 248, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] ZFSKeyAlreadyLoadedException and ZFSNotEncryptedException silently swallowed as string errors instead of structured CallError**\n\nThe bare `except Exception as e` branch on line 229 catches `ZFSKeyAlreadyLoadedException` and `ZFSNotEncryptedException` (both plain `Exception` subclasses from `zfs/exceptions.py`) and converts them to `failed[name]['error'] = str(e)` \u2014 a raw string embedded in the return value dict.\n\nThis is a contract violation because:\n1. These exceptions are **pre-condition guards** (dataset not encrypted, or key already loaded) that signal programmer/caller errors, not transient ZFS crypto failures. Treating them identically to \"Invalid Key\" hides the actual cause.\n2. The `unlock` API method's structured return `{'unlocked': [...], 'failed': {...}}` will surface these as opaque string errors (e.g. `\"'pool/ds' key is already loaded\"`) with no errno or structured error code, making it impossible for callers to distinguish pre-condition failures from crypto failures.\n3. The old code path (before `load_key` was extracted to `zfs/encryption.py`) presumably raised `CallError` directly \u2014 the refactoring broke this by introducing new exception types without updating the catch sites.\n\nSpecifically:\n- `ZFSKeyAlreadyLoadedException` raised at `encryption.py:33` falls into `except Exception` at `dataset_encryption_lock.py:229`\n- `ZFSNotEncryptedException` raised at `encryption.py:31` similarly falls into `except Exception` at `dataset_encryption_lock.py:229`\n\nNeither is ever re-raised as a `CallError`.\n\n---\n\n> Step 1: `unlock` calls `load_key(tls, name, key=datasets[name]['key'])` at line 222.\n> Step 2: `load_key` in `zfs/encryption.py:31` calls `rsrc.crypto()`, and if it returns `None`, raises `ZFSNotEncryptedException(dataset)` \u2014 a subclass of plain `Exception` (confirmed at `exceptions.py:20`).\n> Step 3: `load_key` at `encryption.py:33` raises `ZFSKeyAlreadyLoadedException(dataset)` if `crypto.info().key_is_loaded` is True \u2014 also a plain `Exception` subclass (`exceptions.py:14`).\n> Step 4: Neither exception is a `ZFSException` subclass (imported from `truenas_pylibzfs`), so the `except ZFSException as e` block at line 223 does NOT catch them.\n> Step 5: They fall through to `except Exception as e` at line 229, where `failed[name]['error'] = str(e)` stores the message string `\"'pool/ds' key is already loaded\"` or `\"'pool/ds' is not encrypted\"` \u2014 no `CallError`, no errno.\n\n**\ud83d\udca1 Suggested Fix**\n\nEither (a) make `ZFSKeyAlreadyLoadedException` and `ZFSNotEncryptedException` inherit from `CallError` (with appropriate `errno` values such as `errno.ENOTSUP` for not-encrypted and `errno.EEXIST` for already-loaded), OR (b) add an explicit catch before the bare `except Exception` block:\n```python\nfrom middlewared.plugins.zfs.exceptions import ZFSKeyAlreadyLoadedException, ZFSNotEncryptedException\n\ntry:\n load_key(tls, name, key=datasets[name]['key'])\nexcept ZFSKeyAlreadyLoadedException:\n # Key already loaded means dataset is effectively unlocked; treat as success or specific error\n failed[name]['error'] = 'Key is already loaded'\n continue\nexcept ZFSNotEncryptedException:\n failed[name]['error'] = 'Dataset is not encrypted'\n continue\nexcept ZFSException as e:\n ...\nexcept Exception as e:\n failed[name]['error'] = str(e)\n continue\n```\nOption (a) is cleaner and ensures these exceptions carry structured error information everywhere they propagate.\n\n---\n*`Exception Handling Contract` \u00b7 confidence 95%*", - "line": 229, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_lock.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] from_previous fires on write only; legacy API callers have pbkdf2iters silently upgraded to 1,300,000 without any notification**\n\n**`from_previous` is invoked exclusively on incoming write operations (argument upgrade), never on reads (API responses).**\n\nThe `APIVersionsAdapter` in `legacy_api_method.py` upgrades incoming parameters from an older API version to the current version via `_adapt_params`, which calls `adapter.adapt(params_dict, model_name, self.api_version, self.adapter.current_version)`. Because `version1_index < version2_index` the direction resolves to `Direction.UPGRADE`, triggering `new_model.from_previous(value)` at `version.py:233`.\n\nConversely, `_dump_result` adapts the **result** from `current_version` back to `api_version` (downgrade direction), which calls `to_previous`. Neither `PoolDatasetChangeKeyOptions` nor `PoolCreateEncryptionOptions` define `to_previous`, so outgoing responses are never touched.\n\n**Practical impact:** An automation client or script pinned to API v25.x that deliberately submits `pbkdf2iters=350000` (valid under `ge=100000` in v25.10.x) will have that value silently overwritten to `1300000` by `from_previous` before the `change_key` handler executes. The caller receives `{\"result\": null}` \u2014 the standard success response for `PoolDatasetChangeKeyResult` \u2014 with no indication that a different iteration count was actually applied to ZFS.\n\nNote: `pbkdf2iters` is only forwarded to the ZFS layer when `passphrase_key_format=True` (plugin line 114), so this affects only passphrase-encrypted datasets. For raw-hex keyed datasets `pbkdf2iters` is excluded from `opts` entirely and no iteration count is stored.\n\n---\n\n> Step 1: Client on API v25.10.2 calls `pool.dataset.change_key` with `options={\"pbkdf2iters\": 350000, \"passphrase\": \"mypass\"}`. Old model allows this: `pbkdf2iters: int = Field(default=350000, ge=100000)` (v25_10_2/pool_dataset.py:175).\n> Step 2: `LegacyAPIMethod.call()` (legacy_api_method.py:60) calls `_adapt_params()` \u2192 `adapter.adapt(params_dict, 'PoolDatasetChangeKeyArgs', 'v25.10.2', 'v26.0.0')`.\n> Step 3: `adapt_model` computes `version1_index < version2_index` \u2192 `direction = Direction.UPGRADE`.\n> Step 4: `_adapt_value` on `PoolDatasetChangeKeyArgs` calls `_adapt_nested_value` on the `options` field because both versions define a model named `PoolDatasetChangeKeyOptions`; this triggers a recursive `_adapt_value` call.\n> Step 5: At the end of the nested `_adapt_value`, line 233 of version.py: `value = new_model.from_previous(value)` where `new_model` is v26_0_0's `PoolDatasetChangeKeyOptions`.\n> Step 6: `from_previous` (pool_dataset.py:185) executes `value['pbkdf2iters'] = max(1300000, 350000)` \u2192 `1300000`.\n> Step 7: `change_key` plugin receives `options['pbkdf2iters'] == 1300000`, passes it to `validate_encryption_data` (line 191), which includes it in `opts` because `passphrase_key_format=True` (line 114).\n> Step 8: `zfs/encryption.py::change_key()` permanently stores `pbkdf2iters=1300000` in the dataset's ZFS config.\n> Step 9: `_dump_result` downgrades `{\"result\": null}` \u2014 no clamping info is surfaced.\n\n**\ud83d\udca1 Suggested Fix**\n\nAt minimum, emit a job log warning when `pbkdf2iters` is clamped upward. A job-status message such as `job.set_progress(0, f'Note: pbkdf2iters elevated from submitted value to {options[\"pbkdf2iters\"]}')` would make the override visible to operators. Longer-term, consider returning the effective `pbkdf2iters` in the result payload or adding a `to_previous` on the result model so legacy clients can detect the discrepancy.\n\n---\n*`PBKDF2 Iteration Count Silent Migration` \u00b7 confidence 95%*", - "line": 183, - "path": "src/middlewared/middlewared/api/v26_0_0/pool_dataset.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] `sync_db_keys` lock lambda embeds the full args list, causing inconsistent lock keys between periodic and explicit calls**\n\nThe `lock` lambda on `sync_db_keys` uses `args` (the entire raw-arguments list) rather than `args[0]` (the first positional argument, `name`):\n\n```python\n@job(lock=lambda args: f'sync_encrypted_pool_dataset_keys_{args}')\ndef sync_db_keys(self, job, tls, name=None):\n```\n\nThe `@job` and `@pass_thread_local_storage` decorators are both **pure marker decorators** \u2014 they stamp attributes on the function and return it unchanged. `Job.__init__` stores the raw caller-supplied `params` list as `self.args`, and the lock lambda is evaluated with that list before the job is queued (in `JobsQueue.handle_lock` \u2192 `Job.get_lock_name`). The `tls` object is injected at run time in `Job.__run_body`, well after lock computation, so `tls` is **not** visible to the lambda.\n\nThe real problem is that `name` has a default of `None`. This means:\n\n| Call site | `self.args` passed to lambda | Resulting lock key |\n|---|---|---|\n| Periodic scheduler (no args) | `[]` | `sync_encrypted_pool_dataset_keys_[]` |\n| `call_sync('pool.dataset.sync_db_keys', 'tank')` | `['tank']` | `sync_encrypted_pool_dataset_keys_['tank']` |\n| `call_sync('pool.dataset.sync_db_keys', None)` | `[None]` | `sync_encrypted_pool_dataset_keys_[None]` |\n\nThe periodic invocation produces the key `sync_encrypted_pool_dataset_keys_[]` while an explicit `sync_db_keys(None)` produces `sync_encrypted_pool_dataset_keys_[None]` \u2014 these are **different lock keys**, so the two calls do NOT share a lock and can run concurrently. This defeats the purpose of the lock for the all-datasets sync case.\n\nBy contrast, the `encryption_summary` lock lambda on the same class correctly uses `args[0]`:\n```python\n@job(lock=lambda args: f'encryption_summary_options_{args[0]}', ...)\n```\n\nAdditionally, the lock key includes Python list-repr brackets (e.g., `['tank']`) rather than a clean string like `tank`, making the key non-human-readable and fragile if calling conventions change.\n\n---\n\n> Step 1: `sync_db_keys` is decorated with `@job(lock=lambda args: f'sync_encrypted_pool_dataset_keys_{args}')` at line 161.\n> Step 2: `@job` is a pure marker decorator (`decorators.py:153-166`) \u2014 it sets `fn._job = {'lock': lock, ...}` and returns `fn` unchanged.\n> Step 3: `_call_prepare` in `main.py:880` constructs `Job(self, name, serviceobj, methodobj, params, ...)` where `params` is the raw caller-supplied arguments list.\n> Step 4: `Job.__init__` at `job.py:333` stores `self.args = args` (the `params` parameter passed in).\n> Step 5: `JobsQueue.add` at `job.py:149` calls `self.handle_lock(job)`, which calls `job.get_lock_name()` at `job.py:422`: `lock_name = lock_name(self.args)` \u2014 so the lambda receives the raw `params` list.\n> Step 6: Periodic scheduler calls `sync_db_keys` with zero user arguments \u2192 `params = []` \u2192 lambda receives `[]` \u2192 lock key is `sync_encrypted_pool_dataset_keys_[]`.\n> Step 7: Explicit `call_sync('pool.dataset.sync_db_keys', None)` \u2192 `params = [None]` \u2192 lambda receives `[None]` \u2192 lock key is `sync_encrypted_pool_dataset_keys_[None]`.\n> Step 8: Keys differ \u2192 neither invocation blocks the other \u2192 two full-dataset syncs can run concurrently.\n\n**\ud83d\udca1 Suggested Fix**\n\nChange the lambda to extract only the first argument and normalize `None` to an empty string, mirroring the pattern used by `encryption_summary`:\n\n```python\n@job(lock=lambda args: f'sync_encrypted_pool_dataset_keys_{args[0] if args else \"\"}')\n```\n\nThis ensures:\n- A periodic call (no args) and an explicit `call(..., None)` both produce the same lock key: `sync_encrypted_pool_dataset_keys_None`\n- A call with a specific pool name produces `sync_encrypted_pool_dataset_keys_tank`\n- The key no longer contains list brackets\n\n---\n*`Decorator Order and Lock Key Correctness` \u00b7 confidence 92%*", - "line": 161, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Existing passphrase-encrypted datasets silently re-keyed at 3.7x higher iteration count on next change_key call via any API version**\n\n**Existing datasets with `pbkdf2iters` between 100,000 and 1,299,999 will have their iteration count permanently changed to 1,300,000 on the next `change_key` call, regardless of whether the user explicitly requested this change.**\n\nThere are two distinct triggers:\n\n1. **Legacy API client omits `pbkdf2iters`:** The v25.10.x default was 350,000. When a v25.x client calls `change_key` without specifying `pbkdf2iters`, `_adapt_value` fills in the missing field using the **v26.0.0 new default** of `1300000` (version.py:226: `value[key_to_use] = field_info.get_default(call_default_factory=True)`). `from_previous` then sees `max(1300000, 1300000)` which is a no-op \u2014 but the applied value is the new default, not what the user would have expected from their v25.x context.\n\n2. **Legacy API client explicitly submits `pbkdf2iters=350000`:** `from_previous` clamps it to 1,300,000 as described in the companion finding.\n\nIn both cases, `change_key` permanently alters the ZFS dataset property `pbkdf2iters`. Once a dataset is re-keyed at 1,300,000 iterations, every subsequent passphrase-unlock of that dataset (at boot, during HA failover, or via `pool.dataset.unlock`) will run PBKDF2 with 1,300,000 iterations. The user never saw a prompt asking to confirm this change, and the API response `{\"result\": null}` provides no visibility into what iteration count was applied.\n\n**Scope:** Only passphrase-encrypted datasets are affected (line 114 of `dataset_encryption_operations.py` guards `pbkdf2iters` inclusion on `passphrase_key_format=True`). Raw-hex keyed datasets are not affected.\n\n---\n\n> Step 1: User has a passphrase-encrypted dataset with `pbkdf2iters=350000` (set under v25.x).\n> Step 2: User or script calls `pool.dataset.change_key` via v25.x API client without specifying `pbkdf2iters`.\n> Step 3: `_adapt_value` (version.py:224-227) detects `pbkdf2iters` is absent; since the field has a default in v26 (`1300000`), it fills: `value['pbkdf2iters'] = 1300000`.\n> Step 4: `from_previous` is a no-op for `max(1300000, 1300000)`, but the effective value is now 1,300,000 instead of the user's expected 350,000.\n> Step 5: `change_key` plugin line 191 passes `pbkdf2iters: 1300000` to `validate_encryption_data`.\n> Step 6: Since `passphrase_key_format=True`, line 114 includes `pbkdf2iters` in `opts`.\n> Step 7: `zfs/encryption.py::change_key()` writes `pbkdf2iters=1300000` permanently to ZFS.\n> Step 8: API returns `{\"result\": null}` \u2014 no indication the iteration count was elevated.\n\n**\ud83d\udca1 Suggested Fix**\n\nCompare `options['pbkdf2iters']` against the dataset's current stored iteration count before applying the change (available via `ds['pbkdf2iters']['parsed']` from `get_instance_quick`). If the value is being elevated due to the minimum-floor and not due to the user explicitly passing the new value, emit a warning. Consider adding a `pbkdf2iters_effective` field to `PoolDatasetChangeKeyResult` so callers can detect the actual value applied.\n\n---\n*`PBKDF2 Iteration Count Silent Migration` \u00b7 confidence 92%*", - "line": 175, - "path": "src/middlewared/middlewared/api/v26_0_0/pool_dataset.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Custom ZFS exceptions inherit from plain Exception instead of CallError, breaking structured error propagation across all callers**\n\n`ZFSKeyAlreadyLoadedException` (line 14) and `ZFSNotEncryptedException` (line 20) both inherit directly from `Exception`. This is the root cause of the contract break identified in the other findings.\n\nIn the TrueNAS middleware architecture, user-facing errors are expected to be `CallError` instances (with an `errno` attribute). Any unhandled non-`CallError` exception that escapes a service method is treated as an internal server error by the WebSocket API layer, producing unstructured error responses.\n\nBy making these exceptions plain `Exception` subclasses:\n1. Every call site that calls `load_key()`, `check_key()`, `change_key()`, or `change_encryption_root()` must manually wrap exceptions to convert them to `CallError` \u2014 creating a systemic catch-site gap.\n2. Existing bare `except Exception` handlers (as in `dataset_encryption_lock.py:229`) silently absorb them as string errors with no errno, making them indistinguishable from other failures.\n3. The `.message` attribute is redundant with `str(e)` since `super().__init__(self.message)` already sets the string representation \u2014 the `.message` attribute is never used by any handler.\n\n---\n\n> Step 1: `exceptions.py:14` \u2014 `class ZFSKeyAlreadyLoadedException(Exception)` \u2014 base class is plain `Exception`.\n> Step 2: `exceptions.py:20` \u2014 `class ZFSNotEncryptedException(Exception)` \u2014 base class is plain `Exception`.\n> Step 3: These are imported and raised in `zfs/encryption.py` at lines 31, 33, 58, 88, 105.\n> Step 4: `dataset_encryption_lock.py:229` and `dataset_encryption_operations.py:200,263` are call sites with no conversion to `CallError`.\n> Step 5: The middleware WebSocket error dispatch (not read, but standard TrueNAS architecture) wraps `CallError` into structured JSON error responses with errno codes; plain `Exception` becomes an unstructured internal error.\n\n**\ud83d\udca1 Suggested Fix**\n\nChange the base class of both exceptions to `CallError` with appropriate errno values:\n```python\nfrom middlewared.service.core import CallError # or wherever CallError is importable\nimport errno\n\nclass ZFSKeyAlreadyLoadedException(CallError):\n def __init__(self, path: str):\n super().__init__(f\"{path!r} key is already loaded\", errno=errno.EEXIST)\n\nclass ZFSNotEncryptedException(CallError):\n def __init__(self, path: str):\n super().__init__(f\"{path!r} is not encrypted\", errno=errno.ENOTSUP)\n```\nThis ensures that wherever these exceptions propagate \u2014 through `except Exception`, `except CallError`, or unhandled \u2014 they carry structured error information and are handled correctly by the middleware's error dispatch layer. Note: verify there are no circular import issues between `middlewared.plugins.zfs` and `middlewared.service`; if so, an intermediate base class in `zfs/exceptions.py` may be needed.\n\n---\n*`Exception Handling Contract` \u00b7 confidence 90%*", - "line": 14, - "path": "src/middlewared/middlewared/plugins/zfs/exceptions.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] ZFSNotEncryptedException from change_key() propagates as raw Exception to WebSocket API layer \u2014 no CallError wrapping**\n\n`dataset_encryption_operations.py:200` calls `change_key(tls, id_, encryption_dict, key)` with no surrounding try/except. The `change_key` function in `zfs/encryption.py:87-88` can raise `ZFSNotEncryptedException` if `rsrc.crypto()` returns `None`.\n\nAlthough the `change_key` method does validate `ds['encrypted']` at line 134 via `verrors.add`, this is a **database/metadata check** \u2014 it does NOT prevent a race condition where the ZFS state diverges from the database (e.g. dataset was recreated between the query and the `change_key` call). If the ZFS layer reports the dataset as unencrypted but the DB still has it marked encrypted, `ZFSNotEncryptedException` will propagate all the way to the WebSocket API layer as an unhandled `Exception`, not a `CallError`.\n\nSimilarly, `change_encryption_root` at `dataset_encryption_operations.py:263` calls `change_encryption_root(tls, id_)` which also raises `ZFSNotEncryptedException` at `encryption.py:104-105` with no catch.\n\n---\n\n> Step 1: `change_key` method in `dataset_encryption_operations.py:200` calls `change_key(tls, id_, encryption_dict, key)` with no try/except.\n> Step 2: `change_key` in `zfs/encryption.py:86-88`: `rsrc = open_resource(tls, dataset); if (crypto := rsrc.crypto()) is None: raise ZFSNotEncryptedException(dataset)`.\n> Step 3: `ZFSNotEncryptedException` inherits from `Exception` (confirmed at `exceptions.py:20`), NOT from `CallError`.\n> Step 4: No catch exists between `encryption.py:88` and the WebSocket layer. The exception propagates as a raw `Exception`.\n> Step 5: The WebSocket API layer expects `CallError` for user-facing error messages with structured errno codes. A raw `Exception` results in an unstructured 500-style error.\n> Same path applies to `change_encryption_root` at `dataset_encryption_operations.py:263` calling `encryption.py:103-105`.\n\n**\ud83d\udca1 Suggested Fix**\n\nWrap the `change_key` and `change_encryption_root` calls with try/except to convert `ZFSNotEncryptedException` (and `ZFSKeyAlreadyLoadedException` if applicable) into `CallError`:\n```python\nfrom middlewared.plugins.zfs.exceptions import ZFSNotEncryptedException\n\ntry:\n change_key(tls, id_, encryption_dict, key)\nexcept ZFSNotEncryptedException as e:\n raise CallError(str(e), errno=errno.ENOTSUP)\n```\nAlternatively, make `ZFSNotEncryptedException` a subclass of `CallError` with a fixed errno so it automatically presents correctly to all callers throughout the codebase.\n\n---\n*`Exception Handling Contract` \u00b7 confidence 82%*", - "line": 200, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] Raw truenas_pylibzfs.ZFSException from crypto.load_key() propagates out of encryption.load_key() undecorated, breaking the old CallError contract for any caller outside unlock**\n\nIn the old `zfs.dataset.load_key` service method, all `libzfs.ZFSException` instances were caught and re-raised as `CallError`. In the new `encryption.py:load_key()`, the call to `crypto.load_key(**kwargs)` at line 34 is **not wrapped in any try/except**.\n\nAny `truenas_pylibzfs.ZFSException` raised by `crypto.load_key()` propagates directly out of `encryption.load_key()` back to its caller with:\n- A `.code` attribute (a `ZFSError` enum value)\n- **No `.errmsg`** or **`.errno`** fields in the `CallError` sense\n- No `CallError` wrapping\n\nFor the `unlock` call path in `dataset_encryption_lock.py`, this is handled correctly: `except ZFSException as e:` at line 223 catches these and processes `EZFS_CRYPTOFAILED` vs. other codes. So the current only caller handles it.\n\nHowever, the **API contract has silently changed**: any other present or future caller of `encryption.load_key()` that expects `CallError` (because the old `zfs.dataset.load_key` always raised `CallError`) will receive raw `ZFSException` instead. If such a caller reaches the WebSocket dispatch layer without intermediate handling, `websocket_app.py:196-207` catches the bare `Exception`, calls `adapt_exception(e)` (which only handles `subprocess.CalledProcessError` \u2014 not `ZFSException`), and falls back to `send_error(message, EINVAL, str(e))`, losing the original ZFS error code entirely and emitting a generic `EINVAL` to the client.\n\n---\n\n> Step 1: `encryption.py:load_key()` calls `crypto.load_key(**kwargs)` at line 34 with no surrounding try/except block.\n> Step 2: `truenas_pylibzfs.ZFSException` is the exception type raised by `crypto.load_key()` on failure (e.g., wrong key \u2192 `EZFS_CRYPTOFAILED`).\n> Step 3: `ZFSException` has a `.code` attribute (a `ZFSError` enum), but no `.errmsg` or `.errno` in the `CallError` sense.\n> Step 4: The old service method `zfs.dataset.load_key` caught all `libzfs.ZFSException` and re-raised as `CallError` \u2014 all callers expected `CallError`.\n> Step 5: A hypothetical new caller of `encryption.load_key()` that does not import `truenas_pylibzfs.ZFSException` and uses only `except CallError` will miss the exception.\n> Step 6: That uncaught `ZFSException` reaches `websocket_app.py:196`, `adapt_exception(e)` returns `None` (only handles `CalledProcessError`), and `send_error(message, EINVAL, str(e))` emits an unstructured `EINVAL` response to the client.\n\n**\ud83d\udca1 Suggested Fix**\n\nEither:\n1. **Document the contract explicitly** in `load_key()`'s docstring: state that it may raise `truenas_pylibzfs.ZFSException` directly (in addition to `ZFSNotEncryptedException` and `ZFSKeyAlreadyLoadedException`), so all callers know they must handle `ZFSException`.\n2. **Convert at the boundary**: wrap `crypto.load_key(**kwargs)` in a try/except that re-raises as a typed domain exception (e.g., add `ZFSLoadKeyException` to `exceptions.py`), so `encryption.py` never leaks `truenas_pylibzfs` types to callers:\n```python\ntry:\n crypto.load_key(**kwargs)\nexcept ZFSException as e:\n if e.code == ZFSError.EZFS_CRYPTOFAILED:\n raise ZFSInvalidKeyException(dataset) from e\n raise\n```\nOption 2 is the cleaner design: it keeps `truenas_pylibzfs` as an internal implementation detail.\n\n---\n*`Exception Handling and Error Flow` \u00b7 confidence 80%*", - "line": 34, - "path": "src/middlewared/middlewared/plugins/zfs/encryption.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udfe0 **[IMPORTANT] 3.7x PBKDF2 iteration increase enforced with no hardware capability check; may cause passphrase unlock timeouts making datasets inaccessible**\n\n**The 3.7x increase from 350,000 to 1,300,000 PBKDF2 iterations is applied unconditionally with no runtime check for hardware capability. On low-power or embedded hardware, this can cause passphrase-based key derivation to exceed unlock timeouts, making encrypted datasets permanently inaccessible without manual CLI intervention.**\n\nOnce a passphrase-encrypted dataset is re-keyed with `pbkdf2iters=1300000` (whether explicitly or via the silent clamping in `from_previous`), every future unlock attempt runs PBKDF2-SHA256 with 1,300,000 iterations synchronously. On ARM SoCs and Atom-class CPUs common in consumer NAS hardware:\n- At 350,000 iters: typically ~0.5\u20131 second per dataset\n- At 1,300,000 iters: typically ~2\u20134 seconds per dataset\n\nFor pools with multiple passphrase-encrypted datasets that must all unlock at pool import (a common TrueNAS configuration), unlock times multiply linearly. If this occurs during boot under a systemd service timeout, or during HA failover under a failover timeout, the unlock will fail \u2014 and with `ge=1300000` enforced as the hard minimum, there is **no API path** to reduce the iteration count back down without using the ZFS CLI directly (`zfs change-key -o pbkdf2iters=...`).\n\nThe `change_key` plugin (`dataset_encryption_operations.py:118`) does not measure or estimate key derivation time before applying the new iteration count. Neither `PoolCreateEncryptionOptions` nor `PoolDatasetChangeKeyOptions` expose any per-hardware tuning path below the new minimum.\n\nNote: `PoolCreateEncryptionOptions.from_previous` in `pool.py:152` applies the same clamping on pool creation encryption options. For new pool creation this affects the root dataset's initial encryption setup, not just re-keying.\n\n---\n\n> Step 1: Passphrase-encrypted dataset is re-keyed to `pbkdf2iters=1300000` via `change_key` (either explicitly or via silent clamping from `from_previous`).\n> Step 2: `dataset_encryption_operations.py:191` passes `pbkdf2iters: options['pbkdf2iters']` to `validate_encryption_data`.\n> Step 3: `validate_encryption_data` line 114 includes `pbkdf2iters` in `opts` when `passphrase_key_format=True`.\n> Step 4: `zfs/encryption.py::change_key()` line 89 calls `tls.lzh.resource_cryptography_config(**props)` with `pbkdf2iters=1300000`, permanently recording it as a ZFS dataset property.\n> Step 5: On the next pool import or `pool.dataset.unlock`, ZFS runs PBKDF2-SHA256 with 1,300,000 iterations to derive the wrapping key from the passphrase.\n> Step 6: On low-power hardware (e.g., Cortex-A53 at 1.4GHz, ~350k iters/sec for PBKDF2-SHA256), this takes ~3.7 seconds per dataset. With 5 passphrase datasets: ~18.5 seconds total.\n> Step 7: If a systemd or HA failover timeout fires during this window, unlock fails; dataset remains locked.\n> Step 8: The `ge=1300000` constraint on `PoolDatasetChangeKeyOptions` means there is no supported API path to reduce `pbkdf2iters` on an already-re-keyed dataset \u2014 only direct ZFS CLI access can recover.\n\n**\ud83d\udca1 Suggested Fix**\n\nConsider the following mitigations: (1) **Benchmark gate:** Before applying `change_key` with a high `pbkdf2iters`, run a short PBKDF2 benchmark and warn or reject if estimated unlock time exceeds a configurable threshold. (2) **System-wide override:** Allow a `tunable` or system config option to set a lower `pbkdf2iters` ceiling for constrained hardware, overriding the API minimum for that installation. (3) **Recovery documentation:** Explicitly document that `zfs change-key -o pbkdf2iters=` is available as a recovery path if unlock times become prohibitive. (4) **Job warning:** At minimum, have the `change_key` job emit a progress message noting the effective iteration count when it exceeds the old default.\n\n---\n*`PBKDF2 Iteration Count Silent Migration` \u00b7 confidence 75%*", - "line": 151, - "path": "src/middlewared/middlewared/api/v26_0_0/pool.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd35 **[SUGGESTION] No double-injection bug: explicit tls passing is correct for direct calls**\n\n`@pass_thread_local_storage` is a **marker-only decorator** \u2014 it sets `fn._pass_thread_local_storage = True` and returns `fn` unchanged (`decorators.py:221-222`). The actual `tls` injection happens only at API dispatch time: in `main.py:862-865` for normal methods and `job.py:620-621` for `@job` methods.\n\nWhen `sync_zfs_keys` calls `self.push_zfs_keys(tls, ids)` and `self.pull_zfs_keys(tls)` directly (lines 138 and 142), these are **plain Python method calls** \u2014 they bypass the middleware dispatch system entirely. The `_pass_thread_local_storage` attribute on `push_zfs_keys` and `pull_zfs_keys` has **no effect** on direct calls. Therefore, `tls` is supplied exactly once by the caller, and the functions receive it correctly.\n\nThe decorators on `push_zfs_keys`/`pull_zfs_keys` are intentional: they allow those methods to be called independently through the middleware dispatch system (e.g., `self.middleware.call_sync('kmip.push_zfs_keys', ...)`) with `tls` injected automatically. The `# type: ignore` comments are consistent with the decorator's type signature hiding `tls` from external callers.\n\n**No double-injection occurs. The code is correct for this pattern.**\n\n---\n\n> Step 1: `pass_thread_local_storage` in `service/decorators.py:209-222` sets `fn._pass_thread_local_storage = True` and returns `fn` unchanged \u2014 no wrapping, no injection at decoration time.\n> Step 2: `main.py:862-865` \u2014 injection only occurs inside `_call_prepare`, which is invoked by the middleware dispatch system, not on direct Python calls.\n> Step 3: `job.py:620-621` \u2014 same: injection only at job run time via `prepend.append(thread_local_storage)`.\n> Step 4: `sync_zfs_keys` at lines 138/142 calls `self.push_zfs_keys(tls, ids)` directly \u2014 this is a plain Python attribute lookup and call, bypassing `_call_prepare` entirely.\n> Step 5: `push_zfs_keys` receives `(self, tls, ids)` \u2014 one `tls` from the caller, zero injected by decorator. Correct.\n\n**\ud83d\udca1 Suggested Fix**\n\nNo change needed for the decorator/injection pattern. The explicit `tls` passing at lines 138 and 142 is correct because these are direct Python method calls, not middleware dispatches.\n\n---\n*`Decorator Double-Injection Analysis` \u00b7 confidence 98%*", - "line": 138, - "path": "src/middlewared/middlewared/plugins/kmip/zfs_keys.py", - "side": "RIGHT" - }, - { - "body": "\ud83d\udd35 **[SUGGESTION] No test covers the newly-enforced rejection path (passphrase root + key-encrypted child roots)**\n\nThe only integration test for `inherit_parent_encryption_properties` (`tests/api2/test_pool_dataset_encryption.py:404`) exercises the case where the parent's encryption root uses a **hex key** \u2014 so `parent_encrypted_root['key_format']['value'] == 'HEX'`. The guard evaluates to `False` in both old and new code, meaning this test provides **zero coverage** of the bug fix.\n\nThe case that was silently broken (passphrase-encrypted parent root + key-encrypted child encryption roots under `id_`) has never been tested. Now that the guard works correctly, there is a real behavioral difference: the operation **raises a `CallError`** instead of silently succeeding. Without a test for this path:\n\n1. There is no automated verification that the `CallError` message is correct.\n2. A future refactor could re-introduce the same type-comparison mistake and no test would catch it.\n3. The complementary allowed case \u2014 passphrase parent root, `id_` has *no* key-encrypted child roots \u2014 is also untested; verifying it proceeds successfully is equally important.\n\nThe guard itself (`any(d['name'] == d['encryption_root'] for d in self.middleware.call_sync('pool.dataset.query', [...]))`) is logically sound and the fix is correct, but the absence of test coverage for the enforced path is a gap worth closing.\n\n---\n\n> Only test reference: `tests/api2/test_pool_dataset_encryption.py:404`\n> ```python\n> def test_key_encrypted_dataset(self):\n> # parent uses HEX key\n> payload = {'name': dataset, 'encryption_options': {'key': dataset_token_hex}, ...}\n> call('pool.dataset.create', payload)\n> # child uses PASSPHRASE\n> payload.update({'name': child_dataset, 'encryption_options': {'passphrase': passphrase}})\n> call('pool.dataset.create', payload)\n> # parent_encrypted_root is the HEX-keyed parent -> guard evaluates False in both old and new code\n> call('pool.dataset.inherit_parent_encryption_properties', child_dataset)\n> ds = call('pool.dataset.get_instance', child_dataset)\n> assert ds['key_format']['value'] == 'HEX', ds\n> ```\n> No test exercises the path where `parent_encrypted_root['key_format']['value'] == 'PASSPHRASE'`.\n\n**\ud83d\udca1 Suggested Fix**\n\nAdd a test case in `tests/api2/test_pool_dataset_encryption.py` that:\n1. Creates a passphrase-encrypted dataset `P` as an encryption root.\n2. Creates `P/K` as a key-encrypted encryption root (child of P).\n3. Creates `P/K/KC` as a second key-encrypted encryption root (grandchild).\n4. Calls `pool.dataset.inherit_parent_encryption_properties('P/K')` and asserts a `ClientException` / `CallError` is raised containing the expected message.\n5. Also tests the allowed sub-case: `P/K` with no key-encrypted child roots successfully inherits from the passphrase root.\n\n---\n*`Enum vs String Comparison Bug in Encryption Root Guard` \u00b7 confidence 95%*", - "line": 248, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_operations.py", - "side": "RIGHT" - }, - { - "body": "\u26aa **[NITPICK] Original `tls`-injection concern is a false alarm: decorator order is correct and `tls` is never visible to the lock lambda**\n\nThe review prompt raised a concern that if `@pass_thread_local_storage` wraps the `@job`-decorated function, the lock lambda might see `(tls, name)` instead of `(name,)`.\n\nThis concern does **not** apply. Both decorators are pure markers:\n\n```python\n# decorators.py:153-166\ndef check_job(fn):\n fn._job = {'lock': lock, ...}\n return fn # fn is returned unchanged\n\n# decorators.py:221-222\nfn._pass_thread_local_storage = True\nreturn fn # fn is returned unchanged\n```\n\nNeither decorator wraps the function \u2014 they only set attributes. The `tls` object is injected at job run time in `job.py:620-621` inside `Job.__run_body`, well after `get_lock_name()` has already evaluated the lock lambda at queue time. The `Job` object is constructed with `params` (raw caller args), and that is what the lambda sees \u2014 never `tls`.\n\nThe actual decorator stacking requirement is documented in `api/base/decorator.py:53-59`: `@job` must be the innermost (bottommost) decorator, and the current ordering is correct.\n\n---\n\n> Step 1: `@pass_thread_local_storage` at `decorators.py:209-222` sets `fn._pass_thread_local_storage = True` and returns `fn` \u2014 no wrapping.\n> Step 2: `@job` at `decorators.py:153-166` sets `fn._job = {...}` and returns `fn` \u2014 no wrapping.\n> Step 3: `_call_prepare` at `main.py:880` constructs `Job(..., params, job_options, ...)` where `params` is the raw caller args \u2014 `tls` is NOT in this list.\n> Step 4: `tls` injection for jobs occurs in `job.py:620-621` inside `Job.__run_body`, which runs after the job has been queued and the lock key has already been computed.\n> Step 5: `get_lock_name` at `job.py:422` calls `lock_name(self.args)` where `self.args = params` \u2014 the lambda never sees `tls`.\n\n**\ud83d\udca1 Suggested Fix**\n\nNo code change needed for this specific concern. The decorator order is correct and `tls` is never present in the lock lambda's argument list.\n\n---\n*`Decorator Order and Lock Key Correctness` \u00b7 confidence 97%*", - "line": 158, - "path": "src/middlewared/middlewared/plugins/pool_/dataset_encryption_info.py", - "side": "RIGHT" - } - ], - "event": "REQUEST_CHANGES" - }, - "review_id": "rev_07c8d4f2bf5a", - "summary": { - "adversary_challenged": 0, - "adversary_confirmed": 0, - "ai_generated_confidence": 0, - "budget_exhausted": true, - "by_severity": { - "critical": 2, - "important": 9, - "nitpick": 1, - "suggestion": 2 - }, - "cost_usd": 0, - "coverage_iterations": 0, - "cross_ref_interactions": 0, - "dimensions_run": 6, - "duration_seconds": 1808.733, - "total_findings": 14 - } - }, - "started_at": "2026-03-10T14:41:21Z", - "completed_at": "2026-03-10T15:11:32Z", - "duration_ms": 1811005, - "webhook_registered": false -} diff --git a/src/pr_af/config.py b/src/pr_af/config.py index 323b237..51d1d59 100644 --- a/src/pr_af/config.py +++ b/src/pr_af/config.py @@ -122,6 +122,18 @@ class CommentConfig(BaseModel): include_confidence: bool = True # Show confidence score suggestion_mode: str = "comment" # comment | code + # Parallel `.ai()` polish pass: rewrites each comment body to be more + # concise and developer-focused right before posting. On any per-call + # failure, the original body is kept. + polish_enabled: bool = True + + # Parallel `.ai()` merge-gate pass: classifies each finding as blocking + # vs non-blocking using a tight release-manager bar (build/security/data + # loss/contract break/regression only). Findings that don't meet the bar + # stay as advisory inline comments and never trigger REQUEST_CHANGES. + # Default ON for production noise reduction. Failures default to advisory. + merge_gate_enabled: bool = True + severity_emojis: dict[str, str] = Field( default_factory=lambda: { "critical": "🔴", @@ -273,7 +285,10 @@ class AIIntegrationConfig(BaseModel): default_factory=lambda: os.getenv("PR_AF_MODEL", "minimax/minimax-m2.5") ) ai_model: str = Field( - default_factory=lambda: os.getenv("PR_AF_MODEL", "minimax/minimax-m2.5") + default_factory=lambda: os.getenv( + "PR_AF_AI_MODEL", + os.getenv("PR_AF_MODEL", "minimax/minimax-m2.5"), + ) ) max_turns: int = Field(default_factory=lambda: int(os.getenv("PR_AF_MAX_TURNS", "50"))) max_retries: int = Field(default_factory=lambda: int(os.getenv("PR_AF_AI_MAX_RETRIES", "3"))) diff --git a/src/pr_af/merge_gate.py b/src/pr_af/merge_gate.py new file mode 100644 index 0000000..e5f7dc4 --- /dev/null +++ b/src/pr_af/merge_gate.py @@ -0,0 +1,163 @@ +"""Merge-blocker gate — parallel `.ai()` pass over scored findings. + +A separate lens from severity. Severity asks "how bad is this issue?". +The merge gate asks "must this be fixed BEFORE the PR ships, or can it +ship and be addressed in a follow-up?". A pedantic `critical` finding +(e.g. a wrong test mock signature in unreachable code) can still be +non-blocking. A subtle `important` regression on a hot path can be +blocking. + +Production goal: keep automated review useful without forcing a +REQUEST_CHANGES gate on every alarmist finding. Only true must-fix +issues should block merge. Everything else stays as advisory comments. + +Architecture: +- One `.ai()` per finding, fired in parallel (mirror of polish.py). +- Failure mode: default to `blocking=False`. False negatives (a real + blocker passes through as advisory) are recoverable by human review. + False positives (advisory issue flagged blocking) erode trust and + block merge — worse outcome. +""" + +from __future__ import annotations + +import asyncio +import json +from typing import TYPE_CHECKING, Any + +from pydantic import BaseModel, Field + +if TYPE_CHECKING: + from .schemas.output import ScoredFinding + + +_MERGE_GATE_SYSTEM = ( + "You are the release manager for an automated code reviewer. Your job is to " + "decide whether a single review finding must be fixed BEFORE this PR is merged, " + "or whether the team can safely merge now and address it later.\n" + "\n" + "Apply a TIGHT bar. Only call something `blocking` if at least one is true:\n" + " - It breaks the build, tests, or type-checking.\n" + " - It introduces a security vulnerability reachable from a real user-facing " + "code path (auth bypass, injection, credential leak, RCE, exposed secret, " + "missing access control on a route real clients hit).\n" + " - It causes data loss, data corruption, or irreversible state damage in " + "production-running code.\n" + " - It breaks an existing public API/CLI/schema contract that real callers " + "depend on, with no migration path.\n" + " - It is a regression of behavior that was working before this PR.\n" + "\n" + "Treat the following as NON-blocking (return blocking=false):\n" + " - Code quality, style, naming, refactor opportunities.\n" + " - Missing tests for edge cases, low test coverage, mock signature drift in " + "test helpers.\n" + " - Defensive programming opportunities, missing input validation that has " + "no demonstrated reachable exploit path.\n" + " - Performance suggestions that don't change correctness.\n" + " - Documentation, comments, README, type-hint completeness.\n" + " - 'Should also handle X' suggestions when X isn't currently reachable.\n" + " - Architectural critiques (DRY, single source of truth, layering) without a " + "concrete production impact described in the finding.\n" + " - Issues whose reachability or exploitability the finding itself cannot " + "demonstrate concretely.\n" + "\n" + "If the finding's evidence does NOT concretely demonstrate one of the blocking " + "criteria above — even when the severity is labeled 'critical' — return " + "blocking=false. Reviewers are often alarmist; you are the calibrating layer.\n" + "\n" + "Output strict JSON with this exact shape and nothing else:\n" + ' {"blocking": true | false, "reason": ""}\n' + "Do not add prose. Do not wrap in markdown fences. JSON only." +) + + +class MergeGateVerdict(BaseModel): + """Per-finding gate output.""" + + blocking: bool = False + reason: str = Field(default="", max_length=400) + + +def _build_user_prompt(finding: ScoredFinding) -> str: + parts = [ + "# Finding\n", + f"Severity (reviewer's label): {finding.severity}\n", + f"Confidence: {finding.confidence:.2f}\n", + f"File: {finding.file_path}:{finding.line_start}\n", + f"Title: {finding.title}\n", + f"\n## Body\n{finding.body}\n", + ] + if finding.evidence: + parts.append(f"\n## Evidence\n{finding.evidence}\n") + if finding.suggestion: + parts.append(f"\n## Suggested fix\n{finding.suggestion}\n") + parts.append( + "\n# Question\n" + "Must this be fixed before this PR is merged to production? " + "Apply the bar described in the system prompt. Reply with JSON only." + ) + return "".join(parts) + + +def _parse_verdict(raw: str) -> MergeGateVerdict: + """Tolerant JSON parsing — strips markdown fences, picks the first JSON object.""" + + text = raw.strip() + if text.startswith("```"): + text = text.strip("`") + if text.startswith("json"): + text = text[4:].strip() + # Find first `{...}` block + start = text.find("{") + end = text.rfind("}") + if start >= 0 and end > start: + text = text[start : end + 1] + try: + data = json.loads(text) + except json.JSONDecodeError: + return MergeGateVerdict(blocking=False, reason="gate parse error") + if not isinstance(data, dict): + return MergeGateVerdict(blocking=False, reason="gate non-object") + return MergeGateVerdict( + blocking=bool(data.get("blocking", False)), + reason=str(data.get("reason", ""))[:400], + ) + + +async def _gate_one(app: Any, finding: ScoredFinding) -> MergeGateVerdict: + try: + # response_format="json" forces JSON output so reasoning models don't + # leak their chain-of-thought before the verdict and truncate the answer. + out = await app.ai( + _build_user_prompt(finding), + system=_MERGE_GATE_SYSTEM, + response_format="json", + ) + except Exception as exc: # noqa: BLE001 + print(f"[PR-AF] Merge-gate skipped for {finding.id}: {exc.__class__.__name__}", flush=True) + return MergeGateVerdict(blocking=False, reason="gate error") + text = getattr(out, "text", None) or getattr(out, "content", None) or str(out) + return _parse_verdict(text) + + +async def classify_findings( + app: Any, findings: list[ScoredFinding] +) -> list[ScoredFinding]: + """Return a new list of findings with `blocking` and `blocking_reason` populated. + + Original order preserved. Pure function — input is not mutated. + """ + + if not findings: + return findings + verdicts = await asyncio.gather(*(_gate_one(app, f) for f in findings)) + classified = [ + f.model_copy(update={"blocking": v.blocking, "blocking_reason": v.reason}) + for f, v in zip(findings, verdicts, strict=True) + ] + blocking_count = sum(1 for f in classified if f.blocking) + print( + f"[PR-AF] Merge-gate: {blocking_count}/{len(classified)} findings classified blocking", + flush=True, + ) + return classified diff --git a/src/pr_af/orchestrator.py b/src/pr_af/orchestrator.py index 2ab9659..1f66e54 100644 --- a/src/pr_af/orchestrator.py +++ b/src/pr_af/orchestrator.py @@ -27,6 +27,7 @@ build_hax_client_from_env, request_review_approval, ) +from .merge_gate import classify_findings from .reasoners.harnesses import ( adversary_phase, anatomy_phase, @@ -268,6 +269,8 @@ async def _run_review_phases( print("[PR-AF] Phase 7: SYNTHESIS", flush=True) scored_findings = self._synthesize(all_findings, adversary_results) print(f"[PR-AF] Synthesis complete: {len(scored_findings)} scored findings", flush=True) + if self.config.comments.merge_gate_enabled: + scored_findings = await classify_findings(self.app, scored_findings) return plan, scored_findings async def _finish( @@ -1009,6 +1012,30 @@ async def _generate_output( ) comments = comments[: self.config.comments.max_comments] + if self.config.comments.polish_enabled and comments: + from .polish import polish_comments + + comments = await polish_comments(self.app, comments) + + findings_by_location = { + ( + self._normalize_path(f.file_path), + f.line_start, + f.diff_side, + ): f + for f in filtered_for_comments + } + comments = [ + comment.model_copy( + update={ + "body": self._decorate_with_blocking( + findings_by_location.get((comment.path, comment.line, comment.side)), + comment.body, + ) + } + ) + for comment in comments + ] review_event = determine_review_event(filtered_for_comments) summary_body = self._format_summary( @@ -1074,6 +1101,8 @@ async def _generate_output( summary = ReviewSummary( total_findings=len(scored_findings), by_severity=by_severity, + blocking_count=sum(1 for finding in scored_findings if finding.blocking), + advisory_count=sum(1 for finding in scored_findings if not finding.blocking), dimensions_run=len(plan.dimensions), cross_ref_interactions=self.cross_ref_count, adversary_challenged=self.adversary_challenged_count, @@ -1454,18 +1483,35 @@ def _build_gap_dimensions( ) return dimensions + def _decorate_with_blocking(self, finding: ScoredFinding | None, body: str) -> str: + """Wrap a comment in a GitHub-native merge-gate callout.""" + + if finding and finding.blocking: + header = "> [!CAUTION]\n> **Must-fix before merge.**" + if finding.blocking_reason: + header += f" {finding.blocking_reason}" + else: + header = "> [!NOTE]\n> **Advisory — non-blocking.** Safe to merge and address in a follow-up." + if finding and finding.blocking_reason: + header += f"\n>\n> _Why non-blocking:_ {finding.blocking_reason}" + return f"{header}\n\n{body}" + def _format_comment_body(self, finding: ScoredFinding) -> str: emoji = self.config.comments.severity_emojis.get(finding.severity, "") severity_label = finding.severity.upper() - lines = [f"{emoji} **[{severity_label}] {finding.title}**", ""] - - lines.append(finding.body) + lines = [ + f"**{finding.title}**", + "", + f"{emoji} `{severity_label}` · confidence {int(finding.confidence * 100)}%", + "", + finding.body.strip(), + ] if finding.evidence: - lines.extend(["", "---", ""]) - evidence_lines = finding.evidence.strip().splitlines() - for ev_line in evidence_lines: + lines.extend(["", "
Evidence", ""]) + for ev_line in finding.evidence.strip().splitlines(): lines.append(f"> {ev_line}") + lines.extend(["", "
"]) if self.config.comments.include_suggestions and finding.suggestion: suggestion_text = finding.suggestion.strip() @@ -1538,16 +1584,36 @@ def _format_summary( emojis = self.config.comments.severity_emojis duration = round(time.monotonic() - self.started_at, 1) - rating = self._compute_rating(by_severity, len(findings)) + blocking_count = sum(1 for f in findings if f.blocking) + advisory_count = len(findings) - blocking_count + rating = self._compute_rating_v2(blocking_count, advisory_count, len(findings)) + + if blocking_count == 0 and advisory_count == 0: + verdict_line = "> ✅ **Safe to merge.** No findings from automated review." + elif blocking_count == 0: + verdict_line = ( + f"> ✅ **Safe to merge.** {advisory_count} advisory note" + f"{'s' if advisory_count != 1 else ''} below — non-blocking, " + "safe to address in a follow-up." + ) + else: + verdict_line = ( + f"> 🚫 **Merge blocked.** {blocking_count} must-fix issue" + f"{'s' if blocking_count != 1 else ''} found by automated review." + ) lines: list[str] = [ f"## {rating['emoji']} PR-AF Review — **{rating['label']}**", "", + verdict_line, + "", "*Automated multi-agent code review · " "[PR-AF](https://github.com/Agent-Field/pr-af) built with " "[AgentField](https://github.com/Agent-Field/agentfield)*", "", f"> **{len(findings)} findings** · " + f"🚫 {sum(1 for f in findings if f.blocking)} blocking · " + f"💬 {sum(1 for f in findings if not f.blocking)} advisory · " f"{emojis.get('critical', '')} {by_severity.get('critical', 0)} critical · " f"{emojis.get('important', '')} {by_severity.get('important', 0)} important · " f"{emojis.get('suggestion', '')} {by_severity.get('suggestion', 0)} suggestions · " @@ -1613,12 +1679,14 @@ def _format_summary( return "\n".join(lines) - def _compute_rating(self, by_severity: dict[str, int], total: int) -> dict[str, str]: + def _compute_rating(self, by_severity: dict[str, int], total: int, blocking_count: int = 0) -> dict[str, str]: critical = by_severity.get("critical", 0) important = by_severity.get("important", 0) if total == 0: return {"emoji": "🟢", "label": "Looks Good", "grade": "A"} + if blocking_count == 0: + return {"emoji": "🟢", "label": "Advisory Findings Only", "grade": "A-"} if critical >= 3: return {"emoji": "🔴", "label": "Needs Major Rework", "grade": "D"} if critical >= 1: @@ -1631,6 +1699,21 @@ def _compute_rating(self, by_severity: dict[str, int], total: int) -> dict[str, return {"emoji": "🟡", "label": "Mostly Good", "grade": "B+"} return {"emoji": "🟢", "label": "Looks Good — Minor Suggestions", "grade": "A-"} + def _compute_rating_v2(self, blocking: int, advisory: int, total: int) -> dict[str, str]: + """Blocking-driven rating; severity is secondary display context.""" + + if total == 0: + return {"emoji": "🟢", "label": "Looks Good", "grade": "A"} + if blocking >= 3: + return {"emoji": "🔴", "label": "Multiple Merge-Blockers", "grade": "D"} + if blocking >= 1: + return {"emoji": "🔴", "label": "Merge Blocked — Must-Fix Found", "grade": "C"} + if advisory >= 10: + return {"emoji": "🟢", "label": "Safe to Merge — Many Advisories", "grade": "B"} + if advisory >= 3: + return {"emoji": "🟢", "label": "Safe to Merge — Advisories Only", "grade": "B+"} + return {"emoji": "🟢", "label": "Safe to Merge — Minor Advisories", "grade": "A-"} + def _build_key_findings(self, findings: list[ScoredFinding]) -> list[str]: if not findings: return ["**No issues found.** This PR looks clean across all review dimensions.", ""] @@ -1640,8 +1723,8 @@ def _build_key_findings(self, findings: list[ScoredFinding]) -> list[str]: for f in findings: by_sev.setdefault(f.severity, []).append(f) - blocking = by_sev.get("critical", []) + by_sev.get("important", []) - non_blocking = by_sev.get("suggestion", []) + by_sev.get("nitpick", []) + blocking = [f for f in findings if f.blocking] + non_blocking = [f for f in findings if not f.blocking] lines.append("### Key Findings") lines.append("") @@ -1652,13 +1735,14 @@ def _build_key_findings(self, findings: list[ScoredFinding]) -> list[str]: for f in blocking[:8]: emoji = self.config.comments.severity_emojis.get(f.severity, "") path_ref = f" (`{self._normalize_path(f.file_path)}:{f.line_start}`)" if f.file_path else "" - lines.append(f"- {emoji} **{f.title}**{path_ref} — {self._first_sentence(f.body)}") + reason = f" Gate: {f.blocking_reason}" if f.blocking_reason else "" + lines.append(f"- {emoji} **{f.title}**{path_ref} — {self._first_sentence(f.body)}{reason}") if len(blocking) > 8: lines.append(f"- … and {len(blocking) - 8} more (see All Findings by Severity)") lines.append("") if non_blocking: - lines.append(f"**{len(non_blocking)} suggestion(s) and style note(s):**") + lines.append(f"**{len(non_blocking)} advisory finding(s) surfaced as non-blocking:**") lines.append("") for f in non_blocking[:5]: emoji = self.config.comments.severity_emojis.get(f.severity, "") @@ -1677,6 +1761,14 @@ def _build_key_findings(self, findings: list[ScoredFinding]) -> list[str]: return lines + def _loc(self, f: ScoredFinding) -> str: + if not f.file_path: + return "—" + norm = self._normalize_path(f.file_path) + if f.line_start > 0: + return f"`{norm}:{f.line_start}`" + return f"`{norm}`" + def _build_review_details(self, findings: list[ScoredFinding], plan: ReviewPlan | None) -> list[str]: lines: list[str] = [] detail_parts: list[str] = [] diff --git a/src/pr_af/polish.py b/src/pr_af/polish.py new file mode 100644 index 0000000..bd714d9 --- /dev/null +++ b/src/pr_af/polish.py @@ -0,0 +1,51 @@ +"""Parallel polish pass — rewrites each inline comment body to be concise and +developer-focused right before posting to GitHub. + +One `.ai()` call per comment, fired in parallel. Returns rewritten bodies. +On any per-comment failure, the original body is kept. +""" + +from __future__ import annotations + +import asyncio +from typing import TYPE_CHECKING, Any + +if TYPE_CHECKING: + from .schemas.output import GitHubComment + + +_POLISH_SYSTEM = ( + "You rewrite GitHub PR review comments. A good PR comment tells the author " + "exactly what to fix and why, so they can act in under 30 seconds. Open with a " + "one-sentence directive. Then one short paragraph (2-3 sentences) on the concrete " + "failure mode — no abstract security lectures, no 'attacker-controlled' filler. " + "Preserve every file path, line number, identifier, code block, markdown " + "header, GitHub alert callout (`> [!CAUTION]`, `> [!NOTE]`), `
` block, " + "and `` line verbatim. Never invent facts. Never soften severity. Output " + "the polished comment body only — no preamble, no commentary." +) + + +async def _polish_one(app: Any, body: str) -> str: + try: + out = await app.ai( + f"Rewrite this PR review comment to be concise and developer-focused.\n\n{body}", + system=_POLISH_SYSTEM, + ) + except Exception as exc: # noqa: BLE001 + print(f"[PR-AF] Polish skipped: {exc.__class__.__name__}", flush=True) + return body + text = getattr(out, "text", None) or getattr(out, "content", None) or str(out) + text = text.strip() + return text or body + + +async def polish_comments(app: Any, comments: list[GitHubComment]) -> list[GitHubComment]: + """Rewrite each comment body in parallel. Returns a new list.""" + if not comments: + return comments + new_bodies = await asyncio.gather(*(_polish_one(app, c.body) for c in comments)) + polished = [c.model_copy(update={"body": b}) for c, b in zip(comments, new_bodies, strict=True)] + changed = sum(1 for c, b in zip(comments, new_bodies, strict=True) if b != c.body) + print(f"[PR-AF] Polish complete: {changed}/{len(comments)} comments rewritten", flush=True) + return polished diff --git a/src/pr_af/schemas/output.py b/src/pr_af/schemas/output.py index ccb71d3..dec57c2 100644 --- a/src/pr_af/schemas/output.py +++ b/src/pr_af/schemas/output.py @@ -30,6 +30,10 @@ class ScoredFinding(BaseModel): tags: list[str] = Field(default_factory=list) score: float = 0.0 active_multipliers: list[str] = Field(default_factory=list) + # Orthogonal release-blocker axis. Severity = "how bad", blocking = "must fix before merge". + # Default False so unreviewed findings never falsely block merge. + blocking: bool = False + blocking_reason: str = "" class ReviewSummary(BaseModel): @@ -37,6 +41,8 @@ class ReviewSummary(BaseModel): total_findings: int = 0 by_severity: dict[str, int] = Field(default_factory=dict) + blocking_count: int = 0 + advisory_count: int = 0 dimensions_run: int = 0 cross_ref_interactions: int = Field( default=0, diff --git a/src/pr_af/scoring.py b/src/pr_af/scoring.py index 8d5569d..d023e1f 100644 --- a/src/pr_af/scoring.py +++ b/src/pr_af/scoring.py @@ -39,9 +39,32 @@ def score_findings( scored: list[ScoredFinding] = [] + # Severity normalization — reviewer LLMs sometimes emit uppercase or aliases + # like "high"/"medium". Map them to the canonical lowercase rubric so downstream + # code (emoji lookup, by_severity counting, severity_rank gates) doesn't break. + aliases = { + "critical": "critical", + "high": "critical", + "blocker": "critical", + "important": "important", + "medium": "important", + "major": "important", + "suggestion": "suggestion", + "minor": "suggestion", + "low": "suggestion", + "nitpick": "nitpick", + "info": "nitpick", + "trivia": "nitpick", + "trivial": "nitpick", + } + + def _norm_sev(s: str) -> str: + return aliases.get((s or "").strip().lower(), "suggestion") + for finding in findings: + norm_sev = _norm_sev(finding.severity) # Base weight from severity - base = config.base_weights.get(finding.severity, 0.3) + base = config.base_weights.get(norm_sev, 0.3) # Confidence-weighted base score = base * finding.confidence @@ -70,7 +93,7 @@ def score_findings( active_multipliers.append("blast_radius_high") # Confidence threshold filtering - min_confidence = config.confidence_thresholds.get(finding.severity, 0.5) + min_confidence = config.confidence_thresholds.get(norm_sev, 0.5) if finding.confidence < min_confidence: continue # Drop low-confidence findings @@ -82,7 +105,7 @@ def score_findings( file_path=finding.file_path, line_start=finding.line_start, line_end=finding.line_end, - severity=finding.severity, + severity=norm_sev, title=finding.title, body=finding.body, suggestion=finding.suggestion, @@ -122,19 +145,19 @@ def score_findings( def determine_review_event(findings: list[ScoredFinding]) -> str: - """Determine the GitHub review event based on findings. + """Determine the GitHub review event based on the merge-gate verdict. + + Decoupled from severity. The merge-gate is the single source of truth + for "must fix before merging". Severity remains the reviewer's badness + label and drives sorting/display, not the event. Returns: APPROVE | COMMENT | REQUEST_CHANGES """ - severities = {f.severity for f in findings} - - if "critical" in severities: + if any(f.blocking for f in findings): return "REQUEST_CHANGES" - if "important" in severities: - return "COMMENT" if findings: - return "APPROVE" # Only suggestions/nitpicks → approve with comments - return "APPROVE" # Clean → approve + return "COMMENT" # Advisory-only findings: surface, but don't gate merge. + return "APPROVE" def deduplicate_exact(findings: list[ReviewFinding]) -> list[ReviewFinding]: diff --git a/tests/test_budget_env.py b/tests/test_budget_env.py index 7a09ec8..0259691 100644 --- a/tests/test_budget_env.py +++ b/tests/test_budget_env.py @@ -9,7 +9,10 @@ from __future__ import annotations -import pytest +from typing import TYPE_CHECKING + +if TYPE_CHECKING: + import pytest from pr_af.app import _resolve_budget_caps diff --git a/tests/test_merge_gate.py b/tests/test_merge_gate.py new file mode 100644 index 0000000..e1246bb --- /dev/null +++ b/tests/test_merge_gate.py @@ -0,0 +1,139 @@ +"""Unit tests for the merge-gate machinery — verdict parsing and event mapping. + +These tests do NOT hit OpenRouter. They cover the deterministic parts: +- `_parse_verdict` tolerance to messy model output (markdown fences, leading prose). +- `determine_review_event` correctness across blocking/advisory combinations. +- `classify_findings` integration with a stubbed `app.ai`. +""" + +from __future__ import annotations + +import asyncio +from types import SimpleNamespace + +import pytest + +from pr_af.merge_gate import ( + MergeGateVerdict, + _parse_verdict, + classify_findings, +) +from pr_af.schemas.output import ScoredFinding +from pr_af.scoring import determine_review_event + + +def _f(id_: str, *, blocking: bool = False, severity: str = "important") -> ScoredFinding: + return ScoredFinding( + id=id_, + dimension_id="d", + dimension_name="D", + file_path="a.go", + line_start=1, + line_end=1, + severity=severity, + title=f"t-{id_}", + body="body", + blocking=blocking, + ) + + +class TestParseVerdict: + def test_plain_json(self) -> None: + v = _parse_verdict('{"blocking": true, "reason": "build break"}') + assert v.blocking is True + assert v.reason == "build break" + + def test_with_markdown_fence(self) -> None: + v = _parse_verdict('```json\n{"blocking": false, "reason": "style"}\n```') + assert v.blocking is False + assert v.reason == "style" + + def test_with_leading_prose(self) -> None: + v = _parse_verdict('Sure, here is the verdict: {"blocking": true, "reason": "x"} done.') + assert v.blocking is True + + def test_garbage_defaults_to_advisory(self) -> None: + # On parse failure, advisory is the safe default — we'd rather under-block. + v = _parse_verdict("model rambled but never closed JSON") + assert v.blocking is False + assert "parse" in v.reason.lower() + + def test_non_object_json_defaults_to_advisory(self) -> None: + v = _parse_verdict('"some string"') + assert v.blocking is False + + def test_reason_truncated_at_400(self) -> None: + long = "x" * 1000 + v = _parse_verdict(f'{{"blocking": false, "reason": "{long}"}}') + assert len(v.reason) <= 400 + + +class TestDetermineReviewEvent: + def test_empty(self) -> None: + assert determine_review_event([]) == "APPROVE" + + def test_only_advisory(self) -> None: + assert determine_review_event([_f("1"), _f("2", severity="critical")]) == "COMMENT" + + def test_any_blocking(self) -> None: + assert ( + determine_review_event([_f("1"), _f("2", blocking=True, severity="suggestion")]) + == "REQUEST_CHANGES" + ) + + def test_severity_alone_does_not_block(self) -> None: + # A critical-severity finding that the gate ruled advisory must not block. + assert determine_review_event([_f("1", severity="critical", blocking=False)]) == "COMMENT" + + +class _StubAI: + """Stub `app.ai` that returns canned verdicts keyed by finding title.""" + + def __init__(self, verdicts: dict[str, MergeGateVerdict]) -> None: + self.verdicts = verdicts + self.calls: list[str] = [] + + async def __call__(self, prompt: str, **kwargs: object) -> SimpleNamespace: + self.calls.append(prompt) + for title, verdict in self.verdicts.items(): + if title in prompt: + return SimpleNamespace(text=verdict.model_dump_json()) + return SimpleNamespace(text='{"blocking": false, "reason": "default"}') + + +class TestClassifyFindings: + def test_empty_passthrough(self) -> None: + result = asyncio.run(classify_findings(object(), [])) + assert result == [] + + def test_each_finding_classified(self) -> None: + f1 = _f("1", severity="critical") + f2 = _f("2", severity="important") + stub = _StubAI({ + "t-1": MergeGateVerdict(blocking=True, reason="ship blocker"), + "t-2": MergeGateVerdict(blocking=False, reason="advisory only"), + }) + app = SimpleNamespace(ai=stub) + out = asyncio.run(classify_findings(app, [f1, f2])) + assert len(out) == 2 + assert out[0].blocking is True + assert out[0].blocking_reason == "ship blocker" + assert out[1].blocking is False + assert out[1].blocking_reason == "advisory only" + # Each finding produces one .ai() call. No de-dup. + assert len(stub.calls) == 2 + + def test_failure_defaults_to_advisory(self) -> None: + async def boom(*_a: object, **_kw: object) -> SimpleNamespace: + raise RuntimeError("upstream blew up") + + app = SimpleNamespace(ai=boom) + f = _f("1", severity="critical") + out = asyncio.run(classify_findings(app, [f])) + # Safe default: never block on infra failure. + assert out[0].blocking is False + assert "error" in out[0].blocking_reason + + +if __name__ == "__main__": + pytest.main([__file__, "-v"]) diff --git a/tests/test_resolve_repo.py b/tests/test_resolve_repo.py index d297f09..080f32a 100644 --- a/tests/test_resolve_repo.py +++ b/tests/test_resolve_repo.py @@ -17,12 +17,15 @@ from __future__ import annotations import subprocess -from pathlib import Path +from typing import TYPE_CHECKING import pytest from pr_af.app import _checkout_pr_branch +if TYPE_CHECKING: + from pathlib import Path + def _git(*args: str, cwd: Path) -> str: return subprocess.run(