feat(audit): durable writes, graceful-shutdown flush, and clearer verify diagnostics#184
Open
dgenio wants to merge 2 commits into
Open
feat(audit): durable writes, graceful-shutdown flush, and clearer verify diagnostics#184dgenio wants to merge 2 commits into
dgenio wants to merge 2 commits into
Conversation
…ify diagnostics Harden the hash-chained audit log — the project's security-critical artifact — against write loss and damaged input, and lock in two concurrency invariants with race tests. Addresses #132, #158, #172, #180. - #132 Durable writes: add `--audit-fsync` (check/proxy/proxy-http) and `audit.Options.Fsync`. `Writer` fsyncs each committed line through to stable storage when enabled; `Rotator` gains `Sync()` so rotating logs are equally durable. Default behaviour is unchanged (page-cache only). - #158 Graceful shutdown: add `Writer.Sync()` and fsync-then-close the audit destination on the proxies' SIGINT/SIGTERM path, so an in-flight decision (audited before the call is forwarded) is never lost on exit. - #172 Concurrency: race-detector tests for the stdio proxy `lockedWriter` (frame atomicity) and the audit `Writer` (gap-free sequence numbers). - #180 Diagnostics: `audit verify` now distinguishes a corrupt/unreadable log (`CORRUPT`) from an integrity break (`FAILED`); `VerifyError` carries a `Malformed` flag to separate "damaged file" from "tampering". Docs: threat-model durability/shutdown/diagnosis notes, README status row, CHANGELOG entries. https://claude.ai/code/session_01GQ2VQaTfN7JzTq3VxN3K3q
There was a problem hiding this comment.
Pull request overview
This PR hardens AgentFence’s security-critical hash-chained audit log lifecycle by adding an opt-in durability mode (--audit-fsync), improving shutdown flushing semantics, adding race-oriented concurrency tests, and making audit verify diagnostics distinguish corrupt input from integrity failures.
Changes:
- Add opt-in per-event audit
fsyncsupport end-to-end (writer, rotator, CLI flags, shutdown close path) to reduce audit record loss on power failure. - Improve audit verification diagnostics by distinguishing malformed/corrupt lines from chain integrity failures (
VerifyError.Malformed,CORRUPTvsFAILED). - Add concurrency/race tests covering stdio proxy frame atomicity (
lockedWriter) and audit writer durability/sequence behavior.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| README.md | Documents durable audit writes feature flag in feature matrix. |
| internal/proxy/lockedwriter_test.go | Adds race-friendly test asserting concurrent writes do not interleave frames. |
| internal/audit/verify.go | Adds VerifyError.Malformed to distinguish corrupt input from integrity failures. |
| internal/audit/verify_corrupt_test.go | Adds tests covering corrupt-line vs tamper scenarios for VerifyChain. |
| internal/audit/rotate.go | Adds Rotator.Sync() so rotating logs can participate in fsync durability. |
| internal/audit/durability_test.go | Adds durability-related unit tests for Writer/Rotator sync behavior. |
| internal/audit/audit.go | Adds Options.Fsync, Writer.Sync(), and per-write fsync behavior. |
| docs/threat-model.md | Documents audit durability, shutdown ordering, and CORRUPT vs FAILED diagnostics. |
| cmd/agentfence/main.go | Plumbs --audit-fsync through CLI, adds shutdown fsync close behavior, and updates verify output. |
| cmd/agentfence/durability_test.go | Adds end-to-end CLI tests for --audit-fsync and verify diagnostics. |
| CHANGELOG.md | Records new durability flag, shutdown flush semantics, concurrency tests, and verify diagnostics changes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+1373
to
+1375
| if cfg.fsync && cfg.path == "" { | ||
| fmt.Fprintln(os.Stderr, "AgentFence: warning: --audit-fsync has no effect without --audit-log") | ||
| } |
Comment on lines
+108
to
+124
| // TestWriterFsyncThroughRotatorIsDurable proves the Fsync option reaches a | ||
| // rotating destination (the path the proxies use with --audit-max-size), so a | ||
| // rotated log is as durable as a plain one. | ||
| func TestWriterFsyncThroughRotatorIsDurable(t *testing.T) { | ||
| path := filepath.Join(t.TempDir(), "audit.jsonl") | ||
| rot, err := NewRotator(RotationConfig{Path: path}) | ||
| if err != nil { | ||
| t.Fatalf("NewRotator: %v", err) | ||
| } | ||
| defer rot.Close() | ||
|
|
||
| w := NewWriterOptions(rot, Options{Fsync: true}) | ||
| if err := w.Write(Event{CallID: "c", Tool: "t", Decision: policy.DecisionAllow}); err != nil { | ||
| t.Fatalf("Write: %v", err) | ||
| } | ||
| // The event must be readable from disk immediately (it was flushed). | ||
| b, err := os.ReadFile(path) |
…rotator test Address PR review feedback: - openAuditOutput now clears Options.Fsync when no --audit-log is given. Previously the flag still set Fsync=true, so `check --output text --audit-fsync` would Sync() os.Stdout — which returns EINVAL on a TTY or pipe and failed the run. This matches the documented "no effect" warning. Added a regression test (fails with `audit: fsync: invalid argument` before the fix). - Renamed TestWriterFsyncThroughRotatorIsDurable -> ...WritesEvent and reworded it: it checks the Fsync path drives a rotating destination without error and the event lands on disk, not real fsync(2) semantics (covered by the counting-syncer test). https://claude.ai/code/session_01GQ2VQaTfN7JzTq3VxN3K3q
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Hardens the hash-chained audit log — the security-critical artifact the threat model relies on — against write loss and damaged input, and locks in two concurrency invariants with race tests. The four issues form one coherent change: they all touch the audit
Writer/Rotatorwrite-and-read lifecycle, the stdio proxy'slockedWriter, and the proxy shutdown path, and share fault-injection/concurrency test scaffolding.What changed:
--audit-fsyncflag oncheck/proxy/proxy-httpandaudit.Options.Fsync. When set,audit.Writerflushes each committed JSONL line through to stable storage (via aSync() errordestination) before the call is allowed to proceed;RotatorgainsSync()so rotating logs are equally durable. Default behaviour is unchanged (events reach the OS page cache, surviving a process crash but not power loss). A no-op warning fires for--audit-fsyncwithout--audit-log.Writer.Sync();openAuditOutput's close path fsync-then-closes the destination on shutdown. Because each decision is audited before the proxy forwards the call, and both proxies already run undersignal.NotifyContext(SIGINT, SIGTERM), no call is left made-but-unaudited on exit. Shutdown ordering is now documented.-race) for the stdio proxylockedWriter(JSON-RPC frame atomicity under concurrent relay writes) and the auditWriter(gap-free monotonic sequence numbers under load).audit summarize/verifyon corrupt files #180 — Corrupt-vs-tamper diagnostics.audit verifynow printsCORRUPTfor a damaged/unreadable line andFAILEDfor an integrity break (possible tampering) on distinct status lines;audit.VerifyErrorcarries aMalformedflag so callers can tell a damaged file from a rewritten one.Files:
internal/audit/{audit,rotate,verify}.go,cmd/agentfence/main.go, plus docs (docs/threat-model.md,README.md,CHANGELOG.md) and four new test files.Related issue
Fixes #132
Fixes #158
Fixes #172
Fixes #180
How verified
make ci(fmt-check, vet,go test -raceacross all packages): all packages pass, total coverage 81.0%.TestVerifyChainFlagsCorruptLineAsMalformed,TestWriterFsyncSyncsEveryWrite) and pass with it.TestLockedWriterKeepsFramesAtomicrun with-race -count=3: clean.Tradeoffs / risks
--audit-fsyncis opt-in and off by default; per-event fsync trades throughput for durability, as documented.Writer.Writeafter the line and sequence/chain state are committed (the record is on disk): it signals a durability problem without rolling back state, preserving the no-gap-on-retry contract. This is intentional and covered by a test.fsync(2)from a test isn't portable, so durability is verified at the wiring level (a counting syncer) plus an end-to-end "log is complete and verifies" check.Checklist
Scope notes
Changes are limited to the audit durability/integrity cluster (#132, #158, #172, #180). Adjacent audit work surfaced during triage but deliberately left as follow-ups: a typed reason/decision-code taxonomy (#136),
audit verify --json(#160), and a redaction hot-path allocation budget (#147).https://claude.ai/code/session_01GQ2VQaTfN7JzTq3VxN3K3q
Generated by Claude Code