feat(audit): durable writes, graceful-shutdown flush, and clearer verify diagnostics by dgenio · Pull Request #184 · dgenio/agentfence

dgenio · 2026-06-14T20:06:33Z

Summary

Hardens the hash-chained audit log — the security-critical artifact the threat model relies on — against write loss and damaged input, and locks in two concurrency invariants with race tests. The four issues form one coherent change: they all touch the audit Writer/Rotator write-and-read lifecycle, the stdio proxy's lockedWriter, and the proxy shutdown path, and share fault-injection/concurrency test scaffolding.

What changed:

Make audit-log writes durable on a configurable flush/fsync policy #132 — Durable audit writes. New --audit-fsync flag on check/proxy/proxy-http and audit.Options.Fsync. When set, audit.Writer flushes each committed JSONL line through to stable storage (via a Sync() error destination) before the call is allowed to proceed; Rotator gains Sync() so rotating logs are equally durable. Default behaviour is unchanged (events reach the OS page cache, surviving a process crash but not power loss). A no-op warning fires for --audit-fsync without --audit-log.
Add graceful-shutdown audit-flush guarantees on signal across both proxies #158 — Graceful-shutdown audit flush. New Writer.Sync(); openAuditOutput's close path fsync-then-closes the destination on shutdown. Because each decision is audited before the proxy forwards the call, and both proxies already run under signal.NotifyContext(SIGINT, SIGTERM), no call is left made-but-unaudited on exit. Shutdown ordering is now documented.
Add concurrency and atomicity tests for the locked-writer and audit writer under load #172 — Concurrency/atomicity tests. Race-detector coverage (CI runs -race) for the stdio proxy lockedWriter (JSON-RPC frame atomicity under concurrent relay writes) and the audit Writer (gap-free monotonic sequence numbers under load).
Add audit-log input validation and a recovery path for audit summarize/verify on corrupt files #180 — Corrupt-vs-tamper diagnostics. audit verify now prints CORRUPT for a damaged/unreadable line and FAILED for an integrity break (possible tampering) on distinct status lines; audit.VerifyError carries a Malformed flag so callers can tell a damaged file from a rewritten one.

Files: internal/audit/{audit,rotate,verify}.go, cmd/agentfence/main.go, plus docs (docs/threat-model.md, README.md, CHANGELOG.md) and four new test files.

Related issue

Fixes #132
Fixes #158
Fixes #172
Fixes #180

How verified

make ci (fmt-check, vet, go test -race across all packages): all packages pass, total coverage 81.0%.
New tests fail without their corresponding change (e.g. TestVerifyChainFlagsCorruptLineAsMalformed, TestWriterFsyncSyncsEveryWrite) and pass with it.
TestLockedWriterKeepsFramesAtomic run with -race -count=3: clean.

Tradeoffs / risks

--audit-fsync is opt-in and off by default; per-event fsync trades throughput for durability, as documented.
A fsync failure is returned by Writer.Write after the line and sequence/chain state are committed (the record is on disk): it signals a durability problem without rolling back state, preserving the no-gap-on-retry contract. This is intentional and covered by a test.
Observing a real fsync(2) from a test isn't portable, so durability is verified at the wiring level (a counting syncer) plus an end-to-end "log is complete and verifies" check.

Checklist

Tests added or updated
Documentation updated if needed
CI passes
Issue number included

Scope notes

Changes are limited to the audit durability/integrity cluster (#132, #158, #172, #180). Adjacent audit work surfaced during triage but deliberately left as follow-ups: a typed reason/decision-code taxonomy (#136), audit verify --json (#160), and a redaction hot-path allocation budget (#147).

https://claude.ai/code/session_01GQ2VQaTfN7JzTq3VxN3K3q

Generated by Claude Code

…ify diagnostics Harden the hash-chained audit log — the project's security-critical artifact — against write loss and damaged input, and lock in two concurrency invariants with race tests. Addresses #132, #158, #172, #180. - #132 Durable writes: add `--audit-fsync` (check/proxy/proxy-http) and `audit.Options.Fsync`. `Writer` fsyncs each committed line through to stable storage when enabled; `Rotator` gains `Sync()` so rotating logs are equally durable. Default behaviour is unchanged (page-cache only). - #158 Graceful shutdown: add `Writer.Sync()` and fsync-then-close the audit destination on the proxies' SIGINT/SIGTERM path, so an in-flight decision (audited before the call is forwarded) is never lost on exit. - #172 Concurrency: race-detector tests for the stdio proxy `lockedWriter` (frame atomicity) and the audit `Writer` (gap-free sequence numbers). - #180 Diagnostics: `audit verify` now distinguishes a corrupt/unreadable log (`CORRUPT`) from an integrity break (`FAILED`); `VerifyError` carries a `Malformed` flag to separate "damaged file" from "tampering". Docs: threat-model durability/shutdown/diagnosis notes, README status row, CHANGELOG entries. https://claude.ai/code/session_01GQ2VQaTfN7JzTq3VxN3K3q

Copilot

Pull request overview

This PR hardens AgentFence’s security-critical hash-chained audit log lifecycle by adding an opt-in durability mode (--audit-fsync), improving shutdown flushing semantics, adding race-oriented concurrency tests, and making audit verify diagnostics distinguish corrupt input from integrity failures.

Changes:

Add opt-in per-event audit fsync support end-to-end (writer, rotator, CLI flags, shutdown close path) to reduce audit record loss on power failure.
Improve audit verification diagnostics by distinguishing malformed/corrupt lines from chain integrity failures (VerifyError.Malformed, CORRUPT vs FAILED).
Add concurrency/race tests covering stdio proxy frame atomicity (lockedWriter) and audit writer durability/sequence behavior.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
README.md	Documents durable audit writes feature flag in feature matrix.
internal/proxy/lockedwriter_test.go	Adds race-friendly test asserting concurrent writes do not interleave frames.
internal/audit/verify.go	Adds `VerifyError.Malformed` to distinguish corrupt input from integrity failures.
internal/audit/verify_corrupt_test.go	Adds tests covering corrupt-line vs tamper scenarios for `VerifyChain`.
internal/audit/rotate.go	Adds `Rotator.Sync()` so rotating logs can participate in fsync durability.
internal/audit/durability_test.go	Adds durability-related unit tests for Writer/Rotator sync behavior.
internal/audit/audit.go	Adds `Options.Fsync`, `Writer.Sync()`, and per-write fsync behavior.
docs/threat-model.md	Documents audit durability, shutdown ordering, and CORRUPT vs FAILED diagnostics.
cmd/agentfence/main.go	Plumbs `--audit-fsync` through CLI, adds shutdown fsync close behavior, and updates verify output.
cmd/agentfence/durability_test.go	Adds end-to-end CLI tests for `--audit-fsync` and verify diagnostics.
CHANGELOG.md	Records new durability flag, shutdown flush semantics, concurrency tests, and verify diagnostics changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+	if cfg.fsync && cfg.path == "" {
+		fmt.Fprintln(os.Stderr, "AgentFence: warning: --audit-fsync has no effect without --audit-log")
+	}


+// TestWriterFsyncThroughRotatorIsDurable proves the Fsync option reaches a
+// rotating destination (the path the proxies use with --audit-max-size), so a
+// rotated log is as durable as a plain one.
+func TestWriterFsyncThroughRotatorIsDurable(t *testing.T) {
+	path := filepath.Join(t.TempDir(), "audit.jsonl")
+	rot, err := NewRotator(RotationConfig{Path: path})
+	if err != nil {
+		t.Fatalf("NewRotator: %v", err)
+	}
+	defer rot.Close()
+
+	w := NewWriterOptions(rot, Options{Fsync: true})
+	if err := w.Write(Event{CallID: "c", Tool: "t", Decision: policy.DecisionAllow}); err != nil {
+		t.Fatalf("Write: %v", err)
+	}
+	// The event must be readable from disk immediately (it was flushed).
+	b, err := os.ReadFile(path)


…rotator test Address PR review feedback: - openAuditOutput now clears Options.Fsync when no --audit-log is given. Previously the flag still set Fsync=true, so `check --output text --audit-fsync` would Sync() os.Stdout — which returns EINVAL on a TTY or pipe and failed the run. This matches the documented "no effect" warning. Added a regression test (fails with `audit: fsync: invalid argument` before the fix). - Renamed TestWriterFsyncThroughRotatorIsDurable -> ...WritesEvent and reworded it: it checks the Fsync path drives a rotating destination without error and the event lands on disk, not real fsync(2) semantics (covered by the counting-syncer test). https://claude.ai/code/session_01GQ2VQaTfN7JzTq3VxN3K3q

Copilot AI review requested due to automatic review settings June 14, 2026 20:06

Copilot started reviewing on behalf of dgenio June 14, 2026 20:06 View session

Copilot AI reviewed Jun 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(audit): durable writes, graceful-shutdown flush, and clearer verify diagnostics#184

feat(audit): durable writes, graceful-shutdown flush, and clearer verify diagnostics#184
dgenio wants to merge 2 commits into
mainfrom
claude/github-issues-triage-q5ubjq

dgenio commented Jun 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dgenio commented Jun 14, 2026

Summary

Related issue

How verified

Tradeoffs / risks

Checklist

Scope notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants