Skip to content

feat(audit): durable writes, graceful-shutdown flush, and clearer verify diagnostics#184

Open
dgenio wants to merge 2 commits into
mainfrom
claude/github-issues-triage-q5ubjq
Open

feat(audit): durable writes, graceful-shutdown flush, and clearer verify diagnostics#184
dgenio wants to merge 2 commits into
mainfrom
claude/github-issues-triage-q5ubjq

Conversation

@dgenio

@dgenio dgenio commented Jun 14, 2026

Copy link
Copy Markdown
Owner

Summary

Hardens the hash-chained audit log — the security-critical artifact the threat model relies on — against write loss and damaged input, and locks in two concurrency invariants with race tests. The four issues form one coherent change: they all touch the audit Writer/Rotator write-and-read lifecycle, the stdio proxy's lockedWriter, and the proxy shutdown path, and share fault-injection/concurrency test scaffolding.

What changed:

Files: internal/audit/{audit,rotate,verify}.go, cmd/agentfence/main.go, plus docs (docs/threat-model.md, README.md, CHANGELOG.md) and four new test files.

Related issue

Fixes #132
Fixes #158
Fixes #172
Fixes #180

How verified

  • make ci (fmt-check, vet, go test -race across all packages): all packages pass, total coverage 81.0%.
  • New tests fail without their corresponding change (e.g. TestVerifyChainFlagsCorruptLineAsMalformed, TestWriterFsyncSyncsEveryWrite) and pass with it.
  • TestLockedWriterKeepsFramesAtomic run with -race -count=3: clean.

Tradeoffs / risks

  • --audit-fsync is opt-in and off by default; per-event fsync trades throughput for durability, as documented.
  • A fsync failure is returned by Writer.Write after the line and sequence/chain state are committed (the record is on disk): it signals a durability problem without rolling back state, preserving the no-gap-on-retry contract. This is intentional and covered by a test.
  • Observing a real fsync(2) from a test isn't portable, so durability is verified at the wiring level (a counting syncer) plus an end-to-end "log is complete and verifies" check.

Checklist

  • Tests added or updated
  • Documentation updated if needed
  • CI passes
  • Issue number included

Scope notes

Changes are limited to the audit durability/integrity cluster (#132, #158, #172, #180). Adjacent audit work surfaced during triage but deliberately left as follow-ups: a typed reason/decision-code taxonomy (#136), audit verify --json (#160), and a redaction hot-path allocation budget (#147).

https://claude.ai/code/session_01GQ2VQaTfN7JzTq3VxN3K3q


Generated by Claude Code

…ify diagnostics

Harden the hash-chained audit log — the project's security-critical
artifact — against write loss and damaged input, and lock in two
concurrency invariants with race tests. Addresses #132, #158, #172, #180.

- #132 Durable writes: add `--audit-fsync` (check/proxy/proxy-http) and
  `audit.Options.Fsync`. `Writer` fsyncs each committed line through to
  stable storage when enabled; `Rotator` gains `Sync()` so rotating logs
  are equally durable. Default behaviour is unchanged (page-cache only).
- #158 Graceful shutdown: add `Writer.Sync()` and fsync-then-close the
  audit destination on the proxies' SIGINT/SIGTERM path, so an in-flight
  decision (audited before the call is forwarded) is never lost on exit.
- #172 Concurrency: race-detector tests for the stdio proxy `lockedWriter`
  (frame atomicity) and the audit `Writer` (gap-free sequence numbers).
- #180 Diagnostics: `audit verify` now distinguishes a corrupt/unreadable
  log (`CORRUPT`) from an integrity break (`FAILED`); `VerifyError` carries
  a `Malformed` flag to separate "damaged file" from "tampering".

Docs: threat-model durability/shutdown/diagnosis notes, README status row,
CHANGELOG entries.

https://claude.ai/code/session_01GQ2VQaTfN7JzTq3VxN3K3q
Copilot AI review requested due to automatic review settings June 14, 2026 20:06

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens AgentFence’s security-critical hash-chained audit log lifecycle by adding an opt-in durability mode (--audit-fsync), improving shutdown flushing semantics, adding race-oriented concurrency tests, and making audit verify diagnostics distinguish corrupt input from integrity failures.

Changes:

  • Add opt-in per-event audit fsync support end-to-end (writer, rotator, CLI flags, shutdown close path) to reduce audit record loss on power failure.
  • Improve audit verification diagnostics by distinguishing malformed/corrupt lines from chain integrity failures (VerifyError.Malformed, CORRUPT vs FAILED).
  • Add concurrency/race tests covering stdio proxy frame atomicity (lockedWriter) and audit writer durability/sequence behavior.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
README.md Documents durable audit writes feature flag in feature matrix.
internal/proxy/lockedwriter_test.go Adds race-friendly test asserting concurrent writes do not interleave frames.
internal/audit/verify.go Adds VerifyError.Malformed to distinguish corrupt input from integrity failures.
internal/audit/verify_corrupt_test.go Adds tests covering corrupt-line vs tamper scenarios for VerifyChain.
internal/audit/rotate.go Adds Rotator.Sync() so rotating logs can participate in fsync durability.
internal/audit/durability_test.go Adds durability-related unit tests for Writer/Rotator sync behavior.
internal/audit/audit.go Adds Options.Fsync, Writer.Sync(), and per-write fsync behavior.
docs/threat-model.md Documents audit durability, shutdown ordering, and CORRUPT vs FAILED diagnostics.
cmd/agentfence/main.go Plumbs --audit-fsync through CLI, adds shutdown fsync close behavior, and updates verify output.
cmd/agentfence/durability_test.go Adds end-to-end CLI tests for --audit-fsync and verify diagnostics.
CHANGELOG.md Records new durability flag, shutdown flush semantics, concurrency tests, and verify diagnostics changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cmd/agentfence/main.go
Comment on lines +1373 to +1375
if cfg.fsync && cfg.path == "" {
fmt.Fprintln(os.Stderr, "AgentFence: warning: --audit-fsync has no effect without --audit-log")
}
Comment thread internal/audit/durability_test.go Outdated
Comment on lines +108 to +124
// TestWriterFsyncThroughRotatorIsDurable proves the Fsync option reaches a
// rotating destination (the path the proxies use with --audit-max-size), so a
// rotated log is as durable as a plain one.
func TestWriterFsyncThroughRotatorIsDurable(t *testing.T) {
path := filepath.Join(t.TempDir(), "audit.jsonl")
rot, err := NewRotator(RotationConfig{Path: path})
if err != nil {
t.Fatalf("NewRotator: %v", err)
}
defer rot.Close()

w := NewWriterOptions(rot, Options{Fsync: true})
if err := w.Write(Event{CallID: "c", Tool: "t", Decision: policy.DecisionAllow}); err != nil {
t.Fatalf("Write: %v", err)
}
// The event must be readable from disk immediately (it was flushed).
b, err := os.ReadFile(path)
…rotator test

Address PR review feedback:

- openAuditOutput now clears Options.Fsync when no --audit-log is given.
  Previously the flag still set Fsync=true, so `check --output text
  --audit-fsync` would Sync() os.Stdout — which returns EINVAL on a TTY or
  pipe and failed the run. This matches the documented "no effect" warning.
  Added a regression test (fails with `audit: fsync: invalid argument`
  before the fix).
- Renamed TestWriterFsyncThroughRotatorIsDurable ->
  ...WritesEvent and reworded it: it checks the Fsync path drives a rotating
  destination without error and the event lands on disk, not real fsync(2)
  semantics (covered by the counting-syncer test).

https://claude.ai/code/session_01GQ2VQaTfN7JzTq3VxN3K3q
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants