Adapting the methodology to agentic behavioral monitoring (PSA v3 case study)

## Summary

We adapted your claim-driven testing methodology to **PSA v3** — a multi-agent behavioral monitoring system built by Silicon Psyche Labs. We wanted to share the outcome and a few observations from applying the framework to a non-database domain.

## What we built

A full tutorial (Tutorial 14 in our public SDK repo) covering:
- Claim formalization for a behavioral monitoring API (claims like determinism, session consistency, propagation correctness, concurrent isolation, score immutability)
- 9-state verdict taxonomy adapted for HTTP API testing
- Five concrete test scenarios with property oracles, not just key-presence checks
- Coverage adequacy argument table

The tutorial is published here: https://github.com/SiliconPsycheLabs/PSA-core/blob/main/tutorials/14-testing-agentic-systems.md

## Key observations

**What translated directly:**
- The claim-formalization step (C1, C2, ...) is the single highest-value activity — it surfaced 6 implicit guarantees our API was making without any test coverage.
- The oracle discipline critique ("PARTIAL-surface vs PASS-hardening") forced us to upgrade from `assert key in response` to `assert response["scs"] > 0.5` — a qualitative jump in test value.
- The 9-state verdict taxonomy replaced our binary PASS/FAIL with diagnostically precise output. `INCONCLUSIVE-oracle-too-weak` alone identified 4 tests in our existing suite that were effectively no-ops.
- The fault injection categories (process fault → restart, concurrency fault → parallel POSTs, staleness fault → time-separated reads) mapped well even without infrastructure-level access (no iptables, single-node deployment).

**What required adaptation:**
- Linearizability/Elle does not apply (no register operations, no history of conflicting reads/writes). We replaced it with determinism oracles (same input → same output) and immutability oracles (scores do not change post-inference).
- Network partition testing is not feasible without topology control. We substituted concurrency faults (N parallel writes) and staleness faults (immediate GET after POST).
- The `§7.M` model/history/checker discipline is overkill for single-endpoint scoring APIs, but the *spirit* of it — requiring a named abstract model, a history schema, and a machine-checkable checker — was extremely useful for formalizing the adversarial propagation oracle.

## References

The methodology is documented in your SKILL.md files. The oracle-patterns and verdict-taxonomy reference files were the most directly useful. The fault-injection-howto framing ("faults must fire, produce evidence, and be reversible") guided our durability-after-restart scenario design.

Happy to share more details or discuss how the approach works on non-database stateful systems. Thanks for publishing this — it filled a real gap.

— Silicon Psyche Labs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapting the methodology to agentic behavioral monitoring (PSA v3 case study) #7

Summary

What we built

Key observations

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Adapting the methodology to agentic behavioral monitoring (PSA v3 case study) #7

Description

Summary

What we built

Key observations

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions