Skip to content

Adapting the methodology to agentic behavioral monitoring (PSA v3 case study) #7

@SiliconPsycheLabs

Description

@SiliconPsycheLabs

Summary

We adapted your claim-driven testing methodology to PSA v3 — a multi-agent behavioral monitoring system built by Silicon Psyche Labs. We wanted to share the outcome and a few observations from applying the framework to a non-database domain.

What we built

A full tutorial (Tutorial 14 in our public SDK repo) covering:

  • Claim formalization for a behavioral monitoring API (claims like determinism, session consistency, propagation correctness, concurrent isolation, score immutability)
  • 9-state verdict taxonomy adapted for HTTP API testing
  • Five concrete test scenarios with property oracles, not just key-presence checks
  • Coverage adequacy argument table

The tutorial is published here: https://github.com/SiliconPsycheLabs/PSA-core/blob/main/tutorials/14-testing-agentic-systems.md

Key observations

What translated directly:

  • The claim-formalization step (C1, C2, ...) is the single highest-value activity — it surfaced 6 implicit guarantees our API was making without any test coverage.
  • The oracle discipline critique ("PARTIAL-surface vs PASS-hardening") forced us to upgrade from assert key in response to assert response["scs"] > 0.5 — a qualitative jump in test value.
  • The 9-state verdict taxonomy replaced our binary PASS/FAIL with diagnostically precise output. INCONCLUSIVE-oracle-too-weak alone identified 4 tests in our existing suite that were effectively no-ops.
  • The fault injection categories (process fault → restart, concurrency fault → parallel POSTs, staleness fault → time-separated reads) mapped well even without infrastructure-level access (no iptables, single-node deployment).

What required adaptation:

  • Linearizability/Elle does not apply (no register operations, no history of conflicting reads/writes). We replaced it with determinism oracles (same input → same output) and immutability oracles (scores do not change post-inference).
  • Network partition testing is not feasible without topology control. We substituted concurrency faults (N parallel writes) and staleness faults (immediate GET after POST).
  • The §7.M model/history/checker discipline is overkill for single-endpoint scoring APIs, but the spirit of it — requiring a named abstract model, a history schema, and a machine-checkable checker — was extremely useful for formalizing the adversarial propagation oracle.

References

The methodology is documented in your SKILL.md files. The oracle-patterns and verdict-taxonomy reference files were the most directly useful. The fault-injection-howto framing ("faults must fire, produce evidence, and be reversible") guided our durability-after-restart scenario design.

Happy to share more details or discuss how the approach works on non-database stateful systems. Thanks for publishing this — it filled a real gap.

— Silicon Psyche Labs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions