I study how frontier AI systems fail under adversarial pressure. My current work investigates compliance-induced metacognitive collapse: the finding that a simple deployment instruction ("always answer, do not refuse") causes most frontier models to fabricate rather than admit uncertainty. I build evaluation infrastructure to measure this at scale and release everything openly.
Compliance-forcing instructions override epistemic boundaries in 8 of 11 frontier models, producing cognitive collapse rather than strategic deception. The trigger is the instruction itself, not the adversarial context. Architecture family predicts vulnerability; training methodology determines immunity.
- Paper: arXiv:2605.02398
- Post: You Don't Need an Adversary to Break Most Frontier Models (EA Forum)
- Code + Pipeline: rkstu/schema-compliance-trap
- Dataset: 67,221 scored records, 11 models, 8 vendors
- Built on: UK AISI Inspect
5,470 evaluations mapping the compliance pressure curve across 8 models, 5 graduated pressure levels, and 4 fabrication domains (geography, medical, legal, technical). The collapse is a cliff at prohibition-of-uncertainty language, not a gradient. Multi-turn commitment adds zero. Within-family capability protects (GPT-4o at 95.6% vs GPT-4o-mini at 34.4%).
- Data + Verification: schema-compliance-trap/experiments/dose-response-curve
- Verify all claims:
python3 analysis/verify_numbers.py(zero API calls, deterministic from raw data)
583 tasks probing whether models can recognize the boundaries of their own knowledge, across 16 models and 5 vendors. Three metacognitive families: epistemic boundary detection, clarification seeking, and solution monitoring. The evaluation substrate underlying the Compliance Trap experiments.
- Code + Benchmark: rkstu/amb-adversarial-metacognition-benchmark
- Kaggle Writeup: Measuring AGI Competition
- Dataset: Kaggle
Automated pre-deployment security testing for AI agents. Seven-dimensional scoring framework, 50+ attack patterns. The first DAST purpose-built for agentic systems.
- Code: preseal/preseal
- Site: preseal.dev
- Install:
pip install preseal
Adversarial robustness testing for autonomous business process agents. Hydra pipeline architecture. 1st Prize and Legendary Tier at UC Berkeley RDI AgentX Competition.


