Skip to content
View rkstu's full-sized avatar

Block or report rkstu

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
rkstu/README.md

I study how frontier AI systems fail under adversarial pressure. My current work investigates compliance-induced metacognitive collapse: the finding that a simple deployment instruction ("always answer, do not refuse") causes most frontier models to fabricate rather than admit uncertainty. I build evaluation infrastructure to measure this at scale and release everything openly.

Research

The Compliance Trap

Compliance-forcing instructions override epistemic boundaries in 8 of 11 frontier models, producing cognitive collapse rather than strategic deception. The trigger is the instruction itself, not the adversarial context. Architecture family predicts vulnerability; training methodology determines immunity.

Dose-Response Extension

5,470 evaluations mapping the compliance pressure curve across 8 models, 5 graduated pressure levels, and 4 fabrication domains (geography, medical, legal, technical). The collapse is a cliff at prohibition-of-uncertainty language, not a gradient. Multi-turn commitment adds zero. Within-family capability protects (GPT-4o at 95.6% vs GPT-4o-mini at 34.4%).

Adversarial Metacognition Benchmark (AMB)

583 tasks probing whether models can recognize the boundaries of their own knowledge, across 16 models and 5 vendors. Three metacognitive families: epistemic boundary detection, clarification seeking, and solution monitoring. The evaluation substrate underlying the Compliance Trap experiments.

Tools

Preseal

Automated pre-deployment security testing for AI agents. Seven-dimensional scoring framework, 50+ attack patterns. The first DAST purpose-built for agentic systems.

Entropic CRMArena

Adversarial robustness testing for autonomous business process agents. Hydra pipeline architecture. 1st Prize and Legendary Tier at UC Berkeley RDI AgentX Competition.

Contact

rahulkc.dev@gmail.com | Google Scholar | LinkedIn

Pinned Loading

  1. schema-compliance-trap schema-compliance-trap Public

    We measure when AI systems lose the ability to say `I don't know` and build the tools to catch it before deployment.

    Python

  2. preseal/preseal preseal/preseal Public

    Safety linter and regression gate for AI agents. Catches security regressions before deployment.

    Python

  3. BioCalibrate/BioCalibrate BioCalibrate/BioCalibrate Public

    Python

  4. entropic-crmarenapro entropic-crmarenapro Public

    Python 4

  5. amb-adversarial-metacognition-benchmark amb-adversarial-metacognition-benchmark Public

    AMB: A 583-task adversarial benchmark evaluating metacognition across 7 families. 16 models, 5 vendors, deterministic scoring. Kaggle Measuring AGI 2026, Metacognition Track.

    Python

  6. Epistemic-Sec Epistemic-Sec Public

    TypeScript