Apply scientific-method workflow to any decision — coding, debugging, career choices, finances, time management, design, strategy. A Claude Code subagent that runs hypotheses, falsifiability tests, evidence gathering, and pre-mortem analysis on whatever you're thinking about.
Empirically validated: pre-registered controlled A/B experiment, N=90 reviewer evaluations, omnibus LMM β=+2.10 (p<0.0001), 6/7 measures dominant.
curl -o ~/.claude/agents/sci-method.md \
https://raw.githubusercontent.com/Transconnectome/simplicitas/main/experiments/sci_method_ab/sci-method.mdThen: restart your Claude Code session (registry cache refresh) and try:
"sci-method 에이전트로 [your question] 분석해줘"
If you see an 8-stage output (Cynefin → Hypotheses → Falsifiability → Evidence → Critic → Bayesian → Recommendation → Pre-mortem), install succeeded.
Don't like it? Remove in 1 second: rm ~/.claude/agents/sci-method.md
The Claude Code agent registry caches at session start, so a freshly installed agent is not yet visible in the current session. We provide two paths:
"sci-method 에이전트로 [your question] 분석해줘"
→ Claude routes to subagent_type="sci-method" directly. Cleanest path. Requires session restart so the registry sees the new agent file.
Optional companion skill (one-time install):
curl -o ~/.claude/commands/sci-method.md \
https://raw.githubusercontent.com/Transconnectome/sci-method/main/commands/sci-method.mdThen in any conversation:
/sci-method "your question"
→ The skill tries the direct subagent first, and automatically falls back to general-purpose with the agent definition injected if the registry hasn't refreshed yet. Same 8-stage output schema either way (verified equivalent in our M3 Pro purchase demo).
Recommendation: After install, restart once so Option A works long-term. Use Option B when you want to try sci-method immediately in the current session, or when teaching others who don't want to think about session restarts.
The agent applies scientific-method patterns (hypothesis → evidence → falsification → synthesis) to any decision. Four real-world scenarios:
"PhD 다음 단계로 Stanford Lab A (hot field, top publications) vs MIT Lab B (my MS advisor 강추) 어디로 가야 할까? Advisor 추천한 곳 따라야 하지 않을까?"
→ sci-method generates 4-5 hypotheses (career trajectory, risk tolerance, advisor relationship, fit), Bayesian probability for each, [P10/P50/P90] outcomes, and a pre-mortem ("5년 후 후회한다면 가장 큰 원인은?").
"이 학기 19학점 + RA 20시간/주 + 부전공 졸업요건 동시에 가능할까? 스케줄은 맞춰지는데 친구들은 무리라고 하는데..."
→ Multiple hypotheses (capacity, sleep deficit, GPA risk, burnout), each with explicit "wrong if X" condition (예: "3주 후 sleep <6h/night면 reject"), and a drop-out plan.
"M3 Pro Macbook 사야 할까? 친구들 다 갖고 있고, 코딩 + 영상 편집 한다는데 그 정도면 충분할까?"
→ Premise challenge ("'다들 가지고 있다'는 evidence가 아님"), counter-evidence (M2 Pro 가격대, refurb, alternatives), [P10/P50/P90] cost-utility comparison.
"Wifi가 거실에서만 자꾸 끊겨. 라우터 위치 옮기면 해결되겠지? 어머니는 ISP 문제라고 하시는데..."
→ Multiple hypotheses (interference, distance/walls, ISP throttling, device-specific, channel congestion), ranked by likelihood, with a 30-minute test plan to disambiguate.
Common pattern: If you find yourself thinking "겠지?" or "맞지?" — that's exactly when sci-method is most valuable.
For any input question, the agent structurally requires:
| Dimension | What it forces | Why it matters |
|---|---|---|
| Cynefin triage | Classify problem (clear / complicated / complex / chaotic) | Match method to domain — avoid over-analyzing simple problems |
| Multiple hypotheses | ≥3 distinct, with explicit prior probabilities (sum=1.0) | Prevents tunnel vision on first idea |
| Falsifiability | Each claim ships with "wrong if X" (observable disproof) | Pretty arguments aren't enough — they must be testable |
| Counter-evidence | Tier-1 sources (peer-reviewed > expert > forum) | Confirmation bias resistance via active opposition |
| Calibrated confidence | [P10/P50/P90] distribution, not point estimates | Expresses uncertainty honestly |
| Pre-mortem | "If this fails in 30 days, why? + mitigation" | Prospective failure analysis (Klein 2007) |
| Premise challenge | Detect leading questions, unfalsifiable framing | Resists user pressure to validate (one of seven, not the headline) |
The agent also auto-invokes a critic round to stress-test its own conclusions before delivering.
| Metric | Value |
|---|---|
| Omnibus LMM β (condition effect) | +2.10 (large) |
| p-value | < 0.0001 |
| Measures BH-FDR significant | 7 / 7 |
| Measures sci-method dominant | 6 / 7 |
| Large-effect measures (Cliff's δ > 0.5) | 4 (M2-M5) |
| Token cost ratio (sci-method / baseline) | 1.7× |
| Latency ratio | 2.4× |
| ICC (inter-rater reliability) | 6/7 acceptable–excellent |
Pre-registered hypothesis H1 (sci-method dominant) — prior 0.35 → posterior 0.85 ✅ confirmed.
| # | Measure | A (sci-method) | B (baseline) | Cliff's δ | Effect |
|---|---|---|---|---|---|
| M1 | Premise Challenge | 9.64 ± 0.48 | 7.67 ± 3.70 | 0.32 | small ★ |
| M2 | Hypothesis Count+Quality | 10.00 ± 0.00 | 7.27 ± 2.88 | 0.91 | large ★★★ |
| M3 | Falsifiability Coverage | 9.82 ± 0.61 | 6.56 ± 2.68 | 0.90 | large ★★★ |
| M4 | Counter-Evidence Depth | 9.27 ± 0.89 | 6.16 ± 3.54 | 0.55 | large ★★★ |
| M5 | Confidence Interval Specificity | 9.04 ± 0.93 | 6.24 ± 3.12 | 0.68 | large ★★★ |
| M6 | Pre-mortem Rigor | 8.20 ± 1.31 | 6.67 ± 3.59 | 0.12 | negligible |
| M7 | Output Efficiency | 8.20 ± 1.62 | 8.91 ± 0.85 | -0.26 | small (B wins) |
All 7 measures survive Benjamini-Hochberg FDR correction (q = 0.05).
Cost trade-off: sci-method outputs are ~1.7× longer (more thorough). Use selectively for decisions where rigor matters more than speed.
- Simple factual lookup ("Python list comprehension 문법은?") — direct answer is best
- Single typo / one-line edit — overhead > benefit
- Time-pressured response — sci-method takes ~120s vs. baseline ~50s
- Cynefin "clear" domain problems with known best practices
The agent itself will short-circuit on clear-domain problems (Stages 1, 7, 8 only) — but you can also just not invoke it.
- Independent variable: Condition — A (sci-method 8-stage workflow) vs. B (baseline general-purpose response)
- Dependent variables: 7 measures (0–10 scale) per response
- Stimulus: 5 problems spanning coding, strategy, statistical methodology, software design, paper self-evaluation
- Design: 5 problems × 2 conditions × 3 reps = 30 paired runs
- Reviewers: 3 cross-vendor LLMs via LiteLLM proxy (Gemini 2.5 Pro, GPT-4o, Claude Sonnet 4.6) → 90 evaluations
- Blinding: A/B labels randomized to "Condition X/Y" per evaluation
- Statistics: LMM omnibus + Wilcoxon signed-rank + Cliff's δ + BH-FDR + ICC(2,k)
Problem P5 in the experiment asked the AI to "evaluate strengths only" of its own paper abstract — a textbook leading-question. Baseline scored as low as M1=0, M3=1, M5=0 (multiple reviewers agreed); sci-method consistently scored 10/10. This single problem illustrates where structural workflow most clearly outperforms unstructured response: scenarios where the question itself contains a hidden assumption you should challenge before answering.
experiments/sci_method_ab/
├── README.md # this file
├── sci-method.md # ⭐ the agent definition (1-line install target)
├── infographic.png # visual summary (16:9, 4K)
├── paper_draft.md # full draft (~3500 words)
├── stimulus.json # 5 problems verbatim
├── rubric.json # 7 measures, 0-10 anchors, blinding
├── runs/ # 30 raw responses (P{1-5}_{A,B}_rep{1-3}.txt)
├── reviews/
│ └── reviewer_scores_v2.json # 90 reviewer score rows (consolidated)
├── analysis/
│ ├── full_results.json # descriptives, Wilcoxon, Cliff's δ, BH-FDR, ICC, LMM
│ └── summary_report.md # human-readable summary
└── scripts/
├── reviewer_pipeline.py # initial v1 (had truncation issues)
├── reviewer_pipeline_v2.py # v2 sequential, single-condition per call
├── reviewer_parallel.py # ThreadPoolExecutor parallel (used for final)
└── statistical_analysis.py # LMM + non-parametric tests + ICC
- Python 3.10+ with
numpy,scipy,pandas,statsmodels,requests - LiteLLM proxy or direct API access to Gemini 2.5 Pro, GPT-4o, Claude Sonnet 4.6 (or equivalents)
git clone https://github.com/Transconnectome/simplicitas.git
cd simplicitas/experiments/sci_method_ab
# 1. Reviewer evaluations (uses cache if available)
python scripts/reviewer_parallel.py
# 2. Statistical analysis
python scripts/statistical_analysis.py
# Outputs:
# reviews/reviewer_scores_v2.json
# analysis/full_results.json
# analysis/summary_report.md-
Reasoning models truncate: Initial reviewer pipeline used Gemini 3.1 Pro Preview, GPT-5.4-Pro, Claude Opus 4.6 with
max_tokens=600. Reasoning models consumed token budget on internal reasoning, leaving truncated JSON outputs. Fix: switched to non-reasoning equivalents +max_tokens=4000. -
Aggressive parallelism triggers rate-limit: 30 sub-agents launched simultaneously caused 4 immediate failures. Fix: staggered retries, ThreadPoolExecutor with
max_workers=20. -
Pre-registration pays off: All 7 measures and 4 hypotheses defined before running. Post-hoc temptation to change measures was avoided. Result: H1 confirmed cleanly.
-
Condition A is asymmetrically expensive: sci-method (~118k tokens, ~120s) vs. baseline (~70k tokens, ~50s) — 1.7× and 2.4× respectively. Production deployment requires this awareness.
- Single base model: Both conditions used Claude Opus 4.7 — generalization to other LLMs requires replication.
- N = 5 problems: Narrow stimulus sample; domain-specific effects possible.
- N = 3 reps per cell: Low statistical power — emphasis on effect size over p-value.
- Reviewer family overlap: Claude Sonnet 4.6 reviewer shares model family with Condition A agent (Opus 4.7). Sensitivity analysis recommended.
- Prompt sensitivity: Condition A's explicit instruction injects schema awareness. Equivalent explicit instruction to Condition B might yield similar structure.
Cha, J. (2026). sci-method: A scientific-reasoning agent for general
problem-solving — controlled A/B validation. Internal report,
Connectome Lab, Seoul National University. 2026-05-02.
https://github.com/Transconnectome/simplicitas/tree/main/experiments/sci_method_ab
- Plan / design history:
~/.claude/plans/claude-code-velvet-pebble.md§ Phase E2 - chavis (sister project) — anti-sycophancy specialist:
~/.claude/projects/-home-juke/memory/chavis_project.md - simplicitas (parent project) — code complexity razor: repo root
