sci-method Agent

Apply scientific-method workflow to any decision — coding, debugging, career choices, finances, time management, design, strategy. A Claude Code subagent that runs hypotheses, falsifiability tests, evidence gathering, and pre-mortem analysis on whatever you're thinking about.

Empirically validated: pre-registered controlled A/B experiment, N=90 reviewer evaluations, omnibus LMM β=+2.10 (p<0.0001), 6/7 measures dominant.

Quick Install (1 minute)

curl -o ~/.claude/agents/sci-method.md \
  https://raw.githubusercontent.com/Transconnectome/simplicitas/main/experiments/sci_method_ab/sci-method.md

Then: restart your Claude Code session (registry cache refresh) and try:

"sci-method 에이전트로 [your question] 분석해줘"

If you see an 8-stage output (Cynefin → Hypotheses → Falsifiability → Evidence → Critic → Bayesian → Recommendation → Pre-mortem), install succeeded.

Don't like it? Remove in 1 second: rm ~/.claude/agents/sci-method.md

Two ways to call (after install)

The Claude Code agent registry caches at session start, so a freshly installed agent is not yet visible in the current session. We provide two paths:

Option A — Direct subagent (after restarting Claude Code)

"sci-method 에이전트로 [your question] 분석해줘"

→ Claude routes to subagent_type="sci-method" directly. Cleanest path. Requires session restart so the registry sees the new agent file.

Option B — Slash skill (works in any session, automatic fallback)

Optional companion skill (one-time install):

curl -o ~/.claude/commands/sci-method.md \
  https://raw.githubusercontent.com/Transconnectome/sci-method/main/commands/sci-method.md

Then in any conversation:

/sci-method "your question"

→ The skill tries the direct subagent first, and automatically falls back to general-purpose with the agent definition injected if the registry hasn't refreshed yet. Same 8-stage output schema either way (verified equivalent in our M3 Pro purchase demo).

Recommendation: After install, restart once so Option A works long-term. Use Option B when you want to try sci-method immediately in the current session, or when teaching others who don't want to think about session restarts.

Use Cases — Beyond Science

The agent applies scientific-method patterns (hypothesis → evidence → falsification → synthesis) to any decision. Four real-world scenarios:

1. Career Decision (high social pressure)

"PhD 다음 단계로 Stanford Lab A (hot field, top publications) vs MIT Lab B (my MS advisor 강추) 어디로 가야 할까? Advisor 추천한 곳 따라야 하지 않을까?"

→ sci-method generates 4-5 hypotheses (career trajectory, risk tolerance, advisor relationship, fit), Bayesian probability for each, [P10/P50/P90] outcomes, and a pre-mortem ("5년 후 후회한다면 가장 큰 원인은?").

2. Time Management (planning fallacy + sunk cost)

"이 학기 19학점 + RA 20시간/주 + 부전공 졸업요건 동시에 가능할까? 스케줄은 맞춰지는데 친구들은 무리라고 하는데..."

→ Multiple hypotheses (capacity, sleep deficit, GPA risk, burnout), each with explicit "wrong if X" condition (예: "3주 후 sleep <6h/night면 reject"), and a drop-out plan.

3. Purchase Decision (FOMO sycophancy)

"M3 Pro Macbook 사야 할까? 친구들 다 갖고 있고, 코딩 + 영상 편집 한다는데 그 정도면 충분할까?"

→ Premise challenge ("'다들 가지고 있다'는 evidence가 아님"), counter-evidence (M2 Pro 가격대, refurb, alternatives), [P10/P50/P90] cost-utility comparison.

4. Everyday Debugging (premise challenge)

"Wifi가 거실에서만 자꾸 끊겨. 라우터 위치 옮기면 해결되겠지? 어머니는 ISP 문제라고 하시는데..."

→ Multiple hypotheses (interference, distance/walls, ISP throttling, device-specific, channel congestion), ranked by likelihood, with a 30-minute test plan to disambiguate.

Common pattern: If you find yourself thinking "~~겠지?~~" or "~~맞지?~~" — that's exactly when sci-method is most valuable.

What sci-method Adds (7 Dimensions)

For any input question, the agent structurally requires:

Dimension	What it forces	Why it matters
Cynefin triage	Classify problem (clear / complicated / complex / chaotic)	Match method to domain — avoid over-analyzing simple problems
Multiple hypotheses	≥3 distinct, with explicit prior probabilities (sum=1.0)	Prevents tunnel vision on first idea
Falsifiability	Each claim ships with "wrong if X" (observable disproof)	Pretty arguments aren't enough — they must be testable
Counter-evidence	Tier-1 sources (peer-reviewed > expert > forum)	Confirmation bias resistance via active opposition
Calibrated confidence	[P10/P50/P90] distribution, not point estimates	Expresses uncertainty honestly
Pre-mortem	"If this fails in 30 days, why? + mitigation"	Prospective failure analysis (Klein 2007)
Premise challenge	Detect leading questions, unfalsifiable framing	Resists user pressure to validate (one of seven, not the headline)

The agent also auto-invokes a critic round to stress-test its own conclusions before delivering.

Visual Summary

Empirical Validation (Pre-registered A/B Experiment)

Metric	Value
Omnibus LMM β (condition effect)	+2.10 (large)
p-value	< 0.0001
Measures BH-FDR significant	7 / 7
Measures sci-method dominant	6 / 7
Large-effect measures (Cliff's δ > 0.5)	4 (M2-M5)
Token cost ratio (sci-method / baseline)	1.7×
Latency ratio	2.4×
ICC (inter-rater reliability)	6/7 acceptable–excellent

Pre-registered hypothesis H1 (sci-method dominant) — prior 0.35 → posterior 0.85 ✅ confirmed.

Per-Measure Effect Sizes

#	Measure	A (sci-method)	B (baseline)	Cliff's δ	Effect
M1	Premise Challenge	9.64 ± 0.48	7.67 ± 3.70	0.32	small ★
M2	Hypothesis Count+Quality	10.00 ± 0.00	7.27 ± 2.88	0.91	large ★★★
M3	Falsifiability Coverage	9.82 ± 0.61	6.56 ± 2.68	0.90	large ★★★
M4	Counter-Evidence Depth	9.27 ± 0.89	6.16 ± 3.54	0.55	large ★★★
M5	Confidence Interval Specificity	9.04 ± 0.93	6.24 ± 3.12	0.68	large ★★★
M6	Pre-mortem Rigor	8.20 ± 1.31	6.67 ± 3.59	0.12	negligible
M7	Output Efficiency	8.20 ± 1.62	8.91 ± 0.85	-0.26	small (B wins)

All 7 measures survive Benjamini-Hochberg FDR correction (q = 0.05).

Cost trade-off: sci-method outputs are ~1.7× longer (more thorough). Use selectively for decisions where rigor matters more than speed.

When NOT to Use

Simple factual lookup ("Python list comprehension 문법은?") — direct answer is best
Single typo / one-line edit — overhead > benefit
Time-pressured response — sci-method takes ~120s vs. baseline ~50s
Cynefin "clear" domain problems with known best practices

The agent itself will short-circuit on clear-domain problems (Stages 1, 7, 8 only) — but you can also just not invoke it.

Design Summary (Experimental Methods)

Independent variable: Condition — A (sci-method 8-stage workflow) vs. B (baseline general-purpose response)
Dependent variables: 7 measures (0–10 scale) per response
Stimulus: 5 problems spanning coding, strategy, statistical methodology, software design, paper self-evaluation
Design: 5 problems × 2 conditions × 3 reps = 30 paired runs
Reviewers: 3 cross-vendor LLMs via LiteLLM proxy (Gemini 2.5 Pro, GPT-4o, Claude Sonnet 4.6) → 90 evaluations
Blinding: A/B labels randomized to "Condition X/Y" per evaluation
Statistics: LMM omnibus + Wilcoxon signed-rank + Cliff's δ + BH-FDR + ICC(2,k)

Notable Pattern: Pressure-Trap Problems

Problem P5 in the experiment asked the AI to "evaluate strengths only" of its own paper abstract — a textbook leading-question. Baseline scored as low as M1=0, M3=1, M5=0 (multiple reviewers agreed); sci-method consistently scored 10/10. This single problem illustrates where structural workflow most clearly outperforms unstructured response: scenarios where the question itself contains a hidden assumption you should challenge before answering.

Repository Layout

experiments/sci_method_ab/
├── README.md                       # this file
├── sci-method.md                   # ⭐ the agent definition (1-line install target)
├── infographic.png                 # visual summary (16:9, 4K)
├── paper_draft.md                  # full draft (~3500 words)
├── stimulus.json                   # 5 problems verbatim
├── rubric.json                     # 7 measures, 0-10 anchors, blinding
├── runs/                           # 30 raw responses (P{1-5}_{A,B}_rep{1-3}.txt)
├── reviews/
│   └── reviewer_scores_v2.json     # 90 reviewer score rows (consolidated)
├── analysis/
│   ├── full_results.json           # descriptives, Wilcoxon, Cliff's δ, BH-FDR, ICC, LMM
│   └── summary_report.md           # human-readable summary
└── scripts/
    ├── reviewer_pipeline.py        # initial v1 (had truncation issues)
    ├── reviewer_pipeline_v2.py     # v2 sequential, single-condition per call
    ├── reviewer_parallel.py        # ThreadPoolExecutor parallel (used for final)
    └── statistical_analysis.py     # LMM + non-parametric tests + ICC

Reproduce the Experiment

Prerequisites

Python 3.10+ with numpy, scipy, pandas, statsmodels, requests
LiteLLM proxy or direct API access to Gemini 2.5 Pro, GPT-4o, Claude Sonnet 4.6 (or equivalents)

Re-run pipeline

git clone https://github.com/Transconnectome/simplicitas.git
cd simplicitas/experiments/sci_method_ab

# 1. Reviewer evaluations (uses cache if available)
python scripts/reviewer_parallel.py

# 2. Statistical analysis
python scripts/statistical_analysis.py

# Outputs:
#   reviews/reviewer_scores_v2.json
#   analysis/full_results.json
#   analysis/summary_report.md

Lessons Learned (Methodology)

Reasoning models truncate: Initial reviewer pipeline used Gemini 3.1 Pro Preview, GPT-5.4-Pro, Claude Opus 4.6 with max_tokens=600. Reasoning models consumed token budget on internal reasoning, leaving truncated JSON outputs. Fix: switched to non-reasoning equivalents + max_tokens=4000.
Aggressive parallelism triggers rate-limit: 30 sub-agents launched simultaneously caused 4 immediate failures. Fix: staggered retries, ThreadPoolExecutor with max_workers=20.
Pre-registration pays off: All 7 measures and 4 hypotheses defined before running. Post-hoc temptation to change measures was avoided. Result: H1 confirmed cleanly.
Condition A is asymmetrically expensive: sci-method (~118k tokens, ~120s) vs. baseline (~70k tokens, ~50s) — 1.7× and 2.4× respectively. Production deployment requires this awareness.

Limitations

Single base model: Both conditions used Claude Opus 4.7 — generalization to other LLMs requires replication.
N = 5 problems: Narrow stimulus sample; domain-specific effects possible.
N = 3 reps per cell: Low statistical power — emphasis on effect size over p-value.
Reviewer family overlap: Claude Sonnet 4.6 reviewer shares model family with Condition A agent (Opus 4.7). Sensitivity analysis recommended.
Prompt sensitivity: Condition A's explicit instruction injects schema awareness. Equivalent explicit instruction to Condition B might yield similar structure.

Citation

Cha, J. (2026). sci-method: A scientific-reasoning agent for general
problem-solving — controlled A/B validation. Internal report,
Connectome Lab, Seoul National University. 2026-05-02.
https://github.com/Transconnectome/simplicitas/tree/main/experiments/sci_method_ab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sci-method Agent

Quick Install (1 minute)

Two ways to call (after install)

Option A — Direct subagent (after restarting Claude Code)

Option B — Slash skill (works in any session, automatic fallback)

Use Cases — Beyond Science

1. Career Decision (high social pressure)

2. Time Management (planning fallacy + sunk cost)

3. Purchase Decision (FOMO sycophancy)

4. Everyday Debugging (premise challenge)

What sci-method Adds (7 Dimensions)

Visual Summary

Empirical Validation (Pre-registered A/B Experiment)

Per-Measure Effect Sizes

When NOT to Use

Design Summary (Experimental Methods)

Notable Pattern: Pressure-Trap Problems

Repository Layout

Reproduce the Experiment

Prerequisites

Re-run pipeline

Lessons Learned (Methodology)

Limitations

Citation

Related

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
analysis		analysis
commands		commands
reviews		reviews
runs		runs
scripts		scripts
LICENSE		LICENSE
README.md		README.md
infographic.png		infographic.png
paper_draft.md		paper_draft.md
rubric.json		rubric.json
sci-method.md		sci-method.md
stimulus.json		stimulus.json

Folders and files

Latest commit

History

Repository files navigation

sci-method Agent

Quick Install (1 minute)

Two ways to call (after install)

Option A — Direct subagent (after restarting Claude Code)

Option B — Slash skill (works in any session, automatic fallback)

Use Cases — Beyond Science

1. Career Decision (high social pressure)

2. Time Management (planning fallacy + sunk cost)

3. Purchase Decision (FOMO sycophancy)

4. Everyday Debugging (premise challenge)

What sci-method Adds (7 Dimensions)

Visual Summary

Empirical Validation (Pre-registered A/B Experiment)

Per-Measure Effect Sizes

When NOT to Use

Design Summary (Experimental Methods)

Notable Pattern: Pressure-Trap Problems

Repository Layout

Reproduce the Experiment

Prerequisites

Re-run pipeline

Lessons Learned (Methodology)

Limitations

Citation

Related

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages