Skip to content

Transconnectome/sci-method

Repository files navigation

sci-method Agent

Apply scientific-method workflow to any decision — coding, debugging, career choices, finances, time management, design, strategy. A Claude Code subagent that runs hypotheses, falsifiability tests, evidence gathering, and pre-mortem analysis on whatever you're thinking about.

Empirically validated: pre-registered controlled A/B experiment, N=90 reviewer evaluations, omnibus LMM β=+2.10 (p<0.0001), 6/7 measures dominant.


Quick Install (1 minute)

curl -o ~/.claude/agents/sci-method.md \
  https://raw.githubusercontent.com/Transconnectome/simplicitas/main/experiments/sci_method_ab/sci-method.md

Then: restart your Claude Code session (registry cache refresh) and try:

"sci-method 에이전트로 [your question] 분석해줘"

If you see an 8-stage output (Cynefin → Hypotheses → Falsifiability → Evidence → Critic → Bayesian → Recommendation → Pre-mortem), install succeeded.

Don't like it? Remove in 1 second: rm ~/.claude/agents/sci-method.md


Two ways to call (after install)

The Claude Code agent registry caches at session start, so a freshly installed agent is not yet visible in the current session. We provide two paths:

Option A — Direct subagent (after restarting Claude Code)

"sci-method 에이전트로 [your question] 분석해줘"

→ Claude routes to subagent_type="sci-method" directly. Cleanest path. Requires session restart so the registry sees the new agent file.

Option B — Slash skill (works in any session, automatic fallback)

Optional companion skill (one-time install):

curl -o ~/.claude/commands/sci-method.md \
  https://raw.githubusercontent.com/Transconnectome/sci-method/main/commands/sci-method.md

Then in any conversation:

/sci-method "your question"

→ The skill tries the direct subagent first, and automatically falls back to general-purpose with the agent definition injected if the registry hasn't refreshed yet. Same 8-stage output schema either way (verified equivalent in our M3 Pro purchase demo).

Recommendation: After install, restart once so Option A works long-term. Use Option B when you want to try sci-method immediately in the current session, or when teaching others who don't want to think about session restarts.


Use Cases — Beyond Science

The agent applies scientific-method patterns (hypothesis → evidence → falsification → synthesis) to any decision. Four real-world scenarios:

1. Career Decision (high social pressure)

"PhD 다음 단계로 Stanford Lab A (hot field, top publications) vs MIT Lab B (my MS advisor 강추) 어디로 가야 할까? Advisor 추천한 곳 따라야 하지 않을까?"

→ sci-method generates 4-5 hypotheses (career trajectory, risk tolerance, advisor relationship, fit), Bayesian probability for each, [P10/P50/P90] outcomes, and a pre-mortem ("5년 후 후회한다면 가장 큰 원인은?").

2. Time Management (planning fallacy + sunk cost)

"이 학기 19학점 + RA 20시간/주 + 부전공 졸업요건 동시에 가능할까? 스케줄은 맞춰지는데 친구들은 무리라고 하는데..."

→ Multiple hypotheses (capacity, sleep deficit, GPA risk, burnout), each with explicit "wrong if X" condition (예: "3주 후 sleep <6h/night면 reject"), and a drop-out plan.

3. Purchase Decision (FOMO sycophancy)

"M3 Pro Macbook 사야 할까? 친구들 다 갖고 있고, 코딩 + 영상 편집 한다는데 그 정도면 충분할까?"

→ Premise challenge ("'다들 가지고 있다'는 evidence가 아님"), counter-evidence (M2 Pro 가격대, refurb, alternatives), [P10/P50/P90] cost-utility comparison.

4. Everyday Debugging (premise challenge)

"Wifi가 거실에서만 자꾸 끊겨. 라우터 위치 옮기면 해결되겠지? 어머니는 ISP 문제라고 하시는데..."

→ Multiple hypotheses (interference, distance/walls, ISP throttling, device-specific, channel congestion), ranked by likelihood, with a 30-minute test plan to disambiguate.

Common pattern: If you find yourself thinking "겠지?" or "맞지?" — that's exactly when sci-method is most valuable.


What sci-method Adds (7 Dimensions)

For any input question, the agent structurally requires:

Dimension What it forces Why it matters
Cynefin triage Classify problem (clear / complicated / complex / chaotic) Match method to domain — avoid over-analyzing simple problems
Multiple hypotheses ≥3 distinct, with explicit prior probabilities (sum=1.0) Prevents tunnel vision on first idea
Falsifiability Each claim ships with "wrong if X" (observable disproof) Pretty arguments aren't enough — they must be testable
Counter-evidence Tier-1 sources (peer-reviewed > expert > forum) Confirmation bias resistance via active opposition
Calibrated confidence [P10/P50/P90] distribution, not point estimates Expresses uncertainty honestly
Pre-mortem "If this fails in 30 days, why? + mitigation" Prospective failure analysis (Klein 2007)
Premise challenge Detect leading questions, unfalsifiable framing Resists user pressure to validate (one of seven, not the headline)

The agent also auto-invokes a critic round to stress-test its own conclusions before delivering.


Visual Summary

sci-method A/B Validation Results


Empirical Validation (Pre-registered A/B Experiment)

Metric Value
Omnibus LMM β (condition effect) +2.10 (large)
p-value < 0.0001
Measures BH-FDR significant 7 / 7
Measures sci-method dominant 6 / 7
Large-effect measures (Cliff's δ > 0.5) 4 (M2-M5)
Token cost ratio (sci-method / baseline) 1.7×
Latency ratio 2.4×
ICC (inter-rater reliability) 6/7 acceptable–excellent

Pre-registered hypothesis H1 (sci-method dominant) — prior 0.35 → posterior 0.85 ✅ confirmed.

Per-Measure Effect Sizes

# Measure A (sci-method) B (baseline) Cliff's δ Effect
M1 Premise Challenge 9.64 ± 0.48 7.67 ± 3.70 0.32 small ★
M2 Hypothesis Count+Quality 10.00 ± 0.00 7.27 ± 2.88 0.91 large ★★★
M3 Falsifiability Coverage 9.82 ± 0.61 6.56 ± 2.68 0.90 large ★★★
M4 Counter-Evidence Depth 9.27 ± 0.89 6.16 ± 3.54 0.55 large ★★★
M5 Confidence Interval Specificity 9.04 ± 0.93 6.24 ± 3.12 0.68 large ★★★
M6 Pre-mortem Rigor 8.20 ± 1.31 6.67 ± 3.59 0.12 negligible
M7 Output Efficiency 8.20 ± 1.62 8.91 ± 0.85 -0.26 small (B wins)

All 7 measures survive Benjamini-Hochberg FDR correction (q = 0.05).

Cost trade-off: sci-method outputs are ~1.7× longer (more thorough). Use selectively for decisions where rigor matters more than speed.


When NOT to Use

  • Simple factual lookup ("Python list comprehension 문법은?") — direct answer is best
  • Single typo / one-line edit — overhead > benefit
  • Time-pressured response — sci-method takes ~120s vs. baseline ~50s
  • Cynefin "clear" domain problems with known best practices

The agent itself will short-circuit on clear-domain problems (Stages 1, 7, 8 only) — but you can also just not invoke it.


Design Summary (Experimental Methods)

  • Independent variable: Condition — A (sci-method 8-stage workflow) vs. B (baseline general-purpose response)
  • Dependent variables: 7 measures (0–10 scale) per response
  • Stimulus: 5 problems spanning coding, strategy, statistical methodology, software design, paper self-evaluation
  • Design: 5 problems × 2 conditions × 3 reps = 30 paired runs
  • Reviewers: 3 cross-vendor LLMs via LiteLLM proxy (Gemini 2.5 Pro, GPT-4o, Claude Sonnet 4.6) → 90 evaluations
  • Blinding: A/B labels randomized to "Condition X/Y" per evaluation
  • Statistics: LMM omnibus + Wilcoxon signed-rank + Cliff's δ + BH-FDR + ICC(2,k)

Notable Pattern: Pressure-Trap Problems

Problem P5 in the experiment asked the AI to "evaluate strengths only" of its own paper abstract — a textbook leading-question. Baseline scored as low as M1=0, M3=1, M5=0 (multiple reviewers agreed); sci-method consistently scored 10/10. This single problem illustrates where structural workflow most clearly outperforms unstructured response: scenarios where the question itself contains a hidden assumption you should challenge before answering.


Repository Layout

experiments/sci_method_ab/
├── README.md                       # this file
├── sci-method.md                   # ⭐ the agent definition (1-line install target)
├── infographic.png                 # visual summary (16:9, 4K)
├── paper_draft.md                  # full draft (~3500 words)
├── stimulus.json                   # 5 problems verbatim
├── rubric.json                     # 7 measures, 0-10 anchors, blinding
├── runs/                           # 30 raw responses (P{1-5}_{A,B}_rep{1-3}.txt)
├── reviews/
│   └── reviewer_scores_v2.json     # 90 reviewer score rows (consolidated)
├── analysis/
│   ├── full_results.json           # descriptives, Wilcoxon, Cliff's δ, BH-FDR, ICC, LMM
│   └── summary_report.md           # human-readable summary
└── scripts/
    ├── reviewer_pipeline.py        # initial v1 (had truncation issues)
    ├── reviewer_pipeline_v2.py     # v2 sequential, single-condition per call
    ├── reviewer_parallel.py        # ThreadPoolExecutor parallel (used for final)
    └── statistical_analysis.py     # LMM + non-parametric tests + ICC

Reproduce the Experiment

Prerequisites

  • Python 3.10+ with numpy, scipy, pandas, statsmodels, requests
  • LiteLLM proxy or direct API access to Gemini 2.5 Pro, GPT-4o, Claude Sonnet 4.6 (or equivalents)

Re-run pipeline

git clone https://github.com/Transconnectome/simplicitas.git
cd simplicitas/experiments/sci_method_ab

# 1. Reviewer evaluations (uses cache if available)
python scripts/reviewer_parallel.py

# 2. Statistical analysis
python scripts/statistical_analysis.py

# Outputs:
#   reviews/reviewer_scores_v2.json
#   analysis/full_results.json
#   analysis/summary_report.md

Lessons Learned (Methodology)

  1. Reasoning models truncate: Initial reviewer pipeline used Gemini 3.1 Pro Preview, GPT-5.4-Pro, Claude Opus 4.6 with max_tokens=600. Reasoning models consumed token budget on internal reasoning, leaving truncated JSON outputs. Fix: switched to non-reasoning equivalents + max_tokens=4000.

  2. Aggressive parallelism triggers rate-limit: 30 sub-agents launched simultaneously caused 4 immediate failures. Fix: staggered retries, ThreadPoolExecutor with max_workers=20.

  3. Pre-registration pays off: All 7 measures and 4 hypotheses defined before running. Post-hoc temptation to change measures was avoided. Result: H1 confirmed cleanly.

  4. Condition A is asymmetrically expensive: sci-method (~118k tokens, ~120s) vs. baseline (~70k tokens, ~50s) — 1.7× and 2.4× respectively. Production deployment requires this awareness.


Limitations

  1. Single base model: Both conditions used Claude Opus 4.7 — generalization to other LLMs requires replication.
  2. N = 5 problems: Narrow stimulus sample; domain-specific effects possible.
  3. N = 3 reps per cell: Low statistical power — emphasis on effect size over p-value.
  4. Reviewer family overlap: Claude Sonnet 4.6 reviewer shares model family with Condition A agent (Opus 4.7). Sensitivity analysis recommended.
  5. Prompt sensitivity: Condition A's explicit instruction injects schema awareness. Equivalent explicit instruction to Condition B might yield similar structure.

Citation

Cha, J. (2026). sci-method: A scientific-reasoning agent for general
problem-solving — controlled A/B validation. Internal report,
Connectome Lab, Seoul National University. 2026-05-02.
https://github.com/Transconnectome/simplicitas/tree/main/experiments/sci_method_ab

Related

  • Plan / design history: ~/.claude/plans/claude-code-velvet-pebble.md § Phase E2
  • chavis (sister project) — anti-sycophancy specialist: ~/.claude/projects/-home-juke/memory/chavis_project.md
  • simplicitas (parent project) — code complexity razor: repo root

About

Scientific-method workflow agent for general decision-making — pre-registered A/B validation (β=+2.10, p<0.0001)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages