LLM-based literature assessment system for rare disease gene curation. Automatically screens papers for relevance, extracts evidence against PanelApp Australia diagnostic criteria, and generates comprehensive gene-centric reports with panel recommendations.
# Basic installation (data ingestion, reporting, variant analysis)
uv sync
# With ML dependencies (required for LLM inference and screening classifier)
uv sync --extra ml
# With macOS-specific Docling acceleration (optional)
uv sync --extra docling-macosInstall the NCBI EDirect tools:
sh -c "$(curl -fsSL https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"Enable the local pre-commit hook (runs the same lint set as CI):
uv run pre-commit installThe system uses multiple databases:
- Main workflow (
data/db.sqlite): Created fromschema.sqlbypalit ingest-preprintsorpalit ingest-pubmed - Screening workflow (
data/pubmed_baseline_screening.sqlite): Created fromschema.sqlbypalit screen-pubmed - Classifier training (
data/screening_classifier/training.sqlite): Created fromsrc/palit/screening_classifier/training.sql(only needed for training)
Both main and screening workflows use the same schema for consistency, allowing the same tools (e.g., assess-relevance) to work on both databases.
Step 11 (palit fetch-variant-frequencies) requires a running variant-lookup service. Copy .env.example to .env and set both:
VARIANT_LOOKUP_BASE_URL=https://<host>:<port>
VARIANT_LOOKUP_API_KEY=<bearer-token>
.env is gitignored. The command exits immediately on startup if either variable is missing.
# Configuration
PANEL_DATE=2025-10-01
END_DATE=2025-10-15
# 1. Setup: Create database and ingest papers (creates DB from schema.sql if needed)
# --previous-db widens the date range into the previous run's window and skips
# papers already ingested (buffer window for API flakiness resilience).
# Preprints first: ensures preprint metadata (version) is preserved for automatic
# PDF download. PubMed backfills PMIDs into preprint rows without overwriting.
uv run palit ingest-preprints --previous-db data/db_prev.sqlite $BUFFER_START $END_DATE
uv run palit ingest-pubmed --previous-db data/db_prev.sqlite $BUFFER_START $END_DATE
# 2. Assess relevance of papers
uv run palit assess-relevance --panel-date $PANEL_DATE
# 2a. (Optional) Parallel assessment across multiple GPUs
for i in 0 1; do sbatch -p GPU-H100 --gpus=1 -t 24:00:00 -J "assess-relevance-shard-$i" -o "assess_relevance_shard_$i.log" --wrap="uv run palit assess-relevance --panel-date $PANEL_DATE --db-path data/pubmed_baseline_screening.sqlite --prompt-path prompts/retrospective_screening_prompt.txt --shard-index $i --num-shards 2"; done
# 3. Download full-text papers (automated PMC + preprints, manual fallback)
uv run palit download-papers attempt-pmc
uv run palit download-papers download-preprints
uv run palit download-papers open-browser
# ... manually download PDFs to data/papers/ ...
uv run palit docling convert
uv run palit download-papers register
# 4. Extract evidence from full-text papers
uv run palit extract-evidence --panel-date $PANEL_DATE
# 4a. (Optional) Parallel extraction across multiple GPUs
for i in 0 1; do sbatch -p GPU-H100 --gpus=1 -t 24:00:00 -J "extract-evidence-shard-$i" -o "extract_evidence_shard_$i.log" --wrap="uv run palit extract-evidence --panel-date $PANEL_DATE --shard-index $i --num-shards 2"; done
# 5. Discover papers referenced in evidence (citation-based expansion)
uv run palit discover-citations discover
# 5a. Optionally add papers manually that weren't found automatically
uv run palit discover-citations add --gene GENE_SYMBOL PMID1 PMID2 ...
# 6. Expand literature beyond citations
uv run palit expand-literature --cutoff-date $PANEL_DATE
# 7. Download expansion papers (same workflow as step 3)
uv run palit download-papers attempt-pmc
uv run palit download-papers download-preprints
uv run palit download-papers open-browser --expansion-only
# ... manually download PDFs to data/papers/ ...
uv run palit docling convert
uv run palit download-papers register
# 8. Extract evidence from expansion papers
uv run palit extract-evidence --panel-date $PANEL_DATE
# 9. Aggregate evidence across papers per gene (panel-agnostic)
uv run palit assess-genes --panel-date $PANEL_DATE
# 9a. (Optional) Parallel gene assessment across multiple GPUs
for i in 0 1; do sbatch -p GPU-H100 --gpus=1 -t 24:00:00 -J "assess-genes-shard-$i" -o "assess_genes_shard_$i.log" --wrap="uv run palit assess-genes --panel-date $PANEL_DATE --shard-index $i --num-shards 2"; done
# 10. Match genes to appropriate panels based on phenotype descriptions
uv run palit match-panels --panel-date $PANEL_DATE
# 11. Look up variant frequencies from gnomAD via the variant-lookup
# service (requires VARIANT_LOOKUP_* env vars — see Setup).
uv run palit fetch-variant-frequencies
# 12. Create annotated PDFs with highlighted citations
uv run palit annotate-pdfs
# 13. Generate assessment report package with panel recommendations
uv run palit generate-report --report-id report_mendeliome --panel-date $PANEL_DATE# 1. Install ML dependencies and setup W&B
uv sync --extra ml
wandb login
# 2. Create training database
sqlite3 data/screening_classifier/training.sqlite < src/palit/screening_classifier/training.sql
# 3. Extract positive PMIDs from main workflow database
uv run palit screening-classifier extract-pmids
# 4. Prepare training data (fetches negatives from PubMed, assigns train/val/test splits)
uv run palit screening-classifier prepare-data
# 5. Train classifier
uv run palit screening-classifier train
# 6. Evaluate classifier
uv run palit screening-classifier evaluateModel outputs saved to outputs/best_model/ (HuggingFace format + optimal threshold).
Once trained, use the classifier to screen PubMed baseline XML files:
# Download PubMed baseline (all XML files + checksums, ~47GB compressed)
mkdir -p data/pubmed_baseline
cd data/pubmed_baseline
for kind in baseline updatefiles
do
curl -s https://ftp.ncbi.nlm.nih.gov/pubmed/$kind/ | \
grep -oP '(?<=href=")[^"]*\.(xml\.gz|md5)' | \
parallel --bar -j 8 "if [ ! -f {} ]; then curl -s -O \"https://ftp.ncbi.nlm.nih.gov/pubmed/$kind/{}\"; else echo \"{} exists, skipping.\"; fi"
done
cd ../..
# Screen baseline files with trained classifier
uv run palit screen-pubmed \
--checkpoint outputs/best_model \
--baseline-dir data/pubmed_baseline \
--output-db data/pubmed_baseline_screening.sqliteRelevant papers are stored in pubmed_baseline_screening.sqlite. Processing progress is tracked in data/screening_progress.json for resumability.
For historical baseline screening (2000-2025), use the retrospective screening prompt which evaluates papers based on the evidence they provide rather than novelty:
# Retrospective mode: evaluates historical evidence value, not novelty
uv run palit assess-relevance \
--db-path data/pubmed_baseline_screening.sqlite \
--panel-date $PANEL_DATE \
--prompt-path prompts/retrospective_screening_prompt.txtKey difference from standard relevance assessment:
- Standard prompt (
relevance_assessment_prompt.txt): Asks "Is this NEW evidence for diagnostic panels?" - optimized for recent literature - Retrospective prompt (
retrospective_screening_prompt.txt): Asks "Does this provide SUBSTANTIAL evidence for gene-disease relationships?" - optimized for historical baseline screening
The retrospective prompt evaluates papers in their historical context, accepting important early descriptions of gene-disease associations even if those genes are now well-established. This ensures comprehensive coverage across 25 years of literature for downstream tournament selection and analysis.
After each fortnightly processing run completes, feed majority-relevant papers back into the baseline screening DB so it grows as a comprehensive repository:
FORTNIGHTLY_DB=data/db_2026_february_h1.sqlite
sqlite3 data/pubmed_baseline_screening.sqlite <<'SQL'
ATTACH '$FORTNIGHTLY_DB' AS source;
CREATE TEMP TABLE relevant_dois AS
SELECT doi FROM source.papers
WHERE relevance_assessment_json IS NOT NULL
AND (json_extract(relevance_assessment_json, '$[0].relevant')
+ json_extract(relevance_assessment_json, '$[1].relevant')
+ json_extract(relevance_assessment_json, '$[2].relevant')) >= 2;
INSERT OR IGNORE INTO papers
(doi, pmid, title, abstract, authors, journal, source_date,
source, source_metadata, source_type, source_details, download_status,
relevance_assessment_raw, relevance_assessment_json)
SELECT doi, pmid, title, abstract, authors, journal, source_date,
source, source_metadata, source_type, source_details, 'scheduled',
relevance_assessment_raw, relevance_assessment_json
FROM source.papers WHERE doi IN relevant_dois;
INSERT OR IGNORE INTO gene_mentions
(hgnc_id, paper_gene_symbol, paper_doi, source)
SELECT hgnc_id, paper_gene_symbol, paper_doi, source
FROM source.gene_mentions
WHERE source = 'relevance_assessment'
AND paper_doi IN relevant_dois;
DROP TABLE relevant_dois;
DETACH source;
SQLThis step is tracked as UPDATE_BASELINE in the pipeline tracker and also syncs the updated baseline to the cluster.
For curating literature for a specific panel (e.g., Arthrogryposis):
# Configuration
PANEL_DATE=2025-10-01
PANEL_ID=47 # Arthrogryposis panel ID
PANEL_NAME=arthrogryposis
# 1. Copy pre-filtered baseline DB papers to new database
sqlite3 data/$PANEL_NAME.sqlite < schema.sql
sqlite3 data/$PANEL_NAME.sqlite "ATTACH 'data/pubmed_baseline_screening.sqlite' AS source; INSERT INTO papers (pmid, title, abstract, authors, journal, entrez_date, source_type, source_details) SELECT pmid, title, abstract, authors, journal, entrez_date, 'initial', source_details FROM source.papers"
# 2. Assess relevance scoped to the panel
uv run palit assess-relevance \
--db-path data/$PANEL_NAME.sqlite \
--panel-date $PANEL_DATE \
--scope-panel-id $PANEL_ID \
--prompt-path prompts/panel_relevance_assessment_prompt.txt
# 3. (Optional) Reduce literature for well-researched genes
# The aggregation step has a practical limit of ~30-40 papers per gene due to
# context window constraints. For panels with well-researched genes (e.g., POLG
# with 200+ papers), use tournament selection to keep only the most informative:
uv run palit reduce-literature --db-path data/$PANEL_NAME.sqlite
# 4. Download full-text papers (now reduced set if step 3 was run)
uv run palit download-papers attempt-pmc --db-path data/$PANEL_NAME.sqlite
uv run palit download-papers download-preprints --db-path data/$PANEL_NAME.sqlite
uv run palit download-papers open-browser --db-path data/$PANEL_NAME.sqlite
# ... manually download PDFs ...
uv run palit docling convert --db-path data/$PANEL_NAME.sqlite
uv run palit download-papers register --db-path data/$PANEL_NAME.sqlite
# 5. Extract evidence and assess genes (panel-scoped)
uv run palit extract-evidence --db-path data/$PANEL_NAME.sqlite --panel-date $PANEL_DATE --scope-panel-id $PANEL_ID
uv run palit assess-genes --db-path data/$PANEL_NAME.sqlite --panel-date $PANEL_DATE --target-panel-ids $PANEL_ID --scope-panel-id $PANEL_ID
# 6. Generate report package with panel-scoped novelty detection
uv run palit annotate-pdfs --db-path data/$PANEL_NAME.sqlite --output-dir data/annotated_$PANEL_NAME
uv run palit generate-report \
--report-id panel_$PANEL_NAME \
--db-path data/$PANEL_NAME.sqlite \
--panel-date $PANEL_DATE \
--target-panel-ids $PANEL_ID \
--annotated-dir data/annotated_$PANEL_NAME