PanelApp Australia Literature Assessment

LLM-based literature assessment system for rare disease gene curation. Automatically screens papers for relevance, extracts evidence against PanelApp Australia diagnostic criteria, and generates comprehensive gene-centric reports with panel recommendations.

Setup

Installation

# Basic installation (data ingestion, reporting, variant analysis)
uv sync

# With ML dependencies (required for LLM inference and screening classifier)
uv sync --extra ml

# With macOS-specific Docling acceleration (optional)
uv sync --extra docling-macos

Install the NCBI EDirect tools:

sh -c "$(curl -fsSL https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"

Enable the local pre-commit hook (runs the same lint set as CI):

uv run pre-commit install

Database Setup

The system uses multiple databases:

Main workflow (data/db.sqlite): Created from schema.sql by palit ingest-preprints or palit ingest-pubmed
Screening workflow (data/pubmed_baseline_screening.sqlite): Created from schema.sql by palit screen-pubmed
Classifier training (data/screening_classifier/training.sqlite): Created from src/palit/screening_classifier/training.sql (only needed for training)

Both main and screening workflows use the same schema for consistency, allowing the same tools (e.g., assess-relevance) to work on both databases.

External Services

Variant Frequency Lookup

Step 11 (palit fetch-variant-frequencies) requires a running variant-lookup service. Copy .env.example to .env and set both:

VARIANT_LOOKUP_BASE_URL=https://<host>:<port>
VARIANT_LOOKUP_API_KEY=<bearer-token>

.env is gitignored. The command exits immediately on startup if either variable is missing.

Complete Workflow

# Configuration
PANEL_DATE=2025-10-01
END_DATE=2025-10-15

# 1. Setup: Create database and ingest papers (creates DB from schema.sql if needed)
#    --previous-db widens the date range into the previous run's window and skips
#    papers already ingested (buffer window for API flakiness resilience).
#    Preprints first: ensures preprint metadata (version) is preserved for automatic
#    PDF download. PubMed backfills PMIDs into preprint rows without overwriting.
uv run palit ingest-preprints --previous-db data/db_prev.sqlite $BUFFER_START $END_DATE
uv run palit ingest-pubmed --previous-db data/db_prev.sqlite $BUFFER_START $END_DATE

# 2. Assess relevance of papers
uv run palit assess-relevance --panel-date $PANEL_DATE

# 2a. (Optional) Parallel assessment across multiple GPUs
for i in 0 1; do sbatch -p GPU-H100 --gpus=1 -t 24:00:00 -J "assess-relevance-shard-$i" -o "assess_relevance_shard_$i.log" --wrap="uv run palit assess-relevance --panel-date $PANEL_DATE --db-path data/pubmed_baseline_screening.sqlite --prompt-path prompts/retrospective_screening_prompt.txt --shard-index $i --num-shards 2"; done

# 3. Download full-text papers (automated PMC + preprints, manual fallback)
uv run palit download-papers attempt-pmc
uv run palit download-papers download-preprints
uv run palit download-papers open-browser
# ... manually download PDFs to data/papers/ ...
uv run palit docling convert
uv run palit download-papers register

# 4. Extract evidence from full-text papers
uv run palit extract-evidence --panel-date $PANEL_DATE

# 4a. (Optional) Parallel extraction across multiple GPUs
for i in 0 1; do sbatch -p GPU-H100 --gpus=1 -t 24:00:00 -J "extract-evidence-shard-$i" -o "extract_evidence_shard_$i.log" --wrap="uv run palit extract-evidence --panel-date $PANEL_DATE --shard-index $i --num-shards 2"; done

# 5. Discover papers referenced in evidence (citation-based expansion)
uv run palit discover-citations discover

# 5a. Optionally add papers manually that weren't found automatically
uv run palit discover-citations add --gene GENE_SYMBOL PMID1 PMID2 ...

# 6. Expand literature beyond citations
uv run palit expand-literature --cutoff-date $PANEL_DATE

# 7. Download expansion papers (same workflow as step 3)
uv run palit download-papers attempt-pmc
uv run palit download-papers download-preprints
uv run palit download-papers open-browser --expansion-only
# ... manually download PDFs to data/papers/ ...
uv run palit docling convert
uv run palit download-papers register

# 8. Extract evidence from expansion papers
uv run palit extract-evidence --panel-date $PANEL_DATE

# 9. Aggregate evidence across papers per gene (panel-agnostic)
uv run palit assess-genes --panel-date $PANEL_DATE

# 9a. (Optional) Parallel gene assessment across multiple GPUs
for i in 0 1; do sbatch -p GPU-H100 --gpus=1 -t 24:00:00 -J "assess-genes-shard-$i" -o "assess_genes_shard_$i.log" --wrap="uv run palit assess-genes --panel-date $PANEL_DATE --shard-index $i --num-shards 2"; done

# 10. Match genes to appropriate panels based on phenotype descriptions
uv run palit match-panels --panel-date $PANEL_DATE

# 11. Look up variant frequencies from gnomAD via the variant-lookup
#     service (requires VARIANT_LOOKUP_* env vars — see Setup).
uv run palit fetch-variant-frequencies

# 12. Create annotated PDFs with highlighted citations
uv run palit annotate-pdfs

# 13. Generate assessment report package with panel recommendations
uv run palit generate-report --report-id report_mendeliome --panel-date $PANEL_DATE

Relevance Screening Classifier

Training Workflow

# 1. Install ML dependencies and setup W&B
uv sync --extra ml
wandb login

# 2. Create training database
sqlite3 data/screening_classifier/training.sqlite < src/palit/screening_classifier/training.sql

# 3. Extract positive PMIDs from main workflow database
uv run palit screening-classifier extract-pmids

# 4. Prepare training data (fetches negatives from PubMed, assigns train/val/test splits)
uv run palit screening-classifier prepare-data

# 5. Train classifier
uv run palit screening-classifier train

# 6. Evaluate classifier
uv run palit screening-classifier evaluate

Model outputs saved to outputs/best_model/ (HuggingFace format + optimal threshold).

Screening PubMed Baseline

Once trained, use the classifier to screen PubMed baseline XML files:

# Download PubMed baseline (all XML files + checksums, ~47GB compressed)
mkdir -p data/pubmed_baseline
cd data/pubmed_baseline

for kind in baseline updatefiles
do
        curl -s https://ftp.ncbi.nlm.nih.gov/pubmed/$kind/ | \
                grep -oP '(?<=href=")[^"]*\.(xml\.gz|md5)' | \
                parallel --bar -j 8 "if [ ! -f {} ]; then curl -s -O \"https://ftp.ncbi.nlm.nih.gov/pubmed/$kind/{}\"; else echo \"{} exists, skipping.\"; fi"
done

cd ../..

# Screen baseline files with trained classifier
uv run palit screen-pubmed \
  --checkpoint outputs/best_model \
  --baseline-dir data/pubmed_baseline \
  --output-db data/pubmed_baseline_screening.sqlite

Relevant papers are stored in pubmed_baseline_screening.sqlite. Processing progress is tracked in data/screening_progress.json for resumability.

Retrospective Assessment of Baseline Screening

For historical baseline screening (2000-2025), use the retrospective screening prompt which evaluates papers based on the evidence they provide rather than novelty:

# Retrospective mode: evaluates historical evidence value, not novelty
uv run palit assess-relevance \
  --db-path data/pubmed_baseline_screening.sqlite \
  --panel-date $PANEL_DATE \
  --prompt-path prompts/retrospective_screening_prompt.txt

Key difference from standard relevance assessment:

Standard prompt (relevance_assessment_prompt.txt): Asks "Is this NEW evidence for diagnostic panels?" - optimized for recent literature
Retrospective prompt (retrospective_screening_prompt.txt): Asks "Does this provide SUBSTANTIAL evidence for gene-disease relationships?" - optimized for historical baseline screening

The retrospective prompt evaluates papers in their historical context, accepting important early descriptions of gene-disease associations even if those genes are now well-established. This ensures comprehensive coverage across 25 years of literature for downstream tournament selection and analysis.

Updating the Baseline

After each fortnightly processing run completes, feed majority-relevant papers back into the baseline screening DB so it grows as a comprehensive repository:

FORTNIGHTLY_DB=data/db_2026_february_h1.sqlite

sqlite3 data/pubmed_baseline_screening.sqlite <<'SQL'
ATTACH '$FORTNIGHTLY_DB' AS source;

CREATE TEMP TABLE relevant_dois AS
SELECT doi FROM source.papers
WHERE relevance_assessment_json IS NOT NULL
  AND (json_extract(relevance_assessment_json, '$[0].relevant')
     + json_extract(relevance_assessment_json, '$[1].relevant')
     + json_extract(relevance_assessment_json, '$[2].relevant')) >= 2;

INSERT OR IGNORE INTO papers
  (doi, pmid, title, abstract, authors, journal, source_date,
   source, source_metadata, source_type, source_details, download_status,
   relevance_assessment_raw, relevance_assessment_json)
SELECT doi, pmid, title, abstract, authors, journal, source_date,
       source, source_metadata, source_type, source_details, 'scheduled',
       relevance_assessment_raw, relevance_assessment_json
FROM source.papers WHERE doi IN relevant_dois;

INSERT OR IGNORE INTO gene_mentions
  (hgnc_id, paper_gene_symbol, paper_doi, source)
SELECT hgnc_id, paper_gene_symbol, paper_doi, source
FROM source.gene_mentions
WHERE source = 'relevance_assessment'
  AND paper_doi IN relevant_dois;

DROP TABLE relevant_dois;
DETACH source;
SQL

This step is tracked as UPDATE_BASELINE in the pipeline tracker and also syncs the updated baseline to the cluster.

Panel-Specific Curation

For curating literature for a specific panel (e.g., Arthrogryposis):

# Configuration
PANEL_DATE=2025-10-01
PANEL_ID=47  # Arthrogryposis panel ID
PANEL_NAME=arthrogryposis

# 1. Copy pre-filtered baseline DB papers to new database
sqlite3 data/$PANEL_NAME.sqlite < schema.sql
sqlite3 data/$PANEL_NAME.sqlite "ATTACH 'data/pubmed_baseline_screening.sqlite' AS source; INSERT INTO papers (pmid, title, abstract, authors, journal, entrez_date, source_type, source_details) SELECT pmid, title, abstract, authors, journal, entrez_date, 'initial', source_details FROM source.papers"

# 2. Assess relevance scoped to the panel
uv run palit assess-relevance \
  --db-path data/$PANEL_NAME.sqlite \
  --panel-date $PANEL_DATE \
  --scope-panel-id $PANEL_ID \
  --prompt-path prompts/panel_relevance_assessment_prompt.txt

# 3. (Optional) Reduce literature for well-researched genes
# The aggregation step has a practical limit of ~30-40 papers per gene due to
# context window constraints. For panels with well-researched genes (e.g., POLG
# with 200+ papers), use tournament selection to keep only the most informative:
uv run palit reduce-literature --db-path data/$PANEL_NAME.sqlite

# 4. Download full-text papers (now reduced set if step 3 was run)
uv run palit download-papers attempt-pmc --db-path data/$PANEL_NAME.sqlite
uv run palit download-papers download-preprints --db-path data/$PANEL_NAME.sqlite
uv run palit download-papers open-browser --db-path data/$PANEL_NAME.sqlite
# ... manually download PDFs ...
uv run palit docling convert --db-path data/$PANEL_NAME.sqlite
uv run palit download-papers register --db-path data/$PANEL_NAME.sqlite

# 5. Extract evidence and assess genes (panel-scoped)
uv run palit extract-evidence --db-path data/$PANEL_NAME.sqlite --panel-date $PANEL_DATE --scope-panel-id $PANEL_ID
uv run palit assess-genes --db-path data/$PANEL_NAME.sqlite --panel-date $PANEL_DATE --target-panel-ids $PANEL_ID --scope-panel-id $PANEL_ID

# 6. Generate report package with panel-scoped novelty detection
uv run palit annotate-pdfs --db-path data/$PANEL_NAME.sqlite --output-dir data/annotated_$PANEL_NAME
uv run palit generate-report \
  --report-id panel_$PANEL_NAME \
  --db-path data/$PANEL_NAME.sqlite \
  --panel-date $PANEL_DATE \
  --target-panel-ids $PANEL_ID \
  --annotated-dir data/annotated_$PANEL_NAME

Name		Name	Last commit message	Last commit date
Latest commit History 161 Commits
.github		.github
prompts		prompts
src/palit		src/palit
templates		templates
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
schema.sql		schema.sql
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PanelApp Australia Literature Assessment

Setup

Installation

Database Setup

External Services

Variant Frequency Lookup

Complete Workflow

Relevance Screening Classifier

Training Workflow

Screening PubMed Baseline

Retrospective Assessment of Baseline Screening

Updating the Baseline

Panel-Specific Curation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PanelApp Australia Literature Assessment

Setup

Installation

Database Setup

External Services

Variant Frequency Lookup

Complete Workflow

Relevance Screening Classifier

Training Workflow

Screening PubMed Baseline

Retrospective Assessment of Baseline Screening

Updating the Baseline

Panel-Specific Curation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages