Skip to content

Latest commit

 

History

History
365 lines (257 loc) · 24 KB

File metadata and controls

365 lines (257 loc) · 24 KB
layout default
title Appendix B: Prompt Library
parent Appendices
nav_order 2

Appendix B: Prompt Library for Biostatistics, Bioinformatics, and Data Science

Large Language Models (LLMs) like ChatGPT are increasingly used to assist with coding, data analysis, and interpretation tasks in biomedical research. Their ability to follow natural language instructions makes them valuable tools for data scientists and bioinformaticians, enabling analyses and visualizations through verbal or written commands. However, the quality of LLM output depends heavily on prompt design—phrasing questions clearly, providing examples, or breaking tasks into steps can significantly improve results [Chauhan]. This appendix presents a broad collection of example prompts that illustrate how LLMs can support code generation, schema documentation, aggregate-result interpretation, reproducibility planning, and review. Each prompt is crafted to be pedagogically useful while keeping the LLM away from individual-level data.

{: .warning }

Default rule: do not expose individual-level data to an LLM. For the workflows in this guide, prompts that require patient-, participant-, sample-, customer-, or record-level data are out of scope. Do not paste, upload, attach, or mount PHI, PII, controlled-access genomic data, credentials, private institutional paths, or small-cell outputs; doing so can violate institutional policy, data-use agreements, IRB expectations, HIPAA rules, or other hard requirements [NIH genomic data guidance], [45 CFR 164.514].

Use these prompts with code, schemas, data dictionaries, approved aggregate summaries, and synthetic or toy fixtures. Ideally, keep the codebase separate from the data: use a code-only repository, keep protected data in its approved compute environment, and synchronize code through GitHub only in the approved direction. Do not run GenAI tools or repo-native agents inside an environment that can access protected PHI, PII, controlled data, secrets, or sensitive output files. For repo-native coding agents, add the relevant code files, allowed edit scope, and verification command instead of asking for free-form code in isolation [GitHub Copilot agent best practices], [Claude Code best practices].

{: .warning }

Treat every prompt in this library as a template. Replace real data with synthetic examples, remove sensitive details, and add verification commands before using it in a repository.


Data Cleaning

Effective data cleaning is a critical first step in any analysis. LLMs can help streamline data cleaning by suggesting code to handle missing values, detect outliers, harmonize formats, and even parse unstructured data into structured form [AutoDCWorkflow]. These prompts ask for code or local QC plans only; run the code on data inside the approved environment, not inside the LLM session. Example prompts:

  1. (R, tidyverse) – Harmonizing EHR Codes Prompt: "Using R (tidyverse), write code that maps diagnosis-code columns from ICD-9 to ICD-10 using a local crosswalk file. Include flags for unmapped codes and manual review. Use only synthetic example codes in the response."

  2. (Python, pandas) – Missing Data Imputation Prompt: "In Python with pandas, write a function that identifies columns with more than 20% missing values and imputes remaining columns using mean imputation for numeric columns and mode for categorical columns. Provide code using a synthetic toy dataframe only."

  3. (R, lubridate) – Date Parsing Prompt: "Using R’s lubridate, write a reusable function that converts date strings in a named column to YYYY-MM-DD and flags entries that do not parse. Demonstrate with synthetic dates only."

  4. (Python, IQR-based Outlier Detection) Prompt: "Write Python code (numpy/pandas) to detect outliers in a numeric column using the IQR method. Return both a flag column and a filtered dataframe. Demonstrate with a synthetic toy dataframe only."

  5. (R, dplyr) – Duplicate-Key Logic Prompt: "In R using dplyr, write a deduplication function that accepts key-column names and a timestamp column, keeps the latest record per key, and logs counts before and after. Demonstrate with synthetic event records only."

  6. (Multi-step Prompt) – Local Data QC Plan Prompt: "Design an R QC workflow that I can run locally on an approved dataset. It should summarize column types, missingness, unique counts, validation failures, and suggested fixes without requiring me to paste any rows or outputs into the LLM. Provide functions and a synthetic fixture."

  7. (Structured Output) – Data Dictionary Draft Prompt: "Using the column list below, draft a data dictionary as a Markdown table with exactly these columns: name, type, units, allowed_values, description, lineage, and validation_rule. If a field is unknown, write needs_review rather than guessing. After the table, list validation checks I should run before using it."


Data Analysis

LLMs can assist with a wide array of data analysis tasks—from classical statistical tests to advanced modeling. Use them for code templates, method checks, and interpretation of approved aggregate outputs, not for processing or viewing individual-level rows. Example prompts:

  1. (R, survival) – Kaplan-Meier Survival Prompt: "Using R and the survival package, provide code to fit a Kaplan-Meier curve by treatment group, plot the curves, and perform a log-rank test on a local approved dataset. Use a synthetic fixture for demonstration and do not fabricate real output."

  2. (R, lme4) – Mixed Effects Model Prompt: "Provide R code using lme4 for a repeated-measures model with a random intercept per cluster_id. Demonstrate with synthetic measurements and explain how to interpret the fixed effects and variance components after I run it locally."

  3. (R, stats) – Dimensionality Reduction Prompt: "Write R code using prcomp for PCA on a scaled feature matrix. Demonstrate with a synthetic matrix and show how to extract the first 5 PCs and variance explained after local execution."

  4. (Python, scikit-learn) – Classification Prompt: "Write Python scikit-learn code for a random forest classifier using a local feature matrix and outcome column. Include train/test split, training with 100 trees, and feature importance extraction. Demonstrate with synthetic features only."

  5. (Python, TensorFlow) – LSTM Prompt: "Give Python code using TensorFlow/Keras to build and train an LSTM model on synthetic sensor sequences of length 10 to predict the next value. Keep the example fully synthetic."

  6. (R, epi) – Epidemiological Analysis Prompt: "Calculate the odds ratio for an exposure in a case-control study. Provide R code (e.g., epiR or base methods) to compute the OR, 95% CI, and a Chi-square test for association."

  7. (R, powerAnalysis) – Clinical Trial Sample Size Prompt: "Using R, demonstrate how to perform a power analysis for a two-arm trial (80% power to detect a 5% difference, alpha=0.05). Show the code (e.g., pwr package) and the required sample size."

  8. (R, glm) – Logistic Regression with Interaction Prompt: "Use glm() to fit a logistic regression (outcome: disease yes/no) on age, smoking, and their interaction. Provide the R code and interpret the interaction term."

  9. (Python, statsmodels) – Regression Diagnostics Prompt: "Fit a linear regression and generate diagnostic plots (residuals vs fitted, Q-Q plot) to check assumptions. Provide the Python code using statsmodels and matplotlib."


Data Visualization

Data visualization is another area where LLMs can provide plotting code and suggestions [LLMs in radiology biostatistics]. Ask for plotting code and synthetic demonstrations; generate figures from real data only in the approved local environment. Example prompts:

  1. (R, ggplot2) – Scatter Plot with Regression Prompt: "In R with ggplot2, provide code to create a scatter plot of two numeric variables, add a regression line with confidence interval, and set labels and a title. Demonstrate with synthetic data only."

  2. (Python, seaborn) – Correlation Heatmap Prompt: "Use Python (pandas + seaborn) to compute a correlation matrix for approved numeric columns and plot an annotated heatmap. Demonstrate with a synthetic toy dataframe only."

  3. (R, survminer) – Kaplan-Meier Plot Prompt: "Plot a Kaplan-Meier curve by treatment group using the survminer package. Include a risk table below the plot."

  4. (R, EnhancedVolcano) – Volcano Plot Prompt: "After differential expression analysis, create a volcano plot highlighting genes with p<0.001 and |log2FC| > 2. Label top 10 genes. Provide the code (EnhancedVolcano or ggplot2)."

  5. (Python, matplotlib) – ROC Curve Prompt: "Provide Python code to plot an ROC curve using scikit-learn. Compute AUC, add a diagonal reference line, and annotate the AUC."


Documentation

LLMs are adept at producing human-readable text from structured information, making them useful for code and schema documentation. Do not paste raw records, protected examples, or sensitive file paths. Example prompts:

  1. (Python) – Code Comments & Docstring Prompt: "Given this Python function calc_auc(...), write a clear docstring and inline comments explaining each step."

  2. (R) – Explain an R Script Prompt: "Explain the following R script step by step. Provide markdown comments for a newcomer to understand the workflow."

  3. Summarizing a Data Table Prompt: "Given this approved aggregate results table with no small cells or identifiers, generate a brief summary of the key findings in 2–3 sentences suitable for a report."

  4. Model Output Explanation Prompt: "Explain the meaning of these linear regression coefficients and p-values in plain English. Which are significant? What does the intercept represent?"

  5. (Roxygen) – R Function Documentation Prompt: "Write Roxygen2 documentation for an R function clean_data(). Include description, parameters, return value, and example usage."

  6. Project README Generation Prompt: "Generate a README.md for a data analysis project with sections: Introduction, Data, Methods, Results. Keep the tone professional."

  7. AI Provenance Note Prompt: "Draft a concise AI provenance note for this repository. Include: AI tools used, date range, which tasks AI assisted, human reviewer, tests or builds run, and a statement that no PHI or controlled-access data were included in prompts. Do not list the AI tool as an author."


Reproducibility

Reproducibility is crucial in scientific analysis, and LLM-assisted data science workflows need explicit prompts and review to make outputs reproducible [AIRepr]. Keep data outside the code repository and ask the LLM for code, environment, and verification patterns. Example prompts:

  1. (R, set.seed) Prompt: "Provide R code for a random forest analysis with a fixed random seed (set.seed(123)) so results can be reproduced."

  2. (Python, environment) Prompt: "How do I ensure reproducibility on another machine? Provide a short guide: saving environment dependencies (pip freeze, conda env export), setting random seeds in numpy/TensorFlow, etc."

  3. Workflow Tools Prompt: "Outline a reproducible analysis pipeline (Snakemake or Nextflow). Explain how code, configuration, synthetic fixtures, and documentation can be version-controlled while protected data remains outside the repository."

  4. (R, renv) Prompt: "Using R, show how to make a project's environment reproducible with renv: initialization, snapshot, and restoring on another machine."

  5. Version Control Integration Prompt: "Explain how to integrate Git version control into a data analysis project, track changes, collaborate on GitHub, and maintain a single source of truth."

  6. Comparing Outputs Prompt: "I have two versions of the same analysis before and after refactoring. Provide a strategy to compare approved aggregate outputs, schemas, logs, and tests without pasting row-level data into the LLM."

  7. Agent Verification Plan Prompt: "You are working in this repository. Before editing, inspect the project plan, script, and template files I name. Implement only the requested change. Then run these verification commands: <paste commands>. In your final response, report changed files, assumptions, commands run, and exact pass/fail results."

  8. R Test Cases from Success Criteria Prompt: "Convert the success criteria below into testthat tests for an R data-cleaning script. Include tests for boundary values, missing values, empty files, unexpected columns, and output schema. Use small synthetic fixtures only; do not use real patient data."


Agentic Coding and Code Review

Repo-native agents can edit files and run commands, so prompts should define the work area and the evidence required for completion. Run them in a code-only workspace with no mounted protected data directories, secrets, or sensitive generated outputs. Example prompts:

  1. Scoped Agent Task Prompt: "Work only in scripts/ and tests/testthat/. Add the requested BMI validation behavior, preserve existing command-line arguments, and do not change documentation files. Use synthetic fixtures only; do not read data/, mounted drives, secrets, or protected outputs. Run the narrow test command below and show the result. If a dependency is missing, stop and ask before installing anything."

  2. Security Review of Agent Changes Prompt: "Review this diff as a skeptical security and reproducibility reviewer. Identify any new external calls, package installs, credential exposure risks, PHI risks, uncontrolled randomness, missing tests, or generated files that should not be committed. Return a prioritized checklist."

  3. Structured Output Validation Prompt: "The following LLM output was generated from a schema. Check whether the values are semantically valid for this research context, not just whether the JSON parses. Flag impossible units, invalid categories, missing provenance, and values that need human review."


Model Interpretation

Interpreting models is essential in biostatistics and ML. LLMs can help translate complex outputs into understandable explanations. Example prompts:

  1. Linear Regression Coefficient Prompt: "Interpret a linear regression coefficient for Age=1.5 mmHg/year (p=0.01). Also clarify what the intercept means if it's 100 mmHg."

  2. Logistic Regression Odds Ratio Prompt: "If the OR for smoking is 2.0 (95% CI 1.5–2.7), explain in plain terms what that means about disease risk."

  3. Random Forest Feature Importance Prompt: "Explain how to interpret feature importance = 0.15 vs 0.10. Also mention any caution about random forests' bias toward variables with many categories."

  4. (Python, SHAP) – Explainable AI Prompt: "Provide Python code using the SHAP library to interpret an XGBoost model’s predictions. Include a summary plot and how to read it."

  5. Cox Proportional Hazards Prompt: "A hazard ratio of 0.75 with p=0.03 for a new treatment—what does that mean for the aggregate survival outcome? Also mention the assumption of proportional hazards."

  6. Model Assumptions & Diagnostics Prompt: "List the key assumptions of a linear model and how to check them (residual plots, tests for heteroscedasticity, normal errors, etc.)."

  7. Interpreting PCA Prompt: "After PCA, the first two PCs explain 60% of the variance. How do I describe these components and which variables are most influential?"


Statistical Reporting

Reporting results clearly is as important as doing the analysis. LLMs can assist in writing results sections and checking adherence to reporting standards. Example prompts:

  1. Results Paragraph (ANOVA) Prompt: "Write a short results paragraph for an ANOVA test, e.g., F(3,116)=5.23, p=0.002. Interpret the finding."

  2. APA Style (Regression) Prompt: "Provide an APA-style paragraph for a linear regression: coefficient=0.5 (SE=0.1), p<0.001, R²=0.30. 2–3 sentences only."

  3. Summary Table (R, stargazer) Prompt: "In R, create a table of regression results for multiple models using stargazer or gt. Show the code and briefly interpret."

  4. Methods Section Draft Prompt: "Draft a Methods section for a logistic regression analysis. Include study design, variable selection, model fit assessment. Write formally in past tense."

  5. Non-significant Result Prompt: "Explain how to report a non-significant difference in mean cholesterol (mean diff=5 mg/dL, CI [-2, 12], p=0.18) without 'failing to reject' jargon."

  6. CONSORT Summary Prompt: "Outline how to report a randomized controlled trial according to CONSORT guidelines: participant flow, baseline, primary outcome, adverse events."

  7. Automated Report Generation Prompt: "Suggest a way to auto-generate a report (R Markdown or Jupyter) that includes code, outputs, and narrative interpretation for transparency."


Genomic Data Handling

Genomic data has unique formats (FASTA, VCF, BAM). LLMs can help with code patterns and workflow planning, but human genomic records, sample metadata, controlled-access variants, and derived individual-level risk scores must stay out of the LLM environment [bioinformatics LLM review]. Example prompts:

  1. (R, VariantAnnotation) – VCF Filtering Prompt: "Provide R code using VariantAnnotation to read a local approved or synthetic VCF path, filter variants on chromosome 21 with MAF > 0.05, and save a filtered VCF. Do not request or include real variants, genotypes, sample IDs, or paths."

  2. (PLINK) – GWAS Prompt: "Outline steps for a GWAS on a binary trait using PLINK: data QC, association test, multiple testing correction. Provide example commands."

  3. (R, snpStats) – Polygenic Risk Score Prompt: "Write R code that demonstrates PRS calculation on a synthetic toy genotype matrix and synthetic effect sizes. Explain where local approved genotype and effect-size files would be supplied without exposing them to the LLM."

  4. (Python, BioPython) – FASTA Analysis Prompt: "Read a public or synthetic FASTA file in Python, compute GC content per sequence, and search for the motif 'ATGGC' using BioPython. Show code only."

  5. (R, biomaRt) – Gene Annotation Prompt: "Take a list of Ensembl IDs, query gene symbols and descriptions from Ensembl using biomaRt. Provide the example code."

  6. Variant Interpretation Prompt: "Interpret 'SNP rs123456 – OR=1.3, p=4e-6' in a GWAS. Explain what OR=1.3 means for risk and note genome-wide significance thresholds."

  7. (R, phyloseq) – Microbiome Data Prompt: "Provide R code that imports synthetic OTU counts, synthetic sample metadata, and taxonomy into phyloseq, computes Shannon alpha diversity, and plots by a toy group variable. Explain how to adapt the paths locally for approved data."

  8. Controlled-Access Data Preflight Prompt: "Before we use any LLM or coding agent for this genomic workflow, create a preflight checklist for controlled-access data. Include data-use restrictions, a default rule that prompts must not contain real variants or sample metadata, approved compute environments, logging, model/tool access, and what synthetic examples can be used instead."


Data Sharing

Data sharing and collaboration are vital in science, and LLMs can help prepare code, documentation, metadata, and synthetic examples for broader use. Synthetic data can help navigate patient-data sharing restrictions, but privacy risk must still be evaluated [synthetic data review]. Example prompts:

  1. Data Dictionary Prompt: "Generate a data dictionary from this schema-only column list: record_id, age_band, group, measure_before, measure_after, outcome_flag. Provide descriptions, units, possible values, and validation rules. Do not ask for row examples."

  2. De-identification Planning Prompt: "Create a local de-identification planning checklist for a health dataset without seeing the dataset. Include direct identifiers, quasi-identifiers, dates, geographic fields, free text, small cells, approval requirements, and validation steps a human should perform in the approved environment."

  3. Packaging Code (R) Prompt: "Explain how to convert an analysis script into an R package: usethis for structure, functions in R/, documentation, sharing on GitHub."

  4. Reusable Notebook (Python, Jupyter) Prompt: "Tips for structuring a Jupyter Notebook for others to run easily: install/import blocks, relative paths, parameter sections, clear markdown instructions."

  5. Synthetic Data Generation Prompt: "How can I generate a synthetic dataset from documented schema rules and approved aggregate distributions, without giving the LLM real rows? Mention why synthetic data helps with examples but still requires privacy review."

  6. Public Repository Prompt: "Checklist for sharing code, documentation, synthetic fixtures, and approved aggregate results on Dryad/Zenodo: choose license, add metadata, exclude protected data, verify ignored files, and explain how to cite. Provide a short best-practices list."

  7. Collaboration Workflow Prompt: "Describe a workflow for collaborative analysis using Git and LLM assistance while keeping the codebase separate from protected data. Emphasize a code-only GitHub repository, approved one-direction code synchronization into the protected environment, synthetic fixtures for tests, and no GenAI agent access to PHI/PII."

  8. PHI-Safe Prompt Rewrite Prompt: "Review the sanitized prompt below and identify any remaining categories of sensitive information that should be removed before LLM use. Preserve the technical task, replace any remaining concrete examples with synthetic placeholders, and return a safer version."

  9. Publication Disclosure Draft Prompt: "Draft a manuscript disclosure statement describing AI assistance for code refactoring and documentation. State what humans reviewed and verified, which tests were run, and avoid implying that the AI tool is an author or a primary source."


Each of these prompts shows how users in biostatistics, bioinformatics, and data science can engage LLMs for technical help without exposing individual-level data. By specifying the programming language or libraries, clarifying the output format, and supplying only schemas, synthetic fixtures, or approved aggregate results, one can leverage generative AI to save time, learn new techniques, and maintain reproducible workflows [PharmaSUG prompt engineering]. As these examples illustrate, LLMs can act as coding assistants, statistical consultants, or explainers, but they should not be treated as approved handlers of PHI, PII, controlled-access data, or private data environments.

{: .takeaway }

The best prompt-library entries are starting points, not final instructions. Add your project context, constraints, expected outputs, and evidence required for acceptance.