Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
Package: DataFindR
Title: Automated LLM Study Assessment and Data Extraction
Title: Automated Study Assessment and Data Extraction using Large Language Models (LLMs)
Version: 0.0.0.9000
Authors@R:
person("Anders Hagen", "Jarmund", , "anders.h.jarmund@ntnu.no", role = c("aut", "cre"),
comment = c(ORCID = "0000-0002-3923-1637"))
Description: Works in combination with the metawoRld package.
Description: Provides functions to automate the process of study relevance assessment and data extraction from research papers using Large Language Models (LLMs). Key features include fetching metadata, assessing studies based on title/abstract against user-defined criteria, extracting structured data from full-text documents, and caching results to optimize workflows. It is designed to integrate with the 'metawoRld' package for systematic review and meta-analysis projects.
License: GPL (>= 3)
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
Expand Down
80 changes: 52 additions & 28 deletions R/assessment.R
Original file line number Diff line number Diff line change
@@ -1,24 +1,32 @@
#' Functions for assessing study relevance using Large Language Models (LLMs).
#'
#' This file contains functions to assess the relevance of studies based on their
#' title and abstract, utilizing LLMs for the assessment process. It includes
#' functions for single study assessment and batch processing.
#' @title Assess Study Relevance using LLM
#'
#' @description
#' Fetches metadata for an identifier, generates a prompt based on project
#' criteria, calls an LLM API to assess relevance based on Title/Abstract,
#' parses the response, and caches results.
#'
#' @param chat An `ellmer` chat object.
#' @param chat An `ellmer` chat object, initialized with a specific LLM model (e.g., `ellmer::chat_openai()`). This object manages the communication with the LLM service.
#' @param identifier Character string. The DOI or PMID of the study.
#' @param metawoRld_path Character string. Path to the root of the metawoRld project.
#' @param force_fetch Logical. If TRUE, bypass the metadata cache and re-fetch
#' from online sources. Defaults to FALSE.
#' @param force_assess Logical. If TRUE, bypass the assessment cache and
#' re-run the LLM assessment. Defaults to FALSE.
#' @param service Character string. The LLM service to use (currently only "openai").
#' @param model Character string. The specific LLM model name.
#' @param email Character string (optional). Email for NCBI Entrez.
#' @param ncbi_api_key Character string (optional). NCBI API key.
#' @param ... Additional arguments passed to the underlying LLM API call function
#' (e.g., `temperature`, `max_tokens` passed to `.call_llm_openai`).
#'
#' @section API Keys:
#' LLM API keys (e.g., `OPENAI_API_KEY`) are typically expected to be set as
#' environment variables. Refer to the `ellmer` package documentation for more
#' details on API key management.
#'
#' @return A list containing the structured assessment result (decision, score,
#' rationale) or aborts on critical failure.
#' @export
Expand Down Expand Up @@ -46,23 +54,31 @@
#' # --- Run Assessment ---
#' pmid <- "31772108" # Example PMID relevant to cytokines/pregnancy
#' tryCatch({
#' assessment_res <- df_assess_relevance(
#' identifier = pmid,
#' metawoRld_path = proj_path,
#' email = "your.email@example.com", # Replace with your email
#' service = "openai",
#' model = "gpt-3.5-turbo" # Use a cheaper model for testing initially
#' )
#' print(assessment_res)
#' # # Initialize the chat object (assuming OPENAI_API_KEY is set)
#' # # Ensure ellmer is installed: install.packages("ellmer")
#' # # You might need to install it from GitHub if not on CRAN:
#' # # remotes::install_github("ropensci/ellmer") or similar
#' if (requireNamespace("ellmer", quietly = TRUE) && Sys.getenv("OPENAI_API_KEY") != "") {
#' my_chat <- ellmer::chat_openai(model = "gpt-3.5-turbo")
#'
#' # --- Run again (should use cache) ---
#' assessment_res_cached <- df_assess_relevance(pmid, proj_path, email = "your.email@example.com")
#' print(assessment_res_cached)
#' assessment_res <- df_assess_relevance(
#' chat = my_chat,
#' identifier = pmid,
#' metawoRld_path = proj_path,
#' email = "your.email@example.com" # Replace with your email
#' )
#' print(assessment_res)
#'
#' # --- Force re-assessment ---
#' assessment_res_forced <- df_assess_relevance(pmid, proj_path, email = "your.email@example.com", force_assess = TRUE)
#' print(assessment_res_forced)
#' # --- Run again (should use cache) ---
#' assessment_res_cached <- df_assess_relevance(my_chat, pmid, proj_path, email = "your.email@example.com")
#' print(assessment_res_cached)
#'
#' # --- Force re-assessment ---
#' assessment_res_forced <- df_assess_relevance(my_chat, pmid, proj_path, email = "your.email@example.com", force_assess = TRUE)
#' print(assessment_res_forced)
#' } else {
#' message("ellmer package not available or OPENAI_API_KEY not set. Skipping example execution.")
#' }
#' }, error = function(e) {
#' message("Assessment failed: ", e$message)
#' })
Expand Down Expand Up @@ -214,12 +230,11 @@ df_assess_relevance <- function(chat,
#' Runs the relevance assessment workflow (`df_assess_relevance`) for multiple
#' DOIs/PMIDs, leveraging caching and providing a summary of results.
#'
#' @param chat An `ellmer` chat object, initialized with a specific LLM model (e.g., `ellmer::chat_openai()`). This object will be used for all assessments in the batch.
#' @param identifiers Character vector. A vector of DOIs and/or PMIDs.
#' @param metawoRld_path Character string. Path to the root of the metawoRld project.
#' @param force_fetch Logical. If TRUE, bypass the metadata cache for all identifiers.
#' @param force_assess Logical. If TRUE, bypass the assessment cache for all identifiers.
#' @param service Character string. The LLM service to use (e.g., "openai").
#' @param model Character string. The specific LLM model name.
#' @param email Character string (optional). Email for NCBI Entrez.
#' @param ncbi_api_key Character string (optional). NCBI API key.
#' @param stop_on_error Logical. If TRUE, the batch process stops if any single
Expand All @@ -228,6 +243,11 @@ df_assess_relevance <- function(chat,
#' @param ... Additional arguments passed down to `df_assess_relevance` and
#' subsequently to the LLM API call function (e.g., `temperature`).
#'
#' @section API Keys:
#' LLM API keys (e.g., `OPENAI_API_KEY`) are typically expected to be set as
#' environment variables. Refer to the `ellmer` package documentation for more
#' details on API key management.
#'
#' @return A data frame (tibble) summarizing the assessment results for each
#' identifier, with columns:
#' \describe{
Expand Down Expand Up @@ -271,15 +291,19 @@ df_assess_relevance <- function(chat,
#' )
#'
#' # --- Run Batch Assessment ---
#' batch_results <- df_assess_batch(
#' identifiers = ids_to_assess,
#' metawoRld_path = proj_path,
#' email = "your.email@example.com", # Replace with your email
#' service = "openai",
#' model = "gpt-3.5-turbo",
#' stop_on_error = FALSE # Continue processing even if one fails
#' )
#'
#' if (requireNamespace("ellmer", quietly = TRUE) && Sys.getenv("OPENAI_API_KEY") != "") {
#' my_chat_batch <- ellmer::chat_openai(model = "gpt-3.5-turbo")
#' batch_results <- df_assess_batch(
#' chat = my_chat_batch,
#' identifiers = ids_to_assess,
#' metawoRld_path = proj_path,
#' email = "your.email@example.com", # Replace with your email
#' stop_on_error = FALSE # Continue processing even if one fails
#' )
#' print(batch_results)
#' } else {
#' message("ellmer package not available or OPENAI_API_KEY not set. Skipping example execution.")
#' }
#' # --- View Results ---
#' print(batch_results)
#'
Expand Down
6 changes: 6 additions & 0 deletions R/caching.R
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
#' Caching Utility Functions for DataFindR
#'
#' This file contains functions for managing cached data within a metawoRld project.
#' It includes functions for saving, loading, and clearing cached items related to
#' metadata, assessment results, and extracted data.
#' Get the Path to the DataFindR Cache Directory within a metawoRld Project
#'
#' @param metawoRld_path Path to the root of the metawoRld project.
Expand Down Expand Up @@ -136,6 +141,7 @@
#' Clear DataFindR Cache within a metawoRld Project
#'
#' Removes cached files for a specific identifier or the entire cache.
#' Note: If using the Shiny applications, they might hold some data in memory; restarting the Shiny app might be necessary to fully reflect cache clearing.
#'
#' @param identifier The original identifier (e.g., DOI) to clear cache for.
#' If NULL (default), clears the *entire* DataFindR cache for the project.
Expand Down
93 changes: 88 additions & 5 deletions R/extraction.R
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
#' Functions for extracting structured data from full-text study documents using LLMs.
#'
#' This file contains functions to extract data from PDF documents.
#' It leverages LLMs to parse and structure the information according to a predefined schema.
#' It includes functions for batch processing of documents.
#' @title Extract Data for a Batch of Studies using LLM
#'
#' @description
Expand All @@ -6,21 +11,24 @@
#' extraction cache within the `metawoRld` project. Does *not* import into metawoRld.
#' Use `df_import_batch` subsequently to import cached results.
#'
#' @param chat An `ellmer` chat object, initialized with a specific LLM model (e.g., `ellmer::chat_openai()`). This object manages communication with the LLM service for data extraction.
#' @param identifiers Character vector. A vector of DOIs and/or PMIDs for studies
#' identified as relevant (e.g., having an "Include" assessment decision).
#' @param paper_paths Character vector. **Named** vector or list where names are
#' the identifiers (matching `identifiers` argument) and values are the file
#' paths to the corresponding full-text **plain text (.txt) files**.
#' @param paper_paths A named character vector or list. Names must be the study identifiers (matching the `identifiers` argument). Values must be the file paths to the corresponding full-text PDF (.pdf) files. Ensure these files are accessible. Currently, only PDF files are supported.
#' @param metawoRld_path Character string. Path to the root of the metawoRld project.
#' @param force_extract Logical. If TRUE, bypass the extraction cache and re-run
#' LLM extraction even if cached JSON exists. Defaults to FALSE.
#' @param service Character string. The LLM service to use (e.g., "openai").
#' @param model Character string. The specific LLM model name (e.g., "gpt-4-turbo").
#' @param stop_on_error Logical. If TRUE, stop the batch if any single extraction fails.
#' Defaults to FALSE.
#' @param ellmer_timeout_s Numeric. Timeout in seconds for the underlying `ellmer` API calls during extraction. Defaults to 300 seconds (5 minutes).
#' @param ... Additional arguments passed down to the LLM API call function
#' (e.g., `temperature`, `max_tokens`).
#'
#' @section API Keys:
#' LLM API keys (e.g., `OPENAI_API_KEY`) are typically expected to be set as
#' environment variables. Refer to the `ellmer` package documentation for more
#' details on API key management.
#'
#' @return A data frame (tibble) summarizing the extraction attempt for each
#' identifier, with columns:
#' \describe{
Expand All @@ -32,6 +40,81 @@
#' Also prints progress and summary information.
#'
#' @export
#' @examples
#' \dontrun{
#' # --- Prerequisites ---
#' # 1. Set API key: usethis::edit_r_environ("project") -> add OPENAI_API_KEY=sk-... -> Restart R
#' # 2. Create a dummy metawoRld project & files
#' proj_path <- file.path(tempdir(), "extract_batch_proj")
#' metawoRld::create_metawoRld(
#' proj_path,
#' project_name = "Test Batch Extraction",
#' project_description = "Testing DataFindR batch extraction",
#' data_extraction_schema = list( # Define a simple schema for the example
#' list(name = "sample_size", type = "integer", description = "Total sample size"),
#' list(name = "population", type = "string", description = "Study population description")
#' )
#' )
#'
#' # Create dummy PDF files for extraction (replace with actual PDFs)
#' id1 <- "study123"
#' id2 <- "study456"
#' pdf_path1 <- file.path(tempdir(), "study123.pdf")
#' pdf_path2 <- file.path(tempdir(), "study456.pdf")
#' # Create minimal valid PDF content for example purposes
#' pdf_content <- c(
#' "%PDF-1.4",
#' "1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj",
#' "2 0 obj << /Type /Pages /Kids [3 0 R] /Count 1 >> endobj",
#' "3 0 obj << /Type /Page /MediaBox [0 0 612 792] /Contents 4 0 R >> endobj",
#' "4 0 obj << /Length 0 >> stream endstream endobj",
#' "xref", "0 5", "0000000000 65535 f ", "0000000009 00000 n ",
#' "0000000058 00000 n ", "0000000117 00000 n ", "0000000178 00000 n ",
#' "trailer << /Size 5 /Root 1 0 R >>", "startxref", "225", "%%EOF"
#' )
#' writeLines(pdf_content, pdf_path1)
#' writeLines(pdf_content, pdf_path2) # Using same content for simplicity
#'
#' # --- Identifiers and Paths ---
#' ids_to_extract <- c(id1, id2)
#' paper_paths_list <- stats::setNames(c(pdf_path1, pdf_path2), ids_to_extract)
#'
#' # --- Run Batch Extraction ---
#' # Ensure ellmer is installed: install.packages("ellmer")
#' # You might need to install it from GitHub if not on CRAN:
#' # remotes::install_github("ropensci/ellmer") or similar
#' if (requireNamespace("ellmer", quietly = TRUE) && Sys.getenv("OPENAI_API_KEY") != "") {
#' my_chat_extract <- ellmer::chat_openai(model = "gpt-4o") # Or other capable model
#'
#' # Mock df_assess_relevance results for included studies (if necessary for your full flow)
#' # For this example, we assume these IDs are already assessed as "Include"
#' # and their metadata might be cached (though not strictly used by df_extract_batch directly
#' # beyond what .generate_extraction_prompt might use if it reads metadata cache)
#'
#' # Before running, ensure extraction prompt and schema are in the project
#' # (Typically handled by metawoRld::create_metawoRld or manually copying)
#' # For the example, let's assume they exist or are created by default
#' # For a real run, ensure these files are correctly set up:
#' # file.copy(system.file("prompts/_extraction_prompt.txt", package="DataFindR"), proj_path)
#' # file.copy(system.file("prompts/_extraction_schema.yml", package="DataFindR"), proj_path)
#'
#' batch_extract_results <- df_extract_batch(
#' chat = my_chat_extract,
#' identifiers = ids_to_extract,
#' paper_paths = paper_paths_list,
#' metawoRld_path = proj_path,
#' ellmer_timeout_s = 600 # Increase timeout for potentially long documents
#' )
#' print(batch_extract_results)
#' } else {
#' message("ellmer package not available or OPENAI_API_KEY not set. Skipping example execution.")
#' }
#'
#' # --- Clean up ---
#' unlink(proj_path, recursive = TRUE)
#' unlink(pdf_path1)
#' unlink(pdf_path2)
#' }
#' @importFrom purrr map safely list_transpose keep discard set_names map_chr map_lgl compact walk pmap
#' @importFrom dplyr bind_rows tibble mutate select relocate if_else everything rename
#' @importFrom rlang inform warn is_character abort is_logical is_named `%||%` list2 !!! is_null
Expand Down
4 changes: 4 additions & 0 deletions R/fetching.R
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
#' Metadata Fetching Functions
#'
#' This file contains functions to retrieve study metadata (e.g., title, abstract, authors)
#' from online sources like PubMed and Crossref using DOIs or PMIDs.
#' @title Fetch Metadata for a Study Identifier (DOI or PMID)
#'
#' @description
Expand Down
5 changes: 5 additions & 0 deletions R/importing.R
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
#' Data Importing Functions for metawoRld Integration
#'
#' This file provides functions to import structured data, typically extracted by LLMs,
#' into a `metawoRld` project. It handles validation against project schemas and
#' integration with `metawoRld`'s data structures.
#' Import Extracted Study Data into a metawoRld Project
#'
#' Reads cached extraction JSON data for a study, validates it against the
Expand Down
7 changes: 6 additions & 1 deletion R/parsing_validation.R
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
#' Parsing and Validation Utilities
#'
#' This file contains utility functions for parsing schema definitions and validating
#' data structures, particularly for data extracted by LLMs against these schemas.
#' Validate extracted data against a schema definition (list format).
#'
#' Checks if the structure, types, and required fields of the extracted data
Expand All @@ -9,7 +13,8 @@
#'
#' @return A list of validation error messages. An empty list indicates successful
#' validation. Each error message includes the path to the problematic node.
#' @export
#' @noRd
#' @keywords internal
#'
#' @examples
#' \dontrun{
Expand Down
8 changes: 7 additions & 1 deletion R/prompts.R
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
#' LLM Prompt Generation and Schema Parsing
#'
#' This file includes functions for generating formatted prompts for Large Language Models (LLMs)
#' for tasks like study assessment and data extraction. It also contains functions for
#' parsing YAML-based schema definitions into formats usable by the `ellmer` package.
#' Generate Assessment Prompt
#'
#' Constructs a formatted prompt string for assessing a study based on its
Expand Down Expand Up @@ -68,7 +73,8 @@
#' If provided, the prompt can instruct the LLM to focus less on these.
#'
#' @return Character string. The formatted LLM prompt for data extraction.
#' @export
#' @noRd
#' @keywords internal
#' @importFrom glue glue
#' @importFrom rlang `%||%` is_list abort is_scalar_character is_null
#' @importFrom jsonlite toJSON
Expand Down
46 changes: 46 additions & 0 deletions R/shiny_assess.R
Original file line number Diff line number Diff line change
@@ -1,4 +1,50 @@
#' Shiny Application for Interactive Study Assessment
#'
#' This file defines a Shiny application that provides a user interface for
#' fetching study metadata, assessing study relevance using LLMs based on title/abstract
#' and project criteria, and viewing/editing/saving assessment results to cache.
# --- (Keep existing @import tags and function definition) ---
#' Launch Shiny App for Interactive Study Assessment
#'
#' Launches a Shiny application that provides a user interface for the study
#' assessment phase of a systematic review or meta-analysis workflow.
#'
#' @details
#' The application allows users to:
#' \itemize{
#' \item Specify a `metawoRld` project path.
#' \item Enter a study identifier (DOI or PMID).
#' \item Fetch bibliographic metadata (title, abstract) for the identifier.
#' \item Manually input or edit the title and abstract.
#' \item View inclusion/exclusion criteria from the `metawoRld` project.
#' \item View the generated prompt for LLM-based assessment.
#' \item Trigger an LLM call (e.g., OpenAI, Google) to assess study relevance based on the
#' title/abstract and criteria.
#' \item View the LLM's assessment result (decision, score, rationale).
#' \item Manually input or edit the assessment result in JSON format.
#' \item Save the assessment result (from LLM or manual input) to the project's cache.
#' \item Load existing assessment and metadata from cache when an identifier is entered.
#' }
#'
#' Prerequisites:
#' \itemize{
#' \item The `shiny`, `glue`, `rlang`, `yaml`, `fs`, `jsonlite` packages must be installed.
#' \item `rclipboard` is suggested for the "Copy Prompt" functionality.
#' \item For LLM calls, the appropriate API key (e.g., `OPENAI_API_KEY` for OpenAI models,
#' `GOOGLE_API_KEY` for Google models) must be set as an environment variable
#' (e.g., in your `.Renviron` file, followed by an R session restart).
#' Refer to the `ellmer` package for more details on API key setup.
#' \item The specified `metawoRld` project path must point to a valid project
#' containing a `_metawoRld.yml` configuration file with defined
#' inclusion and exclusion criteria.
#' }
#'
#' @param launch.browser Logical, passed to `shiny::runApp`. If `TRUE` (default),
#' the app will launch in the system's default web browser.
#' @param ... Additional arguments passed to `shiny::runApp` (e.g., `port`, `host`).
#'
#' @return Does not return a value; runs the Shiny application.
#' @export
#' @import shiny
#' @importFrom glue glue
#' @importFrom rlang inform warn abort is_list `%||%` check_installed is_scalar_character
Expand Down
Loading