Skip to content
Merged
1 change: 1 addition & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@
^.github$
^\.github$
^vignettes$
^.claude
165 changes: 165 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## What this package is

`openalexConvert` is an R package that converts an OpenAlex parquet/Arrow corpus
(as produced by the sibling package [openalexPro](https://github.com/rkrug/openalexPro))
into bibliography formats. It is part of the `openalexPro` ecosystem (alongside
`openalexSnowball`) and is distributed via r-universe (`https://rkrug.r-universe.dev`),
not CRAN. Status is alpha (see the startup message in `R/zzz.R`).

## Commands

All commands assume the package root and use R's standard tooling.

```r
# Run the full test suite
devtools::test()

# Run a single test file
testthat::test_file("tests/testthat/test-001-corpus_csl_pandoc.R")

# Regenerate man/*.Rd and NAMESPACE from roxygen2 comments (required after
# changing @export tags or @param docs — NAMESPACE is generated, never hand-edit)
devtools::document()

# Full R CMD check
devtools::check()

# Render the pkgdown site locally
pkgdown::build_site()
```

CI (`.github/workflows/`) runs `R-CMD-check` on macOS/Windows/Ubuntu ×
devel/release/oldrel-1, plus `test-coverage` and `pkgdown`. Because `openalexPro`
lives on r-universe, `Additional_repositories` in `DESCRIPTION` must list it or
dependency resolution fails in CI.

## Architecture

The package is a thin pipeline with two stages. Each exported function is one
self-contained file in `R/`.

1. **Corpus → CSL JSON** ([R/corpus_to_csljson.R](R/corpus_to_csljson.R)):
`corpus_to_csljson()` reads an Arrow Dataset/Table/data.frame, registers it
with DuckDB (`duckdb_register_arrow`), and processes it in chunks of
`chunk_size` rows via `LIMIT/OFFSET`. Output is one `chunk_N.json` file per
chunk, each a CSL JSON array. This is the memory-efficient core — large
corpora never load fully into R.

2. **CSL JSON → everything else** ([R/csljson_convert_pandoc.R](R/csljson_convert_pandoc.R)):
`csljson_convert_pandoc()` shells out to Pandoc (via `rmarkdown::pandoc_convert`)
to produce bibtex, biblatex, docx, markdown, latex, html, or pdf. It dispatches
on whether the input is a directory of chunks or a single file, crossed with
whether `to` is a bibliography format (`bibtex`/`biblatex`, direct conversion)
or a formatted-document format (renders via citeproc with a generated
`nocite: "@*"` markdown stub).

`corpus_export_via_pandoc()` ([R/corpus_export_via_pandoc.R](R/corpus_export_via_pandoc.R))
is a convenience wrapper chaining stage 1 → stage 2 through a temp dir.

`csljson_to_zotero_upload()` ([R/csljson_to_zotero_upload.R](R/csljson_to_zotero_upload.R))
is independent of the above: it POSTs CSL JSON files to the Zotero Web API
(`/groups/{id}/items`) using `httr2`, with `progressr` progress. API key comes
from `ZOTERO_API_KEY`.

### Key design points in the corpus→CSL mapping

#### Schema resilience (`.build_select_sql`)

This is the central concern and the function most likely to need edits. OpenAlex
parquet schemas vary across openalexPro versions and across record sources, so the
mapping never assumes a column exists. The function:

- Reads the live column set via `SELECT * FROM src LIMIT 0` (in `corpus_to_csljson`)
and passes `cols` in. Every field expression is chosen with a `has(name)` guard.
- Falls back to a typed null (`CAST(NULL AS VARCHAR/INTEGER/BOOLEAN)`) for missing
scalar columns, or an empty list `[]` for missing list columns (issns, authors,
orcids, keywords). The downstream R code relies on these placeholders existing,
so the `SELECT` always emits the *same column names* regardless of input schema.
- Handles **two venue layouts** that coexist in the wild: the legacy `host_venue`
struct and the newer `primary_location.source` struct. When both are present it
`COALESCE`s them (host_venue first); when only one is present it uses that. This
applies to venue name, venue type, publisher, `issn_l`, and `issn`. If you touch
venue handling, keep all five expressions consistent.
- Uses DuckDB struct/list navigation directly in SQL: dotted paths like
`primary_location.source.display_name`, and `list_transform(authorships, x -> ...)`
to pluck author display names and ORCIDs into parallel arrays.
- `url` is a `COALESCE` priority chain: `doi_url` → `open_access.oa_url` →
`primary_location.landing_page_url` → `id`.
- ISBN is deliberately hard-coded to `CAST(NULL AS VARCHAR)` — the nested path is
unreliable, so ISBN is effectively disabled at the SQL layer (see the comment).

**To add a mapped field you must touch two places**, and they share assumptions:
add the `has()`-guarded expression with an `AS <name>` alias in `.build_select_sql()`,
then read `rec$<name>` in `.map_record_to_csl()`. List-typed columns come back as
list-columns, so they are accessed as `rec$<name>[[1]]` (note the double bracket).

#### Row → CSL item (`.map_record_to_csl`)

Converts one SQL result row to a CSL item list. Notable behaviors:

- `%||%` is a custom coalesce defined here (not rlang's): it treats `NULL`,
length-0, `NA`, **and empty string** all as missing. Used pervasively for scalars.
- Authors: names are split by `.split_name()` (handles `"Family, Given"` and
space-separated, taking the last token as family). ORCIDs ride along in a parallel
array, matched by index.
- `issued` date-parts: prefers `publication_date` (split on `-` into year/month/day),
falling back to `year` alone.
- An aggregated `note` field encodes OA status and citation count
(`OA:true; OA_status:gold; Citations:42`) since CSL has no native fields for these.
- Keywords (from `concepts`) are collapsed to a single `"; "`-joined string.
- ISBN is only emitted for `book`/`chapter`/`report` types — but since the SQL
always nulls ISBN, this branch is currently dead unless the SQL is changed too.

#### CSL type inference (`.infer_csl_type`)

Maps OpenAlex `type`/venue-type strings to CSL types via regex (e.g.
`journal-article` → `article-journal`, `book-chapter` → `chapter`,
`posted-content`/`preprint` → `manuscript`). It then applies signal-based overrides:
presence of ISBN nudges toward `book`, ISSN toward `article-journal`, and a
container + volume/issue toward `article-journal`. Default when nothing matches is
`article-journal`.

#### Sanitization (`.sanitize_csl_item`)

Runs recursively over the assembled item before serialization: drops `NULL` and
scalar `NA`, normalizes character data to UTF-8 (`iconv`), strips control chars and
collapses whitespace, caps `abstract` at **700 chars**, and cleans author sublists
(removes `NA`/`"NA"` ORCIDs, blanks `NA` given/family). The 700-char abstract cap
here is separate from the 10000-char cap in the Pandoc stage (below).

#### DOI normalization (`.normalize_doi`)

Delegates to `openalexPro::extract_doi(..., normalize = TRUE, what = "doi")` to get
a bare DOI, with a regex strip of the resolver prefix as fallback if that call
errors. Do **not** reimplement DOI parsing here — a past bug (see `NEWS.md`) came
from a local regex that missed lowercase suffixes; the fix was to delegate.

#### Pandoc abstract guard (`.normalize_json_for_pandoc`)

Before bibtex/biblatex/formatted conversion, this re-serializes the JSON (for
consistent encoding) and, for directory/chunk conversion, drops any abstract longer
than **10000 chars** — long abstracts can stall pandoc/LaTeX. Single-file bibtex
conversion calls it with `drop_long_abstracts = FALSE`. The returned `$sanitized`
flag is surfaced in the verbose `[sanitized]` log marker.

## Tests

`tests/testthat/test-001-corpus_csl_pandoc.R` runs the full pipeline against
parquet fixtures in `tests/fixtures/corpus/` and compares output against golden
fixtures (`corpus_csl/`, `corpus_bibtex/`, `corpus_biblatex/`, `corpus_docs/`).

Important: exact-text comparison of Pandoc output is **Pandoc-version sensitive**,
so those assertions are guarded by `skip_on_ci()`. JSON-structure comparison (the
stage-1 output) is exact and always runs. Most tests `skip_if_not(rmarkdown::pandoc_available())`.
If you change the CSL mapping, the golden JSON fixtures in `tests/fixtures/corpus_csl/`
must be regenerated to match.

## Docs / vignettes

Vignettes are Quarto (`.qmd`, `VignetteBuilder: quarto`) in `vignettes/`. The
`vignettes/` dir is in `.Rbuildignore` (excluded from the tarball to avoid R CMD
check warnings). `pkgnet_report.qmd` is a generated package-network report.
6 changes: 3 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ Description: Utilities to convert an OpenAlex parquet/Arrow corpus into
other formats. Implemented are at the moment CSL JSON, BibTeX, BibLaTeX,
Markdown, LaTeX, HTML, or PDF via Pandoc. Uses DuckDB over Arrow for
efficient chunked CSL JSON conversion.
URL: https://github.com/rkrug/openalexConvert, https://rkrug.github.io/openalexConvert/
BugReports: https://github.com/rkrug/openalexConvert/issues
URL: https://github.com/openalexPro/openalexConvert, https://openalexpro.github.io/openalexConvert/, https://doi.org/10.5281/zenodo.20448988
BugReports: https://github.com/openalexPro/openalexConvert/issues
License: GPL (>= 2)
Depends:
R (>= 4.1.0)
Expand All @@ -26,7 +26,7 @@ Suggests:
knitr,
quarto,
testthat (>= 3.0.0)
Additional_repositories: https://rkrug.r-universe.dev
Additional_repositories: https://openalexpro.r-universe.dev
Encoding: UTF-8
RoxygenNote: 7.3.3
VignetteBuilder: quarto
Expand Down
2 changes: 2 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -15,5 +15,7 @@ importFrom(httr2,req_perform)
importFrom(httr2,request)
importFrom(httr2,resp_body_string)
importFrom(httr2,resp_status)
importFrom(jsonlite,read_json)
importFrom(jsonlite,write_json)
importFrom(progressr,progressor)
importFrom(progressr,with_progress)
37 changes: 31 additions & 6 deletions R/corpus_export_via_pandoc.R
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,12 @@
#' @param to Target format passed to Pandoc (e.g., `"bibtex"`, `"biblatex"`).
#' @param csl_tmp Optional path for a temporary CSL JSON directory. If `NULL`, a
#' temporary directory is used and removed afterwards.
#' @param ... Additional arguments passed to `corpus_to_csljson()` (e.g., `chunk_size`).
#' @param ... Additional arguments passed to `corpus_to_csljson()`
#' (e.g., `chunk_size`).
#'
#' @return Invisibly returns `normalizePath(output)`.
#' @return Invisibly returns the normalized path to the created file.
#'
#' @importFrom jsonlite read_json write_json
#'
#' @export
corpus_export_via_pandoc <- function(
Expand All @@ -23,17 +26,39 @@ corpus_export_via_pandoc <- function(
to <- match.arg(to)
remove_tmp <- FALSE
if (is.null(csl_tmp)) {
# `corpus_to_csljson()` creates the directory itself and errors if it
# already exists, so only reserve the path here.
csl_tmp <- tempfile(pattern = "csljson_")
dir.create(csl_tmp, recursive = TRUE, showWarnings = FALSE)
remove_tmp <- TRUE
}
corpus_to_csljson(corpus, csl_tmp, ...)
corpus_to_csljson(corpus = corpus, output = csl_tmp, ...)
on.exit(
if (remove_tmp) {
try(unlink(csl_tmp, recursive = TRUE, force = TRUE), silent = TRUE)
},
add = TRUE
)
csljson_convert_pandoc(csl_tmp, output, to = to)
invisible(normalizePath(output))

# Merge the chunked CSL JSON into a single array before conversion so that
# `output` is a single file (e.g. `corpus.bib`) rather than a directory of
# per-chunk files (which is what passing the directory to
# `csljson_convert_pandoc()` would produce).
chunk_files <- sort(list.files(
csl_tmp,
pattern = "^chunk_\\d+\\.json$",
full.names = TRUE
))
if (!length(chunk_files)) {
stop("No CSL JSON chunks were produced from `corpus`.")
}
items <- unlist(
lapply(chunk_files, jsonlite::read_json),
recursive = FALSE
)
combined <- tempfile(fileext = ".json")
on.exit(try(unlink(combined, force = TRUE), silent = TRUE), add = TRUE)
jsonlite::write_json(items, combined, auto_unbox = TRUE, pretty = FALSE)

out_path <- csljson_convert_pandoc(combined, output, to = to)
invisible(out_path)
}
7 changes: 6 additions & 1 deletion R/corpus_to_csljson.R
Original file line number Diff line number Diff line change
Expand Up @@ -427,7 +427,12 @@ corpus_to_csljson <- function(
}
doi_raw <- as.character(doi_raw)
tryCatch(
openalexPro::extract_doi(doi_raw, non_doi_value = "", normalize = TRUE, what = "doi"),
openalexPro::extract_doi(
doi_raw,
non_doi_value = "",
normalize = TRUE,
what = "doi"
),
error = function(e) sub("^(?i)https?://(dx\\.)?doi\\.org/", "", doi_raw)
)
}
Expand Down
Loading
Loading