Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Package: cd
Title: Climate Departure Analysis from ERA5-Land Reanalysis
Version: 0.3.2
Date: 2026-06-06
Version: 0.4.0
Date: 2026-06-25
Authors@R: c(
person("Allan", "Irvine", , "al@newgraphenvironment.com", role = c("aut", "cre"),
comment = c(ORCID = "0000-0002-3495-2128")),
Expand Down Expand Up @@ -35,9 +35,11 @@ Suggests:
rmarkdown,
testthat (>= 3.0.0),
tidyterra,
withr,
zyp
Config/testthat/edition: 3
Imports:
Imports:
curl,
dplyr,
jsonlite,
rappdirs,
Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ export(cd_aggregate)
export(cd_anomaly)
export(cd_baseline)
export(cd_cache_clear)
export(cd_cache_fetch)
export(cd_cache_info)
export(cd_cache_path)
export(cd_catalog)
Expand Down
4 changes: 4 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# cd 0.4.0 (2026-06-25)

* On-disk caching wired into the consumer read path, so repeated extractions, report renders, and vignette rebuilds pull each COG from S3 **once** and read locally thereafter — turning the dominant recurring S3 egress driver into a one-time cost. New exported `cd_cache_fetch()` downloads a remote http(s) COG to the cd cache (keyed by URL hash, with a sidecar `.meta` recording the S3 ETag and size), validates freshness with a cheap HTTP HEAD (ETag, falling back to Content-Length), and serves the local copy on a hit. Downloads are size-validated and atomically renamed so a truncated file is never served; a failed HEAD with a cached copy present serves the cache, and `options(cd.cache_revalidate = FALSE)` skips revalidation entirely for offline work. `cd_crop()` and `cd_extract()` gain `cache = TRUE` (default), threading remote reads through the cache while local paths pass through unchanged. Live S3 confirmation: a repeat read drops from a full-COG download (megabytes) to a ~1 KB HEAD (or zero network with revalidation off). Adds `curl` to Imports. See the new README "Caching" section, which also documents the GDAL `/vsicurl/` env-var stopgap. ([#76](https://github.com/NewGraphEnvironment/cd/pull/76))

# cd 0.3.2 (2026-06-06)

* Both regional vignettes (kootenay-lake, peace-fwcp) rewritten for new readers: plainer-language opener for the snowpack section ("In BC, most of the year's runoff starts as winter snow…" instead of the "hinge of BC hydrology" metaphor), Trends / Recent-Decade / bias-notes preambles compressed and de-jargoned, Annual snowpack signals intro reduced to a 3-bullet plain-language list, salmonid Interpretation closer tightened to one paragraph with three bold knock-on effects. Figure trim: cut `plot-tmean` (covered by `facet-tmean`), `plot-dtr` (asymmetry numbers already in prose), and `snow-rate-peak` (not load-bearing); fold `plot-tmax` + `plot-tmin` into one 2-panel faceted `plot-tmaxmin`, and `snow-swe-max` + `snow-doy-50` + `snow-fraction` into one 3-panel faceted `snow-annual` (free y-scales). Net per vignette: 3 fewer standalone figures, same coverage. Bibliography: dropped `kouki_etal2023` and `yue_wang2002` (no longer cited); union now 15/15. ([#75](https://github.com/NewGraphEnvironment/cd/pull/75))
Expand Down
153 changes: 153 additions & 0 deletions R/cd_cache_fetch.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
#' Fetch a remote COG through the on-disk cache
#'
#' Given a remote `href` (http/https), downloads the file once to the cd
#' cache directory and returns a local path; subsequent calls read the
#' local copy instead of re-pulling from the network. Freshness is
#' checked with a cheap HTTP HEAD request (comparing the S3 ETag), so a
#' monthly catalog republish is picked up automatically while repeat
#' builds do near-zero egress. Local paths — and non-http URLs such as
#' `s3://`, which GDAL reads directly — are returned unchanged.
#'
#' @param href Character. Path or URL to a COG.
#' @param refresh Logical. If `TRUE`, force a re-download even when a
#' valid cached copy exists. Default `FALSE`.
#' @param cache_dir Character. Override the cache location. If `NULL`,
#' uses [cd_cache_path()].
#'
#' @details
#' Freshness uses the ETag when the server provides one, falling back to
#' the `Content-Length` size when it does not. A host that returns
#' neither validator cannot be proven fresh, so the file is re-downloaded
#' on each call (safe, but un-cached) — S3, the default host, always
#' returns both. Revalidation can be disabled for a fully-offline fast
#' path with `options(cd.cache_revalidate = FALSE)`, which serves any
#' existing cached copy without an HTTP HEAD. When the HEAD fails (e.g.
#' offline) but a cached copy exists, the cached copy is served with a
#' message. Downloads are written to a temporary file, validated against
#' the advertised `Content-Length`, then atomically renamed, so a
#' truncated download is never served as complete.
#'
#' @return Character path to the local (cached) file, or `href`
#' unchanged for local / non-http inputs.
#'
#' @examples
#' # Local files pass through untouched:
#' f <- system.file("extdata", "example_climate.tif", package = "cd")
#' identical(cd_cache_fetch(f), f)
#'
#' @export
cd_cache_fetch <- function(href, refresh = FALSE, cache_dir = NULL) {
if (length(href) != 1L || is.na(href) || !cd_is_remote(href)) {
return(href)
}

dir <- cd_cache_path(cache_dir)
ext <- tools::file_ext(href)
key <- rlang::hash(href)
fname <- if (nzchar(ext)) paste0(key, ".", ext) else key
local_path <- file.path(dir, fname)
meta_path <- paste0(local_path, ".meta")

have_local <- file.exists(local_path) && file.exists(meta_path)
revalidate <- isTRUE(getOption("cd.cache_revalidate", default = TRUE))

# Offline fast path: trust an existing cache without a HEAD request.
if (have_local && !refresh && !revalidate) {
return(local_path)
}

head <- cd_remote_head(href)

# HEAD failed (offline / server error): serve a cached copy if present.
if (is.null(head)) {
if (have_local && !refresh) {
rlang::inform(
paste0("cd_cache_fetch: could not reach '", href,
"'; serving cached copy.")
)
return(local_path)
}
stop("cd_cache_fetch: failed to reach '", href,
"' and no cached copy is available.", call. = FALSE)
}

# Valid cache: serve local, no download.
if (have_local && !refresh) {
meta <- jsonlite::read_json(meta_path)
if (cd_cache_valid(head, meta)) {
return(local_path)
}
}

# Download to a temp file, validate size, atomic rename, write meta.
tmp <- tempfile(tmpdir = dir, fileext = if (nzchar(ext)) paste0(".", ext) else "")
on.exit(if (file.exists(tmp)) unlink(tmp), add = TRUE)
cd_remote_download(href, tmp)

if (!is.null(head$size) && !is.na(head$size)) {
got <- file.size(tmp)
if (is.na(got) || got != head$size) {
stop("cd_cache_fetch: incomplete download of '", href, "' (",
got, " of ", head$size, " bytes).", call. = FALSE)
}
}

if (!file.rename(tmp, local_path)) {
stop("cd_cache_fetch: failed to move the download into the cache for '",
href, "'.", call. = FALSE)
}
jsonlite::write_json(
list(url = href, etag = head$etag, size = head$size,
downloaded_at = format(Sys.time(), "%Y-%m-%dT%H:%M:%S%z")),
meta_path, auto_unbox = TRUE
)
local_path
}

#' Is an href a cacheable remote (http/https) URL?
#' @noRd
cd_is_remote <- function(href) {
grepl("^https?://", href)
}

#' Is a cached copy still valid against fresh HEAD metadata?
#'
#' Prefers the ETag; falls back to Content-Length size when the server
#' (or the stored meta) carries no ETag, so ETag-less hosts still get a
#' cache hit instead of re-downloading on every call.
#' @noRd
cd_cache_valid <- function(head, meta) {
if (!is.null(head$etag) && !is.null(meta$etag)) {
return(identical(head$etag, meta$etag))
}
if (!is.null(head$size) && !is.na(head$size) && !is.null(meta$size)) {
return(isTRUE(as.numeric(meta$size) == head$size))
}
FALSE
}

#' HTTP HEAD a remote COG; return its ETag and size, or NULL on failure.
#' @noRd
cd_remote_head <- function(href) {
handle <- curl::new_handle(nobody = TRUE)
res <- tryCatch(
curl::curl_fetch_memory(href, handle = handle),
error = function(e) NULL
)
if (is.null(res) || res$status_code >= 400) {
return(NULL)
}
hdrs <- curl::parse_headers_list(res$headers)
etag <- hdrs[["etag"]]
cl <- hdrs[["content-length"]]
list(
etag = if (!is.null(etag)) gsub('"', "", etag) else NULL,
size = if (!is.null(cl)) as.numeric(cl) else NA_real_
)
}

#' Download a remote COG to destfile (binary).
#' @noRd
cd_remote_download <- function(href, destfile) {
curl::curl_download(href, destfile, mode = "wb")
}
8 changes: 7 additions & 1 deletion R/cd_crop.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@
#'
#' @param href Character. Path or URL to a COG or raster file.
#' @param aoi An `sf` or `SpatVector` polygon to crop to.
#' @param cache Logical. If `TRUE` (default), route remote http(s) hrefs
#' through the on-disk cache via [cd_cache_fetch()] so repeated reads
#' pull from S3 once instead of every call. Local paths are unaffected.
#'
#' @return A [terra::SpatRaster] cropped and masked to the AOI.
#'
Expand All @@ -19,7 +22,10 @@
#' r
#'
#' @export
cd_crop <- function(href, aoi) {
cd_crop <- function(href, aoi, cache = TRUE) {
if (isTRUE(cache)) {
href <- cd_cache_fetch(href)
}
r <- terra::rast(href)
if (inherits(aoi, "sf") || inherits(aoi, "sfc")) {
aoi <- terra::vect(aoi)
Expand Down
9 changes: 7 additions & 2 deletions R/cd_extract.R
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@
#' @param periods Character vector of periods to extract.
#' Defaults to all periods in `catalog`.
#' @param years Optional integer vector to filter specific years.
#' @param cache Logical. If `TRUE` (default), remote COGs are read
#' through the on-disk cache (see [cd_cache_fetch()]) so repeated
#' extractions and report rebuilds download each COG from S3 once
#' rather than on every call. Passed through to [cd_crop()].
#'
#' @return A tibble with columns:
#' \describe{
Expand All @@ -36,11 +40,12 @@
cd_extract <- function(catalog, aoi,
variables = catalog$variable,
periods = catalog$period,
years = NULL) {
years = NULL,
cache = TRUE) {
rows <- catalog[catalog$variable %in% variables & catalog$period %in% periods, ]

results <- lapply(seq_len(nrow(rows)), function(i) {
r <- cd_crop(rows$href[i], aoi)
r <- cd_crop(rows$href[i], aoi, cache = cache)
means <- terra::global(r, fun = "mean", na.rm = TRUE)
yr <- as.integer(names(r))

Expand Down
34 changes: 34 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,40 @@ cd_summary(trn)
cd_compare(ts, window_a = 1956:1960, window_b = 1951:1955)
```

## Caching

`cd_extract()` and `cd_crop()` cache each COG on first read, so repeated
extractions, report renders, and vignette rebuilds pull each file from S3
**once** and read locally thereafter — turning recurring S3 egress into a
one-time cost. Caching is on by default (`cache = TRUE`).

```r
# First call downloads; later calls read the local cache.
ts <- cd_extract(catalog, aoi) # cache = TRUE by default

cd_cache_info() # where the cache lives + size
cd_cache_clear() # wipe it
cd_extract(catalog, aoi, cache = FALSE) # bypass the cache for one call
```

Freshness is checked with a cheap HTTP HEAD (S3 ETag), so the monthly
catalog republish is picked up automatically; `cd_cache_fetch(href,
refresh = TRUE)` forces a re-download. For a fully-offline session set
`options(cd.cache_revalidate = FALSE)` to serve cached copies without any
network call.

**Stopgap without the cache.** If you read COGs through GDAL directly
(e.g. raw `terra::rast("/vsicurl/...")` outside `cd_crop()`), you can cut
repeat egress within a session by enabling GDAL's `/vsicurl/` cache:

```r
Sys.setenv(VSI_CACHE = "TRUE", VSI_CACHE_SIZE = "100000000") # 100 MB
Sys.setenv(GDAL_HTTP_MAX_RETRY = "3", GDAL_HTTP_RETRY_DELAY = "1")
```

This only persists within one R session; the `cd_*` cache above persists
across sessions, which is what kills recurring report-dev egress.

## Data

The producer pipeline fetches ERA5-Land hourly reanalysis from
Expand Down
49 changes: 49 additions & 0 deletions man/cd_cache_fetch.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 5 additions & 1 deletion man/cd_crop.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 7 additions & 1 deletion man/cd_extract.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Empty file.
Loading
Loading