Skip to content

feat: retry CLI download on transient GitHub failures#23

Merged
dangrondahl merged 2 commits into
mainfrom
retry-on-transient-download-failures
Jun 8, 2026
Merged

feat: retry CLI download on transient GitHub failures#23
dangrondahl merged 2 commits into
mainfrom
retry-on-transient-download-failures

Conversation

@dangrondahl

@dangrondahl dangrondahl commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Wrap the release-asset download (and latest-version resolution when used) in an exponential-backoff retry helper so a transient GitHub 504 / connection reset no longer fails the whole job
  • Full-jitter exponential backoff (5 retries, 2s base, 2x factor, 60s per-delay cap) to avoid lockstep retries from many parallel CI jobs after an outage clears
  • Run npm test in CI (it wasn't running before), and delete the orphaned workflows/test.yml that GitHub never picked up

Motivation

Internal Slack thread reported a consistent failure in cyber-dojo/saver CI with:

installing Kosli CLI from https://github.com/kosli-dev/cli/releases/download/v2.24.2/kosli_2.24.2_linux_amd64.tar.gz ...
Unexpected HTTP response: 504
Waiting 17 seconds before trying again
Unexpected HTTP response: 504
Waiting 19 seconds before trying again
##[error]Error: Unexpected HTTP response: 504

This was caused by a GitHub release-downloads incident. The reason only the Kosli setup step was affected (while every other action in the same workflow ran fine) is that this is the only step that fetches a release asset at runtime - every other action ships its code via the action-download API, which is a different GitHub subsystem.

@actions/tool-cache does some internal retries already (2 attempts visible in the log) but that wasn't enough to ride out the incident. This PR layers our own retry around it.

Design notes

  • Why full jitter? When a GitHub outage clears, every queued CI job retries at once. Equal exponential backoff means they all hammer GitHub again in lockstep. Full jitter (delay uniform in [0, min(cap, base * factor^attempt)]) smears retries across the window. See the AWS blog post "Exponential Backoff And Jitter."
  • Why only latest is wrapped for resolveVersion? Non-latest is a pure local string return - no network, nothing to retry. The download itself is wrapped regardless of which version path was taken.
  • What's retried: HTTP 5xx, 429, 408, common Node network error codes (ECONNRESET, ETIMEDOUT, ENOTFOUND, EAI_AGAIN, ECONNREFUSED, EPIPE), and the "Unexpected HTTP response: 5xx" message shape that tc.downloadTool throws.
  • What's NOT retried: 4xx errors (e.g. a wrong version tag) - those fail fast.

Test plan

  • npm test - all 23 tests pass locally (10 new for retry, 13 existing)
  • CI green on this PR (now includes the unit test step)
  • End-to-end matrix (macos-latest, windows-latest, ubuntu-latest x 2.11.27, latest) still installs the CLI successfully

Wrap the release-asset download (and the latest-version resolution when
applicable) in an exponential-backoff retry helper so a transient GitHub
504 / connection reset no longer fails the whole job. Uses full-jitter
backoff to avoid lockstep retries from many CI jobs after an outage
clears. Defaults: 5 retries, 2s base, 2x factor, 60s per-delay cap.

Also wire npm test into CI (it was not running before) and remove the
orphaned workflows/test.yml that GitHub never picked up.
@dangrondahl dangrondahl requested a review from a team as a code owner June 8, 2026 08:42
@mbevc1

mbevc1 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Looks good, but I wonder about to following: Compounding retry budget - tc.downloadTool already retries internally (the PR notes 2 visible attempts), so wrapping it multiplies attempts: your 5 × tool-cache's internal retries. Combined with full-jitter delays whose caps climb to [2s, 4s, 8s, 16s, 32s], a sustained outage means a single job can sit for well over a minute before failing - and there's no overall deadline. For CI that's probably acceptable, but it's worth a deliberate decision rather than an emergent one. Consider exposing retries/baseDelayMs as action inputs, or lowering the cap.

Marking ENOTFOUND / ECONNREFUSED as transient. These are often permanent misconfiguration (bad host, nothing listening) rather than transient. Retrying them just adds the full backoff delay to a job that was going to fail anyway. ECONNRESET/ETIMEDOUT/EAI_AGAIN are the more clearly transient ones.

Address review feedback on #23:

- Drop retries from 5 to 3 and maxDelayMs from 60s to 15s. The outer
  wrapper stacks on top of @actions/tool-cache's own internal retries
  (~3 attempts, 10-20s waits), so the original defaults pushed the
  worst-case time-to-fail past 5 minutes. New defaults add at most
  ~14s of jittered waits on top of tool-cache.

- Remove ENOTFOUND and ECONNREFUSED from the transient set. The
  download URL is hardcoded by this action, so those codes only fire
  on a hard GitHub outage where retrying just delays failure.
@dangrondahl

Copy link
Copy Markdown
Contributor Author

Looks good, but I wonder about to following: Compounding retry budget - tc.downloadTool already retries internally (the PR notes 2 visible attempts), so wrapping it multiplies attempts: your 5 × tool-cache's internal retries. Combined with full-jitter delays whose caps climb to [2s, 4s, 8s, 16s, 32s], a sustained outage means a single job can sit for well over a minute before failing - and there's no overall deadline. For CI that's probably acceptable, but it's worth a deliberate decision rather than an emergent one. Consider exposing retries/baseDelayMs as action inputs, or lowering the cap.

Marking ENOTFOUND / ECONNREFUSED as transient. These are often permanent misconfiguration (bad host, nothing listening) rather than transient. Retrying them just adds the full backoff delay to a job that was going to fail anyway. ECONNRESET/ETIMEDOUT/EAI_AGAIN are the more clearly transient ones.

Addressed by tuning the defaults rather than exposing them as inputs. I consider them to be kind of plumbing. :)

@dangrondahl dangrondahl merged commit 5601264 into main Jun 8, 2026
6 checks passed
@dangrondahl dangrondahl deleted the retry-on-transient-download-failures branch June 8, 2026 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants