feat: retry CLI download on transient GitHub failures#23
Conversation
Wrap the release-asset download (and the latest-version resolution when applicable) in an exponential-backoff retry helper so a transient GitHub 504 / connection reset no longer fails the whole job. Uses full-jitter backoff to avoid lockstep retries from many CI jobs after an outage clears. Defaults: 5 retries, 2s base, 2x factor, 60s per-delay cap. Also wire npm test into CI (it was not running before) and remove the orphaned workflows/test.yml that GitHub never picked up.
|
Looks good, but I wonder about to following: Compounding retry budget - Marking |
Address review feedback on #23: - Drop retries from 5 to 3 and maxDelayMs from 60s to 15s. The outer wrapper stacks on top of @actions/tool-cache's own internal retries (~3 attempts, 10-20s waits), so the original defaults pushed the worst-case time-to-fail past 5 minutes. New defaults add at most ~14s of jittered waits on top of tool-cache. - Remove ENOTFOUND and ECONNREFUSED from the transient set. The download URL is hardcoded by this action, so those codes only fire on a hard GitHub outage where retrying just delays failure.
Addressed by tuning the defaults rather than exposing them as inputs. I consider them to be kind of plumbing. :) |
Summary
latest-version resolution when used) in an exponential-backoff retry helper so a transient GitHub 504 / connection reset no longer fails the whole jobnpm testin CI (it wasn't running before), and delete the orphanedworkflows/test.ymlthat GitHub never picked upMotivation
Internal Slack thread reported a consistent failure in
cyber-dojo/saverCI with:This was caused by a GitHub release-downloads incident. The reason only the Kosli setup step was affected (while every other action in the same workflow ran fine) is that this is the only step that fetches a release asset at runtime - every other action ships its code via the action-download API, which is a different GitHub subsystem.
@actions/tool-cachedoes some internal retries already (2 attempts visible in the log) but that wasn't enough to ride out the incident. This PR layers our own retry around it.Design notes
[0, min(cap, base * factor^attempt)]) smears retries across the window. See the AWS blog post "Exponential Backoff And Jitter."latestis wrapped forresolveVersion? Non-latestis a pure local string return - no network, nothing to retry. The download itself is wrapped regardless of which version path was taken.ECONNRESET,ETIMEDOUT,ENOTFOUND,EAI_AGAIN,ECONNREFUSED,EPIPE), and the"Unexpected HTTP response: 5xx"message shape thattc.downloadToolthrows.Test plan
npm test- all 23 tests pass locally (10 new for retry, 13 existing)macos-latest,windows-latest,ubuntu-latestx2.11.27,latest) still installs the CLI successfully