fix(cli): don't downgrade prebuilt providers on a transient registry failure#298
fix(cli): don't downgrade prebuilt providers on a transient registry failure#298X-Guardian wants to merge 2 commits into
Conversation
so0k
left a comment
There was a problem hiding this comment.
Improves reliability & correctness: it stops transient registry failures from being misclassified as “version absent,” avoids silent provider downgrades, and surfaces a detailed External error instead.
Concerns checked and resolved
-
Is
fetchWithRetrylogic already pre-existing in CLI/package-manager?
Resolved: no. Search only did not surface existing shared CLI/package-manager fetch retry helper to reuse. -
Are
promise-retry/@gar/promise-retryalready dependencies?
Resolved: they are present only as transitive dev/tooling dependencies, not as direct@cdktn/cli-coredependencies. They come from npm/lerna/pacote tooling paths such aslerna -> pacote,npm-registry-fetch,make-fetch-happen,@npmcli/git, andsigstore.
@cdktn/cli-coredirectly depends onnode-fetch, not these retry packages. So the localfetchWithRetryhelper earns its keep and does not duplicate an existing project-level dependency-manager abstraction. -
Is a local
delayhelper duplicating an existing CLI helper?
Resolved: no shared reusable CLI delay helper was found. There are isolatedsetTimeout/sleeppatterns elsewhere, but nothing obvious incli-corethat this code should import. The tiny local helper is acceptable. -
Does the retry logic already use exponential backoff?
Resolved: yes. It uses bounded exponential backoff:REGISTRY_PROBE_ATTEMPTS = 3 REGISTRY_PROBE_BACKOFF_MS = 250 delay = 250ms, then 500ms
That is appropriate for this CLI path.
-
Should
429 Retry-Afterbe honored?
Considered a non-issue for this PR. For an interactive CLI, failing fast with a clearExternalerror after bounded retries is acceptable — often preferable to silently blocking for5s/60s. Not honoringRetry-Afterdoes not reintroduce the correctness bug and is not worth a follow-up cleanup here. -
Do the tests provide meaningful coverage?
Resolved: yes. The tests exercise definitive success/absence and transient-failure paths across Maven, GitHub/Go, PyPI, and NuGet. They are behavior-focused and run in the normal test path.
Related issue
Fixes #297
Description
When
cdktn provider addresolves which prebuilt-provider version to install, it walks the npm candidate versions highest-to-lowest and installs the first thatisNpmVersionAvailablereports as present. The problem: for most languages a transient registry failure (network error, rate limit, 5xx) was indistinguishable from "this version doesn't exist", so the resolver skipped the present newest version and silently installed an older one — no error, no warning.The probes conflated the two failure modes, and the resolution loop's
catchswallowed anything that threw:response.json()succeeded,json.refwasundefined, and the probe returnedfalse. A throttled request read as "version absent".fetchwith no status check; a non-JSON error body maderesponse.json()throw.DependencyManager.getLanguageSpecificPackageVersionwrapped the body intry { … } catch { logger.info("… Skipping…") }, so any throw was logged atinfoand the loop moved on to an older version.This PR gives the probes a shared, deliberate transient-vs-definitive distinction and decides at the loop level what a transient failure should do:
package-manager.ts): a newfetchWithRetryclassifies transient failures — network errors and HTTP408/429/5xx— and retries them (3 attempts, short exponential backoff), while definitive responses (2xx,404, other4xx) return immediately. If the registry stays unreachable after all attempts, it throws anExternalerror rather than reporting the version as absent.isNpmVersionAvailablenow routes throughfetchWithRetryand keys off the HTTP status instead of inferring absence from a missing field. Go checks200vs404(fixing the rate-limit-reads-as-absent bug), Python and .NET check404explicitly, and Java (already probingrepo1.maven.orgdirectly) shares the same retry behaviour.catch(dependency-manager.ts): a registry-unreachable (External) error now propagates and aborts the resolution, so the CLI never silently downgrades on an infrastructure hiccup. A definitive "not available" (afalsereturn, no throw) still advances to the next-oldest candidate as before.Why abort rather than retry-forever or degrade silently: installing an older version than is actually available is a silent correctness problem that's hard to notice after the fact. Surfacing a clear "could not reach the registry" error lets the user retry against a recovered network and get the version they expect.
Checklist