Skip to content

fix(cli): don't downgrade prebuilt providers on a transient registry failure#298

Open
X-Guardian wants to merge 2 commits into
open-constructs:mainfrom
X-Guardian:fix/add-package-manager-retries
Open

fix(cli): don't downgrade prebuilt providers on a transient registry failure#298
X-Guardian wants to merge 2 commits into
open-constructs:mainfrom
X-Guardian:fix/add-package-manager-retries

Conversation

@X-Guardian

Copy link
Copy Markdown
Contributor

Related issue

Fixes #297

Description

When cdktn provider add resolves which prebuilt-provider version to install, it walks the npm candidate versions highest-to-lowest and installs the first that isNpmVersionAvailable reports as present. The problem: for most languages a transient registry failure (network error, rate limit, 5xx) was indistinguishable from "this version doesn't exist", so the resolver skipped the present newest version and silently installed an older one — no error, no warning.

The probes conflated the two failure modes, and the resolution loop's catch swallowed anything that threw:

  • Go — GitHub returns a JSON body on rate-limit / 5xx, so response.json() succeeded, json.ref was undefined, and the probe returned false. A throttled request read as "version absent".
  • Python (PyPI) and .NET (NuGet) — bare fetch with no status check; a non-JSON error body made response.json() throw.
  • The loop in DependencyManager.getLanguageSpecificPackageVersion wrapped the body in try { … } catch { logger.info("… Skipping…") }, so any throw was logged at info and the loop moved on to an older version.

This PR gives the probes a shared, deliberate transient-vs-definitive distinction and decides at the loop level what a transient failure should do:

  • Shared retry helper (package-manager.ts): a new fetchWithRetry classifies transient failures — network errors and HTTP 408 / 429 / 5xx — and retries them (3 attempts, short exponential backoff), while definitive responses (2xx, 404, other 4xx) return immediately. If the registry stays unreachable after all attempts, it throws an External error rather than reporting the version as absent.
  • Status-based probes: each isNpmVersionAvailable now routes through fetchWithRetry and keys off the HTTP status instead of inferring absence from a missing field. Go checks 200 vs 404 (fixing the rate-limit-reads-as-absent bug), Python and .NET check 404 explicitly, and Java (already probing repo1.maven.org directly) shares the same retry behaviour.
  • Narrowed the resolution loop's catch (dependency-manager.ts): a registry-unreachable (External) error now propagates and aborts the resolution, so the CLI never silently downgrades on an infrastructure hiccup. A definitive "not available" (a false return, no throw) still advances to the next-oldest candidate as before.

Why abort rather than retry-forever or degrade silently: installing an older version than is actually available is a silent correctness problem that's hard to notice after the fact. Surfacing a clear "could not reach the registry" error lets the user retry against a recovered network and get the version they expect.

Checklist

  • I have updated the PR title to match CDKTN's style guide
  • I have run the linter on my code locally
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation if applicable
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works if applicable
  • New and existing unit tests pass locally with my changes

@X-Guardian X-Guardian marked this pull request as ready for review July 2, 2026 17:54
@X-Guardian X-Guardian requested a review from a team as a code owner July 2, 2026 17:54

@so0k so0k left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improves reliability & correctness: it stops transient registry failures from being misclassified as “version absent,” avoids silent provider downgrades, and surfaces a detailed External error instead.

Concerns checked and resolved

  1. Is fetchWithRetry logic already pre-existing in CLI/package-manager?
    Resolved: no. Search only did not surface existing shared CLI/package-manager fetch retry helper to reuse.

  2. Are promise-retry / @gar/promise-retry already dependencies?
    Resolved: they are present only as transitive dev/tooling dependencies, not as direct @cdktn/cli-core dependencies. They come from npm/lerna/pacote tooling paths such as lerna -> pacote, npm-registry-fetch, make-fetch-happen, @npmcli/git, and sigstore.
    @cdktn/cli-core directly depends on node-fetch, not these retry packages. So the local fetchWithRetry helper earns its keep and does not duplicate an existing project-level dependency-manager abstraction.

  3. Is a local delay helper duplicating an existing CLI helper?
    Resolved: no shared reusable CLI delay helper was found. There are isolated setTimeout / sleep patterns elsewhere, but nothing obvious in cli-core that this code should import. The tiny local helper is acceptable.

  4. Does the retry logic already use exponential backoff?
    Resolved: yes. It uses bounded exponential backoff:

    REGISTRY_PROBE_ATTEMPTS = 3
    REGISTRY_PROBE_BACKOFF_MS = 250
    delay = 250ms, then 500ms

    That is appropriate for this CLI path.

  5. Should 429 Retry-After be honored?
    Considered a non-issue for this PR. For an interactive CLI, failing fast with a clear External error after bounded retries is acceptable — often preferable to silently blocking for 5s/60s. Not honoring Retry-After does not reintroduce the correctness bug and is not worth a follow-up cleanup here.

  6. Do the tests provide meaningful coverage?
    Resolved: yes. The tests exercise definitive success/absence and transient-failure paths across Maven, GitHub/Go, PyPI, and NuGet. They are behavior-focused and run in the normal test path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cli: prebuilt-provider version resolution silently downgrades on a transient registry failure

2 participants