Skip to content

Retry postinstall downloads on connect failure#3599

Open
gavande1 wants to merge 9 commits into
trunkfrom
stu-1749-add-retry-to-postinstall-wp-download
Open

Retry postinstall downloads on connect failure#3599
gavande1 wants to merge 9 commits into
trunkfrom
stu-1749-add-retry-to-postinstall-wp-download

Conversation

@gavande1
Copy link
Copy Markdown
Contributor

@gavande1 gavande1 commented May 22, 2026

Related issues

How AI was used in this PR

AI assisted with diagnosing the failure mode (undici's default 10s connect.timeout firing at the TCP/TLS handshake to wordpress.org, never reaching the body phase, and the script giving up after a single attempt) and drafted the implementation. I reviewed each change manually and ran lint + typecheck locally.

Proposed Changes

New shared libscripts/lib/with-retry.ts:

  • withRetry< T >( name, fn, options? ) — generic retry helper with exponential backoff (1s → 2s) and up to 3 attempts.
  • NonRetriableError — throw this from inside fn to opt an error out of retry.
  • throwForHttpStatus( context, status, statusText? ) — maps response status to retriable vs non-retriable: 5xx and 429 retry; other 4xx (404, 401, 403, ...) throw NonRetriableError so we fail fast.
  • sharedDispatcher — undici Agent with connect: { timeout: 10_000 } (matches Node's undici default; declared centrally so it can be tuned in one place if needed).

Per-attempt total connect budget is roughly 3 × 10s + (1s + 2s) ≈ 33s, which absorbs transient TLS handshake failures without bumping the per-attempt cap.

Adopted by scripts/download-wp-server-files.ts:

  • fetchWithRetry artifact downloader (WordPress, SQLite integration, WP-CLI, sqlite-command, phpMyAdmin) now wraps the fetch in withRetry.
  • fetchLatestGithubRelease is also wrapped in withRetry — the GitHub API call that resolves the sqlite-command URL no longer uses a 5s total timeout with no retries.

Adopted by scripts/download-language-packs.ts (sibling postinstall script that had its own near-identical retry helper):

  • Removed the local recursive fetchWithRetry, MAX_RETRIES, INITIAL_RETRY_DELAY_MS.
  • The translations API call and each per-locale zip download are now wrapped in withRetry individually, so JSON parsing and body streaming both benefit from retry.
  • Inherits the same 4xx-fail-fast behavior.

Other notes:

  • Error logging is single-sourced: the retry helper logs per-attempt failures (Attempt N/M failed ... Retrying in Xms); the script entry points already log the final thrown error before process.exit(1), so failures surface exactly once.
  • undici is imported but kept as a transitive dependency (already pulled in by @octokit/core, hono, mime-db, and others) rather than declared in package.json, to keep this change minimal. Happy to pin it explicitly in a follow-up if reviewers prefer.

Testing Instructions

  • Happy pathrm -rf wp-files && npx tsx ./scripts/download-wp-server-files.ts && npx tsx ./scripts/download-language-packs.ts should succeed and repopulate wp-files/.
  • Retry on connect timeout — point wordpress.org at a blackholed IP to force connect-phase failures:
    sudo sh -c 'echo "192.0.2.1 wordpress.org" >> /etc/hosts'
    npx tsx ./scripts/download-wp-server-files.ts
    sudo sed -i '' '/192.0.2.1 wordpress.org/d' /etc/hosts
    
    Expect three [wordpress] Attempt N/3 failed: ... Retrying in Xms... lines followed by the final thrown error and a non-zero exit. The same blackhole trick on api.wordpress.org exercises the language-packs path.
  • Fail-fast on 4xx — temporarily change a URL in either script to one that 404s and re-run. Expect a single attempt and an immediate Request failed with status code: 404 — no retries.
  • GitHub API retry path — blackholing api.github.com should produce three [github:automattic/wp-cli-sqlite-command] Attempt N/3 failed ... lines before failing.
  • npx eslint scripts/ — clean.
  • npx tsc -p scripts/tsconfig.json --noEmit — clean.

Pre-merge Checklist

  • Have you checked for TypeScript, React or other console errors?

Fixes STU-1749. The postinstall download script relied on undici's
default 10s connect timeout and gave up after a single attempt, so a
transient slow TLS handshake to wordpress.org (or any of the other
artifact hosts) failed the entire install. Wrap the fetch in a
fetchWithRetry helper that uses an undici Agent with
connect.timeout=15000 and retries up to 3 times with exponential
backoff.
@gavande1 gavande1 force-pushed the stu-1749-add-retry-to-postinstall-wp-download branch from 5007219 to 0db6c18 Compare May 22, 2026 09:32
@gavande1 gavande1 requested a review from Copilot May 22, 2026 09:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the postinstall WordPress/server-files download script by introducing an undici-backed fetchWithRetry wrapper with a longer connect timeout, reducing failures caused by slow TCP/TLS handshakes during artifact downloads.

Changes:

  • Add an undici Agent with a 15s connect timeout and a fetchWithRetry helper with exponential backoff retries.
  • Route artifact downloads in downloadFile through fetchWithRetry instead of a single fetch.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/download-wp-server-files.ts Outdated
Comment thread scripts/download-wp-server-files.ts Outdated
Comment thread scripts/download-wp-server-files.ts Outdated
Comment thread scripts/download-wp-server-files.ts Outdated
Comment thread scripts/download-wp-server-files.ts
gavande1 added 4 commits May 22, 2026 15:24
…dupe logs

- Don't retry 4xx responses other than 429 — those are non-transient
  and only waste time on retries.
- Type lastError as Error and normalize non-Error throws so the final
  re-throw preserves a real stack trace.
- Drop the in-helper console.error on the final attempt; downloadFiles
  already logs the thrown error once before exiting.
Extract a generic withRetry<T> helper backed by a NonRetriableError
class so any network call in this script can opt into the same retry
behavior. Refactor fetchWithRetry to use it and wrap
fetchLatestGithubRelease in withRetry so the GitHub API lookup
also benefits from the 15s connect timeout and 3-attempt backoff
instead of its previous 5s total timeout with no retries.
…b.com:Automattic/studio into stu-1749-add-retry-to-postinstall-wp-download
@gavande1 gavande1 requested a review from a team May 22, 2026 10:15
gavande1 added 4 commits May 22, 2026 15:57
Both download-wp-server-files.ts and download-language-packs.ts had
their own fetchWithRetry helpers with overlapping behavior. Extract
withRetry<T>, NonRetriableError, throwForHttpStatus, and a shared
undici Agent (15s connect timeout) into scripts/lib/with-retry.ts and
have both scripts import from there.

Also bring download-language-packs in line with the improvements
already in download-wp-server-files: fail fast on 4xx (other than 429)
and use the shared 15s connect timeout instead of undici's 10s default.
…b.com:Automattic/studio into stu-1749-add-retry-to-postinstall-wp-download
The retry loop (3 attempts with backoff) provides 30s+ of total connect
budget, which is enough to absorb the transient TLS handshake failures
reported in STU-1749 without bumping the per-attempt cap. Keep the
dispatcher and constant declared explicitly so the value can be tuned
in one place if needed later.
@gavande1 gavande1 changed the title Add retry and 15s connect timeout to postinstall WP download Retry postinstall downloads on connect failure May 22, 2026
@wpmobilebot
Copy link
Copy Markdown
Collaborator

📊 Performance Test Results

Comparing 6b9f9b0 vs trunk

app-size

Metric trunk 6b9f9b0 Diff Change
App Size (Mac) 1339.06 MB 1339.06 MB +0.00 MB ⚪ 0.0%

site-editor

Metric trunk 6b9f9b0 Diff Change
load 1776 ms 1717 ms 59 ms 🟢 -3.3%

site-startup

Metric trunk 6b9f9b0 Diff Change
siteCreation 9595 ms 9591 ms 4 ms ⚪ 0.0%
siteStartup 4920 ms 4921 ms +1 ms ⚪ 0.0%

Results are median values from multiple test runs.

Legend: 🟢 Improvement (faster) | 🔴 Regression (slower) | ⚪ No change (<50ms diff)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants