Skip to content

Commit f977aa2

Browse files
authored
fix: retry transient clean-room install timeouts (#70)
1 parent c2ec7aa commit f977aa2

6 files changed

Lines changed: 108 additions & 13 deletions

File tree

AGENTS.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -183,8 +183,11 @@ Work in CortexPilot as a contract-first engineering agent:
183183
same patch; current examples include `.runtime-cache/test_output/ci/` and
184184
`configs/pip_audit_ignored_advisories.json`, plus the dashboard
185185
and desktop install-time ENOSPC recovery knobs plus the Docker daemon
186-
precheck retry knobs registered in `configs/env.registry.json`; current CI
187-
credential/evidence examples also include the upstream receipt refresh
186+
precheck retry knobs registered in `configs/env.registry.json`, and the
187+
bounded transient npm registry socket-timeout retries inside
188+
`scripts/install_dashboard_deps.sh` plus
189+
`scripts/install_desktop_deps.sh`; current CI credential/evidence examples
190+
also include the upstream receipt refresh
188191
fallback to `scripts/verify_upstream_slices.py --mode smoke` and the strict
189192
live-provider rule that resolves process env first and `~/.codex/config.toml`
190193
second while keeping dotenv and shell-export fallbacks disabled on mainline;

CLAUDE.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -108,8 +108,11 @@ This file mirrors the root AI entrypoint for tools that prefer `CLAUDE.md`.
108108
same patch; current examples include `.runtime-cache/test_output/ci/` and
109109
`configs/pip_audit_ignored_advisories.json`, plus the dashboard
110110
and desktop install-time ENOSPC recovery knobs plus the Docker daemon
111-
precheck retry knobs registered in `configs/env.registry.json`; current CI
112-
credential/evidence examples also include the upstream receipt refresh
111+
precheck retry knobs registered in `configs/env.registry.json`, and the
112+
bounded transient npm registry socket-timeout retries inside
113+
`scripts/install_dashboard_deps.sh` plus
114+
`scripts/install_desktop_deps.sh`; current CI credential/evidence examples
115+
also include the upstream receipt refresh
113116
fallback to `scripts/verify_upstream_slices.py --mode smoke` and the strict
114117
live-provider rule that resolves process env first and `~/.codex/config.toml`
115118
second while keeping dotenv and shell-export fallbacks disabled on mainline;

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -647,7 +647,9 @@ governance manifest refresh falls back to `scripts/verify_upstream_slices.py --m
647647
to regenerate the receipts instead of failing on missing files alone.
648648
Dashboard dependency installs now also carry an ENOSPC recovery branch that
649649
retries with a workspace-local pnpm store and the registered dashboard install
650-
env knobs when copy-heavy CI or local maintenance installs run out of disk.
650+
env knobs when copy-heavy CI or local maintenance installs run out of disk, and
651+
dashboard/desktop clean-room installs now retry bounded transient npm registry
652+
socket timeouts before they fail closed.
651653
Desktop dependency installs now mirror the same ENOSPC recovery strategy,
652654
including the registered desktop install env knobs that scope hardlink imports
653655
to the recovery attempt and move retry stores onto workspace-local temp roots.

docs/README.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -240,7 +240,10 @@ navigation set.
240240
same patch; the current examples are `.runtime-cache/test_output/ci/` and
241241
`configs/pip_audit_ignored_advisories.json`, plus the dashboard and desktop
242242
ENOSPC recovery knobs plus the Docker daemon precheck retry knobs registered
243-
in `configs/env.registry.json`; current CI contract changes also include the
243+
in `configs/env.registry.json`, together with the bounded transient npm
244+
registry socket-timeout retries inside
245+
`scripts/install_dashboard_deps.sh` / `scripts/install_desktop_deps.sh`;
246+
current CI contract changes also include the
244247
upstream receipt refresh fallback to `scripts/verify_upstream_slices.py --mode smoke`
245248
and the strict hosted-first live-provider rule that allows
246249
process env first and `~/.codex/config.toml` second while keeping dotenv and

scripts/install_dashboard_deps.sh

Lines changed: 41 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ INSTALL_SHAMEFULLY_HOIST="${CORTEXPILOT_DASHBOARD_PNPM_SHAMEFULLY_HOIST:-1}"
1717
BASE_INSTALL_NODE_LINKER="$INSTALL_NODE_LINKER"
1818
BASE_INSTALL_PACKAGE_IMPORT_METHOD="$INSTALL_PACKAGE_IMPORT_METHOD"
1919
BASE_INSTALL_SHAMEFULLY_HOIST="$INSTALL_SHAMEFULLY_HOIST"
20+
NETWORK_RETRY_ATTEMPTS=2
21+
NETWORK_RETRY_SLEEP_SECONDS=5
2022
INSTALL_LOG="$ROOT_DIR/.runtime-cache/logs/runtime/deps_install/install_dashboard_deps.log"
2123
LOCK_DIR="${STATE_ROOT}/install-dashboard-deps.lock"
2224
LOCK_OWNER_FILE="$LOCK_DIR/owner"
@@ -36,6 +38,10 @@ install_log_has_partial_bin_warning() {
3638
log_contains "Failed to create bin at " && log_contains "ENOENT"
3739
}
3840

41+
install_log_has_socket_timeout() {
42+
log_contains "ERR_SOCKET_TIMEOUT" || log_contains "Socket timeout"
43+
}
44+
3945
print_install_log_tail() {
4046
local lines="${1:-80}"
4147
[[ -f "$INSTALL_LOG" ]] || return 0
@@ -225,6 +231,29 @@ run_install() {
225231
)
226232
}
227233

234+
run_install_with_network_retry() {
235+
local context="$1"
236+
local attempt=0
237+
while true; do
238+
if run_install; then
239+
return 0
240+
fi
241+
if ! install_log_has_socket_timeout; then
242+
return 1
243+
fi
244+
if (( attempt >= NETWORK_RETRY_ATTEMPTS )); then
245+
return 1
246+
fi
247+
attempt=$((attempt + 1))
248+
echo "⚠️ [install-dashboard-deps] ${context}; transient npm registry socket timeout detected; retrying install (${attempt}/${NETWORK_RETRY_ATTEMPTS})" >&2
249+
sleep "$NETWORK_RETRY_SLEEP_SECONDS"
250+
if ! reset_app_node_modules; then
251+
print_install_log_tail 80
252+
exit 1
253+
fi
254+
done
255+
}
256+
228257
reset_app_node_modules() {
229258
local target="$APP_DIR/node_modules"
230259
[[ -d "$target" ]] || return 0
@@ -306,10 +335,15 @@ recover_with_workspace_store() {
306335
print_install_log_tail 80
307336
exit 1
308337
fi
309-
if ! run_install; then
338+
if ! run_install_with_network_retry "workspace-local recovery install"; then
310339
INSTALL_PACKAGE_IMPORT_METHOD="$previous_import_method"
311340
INSTALL_NODE_LINKER="$previous_node_linker"
312341
INSTALL_SHAMEFULLY_HOIST="$previous_shamefully_hoist"
342+
if install_log_has_socket_timeout; then
343+
echo "❌ [install-dashboard-deps] workspace-local recovery exhausted transient npm registry retries; tail follows" >&2
344+
print_install_log_tail 80
345+
exit 1
346+
fi
313347
echo "❌ [install-dashboard-deps] pnpm install failed after workspace-local recovery; tail follows" >&2
314348
print_install_log_tail 80
315349
exit 1
@@ -319,18 +353,22 @@ recover_with_workspace_store() {
319353
INSTALL_SHAMEFULLY_HOIST="$previous_shamefully_hoist"
320354
}
321355

322-
if ! run_install; then
356+
if ! run_install_with_network_retry "initial install"; then
323357
if log_contains "ERR_PNPM_ENOENT"; then
324358
recover_with_fresh_store "detected pnpm store ENOENT"
325359
elif log_contains "ERR_PNPM_ENOSPC" || log_contains "no space left on device"; then
326360
recover_with_workspace_store "detected pnpm ENOSPC"
361+
elif install_log_has_socket_timeout; then
362+
echo "❌ [install-dashboard-deps] pnpm install failed after transient npm registry retries; tail follows" >&2
363+
print_install_log_tail 80
364+
exit 1
327365
elif log_contains "ERR_PNPM_ENOTDIR"; then
328366
echo "⚠️ [install-dashboard-deps] detected app-local node_modules ENOTDIR; resetting dashboard node_modules and retrying once" >&2
329367
if ! reset_app_node_modules; then
330368
print_install_log_tail 80
331369
exit 1
332370
fi
333-
if ! run_install; then
371+
if ! run_install_with_network_retry "node_modules reset retry"; then
334372
echo "❌ [install-dashboard-deps] pnpm install failed after node_modules reset; tail follows" >&2
335373
print_install_log_tail 80
336374
exit 1

scripts/install_desktop_deps.sh

Lines changed: 50 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ INSTALL_PACKAGE_IMPORT_METHOD="${CORTEXPILOT_DESKTOP_PNPM_IMPORT_METHOD:-copy}"
1212
INSTALL_SHAMEFULLY_HOIST="${CORTEXPILOT_DESKTOP_PNPM_SHAMEFULLY_HOIST:-1}"
1313
BASE_INSTALL_NODE_LINKER="$INSTALL_NODE_LINKER"
1414
BASE_INSTALL_SHAMEFULLY_HOIST="$INSTALL_SHAMEFULLY_HOIST"
15+
NETWORK_RETRY_ATTEMPTS=2
16+
NETWORK_RETRY_SLEEP_SECONDS=5
1517
INSTALL_LOG="$ROOT_DIR/.runtime-cache/logs/runtime/deps_install/install_desktop_deps.log"
1618
LOCK_DIR="$ROOT_DIR/.runtime-cache/cortexpilot/locks/install-desktop-deps.lock"
1719
LOCK_OWNER_FILE="$LOCK_DIR/owner"
@@ -159,6 +161,36 @@ run_install() {
159161
)
160162
}
161163

164+
install_log_has_socket_timeout() {
165+
[[ -f "$INSTALL_LOG" ]] || return 1
166+
local content=""
167+
content="$(<"$INSTALL_LOG")"
168+
[[ "$content" == *"ERR_SOCKET_TIMEOUT"* || "$content" == *"Socket timeout"* ]]
169+
}
170+
171+
run_install_with_network_retry() {
172+
local context="$1"
173+
local attempt=0
174+
while true; do
175+
if run_install; then
176+
return 0
177+
fi
178+
if ! install_log_has_socket_timeout; then
179+
return 1
180+
fi
181+
if (( attempt >= NETWORK_RETRY_ATTEMPTS )); then
182+
return 1
183+
fi
184+
attempt=$((attempt + 1))
185+
echo "⚠️ [install-desktop-deps] ${context}; transient npm registry socket timeout detected; retrying install (${attempt}/${NETWORK_RETRY_ATTEMPTS})" >&2
186+
sleep "$NETWORK_RETRY_SLEEP_SECONDS"
187+
if ! reset_app_node_modules; then
188+
tail -n 80 "$INSTALL_LOG" >&2 || true
189+
exit 1
190+
fi
191+
done
192+
}
193+
162194
verify_typescript_toolchain() {
163195
(
164196
cd "$APP_DIR"
@@ -175,12 +207,17 @@ recover_with_fresh_store() {
175207
tail -n 80 "$INSTALL_LOG" >&2 || true
176208
exit 1
177209
fi
178-
if ! run_install; then
210+
if ! run_install_with_network_retry "fresh-store recovery install"; then
179211
if grep -q "ERR_PNPM_ENOENT" "$INSTALL_LOG"; then
180212
echo "⚠️ [install-desktop-deps] fresh-store recovery hit another ERR_PNPM_ENOENT; escalating to workspace-local pnpm store recovery" >&2
181213
recover_with_workspace_store "fresh-store ERR_PNPM_ENOENT persisted after desktop retry"
182214
return 0
183215
fi
216+
if install_log_has_socket_timeout; then
217+
echo "❌ [install-desktop-deps] fresh-store recovery exhausted transient npm registry retries; tail follows" >&2
218+
tail -n 80 "$INSTALL_LOG" >&2 || true
219+
exit 1
220+
fi
184221
echo "❌ [install-desktop-deps] pnpm install failed after fresh-store recovery; tail follows" >&2
185222
tail -n 80 "$INSTALL_LOG" >&2 || true
186223
exit 1
@@ -206,10 +243,15 @@ recover_with_workspace_store() {
206243
tail -n 80 "$INSTALL_LOG" >&2 || true
207244
exit 1
208245
fi
209-
if ! run_install; then
246+
if ! run_install_with_network_retry "workspace-local recovery install"; then
210247
INSTALL_PACKAGE_IMPORT_METHOD="$previous_import_method"
211248
INSTALL_NODE_LINKER="$previous_node_linker"
212249
INSTALL_SHAMEFULLY_HOIST="$previous_shamefully_hoist"
250+
if install_log_has_socket_timeout; then
251+
echo "❌ [install-desktop-deps] workspace-local recovery exhausted transient npm registry retries; tail follows" >&2
252+
tail -n 80 "$INSTALL_LOG" >&2 || true
253+
exit 1
254+
fi
213255
echo "❌ [install-desktop-deps] pnpm install failed after workspace-local recovery; tail follows" >&2
214256
tail -n 80 "$INSTALL_LOG" >&2 || true
215257
exit 1
@@ -238,18 +280,22 @@ reset_app_node_modules() {
238280
return 1
239281
}
240282

241-
if ! run_install; then
283+
if ! run_install_with_network_retry "initial install"; then
242284
if grep -q "ERR_PNPM_ENOENT" "$INSTALL_LOG"; then
243285
recover_with_fresh_store "detected pnpm store ENOENT"
244286
elif grep -q "ERR_PNPM_ENOSPC" "$INSTALL_LOG" || grep -qi "no space left on device" "$INSTALL_LOG"; then
245287
recover_with_workspace_store "detected pnpm ENOSPC"
288+
elif install_log_has_socket_timeout; then
289+
echo "❌ [install-desktop-deps] pnpm install failed after transient npm registry retries; tail follows" >&2
290+
tail -n 80 "$INSTALL_LOG" >&2 || true
291+
exit 1
246292
elif grep -q "ERR_PNPM_ENOTDIR" "$INSTALL_LOG"; then
247293
echo "⚠️ [install-desktop-deps] detected app-local node_modules ENOTDIR; resetting desktop node_modules and retrying once" >&2
248294
if ! reset_app_node_modules; then
249295
tail -n 80 "$INSTALL_LOG" >&2 || true
250296
exit 1
251297
fi
252-
if ! run_install; then
298+
if ! run_install_with_network_retry "node_modules reset retry"; then
253299
echo "❌ [install-desktop-deps] pnpm install failed after node_modules reset; tail follows" >&2
254300
tail -n 80 "$INSTALL_LOG" >&2 || true
255301
exit 1

0 commit comments

Comments
 (0)