OAuth monitor gives up on transient failures, leaving workloads dead

# OAuth monitor gives up on transient failures, leaving workloads dead

## Bug description

When an OAuth token-refresh attempt returns an error that
`isTransientNetworkError` classifies as transient — 5xx, 429, or 4xx
without an RFC 6749 `error` code, per the rule established by #5170 —
`pkg/auth/monitored_token_source.go` runs an in-loop short retry (5
attempts with exponential backoff, ~1–2 minutes at defaults, bounded
by `TOOLHIVE_TOKEN_REFRESH_MAX_TRIES` and
`TOOLHIVE_TOKEN_REFRESH_MAX_ELAPSED_TIME`). If the error persists past
that window, the monitor marks the workload `unauthenticated` and
exits its goroutine. **No further refresh is ever attempted**, even
after the underlying condition clears. The workload stays
`unauthenticated` until manual intervention (`thv restart`, `thv rm`
+ `thv run`, or similar).

The transient classification is correct: the response shape doesn't
carry a definitive "denied" verdict from the OAuth server (no RFC 6749
error code), so ToolHive can't conclude the credentials are bad. The
gap is at the next layer up — the in-loop retry window is too short
to cover realistic recovery time scales for the conditions that
produce these errors (see Additional context). The monitor should
keep trying on a longer cadence, not give up after ~2 minutes.

## Steps to reproduce

Reproduction requires either (a) a real OAuth endpoint behind a network
control point that can be selectively dropped, or (b) the naturally-
occurring real-world trigger described in Additional context (e.g.,
client-side VPN disconnect routing requests through an IP-allowlisted
WAF or CDN).

1. Run an OAuth-backed remote workload with `thv run --remote-url ...`
   and let it complete the initial OAuth flow.
2. Wait for the cached access token to expire (typically 1 hour) so the
   monitor will attempt a refresh.
3. Just before refresh time, block traffic to the token endpoint
   (`pfctl` on macOS / `iptables` on Linux) for several minutes — long
   enough for the short retry to exhaust all 5 attempts (see Additional
   context for the default backoff schedule; ~3 minutes is comfortable).
4. Observe the workload transitions to `unauthenticated`. Restore the
   network — observe that ToolHive does NOT attempt to recover, even
   after waiting an arbitrarily long time.

Note that all conditions for the bug must be met: (a) the failure
classifies as transient (5xx, 429, or 4xx without an RFC 6749 error
code — see #5170), (b) the failure persists past the short-retry
window. Permanent OAuth failures (`invalid_grant`, `invalid_client`)
correctly stop the monitor and are not affected.

## Expected behavior

When a transient token-refresh failure exceeds the short-retry window,
the background monitor should keep attempting refresh on a longer
cadence until either the upstream recovers (→ `running`) or a
configurable ceiling is reached, at which point the workload is finally
marked `unauthenticated`. Workloads should not be permanently broken by
transient failures that resolve within a reasonable ceiling.

## Actual behavior

After the short-retry window exhausts on a still-transient error:

- The retry exhaustion branch in `Token()` (in the `if err != nil`
  block following `refresher.Refresh(...)` in
  `pkg/auth/monitored_token_source.go`) calls `markAsUnauthenticated`.
- `markAsUnauthenticated` writes `WorkloadStatusUnauthenticated` and
  closes the `stopMonitoring` channel.
- The monitor goroutine exits via the `monitorLoop`'s select on
  `stopMonitoring`.
- No further refresh is ever attempted by this workload's monitor.
- The workload remains `unauthenticated` indefinitely.

## Environment (if relevant)

- OS: macOS (also affects Linux — the relevant code path is platform-
  independent).
- ToolHive: current `main` (post-#5170 and post-#5044). The affected
  code paths in `pkg/auth/monitored_token_source.go::Token` and
  `::onTick` have had this shape since the short-retry layer was
  introduced in #4281.

## Additional context

**Real-world trigger:** the canonical scenario is a client-side
network-context change — disconnecting from a corporate VPN, putting
the laptop to sleep on one network and resuming on another, etc.
Token-refresh requests that previously traversed an IP-allowlisted
path now reach the OAuth server from a residential IP, where a WAF or
CDN consistently returns 403+HTML until the trusted path is restored.
The block isn't intermittent from the WAF's perspective — it's a
stable response to a different network origin — but from the
workload's perspective it's a transient failure window that resolves
on its own when the user reconnects.

In one such environment the bug surfaced every 1–3 days; each time the
underlying network state reverted on its own (e.g., morning VPN
reconnect), but the workload stayed `unauthenticated` until manual
recovery.

**Production error shape** (real recurrence):

```
oauth2: cannot fetch token: 403 Forbidden
Response: <!DOCTYPE html>...
```

This is a 4xx-without-RFC-6749-error-code response — correctly
classified as transient by `isTransientNetworkError` after #5170. The
short retry exhausts via the 5-try cap (typically within ~1–2 minutes
given the default backoff: 10s initial interval, 1.5 multiplier, ±50%
randomization, 5 max tries), well before the 5-minute
`MAX_ELAPSED_TIME` cap. The monitor exits at that point.

**Affected code:**

- `pkg/auth/monitored_token_source.go::Token` — the `refresher.Refresh`
  exhaustion branch (the `if err != nil` block right after the
  `refresher.Refresh` call).
- `pkg/auth/monitored_token_source.go::onTick` — calls `Token()`, so
  inherits the same exit-on-exhaustion behavior. The monitor's only
  response to a transient refresh error that outlasts the short retry
  is to mark the workload `unauthenticated` and stop.
- `pkg/auth/monitored_token_source.go::markAsUnauthenticated` is the
  single-shot exit point both call into.

**Related PRs:**

- #5170 — 4xx-without-error-code classification (merged). Necessary
  precondition for this bug to manifest predictably; the short retry
  now correctly retries WAF-shaped responses, then exhausts via the
  5-try cap (~1–2 minutes default).
- #4513 — retry transient errors in the background monitor (merged).
  Refined the short-retry layer.
- #4281 — introduced the short-retry layer in the background monitor.
- #5044 — DCR Warn + `upstream`/`clientID` constructor context
  (merged). A fix here must continue to gate the DCR remediation Warn
  correctly (only on permanent errors, not on transient-ceiling
  give-ups).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OAuth monitor gives up on transient failures, leaving workloads dead #5349

OAuth monitor gives up on transient failures, leaving workloads dead

Bug description

Steps to reproduce

Expected behavior

Actual behavior

Environment (if relevant)

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

OAuth monitor gives up on transient failures, leaving workloads dead #5349

Description

OAuth monitor gives up on transient failures, leaving workloads dead

Bug description

Steps to reproduce

Expected behavior

Actual behavior

Environment (if relevant)

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions