Skip to content

OAuth monitor gives up on transient failures, leaving workloads dead #5349

@gkatz2

Description

@gkatz2

OAuth monitor gives up on transient failures, leaving workloads dead

Bug description

When an OAuth token-refresh attempt returns an error that
isTransientNetworkError classifies as transient — 5xx, 429, or 4xx
without an RFC 6749 error code, per the rule established by #5170
pkg/auth/monitored_token_source.go runs an in-loop short retry (5
attempts with exponential backoff, ~1–2 minutes at defaults, bounded
by TOOLHIVE_TOKEN_REFRESH_MAX_TRIES and
TOOLHIVE_TOKEN_REFRESH_MAX_ELAPSED_TIME). If the error persists past
that window, the monitor marks the workload unauthenticated and
exits its goroutine. No further refresh is ever attempted, even
after the underlying condition clears. The workload stays
unauthenticated until manual intervention (thv restart, thv rm

  • thv run, or similar).

The transient classification is correct: the response shape doesn't
carry a definitive "denied" verdict from the OAuth server (no RFC 6749
error code), so ToolHive can't conclude the credentials are bad. The
gap is at the next layer up — the in-loop retry window is too short
to cover realistic recovery time scales for the conditions that
produce these errors (see Additional context). The monitor should
keep trying on a longer cadence, not give up after ~2 minutes.

Steps to reproduce

Reproduction requires either (a) a real OAuth endpoint behind a network
control point that can be selectively dropped, or (b) the naturally-
occurring real-world trigger described in Additional context (e.g.,
client-side VPN disconnect routing requests through an IP-allowlisted
WAF or CDN).

  1. Run an OAuth-backed remote workload with thv run --remote-url ...
    and let it complete the initial OAuth flow.
  2. Wait for the cached access token to expire (typically 1 hour) so the
    monitor will attempt a refresh.
  3. Just before refresh time, block traffic to the token endpoint
    (pfctl on macOS / iptables on Linux) for several minutes — long
    enough for the short retry to exhaust all 5 attempts (see Additional
    context for the default backoff schedule; ~3 minutes is comfortable).
  4. Observe the workload transitions to unauthenticated. Restore the
    network — observe that ToolHive does NOT attempt to recover, even
    after waiting an arbitrarily long time.

Note that all conditions for the bug must be met: (a) the failure
classifies as transient (5xx, 429, or 4xx without an RFC 6749 error
code — see #5170), (b) the failure persists past the short-retry
window. Permanent OAuth failures (invalid_grant, invalid_client)
correctly stop the monitor and are not affected.

Expected behavior

When a transient token-refresh failure exceeds the short-retry window,
the background monitor should keep attempting refresh on a longer
cadence until either the upstream recovers (→ running) or a
configurable ceiling is reached, at which point the workload is finally
marked unauthenticated. Workloads should not be permanently broken by
transient failures that resolve within a reasonable ceiling.

Actual behavior

After the short-retry window exhausts on a still-transient error:

  • The retry exhaustion branch in Token() (in the if err != nil
    block following refresher.Refresh(...) in
    pkg/auth/monitored_token_source.go) calls markAsUnauthenticated.
  • markAsUnauthenticated writes WorkloadStatusUnauthenticated and
    closes the stopMonitoring channel.
  • The monitor goroutine exits via the monitorLoop's select on
    stopMonitoring.
  • No further refresh is ever attempted by this workload's monitor.
  • The workload remains unauthenticated indefinitely.

Environment (if relevant)

Additional context

Real-world trigger: the canonical scenario is a client-side
network-context change — disconnecting from a corporate VPN, putting
the laptop to sleep on one network and resuming on another, etc.
Token-refresh requests that previously traversed an IP-allowlisted
path now reach the OAuth server from a residential IP, where a WAF or
CDN consistently returns 403+HTML until the trusted path is restored.
The block isn't intermittent from the WAF's perspective — it's a
stable response to a different network origin — but from the
workload's perspective it's a transient failure window that resolves
on its own when the user reconnects.

In one such environment the bug surfaced every 1–3 days; each time the
underlying network state reverted on its own (e.g., morning VPN
reconnect), but the workload stayed unauthenticated until manual
recovery.

Production error shape (real recurrence):

oauth2: cannot fetch token: 403 Forbidden
Response: <!DOCTYPE html>...

This is a 4xx-without-RFC-6749-error-code response — correctly
classified as transient by isTransientNetworkError after #5170. The
short retry exhausts via the 5-try cap (typically within ~1–2 minutes
given the default backoff: 10s initial interval, 1.5 multiplier, ±50%
randomization, 5 max tries), well before the 5-minute
MAX_ELAPSED_TIME cap. The monitor exits at that point.

Affected code:

  • pkg/auth/monitored_token_source.go::Token — the refresher.Refresh
    exhaustion branch (the if err != nil block right after the
    refresher.Refresh call).
  • pkg/auth/monitored_token_source.go::onTick — calls Token(), so
    inherits the same exit-on-exhaustion behavior. The monitor's only
    response to a transient refresh error that outlasts the short retry
    is to mark the workload unauthenticated and stop.
  • pkg/auth/monitored_token_source.go::markAsUnauthenticated is the
    single-shot exit point both call into.

Related PRs:

Metadata

Metadata

Assignees

No one assigned

    Labels

    authenticationbugSomething isn't workinggoPull requests that update go codeoauth

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions