From 368e27087cfb0de43df1d25a338b8329ae52a4ef Mon Sep 17 00:00:00 2001 From: Darren Janeczek Date: Tue, 26 May 2026 08:29:26 -0400 Subject: [PATCH] fix(wait-for-grafana): lower startupTimeout default from 300s to 60s MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The two-phase polling introduced in v1.0.3 was based on the hypothesis that Grafana startups on contested CI runners are slow-but-alive. Recent evidence (see grafana/grafana#122993, mitigated in grafana/grafana#123034 and grafana/logs-drilldown#1886) shows that an important class of Playwright matrix failures is actually a *crash* during provisioning — specifically, a SQLITE_BUSY self-deadlock between the Grafana Advisor checktyperegisterer and legacy provisioning, which prevents the HTTP listener from ever binding. In that scenario the v1.0.3 default of 300 seconds makes the failure 5x slower to diagnose than v1.0.2 was: the action waits 5 minutes observing only "Current status: 000" before timing out, during which the operator has no signal at all. wait-for-grafana cannot distinguish "still booting" from "dead before booting" — both produce the same ECONNREFUSED. Lowering the default to 60 seconds restores the v1.0.2 fail-fast behaviour for crash scenarios while keeping the two-phase split (Phase 1 TCP-bind vs Phase 2 health endpoint) available for repos that have a genuine slow-start need and want to opt into a higher value explicitly. A follow-up will propose a separate `grafana-startup-logs` sibling action that dumps `docker compose ps` and a filtered tail of Grafana's own logs (with secret re-masking and configurable redaction) when wait-for-grafana fails — that is the right place to surface the real signal, rather than waiting longer at this layer. Co-Authored-By: Claude Opus 4.7 (1M context) --- wait-for-grafana/README.md | 4 ++-- wait-for-grafana/action.yml | 4 ++-- wait-for-grafana/wait-for-grafana.sh | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/wait-for-grafana/README.md b/wait-for-grafana/README.md index 22f81bf7..77325378 100644 --- a/wait-for-grafana/README.md +++ b/wait-for-grafana/README.md @@ -22,11 +22,11 @@ The time to wait between each check, in seconds. Default is `0.5`. ### `startupTimeout` (optional) -The maximum time to wait for the server's TCP port to bind, in seconds. Default is `300`. +The maximum time to wait for the server's TCP port to bind, in seconds. Default is `60`. This covers the window between the container starting and Grafana's HTTP listener becoming active. During this phase the action polls every 5 seconds. Once the port responds (with any status other than `000`), normal health polling begins using the `timeout` and `interval` values above. -On contested CI runners, Grafana's HTTP listener can take longer to bind than the default health-check window allows, regardless of Grafana version. Increasing this value gives the process more time to start without affecting the health-check phase. +Raise this value only if Grafana genuinely takes longer than the default to start. A long default delays diagnosis of startup *crashes* (for example, a provisioning deadlock that prevents the HTTP listener from ever binding), since the action can only observe `000` and cannot distinguish "still booting" from "dead before booting." Pair this action with a separate diagnostic step that dumps container logs on failure to recover the real signal. ## How to use? diff --git a/wait-for-grafana/action.yml b/wait-for-grafana/action.yml index b742ccba..4e9cbc8d 100644 --- a/wait-for-grafana/action.yml +++ b/wait-for-grafana/action.yml @@ -18,9 +18,9 @@ inputs: required: true default: "0.5" startupTimeout: - description: "Seconds to wait for the TCP port to bind before health polling begins. On contested CI runners, Grafana's HTTP listener can take longer to start than the default health-check window allows." + description: "Seconds to wait for the TCP port to bind before health polling begins. Raise this only if Grafana genuinely takes longer than the default to start — a long default delays diagnosis of startup crashes (e.g. provisioning deadlocks), which never recover no matter how long we wait." required: true - default: "300" + default: "60" runs: using: "composite" steps: diff --git a/wait-for-grafana/wait-for-grafana.sh b/wait-for-grafana/wait-for-grafana.sh index 5da6bac9..7ae478e5 100755 --- a/wait-for-grafana/wait-for-grafana.sh +++ b/wait-for-grafana/wait-for-grafana.sh @@ -4,7 +4,7 @@ url="$1" expected_response_code="$2" timeout="$3" interval="$4" -startup_timeout="${5:-300}" +startup_timeout="${5:-60}" echo "Checking URL: $url" echo "Expected response code: $expected_response_code"