Skip to content

test(hosting): fix flaky "Docker is not running" failures in pipeline tests#18126

Open
radical wants to merge 2 commits into
microsoft:mainfrom
radical:ankj/fix-pipeline-tests-docker-preflight
Open

test(hosting): fix flaky "Docker is not running" failures in pipeline tests#18126
radical wants to merge 2 commits into
microsoft:mainfrom
radical:ankj/fix-pipeline-tests-docker-preflight

Conversation

@radical

@radical radical commented Jun 11, 2026

Copy link
Copy Markdown
Member

Description

DistributedApplicationPipelineTests intermittently went red on windows-latest
CI (and passed on re-run) with:

Aspire.Hosting.DistributedApplicationException: Docker is not running.
Start Docker and try again.
   at ...DistributedApplicationPipeline... (check-container-runtime step)

One run took out 33 tests at once; this signature drove the red-then-green CI on
the Hosting-4 (windows-latest) job.

Root cause: each test builds a publish pipeline with step: null, so
ExecuteAsync runs every default step — including check-container-runtime,
which resolves the real IContainerRuntimeResolver and shells out to
docker container ls. GitHub-hosted Windows runners don't run a Linux-container
Docker daemon, so the step throws; on re-run, when a daemon happens to be up, the
identical code passes. Pure daemon-state flakiness — these tests only validate
pipeline ordering and never intend to touch a container runtime.

Fix: register the shared FakeContainerRuntime as IContainerRuntimeResolver
in each test, so the preflight resolves a fake that reports "running" instead of
probing a real daemon. The builder registers the real resolver with AddSingleton
(not TryAdd), so the later test registration wins. This mirrors the existing
pattern in DockerComposePublisherTests and ResourceContainerImageManagerTests.

Why the test-side fix: check-container-runtime is a legitimate default step;
the bug is that ordering-only unit tests run it against a real daemon. Faking the
resolver is the minimal, established pattern and touches no product code.

Call-outs:

  • Test-only; no product code changed.
  • Verified locally with Docker both stopped and running: with the fix the full
    class passes (77/77) in both states; reverting it makes the tests fail when the
    daemon is down and pass when it's up — reproducing the CI behavior exactly.
  • The same latent pattern exists in AddJavaScriptAppTests / AddViteAppTests
    (they run where Docker is present, so they aren't flaky today); left out to keep
    this PR tight.

Checklist

  • Is this feature complete?
    • Yes. Ready to ship.
  • Are you including unit tests for the changes and scenario tests if relevant?
    • Yes
  • Did you add public API?
    • No
  • Does the change make any security assumptions or guarantees?
    • No

DistributedApplicationPipelineTests intermittently failed on
windows-latest CI (then passed on re-run) with:

    Aspire.Hosting.DistributedApplicationException: Docker is not
    running. Start Docker and try again.

In one job this took out 33 tests at once.

Root cause: each test builds a publish pipeline with step: null, so
ExecuteAsync runs every default step, including check-container-runtime.
That step resolves the real IContainerRuntimeResolver and shells out to
`docker container ls`. On runners where the Docker daemon isn't up the
step throws; on re-run, when the daemon is up, the same code passes -- so
the failures are pure daemon-state flakiness, not a real test or product
bug. These tests only validate pipeline ordering and never intend to
touch a container runtime.

Fix: register the shared FakeContainerRuntime as IContainerRuntimeResolver
in each test. The preflight then resolves a fake that reports running,
making the tests independent of a real daemon. This mirrors the existing
pattern in DockerComposePublisherTests and ResourceContainerImageManagerTests.

Validated locally (no product code changed): with the fix the full class
passes whether or not Docker is running; reverting it makes the tests
fail when the daemon is down and pass when it is up -- reproducing the CI
behavior exactly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 11, 2026 20:36
@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 18126

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 18126"

@radical radical marked this pull request as ready for review June 11, 2026 20:38

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes intermittent "Docker is not running" failures in DistributedApplicationPipelineTests on Windows CI by registering a FakeContainerRuntime as the IContainerRuntimeResolver in every test. The tests validate pipeline ordering/behavior and never intend to touch a real container runtime, so the check-container-runtime default step was failing on GitHub-hosted Windows runners where Docker isn't always available.

Changes:

  • Added #pragma warning disable ASPIRECONTAINERRUNTIME001 to suppress the experimental API warning for IContainerRuntimeResolver
  • Registered FakeContainerRuntime as IContainerRuntimeResolver in all 65 test methods that create a TestDistributedApplicationBuilder, using the established fake-registration pattern

Comment thread tests/Aspire.Hosting.Tests/Pipelines/DistributedApplicationPipelineTests.cs Outdated
Every test in DistributedApplicationPipelineTests repeated the same
builder construction and service registrations (test output helper,
FakeContainerRuntime as IContainerRuntimeResolver, and the activity
reporter). Extract this into a CreatePipelineTestBuilder helper so the
shared setup lives in one place.

Call sites collapse to a single line; the step, log level, and a
caller-supplied activity reporter are passed as helper arguments to
cover the few tests that diverge from the defaults.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@radical radical requested review from adamint and davidfowl June 12, 2026 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants