Skip to content

[WIP] test: e2e CI speedups#13904

Open
chandler-solo wants to merge 14 commits intokgateway-dev:mainfrom
chandler-solo:chandler/e2etimesaved_thanks_to_k3d
Open

[WIP] test: e2e CI speedups#13904
chandler-solo wants to merge 14 commits intokgateway-dev:mainfrom
chandler-solo:chandler/e2etimesaved_thanks_to_k3d

Conversation

@chandler-solo
Copy link
Copy Markdown
Contributor

@chandler-solo chandler-solo commented Apr 23, 2026

Description

e2e tests on CI are faster and centralize common build steps, making them less flaky.

Change Type

/kind flake

Changelog

NONE

Additional Notes

Signed-off-by: David L. Chandler <david.chandler@solo.io>
Signed-off-by: David L. Chandler <david.chandler@solo.io>
Signed-off-by: David L. Chandler <david.chandler@solo.io>
Signed-off-by: David L. Chandler <david.chandler@solo.io>
Signed-off-by: David L. Chandler <david.chandler@solo.io>
Copilot AI review requested due to automatic review settings April 23, 2026 01:03
@chandler-solo chandler-solo added the work in progress Indicates that a PR should not merge because it is a work in progress label Apr 23, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to speed up e2e CI by reusing prebuilt Docker images and improving Go caching, while adjusting e2e Helm chart resolution to avoid requiring packaged charts in _test/.

Changes:

  • Add a shared-image build job and artifact transfer to avoid rebuilding e2e images per shard.
  • Introduce Go build-cache priming/saving in the cache setup workflow and expose cache metadata from prep-go-runner.
  • Update e2e chart path selection to prefer packaged charts but fall back to in-repo unpackaged charts; add Makefile/script toggles to skip certain setup work.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
test/e2e/testutils/helper/install.go Prefer packaged charts in _test/ but fall back to install/helm/<chart>; add packaged-chart existence checks.
test/e2e/testutils/helper/install_test.go Add e2e-tagged unit tests covering packaged-vs-directory chart path selection.
hack/kind/setup-kind.sh Add LOAD_DOCKER_IMAGES path to load prebuilt images when SKIP_DOCKER=true; stop packaging charts.
hack/k3d/setup-k3d.sh Same as kind script for k3d.
Makefile Add shared-image build/save targets; add SKIP_EXTPROC_SERVER_SETUP gating for e2e-test; add k3d-load aggregate target.
.github/workflows/e2e.yaml Add image build job + artifact download/load in e2e shards; skip rebuilding images; load extproc-server separately.
.github/workflows/cache-setup.yaml Prime and persist Go build cache.
.github/actions/prep-go-runner/action.yaml Restore Go build cache and expose outputs (hit/key/path) for downstream workflows.

Comment thread hack/kind/setup-kind.sh
echo "SKIP_DOCKER=true, not building images or chart"
echo "SKIP_DOCKER=true, not building images"
if [[ $LOAD_DOCKER_IMAGES == 'true' ]]; then
VERSION=$VERSION CLUSTER_NAME=$CLUSTER_NAME make kind-load
Comment thread hack/k3d/setup-k3d.sh
echo "SKIP_DOCKER=true, not building images or chart"
echo "SKIP_DOCKER=true, not building images"
if [[ $LOAD_DOCKER_IMAGES == 'true' ]]; then
VERSION=$VERSION K3D_CLUSTER_NAME=$CLUSTER_NAME make k3d-load
Comment on lines +38 to +43
if fsutils.IsDirectory(testAssetDir) {
chartPath, err := getPackagedChartPath(testAssetDir, chartName)
if err == nil {
return chartPath, nil
}
logger.Info("falling back to unpackaged Helm chart", "chart", chartName, "assetDir", testAssetDir, "error", err)
Comment on lines +62 to +65
return "", fmt.Errorf("stat packaged chart: %w", err)
}
if info.IsDir() {
return "", fmt.Errorf("packaged chart path is a directory: %s", chartPath)
Comment thread .github/workflows/cache-setup.yaml
CGO_ENABLED=0 was true almost everywhere and I did not want to waste
time creating a cache for the exceptions so I stomped them out.

Signed-off-by: David L. Chandler <david.chandler@solo.io>

ci: standardize Go workflows on cgo off
Signed-off-by: David L. Chandler <david.chandler@solo.io>
Signed-off-by: David L. Chandler <david.chandler@solo.io>
chandler-solo and others added 6 commits April 23, 2026 16:42
Signed-off-by: David L. Chandler <david.chandler@solo.io>
* fix: Tries to remedy an xDS 'HACK'

This fixes a reconnect-time xDS race that can surface during kgateway controller restarts.

When a new uniquely connected client is created, `snapshotPerClient` can run in parallel with per-client backend cluster translation. The existing guard only waited for the per-client cluster collection to become non-`nil`, which still allowed a partial CDS snapshot to be published for that client. In that state, routes and listeners could already reference backend clusters that were not yet present in the snapshot, and Envoy would return `500` responses with the `NC` flag until the missing clusters arrived in a later update.

This change defers publishing a per-client snapshot until all clusters referenced by the route and listener resources are present in the merged cluster set. Clusters already tracked as errored are excluded from that readiness check so existing error-handling behavior is preserved.

/kind fix

```release-note
Fix a reconnect-time xDS race where Envoy could briefly receive routes and listeners before all referenced backend clusters were present, causing transient NC/500 responses during controller restart.
```

Added regression coverage for the partial-snapshot case:
- a new test that failed on the old code because a snapshot was published before all referenced clusters were ready
- a companion test confirming snapshots are still published when a referenced cluster is explicitly marked errored

Verified with:

```bash
go test ./pkg/kgateway/proxy_syncer
```

Signed-off-by: David L. Chandler <david.chandler@solo.io>

* Try harder to find clusters.

Signed-off-by: David L. Chandler <david.chandler@solo.io>

* fixup

Signed-off-by: David L. Chandler <david.chandler@solo.io>

* fixup

Signed-off-by: David L. Chandler <david.chandler@solo.io>

* review feedback

Signed-off-by: David L. Chandler <david.chandler@solo.io>

* optimize: one proto walk per Gateway translation

Signed-off-by: David L. Chandler <david.chandler@solo.io>

* fixup

Signed-off-by: David L. Chandler <david.chandler@solo.io>

* HACK comment rewrite

Signed-off-by: David L. Chandler <david.chandler@solo.io>

* test cases for both scenarios reviewers raised

Signed-off-by: David L. Chandler <david.chandler@solo.io>

* attempt to repro e2e

Signed-off-by: David L. Chandler <david.chandler@solo.io>

* scope defer gate to dataplane clusters; document last-good retention

Narrow collectReferencedClusters to RouteAction/TcpProxy targets only.
BackendRef typos already resolve to BlackholeClusterName (skipped), and
ancillary refs (access-log GrpcService, JWT HttpUri, etc.) ship with
their ExtraClusters in the same per-gateway snapshot, so there is no
reconnect race to guard. Gating on them only created starvation risk
when a plugin or user misconfiguration left an ancillary reference
undeclared.

Document the xDS cache's intentional no-op on EventDelete: when the
per-client handler defers (returns nil), Envoy keeps serving its last
coherent snapshot, so valid traffic sees no 500/NC during the defer
window. Update the readiness-gate block comment to tie both guards
to this retention behavior.

Update tests to match: the "unresolvable cluster" scenarios now simulate
the realistic BackendRef-typo path using BlackholeClusterName, and the
HCM nested-refs test is inverted to assert ancillaries are intentionally
excluded.

Signed-off-by: David L. Chandler <david.chandler@solo.io>

* Update pkg/kgateway/proxy_syncer/perclient.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: David L. Chandler <david.chandler@solo.io>

* defer until EDS clusters have mathcing ClusterLoadAssignment

Signed-off-by: David L. Chandler <david.chandler@solo.io>

* comment re: leak

Signed-off-by: David L. Chandler <david.chandler@solo.io>

---------

Signed-off-by: David L. Chandler <david.chandler@solo.io>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: David L. Chandler <david.chandler@solo.io>
…imesaved_thanks_to_k3d"

This reverts commit 9e10fad, reversing
changes made to c775308.
@chandler-solo chandler-solo requested a review from a team as a code owner April 27, 2026 06:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/flake release-note-none work in progress Indicates that a PR should not merge because it is a work in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants