[WIP] test: e2e CI speedups#13904
Open
chandler-solo wants to merge 14 commits intokgateway-dev:mainfrom
Open
Conversation
Signed-off-by: David L. Chandler <david.chandler@solo.io>
Signed-off-by: David L. Chandler <david.chandler@solo.io>
Signed-off-by: David L. Chandler <david.chandler@solo.io>
Signed-off-by: David L. Chandler <david.chandler@solo.io>
Signed-off-by: David L. Chandler <david.chandler@solo.io>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR aims to speed up e2e CI by reusing prebuilt Docker images and improving Go caching, while adjusting e2e Helm chart resolution to avoid requiring packaged charts in _test/.
Changes:
- Add a shared-image build job and artifact transfer to avoid rebuilding e2e images per shard.
- Introduce Go build-cache priming/saving in the cache setup workflow and expose cache metadata from
prep-go-runner. - Update e2e chart path selection to prefer packaged charts but fall back to in-repo unpackaged charts; add Makefile/script toggles to skip certain setup work.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
test/e2e/testutils/helper/install.go |
Prefer packaged charts in _test/ but fall back to install/helm/<chart>; add packaged-chart existence checks. |
test/e2e/testutils/helper/install_test.go |
Add e2e-tagged unit tests covering packaged-vs-directory chart path selection. |
hack/kind/setup-kind.sh |
Add LOAD_DOCKER_IMAGES path to load prebuilt images when SKIP_DOCKER=true; stop packaging charts. |
hack/k3d/setup-k3d.sh |
Same as kind script for k3d. |
Makefile |
Add shared-image build/save targets; add SKIP_EXTPROC_SERVER_SETUP gating for e2e-test; add k3d-load aggregate target. |
.github/workflows/e2e.yaml |
Add image build job + artifact download/load in e2e shards; skip rebuilding images; load extproc-server separately. |
.github/workflows/cache-setup.yaml |
Prime and persist Go build cache. |
.github/actions/prep-go-runner/action.yaml |
Restore Go build cache and expose outputs (hit/key/path) for downstream workflows. |
| echo "SKIP_DOCKER=true, not building images or chart" | ||
| echo "SKIP_DOCKER=true, not building images" | ||
| if [[ $LOAD_DOCKER_IMAGES == 'true' ]]; then | ||
| VERSION=$VERSION CLUSTER_NAME=$CLUSTER_NAME make kind-load |
| echo "SKIP_DOCKER=true, not building images or chart" | ||
| echo "SKIP_DOCKER=true, not building images" | ||
| if [[ $LOAD_DOCKER_IMAGES == 'true' ]]; then | ||
| VERSION=$VERSION K3D_CLUSTER_NAME=$CLUSTER_NAME make k3d-load |
Comment on lines
+38
to
+43
| if fsutils.IsDirectory(testAssetDir) { | ||
| chartPath, err := getPackagedChartPath(testAssetDir, chartName) | ||
| if err == nil { | ||
| return chartPath, nil | ||
| } | ||
| logger.Info("falling back to unpackaged Helm chart", "chart", chartName, "assetDir", testAssetDir, "error", err) |
Comment on lines
+62
to
+65
| return "", fmt.Errorf("stat packaged chart: %w", err) | ||
| } | ||
| if info.IsDir() { | ||
| return "", fmt.Errorf("packaged chart path is a directory: %s", chartPath) |
CGO_ENABLED=0 was true almost everywhere and I did not want to waste time creating a cache for the exceptions so I stomped them out. Signed-off-by: David L. Chandler <david.chandler@solo.io> ci: standardize Go workflows on cgo off
Signed-off-by: David L. Chandler <david.chandler@solo.io>
* fix: Tries to remedy an xDS 'HACK' This fixes a reconnect-time xDS race that can surface during kgateway controller restarts. When a new uniquely connected client is created, `snapshotPerClient` can run in parallel with per-client backend cluster translation. The existing guard only waited for the per-client cluster collection to become non-`nil`, which still allowed a partial CDS snapshot to be published for that client. In that state, routes and listeners could already reference backend clusters that were not yet present in the snapshot, and Envoy would return `500` responses with the `NC` flag until the missing clusters arrived in a later update. This change defers publishing a per-client snapshot until all clusters referenced by the route and listener resources are present in the merged cluster set. Clusters already tracked as errored are excluded from that readiness check so existing error-handling behavior is preserved. /kind fix ```release-note Fix a reconnect-time xDS race where Envoy could briefly receive routes and listeners before all referenced backend clusters were present, causing transient NC/500 responses during controller restart. ``` Added regression coverage for the partial-snapshot case: - a new test that failed on the old code because a snapshot was published before all referenced clusters were ready - a companion test confirming snapshots are still published when a referenced cluster is explicitly marked errored Verified with: ```bash go test ./pkg/kgateway/proxy_syncer ``` Signed-off-by: David L. Chandler <david.chandler@solo.io> * Try harder to find clusters. Signed-off-by: David L. Chandler <david.chandler@solo.io> * fixup Signed-off-by: David L. Chandler <david.chandler@solo.io> * fixup Signed-off-by: David L. Chandler <david.chandler@solo.io> * review feedback Signed-off-by: David L. Chandler <david.chandler@solo.io> * optimize: one proto walk per Gateway translation Signed-off-by: David L. Chandler <david.chandler@solo.io> * fixup Signed-off-by: David L. Chandler <david.chandler@solo.io> * HACK comment rewrite Signed-off-by: David L. Chandler <david.chandler@solo.io> * test cases for both scenarios reviewers raised Signed-off-by: David L. Chandler <david.chandler@solo.io> * attempt to repro e2e Signed-off-by: David L. Chandler <david.chandler@solo.io> * scope defer gate to dataplane clusters; document last-good retention Narrow collectReferencedClusters to RouteAction/TcpProxy targets only. BackendRef typos already resolve to BlackholeClusterName (skipped), and ancillary refs (access-log GrpcService, JWT HttpUri, etc.) ship with their ExtraClusters in the same per-gateway snapshot, so there is no reconnect race to guard. Gating on them only created starvation risk when a plugin or user misconfiguration left an ancillary reference undeclared. Document the xDS cache's intentional no-op on EventDelete: when the per-client handler defers (returns nil), Envoy keeps serving its last coherent snapshot, so valid traffic sees no 500/NC during the defer window. Update the readiness-gate block comment to tie both guards to this retention behavior. Update tests to match: the "unresolvable cluster" scenarios now simulate the realistic BackendRef-typo path using BlackholeClusterName, and the HCM nested-refs test is inverted to assert ancillaries are intentionally excluded. Signed-off-by: David L. Chandler <david.chandler@solo.io> * Update pkg/kgateway/proxy_syncer/perclient.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: David L. Chandler <david.chandler@solo.io> * defer until EDS clusters have mathcing ClusterLoadAssignment Signed-off-by: David L. Chandler <david.chandler@solo.io> * comment re: leak Signed-off-by: David L. Chandler <david.chandler@solo.io> --------- Signed-off-by: David L. Chandler <david.chandler@solo.io> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: David L. Chandler <david.chandler@solo.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
e2e tests on CI are faster and centralize common build steps, making them less flaky.
Change Type
/kind flake
Changelog
Additional Notes