[smoke-test] fix a flake: retry tag validation to tolerate Docker Hub rate limiting#1903
Open
V-Subhankar-infy wants to merge 3 commits into
Open
[smoke-test] fix a flake: retry tag validation to tolerate Docker Hub rate limiting#1903V-Subhankar-infy wants to merge 3 commits into
V-Subhankar-infy wants to merge 3 commits into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR hardens the smoke-test tag validator so transient Docker Hub failures (rate limiting/network/5xx) aren’t incorrectly reported as missing image tags, reducing false CI failures during variant validation.
Changes:
- Adds retry-with-backoff around
docker manifest inspectand distinguishes between valid, invalid (missing), and unverified (transient) outcomes. - Treats only known “missing manifest” errors as invalid; other errors are retried and then downgraded to warnings if still unresolved.
- Introduces UNVERIFIED reporting to allow CI to proceed when tags cannot be verified due to transient issues.
| return 1 | ||
| fi | ||
|
|
||
| echo " ! Transient error (attempt ${attempt}/${max_attempts}): $(echo "$output" | tr '\n' ' ' | cut -c1-200)" |
| done | ||
| echo "These were not treated as missing; the build will continue." | ||
| echo "" | ||
| echo "✓ No invalid base image tags found (${#UNVERIFIED_TAGS[@]} could not be verified and were skipped)." |
Kaniska244
previously requested changes
Jun 23, 2026
2edfafd to
742bb3b
Compare
… rate limiting validate-tags.sh ran a single anonymous `docker manifest inspect` per tag, so on shared-IP CI runners any non-zero exit (including HTTP 429 rate limiting) was misreported as "tag does not exist" and failed the whole smoke-test job with a false "Invalid variants" error, even though the tag exists upstream. - Retry the registry query up to 3 times with a 10s pause before marking a tag invalid, so a transient failure isn't mistaken for a missing tag. - After all retries fail, flag the one unambiguous transient cause, Docker Hub rate limiting (HTTP 429); every other case keeps the original "tag does not exist" message unchanged. - Echo the raw registry response to aid debugging. Fail-closed behaviour is unchanged: a genuinely missing tag still exits 1.
742bb3b to
a4f9f14
Compare
I have considered all the points and made the following changes incorporating the review into the code.
- Removed the complex string match that depends on dockerhub response data type.
- Removed Complex code to handle multiple cases and sticked to simple sleep to wait for 10 seconds and retry 2 more times before stopping the build.
- Only when the docker-hub sends a definitive response code of 429 that only means rate limiting we will log that onto console.
- The code only updates the log displayed in case of failure, the core logic is untouched & the error validation is unaffected.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Smoke-test tag validation failed intermittently with false
Invalid variantserrors (e.g.
3.13-trixie) when anonymous Docker Hub calls got rate limited.Root cause
validate-tags.shran a singledocker manifest inspectper tag and treatedany non-zero exit — including HTTP 429 rate limiting — as "tag does not exist",
so one throttled call failed the whole job despite the tag existing upstream.
Changes (
check_image_exists()only)everything else keeps the original
✗ Invalid - tag does not existmessage.Preserved
Fail-closed: genuinely missing tags still
exit 1, and the default message is unchanged and displays "invalid-tag"Testing
bash -npasses; classification self-test confirms 429 → rate-limit message andall else → original message;