feat: spot-capacity diversification (cut on-demand fallback + t3 unlimited-mode burst) by kaio6fellipe · Pull Request #90 · devopsfactory-io/jit-runners

kaio6fellipe · 2026-07-02T02:55:09Z

Summary

Cuts jit-runners' recurring AWS cost (from the June FinOps pass) by launching runners across diversified spot pools before the on-demand last resort, and pinning t-family runners to standard CPU credits. Targets the ~$22/mo on-demand fallback + ~$4.55/mo t3 unlimited-mode burst.

What changed (7 commits, TDD)

refactor(compute) — widen LaunchSpec to InstanceTypes []string / SubnetIDs []string (all readers updated: AWS + GCP launchers, scaleup, tests).
feat(scaleup) — LabelMapping.InstanceTypes + resolveInstanceTypes (ordered candidate list; back-compat with single instance_type).
feat(ec2) — Launch now tries each candidate type × subnet (AZ) on spot, first-success-wins, then one on-demand last resort. isFatal (smithy.APIError) aborts only on a closed allow-list of request-level errors; unknown/non-API errors default to retryable so launch success is provably ≥ today. Deterministic rotate spreads concurrent launches across AZs. Emits event=spot_exhausted_ondemand_fallback on fallback.
feat(ec2) — t-family (t2/t3/t3a/t4g) launches get CreditSpecification=standard (gated on isBurstable), wired via EC2_CPU_CREDITS (default standard).
feat(infra) — CFN + Terraform: six-class LabelMappings default with candidate lists + EC2_CPU_CREDITS=standard.
refactor(launcher) — GCP empty-InstanceTypes guard, rotate test, isFatal coupling note.
docs / fix(lint) — docs for the new behavior/config; revive var-naming fix.

Safety (CI availability is the hard constraint)

The "launch success ≥ today" guarantee holds under two enforced conditions: the candidate list always includes the original primary type + on-demand fallback preserved; and isFatal is a closed allow-list defaulting unknown → retryable. The spec reviewer mutation-tested this — flipping the default to fatal makes the unknown/non-API tests fail, so the property is load-bearing. release candidates are all 16 GiB (no OOM). No new IAM (ec2:RunInstances suffices).

Verification

make check green (lint + vet + 232 tests). New tests cover: spot-success, capacity→next-candidate, all-fail→on-demand, fatal-abort, unknown→retryable, non-API→retryable, empty-subnets, empty-types, burstable/non-burstable credits, resolveInstanceTypes, rotate.

Design docs & context

Spec + plan: devopsfactory-io/zettelkasten#108 (Projects/jit-runners/specs|plans/2026-07-01-scaleup-spot-diversification*)
Issue scaleup: large-labelled jobs starve when idle smaller runners exist (demand/supply not segmented by label) #88 (label-aware supply) was verified already-fixed on main and scoped out — see the evidence comment on scaleup: large-labelled jobs starve when idle smaller runners exist (demand/supply not segmented by label) #88.
Reviewed subagent-driven: each task got independent spec-compliance + code-quality review, plus a final whole-branch review.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Enhanced runner launch to try multiple candidate instance types and subnets in order (first-success-wins) to improve spot capacity outcomes.
- Added explicit spot-to–on-demand fallback when spot candidates are exhausted, with a dedicated fallback log event.
- Added EC2 CPU credit configuration for burstable instances (defaulting to standard).
Bug Fixes
- Improved behavior when instance type/subnet inputs are missing or empty across AWS and GCP.
Documentation
- Updated setup docs to support instance_types candidate lists and clarified AWS spot fallback behavior.
- Added troubleshooting guidance for frequent on-demand fallback.

Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>

…PPINGS Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>

Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>

…fault Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>

…te isFatal coupling Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>

…REDITS Document the new diversified spot launch (multi-candidate types × AZs before on-demand fallback + the spot_exhausted_ondemand_fallback log), the optional instance_types list in LABEL_MAPPINGS with the new six-class default, and the EC2_CPU_CREDITS env var. Fixes the stale label_mappings=[] doc drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>

…r-naming) Our option field triggered revive var-naming (initialisms → CPU). The AWS SDK's CreditSpecificationRequest.CpuCredits field name is unchanged (not ours to rename). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>

coderabbitai · 2026-07-02T02:55:18Z

📝 Walkthrough

Walkthrough

This PR adds ordered instance-type and subnet candidate support to the scaleup flow, updates AWS and GCP launchers to consume the new contract, configures CPU-credit handling for AWS, refreshes infra defaults, and updates documentation for the new spot selection and fallback behavior.

Changes

Multi-candidate spot launch support

Layer / File(s)	Summary
LaunchSpec and label mapping contract `lambda/internal/compute/compute.go`, `lambda/internal/config/config.go`	`LaunchSpec.InstanceType`/`SubnetID` become slices `InstanceTypes`/`SubnetIDs`; `LabelMapping` gains optional `InstanceTypes` field.
Scaleup Lambda resolution and wiring `lambda/cmd/scaleup/main.go`, `lambda/cmd/scaleup/main_integration_test.go`	`resolveInstanceType` is replaced by `resolveInstanceTypes`; launch spec now carries instance type and subnet lists; new tests cover list, single, and default fallback resolution.
AWS EC2 launcher retry/fallback and CPU credits `lambda/internal/aws/ec2/launcher.go`, `lambda/internal/aws/ec2/launcher_test.go`, `lambda/internal/provider/aws.go`, `lambda/go.mod`	Launch iterates rotated subnets and candidate instance types for spot, classifies fatal vs retryable errors, falls back to on-demand once, and applies CPU-credit pinning for burstable types via `EC2_CPU_CREDITS`; new tests cover launch paths, CPU credits, and rotation.
GCE launcher candidate handling `lambda/internal/gcp/gce/launcher.go`, `lambda/internal/gcp/gce/launcher_test.go`	GCE launcher validates non-empty instance types, uses the first candidate instance type and subnet, and updates tests to the slice-based spec.
Infra defaults `infra/cloudformation/template.yaml`, `infra/terraform/lambda.tf`, `infra/terraform/variables.tf`	`LabelMappings`/`label_mappings` defaults are populated with instance-type mappings, and `EC2_CPU_CREDITS=standard` is added to the scaleup Lambda environment.
Documentation updates `AGENTS.md`, `CLAUDE.md`, `README.md`, `docs/getting-started-aws.md`, `docs/getting-started-gcp.md`, `docs/troubleshooting.md`	Docs describe multi-candidate spot selection, on-demand fallback event logging, the AWS/GCP `label_mappings` behavior, CPU credits, and the new troubleshooting section with renumbering.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Scaleup as processRecord
  participant Launcher as ec2.Launcher.Launch
  participant EC2 as EC2 API

  Scaleup->>Launcher: Launch(spec with InstanceTypes, SubnetIDs)
  loop rotated subnets x instance types
    Launcher->>EC2: RunInstances (spot, instanceType, subnetID)
    EC2-->>Launcher: success or error
    alt fatal error
      Launcher-->>Scaleup: return error
    else success
      Launcher-->>Scaleup: return instance
    end
  end
  Launcher->>EC2: RunInstances (on-demand fallback)
  EC2-->>Launcher: success or error
  Launcher-->>Scaleup: return instance or error

Possibly related PRs

devopsfactory-io/jit-runners#46: Modifies the same scaleup instance-type resolution and LaunchSpec logic in lambda/cmd/scaleup/main.go that this PR extends into multi-candidate form.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title captures the main change: diversified spot launches with reduced on-demand fallback and standard CPU credits.
Description check	✅ Passed	The description is substantive and covers scope, rationale, verification, and docs, though it doesn't follow the repository's exact template.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/spot-diversification

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

github-actions · 2026-07-02T02:55:38Z

OpenSSF Scorecard — 8.3/10 ✅

Check	Score	Details
Binary-Artifacts	10/10	no binaries found in the repo
CI-Tests	10/10	15 out of 15 merged PRs checked by a CI test -- score normalized to 10
Code-Review	0/10	Found 0/30 approved changesets -- score normalized to 0
Dangerous-Workflow	10/10	no dangerous workflow patterns detected
License	10/10	license file detected
Pinned-Dependencies	6/10	dependency not pinned by hash detected -- score normalized to 6
Security-Policy	10/10	security policy file detected
Token-Permissions	10/10	GitHub workflow tokens follow principle of least privilege
Vulnerabilities	10/10	0 existing vulnerabilities detected

Threshold: 7 | Full report

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (1)

infra/terraform/variables.tf (1)
63-63: 🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win

Same large-label instance_type/instance_types[0] mismatch as infra/cloudformation/template.yaml.

instance_type is "c5.xlarge" while instance_types[0] is "c6i.xlarge"; align them so both IaC defaults stay consistent.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@infra/terraform/variables.tf` at line 63, The default for the large instance
definition is inconsistent because `instance_type` and the first entry in
`instance_types` do not match. Update the large object in `variables.tf` so the
`instance_type` value and `instance_types[0]` align with the same instance
family, matching the corresponding large entry used in `template.yaml`.

🧹 Nitpick comments (1)

lambda/internal/gcp/gce/launcher_test.go (1)

79-126: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add coverage for the new empty-InstanceTypes guard and subnet fallback.

The EC2 launcher's equivalent behavior (empty InstanceTypes → error, empty SubnetIDs → fallback) is directly tested (TestLaunch_NoInstanceTypesErrors, TestLaunch_EmptySubnetsNoPanic in lambda/internal/aws/ec2/launcher_test.go), but this file only updates defaultSpec()'s field types without adding equivalent tests for gce.Launcher.Launch.

✅ Suggested additional test cases

func TestLauncher_Launch_NoInstanceTypesErrors(t *testing.T) {
	fake := &fakeGCE{}
	launcher := newLauncherWithAPI(fake, defaultOpts())

	spec := defaultSpec()
	spec.InstanceTypes = nil
	if _, err := launcher.Launch(context.Background(), spec); err == nil {
		t.Fatal("want error for empty InstanceTypes, got nil")
	}
	if fake.insertCalled {
		t.Error("expected Insert not to be called")
	}
}

func TestLauncher_Launch_EmptySubnetIDsFallsBackToOptsSubnet(t *testing.T) {
	fake := &fakeGCE{insertReturnID: "jit-runner-abcdef12"}
	launcher := newLauncherWithAPI(fake, defaultOpts())

	spec := defaultSpec()
	spec.SubnetIDs = nil
	if _, err := launcher.Launch(context.Background(), spec); err != nil {
		t.Fatalf("unexpected error: %v", err)
	}
	got := fake.insertReq.GetInstanceResource().GetNetworkInterfaces()[0].GetSubnetwork()
	if got != defaultOpts().Subnet {
		t.Errorf("subnet = %q, want opts.Subnet %q", got, defaultOpts().Subnet)
	}
}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lambda/internal/gcp/gce/launcher_test.go` around lines 79 - 126, Add tests in
gce.Launcher.Launch for the new validation and fallback behavior: verify that an
empty InstanceTypes slice returns an error and does not call fakeGCE.Insert, and
verify that when SubnetIDs is empty the launcher falls back to the subnet from
defaultOpts() instead of panicking or using an empty value. Use the existing
TestLauncher_Launch_ReturnsInstance pattern and unique symbols like defaultSpec,
newLauncherWithAPI, fakeGCE, and Launch to place the new cases nearby.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/getting-started-aws.md`:
- Around line 160-164: The `LabelMappings` example in the AWS getting-started
doc is presented as a complete override, but `--parameter-overrides` replaces
the entire JSON value and would drop the default labels. Update the
`LabelMappings` guidance to either show the full default mapping set or clearly
mark it as a partial override that must be expanded before use, keeping the
explanation attached to the `LabelMappings`/`instance_types` example.

In `@docs/troubleshooting.md`:
- Around line 196-217: The fallback diagnosis section is inaccurate: the `aws
logs filter-log-events` example returns raw messages rather than a count, and
the scaleup Lambda log fields referenced in the text are not actually emitted by
the current launcher path. Update the documentation to match the real output, or
change the logging in the scaleup flow so the `spot_exhausted_ondemand_fallback`
event includes the label class and tried candidate list alongside `runner` and
`attempted_types`, using the relevant logging code in the scaleup Lambda that
emits these messages.

In `@infra/cloudformation/template.yaml`:
- Line 73: The `large` entry’s primary `instance_type` is out of sync with the
first value in its `instance_types` list, so update the `large` label in the
CloudFormation default so both fields use the same leading instance type,
matching the pattern used by the other size labels. Also make the same alignment
fix in the duplicated default in the Terraform variables definition so both
sources stay consistent.

In `@lambda/internal/aws/ec2/launcher.go`:
- Around line 119-146: The isFatal helper in launcher.go currently treats
InvalidParameterValue and InvalidParameterCombination as fatal, which can stop
launch fallback too early; change the logic in isFatal to classify those two EC2
API errors as retryable so candidate selection can continue. Keep the existing
fatal handling for the other explicit codes, and add a test alongside
TestLaunch_UnknownErrorStaysRetryable that verifies both InvalidParameterValue
and InvalidParameterCombination do not short-circuit fallback.
- Around line 62-98: The sequential spot sweep in Launcher.Launch can consume
the full Lambda timeout before the on-demand fallback is tried. Add a ctx.Err()
check inside the nested instance-type/subnet loops, before each runInstance
attempt and after failures, so Launch exits promptly when the context is near
deadline. Keep the fix localized to Launcher.Launch and preserve the existing
isFatal handling and fallback behavior.

---

Duplicate comments:
In `@infra/terraform/variables.tf`:
- Line 63: The default for the large instance definition is inconsistent because
`instance_type` and the first entry in `instance_types` do not match. Update the
large object in `variables.tf` so the `instance_type` value and
`instance_types[0]` align with the same instance family, matching the
corresponding large entry used in `template.yaml`.

---

Nitpick comments:
In `@lambda/internal/gcp/gce/launcher_test.go`:
- Around line 79-126: Add tests in gce.Launcher.Launch for the new validation
and fallback behavior: verify that an empty InstanceTypes slice returns an error
and does not call fakeGCE.Insert, and verify that when SubnetIDs is empty the
launcher falls back to the subnet from defaultOpts() instead of panicking or
using an empty value. Use the existing TestLauncher_Launch_ReturnsInstance
pattern and unique symbols like defaultSpec, newLauncherWithAPI, fakeGCE, and
Launch to place the new cases nearby.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 83bcd4da-a623-4051-9a8c-7f3da7907f6d

📥 Commits

Reviewing files that changed from the base of the PR and between 7be033c and e4b2115.

📒 Files selected for processing (19)

AGENTS.md
CLAUDE.md
README.md
docs/getting-started-aws.md
docs/getting-started-gcp.md
docs/troubleshooting.md
infra/cloudformation/template.yaml
infra/terraform/lambda.tf
infra/terraform/variables.tf
lambda/cmd/scaleup/main.go
lambda/cmd/scaleup/main_integration_test.go
lambda/go.mod
lambda/internal/aws/ec2/launcher.go
lambda/internal/aws/ec2/launcher_test.go
lambda/internal/compute/compute.go
lambda/internal/config/config.go
lambda/internal/gcp/gce/launcher.go
lambda/internal/gcp/gce/launcher_test.go
lambda/internal/provider/aws.go

coderabbitai · 2026-07-02T03:28:28Z

  LabelMappings:
    Type: String
-    Default: "[]"
+    Default: '[{"label":"nano","instance_type":"t2.nano"},{"label":"micro","instance_type":"t2.micro"},{"label":"small","instance_type":"t2.small"},{"label":"medium","instance_type":"t3.medium","instance_types":["t3.medium","t3a.medium","m6i.large","m5.large"]},{"label":"large","instance_type":"c5.xlarge","instance_types":["c6i.xlarge","c5.xlarge","c5a.xlarge","m6i.xlarge"]},{"label":"release","instance_type":"m5.xlarge","instance_types":["m5.xlarge","m5a.xlarge","m6i.xlarge","m6a.xlarge"]}]'


🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win

large label's instance_type doesn't match the head of its instance_types list.

instance_type is "c5.xlarge" but instance_types[0] is "c6i.xlarge". Every other label with an instance_types array keeps instance_type aligned with the first (primary) candidate. This looks like a copy/paste slip and is confusing regardless of which field actually drives resolution at runtime.

🔧 Suggested fix

-{"label":"large","instance_type":"c5.xlarge","instance_types":["c6i.xlarge","c5.xlarge","c5a.xlarge","m6i.xlarge"]} +{"label":"large","instance_type":"c6i.xlarge","instance_types":["c6i.xlarge","c5.xlarge","c5a.xlarge","m6i.xlarge"]}

Same mismatch is duplicated in infra/terraform/variables.tf.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Default: '[{"label":"nano","instance_type":"t2.nano"},{"label":"micro","instance_type":"t2.micro"},{"label":"small","instance_type":"t2.small"},{"label":"medium","instance_type":"t3.medium","instance_types":["t3.medium","t3a.medium","m6i.large","m5.large"]},{"label":"large","instance_type":"c5.xlarge","instance_types":["c6i.xlarge","c5.xlarge","c5a.xlarge","m6i.xlarge"]},{"label":"release","instance_type":"m5.xlarge","instance_types":["m5.xlarge","m5a.xlarge","m6i.xlarge","m6a.xlarge"]}]'

Default: '[{"label":"nano","instance_type":"t2.nano"},{"label":"micro","instance_type":"t2.micro"},{"label":"small","instance_type":"t2.small"},{"label":"medium","instance_type":"t3.medium","instance_types":["t3.medium","t3a.medium","m6i.large","m5.large"]},{"label":"large","instance_type":"c6i.xlarge","instance_types":["c6i.xlarge","c5.xlarge","c5a.xlarge","m6i.xlarge"]},{"label":"release","instance_type":"m5.xlarge","instance_types":["m5.xlarge","m5a.xlarge","m6i.xlarge","m6a.xlarge"]}]'

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@infra/cloudformation/template.yaml` at line 73, The `large` entry’s primary `instance_type` is out of sync with the first value in its `instance_types` list, so update the `large` label in the CloudFormation default so both fields use the same leading instance type, matching the pattern used by the other size labels. Also make the same alignment fix in the duplicated default in the Terraform variables definition so both sources stay consistent.

coderabbitai · 2026-07-02T03:28:28Z

+// isFatal reports whether err is a request-level error that would fail on
+// on-demand too — the only case where skipping the on-demand fallback is safe.
+// DEFAULT: unknown/non-API errors are retryable (return false), preserving the
+// "launch success ≥ today" guarantee.
+func isFatal(err error) bool {
+	var apiErr smithy.APIError
+	if !errors.As(err, &apiErr) {
+		return false
+	}
+	code := apiErr.ErrorCode()
+	if strings.HasPrefix(code, "InvalidAMIID") {
+		return true
+	}
+	switch code {
+	case "UnauthorizedOperation", "AuthFailure", "PendingVerification":
+		return true
+	// NOTE: InvalidParameterValue/Combination are treated as fatal (they fail on
+	// on-demand too). This is safe only while spot-specific params are hardcoded
+	// valid constants. If SpotMaxPrice (currently unwired) is ever populated and
+	// misconfigured, it surfaces as InvalidParameterValue on the SPOT request only
+	// — revisit excluding these codes then, so a bad MaxPrice still falls back.
+	case "InvalidParameterValue", "InvalidParameterCombination":
+		return true
+	}
+	// Allow-list is intentionally minimal and grows only as production logs
+	// reveal codes that also fail on-demand. Unknown ⇒ retryable (see doc above).
+	return false
+}


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

EC2 RunInstances InvalidParameterValue instance type architecture mismatch AMI

💡 Result:

The error "InvalidParameterValue: The architecture... of the specified instance type does not match the architecture... of the specified AMI" occurs when you attempt to launch an EC2 instance using an Amazon Machine Image (AMI) that is not compatible with the selected instance type's processor architecture (e.g., trying to use an x86_64 AMI on an arm64 Graviton instance, or vice-versa) [1][2][3]. To resolve this issue, ensure the architectures of both the AMI and the instance type are compatible [4][2]. 1. Identify the architecture of your AMI: You can check the architecture of your selected AMI using the AWS CLI [1][2]: aws ec2 describe-images --image-ids <YOUR_AMI_ID> --query 'Images.Architecture' 2. Identify compatible instance types: Once you know the AMI's architecture (e.g., x86_64 or arm64), you must select an instance type that supports that same architecture [2][3]. You can list instance types for a specific architecture using the AWS CLI [3]: aws ec2 describe-instance-types --filters Name=processor-info.supported-architecture,Values=arm64 --query "InstanceTypes[*].InstanceType" --output text Common architectures include: - x86_64: Supported by most standard instance families (e.g., m5, c5, t3). - arm64: Supported by AWS Graviton-based instance families (e.g., m6g, c6g, t4g). If you are using Infrastructure as Code (such as Terraform or AWS CloudFormation), verify that your resource configuration defines an instance type and AMI that align [2][5][3]. If you encounter this error during a validation/dry-run phase of a tool (like AWS ParallelCluster), ensure you are using the latest supported configuration or, if appropriate, verify if the tool is incorrectly performing validations against an incompatible default AMI [6].

Citations:

1: https://oneuptime.com/blog/post/2026-02-12-troubleshoot-ec2-instance-launch-failures/view

2: https://docs.aws.amazon.com/autoscaling/ec2/userguide/ts-as-ami.html

3: https://stackoverflow.com/questions/67276619/how-to-find-the-list-of-applicable-instance-types-for-an-ami

4: The architecture 'x86_64' of the specified instance type does not match the architecture 'arm64' of the specified AMI aws/aws-parallelcluster#3795

5: https://stackoverflow.com/questions/75975304/terraform-aws-ec2-instance-ubuntu-error

6: https://github.com/aws/aws-parallelcluster/wiki/Configuration-validation-failure:-architecture-of-AMI-and-instance-type-does-not-match

🏁 Script executed:

#!/bin/bash set -euo pipefail git ls-files lambda/internal/aws/ec2 printf '\n--- launcher outline ---\n' ast-grep outline lambda/internal/aws/ec2/launcher.go --view expanded || true printf '\n--- tests outline ---\n' ast-grep outline lambda/internal/aws/ec2/*_test.go --view expanded || true printf '\n--- search for isFatal and launch loop ---\n' rg -n "isFatal|InvalidParameterValue|InvalidParameterCombination|UnknownErrorStaysRetryable|divers|fallback|spot" lambda/internal/aws/ec2 -S

Repository: devopsfactory-io/jit-runners

Length of output: 7670

🏁 Script executed:

#!/bin/bash set -euo pipefail sed -n '1,260p' lambda/internal/aws/ec2/launcher.go | cat -n printf '\n--- tests ---\n' for f in lambda/internal/aws/ec2/*_test.go; do echo "### $f" sed -n '1,260p' "$f" | cat -n done

Repository: devopsfactory-io/jit-runners

Length of output: 23890

Treat InvalidParameterValue/InvalidParameterCombination as retryable
These EC2 errors can be candidate-specific (for example AMI/type or AZ/subnet incompatibilities), so classifying them as fatal can abort the first failed combination and skip other candidates that may still launch. Demote them to retryable and add a test covering both codes, similar to TestLaunch_UnknownErrorStaysRetryable.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lambda/internal/aws/ec2/launcher.go` around lines 119 - 146, The isFatal helper in launcher.go currently treats InvalidParameterValue and InvalidParameterCombination as fatal, which can stop launch fallback too early; change the logic in isFatal to classify those two EC2 API errors as retryable so candidate selection can continue. Keep the existing fatal handling for the other explicit codes, and add a test alongside TestLaunch_UnknownErrorStaysRetryable that verifies both InvalidParameterValue and InvalidParameterCombination do not short-circuit fallback.

…g, doc + config consistency, GCP guard tests Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>

kaio6fellipe · 2026-07-02T17:25:44Z

Addressed all 5 review findings in 3656513 (evaluated each against the code first):

ctx-bounded spot sweep (launcher.go:98, Major) — valid: scaleup timeout is 60s and diversification made the sweep up to ~12 sequential attempts. Added a ctx.Err() guard at the top of the inner loop so a near-deadline invocation exits cleanly (→ markLaunchFailed deregisters the token) instead of a hard Lambda kill. Covered by TestLaunch_ContextCanceledStopsSweep.
partial LabelMappings example (getting-started-aws.md, Major) — valid: added a caveat that --parameter-overrides replaces the whole value; pointed at the full six-class default table below.
fallback log fields (troubleshooting.md, Minor) — valid: enriched the log to labels=%v attempted_types=%v (the class + actual candidates), which is what makes "tune your candidate lists" actionable; fixed the # Count→# List comment.
large instance_type mismatch (template.yaml + variables.tf, Minor) — valid (cosmetic; the singular is inert when instance_types is set): aligned large.instance_type to c6i.xlarge in both IaC sources + the doc echoes.
missing GCP launcher tests (gce/launcher_test.go, Trivial) — valid: added TestLauncher_Launch_NoInstanceTypesErrors + TestLauncher_Launch_EmptySubnetUsesOptsFallback.

make check green (235 tests). Rebased onto the latest main merge (#82's 5432 egress rule) — both changes coexist.

coderabbitai

🧹 Nitpick comments (1)

lambda/internal/gcp/gce/launcher_test.go (1)
128-165: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Correct test coverage; consider table-driven style per guidelines.

Both new tests correctly assert the Launch() contract shown in launcher.go — "if len(spec.InstanceTypes) == 0 { return compute.Instance{}, fmt.Errorf("gcp/gce: launch: spec has no instance types") }" and the subnet fallback logic where subnetURL falls back to l.opts.Subnet when spec.SubnetIDs is empty.

That said, these two independent test functions could be consolidated into a single table-driven test (varying spec mutation, expected error, and expected subnetwork) per the repo's testing guideline.

As per coding guidelines, lambda/**/*_test.go: "Place tests in the same package as the code they test and write them in table-driven style."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lambda/internal/gcp/gce/launcher_test.go` around lines 128 - 165, The new
Launcher tests cover two independent Launch() behaviors, but they should be
consolidated into one table-driven test to match the repo’s testing guideline.
Update the launcher_test.go coverage around newLauncherWithAPI, defaultSpec, and
Launch so the empty InstanceTypes error case and the SubnetIDs fallback case are
expressed as table rows with expected error/insert behavior and expected
subnetwork values.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@lambda/internal/gcp/gce/launcher_test.go`:
- Around line 128-165: The new Launcher tests cover two independent Launch()
behaviors, but they should be consolidated into one table-driven test to match
the repo’s testing guideline. Update the launcher_test.go coverage around
newLauncherWithAPI, defaultSpec, and Launch so the empty InstanceTypes error
case and the SubnetIDs fallback case are expressed as table rows with expected
error/insert behavior and expected subnetwork values.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bc02efdb-f738-4d66-98f6-41646c7a34d2

📥 Commits

Reviewing files that changed from the base of the PR and between e4b2115 and 3656513.

📒 Files selected for processing (7)

docs/getting-started-aws.md
docs/troubleshooting.md
infra/cloudformation/template.yaml
infra/terraform/variables.tf
lambda/internal/aws/ec2/launcher.go
lambda/internal/aws/ec2/launcher_test.go
lambda/internal/gcp/gce/launcher_test.go

🚧 Files skipped from review as they are similar to previous changes (6)

infra/terraform/variables.tf
docs/getting-started-aws.md
infra/cloudformation/template.yaml
lambda/internal/aws/ec2/launcher_test.go
docs/troubleshooting.md
lambda/internal/aws/ec2/launcher.go

kaio6fellipe and others added 8 commits July 1, 2026 23:06

refactor(compute): widen LaunchSpec to InstanceTypes/SubnetIDs slices

4387c16

Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>

feat(scaleup): resolve ordered candidate instance types from LABEL_MA…

4774bed

…PPINGS Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>

feat(ec2): diversified spot launch across candidate types and subnets

16011eb

Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>

feat(ec2): standard CPU credits for t-family + on-demand fallback log

338cfe5

Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>

feat(infra): enable spot diversification + standard CPU credits by de…

d9c8ef5

…fault Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>

refactor(launcher): guard empty InstanceTypes on GCP, test rotate, no…

54be89a

…te isFatal coupling Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>

github-actions Bot added documentation Improvements or additions to documentation jit-runners feature dependencies labels Jul 2, 2026

Merge branch 'main' into feat/spot-diversification

e4b2115

coderabbitai Bot reviewed Jul 2, 2026

View reviewed changes

fix: address PR review — ctx-bounded spot sweep, enriched fallback lo…

3656513

…g, doc + config consistency, GCP guard tests Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>

coderabbitai Bot reviewed Jul 2, 2026

View reviewed changes

kaio6fellipe merged commit 715c97d into main Jul 3, 2026
10 checks passed

kaio6fellipe deleted the feat/spot-diversification branch July 3, 2026 05:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: spot-capacity diversification (cut on-demand fallback + t3 unlimited-mode burst)#90

feat: spot-capacity diversification (cut on-demand fallback + t3 unlimited-mode burst)#90
kaio6fellipe merged 10 commits into
mainfrom
feat/spot-diversification

kaio6fellipe commented Jul 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

github-actions Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jul 2, 2026

Uh oh!

Uh oh!

coderabbitai Bot Jul 2, 2026

Uh oh!

kaio6fellipe commented Jul 2, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

kaio6fellipe commented Jul 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed (7 commits, TDD)

Safety (CI availability is the hard constraint)

Verification

Design docs & context

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

github-actions Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OpenSSF Scorecard — 8.3/10 ✅

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

kaio6fellipe commented Jul 2, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kaio6fellipe commented Jul 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading

github-actions Bot commented Jul 2, 2026 •

edited

Loading