Skip to content

feat: spot-capacity diversification (cut on-demand fallback + t3 unlimited-mode burst)#90

Merged
kaio6fellipe merged 10 commits into
mainfrom
feat/spot-diversification
Jul 3, 2026
Merged

feat: spot-capacity diversification (cut on-demand fallback + t3 unlimited-mode burst)#90
kaio6fellipe merged 10 commits into
mainfrom
feat/spot-diversification

Conversation

@kaio6fellipe

@kaio6fellipe kaio6fellipe commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Cuts jit-runners' recurring AWS cost (from the June FinOps pass) by launching runners across diversified spot pools before the on-demand last resort, and pinning t-family runners to standard CPU credits. Targets the ~$22/mo on-demand fallback + ~$4.55/mo t3 unlimited-mode burst.

What changed (7 commits, TDD)

  1. refactor(compute) — widen LaunchSpec to InstanceTypes []string / SubnetIDs []string (all readers updated: AWS + GCP launchers, scaleup, tests).
  2. feat(scaleup)LabelMapping.InstanceTypes + resolveInstanceTypes (ordered candidate list; back-compat with single instance_type).
  3. feat(ec2)Launch now tries each candidate type × subnet (AZ) on spot, first-success-wins, then one on-demand last resort. isFatal (smithy.APIError) aborts only on a closed allow-list of request-level errors; unknown/non-API errors default to retryable so launch success is provably ≥ today. Deterministic rotate spreads concurrent launches across AZs. Emits event=spot_exhausted_ondemand_fallback on fallback.
  4. feat(ec2) — t-family (t2/t3/t3a/t4g) launches get CreditSpecification=standard (gated on isBurstable), wired via EC2_CPU_CREDITS (default standard).
  5. feat(infra) — CFN + Terraform: six-class LabelMappings default with candidate lists + EC2_CPU_CREDITS=standard.
  6. refactor(launcher) — GCP empty-InstanceTypes guard, rotate test, isFatal coupling note.
  7. docs / fix(lint) — docs for the new behavior/config; revive var-naming fix.

Safety (CI availability is the hard constraint)

The "launch success ≥ today" guarantee holds under two enforced conditions: the candidate list always includes the original primary type + on-demand fallback preserved; and isFatal is a closed allow-list defaulting unknown → retryable. The spec reviewer mutation-tested this — flipping the default to fatal makes the unknown/non-API tests fail, so the property is load-bearing. release candidates are all 16 GiB (no OOM). No new IAM (ec2:RunInstances suffices).

Verification

make check green (lint + vet + 232 tests). New tests cover: spot-success, capacity→next-candidate, all-fail→on-demand, fatal-abort, unknown→retryable, non-API→retryable, empty-subnets, empty-types, burstable/non-burstable credits, resolveInstanceTypes, rotate.

Design docs & context

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Enhanced runner launch to try multiple candidate instance types and subnets in order (first-success-wins) to improve spot capacity outcomes.
    • Added explicit spot-to–on-demand fallback when spot candidates are exhausted, with a dedicated fallback log event.
    • Added EC2 CPU credit configuration for burstable instances (defaulting to standard).
  • Bug Fixes

    • Improved behavior when instance type/subnet inputs are missing or empty across AWS and GCP.
  • Documentation

    • Updated setup docs to support instance_types candidate lists and clarified AWS spot fallback behavior.
    • Added troubleshooting guidance for frequent on-demand fallback.

kaio6fellipe and others added 8 commits July 1, 2026 23:06
Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>
…PPINGS

Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>
Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>
Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>
…fault

Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>
…te isFatal coupling

Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>
…REDITS

Document the new diversified spot launch (multi-candidate types × AZs before
on-demand fallback + the spot_exhausted_ondemand_fallback log), the optional
instance_types list in LABEL_MAPPINGS with the new six-class default, and the
EC2_CPU_CREDITS env var. Fixes the stale label_mappings=[] doc drift.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>
…r-naming)

Our option field triggered revive var-naming (initialisms → CPU). The AWS SDK's
CreditSpecificationRequest.CpuCredits field name is unchanged (not ours to rename).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds ordered instance-type and subnet candidate support to the scaleup flow, updates AWS and GCP launchers to consume the new contract, configures CPU-credit handling for AWS, refreshes infra defaults, and updates documentation for the new spot selection and fallback behavior.

Changes

Multi-candidate spot launch support

Layer / File(s) Summary
LaunchSpec and label mapping contract
lambda/internal/compute/compute.go, lambda/internal/config/config.go
LaunchSpec.InstanceType/SubnetID become slices InstanceTypes/SubnetIDs; LabelMapping gains optional InstanceTypes field.
Scaleup Lambda resolution and wiring
lambda/cmd/scaleup/main.go, lambda/cmd/scaleup/main_integration_test.go
resolveInstanceType is replaced by resolveInstanceTypes; launch spec now carries instance type and subnet lists; new tests cover list, single, and default fallback resolution.
AWS EC2 launcher retry/fallback and CPU credits
lambda/internal/aws/ec2/launcher.go, lambda/internal/aws/ec2/launcher_test.go, lambda/internal/provider/aws.go, lambda/go.mod
Launch iterates rotated subnets and candidate instance types for spot, classifies fatal vs retryable errors, falls back to on-demand once, and applies CPU-credit pinning for burstable types via EC2_CPU_CREDITS; new tests cover launch paths, CPU credits, and rotation.
GCE launcher candidate handling
lambda/internal/gcp/gce/launcher.go, lambda/internal/gcp/gce/launcher_test.go
GCE launcher validates non-empty instance types, uses the first candidate instance type and subnet, and updates tests to the slice-based spec.
Infra defaults
infra/cloudformation/template.yaml, infra/terraform/lambda.tf, infra/terraform/variables.tf
LabelMappings/label_mappings defaults are populated with instance-type mappings, and EC2_CPU_CREDITS=standard is added to the scaleup Lambda environment.
Documentation updates
AGENTS.md, CLAUDE.md, README.md, docs/getting-started-aws.md, docs/getting-started-gcp.md, docs/troubleshooting.md
Docs describe multi-candidate spot selection, on-demand fallback event logging, the AWS/GCP label_mappings behavior, CPU credits, and the new troubleshooting section with renumbering.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Scaleup as processRecord
  participant Launcher as ec2.Launcher.Launch
  participant EC2 as EC2 API

  Scaleup->>Launcher: Launch(spec with InstanceTypes, SubnetIDs)
  loop rotated subnets x instance types
    Launcher->>EC2: RunInstances (spot, instanceType, subnetID)
    EC2-->>Launcher: success or error
    alt fatal error
      Launcher-->>Scaleup: return error
    else success
      Launcher-->>Scaleup: return instance
    end
  end
  Launcher->>EC2: RunInstances (on-demand fallback)
  EC2-->>Launcher: success or error
  Launcher-->>Scaleup: return instance or error
Loading

Possibly related PRs

  • devopsfactory-io/jit-runners#46: Modifies the same scaleup instance-type resolution and LaunchSpec logic in lambda/cmd/scaleup/main.go that this PR extends into multi-candidate form.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title captures the main change: diversified spot launches with reduced on-demand fallback and standard CPU credits.
Description check ✅ Passed The description is substantive and covers scope, rationale, verification, and docs, though it doesn't follow the repository's exact template.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/spot-diversification

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@github-actions github-actions Bot added documentation Improvements or additions to documentation jit-runners feature dependencies labels Jul 2, 2026
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

OpenSSF Scorecard — 8.3/10 ✅

Check Score Details
Binary-Artifacts 10/10 no binaries found in the repo
CI-Tests 10/10 15 out of 15 merged PRs checked by a CI test -- score normalized to 10
Code-Review 0/10 Found 0/30 approved changesets -- score normalized to 0
Dangerous-Workflow 10/10 no dangerous workflow patterns detected
License 10/10 license file detected
Pinned-Dependencies 6/10 dependency not pinned by hash detected -- score normalized to 6
Security-Policy 10/10 security policy file detected
Token-Permissions 10/10 GitHub workflow tokens follow principle of least privilege
Vulnerabilities 10/10 0 existing vulnerabilities detected

Threshold: 7 | Full report

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (1)
infra/terraform/variables.tf (1)

63-63: 🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win

Same large-label instance_type/instance_types[0] mismatch as infra/cloudformation/template.yaml.

instance_type is "c5.xlarge" while instance_types[0] is "c6i.xlarge"; align them so both IaC defaults stay consistent.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@infra/terraform/variables.tf` at line 63, The default for the large instance
definition is inconsistent because `instance_type` and the first entry in
`instance_types` do not match. Update the large object in `variables.tf` so the
`instance_type` value and `instance_types[0]` align with the same instance
family, matching the corresponding large entry used in `template.yaml`.
🧹 Nitpick comments (1)
lambda/internal/gcp/gce/launcher_test.go (1)

79-126: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add coverage for the new empty-InstanceTypes guard and subnet fallback.

The EC2 launcher's equivalent behavior (empty InstanceTypes → error, empty SubnetIDs → fallback) is directly tested (TestLaunch_NoInstanceTypesErrors, TestLaunch_EmptySubnetsNoPanic in lambda/internal/aws/ec2/launcher_test.go), but this file only updates defaultSpec()'s field types without adding equivalent tests for gce.Launcher.Launch.

✅ Suggested additional test cases
func TestLauncher_Launch_NoInstanceTypesErrors(t *testing.T) {
	fake := &fakeGCE{}
	launcher := newLauncherWithAPI(fake, defaultOpts())

	spec := defaultSpec()
	spec.InstanceTypes = nil
	if _, err := launcher.Launch(context.Background(), spec); err == nil {
		t.Fatal("want error for empty InstanceTypes, got nil")
	}
	if fake.insertCalled {
		t.Error("expected Insert not to be called")
	}
}

func TestLauncher_Launch_EmptySubnetIDsFallsBackToOptsSubnet(t *testing.T) {
	fake := &fakeGCE{insertReturnID: "jit-runner-abcdef12"}
	launcher := newLauncherWithAPI(fake, defaultOpts())

	spec := defaultSpec()
	spec.SubnetIDs = nil
	if _, err := launcher.Launch(context.Background(), spec); err != nil {
		t.Fatalf("unexpected error: %v", err)
	}
	got := fake.insertReq.GetInstanceResource().GetNetworkInterfaces()[0].GetSubnetwork()
	if got != defaultOpts().Subnet {
		t.Errorf("subnet = %q, want opts.Subnet %q", got, defaultOpts().Subnet)
	}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lambda/internal/gcp/gce/launcher_test.go` around lines 79 - 126, Add tests in
gce.Launcher.Launch for the new validation and fallback behavior: verify that an
empty InstanceTypes slice returns an error and does not call fakeGCE.Insert, and
verify that when SubnetIDs is empty the launcher falls back to the subnet from
defaultOpts() instead of panicking or using an empty value. Use the existing
TestLauncher_Launch_ReturnsInstance pattern and unique symbols like defaultSpec,
newLauncherWithAPI, fakeGCE, and Launch to place the new cases nearby.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/getting-started-aws.md`:
- Around line 160-164: The `LabelMappings` example in the AWS getting-started
doc is presented as a complete override, but `--parameter-overrides` replaces
the entire JSON value and would drop the default labels. Update the
`LabelMappings` guidance to either show the full default mapping set or clearly
mark it as a partial override that must be expanded before use, keeping the
explanation attached to the `LabelMappings`/`instance_types` example.

In `@docs/troubleshooting.md`:
- Around line 196-217: The fallback diagnosis section is inaccurate: the `aws
logs filter-log-events` example returns raw messages rather than a count, and
the scaleup Lambda log fields referenced in the text are not actually emitted by
the current launcher path. Update the documentation to match the real output, or
change the logging in the scaleup flow so the `spot_exhausted_ondemand_fallback`
event includes the label class and tried candidate list alongside `runner` and
`attempted_types`, using the relevant logging code in the scaleup Lambda that
emits these messages.

In `@infra/cloudformation/template.yaml`:
- Line 73: The `large` entry’s primary `instance_type` is out of sync with the
first value in its `instance_types` list, so update the `large` label in the
CloudFormation default so both fields use the same leading instance type,
matching the pattern used by the other size labels. Also make the same alignment
fix in the duplicated default in the Terraform variables definition so both
sources stay consistent.

In `@lambda/internal/aws/ec2/launcher.go`:
- Around line 119-146: The isFatal helper in launcher.go currently treats
InvalidParameterValue and InvalidParameterCombination as fatal, which can stop
launch fallback too early; change the logic in isFatal to classify those two EC2
API errors as retryable so candidate selection can continue. Keep the existing
fatal handling for the other explicit codes, and add a test alongside
TestLaunch_UnknownErrorStaysRetryable that verifies both InvalidParameterValue
and InvalidParameterCombination do not short-circuit fallback.
- Around line 62-98: The sequential spot sweep in Launcher.Launch can consume
the full Lambda timeout before the on-demand fallback is tried. Add a ctx.Err()
check inside the nested instance-type/subnet loops, before each runInstance
attempt and after failures, so Launch exits promptly when the context is near
deadline. Keep the fix localized to Launcher.Launch and preserve the existing
isFatal handling and fallback behavior.

---

Duplicate comments:
In `@infra/terraform/variables.tf`:
- Line 63: The default for the large instance definition is inconsistent because
`instance_type` and the first entry in `instance_types` do not match. Update the
large object in `variables.tf` so the `instance_type` value and
`instance_types[0]` align with the same instance family, matching the
corresponding large entry used in `template.yaml`.

---

Nitpick comments:
In `@lambda/internal/gcp/gce/launcher_test.go`:
- Around line 79-126: Add tests in gce.Launcher.Launch for the new validation
and fallback behavior: verify that an empty InstanceTypes slice returns an error
and does not call fakeGCE.Insert, and verify that when SubnetIDs is empty the
launcher falls back to the subnet from defaultOpts() instead of panicking or
using an empty value. Use the existing TestLauncher_Launch_ReturnsInstance
pattern and unique symbols like defaultSpec, newLauncherWithAPI, fakeGCE, and
Launch to place the new cases nearby.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 83bcd4da-a623-4051-9a8c-7f3da7907f6d

📥 Commits

Reviewing files that changed from the base of the PR and between 7be033c and e4b2115.

📒 Files selected for processing (19)
  • AGENTS.md
  • CLAUDE.md
  • README.md
  • docs/getting-started-aws.md
  • docs/getting-started-gcp.md
  • docs/troubleshooting.md
  • infra/cloudformation/template.yaml
  • infra/terraform/lambda.tf
  • infra/terraform/variables.tf
  • lambda/cmd/scaleup/main.go
  • lambda/cmd/scaleup/main_integration_test.go
  • lambda/go.mod
  • lambda/internal/aws/ec2/launcher.go
  • lambda/internal/aws/ec2/launcher_test.go
  • lambda/internal/compute/compute.go
  • lambda/internal/config/config.go
  • lambda/internal/gcp/gce/launcher.go
  • lambda/internal/gcp/gce/launcher_test.go
  • lambda/internal/provider/aws.go

Comment thread docs/getting-started-aws.md Outdated
Comment thread docs/troubleshooting.md Outdated
Comment thread infra/cloudformation/template.yaml Outdated
LabelMappings:
Type: String
Default: "[]"
Default: '[{"label":"nano","instance_type":"t2.nano"},{"label":"micro","instance_type":"t2.micro"},{"label":"small","instance_type":"t2.small"},{"label":"medium","instance_type":"t3.medium","instance_types":["t3.medium","t3a.medium","m6i.large","m5.large"]},{"label":"large","instance_type":"c5.xlarge","instance_types":["c6i.xlarge","c5.xlarge","c5a.xlarge","m6i.xlarge"]},{"label":"release","instance_type":"m5.xlarge","instance_types":["m5.xlarge","m5a.xlarge","m6i.xlarge","m6a.xlarge"]}]'

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win

large label's instance_type doesn't match the head of its instance_types list.

instance_type is "c5.xlarge" but instance_types[0] is "c6i.xlarge". Every other label with an instance_types array keeps instance_type aligned with the first (primary) candidate. This looks like a copy/paste slip and is confusing regardless of which field actually drives resolution at runtime.

🔧 Suggested fix
-{"label":"large","instance_type":"c5.xlarge","instance_types":["c6i.xlarge","c5.xlarge","c5a.xlarge","m6i.xlarge"]}
+{"label":"large","instance_type":"c6i.xlarge","instance_types":["c6i.xlarge","c5.xlarge","c5a.xlarge","m6i.xlarge"]}

Same mismatch is duplicated in infra/terraform/variables.tf.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Default: '[{"label":"nano","instance_type":"t2.nano"},{"label":"micro","instance_type":"t2.micro"},{"label":"small","instance_type":"t2.small"},{"label":"medium","instance_type":"t3.medium","instance_types":["t3.medium","t3a.medium","m6i.large","m5.large"]},{"label":"large","instance_type":"c5.xlarge","instance_types":["c6i.xlarge","c5.xlarge","c5a.xlarge","m6i.xlarge"]},{"label":"release","instance_type":"m5.xlarge","instance_types":["m5.xlarge","m5a.xlarge","m6i.xlarge","m6a.xlarge"]}]'
Default: '[{"label":"nano","instance_type":"t2.nano"},{"label":"micro","instance_type":"t2.micro"},{"label":"small","instance_type":"t2.small"},{"label":"medium","instance_type":"t3.medium","instance_types":["t3.medium","t3a.medium","m6i.large","m5.large"]},{"label":"large","instance_type":"c6i.xlarge","instance_types":["c6i.xlarge","c5.xlarge","c5a.xlarge","m6i.xlarge"]},{"label":"release","instance_type":"m5.xlarge","instance_types":["m5.xlarge","m5a.xlarge","m6i.xlarge","m6a.xlarge"]}]'
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@infra/cloudformation/template.yaml` at line 73, The `large` entry’s primary
`instance_type` is out of sync with the first value in its `instance_types`
list, so update the `large` label in the CloudFormation default so both fields
use the same leading instance type, matching the pattern used by the other size
labels. Also make the same alignment fix in the duplicated default in the
Terraform variables definition so both sources stay consistent.

Comment thread lambda/internal/aws/ec2/launcher.go
Comment on lines +119 to +146
// isFatal reports whether err is a request-level error that would fail on
// on-demand too — the only case where skipping the on-demand fallback is safe.
// DEFAULT: unknown/non-API errors are retryable (return false), preserving the
// "launch success ≥ today" guarantee.
func isFatal(err error) bool {
var apiErr smithy.APIError
if !errors.As(err, &apiErr) {
return false
}
code := apiErr.ErrorCode()
if strings.HasPrefix(code, "InvalidAMIID") {
return true
}
switch code {
case "UnauthorizedOperation", "AuthFailure", "PendingVerification":
return true
// NOTE: InvalidParameterValue/Combination are treated as fatal (they fail on
// on-demand too). This is safe only while spot-specific params are hardcoded
// valid constants. If SpotMaxPrice (currently unwired) is ever populated and
// misconfigured, it surfaces as InvalidParameterValue on the SPOT request only
// — revisit excluding these codes then, so a bad MaxPrice still falls back.
case "InvalidParameterValue", "InvalidParameterCombination":
return true
}
// Allow-list is intentionally minimal and grows only as production logs
// reveal codes that also fail on-demand. Unknown ⇒ retryable (see doc above).
return false
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

EC2 RunInstances InvalidParameterValue instance type architecture mismatch AMI

💡 Result:

The error "InvalidParameterValue: The architecture... of the specified instance type does not match the architecture... of the specified AMI" occurs when you attempt to launch an EC2 instance using an Amazon Machine Image (AMI) that is not compatible with the selected instance type's processor architecture (e.g., trying to use an x86_64 AMI on an arm64 Graviton instance, or vice-versa) [1][2][3]. To resolve this issue, ensure the architectures of both the AMI and the instance type are compatible [4][2]. 1. Identify the architecture of your AMI: You can check the architecture of your selected AMI using the AWS CLI [1][2]: aws ec2 describe-images --image-ids <YOUR_AMI_ID> --query 'Images.Architecture' 2. Identify compatible instance types: Once you know the AMI's architecture (e.g., x86_64 or arm64), you must select an instance type that supports that same architecture [2][3]. You can list instance types for a specific architecture using the AWS CLI [3]: aws ec2 describe-instance-types --filters Name=processor-info.supported-architecture,Values=arm64 --query "InstanceTypes[*].InstanceType" --output text Common architectures include: - x86_64: Supported by most standard instance families (e.g., m5, c5, t3). - arm64: Supported by AWS Graviton-based instance families (e.g., m6g, c6g, t4g). If you are using Infrastructure as Code (such as Terraform or AWS CloudFormation), verify that your resource configuration defines an instance type and AMI that align [2][5][3]. If you encounter this error during a validation/dry-run phase of a tool (like AWS ParallelCluster), ensure you are using the latest supported configuration or, if appropriate, verify if the tool is incorrectly performing validations against an incompatible default AMI [6].

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

git ls-files lambda/internal/aws/ec2
printf '\n--- launcher outline ---\n'
ast-grep outline lambda/internal/aws/ec2/launcher.go --view expanded || true
printf '\n--- tests outline ---\n'
ast-grep outline lambda/internal/aws/ec2/*_test.go --view expanded || true
printf '\n--- search for isFatal and launch loop ---\n'
rg -n "isFatal|InvalidParameterValue|InvalidParameterCombination|UnknownErrorStaysRetryable|divers|fallback|spot" lambda/internal/aws/ec2 -S

Repository: devopsfactory-io/jit-runners

Length of output: 7670


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '1,260p' lambda/internal/aws/ec2/launcher.go | cat -n
printf '\n--- tests ---\n'
for f in lambda/internal/aws/ec2/*_test.go; do
  echo "### $f"
  sed -n '1,260p' "$f" | cat -n
done

Repository: devopsfactory-io/jit-runners

Length of output: 23890


Treat InvalidParameterValue/InvalidParameterCombination as retryable
These EC2 errors can be candidate-specific (for example AMI/type or AZ/subnet incompatibilities), so classifying them as fatal can abort the first failed combination and skip other candidates that may still launch. Demote them to retryable and add a test covering both codes, similar to TestLaunch_UnknownErrorStaysRetryable.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lambda/internal/aws/ec2/launcher.go` around lines 119 - 146, The isFatal
helper in launcher.go currently treats InvalidParameterValue and
InvalidParameterCombination as fatal, which can stop launch fallback too early;
change the logic in isFatal to classify those two EC2 API errors as retryable so
candidate selection can continue. Keep the existing fatal handling for the other
explicit codes, and add a test alongside TestLaunch_UnknownErrorStaysRetryable
that verifies both InvalidParameterValue and InvalidParameterCombination do not
short-circuit fallback.

…g, doc + config consistency, GCP guard tests

Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>
@kaio6fellipe

Copy link
Copy Markdown
Contributor Author

Addressed all 5 review findings in 3656513 (evaluated each against the code first):

  • ctx-bounded spot sweep (launcher.go:98, Major) — valid: scaleup timeout is 60s and diversification made the sweep up to ~12 sequential attempts. Added a ctx.Err() guard at the top of the inner loop so a near-deadline invocation exits cleanly (→ markLaunchFailed deregisters the token) instead of a hard Lambda kill. Covered by TestLaunch_ContextCanceledStopsSweep.
  • partial LabelMappings example (getting-started-aws.md, Major) — valid: added a caveat that --parameter-overrides replaces the whole value; pointed at the full six-class default table below.
  • fallback log fields (troubleshooting.md, Minor) — valid: enriched the log to labels=%v attempted_types=%v (the class + actual candidates), which is what makes "tune your candidate lists" actionable; fixed the # Count# List comment.
  • large instance_type mismatch (template.yaml + variables.tf, Minor) — valid (cosmetic; the singular is inert when instance_types is set): aligned large.instance_type to c6i.xlarge in both IaC sources + the doc echoes.
  • missing GCP launcher tests (gce/launcher_test.go, Trivial) — valid: added TestLauncher_Launch_NoInstanceTypesErrors + TestLauncher_Launch_EmptySubnetUsesOptsFallback.

make check green (235 tests). Rebased onto the latest main merge (#82's 5432 egress rule) — both changes coexist.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
lambda/internal/gcp/gce/launcher_test.go (1)

128-165: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Correct test coverage; consider table-driven style per guidelines.

Both new tests correctly assert the Launch() contract shown in launcher.go — "if len(spec.InstanceTypes) == 0 { return compute.Instance{}, fmt.Errorf("gcp/gce: launch: spec has no instance types") }" and the subnet fallback logic where subnetURL falls back to l.opts.Subnet when spec.SubnetIDs is empty.

That said, these two independent test functions could be consolidated into a single table-driven test (varying spec mutation, expected error, and expected subnetwork) per the repo's testing guideline.

As per coding guidelines, lambda/**/*_test.go: "Place tests in the same package as the code they test and write them in table-driven style."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lambda/internal/gcp/gce/launcher_test.go` around lines 128 - 165, The new
Launcher tests cover two independent Launch() behaviors, but they should be
consolidated into one table-driven test to match the repo’s testing guideline.
Update the launcher_test.go coverage around newLauncherWithAPI, defaultSpec, and
Launch so the empty InstanceTypes error case and the SubnetIDs fallback case are
expressed as table rows with expected error/insert behavior and expected
subnetwork values.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@lambda/internal/gcp/gce/launcher_test.go`:
- Around line 128-165: The new Launcher tests cover two independent Launch()
behaviors, but they should be consolidated into one table-driven test to match
the repo’s testing guideline. Update the launcher_test.go coverage around
newLauncherWithAPI, defaultSpec, and Launch so the empty InstanceTypes error
case and the SubnetIDs fallback case are expressed as table rows with expected
error/insert behavior and expected subnetwork values.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bc02efdb-f738-4d66-98f6-41646c7a34d2

📥 Commits

Reviewing files that changed from the base of the PR and between e4b2115 and 3656513.

📒 Files selected for processing (7)
  • docs/getting-started-aws.md
  • docs/troubleshooting.md
  • infra/cloudformation/template.yaml
  • infra/terraform/variables.tf
  • lambda/internal/aws/ec2/launcher.go
  • lambda/internal/aws/ec2/launcher_test.go
  • lambda/internal/gcp/gce/launcher_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • infra/terraform/variables.tf
  • docs/getting-started-aws.md
  • infra/cloudformation/template.yaml
  • lambda/internal/aws/ec2/launcher_test.go
  • docs/troubleshooting.md
  • lambda/internal/aws/ec2/launcher.go

@kaio6fellipe kaio6fellipe merged commit 715c97d into main Jul 3, 2026
10 checks passed
@kaio6fellipe kaio6fellipe deleted the feat/spot-diversification branch July 3, 2026 05:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies documentation Improvements or additions to documentation feature jit-runners

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant