Improve parallel catchup v2 resource consumption

### What problem does your feature solve?

Currently the parallel catchup v2 mission works in this way:
1. Spin up n worker pods
2. Workers pick up jobs
3. When all jobs are complete, or when the mission fails, the worker pods are torn down

This can cause excessive resource consumption. For example
1. We spin up 1024 worker pods. This results in dynamic provisioning of k8s workers (possibly hundreds of them depending on instance type)
2. If a job gets stuck it will be retried, but only once all other jobs are finished
3. The stuck job may take a while to complete. During this time we keep all worker pods, and k8s workers, online, even though only one job is running

Another scenario is where a significant portion of the jobs finish. We may have 50% of the worker pods doing some work, yet we keep all of them up.

### What would you like to see?

It would be good for the worker pods to be torn down even before the mission finishes. Once they are torn down karpenter should be able to release some workers to reduce cost.
Perhaps we can also improve retry logic to catch dead jobs earlier. If we can retry earlier in the run we'll avoid long tail.

### What alternatives are there?




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve parallel catchup v2 resource consumption #397

What problem does your feature solve?

What would you like to see?

What alternatives are there?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Improve parallel catchup v2 resource consumption #397

Description

What problem does your feature solve?

What would you like to see?

What alternatives are there?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions