What problem does your feature solve?
Currently the parallel catchup v2 mission works in this way:
- Spin up n worker pods
- Workers pick up jobs
- When all jobs are complete, or when the mission fails, the worker pods are torn down
This can cause excessive resource consumption. For example
- We spin up 1024 worker pods. This results in dynamic provisioning of k8s workers (possibly hundreds of them depending on instance type)
- If a job gets stuck it will be retried, but only once all other jobs are finished
- The stuck job may take a while to complete. During this time we keep all worker pods, and k8s workers, online, even though only one job is running
Another scenario is where a significant portion of the jobs finish. We may have 50% of the worker pods doing some work, yet we keep all of them up.
What would you like to see?
It would be good for the worker pods to be torn down even before the mission finishes. Once they are torn down karpenter should be able to release some workers to reduce cost.
Perhaps we can also improve retry logic to catch dead jobs earlier. If we can retry earlier in the run we'll avoid long tail.
What alternatives are there?
What problem does your feature solve?
Currently the parallel catchup v2 mission works in this way:
This can cause excessive resource consumption. For example
Another scenario is where a significant portion of the jobs finish. We may have 50% of the worker pods doing some work, yet we keep all of them up.
What would you like to see?
It would be good for the worker pods to be torn down even before the mission finishes. Once they are torn down karpenter should be able to release some workers to reduce cost.
Perhaps we can also improve retry logic to catch dead jobs earlier. If we can retry earlier in the run we'll avoid long tail.
What alternatives are there?