What are you really trying to do?
When the SDK is operating under high cpu contention it is possible to emit un-replayable workflow history permanently breaking a workflow requiring a reset/termination to resolve.
eg. if we wanted to start 1000 activities in parallel with some sleep inbetween batches of 100:
futures = 1000.times.map do |i|
Temporalio::Workflow.sleep(1) if i.positive? && i % 100 == 0
Temporalio::Workflow::Future.new do
Temporalio::Workflow.execute_activity(MyActivity, i, schedule_to_close_timeout: 300)
end
end
Temporalio::Workflow::Future.all_of(*futures).wait
results = futures.map(&:result)
Describe the bug
What happens in practice under CPU contention is inconsistent leading to workflow histories where only a subset of the 100 start's is partially flushed with the rest happening in the next window.
Leading to:
Sleep
100x Starts
Sleep
50x Starts
Sleep
150x Starts
If this workflow then moves to a new worker, the replay fails beause of the unexpected partial batch + sleep. Instead I would expect the partial window hits a workflow task timeout error, leading to a task retry instead of the partial flush.
Minimal Reproduction
I pointed Cursor+Fable(rip) at this and it created this test reproduction in the SDK: e0f9134. There is also a proposed fixed on the fork/branch but it is all fully AI slop, the behavior it is highlighting does match what we saw in production.
Environment/Versions
- OS and processor: Linux
- Temporal Version: 1.4.0
- Are you using Docker or Kubernetes or building Temporal from source? Kubernetes
Additional context
While I think it is maybe intended behavior, Futures with other mechanisms (eg. Sleep) was very unexpected behavior. I would expect from the above code we would get 100 starts then 1 timer start + fired. Instead we get 1 timer start, 100 starts, and 1 sleep fired. This is because the Future fibers are not processed until the sleep itself is started, then the fibers are unwound, and then the timer fired is handled.
What are you really trying to do?
When the SDK is operating under high cpu contention it is possible to emit un-replayable workflow history permanently breaking a workflow requiring a reset/termination to resolve.
eg. if we wanted to start 1000 activities in parallel with some sleep inbetween batches of 100:
Describe the bug
What happens in practice under CPU contention is inconsistent leading to workflow histories where only a subset of the 100 start's is partially flushed with the rest happening in the next window.
Leading to:
Sleep
100x Starts
Sleep
50x Starts
Sleep
150x Starts
If this workflow then moves to a new worker, the replay fails beause of the unexpected partial batch + sleep. Instead I would expect the partial window hits a workflow task timeout error, leading to a task retry instead of the partial flush.
Minimal Reproduction
I pointed Cursor+Fable(rip) at this and it created this test reproduction in the SDK: e0f9134. There is also a proposed fixed on the fork/branch but it is all fully AI slop, the behavior it is highlighting does match what we saw in production.
Environment/Versions
Additional context
While I think it is maybe intended behavior, Futures with other mechanisms (eg. Sleep) was very unexpected behavior. I would expect from the above code we would get 100 starts then 1 timer start + fired. Instead we get 1 timer start, 100 starts, and 1 sleep fired. This is because the Future fibers are not processed until the sleep itself is started, then the fibers are unwound, and then the timer fired is handled.