Skip to content

[Bug] Sync primitives can lead to un-replayable workflow history. #464

@macb

Description

@macb

What are you really trying to do?

When the SDK is operating under high cpu contention it is possible to emit un-replayable workflow history permanently breaking a workflow requiring a reset/termination to resolve.

eg. if we wanted to start 1000 activities in parallel with some sleep inbetween batches of 100:

futures = 1000.times.map do |i|
  Temporalio::Workflow.sleep(1) if i.positive? && i % 100 == 0

  Temporalio::Workflow::Future.new do
    Temporalio::Workflow.execute_activity(MyActivity, i, schedule_to_close_timeout: 300)
  end
end

Temporalio::Workflow::Future.all_of(*futures).wait
results = futures.map(&:result)

Describe the bug

What happens in practice under CPU contention is inconsistent leading to workflow histories where only a subset of the 100 start's is partially flushed with the rest happening in the next window.

Leading to:
Sleep
100x Starts
Sleep
50x Starts
Sleep
150x Starts

If this workflow then moves to a new worker, the replay fails beause of the unexpected partial batch + sleep. Instead I would expect the partial window hits a workflow task timeout error, leading to a task retry instead of the partial flush.

Minimal Reproduction

I pointed Cursor+Fable(rip) at this and it created this test reproduction in the SDK: e0f9134. There is also a proposed fixed on the fork/branch but it is all fully AI slop, the behavior it is highlighting does match what we saw in production.

Environment/Versions

  • OS and processor: Linux
  • Temporal Version: 1.4.0
  • Are you using Docker or Kubernetes or building Temporal from source? Kubernetes

Additional context

While I think it is maybe intended behavior, Futures with other mechanisms (eg. Sleep) was very unexpected behavior. I would expect from the above code we would get 100 starts then 1 timer start + fired. Instead we get 1 timer start, 100 starts, and 1 sleep fired. This is because the Future fibers are not processed until the sleep itself is started, then the fibers are unwound, and then the timer fired is handled.

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions