[Improvement-16994][TaskPlugin] support retry for every api call for serverless spark#17476
Conversation
| StartJobRunRequest startJobRunRequest = buildStartJobRunRequest(aliyunServerlessSparkParameters); | ||
| StartJobRunResponse startJobRunResponse = RetryUtils.retryFunction(() -> { | ||
| try { | ||
| return aliyunServerlessSparkClient.startJobRun( | ||
| aliyunServerlessSparkParameters.getWorkspaceId(), startJobRunRequest); | ||
| } catch (Exception e) { | ||
| throw new AliyunServerlessSparkTaskException("Failed to start job run!"); |
There was a problem hiding this comment.
There seems to exist timeout issue here, if the http is timeout, then client will retry, but server side might already handle the previous, then will cause the request be handled twice. I'm unsure whether the service side has implemented idempotency handling. Because a new token is passed here each time, so the server side cannot know the second request are retry.
There was a problem hiding this comment.
There seems to exist timeout issue here, if the http is timeout, then client will retry, but server side might already handle the previous, then will cause the request be handled twice. I'm unsure whether the service side has implemented idempotency handling. Because a new token is passed here each time, so the server side cannot know the second request are retry.
@ruanwenjun @abzymeatsjtu Looks like the token is generated and set when initializing the request line#257, therefore, I assume the idempotency is alright here?
There was a problem hiding this comment.
Ah, I missed that. If the server-side is capable of implementing idempotency handling via tokens, that would be a great solution.
|
…serverless spark (apache#17476) * [Improvement-16994][TaskPlugin] support retry for every api call for serverless spark --------- Co-authored-by: sunyifan.syf <sunyifan.syf@alibaba-inc.com> Co-authored-by: Eric Gao <ericgao.apache@gmail.com>




support retry for every api call for EMR Serverless Spark, will improve the robustness of this task plugin against some temporarily malfunction of remote service
part of #16994