Skip to content

Commit 8077c2b

Browse files
authored
[Feature-18070][Task] Add Amazon EMR Serverless task plugin (#18069)
1 parent 08db465 commit 8077c2b

38 files changed

Lines changed: 1958 additions & 0 deletions

File tree

.github/workflows/api-test.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,8 @@ jobs:
121121
class: org.apache.dolphinscheduler.api.test.cases.OidcLoginAPITest
122122
- name: DependentTaskAPITest
123123
class: org.apache.dolphinscheduler.api.test.cases.tasks.DependentTaskAPITest
124+
- name: EmrServerlessTaskAPITest
125+
class: org.apache.dolphinscheduler.api.test.cases.tasks.EmrServerlessTaskAPITest
124126
env:
125127
RECORDING_PATH: /tmp/recording-${{ matrix.case.name }}
126128
steps:

docs/configs/docsdev.js

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,10 @@ export default {
153153
title: 'Amazon EMR',
154154
link: '/en-us/docs/dev/user_doc/guide/task/emr.html',
155155
},
156+
{
157+
title: 'Amazon EMR Serverless',
158+
link: '/en-us/docs/dev/user_doc/guide/task/emr-serverless.html',
159+
},
156160
{
157161
title: 'Apache Zeppelin',
158162
link: '/en-us/docs/dev/user_doc/guide/task/zeppelin.html',
@@ -877,6 +881,10 @@ export default {
877881
title: 'Amazon EMR',
878882
link: '/zh-cn/docs/dev/user_doc/guide/task/emr.html',
879883
},
884+
{
885+
title: 'Amazon EMR Serverless',
886+
link: '/zh-cn/docs/dev/user_doc/guide/task/emr-serverless.html',
887+
},
880888
{
881889
title: 'Apache Zeppelin',
882890
link: '/zh-cn/docs/dev/user_doc/guide/task/zeppelin.html',
Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
# Amazon EMR Serverless
2+
3+
## Overview
4+
5+
Amazon EMR Serverless task type, for submitting and monitoring job runs on [Amazon EMR Serverless](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html) applications.
6+
Unlike traditional EMR on EC2, EMR Serverless requires no cluster infrastructure management and automatically scales compute resources on demand, suitable for Spark and Hive workloads.
7+
8+
Using [aws-java-sdk](https://aws.amazon.com/cn/sdk-for-java/) in the background code, to transfer JSON parameters to a [StartJobRunRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/emrserverless/model/StartJobRunRequest.html) object
9+
and submit it to AWS via the [StartJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_StartJobRun.html), then poll job status via the [GetJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_GetJobRun.html) until completion.
10+
11+
## Create Task
12+
13+
- Click `Project Management -> Project Name -> Workflow Definition`, click the `Create Workflow` button to enter the DAG editing page.
14+
- Drag `AmazonEMRServerless` task from the toolbar to the artboard to complete the creation.
15+
16+
## Task Parameters
17+
18+
[//]: # (TODO: use the commented anchor below once our website template supports this syntax)
19+
[//]: # (- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) `Default Task Parameters` section for default parameters.)
20+
21+
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md) `Default Task Parameters` section for default parameters.
22+
23+
| **Parameter** | **Description** |
24+
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
25+
| Application Id | EMR Serverless application ID (e.g. `00fkht2eodujab09`), obtainable from the [EMR Serverless Console](https://console.aws.amazon.com/emr/home#/serverless) |
26+
| Execution Role Arn | ARN of the IAM role for job execution (e.g. `arn:aws:iam::123456789012:role/EMRServerlessRole`), this role needs permissions to access S3, Glue, and other services |
27+
| Job Name | Job name (optional), used to identify the job in the EMR Serverless console |
28+
| StartJobRunRequest JSON | JSON corresponding to the `JobDriver` and `ConfigurationOverrides` portions of the [StartJobRunRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/emrserverless/model/StartJobRunRequest.html), see examples below. **Note**: `ApplicationId` and `ExecutionRoleArn` do not need to be included in the JSON as they are automatically injected from the form parameters above |
29+
30+
![RUN_JOB_FLOW](../../../../img/tasks/demo/emr_serverless_create.png)
31+
32+
## Task Example
33+
34+
### Submit a Spark Job
35+
36+
This example shows how to create an `EMR_SERVERLESS` task node to submit a Spark job to an EMR Serverless application.
37+
38+
StartJobRunRequest JSON example (Spark):
39+
40+
```json
41+
{
42+
"JobDriver": {
43+
"SparkSubmit": {
44+
"EntryPoint": "s3://my-bucket/scripts/my-spark-job.jar",
45+
"EntryPointArguments": [
46+
"s3://my-bucket/input/",
47+
"s3://my-bucket/output/"
48+
],
49+
"SparkSubmitParameters": "--class com.example.MySparkApp --conf spark.executor.cores=4 --conf spark.executor.memory=8g --conf spark.executor.instances=10"
50+
}
51+
},
52+
"ConfigurationOverrides": {
53+
"MonitoringConfiguration": {
54+
"S3MonitoringConfiguration": {
55+
"LogUri": "s3://my-bucket/emr-serverless-logs/"
56+
}
57+
}
58+
}
59+
}
60+
```
61+
62+
### Submit a Hive Job
63+
64+
This example shows how to create an `EMR_SERVERLESS` task node to submit a Hive query job.
65+
66+
StartJobRunRequest JSON example (Hive):
67+
68+
```json
69+
{
70+
"JobDriver": {
71+
"HiveSQL": {
72+
"Query": "s3://my-bucket/scripts/my-hive-query.sql",
73+
"Parameters": "--hiveconf hive.exec.dynamic.partition=true --hiveconf hive.exec.dynamic.partition.mode=nonstrict"
74+
}
75+
},
76+
"ConfigurationOverrides": {
77+
"MonitoringConfiguration": {
78+
"S3MonitoringConfiguration": {
79+
"LogUri": "s3://my-bucket/emr-serverless-logs/"
80+
}
81+
},
82+
"ApplicationConfiguration": [
83+
{
84+
"Classification": "hive-site",
85+
"Properties": {
86+
"hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
87+
}
88+
}
89+
]
90+
}
91+
}
92+
```
93+
94+
## AWS Authentication Configuration
95+
96+
The EMR Serverless task reads AWS credentials from the DolphinScheduler `aws.yaml` configuration file, under the `aws.emr` section at `conf/aws.yaml`.
97+
98+
### Using IAM Role (Recommended)
99+
100+
If the DolphinScheduler Worker node runs on an EC2 instance with an attached IAM Role:
101+
102+
```yaml
103+
aws:
104+
emr:
105+
credentials.provider.type: InstanceProfileCredentialsProvider
106+
region: us-east-1
107+
```
108+
109+
### Using Access Key
110+
111+
If you need to authenticate using AK/SK:
112+
113+
```yaml
114+
aws:
115+
emr:
116+
credentials.provider.type: AWSStaticCredentialsProvider
117+
access.key.id: your-access-key-id
118+
access.key.secret: your-secret-access-key
119+
region: us-east-1
120+
```
121+
122+
> **Note**: The `aws.emr` section configuration is shared by both EMR on EC2 and EMR Serverless task types.
123+
124+
## Job State Transitions
125+
126+
After an EMR Serverless job is submitted, DolphinScheduler polls the job status every 10 seconds:
127+
128+
```
129+
SUBMITTED → PENDING → SCHEDULED → RUNNING → SUCCESS
130+
→ FAILED
131+
→ CANCELLED
132+
```
133+
134+
- When a job reaches `SUCCESS` state, the task is marked as successful
135+
- When a job reaches `FAILED` or `CANCELLED` state, the task is marked as failed
136+
- If a DolphinScheduler task is killed, it automatically calls the [CancelJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_CancelJobRun.html) to cancel the running job
137+
138+
## Notice
139+
140+
- The **Application Id** must correspond to a pre-existing EMR Serverless application (created via the AWS Console or API) in `STARTED` or `CREATED` state
141+
- The **Execution Role** requires the following minimum permissions: `emr-serverless:StartJobRun`, `emr-serverless:GetJobRun`, `emr-serverless:CancelJobRun`, plus S3, Glue and other data access permissions required by the job
142+
- `StartJobRunRequest JSON` should NOT include `ApplicationId` or `ExecutionRoleArn` fields — they are automatically injected from the form parameters
143+
- EMR Serverless task supports failover: when a Worker node fails, a new Worker can recover tracking of running jobs through `appIds` (the `jobRunId`)
144+
Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
# Amazon EMR Serverless
2+
3+
## 综述
4+
5+
Amazon EMR Serverless 任务类型,用于向 [Amazon EMR Serverless](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html) 应用程序提交并监控作业运行。
6+
与传统的 EMR on EC2 不同,EMR Serverless 无需管理集群基础设施,按需自动扩缩容计算资源,适用于 Spark 和 Hive 工作负载。
7+
8+
后台使用 [aws-java-sdk](https://aws.amazon.com/cn/sdk-for-java/) 将 JSON 参数转换为 [StartJobRunRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/emrserverless/model/StartJobRunRequest.html) 对象,
9+
通过 [StartJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_StartJobRun.html) 提交到 AWS,并通过 [GetJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_GetJobRun.html) 轮询作业状态直到完成。
10+
11+
## 创建任务
12+
13+
- 点击 `项目管理 -> 项目名称 -> 工作流定义`,点击 `创建工作流` 按钮进入 DAG 编辑页面。
14+
- 从工具栏中拖拽 `AmazonEMRServerless` 任务到画布中完成创建。
15+
16+
## 任务参数
17+
18+
[//]: # (TODO: use the commented anchor below once our website template supports this syntax)
19+
[//]: # (- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)`默认任务参数`一栏。)
20+
21+
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md)`默认任务参数`一栏。
22+
23+
| **任务参数** | **描述** |
24+
|-------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
25+
| Application Id | EMR Serverless 应用程序 ID(格式如 `00fkht2eodujab09`),可在 [EMR Serverless 控制台](https://console.aws.amazon.com/emr/home#/serverless) 获取 |
26+
| Execution Role Arn | 作业执行 IAM 角色的 ARN(格式如 `arn:aws:iam::123456789012:role/EMRServerlessRole`),该角色需要有访问 S3、Glue 等服务的权限 |
27+
| Job Name | 作业名称(可选),用于在 EMR Serverless 控制台中标识作业 |
28+
| StartJobRunRequest JSON | [StartJobRunRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/emrserverless/model/StartJobRunRequest.html)`JobDriver``ConfigurationOverrides` 部分对应的 JSON,详细定义见下方示例。**注意**`ApplicationId``ExecutionRoleArn` 无需在 JSON 中重复填写,系统会自动从上方参数注入 |
29+
30+
![RUN_JOB_FLOW](../../../../img/tasks/demo/emr_serverless_create.png)
31+
32+
## 任务样例
33+
34+
### 提交 Spark 作业
35+
36+
该样例展示了如何创建 `EMR_SERVERLESS` 任务节点来提交一个 Spark 作业到 EMR Serverless 应用程序。
37+
38+
StartJobRunRequest JSON 参数样例(Spark):
39+
40+
```json
41+
{
42+
"JobDriver": {
43+
"SparkSubmit": {
44+
"EntryPoint": "s3://my-bucket/scripts/my-spark-job.jar",
45+
"EntryPointArguments": [
46+
"s3://my-bucket/input/",
47+
"s3://my-bucket/output/"
48+
],
49+
"SparkSubmitParameters": "--class com.example.MySparkApp --conf spark.executor.cores=4 --conf spark.executor.memory=8g --conf spark.executor.instances=10"
50+
}
51+
},
52+
"ConfigurationOverrides": {
53+
"MonitoringConfiguration": {
54+
"S3MonitoringConfiguration": {
55+
"LogUri": "s3://my-bucket/emr-serverless-logs/"
56+
}
57+
}
58+
}
59+
}
60+
```
61+
62+
### 提交 Hive 作业
63+
64+
该样例展示了如何创建 `EMR_SERVERLESS` 任务节点来提交一个 Hive 查询作业。
65+
66+
StartJobRunRequest JSON 参数样例(Hive):
67+
68+
```json
69+
{
70+
"JobDriver": {
71+
"HiveSQL": {
72+
"Query": "s3://my-bucket/scripts/my-hive-query.sql",
73+
"Parameters": "--hiveconf hive.exec.dynamic.partition=true --hiveconf hive.exec.dynamic.partition.mode=nonstrict"
74+
}
75+
},
76+
"ConfigurationOverrides": {
77+
"MonitoringConfiguration": {
78+
"S3MonitoringConfiguration": {
79+
"LogUri": "s3://my-bucket/emr-serverless-logs/"
80+
}
81+
},
82+
"ApplicationConfiguration": [
83+
{
84+
"Classification": "hive-site",
85+
"Properties": {
86+
"hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
87+
}
88+
}
89+
]
90+
}
91+
}
92+
```
93+
94+
## AWS 认证配置
95+
96+
EMR Serverless 任务通过 DolphinScheduler 的 `aws.yaml` 配置文件读取 AWS 认证信息,配置路径为 `conf/aws.yaml` 中的 `aws.emr` 段。
97+
98+
### 使用 IAM Role(推荐)
99+
100+
如果 DolphinScheduler Worker 节点运行在 EC2 实例上并已绑定 IAM Role,配置如下:
101+
102+
```yaml
103+
aws:
104+
emr:
105+
credentials.provider.type: InstanceProfileCredentialsProvider
106+
region: us-east-1
107+
```
108+
109+
### 使用 Access Key
110+
111+
如果需要使用 AK/SK 方式认证:
112+
113+
```yaml
114+
aws:
115+
emr:
116+
credentials.provider.type: AWSStaticCredentialsProvider
117+
access.key.id: your-access-key-id
118+
access.key.secret: your-secret-access-key
119+
region: us-east-1
120+
```
121+
122+
> **注意**:`aws.emr` 段的配置同时被 EMR on EC2 和 EMR Serverless 任务类型共享。
123+
124+
## 作业状态流转
125+
126+
EMR Serverless 作业提交后,DolphinScheduler 会每 10 秒轮询一次作业状态:
127+
128+
```
129+
SUBMITTED → PENDING → SCHEDULED → RUNNING → SUCCESS
130+
→ FAILED
131+
→ CANCELLED
132+
```
133+
134+
- 作业进入 `SUCCESS` 状态时,任务标记为成功
135+
- 作业进入 `FAILED` 或 `CANCELLED` 状态时,任务标记为失败
136+
- 如果 DolphinScheduler 任务被终止,会自动调用 [CancelJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_CancelJobRun.html) 取消正在运行的作业
137+
138+
## 注意事项
139+
140+
- **Application Id** 对应的 EMR Serverless 应用程序需要预先在 AWS 控制台或通过 API 创建,并确保处于 `STARTED` 或 `CREATED` 状态
141+
- **Execution Role** 需要有以下最小权限:`emr-serverless:StartJobRun`、`emr-serverless:GetJobRun`、`emr-serverless:CancelJobRun`,以及作业所需的 S3、Glue 等数据访问权限
142+
- `StartJobRunRequest JSON` 中无需填写 `ApplicationId` 和 `ExecutionRoleArn` 字段,系统会自动从表单参数注入
143+
- EMR Serverless 任务支持故障转移(Failover):当 Worker 节点发生故障时,新的 Worker 可以通过 `appIds`(即 `jobRunId`)恢复对正在运行作业的跟踪
144+
284 KB
Loading

0 commit comments

Comments
 (0)