|
| 1 | +# Amazon EMR Serverless |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +Amazon EMR Serverless task type, for submitting and monitoring job runs on [Amazon EMR Serverless](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html) applications. |
| 6 | +Unlike traditional EMR on EC2, EMR Serverless requires no cluster infrastructure management and automatically scales compute resources on demand, suitable for Spark and Hive workloads. |
| 7 | + |
| 8 | +Using [aws-java-sdk](https://aws.amazon.com/cn/sdk-for-java/) in the background code, to transfer JSON parameters to a [StartJobRunRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/emrserverless/model/StartJobRunRequest.html) object |
| 9 | +and submit it to AWS via the [StartJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_StartJobRun.html), then poll job status via the [GetJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_GetJobRun.html) until completion. |
| 10 | + |
| 11 | +## Create Task |
| 12 | + |
| 13 | +- Click `Project Management -> Project Name -> Workflow Definition`, click the `Create Workflow` button to enter the DAG editing page. |
| 14 | +- Drag `AmazonEMRServerless` task from the toolbar to the artboard to complete the creation. |
| 15 | + |
| 16 | +## Task Parameters |
| 17 | + |
| 18 | +[//]: # (TODO: use the commented anchor below once our website template supports this syntax) |
| 19 | +[//]: # (- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) `Default Task Parameters` section for default parameters.) |
| 20 | + |
| 21 | +- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md) `Default Task Parameters` section for default parameters. |
| 22 | + |
| 23 | +| **Parameter** | **Description** | |
| 24 | +|-------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
| 25 | +| Application Id | EMR Serverless application ID (e.g. `00fkht2eodujab09`), obtainable from the [EMR Serverless Console](https://console.aws.amazon.com/emr/home#/serverless) | |
| 26 | +| Execution Role Arn | ARN of the IAM role for job execution (e.g. `arn:aws:iam::123456789012:role/EMRServerlessRole`), this role needs permissions to access S3, Glue, and other services | |
| 27 | +| Job Name | Job name (optional), used to identify the job in the EMR Serverless console | |
| 28 | +| StartJobRunRequest JSON | JSON corresponding to the `JobDriver` and `ConfigurationOverrides` portions of the [StartJobRunRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/emrserverless/model/StartJobRunRequest.html), see examples below. **Note**: `ApplicationId` and `ExecutionRoleArn` do not need to be included in the JSON as they are automatically injected from the form parameters above | |
| 29 | + |
| 30 | + |
| 31 | + |
| 32 | +## Task Example |
| 33 | + |
| 34 | +### Submit a Spark Job |
| 35 | + |
| 36 | +This example shows how to create an `EMR_SERVERLESS` task node to submit a Spark job to an EMR Serverless application. |
| 37 | + |
| 38 | +StartJobRunRequest JSON example (Spark): |
| 39 | + |
| 40 | +```json |
| 41 | +{ |
| 42 | + "JobDriver": { |
| 43 | + "SparkSubmit": { |
| 44 | + "EntryPoint": "s3://my-bucket/scripts/my-spark-job.jar", |
| 45 | + "EntryPointArguments": [ |
| 46 | + "s3://my-bucket/input/", |
| 47 | + "s3://my-bucket/output/" |
| 48 | + ], |
| 49 | + "SparkSubmitParameters": "--class com.example.MySparkApp --conf spark.executor.cores=4 --conf spark.executor.memory=8g --conf spark.executor.instances=10" |
| 50 | + } |
| 51 | + }, |
| 52 | + "ConfigurationOverrides": { |
| 53 | + "MonitoringConfiguration": { |
| 54 | + "S3MonitoringConfiguration": { |
| 55 | + "LogUri": "s3://my-bucket/emr-serverless-logs/" |
| 56 | + } |
| 57 | + } |
| 58 | + } |
| 59 | +} |
| 60 | +``` |
| 61 | + |
| 62 | +### Submit a Hive Job |
| 63 | + |
| 64 | +This example shows how to create an `EMR_SERVERLESS` task node to submit a Hive query job. |
| 65 | + |
| 66 | +StartJobRunRequest JSON example (Hive): |
| 67 | + |
| 68 | +```json |
| 69 | +{ |
| 70 | + "JobDriver": { |
| 71 | + "HiveSQL": { |
| 72 | + "Query": "s3://my-bucket/scripts/my-hive-query.sql", |
| 73 | + "Parameters": "--hiveconf hive.exec.dynamic.partition=true --hiveconf hive.exec.dynamic.partition.mode=nonstrict" |
| 74 | + } |
| 75 | + }, |
| 76 | + "ConfigurationOverrides": { |
| 77 | + "MonitoringConfiguration": { |
| 78 | + "S3MonitoringConfiguration": { |
| 79 | + "LogUri": "s3://my-bucket/emr-serverless-logs/" |
| 80 | + } |
| 81 | + }, |
| 82 | + "ApplicationConfiguration": [ |
| 83 | + { |
| 84 | + "Classification": "hive-site", |
| 85 | + "Properties": { |
| 86 | + "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory" |
| 87 | + } |
| 88 | + } |
| 89 | + ] |
| 90 | + } |
| 91 | +} |
| 92 | +``` |
| 93 | + |
| 94 | +## AWS Authentication Configuration |
| 95 | + |
| 96 | +The EMR Serverless task reads AWS credentials from the DolphinScheduler `aws.yaml` configuration file, under the `aws.emr` section at `conf/aws.yaml`. |
| 97 | + |
| 98 | +### Using IAM Role (Recommended) |
| 99 | + |
| 100 | +If the DolphinScheduler Worker node runs on an EC2 instance with an attached IAM Role: |
| 101 | + |
| 102 | +```yaml |
| 103 | +aws: |
| 104 | + emr: |
| 105 | + credentials.provider.type: InstanceProfileCredentialsProvider |
| 106 | + region: us-east-1 |
| 107 | +``` |
| 108 | +
|
| 109 | +### Using Access Key |
| 110 | +
|
| 111 | +If you need to authenticate using AK/SK: |
| 112 | +
|
| 113 | +```yaml |
| 114 | +aws: |
| 115 | + emr: |
| 116 | + credentials.provider.type: AWSStaticCredentialsProvider |
| 117 | + access.key.id: your-access-key-id |
| 118 | + access.key.secret: your-secret-access-key |
| 119 | + region: us-east-1 |
| 120 | +``` |
| 121 | +
|
| 122 | +> **Note**: The `aws.emr` section configuration is shared by both EMR on EC2 and EMR Serverless task types. |
| 123 | + |
| 124 | +## Job State Transitions |
| 125 | + |
| 126 | +After an EMR Serverless job is submitted, DolphinScheduler polls the job status every 10 seconds: |
| 127 | + |
| 128 | +``` |
| 129 | +SUBMITTED → PENDING → SCHEDULED → RUNNING → SUCCESS |
| 130 | + → FAILED |
| 131 | + → CANCELLED |
| 132 | +``` |
| 133 | + |
| 134 | +- When a job reaches `SUCCESS` state, the task is marked as successful |
| 135 | +- When a job reaches `FAILED` or `CANCELLED` state, the task is marked as failed |
| 136 | +- If a DolphinScheduler task is killed, it automatically calls the [CancelJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_CancelJobRun.html) to cancel the running job |
| 137 | + |
| 138 | +## Notice |
| 139 | + |
| 140 | +- The **Application Id** must correspond to a pre-existing EMR Serverless application (created via the AWS Console or API) in `STARTED` or `CREATED` state |
| 141 | +- The **Execution Role** requires the following minimum permissions: `emr-serverless:StartJobRun`, `emr-serverless:GetJobRun`, `emr-serverless:CancelJobRun`, plus S3, Glue and other data access permissions required by the job |
| 142 | +- `StartJobRunRequest JSON` should NOT include `ApplicationId` or `ExecutionRoleArn` fields — they are automatically injected from the form parameters |
| 143 | +- EMR Serverless task supports failover: when a Worker node fails, a new Worker can recover tracking of running jobs through `appIds` (the `jobRunId`) |
| 144 | + |
0 commit comments