Add EMR plugin to trigger a Spark job and wait for its completion #563

anna-geller · 2024-12-03T18:34:04Z

Feature description

Trigger an EMR job and wait for its completion
(TBD) spin up an on-demand EMR cluster + output its details so that a job can be submitted to it + the cluster can then be deleted

Specs

CreateCluster

type: io.kestra.plugin.aws.emr.cluster.CreateCluster
description: Create an EMR cluster

clusterName - STRING - required
releaseLabel- STRING, default to latest
description: specifies the EMR release version label
applications - ARRAY ["Spark"]
enableDebugging - BOOLEAN
logUri - STRING
description: a URI in S3 for log files is required when debugging is enabled
masterInstanceType - STRING
- description: EC2 instance type for master instances
slaveInstanceType - STRING
- description: EC2 isntance type for slave instances
instanceCount - INTEGER
- description: number of instances

steps - ARRAY - list of jobs to run

- `actionOnFailure` - ENUM - 'TERMINATE_JOB_FLOW'|'TERMINATE_CLUSTER'|'CANCEL_AND_WAIT'|'CONTINUE',
- `name` - STRING
- `hadoopJarStep` -  OBJECT
- `jar` - STRING, A path to a JAR file run during the step
- `mainClass` - STRING, The name of the main class in the specified Java file. If not specified, the JAR file should specify a Main-Class in its manifest file.
-  `args` - ARRAY A list of command line arguments passed to the JAR file’s main function when executed.

keepJobFlowAliveWhenNoSteps - BOOLEAN
vpcARN - STRING
description: VPC arn
subnetARN - STRING
description: Subnet arn
emrRole - STRING
description: IAM service role
emrEc2Role - STRING
description: IAM service role

(these below are probably needed to authenticate the Java client)

accessKeyId: STRING
secretKeyId: STRING
region: STRING

References & Example

AWS documentation example:

https://docs.aws.amazon.com/fr_fr/emr/latest/ManagementGuide/calling-emr-with-java-sdk.html
https://github.com/awsdocs/aws-doc-sdk-examples/blob/main/javav2/example_code/emr/src/main/java/aws/example/emr/CreateCluster.java
https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/emr/package-summary.html
(mainly for @Ben8t to understand what's going on + good documentation) Python SDK, example Pandas SDK

		RunJobFlowRequest request = new RunJobFlowRequest()
				.withName("MyClusterCreatedFromJava")
				.withReleaseLabel("emr-5.20.0") // specifies the EMR release version label, we recommend the latest release
				.withSteps(enabledebugging)
				.withApplications(hive, spark, ganglia, zeppelin)
				.withLogUri("s3://path/to/my/emr/logs") // a URI in S3 for log files is required when debugging is enabled
				.withServiceRole("EMR_DefaultRole") // replace the default with a custom IAM service role if one is used
				.withJobFlowRole("EMR_EC2_DefaultRole") // replace the default with a custom EMR role for the EC2 instance
																								// profile if one is used
				.withInstances(new JobFlowInstancesConfig()
						.withEc2SubnetId("subnet-12ab34c56")
						.withEc2KeyName("myEc2Key")
						.withInstanceCount(3)
						.withKeepJobFlowAliveWhenNoSteps(true)
						.withMasterInstanceType("m4.large")
						.withSlaveInstanceType("m4.large"));

👉 Note for @mgabelle

It seems like the RunJobFlowRequest is the method to look for

Let's implement the steps property so Kestra user can either: create cluster and run job directly OR simply create cluster (KeepJobFlowAliveWhenNoSteps=true, see note below)

RunJobFlow creates and starts running a new cluster (job flow). The cluster runs the steps specified. After the steps complete, the cluster stops and the HDFS partition is lost. To prevent loss of data, configure the last step of the job flow to store results in Amazon S3.
If the JobFlowInstancesConfig KeepJobFlowAliveWhenNoSteps parameter is set to TRUE, the cluster transitions to the WAITING state rather than shutting down after the steps have completed.

DeleteCluster

type: io.kestra.plugin.aws.emr.cluster.DeleteCluster
description: Delete an EMR cluster

id: cluster id to delete

Seems like the TerminateJobFlowsRequest is the method to use (first list clusters and then call the TerminateJobFlows). See this example

AddJobFlowsSteps

type: io.kestra.plugin.aws.emr.cluster.AddJobFlowsSteps
description: Adds new steps to a running cluster

The text was updated successfully, but these errors were encountered:

Ben8t · 2024-12-04T08:28:16Z

For references:

Ben8t · 2024-12-09T15:03:28Z

@mgabelle use case looks like these two blueprints:

Ben8t · 2024-12-12T14:04:25Z

ping @mgabelle I added more specs in the main comment above. Please check on your side and sync with me whenever needed 👍

anna-geller added enhancement New feature or request area/plugin Plugin-related issue or feature request kind/customer-request Requested by one or more customers labels Dec 3, 2024

anna-geller assigned mgabelle Dec 3, 2024

anna-geller added this to Issues Dec 3, 2024

github-project-automation bot moved this to Backlog in Issues Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add EMR plugin to trigger a Spark job and wait for its completion #563

Add EMR plugin to trigger a Spark job and wait for its completion #563

anna-geller commented Dec 3, 2024 •

edited by Ben8t

Loading

Ben8t commented Dec 4, 2024

Ben8t commented Dec 9, 2024

Ben8t commented Dec 12, 2024

Add EMR plugin to trigger a Spark job and wait for its completion #563

Add EMR plugin to trigger a Spark job and wait for its completion #563

Comments

anna-geller commented Dec 3, 2024 • edited by Ben8t Loading

Feature description

Specs

CreateCluster

References & Example

DeleteCluster

AddJobFlowsSteps

Ben8t commented Dec 4, 2024

Ben8t commented Dec 9, 2024

Ben8t commented Dec 12, 2024

anna-geller commented Dec 3, 2024 •

edited by Ben8t

Loading