Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add EMR plugin to trigger a Spark job and wait for its completion #563

Open
2 tasks
anna-geller opened this issue Dec 3, 2024 · 3 comments
Open
2 tasks
Assignees
Labels
area/plugin Plugin-related issue or feature request enhancement New feature or request kind/customer-request Requested by one or more customers

Comments

@anna-geller
Copy link
Member

anna-geller commented Dec 3, 2024

Feature description

  • Trigger an EMR job and wait for its completion
  • (TBD) spin up an on-demand EMR cluster + output its details so that a job can be submitted to it + the cluster can then be deleted

Specs

CreateCluster

type: io.kestra.plugin.aws.emr.cluster.CreateCluster
description: Create an EMR cluster

  • clusterName - STRING - required

  • releaseLabel- STRING, default to latest
    description: specifies the EMR release version label

  • applications - ARRAY ["Spark"]

  • enableDebugging - BOOLEAN

  • logUri - STRING
    description: a URI in S3 for log files is required when debugging is enabled

  • masterInstanceType - STRING

    • description: EC2 instance type for master instances
  • slaveInstanceType - STRING

    • description: EC2 isntance type for slave instances
  • instanceCount - INTEGER

    • description: number of instances
  • steps - ARRAY - list of jobs to run

    - `actionOnFailure` - ENUM - 'TERMINATE_JOB_FLOW'|'TERMINATE_CLUSTER'|'CANCEL_AND_WAIT'|'CONTINUE',
    - `name` - STRING
    - `hadoopJarStep` -  OBJECT
    - `jar` - STRING, A path to a JAR file run during the step
    - `mainClass` - STRING, The name of the main class in the specified Java file. If not specified, the JAR file should specify a Main-Class in its manifest file.
    -  `args` - ARRAY A list of command line arguments passed to the JAR file’s main function when executed.
    
  • keepJobFlowAliveWhenNoSteps - BOOLEAN

  • vpcARN - STRING
    description: VPC arn

  • subnetARN - STRING
    description: Subnet arn

  • emrRole - STRING
    description: IAM service role

  • emrEc2Role - STRING
    description: IAM service role

(these below are probably needed to authenticate the Java client)

  • accessKeyId: STRING
  • secretKeyId: STRING
  • region: STRING

References & Example

AWS documentation example:

		RunJobFlowRequest request = new RunJobFlowRequest()
				.withName("MyClusterCreatedFromJava")
				.withReleaseLabel("emr-5.20.0") // specifies the EMR release version label, we recommend the latest release
				.withSteps(enabledebugging)
				.withApplications(hive, spark, ganglia, zeppelin)
				.withLogUri("s3://path/to/my/emr/logs") // a URI in S3 for log files is required when debugging is enabled
				.withServiceRole("EMR_DefaultRole") // replace the default with a custom IAM service role if one is used
				.withJobFlowRole("EMR_EC2_DefaultRole") // replace the default with a custom EMR role for the EC2 instance
																								// profile if one is used
				.withInstances(new JobFlowInstancesConfig()
						.withEc2SubnetId("subnet-12ab34c56")
						.withEc2KeyName("myEc2Key")
						.withInstanceCount(3)
						.withKeepJobFlowAliveWhenNoSteps(true)
						.withMasterInstanceType("m4.large")
						.withSlaveInstanceType("m4.large"));

👉 Note for @mgabelle

It seems like the RunJobFlowRequest is the method to look for

Let's implement the steps property so Kestra user can either: create cluster and run job directly OR simply create cluster (KeepJobFlowAliveWhenNoSteps=true, see note below)

RunJobFlow creates and starts running a new cluster (job flow). The cluster runs the steps specified. After the steps complete, the cluster stops and the HDFS partition is lost. To prevent loss of data, configure the last step of the job flow to store results in Amazon S3.
If the JobFlowInstancesConfig KeepJobFlowAliveWhenNoSteps parameter is set to TRUE, the cluster transitions to the WAITING state rather than shutting down after the steps have completed.

DeleteCluster

type: io.kestra.plugin.aws.emr.cluster.DeleteCluster
description: Delete an EMR cluster

  • id: cluster id to delete

Seems like the TerminateJobFlowsRequest is the method to use (first list clusters and then call the TerminateJobFlows). See this example

AddJobFlowsSteps

type: io.kestra.plugin.aws.emr.cluster.AddJobFlowsSteps
description: Adds new steps to a running cluster

@anna-geller anna-geller added enhancement New feature or request area/plugin Plugin-related issue or feature request kind/customer-request Requested by one or more customers labels Dec 3, 2024
@github-project-automation github-project-automation bot moved this to Backlog in Issues Dec 3, 2024
@Ben8t
Copy link
Member

Ben8t commented Dec 4, 2024

For references:

@Ben8t
Copy link
Member

Ben8t commented Dec 9, 2024

@mgabelle use case looks like these two blueprints:

@Ben8t
Copy link
Member

Ben8t commented Dec 12, 2024

ping @mgabelle I added more specs in the main comment above. Please check on your side and sync with me whenever needed 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/plugin Plugin-related issue or feature request enhancement New feature or request kind/customer-request Requested by one or more customers
Projects
Status: Backlog
Development

No branches or pull requests

3 participants