Create an EMR cluster and submit a job using Boto3

  1. Creating a job to submit as a step to the EMR cluster.
  2. Copy the executable jar file of the job we are going to execute, into a bucket in AWS S3.
  3. Creating an AWS EMR cluster and adding the step details such as the location of the jar file, arguments etc. as part of the cluster creation.

As for this post, I’m going to create a simple Java program which copies a file from one S3 bucket into another. I’m utilizing the aws-java-sdk, in order to access the S3 related APIs’. Prerequisites for this would be, to add the aws java sdk dependency within the pom file and make sure to set the aws-credentials using the AWS CLI.

The entire source code can be grabbed from my repo.

Create the executable jar by using any build tool. I have used maven in this example. Once the jar is being built using maven (mvn clean install), it should be uploaded in one of the S3 buckets.

The executable jar file of the EMR job

We can utilize the Boto3 library for EMR, in order to create a cluster and submit the job on the fly while creating.

In the above code sample, initially, I’m producing the necessary aws credentials and then using the run_job_flow method to create the cluster and submit the job. Explanations of the parameters that are being used:

a) Name = a random name for the cluster
b) LogUri = location to store the logs of EMR
c) ReleaseLabel = the version of EMR which you are willing to use
d) Application = If you need an application to be installed implicitly
e) Instances = The type and size of the instances and the count of the slave instances. The key-pair name which should be used to SSH and the VPC in which, the instances should be created.
f) Steps = The EMR job-related details, which consists of, the name of the step, the action to take if the step fails, the location of the jar file and the arguments that need to be passed.
h) VisibleToAllUsers = Indicates whether the cluster is visible to all IAM users of the AWS account associated with the cluster.
i) JobFlowRole = The IAM role that is specified when the job flow is launched. The EC2 instances of the job flow assume this role.
j) ServiceRole = The IAM role that will be assumed by the Amazon EMR service to access AWS resources on your behalf.
k) Tags = A list of tags to associate with a cluster and propagate to Amazon EC2 instances.

The job can also be submitted later, after the cluster has been created, without implicitly adding it in the steps during the cluster creation.

Once the cluster has been created and it’s in the RUNNING state, the job will start to execute. Make sure to check whether the job has been successfully completed, and check the end outcome by going into the destination S3 bucket, whether your given file has been copied from the source bucket into the destination bucket. After the job has been completed, the cluster will go back to the WAITING state, until the next job has been submitted.

Machine Learning has kept me thriving…https://about.me/kulasangar

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store