Member-only story

Creating a Spark job using Pyspark and executing it in AWS EMR

Published in

AWS in Plain English

4 min readAug 7, 2018

What is Spark?

Spark is considered as one of the data processing engine which is preferable, for usage in a vast range of situations. Data Scientists and application developers integrate Spark into their own implementations in order to transform, analyze and query data at a larger scale. Functions which are most related with Spark, contain collective queries over huge data sets, machine learning problems and processing of streaming data from various sources.

What is PySpark?

PySpark is considered as the interface which provides access to Spark using the Python programming language. PySpark is basically a Python API for Spark.

What is EMR?

Amazon Elastic MapReduce, as known as EMR is an Amazon Web Services mechanism for big data analysis and processing. This is established based on Apache Hadoop, which is known as a Java based programming framework which assists the processing of huge data sets in a distributed computing environment. EMR also manages a vast group of big data use cases, such as bioinformatics, scientific simulation, machine learning and data transformations.

Flowchart of the above functionalities

Let me explain each one of the above by providing the appropriate snippets.

I’ve been mingling around with Pyspark, for the last few days and I was able to built a simple spark application and execute it as a step in an AWS EMR cluster. The following functionalities were covered within this use-case:

Reading csv files from AWS S3 and storing them in two different RDDs (Resilient Distributed Datasets).
Converting an RDD into a Data-frame.
Replacing 0’s with null values.
Dropping the rows which has null values.
Performing an inner join based on a column.
Saving the joined dataframe in the parquet format, back to S3.
Executing the script in an EMR cluster as a step via CLI.