Some Definitions

Hadoop is an open source, a Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation. Amazon Elastic Map Reduce is a web service that enables you to run Big Data analysis using Apache Hadoop and Apache Spark.  You can process data for analytics purposes and business intelligence workloads and move data into and out of other AWS data stores and databases, such as Amazon S3 and DynamoDB.  This article, Amazon EMR Exam Tips gives you an overview of the Elastic MapReduce service and some core concepts you should for the AWS Certified Solutions Architect Associate Exam.

Use Cases

  • Log Processing – Amazon EMR can be used to process logs that turn petabytes of unstructured or semi-structured data into useful insights about the applications used by users
  • Clickstream Analysis – Amazon EMR can be used to conduct clickstream analysis which can help identify demographics and usage patterns. This is often used to deliver appropriate Ads
  • Genomics and Life Sciences – Amazon EMR can be used to analyse and process large sets of genomic and scientific databases in a matter of days because of the processing power available with EMR clusters.

Clusters and Nodes

The primary component of an EMR is the cluster which is a collection of EC2 Instances.  Specifically:

  • Each Instance is called a node
  • Each node has a role within the cluster
  • Each node will have a role based on the application installed on it

Node Types

  • Master node – This is a node that manages the cluster by running software components which coordinate the distribution of data and tasks among other nodes. The master node will monitor the slave nodes and perform health monitoring too.
  • Core node – This is a slave node that has software components which run tasks and store data in the Hadoop Distributed File System (HDFS) on your cluster.
  • Task node – This is a slave node that has software components which only run tasks.

Submitting Work to a Cluster

When you run your cluster on Amazon EMR, you need to specify the work that needs to be done.  The options include:

  • Providing the entire definition of the work to be done in the Map and Reduce functions. You use this option where you have clusters that process a set amount of data and then terminate once the processing is complete.
  • You can also create a long-running cluster and use the Amazon EMR console, API, or CLI to submit steps, which may contain one or more Hadoop jobs.
  • You can create a cluster with a Hadoop application, e.g. Hive or Pig; use the interface provided by the application to submit queries which can be scripted or interactive.
  • You can create a long-running cluster, connect to it, and submit Hadoop jobs using the Hadoop API.

 

Processing Data

When you launch your cluster, you have two ways to your process data. You can either submit jobs or queries directly to the applications that are installed on your cluster or you can run ‘steps’ in the cluster.

  • Submitting Jobs Directly to Applications – You can submit jobs and interact directly with the software installed on your EMR cluster. You connect to the master node over a secure connection and access the interfaces and tools that are available for the software that runs directly on your cluster.
  • Running Steps to Process Data – You can submit one or more ordered steps to an Amazon EMR cluster. Each step is a unit of work that contains instructions to manipulate data for processing by software installed on the cluster.

 

EMR Storage Concepts

There are three different storage types available:

  • Hadoop Distributed File System (HDFS) – distributes data across instances in the cluster, storing multiple copies of data on different instances to ensure that no data is lost if an individual instance fails.  Data is stored on instance backed storage which is terminated when you shut down the instance.  You can use EBS storage if you require data persistence.  HDFS is useful for caching intermediate results during MapReduce processing.
  • EMR File System (EMRFS) – This storage type has the ability to access data stored in Amazon S3 as if it were a file system like HDFS. You can use either HDFS or Amazon S3 as the file system in your cluster.
  • Local File System – This is locally attached storage to the instance. When you create a Hadoop cluster, each node is created from an Amazon EC2 instance that comes with a preconfigured block storage called an instance store.

 

Persistent and Transient Clusters

  • Persistent Cluster – These run 24/7 after they are launched and are used continuous analysis needs to be run on data.  For Persistent Clusters, it is advisable to use HDFS storage as no data is lost when shutting down an instance node.
  • Transient Clusters – For inconsistent workloads, it can be more cost effective to turn off the cluster.  Transient clusters are those where the cluster is started when needed and turned off when not.  Thus it makes sense to use EMRFS storage type for Transient Clusters as the data is stored independently of the cluster in Amazon S3.

 

AWS Certification – 540 Practice Exam Questions – Get Prepared for your Exam Day!

Our AWS Exam Simulator with 540 practice exam questions comes with comprehensive explanations that will help you prepare for one of the most sought-after IT Certifications of the year.  Register Today and start preparing for your AWS Certification.