Data Pipeline is a web service that enables you to schedule regular transfer and processing of data between AWS compute and storage services. You can also transfer data from on-premise services into the AWS Cloud. You can access your data at the source, transform or process it and then transfer the results to several AWS services such as Amazon S3, RDS, DynamoDB, Redshift, and EMR. This article, Amazon Data Pipeline Exam Tips gives you a quick overview of the core concepts and components you need to study for the AWS Certification Exam.
AWS Data Pipeline Components
- Pipeline – Will carry out activities such as moving data from one location to another or running Hive queries. It defines data sources, destinations, and predefined or custom data processing activities to deliver business and application functionality. In situations where you need additional AWS resources such as EC2 Instances, Data Pipeline will automatically launch those instances and then terminate them once the activity is complete
- Data node – A data node can reference a specific Amazon S3 path, MySQL Database or DynamoDB Table
- Activity – An activity is an action that AWS Data Pipeline initiates on your behalf as part of a pipeline. Example activities are EMR or Hive jobs or performing SQL queries
- Pipeline schedules and runs tasks – You can activate and deactivate a pipeline, change definitions and schedules and when you are finished with the pipeline, you can delete it
- Task Runner – Will poll for tasks and then performs those tasks such as copying log files to an S3 bucket or launching EMR clusters. The Task Runner is installed and runs automatically on resources created by your pipeline definitions
- Preconditions – These are conditional statements that must be true before an activity can run. For example, data must exist in a DynamoDB table before a query activity can be performed. Preconditions are useful since if you are running an activity that is expensive to compute, then you want to ensure that the activity only executes when a precondition is met.
Activity Failures
If an activity fails, it retries three times before entering a hard failure state. You can increase the number of automatic retries to 10. After an activity exhausts its attempts, it triggers any configured onFailure alarm and will not try to run again unless you manually issue a rerun command. You can rerun a set of completed or failed activities by resetting their state to SCHEDULED.
Data Pipeline and AWS Resource Usage
AWS Data Pipeline runs on compute resources. There are two types of resources:
- AWS Data Pipelines-managed – These are Amazon EMR clusters or Amazon EC2 instances that the AWS Data Pipeline service launches only when they’re needed
- Self-managed – These are longer running resources and can be any resource capable of running the AWS Data Pipeline Java-based Task Runner (for example, on-premise hardware or a customer-managed Amazon EC2 instance)
AWS Certification – 540 Practice Exam Questions – Get Prepared for your Exam Day!
Our AWS Exam Simulator with 540 practice exam questions comes with comprehensive explanations that will help you prepare for one of the most sought-after IT Certifications of the year. Register Today and start preparing for your AWS Certification.