As part of the AWS Certified Solutions Architect – Associate Exam, you need to ensure that you have a high-level understanding of core database on the AWS platform, which includes Amazon RDS, DynamoDB, Redshift, ElastiCache and Aurora.  This article, Redshift, ElastiCache and Aurora Exam Tips covers core concepts, features and limitations of each database platform and will help you with your revision.

Amazon RedShift

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers.

Redshift is essentially used for Online Analytic Processing (OLAP) and enables your business to conduct management analysis for the business as a whole.

Redshift Performance

Amazon Redshift offers 10x higher performance than other traditional databases for data warehousing.  This is due to the underlying infrastructure and database architecture specifically designed as follows:

  • Columnar Data Storage – Amazon Redshift organises the data by column. A row-based system is ideal for transaction processing, but column-based systems allow queries to be performed for aggregates over large data sets. This results in far fewer I/Os, greatly improving query performance.
  • Advanced Compression – Columnar data stores can be compressed much more than row-based data stores because similar data is stored sequentially on disk. Amazon Redshift doesn’t require indexes or materialised views and so uses less space than traditional relational database system.  When you load data into an empty table, Amazon Redshift will sample your data and selects the most appropriate compression scheme.
  • Massively Parallel Processing (MPP): Amazon Redshift automatically distributes data and query load across all nodes. Amazon Redshift makes it easy to add nodes to your data warehouse and enables you to maintain fast query performance as your data warehouse grows.

Clusters and Nodes

The biggest selling point of Amazon Redshift is based on the concept of running as a cluster. Each cluster and comprise of multiple nodes. Multi-Node configurations require you to configure a Leader Node and one or more Compute Nodes.  Client application interacts directly with the Leader Node, which parses queries and develops execution plans.

  • Leader node coordinates the parallel execution of these plans with the computer nodes, aggregates the intermediate results and returns the results back to client applications.  You can only have one Leader Node
  • Compute nodes execute the steps specified in the execution plans and transmit data among themselves to serve these queries.  Compute Nodes work in the background transparent to the client applications.  You can one or more Compute Nodes up to a maximum of 128

When you launch a cluster, you need specify the node type, which determines the CPU, RAM, Storage type and the storage device for each node.  There are two types of nodes:

  • Dense Storage (DS) – This is for node types that are storage optimized. DS2 are used to handle large workloads and use HDD drives.  Also, DS2 nodes are available in xlarge and 8xlarge sizes.
  • Dense Compute (DC) – this is for node types that are compute optimized and use SSD drives, which means less storage space but are ideal for performance intensive workloads. Also, DC1 nodes are available in large and 8xlarge sizes.

Note – Some node types support single-node cluster.  Here the node is shared for leader and compute functionality.

 

Table of Node Types

Dense Storage Node Types

Node Size vCPU ECU RAM (GiB) Slices Per Node Storage Per Node Node Range Total Capacity
ds2.xlarge 4 13 31 2 2 TB HDD 1–32 64 TB
ds2.8xlarge 36 119 244 16 16 TB HDD 2–128 2 PB

Dense Compute Node Types

Node Size vCPU ECU RAM (GiB) Slices Per Node Storage Per Node Node Range Total Capacity
dc1.large 2 7 15 2 160 GB SSD 1–32 5.12 TB
dc1.8xlarge 32 104 244 32 2.56 TB SSD 2–128 326 TB

As an example, for a Dense Storage type,  if you have 32 TB of data, you can choose either 16 ds2.xlarge nodes or 2 ds2.8xlarge nodes. If your data grows in small increments, choosing the ds2.xlarge node size will allow you to scale in increments of 2 TB. If you typically see data growth in larger increments, a ds2.8xlarge node size might be a better choice.

Additional key points to note:

  • ‘Slices per node’ refer to the number of slices into which a compute node is partitioned.
  • Node Range is the minimum and maximum number of nodes that Amazon Redshift supports for the node type and size.
    • Amazon Redshift applies quotas to resources for each AWS account in each region. This restricts the number of resources that your account can create for a given resource type, such as nodes or snapshots, based on the region to creating your Redshift deployment in.

Redshift distributes and executes queries in parallel across compute nodes.  This means you can increase query performance by adding more nodes to your cluster.   In addition, data is distributed across all compute nodes in a cluster. When you run a cluster with at least two compute nodes, data on each node will always be mirrored on disks on another node and you reduce the risk of incurring data loss from disk failures.

Cluster Security

Amazon Redshift offers a range of security options to ensure compliance with regulatory requirements as well as core business objectives including:

  • You can use IAM policies to limit actions that Administrators can perform on the Database. This includes creating policies to define lifecycle of cluster, scaling, backup and recovery options
  • You can deploy Redshift clusters in the Private IP space of your VPCs and limit them to exist in Private Subnets
  • You create master username and password from which you can create additional users and groups to define access rights

Encryption

Amazon Redshift can be configured with encryption at rest with an AES-256 bit as well as in transit using SSL.  By default, Redshift managed the keys using KMS.  You can also use own keys using Hardware Security Modules (HSM).

Backup and Recovery

You can create automatic and manual snapshots of your Redshift cluster.  The snapshots can then be used to restore a cluster or clone a cluster.  Note that Redshift is only available in a single Availability Zone, but you can restore snapshots to other Availability Zones and Regions if required.

 

Elasticache

Amazon ElastiCache is a web service that makes it easy to deploy, operate, and scale an in-memory data store or cache in the cloud. The service improves the performance of web applications by allowing you to retrieve information from fast, managed, in-memory data stores, instead of relying entirely on slower disk-based databases.

The primary purpose of an in-memory key-value store is to provide ultra-fast and inexpensive access to copies of data. Most databases are seldom updated but read very frequently and queried.  Querying a database will always be slower and more expensive than locating a key in a key-value pair cache. Some database queries are especially expensive to perform, for example, queries that involve joins across multiple tables or queries with intensive calculations. You can use caching of query results, which means you pay the price of the query once and then are able to quickly retrieve the data multiple times without having to re-execute the query

Visit our Amazon ElastiCache Exam Tips article to learn more.

Aurora

Amazon Aurora is a MySQL compatible relational databases engine that offers enterprise-level performance, durability and availability.   It is a fully managed service and you can expect 5x the performance of MySQL without making huge modifications to any of your web applications.  Amazon Aurora databases instances are created as DB clusters.

  • Each DB Cluster consists of one or more instances and a cluster volume that manages the data for those instances. An Aurora cluster volume is a virtual database storage volume that will span multiple Availability Zones and each Availability Zone will have a copy of the cluster data. There are two types of instances make up an Aurora DB cluster:
  • Primary instance – This is the main cluster and it supports read-write workloads and performs all of the data modifications to the cluster volume. Note that each Aurora DB cluster has one primary instance.
  • Aurora Replica – These are used just as any read replicas, in that they support only read operations. Each DB cluster can have up to 15 Aurora Replicas in addition to the primary instance. You can use Aurora Replicas to distribute the read workload, and by placing the replicas in separate Availability Zones you can also increase database availability.

Aurora Endpoints

You connect to your DB Cluster using any one of the following endpoints:

Cluster Endpoint – Each DB Cluster will have a cluster endpoint which then connects you to the primary instances of the DB Cluster.  Here you can perform both read and write functions.  Note that the primary instance has its own endpoint and is different from a cluster endpoint in that the cluster endpoint points to the current primary instance.  Thus if the primary instance fails and a new primary instance is created, the cluster endpoint then connects to it.

Therefore, for high availability, it is always recommended to connect to the cluster endpoint.  This ensures applications failover during a primary endpoint failure.

Reader Endpoint – Aurora DB Clusters also have a reader endpoint which connects you to Aurora Replicas.  The reader endpoint enables you to load balance client requests to access your database replicas in a cluster.  Note that if the primary DB instance fails and one of the read replicas that you are connected to gets promoted; then the connection is dropped.

You can use the reader endpoint to provide high availability for your read-only queries from your DB cluster. You can place multiple Aurora Replicas in different Availability Zones and then connect to the reader endpoint for your read workload.

  • NOTE: The reader endpoint only load-balances connections to the Aurora Replicas in a DB cluster. If you want to load-balance queries to distribute the read workload for a cluster, you will need to manage that in your application

Instance Endpoint – The primary instances and read replicas also have their own individual endpoints.  Instance endpoints will not have the cluster- included in the DNS name of the endpoint.  Before connecting to an instance using the instance endpoint, consider using the cluster endpoint or the reader endpoint for the DB cluster to provide high availability.

Key Points:

  • The minimum storage is 10GB. Based on your database usage, your Amazon Aurora storage will automatically grow, up to 64 TB, in 10GB increments with no impact to database performance. You do not have provision storage in advance.  Your database volume is divided into 10GB segments spread across many disks. Each 10GB chunk of your database volume is replicated six ways, across the three availability zones.
  • With regards to compute resources, you can scale up to 32 vCPUs and 244 GiB Memory
  • There is no push button scaling like DynamoDB but it scales fast
  • Amazon Aurora automatically maintains 6 copies of your data across 3 Availability Zones and so that gives you 2 copies of your data in each Availability Zone. It will automatically attempt to recover your database in a healthy AZ with no data loss.
  • You can restore DB Snapshot or perform a point-in-time restore operation to a new instance. Note that the latest restorable time for a point-in-time restore operation can be up to 5 minutes in the past.
  • Aurora recovery time takes seconds in most cases because there is no need to replay logs due to the failover options
  • Amazon Aurora supports two kinds of replicas. You can create up to 15 Aurora Replicas and up to 5 SQL Replicas. Aurora replicas offer automatic failover when compared to the MySQL replicas.
  • You can set up a cross-region Aurora Replica
  • You can add Aurora Replicas in the cluster that will share the same underlying storage as the cross-region replica.
  • You can promote your cross-region replica to be the new primary from the RDS console. Note that the cross-region replication will stop once you initiate the promotion process.
  • You can assign a promotion priority tier to each instance on your cluster. If the primary instance fails, Amazon RDS will promote the replica with the highest priority to the primary. If there is contention between 2 or more replicas in the same priority tier, then Amazon RDS will promote the replica that is the same size as the primary instance
  • All Amazon Aurora DB Instances must be created in a VPC

 

Additional Exam Tips

 

180 Practice Exam Questions – Get Prepared for your Exam Day!

Our Exam Simulator with 180 practice exam questions comes with comprehensive explanations that will help you prepare for one of the most sought-after IT Certifications of the year.  Register Today and start preparing for your AWS Certified Solutions Architect – Associate Exam.