In the previous article we looked at the core concepts of Recovery Time Objective (RTO) and Recovery Point Objective (RPO) which fundamentally describes the level of loss a company is willing to sustain in the event of a disaster and the time it would take to recover from that disaster.  We then went on to look at various AWS Services on offer and how they have been designed to help develop a complete disaster recovery and business continuity plan.  In this article, we look at four DR scenarios that highlight the use of AWS and compare AWS with traditional DR methods.

Backup & Restore

This is by far the cheapest configuration you can have when compared to the other DR Scenarios as recommended by Amazon Web Services.  It’s important to also note that there is a direct correlation between the investment made in the DR scenario and the resulting RTO and RPO levels that can be obtained.

With a Backup & Restore scenario for DR, essentially you back up your data at regular intervals and they are able to restore previously backed up data in the event of a disaster.  Amazon S3 is an ideal destination for backup data that might be needed quickly to perform a restore. Transferring data to and from Amazon S3 is typically done through the network and is therefore accessible from any location.  You can also use the following methods to back up your data from your on-premise servers or VPC hosted servers to Amazon S3:

  • AWS Import/Export Disk and Snowball
  • Direct Connect
  • Storage Gateway
    • This includes using the gateway-VTL configuration of AWS Storage Gateway as a backup target for your existing backup management software.
  • For services hosted in AWS, you can take snapshots of Amazon EBS volumes, Amazon RDS databases, and Amazon Redshift data warehouses can be stored on Amazon S3.

Quick Restores for Business Continuity

If disaster strikes, you can easily build environments in AWS within a VPC and then restore the backed up data from Amazon S3 into your VPC.

Restore from Backup into AWS

Key points to note:

  • Ensure you select the appropriate tool or method to back up your data into AWS. This is vital to remember in the Exam as scenario questions will determine the right answer you need to select.  So if you have only 3 days in which to transfer 50TB of data and your Internet bandwidth is not sufficient, you would consider using Import/Export Disk or Snowball.
    • Important Note – Remember that the scenario may allow you to start you’re the back up of your data earlier if only a subset of that data changes over time. In this case be sure to evaluate the answer options very carefully.
  • Ensure that you have an appropriate retention policy for your data. This will determine the RPO levels too
  • Ensure you have appropriate security measures in place including encryption of data both in flight and at rest. Ensure you consider IAM for access policy definitions and use of IAM Roles
  • Testing is vital to ensure you meet compliance and regulatory requirements and that you can meet RTO and RPO levels.


Pilot Light

The term pilot light refers to a minimum set of services that needs to be available in order to quickly build and provision a live production system.  Certain services need to be in a state where they are technically on and can act as a pilot light to ignite the rest of the infrastructure.  For example, in a web tier application environment the pilot light itself typically include your database servers, which would replicate data to Amazon EC2 or Amazon RDS. This is the critical core of the system (the pilot light) around which all other infrastructure pieces in AWS (the rest of the furnace) can quickly be provisioned to restore the complete system.

For provisioning the remaining systems in the event of a disaster, for example the fleet of EC2 Instances running as web servers, you would host the necessary AMIs in your Amazon Region and quickly provision your users using these AMIs which would have most of the configuration design in place to reduce the time it takes to go live.

For Network design, you have two main options for provisioning:

  • You can use Elastic IP addresses, which can be pre-allocated and identified in the preparation phase for DR, and associate them with your instances. You can also use elastic network interfaces (ENIs), which have a MAC address that can also be pre-allocated to provision licenses against where you have applications that require MAC addresses for licensing
  • You can use Elastic Load Balancing (ELB) to distribute traffic to multiple EC2 instances. You then update your DNS records to point at your Amazon EC2 instance or point to your load balancer using a CNAME or Alias Names if you are using Route 53

The pilot light method gives you a quicker recovery time than the backup-and-restore method because the core pieces of the system are already running and are continually kept up to date.

In addition, you can also use automation tools like Amazon CloudFormation to enable automatic deployment of all core infrastructure services like launching EC2 Instances, setting up ELBs and Auto Scaling.  This will further reduce your RTO time.

Preparation Phase

You need to have your regularly changing data replicated to the pilot light, so in this example, it would be a slave copy of your database in a mirroring configuration.  This will ensure that the full environment will be started in the recovery phase successfully.

Key steps for preparation:

  • Set up Amazon EC2 instances to replicate or mirror data.
  • Ensure that you have all supporting custom software packages available in AWS.
  • Create and maintain AMIs of key servers where fast recovery is required.
  • Regularly run these servers, test them, and apply any software updates and configuration changes.
  • Consider automating the provisioning of AWS resources such as Amazon CloudFormation

Recovery Phase

With the pilot light in place, you can failover to the replica and then start up and launch EC2 servers and other services that are needed for a full production setup.  You can also resize and scale both vertically and horizontally and upgrade the instance types that run your RDS Databases.  For network configurations, you can configure DNS as required.

Key steps for recovery:

  • Start your application Amazon EC2 instances from your custom AMIs.
  • Resize existing database/data store instances to process the increased traffic.
  • Add additional database/data store instances to give the DR site resilience in the data tier; if you are using Amazon RDS, turn on Multi-AZ to improve resilience.
  • Change DNS to point at the Amazon EC2 servers.
  • Install and configure any non-AMI based systems, ideally in an automated way.


Warm Standby

Warm Standby describes a DR scenario which contains a scaled down version of a fully functional environment that is running on AWS.  This is an extension of the pilot light as in the pilot light a number of services are turned off.  In a Warm Standby scenario, the infrastructure will run on a minimum sized fleet and smallest possible sizes.  So both vertical and horizontal scaling will need to be performed during the recovery phase.

Preparation Phase

In addition to database replication as per the Pilot Light scenario, here, you may have some application data that is also being replicated to a failover application server in the cloud.  Elastic Load Balancers may already be deployed waiting for you to add more EC2 Instances to the ELB when you enter a recovery phase.  Also, Route 53 will have a DNS entry that will point to the DR site but not be active.

Key steps for preparation:

  • Set up Amazon EC2 instances to replicate or mirror data.
  • Create and maintain AMIs.
  • Run your application using a minimal footprint of Amazon EC2 instances or AWS infrastructure.
  • Patch and update software and configuration files in line with your live environment.

Recovery Phase

The Warm Standby scenario is a fully functioning scenario albeit in a scaled down version.  Going into production state would require scaling up and out as required and configuring DNS to make the failover services live.

Key steps for recovery:

  • Increase the size of the Amazon EC2 fleets in service with the load balancer (horizontal scaling).
  • Start applications on larger Amazon EC2 instance types as needed (vertical scaling).
  • Either manually change the DNS records, or use Amazon Route 53 automated health checks so that all traffic is routed to the AWS environment.
  • Consider using Auto Scaling to right-size the fleet or accommodate the increased load. 5. Add resilience or scale up your database.

Important Note – Horizontal Scaling is preferred over vertical scaling to avoid downtime


Multi-Site Solution

Multi-Site DR Solutions relate to having a failover site that runs in an active-active setup.  You will replicate your data from the live site to the DR site, but the services in the DR site will be in active mode, ready for you to switch over very quickly in the event of a disaster.  This will be the most expensive scenario but offers the lowest RTO.

Amazon suggests that you can use DNS weighting to route production traffic to different sites that deliver the same application or service. A proportion of traffic will go to your infrastructure in AWS, and the remainder will go to your on-site infrastructure.  You can then send all traffic to the AWS Servers in a DR situation by adjusting the weighting.

You can use Amazon EC2 Auto Scaling to automate this process. You might need some application logic to detect the failure of the primary database services and cut over to the parallel database services running in AWS.

Since cost is going to be an issue here, you can employ various strategies to reduce the cost.  For example, if you will your DR site as your on-going production site, it might be a good idea to purchase Reserved Instances (RI) instead of On Demand ones.

Preparation Phase

With a DNS weighted configuration, Amazon Route 53 DNS to route a portion of your traffic to the AWS site. The application on AWS might access data sources in the on-site production system. Data is replicated or mirrored to the AWS infrastructure.  The key point to note is that application servers can access database servers in any site.  However, the database mirroring will take place from primary site to DR site.

Multi-Site DR Scenario with AWS

Recovery Phase

During recovery phase in the event of a disaster of the primary site, traffic will be cut over to AWS by updating DNS.  Key steps for recovery:

  • Either manually or by using DNS failover, change the DNS weighting so that all requests are sent to the AWS site.
  • Have application logic for failover to use the local AWS database servers for all queries.
  • Consider using Auto Scaling to automatically right-size the AWS fleet.


Replication of Data

Key factors to consider when replicating data

  • Distance between sites
  • Available Bandwidth – Would Snowball data transfer be better?
  • Data rate required by your application.
  • Replication topology- should be parallel to use network effective.

Two approaches to replicating data are:

Synchronous Replication

Data is updated in multiple locations.  This will depend on network performance and availability.  Suited for services like RDS deployed in a Multi-AZ configuration.  Synchronous replication ensures data is not lost if primary AZ becomes unavailable. Remember that RDS in Multi-AZ only operates between AZs of a region and not across regions.

Asynchronous Replication

Data will not be automatically updated in multiple locations.  Transfer of data happens when the network becomes available and the application will continue to write data which may not be replicated yet.  Also, the data need not have been replicating for the source to continue operations.  RDS Read Replica uses asynchronous replication and due to the nature of asynchronous replication, the replica can be located remotely in another region.  This is made possible because the data does not need to be completely synced with the source.