Monday, April 14, 2014

Designing for failure in the AWS cloud

Everything fails all the time: Hard disks break, computers overheat, wires get broken, the power goes out, earthquakes damage buildings, and because of all this, no single device should be considered fault-tolerant.

It is a guarantee that something in your setup will eventually fail. At large enough scale, something is always failing. 

The reasoning is simple: hardware is physical—subject to the laws of physics—and thus, it breaks all the time. This is a mantra that anyone designing software systems in the AWS cloud must always keep in mind. It is also one of the design principles on which the Amazon AWS cloud solutions are built, and it is a principle that AWS does not eliminate for its customers. 

Something or someone has to deal with all these uncertainties. Ideally, cloud solutions providers such as Amazon AWS would solve problems transparently for their customers and would take care of providing a failure-proof environment, but that’s just not how it works.

As Reed Hastings said during the AWS Re:Invent 2012 Keynote in Las Vegas, “We are still in the assembly language phase of cloud computing.” Coping with failure is for the most part left in the hands of cloud computing users, and software architects must take these inevitable failures very seriously. Design for failure must be deliberate, where there is an assumption that the components in a system will in fact fail, and the best plans design backwards to address them. 

When this backwards planning is done, an important advantage is discovered: Designing a solution for reliability leads to a path that, with little additional cost and effort, solves scalability issues as well.

For example, let’s say that you have a simple web application where customers can log in from the Internet. The application collects data from those individuals, stores it, and provides reports based on that data. 

Initially, you could design it to run very simply across one server and one instance of MySql. It would look something like this:

This solution might work for a while, and could even run on the computer under your desk with minimal initial cost. Unfortunately though, this scenario includes several concerns: 

  1. It doesn’t scale. As traffic increases you’ll quickly reach a breaking point. Either the database (DB) machine or the product server is bound to get saturated and stop working properly.
  2. There are only two physical components and they depend on each other. If either of them fails, the system is going to be down.

So let’s start by focusing on the component that is further away from the customer, and move backwards from there. What happens if the MySql DB starts failing? I’ll indicate with a red color a component that represents a single point of failure, with green a component that has been determined fault-tolerant, and with white a component that I haven’t yet examined. Our goal is to make all the boxes green. With that in mind:

In this design if the database fails:

  1. Customers can no longer log in because their credentials are stored in the DB.
  2. The system is not able to collect data since there is no place to put it.
  3. The business logic cannot operate because the data is not available.
  4. The system is unable to display reports since the data is inaccessible.

In other words, the product becomes completely non-functional, resulting in total disaster:

Failure started from the DB and moved backward toward the customer, ultimately halting the experience. 

How can this be avoided? We can prevent this scenario by assuming during the design of the system that any MySql instance could fail at any time. We need to replace that single point of failure with something that is tolerant of errors. There are many options to achieve this result. 

For example, we could partition the data in several MySql instances. With this schema, we would put the population of customers in buckets and use a separate MySql instance for each of the buckets. If any of the instances goes bad, only the group of customers stored in the “bad bucket” will experience issues.

In this simple example we have two buckets. Assuming that each customer has a numeric customer ID, we could assign customers with odd ID numbers to DB1 and customers with even ID numbers to DB2. If one of the two MySql instances goes bad, you’d get this situation:

While this is not ideal, at least the product will continue working for 50 percent of your customers while you resolve the problem. In general, if you have n buckets and one goes bad, the system would be broken for only 1/n of the customers. The more buckets that exist, the better this works.

However, there are several drawbacks to this design. The first of which is that data partitioned this way becomes costly to repartition. For instance, if we start with 3 buckets and we want to move to 5, we would need to repartition all the customers and move them around within the databases. There are many ways to address this issue, but it would be nice to come up with a solution that:

  1. Is simple and inexpensive to implement on day one.
  2. Allows managing of costs as the business grows and the available budget increases.
  3. Solves the single point of failure issue.
  4. Allows changing database technology without reengineering the system.
  5. Improves the ability to test the system.

The solution we adopted at DreamBox is based on a physical storage abstraction—we call it “DataBag.” It is designed to separate the physical storage from the data consumers with a very simple NoSql API compatible with memcached. Each physical storage type that is supported is implemented as a plugin of the DataBag system and can be switched depending on needs. For example, in production we can choose to use a different storage solution than in the test or development environments. 

Using three storage plugins as an example (MySql, DynamoDB, and a simple Flat File), the system looks like this:

This solution addresses the various concerns mentioned above:

  1. It is inexpensive to implement on day one because the DataBag API is very simple and creating the first plugin is easy.
  2. It allows tuning the costs because some storage solutions cost more than others. At DreamBox we started using a MySql implementation and eventually moved to AWS DynamoDB as budget constraints changed. 
  3. The logic to deal with a single point of failure issue is contained in the plugins and is specific to the storage system, but it is abstracted from the rest of the system, so it can be implemented and then refined over time. For example, a first implementation of a single MySql instance plugin can be easily switched with a partitioned MySql plugin solution or with an AWS managed service such as DynamoDB. This abstracts the single point of failure issue and allows refining the handling over time.
  4. Switching database technology is as easy as implementing one of the plugins and can be completed as an iterative process of experimentation, testing, and refinement. 
  5. Some plugins, such as the flat file version, make it very easy to run tests in a controlled, inexpensive environment. 

In addition, the DataBag model can be enhanced to automatically and transparently deal with multi-storage solutions. For example, at DreamBox we store data in DynamoDB, we back it up on S3, and we have low-performance mirrors that can be used for easy reporting on MySql. All of this becomes an implementation detail of the DataBag abstraction. We even generalized the migration from a storage type to another with a high-level concept of migration—both lazy and proactive—between data storages. We were also able to generalize the concept of caching by implementing a DataBag based on ElastiCache. All of this is handled in the DataBag subsystem with no dependencies to the rest of the system.

The DataBag solution in this example made the storage itself scalable and fault-tolerant, so our system now looks like this:

Moving our attention backwards, closer to the customer, we encounter another single point of failure: one single physical server running all the logic. If that server fails, our services become unavailable. 

This picture represents that grim situation:

In addition, this single machine running the entire product doesn’t scale because as traffic increases, it will eventually become saturated. An easy solution is to run the product on a cluster of servers with a load balancer in front of it. At DreamBox we use the AWS Elastic Load Balancer. 

The ELB takes care of performing auto-scaling and load balancing of instances in the cluster and also deals with killing unresponsive instances. With this design choice, the system can now be represented as follows:

This completes the process of making the initial simple but very fragile system into a fault tolerant and scalable solution, built for failure and for growth. For scalability you would also want to break down the product into multiple services, each of them with an API and running on a cluster with a load balancer in front of it. 

We didn’t address the concept of AWS Availability Zones and Regions in this article, but I’ll make sure to discuss those options in future articles.

Written by Lorenzo Pasqualis
Director of Engineering at DreamBox Learning

1 comment:

  1. Cloud computing has proven a boon to businesses—especially small businesses, for which it hits a particularly sweet spot. With cloud services, small businesses reap the benefits of not having to deploy physical infrastructure like file and e-mail servers, storage systems or shrink-wrapped software.
    dataroom services