Monday, April 14, 2014

Designing for failure in the AWS cloud

Everything fails all the time: Hard disks break, computers overheat, wires get broken, the power goes out, earthquakes damage buildings, and because of all this, no single device should be considered fault-tolerant.

It is a guarantee that something in your setup will eventually fail. At large enough scale, something is always failing. 

The reasoning is simple: hardware is physical—subject to the laws of physics—and thus, it breaks all the time. This is a mantra that anyone designing software systems in the AWS cloud must always keep in mind. It is also one of the design principles on which the Amazon AWS cloud solutions are built, and it is a principle that AWS does not eliminate for its customers. 

Something or someone has to deal with all these uncertainties. Ideally, cloud solutions providers such as Amazon AWS would solve problems transparently for their customers and would take care of providing a failure-proof environment, but that’s just not how it works.

As Reed Hastings said during the AWS Re:Invent 2012 Keynote in Las Vegas, “We are still in the assembly language phase of cloud computing.” Coping with failure is for the most part left in the hands of cloud computing users, and software architects must take these inevitable failures very seriously. Design for failure must be deliberate, where there is an assumption that the components in a system will in fact fail, and the best plans design backwards to address them. 

When this backwards planning is done, an important advantage is discovered: Designing a solution for reliability leads to a path that, with little additional cost and effort, solves scalability issues as well.

For example, let’s say that you have a simple web application where customers can log in from the Internet. The application collects data from those individuals, stores it, and provides reports based on that data. 

Initially, you could design it to run very simply across one server and one instance of MySql. It would look something like this:


fig1
This solution might work for a while, and could even run on the computer under your desk with minimal initial cost. Unfortunately though, this scenario includes several concerns: 

  1. It doesn’t scale. As traffic increases you’ll quickly reach a breaking point. Either the database (DB) machine or the product server is bound to get saturated and stop working properly.
  2. There are only two physical components and they depend on each other. If either of them fails, the system is going to be down.

So let’s start by focusing on the component that is further away from the customer, and move backwards from there. What happens if the MySql DB starts failing? I’ll indicate with a red color a component that represents a single point of failure, with green a component that has been determined fault-tolerant, and with white a component that I haven’t yet examined. Our goal is to make all the boxes green. With that in mind:


fig2
In this design if the database fails:

  1. Customers can no longer log in because their credentials are stored in the DB.
  2. The system is not able to collect data since there is no place to put it.
  3. The business logic cannot operate because the data is not available.
  4. The system is unable to display reports since the data is inaccessible.

In other words, the product becomes completely non-functional, resulting in total disaster:


fig3
Failure started from the DB and moved backward toward the customer, ultimately halting the experience. 

How can this be avoided? We can prevent this scenario by assuming during the design of the system that any MySql instance could fail at any time. We need to replace that single point of failure with something that is tolerant of errors. There are many options to achieve this result. 

For example, we could partition the data in several MySql instances. With this schema, we would put the population of customers in buckets and use a separate MySql instance for each of the buckets. If any of the instances goes bad, only the group of customers stored in the “bad bucket” will experience issues.

fig4
In this simple example we have two buckets. Assuming that each customer has a numeric customer ID, we could assign customers with odd ID numbers to DB1 and customers with even ID numbers to DB2. If one of the two MySql instances goes bad, you’d get this situation:


fig5
While this is not ideal, at least the product will continue working for 50 percent of your customers while you resolve the problem. In general, if you have n buckets and one goes bad, the system would be broken for only 1/n of the customers. The more buckets that exist, the better this works.

However, there are several drawbacks to this design. The first of which is that data partitioned this way becomes costly to repartition. For instance, if we start with 3 buckets and we want to move to 5, we would need to repartition all the customers and move them around within the databases. There are many ways to address this issue, but it would be nice to come up with a solution that:

  1. Is simple and inexpensive to implement on day one.
  2. Allows managing of costs as the business grows and the available budget increases.
  3. Solves the single point of failure issue.
  4. Allows changing database technology without reengineering the system.
  5. Improves the ability to test the system.

The solution we adopted at DreamBox is based on a physical storage abstraction—we call it “DataBag.” It is designed to separate the physical storage from the data consumers with a very simple NoSql API compatible with memcached. Each physical storage type that is supported is implemented as a plugin of the DataBag system and can be switched depending on needs. For example, in production we can choose to use a different storage solution than in the test or development environments. 

Using three storage plugins as an example (MySql, DynamoDB, and a simple Flat File), the system looks like this:


fig6
This solution addresses the various concerns mentioned above:

  1. It is inexpensive to implement on day one because the DataBag API is very simple and creating the first plugin is easy.
  2. It allows tuning the costs because some storage solutions cost more than others. At DreamBox we started using a MySql implementation and eventually moved to AWS DynamoDB as budget constraints changed. 
  3. The logic to deal with a single point of failure issue is contained in the plugins and is specific to the storage system, but it is abstracted from the rest of the system, so it can be implemented and then refined over time. For example, a first implementation of a single MySql instance plugin can be easily switched with a partitioned MySql plugin solution or with an AWS managed service such as DynamoDB. This abstracts the single point of failure issue and allows refining the handling over time.
  4. Switching database technology is as easy as implementing one of the plugins and can be completed as an iterative process of experimentation, testing, and refinement. 
  5. Some plugins, such as the flat file version, make it very easy to run tests in a controlled, inexpensive environment. 

In addition, the DataBag model can be enhanced to automatically and transparently deal with multi-storage solutions. For example, at DreamBox we store data in DynamoDB, we back it up on S3, and we have low-performance mirrors that can be used for easy reporting on MySql. All of this becomes an implementation detail of the DataBag abstraction. We even generalized the migration from a storage type to another with a high-level concept of migration—both lazy and proactive—between data storages. We were also able to generalize the concept of caching by implementing a DataBag based on ElastiCache. All of this is handled in the DataBag subsystem with no dependencies to the rest of the system.

The DataBag solution in this example made the storage itself scalable and fault-tolerant, so our system now looks like this:


fig7
Moving our attention backwards, closer to the customer, we encounter another single point of failure: one single physical server running all the logic. If that server fails, our services become unavailable. 

This picture represents that grim situation:


fig8
In addition, this single machine running the entire product doesn’t scale because as traffic increases, it will eventually become saturated. An easy solution is to run the product on a cluster of servers with a load balancer in front of it. At DreamBox we use the AWS Elastic Load Balancer. 


fig9
The ELB takes care of performing auto-scaling and load balancing of instances in the cluster and also deals with killing unresponsive instances. With this design choice, the system can now be represented as follows:


fig10
This completes the process of making the initial simple but very fragile system into a fault tolerant and scalable solution, built for failure and for growth. For scalability you would also want to break down the product into multiple services, each of them with an API and running on a cluster with a load balancer in front of it. 

We didn’t address the concept of AWS Availability Zones and Regions in this article, but I’ll make sure to discuss those options in future articles.

Written by Lorenzo Pasqualis
Sr. Director of Engineering at DreamBox Learning

Thursday, April 3, 2014

Engineering at DreamBox Learning

The engineers who decide to join DreamBox Learning do so for many different reasons. Some reasons are of course personal, but there are a few clear patterns that I have observed over the years that I’d like to share.

#1 Engineering with a Higher Purpose
Most of our engineers are veterans of the technology world and have been employed by many software companies and across different industries. Only some of them, however, have been lucky enough at previous jobs where they felt their work had a higher purpose. Today, our engineers find that DreamBox fulfills that aspect of their professional life.
It is a great feeling to know that hundreds of thousands of children are improving their math skills by using a product you’ve had a direct hand in creating.
For engineers with young children of their own, it is also refreshing to be able to show their kids what they work on—sparking interest in those young minds for both software development and math in the process.

#2 Engineering for Scale
DreamBox is continually built and constantly refined to scale to millions of users on the web. Our student population continues to increase year after year, and we recognize that being able to keep pace with the demand starts from having a solid architecture.
The improvements we continuously make to our platform require an ever-growing amount of data to be collected, stored, and analyzed for each student. This need for complex and evolving data analysis solutions and our growing data collection volume is an exciting challenge for our engineers who become experts in “Big Data” solutions.

#3 Engineering in the Cloud
At DreamBox we use AWS cloud solutions heavily. In fact, we are 100 percent hosted in the cloud, and both our engineering and operations departments are continually working on new ways to maximize the value we get from this choice.
We have been at AWS ReInvent in Las Vegas each year and we ride on the cutting edge of their technology solutions, making us a great choice for engineers who like to keep their skill sets current.
We use a number of Amazon services and systems such as ELB, EC2, CloudFormation, ElastiCache, RDS, DynamoDB, SQS, SES, EMR, etc.
We architect our backend as a collection of highly scalable, highly reliable distributed services, loosely coupled and built for failure. This strategy allows us to scale to the constant demands of our business.

#4 Engineering with Smart People
We hire well and we hire smart. Our engineers are a pleasure to work with and are carefully selected for both skills and cultural fit. We work in an environment where coders and educators interact constantly to bring the best possible educational product to market.
Recognizing that the market moves at an incredible speed, we move fast  and are consistently researching and experimenting to keep stride with all those needs, desires, and technological advances. We strive to make innovation part of our engineering culture and we are very data-driven, basing much of our innovation on what we see when evaluating the numbers rather than on guesses.
Also, we are not afraid of thinking outside the box nor of making mistakes, as we recognize that controlled errors are often a base for improvement.

#5 Engineering without Dogmas
We are an agile software development shop—fast and lean. We currently operate mainly using a Scrum process but without subscribing to any particular “book” or position. In fact, we adapt our processes and technology choices to our particular needs and we shy away from engineering dogmas or “this is the way we do it here” stances.

#6 Engineering with a Personal Life
We come to work every day to help all children acquire better math skills, but we do not forget about our own children and families in the process.
Most of our engineers have children of their own, and DreamBox has been very successful in promoting a healthy work/life balance where employee needs are acknowledged and supported by management. 
We come to work to solve hard problems; we work passionately on these problems keeping in mind the value that we are bringing to millions of school children. We recognize that this is made possible because each of us know that at the end of the day our employer is as respectful of our personal lives outside the office as much as it is for those of our customers.

#7 Engineering with Dogs
We have a dog-friendly office space and our employees are welcome to bring in their well-behaved canine pets, and they do.

Written by Lorenzo Pasqualis


Sr. Director of Engineering at DreamBox Learning