It is a guarantee that something in your setup will eventually fail. At large enough scale, something is always failing.
The reasoning is simple: hardware is physical—subject to the laws of physics—and thus, it breaks all the time. This is a mantra that anyone designing software systems in the AWS cloud must always keep in mind. It is also one of the design principles on which the Amazon AWS cloud solutions are built, and it is a principle that AWS does not eliminate for its customers.
Something or someone has to deal with all these uncertainties. Ideally, cloud solutions providers such as Amazon AWS would solve problems transparently for their customers and would take care of providing a failure-proof environment, but that’s just not how it works.
As Reed Hastings said during the AWS Re:Invent 2012 Keynote in Las Vegas, “We are still in the assembly language phase of cloud computing.” Coping with failure is for the most part left in the hands of cloud computing users, and software architects must take these inevitable failures very seriously. Design for failure must be deliberate, where there is an assumption that the components in a system will in fact fail, and the best plans design backwards to address them.
When this backwards planning is done, an important advantage is discovered: Designing a solution for reliability leads to a path that, with little additional cost and effort, solves scalability issues as well.
For example, let’s say that you have a simple web application where customers can log in from the Internet. The application collects data from those individuals, stores it, and provides reports based on that data.
Initially, you could design it to run very simply across one server and one instance of MySql. It would look something like this:
![]() |
fig1 |
- It doesn’t scale. As traffic increases you’ll quickly reach a breaking point. Either the database (DB) machine or the product server is bound to get saturated and stop working properly.
- There are only two physical components and they depend on each other. If either of them fails, the system is going to be down.
So let’s start by focusing on the component that is further away from the customer, and move backwards from there. What happens if the MySql DB starts failing? I’ll indicate with a red color a component that represents a single point of failure, with green a component that has been determined fault-tolerant, and with white a component that I haven’t yet examined. Our goal is to make all the boxes green. With that in mind:
![]() |
fig2 |
- Customers can no longer log in because their credentials are stored in the DB.
- The system is not able to collect data since there is no place to put it.
- The business logic cannot operate because the data is not available.
- The system is unable to display reports since the data is inaccessible.
In other words, the product becomes completely non-functional, resulting in total disaster:
![]() |
fig3 |
How can this be avoided? We can prevent this scenario by assuming during the design of the system that any MySql instance could fail at any time. We need to replace that single point of failure with something that is tolerant of errors. There are many options to achieve this result.
For example, we could partition the data in several MySql instances. With this schema, we would put the population of customers in buckets and use a separate MySql instance for each of the buckets. If any of the instances goes bad, only the group of customers stored in the “bad bucket” will experience issues.
![]() |
fig4 |
![]() |
fig5 |
However, there are several drawbacks to this design. The first of which is that data partitioned this way becomes costly to repartition. For instance, if we start with 3 buckets and we want to move to 5, we would need to repartition all the customers and move them around within the databases. There are many ways to address this issue, but it would be nice to come up with a solution that:
- Is simple and inexpensive to implement on day one.
- Allows managing of costs as the business grows and the available budget increases.
- Solves the single point of failure issue.
- Allows changing database technology without reengineering the system.
- Improves the ability to test the system.
The solution we adopted at DreamBox is based on a physical storage abstraction—we call it “DataBag.” It is designed to separate the physical storage from the data consumers with a very simple NoSql API compatible with memcached. Each physical storage type that is supported is implemented as a plugin of the DataBag system and can be switched depending on needs. For example, in production we can choose to use a different storage solution than in the test or development environments.
Using three storage plugins as an example (MySql, DynamoDB, and a simple Flat File), the system looks like this:
![]() |
fig6 |
- It is inexpensive to implement on day one because the DataBag API is very simple and creating the first plugin is easy.
- It allows tuning the costs because some storage solutions cost more than others. At DreamBox we started using a MySql implementation and eventually moved to AWS DynamoDB as budget constraints changed.
- The logic to deal with a single point of failure issue is contained in the plugins and is specific to the storage system, but it is abstracted from the rest of the system, so it can be implemented and then refined over time. For example, a first implementation of a single MySql instance plugin can be easily switched with a partitioned MySql plugin solution or with an AWS managed service such as DynamoDB. This abstracts the single point of failure issue and allows refining the handling over time.
- Switching database technology is as easy as implementing one of the plugins and can be completed as an iterative process of experimentation, testing, and refinement.
- Some plugins, such as the flat file version, make it very easy to run tests in a controlled, inexpensive environment.
In addition, the DataBag model can be enhanced to automatically and transparently deal with multi-storage solutions. For example, at DreamBox we store data in DynamoDB, we back it up on S3, and we have low-performance mirrors that can be used for easy reporting on MySql. All of this becomes an implementation detail of the DataBag abstraction. We even generalized the migration from a storage type to another with a high-level concept of migration—both lazy and proactive—between data storages. We were also able to generalize the concept of caching by implementing a DataBag based on ElastiCache. All of this is handled in the DataBag subsystem with no dependencies to the rest of the system.
The DataBag solution in this example made the storage itself scalable and fault-tolerant, so our system now looks like this:
![]() |
fig7 |
This picture represents that grim situation:
![]() |
fig8 |
![]() |
fig9 |
![]() |
fig10 |
We didn’t address the concept of AWS Availability Zones and Regions in this article, but I’ll make sure to discuss those options in future articles.
Written by Lorenzo Pasqualis
Sr. Director of Engineering at DreamBox Learning