Thursday, March 31, 2016

Software Development Process in a Scaling Organization

DreamBox is growing. As we grow, we are learning to be more adaptive and ensure efficient scaling as things change. In Engineering, we formed several teams to build our solutions. Having several independent teams has given us the ability to parallelize much of the work we do, and move faster to serve our customers. As a result we had to improve our ability to synchronize and coordinate across teams, and create some additional structure around the daily engineering operations and the development cycle.

We had already been using the roles of Product Owner (PO) and Scrum Master (SM) extensively for years. Those are traditional Agile Development roles, and their clear definition worked well for us. In addition, we felt the need to formalize a third role in each team: the Project Lead (PL). This is nothing new as this role usually exists in development teams in one form or another, but we found it useful to formalize the PL as a specific person for each of the scrum teams. The formalization clarified roles, improved communication, and prevented running in circles when it came to organizing the process to keep the teams synchronized and focused on common business goals.

Together these roles represent the three legs of the stool that facilitate, support, and guide the teams to work in an organized and efficient way. Effort with positive impact can best be achieved by ensuring accountability and ownership of strong customer, product, business, and technical vision to guide the scum teams to efficiently meet the right goals.

When defining any kind of role it is critical that everyone involved is on the same page about the need for them, who is filling them, and that the company is constantly learning and adapting to changes and needs.

In this chart the three roles are represented as circles, and the areas of interaction are labeled with letters.

The chart shows how the three roles:
  1. Operate independently, performing their specialized functions (A,B and C).
  2. Interface, understand and support each other, and help each other make the project successful (D,E and F).
  3. Collaborate tightly as a team to ensure the delivery team has clear priorities, technical vision, and supporting structure and process (X).
The circles are positioned horizontally based on their level of focus. External focus (customers) on the right, and internal focus (the company and the process) on the left. The PO is mostly focused externally, but is constantly interacting with the team. The SM is mostly focused internally, and is all about making sure the team is not impeded. The PL is in the middle, and needs to understand both focus areas, but not necessarily in great depth.

The circles are also positioned vertically based on the amount of detail they need to worry about and keep in mind. The PO stays high level, but has an ability to dig in and understand the details when the team needs them to. The SM is slightly lower level being aware of the day-to-day details, but does not need to know everything about the technical implementation. The PL needs to know all the technical details of the project, but also have an understanding of the business vision, the customer needs, and the overall process.

At DreamBox, responsibilities of these roles are described as follows:

Product Owner

  • A Product Owner (PO) is the empowered single point of product leadership. A PO should simultaneously represent internal and external views (understanding the needs of the organization and stakeholders, customers, etc.) and clearly communicate to the development team what is needed and in what order to build it. A PO's role includes: 
  • Managing the economic decisions of what is being built (business value, cost/benefit, opportunity cost). 
  • Participating as a key player in planning, providing valuable input which enables the team to select product backlog items to commit to for delivery in a sprint. 
  • Grooming the backlog, which includes creating, refining, and prioritizing. 
  • Defining the acceptance criteria for stories so that teams can ensure that what is being built is in fact what the PO is asking for. 
  • Collaboration with the development team on a regular basis, keeping an open feedback loop throughout the development process. 
  • Collaboration with stakeholders as they are viewed as the single voice for all of the stakeholder community using this input to inform decisions about the product backlog.


A ScrumMaster should be viewed as: 
  • Coach to the PO and delivery team.
  • Servant-leader to help the team become efficient and continually improve.
  • Process authority for the delivery team, not in setting the process for the team but rather in helping each team establish the process that works best for them. 
  • Shield the team from interference and interruption so they can focus on their committed sprint. 
  • Impediment remover, helping the team maintain productivity.
  • Change agent to help evolve agile practices in the organization. 

Project Lead

A Project Lead (PL) s a member of a team that is responsible for:
  • Always being informed about all technical aspects of the project.
  • Expected to clearly articulate how the implementation achieves the project's vision and goals to the team, other teams, and stakeholders.
  • Provides communication and collaboration about the project with other teams.
  • Proactively understands, shares, and addresses cross-team project dependencies.
  • Understanding and socializing implications of project decisions on customers and client-facing teams (Sales, Marketing, Finance, Customer Support, etc.) and creating opportunities for team members to get close to the customer.
  • Drives the technical vision, strategy, tactics, and direction of the project.

When Things Don't Go Well

These three roles are very important. If one of them is not fulfilled, symptoms of dysfunction start cropping up. Here we examine the various issues and symptoms of not effectively carrying out PO, SM, and PL functions.

When the Product Owner is Not Effectively Engaged

The PO carries the business vision of what the team is building, and it is absolutely necessary to have an engaged PO for the team to succeed. If the product owner is not effectively engaged with the PL and the SM, one or more of the following symptoms may be observed:

  • The backlog gets too short, and the team doesn’t know what’s coming next.
  • General lack of priority. The team doesn’t know what is most important, and due to the lack of vision, they start making up priorities. Often they take on work items such as experimental stuff to fill time (there is nothing wrong with experimentation, but experimentation needs to be deliberate, not accidental).
  • The PL and/or the SM have to start collecting information to make business-priority decisions. That slows them down from their other important functions, and the decisions made are often not well informed.
  • There is no product vision or business vision. Engineers end up making all the product decisions based on insufficient information or wrong assumptions.
  • Poor product vision creates a lack of continuity in the user experience. Products ends up looking like collections of features, not integrated solutions.
  • Stakeholders, and the business as a whole, don't know what's going on with the projects: (1) What is being released, (2) What is being worked on, (3) What's coming, (4) When things are coming.
  • Since there is not a clear vision, stakeholders get nervous and start creating fire drills, throwing work items to the teams directly, without a clear indication of relative priority. Without a PO helping to negotiate priorities with the stakeholders, the team starts reacting, jumping on the latest fire drill immediately, interrupting whatever they were already doing. Very little ends up getting completely done this way.
  • Since lots of work is started but not finished, we see lots of effort without much business impact.

When the Scrum Master is Not Effectively Engaged

The SM facilitates the inner workings of the team, and ensures that process or other impediments do not get in the way. If the SM is not effectively engaged with PO and the PL, we risk one or more of the following symptoms:

The team becomes disorganized.
  • The team spends a lot of time discussing process or simply ignoring the process.
  • Since the SM is the guide through an organized process, without an effective SM the process seems to get in the way of progress instead of facilitating progress.
  • The team becomes unpredictable – it is unclear to the PO how much time anything will take to finish.
  • It is difficult to measure the effectiveness of the team and the status of a project.
  • It is hard for the PL to keep track of what's happening and who is working on what.
  • The technical vision and the business vision start drifting from the actual work that is being done.
  • Impediments do not surface, or surface too late, blocking the team members for prolonged periods of time.
  • The team spends the majority of the time on un-prioritized or tracked work.

When the Project Lead is Not Effectively Engaged

A PL carries the technical vision of the project and/or can explain what's going on with the project in great detail. If the PL is not effectively engaged with the PO and the SM, we risk one or more of the following symptoms:

  • General lack of technical vision. 
  • The team often has to ask the question: Where are we going with this?
  • Engineers start making incompatible decisions in the team and across teams.
  • There is no effective cross-team communication. 
  • Different teams start stepping on each other's toes, especially at release time.
  • Technical roadblocks are often found at integration time.
  • Various scrum teams start developing different and incompatible jargon and terms. 
  • Team is unable to settle technical disagreements.
  • Team is often surprised by technical choices made by other members of the team, or by other teams.
  • SM and PO are overwhelmed by technical details that they don't necessarily fully grasp.
  • Nobody can explain how things actually work.
  • Nobody spots (early enough) product designs and visions incompatible with the technical reality.

Written by:
Lorenzo Pasqualis, Sr. Director of Engineering at DreamBox Learning
Eryn Doerffler, Senior Project Manager and Scrum Master at DreamBox Learning

Monday, April 14, 2014

Designing for failure in the AWS cloud

Everything fails all the time: Hard disks break, computers overheat, wires get broken, the power goes out, earthquakes damage buildings, and because of all this, no single device should be considered fault-tolerant.

It is a guarantee that something in your setup will eventually fail. At large enough scale, something is always failing. 

The reasoning is simple: hardware is physical—subject to the laws of physics—and thus, it breaks all the time. This is a mantra that anyone designing software systems in the AWS cloud must always keep in mind. It is also one of the design principles on which the Amazon AWS cloud solutions are built, and it is a principle that AWS does not eliminate for its customers. 

Something or someone has to deal with all these uncertainties. Ideally, cloud solutions providers such as Amazon AWS would solve problems transparently for their customers and would take care of providing a failure-proof environment, but that’s just not how it works.

As Reed Hastings said during the AWS Re:Invent 2012 Keynote in Las Vegas, “We are still in the assembly language phase of cloud computing.” Coping with failure is for the most part left in the hands of cloud computing users, and software architects must take these inevitable failures very seriously. Design for failure must be deliberate, where there is an assumption that the components in a system will in fact fail, and the best plans design backwards to address them. 

When this backwards planning is done, an important advantage is discovered: Designing a solution for reliability leads to a path that, with little additional cost and effort, solves scalability issues as well.

For example, let’s say that you have a simple web application where customers can log in from the Internet. The application collects data from those individuals, stores it, and provides reports based on that data. 

Initially, you could design it to run very simply across one server and one instance of MySql. It would look something like this:

This solution might work for a while, and could even run on the computer under your desk with minimal initial cost. Unfortunately though, this scenario includes several concerns: 

  1. It doesn’t scale. As traffic increases you’ll quickly reach a breaking point. Either the database (DB) machine or the product server is bound to get saturated and stop working properly.
  2. There are only two physical components and they depend on each other. If either of them fails, the system is going to be down.

So let’s start by focusing on the component that is further away from the customer, and move backwards from there. What happens if the MySql DB starts failing? I’ll indicate with a red color a component that represents a single point of failure, with green a component that has been determined fault-tolerant, and with white a component that I haven’t yet examined. Our goal is to make all the boxes green. With that in mind:

In this design if the database fails:

  1. Customers can no longer log in because their credentials are stored in the DB.
  2. The system is not able to collect data since there is no place to put it.
  3. The business logic cannot operate because the data is not available.
  4. The system is unable to display reports since the data is inaccessible.

In other words, the product becomes completely non-functional, resulting in total disaster:

Failure started from the DB and moved backward toward the customer, ultimately halting the experience. 

How can this be avoided? We can prevent this scenario by assuming during the design of the system that any MySql instance could fail at any time. We need to replace that single point of failure with something that is tolerant of errors. There are many options to achieve this result. 

For example, we could partition the data in several MySql instances. With this schema, we would put the population of customers in buckets and use a separate MySql instance for each of the buckets. If any of the instances goes bad, only the group of customers stored in the “bad bucket” will experience issues.

In this simple example we have two buckets. Assuming that each customer has a numeric customer ID, we could assign customers with odd ID numbers to DB1 and customers with even ID numbers to DB2. If one of the two MySql instances goes bad, you’d get this situation:

While this is not ideal, at least the product will continue working for 50 percent of your customers while you resolve the problem. In general, if you have n buckets and one goes bad, the system would be broken for only 1/n of the customers. The more buckets that exist, the better this works.

However, there are several drawbacks to this design. The first of which is that data partitioned this way becomes costly to repartition. For instance, if we start with 3 buckets and we want to move to 5, we would need to repartition all the customers and move them around within the databases. There are many ways to address this issue, but it would be nice to come up with a solution that:

  1. Is simple and inexpensive to implement on day one.
  2. Allows managing of costs as the business grows and the available budget increases.
  3. Solves the single point of failure issue.
  4. Allows changing database technology without reengineering the system.
  5. Improves the ability to test the system.

The solution we adopted at DreamBox is based on a physical storage abstraction—we call it “DataBag.” It is designed to separate the physical storage from the data consumers with a very simple NoSql API compatible with memcached. Each physical storage type that is supported is implemented as a plugin of the DataBag system and can be switched depending on needs. For example, in production we can choose to use a different storage solution than in the test or development environments. 

Using three storage plugins as an example (MySql, DynamoDB, and a simple Flat File), the system looks like this:

This solution addresses the various concerns mentioned above:

  1. It is inexpensive to implement on day one because the DataBag API is very simple and creating the first plugin is easy.
  2. It allows tuning the costs because some storage solutions cost more than others. At DreamBox we started using a MySql implementation and eventually moved to AWS DynamoDB as budget constraints changed. 
  3. The logic to deal with a single point of failure issue is contained in the plugins and is specific to the storage system, but it is abstracted from the rest of the system, so it can be implemented and then refined over time. For example, a first implementation of a single MySql instance plugin can be easily switched with a partitioned MySql plugin solution or with an AWS managed service such as DynamoDB. This abstracts the single point of failure issue and allows refining the handling over time.
  4. Switching database technology is as easy as implementing one of the plugins and can be completed as an iterative process of experimentation, testing, and refinement. 
  5. Some plugins, such as the flat file version, make it very easy to run tests in a controlled, inexpensive environment. 

In addition, the DataBag model can be enhanced to automatically and transparently deal with multi-storage solutions. For example, at DreamBox we store data in DynamoDB, we back it up on S3, and we have low-performance mirrors that can be used for easy reporting on MySql. All of this becomes an implementation detail of the DataBag abstraction. We even generalized the migration from a storage type to another with a high-level concept of migration—both lazy and proactive—between data storages. We were also able to generalize the concept of caching by implementing a DataBag based on ElastiCache. All of this is handled in the DataBag subsystem with no dependencies to the rest of the system.

The DataBag solution in this example made the storage itself scalable and fault-tolerant, so our system now looks like this:

Moving our attention backwards, closer to the customer, we encounter another single point of failure: one single physical server running all the logic. If that server fails, our services become unavailable. 

This picture represents that grim situation:

In addition, this single machine running the entire product doesn’t scale because as traffic increases, it will eventually become saturated. An easy solution is to run the product on a cluster of servers with a load balancer in front of it. At DreamBox we use the AWS Elastic Load Balancer. 

The ELB takes care of performing auto-scaling and load balancing of instances in the cluster and also deals with killing unresponsive instances. With this design choice, the system can now be represented as follows:

This completes the process of making the initial simple but very fragile system into a fault tolerant and scalable solution, built for failure and for growth. For scalability you would also want to break down the product into multiple services, each of them with an API and running on a cluster with a load balancer in front of it. 

We didn’t address the concept of AWS Availability Zones and Regions in this article, but I’ll make sure to discuss those options in future articles.

Written by Lorenzo Pasqualis
Sr. Director of Engineering at DreamBox Learning

Thursday, April 3, 2014

Engineering at DreamBox Learning

The engineers who decide to join DreamBox Learning do so for many different reasons. Some reasons are of course personal, but there are a few clear patterns that I have observed over the years that I’d like to share.

#1 Engineering with a Higher Purpose
Most of our engineers are veterans of the technology world and have been employed by many software companies and across different industries. Only some of them, however, have been lucky enough at previous jobs where they felt their work had a higher purpose. Today, our engineers find that DreamBox fulfills that aspect of their professional life.
It is a great feeling to know that hundreds of thousands of children are improving their math skills by using a product you’ve had a direct hand in creating.
For engineers with young children of their own, it is also refreshing to be able to show their kids what they work on—sparking interest in those young minds for both software development and math in the process.

#2 Engineering for Scale
DreamBox is continually built and constantly refined to scale to millions of users on the web. Our student population continues to increase year after year, and we recognize that being able to keep pace with the demand starts from having a solid architecture.
The improvements we continuously make to our platform require an ever-growing amount of data to be collected, stored, and analyzed for each student. This need for complex and evolving data analysis solutions and our growing data collection volume is an exciting challenge for our engineers who become experts in “Big Data” solutions.

#3 Engineering in the Cloud
At DreamBox we use AWS cloud solutions heavily. In fact, we are 100 percent hosted in the cloud, and both our engineering and operations departments are continually working on new ways to maximize the value we get from this choice.
We have been at AWS ReInvent in Las Vegas each year and we ride on the cutting edge of their technology solutions, making us a great choice for engineers who like to keep their skill sets current.
We use a number of Amazon services and systems such as ELB, EC2, CloudFormation, ElastiCache, RDS, DynamoDB, SQS, SES, EMR, etc.
We architect our backend as a collection of highly scalable, highly reliable distributed services, loosely coupled and built for failure. This strategy allows us to scale to the constant demands of our business.

#4 Engineering with Smart People
We hire well and we hire smart. Our engineers are a pleasure to work with and are carefully selected for both skills and cultural fit. We work in an environment where coders and educators interact constantly to bring the best possible educational product to market.
Recognizing that the market moves at an incredible speed, we move fast  and are consistently researching and experimenting to keep stride with all those needs, desires, and technological advances. We strive to make innovation part of our engineering culture and we are very data-driven, basing much of our innovation on what we see when evaluating the numbers rather than on guesses.
Also, we are not afraid of thinking outside the box nor of making mistakes, as we recognize that controlled errors are often a base for improvement.

#5 Engineering without Dogmas
We are an agile software development shop—fast and lean. We currently operate mainly using a Scrum process but without subscribing to any particular “book” or position. In fact, we adapt our processes and technology choices to our particular needs and we shy away from engineering dogmas or “this is the way we do it here” stances.

#6 Engineering with a Personal Life
We come to work every day to help all children acquire better math skills, but we do not forget about our own children and families in the process.
Most of our engineers have children of their own, and DreamBox has been very successful in promoting a healthy work/life balance where employee needs are acknowledged and supported by management. 
We come to work to solve hard problems; we work passionately on these problems keeping in mind the value that we are bringing to millions of school children. We recognize that this is made possible because each of us know that at the end of the day our employer is as respectful of our personal lives outside the office as much as it is for those of our customers.

#7 Engineering with Dogs
We have a dog-friendly office space and our employees are welcome to bring in their well-behaved canine pets, and they do.

Written by Lorenzo Pasqualis

Sr. Director of Engineering at DreamBox Learning

Monday, March 24, 2014

The Non-Linearity of Human Learning

The main purpose of this blog is to document some of the more technical aspects of what we do here at DreamBox, the types of problems we face, how we approach them, and how we solve them.

One of the major goals of our work is to study, understand, and model the data and processes related to human learning.

When considering human learning from a software engineering point of view, nothing should be assumed to be linear. The way students acquire, retain, and progress through the information we give them is nonlinear, organic, and multidimensional in nature. At DreamBox, our software engineers quickly become data scientists and spend much of their time thinking and designing technical solutions to model the non-linearity of the human learning process and studying its organic behaviors.

From modeling student skill acquisition and how that information is retained and potentially lost over time in unpredictable ways, to visualizing these multidimensional data models to show student progress across a complex array of skills—in a way that is interpretable by educators—the challenge of making sense of the natural non-linearity of the human mind is always before us.

Here is a partial list of some of the challenges that we have been working on, and will continue to tackle during our continuous process of refinement:

  • Defining which aspects of student interactions with DreamBox are important to use when “learning from the learners the way they learn.”
  • Modeling organic data—representing student mastery of academic concepts—as a complex and ever-growing graph formed by nodes and edges that morph over time in unpredictable ways due to a multitude of reasons that we do not have full control over.
  • Modeling the elementary and middle school math curriculum in a form that is compatible both with student learning preferences and academic standards.
  • Creating several engagement layers that are fun and appropriate for children while supporting their math skill acquisition journey.
  • Collecting, storing, organizing, and processing enormous amounts of data recorded during student interactions with DreamBox.
  • Analyzing data, defining hypotheses, and testing the hypotheses through experiments to refine our understanding of current student learning models.
  • Building distributed systems that are able to scale automatically and quickly during tremendous bursts and drops of traffic on our servers that we experience throughout the day, week, and school year.
  • Continuously adapt to what the students are doing both at a real-time micro level and at an over-time macro level.
  • Building systems that allow us to visualize, study, and refine our understanding of the data we collect and the algorithms that process that data.
  • Finding methods of showing student progression in ways that are familiar to educators, by translating the enormous amount of data into efficient and functional reports with numbers and graphs.
  • Building technologies that allow educators to create lessons and content, effectively making them DreamBox lesson engineers.
In future articles we’ll explore some of the technical aspects related to these points.

Written by Lorenzo Pasqualis

Sr. Director of Engineering at DreamBox Learning

Sunday, March 23, 2014

Introducing the DreamBox Learning Tech Blog

Welcome to the DreamBox Learning Tech Blog.

On this site we'll focus on technology and technology issues and solutions.

Our engineers will directly share their perspectives, challenges and decisions regarding the software they build and use to create the DreamBox Learning experience. 

We'll share our life as software makers in the Education Technology world, the engineering culture at DreamBox and why we are so excited to be here. 

Stay tuned!

Written by Lorenzo Pasqualis

Sr. Director of Engineering at DreamBox Learning