When we—the Stability team—talk about our jobs, we spend a lot of time discussing “setting things on fire.” Our job is to keep Asana up no matter what, planning for possibilities like uncontrolled blazes in the AWS data centers that host our servers. It’s fun to envision and important to guard against, but we didn’t always have the time to be so vigilant.
In 2010 Dustin—one of our co-founders—introduced the first in a series of scripts that led to the automated server deployment system we have today. Back then, Asana was just getting started, so we were more worried about having a site that worked at all than having a site that worked come what may. We made pragmatic choices, knowing they weren’t necessarily sustainable ones.
As years passed, our product feature-set and customer base grew, increasing our server traffic and code footprint, putting new demands on the configuration system. We kept configuring our machines in essentially the same way as in 2010, using scripts that started out lightweight, but became complicated as features and requirements were added. These scripts required manual oversight, which was fine when we had five servers, but with hundreds of them, manually patching things up when they failed got old fast. As more and more engineer time was sunk into maintenance instead of development, we decided it was time to invest in a system that could take care of itself.
In 2015 we built the AutoProvisioner, a handy system that replaced the manual work we had to undertake to replace our broken and dying servers. We changed to a simple cron job that checked the status of our infrastructure every five minutes, and worked to nurse things back to health whenever things went wrong.
The first thing the AutoProvisioner did was check how many healthy servers we had, and how many we wanted. If those numbers matched, its work was done. If not, it launched a new server for us. It got the server ready by sending our code over and running our configuration scripts (the ones we used to run by hand!). When the server was provisioned, it let the rest of our fleet know about the existence of the new server, leaving it indistinguishable from any of its long-running counterparts. With this mechanism, we smoothly replaced any servers that failed with no human intervention required. Furthermore, if we ever had a server behaving strangely and didn’t have time to investigate it, we could just disable it, and the AutoProvisioner would replace it all on its own.
We drew inspiration from Chaos Monkey, a system that does the equivalent of letting an enraged monkey with a sledgehammer loose in your server room, forcing you to build systems that can tolerate the assault. With a Chaos Monkey approach, we didn’t try and avoid failure—we expected it, we built our systems to withstand it, and when we thought we could handle it, we caused it!
With the AutoProvisioner in place, something like Chaos Monkey actually became feasible. We set up a server terminator to kill servers daily. The AutoProvisioner complained loudly if something failed when it tried to launch replacements, which became a strong motivating factor to fix regressions in our configuration immediately. Our infrastructure reached a happy equilibrium where it could sustain damage and continue working steadily.
At that point in time, however, Asana was not moving at a steady pace. Our customer base was increasing, which required manual intervention to tell the AutoProvisioner that we wanted more and more servers. It was time to bring in AutoScaling Groups. If there was too much load on our servers, the AutoScaling Groups could detect that and launch new servers, which the AutoProvisioner would then provision. At this point, not only would our infrastructure maintain itself, but it could grow to meet our needs.
This work carried us through the end of 2016, leaving us with a server deployment system that was stable, but not dynamic. It should have been, since it could add servers at will, but there was one fatal flaw: this process was slow. It took two hours to launch and provision new servers.
This two hour waiting period really limited our potential and, in some ways, was downright dangerous. Two hours of partial downtime until our system could recover from sudden increased load was a harsh reality. And while we didn’t quite find ourselves in that situation, we came scarily close.
The other big issue we faced was cost. Many customers translated into a lot of servers. We would scale up our servers when there was increased load, but we never scaled them down during periods of reduced load, because they’d take too long to bring back. That was some serious potential savings we were missing out on.
Moreover, what on earth were our servers doing for those two hours?
It turns out that a solid 40 minutes of that time were just sending 65 GB of code over to each server. There was a lot of code, and the entire repository went to each server many times over (so we could change versions in a flash if we needed to). That was one of the downsides of still using code from 2010—the code had grown and become a bit of a jumbled mess, and we hadn’t managed to separate out just the parts we needed. And we weren’t just sending code, either—there was building all our code and a fair number of the libraries we used every time we deployed a new server.
We could have pursued ways to shorten each of those processes, but that would have required investigative and exploratory work unique to each time-guzzling process. Additionally, that work would have had to be ongoing as our deployment system grew and changed. In a world of finite engineering resources, we went for custom AMIs instead.
Using custom AMIs is a low-effort, high-reward project to reduce our server launch time by 90%. The constraints: we needed to change our deployment system as little as possible and ensure the time savings would stand up to future changes.
The idea behind using custom AMIs is simple. In our old system, our servers started out from a standard Ubuntu AMI. With custom AMIs, we asynchronously create pre-configured OS images that have all of our code, dependencies, and build artifacts already on them, and launch new servers from this custom image instead of the standard Ubuntu one. Servers launched this way then only need to be caught up with any changes to our codebase that occurred after the custom AMI was created, giving us a launching and provisioning process that takes 10 minutes total.
To create these custom AMIs, we run a cron job nightly that launches and provisions a server using the old, two-hour system, then takes a snapshot, creating an AMI from the fully provisioned server. We deploy to production twice a day, so this AMI is at most 2 versions behind, a difference easily made up in 10 minutes.
We use an S3 object to keep track of which custom AMI we should use to launch new servers. We keep a two-week backlog of old AMIs, in case we need revert to an earlier version of the code. To perform the revert, all we need to do is change the S3 object to point to an older AMI, thus maintaining the ability to launch servers quickly.
A notable aspect of this process is that we are still doing two-hour server launches—just asynchronously and on a controlled schedule, instead of on-demand. We could have done this differently. Instead of starting from scratch every night, we could have built upon the previous night’s AMI, applying that day’s changes to it and snapshotting the result. This would have cut 110 minutes from the AMI creation process.
However, with that approach, we would have been vulnerable to missing regressions in our configuration code, as it would never execute in full. There’d be a security vulnerability as well: In the unlikely event that someone managed to get access to a server a custom AMI was created from, they could introduce a security flaw that could eventually be reflected in our entire fleet, without ever being in code. With all those potentials for disaster, 110 asynchronous minutes in the middle of the night seemed like a fair trade for peace of mind.
Though we’ve come far, our current world is not perfect. Our configuration code is still messy and does unnecessary work. Our code bundle deployment is intertwined with our infrastructure management, which means that code bundle errors block infrastructure changes and the other way around. Our servers run numerous processes, which aren’t isolated from each other. As Asana grows, we want our teams to be able to deploy their components independently of each other without fear of cascading issues.
Getting to our desired world will involve breaking up our current monolithic deployment into several components–a process that is already under way. Our newer infrastructure is deployed using Kubernetes, with different components isolated into pods and clusters. This provides us more flexibility and allows our teams to move faster without blocking each other. Additionally, within our Kubernetes system, every application runs an instance of Chaos Monkey, so applications have to be fault-tolerant from the start. There’s a lot more exciting work to be done here, so if you would like to set some fires of your own in the course of making this vision a reality, get in touch.