In the early 1900s, a sentinel species (a species of animals who are both susceptible than humans and show clear signals of being affected by a dangerous environmental condition) was identified for carbon monoxide: the canary. The implementation of the sentinel species pattern was simple, but effective: miners would carry a canary with them into the mine. If the canary began showing signs of distress, it was their warning sign to escape the mine quickly before the humans became affected by the dangerous environment.
In the old days at Innovatz, we used to ship code to all of production at the same time, following the pattern that many companies use today (early integration testing, then full production deployment). This pattern is woefully insufficient. It is capable of catching the obvious stuff (new build doesn’t work, API incompatibility, etc.), but it will never be able to catch performance, load, or user-behavior driven problems. This wasn’t good enough for us, so we began looking for a better solution.
A couple of years ago, we adapted the “sentinel species” pattern to our own production releases. It turned out to be relatively simple, as long as we enforced backwards compatibility from one version of code to the next.
First, we test our code as thoroughly as possible in a private testing environment referred to as “early integration.” Once code is ready to be deployed to production, we start with a canary. A “canary” in this context is when we deploy a service to a single production node, and then watch this node closely for half an hour to ensure it does not behave in a significantly different manner or show signs of distress. If the canary remains healthy, we deploy the updated code to the rest of the nodes for the service. However, if for any reason the canary shows signs of distress, failure, or behaves differently from the untouched nodes, we roll back immediately, and a disaster is prevented.
Learning from “The canary in the coal mine”
Using a canary is no longer a nice-to-have capability for digital companies—it’s a necessity. The traditional operational models of software system development, qualification, deployment, and operation simply cannot keep up with the velocity of change required for an internet-based company to thrive.
Put simply, when the minimum required velocity for change is high enough, you have no option but to take intelligent risks. It is far better that a single node die (for a very short amount of time) than an entire cluster. The key to using this technique is that it is a controlled experiment. The criteria for success is known in advance, metrics are in place, and a quick rollback plan is available.
Sequencing versus parallelization
It was 9:25 a.m. when I sat down in our daily site status review meeting. I was prepared to talk about the three major outages JIRA had experienced the previous day. We had squashed three different problems, including unexpected user behavior, memory exhaustion, and even a file descriptor problem. We had not, however, found root cause. During our attempt to solve the file descriptor problem, two teams had made changes around the same time that JIRA recovered. It was unclear what the impact of each change was.
At 9:30 a.m., the leader of the site status meeting called the start of the meeting and hit refresh on the JIRA dashboard we used to run our meeting: Error 502, Bad Gateway. Refresh: Error 503, Service Unavailable. Refresh (30 seconds later): Error 500, Internal Server Error. I stood up, called for a war room, and asked for representatives from some of the teams to join me.
A few minutes later, we had a war room up, and representatives from all teams were on a conference call. We had been fighting this cluster of issues for the past day, and by now we had identified a number of possible causes, ranging from database performance degradation to JVM garbage collection, from NGINX misconfigurations to a potential memory inconsistency issue on the host running our service.
Acting as the coordinator for the war room, I parallelized the search for possible causes. I had each team set out to confirm, deny, or label as suspect the plausible causes we had identified from the prior day’s investigations. No group was allowed to implement changes—just to report in on the status of each possible cause. As each group checked in, we filtered our list of causes, coming up with about 15 that were suspect, 0 confirmed, and more than 70 that were not contributing.
Once we had the suspect causes, we tested each of them in sequence (only one being tested at any point in time). We had a team “fix” each suspected cause, all teams monitor their areas, and then accept or reject the fix. The total cycle time was around five minutes per “fix.” In the end we identified the cause: an ordering change of the parameters of the calls made to the database caused a slight increase in database latency which, in turn, caused bad JVM behavior. We had our service restored in just 20 minutes after we developed our list of suspect causes.
Learning from “Sequencing versus parallelization”
When triaging the site, there are times when it’s best to sequence events and times to parallelize the work. It is important to know when to do one instead of the other. If there is a site problem and we do not know the cause, it is normal to spin up people from multiple teams in an effort to find out what’s wrong. This is a good thing because it lets us quickly determine which areas are healthy and which are not. Further, it lets us narrow our search for the problem: the network team verifies network health, the database team confirms the database is responding fast enough, the app team confirms new exceptions are not appearing in the logs, etc.
When it comes to actually implementing changes to the site, we need to sequence the work carefully. If two teams make changes simultaneously, it is often unclear which one had an impact, and which one did not. At first glance, sequencing changes could look like slowing down the restoration process, but in reality, this is where things get faster. All eyes are watching one change, so there is less chance of missing the impact each change has (good or bad) and far less chance of going down the rabbit hole of trying to untangle the impact caused by multiple simultaneous changes.
When it comes to MTTD and MTTR, lower is always better. The canary concept helps us prevent a site-wide service outage. After all, no outage is better than even the lowest measured MTTR. For issues that do manifest, the key is to move quickly through safe areas in the investigation (finding possible causes) while sequencing the testing of possible solutions so that you do not waste time trying to understand how more than one change affects your service.