Canary deployment
Last post I made I covered the recreate deployment strategy and, since I’m trying to review as many deployment strategies as possible, this post will cover another type of deployment: the canary deployment.
Canary deployment
A bit of history
In my previous post, the names of the deployment strategies were pretty self-explanatory, and this one also is if we understand where it’s name comes from. Miners in the XIX and XX century faced a multitude of dangers with one of them being carbon monoxide exposure, whose effects can range from loss of muscle coordination, headaches, and nausea, to death. Moreover, carbon monoxide has no smell or taste and, in very high concentrations, death occurs in the matter of hours. As such, as a way of detecting when miners were exposed to carbon monoxide, they would carry a cage with a canary in it to the caves, because the bird is more susceptible to the effects of carbon monoxide than humans. If the bird died, well, it meant it was time for the miners to get out as fast as possible.
Deploying canaries
Much like the miners’ canary, a canary deployment is meant to use a small part of the system as a test subject for the new software version. The new version is deployed such that it takes in only a small part of the regular traffic and the remaining traffic is still flowing through older versions of the software. After letting it simmer for a while in a healthy state, our confidence on the new software version is enough to proceed further with the deployment, in which case many different strategies can be applied, but it is common to do it in a phased fashion. On the other hand, if the new version is not performing up to standards, it is time to rollback to the previous healthy version.
Pros and cons
The main advantage of using this deployment strategy is very much analogous to the miners’ story. If the canary dies, the miners are able to get out safely before they also die. In our case, if our new software version has issues, we are able to roll it back much faster than if we had deployed it to all instances. Additionally, since we only route a piece of the traffic through the new service, the blast radius is minimized, i.e., only a small subset of traffic/customers are affected by the faulty version. In fact, if the traffic routed to the canary is not too large, you can simply use a load balancer to completely cut off the new version and redistribute the traffic by the instances where the previous versions are running, which further minimizes blast radius. In essence, it allows us to test our software live in a production environment with a minimized risk factor.
However, this type of deployments is not always possible. Not all architectures support routing only a piece of traffic through instances containing the newer version of your software. Additionally, you now have to manage two distinct versions of your software for a period of time (note that you can have more than one canary and can run multiple versions of your software this way, but please don’t). And last, but not least, this is still very much a version of “testing in production”, meaning that it requires a tight monitoring loop to ensure that everything is working as expected and that it is safe to proceed with the deployment of the newer version.