A fairly typical scenario – there is production software that was released some time ago. The entire development team works on a set of features that will be a part of the next release (no concept of continuous deployments is known). The deadline is coming. More and more work is added. Finally, the release date comes.
Basic deployment strategy
The development team decided to follow the basic deployment strategy:
This requires first shutting down the application that runs in production. Of course, after this, it is not visible on the Internet – here comes the first problem. The next step is to proceed with the update. After the successful update, with approval from the business side, the new version of the application is shown to the entire world – the second problem.
During the following week, complaints started coming in from users. First, why was the app unavailable for an hour? Customers were not able to place orders that were needed = financial losses due to the stoppage of production. Besides, there is another important issue. The SLA is set at 99.9%. This means that in a year the allowed application downtime equals 8h 45m 56s. As you can count, this is enough for up to 8 releases (an hour per each) and almost 46 minutes for other things. Ouch!
Second, there are a lot of bugs, performance issues and the application is unstable. The new version is now used by all – 100 000 – users. Issues were reported by 20% of them, in this case, it means 20 000. Thousands of tickets have to be somehow handled. The rating goes down extremely fast.
After some time, the fire is extinguished. There is another release on the horizon. Learning from previous experience, the team decided to implement a different strategy.
Blue green deployment strategy
This time the choice falls on another deployment strategy, called blue-green:
This approach completely solves the first problem that the team was struggling with. There is no downtime at any point. When the update of another instance is successfully finished, then all users are redirected to the new version of the application. Afterward, the old version is no anymore live and becomes some kind of a rollback. Whenever a critical issue (e.g. security) is found, then the traffic can be quickly redirected via routing to the old version.
Unfortunately, the second problem remains unresolved. Still, all users have access to all new features and there will be bugs in existing code as well. Sometimes – even with the best analysis and research done on a group of users – a feature that was added is not enjoyed by the real audience. Again, when the application is released to all users, there is a high chance that the total rating will go immediately down due to the impact of thousands of people.
Canary deployment strategy
Recent releases have been a tough time for the team. They learned a lot regarding different strategies of deployments. It was possible to successfully solve the problem of shutting down during the installation of the new version. However, improvements are still needed.
One of the team members read an article about canary releases. To give this solution a chance and try to solve the other problem, he passed this information to others. After heavy discussions, it was democratically decided to try this approach:
A new version of the application was released. There were 2 applications that run in parallel – v1.2 and v1.3. As a first step, the team decided to redirect traffic of 10% of users (10 000). A few bugs have been reported, one feature has received negative feedback. All reported bugs were solved. The mentioned functionality got a second chance to be verified by another group of users (next 10 000 users). The cycle repeated – bugs and disliked functionality – and the business has decided to remove it from the release. Subsequent iterations brought other bugs, but they were fewer and fewer with each iteration. The second problem was solved.
Tremendous success – the version change was positively received by users. The rating was raised. The development team lived happily ever after.
Is that so?
Canary releases are a great opportunity to minimize the risk of bringing new changes to the application, run different versions in parallel and test it with a subset of users. You can even select this subset – it might be based on region, e.g. new version is visible by users who are accessing from western Europe or a group of early adopters, who really appreciate using your application. The percentage of users can also be determined by your team – it can be 1% in the beginning, 10% or any other, small number (in my opinion, the less the better but you should be careful – when your application is used by 100 users, then 1% = 1 user, so the result may be unreliable).
Another thing worth mentioning is whether you are dealing with a monolith (standard or modular, no difference here) or a distributed architecture. Modular monolith will always be deployed as a single unit, so your way of canary deployment strategy will go as shown above. When you work with distributed architecture (multiple deployment units), this will allow you to release new versions of each service, so it will work to your advantage (speed and minimal changes to the entire ecosystem) – e.g. when something goes wrong, it would be usually very easy & quick to rollback.
In my opinion, the biggest difficulty is on the infrastructure side – finding a way to monitor, analyze, route the traffic, prepare scripts – and convince the business to follow this direction due to its unpopularity.
On top of that, if you are able to combine canary releases with continuous deployments, then it is a game-changer. But that’s a topic for a separate story.
And what does it look like in your projects? Have you given canary releases a chance?
Leave a Reply