TL;DR: At Transmit Security, we implemented a Blue/Green deployment strategy to achieve zero downtime during releases. This approach allows us to deploy, test, and release new versions frequently while reducing risk to service availability. This article details our implementation process, challenges overcome, and the significant improvements achieved in our deployment velocity and system reliability.
It is no secret nor novelty that one of the most important criteria for a company's growth is the velocity of delivering new features to customers. Big, complex, and frequent changes that combine application development, infrastructure, and data stand in the way of achieving this velocity while maintaining stability and uptime.
To achieve the much-desired high velocity, you must have the right mechanisms in place to allow frequent and safe deployments. One of those mechanisms is to ensure that any change occurs with zero downtime. In this blog, we'll talk about how we achieved this at Transmit Security.
Transmit Security is a SaaS company providing identity and authentication solutions, keeping businesses and users protected and delighted. Identity protection means everything to us. It's what we do and what we're passionate about. With that mission in mind, it was clear that to provide a secure, stable, and updated service, we must prioritize our deployment process.
Securing digital identities requires both robust technology and reliable infrastructure
To achieve zero downtime deployments, we chose to implement a Blue/Green environment. While many companies talk about this approach, it's surprising to see how few actually deploy it. The main reason is the challenges around database migrations and the use of shared underlying infrastructure between environments, which can lead to production outages as older code versions must be supported in all application components.
Let's dive into Blue/Green deployment concepts, pros, and cons.
Blue/Green deployment is a technique that reduces downtime and risk by running two identical production environments called Blue and Green. At any time, only one environment is live and serves all production traffic.
Blue/Green deployment architecture allows for seamless traffic switching
When we started designing our cloud infrastructure, there was no doubt we would use Kubernetes as the orchestrator for our applications, as it's the de facto tool for automating deployments, scaling, and management of containerized applications. Additionally, we leverage MongoDB databases at the backend. Naturally, all our services reside behind CDN and load balancing services, so client connectivity is centrally managed.
To implement the zero downtime strategy, we designed a Blue/Green solution where we deployed two deployments of the same version to the existing environment. With that in mind, we understood that the change needed to be on the same load balancer since we wanted to avoid DNS TTL dependencies from the CDN. So, we routed the traffic of the load balancers' target groups and shifted the traffic to the Blue (main) deployment.
We added load balancing listener's rules for each service with a lower priority than the existing rules. That way, we didn't make an actual change to the environment. After this change, we could just remove the old Kubernetes ingress instances, and the traffic would automatically shift to the Blue deployment.
When we first thought about the Blue/Green implementation, we wanted this solution to be as flexible as possible.
We defined 4 main phases to implement a deployment release and move traffic from Blue to Green and back:
Upgrade the Green (offline) environment with the new version that is about to be deployed. In this phase, we also copy the database collections from the Blue DB to the Green DB.
In this phase, we create new ingress annotations that will add new listener rules that forward traffic to the green application with an expression of the IP source header to match the VPN server's IP address. Since we're working with CDN, we defined the X-Forwarded-For header as it holds the original client IP address that sent the request (our VPN).
This allows us to QA the new deployment via VPN connection and verify the new version without exposing it to the public and without impacting our current service.
During the testing phase, only VPN traffic is routed to the Green environment
After successful verification of the newly deployed version, we shift all traffic to the new deployment (Green).
After shifting traffic to the Green environment running the new deployment, we change the replica count of the Blue environment to 0. This is the final state in which the Green environment has become our main running environment, and Blue is now ready to start the next deployment cycle.
Since we started working on this implementation after we had a running Production environment, we had to build a solution where we wouldn't harm the existing environment while adding support for Blue/Green deployment. After adding Blue/Green support in our Helm charts, we deployed the same application's deployment twice (blue and green). During this deployment, we copied the existing database collections to separate blue and green databases. Then, we removed the existing deployment and were left with blue and green while traffic was forwarded to blue.
When releasing a new version and to support the blue/green strategy, we created new ingress instances and added a few changes:
Ingress annotations that will create new listener rules for each service. Let's take our API service's ingress instance as an example:
These annotations create a new listener rule that forwards traffic to the green-api-service based on the VPN IP address as the value of the XFF header. Note that in this phase, all other traffic is still forwarded to the blue deployment.
After connecting to the VPN server, the ALB points us to the new green deployment. QA completes running sanity checks over the new green deployment, and we can go ahead and forward all traffic to the green deployment:
After successful testing, all traffic is shifted to the Green environment
As a result of using Blue/Green deployment, we gained the ability to deploy new versions of our application, test them, and release frequently while reducing the risk to our service availability and functionality. This enables our Development team to move faster, increase our overall velocity, and provide better protection to our users' online identity (our obsession and passion!).
Implementing Blue/Green deployments at Transmit Security has transformed how we approach releases. We've moved from treating deployments as high-risk events to embracing them as routine operations. This shift has been instrumental in our ability to deliver new features and security enhancements to our customers more rapidly.
While the journey to zero downtime deployments required significant planning and architectural considerations, the benefits have far outweighed the challenges. The additional infrastructure costs are offset by the business value of increased velocity and improved reliability.
For organizations looking to implement similar strategies, we recommend starting with a thorough assessment of your database architecture, as this is often the most challenging aspect of Blue/Green implementations. Invest time in designing proper migration strategies and ensuring backward compatibility between versions.
With the right approach, zero downtime deployments are achievable and can become a competitive advantage in delivering value to your customers.