The Road to Zero Downtime

Blue/Green Deployments

TL;DR: At Transmit Security, we implemented a Blue/Green deployment strategy to achieve zero downtime during releases. This approach allows us to deploy, test, and release new versions frequently while reducing risk to service availability. This article details our implementation process, challenges overcome, and the significant improvements achieved in our deployment velocity and system reliability.


It is no secret nor novelty that one of the most important criteria for a company's growth is the velocity of delivering new features to customers. Big, complex, and frequent changes that combine application development, infrastructure, and data stand in the way of achieving this velocity while maintaining stability and uptime.

To achieve the much-desired high velocity, you must have the right mechanisms in place to allow frequent and safe deployments. One of those mechanisms is to ensure that any change occurs with zero downtime. In this blog, we'll talk about how we achieved this at Transmit Security.

Who We Are

Transmit Security is a SaaS company providing identity and authentication solutions, keeping businesses and users protected and delighted. Identity protection means everything to us. It's what we do and what we're passionate about. With that mission in mind, it was clear that to provide a secure, stable, and updated service, we must prioritize our deployment process.

Securing digital identities requires both robust technology and reliable infrastructure

The Challenge: Achieving Zero Downtime

To achieve zero downtime deployments, we chose to implement a Blue/Green environment. While many companies talk about this approach, it's surprising to see how few actually deploy it. The main reason is the challenges around database migrations and the use of shared underlying infrastructure between environments, which can lead to production outages as older code versions must be supported in all application components.

Let's dive into Blue/Green deployment concepts, pros, and cons.

Blue/Green: Pros and Cons

Blue/Green deployment is a technique that reduces downtime and risk by running two identical production environments called Blue and Green. At any time, only one environment is live and serves all production traffic.

Pros

  • This strategy can eliminate downtime due to app deployment and reduces risk since we can shift traffic back and forth if something unexpected happens.
  • You can release software practically at any time. You don't need to schedule weekend or off-hours releases because, in most cases, all that's necessary to go live is a routing change.
  • It allows us to support A/B testing and gradual rollout while splitting traffic with feature toggles between different deployments.
  • Simple rollback, as the reverse process is equally fast. Since blue-green deployments utilize two parallel production environments, we can quickly shift back to the stable one if issues arise in our live environment.

Cons

  • Database management - When working with Blue/Green deployments, we must support backward compatibility as we shift between the Blue and Green databases.
  • Cost - Running two production environments means you're paying twice the infrastructure cost.

Blue/Green deployment architecture allows for seamless traffic switching

Our Kubernetes Foundation

When we started designing our cloud infrastructure, there was no doubt we would use Kubernetes as the orchestrator for our applications, as it's the de facto tool for automating deployments, scaling, and management of containerized applications. Additionally, we leverage MongoDB databases at the backend. Naturally, all our services reside behind CDN and load balancing services, so client connectivity is centrally managed.

Initial Implementation

To implement the zero downtime strategy, we designed a Blue/Green solution where we deployed two deployments of the same version to the existing environment. With that in mind, we understood that the change needed to be on the same load balancer since we wanted to avoid DNS TTL dependencies from the CDN. So, we routed the traffic of the load balancers' target groups and shifted the traffic to the Blue (main) deployment.

We added load balancing listener's rules for each service with a lower priority than the existing rules. That way, we didn't make an actual change to the environment. After this change, we could just remove the old Kubernetes ingress instances, and the traffic would automatically shift to the Blue deployment.

Blue/Green Phases

When we first thought about the Blue/Green implementation, we wanted this solution to be as flexible as possible.

We defined 4 main phases to implement a deployment release and move traffic from Blue to Green and back:

1. Upgrade

Upgrade the Green (offline) environment with the new version that is about to be deployed. In this phase, we also copy the database collections from the Blue DB to the Green DB.

2. Test

In this phase, we create new ingress annotations that will add new listener rules that forward traffic to the green application with an expression of the IP source header to match the VPN server's IP address. Since we're working with CDN, we defined the X-Forwarded-For header as it holds the original client IP address that sent the request (our VPN).

This allows us to QA the new deployment via VPN connection and verify the new version without exposing it to the public and without impacting our current service.

During the testing phase, only VPN traffic is routed to the Green environment

3. Shift

After successful verification of the newly deployed version, we shift all traffic to the new deployment (Green).

4. Final

After shifting traffic to the Green environment running the new deployment, we change the replica count of the Blue environment to 0. This is the final state in which the Green environment has become our main running environment, and Blue is now ready to start the next deployment cycle.

Implementation in an Existing Environment

Since we started working on this implementation after we had a running Production environment, we had to build a solution where we wouldn't harm the existing environment while adding support for Blue/Green deployment. After adding Blue/Green support in our Helm charts, we deployed the same application's deployment twice (blue and green). During this deployment, we copied the existing database collections to separate blue and green databases. Then, we removed the existing deployment and were left with blue and green while traffic was forwarded to blue.

Ingress Traffic Movement

When releasing a new version and to support the blue/green strategy, we created new ingress instances and added a few changes:

Ingress annotations that will create new listener rules for each service. Let's take our API service's ingress instance as an example:

apiVersion: networking.k8s.io/v1 kind: Ingress metadata: annotations: alb.ingress.kubernetes.io/actions.upgrade-svc: | {"type":"forward","forwardConfig":{"targetGroups":[{"serviceName":"green-api-service","servicePort":80,"weight":100}]}} alb.ingress.kubernetes.io/conditions.upgrade-svc: | [{"field":"http-header","httpHeaderConfig":{"httpHeaderName":"X-Forwarded-For","values":["VPN_IP_ADDRESS"]}}] spec: rules: - host: api.DOMAIN_NAME http: paths: - backend: service: name: upgrade-svc port: name: use-annotation path: /* pathType: ImplementationSpecific - backend: service: name: blue-api-service port: number: 80 path: /* pathType: ImplementationSpecific

These annotations create a new listener rule that forwards traffic to the green-api-service based on the VPN IP address as the value of the XFF header. Note that in this phase, all other traffic is still forwarded to the blue deployment.

After connecting to the VPN server, the ALB points us to the new green deployment. QA completes running sanity checks over the new green deployment, and we can go ahead and forward all traffic to the green deployment:

apiVersion: networking.k8s.io/v1 kind: Ingress spec: rules: - http: paths: - backend: service: name: ssl-redirect port: name: use-annotation path: /* pathType: ImplementationSpecific - host: api.DOMAIN_NAME http: paths: - backend: service: name: green-api-service port: number: 80 path: /* pathType: ImplementationSpecific

After successful testing, all traffic is shifted to the Green environment

Results: Increasing Deployment Velocity

As a result of using Blue/Green deployment, we gained the ability to deploy new versions of our application, test them, and release frequently while reducing the risk to our service availability and functionality. This enables our Development team to move faster, increase our overall velocity, and provide better protection to our users' online identity (our obsession and passion!).

Key Benefits We've Realized

  • Elimination of Deployment Downtime: Our releases no longer impact user experience
  • Reduced Release Risk: Quick rollback capabilities if issues are detected
  • Improved Testing Capabilities: Thorough testing in a production-identical environment before public exposure
  • Increased Deployment Frequency: From bi-weekly to on-demand releases
  • Enhanced Developer Confidence: Teams are more willing to push changes knowing the safety net exists

Conclusion: Embracing Change with Confidence

Implementing Blue/Green deployments at Transmit Security has transformed how we approach releases. We've moved from treating deployments as high-risk events to embracing them as routine operations. This shift has been instrumental in our ability to deliver new features and security enhancements to our customers more rapidly.

While the journey to zero downtime deployments required significant planning and architectural considerations, the benefits have far outweighed the challenges. The additional infrastructure costs are offset by the business value of increased velocity and improved reliability.

For organizations looking to implement similar strategies, we recommend starting with a thorough assessment of your database architecture, as this is often the most challenging aspect of Blue/Green implementations. Invest time in designing proper migration strategies and ensuring backward compatibility between versions.

With the right approach, zero downtime deployments are achievable and can become a competitive advantage in delivering value to your customers.