From Chaos to Harmony

Centralizing Kubernetes Controller Upgrades

TL;DR: At Transmit Security, we developed a centralized system for upgrading Kubernetes DevOps controllers across multiple products and environments. Our solution combines GitHub Actions, custom Helm charts, and automated processes to streamline upgrades, enhance security, and improve consistency. This article details our approach, its implementation, and its real-world impact on our operations.


In the fast-paced world of Kubernetes orchestration, staying current with controller versions is not just a best practice — it's a necessity for maintaining robust, secure, and efficient infrastructure. Here at Transmit Security, where we manage a diverse portfolio of products across numerous Kubernetes clusters, this challenge is magnified by the scale and complexity of our operations. Today, we're pulling back the curtain on an innovative approach we've developed to centralize and streamline the upgrade process for Kubernetes DevOps controllers across our multiple products and environments, a solution born from the unique demands of our robust infrastructure.

The Challenge: Taming the Multi-Controller Beast

Before we dive into our solution, let's paint a picture of the challenges we faced:

  • Version Inconsistency: With multiple products and environments, keeping track of controller versions was like herding cats.
  • Manual Overhead: Upgrades often required manual intervention, eating into valuable developer time.
  • Lack of Standardization: Different teams had different upgrade processes, leading to potential errors and inconsistencies.
  • Scale Issues: As our infrastructure grew, so did the complexity of managing upgrades across the board.
  • Security Concerns: Delayed upgrades could potentially expose our systems to known vulnerabilities.

We needed a solution that would address these challenges while providing flexibility and control to our DevOps teams. Enter our centralized upgrade pipeline.

Managing Kubernetes assets can be so frustrating…

Our Solution: A Symphony of Automation and Control

We've architected a comprehensive solution that harmonizes GitHub Actions, custom Helm charts, and automated processes into a streamlined upgrade symphony. Let's break down the key components and processes:

Key Components: The Building Blocks

Charts

Devops-stack-charts: This is our maestro chart, conducting the orchestra of external charts. It contains:

  • External charts (such as cert-manager, cluster-autoscaler, etc.) defined in the Chart.yaml
  • A template file for each controller in the templates directory
  • Custom resources for each controller's chart

This elegant structure empowers us to efficiently manage and customize resources for each controller while maintaining a centralized approach, significantly enhancing our operational efficiency.

Application-sets-chart: Think of this as our sheet music. It contains:

  • A template for a MultiSource ApplicationSet resource
  • Per-product values files with ApplicationSet configurations
  • Global defaults and a hierarchy (product_name > cloud_name > environment_name)
  • Version specifications for controllers
  • Git repository URLs and branches for controller-specific values
  • Sync options for the ApplicationSet

This chart is instrumental in our ability to manage configurations consistently across diverse products and environments, ensuring uniformity and reliability in our deployments.

GitHub Actions: The Automation Ensemble

renovate-scheduled.yaml: Our update scout. Running periodically, it:

  • Checks for new controller versions
  • Triggers the publish-chart action when updates are found

This proactive approach ensures that our system consistently benefits from the latest controller versions, enhancing security and performance.

publish-chart.yaml: The Packaging and Publishing Maestro. It:

  • Packages each chart from external helm repositories (e.g., Bitnami, Jetstack)
  • Incorporates relevant custom resources from devops-stack-charts/templates
  • Publishes the resulting chart to our internal artifact repository

This process ensures that our customized charts are readily available for deployment, streamlining our infrastructure management.

generate-applicationset.yaml: Our template virtuoso. This action:

  • Generates ApplicationSet templates based on user parameters
  • Creates a new branch
  • Opens a PR in the specific product's repository (e.g., mind-infra, riskid-infra)

By ensuring that new configurations undergo review before implementation, it maintains the integrity of our infrastructure. This action triggers a script called get_target_repo.py that contains the mapping of a product name <> repository name and helps us understand to which repository should we open the PR.

apply-appset.yaml: This action is our deployment conductor. It:

  • Runs in per-product repositories
  • Applies an ApplicationSet based on user-provided parameters

Its role is pivotal in the bootstrapping process, as the ApplicationSet takes over the reins for automatic syncing post-deployment.

├──README.md ├──.github │ ├── scripts │ └── get_target_repo.py │ └── workflows │ │ ├── apply-appset.yaml │ │ ├── generate-applicationset.yaml │ │ ├── publish-chart.yaml │ │ └── renovate-scheduled.yaml ├── application-sets-chart │ ├── Chart.yaml │ ├── values.yaml │ ├── productA.values.yaml │ ├── productB.values.yaml │ └── templates │ └── applicationset.yaml ├── devops-stack-charts │ │ ├── Chart.yaml │ │ ├── templates │ │ │ ├── controllerA.yaml │ │ │ └── controllerB.yaml │ └── values.yaml └── renovate.json

The Upgrade Process: A Well-Orchestrated Performance

  1. Renovate Scheduler checks for updates
  2. If new versions found, trigger Publish Chart Action
  3. Pull the changed Chart
  4. Add Custom Resources
  5. Package the updated Chart
  6. Publish to the artifact repository
  7. Create PRs in Dev Values
  8. Generate ApplicationSet
  9. Create PRs in Product Repository
  10. DevOps Team Review
  11. If approved, Deploy to Dev
  12. Test in Dev
  13. If successful, Update Other Environments
  14. Generate New ApplicationSets
  15. Deploy to Staging/Prod

Let's walk through this process:

Weekly Update Check:

Periodically, our Renovate scheduler GitHub Action runs its reconnaissance mission, scouting for controller updates. This regular check ensures we're always aware of the latest versions available.

Automated Chart Publishing:

When Renovate detects updates, it triggers our publish-chart action. This process efficiently retrieves all charts from their repositories, unarchives them, and integrates our custom resources for each controller. The charts are then repackaged with these enhancements and published to our artifact repository. This streamlined approach ensures our deployment pipeline always has access to up-to-date, customized charts, ready for immediate use. By automating this process, we maintain consistency, reduce errors, and significantly improve our deployment efficiency.

Automated PR Creation — The Proposal Stage:

As soon as new controller versions hit the artifact repository, our system kicks into high gear:

  • Pull Requests are automatically created in each per-product values file, specifically in the development section within the application-sets-chart in our devops-stack repository.
  • The generate-applicationset action runs for each product's development environment, preparing the ground for potential upgrades.
  • PRs pop up in per-product repositories, suggesting changes to the ApplicationSets for development environments.

This automated PR creation serves as a proposal system, allowing teams to review changes before implementation.

DevOps Team Review — The Human Touch:

While automation is at the heart of our system, we believe in the importance of human oversight:

  • DevOps teams receive Slack notifications about new versions, keeping them in the loop.
  • Teams review the proposed changes, considering factors like compatibility and potential impact.
  • They make the call on whether to implement the new version in the development environment.

This step ensures that despite the automation, humans remain in control of what goes into their systems.

Controlled Rollout: From Dev to Production

After a successful deployment in the development environment, DevOps teams can confidently roll out the new version across other environments:

  • They modify their per-product values file in the application-sets-chart (located in the DevOps-Stack repository), updating the version for the desired environments.
  • Using the generate-applicationset GitHub action, they trigger the generation of new PRs for each cluster that needs the update.

This process allows for a gradual, controlled rollout, enabling teams to monitor the impact at each stage.

ArgoCD: One Instance Per Kubernetes Cluster

A key aspect of our architecture is running a dedicated ArgoCD instance for each Kubernetes cluster. This approach offers several advantages:

  • Isolation: Each cluster's configuration and state are isolated, reducing the risk of cross-cluster issues.
  • Performance: Dedicated ArgoCD instances ensure optimal performance for each cluster, preventing resource contention.
  • Scalability: As we add new clusters, we can easily scale our ArgoCD deployment without impacting existing clusters.
  • Security: Cluster-specific ArgoCD instances enhance security by limiting the blast radius of potential breaches.
  • Customization: We can tailor ArgoCD configurations to the specific needs of each cluster.
  • Resilience: Issues with one ArgoCD instance don't affect other clusters, improving overall system resilience.

This architecture aligns perfectly with our centralized controller upgrade strategy, allowing for granular control and monitoring of the upgrade process across our infrastructure.

Per-Product Repository Structure

To give you a clearer picture of how our solution is organized, let's take a look at the structure of our per-product repositories:

product-infra/ ├── devops-stack/ │ ├── appsets/ │ │ ├── aws/ │ │ │ ├── dev/ │ │ │ │ ├── cluster-name/ │ │ │ │ │ └── applicationSet.yaml │ │ │ │ └── another-cluster/ │ │ │ │ └── applicationSet.yaml │ │ │ ├── staging/ │ │ │ └── prod/ │ │ └── gcp/ │ │ │ ├── dev/ │ │ │ ├── staging/ │ │ │ └── prod/ │ └── apps/ │ ├── aws/ │ │ ├── values.yaml │ │ ├── dev/ │ │ │ ├── values.yaml │ │ │ ├── cluster-name/ │ │ │ │ └── values.yaml │ │ │ └── another-cluster/ │ │ │ └── values.yaml │ │ ├── staging/ │ │ └── prod/ │ └── gcp/ │ │ ├── dev/ │ │ ├── staging/ │ │ └── prod/ └── README.md

This structure allows us to:

  • Organize resources by cloud provider (aws, gcp)
  • Separate environments (dev, staging, prod)
  • Manage multiple clusters within each environment
  • Keep product-specific resources separate from the DevOps stack
  • Maintain consistency across products while allowing for product-specific customizations

The applicationSet.yaml files in each cluster directory are the key to our centralized upgrade process, as they define which controller versions and configurations to apply to each cluster.

Benefits: Why This Approach is a Game-Changer

1. Centralized Management:

  • Single source of truth for controller versions and configurations
  • Easier auditing and compliance management
  • Reduced cognitive load on DevOps teams

2. Automated Updates:

  • Significant reduction in manual intervention
  • Faster response to critical updates and security patches
  • Reduced human error in the update process

3. Consistency Across the Board:

  • Uniform controller versions across products and environments
  • Standardized upgrade process for all teams
  • Easier troubleshooting due to version consistency

4. Controlled Rollout:

  • DevOps teams retain oversight and control
  • Ability to test updates in development before production rollout
  • Flexible upgrade schedules per product and environment

5. Scalability:

  • Easily adaptable to new products and controllers
  • Process remains efficient regardless of the number of clusters or controllers
  • Reduced overhead as infrastructure grows

6. Enhanced Security:

  • Faster implementation of security patches
  • Reduced window of vulnerability due to outdated controllers
  • Improved compliance with security best practices

7. Time and Resource Efficiency:

  • DevOps teams can focus on strategic tasks rather than manual upgrades
  • Reduced downtime and smoother upgrades
  • Cost savings from optimized use of human resources

Real-world Impact: A Case Study

To illustrate the power of this approach, let's look at the following scenario:

When a critical security patch was released for one of our key controllers, our Renovate action picked it up. The publish-chart action immediately packaged and published the updated chart to the artifact repository. PRs were automatically created for our development environments across all products.

Our DevOps teams, alerted by Slack notifications, quickly reviewed and approved the changes. Within a day, the patch was tested and deployed across all development environments. Using the controlled rollout process, teams then systematically updated staging and production environments over the next maintenance time.

What once would have been a long, stress-filled scramble to manually update controllers across our entire infrastructure was reduced to a smooth process with minimal manual intervention.

Conclusion: Embracing the Future of Kubernetes Management

By centralizing our Kubernetes DevOps controller upgrade process, we've not only streamlined our operations but also set the stage for more advanced automation and optimization in our Kubernetes ecosystem. This approach has transformed a once-cumbersome process into a streamlined, efficient operation that enhances our ability to maintain a cutting-edge, secure infrastructure.

As the Kubernetes landscape continues to evolve, so too will our processes and tools. By building a flexible, automated foundation, we're well-positioned to adapt to whatever challenges the future of container orchestration may bring.

I'm excited about the possibilities this centralized approach opens up and committed to continual refinement and innovation in our DevOps practices.