From Web Scale to Edge Scale: Rancher 2.4 Supports 2,000 Clusters on its Way to 1 Million

From Web Scale to Edge Scale: Rancher 2.4 Supports 2,000 Clusters on its Way to 1 Million

Tom Callway
Tom Callway
Gray Calendar Icon Published: March 31, 2020
Gray Calendar Icon Updated: April 9, 2020

Rancher 2.4 is here – with new under-the-hood changes that pave the way to supporting up to 1 million clusters. That’s probably the most exciting capability in the new version. But you might ask: why would anyone want to run thousands of Kubernetes clusters – let alone tens of thousands, hundreds of thousands or more? At Rancher Labs, we believe the future of Kubernetes is multi-cluster and fully heterogeneous. This means ‘breaking the monolith’ into many clusters and running the best Kubernetes distribution for each environment and use case.

The argument that operational consistency in the hybrid cloud is only guaranteed through a vendor monoculture is simply false. Kubernetes management platforms like Rancher exist to provide IT Ops with the ability to manage CNCF-certified Kubernetes distributions running on-prem, in the cloud and in remote locations at the edge consistently through a single pane of glass.

The Future of Kubernetes at the Edge is ‘Fleet Management’

Broadly speaking, enterprises see their low-latency performance requirements for running advanced workloads on Kubernetes at the edge being satisfied by one of two approaches:

  1. Run a cloud-tethered, monolithic cluster in an edge data center to manage ‘dumb’ edge devices.
  2. Run production-grade, lightweight Kubernetes clusters on ‘smart’ low-powered edge devices and manage them consistently as a ‘fleet’ from a central management control plane.

The cloud-tethered, monolithic approach makes sense for the vendor – they increase your dependency on their technology stack while charging you a premium. Some customers find such lock-in reassuring, but while your future might be certain, it’s not yours to control.

Alternatively, leveraging a lightweight Kubernetes distribution such as K3s on the endpoints themselves and managing each cluster consistently from a central control plane means optimum performance with the flexibility to adapt your strategy as the technology and economic landscape changes.

The scale improvements in Rancher 2.4 were a pre-condition for full fleet management capabilities coming later this year. So, what changes have we made to Rancher to enable it to support what we’re calling ‘edge scale’ Kubernetes deployments?

Optimizations in Rancher 2.4 Have Had Dramatic Results

During the development cycle for Rancher 2.4, our engineering team made extensive scale tests using k3d to create tens to hundreds of clusters on a single VM in AWS.

By doing this they simulated a large number of clusters attaching to Rancher while keeping their costs down. Once they could scale up the number of clusters, the engineering team needed a way to visualize what Rancher was doing. They achieved this by adding a new metrics endpoint to give insights to the various controllers Rancher runs.

After starting their tests, our engineers noticed the amount of memory taken per cluster increased with every cluster added. They traced this back to the way they were managing the contexts – each cluster had a copy of the management context, which had a context for each cluster! They addressed this by creating a single copy of the management context that all clusters could use. As you can see below, this simple change dramatically reduced memory usage per cluster.

Memory usage of Rancher 2.3.5 vs. Rancher 2.4.0-rc11 with 100 clusters at steady state Memory usage of Rancher 2.3.5 vs. Rancher 2.4.0-rc11 with 100 clusters at steady state

For every cluster, Rancher spins up Kubernetes controllers to manage role-based access control (RBAC), sync state and other administrative tasks. With the new metrics endpoint and Golang profiles, our engineers observed and optimized the controllers to reduce load on the Kubernetes API, etcd and CPU load of the host. In some cases, controller clients were not making use of available caches. They do now.

Controller optimizations in Rancher 2.4.0 Controller optimizations in Rancher 2.4.0

As the diagram above demonstrates, these optimizations have had a dramatic impact on Rancher 2.4’s load profile and set the stage for 10x gains in the number of clusters it can manage.

In the coming months, we will build on this work by using fleet management to scale up to millions of clusters through a federation.

This is just one of the improvements in our latest version. See what else is new in Rancher 2.4.

Tom Callway
Tom Callway
Global Director of Product Marketing, Rancher
Tom has been working for high growth, B2B open source technology startups for over 15 years. Before joining Rancher, Tom ran the cloud marketing team at Canonical/Ubuntu and, before that, was responsible for transitioning MariaDB from a community database project to a global brand. Tom lives in Twickenham, UK with his wife Rosie, two children and miniature Schnauzer.
Get started with Rancher