Continental Innovates with Rancher and Kubernetes
We created the Fleet Project to provide centralized GitOps-style management of a large number of Kubernetes clusters. A key design goal of Fleet is to be able to manage 1 million geographically distributed clusters. When we architected Fleet, we wanted to use a standard Kubernetes controller architecture. This meant in order to scale, we needed to prove we could scale Kubernetes much farther than we ever had. In this blog, I will cover Fleet’s architecture, the method we used to test scale and our findings.
As K3s has exploded in popularity, so too has Kubernetes on the edge. Organizations have adopted an edge-deployment model where each device is a single-node cluster. Or you might see three-node clusters to provide high availability (HA). The point is, we are dealing with a lot of small clusters, not a single large cluster with many nodes. Pretty much anywhere people are running Linux today, they are looking to use Kubernetes to manage the workloads. While most K3s edge deployments are less than 10,000 nodes, it is not unreasonable that one could get to 1 million. The most important takeaway is that Fleet will meet your scale requirements.
The key parts of Fleet’s architecture are as follows:
To do a deployment from git, the Fleet Manager first clones and stores the contents from git. The Fleet Manager then decides which cluster needs to be updated with the contents from git and creates deployment records for the agents to read. When they can, agents will check in to read the deployment records, deploy the new assets and report back the status.
We used two approaches to simulate 1 million clusters. First, we deployed a set of large VMs (m5ad.24xlarge - 384 GiB RAM). Each VM ran 10 K3s clusters using k3d. Then those 10 K3s clusters ran 750 agents each, with each agent representing a downstream cluster. In total, each VM simulated 7,500 clusters. On average, it took about 40 minutes to deploy a VM, register all clusters with Fleet and then reach a steady state. Over the course of two days, we launched VMs in this fashion until we reached 100,000 clusters. During the first 100,000 clusters, we discovered the majority of scaling issues. After fixing these issues, scaling became fairly predictable. At this rate, it would take a long time to simulate 900,000 clusters – and a fair amount of money.
Then we shifted to a second approach: running a single simulator that could make all the API calls that 1 million clusters would, without requiring a downstream Kubernetes cluster or deploying Kubernetes resources. Instead, the simulator made the API calls to register a new cluster, discover new deployments and report back status that they succeeded. Using this approach, we went from zero to 1 million simulated clusters in a day.
The Fleet Manager, a controller running on a Kubernetes cluster, ran on three large VMs (m5ad.24xlarge - 384 GiB RAM) and an RDS (db.m5.24xlarge) instance. We actually used K3s to run the Fleet Manager cluster. We did this because Kine was already integrated. I’ll explain why we used Kine and what it is later. Even though K3s targets small-scale clusters, it is probably the easiest Kubernetes distribution to run at massive scale, and we used it because of that simplicity. It is not possible to run Fleet at the scale we were doing on a managed provider such as EKS, which I’ll explain later.
The first issue we hit was completely unexpected. When a Fleet agent registers to the Fleet Manager, it does so with a temporary cluster registration token. Then that token is used to create a new identity and credential for that cluster/agent. Both the cluster registration token and the credential for that agent are service accounts. The rate at which we could register clusters was limited by the rate the controller-manager could create tokens for the service account. After some research, we found we could modify the controller-manager’s default settings to increase the rate at which we could create service accounts (–concurrent-serviceaccount-token-syncs=100) and the overall number of API requests per second (–kube-api-qps=10000).
Fleet is written as a Kubernetes controller. Therefore, scaling Fleet to 1 million clusters meant managing tens of millions of objects in Kubernetes. We knew going into this that etcd was not capable of managing that amount of data. Etcd has a limit of 8GB for its keyspace, set to 2GB by default. The keyspace includes the current values and their previous values that have been yet to be garbage collected. A simple cluster object in Fleet will take about 6KB. For 1 million clusters, we will need at least 6GB. But a single cluster is typically about 10 Kubernetes objects, plus one object per deployment. So in practice, we are more likely to need more than 10x the space for 1 million clusters.
To work around the limitation of etcd, we used Kine, which makes it possible to run any Kubernetes distribution using a traditional RDBMS. For this scale test, we ran RDS db.m5.24xlarge instances. We didn’t attempt to do proper sizing of the database and instead started with the largest m5 instance. By the end of the test, we had about 20 million objects in Kine. This means that running Fleet at this scale cannot be done on a managed provider like EKS because it is based on etcd and also will not provide a scalable enough datastore for our needs.
This test didn’t seem to push the database very hard. Granted, we used a very large database, but it’s clear we have a lot of vertical scaling room left. Inserts and lookup of a single record continued to perform at an acceptable rate. One thing we noticed was that randomly a large list of objects (at most 10,000) would take 30 seconds to a minute. In general, those queries would complete in less than 1 second, or 5 seconds under very abusive testing. These very long queries had little impact on the system overall because they happened during cache reloading, which we will discuss later. Regardless of the fact that these slow queries didn’t significantly impact Fleet, we still need to investigate further why this is happening.
When controllers load their caches, they first list all the objects and then start a watch from the revision of the list. If there is a very high change rate and listing takes long enough, you can easily get into a situation where you finish the list but can’t start a watch because the revision is not available in the API server watch cache or has been compacted in etcd. As a workaround, we set the watch caches to a very high amount (–default-watch-cache-size=10000000). In theory, we thought we’d hit compaction issues with Kine, but we didn’t. This requires further investigation. In general, Kine is much less aggressive in how often it compacts. But in this situation, we suspect that we were adding records faster than Kine could compact them. This is not that bad. We don’t expect that the change rates we were pushing to be consistent – it’s only because we were registering clusters at a rapid pace.
The standard implementation of Kubernetes controllers is to cache all objects you are working on in memory. With Fleet, that means we need to load millions of objects to build the caches. The listing of objects has a default pagination size of 500. It takes 2,000 API requests to load 1 million objects. If you assume we can make a list call, process the objects and start the next page every second, that means it would take about 30 minutes to load the cache. Unfortunately, if any of those 2,000 API requests fails, the process starts over. We tried increasing the page size to 10,000 objects but saw that the overall load times were not significantly faster. Once we started listing 10,000 objects at a time, we ran into an issue where Kine would randomly take over a minute to return all the objects. Then the Kubernetes API server would cancel the request, causing the whole load operation to fail and have to be restarted. We worked around this issue by increasing the API request timeout (–request-timeout=30m), but this is not an acceptable solution. Keeping the page size lower would ensure the requests were faster, but the number of requests increases the chances of one failing and causing the whole operation to restart.
Restarting the Fleet controller would take up to 45 minutes. The restart time also applies to kube-apiserver and kube-controller-manager. This means you need to be very careful how you restart services. This is one point where we found that running K3s was not as good as running a traditional distribution like RKE. K3s combines api-server and controller-manager into the same process, which makes restarting the api-server or controller-manager slower and more error prone than it should be. Simulating a more catastrophic failure that required a full restart of all services, including Kubernetes, took hours to get things back online.
The time it takes to load caches and the chance of failure is by far the greatest issue we found with scaling Fleet. Going forward, this is the number one issue we are looking to address.
With our tests, we proved that Fleet’s architecture will scale to 1 million clusters, and more importantly, that Kubernetes can be used as a platform to manage significantly more data. Fleet itself does nothing directly with containers and can be viewed as just a simple application that manages data in Kubernetes. These findings open up a large realm of possibility of treating Kubernetes as more of a generic orchestration platform to write code on. When you consider you can easily bundle a set of controllers with K3s, Kubernetes becomes a nice self-contained application server.
At scale, the time it takes to reload caches is concerning but definitely manageable. We will continue to improve in this area so that running 1 million clusters is not just doable, but simple. Because at Rancher Labs, we love simple.