Building a Highly Available Kubernetes Cluster

Chris Kim
Chris Kim
Published: September 12, 2019

Introduction

As Kubernetes becomes more and more prevalent as a container orchestration system, it is important to properly architect your Kubernetes clusters to be highly available. This guide covers some of the options that are available for deploying highly available clusters, as well as an example of deploying a highly available cluster.

In this article, we will be using the Rancher Kubernetes Engine (RKE) as the installation tool for our cluster. However, the concepts outlined here can easily be translated to other Kubernetes installers and tools.

Overview

We’ll be going over a few of the core components that are required to make a highly available Kubernetes cluster function:

  • Control plane
  • etcd
  • Ingress (L7) considerations
  • Node configuration best practices

Load Balancer type services are not included in this guide, but are also a consideration you should keep in mind. This is omitted due to the cloud-specific nature of the various load balancer implementations. If working with a on-premise deployment, it’s highly recommended to look at a solution like MetalLB.

High Availability of the Control Plane

When using rke to deploy a Kubernetes cluster, controlplane designated nodes have a few unique components deployed onto them. These include:

  • kube-apiserver
  • kube-controller-manager
  • kube-scheduler

To review, the kube-apiserver component runs the Kubernetes API server. It is important to keep in mind that API server availability must be maintained to ensure the functionality of your cluster. Without a functional API endpoint for your cluster, the cluster will come to a halt. For example, the kubelet on each node will not be able to update the API server, the controllers will not be able to operate on the various control objects, and users will not be able to interact with the Kubernetes cluster using kubectl.

The kube-controller-manager component runs the various controllers that operate on the Kubernetes control objects, like Deployments, DaemonSets, ReplicaSets, Endpoints, etc. More information on this component can be found here.

The kube-scheduler component is responsible for scheduling Pods to nodes. More information on the kube-scheduler can be found here.

The kube-apiserver is capable of being run in parallel across multiple nodes, providing a highly available solution when requests are balanced/failed over between nodes. When using RKE, the internal component of the API server load balancing is handled by the nginx-proxy container. User-facing API server communication must be configured out-of-band in order to ensure maximum availability.

User (External) Load Balancing of the kube-apiserver

Here is a basic diagram outlining a configuration for external load balancing of the kube-apiserver:

Externally Load Balanced Kubernetes API Server Diagram

In the diagram above, there is a Layer 4 load balancer listening on 443/tcp that is forwarding traffic to the two control plane hosts via 6443/tcp. This ensures that in the event of a control plane host failure, users are still able to access the Kubernetes API.

RKE has the ability to add additional hostnames to the kube-apiserver cert SAN list. This is useful when using an external load balancer that has a different hostname or IP (as they should) than the nodes that are serving API server traffic. Information on configuring this functionality can be found in the RKE docs on authentication.

To set up the example shown in the diagram, the RKE cluster.yml configuration snippet to include the api.mycluster.example.com load balancer in the kube-apiserver list would look like:

authentication:
    strategy: x509
    sans:
      - "api.mycluster.example.com"

If you are adding the L4 load balancer after the fact, it is important that you perform an RKE certificate rotation operation in order to properly add the additional hostname to the SAN list of the kube-apiserver.pem certificate.

Additionally, the kube_config_cluster.yml file will not be configured to access the API server through the load balancer, but rather through the first controlplane node in the list. It will be necessary to generate a specific kube_config file for users to utilize that includes the L4 API server load balancer as the server value.

User (Cluster Internal) Load Balancing of the kube-apiserver

By default, rke designates 10.43.0.1 as the internal Kubernetes API server endpoint (based on the default service cluster CIDR of 10.43.0.0/16). A ClusterIP service and endpoint named kubernetes are created in the default namespace which resolve to the IP that is designated to the Kubernetes API server.

In a default-configured RKE cluster, the ClusterIP is load balanced using iptables. More specifically, it uses NAT pre-routing to masquerade traffic to the desired Kubernetes API server endpoint with the probability determined by the number of API server hosts available. This provides a generally highly available solution for the Kubernetes API server internally, and libraries which connect to the API server from within pods should be able to handle failover through retry.

kubelet / kube-proxy Load Balancing of the kube-apiserver

The kubelet and kube-proxy components on each Kubernetes cluster node are configured by rke to connect to 127.0.0.1:6443. You may be asking yourself, “how does 127.0.0.1:6443 resolve to anything on a worker node?” The reason this works is due to the existence of an nginx-proxy container on each non-controlplane designated node. The nginx-proxy is a simple container that performs L4 round robin load balancing across the known controlplane node IPs with health checking, ensuring that the nodes are able to continue operating even during transient failures.

High Availability of etcd

etcd, when deployed, has high availability functionality built into it. When deploying a highly available Kubernetes cluster, it is very important to ensure that etcd is deployed in a multi-node configuration where it can achieve quorum and establish a leader.

When planning for your highly available etcd cluster, there are a few aspects to keep in mind:

  • Node count
  • Disk I/O capacity
  • Network latency and throughput

Information on performance tuning etcd in highly available architectures can be found in the related section of the etcd docs.

Quorum

In order to achieve quorum, etcd must have a majority of members available and in communication with each other. As such, odd numbers of members are best suited for etcd. Increased failure tolerance is achieved as the number of members grows. Keep in mind that more is not always better in this case. When you have too many members, etcd can actually slow down due to the Raft consensus algorithm that etcd uses to propagate writes among members. A table comparing the total number of nodes and the number of node failures that can be tolerated is shown below:

Recommended for HA Total etcd Members Number of Failed Nodes Tolerated
No 1 0
No 2 0
Yes 3 1
No 4 1
Yes 5 2
etcd Deployment Architecture

When deploying multiple etcd hosts with rke in a highly available cluster, there are two generally accepted architectures. One is where etcd is co-located with the controlplane components, thus allowing for optimized use of compute resources. This is generally only recommended for small to medium sized clusters where compute resources may be limited. This configuration works as etcd is primarily memory based (as it operates within memory) whereas the controlplane components are generally compute intensive. A diagram of the configuration is below:

colocated etcd diagram

In production-critical environments, it can be preferable to run etcd on dedicated nodes with hardware ideal for running etcd. A diagram of a separated configuration is shown below:

separated etcd diagram

The other architecture relies on a dedicated external etcd cluster that are not co-located with any other controlplane components. This provides greater redundancy and availability at the expense of operating some additional nodes.

Ingress (L7) Considerations

Kubernetes Ingress objects allow specifying host and path based routes in order to serve traffic to users of the applications hosted within a Kubernetes cluster.

Ingress Controller Networking

There are two general networking configurations that you can choose between when installing an ingress controller:

  • Host network
  • Cluster network

The first model, host network, is where the ingress controller runs on a set of nodes on the same network namespace as the host. This exposes port 80 and 443 of the ingress controller on the host directly. To external clients, it appears that the host has a web server listening 80/tcp and 443/tcp.

The second option, cluster network, is where the ingress controller is run on the same cluster network as the workloads within the cluster. This deployment model is useful when using services of type LoadBalancer or using a NodePort service to mux the host’s capabilities, while providing an isolation plane for the ingress controller to not share the hosts’ network namespace.

For the purposes of this guide, we will explore the option of deploying the ingress controller to operate on the host network, as rke configures the ingress controller in this manner by default.

Ingress Controller Deployment Model

By default, rke deploys the Nginx ingress controller as a DaemonSet which runs on all worker nodes in the Kubernetes cluster. These worker nodes can then be load balanced or have dynamic or round robin DNS records configured in order to land traffic at the nodes. In this model, application workloads will be co-located alongside the ingress controller:

ingress-daemonset

This works for most small-to-medium sized clusters, but when running workloads that are not profiled or heterogeneous, it is possible for CPU or memory contention to cause the ingress controller to not serve traffic properly. In these scenarios, it can be preferable to designate specific nodes to run the ingress controller. In this model, it is still possible to perform round robin DNS or dynamic DNS, but load balancing tends to be the more preferred solution in this case:

ingress-dedicated

As of today, rke only supports setting a node selector to control scheduling of the ingress controller, however a feature request is open to bring the capability to place tolerations on the ingress controller as well as placing taints on nodes allowing for more fine-grained ingress controller deployments.

Ingress Controller DNS Resolution/Routing

As mentioned earlier, are two options to balance traffic when utilizing ingress controllers in this manner. The first is to simply create DNS records that point to your ingress controllers and the second is to run a load balancer which will load balance across your ingress controllers. Let’s take a closer look at these options now.

Direct to Ingress Controller

In some models of deployment, it can be preferable to use a technique such as round-robin DNS or some other type of dynamic DNS solution to serve traffic for your application. Tools such as external-dns allow such dynamic configuration of DNS records to take place. In addition, Rancher Multi-Cluster App uses external-dns to dynamically configure DNS entries.

Load Balanced Ingress Controllers

When operating a highly available cluster, it is often desirable to operate a load balancer in front of the ingress controllers whether to perform SSL offloading or to provide a single IP for DNS records.

Using RKE to Deploy a Production-Ready Cluster

When using rke to deploy a highly-available Kubernetes cluster, a few configuration tweaks should be made.

A sample cluster.yml file is provided here for a hypothetical cluster with the following 10 nodes:

  • 2 controlplane
  • 3 etcd
  • 2 ingress
  • 3 worker nodes

In this hypothetical cluster, there are two VIP/DNS entries, the first of which points towards the API servers on 6443/tcp and the second of which points to the ingress controllers on ports 80/tcp and 443/tcp.

The following YAML will deploy each of the above nodes with RKE:

nodes:
    - address: controlplane1.mycluster.example.com
      user: ubuntu
      role:
        - controlplane
    - address: controlplane2.mycluster.example.com
      user: ubuntu
      role:
        - controlplane

    - address: etcd1.mycluster.example.com
      user: ubuntu
      role:
        - etcd
    - address: etcd2.mycluster.example.com
      user: ubuntu
      role:
        - etcd
    - address: etcd3.mycluster.example.com
      user: ubuntu
      role:
        - etcd
 
    - address: ingress1.mycluster.example.com
      user: ubuntu
      role:
        - worker
      labels:
        app: ingress
    - address: ingress2.mycluster.example.com
      user: ubuntu
      role:
        - worker
      labels:
        app: ingress

    - address: worker1.mycluster.example.com
      user: ubuntu
      role:
        - worker
    - address: worker2.mycluster.example.com
      user: ubuntu
      role:
        - worker
    - address: worker3.mycluster.example.com
      user: ubuntu
      role:
        - worker

authentication:
    strategy: x509
    sans:
      - "api.mycluster.example.com"

ingress:
    provider: nginx
    node_selector:
      app: ingress

Conclusion

In this guide, we discussed some of the requirements of operating a highly available Kubernetes cluster. As you may gather, there are quite a few components that need to be scaled and replicated in order to eliminate single points of failure. Understanding the availability requirements of your deployments, automating this process, and leveraging tools like RKE to help configure your environments can help you reach your targets for fault tolerance and availability.

Get started with Rancher