Etcd is a highly available distributed key-value store that provides a reliable way to store data across machines, more importantly it is used as a Kubernetes’ backing store for all of a cluster’s data.
In this post we are going to discuss how to backup etcd and how to recover from a backup to restore operations to a Kubernetes cluster.
Etcd in Rancher 1.6
In Rancher 1.6 we use our own Docker image for etcd which basically pulls the official etcd and adds some scripts and go binaries for orchestration, backup, disaster recovery, and healthcheck.
The scripts communicate with Rancher’s metadata service to get important information, such as: how many etcd are running in the cluster, who is the etcd leader, etc. In Rancher 1.6, we introduced etcd backup, which works besides the main etcd in the background. This service is responsible for backup operations.
The backup operations work by performing rolling backups of etcd at specified intervals and also supports retention of old backups. Rancher-etcd does that by providing three environment variables to the Docker image:
EMBEDDED_BACKUPS: boolean variable to enable/disable backup.
BACKUP_PERIOD: etcd will perform backups at this time interval.
BACKUP_RETENTION: etcd will retain backups for this time interval.
Backups are taken at /var/etcd/backups on the host and are taken using the following command:
etcdctl backup --data-dir <dataDir> --backup-dir <backupDir>
To configure the backup operations for etcd in Rancher 1.6, you must supply the mentioned environment variables in the Kubernetes configuration template:
After configuring and launching Kubernetes, etcd should automatically take backups every 15 minutes by default.
Recovering etcd from a backup in rancher 1.6 requires the user to have data in the etcd volume created for etcd. For example, if you have 3 nodes and you have backups created in the /var/etcd/backup directory:
# ls /var/etcd/backups/ -l total 44 drwx------ 3 root root 4096 Apr 9 15:03 2018-04-09T15:03:54Z_etcd_1 drwx------ 3 root root 4096 Apr 9 15:05 2018-04-09T15:05:54Z_etcd_1 drwx------ 3 root root 4096 Apr 9 15:07 2018-04-09T15:07:54Z_etcd_1 drwx------ 3 root root 4096 Apr 9 15:09 2018-04-09T15:09:54Z_etcd_1 drwx------ 3 root root 4096 Apr 9 15:11 2018-04-09T15:11:54Z_etcd_1 drwx------ 3 root root 4096 Apr 9 15:13 2018-04-09T15:13:54Z_etcd_1 drwx------ 3 root root 4096 Apr 9 15:15 2018-04-09T15:15:54Z_etcd_1 drwx------ 3 root root 4096 Apr 9 15:17 2018-04-09T15:17:54Z_etcd_1 drwx------ 3 root root 4096 Apr 9 15:19 2018-04-09T15:19:54Z_etcd_1 drwx------ 3 root root 4096 Apr 9 15:21 2018-04-09T15:21:54Z_etcd_1 drwx------ 3 root root 4096 Apr 9 15:23 2018-04-09T15:23:54Z_etcd_1
Then you should be able to restore operations to etcd. First of all you should only start with one node, so that only one etcd will restore from backup, and then the rest of etcd will join the cluster. To begin the restoration, use the following steps:
target=2018-04-09T15:23:54Z_etcd_1 docker volume create --name etcd docker run -d -v etcd:/data --name etcd-restore busybox docker cp /var/etcd/backups/$target etcd-restore:/data/data.current docker rm etcd-restore
The next step is to start Kubernetes on this node normally:
After that you can add new hosts to the setup. Note that you have to make sure that new hosts don’t have etcd volumes.
It’s also preferable to have etcd backup mounted to NFS mount point so that if the hosts are down for any reason, it won’t affect the backups created for etcd.
Etcd in Rancher 2.0
Recently Rancher announced GA for Rancher 2.0 and became ready for production deployments. Rancher 2.0 provides unified cluster management for different cloud providers including GKE, AKS, EKS as well providers that do not yet support a managed Kubernetes service.
Starting from RKE v0.1.7, the user is allowed to enable regular etcd snapshots automatically. In addition, it lets the user restore etcd from a snapshot stored on cluster instances.
In this section we will explain how to backup/restore your Rancher installation on an RKE managed cluster. The steps for this kind of Rancher installation is explained in the official documentation in more detail.
After Rancher Installation
After you install Rancher using RKE as explained in the documentation, you should see similar output when you execute the command:
# kubectl get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE cattle-system cattle-859b6cdc6b-tns6g 1/1 Running 0 19s ingress-nginx default-http-backend-564b9b6c5b-7wbkx 1/1 Running 0 25s ingress-nginx nginx-ingress-controller-shpn4 1/1 Running 0 25s kube-system canal-5xj2r 3/3 Running 0 37s kube-system kube-dns-5ccb66df65-c72t9 3/3 Running 0 31s kube-system kube-dns-autoscaler-6c4b786f5-xtj26 1/1 Running 0 30s
You will notice that cattle pod is up and running in
cattle-system namespace; this pod is the rancher server installed as a Kubernetes deployment:
RKE etcd Snapshots
RKE introduced two commands to save and restore etcd snapshots of a running RKE cluster; the two commands are:
rke etcd snapshot-save --config <config-path> --name <snapshot-name>
rke etcd snapshot-restore --config <config-path> --name <snapshot-name>
For more information about etcd snapshot save/restore in RKE, please refer to the official documentation.
First we will take a snapshot of etcd that is running on the cluster. To do that, lets run the following command:
# rke etcd snapshot-save --name rancher.snapshot --config cluster.yml INFO Starting saving snapshot on etcd hosts INFO [dialer] Setup tunnel for host [x.x.x.x] INFO [etcd] Saving snapshot [rancher.snapshot] on host [x.x.x.x] INFO [etcd] Successfully started [etcd-snapshot-once] container on host [x.x.x.x] INFO Finished saving snapshot [rancher.snapshot] on all etcd hosts
RKE etcd snapshot restore
Assuming the Kubernetes cluster failed for any reason, we can restore normally from the taken snapshot, using the following command:
# rke etcd snapshot-restore --name rancher.snapshot --config cluster.yml INFO Starting restoring snapshot on etcd hosts INFO [dialer] Setup tunnel for host [x.x.x.x] INFO [remove/etcd] Successfully removed container on host [x.x.x.x] INFO [hosts] Cleaning up host [x.x.x.x] INFO [hosts] Running cleaner container on host [x.x.x.x] INFO [kube-cleaner] Successfully started [kube-cleaner] container on host [x.x.x.x] INFO [hosts] Removing cleaner container on host [x.x.x.x] INFO [hosts] Successfully cleaned up host [x.x.x.x] INFO [etcd] Restoring [rancher.snapshot] snapshot on etcd host [x.x.x.x] INFO [etcd] Successfully started [etcd-restore] container on host [x.x.x.x] INFO [etcd] Building up etcd plane.. INFO [etcd] Successfully started [etcd] container on host [x.x.x.x] INFO [etcd] Successfully started [rke-log-linker] container on host [x.x.x.x] INFO [remove/rke-log-linker] Successfully removed container on host [x.x.x.x] INFO [etcd] Successfully started etcd plane.. INFO Finished restoring snapshot [rancher.snapshot] on all etcd hosts
Notes There are some important notes for the etcd restore process in RKE:
1. Restarting Kubernetes components
After restoring the cluster, you have to restart the Kubernetes components on all nodes, otherwise there will be some conflicts with resource versions of objects stored in etcd; this will include restart to Kubernetes components and the network components. For more information, please refer to Kubernetes documentation. To restart the Kubernetes components, you can run the following on each node:
docker restart kube-apiserver kubelet kube-controller-manager kube-scheduler kube-proxy docker ps | grep flannel | cut -f 1 -d " " | xargs docker restart docker ps | grep calico | cut -f 1 -d " " | xargs docker restart
2. Restoring etcd on a multi-node cluster
If you are restoring etcd on a cluster with multiple etcd nodes, the same exact snapshot must be copied to
rke etcd snapshot-save will take different snapshots on each node, so you will need to copy one of the created snapshots manually to all nodes before restoring.
3. Invalidated service account tokens
Restoring etcd on a new Kubernetes cluster with new certificates is not currently supported, because the new cluster will contain different private keys which are used to sign service tokens for all service accounts. This may cause a lot of problems for all pods that communicate directly with kube api.
In this post we saw how backups can be created and restored for etcd in Kubernetes clusters in both Rancher 1.6.x and 2.0.x. Etcd snapshots can be managed in 1.6 using Rancher’s etcd image and in 2.0 using RKE CLI.