Continental Innovates with Rancher and Kubernetes
This section contains commands and tips for troubleshooting nodes with the etcd role.
etcd
This page covers the following topics:
The container for etcd should have status Up. The duration shown after Up is the time the container has been running.
docker ps -a -f=name=etcd$
Example output:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 605a124503b9 rancher/coreos-etcd:v3.2.18 "/usr/local/bin/et..." 2 hours ago Up 2 hours etcd
The logging of the container can contain information on what the problem could be.
docker logs etcd
health check for peer xxx could not connect: dial tcp IP:2380: getsockopt: connection refused
xxx is starting a new election at term x
connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: i/o timeout"; Reconnecting to {0.0.0.0:2379 0 <nil>}
rafthttp: request cluster ID mismatch
rafthttp: failed to find member
/var/lib/etcd
The address where etcd is listening depends on the address configuration of the host etcd is running on. If an internal address is configured for the host etcd is running on, the endpoint for etcdctl needs to be specified explicitly. If any of the commands respond with Error: context deadline exceeded, the etcd instance is unhealthy (either quorum is lost or the instance is not correctly joined in the cluster)
etcdctl
Error: context deadline exceeded
Output should contain all the nodes with the etcd role and the output should be identical on all nodes.
Command:
docker exec etcd etcdctl member list
Command when using etcd version lower than 3.3.x (Kubernetes 1.13.x and lower) and --internal-address was specified when adding the node:
--internal-address
docker exec etcd sh -c "etcdctl --endpoints=\$ETCDCTL_ENDPOINT member list"
xxx, started, etcd-xxx, https://IP:2380, https://IP:2379,https://IP:4001 xxx, started, etcd-xxx, https://IP:2380, https://IP:2379,https://IP:4001 xxx, started, etcd-xxx, https://IP:2380, https://IP:2379,https://IP:4001
The values for RAFT TERM should be equal and RAFT INDEX should be not be too far apart from each other.
RAFT TERM
RAFT INDEX
docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") etcd etcdctl endpoint status --write-out table
docker exec etcd etcdctl endpoint status --endpoints=$(docker exec etcd /bin/sh -c "etcdctl --endpoints=\$ETCDCTL_ENDPOINT member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") --write-out table
+-----------------+------------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +-----------------+------------------+---------+---------+-----------+-----------+------------+ | https://IP:2379 | 333ef673fc4add56 | 3.2.18 | 24 MB | false | 72 | 66887 | | https://IP:2379 | 5feed52d940ce4cf | 3.2.18 | 24 MB | true | 72 | 66887 | | https://IP:2379 | db6b3bdb559a848d | 3.2.18 | 25 MB | false | 72 | 66887 | +-----------------+------------------+---------+---------+-----------+-----------+------------+
docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") etcd etcdctl endpoint health
docker exec etcd etcdctl endpoint health --endpoints=$(docker exec etcd /bin/sh -c "etcdctl --endpoints=\$ETCDCTL_ENDPOINT member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','")
https://IP:2379 is healthy: successfully committed proposal: took = 2.113189ms https://IP:2379 is healthy: successfully committed proposal: took = 2.649963ms https://IP:2379 is healthy: successfully committed proposal: took = 2.451201ms
for endpoint in $(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5"); do echo "Validating connection to ${endpoint}/health" docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -w "\n" --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) "${endpoint}/health" done
for endpoint in $(docker exec etcd /bin/sh -c "etcdctl --endpoints=\$ETCDCTL_ENDPOINT member list | cut -d, -f5"); do echo "Validating connection to ${endpoint}/health"; docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -w "\n" --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) "${endpoint}/health" done
Validating connection to https://IP:2379/health {"health": "true"} Validating connection to https://IP:2379/health {"health": "true"} Validating connection to https://IP:2379/health {"health": "true"}
for endpoint in $(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f4"); do echo "Validating connection to ${endpoint}/version"; docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl --http1.1 -s -w "\n" --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) "${endpoint}/version" done
for endpoint in $(docker exec etcd /bin/sh -c "etcdctl --endpoints=\$ETCDCTL_ENDPOINT member list | cut -d, -f4"); do echo "Validating connection to ${endpoint}/version"; docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl --http1.1 -s -w "\n" --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) "${endpoint}/version" done
Validating connection to https://IP:2380/version {"etcdserver":"3.2.18","etcdcluster":"3.2.0"} Validating connection to https://IP:2380/version {"etcdserver":"3.2.18","etcdcluster":"3.2.0"} Validating connection to https://IP:2380/version {"etcdserver":"3.2.18","etcdcluster":"3.2.0"}
etcd will trigger alarms, for instance when it runs out of space.
docker exec etcd etcdctl alarm list
docker exec etcd sh -c "etcdctl --endpoints=\$ETCDCTL_ENDPOINT alarm list"
Example output when NOSPACE alarm is triggered:
memberID:x alarm:NOSPACE memberID:x alarm:NOSPACE memberID:x alarm:NOSPACE
Related error messages are etcdserver: mvcc: database space exceeded or applying raft message exceeded backend quota. Alarm NOSPACE will be triggered.
etcdserver: mvcc: database space exceeded
applying raft message exceeded backend quota
NOSPACE
Resolutions:
rev=$(docker exec etcd etcdctl endpoint status --write-out json | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*') docker exec etcd etcdctl compact "$rev"
rev=$(docker exec etcd sh -c "etcdctl --endpoints=\$ETCDCTL_ENDPOINT endpoint status --write-out json | egrep -o '\"revision\":[0-9]*' | egrep -o '[0-9]*'") docker exec etcd sh -c "etcdctl --endpoints=\$ETCDCTL_ENDPOINT compact \"$rev\""
compacted revision xxx
docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") etcd etcdctl defrag
docker exec etcd sh -c "etcdctl defrag --endpoints=$(docker exec etcd /bin/sh -c "etcdctl --endpoints=\$ETCDCTL_ENDPOINT member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','")"
Finished defragmenting etcd member[https://IP:2379] Finished defragmenting etcd member[https://IP:2379] Finished defragmenting etcd member[https://IP:2379]
docker exec etcd sh -c "etcdctl endpoint status --endpoints=$(docker exec etcd /bin/sh -c "etcdctl --endpoints=\$ETCDCTL_ENDPOINT member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") --write-out table"
+-----------------+------------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +-----------------+------------------+---------+---------+-----------+-----------+------------+ | https://IP:2379 | e973e4419737125 | 3.2.18 | 553 kB | false | 32 | 2449410 | | https://IP:2379 | 4a509c997b26c206 | 3.2.18 | 553 kB | false | 32 | 2449410 | | https://IP:2379 | b217e736575e9dd3 | 3.2.18 | 553 kB | true | 32 | 2449410 | +-----------------+------------------+---------+---------+-----------+-----------+------------+
After verifying that the DB size went down after compaction and defragmenting, the alarm needs to be disarmed for etcd to allow writes again.
docker exec etcd etcdctl alarm list docker exec etcd etcdctl alarm disarm docker exec etcd etcdctl alarm list
docker exec etcd sh -c "etcdctl --endpoints=\$ETCDCTL_ENDPOINT alarm list" docker exec etcd sh -c "etcdctl --endpoints=\$ETCDCTL_ENDPOINT alarm disarm" docker exec etcd sh -c "etcdctl --endpoints=\$ETCDCTL_ENDPOINT alarm list"
docker exec etcd etcdctl alarm list memberID:x alarm:NOSPACE memberID:x alarm:NOSPACE memberID:x alarm:NOSPACE docker exec etcd etcdctl alarm disarm docker exec etcd etcdctl alarm list
The log level of etcd can be changed dynamically via the API. You can configure debug logging using the commands below.
docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -XPUT -d '{"Level":"DEBUG"}' --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) $(docker exec etcd printenv ETCDCTL_ENDPOINTS)/config/local/log
docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -XPUT -d '{"Level":"DEBUG"}' --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) $(docker exec etcd printenv ETCDCTL_ENDPOINT)/config/local/log
To reset the log level back to the default (INFO), you can use the following command.
INFO
docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -XPUT -d '{"Level":"INFO"}' --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) $(docker exec etcd printenv ETCDCTL_ENDPOINTS)/config/local/log
docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -XPUT -d '{"Level":"INFO"}' --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) $(docker exec etcd printenv ETCDCTL_ENDPOINT)/config/local/log
If you want to investigate the contents of your etcd, you can either watch streaming events or you can query etcd directly, see below for examples.
docker exec etcd etcdctl watch --prefix /registry
docker exec etcd etcdctl --endpoints=\$ETCDCTL_ENDPOINT watch --prefix /registry
If you only want to see the affected keys (and not the binary data), you can append | grep -a ^/registry to the command to filter for keys only.
| grep -a ^/registry
docker exec etcd etcdctl get /registry --prefix=true --keys-only
docker exec etcd etcdctl --endpoints=\$ETCDCTL_ENDPOINT get /registry --prefix=true --keys-only
You can process the data to get a summary of count per key, using the command below:
docker exec etcd etcdctl get /registry --prefix=true --keys-only | grep -v ^$ | awk -F'/' '{ if ($3 ~ /cattle.io/) {h[$3"/"$4]++} else { h[$3]++ }} END { for(k in h) print h[k], k }' | sort -nr
When a node in your etcd cluster becomes unhealthy, the recommended approach is to fix or remove the failed or unhealthy node before adding a new etcd node to the cluster.