In Part 1: Rancher Server HA, we looked into setting up Rancher Server in HA mode to secure it against failure. There now exists a degree of engineering in our system on top of which we can iterate. So what now? In this installment, we’ll look towards building better service resiliency with Rancher Health Checks and Load Balancing. Since the Rancher documentation for Health Checks and Load Balancing are extremely detailed, Part 2 will focus on illustrating how they work, so we can become familiar with the nuances of running services in Rancher. A person tasked with supporting the system might have several questions. For example, how does Rancher know a container is down? How is this scenario different from a Health Check? What component is responsible for operating the health checks? How does networking work with Health Checks?

Note: the experiments here are for illustration only. For troubleshooting and support, we encourage you to check out the various Rancher resources, including the forumsand Github.

Service Scale

First, we will walk through how container scale is maintained in Rancher, and continue with the WordPress catalog installation from Part 1. Codesheppard-2-2-2 Let’s check out the Rancher Server’s database on our Rancher quickstart container:

$> docker ps | grep rancher/server
cc801bdb5330 rancher/server "/usr/bin/s6-svscan /" 5 days ago Up 5 days 3306/tcp, 0.0.0.0:9999->8080/tcp thirsty_hugle
$> docker inspect -f {{.NetworkSettings.IPAddress}} thirsty_hugle
172.17.0.4
$> mysql --host 172.17.0.4 --port 3306 --user cattle -p
# The password's cattle too!
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| cattle |
+--------------------+
2 rows in set (0.00 sec)

We can drop into the database shell and check out the database, or hook up that IP address to a GUI such as MySQL Workbench. From there, we can then see that our WordPress and DB service are registered in our Rancher Server’s metadata along with other containers on the agent-managed host. Codesheppard-2-3 There are actually quite a lot of tables to browse manually, so I instead used the Rancher terminal to execute a shell on my rancher/server container to enable database logging. Codesheppard-2-5

# relevant queries
[email protected]:/# mysql -u root
mysql> SHOW VARIABLES LIKE "general_log%";
+------------------+---------------------------------+
| Variable_name | Value |
+------------------+---------------------------------+
| general_log | OFF |
| general_log_file | /var/lib/mysql/cc801bdb5330.log |
+------------------+---------------------------------+
2 rows in set (0.00 sec)
mysql> SET GLOBAL general_log = 'ON';

# Don't forget this, or your local Rancher will be extremely slow and fill up disk space.
# mysql> SET GLOBAL general_log = 'OFF';

Now with database event logging turned on, let’s see what happens when we kill a WordPress container!

# on rancher agent host
$> docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
cffb01a9ea15 wordpress:latest "/entrypoint.sh apach" 20 minutes ago Up 20 minutes 0.0.0.0:80->80/tcp r-wordpress_wordpress_1
98e5bcbdc6b3 mariadb:latest "docker-entrypoint.sh" 15 hours ago Up 15 hours 3306/tcp r-wordpress_db_1
c0ac56d7da38 rancher/agent-instance:v0.8.3 "/etc/init.d/agent-in" 15 hours ago Up 15 hours 0.0.0.0:500->500/udp, 0.0.0.0:4500->4500/udp cbbbed1b-8727-41d1-aa3b-9fb2c7598210
6784df26c8a7 rancher/agent:v1.0.2 "/run.sh run" 5 days ago Up 5 days rancher-agent
$> docker rm -f r-wordpress_wordpress_1
r-wordpress_wordpress_1

Checking the audit trail on the Rancher UI, we can see that Rancher detects that a WordPress container failed and immediately spins up a new container. Codesheppard-2-8 The database logs we extracted show that these events and actions triggered responses within the following Rancher database tables:

agent
container_event
process_instance
process_execution
service
config_item_status
instance

All the logging from the database are from interactions between rancher/cattle and its agents.

$> head cattle_mysql.log
/usr/sbin/mysqld, Version: 5.5.49-0ubuntu0.14.04.1 ((Ubuntu)). started with:
Tcp port: 3306 Unix socket: /var/run/mysqld/mysqld.sock
Time Id Command Argument
160904 18:14:49 1597 Connect [email protected] on
160904 18:14:50 225 Query SELECT 1
 225 Prepare select `agent`.`id`, `agent`.`name`, `agent`.`account_id`, `agent`.`kind`, `agent`.`uuid`, `agent`.`description`, `agent`.`state`, `agent`.`created`, `agent`.`removed`, `agent`.`remove_time`, `
agent`.`data`, `agent`.`uri`, `agent`.`managed_config`, `agent`.`agent_group_id`, `agent`.`zone_id` from `agent` where (`agent`.`state` = ? and `agent`.`uri` not like ? and `agent`.`uri` not like ? and 1 = 1)
 225 Close stmt

My logging started at 18:14:49. From the logs, we can tell that every so often Rancher checks up with its agents on the state of the system through the cattle.agent table. When we killed the WordPress container around 18:15:05, the server received a cattle.container_event which signaled that WordPress was killed.

160904 18:15:05  225 Execute   insert into `container_event` (...omit colunms...) values (7, 'containerEvent', 'requested', '2016-09-04 18:15:05', '{...}', 'cffb01a9ea154f167b8c852fab1f2a444d8e846beefb6b15147109580e3bcf36', 'kill', 'wordpress:latest', 1473012905, '7442b981-b62a-4d29-80ee-e6077589fabc', 1)

Cattle then calculated that the desired instance count for wordpress was insufficient based on the metadata stored in cattle.service. So it emits a few cattle.process_instance to reconcile the load. codesheppard-2-11 Following the update, Rancher emits a few commands in cattle.process_instance. Agents then enact upon the events, updating cattle.process_instance and cattle.process_execution within a few loops: Codesheppard-2-10 By 18:15:08, a new WordPress container is spun up to converge to the desired instance count. In brief, the Cattle event engine will process incoming host states from its agents; whenever an imbalance in service scale is detected, new events are emitted by Cattle and the agents act on them to achieve the desired state. This does not ensure that your container is behaving correctly, only that it is up and running. To ensure correct behavior, we move on to our next topic.

Health Checks

Health checks, on the other hand, are user defined and use HTTP request/pings to report a status instead of checking container_events in the Rancher database. We’ll get to take a look at this once we setup a multi-container WordPress following the instructions on Creating a Multiple Container Application in the Rancher documentation. Codesheppard-2-13 Let’s introduce a new error type. This time instead of killing the container, we will make the software fail. I dropped a line in the WordPress container to cause it to return 500, but the container is still up serving 500s.

$> docker exec -it r-wordpress-multi_mywordpress_1 bash
[email protected]> echo "failwhale" >> .htacess
# container now returns 500s.
# we want it to fail when the software fails!

What happened? Well, the issue is that the multiple container example does not contain a Health Check. So I will go ahead and modify *rancher-compose.yml *to include one. The rancher-compose.yml only defines a Health Check for the LoadBalancer itself; we need to add a service-level Health Check to our WordPress service.

mywordpress:
  scale: 2
  health_check:
    # Which port to perform the check against
    port: 80
    # For TCP, request_line needs to be '' or not shown
    # TCP Example:
    # request_line: ''
    request_line: GET / HTTP/1.0
    # Interval is measured in milliseconds
    interval: 2000
    initializing_timeout: 60000
    unhealthy_threshold: 3
    # Strategy for what to do when unhealthy
    # In this service, Rancher will recreate any unhealthy containers
    strategy: recreate
    healthy_threshold: 2
    # Response timeout is measured in milliseconds
   response_timeout: 200
...

$> cd wordpress-multi
$> rancher-compose up --upgrade mywordpress
... log lines
$> rancher-compose up --upgrade --confirm-upgrade

I defined my Health Check through rancher-compose.yml, but you can also define it through the Rancher UI to browse through the options.

Note: You will only have access to this UI on new service creation.

Codesheppard-2-14 Creation of Simple TCP Ping Codesheppard-2-15

The documentation covers the Health Check options in extreme detail. So in this post, we’ll instead look at which components support the Health Check feature.

With the addition of the Health Check, I repeated the above experiment. The moment that the container started returning 500s, Rancher Health Checks marked the container as unhealthy, then proceeded to recreate the container. Codesheppard-2-16 To get a deeper understanding of how Health Checking works we will take a look into how the agent’s components faciliate Health Check on one host. Entering into the agent instance, we check out the processes running on it. HealthCheckNetworkDiagram At a high level, our hosts communicate with the outside world on the physical eth0 interface. Docker by default creates a bridge called docker0 and hands out container IP addresses known commonly as Docker IPs to the eth0 of containers through a virtual network (veth). This is how we were able to connect to Rancher/server’s MySQL previously on 172.17.0.4:3306. The network agent contains a DNS server called rancher/rancher-dns; every container managed by Rancher uses this DNS to route to the private IPs, and every networking update is managed by the services found in the network agent container.

If you have a networking background, there is a great post on the blog called Life of a Packet in Rancher.

Breakdown of Processes Running on Network Agent Instance

[email protected]:/# ps ax
 PID TTY STAT TIME COMMAND
 1 ? Ss 0:00 init
 306 ? Sl 0:02 /var/lib/cattle/bin/rancher-metadata -log /var/log/rancher-metadata.log -answers /var/lib/cattle/etc/cattle/metadata/answers.yml -pid-file /var/run/rancher-metadata.pid
 376 ? Sl 1:29 /var/lib/cattle/bin/rancher-dns -log /var/log/rancher-dns.log -answers /var/lib/cattle/etc/cattle/dns/answers.json -pid-file /var/run/rancher-dns.pid -ttl 1
 692 ? Ssl 0:16 /usr/bin/monit -Ic /etc/monit/monitrc
 715 ? Sl 0:30 /usr/local/sbin/charon
 736 ? Sl 0:40 /var/lib/cattle/bin/rancher-net --log /var/log/rancher-net.log -f /var/lib/cattle/etc/cattle/ipsec/config.json -c /var/lib/cattle/etc/cattle/ipsec -i 172.17.0.2/16 --pid-file /var/run/rancher-net.pi
 837 ? Sl 0:29 /var/lib/cattle/bin/host-api -log /var/log/haproxy-monitor.log -haproxy-monitor -pid-file /var/run/haproxy-monitor.pid
16231 ? Ss 0:00 haproxy -p /var/run/haproxy.pid -f /etc/healthcheck/healthcheck.cfg -sf 16162

If we dig into the /etc/healthcheck/healthcheck.cfg, you can see our health checks defined inside for HAProxy:

...

backend 359346ff-33cb-445e-b1e2-7ec06d95bb19_backend
 mode http
 balance roundrobin
 timeout check 2000
 option httpchk GET / HTTP/1.0
 server cattle-359346ff-33cb-445e-b1e2-7ec06d95bb19_1 10.42.188.31:80 check port 80 inter 2000 rise 2 fall 3

backend cbc329bc-c7ec-4581-941b-da6660b8ef00_backend
 mode http
 balance roundrobin
 timeout check 2000
 option httpchk GET / HTTP/1.0
 server cattle-cbc329bc-c7ec-4581-941b-da6660b8ef00_1 10.42.179.149:80 check port 80 inter 2000 rise 2 fall 3

# This one is the Rancher Internal Health Check defined for Load Balancers
backend 3f730419-9554-4bf6-baef-a7439ba4d16f_backend
 mode tcp
 balance roundrobin
 timeout check 2000

server cattle-3f730419-9554-4bf6-baef-a7439ba4d16f_1 10.42.218.145:42 check port 42 inter 2000 rise 2 fall 3

...

Health Check Summary

Rancher’s Network agent runs the Health Checks from host-api, which queries the configured Health Checks from HAProxy and reports statuses back to Cattle. Paraphrasing the documentation:

In Cattle environments, Rancher implements a health monitoring system by running managed network agents across its hosts to coordinate the distributed health checking of containers and services.

You can see metadata for this being filled in cattle.healthcheck_instance.

When health checks are enabled either on an individual container or a service, each container is then monitored by up to three network agents running on hosts separate to that container’s parent host.

Unless you are running one host like I am, the Health Check will be from the same host. These Health Checks are all configured by the rancher/host-api binary with HAProxy. HAProxy is a pretty popular and battle-tested software, and can be found in popular service discovery projects like AirBnB’s synapse.

The container is considered healthy if at least one HAProxy instance reports a \“passed\” health check and it is considered unhealthy when all HAProxy instances report a \“unhealthy\” health check.

Events are propagated by the Rancher Agent to Cattle, at which point the Cattle server will decide if a Health Check’s unhealthy strategy (if any) needs to be applied. In our experiment, Cattle terminated the container returning 500s and recreated it. With the network services, we can connect the dots of how health checks are setup. This way, we now have a point of reference into the components supporting Health Checks in Rancher.

Load Balancers

So now we know Cattle keeps our individual services are to the scale we set, and that for more resiliency, we can also setup HAProxy Health Checks to ensure the software is running. Now let’s build up another layer of resiliency by introducing Load Balancers. The Rancher Load Balancer is a containerized HAProxy application service that is managed like any other service in Rancher by Service Scale, though it is tagged by Cattle as a System Service, and default hidden by default in the UI. (Marked blue when we toggle system services) Codesheppard-2-17 When a WordPress container behind a Load Balancer fails, the Load Balancer will automatically divert traffic to the next available host. This is by no means unique to Rancher, and is a common way to balance traffic on most applications. Though usually you will pay an hourly rate for such service or maintain it yourself, Rancher allows you to quickly and automatically set up an HAProxy loadbalancer, so we can get onto building software instead of infrastructure. If we dig into the container r-wordpress-multi_wordpresslb_1 to check its HAProxy configs, we can see that the config is periodically updated with the containers in the Rancher-managed network:

$> docker exec -it r-wordpress-multi_wordpresslb_1 bash
 [email protected]_1> cat /etc/haproxy/haproxy.cfg
 ...
 frontend 6cd2e4b8-ea4c-4300-87f2-2a8f1fc96fec_80_frontend
 bind *:80
 mode http

default_backend 6cd2e4b8-ea4c-4300-87f2-2a8f1fc96fec_80_0_backend

backend 6cd2e4b8-ea4c-4300-87f2-2a8f1fc96fec_80_0_backend
 mode http
 timeout check 2000
 option httpchk GET / HTTP/1.0
 server cee0dd09-4307-4a5c-812e-df234b035694 10.42.188.31:80 check port 80 inter 2000 rise 2 fall 3
 server a7f20d4a-58fd-419e-8df2-f77e991fec3f 10.42.179.149:80 check port 80 inter 2000 rise 2 fall 3
 http-request set-header X-Forwarded-Port %[dst_port]

listen default
 bind *:42
 ...

You can also use achieve a similar result with DNS like we did for Rancher HA in part 1, though Load Balancers offer additional features in Rancher such as SSL certificates, advanced load balancing other than round robin and etc. Codesheppard-2-18 For more details on all of the features, I highly recommend checking out the detailed Rancher documentation on Load Balancers here.

Final Experiment, Killing the Database

Now for the final experiment: what happens when we kill the Database Container? Well, the container comes back up and WordPress connects to it. Though...oh no, WordPress is back in setup mode, and even worse, all my posts are gone! What happened?

Since the database depends on the data to be migrated, when we kill the container, it actually removes the volumes that contain our Wordpress data.

This is a major problem. Even if we can use a Load Balancer to scale all these containers, it doesn’t matter if we can’t properly protect data running on them! So in the next section, we will discuss data resiliency on Rancher with Convoy and how to launch a replicated MySQL cluster to make our WordPress setup more resilient inside Rancher. Stay tuned for part 3, where we will dive into data resiliency in Rancher. Nick Ma is an Infrastructure Engineer who blogs about Rancher and Open Source. You can visit Nick’s blog, CodeSheppard.com, to catch up on practical guides for keeping your services sane and reliable with open-source solutions.