Continental Innovates with Rancher and Kubernetes
Prometheus is an open-source system for monitoring and alerting originally developed by Soundcloud. It moved to Cloud Native Computing Federation (CNCF) in 2016 and became one of the most popular projects after Kubernetes. It can monitor everything from an entire Linux server to a stand-alone web server, a database service or a single process. In Prometheus terminology, the things it monitors are called Targets. Each unit of a target is called a metric. It pulls (scrapes) targets over http, at a set interval, to collect metrics and places the data in its time-series database. You can use the PromQL query language to query metrics about targets.
In this article, we will show a step-by-step setup guide on how to:
Below we can see below all the components that form the Prometheus ecosystem.
All of the components in the Prometheus ecosystem (credit http://prometheus.io)
Here’s a quick review of the terms that are relevant to this exercise:
According to CoreOS, owners of the Prometheus Operator project, it makes the Prometheus configuration Kubernetes native and manages and operates Prometheus and Alertmanager clusters.
The Operator introduces the following Kubernetes custom resource definitions (CRDs): Prometheus, ServiceMonitor, PrometheusRule and Alertmanager. Get more details about these here. In our demo, we’ll use PrometheusRule to define custom rules.
PrometheusRule
To install, we will use the stable/prometheus-operator Helm chart.
The default installation deploys the following components: prometheus-operator, prometheus, alertmanager, node-exporter, kube-state-metrics and grafana. By default, Prometheus scrapes the main Kubernetes components: kube-apiserver, kube-scheduler, kube-controller-manager and etcd.
Prerequisites
To perform this demo, you will need the following:
Starting a Rancher instance
Follow this intuitive getting started guide.
Using Rancher to Deploy a GKE Cluster
Use Rancher to set up and configure your Kubernetes cluster. Follow the how-to guide.
As soon as this operation is ready and you have configured the kubeconfig file with appropriate credentials and endpoint information, you can make use of kubectl to point to that specific cluster.
kubectl
Deploying Prometheus Software
Let’s check what Helm version are we running.
$ helm version version.BuildInfo{Version:"v3.1.2", GitCommit:"d878d4d45863e42fd5cff6743294a11d28a9abce", GitTreeState:"clean", GoVersion:"go1.13.8"}
As we are using Helm 3, we will need to add the stable repository, as this is not set by default.
$ helm repo add stable https://kubernetes-charts.storage.googleapis.com "stable" has been added to your repositories
$ helm repo update Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "stable" chart repository Update Complete. ⎈ Happy Helming!⎈
$ helm repo list NAME URL stable https://kubernetes-charts.storage.googleapis.com
With Helm configured, we can proceed with the installation of prometheus-operator.
prometheus-operator
$ kubectl create namespace monitoring namespace/monitoring created
$ helm install --namespace monitoring demo stable/prometheus-operator manifest_sorter.go:192: info: skipping unknown hook: "crd-install" manifest_sorter.go:192: info: skipping unknown hook: "crd-install" manifest_sorter.go:192: info: skipping unknown hook: "crd-install" manifest_sorter.go:192: info: skipping unknown hook: "crd-install" manifest_sorter.go:192: info: skipping unknown hook: "crd-install" manifest_sorter.go:192: info: skipping unknown hook: "crd-install" NAME: demo LAST DEPLOYED: Sat Mar 14 09:40:35 2020 NAMESPACE: monitoring STATUS: deployed REVISION: 1 NOTES: The Prometheus Operator has been installed. Check its status by running: kubectl --namespace monitoring get pods -l "release=demo" Visit https://github.com/coreos/prometheus-operator for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator.
Besides monitoring, Prometheus allows us to have rules that should trigger alerts. These rules are based on Prometheus expression language expressions. Whenever a condition is met, the alert is fired and sent to Alertmanager. Later, we’ll see what a rule looks like.
Let’s get back to our demo. As soon as Helm has finished the deployment we can check what pods have been created:
$ kubectl -n monitoring get pods NAME READY STATUS RESTARTS AGE alertmanager-demo-prometheus-operator-alertmanager-0 2/2 Running 0 61s demo-grafana-5576fbf669-9l57b 3/3 Running 0 72s demo-kube-state-metrics-67bf64b7f4-4786k 1/1 Running 0 72s demo-prometheus-node-exporter-ll8zx 1/1 Running 0 72s demo-prometheus-node-exporter-nqnr6 1/1 Running 0 72s demo-prometheus-node-exporter-sdndf 1/1 Running 0 72s demo-prometheus-operator-operator-b9c9b5457-db9dj 2/2 Running 0 72s prometheus-demo-prometheus-operator-prometheus-0 3/3 Running 1 50s
In order to access Prometheus and Alertmanager from a Web browser, we need to use port forwarding.
As this demo uses a GCP instance, and all kubectl commands are run from this instance, we use the instance’s external IP address to access the resources.
$ kubectl port-forward --address 0.0.0.0 -n monitoring prometheus-demo-prometheus-operator-prometheus-0 9090 >/dev/null 2>&1 &
$ kubectl port-forward --address 0.0.0.0 -n monitoring alertmanager-demo-prometheus-operator-alertmanager-0 9093 >/dev/null 2>&1 &
The Alerts tab from Prometheus UI shows us all the currently running/configured alerts. This can be checked from CLI as well by querying the CRD called prometheusrules:
prometheusrules
$ kubectl -n monitoring get prometheusrules NAME AGE demo-prometheus-operator-alertmanager.rules 3m21s demo-prometheus-operator-etcd 3m21s demo-prometheus-operator-general.rules 3m21s demo-prometheus-operator-k8s.rules 3m21s demo-prometheus-operator-kube-apiserver-error 3m21s demo-prometheus-operator-kube-apiserver.rules 3m21s demo-prometheus-operator-kube-prometheus-node-recording.rules 3m21s demo-prometheus-operator-kube-scheduler.rules 3m21s demo-prometheus-operator-kubernetes-absent 3m21s demo-prometheus-operator-kubernetes-apps 3m21s demo-prometheus-operator-kubernetes-resources 3m21s demo-prometheus-operator-kubernetes-storage 3m21s demo-prometheus-operator-kubernetes-system 3m21s demo-prometheus-operator-kubernetes-system-apiserver 3m21s demo-prometheus-operator-kubernetes-system-controller-manager 3m21s demo-prometheus-operator-kubernetes-system-kubelet 3m21s demo-prometheus-operator-kubernetes-system-scheduler 3m21s demo-prometheus-operator-node-exporter 3m21s demo-prometheus-operator-node-exporter.rules 3m21s demo-prometheus-operator-node-network 3m21s demo-prometheus-operator-node-time 3m21s demo-prometheus-operator-node.rules 3m21s demo-prometheus-operator-prometheus 3m21s demo-prometheus-operator-prometheus-operator 3m21s
We can check the physical files as well, located in prometheus-operator Pod, in prometheus container.
$ kubectl -n monitoring exec -it prometheus-demo-prometheus-operator-prometheus-0 -- /bin/sh Defaulting container name to prometheus. Use 'kubectl describe pod/prometheus-demo-prometheus-operator-prometheus-0 -n monitoring' to see all of the containers in this pod.
Inside the container, we can check the path where rules are stored:
/prometheus $ ls /etc/prometheus/rules/prometheus-demo-prometheus-operator-prometheus-rulefiles-0/ monitoring-demo-prometheus-operator-alertmanager.rules.yaml monitoring-demo-prometheus-operator-kubernetes-system-apiserver.yaml monitoring-demo-prometheus-operator-etcd.yaml monitoring-demo-prometheus-operator-kubernetes-system-controller-manager.yaml monitoring-demo-prometheus-operator-general.rules.yaml monitoring-demo-prometheus-operator-kubernetes-system-kubelet.yaml monitoring-demo-prometheus-operator-k8s.rules.yaml monitoring-demo-prometheus-operator-kubernetes-system-scheduler.yaml monitoring-demo-prometheus-operator-kube-apiserver-error.yaml monitoring-demo-prometheus-operator-kubernetes-system.yaml monitoring-demo-prometheus-operator-kube-apiserver.rules.yaml monitoring-demo-prometheus-operator-node-exporter.rules.yaml monitoring-demo-prometheus-operator-kube-prometheus-node-recording.rules.yaml monitoring-demo-prometheus-operator-node-exporter.yaml monitoring-demo-prometheus-operator-kube-scheduler.rules.yaml monitoring-demo-prometheus-operator-node-network.yaml monitoring-demo-prometheus-operator-kubernetes-absent.yaml monitoring-demo-prometheus-operator-node-time.yaml monitoring-demo-prometheus-operator-kubernetes-apps.yaml monitoring-demo-prometheus-operator-node.rules.yaml monitoring-demo-prometheus-operator-kubernetes-resources.yaml monitoring-demo-prometheus-operator-prometheus-operator.yaml monitoring-demo-prometheus-operator-kubernetes-storage.yaml monitoring-demo-prometheus-operator-prometheus.yaml
To understand more on how these rules are loaded into Prometheus, let’s check the Pod’s details. We can see that the config file used for the prometheus container is etc/prometheus/config_out/prometheus.env.yaml. This config file will show us the location of the files or the frequency set for yaml files to be rechecked.
etc/prometheus/config_out/prometheus.env.yaml
$ kubectl -n monitoring describe pod prometheus-demo-prometheus-operator-prometheus-0
Click for full command output
Name: prometheus-demo-prometheus-operator-prometheus-0 Namespace: monitoring Priority: 0 Node: gke-c-7dkls-default-0-c6ca178a-gmcq/10.132.0.15 Start Time: Wed, 11 Mar 2020 18:06:47 +0000 Labels: app=prometheus controller-revision-hash=prometheus-demo-prometheus-operator-prometheus-5ccbbd8578 prometheus=demo-prometheus-operator-prometheus statefulset.kubernetes.io/pod-name=prometheus-demo-prometheus-operator-prometheus-0 Annotations: <none> Status: Running IP: 10.40.0.7 IPs: <none> Controlled By: StatefulSet/prometheus-demo-prometheus-operator-prometheus Containers: prometheus: Container ID: docker://360db8a9f1cce8d72edd81fcdf8c03fe75992e6c2c59198b89807aa0ce03454c Image: quay.io/prometheus/prometheus:v2.15.2 Image ID: docker-pullable://quay.io/prometheus/prometheus@sha256:914525123cf76a15a6aaeac069fcb445ce8fb125113d1bc5b15854bc1e8b6353 Port: 9090/TCP Host Port: 0/TCP Args: --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries --config.file=/etc/prometheus/config_out/prometheus.env.yaml --storage.tsdb.path=/prometheus --storage.tsdb.retention.time=10d --web.enable-lifecycle --storage.tsdb.no-lockfile --web.external-url=http://demo-prometheus-operator-prometheus.monitoring:9090 --web.route-prefix=/ State: Running Started: Wed, 11 Mar 2020 18:07:07 +0000 Last State: Terminated Reason: Error Message: caller=main.go:648 msg="Starting TSDB ..." level=info ts=2020-03-11T18:07:02.185Z caller=web.go:506 component=web msg="Start listening for connections" address=0.0.0.0:9090 level=info ts=2020-03-11T18:07:02.192Z caller=head.go:584 component=tsdb msg="replaying WAL, this may take awhile" level=info ts=2020-03-11T18:07:02.192Z caller=head.go:632 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0 level=info ts=2020-03-11T18:07:02.194Z caller=main.go:663 fs_type=EXT4_SUPER_MAGIC level=info ts=2020-03-11T18:07:02.194Z caller=main.go:664 msg="TSDB started" level=info ts=2020-03-11T18:07:02.194Z caller=main.go:734 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml level=info ts=2020-03-11T18:07:02.194Z caller=main.go:517 msg="Stopping scrape discovery manager..." level=info ts=2020-03-11T18:07:02.194Z caller=main.go:531 msg="Stopping notify discovery manager..." level=info ts=2020-03-11T18:07:02.194Z caller=main.go:553 msg="Stopping scrape manager..." level=info ts=2020-03-11T18:07:02.194Z caller=manager.go:814 component="rule manager" msg="Stopping rule manager..." level=info ts=2020-03-11T18:07:02.194Z caller=manager.go:820 component="rule manager" msg="Rule manager stopped" level=info ts=2020-03-11T18:07:02.194Z caller=main.go:513 msg="Scrape discovery manager stopped" level=info ts=2020-03-11T18:07:02.194Z caller=main.go:527 msg="Notify discovery manager stopped" level=info ts=2020-03-11T18:07:02.194Z caller=main.go:547 msg="Scrape manager stopped" level=info ts=2020-03-11T18:07:02.197Z caller=notifier.go:598 component=notifier msg="Stopping notification manager..." level=info ts=2020-03-11T18:07:02.197Z caller=main.go:718 msg="Notifier manager stopped" level=error ts=2020-03-11T18:07:02.197Z caller=main.go:727 err="error loading config from \"/etc/prometheus/config_out/prometheus.env.yaml\": couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): open /etc/prometheus/config_out/prometheus.env.yaml: no such file or directory" Exit Code: 1 Started: Wed, 11 Mar 2020 18:07:02 +0000 Finished: Wed, 11 Mar 2020 18:07:02 +0000 Ready: True Restart Count: 1 Liveness: http-get http://:web/-/healthy delay=0s timeout=3s period=5s #success=1 #failure=6 Readiness: http-get http://:web/-/ready delay=0s timeout=3s period=5s #success=1 #failure=120 Environment: <none> Mounts: /etc/prometheus/certs from tls-assets (ro) /etc/prometheus/config_out from config-out (ro) /etc/prometheus/rules/prometheus-demo-prometheus-operator-prometheus-rulefiles-0 from prometheus-demo-prometheus-operator-prometheus-rulefiles-0 (rw) /prometheus from prometheus-demo-prometheus-operator-prometheus-db (rw) /var/run/secrets/kubernetes.io/serviceaccount from demo-prometheus-operator-prometheus-token-jvbrr (ro) prometheus-config-reloader: Container ID: docker://de27cdad7067ebd5154c61b918401b2544299c161850daf3e317311d2d17af3d Image: quay.io/coreos/prometheus-config-reloader:v0.37.0 Image ID: docker-pullable://quay.io/coreos/prometheus-config-reloader@sha256:5e870e7a99d55a5ccf086063efd3263445a63732bc4c04b05cf8b664f4d0246e Port: <none> Host Port: <none> Command: /bin/prometheus-config-reloader Args: --log-format=logfmt --reload-url=http://127.0.0.1:9090/-/reload --config-file=/etc/prometheus/config/prometheus.yaml.gz --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml State: Running Started: Wed, 11 Mar 2020 18:07:04 +0000 Ready: True Restart Count: 0 Limits: cpu: 100m memory: 25Mi Requests: cpu: 100m memory: 25Mi Environment: POD_NAME: prometheus-demo-prometheus-operator-prometheus-0 (v1:metadata.name) Mounts: /etc/prometheus/config from config (rw) /etc/prometheus/config_out from config-out (rw) /var/run/secrets/kubernetes.io/serviceaccount from demo-prometheus-operator-prometheus-token-jvbrr (ro) rules-configmap-reloader: Container ID: docker://5804e45380ed1b5374a4c2c9ee4c9c4e365bee93b9ccd8b5a21f50886ea81a91 Image: quay.io/coreos/configmap-reload:v0.0.1 Image ID: docker-pullable://quay.io/coreos/configmap-reload@sha256:e2fd60ff0ae4500a75b80ebaa30e0e7deba9ad107833e8ca53f0047c42c5a057 Port: <none> Host Port: <none> Args: --webhook-url=http://127.0.0.1:9090/-/reload --volume-dir=/etc/prometheus/rules/prometheus-demo-prometheus-operator-prometheus-rulefiles-0 State: Running Started: Wed, 11 Mar 2020 18:07:06 +0000 Ready: True Restart Count: 0 Limits: cpu: 100m memory: 25Mi Requests: cpu: 100m memory: 25Mi Environment: <none> Mounts: /etc/prometheus/rules/prometheus-demo-prometheus-operator-prometheus-rulefiles-0 from prometheus-demo-prometheus-operator-prometheus-rulefiles-0 (rw) /var/run/secrets/kubernetes.io/serviceaccount from demo-prometheus-operator-prometheus-token-jvbrr (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: config: Type: Secret (a volume populated by a Secret) SecretName: prometheus-demo-prometheus-operator-prometheus Optional: false tls-assets: Type: Secret (a volume populated by a Secret) SecretName: prometheus-demo-prometheus-operator-prometheus-tls-assets Optional: false config-out: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> prometheus-demo-prometheus-operator-prometheus-rulefiles-0: Type: ConfigMap (a volume populated by a ConfigMap) Name: prometheus-demo-prometheus-operator-prometheus-rulefiles-0 Optional: false prometheus-demo-prometheus-operator-prometheus-db: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> demo-prometheus-operator-prometheus-token-jvbrr: Type: Secret (a volume populated by a Secret) SecretName: demo-prometheus-operator-prometheus-token-jvbrr Optional: false QoS Class: Burstable Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 4m51s default-scheduler Successfully assigned monitoring/prometheus-demo-prometheus-operator-prometheus-0 to gke-c-7dkls-default-0-c6ca178a-gmcq Normal Pulling 4m45s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Pulling image "quay.io/prometheus/prometheus:v2.15.2" Normal Pulled 4m39s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Successfully pulled image "quay.io/prometheus/prometheus:v2.15.2" Normal Pulling 4m36s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Pulling image "quay.io/coreos/prometheus-config-reloader:v0.37.0" Normal Pulled 4m35s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Successfully pulled image "quay.io/coreos/prometheus-config-reloader:v0.37.0" Normal Pulling 4m34s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Pulling image "quay.io/coreos/configmap-reload:v0.0.1" Normal Started 4m34s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Started container prometheus-config-reloader Normal Created 4m34s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Created container prometheus-config-reloader Normal Pulled 4m33s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Successfully pulled image "quay.io/coreos/configmap-reload:v0.0.1" Normal Created 4m32s (x2 over 4m36s) kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Created container prometheus Normal Created 4m32s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Created container rules-configmap-reloader Normal Started 4m32s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Started container rules-configmap-reloader Normal Pulled 4m32s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Container image "quay.io/prometheus/prometheus:v2.15.2" already present on machine Normal Started 4m31s (x2 over 4m36s) kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Started container prometheus
Let’s clean up the default rules, so we can better observe the one that we’ll create. The command below deletes all rules, but leaves only one standing, the first one, called monitoring-demo-prometheus-operator-alertmanager.rules.
$ kubectl -n monitoring delete prometheusrules $(kubectl -n monitoring get prometheusrules | grep -v alert)
$ kubectl -n monitoring get prometheusrules NAME AGE demo-prometheus-operator-alertmanager.rules 8m53s
OBS: We left one rule only just to make this demo easier. But there is a rule that you should never delete. This sits in monitoring-demo-prometheus-operator-general.rules.yaml and it is called Watchdog. The alert is always firing and its purpose is to ensure that the entire alerting pipeline is functional.
monitoring-demo-prometheus-operator-general.rules.yaml
Let’s check from CLI the rule we left and compare it with what we will see in the browser.
$ kubectl -n monitoring describe prometheusrule demo-prometheus-operator-alertmanager.rules Name: demo-prometheus-operator-alertmanager.rules Namespace: monitoring Labels: app=prometheus-operator chart=prometheus-operator-8.12.1 heritage=Tiller release=demo Annotations: prometheus-operator-validated: true API Version: monitoring.coreos.com/v1 Kind: PrometheusRule Metadata: Creation Timestamp: 2020-03-11T18:06:25Z Generation: 1 Resource Version: 4871 Self Link: /apis/monitoring.coreos.com/v1/namespaces/monitoring/prometheusrules/demo-prometheus-operator-alertmanager.rules UID: 6a84dbb0-feba-4f17-b3dc-4b6486818bc0 Spec: Groups: Name: alertmanager.rules Rules: Alert: AlertmanagerConfigInconsistent Annotations: Message: The configuration of the instances of the Alertmanager cluster `{{$labels.service}}` are out of sync. Expr: count_values("config_hash", alertmanager_config_hash{job="demo-prometheus-operator-alertmanager",namespace="monitoring"}) BY (service) / ON(service) GROUP_LEFT() label_replace(max(prometheus_operator_spec_replicas{job="demo-prometheus-operator-operator",namespace="monitoring",controller="alertmanager"}) by (name, job, namespace, controller), "service", "$1", "name", "(.*)") != 1 For: 5m Labels: Severity: critical Alert: AlertmanagerFailedReload Annotations: Message: Reloading Alertmanager's configuration has failed for {{ $labels.namespace }}/{{ $labels.pod}}. Expr: alertmanager_config_last_reload_successful{job="demo-prometheus-operator-alertmanager",namespace="monitoring"} == 0 For: 10m Labels: Severity: warning Alert: AlertmanagerMembersInconsistent Annotations: Message: Alertmanager has not found all other members of the cluster. Expr: alertmanager_cluster_members{job="demo-prometheus-operator-alertmanager",namespace="monitoring"} != on (service) GROUP_LEFT() count by (service) (alertmanager_cluster_members{job="demo-prometheus-operator-alertmanager",namespace="monitoring"}) For: 5m Labels: Severity: critical Events: <none>
Let’s remove all three default alerts and create one of our own:
$ kubectl -n monitoring edit prometheusrules demo-prometheus-operator-alertmanager.rules prometheusrule.monitoring.coreos.com/demo-prometheus-operator-alertmanager.rules edited
Our custom alert looks like this:
$ kubectl -n monitoring describe prometheusrule demo-prometheus-operator-alertmanager.rules Name: demo-prometheus-operator-alertmanager.rules Namespace: monitoring Labels: app=prometheus-operator chart=prometheus-operator-8.12.1 heritage=Tiller release=demo Annotations: prometheus-operator-validated: true API Version: monitoring.coreos.com/v1 Kind: PrometheusRule Metadata: Creation Timestamp: 2020-03-11T18:06:25Z Generation: 3 Resource Version: 18180 Self Link: /apis/monitoring.coreos.com/v1/namespaces/monitoring/prometheusrules/demo-prometheus-operator-alertmanager.rules UID: 6a84dbb0-feba-4f17-b3dc-4b6486818bc0 Spec: Groups: Name: alertmanager.rules Rules: Alert: PodHighCpuLoad Annotations: Message: Alertmanager has found {{ $labels.instance }} with CPU too high Expr: rate (container_cpu_usage_seconds_total{pod_name=~"nginx-.*", image!="", container!="POD"}[5m]) > 0.04 For: 1m Labels: Severity: critical Events: <none>
Here are the options for the alert we created:
Now that we’re finished with Prometheus alerts, let’s configure Alertmanager so as soon as it gets the our alert it will notify us via email. Alertmanager’s configuration sits in a Kubernetes Secrets object.
$ kubectl get secrets -n monitoring NAME TYPE DATA AGE alertmanager-demo-prometheus-operator-alertmanager Opaque 1 32m default-token-x4rgq kubernetes.io/service-account-token 3 37m demo-grafana Opaque 3 32m demo-grafana-test-token-p6qnk kubernetes.io/service-account-token 3 32m demo-grafana-token-ff6nl kubernetes.io/service-account-token 3 32m demo-kube-state-metrics-token-vmvbr kubernetes.io/service-account-token 3 32m demo-prometheus-node-exporter-token-wlnk9 kubernetes.io/service-account-token 3 32m demo-prometheus-operator-admission Opaque 3 32m demo-prometheus-operator-alertmanager-token-rrx4k kubernetes.io/service-account-token 3 32m demo-prometheus-operator-operator-token-q9744 kubernetes.io/service-account-token 3 32m demo-prometheus-operator-prometheus-token-jvbrr kubernetes.io/service-account-token 3 32m prometheus-demo-prometheus-operator-prometheus Opaque 1 31m prometheus-demo-prometheus-operator-prometheus-tls-assets Opaque 0 31m
We’re interested only in alertmanager-demo-prometheus-operator-alertmanager. Let’s look at it:
alertmanager-demo-prometheus-operator-alertmanager
kubectl -n monitoring get secret alertmanager-demo-prometheus-operator-alertmanager -o yaml apiVersion: v1 data: alertmanager.yaml: Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0KcmVjZWl2ZXJzOgotIG5hbWU6ICJudWxsIgpyb3V0ZToKICBncm91cF9ieToKICAtIGpvYgogIGdyb3VwX2ludGVydmFsOiA1bQogIGdyb3VwX3dhaXQ6IDMwcwogIHJlY2VpdmVyOiAibnVsbCIKICByZXBlYXRfaW50ZXJ2YWw6IDEyaAogIHJvdXRlczoKICAtIG1hdGNoOgogICAgICBhbGVydG5hbWU6IFdhdGNoZG9nCiAgICByZWNlaXZlcjogIm51bGwiCg== kind: Secret metadata: creationTimestamp: "2020-03-11T18:06:24Z" labels: app: prometheus-operator-alertmanager chart: prometheus-operator-8.12.1 heritage: Tiller release: demo name: alertmanager-demo-prometheus-operator-alertmanager namespace: monitoring resourceVersion: "3018" selfLink: /api/v1/namespaces/monitoring/secrets/alertmanager-demo-prometheus-operator-alertmanager uid: 6baf6883-f690-47a1-bb49-491935956c22 type: Opaque
alertmanager.yaml field is encoded with base64. Let’s see what’s inside:
$ echo 'Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0KcmVjZWl2ZXJzOgotIG5hbWU6ICJudWxsIgpyb3V0ZToKICBncm91cF9ieToKICAtIGpvYgogIGdyb3VwX2ludGVydmFsOiA1bQogIGdyb3VwX3dhaXQ6IDMwcwogIHJlY2VpdmVyOiAibnVsbCIKICByZXBlYXRfaW50ZXJ2YWw6IDEyaAogIHJvdXRlczoKICAtIG1hdGNoOgogICAgICBhbGVydG5hbWU6IFdhdGNoZG9nCiAgICByZWNlaXZlcjogIm51bGwiCg==' | base64 --decode global: resolve_timeout: 5m receivers: - name: "null" route: group_by: - job group_interval: 5m group_wait: 30s receiver: "null" repeat_interval: 12h routes: - match: alertname: Watchdog receiver: "null"
As we can see, this is the default Alertmanager configuration. You can also see this configuration in Status tab of the Alertmanager UI. Let’s change it with one that will actually do something – in our case, send emails:
$ cat alertmanager.yaml global: resolve_timeout: 5m route: group_by: [Alertname] # Send all notifications to me. receiver: demo-alert group_wait: 30s group_interval: 5m repeat_interval: 12h routes: - match: alertname: DemoAlertName receiver: 'demo-alert' receivers: - name: demo-alert email_configs: - to: your_email@gmail.com from: from_email@gmail.com # Your smtp server address smarthost: smtp.gmail.com:587 auth_username: from_email@gmail.com auth_identity: from_email@gmail.com auth_password: 16letter_generated token # you can use gmail account password, but better create a dedicated token for this headers: From: from_email@gmail.com Subject: 'Demo ALERT'
First, we need to encode this:
$ cat alertmanager.yaml | base64 -w0
Once we get the encoded output, we need to fill it in the below yaml file that we will apply:
cat alertmanager-secret-k8s.yaml apiVersion: v1 data: alertmanager.yaml: <paste here de encoded content of alertmanager.yaml> kind: Secret metadata: name: alertmanager-demo-prometheus-operator-alertmanager namespace: monitoring type: Opaque
$ kubectl apply -f alertmanager-secret-k8s.yaml Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply secret/alertmanager-demo-prometheus-operator-alertmanager configured
The configuration is automatically reloaded and the changes are present in UI as well.
Let’s deploy something to monitor. A simple nginx deployment should be enough for this exercise:
$ cat nginx-deployment.yaml apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2 kind: Deployment metadata: name: nginx-deployment spec: selector: matchLabels: app: nginx replicas: 3 # tells deployment to run 2 pods matching the template template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx:1.7.9 ports: - containerPort: 80
$ kubectl apply -f nginx-deployment.yaml deployment.apps/nginx-deployment created
We have three replicas, as per our configuration yaml:
$ kubectl get pods NAME READY STATUS RESTARTS AGE nginx-deployment-5754944d6c-7g6gq 1/1 Running 0 67s nginx-deployment-5754944d6c-lhvx8 1/1 Running 0 67s nginx-deployment-5754944d6c-whhtr 1/1 Running 0 67s
In the Prometheus UI, using the same expression we configured for the alert:
rate (container_cpu_usage_seconds_total{pod_name=~"nginx-.*", image!="", container!="POD"}[5m])
we can check the data for these Pods. The value for all the Pods should be 0.
Let’s put some load in one of the Pods to see the value change. When the value is greater than 0.04, we should have an alert:
$ kubectl exec -it nginx-deployment-5754944d6c-7g6gq -- /bin/sh # yes > /dev/null
The alert has three phases:
We already saw the alert in inactive state, so puting some load on the CPU will let us observe the rest of them, too:
As soon as the alert is fired, this will be present in Alertmanager.
Alertmanager is configured to send emails when we receive alerts. If we check the inbox, we’ll see something like this:
We know how important monitoring is, but it would not be complete without alerting. Alerts can notify us as soon as a problem occurs, letting us know immediately when something goes wrong with our system. Prometheus covers both of these aspects – monitoring the solution and alerting via its Alertmanager component. We saw how alerts are defined in Prometheus configuration and how alerts reach Alertmanager when fired. From here based on the definition/integration of Alertmanager we received an email with details of the triggered alert (this can also be sent via Slack or PagerDuty).
Join our Master Class: Monitoring and Alerting with Prometheus & Grafana on March 24th.