Table of Contents
Introduction
In part I of this blog series, we understood that monitoring a Kubernetes cluster is a challenge we can overcome if we use the right tools. We also understood that the default Kubernetes dashboard allows us to monitor the different resources running inside our cluster, but it is very basic. We suggested tools and platforms like cAdvisor, Kube-state-metrics, Prometheus, Grafana, Kubewatch, Jaeger, and MetricFire.
In this blog post, we are going to look at the Four Golden Signals of building an observable system, and then see how Prometheus can help us in applying these rules.
To get started, sign up for the MetricFire free trial. You can try out our Prometheus alternative with almost no setup.
Key Takeaways
- Monitoring Kubernetes clusters is challenging, but using tools like Prometheus can help.
- Google's Four Golden Signals (Latency, Traffic, Errors, Saturation) are crucial for building observable systems.
- Prometheus, an open-source tool, aids in monitoring and alerting Kubernetes clusters.
- Utilize advanced features like AlertManager and integrate with Grafana dashboards for better insights.
What’s Broken, and Why?
A big part of a DevOps team's job is to empower development teams to take part in operational responsibility. DevOps is based on cooperation between the various IT players around good practices to design, develop, and deploy applications more quickly, less expensively, and with higher quality. It regulates the development and operation teams around the famous principle given to us by Werner Vogels, CTO of Amazon: "You build it, you run it". Therefore, the people that make up the team are one of DevOps' main assets.
Taking responsibility for running an application also requires the DevOps team to get involved in other subtasks, like monitoring it. This is when choosing the right metrics to watch in production is critical. What you monitor and the data you see will impact your DevOps approach.
Also, involving the team in traditional monitoring tasks is not enough. With monitoring, you can discern what is happening in your production infrastructure. For example, you can determine if there is a high activity volume on a server or a pool of servers. However, with observability (or white box monitoring), you can detect the problem before it becomes an issue.
"Your monitoring system should address two questions: what’s broken, and why? The “what’s broken” indicates the symptom; the “why” indicates a (possibly intermediate) cause. “What” versus “why” is one of the most important distinctions in writing good monitoring with maximum signal and minimum noise." ~ Google SRE Book.
There are no ready-to-use methodologies for choosing the right metrics; everything depends on your team's technical and business needs. However, the following approaches may inspire you:
- Google SRE book from Google
- USE Method from Brendan Gregg
- RED Method from Tom Wilkie
We will try to understand some of the essential and most common metrics to watch in a kubernetes-based production system based on Google's Four Golden Signals.
Monitoring Distributed Systems: The Four Golden Signals
In chapter 6 of "Monitoring Distributed Systems" of the famous Google SRE book, Google defines the four main signals to observe constantly. These four signals are called the four golden signals: latency, traffic, errors, and saturation.
These signals are extremely important, as they are essential to ensuring high application availability. Let's briefly examine what each one means.
Latency
Latency is the time required to send a request and receive a response. It is usually measured on the server side, but it can also be measured on the client side to account for the differences in network speed. The operations team controls server-side latency most, but client-side latency is more relevant for end clients.
The target threshold you choose may vary depending on the application type. You also need to trace the latency of successful and failed requests distinctly because failed requests often fail quickly without further processing.
Traffic
Traffic is a measure of request numbers passing through the network. These can be HTTP requests sent to your web or API server or messages sent to a processing queue. Peak traffic periods can stress your infrastructure and drive it to its limits, which can have downstream consequences. That is why traffic is a key signal. It helps you differentiate between two different root causes that have the same results: capacity issues and inappropriate system configurations, as the system configuration issues can produce problems even in times of low traffic.
For distributed systems, particularly Kubernetes, this will help you plan capacity to meet future demand.
Errors
Errors can tell you about a bug in your code, an unresolved dependency, or configuration errors in your infrastructure. Take the example of a database failure that generates a spike in the error rate and compare it with the case of a network error that usually induces the same spike in the results. You can't understand the issue by looking at only the error rate.
Following a change in your Kubernetes deployment, the errors may indicate bugs in the code that were not detected during testing or only appeared in your production system.
Therefore, the error message provides a more accurate report of the problem. Errors can also affect other metrics, such as artificially reducing latency, and they can cause repeated attempts that drown your Kubernetes clusters.
Saturation
Saturation is defined as the load on your server's resources, such as the network and CPU. Each resource has a limit beyond which performance degrades or becomes unavailable.
Saturation applies to resources such as disk capacity (read/write operations per second), CPU usage, memory usage, and others. We need to recognize that the design of the Kubernetes cluster needs to accommodate which parts of the service might become saturated first.
Often, the metrics used are leading indicators that allow you to adjust capacity before performance degenerates. For example, network saturation can cause packets to drop off. Also, when the CPU is full, it can cause delayed responses, and full disks can cause disk write failures and data loss.
Prometheus and the Four Golden Signals
Prometheus is an open-source tool for monitoring and alerting. It was developed by SoundCloud and afterwards donated to the CNCF. This tool integrates natively or indirectly with other applications using metrics exporters. Using Prometheus Operator, installing and managing Prometheus on top of Kubernetes becomes easier than expected. Prometheus Operator is an easy way to run the Prometheus Alertmanager and Grafana inside a Kubernetes cluster. So, what are the Prometheus metrics to watch to implement the Four Golden Signals?
Plenty of metrics are collected and stored by Prometheus, but we are going to see some of them just as a demonstration; therefore, the following list is not exhaustive.
First, "http_requests_total" counts the number of HTTP responses issued and classifies them by code and method. It can be used to observe traffic.
Example:
sum(rate(http_requests_total[1m]))
Other metrics, such as "node_network_transmit_bytes" or "node_network_receive_bytes," can be used to monitor traffic. Choosing the right metric depends on what you need to measure and the use case. Do you need to monitor HTTP requests, TCP requests, or transmitted and received bytes?
Latency, which is another golden signal, can also be observed using metrics like "http_request_duration_seconds". Using PromQL, we can, for instance, get the percentage of requests that are complete within 400ms:
sum(rate(http_request_duration_seconds_bucket{le="0.4"}[1m])) / sum(rate(http_request_duration_seconds_count[1m]))
Errors percentages can be measured in almost the same way, using metrics like "http_status_500_total" and "http_responses_total":
rate(http_status_500_total [1m]) / rate(http_requests_total [1m])
or
sum(rate(http_responses_total{code="500"}[1m])) / sum(rate(http_responses_total[1m]))
You should usually refer to system metrics like memory, disk, or CPU to measure saturation. These metrics are collected directly from Kubernetes nodes and don't rely on instrumentation. For instance, to monitor the CPU saturation, you need to use a metric like "CPU" from the node exporter combined with the average overtime function:
avg_over_time(cpu[1m])
If you need to apply the same to other resources, like disk, you may use something like:
avg_over_time (node_disk_io_time_seconds_total[1m])
Conclusion
Choosing the right metric comes easily when you understand the design of your Kubernetes cluster and the nature of the services it runs. The four golden signals help design an observable system. However, to use the full power of Prometheus, you need to extend it from basic metrics to other advanced features like the AlertManager and then integrate data and alerts into a Grafana dashboard. To learn more about how to use AlertManager, check out our article on the top 5 AlertManager gotchas. Also, sign up for the MetricFire free trial and experiment with querying and alerting on your Hosted Graphite metrics today.