Table of Contents
So what exactly is observability? Is it just a new-fangled term for 'monitoring'? Well, no. Observability goes further than mere monitoring. Observability involves the combination of 3 pillars – Metrics, Logs, and Tracing – to give a much more in-depth view of what your application is doing.
Observability offers proactive insights into how your application and/or infrastructure are likely to behave, whereas monitoring is only reactive in nature. Information from these 3 pillars (Metrics, Logs, and Tracing) come together to provide us the four Golden Signals of Observability.
The four Golden Signals of Observability are typically understood as:
- Latency
- Traffic
- Errors
- Saturation
You can follow up more on these principles in Google’s SRE handbook, and you can see them at work in our article on how to set up Kubernetes monitoring with Prometheus.
We collect the Golden Signals for these key reasons:
- Alerting: inform us when something is wrong
- Troubleshooting: help us to isolate and fix the problem
- Tuning and Capacity Planning: to assist us in improving our setup over time
Observability also enables Whitebox or Clearbox application monitoring; that is, monitoring in which the design, code internals, and application structure are all known to us. Simple monitoring tools, on the other hand, only allow Blackbox monitoring and metrics collection. This is monitoring in which the innards and structure of the application are opaque or unknown to the tester (hence the name 'Blackbox' - you cannot see what's inside).
Let’s look at a simple example of the 3 pillars, how they come together to build the four golden signals, and how they differentiate observability from monitoring. Imagine a server's disk is getting full. A basic monitoring tool, such as Nagios, is limited to a Blackbox view of an application. It can only reactively inform you one day that a certain disk is 90% full. But it offers no useful meta-information: why did it get full, how suddenly (over what timeframe) did it get to 90%, and is this just a temporary spike in disk usage that may or may not decrease later?
An observability platform combo such as Grafana can take advantage of its Whitebox PoV to proactively gather and display metrics that show you a dashboard indicating a growing week-over-week increase in disk usage, long before it gets to the alarming 90% level.
Next, access to application and system logs will pinpoint the specific application that's causing the increase we've seen over the last few weeks. And finally, fine-grained tracing will show that the application suddenly increased its 'disk write' activity exactly 7 weeks ago. And one of the changes during the upgrade was that the 'verbose' flag was enabled for all of the application's logging.
Boom! The whole-of-system view afforded by your observability tool enables you to very quickly drill down and proactively know exactly why your disk usage has been increasing before it becomes a major problem.
The best way to get started with Graphite and Grafana is to get on to the MetricFire free trial. You should get a demo booked, where we can give you free trial access.
Metrics Collection in Graphite
Let us zoom a bit more on metrics, because 'metrics' is the first among the 3 pillars. Now, two of the most widely used open-source and cloud-native monitoring tools are Graphite. Of course, a big part of their raison d'etre is metrics collection, but they go about it in very different ways.
First up, Prometheus. You can read all about Prometheus' workings here, but for now, we are just interested in how Prometheus collects and stores metrics. Metrics collection with Prometheus works on the pull model; this means that the Prometheus server actively extracts metrics (via scraping) from the services that it monitors.
This has the major benefit that the services and applications being monitored only need to expose their metrics and then they can leave the rest to Prometheus. The actual nodes being monitored can be lightweight or very thin in nature since they do not have to do any heavy lifting in terms of exporting or pushing out the metrics.
The availability of metrics on the client-side is typically achieved by exposing an HTTP endpoint (usually '/metrics') that returns the full list of metrics, their corresponding values, and all other relevant metadata such as labels. In terms of the load placed on system resources, this endpoint is relatively light to both maintain and call, as it simply exposes or avails the current value of each metric, without the need for any calculation.
The Prometheus server then scrapes each target at a regular predefined interval (the scrape interval). Each scrape action reads the /metrics endpoint to obtain the current state of the node's metrics, then stores these values in Prometheus' internal time-series database. Prometheus can also avail of this time-series data via its own API endpoints to other consumers such as Grafana.
Graphite, on the other hand, is not a true collection agent, but it offers a much simpler path than Prometheus for gathering your metrics into a time-series database.
Graphite uses a passive model, in which the Carbon daemon passively listens for time-series data that must be pushed from the node/client being monitored. Data is stored in a simple backend database called Whisper, and graphs are rendered as needed using a basic Django web app.
However, Graphite can also be turned into an active collector of metrics using special integrations to enable a Prometheus-like pull mode. Need a collection agent or language bindings? Graphite has one of the largest ecosystems of data integrations and third-party tools. We are particularly interested in two of these tools, collectd and statsd, because they enable the active-mode/pull capability in Graphite:
- Collectd is an add-on daemon that actively collects system and application performance metrics at preset intervals, and also provides storage mechanisms to hold the values - the most storage type is RRD files.
- StatsD began life as a standalone daemon but is now a collection of tools that together actively collects and aggregates custom metrics from almost any application or service.
- Yet a third tool is graphite-remote-adapter, a read-write adapter that plugs into Prometheus, receives data via Prometheus's own remote write protocol, and stores it in Graphite. One use case for this is because Prometheus can only store data in a single location, whereas Graphite can save data to multiple clusters.
So Prometheus and Graphite are both useful monitoring/observability utilities. The main difference is that Prometheus works on an active or pull model, and Graphite uses a passive or push model. Read more about Prometheus vs. Graphite here.
Metrics Visualization Using Grafana
Now that we've seen how we can collect metrics using either Prometheus or Graphite let's talk about the importance of visualizing these metrics via Grafana.
Now, if you aren't visualizing your metrics, or even worse if you are collecting but not utilizing them, then you are not taking full advantage of your observability platform. You are, in effect, doing little more than base monitoring.
It is important to utilize and collate all your signal data in the form of metrics, logs and tracing, to provide a holistic view of what's going on in your systems and applications. Using a great visualization tool, such as Grafana, with its rich layout and presentation options like panels and dashboards, enables you to turn your raw mass of data points into actionable, proactive intelligence for your engineers; see again the 'disk full' example we outlined earlier.
Grafana's slogan is "The Open Observability Platform," and this is not just a slogan, it's the philosophy and the way this tool works: this open-source visualization tool allows you to see and analyze all of your metrics in one unified dashboard. Grafana can pull metrics from nearly any source such as Graphite and Prometheus, display that data, then enable you to annotate and understand the data directly in the dashboard. Grafana dashboards are designed to allow you to visualize information in a ton of ways, from histograms and heatmaps to world maps.
Along with metrics coming from data sources like Prometheus and Graphite, Grafana can integrate with many other tools and datasources to visualize logs. And since Grafana 3.0 you can install data sources as plugins. Check out the plugins page for more data sources.
Grafana can also act as an alerting tool. You can define alerts and their triggers and set up and use notification channels such as email, text message (SMS), or even apps such as Slack. Or you can create alerts in Grafana but outsource the entire notification part to a specialized tool like Pagerduty.
Note that Prometheus also has an inbuilt alerting manager called Alertmanager; you can weigh and decide to set up your alerting in either Prometheus or Grafana.
Finally, the Grafana team has recently added support for the 3rd pillar by implementing tracing. So, in addition to metrics and logging data side-by-side on a single screen, Grafana will also add traces, by enabling interoperability with two data sources to begin with: Zipkin and Jaeger.
Conclusion
Hopefully, you now have a much better grasp of the importance and usefulness of migrating from mere monitoring to observability; it is akin to the old data vs. information comparison on the business side of things. And you also understand how the three tools Prometheus, Graphite, and Grafana (all of which are part of MetricFire's domain of expertise) can help you migrate from monitoring to observability for your data center applications.
The great thing is, that you can use these tools directly in our platform, and monitor metrics without any setup. Also, talk to us directly by booking a demo - we're always happy to talk with you about your company's monitoring needs. You can try a free trial and test Grafana as a service and/or Hosted Graphite, and apply what you've learned from this article.