Table of Contents
Introduction
Prometheus and its querying language, PromQL, have several functions that perform various calculations on their data. One of the most widely used functions is rate(); however, it is also one of the most misunderstood.
Having a monitoring stack in your company, such as the one that Metricfire provides, gives you the essential functionality that you need, and one of these crucial functions is predicting trends. That is where rate() comes into play. As the name suggests, it lets you calculate the per-second average rate of how a value increases over time. It is the function you can use if you want, for instance, to calculate how the number of requests coming into your server changes over time or the CPU usage of your servers. But first, let's talk about its internals. We need to understand how it works under the hood to build up our knowledge from there.
If you're interested in trying a Prometheus alternative, you can sign up now for our Hosted Graphite free trial - or sign up for a demo.
Key Takeaways
- The rate() function in PromQL is essential for calculating a metric's per-second average rate of change over time. It's commonly used for monitoring trends, such as server request rates and CPU usage.
- PromQL uses two types of arguments - range and instant vectors. Range vectors have a time dimension, while instant vectors represent the most recent data point. rate() and similar functions require range arguments for trend analysis.
- The choice of time range for range vectors is crucial. It should be at least two times the scrape interval, but the optimal range depends on the specific use case and whether detailed data or broader trends are needed.
- You can apply rate() to specific dimensions, making monitoring error rates for different backends useful.
What are Prometheus Functions?
Before we discuss Prometheus functions too deeply, we first need to discuss some basic concepts and terminology.
PromQL is a custom query language for the Prometheus project used to filter and search through Prometheus' time series data. When querying in PromQL, every metric has four components:
- The metric name
- Labels, i.e. key-value pairs to distinguish metrics with the same name
- The metric value, a 64-bit floating point number
- A timestamp with millisecond-level precision
The Prometheus expression language has three different data types:
- A scalar represents a floating point value.
- An instant vector is a set of time series data with a single scalar for each time series.
- A range vector is a set of time series data with a range of data points over time for each time series.
In addition, there are four different metric types in the Prometheus client libraries:
- Counter: Useful for increasing values; the counter resets to zero on restart.
- Gauge: Useful for counts that go up and down or for rising and falling values.
- Histogram: Useful for sampling observations (such as response sizes), counting them in buckets for configuration, and providing a sum of the observed values.
- Summary: Similar to a histogram, this metric type records a total count of observations and a sum of observed values. It processes the information while computing configurable quantities for a sliding time window.
A Prometheus query using the PromQL query language can return either an instant vector or a range vector, depending on the metric type and the result you are asking for. Now that we've got all that out let's return to the original question: what are Prometheus's functions?
Simply put, Prometheus functions are functions in the PromQL language that can be used to query a Prometheus database. In the next section, we'll review a few of the most common use cases of Prometheus functions.
If you're interested in using Prometheus but think its setup and management would be too resource-consuming, book a demo and talk to us about how Hosted Graphite can fit into your monitoring environment. You can also get a free trial and check it out now.
How The Prometheus rate() function Works
Types of Arguments
There are two types of arguments in PromQL: range and instant vectors. Here is how it would look if we looked at these two types graphically:
This is a matrix of three range vectors, each encompassing one minute of data scraped every 10 seconds. As you can see, it is a data set defined by a unique set of label pairs. Range vectors also have a time dimension - in this case, it is one minute - whereas instant vectors do not. Here is what instant vectors would look like:
As you can see, instant vectors only define the recently scraped value. rate() and its cousins take an argument of the range type since to calculate any change, you need at least two points of data. They do not return any results if less than two samples are available. PromQL indicates range vectors by writing a time range in square brackets next to a selector that says how much time it should go into the past.
Choosing the time range for range vectors
What time range should we choose? There is no silver bullet here: at the very minimum, it should be two times the scrape interval. However, in this case, the result will be very “sharp”: all of the changes in the value would reflect in the results of the function faster than any other time range. After that, the result would become 0 again swiftly. Increasing the time range would achieve the opposite - the resulting line (if you plotted the results) would become “smoother”, and it would be harder to spot the spikes. Thus, the recommendation is to put the time range into a different variable (let’s say 1m, 5m, 15m, 2h) in Grafana, and then you can choose whichever value fits your case the best at the time when you are trying to spot something - such as a spike or a trend.
You could also use the variable in Grafana called $__interval, which is equal to the time range divided by the step’s size. It could seem like the perfect solution as all of the data points between each step would be considered, but it has the same problems as mentioned previously. It is impossible to see both very detailed graphs and broad trends at the same time. Also, your time interval becomes tied to your query step, so if your scrape interval ever changes, you might have problems with minimal time ranges.
Something to remember - MetricFire is also a Hosted Grafana service. Explore our free trial here - or sign up for a demo.
Calculation
Just like everything else, the function gets evaluated at each step. But how does it work?
It roughly calculates the following:
rate(x[35s]) = difference in value over 35 seconds / 35s
The nice thing about the rate() function is that it considers all data points, not just the first and last ones. Another function, irate, uses only the first and last data points.
You might now say… why not delta()? Well, rate() we have just described has this excellent characteristic: it automatically adjusts for resets. This means it is only suitable for constantly increasing metrics, a.k.a. the metric type called a “counter”. It’s not ideal for a “gauge”. Also, a keen reader would have noticed that using rate() is a hack to work around the limitation that floating-point numbers are used for metrics’ values and cannot go up indefinitely, so they are “rolled over” once a limit is reached. This logic prevents us from losing old data, so using rate() is a good idea when you need this feature.
Note: Because of this automatic adjustment for resets, if you want to use any other aggregation together with rate(), you must apply rate() first; otherwise, the counter resets will not be caught, and you will get weird results.
Either way, PromQL currently will not prevent you from using rate() with a gauge, so it is essential to realize which metric should be passed to this function when choosing. Using rate() with gauges is incorrect because the reset detection logic will mistakenly catch the values going down as a “counter reset”, and you will get wrong results.
All in all, let’s say you have a counter metric that is changing like this:
- 0
- 4
- 6
- 10
- 2
The reset between “10” and “2” would be caught by irate() and rate() and it would be taken as if the value after that were “12” i.e. it has increased by “2” (from zero). Let’s say that we were trying to calculate the rate with rate() over 60 seconds, and we got these six samples on ideal timestamps. So, the resulting average rate of increase per second would be:
12-0/60 = 0.2. Because everything is perfectly ideal in our situation, the opposite calculation is also true: 0.2 * 60 = 12. However, this opposite calculation is not always true when some samples do not cover the full range ideally or when samples do not line up perfectly due to random delays introduced between scrapes. Let me explain this in more detail in the following section.
Extrapolation: what rate() does when missing information
Last, it’s essential to understand that rate() performs extrapolation. Knowing this will save you from headaches in the long term. Sometimes, when rate() is executed at a point, some data might be missing if some scrapes fail. Moreover, the scrape interval due to added randomness might not align perfectly with the range vector, even if it is a multiple of the range vector’s time range.
In such a case, rate() calculates the rate with the data it has and then, if any information is missing, extrapolates the beginning or the end of the selected window using either the first or the last two data points. This means that you might get uneven results even if all of the data points are integers, so this function is suited only for spotting trends and spikes and for alerting if something happens.
Aggregation
Optionally, you apply rate() only to specific dimensions like other functions. For example, rate(foo) by (bar) will calculate the rate of change of foo for every bar (label’s name). This can be useful if you have, for example, haproxy running and you want to calculate the rate of change of the number of errors by different backends so you can write something like rate(haproxy_connection_errors_total[5m]) by (backend).
Examples
Alerting Rules
As described previously, rate() works perfectly when you want to get an alert when the number of errors jumps. So, you could write an alert like this:
groups:
- name: Errors
rules:
- alert: ErrorsCountIncreased
expr: rate(haproxy_connection_errors_total[5m]) by (backend) > 0.5
for: 10m
labels:
severity: page
annotations:
summary: High connection error count in {{ $labels.backend }}
This would inform you if any of the backends have increased connection errors. As you can see, rate() is perfect for this use case. Feel free to implement similar alerts for your services that you monitor with MetricFire. Interested in seeing what we can do for you? Try our free trial or sign up for a demo.
SLO Calculation
Another everyday use case for the rate() function is calculating SLIs and seeing if you do not violate your SLO/SLA. Google has recently released a popular book for site-reliability engineers. Here is how they calculate the availability of the services:
As you can see, they calculate the rate of change of the amount of all of the requests that were not 5xx and then divide by the rate of change of the total amount of requests. If there are any 5xx responses, the value would be less than one. You can, again, use this formula in your alerting rules with some specified threshold - then you would get an alert if it is violated, or you could predict the near future with predict_linear and avoid any SLA/SLO problems.
What are other use cases for Prometheus functions?
There are as many Prometheus function use cases as there are functions—and even more. Below, we'll demonstrate just a few uses of Prometheus functions.
Aggregation operators
Aggregation operators calculate mathematical values over a time range. You can use Prometheus functions such as the ones below to aggregate over a given range vector:
- avg_over_time() for the average (mean) value
- max_over_time() for the maximum value
- count_over_time() for the total count of all values
Counting HTTP requests
As another example, you can use the increase() Prometheus function to count the number of HTTP requests over the past 5 minutes, e.g.:
increase(http_requests_total{job="api-server"}[5m])
Linear regression
Given a range vector and a scalar t, the predict_linear() Prometheus function uses simple linear regression to predict the future value of the time series t seconds.
Regular expressions
The label_replace() Prometheus function searches through time series to find one that matches the given regex (regular expression), and then.
Calculating percentiles
Given a histogram, the histogram_quantile() Prometheus function. The highest bucket in the histogram must have an upper bound of +Inf (positive infinity).
Calculating differences
Given a range vector, the delta() Prometheus function calculates the difference between two quantities. For example, to calculate the difference in CPU temperature between now and 2 hours ago:
delta(cpu_temp_celsius{host="zeus"}[2h])
To enhance your use of Prometheus functions, you can also integrate Prometheus with Grafana, a web application for data analytics and visualization. MetricFire's hosted Grafana service makes it easy for any Prometheus user to enjoy high-quality, informative data visualizations. When you choose MetricFire, you also get access to our Hosted Graphite data source and platform and 24/7 support for all your Graphite and Grafana needs. Sign up today!
How MetricFire Can Help
MetricFire is a cloud infrastructure and application monitoring platform that makes understanding your data at a glance simple. We offer a Hosted Graphite service that handles issues such as scaling, data storage, and support and maintenance—so that you can spend more time on the metrics and less on the technical details.
To learn more, check out the Prometheus posts on our blog or book a demo to discuss your business needs and objectives. Want to get started with your monitoring right away? The MetricFire free trial is an excellent way to explore what you can do with hosted Graphite.