Table of Contents
- Introduction
- Key Takeaways
- What is Kafka?
- How does Kafka work?
- Benefits of using Kafka
- Kafka metrics
- Producer metrics
- Consumer metrics
- ZooKeeper metrics
- Collecting Kafka metrics
- What is Graphite and Grafana?
- Using hosted Grafana and Graphite for monitoring Kafka metrics
- How to integrate Kafka and Grafana via Graphite?
- Benefits of using MetricFire
- Conclusion
Introduction
In this article, we will analyze what are the metrics for monitoring Kafka performance and why it is important to constantly monitor them. We will also look at the process of monitoring metrics for Kafka using Hosted Graphite by MetricFire.
To learn more about MetricFire, book a demo with the MetricFire team or sign up for the free trial.
Key Takeaways
- Kafka is an open-source distributed event streaming platform used for storing, processing, and analyzing streaming data.
- Kafka offers high throughput, low latency, fault tolerance, durability, scalability, and real-time data processing.
- Monitoring Kafka is essential for ensuring the stable operation of applications.
- Grafana is an open-source system for visualizing metrics with customizable dashboards.
- Graphite is a monitoring tool that stores and processes data, and Grafana can connect to it as a data source.
What is Kafka?
Kafka is an open-source distributed event streaming platform used by thousands of users to store, process, and analyze streaming data.
How does Kafka work?
Kafka consists of servers and clients that communicate over the high-performance TCP network protocol. Kafka operates as a cluster of one or more servers. Some of these servers store data. Other servers import and export data as streams of events to integrate Kafka with your existing systems. The Kafka cluster is highly scalable and resilient. If one of the servers fails, the other ones do their work to ensure continuous operation without data loss. Clients enable you to write distributed applications and microservices that read, write, and process streams of events in parallel, at scale, and in a fault-tolerant manner.
Kafka allows you to create themes and then connect apps and write records to those themes. Records are byte arrays in which you can store any information. A record has four attributes: key, value, timestamp, and titles. Only the first two attributes are required.
Kafka consists of four main systems:
- The broker handles all requests from clients and stores data. A cluster can have one or more brokers.
- Zookeeper maintains the state of the cluster.
- Producer sends records to the broker.
- The consumer receives batches of records from the broker.
Benefits of using Kafka
Let’s take a closer look at the benefits of using Kafka.
- High throughput. Kafka supports the throughput of thousands of messages per second and can handle high speed and large amounts of data.
- Low latency. Kafka can process messages with latency in the range of milliseconds.
- Fault-tolerant. It is one of the main advantages of Kafka. Kafka works flawlessly even when a node/machine in the cluster fails.
- Durability. Kafka offers a message replication feature, which is one of the reasons for its durability, so messages are never lost.
- Scalability. Kafka can be scaled up on the fly by adding additional nodes.
- Distributed architecture. Kafka’s distributed architecture makes it scalable by leveraging capabilities such as replication and partitioning.
- Convenience for consumers. Kafka can work in different ways depending on the consumer with whom it integrates. It can integrate with many consumers written in different programming languages.
- Real-time processing. Kafka can process the data pipeline in real-time.
Kafka metrics
To ensure the stable operation of applications that depend on Kafka, you need to constantly monitor its status and efficiency. To do this, you need to monitor the key metrics of each component that the cluster includes:
- Broker metrics.
- Producer metrics.
- Consumer metrics.
- ZooKeeper metrics.
Those are some Kafka performance metrics.
Broker metrics
Each message goes through the broker before being used. Therefore, brokers play a key role in Kafka. It is very important to track their performance characteristics, which can be divided into three main categories:
- Kafka system metrics.
- JVM garbage collector metrics.
- Host metrics.
Kafka system metrics
| Name | Description | 
| UnderReplicatedPartitions | The number of under-replicated partitions across all topics on the broker. Under-replicated partition metrics are a leading indicator of one or more brokers being unavailable. | 
| IsrShrinksPerSec/IsrExpandsPerSec | If a broker goes down, in-sync replica ISRs for some of the partitions shrink. When that broker is up again, ISRs are expanded once the replicas are fully caught up. | 
| ActiveControllerCount | Indicates whether the broker is active and should always be equal to 1 since there is only one broker at the same time that acts as a controller. | 
| OfflinePartitionsCount | The number of partitions that don’t have an active leader and are hence not writable or readable. A non-zero value indicates that brokers are not available. | 
| LeaderElectionRateAndTimeMs | A partition leader election happens when ZooKeeper is not able to connect with the leader. This metric may indicate a broker is unavailable. | 
| UncleanLeaderElectionsPerSec | A leader may be chosen from out-of-sync replicas if the broker which is the leader of the partition is unavailable and a new leader needs to be elected. This metric can indicate potential message loss. | 
| TotalTimeMs | The time is taken to process the message. | 
| PurgatorySize | The size of purgatory requests. Can help identify the main causes of the delay. | 
| BytesInPerSec/BytesOutPerSec | The number of data brokers received from producers and the number that consumers read from brokers. This is an indicator of the overall throughput or workload in the Kafka cluster. | 
| RequestsPerSecond | Frequency of requests from producers, consumers, and subscribers. | 
JVM garbage collector metrics
| Name | Description | 
| CollectionCount | The total number of young or old garbage collection processes performed by the JVM. | 
| CollectionTime | The total amount of time in milliseconds that the JVM spent executing young or old garbage collection processes. | 
Host metrics
| Name | Description | 
| Page cache reads ratio | The ratio of the number of reads from the cache pages and the number of reads from the disk. | 
| Disk usage | The amount of used and available disk space. | 
| CPU usage | The CPU is rarely the source of performance issues. However, if you see spikes in CPU usage, this metric should be investigated. | 
| Network bytes sent/received | The amount of incoming and outgoing network traffic. | 
Producer metrics
Producers are processes that send messages to consumers. If producers stop working, consumers will not receive new messages. Let’s take a look at the key producer metrics.
| Name | Description | 
| compression-rate-avg | Average compression rate of sent batches. | 
| response-rate | An average number of responses received per producer. | 
| request-rate | An average number of responses sent per producer. | 
| request-latency-avg | Average request latency in milliseconds. | 
| outgoing-byte-rate | An average number of outgoing bytes per second. | 
| io-wait-time-ns-avg | The average length of time the I/O thread spent waiting for a socket (in ns). | 
| batch-size-avg | The average number of bytes sent per partition per request. | 
Consumer metrics
Monitoring consumer metrics can show how efficiently data is being retrieved by consumers, which can help identify system performance problems. Let’s take a look at the consumer metrics below.
| Name | Description | 
| records-lag | The number of messages consumer is behind the producer on this partition. | 
| records-lag-max | Maximum record lag. Increasing value means that the consumer is not keeping up with the producers. | 
| bytes-consumed-rate | Average bytes consumed per second for each consumer for a specific topic or across all topics. | 
| records-consumed-rate | An average number of records consumed per second for a specific topic or across all topics. | 
| fetch-rate | The number of fetch requests per second from the consumer. | 
ZooKeeper metrics
ZooKeeper is an essential component of Kafka deployment and disabling ZooKeeper will stop Kafka. ZooKeeper stores information about brokers and Kafka themes, applies quotas to control the speed of traffic passing through the cluster, and stores information about replicas. Below are the ZooKeeper metrics.
| Name | Description | 
| outstanding-requests | The number of requests that are in the queue. | 
| avg-latency | The response time to a client request is in milliseconds. | 
| num-alive-connections | The number of clients connected to ZooKeeper. | 
| followers | The number of active followers. | 
| pending-syncs | The number of pending consumers syncs. | 
| open-file-descriptor-count | The number of used file descriptors. | 
Collecting Kafka metrics
There are several tools for collecting Kafka metrics:
- JConsole is the GUI that comes with the JDK. It provides an interface for collecting all Kafka metrics.
- JMX. A lot of monitoring tools can collect JMX metrics from Kafka through JMX plugins, through metric reporter plugins, or through connectors that write JMX metrics to Graphite or other systems.
- Burrow is a tool that allows you to get detailed metrics of the efficiency of all consumers.
What is Graphite and Grafana?
Grafana is an open-source system that provides tools for the graphical visualization of metrics. Grafana has a lot of different customizable dashboards that let you create beautiful graphs and charts. The data source for Grafana can be any place where you store your data.
Graphite is a monitoring tool that allows you to store and process data. Grafana can connect to Graphite as a data source and can be used with it to monitor your system’s metrics.
Using hosted Grafana and Graphite for monitoring Kafka metrics
To monitor Kafka metrics use Grafana dashboards. First, you need to choose the type of dashboard that suits you and create it. Then choose a data source. Today the best source of data for Grafana is Graphite. All Kafka metrics that you have collected using special tools need first be saved in Graphite. Next, create and configure all the necessary charts. The finished dashboard can be exported to a JSON file. You can also create an external link to the dashboard or a screenshot of it.
How to integrate Kafka and Grafana via Graphite?
Create beautiful, customizable dashboards for monitoring Kafka metrics using a lot of tools Grafana provides. Save your metrics with Graphite. Connect Graphite to Grafana and monitor your metrics easily and conveniently.
For more information on how to integrate Kafka with Grafana via Graphite, book a demo with the MetricFire team or sign up in MetricFire for free.
Benefits of using MetricFire
MetricFire offers hosted Graphite and Grafana which will help make the process of monitoring Kafka metrics easier and more convenient. Using MetricFire, you can only care about Kafka performance metrics, and we take care of setting up the monitoring system.
Conclusion
In this article, we explored how the Kafka event streaming platform works and the benefits of using it. We also took a closer look at Kafka performance metrics and such tools for monitoring them as hosted Graphite and Grafana offered by MetricFire.
To learn more about MetricFire, book a demo with our experts or sign up in MetricFire for the free trial today.
