I am excited to announce that, together with Jay Kreps and Jun Rao, I’m founding Confluent to make real-time data and Apache Kafka available widely.

Since we created and open sourced Kafka more than 4 years ago, it has seen widespread adoption across thousands of companies like LinkedIn, Netflix, Pinterest, AirBnb, Verizon, Salesforce and more. At Confluent, we are solving a very key problem that companies face, which is collecting data about everything happening inside the company and allowing applications and systems to react to events in real-time.

Apache Kafka is used in different ways, from collecting user activity data, IT/operational metrics to collecting set-top box metrics and stock ticker data, but its biggest differentiator is that it enables collection of large volume data for real time consumption. The impact that this has on the way data flows through an organization, is the ability to deploy a single infrastructure, that can act as the central nervous system transmitting data to the company’s brain, which is all the systems and applications that process data.

Until a few years ago and even now for traditional enterprises, data was locked up in silos. Traditional solutions for handling real time data, like messaging systems, could not keep up with the newer high throughput data sources. And traditional batch ETL tools as well as newer log aggregation systems were not designed for real time consumption. The absence of a single scalable infrastructure for handling variety of data sources, led to an ad-hoc approach to data integration, where people built custom pipes to move data between various sources and destination systems. The problem with this approach is limited data coverage in each of the data processing systems, Hadoop as well as real time systems. Kafka’s popularity can be attributed to the fact that it provides an elegant solution to this age old problem of data movement and it does so reliably in a scalable way.

Web companies are used to managing data in real time, most of which use Kafka to do the same. We see an opportunity in the enterprise who have started to realize this trend and are adopting Kafka.

There is a story behind every company; you can read about ours here. It is truly exciting to get an opportunity to create a company and one that backs the technology you helped create. I look forward to this next phase of my career.

In this post, I’m going to compare Kafka performance with GZIP and Snappy compression codecs.

Why compression ?

It is a well known fact that compression helps increase the performance of I/O intensive applications. The reason is simple – disks are slow. Compression reduces the disk footprint of your data leading to faster reads and writes. But at the same time, you invest CPU cycles in decompressing the data read from disk. So it is about striking a balance between I/O load and CPU load.

If the application starves on disk capacity but has plenty of CPU cycles to spare, then picking a compression algorithm that yields the largest compression ratio makes sense. If your application is I/O intensive and the compression ratio achieved is minimal, the savings in CPU cycles might get overshadowed by the time required to read the data from disk. Furthermore, if compressed data read from disk needs to be transferred over the network, compression ratio directly affects the number of roundtrips required to read compressed data over the wire.

 Compression in Kafka

At LinkedIn, we have deployed Kafka in production at LinkedIn for almost 3 years successfully. Currently all Kafka data at LinkedIn is GZIP compressed and though it is relatively heavy on CPU usage, it has worked fairly well so far. A reason to look into using the right compression codec now is due to the changes in Kafka 0.8 that impact compression performance.

In Kafka 0.7, the compression took place on the producer where it compressed a batch of messages into one compressed message. This message gets appended, as is, to the Kafka broker’s log file. When a consumer fetches compressed data, it decompresses the underlying compressed data and hands out the original messages to the user. So once data is compressed at source, it stays compressed until it reaches the end consumer. The broker pays no penalty as far as compression overhead is concerned.

In Kafka 0.8, there are changes made to the broker that can have an impact on performance if the data is sent compressed by the producer. To explain that, I need to briefly mention one important change made in Kafka 0.8 concerning message offsets. As you know, in Kafka 0.7, messages were addressable by physical bytes offsets into the partition’s write ahead log. In Kafka 0.8, each message is addressable by a monotonically increasing logical offset that is unique per partition. The 1st message has an offset of 1, the 100th message has an offset of 100 and so on. This feature has simplified offset management and consumer rewind capability considerably. However, it has an impact on broker performance if the incoming data is compressed. Note that, in Kafka 0.8, messages for a partition are served by the leader broker. The leader assigns these unique logical offsets to every message it appends to its log. Now, if the data is compressed, the leader has to decompress the data in order to assign offsets to the messages inside the compressed message. So the leader decompresses data, assigns offsets, compresses it again and then appends the re-compressed data to disk. And it has to do that for every compressed message received! One can imagine the CPU load on a Kafka broker serving production traffic for thousands of partitions with GZIP compression enabled.

I benchmarked 2 popular compression codecs – GZIP and Snappy. GZIP is known for large compression ratios, but poor decompression speeds and high CPU usage; while Snappy trades off compression ratio for higher compression and decompression speed.


1. CPU load

In this test, I ran a Kafka consumer with 20 threads to consume 300 topics from a Kafka cluster configured to host data compressed in the GZIP format. In another test, I ran a Kafka consumer with 20 threads to consume 300 topics from a Kafka cluster configured to host data compressed in Snappy. Note that the consumers were configured to fetch data from the tail of the Kafka topics, so they were operating at the maximum possible throughput. I wrote a short python script to correlate the per thread CPU usage stats (from top) with the thread dump to see the per thread CPU usage between the 2 tests. This test showed that consumer threads in a Kafka process running in catch up mode utilize 2x more CPU when consuming GZIP data compared to one that is consuming Snappy data. I ran the same test, but this time with a single consumer thread in the Kafka consumer. In this test, the throughput of the GZIP consumer reduced since it gets pegged at 100% CPU usage.

2. Compression Ratio

In this test, I copied a 1GB worth data from one of our production topics and ran the replay log tool that ships with Kafka. Using the tool, I recreated the log segment in GZIP and Snappy compression formats. This test showed that for reasonable production data, GZIP compresses data 30% more as compared to Snappy. (Compression ratio of GZIP was 2.8x, while that of Snappy was 2x)

3. Consumer throughput

In this test, I ran a Kafka consumer to consume 1 million messages from a Kafka topic in catch up mode. Similar to the test setup above, I ran one consumer against GZIP compressed data and another against Snappy compressed data. Note that the Snappy cluster is a mirror of the GZIP cluster, so they host identical data sets, but in a different compression format. The expectation was that since GZIP compresses data 30% better than Snappy, it will fetch data proportionately faster over the network and hence lead to a higher consumer throughput. It turns out that even though the GZIP consumer issues 30% fewer fetch requests to the Kafka brokers, it’s throughput is comparable to that of the Snappy consumer. To understand this result, let me explain how the high level consumer (ZookeeperConsumerConnector) works in Kafka. The consumer has background “fetcher” threads that continuously fetch data in batches of 1MB from the brokers and add it to an internal blocking queue. The consumer thread dequeues data from this blocking queue, decompresses and iterates through the messages. Due to the buffering between the fetcher and consumer threads and the fact that the consumer is running in catch-up mode, there is always data ready to be decompressed by the consumer thread making it go as fast as it can decompress. The reason Snappy does not outperform GZIP is due to the fact that it has to decompress roughly 30% more data chunks as compared to GZIP. So the savings in decompression cost are offset by the overhead of making more 1MB roundtrips to the Kafka brokers.

4. Producer throughput

In this test, I ran one producer with batch size of 100 and message size of 1KB to produce 15 million messages to a Kafka 0.8 cluster in the same data center. The results are largely in favor of Snappy. I ran the producer in 2 modes –

1. Ack when the leader write data to log

In this mode, as far as compression is concerned, the data gets compressed at the producer, decompressed and compressed on the broker before it sends the ack to the producer. The producer throughput with Snappy compression was roughly 22.3MB/s as compared to 8.9MB/s of the GZIP producer. Producer throughput is 150% higher with Snappy as compared to GZIP.

2. No ack, similar to Kafka 0.7 behavior

In this mode, the data gets compressed at the producer and it doesn’t wait for the ack from the broker. The producer throughput with Snappy compression was roughly 60.8MB/s as compared to 18.5MB/s of the GZIP producer. Producer throughput is 228% higher with Snappy as compared to GZIP. The higher compression savings in this test are due to the fact that the producer does not wait for the leader to re-compress and append the data; it simply compresses messages and fires away. Since Snappy has very high compression speed and low CPU usage, a single producer is able to compress the same amount of messages much faster as compared to GZIP.

Cross DC mirroring

One thing that I skimmed over in my discussion is cross data center mirroring. Cross colo network resources are typically limited and expensive. For such mirroring, it might make more sense to use GZIP instead of Snappy in spite of the low CPU load and higher throughput.