Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE credits
In the context of data-intensive applications, efficient data serialization is essential for maintaining high performance and scalability. This thesis investigates the impact of different serialization protocols on the latency and throughput in Apache Kafka, a widely used distributed streaming platform. Given the diverse array of serialization protocols, this study focuses on four prevalent ones: Apache Avro, Protocol Buffers (Protobuf), JSON, and MessagePack. These protocols were selected based on their widespread use in academic research and industry and their varying approaches to balancing human readability, efficiency, and performance.
JSON, the most commonly used serialization protocol in many systems, is a baseline for comparison in this study. While JSON offers ease of use and broad compatibility, it may not be optimal in terms of speed and data size efficiency. This research aims to determine whether alternative serialization protocols can improve performance.
This research utilized a testing framework involving two distinct types of tests: batch processing and single message processing. Each test type consisted of 1,048,575 records and was applied across three different data sizes, 1,176 bytes, 4,696 bytes, and 9,312 bytes, to evaluate how the data size impacts serialization and deserialization times, total execution times, throughput, and latency. The throughput is measured in records per second (rps).
The throughput results indicate that MessagePack achieves a two--time higher throughput than JSON. The batch--processing results from lowest to highest size show 34,254 rps vs 14,243 rps, 7,377 rps vs 3411 rps, and 3,802 rps vs 1784 rps. The single-message results show 29,212 rps vs. 14,126 rps, 8,350 rps vs. 3,344 rps and 3,781 rps vs. 1,803 rps. Protobuf showed the highest throughput for the smallest tested data size at 36,945 rps for batch-processing and 36,364 rps for single message processing. Avro showed a slight edge over JSON regarding throughput but was less significant than MessagePack. All the protocols were faster than JSON regarding serialization speeds, the quickest one being Protobuf.
Regarding latency, Protobuf consistently achieved the lowest median latencies across all test sizes in batch processing, recording 38.97 ms, 57.41 ms, and 63.14 ms for increasing record sizes, whereas JSON showed higher latencies of 77.59 ms, 72.60 ms, and 78.09 ms. In single-message tests, Protobuf also displayed the lowest median latency at 1.68 ms for the smallest size, significantly outperforming JSON’s 7.94 ms. Interestingly, for the record size of 4,696 bytes, JSON exhibited the lowest median latency at 3.76 ms. Avro presented the lowest median latency for the largest size at 2.71 ms, compared to JSON's 4.18 ms.
The results indicate that migrating from JSON to MessagePack or Protobuf (for the lowest size) will increase throughput by twofold.
Protobuf enhances latency metrics across all tested sizes in batch--processing scenarios, making it a convincing choice for systems prioritizing rapid data handling. For single-message tests, Protobuf is recommended for the smallest data size, while Avro offers advantages for the largest data size.
2024. , p. 48