ABSTRACT
Apache Kafka is an open-source distributed publish-subscribe system, which is widely used in data centers for messaging between applications, log aggregation, and stream processing. The existing Kafka implementation uses TCP/IP for communication, which has various inefficiencies such as a high message dispatch cost due to OS involvement and excessive memory copies. Recently, the availability of cost-effective RDMA-capable network controllers within data centers and cloud infrastructures have encouraged many modern applications to adopt RDMA networking, which offers the potential to outperform classical TCP/IP. We introduce KafkaDirect, an extension to Apache Kafka, that uses RDMA to accelerate the three most network intensive datapaths: record production, record replication, and record consumption. In this work, we explore the design choices including which RDMA operations to use to take full advantage of offloaded communication. Our RDMA design relies on one-sided RDMA requests to attain true zero-copy communication completely avoiding the need for using intermediate buffers in Kafka servers, thereby ensuring low latency and high throughput communication. KafkaDirect can offer up to 9x increase in throughput for both Kafka producers and Kafka consumers, and can provide 4x and 50x reduction in latency for Kafka producers and Kafka consumers, respectively.
Supplemental Material
- InfiniBand Trade Association et al . 2020. The InfiniBand Architecture Specification 1.4. https://www.infinibandta.org/ibta-specification/.Google Scholar
- Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, Ted Wobbler, Michael Wei, and John D. Davis. 2012. CORFU: A Shared Log Design for Flash Clusters. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI'12). USENIX Association, 1--14.Google ScholarDigital Library
- Claude Barthels, Gustavo Alonso, and Torsten Hoefler. 2017. Designing Databases for Future High-Performance Networks. IEEE Data Engineering Bulletin 40, 1 (2017), 15--26.Google Scholar
- Claude Barthels, Simon Loesing, Gustavo Alonso, and Donald Kossmann. 2015. Rack-Scale In-Memory Join Processing Using RDMA. In Proceedings of the 2015 ACM International Conference on Management of Data (SIGMOD'15). Association for Computing Machinery, 1463--1475. https://doi.org/10.1145/2723372.2750547Google ScholarDigital Library
- Carsten Binnig, Andrew Crotty, Alex Galakatos, Tim Kraska, and Erfan Zamanian. 2016. The End of Slow Networks: It's Time for a Redesign. Proceedings of the VLDB Endowment 9, 7 (2016), 528--539. https://doi.org/10.14778/2904483.2904485Google ScholarDigital Library
- Andrew D. Birrell and Bruce Jay Nelson. 1984. Implementing Remote Procedure Calls. ACM Transactions on Computer Systems 2, 1 (1984), 39--59. https://doi.org/10.1145/2080.357392Google ScholarDigital Library
- Mark S Birrittella, Mark Debbage, Ram Huggahalli, James Kunz, Tom Lovett, Todd Rimmer, Keith D Underwood, and Robert C Zak. 2015. Intel® Omni-path Architecture: Enabling Scalable, High Performance Fabrics. In Proceedings of the 23rd IEEE Symposium on High-Performance Interconnects (HOTI'15). IEEE Computer Society, 1--9.Google ScholarDigital Library
- Matthew Burke, Sowmya Dharanipragada, Shannon Joyner, Adriana Szekeres, Jacob Nelson, Irene Zhang, and Dan R. K. Ports. 2021. PRISM: Rethinking the RDMA Interface for Distributed Systems. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP'21). Association for Computing Machinery, 228--242. https://doi.org/10.1145/3477132.3483587Google ScholarDigital Library
- Haibo Chen, Rong Chen, Xingda Wei, Jiaxin Shi, Yanzhe Chen, Zhaoguo Wang, Binyu Zang, and Haibing Guan. 2015. Fast In-Memory Transaction Processing Using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP'15). Association for Computing Machinery, 87--104.Google Scholar
- Youmin Chen, Youyou Lu, and Jiwu Shu. 2019. Scalable RDMA RPC on Reliable Connection with Efficient Resource Sharing. In Proceedings of the 14th EuroSys Conference (EuroSys'19). Association for Computing Machinery, Article 19, 14 pages. https://doi.org/10.1145/3302424.3303968Google ScholarDigital Library
- Youmin Chen, Youyou Lu, Fan Yang, Qing Wang, Yang Wang, and Jiwu Shu. 2020. FlatStore: An Efficient Log-Structured Key-Value Storage Engine for Persistent Memory. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'20). Association for Computing Machinery, 1077--1091. https://doi.org/10.1145/3373376.3378515Google ScholarDigital Library
- Alibaba Cloud. 2018. Super computing cluster. https://www.alibabacloud.com/product/scc.Google Scholar
- Inc. Cloudera. 2019. kafka-*-perf-test. https://docs.cloudera.com/runtime/7.2.0/kafka-managing/topics/kafka-manage-cli-perf-test.html.Google Scholar
- Cong Ding, David Chu, Evan Zhao, Xiang Li, Lorenzo Alvisi, and Robbert Van Renesse. 2020. Scalog: Seamless Reconfiguration and Total Order in a Scalable Shared Log . In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI'20). USENIX Association, 325--338.Google Scholar
- Aleksandar Dragojevi, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. FaRM: Fast Remote Memory. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI'14). USENIX Association, 401--414.Google Scholar
- Ken Goodhope, Joel Koshy, Jay Kreps, Neha Narkhede, Richard Park, Jun Rao, and Victor Yang Ye. 2012. Building LinkedIn's Real-time Activity Data Pipeline. IEEE Data Engineering Bulletin 35, 2 (2012), 33--45.Google Scholar
- R. Guerraoui and A. Schiper. 1997. Total order multicast to multiple groups. In Proceedings of the 17th International Conference on Distributed Computing Systems (ICDCS'97). IEEE Computer Society, 578--585.Google Scholar
- Torsten Hoefler, Salvatore Di Girolamo, Konstantin Taranov, Ryan E. Grant, and Ron Brightwell. 2017. sPIN: High-performance Streaming Processing In the Network. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'17). Association for Computing Machinery, Article 59, 16 pages.Google ScholarDigital Library
- Sagar Jha, Jonathan Behrens, Theo Gkountouvas, Matthew Milano, Weijia Song, Edward Tremel, Robbert Van Renesse, Sydney Zink, and Kenneth P. Birman. 2019. Derecho: Fast State Machine Replication for Cloud Services. ACM Transactions on Computer Systems 36, 2, Article 4 (2019), 49 pages. https://doi.org/10.1145/3302258Google ScholarDigital Library
- Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2014. Using RDMA Efficiently for Key-Value Services. In Proceedings of the 2014 ACM Conference on SIGCOMM (SIGCOMM'14). Association for Computing Machinery, 295--306. https://doi.org/10.1145/2619239.2626299Google ScholarDigital Library
- Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016. Design Guidelines for High Performance RDMA Systems. In Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC'16). USENIX Association, 437--450.Google Scholar
- Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016. FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, 185--201.Google ScholarDigital Library
- Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. 2015. Profiling a Warehouse- Scale Computer. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA'15). Association for Computing Machinery, 158--169. https://doi.org/10.1145/2749469.2750392Google ScholarDigital Library
- Tejas Karmarkar. 2015. Availability of linux RDMA on Microsoft Azure. https://azure.microsoft.com/en-us/blog/azure-linux-rdma-hpc-available.Google Scholar
- Daehyeok Kim, Amirsaman Memaripour, Anirudh Badam, Yibo Zhu, Hongqiang Harry Liu, Jitu Padhye, Shachar Raindel, Steven Swanson, Vyas Sekar, and Srinivasan Seshan. 2018. Hyperloop: Group-Based NIC-Offloading to Accelerate Replicated Transactions in Multi-Tenant Storage Systems. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication (SIGCOMM'18). Association for Computing Machinery, 297--312. https://doi.org/10.1145/3230543.3230572Google ScholarDigital Library
- Jay Kreps, Neha Narkhede, Jun Rao, et al. 2011. Kafka: A distributed messaging system for log processing. In Proceedings of the 2011 IEEE International Workshop on Networking Meets Databases (NetDB'11). 1--7.Google Scholar
- Ilya Lesokhin, Haggai Eran, Shachar Raindel, Guy Shapiro, Sagi Grimberg, Liran Liss, Muli Ben-Yehuda, Nadav Amit, and Dan Tsafrir. 2017. Page fault support for network controllers. ACM SIGARCH Computer Architecture News 45, 1 (2017), 449--466.Google ScholarDigital Library
- Bojie Li, Zhenyuan Ruan, Wencong Xiao, Yuanwei Lu, Yongqiang Xiong, Andrew Putnam, Enhong Chen, and Lintao Zhang. 2017. KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP'17). Association for Computing Machinery, 137--152. https://doi.org/10.1145/3132747.3132756Google ScholarDigital Library
- Feng Li, Sudipto Das, Manoj Syamala, and Vivek R. Narasayya. 2016. Accelerating Relational Databases by Leveraging Remote Memory and RDMA. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD'16). Association for Computing Machinery, 355--370. https://doi.org/10.1145/2882903.2882949Google ScholarDigital Library
- Joshua Lockerman, Jose M. Faleiro, Juno Kim, Soham Sankaran, Daniel J. Abadi, James Aspnes, Siddhartha Sen, and Mahesh Balakrishnan. 2018. The FuzzyLog: A Partially Ordered Shared Log. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI'18). USENIX Association, 357--372.Google Scholar
- Microsoft. 2020. Azure Service Bus Messaging. https://azure.microsoft.com/en-us/services/service-bus/.Google Scholar
- Christopher Mitchell, Yifeng Geng, and Jinyang Li. 2013. Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store. In Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC'16). USENIX Association, 103--114.Google Scholar
- The Ohio State University Network-Based Computing Laboratory. 2018. RDMA-based Apache Kafka (RDMA-Kafka). http://hibd.cse.ohio-state.edu/#kafka.Google Scholar
- Oracle. 2020. Oracle Cloud. https://www.oracle.com/cloud/hpc/.Google Scholar
- Oracle. 2020. Oracle Messaging Cloud Service. https://www.oracle.com/technical-resources/articles/cloud/wilkins-ocms.html.Google Scholar
- John Ousterhout, Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro, Seo Jin Park, Henry Qin, Mendel Rosenblum, Stephen Rumble, Ryan Stutsman, and Stephen Yang. 2015. The RAMCloud Storage System. ACM Transactions on Computer Systems 33, 3, Article 7 (Aug. 2015), 55 pages. https://doi.org/10.1145/2806887Google ScholarDigital Library
- Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung-Gon Chun. 2015. Making Sense of Performance in Data Analytics Frameworks. In Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI'15). USENIX Association, 293--307. https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/ousterhoutGoogle ScholarDigital Library
- Sathish K Palaniappan and Pramod B Nagaraja. 2008. Efficient data transfer through zero copy. IBM developerworks (2008).Google Scholar
- Marius Poke and Torsten Hoefler. 2015. DARE: High-Performance State Machine Replication on RDMA Networks. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC'15). Association for Computing Machinery, 107--118. https://doi.org/10.1145/2749246.2749267Google ScholarDigital Library
- OpenMessaging Project. 2017. OpenMessaging Benchmark Framework. https://github.com/openmessaging/openmessaging-benchmark.Google Scholar
- Renato Recio, Bernard Metzler, Paul Culley, Jeff Hilland, and Dave Garcia. 2007. A Remote Direct Memory Access Protocol Specification. Technical Report RFC 5040. Network Working Group.Google Scholar
- Eric Rescorla. 2018. The Transport Layer Security (TLS) Protocol Version 1.3. Technical Report RFC 8446. Network Working Group.Google Scholar
- Wolf Rödiger, Tobias Mühlbauer, Alfons Kemper, and Thomas Neumann. 2015. High-Speed Query Processing over High-Speed Networks. Proceedings of the VLDB Endowment 9, 4 (Dec. 2015), 228--239. https://doi.org/10.14778/2856318.2856319Google ScholarDigital Library
- Benjamin Rothenberger, Konstantin Taranov, Adrian Perrig, and Torsten Hoefler. 2021. ReDMArk: Bypassing RDMA Security Mechanisms. In Proceedings of the 30th USENIX Security Symposium (USENIX Security'21). USENIX Association.Google Scholar
- Amazon Web Services. 2020. Amazon Simple Queue Service. https://aws.amazon.com/sqs/.Google Scholar
- Peter Snyder. 1990. tmpfs: A virtual memory file system. In Proceedings of the Autumn 1990 European UNIX Users' Group Conference (EUUG'90). 241--248.Google Scholar
- Patrick Stuedi, Bernard Metzler, and Animesh Trivedi. 2013. JVerbs: Ultra-Low Latency for Data Center Applications. In Proceedings of the 4th ACM Symposium on Cloud Computing (SoCC'13). Association for Computing Machinery, Article 10, 14 pages. https://doi.org/10.1145/2523616.2523631Google ScholarDigital Library
- Patrick Stuedi, Animesh Trivedi, Bernard Metzler, and Jonas Pfefferle. 2014. DaRPC: Data Center RPC. In Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC'14). Association for Computing Machinery, 1--13. https://doi.org/10.1145/2670979.2670994Google ScholarDigital Library
- Maomeng Su, Mingxing Zhang, Kang Chen, Zhenyu Guo, and Yongwei Wu. 2017. RFP: When RPC is Faster than Server-Bypass with RDMA. In Proceedings of the 12th European Conference on Computer Systems (EuroSys'17). Association for Computing Machinery, 1--15. https://doi.org/10.1145/3064176.3064189Google ScholarDigital Library
- Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, and Toni Cortes. 2018. Tailwind: Fast and Atomic RDMA-based Replication. In Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC'18). USENIX Association, 851--863.Google Scholar
- Konstantin Taranov, Rodrigo Bruno, Gustavo Alonso, and Torsten Hoefler. 2021. Naos: Serialization-free RDMA networking in Java. In Proceedings of the 2021 USENIX Annual Technical Conference (USENIX ATC'21). USENIX Association, 1--14.Google Scholar
- Konstantin Taranov, Benjamin Rothenberger, Adrian Perrig, and Torsten Hoefler. 2020. sRDMA -- Efficient NIC-based Authentication and Encryption for Remote Direct Memory Access. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC'20). USENIX Association, 691--704.Google Scholar
- Mellanox Technologies. 2015. RDMA Aware Networks Programming User Manual, Rev 1.7. https://www.mellanox.com/related-docs/prod_software/RDMA_Aware_ Programming_user_manual.pdf.Google Scholar
- Gigabyte Technology. 2021. AORUS Gen4 AIC SSD 8TB. https://www.gigabyte.com/Solid-State-Drive/AORUS-Gen4-AIC-SSD-8TB.Google Scholar
- Hung-Wei Tseng, Qianchen Zhao, Yuxiao Zhou, Mark Gahagan, and Steven Swanson. 2016. Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA'16). IEEE Press, 53--65. https://doi.org/10.1109/ISCA.2016.15Google ScholarDigital Library
- Giselle van Dongen and Dirk Van den Poel. 2020. Evaluation of Stream Processing Frameworks. IEEE Transactions on Parallel and Distributed Systems 31, 8 (2020), 1845--1858.Google ScholarCross Ref
- Guozhang Wang, Joel Koshy, Sriram Subramanian, Kartik Paramasivam, Mammad Zadeh, Neha Narkhede, Jun Rao, Jay Kreps, and Joe Stein. 2015. Building a Replicated Logging System with Apache Kafka. Proceedings of the VLDB Endowment 8, 12 (Aug. 2015), 1654--1655. https://doi.org/10.14778/2824032.2824063Google ScholarDigital Library
- Michael Wei, Amy Tai, Christopher J. Rossbach, Ittai Abraham, Maithem Munshed, Medhavi Dhawan, Jim Stabile, Udi Wieder, Scott Fritchie, Steven Swanson, Michael J. Freedman, and Dahlia Malkhi. 2017. vCorfu: A Cloud-Scale Object Store on a Shared Log. In Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI'17). USENIX Association, 35--49.Google ScholarDigital Library
- Xingda Wei, Zhiyuan Dong, Rong Chen, and Haibo Chen. 2018. Deconstructing RDMA-enabled Distributed Transactions: Hybrid is Better!. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI'18). USENIX Association, Carlsbad, CA, 233--251. https://www.usenix.org/conference/osdi18/presentation/weiGoogle Scholar
- Jilong Xue, Youshan Miao, Cheng Chen, Ming Wu, Lintao Zhang, and Lidong Zhou. 2019. Fast Distributed Deep Learning over RDMA. In Proceedings of the 14th European Conference on Computer Systems (EuroSys'19). Association for Computing Machinery, Article 44, 14 pages. https://doi.org/10.1145/3302424.3303975Google ScholarDigital Library
- Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud'10). USENIX Association, 10.Google ScholarDigital Library
- Erfan Zamanian, Carsten Binnig, Tim Harris, and Tim Kraska. 2017. The End of a Myth: Distributed Transactions Can Scale. Proceedings of the VLDB Endowment 10, 6 (2017), 685--696. https://doi.org/10.14778/3055330.3055335Google ScholarDigital Library
- Erfan Zamanian, Xiangyao Yu, Michael Stonebraker, and Tim Kraska. 2019. Rethinking Database High Availability with RDMA Networks. Proceedings of the VLDB Endowment 12, 11 (July 2019), 1637--1650. https://doi.org/10.14778/3342263.3342639Google ScholarDigital Library
- Tobias Ziegler, Sumukha Tumkur Vani, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. 2019. Designing Distributed Tree-Based Index Structures for Fast RDMA-Capable Networks. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD'19). Association for Computing Machinery, 741--758. https://doi.org/10.1145/3299869.3300081Google ScholarDigital Library
Index Terms
- KafkaDirect: Zero-copy Data Access for Apache Kafka over RDMA Networks
Recommendations
Message Latency-Based Load Shedding Mechanism in Apache Kafka
Euro-Par 2019: Parallel Processing WorkshopsAbstractApache Kafka is a distributed message queuing platform that delivers data streams in real time. Through the distributed processing technology, Kafka has the advantage of delivering very large data streams very fast. However, when the data ...
Comments