ABSTRACT
The advent of RoCE (RDMA over Converged Ethernet) has led to a significant increase in the use of RDMA in datacenter networks. To achieve good performance, RoCE requires a lossless network which is in turn achieved by enabling Priority Flow Control (PFC) within the network. However, PFC brings with it a host of problems such as head-of-the-line blocking, congestion spreading, and occasional deadlocks. Rather than seek to fix these issues, we instead ask: is PFC fundamentally required to support RDMA over Ethernet?
We show that the need for PFC is an artifact of current RoCE NIC designs rather than a fundamental requirement. We propose an improved RoCE NIC (IRN) design that makes a few simple changes to the RoCE NIC for better handling of packet losses. We show that IRN (without PFC) outperforms RoCE (with PFC) by 6-83% for typical network scenarios. Thus not only does IRN eliminate the need for PFC, it improves performance in the process! We further show that the changes that IRN introduces can be implemented with modest overheads of about 3-10% to NIC resources. Based on our results, we argue that research and industry should rethink the current trajectory of network support for RDMA.
- http://omnetpp.org/.Google Scholar
- https://inet.omnetpp.org.Google Scholar
- Xilinx Vivado Design Suite. https://www.xilinx.com/products/design-tools/vivado.html.Google Scholar
- InfiniBand architecture volume 1, general specifications, release 1.2.1. www.infinibandta.org/specs, 2008.Google Scholar
- Supplement to InfiniBand architecture specification volume 1 release 1.2.2 annex A16: RDMA over Converged Ethernet (RoCE). www.infinibandta.org/specs, 2010.Google Scholar
- IEEE. 802.11Qbb. Priority based flow control, 2011.Google Scholar
- Vivado Design Suite User Guide. https://goo.gl/akRdXC, 2013.Google Scholar
- http://www.xilinx.com/support/documentation/white_papers/wp350.pdf, 2014.Google Scholar
- Supplement to InfiniBand architecture specification volume 1 release 1.2.2 annex A17: RoCEv2 (IP routable RoCE),. www.infinibandta.org/specs, 2014.Google Scholar
- Mellanox ConnectX-4 Product Brief. https://goo.gl/HBw9f9, 2016.Google Scholar
- Mellanox ConnectX-5 Product Brief. https://goo.gl/ODlqMl, 2016.Google Scholar
- Mellanox Innova Flex 4 Product Brief. http://goo.gl/Lh7VN4, 2016.Google Scholar
- RoCE vs. iWARP Competitive Analysis. http://www.mellanox.com/related-docs/whitepapers/WP_RoCE_vs_iWARP.pdf, 2017.Google Scholar
- Sarita V Adve and Hans-J Boehm. Memory models: a case for rethinking parallel languages and hardware. Communications of the ACM, 2010. Google ScholarDigital Library
- Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. Data Center TCP (DCTCP). In Proc. ACM SIGCOMM, 2010. Google ScholarDigital Library
- Mohammad Alizadeh, Shuang Yang, Sachin Katti, Nick McKeown, Balaji Prabhakar, and Scott Shenker. Deconstructing Datacenter Packet Transport. In Proc. ACM Workshop on Hot Topics in Networks (HotNets), 2012. Google ScholarDigital Library
- Mohammad Alizadeh, Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, and Scott Shenker. pFabric: Minimal Near-optimal Datacenter Transport. In Proc. ACM SIGCOMM, 2013. Google ScholarDigital Library
- Appenzeller, Guido and Keslassy, Isaac and McKeown, Nick. Sizing router buffers. In Proc. ACM SIGCOMM, 2004. Google ScholarDigital Library
- Theophilus Benson, Aditya Akella, and David Maltz. Network Traffic Characteristics of Data Centers in the Wild. In Proc. ACM Internet Measurement Conference (IMC), 2012. Google ScholarDigital Library
- Advait Dixit, Pawan Prakash, Y Charlie Hu, and Ramana Rao Kompella. On the impact of packet spraying in data center networks. In Proc. IEEE INFOCOM, 2013.Google ScholarCross Ref
- Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. FaRM: Fast Remote Memory. In Proc. USENIX NSDI, 2014. Google ScholarDigital Library
- Soudeh Ghorbani, Zibin Yang, P. Brighten Godfrey, Yashar Ganjali, and Amin Firoozshahian. DRILL: Micro Load Balancing for Low-latency Data Center Networks. In Proc. ACM SIGCOMM, 2017. Google ScholarDigital Library
- Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. RDMA over commodity ethernet at scale. In Proc. ACM SIGCOMM, 2016. Google ScholarDigital Library
- Shuihai Hu, Yibo Zhu, Peng Cheng, Chuanxiong Guo, Kun Tan, Jitendra Padhye, and Kai Chen. Deadlocks in Datacenter Networks: Why Do They Form, and How to Avoid Them. In Proc. ACM Workshop on Hot Topics in Networks (HotNets), 2016. Google ScholarDigital Library
- Anuj Kalia, Michael Kaminsky, and David G. Andersen. Using RDMA Efficiently for Key-value Services. In Proc. ACM SIGCOMM, 2014. Google ScholarDigital Library
- Anuj Kalia, Michael Kaminsky, and David G. Andersen. Design Guidelines for High Performance RDMA Systems. In Proc. USENIX ATC, 2016. Google ScholarDigital Library
- Bojie Li, Kun Tan, Layong (Larry) Luo, Yanqing Peng, Renqian Luo, Ningyi Xu, Yongqiang Xiong, Peng Cheng, and Enhong Chen. ClickNP: Highly Flexible and High Performance Network Processing with Re-configurable Hardware. In Proc. ACM SIGCOMM, 2016. Google ScholarDigital Library
- Yuanwei Lu, Guo Chen, Zhenyuan Ruan, Wencong Xiao, Bojie Li, Jiansong Zhang, Yongqiang Xiong, Peng Cheng, and Enhong Chen. Memory Efficient Loss Recovery for Hardware-based Transport in Datacenter. In Proc. First Asia-Pacific Workshop on Networking (APNet), 2017. Google ScholarDigital Library
- Radhika Mittal, Vinh The Lam, Nandita Dukkipati, Emily Blem, Hassan Wassel, Monia Ghobadi, Amin Vahdat, Yaogong Wang, David Wetherall, and David Zats. TIMELY: RTT-based Congestion Control for the Datacenter. In Proc. ACM SIGCOMM, 2015. Google ScholarDigital Library
- Radhika Mittal, Justine Sherry, Sylvia Ratnasamy, and Scott Shenker. Recursively Cautious Congestion Control. In Proc. USENIX NSDI, 2014. Google ScholarDigital Library
- Radhika Mittal, Alexander Shpiner, Aurojit Panda, Eitan Zahavi, Arvind Krishnamurthy, Sylvia Ratnasamy, and Scott Shenker. Revisiting Network Support for RDMA (Extended Version). arXiv:1806.08159, 2018. Google ScholarDigital Library
- Sivasankar Radhakrishnan, Yilong Geng, Vimalkumar Jeyakumar, Abdul Kabbani, George Porter, and Amin Vahdat. SENIC: Scalable NIC for End-host Rate Limiting. In Proc. USENIX NSDI, 2014. Google ScholarDigital Library
- Renato Recio, Bernard Metzler, Paul Culley, Jeff Hilland, and Dave Garcia. A Remote Direct Memory Access Protocol Specification. RFC 5040, 2007.Google Scholar
- Alexander Shpiner, Eitan Zahavi, Omar Dahley, Aviv Barnea, Rotem Damsker, Gennady Yekelis, Michael Zus, Eitan Kuta, and Dean Baram. RoCE Rocks Without PFC: Detailed Evaluation. In Proc. ACM Workshop on Kernel-Bypass Networks (KBNets), 2017. Google ScholarDigital Library
- Alexander Shpiner, Eitan Zahavi, Vladimir Zdornov, Tal Anker, and Matty Kadosh. Unlocking Credit Loop Deadlocks. In Proc. ACM Workshop on Hot Topics in Networks (HotNets), 2016. Google ScholarDigital Library
- Daniel J. Sorin, Mark D. Hill, and David A. Wood. A Primer on Memory Consistency and Cache Coherence. Morgan & Claypool Publishers, 1st edition, 2011. Google ScholarDigital Library
- Brent Stephens, Alan L Cox, Ankit Singla, John Carter, Colin Dixon, and Wesley Felter. Practical DCB for improved data center networks. In Proc. IEEE INFOCOM, 2014.Google ScholarCross Ref
- Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. Congestion Control for Large-Scale RDMA Deployments. In Proc. ACM SIGCOMM, 2015. Google ScholarDigital Library
Index Terms
- Revisiting network support for RDMA
Recommendations
Congestion Control for Large-Scale RDMA Deployments
SIGCOMM '15: Proceedings of the 2015 ACM Conference on Special Interest Group on Data CommunicationModern datacenter applications demand high throughput (40Gbps) and ultra-low latency (< 10 μs per hop) from the network, with low CPU overhead. Standard TCP/IP stacks cannot meet these requirements, but Remote Direct Memory Access (RDMA) can. On IP-...
TIMELY: RTT-based Congestion Control for the Datacenter
SIGCOMM'15Datacenter transports aim to deliver low latency messaging together with high throughput. We show that simple packet delay, measured as round-trip times at hosts, is an effective congestion signal without the need for switch feedback. First, we show ...
Congestion Control for Large-Scale RDMA Deployments
SIGCOMM'15Modern datacenter applications demand high throughput (40Gbps) and ultra-low latency (< 10 μs per hop) from the network, with low CPU overhead. Standard TCP/IP stacks cannot meet these requirements, but Remote Direct Memory Access (RDMA) can. On IP-...
Comments