ABSTRACT
Transactions with strong consistency and high availability simplify building and reasoning about distributed systems. However, previous implementations performed poorly. This forced system designers to avoid transactions completely, to weaken consistency guarantees, or to provide single-machine transactions that require programmers to partition their data. In this paper, we show that there is no need to compromise in modern data centers. We show that a main memory distributed computing platform called FaRM can provide distributed transactions with strict serializability, high performance, durability, and high availability. FaRM achieves a peak throughput of 140 million TATP transactions per second on 90 machines with a 4.9 TB database, and it recovers from a failure in less than 50 ms. Key to achieving these results was the design of new transaction, replication, and recovery protocols from first principles to leverage commodity networks with RDMA and a new, inexpensive approach to providing non-volatile DRAM.
Supplemental Material
- Memcached. http://memcached.org.Google Scholar
- Viking Technology. http://www.vikingtechnology.com/.Google Scholar
- Apache Cassandra. http://cassandra.apache.org/, 2015.Google Scholar
- MySQL. http://www.mysql.com/, 2015.Google Scholar
- neo4j. http://neo4j.com/, 2015.Google Scholar
- redis. http://redis.io/, 2015.Google Scholar
- Adya, A., Dunagan, J., and Wolman, A. Centrifuge: Integrated lease management and partitioning for cloud services. In Proceedings of the 7th USENIX Symposium on Networked Systems Design and Implementation (2010), NSDI'10. Google ScholarDigital Library
- Aaguilera, M. K., Merchant, A., Shah, M., Veitch, A., and Karamanolis, C. Sinfonia: A new paradigm for building scalable distributed systems. In Proceedings of 21st ACM SIGOPS Symposium on Operating Systems Principles (2007), SOSP'07. Google ScholarDigital Library
- Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. Bigtable: A distributed storage system for structured data. In Proceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation (2006), OSDI'06. Google ScholarDigital Library
- Chockler, G. V., Keidar, I., and Vitenberg, R. Group communication specifications: a comprehensive study. ACM Computing Surveys (CSUR) 33, 4 (2001). Google ScholarDigital Library
- Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J. J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., Hsieh, W. C., Kanthak, S., Kogan, E., Li, H., Lloyd, A., Melnik, S., Mwaura, D., Nagle, D., Quinlan, S., Rao, R., Rolig, L., Saito, Y., Szymaniak, M., Taylor, C., Wang, R., and Woodford, D. Spanner: Google's globally-distributed database. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (2012), OSDI'12. Google ScholarDigital Library
- Dalessandro, L., and Scott, M. L. Sandboxing transactional memory. In Proceedings of the 21st ACM International Conference on Parallel Architectures and Compilation Techniques (2012), PACT'12. Google ScholarDigital Library
- DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. Dynamo: Amazon's highly available key-value store. In Proceedings of the the 21st ACM Symposium on Operating Systems Principles (2007), SOSP'07. Google ScholarDigital Library
- Diaconu, C., Freedman, C., Ismert, E., Larson, P.-Å., Mittal, P., Stonecipher, R., Verma, N., and Zwilling, M. Hekaton: SQL Server's memory-optimized OLTP engine. In Proceedings of the ACM SIGMOD International Conference on Management of Data (2013), SIGMOD' 13. Google ScholarDigital Library
- Dice, D., Shalev, O., and Shavit, N. Transactional locking II. In Proceedings of the 20th International Symposium on Distributed Computing (2006), DISC'06. Google ScholarDigital Library
- Dragojević, A., Narayanan, D., Hodson, O., and Castro, M. FaRM: Fast remote memory. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation (2014), NSDI'14. Google ScholarDigital Library
- Graefe, G. Write-optimized B-trees. In Proceedings of the 30th International Conference on Very Large Data Bases (2004), VLDB'04. Google ScholarDigital Library
- Gray, C., and Cheriton, D. Leases: An efficient fault-tolerant mechanism for distributed file cache consistency. SIGOPS Operating Systems Review (OSR) 23, 5 (1989). Google ScholarDigital Library
- Gray, J., and Reuter, A. Transaction Processing: Concepts and Techniques. 1992. Google ScholarDigital Library
- Guerraoui, R., and Kapalka, M. On the correctness of transactional memory. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2008), PPoPP'08. Google ScholarDigital Library
- Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. Zookeeper: wait-free coordination for internet-scale systems. In Proceedings of the 2010 USENIX Annual Technical Conference (2010), USENIX ATC'10. Google ScholarDigital Library
- InfiniBand Trade Association. Supplement to InfiniBand Architecture Specification Volume 1 Release 1.2.2 Annex A16: RDMA over Converged Ethernet (RoCE), 2010.Google Scholar
- Kalia, A., Kaminsky, M., and Andersen, D. G. Using RDMA efficiently for key-value services. In Proceedings of the 2014 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (2014), SIGCOMM'14. Google ScholarDigital Library
- Lamport, L. The part-time parliament. ACM Transactions on Computer Systems 16, 2. Google ScholarDigital Library
- Lamport, L., Malkhi, D., and Zhou, L. Vertical Paxos and primary-backup replication. In Proceedings of the 28th ACM Symposium on Principles of Distributed Computing (2009), PODC'09. Google ScholarDigital Library
- Larson, P.-Å., Blanas, S., Diaconu, C., Freedman, C., Patel, J. M., and Zwilling, M. High-performance concurrency control mechanisms for main-memory databases. PVLDB 5, 4 (2011). Google ScholarDigital Library
- Lehman, P. L., and Yao, S. B. Efficient locking for concurrent operations on B-trees. ACM Transactions on Database Systems 6, 4 (Dec. 1981). Google ScholarDigital Library
- Microsoft. Scaling out SQL Server. http://www.microsoft.com/en-us/server-cloud/solutions/high-availability.aspx.Google Scholar
- Microsoft. Open CloudServer OCS V2 specification: Blade, 2014.Google Scholar
- Microsoft. OCS Open CloudServer power supply v2.0. http://www.opencompute.org/wiki/Server/SpecsAndDesigns, 2015.Google Scholar
- Mitchell, C., Yifeng, G., and Jinyang, L. Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. In Proceedings of the 2013 USENIX Annual Technical Conference (2013), USENIX ATC'13. Google ScholarDigital Library
- Neuvonen, S., Wolski, A., Manner, M., and Raatikka, V. Telecom Application Transaction Processing benchmark. http://tatpbenchmark.sourceforge.net/.Google Scholar
- Ongaro, D., Rumble, S. M., Stutsman, R., Ousterhout, J., and Rosenblum, M. Fast crash recovery in RAMCloud. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (2011), SOSP'11. Google ScholarDigital Library
- Rumble, S. M., Kejriwal, A., and Ousterhout, J. Log-structured Memory for DRAM-based Storage. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (2014), FAST'14. Google ScholarDigital Library
- Sethi, R. Useless actions make a difference: Strict serializability of database updates. JACM 29, 2 (1982). Google ScholarDigital Library
- Shaun Harris. Microsoft reinvents datacenter power backup with new Open Compute project specification. http://blogs.msdn.com/b/windowsazure/archive/2012/11/13/windows-azure-benchmarks-show-top-performance-for-big-compute.aspx, 2015.Google Scholar
- Sowell, B., Golab, W. M., and Shah, M. A. Minuet: A scalable distributed multiversion B-tree. PVLDB 5, 9 (2012). Google ScholarDigital Library
- Transaction Processing Performance Council (TPC). TPC benchmark C: Standard specification. http://www.tpc.org.Google Scholar
- Tu, S., Zheng, W., Kohler, E., Liskov, B., and Madden, S. Speedy transactions in multicore in-memory databases. In Proceedings of the 24th Symposium on Operating Systems Principles (2013), SOSP'13. Google ScholarDigital Library
- Zheng, W., Tu, S., Kohler, E., and Liskov, B. Fast databases with fast durability and recovery through multicore parallelism. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (2014), OSDI'14. Google ScholarDigital Library
Index Terms
- No compromises: distributed transactions with consistency, availability, and performance
Recommendations
Fast in-memory transaction processing using RDMA and HTM
SOSP '15: Proceedings of the 25th Symposium on Operating Systems PrinciplesWe present DrTM, a fast in-memory transaction processing system that exploits advanced hardware features (i.e., RDMA and HTM) to improve latency and throughput by over one order of magnitude compared to state-of-the-art distributed transaction systems. ...
Using RDMA efficiently for key-value services
SIGCOMM '14: Proceedings of the 2014 ACM conference on SIGCOMMThis paper describes the design and implementation of HERD, a key-value system designed to make the best use of an RDMA network. Unlike prior RDMA-based key-value systems, HERD focuses its design on reducing network round trips while using efficient ...
Spanner: Google’s Globally Distributed Database
Spanner is Google’s scalable, multiversion, globally distributed, and synchronously replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This article describes how ...
Comments