ABSTRACT
Geo-replicated, distributed data stores that support complex online applications, such as social networks, must provide an "always-on" experience where operations always complete with low latency. Today's systems often sacrifice strong consistency to achieve these goals, exposing inconsistencies to their clients and necessitating complex application logic. In this paper, we identify and define a consistency model---causal consistency with convergent conflict handling, or causal+---that is the strongest achieved under these constraints.
We present the design and implementation of COPS, a key-value store that delivers this consistency model across the wide-area. A key contribution of COPS is its scalability, which can enforce causal dependencies between keys stored across an entire cluster, rather than a single server like previous systems. The central approach in COPS is tracking and explicitly checking whether causal dependencies between keys are satisfied in the local cluster before exposing writes. Further, in COPS-GT, we introduce get transactions in order to obtain a consistent view of multiple keys without locking or blocking. Our evaluation shows that COPS completes operations in less than a millisecond, provides throughput similar to previous systems when using one server per cluster, and scales well as we increase the number of servers in each cluster. It also shows that COPS-GT provides similar latency, throughput, and scaling to COPS for common workloads.
- M. K. Aguilera, A. Merchant, M. Shah, A. Veitch, and C. Karamanolis. Sinfonia: A new paradigm for building scalable distributed systems. ACM TOCS, 27(3), 2009. Google ScholarDigital Library
- M. Ahamad, G. Neiger, P. Kohli, J. Burns, and P. Hutto. Causal memory: Definitions, implementation, and programming. Distributed Computing, 9(1), 1995.Google Scholar
- M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center network architecture. In SIGCOMM, Aug. 2008. Google ScholarDigital Library
- P. Alsberg and J. Day. A principle for resilient sharing of distributed resources. In Conf. Software Engineering, Oct. 1976. Google ScholarDigital Library
- D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudeyan. FAWN: A fast array of wimpy nodes. In SOSP, Oct. 2009. Google ScholarDigital Library
- T. E. Anderson, M. D. Dahlin, J. M. Neefe, D. A. Patterson, D. S. Roselli, and R. Y. Wang. Serverless network file systems. ACM TOCS, 14(1), 1996. Google ScholarDigital Library
- Apache Thrift. http://thrift.apache.org/, 2011.Google Scholar
- H. Attiya and J. L. Welch. Sequential consistency versus linearizability. ACM TOCS, 12(2), 1994. Google ScholarDigital Library
- J. Baker, C. Bond, J. C. Corbett, J. Furman, A. Khorlin, J. Larson, J.-M. Leon, Y. Li, A. Lloyd, and V. Yushprakh. Megastore: Providing scalable, highly available storage for interactive services. In CIDR, Jan. 2011.Google Scholar
- N. Belaramani, M. Dahlin, L. Gao, A. Nayate, A. Venkataramani, P. Yalagandula, and J. Zheng. PRACTI replication. In NSDI, May 2006. Google ScholarDigital Library
- K. P. Birman and T. Joseph. Exploiting virtual synchrony in distributed systems. In SOSP, Nov. 1987. Google ScholarDigital Library
- K. P. Birman and R. V. Renesse. Reliable Distributed Computing with the ISIS Toolkit. IEEE Comp. Soc. Press, 1994. Google ScholarDigital Library
- E. Brewer. Towards robust distributed systems. PODC Keynote, July 2000. Google ScholarDigital Library
- D. R. Cheriton and D. Skeen. Understanding the limitations of causally and totally ordered communication. In SOSP, Dec. 1993. Google ScholarDigital Library
- B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. PNUTS: Yahoo!'s hosted data serving platform. In VLDB, Aug. 2008. Google ScholarDigital Library
- G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's highly available key-value store. In SOSP, Oct. 2007. Google ScholarDigital Library
- D. DeWitt, S. Ghandeharizadeh, D. Schneider, A. Bricker, H.-I. Hsiao, and R. Rasmussen. The gamma database machine project. Knowledge and Data Engineering, 2(1), 1990. Google ScholarDigital Library
- FAWN-KV. https://github.com/vrv/FAWN-KV, 2011.Google Scholar
- B. Fitzpatrick. Memcached: a distributed memory object caching system. http://memcached.org/.2011.Google Scholar
- A. Fox, S. D. Gribble, Y. Chawathe. E. A. Brewer, and P. Gauthier. Cluster-based scalable network services. In SOSP, Oct. 1997. Google ScholarDigital Library
- S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In SOSP, Oct. 2003. Google ScholarDigital Library
- D. K. Gifford. Weighted voting for replicated data. In SOSP, Dec. 1979. Google ScholarDigital Library
- S. Gilbert and N. Lynch. Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News, 33(2), 2002. Google ScholarDigital Library
- A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. VL2: A scalable and flexible data center network. In SIGCOMM, Aug. 2009. Google ScholarDigital Library
- M. Herlihy. A quorum-consensus replication method for abstract data types. ACM TOCS, 4(1), Feb. 1986. Google ScholarDigital Library
- M. P. Herlihy and J. M. Wing. Linearizability: A correctness condition for concurrent objects. ACM TOPLAS, 12(3), 1990. Google ScholarDigital Library
- D. Karger, E. Lehman, F. Leighton, M. Levine, D. Lewin, and R. Panigrahy. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In STOC, May 1997. Google ScholarDigital Library
- J. Kistler and M. Satyanarayanan. Disconnected operation in the Coda file system. ACM TOCS, 10(3), Feb. 1992. Google ScholarDigital Library
- R. Ladin, B. Liskov, L. Shrira, and S. Ghemawat. Providing high availability using lazy replication. ACM TOCS, 10(4), 1992. Google ScholarDigital Library
- A. Lakshman and P. Malik. Cassandra -- a decentralized structured storage system. In LADIS, Oct. 2009.Google Scholar
- L. Lamport. Time, clocks, and the ordering of events in a distributed system. Comm. ACM, 21(7), 1978. Google ScholarDigital Library
- L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. Computer, 28(9), 1979. Google ScholarDigital Library
- L. Lamport. The part-time parliament. ACM TOCS, 16(2), 1998. Google ScholarDigital Library
- R. J. Lipton and J. S. Sandberg. PRAM: A scalable shared memory. Technical Report TR-180-88, Princeton Univ., Dept. Comp. Sci., 1988.Google Scholar
- P. Mahajan, L. Alvisi, and M. Dahlin. Consistency, availability, and convergence. Technical Report TR-11-22, Univ. Texas at Austin, Dept. Comp. Sci., 2011.Google Scholar
- J. Misra. Axioms for memory access in asynchronous hardware systems. ACM TOPLAS, 8(1), Jan. 1986. Google ScholarDigital Library
- C. Mohan, B. Lindsay, and R. Obermarck. Transaction management in the R* distributed database management system. ACM Trans. Database Sys., 11(4), 1986 Google ScholarDigital Library
- MySQL. http://www.mysql.com/, 2011.Google Scholar
- B. M. Oki and B. H. Liskov. Viewstamped replication: A general primary copy. In PODC, Aug. 1988.Google ScholarDigital Library
- D. S. Parker, G. J. Popek, G. Rudisin, A. Stoughton, B. J. Walker, E. Walton, J. M. Chow, D. Edwards, S. Kiser, and C. Kline. Detection of mutual inconsistency in distributed systems. IEEE Trans. Software Eng., 9(3), 1983. Google ScholarDigital Library
- K. Petersen, M. Spreitzer, D. Terry, M. Theimer, and A. Demers. Flexible update propagation for weakly consistent replication. In SOSP, Oct. 1997. Google ScholarDigital Library
- PostgresSQL. http://www.postgresql.org/, 2011.Google Scholar
- Project Voldemort. http://project-voldemort.com/, 2011.Google Scholar
- D. Skeen. A formal model of crash recovery in a distributed system. IEEE Trans. Software Engineering, 9(3), May 1983. Google ScholarDigital Library
- Snappy. http://code.google.com/p/snappy/, 2011.Google Scholar
- J. Sobel. Scaling out. Engineering at Facebook blog, Aug. 20 2008.Google Scholar
- Y. Sovran, R. Power, M. K. Aguilera, and J. Li. Transactional storage for geo-replicated systems. In SOSP, Oct. 2011. Google ScholarDigital Library
- J. Terrace and M. J. Freedman. Object storage on CRAQ: High-throughput chain replication for read-mostly workloads. In USENIX ATC, June 2009. Google ScholarDigital Library
- D. B. Terry, A. J. Demers, K. Petersen, M. Spreitzer, M. Theimer, and B. W. Welch. Session guarantees for weakly consistent replicated data. In Conf. Parallel Distributed Info. Sys., Sept. 1994. Google ScholarDigital Library
- R. H. Thomas. A majority consensus approach to concurrency control for multiple copy databases. ACM Trans. Database Sys., 4(2), 1979. Google ScholarDigital Library
- R. van Renesse and F. B. Schneider. Chain replication for supporting high throughput and availability. In OSDI, Dec. 2004. Google ScholarDigital Library
- VICCI. http://vicci.org/, 2011.Google Scholar
- H. Yu and A. Vahdat. Design and evaluation of a continuous consistency model for replicated services. In OSDI, Oct. 2000. Google ScholarDigital Library
Index Terms
- Don't settle for eventual: scalable causal consistency for wide-area storage with COPS
Recommendations
Constraining the eventual in eventual consistency
PaPoC '18: Proceedings of the 5th Workshop on the Principles and Practice of Consistency for Distributed DataCRDTs are highly available replicated data structures which offer strong eventual consistency in the face of concurrent operations [3]. By their definition, CRDTs eventually converge to a consistent state given enough time. However, this is not strict ...
Quantifying eventual consistency with PBS
Data replication results in a fundamental trade-off between operation latency and consistency. At the weak end of the spectrum of possible consistency models is eventual consistency, which provides no limit to the staleness of data returned. However, ...
ChainReaction: a causal+ consistent datastore based on chain replication
EuroSys '13: Proceedings of the 8th ACM European Conference on Computer SystemsThis paper proposes a Geo-distributed key-value datastore, named ChainReaction, that offers causal+ consistency, with high performance, fault-tolerance, and scalability. ChainReaction enforces causal+ consistency which is stronger than eventual ...
Comments