Abstract
In this paper, we present PaxosStore, a high-availability storage system developed to support the comprehensive business of WeChat. It employs a combinational design in the storage layer to engage multiple storage engines constructed for different storage models. PaxosStore is characteristic of extracting the Paxos-based distributed consensus protocol as a middleware that is universally accessible to the underlying multi-model storage engines. This facilitates tuning, maintaining, scaling and extending the storage engines. According to our experience in engineering practice, to achieve a practical consistent read/write protocol is far more complex than its theory. To tackle such engineering complexity, we propose a layered design of the Paxos-based storage protocol stack, where PaxosLog, the key data structure used in the protocol, is devised to bridge the programming-oriented consistent read/write to the storage-oriented Paxos procedure. Additionally, we present optimizations based on Paxos that made fault-tolerance more efficient. Discussion throughout the paper primarily focuses on pragmatic solutions that could be insightful for building practical distributed storage systems.
- R. Ananthanarayanan, V. Basker, S. Das, A. Gupta, H. Jiang, T. Qiu, et al. Photon: Fault-tolerant and scalable joining of continuous data streams. In Proc. of SIGMOD, 2013. Google ScholarDigital Library
- J. Baker, C. Bond, J. C. Corbett, J. Furman, A. Khorlin, J. Larson, et al. Megastore: Providing scalable, highly available storage for interactive services. In Proc. of CIDR, 2011.Google Scholar
- P. A. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Control and Recovery in Database Systems. Addison-Wesley Longman Publishing Co., Inc., 1987. Google ScholarDigital Library
- M. Burrows. The chubby lock service for loosely-coupled distributed systems. In Proc. of OSDI, 2006. Google ScholarDigital Library
- T. D. Chandra, R. Griesemer, and J. Redstone. Paxos made live: An engineering perspective. In Proc. of PODC, 2007. Google ScholarDigital Library
- J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, et al. Spanner: Google's globally-distributed database. In Proc. of OSDI, 2012. Google ScholarDigital Library
- J. Cowling, D. Myers, B. Liskov, R. Rodrigues, and L. Shrira. Hq replication: A hybrid quorum protocol for byzantine fault tolerance. In Proc. of OSDI, 2006. Google ScholarDigital Library
- R. De Prisco, B. Lampson, and N. Lynch. Revisiting the paxos algorithm. In Proc. of WDAG, 1997. Google ScholarDigital Library
- F. Färber, N. May, W. Lehner, P. Große, I. Müller, H. Rauhe, et al. The sap hana database--an architecture overview. IEEE Data Engineering Bulletin, 35(1):28--33, 2012.Google Scholar
- M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. JACM, 32(2):374--382, 1985. Google ScholarDigital Library
- L. Glendenning, I. Beschastnikh, A. Krishnamurthy, and T. Anderson. Scalable consistency in scatter. In Proc. of SOSP, 2011. Google ScholarDigital Library
- J. Gray and L. Lamport. Consensus on transaction commit. TODS, 31(1):133--160, 2006. Google ScholarDigital Library
- A. Gupta and J. Shute. High-availability at massive scale: Building google's data infrastructure for ads. In Proc. of BIRTE, 2015.Google Scholar
- A. Gupta, F. Yang, J. Govig, A. Kirsch, K. Chan, K. Lai, et al. Mesa: Geo-replicated, near real-time, scalable data warehousing. PVLDB, 7(12):1259--1270, 2014. Google ScholarDigital Library
- S. Héman, M. Zukowski, N. J. Nes, L. Sidirourgos, and P. Boncz. Positional update handling in column stores. In Proc. of SIGMOD, 2010. Google ScholarDigital Library
- P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: Wait-free coordination for internet-scale systems. In Proc. of USENIX ATC, 2010. Google ScholarDigital Library
- J. Kirsch and Y. Amir. Paxos for system builders: An overview. In Proc. of LADIS, 2008. Google ScholarDigital Library
- L. Lamport. Time, clocks, and the ordering of events in a distributed system. CACM, 21(7):558--565, 1978. Google ScholarDigital Library
- L. Lamport. The part-time parliament. TOCS, 16(2):133--169, 1998. Google ScholarDigital Library
- L. Lamport. Fast paxos. Distributed Computing, 19(2):79--103, 2006.Google ScholarDigital Library
- L. Lamport and M. Massa. Cheap paxos. In Proc. of DSN, 2004. Google ScholarDigital Library
- L. Lamport, R. Shostak, and M. Pease. The byzantine generals problem. TOPLAS, 4(3):382--401, 1982. Google ScholarDigital Library
- B. Lampson. The abcd's of paxos. In Proc. of PODC, 2001. Google ScholarDigital Library
- B. W. Lampson. How to build a highly available system using consensus. In Proc. of WDAG, 1996. Google ScholarDigital Library
- I. Moraru, D. G. Andersen, and M. Kaminsky. Paxos quorum leases: Fast reads without sacrificing writes. In Proc. of SoCC, 2014. Google ScholarDigital Library
- P. O'Neil, E. Cheng, D. Gawlick, and E. O'Neil. The log-structured merge-tree (lsm-tree). Acta Informatica, 33(4):351--385, 1996. Google ScholarDigital Library
- D. Ongaro and J. Ousterhout. In search of an understandable consensus algorithm. In Proc. of USENIX ATC, 2014. Google ScholarDigital Library
- D. R. K. Ports, J. Li, V. Liu, N. K. Sharma, and A. Krishnamurthy. Designing distributed systems using approximate synchrony in data center networks. In Proc. of NSDI, 2015. Google ScholarDigital Library
- R. D. Prisco, B. Lampson, and N. Lynch. Revisiting the paxos algorithm. TCS, 243(1-2):35--91, 2000. Google ScholarDigital Library
- R. Sears and R. Ramakrishnan. blsm: A general purpose log structured merge tree. In Proc. of SIGMOD, 2012. Google ScholarDigital Library
- J. Sheehy and D. Smith. Bitcask: a log-structured hash table for fast key/value data. White paper, Basho Technologies, 2010.Google Scholar
- I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In Proc. of SIGCOMM, 2001. Google ScholarDigital Library
- M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, et al. C-store: A column-oriented dbms. In Proc. of VLDB, 2005. Google ScholarDigital Library
- R. Van Renesse and D. Altinbuken. Paxos made moderately complex. CSUR, 47(3):42:1--42:36, 2015. Google ScholarDigital Library
Index Terms
- PaxosStore: high-availability storage made practical in WeChat
Recommendations
HPDA: A hybrid parity-based disk array for enhanced performance and reliability
Flash-based Solid State Drive (SSD) has been productively shipped and deployed in large scale storage systems. However, a single flash-based SSD cannot satisfy the capacity, performance and reliability requirements of the modern storage systems that ...
Higher reliability redundant disk arrays: Organization, operation, and coding
Parity is a popular form of data protection in redundant arrays of inexpensive/independent disks (RAID). RAID5 dedicates one out of N disks to parity to mask single disk failures, that is, the contents of a block on a failed disk can be reconstructed by ...
Comments