Abstract
CORFU is a global log which clients can append-to and read-from over a network. Internally, CORFU is distributed over a cluster of machines in such a way that there is no single I/O bottleneck to either appends or reads. Data is fully replicated for fault tolerance, and a modest cluster of about 16--32 machines with SSD drives can sustain 1 million 4-KByte operations per second.
The CORFU log enabled the construction of a variety of distributed applications that require strong consistency at high speeds, such as databases, transactional key-value stores, replicated state machines, and metadata services.
- 10Gen. 2011. MongoDB. http://www.10gen.com/white-papers.Google Scholar
- Anderson, T., Dahlin, M., Neefe, J., Patterson, D., Roselli, D., and Wang, R. 1995. Serverless network file systems. ACM SIGOPS Oper. Syst. Rev. 29, 109--126. Google ScholarDigital Library
- Apache. 2011. CouchDB. http://couchdb.apache.org/.Google Scholar
- Baker, J., Bond, C., Corbett, J., Furman, J., Khorlin, A., Larson, J., L'Eon, J., Li, Y., Lloyd, A., and Yushprakh, V. 2011. Megastore: providing scalable, highly available storage for interactive services. In Proceedings of the Conference on Innovative Data Systems Research (CIDR). 223--234.Google Scholar
- Balakrishnan, M., Malkhi, D., Prabhakaran, V., Wobber, T., Wei, M., and Davis, J. 2012. Corfu: A shared log design for flash clusters. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI'12). USENIX Association. Google ScholarDigital Library
- Balakrishnan, M., Malkhi, D., Wobber, T., Wu, M., Prabhakaran, V., Wei, M., Davis, J. D., Rao, S., Zou, T., and Zuck, A. 2013. Tango: Distributed data structures over a shared log. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP). ACM, New York. Google ScholarDigital Library
- Bernstein, P., Reid, C., and Das, S. 2011. Hyder—A transactional record manager for shared flash. In Proceedings of the 5th Biennial Conference on Innovative Data Systems Research (CIDR). 9--20.Google Scholar
- Birman, K., Malkhi, D., and Van Renesse, R. 2010. Virtually synchronous methodology for dynamic service replication. Tech. rep. MSR-TR-2010-151, Microsoft Research.Google Scholar
- Burrows, M. 2006. The chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI'06). USENIX Association, 335--350. Google ScholarDigital Library
- Calder, B., Wang, J., Ogus, A., Nilakantan, N., Skjolsvold, A., McKelvie, S., Xu, Y., Srivastav, S., Wu, J., Simitci, H., et al. 2011. Windows Azure Storage: A highly available cloud storage service with strong consistency. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP). ACM, New York, 143--157. Google ScholarDigital Library
- Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. 2008. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2, 4. Google ScholarDigital Library
- Chockler, G. and Malkhi, D. 2005. Active disk Paxos with infinitely many processes. Distrib. Comput. 18, 1, 73--84. Google ScholarDigital Library
- Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J. J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., Hsieh, W., Kanthak, S., Kogan, E., Li, H., Lloyd, A., Melnik, S., Mwaura, D., Nagle, D., Quinlan, S., Rao, R., Rolig, L., Saito, Y., Szymaniak, M., Taylor, C., Wang, R., and Woodford, D. 2012. Spanner: Google's globally-distributed database. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI'12). USENIX Association, 251--264. Google ScholarDigital Library
- Davis, J., Thacker, C. P., and Chang, C. 2009. BEE3: Revitalizing computer architecture research. Tech. rep. MSR-TR-2009-45, Microsoft Research.Google Scholar
- Decandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. 2007. Dynamo: Amazon's highly available key-value store. In Proceedings of the 21st Symposium on Operating Systems Principles (SOSP'07). Google ScholarDigital Library
- Defago, X., Schiper, A., and Urban, P. 2003. Total order broadcast and multicast algorithms: Taxonomy and survey. ACM Comput. Surv. 36, 2004. Google ScholarDigital Library
- Gafni, E. and Lamport, L. 2000. Disk Paxos. In Proceedings of the 14th International Conference on Distributed Computing (DISC'00). Springer, Berlin, 330--344. Google ScholarDigital Library
- Hartman, J. H. and Ousterhout, J. K. 1995. The zebra striped network file system. ACM Trans. Comput. Syst. 13, 3, 274--310. Google ScholarDigital Library
- Haskin, R., Malachi, Y., and Chan, G. 1988. Recovery management in quicksilver. ACM Trans. Comput. Syst. 6, 1, 82--108. Google ScholarDigital Library
- Herlihy, M. P. and Wing, J. M. 1990. Linearizability: A correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. 12, 3, 463--492. Google ScholarDigital Library
- Holbrook, H. W., Singhal, S. K., and Cheriton, D. R. 1995. Log-based receiver-reliable multicast for distributed interactive simulation. SIGCOMM Comput. Commun. Rev. 25, 4, 328--341. Google ScholarDigital Library
- Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. 2010. Zookeeper: Wait-free coordination for internet-scale systems. In Proceedings of the USENIX Annual Technical Conference (USENIXATC'10). USENIX Association, Berkeley, CA, 11--11. Google ScholarDigital Library
- Ji, M., Veitch, A., and Wilkes, J., et al. 2003. Seneca: Remote mirroring done write. In Proceedings of the USENIX Annual Technical Conference.Google Scholar
- Junqueira, F. 2012. Durability with BookKeeper. In Proceedings of LADIS'12.Google Scholar
- Junqueira, F., Reed, B., and Yabandeh, M. 2011. Lock-free transactional support for large-scale storage systems. In Proceedings of the IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W). IEEE, 176--181. Google ScholarDigital Library
- Kapritsos, M. and Junqueira, F. P. 2010. Scalable agreement: Toward ordering as a service. In Proceedings of the Sixth International Conference on Hot Topics In System Dependability (HotDep'10). USENIX Association, 1--8. Google ScholarDigital Library
- Lakshman, A. and Malik, P. 2010. Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev. 44, 35--40. Google ScholarDigital Library
- Lamport, L. 1978. Time, clocks, and the ordering of events in a distributed system. Comm. ACM 21, 7, 558--565. Google ScholarDigital Library
- Lamport, L. 1998. The part-time parliament. ACM Trans. Comput. Syst. 16, 133--169. Google ScholarDigital Library
- Lamport, L., Malkhi, D., and Zhou, L. 2009. Vertical Paxos and primary-backup replication. In Proceedings of the 28th ACM Symposium on Principles of Distributed Computing (PODC'09). ACM, New York, 312--313. Google ScholarDigital Library
- Lamport, L., Malkhi, D., and Zhou, L. 2010. Reconfiguring a state machine. ACM SIGACT News 41, 1, 63--73. Google ScholarDigital Library
- Lee, E. and Thekkath, C. 1996. Petal: Distributed virtual disks. ACM SIGOPS Oper. Syst. Rev. 30, 5, 84--92. Google ScholarDigital Library
- Linkedin. 2011. Voldemort. http://www.project-voldemort.com/voldemort/.Google Scholar
- Liskov, B., Ghemawat, S., Gruber, R., Johnson, P., and Shrira, L. 1991. Replication in the harp file system. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP'91). ACM, New York, 226--238. Google ScholarDigital Library
- MacCormick, J., Murphy, N., Najork, M., Thekkath, C. A., and Zhou, L. 2004. Boxwood: Abstractions as the foundation for storage infrastructure. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation (OSDI'04). USENIX Association, Berkeley, CA, 105--120. Google ScholarDigital Library
- Mao, Y., Junqueira, F. P., and Marzullo, K. 2008. Mencius: Building efficient replicated state machines for WANS. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). USENIX Association, Berkeley, CA, 369--384. Google ScholarDigital Library
- Meyer, D. T., Aggarwal, G., Cully, B., Lefebvre, G., Feeley, M. J., Hutchinson, N. C., and Warfield, A. 2008. Parallax: virtual disks for virtual machines. SIGOPS Oper. Syst. Rev. 42, 4, 41--54. Google ScholarDigital Library
- Peng, D. and Dabek, F. 2010. Large-scale incremental processing using distributed transactions and notifications. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, Berkeley, CA, 1--15. Google ScholarDigital Library
- Rosenblum, M. and Ousterhout, J. K. 1991. The design and implementation of a log-structured file system. SIGOPS Oper. Syst. Rev. 25, 5, 1--15. Google ScholarDigital Library
- Schmuck, F. and Wylie, J. 1991. Experience with transactions in quicksilver. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP'91). ACM, New York, 239--253. Google ScholarDigital Library
- Schneider, F. B. 1990. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Surv. 22, 4, 299--319. Google ScholarDigital Library
- Seltzer, M., Smith, K. A., Balakrishnan, H., Chang, J., McMains, S., and Padmanabhan, V. 1995. File system logging versus clustering: A performance comparison. In Proceedings of the USENIX Technical Conference (TCON'95). USENIX Association, Berkeley, CA, 21--21. Google ScholarDigital Library
- Sovran, Y., Power, R., Aguilera, M. K., and Li, J. 2011. Transactional storage for geo-replicated systems. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP'11). ACM, New York, 385--400. Google ScholarDigital Library
- Spector, A. Z., Daniels, D., Duchamp, D., Eppinger, J. L., and Pausch, R. 1985. Distributed transactions for reliable systems. SIGOPS Oper. Syst. Rev. 19, 5, 127--146. Google ScholarDigital Library
- Thacker, C. P. Beehive: A many-core computer for FPGAs. Unpublished Manuscript.Google Scholar
- Thekkath, C. A., Mann, T., and Lee, E. K. 1997. Frangipani: A scalable distributed file system. In Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP'97). ACM, New York, NY, 224--237. Google ScholarDigital Library
- Thomson, A., Diamond, T., Weng, S.-C., Ren, K., Shao, P., and Abadi, D. J. 2012. Calvin: Fast distributed transactions for partitioned database systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'12). ACM, New York, 1--12. Google ScholarDigital Library
- Van Renesse, R. and Schneider, F. B. 2004. Chain replication for supporting high throughput and availability. In Proceedings of the 6th Symposium on Operating Systems Design & Implementation (OSDI'04). USENIX Association, Berkeley, CA, 7--7. Google ScholarDigital Library
- Wei, M., Davis, J. D., Wobber, T., Balakrishnan, M., and Malkhi, D. 2013. Beyond block i/o: implementing a distributed shared log in hardware. In Proceedings of the 6th International Systems and Storage Conference (SYSTOR'13). ACM, New York, 21:1--21:11. Google ScholarDigital Library
- XILINX. 2011. Xilinx university program xupv5-lx110t development system. http://www.xilinx.com/univ/xupv5-lx110t.htm.Google Scholar
Index Terms
- CORFU: A distributed shared log
Recommendations
CORFU: a shared log design for flash clusters
NSDI'12: Proceedings of the 9th USENIX conference on Networked Systems Design and ImplementationCORFU organizes a cluster of flash devices as a single, shared log that can be accessed concurrently by multiple clients over the network. The CORFU shared log makes it easy to build distributed applications that require strong consistency at high ...
Higher reliability redundant disk arrays: Organization, operation, and coding
Parity is a popular form of data protection in redundant arrays of inexpensive/independent disks (RAID). RAID5 dedicates one out of N disks to parity to mask single disk failures, that is, the contents of a block on a failed disk can be reconstructed by ...
HPDA: A hybrid parity-based disk array for enhanced performance and reliability
Flash-based Solid State Drive (SSD) has been productively shipped and deployed in large scale storage systems. However, a single flash-based SSD cannot satisfy the capacity, performance and reliability requirements of the modern storage systems that ...
Comments