research-article

CORFU: A distributed shared log

Authors:
Mahesh Balakrishnan

Microsoft Research Silicon Valley

Microsoft Research Silicon Valley
View Profile

,
Dahlia Malkhi

Microsoft Research Silicon Valley

Microsoft Research Silicon Valley
View Profile

,
John D. Davis

Microsoft Research Silicon Valley

Microsoft Research Silicon Valley
View Profile

,
Vijayan Prabhakaran

Microsoft Research Silicon Valley

Microsoft Research Silicon Valley
View Profile

,
Michael Wei

University of California, San Diego

University of California, San Diego
View Profile

,
Ted Wobber

Microsoft Research Silicon Valley

Microsoft Research Silicon Valley
View Profile

Authors Info & Claims

ACM Transactions on Computer Systems Volume 31 Issue 4Article No.: 10pp 1–24https://doi.org/10.1145/2535930

Published:20 December 2013Publication History

ACM Transactions on Computer Systems

Abstract

CORFU is a global log which clients can append-to and read-from over a network. Internally, CORFU is distributed over a cluster of machines in such a way that there is no single I/O bottleneck to either appends or reads. Data is fully replicated for fault tolerance, and a modest cluster of about 16--32 machines with SSD drives can sustain 1 million 4-KByte operations per second.

The CORFU log enabled the construction of a variety of distributed applications that require strong consistency at high speeds, such as databases, transactional key-value stores, replicated state machines, and metadata services.

References

10Gen. 2011. MongoDB. http://www.10gen.com/white-papers.Google Scholar
Anderson, T., Dahlin, M., Neefe, J., Patterson, D., Roselli, D., and Wang, R. 1995. Serverless network file systems. ACM SIGOPS Oper. Syst. Rev. 29, 109--126. Google ScholarDigital Library
Apache. 2011. CouchDB. http://couchdb.apache.org/.Google Scholar
Baker, J., Bond, C., Corbett, J., Furman, J., Khorlin, A., Larson, J., L'Eon, J., Li, Y., Lloyd, A., and Yushprakh, V. 2011. Megastore: providing scalable, highly available storage for interactive services. In Proceedings of the Conference on Innovative Data Systems Research (CIDR). 223--234.Google Scholar
Balakrishnan, M., Malkhi, D., Prabhakaran, V., Wobber, T., Wei, M., and Davis, J. 2012. Corfu: A shared log design for flash clusters. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI'12). USENIX Association. Google ScholarDigital Library
Balakrishnan, M., Malkhi, D., Wobber, T., Wu, M., Prabhakaran, V., Wei, M., Davis, J. D., Rao, S., Zou, T., and Zuck, A. 2013. Tango: Distributed data structures over a shared log. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP). ACM, New York. Google ScholarDigital Library
Bernstein, P., Reid, C., and Das, S. 2011. Hyder—A transactional record manager for shared flash. In Proceedings of the 5th Biennial Conference on Innovative Data Systems Research (CIDR). 9--20.Google Scholar
Birman, K., Malkhi, D., and Van Renesse, R. 2010. Virtually synchronous methodology for dynamic service replication. Tech. rep. MSR-TR-2010-151, Microsoft Research.Google Scholar
Burrows, M. 2006. The chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI'06). USENIX Association, 335--350. Google ScholarDigital Library
Calder, B., Wang, J., Ogus, A., Nilakantan, N., Skjolsvold, A., McKelvie, S., Xu, Y., Srivastav, S., Wu, J., Simitci, H., et al. 2011. Windows Azure Storage: A highly available cloud storage service with strong consistency. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP). ACM, New York, 143--157. Google ScholarDigital Library
Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. 2008. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2, 4. Google ScholarDigital Library
Chockler, G. and Malkhi, D. 2005. Active disk Paxos with infinitely many processes. Distrib. Comput. 18, 1, 73--84. Google ScholarDigital Library
Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J. J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., Hsieh, W., Kanthak, S., Kogan, E., Li, H., Lloyd, A., Melnik, S., Mwaura, D., Nagle, D., Quinlan, S., Rao, R., Rolig, L., Saito, Y., Szymaniak, M., Taylor, C., Wang, R., and Woodford, D. 2012. Spanner: Google's globally-distributed database. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI'12). USENIX Association, 251--264. Google ScholarDigital Library
Davis, J., Thacker, C. P., and Chang, C. 2009. BEE3: Revitalizing computer architecture research. Tech. rep. MSR-TR-2009-45, Microsoft Research.Google Scholar
Decandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. 2007. Dynamo: Amazon's highly available key-value store. In Proceedings of the 21st Symposium on Operating Systems Principles (SOSP'07). Google ScholarDigital Library
Defago, X., Schiper, A., and Urban, P. 2003. Total order broadcast and multicast algorithms: Taxonomy and survey. ACM Comput. Surv. 36, 2004. Google ScholarDigital Library
Gafni, E. and Lamport, L. 2000. Disk Paxos. In Proceedings of the 14th International Conference on Distributed Computing (DISC'00). Springer, Berlin, 330--344. Google ScholarDigital Library
Hartman, J. H. and Ousterhout, J. K. 1995. The zebra striped network file system. ACM Trans. Comput. Syst. 13, 3, 274--310. Google ScholarDigital Library
Haskin, R., Malachi, Y., and Chan, G. 1988. Recovery management in quicksilver. ACM Trans. Comput. Syst. 6, 1, 82--108. Google ScholarDigital Library
Herlihy, M. P. and Wing, J. M. 1990. Linearizability: A correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. 12, 3, 463--492. Google ScholarDigital Library
Holbrook, H. W., Singhal, S. K., and Cheriton, D. R. 1995. Log-based receiver-reliable multicast for distributed interactive simulation. SIGCOMM Comput. Commun. Rev. 25, 4, 328--341. Google ScholarDigital Library
Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. 2010. Zookeeper: Wait-free coordination for internet-scale systems. In Proceedings of the USENIX Annual Technical Conference (USENIXATC'10). USENIX Association, Berkeley, CA, 11--11. Google ScholarDigital Library
Ji, M., Veitch, A., and Wilkes, J., et al. 2003. Seneca: Remote mirroring done write. In Proceedings of the USENIX Annual Technical Conference.Google Scholar
Junqueira, F. 2012. Durability with BookKeeper. In Proceedings of LADIS'12.Google Scholar
Junqueira, F., Reed, B., and Yabandeh, M. 2011. Lock-free transactional support for large-scale storage systems. In Proceedings of the IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W). IEEE, 176--181. Google ScholarDigital Library
Kapritsos, M. and Junqueira, F. P. 2010. Scalable agreement: Toward ordering as a service. In Proceedings of the Sixth International Conference on Hot Topics In System Dependability (HotDep'10). USENIX Association, 1--8. Google ScholarDigital Library
Lakshman, A. and Malik, P. 2010. Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev. 44, 35--40. Google ScholarDigital Library
Lamport, L. 1978. Time, clocks, and the ordering of events in a distributed system. Comm. ACM 21, 7, 558--565. Google ScholarDigital Library
Lamport, L. 1998. The part-time parliament. ACM Trans. Comput. Syst. 16, 133--169. Google ScholarDigital Library
Lamport, L., Malkhi, D., and Zhou, L. 2009. Vertical Paxos and primary-backup replication. In Proceedings of the 28th ACM Symposium on Principles of Distributed Computing (PODC'09). ACM, New York, 312--313. Google ScholarDigital Library
Lamport, L., Malkhi, D., and Zhou, L. 2010. Reconfiguring a state machine. ACM SIGACT News 41, 1, 63--73. Google ScholarDigital Library
Lee, E. and Thekkath, C. 1996. Petal: Distributed virtual disks. ACM SIGOPS Oper. Syst. Rev. 30, 5, 84--92. Google ScholarDigital Library
Linkedin. 2011. Voldemort. http://www.project-voldemort.com/voldemort/.Google Scholar
Liskov, B., Ghemawat, S., Gruber, R., Johnson, P., and Shrira, L. 1991. Replication in the harp file system. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP'91). ACM, New York, 226--238. Google ScholarDigital Library
MacCormick, J., Murphy, N., Najork, M., Thekkath, C. A., and Zhou, L. 2004. Boxwood: Abstractions as the foundation for storage infrastructure. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation (OSDI'04). USENIX Association, Berkeley, CA, 105--120. Google ScholarDigital Library
Mao, Y., Junqueira, F. P., and Marzullo, K. 2008. Mencius: Building efficient replicated state machines for WANS. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). USENIX Association, Berkeley, CA, 369--384. Google ScholarDigital Library
Meyer, D. T., Aggarwal, G., Cully, B., Lefebvre, G., Feeley, M. J., Hutchinson, N. C., and Warfield, A. 2008. Parallax: virtual disks for virtual machines. SIGOPS Oper. Syst. Rev. 42, 4, 41--54. Google ScholarDigital Library
Peng, D. and Dabek, F. 2010. Large-scale incremental processing using distributed transactions and notifications. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, Berkeley, CA, 1--15. Google ScholarDigital Library
Rosenblum, M. and Ousterhout, J. K. 1991. The design and implementation of a log-structured file system. SIGOPS Oper. Syst. Rev. 25, 5, 1--15. Google ScholarDigital Library
Schmuck, F. and Wylie, J. 1991. Experience with transactions in quicksilver. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP'91). ACM, New York, 239--253. Google ScholarDigital Library
Schneider, F. B. 1990. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Surv. 22, 4, 299--319. Google ScholarDigital Library
Seltzer, M., Smith, K. A., Balakrishnan, H., Chang, J., McMains, S., and Padmanabhan, V. 1995. File system logging versus clustering: A performance comparison. In Proceedings of the USENIX Technical Conference (TCON'95). USENIX Association, Berkeley, CA, 21--21. Google ScholarDigital Library
Sovran, Y., Power, R., Aguilera, M. K., and Li, J. 2011. Transactional storage for geo-replicated systems. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP'11). ACM, New York, 385--400. Google ScholarDigital Library
Spector, A. Z., Daniels, D., Duchamp, D., Eppinger, J. L., and Pausch, R. 1985. Distributed transactions for reliable systems. SIGOPS Oper. Syst. Rev. 19, 5, 127--146. Google ScholarDigital Library
Thacker, C. P. Beehive: A many-core computer for FPGAs. Unpublished Manuscript.Google Scholar
Thekkath, C. A., Mann, T., and Lee, E. K. 1997. Frangipani: A scalable distributed file system. In Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP'97). ACM, New York, NY, 224--237. Google ScholarDigital Library
Thomson, A., Diamond, T., Weng, S.-C., Ren, K., Shao, P., and Abadi, D. J. 2012. Calvin: Fast distributed transactions for partitioned database systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'12). ACM, New York, 1--12. Google ScholarDigital Library
Van Renesse, R. and Schneider, F. B. 2004. Chain replication for supporting high throughput and availability. In Proceedings of the 6th Symposium on Operating Systems Design & Implementation (OSDI'04). USENIX Association, Berkeley, CA, 7--7. Google ScholarDigital Library
Wei, M., Davis, J. D., Wobber, T., Balakrishnan, M., and Malkhi, D. 2013. Beyond block i/o: implementing a distributed shared log in hardware. In Proceedings of the 6th International Systems and Storage Conference (SYSTOR'13). ACM, New York, 21:1--21:11. Google ScholarDigital Library
XILINX. 2011. Xilinx university program xupv5-lx110t development system. http://www.xilinx.com/univ/xupv5-lx110t.htm.Google Scholar

Index Terms

CORFU: A distributed shared log
1. Information systems
  1. Information retrieval
    1. Search engine architectures and scalability
      1. Distributed retrieval
      2. Peer-to-peer retrieval
  2. Information storage systems
    1. Storage architectures
      1. Distributed storage
2. Software and its engineering
  1. Software organization and properties
    1. Software system structures
      1. Distributed systems organizing principles
        Organizing principles for web applications

Recommendations

CORFU: a shared log design for flash clusters
NSDI'12: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation

CORFU organizes a cluster of flash devices as a single, shared log that can be accessed concurrently by multiple clients over the network. The CORFU shared log makes it easy to build distributed applications that require strong consistency at high ...
Read More
Higher reliability redundant disk arrays: Organization, operation, and coding

Parity is a popular form of data protection in redundant arrays of inexpensive/independent disks (RAID). RAID5 dedicates one out of N disks to parity to mask single disk failures, that is, the contents of a block on a failed disk can be reconstructed by ...
Read More
HPDA: A hybrid parity-based disk array for enhanced performance and reliability

Flash-based Solid State Drive (SSD) has been productively shipped and deployed in large scale storage systems. However, a single flash-based SSD cannot satisfy the capacity, performance and reliability requirements of the modern storage systems that ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Computer Systems Volume 31, Issue 4
December 2013
90 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/2542150
Editor:
Todd C. Mowry
Issue’s Table of Contents
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 December 2013
- Accepted: 1 March 2013
- Received: 1 December 2012
Published in tocs Volume 31, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 60
  Total Citations
  View Citations
- 1,718
  Total Downloads
- Downloads (Last 12 months)73
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

CORFU: A distributed shared log

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

CORFU: a shared log design for flash clusters

Higher reliability redundant disk arrays: Organization, operation, and coding

HPDA: A hybrid parity-based disk array for enhanced performance and reliability

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

CORFU: A distributed shared log

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

CORFU: a shared log design for flash clusters

Higher reliability redundant disk arrays: Organization, operation, and coding

HPDA: A hybrid parity-based disk array for enhanced performance and reliability

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media