Abstract
This paper describes the design, implementation, and evaluation of a Federated Array of Bricks (FAB), a distributed disk array that provides the reliability of traditional enterprise arrays with lower cost and better scalability. FAB is built from a collection of bricks, small storage appliances containing commodity disks, CPU, NVRAM, and network interface cards. FAB deploys a new majority-voting-based algorithm to replicate or erasure-code logical blocks across bricks and a reconfiguration algorithm to move data in the background when bricks are added or decommissioned. We argue that voting is practical and necessary for reliable, high-throughput storage systems such as FAB. We have implemented a FAB prototype on a 22-node Linux cluster. This prototype sustains 85MB/second of throughput for a database workload, and 270MB/second for a bulk-read workload. In addition, it can outperform traditional master-slave replication through performance decoupling and can handle brick failures and recoveries smoothly without disturbing client requests.
- Atul Adya, William J. Bolosky, Miguel Castro, Gerald Cermak, Ronnie Chaiken, John R. Douceur, Jon Howell, Jacob R. Lorch, Marvin Theimer, and Roger P. Wattenhofer. FARSITE: Federated, available, and reliable storage for an incompletely trusted environment. In 5th Symp. on Op. Sys. Design and Impl. (OSDI), Boston, MA, USA, December 2002.]] Google ScholarDigital Library
- Marcos K. Aguilera and Svend Frolund. Strict linearizability and the power of aborting. Technical Report HPL-2003-241, HP Labs, December 2003.]]Google Scholar
- Dave Anderson, John Dykes, and Erik Riedel. More than an interface: SCSI vs. ATA. In USENIX Conf. on File and Storage Technologies (FAST), San Francisco, CA, March 2003.]] Google ScholarDigital Library
- Satoshi Asami. Reducing the cost of system administration of a disk storage system built from commodity components. PhD thesis, University of California, Berkeley, May 2000. Tech. Report. no. UCB-CSD-00-1100.]] Google ScholarDigital Library
- Hagit Attiya, Amotz Bar-Noy, and Danny Dolev. Sharing memory robustly in message-passing systems. Journal of the ACM (JACM), 42(1):124--142, 1995.]] Google ScholarDigital Library
- Pei Cao, Swee Boon Lin, Shivakumar Venkataraman, and John Wilkes. The TickerTAIP parallel RAID architecture. ACM Trans. on Comp. Sys. (TOCS), 12(3):236--269, 1994.]] Google ScholarDigital Library
- Peter M. Chen, Edward K. Lee, Garth A. Gibson, Randy H. Katz, and David A. Patterson. RAID: High-performance, reliable secondary storage. ACM Computing Surveys, 26(2):145--185, 1994.]] Google ScholarDigital Library
- Flaviu Christian and Frank Schmuck. Agreeing on processor group membership in asynchronous distributed systems. Technical Report CSE95-428, UC San Diego, 1995.]]Google Scholar
- Storage Performance Council. SPC Benchmark 1 specification. http://www.storageperformance.org/, 2003.]]Google Scholar
- S. Frolund, A. Merchant, Y. Saito, S. Spence, and A. Veitch. FAB: Enterprise storage systems on a shoestring. In 8th Workshop on Hot Topics in Operating Systems (HOTOS-VIII), pages 169--174, Kauai, HI, USA, May 2003.]] Google ScholarDigital Library
- Svend Frolund, Arif Merchant, Yasushi Saito, Susan Spence, and Alistair Veitch. A decentralized algorithm for erasure-coded virtual disks. In Int. Conf. on Dependable Systems and Networks (DSN), pages 125--134, Florence, Italy, June 2004.]] Google ScholarDigital Library
- Gregory R. Ganger, John D. Strunk, and Andrew J. Klosterman. Self-* storage: Brick-based storage with automated administration. Technical Report CMU-CS-03-178, Carnegie Mellon University, August 2003.]]Google ScholarCross Ref
- Garth A. Gibson, David F. Nagle, Khalil Amiri, Jeff Butler, Fay W. Chang, Howard Gobioff, Charles Hardin, Erik Riedel, David Rochberg, and Jim Zelenka. A cost-effective, high-bandwidth storage architecture. In 8th Int. Conf. on Arch. Support for Prog. Lang. and Op. Sys. (ASPLOS-VIII), pages 92--103, San Jose, CA, USA, October 1998.]] Google ScholarDigital Library
- David Gifford. Weighted voting for replicated data. In 7th Symp. on Op. Sys. Principles (SOSP), pages 150--162, Pacific Grove, CA, USA, December 1979.]] Google ScholarDigital Library
- Douglas Gilbert. The Linux SCSI generic HOWTO. http://www.torque.net/sg/p/sg v3 ho.html, 2003.]]Google Scholar
- Garth R. Goodson, Jay J. Wylie, Gregory R. Ganger, and Michael K. Reiter. Efficient consistency for erasure-coded data via versioning servers. Technical Report CMU-CS-03-127, Carnegie Mellon University, April 2003.]]Google ScholarCross Ref
- Maurice P. Herlihy and Jeannette M. Wing. Linearizability: a correctness condition for concurrent objects. ACM Trans. on Prog. Lang. and Sys. (TOPLAS), 12(3):463--492, July 1990.]] Google ScholarDigital Library
- Andy Huang and Armando Fox. Dstore: self-managing, crash-only persistent hash table. http://swig.stanford.edu/public/projects/dstore/, 2004.]]Google Scholar
- IBM. IceCube: storage server for the Internet age. http://www.almaden.ibm.com/cs/storagesystems/IceCube/, 2003.]]Google Scholar
- Leslie Lamport. The part-time parliament. ACM Trans. on Comp. Sys. (TOCS), 16(2):133--169, 1998.]] Google ScholarDigital Library
- Leslie Lamport. Paxos made simple. ACM SIGACT News, 32(4):18--25, December 2001.]]Google Scholar
- Edward K. Lee and Chandramohan A. Thekkath. Petal: distributed virtual disks. In 7th Int. Conf. on Arch. Support for Prog. Lang. and Op. Sys. (ASPLOS-VII), pages 84--92, Cambridge, MA, USA, October 1996.]] Google ScholarDigital Library
- LeftHand Networks. IP-based storage area networks. http://www.lefthandnetworks.com/downloads/ip-san wp.pdf, 2002.]]Google Scholar
- Benjamin C. Ling, Emre Kiciman, and Armando Fox. Session state: beyond soft state. In 1st Symp. on Network Sys. Design and Impl. (NSDI), pages 295--308, San Francisco, CA, USA, March 2004.]] Google ScholarDigital Library
- Barbara Liskov, Liuba Shrira, and John Wroclawski. Efficient at-most-once messages based on synchronized clocks. ACM Trans. on Comp. Sys. (TOCS), 9(2):125--142, 1991.]] Google ScholarDigital Library
- Esti Yeger Lotem, Idit Keidar, and Danny Dolev. Dynamic voting for consistent primary components. In 16th Symp. on Princ. of Distr. Comp. (PODC), pages 63--71, Santa Barbara, CA, USA, August 1997.]] Google ScholarDigital Library
- Nancy A. Lynch and Alex A. Shvartsman. RAMBO: A reconfigurable atomic memory service for dynamic networks. In 16th Int. Conf. on Dist. Computing (DISC), pages 173--190, Toulouse, France, October 2002.]] Google ScholarDigital Library
- David L. Mills. Improved algorithms for synchronizing computer network clocks. In ACM SIGCOMM, pages 317--327, London, United Kingdom, September 1994.]] Google ScholarDigital Library
- Brian Oki and Barbara Liskov. Viewstamped replication: A new primary copy method to support highly available distrbuted systems. In 7th Symp. on Princ. of Distr. Comp. (PODC), pages 8--17, Toronto, ON, Canada, August 1988.]] Google ScholarDigital Library
- James S. Plank. A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Software-Practice and Experience, 27(9):995--1012, 1997.]] Google ScholarDigital Library
- Sean Reah, Patrik Eaton, Dennis Geels, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz. Pond: the OceanStore prototype. In USENIX Conf. on File and Storage Technologies (FAST), pages 1--14, San Francisco, CA, March 2003.]] Google ScholarDigital Library
- Julian Satran, Kalman Meth, Constantine Sapuntzakis, Mallikarjun Chadalapaka, and Efri Zeidner. RFC3720: Internet small computer systems interface (iSCSI). http://www.faqs.org/rfcs/rfc3720.html, 2004.]] Google ScholarDigital Library
- Josh Tseng, Kevin Gibbons, Franco Travostino, Curt Du Laney, and Joe Souza. Internet storage name service (iSNS), draft version 18. http://www.diskdrive.com/reading-room/standards.html, March 2003.]]Google Scholar
- Carl A. Waldspurger and William E. Weihl. Lottery scheduling: Flexible propotional-share resource management. In 1st Symp. on Op. Sys. Design and Impl. (OSDI), pages 1--11, Monterey, CA, USA, November 1994.]] Google ScholarDigital Library
- Avishai Wool. Quorum systems in replicated databases: science or fiction? Bull. IEEE Technical Committee on Data Engineering, 21(4):3--11, December 1998.]]Google Scholar
Index Terms
- FAB: building distributed enterprise disk arrays from commodity components
Recommendations
FAB: building distributed enterprise disk arrays from commodity components
ASPLOS XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systemsThis paper describes the design, implementation, and evaluation of a Federated Array of Bricks (FAB), a distributed disk array that provides the reliability of traditional enterprise arrays with lower cost and better scalability. FAB is built from a ...
FAB: building distributed enterprise disk arrays from commodity components
ASPLOS '04This paper describes the design, implementation, and evaluation of a Federated Array of Bricks (FAB), a distributed disk array that provides the reliability of traditional enterprise arrays with lower cost and better scalability. FAB is built from a ...
FAB: building distributed enterprise disk arrays from commodity components
ASPLOS '04This paper describes the design, implementation, and evaluation of a Federated Array of Bricks (FAB), a distributed disk array that provides the reliability of traditional enterprise arrays with lower cost and better scalability. FAB is built from a ...
Comments