skip to main content
article

FAB: building distributed enterprise disk arrays from commodity components

Published:07 October 2004Publication History
Skip Abstract Section

Abstract

This paper describes the design, implementation, and evaluation of a Federated Array of Bricks (FAB), a distributed disk array that provides the reliability of traditional enterprise arrays with lower cost and better scalability. FAB is built from a collection of bricks, small storage appliances containing commodity disks, CPU, NVRAM, and network interface cards. FAB deploys a new majority-voting-based algorithm to replicate or erasure-code logical blocks across bricks and a reconfiguration algorithm to move data in the background when bricks are added or decommissioned. We argue that voting is practical and necessary for reliable, high-throughput storage systems such as FAB. We have implemented a FAB prototype on a 22-node Linux cluster. This prototype sustains 85MB/second of throughput for a database workload, and 270MB/second for a bulk-read workload. In addition, it can outperform traditional master-slave replication through performance decoupling and can handle brick failures and recoveries smoothly without disturbing client requests.

References

  1. Atul Adya, William J. Bolosky, Miguel Castro, Gerald Cermak, Ronnie Chaiken, John R. Douceur, Jon Howell, Jacob R. Lorch, Marvin Theimer, and Roger P. Wattenhofer. FARSITE: Federated, available, and reliable storage for an incompletely trusted environment. In 5th Symp. on Op. Sys. Design and Impl. (OSDI), Boston, MA, USA, December 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Marcos K. Aguilera and Svend Frolund. Strict linearizability and the power of aborting. Technical Report HPL-2003-241, HP Labs, December 2003.]]Google ScholarGoogle Scholar
  3. Dave Anderson, John Dykes, and Erik Riedel. More than an interface: SCSI vs. ATA. In USENIX Conf. on File and Storage Technologies (FAST), San Francisco, CA, March 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Satoshi Asami. Reducing the cost of system administration of a disk storage system built from commodity components. PhD thesis, University of California, Berkeley, May 2000. Tech. Report. no. UCB-CSD-00-1100.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Hagit Attiya, Amotz Bar-Noy, and Danny Dolev. Sharing memory robustly in message-passing systems. Journal of the ACM (JACM), 42(1):124--142, 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Pei Cao, Swee Boon Lin, Shivakumar Venkataraman, and John Wilkes. The TickerTAIP parallel RAID architecture. ACM Trans. on Comp. Sys. (TOCS), 12(3):236--269, 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Peter M. Chen, Edward K. Lee, Garth A. Gibson, Randy H. Katz, and David A. Patterson. RAID: High-performance, reliable secondary storage. ACM Computing Surveys, 26(2):145--185, 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Flaviu Christian and Frank Schmuck. Agreeing on processor group membership in asynchronous distributed systems. Technical Report CSE95-428, UC San Diego, 1995.]]Google ScholarGoogle Scholar
  9. Storage Performance Council. SPC Benchmark 1 specification. http://www.storageperformance.org/, 2003.]]Google ScholarGoogle Scholar
  10. S. Frolund, A. Merchant, Y. Saito, S. Spence, and A. Veitch. FAB: Enterprise storage systems on a shoestring. In 8th Workshop on Hot Topics in Operating Systems (HOTOS-VIII), pages 169--174, Kauai, HI, USA, May 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Svend Frolund, Arif Merchant, Yasushi Saito, Susan Spence, and Alistair Veitch. A decentralized algorithm for erasure-coded virtual disks. In Int. Conf. on Dependable Systems and Networks (DSN), pages 125--134, Florence, Italy, June 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Gregory R. Ganger, John D. Strunk, and Andrew J. Klosterman. Self-* storage: Brick-based storage with automated administration. Technical Report CMU-CS-03-178, Carnegie Mellon University, August 2003.]]Google ScholarGoogle ScholarCross RefCross Ref
  13. Garth A. Gibson, David F. Nagle, Khalil Amiri, Jeff Butler, Fay W. Chang, Howard Gobioff, Charles Hardin, Erik Riedel, David Rochberg, and Jim Zelenka. A cost-effective, high-bandwidth storage architecture. In 8th Int. Conf. on Arch. Support for Prog. Lang. and Op. Sys. (ASPLOS-VIII), pages 92--103, San Jose, CA, USA, October 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. David Gifford. Weighted voting for replicated data. In 7th Symp. on Op. Sys. Principles (SOSP), pages 150--162, Pacific Grove, CA, USA, December 1979.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Douglas Gilbert. The Linux SCSI generic HOWTO. http://www.torque.net/sg/p/sg v3 ho.html, 2003.]]Google ScholarGoogle Scholar
  16. Garth R. Goodson, Jay J. Wylie, Gregory R. Ganger, and Michael K. Reiter. Efficient consistency for erasure-coded data via versioning servers. Technical Report CMU-CS-03-127, Carnegie Mellon University, April 2003.]]Google ScholarGoogle ScholarCross RefCross Ref
  17. Maurice P. Herlihy and Jeannette M. Wing. Linearizability: a correctness condition for concurrent objects. ACM Trans. on Prog. Lang. and Sys. (TOPLAS), 12(3):463--492, July 1990.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Andy Huang and Armando Fox. Dstore: self-managing, crash-only persistent hash table. http://swig.stanford.edu/public/projects/dstore/, 2004.]]Google ScholarGoogle Scholar
  19. IBM. IceCube: storage server for the Internet age. http://www.almaden.ibm.com/cs/storagesystems/IceCube/, 2003.]]Google ScholarGoogle Scholar
  20. Leslie Lamport. The part-time parliament. ACM Trans. on Comp. Sys. (TOCS), 16(2):133--169, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Leslie Lamport. Paxos made simple. ACM SIGACT News, 32(4):18--25, December 2001.]]Google ScholarGoogle Scholar
  22. Edward K. Lee and Chandramohan A. Thekkath. Petal: distributed virtual disks. In 7th Int. Conf. on Arch. Support for Prog. Lang. and Op. Sys. (ASPLOS-VII), pages 84--92, Cambridge, MA, USA, October 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. LeftHand Networks. IP-based storage area networks. http://www.lefthandnetworks.com/downloads/ip-san wp.pdf, 2002.]]Google ScholarGoogle Scholar
  24. Benjamin C. Ling, Emre Kiciman, and Armando Fox. Session state: beyond soft state. In 1st Symp. on Network Sys. Design and Impl. (NSDI), pages 295--308, San Francisco, CA, USA, March 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Barbara Liskov, Liuba Shrira, and John Wroclawski. Efficient at-most-once messages based on synchronized clocks. ACM Trans. on Comp. Sys. (TOCS), 9(2):125--142, 1991.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Esti Yeger Lotem, Idit Keidar, and Danny Dolev. Dynamic voting for consistent primary components. In 16th Symp. on Princ. of Distr. Comp. (PODC), pages 63--71, Santa Barbara, CA, USA, August 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Nancy A. Lynch and Alex A. Shvartsman. RAMBO: A reconfigurable atomic memory service for dynamic networks. In 16th Int. Conf. on Dist. Computing (DISC), pages 173--190, Toulouse, France, October 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. David L. Mills. Improved algorithms for synchronizing computer network clocks. In ACM SIGCOMM, pages 317--327, London, United Kingdom, September 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Brian Oki and Barbara Liskov. Viewstamped replication: A new primary copy method to support highly available distrbuted systems. In 7th Symp. on Princ. of Distr. Comp. (PODC), pages 8--17, Toronto, ON, Canada, August 1988.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. James S. Plank. A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Software-Practice and Experience, 27(9):995--1012, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Sean Reah, Patrik Eaton, Dennis Geels, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz. Pond: the OceanStore prototype. In USENIX Conf. on File and Storage Technologies (FAST), pages 1--14, San Francisco, CA, March 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Julian Satran, Kalman Meth, Constantine Sapuntzakis, Mallikarjun Chadalapaka, and Efri Zeidner. RFC3720: Internet small computer systems interface (iSCSI). http://www.faqs.org/rfcs/rfc3720.html, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Josh Tseng, Kevin Gibbons, Franco Travostino, Curt Du Laney, and Joe Souza. Internet storage name service (iSNS), draft version 18. http://www.diskdrive.com/reading-room/standards.html, March 2003.]]Google ScholarGoogle Scholar
  34. Carl A. Waldspurger and William E. Weihl. Lottery scheduling: Flexible propotional-share resource management. In 1st Symp. on Op. Sys. Design and Impl. (OSDI), pages 1--11, Monterey, CA, USA, November 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Avishai Wool. Quorum systems in replicated databases: science or fiction? Bull. IEEE Technical Committee on Data Engineering, 21(4):3--11, December 1998.]]Google ScholarGoogle Scholar

Index Terms

  1. FAB: building distributed enterprise disk arrays from commodity components

                  Recommendations

                  Reviews

                  Elliot Jaffe

                  In recent years, file system research has focused on using the massive quantities of commodity disk space now residing in each desktop personal computer (PC). Desktops are frequently shut down or rebooted, and, hence, significant research has focused on availability and reliability in such systems. These features usually come at the expense of system performance. The federated array of bricks (FAB) system described in this paper focuses on a related problem: how to build a stable, reliable, and high performance file system out of commodity equipment. Specifically, what kind of file system can you build if the basic unit is made up of a central processing unit (CPU), nonvolatile random access memory (NVRAM), disks, and a network connection__?__ The paper focuses on the internal algorithms and designs that were used to create a working prototype, on top of a cluster of Linux servers. The algorithms focus on maintaining system consistency and correctness in the face of non-Byzantine node failures. Voting algorithms are used to maintain file system data, and to record and synchronize the various system components. Researchers interested in quorum-based storage systems will be interested in the practical extensions and implementations that were developed for this system. The level of detail presented in the paper suggests that the researchers did a thorough job of exploring, developing, and implementing a basic file system. The evaluation section supports the contention that these algorithms do indeed perform as advertised. Missing from the presentation are real-world issues related to multiple clients, such as disk contention, file locking, and access controls. The system claims to support multiple clients, but all of the system tests seem to be generated by single clients. In many reports, it is clear that the evaluated system was a proof of concept. The FAB system seems sufficiently robust, and I found myself wondering how close the researchers came to a production file system, in terms of compatibility and compliance. Online Computing Reviews Service

                  Access critical reviews of Computing literature here

                  Become a reviewer for Computing Reviews.

                  Comments

                  Login options

                  Check if you have access through your login credentials or your institution to get full access on this article.

                  Sign in

                  Full Access

                  • Published in

                    cover image ACM SIGARCH Computer Architecture News
                    ACM SIGARCH Computer Architecture News  Volume 32, Issue 5
                    ASPLOS 2004
                    December 2004
                    283 pages
                    ISSN:0163-5964
                    DOI:10.1145/1037947
                    Issue’s Table of Contents
                    • cover image ACM Conferences
                      ASPLOS XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
                      October 2004
                      296 pages
                      ISBN:1581138040
                      DOI:10.1145/1024393

                    Copyright © 2004 ACM

                    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                    Publisher

                    Association for Computing Machinery

                    New York, NY, United States

                    Publication History

                    • Published: 7 October 2004

                    Check for updates

                    Qualifiers

                    • article

                  PDF Format

                  View or Download as a PDF file.

                  PDF

                  eReader

                  View online with eReader.

                  eReader