article

FAB: building distributed enterprise disk arrays from commodity components

Authors:
Yasushi Saito

Hewlett-Packard Laboratories

Hewlett-Packard Laboratories
View Profile

,
Svend Frølund

Hewlett-Packard Laboratories

Hewlett-Packard Laboratories
View Profile

,
Alistair Veitch

Hewlett-Packard Laboratories

Hewlett-Packard Laboratories
View Profile

,
Arif Merchant

Hewlett-Packard Laboratories

Hewlett-Packard Laboratories
View Profile

,
Susan Spence

Hewlett-Packard Laboratories

Hewlett-Packard Laboratories
View Profile

Authors Info & Claims

ACM SIGARCH Computer Architecture News Volume 32 Issue 5December 2004pp 48–58https://doi.org/10.1145/1037947.1024400

Published:07 October 2004Publication History

ACM SIGARCH Computer Architecture News

Abstract

This paper describes the design, implementation, and evaluation of a Federated Array of Bricks (FAB), a distributed disk array that provides the reliability of traditional enterprise arrays with lower cost and better scalability. FAB is built from a collection of bricks, small storage appliances containing commodity disks, CPU, NVRAM, and network interface cards. FAB deploys a new majority-voting-based algorithm to replicate or erasure-code logical blocks across bricks and a reconfiguration algorithm to move data in the background when bricks are added or decommissioned. We argue that voting is practical and necessary for reliable, high-throughput storage systems such as FAB. We have implemented a FAB prototype on a 22-node Linux cluster. This prototype sustains 85MB/second of throughput for a database workload, and 270MB/second for a bulk-read workload. In addition, it can outperform traditional master-slave replication through performance decoupling and can handle brick failures and recoveries smoothly without disturbing client requests.

References

Atul Adya, William J. Bolosky, Miguel Castro, Gerald Cermak, Ronnie Chaiken, John R. Douceur, Jon Howell, Jacob R. Lorch, Marvin Theimer, and Roger P. Wattenhofer. FARSITE: Federated, available, and reliable storage for an incompletely trusted environment. In 5th Symp. on Op. Sys. Design and Impl. (OSDI), Boston, MA, USA, December 2002.]] Google ScholarDigital Library
Marcos K. Aguilera and Svend Frolund. Strict linearizability and the power of aborting. Technical Report HPL-2003-241, HP Labs, December 2003.]]Google Scholar
Dave Anderson, John Dykes, and Erik Riedel. More than an interface: SCSI vs. ATA. In USENIX Conf. on File and Storage Technologies (FAST), San Francisco, CA, March 2003.]] Google ScholarDigital Library
Satoshi Asami. Reducing the cost of system administration of a disk storage system built from commodity components. PhD thesis, University of California, Berkeley, May 2000. Tech. Report. no. UCB-CSD-00-1100.]] Google ScholarDigital Library
Hagit Attiya, Amotz Bar-Noy, and Danny Dolev. Sharing memory robustly in message-passing systems. Journal of the ACM (JACM), 42(1):124--142, 1995.]] Google ScholarDigital Library
Pei Cao, Swee Boon Lin, Shivakumar Venkataraman, and John Wilkes. The TickerTAIP parallel RAID architecture. ACM Trans. on Comp. Sys. (TOCS), 12(3):236--269, 1994.]] Google ScholarDigital Library
Peter M. Chen, Edward K. Lee, Garth A. Gibson, Randy H. Katz, and David A. Patterson. RAID: High-performance, reliable secondary storage. ACM Computing Surveys, 26(2):145--185, 1994.]] Google ScholarDigital Library
Flaviu Christian and Frank Schmuck. Agreeing on processor group membership in asynchronous distributed systems. Technical Report CSE95-428, UC San Diego, 1995.]]Google Scholar
Storage Performance Council. SPC Benchmark 1 specification. http://www.storageperformance.org/, 2003.]]Google Scholar
S. Frolund, A. Merchant, Y. Saito, S. Spence, and A. Veitch. FAB: Enterprise storage systems on a shoestring. In 8th Workshop on Hot Topics in Operating Systems (HOTOS-VIII), pages 169--174, Kauai, HI, USA, May 2003.]] Google ScholarDigital Library
Svend Frolund, Arif Merchant, Yasushi Saito, Susan Spence, and Alistair Veitch. A decentralized algorithm for erasure-coded virtual disks. In Int. Conf. on Dependable Systems and Networks (DSN), pages 125--134, Florence, Italy, June 2004.]] Google ScholarDigital Library
Gregory R. Ganger, John D. Strunk, and Andrew J. Klosterman. Self-* storage: Brick-based storage with automated administration. Technical Report CMU-CS-03-178, Carnegie Mellon University, August 2003.]]Google ScholarCross Ref
Garth A. Gibson, David F. Nagle, Khalil Amiri, Jeff Butler, Fay W. Chang, Howard Gobioff, Charles Hardin, Erik Riedel, David Rochberg, and Jim Zelenka. A cost-effective, high-bandwidth storage architecture. In 8th Int. Conf. on Arch. Support for Prog. Lang. and Op. Sys. (ASPLOS-VIII), pages 92--103, San Jose, CA, USA, October 1998.]] Google ScholarDigital Library
David Gifford. Weighted voting for replicated data. In 7th Symp. on Op. Sys. Principles (SOSP), pages 150--162, Pacific Grove, CA, USA, December 1979.]] Google ScholarDigital Library
Douglas Gilbert. The Linux SCSI generic HOWTO. http://www.torque.net/sg/p/sg v3 ho.html, 2003.]]Google Scholar
Garth R. Goodson, Jay J. Wylie, Gregory R. Ganger, and Michael K. Reiter. Efficient consistency for erasure-coded data via versioning servers. Technical Report CMU-CS-03-127, Carnegie Mellon University, April 2003.]]Google ScholarCross Ref
Maurice P. Herlihy and Jeannette M. Wing. Linearizability: a correctness condition for concurrent objects. ACM Trans. on Prog. Lang. and Sys. (TOPLAS), 12(3):463--492, July 1990.]] Google ScholarDigital Library
Andy Huang and Armando Fox. Dstore: self-managing, crash-only persistent hash table. http://swig.stanford.edu/public/projects/dstore/, 2004.]]Google Scholar
IBM. IceCube: storage server for the Internet age. http://www.almaden.ibm.com/cs/storagesystems/IceCube/, 2003.]]Google Scholar
Leslie Lamport. The part-time parliament. ACM Trans. on Comp. Sys. (TOCS), 16(2):133--169, 1998.]] Google ScholarDigital Library
Leslie Lamport. Paxos made simple. ACM SIGACT News, 32(4):18--25, December 2001.]]Google Scholar
Edward K. Lee and Chandramohan A. Thekkath. Petal: distributed virtual disks. In 7th Int. Conf. on Arch. Support for Prog. Lang. and Op. Sys. (ASPLOS-VII), pages 84--92, Cambridge, MA, USA, October 1996.]] Google ScholarDigital Library
LeftHand Networks. IP-based storage area networks. http://www.lefthandnetworks.com/downloads/ip-san wp.pdf, 2002.]]Google Scholar
Benjamin C. Ling, Emre Kiciman, and Armando Fox. Session state: beyond soft state. In 1st Symp. on Network Sys. Design and Impl. (NSDI), pages 295--308, San Francisco, CA, USA, March 2004.]] Google ScholarDigital Library
Barbara Liskov, Liuba Shrira, and John Wroclawski. Efficient at-most-once messages based on synchronized clocks. ACM Trans. on Comp. Sys. (TOCS), 9(2):125--142, 1991.]] Google ScholarDigital Library
Esti Yeger Lotem, Idit Keidar, and Danny Dolev. Dynamic voting for consistent primary components. In 16th Symp. on Princ. of Distr. Comp. (PODC), pages 63--71, Santa Barbara, CA, USA, August 1997.]] Google ScholarDigital Library
Nancy A. Lynch and Alex A. Shvartsman. RAMBO: A reconfigurable atomic memory service for dynamic networks. In 16th Int. Conf. on Dist. Computing (DISC), pages 173--190, Toulouse, France, October 2002.]] Google ScholarDigital Library
David L. Mills. Improved algorithms for synchronizing computer network clocks. In ACM SIGCOMM, pages 317--327, London, United Kingdom, September 1994.]] Google ScholarDigital Library
Brian Oki and Barbara Liskov. Viewstamped replication: A new primary copy method to support highly available distrbuted systems. In 7th Symp. on Princ. of Distr. Comp. (PODC), pages 8--17, Toronto, ON, Canada, August 1988.]] Google ScholarDigital Library
James S. Plank. A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Software-Practice and Experience, 27(9):995--1012, 1997.]] Google ScholarDigital Library
Sean Reah, Patrik Eaton, Dennis Geels, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz. Pond: the OceanStore prototype. In USENIX Conf. on File and Storage Technologies (FAST), pages 1--14, San Francisco, CA, March 2003.]] Google ScholarDigital Library
Julian Satran, Kalman Meth, Constantine Sapuntzakis, Mallikarjun Chadalapaka, and Efri Zeidner. RFC3720: Internet small computer systems interface (iSCSI). http://www.faqs.org/rfcs/rfc3720.html, 2004.]] Google ScholarDigital Library
Josh Tseng, Kevin Gibbons, Franco Travostino, Curt Du Laney, and Joe Souza. Internet storage name service (iSNS), draft version 18. http://www.diskdrive.com/reading-room/standards.html, March 2003.]]Google Scholar
Carl A. Waldspurger and William E. Weihl. Lottery scheduling: Flexible propotional-share resource management. In 1st Symp. on Op. Sys. Design and Impl. (OSDI), pages 1--11, Monterey, CA, USA, November 1994.]] Google ScholarDigital Library
Avishai Wool. Quorum systems in replicated databases: science or fiction? Bull. IEEE Technical Committee on Data Engineering, 21(4):3--11, December 1998.]]Google Scholar

Index Terms

Recommendations

FAB: building distributed enterprise disk arrays from commodity components
ASPLOS XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systems

This paper describes the design, implementation, and evaluation of a Federated Array of Bricks (FAB), a distributed disk array that provides the reliability of traditional enterprise arrays with lower cost and better scalability. FAB is built from a ...
Read More
FAB: building distributed enterprise disk arrays from commodity components
ASPLOS '04

This paper describes the design, implementation, and evaluation of a Federated Array of Bricks (FAB), a distributed disk array that provides the reliability of traditional enterprise arrays with lower cost and better scalability. FAB is built from a ...
Read More
FAB: building distributed enterprise disk arrays from commodity components
ASPLOS '04

This paper describes the design, implementation, and evaluation of a Federated Array of Bricks (FAB), a distributed disk array that provides the reliability of traditional enterprise arrays with lower cost and better scalability. FAB is built from a ...
Read More

Reviews

Reviewer: Elliot Jaffe

In recent years, file system research has focused on using the massive quantities of commodity disk space now residing in each desktop personal computer (PC). Desktops are frequently shut down or rebooted, and, hence, significant research has focused on availability and reliability in such systems. These features usually come at the expense of system performance. The federated array of bricks (FAB) system described in this paper focuses on a related problem: how to build a stable, reliable, and high performance file system out of commodity equipment. Specifically, what kind of file system can you build if the basic unit is made up of a central processing unit (CPU), nonvolatile random access memory (NVRAM), disks, and a network connection__?__ The paper focuses on the internal algorithms and designs that were used to create a working prototype, on top of a cluster of Linux servers. The algorithms focus on maintaining system consistency and correctness in the face of non-Byzantine node failures. Voting algorithms are used to maintain file system data, and to record and synchronize the various system components. Researchers interested in quorum-based storage systems will be interested in the practical extensions and implementations that were developed for this system. The level of detail presented in the paper suggests that the researchers did a thorough job of exploring, developing, and implementing a basic file system. The evaluation section supports the contention that these algorithms do indeed perform as advertised. Missing from the presentation are real-world issues related to multiple clients, such as disk contention, file locking, and access controls. The system claims to support multiple clients, but all of the system tests seem to be generated by single clients. In many reports, it is clear that the evaluated system was a proof of concept. The FAB system seems sufficiently robust, and I found myself wondering how close the researchers came to a production file system, in terms of compatibility and compliance. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGARCH Computer Architecture News Volume 32, Issue 5
ASPLOS 2004
December 2004
283 pages
ISSN:0163-5964
DOI:10.1145/1037947
Issue’s Table of Contents
ASPLOS XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
October 2004
296 pages
ISBN:1581138040
DOI:10.1145/1024393
General Chair:
Shubu Mukherjee
Intel Corporation
,
Program Chair:
Kathryn S. McKinley
University of Texas at Austin
Copyright © 2004 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 October 2004
Check for updates
Author Tags
consensus
disk array
erasure coding
replication
storage
voting
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 149
  Total Citations
  View Citations
- 899
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

FAB: building distributed enterprise disk arrays from commodity components

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

FAB: building distributed enterprise disk arrays from commodity components

FAB: building distributed enterprise disk arrays from commodity components

FAB: building distributed enterprise disk arrays from commodity components

Reviews

Access critical reviews of Computing literature here