research-article

Purity: Building Fast, Highly-Available Enterprise Flash Storage from Commodity Components

Authors:
John Colgrove

Pure Storage, Mountain View, CA, USA

Pure Storage, Mountain View, CA, USA
View Profile

,
John D. Davis

Pure Storage, Mountain View, CA, USA

Pure Storage, Mountain View, CA, USA
View Profile

,
John Hayes

Pure Storage, Mountain View, CA, USA

Pure Storage, Mountain View, CA, USA
View Profile

,
Ethan L. Miller

Pure Storage, Mountain View, CA, USA

Pure Storage, Mountain View, CA, USA
View Profile

,
Cary Sandvig

Pure Storage, Mountain View, CA, USA

Pure Storage, Mountain View, CA, USA
View Profile

,
Russell Sears

Pure Storage, Mountain View, CA, USA

Pure Storage, Mountain View, CA, USA
View Profile

,
Ari Tamches

Pure Storage, Mountain View, CA, USA

Pure Storage, Mountain View, CA, USA
View Profile

,
Neil Vachharajani

Pure Storage, Mountain View, CA, USA

Pure Storage, Mountain View, CA, USA
View Profile

,
Feng Wang

Pure Storage, Mountain View, CA, USA

Pure Storage, Mountain View, CA, USA
View Profile

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of DataMay 2015Pages 1683–1694https://doi.org/10.1145/2723372.2742798

Published:27 May 2015Publication History

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Pages 1683–1694

ABSTRACT

Although flash storage has largely replaced hard disks in consumer class devices, enterprise workloads pose unique challenges that have slowed adoption of flash in ``performance tier'' storage appliances. In this paper, we describe Purity, the foundation of Pure Storage's Flash Arrays, the first all-flash enterprise storage system to support compression, deduplication, and high-availability.

Purity borrows techniques from modern database and key-value storage architectures, and introduces novel storage primitives that have wide applicability to data management systems. For instance, all writes in Purity are monotonic, and deletions are handled using an atomic predicate-based tuple elision primitive.

Purity's redundancy mechanisms are optimized for SSD failure modes and performance characteristics, allowing for fast recovery from component failures and lower space overhead than the best hard disk systems. We built deduplication and data compression schemes atop these primitives.

Flash changes storage capacity/performance tradeoffs: unlike disk-based systems, flash deployments are rarely performance bound. A single Purity appliance can provide over 7GiB/s of throughput on 32KiB random I/Os, even through multiple device failures, and while providing asynchronous off-site replication. Typical installations have 99.9% latencies under 1ms, and production arrays average 5.4x data reduction and 99.999% availability.

Purity takes advantage of storage performance increasing more rapidly than computational performance to build a simpler (with respect to engineering, installation, and management) scale-up storage appliance that supports hundreds of terabytes of highly-available, high-performance storage. The resulting performance and capacity supports many customer deployments of multiple applications, including scale-out and parallel systems, such as MongoDB and Oracle RAC, on a single Purity appliance.

References

D. Abadi, S. Madden, and M. Ferreira. Integrating compression and execution in column-oriented database systems. In Proc. SIGMOD Conf., pages 671--682, 2006. Google ScholarDigital Library
D. Agrawal, D. Ganesan, R. Sitaraman, Y. Diao, and S. Singh. Lazy-adaptive tree: An optimized index structure for flash devices. In Proc. 35th VLDB Conf., Aug. 2009. Google ScholarDigital Library
P. Alvaro, N. Conway, J. Hellerstein, and W. R. Marczak. Consistency analysis in Bloom: a CALM and collected approach. In Proc. CIDR, pages 249--260, 2011.Google Scholar
T. J. Ameloot and J. Van den Bussche. Positive Dedalus programs tolerate non-causality. Journal of Computer and System Sciences, 80(7):1191--1213, 2014.Google ScholarCross Ref
A. Anand, C. Muthukrishnan, S. Kappes, A. Akella, and S. Nath. Cheap and large CAMs for high performance data-intensive networked systems. In Proc. 7th NSDI Symp., 2010. Google ScholarDigital Library
D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. FAWN: A fast array of wimpy nodes. In Proc. 22nd SOSP Conf., pages 1--14, Oct. 2009. Google ScholarDigital Library
B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny. Workload analysis of a large-scale key-value store. In Proc. 2012 SIGMETRICS, pages 53--64, 2012. Google ScholarDigital Library
M. Balakrishnan, D. Malkhi, V. Prabhakaran, T. Wobber, M. Wei, and J. D. Davis. CORFU: A shared log design for flash clusters. In Proc. 9th NSDI Symp., Apr. 2012. Google ScholarDigital Library
M. Balakrishnan, D. Malkhi, T. Wobber, M. Wu, V. Prabhakaran, M. Wei, J. D. Davis, S. Rao, T. Zou, and A. Zuck. Tango: Distributed data structures over a shared log. In Proc. 24th SOSP Conf., pages 325--340, Nov. 2013. Google ScholarDigital Library
M. A. Bender, M. Farach-Colton, J. T. Fineman, Y. R. Fogel, B. C. Kuszmaul, and J. Nelson. Cache-oblivious streaming B-trees. In Proc. 19th Symp. on Parallel Algorithms and Architectures, pages 81--92, 2007. Google ScholarDigital Library
E. A. Brewer. Lessons from giant-scale services. IEEE Internet Computing, 5(4):46--55, Aug. 2001. Google ScholarDigital Library
M. Burrows, C. Jerian, B. Lampson, and T. Mann. On-line data compression in a log-structured file system. In Proc. 5th ASPLOS, pages 2--9, Oct. 1992. Google ScholarDigital Library
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In Proc. 7th OSDI Symp., Nov. 2006. Google ScholarDigital Library
F. Chen, T. Luo, and X. Zhang. CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In Proc. 9th FAST Conf., Feb. 2011. Google ScholarDigital Library
B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. PNUTS: Yahoo!'s hosted data serving platform. In Proc. 34th VLDB Conf., Aug. 2008. Google ScholarDigital Library
B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with ycsb. In Proc. 1st ACM Symp. on Cloud Computing, 2010. Google ScholarDigital Library
B. Cully, J. Wires, D. Meyer, K. Jamieson, K. Fraser, T. Deegan, D. Stodden, G. Lefebvre, D. Ferstay, and A. Warfield. Strata: High-performance scalable storage on virtualized non-volatile memory. In Proc. 12th FAST Conf., Feb. 2014. Google ScholarDigital Library
J. Dean. Designs, lessons and advice from building large distributed systems. Keynote from LADIS, 2009.Google Scholar
J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56(2):74--80, 2013. Google ScholarDigital Library
B. Debnath, M. F. Mokbel, D. J. Lilja, and D. Du. Deferred updates for flash-based storage. In Proc. 26th IEEE Conf. on Mass Storage Systems and Technologies, 2010. Google ScholarDigital Library
B. Debnath, S. Sengupta, and J. Li. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proc. 2010 USENIX Annual Technical Conference, June 2010. Google ScholarDigital Library
B. Debnath, S. Sengupta, and J. Li. FlashStore: High throughput persistent key-value store. In Proc. 36th VLDB Conf., Sept. 2010. Google ScholarDigital Library
B. Debnath, S. Sengupta, and J. Li. SkimpyStash: RAM space skimpy key-value store on flash-based storage. In Proc. SIGMOD Conf., 2011. Google ScholarDigital Library
J. Gray and G. Graefe. The five-minute rule ten years later, and other computer storage rules of thumb. ACM Sigmod Record, 26(4):63--68, 1997. Google ScholarDigital Library
J. Gray, P. Helland, P. O'Neil, and D. Shasha. The dangers of replication and a solution. In Proc. SIGMOD Conf., pages 173--182, June 1996. Google ScholarDigital Library
J. N. Gray, R. A. Lorie, G. R. Putzolu, and I. L. Traiger. Granularity of locks and degrees of consistency in a shared data base. In Proc. IFIP Working Conf. on Modelling in Data Base Management Systems, pages 365--394, 1976.Google Scholar
K. M. Greenan, D. D. Long, E. L. Miller, T. J. E. Schwarz, S.J., and A. Wildani. Building flexible, fault-tolerant flash-based storage systems. In Proc. 5th Workshop on Hot Topics in System Dependability, June 2009.Google Scholar
L. M. Grupp, J. D. Davis, and S. Swanson. The bleak future of NAND flash memory. In Proc. 10th FAST Conf., Feb. 2012. Google ScholarDigital Library
L. M. Grupp, J. D. Davis, and S. Swanson. The harey tortoise: Managing heterogeneous write performance in SSDs. In Proc. 2013 USENIX Annual Technical Conference, June 2013. Google ScholarDigital Library
A. Gupta, R. Pisolkar, B. Urgaonkar, and A. Sivasubramaniam. Leveraging value locality in optimizing NAND flash-based SSDs. In Proc. 9th FAST Conf., Feb. 2011. Google ScholarDigital Library
J. Hamilton. Why scale matters and why the cloud is different. AWS re:Invent, 2013.Google Scholar
J. Hamilton. AWS innovation at scale. AWS re:Invent 2014.Google Scholar
A. L. Holloway, V. Raman, G. Swart, and D. J. DeWitt. How to barter bits for chronons: compression and bandwidth trade offs for database scans. In Proc. SIGMOD Conf., pages 389--400, 2007. Google ScholarDigital Library
B. R. Iyer and D. Wilhite. Data compression support in databases. In Proc. 20th VLDB Conf., pages 695--704, 1994. Google ScholarDigital Library
C. Jermaine, E. Omiecinski, and W. G. Yee. The partitioned exponential file for database storage management. Proc. VLDB Endowment, 16:417--437, 2007. Google ScholarDigital Library
K. Jin and E. L. Miller. The effectiveness of deduplication on virtual machine disk images. In Proc. SYSTOR 2009, May 2009. Google ScholarDigital Library
W. K. Josephson, L. A. Bongo, K. Li, and D. Flynn. DFS: A file system for virtualized flash storage. ACM Trans. on Storage, 6(3), Sept. 2010. Google ScholarDigital Library
Y. Kang and E. L. Miller. Adding aggressive error correction to a high-performance compressing flash file system. In Proc. 9th ACM & IEEE Conf. on Embedded Software (EMSOFT '09), Oct. 2009. Google ScholarDigital Library
LevelDB: A fast and lightweight key/value database library by Google. https://code.google.com/p/leveldb/.Google Scholar
Y. Li, B. He, Q. Luo, and K. Yi. Tree indexing on flash disks. In Proc. 25th Int'l Conf. on Data Engineering, 2009. Google ScholarDigital Library
H. Lim, B. Fan, D. G. Andersen, and M. Kaminsky. SILT: a memory-efficient, high-performance key-value store. In Proc. 23rd SOSP Conf., Oct. 2011. Google ScholarDigital Library
D. T. Meyer and W. J. Bolosky. A study of practical deduplication. In Proc. 9th FAST Conf., Feb. 2011. Google ScholarDigital Library
C. Min, K. Kim, H. Cho, S.-W. Lee, and Y. I. Eom. SFS: Random write considered harmful in solid state drives. In Proc. 10th FAST Conf., Feb. 2012. Google ScholarDigital Library
P. O'Neil, E. Cheng, D. Gawlick, and E. O'Neil. The log-structured merge-tree (LSM-tree). Acta Informatica, 33:351--385, 1996. Google ScholarDigital Library
J. S. Plank, K. M. Greenan, and E. L. Miller. Screaming fast Galois field arithmetic using Intel SIMD instructions. In Proc. 11th FAST Conf., Feb. 2013. Google ScholarDigital Library
Pure Storage reference architecture for Oracle databases, 2014.Google Scholar
RocksDB: A fork of LevelDB by Facebook. https://github.com/facebook/rocksdb/.Google Scholar
M. Rosenblum and J. K. Ousterhout. The design and implementation of a log-structured file system. ACM Trans. on Computer Systems, 10(1):26--52, Feb. 1992. Google ScholarDigital Library
S. M. Rumble, A. Kejriwal, and J. Ousterhout. Log-structured memory for DRAM-based storage. In Proc. 12th FAST Conf., Feb. 2014. Google ScholarDigital Library
R. Sears, M. Callaghan, and E. Brewer. Rose: Compressed, log-structured replication. In Proc. 34th VLDB Conf., pages 526--537, Aug. 2008. Google ScholarDigital Library
R. Sears and R. Ramakrishnan. bLSM: A general purpose log structured merge tree. In Proc. SIGMOD Conf., pages 217--228, May 2012. Google ScholarDigital Library
D. J. Sheehy and D. Smith. Bitcask. a log-structured hash table for fast key value data. Technical report, Technical report, Basho Technologies, 04 2010, 2010.Google Scholar
V. Sikka, F. Färber, W. Lehner, S. K. Cha, T. Peh, and C. Bornhövd. Efficient transaction processing in SAP HANA database: the end of a column store myth. In Proc. SIGMOD Conf., pages 731--742, 2012. Google ScholarDigital Library
K. Srinivasan, T. Bisson, G. Goodson, and K. Voruganti. iDedup: Latency-aware, inline data deduplication for primary storage. In Proc. 10th FAST Conf., Feb. 2012. Google ScholarDigital Library
R. Stoica, M. Athanassoulis, R. Johnson, and A. Ailamaki. Evaluating and repairing write performance on flash devices. In DaMoN, pages 9--14, June 2009. Google ScholarDigital Library
M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik. C-store: a column-oriented DBMS. In Proc. 31st VLDB Conf., pages 553--564, 2005. Google ScholarDigital Library
Y. Wang, M. Kapritsos, Z. Ren, P. Mahajan, J. Kirubanandam, L. Alvisi, and M. Dahlin. Robustness in the Salus scalable block store. In Proc. 10th NSDI Symp., pages 357--370, 2013. Google ScholarDigital Library
H. Yadava. The Berkeley DB Book. Apress, 2014. Google ScholarDigital Library

Index Terms

Purity: Building Fast, Highly-Available Enterprise Flash Storage from Commodity Components
1. Information systems
  1. Information storage systems
    1. Storage management
      1. Hierarchical storage management
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management
        Secondary storage

Recommendations

WOJ: Enabling Write-Once Full-data Journaling in SSDs by Using Weak-Hashing-based Deduplication

Journaling is a commonly used technique to ensure data consistency in file systems, such as ext3 and ext4. With journaling technique, file system updates are first recorded in a journal (in the commit phase) and later applied to their home locations in ...
Read More
Storage Deduplication by Virtual Large-Scale Disks
NBIS '12: Proceedings of the 2012 15th International Conference on Network-Based Information Systems

Recently, the demand of low cost large scale storages increases. We developed VLSD (Virtual Large Scale Disks) toolkit for constructing virtual disk based distributed storages, which aggregate free spaces of individual disks. VLSD realizes low-cost ...
Read More
GHOST: GPGPU-offloaded high performance storage I/O deduplication for primary storage system
PMAM '12: Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores

Data deduplication has been an effective way to eliminate redundant data mainly for backup storage systems. Since the recent primary storage systems in cloud services are expected to have the redundancy, the deduplication technique can also bring ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
May 2015
2110 pages
ISBN:9781450327589
DOI:10.1145/2723372
General Chair:
Timos Sellis
RMIT University, Australia
,
Program Chairs:
Susan B. Davidson
University of Pennsylvania, USA
,
Zack Ives
University of Pennsylvania, USA
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 May 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
deduplication
enterprise flash storage
high availability
log structured storage
scale up architectures
storage area networks
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '15 Paper Acceptance Rate106of415submissions,26%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 46
  Total Citations
  View Citations
- 990
  Total Downloads
- Downloads (Last 12 months)70
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Purity: Building Fast, Highly-Available Enterprise Flash Storage from Commodity Components

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

WOJ: Enabling Write-Once Full-data Journaling in SSDs by Using Weak-Hashing-based Deduplication

Storage Deduplication by Virtual Large-Scale Disks

GHOST: GPGPU-offloaded high performance storage I/O deduplication for primary storage system