ABSTRACT
Although flash storage has largely replaced hard disks in consumer class devices, enterprise workloads pose unique challenges that have slowed adoption of flash in ``performance tier'' storage appliances. In this paper, we describe Purity, the foundation of Pure Storage's Flash Arrays, the first all-flash enterprise storage system to support compression, deduplication, and high-availability.
Purity borrows techniques from modern database and key-value storage architectures, and introduces novel storage primitives that have wide applicability to data management systems. For instance, all writes in Purity are monotonic, and deletions are handled using an atomic predicate-based tuple elision primitive.
Purity's redundancy mechanisms are optimized for SSD failure modes and performance characteristics, allowing for fast recovery from component failures and lower space overhead than the best hard disk systems. We built deduplication and data compression schemes atop these primitives.
Flash changes storage capacity/performance tradeoffs: unlike disk-based systems, flash deployments are rarely performance bound. A single Purity appliance can provide over 7GiB/s of throughput on 32KiB random I/Os, even through multiple device failures, and while providing asynchronous off-site replication. Typical installations have 99.9% latencies under 1ms, and production arrays average 5.4x data reduction and 99.999% availability.
Purity takes advantage of storage performance increasing more rapidly than computational performance to build a simpler (with respect to engineering, installation, and management) scale-up storage appliance that supports hundreds of terabytes of highly-available, high-performance storage. The resulting performance and capacity supports many customer deployments of multiple applications, including scale-out and parallel systems, such as MongoDB and Oracle RAC, on a single Purity appliance.
- D. Abadi, S. Madden, and M. Ferreira. Integrating compression and execution in column-oriented database systems. In Proc. SIGMOD Conf., pages 671--682, 2006. Google ScholarDigital Library
- D. Agrawal, D. Ganesan, R. Sitaraman, Y. Diao, and S. Singh. Lazy-adaptive tree: An optimized index structure for flash devices. In Proc. 35th VLDB Conf., Aug. 2009. Google ScholarDigital Library
- P. Alvaro, N. Conway, J. Hellerstein, and W. R. Marczak. Consistency analysis in Bloom: a CALM and collected approach. In Proc. CIDR, pages 249--260, 2011.Google Scholar
- T. J. Ameloot and J. Van den Bussche. Positive Dedalus programs tolerate non-causality. Journal of Computer and System Sciences, 80(7):1191--1213, 2014.Google ScholarCross Ref
- A. Anand, C. Muthukrishnan, S. Kappes, A. Akella, and S. Nath. Cheap and large CAMs for high performance data-intensive networked systems. In Proc. 7th NSDI Symp., 2010. Google ScholarDigital Library
- D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. FAWN: A fast array of wimpy nodes. In Proc. 22nd SOSP Conf., pages 1--14, Oct. 2009. Google ScholarDigital Library
- B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny. Workload analysis of a large-scale key-value store. In Proc. 2012 SIGMETRICS, pages 53--64, 2012. Google ScholarDigital Library
- M. Balakrishnan, D. Malkhi, V. Prabhakaran, T. Wobber, M. Wei, and J. D. Davis. CORFU: A shared log design for flash clusters. In Proc. 9th NSDI Symp., Apr. 2012. Google ScholarDigital Library
- M. Balakrishnan, D. Malkhi, T. Wobber, M. Wu, V. Prabhakaran, M. Wei, J. D. Davis, S. Rao, T. Zou, and A. Zuck. Tango: Distributed data structures over a shared log. In Proc. 24th SOSP Conf., pages 325--340, Nov. 2013. Google ScholarDigital Library
- M. A. Bender, M. Farach-Colton, J. T. Fineman, Y. R. Fogel, B. C. Kuszmaul, and J. Nelson. Cache-oblivious streaming B-trees. In Proc. 19th Symp. on Parallel Algorithms and Architectures, pages 81--92, 2007. Google ScholarDigital Library
- E. A. Brewer. Lessons from giant-scale services. IEEE Internet Computing, 5(4):46--55, Aug. 2001. Google ScholarDigital Library
- M. Burrows, C. Jerian, B. Lampson, and T. Mann. On-line data compression in a log-structured file system. In Proc. 5th ASPLOS, pages 2--9, Oct. 1992. Google ScholarDigital Library
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In Proc. 7th OSDI Symp., Nov. 2006. Google ScholarDigital Library
- F. Chen, T. Luo, and X. Zhang. CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In Proc. 9th FAST Conf., Feb. 2011. Google ScholarDigital Library
- B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. PNUTS: Yahoo!'s hosted data serving platform. In Proc. 34th VLDB Conf., Aug. 2008. Google ScholarDigital Library
- B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with ycsb. In Proc. 1st ACM Symp. on Cloud Computing, 2010. Google ScholarDigital Library
- B. Cully, J. Wires, D. Meyer, K. Jamieson, K. Fraser, T. Deegan, D. Stodden, G. Lefebvre, D. Ferstay, and A. Warfield. Strata: High-performance scalable storage on virtualized non-volatile memory. In Proc. 12th FAST Conf., Feb. 2014. Google ScholarDigital Library
- J. Dean. Designs, lessons and advice from building large distributed systems. Keynote from LADIS, 2009.Google Scholar
- J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56(2):74--80, 2013. Google ScholarDigital Library
- B. Debnath, M. F. Mokbel, D. J. Lilja, and D. Du. Deferred updates for flash-based storage. In Proc. 26th IEEE Conf. on Mass Storage Systems and Technologies, 2010. Google ScholarDigital Library
- B. Debnath, S. Sengupta, and J. Li. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proc. 2010 USENIX Annual Technical Conference, June 2010. Google ScholarDigital Library
- B. Debnath, S. Sengupta, and J. Li. FlashStore: High throughput persistent key-value store. In Proc. 36th VLDB Conf., Sept. 2010. Google ScholarDigital Library
- B. Debnath, S. Sengupta, and J. Li. SkimpyStash: RAM space skimpy key-value store on flash-based storage. In Proc. SIGMOD Conf., 2011. Google ScholarDigital Library
- J. Gray and G. Graefe. The five-minute rule ten years later, and other computer storage rules of thumb. ACM Sigmod Record, 26(4):63--68, 1997. Google ScholarDigital Library
- J. Gray, P. Helland, P. O'Neil, and D. Shasha. The dangers of replication and a solution. In Proc. SIGMOD Conf., pages 173--182, June 1996. Google ScholarDigital Library
- J. N. Gray, R. A. Lorie, G. R. Putzolu, and I. L. Traiger. Granularity of locks and degrees of consistency in a shared data base. In Proc. IFIP Working Conf. on Modelling in Data Base Management Systems, pages 365--394, 1976.Google Scholar
- K. M. Greenan, D. D. Long, E. L. Miller, T. J. E. Schwarz, S.J., and A. Wildani. Building flexible, fault-tolerant flash-based storage systems. In Proc. 5th Workshop on Hot Topics in System Dependability, June 2009.Google Scholar
- L. M. Grupp, J. D. Davis, and S. Swanson. The bleak future of NAND flash memory. In Proc. 10th FAST Conf., Feb. 2012. Google ScholarDigital Library
- L. M. Grupp, J. D. Davis, and S. Swanson. The harey tortoise: Managing heterogeneous write performance in SSDs. In Proc. 2013 USENIX Annual Technical Conference, June 2013. Google ScholarDigital Library
- A. Gupta, R. Pisolkar, B. Urgaonkar, and A. Sivasubramaniam. Leveraging value locality in optimizing NAND flash-based SSDs. In Proc. 9th FAST Conf., Feb. 2011. Google ScholarDigital Library
- J. Hamilton. Why scale matters and why the cloud is different. AWS re:Invent, 2013.Google Scholar
- J. Hamilton. AWS innovation at scale. AWS re:Invent 2014.Google Scholar
- A. L. Holloway, V. Raman, G. Swart, and D. J. DeWitt. How to barter bits for chronons: compression and bandwidth trade offs for database scans. In Proc. SIGMOD Conf., pages 389--400, 2007. Google ScholarDigital Library
- B. R. Iyer and D. Wilhite. Data compression support in databases. In Proc. 20th VLDB Conf., pages 695--704, 1994. Google ScholarDigital Library
- C. Jermaine, E. Omiecinski, and W. G. Yee. The partitioned exponential file for database storage management. Proc. VLDB Endowment, 16:417--437, 2007. Google ScholarDigital Library
- K. Jin and E. L. Miller. The effectiveness of deduplication on virtual machine disk images. In Proc. SYSTOR 2009, May 2009. Google ScholarDigital Library
- W. K. Josephson, L. A. Bongo, K. Li, and D. Flynn. DFS: A file system for virtualized flash storage. ACM Trans. on Storage, 6(3), Sept. 2010. Google ScholarDigital Library
- Y. Kang and E. L. Miller. Adding aggressive error correction to a high-performance compressing flash file system. In Proc. 9th ACM & IEEE Conf. on Embedded Software (EMSOFT '09), Oct. 2009. Google ScholarDigital Library
- LevelDB: A fast and lightweight key/value database library by Google. https://code.google.com/p/leveldb/.Google Scholar
- Y. Li, B. He, Q. Luo, and K. Yi. Tree indexing on flash disks. In Proc. 25th Int'l Conf. on Data Engineering, 2009. Google ScholarDigital Library
- H. Lim, B. Fan, D. G. Andersen, and M. Kaminsky. SILT: a memory-efficient, high-performance key-value store. In Proc. 23rd SOSP Conf., Oct. 2011. Google ScholarDigital Library
- D. T. Meyer and W. J. Bolosky. A study of practical deduplication. In Proc. 9th FAST Conf., Feb. 2011. Google ScholarDigital Library
- C. Min, K. Kim, H. Cho, S.-W. Lee, and Y. I. Eom. SFS: Random write considered harmful in solid state drives. In Proc. 10th FAST Conf., Feb. 2012. Google ScholarDigital Library
- P. O'Neil, E. Cheng, D. Gawlick, and E. O'Neil. The log-structured merge-tree (LSM-tree). Acta Informatica, 33:351--385, 1996. Google ScholarDigital Library
- J. S. Plank, K. M. Greenan, and E. L. Miller. Screaming fast Galois field arithmetic using Intel SIMD instructions. In Proc. 11th FAST Conf., Feb. 2013. Google ScholarDigital Library
- Pure Storage reference architecture for Oracle databases, 2014.Google Scholar
- RocksDB: A fork of LevelDB by Facebook. https://github.com/facebook/rocksdb/.Google Scholar
- M. Rosenblum and J. K. Ousterhout. The design and implementation of a log-structured file system. ACM Trans. on Computer Systems, 10(1):26--52, Feb. 1992. Google ScholarDigital Library
- S. M. Rumble, A. Kejriwal, and J. Ousterhout. Log-structured memory for DRAM-based storage. In Proc. 12th FAST Conf., Feb. 2014. Google ScholarDigital Library
- R. Sears, M. Callaghan, and E. Brewer. Rose: Compressed, log-structured replication. In Proc. 34th VLDB Conf., pages 526--537, Aug. 2008. Google ScholarDigital Library
- R. Sears and R. Ramakrishnan. bLSM: A general purpose log structured merge tree. In Proc. SIGMOD Conf., pages 217--228, May 2012. Google ScholarDigital Library
- D. J. Sheehy and D. Smith. Bitcask. a log-structured hash table for fast key value data. Technical report, Technical report, Basho Technologies, 04 2010, 2010.Google Scholar
- V. Sikka, F. Färber, W. Lehner, S. K. Cha, T. Peh, and C. Bornhövd. Efficient transaction processing in SAP HANA database: the end of a column store myth. In Proc. SIGMOD Conf., pages 731--742, 2012. Google ScholarDigital Library
- K. Srinivasan, T. Bisson, G. Goodson, and K. Voruganti. iDedup: Latency-aware, inline data deduplication for primary storage. In Proc. 10th FAST Conf., Feb. 2012. Google ScholarDigital Library
- R. Stoica, M. Athanassoulis, R. Johnson, and A. Ailamaki. Evaluating and repairing write performance on flash devices. In DaMoN, pages 9--14, June 2009. Google ScholarDigital Library
- M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik. C-store: a column-oriented DBMS. In Proc. 31st VLDB Conf., pages 553--564, 2005. Google ScholarDigital Library
- Y. Wang, M. Kapritsos, Z. Ren, P. Mahajan, J. Kirubanandam, L. Alvisi, and M. Dahlin. Robustness in the Salus scalable block store. In Proc. 10th NSDI Symp., pages 357--370, 2013. Google ScholarDigital Library
- H. Yadava. The Berkeley DB Book. Apress, 2014. Google ScholarDigital Library
Index Terms
- Purity: Building Fast, Highly-Available Enterprise Flash Storage from Commodity Components
Recommendations
WOJ: Enabling Write-Once Full-data Journaling in SSDs by Using Weak-Hashing-based Deduplication
Journaling is a commonly used technique to ensure data consistency in file systems, such as ext3 and ext4. With journaling technique, file system updates are first recorded in a journal (in the commit phase) and later applied to their home locations in ...
Storage Deduplication by Virtual Large-Scale Disks
NBIS '12: Proceedings of the 2012 15th International Conference on Network-Based Information SystemsRecently, the demand of low cost large scale storages increases. We developed VLSD (Virtual Large Scale Disks) toolkit for constructing virtual disk based distributed storages, which aggregate free spaces of individual disks. VLSD realizes low-cost ...
GHOST: GPGPU-offloaded high performance storage I/O deduplication for primary storage system
PMAM '12: Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and ManycoresData deduplication has been an effective way to eliminate redundant data mainly for backup storage systems. Since the recent primary storage systems in cloud services are expected to have the redundancy, the deduplication technique can also bring ...
Comments