Abstract
Log-structured merge (LSM) stores have emerged as the technology of choice for building scalable write-intensive key-value storage systems. An LSM store replaces random I/O with sequential I/O by accumulating large batches of writes in a memory store prior to flushing them to log-structured disk storage; the latter is continuously re-organized in the background through a compaction process for efficiency of reads. Though inherent to the LSM design, frequent compactions are a major pain point because they slow down data store operations, primarily writes, and also increase disk wear. Another performance bottleneck in today's state-of-the-art LSM stores, in particular ones that use managed languages like Java, is the fragmented memory layout of their dynamic memory store.
In this paper we show that these pain points may be mitigated via better organization of the memory store. We present Accordion - an algorithm that addresses these problems by re-applying the LSM design principles to memory management. Accordion is implemented in the production code of Apache HBase, where it was extensively evaluated. We demonstrate Accordion's double-digit performance gains versus the baseline HBase implementation and discuss some unexpected lessons learned in the process.
- Apache HBase. http://hbase.apache.org.Google Scholar
- Apache hbase at airbnb. https://www.slideshare.net/HBaseCon/apache-hbase-at-airbnb.Google Scholar
- Avoiding full gcs with memstore-local allocation buffers. http://blog.cloudera.com/blog/2011/02/.Google Scholar
- Cassandra. http://cassandra.apache.org.Google Scholar
- Dragon: A distributed graph query engine. https://code.facebook.com/posts/1737605303120405.Google Scholar
- Hbase compaction tuning tips. https://community.hortonworks.com/articles/52616/hbase-compaction-tuning-tips.html.Google Scholar
- Hbase operations in a flurry. http://bit.ly/2yaTfoP.Google Scholar
- Java concurrent skiplist map. https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ConcurrentSkipListSet.html.Google Scholar
- LevelDB. https://github.com/google/leveldb.Google Scholar
- Mongodb. https://mongorocks.org.Google Scholar
- Mysql. http://myrocks.io.Google Scholar
- Off-heap memtables in cassandra 2.1. https://www.datastax.com/dev/blog/off-heap-memtables-in-cassandra-2-1.Google Scholar
- Offheap read-path in production the alibaba story. https://blog.cloudera.com/blog/2017/03/.Google Scholar
- RocksDB. http://rocksdb.org/.Google Scholar
- Rocksdb tuning guide.Google Scholar
- Scylladb. https://github.com/scylladb/scylla.Google Scholar
- Sstable compaction and compaction strategies. http://bit.ly/2wXc7ah.Google Scholar
- Universal compaction.Google Scholar
- Zen: Pinterest's graph storage service. http://bit.ly/2ft4YDx.Google Scholar
- O. Balmau, R. Guerraoui, V. Trigonakis, and I. Zablotchi. Flodb: Unlocking memory in persistent key-value stores. In Proceedings of the Twelfth European Conference on Computer Systems, EuroSys '17, pages 80--94, New York, NY, USA, 2017. ACM. Google ScholarDigital Library
- J. L. Bentley and J. B. Saxe. Decomposable searching problems i. static-to-dynamic transformation. Journal of Algorithms, 1(4):301--358, 1980.Google ScholarCross Ref
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2):4:1--4:26, June 2008. Google ScholarDigital Library
- B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with ycsb. In Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC '10, pages 143--154, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- P. Devineni, D. Koutra, M. Faloutsos, and C. Faloutsos. If walls could talk: Patterns and anomalies in facebook wallposts. In Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015, ASONAM '15, pages 367--374, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- G. Golan-Gueta, E. Bortnikov, E. Hillel, and I. Keidar. Scaling concurrent log-structured data stores. In Proceedings of the Tenth European Conference on Computer Systems, EuroSys '15, pages 32:1--32:14, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J. Weinberger. Quickly generating billion-record synthetic databases. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, SIGMOD '94, pages 243--252, New York, NY, USA, 1994. ACM. Google ScholarDigital Library
- X.-Y. Hu, E. Eleftheriou, R. Haas, I. Iliadis, and R. Pletka. Write amplification analysis in flash-based solid state drives. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, SYSTOR '09, pages 10:1--10:9, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- L. Lu, T. S. Pillai, H. Gopalakrishnan, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Wisckey: Separating keys from values in ssd-conscious storage. TOS, 13(1):5:1--5:28, 2017. Google ScholarDigital Library
- P. Muth, P. E. O'Neil, A. Pick, and G. Weikum. Design, implementation, and performance of the lham log-structured history data access method. In Proceedings of the 24rd International Conference on Very Large Data Bases, VLDB '98, pages 452--463, 1998. Google ScholarDigital Library
- C. Nyberg, T. Barclay, Z. Cvetanovic, J. Gray, and D. Lomet. AlphaSort: A cache-sensitive parallel external sort. The VLDB Journal, 4(4):603--628, Oct. 1995. Google ScholarDigital Library
- P. O'Neil, E. Cheng, D. Gawlick, and E. O'Neil. The log-structured merge-tree (LSM-tree). Acta Inf., 33(4):351--385, June 1996. Google ScholarDigital Library
- R. Sears and R. Ramakrishnan. blsm: A general purpose log structured merge tree. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD '12, pages 217--228, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- A. S. Tanenbaum and H. Bos. Modern Operating Systems. Prentice Hall Press, Upper Saddle River, NJ, USA, 4th edition, 2014. Google ScholarDigital Library
- G. Wu, X. He, and B. Eckart. An adaptive write buffer management scheme for flash-based ssds. Trans. Storage, 8(1):1:1--1:24, Feb. 2012. Google ScholarDigital Library
- X. Wu, Y. Xu, Z. Shao, and S. Jiang. Lsm-trie: An lsm-tree-based ultra-large key-value store for small data items. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 71--82, Santa Clara, CA, 2015. USENIX Association. Google ScholarDigital Library
Index Terms
- Accordion: better memory organization for LSM key-value stores
Recommendations
Accordion: multi-scale recipes for adaptive detection of duplication
HotStorage'15: Proceedings of the 7th USENIX Conference on Hot Topics in Storage and File SystemsA recipe is metadata that describes the contents of a file as a sequence of blocks identified by their hash. Using recipes, one can rapidly compare the contents of two files without reading the files themselves. Unfortunately, recipes present a space/...
Accordion arrays
ISMM '07: Proceedings of the 6th international symposium on Memory managementIn this work, we present accordion arrays, a straight-forward and effective memory compression technique targeting Unicode-based character arrays. In many non-numeric Java programs, character arrays represent a significant fraction (30-40% on average) ...
FRASH: hierarchical file system for FRAM and flash
ICCSA'07: Proceedings of the 2007 international conference on Computational science and its applications - Volume Part IIn this work, we develop novel file system, FRASH, for byte-addressable NVRAM (FRAM[1]) and NAND Flash device. Byte addressable NVRAM and NAND Flash is typified by the DRAM-like fast access latency and high storage density, respectively. Hierarchical ...
Comments