ABSTRACT
In this paper, we show that all mainstream LSM-tree based key-value stores in the literature and in industry are suboptimal with respect to how they trade off among the I/O costs of updates, point lookups, range lookups, as well as the cost of storage, measured as space-amplification. The reason is that they perform expensive merge operations in order to (1) bound the number of runs that a lookup has to probe, and to (2) remove obsolete entries to reclaim space. However, most of these merge operations reduce point lookup cost, long range lookup cost, and space-amplification by a negligible amount.
To address this problem, we expand the LSM-tree design space with Lazy Leveling, a new design that prohibits merge operations at all levels of LSM-tree but the largest. We show that Lazy Leveling improves the worst-case cost complexity of updates while maintaining the same bounds on point lookup cost, long range lookup cost, and space-amplification. To be able to navigate between Lazy Leveling and other designs, we make the LSM-tree design space fluid by introducing Fluid LSM-tree, a generalization of LSM-tree that can be parameterized to assume all existing LSM-tree designs. We show how to fluidly transition from Lazy Leveling to (1) designs that are more optimized for updates by merging less at the largest level, and (2) designs that are more optimized for small range lookups by merging more at all other levels.
We put everything together to design Dostoevsky, a key-value store that navigates the entire Fluid LSM-tree design space based on the application workload and hardware to maximize throughput using a novel closed-form performance model. We implemented Dostoevsky on top of RocksDB, and we show that it strictly dominates state-of-the-art LSM-tree based key-value stores in terms of performance and space-amplification.
- A. Aggarwal and J. S. Vitter. The Input/Output Complexity of Sorting and Related Problems. Communications of the ACM, 31(9):1116--1127, 1988. Google ScholarDigital Library
- D. Agrawal, D. Ganesan, R. Sitaraman, Y. Diao, and S. Singh. Lazy-Adaptive Tree: An Optimized Index Structure for Flash Devices. Proceedings of the VLDB Endowment, 2(1):361--372, 2009. Google ScholarDigital Library
- M. Y. Ahmad and B. Kemme. Compaction management in distributed key-value datastores. Proceedings of the VLDB Endowment, 8(8):850--861, 2015. Google ScholarDigital Library
- D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. FAWN: A Fast Array of Wimpy Nodes. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), pages 1--14, 2009. Google ScholarDigital Library
- Apache. Accumulo. https://accumulo.apache.org/, 2016.Google Scholar
- Apache. Cassandra. http://cassandra.apache.org, 2016.Google Scholar
- Apache. HBase. http://hbase.apache.org/, 2016.Google Scholar
- T. G. Armstrong, V. Ponnekanti, D. Borthakur, and M. Callaghan. LinkBench: a Database Benchmark Based on the Facebook Social Graph. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 1185--1196, 2013. Google ScholarDigital Library
- M. Athanassoulis, S. Chen, A. Ailamaki, P. B. Gibbons, and R. Stoica. MaSM: Efficient Online Updates in Data Warehouses. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 865--876, 2011. Google ScholarDigital Library
- M. Athanassoulis, S. Chen, A. Ailamaki, P. B. Gibbons, and R. Stoica. Online Updates on Data Warehouses via Judicious Use of Solid-State Storage. ACM Transactions on Database Systems (TODS), 40(1), 2015. Google ScholarDigital Library
- A. Badam, K. Park, V. S. Pai, and L. L. Peterson. HashCache: Cache Storage for the Next Billion. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), pages 123--136, 2009. Google ScholarDigital Library
- O. Balmau, D. Didona, R. Guerraoui, W. Zwaenepoel, H. Yuan, A. Arora, K. Gupta, and P. Konka. TRIAD: Creating synergies between memory, disk and log in log structured key-value stores. In USENIX Annual Technical Conference, 2017. Google ScholarDigital Library
- M. A. Bender, M. Farach-Colton, J. T. Fineman, Y. R. Fogel, B. C. Kuszmaul, and J. Nelson. Cache-Oblivious Streaming B-trees. In Proceedings of the Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 81--92, 2007. Google ScholarDigital Library
- B. H. Bloom. Space/Time Trade-offs in Hash Coding with Allowable Errors. Communications of the ACM, 13(7):422--426, 1970. Google ScholarDigital Library
- G. S. Brodal and R. Fagerberg. Lower Bounds for External Memory Dictionaries. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 546--554, 2003. Google ScholarDigital Library
- N. Bronson, Z. Amsden, G. Cabrera, P. Chakka, P. Dimov, H. Ding, J. Ferris, A. Giardullo, S. Kulkarni, H. C. Li, M. Marchukov, D. Petrov, L. Puzar, Y. J. Song, and V. Venkataramani. TAO: Facebook's Distributed Data Store for the Social Graph. In Proceedings of the USENIX Annual Technical Conference (ATC), pages 49--60, 2013. Google ScholarDigital Library
- Y. Bu, V. Borkar, J. Jia, M. J. Carey, and Condie. Pregelix: Big (ger) graph analytics on a dataflow engine. In Proceedings of the International Conference on Very Large Data Bases (VLDB), pages 161--172, 2014. Google ScholarDigital Library
- Z. Cao, S. Chen, F. Li, M. Wang, and X. S. Wang. LogKV: Exploiting Key-Value Stores for Log Processing. In Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR), 2013.Google Scholar
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A Distributed Storage System for Structured Data. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 205--218, 2006. Google ScholarDigital Library
- J. Chen, C. Douglas, M. Mutsuzaki, P. Quaid, R. Ramakrishnan, S. Rao, and R. Sears. Walnut: A Unified Cloud Object Store. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 743--754, 2012. Google ScholarDigital Library
- B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with ycsb. In Proceedings of the ACM Symposium on Cloud Computing (SoCC), pages 143--154, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- N. Dayan, M. Athanassoulis, and S. Idreos. Monkey: Optimal Navigable Key-Value Store. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 79--94, 2017. Google ScholarDigital Library
- N. Dayan, P. Bonnet, and S. Idreos. GeckoFTL: Scalable Flash Translation Techniques For Very Large Flash Devices. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 327--342, 2016. Google ScholarDigital Library
- B. Debnath, S. Sengupta, and J. Li. FlashStore: high throughput persistent key-value store. Proceedings of the VLDB Endowment, 3(1--2):1414--1425, 2010. Google ScholarDigital Library
- B. Debnath, S. Sengupta, and J. Li. SkimpyStash: RAM space skimpy key-value store on flash-based storage. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 25--36, 2011. Google ScholarDigital Library
- G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's Highly Available Key-value Store. ACM SIGOPS Operating Systems Review, 41(6):205--220, 2007. Google ScholarDigital Library
- S. Dong, M. Callaghan, L. Galanis, D. Borthakur, T. Savor, and M. Strum. Optimizing Space Amplification in RocksDB. In Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR), 2017.Google Scholar
- Facebook. RocksDB. https://github.com/facebook/rocksdb, 2016.Google Scholar
- Facebook. MyRocks. http://myrocks.io/, 2017.Google Scholar
- B. Fitzpatrick and A. Vorobey. Memcached: a distributed memory object caching system, 2011.Google Scholar
- G. Golan-Gueta, E. Bortnikov, E. Hillel, and I. Keidar. Scaling Concurrent Log-Structured Data Stores. In Proceedings of the ACM European Conference on Computer Systems (EuroSys), pages 32:1---32:14, 2015. Google ScholarDigital Library
- Google. LevelDB. https://github.com/google/leveldb/, 2016.Google Scholar
- H. V. Jagadish, P. P. S. Narayan, S. Seshadri, S. Sudarshan, and R. Kanneganti. Incremental Organization for Data Recording and Warehousing. In Proceedings of the International Conference on Very Large Data Bases (VLDB), pages 16--25, 1997. Google ScholarDigital Library
- A. Lakshman and P. Malik. Cassandra - A Decentralized Structured Storage System. ACM SIGOPS Operating Systems Review, 44(2):35--40, 2010. Google ScholarDigital Library
- Y. Li, B. He, J. Yang, Q. Luo, K. Yi, and R. J. Yang. Tree Indexing on Solid State Drives. Proceedings of the VLDB Endowment, 3(1--2):1195--1206, 2010. Google ScholarDigital Library
- H. Lim, D. G. Andersen, and M. Kaminsky. Towards Accurate and Fast Evaluation of Multi-Stage Log-structured Designs. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST), pages 149--166, 2016. Google ScholarDigital Library
- H. Lim, B. Fan, D. G. Andersen, and M. Kaminsky. SILT: A Memory-Efficient, High-Performance Key-Value Store. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), pages 1--13, 2011. Google ScholarDigital Library
- LinkedIn. Online reference. http://www.project-voldemort.com, 2016.Google Scholar
- L. Lu, T. S. Pillai, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. WiscKey: Separating Keys from Values in SSD-conscious Storage. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST), pages 133--148, 2016. Google ScholarDigital Library
- S. Nath and A. Kansal. FlashDB: dynamic self-tuning database for NAND flash. ACM/IEEE IPSN, 2007. Google ScholarDigital Library
- P. E. O'Neil, E. Cheng, D. Gawlick, and E. J. O'Neil. The log-structured merge-tree (LSM-tree). Acta Informatica, 33(4):351--385, 1996. Google ScholarDigital Library
- A. Papagiannis, G. Saloustros, P. González-Férez, and A. Bilas. Tucana: Design and Implementation of a Fast and Efficient Scale-up Key-value Store. In Proceedings of the USENIX Annual Technical Conference (ATC), pages 537--550, 2016. Google ScholarDigital Library
- M. Pilman, K. Bocksrocker, L. Braun, R. Marroquin, and D. Kossmann. Fast Scans on Key-Value Stores. Proceedings of the VLDB Endowment, 10(11):1526--1537, 2017. Google ScholarDigital Library
- J. Pitman. Probability. 1999.Google Scholar
- P. Raju, R. Kadekodi, V. Chidambaram, and I. Abraham. Pebblesdb: Building key-value stores using fragmented log-structured merge trees. Proceedings of the ACM Symposium on Operating Systems Principles, 2017. Google ScholarDigital Library
- Redis. Online reference. http://redis.io/.Google Scholar
- K. Ren, Q. Zheng, J. Arulraj, and G. Gibson. SlimDB: A Space-Efficient Key-Value Storage Engine For Semi-Sorted Data. Proceedings of the VLDB Endowment, 10(13):2037--2048, 2017. Google ScholarDigital Library
- R. Sears and R. Ramakrishnan. bLSM: A General Purpose Log Structured Merge Tree. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 217--228, 2012. Google ScholarDigital Library
- P. Shetty, R. P. Spillane, R. Malpani, B. Andrews, J. Seyster, and E. Zadok. Building Workload-Independent Storage with VT-trees. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST), pages 17--30, 2013. Google ScholarDigital Library
- S. Tarkoma, C. E. Rothenberg, and E. Lagerspetz. Theory and Practice of Bloom Filters for Distributed Systems. IEEE Communications Serveys & Tutorials, 14(1):131--155, 2012.Google ScholarCross Ref
- R. Thonangi and J. Yang. On Log-Structured Merge for Solid-State Drives. In Proceedings of the IEEE International Conference on Data Engineering (ICDE), pages 683--694, 2017.Google ScholarCross Ref
- WiredTiger. WiredTiger. https://github.com/wiredtiger/wiredtiger, 2016.Google Scholar
- X. Wu, Y. Xu, Z. Shao, and S. Jiang. LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data Items. In Proceedings of the USENIX Annual Technical Conference (ATC), pages 71--82, 2015. Google ScholarDigital Library
Index Terms
- Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging
Recommendations
GHStore: A High Performance Global Hash Based Key-Value Store
Database Systems for Advanced ApplicationsAbstractLog-Structured Merge tree (LSM-tree) has become the mainstream data structure of persistent key-value (KV) stores, but it suffers from serious write and read amplification. In update intensive workloads, repeated and useless compaction of outdated ...
Monkey: Optimal Navigable Key-Value Store
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataIn this paper, we show that key-value stores backed by an LSM-tree exhibit an intrinsic trade-off between lookup cost, update cost, and main memory footprint, yet all existing designs expose a suboptimal and difficult to tune trade-off among these ...
Optimal Bloom Filters and Adaptive Merging for LSM-Trees
Best of SIGMOD 2017 PapersIn this article, we show that key-value stores backed by a log-structured merge-tree (LSM-tree) exhibit an intrinsic tradeoff between lookup cost, update cost, and main memory footprint, yet all existing designs expose a suboptimal and difficult to tune ...
Comments