skip to main content
research-article

An LSM-based tuple compaction framework for Apache AsterixDB

Published:01 May 2020Publication History
Skip Abstract Section

Abstract

Document database systems store self-describing semi-structured records, such as JSON, "as-is" without requiring the users to pre-define a schema. This provides users with the flexibility to change the structure of incoming records without worrying about taking the system offline or hindering the performance of currently running queries. However, the flexibility of such systems does not free. The large amount of redundancy in the records can introduce an unnecessary storage overhead and impact query performance.

Our focus in this paper is to address the storage overhead issue by introducing a tuple compactor framework that infers and extracts the schema from self-describing semi-structured records during the data ingestion. As many prominent document stores, such as MongoDB and Couchbase, adopt Log Structured Merge (LSM) trees in their storage engines, our framework exploits LSM lifecycle events to piggyback the schema inference and extraction operations. We have implemented and empirically evaluated our approach to measure its impact on storage, data ingestion, and query performance in the context of Apache AsterixDB.

References

  1. Apache Arrow. https://arrow.apache.org.Google ScholarGoogle Scholar
  2. AsterixDB Documentation. https://ci.apache.org/projects/asterixdb/index.html.Google ScholarGoogle Scholar
  3. AsterixDB Object Serialization Reference. https://cwiki.apache.org/confluence/display/ASTERIXDB/AsterixDB+Object+Serialization+Reference.Google ScholarGoogle Scholar
  4. Apache Avro. https://avro.apache.org.Google ScholarGoogle Scholar
  5. Apache CarbonData. https://carbondata.apache.org.Google ScholarGoogle Scholar
  6. Apache Drill. https://drill.apache.org.Google ScholarGoogle Scholar
  7. Apache Parquet. https://parquet.apache.org.Google ScholarGoogle Scholar
  8. Apache Thrift. https://thrift.apache.org.Google ScholarGoogle Scholar
  9. Binary JSON: BSON specification. http://bsonspec.org/.Google ScholarGoogle Scholar
  10. Couchbase. https://couchbase.com.Google ScholarGoogle Scholar
  11. MongoDB. https://www.mongodb.com.Google ScholarGoogle Scholar
  12. Protocol Buffers. https://developers.google.com/protocol-buffers.Google ScholarGoogle Scholar
  13. Snappy. http://google.github.io/snappy/.Google ScholarGoogle Scholar
  14. Twitter API Documentation. https://developer.twitter.com/en/docs.html.Google ScholarGoogle Scholar
  15. I. Absalyamov, M. J. Carey, and V. J. Tsotras. Lightweight cardinality estimation in LSM-based systems. In ACM SIGMOD, pages 841--855, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Ailamaki, D. J. DeWitt, and M. D. Hill. Data page layouts for relational databases on deep memory hierarchies. The VLDB Journal, 11(3):198--215, 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. W. Y. Alkowaileet, S. Alsubaiee, and M. J. Carey. An LSM-based tuple compaction framework for Apache AsterixDB. arXiv preprint arXiv:1910.08185, 2018.Google ScholarGoogle Scholar
  18. S. Alsubaiee et al. AsterixDB: A scalable, open source BDMS. PVLDB, 7(14), 2014.Google ScholarGoogle Scholar
  19. S. Alsubaiee et al. Storage management in AsterixDB. PVLDB, 7(10), 2014.Google ScholarGoogle Scholar
  20. S. Alsubaiee and Others. LSM-based storage and indexing: An old idea with timely benefits. In Second international ACM workshop on managing and mining enriched geo-spatial data, pages 1--6, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Arion, A. Bonifati, G. Costa, S. dAguanno, I. Manolescu, and A. Pugliese. Efficient query evaluation over compressed XML data. In EDBT, pages 200--218, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  22. V. Borkar et al. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. V. Borkar et al. Algebricks: a data model-agnostic compiler backend for big data languages. In SoCC, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. J. Carey. AsterixDB mid-flight: A case study in building systems in academia. In ICDE, pages 1--12, 2019.Google ScholarGoogle Scholar
  25. D. Chamberlin. SQL++ For SQL Users: A Tutorial. Couchbase, Inc., 2018. (Available at Amazon.com).Google ScholarGoogle Scholar
  26. M. DiScala and D. J. Abadi. Automatic generation of normalized relational schemas from nested key-value data. In ACM SIGMOD, pages 295--310, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. R. Goldman and J. Widom. DataGuides: Enabling query formulation and optimization in semistructured databases. In VLDB, pages 436--445, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. H. Liefke and D. Suciu. XMill: An efficient compressor for XML data. In ACM SIGMOD, pages 153--164, 2000.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C. Luo and M. J. Carey. LSM-based storage techniques: A survey. arXiv preprint arXiv:1812.07527, 2018.Google ScholarGoogle Scholar
  30. C. Luo and M. J. Carey. Efficient data ingestion and query processing for LSM-based storage systems. PVLDB, 12(5):531--543, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. PVLDB, 3(1--2):330--339, 2010.Google ScholarGoogle Scholar
  32. P. O'Neil, E. Cheng, D. Gawlick, and E. O'Neil. The log-structured merge-tree (LSM-tree). Acta Informatica, 33(4):351--385, 1996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. K. W. Ong, Y. Papakonstantinou, and R. Vernoux. The SQL++ query language: Configurable, unifying and semi-structured. arXiv preprint arXiv:1405.3631, 2014.Google ScholarGoogle Scholar
  34. P. Pirzadeh, M. J. Carey, and T. Westmann. Bigfun: A performance study of big data management system functionality. In 2015 IEEE International Conference on Big Data (Big Data), pages 507--514. IEEE, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. D. Shukla, S. Thota, K. Raman, M. Gajendran, A. Shah, S. Ziuzin, K. Sundaram, M. G. Guajardo, A. Wawrzyniak, S. Boshra, et al. Schema-agnostic indexing with Azure DocumentDB. PVLDB, 8(12):1668--1679, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, et al. C-store: A column-oriented DBMS. In VLDB, pages 553--564, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. L. Wang, S. Zhang, J. Shi, L. Jiao, O. Hassanzadeh, J. Zou, and C. Wangz. Schema management for document stores. PVLDB, 8(9):922--933, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. M. Zaharia et al. Spark: Cluster computing with working sets. In Proc. HotCloud, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An LSM-based tuple compaction framework for Apache AsterixDB
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 13, Issue 9
        May 2020
        295 pages
        ISSN:2150-8097
        Issue’s Table of Contents

        Publisher

        VLDB Endowment

        Publication History

        • Published: 1 May 2020
        Published in pvldb Volume 13, Issue 9

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader