Abstract
Document database systems store self-describing semi-structured records, such as JSON, "as-is" without requiring the users to pre-define a schema. This provides users with the flexibility to change the structure of incoming records without worrying about taking the system offline or hindering the performance of currently running queries. However, the flexibility of such systems does not free. The large amount of redundancy in the records can introduce an unnecessary storage overhead and impact query performance.
Our focus in this paper is to address the storage overhead issue by introducing a tuple compactor framework that infers and extracts the schema from self-describing semi-structured records during the data ingestion. As many prominent document stores, such as MongoDB and Couchbase, adopt Log Structured Merge (LSM) trees in their storage engines, our framework exploits LSM lifecycle events to piggyback the schema inference and extraction operations. We have implemented and empirically evaluated our approach to measure its impact on storage, data ingestion, and query performance in the context of Apache AsterixDB.
- Apache Arrow. https://arrow.apache.org.Google Scholar
- AsterixDB Documentation. https://ci.apache.org/projects/asterixdb/index.html.Google Scholar
- AsterixDB Object Serialization Reference. https://cwiki.apache.org/confluence/display/ASTERIXDB/AsterixDB+Object+Serialization+Reference.Google Scholar
- Apache Avro. https://avro.apache.org.Google Scholar
- Apache CarbonData. https://carbondata.apache.org.Google Scholar
- Apache Drill. https://drill.apache.org.Google Scholar
- Apache Parquet. https://parquet.apache.org.Google Scholar
- Apache Thrift. https://thrift.apache.org.Google Scholar
- Binary JSON: BSON specification. http://bsonspec.org/.Google Scholar
- Couchbase. https://couchbase.com.Google Scholar
- MongoDB. https://www.mongodb.com.Google Scholar
- Protocol Buffers. https://developers.google.com/protocol-buffers.Google Scholar
- Snappy. http://google.github.io/snappy/.Google Scholar
- Twitter API Documentation. https://developer.twitter.com/en/docs.html.Google Scholar
- I. Absalyamov, M. J. Carey, and V. J. Tsotras. Lightweight cardinality estimation in LSM-based systems. In ACM SIGMOD, pages 841--855, 2018.Google ScholarDigital Library
- A. Ailamaki, D. J. DeWitt, and M. D. Hill. Data page layouts for relational databases on deep memory hierarchies. The VLDB Journal, 11(3):198--215, 2002.Google ScholarDigital Library
- W. Y. Alkowaileet, S. Alsubaiee, and M. J. Carey. An LSM-based tuple compaction framework for Apache AsterixDB. arXiv preprint arXiv:1910.08185, 2018.Google Scholar
- S. Alsubaiee et al. AsterixDB: A scalable, open source BDMS. PVLDB, 7(14), 2014.Google Scholar
- S. Alsubaiee et al. Storage management in AsterixDB. PVLDB, 7(10), 2014.Google Scholar
- S. Alsubaiee and Others. LSM-based storage and indexing: An old idea with timely benefits. In Second international ACM workshop on managing and mining enriched geo-spatial data, pages 1--6, 2015.Google ScholarDigital Library
- A. Arion, A. Bonifati, G. Costa, S. dAguanno, I. Manolescu, and A. Pugliese. Efficient query evaluation over compressed XML data. In EDBT, pages 200--218, 2004.Google ScholarCross Ref
- V. Borkar et al. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, 2011.Google ScholarDigital Library
- V. Borkar et al. Algebricks: a data model-agnostic compiler backend for big data languages. In SoCC, 2015.Google ScholarDigital Library
- M. J. Carey. AsterixDB mid-flight: A case study in building systems in academia. In ICDE, pages 1--12, 2019.Google Scholar
- D. Chamberlin. SQL++ For SQL Users: A Tutorial. Couchbase, Inc., 2018. (Available at Amazon.com).Google Scholar
- M. DiScala and D. J. Abadi. Automatic generation of normalized relational schemas from nested key-value data. In ACM SIGMOD, pages 295--310, 2016.Google ScholarDigital Library
- R. Goldman and J. Widom. DataGuides: Enabling query formulation and optimization in semistructured databases. In VLDB, pages 436--445, 1997.Google ScholarDigital Library
- H. Liefke and D. Suciu. XMill: An efficient compressor for XML data. In ACM SIGMOD, pages 153--164, 2000.Google ScholarDigital Library
- C. Luo and M. J. Carey. LSM-based storage techniques: A survey. arXiv preprint arXiv:1812.07527, 2018.Google Scholar
- C. Luo and M. J. Carey. Efficient data ingestion and query processing for LSM-based storage systems. PVLDB, 12(5):531--543, 2019.Google ScholarDigital Library
- S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. PVLDB, 3(1--2):330--339, 2010.Google Scholar
- P. O'Neil, E. Cheng, D. Gawlick, and E. O'Neil. The log-structured merge-tree (LSM-tree). Acta Informatica, 33(4):351--385, 1996.Google ScholarDigital Library
- K. W. Ong, Y. Papakonstantinou, and R. Vernoux. The SQL++ query language: Configurable, unifying and semi-structured. arXiv preprint arXiv:1405.3631, 2014.Google Scholar
- P. Pirzadeh, M. J. Carey, and T. Westmann. Bigfun: A performance study of big data management system functionality. In 2015 IEEE International Conference on Big Data (Big Data), pages 507--514. IEEE, 2015.Google ScholarDigital Library
- D. Shukla, S. Thota, K. Raman, M. Gajendran, A. Shah, S. Ziuzin, K. Sundaram, M. G. Guajardo, A. Wawrzyniak, S. Boshra, et al. Schema-agnostic indexing with Azure DocumentDB. PVLDB, 8(12):1668--1679, 2015.Google ScholarDigital Library
- M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, et al. C-store: A column-oriented DBMS. In VLDB, pages 553--564, 2005.Google ScholarDigital Library
- L. Wang, S. Zhang, J. Shi, L. Jiao, O. Hassanzadeh, J. Zou, and C. Wangz. Schema management for document stores. PVLDB, 8(9):922--933, 2015.Google ScholarDigital Library
- M. Zaharia et al. Spark: Cluster computing with working sets. In Proc. HotCloud, 2010.Google ScholarDigital Library
Index Terms
- An LSM-based tuple compaction framework for Apache AsterixDB
Recommendations
AsterixDB: a scalable, open source BDMS
AsterixDB is a new, full-function BDMS (Big Data Management System) with a feature set that distinguishes it from other platforms in today's open source Big Data ecosystem. Its features make it well-suited to applications like web data warehousing, ...
Storage management in AsterixDB
Social networks, online communities, mobile devices, and instant messaging applications generate complex, unstructured data at a high rate, resulting in large volumes of data. This poses new challenges for data management systems that aim to ingest, ...
Columnar formats for schemaless LSM-based document stores
In the last decade, document store database systems have gained more traction for storing and querying large volumes of semi-structured data. However, the flexibility of the document stores' data models has limited their ability to store data in a ...
Comments