research-article

An LSM-based tuple compaction framework for Apache AsterixDB

Authors:
Wail Y. Alkowaileet

University of California

University of California
View Profile

,
Sattam Alsubaiee

King Abdulaziz City for Science and Technology, Riyadh, Saudi Arabia

King Abdulaziz City for Science and Technology, Riyadh, Saudi Arabia
View Profile

,
Michael J. Carey

University of California

University of California
View Profile

Proceedings of the VLDB Endowment Volume 13 Issue 9pp 1388–1400https://doi.org/10.14778/3397230.3397236

Published:01 May 2020Publication History

Proceedings of the VLDB Endowment

Abstract

Document database systems store self-describing semi-structured records, such as JSON, "as-is" without requiring the users to pre-define a schema. This provides users with the flexibility to change the structure of incoming records without worrying about taking the system offline or hindering the performance of currently running queries. However, the flexibility of such systems does not free. The large amount of redundancy in the records can introduce an unnecessary storage overhead and impact query performance.

Our focus in this paper is to address the storage overhead issue by introducing a tuple compactor framework that infers and extracts the schema from self-describing semi-structured records during the data ingestion. As many prominent document stores, such as MongoDB and Couchbase, adopt Log Structured Merge (LSM) trees in their storage engines, our framework exploits LSM lifecycle events to piggyback the schema inference and extraction operations. We have implemented and empirically evaluated our approach to measure its impact on storage, data ingestion, and query performance in the context of Apache AsterixDB.

References

Apache Arrow. https://arrow.apache.org.Google Scholar
AsterixDB Documentation. https://ci.apache.org/projects/asterixdb/index.html.Google Scholar
AsterixDB Object Serialization Reference. https://cwiki.apache.org/confluence/display/ASTERIXDB/AsterixDB+Object+Serialization+Reference.Google Scholar
Apache Avro. https://avro.apache.org.Google Scholar
Apache CarbonData. https://carbondata.apache.org.Google Scholar
Apache Drill. https://drill.apache.org.Google Scholar
Apache Parquet. https://parquet.apache.org.Google Scholar
Apache Thrift. https://thrift.apache.org.Google Scholar
Binary JSON: BSON specification. http://bsonspec.org/.Google Scholar
Couchbase. https://couchbase.com.Google Scholar
MongoDB. https://www.mongodb.com.Google Scholar
Protocol Buffers. https://developers.google.com/protocol-buffers.Google Scholar
Snappy. http://google.github.io/snappy/.Google Scholar
Twitter API Documentation. https://developer.twitter.com/en/docs.html.Google Scholar
I. Absalyamov, M. J. Carey, and V. J. Tsotras. Lightweight cardinality estimation in LSM-based systems. In ACM SIGMOD, pages 841--855, 2018.Google ScholarDigital Library
A. Ailamaki, D. J. DeWitt, and M. D. Hill. Data page layouts for relational databases on deep memory hierarchies. The VLDB Journal, 11(3):198--215, 2002.Google ScholarDigital Library
W. Y. Alkowaileet, S. Alsubaiee, and M. J. Carey. An LSM-based tuple compaction framework for Apache AsterixDB. arXiv preprint arXiv:1910.08185, 2018.Google Scholar
S. Alsubaiee et al. AsterixDB: A scalable, open source BDMS. PVLDB, 7(14), 2014.Google Scholar
S. Alsubaiee et al. Storage management in AsterixDB. PVLDB, 7(10), 2014.Google Scholar
S. Alsubaiee and Others. LSM-based storage and indexing: An old idea with timely benefits. In Second international ACM workshop on managing and mining enriched geo-spatial data, pages 1--6, 2015.Google ScholarDigital Library
A. Arion, A. Bonifati, G. Costa, S. dAguanno, I. Manolescu, and A. Pugliese. Efficient query evaluation over compressed XML data. In EDBT, pages 200--218, 2004.Google ScholarCross Ref
V. Borkar et al. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, 2011.Google ScholarDigital Library
V. Borkar et al. Algebricks: a data model-agnostic compiler backend for big data languages. In SoCC, 2015.Google ScholarDigital Library
M. J. Carey. AsterixDB mid-flight: A case study in building systems in academia. In ICDE, pages 1--12, 2019.Google Scholar
D. Chamberlin. SQL++ For SQL Users: A Tutorial. Couchbase, Inc., 2018. (Available at Amazon.com).Google Scholar
M. DiScala and D. J. Abadi. Automatic generation of normalized relational schemas from nested key-value data. In ACM SIGMOD, pages 295--310, 2016.Google ScholarDigital Library
R. Goldman and J. Widom. DataGuides: Enabling query formulation and optimization in semistructured databases. In VLDB, pages 436--445, 1997.Google ScholarDigital Library
H. Liefke and D. Suciu. XMill: An efficient compressor for XML data. In ACM SIGMOD, pages 153--164, 2000.Google ScholarDigital Library
C. Luo and M. J. Carey. LSM-based storage techniques: A survey. arXiv preprint arXiv:1812.07527, 2018.Google Scholar
C. Luo and M. J. Carey. Efficient data ingestion and query processing for LSM-based storage systems. PVLDB, 12(5):531--543, 2019.Google ScholarDigital Library
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. PVLDB, 3(1--2):330--339, 2010.Google Scholar
P. O'Neil, E. Cheng, D. Gawlick, and E. O'Neil. The log-structured merge-tree (LSM-tree). Acta Informatica, 33(4):351--385, 1996.Google ScholarDigital Library
K. W. Ong, Y. Papakonstantinou, and R. Vernoux. The SQL++ query language: Configurable, unifying and semi-structured. arXiv preprint arXiv:1405.3631, 2014.Google Scholar
P. Pirzadeh, M. J. Carey, and T. Westmann. Bigfun: A performance study of big data management system functionality. In 2015 IEEE International Conference on Big Data (Big Data), pages 507--514. IEEE, 2015.Google ScholarDigital Library
D. Shukla, S. Thota, K. Raman, M. Gajendran, A. Shah, S. Ziuzin, K. Sundaram, M. G. Guajardo, A. Wawrzyniak, S. Boshra, et al. Schema-agnostic indexing with Azure DocumentDB. PVLDB, 8(12):1668--1679, 2015.Google ScholarDigital Library
M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, et al. C-store: A column-oriented DBMS. In VLDB, pages 553--564, 2005.Google ScholarDigital Library
L. Wang, S. Zhang, J. Shi, L. Jiao, O. Hassanzadeh, J. Zou, and C. Wangz. Schema management for document stores. PVLDB, 8(9):922--933, 2015.Google ScholarDigital Library
M. Zaharia et al. Spark: Cluster computing with working sets. In Proc. HotCloud, 2010.Google ScholarDigital Library

Index Terms

An LSM-based tuple compaction framework for Apache AsterixDB
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Index terms have been assigned to the content through auto-classification.

Recommendations

AsterixDB: a scalable, open source BDMS

AsterixDB is a new, full-function BDMS (Big Data Management System) with a feature set that distinguishes it from other platforms in today's open source Big Data ecosystem. Its features make it well-suited to applications like web data warehousing, ...
Read More
Storage management in AsterixDB

Social networks, online communities, mobile devices, and instant messaging applications generate complex, unstructured data at a high rate, resulting in large volumes of data. This poses new challenges for data management systems that aim to ingest, ...
Read More
Columnar formats for schemaless LSM-based document stores

In the last decade, document store database systems have gained more traction for storing and querying large volumes of semi-structured data. However, the flexibility of the document stores' data models has limited their ability to store data in a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 13, Issue 9
May 2020
295 pages
ISSN:2150-8097
Editors:
Magdalena Balazinska
University of Washington
,
Xiaofang Zhou
University of Queensland, Australia
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 May 2020
Published in pvldb Volume 13, Issue 9
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 65
  Total Downloads
- Downloads (Last 12 months)17
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An LSM-based tuple compaction framework for Apache AsterixDB

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

AsterixDB: a scalable, open source BDMS

Storage management in AsterixDB

Columnar formats for schemaless LSM-based document stores

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

An LSM-based tuple compaction framework for Apache AsterixDB

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

AsterixDB: a scalable, open source BDMS

Storage management in AsterixDB

Columnar formats for schemaless LSM-based document stores

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media