Abstract
Social networks, online communities, mobile devices, and instant messaging applications generate complex, unstructured data at a high rate, resulting in large volumes of data. This poses new challenges for data management systems that aim to ingest, store, index, and analyze such data efficiently. In response, we released the first public version of AsterixDB, an open-source Big Data Management System (BDMS), in June of 2013. This paper describes the storage management layer of AsterixDB, providing a detailed description of its ingestion-oriented approach to local storage and a set of initial measurements of its ingestion-related performance characteristics.
In order to support high frequency insertions, AsterixDB has wholly adopted Log-Structured Merge-trees as the storage technology for all of its index structures. We describe how the AsterixDB software framework enables "LSM-ification" (conversion from an in-place update, disk-based data structure to a deferred-update, append-only data structure) of any kind of index structure that supports certain primitive operations, enabling the index to ingest data efficiently. We also describe how AsterixDB ensures the ACID properties for operations involving multiple heterogeneous LSM-based indexes. Lastly, we highlight the challenges related to managing the resources of a system when many LSM indexes are used concurrently and present AsterixDB's initial solution.
- AsterixDB. http://asterixdb.ics.uci.edu/.Google Scholar
- Cassandra. http://cassandra.apache.org/.Google Scholar
- CouchDB. http://couchdb.apache.org/.Google Scholar
- HBase. http://hbase.apache.org/.Google Scholar
- LevelDB. https://code.google.com/p/leveldb/.Google Scholar
- S. Alsubaiee et al. Asterix: scalable warehouse-style web data integration. In IIWeb, 2012. Google ScholarDigital Library
- Apache Hive, http://hadoop.apache.org/hive.Google Scholar
- A. Behm et al. Asterix: towards a scalable, semistructured data platform for evolving-world models. Distributed and Parallel Databases, 29(3), 2011. Google ScholarDigital Library
- V. R. Borkar et al. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, 2011. Google ScholarDigital Library
- K. P. Brown et al. Towards automated performance tuning for complex workloads. In VLDB, 1994. Google ScholarDigital Library
- F. Chang et al. Bigtable: A Distributed Storage System for Structured Data. ACM TOCS., 26(2), 2008. Google ScholarDigital Library
- S. Chen et al. Log-based architectures: using multicore to help software behave correctly. ACM SIGOPS Oper. Syst. Rev., 45(1), 2011. Google ScholarDigital Library
- Facebook. Facebook's growth in the past year. https://www.facebook.com/media/set/? set=a.10151908376636729.1073741825.20531316728.Google Scholar
- A. Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD, 1984. Google ScholarDigital Library
- Jaql, http://www.jaql.org.Google Scholar
- C. Jermaine et al. The partitioned exponential file for database storage management. The VLDB Journal., 16(4), 2007. Google ScholarDigital Library
- I. Kamel et al. On packing R-trees. In CIKM, 1993. Google ScholarDigital Library
- M. Kornacker et al. Concurrency and recovery in generalized search trees. In SIGMOD, 1997. Google ScholarDigital Library
- C. Mohan. ARIES/KVL: A key-value locking method for concurrency control of multiaction transactions operating on b-tree indexes. In VLDB, 1990. Google ScholarDigital Library
- C. Mohan et al. ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. ACM TODS., 17(1), 1992. Google ScholarDigital Library
- C. Olston et al. Pig Latin: a not-so-foreign language for data processing. In SIGMOD, 2008. Google ScholarDigital Library
- P. O'Neil et al. The log-structured merge-tree (LSM-tree). Acta Inf., 33(4), 1996. Google ScholarDigital Library
- O. Procopiuc et al. Bkd-tree: A dynamic scalable kd-tree. In SSTD, 2003.Google Scholar
- W. Pugh. Skip Lists: A probabilistic alternative to balanced trees. Commun. ACM, 33(6), 1990. Google ScholarDigital Library
- M. Rosenblum and J. K. Ousterhout. The design and implementation of a log-structured file system. In SOSP, 1991. Google ScholarDigital Library
- R. Sears et al. bLSM: a general purpose log structured merge tree. In SIGMOD, 2012. Google ScholarDigital Library
- D. G. Severance et al. Differential files: Their application to the maintenance of large databases. ACM TODS., 1(3), 1976. Google ScholarDigital Library
- A. J. Storm et al. Adaptive self-tuning memory in DB2. In VLDB, 2006. Google ScholarDigital Library
- Twitter Blog. New Tweets per second record, and how!, August 2013. https://blog.twitter.com/2013/new-tweets-persecond-record-and-how.Google Scholar
Index Terms
- Storage management in AsterixDB
Recommendations
AsterixDB: a scalable, open source BDMS
AsterixDB is a new, full-function BDMS (Big Data Management System) with a feature set that distinguishes it from other platforms in today's open source Big Data ecosystem. Its features make it well-suited to applications like web data warehousing, ...
Large-scale complex analytics on semi-structured datasets using asterixDB and spark
Large quantities of raw data are being generated by many different sources in different formats. Private and public sectors alike acclaim the valuable information and insights that can be mined from such data to better understand the dynamics of ...
External Data Access And Indexing In AsterixDB
CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge ManagementTraditional database systems offer rich query interfaces (SQL) and efficient query execution for data that they store. Recent years have seen the rise of Big Data analytics platforms offering query-based access to "raw" external data, e.g., file-...
Comments