research-article

An IDEA: an ingestion framework for data enrichment in asterixDB

Authors:
Xikui Wang

University of California Irvine

University of California Irvine
View Profile

,
Michael J. Carey

University of California Irvine

University of California Irvine
View Profile

Proceedings of the VLDB Endowment Volume 12 Issue 11pp 1485–1498https://doi.org/10.14778/3342263.3342628

Published:01 July 2019Publication History

Proceedings of the VLDB Endowment

Abstract

Big Data today is being generated at an unprecedented rate from various sources such as sensors, applications, and devices, and it often needs to be enriched based on other reference information to support complex analytical queries. Depending on the use case, the enrichment operations can be compiled code, declarative queries, or machine learning models with different complexities. For enrichments that will be frequently used in the future, it can be advantageous to push their computation into the ingestion pipeline so that they can be stored (and queried) together with the data. In some cases, the referenced information may change over time, so the ingestion pipeline should be able to adapt to such changes to guarantee the currency and/or correctness of the enrichment results.

In this paper, we present a new data ingestion framework that supports data ingestion at scale, enrichments requiring complex operations, and adaptiveness to reference data changes. We explain how this framework has been built on top of Apache AsterixDB and investigate its performance at scale under various workloads.

References

F. Abel, Q. Gao, G.-J. Houben, and K. Tao. Semantic enrichment of Twitter posts for user profile construction on the social web. In Extended semantic web conference, pages 375--389. Springer, 2011. Google ScholarDigital Library
S. Alsubaiee, Y. Altowim, H. Aitwaijry, A. Behm, V. R. Borkar, Y. Bu, M. J. Carey, I. Cetindil, M. Cheelangi, K. Faraaz, E. Gabrielova, R. Grover, Z. Heilbron, Y. Kim, C. Li, G. Li, J. M. Ok, N. Onose, P. Pirzadeh, V. J. Tsotras, R. Vernica, J. Wen, and T. Westmann. AsterixDB: A scalable, open source BDMS. PVLDB, 7(14):1905--1916, 2014. Google ScholarDigital Library
S. Alsubaiee, A. Behm, V. R. Borkar, Z. Heilbron, Y. Kim, M. J. Carey, M. Dreseler, and C. Li. Storage management in AsterixDB. PVLDB, 7(10):841--852, 2014. Google ScholarDigital Library
R. Barber, M. Huras, G. M. Lohman, C. Mohan, R. Müller, F. Özcan, H. Pirahesh, V. Raman, R. Sidle, O. Sidorkin, A. J. Storm, Y. Tian, and P. Tözün. Wildfire: Concurrent blazing data ingest and analytics. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 2077--2080, 2016. Google ScholarDigital Library
S. Bharadwaj, L. Chiticariu, M. Danilevsky, S. Dhingra, S. Divekar, A. Carreno-Fuentes, H. Gupta, N. Gupta, S. Han, M. A. Hernández, H. Ho, P. Jain, S. Joshi, H. Karanam, S. Krishnan, R. Krishnamurthy, Y. Li, S. Manivannan, A. R. Mittal, F. Ozcan, A. Quamar, P. Raman, D. Saha, K. Sankaranarayanan, J. Sen, P. Sen, S. Vaithyanathan, M. Vasa, H. Wang, and H. Zhu. Creation and interaction with large-scale domain-specific knowledge bases. PVLDB, 10(12):1965--1968, 2017. Google ScholarDigital Library
V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11-16, 2011, Hannover, Germany, pages 1151--1162, 2011. Google ScholarDigital Library
I. Botan, Y. Cho, R. Derakhshan, N. Dindar, L. Haas, K. Kim, and N. Tatbul. Federated stream processing support for real-time business intelligence applications. In International Workshop on Business Intelligence for the Real-Time Enterprise, pages 14--31. Springer, 2009.Google Scholar
R. M. Bruckner, B. List, and J. Schiefer. Striving towards near real-time data integration for data warehouses. In International Conference on Data Warehousing and Knowledge Discovery, pages 317--326. Springer, 2002. Google ScholarDigital Library
P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache Flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4), 2015.Google Scholar
D. Chamberlin. SQL++ For SQL Users: A Tutorial. Couchbase, Inc., 2018. (Available at Amazon.com).Google Scholar
S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. SIGMOD Record, 26(1):65--74, 1997. Google ScholarDigital Library
K. Conroy and M. Roantree. Enrichment of raw sensor data to enable high-level queries. In International Conference on Database and Expert Systems Applications, pages 462--469. Springer, 2010. Google ScholarDigital Library
A. Doyle, G. Katz, K. Summers, C. Ackermann, I. Zavorin, Z. Lim, S. Muthiah, P. Butler, N. Self, L. Zhao, et al. Forecasting significant societal events using the embers streaming predictive analytics system. Big Data, 2(4):185--195, 2014.Google ScholarCross Ref
L. Duan and Y. Xiong. Big Data analytics and business analytics. Journal of Management Analytics, 2(1):1--21, 2015.Google ScholarCross Ref
F. Färber, N. May, W. Lehner, P. Große, I. Müller, H. Rauhe, and J. Dees. The SAP HANA database - an architecture overview. IEEE Data Eng. Bull., 35(1):28--33, 2012.Google Scholar
H.-P. Grahsl. Kafka connect MongoDB sink, 2016. {Online; accessed 23-December-2018}.Google Scholar
R. Grover and M. J. Carey. Data ingestion in AsterixDB. In EDBT, pages 605--616, 2015.Google Scholar
J. M. Hellerstein and M. Stonebraker. Predicate migration: Optimizing queries with expensive predicates. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, May 26-28, 1993., pages 267--276, 1993. Google ScholarDigital Library
S. Jacobs, M. Y. S. Uddin, M. J. Carey, V. Hristidis, V. J. Tsotras, N. Venkatasubramanian, Y. Wu, S. Safir, P. Kaul, X. Wang, M. A. Qader, and Y. Li. A BAD demonstration: Towards big active data. PVLDB, 10(12):1941--1944, 2017. Google ScholarDigital Library
C. S. Jensen, T. B. Pedersen, and C. Thomsen. Multidimensional databases and data warehousing. Synthesis Lectures on Data Management, 2(1):1--111, 2010. Google ScholarDigital Library
E. D. Knapp and J. T. Langill. Industrial Network Security (Second Edition). Syngress, Boston, 2015.Google Scholar
J. Kreps, N. Narkhede, J. Rao, et al. Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB, pages 1--7, 2011.Google Scholar
V. Linnemann, K. Küspert, P. Dadam, P. Pistor, R. Erbe, A. Kemper, N. Südkamp, G. Walch, and M. Wallrath. Design and implementation of an extensible database management system supporting user defined data types and functions. In Fourteenth International Conference on Very Large Data Bases, August 29 - September 1, 1988, Los Angeles, California, USA, Proceedings., pages 294--305, 1988. Google ScholarDigital Library
J. Meehan, C. Aslantas, S. Zdonik, N. Tatbul, and J. Du. Data ingestion for the connected world. In CIDR, 2017.Google Scholar
A. Moraru and D. Mladenić. A framework for semantic enrichment of sensor data. Journal of computing and information technology, 20(3):167--173, 2012.Google Scholar
A. Morgan. MongoDB & data streaming - implementing a MongoDB Kafka consumer, 2016. {Online; accessed 23-December-2018}.Google Scholar
K. W. Ong, Y. Papakonstantinou, and R. Vernoux. The SQL++ query language: Configurable, unifying and semi-structured. arXiv preprint arXiv:1405.3631, 2014.Google Scholar
F. Özcan, Y. Tian, and P. Tözün. Hybrid transactional/analytical processing: A survey. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, pages 1771--1775, 2017. Google ScholarDigital Library
S. Qanbari, N. Behinaein, R. Rahimzadeh, and S. Dustdar. Gatica: Linked sensed data enrichment and analytics middleware for IoT gateways. In 2015 3rd International Conference on Future Internet of Things and Cloud, pages 38--43. IEEE, 2015. Google ScholarDigital Library
V. Raman, G. K. Attaluri, R. Barber, N. Chainani, D. Kalmuk, V. KulandaiSamy, J. Leenstra, S. Lightstone, S. Liu, G. M. Lohman, T. Malkemus, R. Müller, I. Pandis, B. Schiefer, D. Sharpe, R. Sidle, A. J. Storm, and L. Zhang. DB2 with BLU acceleration: So much more than just a column store. PVLDB, 6(11):1080--1091, 2013. Google ScholarDigital Library
A. Rheinländer, U. Leser, and G. Graefe. Optimization of complex dataflows with user-defined functions. ACM Comput. Surv., 50(3):38:1--38:39, 2017. Google ScholarDigital Library
L. D. Shapiro. Join processing in database systems with large main memories. ACM Trans. Database Syst., 11(3):239--264, 1986. Google ScholarDigital Library
M. Stonebraker, L. A. Rowe, and M. Hirohama. The implementation of POSTGRES. IEEE Trans. Knowl. Data Eng., 2(1):125--142, 1990. Google ScholarDigital Library
P. Vassiliadis. A survey of extract-transform-load technology. IJDWM, 5(3):1--27, 2009.Google Scholar
X. Wang and M. J. Carey. An IDEA: An ingestion framework for data enrichment in AsterixDB. arXiv preprint arXiv:1902.08271, 2019. Google ScholarDigital Library
H. J. Watson. Tutorial: Big Data analytics: Concepts, technologies, and applications. CAIS, 34:65, 2014.Google ScholarCross Ref
M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, and I. Stoica. Apache Spark: A unified engine for Big Data processing. Commun. ACM, 59(11):56--65, 2016. Google ScholarDigital Library

Index Terms

An IDEA: an ingestion framework for data enrichment in asterixDB
1. Information systems
  1. Data management systems
    1. Database management system engines

Index terms have been assigned to the content through auto-classification.

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 12, Issue 11
July 2019
543 pages
ISSN:2150-8097
Editors:
Lei Chen,
Fatma Özcan
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 July 2019
Published in pvldb Volume 12, Issue 11
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 79
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An IDEA: an ingestion framework for data enrichment in asterixDB

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Big Data Analytics with R and Hadoop

Big Data Analytics

Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

An IDEA: an ingestion framework for data enrichment in asterixDB

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Big Data Analytics with R and Hadoop

Big Data Analytics

Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media