Abstract
Big Data today is being generated at an unprecedented rate from various sources such as sensors, applications, and devices, and it often needs to be enriched based on other reference information to support complex analytical queries. Depending on the use case, the enrichment operations can be compiled code, declarative queries, or machine learning models with different complexities. For enrichments that will be frequently used in the future, it can be advantageous to push their computation into the ingestion pipeline so that they can be stored (and queried) together with the data. In some cases, the referenced information may change over time, so the ingestion pipeline should be able to adapt to such changes to guarantee the currency and/or correctness of the enrichment results.
In this paper, we present a new data ingestion framework that supports data ingestion at scale, enrichments requiring complex operations, and adaptiveness to reference data changes. We explain how this framework has been built on top of Apache AsterixDB and investigate its performance at scale under various workloads.
- F. Abel, Q. Gao, G.-J. Houben, and K. Tao. Semantic enrichment of Twitter posts for user profile construction on the social web. In Extended semantic web conference, pages 375--389. Springer, 2011. Google ScholarDigital Library
- S. Alsubaiee, Y. Altowim, H. Aitwaijry, A. Behm, V. R. Borkar, Y. Bu, M. J. Carey, I. Cetindil, M. Cheelangi, K. Faraaz, E. Gabrielova, R. Grover, Z. Heilbron, Y. Kim, C. Li, G. Li, J. M. Ok, N. Onose, P. Pirzadeh, V. J. Tsotras, R. Vernica, J. Wen, and T. Westmann. AsterixDB: A scalable, open source BDMS. PVLDB, 7(14):1905--1916, 2014. Google ScholarDigital Library
- S. Alsubaiee, A. Behm, V. R. Borkar, Z. Heilbron, Y. Kim, M. J. Carey, M. Dreseler, and C. Li. Storage management in AsterixDB. PVLDB, 7(10):841--852, 2014. Google ScholarDigital Library
- R. Barber, M. Huras, G. M. Lohman, C. Mohan, R. Müller, F. Özcan, H. Pirahesh, V. Raman, R. Sidle, O. Sidorkin, A. J. Storm, Y. Tian, and P. Tözün. Wildfire: Concurrent blazing data ingest and analytics. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 2077--2080, 2016. Google ScholarDigital Library
- S. Bharadwaj, L. Chiticariu, M. Danilevsky, S. Dhingra, S. Divekar, A. Carreno-Fuentes, H. Gupta, N. Gupta, S. Han, M. A. Hernández, H. Ho, P. Jain, S. Joshi, H. Karanam, S. Krishnan, R. Krishnamurthy, Y. Li, S. Manivannan, A. R. Mittal, F. Ozcan, A. Quamar, P. Raman, D. Saha, K. Sankaranarayanan, J. Sen, P. Sen, S. Vaithyanathan, M. Vasa, H. Wang, and H. Zhu. Creation and interaction with large-scale domain-specific knowledge bases. PVLDB, 10(12):1965--1968, 2017. Google ScholarDigital Library
- V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11-16, 2011, Hannover, Germany, pages 1151--1162, 2011. Google ScholarDigital Library
- I. Botan, Y. Cho, R. Derakhshan, N. Dindar, L. Haas, K. Kim, and N. Tatbul. Federated stream processing support for real-time business intelligence applications. In International Workshop on Business Intelligence for the Real-Time Enterprise, pages 14--31. Springer, 2009.Google Scholar
- R. M. Bruckner, B. List, and J. Schiefer. Striving towards near real-time data integration for data warehouses. In International Conference on Data Warehousing and Knowledge Discovery, pages 317--326. Springer, 2002. Google ScholarDigital Library
- P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache Flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4), 2015.Google Scholar
- D. Chamberlin. SQL++ For SQL Users: A Tutorial. Couchbase, Inc., 2018. (Available at Amazon.com).Google Scholar
- S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. SIGMOD Record, 26(1):65--74, 1997. Google ScholarDigital Library
- K. Conroy and M. Roantree. Enrichment of raw sensor data to enable high-level queries. In International Conference on Database and Expert Systems Applications, pages 462--469. Springer, 2010. Google ScholarDigital Library
- A. Doyle, G. Katz, K. Summers, C. Ackermann, I. Zavorin, Z. Lim, S. Muthiah, P. Butler, N. Self, L. Zhao, et al. Forecasting significant societal events using the embers streaming predictive analytics system. Big Data, 2(4):185--195, 2014.Google ScholarCross Ref
- L. Duan and Y. Xiong. Big Data analytics and business analytics. Journal of Management Analytics, 2(1):1--21, 2015.Google ScholarCross Ref
- F. Färber, N. May, W. Lehner, P. Große, I. Müller, H. Rauhe, and J. Dees. The SAP HANA database - an architecture overview. IEEE Data Eng. Bull., 35(1):28--33, 2012.Google Scholar
- H.-P. Grahsl. Kafka connect MongoDB sink, 2016. {Online; accessed 23-December-2018}.Google Scholar
- R. Grover and M. J. Carey. Data ingestion in AsterixDB. In EDBT, pages 605--616, 2015.Google Scholar
- J. M. Hellerstein and M. Stonebraker. Predicate migration: Optimizing queries with expensive predicates. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, May 26-28, 1993., pages 267--276, 1993. Google ScholarDigital Library
- S. Jacobs, M. Y. S. Uddin, M. J. Carey, V. Hristidis, V. J. Tsotras, N. Venkatasubramanian, Y. Wu, S. Safir, P. Kaul, X. Wang, M. A. Qader, and Y. Li. A BAD demonstration: Towards big active data. PVLDB, 10(12):1941--1944, 2017. Google ScholarDigital Library
- C. S. Jensen, T. B. Pedersen, and C. Thomsen. Multidimensional databases and data warehousing. Synthesis Lectures on Data Management, 2(1):1--111, 2010. Google ScholarDigital Library
- E. D. Knapp and J. T. Langill. Industrial Network Security (Second Edition). Syngress, Boston, 2015.Google Scholar
- J. Kreps, N. Narkhede, J. Rao, et al. Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB, pages 1--7, 2011.Google Scholar
- V. Linnemann, K. Küspert, P. Dadam, P. Pistor, R. Erbe, A. Kemper, N. Südkamp, G. Walch, and M. Wallrath. Design and implementation of an extensible database management system supporting user defined data types and functions. In Fourteenth International Conference on Very Large Data Bases, August 29 - September 1, 1988, Los Angeles, California, USA, Proceedings., pages 294--305, 1988. Google ScholarDigital Library
- J. Meehan, C. Aslantas, S. Zdonik, N. Tatbul, and J. Du. Data ingestion for the connected world. In CIDR, 2017.Google Scholar
- A. Moraru and D. Mladenić. A framework for semantic enrichment of sensor data. Journal of computing and information technology, 20(3):167--173, 2012.Google Scholar
- A. Morgan. MongoDB & data streaming - implementing a MongoDB Kafka consumer, 2016. {Online; accessed 23-December-2018}.Google Scholar
- K. W. Ong, Y. Papakonstantinou, and R. Vernoux. The SQL++ query language: Configurable, unifying and semi-structured. arXiv preprint arXiv:1405.3631, 2014.Google Scholar
- F. Özcan, Y. Tian, and P. Tözün. Hybrid transactional/analytical processing: A survey. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, pages 1771--1775, 2017. Google ScholarDigital Library
- S. Qanbari, N. Behinaein, R. Rahimzadeh, and S. Dustdar. Gatica: Linked sensed data enrichment and analytics middleware for IoT gateways. In 2015 3rd International Conference on Future Internet of Things and Cloud, pages 38--43. IEEE, 2015. Google ScholarDigital Library
- V. Raman, G. K. Attaluri, R. Barber, N. Chainani, D. Kalmuk, V. KulandaiSamy, J. Leenstra, S. Lightstone, S. Liu, G. M. Lohman, T. Malkemus, R. Müller, I. Pandis, B. Schiefer, D. Sharpe, R. Sidle, A. J. Storm, and L. Zhang. DB2 with BLU acceleration: So much more than just a column store. PVLDB, 6(11):1080--1091, 2013. Google ScholarDigital Library
- A. Rheinländer, U. Leser, and G. Graefe. Optimization of complex dataflows with user-defined functions. ACM Comput. Surv., 50(3):38:1--38:39, 2017. Google ScholarDigital Library
- L. D. Shapiro. Join processing in database systems with large main memories. ACM Trans. Database Syst., 11(3):239--264, 1986. Google ScholarDigital Library
- M. Stonebraker, L. A. Rowe, and M. Hirohama. The implementation of POSTGRES. IEEE Trans. Knowl. Data Eng., 2(1):125--142, 1990. Google ScholarDigital Library
- P. Vassiliadis. A survey of extract-transform-load technology. IJDWM, 5(3):1--27, 2009.Google Scholar
- X. Wang and M. J. Carey. An IDEA: An ingestion framework for data enrichment in AsterixDB. arXiv preprint arXiv:1902.08271, 2019. Google ScholarDigital Library
- H. J. Watson. Tutorial: Big Data analytics: Concepts, technologies, and applications. CAIS, 34:65, 2014.Google ScholarCross Ref
- M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, and I. Stoica. Apache Spark: A unified engine for Big Data processing. Commun. ACM, 59(11):56--65, 2016. Google ScholarDigital Library
Index Terms
- An IDEA: an ingestion framework for data enrichment in asterixDB
Comments