skip to main content
research-article

An IDEA: an ingestion framework for data enrichment in asterixDB

Published:01 July 2019Publication History
Skip Abstract Section

Abstract

Big Data today is being generated at an unprecedented rate from various sources such as sensors, applications, and devices, and it often needs to be enriched based on other reference information to support complex analytical queries. Depending on the use case, the enrichment operations can be compiled code, declarative queries, or machine learning models with different complexities. For enrichments that will be frequently used in the future, it can be advantageous to push their computation into the ingestion pipeline so that they can be stored (and queried) together with the data. In some cases, the referenced information may change over time, so the ingestion pipeline should be able to adapt to such changes to guarantee the currency and/or correctness of the enrichment results.

In this paper, we present a new data ingestion framework that supports data ingestion at scale, enrichments requiring complex operations, and adaptiveness to reference data changes. We explain how this framework has been built on top of Apache AsterixDB and investigate its performance at scale under various workloads.

References

  1. F. Abel, Q. Gao, G.-J. Houben, and K. Tao. Semantic enrichment of Twitter posts for user profile construction on the social web. In Extended semantic web conference, pages 375--389. Springer, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Alsubaiee, Y. Altowim, H. Aitwaijry, A. Behm, V. R. Borkar, Y. Bu, M. J. Carey, I. Cetindil, M. Cheelangi, K. Faraaz, E. Gabrielova, R. Grover, Z. Heilbron, Y. Kim, C. Li, G. Li, J. M. Ok, N. Onose, P. Pirzadeh, V. J. Tsotras, R. Vernica, J. Wen, and T. Westmann. AsterixDB: A scalable, open source BDMS. PVLDB, 7(14):1905--1916, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Alsubaiee, A. Behm, V. R. Borkar, Z. Heilbron, Y. Kim, M. J. Carey, M. Dreseler, and C. Li. Storage management in AsterixDB. PVLDB, 7(10):841--852, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Barber, M. Huras, G. M. Lohman, C. Mohan, R. Müller, F. Özcan, H. Pirahesh, V. Raman, R. Sidle, O. Sidorkin, A. J. Storm, Y. Tian, and P. Tözün. Wildfire: Concurrent blazing data ingest and analytics. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 2077--2080, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Bharadwaj, L. Chiticariu, M. Danilevsky, S. Dhingra, S. Divekar, A. Carreno-Fuentes, H. Gupta, N. Gupta, S. Han, M. A. Hernández, H. Ho, P. Jain, S. Joshi, H. Karanam, S. Krishnan, R. Krishnamurthy, Y. Li, S. Manivannan, A. R. Mittal, F. Ozcan, A. Quamar, P. Raman, D. Saha, K. Sankaranarayanan, J. Sen, P. Sen, S. Vaithyanathan, M. Vasa, H. Wang, and H. Zhu. Creation and interaction with large-scale domain-specific knowledge bases. PVLDB, 10(12):1965--1968, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11-16, 2011, Hannover, Germany, pages 1151--1162, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. I. Botan, Y. Cho, R. Derakhshan, N. Dindar, L. Haas, K. Kim, and N. Tatbul. Federated stream processing support for real-time business intelligence applications. In International Workshop on Business Intelligence for the Real-Time Enterprise, pages 14--31. Springer, 2009.Google ScholarGoogle Scholar
  8. R. M. Bruckner, B. List, and J. Schiefer. Striving towards near real-time data integration for data warehouses. In International Conference on Data Warehousing and Knowledge Discovery, pages 317--326. Springer, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache Flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4), 2015.Google ScholarGoogle Scholar
  10. D. Chamberlin. SQL++ For SQL Users: A Tutorial. Couchbase, Inc., 2018. (Available at Amazon.com).Google ScholarGoogle Scholar
  11. S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. SIGMOD Record, 26(1):65--74, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. Conroy and M. Roantree. Enrichment of raw sensor data to enable high-level queries. In International Conference on Database and Expert Systems Applications, pages 462--469. Springer, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Doyle, G. Katz, K. Summers, C. Ackermann, I. Zavorin, Z. Lim, S. Muthiah, P. Butler, N. Self, L. Zhao, et al. Forecasting significant societal events using the embers streaming predictive analytics system. Big Data, 2(4):185--195, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  14. L. Duan and Y. Xiong. Big Data analytics and business analytics. Journal of Management Analytics, 2(1):1--21, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  15. F. Färber, N. May, W. Lehner, P. Große, I. Müller, H. Rauhe, and J. Dees. The SAP HANA database - an architecture overview. IEEE Data Eng. Bull., 35(1):28--33, 2012.Google ScholarGoogle Scholar
  16. H.-P. Grahsl. Kafka connect MongoDB sink, 2016. {Online; accessed 23-December-2018}.Google ScholarGoogle Scholar
  17. R. Grover and M. J. Carey. Data ingestion in AsterixDB. In EDBT, pages 605--616, 2015.Google ScholarGoogle Scholar
  18. J. M. Hellerstein and M. Stonebraker. Predicate migration: Optimizing queries with expensive predicates. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, May 26-28, 1993., pages 267--276, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Jacobs, M. Y. S. Uddin, M. J. Carey, V. Hristidis, V. J. Tsotras, N. Venkatasubramanian, Y. Wu, S. Safir, P. Kaul, X. Wang, M. A. Qader, and Y. Li. A BAD demonstration: Towards big active data. PVLDB, 10(12):1941--1944, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. S. Jensen, T. B. Pedersen, and C. Thomsen. Multidimensional databases and data warehousing. Synthesis Lectures on Data Management, 2(1):1--111, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. E. D. Knapp and J. T. Langill. Industrial Network Security (Second Edition). Syngress, Boston, 2015.Google ScholarGoogle Scholar
  22. J. Kreps, N. Narkhede, J. Rao, et al. Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB, pages 1--7, 2011.Google ScholarGoogle Scholar
  23. V. Linnemann, K. Küspert, P. Dadam, P. Pistor, R. Erbe, A. Kemper, N. Südkamp, G. Walch, and M. Wallrath. Design and implementation of an extensible database management system supporting user defined data types and functions. In Fourteenth International Conference on Very Large Data Bases, August 29 - September 1, 1988, Los Angeles, California, USA, Proceedings., pages 294--305, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Meehan, C. Aslantas, S. Zdonik, N. Tatbul, and J. Du. Data ingestion for the connected world. In CIDR, 2017.Google ScholarGoogle Scholar
  25. A. Moraru and D. Mladenić. A framework for semantic enrichment of sensor data. Journal of computing and information technology, 20(3):167--173, 2012.Google ScholarGoogle Scholar
  26. A. Morgan. MongoDB & data streaming - implementing a MongoDB Kafka consumer, 2016. {Online; accessed 23-December-2018}.Google ScholarGoogle Scholar
  27. K. W. Ong, Y. Papakonstantinou, and R. Vernoux. The SQL++ query language: Configurable, unifying and semi-structured. arXiv preprint arXiv:1405.3631, 2014.Google ScholarGoogle Scholar
  28. F. Özcan, Y. Tian, and P. Tözün. Hybrid transactional/analytical processing: A survey. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, pages 1771--1775, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. Qanbari, N. Behinaein, R. Rahimzadeh, and S. Dustdar. Gatica: Linked sensed data enrichment and analytics middleware for IoT gateways. In 2015 3rd International Conference on Future Internet of Things and Cloud, pages 38--43. IEEE, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. V. Raman, G. K. Attaluri, R. Barber, N. Chainani, D. Kalmuk, V. KulandaiSamy, J. Leenstra, S. Lightstone, S. Liu, G. M. Lohman, T. Malkemus, R. Müller, I. Pandis, B. Schiefer, D. Sharpe, R. Sidle, A. J. Storm, and L. Zhang. DB2 with BLU acceleration: So much more than just a column store. PVLDB, 6(11):1080--1091, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Rheinländer, U. Leser, and G. Graefe. Optimization of complex dataflows with user-defined functions. ACM Comput. Surv., 50(3):38:1--38:39, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. L. D. Shapiro. Join processing in database systems with large main memories. ACM Trans. Database Syst., 11(3):239--264, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Stonebraker, L. A. Rowe, and M. Hirohama. The implementation of POSTGRES. IEEE Trans. Knowl. Data Eng., 2(1):125--142, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. P. Vassiliadis. A survey of extract-transform-load technology. IJDWM, 5(3):1--27, 2009.Google ScholarGoogle Scholar
  35. X. Wang and M. J. Carey. An IDEA: An ingestion framework for data enrichment in AsterixDB. arXiv preprint arXiv:1902.08271, 2019. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. H. J. Watson. Tutorial: Big Data analytics: Concepts, technologies, and applications. CAIS, 34:65, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  37. M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, and I. Stoica. Apache Spark: A unified engine for Big Data processing. Commun. ACM, 59(11):56--65, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An IDEA: an ingestion framework for data enrichment in asterixDB
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 12, Issue 11
      July 2019
      543 pages

      Publisher

      VLDB Endowment

      Publication History

      • Published: 1 July 2019
      Published in pvldb Volume 12, Issue 11

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader