Skip to main content
Top
Published in: Innovations in Systems and Software Engineering 1/2020

15-05-2019 | S.I. : CICBA 2018

Efficient incremental loading in ETL processing for real-time data integration

Authors: Neepa Biswas, Anamitra Sarkar, Kartick Chandra Mondal

Published in: Innovations in Systems and Software Engineering | Issue 1/2020

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

ETL (extract transform load) is the widely used standard process for creating and maintaining a data warehouse (DW). ETL is the most resource-, cost- and time-demanding process in DW implementation and maintenance. Nowadays, many graphical user interfaces (GUI)-based solutions are available to facilitate the ETL processes. In spite of the high popularity of GUI-based tool, there is still some downside of such approach. This paper focuses on alternative ETL developmental approach taken by hand coding. In some contexts like research and academic work, it is appropriate to go for custom-coded solution which can be cheaper, faster and maintainable compared to any GUI-based tools. Some well-known code-based open-source ETL tools developed by the academic world have been studied in this article. Their architecture and implementation details are addressed here. The aim of this paper is to present a comparative evaluation of these code-based ETL tools. Finally, an efficient ETL model is designed to meet the near real-time responsibility of the present days.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Inmon W (2005) Building the data warehouse. Wiley, New York Inmon W (2005) Building the data warehouse. Wiley, New York
2.
go back to reference Vassiliadis P (2009) A survey of extract—transform—load technology. Int J Data Warehous Min 5(3):1–27CrossRef Vassiliadis P (2009) A survey of extract—transform—load technology. Int J Data Warehous Min 5(3):1–27CrossRef
3.
go back to reference Eckerson W, White C (2003) Evaluating ETL and data integration platforms. Report of The Data Warehousing Institute 184 Eckerson W, White C (2003) Evaluating ETL and data integration platforms. Report of The Data Warehousing Institute 184
9.
go back to reference Schmidt N, Rosa M, Garcia R, Molina E, Reyna R, Gonzalez J (2011) Etl tool evaluation—a criteria framework. University of Texas-Pan American, Texas Schmidt N, Rosa M, Garcia R, Molina E, Reyna R, Gonzalez J (2011) Etl tool evaluation—a criteria framework. University of Texas-Pan American, Texas
10.
go back to reference Majchrzak TA, Jansen T, Kuchen H (2011) Efficiency evaluation of open source ETL tools. In: Proceedings of the 2011 ACM symposium on applied computing. ACM, pp 287–294 Majchrzak TA, Jansen T, Kuchen H (2011) Efficiency evaluation of open source ETL tools. In: Proceedings of the 2011 ACM symposium on applied computing. ACM, pp 287–294
11.
go back to reference Pall AS, Khaira JS (2013) A comparative review of extraction, transformation and loading tools. Database Syst J BOARD 4(2):42–51 Pall AS, Khaira JS (2013) A comparative review of extraction, transformation and loading tools. Database Syst J BOARD 4(2):42–51
13.
go back to reference Thomsen C, Pedersen T (2005) A survey of open source tools for business intelligence. In: International conference on data warehousing and knowledge discovery. Springer, pp 74–84 Thomsen C, Pedersen T (2005) A survey of open source tools for business intelligence. In: International conference on data warehousing and knowledge discovery. Springer, pp 74–84
14.
go back to reference Vassiliadis P, Simitsis A, Baikousi E (2009) A taxonomy of ETL activities. In: Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP. ACM, pp 25–32 Vassiliadis P, Simitsis A, Baikousi E (2009) A taxonomy of ETL activities. In: Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP. ACM, pp 25–32
15.
go back to reference Kabiri A, Chiadmi D (2013) Survey on ETL processes. J Theor Appl Inf Technol 54(2):219–229 Kabiri A, Chiadmi D (2013) Survey on ETL processes. J Theor Appl Inf Technol 54(2):219–229
16.
go back to reference Labio W, Yang J, Cui Y, Garcia-Molina H, Widom J (1999) Performance issues in incremental warehouse maintenance. In: Proceedings of the 26th international conference on very large data bases (VLDB’00), Cairo, Egypt, September 2000. Stanford InfoLab Labio W, Yang J, Cui Y, Garcia-Molina H, Widom J (1999) Performance issues in incremental warehouse maintenance. In: Proceedings of the 26th international conference on very large data bases (VLDB’00), Cairo, Egypt, September 2000. Stanford InfoLab
17.
go back to reference Zhang X, Sun W, Wang W, Feng Y, Shi B (2006) Generating incremental etl processes automatically. In: First international multi-symposiums on computer and computational sciences (IMSCCS’06), vol 2. IEEE, pp 516–521 Zhang X, Sun W, Wang W, Feng Y, Shi B (2006) Generating incremental etl processes automatically. In: First international multi-symposiums on computer and computational sciences (IMSCCS’06), vol 2. IEEE, pp 516–521
18.
go back to reference Jörg T, Dessloch S (2008) Towards generating ETL processes for incremental loading. In: Proceedings of the 2008 international symposium on database engineering applications (IDEAS’08). ACM, pp 101–110 Jörg T, Dessloch S (2008) Towards generating ETL processes for incremental loading. In: Proceedings of the 2008 international symposium on database engineering applications (IDEAS’08). ACM, pp 101–110
19.
go back to reference Jörg T, Dessloch S (2009) Formalizing etl jobs for incremental loading of data warehouses. In: BTW, pp 327–346 Jörg T, Dessloch S (2009) Formalizing etl jobs for incremental loading of data warehouses. In: BTW, pp 327–346
20.
go back to reference Behrend A, Jörg T (2010) Optimized incremental etl jobs for maintaining data warehouses. In: Proceedings of the fourteenth international database engineering and applications symposium, ACM, pp 216–224 Behrend A, Jörg T (2010) Optimized incremental etl jobs for maintaining data warehouses. In: Proceedings of the fourteenth international database engineering and applications symposium, ACM, pp 216–224
21.
go back to reference Qu W, Basavaraj V, Shankar S, Dessloch S (2015) Real-time snapshot maintenance with incremental ETL pipelines in data warehouses. In: Big data analytics and knowledge discovery. Springer, pp 217–228 Qu W, Basavaraj V, Shankar S, Dessloch S (2015) Real-time snapshot maintenance with incremental ETL pipelines in data warehouses. In: Big data analytics and knowledge discovery. Springer, pp 217–228
22.
go back to reference Shi J, Bao Y, Leng F, Yu G (2008) Study on log-based change data capture and handling mechanism in real-time data warehouse. In: 2008 international conference on computer science and software engineering, vol 4, IEEE, pp 478–481 Shi J, Bao Y, Leng F, Yu G (2008) Study on log-based change data capture and handling mechanism in real-time data warehouse. In: 2008 international conference on computer science and software engineering, vol 4, IEEE, pp 478–481
23.
go back to reference Ma K, Yang B (2015) Log-based change data capture from schema-free document stores using mapreduce. In: 2015 International conference on cloud technologies and applications (CloudTech). IEEE, pp 1–6 Ma K, Yang B (2015) Log-based change data capture from schema-free document stores using mapreduce. In: 2015 International conference on cloud technologies and applications (CloudTech). IEEE, pp 1–6
24.
go back to reference Eccles MJ, Evans DJ, Beaumont AJ (2010) True real-time change data capture with web service database encapsulation. In: 2010 6th world congress on services (SERVICES-1). IEEE, pp 128–131 Eccles MJ, Evans DJ, Beaumont AJ (2010) True real-time change data capture with web service database encapsulation. In: 2010 6th world congress on services (SERVICES-1). IEEE, pp 128–131
25.
go back to reference Tank DM, Ganatra A, Kosta YP, Bhensdadia CK (2010) Speeding ETL processing in data warehouses using high-performance joins for changed data capture (cdc). In: 2010 international conference on advances in recent technologies in communication and computing (ARTCom). IEEE, pp 365–368 Tank DM, Ganatra A, Kosta YP, Bhensdadia CK (2010) Speeding ETL processing in data warehouses using high-performance joins for changed data capture (cdc). In: 2010 international conference on advances in recent technologies in communication and computing (ARTCom). IEEE, pp 365–368
26.
go back to reference Sukarsa IM, Wisswani NW, Darma IG (2012) Change data capture on OLTP staging area for nearly real time data warehouse base on database trigger. Int J. Comput. Appl. 52(11):32–37 Sukarsa IM, Wisswani NW, Darma IG (2012) Change data capture on OLTP staging area for nearly real time data warehouse base on database trigger. Int J. Comput. Appl. 52(11):32–37
27.
go back to reference Valêncio CR, Marioto MH, Zafalon GFD, Machado J, Momente J (2013) Real time delta extraction based on triggers to support data warehousing. In: International conference on parallel and distributed computing, applications and technologies (PDCAT’13). IEEE, pp 293–297 Valêncio CR, Marioto MH, Zafalon GFD, Machado J, Momente J (2013) Real time delta extraction based on triggers to support data warehousing. In: International conference on parallel and distributed computing, applications and technologies (PDCAT’13). IEEE, pp 293–297
28.
go back to reference Thomsen C, Pedersen T (2009) pygrametl: a powerful programming framework for extract-transform-load programmers. In: Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP. ACM, pp 49–56 Thomsen C, Pedersen T (2009) pygrametl: a powerful programming framework for extract-transform-load programmers. In: Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP. ACM, pp 49–56
29.
go back to reference Thomsen C, Pedersen T (2011) Easy and effective parallel programmable etl. In: Proceedings of the ACM 14th international workshop on data warehousing and OLAP. ACM, pp 37–44 Thomsen C, Pedersen T (2011) Easy and effective parallel programmable etl. In: Proceedings of the ACM 14th international workshop on data warehousing and OLAP. ACM, pp 37–44
34.
go back to reference Baumer B (2017) A grammar for reproducible and painless extract-transform-load operations on medium data. arXiv preprint arXiv:1708.07073 Baumer B (2017) A grammar for reproducible and painless extract-transform-load operations on medium data. arXiv preprint arXiv:​1708.​07073
38.
go back to reference Ankorion I (2005) Change data capture efficient ETL for real-time bi. Inf Manag 15(1):36 Ankorion I (2005) Change data capture efficient ETL for real-time bi. Inf Manag 15(1):36
39.
go back to reference Bokade MB, Dhande SS, Vyavahare HR (2013) Framework of change data capture and real time data warehouse. In: International journal of engineering research and technology, vol 2. ESRSA Publications Bokade MB, Dhande SS, Vyavahare HR (2013) Framework of change data capture and real time data warehouse. In: International journal of engineering research and technology, vol 2. ESRSA Publications
40.
go back to reference Lindsay B, Haas L, Mohan C, Pirahesh H, Wilms P (1986) A snapshot differential refresh algorithm. In: Proceedings of the ACM-SIGMOD conference, vol 15CrossRef Lindsay B, Haas L, Mohan C, Pirahesh H, Wilms P (1986) A snapshot differential refresh algorithm. In: Proceedings of the ACM-SIGMOD conference, vol 15CrossRef
41.
go back to reference Labio W, Garcia-Molina H (1996) Efficient snapshot differential algorithms for data warehousing. In: Proceedings of the 22th international conference on very large data bases (VLDB’96). Morgan Kaufmann Publishers Inc, pp 63–74 Labio W, Garcia-Molina H (1996) Efficient snapshot differential algorithms for data warehousing. In: Proceedings of the 22th international conference on very large data bases (VLDB’96). Morgan Kaufmann Publishers Inc, pp 63–74
Metadata
Title
Efficient incremental loading in ETL processing for real-time data integration
Authors
Neepa Biswas
Anamitra Sarkar
Kartick Chandra Mondal
Publication date
15-05-2019
Publisher
Springer London
Published in
Innovations in Systems and Software Engineering / Issue 1/2020
Print ISSN: 1614-5046
Electronic ISSN: 1614-5054
DOI
https://doi.org/10.1007/s11334-019-00344-4

Other articles of this Issue 1/2020

Innovations in Systems and Software Engineering 1/2020 Go to the issue

Premium Partner