Skip to main content
Top
Published in: Knowledge and Information Systems 3/2014

01-09-2014 | Regular Paper

Efficient processing of streaming updates with archived master data in near-real-time data warehousing

Authors: M. Asif Naeem, Gillian Dobbie, Gerald Weber

Published in: Knowledge and Information Systems | Issue 3/2014

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In order to make timely and effective decisions, businesses need the latest information from data warehouse repositories. To keep these repositories up-to-date with respect to end user updates, near-real-time data integration is required. An important phase in near-real-time data integration is data transformation where the stream of updates is joined with disk-based master data. The stream-based algorithm MESHJOIN (Mesh Join) has been proposed to amortize disk access over fast streams. MESHJOIN makes no assumptions about the data distribution. In real-world applications, however, skewed distributions can be found, such as a stream of products sold, where certain products are sold more frequently than the remainder of the products. The question arises is how much does MESHJOIN lose in terms of performance by not adapting to data skew. In this paper we perform a rigorous experimental study analyzing the possible performance improvements while considering typical data distributions. For this purpose we design an algorithm Extended Hybrid Join (X-HYBRIDJOIN) that is complementary to MESHJOIN in that it can adapt to data skew and stores parts of the master data in memory permanently, reducing the disk access overhead significantly. We compare the performance of X-HYBRIDJOIN against the performance of MESHJOIN. We take several precautions to make sure the comparison is adequate and focuses on the utilization of data skew. The experiments show that considering data skew offers substantial room for performance gains that cannot be found in non-adaptive approaches such as MESHJOIN. We also present a cost model for X-HYBRIDJOIN, and based on that cost model, the algorithm is tuned.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Abramowitz M, Stegun IA (1964) Handbook of mathematical functions with formulas, graphs, and mathematical tables. Dover, New YorkMATH Abramowitz M, Stegun IA (1964) Handbook of mathematical functions with formulas, graphs, and mathematical tables. Dover, New YorkMATH
2.
go back to reference Anderson C (2006) The long tail: why the future of business is selling less of more. Hyperion Anderson C (2006) The long tail: why the future of business is selling less of more. Hyperion
3.
go back to reference Bornea MA, Deligiannakis A, Kotidis Y, Vassalos V (2011) Semi-streamed index join for near-real time execution of ETL transformations. In: ICDE ’09: proceedings of the 27th international conference on data engineering (ICDE). IEEE Computer Society, Washington, DC, USA, pp 159–170 Bornea MA, Deligiannakis A, Kotidis Y, Vassalos V (2011) Semi-streamed index join for near-real time execution of ETL transformations. In: ICDE ’09: proceedings of the 27th international conference on data engineering (ICDE). IEEE Computer Society, Washington, DC, USA, pp 159–170
4.
go back to reference Bruckner RM, List B, Schiefer J (2002) Striving towards near real-time data integration for data warehouses. In: DaWaK 2000: proceedings of the 4th international conference on data warehousing and knowledge discovery. Springer, London, UK, pp 317–326 Bruckner RM, List B, Schiefer J (2002) Striving towards near real-time data integration for data warehouses. In: DaWaK 2000: proceedings of the 4th international conference on data warehousing and knowledge discovery. Springer, London, UK, pp 317–326
5.
go back to reference Chakraborty A, Singh A (2009) A partition-based approach to support streaming updates over persistent data in an active datawarehouse. In: IPDPS ’09: proceedings of the 2009 IEEE international symposium on parallel and distributed processing. IEEE Computer Society, Washington, DC, USA, pp 1–11 Chakraborty A, Singh A (2009) A partition-based approach to support streaming updates over persistent data in an active datawarehouse. In: IPDPS ’09: proceedings of the 2009 IEEE international symposium on parallel and distributed processing. IEEE Computer Society, Washington, DC, USA, pp 1–11
6.
go back to reference Dittrich J, Seeger B, Taylor DS, Widmayer P (2002) Progressive merge join: a generic and non-blocking sort-based join algorithm. In: VLDB ’02: proceedings of the 28th international conference on very large data bases. Hong Kong, China, pp 299–310 Dittrich J, Seeger B, Taylor DS, Widmayer P (2002) Progressive merge join: a generic and non-blocking sort-based join algorithm. In: VLDB ’02: proceedings of the 28th international conference on very large data bases. Hong Kong, China, pp 299–310
7.
go back to reference Francisco A (2003) Real-time data warehousing with temporal requirements. In: CAiSE workshops Francisco A (2003) Real-time data warehousing with temporal requirements. In: CAiSE workshops
8.
go back to reference Golab L, Johnson T, Seidel JS, Shkapenyuk V (2009) Stream warehousing with datadepot. In: SIGMOD ’09: proceedings of the 35th SIGMOD international conference on management of data. ACM, New York, NY, USA, pp 847–854 Golab L, Johnson T, Seidel JS, Shkapenyuk V (2009) Stream warehousing with datadepot. In: SIGMOD ’09: proceedings of the 35th SIGMOD international conference on management of data. ACM, New York, NY, USA, pp 847–854
9.
go back to reference Gupta A, Mumick IS (1995) Maintenance of materialized views: problems, techniques, and applications. IEEE Data Eng Bull 18(2):3–18 Gupta A, Mumick IS (1995) Maintenance of materialized views: problems, techniques, and applications. IEEE Data Eng Bull 18(2):3–18
10.
go back to reference Han X, Li J, Yang D (2012) PI-join: efficiently processing join queries on massive data. Knowl Inf Syst 32(3):527–557 Han X, Li J, Yang D (2012) PI-join: efficiently processing join queries on massive data. Knowl Inf Syst 32(3):527–557
11.
go back to reference Heising WP (1963) Note on random addressing techniques.: IBM Syst J 2(2), 112–116 Heising WP (1963) Note on random addressing techniques.: IBM Syst J 2(2), 112–116
12.
go back to reference Hohpe G, Woolf B (2003) Enterprise integration patterns: designing, building, and deploying messaging solutions. Addison-Wesley Longman Publishing, Boston Hohpe G, Woolf B (2003) Enterprise integration patterns: designing, building, and deploying messaging solutions. Addison-Wesley Longman Publishing, Boston
13.
go back to reference Ives ZG, Florescu D, Friedman M, Levy A, Weld DS (1999) An adaptive query execution system for data integration. In: SIGMOD Rec., vol 28, no 2. ACM, New York, NY, USA, pp 299–310 Ives ZG, Florescu D, Friedman M, Levy A, Weld DS (1999) An adaptive query execution system for data integration. In: SIGMOD Rec., vol 28, no 2. ACM, New York, NY, USA, pp 299–310
14.
go back to reference Karakasidis A, Vassiliadis P, Pitoura E (2005) ETL queues for active data warehousing. In: IQIS ’05: proceedings of the 2nd international workshop on information quality in information systems. ACM, New York, NY, USA, pp 28–39 Karakasidis A, Vassiliadis P, Pitoura E (2005) ETL queues for active data warehousing. In: IQIS ’05: proceedings of the 2nd international workshop on information quality in information systems. ACM, New York, NY, USA, pp 28–39
15.
go back to reference Knuth DE (2006) The art of computer programming, vol 3, 2nd edn. Sorting and searching. Addison Wesley Longman Publishing, Redwood City Knuth DE (2006) The art of computer programming, vol 3, 2nd edn. Sorting and searching. Addison Wesley Longman Publishing, Redwood City
16.
go back to reference Labio W, Garcia-Molina H (1996) Efficient snapshot differential algorithms for data warehousing. In: VLDB ’96: proceedings of the 22th international conference on very large data bases. San Francisco, CA, USA, pp 63–74 Labio W, Garcia-Molina H (1996) Efficient snapshot differential algorithms for data warehousing. In: VLDB ’96: proceedings of the 22th international conference on very large data bases. San Francisco, CA, USA, pp 63–74
17.
go back to reference Labio W, Yang J, Cui Y, Garcia-Molina H, Widom J (2000) Performance issues in incremental warehouse maintenance. In: VLDB ’00: proceedings of the 26th international conference on very large data bases. San Francisco, CA, USA, pp 461–472 Labio W, Yang J, Cui Y, Garcia-Molina H, Widom J (2000) Performance issues in incremental warehouse maintenance. In: VLDB ’00: proceedings of the 26th international conference on very large data bases. San Francisco, CA, USA, pp 461–472
18.
go back to reference Labio WJ, Wiener JL, Garcia-Molina H, Gorelik V (2000) Efficient resumption of interrupted warehouse loads. In: SIGMOD Rec., vol 29, no 2. New York, NY, USA, pp 46–57 Labio WJ, Wiener JL, Garcia-Molina H, Gorelik V (2000) Efficient resumption of interrupted warehouse loads. In: SIGMOD Rec., vol 29, no 2. New York, NY, USA, pp 46–57
19.
go back to reference Lawrence R (2005) Early hash join: a configurable algorithm for the efficient and early production of join results. In: VLDB ’05: proceedings of the 31st international conference on very large data bases. VLDB endowment, Trondheim, Norway, pp 841–852 Lawrence R (2005) Early hash join: a configurable algorithm for the efficient and early production of join results. In: VLDB ’05: proceedings of the 31st international conference on very large data bases. VLDB endowment, Trondheim, Norway, pp 841–852
20.
go back to reference Levene M, Borges J, Loizou G (2001) Zipf’s law for web surfers. Knowl Inf Syst 3(1):120–129 Levene M, Borges J, Loizou G (2001) Zipf’s law for web surfers. Knowl Inf Syst 3(1):120–129
21.
go back to reference Mokbel MF, Lu M, Aref WG (2004) Hash-merge join: a non-blocking join algorithm for producing fast and early join results. In: ICDE ’04: proceedings of the 20th international conference on data engineering. IEEE Computer Society, Washington, DC, USA, pp 251–263 Mokbel MF, Lu M, Aref WG (2004) Hash-merge join: a non-blocking join algorithm for producing fast and early join results. In: ICDE ’04: proceedings of the 20th international conference on data engineering. IEEE Computer Society, Washington, DC, USA, pp 251–263
22.
go back to reference Naeem MA, Dobbie G, Weber G (2008) An event-based near real-time data integration architecture. In: Enterprise distributed object computing conference workshops. IEEE, Munich, Germany, pp 401–404 Naeem MA, Dobbie G, Weber G (2008) An event-based near real-time data integration architecture. In: Enterprise distributed object computing conference workshops. IEEE, Munich, Germany, pp 401–404
23.
go back to reference Naeem MA, Dobbie G, Weber G (2010) R-MESHJOIN for near-real-time data warehousing. In: DOLAP’10: proceedings of the ACM 13th international workshop on data warehousing and OLAP. ACM, Toronto, Canada, pp 53–60 Naeem MA, Dobbie G, Weber G (2010) R-MESHJOIN for near-real-time data warehousing. In: DOLAP’10: proceedings of the ACM 13th international workshop on data warehousing and OLAP. ACM, Toronto, Canada, pp 53–60
24.
go back to reference Naeem MA, Dobbie G, Weber G (2011) X-HYBRIDJOIN for near-real-time data warehousing. In: Proceedings of 28th British national conference on databases (BNCOD 28). Springer, Berlin/Heidelberg, pp 33–47 Naeem MA, Dobbie G, Weber G (2011) X-HYBRIDJOIN for near-real-time data warehousing. In: Proceedings of 28th British national conference on databases (BNCOD 28). Springer, Berlin/Heidelberg, pp 33–47
25.
go back to reference Nguyen A, Tjoa A (2003) Zero-latency data warehousing for hetrogeneous data sources and continuous data streams. In: iiWAS’2003: the fifth international conference on information integrationand web-based applications services, Austrian Computer Society (OCG), pp 55–64 Nguyen A, Tjoa A (2003) Zero-latency data warehousing for hetrogeneous data sources and continuous data streams. In: iiWAS’2003: the fifth international conference on information integrationand web-based applications services, Austrian Computer Society (OCG), pp 55–64
26.
go back to reference Polyzotis N, Skiadopoulos S, Vassiliadis P, Simitsis A, Frantzell N (2008) Meshing streaming updates with persistent data in an active data warehouse. IEEE Trans Knowl Data Eng 20(7):976–991 Polyzotis N, Skiadopoulos S, Vassiliadis P, Simitsis A, Frantzell N (2008) Meshing streaming updates with persistent data in an active data warehouse. IEEE Trans Knowl Data Eng 20(7):976–991
27.
go back to reference Polyzotis N, Skiadopoulos S, Vassiliadis P, Simitsis A, Frantzell NE (2007) Supporting streaming updates in an active data warehouse. In: ICDE 2007. IEEE 23rd international conference on data engineering. Los Alamitos, CA, USA, pp 476–485 Polyzotis N, Skiadopoulos S, Vassiliadis P, Simitsis A, Frantzell NE (2007) Supporting streaming updates in an active data warehouse. In: ICDE 2007. IEEE 23rd international conference on data engineering. Los Alamitos, CA, USA, pp 476–485
28.
go back to reference Tao Y, Yiu ML, Papadias D, Hadjieleftheriou M, Mamoulis N (2005) RPJ: producing fast join results on streams through rate-based optimization. In: SIGMOD ’05: proceedings of the 2005 ACM SIGMOD international conference on management of data. New York, NY, USA. pp 371–382 Tao Y, Yiu ML, Papadias D, Hadjieleftheriou M, Mamoulis N (2005) RPJ: producing fast join results on streams through rate-based optimization. In: SIGMOD ’05: proceedings of the 2005 ACM SIGMOD international conference on management of data. New York, NY, USA. pp 371–382
29.
go back to reference Tolga U, Michael JF (2000) Xjoin: a reactively-scheduled pipelined join operator. IEEE Data Eng Bull 23(2):27–33 Tolga U, Michael JF (2000) Xjoin: a reactively-scheduled pipelined join operator. IEEE Data Eng Bull 23(2):27–33
30.
go back to reference Urhan T, Franklin MJ (1999) XJoin: getting fast answers from slow and bursty networks. University of Maryland, College Park Urhan T, Franklin MJ (1999) XJoin: getting fast answers from slow and bursty networks. University of Maryland, College Park
31.
go back to reference Viglas SD, Naughton JF, Burger J (2003) Maximizing the output rate of multi-way join queries over streaming information sources. In: VLDB ’2003: proceedings of the 29th international conference on very large data bases. VLDB Endowment, Berlin, Germany, pp 285–296 Viglas SD, Naughton JF, Burger J (2003) Maximizing the output rate of multi-way join queries over streaming information sources. In: VLDB ’2003: proceedings of the 29th international conference on very large data bases. VLDB Endowment, Berlin, Germany, pp 285–296
32.
go back to reference Wilschut AN, Apers PMG (1991) Dataflow query execution in a parallel main-memory environment. In: PDIS ’91: proceedings of the first international conference on parallel and distributed information systems. IEEE Computer Society Press, Los Alamitos, CA, USA, pp 68–77 Wilschut AN, Apers PMG (1991) Dataflow query execution in a parallel main-memory environment. In: PDIS ’91: proceedings of the first international conference on parallel and distributed information systems. IEEE Computer Society Press, Los Alamitos, CA, USA, pp 68–77
33.
go back to reference Wilschut AN, Apers PMG (1990) Pipelining in query execution. In: PARBASE-90: international conference on databases, parallel architectures and their applications. Miami, FL, USA, pp 562–562 Wilschut AN, Apers PMG (1990) Pipelining in query execution. In: PARBASE-90: international conference on databases, parallel architectures and their applications. Miami, FL, USA, pp 562–562
34.
go back to reference Zhang X, Rundensteiner EA (2002) Integrating the maintenance and synchronization of data warehouses using a cooperative framework. Inf Syst 27(4):219–243 Zhang X, Rundensteiner EA (2002) Integrating the maintenance and synchronization of data warehouses using a cooperative framework. Inf Syst 27(4):219–243
35.
go back to reference Zhuge Y, García-Molina H, Hammer J, Widom J (1995) View maintenance in a warehousing environment. In: SIGMOD ’95: proceedings of the 1995 ACM SIGMOD international conference on management of data. ACM, New York, NY, USA, pp 316–327 Zhuge Y, García-Molina H, Hammer J, Widom J (1995) View maintenance in a warehousing environment. In: SIGMOD ’95: proceedings of the 1995 ACM SIGMOD international conference on management of data. ACM, New York, NY, USA, pp 316–327
Metadata
Title
Efficient processing of streaming updates with archived master data in near-real-time data warehousing
Authors
M. Asif Naeem
Gillian Dobbie
Gerald Weber
Publication date
01-09-2014
Publisher
Springer London
Published in
Knowledge and Information Systems / Issue 3/2014
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-013-0653-7

Other articles of this Issue 3/2014

Knowledge and Information Systems 3/2014 Go to the issue

Premium Partner