Skip to main content
Top
Published in: The Journal of Supercomputing 2/2014

01-08-2014

Join processing with threshold-based filtering in MapReduce

Authors: Taewhi Lee, Hye-Chan Bae, Hyoung-Joo Kim

Published in: The Journal of Supercomputing | Issue 2/2014

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Data analytics, in particular those involving heterogeneous data, often require join operations on datasets collected from different sources. MapReduce, one of the most popular frameworks for large-scale data processing, is not suited for joining multiple datasets. This is because MapReduce often produces a large number of redundant intermediate results, irrespective of the size of the joined records. Although several existing approaches attempt to reduce the number of such redundant results using Bloom filters, they may be inefficient if large portions of records are joined or the number of distinct keys is large. To alleviate this problem, we propose a join processing method with threshold-based filtering in MapReduce, called TMFR-Join, which is an abbreviation for “Threshold-based Map-Filter-Reduce Join”. TMFR-Join applies filters according to their performance, which is estimated in terms of false-positive rates. It also provides a general framework for exploiting various filtering techniques that support certain desired operations. The experimental results indicate that the performance of TMFR-Join is close to that of the better of existing join processing techniques, both with and without filters.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Thusoo A, Antony S, Jain N, Murthy R, Shao Z, Borthakur D, Sarma JS, Liu H (2010) Data warehousing and analytics infrastructure at facebook. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, SIGMOD’10, pp 1013–1020 Thusoo A, Antony S, Jain N, Murthy R, Shao Z, Borthakur D, Sarma JS, Liu H (2010) Data warehousing and analytics infrastructure at facebook. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, SIGMOD’10, pp 1013–1020
2.
go back to reference Gupta R, Gupta H, Nambiar U, Mohania M (2010) Efficiently querying archived data using hadoop. In: Proceedings of the 19th ACM international conference on information and knowledge management, CIKM’10, pp 1301–1304 Gupta R, Gupta H, Nambiar U, Mohania M (2010) Efficiently querying archived data using hadoop. In: Proceedings of the 19th ACM international conference on information and knowledge management, CIKM’10, pp 1301–1304
3.
go back to reference Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th USENIX symposium on opearting systems design and implementation, OSDI’04, pp 137–150 Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th USENIX symposium on opearting systems design and implementation, OSDI’04, pp 137–150
5.
go back to reference Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian Y (2010) A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, SIGMOD’10, pp 975–986 Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian Y (2010) A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, SIGMOD’10, pp 975–986
6.
go back to reference Yang HC, Dasdan A, Hsiao RL, Parker DS (2007) Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD international conference on management of data, SIGMOD’07, pp 1029–1040 Yang HC, Dasdan A, Hsiao RL, Parker DS (2007) Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD international conference on management of data, SIGMOD’07, pp 1029–1040
7.
go back to reference Espinosa A, Hernandez P, Moure JC, Protasio J, Ripoll A (2012) Analysis and improvement of map-reduce data distribution in read mapping applications. J Supercomput 62(3):1305–1317CrossRef Espinosa A, Hernandez P, Moure JC, Protasio J, Ripoll A (2012) Analysis and improvement of map-reduce data distribution in read mapping applications. J Supercomput 62(3):1305–1317CrossRef
8.
go back to reference Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426CrossRefMATH Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426CrossRefMATH
10.
go back to reference Lee T, Kim K, Kim HJ (2012) Join processing using bloom filter in mapreduce. In: Proceedings of the 2012 ACM research in applied computation symposium, RACS’12, pp 100–105 Lee T, Kim K, Kim HJ (2012) Join processing using bloom filter in mapreduce. In: Proceedings of the 2012 ACM research in applied computation symposium, RACS’12, pp 100–105
11.
go back to reference Palla K (2009) A comparative analysis of join algorithms using the hadoop map/reduce framework. Master’s thesis, University of Edinburgh, Edinburgh Palla K (2009) A comparative analysis of join algorithms using the hadoop map/reduce framework. Master’s thesis, University of Edinburgh, Edinburgh
12.
go back to reference Zhang C, Wu L, Li J (2013) Efficient processing distributed joins with bloomfilter using mapreduce. Int J Grid Distrib Comput 6(3):43–58 Zhang C, Wu L, Li J (2013) Efficient processing distributed joins with bloomfilter using mapreduce. Int J Grid Distrib Comput 6(3):43–58
13.
go back to reference Tarkoma S, Rothenberg CE, Lagerspetz E (2012) Theory and practice of bloom filters for distributed systems. IEEE Commun Surv Tutor 14(1):131–155CrossRef Tarkoma S, Rothenberg CE, Lagerspetz E (2012) Theory and practice of bloom filters for distributed systems. IEEE Commun Surv Tutor 14(1):131–155CrossRef
14.
go back to reference Bender MA, Farach-Colton M, Johnson R, Kraner R, Kuszmaul BC, Medjedovic D, Montes P, Shetty P, Spillane RP, Zadok E (2012) Don’t thrash: how to cache your hash on flash. Proc VLDB Endow 5(11):1627–1637CrossRef Bender MA, Farach-Colton M, Johnson R, Kraner R, Kuszmaul BC, Medjedovic D, Montes P, Shetty P, Spillane RP, Zadok E (2012) Don’t thrash: how to cache your hash on flash. Proc VLDB Endow 5(11):1627–1637CrossRef
15.
go back to reference Quislant R, Gutierrez E, Plata O, Zapata EL (2010), Interval filter: a locality-aware alternative to bloom filters for hardware membership queries by interval classification. In: Proceedings of the 11th international conference on intelligent data engineering and automated learning, IDEAL’10, pp 162–169 Quislant R, Gutierrez E, Plata O, Zapata EL (2010), Interval filter: a locality-aware alternative to bloom filters for hardware membership queries by interval classification. In: Proceedings of the 11th international conference on intelligent data engineering and automated learning, IDEAL’10, pp 162–169
16.
go back to reference Lee KH, Lee YJ, Choi H, Chung YD, Moon B (2011) Parallel data processing with mapreduce: a survey. ACM SIGMOD Rec 40(4):11–20CrossRef Lee KH, Lee YJ, Choi H, Chung YD, Moon B (2011) Parallel data processing with mapreduce: a survey. ACM SIGMOD Rec 40(4):11–20CrossRef
17.
go back to reference White T (2011) Hadoop: the definitive guide, 2nd edn. O’Reilly Media Inc., USA White T (2011) Hadoop: the definitive guide, 2nd edn. O’Reilly Media Inc., USA
18.
go back to reference Afrati FN, Ullman JD (2010) Optimizing joins in a map-reduce environment. In: Proceedings of the 13th international conference on extending database technology, EDBT’10, pp 99–110 Afrati FN, Ullman JD (2010) Optimizing joins in a map-reduce environment. In: Proceedings of the 13th international conference on extending database technology, EDBT’10, pp 99–110
19.
go back to reference Jiang D, Tung AKH, Chen G (2011) Map-join-reduce: toward scalable and efficient data analysis on large clusters. IEEE Trans Knowl Data Eng 23(9):1299–1311CrossRef Jiang D, Tung AKH, Chen G (2011) Map-join-reduce: toward scalable and efficient data analysis on large clusters. IEEE Trans Knowl Data Eng 23(9):1299–1311CrossRef
20.
go back to reference Mackert LF, Lohman GM (1986) R* optimizer validation and performance evaluation for distributed queries. In: Proceedings of the 12th international conference on very large data bases, VLDB’86, pp 149–159 Mackert LF, Lohman GM (1986) R* optimizer validation and performance evaluation for distributed queries. In: Proceedings of the 12th international conference on very large data bases, VLDB’86, pp 149–159
21.
go back to reference Kemper A, Kossmann D, Wiesner C (1999), Generalized hash teams for join and group-by. In: Proceedings of the 25th international conference on very large data bases, VLDB’99, pp 30–41 Kemper A, Kossmann D, Wiesner C (1999), Generalized hash teams for join and group-by. In: Proceedings of the 25th international conference on very large data bases, VLDB’99, pp 30–41
22.
go back to reference Michael L, Nejdl W, Papapetrou O, Siberski W (2007) Improving distributed join efficiency with extended bloom filter operations. In: Proceedings of the 21st international conference on advanced networking and applications, AINA’07, pp 187–194 Michael L, Nejdl W, Papapetrou O, Siberski W (2007) Improving distributed join efficiency with extended bloom filter operations. In: Proceedings of the 21st international conference on advanced networking and applications, AINA’07, pp 187–194
23.
go back to reference Ramesh S, Papapetrou O, Siberski W (2008) Optimizing distributed joins with bloom filters. Distributed computing and internet technology. In: Lecture notes in computer science, vol 5375, pp. 145–156 Ramesh S, Papapetrou O, Siberski W (2008) Optimizing distributed joins with bloom filters. Distributed computing and internet technology. In: Lecture notes in computer science, vol 5375, pp. 145–156
24.
go back to reference Papapetrou O, Siberski W, Nejdl W (2010) Cardinality estimation and dynamic length adaptation for bloom filters. Distrib Parallel Databases 28(2–3):119–156CrossRef Papapetrou O, Siberski W, Nejdl W (2010) Cardinality estimation and dynamic length adaptation for bloom filters. Distrib Parallel Databases 28(2–3):119–156CrossRef
Metadata
Title
Join processing with threshold-based filtering in MapReduce
Authors
Taewhi Lee
Hye-Chan Bae
Hyoung-Joo Kim
Publication date
01-08-2014
Publisher
Springer US
Published in
The Journal of Supercomputing / Issue 2/2014
Print ISSN: 0920-8542
Electronic ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-014-1179-9

Other articles of this Issue 2/2014

The Journal of Supercomputing 2/2014 Go to the issue

EditorialNotes

Preface

Premium Partner