nach oben

The Journal of Supercomputing

Erschienen in:

01.08.2014

Join processing with threshold-based filtering in MapReduce

verfasst von: Taewhi Lee, Hye-Chan Bae, Hyoung-Joo Kim

Erschienen in: The Journal of Supercomputing | Ausgabe 2/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Data analytics, in particular those involving heterogeneous data, often require join operations on datasets collected from different sources. MapReduce, one of the most popular frameworks for large-scale data processing, is not suited for joining multiple datasets. This is because MapReduce often produces a large number of redundant intermediate results, irrespective of the size of the joined records. Although several existing approaches attempt to reduce the number of such redundant results using Bloom filters, they may be inefficient if large portions of records are joined or the number of distinct keys is large. To alleviate this problem, we propose a join processing method with threshold-based filtering in MapReduce, called TMFR-Join, which is an abbreviation for “Threshold-based Map-Filter-Reduce Join”. TMFR-Join applies filters according to their performance, which is estimated in terms of false-positive rates. It also provides a general framework for exploiting various filtering techniques that support certain desired operations. The experimental results indicate that the performance of TMFR-Join is close to that of the better of existing join processing techniques, both with and without filters.

Vorheriger Artikel A shortly connected mesh topology for high performance and energy efficient network-on-chip architectures

Nächster Artikel Parallelizing exact motif finding algorithms on multi-core

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Thusoo A, Antony S, Jain N, Murthy R, Shao Z, Borthakur D, Sarma JS, Liu H (2010) Data warehousing and analytics infrastructure at facebook. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, SIGMOD’10, pp 1013–1020

Gupta R, Gupta H, Nambiar U, Mohania M (2010) Efficiently querying archived data using hadoop. In: Proceedings of the 19th ACM international conference on information and knowledge management, CIKM’10, pp 1301–1304

Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th USENIX symposium on opearting systems design and implementation, OSDI’04, pp 137–150

Hadoop. http://hadoop.apache.org/. Accessed 3 April 2014

Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian Y (2010) A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, SIGMOD’10, pp 975–986

Yang HC, Dasdan A, Hsiao RL, Parker DS (2007) Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD international conference on management of data, SIGMOD’07, pp 1029–1040

Espinosa A, Hernandez P, Moure JC, Protasio J, Ripoll A (2012) Analysis and improvement of map-reduce data distribution in read mapping applications. J Supercomput 62(3):1305–1317CrossRef

Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426CrossRefMATH

Koutris P (2011) Bloom filters in distributed query execution. University of Washington, Washington. http://www.cs.washington.edu/education/courses/cse544/11wi/projects/koutris.pdf. Accessed 3 April 2014

10.

Lee T, Kim K, Kim HJ (2012) Join processing using bloom filter in mapreduce. In: Proceedings of the 2012 ACM research in applied computation symposium, RACS’12, pp 100–105

11.

Palla K (2009) A comparative analysis of join algorithms using the hadoop map/reduce framework. Master’s thesis, University of Edinburgh, Edinburgh

12.

Zhang C, Wu L, Li J (2013) Efficient processing distributed joins with bloomfilter using mapreduce. Int J Grid Distrib Comput 6(3):43–58

13.

Tarkoma S, Rothenberg CE, Lagerspetz E (2012) Theory and practice of bloom filters for distributed systems. IEEE Commun Surv Tutor 14(1):131–155CrossRef

14.

Bender MA, Farach-Colton M, Johnson R, Kraner R, Kuszmaul BC, Medjedovic D, Montes P, Shetty P, Spillane RP, Zadok E (2012) Don’t thrash: how to cache your hash on flash. Proc VLDB Endow 5(11):1627–1637CrossRef

15.

Quislant R, Gutierrez E, Plata O, Zapata EL (2010), Interval filter: a locality-aware alternative to bloom filters for hardware membership queries by interval classification. In: Proceedings of the 11th international conference on intelligent data engineering and automated learning, IDEAL’10, pp 162–169

16.

Lee KH, Lee YJ, Choi H, Chung YD, Moon B (2011) Parallel data processing with mapreduce: a survey. ACM SIGMOD Rec 40(4):11–20CrossRef

17.

White T (2011) Hadoop: the definitive guide, 2nd edn. O’Reilly Media Inc., USA

18.

Afrati FN, Ullman JD (2010) Optimizing joins in a map-reduce environment. In: Proceedings of the 13th international conference on extending database technology, EDBT’10, pp 99–110

19.

Jiang D, Tung AKH, Chen G (2011) Map-join-reduce: toward scalable and efficient data analysis on large clusters. IEEE Trans Knowl Data Eng 23(9):1299–1311CrossRef

20.

Mackert LF, Lohman GM (1986) R* optimizer validation and performance evaluation for distributed queries. In: Proceedings of the 12th international conference on very large data bases, VLDB’86, pp 149–159

21.

Kemper A, Kossmann D, Wiesner C (1999), Generalized hash teams for join and group-by. In: Proceedings of the 25th international conference on very large data bases, VLDB’99, pp 30–41

22.

Michael L, Nejdl W, Papapetrou O, Siberski W (2007) Improving distributed join efficiency with extended bloom filter operations. In: Proceedings of the 21st international conference on advanced networking and applications, AINA’07, pp 187–194

23.

Ramesh S, Papapetrou O, Siberski W (2008) Optimizing distributed joins with bloom filters. Distributed computing and internet technology. In: Lecture notes in computer science, vol 5375, pp. 145–156

24.

Papapetrou O, Siberski W, Nejdl W (2010) Cardinality estimation and dynamic length adaptation for bloom filters. Distrib Parallel Databases 28(2–3):119–156CrossRef

25.

Herodotou H (2011) Hadoop performance models. In: Technical report CS-2011-05, Duke University, Durham. http://www.cs.duke.edu/starfish/files/hadoop-models.pdf. Accessed 3 April 2014

26.

Cluster setup. http://hadoop.apache.org/docs/r0.19.1/cluster_setup.html. Accessed 3 April 2014

27.

TPC-H benchmark. http://www.tpc.org/tpch/. Accessed 3 April 2014

Titel: Join processing with threshold-based filtering in MapReduce
verfasst von: Taewhi Lee
Hye-Chan Bae
Hyoung-Joo Kim
Publikationsdatum: 01.08.2014
Verlag: Springer US
Erschienen in: The Journal of Supercomputing / Ausgabe 2/2014
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-014-1179-9

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 2/2014

Lightweight dynamic partitioning for last-level cache of multicore processor on real system

An augmented Lagrangian method for VLSI global placement

Parallelizing exact motif finding algorithms on multi-core

High performance parallel -means clustering for disk-resident datasets on multi-core CPUs

TB-SnW: Trust-based Spray-and-Wait routing for delay-tolerant networks

An OpenCL micro-benchmark suite for GPUs and CPUs