nach oben

The Journal of Supercomputing

Erschienen in:

01.03.2016

An efficient MapReduce algorithm for similarity join in metric spaces

verfasst von: Wen Liu, Yanming Shen, Peng Wang

Erschienen in: The Journal of Supercomputing | Ausgabe 3/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Given a massive set of records, similarity join is to find pairs of records with similarity score greater than a threshold. In this paper, we address the problem of scaling up similarity join for general metric distance functions using MapReduce. First, we propose a novel index structure, Similarity Join Tree (SJT), which partitions data based on the underlying data distribution, and distributes similar records to the same group. Different from existing approaches, SJT can prune a large number of comparisons within reduce tasks by utilizing the by-product results generated in partitioning data. Then, to avoid the straggler reduce tasks, we design a graph partition algorithm by extending the well known Fiduccia–Mattheyses algorithm which can ensure load balancing while minimizing communication cost and redundancy in all reduce tasks. Experimental results using real data sets show that our approach is more effective and scalable compared to state-of-the-art algorithms.

Vorheriger Artikel A performance study of incentive schemes in peer-to-peer file-sharing systems

Nächster Artikel Location-aided route discovery mechanism based on two-hop neighbor information for ad hoc network

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

For the added virtual nodes in SJT\(_C\), the weight is set to 1.

Henzinger MR (2006) Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, pp 284–291

Broder AZ, Glassman SC, Manasse MS, Zweig G (1997) Syntactic clustering of the web. Comput Netw 29(813):1157–1166

Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRef

Metwally A, Faloutsos C (2012) V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. Proc VLDB Endow 5(8):704–715CrossRef

Sarma AD, He Y, Chaudhuri S (2014) Clusterjoin: a similarity joins framework using map-reduce. Proc VLDB Endow 7(12):1059–1070CrossRef

Shim K, Srikant R, Agrawal R (2002) High-dimensional similarity joins. IEEE Trans Knowl Data Eng 14(1):156–171CrossRef

Wang Y, Metwally A, Parthasarathy S (2013) Scalable all-pairs similarity search in metric spaces. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, pp 829–837

Weber R, Schek H-J, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proceedings of the VLDB conference, pp 194–205

Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: Knowledge discovery and data mining, pp 245–250

10.

Korn F, Jagadish HV, Faloutsos C (1997) Efficiently supporting ad hoc queries in large datasets of time sequences. ACM SIGMOD 26(2):289–300CrossRef

11.

Chakrabarti K, Keogh EJ, Mehrotra S, Pazzani MJ (2002) Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans Database Syst 27(2):188–228CrossRef

12.

Keogh EJ, Pazzani MJ (2000) A simple dimensionality reduction technique for fast similarity search in large time series databases. Knowl Discov Data Min Curr Issues New Appl 122–133

13.

Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, pp 495–506

14.

Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: ICDE’06. Proceedings of the 22nd international conference on data engineering, p 5

15.

Xiao C, Wang W, Lin X, Yu JX, Wang G (2011) Efficient similarity joins for near-duplicate detection. ACM Trans Database Syst (TODS) 36(3):15CrossRef

16.

Okcan A, Riedewald M (2011) Processing theta-joins using mapreduce. In: SIGMOD, pp 949–960

17.

Beyer Kevin S, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Proceedings of the ICDT, pp 217–235

18.

Bryant V (1985) Metric spaces: iteration and application. cambridge University Press, CambridgeMATH

19.

Traina C Jr, Santos Filho RF, Traina AJM, Vieira MR, Faloutsos Christos (2007) The omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient. VLDB J 16(4):483–505CrossRef

20.

Chen L, Gao Y, Li X, Jensen CS, Chen G (2015) Efficient metric indexing for similarity search. In: International conference on data engineering (ICDE)

21.

Yang S, Yan X, Zong B, Khan A (2012) Towards effective partition management for large graphs. In: SIGMOD

22.

Bourse F, Lelarge M, Vojnovic M (2014) Balanced graph edge partition. In: KDD ’14—20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1456–1465

23.

Harary F, Norman RZ (1960) Some properties of line digraphs. Rend Circ Mat Palermo 9(2):161–168MathSciNetCrossRefMATH

24.

Fiduccia CM, Mattheyses RM (1982) A linear-time heuristic for improving network partitions. In: 19th Proceedings of the design automation conference, pp 175–181

25.

Newman DJ, Asuncion A (2007) UCI machine learning repository. http://mlearn.ics.uci.edu/MLRepository.html. Accessed 26 Dec 2015

26.

Wikipedia. https://en.wikipedia.org/wiki/Main_Page. Accessed 28 Dec 2015

Titel: An efficient MapReduce algorithm for similarity join in metric spaces
verfasst von: Wen Liu
Yanming Shen
Peng Wang
Publikationsdatum: 01.03.2016
Verlag: Springer US
Erschienen in: The Journal of Supercomputing / Ausgabe 3/2016
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-016-1651-9

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 3/2016

Design procedures and NML cost analysis of reversible barrel shifters optimizing garbage and ancilla lines

Location-aided route discovery mechanism based on two-hop neighbor information for ad hoc network

Parallel Partition and Merge QuickSort (PPMQSort) on Multicore CPUs

Sibling virtual machine co-location confirmation and avoidance tactics for Public Infrastructure Clouds

Achieving secure communication over MANET using secret sharing schemes

A performance study of incentive schemes in peer-to-peer file-sharing systems

Premium Partner