Skip to main content
Erschienen in: The Journal of Supercomputing 3/2016

01.03.2016

An efficient MapReduce algorithm for similarity join in metric spaces

verfasst von: Wen Liu, Yanming Shen, Peng Wang

Erschienen in: The Journal of Supercomputing | Ausgabe 3/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Given a massive set of records, similarity join is to find pairs of records with similarity score greater than a threshold. In this paper, we address the problem of scaling up similarity join for general metric distance functions using MapReduce. First, we propose a novel index structure, Similarity Join Tree (SJT), which partitions data based on the underlying data distribution, and distributes similar records to the same group. Different from existing approaches, SJT can prune a large number of comparisons within reduce tasks by utilizing the by-product results generated in partitioning data. Then, to avoid the straggler reduce tasks, we design a graph partition algorithm by extending the well known Fiduccia–Mattheyses algorithm which can ensure load balancing while minimizing communication cost and redundancy in all reduce tasks. Experimental results using real data sets show that our approach is more effective and scalable compared to state-of-the-art algorithms.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
1
For the added virtual nodes in SJT\(_C\), the weight is set to 1.
 
Literatur
1.
Zurück zum Zitat Henzinger MR (2006) Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, pp 284–291 Henzinger MR (2006) Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, pp 284–291
2.
Zurück zum Zitat Broder AZ, Glassman SC, Manasse MS, Zweig G (1997) Syntactic clustering of the web. Comput Netw 29(813):1157–1166 Broder AZ, Glassman SC, Manasse MS, Zweig G (1997) Syntactic clustering of the web. Comput Netw 29(813):1157–1166
3.
Zurück zum Zitat Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRef Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRef
4.
Zurück zum Zitat Metwally A, Faloutsos C (2012) V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. Proc VLDB Endow 5(8):704–715CrossRef Metwally A, Faloutsos C (2012) V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. Proc VLDB Endow 5(8):704–715CrossRef
5.
Zurück zum Zitat Sarma AD, He Y, Chaudhuri S (2014) Clusterjoin: a similarity joins framework using map-reduce. Proc VLDB Endow 7(12):1059–1070CrossRef Sarma AD, He Y, Chaudhuri S (2014) Clusterjoin: a similarity joins framework using map-reduce. Proc VLDB Endow 7(12):1059–1070CrossRef
6.
Zurück zum Zitat Shim K, Srikant R, Agrawal R (2002) High-dimensional similarity joins. IEEE Trans Knowl Data Eng 14(1):156–171CrossRef Shim K, Srikant R, Agrawal R (2002) High-dimensional similarity joins. IEEE Trans Knowl Data Eng 14(1):156–171CrossRef
7.
Zurück zum Zitat Wang Y, Metwally A, Parthasarathy S (2013) Scalable all-pairs similarity search in metric spaces. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, pp 829–837 Wang Y, Metwally A, Parthasarathy S (2013) Scalable all-pairs similarity search in metric spaces. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, pp 829–837
8.
Zurück zum Zitat Weber R, Schek H-J, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proceedings of the VLDB conference, pp 194–205 Weber R, Schek H-J, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proceedings of the VLDB conference, pp 194–205
9.
Zurück zum Zitat Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: Knowledge discovery and data mining, pp 245–250 Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: Knowledge discovery and data mining, pp 245–250
10.
Zurück zum Zitat Korn F, Jagadish HV, Faloutsos C (1997) Efficiently supporting ad hoc queries in large datasets of time sequences. ACM SIGMOD 26(2):289–300CrossRef Korn F, Jagadish HV, Faloutsos C (1997) Efficiently supporting ad hoc queries in large datasets of time sequences. ACM SIGMOD 26(2):289–300CrossRef
11.
Zurück zum Zitat Chakrabarti K, Keogh EJ, Mehrotra S, Pazzani MJ (2002) Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans Database Syst 27(2):188–228CrossRef Chakrabarti K, Keogh EJ, Mehrotra S, Pazzani MJ (2002) Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans Database Syst 27(2):188–228CrossRef
12.
Zurück zum Zitat Keogh EJ, Pazzani MJ (2000) A simple dimensionality reduction technique for fast similarity search in large time series databases. Knowl Discov Data Min Curr Issues New Appl 122–133 Keogh EJ, Pazzani MJ (2000) A simple dimensionality reduction technique for fast similarity search in large time series databases. Knowl Discov Data Min Curr Issues New Appl 122–133
13.
Zurück zum Zitat Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, pp 495–506 Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, pp 495–506
14.
Zurück zum Zitat Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: ICDE’06. Proceedings of the 22nd international conference on data engineering, p 5 Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: ICDE’06. Proceedings of the 22nd international conference on data engineering, p 5
15.
Zurück zum Zitat Xiao C, Wang W, Lin X, Yu JX, Wang G (2011) Efficient similarity joins for near-duplicate detection. ACM Trans Database Syst (TODS) 36(3):15CrossRef Xiao C, Wang W, Lin X, Yu JX, Wang G (2011) Efficient similarity joins for near-duplicate detection. ACM Trans Database Syst (TODS) 36(3):15CrossRef
16.
Zurück zum Zitat Okcan A, Riedewald M (2011) Processing theta-joins using mapreduce. In: SIGMOD, pp 949–960 Okcan A, Riedewald M (2011) Processing theta-joins using mapreduce. In: SIGMOD, pp 949–960
17.
Zurück zum Zitat Beyer Kevin S, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Proceedings of the ICDT, pp 217–235 Beyer Kevin S, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Proceedings of the ICDT, pp 217–235
18.
Zurück zum Zitat Bryant V (1985) Metric spaces: iteration and application. cambridge University Press, CambridgeMATH Bryant V (1985) Metric spaces: iteration and application. cambridge University Press, CambridgeMATH
19.
Zurück zum Zitat Traina C Jr, Santos Filho RF, Traina AJM, Vieira MR, Faloutsos Christos (2007) The omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient. VLDB J 16(4):483–505CrossRef Traina C Jr, Santos Filho RF, Traina AJM, Vieira MR, Faloutsos Christos (2007) The omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient. VLDB J 16(4):483–505CrossRef
20.
Zurück zum Zitat Chen L, Gao Y, Li X, Jensen CS, Chen G (2015) Efficient metric indexing for similarity search. In: International conference on data engineering (ICDE) Chen L, Gao Y, Li X, Jensen CS, Chen G (2015) Efficient metric indexing for similarity search. In: International conference on data engineering (ICDE)
21.
Zurück zum Zitat Yang S, Yan X, Zong B, Khan A (2012) Towards effective partition management for large graphs. In: SIGMOD Yang S, Yan X, Zong B, Khan A (2012) Towards effective partition management for large graphs. In: SIGMOD
22.
Zurück zum Zitat Bourse F, Lelarge M, Vojnovic M (2014) Balanced graph edge partition. In: KDD ’14—20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1456–1465 Bourse F, Lelarge M, Vojnovic M (2014) Balanced graph edge partition. In: KDD ’14—20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1456–1465
24.
Zurück zum Zitat Fiduccia CM, Mattheyses RM (1982) A linear-time heuristic for improving network partitions. In: 19th Proceedings of the design automation conference, pp 175–181 Fiduccia CM, Mattheyses RM (1982) A linear-time heuristic for improving network partitions. In: 19th Proceedings of the design automation conference, pp 175–181
Metadaten
Titel
An efficient MapReduce algorithm for similarity join in metric spaces
verfasst von
Wen Liu
Yanming Shen
Peng Wang
Publikationsdatum
01.03.2016
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 3/2016
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-016-1651-9

Weitere Artikel der Ausgabe 3/2016

The Journal of Supercomputing 3/2016 Zur Ausgabe

Premium Partner