Skip to main content
Top
Published in: International Journal of Parallel Programming 3-4/2022

23-05-2022

A Scalable Similarity Join Algorithm Based on MapReduce and LSH

Authors: Sébastien Rivault, Mostafa Bamha, Sébastien Limet, Sophie Robert

Published in: International Journal of Parallel Programming | Issue 3-4/2022

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Similarity joins are recognized to be among the most useful data processing and analysis operations. A similarity join is used to retrieve all data pairs whose distances are smaller than a predefined threshold \(\lambda\). In this paper, we introduce the MRS-join algorithm to perform similarity joins on large trajectory datasets. The MapReduce model and a randomized local sensitive hashing keys redistribution approach are used to balance load among processing nodes while reducing communications and computations to almost all relevant data by using distributed histograms. A cost analysis of the MRS-join algorithm shows that our approach is insensitive to data skew and guarantees perfect balancing properties, in large scale systems, during all stages of similarity join computations. These performances have been confirmed by a series of experiments using the Fréchet distance on large datasets of trajectories from real world and synthetic data benchmarks.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Alt, H., Godau, M.: Computing the fréchet distance between two polygonal curves. Int. J. Comput. Geomet. Appl. 05(1), 75–91 (1995)CrossRef Alt, H., Godau, M.: Computing the fréchet distance between two polygonal curves. Int. J. Comput. Geomet. Appl. 05(1), 75–91 (1995)CrossRef
2.
go back to reference Baldus, J., Bringmann, K.: A fast implementation of near neighbors queries for fréchet distance (GIS cup). In Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, SIGSPATIAL ’17, pp. 1–4. Association for Computing Machinery (2017) Baldus, J., Bringmann, K.: A fast implementation of near neighbors queries for fréchet distance (GIS cup). In Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, SIGSPATIAL ’17, pp. 1–4. Association for Computing Machinery (2017)
3.
go back to reference Bamha, M.: An optimal and skew-insensitive join and multi-join algorithm for distributed architectures. In: Proceedings of the International Conference on Database and Expert Systems Applications (DEXA’2005). 22–26 August, Copenhagen, Danemark. LNCS, vol. 3588, pp. 616–625. Springer, New York (2005) Bamha, M.: An optimal and skew-insensitive join and multi-join algorithm for distributed architectures. In: Proceedings of the International Conference on Database and Expert Systems Applications (DEXA’2005). 22–26 August, Copenhagen, Danemark. LNCS, vol. 3588, pp. 616–625. Springer, New York (2005)
4.
go back to reference Bamha, M., Exbrayat, M.: Pipelining a skew-insensitive parallel join algorithm. Parallel Process. Lett. 13(3), 317–328 (2003)MathSciNetCrossRef Bamha, M., Exbrayat, M.: Pipelining a skew-insensitive parallel join algorithm. Parallel Process. Lett. 13(3), 317–328 (2003)MathSciNetCrossRef
5.
go back to reference Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E. J., Tian, Y.: A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, pp. 975–986. ACM, New York (2010) Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E. J., Tian, Y.: A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, pp. 975–986. ACM, New York (2010)
6.
go back to reference Bringmann, K.: Why walking the dog takes time: Frechet distance has no strongly subquadratic algorithms unless seth fails. In: Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, FOCS ’14, pp. 661–670. IEEE Computer Society, USA (2014) Bringmann, K.: Why walking the dog takes time: Frechet distance has no strongly subquadratic algorithms unless seth fails. In: Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, FOCS ’14, pp. 661–670. IEEE Computer Society, USA (2014)
7.
go back to reference Buchin, K., Buchin, M., Meulemans, W., Mulzer, W.: Four soviets walk the dog: Improved bounds for computing the fréchet distance. Discret. Comput. Geomet. 58(1), 180–216 (2017)CrossRef Buchin, K., Buchin, M., Meulemans, W., Mulzer, W.: Four soviets walk the dog: Improved bounds for computing the fréchet distance. Discret. Comput. Geomet. 58(1), 180–216 (2017)CrossRef
8.
go back to reference Ceccarello, M., Driemel, A., Silvestri, F.: Fresh: Fréchet similarity with hashing. In: Friggstad, Z., Sack, J.-R., Salavatipour, M.R. (eds.) Algorithms and Data Structures, pp. 254–268. Springer International Publishing, Cham (2019)CrossRef Ceccarello, M., Driemel, A., Silvestri, F.: Fresh: Fréchet similarity with hashing. In: Friggstad, Z., Sack, J.-R., Salavatipour, M.R. (eds.) Algorithms and Data Structures, pp. 254–268. Springer International Publishing, Cham (2019)CrossRef
9.
go back to reference Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef
10.
go back to reference Driemel, A., Har-Peled, S., Wenk, C.: Approximating the fréchet distance for realistic curves in near linear time. Discret. Comput. Geomet. 48(1), 94–127 (2012)CrossRef Driemel, A., Har-Peled, S., Wenk, C.: Approximating the fréchet distance for realistic curves in near linear time. Discret. Comput. Geomet. 48(1), 94–127 (2012)CrossRef
11.
go back to reference Driemel, A., Silvestri, F.: Locality-Sensitive Hashing of Curves. In: B. Aronov and M.J. Katz (eds.) 33rd International Symposium on Computational Geometry (SoCG 2017) Leibniz International Proceedings in Informatics (LIPIcs), vol. 77, pp. 37:1–37:16. Dagstuhl, Germany, Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik (2017) Driemel, A., Silvestri, F.: Locality-Sensitive Hashing of Curves. In: B. Aronov and M.J. Katz (eds.) 33rd International Symposium on Computational Geometry (SoCG 2017) Leibniz International Proceedings in Informatics (LIPIcs), vol. 77, pp. 37:1–37:16. Dagstuhl, Germany, Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik (2017)
12.
go back to reference Florence, P.S.: Human behaviour and the principle of least effort. Econ. J. 60(240), 808–810 (1950)CrossRef Florence, P.S.: Human behaviour and the principle of least effort. Econ. J. 60(240), 808–810 (1950)CrossRef
13.
go back to reference Hassan, M.A.H., Bamha, M.: Towards scalability and data skew handling in groupby-joins using mapreduce model. Procedia Comput. Sci. 51, 70–79 (2015)CrossRef Hassan, M.A.H., Bamha, M.: Towards scalability and data skew handling in groupby-joins using mapreduce model. Procedia Comput. Sci. 51, 70–79 (2015)CrossRef
14.
go back to reference Hassan, M.A.H., Bamha, M., Loulergue, F.: Handling data-skew effects in join operations using mapreduce. Procedia Comput. Sci. 29, 145–158 (2014)CrossRef Hassan, M.A.H., Bamha, M., Loulergue, F.: Handling data-skew effects in join operations using mapreduce. Procedia Comput. Sci. 29, 145–158 (2014)CrossRef
15.
go back to reference Hu, X., Tao, Y., Yi, K.: Output-optimal parallel algorithms for similarity joins. In: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pp. 79–90. ACM, New York (2017) Hu, X., Tao, Y., Yi, K.: Output-optimal parallel algorithms for similarity joins. In: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pp. 79–90. ACM, New York (2017)
16.
go back to reference Indyk, P.: Approximate nearest neighbor algorithms for frechet distance via product metrics. In: Proceedings of the Eighteenth Annual Symposium on Computational Geometry—SCG ’02, pp. 102–106. ACM Press, New York (2002) Indyk, P.: Approximate nearest neighbor algorithms for frechet distance via product metrics. In: Proceedings of the Eighteenth Annual Symposium on Computational Geometry—SCG ’02, pp. 102–106. ACM Press, New York (2002)
17.
go back to reference Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, pp. 604–613. Association for Computing Machinery, New York, NY (1998) Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, pp. 604–613. Association for Computing Machinery, New York, NY (1998)
18.
go back to reference Konzack, M., Mcketterick, T.J., Ophelders, T., Buchin, M., Giuggioli, L., Long, J., Nelson, T., Westenberg, M.A., Buchin, K.: Visual analytics of delays and interaction in movement data. Int. J. Geogr. Inf. Sci. 31(2), 320–345 (2017)CrossRef Konzack, M., Mcketterick, T.J., Ophelders, T., Buchin, M., Giuggioli, L., Long, J., Nelson, T., Westenberg, M.A., Buchin, K.: Visual analytics of delays and interaction in movement data. Int. J. Geogr. Inf. Sci. 31(2), 320–345 (2017)CrossRef
19.
go back to reference Metwally, A., Faloutsos, C.: V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. Proc. VLDB Endow. 5(8), 704–715 (2012)CrossRef Metwally, A., Faloutsos, C.: V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. Proc. VLDB Endow. 5(8), 704–715 (2012)CrossRef
20.
go back to reference Sriraghavendra, E., Bhattacharyya, K.K., Fréchet, C.: distance based approach for searching online handwritten documents. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), pp. 461–465. IEEE Computer Society (2007) Sriraghavendra, E., Bhattacharyya, K.K., Fréchet, C.: distance based approach for searching online handwritten documents. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), pp. 461–465. IEEE Computer Society (2007)
21.
go back to reference Werner, M., Oliver, D.: ACM SIGSPATIAL GIS cup 2017: range queries under fréchet distance. SIGSPATIAL Special 10(1), 24–27 (2018)CrossRef Werner, M., Oliver, D.: ACM SIGSPATIAL GIS cup 2017: range queries under fréchet distance. SIGSPATIAL Special 10(1), 24–27 (2018)CrossRef
22.
go back to reference Xie, D., Li, F., Phillips, J.M.: Distributed trajectory similarity search. Proc. VLDB Endowment 10(11), 1478–1489 (2017)CrossRef Xie, D., Li, F., Phillips, J.M.: Distributed trajectory similarity search. Proc. VLDB Endowment 10(11), 1478–1489 (2017)CrossRef
23.
go back to reference Yuan, H., Li, G.: Distributed in-memory trajectory similarity search and join on road network. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1262–1273. IEEE (2019) Yuan, H., Li, G.: Distributed in-memory trajectory similarity search and join on road network. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1262–1273. IEEE (2019)
Metadata
Title
A Scalable Similarity Join Algorithm Based on MapReduce and LSH
Authors
Sébastien Rivault
Mostafa Bamha
Sébastien Limet
Sophie Robert
Publication date
23-05-2022
Publisher
Springer US
Published in
International Journal of Parallel Programming / Issue 3-4/2022
Print ISSN: 0885-7458
Electronic ISSN: 1573-7640
DOI
https://doi.org/10.1007/s10766-022-00733-6

Other articles of this Issue 3-4/2022

International Journal of Parallel Programming 3-4/2022 Go to the issue

Premium Partner