Skip to main content
Top

2016 | OriginalPaper | Chapter

Multidimensional Similarity Join Using MapReduce

Authors : Ye Li, Jian Wang, Leong Hou U

Published in: Web-Age Information Management

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Similarity join is arguably one of the most important operators in multidimensional data analysis tasks. However, processing a similarity join is costly especially for large volume and high dimensional data. In this work, we attempt to process the similarity join on MapReduce such that the join computation can be scaled horizontally. In order to make the workload balancing among all MapReduce nodes, we systemically select the most profitable feature based on a novel data selectivity approach. Given the selected feature, we develop the partitioning scheme for MapReduce processing based on two different optimization goals. Our proposed techniques are extensively evaluated on real datasets.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Gunopulos, D., Kollios, G., Tsotras, J., Domeniconi, C.: Selectivity estimators for multidimensional range queries over real attributes. VLDB J.–Int. J. Very Large Data Bases 14(2), 137–154 (2005)CrossRef Gunopulos, D., Kollios, G., Tsotras, J., Domeniconi, C.: Selectivity estimators for multidimensional range queries over real attributes. VLDB J.–Int. J. Very Large Data Bases 14(2), 137–154 (2005)CrossRef
2.
go back to reference Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 5(1), 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 5(1), 107–113 (2008)CrossRef
3.
go back to reference Yang, H., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 1029–1040. ACM (2007) Yang, H., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 1029–1040. ACM (2007)
4.
go back to reference Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 975–986. ACM (2010) Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 975–986. ACM (2010)
5.
go back to reference Wu, F., Wu, Q., Tan, Y.: Comparison and performance analysis of join approach in mapreduce. In: Yuan, Y., Wu, X., Lu, Y. (eds.) Trustworthy Computing and Services. CCIS, vol. 320, pp. 629–636. Springer, Heidelberg (2013)CrossRef Wu, F., Wu, Q., Tan, Y.: Comparison and performance analysis of join approach in mapreduce. In: Yuan, Y., Wu, X., Lu, Y. (eds.) Trustworthy Computing and Services. CCIS, vol. 320, pp. 629–636. Springer, Heidelberg (2013)CrossRef
6.
go back to reference Atta, F., Viglas, S.D., Niazi, S.: Sand joina skew handling join algorithm for google’s mapreduce framework. In: 2011 IEEE 14th International Multitopic Conference (INMIC), pp. 170–175. IEEE (2011) Atta, F., Viglas, S.D., Niazi, S.: Sand joina skew handling join algorithm for google’s mapreduce framework. In: 2011 IEEE 14th International Multitopic Conference (INMIC), pp. 170–175. IEEE (2011)
7.
go back to reference Lin, Z., Cai, M., Huang, Z., Lai, Y.: SALA: A skew-avoiding and locality-aware algorithm for mapreduce-based join. In: Dong, X.L., Yu, X., Dong, X.L., Li, J., Sun, Y., Sun, Y. (eds.) WAIM 2015. LNCS, vol. 9098, pp. 311–323. Springer, Heidelberg (2015). doi:10.1007/978-3-319-21042-1_25 CrossRef Lin, Z., Cai, M., Huang, Z., Lai, Y.: SALA: A skew-avoiding and locality-aware algorithm for mapreduce-based join. In: Dong, X.L., Yu, X., Dong, X.L., Li, J., Sun, Y., Sun, Y. (eds.) WAIM 2015. LNCS, vol. 9098, pp. 311–323. Springer, Heidelberg (2015). doi:10.​1007/​978-3-319-21042-1_​25 CrossRef
8.
go back to reference Zhang, S., Han, J., Liu, Z., Wang, K., Xu, Z.: SJMR: Parallelizing spatial join with mapreduce on clusters. In: 2009 IEEE International Conference on Cluster Computing and Workshops, CLUSTER 2009, pp. 1–8. IEEE (2009) Zhang, S., Han, J., Liu, Z., Wang, K., Xu, Z.: SJMR: Parallelizing spatial join with mapreduce on clusters. In: 2009 IEEE International Conference on Cluster Computing and Workshops, CLUSTER 2009, pp. 1–8. IEEE (2009)
9.
go back to reference Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 495–506. ACM (2010) Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 495–506. ACM (2010)
10.
go back to reference Böhm, C., Braunmüller, B., Krebs, F., Kriegel, H.P.: Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data. In: ACM SIGMOD Record, vol. 30, pp. 379–388. ACM (2001) Böhm, C., Braunmüller, B., Krebs, F., Kriegel, H.P.: Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data. In: ACM SIGMOD Record, vol. 30, pp. 379–388. ACM (2001)
11.
go back to reference Kalashnikov, D.V.: Super-ego: fast multi-dimensional similarity join. VLDB J. Int. J. Very Large Data Bases 22(4), 561–585 (2013)MathSciNetCrossRef Kalashnikov, D.V.: Super-ego: fast multi-dimensional similarity join. VLDB J. Int. J. Very Large Data Bases 22(4), 561–585 (2013)MathSciNetCrossRef
12.
go back to reference Chaudhuri, S., Motwani, R., Narasayya, V.: On random sampling over joins. In: ACM SIGMOD Record, vol. 28, pp. 263–274. ACM (1999) Chaudhuri, S., Motwani, R., Narasayya, V.: On random sampling over joins. In: ACM SIGMOD Record, vol. 28, pp. 263–274. ACM (1999)
13.
go back to reference Olken, F., Rotem, D.: Simple random sampling from relational databases. In: VLDB, vol. 86, pp. 25–28 (1986) Olken, F., Rotem, D.: Simple random sampling from relational databases. In: VLDB, vol. 86, pp. 25–28 (1986)
14.
go back to reference Das Sarma, A., He, Y., Chaudhuri, S.: Clusterjoin: a similarity joins framework using map-reduce. Proc. VLDB Endowment 7(12), 1059–1070 (2014)CrossRef Das Sarma, A., He, Y., Chaudhuri, S.: Clusterjoin: a similarity joins framework using map-reduce. Proc. VLDB Endowment 7(12), 1059–1070 (2014)CrossRef
15.
go back to reference Wang, Y., Metwally, A., Parthasarathy, S.: Scalable all-pairs similarity search in metric spaces. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 829–837. ACM (2013) Wang, Y., Metwally, A., Parthasarathy, S.: Scalable all-pairs similarity search in metric spaces. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 829–837. ACM (2013)
Metadata
Title
Multidimensional Similarity Join Using MapReduce
Authors
Ye Li
Jian Wang
Leong Hou U
Copyright Year
2016
DOI
https://doi.org/10.1007/978-3-319-39958-4_36