Skip to main content

2016 | OriginalPaper | Buchkapitel

Multidimensional Similarity Join Using MapReduce

verfasst von : Ye Li, Jian Wang, Leong Hou U

Erschienen in: Web-Age Information Management

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Similarity join is arguably one of the most important operators in multidimensional data analysis tasks. However, processing a similarity join is costly especially for large volume and high dimensional data. In this work, we attempt to process the similarity join on MapReduce such that the join computation can be scaled horizontally. In order to make the workload balancing among all MapReduce nodes, we systemically select the most profitable feature based on a novel data selectivity approach. Given the selected feature, we develop the partitioning scheme for MapReduce processing based on two different optimization goals. Our proposed techniques are extensively evaluated on real datasets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Gunopulos, D., Kollios, G., Tsotras, J., Domeniconi, C.: Selectivity estimators for multidimensional range queries over real attributes. VLDB J.–Int. J. Very Large Data Bases 14(2), 137–154 (2005)CrossRef Gunopulos, D., Kollios, G., Tsotras, J., Domeniconi, C.: Selectivity estimators for multidimensional range queries over real attributes. VLDB J.–Int. J. Very Large Data Bases 14(2), 137–154 (2005)CrossRef
2.
Zurück zum Zitat Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 5(1), 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 5(1), 107–113 (2008)CrossRef
3.
Zurück zum Zitat Yang, H., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 1029–1040. ACM (2007) Yang, H., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 1029–1040. ACM (2007)
4.
Zurück zum Zitat Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 975–986. ACM (2010) Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 975–986. ACM (2010)
5.
Zurück zum Zitat Wu, F., Wu, Q., Tan, Y.: Comparison and performance analysis of join approach in mapreduce. In: Yuan, Y., Wu, X., Lu, Y. (eds.) Trustworthy Computing and Services. CCIS, vol. 320, pp. 629–636. Springer, Heidelberg (2013)CrossRef Wu, F., Wu, Q., Tan, Y.: Comparison and performance analysis of join approach in mapreduce. In: Yuan, Y., Wu, X., Lu, Y. (eds.) Trustworthy Computing and Services. CCIS, vol. 320, pp. 629–636. Springer, Heidelberg (2013)CrossRef
6.
Zurück zum Zitat Atta, F., Viglas, S.D., Niazi, S.: Sand joina skew handling join algorithm for google’s mapreduce framework. In: 2011 IEEE 14th International Multitopic Conference (INMIC), pp. 170–175. IEEE (2011) Atta, F., Viglas, S.D., Niazi, S.: Sand joina skew handling join algorithm for google’s mapreduce framework. In: 2011 IEEE 14th International Multitopic Conference (INMIC), pp. 170–175. IEEE (2011)
7.
Zurück zum Zitat Lin, Z., Cai, M., Huang, Z., Lai, Y.: SALA: A skew-avoiding and locality-aware algorithm for mapreduce-based join. In: Dong, X.L., Yu, X., Dong, X.L., Li, J., Sun, Y., Sun, Y. (eds.) WAIM 2015. LNCS, vol. 9098, pp. 311–323. Springer, Heidelberg (2015). doi:10.1007/978-3-319-21042-1_25 CrossRef Lin, Z., Cai, M., Huang, Z., Lai, Y.: SALA: A skew-avoiding and locality-aware algorithm for mapreduce-based join. In: Dong, X.L., Yu, X., Dong, X.L., Li, J., Sun, Y., Sun, Y. (eds.) WAIM 2015. LNCS, vol. 9098, pp. 311–323. Springer, Heidelberg (2015). doi:10.​1007/​978-3-319-21042-1_​25 CrossRef
8.
Zurück zum Zitat Zhang, S., Han, J., Liu, Z., Wang, K., Xu, Z.: SJMR: Parallelizing spatial join with mapreduce on clusters. In: 2009 IEEE International Conference on Cluster Computing and Workshops, CLUSTER 2009, pp. 1–8. IEEE (2009) Zhang, S., Han, J., Liu, Z., Wang, K., Xu, Z.: SJMR: Parallelizing spatial join with mapreduce on clusters. In: 2009 IEEE International Conference on Cluster Computing and Workshops, CLUSTER 2009, pp. 1–8. IEEE (2009)
9.
Zurück zum Zitat Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 495–506. ACM (2010) Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 495–506. ACM (2010)
10.
Zurück zum Zitat Böhm, C., Braunmüller, B., Krebs, F., Kriegel, H.P.: Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data. In: ACM SIGMOD Record, vol. 30, pp. 379–388. ACM (2001) Böhm, C., Braunmüller, B., Krebs, F., Kriegel, H.P.: Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data. In: ACM SIGMOD Record, vol. 30, pp. 379–388. ACM (2001)
11.
Zurück zum Zitat Kalashnikov, D.V.: Super-ego: fast multi-dimensional similarity join. VLDB J. Int. J. Very Large Data Bases 22(4), 561–585 (2013)MathSciNetCrossRef Kalashnikov, D.V.: Super-ego: fast multi-dimensional similarity join. VLDB J. Int. J. Very Large Data Bases 22(4), 561–585 (2013)MathSciNetCrossRef
12.
Zurück zum Zitat Chaudhuri, S., Motwani, R., Narasayya, V.: On random sampling over joins. In: ACM SIGMOD Record, vol. 28, pp. 263–274. ACM (1999) Chaudhuri, S., Motwani, R., Narasayya, V.: On random sampling over joins. In: ACM SIGMOD Record, vol. 28, pp. 263–274. ACM (1999)
13.
Zurück zum Zitat Olken, F., Rotem, D.: Simple random sampling from relational databases. In: VLDB, vol. 86, pp. 25–28 (1986) Olken, F., Rotem, D.: Simple random sampling from relational databases. In: VLDB, vol. 86, pp. 25–28 (1986)
14.
Zurück zum Zitat Das Sarma, A., He, Y., Chaudhuri, S.: Clusterjoin: a similarity joins framework using map-reduce. Proc. VLDB Endowment 7(12), 1059–1070 (2014)CrossRef Das Sarma, A., He, Y., Chaudhuri, S.: Clusterjoin: a similarity joins framework using map-reduce. Proc. VLDB Endowment 7(12), 1059–1070 (2014)CrossRef
15.
Zurück zum Zitat Wang, Y., Metwally, A., Parthasarathy, S.: Scalable all-pairs similarity search in metric spaces. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 829–837. ACM (2013) Wang, Y., Metwally, A., Parthasarathy, S.: Scalable all-pairs similarity search in metric spaces. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 829–837. ACM (2013)
Metadaten
Titel
Multidimensional Similarity Join Using MapReduce
verfasst von
Ye Li
Jian Wang
Leong Hou U
Copyright-Jahr
2016
DOI
https://doi.org/10.1007/978-3-319-39958-4_36

Neuer Inhalt