Skip to main content

2018 | OriginalPaper | Buchkapitel

Document Nearest Neighbors Query Based on Pairwise Similarity with MapReduce

verfasst von : Peipei Lv, Peng Yang, Yong-Qiang Dong, Liang Gu

Erschienen in: Algorithms and Architectures for Parallel Processing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

With the continuous development of Web technology, many Internet issues evolve into Big Data problems, characterized by volume, variety, velocity and variability. Among them, how to organize plenty of web pages and retrieval information needed is a critical one. An important notion is document classification, in which nearest neighbors query is the key issue to be solved. Most parallel nearest neighbors query methods adopt Cartesian Product between training set and testing set resulting in poor time efficiency. In this paper, two methods are proposed on document nearest neighbor query based on pairwise similarity, i.e. brute-force and pre-filtering. brute-force is constituted by two phases (i.e. copying and filtering) and one map-reduce procedure is conducted. In order to obtain nearest neighbors for each document, each document pair is copied twice and all records generated are shuffled. However, time efficiency of shuffle is sensitive to the number of the intermediate results. For the purpose of intermediate results reduction, pre-filtering is proposed for nearest neighbor query based on pairwise similarity. Since only first top-k neighbors are output for each document, the size of records shuffled is kept in the same magnitude as input size in pre-filtering. Additionally, detailed theoretical analysis is provided. The performance of the algorithms is demonstrated by experiments on real world dataset.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Ahmed, O.S., Franklin, S.E., Wulder, M.A., White, J.C.: Extending airborne lidar-derived estimates of forest canopy cover and height over large areas using knn with landsat time series data. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 9(8), 3489–3496 (2016)CrossRef Ahmed, O.S., Franklin, S.E., Wulder, M.A., White, J.C.: Extending airborne lidar-derived estimates of forest canopy cover and height over large areas using knn with landsat time series data. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 9(8), 3489–3496 (2016)CrossRef
2.
Zurück zum Zitat Al Aghbari, Z.: Array-index: a plug&search K nearest neighbors method for high-dimensional data. Data Knowl. Eng. 52(3), 333–352 (2005)CrossRef Al Aghbari, Z.: Array-index: a plug&search K nearest neighbors method for high-dimensional data. Data Knowl. Eng. 52(3), 333–352 (2005)CrossRef
3.
Zurück zum Zitat Almalawi, A.M., Fahad, A., Tari, Z., Cheema, M.A., Khalil, I.: \( k \) NNVWC: an efficient \( k \)-nearest neighbors approach based on various-widths clustering. IEEE Trans. Knowl. Data Eng. 28(1), 68–81 (2016)CrossRef Almalawi, A.M., Fahad, A., Tari, Z., Cheema, M.A., Khalil, I.: \( k \) NNVWC: an efficient \( k \)-nearest neighbors approach based on various-widths clustering. IEEE Trans. Knowl. Data Eng. 28(1), 68–81 (2016)CrossRef
4.
Zurück zum Zitat Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)MathSciNetCrossRef Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)MathSciNetCrossRef
5.
Zurück zum Zitat Cha, G.H., Zhu, X., Petkovic, D., Chung, C.W.: An efficient indexing method for nearest neighbor searches in high-dirnensional image databases. IEEE Trans. Multimed. 4(1), 76–87 (2002)CrossRef Cha, G.H., Zhu, X., Petkovic, D., Chung, C.W.: An efficient indexing method for nearest neighbor searches in high-dirnensional image databases. IEEE Trans. Multimed. 4(1), 76–87 (2002)CrossRef
6.
Zurück zum Zitat Chim, H., Deng, X.: Efficient phrase-based document similarity for clustering. IEEE Trans. Knowl. Data Eng. 20(9), 1217–1229 (2008)CrossRef Chim, H., Deng, X.: Efficient phrase-based document similarity for clustering. IEEE Trans. Knowl. Data Eng. 20(9), 1217–1229 (2008)CrossRef
7.
Zurück zum Zitat Dai, J., Ding, Z.M.: MapReduce based fast kNN join. Chin. J. Comput. (2015) Dai, J., Ding, Z.M.: MapReduce based fast kNN join. Chin. J. Comput. (2015)
8.
Zurück zum Zitat Deng, Z., Zhu, X., Cheng, D., Zong, M., Zhang, S.: Efficient kNN classification algorithm for big data. Neurocomputing 195, 143–148 (2016)CrossRef Deng, Z., Zhu, X., Cheng, D., Zong, M., Zhang, S.: Efficient kNN classification algorithm for big data. Neurocomputing 195, 143–148 (2016)CrossRef
9.
Zurück zum Zitat Dhanabal, S., Chandramathi, S.: A review of various k-nearest neighbor query processing techniques. Int. J. Comput. Appl. 31(7), 14–22 (2011) Dhanabal, S., Chandramathi, S.: A review of various k-nearest neighbor query processing techniques. Int. J. Comput. Appl. 31(7), 14–22 (2011)
10.
Zurück zum Zitat Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with MapReduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, pp. 265–268. Association for Computational Linguistics (2008) Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with MapReduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, pp. 265–268. Association for Computational Linguistics (2008)
11.
Zurück zum Zitat Fier, F.: Distributed similarity joins on big textual data: toward a robust cost-based framework (2017) Fier, F.: Distributed similarity joins on big textual data: toward a robust cost-based framework (2017)
12.
Zurück zum Zitat Ghiassi, M., Fa’al, F., Abrishamchi, A.: Large metropolitan water demand forecasting using DAN2, FTDNN, and KNN models: a case study of the city of Tehran, Iran. Urban Water J. 14(6), 655–659 (2017)CrossRef Ghiassi, M., Fa’al, F., Abrishamchi, A.: Large metropolitan water demand forecasting using DAN2, FTDNN, and KNN models: a case study of the city of Tehran, Iran. Urban Water J. 14(6), 655–659 (2017)CrossRef
13.
Zurück zum Zitat Kibanov, M., Becker, M., Mueller, J., Atzmueller, M., Hotho, A., Stumme, G.: Adaptive kNN using expected accuracy for classification of geo-spatial data. arXiv preprint arXiv:1801.01453 (2017) Kibanov, M., Becker, M., Mueller, J., Atzmueller, M., Hotho, A., Stumme, G.: Adaptive kNN using expected accuracy for classification of geo-spatial data. arXiv preprint arXiv:​1801.​01453 (2017)
14.
Zurück zum Zitat Lai, J., Liaw, Y.C., Liu, J.: Fast k-nearest-neighbor search based on projection and triangular inequality. Pattern Recognit. 40(2), 351–359 (2007)CrossRef Lai, J., Liaw, Y.C., Liu, J.: Fast k-nearest-neighbor search based on projection and triangular inequality. Pattern Recognit. 40(2), 351–359 (2007)CrossRef
15.
Zurück zum Zitat Lee, K.H., Lee, Y.J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with mapreduce: a survey. AcM sIGMoD Rec. 40(4), 11–20 (2012)CrossRef Lee, K.H., Lee, Y.J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with mapreduce: a survey. AcM sIGMoD Rec. 40(4), 11–20 (2012)CrossRef
16.
Zurück zum Zitat Li, S.Z., Chan, K.L., Wang, C.: Performance evaluation of the nearest feature line method in image classification and retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 11, 1335–1349 (2000) Li, S.Z., Chan, K.L., Wang, C.: Performance evaluation of the nearest feature line method in image classification and retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 11, 1335–1349 (2000)
17.
Zurück zum Zitat Liaw, Y.C., Leou, M.L., Wu, C.M.: Fast exact k nearest neighbors search using an orthogonal search tree. Pattern Recognit. 43(6), 2351–2358 (2010)CrossRef Liaw, Y.C., Leou, M.L., Wu, C.M.: Fast exact k nearest neighbors search using an orthogonal search tree. Pattern Recognit. 43(6), 2351–2358 (2010)CrossRef
18.
Zurück zum Zitat Lin, J.: Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 155–162. ACM (2009) Lin, J.: Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 155–162. ACM (2009)
19.
Zurück zum Zitat Liu, T., Moore, A.W., Gray, A.: New algorithms for efficient high-dimensional nonparametric classification. J. Mach. Learn. Res. 7, 1135–1158 (2006)MathSciNetMATH Liu, T., Moore, A.W., Gray, A.: New algorithms for efficient high-dimensional nonparametric classification. J. Mach. Learn. Res. 7, 1135–1158 (2006)MathSciNetMATH
20.
Zurück zum Zitat Maillo, J., Triguero, I., Herrera, F.: A MapReduce-based k-nearest neighbor approach for big data classification. In: Trustcom/BigDataSE/ISPA, 2015 IEEE. vol. 2, pp. 167–172. IEEE (2015) Maillo, J., Triguero, I., Herrera, F.: A MapReduce-based k-nearest neighbor approach for big data classification. In: Trustcom/BigDataSE/ISPA, 2015 IEEE. vol. 2, pp. 167–172. IEEE (2015)
21.
Zurück zum Zitat McNames, J.: A fast nearest-neighbor algorithm based on a principal axis search tree. IEEE Trans. Pattern Anal. Mach. Intell. 23(9), 964–976 (2001)CrossRef McNames, J.: A fast nearest-neighbor algorithm based on a principal axis search tree. IEEE Trans. Pattern Anal. Mach. Intell. 23(9), 964–976 (2001)CrossRef
22.
Zurück zum Zitat Nodarakis, N., Sioutas, S., Tsoumakos, D., Tzimas, G., Pitoura, E.: Rapid AkNN query processing for fast classification of multidimensional data in the cloud. Eprint Arxiv (2014) Nodarakis, N., Sioutas, S., Tsoumakos, D., Tzimas, G., Pitoura, E.: Rapid AkNN query processing for fast classification of multidimensional data in the cloud. Eprint Arxiv (2014)
23.
Zurück zum Zitat Omohundro, S.M.: Five balltree construction algorithms. International Computer Science Institute Berkeley (1989) Omohundro, S.M.: Five balltree construction algorithms. International Computer Science Institute Berkeley (1989)
24.
Zurück zum Zitat Schiaffino, L., et al.: Feature selection for KNN classifier to improve accurate detection of subthalamic nucleus during deep brain stimulation surgery in Parkinson’s patients. In: Torres, I., Bustamante, J., Sierra, D. (eds.) VII Latin American Congress on Biomedical Engineering CLAIB 2016, Bucaramanga, Santander, Colombia, October 26th -28th, 2016. IP, vol. 60, pp. 441–444. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-4086-3_111CrossRef Schiaffino, L., et al.: Feature selection for KNN classifier to improve accurate detection of subthalamic nucleus during deep brain stimulation surgery in Parkinson’s patients. In: Torres, I., Bustamante, J., Sierra, D. (eds.) VII Latin American Congress on Biomedical Engineering CLAIB 2016, Bucaramanga, Santander, Colombia, October 26th -28th, 2016. IP, vol. 60, pp. 441–444. Springer, Singapore (2017). https://​doi.​org/​10.​1007/​978-981-10-4086-3_​111CrossRef
25.
Zurück zum Zitat Sproull, R.F.: Refinements to nearest-neighbor searching in k-dimensional trees. Algorithmica 6(1–6), 579–589 (1991)MathSciNetCrossRef Sproull, R.F.: Refinements to nearest-neighbor searching in k-dimensional trees. Algorithmica 6(1–6), 579–589 (1991)MathSciNetCrossRef
26.
Zurück zum Zitat Tan, S.: An effective refinement strategy for KNN text classifier. Expert Syst. Appl. 30(2), 290–298 (2006)CrossRef Tan, S.: An effective refinement strategy for KNN text classifier. Expert Syst. Appl. 30(2), 290–298 (2006)CrossRef
28.
Zurück zum Zitat Velásquez, J.D., et al.: Docode 5: building a real-world plagiarism detection system. Eng. Appl. Artif. Intell. 64, 261–271 (2017)CrossRef Velásquez, J.D., et al.: Docode 5: building a real-world plagiarism detection system. Eng. Appl. Artif. Intell. 64, 261–271 (2017)CrossRef
29.
Zurück zum Zitat Wang, Y., Wang, Z.O.: A fast KNN algorithm for text categorization. In: 2007 International Conference on Machine Learning and Cybernetics, vol. 6, pp. 3436–3441. IEEE (2007) Wang, Y., Wang, Z.O.: A fast KNN algorithm for text categorization. In: 2007 International Conference on Machine Learning and Cybernetics, vol. 6, pp. 3436–3441. IEEE (2007)
30.
Zurück zum Zitat Yu, C., Ooi, B.C., Tan, K.L., Jagadish, H.: Indexing the distance: an efficient method to KNN processing. In: VLDB, vol. 1, pp. 421–430 (2001) Yu, C., Ooi, B.C., Tan, K.L., Jagadish, H.: Indexing the distance: an efficient method to KNN processing. In: VLDB, vol. 1, pp. 421–430 (2001)
31.
Zurück zum Zitat Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54. ACM (1998) Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54. ACM (1998)
32.
Zurück zum Zitat Zhang, C., Li, F., Jestes, J.: Efficient parallel KNN joins for large data in MapReduce. In: Proceedings of the 15th International Conference on Extending Database Technology, pp. 38–49. ACM (2012) Zhang, C., Li, F., Jestes, J.: Efficient parallel KNN joins for large data in MapReduce. In: Proceedings of the 15th International Conference on Extending Database Technology, pp. 38–49. ACM (2012)
33.
Zurück zum Zitat Zhang, S., Li, X., Zong, M., Zhu, X., Wang, R.: Efficient knn classification with different numbers of nearest neighbors. IEEE Trans. Neural Netw. Learn. Syst. 29(5), 1774–1785 (2018)MathSciNetCrossRef Zhang, S., Li, X., Zong, M., Zhu, X., Wang, R.: Efficient knn classification with different numbers of nearest neighbors. IEEE Trans. Neural Netw. Learn. Syst. 29(5), 1774–1785 (2018)MathSciNetCrossRef
Metadaten
Titel
Document Nearest Neighbors Query Based on Pairwise Similarity with MapReduce
verfasst von
Peipei Lv
Peng Yang
Yong-Qiang Dong
Liang Gu
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-030-05051-1_3