Skip to main content
Top
Published in: Cluster Computing 1/2016

01-03-2016

Efficient top-k similarity document search utilizing distributed file systems and cosine similarity

Authors: Mahmoud Alewiwi, Cengiz Orencik, Erkay Savaş

Published in: Cluster Computing | Issue 1/2016

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Document similarity has important real life applications such as finding duplicate web sites and identifying plagiarism. While the basic techniques such as k-similarity algorithms have been long known, overwhelming amount of data, being collected such as in big data setting, calls for novel algorithms to find highly similar documents in reasonably short amount of time. In particular, pairwise comparison of documents’ features, a key operation in calculating document similarity, necessitates prohibitively high storage and computation power. In this paper, we propose a new filtering technique that decreases the number of comparisons between the query set and the search set to find highly similar documents. The proposed filtering technique utilizes Z-order prefix, based on the cosine similarity measure, in which only the most important features are used first to find highly similar documents. We propose a three-phase approach, where the phases are near duplicate detection, common important terms and join phase. We utilize the Hadoop distributed file system and the MapReduce parallel programming model to scale our techniques to big data setting. Our experimental results on real data show that the proposed method performs better than the previous work in the literature in terms of the number of joins, and therefore, speed.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Angiulli, F., Pizzuti, C.: An approximate algorithm for top-k closest pairs join query in large high dimensional data. Data Knowl. Eng. 53(3), 263–281 (2005)CrossRefMATH Angiulli, F., Pizzuti, C.: An approximate algorithm for top-k closest pairs join query in large high dimensional data. Data Knowl. Eng. 53(3), 263–281 (2005)CrossRefMATH
3.
go back to reference Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB ’06, pp. 918–929. VLDB Endowment (2006) Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB ’06, pp. 918–929. VLDB Endowment (2006)
4.
go back to reference Baraglia, R., De Francisci Morales, G., Lucchese, C.: Document similarity self-join with MapReduce. In: 2010 IEEE 10th International Conference on Data Mining (ICDM), pp. 731–736 (2010). doi:10.1109/ICDM.2010.70 Baraglia, R., De Francisci Morales, G., Lucchese, C.: Document similarity self-join with MapReduce. In: 2010 IEEE 10th International Conference on Data Mining (ICDM), pp. 731–736 (2010). doi:10.​1109/​ICDM.​2010.​70
5.
go back to reference Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pp. 131–140. ACM, New York (2007). doi:10.1145/1242572.1242591 Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pp. 131–140. ACM, New York (2007). doi:10.​1145/​1242572.​1242591
7.
go back to reference Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data Engineering, ICDE ’06, p. 5. IEEE Computer Society, Washington, DC (2006). doi:10.1109/ICDE.2006.9 Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data Engineering, ICDE ’06, p. 5. IEEE Computer Society, Washington, DC (2006). doi:10.​1109/​ICDE.​2006.​9
10.
go back to reference Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with mapreduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. HLT-Short ’08, pp. 265–268. Association for Computational Linguistics, Stroudsburg (2008) Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with mapreduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. HLT-Short ’08, pp. 265–268. Association for Computational Linguistics, Stroudsburg (2008)
12.
go back to reference Falchi, F., Perego, R., Lucchese, C., Rabitti, F., Orlando, S.: A metric cache for similarity search. In: LSDS-IR (2008) Falchi, F., Perego, R., Lucchese, C., Rabitti, F., Orlando, S.: A metric cache for similarity search. In: LSDS-IR (2008)
14.
go back to reference Li, R., Ju, L., Peng, Z., Yu, Z., Wang, C.: Batch text similarity search with mapreduce. In: Du, X., Fan, W., Peng, Z., Sharaf, M.A. (eds.) APWeb. Lecture Notes in Computer Science, vol. 6612, pp. 412–423. Springer, Heidelberg (2011) Li, R., Ju, L., Peng, Z., Yu, Z., Wang, C.: Batch text similarity search with mapreduce. In: Du, X., Fan, W., Peng, Z., Sharaf, M.A. (eds.) APWeb. Lecture Notes in Computer Science, vol. 6612, pp. 412–423. Springer, Heidelberg (2011)
16.
go back to reference Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)CrossRefMATH Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)CrossRefMATH
17.
go back to reference Phan, T.C., d’Orazio, L., Rigaux, P.: Toward intersection filter-based optimization for joins in mapreduce. In: Cloud-I’13, p. 2 (2013) Phan, T.C., d’Orazio, L., Rigaux, P.: Toward intersection filter-based optimization for joins in mapreduce. In: Cloud-I’13, p. 2 (2013)
18.
go back to reference Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2012) Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2012)
19.
go back to reference Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, SIGMOD ’04, pp. 743–754. ACM, New York (2004). doi:10.1145/1007568.1007652 Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, SIGMOD ’04, pp. 743–754. ACM, New York (2004). doi:10.​1145/​1007568.​1007652
20.
go back to reference Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans. Database Syst. 35(3), 20:1–20:46 (2010). doi:10.1145/1806907.1806912 CrossRef Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans. Database Syst. 35(3), 20:1–20:46 (2010). doi:10.​1145/​1806907.​1806912 CrossRef
21.
go back to reference Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, pp. 495–506. ACM, New York (2010). doi:10.1145/1807167.1807222 Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, pp. 495–506. ACM, New York (2010). doi:10.​1145/​1807167.​1807222
22.
go back to reference Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: Proceedings of the 17th International Conference on World Wide Web, WWW ’08, pp. 131–140. ACM, New York (2008). doi:10.1145/1367497.1367516 Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: Proceedings of the 17th International Conference on World Wide Web, WWW ’08, pp. 131–140. ACM, New York (2008). doi:10.​1145/​1367497.​1367516
23.
go back to reference Yang, B., Myung, J., Lee, S.G., Lee, D.: A mapreduce-based filtering algorithm for vector similarity join. In: Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication, ICUIMC ’13, pp. 71:1–71:5. ACM, New York (2013). doi:10.1145/2448556.2448627 Yang, B., Myung, J., Lee, S.G., Lee, D.: A mapreduce-based filtering algorithm for vector similarity join. In: Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication, ICUIMC ’13, pp. 71:1–71:5. ACM, New York (2013). doi:10.​1145/​2448556.​2448627
24.
go back to reference Zhang, C., Li, F., Jestes, J.: Efficient parallel knn joins for large data in mapreduce. In: Proceedings of the 15th International Conference on Extending Database Technology, EDBT ’12, pp. 38–49. ACM, New York (2012). doi:10.1145/2247596.2247602 Zhang, C., Li, F., Jestes, J.: Efficient parallel knn joins for large data in mapreduce. In: Proceedings of the 15th International Conference on Extending Database Technology, EDBT ’12, pp. 38–49. ACM, New York (2012). doi:10.​1145/​2247596.​2247602
25.
go back to reference Zhu, S., Wu, J., Xiong, H., Xia, G.: Scaling up top-k cosine similarity search. Data Knowl. Eng. 70(1), 60–83 (2011)CrossRef Zhu, S., Wu, J., Xiong, H., Xia, G.: Scaling up top-k cosine similarity search. Data Knowl. Eng. 70(1), 60–83 (2011)CrossRef
Metadata
Title
Efficient top-k similarity document search utilizing distributed file systems and cosine similarity
Authors
Mahmoud Alewiwi
Cengiz Orencik
Erkay Savaş
Publication date
01-03-2016
Publisher
Springer US
Published in
Cluster Computing / Issue 1/2016
Print ISSN: 1386-7857
Electronic ISSN: 1573-7543
DOI
https://doi.org/10.1007/s10586-015-0506-0

Other articles of this Issue 1/2016

Cluster Computing 1/2016 Go to the issue

Premium Partner