Skip to main content
Top

2016 | OriginalPaper | Chapter

Implicit Links-Based Techniques to Enrich K-Nearest Neighbors and Naive Bayes Algorithms for Web Page Classification

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The web has developed into one of the most relevant data sources and becomes now a broad knowledge base for almost all fields. Its content grows faster, and its size becomes larger every day. Due to this big amount of data, web page classification becomes crucial since users encounter difficulties in finding what they are seeking, even though they use search engines. Web page classification is the process of assigning a web page to one or more classes based on previously seen labeled examples. Web pages contain a lot of contextual features that can be used to enhance the classification’s accuracy. In this paper, we present a similarity computation technique that is based on implicit links extracted from the query-log, and used with K-Nearest Neighbors (KNN) in web page classification. We also introduce an implicit links-based probability computation method used with Naive Bayes (NB) for web page classification. The new computed similarity and probability help enrich KNN and NB respectively for web page classification. Experiments are conducted on two subsets of Open Directory Project (ODP). Results show that: (1) when applied as a similarity for KNN, the implicit links-based similarity helps improve results. (2) the implicit links-based probability helps ameliorate results provided by NB using only text-based probability.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Beyer, K.S., Goldstein, J., Ramakrishnan, R., Shaft, U.: When Is Nearest Neighbor Meaningful?. In: Proceedings of the 7th International Conference on Database Theory, pp. 217–235, London, UK (1999) Beyer, K.S., Goldstein, J., Ramakrishnan, R., Shaft, U.: When Is Nearest Neighbor Meaningful?. In: Proceedings of the 7th International Conference on Database Theory, pp. 217–235, London, UK (1999)
5.
go back to reference Kwon, O.-W., Lee, J.-H.: Web page classification based on k-nearest neighbor approach. In: Proceedings of the fifth international workshop on Information retrieval with Asian languages, pp. 9–15, New York, NY, USA (2000) Kwon, O.-W., Lee, J.-H.: Web page classification based on k-nearest neighbor approach. In: Proceedings of the fifth international workshop on Information retrieval with Asian languages, pp. 9–15, New York, NY, USA (2000)
6.
go back to reference He, Z., Liu, Z.: A novel approach to naive bayes web page automatic classification. In: Fifth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), vo. 2, pp. 361–365 (2008) He, Z., Liu, Z.: A novel approach to naive bayes web page automatic classification. In: Fifth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), vo. 2, pp. 361–365 (2008)
7.
go back to reference Youquan, H., Jianfang, X., Cheng, X.: An improved naive bayesian algorithm for web page text classification. In: Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), vo. 3, pp. 1765–1768 (2011) Youquan, H., Jianfang, X., Cheng, X.: An improved naive bayesian algorithm for web page text classification. In: Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), vo. 3, pp. 1765–1768 (2011)
8.
go back to reference Fernandez, V.F., Herranz, S.M., Unanue, R.M., Rubio, A.C.: Naive Bayes web page classification with HTML Mark-Up enrichment. In: International Multi-Conference on Computing in the Global Information Technology (ICCGI), pp. 48–48 (2006) Fernandez, V.F., Herranz, S.M., Unanue, R.M., Rubio, A.C.: Naive Bayes web page classification with HTML Mark-Up enrichment. In: International Multi-Conference on Computing in the Global Information Technology (ICCGI), pp. 48–48 (2006)
9.
go back to reference Shen, D., Sun, J.-T., Yang, Q., Chen, Z.: A comparison of implicit and explicit links for web page classification. In: Proceedings of the 15th international conference on World Wide Web, pp. 643–650, New York (2006) Shen, D., Sun, J.-T., Yang, Q., Chen, Z.: A comparison of implicit and explicit links for web page classification. In: Proceedings of the 15th international conference on World Wide Web, pp. 643–650, New York (2006)
10.
go back to reference Xue, G.-R., Yu, Y., Shen, D., Yang, Q., Zeng, H.-J., Chen, Z.: Reinforcing web-object categorization through interrelationships. Data Min. Knowl. Discov. 12(2–3), 229–248 (2006)MathSciNetCrossRef Xue, G.-R., Yu, Y., Shen, D., Yang, Q., Zeng, H.-J., Chen, Z.: Reinforcing web-object categorization through interrelationships. Data Min. Knowl. Discov. 12(2–3), 229–248 (2006)MathSciNetCrossRef
11.
go back to reference Kim, S.-M., Pantel, P., Duan, L., Gaffney, S.: Improving web page classification by label-propagation over click graphs. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1077–1086, New York, NY, USA (2009) Kim, S.-M., Pantel, P., Duan, L., Gaffney, S.: Improving web page classification by label-propagation over click graphs. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1077–1086, New York, NY, USA (2009)
12.
go back to reference Belmouhcine, A., Benkhalifa, M.: Formal concept analysis based corrective approach using query-log for web page classification. J. Emerg. Technol. Web Intell. 6(2) (2014) Belmouhcine, A., Benkhalifa, M.: Formal concept analysis based corrective approach using query-log for web page classification. J. Emerg. Technol. Web Intell. 6(2) (2014)
13.
go back to reference Porter, M.F.: In: Sparck Jones, K., Willett, P. (eds.) Readings in Information Retrieval, pp. 313–316. Morgan Kaufmann Publishers Inc., San Francisco (1997) Porter, M.F.: In: Sparck Jones, K., Willett, P. (eds.) Readings in Information Retrieval, pp. 313–316. Morgan Kaufmann Publishers Inc., San Francisco (1997)
14.
go back to reference Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc, New York (1986)MATH Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc, New York (1986)MATH
15.
go back to reference Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)CrossRef Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)CrossRef
16.
go back to reference Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28, 11–21 (1972)CrossRef Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28, 11–21 (1972)CrossRef
17.
go back to reference Mitchell, T.M.: Machine Learning, 1st edn. McGraw-Hill Science/Engineering/Math (1997) Mitchell, T.M.: Machine Learning, 1st edn. McGraw-Hill Science/Engineering/Math (1997)
18.
go back to reference McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classification (1998) McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classification (1998)
19.
go back to reference Aha, D., Kibler, D.: Instance-based learning algorithms. Mach. Learn. 6, 37–66 (1991)MATH Aha, D., Kibler, D.: Instance-based learning algorithms. Mach. Learn. 6, 37–66 (1991)MATH
20.
go back to reference Henderson, L.: Automated Text Classification in the DMOZ Hierarchy (2009) Henderson, L.: Automated Text Classification in the DMOZ Hierarchy (2009)
Metadata
Title
Implicit Links-Based Techniques to Enrich K-Nearest Neighbors and Naive Bayes Algorithms for Web Page Classification
Authors
Abdelbadie Belmouhcine
Mohammed Benkhalifa
Copyright Year
2016
DOI
https://doi.org/10.1007/978-3-319-26227-7_71

Premium Partner