skip to main content
10.1145/1008992.1009035acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Web-page classification through summarization

Authors Info & Claims
Published:25 July 2004Publication History

ABSTRACT

Web-page classification is much more difficult than pure-text classification due to a large variety of noisy information embedded in Web pages. In this paper, we propose a new Web-page classification algorithm based on Web summarization for improving the accuracy. We first give empirical evidence that ideal Web-page summaries generated by human editors can indeed improve the performance of Web-page classification algorithms. We then propose a new Web summarization-based classification algorithm and evaluate it along with several other state-of-the-art text summarization algorithms on the LookSmart Web directory. Experimental results show that our proposed summarization-based classification algorithm achieves an approximately 8.8% improvement as compared to pure-text-based classification algorithm. We further introduce an ensemble classifier using the improved summarization algorithm and show that it achieves about 12.9% improvement over pure-text based methods.

References

  1. G. Attardi, A. Gulli, and F. Sebastiani. Automatic Web Page Categorization by Link and Context Analysis. In Chris Hutchison and Gaetano Lanzarone (eds.), Proc. of THAI'99, 1999, 105--119.Google ScholarGoogle Scholar
  2. A.L. Berger, V.O. Mittal. OCELOT: A System for Summarizing Web Pages. Proc. of the 23rd annual international ACM SIGIR, Athens, Greece, 2000, 144--151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M.W. Berry, S.T. Dumais, and Gavin W. O'Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37:573--595, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. O. Buyukkokten, H. Garcia-Molina, and A. Paepcke. Seeing the whole in parts: text summarization for Web browsing on handheld devices. Proc. of WWW10, Hong Kong, China, May 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Chakrabarti, B. Dom, and P. Indyk. Enhanced Hypertext Categorization Using Hyperlinks. Proc. of the ACM SIGMOD, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. H. Chen and S. T. Dumais. Bringing order to the Web: Automatically categorizing search results. Proc. of CHI2000, 2000, 145--152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J.L. Chen, B.Y. Zhou, J. Shi, H.J. Zhang, and Q.F. Wu. Function-based Object Model Towards Website Adaptation, Proc. of WWW10, HK, China, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Z. Chen, S.P. Liu, W.Y. Liu, G.G. Pu, W.Y. Ma. Building a Web Thesaurus from Web Link Structure. Proc. of the 26th annual international ACM SIGIR, Canada, 2003, 48--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. W. Chuang, J. Yang, Extracting sentence segments for text summarization: a machine learning approach, Proc. of the 23rd annual international ACM SIGIR, Athens, Greece, 2000, 152--159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:1--25, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Deerwester, S. Dumais, G. Furnas, T. Landauer,and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, vol. 41, 1990, 391--407.Google ScholarGoogle ScholarCross RefCross Ref
  12. J.-Y. Delort, B. Bouchon-Meunier and M. Rifqi. Web Document Summarization by Context. Poster Proc. of WWW12, 2003.Google ScholarGoogle Scholar
  13. E. J. Glover, K. Tsioutsiouliklis, and et al. Flake. Using Web structure for classifying and describing Web pages. Proc. of WWW12, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y.H. Gong, X. Liu. Generic text summarization using relevance measure and latent semantic analysis. In Proc. Of the 24th annual international ACM SIGIR, New Orleans, Louisiana, United States, 2001, 19--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning, 1998, 137--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Joachims. Transductive inference for text classification using support vector machines. Proc. of ICML-99, Bled, Slovenia, June 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S.J. Ker and J.-N. Chen. A Text Categorization Based on Summarization Technique. In the 38th Annual Meeting of the Association for Computational Linguistics IR&NLP workshop, Hong Kong, October 3-8, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y.J Ko, J.W Park, J.Y. Seo. Automatic Text Categorization using the Importance of Sentences. Proc. of COLING 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Kolcz, V. Prabakarmurthi, J.K. Kalita. Summarization as feature selection for text categorization. Proc. Of CIKM01, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Kupiec, J. Pedersen, and F. Chen. A trainable document summarizer. Proc. of the 18th annual international ACM SIGIR, United States, 1995, 68--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. W. Lam, Y.q. Han. Automatic Textual Document Categorization Based on Generalized Instance Sets and a Metamodel. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(5): 628--633, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. K. Landauer, P. W. Foltz, and D. Laham. Introduction to Latent Semantic Analysis. Discourse processes, 25, 1998, 259--284.Google ScholarGoogle Scholar
  23. H.P. Luhn. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, Vol. 2, No. 2, April 1958, 159--165.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. McCallum and K. Nigam, A comparison of event models for naive bayes text classification, In AAAI-98 Workshop on Learning for Text Categorization, 1998.Google ScholarGoogle Scholar
  25. T. Mitchell. Machine Learning. McGraw-Hill, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. W. Press and et al., Numerical Recipes in C: The Art of Scientific Computing. Cambridge, England: Cambridge University Press, 2 ed., 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Sequential Minimal Optimization, http://research.micro-soft.com/jplatt/smo.html.Google ScholarGoogle Scholar
  29. The Porter Stemming Algorithm, http://www.tartarus.org/martin/PorterStemmer.Google ScholarGoogle Scholar
  30. S. Teufel and M. Moens. Sentence extraction as a classification task. In ACL/EACL-97 Workshop on Intelligent and Scalable Text Summarization, 1997.Google ScholarGoogle Scholar
  31. C. J. van Rijsbergen. Information Retrieval. Butterworth, London, 1979, 173--176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, NY, USA, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. L. Yi, B. Liu, and X. Li. Eliminating Noisy Information in Web Pages for Data Mining. KDD2003. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Y. Yang and J.O. Pedersen. A comparative study on feature selection in text categorization. Proc. of ICML-97. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Web-page classification through summarization

            Recommendations

            Reviews

            Christoph F. Strnadl

            Undoubtedly, information retrieval is pivotal to the continued success of the Web. One means of accomplishing this retrieval is by way of a significant categorization of Web pages (which is nontrivial, given the idiosyncratic Hypertext Markup Language (HTML)-based structure of a Web document). This paper proposes a Web page categorization scheme based on Web page summarization: the algorithm first obtains a textual summary of the examined Web page, and the summary is then classified into the respective categories. The authors implement four different summarization algorithms (an adaptation of Luhn's method, latent semantic analysis, the function-based object model, and supervised classification based on machine learning; human summarization by category editors is used as proxy to the "ideal" summary). They test their scheme's effectiveness on 150,000 pre-categorized pages. Classifiers are obtained from the output of the summarization stage, by applying a naïve Bayesian classifier-learning method, or using a support vector machine. The authors use the standard F1 measure of the field (the harmonic mean of precision and recall). Two findings are interesting. First, categorization based on summarization outperforms simple, text-based categorization by approximately 14 percent. Second, no individual summarization algorithm is as effective as human summarization. However, an unweighted sum of the four summarization methods nearly reproduces the "ideal" case. The paper is easy to follow, and focuses on the overall experiment and its results, with only short introductions to the algorithms used. The intended audience for the paper is the research scientist or professional who is fairly deeply involved in the design or implementation of Web page information management algorithms. Online Computing Reviews Service

            Access critical reviews of Computing literature here

            Become a reviewer for Computing Reviews.

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
              July 2004
              624 pages
              ISBN:1581138814
              DOI:10.1145/1008992

              Copyright © 2004 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 25 July 2004

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • Article

              Acceptance Rates

              Overall Acceptance Rate792of3,983submissions,20%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader