Article

Web-page classification through summarization

Authors:
Dou Shen

Tsinghua University, Beijing, P.R. China

Tsinghua University, Beijing, P.R. China
View Profile

,
Zheng Chen

Microsoft Research Asia, Beijing, P.R. China

Microsoft Research Asia, Beijing, P.R. China
View Profile

,
Qiang Yang

Hong Kong University of Science and Technology, Kowloon, Hong Kong

Hong Kong University of Science and Technology, Kowloon, Hong Kong
View Profile

,
Hua-Jun Zeng

Microsoft Research Asia, Beijing, P.R. China

Microsoft Research Asia, Beijing, P.R. China
View Profile

,
Benyu Zhang

Microsoft Research Asia, Beijing, P.R. China

Microsoft Research Asia, Beijing, P.R. China
View Profile

,
Yuchang Lu

Tsinghua University, Beijing, P.R. China

Tsinghua University, Beijing, P.R. China
View Profile

,
Wei-Ying Ma

Microsoft Research Asia, Beijing, P.R. China

Microsoft Research Asia, Beijing, P.R. China
View Profile

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrievalJuly 2004Pages 242–249https://doi.org/10.1145/1008992.1009035

Published:25 July 2004Publication History

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 242–249

ABSTRACT

Web-page classification is much more difficult than pure-text classification due to a large variety of noisy information embedded in Web pages. In this paper, we propose a new Web-page classification algorithm based on Web summarization for improving the accuracy. We first give empirical evidence that ideal Web-page summaries generated by human editors can indeed improve the performance of Web-page classification algorithms. We then propose a new Web summarization-based classification algorithm and evaluate it along with several other state-of-the-art text summarization algorithms on the LookSmart Web directory. Experimental results show that our proposed summarization-based classification algorithm achieves an approximately 8.8% improvement as compared to pure-text-based classification algorithm. We further introduce an ensemble classifier using the improved summarization algorithm and show that it achieves about 12.9% improvement over pure-text based methods.

References

G. Attardi, A. Gulli, and F. Sebastiani. Automatic Web Page Categorization by Link and Context Analysis. In Chris Hutchison and Gaetano Lanzarone (eds.), Proc. of THAI'99, 1999, 105--119.Google Scholar
A.L. Berger, V.O. Mittal. OCELOT: A System for Summarizing Web Pages. Proc. of the 23rd annual international ACM SIGIR, Athens, Greece, 2000, 144--151. Google ScholarDigital Library
M.W. Berry, S.T. Dumais, and Gavin W. O'Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37:573--595, 1995. Google ScholarDigital Library
O. Buyukkokten, H. Garcia-Molina, and A. Paepcke. Seeing the whole in parts: text summarization for Web browsing on handheld devices. Proc. of WWW10, Hong Kong, China, May 2001. Google ScholarDigital Library
S. Chakrabarti, B. Dom, and P. Indyk. Enhanced Hypertext Categorization Using Hyperlinks. Proc. of the ACM SIGMOD, 1998. Google ScholarDigital Library
H. Chen and S. T. Dumais. Bringing order to the Web: Automatically categorizing search results. Proc. of CHI2000, 2000, 145--152. Google ScholarDigital Library
J.L. Chen, B.Y. Zhou, J. Shi, H.J. Zhang, and Q.F. Wu. Function-based Object Model Towards Website Adaptation, Proc. of WWW10, HK, China, 2001. Google ScholarDigital Library
Z. Chen, S.P. Liu, W.Y. Liu, G.G. Pu, W.Y. Ma. Building a Web Thesaurus from Web Link Structure. Proc. of the 26th annual international ACM SIGIR, Canada, 2003, 48--55. Google ScholarDigital Library
W. Chuang, J. Yang, Extracting sentence segments for text summarization: a machine learning approach, Proc. of the 23rd annual international ACM SIGIR, Athens, Greece, 2000, 152--159. Google ScholarDigital Library
C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:1--25, 1995. Google ScholarDigital Library
S. Deerwester, S. Dumais, G. Furnas, T. Landauer,and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, vol. 41, 1990, 391--407.Google ScholarCross Ref
J.-Y. Delort, B. Bouchon-Meunier and M. Rifqi. Web Document Summarization by Context. Poster Proc. of WWW12, 2003.Google Scholar
E. J. Glover, K. Tsioutsiouliklis, and et al. Flake. Using Web structure for classifying and describing Web pages. Proc. of WWW12, 2002. Google ScholarDigital Library
Y.H. Gong, X. Liu. Generic text summarization using relevance measure and latent semantic analysis. In Proc. Of the 24th annual international ACM SIGIR, New Orleans, Louisiana, United States, 2001, 19--25. Google ScholarDigital Library
T. Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning, 1998, 137--142. Google ScholarDigital Library
T. Joachims. Transductive inference for text classification using support vector machines. Proc. of ICML-99, Bled, Slovenia, June 1999. Google ScholarDigital Library
S.J. Ker and J.-N. Chen. A Text Categorization Based on Summarization Technique. In the 38th Annual Meeting of the Association for Computational Linguistics IR&NLP workshop, Hong Kong, October 3-8, 2000. Google ScholarDigital Library
Y.J Ko, J.W Park, J.Y. Seo. Automatic Text Categorization using the Importance of Sentences. Proc. of COLING 2002. Google ScholarDigital Library
A. Kolcz, V. Prabakarmurthi, J.K. Kalita. Summarization as feature selection for text categorization. Proc. Of CIKM01, 2001. Google ScholarDigital Library
J. Kupiec, J. Pedersen, and F. Chen. A trainable document summarizer. Proc. of the 18th annual international ACM SIGIR, United States, 1995, 68--73. Google ScholarDigital Library
W. Lam, Y.q. Han. Automatic Textual Document Categorization Based on Generalized Instance Sets and a Metamodel. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(5): 628--633, 2003. Google ScholarDigital Library
T. K. Landauer, P. W. Foltz, and D. Laham. Introduction to Latent Semantic Analysis. Discourse processes, 25, 1998, 259--284.Google Scholar
H.P. Luhn. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, Vol. 2, No. 2, April 1958, 159--165.Google ScholarDigital Library
A. McCallum and K. Nigam, A comparison of event models for naive bayes text classification, In AAAI-98 Workshop on Learning for Text Categorization, 1998.Google Scholar
T. Mitchell. Machine Learning. McGraw-Hill, 1997. Google ScholarDigital Library
W. Press and et al., Numerical Recipes in C: The Art of Scientific Computing. Cambridge, England: Cambridge University Press, 2 ed., 1992. Google ScholarDigital Library
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 2002. Google ScholarDigital Library
Sequential Minimal Optimization, http://research.micro-soft.com/jplatt/smo.html.Google Scholar
The Porter Stemming Algorithm, http://www.tartarus.org/martin/PorterStemmer.Google Scholar
S. Teufel and M. Moens. Sentence extraction as a classification task. In ACL/EACL-97 Workshop on Intelligent and Scalable Text Summarization, 1997.Google Scholar
C. J. van Rijsbergen. Information Retrieval. Butterworth, London, 1979, 173--176. Google ScholarDigital Library
V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, NY, USA, 1995. Google ScholarDigital Library
L. Yi, B. Liu, and X. Li. Eliminating Noisy Information in Web Pages for Data Mining. KDD2003. 2003. Google ScholarDigital Library
Y. Yang and J.O. Pedersen. A comparative study on feature selection in text categorization. Proc. of ICML-97. Google ScholarDigital Library

Index Terms

Web-page classification through summarization

Recommendations

Web page classification: Features and algorithms

Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as ...
Read More
Noise reduction through summarization for Web-page classification

Due to a large variety of noisy information embedded in Web pages, Web-page classification is much more difficult than pure-text classification. In this paper, we propose to improve the Web-page classification performance by removing the noise through ...
Read More
Visual summarization of web pages
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Visual summarization is a attractive new scheme to summarize web pages, which can help achieve a more friendly user experience in search and re-finding tasks by allowing users quickly get the idea of what the web page is about and helping users recall ...
Read More

Reviews

Reviewer: Christoph F. Strnadl

Undoubtedly, information retrieval is pivotal to the continued success of the Web. One means of accomplishing this retrieval is by way of a significant categorization of Web pages (which is nontrivial, given the idiosyncratic Hypertext Markup Language (HTML)-based structure of a Web document). This paper proposes a Web page categorization scheme based on Web page summarization: the algorithm first obtains a textual summary of the examined Web page, and the summary is then classified into the respective categories. The authors implement four different summarization algorithms (an adaptation of Luhn's method, latent semantic analysis, the function-based object model, and supervised classification based on machine learning; human summarization by category editors is used as proxy to the "ideal" summary). They test their scheme's effectiveness on 150,000 pre-categorized pages. Classifiers are obtained from the output of the summarization stage, by applying a naïve Bayesian classifier-learning method, or using a support vector machine. The authors use the standard F1 measure of the field (the harmonic mean of precision and recall). Two findings are interesting. First, categorization based on summarization outperforms simple, text-based categorization by approximately 14 percent. Second, no individual summarization algorithm is as effective as human summarization. However, an unweighted sum of the four summarization methods nearly reproduces the "ideal" case. The paper is easy to follow, and focuses on the overall experiment and its results, with only short introductions to the algorithms used. The intended audience for the paper is the research scientist or professional who is fairly deeply involved in the design or implementation of Web page information management algorithms. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
July 2004
624 pages
ISBN:1581138814
DOI:10.1145/1008992
General Chair:
Mark Sanderson
University of Sheffield (UK)
,
Program Chairs:
Kalervo Järvelin
University of Tampere (Finland)
,
James Allan
University of Massachusetts (USA)
,
Peter Bruza
Distributed Systems Technology Centre (Australia)
Copyright © 2004 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 July 2004
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
content body
web page categorization
web page summarization
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 107
  Total Citations
  View Citations
- 3,312
  Total Downloads
- Downloads (Last 12 months)43
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Web-page classification through summarization

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Web page classification: Features and algorithms

Noise reduction through summarization for Web-page classification

Visual summarization of web pages

Reviews

Access critical reviews of Computing literature here