Article

Eliminating noisy information in Web pages for data mining

Authors:
Lan Yi

National University of Singapore, Singapore

National University of Singapore, Singapore
View Profile

,
Bing Liu

University of Illinois at Chicago, Chicago, IL

University of Illinois at Chicago, Chicago, IL
View Profile

,
Xiaoli Li

National University of Singapore, Singapore

National University of Singapore, Singapore
View Profile

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2003Pages 296–305https://doi.org/10.1145/956750.956785

Published:24 August 2003Publication History

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 296–305

ABSTRACT

A commercial Web page typically contains many information blocks. Apart from the main content blocks, it usually has such blocks as navigation panels, copyright and privacy notices, and advertisements (for business purposes and for easy user access). We call these blocks that are not the main content blocks of the page the noisy blocks. We show that the information contained in these noisy blocks can seriously harm Web data mining. Eliminating these noises is thus of great importance. In this paper, we propose a noise elimination technique based on the following observation: In a given Web site, noisy blocks usually share some common contents and presentation styles, while the main content blocks of the pages are often diverse in their actual contents and/or presentation styles. Based on this observation, we propose a tree structure, called Style Tree, to capture the common presentation styles and the actual contents of the pages in a given Web site. By sampling the pages of the site, a Style Tree can be built for the site, which we call the Site Style Tree (SST). We then introduce an information based measure to determine which parts of the SST represent noises and which parts represent the main contents of the site. The SST is employed to detect and eliminate noises in any Web page of the site by mapping this page to the SST. The proposed technique is evaluated with two data mining tasks, Web page clustering and classification. Experimental results show that our noise elimination technique is able to improve the mining results significantly.

References

Anderberg, M. R. Cluster Analysis for Applications, Academic Press, Inc. New York, 1973.Google Scholar
Bar-Yossef, Z. and Rajagopalan, S. Template Detection via Data Mining and its Applications, WWW 2002, 2002. Google ScholarDigital Library
Beeferman, D., Berger, A. and Lafferty, J. A model of lexical attraction and repulsion. ACL-97, 1997. Google ScholarDigital Library
Beeferman, D., Berger, A. and Lafferty, J. Statistical models for text segmentation, Machine learning, 34(1--3), 1999. Google ScholarDigital Library
Broder, A., Glassman, S., Manasse, M. and Zweig, G. Syntactic clustering of the Web, Proceeding of WWW6, 1997. Google ScholarDigital Library
Chakrabarti, S. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, 2002. Google ScholarDigital Library
Cooley, R., Mobasher, B. and Srivastava, J. Data preparation for mining World Wide Web browsing patterns. Journal of Knowledge and Information Systems, (1) 1, 1999.Google ScholarDigital Library
Davision, B. D. Recognizing Nepotistic links on the Web. Proceeding of AAAI 2000.Google Scholar
Han, J. and Chang, K. C.-C. Data Mining for Web Intelligence, IEEE Computer, Nov. 2002. Google ScholarDigital Library
Jushmerick, N. Learning to remove Internet advertisements, AGENT-99, 1999. Google ScholarDigital Library
Kao, J. Y., Lin, S. H. Ho, J. M. and Chen, M. S. Entropy-based link analysis for mining web informative structures, CIKM 2002. Google ScholarDigital Library
Kleinberg, J. Authoritative Sources in a Hyperlinked Environment. ACM-SIAM Symposium on Discrete Algorithms, 1998. Google ScholarDigital Library
Lee, M. L., Ling, W. and Low, W. L. Intelliclean: A knowledge-based intelligent data cleaner. KDD-2000, 2000. Google ScholarDigital Library
Lewis, D. and Gale, W. A sequential algorithm for training text classifiers. Proceedings of SIGIR, 1994. Google ScholarDigital Library
McCallum, A. and Nigam, K. A comparison of event models for naïve Bayes text classification. AAAI-98 Workshop on Learning for Text Categorization. AAAI Press, 1998.Google Scholar
Nahm, U. Y., Bilenko, M. and Mooney R. J. Two Approaches to Handling Noisy Variation in Text Mining. ICML-2002 Workshop on Text Learning, 2002Google Scholar
Shian-Hua Lin and Jan-Ming Ho. Discovering Informative Content Blocks from Web Documents, KDD-02, 2002. Google ScholarDigital Library
Yang, Y. and Pedersen, J. O. A. comparative study on feature selection in text categorization. ICML-97, 1997. Google ScholarDigital Library

Index Terms

Eliminating noisy information in Web pages for data mining
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
    2. Retrieval tasks and goals
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

An Efficient Method of Eliminating Noisy Information in Web Pages for Data Mining
CIT '04: Proceedings of the The Fourth International Conference on Computer and Information Technology

A commercial Web page typically contains many information blocks. Apart from the main content blocks, it usually has such blocks as navigation panels, copyright and privacy notices, and advertisements. We call these blocks that are not the main content ...
Read More
Effectual Web Content Mining using Noise Removal from Web Pages

Web mining is an emerging research area due to the rapid growth of websites. Web mining is classified into Web Content Mining (WCM), Web Usage Mining and Web Structure Mining. Extraction of required information from web page content available on World ...
Read More
Mining web site's topic hierarchy
WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web

Searching and navigating a Web site is a tedious task and the hierarchical models, such as site maps, are frequently used for organizing the Web site's content. In this work, we propose to model a Web site's content structure using the topic hierarchy, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
August 2003
736 pages
ISBN:1581137370
DOI:10.1145/956750
Conference Chair:
Lise Getoor
University of Maryland, College Park
,
General Chair:
Ted Senator
DARPA
,
Program Chairs:
Pedro Domingos
University of Washington
,
Christos Faloutsos
Carnegie Mellon University
Copyright © 2003 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2003
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Web mining
noise detection
noise elimination
Qualifiers
- Article
Conference

Acceptance Rates
KDD '03 Paper Acceptance Rate46of298submissions,15%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 184
  Total Citations
  View Citations
- 3,331
  Total Downloads
- Downloads (Last 12 months)28
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Eliminating noisy information in Web pages for data mining

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

An Efficient Method of Eliminating Noisy Information in Web Pages for Data Mining

Effectual Web Content Mining using Noise Removal from Web Pages

Mining web site's topic hierarchy