skip to main content
10.1145/956750.956785acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Eliminating noisy information in Web pages for data mining

Authors Info & Claims
Published:24 August 2003Publication History

ABSTRACT

A commercial Web page typically contains many information blocks. Apart from the main content blocks, it usually has such blocks as navigation panels, copyright and privacy notices, and advertisements (for business purposes and for easy user access). We call these blocks that are not the main content blocks of the page the noisy blocks. We show that the information contained in these noisy blocks can seriously harm Web data mining. Eliminating these noises is thus of great importance. In this paper, we propose a noise elimination technique based on the following observation: In a given Web site, noisy blocks usually share some common contents and presentation styles, while the main content blocks of the pages are often diverse in their actual contents and/or presentation styles. Based on this observation, we propose a tree structure, called Style Tree, to capture the common presentation styles and the actual contents of the pages in a given Web site. By sampling the pages of the site, a Style Tree can be built for the site, which we call the Site Style Tree (SST). We then introduce an information based measure to determine which parts of the SST represent noises and which parts represent the main contents of the site. The SST is employed to detect and eliminate noises in any Web page of the site by mapping this page to the SST. The proposed technique is evaluated with two data mining tasks, Web page clustering and classification. Experimental results show that our noise elimination technique is able to improve the mining results significantly.

References

  1. Anderberg, M. R. Cluster Analysis for Applications, Academic Press, Inc. New York, 1973.Google ScholarGoogle Scholar
  2. Bar-Yossef, Z. and Rajagopalan, S. Template Detection via Data Mining and its Applications, WWW 2002, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Beeferman, D., Berger, A. and Lafferty, J. A model of lexical attraction and repulsion. ACL-97, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Beeferman, D., Berger, A. and Lafferty, J. Statistical models for text segmentation, Machine learning, 34(1--3), 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Broder, A., Glassman, S., Manasse, M. and Zweig, G. Syntactic clustering of the Web, Proceeding of WWW6, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chakrabarti, S. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Cooley, R., Mobasher, B. and Srivastava, J. Data preparation for mining World Wide Web browsing patterns. Journal of Knowledge and Information Systems, (1) 1, 1999.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Davision, B. D. Recognizing Nepotistic links on the Web. Proceeding of AAAI 2000.Google ScholarGoogle Scholar
  9. Han, J. and Chang, K. C.-C. Data Mining for Web Intelligence, IEEE Computer, Nov. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jushmerick, N. Learning to remove Internet advertisements, AGENT-99, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Kao, J. Y., Lin, S. H. Ho, J. M. and Chen, M. S. Entropy-based link analysis for mining web informative structures, CIKM 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Kleinberg, J. Authoritative Sources in a Hyperlinked Environment. ACM-SIAM Symposium on Discrete Algorithms, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Lee, M. L., Ling, W. and Low, W. L. Intelliclean: A knowledge-based intelligent data cleaner. KDD-2000, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Lewis, D. and Gale, W. A sequential algorithm for training text classifiers. Proceedings of SIGIR, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. McCallum, A. and Nigam, K. A comparison of event models for naïve Bayes text classification. AAAI-98 Workshop on Learning for Text Categorization. AAAI Press, 1998.Google ScholarGoogle Scholar
  16. Nahm, U. Y., Bilenko, M. and Mooney R. J. Two Approaches to Handling Noisy Variation in Text Mining. ICML-2002 Workshop on Text Learning, 2002Google ScholarGoogle Scholar
  17. Shian-Hua Lin and Jan-Ming Ho. Discovering Informative Content Blocks from Web Documents, KDD-02, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Yang, Y. and Pedersen, J. O. A. comparative study on feature selection in text categorization. ICML-97, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Eliminating noisy information in Web pages for data mining

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
              August 2003
              736 pages
              ISBN:1581137370
              DOI:10.1145/956750

              Copyright © 2003 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 24 August 2003

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • Article

              Acceptance Rates

              KDD '03 Paper Acceptance Rate46of298submissions,15%Overall Acceptance Rate1,133of8,635submissions,13%

              Upcoming Conference

              KDD '24

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader