skip to main content
10.1145/1066677.1067065acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
Article

Automatic extraction of informative blocks from webpages

Published:13 March 2005Publication History

ABSTRACT

Search engines crawl and index webpages depending upon their informative content. However, webpages --- especially dynamically generated ones --- contain items that cannot be classified as the "primary content", e.g., navigation side-bars, advertisements, copyright notices, etc. Most end-users search for the primary content, and largely do not seek the non-informative content. A tool that assists an end-user or application to search and process information from webpages automatically, must separate the "primary content blocks" from the other blocks. In this paper, two new algorithms, ContentExtractor, and FeatureExtractor are proposed. The algorithms identify primary content blocks by i) looking for blocks that do not occur a large number of times across webpages and ii) looking for blocks with desired features respectively. They identify the primary content blocks with high precision and recall, reduce the storage requirement for search engines, result in smaller indexes and thereby faster search times, and better user satisfaction. While operating on several thousand webpages obtained from 11 news websites, our algorithms significantly outperform the Entropy-based algorithm proposed by Lin and Ho [7] in both accuracy and run-time.

References

  1. Ziv Bar-Yossef and Sridhar Rajagopalan. Template detection via data mining and its applications. In Proceedings of WWW 2002, pages 580--591, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Boris Chidlovskii, Jon Ragetli, and Maarten de Rijke. Wrapper generation via grammar induction. In Machine Learning: ECML 2000, 11th European Conference on Machine Learning, Barcelona, Catalonia, Spain, May 31 - June 2, 2000, Proceedings, volume 1810, pages 96--108. Springer, Berlin, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In Proceedings of the 27th International Conference on Very Large Data Bases, pages 109--118, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Hsu. Initial results on wrapping semistructured web pages with finite-state transducers and contextual rules. In AAAI-98 Workshop on AI and Information Integration, pages 66--73. AAAI Press, 1998.]]Google ScholarGoogle Scholar
  5. Nicholas Kushmerick, Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118(1-2): 15--68, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Nickolas Kushmerick. Daniel S. Weld, and Robert B. Doorenbos. Wrapper induction for information extraction. In International Joint Conference on Artificial Intelligence (IJCAI), pages 729--737, 1997.]]Google ScholarGoogle Scholar
  7. Shian-Hua Lin and Jan-Ming Ho. Discovering informative content blocks from web documents. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 588--593, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bing Liu, Kaidi Zhao, and Lan Yi. Eliminating noisy information in web pages for data mining. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 296--305, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ion Muslea, Steven Minton, and Craig A. Knoblock. Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 4(1/2):93--114, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Lan Yi, Bing Liu, and Xiaoli Li. Visualizing web site comparisons. In Proceedings of the eleventh international conference on World Wide Web, pages 693--703, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Automatic extraction of informative blocks from webpages

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SAC '05: Proceedings of the 2005 ACM symposium on Applied computing
      March 2005
      1814 pages
      ISBN:1581139640
      DOI:10.1145/1066677

      Copyright © 2005 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 March 2005

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate1,650of6,669submissions,25%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader