ABSTRACT
Search engines crawl and index webpages depending upon their informative content. However, webpages --- especially dynamically generated ones --- contain items that cannot be classified as the "primary content", e.g., navigation side-bars, advertisements, copyright notices, etc. Most end-users search for the primary content, and largely do not seek the non-informative content. A tool that assists an end-user or application to search and process information from webpages automatically, must separate the "primary content blocks" from the other blocks. In this paper, two new algorithms, ContentExtractor, and FeatureExtractor are proposed. The algorithms identify primary content blocks by i) looking for blocks that do not occur a large number of times across webpages and ii) looking for blocks with desired features respectively. They identify the primary content blocks with high precision and recall, reduce the storage requirement for search engines, result in smaller indexes and thereby faster search times, and better user satisfaction. While operating on several thousand webpages obtained from 11 news websites, our algorithms significantly outperform the Entropy-based algorithm proposed by Lin and Ho [7] in both accuracy and run-time.
- Ziv Bar-Yossef and Sridhar Rajagopalan. Template detection via data mining and its applications. In Proceedings of WWW 2002, pages 580--591, 2002.]] Google ScholarDigital Library
- Boris Chidlovskii, Jon Ragetli, and Maarten de Rijke. Wrapper generation via grammar induction. In Machine Learning: ECML 2000, 11th European Conference on Machine Learning, Barcelona, Catalonia, Spain, May 31 - June 2, 2000, Proceedings, volume 1810, pages 96--108. Springer, Berlin, 2000.]] Google ScholarDigital Library
- Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In Proceedings of the 27th International Conference on Very Large Data Bases, pages 109--118, 2001.]] Google ScholarDigital Library
- C. Hsu. Initial results on wrapping semistructured web pages with finite-state transducers and contextual rules. In AAAI-98 Workshop on AI and Information Integration, pages 66--73. AAAI Press, 1998.]]Google Scholar
- Nicholas Kushmerick, Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118(1-2): 15--68, 2000.]] Google ScholarDigital Library
- Nickolas Kushmerick. Daniel S. Weld, and Robert B. Doorenbos. Wrapper induction for information extraction. In International Joint Conference on Artificial Intelligence (IJCAI), pages 729--737, 1997.]]Google Scholar
- Shian-Hua Lin and Jan-Ming Ho. Discovering informative content blocks from web documents. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 588--593, 2002.]] Google ScholarDigital Library
- Bing Liu, Kaidi Zhao, and Lan Yi. Eliminating noisy information in web pages for data mining. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 296--305, 2003.]] Google ScholarDigital Library
- Ion Muslea, Steven Minton, and Craig A. Knoblock. Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 4(1/2):93--114, 2001.]] Google ScholarDigital Library
- Lan Yi, Bing Liu, and Xiaoli Li. Visualizing web site comparisons. In Proceedings of the eleventh international conference on World Wide Web, pages 693--703, 2002.]] Google ScholarDigital Library
Index Terms
- Automatic extraction of informative blocks from webpages
Recommendations
Discovering informative content blocks from Web documents
KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data miningIn this paper, we propose a new approach to discover informative contents from a set of tabular documents (or Web pages) of a Web site. Our system, InfoDiscoverer, first partitions a page into several content blocks according to HTML tag <TABLE> in a ...
Identifying content blocks from web documents
ISMIS'05: Proceedings of the 15th international conference on Foundations of Intelligent SystemsIntelligent information processing systems, such as digital libraries or search engines index web-pages according to their informative content. However, web-pages contain several non-informative contents, e.g., navigation sidebars, advertisements, ...
Automatic Identification of Informative Sections of Web Pages
Web pages especially dynamically generated ones contain several items that cannot be classified as the "primary content, e.g., navigation sidebars, advertisements, copyright notices, etc. Most clients and end-users search for the primary content, and ...
Comments