Article

Automatic extraction of informative blocks from webpages

Authors:
Sandip Debnath

The Pennsylvania State University, PA

The Pennsylvania State University, PA
View Profile

,
Prasenjit Mitra

The Pennsylvania State University, PA

The Pennsylvania State University, PA
View Profile

,
C. Lee Giles

The Pennsylvania State University, PA

The Pennsylvania State University, PA
View Profile

SAC '05: Proceedings of the 2005 ACM symposium on Applied computingMarch 2005Pages 1722–1726https://doi.org/10.1145/1066677.1067065

Published:13 March 2005Publication History

SAC '05: Proceedings of the 2005 ACM symposium on Applied computing

Pages 1722–1726

ABSTRACT

Search engines crawl and index webpages depending upon their informative content. However, webpages --- especially dynamically generated ones --- contain items that cannot be classified as the "primary content", e.g., navigation side-bars, advertisements, copyright notices, etc. Most end-users search for the primary content, and largely do not seek the non-informative content. A tool that assists an end-user or application to search and process information from webpages automatically, must separate the "primary content blocks" from the other blocks. In this paper, two new algorithms, ContentExtractor, and FeatureExtractor are proposed. The algorithms identify primary content blocks by i) looking for blocks that do not occur a large number of times across webpages and ii) looking for blocks with desired features respectively. They identify the primary content blocks with high precision and recall, reduce the storage requirement for search engines, result in smaller indexes and thereby faster search times, and better user satisfaction. While operating on several thousand webpages obtained from 11 news websites, our algorithms significantly outperform the Entropy-based algorithm proposed by Lin and Ho [7] in both accuracy and run-time.

References

Ziv Bar-Yossef and Sridhar Rajagopalan. Template detection via data mining and its applications. In Proceedings of WWW 2002, pages 580--591, 2002.]] Google ScholarDigital Library
Boris Chidlovskii, Jon Ragetli, and Maarten de Rijke. Wrapper generation via grammar induction. In Machine Learning: ECML 2000, 11th European Conference on Machine Learning, Barcelona, Catalonia, Spain, May 31 - June 2, 2000, Proceedings, volume 1810, pages 96--108. Springer, Berlin, 2000.]] Google ScholarDigital Library
Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In Proceedings of the 27th International Conference on Very Large Data Bases, pages 109--118, 2001.]] Google ScholarDigital Library
C. Hsu. Initial results on wrapping semistructured web pages with finite-state transducers and contextual rules. In AAAI-98 Workshop on AI and Information Integration, pages 66--73. AAAI Press, 1998.]]Google Scholar
Nicholas Kushmerick, Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118(1-2): 15--68, 2000.]] Google ScholarDigital Library
Nickolas Kushmerick. Daniel S. Weld, and Robert B. Doorenbos. Wrapper induction for information extraction. In International Joint Conference on Artificial Intelligence (IJCAI), pages 729--737, 1997.]]Google Scholar
Shian-Hua Lin and Jan-Ming Ho. Discovering informative content blocks from web documents. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 588--593, 2002.]] Google ScholarDigital Library
Bing Liu, Kaidi Zhao, and Lan Yi. Eliminating noisy information in web pages for data mining. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 296--305, 2003.]] Google ScholarDigital Library
Ion Muslea, Steven Minton, and Craig A. Knoblock. Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 4(1/2):93--114, 2001.]] Google ScholarDigital Library
Lan Yi, Bing Liu, and Xiaoli Li. Visualizing web site comparisons. In Proceedings of the eleventh international conference on World Wide Web, pages 693--703, 2002.]] Google ScholarDigital Library

Index Terms

Automatic extraction of informative blocks from webpages
1. Information systems
  1. Information systems applications

Recommendations

Discovering informative content blocks from Web documents
KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

In this paper, we propose a new approach to discover informative contents from a set of tabular documents (or Web pages) of a Web site. Our system, InfoDiscoverer, first partitions a page into several content blocks according to HTML tag <TABLE> in a ...
Read More
Identifying content blocks from web documents
ISMIS'05: Proceedings of the 15th international conference on Foundations of Intelligent Systems

Intelligent information processing systems, such as digital libraries or search engines index web-pages according to their informative content. However, web-pages contain several non-informative contents, e.g., navigation sidebars, advertisements, ...
Read More
Automatic Identification of Informative Sections of Web Pages

Web pages especially dynamically generated ones contain several items that cannot be classified as the "primary content, e.g., navigation sidebars, advertisements, copyright notices, etc. Most clients and end-users search for the primary content, and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SAC '05: Proceedings of the 2005 ACM symposium on Applied computing
March 2005
1814 pages
ISBN:1581139640
DOI:10.1145/1066677
Conference Chair:
Hisham M. Haddad
Kennesaw State University
,
Editor:
Lorie M. Liebrock
New Mexico Institute of Mining and Technology, Socorro, NM
,
Program Chairs:
Andrea Omicini
Alma Mater Studiorum, Universita di Bologna, Italy
,
Roger L. Wainwright
Univerity of Tulsa, OK
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 March 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data mining
electronic publishing
information systems applications
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,650of6,669submissions,25%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 47
  Total Citations
  View Citations
- 639
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automatic extraction of informative blocks from webpages

SAC '05: Proceedings of the 2005 ACM symposium on Applied computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Discovering informative content blocks from Web documents

Identifying content blocks from web documents

Automatic Identification of Informative Sections of Web Pages