Article

Fully automatic wrapper generation for search engines

Authors:
Hongkun Zhao

SUNY at Binghamton, Binghamton, NY

SUNY at Binghamton, Binghamton, NY
View Profile

,
Weiyi Meng

SUNY at Binghamton, Binghamton, NY

SUNY at Binghamton, Binghamton, NY
View Profile

,
Zonghuan Wu

Univ. of Louisiana at Lafayette, Lafayette, LA

Univ. of Louisiana at Lafayette, Lafayette, LA
View Profile

,
Vijay Raghavan

Univ. of Louisiana at Lafayette, Lafayette, LA

Univ. of Louisiana at Lafayette, Lafayette, LA
View Profile

,
Clement Yu

University of Illinois at Chicago, Chicago, IL

University of Illinois at Chicago, Chicago, IL
View Profile

WWW '05: Proceedings of the 14th international conference on World Wide WebMay 2005Pages 66–75https://doi.org/10.1145/1060745.1060760

Published:10 May 2005Publication History

WWW '05: Proceedings of the 14th international conference on World Wide Web

Pages 66–75

ABSTRACT

When a query is submitted to a search engine, the search engine returns a dynamically generated result page containing the result records, each of which usually consists of a link to and/or snippet of a retrieved Web page. In addition, such a result page often also contains information irrelevant to the query, such as information related to the hosting site of the search engine and advertisements. In this paper, we present a technique for automatically producing wrappers that can be used to extract search result records from dynamically generated result pages returned by search engines. Automatic search result record extraction is very important for many applications that need to interact with search engines such as automatic construction and maintenance of metasearch engines and deep Web crawling. The novel aspect of the proposed technique is that it utilizes both the visual content features on the result page as displayed on a browser and the HTML tag structures of the HTML source file of the result page. Experimental results indicate that this technique can achieve very high extraction accuracy.

References

B. Adelberg. NoDoSE - A tool for semi-automatically extracting structured and semistructured data from text documents. ACM SIGMOD Conference, 1998.]] Google ScholarDigital Library
A. Arasu, H. Garcia-Molina. Extracting Structured Data from Web Pages. ACM SIGMOD Conference, June 2003.]] Google ScholarDigital Library
R. Baumgartner, S. Flesca and G. Gottlob. Visual web information extraction with Lixto. VLDB Conference, 2001.]] Google ScholarDigital Library
M. Bergman. The Deep Web: Surfacing Hidden Value. White Paper, BrightPlanet, 2000 (www.completeplanet.com/ Tutorials/DeepWeb/index.asp)]]Google Scholar
D. Buttler, L. Liu, C. Pu. A Fully Automated Object Extraction System for the World Wide Web. International Conference on Distributed Computing Systems (ICDCS 2001), 2001.]] Google ScholarDigital Library
C. Chang, S. Lui. IEPAD: Information Extraction based on Pattern Discovery. World Wide Web Conference, 2001.]] Google ScholarDigital Library
K. Chang, B. He, C. Li, M. P, Z. Zhang. Structured Databases on the Web: Observations and Implications. Technical Report, UIUCDCS-R-2003-2321, UIUC, 2003.]]Google Scholar
L. Chen, H. Jamil, N. Wang. Automatic Composite Wrapper Generation for Semi-Structured Biological Data Based on Table Structure Identification. SIGMOD Record, June 2004.]] Google ScholarDigital Library
B. Chidlowskii, J. Ragetli, M. de Rijke. Automatic Wrapper Generation for Web Search Engines. WAIM Conf., 2000.]] Google ScholarDigital Library
V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. VLDB Conference, pp. 109--118, 2001.]] Google ScholarDigital Library
www.cs.binghamton.edu/~meng/metasearch.html.]]Google Scholar
D. Embley, Y. Jiang, and Y. -K. Ng. Record-boundary discovery in Web documents. ACM SIGMOD Conf., 1999.]] Google ScholarDigital Library
E. Gold. Language Identification in the Limit. Information and Control, 10(5), 1967.]]Google Scholar
X. Gu, J. Chen, W. Ma, G. Chen. Visual based Content Understanding towards Web Adaptation. Int'l Conf. on Adaptive Hypermedia & Adaptive Web-based Systems, pp.164-173, 2002.]] Google ScholarDigital Library
C. Hsu and M. Dung. Generating finite-state transducers for semi-structured data extraction from the Web. Information Systems. 23(8): 521--538, 1998.]] Google ScholarDigital Library
http://www.icesoft.com]]Google Scholar
M. Kovacevic, M. Diligenti, M. Gori, M. Maggini, V. Milutinovic. Recognition of Common Areas in a Web Page Using Visual Information: A Possible Application in a Page Classification. ICDM Conference, 2002.]] Google ScholarDigital Library
N. Kushmerick, D. Weld, R. Doorenbos. Wrapper Induction for Information Extraction. Int'l Joint Conf. on AI, 1997.]]Google Scholar
A. Laender, B. Ribeiro-Neto, A. da Silva, and J. Teixeira. A Brief Survey of Web Data Extraction Tools. ACM SIGMOD Record, 31(2), 2002.]] Google ScholarDigital Library
B. Liu, R. Grossman and Y. Zhai. Mining Data Records in Web Pages. SIGKDD'03, 2003.]] Google ScholarDigital Library
L. Liu, C. Pu and W. Han. XWRAP: An XML-enabled wrapper construction system for web information sources. Int'l Conf. on Data Engineering, 2000.]] Google ScholarDigital Library
W. Meng, C. Yu, K. Liu. Building Efficient and Effective Metasearch Engines. ACM Computing Surveys, 34(1), March 2002, pp.48--84.]] Google ScholarDigital Library
I. Muslea, S. Minton and C. Knoblock. A hierarchical approach to wrapper induction. Int'l Conf. on Autonomous Agents, 190-197, 1999.]] Google ScholarDigital Library
S. Raghavan, H. Garcia-Molina. Crawling the Hidden Web. VLDB Conference, Italy, 2001.]] Google ScholarDigital Library
E. Ukkonen. On-line Construction of Suffix Trees. Algorithmica, 14:249-260, 1995.]]Google ScholarDigital Library
J. Wang, F. H. Lochovsky. Data Extraction and Label Assignment for Web Databases. WWW Conference, 2003.]] Google ScholarDigital Library
S. Wu and U. Manber. Fast Text Searching Allowing Errors. Communications of the ACM, 35(10):83--91, 1992.]] Google ScholarDigital Library
Z. Wu, W. Meng, V. Raghavan, C. Yu, H. He, H. Qian, R. Vuyyuru. Towards Automatic Incorporation of Search Engines into a Large-Scale Metasearch Engine. IEEE/WIC WI-2003 Conference, October 2003.]] Google ScholarDigital Library
Y. Yang, H. Zhang. HTML Page Analysis based on Visual Cues. 6th International Conference on Document Analysis and Recognition, 2001.]] Google ScholarDigital Library

Index Terms

Fully automatic wrapper generation for search engines
1. Information systems
  1. World Wide Web

Recommendations

Mining templates from search result records of search engines
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

Metasearch engine, Comparison-shopping and Deep Web crawling applications need to extract search result records enwrapped in result pages returned from search engines in response to user queries. The search result records from a given search engine are ...
Read More
Multimedia search capabilities of Chinese language search engines

This paper reports results from a study exploring the multimedia search functionality of Chinese language search engines. Web searching in Chinese (Mandarin) is a growing research area and a technical challenge for popular commercial Web search engines. ...
Read More
Automatic performance evaluation of web search engines

Measuring the information retrieval effectiveness of World Wide Web search engines is costly because of human relevance judgments involved. However, both for business enterprises and people it is important to know the most effective Web search engines, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '05: Proceedings of the 14th international conference on World Wide Web
May 2005
781 pages
ISBN:1595930469
DOI:10.1145/1060745
Conference Chairs:
Allan Ellis
Southern Cross University
,
Tatsuya Hagino
Keio University
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 May 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
information extraction
search engine
wrapper generation
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 196
  Total Citations
  View Citations
- 1,611
  Total Downloads
- Downloads (Last 12 months)17
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Fully automatic wrapper generation for search engines

WWW '05: Proceedings of the 14th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mining templates from search result records of search engines

Multimedia search capabilities of Chinese language search engines

Automatic performance evaluation of web search engines

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Fully automatic wrapper generation for search engines

WWW '05: Proceedings of the 14th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mining templates from search result records of search engines

Multimedia search capabilities of Chinese language search engines

Automatic performance evaluation of web search engines

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media