ABSTRACT
When a query is submitted to a search engine, the search engine returns a dynamically generated result page containing the result records, each of which usually consists of a link to and/or snippet of a retrieved Web page. In addition, such a result page often also contains information irrelevant to the query, such as information related to the hosting site of the search engine and advertisements. In this paper, we present a technique for automatically producing wrappers that can be used to extract search result records from dynamically generated result pages returned by search engines. Automatic search result record extraction is very important for many applications that need to interact with search engines such as automatic construction and maintenance of metasearch engines and deep Web crawling. The novel aspect of the proposed technique is that it utilizes both the visual content features on the result page as displayed on a browser and the HTML tag structures of the HTML source file of the result page. Experimental results indicate that this technique can achieve very high extraction accuracy.
- B. Adelberg. NoDoSE - A tool for semi-automatically extracting structured and semistructured data from text documents. ACM SIGMOD Conference, 1998.]] Google ScholarDigital Library
- A. Arasu, H. Garcia-Molina. Extracting Structured Data from Web Pages. ACM SIGMOD Conference, June 2003.]] Google ScholarDigital Library
- R. Baumgartner, S. Flesca and G. Gottlob. Visual web information extraction with Lixto. VLDB Conference, 2001.]] Google ScholarDigital Library
- M. Bergman. The Deep Web: Surfacing Hidden Value. White Paper, BrightPlanet, 2000 (www.completeplanet.com/ Tutorials/DeepWeb/index.asp)]]Google Scholar
- D. Buttler, L. Liu, C. Pu. A Fully Automated Object Extraction System for the World Wide Web. International Conference on Distributed Computing Systems (ICDCS 2001), 2001.]] Google ScholarDigital Library
- C. Chang, S. Lui. IEPAD: Information Extraction based on Pattern Discovery. World Wide Web Conference, 2001.]] Google ScholarDigital Library
- K. Chang, B. He, C. Li, M. P, Z. Zhang. Structured Databases on the Web: Observations and Implications. Technical Report, UIUCDCS-R-2003-2321, UIUC, 2003.]]Google Scholar
- L. Chen, H. Jamil, N. Wang. Automatic Composite Wrapper Generation for Semi-Structured Biological Data Based on Table Structure Identification. SIGMOD Record, June 2004.]] Google ScholarDigital Library
- B. Chidlowskii, J. Ragetli, M. de Rijke. Automatic Wrapper Generation for Web Search Engines. WAIM Conf., 2000.]] Google ScholarDigital Library
- V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. VLDB Conference, pp. 109--118, 2001.]] Google ScholarDigital Library
- www.cs.binghamton.edu/~meng/metasearch.html.]]Google Scholar
- D. Embley, Y. Jiang, and Y. -K. Ng. Record-boundary discovery in Web documents. ACM SIGMOD Conf., 1999.]] Google ScholarDigital Library
- E. Gold. Language Identification in the Limit. Information and Control, 10(5), 1967.]]Google Scholar
- X. Gu, J. Chen, W. Ma, G. Chen. Visual based Content Understanding towards Web Adaptation. Int'l Conf. on Adaptive Hypermedia & Adaptive Web-based Systems, pp.164-173, 2002.]] Google ScholarDigital Library
- C. Hsu and M. Dung. Generating finite-state transducers for semi-structured data extraction from the Web. Information Systems. 23(8): 521--538, 1998.]] Google ScholarDigital Library
- http://www.icesoft.com]]Google Scholar
- M. Kovacevic, M. Diligenti, M. Gori, M. Maggini, V. Milutinovic. Recognition of Common Areas in a Web Page Using Visual Information: A Possible Application in a Page Classification. ICDM Conference, 2002.]] Google ScholarDigital Library
- N. Kushmerick, D. Weld, R. Doorenbos. Wrapper Induction for Information Extraction. Int'l Joint Conf. on AI, 1997.]]Google Scholar
- A. Laender, B. Ribeiro-Neto, A. da Silva, and J. Teixeira. A Brief Survey of Web Data Extraction Tools. ACM SIGMOD Record, 31(2), 2002.]] Google ScholarDigital Library
- B. Liu, R. Grossman and Y. Zhai. Mining Data Records in Web Pages. SIGKDD'03, 2003.]] Google ScholarDigital Library
- L. Liu, C. Pu and W. Han. XWRAP: An XML-enabled wrapper construction system for web information sources. Int'l Conf. on Data Engineering, 2000.]] Google ScholarDigital Library
- W. Meng, C. Yu, K. Liu. Building Efficient and Effective Metasearch Engines. ACM Computing Surveys, 34(1), March 2002, pp.48--84.]] Google ScholarDigital Library
- I. Muslea, S. Minton and C. Knoblock. A hierarchical approach to wrapper induction. Int'l Conf. on Autonomous Agents, 190-197, 1999.]] Google ScholarDigital Library
- S. Raghavan, H. Garcia-Molina. Crawling the Hidden Web. VLDB Conference, Italy, 2001.]] Google ScholarDigital Library
- E. Ukkonen. On-line Construction of Suffix Trees. Algorithmica, 14:249-260, 1995.]]Google ScholarDigital Library
- J. Wang, F. H. Lochovsky. Data Extraction and Label Assignment for Web Databases. WWW Conference, 2003.]] Google ScholarDigital Library
- S. Wu and U. Manber. Fast Text Searching Allowing Errors. Communications of the ACM, 35(10):83--91, 1992.]] Google ScholarDigital Library
- Z. Wu, W. Meng, V. Raghavan, C. Yu, H. He, H. Qian, R. Vuyyuru. Towards Automatic Incorporation of Search Engines into a Large-Scale Metasearch Engine. IEEE/WIC WI-2003 Conference, October 2003.]] Google ScholarDigital Library
- Y. Yang, H. Zhang. HTML Page Analysis based on Visual Cues. 6th International Conference on Document Analysis and Recognition, 2001.]] Google ScholarDigital Library
Index Terms
- Fully automatic wrapper generation for search engines
Recommendations
Mining templates from search result records of search engines
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data miningMetasearch engine, Comparison-shopping and Deep Web crawling applications need to extract search result records enwrapped in result pages returned from search engines in response to user queries. The search result records from a given search engine are ...
Multimedia search capabilities of Chinese language search engines
This paper reports results from a study exploring the multimedia search functionality of Chinese language search engines. Web searching in Chinese (Mandarin) is a growing research area and a technical challenge for popular commercial Web search engines. ...
Automatic performance evaluation of web search engines
Measuring the information retrieval effectiveness of World Wide Web search engines is costly because of human relevance judgments involved. However, both for business enterprises and people it is important to know the most effective Web search engines, ...
Comments