skip to main content
10.1145/1060745.1060760acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
Article

Fully automatic wrapper generation for search engines

Published:10 May 2005Publication History

ABSTRACT

When a query is submitted to a search engine, the search engine returns a dynamically generated result page containing the result records, each of which usually consists of a link to and/or snippet of a retrieved Web page. In addition, such a result page often also contains information irrelevant to the query, such as information related to the hosting site of the search engine and advertisements. In this paper, we present a technique for automatically producing wrappers that can be used to extract search result records from dynamically generated result pages returned by search engines. Automatic search result record extraction is very important for many applications that need to interact with search engines such as automatic construction and maintenance of metasearch engines and deep Web crawling. The novel aspect of the proposed technique is that it utilizes both the visual content features on the result page as displayed on a browser and the HTML tag structures of the HTML source file of the result page. Experimental results indicate that this technique can achieve very high extraction accuracy.

References

  1. B. Adelberg. NoDoSE - A tool for semi-automatically extracting structured and semistructured data from text documents. ACM SIGMOD Conference, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Arasu, H. Garcia-Molina. Extracting Structured Data from Web Pages. ACM SIGMOD Conference, June 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Baumgartner, S. Flesca and G. Gottlob. Visual web information extraction with Lixto. VLDB Conference, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Bergman. The Deep Web: Surfacing Hidden Value. White Paper, BrightPlanet, 2000 (www.completeplanet.com/ Tutorials/DeepWeb/index.asp)]]Google ScholarGoogle Scholar
  5. D. Buttler, L. Liu, C. Pu. A Fully Automated Object Extraction System for the World Wide Web. International Conference on Distributed Computing Systems (ICDCS 2001), 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Chang, S. Lui. IEPAD: Information Extraction based on Pattern Discovery. World Wide Web Conference, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. K. Chang, B. He, C. Li, M. P, Z. Zhang. Structured Databases on the Web: Observations and Implications. Technical Report, UIUCDCS-R-2003-2321, UIUC, 2003.]]Google ScholarGoogle Scholar
  8. L. Chen, H. Jamil, N. Wang. Automatic Composite Wrapper Generation for Semi-Structured Biological Data Based on Table Structure Identification. SIGMOD Record, June 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. Chidlowskii, J. Ragetli, M. de Rijke. Automatic Wrapper Generation for Web Search Engines. WAIM Conf., 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. VLDB Conference, pp. 109--118, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. www.cs.binghamton.edu/~meng/metasearch.html.]]Google ScholarGoogle Scholar
  12. D. Embley, Y. Jiang, and Y. -K. Ng. Record-boundary discovery in Web documents. ACM SIGMOD Conf., 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. E. Gold. Language Identification in the Limit. Information and Control, 10(5), 1967.]]Google ScholarGoogle Scholar
  14. X. Gu, J. Chen, W. Ma, G. Chen. Visual based Content Understanding towards Web Adaptation. Int'l Conf. on Adaptive Hypermedia & Adaptive Web-based Systems, pp.164-173, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. Hsu and M. Dung. Generating finite-state transducers for semi-structured data extraction from the Web. Information Systems. 23(8): 521--538, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. http://www.icesoft.com]]Google ScholarGoogle Scholar
  17. M. Kovacevic, M. Diligenti, M. Gori, M. Maggini, V. Milutinovic. Recognition of Common Areas in a Web Page Using Visual Information: A Possible Application in a Page Classification. ICDM Conference, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. N. Kushmerick, D. Weld, R. Doorenbos. Wrapper Induction for Information Extraction. Int'l Joint Conf. on AI, 1997.]]Google ScholarGoogle Scholar
  19. A. Laender, B. Ribeiro-Neto, A. da Silva, and J. Teixeira. A Brief Survey of Web Data Extraction Tools. ACM SIGMOD Record, 31(2), 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Liu, R. Grossman and Y. Zhai. Mining Data Records in Web Pages. SIGKDD'03, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. L. Liu, C. Pu and W. Han. XWRAP: An XML-enabled wrapper construction system for web information sources. Int'l Conf. on Data Engineering, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. W. Meng, C. Yu, K. Liu. Building Efficient and Effective Metasearch Engines. ACM Computing Surveys, 34(1), March 2002, pp.48--84.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. I. Muslea, S. Minton and C. Knoblock. A hierarchical approach to wrapper induction. Int'l Conf. on Autonomous Agents, 190-197, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Raghavan, H. Garcia-Molina. Crawling the Hidden Web. VLDB Conference, Italy, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. E. Ukkonen. On-line Construction of Suffix Trees. Algorithmica, 14:249-260, 1995.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Wang, F. H. Lochovsky. Data Extraction and Label Assignment for Web Databases. WWW Conference, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Wu and U. Manber. Fast Text Searching Allowing Errors. Communications of the ACM, 35(10):83--91, 1992.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Z. Wu, W. Meng, V. Raghavan, C. Yu, H. He, H. Qian, R. Vuyyuru. Towards Automatic Incorporation of Search Engines into a Large-Scale Metasearch Engine. IEEE/WIC WI-2003 Conference, October 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Y. Yang, H. Zhang. HTML Page Analysis based on Visual Cues. 6th International Conference on Document Analysis and Recognition, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fully automatic wrapper generation for search engines

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          WWW '05: Proceedings of the 14th international conference on World Wide Web
          May 2005
          781 pages
          ISBN:1595930469
          DOI:10.1145/1060745

          Copyright © 2005 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 10 May 2005

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate1,899of8,196submissions,23%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader