skip to main content
10.1145/956750.956826acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Mining data records in Web pages

Published:24 August 2003Publication History

ABSTRACT

A large amount of information on the Web is contained in regularly structured objects, which we call data records. Such data records are important because they often present the essential information of their host pages, e.g., lists of products or services. It is useful to mine such data records in order to extract information from them to provide value-added services. Existing automatic techniques are not satisfactory because of their poor accuracies. In this paper, we propose a more effective technique to perform the task. The technique is based on two observations about data records on the Web and a string matching algorithm. The proposed technique is able to mine both contiguous and non-contiguous data records. Our experimental results show that the proposed technique outperforms existing techniques substantially.

References

  1. Baeza-Yates, R. "Algorithms for string matching: A survey." ACM SIGIR Forum, 23(3--4):34--58, 1989 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Buttler, D., Liu, L., Pu, C. "A fully automated extraction system for the World Wide Web." IEEE ICDCS-21, 2001.Google ScholarGoogle Scholar
  3. Chang, C-H., Lui, S-L. "IEPAD: Information extraction based on pattern discovery." WWW-10, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Cohen, W., Hurst, M., and Jensen, L. "A flexible learning system for wrapping tables and lists in HTML documents." WWW-2002, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Doorenbos, R., Etzioni, O., Weld, D. "A scalable comparison shopping agent for the World Wide Web." Agents-97, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Embley, D., Jiang, Y and Ng, Y. "Record-boundary discovery in Web documents," SIGMOD-99, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Gusfield, D. Algorithms on strings, tree, and sequence. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Hsu, C.-N., and Dung, M.-T. "Generating finite-state transducers for semi-structured data extraction from the Web." Information Systems. 23(8): 521--538, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Kushmerick, N. "Wrapper induction: efficiency and expressiveness." Artificial Intelligence, 118:15--68, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Lerman, K. Knoblock, C., and Minton, S. "Automatic data extraction from lists and tables in web sources." IJCAI-01 Workshop on Adaptive Text Extraction and Mining, 2001.Google ScholarGoogle Scholar
  11. Liu, B., Grossman, R. and Zhai, Y. "Mining data records in Web pages." UIC Technical Report, 2003.Google ScholarGoogle Scholar
  12. Muslea, I., Minton, S. and Knoblock, C. "A hierarchical approach to wrapper induction." Agents-99, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mining data records in Web pages

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
        August 2003
        736 pages
        ISBN:1581137370
        DOI:10.1145/956750

        Copyright © 2003 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 August 2003

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        KDD '03 Paper Acceptance Rate46of298submissions,15%Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

        KDD '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader