skip to main content
10.1145/335168.335216acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
Article
Free Access

Computational aspects of resilient data extraction from semistructured sources (extended abstract)

Authors Info & Claims
Published:01 May 2000Publication History

ABSTRACT

Automatic data extraction from semistructured sources such as HTML pages is rapidly growing into a problem of significant importance, spurred by the growing popularity of the so called “shopbots” that enable end users to compare prices of goods and other services at various web sites without having to manually browse and fill out forms at each one of these sites.

The main problem one has to contend with when designing data extraction techniques is that the contents of a web page changes frequently, either because its data is generated dynamically, in response to filling out a form, or because of changes to its presentation format. This makes the problem of data extraction particularly challenging, since a desirable requirement of any data extraction technique is that it be “resilient”, i.e., using it we should always be able to locate the object of interest in a page (such as a form or an element in a table generated by a form fill-out) in spite of changes to the page's ntent and layout.

In this paper we propose a formal computation model for developing resilient data extraction techniques from semistructured sources. Specifically we formalize the problem of data extraction as one of generating unambiguous extraction expressions, which are regular expressions with some additional structure. The problem of resilience is then formalized as one of generating a maximal extraction expression of this kind. We present characterization theorems for maximal extraction expressions, complexity results for testing them, and algorithms for synthesizing them.

References

  1. 1.S. Abiteboul. Querying semi-structured data. In Int'l Conference on Database Theory, volume 1186, pages 1-18, Delphi, Greece, 1997. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. 2.Serge Abiteboul, Dallan Quass, Jason McHugh, Jennifer Widom, and Janet L. Wiener. The lorel query language for semistructured data. Int. J. on Digital Libraries, 1(1):68-88, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  3. 3.B. Adelberg. NoDoSe: A tool for semi-automatically extracting structured and semi-structured data from text documents. In A CM SIGMOD Conference on Management of Data, pages 283-294, Washington, 1998. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. 4.D. Angluin. Finding patterns common to a set of strings. In A CM Symposium on Theory of Computing, pages 130-141, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 5.D. Angluin. Inductive inference of formal languages from positive data. Information and Control, 45:117-135, 1980.Google ScholarGoogle ScholarCross RefCross Ref
  6. 6.N. Ashish and C. Knoblock. Wrapper generation for semi-structured internet sources. ACM SIGMOD Record, 26(4):8-15, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. 7.P. Atzeni and G. Mecca. Cut & paste. In ACM Symposium on Principles of Database Systems, pages 117-121, Arizona, June 1997. ACM.Google ScholarGoogle Scholar
  8. 8.R.C. Berwick and S. Pilato. Learning syntax by automata induction. Machine Learning, 2:9-38, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9.B.Ribeiro-Neto, A.H.L. Laender, and A.S. da Silva. Extracting semi-structured data through examples. In Proceedings of the International Conference on Knowledge Management, November 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. 10.P. Buneman. Semistructured data. In A CM Symposium on Principles of Database Systems, pages 117-121, Tucson, Arizona, June 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. 11.P. Buneman, S. Davidson, G. Hillebrand, and D. Suciu. A query language and optimization techniques for unstructured data. In ACM SIGMOD Conference on Management of Data, Montreal, Canada, 1996. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. 12.D.W. Embley, D.M. Campbell, Y.S. Jiang, S.W. Liddle aand D.W. Lonsdale, Y.-K. Ng, and R.D. Smith. Conceptual-model-based data extraction from multiple-record web pages. Journal of Data and Knowledge Engineering, November 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. 13.Jean-Robert Gruser, L. Raschid, M. E. Vidal, and L. Bright. Wrapper generation for web accessible data sources. In Proceedings of the Third International Conference on Cooperative Information Systems (CoopIS98), pages 14-23, New York City, New York, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. 14.J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, and R. Aranha. Extracting semistructured information from the web. In In Proceedings of the Workshop on Management of Semistructured Data, pages 18-25, Tucson, Arizona, May 1997.Google ScholarGoogle Scholar
  15. 15.J. Hammer, Hector Garcia-Molina, S. Nestorov, Ramana Yerneni, Markus M. Breunig, and Vasilis Vassalos. Template-based wrappers in the tsimmis system. In A CM SIGMOD Conference on Management of Data, pages 532-535. ACM, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. 16.J.E. Hopcroft and J.D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Reading, MA, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. 17.http://www.jango.com. Jango Corporation.Google ScholarGoogle Scholar
  18. 18.N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In Int'l Joint Conference on Artificial Intelligence, volume 1, pages 729-737, Nagoya, Japan, 1997.Google ScholarGoogle Scholar
  19. 19.H.R. Lewis and C.H. Papadimitriou. Elements of the Theory of Computation. Prentice Hall, Englewood Cliffs, N J, 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. 20.Seung-Jin Lira and Yiu-Kai Ng. An automated approach for retrieving hierarchical data from html table. In Proceedings of the International Conference on Knowledge Management, November 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. 21.M. Perkowitz, R.B. Doorenbos, O. Etzioni, and D.S. Weld. Learning to understand information on the internet: An example-based approach. Journal of Intelligent Information Systems, 8(2):133-153, March 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. 22.A. Sahuguet and F. Azavant. Web Ecology: Recycling HTML pages as XML documents using W4F. In ACM SIGMOD Workshop on Database the Web and Databases (WebBD'99), pages 31-35, Philadelphia, Pennsylvania, June 1999.Google ScholarGoogle Scholar

Index Terms

  1. Computational aspects of resilient data extraction from semistructured sources (extended abstract)

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PODS '00: Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
        May 2000
        281 pages
        ISBN:158113214X
        DOI:10.1145/335168

        Copyright © 2000 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 May 2000

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        PODS '00 Paper Acceptance Rate26of119submissions,22%Overall Acceptance Rate642of2,707submissions,24%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader