ABSTRACT
Automatic data extraction from semistructured sources such as HTML pages is rapidly growing into a problem of significant importance, spurred by the growing popularity of the so called “shopbots” that enable end users to compare prices of goods and other services at various web sites without having to manually browse and fill out forms at each one of these sites.
The main problem one has to contend with when designing data extraction techniques is that the contents of a web page changes frequently, either because its data is generated dynamically, in response to filling out a form, or because of changes to its presentation format. This makes the problem of data extraction particularly challenging, since a desirable requirement of any data extraction technique is that it be “resilient”, i.e., using it we should always be able to locate the object of interest in a page (such as a form or an element in a table generated by a form fill-out) in spite of changes to the page's ntent and layout.
In this paper we propose a formal computation model for developing resilient data extraction techniques from semistructured sources. Specifically we formalize the problem of data extraction as one of generating unambiguous extraction expressions, which are regular expressions with some additional structure. The problem of resilience is then formalized as one of generating a maximal extraction expression of this kind. We present characterization theorems for maximal extraction expressions, complexity results for testing them, and algorithms for synthesizing them.
- 1.S. Abiteboul. Querying semi-structured data. In Int'l Conference on Database Theory, volume 1186, pages 1-18, Delphi, Greece, 1997. Springer. Google ScholarDigital Library
- 2.Serge Abiteboul, Dallan Quass, Jason McHugh, Jennifer Widom, and Janet L. Wiener. The lorel query language for semistructured data. Int. J. on Digital Libraries, 1(1):68-88, 1997.Google ScholarCross Ref
- 3.B. Adelberg. NoDoSe: A tool for semi-automatically extracting structured and semi-structured data from text documents. In A CM SIGMOD Conference on Management of Data, pages 283-294, Washington, 1998. ACM. Google ScholarDigital Library
- 4.D. Angluin. Finding patterns common to a set of strings. In A CM Symposium on Theory of Computing, pages 130-141, 1979. Google ScholarDigital Library
- 5.D. Angluin. Inductive inference of formal languages from positive data. Information and Control, 45:117-135, 1980.Google ScholarCross Ref
- 6.N. Ashish and C. Knoblock. Wrapper generation for semi-structured internet sources. ACM SIGMOD Record, 26(4):8-15, 1997. Google ScholarDigital Library
- 7.P. Atzeni and G. Mecca. Cut & paste. In ACM Symposium on Principles of Database Systems, pages 117-121, Arizona, June 1997. ACM.Google Scholar
- 8.R.C. Berwick and S. Pilato. Learning syntax by automata induction. Machine Learning, 2:9-38, 1987. Google ScholarDigital Library
- 9.B.Ribeiro-Neto, A.H.L. Laender, and A.S. da Silva. Extracting semi-structured data through examples. In Proceedings of the International Conference on Knowledge Management, November 1999. Google ScholarDigital Library
- 10.P. Buneman. Semistructured data. In A CM Symposium on Principles of Database Systems, pages 117-121, Tucson, Arizona, June 1997. Google ScholarDigital Library
- 11.P. Buneman, S. Davidson, G. Hillebrand, and D. Suciu. A query language and optimization techniques for unstructured data. In ACM SIGMOD Conference on Management of Data, Montreal, Canada, 1996. ACM. Google ScholarDigital Library
- 12.D.W. Embley, D.M. Campbell, Y.S. Jiang, S.W. Liddle aand D.W. Lonsdale, Y.-K. Ng, and R.D. Smith. Conceptual-model-based data extraction from multiple-record web pages. Journal of Data and Knowledge Engineering, November 1999. Google ScholarDigital Library
- 13.Jean-Robert Gruser, L. Raschid, M. E. Vidal, and L. Bright. Wrapper generation for web accessible data sources. In Proceedings of the Third International Conference on Cooperative Information Systems (CoopIS98), pages 14-23, New York City, New York, 1998. Google ScholarDigital Library
- 14.J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, and R. Aranha. Extracting semistructured information from the web. In In Proceedings of the Workshop on Management of Semistructured Data, pages 18-25, Tucson, Arizona, May 1997.Google Scholar
- 15.J. Hammer, Hector Garcia-Molina, S. Nestorov, Ramana Yerneni, Markus M. Breunig, and Vasilis Vassalos. Template-based wrappers in the tsimmis system. In A CM SIGMOD Conference on Management of Data, pages 532-535. ACM, 1997. Google ScholarDigital Library
- 16.J.E. Hopcroft and J.D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Reading, MA, 1979. Google ScholarDigital Library
- 17.http://www.jango.com. Jango Corporation.Google Scholar
- 18.N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In Int'l Joint Conference on Artificial Intelligence, volume 1, pages 729-737, Nagoya, Japan, 1997.Google Scholar
- 19.H.R. Lewis and C.H. Papadimitriou. Elements of the Theory of Computation. Prentice Hall, Englewood Cliffs, N J, 1981. Google ScholarDigital Library
- 20.Seung-Jin Lira and Yiu-Kai Ng. An automated approach for retrieving hierarchical data from html table. In Proceedings of the International Conference on Knowledge Management, November 1999. Google ScholarDigital Library
- 21.M. Perkowitz, R.B. Doorenbos, O. Etzioni, and D.S. Weld. Learning to understand information on the internet: An example-based approach. Journal of Intelligent Information Systems, 8(2):133-153, March 1997. Google ScholarDigital Library
- 22.A. Sahuguet and F. Azavant. Web Ecology: Recycling HTML pages as XML documents using W4F. In ACM SIGMOD Workshop on Database the Web and Databases (WebBD'99), pages 31-35, Philadelphia, Pennsylvania, June 1999.Google Scholar
Index Terms
- Computational aspects of resilient data extraction from semistructured sources (extended abstract)
Recommendations
Semistructured data and XML
Information organization and databasesXML poses a new set of challenges for semistructured data research. The Extensible Markup Language, XML, is a new recommendation from World Wide Web Consortium that will become a universal data exchange format for the Web. XML shares many common ...
Effective Web Data Extraction with Ducky
IDEAS '15: Proceedings of the 19th International Database Engineering & Applications SymposiumThe World Wide Web has become an invaluable source of data. However, extracting useful information from the vastness of the web can become a challenge as depending on the amount of data needed, manual extraction or creation of web scraping programs may ...
Semistructured Data: From Practice to Theory
LICS '01: Proceedings of the 16th Annual IEEE Symposium on Logic in Computer ScienceAbstract: Semistructured data is data that presents some regularity (it is not an image or plain text) but perhaps not as much as some relational data or some ODMG data (the standard of object databases). Such data is becoming increasingly important and,...
Comments