Article

Free Access

Computational aspects of resilient data extraction from semistructured sources (extended abstract)

Authors:
Hasan Davulcu

Department of Computer Science, SUNY at Stony Brook, Stony Brook, NY

Department of Computer Science, SUNY at Stony Brook, Stony Brook, NY
View Profile

,
Guizhen Yang

Department of Computer Science, SUNY at Stony Brook, Stony Brook, NY

Department of Computer Science, SUNY at Stony Brook, Stony Brook, NY
View Profile

,
Michael Kifer

Department of Computer Science, SUNY at Stony Brook, Stony Brook, NY

Department of Computer Science, SUNY at Stony Brook, Stony Brook, NY
View Profile

,
I. V. Ramakrishnan

Department of Computer Science, SUNY at Stony Brook, Stony Brook, NY

Department of Computer Science, SUNY at Stony Brook, Stony Brook, NY
View Profile

PODS '00: Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsMay 2000Pages 136–144https://doi.org/10.1145/335168.335216

Published:01 May 2000Publication History

PODS '00: Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Pages 136–144

ABSTRACT

Automatic data extraction from semistructured sources such as HTML pages is rapidly growing into a problem of significant importance, spurred by the growing popularity of the so called “shopbots” that enable end users to compare prices of goods and other services at various web sites without having to manually browse and fill out forms at each one of these sites.

The main problem one has to contend with when designing data extraction techniques is that the contents of a web page changes frequently, either because its data is generated dynamically, in response to filling out a form, or because of changes to its presentation format. This makes the problem of data extraction particularly challenging, since a desirable requirement of any data extraction technique is that it be “resilient”, i.e., using it we should always be able to locate the object of interest in a page (such as a form or an element in a table generated by a form fill-out) in spite of changes to the page's ntent and layout.

In this paper we propose a formal computation model for developing resilient data extraction techniques from semistructured sources. Specifically we formalize the problem of data extraction as one of generating unambiguous extraction expressions, which are regular expressions with some additional structure. The problem of resilience is then formalized as one of generating a maximal extraction expression of this kind. We present characterization theorems for maximal extraction expressions, complexity results for testing them, and algorithms for synthesizing them.

References

1.S. Abiteboul. Querying semi-structured data. In Int'l Conference on Database Theory, volume 1186, pages 1-18, Delphi, Greece, 1997. Springer. Google ScholarDigital Library
2.Serge Abiteboul, Dallan Quass, Jason McHugh, Jennifer Widom, and Janet L. Wiener. The lorel query language for semistructured data. Int. J. on Digital Libraries, 1(1):68-88, 1997.Google ScholarCross Ref
3.B. Adelberg. NoDoSe: A tool for semi-automatically extracting structured and semi-structured data from text documents. In A CM SIGMOD Conference on Management of Data, pages 283-294, Washington, 1998. ACM. Google ScholarDigital Library
4.D. Angluin. Finding patterns common to a set of strings. In A CM Symposium on Theory of Computing, pages 130-141, 1979. Google ScholarDigital Library
5.D. Angluin. Inductive inference of formal languages from positive data. Information and Control, 45:117-135, 1980.Google ScholarCross Ref
6.N. Ashish and C. Knoblock. Wrapper generation for semi-structured internet sources. ACM SIGMOD Record, 26(4):8-15, 1997. Google ScholarDigital Library
7.P. Atzeni and G. Mecca. Cut & paste. In ACM Symposium on Principles of Database Systems, pages 117-121, Arizona, June 1997. ACM.Google Scholar
8.R.C. Berwick and S. Pilato. Learning syntax by automata induction. Machine Learning, 2:9-38, 1987. Google ScholarDigital Library
9.B.Ribeiro-Neto, A.H.L. Laender, and A.S. da Silva. Extracting semi-structured data through examples. In Proceedings of the International Conference on Knowledge Management, November 1999. Google ScholarDigital Library
10.P. Buneman. Semistructured data. In A CM Symposium on Principles of Database Systems, pages 117-121, Tucson, Arizona, June 1997. Google ScholarDigital Library
11.P. Buneman, S. Davidson, G. Hillebrand, and D. Suciu. A query language and optimization techniques for unstructured data. In ACM SIGMOD Conference on Management of Data, Montreal, Canada, 1996. ACM. Google ScholarDigital Library
12.D.W. Embley, D.M. Campbell, Y.S. Jiang, S.W. Liddle aand D.W. Lonsdale, Y.-K. Ng, and R.D. Smith. Conceptual-model-based data extraction from multiple-record web pages. Journal of Data and Knowledge Engineering, November 1999. Google ScholarDigital Library
13.Jean-Robert Gruser, L. Raschid, M. E. Vidal, and L. Bright. Wrapper generation for web accessible data sources. In Proceedings of the Third International Conference on Cooperative Information Systems (CoopIS98), pages 14-23, New York City, New York, 1998. Google ScholarDigital Library
14.J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, and R. Aranha. Extracting semistructured information from the web. In In Proceedings of the Workshop on Management of Semistructured Data, pages 18-25, Tucson, Arizona, May 1997.Google Scholar
15.J. Hammer, Hector Garcia-Molina, S. Nestorov, Ramana Yerneni, Markus M. Breunig, and Vasilis Vassalos. Template-based wrappers in the tsimmis system. In A CM SIGMOD Conference on Management of Data, pages 532-535. ACM, 1997. Google ScholarDigital Library
16.J.E. Hopcroft and J.D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Reading, MA, 1979. Google ScholarDigital Library
17.http://www.jango.com. Jango Corporation.Google Scholar
18.N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In Int'l Joint Conference on Artificial Intelligence, volume 1, pages 729-737, Nagoya, Japan, 1997.Google Scholar
19.H.R. Lewis and C.H. Papadimitriou. Elements of the Theory of Computation. Prentice Hall, Englewood Cliffs, N J, 1981. Google ScholarDigital Library
20.Seung-Jin Lira and Yiu-Kai Ng. An automated approach for retrieving hierarchical data from html table. In Proceedings of the International Conference on Knowledge Management, November 1999. Google ScholarDigital Library
21.M. Perkowitz, R.B. Doorenbos, O. Etzioni, and D.S. Weld. Learning to understand information on the internet: An example-based approach. Journal of Intelligent Information Systems, 8(2):133-153, March 1997. Google ScholarDigital Library
22.A. Sahuguet and F. Azavant. Web Ecology: Recycling HTML pages as XML documents using W4F. In ACM SIGMOD Workshop on Database the Web and Databases (WebBD'99), pages 31-35, Philadelphia, Pennsylvania, June 1999.Google Scholar

Index Terms

Computational aspects of resilient data extraction from semistructured sources (extended abstract)
1. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

Semistructured data and XML
Information organization and databases

XML poses a new set of challenges for semistructured data research. The Extensible Markup Language, XML, is a new recommendation from World Wide Web Consortium that will become a universal data exchange format for the Web. XML shares many common ...
Read More
Effective Web Data Extraction with Ducky
IDEAS '15: Proceedings of the 19th International Database Engineering & Applications Symposium

The World Wide Web has become an invaluable source of data. However, extracting useful information from the vastness of the web can become a challenge as depending on the amount of data needed, manual extraction or creation of web scraping programs may ...
Read More
Semistructured Data: From Practice to Theory
LICS '01: Proceedings of the 16th Annual IEEE Symposium on Logic in Computer Science

Abstract: Semistructured data is data that presents some regularity (it is not an image or plain text) but perhaps not as much as some relational data or some ODMG data (the standard of object databases). Such data is becoming increasingly important and,...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PODS '00: Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
May 2000
281 pages
ISBN:158113214X
DOI:10.1145/335168
Chairmen:
Victor Vianu
Univ. of California at San Diego, La Jolla
,
Georg Gottlob
Technische Univ. Wien, Vienna, Austria
Copyright © 2000 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 May 2000
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
PODS '00 Paper Acceptance Rate26of119submissions,22%Overall Acceptance Rate642of2,707submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 23
  Total Citations
  View Citations
- 301
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Computational aspects of resilient data extraction from semistructured sources (extended abstract)

PODS '00: Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Semistructured data and XML

Effective Web Data Extraction with Ducky

Semistructured Data: From Practice to Theory

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Computational aspects of resilient data extraction from semistructured sources (extended abstract)

PODS '00: Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Semistructured data and XML

Effective Web Data Extraction with Ducky

Semistructured Data: From Practice to Theory

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media