research-article

Learning deterministic regular expressions for the inference of schemas from XML data

Authors:
Geert Jan Bex

Hasselt University/Transnational University of Limburg, Diepenbeek, Belgium

Hasselt University/Transnational University of Limburg, Diepenbeek, Belgium
View Profile

,
Wouter Gelade

Hasselt University/Transnational University of Limburg, Diepenbeek, Belgium

Hasselt University/Transnational University of Limburg, Diepenbeek, Belgium
View Profile

,
Frank Neven

Hasselt University/Transnational University of Limburg, Diepenbeek, Belgium

Hasselt University/Transnational University of Limburg, Diepenbeek, Belgium
View Profile

,
Stijn Vansummeren

Hasselt University/Transnational University of Limburg, Diepenbeek, Belgium

Hasselt University/Transnational University of Limburg, Diepenbeek, Belgium
View Profile

WWW '08: Proceedings of the 17th international conference on World Wide WebApril 2008Pages 825–834https://doi.org/10.1145/1367497.1367609

Published:21 April 2008Publication History

WWW '08: Proceedings of the 17th international conference on World Wide Web

Pages 825–834

ABSTRACT

Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.

References

Castor. www.castor.org.Google Scholar
SUN Microsystems JAXB. java.sun.com/webservices/jaxb.Google Scholar
P. Adriaans and P. Vitanyi. The Power and Perils of MDL, 2006.Google Scholar
H. Ahonen. Generating Grammars for Structured Documents using Grammatical Inference Methods. Report A-1996-4, Dept. of Comp. Sci., Univ. of Finland, 1996.Google Scholar
D. Angluin and C. Smith. Inductive Inference: Theory and Methods. ACM Computing Surveys, 15(3):237--269, 1983. Google ScholarDigital Library
D. Barbosa, L. Mignet, and P. Veltri. Studying the XML Web: Gathering Statistics from an XML Sample. World Wide Web, 8(4):413--438, 2005. Google ScholarDigital Library
M. Benedikt, W. Fan, and F. Geerts. XPath Satisfiability in the Presence of DTDs. In PODS, pages 25--36, 2005. Google ScholarDigital Library
P. A. Bernstein. Applying Model Management to Classical Meta Data Problems. In CIDR, 2003.Google Scholar
G. Bex, W. Gelade, F. Neven, and S. Vansummeren. Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data, 2007. http://www.uhasselt.be/~lucg5855/papers/infkore.pdf Google ScholarDigital Library
G. J. Bex, F. Neven, T. Schwentick, and K. Tuyls. Inference of Concise DTDs from XML Data. In VLDB, 2006. Google ScholarDigital Library
G. J. Bex, F. Neven, and J. Van den Bussche. DTDs versus XML Schema: a Practical Study. In WebDB, pages 79--84, 2004. Google ScholarDigital Library
G. J. Bex, F. Neven, and S. Vansummeren. Inferring XML Schema Definitions from XML data. In VLDB, pages 998--1009, 2007. Google ScholarDigital Library
A. Bräzma. Efficient Identification of Regular Expressions from Representative Examples. In COLT, pages 236--242. ACM Press, 1993. Google ScholarDigital Library
A. Brüggeman-Klein. Regular Expressions into Finite Automata. Theor. Comput. Sci., 120(2):197--213, 1993. Google ScholarDigital Library
A. Brüggemann-Klein and D. Wood. One-unambiguous Regular Languages. Information and computation, 140(2):229--253, 1998. Google ScholarDigital Library
P. Buneman, S. B. Davidson, M. F. Fernandez, and D. Suciu. Adding Structure to Unstructured Data. In ICDT, pages 336--350, 1997. Google ScholarDigital Library
P. Caron and D. Ziadi. Characterization of Glushkov Automata. Theo. Comp. Sc., 233(1-2):75--90, 2000. Google ScholarDigital Library
D. Che, K. Aberer, and M. T. Özsu. Query Optimization in XML Structured-Document Databases. VLDB J., 15(3):263--289, 2006. Google ScholarDigital Library
B. Chidlovskii. Schema Extraction from XML: a Grammatical Inference Approach. In KRDB, 2001.Google Scholar
J. Clark. Trang: Multi-format Schema Converter. http://www.thaiopensource.com/relaxng/trang.html.Google Scholar
J. Clark and M. Murata. RELAX NG Specification. OASIS, December 2001.Google Scholar
R. Cover. The Cover Pages. http://xml.coverpages.org/, 2003.Google Scholar
J. F. Fang Du, Sihem Amer-Yahia. ShreX: Managing XML Documents in Relational Databases. In VLDB, pages 1297--1300, 2004. Google ScholarDigital Library
H. Fernau. Algorithms for Learning Regular Expressions. In ALT, pages 297--311, 2005. Google ScholarDigital Library
R. Finn, J. Mistry, B. Schuster-Böckler, S. Griffiths-Jones, et al. Pfam: Clans, Web Tools and Services. Nucleic Acids Research, 34:D247--D251, 2006.Google ScholarCross Ref
D. Florescu. Managing Semi-structured Data. ACM Queue, 3(8), October 2005. Google ScholarDigital Library
J.-M. François. Jahmm. http://www.run.montefiore.ulg.ac.be/~francois/software/jahmm/, April 2006.Google Scholar
J. Freire, J. R. Haritsa, M. Ramanath, P. Roy, and J. Siméon. StatiX: Making XML Count. In SIGMOD, pages 181--191, 2002. Google ScholarDigital Library
D. Freitag and A. McCallum. Information Extraction with HMM Structures Learned by Stochastic Optimization. In AAAI/IAAI, pages 584--589, 2000. Google ScholarDigital Library
P. Garcia and E. Vidal. Inference of k-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition. IEEE Trans. Pattern Anal. Mach. Intell., 12(9):920--925, 1990. Google ScholarDigital Library
M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim. XTRACT: Learning Document Type Descriptors from XML Document Collections. Data Mining and Knowledge Discovery, 7:23--56, 2003. Google ScholarDigital Library
E. Gold. Language Identification in the Limit. Information and Control, 10(5):447--474, May 1967.Google ScholarCross Ref
R. Goldman and J. Widom. DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. In VLDB, pages 436--445, 1997. Google ScholarDigital Library
J. Hegewald, F. Naumann, and M. Weis. XStruct: Efficient Schema Extraction from Multiple and Large XML Documents. In ICDE Workshops, page 81, 2006. Google ScholarDigital Library
J. Hopcroft and J. Ullman. Introduction to Automata Theory, Languages and Computation. Addison-Wesley, Reading, MA, 2007. Google ScholarDigital Library
C. Koch, S. Scherzinger, N. Schweikardt, and B. Stegmaier. Schema-based Scheduling of Event Processors and Buffer Minimization for Queries on Structured Data Streams. In VLDB, pages 228--239, 2004. Google ScholarDigital Library
I. Manolescu, D. Florescu, and D. Kossmann. Answering XML Queries on Heterogeneous Data Sources. In VLDB, pages 241--250, 2001. Google ScholarDigital Library
W. Martens, F. Neven, T. Schwentick, and G. J. Bex. Expressiveness and Complexity of XML Schema. ACM TODS, 31(3), 2006. Google ScholarDigital Library
L. Mignet, D. Barbosa, and P. Veltri. The XML Web: A First Study. In WWW, pages 500--510, 2003. Google ScholarDigital Library
M. Murata, D. Lee, M. Mani, and K. Kawaguchi. Taxonomy of xml schema languages using formal language theory. ACM Trans. Internet Techn., 5(4):660--704, 2005. Google ScholarDigital Library
S. Nestorov, S. Abiteboul, and R. Motwani. Extracting Schema from Semistructured Data. In ICDM, pages 295--306. 1998.Google ScholarDigital Library
F. Neven and T. Schwentick. On the Complexity of XPath Containment in the Presence of Disjunction, DTDs, and Variables. Logical Methods in Computer Science, 2(3), 2006.Google Scholar
L. Pitt. Inductive Inference, DFAs, and Computational Complexity. In AII, pages 18--44, 1989. Google ScholarDigital Library
D. Quass, J. Widom, R. Goldman, et al. LORE: a Lightweight Object REpository for Semistructured Data. In SIGMOD, page 549, 1996. Google ScholarDigital Library
L. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. IEEE, 77(2):257--286, 1989.Google ScholarCross Ref
E. Rahm and P. A. Bernstein. A Survey of Approaches to Automatic Schema Matching. VLDB J., 10(4):334--350, 2001. Google ScholarDigital Library
A. Sahuguet. Everything You Ever Wanted to Know about DTDs, but Were Afraid to Ask. In WebDB, 2000. Google ScholarDigital Library
Y. Sakakibara. Recent Advances of Grammatical Inference. Theor. Comput. Sci., 185(1):15--45, 1997. Google ScholarDigital Library
J. Sankey and R. K. Wong. Structural Inference for Semistructured Data. In CIKM, pages 159--166. 2001. Google ScholarDigital Library
H. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. XML Schema Part 1: Structures. W3C, May 2001.Google Scholar
M. Young-Lai and F. W. Tompa. Stochastic Grammatical Inference of Text Database Structure. Machine Learning, 40(2):111--137, 2000. Google ScholarDigital Library

Index Terms

Learning deterministic regular expressions for the inference of schemas from XML data

Recommendations

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of ...
Read More
Inference of concise regular expressions and DTDs

We consider the problem of inferring a concise Document Type Definition (DTD) for a given set of XML-documents, a problem that basically reduces to learning concise regular expressions from positive examples strings. We identify two classes of concise ...
Read More
Complexity of Decision Problems for XML Schemas and Chain Regular Expressions

We study the complexity of the inclusion, equivalence, and intersection problem of extended chain regular expressions (eCHAREs). These are regular expressions with a very simple structure: they basically consist of the concatenation of factors, where ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '08: Proceedings of the 17th international conference on World Wide Web
April 2008
1326 pages
ISBN:9781605580852
DOI:10.1145/1367497
General Chairs:
Jinpeng Huai
Beihang University, China
,
Robin Chen
AT&T Labs, USA
,
Hsiao-Wuen Hon
Microsoft Research Asia, China
,
Yunhao Liu
HK University of Science and Technology, Hong Kong
,
Program Chairs:
Wei-Ying Ma
Microsoft Research Asia, China
,
Andrew Tomkins
Yahoo! Research, USA
,
Xiaodong Zhang
The Ohio State University, USA
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 April 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
XML
regular expressions
schema inference
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 32
  Total Citations
  View Citations
- 482
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning deterministic regular expressions for the inference of schemas from XML data

WWW '08: Proceedings of the 17th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

Inference of concise regular expressions and DTDs

Complexity of Decision Problems for XML Schemas and Chain Regular Expressions