research-article

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

Authors:
Geert Jan Bex

Hasselt University and Transnational University of Limburg

Hasselt University and Transnational University of Limburg
View Profile

,
Wouter Gelade

Hasselt University and Transnational University of Limburg

Hasselt University and Transnational University of Limburg
View Profile

,
Frank Neven

Hasselt University and Transnational University of Limburg

Hasselt University and Transnational University of Limburg
View Profile

,
Stijn Vansummeren

Université Libre de Bruxelles

Université Libre de Bruxelles
View Profile

Authors Info & Claims

ACM Transactions on the Web Volume 4 Issue 4Article No.: 14pp 1–32https://doi.org/10.1145/1841909.1841911

Published:01 September 2010Publication History

ACM Transactions on the Web

Abstract

Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of deterministic regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the deterministic one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.

References

}}Adriaans, P. and Vitányi, P. 2006. The power and perils of MDL. In Proceedings of the International Symposium on Information Theory.Google Scholar
}}Ahonen, H. 1996. Generating Grammars for structured documents using grammatical inference methods. Tech. rep. A-1996-4, Department of Computer Science, University of Finland.Google Scholar
}}Angluin, D. and Smith, C. H. 1983. Inductive inference: Theory and methods. ACM Comput. Surv. 15, 3, 237--269. Google ScholarDigital Library
}}Barbosa, D., Mignet, L., and Veltri, P. 2005. Studying the XML Web: Gathering statistics from an XML sample. World Wide Web 8, 4, 413--438. Google ScholarDigital Library
}}Benedikt, M., Fan, W., and Geerts, F. 2005. XPath satisfiability in the presence of DTDs. In Proceedings of the 24th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 25--36. Google ScholarDigital Library
}}Bernstein, P. A. 2003. Applying model management to classical meta data problems. In Proceedings of the 1st Biennial Conference on Innovative Data Systems Research.Google Scholar
}}Bex, G. J., Neven, F., and Van den Bussche, J. 2004. DTDs versus XML Schema: A practical study. In Proceedings of the 7th International Workshop on the Web and Databases. 79--84. Google ScholarDigital Library
}}Bex, G. J., Neven, F., Schwentick, T., and Tuyls, K. 2006. Inference of concise DTDs from XML data. In Proceedings of the 32nd International Conference on Very Large Data Bases. 115--126. Google ScholarDigital Library
}}Bex, G. J., Neven, F., and Vansummeren, S. 2007. Inferring XML Schema definitions from XML data. In Proceedings of the 33rd International Conference on Very Large Databases. 998--1009. Google ScholarDigital Library
}}Bex, G. J., Gelade, W., Neven, F., and Vansummeren, S. 2008. Learning deterministic regular expressions for the inference of schemas from XML data. In Proceedings of the International Conference on the World Wide Web (WWW’08). 825--834. Google ScholarDigital Library
}}Bex, G., Neven, F., Schwentick, T., and Vansummeren, S. 2010. Inference of concise regular expressions and DTDs. ACM Trans. Datab. Syst. 35, 2. Google ScholarDigital Library
}}Brāzma, A. 1993. Efficient identification of regular expressions from representative examples. In Proceedings of the 6th Annual ACM Conference on Computational Learning Theory. ACM Press, 236--242. Google ScholarDigital Library
}}Brüggeman-Klein, A. 1993. Regular expressions into finite automata. Theor. Comput. Sci. 120, 2, 197--213. Google ScholarDigital Library
}}Brüggemann-Klein, A. and Wood, D. 1998. One-unambiguous regular languages. Inform. Comput. 140, 2, 229--253. Google ScholarDigital Library
}}Buneman, P., Davidson, S. B., Fernandez, M. F., and Suciu, D. 1997. Adding structure to unstructured data. In Proceedings of the 6th International Conference on Database Theory (ICDT’97). F. N. Afrati and P. G. Kolaitis, Eds. Lecture Notes in Computer Science, vol. 1186. Springer, 336--350. Google ScholarDigital Library
}}Che, D., Aberer, K., and Özsu, M. T. 2006. Query optimization in XML structured-document databases. VLDB J. 15, 3, 263--289. Google ScholarDigital Library
}}Chidlovskii, B. 2001. Schema extraction from XML: A grammatical inference approach. In Proceedings of the 8th International Workshop on Knowledge Representation meets Databases.Google Scholar
}}Clark, J. Trang: Multi-format schema converter based on RELAX NG. http://www.thaiopensource.com/relaxng/trang.html.Google Scholar
}}Clark, J. and Murata, M. 2001. RELAX NG specification. OASIS.Google Scholar
}}Cover, R. 2003. The Cover Pages. http://xml.coverpages.org/.Google Scholar
}}Du, F., Amer-Yahia, S., and Freire, J. 2004. ShreX: Managing XML documents in relational databases. In Proceedings of the 30th International Conference on Very Large Data Bases. 1297--1300. Google ScholarDigital Library
}}Ehrenfeucht, A. and Zeiger, P. 1976. Complexity measures for regular expressions. J. Comput. Syst. Sci. 12, 134--146. Google ScholarDigital Library
}}Fernau, H. 2004. Extracting minimum length document type definitions is NP-hard. In Proceedings of the International Colloquium on Grammatical Inference (ICGI). 277--278.Google ScholarCross Ref
}}Fernau, H. 2005. Algorithms for Learning Regular Expressions. In Proceedings of the 16th International Conference on Algorithmic Learning Theory. 297--311. Google ScholarDigital Library
}}Finn, R., Mistry, J., Schuster-Böckler, B., Griffiths-Jones, S., et al. 2006. Pfam: clans, Web tools and services. Nucleic Acids Resear. 34, D247--D251.Google ScholarCross Ref
}}Florescu, D. 2005. Managing semi-structured data. ACM Queue 3, 8. Google ScholarDigital Library
}}François, J.-M. 2006. Jahmm. http://www.run.montefiore.ulg.ac.be/~francois/software/jahmm/.Google Scholar
}}Freire, J., Haritsa, J. R., Ramanath, M., Roy, P., and Siméon, J. 2002. StatiX: making XML count. In Proceedings of the SIGMOD Conference. 181--191. Google ScholarDigital Library
}}Freitag, D. and McCallum, A. 2000. Information extraction with HMM structures learned by stochastic optimization. In Proceedings of the 17th National Conference on Artificial Intelligence. AAAI Press/The MIT Press, 584--589. Google ScholarDigital Library
}}Garcia, P. and Vidal, E. 1990. Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Trans. Patt. Anal. Mach. Intell. 12, 9, 920--925. Google ScholarDigital Library
}}Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., and Shim, K. 2003. XTRACT: Learning document type descriptors from XML document collections. Data Min. Knowl. Disc. 7, 23--56. Google ScholarDigital Library
}}Gelade, W. and Neven, F. 2008. Succinctness of the complement and intersection of regular expressions. In Proceedings of the Annual Symposium on Theoretical Aspects of Computer Science (STACS). 325--336.Google Scholar
}}Gold, E. 1967. Language identification in the limit. Inform. Cont. 10, 5, 447--474.Google ScholarCross Ref
}}Goldman, R. and Widom, J. 1997. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the 23rd International Conference on Very Large Data Bases. 436--445. Google ScholarDigital Library
}}Gruber, H. and Holzer, M. 2008. Finite automata, digraph connectivity, and regular expression size. In Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP). 39--50. Google ScholarDigital Library
}}Hegewald, J., Naumann, F., and Weis, M. 2006. XStruct: Efficient schema extraction from multiple and large XML documents. In Proceedings of the ICDE Workshops. 81. Google ScholarDigital Library
}}Hopcroft, J. and Ullman, J. 2007. Introduction to Automata Theory, Languages and Computation. Addison-Wesley, Reading, MA. Google ScholarDigital Library
}}Koch, C., Scherzinger, S., Schweikardt, N., and Stegmaier, B. 2004. Schema-based scheduling of event processors and buffer minimization for queries on structured data streams. In Proceedings of the 30th International Conference on Very Large Data Bases. 228--239. Google ScholarDigital Library
}}Manolescu, I., Florescu, D., and Kossmann, D. 2001. Answering XML queries on heterogeneous data sources. In Proceedings of 27th International Conference on Very Large Data Bases. 241--250. Google ScholarDigital Library
}}Martens, W., Neven, F., Schwentick, T., and Bex, G. J. 2006. Expressiveness and complexity of XML schema. ACM Trans. Datab. Syst. 31, 3, 770--813. Google ScholarDigital Library
}}Mignet, L., Barbosa, D., and Veltri, P. 2003. The XML Web: A first study. In Proceedings of the 12th International World Wide Web Conference. 500--510. Google ScholarDigital Library
}}Nestorov, S., Abiteboul, S., and Motwani, R. 1998. Extracting schema from semistructured data. In Proceedings of the International Conference on Management of Data. ACM Press, 295--306. Google ScholarDigital Library
}}Neven, F. and Schwentick, T. 2006. On the complexity of XPath containment in the presence of disjunction, DTDs, and variables. Logic. Meth. Comput. Sci. 2, 3.Google ScholarCross Ref
}}Pitt, L. 1989. Inductive inference, DFAs, and computational complexity. In Proceedings of the International Workshop on Analogical and Inductive Inference. K. P. Jantke, Ed. Lecture Notes in Computer Science, vol. 397. Springer-Verlag, 18--44. Google ScholarDigital Library
}}Quass, D., Widom, J., Goldman, R., et al. 1996. LORE: A Lightweight Object REpository for semistructured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 549. Google ScholarDigital Library
}}Rabiner, L. 1989. A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc. IEEE 77, 2, 257--286.Google ScholarCross Ref
}}Rahm, E. and Bernstein, P. A. 2001. A survey of approaches to automatic schema matching. VLDB J. 10, 4, 334--350. Google ScholarDigital Library
}}Sahuguet, A. 2000. Everything you ever wanted to know about DTDs, but were afraid to ask (Extended Abstract). In Proceedings of the 3rd International Workshop on The World Wide Web and Databases. D. Suciu and G. Vossen, Eds. Lecture Notes in Computer Science, vol. 1997. Springer, 171--183. Google ScholarDigital Library
}}Sakakibara, Y. 1997. Recent advances of grammatical inference. Theor. Comput. Sci. 185, 1, 15--45. Google ScholarDigital Library
}}Sankey, J. and Wong, R. K. 2001. Structural inference for semistructured data. In Proceedings of the 10th International Conference on Information and Knowledge Management. ACM Press, 159--166. Google ScholarDigital Library
}}Thompson, H., Beech, D., Maloney, M., and Mendelsohn, N. 2001. XML Schema part 1: Structures. W3C.Google Scholar
}}Young-Lai, M. and Tompa, F. W. 2000. Stochastic grammatical inference of text database structure. Mach. Learn. 40, 2, 111--137. Google ScholarDigital Library

Index Terms

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

Recommendations

Inference of concise regular expressions and DTDs

We consider the problem of inferring a concise Document Type Definition (DTD) for a given set of XML-documents, a problem that basically reduces to learning concise regular expressions from positive examples strings. We identify two classes of concise ...
Read More
Learning deterministic regular expressions for the inference of schemas from XML data
WWW '08: Proceedings of the 17th international conference on World Wide Web

Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of ...
Read More
Complexity of Decision Problems for XML Schemas and Chain Regular Expressions

We study the complexity of the inclusion, equivalence, and intersection problem of extended chain regular expressions (eCHAREs). These are regular expressions with a very simple structure: they basically consist of the concatenation of factors, where ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on the Web Volume 4, Issue 4
September 2010
173 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/1841909
Issue’s Table of Contents

Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 September 2010
- Accepted: 1 July 2010
- Revised: 1 February 2010
- Received: 1 November 2008
Published in tweb Volume 4, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Regular expressions
XML
schema inference
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 53
  Total Citations
  View Citations
- 720
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

Inference of concise regular expressions and DTDs

Learning deterministic regular expressions for the inference of schemas from XML data

Complexity of Decision Problems for XML Schemas and Chain Regular Expressions

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

Inference of concise regular expressions and DTDs

Learning deterministic regular expressions for the inference of schemas from XML data

Complexity of Decision Problems for XML Schemas and Chain Regular Expressions

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media