ABSTRACT
Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.
- Castor. www.castor.org.Google Scholar
- SUN Microsystems JAXB. java.sun.com/webservices/jaxb.Google Scholar
- P. Adriaans and P. Vitanyi. The Power and Perils of MDL, 2006.Google Scholar
- H. Ahonen. Generating Grammars for Structured Documents using Grammatical Inference Methods. Report A-1996-4, Dept. of Comp. Sci., Univ. of Finland, 1996.Google Scholar
- D. Angluin and C. Smith. Inductive Inference: Theory and Methods. ACM Computing Surveys, 15(3):237--269, 1983. Google ScholarDigital Library
- D. Barbosa, L. Mignet, and P. Veltri. Studying the XML Web: Gathering Statistics from an XML Sample. World Wide Web, 8(4):413--438, 2005. Google ScholarDigital Library
- M. Benedikt, W. Fan, and F. Geerts. XPath Satisfiability in the Presence of DTDs. In PODS, pages 25--36, 2005. Google ScholarDigital Library
- P. A. Bernstein. Applying Model Management to Classical Meta Data Problems. In CIDR, 2003.Google Scholar
- G. Bex, W. Gelade, F. Neven, and S. Vansummeren. Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data, 2007. http://www.uhasselt.be/~lucg5855/papers/infkore.pdf Google ScholarDigital Library
- G. J. Bex, F. Neven, T. Schwentick, and K. Tuyls. Inference of Concise DTDs from XML Data. In VLDB, 2006. Google ScholarDigital Library
- G. J. Bex, F. Neven, and J. Van den Bussche. DTDs versus XML Schema: a Practical Study. In WebDB, pages 79--84, 2004. Google ScholarDigital Library
- G. J. Bex, F. Neven, and S. Vansummeren. Inferring XML Schema Definitions from XML data. In VLDB, pages 998--1009, 2007. Google ScholarDigital Library
- A. Bräzma. Efficient Identification of Regular Expressions from Representative Examples. In COLT, pages 236--242. ACM Press, 1993. Google ScholarDigital Library
- A. Brüggeman-Klein. Regular Expressions into Finite Automata. Theor. Comput. Sci., 120(2):197--213, 1993. Google ScholarDigital Library
- A. Brüggemann-Klein and D. Wood. One-unambiguous Regular Languages. Information and computation, 140(2):229--253, 1998. Google ScholarDigital Library
- P. Buneman, S. B. Davidson, M. F. Fernandez, and D. Suciu. Adding Structure to Unstructured Data. In ICDT, pages 336--350, 1997. Google ScholarDigital Library
- P. Caron and D. Ziadi. Characterization of Glushkov Automata. Theo. Comp. Sc., 233(1-2):75--90, 2000. Google ScholarDigital Library
- D. Che, K. Aberer, and M. T. Özsu. Query Optimization in XML Structured-Document Databases. VLDB J., 15(3):263--289, 2006. Google ScholarDigital Library
- B. Chidlovskii. Schema Extraction from XML: a Grammatical Inference Approach. In KRDB, 2001.Google Scholar
- J. Clark. Trang: Multi-format Schema Converter. http://www.thaiopensource.com/relaxng/trang.html.Google Scholar
- J. Clark and M. Murata. RELAX NG Specification. OASIS, December 2001.Google Scholar
- R. Cover. The Cover Pages. http://xml.coverpages.org/, 2003.Google Scholar
- J. F. Fang Du, Sihem Amer-Yahia. ShreX: Managing XML Documents in Relational Databases. In VLDB, pages 1297--1300, 2004. Google ScholarDigital Library
- H. Fernau. Algorithms for Learning Regular Expressions. In ALT, pages 297--311, 2005. Google ScholarDigital Library
- R. Finn, J. Mistry, B. Schuster-Böckler, S. Griffiths-Jones, et al. Pfam: Clans, Web Tools and Services. Nucleic Acids Research, 34:D247--D251, 2006.Google ScholarCross Ref
- D. Florescu. Managing Semi-structured Data. ACM Queue, 3(8), October 2005. Google ScholarDigital Library
- J.-M. François. Jahmm. http://www.run.montefiore.ulg.ac.be/~francois/software/jahmm/, April 2006.Google Scholar
- J. Freire, J. R. Haritsa, M. Ramanath, P. Roy, and J. Siméon. StatiX: Making XML Count. In SIGMOD, pages 181--191, 2002. Google ScholarDigital Library
- D. Freitag and A. McCallum. Information Extraction with HMM Structures Learned by Stochastic Optimization. In AAAI/IAAI, pages 584--589, 2000. Google ScholarDigital Library
- P. Garcia and E. Vidal. Inference of k-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition. IEEE Trans. Pattern Anal. Mach. Intell., 12(9):920--925, 1990. Google ScholarDigital Library
- M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim. XTRACT: Learning Document Type Descriptors from XML Document Collections. Data Mining and Knowledge Discovery, 7:23--56, 2003. Google ScholarDigital Library
- E. Gold. Language Identification in the Limit. Information and Control, 10(5):447--474, May 1967.Google ScholarCross Ref
- R. Goldman and J. Widom. DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. In VLDB, pages 436--445, 1997. Google ScholarDigital Library
- J. Hegewald, F. Naumann, and M. Weis. XStruct: Efficient Schema Extraction from Multiple and Large XML Documents. In ICDE Workshops, page 81, 2006. Google ScholarDigital Library
- J. Hopcroft and J. Ullman. Introduction to Automata Theory, Languages and Computation. Addison-Wesley, Reading, MA, 2007. Google ScholarDigital Library
- C. Koch, S. Scherzinger, N. Schweikardt, and B. Stegmaier. Schema-based Scheduling of Event Processors and Buffer Minimization for Queries on Structured Data Streams. In VLDB, pages 228--239, 2004. Google ScholarDigital Library
- I. Manolescu, D. Florescu, and D. Kossmann. Answering XML Queries on Heterogeneous Data Sources. In VLDB, pages 241--250, 2001. Google ScholarDigital Library
- W. Martens, F. Neven, T. Schwentick, and G. J. Bex. Expressiveness and Complexity of XML Schema. ACM TODS, 31(3), 2006. Google ScholarDigital Library
- L. Mignet, D. Barbosa, and P. Veltri. The XML Web: A First Study. In WWW, pages 500--510, 2003. Google ScholarDigital Library
- M. Murata, D. Lee, M. Mani, and K. Kawaguchi. Taxonomy of xml schema languages using formal language theory. ACM Trans. Internet Techn., 5(4):660--704, 2005. Google ScholarDigital Library
- S. Nestorov, S. Abiteboul, and R. Motwani. Extracting Schema from Semistructured Data. In ICDM, pages 295--306. 1998.Google ScholarDigital Library
- F. Neven and T. Schwentick. On the Complexity of XPath Containment in the Presence of Disjunction, DTDs, and Variables. Logical Methods in Computer Science, 2(3), 2006.Google Scholar
- L. Pitt. Inductive Inference, DFAs, and Computational Complexity. In AII, pages 18--44, 1989. Google ScholarDigital Library
- D. Quass, J. Widom, R. Goldman, et al. LORE: a Lightweight Object REpository for Semistructured Data. In SIGMOD, page 549, 1996. Google ScholarDigital Library
- L. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. IEEE, 77(2):257--286, 1989.Google ScholarCross Ref
- E. Rahm and P. A. Bernstein. A Survey of Approaches to Automatic Schema Matching. VLDB J., 10(4):334--350, 2001. Google ScholarDigital Library
- A. Sahuguet. Everything You Ever Wanted to Know about DTDs, but Were Afraid to Ask. In WebDB, 2000. Google ScholarDigital Library
- Y. Sakakibara. Recent Advances of Grammatical Inference. Theor. Comput. Sci., 185(1):15--45, 1997. Google ScholarDigital Library
- J. Sankey and R. K. Wong. Structural Inference for Semistructured Data. In CIKM, pages 159--166. 2001. Google ScholarDigital Library
- H. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. XML Schema Part 1: Structures. W3C, May 2001.Google Scholar
- M. Young-Lai and F. W. Tompa. Stochastic Grammatical Inference of Text Database Structure. Machine Learning, 40(2):111--137, 2000. Google ScholarDigital Library
Index Terms
- Learning deterministic regular expressions for the inference of schemas from XML data
Recommendations
Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data
Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of ...
Inference of concise regular expressions and DTDs
We consider the problem of inferring a concise Document Type Definition (DTD) for a given set of XML-documents, a problem that basically reduces to learning concise regular expressions from positive examples strings. We identify two classes of concise ...
Complexity of Decision Problems for XML Schemas and Chain Regular Expressions
We study the complexity of the inclusion, equivalence, and intersection problem of extended chain regular expressions (eCHAREs). These are regular expressions with a very simple structure: they basically consist of the concatenation of factors, where ...
Comments