Abstract
Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of deterministic regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the deterministic one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.
- }}Adriaans, P. and Vitányi, P. 2006. The power and perils of MDL. In Proceedings of the International Symposium on Information Theory.Google Scholar
- }}Ahonen, H. 1996. Generating Grammars for structured documents using grammatical inference methods. Tech. rep. A-1996-4, Department of Computer Science, University of Finland.Google Scholar
- }}Angluin, D. and Smith, C. H. 1983. Inductive inference: Theory and methods. ACM Comput. Surv. 15, 3, 237--269. Google ScholarDigital Library
- }}Barbosa, D., Mignet, L., and Veltri, P. 2005. Studying the XML Web: Gathering statistics from an XML sample. World Wide Web 8, 4, 413--438. Google ScholarDigital Library
- }}Benedikt, M., Fan, W., and Geerts, F. 2005. XPath satisfiability in the presence of DTDs. In Proceedings of the 24th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 25--36. Google ScholarDigital Library
- }}Bernstein, P. A. 2003. Applying model management to classical meta data problems. In Proceedings of the 1st Biennial Conference on Innovative Data Systems Research.Google Scholar
- }}Bex, G. J., Neven, F., and Van den Bussche, J. 2004. DTDs versus XML Schema: A practical study. In Proceedings of the 7th International Workshop on the Web and Databases. 79--84. Google ScholarDigital Library
- }}Bex, G. J., Neven, F., Schwentick, T., and Tuyls, K. 2006. Inference of concise DTDs from XML data. In Proceedings of the 32nd International Conference on Very Large Data Bases. 115--126. Google ScholarDigital Library
- }}Bex, G. J., Neven, F., and Vansummeren, S. 2007. Inferring XML Schema definitions from XML data. In Proceedings of the 33rd International Conference on Very Large Databases. 998--1009. Google ScholarDigital Library
- }}Bex, G. J., Gelade, W., Neven, F., and Vansummeren, S. 2008. Learning deterministic regular expressions for the inference of schemas from XML data. In Proceedings of the International Conference on the World Wide Web (WWW’08). 825--834. Google ScholarDigital Library
- }}Bex, G., Neven, F., Schwentick, T., and Vansummeren, S. 2010. Inference of concise regular expressions and DTDs. ACM Trans. Datab. Syst. 35, 2. Google ScholarDigital Library
- }}Brāzma, A. 1993. Efficient identification of regular expressions from representative examples. In Proceedings of the 6th Annual ACM Conference on Computational Learning Theory. ACM Press, 236--242. Google ScholarDigital Library
- }}Brüggeman-Klein, A. 1993. Regular expressions into finite automata. Theor. Comput. Sci. 120, 2, 197--213. Google ScholarDigital Library
- }}Brüggemann-Klein, A. and Wood, D. 1998. One-unambiguous regular languages. Inform. Comput. 140, 2, 229--253. Google ScholarDigital Library
- }}Buneman, P., Davidson, S. B., Fernandez, M. F., and Suciu, D. 1997. Adding structure to unstructured data. In Proceedings of the 6th International Conference on Database Theory (ICDT’97). F. N. Afrati and P. G. Kolaitis, Eds. Lecture Notes in Computer Science, vol. 1186. Springer, 336--350. Google ScholarDigital Library
- }}Che, D., Aberer, K., and Özsu, M. T. 2006. Query optimization in XML structured-document databases. VLDB J. 15, 3, 263--289. Google ScholarDigital Library
- }}Chidlovskii, B. 2001. Schema extraction from XML: A grammatical inference approach. In Proceedings of the 8th International Workshop on Knowledge Representation meets Databases.Google Scholar
- }}Clark, J. Trang: Multi-format schema converter based on RELAX NG. http://www.thaiopensource.com/relaxng/trang.html.Google Scholar
- }}Clark, J. and Murata, M. 2001. RELAX NG specification. OASIS.Google Scholar
- }}Cover, R. 2003. The Cover Pages. http://xml.coverpages.org/.Google Scholar
- }}Du, F., Amer-Yahia, S., and Freire, J. 2004. ShreX: Managing XML documents in relational databases. In Proceedings of the 30th International Conference on Very Large Data Bases. 1297--1300. Google ScholarDigital Library
- }}Ehrenfeucht, A. and Zeiger, P. 1976. Complexity measures for regular expressions. J. Comput. Syst. Sci. 12, 134--146. Google ScholarDigital Library
- }}Fernau, H. 2004. Extracting minimum length document type definitions is NP-hard. In Proceedings of the International Colloquium on Grammatical Inference (ICGI). 277--278.Google ScholarCross Ref
- }}Fernau, H. 2005. Algorithms for Learning Regular Expressions. In Proceedings of the 16th International Conference on Algorithmic Learning Theory. 297--311. Google ScholarDigital Library
- }}Finn, R., Mistry, J., Schuster-Böckler, B., Griffiths-Jones, S., et al. 2006. Pfam: clans, Web tools and services. Nucleic Acids Resear. 34, D247--D251.Google ScholarCross Ref
- }}Florescu, D. 2005. Managing semi-structured data. ACM Queue 3, 8. Google ScholarDigital Library
- }}François, J.-M. 2006. Jahmm. http://www.run.montefiore.ulg.ac.be/~francois/software/jahmm/.Google Scholar
- }}Freire, J., Haritsa, J. R., Ramanath, M., Roy, P., and Siméon, J. 2002. StatiX: making XML count. In Proceedings of the SIGMOD Conference. 181--191. Google ScholarDigital Library
- }}Freitag, D. and McCallum, A. 2000. Information extraction with HMM structures learned by stochastic optimization. In Proceedings of the 17th National Conference on Artificial Intelligence. AAAI Press/The MIT Press, 584--589. Google ScholarDigital Library
- }}Garcia, P. and Vidal, E. 1990. Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Trans. Patt. Anal. Mach. Intell. 12, 9, 920--925. Google ScholarDigital Library
- }}Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., and Shim, K. 2003. XTRACT: Learning document type descriptors from XML document collections. Data Min. Knowl. Disc. 7, 23--56. Google ScholarDigital Library
- }}Gelade, W. and Neven, F. 2008. Succinctness of the complement and intersection of regular expressions. In Proceedings of the Annual Symposium on Theoretical Aspects of Computer Science (STACS). 325--336.Google Scholar
- }}Gold, E. 1967. Language identification in the limit. Inform. Cont. 10, 5, 447--474.Google ScholarCross Ref
- }}Goldman, R. and Widom, J. 1997. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the 23rd International Conference on Very Large Data Bases. 436--445. Google ScholarDigital Library
- }}Gruber, H. and Holzer, M. 2008. Finite automata, digraph connectivity, and regular expression size. In Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP). 39--50. Google ScholarDigital Library
- }}Hegewald, J., Naumann, F., and Weis, M. 2006. XStruct: Efficient schema extraction from multiple and large XML documents. In Proceedings of the ICDE Workshops. 81. Google ScholarDigital Library
- }}Hopcroft, J. and Ullman, J. 2007. Introduction to Automata Theory, Languages and Computation. Addison-Wesley, Reading, MA. Google ScholarDigital Library
- }}Koch, C., Scherzinger, S., Schweikardt, N., and Stegmaier, B. 2004. Schema-based scheduling of event processors and buffer minimization for queries on structured data streams. In Proceedings of the 30th International Conference on Very Large Data Bases. 228--239. Google ScholarDigital Library
- }}Manolescu, I., Florescu, D., and Kossmann, D. 2001. Answering XML queries on heterogeneous data sources. In Proceedings of 27th International Conference on Very Large Data Bases. 241--250. Google ScholarDigital Library
- }}Martens, W., Neven, F., Schwentick, T., and Bex, G. J. 2006. Expressiveness and complexity of XML schema. ACM Trans. Datab. Syst. 31, 3, 770--813. Google ScholarDigital Library
- }}Mignet, L., Barbosa, D., and Veltri, P. 2003. The XML Web: A first study. In Proceedings of the 12th International World Wide Web Conference. 500--510. Google ScholarDigital Library
- }}Nestorov, S., Abiteboul, S., and Motwani, R. 1998. Extracting schema from semistructured data. In Proceedings of the International Conference on Management of Data. ACM Press, 295--306. Google ScholarDigital Library
- }}Neven, F. and Schwentick, T. 2006. On the complexity of XPath containment in the presence of disjunction, DTDs, and variables. Logic. Meth. Comput. Sci. 2, 3.Google ScholarCross Ref
- }}Pitt, L. 1989. Inductive inference, DFAs, and computational complexity. In Proceedings of the International Workshop on Analogical and Inductive Inference. K. P. Jantke, Ed. Lecture Notes in Computer Science, vol. 397. Springer-Verlag, 18--44. Google ScholarDigital Library
- }}Quass, D., Widom, J., Goldman, R., et al. 1996. LORE: A Lightweight Object REpository for semistructured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 549. Google ScholarDigital Library
- }}Rabiner, L. 1989. A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc. IEEE 77, 2, 257--286.Google ScholarCross Ref
- }}Rahm, E. and Bernstein, P. A. 2001. A survey of approaches to automatic schema matching. VLDB J. 10, 4, 334--350. Google ScholarDigital Library
- }}Sahuguet, A. 2000. Everything you ever wanted to know about DTDs, but were afraid to ask (Extended Abstract). In Proceedings of the 3rd International Workshop on The World Wide Web and Databases. D. Suciu and G. Vossen, Eds. Lecture Notes in Computer Science, vol. 1997. Springer, 171--183. Google ScholarDigital Library
- }}Sakakibara, Y. 1997. Recent advances of grammatical inference. Theor. Comput. Sci. 185, 1, 15--45. Google ScholarDigital Library
- }}Sankey, J. and Wong, R. K. 2001. Structural inference for semistructured data. In Proceedings of the 10th International Conference on Information and Knowledge Management. ACM Press, 159--166. Google ScholarDigital Library
- }}Thompson, H., Beech, D., Maloney, M., and Mendelsohn, N. 2001. XML Schema part 1: Structures. W3C.Google Scholar
- }}Young-Lai, M. and Tompa, F. W. 2000. Stochastic grammatical inference of text database structure. Mach. Learn. 40, 2, 111--137. Google ScholarDigital Library
Index Terms
- Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data
Recommendations
Inference of concise regular expressions and DTDs
We consider the problem of inferring a concise Document Type Definition (DTD) for a given set of XML-documents, a problem that basically reduces to learning concise regular expressions from positive examples strings. We identify two classes of concise ...
Learning deterministic regular expressions for the inference of schemas from XML data
WWW '08: Proceedings of the 17th international conference on World Wide WebInferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of ...
Complexity of Decision Problems for XML Schemas and Chain Regular Expressions
We study the complexity of the inclusion, equivalence, and intersection problem of extended chain regular expressions (eCHAREs). These are regular expressions with a very simple structure: they basically consist of the concatenation of factors, where ...
Comments