skip to main content
research-article

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

Authors Info & Claims
Published:01 September 2010Publication History
Skip Abstract Section

Abstract

Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of deterministic regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the deterministic one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.

References

  1. }}Adriaans, P. and Vitányi, P. 2006. The power and perils of MDL. In Proceedings of the International Symposium on Information Theory.Google ScholarGoogle Scholar
  2. }}Ahonen, H. 1996. Generating Grammars for structured documents using grammatical inference methods. Tech. rep. A-1996-4, Department of Computer Science, University of Finland.Google ScholarGoogle Scholar
  3. }}Angluin, D. and Smith, C. H. 1983. Inductive inference: Theory and methods. ACM Comput. Surv. 15, 3, 237--269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. }}Barbosa, D., Mignet, L., and Veltri, P. 2005. Studying the XML Web: Gathering statistics from an XML sample. World Wide Web 8, 4, 413--438. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. }}Benedikt, M., Fan, W., and Geerts, F. 2005. XPath satisfiability in the presence of DTDs. In Proceedings of the 24th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. }}Bernstein, P. A. 2003. Applying model management to classical meta data problems. In Proceedings of the 1st Biennial Conference on Innovative Data Systems Research.Google ScholarGoogle Scholar
  7. }}Bex, G. J., Neven, F., and Van den Bussche, J. 2004. DTDs versus XML Schema: A practical study. In Proceedings of the 7th International Workshop on the Web and Databases. 79--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. }}Bex, G. J., Neven, F., Schwentick, T., and Tuyls, K. 2006. Inference of concise DTDs from XML data. In Proceedings of the 32nd International Conference on Very Large Data Bases. 115--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. }}Bex, G. J., Neven, F., and Vansummeren, S. 2007. Inferring XML Schema definitions from XML data. In Proceedings of the 33rd International Conference on Very Large Databases. 998--1009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. }}Bex, G. J., Gelade, W., Neven, F., and Vansummeren, S. 2008. Learning deterministic regular expressions for the inference of schemas from XML data. In Proceedings of the International Conference on the World Wide Web (WWW’08). 825--834. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. }}Bex, G., Neven, F., Schwentick, T., and Vansummeren, S. 2010. Inference of concise regular expressions and DTDs. ACM Trans. Datab. Syst. 35, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. }}Brāzma, A. 1993. Efficient identification of regular expressions from representative examples. In Proceedings of the 6th Annual ACM Conference on Computational Learning Theory. ACM Press, 236--242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. }}Brüggeman-Klein, A. 1993. Regular expressions into finite automata. Theor. Comput. Sci. 120, 2, 197--213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. }}Brüggemann-Klein, A. and Wood, D. 1998. One-unambiguous regular languages. Inform. Comput. 140, 2, 229--253. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. }}Buneman, P., Davidson, S. B., Fernandez, M. F., and Suciu, D. 1997. Adding structure to unstructured data. In Proceedings of the 6th International Conference on Database Theory (ICDT’97). F. N. Afrati and P. G. Kolaitis, Eds. Lecture Notes in Computer Science, vol. 1186. Springer, 336--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. }}Che, D., Aberer, K., and Özsu, M. T. 2006. Query optimization in XML structured-document databases. VLDB J. 15, 3, 263--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. }}Chidlovskii, B. 2001. Schema extraction from XML: A grammatical inference approach. In Proceedings of the 8th International Workshop on Knowledge Representation meets Databases.Google ScholarGoogle Scholar
  18. }}Clark, J. Trang: Multi-format schema converter based on RELAX NG. http://www.thaiopensource.com/relaxng/trang.html.Google ScholarGoogle Scholar
  19. }}Clark, J. and Murata, M. 2001. RELAX NG specification. OASIS.Google ScholarGoogle Scholar
  20. }}Cover, R. 2003. The Cover Pages. http://xml.coverpages.org/.Google ScholarGoogle Scholar
  21. }}Du, F., Amer-Yahia, S., and Freire, J. 2004. ShreX: Managing XML documents in relational databases. In Proceedings of the 30th International Conference on Very Large Data Bases. 1297--1300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. }}Ehrenfeucht, A. and Zeiger, P. 1976. Complexity measures for regular expressions. J. Comput. Syst. Sci. 12, 134--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. }}Fernau, H. 2004. Extracting minimum length document type definitions is NP-hard. In Proceedings of the International Colloquium on Grammatical Inference (ICGI). 277--278.Google ScholarGoogle ScholarCross RefCross Ref
  24. }}Fernau, H. 2005. Algorithms for Learning Regular Expressions. In Proceedings of the 16th International Conference on Algorithmic Learning Theory. 297--311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. }}Finn, R., Mistry, J., Schuster-Böckler, B., Griffiths-Jones, S., et al. 2006. Pfam: clans, Web tools and services. Nucleic Acids Resear. 34, D247--D251.Google ScholarGoogle ScholarCross RefCross Ref
  26. }}Florescu, D. 2005. Managing semi-structured data. ACM Queue 3, 8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. }}François, J.-M. 2006. Jahmm. http://www.run.montefiore.ulg.ac.be/~francois/software/jahmm/.Google ScholarGoogle Scholar
  28. }}Freire, J., Haritsa, J. R., Ramanath, M., Roy, P., and Siméon, J. 2002. StatiX: making XML count. In Proceedings of the SIGMOD Conference. 181--191. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. }}Freitag, D. and McCallum, A. 2000. Information extraction with HMM structures learned by stochastic optimization. In Proceedings of the 17th National Conference on Artificial Intelligence. AAAI Press/The MIT Press, 584--589. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. }}Garcia, P. and Vidal, E. 1990. Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Trans. Patt. Anal. Mach. Intell. 12, 9, 920--925. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. }}Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., and Shim, K. 2003. XTRACT: Learning document type descriptors from XML document collections. Data Min. Knowl. Disc. 7, 23--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. }}Gelade, W. and Neven, F. 2008. Succinctness of the complement and intersection of regular expressions. In Proceedings of the Annual Symposium on Theoretical Aspects of Computer Science (STACS). 325--336.Google ScholarGoogle Scholar
  33. }}Gold, E. 1967. Language identification in the limit. Inform. Cont. 10, 5, 447--474.Google ScholarGoogle ScholarCross RefCross Ref
  34. }}Goldman, R. and Widom, J. 1997. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the 23rd International Conference on Very Large Data Bases. 436--445. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. }}Gruber, H. and Holzer, M. 2008. Finite automata, digraph connectivity, and regular expression size. In Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP). 39--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. }}Hegewald, J., Naumann, F., and Weis, M. 2006. XStruct: Efficient schema extraction from multiple and large XML documents. In Proceedings of the ICDE Workshops. 81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. }}Hopcroft, J. and Ullman, J. 2007. Introduction to Automata Theory, Languages and Computation. Addison-Wesley, Reading, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. }}Koch, C., Scherzinger, S., Schweikardt, N., and Stegmaier, B. 2004. Schema-based scheduling of event processors and buffer minimization for queries on structured data streams. In Proceedings of the 30th International Conference on Very Large Data Bases. 228--239. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. }}Manolescu, I., Florescu, D., and Kossmann, D. 2001. Answering XML queries on heterogeneous data sources. In Proceedings of 27th International Conference on Very Large Data Bases. 241--250. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. }}Martens, W., Neven, F., Schwentick, T., and Bex, G. J. 2006. Expressiveness and complexity of XML schema. ACM Trans. Datab. Syst. 31, 3, 770--813. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. }}Mignet, L., Barbosa, D., and Veltri, P. 2003. The XML Web: A first study. In Proceedings of the 12th International World Wide Web Conference. 500--510. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. }}Nestorov, S., Abiteboul, S., and Motwani, R. 1998. Extracting schema from semistructured data. In Proceedings of the International Conference on Management of Data. ACM Press, 295--306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. }}Neven, F. and Schwentick, T. 2006. On the complexity of XPath containment in the presence of disjunction, DTDs, and variables. Logic. Meth. Comput. Sci. 2, 3.Google ScholarGoogle ScholarCross RefCross Ref
  44. }}Pitt, L. 1989. Inductive inference, DFAs, and computational complexity. In Proceedings of the International Workshop on Analogical and Inductive Inference. K. P. Jantke, Ed. Lecture Notes in Computer Science, vol. 397. Springer-Verlag, 18--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. }}Quass, D., Widom, J., Goldman, R., et al. 1996. LORE: A Lightweight Object REpository for semistructured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 549. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. }}Rabiner, L. 1989. A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc. IEEE 77, 2, 257--286.Google ScholarGoogle ScholarCross RefCross Ref
  47. }}Rahm, E. and Bernstein, P. A. 2001. A survey of approaches to automatic schema matching. VLDB J. 10, 4, 334--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. }}Sahuguet, A. 2000. Everything you ever wanted to know about DTDs, but were afraid to ask (Extended Abstract). In Proceedings of the 3rd International Workshop on The World Wide Web and Databases. D. Suciu and G. Vossen, Eds. Lecture Notes in Computer Science, vol. 1997. Springer, 171--183. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. }}Sakakibara, Y. 1997. Recent advances of grammatical inference. Theor. Comput. Sci. 185, 1, 15--45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. }}Sankey, J. and Wong, R. K. 2001. Structural inference for semistructured data. In Proceedings of the 10th International Conference on Information and Knowledge Management. ACM Press, 159--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. }}Thompson, H., Beech, D., Maloney, M., and Mendelsohn, N. 2001. XML Schema part 1: Structures. W3C.Google ScholarGoogle Scholar
  52. }}Young-Lai, M. and Tompa, F. W. 2000. Stochastic grammatical inference of text database structure. Mach. Learn. 40, 2, 111--137. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on the Web
            ACM Transactions on the Web  Volume 4, Issue 4
            September 2010
            173 pages
            ISSN:1559-1131
            EISSN:1559-114X
            DOI:10.1145/1841909
            Issue’s Table of Contents

            Copyright © 2010 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 September 2010
            • Accepted: 1 July 2010
            • Revised: 1 February 2010
            • Received: 1 November 2008
            Published in tweb Volume 4, Issue 4

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader