skip to main content
10.1145/1367497.1367609acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Learning deterministic regular expressions for the inference of schemas from XML data

Authors Info & Claims
Published:21 April 2008Publication History

ABSTRACT

Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.

References

  1. Castor. www.castor.org.Google ScholarGoogle Scholar
  2. SUN Microsystems JAXB. java.sun.com/webservices/jaxb.Google ScholarGoogle Scholar
  3. P. Adriaans and P. Vitanyi. The Power and Perils of MDL, 2006.Google ScholarGoogle Scholar
  4. H. Ahonen. Generating Grammars for Structured Documents using Grammatical Inference Methods. Report A-1996-4, Dept. of Comp. Sci., Univ. of Finland, 1996.Google ScholarGoogle Scholar
  5. D. Angluin and C. Smith. Inductive Inference: Theory and Methods. ACM Computing Surveys, 15(3):237--269, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Barbosa, L. Mignet, and P. Veltri. Studying the XML Web: Gathering Statistics from an XML Sample. World Wide Web, 8(4):413--438, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Benedikt, W. Fan, and F. Geerts. XPath Satisfiability in the Presence of DTDs. In PODS, pages 25--36, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. A. Bernstein. Applying Model Management to Classical Meta Data Problems. In CIDR, 2003.Google ScholarGoogle Scholar
  9. G. Bex, W. Gelade, F. Neven, and S. Vansummeren. Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data, 2007. http://www.uhasselt.be/~lucg5855/papers/infkore.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. J. Bex, F. Neven, T. Schwentick, and K. Tuyls. Inference of Concise DTDs from XML Data. In VLDB, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. J. Bex, F. Neven, and J. Van den Bussche. DTDs versus XML Schema: a Practical Study. In WebDB, pages 79--84, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. J. Bex, F. Neven, and S. Vansummeren. Inferring XML Schema Definitions from XML data. In VLDB, pages 998--1009, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Bräzma. Efficient Identification of Regular Expressions from Representative Examples. In COLT, pages 236--242. ACM Press, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Brüggeman-Klein. Regular Expressions into Finite Automata. Theor. Comput. Sci., 120(2):197--213, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Brüggemann-Klein and D. Wood. One-unambiguous Regular Languages. Information and computation, 140(2):229--253, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Buneman, S. B. Davidson, M. F. Fernandez, and D. Suciu. Adding Structure to Unstructured Data. In ICDT, pages 336--350, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Caron and D. Ziadi. Characterization of Glushkov Automata. Theo. Comp. Sc., 233(1-2):75--90, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. Che, K. Aberer, and M. T. Özsu. Query Optimization in XML Structured-Document Databases. VLDB J., 15(3):263--289, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. Chidlovskii. Schema Extraction from XML: a Grammatical Inference Approach. In KRDB, 2001.Google ScholarGoogle Scholar
  20. J. Clark. Trang: Multi-format Schema Converter. http://www.thaiopensource.com/relaxng/trang.html.Google ScholarGoogle Scholar
  21. J. Clark and M. Murata. RELAX NG Specification. OASIS, December 2001.Google ScholarGoogle Scholar
  22. R. Cover. The Cover Pages. http://xml.coverpages.org/, 2003.Google ScholarGoogle Scholar
  23. J. F. Fang Du, Sihem Amer-Yahia. ShreX: Managing XML Documents in Relational Databases. In VLDB, pages 1297--1300, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. H. Fernau. Algorithms for Learning Regular Expressions. In ALT, pages 297--311, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. R. Finn, J. Mistry, B. Schuster-Böckler, S. Griffiths-Jones, et al. Pfam: Clans, Web Tools and Services. Nucleic Acids Research, 34:D247--D251, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  26. D. Florescu. Managing Semi-structured Data. ACM Queue, 3(8), October 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J.-M. François. Jahmm. http://www.run.montefiore.ulg.ac.be/~francois/software/jahmm/, April 2006.Google ScholarGoogle Scholar
  28. J. Freire, J. R. Haritsa, M. Ramanath, P. Roy, and J. Siméon. StatiX: Making XML Count. In SIGMOD, pages 181--191, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. Freitag and A. McCallum. Information Extraction with HMM Structures Learned by Stochastic Optimization. In AAAI/IAAI, pages 584--589, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. P. Garcia and E. Vidal. Inference of k-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition. IEEE Trans. Pattern Anal. Mach. Intell., 12(9):920--925, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim. XTRACT: Learning Document Type Descriptors from XML Document Collections. Data Mining and Knowledge Discovery, 7:23--56, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. E. Gold. Language Identification in the Limit. Information and Control, 10(5):447--474, May 1967.Google ScholarGoogle ScholarCross RefCross Ref
  33. R. Goldman and J. Widom. DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. In VLDB, pages 436--445, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Hegewald, F. Naumann, and M. Weis. XStruct: Efficient Schema Extraction from Multiple and Large XML Documents. In ICDE Workshops, page 81, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. Hopcroft and J. Ullman. Introduction to Automata Theory, Languages and Computation. Addison-Wesley, Reading, MA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. C. Koch, S. Scherzinger, N. Schweikardt, and B. Stegmaier. Schema-based Scheduling of Event Processors and Buffer Minimization for Queries on Structured Data Streams. In VLDB, pages 228--239, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. I. Manolescu, D. Florescu, and D. Kossmann. Answering XML Queries on Heterogeneous Data Sources. In VLDB, pages 241--250, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. W. Martens, F. Neven, T. Schwentick, and G. J. Bex. Expressiveness and Complexity of XML Schema. ACM TODS, 31(3), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. L. Mignet, D. Barbosa, and P. Veltri. The XML Web: A First Study. In WWW, pages 500--510, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. M. Murata, D. Lee, M. Mani, and K. Kawaguchi. Taxonomy of xml schema languages using formal language theory. ACM Trans. Internet Techn., 5(4):660--704, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. S. Nestorov, S. Abiteboul, and R. Motwani. Extracting Schema from Semistructured Data. In ICDM, pages 295--306. 1998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. F. Neven and T. Schwentick. On the Complexity of XPath Containment in the Presence of Disjunction, DTDs, and Variables. Logical Methods in Computer Science, 2(3), 2006.Google ScholarGoogle Scholar
  43. L. Pitt. Inductive Inference, DFAs, and Computational Complexity. In AII, pages 18--44, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. D. Quass, J. Widom, R. Goldman, et al. LORE: a Lightweight Object REpository for Semistructured Data. In SIGMOD, page 549, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. L. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. IEEE, 77(2):257--286, 1989.Google ScholarGoogle ScholarCross RefCross Ref
  46. E. Rahm and P. A. Bernstein. A Survey of Approaches to Automatic Schema Matching. VLDB J., 10(4):334--350, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. A. Sahuguet. Everything You Ever Wanted to Know about DTDs, but Were Afraid to Ask. In WebDB, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Y. Sakakibara. Recent Advances of Grammatical Inference. Theor. Comput. Sci., 185(1):15--45, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. J. Sankey and R. K. Wong. Structural Inference for Semistructured Data. In CIKM, pages 159--166. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. H. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. XML Schema Part 1: Structures. W3C, May 2001.Google ScholarGoogle Scholar
  51. M. Young-Lai and F. W. Tompa. Stochastic Grammatical Inference of Text Database Structure. Machine Learning, 40(2):111--137, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Learning deterministic regular expressions for the inference of schemas from XML data

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            WWW '08: Proceedings of the 17th international conference on World Wide Web
            April 2008
            1326 pages
            ISBN:9781605580852
            DOI:10.1145/1367497

            Copyright © 2008 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 21 April 2008

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate1,899of8,196submissions,23%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader