skip to main content
research-article

Discovering XSD Keys from XML Data

Authors Info & Claims
Published:30 December 2014Publication History
Skip Abstract Section

Abstract

A great deal of research into the learning of schemas from XML data has been conducted in recent years to enable the automatic discovery of XML schemas from XML documents when no schema or only a low-quality one is available. Unfortunately, and in strong contrast to, for instance, the relational model, the automatic discovery of even the simplest of XML constraints, namely XML keys, has been left largely unexplored in this context. A major obstacle here is the unavailability of a theory on reasoning about XML keys in the presence of XML schemas, which is needed to validate the quality of candidate keys. The present article embarks on a fundamental study of such a theory and classifies the complexity of several crucial properties concerning XML keys in the presence of an XSD, like, for instance, testing for consistency, boundedness, satisfiability, universality, and equivalence. Of independent interest, novel results are obtained related to cardinality estimation of XPath result sets. A mining algorithm is then developed within the framework of levelwise search. The algorithm leverages known discovery algorithms for functional dependencies in the relational model, but incorporates the properties mentioned before to assess and refine the quality of derived keys. An experimental study on an extensive body of real-world XML data evaluating the effectiveness of the proposed algorithm is provided.

Skip Supplemental Material Section

Supplemental Material

References

  1. Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, and Pierre Senellart. 2012. Finding optimal probabilistic generators for XML collections. In Proceedings of the 15th International Conference on Database Theory (ICDT'12). ACM Press, New York, 127--139. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases. Addison-Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Marcelo Arenas, Jonny Daenen, Frank Neven, Martin Ugarte, Jan Van den Bussche, and Stijn Vansummeren. 2013. Discovering XSD keys from XML data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'13). 61--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Marcelo Arenas, Wenfei Fan, and Leonid Libkin. 2002. What's hard about XML schema constraints? In Proceedings of the 13th International Conference on Database and Expert Systems Applications (DEXA'02). Lecture Notes in Computer Science, vol. 2453, Springer, 269--278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Denilson Barbosa and Alberto O. Mendelzon. 2003. Finding id attributes in XML documents. In Proceedings of the 1st International XML Database Symposium (XSym'03). Lecture Notes in Computer Science, vol. 2824, Springer, 180--194.Google ScholarGoogle ScholarCross RefCross Ref
  6. Geert Jan Bex, Wouter Gelade, Wim Martens, and Frank Neven. 2009. Simplifying XML schema: Effortless handling of nondeterministic regular expressions. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'09). 731--744. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Geert Jan Bex, Wouter Gelade, Frank Neven, and Stijn Vansummeren. 2010a. Learning deterministic regular expressions for the inference of schemas from XML data. ACM Trans. Web 4, 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Geert Jan Bex, Frank Neven, Thomas Schwentick, and Stijn Vansummeren. 2010b. Inference of concise regular expressions and dtds. ACM Trans. Database Syst. 35, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Geert Jan Bex, Frank Neven, and Stijn Vansummeren. 2007. Inferring XML schema definitions from XML data. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB'07). 998--1009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Geert Jan Bex, Frank Neven, and Stijn Vansummeren. 2008. SchemaScope: A system for inferring and cleaning XML schemas. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'08). 1259--1262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Dina Bitton, Jeffrey Millman, and Solveig Torgersen. 1989. A feasibility and performance study of dependency inference. In Proceedings of the International Conference on Data Engineering (ICDE'89). 635--641. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Henrik Björklund, Wim Martens, and Thomas Schwentick. 2013. Validity of tree pattern queries with respect to schema information. In Proceedings of the 38th International Symposium on Mathematical Foundations of Computer Science (MFCS'13). Lecture Notes in Computer Science, vol. 8087, Springer, 171--182.Google ScholarGoogle ScholarCross RefCross Ref
  13. Mikolaj Bojanczyk. 2008. Tree-walking automata. In Proceedings of the 2nd International Conference on Language and Automata Theory and Applications (LATA'08). Lecture Notes in Computer Science, vol. 5196, Springer, 1--2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Anne Bruggemann-Klein and Derick Wood. 1998. One-unambiguous regular languages. Inf. Comput. 140, 2, 229--253. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Peter Buneman, Susan B. Davidson, Wenfei Fan, Carmem S. Hara, and Wang Chiew Tan. 2002. Keys for XML. Comput. Netw. 39, 5, 473--487.Google ScholarGoogle ScholarCross RefCross Ref
  16. Peter Buneman, Susan B. Davidson, Wenfei Fan, Carmem S. Hara, and Wang Chiew Tan. 2003. Reasoning about keys for XML. Inf. Syst. 28, 8, 1037--1063. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Stanislav Fajt, Irena Mlynkova, and Martin Necasky. 2011. On mining xml integrity constraints. In Proceedings of the 6th IEEE International Conference on Digital Information Management (ICDIM'11). 23--29.Google ScholarGoogle ScholarCross RefCross Ref
  18. Wenfei Fan and Leonid Libkin. 2002. On XML integrity constraints in the presence of dtds. J. ACM 49, 3, 368--406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Minos N. Garofalakis, Aristides Gionis, Rajeev Rastogi, Sridhar Seshadri, and Kyuseok Shim. 2003. XTRACT: Learning document type descriptors from XML document collections. Data Min. Knowl. Discov. 7, 1, 23--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Gösta Grahne and Jianfei Zhu. 2002. Discovering approximate keys in XML data. In Proceedings of the International Conference on Information and Knowledge Management (CIKM'02). 453--460. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Steven Grijzenhout and Maarten Marx. 2010. University of amsterdam XML web collection. http://data.politicalmashup.nl/sgrijzen/xmlweb/.Google ScholarGoogle Scholar
  22. Sven Hartmann and Sebastian Link. 2009. Efficient reasoning about a robust XML key fragment. ACM Trans. Database Syst. 34, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. 2003. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Donald E. Knuth, James H. Morris Jr., and Vaughan R. PRATT. 1977. Fast pattern matching in strings. SIAM J. Comput. 6, 2, 323--350.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Heiki Mannila and Kari-Jouko Räihä. 1989. Practical algorithms for finding prime attributes and testing normal forms. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS'89). ACM Press, New York, 128--133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Heikki Mannila and Kari-Jouko Räihä. 1991. The Design of Relational Databases. Addison-Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Heikki Mannila and Kari-Jouko Räihä. 1994. Algorithms for inferring functional dependencies from relations. Data Knowl. Engin. 12, 1, 83--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Heikki Mannila and Hannu Toivonen. 1997. Levelwise search and borders of theories in knowledge discovery. Data Min. Knowl. Discov. 1, 3, 241--258. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Wim Martens, Frank Neven, and Thomas Schwentick. 2007. Simple off the shelf abstractions for XML schema. SIGMOD Rec. 36, 3, 15--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Wim Martens, Frank Neven, Thomas Schwentick, and Geert Jan Bex. 2006. Expressiveness and complexity of XML schema. ACM Trans. Database Syst. 31, 3, 770--813. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Makoto Murata, Dongwon Lee, Murali Mani, and Kohsuke Kawaguchi. 2005. Taxonomy of XML schema languages using formal language theory. ACM Trans. Internet Technol. 5, 4, 660--704. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Martin Necaský and Irena Mlýnkova. 2009. Discovering XML keys and foreign keys in queries. In Proceedings of the ACM Symposium on Applied Computing (SAC'09). ACM Press, New York, 632--638. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Raghu Ramakrishnan and Johannes Gehrke. 2003. Database Management Systems 3rd Ed. McGraw-Hill. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Helmut Seidl. 1990. Deciding equivalence of finite tree automata. SIAM J. Comput. 19, 3, 424--437. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Richard Edwin Stearns and Harry B. Hunt Iii. 1985. On the equivalence and containment problems for unambiguous regular expressions, regular grammars and finite automata. SIAM J. Comput. 14, 3, 598--611.Google ScholarGoogle ScholarCross RefCross Ref
  36. Larry J. Stockmeyer and Albert R. Meyer. 1973. Word problems requiring exponential time: Preliminary report. In Proceedings of the 5th Annual ACM Symposium on Theory of Computing (STOC'73). 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Peter Van Emde Boas. 1997. The convenience of tilings. In Complexity, Logic, and Recursion Theory, Marcel Dekker, 331--363.Google ScholarGoogle Scholar
  38. W3C. 2004. XML Schema Part 1: Structures 2nd Ed. http://www.w3.org/TR/xmlschema-1/#cIdentity-constraintGoogle ScholarGoogle Scholar
  39. Cong Yu and H. V. Jagadish. 2008. XML schema refinement through redundancy detection and normalization. VLDB J. 17, 2, 203--223. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Discovering XSD Keys from XML Data

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Database Systems
      ACM Transactions on Database Systems  Volume 39, Issue 4
      Invited Articles Issue, SIGMOD 2013, PODS 2013 and ICDT 2013
      December 2014
      341 pages
      ISSN:0362-5915
      EISSN:1557-4644
      DOI:10.1145/2691190
      Issue’s Table of Contents

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 30 December 2014
      • Accepted: 1 June 2014
      • Revised: 1 April 2014
      • Received: 1 October 2013
      Published in tods Volume 39, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Author Tags

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader