Abstract
A great deal of research into the learning of schemas from XML data has been conducted in recent years to enable the automatic discovery of XML schemas from XML documents when no schema or only a low-quality one is available. Unfortunately, and in strong contrast to, for instance, the relational model, the automatic discovery of even the simplest of XML constraints, namely XML keys, has been left largely unexplored in this context. A major obstacle here is the unavailability of a theory on reasoning about XML keys in the presence of XML schemas, which is needed to validate the quality of candidate keys. The present article embarks on a fundamental study of such a theory and classifies the complexity of several crucial properties concerning XML keys in the presence of an XSD, like, for instance, testing for consistency, boundedness, satisfiability, universality, and equivalence. Of independent interest, novel results are obtained related to cardinality estimation of XPath result sets. A mining algorithm is then developed within the framework of levelwise search. The algorithm leverages known discovery algorithms for functional dependencies in the relational model, but incorporates the properties mentioned before to assess and refine the quality of derived keys. An experimental study on an extensive body of real-world XML data evaluating the effectiveness of the proposed algorithm is provided.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, Discovering XSD Keys from XML Data
- Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, and Pierre Senellart. 2012. Finding optimal probabilistic generators for XML collections. In Proceedings of the 15th International Conference on Database Theory (ICDT'12). ACM Press, New York, 127--139. Google ScholarDigital Library
- Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases. Addison-Wesley. Google ScholarDigital Library
- Marcelo Arenas, Jonny Daenen, Frank Neven, Martin Ugarte, Jan Van den Bussche, and Stijn Vansummeren. 2013. Discovering XSD keys from XML data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'13). 61--72. Google ScholarDigital Library
- Marcelo Arenas, Wenfei Fan, and Leonid Libkin. 2002. What's hard about XML schema constraints? In Proceedings of the 13th International Conference on Database and Expert Systems Applications (DEXA'02). Lecture Notes in Computer Science, vol. 2453, Springer, 269--278. Google ScholarDigital Library
- Denilson Barbosa and Alberto O. Mendelzon. 2003. Finding id attributes in XML documents. In Proceedings of the 1st International XML Database Symposium (XSym'03). Lecture Notes in Computer Science, vol. 2824, Springer, 180--194.Google ScholarCross Ref
- Geert Jan Bex, Wouter Gelade, Wim Martens, and Frank Neven. 2009. Simplifying XML schema: Effortless handling of nondeterministic regular expressions. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'09). 731--744. Google ScholarDigital Library
- Geert Jan Bex, Wouter Gelade, Frank Neven, and Stijn Vansummeren. 2010a. Learning deterministic regular expressions for the inference of schemas from XML data. ACM Trans. Web 4, 4. Google ScholarDigital Library
- Geert Jan Bex, Frank Neven, Thomas Schwentick, and Stijn Vansummeren. 2010b. Inference of concise regular expressions and dtds. ACM Trans. Database Syst. 35, 2. Google ScholarDigital Library
- Geert Jan Bex, Frank Neven, and Stijn Vansummeren. 2007. Inferring XML schema definitions from XML data. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB'07). 998--1009. Google ScholarDigital Library
- Geert Jan Bex, Frank Neven, and Stijn Vansummeren. 2008. SchemaScope: A system for inferring and cleaning XML schemas. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'08). 1259--1262. Google ScholarDigital Library
- Dina Bitton, Jeffrey Millman, and Solveig Torgersen. 1989. A feasibility and performance study of dependency inference. In Proceedings of the International Conference on Data Engineering (ICDE'89). 635--641. Google ScholarDigital Library
- Henrik Björklund, Wim Martens, and Thomas Schwentick. 2013. Validity of tree pattern queries with respect to schema information. In Proceedings of the 38th International Symposium on Mathematical Foundations of Computer Science (MFCS'13). Lecture Notes in Computer Science, vol. 8087, Springer, 171--182.Google ScholarCross Ref
- Mikolaj Bojanczyk. 2008. Tree-walking automata. In Proceedings of the 2nd International Conference on Language and Automata Theory and Applications (LATA'08). Lecture Notes in Computer Science, vol. 5196, Springer, 1--2. Google ScholarDigital Library
- Anne Bruggemann-Klein and Derick Wood. 1998. One-unambiguous regular languages. Inf. Comput. 140, 2, 229--253. Google ScholarDigital Library
- Peter Buneman, Susan B. Davidson, Wenfei Fan, Carmem S. Hara, and Wang Chiew Tan. 2002. Keys for XML. Comput. Netw. 39, 5, 473--487.Google ScholarCross Ref
- Peter Buneman, Susan B. Davidson, Wenfei Fan, Carmem S. Hara, and Wang Chiew Tan. 2003. Reasoning about keys for XML. Inf. Syst. 28, 8, 1037--1063. Google ScholarDigital Library
- Stanislav Fajt, Irena Mlynkova, and Martin Necasky. 2011. On mining xml integrity constraints. In Proceedings of the 6th IEEE International Conference on Digital Information Management (ICDIM'11). 23--29.Google ScholarCross Ref
- Wenfei Fan and Leonid Libkin. 2002. On XML integrity constraints in the presence of dtds. J. ACM 49, 3, 368--406. Google ScholarDigital Library
- Minos N. Garofalakis, Aristides Gionis, Rajeev Rastogi, Sridhar Seshadri, and Kyuseok Shim. 2003. XTRACT: Learning document type descriptors from XML document collections. Data Min. Knowl. Discov. 7, 1, 23--56. Google ScholarDigital Library
- Gösta Grahne and Jianfei Zhu. 2002. Discovering approximate keys in XML data. In Proceedings of the International Conference on Information and Knowledge Management (CIKM'02). 453--460. Google ScholarDigital Library
- Steven Grijzenhout and Maarten Marx. 2010. University of amsterdam XML web collection. http://data.politicalmashup.nl/sgrijzen/xmlweb/.Google Scholar
- Sven Hartmann and Sebastian Link. 2009. Efficient reasoning about a robust XML key fragment. ACM Trans. Database Syst. 34, 2. Google ScholarDigital Library
- John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. 2003. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley. Google ScholarDigital Library
- Donald E. Knuth, James H. Morris Jr., and Vaughan R. PRATT. 1977. Fast pattern matching in strings. SIAM J. Comput. 6, 2, 323--350.Google ScholarDigital Library
- Heiki Mannila and Kari-Jouko Räihä. 1989. Practical algorithms for finding prime attributes and testing normal forms. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS'89). ACM Press, New York, 128--133. Google ScholarDigital Library
- Heikki Mannila and Kari-Jouko Räihä. 1991. The Design of Relational Databases. Addison-Wesley. Google ScholarDigital Library
- Heikki Mannila and Kari-Jouko Räihä. 1994. Algorithms for inferring functional dependencies from relations. Data Knowl. Engin. 12, 1, 83--99. Google ScholarDigital Library
- Heikki Mannila and Hannu Toivonen. 1997. Levelwise search and borders of theories in knowledge discovery. Data Min. Knowl. Discov. 1, 3, 241--258. Google ScholarDigital Library
- Wim Martens, Frank Neven, and Thomas Schwentick. 2007. Simple off the shelf abstractions for XML schema. SIGMOD Rec. 36, 3, 15--22. Google ScholarDigital Library
- Wim Martens, Frank Neven, Thomas Schwentick, and Geert Jan Bex. 2006. Expressiveness and complexity of XML schema. ACM Trans. Database Syst. 31, 3, 770--813. Google ScholarDigital Library
- Makoto Murata, Dongwon Lee, Murali Mani, and Kohsuke Kawaguchi. 2005. Taxonomy of XML schema languages using formal language theory. ACM Trans. Internet Technol. 5, 4, 660--704. Google ScholarDigital Library
- Martin Necaský and Irena Mlýnkova. 2009. Discovering XML keys and foreign keys in queries. In Proceedings of the ACM Symposium on Applied Computing (SAC'09). ACM Press, New York, 632--638. Google ScholarDigital Library
- Raghu Ramakrishnan and Johannes Gehrke. 2003. Database Management Systems 3rd Ed. McGraw-Hill. Google ScholarDigital Library
- Helmut Seidl. 1990. Deciding equivalence of finite tree automata. SIAM J. Comput. 19, 3, 424--437. Google ScholarDigital Library
- Richard Edwin Stearns and Harry B. Hunt Iii. 1985. On the equivalence and containment problems for unambiguous regular expressions, regular grammars and finite automata. SIAM J. Comput. 14, 3, 598--611.Google ScholarCross Ref
- Larry J. Stockmeyer and Albert R. Meyer. 1973. Word problems requiring exponential time: Preliminary report. In Proceedings of the 5th Annual ACM Symposium on Theory of Computing (STOC'73). 1--9. Google ScholarDigital Library
- Peter Van Emde Boas. 1997. The convenience of tilings. In Complexity, Logic, and Recursion Theory, Marcel Dekker, 331--363.Google Scholar
- W3C. 2004. XML Schema Part 1: Structures 2nd Ed. http://www.w3.org/TR/xmlschema-1/#cIdentity-constraintGoogle Scholar
- Cong Yu and H. V. Jagadish. 2008. XML schema refinement through redundancy detection and normalization. VLDB J. 17, 2, 203--223. Google ScholarDigital Library
Index Terms
- Discovering XSD Keys from XML Data
Recommendations
Discovering XSD keys from XML data
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of DataA great deal of research into the learning of schemas from XML data has been conducted in recent years to enable the automatic discovery of XML Schemas from XML documents when no schema, or only a low-quality one is available. Unfortunately, and in ...
Deriving relation keys from XML keys
ADC '03: Proceedings of the 14th Australasian database conference - Volume 17Much work on XML data was around storage and querying and did not consider constraints of XML, especially keys. Since constraints have been proposed in many papers for XML, much research work on constraints has been being done. In this paper, we consider ...
Translating relational schema into XML schema definition with data semantic preservation and XSD graph
Many legacy systems have been created by using relational database operating not for the Internet expression. Since the relational database is not an efficient way for data explosion, electronic transfer of data, and electronic business on the Web, we ...
Comments