ABSTRACT
The availability of summary data for XML documents has many applications, from providing users with quick feedback about their queries, to cost-based storage design and query optimization. StatiX is a novel XML Schema-aware statistics framework that exploits the structure derived by regular expressions (which define elements in an XML Schema) to pinpoint places in the schema that are likely sources of structural skew. As we discuss below, this information can be used to build concise, yet accurate, statistical summaries for XML data. StatiX leverages standard XML technology for gathering statistics, notably XML Schema validators, and it uses histograms to summarize both the structure and values in an XML document. In this paper we describe the StatiX system. We develop algorithms that decompose schemas to obtain statistics at different granularities and discuss how statistics can be gathered as documents are validated. We also present an experimental evaluation which demonstrates the accuracy and scalability of our approach and show an application of these statistics to cost-based XML storage design.
- A. Aboulnaga, A. R. Alameldeen, and J. Naughton. Estimating the selectivity of XML path expressions for Internet scale applications. In Proceedings of VLDB, pages 591-600, 2001. Google ScholarDigital Library
- P. Bohannon, J. Freire, P. Roy, and J. Siméon. From XML schema to relations: A cost-based approach to XML storage. In Proceedings of lCDE, pages 64-75, 2002. Google ScholarDigital Library
- A. Brüggemann-Klein. Regular expressions into finite automata. TCS, 120(2):197-213, 1993. Google ScholarDigital Library
- A. Brüggemann-Klein and D. Wood. One-unambiguous regular languages. Information and Computation, 140(2):229-253, 1998. Google ScholarDigital Library
- D. Chambelin, J. Clark, D. Florescu, Jonathan Robie, J. Siméon, and M. Stefanescu. XQuery 1.0: An XML query language. W3C Working Draft, June 2001.Google Scholar
- Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S. Muthukrishnan, R. T. Ng, and D. Srivastava. Counting twig matches in a tree. In Proceedings of lCDE, pages 595-604, 2001. Google ScholarDigital Library
- A. Deutsch, M. Fernandez, and D. Suciu. Storing semi-structured data with STORED. In Proceedings of SIGMOD, pages 431-442, 1999. Google ScholarDigital Library
- P. Fankhauser, M. Fernandez, A. Malhotra, M. Rys, J. Siméon, and P. Wadler. The XML query algebra, February 2001. http://www.w3.org/TR/2001/WD-query-algebra-20010215.Google Scholar
- Galax system, October 2001. http://db.bell-labs.com/galax/.Google Scholar
- G. Graefe and W. McKenna. The volcano optimizer generator: Extensibility and efficient search. In Proceedings of lCDE, pages 209-218, 1993. Google ScholarDigital Library
- Internet Movie Database. http://www.imdb.com.Google Scholar
- Y. Ioannidis and S. Christodoulakis. Optimal histograms for limiting worst-case error propagation in the size of join results. ACM TODS, 18(4):709-748, 1993. Google ScholarDigital Library
- A. Kemper and G. Moerkotte. Advanced query processing in object bases using access support relations. In Proceedings of VLDB, pages 290-301, 1990. Google ScholarDigital Library
- J. McHugh and J. Widom. Query optimization for XML. In Proceedings of VLDB, pages 315-326, 1999. Google ScholarDigital Library
- XML query language (xql). http://www.oasis-open.org, 2001.Google Scholar
- G. Piatetsky-Shapiro and C. Connell. Accurate estimation of the number of tuples satisfying a condition. In Proceedings of SIGMOD, pages 256-276, 1984. Google ScholarDigital Library
- N. Polyzotis and M. Garofalakis. Statistical synopses for graph structured XML databases. In Proceedings of SIGMOD, 2002. Google ScholarDigital Library
- V. Poosala and Y. Ioannidis. Selectivity estimation without the attribute value independence assumption. In Proceedings of VLDB, pages 486-495, 1997. Google ScholarDigital Library
- V. Poosala, Y. Ioannidis, P. Haas, and E. Shekita. Improved histograms for selectivity estimation of range predicates. In Proceedings of SIGMOD, pages 294-305, 1996. Google ScholarDigital Library
- G. Rozenberg and A. Salomaa, editors. Handbook of formal languages, volume 3. Springer Verlag, 1997. Google ScholarDigital Library
- P. Selinger, M. Astrahan, D. Chamberlin, R. Lorie, and T. Price. Access path selection in a relational database management system. In Proceedings of SIGMOD, pages 23-34, 1979. Google ScholarDigital Library
- J. Shanmugasundaram, K. Tufle, G. He, C. Zhang, D. DeWitt, and J. Naughton. Relational databases for querying XML documents: Limitations and opportunities. In Proceedings of VLDB, pages 302-314, 1999. Google ScholarDigital Library
- H. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. XML Schema Part 1: Structures. W3C Working Draft, February 2000.Google Scholar
- Y. Wu, J. M. Patel, and H. V. Jagadish. Estimating answer sizes for xml queries. In Proceedings of EDBT, 2002. Google ScholarDigital Library
- Xerces java parser 1.4.3. http://xml.apache.org/xerces-j/.Google Scholar
- Xmark. http://monetdb.cwi.nl/xml.Google Scholar
Index Terms
- StatiX: making XML count
Recommendations
Mapping of bibliographical standards into XML
The most popular bibliographical standards, which prescribe the exchange of bibliographical data in machine readable form, are MARC (Machine Readable Cataloguing) and UNIMARC (Universal Machine Readable Cataloguing). This paper presents two schemas, ...
XML-based information mediation with MIX
SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of dataThe MIX mediator system, MIXm, is developed as part of the MIX Project at the San Diego Supercomputer Center, and the University of California, San Diego.1 MIXm uses XML as the common model for data exchange. Mediator views are expressed in XMAS (XML ...
Comments