skip to main content
10.1145/564691.564713acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

StatiX: making XML count

Published:03 June 2002Publication History

ABSTRACT

The availability of summary data for XML documents has many applications, from providing users with quick feedback about their queries, to cost-based storage design and query optimization. StatiX is a novel XML Schema-aware statistics framework that exploits the structure derived by regular expressions (which define elements in an XML Schema) to pinpoint places in the schema that are likely sources of structural skew. As we discuss below, this information can be used to build concise, yet accurate, statistical summaries for XML data. StatiX leverages standard XML technology for gathering statistics, notably XML Schema validators, and it uses histograms to summarize both the structure and values in an XML document. In this paper we describe the StatiX system. We develop algorithms that decompose schemas to obtain statistics at different granularities and discuss how statistics can be gathered as documents are validated. We also present an experimental evaluation which demonstrates the accuracy and scalability of our approach and show an application of these statistics to cost-based XML storage design.

References

  1. A. Aboulnaga, A. R. Alameldeen, and J. Naughton. Estimating the selectivity of XML path expressions for Internet scale applications. In Proceedings of VLDB, pages 591-600, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. P. Bohannon, J. Freire, P. Roy, and J. Siméon. From XML schema to relations: A cost-based approach to XML storage. In Proceedings of lCDE, pages 64-75, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Brüggemann-Klein. Regular expressions into finite automata. TCS, 120(2):197-213, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Brüggemann-Klein and D. Wood. One-unambiguous regular languages. Information and Computation, 140(2):229-253, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Chambelin, J. Clark, D. Florescu, Jonathan Robie, J. Siméon, and M. Stefanescu. XQuery 1.0: An XML query language. W3C Working Draft, June 2001.Google ScholarGoogle Scholar
  6. Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S. Muthukrishnan, R. T. Ng, and D. Srivastava. Counting twig matches in a tree. In Proceedings of lCDE, pages 595-604, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Deutsch, M. Fernandez, and D. Suciu. Storing semi-structured data with STORED. In Proceedings of SIGMOD, pages 431-442, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Fankhauser, M. Fernandez, A. Malhotra, M. Rys, J. Siméon, and P. Wadler. The XML query algebra, February 2001. http://www.w3.org/TR/2001/WD-query-algebra-20010215.Google ScholarGoogle Scholar
  9. Galax system, October 2001. http://db.bell-labs.com/galax/.Google ScholarGoogle Scholar
  10. G. Graefe and W. McKenna. The volcano optimizer generator: Extensibility and efficient search. In Proceedings of lCDE, pages 209-218, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Internet Movie Database. http://www.imdb.com.Google ScholarGoogle Scholar
  12. Y. Ioannidis and S. Christodoulakis. Optimal histograms for limiting worst-case error propagation in the size of join results. ACM TODS, 18(4):709-748, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Kemper and G. Moerkotte. Advanced query processing in object bases using access support relations. In Proceedings of VLDB, pages 290-301, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. McHugh and J. Widom. Query optimization for XML. In Proceedings of VLDB, pages 315-326, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. XML query language (xql). http://www.oasis-open.org, 2001.Google ScholarGoogle Scholar
  16. G. Piatetsky-Shapiro and C. Connell. Accurate estimation of the number of tuples satisfying a condition. In Proceedings of SIGMOD, pages 256-276, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. N. Polyzotis and M. Garofalakis. Statistical synopses for graph structured XML databases. In Proceedings of SIGMOD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. V. Poosala and Y. Ioannidis. Selectivity estimation without the attribute value independence assumption. In Proceedings of VLDB, pages 486-495, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. V. Poosala, Y. Ioannidis, P. Haas, and E. Shekita. Improved histograms for selectivity estimation of range predicates. In Proceedings of SIGMOD, pages 294-305, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. G. Rozenberg and A. Salomaa, editors. Handbook of formal languages, volume 3. Springer Verlag, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Selinger, M. Astrahan, D. Chamberlin, R. Lorie, and T. Price. Access path selection in a relational database management system. In Proceedings of SIGMOD, pages 23-34, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Shanmugasundaram, K. Tufle, G. He, C. Zhang, D. DeWitt, and J. Naughton. Relational databases for querying XML documents: Limitations and opportunities. In Proceedings of VLDB, pages 302-314, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. H. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. XML Schema Part 1: Structures. W3C Working Draft, February 2000.Google ScholarGoogle Scholar
  24. Y. Wu, J. M. Patel, and H. V. Jagadish. Estimating answer sizes for xml queries. In Proceedings of EDBT, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Xerces java parser 1.4.3. http://xml.apache.org/xerces-j/.Google ScholarGoogle Scholar
  26. Xmark. http://monetdb.cwi.nl/xml.Google ScholarGoogle Scholar

Index Terms

  1. StatiX: making XML count

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              SIGMOD '02: Proceedings of the 2002 ACM SIGMOD international conference on Management of data
              June 2002
              654 pages
              ISBN:1581134975
              DOI:10.1145/564691

              Copyright © 2002 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 3 June 2002

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • Article

              Acceptance Rates

              SIGMOD '02 Paper Acceptance Rate42of240submissions,18%Overall Acceptance Rate785of4,003submissions,20%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader