skip to main content
10.1145/564691.564733acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Statistical synopses for graph-structured XML databases

Published:03 June 2002Publication History

ABSTRACT

Effective support for XML query languages is becoming increasingly important with the emergence of new applications that access large volumes of XML data. All existing proposals for querying XML (e.g., XQuery) rely on a pattern-specification language that allows path navigation and branching through the XML data graph in order to reach the desired data elements. Optimizing such queries depends crucially on the existence of concise synopsis structures that enable accurate compile-time selectivity estimates for complex path expressions over graph-structured XML data. In this paper, We introduce a novel approach to building and using statistical summaries of large XML data graphs for effective path-expression selectivity estimation. Our proposed graph-synopsis model (termed XSKETCH) exploits localized graph stability to accurately approximate (in limited space) the path and branching distribution in the data graph. To estimate the selectivities of complex path expressions over concise XSKETCH synopses, we develop an estimation framework that relies on appropriate statistical (uniformity and independence) assumptions to compensate for the lack of detailed distribution information. Given our estimation framework, we demonstrate that the problem of building an accuracy-optimal XSKETCH for a given amount of space is 𝒩𝒫-hard, and propose an efficient heuristic algorithm based on greedy forward selection. Briefly, our algorithm constructs an XSKETCH synopsis by successive refinements of the label-split graph, the coarsest summary of the XML data graph. Our refinement operations act locally and attempt to capture important statistical correlations between data paths. Extensive experimental results with synthetic as well as real-life data sets verify the effectiveness of our approach. To the best of our knowledge, ours is the first work to address this timely problem in the most general setting of graph-structured data and complex (branching) path expressions.

References

  1. A. Aboulnaga, A. R. Alameldeen, and J. F. Naughton. "Estimating the Selectivity of XML Path Expressions for Internet Scale Applications". In Proc. of the 27th Intl. Conf. on Very Large Data Bases, Sept. 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. T. Bray, J. Paoli, C. M. Sperberg-McQueen, and E. Maler. "Extensible Markup Language (XML) 1.0 (Second Edition)". W3C Recommendation (www.w3.org/TR/REC-xml/), October 2000.]]Google ScholarGoogle Scholar
  3. D. Chamberlin, J. Clark, D. Florescu, J. Robie, J. Siméon, and M. Stefanescu. "XQuery 1.0: An XML Query Language". W3C Working Draft 07 (www.w3.org/TR/xquery/), June 2001.]]Google ScholarGoogle Scholar
  4. Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S. Muthukrishnan, R. Ng, and D. Srivastava. "Counting Twig Matches in a Tree". In Proc. of the 17th Intl. Conf. on Data Engineering, April 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Christensen. "Log-Linear Models and Logistic Regression". Springer-Verlag New York, Inc. (Springer Series in Statistics), 1997.]]Google ScholarGoogle Scholar
  6. J. Clark. "XSL Transformations (XSLT), Version 1.0". W3C Recommendation (www.w3.org/TR/xslt/), November 1999.]]Google ScholarGoogle Scholar
  7. J. Clark and S. DeRose. "XML Path Language (XPath), Version 1.0". W3C Recomm. (www.w3.org/TR/xpath/), November 1999.]]Google ScholarGoogle Scholar
  8. S. DeRose, E. Maler, and D. Orchard. "XML Linking Language (XLink), Version 1.0". W3C Recommendation (www.w3.org/-TR/xlink/), June 2001.]]Google ScholarGoogle Scholar
  9. W. Feller. "An Introduction to Probability Theory and its Applications --- Vol. 1". John Wiley & Sons, 1968. (3rd Edition).]]Google ScholarGoogle Scholar
  10. B. Fox. "Discrete Optimization Via Marginal Analysis". Management Science, 13(3), November 1966.]]Google ScholarGoogle Scholar
  11. Y. Kanemasa H. Ishikawa, K. Kubota. XQL: A query language for xml data. In QL98 - The Query Languages Workshop, 1998.]]Google ScholarGoogle Scholar
  12. R. Kaushik, P. Shenoy, P. Bohannon, and E. Gudes. "Exploiting Local Similarity for Efficient Indexing of Paths in Graph Structured Data". In Proc. of the 18th Intl. Conf. on Data Engineering, Feb. 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. F. M. Malvestuto. "Approximating Discrete Probability Distributions with Decomposable Models". IEEE Trans. on Systems, Man, and Cybernetics, 21(5), September 1991.]]Google ScholarGoogle ScholarCross RefCross Ref
  14. J. McHugh and J. Widom. "Query Optimization for XML". In Proc. of the 25th Intl. Conf. on Very Large Data Bases, September 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. Milo and D. Suciu. "Index structures for Path Expressions". In Proc. of the 7th Intl. Conf. on Database Theory, January 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. F. Naughton, D. J. DeWitt, D. Maier, et al. "The Niagara Internet Query System". IEEE Data Engineering Bulletin, 24(2), 2001.]]Google ScholarGoogle Scholar
  17. R. Paige and R. E. Tarjan. "Three Partition Refinement Algorithms". SIAM Journal on Computing, 16(6), December 1987.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Pearl. "Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference". Morgan Kaufmann Publishers, Inc., 1988.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. N. Polyzotis and M. Garofalakis. "Statistical Synopses for Graph-Structured XML Databases". Bell Labs Tech. Memo., Dec. 2001.]]Google ScholarGoogle Scholar
  20. V. Poosala and Y. E. Ioannidis. "Selectivity Estimation Without the Attribute Value Independence Assumption". In Proc. of the 23rd Intl. Conf. on Very Large Data Bases, August 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. J. Shekita. "Improved Histograms for Selectivity Estimation of Range Predicates". In Proc. of the 1996 ACM SIGMOD Intl. Conf. on Management of Data, June 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. R. Schmidt, F. Waas, M. L. Kersten, D. Florescu, I. Manolescu, M. J. Carey, and R. Busse. "The XML Benchmark Project". Tech. Report, CWI, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. S. Vitter and M. Wang. "Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets". In Proc. of the 1999 ACM SIGMOD Intl. Conf. on Management of Data, May 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Statistical synopses for graph-structured XML databases

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                SIGMOD '02: Proceedings of the 2002 ACM SIGMOD international conference on Management of data
                June 2002
                654 pages
                ISBN:1581134975
                DOI:10.1145/564691

                Copyright © 2002 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 3 June 2002

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • Article

                Acceptance Rates

                SIGMOD '02 Paper Acceptance Rate42of240submissions,18%Overall Acceptance Rate785of4,003submissions,20%

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader