Article

Statistical synopses for graph-structured XML databases

Authors:
Neoklis Polyzotis

Univ. of Wisconsin-Madison

Univ. of Wisconsin-Madison
View Profile

,
Minos Garofalakis

Bell Labs, Lucent Technologies

Bell Labs, Lucent Technologies
View Profile

SIGMOD '02: Proceedings of the 2002 ACM SIGMOD international conference on Management of dataJune 2002Pages 358–369https://doi.org/10.1145/564691.564733

Published:03 June 2002Publication History

SIGMOD '02: Proceedings of the 2002 ACM SIGMOD international conference on Management of data

Pages 358–369

ABSTRACT

Effective support for XML query languages is becoming increasingly important with the emergence of new applications that access large volumes of XML data. All existing proposals for querying XML (e.g., XQuery) rely on a pattern-specification language that allows path navigation and branching through the XML data graph in order to reach the desired data elements. Optimizing such queries depends crucially on the existence of concise synopsis structures that enable accurate compile-time selectivity estimates for complex path expressions over graph-structured XML data. In this paper, We introduce a novel approach to building and using statistical summaries of large XML data graphs for effective path-expression selectivity estimation. Our proposed graph-synopsis model (termed XSKETCH) exploits localized graph stability to accurately approximate (in limited space) the path and branching distribution in the data graph. To estimate the selectivities of complex path expressions over concise XSKETCH synopses, we develop an estimation framework that relies on appropriate statistical (uniformity and independence) assumptions to compensate for the lack of detailed distribution information. Given our estimation framework, we demonstrate that the problem of building an accuracy-optimal XSKETCH for a given amount of space is 𝒩𝒫-hard, and propose an efficient heuristic algorithm based on greedy forward selection. Briefly, our algorithm constructs an XSKETCH synopsis by successive refinements of the label-split graph, the coarsest summary of the XML data graph. Our refinement operations act locally and attempt to capture important statistical correlations between data paths. Extensive experimental results with synthetic as well as real-life data sets verify the effectiveness of our approach. To the best of our knowledge, ours is the first work to address this timely problem in the most general setting of graph-structured data and complex (branching) path expressions.

References

A. Aboulnaga, A. R. Alameldeen, and J. F. Naughton. "Estimating the Selectivity of XML Path Expressions for Internet Scale Applications". In Proc. of the 27th Intl. Conf. on Very Large Data Bases, Sept. 2001.]] Google ScholarDigital Library
T. Bray, J. Paoli, C. M. Sperberg-McQueen, and E. Maler. "Extensible Markup Language (XML) 1.0 (Second Edition)". W3C Recommendation (www.w3.org/TR/REC-xml/), October 2000.]]Google Scholar
D. Chamberlin, J. Clark, D. Florescu, J. Robie, J. Siméon, and M. Stefanescu. "XQuery 1.0: An XML Query Language". W3C Working Draft 07 (www.w3.org/TR/xquery/), June 2001.]]Google Scholar
Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S. Muthukrishnan, R. Ng, and D. Srivastava. "Counting Twig Matches in a Tree". In Proc. of the 17th Intl. Conf. on Data Engineering, April 2001.]] Google ScholarDigital Library
R. Christensen. "Log-Linear Models and Logistic Regression". Springer-Verlag New York, Inc. (Springer Series in Statistics), 1997.]]Google Scholar
J. Clark. "XSL Transformations (XSLT), Version 1.0". W3C Recommendation (www.w3.org/TR/xslt/), November 1999.]]Google Scholar
J. Clark and S. DeRose. "XML Path Language (XPath), Version 1.0". W3C Recomm. (www.w3.org/TR/xpath/), November 1999.]]Google Scholar
S. DeRose, E. Maler, and D. Orchard. "XML Linking Language (XLink), Version 1.0". W3C Recommendation (www.w3.org/-TR/xlink/), June 2001.]]Google Scholar
W. Feller. "An Introduction to Probability Theory and its Applications --- Vol. 1". John Wiley & Sons, 1968. (3rd Edition).]]Google Scholar
B. Fox. "Discrete Optimization Via Marginal Analysis". Management Science, 13(3), November 1966.]]Google Scholar
Y. Kanemasa H. Ishikawa, K. Kubota. XQL: A query language for xml data. In QL98 - The Query Languages Workshop, 1998.]]Google Scholar
R. Kaushik, P. Shenoy, P. Bohannon, and E. Gudes. "Exploiting Local Similarity for Efficient Indexing of Paths in Graph Structured Data". In Proc. of the 18th Intl. Conf. on Data Engineering, Feb. 2002.]] Google ScholarDigital Library
F. M. Malvestuto. "Approximating Discrete Probability Distributions with Decomposable Models". IEEE Trans. on Systems, Man, and Cybernetics, 21(5), September 1991.]]Google ScholarCross Ref
J. McHugh and J. Widom. "Query Optimization for XML". In Proc. of the 25th Intl. Conf. on Very Large Data Bases, September 1999.]] Google ScholarDigital Library
T. Milo and D. Suciu. "Index structures for Path Expressions". In Proc. of the 7th Intl. Conf. on Database Theory, January 1999.]] Google ScholarDigital Library
J. F. Naughton, D. J. DeWitt, D. Maier, et al. "The Niagara Internet Query System". IEEE Data Engineering Bulletin, 24(2), 2001.]]Google Scholar
R. Paige and R. E. Tarjan. "Three Partition Refinement Algorithms". SIAM Journal on Computing, 16(6), December 1987.]] Google ScholarDigital Library
J. Pearl. "Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference". Morgan Kaufmann Publishers, Inc., 1988.]] Google ScholarDigital Library
N. Polyzotis and M. Garofalakis. "Statistical Synopses for Graph-Structured XML Databases". Bell Labs Tech. Memo., Dec. 2001.]]Google Scholar
V. Poosala and Y. E. Ioannidis. "Selectivity Estimation Without the Attribute Value Independence Assumption". In Proc. of the 23rd Intl. Conf. on Very Large Data Bases, August 1997.]] Google ScholarDigital Library
V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. J. Shekita. "Improved Histograms for Selectivity Estimation of Range Predicates". In Proc. of the 1996 ACM SIGMOD Intl. Conf. on Management of Data, June 1996.]] Google ScholarDigital Library
A. R. Schmidt, F. Waas, M. L. Kersten, D. Florescu, I. Manolescu, M. J. Carey, and R. Busse. "The XML Benchmark Project". Tech. Report, CWI, 2001.]] Google ScholarDigital Library
J. S. Vitter and M. Wang. "Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets". In Proc. of the 1999 ACM SIGMOD Intl. Conf. on Management of Data, May 1999.]] Google ScholarDigital Library

Index Terms

Statistical synopses for graph-structured XML databases

Recommendations

Statistical synopses for xml data
Read More
XSKETCH synopses for XML data graphs

Effective support for XML query languages is becoming increasingly important with the emergence of new applications that access large volumes of XML data. All existing proposals for querying XML (e.g., XQuery) rely on a pattern-specification language ...
Read More
Efficient xpath query processing in native xml databases
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '02: Proceedings of the 2002 ACM SIGMOD international conference on Management of data
June 2002
654 pages
ISBN:1581134975
DOI:10.1145/564691
Conference Chair:
Bongki Moon
University of Wisconsin - Madison
,
General Chair:
David DeWitt
University of Wisconsin - Madison
,
Program Chair:
Michael Franklin
University of California, Berkeley
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 June 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
SIGMOD '02 Paper Acceptance Rate42of240submissions,18%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 106
  Total Citations
  View Citations
- 939
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Statistical synopses for graph-structured XML databases

SIGMOD '02: Proceedings of the 2002 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Statistical synopses for xml data

XSKETCH synopses for XML data graphs

Efficient xpath query processing in native xml databases