ABSTRACT
Effective support for XML query languages is becoming increasingly important with the emergence of new applications that access large volumes of XML data. All existing proposals for querying XML (e.g., XQuery) rely on a pattern-specification language that allows path navigation and branching through the XML data graph in order to reach the desired data elements. Optimizing such queries depends crucially on the existence of concise synopsis structures that enable accurate compile-time selectivity estimates for complex path expressions over graph-structured XML data. In this paper, We introduce a novel approach to building and using statistical summaries of large XML data graphs for effective path-expression selectivity estimation. Our proposed graph-synopsis model (termed XSKETCH) exploits localized graph stability to accurately approximate (in limited space) the path and branching distribution in the data graph. To estimate the selectivities of complex path expressions over concise XSKETCH synopses, we develop an estimation framework that relies on appropriate statistical (uniformity and independence) assumptions to compensate for the lack of detailed distribution information. Given our estimation framework, we demonstrate that the problem of building an accuracy-optimal XSKETCH for a given amount of space is 𝒩𝒫-hard, and propose an efficient heuristic algorithm based on greedy forward selection. Briefly, our algorithm constructs an XSKETCH synopsis by successive refinements of the label-split graph, the coarsest summary of the XML data graph. Our refinement operations act locally and attempt to capture important statistical correlations between data paths. Extensive experimental results with synthetic as well as real-life data sets verify the effectiveness of our approach. To the best of our knowledge, ours is the first work to address this timely problem in the most general setting of graph-structured data and complex (branching) path expressions.
- A. Aboulnaga, A. R. Alameldeen, and J. F. Naughton. "Estimating the Selectivity of XML Path Expressions for Internet Scale Applications". In Proc. of the 27th Intl. Conf. on Very Large Data Bases, Sept. 2001.]] Google ScholarDigital Library
- T. Bray, J. Paoli, C. M. Sperberg-McQueen, and E. Maler. "Extensible Markup Language (XML) 1.0 (Second Edition)". W3C Recommendation (www.w3.org/TR/REC-xml/), October 2000.]]Google Scholar
- D. Chamberlin, J. Clark, D. Florescu, J. Robie, J. Siméon, and M. Stefanescu. "XQuery 1.0: An XML Query Language". W3C Working Draft 07 (www.w3.org/TR/xquery/), June 2001.]]Google Scholar
- Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S. Muthukrishnan, R. Ng, and D. Srivastava. "Counting Twig Matches in a Tree". In Proc. of the 17th Intl. Conf. on Data Engineering, April 2001.]] Google ScholarDigital Library
- R. Christensen. "Log-Linear Models and Logistic Regression". Springer-Verlag New York, Inc. (Springer Series in Statistics), 1997.]]Google Scholar
- J. Clark. "XSL Transformations (XSLT), Version 1.0". W3C Recommendation (www.w3.org/TR/xslt/), November 1999.]]Google Scholar
- J. Clark and S. DeRose. "XML Path Language (XPath), Version 1.0". W3C Recomm. (www.w3.org/TR/xpath/), November 1999.]]Google Scholar
- S. DeRose, E. Maler, and D. Orchard. "XML Linking Language (XLink), Version 1.0". W3C Recommendation (www.w3.org/-TR/xlink/), June 2001.]]Google Scholar
- W. Feller. "An Introduction to Probability Theory and its Applications --- Vol. 1". John Wiley & Sons, 1968. (3rd Edition).]]Google Scholar
- B. Fox. "Discrete Optimization Via Marginal Analysis". Management Science, 13(3), November 1966.]]Google Scholar
- Y. Kanemasa H. Ishikawa, K. Kubota. XQL: A query language for xml data. In QL98 - The Query Languages Workshop, 1998.]]Google Scholar
- R. Kaushik, P. Shenoy, P. Bohannon, and E. Gudes. "Exploiting Local Similarity for Efficient Indexing of Paths in Graph Structured Data". In Proc. of the 18th Intl. Conf. on Data Engineering, Feb. 2002.]] Google ScholarDigital Library
- F. M. Malvestuto. "Approximating Discrete Probability Distributions with Decomposable Models". IEEE Trans. on Systems, Man, and Cybernetics, 21(5), September 1991.]]Google ScholarCross Ref
- J. McHugh and J. Widom. "Query Optimization for XML". In Proc. of the 25th Intl. Conf. on Very Large Data Bases, September 1999.]] Google ScholarDigital Library
- T. Milo and D. Suciu. "Index structures for Path Expressions". In Proc. of the 7th Intl. Conf. on Database Theory, January 1999.]] Google ScholarDigital Library
- J. F. Naughton, D. J. DeWitt, D. Maier, et al. "The Niagara Internet Query System". IEEE Data Engineering Bulletin, 24(2), 2001.]]Google Scholar
- R. Paige and R. E. Tarjan. "Three Partition Refinement Algorithms". SIAM Journal on Computing, 16(6), December 1987.]] Google ScholarDigital Library
- J. Pearl. "Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference". Morgan Kaufmann Publishers, Inc., 1988.]] Google ScholarDigital Library
- N. Polyzotis and M. Garofalakis. "Statistical Synopses for Graph-Structured XML Databases". Bell Labs Tech. Memo., Dec. 2001.]]Google Scholar
- V. Poosala and Y. E. Ioannidis. "Selectivity Estimation Without the Attribute Value Independence Assumption". In Proc. of the 23rd Intl. Conf. on Very Large Data Bases, August 1997.]] Google ScholarDigital Library
- V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. J. Shekita. "Improved Histograms for Selectivity Estimation of Range Predicates". In Proc. of the 1996 ACM SIGMOD Intl. Conf. on Management of Data, June 1996.]] Google ScholarDigital Library
- A. R. Schmidt, F. Waas, M. L. Kersten, D. Florescu, I. Manolescu, M. J. Carey, and R. Busse. "The XML Benchmark Project". Tech. Report, CWI, 2001.]] Google ScholarDigital Library
- J. S. Vitter and M. Wang. "Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets". In Proc. of the 1999 ACM SIGMOD Intl. Conf. on Management of Data, May 1999.]] Google ScholarDigital Library
Index Terms
- Statistical synopses for graph-structured XML databases
Recommendations
XSKETCH synopses for XML data graphs
Effective support for XML query languages is becoming increasingly important with the emergence of new applications that access large volumes of XML data. All existing proposals for querying XML (e.g., XQuery) rely on a pattern-specification language ...
Comments