Abstract
Effective support for XML query languages is becoming increasingly important with the emergence of new applications that access large volumes of XML data. All existing proposals for querying XML (e.g., XQuery) rely on a pattern-specification language that allows (1) path navigation and branching through the label structure of the XML data graph, and (2) predicates on the values of specific path/branch nodes, in order to reach the desired data elements. Clearly, optimizing such queries requires approximating the result cardinality of the referenced paths and hence hinges on the existence of concise synopsis structures that enable accurate compile-time selectivity estimates for complex path expressions over the base XML data. In this article, we introduce a novel approach to building and using statistical summaries of large XML data graphs for effective path-expression selectivity estimation. Our proposed graph-synopsis model (termed XSketch) exploits localized graph stability and value-distribution summaries (e.g., histograms) to accurately approximate (in limited space) the path and branching distribution, as well as the complex correlation patterns that can exist between and across path structure and element values in the data graph. To the best of our knowledge, ours is the first work to address this timely problem in the most general setting of graph-structured XML data with values, and complex (branching) path expressions.
Supplemental Material
Available for Download
Online appendix to designing mediation for context-aware applications. The appendix supports the information on page 1014.
- Aboulnaga, A., Alameldeen, A. R., and Naughton, J. F. 2001. Estimating the Selectivity of XML Path Expressions for Internet Scale Applications. In Proceedings of the 27th International Conference on Very Large Data Bases.]] Google ScholarDigital Library
- Aboulnaga, A. and Naughton, J. F. 2003. Building XML statistics for the hidden web. In Proceedings of the Twelfth International Conference on Information and Knowledge Management. 358--365.]] Google ScholarDigital Library
- Altinel, M. and Franklin, M. J. 2000. Efficient filtering of XML documents for selective dissemination of information. In Proceedings of the 26th International Conference on Very Large Data Bases. 53--64.]] Google ScholarDigital Library
- Boag, S., Chamberlin, D., Fernndez, M. F., Florescu, D., Robie, J., and Simon, J. 2005. XQuery 1.0: An XML query language. W3C Candidate Recommendation. Go online to www.w3.org.]]Google Scholar
- Bray, T., Paoli, J., Sperberg-McQueen, C. M., and Maler, E. 2000. Extensible Markup Language (XML) 1.0 (second edition). W3C Recommendation. Go online to www.w3.org/.]]Google Scholar
- Buneman, P., Davidson, S., Hillebrand, G., and Suciu, D. 1996. A query language and optimization techniques for unstructured data. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data.]] Google ScholarDigital Library
- Chaudhuri, S., Motwani, R., and Narasayya, V. 1999. On random sampling over joins. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data. 263--274.]] Google ScholarDigital Library
- Chen, Z., Jagadish, H. V., Korn, F., Koudas, N., Muthukrishnan, S., Ng, R., and Srivastava, D. 2001. Counting twig matches in a tree. In Proceedings of the 17th International Conference on Data Engineering. 595--604.]] Google ScholarDigital Library
- Clark, J. 1999. XSL Transformations (XSLT), Version 1.0. W3C Recommendation. Go online to www.w3.org.]]Google Scholar
- Clark, J. and DeRose, S. 1999. XML path language (XPath), version 1.0. W3C Recommendation. Go online to www.w3.org.]]Google Scholar
- DeRose, S., Maler, E., and Orchard, D. 2001. XML linking language (XLink), version 1.0. W3C Recommendation. Go online to www.w3.org.]]Google Scholar
- Deshpande, A., Garofalakis, M., and Rastogi, R. 2001. Independence is good: Dependency-based histogram synopses for high-dimensional data. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data.]] Google ScholarDigital Library
- Feller, W. 1968. An Introduction to Probability Theory and its Applications---Volume I, John Wiley & Sons, New York, NY.]]Google Scholar
- Fowler, R. J., Paterson, M. S., and Tanimoto, S. L. 1981. Optimal packing and covering in the plane are NP-complete. Inform. Process. Lett. 12, 3 (Jun.), 133--137.]]Google ScholarCross Ref
- Fox, B. 1966. Discrete Optimization Via Marginal Analysis. Manage. Sci. 13, 3, 211--216.]]Google Scholar
- Freire, J., Haritsa, J. R., Ramanath, M., Roy, P., and Siméon, J. 2002. StatiX: Making XML count. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. 181--191.]] Google ScholarDigital Library
- Garofalakis, M. N. and Gibbons, P. 2001. Approximate Query processing: Taming the terabytes. In Proceedings of the 27th International Conference on Very Large Data Bases.]] Google ScholarDigital Library
- Gibbons, P. B. 2001. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In Proceedings of the 27th International Conference on Very Large Data Bases. 541--550.]] Google ScholarDigital Library
- Goldman, R. and Widom, J. 1997. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the 23rd International Conference on Very Large Data Bases.]] Google ScholarDigital Library
- Gonzalez, T. F. 1985. Clustering to minimize the maximum intercluster distance. Theoret. Comput. Sci. 38, 293--306.]]Google ScholarCross Ref
- Jagadish, H., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K. C., and Suel, T. 1998. Optimal histograms with quality guarantees. In Proceedings of the 24th International Conference on Very Large Data Bases.]] Google ScholarDigital Library
- Jagadish, H. V., Al-Khalifa, S., Chapman, A., Lakshmanan, L. V. S., Nierman, A., Paparizos, S., Patel, J. M., Srivastava, D., Wiwatwattana, N., Wu, Y., and Yu, C. 2002. TIMBER: A native XML database. Internat. J. Very Large Data Bases 11, 4, 274--291.]] Google ScholarDigital Library
- Kaushik, R., Bohannon, P., Naughton, J. F., and Korth, H. F. 2002a. Covering indexes for branching path queries. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. 133--144.]] Google ScholarDigital Library
- Kaushik, R., Shenoy, P., Bohannon, P., and Gudes, E. 2002b. Exploiting local similarity for efficient indexing of paths in graph structured data. In Proceedings of the 18th International Conference on Data Engineering. 129--140.]] Google ScholarDigital Library
- Lim, L., Wang, M., Padmanabhan, S., Vitter, J., and Parr, R. 2002. XPathLearner: An on-line self-tuning markov histogram for XML path selectivity estimation. In Proceedings of the 28th International Conference on Very Large Data Bases. 442--453.]]Google Scholar
- Megiddo, N. and Supowit, K. J. 1984. On the complexity of some common geometric location problems. SIAM J. Comput. 13, 1, 182--196.]]Google ScholarCross Ref
- Milo, T. and Suciu, D. 1999. Index structures for path expressions. In Proceedings of the 7th International Conference on Database Theory (ICDT'99). 277--295.]] Google ScholarDigital Library
- Naughton, J. F., DeWitt, D. J., and et al 2001. The niagara internet query system. IEEE Data Eng. Bull. 24, 2.]]Google Scholar
- Paige, R. and Tarjan, R. E. 1987. Three partition refinement algorithms. SIAM J. Comput. 16, 6 (Dec.), 973--989.]] Google ScholarDigital Library
- Pearl, J. 1988. Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kaufmann, San Francisco, CA.]] Google ScholarDigital Library
- Polyzotis, N. and Garofalakis, M. 2006. Xcluster synopses for structured xml content. In Proceedings of the 22nd International Conference on Data Engineering.]] Google ScholarDigital Library
- Polyzotis, N., Garofalakis, M., and Ioannidis, Y. 2004a. Approximate XML query answers. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data. 263--274.]] Google ScholarDigital Library
- Polyzotis, N., Garofalakis, M., and Ioannidis, Y. 2004b. Selectivity estimation for XML twigs. In Proceedings of the 20th International Conference on Data Engineering. 264--275.]] Google ScholarDigital Library
- Poosala, V. and Ioannidis, Y. E. 1997. Selectivity estimation without the attribute value independence assumption. In Proceedings of the 23rd International Conference on Very Large Data Bases. 486--495.]] Google ScholarDigital Library
- Poosala, V., Ioannidis, Y. E., Haas, P. J., and Shekita, E. J. 1996. Improved histograms for selectivity estimation of range predicates. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data.]] Google ScholarDigital Library
- Ramanath, M., Zhang, L., Freire, J., and Haritsa, J. 2005. IMAX: The big picture of dynamic XML statistics. In Proceedings of the 21st International Conference on Data Engineering. 273--284.]] Google ScholarDigital Library
- Vitter, J. S. and Wang, M. 1999. Approximate computation of multidimensional aggregates of sparse data using wavelets. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data. 193--204.]] Google ScholarDigital Library
- Wang, W., Jiang, H., Lu, H., and Yu, J. X. 2003. Containment join size estimation: Models and methods. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. 145--156.]] Google ScholarDigital Library
- Wang, W., Jiang, H., Lu, H., and Yu, J. X. 2004. Bloom histogram: Path selectivity estimation for XML data with updates. In Proceedings of the 30th International Conference on Very Large Data Bases. 240--251.]]Google Scholar
- Wu, Y., Patel, J. M., and Jagadish, H. 2002. Estimating answer sizes for XML queries. In Proceedings of the 8th International Conference on Extending Database Technology (EDBT '02). 590--608.]] Google ScholarDigital Library
- Wu, Y., Patel, J. M., and Jagadish, H. 2003. Structural join order selection for XML query optimization. In Proceedings of the 19th International Conference on Data Engineering. 443--454.]]Google Scholar
- Zhang, N., Ozsu, M. T., Aboulnaga, A., and Ilyas, I. 2006. Xseed: Accurate and fast cardinality estimation for xpath queries. In Proceedings of the 22nd International Conference on Data Engineering.]] Google ScholarDigital Library
Index Terms
- XSKETCH synopses for XML data graphs
Recommendations
DVQ: Towards Visual Query Processing of XML Database Systems
XML has been recognized as a promising language for data exchange over the Internet. A number of query languages have been proposed for querying XML data. Most of those languages are path-expression based. One difficulty in forming path-expression based ...
Mapping of bibliographical standards into XML
The most popular bibliographical standards, which prescribe the exchange of bibliographical data in machine readable form, are MARC (Machine Readable Cataloguing) and UNIMARC (Universal Machine Readable Cataloguing). This paper presents two schemas, ...
TuG synopses for approximate query answering
This article introduces the Tuple Graph (TuG) synopses, a new class of data summaries that enable accurate approximate answers for complex relational queries. The proposed summarization framework adopts a “semi-structured” view of the relational ...
Comments