skip to main content
article

XSKETCH synopses for XML data graphs

Published:01 September 2006Publication History
Skip Abstract Section

Abstract

Effective support for XML query languages is becoming increasingly important with the emergence of new applications that access large volumes of XML data. All existing proposals for querying XML (e.g., XQuery) rely on a pattern-specification language that allows (1) path navigation and branching through the label structure of the XML data graph, and (2) predicates on the values of specific path/branch nodes, in order to reach the desired data elements. Clearly, optimizing such queries requires approximating the result cardinality of the referenced paths and hence hinges on the existence of concise synopsis structures that enable accurate compile-time selectivity estimates for complex path expressions over the base XML data. In this article, we introduce a novel approach to building and using statistical summaries of large XML data graphs for effective path-expression selectivity estimation. Our proposed graph-synopsis model (termed XSketch) exploits localized graph stability and value-distribution summaries (e.g., histograms) to accurately approximate (in limited space) the path and branching distribution, as well as the complex correlation patterns that can exist between and across path structure and element values in the data graph. To the best of our knowledge, ours is the first work to address this timely problem in the most general setting of graph-structured XML data with values, and complex (branching) path expressions.

Skip Supplemental Material Section

Supplemental Material

References

  1. Aboulnaga, A., Alameldeen, A. R., and Naughton, J. F. 2001. Estimating the Selectivity of XML Path Expressions for Internet Scale Applications. In Proceedings of the 27th International Conference on Very Large Data Bases.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Aboulnaga, A. and Naughton, J. F. 2003. Building XML statistics for the hidden web. In Proceedings of the Twelfth International Conference on Information and Knowledge Management. 358--365.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Altinel, M. and Franklin, M. J. 2000. Efficient filtering of XML documents for selective dissemination of information. In Proceedings of the 26th International Conference on Very Large Data Bases. 53--64.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Boag, S., Chamberlin, D., Fernndez, M. F., Florescu, D., Robie, J., and Simon, J. 2005. XQuery 1.0: An XML query language. W3C Candidate Recommendation. Go online to www.w3.org.]]Google ScholarGoogle Scholar
  5. Bray, T., Paoli, J., Sperberg-McQueen, C. M., and Maler, E. 2000. Extensible Markup Language (XML) 1.0 (second edition). W3C Recommendation. Go online to www.w3.org/.]]Google ScholarGoogle Scholar
  6. Buneman, P., Davidson, S., Hillebrand, G., and Suciu, D. 1996. A query language and optimization techniques for unstructured data. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chaudhuri, S., Motwani, R., and Narasayya, V. 1999. On random sampling over joins. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data. 263--274.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chen, Z., Jagadish, H. V., Korn, F., Koudas, N., Muthukrishnan, S., Ng, R., and Srivastava, D. 2001. Counting twig matches in a tree. In Proceedings of the 17th International Conference on Data Engineering. 595--604.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Clark, J. 1999. XSL Transformations (XSLT), Version 1.0. W3C Recommendation. Go online to www.w3.org.]]Google ScholarGoogle Scholar
  10. Clark, J. and DeRose, S. 1999. XML path language (XPath), version 1.0. W3C Recommendation. Go online to www.w3.org.]]Google ScholarGoogle Scholar
  11. DeRose, S., Maler, E., and Orchard, D. 2001. XML linking language (XLink), version 1.0. W3C Recommendation. Go online to www.w3.org.]]Google ScholarGoogle Scholar
  12. Deshpande, A., Garofalakis, M., and Rastogi, R. 2001. Independence is good: Dependency-based histogram synopses for high-dimensional data. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Feller, W. 1968. An Introduction to Probability Theory and its Applications---Volume I, John Wiley & Sons, New York, NY.]]Google ScholarGoogle Scholar
  14. Fowler, R. J., Paterson, M. S., and Tanimoto, S. L. 1981. Optimal packing and covering in the plane are NP-complete. Inform. Process. Lett. 12, 3 (Jun.), 133--137.]]Google ScholarGoogle ScholarCross RefCross Ref
  15. Fox, B. 1966. Discrete Optimization Via Marginal Analysis. Manage. Sci. 13, 3, 211--216.]]Google ScholarGoogle Scholar
  16. Freire, J., Haritsa, J. R., Ramanath, M., Roy, P., and Siméon, J. 2002. StatiX: Making XML count. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. 181--191.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Garofalakis, M. N. and Gibbons, P. 2001. Approximate Query processing: Taming the terabytes. In Proceedings of the 27th International Conference on Very Large Data Bases.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Gibbons, P. B. 2001. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In Proceedings of the 27th International Conference on Very Large Data Bases. 541--550.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Goldman, R. and Widom, J. 1997. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the 23rd International Conference on Very Large Data Bases.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Gonzalez, T. F. 1985. Clustering to minimize the maximum intercluster distance. Theoret. Comput. Sci. 38, 293--306.]]Google ScholarGoogle ScholarCross RefCross Ref
  21. Jagadish, H., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K. C., and Suel, T. 1998. Optimal histograms with quality guarantees. In Proceedings of the 24th International Conference on Very Large Data Bases.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jagadish, H. V., Al-Khalifa, S., Chapman, A., Lakshmanan, L. V. S., Nierman, A., Paparizos, S., Patel, J. M., Srivastava, D., Wiwatwattana, N., Wu, Y., and Yu, C. 2002. TIMBER: A native XML database. Internat. J. Very Large Data Bases 11, 4, 274--291.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kaushik, R., Bohannon, P., Naughton, J. F., and Korth, H. F. 2002a. Covering indexes for branching path queries. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. 133--144.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kaushik, R., Shenoy, P., Bohannon, P., and Gudes, E. 2002b. Exploiting local similarity for efficient indexing of paths in graph structured data. In Proceedings of the 18th International Conference on Data Engineering. 129--140.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Lim, L., Wang, M., Padmanabhan, S., Vitter, J., and Parr, R. 2002. XPathLearner: An on-line self-tuning markov histogram for XML path selectivity estimation. In Proceedings of the 28th International Conference on Very Large Data Bases. 442--453.]]Google ScholarGoogle Scholar
  26. Megiddo, N. and Supowit, K. J. 1984. On the complexity of some common geometric location problems. SIAM J. Comput. 13, 1, 182--196.]]Google ScholarGoogle ScholarCross RefCross Ref
  27. Milo, T. and Suciu, D. 1999. Index structures for path expressions. In Proceedings of the 7th International Conference on Database Theory (ICDT'99). 277--295.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Naughton, J. F., DeWitt, D. J., and et al 2001. The niagara internet query system. IEEE Data Eng. Bull. 24, 2.]]Google ScholarGoogle Scholar
  29. Paige, R. and Tarjan, R. E. 1987. Three partition refinement algorithms. SIAM J. Comput. 16, 6 (Dec.), 973--989.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Pearl, J. 1988. Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kaufmann, San Francisco, CA.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Polyzotis, N. and Garofalakis, M. 2006. Xcluster synopses for structured xml content. In Proceedings of the 22nd International Conference on Data Engineering.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Polyzotis, N., Garofalakis, M., and Ioannidis, Y. 2004a. Approximate XML query answers. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data. 263--274.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Polyzotis, N., Garofalakis, M., and Ioannidis, Y. 2004b. Selectivity estimation for XML twigs. In Proceedings of the 20th International Conference on Data Engineering. 264--275.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Poosala, V. and Ioannidis, Y. E. 1997. Selectivity estimation without the attribute value independence assumption. In Proceedings of the 23rd International Conference on Very Large Data Bases. 486--495.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Poosala, V., Ioannidis, Y. E., Haas, P. J., and Shekita, E. J. 1996. Improved histograms for selectivity estimation of range predicates. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Ramanath, M., Zhang, L., Freire, J., and Haritsa, J. 2005. IMAX: The big picture of dynamic XML statistics. In Proceedings of the 21st International Conference on Data Engineering. 273--284.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Vitter, J. S. and Wang, M. 1999. Approximate computation of multidimensional aggregates of sparse data using wavelets. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data. 193--204.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Wang, W., Jiang, H., Lu, H., and Yu, J. X. 2003. Containment join size estimation: Models and methods. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. 145--156.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Wang, W., Jiang, H., Lu, H., and Yu, J. X. 2004. Bloom histogram: Path selectivity estimation for XML data with updates. In Proceedings of the 30th International Conference on Very Large Data Bases. 240--251.]]Google ScholarGoogle Scholar
  40. Wu, Y., Patel, J. M., and Jagadish, H. 2002. Estimating answer sizes for XML queries. In Proceedings of the 8th International Conference on Extending Database Technology (EDBT '02). 590--608.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Wu, Y., Patel, J. M., and Jagadish, H. 2003. Structural join order selection for XML query optimization. In Proceedings of the 19th International Conference on Data Engineering. 443--454.]]Google ScholarGoogle Scholar
  42. Zhang, N., Ozsu, M. T., Aboulnaga, A., and Ilyas, I. 2006. Xseed: Accurate and fast cardinality estimation for xpath queries. In Proceedings of the 22nd International Conference on Data Engineering.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. XSKETCH synopses for XML data graphs

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader