article

XSKETCH synopses for XML data graphs

Authors:
Neoklis Polyzotis

University of California, Santa Cruz, Santa Cruz, CA

University of California, Santa Cruz, Santa Cruz, CA
View Profile

,
Minos Garofalakis

Intel Research Berkeley, Berkeley, CA

Intel Research Berkeley, Berkeley, CA
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 31 Issue 3pp 1014–1063https://doi.org/10.1145/1166074.1166082

Published:01 September 2006Publication History

ACM Transactions on Database Systems

Abstract

Effective support for XML query languages is becoming increasingly important with the emergence of new applications that access large volumes of XML data. All existing proposals for querying XML (e.g., XQuery) rely on a pattern-specification language that allows (1) path navigation and branching through the label structure of the XML data graph, and (2) predicates on the values of specific path/branch nodes, in order to reach the desired data elements. Clearly, optimizing such queries requires approximating the result cardinality of the referenced paths and hence hinges on the existence of concise synopsis structures that enable accurate compile-time selectivity estimates for complex path expressions over the base XML data. In this article, we introduce a novel approach to building and using statistical summaries of large XML data graphs for effective path-expression selectivity estimation. Our proposed graph-synopsis model (termed XSketch) exploits localized graph stability and value-distribution summaries (e.g., histograms) to accurately approximate (in limited space) the path and branching distribution, as well as the complex correlation patterns that can exist between and across path structure and element values in the data graph. To the best of our knowledge, ours is the first work to address this timely problem in the most general setting of graph-structured XML data with values, and complex (branching) path expressions.

Supplemental Material

Available for Download

pdf

p1014-polyzotis-apndx.pdf (529.9 KB)

Online appendix to designing mediation for context-aware applications. The appendix supports the information on page 1014.

References

Aboulnaga, A., Alameldeen, A. R., and Naughton, J. F. 2001. Estimating the Selectivity of XML Path Expressions for Internet Scale Applications. In Proceedings of the 27th International Conference on Very Large Data Bases.]] Google ScholarDigital Library
Aboulnaga, A. and Naughton, J. F. 2003. Building XML statistics for the hidden web. In Proceedings of the Twelfth International Conference on Information and Knowledge Management. 358--365.]] Google ScholarDigital Library
Altinel, M. and Franklin, M. J. 2000. Efficient filtering of XML documents for selective dissemination of information. In Proceedings of the 26th International Conference on Very Large Data Bases. 53--64.]] Google ScholarDigital Library
Boag, S., Chamberlin, D., Fernndez, M. F., Florescu, D., Robie, J., and Simon, J. 2005. XQuery 1.0: An XML query language. W3C Candidate Recommendation. Go online to www.w3.org.]]Google Scholar
Bray, T., Paoli, J., Sperberg-McQueen, C. M., and Maler, E. 2000. Extensible Markup Language (XML) 1.0 (second edition). W3C Recommendation. Go online to www.w3.org/.]]Google Scholar
Buneman, P., Davidson, S., Hillebrand, G., and Suciu, D. 1996. A query language and optimization techniques for unstructured data. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data.]] Google ScholarDigital Library
Chaudhuri, S., Motwani, R., and Narasayya, V. 1999. On random sampling over joins. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data. 263--274.]] Google ScholarDigital Library
Chen, Z., Jagadish, H. V., Korn, F., Koudas, N., Muthukrishnan, S., Ng, R., and Srivastava, D. 2001. Counting twig matches in a tree. In Proceedings of the 17th International Conference on Data Engineering. 595--604.]] Google ScholarDigital Library
Clark, J. 1999. XSL Transformations (XSLT), Version 1.0. W3C Recommendation. Go online to www.w3.org.]]Google Scholar
Clark, J. and DeRose, S. 1999. XML path language (XPath), version 1.0. W3C Recommendation. Go online to www.w3.org.]]Google Scholar
DeRose, S., Maler, E., and Orchard, D. 2001. XML linking language (XLink), version 1.0. W3C Recommendation. Go online to www.w3.org.]]Google Scholar
Deshpande, A., Garofalakis, M., and Rastogi, R. 2001. Independence is good: Dependency-based histogram synopses for high-dimensional data. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data.]] Google ScholarDigital Library
Feller, W. 1968. An Introduction to Probability Theory and its Applications---Volume I, John Wiley & Sons, New York, NY.]]Google Scholar
Fowler, R. J., Paterson, M. S., and Tanimoto, S. L. 1981. Optimal packing and covering in the plane are NP-complete. Inform. Process. Lett. 12, 3 (Jun.), 133--137.]]Google ScholarCross Ref
Fox, B. 1966. Discrete Optimization Via Marginal Analysis. Manage. Sci. 13, 3, 211--216.]]Google Scholar
Freire, J., Haritsa, J. R., Ramanath, M., Roy, P., and Siméon, J. 2002. StatiX: Making XML count. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. 181--191.]] Google ScholarDigital Library
Garofalakis, M. N. and Gibbons, P. 2001. Approximate Query processing: Taming the terabytes. In Proceedings of the 27th International Conference on Very Large Data Bases.]] Google ScholarDigital Library
Gibbons, P. B. 2001. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In Proceedings of the 27th International Conference on Very Large Data Bases. 541--550.]] Google ScholarDigital Library
Goldman, R. and Widom, J. 1997. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the 23rd International Conference on Very Large Data Bases.]] Google ScholarDigital Library
Gonzalez, T. F. 1985. Clustering to minimize the maximum intercluster distance. Theoret. Comput. Sci. 38, 293--306.]]Google ScholarCross Ref
Jagadish, H., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K. C., and Suel, T. 1998. Optimal histograms with quality guarantees. In Proceedings of the 24th International Conference on Very Large Data Bases.]] Google ScholarDigital Library
Jagadish, H. V., Al-Khalifa, S., Chapman, A., Lakshmanan, L. V. S., Nierman, A., Paparizos, S., Patel, J. M., Srivastava, D., Wiwatwattana, N., Wu, Y., and Yu, C. 2002. TIMBER: A native XML database. Internat. J. Very Large Data Bases 11, 4, 274--291.]] Google ScholarDigital Library
Kaushik, R., Bohannon, P., Naughton, J. F., and Korth, H. F. 2002a. Covering indexes for branching path queries. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. 133--144.]] Google ScholarDigital Library
Kaushik, R., Shenoy, P., Bohannon, P., and Gudes, E. 2002b. Exploiting local similarity for efficient indexing of paths in graph structured data. In Proceedings of the 18th International Conference on Data Engineering. 129--140.]] Google ScholarDigital Library
Lim, L., Wang, M., Padmanabhan, S., Vitter, J., and Parr, R. 2002. XPathLearner: An on-line self-tuning markov histogram for XML path selectivity estimation. In Proceedings of the 28th International Conference on Very Large Data Bases. 442--453.]]Google Scholar
Megiddo, N. and Supowit, K. J. 1984. On the complexity of some common geometric location problems. SIAM J. Comput. 13, 1, 182--196.]]Google ScholarCross Ref
Milo, T. and Suciu, D. 1999. Index structures for path expressions. In Proceedings of the 7th International Conference on Database Theory (ICDT'99). 277--295.]] Google ScholarDigital Library
Naughton, J. F., DeWitt, D. J., and et al 2001. The niagara internet query system. IEEE Data Eng. Bull. 24, 2.]]Google Scholar
Paige, R. and Tarjan, R. E. 1987. Three partition refinement algorithms. SIAM J. Comput. 16, 6 (Dec.), 973--989.]] Google ScholarDigital Library
Pearl, J. 1988. Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kaufmann, San Francisco, CA.]] Google ScholarDigital Library
Polyzotis, N. and Garofalakis, M. 2006. Xcluster synopses for structured xml content. In Proceedings of the 22nd International Conference on Data Engineering.]] Google ScholarDigital Library
Polyzotis, N., Garofalakis, M., and Ioannidis, Y. 2004a. Approximate XML query answers. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data. 263--274.]] Google ScholarDigital Library
Polyzotis, N., Garofalakis, M., and Ioannidis, Y. 2004b. Selectivity estimation for XML twigs. In Proceedings of the 20th International Conference on Data Engineering. 264--275.]] Google ScholarDigital Library
Poosala, V. and Ioannidis, Y. E. 1997. Selectivity estimation without the attribute value independence assumption. In Proceedings of the 23rd International Conference on Very Large Data Bases. 486--495.]] Google ScholarDigital Library
Poosala, V., Ioannidis, Y. E., Haas, P. J., and Shekita, E. J. 1996. Improved histograms for selectivity estimation of range predicates. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data.]] Google ScholarDigital Library
Ramanath, M., Zhang, L., Freire, J., and Haritsa, J. 2005. IMAX: The big picture of dynamic XML statistics. In Proceedings of the 21st International Conference on Data Engineering. 273--284.]] Google ScholarDigital Library
Vitter, J. S. and Wang, M. 1999. Approximate computation of multidimensional aggregates of sparse data using wavelets. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data. 193--204.]] Google ScholarDigital Library
Wang, W., Jiang, H., Lu, H., and Yu, J. X. 2003. Containment join size estimation: Models and methods. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. 145--156.]] Google ScholarDigital Library
Wang, W., Jiang, H., Lu, H., and Yu, J. X. 2004. Bloom histogram: Path selectivity estimation for XML data with updates. In Proceedings of the 30th International Conference on Very Large Data Bases. 240--251.]]Google Scholar
Wu, Y., Patel, J. M., and Jagadish, H. 2002. Estimating answer sizes for XML queries. In Proceedings of the 8th International Conference on Extending Database Technology (EDBT '02). 590--608.]] Google ScholarDigital Library
Wu, Y., Patel, J. M., and Jagadish, H. 2003. Structural join order selection for XML query optimization. In Proceedings of the 19th International Conference on Data Engineering. 443--454.]]Google Scholar
Zhang, N., Ozsu, M. T., Aboulnaga, A., and Ilyas, I. 2006. Xseed: Accurate and fast cardinality estimation for xpath queries. In Proceedings of the 22nd International Conference on Data Engineering.]] Google ScholarDigital Library

Index Terms

XSKETCH synopses for XML data graphs

Recommendations

DVQ: Towards Visual Query Processing of XML Database Systems

XML has been recognized as a promising language for data exchange over the Internet. A number of query languages have been proposed for querying XML data. Most of those languages are path-expression based. One difficulty in forming path-expression based ...
Read More
Mapping of bibliographical standards into XML

The most popular bibliographical standards, which prescribe the exchange of bibliographical data in machine readable form, are MARC (Machine Readable Cataloguing) and UNIMARC (Universal Machine Readable Cataloguing). This paper presents two schemas, ...
Read More
TuG synopses for approximate query answering

This article introduces the Tuple Graph (TuG) synopses, a new class of data summaries that enable accurate approximate answers for complex relational queries. The proposed summarization framework adopts a “semi-structured” view of the relational ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Database Systems Volume 31, Issue 3
September 2006
400 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/1166074
Issue’s Table of Contents

Copyright © 2006 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 September 2006
Published in tods Volume 31, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
XML
approximate query processing
data synopses
path expressions
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 58
  Total Citations
  View Citations
- 843
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

XSKETCH synopses for XML data graphs

ACM Transactions on Database Systems

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

DVQ: Towards Visual Query Processing of XML Database Systems

Mapping of bibliographical standards into XML

TuG synopses for approximate query answering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

XSKETCH synopses for XML data graphs

ACM Transactions on Database Systems

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

DVQ: Towards Visual Query Processing of XML Database Systems

Mapping of bibliographical standards into XML

TuG synopses for approximate query answering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media