Article

StatiX: making XML count

Authors:
Juliana Freire

Bell Labs

Bell Labs
View Profile

,
Jayant R. Haritsa

lndian Institute of Science

lndian Institute of Science
View Profile

,
Maya Ramanath

lndian Institute of Science

lndian Institute of Science
View Profile

,
Prasan Roy

Bell Labs

Bell Labs
View Profile

,
Jérôme Siméon

Bell Labs

Bell Labs
View Profile

SIGMOD '02: Proceedings of the 2002 ACM SIGMOD international conference on Management of dataJune 2002Pages 181–191https://doi.org/10.1145/564691.564713

Published:03 June 2002Publication History

SIGMOD '02: Proceedings of the 2002 ACM SIGMOD international conference on Management of data

Pages 181–191

ABSTRACT

The availability of summary data for XML documents has many applications, from providing users with quick feedback about their queries, to cost-based storage design and query optimization. StatiX is a novel XML Schema-aware statistics framework that exploits the structure derived by regular expressions (which define elements in an XML Schema) to pinpoint places in the schema that are likely sources of structural skew. As we discuss below, this information can be used to build concise, yet accurate, statistical summaries for XML data. StatiX leverages standard XML technology for gathering statistics, notably XML Schema validators, and it uses histograms to summarize both the structure and values in an XML document. In this paper we describe the StatiX system. We develop algorithms that decompose schemas to obtain statistics at different granularities and discuss how statistics can be gathered as documents are validated. We also present an experimental evaluation which demonstrates the accuracy and scalability of our approach and show an application of these statistics to cost-based XML storage design.

References

A. Aboulnaga, A. R. Alameldeen, and J. Naughton. Estimating the selectivity of XML path expressions for Internet scale applications. In Proceedings of VLDB, pages 591-600, 2001. Google ScholarDigital Library
P. Bohannon, J. Freire, P. Roy, and J. Siméon. From XML schema to relations: A cost-based approach to XML storage. In Proceedings of lCDE, pages 64-75, 2002. Google ScholarDigital Library
A. Brüggemann-Klein. Regular expressions into finite automata. TCS, 120(2):197-213, 1993. Google ScholarDigital Library
A. Brüggemann-Klein and D. Wood. One-unambiguous regular languages. Information and Computation, 140(2):229-253, 1998. Google ScholarDigital Library
D. Chambelin, J. Clark, D. Florescu, Jonathan Robie, J. Siméon, and M. Stefanescu. XQuery 1.0: An XML query language. W3C Working Draft, June 2001.Google Scholar
Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S. Muthukrishnan, R. T. Ng, and D. Srivastava. Counting twig matches in a tree. In Proceedings of lCDE, pages 595-604, 2001. Google ScholarDigital Library
A. Deutsch, M. Fernandez, and D. Suciu. Storing semi-structured data with STORED. In Proceedings of SIGMOD, pages 431-442, 1999. Google ScholarDigital Library
P. Fankhauser, M. Fernandez, A. Malhotra, M. Rys, J. Siméon, and P. Wadler. The XML query algebra, February 2001. http://www.w3.org/TR/2001/WD-query-algebra-20010215.Google Scholar
Galax system, October 2001. http://db.bell-labs.com/galax/.Google Scholar
G. Graefe and W. McKenna. The volcano optimizer generator: Extensibility and efficient search. In Proceedings of lCDE, pages 209-218, 1993. Google ScholarDigital Library
Internet Movie Database. http://www.imdb.com.Google Scholar
Y. Ioannidis and S. Christodoulakis. Optimal histograms for limiting worst-case error propagation in the size of join results. ACM TODS, 18(4):709-748, 1993. Google ScholarDigital Library
A. Kemper and G. Moerkotte. Advanced query processing in object bases using access support relations. In Proceedings of VLDB, pages 290-301, 1990. Google ScholarDigital Library
J. McHugh and J. Widom. Query optimization for XML. In Proceedings of VLDB, pages 315-326, 1999. Google ScholarDigital Library
XML query language (xql). http://www.oasis-open.org, 2001.Google Scholar
G. Piatetsky-Shapiro and C. Connell. Accurate estimation of the number of tuples satisfying a condition. In Proceedings of SIGMOD, pages 256-276, 1984. Google ScholarDigital Library
N. Polyzotis and M. Garofalakis. Statistical synopses for graph structured XML databases. In Proceedings of SIGMOD, 2002. Google ScholarDigital Library
V. Poosala and Y. Ioannidis. Selectivity estimation without the attribute value independence assumption. In Proceedings of VLDB, pages 486-495, 1997. Google ScholarDigital Library
V. Poosala, Y. Ioannidis, P. Haas, and E. Shekita. Improved histograms for selectivity estimation of range predicates. In Proceedings of SIGMOD, pages 294-305, 1996. Google ScholarDigital Library
G. Rozenberg and A. Salomaa, editors. Handbook of formal languages, volume 3. Springer Verlag, 1997. Google ScholarDigital Library
P. Selinger, M. Astrahan, D. Chamberlin, R. Lorie, and T. Price. Access path selection in a relational database management system. In Proceedings of SIGMOD, pages 23-34, 1979. Google ScholarDigital Library
J. Shanmugasundaram, K. Tufle, G. He, C. Zhang, D. DeWitt, and J. Naughton. Relational databases for querying XML documents: Limitations and opportunities. In Proceedings of VLDB, pages 302-314, 1999. Google ScholarDigital Library
H. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. XML Schema Part 1: Structures. W3C Working Draft, February 2000.Google Scholar
Y. Wu, J. M. Patel, and H. V. Jagadish. Estimating answer sizes for xml queries. In Proceedings of EDBT, 2002. Google ScholarDigital Library
Xerces java parser 1.4.3. http://xml.apache.org/xerces-j/.Google Scholar
Xmark. http://monetdb.cwi.nl/xml.Google Scholar

Index Terms

StatiX: making XML count

Recommendations

Mapping of bibliographical standards into XML

The most popular bibliographical standards, which prescribe the exchange of bibliographical data in machine readable form, are MARC (Machine Readable Cataloguing) and UNIMARC (Universal Machine Readable Cataloguing). This paper presents two schemas, ...
Read More
XML: Visual QuickStart Guide
Read More
XML-based information mediation with MIX
SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data
The MIX mediator system, MIXm, is developed as part of the MIX Project at the San Diego Supercomputer Center, and the University of California, San Diego.¹ MIXm uses XML as the common model for data exchange. Mediator views are expressed in XMAS (XML ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '02: Proceedings of the 2002 ACM SIGMOD international conference on Management of data
June 2002
654 pages
ISBN:1581134975
DOI:10.1145/564691
Conference Chair:
Bongki Moon
University of Wisconsin - Madison
,
General Chair:
David DeWitt
University of Wisconsin - Madison
,
Program Chair:
Michael Franklin
University of California, Berkeley
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 June 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
SIGMOD '02 Paper Acceptance Rate42of240submissions,18%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 89
  Total Citations
  View Citations
- 614
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

StatiX: making XML count

SIGMOD '02: Proceedings of the 2002 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mapping of bibliographical standards into XML

XML: Visual QuickStart Guide

XML-based information mediation with MIX

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

StatiX: making XML count

SIGMOD '02: Proceedings of the 2002 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mapping of bibliographical standards into XML

XML: Visual QuickStart Guide

XML-based information mediation with MIX

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media