Abstract
XML compression has gained prominence recently because it counters the disadvantage of the verbose representation XML gives to data. In many applications, such as data exchange and data archiving, entirely compressing and decompressing a document is acceptable. In other applications, where queries must be run over compressed documents, compression may not be beneficial since the performance penalty in running the query processor over compressed data outweighs the data compression benefits. While balancing the interests of compression and query processing has received significant attention in the domain of relational databases, these results do not immediately translate to XML data.
In this article, we address the problem of embedding compression into XML databases without degrading query performance. Since the setting is rather different from relational databases, the choice of compression granularity and compression algorithms must be revisited. Query execution in the compressed domain must also be rethought in the framework of XML query processing due to the richer structure of XML data. Indeed, a proper storage design for the compressed data plays a crucial role here.
The XQueC system (XQuery Processor and Compressor) covers a wide set of XQuery queries in the compressed domain and relies on a workload-based cost model to perform the choices of the compression granules and of their corresponding compression algorithms. As a consequence, XQueC provides efficient query processing on compressed XML data. An extensive experimental assessment is presented, showing the effectiveness of the cost model, the compression ratios, and the query execution times.
- Al-Khalifa, S., Jagadish, H., Patel, J., Wu, Y., Koudas, N., and Srivastava, D. 2002. Structural joins: A primitive for efficient XML query pattern matching. In Proceedings of the 18th International Conference on Data Engineering. IEEE, 141--152. Google ScholarDigital Library
- Amer-Yahia, S. and Johnson, T. 2000. Optimizing queries on compressed bitmaps. In Proceedings of 26th International Conference on Very Large Data Bases. ACM, 329--338. Google ScholarDigital Library
- Antoshenkov, G. 1997. Dictionary-based order-preserving string compression. VLDB J. 6, 1, 26--39. Google ScholarDigital Library
- Antoshenkov, G., Lomet, D., and Murray, J. 1996. Order preserving string compression. In Proceedings of the 12th International Conference on Data Engineering. IEEE, 655--663. Google ScholarDigital Library
- APA 2004. Apache custom log format. http://www.apache.org/docs/mod/mod_log_config.html.Google Scholar
- Arion, A., Benzaken, V., Manolescu, I., Papakonstantinou, Y., and Vijay, R. 2006. Algebra-based identification of tree patterns in XQuery. In Proceedings of the International Conference on Flexible Query Answering Systems. 13--25. Google ScholarDigital Library
- Arion, A., Bonifati, A., Costa, G., D'Aguanno, S., Manolescu, I., and Pugliese, A. 2004. Efficient query evaluation over compressed XML data. In Proceedings of the International Conference on Extending Database Technologies. Heraklion, Grece, 200--218.Google Scholar
- Arion, A., Bonifati, A., Manolescu, I., and Pugliese, A. 2006. Path summaries and path partitioning in modern XML databases. In Proceedings of the International World Wide Web Conference. 1077--1078. Google ScholarDigital Library
- BER 2003. Berkeley DB Data Store. http://www.sleepycat.com/products/data.shtml.Google Scholar
- Bohannon, P., Freire, J., Roy, P., and Simeon, J. 2002. From XML schema to relations: A cost-based approach to XML storage. In Proceedings of the 18th International Conference on Data Engineering. IEEE, 64--76. Google ScholarDigital Library
- Buneman, P., Grohe, M., and Koch, C. 2003. Path queries on compressed XML. In Proceedings of 29th International Conference on Very Large Data Bases. Morgan Kaufmann, 141--152.Google Scholar
- Busatto, G., Lohrey, M., and Maneth, S. 2005. Efficient memory representation of XML documents. In Proceedings of the 10th International Symposium on Database Programming Languages. Trondheim, Norway, 199--216. Google ScholarDigital Library
- BZIP2 2002. The bzip2 and libbzip2 Official Home Page. http://sources.redhat.com/bzip2/.Google Scholar
- Chen, Z., Gehrke, J., and Korn, F. 2000. Query optimization in compressed database systems. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. ACM, 271--282. Google ScholarDigital Library
- Chen, Z., Jagadish, H., Lakshmanan, L., and Paparizos, S. 2003. From tree patterns to generalized tree patterns: On efficient evaluation of XQuery. In Proceedings of 29th International Conference on Very Large Data Bases. Morgan Kaufmann, 237--248.Google Scholar
- Cheney, J. 2001. Compressing XML with multiplexed hierarchical PPM models. In Data Compression Conference. IEEE Computer Society, 163--172. Google ScholarDigital Library
- Cheney, J. 2005. An empirical evaluation of simple DTD-conscious compression techniques. In WebDB. 43--48.Google Scholar
- Cheng, J. and Ng, W. 2004. XQzip: Querying compressed XML using structural indexing. In Proceedings of the International Conference on Extending Database Technologies. Heraklion, Greece, 219--236.Google Scholar
- Fiebig, T., Helmer, S., Kanne, C., Moerkotte, G., Neumann, J., Schiele, R., and Westmann, T. 2002. Anatomy of a native XML base management system. VLDB J. 11, 4, 292--314. Google ScholarDigital Library
- Galax 2006. Galax: An implementation of XQuery. Available at www.galaxquery.org.Google Scholar
- Goldman, R. and Widom, J. 1997. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proceedings of 23rd International Conference on Very Large Data Bases. Morgan Kaufman, 436--445. Google ScholarDigital Library
- Goldstein, J., Ramakrishnan, R., and Shaft, U. 1998. Compressing relations and indexes. In Proceedings of the 14th International Conference on Data Engineering. IEEE, 370--379. Google ScholarDigital Library
- Graefe, G. 1993. Query evaluation techniques for large databases. ACM Compu. Surv. 25, 2, 73--170. Google ScholarDigital Library
- Greer, R. 1999. Daytona and the fourth-generation language Cymbal. In Proceedings ACM SIGMOD International Conference on Management of Data. ACM, 525--526. Google ScholarDigital Library
- Grust, T. 2002. Accelerating XPath location steps. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. ACM, 109--120. Google ScholarDigital Library
- Halverson, A., Burger, J., Galanis, L., Kini, A., Krishnamurthy, R., Rao, A., Tian, F., Viglas, S., Wang, Y., Naughton, J., and DeWitt, D. 2003. Mixed mode XML query processing. In Proceedings of 29th International Conference on Very Large Data Bases. Morgan Kaufmann, 225--236.Google ScholarCross Ref
- Hu, T. C. and Tucker, A. C. 1971. Optimal computer search trees and variable-length alphabetical codes. SIAM J. of App. Mathe. 21, 4, 514--532.Google ScholarCross Ref
- Huffman, D. A. 1952. A method for construction of minimum-redundancy codes. In Proceedings of the IRE. 1098--1101.Google ScholarCross Ref
- IBIBLIO 2004. Ibiblio.org web site. Available at www.ibiblio.org/xml/books/biblegold/examples/baseball/.Google Scholar
- INEX 2004. INitiative for the Evaluation of XML retrieval. inex.is.informatik.uni-duisburg.de: 2004.Google Scholar
- Jagadish, H. V., Al-Khalifa, S., Chapman, A., Lakshmanan, L. V., Nierman, A., Paparizos, S., Patel, J., Srivastava, D., Wiwatwattana, N., Wu, Y., and Yu., C. 2002. Timber: A native XML database. VLDB J. 11, 4, 274--291. Google ScholarDigital Library
- Jagadish, H. V., Ng, R., Ooi, B. C., and Tung, A. K. H. 2004. ItCompress: An iterative semantic compression algorithm. In Proceedings of the International Conference on Data Engineering. IEEE Computer Society, 646--658. Google ScholarDigital Library
- Jain, A. K., Murty, M. N., and Flynn, P. J. 1999. Data clustering: A review. ACM Comput. Surv. 31, 3, 264--323. Google ScholarDigital Library
- Liefke, H. and Suciu, D. 2000. XMILL: An efficient compressor for XML Data. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. ACM, 153--164. Google ScholarDigital Library
- Miklau, G. and Suciu, D. 2002. Containment and equivalence for an XPath fragment. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Conference on the Principles of Database Systems. 65--76. Google ScholarDigital Library
- Milo, T. and Suciu, D. 1999. Index structures for path expressions. In Proceedings of the International Conference on Database Theory (ICDT). 277--295. Google ScholarDigital Library
- Min, J. K., Park, M., and Chung, C. 2003. XPRESS: A queriable compression for XML data. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. ACM, 122--133. Google ScholarDigital Library
- Min, J. K., Park, M., and Chung, C. 2006. A compressor for effective archiving, retrieval, and update of XML documents. ACM Trans. Intern. Techn. 6, 3. Google ScholarDigital Library
- Moffat, A. and Zobel, J. 1992. Coding for compression in full-text retrieval systems. In Proceedings of the Data Compression Conference (DCC). 72--81.Google Scholar
- Moura, E. D., Navarro, G., Ziviani, N., and Baeza-Yates, R. 2000. Fast and flexible word searching on compressed text. ACM Trans. Inform. Syst. 18, 2 (April), 113--139. Google ScholarDigital Library
- Ng, W., Lam, Y. W., and Cheng, J. 2006. Comparative analysis of XML compression technologies. WWW J. 9, 1, 5--33. Google ScholarDigital Library
- Ng, W., Lam, Y. W., Wood, P., and Levene, M. 2006. XCQ: A queriable XML compression system. Inter. J. Knowl. Inform. Syst. To appear. Google ScholarDigital Library
- Paparizos, S., Al-Khalifa, S., Chapman, A., Jagadish, H. V., Lakshmanan, L. V. S., Nierman, A., Patel, J. M., Srivastava, D., Wiwatwattana, N., Wu, Y., and Yu, C. 2003. TIMBER: A native system for querying XML. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. ACM, 672. Google ScholarDigital Library
- Poess, M. and Potapov, D. 2003. Data compression in Oracle. In Proceedings of 29th International Conference on Very Large Data Bases. Morgan Kaufmann, 937--947.Google Scholar
- Roy, P., Seshadri, S., Sudarshan, S., and Bhobe, S. 2000. Efficient and extensible algorithms for multi query optimization. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. Dallas, TX. 249--260. Google ScholarDigital Library
- Schmidt, A., Waas, F., Kersten, M., Carey, M., Manolescu, I., and Busse, R. 2002. XMark: A benchmark for XML data management. In Proceedings of 28th International Conference on Very Large Data Bases. Morgan Kaufmann, 974--985.Google Scholar
- Tolani, P. and Haritsa, J. 2002. XGRIND: A Query-friendly XML Compressor. In Proceedings of the 18th International Conference on Data Engineering. IEEE, 225--235. Google ScholarDigital Library
- Transaction Processing Performance Council. 1999. TPC-H benchmark database. http://www.tcp.org.Google Scholar
- UWXML 2004. University of Washington's XML repository. Available at www.cs.washington.edu/research/xmldatasets.Google Scholar
- Westmann, T., Kossmann, D., Helmer, S., and Moerkotte, G. 2000. The implementation and performance of compressed databases. ACM SIGMOD Rec. 29, 3, 55--67. Google ScholarDigital Library
- Witten, I. H. 1987. Arithmetic coding For data compression. Comm. ACM, 857--865. Google ScholarDigital Library
- XMLZIP 1999. XMLZip XML compressor. Available at http://www.xmls.com/products/xmlzip/xmlzip.html.Google Scholar
- XQUE 2004. The XML query language. http://www.w3.org/XML/Query.Google Scholar
Index Terms
- XQueC: A query-conscious compressed XML database
Recommendations
TraCX: transformation of compressed XML
BNCOD'11: Proceedings of the 28th British national conference on Advances in databasesWhile the volume of XML data and sometimes even the processing time of XML data can be reduced by using XML compression and storing, processing, and transferring compressed XML instead of uncompressed XML, a transformation of the transferred XML data ...
Optimized XPath evaluation for schema-compressed XML data
ADC '12: Proceedings of the Twenty-Third Australasian Database Conference - Volume 124XML has become the de facto standard for data exchange in enterprise information systems. But whenever XML data is stored or processed, e. g. in form of a DOM tree representation, the XML markup causes a huge blow-up of the memory consumption compared ...
Column-oriented Database Systems and XML Compression
DATA 2014: Proceedings of 3rd International Conference on Data Management Technologies and ApplicationsWith the renewed industrial and academic interest in Column-Oriented Database Management Systems, a lot of interest has been shown in the area of software optimizations designed to improve the efficiency of queries in the Column-Oriented domain. ...
Comments