skip to main content
article

XQueC: A query-conscious compressed XML database

Published:01 May 2007Publication History
Skip Abstract Section

Abstract

XML compression has gained prominence recently because it counters the disadvantage of the verbose representation XML gives to data. In many applications, such as data exchange and data archiving, entirely compressing and decompressing a document is acceptable. In other applications, where queries must be run over compressed documents, compression may not be beneficial since the performance penalty in running the query processor over compressed data outweighs the data compression benefits. While balancing the interests of compression and query processing has received significant attention in the domain of relational databases, these results do not immediately translate to XML data.

In this article, we address the problem of embedding compression into XML databases without degrading query performance. Since the setting is rather different from relational databases, the choice of compression granularity and compression algorithms must be revisited. Query execution in the compressed domain must also be rethought in the framework of XML query processing due to the richer structure of XML data. Indeed, a proper storage design for the compressed data plays a crucial role here.

The XQueC system (XQuery Processor and Compressor) covers a wide set of XQuery queries in the compressed domain and relies on a workload-based cost model to perform the choices of the compression granules and of their corresponding compression algorithms. As a consequence, XQueC provides efficient query processing on compressed XML data. An extensive experimental assessment is presented, showing the effectiveness of the cost model, the compression ratios, and the query execution times.

References

  1. Al-Khalifa, S., Jagadish, H., Patel, J., Wu, Y., Koudas, N., and Srivastava, D. 2002. Structural joins: A primitive for efficient XML query pattern matching. In Proceedings of the 18th International Conference on Data Engineering. IEEE, 141--152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Amer-Yahia, S. and Johnson, T. 2000. Optimizing queries on compressed bitmaps. In Proceedings of 26th International Conference on Very Large Data Bases. ACM, 329--338. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Antoshenkov, G. 1997. Dictionary-based order-preserving string compression. VLDB J. 6, 1, 26--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Antoshenkov, G., Lomet, D., and Murray, J. 1996. Order preserving string compression. In Proceedings of the 12th International Conference on Data Engineering. IEEE, 655--663. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. APA 2004. Apache custom log format. http://www.apache.org/docs/mod/mod_log_config.html.Google ScholarGoogle Scholar
  6. Arion, A., Benzaken, V., Manolescu, I., Papakonstantinou, Y., and Vijay, R. 2006. Algebra-based identification of tree patterns in XQuery. In Proceedings of the International Conference on Flexible Query Answering Systems. 13--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Arion, A., Bonifati, A., Costa, G., D'Aguanno, S., Manolescu, I., and Pugliese, A. 2004. Efficient query evaluation over compressed XML data. In Proceedings of the International Conference on Extending Database Technologies. Heraklion, Grece, 200--218.Google ScholarGoogle Scholar
  8. Arion, A., Bonifati, A., Manolescu, I., and Pugliese, A. 2006. Path summaries and path partitioning in modern XML databases. In Proceedings of the International World Wide Web Conference. 1077--1078. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. BER 2003. Berkeley DB Data Store. http://www.sleepycat.com/products/data.shtml.Google ScholarGoogle Scholar
  10. Bohannon, P., Freire, J., Roy, P., and Simeon, J. 2002. From XML schema to relations: A cost-based approach to XML storage. In Proceedings of the 18th International Conference on Data Engineering. IEEE, 64--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Buneman, P., Grohe, M., and Koch, C. 2003. Path queries on compressed XML. In Proceedings of 29th International Conference on Very Large Data Bases. Morgan Kaufmann, 141--152.Google ScholarGoogle Scholar
  12. Busatto, G., Lohrey, M., and Maneth, S. 2005. Efficient memory representation of XML documents. In Proceedings of the 10th International Symposium on Database Programming Languages. Trondheim, Norway, 199--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. BZIP2 2002. The bzip2 and libbzip2 Official Home Page. http://sources.redhat.com/bzip2/.Google ScholarGoogle Scholar
  14. Chen, Z., Gehrke, J., and Korn, F. 2000. Query optimization in compressed database systems. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. ACM, 271--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Chen, Z., Jagadish, H., Lakshmanan, L., and Paparizos, S. 2003. From tree patterns to generalized tree patterns: On efficient evaluation of XQuery. In Proceedings of 29th International Conference on Very Large Data Bases. Morgan Kaufmann, 237--248.Google ScholarGoogle Scholar
  16. Cheney, J. 2001. Compressing XML with multiplexed hierarchical PPM models. In Data Compression Conference. IEEE Computer Society, 163--172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Cheney, J. 2005. An empirical evaluation of simple DTD-conscious compression techniques. In WebDB. 43--48.Google ScholarGoogle Scholar
  18. Cheng, J. and Ng, W. 2004. XQzip: Querying compressed XML using structural indexing. In Proceedings of the International Conference on Extending Database Technologies. Heraklion, Greece, 219--236.Google ScholarGoogle Scholar
  19. Fiebig, T., Helmer, S., Kanne, C., Moerkotte, G., Neumann, J., Schiele, R., and Westmann, T. 2002. Anatomy of a native XML base management system. VLDB J. 11, 4, 292--314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Galax 2006. Galax: An implementation of XQuery. Available at www.galaxquery.org.Google ScholarGoogle Scholar
  21. Goldman, R. and Widom, J. 1997. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proceedings of 23rd International Conference on Very Large Data Bases. Morgan Kaufman, 436--445. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Goldstein, J., Ramakrishnan, R., and Shaft, U. 1998. Compressing relations and indexes. In Proceedings of the 14th International Conference on Data Engineering. IEEE, 370--379. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Graefe, G. 1993. Query evaluation techniques for large databases. ACM Compu. Surv. 25, 2, 73--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Greer, R. 1999. Daytona and the fourth-generation language Cymbal. In Proceedings ACM SIGMOD International Conference on Management of Data. ACM, 525--526. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Grust, T. 2002. Accelerating XPath location steps. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. ACM, 109--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Halverson, A., Burger, J., Galanis, L., Kini, A., Krishnamurthy, R., Rao, A., Tian, F., Viglas, S., Wang, Y., Naughton, J., and DeWitt, D. 2003. Mixed mode XML query processing. In Proceedings of 29th International Conference on Very Large Data Bases. Morgan Kaufmann, 225--236.Google ScholarGoogle ScholarCross RefCross Ref
  27. Hu, T. C. and Tucker, A. C. 1971. Optimal computer search trees and variable-length alphabetical codes. SIAM J. of App. Mathe. 21, 4, 514--532.Google ScholarGoogle ScholarCross RefCross Ref
  28. Huffman, D. A. 1952. A method for construction of minimum-redundancy codes. In Proceedings of the IRE. 1098--1101.Google ScholarGoogle ScholarCross RefCross Ref
  29. IBIBLIO 2004. Ibiblio.org web site. Available at www.ibiblio.org/xml/books/biblegold/examples/baseball/.Google ScholarGoogle Scholar
  30. INEX 2004. INitiative for the Evaluation of XML retrieval. inex.is.informatik.uni-duisburg.de: 2004.Google ScholarGoogle Scholar
  31. Jagadish, H. V., Al-Khalifa, S., Chapman, A., Lakshmanan, L. V., Nierman, A., Paparizos, S., Patel, J., Srivastava, D., Wiwatwattana, N., Wu, Y., and Yu., C. 2002. Timber: A native XML database. VLDB J. 11, 4, 274--291. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Jagadish, H. V., Ng, R., Ooi, B. C., and Tung, A. K. H. 2004. ItCompress: An iterative semantic compression algorithm. In Proceedings of the International Conference on Data Engineering. IEEE Computer Society, 646--658. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Jain, A. K., Murty, M. N., and Flynn, P. J. 1999. Data clustering: A review. ACM Comput. Surv. 31, 3, 264--323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Liefke, H. and Suciu, D. 2000. XMILL: An efficient compressor for XML Data. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. ACM, 153--164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Miklau, G. and Suciu, D. 2002. Containment and equivalence for an XPath fragment. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Conference on the Principles of Database Systems. 65--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Milo, T. and Suciu, D. 1999. Index structures for path expressions. In Proceedings of the International Conference on Database Theory (ICDT). 277--295. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Min, J. K., Park, M., and Chung, C. 2003. XPRESS: A queriable compression for XML data. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. ACM, 122--133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Min, J. K., Park, M., and Chung, C. 2006. A compressor for effective archiving, retrieval, and update of XML documents. ACM Trans. Intern. Techn. 6, 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Moffat, A. and Zobel, J. 1992. Coding for compression in full-text retrieval systems. In Proceedings of the Data Compression Conference (DCC). 72--81.Google ScholarGoogle Scholar
  40. Moura, E. D., Navarro, G., Ziviani, N., and Baeza-Yates, R. 2000. Fast and flexible word searching on compressed text. ACM Trans. Inform. Syst. 18, 2 (April), 113--139. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Ng, W., Lam, Y. W., and Cheng, J. 2006. Comparative analysis of XML compression technologies. WWW J. 9, 1, 5--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Ng, W., Lam, Y. W., Wood, P., and Levene, M. 2006. XCQ: A queriable XML compression system. Inter. J. Knowl. Inform. Syst. To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Paparizos, S., Al-Khalifa, S., Chapman, A., Jagadish, H. V., Lakshmanan, L. V. S., Nierman, A., Patel, J. M., Srivastava, D., Wiwatwattana, N., Wu, Y., and Yu, C. 2003. TIMBER: A native system for querying XML. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. ACM, 672. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Poess, M. and Potapov, D. 2003. Data compression in Oracle. In Proceedings of 29th International Conference on Very Large Data Bases. Morgan Kaufmann, 937--947.Google ScholarGoogle Scholar
  45. Roy, P., Seshadri, S., Sudarshan, S., and Bhobe, S. 2000. Efficient and extensible algorithms for multi query optimization. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. Dallas, TX. 249--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Schmidt, A., Waas, F., Kersten, M., Carey, M., Manolescu, I., and Busse, R. 2002. XMark: A benchmark for XML data management. In Proceedings of 28th International Conference on Very Large Data Bases. Morgan Kaufmann, 974--985.Google ScholarGoogle Scholar
  47. Tolani, P. and Haritsa, J. 2002. XGRIND: A Query-friendly XML Compressor. In Proceedings of the 18th International Conference on Data Engineering. IEEE, 225--235. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Transaction Processing Performance Council. 1999. TPC-H benchmark database. http://www.tcp.org.Google ScholarGoogle Scholar
  49. UWXML 2004. University of Washington's XML repository. Available at www.cs.washington.edu/research/xmldatasets.Google ScholarGoogle Scholar
  50. Westmann, T., Kossmann, D., Helmer, S., and Moerkotte, G. 2000. The implementation and performance of compressed databases. ACM SIGMOD Rec. 29, 3, 55--67. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Witten, I. H. 1987. Arithmetic coding For data compression. Comm. ACM, 857--865. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. XMLZIP 1999. XMLZip XML compressor. Available at http://www.xmls.com/products/xmlzip/xmlzip.html.Google ScholarGoogle Scholar
  53. XQUE 2004. The XML query language. http://www.w3.org/XML/Query.Google ScholarGoogle Scholar

Index Terms

  1. XQueC: A query-conscious compressed XML database

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader