skip to main content
research-article

Compressing and indexing labeled trees, with applications

Published:27 November 2009Publication History
Skip Abstract Section

Abstract

Consider an ordered, static tree T where each node has a label from alphabet Σ. Tree T may be of arbitrary degree and shape. Our goal is designing a compressed storage scheme of T that supports basic navigational operations among the immediate neighbors of a node (i.e. parent, ith child, or any child with some label,…) as well as more sophisticated path-based search operations over its labeled structure.

We present a novel approach to this problem by designing what we call the XBW-transform of the tree in the spirit of the well-known Burrows-Wheeler transform for strings [1994]. The XBW-transform uses path-sorting to linearize the labeled tree T into two coordinated arrays, one capturing the structure and the other the labels. For the first time, by using the properties of the XBW-transform, our compressed indexes go beyond the information-theoretic lower bound, and support navigational and path-search operations over labeled trees within (near-)optimal time bounds and entropy-bounded space.

Our XBW-transform is simple and likely to spur new results in the theory of tree compression and indexing, as well as interesting application contexts. As an example, we use the XBW-transform to design and implement a compressed index for XML documents whose compression ratio is significantly better than the one achievable by state-of-the-art tools, and its query time performance is order of magnitudes faster.

References

  1. Adiego, J., de la Fuente, P., and Navarro, G. 2004. Merging prediction by partial matching with structural contexts model. In Proceedings of the IEEE Data Compression Conference (DCC). IEEE Computer Society Press, Los Alamitos, CA, 522. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Arion, A., Bonifati, A., Costa, G., D'Aguanno, S., Manolescu, I., and Pugliese, A. 2003. XQueC: pushing queries to compressed XML data. In Proceedings of the 29th International Conference on Very Large Data Bases (VLDB). 1065--1068. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Arroyuelo, D., Navarro, G., and Sadakane, K. 2006. Reducing the space requirement of LZ-index. In Proceedings of the 17th Combinatorial Pattern Matching Conference (CPM). Lecture Notes in Computer Science vol. 4009. Springer-Verlag, New York, 318--329. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Barbay, J., Golynski, A., Munro, J. I., and Rao, S. 2007. Adaptive searching in succinctly encoded binary relations and tree-structured documents. Theoret. Comput. Sci. 387, 3, 284--297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Barbay, J., He, M., Munro, J., and Rao, S. 2008. Succinct indexes for string, bynary relations and multi-labeled trees. Submitted to Journal (Preliminary version appeared in Proceedings of the 18th ACM-SIAM Symposium on Discrete Algorithms (SODA 07)). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Benoit, D., Demaine, E., Munro, J. I., Raman, R., Raman, V., and Rao, S. 2005. Representing trees of higher degree. Algorithmica 43, 275--292.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Burrows, M., and Wheeler, D. 1994. A block-sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation.Google ScholarGoogle Scholar
  8. Busatto, G., Lohrey, M., and Maneth, S. 2008. Efficient memory representation of XML document trees. Inf. Syst. 33, 4-5, 456--474. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Catania, B., Maddalena, A., and Vakali, A. 2005. XML document indexes: A classification. IEEE Internet Computing, 1, 5, 64--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cheney, J. 2001. Compressing XML with multiplexed hierarchical PPM models. In Proceedings of the IEEE Data Compression Conference (DCC). IEEE Computer Society Press, Los Alamitos, CA, 163--172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Cheng, J., and Ng, W. 2004. XQzip: Querying compressed XML using structural indexing. In Proceedings of the 9th International Conference on Extending Database Technology. Lecture Notes in Computer Science, vol. 2992. Springer-Verlag, Berlin, Germany, 219--236.Google ScholarGoogle Scholar
  12. Farzan, A., and Munro, I. 2008. Succinct representations of arbitrary graphs. In Proceedings of the 16th European Symposium on Algorithms (ESA). Lecture Notes in Computer Science vol. 4009. Springer-Verlag, Berlin Germany. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Fernandez, M. F., Simeon, J., Choi, B., Marian, A., and Sur, G. 2003. Implementing Xquery 1.0: the Galax experience. In Proceedings of the 29th International Conference on Very Large Data Bases (VLDB). 1077--1080. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ferragina, P., Giancarlo, R., and Manzini, G. 2006a. The engineering of a compression boosting library: Theory vs practice in BWT compression. In Proceedings of the 14th European Symposium on Algorithms (ESA). Lecture Notes in Computer Science, vol. 4168. Springer-Verlag, Berlin, Germany, 756--767. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ferragina, P., Giancarlo, R., and Manzini, G. 2009. The myriad virtues of wavelet trees. Inf. Comput. 207, 849--866. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Ferragina, P., Giancarlo, R., Manzini, G., and Sciortino, M. 2005a. Boosting textual compression in optimal linear time. J. ACM 52, 688--713. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ferragina, P., González, R., Navarro, G., and Venturini, R. 2008. Compressed text indexes: From theory to practice. ACM J. Exper. Algor. 13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ferragina, P., Luccio, F., Manzini, G., and Muthukrishnan, S. 2005b. Structuring labeled trees for optimal succinctness, and beyond. In Proceedings of the 46th IEEE Symposium on Foundations of Computer Science (FOCS). IEEE Computer Society Press, Los Alamitos, CA, 184--193. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ferragina, P., Luccio, F., Manzini, G., and Muthukrishnan, S. 2006b. Compressing and searching XML data via two zips. In Proceedings of the 15th International World Wide Web Conference (WWW). 751--760. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ferragina, P., and Manzini, G. 2001. An experimental study of a compressed index. Information Sciences: special issue on “Dictionary Based Compression” 135, 13--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Ferragina, P., and Manzini, G. 2005. Indexing compressed text. J. ACM 52, 4, 552--581. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Ferragina, P., Manzini, G., Mäkinen, V., and Navarro, G. 2007. Compressed representations of sequences and full-text indexes. ACM Trans. Algor. 3, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ferragina, P., and Navarro, G. 2007. The Pizza&Chili corpus home page. http://pizzachili.dcc.uchile.cl/or http://pizzachili.di.unipi.it/.Google ScholarGoogle Scholar
  24. Ferragina, P., and Rao, S. 2008. Tree compression and indexing. In Encyclopedia of Algorithms, Springer. 964--967.Google ScholarGoogle Scholar
  25. Ferragina, P., and Venturini, R. 2007. A simple storage scheme for strings achieving entropy bounds. Theoret. Comput. Sci. 372, 1, 115--121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Geary, R., Raman, R., and Raman, V. 2006. Succinct ordinal trees with level-ancestor queries. ACM Trans. Algor. 2, 4, 510--534. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Goldman, R., and Widom, J. 1997. Dataguides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB). 436--445. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Golynski, A., Munro, J. I., and Rao, S. 2006. Rank/select operations on large alphabets: A tool for text indexing. In Proceedings of the 17th ACM-SIAM Symposium on Discrete Algorithms (SODA). ACM, New York, 368--373. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Gupta, A., Hon, W., Shah, R., and Vitter, J. 2007. A framework for dynamizing succinct data structures. In Proceedings of the 34th International Colloquium on Automata, Languages and Programming (ICALP). Lecture Notes in Computer Science, vol. 4596. Springer-Verlag, Berlin, Germany, 521--532. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jacobson, G. 1989. Space-efficient static trees and graphs. In Proceedings of the 30th IEEE Symposium on Foundations of Computer Science (FOCS). IEEE Computer Society Press, Los Alamitos, CA, 549--554. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Jansson, J., Sadakane, K., and Sung, W. 2007. Ultra-succinct representation of ordered trees. In Proceedings of the 18th ACM-SIAM Symposium on Discrete Algorithms (SODA). ACM, New York, 575--584. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Kärkkäinen, J., Sanders, P., and Burkhardt, S. 2006. Linear work suffix array construction. J. ACM 53, 6, 918--936. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Karp, R., Miller, R., and Rosenberg, A. 1972. Rapid identification of repeated patterns in strings, arrays and trees. In Proceedings of the ACM Symposium on Theory of Computation. ACM, New York, 125--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Kaushik, R., Krishnamurthy, R., Naughton, J., and Ramakrishnan, R. 2004. On the integration of structure indexes and inverted lists. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, 779--790. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Kosaraju, S. R. 1989. Efficient tree pattern matching. In Proceedings of the 20th IEEE Foundations of Computer Science (FOCS). 178--183. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Kurtz, S. 1999. Reducing the space requirement of suffix trees. Software—Practice and Experience 29, 13, 1149--1171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Liefke, H., and Suciu, D. 2000. XMill: An efficient compressor for xml data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 153--164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Mäkinen, V., and Navarro, G. 2006. Position-restricted substring searching. In Proceedings of the 7th Latin American Symposium on Theoretical Informatics (LATIN). Lecture Notes in Computer Science, vol. 3887. Springer-Verlag, Berlin, Germany, 703--714. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Manzini, G. 2001. An analysis of the Burrows-Wheeler transform. J. ACM 48, 3, 407--430. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Milo, T., and Suciu, D. 1999. Index structures for path expressions. In Proceedings of the 3rd International Conference on Database Theory. 277--295. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Min, J., Park, M., and Chung, C. 2003. Xpress: A queriable compression for XML data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, 122--133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Moura, E., Navarro, G., Ziviani, N., and Baeza-Yates, R. 2000. Fast and flexible word searching on compressed text. ACM Trans. Inform. Syst. 18, 2, 113--139. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Munro, J. I. 1996. Tables. In Proceeding of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science. Lecture Notes in Computer Science, vol. 1180. Springer-Verlag, Berlin, Germany, 37--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Munro, J. I., Raman, R., Raman, V., and Rao, S. 2003. Succinct representations of permutations. In Proceedings of the 30th International Colloquium on Automata, Languages and Programming (ICALP). Lecture Notes in Computer Science, vol. 2719. Springer-Verlag, Berlin, Germany, 345--356. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Munro, J. I., and Raman, V. 1997. Succinct representation of balanced parentheses, static trees and planar graphs. In Proceedings of the 38th IEEE Symposium on Foundations of Computer Science (FOCS). IEEE Computer Society Press, Los Alamitos, CA, 118--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Munro, J. I., and Raman, V. 2001. Succinct representation of balanced parentheses and static trees. SIAM J. Comput. 31, 762--776. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Munro, J. I., and Rao, S. 2004. Succinct representations of functions. In Proceedings of the 31st International Colloquium on Automata, Languages and Programming (ICALP). Lecture Notes in Computer Science, vol. 3142. Springer-Verlag, Germany, 1006--1015.Google ScholarGoogle Scholar
  48. Navarro, G., and Mäkinen, V. 2007. Compressed full-text indexes. ACM Comput. Surv. 39, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Ng, W., Lam, W., Wood, P., and Levene, M. 2006. XCQ: A queriable XML compression system. Know. Inform. Syst. 10, 4, 421--452. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Raman, R., Raman, V., and Rao, S. 2007. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algor. 3, 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Raw, P., and Moon, B. 2004. PRIX: Indexing and querying XML using Prüfer sequences. In Proceedings of the 20th International Conference on Data Engineering (ICDE). 288--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Tolani, P. M., and Haritsa, J. R. 2002. XGRIND: A query-friendly XML compressor. In Proceedings of the 18th International Conference on Data Engineering (ICDE). 225--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Wang, H., Park, S., Fan, W., and Yu, P. S. 2003. ViST: A dynamic index methd for querying XML data by tree structures. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, 110--121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Wang, W., Wang, H., Lu, H., Jang, H., Lin, X., and Li, J. 2005. Efficient processing of XML path queries using the disk-based F&B index. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB). 145--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images, Second ed. Morgan Kaufmann, Los Altos, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Ye, Z., and Berger, T. 1998. Information measures for discrete random fields. Science Press.Google ScholarGoogle Scholar

Index Terms

  1. Compressing and indexing labeled trees, with applications

                      Recommendations

                      Comments

                      Login options

                      Check if you have access through your login credentials or your institution to get full access on this article.

                      Sign in

                      Full Access

                      • Published in

                        cover image Journal of the ACM
                        Journal of the ACM  Volume 57, Issue 1
                        November 2009
                        149 pages
                        ISSN:0004-5411
                        EISSN:1557-735X
                        DOI:10.1145/1613676
                        Issue’s Table of Contents

                        Copyright © 2009 ACM

                        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                        Publisher

                        Association for Computing Machinery

                        New York, NY, United States

                        Publication History

                        • Published: 27 November 2009
                        • Accepted: 1 July 2009
                        • Revised: 1 September 2008
                        • Received: 1 March 2007
                        Published in jacm Volume 57, Issue 1

                        Permissions

                        Request permissions about this article.

                        Request Permissions

                        Check for updates

                        Qualifiers

                        • research-article
                        • Research
                        • Refereed

                      PDF Format

                      View or Download as a PDF file.

                      PDF

                      eReader

                      View online with eReader.

                      eReader