Abstract
Consider an ordered, static tree T where each node has a label from alphabet Σ. Tree T may be of arbitrary degree and shape. Our goal is designing a compressed storage scheme of T that supports basic navigational operations among the immediate neighbors of a node (i.e. parent, ith child, or any child with some label,…) as well as more sophisticated path-based search operations over its labeled structure.
We present a novel approach to this problem by designing what we call the XBW-transform of the tree in the spirit of the well-known Burrows-Wheeler transform for strings [1994]. The XBW-transform uses path-sorting to linearize the labeled tree T into two coordinated arrays, one capturing the structure and the other the labels. For the first time, by using the properties of the XBW-transform, our compressed indexes go beyond the information-theoretic lower bound, and support navigational and path-search operations over labeled trees within (near-)optimal time bounds and entropy-bounded space.
Our XBW-transform is simple and likely to spur new results in the theory of tree compression and indexing, as well as interesting application contexts. As an example, we use the XBW-transform to design and implement a compressed index for XML documents whose compression ratio is significantly better than the one achievable by state-of-the-art tools, and its query time performance is order of magnitudes faster.
- Adiego, J., de la Fuente, P., and Navarro, G. 2004. Merging prediction by partial matching with structural contexts model. In Proceedings of the IEEE Data Compression Conference (DCC). IEEE Computer Society Press, Los Alamitos, CA, 522. Google ScholarDigital Library
- Arion, A., Bonifati, A., Costa, G., D'Aguanno, S., Manolescu, I., and Pugliese, A. 2003. XQueC: pushing queries to compressed XML data. In Proceedings of the 29th International Conference on Very Large Data Bases (VLDB). 1065--1068. Google ScholarDigital Library
- Arroyuelo, D., Navarro, G., and Sadakane, K. 2006. Reducing the space requirement of LZ-index. In Proceedings of the 17th Combinatorial Pattern Matching Conference (CPM). Lecture Notes in Computer Science vol. 4009. Springer-Verlag, New York, 318--329. Google ScholarDigital Library
- Barbay, J., Golynski, A., Munro, J. I., and Rao, S. 2007. Adaptive searching in succinctly encoded binary relations and tree-structured documents. Theoret. Comput. Sci. 387, 3, 284--297. Google ScholarDigital Library
- Barbay, J., He, M., Munro, J., and Rao, S. 2008. Succinct indexes for string, bynary relations and multi-labeled trees. Submitted to Journal (Preliminary version appeared in Proceedings of the 18th ACM-SIAM Symposium on Discrete Algorithms (SODA 07)). Google ScholarDigital Library
- Benoit, D., Demaine, E., Munro, J. I., Raman, R., Raman, V., and Rao, S. 2005. Representing trees of higher degree. Algorithmica 43, 275--292.Google ScholarDigital Library
- Burrows, M., and Wheeler, D. 1994. A block-sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation.Google Scholar
- Busatto, G., Lohrey, M., and Maneth, S. 2008. Efficient memory representation of XML document trees. Inf. Syst. 33, 4-5, 456--474. Google ScholarDigital Library
- Catania, B., Maddalena, A., and Vakali, A. 2005. XML document indexes: A classification. IEEE Internet Computing, 1, 5, 64--71. Google ScholarDigital Library
- Cheney, J. 2001. Compressing XML with multiplexed hierarchical PPM models. In Proceedings of the IEEE Data Compression Conference (DCC). IEEE Computer Society Press, Los Alamitos, CA, 163--172. Google ScholarDigital Library
- Cheng, J., and Ng, W. 2004. XQzip: Querying compressed XML using structural indexing. In Proceedings of the 9th International Conference on Extending Database Technology. Lecture Notes in Computer Science, vol. 2992. Springer-Verlag, Berlin, Germany, 219--236.Google Scholar
- Farzan, A., and Munro, I. 2008. Succinct representations of arbitrary graphs. In Proceedings of the 16th European Symposium on Algorithms (ESA). Lecture Notes in Computer Science vol. 4009. Springer-Verlag, Berlin Germany. Google ScholarDigital Library
- Fernandez, M. F., Simeon, J., Choi, B., Marian, A., and Sur, G. 2003. Implementing Xquery 1.0: the Galax experience. In Proceedings of the 29th International Conference on Very Large Data Bases (VLDB). 1077--1080. Google ScholarDigital Library
- Ferragina, P., Giancarlo, R., and Manzini, G. 2006a. The engineering of a compression boosting library: Theory vs practice in BWT compression. In Proceedings of the 14th European Symposium on Algorithms (ESA). Lecture Notes in Computer Science, vol. 4168. Springer-Verlag, Berlin, Germany, 756--767. Google ScholarDigital Library
- Ferragina, P., Giancarlo, R., and Manzini, G. 2009. The myriad virtues of wavelet trees. Inf. Comput. 207, 849--866. Google ScholarDigital Library
- Ferragina, P., Giancarlo, R., Manzini, G., and Sciortino, M. 2005a. Boosting textual compression in optimal linear time. J. ACM 52, 688--713. Google ScholarDigital Library
- Ferragina, P., González, R., Navarro, G., and Venturini, R. 2008. Compressed text indexes: From theory to practice. ACM J. Exper. Algor. 13. Google ScholarDigital Library
- Ferragina, P., Luccio, F., Manzini, G., and Muthukrishnan, S. 2005b. Structuring labeled trees for optimal succinctness, and beyond. In Proceedings of the 46th IEEE Symposium on Foundations of Computer Science (FOCS). IEEE Computer Society Press, Los Alamitos, CA, 184--193. Google ScholarDigital Library
- Ferragina, P., Luccio, F., Manzini, G., and Muthukrishnan, S. 2006b. Compressing and searching XML data via two zips. In Proceedings of the 15th International World Wide Web Conference (WWW). 751--760. Google ScholarDigital Library
- Ferragina, P., and Manzini, G. 2001. An experimental study of a compressed index. Information Sciences: special issue on “Dictionary Based Compression” 135, 13--28. Google ScholarDigital Library
- Ferragina, P., and Manzini, G. 2005. Indexing compressed text. J. ACM 52, 4, 552--581. Google ScholarDigital Library
- Ferragina, P., Manzini, G., Mäkinen, V., and Navarro, G. 2007. Compressed representations of sequences and full-text indexes. ACM Trans. Algor. 3, 2. Google ScholarDigital Library
- Ferragina, P., and Navarro, G. 2007. The Pizza&Chili corpus home page. http://pizzachili.dcc.uchile.cl/or http://pizzachili.di.unipi.it/.Google Scholar
- Ferragina, P., and Rao, S. 2008. Tree compression and indexing. In Encyclopedia of Algorithms, Springer. 964--967.Google Scholar
- Ferragina, P., and Venturini, R. 2007. A simple storage scheme for strings achieving entropy bounds. Theoret. Comput. Sci. 372, 1, 115--121. Google ScholarDigital Library
- Geary, R., Raman, R., and Raman, V. 2006. Succinct ordinal trees with level-ancestor queries. ACM Trans. Algor. 2, 4, 510--534. Google ScholarDigital Library
- Goldman, R., and Widom, J. 1997. Dataguides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB). 436--445. Google ScholarDigital Library
- Golynski, A., Munro, J. I., and Rao, S. 2006. Rank/select operations on large alphabets: A tool for text indexing. In Proceedings of the 17th ACM-SIAM Symposium on Discrete Algorithms (SODA). ACM, New York, 368--373. Google ScholarDigital Library
- Gupta, A., Hon, W., Shah, R., and Vitter, J. 2007. A framework for dynamizing succinct data structures. In Proceedings of the 34th International Colloquium on Automata, Languages and Programming (ICALP). Lecture Notes in Computer Science, vol. 4596. Springer-Verlag, Berlin, Germany, 521--532. Google ScholarDigital Library
- Jacobson, G. 1989. Space-efficient static trees and graphs. In Proceedings of the 30th IEEE Symposium on Foundations of Computer Science (FOCS). IEEE Computer Society Press, Los Alamitos, CA, 549--554. Google ScholarDigital Library
- Jansson, J., Sadakane, K., and Sung, W. 2007. Ultra-succinct representation of ordered trees. In Proceedings of the 18th ACM-SIAM Symposium on Discrete Algorithms (SODA). ACM, New York, 575--584. Google ScholarDigital Library
- Kärkkäinen, J., Sanders, P., and Burkhardt, S. 2006. Linear work suffix array construction. J. ACM 53, 6, 918--936. Google ScholarDigital Library
- Karp, R., Miller, R., and Rosenberg, A. 1972. Rapid identification of repeated patterns in strings, arrays and trees. In Proceedings of the ACM Symposium on Theory of Computation. ACM, New York, 125--136. Google ScholarDigital Library
- Kaushik, R., Krishnamurthy, R., Naughton, J., and Ramakrishnan, R. 2004. On the integration of structure indexes and inverted lists. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, 779--790. Google ScholarDigital Library
- Kosaraju, S. R. 1989. Efficient tree pattern matching. In Proceedings of the 20th IEEE Foundations of Computer Science (FOCS). 178--183. Google ScholarDigital Library
- Kurtz, S. 1999. Reducing the space requirement of suffix trees. Software—Practice and Experience 29, 13, 1149--1171. Google ScholarDigital Library
- Liefke, H., and Suciu, D. 2000. XMill: An efficient compressor for xml data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 153--164. Google ScholarDigital Library
- Mäkinen, V., and Navarro, G. 2006. Position-restricted substring searching. In Proceedings of the 7th Latin American Symposium on Theoretical Informatics (LATIN). Lecture Notes in Computer Science, vol. 3887. Springer-Verlag, Berlin, Germany, 703--714. Google ScholarDigital Library
- Manzini, G. 2001. An analysis of the Burrows-Wheeler transform. J. ACM 48, 3, 407--430. Google ScholarDigital Library
- Milo, T., and Suciu, D. 1999. Index structures for path expressions. In Proceedings of the 3rd International Conference on Database Theory. 277--295. Google ScholarDigital Library
- Min, J., Park, M., and Chung, C. 2003. Xpress: A queriable compression for XML data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, 122--133. Google ScholarDigital Library
- Moura, E., Navarro, G., Ziviani, N., and Baeza-Yates, R. 2000. Fast and flexible word searching on compressed text. ACM Trans. Inform. Syst. 18, 2, 113--139. Google ScholarDigital Library
- Munro, J. I. 1996. Tables. In Proceeding of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science. Lecture Notes in Computer Science, vol. 1180. Springer-Verlag, Berlin, Germany, 37--42. Google ScholarDigital Library
- Munro, J. I., Raman, R., Raman, V., and Rao, S. 2003. Succinct representations of permutations. In Proceedings of the 30th International Colloquium on Automata, Languages and Programming (ICALP). Lecture Notes in Computer Science, vol. 2719. Springer-Verlag, Berlin, Germany, 345--356. Google ScholarDigital Library
- Munro, J. I., and Raman, V. 1997. Succinct representation of balanced parentheses, static trees and planar graphs. In Proceedings of the 38th IEEE Symposium on Foundations of Computer Science (FOCS). IEEE Computer Society Press, Los Alamitos, CA, 118--126. Google ScholarDigital Library
- Munro, J. I., and Raman, V. 2001. Succinct representation of balanced parentheses and static trees. SIAM J. Comput. 31, 762--776. Google ScholarDigital Library
- Munro, J. I., and Rao, S. 2004. Succinct representations of functions. In Proceedings of the 31st International Colloquium on Automata, Languages and Programming (ICALP). Lecture Notes in Computer Science, vol. 3142. Springer-Verlag, Germany, 1006--1015.Google Scholar
- Navarro, G., and Mäkinen, V. 2007. Compressed full-text indexes. ACM Comput. Surv. 39, 1. Google ScholarDigital Library
- Ng, W., Lam, W., Wood, P., and Levene, M. 2006. XCQ: A queriable XML compression system. Know. Inform. Syst. 10, 4, 421--452. Google ScholarDigital Library
- Raman, R., Raman, V., and Rao, S. 2007. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algor. 3, 4. Google ScholarDigital Library
- Raw, P., and Moon, B. 2004. PRIX: Indexing and querying XML using Prüfer sequences. In Proceedings of the 20th International Conference on Data Engineering (ICDE). 288--300. Google ScholarDigital Library
- Tolani, P. M., and Haritsa, J. R. 2002. XGRIND: A query-friendly XML compressor. In Proceedings of the 18th International Conference on Data Engineering (ICDE). 225--234. Google ScholarDigital Library
- Wang, H., Park, S., Fan, W., and Yu, P. S. 2003. ViST: A dynamic index methd for querying XML data by tree structures. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, 110--121. Google ScholarDigital Library
- Wang, W., Wang, H., Lu, H., Jang, H., Lin, X., and Li, J. 2005. Efficient processing of XML path queries using the disk-based F&B index. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB). 145--156. Google ScholarDigital Library
- Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images, Second ed. Morgan Kaufmann, Los Altos, CA. Google ScholarDigital Library
- Ye, Z., and Berger, T. 1998. Information measures for discrete random fields. Science Press.Google Scholar
Index Terms
Compressing and indexing labeled trees, with applications
Recommendations
Compressed suffix trees: Efficient computation and storage of LCP-values
The suffix tree is a very important data structure in string processing, but typical implementations suffer from huge space consumption. In large-scale applications, compressed suffix trees (CSTs) are therefore used instead. A CST consists of three (...
Canonical forms for labelled trees and their applications in frequent subtree mining
Tree structures are used extensively in domains such as computational biology, pattern recognition, XML databases, computer networks, and so on. In this paper, we first present two canonical forms for labelled rooted unordered trees---the breadth-first ...
Infinite labeled trees: From rational to Sturmian trees
This paper studies infinite unordered d-ary trees with nodes labeled by {0,1}. We introduce the notions of rational and Sturmian trees along with the definitions of (strongly) balanced trees and mechanical trees, and study the relations among them. In ...
Comments