skip to main content
research-article

Fast and Compact Web Graph Representations

Published:01 September 2010Publication History
Skip Abstract Section

Abstract

Compressed graph representations, in particular for Web graphs, have become an attractive research topic because of their applications in the manipulation of huge graphs in main memory. The state of the art is well represented by the WebGraph project, where advantage is taken of several particular properties of Web graphs to offer a trade-off between space and access time. In this paper we show that the same properties can be exploited with a different and elegant technique that builds on grammar-based compression. In particular, we focus on Re-Pair and on Ziv-Lempel compression, which, although cannot reach the best compression ratios of WebGraph, achieve much faster navigation of the graph when both are tuned to use the same space. Moreover, the technique adapts well to run on secondary memory and in distributed scenarios. As a byproduct, we introduce an approximate Re-Pair version that works efficiently with severely limited main memory.

References

  1. }}Adler, M. and Mitzenmacher, M. 2001. Towards compressing Web graphs. In Proceedings of the 11th Data Compression Conference (DCC). 203--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. }}Aiello, W., Chung, F., and Lu, L. 2000. A random graph model for massive graphs. In Proceedings of the 32th ACM Symposium on Theory of Computing (STOC). 171--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. }}Asano, Y., Miyawaki, Y., and Nishizeki, T. 2008. Efficient compression of Web graphs. In Proceedings of the 14th Conference on Computing and Combinatorics (COCOON). LNCS 5092. 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. }}Badue, C., Baeza-Yates, R., Ribeiro-Neto, B., and Ziviani, N. 2001. Distributed query processing using partitioned inverted files. In Proceedings of the 8th International Symposium on String Processing and Information Retrieval (SPIRE). 10--20.Google ScholarGoogle Scholar
  5. }}Bharat, K., Broder, A., Henzinger, M., Kumar, P., and Venkatasubramanian, S. 1998. The Connectivity Server: Fast access to linkage information on the Web. In Proceedings of the 7th World Wide Web Conference (WWW). 469--477. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. }}Blandford, D. 2006. Compact data structures with fast queries. Ph.D. thesis, Tech. rep. School of Computer Science, Carnegie Mellon University. TR CMU-CS-05-196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. }}Blandford, D., Blelloch, G., and Kash, I. 2003. Compact representations of separable graphs. In Proceedings of the 14th Symposium on Discrete Algorithms (SODA). 579--588. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. }}Boldi, P., Santini, M., and Vigna, S. 2008. A large time-aware Web graph. SIGIR Forum 42, 2, 33--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. }}Boldi, P., Santini, M., and Vigna, S. 2009. Permuting Web graphs. In Proceedings of the 6th Workshop on Algorithms and Models for the Web Graph (WAW). 116--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. }}Boldi, P. and Vigna, S. 2004a. The WebGraph framework I: Compression techniques. In Proceedings of the 13th World Wide Web Conference (WWW). 595--602. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. }}Boldi, P. and Vigna, S. 2004b. The WebGraph framework II: Codes for the world-wide Web. In Proceedings of the 14th Data Compression Conference (DCC). 528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. }}Brisaboa, N., Ladra, S., and Navarro, G. 2009. K2-trees for compact Web graph representation. In Proceedings of the 16th International Symposium on String Processing and Information Retrieval (SPIRE). Lecture Notes in Computer Science, vol. 5721. Springer, 18--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. }}Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., and Wiener, J. 2000. Graph structure in the Web. J. Comput. Netw. 33, 1--6, 309--320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. }}Buehrer, G. and Chellapilla, K. 2008. A scalable pattern mining approach to Web graph compression with communities. In Proceedings of the International Conference on Web Search and Web Data (WSDM). 95--106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. }}Chakrabarti, D., Papadimitriou, S., Modha, D., and Faloutsos, C. 2004. Fully automatic cross-associations. In Proceedings of the ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. }}Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., and Shelat, A. 2005. The smallest grammar problem. IEEE Trans. Inform. Theory 51, 7, 2554--2576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. }}Chuang, R., Garg, A., He, X., Kao, M.-Y., and Lu, H.-I. 1998. Compact encodings of planar graphs with canonical orderings and multiple parentheses. In Proceedings of the International Symposium on Automata, Languages, and Programming, Lecture Notes in Computer Science, vol. 1443, 118--129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. }}Clark, D. 1996. Compact pat trees. Ph.D. thesis, University of Waterloo. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. }}Claude, F. and Navarro, G. 2007. A fast and compact Web graph representation. In Proceedings of the 14th International Symposium on String Processing and Information Retrieval (SPIRE). Lecture Notes in Computer Science, vol. 4726. 105--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. }}Claude, F. and Navarro, G. 2008. Practical rank/select queries over arbitrary sequences. In Proceedings of the 15th International Symposium on String Processing and Information Retrieval (SPIRE). Lecture Notes in Computer Science, vol. 5280. 176--187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. }}Claude, F. and Navarro, G. 2010. Extended compact Web graph representations. In Algorithms and Applications, T. Elomaa, H. Mannila, and P. Orponen, Eds. Lecture Notes in Computer Science, vol. 6060. Springer, 77--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. }}Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. 2001. Introduction to Algorithms, 2nd Ed. MIT Press and McGraw-Hill. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. }}Deo, N. and Litow, B. 1998. A structural approach to graph compression. In Proceedings of the 23th MFCS Workshop on Communications. 91--101.Google ScholarGoogle Scholar
  24. }}Donato, D., Laura, L., Leonardi, S., Meyer, U., Millozzi, S., and Sibeyn, J. 2006. Algorithms and experiments for the WebGraph. J. Graph Algor. Appl. 10, 2, 219--236.Google ScholarGoogle ScholarCross RefCross Ref
  25. }}Ferragina, P., Manzini, G., Mäkinen, V., and Navarro, G. 2007. Compressed representations of sequences and full-text indexes. ACM Trans. Algor. 3, 2, Article 20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. }}Fink, A. and Voß, S. 1999. Aications of modern heuristic search methods to pattern sequencing problems. Comput. Operat. Resear. 26, 17--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. }}Golynski, A., Munro, I., and Rao, S. 2006. Rank/select operations on large alphabets: A tool for text indexing. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 368--373. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. }}González, R. and Navarro, G. 2007. Compressed text indexes with fast locate. In Proceedings of the 18th Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 4580. 216--227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. }}Grossi, R., Gupta, A., and Vitter, J. 2003. High-order entropy-compressed text indexes. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 841--850. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. }}He, X., Kao, M.-Y., and Lu, H.-I. 2000. A fast general methodology for information-theoretically optimal encodings of graphs. SIAM J. Comput. 30, 838--846. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. }}Hu, T. and Tucker, A. 1971. Optimal computer-search trees and variable-length alphabetic codes. SIAM J. Appl. Math. 21, 514--532.Google ScholarGoogle ScholarCross RefCross Ref
  32. }}Jacobson, G. 1989. Succinct static data structures. Ph.D. thesis, Carnegie Mellon University. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. }}Kleinberg, J., Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. 1999. The Web as a graph: Measurements, models, and methods. In Proceedings of the 5th Annual International Conference on Computing and Combinatorics (COCOON). Lecture Notes in Computer Science, vol. 1627. 1--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. }}Kleinberg, J. M. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5, 604--632. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. }}Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. 1999. Extracting large scale knowledge bases from the Web. In Proceedings of the 25th Conference on Very Large Data Bases (VLDB). 639--650. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. }}Larsson, J. and Moffat, A. 2000. Off-line dictionary-based compression. Proc. IEEE 88, 11, 1722--1732. Google ScholarGoogle ScholarCross RefCross Ref
  37. }}Munro, I. 1996. Tables. In Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS). Lecture Notes in Computer Science, vol. 1180. 37--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. }}Munro, I. and Raman, V. 1997. Succinct representation of balanced parentheses, static trees, and planar graphs. In Proceedings of the 38th Symposium on Foundations of Computer Science (FOCS). 118--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. }}Navarro, G. and Mäkinen, V. 2007. Compressed full-text indexes. ACM Comput. Surv. 39, 1, Article 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. }}Raghavan, S. and Garcia-Molina, H. 2003. Representing Web graphs. In Proceedings of the 19th International Conference on Data Engineering (ICDE). 405.Google ScholarGoogle Scholar
  41. }}Raman, R., Raman, V., and Rao, S. S. 2002. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proceedings of the ACM-SIAM 13th Symposium on Discrete Algorithms (SODA). 233--242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. }}Randall, K., Stata, R., Wickremesinghe, R., and Wiener, J. 2001. The LINK database: Fast access to graphs of the Web. Tech. rep. 175, Compaq Systems Research Center, Palo Alto, CA.Google ScholarGoogle Scholar
  43. }}Rytter, W. 2003. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302, 1-3, 211--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. }}Saito, H., Toyoda, M., Kitsuregawa, M., and Aihara, K. 2007. A large-scale study of link spam detection by graph algorithms. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb). ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. }}Sakamoto, H. 2005. A fully linear-time approximation algorithm for grammar-based compression. J. Disc. Algor. 3, 2-4, 416--430.Google ScholarGoogle ScholarCross RefCross Ref
  46. }}Shieh, W., Chen, T., Shann, J., and Chung, C. 2003. Inverted file compression through document identifier reassignment. Inform. Process. Manag. 39, 1, 117--131. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. }}Suel, T. and Yuan, J. 2001. Compressing the graph structure of the Web. In Proceedings of the 11th Data Compression Conference (DCC). 213--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. }}Tomasic, A. and Garcia-Molina, H. 1993. Performance of inverted indices in shared-nothing distributed text document information retrieval systems. In Proceedings of the 2nd International Conference on Parallel and Distributed Information Systems (PDIS). 8--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. }}Vitter, J. S. 2006. Algorithms and data structures for external memory. Found. Trends Theoret. Comput. Sci. 2, 4, 305--474. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. }}Wan, R. 2003. Browsing and searching compressed documents. Ph.D. thesis, Department of Computer Science and Software Engineering, University of Melbourne.Google ScholarGoogle Scholar
  51. }}Ziv, J. and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 23, 337--343.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. }}Ziv, J. and Lempel, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory 24, 5, 530--536.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fast and Compact Web Graph Representations

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                • Published in

                  cover image ACM Transactions on the Web
                  ACM Transactions on the Web  Volume 4, Issue 4
                  September 2010
                  173 pages
                  ISSN:1559-1131
                  EISSN:1559-114X
                  DOI:10.1145/1841909
                  Issue’s Table of Contents

                  Copyright © 2010 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 1 September 2010
                  • Accepted: 1 April 2010
                  • Revised: 1 February 2010
                  • Received: 1 June 2009
                  Published in tweb Volume 4, Issue 4

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article
                  • Research
                  • Refereed

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader