skip to main content
10.1145/1242572.1242628acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
Article

Efficient search in large textual collections with redundancy

Published:08 May 2007Publication History

ABSTRACT

Current web search engines focus on searching only themost recentsnapshot of the web. In some cases, however, it would be desirableto search over collections that include many different crawls andversions of each page. One important example of such a collectionis the Internet Archive, though there are many others. Sincethe data size of such an archive is multiple times that of a singlesnapshot, this presents us with significant performance challenges.Current engines use various techniques for index compression andoptimized query execution, but these techniques do not exploit thesignificant similarities between different versions of a page, or betweendifferent pages.In this paper, we propose a general framework for indexing andquery processing of archival collections and, more generally, anycollections with a sufficient amount of redundancy. Our approachresults in significant reductions in index size and query processingcosts on such collections, and it is orthogonal to and can be combinedwith the existing techniques. It also supports highly efficientupdates, both locally and over a network. Within this framework,we describe and evaluate different implementations that trade offindex size versus CPU cost and other factors, and discuss applicationsranging from archival web search to local search of web sites,email archives, or file systems. We present experimental resultsbased on search engine query log and a large collection consistingof multiple crawls.

References

  1. V. Anh and A. Moffat. Index compression using fixed binary codewords. In Proc. of the 15th Int. Australasian Database Conference, pages 61--67, January 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Arasu, J. Cho, H. Garcia-Molina, and S. Raghavan. Searching the web. ACM Transactions on Internet Technologies, 1(1), June 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addision Wesley, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Blandford and G. Blelloch. Index compression through document reordering. In IEEE Data Compression Conference, April 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proc. of the Seventh World Wide Web Conference, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. Efficient query evaluation using a two-level retrieval process. In Proc. of the 12th Int. Conf. on Information and Knowledge Management, pages 426--434, November 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Broder, N. Eiron, M. Fontoura, M. Herscovici, R. Lempel, J. McPherson, R. Qi, and E. Shekita. Indexing shared content in information retrieval systems. In Proc. of the 10th Int. Conf. on Extending Database Technology, pages 313--330, October 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the web. In Sixth Int. World Wide Web Conference, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. Brown, J. Callan, and W. Croft. Fast incremental indexing for full-text information retrieval. In Proc. of the 20th Int. Conf. on Very Large Databases, pages 192--202, September 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Burns and D. Long. Efficient distributed backup with delta compression. In Proc. of the Fifth Workshop on I/O in Parallel and Distributed Systems (IOPADS), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. Chiueh and L. Huang. Efficient real-time index updates in text retrieval systems. Technical Report TR-66,, Experimental Computer Systems Lab, Stony Brook University, March 1999.Google ScholarGoogle Scholar
  12. J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 355--366, May 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. L. Cox, C. Murray, and B. Noble. Pastiche: Making backup cheap and easy. In Proc. of the 5th Symp. on Operating System Design and Implementation, December 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Hawking. Web search engines: Part 1 & 2. IEEE Computer, 39, June and August 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Heman. Super-scalar database compression between ram and cpu-cache. MS Thesis, Centrum voor Wiskunde en Informatica (CWI), Amsterdam, Netherlands, July 2005.Google ScholarGoogle Scholar
  16. M. Herscovici, R. Lempel, and S. Yogev. Efficient indexing of versioned document sequences. In Proc. of the 29th European Conf. on Information Retrieval, April 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Hunt, K.-P. Vo, and W. Tichy. Delta algorithms: An empirical analysis. ACM Transactions on Software Engineering and Methodology, 7, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. U. Irmak, S. Mihaylov, and T. Suel. Improved single-round protocols for remote file synchronization. In Proc. of Infocom, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  19. U. Irmak and T. Suel. Hierarchical substring caching for efficient content distribution to low-bandwidth clients. In Proc. of the 14th Int. World Wide Web Conference, pages 43--53, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Karp and M. Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2):249--260, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Kaszkiel, J. Zobel, and R. Sacks-Davis. Efficient passage ranking for document databases. ACM Transactions on Information Systems (TOIS), 17(4):406--439, Oct. 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. Kulkarni, F. Douglis, J. LaVoie, and J. Tracey. Redundancy elimination wthin large collections of files. In Proc. of the 2004 USENIX Annual Technical Conference, June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. L. Lim, M. Wang, S. Padmanabhan, J. Vitter, and R. Agarwal. Dynamic maintenance of web indexes using landmarks. In Proc. of the 12th Int. World Wide Web Conference, pages 102--111, May 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Moffat and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Trans. on Information Systems, 14(4):349--379, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Muthitacharoen, B. Chen, and D. Mazières. A low-bandwidth network file system. In Proc. of the 18th ACM Symp. on Operating Systems Principles, pages 174--187, October 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Quinlan and S. Dorward. Venti: a new approach to archival storage. In Proc. of the 1st USENIX Conf. on File and Storage Technologies, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Rhea, K. Liang, and E. Brewer. Value-based web caching. In Proc. of the 12th Int. World Wide Web Conference, May 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. K. Risvik and R. Michelsen. Search engines and web dynamics. Computer Networks, 39:289--302, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  29. S. Sahinalp and U. Vishkin. Efficient approximate and dynamic matching of patterns using a labeling paradigm. In IEEE Symp. on Foundations of Computer Science, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Schleimer, D. Wilkerson, and A. Aiken. Winnowing: Local algorithms for document fingerprinting. In Proc. of the 2003 ACM SIGMOD Int. Conf. on Management of Data, pages 76--85, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. F. Scholer, H. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. In Proc. of the 25th Annual SIGIR Conf. on Research and Development in Information Retrieval, pages 222--229, Aug. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. T. Schwarz, R. Bowdidge, and W. Burkhard. Low cost comparison of file copies. In Proc. of the 10th Int. Conf. on Distributed Computing Systems, pages 196--202, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  33. F. Silvestri, S. Orlando, and R. Perego. Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In Proc. of the 27th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. N. Spring and D. Wetherall. A protocol independent technique for eliminating redundant network traffic. In Proc. of the ACM SIGCOMM Conference, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. D. Teodosiu, N. Bjorner, Y. Gurevich, M. Manasse, and J. Porkka. Optimizing file replication over limited bandwidth networks using remote differential compression. Technical Report TR2006-157-1, Microsoft Corporation, 2006.Google ScholarGoogle Scholar
  36. A. Tomasic, H. Garcia-Molina, and K. Shoens. Incremental updates of inverted lists for text document retrieval. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. A. Tridgell and P. MacKerras. The rsync algorithm. Technical Report TR-CS-96-05, Australian National University, June 1996.Google ScholarGoogle Scholar
  38. I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, second edition, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys, 38(2), July 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient search in large textual collections with redundancy

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WWW '07: Proceedings of the 16th international conference on World Wide Web
      May 2007
      1382 pages
      ISBN:9781595936547
      DOI:10.1145/1242572

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 May 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate1,899of8,196submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader