ABSTRACT
Current web search engines focus on searching only themost recentsnapshot of the web. In some cases, however, it would be desirableto search over collections that include many different crawls andversions of each page. One important example of such a collectionis the Internet Archive, though there are many others. Sincethe data size of such an archive is multiple times that of a singlesnapshot, this presents us with significant performance challenges.Current engines use various techniques for index compression andoptimized query execution, but these techniques do not exploit thesignificant similarities between different versions of a page, or betweendifferent pages.In this paper, we propose a general framework for indexing andquery processing of archival collections and, more generally, anycollections with a sufficient amount of redundancy. Our approachresults in significant reductions in index size and query processingcosts on such collections, and it is orthogonal to and can be combinedwith the existing techniques. It also supports highly efficientupdates, both locally and over a network. Within this framework,we describe and evaluate different implementations that trade offindex size versus CPU cost and other factors, and discuss applicationsranging from archival web search to local search of web sites,email archives, or file systems. We present experimental resultsbased on search engine query log and a large collection consistingof multiple crawls.
- V. Anh and A. Moffat. Index compression using fixed binary codewords. In Proc. of the 15th Int. Australasian Database Conference, pages 61--67, January 2004. Google ScholarDigital Library
- A. Arasu, J. Cho, H. Garcia-Molina, and S. Raghavan. Searching the web. ACM Transactions on Internet Technologies, 1(1), June 2001. Google ScholarDigital Library
- R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addision Wesley, 1999. Google ScholarDigital Library
- D. Blandford and G. Blelloch. Index compression through document reordering. In IEEE Data Compression Conference, April 2002. Google ScholarDigital Library
- S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proc. of the Seventh World Wide Web Conference, 1998. Google ScholarDigital Library
- A. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. Efficient query evaluation using a two-level retrieval process. In Proc. of the 12th Int. Conf. on Information and Knowledge Management, pages 426--434, November 2003. Google ScholarDigital Library
- A. Broder, N. Eiron, M. Fontoura, M. Herscovici, R. Lempel, J. McPherson, R. Qi, and E. Shekita. Indexing shared content in information retrieval systems. In Proc. of the 10th Int. Conf. on Extending Database Technology, pages 313--330, October 2006. Google ScholarDigital Library
- A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the web. In Sixth Int. World Wide Web Conference, 1997. Google ScholarDigital Library
- E. Brown, J. Callan, and W. Croft. Fast incremental indexing for full-text information retrieval. In Proc. of the 20th Int. Conf. on Very Large Databases, pages 192--202, September 1994. Google ScholarDigital Library
- R. Burns and D. Long. Efficient distributed backup with delta compression. In Proc. of the Fifth Workshop on I/O in Parallel and Distributed Systems (IOPADS), 1997. Google ScholarDigital Library
- T. Chiueh and L. Huang. Efficient real-time index updates in text retrieval systems. Technical Report TR-66,, Experimental Computer Systems Lab, Stony Brook University, March 1999.Google Scholar
- J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 355--366, May 2000. Google ScholarDigital Library
- L. Cox, C. Murray, and B. Noble. Pastiche: Making backup cheap and easy. In Proc. of the 5th Symp. on Operating System Design and Implementation, December 2002. Google ScholarDigital Library
- D. Hawking. Web search engines: Part 1 & 2. IEEE Computer, 39, June and August 2006. Google ScholarDigital Library
- S. Heman. Super-scalar database compression between ram and cpu-cache. MS Thesis, Centrum voor Wiskunde en Informatica (CWI), Amsterdam, Netherlands, July 2005.Google Scholar
- M. Herscovici, R. Lempel, and S. Yogev. Efficient indexing of versioned document sequences. In Proc. of the 29th European Conf. on Information Retrieval, April 2007. Google ScholarDigital Library
- J. Hunt, K.-P. Vo, and W. Tichy. Delta algorithms: An empirical analysis. ACM Transactions on Software Engineering and Methodology, 7, 1998. Google ScholarDigital Library
- U. Irmak, S. Mihaylov, and T. Suel. Improved single-round protocols for remote file synchronization. In Proc. of Infocom, 2005.Google ScholarCross Ref
- U. Irmak and T. Suel. Hierarchical substring caching for efficient content distribution to low-bandwidth clients. In Proc. of the 14th Int. World Wide Web Conference, pages 43--53, 2005. Google ScholarDigital Library
- R. Karp and M. Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2):249--260, 1987. Google ScholarDigital Library
- M. Kaszkiel, J. Zobel, and R. Sacks-Davis. Efficient passage ranking for document databases. ACM Transactions on Information Systems (TOIS), 17(4):406--439, Oct. 1999. Google ScholarDigital Library
- P. Kulkarni, F. Douglis, J. LaVoie, and J. Tracey. Redundancy elimination wthin large collections of files. In Proc. of the 2004 USENIX Annual Technical Conference, June 2004. Google ScholarDigital Library
- L. Lim, M. Wang, S. Padmanabhan, J. Vitter, and R. Agarwal. Dynamic maintenance of web indexes using landmarks. In Proc. of the 12th Int. World Wide Web Conference, pages 102--111, May 2003. Google ScholarDigital Library
- A. Moffat and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Trans. on Information Systems, 14(4):349--379, 1996. Google ScholarDigital Library
- A. Muthitacharoen, B. Chen, and D. Mazières. A low-bandwidth network file system. In Proc. of the 18th ACM Symp. on Operating Systems Principles, pages 174--187, October 2001. Google ScholarDigital Library
- S. Quinlan and S. Dorward. Venti: a new approach to archival storage. In Proc. of the 1st USENIX Conf. on File and Storage Technologies, 2002. Google ScholarDigital Library
- S. Rhea, K. Liang, and E. Brewer. Value-based web caching. In Proc. of the 12th Int. World Wide Web Conference, May 2003. Google ScholarDigital Library
- K. Risvik and R. Michelsen. Search engines and web dynamics. Computer Networks, 39:289--302, 2002.Google ScholarCross Ref
- S. Sahinalp and U. Vishkin. Efficient approximate and dynamic matching of patterns using a labeling paradigm. In IEEE Symp. on Foundations of Computer Science, 1996. Google ScholarDigital Library
- S. Schleimer, D. Wilkerson, and A. Aiken. Winnowing: Local algorithms for document fingerprinting. In Proc. of the 2003 ACM SIGMOD Int. Conf. on Management of Data, pages 76--85, 2003. Google ScholarDigital Library
- F. Scholer, H. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. In Proc. of the 25th Annual SIGIR Conf. on Research and Development in Information Retrieval, pages 222--229, Aug. 2002. Google ScholarDigital Library
- T. Schwarz, R. Bowdidge, and W. Burkhard. Low cost comparison of file copies. In Proc. of the 10th Int. Conf. on Distributed Computing Systems, pages 196--202, 1990.Google ScholarCross Ref
- F. Silvestri, S. Orlando, and R. Perego. Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In Proc. of the 27th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2004. Google ScholarDigital Library
- N. Spring and D. Wetherall. A protocol independent technique for eliminating redundant network traffic. In Proc. of the ACM SIGCOMM Conference, 2000. Google ScholarDigital Library
- D. Teodosiu, N. Bjorner, Y. Gurevich, M. Manasse, and J. Porkka. Optimizing file replication over limited bandwidth networks using remote differential compression. Technical Report TR2006-157-1, Microsoft Corporation, 2006.Google Scholar
- A. Tomasic, H. Garcia-Molina, and K. Shoens. Incremental updates of inverted lists for text document retrieval. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, 1994. Google ScholarDigital Library
- A. Tridgell and P. MacKerras. The rsync algorithm. Technical Report TR-CS-96-05, Australian National University, June 1996.Google Scholar
- I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, second edition, 1999. Google ScholarDigital Library
- J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys, 38(2), July 2006. Google ScholarDigital Library
Index Terms
- Efficient search in large textual collections with redundancy
Recommendations
Performance of compressed inverted list caching in search engines
WWW '08: Proceedings of the 17th international conference on World Wide WebDue to the rapid growth in the size of the web, web search engines are facing enormous performance challenges. The larger engines in particular have to be able to process tens of thousands of queries per second on tens of billions of documents, making ...
The influence of commercial intent of search results on their perceived relevance
iConference '11: Proceedings of the 2011 iConferenceWe carried out a retrieval effectiveness test on the three major web search engines (i.e., Google, Microsoft and Yahoo). In addition to relevance judgments, we classified the results according to their commercial intent and whether or not they carried ...
Discovering search engine related queries using association rules
This work presents a method for online generation of query related suggestions for a Web search engine. The method uses association rules to extract related queries from the log of sbumitted queries to the search engine. Experimental results were ...
Comments