Article

Efficient search in large textual collections with redundancy

Authors:
Jiangong Zhang

Polytechnic University, Brooklyn, NY

Polytechnic University, Brooklyn, NY
View Profile

,
Torsten Suel

Polytechnic University, Brooklyn, NY

Polytechnic University, Brooklyn, NY
View Profile

WWW '07: Proceedings of the 16th international conference on World Wide WebMay 2007Pages 411–420https://doi.org/10.1145/1242572.1242628

Published:08 May 2007Publication History

WWW '07: Proceedings of the 16th international conference on World Wide Web

Pages 411–420

ABSTRACT

Current web search engines focus on searching only themost recentsnapshot of the web. In some cases, however, it would be desirableto search over collections that include many different crawls andversions of each page. One important example of such a collectionis the Internet Archive, though there are many others. Sincethe data size of such an archive is multiple times that of a singlesnapshot, this presents us with significant performance challenges.Current engines use various techniques for index compression andoptimized query execution, but these techniques do not exploit thesignificant similarities between different versions of a page, or betweendifferent pages.In this paper, we propose a general framework for indexing andquery processing of archival collections and, more generally, anycollections with a sufficient amount of redundancy. Our approachresults in significant reductions in index size and query processingcosts on such collections, and it is orthogonal to and can be combinedwith the existing techniques. It also supports highly efficientupdates, both locally and over a network. Within this framework,we describe and evaluate different implementations that trade offindex size versus CPU cost and other factors, and discuss applicationsranging from archival web search to local search of web sites,email archives, or file systems. We present experimental resultsbased on search engine query log and a large collection consistingof multiple crawls.

References

V. Anh and A. Moffat. Index compression using fixed binary codewords. In Proc. of the 15th Int. Australasian Database Conference, pages 61--67, January 2004. Google ScholarDigital Library
A. Arasu, J. Cho, H. Garcia-Molina, and S. Raghavan. Searching the web. ACM Transactions on Internet Technologies, 1(1), June 2001. Google ScholarDigital Library
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addision Wesley, 1999. Google ScholarDigital Library
D. Blandford and G. Blelloch. Index compression through document reordering. In IEEE Data Compression Conference, April 2002. Google ScholarDigital Library
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proc. of the Seventh World Wide Web Conference, 1998. Google ScholarDigital Library
A. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. Efficient query evaluation using a two-level retrieval process. In Proc. of the 12th Int. Conf. on Information and Knowledge Management, pages 426--434, November 2003. Google ScholarDigital Library
A. Broder, N. Eiron, M. Fontoura, M. Herscovici, R. Lempel, J. McPherson, R. Qi, and E. Shekita. Indexing shared content in information retrieval systems. In Proc. of the 10th Int. Conf. on Extending Database Technology, pages 313--330, October 2006. Google ScholarDigital Library
A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the web. In Sixth Int. World Wide Web Conference, 1997. Google ScholarDigital Library
E. Brown, J. Callan, and W. Croft. Fast incremental indexing for full-text information retrieval. In Proc. of the 20th Int. Conf. on Very Large Databases, pages 192--202, September 1994. Google ScholarDigital Library
R. Burns and D. Long. Efficient distributed backup with delta compression. In Proc. of the Fifth Workshop on I/O in Parallel and Distributed Systems (IOPADS), 1997. Google ScholarDigital Library
T. Chiueh and L. Huang. Efficient real-time index updates in text retrieval systems. Technical Report TR-66,, Experimental Computer Systems Lab, Stony Brook University, March 1999.Google Scholar
J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 355--366, May 2000. Google ScholarDigital Library
L. Cox, C. Murray, and B. Noble. Pastiche: Making backup cheap and easy. In Proc. of the 5th Symp. on Operating System Design and Implementation, December 2002. Google ScholarDigital Library
D. Hawking. Web search engines: Part 1 & 2. IEEE Computer, 39, June and August 2006. Google ScholarDigital Library
S. Heman. Super-scalar database compression between ram and cpu-cache. MS Thesis, Centrum voor Wiskunde en Informatica (CWI), Amsterdam, Netherlands, July 2005.Google Scholar
M. Herscovici, R. Lempel, and S. Yogev. Efficient indexing of versioned document sequences. In Proc. of the 29th European Conf. on Information Retrieval, April 2007. Google ScholarDigital Library
J. Hunt, K.-P. Vo, and W. Tichy. Delta algorithms: An empirical analysis. ACM Transactions on Software Engineering and Methodology, 7, 1998. Google ScholarDigital Library
U. Irmak, S. Mihaylov, and T. Suel. Improved single-round protocols for remote file synchronization. In Proc. of Infocom, 2005.Google ScholarCross Ref
U. Irmak and T. Suel. Hierarchical substring caching for efficient content distribution to low-bandwidth clients. In Proc. of the 14th Int. World Wide Web Conference, pages 43--53, 2005. Google ScholarDigital Library
R. Karp and M. Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2):249--260, 1987. Google ScholarDigital Library
M. Kaszkiel, J. Zobel, and R. Sacks-Davis. Efficient passage ranking for document databases. ACM Transactions on Information Systems (TOIS), 17(4):406--439, Oct. 1999. Google ScholarDigital Library
P. Kulkarni, F. Douglis, J. LaVoie, and J. Tracey. Redundancy elimination wthin large collections of files. In Proc. of the 2004 USENIX Annual Technical Conference, June 2004. Google ScholarDigital Library
L. Lim, M. Wang, S. Padmanabhan, J. Vitter, and R. Agarwal. Dynamic maintenance of web indexes using landmarks. In Proc. of the 12th Int. World Wide Web Conference, pages 102--111, May 2003. Google ScholarDigital Library
A. Moffat and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Trans. on Information Systems, 14(4):349--379, 1996. Google ScholarDigital Library
A. Muthitacharoen, B. Chen, and D. Mazières. A low-bandwidth network file system. In Proc. of the 18th ACM Symp. on Operating Systems Principles, pages 174--187, October 2001. Google ScholarDigital Library
S. Quinlan and S. Dorward. Venti: a new approach to archival storage. In Proc. of the 1st USENIX Conf. on File and Storage Technologies, 2002. Google ScholarDigital Library
S. Rhea, K. Liang, and E. Brewer. Value-based web caching. In Proc. of the 12th Int. World Wide Web Conference, May 2003. Google ScholarDigital Library
K. Risvik and R. Michelsen. Search engines and web dynamics. Computer Networks, 39:289--302, 2002.Google ScholarCross Ref
S. Sahinalp and U. Vishkin. Efficient approximate and dynamic matching of patterns using a labeling paradigm. In IEEE Symp. on Foundations of Computer Science, 1996. Google ScholarDigital Library
S. Schleimer, D. Wilkerson, and A. Aiken. Winnowing: Local algorithms for document fingerprinting. In Proc. of the 2003 ACM SIGMOD Int. Conf. on Management of Data, pages 76--85, 2003. Google ScholarDigital Library
F. Scholer, H. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. In Proc. of the 25th Annual SIGIR Conf. on Research and Development in Information Retrieval, pages 222--229, Aug. 2002. Google ScholarDigital Library
T. Schwarz, R. Bowdidge, and W. Burkhard. Low cost comparison of file copies. In Proc. of the 10th Int. Conf. on Distributed Computing Systems, pages 196--202, 1990.Google ScholarCross Ref
F. Silvestri, S. Orlando, and R. Perego. Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In Proc. of the 27th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2004. Google ScholarDigital Library
N. Spring and D. Wetherall. A protocol independent technique for eliminating redundant network traffic. In Proc. of the ACM SIGCOMM Conference, 2000. Google ScholarDigital Library
D. Teodosiu, N. Bjorner, Y. Gurevich, M. Manasse, and J. Porkka. Optimizing file replication over limited bandwidth networks using remote differential compression. Technical Report TR2006-157-1, Microsoft Corporation, 2006.Google Scholar
A. Tomasic, H. Garcia-Molina, and K. Shoens. Incremental updates of inverted lists for text document retrieval. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, 1994. Google ScholarDigital Library
A. Tridgell and P. MacKerras. The rsync algorithm. Technical Report TR-CS-96-05, Australian National University, June 1996.Google Scholar
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, second edition, 1999. Google ScholarDigital Library
J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys, 38(2), July 2006. Google ScholarDigital Library

Index Terms

Efficient search in large textual collections with redundancy
1. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Performance of compressed inverted list caching in search engines
WWW '08: Proceedings of the 17th international conference on World Wide Web

Due to the rapid growth in the size of the web, web search engines are facing enormous performance challenges. The larger engines in particular have to be able to process tens of thousands of queries per second on tens of billions of documents, making ...
Read More
The influence of commercial intent of search results on their perceived relevance
iConference '11: Proceedings of the 2011 iConference

We carried out a retrieval effectiveness test on the three major web search engines (i.e., Google, Microsoft and Yahoo). In addition to relevance judgments, we classified the results according to their commercial intent and whether or not they carried ...
Read More
Discovering search engine related queries using association rules

This work presents a method for online generation of query related suggestions for a Web search engine. The method uses association rules to extract related queries from the log of sbumitted queries to the search engine. Experimental results were ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '07: Proceedings of the 16th international conference on World Wide Web
May 2007
1382 pages
ISBN:9781595936547
DOI:10.1145/1242572
General Chairs:
Carey Williamson
University of Calgary, Canada
,
Mary Ellen Zurko
IBM, USA
,
Program Chairs:
Peter Patel-Schneider
Bell Labs Research, USA
,
Prashant Shenoy
University of Massachusetts at Amherst, USA
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 May 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
index compression
inverted index
query execution
redundancy elimination
search engines
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 23
  Total Citations
  View Citations
- 644
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient search in large textual collections with redundancy

WWW '07: Proceedings of the 16th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Performance of compressed inverted list caching in search engines

The influence of commercial intent of search results on their perceived relevance

Discovering search engine related queries using association rules