Article

Capturing collection size for distributed non-cooperative retrieval

Authors:
Milad Shokouhi

RMIT University, Melbourne, Australia

RMIT University, Melbourne, Australia
View Profile

,
Justin Zobel

RMIT University, Melbourne, Australia

RMIT University, Melbourne, Australia
View Profile

,
Falk Scholer

RMIT University, Melbourne, Australia

RMIT University, Melbourne, Australia
View Profile

,
S. M. M. Tahaghoghi

RMIT University, Melbourne, Australia

RMIT University, Melbourne, Australia
View Profile

SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrievalAugust 2006Pages 316–323https://doi.org/10.1145/1148170.1148227

Published:06 August 2006Publication History

SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 316–323

ABSTRACT

Modern distributed information retrieval techniques require accurate knowledge of collection size. In non-cooperative environments, where detailed collection statistics are not available, the size of the underlying collections must be estimated. While several approaches for the estimation of collection size have been proposed, their accuracy has not been thoroughly evaluated. An empirical analysis of past estimation approaches across a variety of collections demonstrates that their prediction accuracy is low. Motivated by ecological techniques for the estimation of animal populations, we propose two new approaches for the estimation of collection size. We show that our approaches are significantly more accurate that previous methods, and are more efficient in use of resources required to perform the estimation.

References

Agichtein, E., Ipeirotis, P. G., and Gravano, L. (2003). Modeling query-based access to text databases. In International Workshop on Web and Databases, pages 87--92, San Diego, California.Google Scholar
Anagnostopoulos, A., Broder, A. Z., and Carmel, D. (2005). Sampling search-engine results. In Proceedings of 14th International Conference on the World Wide Web, pages 245--256, Chiba, Japan. Google ScholarDigital Library
Baeza-Yates, R. A. and Ribeiro-Neto, B. (1999). Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA. Google ScholarDigital Library
Bailey, P., Craswell, N., and Hawking, D. (2003). Engineering a multi-purpose test collection for Web retrieval experiments. Information Processing and Management, 39(6):853--871. Google ScholarDigital Library
Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. In Proceedings of 7th International Conference on the World Wide Web, pages 107--117, Brisbane, Australia. Google ScholarDigital Library
Callan, J. and Connell, M. (2001). Query-based sampling of text databases. ACM Transactions on Information Systems, 19(2):97--130. Google ScholarDigital Library
Craswell, N. and Hawking, D. (2002). Overview of the TREC-2002 Web Track. In Proceedings of TREC-2002, Gaithersburg, Maryland.Google Scholar
Craswell, N., Hawking, D., and Robertson, S. (2001). Effective site finding using link anchor information. In Proceedings of 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 250--257, New Orleans, Louisiana. Google ScholarDigital Library
D'Souza, D., Thom, J., and Zobel, J. (2004). Collection selection for managed distributed document databases. Information Processing and Management, 40(3):527--546. Google ScholarDigital Library
Garcia, S., Williams, H. E., and Cannane, A. (2004). Access-ordered indexes. In Proceedings of 27th Australasian Computer Science Conference, pages 7--14, Darlinghurst, Australia. Google ScholarDigital Library
Gravano, L., Ipeirotis, P. G., and Sahami, M. (2003). Qprober: A system for automatic classification of Hidden-Web databases. ACM Transactions on Information Systems, 21(1):1--41. Google ScholarDigital Library
Hawking, D. and Thomas, P. (2005). Server selection methods in hybrid portal search. In Proceedings of 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 75--82, Salvador, Brazil. Google ScholarDigital Library
Ipeirotis, P. G. and Gravano, L. (2002). Distributed search over the hidden Web: Hierarchical database sampling and selection. In Proceedings of 28th International Conference on Very Large Data Bases, pages 394--405, Hong Kong, China. Google ScholarDigital Library
Ipeirotis, P. G. and Gravano, L. (2004). When one sample is not enough: improving text database selection using shrinkage. In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 767--778, Paris, France. Google ScholarDigital Library
Ipeirotis, P. G., Gravano, L., and Sahami, M. (2001). Probe, count, and classify: categorizing Hidden Web databases. ACM SIGMOD Record, 30(2):67--78. Google ScholarDigital Library
Jansen, B. J., Spink, A., and Saracevic, T. (2000). Real life, real users, and real needs: a study and analysis of user queries on the Web. Information Processing and Management, 36(2):207--227. Google ScholarDigital Library
Karnatapu, S., Ramachandran, K., Wu, Z., Shah, B., Raghavan, V., and Benton, R. (2004). Estimating size of search engines in an uncooperative environment. In Workshop on Web-based Support Systems, pages 81--87, Beijing, China.Google Scholar
Liu, K., Yu, C., and Meng, W. (2002). Discovering the representative of a search engine. In Proceedings of 11th ACM CIKM International Conference on Information and Knowledge Management, pages 652--654, McLean, Virginia. Google ScholarDigital Library
Powell, A. L. and French, J. (2003). Comparing the performance of collection selection algorithms. ACM Transactions on Information Systems, 21(4):412--456. Google ScholarDigital Library
Schumacher, F. X. and Eschmeyer, R. W. (1943). The estimation of fish populations in lakes and ponds. Journal of the Tennesse Academy of Science, 18:228--249.Google Scholar
Si, L. and Callan, J. (2003a). The effect of database size distribution on resource selection algorithms. In Proeedings of SIGIR 2003 Workshop on Distributed Information Retrieval, pages 31--42, Toronto, Canada.Google Scholar
Si, L. and Callan, J. (2003b). Relevant document distribution estimation method for resource selection. In Proceedings of 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 298--305, Toronto, Canada. Google ScholarDigital Library
Si, L. and Callan, J. (2004). Unified utility maximization framework for resource selection. In Proceedings of 13th ACM CIKM Conference on Information and Knowledge Management, pages 32--41, Washington, D.C. Google ScholarDigital Library
Si, L., Jin, R., Callan, J., and Ogilvie, P. (2002). A language modeling framework for resource selection and results merging. In Proceedings of 11th ACM CIKM International Conference on Information and Knowledge Management, pages 391--397, New York, NY. Google ScholarDigital Library
Sutherland, W. J. (1996). Ecological Census Techniques. Cambridge University Press.Google Scholar
Voorhees, E. M. and Harman, D. (2000). Overview of the sixth Text REtrieval Conference (TREC-6). Information Processing and Management, 36(1):3--35. Google ScholarDigital Library

Index Terms

Capturing collection size for distributed non-cooperative retrieval
1. Applied computing
  1. Computers in other domains
    1. Digital libraries and archives
2. Information systems

Recommendations

Estimating collection size with logistic regression
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Collection size is an important feature to represent the content summaries of a collection, and plays a vital role in collection selection for distributed search. In uncooperative environments, collection size estimation algorithms are adopted to ...
Read More
Age-based garbage collection

Modern generational garbage collectors look for garbage among the young objects, because they have high mortality; however, these objects include the very youngest objects, which clearly are still live. We introduce new garbage collection algorithms, ...
Read More
Age-based garbage collection
OOPSLA '99: Proceedings of the 14th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications

Modern generational garbage collectors look for garbage among the young objects, because they have high mortality; however, these objects include the very youngest objects, which clearly are still live. We introduce new garbage collection algorithms, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
August 2006
768 pages
ISBN:1595933697
DOI:10.1145/1148170
General Chair:
Efthimis N. Efthimiadis
University of Washington
,
Program Chairs:
Susan Dumais
Microsoft Research, Redmond
,
David Hawking
CSIRO ICT Centre, Canberra, Australia
,
Kalervo Järvelin,
University of Tampere, Finland
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 August 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
capture-history
capture-recapture
collection size estimation
sample-resample
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 49
  Total Citations
  View Citations
- 666
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.