ABSTRACT
Modern distributed information retrieval techniques require accurate knowledge of collection size. In non-cooperative environments, where detailed collection statistics are not available, the size of the underlying collections must be estimated. While several approaches for the estimation of collection size have been proposed, their accuracy has not been thoroughly evaluated. An empirical analysis of past estimation approaches across a variety of collections demonstrates that their prediction accuracy is low. Motivated by ecological techniques for the estimation of animal populations, we propose two new approaches for the estimation of collection size. We show that our approaches are significantly more accurate that previous methods, and are more efficient in use of resources required to perform the estimation.
- Agichtein, E., Ipeirotis, P. G., and Gravano, L. (2003). Modeling query-based access to text databases. In International Workshop on Web and Databases, pages 87--92, San Diego, California.Google Scholar
- Anagnostopoulos, A., Broder, A. Z., and Carmel, D. (2005). Sampling search-engine results. In Proceedings of 14th International Conference on the World Wide Web, pages 245--256, Chiba, Japan. Google ScholarDigital Library
- Baeza-Yates, R. A. and Ribeiro-Neto, B. (1999). Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA. Google ScholarDigital Library
- Bailey, P., Craswell, N., and Hawking, D. (2003). Engineering a multi-purpose test collection for Web retrieval experiments. Information Processing and Management, 39(6):853--871. Google ScholarDigital Library
- Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. In Proceedings of 7th International Conference on the World Wide Web, pages 107--117, Brisbane, Australia. Google ScholarDigital Library
- Callan, J. and Connell, M. (2001). Query-based sampling of text databases. ACM Transactions on Information Systems, 19(2):97--130. Google ScholarDigital Library
- Craswell, N. and Hawking, D. (2002). Overview of the TREC-2002 Web Track. In Proceedings of TREC-2002, Gaithersburg, Maryland.Google Scholar
- Craswell, N., Hawking, D., and Robertson, S. (2001). Effective site finding using link anchor information. In Proceedings of 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 250--257, New Orleans, Louisiana. Google ScholarDigital Library
- D'Souza, D., Thom, J., and Zobel, J. (2004). Collection selection for managed distributed document databases. Information Processing and Management, 40(3):527--546. Google ScholarDigital Library
- Garcia, S., Williams, H. E., and Cannane, A. (2004). Access-ordered indexes. In Proceedings of 27th Australasian Computer Science Conference, pages 7--14, Darlinghurst, Australia. Google ScholarDigital Library
- Gravano, L., Ipeirotis, P. G., and Sahami, M. (2003). Qprober: A system for automatic classification of Hidden-Web databases. ACM Transactions on Information Systems, 21(1):1--41. Google ScholarDigital Library
- Hawking, D. and Thomas, P. (2005). Server selection methods in hybrid portal search. In Proceedings of 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 75--82, Salvador, Brazil. Google ScholarDigital Library
- Ipeirotis, P. G. and Gravano, L. (2002). Distributed search over the hidden Web: Hierarchical database sampling and selection. In Proceedings of 28th International Conference on Very Large Data Bases, pages 394--405, Hong Kong, China. Google ScholarDigital Library
- Ipeirotis, P. G. and Gravano, L. (2004). When one sample is not enough: improving text database selection using shrinkage. In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 767--778, Paris, France. Google ScholarDigital Library
- Ipeirotis, P. G., Gravano, L., and Sahami, M. (2001). Probe, count, and classify: categorizing Hidden Web databases. ACM SIGMOD Record, 30(2):67--78. Google ScholarDigital Library
- Jansen, B. J., Spink, A., and Saracevic, T. (2000). Real life, real users, and real needs: a study and analysis of user queries on the Web. Information Processing and Management, 36(2):207--227. Google ScholarDigital Library
- Karnatapu, S., Ramachandran, K., Wu, Z., Shah, B., Raghavan, V., and Benton, R. (2004). Estimating size of search engines in an uncooperative environment. In Workshop on Web-based Support Systems, pages 81--87, Beijing, China.Google Scholar
- Liu, K., Yu, C., and Meng, W. (2002). Discovering the representative of a search engine. In Proceedings of 11th ACM CIKM International Conference on Information and Knowledge Management, pages 652--654, McLean, Virginia. Google ScholarDigital Library
- Powell, A. L. and French, J. (2003). Comparing the performance of collection selection algorithms. ACM Transactions on Information Systems, 21(4):412--456. Google ScholarDigital Library
- Schumacher, F. X. and Eschmeyer, R. W. (1943). The estimation of fish populations in lakes and ponds. Journal of the Tennesse Academy of Science, 18:228--249.Google Scholar
- Si, L. and Callan, J. (2003a). The effect of database size distribution on resource selection algorithms. In Proeedings of SIGIR 2003 Workshop on Distributed Information Retrieval, pages 31--42, Toronto, Canada.Google Scholar
- Si, L. and Callan, J. (2003b). Relevant document distribution estimation method for resource selection. In Proceedings of 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 298--305, Toronto, Canada. Google ScholarDigital Library
- Si, L. and Callan, J. (2004). Unified utility maximization framework for resource selection. In Proceedings of 13th ACM CIKM Conference on Information and Knowledge Management, pages 32--41, Washington, D.C. Google ScholarDigital Library
- Si, L., Jin, R., Callan, J., and Ogilvie, P. (2002). A language modeling framework for resource selection and results merging. In Proceedings of 11th ACM CIKM International Conference on Information and Knowledge Management, pages 391--397, New York, NY. Google ScholarDigital Library
- Sutherland, W. J. (1996). Ecological Census Techniques. Cambridge University Press.Google Scholar
- Voorhees, E. M. and Harman, D. (2000). Overview of the sixth Text REtrieval Conference (TREC-6). Information Processing and Management, 36(1):3--35. Google ScholarDigital Library
Index Terms
- Capturing collection size for distributed non-cooperative retrieval
Recommendations
Estimating collection size with logistic regression
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrievalCollection size is an important feature to represent the content summaries of a collection, and plays a vital role in collection selection for distributed search. In uncooperative environments, collection size estimation algorithms are adopted to ...
Age-based garbage collection
Modern generational garbage collectors look for garbage among the young objects, because they have high mortality; however, these objects include the very youngest objects, which clearly are still live. We introduce new garbage collection algorithms, ...
Age-based garbage collection
OOPSLA '99: Proceedings of the 14th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applicationsModern generational garbage collectors look for garbage among the young objects, because they have high mortality; however, these objects include the very youngest objects, which clearly are still live. We introduce new garbage collection algorithms, ...
Comments