skip to main content
10.1145/1148170.1148227acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Capturing collection size for distributed non-cooperative retrieval

Published:06 August 2006Publication History

ABSTRACT

Modern distributed information retrieval techniques require accurate knowledge of collection size. In non-cooperative environments, where detailed collection statistics are not available, the size of the underlying collections must be estimated. While several approaches for the estimation of collection size have been proposed, their accuracy has not been thoroughly evaluated. An empirical analysis of past estimation approaches across a variety of collections demonstrates that their prediction accuracy is low. Motivated by ecological techniques for the estimation of animal populations, we propose two new approaches for the estimation of collection size. We show that our approaches are significantly more accurate that previous methods, and are more efficient in use of resources required to perform the estimation.

References

  1. Agichtein, E., Ipeirotis, P. G., and Gravano, L. (2003). Modeling query-based access to text databases. In International Workshop on Web and Databases, pages 87--92, San Diego, California.Google ScholarGoogle Scholar
  2. Anagnostopoulos, A., Broder, A. Z., and Carmel, D. (2005). Sampling search-engine results. In Proceedings of 14th International Conference on the World Wide Web, pages 245--256, Chiba, Japan. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Baeza-Yates, R. A. and Ribeiro-Neto, B. (1999). Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bailey, P., Craswell, N., and Hawking, D. (2003). Engineering a multi-purpose test collection for Web retrieval experiments. Information Processing and Management, 39(6):853--871. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. In Proceedings of 7th International Conference on the World Wide Web, pages 107--117, Brisbane, Australia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Callan, J. and Connell, M. (2001). Query-based sampling of text databases. ACM Transactions on Information Systems, 19(2):97--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Craswell, N. and Hawking, D. (2002). Overview of the TREC-2002 Web Track. In Proceedings of TREC-2002, Gaithersburg, Maryland.Google ScholarGoogle Scholar
  8. Craswell, N., Hawking, D., and Robertson, S. (2001). Effective site finding using link anchor information. In Proceedings of 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 250--257, New Orleans, Louisiana. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D'Souza, D., Thom, J., and Zobel, J. (2004). Collection selection for managed distributed document databases. Information Processing and Management, 40(3):527--546. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Garcia, S., Williams, H. E., and Cannane, A. (2004). Access-ordered indexes. In Proceedings of 27th Australasian Computer Science Conference, pages 7--14, Darlinghurst, Australia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Gravano, L., Ipeirotis, P. G., and Sahami, M. (2003). Qprober: A system for automatic classification of Hidden-Web databases. ACM Transactions on Information Systems, 21(1):1--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Hawking, D. and Thomas, P. (2005). Server selection methods in hybrid portal search. In Proceedings of 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 75--82, Salvador, Brazil. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ipeirotis, P. G. and Gravano, L. (2002). Distributed search over the hidden Web: Hierarchical database sampling and selection. In Proceedings of 28th International Conference on Very Large Data Bases, pages 394--405, Hong Kong, China. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ipeirotis, P. G. and Gravano, L. (2004). When one sample is not enough: improving text database selection using shrinkage. In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 767--778, Paris, France. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ipeirotis, P. G., Gravano, L., and Sahami, M. (2001). Probe, count, and classify: categorizing Hidden Web databases. ACM SIGMOD Record, 30(2):67--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jansen, B. J., Spink, A., and Saracevic, T. (2000). Real life, real users, and real needs: a study and analysis of user queries on the Web. Information Processing and Management, 36(2):207--227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Karnatapu, S., Ramachandran, K., Wu, Z., Shah, B., Raghavan, V., and Benton, R. (2004). Estimating size of search engines in an uncooperative environment. In Workshop on Web-based Support Systems, pages 81--87, Beijing, China.Google ScholarGoogle Scholar
  18. Liu, K., Yu, C., and Meng, W. (2002). Discovering the representative of a search engine. In Proceedings of 11th ACM CIKM International Conference on Information and Knowledge Management, pages 652--654, McLean, Virginia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Powell, A. L. and French, J. (2003). Comparing the performance of collection selection algorithms. ACM Transactions on Information Systems, 21(4):412--456. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Schumacher, F. X. and Eschmeyer, R. W. (1943). The estimation of fish populations in lakes and ponds. Journal of the Tennesse Academy of Science, 18:228--249.Google ScholarGoogle Scholar
  21. Si, L. and Callan, J. (2003a). The effect of database size distribution on resource selection algorithms. In Proeedings of SIGIR 2003 Workshop on Distributed Information Retrieval, pages 31--42, Toronto, Canada.Google ScholarGoogle Scholar
  22. Si, L. and Callan, J. (2003b). Relevant document distribution estimation method for resource selection. In Proceedings of 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 298--305, Toronto, Canada. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Si, L. and Callan, J. (2004). Unified utility maximization framework for resource selection. In Proceedings of 13th ACM CIKM Conference on Information and Knowledge Management, pages 32--41, Washington, D.C. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Si, L., Jin, R., Callan, J., and Ogilvie, P. (2002). A language modeling framework for resource selection and results merging. In Proceedings of 11th ACM CIKM International Conference on Information and Knowledge Management, pages 391--397, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Sutherland, W. J. (1996). Ecological Census Techniques. Cambridge University Press.Google ScholarGoogle Scholar
  26. Voorhees, E. M. and Harman, D. (2000). Overview of the sixth Text REtrieval Conference (TREC-6). Information Processing and Management, 36(1):3--35. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Capturing collection size for distributed non-cooperative retrieval

                  Recommendations

                  Comments

                  Login options

                  Check if you have access through your login credentials or your institution to get full access on this article.

                  Sign in
                  • Published in

                    cover image ACM Conferences
                    SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
                    August 2006
                    768 pages
                    ISBN:1595933697
                    DOI:10.1145/1148170

                    Copyright © 2006 ACM

                    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                    Publisher

                    Association for Computing Machinery

                    New York, NY, United States

                    Publication History

                    • Published: 6 August 2006

                    Permissions

                    Request permissions about this article.

                    Request Permissions

                    Check for updates

                    Qualifiers

                    • Article

                    Acceptance Rates

                    Overall Acceptance Rate792of3,983submissions,20%

                  PDF Format

                  View or Download as a PDF file.

                  PDF

                  eReader

                  View online with eReader.

                  eReader