skip to main content
10.1145/564376.564382acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Using sampled data and regression to merge search engine results

Published:11 August 2002Publication History

ABSTRACT

This paper addresses the problem of merging results obtained from different databases and search engines in a distributed information retrieval environment. The prior research on this problem either assumed the exchange of statistics necessary for normalizing scores (cooperative solutions) or is heuristic. Both approaches have disadvantages. We show that the problem in uncooperative environments is simpler when viewed as a component of a distributed IR system that uses query-based sampling to create resource descriptions. Documents sampled for creating resource descriptions can also be used to create a sample centralized index, and this index is a source of training data for adaptive results merging algorithms. A variety of experiments demonstrate that this new approach is more effective than a well-known alternative, and that it allows query-by-query tuning of the results merging function.

References

  1. J. Callan. Distributed information retrieval. In W.B. Croft, editor, Advances in information retrieval. pp. 127--150. Kluwer Academic Publishers, 2000.Google ScholarGoogle Scholar
  2. J. Callan, W.B. Croft, and J. Broglio, TREC and TIPSTER experiments with INQUERY. Information Processing and Management, 31(3):327--343, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. Gravano, C. Chang, H. Garcia-Molina, and A. Paepcke. STARTS: Stanford Proposal for Internet Meta-Searching. In Proc. of the ACM-SIGMOD Int'l Conference on Management of Data, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Callan and M. Connell. Query-based sampling of text databases. ACM Transactions on Information Systems, 19(2):97--130, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. Fuhr. A decision-theoretic approach to database selection in networked IR. ACM Transactions on Information Systems, 17(3):229--249, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. Gravano and H. Garcia-Molina. Generalizing GloSS to Vector-Space Databases and Broker Hierarchies. In Proceedings of the 21st International Conference on Very Large Databases (VLDB), 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Xu and J. Callan. Effective Retrieval with Distributed Collections. In Proc. of the 21st Annual Int'l ACM SIGIR Conference on Research and Development in Information Retrieval, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. Yuwono and D. Lee. Server Ranking for Distributed Text Retrieval Systems on Internet. In Proc. of the Int. Conf. on Database Systems for Adv. Applications, pages 41--49, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. N. Craswell, P.Bailey, and D.Hawking. Server selection on the World Wide Web. In Proc. of the Fifth ACM Conference on Digital Libraries, pp. 37--46. ACM, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. L. Viles and J. C. French. Dissemination of Collection Wide Information in a Distributed Information Retrieval System. In Proc. of the 18th Annual Int'l ACM SIGIR Conference on Research and Development in Information Retrieval, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. T. Kirsch. Document retrieval over networks wherein ranking and relevance scores are computed at the client for multiple database documents. U.S. Patent 5,659,732.Google ScholarGoogle Scholar
  12. N. Craswell, D. Hawking, and P. Thistlewaite. Merging Results from Isolated Search Engines. In Proc. of the Tenth Australasian Database Conf., pages 189--200, 1999.Google ScholarGoogle Scholar
  13. J.C. French, A.L. Powell, J. Callan, C.L. Viles, T. Emmitt, K.J. Prey, and Y. Mou. Comparing the performance of database selection algorithms. In Proc. of the 22nd Annual Int'l ACM SIGIR Conference on Research and Development in Information Retrieval, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. H. Lee. Analyses of multiple evidence combination. In Proc. of the 20th Annual Int'l ACM SIGIR Conference on Research and Development in Information Retrieval, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Manmatha, T. Rath, and F. Feng. Modeling score distributions for combining the outputs of search engines. In Proc. of the 24th Annual Int'l ACM SIGIR Conference on Research and Development in Information Retrieval, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Le Calv , J. Savoy. Database Merging Strategy Based on Logistic Regression. Information Processing & Management, 36(3), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A.L. Powell, J.C. French, J. Callan, M. Connell, and C.L. Viles, The impact of database selection on distributed searching. In Proc. of the 23rd Annual Int'l ACM SIGIR Conference on Research and Development in Information Retrieval, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Xu and W.B. Croft, Cluster-based language models for distributed retrieval. In Proc. of the 22nd Annual Int'l ACM SIGIR Conference on Research and Development in Information Retrieval, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Buckley, A. Singhal, M. Mitra, and G. Salton, New retrieval approaches using SMART. In Proceedings of 1995 Text REtrieval Conference (TREC-3). National Institute of Standards and Technology, special publication.Google ScholarGoogle Scholar
  20. L. Larkey, M. Connell, and J. Callan. Collection selection and results merging with topically organized U.S. patents and TREC data. In Proceedings of Conference of Information and Knowledge Management, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. A. Aslam, M. Montague. Models for Metasearch. In Proc. of the 23rd Annual Int'l ACM SIGIR Conference on Research and Development in Information Retrieval, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. Ogilvie, J. Callan. Experiments using the Lemur toolkit. In Proc of 2001 Text REtrieval Conference (TREC 2001). National Institute of Standards and Technology, special publication.Google ScholarGoogle Scholar
  23. Ellen Voorhees, Narendra K. Gupta, and Ben Johnson-Laird. Learning Collection Fusion Strategies. In Proc. of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Using sampled data and regression to merge search engine results

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
      August 2002
      478 pages
      ISBN:1581135610
      DOI:10.1145/564376

      Copyright © 2002 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 August 2002

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      SIGIR '02 Paper Acceptance Rate44of219submissions,20%Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader