skip to main content
10.1145/2396761.2398535acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper

Federated search in the wild: the combined power of over a hundred search engines

Published:29 October 2012Publication History

ABSTRACT

Federated search has the potential of improving web search: the user becomes less dependent on a single search provider and parts of the deep web become available through a unified interface, leading to a wider variety in the retrieved search results. However, a publicly available dataset for federated search reflecting an actual web environment has been absent. As a result, it has been difficult to assess whether proposed systems are suitable for the web setting. We introduce a new test collection containing the results from more than a hundred actual search engines, ranging from large general web search engines such as Google and Bing to small domain-specific engines. We discuss the design and analyze the effect of several sampling methods. For a set of test queries, we collected relevance judgements for the top 10 results of each search engine. The dataset is publicly available and is useful for researchers interested in resource selection for web search collections, result merging and size estimation of uncooperative resources.

References

  1. J. Arguello, F. Diaz, J. Callan, and J.-F. Crespo. Sources of evidence for vertical selection. In SIGIR 2009, pages 315--322. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Callan. Advances in Information Retrieval, chapter Distributed information retrieval, pages 127--150. Kluwer Academic Publishers, 2000.Google ScholarGoogle Scholar
  3. J. Callan and M. Connell. Query-based sampling of text databases. ACM Trans. Inf. Syst., 19:97--130, April 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. L. A. Clarke, N. Craswell, I. Soboroff, and G. V. Cormack. Overview of the trec 2010 web track. In TREC, 2010.Google ScholarGoogle Scholar
  5. N. Craswell, P. Bailey, and D. Hawking. Server selection on the world wide web. In Proceedings of the fifth ACM conference on Digital libraries, DL'00, pages 37--46. ACM, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Hawking and P. Thomas. Server selection methods in hybrid portal search. In SIGIR 2005, pages 75--82. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. G. Ipeirotis and L. Gravano. Distributed search over the hidden web: hierarchical database sampling and selection. In VLDB 2002, pages 394--405. VLDB Endowment, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Monroe, J. French, and A. Powell. Obtaining language models of web collections using query-based sampling techniques. In Proceedings of the 35th Annual Hawaii International Conference on System Sciences (HICSS'02)-Volume 3 - Volume 3, HICSS'02, pages 67.2--, Washington, DC, USA, 2002. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Pass, A. Chowdhury, and C. Torgeson. A picture of search. In InfoScale 2006. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. L. Powell and J. C. French. Comparing the performance of collection selection algorithms. ACM Trans. Inf. Syst., 21(4):412--456, Oct. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In VLDB 2001. ACM, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Shokouhi and L. Si. Federated search. Foundations and Trends in Information Retrieval, 5(1):1--102, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Shokouhi and J. Zobel. Federated text retrieval from uncooperative overlapped collections. In SIGIR 2007, pages 495--502. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. L. Si and J. Callan. Relevant document distribution estimation method for resource selection. In SIGIR 2003, pages 298--305. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Thomas and D. Hawking. Server selection methods in personal metasearch: a comparative empirical study. Inf. Retr., 12:581--604, October 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. S. Tigelaar and D. Hiemstra. Query-based sampling using snippets. In Eighth Workshop on Large-Scale Distributed Systems for Information Retrieval, Geneva, Switzerland, volume 630 of CEUR Workshop Proceedings, pages 9--14, Aachen, Germany, July 2010. CEUR-WS.Google ScholarGoogle Scholar
  17. R. B. Trieschnigg, K. T. T. E. Tjin-Kam-Jet, and D. Hiemstra. Ranking XPaths for extracting search result records. Technical Report TR-CTIT-12-08, Centre for Telematics and Information Technology, University of Twente, Enschede, March 2012.Google ScholarGoogle Scholar
  18. E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing and Management, 36:697--716, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. K. Zhou, R. Cummins, M. Lalmas, and J. Jose. Evaluating large-scale distributed vertical search. In Proceedings of the 9th workshop on Large-scale and distributed informational retrieval, LSDS-IR'11, pages 9--14. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Zobel and J. A. Thom. Is CORI effective for collection selection? an exploration of parameters, queries, and data. In in 'Proceedings of Australian Document Computing Symposium', pages 41--46, 2004.Google ScholarGoogle Scholar

Index Terms

  1. Federated search in the wild: the combined power of over a hundred search engines

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management
      October 2012
      2840 pages
      ISBN:9781450311564
      DOI:10.1145/2396761

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 29 October 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper

      Acceptance Rates

      Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader