ABSTRACT
Federated search has the potential of improving web search: the user becomes less dependent on a single search provider and parts of the deep web become available through a unified interface, leading to a wider variety in the retrieved search results. However, a publicly available dataset for federated search reflecting an actual web environment has been absent. As a result, it has been difficult to assess whether proposed systems are suitable for the web setting. We introduce a new test collection containing the results from more than a hundred actual search engines, ranging from large general web search engines such as Google and Bing to small domain-specific engines. We discuss the design and analyze the effect of several sampling methods. For a set of test queries, we collected relevance judgements for the top 10 results of each search engine. The dataset is publicly available and is useful for researchers interested in resource selection for web search collections, result merging and size estimation of uncooperative resources.
- J. Arguello, F. Diaz, J. Callan, and J.-F. Crespo. Sources of evidence for vertical selection. In SIGIR 2009, pages 315--322. ACM, 2009. Google ScholarDigital Library
- J. Callan. Advances in Information Retrieval, chapter Distributed information retrieval, pages 127--150. Kluwer Academic Publishers, 2000.Google Scholar
- J. Callan and M. Connell. Query-based sampling of text databases. ACM Trans. Inf. Syst., 19:97--130, April 2001. Google ScholarDigital Library
- C. L. A. Clarke, N. Craswell, I. Soboroff, and G. V. Cormack. Overview of the trec 2010 web track. In TREC, 2010.Google Scholar
- N. Craswell, P. Bailey, and D. Hawking. Server selection on the world wide web. In Proceedings of the fifth ACM conference on Digital libraries, DL'00, pages 37--46. ACM, 2000. Google ScholarDigital Library
- D. Hawking and P. Thomas. Server selection methods in hybrid portal search. In SIGIR 2005, pages 75--82. ACM, 2005. Google ScholarDigital Library
- P. G. Ipeirotis and L. Gravano. Distributed search over the hidden web: hierarchical database sampling and selection. In VLDB 2002, pages 394--405. VLDB Endowment, 2002. Google ScholarDigital Library
- G. Monroe, J. French, and A. Powell. Obtaining language models of web collections using query-based sampling techniques. In Proceedings of the 35th Annual Hawaii International Conference on System Sciences (HICSS'02)-Volume 3 - Volume 3, HICSS'02, pages 67.2--, Washington, DC, USA, 2002. IEEE Computer Society. Google ScholarDigital Library
- G. Pass, A. Chowdhury, and C. Torgeson. A picture of search. In InfoScale 2006. ACM, 2006. Google ScholarDigital Library
- A. L. Powell and J. C. French. Comparing the performance of collection selection algorithms. ACM Trans. Inf. Syst., 21(4):412--456, Oct. 2003. Google ScholarDigital Library
- S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In VLDB 2001. ACM, 2001. Google ScholarDigital Library
- M. Shokouhi and L. Si. Federated search. Foundations and Trends in Information Retrieval, 5(1):1--102, 2011. Google ScholarDigital Library
- M. Shokouhi and J. Zobel. Federated text retrieval from uncooperative overlapped collections. In SIGIR 2007, pages 495--502. ACM, 2007. Google ScholarDigital Library
- L. Si and J. Callan. Relevant document distribution estimation method for resource selection. In SIGIR 2003, pages 298--305. ACM, 2003. Google ScholarDigital Library
- P. Thomas and D. Hawking. Server selection methods in personal metasearch: a comparative empirical study. Inf. Retr., 12:581--604, October 2009. Google ScholarDigital Library
- A. S. Tigelaar and D. Hiemstra. Query-based sampling using snippets. In Eighth Workshop on Large-Scale Distributed Systems for Information Retrieval, Geneva, Switzerland, volume 630 of CEUR Workshop Proceedings, pages 9--14, Aachen, Germany, July 2010. CEUR-WS.Google Scholar
- R. B. Trieschnigg, K. T. T. E. Tjin-Kam-Jet, and D. Hiemstra. Ranking XPaths for extracting search result records. Technical Report TR-CTIT-12-08, Centre for Telematics and Information Technology, University of Twente, Enschede, March 2012.Google Scholar
- E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing and Management, 36:697--716, 2000. Google ScholarDigital Library
- K. Zhou, R. Cummins, M. Lalmas, and J. Jose. Evaluating large-scale distributed vertical search. In Proceedings of the 9th workshop on Large-scale and distributed informational retrieval, LSDS-IR'11, pages 9--14. ACM, 2011. Google ScholarDigital Library
- J. Zobel and J. A. Thom. Is CORI effective for collection selection? an exploration of parameters, queries, and data. In in 'Proceedings of Australian Document Computing Symposium', pages 41--46, 2004.Google Scholar
Index Terms
- Federated search in the wild: the combined power of over a hundred search engines
Recommendations
Federated Search
Federated search (federated information retrieval or distributed information retrieval) is a technique for searching multiple text collections simultaneously. Queries are submitted to a subset of collections that are most likely to return relevant ...
From federated to aggregated search
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrievalFederated search refers to the brokered retrieval of content from a set of auxiliary retrieval systems instead of from a single, centralized retrieval system. Federated search tasks occur in, for example, digital libraries (where documents from several ...
Improving Ranking Consistency for Web Search by Leveraging a Knowledge Base and Search Logs
CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge ManagementIn this paper, we propose a new idea called ranking consistency in web search. Relevance ranking is one of the biggest problems in creating an effective web search system. Given some queries with similar search intents, conventional approaches typically ...
Comments