Article

Updating collection representations for federated search

Authors:
Milad Shokouhi

RMIT University, Melbourne, Australia

RMIT University, Melbourne, Australia
View Profile

,
Mark Baillie

University of Strathclyde, Glasgow, Scotland, UK

University of Strathclyde, Glasgow, Scotland, UK
View Profile

,
Leif Azzopardi

University of Glasgow, Glasgow, Scotland, UK

University of Glasgow, Glasgow, Scotland, UK
View Profile

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrievalJuly 2007Pages 511–518https://doi.org/10.1145/1277741.1277829

Published:23 July 2007Publication History

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 511–518

ABSTRACT

To facilitate the search for relevant information across a setof online distributed collections, a federated information retrieval system typically represents each collection, centrally, by a set of vocabularies or sampled documents. Accurate retrieval is therefore related to how precise each representation reflects the underlying content stored in that collection. As collections evolve over time, collection representations should also be updated to reflect any change, however, a current solution has not yet been proposed. In this study we examine both the implications of out-of-date representation sets on retrieval accuracy, as well as proposing three different policies for managing necessary updates. Each policyis evaluated on a testbed of forty-four dynamic collections over an eight-week period. Our findings show that out-of-date representations significantly degrade performance overtime, however, adopting a suitable update policy can minimise this problem.

References

Avrahami, T., Yau, L., Si, L., and Callan, J. (2006). The FedLemur:federated search in the real world. Journal of the American Society for Information Science and Technology 57(3):347--358. Google ScholarDigital Library
Baillie, M., Azzopardi, L., and Crestani, F. (2006). Adaptive query-based sampling of distributed collections. In Proc. SPIRE Conf., Glasgow, UK pages 316--328. Google ScholarDigital Library
Callan, J. (2000). Advances in information retrieval Chapter 5, Distributed information retrieval, pages 127--150. Kluwer.Google Scholar
Callan, J. and Connell, M. (2001). Query-based sampling of text databases.ACM Transactions on Information Systems 19(2):97--130. Google ScholarDigital Library
Callan, J., Lu, Z., and Croft, B. (1995). Searching distributed collections with inference networks. Proc. ACM SIGIR Conf., Seattle, WA pages 21--28. Google ScholarDigital Library
Cho, J. and Garcia-Molina, H. (2003). Effective page refresh policies for web crawlers. ACM Transactions on Database Systems 28(4):390--426. Google ScholarDigital Library
Craswell, N., Bailey, P., and Hawking, D. (2000). Server selection on the World Wide Web. Proc. ACM Conf. on Digital Libraries, San Antonio, TX pages 37--46. Google ScholarDigital Library
Craswell, N., Crimmins, F., Hawking, D., and Moffat, A. (2004). Performance and cost tradeoffs in web search. In Proc. Australasian Database Conf., Darlinghurst, Australia pages 161--169, Australian Computer Society, Inc. Google ScholarDigital Library
Gravano, L., Chang, C., Garcia-Molina, H., and Paepcke, A. (1997). Starts:Stanford proposal for internet meta-searching. In Proc. ACM SIGMOD Conf., Tucson, AZ pages 207--218. Google ScholarDigital Library
Gravano, L., García-Molina, H., and Tomasic, A. (1999).GlOSS: text-source discovery over the Internet. ACM Transactions on Database Systems 24(2):229--264. Google ScholarDigital Library
Gravano, L., Ipeirotis, P., and Sahami, M. (2003). Qprober: A system for automatic classification of hidden-web databases. ACM Transactions on Information Systems 21(1):1--41. Google ScholarDigital Library
Hawking, D. and Thomas, P. (2005). Server selection methods in hybrid portal search. In Proc. ACM SIGIR Conf., Salvador, Brazil pages 75--82. Google ScholarDigital Library
Ipeirotis, P., Ntoulas, A., Cho,J., and Gravano, L. (2005). Modeling and managing content changes in text databases. In Proc. ICDE Conf., Tokyo, Japan pages 606--617. Google ScholarDigital Library
Kleinberg, J. (2006). Temporal dynamics of on-line information systems. Data Stream Management: Processing High-Speed Data Streams.Google Scholar
S. Kullback. Information theoery and statistics. Wiley, New York, NY 1959.Google Scholar
Ntoulas, A., Zerfos, P., and Cho, J. (2005). Downloading textual hidden web content through keyword queries. In Proc. ACM/IEEE-CS Joint Conf. on Digital libraries, Denver, CO pages 100--109. Google ScholarDigital Library
Paepcke, A., Brandriff, R., Janee, G., Larson, R.,Ludaescher, B., Melnik, S., and Raghavan, S. (2000). Search middleware and the simple digital library interoperability protocol. D-Lib Magazine 6(3).Google Scholar
Price, G. and Sherman, C. (2001). The Invisible Web: Uncovering Information Sources Search Engines Can't See CyberAge Books. Google ScholarDigital Library
Robertson, S., Walker, S., Hancock-Beaulieu, M., Gull ,A., and Lau, M. (1992). Okapi at TREC. In Proceedings of TREC-1992, Gaithersburg, MA pages 21--30.Google Scholar
Si, L. and Callan, J. (2003a). Relevant document distribution estimation method for resource selection. In Proc. ACM SIGIR Conf., Toronto, Canada pages 298--305. Google ScholarDigital Library
Si, L. and Callan, J. (2003b). A semisupervised learning method to merge search engine results. ACM Transactions on Infor-mation Systems 21(4):457--491. Google ScholarDigital Library
Si, L. and Callan, J. (2004). Unified utility maximization framework for resource selection. In Proc. ACM CIKM Conf., Washington, DC pages 32--41. Google ScholarDigital Library
Si, L., Jin, R., Callan, J., and Ogilvie, P. (2002). A language modeling framework for resource selection and results merging. In Proc. ACM CIKM Conf., McLean, VA pages 391--397. Google ScholarDigital Library
Shokouhi, M. (2007). Central-Rank-Based Collection Selection in uncooperative distributed information retrieval. Proc. ECIR Conf., Rome, Italy pages 160--172. Google ScholarDigital Library
Shokouhi, M., Zobel, J., Tahaghoghi, S., and Scholer, F. (2007). Using query logs to establish vocabularies in distributed information retrieval. Journal of Information Processing and Management 43(1). Google ScholarDigital Library
Shokouhi, M., Zobel, J., Scholer, F., and Tahaghoghi, S. (2006). Capturing collection size for distributed non-cooperative retrieval. In Proc. ACM SIGIR Conf., Seattle, WA pages 316--323. Google ScholarDigital Library
J. Xu and J. Callan (1998). Effective retrieval with distributed collections.In Proc. ACM SIGIR Conf., Melbourne, Australia pages 112--120. Google ScholarDigital Library
Xu, J. and Croft, W.B. (1999). Cluster-based language models for distributed retrieval. In Proc. ACM SIGIR Conf., Berkeley, CA pages 254--261. Google ScholarDigital Library

Index Terms

Updating collection representations for federated search
1. Applied computing
  1. Computers in other domains
    1. Digital libraries and archives
2. Information systems

Recommendations

Federated search in the wild: the combined power of over a hundred search engines
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Federated search has the potential of improving web search: the user becomes less dependent on a single search provider and parts of the deep web become available through a unified interface, leading to a wider variety in the retrieved search results. ...
Read More
A multi-collection latent topic model for federated search
Abstract
Collection selection is a crucial function, central to the effectiveness and efficiency of a federated information retrieval system. A variety of solutions have been proposed for collection selection adapting proven techniques used in centralised ...
Read More
A Methodology for Collection Selection in Heterogeneous Contexts
ITCC '02: Proceedings of the International Conference on Information Technology: Coding and Computing

In this paper we demonstrate that in an ideal Distributed Information Retrieval environment, taking the ability of each collection server to return relevant documents into account when selecting collections can be effective. Based on this assumption, we ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
July 2007
946 pages
ISBN:9781595935977
DOI:10.1145/1277741
General Chairs:
Wessel Kraaij
TNO, The Netherlands
,
Arjen P. de Vries
CWI, The Netherlands
,
Program Chairs:
Charles L. A. Clarke
University of Waterloo, Canada
,
Norbert Fuhr
University of Duisburg-Essen, Germany
,
Noriko Kando
National Institute of Informatics, Japan
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 July 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
collection selection
distributed information retrieval
federated search
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 578
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Updating collection representations for federated search

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Federated search in the wild: the combined power of over a hundred search engines

A multi-collection latent topic model for federated search

A Methodology for Collection Selection in Heterogeneous Contexts