Adaptive Query-Based Sampling of Distributed Collections

Baillie, Mark; Azzopardi, Leif; Crestani, Fabio

doi:10.1007/11880561_26

Adaptive Query-Based Sampling of Distributed Collections

Mark Baillie¹⁹,
Leif Azzopardi¹⁹ &
Fabio Crestani¹⁹

Conference paper

598 Accesses
13 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4209))

Abstract

As part of a Distributed Information Retrieval system a description of each remote information resource, archive or repository is usually stored centrally in order to facilitate resource selection. The acquisition of precise resource descriptions is therefore an important phase in Distributed Information Retrieval, as the quality of such representations will impact on selection accuracy, and ultimately retrieval performance. While Query-Based Sampling is currently used for content discovery of uncooperative resources, the application of this technique is dependent upon heuristic guidelines to determine when a sufficiently accurate representation of each remote resource has been obtained. In this paper we address this shortcoming by using the Predictive Likelihood to provide both an indication of the quality of an acquired resource description estimate, and when a sufficiently good representation of a resource has been obtained during Query-Based Sampling.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Azzopardi, L., Girolami, M., Risjbergen, C.J.: Investigating the relationship between language model perplexity and IR precision-recall measures. In: Proceedings of the 26th ACM SIGIR conference, pp. 369–370 (2003)
Google Scholar
Baeza-Yates, R.: Applications of web query mining. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 7–22. Springer, Heidelberg (2005)
Chapter Google Scholar
Baillie, M., Azzopardi, L., Crestani, F.: Towards better measures: Evaluation of estimated resource description quality for distributed IR. In: First International Conference on Scalable Information Systems. IEEE Computer Society Press, Los Alamitos (2006)
Google Scholar
Belkin, N.J., Croft, W.B.: Information filtering and information retrieval: two sides of the same coin. Communications of the ACM 35(12), 29–38 (1992)
Article Google Scholar
Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: Proceedings of the 23rd ACM SIGIR conference, pp. 33–40 (2000)
Google Scholar
Callan, J.P.: Advances in information retrieval. In: chapter Distributed information retrieval, pp. 127–150. Kluwer Academic Publishers, Dordrecht (2000)
Google Scholar
Callan, J.P., Connell, M.: Query-based sampling of text databases. ACM Transactions of Information Systems 19(2), 97–130 (2001)
Article Google Scholar
Degroot, M.H.: Optimal Statistical Decisions (Wiley Classics Library). Wiley-Interscience, Chichester (2004)
Book Google Scholar
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley-Interscience Publication, Chichester (2000)
Google Scholar
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 42(1-2), 177–196 (2001)
Article MATH Google Scholar
Ipeirotis, P.G., Gravano, L.: When one sample is not enough: improving text database selection using shrinkage. In: Proceedings of the ACM SIGMOD Conference, pp. 767–778 (2004)
Google Scholar
Kullback, S.: Information theoery and statistics. Wiley, New York (1959)
Google Scholar
Shokouhi, M., Scholer, F., Zobel, J.: Sample sizes for query probing in uncooperative distributed information retrieval. In: Zhou, X., Li, J., Shen, H.T., Kitsuregawa, M., Zhang, Y. (eds.) APWeb 2006. LNCS, vol. 3841, pp. 63–75. Springer, Heidelberg (2006)
Chapter Google Scholar
Si, L., Callan, J.P.: Modeling search engine effectiveness for federated search. In: Proceedings of the 28th ACM SIGIR Conference, pp. 83–90 (2005)
Google Scholar
Xu, J., Croft, W.B.: Cluster-based language models for distributed retrieval. In: Proceedings of the 22nd ACM SIGIR conference, pp. 254–261 (1999)
Google Scholar
Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. ACM Transaction of Information Systems 22(2), 179–214 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing and Information Sciences, University of Strathclyde, Glasgow, UK
Mark Baillie, Leif Azzopardi & Fabio Crestani

Authors

Mark Baillie
View author publications
You can also search for this author in PubMed Google Scholar
Leif Azzopardi
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Crestani
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer and Information Science, University of Strathclyde, Scotland
Fabio Crestani
Dipartimento di Informatica, University of Pisa, Largo B. Pontecorvo 3, 56127, Pisa, Italy
Paolo Ferragina
Department of Information Studies, University of Sheffield, Sheffield, UK
Mark Sanderson

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Baillie, M., Azzopardi, L., Crestani, F. (2006). Adaptive Query-Based Sampling of Distributed Collections. In: Crestani, F., Ferragina, P., Sanderson, M. (eds) String Processing and Information Retrieval. SPIRE 2006. Lecture Notes in Computer Science, vol 4209. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880561_26

Download citation

DOI: https://doi.org/10.1007/11880561_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45774-9
Online ISBN: 978-3-540-45775-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics