skip to main content
10.1145/1277741.1277829acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Updating collection representations for federated search

Published:23 July 2007Publication History

ABSTRACT

To facilitate the search for relevant information across a setof online distributed collections, a federated information retrieval system typically represents each collection, centrally, by a set of vocabularies or sampled documents. Accurate retrieval is therefore related to how precise each representation reflects the underlying content stored in that collection. As collections evolve over time, collection representations should also be updated to reflect any change, however, a current solution has not yet been proposed. In this study we examine both the implications of out-of-date representation sets on retrieval accuracy, as well as proposing three different policies for managing necessary updates. Each policyis evaluated on a testbed of forty-four dynamic collections over an eight-week period. Our findings show that out-of-date representations significantly degrade performance overtime, however, adopting a suitable update policy can minimise this problem.

References

  1. Avrahami, T., Yau, L., Si, L., and Callan, J. (2006). The FedLemur:federated search in the real world. Journal of the American Society for Information Science and Technology 57(3):347--358. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Baillie, M., Azzopardi, L., and Crestani, F. (2006). Adaptive query-based sampling of distributed collections. In Proc. SPIRE Conf., Glasgow, UK pages 316--328. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Callan, J. (2000). Advances in information retrieval Chapter 5, Distributed information retrieval, pages 127--150. Kluwer.Google ScholarGoogle Scholar
  4. Callan, J. and Connell, M. (2001). Query-based sampling of text databases.ACM Transactions on Information Systems 19(2):97--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Callan, J., Lu, Z., and Croft, B. (1995). Searching distributed collections with inference networks. Proc. ACM SIGIR Conf., Seattle, WA pages 21--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Cho, J. and Garcia-Molina, H. (2003). Effective page refresh policies for web crawlers. ACM Transactions on Database Systems 28(4):390--426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Craswell, N., Bailey, P., and Hawking, D. (2000). Server selection on the World Wide Web. Proc. ACM Conf. on Digital Libraries, San Antonio, TX pages 37--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Craswell, N., Crimmins, F., Hawking, D., and Moffat, A. (2004). Performance and cost tradeoffs in web search. In Proc. Australasian Database Conf., Darlinghurst, Australia pages 161--169, Australian Computer Society, Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Gravano, L., Chang, C., Garcia-Molina, H., and Paepcke, A. (1997). Starts:Stanford proposal for internet meta-searching. In Proc. ACM SIGMOD Conf., Tucson, AZ pages 207--218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Gravano, L., García-Molina, H., and Tomasic, A. (1999).GlOSS: text-source discovery over the Internet. ACM Transactions on Database Systems 24(2):229--264. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Gravano, L., Ipeirotis, P., and Sahami, M. (2003). Qprober: A system for automatic classification of hidden-web databases. ACM Transactions on Information Systems 21(1):1--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Hawking, D. and Thomas, P. (2005). Server selection methods in hybrid portal search. In Proc. ACM SIGIR Conf., Salvador, Brazil pages 75--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ipeirotis, P., Ntoulas, A., Cho,J., and Gravano, L. (2005). Modeling and managing content changes in text databases. In Proc. ICDE Conf., Tokyo, Japan pages 606--617. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kleinberg, J. (2006). Temporal dynamics of on-line information systems. Data Stream Management: Processing High-Speed Data Streams.Google ScholarGoogle Scholar
  15. S. Kullback. Information theoery and statistics. Wiley, New York, NY 1959.Google ScholarGoogle Scholar
  16. Ntoulas, A., Zerfos, P., and Cho, J. (2005). Downloading textual hidden web content through keyword queries. In Proc. ACM/IEEE-CS Joint Conf. on Digital libraries, Denver, CO pages 100--109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Paepcke, A., Brandriff, R., Janee, G., Larson, R.,Ludaescher, B., Melnik, S., and Raghavan, S. (2000). Search middleware and the simple digital library interoperability protocol. D-Lib Magazine 6(3).Google ScholarGoogle Scholar
  18. Price, G. and Sherman, C. (2001). The Invisible Web: Uncovering Information Sources Search Engines Can't See CyberAge Books. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Robertson, S., Walker, S., Hancock-Beaulieu, M., Gull ,A., and Lau, M. (1992). Okapi at TREC. In Proceedings of TREC-1992, Gaithersburg, MA pages 21--30.Google ScholarGoogle Scholar
  20. Si, L. and Callan, J. (2003a). Relevant document distribution estimation method for resource selection. In Proc. ACM SIGIR Conf., Toronto, Canada pages 298--305. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Si, L. and Callan, J. (2003b). A semisupervised learning method to merge search engine results. ACM Transactions on Infor-mation Systems 21(4):457--491. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Si, L. and Callan, J. (2004). Unified utility maximization framework for resource selection. In Proc. ACM CIKM Conf., Washington, DC pages 32--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Si, L., Jin, R., Callan, J., and Ogilvie, P. (2002). A language modeling framework for resource selection and results merging. In Proc. ACM CIKM Conf., McLean, VA pages 391--397. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Shokouhi, M. (2007). Central-Rank-Based Collection Selection in uncooperative distributed information retrieval. Proc. ECIR Conf., Rome, Italy pages 160--172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Shokouhi, M., Zobel, J., Tahaghoghi, S., and Scholer, F. (2007). Using query logs to establish vocabularies in distributed information retrieval. Journal of Information Processing and Management 43(1). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Shokouhi, M., Zobel, J., Scholer, F., and Tahaghoghi, S. (2006). Capturing collection size for distributed non-cooperative retrieval. In Proc. ACM SIGIR Conf., Seattle, WA pages 316--323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Xu and J. Callan (1998). Effective retrieval with distributed collections.In Proc. ACM SIGIR Conf., Melbourne, Australia pages 112--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Xu, J. and Croft, W.B. (1999). Cluster-based language models for distributed retrieval. In Proc. ACM SIGIR Conf., Berkeley, CA pages 254--261. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Updating collection representations for federated search

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
            July 2007
            946 pages
            ISBN:9781595935977
            DOI:10.1145/1277741

            Copyright © 2007 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 23 July 2007

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            Overall Acceptance Rate792of3,983submissions,20%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader