skip to main content
10.1145/2983323.2983706acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article
Public Access

Query-Biased Partitioning for Selective Search

Published:24 October 2016Publication History

ABSTRACT

Selective search is a cluster-based distributed retrieval architecture that reduces computational costs by partitioning a corpus into topical shards, and selectively searching them. Prior research formed topical shards by clustering the corpus based on the documents' contents. This content-based partitioning strategy reveals common topics in a corpus. However, the topic distribution produced by clustering may not match the distribution of topics in search traffic, which may reduce the effectiveness of selective search.

This paper presents a query-biased partitioning strategy that aligns document partitions with topics from query logs. It focuses on two parts of the partitioning process: clustering initialization and document similarity calculation. A query-driven clustering initialization algorithm uses topics from query logs to form cluster seeds. A query-biased similarity metric favors terms that are important in query logs. Both methods boost retrieval effectiveness, reduce variance, and produce a more balanced distribution of shard sizes.

References

  1. I. S. Altingövde, E. Demir, F. Can, and Ö. Ulusoy. Incremental cluster-based retrieval using compressed cluster-skipping inverted files. ACM Transactions on Information Systems, 26(3), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Aly, T. Demeester, and D. Hiemstra. Taily: Shard selection using the tail of score distributions. In Proceedings of SIGIR, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. The Journal of Machine Learning Research, 3:1137--1155, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. E. Celebi, H. A. Kingravi, and P. A. Vela. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Systems with Applications, 40(1):200--210, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. L. Clarke, N. Craswell, and I. Soboroff. Overview of the trec 2004 terabyte track. In TREC, volume 4, page 74, 2004.Google ScholarGoogle Scholar
  6. Z. Dai, Y. Kim, and J. Callan. How random decisions affect selective distributed search. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 771--774. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. C. French, A. L. Powell, F. C. Gey, and N. Perelman. Exploiting a controlled vocabulary to improve collection selection and retrieval effectiveness. In Proceedings of the 2001 ACM CIKM International Conference on Information and Knowledge Management, pages 199--206, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Y. Kim, J. Callan, S. Culpepper, and A. Moffat. Load-balancing in distributed selective search. In Proceedings of the 39th International ACM SIGIR conference on on Research and Development in Information Retrieval, pages ??--?? ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Kulkarni. Efficient and Effective Large-scale Search. PhD thesis, Carnegie Mellon University, 2013.Google ScholarGoogle Scholar
  10. A. Kulkarni and J. Callan. Document allocation policies for selective searching of distributed indexes. In Proceedings of the 19th ACM Conference on Information and Knowledge Management (CIKM '10), pages 449--458. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Kulkarni and J. Callan. Selective search: Efficient and effective search of large textual collections. ACM Transactions on Information Systems (TOIS), 33(4):17, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Kulkarni, A. Tigelaar, J. Callan, and D. Hiemstra. Shard ranking and cutoff estimation for topically partitioned collections. In Proceedings of the 21st ACM Conference on Information and Knowledge Management (CIKM'12), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Metzler and W. B. Croft. A markov random field model for term dependencies. In Proceedings of the 28th International ACM SIGIR conference on Research and development in information retrieval, pages 472--479. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.Google ScholarGoogle Scholar
  15. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111--3119, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. Poblete and R. Baeza-Yates. Query-sets: using implicit feedback and query patterns to organize web documents. In Proceedings of the 17th International Conference on World Wide Web, pages 41--50. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Puppin, F. Silvestri, and D. Laforenza. Query-driven document partitioning and collection selection. In Proceedings of the 1st international conference on Scalable information systems, page 34. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. Si and J. Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 298--305. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Thomas and M. Shokouhi. Sushi: scoring scaled samples for server selection. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 419--426. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. K. Wagstaff, C. Cardie, S. Rogers, S. Schrödl, et al. Constrained k-means clustering with background knowledge. In Proceedings of the 18th International Conference on Machine Learning (ICML'01), volume 1, pages 577--584, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. E. P. Xing, M. I. Jordan, S. Russell, and A. Y. Ng. Distance metric learning with application to clustering with side-information. In Advances in neural information processing systems, pages 505--512, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS), 22(2):179--214, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Query-Biased Partitioning for Selective Search

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management
      October 2016
      2566 pages
      ISBN:9781450340731
      DOI:10.1145/2983323

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 October 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      CIKM '16 Paper Acceptance Rate160of701submissions,23%Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader