ABSTRACT
We present an approach to improving the precision of an initial document ranking wherein we utilize cluster information within a graph-based framework. The main idea is to perform reranking based on centrality within bipartite graphs of documents (on one side) and clusters (on the other side), on the premise that these are mutually reinforcing entities. Links between entities are created via consideration of language models induced from them.We find that our cluster-document graphs give rise to much better retrieval performance than previously proposed document-only graphs do. For example, authority-based reranking of documents via a HITS-style cluster-based approach outperforms a previously-proposed PageRank-inspired algorithm applied to solely-document graphs. Moreover, we also show that computing authority scores for clusters constitutes an effective method for identifying clusters containing a large percentage of relevant documents.
- J. Baliński and C. Danilowicz. Re-ranking method based on inter-document distances. Information Processing and Management, 41(4):759--775, 2005.]] Google ScholarDigital Library
- D. Beeferman and A. L. Berger. Agglomerative clustering of a search engine query log. In Proceedings of KDD, pages 407--416, 2000.]] Google ScholarDigital Library
- S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th International World Wide Web Conference, pages 107--117, 1998.]] Google ScholarDigital Library
- W. B. Croft. A model of cluster searching based on classification. Information Systems, 5:189--195, 1980.]]Google ScholarCross Ref
- W. B. Croft and J. Lafferty, editors. Language Modeling for Information Retrieval. Number 13 in Information Retrieval Book Series. Kluwer, 2003.]] Google ScholarDigital Library
- D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/Gather: A cluster-based approach to browsing large document collections. In 15th Annual International SIGIR, pages 318--329, Denmark, June 1992.]] Google ScholarDigital Library
- C. Danilowicz and J. Baliński. Document ranking based upon Markov chains. Information Processing and Management, 41(4):759--775, 2000.]] Google ScholarDigital Library
- I. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the Seventh ACM SIGKDD Conference, pages 269--274, 2001.]] Google ScholarDigital Library
- F. Diaz. Regularizing ad hoc retrieval scores. In Proceedings of the Fourteenth International Conference on Information and Knowledge Managment (CIKM), pages 672--679, 2005.]] Google ScholarDigital Library
- G. Erkan. Language model based document clustering using random walks. In Proceedings of HLT/NAACL, 2006.]] Google ScholarDigital Library
- G. Erkan and D. R. Radev. LexRank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 22:457--479, 2004.]]Google ScholarCross Ref
- A. Griffiths, H. C. Luckhurst, and P. Willett. Using interdocument similarity information in document retrieval systems. Journal of the American Society for Information Science (JASIS), 37(1):3--11, 1986. Reprinted in Karen Sparck Jones and Peter Willett, eds., Readings in Information Retrieval, Morgan Kaufmann, pp. 365--373, 1997.]] Google ScholarDigital Library
- M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proceedings of SIGIR, 1996.]] Google ScholarDigital Library
- N. Jardine and C. J. van Rijsbergen. The use of hierarchic clustering in information retrieval. Information Storage and Retrieval, 7(5):217--240, 1971.]]Google ScholarCross Ref
- Y. Karov and S. Edelman. Similarity-based word sense disambiguation. Computational Linguistics, 24(1):41--59, 1998.]] Google ScholarDigital Library
- J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 668--677, 1998. Extended version in Journal of the ACM, 46:604--632, 1999.]] Google ScholarDigital Library
- O. Kurland and L. Lee. Corpus structure, language models, and ad hoc information retrieval. In Proceedings of SIGIR, pages 194--201, 2004.]] Google ScholarDigital Library
- O. Kurland and L. Lee. PageRank without hyperlinks: Structural re-ranking using links induced by language models. In Proceedings of SIGIR, pages 306--313, 2005.]] Google ScholarDigital Library
- O. Kurland, L. Lee, and C. Domshlak. Better than the real thing? Iterative pseudo-query processing using cluster-based language models. In Proceedings of SIGIR, pages 19--26, 2005.]] Google ScholarDigital Library
- J. D. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proceedings of SIGIR, pages 111--119, 2001.]] Google ScholarDigital Library
- A. N. Langville and C. D. Meyer. Deeper inside PageRank. Internet Mathematics, 2005.]]Google Scholar
- A. Leuski. Evaluating document clustering for interactive information retrieval. In Proceedings of the Tenth International Conference on Information and Knowledge Managment (CIKM), pages 33--40, 2001.]] Google ScholarDigital Library
- A. Leuski and J. Allan. Evaluating a visual navigation system for a digital library. In Proceedings of the Second European conference on research and advanced technology for digital libraries (ECDL), pages 535--554, 1998.]] Google ScholarDigital Library
- G.-A. Levow and I. Matveeva. University of Chicago at CLEF2004: Cross-language text and spoken document retrieval. In Proceedings of CLEF, pages 170--179, 2004.]]Google Scholar
- X. Liu and W. B. Croft. Cluster-based retrieval using language models. In Proceedings of SIGIR, pages 186--193, 2004.]] Google ScholarDigital Library
- R. Mihalcea and P. Tarau. TextRank: Bringing order into texts. In Proceedings of EMNLP, pages 404--411, 2004. Poster.]]Google Scholar
- A. Y. Ng, A. X. Zheng, and M. I. Jordan. Stable algorithms for link analysis. In Proceedings of SIGIR, pages 258--266, 2001.]] Google ScholarDigital Library
- J. Otterbacher, G. Erkan, and D. R. Radev. Using random walks for question-focused sentence retrieval. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 915--922, 2005.]] Google ScholarDigital Library
- J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of SIGIR, pages 275--281, 1998.]] Google ScholarDigital Library
- S. E. Preece. Clustering as an output option. In Proceedings of the American Society for Information Science, pages 189--190, 1973.]]Google Scholar
- C. Shah and W. B. Croft. Evaluating high accuracy retrieval techniques. In Proceedings of SIGIR, pages 2--9, 2004.]] Google ScholarDigital Library
- X. Shen and C. Zhai. Active feedback in ad hoc information retrieval. In Proceedings of SIGIR, pages 59--66, 2005.]] Google ScholarDigital Library
- T. Tao, X. Wang, Q. Mei, and C. Zhai. Language model information retrieval with document expansion. In Proceedings of HLT/NAACL, 2006.]] Google ScholarDigital Library
- A. Tombros, R. Villa, and C. van Rijsbergen. The effectiveness of query-specific hierarchic clustering in information retrieval. Information Processing and Management, 38(4):559--582, 2002.]] Google ScholarDigital Library
- C. J. van Rijsbergen. Information Retrieval. Butterworths, second edition, 1979.]] Google ScholarDigital Library
- P. Willett. Query specific automatic document classification. International Forum on Information and Documentation, 10(2):28--32, 1985.]]Google Scholar
- O. Zamir and O. Etzioni. Web document clustering: a feasibility demonstration. In Proceedings of SIGIR, pages 46--54, 1998.]] Google ScholarDigital Library
- C. Zhai and J. D. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR, pages 334--342, 2001.]] Google ScholarDigital Library
- B. Zhang, H. Li, Y. Liu, L. Ji, W. Xi, W. Fan, Z. Chen, and W.-Y. Ma. Improving web search results using affinity graph. In Proceedings of SIGIR, pages 504--511, 2005.]] Google ScholarDigital Library
Index Terms
- Respect my authority!: HITS without hyperlinks, utilizing cluster-based language models
Recommendations
PageRank without hyperlinks: structural re-ranking using links induced by language models
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrievalInspired by the PageRank and HITS (hubs and authorities) algorithms for Web search, we propose a structural re-ranking approach to ad hoc information retrieval: we reorder the documents in an initially retrieved set by exploiting asymmetric ...
PageRank without hyperlinks: Structural reranking using links induced by language models
The ad hoc retrieval task is to find documents in a corpus that are relevant to a query. Inspired by the PageRank and HITS (hubs and authorities) algorithms for Web search, we propose a structural reranking approach to ad-hoc retrieval that applies to ...
A study of the integration of passage-, document-, and cluster-based information for re-ranking search results
AbstractCluster-based and passage-based document retrieval paradigms were shown to be effective. While the former are based on utilizing query-related corpus context manifested in clusters of similar documents, the latter address the fact that a document ...
Comments