Skip to main content
Erschienen in: Discover Computing 4/2014

01.08.2014

Distance matters! Cumulative proximity expansions for ranking documents

verfasst von: Jeroen B. P. Vuurens, Arjen P. de Vries

Erschienen in: Discover Computing | Ausgabe 4/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In the information retrieval process, functions that rank documents according to their estimated relevance to a query typically regard query terms as being independent. However, it is often the joint presence of query terms that is of interest to the user, which is overlooked when matching independent terms. One feature that can be used to express the relatedness of co-occurring terms is their proximity in text. In past research, models that are trained on the proximity information in a collection have performed better than models that are not estimated on data. We analyzed how co-occurring query terms can be used to estimate the relevance of documents based on their distance in text, which is used to extend a unigram ranking function with a proximity model that accumulates the scores of all occurring term combinations. This proximity model is more practical than existing models, since it does not require any co-occurrence statistics, it obviates the need to tune additional parameters, and has a retrieval speed close to competing models. We show that this approach is more robust than existing models, on both Web and newswire corpora, and on average performs equal or better than existing proximity models across collections.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
These queries contained only one non stop word or did not have any relevant documents in the relevance judgments.
 
Literatur
Zurück zum Zitat Beeferman, D., Berger, A., & Lafferty, J. (1997). A model of lexical attraction and repulsion. In Proceedings of the 35th annual meeting of the association for computational linguistics and eighth conference of the European chapter of the association for computational linguistics (pp. 373–380). Association for computational linguistics. Beeferman, D., Berger, A., & Lafferty, J. (1997). A model of lexical attraction and repulsion. In Proceedings of the 35th annual meeting of the association for computational linguistics and eighth conference of the European chapter of the association for computational linguistics (pp. 373–380). Association for computational linguistics.
Zurück zum Zitat Bendersky, M., & Croft, W. B. (2012). Modeling higher-order term dependencies in information retrieval using query hypergraphs. In Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval (pp. 941–950). ACM. Bendersky, M., & Croft, W. B. (2012). Modeling higher-order term dependencies in information retrieval using query hypergraphs. In Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval (pp. 941–950). ACM.
Zurück zum Zitat Bendersky, M., Metzler, D. & Croft, W. B. (2010). Learning concept importance using a weighted dependence model. In Proceedings of the third ACM international conference on Web search and data mining (pp. 31–40). ACM. Bendersky, M., Metzler, D. & Croft, W. B. (2010). Learning concept importance using a weighted dependence model. In Proceedings of the third ACM international conference on Web search and data mining (pp. 31–40). ACM.
Zurück zum Zitat Büttcher, S., Clarke, C. L., & Lushman, B. (2006). Term proximity scoring for ad-hoc retrieval on very large text collections. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 621–622). ACM. Büttcher, S., Clarke, C. L., & Lushman, B. (2006). Term proximity scoring for ad-hoc retrieval on very large text collections. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 621–622). ACM.
Zurück zum Zitat Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2008). Evaluation over thousands of queries. In Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (pp. 651–658). ACM. Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2008). Evaluation over thousands of queries. In Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (pp. 651–658). ACM.
Zurück zum Zitat Clarke, C. L., Cormack, G. V., & Tudhope, E. A. (2000). Relevance ranking for one to three term queries. Information Processing and Management, 36(2), 291–311.CrossRef Clarke, C. L., Cormack, G. V., & Tudhope, E. A. (2000). Relevance ranking for one to three term queries. Information Processing and Management, 36(2), 291–311.CrossRef
Zurück zum Zitat Collins-Thompson, K., & Callan, J. (2007). Estimation and use of uncertainty in pseudo-relevance feedback. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 303–310). ACM. Collins-Thompson, K., & Callan, J. (2007). Estimation and use of uncertainty in pseudo-relevance feedback. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 303–310). ACM.
Zurück zum Zitat Croft, W. B., Turtle, H. R., & Lewis, D. D. (1991). The use of phrases and structured queries in information retrieval. In Proceedings of the 14th annual international ACM SIGIR conference on research and development in information retrieval (pp. 32–45). ACM. Croft, W. B., Turtle, H. R., & Lewis, D. D. (1991). The use of phrases and structured queries in information retrieval. In Proceedings of the 14th annual international ACM SIGIR conference on research and development in information retrieval (pp. 32–45). ACM.
Zurück zum Zitat Cummins, R., & O’Riordan, C. (2009). Learning in a pairwise term-term proximity framework for information retrieval. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval (pp. 251–258). ACM. Cummins, R., & O’Riordan, C. (2009). Learning in a pairwise term-term proximity framework for information retrieval. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval (pp. 251–258). ACM.
Zurück zum Zitat De Kretser, O. & Moffat, A. (1999). Effective document presentation with a locality-based similarity heuristic. In Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 113–120). ACM. De Kretser, O. & Moffat, A. (1999). Effective document presentation with a locality-based similarity heuristic. In Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 113–120). ACM.
Zurück zum Zitat Fagan, J. (1987). Automatic phrase indexing for document retrieval. In Proceedings of the 10th annual international ACM SIGIR conference on research and development in information retrieval (pp. 91–101). ACM. Fagan, J. (1987). Automatic phrase indexing for document retrieval. In Proceedings of the 10th annual international ACM SIGIR conference on research and development in information retrieval (pp. 91–101). ACM.
Zurück zum Zitat Gao, J., Nie, J.-Y., Wu, G., & Cao, G. (2004). Dependence language model for information retrieval. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 170–177). ACM. Gao, J., Nie, J.-Y., Wu, G., & Cao, G. (2004). Dependence language model for information retrieval. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 170–177). ACM.
Zurück zum Zitat Hawking, D., & Thistlewaite, P. (1995). Proximity operators-so near and yet so far. In Proceedings of the 4th text retrieval conference (pp. 131–143). Hawking, D., & Thistlewaite, P. (1995). Proximity operators-so near and yet so far. In Proceedings of the 4th text retrieval conference (pp. 131–143).
Zurück zum Zitat He, B., Huang, J. X., & Zhou, X. (2011). Modeling term proximity for probabilistic information retrieval models. Information Sciences, 181(14), 3017–3031.CrossRefMathSciNet He, B., Huang, J. X., & Zhou, X. (2011). Modeling term proximity for probabilistic information retrieval models. Information Sciences, 181(14), 3017–3031.CrossRefMathSciNet
Zurück zum Zitat Keen, E. M. (1991). The use of term position devices in ranked output experiments. Journal of Documentation, 47(1), 1–22.CrossRef Keen, E. M. (1991). The use of term position devices in ranked output experiments. Journal of Documentation, 47(1), 1–22.CrossRef
Zurück zum Zitat Lavrenko, V., & Croft, W. B. (2001). Relevance based language models. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (pp. 120–127). ACM. Lavrenko, V., & Croft, W. B. (2001). Relevance based language models. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (pp. 120–127). ACM.
Zurück zum Zitat Liu, X., & Croft, W. B. (2002). Passage retrieval based on language models. In Proceedings of the eleventh international conference on information and knowledge management (pp. 375–382). ACM. Liu, X., & Croft, W. B. (2002). Passage retrieval based on language models. In Proceedings of the eleventh international conference on information and knowledge management (pp. 375–382). ACM.
Zurück zum Zitat Lv, Y., & Zhai, C. (2009). Positional language models for information retrieval. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval (pp. 299–306). ACM. Lv, Y., & Zhai, C. (2009). Positional language models for information retrieval. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval (pp. 299–306). ACM.
Zurück zum Zitat Metzler, D., & Croft, W. B. (2005). A markov random field model for term dependencies. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (pp. 472–479). ACM. Metzler, D., & Croft, W. B. (2005). A markov random field model for term dependencies. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (pp. 472–479). ACM.
Zurück zum Zitat Metzler, D., & Croft, W. B. (2007). Latent concept expansion using markov random fields. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 311–318). ACM. Metzler, D., & Croft, W. B. (2007). Latent concept expansion using markov random fields. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 311–318). ACM.
Zurück zum Zitat Miao, J., Huang, J. X., & Ye, Z. (2012). Proximity-based rocchio’s model for pseudo relevance. In Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval (pp. 535–544). ACM. Miao, J., Huang, J. X., & Ye, Z. (2012). Proximity-based rocchio’s model for pseudo relevance. In Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval (pp. 535–544). ACM.
Zurück zum Zitat Nallapati, R., & Allan, J. (2002). Capturing term dependencies using a language model based on sentence trees. In Proceedings of the eleventh international conference on information and knowledge management (pp. 383–390). ACM. Nallapati, R., & Allan, J. (2002). Capturing term dependencies using a language model based on sentence trees. In Proceedings of the eleventh international conference on information and knowledge management (pp. 383–390). ACM.
Zurück zum Zitat Rasolofo, Y., & Savoy, J. (2003). Term proximity scoring for keyword-based retrieval systems. In Advances in information retrieval (pp. 207–218). Springer. Rasolofo, Y., & Savoy, J. (2003). Term proximity scoring for keyword-based retrieval systems. In Advances in information retrieval (pp. 207–218). Springer.
Zurück zum Zitat Sakai, T., Manabe, T., & Koyama, M. (2005). Flexible pseudo-relevance feedback via selective sampling. ACM Transactions on Asian Language Information Processing (TALIP), 4(2), 111–135.CrossRef Sakai, T., Manabe, T., & Koyama, M. (2005). Flexible pseudo-relevance feedback via selective sampling. ACM Transactions on Asian Language Information Processing (TALIP), 4(2), 111–135.CrossRef
Zurück zum Zitat Shi, L., & Nie, J.-Y. (2010). Using various term dependencies according to their utilities. In Proceedings of the 19th ACM international conference on Information and knowledge management (pp. 1493–1496). ACM. Shi, L., & Nie, J.-Y. (2010). Using various term dependencies according to their utilities. In Proceedings of the 19th ACM international conference on Information and knowledge management (pp. 1493–1496). ACM.
Zurück zum Zitat Song, F., & Croft, W. B. (1999). A general language model for information retrieval. In Proceedings of the eighth international conference on Information and knowledge management (pp. 316–321). ACM. Song, F., & Croft, W. B. (1999). A general language model for information retrieval. In Proceedings of the eighth international conference on Information and knowledge management (pp. 316–321). ACM.
Zurück zum Zitat Song, R., Taylor, M. J., Wen, J.-R., Hon, H.-W., & Yu, Y. (2008). Viewing term proximity from a different perspective. In Advances in information retrieval (pp. 346–357). Springer. Song, R., Taylor, M. J., Wen, J.-R., Hon, H.-W., & Yu, Y. (2008). Viewing term proximity from a different perspective. In Advances in information retrieval (pp. 346–357). Springer.
Zurück zum Zitat Svore, K. M., Kanani, P. H., & Khan, N. (2010). How good is a span of terms? Exploiting proximity to improve web retrieval. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (pp. 154–161). ACM. Svore, K. M., Kanani, P. H., & Khan, N. (2010). How good is a span of terms? Exploiting proximity to improve web retrieval. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (pp. 154–161). ACM.
Zurück zum Zitat Tao, T., & Zhai, C. (2007). An exploration of proximity measures in information retrieval. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 295–302). ACM. Tao, T., & Zhai, C. (2007). An exploration of proximity measures in information retrieval. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 295–302). ACM.
Zurück zum Zitat Tellex, S., Katz, B., Lin, J., Fernandes, A., & Marton, G. (2003). Quantitative evaluation of passage retrieval algorithms for question answering. In Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval (pp. 41–47). ACM. Tellex, S., Katz, B., Lin, J., Fernandes, A., & Marton, G. (2003). Quantitative evaluation of passage retrieval algorithms for question answering. In Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval (pp. 41–47). ACM.
Zurück zum Zitat Van Rijsbergen, C. J. (1977). A theoretical basis for the use of co-occurrence data in information retrieval. Journal of Documentation, 33(2), 106–119.CrossRef Van Rijsbergen, C. J. (1977). A theoretical basis for the use of co-occurrence data in information retrieval. Journal of Documentation, 33(2), 106–119.CrossRef
Zurück zum Zitat Vechtomova, O., & Wang, Y. (2006). A study of the effect of term proximity on query expansion. Journal of Information Science, 2(4), 324–333.CrossRef Vechtomova, O., & Wang, Y. (2006). A study of the effect of term proximity on query expansion. Journal of Information Science, 2(4), 324–333.CrossRef
Zurück zum Zitat Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS), 22(2), 179–214.CrossRef Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS), 22(2), 179–214.CrossRef
Zurück zum Zitat Zhao, J., & Yun, Y. (2009). A proximity language model for information retrieval. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (pp. 291–298). ACM. Zhao, J., & Yun, Y. (2009). A proximity language model for information retrieval. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (pp. 291–298). ACM.
Metadaten
Titel
Distance matters! Cumulative proximity expansions for ranking documents
verfasst von
Jeroen B. P. Vuurens
Arjen P. de Vries
Publikationsdatum
01.08.2014
Verlag
Springer Netherlands
Erschienen in
Discover Computing / Ausgabe 4/2014
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI
https://doi.org/10.1007/s10791-014-9243-x

Premium Partner