Skip to main content
Erschienen in: Journal of Intelligent Information Systems 3/2014

01.06.2014

Producing efficient retrievability ranks of documents using normalized retrievability scoring function

verfasst von: Shariq Bashir, Akmal Saeed Khattak

Erschienen in: Journal of Intelligent Information Systems | Ausgabe 3/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this paper, we perform a number of experiments with large scale queries to analyze the retrieval bias of standard retrieval models. These experiments analyze how far different retrieval models differ in terms of retrieval bias that they imposed on the collection. Along with the retrieval bias analysis, we also exploit a limitation of standard retrievability scoring function and propose a normalized retrievability scoring function. Results of retrieval bias experiments show us that when a collection contains highly skewed distribution, then the standard retrievability calculation function does not take into account the differences in vocabulary richness across documents of collection. In such case, documents having large vocabulary produce many more queries and such documents thus have theoretically large probability of retrievability via a much large number of queries. We thus propose a normalized retrievability scoring function that tries to mitigate this effect by normalizing the retrievability scores of documents relative to their total number of queries. This provides an unbiased representation of the retrieval bias that could occurred due to vocabulary differences between the documents of collection without automatically inflicting a penalty on the retrieval models that favor or disfavor long documents. Finally, in order to examine, which retrievability scoring function has better effectiveness than other for correctly producing the retrievability ranks of documents, we perform a comparison between the both functions on the basis of known-items search method. Experiments on known-items search show that normalized retrievability scoring function has better effectiveness than the standard retrievability scoring function.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Arampatzis, A., Kamps, J., Kooken, M., Nussbaum, N. (2007). Access to legal documents: exact match, best match, and combinations. In Proceedings of the 16th text retrieval conference (TREC’07). Arampatzis, A., Kamps, J., Kooken, M., Nussbaum, N. (2007). Access to legal documents: exact match, best match, and combinations. In Proceedings of the 16th text retrieval conference (TREC’07).
Zurück zum Zitat Azzopardi, L., & Bache, R. (2010). On the relationship between effectiveness and accessibility. In SIGIR ’10: Proceeding of the 33rd annual international ACM SIGIR conference on research and development in information retrieval, Geneva, Switzerland (pp. 889–890). Azzopardi, L., & Bache, R. (2010). On the relationship between effectiveness and accessibility. In SIGIR ’10: Proceeding of the 33rd annual international ACM SIGIR conference on research and development in information retrieval, Geneva, Switzerland (pp. 889–890).
Zurück zum Zitat Azzopardi, L., de Rijke, M., Balog, K. (2007). Building simulated queries for known-item topics: an analysis using six European languages. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, Amsterdam, The Netherlands (pp. 455–462). Azzopardi, L., de Rijke, M., Balog, K. (2007). Building simulated queries for known-item topics: an analysis using six European languages. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, Amsterdam, The Netherlands (pp. 455–462).
Zurück zum Zitat Azzopardi, L., & Vinay, V. (2008). Retrievability: an evaluation measure for higher order information access tasks. In CIKM ’08: Proceeding of the 17th ACM conference on information and knowledge management, Napa Valley, CA, USA (pp. 561–570). Azzopardi, L., & Vinay, V. (2008). Retrievability: an evaluation measure for higher order information access tasks. In CIKM ’08: Proceeding of the 17th ACM conference on information and knowledge management, Napa Valley, CA, USA (pp. 561–570).
Zurück zum Zitat Bache, R., & Azzopardi, L. (2010). Improving access to large patent corpora. In Transactions on large-scale data- and knowledge-centered systems II (Vol. 2, pp. 103–121). Springer. Bache, R., & Azzopardi, L. (2010). Improving access to large patent corpora. In Transactions on large-scale data- and knowledge-centered systems II (Vol. 2, pp. 103–121). Springer.
Zurück zum Zitat Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. ACM Press. Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. ACM Press.
Zurück zum Zitat Bashir, S., & Rauber, A. (2009a). Analyzing document retrievability in patent retrieval settings. In DEXA’09: Proceedings of the 20th international conference on database and expert systems applications (pp. 753–760). Bashir, S., & Rauber, A. (2009a). Analyzing document retrievability in patent retrieval settings. In DEXA’09: Proceedings of the 20th international conference on database and expert systems applications (pp. 753–760).
Zurück zum Zitat Bashir, S., & Rauber, A. (2009b). Improving retrievability of patents with cluster-based pseudo-relevance feedback documents selection. In Proceedings of the 18th ACM conference on information and knowledge management, CIKM 2009 (pp. 1863–1866). Bashir, S., & Rauber, A. (2009b). Improving retrievability of patents with cluster-based pseudo-relevance feedback documents selection. In Proceedings of the 18th ACM conference on information and knowledge management, CIKM 2009 (pp. 1863–1866).
Zurück zum Zitat Bashir, S., & Rauber, A. (2010a). Improving retrievability and recall by automatic corpus partitioning. In Transactions on large-scale data- and knowledge-centered systems II (Vol. 2, pp. 122–140). Springer. Bashir, S., & Rauber, A. (2010a). Improving retrievability and recall by automatic corpus partitioning. In Transactions on large-scale data- and knowledge-centered systems II (Vol. 2, pp. 122–140). Springer.
Zurück zum Zitat Bashir, S., & Rauber, A. (2010b). Improving retrievability of patents in prior-art search. In Advances in information retrieval, 32nd European Conference on IR Research, ECIR 2010 (pp. 457–470). Bashir, S., & Rauber, A. (2010b). Improving retrievability of patents in prior-art search. In Advances in information retrieval, 32nd European Conference on IR Research, ECIR 2010 (pp. 457–470).
Zurück zum Zitat Callan, J., & Connell, M. (2001). Query-based sampling of text databases. ACM Transactions on Information Systems (TOIS) Journal, 19(2), 97–130.CrossRef Callan, J., & Connell, M. (2001). Query-based sampling of text databases. ACM Transactions on Information Systems (TOIS) Journal, 19(2), 97–130.CrossRef
Zurück zum Zitat Chowdhury, G.G. (2004). Introduction to modern information retrieval (2nd ed.). London: Facet Publishing. Chowdhury, G.G. (2004). Introduction to modern information retrieval (2nd ed.). London: Facet Publishing.
Zurück zum Zitat Gastwirth, J.L. (1972). The estimation of the Lorenz curve and Gini index. The Review of Economics and Statistics, 54(3), 306–316.CrossRefMathSciNet Gastwirth, J.L. (1972). The estimation of the Lorenz curve and Gini index. The Review of Economics and Statistics, 54(3), 306–316.CrossRefMathSciNet
Zurück zum Zitat Harter, P.S. & Hert, A.C. (1997). Evaluation of information retrieval systems: approaches, issues, and methods. Annual Review of Information Science and Technology (ARIST), 32, 3–94. Harter, P.S. & Hert, A.C. (1997). Evaluation of information retrieval systems: approaches, issues, and methods. Annual Review of Information Science and Technology (ARIST), 32, 3–94.
Zurück zum Zitat Lauw, W.H., Lim, E.-P., Wang, K. (2006). Bias and controversy: beyond the statistical deviation. In Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, Philadelphia, PA, USA (pp. 625–630). Lauw, W.H., Lim, E.-P., Wang, K. (2006). Bias and controversy: beyond the statistical deviation. In Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, Philadelphia, PA, USA (pp. 625–630).
Zurück zum Zitat Lawrence, S., & Giles, C.L. (1999). Accessibility of information on the web. Nature, 400, 107–109. Lawrence, S., & Giles, C.L. (1999). Accessibility of information on the web. Nature, 400, 107–109.
Zurück zum Zitat Lupu, M., Huang, J., Zhu, J., Tait, J. (2009). TREC-CHEM: large scale chemical information retrieval evaluation at trec. SIGIR Forum, 43(2), 63–70.CrossRef Lupu, M., Huang, J., Zhu, J., Tait, J. (2009). TREC-CHEM: large scale chemical information retrieval evaluation at trec. SIGIR Forum, 43(2), 63–70.CrossRef
Zurück zum Zitat Magdy, W., & Jones, J.F.G. (2010). Pres: a score metric for evaluating recall-oriented information retrieval applications. In SIGIR’10: ACM SIGIR conference on research and development in information retrieval (pp. 611–618). ACM. Magdy, W., & Jones, J.F.G. (2010). Pres: a score metric for evaluating recall-oriented information retrieval applications. In SIGIR’10: ACM SIGIR conference on research and development in information retrieval (pp. 611–618). ACM.
Zurück zum Zitat Manning, D., Raghavan, C.P., Schutze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.CrossRefMATH Manning, D., Raghavan, C.P., Schutze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.CrossRefMATH
Zurück zum Zitat Mowshowitz, A., & Kawaguchi, A. (2002). Bias on the web. Communications of the ACM, 45(9), 56–60.CrossRef Mowshowitz, A., & Kawaguchi, A. (2002). Bias on the web. Communications of the ACM, 45(9), 56–60.CrossRef
Zurück zum Zitat Ounis, I., De Rijke, M., Macdonald, C., Mishne, G., Soboroff, I. (2006). Overview of the trec 2006 blog track. In Proc. of the text retrieval conference, TREC’06. Ounis, I., De Rijke, M., Macdonald, C., Mishne, G., Soboroff, I. (2006). Overview of the trec 2006 blog track. In Proc. of the text retrieval conference, TREC’06.
Zurück zum Zitat Owens, C. (2009). A study of the relative bias of web search engines toward news media providers. Master Thesis, University of Glasgow. Owens, C. (2009). A study of the relative bias of web search engines toward news media providers. Master Thesis, University of Glasgow.
Zurück zum Zitat Robertson, S.E., & Walker, S. (1994). Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR ’94: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval, Dublin, Ireland (pp. 232–241). Robertson, S.E., & Walker, S. (1994). Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR ’94: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval, Dublin, Ireland (pp. 232–241).
Zurück zum Zitat Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: effort, sensitivity, and reliability. In SIGIR’05: ACM SIGIR conference on research and development in information retrieval (pp. 162–169). ACM. Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: effort, sensitivity, and reliability. In SIGIR’05: ACM SIGIR conference on research and development in information retrieval (pp. 162–169). ACM.
Zurück zum Zitat Singhal, A. (1997). At&t at trec-6. In The 6th text retrieval conference (TREC6) (pp. 227–232). Singhal, A. (1997). At&t at trec-6. In The 6th text retrieval conference (TREC6) (pp. 227–232).
Zurück zum Zitat Singhal, A. (2001). Modern information retrieval: a brief overview. IEEE Data Engineering Bulletin, 24, 34–43. Singhal, A. (2001). Modern information retrieval: a brief overview. IEEE Data Engineering Bulletin, 24, 34–43.
Zurück zum Zitat Vaughan, L., & Thelwall, M. (2004). Search engine coverage bias: evidence and possible causes. Information Processing and Management Journal, 40(4), 693–707.CrossRef Vaughan, L., & Thelwall, M. (2004). Search engine coverage bias: evidence and possible causes. Information Processing and Management Journal, 40(4), 693–707.CrossRef
Zurück zum Zitat Voorhees, M.E. (2001). Overview of the trec 2001 question answering track. In Proc. of the text retrieval conference, TREC’01 (pp. 42–51). Voorhees, M.E. (2001). Overview of the trec 2001 question answering track. In Proc. of the text retrieval conference, TREC’01 (pp. 42–51).
Zurück zum Zitat Voorhees, M.E. (2002). The philosophy of information retrieval evaluation. In CLEF’01 (pp. 355–370). Springer. Voorhees, M.E. (2002). The philosophy of information retrieval evaluation. In CLEF’01 (pp. 355–370). Springer.
Zurück zum Zitat Voorhees, M.E., & Harman, K.D. (2005). Trec experiment and evaluation in information retrieval. Cambridge, MA: MIT Press. Voorhees, M.E., & Harman, K.D. (2005). Trec experiment and evaluation in information retrieval. Cambridge, MA: MIT Press.
Zurück zum Zitat Zhai, C. (2002). Risk minimization and language modeling in text retrieval. PhD thesis, Carnegie Mellon University. Zhai, C. (2002). Risk minimization and language modeling in text retrieval. PhD thesis, Carnegie Mellon University.
Metadaten
Titel
Producing efficient retrievability ranks of documents using normalized retrievability scoring function
verfasst von
Shariq Bashir
Akmal Saeed Khattak
Publikationsdatum
01.06.2014
Verlag
Springer US
Erschienen in
Journal of Intelligent Information Systems / Ausgabe 3/2014
Print ISSN: 0925-9902
Elektronische ISSN: 1573-7675
DOI
https://doi.org/10.1007/s10844-013-0274-3

Weitere Artikel der Ausgabe 3/2014

Journal of Intelligent Information Systems 3/2014 Zur Ausgabe

Premium Partner