skip to main content
article

eXtended cumulated gain measures for the evaluation of content-oriented XML retrieval

Published:01 October 2006Publication History
Skip Abstract Section

Abstract

We propose and evaluate a family of measures, the eXtended Cumulated Gain (XCG) measures, for the evaluation of content-oriented XML retrieval approaches. Our aim is to provide an evaluation framework that allows the consideration of dependency among XML document components. In particular, two aspects of dependency are considered: (1) near-misses, which are document components that are structurally related to relevant components, such as a neighboring paragraph or container section, and (2) overlap, which regards the situation wherein the same text fragment is referenced multiple times, for example, when a paragraph and its container section are both retrieved. A further consideration is that the measures should be flexible enough so that different models of user behavior may be instantiated within. Both system- and user-oriented aspects are investigated and both recall and precision-like qualities are measured. We evaluate the reliability of the proposed measures based on the INEX 2004 test collection. For example, the effects of assessment variation and topic set size on evaluation stability are investigated, and the upper and lower bounds of expected error rates are established. The evaluation demonstrates that the XCG measures are stable and reliable, and in particular, that the novel measures of effort-precision and gain-recall (ep/gr) show comparable behavior to established IR measures like precision and recall.

References

  1. Amati, G. 2003. Probability models for information retrieval based on divergence from randomness. Ph.D. thesis, University of Glasgow.]]Google ScholarGoogle Scholar
  2. Baeza-Yates, R., Fuhr, N., and Maarek, Y., eds. 2002. Proceedings of the SIGIR Workshop on XML and Information Retrieval.]]Google ScholarGoogle Scholar
  3. Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison Wesley Reading, MA.]] Google ScholarGoogle Scholar
  4. Blanken, H. M., Grabs, T., Schek, H.-J., Schenkel, R., and Weikum, G., eds. 2003. Intelligent Search on XML Data, Applications, Languages, Models, Implementations, and Benchmarks. Lecture Notes in Computer Science, vol. 2818. Springer-Verlag.]]Google ScholarGoogle Scholar
  5. Borlund, P. 2003. The concept of relevance in IR. J. American Society Inf. Sci. Technol. 54, 10, 913--925.]] Google ScholarGoogle Scholar
  6. Bray, T., Paoli, J., and Sperberg-McQueen, C. M. 1998. Extensible markup language (XML) 1.0. http://www.w3.org/TR/1998/REC-xml-19980210, W3C Recommendation. Tech. Rep., W3C (World Wide Web Consortium). Feb.]]Google ScholarGoogle Scholar
  7. Buckley, C. and Voorhees, E. M. 2000. Evaluating evaluation measure stability. In SIGIR: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York. 33--40.]] Google ScholarGoogle Scholar
  8. Burgin, R. 1992. Variations in relevance judgments and the evaluation of retrieval performance. Inf. Proces. Manage. 28, 5, 619--627.]] Google ScholarGoogle Scholar
  9. Chiaramella, Y., Mulhem, P., and Fourel, F. 1996. A model for multimedia information retrieval. Tech. Rep. Fermi ESPRIT BRA 8134, University of Glasgow.]]Google ScholarGoogle Scholar
  10. Clark, J. and DeRose, S. 1999. XML path language (XPath) version 1.0. W3C Recommendation. http://www.w3.org/TR/xpath. Tech. Rep. REC-xpath-19991116, WWW Consortium. Nov.]]Google ScholarGoogle Scholar
  11. Conover, W. 1980. Practical Non-Parametric Statistics, 2nd ed. John Wiley, New York.]]Google ScholarGoogle Scholar
  12. Cooper, W. 1968. Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems. American Documentation 19, 1, 30--41.]]Google ScholarGoogle Scholar
  13. de Vries, A., Kazai, G., and Lalmas, M. 2004. Tolerance to irrelevance: A user-effort oriented evaluation of retrieval systems without predefined retrieval unit. In Proceedings of the Recherche d'Informations Assistee par Ordinateur (RIAO) Conference. Avignon, France.]]Google ScholarGoogle Scholar
  14. Fuhr, N., Lalmas, M., and Malik, S., eds. 2004. Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX). Dagstuhl, Germany, Dec. 15--17, 2003. http://inex.is.informatik.uni-duisburg.de:2003/proceedings.pdf.]]Google ScholarGoogle Scholar
  15. Fuhr, N., Lalmas, M., Malik, S., and Szlavik, Z., eds. 2005. Advances in XML Information Retrieval: 3rd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2004). Schloss Dagstuhl, Germany, 6--8 Dec. 2004. Lecture Notes in Computer Science, vol. 3493. Springer.]]Google ScholarGoogle Scholar
  16. Fuhr, N., Malik, S., and Lalmas, M. 2004. Overview of the INitiative for the Evaluation of XML Retrieval (INEX) 2003. In Proceedings of the 2nd workshop of the initiative for the Evaluation of XML Retrieval. Dagstuhl, Germany. 1--11. http://inex.is.informatik.uni-duisburg.de:2003/proceedings.pdf.]]Google ScholarGoogle Scholar
  17. Goevert, N., Fuhr, N., Lalmas, M., and Kazai, G. 2006. Evaluating the effectiveness of content-oriented XML retrieval methods. J. Inf. Retrieval (to appear).]] Google ScholarGoogle Scholar
  18. Gövert, N. and Kazai, G. 2003. Overview of the INitiative for the Evaluation of XML retrieval (INEX) 2002. In Proceedings of the 1st Workshop of the INitiative for the Evaluation of XML Retrieval (INEX). Dagstuhl, Germany, 8--11 Dec. 2002, Sophia Antipolis, France. 1--17.]]Google ScholarGoogle Scholar
  19. Harter, S. P. 1996. Variations in relevance assessments and the measurement of retrieval effectiveness. J. American Society Inf. Sci. 47, 1, 37--49.]] Google ScholarGoogle Scholar
  20. Hawking, D., Voorhees, E., Craswell, N., and Bailey, P. 1999. Overview of the TREC-8 Web Track. In Proceedings of the TREC Conference.]]Google ScholarGoogle Scholar
  21. Hull, D. 1993. Using statistical testing in the evaluation of retrieval experiments. In SIGIR: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM New York. 329--338.]] Google ScholarGoogle Scholar
  22. Järvelin, K. and Kekäläinen, J. 2000. IR evaluation methods for retrieving highly relevant documents. In SIGIR: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM New York. 41--48.]] Google ScholarGoogle Scholar
  23. Järvelin, K. and Kekäläinen, J. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4, 422--446.]] Google ScholarGoogle Scholar
  24. Kando, N., Kuriyama, K., and Yoshioka, M. 2001. Information retrieval system evaluation using multi-grade relevance judgements - Discussion on averageable single-numbered measures (in japanese). Tech. Rep.]]Google ScholarGoogle Scholar
  25. Kazai, G. and Lalmas, M. 2005. Notes on what to measure in INEX. In Proceedings of the INEX Workshop on Element Retrieval Methodology. Glasgow, July 2005.]]Google ScholarGoogle Scholar
  26. Kazai, G. and Lalmas, M. 2006. INEX 2005 evaluation metrics. In Advances in XML Information Retrieval and Evaluation: 4th Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2005). Schloss Dagstuhl, 28--30 Nov. 2005. Lecture Notes in Computer Science vol. 3977. Springer-Verlag. 16--29.]]Google ScholarGoogle Scholar
  27. Kazai, G., Lalmas, M., and de Vries, A. P. 2004. The overlap problem in content-oriented XML retrieval evaluation. In SIGIR: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York. 72--79.]] Google ScholarGoogle Scholar
  28. Kazai, G., Lalmas, M., and de Vries, A. P. 2005. Reliability tests for the XCG and inex-2002 metrics. In Advances in XML Information Retrieval: 3rd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2004). Schloss Dagstuhl, 6--8 Germany, Dec. 2004. Lecture Notes in Computer Science vol. 3493. Springer-Verlag. 60--72.]]Google ScholarGoogle Scholar
  29. Kazai, G., Lalmas, M., and Piwowarski, B. 2004. INEX relevance assessment guide. In Proceedings of the 2nd workshop of the Initiative for the Evaluation of XML Retrieval. Dagstuhl, Germany. 204--209. http://inex.is.informatik.uni-duisburg.de:2003/proceedings.pdf.]]Google ScholarGoogle Scholar
  30. Kekäläinen, J. 2005. Binary and graded relevance in IR evaluations: Comparison of the effects on ranking of IR systems. Inf. Process. Manage. 41, 5, 1019--1033.]] Google ScholarGoogle Scholar
  31. Kekäläinen, J. and Järvelin, K. 2002. Using graded relevance assessments in IR evaluation. J. American Society Inf. Sci. Technol. 53, 13, 1120--1129.]] Google ScholarGoogle Scholar
  32. Lalmas, M. and Malik, S. 2004. INEX 2004 retrieval task and result submission specification. In Proceedings of the Advances in XML Information Retrieval: 3rd Workshop of the Initiative for the Evaluation of XML Retrieval. Schloss Dagstuhl, Germany. Lecture Notes in Computer Science vol. 3493. Springer-Verlag.]]Google ScholarGoogle Scholar
  33. Larsen, B., Malik, S., and Tombros, A. 2006. The interactive track at INEX 2005. In Advances in XML Information Retrieval and Evaluation: 4th Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2005). Schloss Dagstuhl, Germany, 28--30 Nov. Lecture Notes in Computer Science vol. 3977. Springer-Verlag. 404--417.]] Google ScholarGoogle Scholar
  34. Lesk, M. and Salton, G. 1969. Relevance assessments and retrieval system evaluation. Inf. Storage and Retrieval 4, 4, 343--359.]]Google ScholarGoogle Scholar
  35. Malik, S., Kazai, G., Lalmas, M., and Fuhr, N. 2006. Overview of INEX 2005. In Advances in XML Information Retrieval and Evaluation: 4th Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2005). Schloss Dagstuhl, Germany, 28--30 Nov. Lecture Notes in Computer Science vol. 3977. Springer-Verlag. 1--15.]] Google ScholarGoogle Scholar
  36. Malik, S., Lalmas, M., and Fuhr, N. 2005. Overview of INEX 2004. In Advances in XML Information Retrieval: 3rd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2004). Schloss Dagstuhl, Germany, 6--8 Dec. Lecture Notes in Computer Science vol. 3493. Springer-Verlag. 1--15.]]Google ScholarGoogle Scholar
  37. Piwowarski, B. and Gallinari, P. 2004. Expected ratio of relevant units: A measure for structured document information retrieval. In Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML retrieval (INEX). Dagstuhl, Germany, Dec. 2003. 158--166.]]Google ScholarGoogle Scholar
  38. Piwowarski, B., Gallinari, P., and Dupret, G. 2006. An extension of precision-recall with user modelling (PRUM): Application to XML retrieval. ACM Trans. Inf. Syst. (to appear).]] Google ScholarGoogle Scholar
  39. Raghavan, V. V., Bollmann, P., and Jung, G. S. 1989. Retrieval system evaluation using recall and precision: Problems and answers. In SIGIR: Proceedings of the 12th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York. 59--68.]] Google ScholarGoogle Scholar
  40. Rijsbergen, C. J. V. 1979. Information Retrieval. Butterworth-Heinemann, Newton, MA http://www.dcs.glasgow.ac.uk/Keith/Preface.html.]] Google ScholarGoogle Scholar
  41. Sakai, T. 2004. New performance metrics based on multigrade relevance: Their application to question answering. In Proceedings of the NTCIR Workshop 4 Meeting Working Notes.]]Google ScholarGoogle Scholar
  42. Sakai, T. 2005. The reliability of metrics based on graded relevance. In AIRS, G. G. Lee et al. eds. Lecture Notes in Computer Science vol. 3689. Springer-Verlag. 1--16.]] Google ScholarGoogle Scholar
  43. Sanderson, M. and Zobel, J. 2005. Information retrieval system evaluation: Effort, sensitivity, and reliability. In SIGIR: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York. 162--169.]] Google ScholarGoogle Scholar
  44. Schamber, L. 1994. Relevance and information behavior. Ammal Rev. Inf. Sci. Technol. 3--48.]]Google ScholarGoogle Scholar
  45. Sparck Jones, K. and Willett, P., eds. 1997. Readings in Information Retrieval. Morgan Kaufmann, San Francisco, CA.]] Google ScholarGoogle Scholar
  46. Tague-Sutcliffe, J. 1992. The pragmatics of information retrieval experimentation, revisited. Inf. Process. Manage. 28, 4, 467--490.]] Google ScholarGoogle Scholar
  47. Tombros, T., Larsen, B., and Malik, S. 2005. The interactive track at INEX 2004. In Proceedings of the 3rd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX). Dagstuhl, Germany, Dec. 2004.]]Google ScholarGoogle Scholar
  48. Trotman, A. and Sigurbjörnsson, B. 2005. Narrowed Extended XPath I (NEXI). In Advances in XML Information Retrieval: 3rd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2004). Schloss Dagstuhl, Germany, 6--8 Dec. 2004. Lecture Notes in Computer Science vol. 3493. Springer-Verlag. 41--53.]]Google ScholarGoogle Scholar
  49. Vegas, J., de la Fuente, P., and Crestani, F. 2002. A graphical user interface for structured document retrieval. In Proceedings of the 24th BCS-IRSG European Colloquium on IR Research. Springer-Verlag. 268--283.]] Google ScholarGoogle Scholar
  50. Voorhees, E. M. 2000. Variations in relevance judgments and the measurement of retrieval effectiveness. Inf. Process. Manage. 36, 5, 697--716.]] Google ScholarGoogle Scholar
  51. Voorhees, E. M. 2001. Evaluation by highly relevant documents. In SIGIR: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM New York. 74--82.]] Google ScholarGoogle Scholar
  52. Voorhees, E. M. 2003a. Overview of the TREC 2003 question answering track. In Proceedings of the Text REtrieval Conference. Gaithersburg, Germany.]]Google ScholarGoogle Scholar
  53. Voorhees, E. M. 2003b. Overview of the TREC 2003 robust retrieval track. In Proceedings of the TREC Conference. 69--77.]]Google ScholarGoogle Scholar
  54. Voorhees, E. M. and Buckley, C. 2002. The effect of topic set size on retrieval experiment error. In SIGIR: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York. 316--323.]] Google ScholarGoogle Scholar
  55. Voorhees, E. M. and Harman, D. K. 2005. TREC: Experiment and Evaluation in Information Retrieval. MIT Press Cambridge, MA.]] Google ScholarGoogle Scholar
  56. Wallis, P. and Thom, J. A. 1996. Relevance judgments for assessing recall. Inf. Process. Manage. 32, 3, 273--286.]] Google ScholarGoogle Scholar

Index Terms

  1. eXtended cumulated gain measures for the evaluation of content-oriented XML retrieval

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader