Abstract
We propose and evaluate a family of measures, the eXtended Cumulated Gain (XCG) measures, for the evaluation of content-oriented XML retrieval approaches. Our aim is to provide an evaluation framework that allows the consideration of dependency among XML document components. In particular, two aspects of dependency are considered: (1) near-misses, which are document components that are structurally related to relevant components, such as a neighboring paragraph or container section, and (2) overlap, which regards the situation wherein the same text fragment is referenced multiple times, for example, when a paragraph and its container section are both retrieved. A further consideration is that the measures should be flexible enough so that different models of user behavior may be instantiated within. Both system- and user-oriented aspects are investigated and both recall and precision-like qualities are measured. We evaluate the reliability of the proposed measures based on the INEX 2004 test collection. For example, the effects of assessment variation and topic set size on evaluation stability are investigated, and the upper and lower bounds of expected error rates are established. The evaluation demonstrates that the XCG measures are stable and reliable, and in particular, that the novel measures of effort-precision and gain-recall (ep/gr) show comparable behavior to established IR measures like precision and recall.
- Amati, G. 2003. Probability models for information retrieval based on divergence from randomness. Ph.D. thesis, University of Glasgow.]]Google Scholar
- Baeza-Yates, R., Fuhr, N., and Maarek, Y., eds. 2002. Proceedings of the SIGIR Workshop on XML and Information Retrieval.]]Google Scholar
- Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison Wesley Reading, MA.]] Google Scholar
- Blanken, H. M., Grabs, T., Schek, H.-J., Schenkel, R., and Weikum, G., eds. 2003. Intelligent Search on XML Data, Applications, Languages, Models, Implementations, and Benchmarks. Lecture Notes in Computer Science, vol. 2818. Springer-Verlag.]]Google Scholar
- Borlund, P. 2003. The concept of relevance in IR. J. American Society Inf. Sci. Technol. 54, 10, 913--925.]] Google Scholar
- Bray, T., Paoli, J., and Sperberg-McQueen, C. M. 1998. Extensible markup language (XML) 1.0. http://www.w3.org/TR/1998/REC-xml-19980210, W3C Recommendation. Tech. Rep., W3C (World Wide Web Consortium). Feb.]]Google Scholar
- Buckley, C. and Voorhees, E. M. 2000. Evaluating evaluation measure stability. In SIGIR: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York. 33--40.]] Google Scholar
- Burgin, R. 1992. Variations in relevance judgments and the evaluation of retrieval performance. Inf. Proces. Manage. 28, 5, 619--627.]] Google Scholar
- Chiaramella, Y., Mulhem, P., and Fourel, F. 1996. A model for multimedia information retrieval. Tech. Rep. Fermi ESPRIT BRA 8134, University of Glasgow.]]Google Scholar
- Clark, J. and DeRose, S. 1999. XML path language (XPath) version 1.0. W3C Recommendation. http://www.w3.org/TR/xpath. Tech. Rep. REC-xpath-19991116, WWW Consortium. Nov.]]Google Scholar
- Conover, W. 1980. Practical Non-Parametric Statistics, 2nd ed. John Wiley, New York.]]Google Scholar
- Cooper, W. 1968. Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems. American Documentation 19, 1, 30--41.]]Google Scholar
- de Vries, A., Kazai, G., and Lalmas, M. 2004. Tolerance to irrelevance: A user-effort oriented evaluation of retrieval systems without predefined retrieval unit. In Proceedings of the Recherche d'Informations Assistee par Ordinateur (RIAO) Conference. Avignon, France.]]Google Scholar
- Fuhr, N., Lalmas, M., and Malik, S., eds. 2004. Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX). Dagstuhl, Germany, Dec. 15--17, 2003. http://inex.is.informatik.uni-duisburg.de:2003/proceedings.pdf.]]Google Scholar
- Fuhr, N., Lalmas, M., Malik, S., and Szlavik, Z., eds. 2005. Advances in XML Information Retrieval: 3rd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2004). Schloss Dagstuhl, Germany, 6--8 Dec. 2004. Lecture Notes in Computer Science, vol. 3493. Springer.]]Google Scholar
- Fuhr, N., Malik, S., and Lalmas, M. 2004. Overview of the INitiative for the Evaluation of XML Retrieval (INEX) 2003. In Proceedings of the 2nd workshop of the initiative for the Evaluation of XML Retrieval. Dagstuhl, Germany. 1--11. http://inex.is.informatik.uni-duisburg.de:2003/proceedings.pdf.]]Google Scholar
- Goevert, N., Fuhr, N., Lalmas, M., and Kazai, G. 2006. Evaluating the effectiveness of content-oriented XML retrieval methods. J. Inf. Retrieval (to appear).]] Google Scholar
- Gövert, N. and Kazai, G. 2003. Overview of the INitiative for the Evaluation of XML retrieval (INEX) 2002. In Proceedings of the 1st Workshop of the INitiative for the Evaluation of XML Retrieval (INEX). Dagstuhl, Germany, 8--11 Dec. 2002, Sophia Antipolis, France. 1--17.]]Google Scholar
- Harter, S. P. 1996. Variations in relevance assessments and the measurement of retrieval effectiveness. J. American Society Inf. Sci. 47, 1, 37--49.]] Google Scholar
- Hawking, D., Voorhees, E., Craswell, N., and Bailey, P. 1999. Overview of the TREC-8 Web Track. In Proceedings of the TREC Conference.]]Google Scholar
- Hull, D. 1993. Using statistical testing in the evaluation of retrieval experiments. In SIGIR: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM New York. 329--338.]] Google Scholar
- Järvelin, K. and Kekäläinen, J. 2000. IR evaluation methods for retrieving highly relevant documents. In SIGIR: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM New York. 41--48.]] Google Scholar
- Järvelin, K. and Kekäläinen, J. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4, 422--446.]] Google Scholar
- Kando, N., Kuriyama, K., and Yoshioka, M. 2001. Information retrieval system evaluation using multi-grade relevance judgements - Discussion on averageable single-numbered measures (in japanese). Tech. Rep.]]Google Scholar
- Kazai, G. and Lalmas, M. 2005. Notes on what to measure in INEX. In Proceedings of the INEX Workshop on Element Retrieval Methodology. Glasgow, July 2005.]]Google Scholar
- Kazai, G. and Lalmas, M. 2006. INEX 2005 evaluation metrics. In Advances in XML Information Retrieval and Evaluation: 4th Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2005). Schloss Dagstuhl, 28--30 Nov. 2005. Lecture Notes in Computer Science vol. 3977. Springer-Verlag. 16--29.]]Google Scholar
- Kazai, G., Lalmas, M., and de Vries, A. P. 2004. The overlap problem in content-oriented XML retrieval evaluation. In SIGIR: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York. 72--79.]] Google Scholar
- Kazai, G., Lalmas, M., and de Vries, A. P. 2005. Reliability tests for the XCG and inex-2002 metrics. In Advances in XML Information Retrieval: 3rd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2004). Schloss Dagstuhl, 6--8 Germany, Dec. 2004. Lecture Notes in Computer Science vol. 3493. Springer-Verlag. 60--72.]]Google Scholar
- Kazai, G., Lalmas, M., and Piwowarski, B. 2004. INEX relevance assessment guide. In Proceedings of the 2nd workshop of the Initiative for the Evaluation of XML Retrieval. Dagstuhl, Germany. 204--209. http://inex.is.informatik.uni-duisburg.de:2003/proceedings.pdf.]]Google Scholar
- Kekäläinen, J. 2005. Binary and graded relevance in IR evaluations: Comparison of the effects on ranking of IR systems. Inf. Process. Manage. 41, 5, 1019--1033.]] Google Scholar
- Kekäläinen, J. and Järvelin, K. 2002. Using graded relevance assessments in IR evaluation. J. American Society Inf. Sci. Technol. 53, 13, 1120--1129.]] Google Scholar
- Lalmas, M. and Malik, S. 2004. INEX 2004 retrieval task and result submission specification. In Proceedings of the Advances in XML Information Retrieval: 3rd Workshop of the Initiative for the Evaluation of XML Retrieval. Schloss Dagstuhl, Germany. Lecture Notes in Computer Science vol. 3493. Springer-Verlag.]]Google Scholar
- Larsen, B., Malik, S., and Tombros, A. 2006. The interactive track at INEX 2005. In Advances in XML Information Retrieval and Evaluation: 4th Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2005). Schloss Dagstuhl, Germany, 28--30 Nov. Lecture Notes in Computer Science vol. 3977. Springer-Verlag. 404--417.]] Google Scholar
- Lesk, M. and Salton, G. 1969. Relevance assessments and retrieval system evaluation. Inf. Storage and Retrieval 4, 4, 343--359.]]Google Scholar
- Malik, S., Kazai, G., Lalmas, M., and Fuhr, N. 2006. Overview of INEX 2005. In Advances in XML Information Retrieval and Evaluation: 4th Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2005). Schloss Dagstuhl, Germany, 28--30 Nov. Lecture Notes in Computer Science vol. 3977. Springer-Verlag. 1--15.]] Google Scholar
- Malik, S., Lalmas, M., and Fuhr, N. 2005. Overview of INEX 2004. In Advances in XML Information Retrieval: 3rd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2004). Schloss Dagstuhl, Germany, 6--8 Dec. Lecture Notes in Computer Science vol. 3493. Springer-Verlag. 1--15.]]Google Scholar
- Piwowarski, B. and Gallinari, P. 2004. Expected ratio of relevant units: A measure for structured document information retrieval. In Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML retrieval (INEX). Dagstuhl, Germany, Dec. 2003. 158--166.]]Google Scholar
- Piwowarski, B., Gallinari, P., and Dupret, G. 2006. An extension of precision-recall with user modelling (PRUM): Application to XML retrieval. ACM Trans. Inf. Syst. (to appear).]] Google Scholar
- Raghavan, V. V., Bollmann, P., and Jung, G. S. 1989. Retrieval system evaluation using recall and precision: Problems and answers. In SIGIR: Proceedings of the 12th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York. 59--68.]] Google Scholar
- Rijsbergen, C. J. V. 1979. Information Retrieval. Butterworth-Heinemann, Newton, MA http://www.dcs.glasgow.ac.uk/Keith/Preface.html.]] Google Scholar
- Sakai, T. 2004. New performance metrics based on multigrade relevance: Their application to question answering. In Proceedings of the NTCIR Workshop 4 Meeting Working Notes.]]Google Scholar
- Sakai, T. 2005. The reliability of metrics based on graded relevance. In AIRS, G. G. Lee et al. eds. Lecture Notes in Computer Science vol. 3689. Springer-Verlag. 1--16.]] Google Scholar
- Sanderson, M. and Zobel, J. 2005. Information retrieval system evaluation: Effort, sensitivity, and reliability. In SIGIR: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York. 162--169.]] Google Scholar
- Schamber, L. 1994. Relevance and information behavior. Ammal Rev. Inf. Sci. Technol. 3--48.]]Google Scholar
- Sparck Jones, K. and Willett, P., eds. 1997. Readings in Information Retrieval. Morgan Kaufmann, San Francisco, CA.]] Google Scholar
- Tague-Sutcliffe, J. 1992. The pragmatics of information retrieval experimentation, revisited. Inf. Process. Manage. 28, 4, 467--490.]] Google Scholar
- Tombros, T., Larsen, B., and Malik, S. 2005. The interactive track at INEX 2004. In Proceedings of the 3rd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX). Dagstuhl, Germany, Dec. 2004.]]Google Scholar
- Trotman, A. and Sigurbjörnsson, B. 2005. Narrowed Extended XPath I (NEXI). In Advances in XML Information Retrieval: 3rd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2004). Schloss Dagstuhl, Germany, 6--8 Dec. 2004. Lecture Notes in Computer Science vol. 3493. Springer-Verlag. 41--53.]]Google Scholar
- Vegas, J., de la Fuente, P., and Crestani, F. 2002. A graphical user interface for structured document retrieval. In Proceedings of the 24th BCS-IRSG European Colloquium on IR Research. Springer-Verlag. 268--283.]] Google Scholar
- Voorhees, E. M. 2000. Variations in relevance judgments and the measurement of retrieval effectiveness. Inf. Process. Manage. 36, 5, 697--716.]] Google Scholar
- Voorhees, E. M. 2001. Evaluation by highly relevant documents. In SIGIR: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM New York. 74--82.]] Google Scholar
- Voorhees, E. M. 2003a. Overview of the TREC 2003 question answering track. In Proceedings of the Text REtrieval Conference. Gaithersburg, Germany.]]Google Scholar
- Voorhees, E. M. 2003b. Overview of the TREC 2003 robust retrieval track. In Proceedings of the TREC Conference. 69--77.]]Google Scholar
- Voorhees, E. M. and Buckley, C. 2002. The effect of topic set size on retrieval experiment error. In SIGIR: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York. 316--323.]] Google Scholar
- Voorhees, E. M. and Harman, D. K. 2005. TREC: Experiment and Evaluation in Information Retrieval. MIT Press Cambridge, MA.]] Google Scholar
- Wallis, P. and Thom, J. A. 1996. Relevance judgments for assessing recall. Inf. Process. Manage. 32, 3, 273--286.]] Google Scholar
Index Terms
- eXtended cumulated gain measures for the evaluation of content-oriented XML retrieval
Recommendations
Cumulated gain-based evaluation of IR techniques
Modern large retrieval environments tend to overwhelm their users by their large output. Since all documents are not of equal relevance to their users, highly relevant documents should be identified and ranked first for presentation. In order to develop ...
The overlap problem in content-oriented XML retrieval evaluation
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrievalWithin the INitiative for the Evaluation of XML Retrieval(INEX) a number of metrics to evaluate the effectiveness of content-oriented XML retrieval approaches were developed. Although these metrics provide a solution towards addressing the problem of ...
Sound and complete relevance assessment for XML retrieval
In information retrieval research, comparing retrieval approaches requires test collections consisting of documents, user requests and relevance assessments. Obtaining relevance assessments that are as sound and complete as possible is crucial for the ...
Comments