ABSTRACT
Following the recent adoption by the machine translation community of automatic evaluation using the BLEU/NIST scoring process, we conduct an in-depth study of a similar idea for evaluating summaries. The results show that automatic evaluation using unigram co-occurrences between summary pairs correlates surprising well with human evaluations, based on various statistical metrics; while direct application of the BLEU evaluation procedure does not always give good results.
- Donaway, R. L., Drummey, K. W., and Mather, L. A. 2000. A Comparison of Rankings Produced by Summarization Evaluation Measures. In Proceeding of the Workshop on Automatic Summarization, post-conference workshop of ANLP-NAACL-2000, pp. 69--78, Seattle, WA, 2000. Google ScholarDigital Library
- DUC. 2002. The Document Understanding Conference. http://duc.nist.gov.Google Scholar
- Fukusima, T. and Okumura, M. 2001. Text Summarization Challenge: Text Summarization Evaluation at NTCIR Workshop2. In Proceedings of the Second NTCIR Workshop on Research in Chinese & Japanese Text Retrieval and Text Summarization, NII, Tokyo, Japan, 2001.Google Scholar
- Lin, C.-Y. 2001. Summary Evaluation Environment. http://www.isi.edu/~cyl/SEE.Google Scholar
- Lin, C.-Y. and E. Hovy. 2002. Manual and Automatic Evaluations of Summaries. In Proceedings of the Workshop on Automatic Summarization, post-conference workshop of ACL-2002, pp. 45--51, Philadelphia, PA, 2002. Google ScholarDigital Library
- McKeown, K., R. Barzilay, D. Evans, V. Hatzivassiloglou, J. L. Klavans, A. Nenkova, C. Sable, B. Schiffman, S. Sigelman. Tracking and Summarizing News on a Daily Basis with Columbia's Newsblaster. In Proceedings of Human Language Technology Conference 2002 (HLT 2002). San Diego, CA, 2002. Google ScholarDigital Library
- Mani, I., D. House, G. Klein, L. Hirschman, L. Obrst, T. Firmin, M. Chrzanowski, and B. Sundheim. 1998. The TIPSTER SUMMAC Text Summarization Evaluation: Final Report. MITRE Corp. Tech. Report.Google Scholar
- NIST. 2002. Automatic Evaluation of Machine Translation Quality using N-gram Co-Occurrence Statistics.Google Scholar
- Over, P. 2003. Personal Communication.Google Scholar
- Papineni, K., S. Roukos, T. Ward, W.-J. Zhu. 2001. BLEU: a Method for Automatic Evaluation of Machine Translation. IBM Research Report RC22176 (W0109-022).Google Scholar
- Porter, M. F. 1980. An Algorithm for Suffix Stripping. Program, 14, pp. 130--137.Google ScholarCross Ref
- Radev, D. R., S. Blair-Goldensohn, Z. Zhang, and R. S. Raghavan. Newsinessence: A System for Domain-Independent, Real-Time News Clustering and Multi-Document Summarization. In Proceedings of human Language Technology Conference (HLT 2001), San Diego, CA, 2001. Google ScholarDigital Library
- Spärck Jones, K. and J. R. Galliers. 1996. Evaluating Natural Language Processing Systems: An Analysis and Review. New York: Springer. Google ScholarDigital Library
- Rath, G. J., Resnick, A., and Savage, T. R. 1961. The Formation of Abstracts by the Selection of Sentences. American Documentation, 12(2), pp. 139--143. Reprinted in Mani, I., and Maybury, M., eds, Advances in Automatic Text Summarization, MIT Press, pp. 287--292.Google ScholarCross Ref
- WAS. 2000. Workshop on Automatic Summarization, post-conference workshop of ANLP-NAACL-2000, Seattle, WA, 2000.Google Scholar
- WAS. 2001. Workshop on Automatic Summarization, pre-conference workshop of NAACL-2001, Pittsburgh, PA, 2001.Google Scholar
- WAS. 2002. Workshop on Automatic Summarization, post-conference workshop of ACL-2002, Philadelphia, PA, 2002.Google Scholar
Recommendations
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics
HLT '02: Proceedings of the second international conference on Human Language Technology ResearchEvaluation is recognized as an extremely helpful forcing function in Human Language Technology R&D. Unfortunately, evaluation has not been a very powerful tool in machine translation (MT) research because it requires human judgments and is thus ...
A unified framework for automatic evaluation using N-gram co-occurrence statistics
ACL '04: Proceedings of the 42nd Annual Meeting on Association for Computational LinguisticsIn this paper we propose a unified framework for automatic evaluation of NLP applications using N-gram co-occurrence statistics. The automatic evaluation metrics proposed to date for Machine Translation and Automatic Summarization are particular ...
Automatic evaluation of texts by using paraphrases
LTC'09: Proceedings of the 4th conference on Human language technology: challenges for computer science and linguisticsThe evaluation of computer-produced texts has been recognized as an important research problem for automatic text summarization and machine translation. Traditionally, computer-produced texts were evaluated automatically by ngram overlap with human-...
Comments