The Meteor metric for automatic evaluation of machine translation

Lavie, Alon; Denkowski, Michael J.

doi:10.1007/s10590-009-9059-4

The Meteor metric for automatic evaluation of machine translation

Published: 01 November 2009

Volume 23, pages 105–115, (2009)
Cite this article

Machine Translation

Alon Lavie¹ &
Michael J. Denkowski¹

747 Accesses
72 Citations
3 Altmetric
Explore all metrics

Abstract

The Meteor Automatic Metric for Machine Translation evaluation, originally developed and released in 2004, was designed with the explicit goal of producing sentence-level scores which correlate well with human judgments of translation quality. Several key design decisions were incorporated into Meteor in support of this goal. In contrast with IBM’s Bleu, which uses only precision-based features, Meteor uses and emphasizes recall in addition to precision, a property that has been confirmed by several metrics as being critical for high correlation with human judgments. Meteor also addresses the problem of reference translation variability by utilizing flexible word matching, allowing for morphological variants and synonyms to be taken into account as legitimate correspondences. Furthermore, the feature ingredients within Meteor are parameterized, allowing for the tuning of the metric’s free parameters in search of values that result in optimal correlation with human judgments. Optimal parameters can be separately tuned for different types of human judgments and for different languages. We discuss the initial design of the Meteor metric, subsequent improvements, and performance in several independent evaluations in recent years.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Agarwal A, Lavie A (2008) Meteor, m-bleu and m-ter: evaluation metrics for high-correlation with human rankings of machine translation output. In: Proceedings of the third ACL workshop on statistical machine translation. Columbus, OH, pp 115–118
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT Evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. Ann Arbor, MI, pp 65–72
Blatz J, Fitzgerald E, Foster G, Gandrabur S, Goutte C, Kulesza A, Sanchis A, Ueffing N (2003) Confidence estimation for machine translation. Technical report natural language engineering workshop final report, Johns Hopkins University
Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2007) (Meta-) evaluation of machine translation. In: Proceedings of the second workshop on statistical machine translation. Prague, Czech Republic, pp 136–158
Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2008) Further meta-evaluation of machine translation. In: Proceedings of the third workshop on statistical machine translation. Columbus, OH, pp 70–106
Callison-Burch C, Koehn P, Monz C, Schroeder J (2009) Findings of the 2009 workshop on statistical machine translation. In: Proceedings of the fourth workshop on statistical machine translation. Athens, Greece, pp 1–28
Heafield K, Hanneman G, Lavie A (2009) Machine translation system combination with flexible word ordering. In: Proceedings of the fourth workshop on statistical machine translation. Athens, Greece, pp 56–60
Lavie A, Agarwal A (2007) METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In: Proceedings of the second ACL workshop on statistical machine translation. Prague, Czech Republic, pp 228–231
Lavie A, Sagae K, Jayaraman S (2004) The significance of recall in automatic metrics for MT evaluation. In: Proceedings of the 6th conference of the association for machine translation in the Americas (AMTA-2004). Washington, DC, pp 134–143
Leusch G, Ueffing N, Ney H (2006) CDER: efficient MT evaluation using block movements. In: Proceedings of the thirteenth conference of the European chapter of the association for computational linguistics. pp 241–248
Melamed ID, Green R, Turian J (2003) Precision and recall of machine translation. In: Proceedings of the HLT-NAACL 2003 conference: short papers. Edmonton, Alberta, pp 61–63
Miller G, Fellbaum C (2007) WordNet. http://wordnet.princeton.edu/
Och FJ (2003) Minimum error rate training for statistical machine translation. In: Proceedings of the 41st annual meeting of the association for computational linguistics. pp 160–167
Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of 40th annual meeting of the association for computational linguistics (ACL). Philadelphia, PA, pp 311–318
Porter M (2001) Snowball: a language for stemming algorithms. http://snowball.tartarus.org/texts/introduction.html
Przybocki M, Peterson K, Bronsart S (2008) Official results of the NIST 2008 “Metrics for MAchine TRanslation” Challenge (MetricsMATR08). http://nist.gov/speech/tests/metricsmatr/2008/results/
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of the 7th conference of the association for machine translation in the Americas (AMTA-2006). Cambridge, MA, pp 223–231
van Rijsbergen C (1979) Information retrieval, Chap. 7, 2nd edn. Butterworths, London, UK
Google Scholar
Ye Y, Zhou M, Lin C-Y (2007) Sentence level machine translation evaluation as a ranking. In: Proceedings of the second workshop on statistical machine translation. Prague, Czech Republic, pp 240–247

Download references

Author information

Authors and Affiliations

Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Alon Lavie & Michael J. Denkowski

Authors

Alon Lavie
View author publications
You can also search for this author in PubMed Google Scholar
Michael J. Denkowski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alon Lavie.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lavie, A., Denkowski, M.J. The Meteor metric for automatic evaluation of machine translation. Machine Translation 23, 105–115 (2009). https://doi.org/10.1007/s10590-009-9059-4

Download citation

Received: 16 May 2009
Accepted: 13 October 2009
Published: 01 November 2009
Issue Date: September 2009
DOI: https://doi.org/10.1007/s10590-009-9059-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Meteor metric for automatic evaluation of machine translation

Abstract

Access this article

Similar content being viewed by others

Machine Translation Evaluation: Manual Versus Automatic—A Comparative Study

VERTa: a linguistic approach to automatic machine translation evaluation

Involving language professionals in the evaluation of machine translation

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The Meteor metric for automatic evaluation of machine translation

Abstract

Access this article

Similar content being viewed by others

Machine Translation Evaluation: Manual Versus Automatic—A Comparative Study

VERTa: a linguistic approach to automatic machine translation evaluation

Involving language professionals in the evaluation of machine translation

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation