Maximum-entropy word alignment and posterior-based phrase extraction for machine translation

Tomeh, Nadi; Allauzen, Alexandre; Yvon, François

doi:10.1007/s10590-013-9146-4

Maximum-entropy word alignment and posterior-based phrase extraction for machine translation

Published: 24 September 2013

Volume 28, pages 19–56, (2014)
Cite this article

Machine Translation

Nadi Tomeh¹,
Alexandre Allauzen¹ &
François Yvon¹

506 Accesses
3 Citations
Explore all metrics

Abstract

One of the fundamental assumptions in statistical machine translation (SMT) is that the correspondence between a sentence and its translation can be explained in terms of an alignment between their words. Such alignment information is typically not observed in the parallel corpora used to build the phrase table of an SMT system. Therefore, it is customary to estimate a probabilistic model of the assumed hidden word alignment, which is then used to extract bilingual phrase pairs. In standard extraction heuristics, the alignment model is under-exploited as the only information used from the posterior distribution is the Viterbi best alignment. This is due to the high computational complexity of the IBM models, which are the de facto standard for computing these alignments. Note that these models have other limitations, including their asymmetry and their inability to integrate rich, feature-based, descriptions. We argue that refining the word alignment model in a discriminative maximum-entropy framework substantially improves the alignment quality. We also show that these improved alignments combined with efficient and accurate computation of the link posterior distributions can also improve the overall translation performance, especially when applying posterior-based extraction methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parse and Corpus-Based Machine Translation

Minimum description length inference of phrase-based translation models

Article 10 March 2016

Hybrid Word Alignment

Notes

Under IBM model 3, for instance, each target word is generated from one input word according to the following process: first, the “fertility” of each input word, which represents the number of target words to be generated from it, is selected; then, each input word is translated a number of times corresponding to its fertility; finally, the words in the translation are reordered.
In this context, the term “phrase” has no specific linguistic meaning.
The probability \(p(\mathbf{T}|\mathbf{S})\) can be computed by summing over all possible derivations. In practice, this sum is approximated using only the best derivation to avoid computational complexity.
Similar to reference translations, reference alignments are obtained by manual annotation.
AER was originally defined to distinguish “sure” from “possible” links. Since our manual alignments contain only “sure” links, we drop this distinction and AER reduces to balanced \(F\)-measure.
Given a loss function \(L(A, A')\), and a probability distribution \(\mathrm{Pr}(A|\mathbf{S},\mathbf{T})\) over all possible link sets \(A\in \mathcal{A }\), the MBR decoder is given by: \({\hat{{A}}} = \mathop {\mathrm{arg\,min}}\nolimits _{{A' \in \mathcal{A }}} \sum _{A\in \mathcal{A }} L(A, A') \mathrm{Pr}(A|\mathbf{S},\mathbf{T})\). When we substitute the actual alignment loss function, which has the form: \(L(A, A') = |A| + |A'| - 2 \sum _{l \in A} \sum _{l' \in A'} {1\!\!1}_{\{l\}}(l')\), in the equation for the MBR decoder, we obtain: \({\hat{{A}}} = \mathop {\mathrm{arg\,min}}\nolimits _{A' \in \mathcal{A }} \sum _{l' \in A'} (1 - 2 \mathrm{Pr}(l'|\mathbf{S},\mathbf{T}))\). The sum in the MBR equation is minimized by including all the links for which the posterior probability exceeds 0.5.
The first partial derivative of the \(\ell ^1\) regularizer with respect to each parameter is constant: this means that gradient descent techniques will manage to “push” the value of useless parameters to exactly zero. The \(\ell ^2\) regularizer, by contrast, tends to decrease parameter values less and less as they move toward zero, producing parameters that are often very close to, but never exactly equal to, zero.
Generated, for instance, with any of the IBM models.
In our experiments, we set \(w=2\) which resulted in the best AER performance on the development set.
In the experiments, we set \(L=100\).
In our experiments we use suffixes and prefixes of length 1–3.
All Arabic transliterations are provided in the Buckwalter transliteration scheme (Habash et al. 2007).
In our experiments, we set \(w=1\).
http://wapiti.limsi.fr.
http://nlp.stanford.edu/software/tagger.shtml.
http://www.kyloo.net/software/doku.php/mgiza:overview.
http://www.nlp.ict.ac.cn/~liuyang/wam/wam.html.
http://www.seas.upenn.edu/~strctlrn/CAT/CAT.html.
Implemented by Jan Niehues from the Karlsruhe Institute of Technology (KIT).
For more details on Arabic processing, we refer the reader to Habash (2010).
http://www1.ccls.columbia.edu/MADA/index.html.
The D2 scheme tokenizes question, conjunction and preposition clitics, and uses “\(+\)” as a clitic marker. It also normalizes alef and yaa. We refer the reader to MADA\(+\)TOKAN manual for detailed information http://www1.ccls.columbia.edu/MADA/CCLS-12-01.pdf.
http://www.statmt.org/moses.
http://code.google.com/p/geppetto.
BLEU values are between 0 and 1; we follow standard practice and consistently report 100 \(\times \) BLEU scores.
For all the systems we evaluated on this test set, a difference in BLEU score \(\ge \)0.5 % corresponds to a \(p\) value \(\le \)0.001. Statistical significance is computed using paired bootstrap resampling (Koehn 2004).
These sentences were manually aligned by Thomas Schoenemann. They can be downloaded from http://user.phil-fak.uni-duesseldorf.de/~tosch/downloads.html.
This data can be downloaded from http://www.statmt.org/wmt07/shared-task.html.
For all the reported result, a difference in BLEU score \(\ge \)0.2 % corresponds to a \(p\) value \(\le \)0.001.
We only consider Viterbi-based extraction, since the posterior-based extraction does not seem to help in large corpora settings (Sects. 5.4, 5.5). Additionally, the PostCAT implementation did not scale to the largest training corpus on a machine with 64 GB of RAM.

References

Andrew G, Gao J (2007) Scalable training of L1-regularized log-linear models. In: Proceedings of the 24th international conference on machine learning, New York, pp 33–40
Ayan NF, Dorr BJ (2006) A maximum entropy approach to combining word alignments. In: Proceedings of the human language technology conference of the NAACL. Main conference, New York City, pp 96–103
Berger AL, Pietra VJD, Pietra SAD (1996) A maximum entropy approach to natural language processing. Comput Ling 22:39–71
Article Google Scholar
Birch A, Callison-Burch C, Osborne M, Koehn P (2006) Constraining the phrase-based, joint probability statistical translation model. In: Proceedings of the workshop on statistical machine translation, pp 154–157
Blunsom P, Cohn T (2006) Discriminative word alignment with conditional random fields. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics Australia, pp 65–72
Brown PF, Pietra VJD, Pietra SAD, Mercer RL (1993) The mathematics of statistical machine translation: parameter estimation. Comput Ling 19(2):263–311
Google Scholar
Cela E (1997) The quadratic assignment problem: theory and algorithms. Combinatorial optimization. Springer, Berlin
Google Scholar
Cherry C, Lin D (2003) A probability model to improve word alignment. In: Hinrichs E, Roth D (eds) Proceedings of the 41st annual meeting of the association for computational linguistics, pp 88–95
Chiang D, Marton Y, Resnik P (2008) Online large-margin training of syntactic and structural translation features. In: Proceedings of the 2008 conference on empirical methods in natural language processing, Honolulu, pp 224–233
Crammer K, Dekel O, Keshet J, Shalev-Shwartz S, Singer Y (2006) Online passive-aggressive algorithms. J Mach Learn Res 7:551–585
MATH MathSciNet Google Scholar
Davis P, Xie Z, Small K (2007) All links are not the same: evaluating word alignments for statistical machine translation. In: Proceedings of the MT Summit XI, pp 119–126
de Gispert A, Pino J, Byrne W (2010) Hierarchical phrase-based translation grammars extracted from alignment posterior probabilities. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Morristown, NJ, USA, pp 545–554
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1):1–38
MATH MathSciNet Google Scholar
DeNero J, Klein D (2008) The complexity of phrase alignment problems. In: Proceedings of ACL-08: HLT. Short Papers, Columbus, pp 25–28
DeNero J, Klein D (2010) Discriminative modeling of extraction sets for machine translation. In: Proceedings of the 48th annual meeting of the association for computational linguistics. pp 1453–1463
DeNero J, Gillick D, Zhang J, Klein D (2006) Why generative phrase models underperform surface heuristics. In: proceedings on the workshop on statistical machine translation, New York City, pp 31–38
Deng Y, Byrne W (2005) HMM word and phrase alignment for statistical machine translation. In: Proceedings of human language technology conference and conference on empirical methods in natural language processing, Vancouver, pp 169–176
Deng Y, Byrne WJ (2008) HMM word and phrase alignment for statistical machine translation. IEEE Trans Audio Speech Lang Process 16(3):494–507
Article Google Scholar
Diab M, Hacioglu K, Jurafsky D (2004) Automatic tagging of Arabic text: from raw text to base phrase chunks. In: Proceedings of HLT-NAACL 2004: Short Papers, pp 149–152
Dougherty J, Kohavi R, Sahami M (1995) Supervised and Unsupervised discretization of continuous features. In: ICML, pp 194–202
Dyer C (2009) Using a maximum entropy model to build segmentation lattices for MT. In: Proceedings of human language technologies: the 2009 annual conference of the North American chapter of the association for computational linguistics, Stroudsburg, pp 406–414
Dyer C, Clark JH, Lavie A, Smith NA (2011) Unsupervised word alignment with arbitrary features. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Portland, pp 409–419
Fraser A (2007) Improved word alignments for statistical machine translation. PhD Thesis, University of Southern California, Los Angeles, AAI3291804
Fraser A, Marcu D (2007) Measuring word alignment quality for statistical machine translation. Comput Ling 33(3):293–303
Google Scholar
Galley M, Graehl J, Knight K, Marcu D, DeNeefe S, Wang W, Thayer I (2006) Scalable inference and training of context-rich syntactic translation models. In: Proceedings of the 21st ICCL and 44th ACL, Sydney, pp 961–968
Ganchev K, Graca JV, Taskar B (2008) Better alignments = better translations?. In: Proceedings of ACL-08: HLT, Columbus, pp 986–993
Ganchev K, Graça Ja, Gillenwater J, Taskar B (2010) Posterior regularization for structured latent variable models. J Mach Learn Res 11:2001–2049
MATH MathSciNet Google Scholar
Gao Q, Vogel S (2008) Parallel implementations of word alignment tool. In: SETQA-NLP ’08, pp 49–57
Graça JV, Ganchev K, Taskar B (2010) Learning tractable word alignment models with complex constraints. Comput Ling 36:481–504
Article Google Scholar
Guzman F, Gao Q, Vogel S (2009) Reassessment of the role of phrase extraction. In: 12th MT Summit
Habash N (2007) Arabic morphological representations for machine translation. In: Soudi A, van den Bosch, Neumann G, Ide N (eds) Arabic computational morphology, vol 38 of Text, speech and language technology. Springer, Berlin, pp 263–285
Habash N (2010) Introduction to Arabic natural language processing. Synthesis lectures on human language technologies. Morgan & Claypool Publishers, San Rafael
Habash N, Sadat F (2006) Arabic preprocessing schemes for statistical machine translation. In: NAACL-HLT, Morristown, pp 49–52
Habash N, Soudi A, Buckwalter T (2007) On Arabic transliteration. In: Arabic computational morphology, vol 38 of Text, speech and language technology, chap 2. Springer, Berlin, pp 15–22
Habash N, Rambow O, Roth R (2009) MADA+TOKAN: a toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In: Maegaard B, Choukri K (eds) Proceedings of the second international conference on Arabic language resources and tools, Cairo, pp 102–109
Hopkins M, Langmead G, Vo T (2011) Extraction programs: a unified approach to translation rule extraction. In: Proceedings of the sixth workshop on statistical machine translation, Edinburgh, pp 523–532
Ittycheriah A, Roukos S (2005) A maximum entropy word aligner for Arabic–English machine translation. In: Proceedings of human language technology conference and conference on empirical methods in natural language processing, Vancouver, pp 89–96
Ittycheriah A, Al-Onaizan Y, Roukos S (2006) The IBM Arabic–English word alignment corpus. Technical report
Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Lin D, Wu D (eds) Proceedings of EMNLP 2004, Barcelona, pp 388–395
Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: Proc. NAACL-HLT 2003, pp 48–54
Koller D, Friedman N (2009) Probabilistic graphical models: principles and techniques. MIT, Cambridge
Google Scholar
Kumar S, Byrne W (2002) Minimum Bayes-Risk word alignments of bilingual texts. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), Philadelphia, pp 140–147
Lacoste-Julien S, Taskar B, Klein D, Jordan MI (2006) Word alignment via quadratic assignment. In: Proceedings of the human language technology conference of the NAACL. Main conference,. New York City, pp 112–119
Lambert P, Banchs RE, Crego JM (2007) Discriminative alignment training without annotated data for machine translation. In: Human language technologies 2007: the conference of the North American chapter of the association for computational Linguistics; companion volume. Short Papers, Rochester, pp 85–88
Lambert P, Ma Y, Ozdowska S, Way A (2009) Tracking relevant alignment characteristics for machine translation. In: Proceedings of MT Summit XII. Ottawa, pp 268–275
Lambert P, Petitrenaud S, Ma Y, Way A (2012) What types of word alignment improve statistical machine translation? Mach Transl 26(4):289–323
Article Google Scholar
Langlais P, Simard M, Véronis J (1998) Methods and practical issues in evaluating alignment techniques. In: Proceedings of the 36th annual meeting of the association of computational linguistics (ACL), pp 711–717
Lavergne T, Cappé O, Yvon F (2010) Practical very large scale CRFs. In: Proceedings the 48th annual meeting of the association for computational linguistics (ACL), pp 504–513
Liang P, Bouchard-Côté A, Klein D, Taskar B (2006) An end-to-end discriminative approach to machine translation. In: Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the association for computational linguistics, pp. 761–768
Ling W, Luís T, Graça J, Coheur L, Trancoso I (2010) Towards a general and extensible phrase-extraction algorithm. In: Federico M, Lane I, Paul M, Yvon F (eds) Proceedings of the seventh international workshop on spoken language, translation (IWSLT), pp 313–320
Liu DC, Nocedal J (1989) On the limited memory BFGS method for large scale optimization. Math Program 45(3):503–528
Article MATH MathSciNet Google Scholar
Liu Y, Xia T, Xiao X, Liu Q (2009) Weighted alignment matrices for statistical machine translation. In: Proceedings of the 2009 conference on empirical methods in natural language processing: vol 2, Morristown, pp 1017–1026
Liu Y, Liu Q, Lin S (2010) Discriminative word alignment by linear modeling. Comput Ling 36(3):303–339
Article Google Scholar
Mi H, Huang L (2008) Forest-based translation rule extraction. In: Proceedings of the conference on empirical methods in natural language processing, Morristown, pp 206–214
Moore RC (2005) A discriminative framework for bilingual word alignment. In: Proceedings of human language technology conference and conference on empirical methods in natural language processing, Vancouver, pp 81–88
Niehues J, Vogel S (2008) Discriminative word alignment via alignment matrix modeling. In: Proceedings of the third workshop on statistical machine translation, Columbus, pp 18–25
Och FJ (2003) Minimum error rate training in statistical machine translation. In: Hinrichs E, Roth D (eds) Proceedings of the 41st annual meeting of the association for computational linguistics, pp 160–167
Och FJ, Ney H (2000) Improved statistical alignment models. In: Proceedings of the 38th annual meeting on association for computational linguistics, Stroudsburg, pp 440–447
Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Ling 29:19–51
Article MATH Google Scholar
Och FJ, Ney H (2004) The alignment template approach to statistical machine translation. Comput Ling 30(4):417-449
Google Scholar
Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on ACL, pp 311–318
Riesa J, Marcu D (2010) Hierarchical search for word alignment. In: Proceedings of the 48th annual meeting of the association for computational linguistics, pp 157–166
Roth R, Rambow O, Habash N, Diab M, Rudin C (2008) Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In: Proceedings of ACL-08: HLT. Short Papers, Columbus, pp 117–120
Søgaard A (2010) Can inversion transduction grammars generate hand alignments? In: Proceedings of the 14th annual conference of the European association for machine translation (EAMT), St. Raphael
Søgaard A, Wu D (2009) Empirical lower bounds on translation unit error rate for the full class of inversion transduction grammars. In: Proceedings of the 11th international conference on parsing technologies, pp 33–36
Stolcke A (2002) SRILM—an extensible language modeling toolkit. In: Proceedings of the international conference on spoken language processing (ICSLP), vol 2, Denver, pp 901–904
Sutton C, McCallum A, Rohanimanesh K (2007) Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data. J Mach Learn Res 8:693–723
MATH Google Scholar
Taskar B, Lacoste-Julien S, Klein D (2005) A discriminative matching approach to word alignment. In: proceedings of human language technology conference and conference on empirical methods in natural language processing, Vancouver, pp 73–80
Tomeh N, Allauzen A, Wisniewski G, Yvon F (2010) Refining word alignment with discriminative training. In: Proceedings of the ninth conference of the association for machine translation in the America (AMTA), Denver
Tomeh N, Allauzen A, Lavergne T, Yvon F (2011a) Designing an improved discriminative word aligner. Int J Comput Ling Appl (IJCLA)
Tomeh N, Allauzen A, Yvon F (2011b) Discriminative weighted alignment matrices for statistical machine translation. In: Forcada M, Depraetere H (eds) Proceedings of the European conference on machine translation, Leuven, pp 305–312
Venugopal A, Zollmann A, Smith NA, Vogel S (2008) Wider pipelines: N-best alignments and parses in MT training. In: Proceedings of the association for machine translation in the Americas (AMTA)
Vilar D, Popovic M, Ney H (2006) AER: do we need to “improve” our alignments? In: Proceedings of the international workshop on spoken language translation, Kyoto, pp 205–2012
Vogel S, Ney H, Tillmann C (1996) HMM-based word alignment in statistical translation. In: Proceedings of the 16th conference on computational linguistics—vol 2, Stroudsburg, pp 836–841
Wang W, Knight K, Marcu D (2007) Binarizing syntax trees to improve syntax-based machine translation accuracy. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural, language learning (EMNLP-CoNLL), pp 746–754
Wolpert DH (1992) Original contribution: stacked generalization. Neural Netw 5(2):241–259
Article MathSciNet Google Scholar
Xue Y-Z, Li S, Zhao T-J, Yang M-Y, Li J (2006) Bilingual phrase extraction from N-best alignments. In: ICICIC, pp 410–414
Zaidan OF (2009) Z-MERT: a fully configurable open source tool for minimum error rate training of machine translation systems. Prague Bull Math Ling 91:79–88
Google Scholar
Zens R, Och FJ, Ney H (2002) Phrase-based statistical machine translation. In: Proceedings of the German conference on artificial intelligence (KI 2002), pp 18–32

Download references

Acknowledgments

We are grateful to the anonymous reviewers for their precise analysis and their very insightful comments which have greatly helped us improve the article. All remaining mistakes are ours. We also wish to thank Jan Niehues for sharing his alignment toolkit, and Nizar Habash, Thomas Lavergne and Guillaume Wisniewski for valuable discussions. This work was partly funded by the French agency for innovation OSEO under the Quaero Program.

Author information

Authors and Affiliations

LIMSI–CNRS, Université Paris-Sud 11, 91400 , Orsay, France
Nadi Tomeh, Alexandre Allauzen & François Yvon

Authors

Nadi Tomeh
View author publications
You can also search for this author in PubMed Google Scholar
Alexandre Allauzen
View author publications
You can also search for this author in PubMed Google Scholar
François Yvon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nadi Tomeh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tomeh, N., Allauzen, A. & Yvon, F. Maximum-entropy word alignment and posterior-based phrase extraction for machine translation. Machine Translation 28, 19–56 (2014). https://doi.org/10.1007/s10590-013-9146-4

Download citation

Received: 11 November 2012
Accepted: 02 September 2013
Published: 24 September 2013
Issue Date: March 2014
DOI: https://doi.org/10.1007/s10590-013-9146-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Maximum-entropy word alignment and posterior-based phrase extraction for machine translation

Abstract

Access this article

Similar content being viewed by others

Parse and Corpus-Based Machine Translation

Minimum description length inference of phrase-based translation models

Hybrid Word Alignment

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Maximum-entropy word alignment and posterior-based phrase extraction for machine translation

Abstract

Access this article

Similar content being viewed by others

Parse and Corpus-Based Machine Translation

Minimum description length inference of phrase-based translation models

Hybrid Word Alignment

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation