Skip to main content
Log in

Maximum-entropy word alignment and posterior-based phrase extraction for machine translation

  • Published:
Machine Translation

Abstract

One of the fundamental assumptions in statistical machine translation (SMT) is that the correspondence between a sentence and its translation can be explained in terms of an alignment between their words. Such alignment information is typically not observed in the parallel corpora used to build the phrase table of an SMT system. Therefore, it is customary to estimate a probabilistic model of the assumed hidden word alignment, which is then used to extract bilingual phrase pairs. In standard extraction heuristics, the alignment model is under-exploited as the only information used from the posterior distribution is the Viterbi best alignment. This is due to the high computational complexity of the IBM models, which are the de facto standard for computing these alignments. Note that these models have other limitations, including their asymmetry and their inability to integrate rich, feature-based, descriptions. We argue that refining the word alignment model in a discriminative maximum-entropy framework substantially improves the alignment quality. We also show that these improved alignments combined with efficient and accurate computation of the link posterior distributions can also improve the overall translation performance, especially when applying posterior-based extraction methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. Under IBM model 3, for instance, each target word is generated from one input word according to the following process: first, the “fertility” of each input word, which represents the number of target words to be generated from it, is selected; then, each input word is translated a number of times corresponding to its fertility; finally, the words in the translation are reordered.

  2. In this context, the term “phrase” has no specific linguistic meaning.

  3. The probability \(p(\mathbf{T}|\mathbf{S})\) can be computed by summing over all possible derivations. In practice, this sum is approximated using only the best derivation to avoid computational complexity.

  4. Similar to reference translations, reference alignments are obtained by manual annotation.

  5. AER was originally defined to distinguish “sure” from “possible” links. Since our manual alignments contain only “sure” links, we drop this distinction and AER reduces to balanced \(F\)-measure.

  6. Given a loss function \(L(A, A')\), and a probability distribution \(\mathrm{Pr}(A|\mathbf{S},\mathbf{T})\) over all possible link sets \(A\in \mathcal{A }\), the MBR decoder is given by: \({\hat{{A}}} = \mathop {\mathrm{arg\,min}}\nolimits _{{A' \in \mathcal{A }}} \sum _{A\in \mathcal{A }} L(A, A') \mathrm{Pr}(A|\mathbf{S},\mathbf{T})\). When we substitute the actual alignment loss function, which has the form: \(L(A, A') = |A| + |A'| - 2 \sum _{l \in A} \sum _{l' \in A'} {1\!\!1}_{\{l\}}(l')\), in the equation for the MBR decoder, we obtain: \({\hat{{A}}} = \mathop {\mathrm{arg\,min}}\nolimits _{A' \in \mathcal{A }} \sum _{l' \in A'} (1 - 2 \mathrm{Pr}(l'|\mathbf{S},\mathbf{T}))\). The sum in the MBR equation is minimized by including all the links for which the posterior probability exceeds 0.5.

  7. The first partial derivative of the \(\ell ^1\) regularizer with respect to each parameter is constant: this means that gradient descent techniques will manage to “push” the value of useless parameters to exactly zero. The \(\ell ^2\) regularizer, by contrast, tends to decrease parameter values less and less as they move toward zero, producing parameters that are often very close to, but never exactly equal to, zero.

  8. Generated, for instance, with any of the IBM models.

  9. In our experiments, we set \(w=2\) which resulted in the best AER performance on the development set.

  10. In the experiments, we set \(L=100\).

  11. In our experiments we use suffixes and prefixes of length 1–3.

  12. All Arabic transliterations are provided in the Buckwalter transliteration scheme (Habash et al. 2007).

  13. In our experiments, we set \(w=1\).

  14. http://wapiti.limsi.fr.

  15. http://nlp.stanford.edu/software/tagger.shtml.

  16. http://www.kyloo.net/software/doku.php/mgiza:overview.

  17. http://www.nlp.ict.ac.cn/~liuyang/wam/wam.html.

  18. http://www.seas.upenn.edu/~strctlrn/CAT/CAT.html.

  19. Implemented by Jan Niehues from the Karlsruhe Institute of Technology (KIT).

  20. For more details on Arabic processing, we refer the reader to Habash (2010).

  21. http://www1.ccls.columbia.edu/MADA/index.html.

  22. The D2 scheme tokenizes question, conjunction and preposition clitics, and uses “\(+\)” as a clitic marker. It also normalizes alef and yaa. We refer the reader to MADA\(+\)TOKAN manual for detailed information http://www1.ccls.columbia.edu/MADA/CCLS-12-01.pdf.

  23. http://www.statmt.org/moses.

  24. http://code.google.com/p/geppetto.

  25. BLEU values are between 0 and 1; we follow standard practice and consistently report 100 \(\times \) BLEU scores.

  26. For all the systems we evaluated on this test set, a difference in BLEU score \(\ge \)0.5 % corresponds to a \(p\) value \(\le \)0.001. Statistical significance is computed using paired bootstrap resampling (Koehn 2004).

  27. These sentences were manually aligned by Thomas Schoenemann. They can be downloaded from http://user.phil-fak.uni-duesseldorf.de/~tosch/downloads.html.

  28. This data can be downloaded from http://www.statmt.org/wmt07/shared-task.html.

  29. For all the reported result, a difference in BLEU score \(\ge \)0.2 % corresponds to a \(p\) value \(\le \)0.001.

  30. We only consider Viterbi-based extraction, since the posterior-based extraction does not seem to help in large corpora settings (Sects. 5.4, 5.5). Additionally, the PostCAT implementation did not scale to the largest training corpus on a machine with 64 GB of RAM.

References

  • Andrew G, Gao J (2007) Scalable training of L1-regularized log-linear models. In: Proceedings of the 24th international conference on machine learning, New York, pp 33–40

  • Ayan NF, Dorr BJ (2006) A maximum entropy approach to combining word alignments. In: Proceedings of the human language technology conference of the NAACL. Main conference, New York City, pp 96–103

  • Berger AL, Pietra VJD, Pietra SAD (1996) A maximum entropy approach to natural language processing. Comput Ling 22:39–71

    Article  Google Scholar 

  • Birch A, Callison-Burch C, Osborne M, Koehn P (2006) Constraining the phrase-based, joint probability statistical translation model. In: Proceedings of the workshop on statistical machine translation, pp 154–157

  • Blunsom P, Cohn T (2006) Discriminative word alignment with conditional random fields. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics Australia, pp 65–72

  • Brown PF, Pietra VJD, Pietra SAD, Mercer RL (1993) The mathematics of statistical machine translation: parameter estimation. Comput Ling 19(2):263–311

    Google Scholar 

  • Cela E (1997) The quadratic assignment problem: theory and algorithms. Combinatorial optimization. Springer, Berlin

    Google Scholar 

  • Cherry C, Lin D (2003) A probability model to improve word alignment. In: Hinrichs E, Roth D (eds) Proceedings of the 41st annual meeting of the association for computational linguistics, pp 88–95

  • Chiang D, Marton Y, Resnik P (2008) Online large-margin training of syntactic and structural translation features. In: Proceedings of the 2008 conference on empirical methods in natural language processing, Honolulu, pp 224–233

  • Crammer K, Dekel O, Keshet J, Shalev-Shwartz S, Singer Y (2006) Online passive-aggressive algorithms. J Mach Learn Res 7:551–585

    MATH  MathSciNet  Google Scholar 

  • Davis P, Xie Z, Small K (2007) All links are not the same: evaluating word alignments for statistical machine translation. In: Proceedings of the MT Summit XI, pp 119–126

  • de Gispert A, Pino J, Byrne W (2010) Hierarchical phrase-based translation grammars extracted from alignment posterior probabilities. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Morristown, NJ, USA, pp 545–554

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1):1–38

    MATH  MathSciNet  Google Scholar 

  • DeNero J, Klein D (2008) The complexity of phrase alignment problems. In: Proceedings of ACL-08: HLT. Short Papers, Columbus, pp 25–28

  • DeNero J, Klein D (2010) Discriminative modeling of extraction sets for machine translation. In: Proceedings of the 48th annual meeting of the association for computational linguistics. pp 1453–1463

  • DeNero J, Gillick D, Zhang J, Klein D (2006) Why generative phrase models underperform surface heuristics. In: proceedings on the workshop on statistical machine translation, New York City, pp 31–38

  • Deng Y, Byrne W (2005) HMM word and phrase alignment for statistical machine translation. In: Proceedings of human language technology conference and conference on empirical methods in natural language processing, Vancouver, pp 169–176

  • Deng Y, Byrne WJ (2008) HMM word and phrase alignment for statistical machine translation. IEEE Trans Audio Speech Lang Process 16(3):494–507

    Article  Google Scholar 

  • Diab M, Hacioglu K, Jurafsky D (2004) Automatic tagging of Arabic text: from raw text to base phrase chunks. In: Proceedings of HLT-NAACL 2004: Short Papers, pp 149–152

  • Dougherty J, Kohavi R, Sahami M (1995) Supervised and Unsupervised discretization of continuous features. In: ICML, pp 194–202

  • Dyer C (2009) Using a maximum entropy model to build segmentation lattices for MT. In: Proceedings of human language technologies: the 2009 annual conference of the North American chapter of the association for computational linguistics, Stroudsburg, pp 406–414

  • Dyer C, Clark JH, Lavie A, Smith NA (2011) Unsupervised word alignment with arbitrary features. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Portland, pp 409–419

  • Fraser A (2007) Improved word alignments for statistical machine translation. PhD Thesis, University of Southern California, Los Angeles, AAI3291804

  • Fraser A, Marcu D (2007) Measuring word alignment quality for statistical machine translation. Comput Ling 33(3):293–303

    Google Scholar 

  • Galley M, Graehl J, Knight K, Marcu D, DeNeefe S, Wang W, Thayer I (2006) Scalable inference and training of context-rich syntactic translation models. In: Proceedings of the 21st ICCL and 44th ACL, Sydney, pp 961–968

  • Ganchev K, Graca JV, Taskar B (2008) Better alignments = better translations?. In: Proceedings of ACL-08: HLT, Columbus, pp 986–993

  • Ganchev K, Graça Ja, Gillenwater J, Taskar B (2010) Posterior regularization for structured latent variable models. J Mach Learn Res 11:2001–2049

    MATH  MathSciNet  Google Scholar 

  • Gao Q, Vogel S (2008) Parallel implementations of word alignment tool. In: SETQA-NLP ’08, pp 49–57

  • Graça JV, Ganchev K, Taskar B (2010) Learning tractable word alignment models with complex constraints. Comput Ling 36:481–504

    Article  Google Scholar 

  • Guzman F, Gao Q, Vogel S (2009) Reassessment of the role of phrase extraction. In: 12th MT Summit

  • Habash N (2007) Arabic morphological representations for machine translation. In: Soudi A, van den Bosch, Neumann G, Ide N (eds) Arabic computational morphology, vol 38 of Text, speech and language technology. Springer, Berlin, pp 263–285

  • Habash N (2010) Introduction to Arabic natural language processing. Synthesis lectures on human language technologies. Morgan & Claypool Publishers, San Rafael

  • Habash N, Sadat F (2006) Arabic preprocessing schemes for statistical machine translation. In: NAACL-HLT, Morristown, pp 49–52

  • Habash N, Soudi A, Buckwalter T (2007) On Arabic transliteration. In: Arabic computational morphology, vol 38 of Text, speech and language technology, chap 2. Springer, Berlin, pp 15–22

  • Habash N, Rambow O, Roth R (2009) MADA+TOKAN: a toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In: Maegaard B, Choukri K (eds) Proceedings of the second international conference on Arabic language resources and tools, Cairo, pp 102–109

  • Hopkins M, Langmead G, Vo T (2011) Extraction programs: a unified approach to translation rule extraction. In: Proceedings of the sixth workshop on statistical machine translation, Edinburgh, pp 523–532

  • Ittycheriah A, Roukos S (2005) A maximum entropy word aligner for Arabic–English machine translation. In: Proceedings of human language technology conference and conference on empirical methods in natural language processing, Vancouver, pp 89–96

  • Ittycheriah A, Al-Onaizan Y, Roukos S (2006) The IBM Arabic–English word alignment corpus. Technical report

  • Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Lin D, Wu D (eds) Proceedings of EMNLP 2004, Barcelona, pp 388–395

  • Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: Proc. NAACL-HLT 2003, pp 48–54

  • Koller D, Friedman N (2009) Probabilistic graphical models: principles and techniques. MIT, Cambridge

    Google Scholar 

  • Kumar S, Byrne W (2002) Minimum Bayes-Risk word alignments of bilingual texts. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), Philadelphia, pp 140–147

  • Lacoste-Julien S, Taskar B, Klein D, Jordan MI (2006) Word alignment via quadratic assignment. In: Proceedings of the human language technology conference of the NAACL. Main conference,. New York City, pp 112–119

  • Lambert P, Banchs RE, Crego JM (2007) Discriminative alignment training without annotated data for machine translation. In: Human language technologies 2007: the conference of the North American chapter of the association for computational Linguistics; companion volume. Short Papers, Rochester, pp 85–88

  • Lambert P, Ma Y, Ozdowska S, Way A (2009) Tracking relevant alignment characteristics for machine translation. In: Proceedings of MT Summit XII. Ottawa, pp 268–275

  • Lambert P, Petitrenaud S, Ma Y, Way A (2012) What types of word alignment improve statistical machine translation? Mach Transl 26(4):289–323

    Article  Google Scholar 

  • Langlais P, Simard M, Véronis J (1998) Methods and practical issues in evaluating alignment techniques. In: Proceedings of the 36th annual meeting of the association of computational linguistics (ACL), pp 711–717

  • Lavergne T, Cappé O, Yvon F (2010) Practical very large scale CRFs. In: Proceedings the 48th annual meeting of the association for computational linguistics (ACL), pp 504–513

  • Liang P, Bouchard-Côté A, Klein D, Taskar B (2006) An end-to-end discriminative approach to machine translation. In: Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the association for computational linguistics, pp. 761–768

  • Ling W, Luís T, Graça J, Coheur L, Trancoso I (2010) Towards a general and extensible phrase-extraction algorithm. In: Federico M, Lane I, Paul M, Yvon F (eds) Proceedings of the seventh international workshop on spoken language, translation (IWSLT), pp 313–320

  • Liu DC, Nocedal J (1989) On the limited memory BFGS method for large scale optimization. Math Program 45(3):503–528

    Article  MATH  MathSciNet  Google Scholar 

  • Liu Y, Xia T, Xiao X, Liu Q (2009) Weighted alignment matrices for statistical machine translation. In: Proceedings of the 2009 conference on empirical methods in natural language processing: vol 2, Morristown, pp 1017–1026

  • Liu Y, Liu Q, Lin S (2010) Discriminative word alignment by linear modeling. Comput Ling 36(3):303–339

    Article  Google Scholar 

  • Mi H, Huang L (2008) Forest-based translation rule extraction. In: Proceedings of the conference on empirical methods in natural language processing, Morristown, pp 206–214

  • Moore RC (2005) A discriminative framework for bilingual word alignment. In: Proceedings of human language technology conference and conference on empirical methods in natural language processing, Vancouver, pp 81–88

  • Niehues J, Vogel S (2008) Discriminative word alignment via alignment matrix modeling. In: Proceedings of the third workshop on statistical machine translation, Columbus, pp 18–25

  • Och FJ (2003) Minimum error rate training in statistical machine translation. In: Hinrichs E, Roth D (eds) Proceedings of the 41st annual meeting of the association for computational linguistics, pp 160–167

  • Och FJ, Ney H (2000) Improved statistical alignment models. In: Proceedings of the 38th annual meeting on association for computational linguistics, Stroudsburg, pp 440–447

  • Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Ling 29:19–51

    Article  MATH  Google Scholar 

  • Och FJ, Ney H (2004) The alignment template approach to statistical machine translation. Comput Ling 30(4):417-449

    Google Scholar 

  • Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on ACL, pp 311–318

  • Riesa J, Marcu D (2010) Hierarchical search for word alignment. In: Proceedings of the 48th annual meeting of the association for computational linguistics, pp 157–166

  • Roth R, Rambow O, Habash N, Diab M, Rudin C (2008) Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In: Proceedings of ACL-08: HLT. Short Papers, Columbus, pp 117–120

  • Søgaard A (2010) Can inversion transduction grammars generate hand alignments? In: Proceedings of the 14th annual conference of the European association for machine translation (EAMT), St. Raphael

  • Søgaard A, Wu D (2009) Empirical lower bounds on translation unit error rate for the full class of inversion transduction grammars. In: Proceedings of the 11th international conference on parsing technologies, pp 33–36

  • Stolcke A (2002) SRILM—an extensible language modeling toolkit. In: Proceedings of the international conference on spoken language processing (ICSLP), vol 2, Denver, pp 901–904

  • Sutton C, McCallum A, Rohanimanesh K (2007) Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data. J Mach Learn Res 8:693–723

    MATH  Google Scholar 

  • Taskar B, Lacoste-Julien S, Klein D (2005) A discriminative matching approach to word alignment. In: proceedings of human language technology conference and conference on empirical methods in natural language processing, Vancouver, pp 73–80

  • Tomeh N, Allauzen A, Wisniewski G, Yvon F (2010) Refining word alignment with discriminative training. In: Proceedings of the ninth conference of the association for machine translation in the America (AMTA), Denver

  • Tomeh N, Allauzen A, Lavergne T, Yvon F (2011a) Designing an improved discriminative word aligner. Int J Comput Ling Appl (IJCLA)

  • Tomeh N, Allauzen A, Yvon F (2011b) Discriminative weighted alignment matrices for statistical machine translation. In: Forcada M, Depraetere H (eds) Proceedings of the European conference on machine translation, Leuven, pp 305–312

  • Venugopal A, Zollmann A, Smith NA, Vogel S (2008) Wider pipelines: N-best alignments and parses in MT training. In: Proceedings of the association for machine translation in the Americas (AMTA)

  • Vilar D, Popovic M, Ney H (2006) AER: do we need to “improve” our alignments? In: Proceedings of the international workshop on spoken language translation, Kyoto, pp 205–2012

  • Vogel S, Ney H, Tillmann C (1996) HMM-based word alignment in statistical translation. In: Proceedings of the 16th conference on computational linguistics—vol 2, Stroudsburg, pp 836–841

  • Wang W, Knight K, Marcu D (2007) Binarizing syntax trees to improve syntax-based machine translation accuracy. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural, language learning (EMNLP-CoNLL), pp 746–754

  • Wolpert DH (1992) Original contribution: stacked generalization. Neural Netw 5(2):241–259

    Article  MathSciNet  Google Scholar 

  • Xue Y-Z, Li S, Zhao T-J, Yang M-Y, Li J (2006) Bilingual phrase extraction from N-best alignments. In: ICICIC, pp 410–414

  • Zaidan OF (2009) Z-MERT: a fully configurable open source tool for minimum error rate training of machine translation systems. Prague Bull Math Ling 91:79–88

    Google Scholar 

  • Zens R, Och FJ, Ney H (2002) Phrase-based statistical machine translation. In: Proceedings of the German conference on artificial intelligence (KI 2002), pp 18–32

Download references

Acknowledgments

We are grateful to the anonymous reviewers for their precise analysis and their very insightful comments which have greatly helped us improve the article. All remaining mistakes are ours. We also wish to thank Jan Niehues for sharing his alignment toolkit, and Nizar Habash, Thomas Lavergne and Guillaume Wisniewski for valuable discussions. This work was partly funded by the French agency for innovation OSEO under the Quaero Program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nadi Tomeh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tomeh, N., Allauzen, A. & Yvon, F. Maximum-entropy word alignment and posterior-based phrase extraction for machine translation. Machine Translation 28, 19–56 (2014). https://doi.org/10.1007/s10590-013-9146-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-013-9146-4

Keywords

Navigation