Skip to main content

2014 | Buch

Machine Translation

10th China Workshop, CWMT 2014, Macau, China, November 4-6, 2014. Proceedings

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 10th China Workshop on Machine Translation, CWMT 2014, held in Macau, China, in November 2014. The 10 revised full English papers presented were carefully reviewed and selected from 15 submissions of English papers. The papers cover the following topics: machine translation; data selection; word segmentation; entity recognition; MT evaluation.

Inhaltsverzeichnis

Frontmatter
Making Language Model as Small as Possible in Statistical Machine Translation
Abstract
As one of the key components, n-gram language model is most frequently used in statistical machine translation. Typically, higher order of the language model leads to better translation performance. However, higher order of the n-gram language model requires much more monolingual training data to avoid data sparseness. Furthermore, the model size increases exponentially when the n-gram order becomes higher and higher. In this paper, we investigate the language model pruning techniques that aim at making the model size as small as possible while keeping the translation quality. According to our investigation, we further propose to replace the higher order n-grams with a low-order cluster-based language model. The extensive experiments show that our method is very effective.
Yang Liu, Jiajun Zhang, Jie Hao, Dakun Zhang
Data Selection via Semi-supervised Recursive Autoencoders for SMT Domain Adaptation
Abstract
In this paper, we present a novel data selection approach based on semi-supervised recursive autoencoders. The model is trained to capture the domain specific features and used for detecting sentences, which are relevant to a specific domain, from a large general-domain corpus. The selected data are used for adapting the built language model and translation model to target domain. Experiments were conducted on an in-domain (IWSLT2014 Chinese-English TED Talk) and a general-domain corpus (UM-Corpus). We evaluated the proposed data selection model in both intrinsic and extrinsic evaluations to investigate the selection successful rate (F-score) of pseudo data, as well as the translation quality (BLEU score) of adapting SMT systems. Empirical results reveal the proposed approach outperforms the state-of-the-art selection approach.
Yi Lu, Derek F. Wong, Lidia S. Chao, Longyue Wang
Effective Hypotheses Re-ranking Model in Statistical Machine Translation
Abstract
In statistical machine translation, an effective way to improve the translation quality is to regularize the posterior probabilities of translation hypotheses according to the information of N-best list. In this paper, we present a novel approach to improve the final translation result by dynamically augmenting the translation scores of hypotheses that derived from the N-best translation candidates. The proposed model was trained on a general domain UM-Corpus and evaluated on IWSLT Chinese-English TED Talk data under the configurations of document level translation and sentence level translation respectively. Empirical results real that sentence level translation model outperforms the document level and the baseline system.
Yiming Wang, Longyue Wang, Derek F. Wong, Lidia S. Chao
Recognizing and Reordering the Translation Units in a Long NP for Chinese-English Patent Machine Translation
Abstract
This paper describes a rule based method to identify and reorder the translation units (a smallest unit for reordering) within a long Chinese NP for Chinese-English patent machine translation. By comparing the orders of translation units within long Chinese and English NPs, we developed a strategy on how to reorder the translation units according with the expression of English habit. By analyzing the features of translation units within a long Chinese NP, we built some formalized rules to recognize the boundaries of translation units using the boundary words to identify what to reorder. At last, we used a rule-based MT system to test our work, and the experimental results showed that our rule-based method and strategy were very efficient.
Xiaodie Liu, Yun Zhu, Yaohong Jin
A Statistical Method for Translating Chinese into Under-resourced Minority Languages
Abstract
In order to improve the performance of statistical machine translation between Chinese and minority languages, most of which are under-resourced languages with different word order and rich morphology, the paper proposes a method which incorporates syntactic information of the source-side and morphological information of the target-side to simultaneously reduce the differences of word order and morphology. First, according to the word alignment and the phrase structure trees of source language, reordering rules are extracted automatically to adjust the word order at source side. And then based on Hidden Markov Model, a morphological segmentation method is adopted to obtain morphological information of the target language. In the experiments, we take the Chinese-Mongolian translation as an example. A morpheme-level statistical machine translation system, constructed based on the reordered source side and the segmented target side, achieves 2.1 BLEU points increment over the standard phrase-based system.
Lei Chen, Miao Li, Jian Zhang, Zede Zhu, Zhenxin Yang
Character Tagging-Based Word Segmentation for Uyghur
Abstract
For effectively obtain information in Uyghur words, we present a novel method based on character tagging for Uyghur word segmentation. In this paper, we suggest five labels for characters in a Uyghur word, include: Su, Bu, Iu, Eu and Au, according to our method, we segment Uyghur words as a sequence labeling procedure, which use Conditional Random Fields (CRFs) as the basic labeling model. Experimental show that our method collect more features in Uyghur words, therefore outperform several traditional used word segmentation models significantly.
Yating Yang, Chenggang Mi, Bo Ma, Rui Dong, Lei Wang, Xiao Li
Analysis of the Chinese – Portuguese Machine Translation of Chinese Localizers Qian and Hou
Abstract
The focus of the present article is the two Chinese localizers qian (front) and hou (back), in their function of time, in the process of the Chinese- Portuguese machine translation, and is integrated in the project Autema SynTree (annotation and Analysis of Bilingual Syntactic Trees for Chinese/Portuguese). The text corpus used in the research is composed of 46 Chinese texts, extracted from The International Chinese Newsweekly, identified as source text (ST), and target texts (TT) are composed of translations into Portuguese executed by the Portuguese-Chinese Translator (PCT) and humans. In Portuguese the prepositions of transversal axis such as antes de and depois de, are used to indicate the time before and after, corresponding to qian and hou in Chinese. Nevertheless, inconsistencies related to the translation of the localizers are found in the output of the PCT when comparing it with the human translation (HT). Based thereupon, the present article shows the developed syntax rules to solve the inconsistencies found in the PCT output. The translations and the proposed rules were evaluated through the application of BLEU metrics.
Chunhui Lu, Ana Leal, Paulo Quaresma, Márcia Schmaltz
Chunk-Based Dependency-to-String Model with Japanese Case Frame
Abstract
This paper proposes an idea to integrate Japanese case frame into chunk-based dependency-to-string model. At first, case frames are acquired from Japanese chunk-based dependency analysis results. Then case frames are used to constraint rule extraction and decoding in chunk-based dependency-to-string model. Experimental results show that the proposed method performs well on long structural reordering and lexical translation, and achieves better performance than hierarchical phrase-based model and word-based dependency-to-string model on Japanese to Chinese test sets.
Jinan Xu, Peihao Wu, Jun Xie, Yujie Zhang
A Novel Hybrid Approach to Arabic Named Entity Recognition
Abstract
Named Entity Recognition (NER) task is an essential preprocessing task for many Natural Language Processing (NLP) applications such as text summarization, document categorization, Information Retrieval, among others. NER systems follow either rule-based approach or machine learning approach. In this paper, we introduce a novel NER system for Arabic using a hybrid approach, which combines a rule-based approach and a machine learning approach in order to improve the performance of Arabic NER. The system is able to recognize three types of named entities, including Person, Location and Organization. Experimental results on ANERcorp dataset showed that our hybrid approach has achieved better performance than using the rule-based approach and the machine learning approach when they are processed separately. It also outperforms the state-of-the-art hybrid Arabic NER systems.
Mohamed A. Meselhi, Hitham M. Abo Bakr, Ibrahim Ziedan, Khaled Shaalan
Reexaminatin on Voting for Crowd Sourcing MT Evaluation
Abstract
We describe a model based on Ranking Support Vector Machine(SVM) used to deal with the crowdsourcing data. Our model focuses on how to use poor quality crowdsourcing data to get high quality sorted data. The data sets, used for model training and testing, has the situation of data missing. And we found that our model achieves better results than voting model in all the cases in our experiment, including sorting of two translations and four translations.
Yiming Wang, Muyun Yang
Backmatter
Metadaten
Titel
Machine Translation
herausgegeben von
Xiaodong Shi
Yidong Chen
Copyright-Jahr
2014
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-662-45701-6
Print ISBN
978-3-662-45700-9
DOI
https://doi.org/10.1007/978-3-662-45701-6