ABSTRACT
Prior research has shown that source code also exhibits naturalness, i.e. it is written by humans and is likely to be repetitive. The researchers also showed that the n-gram language model is useful in predicting the next token in a source file given a large corpus of existing source code. In this paper, we investigate how well statistical machine translation (SMT) models for natural languages could help in migrating source code from one programming language to another. We treat source code as a sequence of lexical tokens and apply a phrase-based SMT model on the lexemes of those tokens. Our empirical evaluation on migrating two Java projects into C# showed that lexical, phrase-based SMT could achieve high lexical translation accuracy (BLEU from 81.3-82.6%). Users would have to manually edit only 11.9-15.8% of the total number of tokens in the resulting code to correct it. However, a high percentage of total translation methods (49.5-58.6%) is syntactically incorrect. Therefore, our result calls for a more program-oriented SMT model that is capable of better integrating the syntactic and semantic information of a program to support language migration.
- D. Cer, M. Galley, D. Jurafsky, and C. D. Manning. Phrasal: A statistical machine translation toolkit for exploring new model features. In Proceedings of the NAACL HLT 2010 Demonstration Session, pages 9–12, 2010. Association for Computational Linguistics. Google ScholarDigital Library
- B. Dagenais and M. P. Robillard. Recommending adaptive changes for framework evolution. In ICSE’08: Proceedings of the 30th International Conference on Software Engineering, pages 481–490. ACM, 2008. Google ScholarDigital Library
- db4o. http://sourceforge.net/projects/db4o/.Google Scholar
- Google Translate. http://translate.google.com/.Google Scholar
- A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu. On the naturalness of software. In Proceedings of International Conference on Software Engineering, ICSE’12, pp. 837–847. IEEE Press, 2012. Google ScholarDigital Library
- Java2CSharp. http://j2cstranslator.wiki.sourceforge.net/.Google Scholar
- P. Koehn. Statistical Machine Translation. The Cambridge Press, 2010. Google ScholarDigital Library
- Lucene. http://lucene.apache.org/.Google Scholar
- S. Meng, X. Wang, L. Zhang, and H. Mei. A history-based matching approach to identification of framework evolution. In ICSE’12, pp. 353–363. IEEE. Google ScholarDigital Library
- M. Mossienko. Automated Cobol to Java recycling. In Proceedings of European Conference on Software Maintenance and Reengineering, CSMR’03. IEEE. Google ScholarDigital Library
- M. Nita and D. Notkin. Using twinning to adapt programs to alternative APIs. In Proceedings of ACM/IEEE International Conference on Software Engineering, ICSE ’10, pages 205–214. ACM, 2010. Google ScholarDigital Library
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL’02, pages 311–318. 2002. Google ScholarDigital Library
- R. C. Waters. Program translation via abstraction and reimplementation. IEEE Trans. Softw. Eng., 14(8):1207–1228, Aug. 1988. Google ScholarDigital Library
- W. Wu, Y.-G. Guéhéneuc, G. Antoniol, and M. Kim. AURA: a hybrid approach to identify framework evolution. In ICSE ’10, pages 325–334. ACM, 2010. Google ScholarDigital Library
- K. Yasumatsu and N. Doi. Spice: A system for translating Smalltalk programs into a C environment. IEEE Trans. Softw. Eng., 21(11):902–912, Nov. 1995. Google ScholarDigital Library
- H. Zhong, S. Thummalapenta, T. Xie, L. Zhang, and Q. Wang. Mining API mapping for language migration. In ICSE’10, pages 195–204. ACM, 2010. Google ScholarDigital Library
Index Terms
- Lexical statistical machine translation for language migration
Recommendations
Integrating source-language context into phrase-based statistical machine translation
The translation features typically used in Phrase-Based Statistical Machine Translation (PB-SMT) model dependencies between the source and target phrases, but not among the phrases in the source language themselves. A swathe of research has demonstrated ...
Syntactic discriminative language model rerankers for statistical machine translation
This article describes a method that successfully exploits syntactic features for n-best translation candidate reranking using perceptrons. We motivate the utility of syntax by demonstrating the superior performance of parsers over n-gram language ...
Statistical machine translation of subtitles for highly inflected language pair
This paper addresses the problem of statistical machine translation between highly inflected languages. Even when dealing with closely-related language pairs, statistical machine translation encounters problems if the parallel corpus is not big enough. ...
Comments