ABSTRACT
Every programmer has a characteristic style, ranging from preferences about identifier naming to preferences about object relationships and design patterns. Coding conventions define a consistent syntactic style, fostering readability and hence maintainability. When collaborating, programmers strive to obey a project’s coding conventions. However, one third of reviews of changes contain feedback about coding conventions, indicating that programmers do not always follow them and that project members care deeply about adherence. Unfortunately, programmers are often unaware of coding conventions because inferring them requires a global view, one that aggregates the many local decisions programmers make and identifies emergent consensus on style. We present NATURALIZE, a framework that learns the style of a codebase, and suggests revisions to improve stylistic consistency. NATURALIZE builds on recent work in applying statistical natural language processing to source code. We apply NATURALIZE to suggest natural identifier names and formatting conventions. We present four tools focused on ensuring natural code during development and release management, including code review. NATURALIZE achieves 94 % accuracy in its top suggestions for identifier names. We used NATURALIZE to generate 18 patches for 5 open source projects: 14 were accepted.
- S. L. Abebe, S. Haiduc, P. Tonella, and A. Marcus. The effect of lexicon bad smells on concept location in source code. In Source Code Analysis and Manipulation (SCAM), 2011 11th IEEE International Working Conference on, pages 125–134. IEEE, 2011. Google ScholarDigital Library
- A. Abran, P. Bourque, R. Dupuis, J. W. Moore, and L. L. Tripp. Guide to the Software Engineering Body of Knowledge - SWEBOK. IEEE Press, Piscataway, NJ, USA, 2004 version edition, 2004.Google Scholar
- E. N. Adams. Optimizing preventive service of software products. IBM Journal of Research and Development, 28(1):2–14, Jan. 1984. Google ScholarDigital Library
- M. Allamanis and C. Sutton. Mining source code repositories at massive scale using language modeling. In Proceedings of the Tenth International Workshop on Mining Software Repositories, pages 207–216. IEEE Press, 2013. Google ScholarDigital Library
- N. Anquetil and T. Lethbridge. Assessing the relevance of identifier names in a legacy software system. In Proceedings of the 1998 Conference of the Centre for Advanced Studies on Collaborative Research, page 4, 1998. Google ScholarDigital Library
- N. Anquetil and T. C. Lethbridge. Recovering software architecture from the names of source files. Journal of Software Maintenance, 11(3):201–221, 1999. Google ScholarDigital Library
- C. Arthur. Apple’s SSL iPhone vulnerability: How did it happen, and what next? bit.ly/1bJ7aSa, 2014. Visited Mar 2014.Google Scholar
- M. I. S. R. Association et al. MISRA-C 2012: Guidelines for the Use of the C Language in Critical Systems. ISBN 9781906400118, 2012.Google Scholar
- astyle Contributors. Artistic style 2.03. http://astyle.sourceforge.net/, 2013. Visited September 9, 2013.Google Scholar
- A. Bacchelli and C. Bird. Expectations, outcomes, and challenges of modern code review. In ICSE, 2013. Google ScholarDigital Library
- J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. The Journal of Machine Learning Research, 13:281–305, 2012. Google ScholarDigital Library
- T. J. Biggerstaff, B. G. Mitbander, and D. Webster. The concept assignment problem in program understanding. In Proceedings of the 15th International Conference on Software Engineering, pages 482–498. IEEE Computer Society Press, 1993. Google ScholarDigital Library
- D. Binkley, M. Davis, D. Lawrie, J. Maletic, C. Morrell, and B. Sharif. The impact of identifier style on effort and comprehension. Empirical Software Engineering, 18(2):219–276, 2013. Google ScholarDigital Library
- D. Binkley, M. Davis, D. Lawrie, and C. Morrell. To CamelCase or Under_score. In IEEE International Conference on Program Comprehension (ICPC), pages 158–167, 2009.Google ScholarCross Ref
- C. Boogerd and L. Moonen. Assessing the value of coding standards: An empirical study. In H. Mei and K. Wong, editors, Proceedings of the 24th IEEE International Conference on Software Maintenance (ICSM 2008), pages 277 – 286. IEEE, October 2008.Google ScholarCross Ref
- F. P. Brooks. The Mythical Man-Month. Addison-Wesley Reading, 1975. Google ScholarDigital Library
- M. Broy, F. Deißenböck, and M. Pizka. A holistic approach to software quality at work. In Proc. 3rd World Congress for Software Quality (3WCSQ), 2005.Google Scholar
- M. Bruch, M. Monperrus, and M. Mezini. Learning from examples to improve code completion systems. In ESEC/SIGSOFT FSE, pages 213–222. ACM, 2009. Google ScholarDigital Library
- R. P. Buse and W. R. Weimer. Learning a metric for code readability. Software Engineering, IEEE Transactions on, 36(4):546–558, 2010. Google ScholarDigital Library
- B. Caprile and P. Tonella. Restructuring program identifier names. In International Conference on Software Maintenance (ICSM’00), pages 97–107, 2000. Google ScholarDigital Library
- J. Carletta. Assessing agreement on classification tasks: the kappa statistic. Computational Linguistics, 22(2):249–254, 1996. Google ScholarDigital Library
- S. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, pages 310–318. Association for Computational Linguistics, 1996. Google ScholarDigital Library
- N. Cowan. The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behavioral and Brain Sciences, 24(1):87–114, 2001.Google ScholarCross Ref
- F. Deißenböck and M. Pizka. Concise and consistent naming {software system identifier naming}. In Proceedings of the 13th International Workshop on Program Comprehension (IWPC’05), pages 97–106, 2005. Google ScholarDigital Library
- S. Dowdy, S. Wearden, and D. Chilko. Statistics for Research, volume 512. John Wiley & Sons, 2011.Google Scholar
- Eclipse-Contributors. Eclipse JDT. http://www.eclipse.org/jdt/, 2013. Visited September 9, 2013.Google Scholar
- L. M. Eshkevari, V. Arnaoudova, M. Di Penta, R. Oliveto, Y.-G. Guéhéneuc, and G. Antoniol. An exploratory study of identifier renamings. In Proceedings of the 8th Working Conference on Mining Software Repositories, pages 33–42. ACM, 2011. Google ScholarDigital Library
- M. Gabel and Z. Su. A study of the uniqueness of source code. In Proceedings of the 18th ACM SIGSOFT International Symposium on Foundations of software engineering, FSE ’10, pages 147–156, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- M. G. Gabel. Inferring Programmer Intent and Related Errors from Software. PhD thesis, University of California, 2011. Google ScholarDigital Library
- GitHub. JUnit Pull Request #834. bit.ly/O8bmjM, 2014. Visited Mar 2014.Google Scholar
- GitHub. libgdx Pull Request #1400. bit.ly/O8aBqV, 2014. Visited Mar 2014.Google Scholar
- gnu-indent Contributors. GNU Indent – beautify C code. http://www.gnu.org/software/indent/, 2013. Visited September 9, 2013.Google Scholar
- S. Gupta, S. Malik, L. Pollock, and K. Vijay-Shanker. Part-of-speech tagging of program identifiers for improved text-based software engineering tools. In International Conference on Program Comprehension, pages 3–12. IEEE, 2013.Google Scholar
- L. Hatton. Safer language subsets: an overview and a case history, MISRA C. Information and Software Technology, 46(7):465–472, 2004.Google ScholarCross Ref
- A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu. On the naturalness of software. In International Conference on Software Engineering (ICSE), pages 837–847. IEEE, 2012. Google ScholarDigital Library
- A. Hindle, M. W. Godfrey, and R. C. Holt. Reading beside the lines: Using indentation to rank revisions by complexity. Science of Computer Programming, 74(7):414–429, May 2009. Google ScholarDigital Library
- E. W. Høst and B. M. Østvold. Debugging method names. In In European Conference on Object-Oriented Programming (ECOOP), pages 294–317. Springer, 2009. Google ScholarDigital Library
- D. Jurafsky and J. H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Prentice Hall, 2nd edition, 2009. Google ScholarDigital Library
- K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys, 24(4):377–439, Dec. 1992. Google ScholarDigital Library
- A. Langley. Apple’s SSL/TLS bug. bit.ly/MMvx6b, 2014. Visited Mar 2014.Google Scholar
- D. Lawrie, H. Feild, and D. Binkley. Syntactic identifier conciseness and consistency. In IEEE International Workshop on Source Code Analysis and Manipulation, pages 139–148. IEEE, 2006. Google ScholarDigital Library
- D. Lawrie, H. Feild, and D. Binkley. An empirical study of rules for well-formed identifiers: Research articles. Journal of Software Maintenance Evolution: Research and Practice, 19(4):205–229, July 2007. Google ScholarDigital Library
- D. Lawrie, C. Morrell, H. Feild, and D. Binkley. What’s in a Name? A Study of Identifiers. In Proceedings of the 14th IEEE International Conference on Program Comprehension (ICPC’06), ICPC ’06, pages 3–12, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
- B. Liblit, A. Begel, and E. Sweetser. Cognitive perspectives on the role of naming in computer programs. In Annual Psychology of Programming Workshop, 2006.Google Scholar
- C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pages 74–81, 2004.Google Scholar
- C. J. Maddison and D. Tarlow. Structured generative models of natural source code. arXiv preprint arXiv:1401.0514, 2014.Google Scholar
- E. Mays, F. J. Damerau, and R. L. Mercer. Context based spelling correction. Information Processing and Management, 27(5):517–522, 1991. Google ScholarDigital Library
- G. A. Miller. The magical number seven, plus or minus two: some limits on our capacity for processing information. Psychological review, 63(2):81, 1956.Google ScholarCross Ref
- D. Movshovitz-Attias and W. W. Cohen. Natural language models for predicting programming comments. In Proc of the ACL, 2013.Google Scholar
- E. Murphy-Hill, C. Parnin, and A. P. Black. How we refactor, and how we know it. Software Engineering, IEEE Transactions on, 38(1):5–18, 2012. Google ScholarDigital Library
- N. Nagappan and T. Ball. Using software dependencies and churn metrics to predict field failures: An empirical case study. In ESEM, pages 364–373, 2007. Google ScholarDigital Library
- A. T. Nguyen, T. T. Nguyen, H. A. Nguyen, A. Tamrawi, H. V. Nguyen, J. Al-Kofahi, and T. N. Nguyen. Graph-based pattern-oriented, context-sensitive source code completion. In ACM/IEEE International Conference on Software Engineering (ICSE). IEEE, 2012. Google ScholarDigital Library
- T. T. Nguyen, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen. A statistical semantic language model for source code. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, pages 532–542. ACM, 2013. Google ScholarDigital Library
- M. Ohba and K. Gondow. Toward mining concept keywords from identifiers in large software projects. In ACM SIGSOFT Software Engineering Notes, volume 30, pages 1–5. ACM, 2005. Google ScholarDigital Library
- Oracle. Code Conventions for the Java Programming Language. http://www.oracle.com/technetwork/ java/codeconv-138413.html, 1999. Visited September 2, 2013.Google Scholar
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: a method for automatic evaluation of machine translation. In Association for Computational Linguistics (ACL), pages 311–318, 2002. Google ScholarDigital Library
- R. Pike. Go at Google. http://talks.golang.org/2012/splash.slide, 2012. Visited September 9, 2013.Google Scholar
- Pylint-Contributors. Pylint – code analysis for Python. http://www.pylint.org/, 2013. Visited September 9, 2013.Google Scholar
- V. Rajlich and P. Gosavi. Incremental change in object-oriented programming. Software, IEEE, 21(4):62–69, 2004. Google ScholarDigital Library
- D. Ratiu and F. Deißenböck. From reality to programs and (not quite) back again. In IEEE International Conference on Program Comprehension (ICPC), pages 91–102. IEEE, 2007. Google ScholarDigital Library
- P. C. Rigby and C. Bird. Convergent software peer review practices. In Proceedings of the the Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering (ESEC/FSE). ACM, 2013. Google ScholarDigital Library
- R. Robbes and M. Lanza. How program history can improve code completion. In Automated Software Engineering (ASE), pages 317–326. IEEE, 2008. Google ScholarDigital Library
- M. Robillard, R. Walker, and T. Zimmermann. Recommendation systems for software engineering. Software, IEEE, 27(4):80–86, 2010. Google ScholarDigital Library
- G. v. Rossum, B. Warsaw, and N. Coghlan. PEP 8–Style Guide for Python Code. http://www.python.org/dev/peps/pep-0008/, 2013. Visited September 8, 2013.Google Scholar
- C. Simonyi. Hungarian notation. http://msdn.microsoft. com/en-us/library/aa260976(VS.60).aspx, 1999. Visited September 2, 2013.Google Scholar
- E. Soloway and K. Ehrlich. Empirical studies of programming knowledge. Software Engineering, IEEE Transactions on, (5):595–609, 1984. Google ScholarDigital Library
- W. Strunk Jr and E. White. The Elements of Style. Macmillan, New York, 3rd edition, 1979.Google Scholar
- A. Takang, P. Grubb, and R. Macredie. The effects of comments and identifier names on program comprehensibility: an experiential study. Journal of Program Languages, 4(3):143–167, 1996.Google Scholar
- A. A. Takang, P. A. Grubb, and R. D. Macredie. The effects of comments and identifier names on program comprehensibility: an experimental investigation. J. Prog. Lang., 4(3):143–167, 1996.Google Scholar
- G. Uddin, B. Dagenais, and M. P. Robillard. Analyzing temporal API usage patterns. In Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering, pages 456–459. IEEE Computer Society, 2011. Google ScholarDigital Library
- J. Wang, Y. Dang, H. Zhang, K. Chen, T. Xie, and D. Zhang. Mining succinct and high-coverage API usage patterns from source code. In Proceedings of the Tenth International Workshop on Mining Software Repositories, pages 319–328. IEEE Press, 2013. Google ScholarDigital Library
- X. Wang, L. Pollock, and K. Vijay-Shanker. Automatic segmentation of method code into meaningful blocks to improve readability. In Working Conference on Reverse Engineering, pages 35–44. IEEE, 2011. Google ScholarDigital Library
- Wikipedia. Coding Conventions. http: //en.wikipedia.org/wiki/Coding_conventions.Google Scholar
- H. P. Young. The economics of convention. The Journal of Economic Perspectives, 10(2):105–122, 1996.Google ScholarCross Ref
- C. Zhang, J. Yang, Y. Zhang, J. Fan, X. Zhang, J. Zhao, and P. Ou. Automatic parameter recommendation for practical api usage. In Proceedings of the 34th International Conference on Software Engineering, pages 826–836. IEEE Press, 2012. Google ScholarDigital Library
- H. Zhong, T. Xie, L. Zhang, J. Pei, and H. Mei. MAPO: Mining and recommending API usage patterns. In ECOOP 2009–Object-Oriented Programming, pages 318–343. Springer, 2009. Introduction Motivating Example Use Cases and Tools The Naturalize Framework The Core of Naturalize Choices of Scoring Function Suggesting Natural Names Suggesting Natural Formatting Converting Conventions into Rules Evaluation The Importance of Coding Conventions Suggestion Robustness of Suggestions Manual Examination of Suggestions Suggestions Accepted by Projects Related Work Conclusion Acknowledgements References Google ScholarDigital Library
Index Terms
- Learning natural coding conventions
Recommendations
Suggesting accurate method and class names
ESEC/FSE 2015: Proceedings of the 2015 10th Joint Meeting on Foundations of Software EngineeringDescriptive names are a vital part of readable, and hence maintainable, code. Recent progress on automatically suggesting names for local variables tantalizes with the prospect of replicating that success with method and class names. However, ...
Styler: learning formatting conventions to repair Checkstyle violations
AbstractEnsuring the consistent usage of formatting conventions is an important aspect of modern software quality assurance. To do so, the source code of a project should be checked against the formatting conventions (or rules) adopted by its development ...
Python coding style compliance on stack overflow
MSR '19: Proceedings of the 16th International Conference on Mining Software RepositoriesSoftware developers all over the world use Stack Overflow (SO) to interact and exchange code snippets. Research also uses SO to harvest code snippets for use with recommendation systems. However, previous work has shown that code on SO may have quality ...
Comments