skip to main content
10.1145/2635868.2635883acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

Learning natural coding conventions

Published:11 November 2014Publication History

ABSTRACT

Every programmer has a characteristic style, ranging from preferences about identifier naming to preferences about object relationships and design patterns. Coding conventions define a consistent syntactic style, fostering readability and hence maintainability. When collaborating, programmers strive to obey a project’s coding conventions. However, one third of reviews of changes contain feedback about coding conventions, indicating that programmers do not always follow them and that project members care deeply about adherence. Unfortunately, programmers are often unaware of coding conventions because inferring them requires a global view, one that aggregates the many local decisions programmers make and identifies emergent consensus on style. We present NATURALIZE, a framework that learns the style of a codebase, and suggests revisions to improve stylistic consistency. NATURALIZE builds on recent work in applying statistical natural language processing to source code. We apply NATURALIZE to suggest natural identifier names and formatting conventions. We present four tools focused on ensuring natural code during development and release management, including code review. NATURALIZE achieves 94 % accuracy in its top suggestions for identifier names. We used NATURALIZE to generate 18 patches for 5 open source projects: 14 were accepted.

References

  1. S. L. Abebe, S. Haiduc, P. Tonella, and A. Marcus. The effect of lexicon bad smells on concept location in source code. In Source Code Analysis and Manipulation (SCAM), 2011 11th IEEE International Working Conference on, pages 125–134. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Abran, P. Bourque, R. Dupuis, J. W. Moore, and L. L. Tripp. Guide to the Software Engineering Body of Knowledge - SWEBOK. IEEE Press, Piscataway, NJ, USA, 2004 version edition, 2004.Google ScholarGoogle Scholar
  3. E. N. Adams. Optimizing preventive service of software products. IBM Journal of Research and Development, 28(1):2–14, Jan. 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Allamanis and C. Sutton. Mining source code repositories at massive scale using language modeling. In Proceedings of the Tenth International Workshop on Mining Software Repositories, pages 207–216. IEEE Press, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. Anquetil and T. Lethbridge. Assessing the relevance of identifier names in a legacy software system. In Proceedings of the 1998 Conference of the Centre for Advanced Studies on Collaborative Research, page 4, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. N. Anquetil and T. C. Lethbridge. Recovering software architecture from the names of source files. Journal of Software Maintenance, 11(3):201–221, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Arthur. Apple’s SSL iPhone vulnerability: How did it happen, and what next? bit.ly/1bJ7aSa, 2014. Visited Mar 2014.Google ScholarGoogle Scholar
  8. M. I. S. R. Association et al. MISRA-C 2012: Guidelines for the Use of the C Language in Critical Systems. ISBN 9781906400118, 2012.Google ScholarGoogle Scholar
  9. astyle Contributors. Artistic style 2.03. http://astyle.sourceforge.net/, 2013. Visited September 9, 2013.Google ScholarGoogle Scholar
  10. A. Bacchelli and C. Bird. Expectations, outcomes, and challenges of modern code review. In ICSE, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. The Journal of Machine Learning Research, 13:281–305, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. J. Biggerstaff, B. G. Mitbander, and D. Webster. The concept assignment problem in program understanding. In Proceedings of the 15th International Conference on Software Engineering, pages 482–498. IEEE Computer Society Press, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Binkley, M. Davis, D. Lawrie, J. Maletic, C. Morrell, and B. Sharif. The impact of identifier style on effort and comprehension. Empirical Software Engineering, 18(2):219–276, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Binkley, M. Davis, D. Lawrie, and C. Morrell. To CamelCase or Under_score. In IEEE International Conference on Program Comprehension (ICPC), pages 158–167, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  15. C. Boogerd and L. Moonen. Assessing the value of coding standards: An empirical study. In H. Mei and K. Wong, editors, Proceedings of the 24th IEEE International Conference on Software Maintenance (ICSM 2008), pages 277 – 286. IEEE, October 2008.Google ScholarGoogle ScholarCross RefCross Ref
  16. F. P. Brooks. The Mythical Man-Month. Addison-Wesley Reading, 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Broy, F. Deißenböck, and M. Pizka. A holistic approach to software quality at work. In Proc. 3rd World Congress for Software Quality (3WCSQ), 2005.Google ScholarGoogle Scholar
  18. M. Bruch, M. Monperrus, and M. Mezini. Learning from examples to improve code completion systems. In ESEC/SIGSOFT FSE, pages 213–222. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. P. Buse and W. R. Weimer. Learning a metric for code readability. Software Engineering, IEEE Transactions on, 36(4):546–558, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Caprile and P. Tonella. Restructuring program identifier names. In International Conference on Software Maintenance (ICSM’00), pages 97–107, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Carletta. Assessing agreement on classification tasks: the kappa statistic. Computational Linguistics, 22(2):249–254, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, pages 310–318. Association for Computational Linguistics, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. N. Cowan. The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behavioral and Brain Sciences, 24(1):87–114, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  24. F. Deißenböck and M. Pizka. Concise and consistent naming {software system identifier naming}. In Proceedings of the 13th International Workshop on Program Comprehension (IWPC’05), pages 97–106, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Dowdy, S. Wearden, and D. Chilko. Statistics for Research, volume 512. John Wiley & Sons, 2011.Google ScholarGoogle Scholar
  26. Eclipse-Contributors. Eclipse JDT. http://www.eclipse.org/jdt/, 2013. Visited September 9, 2013.Google ScholarGoogle Scholar
  27. L. M. Eshkevari, V. Arnaoudova, M. Di Penta, R. Oliveto, Y.-G. Guéhéneuc, and G. Antoniol. An exploratory study of identifier renamings. In Proceedings of the 8th Working Conference on Mining Software Repositories, pages 33–42. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Gabel and Z. Su. A study of the uniqueness of source code. In Proceedings of the 18th ACM SIGSOFT International Symposium on Foundations of software engineering, FSE ’10, pages 147–156, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. G. Gabel. Inferring Programmer Intent and Related Errors from Software. PhD thesis, University of California, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. GitHub. JUnit Pull Request #834. bit.ly/O8bmjM, 2014. Visited Mar 2014.Google ScholarGoogle Scholar
  31. GitHub. libgdx Pull Request #1400. bit.ly/O8aBqV, 2014. Visited Mar 2014.Google ScholarGoogle Scholar
  32. gnu-indent Contributors. GNU Indent – beautify C code. http://www.gnu.org/software/indent/, 2013. Visited September 9, 2013.Google ScholarGoogle Scholar
  33. S. Gupta, S. Malik, L. Pollock, and K. Vijay-Shanker. Part-of-speech tagging of program identifiers for improved text-based software engineering tools. In International Conference on Program Comprehension, pages 3–12. IEEE, 2013.Google ScholarGoogle Scholar
  34. L. Hatton. Safer language subsets: an overview and a case history, MISRA C. Information and Software Technology, 46(7):465–472, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  35. A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu. On the naturalness of software. In International Conference on Software Engineering (ICSE), pages 837–847. IEEE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. Hindle, M. W. Godfrey, and R. C. Holt. Reading beside the lines: Using indentation to rank revisions by complexity. Science of Computer Programming, 74(7):414–429, May 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. E. W. Høst and B. M. Østvold. Debugging method names. In In European Conference on Object-Oriented Programming (ECOOP), pages 294–317. Springer, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. D. Jurafsky and J. H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Prentice Hall, 2nd edition, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys, 24(4):377–439, Dec. 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. A. Langley. Apple’s SSL/TLS bug. bit.ly/MMvx6b, 2014. Visited Mar 2014.Google ScholarGoogle Scholar
  41. D. Lawrie, H. Feild, and D. Binkley. Syntactic identifier conciseness and consistency. In IEEE International Workshop on Source Code Analysis and Manipulation, pages 139–148. IEEE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. D. Lawrie, H. Feild, and D. Binkley. An empirical study of rules for well-formed identifiers: Research articles. Journal of Software Maintenance Evolution: Research and Practice, 19(4):205–229, July 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. D. Lawrie, C. Morrell, H. Feild, and D. Binkley. What’s in a Name? A Study of Identifiers. In Proceedings of the 14th IEEE International Conference on Program Comprehension (ICPC’06), ICPC ’06, pages 3–12, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. B. Liblit, A. Begel, and E. Sweetser. Cognitive perspectives on the role of naming in computer programs. In Annual Psychology of Programming Workshop, 2006.Google ScholarGoogle Scholar
  45. C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pages 74–81, 2004.Google ScholarGoogle Scholar
  46. C. J. Maddison and D. Tarlow. Structured generative models of natural source code. arXiv preprint arXiv:1401.0514, 2014.Google ScholarGoogle Scholar
  47. E. Mays, F. J. Damerau, and R. L. Mercer. Context based spelling correction. Information Processing and Management, 27(5):517–522, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. G. A. Miller. The magical number seven, plus or minus two: some limits on our capacity for processing information. Psychological review, 63(2):81, 1956.Google ScholarGoogle ScholarCross RefCross Ref
  49. D. Movshovitz-Attias and W. W. Cohen. Natural language models for predicting programming comments. In Proc of the ACL, 2013.Google ScholarGoogle Scholar
  50. E. Murphy-Hill, C. Parnin, and A. P. Black. How we refactor, and how we know it. Software Engineering, IEEE Transactions on, 38(1):5–18, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. N. Nagappan and T. Ball. Using software dependencies and churn metrics to predict field failures: An empirical case study. In ESEM, pages 364–373, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. A. T. Nguyen, T. T. Nguyen, H. A. Nguyen, A. Tamrawi, H. V. Nguyen, J. Al-Kofahi, and T. N. Nguyen. Graph-based pattern-oriented, context-sensitive source code completion. In ACM/IEEE International Conference on Software Engineering (ICSE). IEEE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. T. T. Nguyen, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen. A statistical semantic language model for source code. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, pages 532–542. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. M. Ohba and K. Gondow. Toward mining concept keywords from identifiers in large software projects. In ACM SIGSOFT Software Engineering Notes, volume 30, pages 1–5. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Oracle. Code Conventions for the Java Programming Language. http://www.oracle.com/technetwork/ java/codeconv-138413.html, 1999. Visited September 2, 2013.Google ScholarGoogle Scholar
  56. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: a method for automatic evaluation of machine translation. In Association for Computational Linguistics (ACL), pages 311–318, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. R. Pike. Go at Google. http://talks.golang.org/2012/splash.slide, 2012. Visited September 9, 2013.Google ScholarGoogle Scholar
  58. Pylint-Contributors. Pylint – code analysis for Python. http://www.pylint.org/, 2013. Visited September 9, 2013.Google ScholarGoogle Scholar
  59. V. Rajlich and P. Gosavi. Incremental change in object-oriented programming. Software, IEEE, 21(4):62–69, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. D. Ratiu and F. Deißenböck. From reality to programs and (not quite) back again. In IEEE International Conference on Program Comprehension (ICPC), pages 91–102. IEEE, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. P. C. Rigby and C. Bird. Convergent software peer review practices. In Proceedings of the the Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering (ESEC/FSE). ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. R. Robbes and M. Lanza. How program history can improve code completion. In Automated Software Engineering (ASE), pages 317–326. IEEE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. M. Robillard, R. Walker, and T. Zimmermann. Recommendation systems for software engineering. Software, IEEE, 27(4):80–86, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. G. v. Rossum, B. Warsaw, and N. Coghlan. PEP 8–Style Guide for Python Code. http://www.python.org/dev/peps/pep-0008/, 2013. Visited September 8, 2013.Google ScholarGoogle Scholar
  65. C. Simonyi. Hungarian notation. http://msdn.microsoft. com/en-us/library/aa260976(VS.60).aspx, 1999. Visited September 2, 2013.Google ScholarGoogle Scholar
  66. E. Soloway and K. Ehrlich. Empirical studies of programming knowledge. Software Engineering, IEEE Transactions on, (5):595–609, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. W. Strunk Jr and E. White. The Elements of Style. Macmillan, New York, 3rd edition, 1979.Google ScholarGoogle Scholar
  68. A. Takang, P. Grubb, and R. Macredie. The effects of comments and identifier names on program comprehensibility: an experiential study. Journal of Program Languages, 4(3):143–167, 1996.Google ScholarGoogle Scholar
  69. A. A. Takang, P. A. Grubb, and R. D. Macredie. The effects of comments and identifier names on program comprehensibility: an experimental investigation. J. Prog. Lang., 4(3):143–167, 1996.Google ScholarGoogle Scholar
  70. G. Uddin, B. Dagenais, and M. P. Robillard. Analyzing temporal API usage patterns. In Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering, pages 456–459. IEEE Computer Society, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. J. Wang, Y. Dang, H. Zhang, K. Chen, T. Xie, and D. Zhang. Mining succinct and high-coverage API usage patterns from source code. In Proceedings of the Tenth International Workshop on Mining Software Repositories, pages 319–328. IEEE Press, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. X. Wang, L. Pollock, and K. Vijay-Shanker. Automatic segmentation of method code into meaningful blocks to improve readability. In Working Conference on Reverse Engineering, pages 35–44. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Wikipedia. Coding Conventions. http: //en.wikipedia.org/wiki/Coding_conventions.Google ScholarGoogle Scholar
  74. H. P. Young. The economics of convention. The Journal of Economic Perspectives, 10(2):105–122, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  75. C. Zhang, J. Yang, Y. Zhang, J. Fan, X. Zhang, J. Zhao, and P. Ou. Automatic parameter recommendation for practical api usage. In Proceedings of the 34th International Conference on Software Engineering, pages 826–836. IEEE Press, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. H. Zhong, T. Xie, L. Zhang, J. Pei, and H. Mei. MAPO: Mining and recommending API usage patterns. In ECOOP 2009–Object-Oriented Programming, pages 318–343. Springer, 2009. Introduction Motivating Example Use Cases and Tools The Naturalize Framework The Core of Naturalize Choices of Scoring Function Suggesting Natural Names Suggesting Natural Formatting Converting Conventions into Rules Evaluation The Importance of Coding Conventions Suggestion Robustness of Suggestions Manual Examination of Suggestions Suggestions Accepted by Projects Related Work Conclusion Acknowledgements References Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Learning natural coding conventions

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      FSE 2014: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering
      November 2014
      856 pages
      ISBN:9781450330565
      DOI:10.1145/2635868

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 November 2014

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate17of128submissions,13%

      Upcoming Conference

      FSE '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader