skip to main content
10.1145/2676726.2677009acmconferencesArticle/Chapter ViewAbstractPublication PagespoplConference Proceedingsconference-collections
research-article

Predicting Program Properties from "Big Code"

Published:14 January 2015Publication History

ABSTRACT

We present a new approach for predicting program properties from massive codebases (aka "Big Code"). Our approach first learns a probabilistic model from existing data and then uses this model to predict properties of new, unseen programs.

The key idea of our work is to transform the input program into a representation which allows us to phrase the problem of inferring program properties as structured prediction in machine learning. This formulation enables us to leverage powerful probabilistic graphical models such as conditional random fields (CRFs) in order to perform joint prediction of program properties.

As an example of our approach, we built a scalable prediction engine called JSNice for solving two kinds of problems in the context of JavaScript: predicting (syntactic) names of identifiers and predicting (semantic) type annotations of variables. Experimentally, JSNice predicts correct names for 63% of name identifiers and its type annotation predictions are correct in 81% of the cases. In the first week since its release, JSNice was used by more than 30,000 developers and in only few months has become a popular tool in the JavaScript developer community.

By formulating the problem of inferring program properties as structured prediction and showing how to perform both learning and inference in this context, our work opens up new possibilities for attacking a wide range of difficult problems in the context of "Big Code" including invariant generation, decompilation, synthesis and others.

Skip Supplemental Material Section

Supplemental Material

2677009.mov

mov

17 GB

References

  1. ALLAMANIS, M., AND SUTTON, C. Mining source code repositories at massive scale using language modeling. In MSR (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. ANDRZEJEWSKI, D., MULHERN, A., LIBLIT, B., AND ZHU, X. Statistical debugging using latent topic models. In ECML (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. BECKMAN, N. E., AND NORI, A. V. Probabilistic, modular and scalable inference of typestate specifications. PLDI '11, pp. 211--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. BESAG, J. On the Statistical Analysis of Dirty Pictures. Journal of the Royal Statistical Society. Series B (Methodol.) 48, 3 (1986), 259--302.Google ScholarGoogle Scholar
  5. BLEI, D., AND LAFFERTY, J. Topic models. In Text Mining: Classification, Clustering, and Applications. 2009.Google ScholarGoogle Scholar
  6. Closure compiler. https://developers.google.com/closure/compiler/.Google ScholarGoogle Scholar
  7. Mining big code to improve software reliability and construction. http://www.darpa.mil/NewsEvents/Releases/2014/03/06a.aspx.Google ScholarGoogle Scholar
  8. FINLEY, T., AND JOACHIMS, T. Training structural svms when exact inference is intractable. In ICML (2008), pp. 304--311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. GULWANI, S., AND JOJIC, N. Program verification as probabilistic inference. POPL '07, ACM, pp. 277--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. HE, X., ZEMEL, R. S., AND CARREIRA-PERPIÑÁN, M. A. Multi-scale conditional random fields for image labeling. CVPR '04. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. JENSEN, S. H., MØLLER, A., AND THIEMANN, P. Type analysis for javascript. In SAS'09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. KAPPES, J. H., ET AL. A comparative study of modern inference techniques for discrete energy minimization problems. CVPR'13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. KARAIVANOV, S., RAYCHEV, V., AND VECHEV, M. Phrase-based statistical translation of programming languages. Onward! '14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. KOLLER, D., AND FRIEDMAN, N. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. KREMENEK, T., NG, A. Y., AND ENGLER, D. A factor graph model for software bug finding. IJCAI'07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. KREMENEK, T., TWOHEY, P., BACK, G., NG, A., AND ENGLER, D. From uncertainty to belief: Inferring the specification within. OSDI'06, USENIX Association, pp. 161--176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. LAFFERTY, J. D., MCCALLUM, A., AND PEREIRA, F. C. N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML '01, pp. 282--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. LIVSHITS, B., NORI, A. V., R AJAMANI, S. K., AND BANERJEE, A. Merlin: Specification inference for explicit information flow problems. PLDI '09, ACM, pp. 75--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. MADDISON, C. J., AND TARLOW, D. Structured generative models of natural source code.Google ScholarGoogle Scholar
  20. MURPHY, K. P. Machine learning: a probabilistic perspective. Cambridge, MA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. PINTO, D., MCCALLUM, A., WEI, X., AND CROFT, W. B. Table extraction using conditional random fields. SIGIR '03, pp. 235--242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. QUATTONI, A., COLLINS, M., AND DARRELL, T. Conditional random fields for object recognition. In NIPS (2004), pp. 1097--1104.Google ScholarGoogle Scholar
  23. RATLIFF, N. D., BAGNELL, J. A., AND ZINKEVICH, M. (approximate) subgradient methods for structured prediction. In AISTATS (2007), pp. 380--387.Google ScholarGoogle Scholar
  24. RAYCHEV, V., VECHEV, M., AND YAHAV, E. Code completion with statistical language models. PLDI '14, ACM, pp. 419--428. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. STEENSGAARD, B. Points-to analysis in almost linear time. POPL'96, pp. 32--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. TASKAR, B., GUESTRIN, C., AND KOLLER, D. Max-margin markov networks. In NIPS (2003).Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. TSOCHANTARIDIS, I., JOACHIMS, T., HOFMANN, T., AND ALTUN, Y. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research 6, 2005, 1453--1484. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Typescript language. http://www.typescriptlang.org/.Google ScholarGoogle Scholar
  29. ZINKEVICH, M., WEIMER, M., LI, L., AND SMOLA, A. J. Parallelized stochastic gradient descent. In NIPS (2010), pp. 2595--2603.Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    POPL '15: Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages
    January 2015
    716 pages
    ISBN:9781450333009
    DOI:10.1145/2676726
    • cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 50, Issue 1
      POPL '15
      January 2015
      682 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2775051
      • Editor:
      • Andy Gill
      Issue’s Table of Contents

    Copyright © 2015 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 14 January 2015

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    POPL '15 Paper Acceptance Rate52of227submissions,23%Overall Acceptance Rate824of4,130submissions,20%

    Upcoming Conference

    POPL '25

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader