research-article

Predicting Program Properties from "Big Code"

Authors:
Veselin Raychev

ETH Zurich, Zurich, Switzerland

ETH Zurich, Zurich, Switzerland
View Profile

,
Martin Vechev

ETH Zurich, Zurich, Switzerland

ETH Zurich, Zurich, Switzerland
View Profile

,
Andreas Krause

ETH Zurich, Zurich, Switzerland

ETH Zurich, Zurich, Switzerland
View Profile

POPL '15: Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming LanguagesJanuary 2015Pages 111–124https://doi.org/10.1145/2676726.2677009

Published:14 January 2015Publication History

POPL '15: Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages

Pages 111–124

ABSTRACT

We present a new approach for predicting program properties from massive codebases (aka "Big Code"). Our approach first learns a probabilistic model from existing data and then uses this model to predict properties of new, unseen programs.

The key idea of our work is to transform the input program into a representation which allows us to phrase the problem of inferring program properties as structured prediction in machine learning. This formulation enables us to leverage powerful probabilistic graphical models such as conditional random fields (CRFs) in order to perform joint prediction of program properties.

As an example of our approach, we built a scalable prediction engine called JSNice for solving two kinds of problems in the context of JavaScript: predicting (syntactic) names of identifiers and predicting (semantic) type annotations of variables. Experimentally, JSNice predicts correct names for 63% of name identifiers and its type annotation predictions are correct in 81% of the cases. In the first week since its release, JSNice was used by more than 30,000 developers and in only few months has become a popular tool in the JavaScript developer community.

By formulating the problem of inferring program properties as structured prediction and showing how to perform both learning and inference in this context, our work opens up new possibilities for attacking a wide range of difficult problems in the context of "Big Code" including invariant generation, decompilation, synthesis and others.

Supplemental Material

2677009.mov

mov

17 GB

Download

References

ALLAMANIS, M., AND SUTTON, C. Mining source code repositories at massive scale using language modeling. In MSR (2013). Google ScholarDigital Library
ANDRZEJEWSKI, D., MULHERN, A., LIBLIT, B., AND ZHU, X. Statistical debugging using latent topic models. In ECML (2007). Google ScholarDigital Library
BECKMAN, N. E., AND NORI, A. V. Probabilistic, modular and scalable inference of typestate specifications. PLDI '11, pp. 211--221. Google ScholarDigital Library
BESAG, J. On the Statistical Analysis of Dirty Pictures. Journal of the Royal Statistical Society. Series B (Methodol.) 48, 3 (1986), 259--302.Google Scholar
BLEI, D., AND LAFFERTY, J. Topic models. In Text Mining: Classification, Clustering, and Applications. 2009.Google Scholar
Closure compiler. https://developers.google.com/closure/compiler/.Google Scholar
Mining big code to improve software reliability and construction. http://www.darpa.mil/NewsEvents/Releases/2014/03/06a.aspx.Google Scholar
FINLEY, T., AND JOACHIMS, T. Training structural svms when exact inference is intractable. In ICML (2008), pp. 304--311. Google ScholarDigital Library
GULWANI, S., AND JOJIC, N. Program verification as probabilistic inference. POPL '07, ACM, pp. 277--289. Google ScholarDigital Library
HE, X., ZEMEL, R. S., AND CARREIRA-PERPIÑÁN, M. A. Multi-scale conditional random fields for image labeling. CVPR '04. Google ScholarDigital Library
JENSEN, S. H., MØLLER, A., AND THIEMANN, P. Type analysis for javascript. In SAS'09. Google ScholarDigital Library
KAPPES, J. H., ET AL. A comparative study of modern inference techniques for discrete energy minimization problems. CVPR'13. Google ScholarDigital Library
KARAIVANOV, S., RAYCHEV, V., AND VECHEV, M. Phrase-based statistical translation of programming languages. Onward! '14. Google ScholarDigital Library
KOLLER, D., AND FRIEDMAN, N. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009. Google ScholarDigital Library
KREMENEK, T., NG, A. Y., AND ENGLER, D. A factor graph model for software bug finding. IJCAI'07. Google ScholarDigital Library
KREMENEK, T., TWOHEY, P., BACK, G., NG, A., AND ENGLER, D. From uncertainty to belief: Inferring the specification within. OSDI'06, USENIX Association, pp. 161--176. Google ScholarDigital Library
LAFFERTY, J. D., MCCALLUM, A., AND PEREIRA, F. C. N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML '01, pp. 282--289. Google ScholarDigital Library
LIVSHITS, B., NORI, A. V., R AJAMANI, S. K., AND BANERJEE, A. Merlin: Specification inference for explicit information flow problems. PLDI '09, ACM, pp. 75--86. Google ScholarDigital Library
MADDISON, C. J., AND TARLOW, D. Structured generative models of natural source code.Google Scholar
MURPHY, K. P. Machine learning: a probabilistic perspective. Cambridge, MA, 2012. Google ScholarDigital Library
PINTO, D., MCCALLUM, A., WEI, X., AND CROFT, W. B. Table extraction using conditional random fields. SIGIR '03, pp. 235--242. Google ScholarDigital Library
QUATTONI, A., COLLINS, M., AND DARRELL, T. Conditional random fields for object recognition. In NIPS (2004), pp. 1097--1104.Google Scholar
RATLIFF, N. D., BAGNELL, J. A., AND ZINKEVICH, M. (approximate) subgradient methods for structured prediction. In AISTATS (2007), pp. 380--387.Google Scholar
RAYCHEV, V., VECHEV, M., AND YAHAV, E. Code completion with statistical language models. PLDI '14, ACM, pp. 419--428. Google ScholarDigital Library
STEENSGAARD, B. Points-to analysis in almost linear time. POPL'96, pp. 32--41. Google ScholarDigital Library
TASKAR, B., GUESTRIN, C., AND KOLLER, D. Max-margin markov networks. In NIPS (2003).Google ScholarDigital Library
TSOCHANTARIDIS, I., JOACHIMS, T., HOFMANN, T., AND ALTUN, Y. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research 6, 2005, 1453--1484. Google ScholarDigital Library
Typescript language. http://www.typescriptlang.org/.Google Scholar
ZINKEVICH, M., WEIMER, M., LI, L., AND SMOLA, A. J. Parallelized stochastic gradient descent. In NIPS (2010), pp. 2595--2603.Google ScholarDigital Library

Recommendations

A general path-based representation for predicting program properties
PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation

Predicting program properties such as names or expression types has a wide range of applications. It can ease the task of programming, and increase programmer productivity. A major challenge when learning from programs is how to represent programs in a ...
Read More
code2vec: learning distributed representations of code

We present a neural model for representing snippets of code as continuous distributed vectors (``code embeddings''). The main idea is to represent a code snippet as a single fixed-length code vector, which can be used to predict semantic properties of ...
Read More
Predicting program properties from 'big code'

We present a new approach for predicting program properties from large codebases (aka "Big Code"). Our approach learns a probabilistic model from "Big Code" and uses this model to predict properties of new, unseen programs.

The key idea of our work is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
POPL '15: Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages
January 2015
716 pages
ISBN:9781450333009
DOI:10.1145/2676726
General Chair:
Sriram Rajamani
Microsoft Research, India
,
Program Chair:
David Walker
Princeton University, USA
ACM SIGPLAN Notices Volume 50, Issue 1
POPL '15
January 2015
682 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2775051
Editor:
Andy Gill
University of Kansas, Lawrence, KS
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 January 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
big code
closure compiler
conditional random fields
javascript
names
program properties
structured prediction
types
Qualifiers
- research-article
Conference

Acceptance Rates
POPL '15 Paper Acceptance Rate52of227submissions,23%Overall Acceptance Rate824of4,130submissions,20%
More
Upcoming Conference
POPL '25

Sponsor:

sigplan

The 52nd Annual ACM SIGPLAN Symposium on Principles of Programming Languages

January 19 - 25, 2025

Denver , CO , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 238
  Total Citations
  View Citations
- 2,165
  Total Downloads
- Downloads (Last 12 months)170
- Downloads (Last 6 weeks)17
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Predicting Program Properties from "Big Code"

POPL '15: Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages

ABSTRACT

Supplemental Material

References

Cited By

Recommendations

A general path-based representation for predicting program properties

code2vec: learning distributed representations of code

Predicting program properties from 'big code'