1 Introduction
CuBERT
(Kanade et al. 2020), PLBART
(Ahmad et al. 2021), CodeBERT
(Feng et al. 2020), and GraphCodeBERT
(Guo et al. 2020) have achieved state-of-the-art performance on a number of Software Engineering (SE) tasks such as code generation, code search, code summarization, clone detection, code translation, and code refinement.PDG
) and Data Flow Graph (DFG
) to achieve state-of-the-art performance on bug prediction.CodeXGLUE
benchmark (Lu et al. 2021) for code completion task.BERT
, CodeBERTa
, CodeBERT
, and GraphCodeBERT
.JEMMA
.JEMMA
as a DatasetJEMMA
has multiple levels of granularity: from methods, to classes, to packages, and entire projects. It consists of over 8 million Java method snippets along with substantial metadata; pre-processed source code representations—including graph representations that comes with control- and data-flow information; call-graph information for all methods at the project-level; and a variety of additional properties and metrics.AST
-node level, resulting from our extensive processing, ensured comprehensiveness at scale, for millions of source code entities defined within JEMMA
. All of which contribute to the overall quality of the data presented.JEMMA
as a WorkbenchJEMMA
is not a static dataset: we purposefully designed it to be extensible in a variety of ways. Concretely, JEMMA
comes with a set of tools to: add metrics or labels to source code snippets (e.g., by utilizing static analysis tools); define prediction tasks based on metrics, properties, or the representation themselves; process the code snippets and existing representations to generate new representations of source code; and run supported models on a task. We describe how to extend the dataset, along with several examples in Section 4. This extensibility is critical, because it transforms JEMMA
into a workbench with which users can experiment with the design of ML models of code and tasks, while saving a lot of time in pre-processing the data.JEMMA
Workbench is a set of tools and helper functions that helps in several operations such as viewing, creating, retrieving from, and appending to datasets (independent of how they are stored), among many other tasks that do not involve working directly with the datasets.JEMMA
JEMMA
can be used to gain insights via empirical studies. The first is a study on the non-localness of software, and how it impacts the performance of models on a variant of the code completion task. This study shows how the data from JEMMA
can be used to gain insights into how the models perform on code samples, highlighting what performance issues exist and what we can do to address such issues by adding project-wide context (Section 5).JEMMA
in empirical studies. For example, empirical studies on fault-prone or misleading method names, or the impact of complexity on other code properties, or the challenges of coupling in large projects, and others.2 Related Work
2.1 Datasets for Machine Learning on Code
JEMMA
provides code entities in multiple granularities across several representation types—creating a wide range of modelling opportunities.JEMMA
, on the other hand, is not task-specific and supports multiple tasks out of the box.JEMMA
supports several representations including raw source code, tokens, ASTs, and graphs, at both method-level and class-level, building up to even coarser granularities.JEMMA
balances the preceding concerns by building upon organic projects coming from a diverse set of domains (e.g., games, websites, standalone applications, etc) and of development standards (ranging from student projects to industry-grade open-source projects) which add a healthy factor of generalization for source code modeling.JEMMA
successfully mitigates such issues by providing a dataset of inter-related code entities across granularities, along with comprehensive intra- and inter-procedural relationship information coming from data-flow, control-flow, call graphs, etc.JEMMA
keeps necessary code information intact, be it across procedures or within procedures, and across files at the project-level.2.2 Datasets for Empirical Studies
.class
files. The QUAATLAS corpus (De Roover et al. 2013), is a post-processed version of the Qualitas corpus that allows better support for API usage analysis. XCorpus (Dietrich et al. 2017), is a subset of the Qualitas corpus (70 programs) complemented by 6 additional programs, that can all be automatically executed via test cases (natural, or generated).JEMMA
and Perceval, the differences highlight that both tools can complement each other well. Perceval can be used to fetch raw project data from a wide variety of data sources. On the other hand JEMMA
focuses on source code, and can be used to take care of the analysis of data, pre-processing of data, task definition, and training of models out of the box.2.3 The 50K-C Dataset
50K-C
dataset of 50,000 compilable projects.JEMMA
builds upon 50K-C
, we provide detailed background information on it in this section. The 50K-C
dataset is a collection of 50,000 compilable Java projects, with a total of almost 1.2m Java class files, its compiled bytecode, dependent jar files, and build scripts. It is divided into three subsets:
-
०
projects
: It contains the 50,000 java projects, as zipped files. The projects are organized into 115 subfolders each with about 435 projects. -
०
jars
: It contains the 5,362 external jar dependencies, which are required for successful project builds. This is important as missing dependencies is the common cause of failing to compile code at scale. -
०
build_results
: It contains the build outputs for the 50,000 projects, including compiled bytecode, build metadata, and original build scripts. In addition to the above data, a mapping between each project and its GitHubURL
is also provided. The bytecode is readily available for a variety of tasks, such as running static analysis tools, or, if the projects can also be executed, as input for testing, and dynamic analysis tools.
50K-C
. The extensive pre-processing that we perform on top of 50K-C
requires the use of static analysis tools, to do things such as call graph extraction, and to extract valuable metrics about the systems. Since the vast majority of static analysis tools operate on bytecode, 50K-C
was the most suitable option that combines both scale and the ability to automate the analysis at such scale.3 The JEMMA
Dataset
JEMMA
project is to provide the research community with a large, comprehensive, and extensible dataset for Java that can be used to advance the field of source code modeling. The JEMMA
datasets consist of a large collection of code samples in varying granularities, with wide-ranging and diverse metadata, a range of supported source code representations, and several properties. In addition, it also includes source code information related to code structure, data-flow, control-flow, caller-callee relationships etc.JEMMA
Dataset, we gather data at the project-level, and provide information on all the packages and classes. Furthermore, for every class, we parse and provide data on all the methods—including respective metadata, several representations and properties. The detail of data provided for every method entity is comprehensive, with data at the level of AST
with data-flow, control-flow, lexical-usage, and call-graph edges among others. In addition to necessary information, such as line numbers and position numbers of code tokens, supplementary information such as token types, node types, etc, are also provided. More details are presented in the following sections.JEMMA
also comes equipped with Workbench tools that allow users to perform a variety of tasks out of the box, such as: transforming code samples into intermediate source code representations, making tailored selections of entities to define tasks and forming custom datasets, or to run supported models, among others (Section 4 provides more details).50K-C
dataset contains a total of 50,000 projects. It has 85 projects with over 500 classes (with a maximum of 5549 classes in a project), 1264 projects with 101–500 classes, 2751 projects with 51–100 classes, 10693 projects with 21–50 classes, 14322 projects with 11–20 classes, and 20885 projects with 10 or fewer classes (with a minimum of 5 classes per project). We have collected metadata for all of these projects. Overall, the data consists of 1.2 million Java classes, which define over 8 million unique method entities.JEMMA
supports multiple granularities. We have processed and catalogued data starting from the project-level descending to smaller entities, which means a spectrum of granularites of code can be accessed.JEMMA
Workbench allows the recomputation of the other properties, if, for some properties, it is more efficient to recompute them than to download them. The data is uploaded on Zenodo; due to its size, it is provided as multiple artifacts. We present the components of the dataset, along with their DOIs (links to the download page), and sizes later on.JEMMA
are organized in Comma-Separated-Values (CSV) files, consequently basic analyses can be run with tools such as csvstat. Furthermore, our Workbench APIs
can be used to gather extensive statistics of the projects, classes, methods, bytecode, and data and control-flow information.JEMMA
datasets are grouped into three major parts: data at the metadata level (Section 3.1), data at the property level (Section 3.2), and at the representation level (Section 3.3). In addition, we also provide project-wide callgraph information for the 50,000 projects, uniquely identifying and associating source and destination nodes in the callgraph with the help of the metadata defined by JEMMA
(Section 3.4). This allows for accessing project-wide data on the whole, for different granularities of code entities.JEMMA
. The top-left corner represents the raw data from 50K-C
, which we catalog by adding UUIDs
(symbolized by colored squares). The rest of the figures depicts the additional pre-/post-processing we performed: the colored gears represent external tools that we run to collect additional data (properties and representations), while the grey gears represent further post-processing that we perform on the tool outputs to integrate it in our dataset.
3.1 JEMMA
: Metadata
JEMMA
datasets. The metadata is made available in CSV (comma-separated values) files. This allows for easy processing, even with simple command-line tools. The metadata is organized in four parts, from the largest units of interest to the smallest: projects, packages, classes, and methods. The units of interest can then be inter-related systematically. The metadata serves two major purposes:
UUID
.UUID
allows us to uniquely identify an entity in the dataset, and the supplementary metadata helps disambiguate entities (file paths, parent relationships, location information in the file, etc). In Section 4 we show how this metadata can be used to add an additional property to source code entities.JEMMA
users can leverage it to construct custom data queries and make selections from the large collection of data at different granularities.50K-C
dataset along with their corresponding metadata—project_id, project_path, project_name. The UUID
is referenced by the entities contained in the project. The project path is relative to the root directory of the 50K-C
dataset3, and can be used to access the raw source code of the project.UUID
of the parent project as project_id, the UUID
assigned to the package as package_id, the relative path of the package as package_path, and the name of the package directory as package_name.50K-C
dataset along with their corresponding metadata: project_id, package_id, class_id, class_path, class_name. Similarly to the projects, the class path is a relative path starting from the 50K-C
dataset’s root directory, that allows to access the raw source code of the class.50K-C
dataset along with their corresponding metadata: project_id, package_id, class_id, method_id, method_path, method_name, start_line, end_line, method_signature.3.2 JEMMA
: Properties
JEMMA
leverages the UUIDs
assigned to projects, classes, and methods as a way to attach additional properties to these entities. Thus, a property can be an arbitrary value that is associated to an entity, such as a metric. Even though we have gathered several properties associated with code entities, it should be noted that a particular property may not be available or may not apply for a given code entity. Users can add new properties associated with code entities as contributions to the dataset, where the property should be given a unique name and be stored in the correct location for it to be visible to JEMMA
Workbench APIs
. (Section 4 provides more details).-
० The Infer static analyser (Calcagno et al. 2015) is a tool that provides advanced abstract interpretation-based analyses for several languages, including Java. Examples of the analyses that Infer can run include an interprocedural analysis to detect possible null pointer dereferences. Infer can also perform additional analyses such as taint analysis, resource leak detection, and estimate the run-time cost of methods. We chose Infer mainly because it can perform inter-procedural analysis that reasons across procedure boundaries, while being able to scale to large codebases.
-
० Metrix++ is a tool that can compute a variety of basic metrics on source code entities, such as lines of code, code complexity, and several others4. We chose Metrix++ since it is suitable for processing large codebases, processing thousands of files per minute; it recognizes various types of entities including classes, interfaces, namespaces, functions, comments; and supports multiple metrics.
-
० PMD is a static code analysis tool5 that can compute a variety of static analysis warnings and metrics, such as the npath complexity metric, among many, many others. We used PMD because it is inexpensive while reviewing large codebases; and it is trusted by industry practitioners and researchers. PMD can also be used to identify defects and problems in code entities which can be useful for future works.
-
० The java-callgraph6 extractor is a tool for generating call graphs in Java. We used this tool to extract project-wide call graphs, from which callers and callees were identified and linked to their respective
UUIDs
at the post-processing stage. The java-callgraph generator tool was used since it was capable of generating both static and dynamic call-graphs suitable for our dataset of compilable code entities.
JEMMA
. It also maps the tools used to obtain the properties. Later, we provide a table links to the datasets for all of the data.
Property | Tool used |
---|---|
[TLOC] Total Lines of Code | Metrix++ |
[SLOC] Source Lines of Code | Metrix++ |
[CMPX] McCabe or Cyclomatic Complexity | Metrix++ |
[MXIN] Maximum indent depth of nesting | Metrix++ |
[NPTH] Npath Complexity | PMD |
[NMTK] Number of Code Tokens | Parser |
[NMPR] Number of parameters | Parser |
[NUID] Number of unique identifiers | Parser |
[NMOP] Number of operators | Parser |
[NMLT] Number of literals | Parser |
[NMRT] Number of return statements | Parser |
[NAME] Name of source code entity | Parser |
[NUPC] Number of unique parent callers | java-callgraph |
[NUCC] Number of unique child callees | java-callgraph |
[NMNC] Number of non-local calls | java-callgraph |
[NMLC] Number of local calls | java-callgraph |
[NLDF] Presence of Null Dereference | Infer |
[RSLK] Presence of Resource Leak | Infer |
3.3 JEMMA
: Representations
JEMMA
is to provide the building blocks to experiment with the design space of representations. Since extracting the relevant information is costly in terms of computational resources, a significant effort went into adding several basic representations at the method level, ranging from the raw source code to the information behind a very complete graph representation. At the representation level, we provide several ready-to-use source code representations (compatible with different models) for over 8 million method snippets. The method level representations that we provide are described in the following subsections.3.3.1 Raw text (TEXT
)
3.3.2 Tokens (TKNA
, TKNB
)
ANTLR4
grammar (Parr 2013) and made available to the user preprocessed. The tokenized code includes method annotations, if any, but does not include natural language comments. However, with the entire raw text of method snippets made available by default, users are free to include comments in their custom tokenizations.<LITCOMMA>
tokens). This representation is recommended for users who would tokenize the code themselves, or would want to avoid literals being split into several tokens, or avoid ambiguities with symbols and special characters when using natural-language tokenizers.3.3.3 Abstract Syntax Tree (ASTS
)
3.3.4 code2vec (C2VC
) and code2seq (C2SQ
)
OOV
) issues, while code2seq models identifiers and paths as sequences of symbols from smaller vocabularies, which alleviates the same issues. However, the downside is that the code2seq representation is significantly larger. Both kinds of inputs are fed to models that use the attention mechanism to select a set of AST paths that are relevant to the model’s training objective (by default, method naming).3.3.5 Feature Graph (FTGR
)
-
०
Child
edges encoding theAST
. -
०
NextToken
edges, encoding the sequential information of code tokens. -
०
LastRead
,LastWrite
, andComputedFrom
edges that link variables together, and provide data flow information. -
०
LastLexicalUse
edges link lexical usage of variables (independent of data flow). -
०
GuardedBy
andGuardedByNegation
edges connecting a variable used in a block to conditions of a control flow. -
०
ReturnTo
edges link return tokens to the method declaration. -
०
FormalArgName
edges connect arguments in method calls to the formal parameters.
GraphCodeBERT
can also be produced from the feature graph representations. The feature graph representation is obtained from Andrew Rice’s feature graph extraction tool7.3.4 JEMMA: Callgraphs
CG
), in which methods calling each other are explicitly linked. Thanks to our metadata, these method call information can then be used to combine representations to create interesting global contexts for large-scale source code models.UUIDs
through post-processing (links to external calls are still recorded but we do not assign UUIDs
to them).4 Extending and Using JEMMA
JEMMA
: meta-data, properties, representations, and callgraphs. These are standalone CSV
files that can be used on their own, but to make it easy for users to access and use them in common usage scenarios we have added a Workbench component to JEMMA
.
JEMMA
dataset artifacts, locations, and sizesArtifact | DOI | Size |
---|---|---|
Metadata: Projects | 4.7 MB | |
Metadata: Packages | 42.2 MB | |
Metadata: Classes | 269.7 MB | |
Metadata: Methods | 2.8 GB | |
Properties: [TLOC] | 335.5 MB | |
Properties: [SLOC] | 335.0 MB | |
Properties: [NUID] | 335.6 MB | |
Properties: [NTID] | 336.7 MB | |
Properties: [NMTK] | 342.5 MB | |
Properties: [NMRT] | 333.3 MB | |
Properties: [NMPR] | 333.3 MB | |
Properties: [NMOP] | 334.5 MB | |
Properties: [NMLT] | 333.4 MB | |
Properties: [NAME] | 432.0 MB | |
Properties: [MXIN] | 267.0 MB | |
Properties: [CMPX] | 267.1 MB | |
Properties: [NUPC] | 333.3 MB | |
Properties: [NUCC] | 333.6 MB | |
Properties: [NMNC] | 334.0 MB | |
Properties: [NMLC] | 333.2 MB | |
Properties: [NLDF] | 333.6 MB | |
Properties: [RSLK] | 334.0 MB | |
Represent.: (TEXT) | 3.8 GB | |
Represent.: (TKNA) | 3.3 GB | |
Represent.: (TKNB) | 4.6 GB | |
Represent.: (ASTS) * | 4.1 GB | |
Represent.: (FTGR) * | 5.2 GB | |
Represent.: (C2VC) * | 6.1 GB | |
Represent.: (C2SQ) * | 10.9 GB | |
Callgraphs: Projects | 7.2 GB |
JEMMA
, we intended it to be large-scale, yet extensible, flexible, and most importantly, easy to use. We have implemented several tools to help with this, and as a result, researchers can readily use JEMMA
as a Workbench to experiment with variants of datasets, models, and tasks while minimizing the processing that is involved.JEMMA
Workbench tools and implementations, written in Python, are accessible through a set of APIs
, which helps developers interface with it when writing machine-learning code and take advantage of several pre-implemented functionalities from viewing, creating, retrieving from, and appending to datasets, defining task labels, generating custom/variant code representations, to training and evaluating supported models on such datasets.-
०
GET
meta-data, properties, representations, callers/ees, n-hop context -
०
ADD
meta-data, properties, representations, callers/ees -
०
GEN
(create/adapt) representations -
०
RUN
(train/evaluate) supported models on a task
API
s. The JEMMA
Workbench implementations, along with an exhaustive list of API
s, are made available online.8 We demonstrate the usage of some of the API
s in this section.JEMMA
is built on top of the 50K-C
Dataset, we have catalogued all of the 50,000 projects and their child code entities, made them uniquely-identifiable, and provided a range of properties associated with them, along information on inter-relationships. Using this information a multitude of datasets can be prepared, not just specifically for ML4Code
but also for other purposes, e.g. creating a dataset of projects based on the project size, or creating a dataset of method snippets with complexities based on a criterion, and so on, for a diverse range of use-cases.JEMMA
. This may sound straightforward but preparing a sound dataset for model training is one of the most important steps, and it is often time-consuming given the amount of data cleaning and transformations involved before model training. The JEMMA
Workbench tools help users choose from a range of pre-processed source code representations across 8M samples, and filter them based on a range of properties, and even use the properties as prediction labels. The representations and properties can be used, either singularly, or in combination, to generate thousands of combinations of clean and balanced ML task datasets ready to be trained. Section 4.2.1 demonstrates a similar example.JEMMA
we catalogue code at the project, package, class, and method-level. Furthermore, we process source code into feature graphs yielding feature-rich code information at the AST node-level. This enables users to access a diverse range of granularites from coarse file-level to finer node-level information. Not only that, information such as the data-flow, control-flow, etc. between nodes are also available at the node level—providing a remarkable level of detail for code entities. In addition, call-graph links provide information on the inter-procedural relationships across entities within projects. This affords users to access code entities at scale, in different granularities, with detailed and intricate information based on various intra- and inter-procedural relationships.JEMMA
was prepared with ML4Code
in mind, users can easily train/evaluate a number of models, conduct inference, and establish benchmarks for tasks. The diversity of representations facilitates training on several different types of model architectures, from graph-based models, to models that take ASTs as input, and other architectures such as code2seq which reason over a bag of AST-paths. This enables users to model source code in various formats and combinations and extract valuable insights.JEMMA
can be extended and used, emphasizing on some essential use-cases.4.1 Extending JEMMA
JEMMA
can be extended with a new property; in Section 4.1.2 we describe how it can be extended with a new representation; and in Section 4.1.3 we show new projects can be added to JEMMA
.4.1.1 Adding a New Property
JEMMA
is to add a new property. This could be any property of interest that can be computed for a source code entity. Examples include defining a new source code metric, or the result of a static analysis tool indicating the presence (or absence) of a specific source code characteristic.JEMMA
with a new property, the workflow has three main steps: a) accessing a set of source code entities, b) generating associated property values, and c) merging the associated property values to the dataset. JEMMA
facilitates accessing the correct code input by providing the location and metadata for code entities, and several initial representations (raw text, ASTs
, etc.). An associated property could then be obtained either directly (e.g. method name) or by means of a code analysis tool (e.g. cyclomatic complexity).JEMMA
and added to the dataset as properties. The yellow highlights mark the Workbench API
calls in the code snippets.
4.1.2 Adding a New Representation
JEMMA
makes is quite simple to do both: create new representations, and modify existing ones. There are three main steps to extend JEMMA
with a representation: a) accessing a set source code entities, b) generating associated representations, and c) merging the representations to JEMMA
.AST
) of the code, and is extended with a number of additional edges, depicting various inter-relationships between the AST
nodes (e.g., data-flow, control-flow, lexical-usage, call-edges among others). In addition to other necessary information such as line numbers and position numbers of every source code token, supplementary information such as token types, node types, are also provided. Thus, with this representation, the detail of data provided for every code entity is comprehensive.JEMMA
facilitates such extensions by providing the base representations for several million code entities. In a similar manner, the other representations included with JEMMA
could also be simplified, modified, augmented to create new representations.UUIDs
. The representations can then be added to JEMMA
using the Workbench APIs
—quite similar to that of adding new properties as demonstrated in Fig. 3.4.1.3 Adding a New Project
JEMMA
on top of the 50K-C
Dataset of 50,000 compilable Java projects, however, we want it to be extensible. So, we have provided mechanisms to include additional projects into the fold of JEMMA
.JEMMA
involves three main steps: 1) forking the jemma repository, 2) generating the meta-data, representations, properties, call-graphs — by running the relevant scripts, and 3) making a pull request to added the new-generated data. We provide a simple bash script that helps users generate all the relevant data in one go—generating meta-data and cataloging code entities within the project, generating representations, generating properties, and generating project-level call-graphs. Once the data for the new project is ready, users can then make a pull request to append the data to JEMMA
Datasets. A detailed tutorial is provided in our documentation. Figure 4 lists the command-line procedure to add a new project to JEMMA
.
4.2 Using JEMMA
JEMMA
can be put to use. In Section 4.2.1 we describe how a property can be used to define a prediction task, while discussing ways in which JEMMA
can help avoid common pitfalls and biases. In Section 4.2.2 we explain how source code representations can be used for tasks such as mutation detection and masked prediction.4.2.1 Defining Tasks Based on Properties
JEMMA
, it can be used in a variety of ways. One such way is to use them as prediction labels for a prediction task. A good example of such a prediction task is complexity prediction, i.e., given a snippet of code as input, a source code model must predict its cyclomatic complexity (property) as output. While this may appear trivial (taking a random sample of entities that have that property defined, and splitting it into training, validation, and test sets), in practice it is often more complex. This is because care must be taken that the data does not contain biases that provide an inaccurate estimate of model performance. In this context, there are several groups of issues that JEMMA
helps contend with while defining the task datasets.JEMMA
is large to start with (over 8 million method entities), the scale of data makes it much more likely that there is enough data to learn in the first place, compared to other alternatives.API
endpoints allow users to query and obtain a balanced set of prediction labels, ready for training.JEMMA
Workbench allows several such operations in the context of defining prediction labels for a machine learning task, and in managing and retrieving large amounts of specific information.JEMMA
and obtain clean, complete, and balanced datasets.JEMMA
Workbench tools can be used to filter and leverage the already existing properties to empirically investigate the performance of models on the tasks and get insights.SLOC
) and their complexity as a hexbin plot. We observe that there is an overall tendency for shorter methods to be less complex, and longer methods to be more complex. On the other hand, there also methods that are very long, but have very low complexity (along the bottom axis). This information can be used to properly balance the data, for instance, by making sure that examples that are short and complex, and examples that are long and simple, are also included in the training and evaluation datasets.
JEMMA
is built on top of 50K-C
, we benefit from its selection of projects, which intentionally limited duplication. 50K-C’s filtering significantly reduces the risk of leakage across projects.JEMMA
keeps the metadata of which project a method belongs to, it is easy to define training, validation and test splits that all contain code from different projects, if necessary.4.2.2 Defining Tasks Based on Representations
JEMMA
can also be used to define tasks that operate on the source code representations themselves, rather than predicting a source code property. These tasks are usually of two forms: a) masked code prediction tasks, and b) mutation detection tasks.
<MASK>
”), and the model is tasked with predicting the masked parts of the representations. Examples of this would include the method naming task, where the name of the method is masked, or a method call completion task, where a method call is masked in the method’s body. A simpler variant of this would be to use a multiple-choice format, where the model has to recognize which of several possibilities is the token that was masked.JEMMA
can help with this. For simple modifications (e.g., masking the first occurrence of an operator), it is enough to directly change the default textual representation, and then use the Workbench APIs
to re-generate the other representations. Figure 6 shows an example of how to generate new representations for a masking task—method call completion.
JEMMA
can be used to analyse the performance of the models on the task and extract insights that may affect the design of the task.gen_representation
call handles running all the necessary tools in the background, such that given any source code snippet, representations can be generated on the fly.4.2.3 Running Models
JEMMA
Workbench APIs
make it easy to run supported models on the task. Several basic baselines are pre-implemented, and models hosted on the huggingface9 platform are supported out of the box.JEMMA
Workbench API
also facilitates the interaction with other libraries, in particular to run models using the code2vec and code2seq architectures, as well as Graph Neural Network (GNN)
models implemented with the ptgnn10 library.JEMMA
allows to easily interface with HuggingFace’s Transformer library (Wolf et al. 2019b). This allows a variety of pre-trained models to be fine-tuned on the tasks defined with JEMMA
(such as CodeBERT
(Feng et al. 2020), GraphCodeBERT
(Guo et al. 2020) etc.). Figure 7 shows how to run a Transformer model on the method complexity task using the Workbench.
4.2.4 Defining Representations with Larger Contexts
JEMMA
is to allow experimentation with novel source code representations. In particular, we want users to be able to define representations that can take into account a larger context than a single method, or a single file, as is done with the vast majority of work today.JEMMA
gives us all the relevant tools to gather that information. The metadata of JEMMA
documents the containment hierarchies (e.g., which files belong to which project, and which classes belong to which package etc.) and provides the ability to uniquely and unambiguously identify source code entities at different granularities. In addition, the call graph data documents which are the immediate callers and callees of each individual method. Since the call graphs link to each method identified by their UUID
, all the properties of the methods, including their representations, can be accessed easily and systematically. Thus, from navigating the call graph and the containment hierarchy, various types of global contexts can be defined at the class-, package-, or even project-level. We present two simple examples in the Appendix.5 Empirical Study I: On the Extent of Non-Localness of Software
JEMMA
Dataset and Workbench through an empirical investigation. We study the extent to which software is made up of interacting functions and methods in a sample of projects contained in JEMMA
by analysing their call graphs. We observe how often method calls are local to a file, cross file boundaries, or are calls to external APIs. Then, we analyze the performance on the method call code completion task through the lens of call types when non-local context is added. We pose the following research questions for this part of our study.
-
० RQ. 1. To what extent are method calls non-local?
-
० RQ. 2. What is the effect of adding non-local context for the method call completion task?
5.1 Extent of Non-Localness of Code
-
० Local calls. The entity is defined in the same file; thus, a machine learning model that has a file context would be likely to see it.
-
० Package calls. The entity is defined in the same Java package (i.e., the classes as in the same file directory).
-
० Project calls. The called entity is defined in the project, but in a different package than the caller.
-
० API calls. The called entity is not defined in the project, but is a call to an imported library.
5.2 Impact of Non-Localness on Code Completion
RNNs
with a closed vocabulary, which were unable to learn new identifiers. Since then, open-vocabulary models (Karampatsis et al. 2020; Ciniselli et al. 2021) have considerably improved the state of the art.RNNs
have a very limited context size, so they are unable to know which identifiers are defined in the project. And since this information is spread over the entire project, it motivates our choice to design a code completion task with much more data-points that focuses particularly on method-call completions considering a project-wide context.JEMMA
Workbench to analyse the performance of three state-of-the-art Transformer code models, with the natural-language BERT
model as the baseline, on a derivative of the code completion task: method-call completion.JEMMA
Datasets, splitting 80K samples as training data, 5K as validation data, and 15K as test data for training and evaluation.CodeBERTa
, CodeBERT
, and GraphCodeBERT
. We use th BERT
model as the baseline model for this task. All of these models accept sequences of tokens as input, so we use the token representation for training.A - BERT, B - CodeBERTa, C - CodeBERT, D - GraphCodeBERT
Local | Package | Project | API | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
n/c | c | ± | n/c | c | ± | n/c | c | ± | n/c | c | ± | |
A | 0.102 | 0.159 | 56% | 0.112 | 0.154 | 37% | 0.144 | 0.181 | 26% | 0.296 | 0.336 | 14% |
B | 0.171 | 0.284 | 66% | 0.255 | 0.362 | 42% | 0.278 | 0.370 | 33% | 0.524 | 0.606 | 16% |
C | 0.137 | 0.187 | 36% | 0.144 | 0.176 | 22% | 0.182 | 0.209 | 15% | 0.376 | 0.397 | 6% |
D | 0.142 | 0.188 | 32% | 0.147 | 0.176 | 20% | 0.184 | 0.209 | 14% | 0.379 | 0.398 | 5% |
5.3 Implications
6 Empirical Study II: OOW is the Next OOV
OOV
) issues (Hellendoorn and Devanbu 2017), until more recent models introduced and adopted an open vocabulary (Karampatsis et al. 2020).OOW
) issue: all modern state-of-the-art models tend to have a fixed input size, which may not be enough to fit the additional context needed. How to best use this limited resource is thus, an open problem. To that effect, we pose the following research questions in this section:
-
० RQ. 1. Given the need for fitting additional context, are English-based model tokenizers comparable to language-specific tokenizers?
-
० RQ. 2. From the perspective of context size, what types of code entities fit modern transformer models at different input size limits?
6.1 Transformers, Window Sizes, and Tokenizers
CodeBERT
(Feng et al. 2020), CodeBERTa
(Wolf et al. 2019a), PLBART
(Ahmad et al. 2021), CodeT5
(Wang et al. 2021), CodeGen
(Nijkamp et al. 2022), GraphCodeBERT
(Guo et al. 2020). Codex
(Chen et al. 2021) is yet another of these large pre-trained Transformer models, that has demonstrated compelling competence on a variety of tasks without necessarily needing fine-tuning, ranging from program synthesis, program summarization (Chen et al. 2021), to even program repair (Prenner and Robbes 2021).CodeBERT
, it is 512 tokens, while for the largest Codex model (codex-davinci), it is 4,096 tokens. If an input is longer than the window, it is generally truncated. Transformers rely on self-attention, where the attention heads attend to each pair of tokens: the complexity is hence quadratic, which renders very large windows prohibitive in terms of training time and inference time. This raises the question: for a given window size, how much code can we expect to fit?CodeBERT
and Codex
are not models trained from scratch on source code: given the amount of time needed to train such a model from scratch, previous models trained on English (RoBERTa
for CodeBERT
, a version of GPT-3
for Codex
) were fine-tuned on source code instead. This means that both CodeBERT
and Codex
use a subword tokenizer that was not learned for source code, but for English, which might lead to sub-optimal tokenization.JEMMA
, and used several subword tokenizers to estimate the ratio of subtokens that each subword tokenizer will produce. We first noticed that the choice of subword tokenizer has a significant impact on the produced tokenization, and consequently the amount of code that can fit in a model’s input window. We used the following tokenizers for our analyses:
-
०
RoBERTa tokenizer
. A byte-level BPE tokenizer, trained on a large English corpus, with a vocabulary of slightly more than 50,000 tokens. A similar tokenizer is used byCodeBERT
andCodex
. -
०
CodeBERTa tokenizer
. The tokenizer used byCodeBERTa
. This tokenizer was trained on source code from the CodeSearchNet corpus, which comprises of 2 million methods in 6 programming languages, including Java. -
०
Java BPE tokenizer
. A tokenizer similar toCodeBERTa tokenizer
, trained on 200,000 Java methods from Maven, instead of several languages. -
०
Java Parser
. A standard tokenizer from a Java Parser, that does not perform sub-tokenizations. We use this as a baseline for our analyses.
Java Parser
(standard tokenizer) as the baseline, and then calculated the average percentage- increase or decrease in the number of generated tokens. The CodeBERTa tokenizer
learned on multiple programming languages, on average, generates 98 tokens per 100 tokens of the baseline Java Parser
tokenizer. This is expected since some common token sequences can be merged in a single token (e.g, (); can be counted as one token instead of three tokens). The learned Java BPE tokenizer
is even more efficient, using on average 85% of the tokens (i.e. it generates 85 tokens per 100 tokens of the standard tokenizer). This is possible since, for instance, specific class names will be common enough that they can be represented by a single token (e.g., ArrayIndexOutOfBoundsException). On the other hand, the RoBERTa tokenizer
is considerably less efficient, needing 126% of the lexical tokens compared to the baseline.CodeBERT
and Codex
—will be able to fit only 409 actual tokens. For example, for tokens such as ArrayIndexOutOfBoundsException, efficient language-specific code tokenizers will tokenize it as a single token, rather than six separate tokens.6.2 Fitting Code Entities
-
♢ Small. A window size of 256 tokens, representing a small transformer model
-
♢ Base. A window size of 512 tokens, representing a model with the same size as
CodeBERT
(Feng et al. 2020). -
♢ Large. A window size of 1,024 tokens, which is the context size used by the largest
GPT-2
model (Radford et al. 2019). -
♢ XL. A window size of 2,048 tokens, which is the context size used by the largest
GPT-3
model (Brown et al. 2020). -
♢ XXL. A window size of 4,096, which is the context size used by the largest
Codex
model (Chen et al. 2021).
6.2.1 Methods
6.2.2 Classes
6.2.3 Packages
6.2.4 Projects
6.3 Implications
ASTs
and graph representations of classes, packages, and projects will also have scaling issues as the number of nodes to consider will grow very quickly. Furthermore, Graph Neural Networks can also struggle with long-distance relationships in the graph (Alon and Yahav 2020). Clearly, significant work is needed to find architectures that can fit contexts at the project-level, especially if the model size is to be kept small enough to be manageable.(OOW)
problem; at a minimum, JEMMA
provides the data at scale, and tools to investigate this.7 Limitations
JEMMA
is the only effort we are aware of in gathering enough data that is preprocessed sufficiently to enable empirical research of machine learning models that can reason on a more global context than the file or method level. Nevertheless, it has several limitations. Some of these issues are inherited from our use of 50K-C, while others are due to limitations in our pre-processing; while the former will be hard to overcome (barring extensive additional data collection), the latter could be mitigated by further processing from our side.7.1 Limitations Stemming from the Use of 50K-C
JEMMA
is comprised of projects in the Java programming language only. This poses issues as to whether models that work well for Java would also work well for other languages. The reason for this limitation is twofold: 1) adding other languages at a similar scale would drastically increase the already extremely significant time we invested in pre-processing data, and 2) restricting to one language frees us from tooling issues: we don’t need to settle on a “common denominator” in tool support (e.g., Infer supports few programming languages, and many of its analyses are limited to a single programming language).JEMMA
is comprised of snapshots of projects, rather than multiple project versions. This prevents us from using it for tasks that would rely on multiple versions, or commit data, such as some program repair tasks. On the other hand, this frees us from issues related to the evolution of software systems, such as performing origin analysis (Godfrey and Zou 2005), which is essential as refactorings are very common in software evolution, and can lead to discontinuities in the history of entities, particularly for the most changed ones (Hora et al. 2018). Omitting versions also considerably reduces the size of the dataset, which is already rather large as it is.50K-C
were selected because they could be compiled, 50K-C
provide no guarantees that they can be run. Indeed, it is hard to know if a project can run, even if it can be compiled. In case it can run, the project likely expects some input of some sort. This leaves running test cases as the only option to reliably gather runtime data. In our previous work in Smalltalk, where we performed an empirical study of 1,000 Smalltalk projects, we could run tests for only 16% of them (Callaú et al. 2014). Thus, JEMMA
makes no attempt at gathering properties that comes from dynamic analysis tools at this time. In the future, JEMMA
’s property mechanism could be used to document whether a project has runnable test cases, as a first step towards gathering runtime information. We could also expand the dataset with the 76 projects coming from XCorpus, which were selected because they are runnable (Dietrich et al. 2017).7.2 Limitations Stemming from our Pre-Processing
50K-C
were selected because they were successfully compiled, we were not able to successfully recompile all of them. Roughly 18% of the largest projects could not be compiled; this number trends down for smaller projects. We are not always sure of the reasons for this, although we suspect that issues related to dependencies might come into play. This could add a bias to our data, in case the projects that we are unable to compile are markedly different from the ones that we could compile. Nevertheless, all of the meta-data, call-graphs, and almost all of the properties and representations could be generated even for uncompiled projects.UUIDs
to them or to the methods defined in them, as this would significantly increase the complexity of our model (in terms of levels of nesting in the hierarchy), while these cases are overall rare. Additional pre-processing could handle these cases, but we do not expect this to become necessary.JEMMA
in the coming weeks. A second category of incomplete processing is that some tools will occasionally fail on some very specific input (e.g., the parser used by an analysis tool may handle some edge cases differently than the official parser).8 Conclusion
JEMMA
, a dataset and workbench to support research in the design and evaluation of source code machine learning models. Seen as a dataset, JEMMA
is built upon the 50K-C
dataset of 50,000 compilable Java projects, which we extend in several ways. We add multiple source code representations at the method level, to allow researchers to experiment on the effectiveness of these, and their variations. We add a project-level call graph, so the researchers can experiment with models that consider multiple methods, rather than a single method or a single file. Finally, we add multiple source code properties, obtained by running source code static analyzers—ranging from basic metrics to advanced analyses characteristics based on abstract interpretation.JEMMA
Workbench, its toolchain and corresponding APIs
, help achieve a variety of objectives. JEMMA
can extend itself with new properties and representations. It can be used to define machine learning tasks, using the properties and the representations themselves as basis for prediction tasks. The properties defined in JEMMA
can be used to get insight into the performance of tasks and pinpoint possible sources of bias. Finally, JEMMA
provides all the tools to experiment with new representations that combine the existing ones, allowing the definition of models that can learn from larger contexts than a single method snippet.JEMMA
. We have shown how JEMMA
can be used to define a metric prediction and a method call completion task. We have also shown how JEMMA
can be used for empirical studies. In particular, we investigated how the performance of our code completion task was impacted by the type of identifier to predict, showing that models performed much better on API method calls than on method calls defined in the project, indicating the need for models that take into account the project’s context. Finally, we have shown that taking into account this global context will be challenging, by studying its size. While state-of-the-art transformer models such as CodeBERT
can fit most methods in the dataset, fitting package-level or higher context is much more challenging, even for the largest models such as OpenAI’s Codex
model. This indicates that significant effort lies ahead in defining models able to process this amount of data, a task that we hope JEMMA
will support the community in achieving.