1 Introduction
self
) are highlighted in bold. Figure 2d shows that nearly half of the correct predictions are high frequent tokens, which is one of the easy cases of code completion.-
We review the state-of-the-art of ML-based code completion approaches (mostly from 2018 to 2021), not only to substantiate the usage of aggregated metrics but also to give an overview of the current research progress in this field. Despite of using the same modeling methods (e.g. Transformers) for the code completion task, it is still hard to compare between models without explicitly re-evaluating them, due to the differences on input representations (e.g. sequences of tokens or Abstract Syntax Trees), the used datasets, evaluation metrics (e.g. accuracy, MRR), and the scope of prediction (e.g. next n code token or block of code).
-
To obtain a refined evaluation of code completion accuracy, we introduce a methodology called Code Token Type Taxonomy (CT3) by proposing multiple dimensions for identifying code token types. For each dimension, we propose the types by analyzing the Abstract Syntax Tree (AST) and the relationships between tokens in the AST. CT3 can be used for a comprehensive comparison between approaches, to gain a detailed view of the impact of each component in a prediction model, and to identify model challenges.
-
We demonstrate the utility of this methodology by conducting an empirical study on the Python150k2 dataset of a Transformer-based code completion approach. We compare the impact of using closed vocabulary vs. open vocabulary (Karampatsis et al. 2020), and find significantly better accuracy of the latter for relevant token types.
2 Background and related work
if
condition, and (v) block, code blocks, e.g. a for
loop or an entire function. The programming languages used in datasets are documented in the Dataset column (except for the widespread datasets Python150k and JavaScript150k5). We refer to the original papers for more details. The column Eval. ter. & non-ter. shows how authors handled the results on terminal and non-terminal nodes in ASTs, which is only applicable when the representation of input data is AST-related forms.Tool name | Modeling method | Year | Pred. Level | Dataset | Eval. Metrics | Input form | Eval. ter. & non-ter.\(^{\mathrm{d}}\) |
---|---|---|---|---|---|---|---|
RNNs | RNNs, n-gram | 2014 | Code line | Java | MRR, Accuracy | Seqs. tokens | N/A |
n-gram | n-gram | 2016 | Token | C, Java | Perplexity, cross-entropy | Seqs. tokens | N/A |
PHOG | PCFG | 2016 | Token | JavaScript150k | Error rate, log-probability, categories | AST nodes | Separated results |
Pointer Mixture Network\(^{\mathrm{a}}\) | RNNs, Pointer Network | 2017 | Token | Python150k, JavaScript150k | Accuracy | AST nodes | Separated results |
CodeGRU\(^{\mathrm{a}}\) | GRU | 2020 | Token, tokens | Java | MRR, accuracy | Seqs. tokens\(^{\mathrm{c}}\) | N/A |
Structural Language Modeling\(^{\mathrm{a}}\) | LSTM, Copy mechanism | 2020 | Construct | Java, C# | Accuracy, tree@k | AST paths | Integrated results |
Extended Network | PHOG, Pointer Mixture Network | 2020 | Token | Python150k | Accuracy | AST nodes | Terminal only |
IntelliCode Compose\(^{\mathrm{a}}\) | Transformers | 2020 | Tokens | Python, C#, JavaScript, TypeScript | Perplexity, ROUGE-L, Levenshtein | Seqs. tokens | N/A |
Feeding Trees to Transformers\(^{\mathrm{a}}\) | Transformers | 2021 | Token | Python150k, Facebook internal repositories. | MRR, categories | AST nodes, seqs. tokens | Integrated and separated results (with a local Beam search) |
CCAG | GNN | 2021 | Token | Python150k, JavaScript150k | Accuracy | AST graph | Separated results |
Transformers for Source Code\(^{\mathrm{a}}\) | Transformers | 2021 | Token | Python150k\(^{\mathrm{b}}\), JavaScript150k | MRR | AST nodes | Separated results |
BERT for Code Completion | RoBERTa | 2021 | Tokens, construct, block | Java, Android | BLEU, Levenshtein, perfect prediction, semantic equiv. | Seqs. tokens | N/A |
Transformers for Code Completion | Transformers | 2021 | Tokens, construct, block | Java, Android | BLEU, Levenshtein, perfect prediction | Seqs. tokens | N/A |
Codex\(^{\mathrm{a}}\) | GPT-3 | 2021 | Block | Python, APPS | pass@k | Docstrings | N/A |
Kite | GPT-2 | 2017 | Tokens | ||||
Deep TabNine | GPT-2 | 2019 | Tokens | ||||
GitHub Copilot | Codex | 2021 | Block | ||||
Byte-pair Encoding for OOV\(^{\mathrm{a}}\) | BPE | 2020 | Token | Java, C, Python | MRR, cross entropy | Seqs. tokens | N/A |
Anonymization for OOV\(^{\mathrm{a}}\) | Transformers | 2020 | Token | Python150k\(^{\mathrm{b}}\), JavaScript150k\(^{\mathrm{b}}\) | MRR | AST nodes | Integrated results |
2.1 Machine learning for code completion
2.2 Aggregated metrics and refined metrics for evaluating
Dimension | Criteria for values of the dimension | Motivation |
---|---|---|
Syntax type | Covering relevant code token types with a sufficient level of details | Identifier token type is the most demanded completion for developers |
Supporting the division between definition and usage of identifiers | Distinguishing between definition and usage of identifiers is essential\(^{\mathrm{a}}\) | |
Context | Providing information of surrounding structures of a code token as detailed as possible | Local context plays a large role in code completion accuracy |
Origin | Indicating the location where the token is defined, i.e. from the same file, from external/standard/built-in libraries | Method invocations within the same project is the most prominent sub-category in datasets |
Length | Reflecting the length of a code token in comparison with other tokens in a dataset | The more characters a code token contains, the more likely a completion will be requested |
Frequency | Expressing the frequency of occurrence of a code token in an AST | Rare and difficult completions (i.e. long and not so frequent code tokens) are vital under the view of developers |
Methodology | Dimensions | Applicability to various completion models\(^{\mathrm{a}}\) | ||||
---|---|---|---|---|---|---|
Syntax Type | Context | Origin | Length | Frequency | ||
Bielik et al. (2016) | \(+\) | − | − | − | − | − |
Kim et al. (2021) | \(+\) | \(+\) | − | − | − | \(\checkmark \) |
CT3 (ours) | \(\checkmark \) | \(\checkmark \) | \(\checkmark \) | \(\checkmark \) | \(\checkmark \) | \(\checkmark \) |
2.3 Out-of-vocabulary issue
3 Approach
3.1 Methodology for a refined evaluation
3.1.1 A general workflow
Syntax Type | Context | Origin | Length | Frequency |
---|---|---|---|---|
arg_def | in_arithmetic_op | from_builtin | long | high_frequent |
attribute | in_assign | from_extlib | medium | low_frequent |
class_def | in_bool_op | from_infile | short | medium_frequent |
class_usg | in_class_def | from_stdlib | ||
const_num | in_comparison | |||
const_str | in_else | |||
exception | in_except | |||
func_def | in_for | |||
func_usg | in_func_def | |||
imp_alias | in_if | |||
imp_lib | in_parameter | |||
imp_sublib | in_raise | |||
keyword | in_return | |||
method_def | in_try | |||
method_usg | in_while | |||
var_def | in_with | |||
var_usg | ||||
unknown |
(context_index, quantity)
, which is explained later on. The value \(-1\) is also used to denote the context value of non-terminal nodes. However, if a terminal node does not belong to any context we are focusing on (see Table 4), then its context value will be [0]
. Besides, we also recorded the indices of nodes in an AST and the AST indices in a dataset (for easily tracking back to original code files).Token | syntax_type | context | origin | length | frequency | AST_idx | node_idx |
---|---|---|---|---|---|---|---|
ImportFrom | \(-1\) | \(-1\) | \(-1\) | \(-1\) | \(-1\) | 0 | 1 |
django.utils.translation | 11 | [0] | 2 | 1 | 1 | 0 | 2 |
alias | \(-1\) | \(-1\) | \(-1\) | \(-1\) | \(-1\) | 0 | 3 |
ugettext_lazy | 12 | [0] | 2 | 1 | 1 | 0 | 4 |
\(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) |
ClassDef | \(-1\) | \(-1\) | \(-1\) | \(-1\) | \(-1\) | 0 | 11 |
NetworkProfileTab | 3 | [(4, 1)] | 3 | 1 | 1 | 0 | 12 |
bases | \(-1\) | \(-1\) | \(-1\) | \(-1\) | \(-1\) | 0 | 13 |
AttributeLoad | \(-1\) | \(-1\) | \(-1\) | \(-1\) | \(-1\) | 0 | 14 |
NameLoad | \(-1\) | \(-1\) | \(-1\) | \(-1\) | \(-1\) | 0 | 15 |
tabs | 4 | [(4, 1), (11, 1)] | 2 | 2 | 1 | 0 | 16 |
\(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) |
ClassDef
is a non-terminal node so its token types are \(-1\) for all dimensions. Meanwhile, the token tabs
is a terminal node and a base class in a class definition (i.e. class NetworkProfileTab(tabs)
). Its token types are: 4/class_usg for Syntax Type , 2/from_extlib for Origin, 2/medium for Length, and 1/high_frequent for Frequency. Ultimately, the value [(4, 1), (11, 1)]
of the Context dimension illustrates that tabs
is in the context of in_class_def once and in_parameter once. More details of the CT3-data and the storage formats are presented in our published dataset.11Closed vocabulary case | Open vocabulary case | Syntax Type | Context | Origin | Length | Frequency | AST idx | Node idx | ||
---|---|---|---|---|---|---|---|---|---|---|
True symbol | Predicted symbol | True subtokens | Predicted subtokens | |||||||
ImportFrom | Expr | [‘ImportFrom’, ‘<ENDTOKEN >’] | [‘Expr’, ‘ <ENDTOKEN >’] | \(-1\) | \(-1\) | \(-1\) | \(-1\) | \(-1\) | 0 | 1 |
django.utils. translation | __future__ | [‘django’, ‘.’, ‘utils’, ‘.’, ‘translation’, ‘ <ENDTOKEN >’] | [‘django’, ‘.’, ‘db’, ‘ <ENDTOKEN >’] | 11 | [0] | 2 | 1 | 1 | 0 | 2 |
alias | alias | [‘alias’, ‘ <ENDTOKEN >’] | [‘alias’, ‘ <ENDTOKEN >’] | \(-1\) | \(-1\) | \(-1\) | \(-1\) | \(-1\) | 0 | 3 |
ugettext_lazy | ugettext_lazy | [‘ugettext_lazy’, ‘ <ENDTOKEN >’] | [‘ugettext_lazy’, ‘ <ENDTOKEN >’] | 12 | [0] | 2 | 1 | 1 | 0 | 4 |
\(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) |
ClassDef | ImportFrom | [‘ClassDef’, ‘ <ENDTOKEN >’] | [‘ImportFrom’, ‘ <ENDTOKEN >’] | \(-1\) | \(-1\) | \(-1\) | \(-1\) | \(-1\) | 0 | 11 |
< UNKNOWN > | User | [‘Network’, ‘Profile’, ‘Tab’, ‘ <ENDTOKEN >’] | [‘Tab’, ‘ <ENDTOKEN >’] | 3 | [(4,1)] | 3 | 1 | 1 | 0 | 12 |
bases | bases | [‘bases’, ‘ <ENDTOKEN >’] | [‘bases’, ‘ <ENDTOKEN >’] | \(-1\) | \(-1\) | \(-1\) | \(-1\) | \(-1\) | 0 | 13 |
AttributeLoad | AttributeLoad | [‘AttributeLoad’, ‘ <ENDTOKEN >’] | [‘AttributeLoad’, ‘ <ENDTOKEN >’] | \(-1\) | \(-1\) | \(-1\) | \(-1\) | \(-1\) | 0 | 14 |
NameLoad | NameLoad | [‘NameLoad’, ‘ <ENDTOKEN >’] | [‘NameLoad’, ‘ <ENDTOKEN >’] | \(-1\) | \(-1\) | \(-1\) | \(-1\) | \(-1\) | 0 | 15 |
tabs | tabs | [‘tabs’, ‘ <ENDTOKEN >’] | [‘tabs’, ‘ <ENDTOKEN >’] | 4 | [(4,1), (11,1)] | 2 | 2 | 1 | 0 | 16 |
\(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) |
tabs
is predicted correctly in both closed and open vocabulary cases while only the first part of the django.utils.translation
token can be predicted by the open vocabulary variant. Two special symbols <UNKNOWN
> and <ENDTOKEN
> are used to indicate tokens that cannot be encoded by the closed/open vocabulary models and to mark the end of sequences of subtokens, respectively. More details of encoding code tokens using closed and open vocabulary with Transformer-based models is discussed in Sect. 4.2.3.1.2 CT3 schema for python
*args
and **kwargs
in Python. Code tokens which are attributes of packages or classes have attribute as their syntax type.get_info
and std_id
in line 1 are identified as func_def and arg_def types. While the variables student
(line 2), s_name
(line 3), and s_class
(line 4) are classified as the var_def token type, the std_id
(line 2) and student
(line 3) variables are grouped into the var_usg type. The tokens Student
and StudentClass
are labeled as the class_usg type. Ultimately, the attribute token type is used to represent the remaining tokens, i.e. profile
and name
in line 3 as well as info
and name
in line 4.
try/catch
blocks are assigned to the exception token type, which covers all built-in exceptions. Lastly, token types imp_lib, imp_sublib, and imp_alias are used to identify imported libraries, imported sublibraries, and their aliases, respectively. The 18th value of the dimension (i.e. unknown) indicates tokens which can not be categorized in other values. Most of these tokens are empty lists of parameters. These lists are still presented in ASTs but don’t affect the code completion accuracy.if
or else
statements are labeled with the in_if or in_else values; analogously for the values in_try, in_except, and in_raise. The context values in_for and in_while refer to the tokens occurring in loop structures (i.e. for
and while
). Finally, in_return value expresses tokens in return
statements and the value in_with indicates code tokens in with
statements.True
/False
). Tokens categorized as from_extlib originate from external (non-standard) libraries or packages. Identifiers defined in the same file have from_infile as their origin label. Ultimately, the from_stdlib label refers to identifiers defined in standard libraries.import
commands in each AST. Built-in code tokens are those within a predefined python_keyword set. Tokens from within the file are determined by exclusion, as these tokens are neither built-in nor from the standard library or an external library. Code tokens appearing as attributes of a particular library are categorized according to the origin information on the library. For example, considering the code line 4 in Listing 1, if the token StudentClass
is assigned to the from_infile label, then the info
and name
tokens also have from_infile as their origin label.i
as a loop counter), which are less significant.3.2 Open vocabulary for transformers
4 Experimental evaluation
4.1 Research questions
4.2 Experimental setup
UNKNOWN
> token to indicate tokens that cannot be encoded by the vocabulary. The token <PADDING
> is used to fill up the data windows in data files (explained below). Ultimately, the token <ENDTOKEN
> is utilized for marking the end point of a sequence of subtokens and therefore can only be applicable for the open vocabulary case. In other words, if n is the vocabulary size, which is a hyperparameter defined before building the vocabulary, then \(n-2\) and \(n-3\) are the number of tokens/subtokens in the closed and open vocabulary, respectively.UNKNOWN
> token and being omitted in the evaluation step. Consequently, a length threshold is again not needed for this case.PADDING
> tokens added according to the certain lengths of the input sequences (Vaswani et al. 2017). Determining the length for the input data involves various factors, such as the length of ASTs and memory capacity. Since ASTs in the datasets are diverse and some can be exceedingly large, it is infeasible to set a value sufficiently high to embed any AST of any length, especially with the limitation of memory. To integrate large ASTs into the input data, we use the same technique as Kim et al. (2021), which splits the large ASTs into smaller chunks with a sliding window while keeping the overlaps between windows to preserve the context. Each window is defined by a window_size (i.e. number of tokens/subtokens within the window) and a step_size of the sliding window. To ensure all windows have the same size, the <PADDING
> tokens are used to pad the last windows of the sequences.Parameter | Value |
---|---|
vocabulary_size | 10,000 |
(window_size, step_size) for closed vocabulary | (1000, 500) |
(window_size, step_size) for open vocabulary | (2000, 1000) |
batch_size | 8 |
epochs | 10 |
max_len_encoder | 50 |
max_len_data | 30 |
optimizer | Adam |
4.3 Evaluation results
Dataset | c_acc. | o_acc. | c_oov_o_true\({^*}\) | oov_c\({^+}\) | oov_o\(^{++}\) |
---|---|---|---|---|---|
PY50k | 0.68 | 0.72 | 1.35M | 5.0M | 0.41M |
JS50k | 0.71 | 0.75 | 1.9M | 6.76M | 0.4M |
5 Discussion
5.1 CT3 Challenges
lamda
functions, if
/else
blocks or loops might be obtained by slightly modifying the AST parser.5.2 Threats to validity
6 Auxiliary experiments
6.1 Length distribution of terminal tokens in the Python150k dataset
6.2 Length threshold for building the open vocabulary
Max length | Single letters | No. of missing non-terminal types\(^{\mathrm{a}}\) | Max length of seq. of subtokens | Most freq. length of seq. of subtokens | Avg. length of seq. of subtokens | No. of included tokens as OOV \(^{\mathrm{b}}\) |
---|---|---|---|---|---|---|
Unlimited | \(\checkmark \) | 67 | 193 | 2 | 2.44 | 0 |
200 | \(\checkmark \) | 64 | 193 | 2 | 2.43 | 0 |
100 | \(\checkmark \) | 63 | 96 | 2 | 2.37 | 0 |
50 | \(\checkmark \) | 63 | 30 | 2 | 2.24 | 0 |
CompareLtLtLt
, CompareLtLtEEqLtLtLt
, or AugAssignLShift
.6.3 Length threshold for creating input data files in the open vocabulary case
UNKNOWN
> tokens after the preprocessing step. We again studied the length distribution of code tokens in Python and JavaScript datasets considering both non-terminal20 and terminal tokens to identify the threshold for very long tokens.Max length | Portion of excluded tokens\(^{\mathrm{a}}\) | Accuracy after training\(^{\mathrm{b}}\) |
---|---|---|
200 | 0.1% | 0.8179 |
100 | 0.2% | 0.8255 |
30 | 0.9% | 0.8363 |
6.4 Window size for creating input data files in the open vocabulary case
-
step_size is inferred as half of window_size for the sake of simplicity.
-
Each token is encoded to at least two subtokens (i.e. the token itself and the mark symbol <
ENDTOKEN
>). As a result, to prevent omitting information, the window used to incorporate sequences of subtokens (i.e. with open vocabulary) should be larger than the one used for sequences of tokens (i.e. with closed vocabulary). -
The value of window_size should be restricted due to the limited capacity of the used GPUs.
-
batch_size, i.e. the number of data windows grouped in a batch, also affects the training process. We establish a batch_size threshold for each considered size of the window, since values higher than the threshold exceed the capacity of our GPUs.
-
Another important element is the Estimated Time of Arrival (ETA), which is the estimated time that the model needs to complete one epoch (i.e. one training iteration). In other words, the training time can be measured by multiplying ETA with the number of epochs. We expect our model to be trained in an acceptable duration of one to two days.
-
Ultimately, the accuracy obtained after training should be sufficiently high in an adequate amount of time (e.g. two days). Besides, as mentioned at the end of Sect. 6.3, the accuracy declines in the evaluation step since all subtokens of a token must be predicted correctly to form a proper prediction in the open vocabulary case.
Window size | Num. of epochs | Batch size\(^{\mathrm{a}}\) | Acc. after training | ETA (h:mm) |
---|---|---|---|---|
5k | 10 | 1 | 0.4468 | 5:18 |
3k | 10 | 4 | 0.4469 | 6:48 |
3k | 20 | 4 | 0.5243 | 8:01 |
2k | 10 | 8 | 0.8363 | 3:54 |