1 Introduction
-
An empirical study over two programming languages that investigates the use of function documentation formats in practice.
-
An extension of an existing language model for the task of code comment completion that leverages the additional section structure to improve the accuracy of the generated completion suggestions.
-
A comparative evaluation of the study and the language models for two programming languages to strengthen our findings.
2 Overview
3 Structure Analysis of Function Documentation Comments (RQ1)
javadoc
, exists for automatically extracting these comments from the source code to different kinds of output formats. While formatting styles for other programming language do exist, to the best of our knowledge these are less common with fewer available open-source projects. Therefore, we restrict our investigations to these two programming languages.3.1 Dataset
-
only include functions that do have an associated documentation comment;
-
filter-out pairs for which the comment is shorter than 3 tokens;
-
filter-out pairs for which the function is shorter than 3 lines;
-
remove functions with the word “test” in the name, and constructors and extension methods such as __str__ in Python or toString in Java;
-
remove duplicates using an algorithm described by Allamanis (2018) based on Jaccard similarity from the dataset and keeping a single copy.
Dataset | 25% | 50% | 75% | 95% | 100% |
---|---|---|---|---|---|
Python | |||||
Files | 1 | 2 | 5 | 16 | 517 |
Function-Docstring Pairs | 4 | 10 | 27 | 127 | 11211 |
Code Length | 43 | 72 | 132 | 341 | 28410 |
Docstring Length | 11 | 26 | 61 | 184 | 8510 |
Java | |||||
Files | 1 | 2 | 4 | 12 | 1181 |
Function-Docstring Pairs | 6 | 20 | 67 | 369 | 22028 |
Code Length | 42 | 66 | 121 | 331 | 68278 |
Docstring Length | 15 | 30 | 56 | 137 | 7135 |
3.2 Parsing Documentation Comments
3.2.1 Parsing Python Docstrings
-
Short description: a one-line summary;
-
Long description: an extended summary clarifying the functionality;
-
Parameters: one tag for each function, describing the argument including its type;
-
Return: a tag explaining the return type and value;
-
Raises: one tag for each possible exception raised by the function and under what conditions.
Section type | Correct | Incorrect | Incorrect cases |
---|---|---|---|
Python | |||
Short Description | 91 | 9 | incorrect sentence split |
Long Description | 81 | 19 | incorrect sentence split, wrong formatting |
Parameters | 77 | 23 | wrong formatting or parsing for type |
Return | 76 | 24 | wrong formatting or parsing for type |
Raises | 74 | 26 | wrong formatting or parsing for type |
Java | |||
Short Description | 96 | 4 | incorrect sentence split |
Long Description | 96 | 4 | incorrect sentence split |
Parameters | 97 | 3 | docstring/signature name mismatch |
Return | 100 | 0 | – |
Raises | 100 | 0 | – |
Section type | Partial docstring | Extracted section |
---|---|---|
Short description | Sample from a Polya-Gamma distribution, as in Proc Int Conf Mach Learn. 2012; 2012: 1343–1350. | Sample from a Polya-Gamma distribution, as in Proc Int Conf Mach Learn. |
Long description | setLEDBrightness() | 0 or 0x00=off 255 or |
Sets the brightness of the motion LED to the decimal hex value provided. 0 or 0x00=off 255 or 0xFF=Full Bright. (0-255 or 0x00-0xFF) | 0xFF=Full Bright. (0-255 or 0x00-0xFF) | |
Parameters | :param metadata_dict: (dict) Metadata dictionary | (‘metadata_dict’, ‘(dict) Metadata dictionary’, None, None) |
Return | Returns: | (None, ’bool. if the RPC |
bool. if the RPC interface file was written. | interface file was written.’) | |
Raises | Raises | (None, ‘OAuth1Error if the request is invalid.’) |
- - - - - - | ||
ValueError if broadcast fails |
3.2.2 Parsing Java Documentation Comments
3.3 Comments Sections Analysis
-
What is the prevalence of the four different styles for Python docstrings?
-
How often are docstring sections missing where they should be present?
-
How common are the different sections among the projects in our dataset?
-
How well do parameter docstrings match up with the parameters specified in the function or method (correctness and completeness)?
-
How often are parameters described in the short or long descriptions instead of having their own tag description?
-
How can the length of parameter, return and exception descriptions be quantified?
3.3.1 Distribution of Python Formatting Styles
3.3.2 Analysis of Missing Docstring Sections
Section | Mean | std | min | 25% | 50% | 75% | max |
---|---|---|---|---|---|---|---|
Python | |||||||
Parameters | 0.424 | 0.340 | 0.000 | 0.083 | 0.400 | 0.733 | 1.000 |
Return | 0.565 | 0.368 | 0.000 | 0.218 | 0.625 | 0.941 | 1.000 |
Raises | 0.859 | 0.282 | 0.000 | 0.888 | 1.000 | 1.000 | 1.000 |
Java | |||||||
Parameters | 0.303 | 0.316 | 0.000 | 0.002 | 0.197 | 0.500 | 1.000 |
Return | 0.243 | 0.292 | 0.000 | 0.000 | 0.125 | 0.397 | 1.000 |
Raises | 0.868 | 0.181 | 0.000 | 0.000 | 0.000 | 0.857 | 1.000 |
3.3.3 Distribution of Docstrings Sections
Dataset | Mean | std | min | 25% | 50% | 75% | max |
---|---|---|---|---|---|---|---|
Python | 0.38 | 0.34 | 0.00 | 0.04 | 0.31 | 0.62 | 1.00 |
Java | 0.29 | 0.31 | 0.00 | 0.00 | 0.18 | 0.47 | 1.00 |
Section | Mean | std | min | 25% | 50% | 75% | max |
---|---|---|---|---|---|---|---|
Python | |||||||
Short description | 0.977 | 0.104 | 0.000 | 1.000 | 1.000 | 1.000 | 1.000 |
Long description | 0.328 | 0.301 | 0.000 | 0.042 | 0.271 | 0.500 | 1.000 |
Parameters | 0.252 | 0.335 | 0.000 | 0.000 | 0.000 | 0.500 | 1.000 |
Return | 0.181 | 0.295 | 0.000 | 0.000 | 0.000 | 0.284 | 1.000 |
Raises | 0.025 | 0.103 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
Java | |||||||
Short description | 0.996 | 0.434 | 0.000 | 1.000 | 1.000 | 1.000 | 1.000 |
Long description | 0.250 | 0.242 | 0.000 | 0.333 | 0.200 | 0.375 | 1.000 |
Parameters | 0.560 | 0.333 | 0.000 | 0.294 | 0.622 | 0.846 | 1.000 |
Return | 0.448 | 0.327 | 0.000 | 0.150 | 0.443 | 0.709 | 1.000 |
Raises | 0.156 | 0.230 | 0.000 | 0.000 | 0.500 | 0.222 | 1.000 |
3.3.4 Analysis of the Parameters Section
Language | Metric | Mean | std | min | 25% | 50% | 75% | max |
---|---|---|---|---|---|---|---|---|
Python | Precision | 0.53 | 0.34 | 0.00 | 0.23 | 0.52 | 0.85 | 1.00 |
Recall | 0.51 | 0.33 | 0.00 | 0.20 | 0.50 | 0.81 | 1.00 | |
Java | Precision | 0.97 | 0.09 | 0.00 | 0.98 | 1.00 | 1.00 | 1.00 |
Recall | 0.65 | 0.35 | 0.00 | 0.38 | 0.77 | 0.98 | 1.00 |
javalang
for Java code, while using Python’s own ast
module for parsing Python functions. We were able to parse 496,248 Java methods. Since the CodeSearchNet dataset also contains older Python code, we used lib2to3 (Lib2to3 2022) to refactor any Python functions that could not be parsed initially. This allowed us to parse an additional 3,964, for a total of 442,980 Python functions. We were unable to parse 440 Java and 333 Python functions from the CodeSearchNet dataset, as they appear to be malformed. The mean fraction of mentioned parameters for each project is calculated via the following formula:
3.3.5 Length of Tag Descriptions
i
would be identified in the sentence “Takes a number i.” but not in “Not i mplemented!” by using the following regex:@param thing the given thing
; i.e., the documentation provides barely any additional information over the source code. The same holds true for exceptions, where a quarter are described in 6 words or less in both languages. Furthermore, one in four parameter descriptions mention the parameter name itself, contributing to the word count, in both Python (27%) and Java (28%).Language | Java | Python |
---|---|---|
Total number of functions | 496,688 | 443,313 |
Functions successfully parsed | 496,248 | 442,980 |
Functions failed to parse | 440 | 333 |
Parameter descriptions mentioning parameter | 147,350 | 96,234 |
Functions with complete param tags | 329,940 | 57,539 |
Functions with incomplete param tags | 166,748 | 385,774 |
Parameters parsed | 325,219 | 993,581 |
Parameters found in docstring tags | 558,432 | 367,705 |
Parameters mentioned in short description | 43,481 | 128,726 |
Parameters mentioned in long description | 14,263 | 111,777 |
Parameters not mentioned at all | 273,948 | 776,230 |
Empty parameter descriptions | 33,636 | 11,111 |
Empty return descriptions | 288,856 | 350,749 |
Empty exception descriptions | 17,590 | 344 |
4 Structure-based Comment Completion
function that can
, and so on. We adopt and extend neural language models that were presented in previous work (Ciurumelea et al. 2020) as a baseline for generating completion suggestions.
4.1 Neural Language Models
Converts the class into an actual view
such a model can generate possible next words. Some potential ones are function
, method
and object
, but not the
, refrigerator
or view
. These kind of models have applications in a large variety of tasks, such as speech recognition, spelling or grammatical error correction and machine translation among others. Traditionally, this problem was tackled using n-gram language models, which estimated the probability of a sequence by extracting frequency counts from the training corpus. However, n-gram models have several problems due to data sparsity and require complex back-off and smoothing techniques. Additionally, using larger n-gram sizes is very expensive in terms of memory and these models cannot be generalized across contexts. For example, seeing sequences such as “blue car” and “red car” will not influence the estimated probability of “black car”. Whereas a neural language model is able to learn that “blue”, “red” and “black” all represent the same concept of color. Neural language models are better able to handle all the problems described above, and will in general, have a much higher prediction accuracy than an n-gram model for a particular training set (Jurafsky and Martin 2000).4.2 Language Models for Documentation Comments
5 Dataset Creation and Training Details
5.1 Documentation Comments Datasets
-
no forks (not marked as a fork of another project);
-
include at least 10 Python files;
-
commits data: it includes at least 10 commits, and these commits include at least 2 different author emails
-
include at least 100 docstring - function pairs.
5.2 Training Data Extraction
Dataset | Repositories | Files | Method-Docstrings Pairs |
---|---|---|---|
Python | |||
Train | 7,439 | 270,498 | 1,415,297 |
Validation | 960 | 40,806 | 211,868 |
Test | 963 | 40,528 | 213,181 |
Java | |||
Train | 3,674 | 93,171 | 327,112 |
Validation | 457 | 9,862 | 34,980 |
Test | 461 | 13,453 | 46,797 |
Section type | Train count | Validation count | Test count |
---|---|---|---|
Python | |||
Short description | 1,370,770 | 205,576 | 207,041 |
Long description | 394,098 | 62,170 | 62,786 |
Parameters | 790,204 | 115,192 | 113,115 |
Raises | 38,392 | 5,438 | 5,484 |
Returns | 233,090 | 33,435 | 32,960 |
Java | |||
Short description | 327,111 | 34,980 | 46,797 |
Long description | 103,044 | 10,852 | 14,908 |
Parameters | 407,753 | 44,293 | 57,645 |
Raises | 95,768 | 7,584 | 16,357 |
Returns | 163,547 | 17,096 | 23,790 |
Section | Mean | Std | Min | 25% | 50% | 75% |
---|---|---|---|---|---|---|
Python | ||||||
Full Comment | 41.04 | 76.63 | 1.00 | 9.00 | 18.00 | 45.60 |
Short description | 11.30 | 9.85 | 1.00 | 7.00 | 9.00 | 13.00 |
Long description | 41.04 | 62.15 | 1.00 | 12.60 | 23.00 | 46.00 |
Parameters | 15.58 | 27.57 | 0.00 | 7.00 | 11.00 | 17.00 |
Returns | 13.31 | 49.65 | 0.00 | 4.00 | 8.00 | 14.00 |
Raises | 15.45 | 15.58 | 0.00 | 9.00 | 12.00 | 18.40 |
Java | ||||||
Full Comment | 50.67 | 68.75 | 1.00 | 17.00 | 34.00 | 62.00 |
Short description | 12.19 | 9.44 | 1.00 | 7.00 | 10.00 | 15.00 |
Long description | 37.95 | 53.14 | 1.00 | 13.00 | 22.00 | 43.00 |
Parameters | 10.40 | 10.72 | 1.00 | 6.00 | 9.00 | 12.00 |
Returns | 10.26 | 9.46 | 1.00 | 5.00 | 9.00 | 13.00 |
Raises | 11.60 | 14.73 | 0.00 | 6.00 | 10.00 | 14.00 |
Section type | Section content |
---|---|
Short Description | Returns the cross entropy loss of the classifier on images. |
Long Description | None |
Parameters | [(’images’, ’A minibatch tensor of MNIST digits. Shape must be [batch, 28, 28, 1].’, None, None), (’one_hot_labels’, ’The one hot label of the examples. Tensor size is [batch, 10].’, None, None)] |
Returns | (None, ’A scalar Tensor representing the cross entropy of the image minibatch.’) |
Raises | None |
Section type | Section content |
---|---|
Short Description | returns the cross entropy loss of the classifier on images <punct.> |
Long Description | None |
Parameters | [(’images’, ’a minibatch tensor of mnist digits <punct.> shape must be batch <punct,> 28 <punct,> 28 <punct,> < newline > 1 <punct.>’, None, None), (’one hot labels’, ’the one hot label of the examples <punct.> tensor size is batch <punct,> < newline > 10 <punct.>’, None, None)] |
Returns | (None, ’a scalar tensor representing the cross entropy of the image minibatch <punct.>’) |
Raises | None |
Method body | Section type | Docstring sequence |
---|---|---|
def mnist cross entropy images... | Short Description | < sos > returns the cross entropy |
def mnist cross entropy images... | Short Description | returns the cross entropy loss |
def mnist cross entropy images... | Parameters | < sos > images < sod > a minibatch |
def mnist cross entropy images... | Parameters | images < sod > a minibatch tensor |
def mnist cross entropy images... | Returns | < sos > < sod > a scalar tensor |
def mnist cross entropy images... | Returns | < sod > a scalar tensor representing |
5.3 Model Configuration and Training Details
-
a size of 512 for the prefix and method body Embedding layers, while a size of 64 is used for the section type Embedding layer;
-
an LSTM layer with size 256, followed by a Dense layer with size 128 and finally the Output layer with dimension equal to 30,000, as the vocabulary size, we also add a Dropout layer between the Dense and Output layers with a dropout probability value of 0.2;
-
training is done using the Adam optimizer with a learning rate of 3e − 4;
6 Can Structural Information Improve Completion Accuracy? (RQ2)
Model | Short | Long | Parameters | Return | Raises |
---|---|---|---|---|---|
Description | Description | ||||
Top-1 | |||||
LM | 0.180 (− 12.7%) | 0.182 (− 6.3%) | 0.251 (− 4.9%) | 0.247 (-6.6%) | 0.307 (− 4.9%) |
Section LM | 0.183 (− 11.0%) | 0.182 (− 6.2%) | 0.256 (− 3.1%) | 0.241 (-9.0%) | 0.335 (+ 4.0%) |
Context LM | 0.206 | 0.194 | 0.264 | 0.264 | 0.323 |
Context-Sec LM | 0.210 (+ 2.0%) | 0.194 (+ 0.1%) | 0.270 (+ 2.5%) | 0.273 (+ 3.3%) | 0.348 (+ 8.0%) |
Sec specific Models | 0.221 (+ 7.3%) | 0.202 (+ 4.1%) | 0.284 (+ 7.6%) | 0.290 (+ 9.6%) | 0.349 (+ 8.3%) |
Top-3 | |||||
LM | 0.286 (− 13.6%) | 0.286 (− 7.2%) | 0.373 (− 5.7%) | 0.361 (− 7.9%) | 0.422 (− 6.5%) |
Section LM | 0.294 (− 11.2%) | 0.287 (− 6.8%) | 0.380 (− 4.0%) | 0.356 (− 9.2%) | 0.456 (+ 1.1%) |
Context LM | 0.331 | 0.308 | 0.396 | 0.392 | 0.451 |
Context-Sec LM | 0.339 (+ 2.6%) | 0.308 (− 0.1%) | 0.405 (+ 2.3%) | 0.400 (+ 1.9%) | 0.483 (+ 7.1%) |
Sec specific Models | 0.352 (+ 6.5%) | 0.315 (+ 2.0%) | 0.414 (+ 4.7%) | 0.402 (+ 2.6%) | 0.458 (+ 1.5%) |
Top-5 | |||||
LM | 0.341 (− 13.8%) | 0.346 (− 7.4%) | 0.435 (− 6.2%) | 0.418 (− 8.5%) | 0.480 (− 6.9%) |
Section LM | 0.349 (− 11.7%) | 0.350 (− 6.6%) | 0.443 (− 4.4%) | 0.417 (− 8.7%) | 0.512 (− 0.6%) |
Context LM | 0.395 | 0.374 | 0.463 | 0.457 | 0.515 |
Context-Sec LM | 0.404 (+ 2.3%) | 0.375 (+ 0.3%) | 0.473 (+ 2.0%) | 0.463 (+ 1.4%) | 0.546 (+ 6.0%) |
Sec specific Models | 0.416 (+ 5.4%) | 0.381 (+ 1.8%) | 0.478 (+ 3.2%) | 0.456 (− 0.1%) | 0.509 (− 1.3%) |
Top-10 | |||||
LM | 0.427 (− 13.4%) | 0.436 (− 7.6%) | 0.523 (− 6.4%) | 0.502 (− 8.4%) | 0.562 (− 7.0%) |
Section LM | 0.433 (− 12.2%) | 0.440 (− 6.9%) | 0.530 (− 5.1%) | 0.500 (− 8.8%) | 0.586 (− 3.0%) |
Context LM | 0.493 | 0.472 | 0.558 | 0.548 | 0.604 |
Context-Sec LM | 0.500 (+ 1.4%) | 0.474 (+ 0.4%) | 0.565 (+ 1.3%) | 0.554 (+ 1.1%) | 0.633 (+ 4.7%) |
Sec specific Models | 0.509 (+ 3.1%) | 0.476 (+ 0.8%) | 0.564 (+ 1.0%) | 0.533 (− 2.8%) | 0.578 (− 4.4%) |
Model | Short | Long | Parameters | Return | Raises |
---|---|---|---|---|---|
Description | Description | ||||
Python | |||||
Context LM | 0.232 | 0.229 | 0.327 | 0.279 | 0.340 |
Context-Sec LM | 0.238 (+ 2.3%) | 0.228 (− 0.5%) | 0.337 (+ 3.0%) | 0.285 (+ 2.1%) | 0.368 (+ 8.1%) |
Sec specific Models | 0.252 (+ 8.3%) | 0.229 (+ 0.3%) | 0.347 (+ 6.1%) | 0.266 (− 4.7%) | 0.358 (+ 5.2%) |
Java | |||||
Context LM | 0.265 | 0.229 | 0.403 | 0.302 | 0.362 |
Context-Sec LM | 0.257 (− 2.9%) | 0.231 (+ 1.3%) | 0.423 (+ 5.0%) | 0.296 (− 1.9%) | 0.407 (+ 12.5%) |
Sec specific Models | 0.251 (− 5.0%) | 0.216 (− 5.5%) | 0.415 (+ 2.9%) | 0.267 (− 11.4%) | 0.374 (+ 3.4%) |
7 How Language-Dependent Are The Evaluation Results? (RQ3)
Model | Short | Long | Parameters | Return | Raises |
---|---|---|---|---|---|
Description | Description | ||||
Top-1 | |||||
Context LM | 0.222 | 0.192 | 0.325 | 0.270 | 0.317 |
Sec specific Models | 0.213 (–4.1%) | 0.187 (− 3.0%) | 0.326 (+ 0.5%) | 0.244 (− 9.5%) | 0.310 (− 2.3%) |
Context-Sec LM | 0.223 (+ 0.2%) | 0.197 (+ 2.3%) | 0.335 (+ 3.3%) | 0.268 (− 0.6%) | 0.342 (+ 7.8%) |
Top-3 | |||||
Context LM | 0.353 | 0.311 | 0.463 | 0.402 | 0.469 |
Sec specific Models | 0.345 (− 2.4%) | 0.300 (− 3.4%) | 0.456 (− 1.5%) | 0.367 (− 8.6%) | 0.439 (− 6.3%) |
Context-Sec LM | 0.364 (+ 2.9%) | 0.321 (+ 3.3%) | 0.480 (+ 3.6%) | 0.408 (+ 1.5%) | 0.495 (+ 5.6%) |
Top-5 | |||||
Context LM | 0.421 | 0.376 | 0.530 | 0.467 | 0.541 |
Sec specific Models | 0.408 (− 3.2%) | 0.360 (− 4.2%) | 0.517 (− 2.5%) | 0.430 (− 7.8%) | 0.500 (− 7.6%) |
Context-Sec LM | 0.432 (+ 2.6%) | 0.386 (+ 2.7%) | 0.545 (+ 2.8%) | 0.476 (+ 2.0%) | 0.565 (+ 4.4%) |
Top-10 | |||||
Context LM | 0.516 | 0.466 | 0.613 | 0.556 | 0.633 |
Sec specific Models | 0.498 (− 3.5%) | 0.444 (− 4.9%) | 0.592 (− 3.5%) | 0.515 (− 7.3%) | 0.575 (− 9.1%) |
Context-Sec LM | 0.526 (+ 2.0%) | 0.474 (+ 1.6%) | 0.625 (+ 1.9%) | 0.569 (+ 2.4%) | 0.650 (+ 2.8%) |