1 Introduction
positionOfPlayer
.” Because a compound name corresponds to a short phrase describing the variable’s role, many programmers and code reviewers can easily understand what the variable stores and how it works in the program. Empirical studies have also reported the positive effect provided by a compound name (Schankin et al. 2018).lineIndex
” and “lineIndent
.” Due to their high similarity, they cause a risk of misreading or mixing up variables during the programming or code review activities. Thus, they may adversely affect the code readability even though each variable name is informative. Aman et al. (2019) reported an empirical study showing that confusing variable pairs are related to the fault-proneness of Java methods. Hence, automatically detecting confusing variable pairs would be helpful for developers in successful programming and code review.sizeOfBlocks
” similar to “blockSizes
.” However, the Levenshtein distance-based measure in the previous study judges the pair as dissimilar since we need to edit 92% (11 out of 12) characters of “sizeOfBlocks
” to convert it to “blockSizes
.” It is better to consider both the string and semantic similarity to examine the confusing variable pairs. To analyze the name similarity from both perspectives above, we conduct large-scale investigations of compound variable names in Java and Python programs. Then, to examine if those name similarities can contribute to distilling confusing variable pairs that decay the code readability, we perform an evaluation study using an ordinal scale with human participants on the perceived confusion of a given pair of variable names. Furthermore, we develop support tools for automatically detecting confusing variable pairs in Java and Python.https://github.com/amanhirohisa/cvpfinder
.2 Related Work
i
, j
, k
” for loop indexes, “s
” for strings, and “t
” for times. Swidan et al. (2017) also analyzed single-letter names in Scratch programs using the dataset presented by Aivaloglou and Hermans (2016). The analysis showed that single-letter names are less common in Scratch programs: the percentage of such names was only 4%. As the above previous work reported, while single-letter names are fundamental variable names, they do not seem to form the majority group in the real world of naming variables. In other words, other naming styles (categories) would be more attractive for programmers. For example, Aman et al. (2021b) reported that the English word names and compound names share about 40% and 34%, respectively, through a large-scale investigation of Java programs from GitHub; in comparison, the percentage of single-letter names was about 17%. Moreover, its percentage decreases to about 5% when the variable’s scope becomes broad.c
”), abbreviation (e.g., “cnt
”), and fully spelled word (e.g., “count
”). Their experiment with 128 programmers statistically showed that fully spelled words and abbreviations are better than single-letter names in program comprehension. Scanniello et al. (2017) also conducted empirical studies involving 100 programmers to compare the above three kinds of names regarding program comprehension and fault detection/fixing. Their empirical results support the finding of Lawrie et al. (2007). Schankin et al. (2018) empirically demonstrated the positive aspect of descriptive compound names. Through an empirical study with 88 programmers, they reported that descriptive names aid programmers to detect faults quicker than short, non-descriptive names. As the previous work showed, making variable names descriptive is a better way to name variables. This trend also underlies recent name-related studies: for example, Tran et al. (2019) studied a way to recover meaningful variable names from the shortened names in JavaScript programs; Lacomis et al. (2019) proposed a technique to reconstruct meaningful variable names in the program decompiled from the binary.sizeOfBlocks
” and “blockSizes
.” They are not similar to each other in terms of string similarity. However, when we focus on their semantic aspects, they can be similar names. Such a difference has motivated us to cover both name similarity concepts in this paper.3 Compound Variable Name and Confusing Name
3.1 Compound Variable Name
_
”), whose first character is an English alphabet character or underscore2. In other words, it must be a string matching the regular expression “[a-zA-Z_][a-zA-Z_0-9]*
”.true
”, “false
”, “null
”)._
”). For example, if we make a name from “data file name,” the compound names in the camelCase and the snake_case can be “dataFileName
” and “data_file_name
,” respectively.testfile
”(test + file) because the element terms are simple and widely used. We regard a name as a compound name in this paper if we can split it into English dictionary words or well-known abbreviations3. We can utilize Spiral 1.1.04 (Hucka 2018), a sophisticated Python module for splitting identifiers, to split variable names.3.2 Confusing Variable Pair
shippingWeight
,” and the IDE suggests candidates including the correct name and other highly similar ones. Suppose the programmer wrongly chose “shippingHeight
” or “shippingMaxWeight
” for it. In those cases, the programmer and the code reviewers might not quickly find those mistakes because “shippingHeight
” and “shippmingMaxWeight
” are highly similar to “shippingHeight
” in terms of string similarity or semantic similarity. Although the above case shown in Fig. 2 is just an example, it illustrates the risk of variable mix-up caused by confusing variable pairs. It is better to focus on confusing variable pairs toward successfully managing the code readability.4 Evaluation of Name Similarity
4.1 Evaluation of String Similarity
shippingHeight
” looks more similar to “shippingWeight
” than “productHeight
.” We can quantify such a string similarity by focusing on how many characters we should edit to convert one name to another. A character edit is one of character addition, deletion, and substitution. Then, the least number of character edits to convert can be an index of string dissimilarity, referred to as the Levenshtein distance (Gusfield 1997).shippingHeight
”, \( name _2=\)“shippingWeight
”), we can convert \( name _1\) to \( name _2\) by only substituting “H
” with “W
,” i.e., the least number of required character edits is one. On the other hand, for the second pair, (\( name _1=\)“shippingHeight
”, \( name _2=\)“productHeight
”), we need eight character edits to convert \( name _1\) to \( name _2\) as: s
\(\rightarrow \) \(\emptyset \) (delete “s
”), h
\(\rightarrow \) p
(substitute “h
” with “p
”), i
\(\rightarrow \) r
, p
\(\rightarrow \) o
, p
\(\rightarrow \) d
, i
\(\rightarrow \) u
, n
\(\rightarrow \) c
, and g
\(\rightarrow \) t
.4.2 Evaluation of Semantic Similarity
numberOfLetter
” and “number_of_letter
” into “number
, Of
, Letter
” and “number
, of
, letter
,” respectively. Moreover, Spiral can split a simply-concatenated name like “testfile
” into “test
” and “file
”. When we can obtain two or more English words6 or abbreviations through the splitting by Spiral, we regard the original variable name as a compound name. We leverage PyEnchant 3.2.2, a spellchecking library, to check whether or not a word is an English dictionary word. To cover abbreviations that are not ordinary English words, we prepared an additional private dictionary7 and used it in the PyEnchant checking.list
,” “lists
,” “listed
,” and “listing
.” We utilize PorterStemmer in Python nltk 3.7 to stem the words. Furthermore, we sometimes encounter a variable name using a number such as “inputBuffer2
.” Although the number is a part of the name, it would not be essential for considering the meaning of the variable name. To avoid any impact caused by such a number, we replace all numbers appearing in a compound variable name with the special token “num
\(>\).”sizeOfBlocks
” is relatively similar to “blockSize
” automatically.5 Support Tools
5.1 Outline
5.2 Support Tool for Java
5.3 Support Tool for Python
ast
module provided in Python to parse Python source files and obtained the corresponding ASTs. Python grammar has no explicit statement for variable declaration (except for the “global
” declaration). The environment allocates a variable at which the programmer assigns a value to the variable first. In other words, such a first assignment to a variable corresponds to the variable’s declaration. Thus, we traverse the AST to find ast.Name
nodes and regard them as variable declaration points if they are used in the “Store
” context. Furthermore, we identify the variable’s scope by checking the corresponding ancestor AST nodes. This tool supports the following three kinds of variables:-
global variables, which are available anywhere in the module,
-
class attributes, which are available anywhere within the class, and
-
local variables, which are available only within the function or the method.
x
” in “[x for x in list]
”; it is referred to as “list comprehension.” However, this tool does not support variables in the context of list comprehension because they are limited to list construction and would not become a part of confusing variable pairs. Similarly, the tool omits variables used in the context of “set comprehension” or “dictionary comprehension.”5.4 Example
cvpfinder4j
” and “cvpfinder4p
” for Java and Python, respectively.
storm/external/storm-jdbc
” as the input. The directory contains 20 Java source files, and our tool detected 18 confusing variable pairs. Figure 5 presents report file (report/report.csv
) produced by that execution, and Fig. 6 zooms in on the part of confusing variable names. In the report CSV file, the path of the analyzed source file, the mark representing if it is a confusing variable pair, the variable names with their scopes, and the computed string and semantic similarities. The report is sorted to raise the confusing variable pairs to the top of the list. The mark “**
” indicates the pair’s similarity is higher than both the string and semantic similarity thresholds; the mark “*
” indicates the pair’s similarity is higher than one of the string similarity or semantic similarity thresholds. In the example, 16 pairs are identical named pairs, i.e., they are pairs of class fields and local variables with the same names; two pairs are highly similar names: “connectionProvider
” vs. “connectionProviderParam
,” and “columnList
” vs. “columnLists
.” They represent typical examples of confusing variable pairs.cvpfinder4p
) because the usage is the same as the tool for Java.6 Large-Scale Investigation of Confusing Variable Pairs
6.1 Aim
-
RQ1: Are the string and semantic similarity scores helpful in detecting confusing variable pairs? This is a fundamental question for our similarity quantification. We need to examine if our similarity evaluations are valid. We will answer this question through an evaluation study with human participants.
-
RQ2: What kind of characteristics do the detected confusing variable pairs have? It is also helpful to understand the naming trends of the confusing variable names to prevent deterioration of the code readability caused by those variables. We will examine the names detected in the studied projects.
6.2 Data Collection and Evaluation Study
Item | Description |
---|---|
Processor | Apple M1 |
Memory | 16GB |
OS | macOS Ventura 13.0.1 |
Python | Python 3.9.15 |
repository collector | radon-repositories-collector 0.0.5 |
identifier splitter | Spiral 1.1.0 |
stemmer | PorterStemmer on nltk 3.7 |
spell checker | enchant 2.3.3 |
pyenchant 3.2.2 | |
Doc2Vec | Doc2Vec on gensim 4.2.0 |
(vector_size=100, dm=0, min_count=1, epochs=400) |
6.3 Results
6.3.1 Results of Data Collection
valuesToArray1
” and “valuesToArray2
.”6.3.2 Results of Evaluation Study with Human Participants
Min | 25% | 50% | Mean | 75% | Max | |
---|---|---|---|---|---|---|
String | 0.000 | 0.045 | 0.136 | 0.172 | 0.250 | 1.000 |
Semantic | –0.129 | 0.300 | 0.382 | 0.407 | 0.491 | 1.000 |
Min | 25% | 50% | Mean | 75% | Max | |
---|---|---|---|---|---|---|
String | 0.000 | 0.130 | 0.185 | 0.213 | 0.261 | 1.000 |
Semantic | –0.147 | 0.288 | 0.364 | 0.394 | 0.466 | 1.000 |
Confusion level evaluation | ||||||
---|---|---|---|---|---|---|
–2 | –1 | 0 | +1 | +2 | Total | |
#Samples | 197 | 197 | 80 | 226 | 139 | 839 |
(23.5%) | (23.5%) | (9.5%) | (26.9%) | (16.6%) |
Confusion level evaluation | ||||||
---|---|---|---|---|---|---|
–2 | –1 | 0 | +1 | +2 | Total | |
#Samples | 215 | 241 | 98 | 190 | 125 | 869 |
(24.7%) | (27.7%) | (11.3%) | (21.9%) | (14.4%) |
6.4 Discussion
6.4.1 RQ1: Are the String and Semantic Similarity Scores Helpful in Detecting Confusing Variable Pairs?
6.4.2 Answer to RQ1
6.4.3 RQ2: What Kind of Characteristics do the Detected Confusing Variable Pairs Have?
-
if two compound names have high string similarity, they may share some words in their names, resulting in high semantic similarity;
-
however, even if two compound names have high semantic similarity, they might not use identical words.
-
Is the string similarity higher than \(\tau _{str}\)?, or
-
Is the semantic similarity higher than \(\tau _{sem}\)?
-
“
csObservationMethod
” vs. “vsObservationMethod
” (string similarity = 0.947), and -
“
mRedditDataRoomDatabase
” vs. “redditDataRoomDatabase
” (string similarity = 0.913).
Language | String similarity (\(\tau _{str}\)) | Semantic similarity (\(\tau _{sem}\)) |
---|---|---|
Java | 0.75 | 0.96 |
Python | 0.84 | 0.95 |
String similarity | |||
---|---|---|---|
\(> \tau _{str}\) | \(\le \tau _{str}\) | ||
Semantic Similarity | \(> \tau _{sem}\) | 1,007,963 (0.86%) | 32,991 (0.03%) |
\(\le \tau _{sem}\) | 583,287 (0.50%) | 115,296,885 (98.61%) |
String similarity | |||
---|---|---|---|
\(> \tau _{str}\) | \(\le \tau _{str}\) | ||
Semantic Similarity | \(> \tau _{sem}\) | 300,935 (0.28%) | 917,025 (0.86%) |
\(\le \tau _{sem}\) | 120,169 (0.11%) | 105,605,394 (98.75%) |
**
” and the latter ones by mark “*
” (see Fig. 5 in Section 5.4).Similarity | |||
---|---|---|---|
No. | Variable name pair | String | Semantic |
1 | \(\texttt {expectedFormattedResultsList}\) \(\texttt {expectedFormattedResultsPColl}\) | 0.828 | 0.340 |
2 | githubComKubernetesSigsServiceCatalogPkgApis Servicecatalog V1beta1ServiceClass | 0.913 | 0.966 |
githubComKubernetesSigsServiceCatalogPkgApis Servicecatalog V1beta1ServiceInstance | |||
3 | \(\texttt {someString722}\) \(\texttt {someString872}\) | 0.846 | 0.583 |
4 | \(\texttt {currentP1Stages}\) \(\texttt {currentP3Stages}\) | 0.933 | 0.810 |
5 | \(\texttt {byteCodeAppenders}\) \(\texttt {byteCodeAppender}\) | 0.941 | 0.220 |
6 | \(\texttt {typeVariableAnnotationTokens}\) \(\texttt {stypeVariableBoundAnnotationTokens}\) | 0.848 | 0.922 |
7 | \(\texttt {variablePattern}\) \(\texttt {patternVariable}\) | 0.067 | 0.972 |
8 | \(\texttt {FIELD\_TARGET\_FIELD}\) \(\texttt {targetField}\) | 0.056 | 0.989 |
Similarity | |||
---|---|---|---|
No. | Variable name pair | String | Semantic |
9 | \(\texttt {content\_ii}\) \(\texttt {content\_iii}\) | 0.909 | 0.579 |
10 | \(\texttt {RELATIVE\_POSITION\_II\_LG}\) \(\texttt {RELATIVE\_POSITION\_II\_BI}\) | 0.913 | 0.763 |
11 | \(\texttt {res4b20\_branch2a}\) \(\texttt {res5b\_branch2a}\) | 0.813 | 0.984 |
12 | \(\texttt {source\_options\_site\_coll\_model\_json}\) \(\texttt {source\_options\_site\_coll\_model\_json2}\) | 0.972 | 0.922 |
13 | \(\texttt {add\_one\_udfs}\) \(\texttt {add\_one\_udf}\) | 0.929 | 0.804 |
14 | \(\texttt {eigBlockVector}\) \(\texttt {eigBlockVectorX}\) | 0.933 | 0.912 |
15 | \(\texttt {frame\_num}\) \(\texttt {num\_frames}\) | 0.200 | 0.988 |
16 | \(\texttt {TASK\_TYPES\_TO\_STRING}\) \(\texttt {STRING\_TO\_TASK\_TYPES}\) | 0.200 | 0.987 |
6.4.4 Characteristic-1: Only the Tailing Tokens Differ
textFieldMap
” and “textFieldList
” has that characteristic because only the tailing tokens—“map
” and “list
”—differ.6.4.5 Characteristic-2: Only Their Numbered Parts Differ
asyncResult1
” and “asyncResult2
” is a typical example. As a result, about 34.7% (563,582 confusing variable pairs) in Java and about 77.1% (1,031,958 confusing variable pairs) in Python had Characteristic-2 (see Figs. 17(b) and 18(b)). Notice that some variable pairs also had Characteristic-1 when their tailing tokens were numbers, so some variable pairs were counted for both Characteristics 1 and 2. On the other hand, we did not find non-confusing variable pairs having Characteristic-2 because the name pairs with Characteristic-2 have high string similarity scores, and all of them fall into the confusing variable pair category.16 Consequently, we can say that confusing varable pairs have Characteristic-2.6.4.6 Characteristic-3: a Name is a Substring of the Other Name
fileNameList
” and “fileNameLists
” corresponds to this characteristic. As a result of our counting, we found about 9.8% (159,006) of confusing variable pairs in Java and about 3.8% (50,806) of confusing variable pairs having Characteristic-3. About 0.1% of non-confusing variable pairs in Java and about 0.2% in Python were also found in our dataset. Because the percentages of confusing variable pairs with Characteristic-3 are significantly fewer than Characteristics 1 and 2 (see Figs. 17(c) and 18(c)), we may not say that confusing variable pairs have Characteristic-3 confidently. However, the percentages of non-confusing variable pairs are low (0.1–0.2%), so we do not need to neglect the characteristic.6.4.7 Characteristic-4: a Name’s Token Set is a Subset of the Other Name’s Token Set
callsReceivedRow
” and “callsBytesReceivedRow
” because all of the tokens in the former name, “calls
,” “received
,” and “row
” are included in the latter name. As a result, we discovered that about 10.3% (167,503) of confusing variable pairs in Java and about 6.8% (90,687) of them in Python had the characteristic (see Figs. 17(d) and 18(d)). Only about 0.5% and 0.3% of non-confusing variable pairs in Java and Python had that characteristic. Similar to Characteristic-3, although this characteristic was also not observed in the majority of confusing variable pairs, the percentages of non-confusing ones are still low. Thus, we may also consider Characteristic-4 as a part of confusing variable pairs’ characteristics.
6.4.8 Characteristic-5: a Name can be Converted into the Other Name by Changing the Order of Tokens
variablePattern
” and “patternVariable
” is an example. Given such a pair, programmers and code reviewers might mix them up during the programming and code review activities. We checked our dataset and found that such pairs rarely appear in real: the percentages of confusing variable pairs with Characteristic-5 were about 0.4% (6,992) in Java and about 1.1% (15,073) in Python (see Figs. 17(e) and 18(e)). Hence, although the variable pairs having the characteristic look confusing, we cannot say it is a common characteristic of confusing variable pairs.6.4.9 Answer to RQ2
6.5 Threats to Validity
6.5.1 Internal Validity
vector_size
and epochs
. Because the Doc2Vec model steadily produced almost the same vectors for the same names when we set vector_size=100
and epochs=400
, we used them in our study. However, there might be a better setting, and we might miss a more proper model for evaluating semantic similarity. Thus, we also tried analyzing the semantic similarity of compound variable names using the state-of-the-art natural language model, Sentence-BERT (Reimers and Gurevych 2019), described in Section 4.2. Although Sentence-BERT models did not work well in our settings, we plan to perform an additional fine-tuning process in our future work.6.5.2 Construct Validity
6.5.3 External Validity
7 Conclusion and Future Work
https://github.com/amanhirohisa/cvpfinder
and https://zenodo.org/record/7493554
.