1 Introduction
2 Background
Foo
of the Java example is the class under test. It has a dependency on class Bar
: in order to generate an object of type Foo
we need an instance of Bar
, and the method doFoo
also requires a parameter of type Bar
.Bar
can be called, since all other methods and constructors require a parameter, resulting in t1 = 〈o1 = new Bar()
〉. Since t1 contains an object of type Bar
, in the second step the test generator now has a choice of either invoking doBar
on that object (and use the same object also as parameter), or invoking the constructor of Foo
. Assuming the chosen call is the constructor of Foo
, we now have t2 = 〈o1 = new Bar()
;o2 =new Foo(
o1)
;〉. Since there now is also an instance of Foo
in the sequence, in the next step also the method doFoo
is an option. The random test generator will continue extending the sequence in this manner, possibly integrating heuristics to select more relevant calls, or to decide when to start with a new sequence.doFoo
of class Foo
. In order to achieve this, it requires an instance of Foo
and an instance of Bar
to satisfy the dependencies. To generate a parameter object of type Bar
, the test generator would consider all calls that are declared to return an instance of Bar
—which is the case for the constructor of Bar
in our example, so it would prepend a call to Bar()
before the invocation of doFoo
. Furthermore, it would try to instantiate Foo
by calling the constructor. This, in turn, requires an instance of Bar
, for which the test generator might use the existing instance, or could invoke the constructor of Bar
.Bar
, but also that of Foo
with an arbitrary parameter type, and (2) has to consider all existing objects for all parameters of a selected call, and thus, for example, also str
or int
. Backwards construction without type information would also have to try to select generators from all possible calls, and all possible objects, which both result in a potentially large search space to select from.Foo
or calls to do_foo
with parameter values of unexpected types, such as:
Foo
and the do_foo
method both expect an object of type Bar
. When type information is not present such test cases can be generated and will only fail if the execution raises an exception; for example, due to a call to a method that is defined for the Bar
type but not on an int
or str
object. Type information can be provided in two ways in recent Python versions: either in a stub file that contains type hints or directly annotated in the source code. A stub file can be compared to C header files: they contain, for example, method declarations with their according types. Since Python 3.5, the types can also be annotated directly in the implementing source code, in a similar fashion known from statically typed languages (see PEP 4842).3 Unit Test Generation with Pynguin
3.1 The Pynguin Framework
pip
utility. We refer the reader to Pynguin’s web site5 for further links and information on the tool.3.1.1 Feedback-Directed Random Test Generation
3.1.2 Whole Suite Test Generation
3.1.3 Many-Objective Sorting Algorithm (MOSA)
3.1.4 Dynamic Target Selection MOSA (DynaMOSA)
bar
to fulfil the condition—and thus covering line 3—is not necessary unless the condition in line 1 is fulfilled. Figure 2 depicts the control-dependence graph. DynaMOSA uses the control-dependencies, which are found in the control-dependence graph of the respective code, to determine which goals to select for further optimisation. To compute the control dependencies within Pynguin, we generate a control-flow graph from the module under test’s byte code using the bytecode
7 library; we use standard algorithms to compute post-dominator tree and control-dependence graph from the control-flow graph (see, for example, the work of Ferrante et al. (1987) for such algorithms). We refer the reader to the literature for details on DynaMOSA (Panichella et al. 2018a). In the following we will refer to this algorithm as DynaMOSA.3.1.5 Many Independent Objectives (MIO)
3.2 Problem Representation
int
, float
, bool
, bytes
and str
variables, for example, var_0 = 42
. Value and type of a statement are defined by the primitive variable. Note that although in Python everything is an object, we treat these values as primitives because they do not require further construction in Python’s syntax.var_0 = SomeType()
. Value and type are defined by the constructed object; any parameters are satisfied from the set V = {v(sk)∣0 ≤ k < j}. Method statements invoke methods on objects, for example, var_1 = var_0.foo()
. Value and type are defined by the return value of the method; source object and any parameters are satisfied from the set V. Function statements invoke functions, for example, var_2 = bar()
. They do not require a source object but are otherwise identical to method statements.List
, Set
, Dict
, and Tuple
. An example for such a list collection statement is var_2 = [var_0, var_1]
; an example for a dictionary collection statement is var_4 = {var_0: var_1, var_2: var_3}
. Value and type are defined by the constructed collection; elements of a collection are satisfied from the set V. For dictionaries, both keys and values are satisfied from V. Tuples are treated similar to lists; their sole difference in Python is that lists are mutable while tuples are immutable.⋆args
and ⋆⋆kwargs
), when creating a constructor, method or function statement and passed the parameters by position. However, filling all parameters might not be necessary, as some parameters may have default values or are optional (for example ⋆args
and ⋆⋆kwargs
, which will result in an empty tuple or dictionary, respectively). It can also be impossible to pass certain parameters by position as it is possible to restrict them to be only passed by keyword. We improved our representation of statements with parameters, by (1) passing parameters in the correct way, that is, positional or by keyword, and (2) leaving optional parameters empty with some probability.⋆args
or ⋆⋆kwargs
capture positional or keyword arguments which are not bound to any other parameter. Hereby, args
and kwargs
are just names for the formal parameters; they can be chosen arbitrarily, but args
and kwargs
are the most common names. Their values can be accessed as a tuple (⋆args
) or a dictionary (⋆⋆kwargs
). We fill these parameters by constructing a list or dictionary of appropriate type and passing its elements as arguments by using the ⋆
or ⋆⋆
unpacking operator, respectively. Consider the example snippet in Listing 2 to shed light into how a test case for functions involving those parameter types may look like.
[]
-operator. While Python also has a []
-operator, the same effect can also be achieved by directly calling the __setitem__
or __getitem__
methods. Please note that we do not use the latter approach in Pynguin currently, because Pynguin considers all methods having names starting with one or more underscores to be private methods; private methods are not part of the public interface of a module and thus Pynguin does not directly call them. Given these constraints, we currently cannot generate a test case as depicted in Listing 3 because this would require some of the aforementioned features, such as reading and writing attributes or the []
-operator for list manipulation. The latter is currently not implemented in Pynguin because we assume that changing the value of a list element is not often required; it is more important to append values to lists, which is supported by Pynguin. Additionally, in Java an array has a fixed size, whereas lists in Python have variable size. This would require a way to find out a valid list index that could be used for the []
-operator.
3.3 Test Cluster
List
, Set
, Dict
, and Tuple
. To create the test cluster we load the module under test and inspect it using the inspect
module from the standard Python API to retrieve all available functions and classes from this module. Additionally, we transitively inspect all dependent modules.3.4 Operators for the Genetic Algorithms
3.4.1 The Crossover Operator
Foo
containing a constructor and two methods, as well as two test cases that construct instances of Foo
and execute both methods. During crossover, each test case is divided into two parts by the crossover point and the latter parts of both test cases are exchanged, which may result in the test cases depicted in Listing 5. Since statements in the exchanged parts may depend on variables defined in the original first part, the statement or test case needs to be repaired to remain valid. For example, the insertion of the call foo_0.baz(int_0)
into the first crossed-over test case requires an instance of Foo
as well as an int
value.Foo
instance instead of reusing the existing foo_0
as well as creating a new int
constant to satisfy the int
argument of the baz
method. For the second crossed-over test case, the operator simply reused both, the instance of Foo
as well the existing str
constant to satisfy the requirements for the bar
method.
3.4.2 The Mutation Operator
None
, possibly left empty if they are optional (see Section 3.2), or are randomly generated. The type of a randomly generated parameter is either defined by its type hint, or if not available, chosen randomly from the test cluster (see Section 3.3). If the type of the parameter is defined by a type hint, we can query the test cluster for callable elements in the subject under test or its dependencies that generate an object of the required type. Generic types currently cannot be handled properly in Pynguin, only Python’s collection types are addressed. A parameter that has no type annotation or the annotation Any
, requires us to consider all available types in the test cluster as potential candidates. For those, we can only randomly pick an element from the test cluster.⋆args
, ⋆⋆kwargs
as well as parameters with a default value are only filled with a certain probability. For ⋆args: T
and ⋆⋆kwargs: T
we try to create or reuse a parameter of type List[T]
or Dict[str, T]
, respectively. Primitive types are either randomly initialized within a certain range or reuse a value from static or dynamic constant seeding (Fraser and Arcuri 2012) with a certain probability. Complex types are constructed in a recursive backwards fashion, that is, by constructing their required parameters or reusing existing values.int
and float
primitives, we choose a random standard normally distributed value α. For int
primitives we add αΔint to the existing value. For float
primitives we either add αΔfloat or α to the existing value, or we change the amount of decimal digits of the current value to a random value in [0,Δdigits]. Here Δint, Δfloat and Δdigits are constants.str
primitives, with probability \(\frac {1}{3}\) each, we delete, replace, and insert characters. Each character is deleted or replaced with probability \(\frac {1}{|v(s_{i})|}\). A new character is inserted at a random location with probability σstr. If it is added, we add another character with probability \(\sigma _{\text {str}}^{2}\) and so on, until the i-th character is not added. This is similar to how we add test cases to a test suite. bytes
primitives are mutated similar to str
primitives. For bool
primitives, we simply negate v(si).None
. If no argument was replaced, we replace the whole statement si by a call to another method, function, or constructor, which is randomly chosen from the test cluster, has the same return type as v(si), and whose parameters can be satisfied with values from {v(sk)∣0 ≤ k < i}.3.5 Covering and Tracing Python Code
if
or while
statements, which are guarded by logical predicates. The control structures are represented by conditional jumps at the bytecode level, based on either a unary or binary predicate. We focus on branch coverage in this work, which requires that each of those predicates evaluates to both true and false.==
, is
, and in
keywords, respectively). For each 𝜃 ∈Θ, we define a function \(\delta _{\theta }: \mathbb {O}\times \mathbb {O} \to \mathbb {R}_{0}^{+} \cup \{\infty \}\) that computes the branch distance of the true branch of a predicate of the form a𝜃b, with \(a,b\in \mathbb {O}\) and 𝜃 ∈Θ. By \(\delta _{\bar {\theta }}(a,b)\) we denote the distance of the false branch, where \(\bar {\theta }\) is the complementary operator of 𝜃. Let further k be a positive number, and let lev(x, y) denote the Levenshtein distance (Levenshtein 1966) between two strings x and y. The value of k is used in cases where we know that the distance is not 0, but we cannot compute an actual distance value, for example, when a predicate compares two references for identity, the branch distance of the true branch is either 0, if the references point to the same object, or k, if they do not. While the actual value of k does not matter, we use k = 1.3.6 Fitness Functions
3.6.1 Test-suite Fitness
3.6.2 Test-case Fitness
3.7 Assertion Generation
int
, str
, bytes
, bool
, and None
), as well as builtin collection types (tuple
, list
, dict
, and set
) as long as they only consist of assertable elements. Additionally, Pynguin can generate assertions for values using an approximate comparison to mitigate the issue of imprecise number representation. For those assertable types, Pynguin can generate the assertions directly, by directly, by creating an appropriate expected value for the assertion’s comparison.len
function on the collection and assert for that size. As a fallback, if the aforementioned strategies is successful, Pynguin is asserting that an object is not None
. Listing 7 shows a simplified example of how test cases for the code snippet in Listing 6 could look like. Please note that the example was not directly generated with Pynguin but manually simplified for demonstration purpose.
pytest.raises
context from the PyTest framework8, which checks whether an exception of a given type was raised during execution and causes a failure of the test case if not. Pynguin decorates a test case, who’s execution causes an unexpected exception, with the pytest.mark.xfail
decorator from the PyTest library; this marks the test case as expected to be failing. Since Pynguin is not able to distinguish whether this failure is expected it is up to the user of Pynguin to inspect and deal with such a test case. Listing 8 shows two test cases that check for exceptions.
3.8 Limitations of Pynguin
sys.exit
, need to be handled differently. One of the reasons Pynguin fails on target modules is that they call sys.exit
in their code, which stops the Python interpreter and thus also Pynguin. Future work needs to address those challenges by, for example, providing a virtualised file system to or mocking calls to critical functions, such as sys.exit
.__getattr__
can fake non-existing attributes and many more. The work of Chen et al. (2018), for example, lists some of these dynamic features of the Python language and their impact on fixing bugs.List[Tuple[Any, ...]
and List[Tuple[int, str]]
; there is no built-in support in the language to decide at runtime whether one is a subtype of the other. Huge engineering effort has been put into static type checkers, such as mypy
10, however they are not considered feature complete. Additionally, each new version of Python brings new features regarding type information to better support developers. This, however, also causes new challenges for tool developers to support those new features.4 Empirical Evaluation
4.1 Experimental Setup
4.1.1 Experiment Subjects
numpy
from our dataset. We do this for two reasons: first, we want to avoid compiling these dependencies, which would require a C/C++ compilation environment in our experiment environment to keep this environment as small as possible. Second, Pynguin’s module analysis and on the source code and Python’s internally used byte code. Both are not available for native code, which can limit Pynguin’s applicability on such projects. Unfortunately, many projects depend to native-code libraries such as numpy
, either directly or transitively. Pynguin can still be used on such projects but its capabilities might be its capabilities might be limited by not being able to analyse these native-code dependencies. Fortunately, only few Python libraries are native-code only. Dependencies that only exist in native code, for example, the widely used numpy
library, can still be used, although Pynguin might not be able to retrieve all information it could retrieve from a Python source file.async_btree
, which provides an asynchronous behaviour-tree implementation. The functions in this project are implemented as co-routines, which would require a proper handling of the induced parallelism and special function invocations. Pynguin currently only targets sequential code; it therefore cannot deal with this form of parallelism to avoid any unintended side effects. Thus, for the async_btree
project, the test generation fails and only import coverage can be achieved, independent of the used configuration. None of the other used projects relies on co-routines.inspect
library for this task. This library is the standard way of inspecting modules in Python—to extract information about existing classes and functions, or type information. If fails to resolve a type that type will not be known to Pynguin’s test cluster, which means that Pynguin will not try to generate an object of such a type. However, since the inspect
library is part of the standard library we would consider its quality to be very good; furthermore, if such a type is reached in a transitive dependency, it might be used anyway.
Project Name | Version | LOCs | Modules | CodeObjs. | Preds. | Types |
---|---|---|---|---|---|---|
apimd | 1.2.1 | 390 | 1 | 50 | 110 | 10 |
codetiming | 1.3.0 | 89 | 2 | 7 | 5 | 4 |
dataclasses-json | 0.5.2 | 891 | 5 | 75 | 76 | 23 |
docstring_parser | 0.7.3 | 506 | 4 | 50 | 86 | 9 |
flake8 | 3.9.0 | 542 | 8 | 80 | 41 | 0 |
flutes | 0.3.0 | 235 | 3 | 4 | 0 | 8 |
flutils | 0.7 | 772 | 6 | 48 | 136 | 19 |
httpie | 2.4.0 | 1274 | 18 | 152 | 129 | 22 |
isort | 5.8.0 | 841 | 6 | 37 | 9 | 11 |
mimesis | 4.1.3 | 685 | 21 | 150 | 55 | 4 |
pdir2 | 0.3.2 | 461 | 5 | 43 | 46 | 9 |
py-backwards | 0.7 | 618 | 18 | 125 | 100 | 13 |
pyMonet | 0.12.0 | 394 | 8 | 145 | 59 | 6 |
pypara | 0.0.24 | 877 | 7 | 232 | 70 | 15 |
python-string-utils | 1.0.0 | 421 | 3 | 76 | 101 | 7 |
pytutils | 0.4.1 | 943 | 19 | 62 | 80 | 4 |
sanic | 21.3.2 | 1886 | 18 | 143 | 187 | 30 |
sty | 1.0.0rc1 | 136 | 3 | 18 | 4 | 2 |
thonny | 3.3.6 | 794 | 5 | 66 | 232 | 1 |
typesystem | 0.2.4 | 157 | 3 | 23 | 19 | 13 |
Total | 12912 | 163 | 1586 | 1545 | 210 |
4.1.2 Experiment Settings
python:3.10.5-slim-buster
image from Docker Hub is the basis of this container).4.1.3 Evaluation Metrics
4.1.4 Data Availability
4.2 Threats to Validity
4.2.1 Internal Validity
x = 0 if y > 42 else 1337
fits on one line but contains two branches, which are not considered by Coverage.py. We thus implemented our own coverage measurement based on bytecode instrumentation. By providing sufficient unit tests for it we try to mitigate possible errors in our implementation.4.2.2 External Validity
4.2.3 Construct Validity
4.3 RQ1: Comparison of the Test-Generation Approaches
def
statement but not the function’s body or any closures) as well as class definitions and their respective method definitions. Due to the structure of the Python bytecode these definitions are also (branchless) coverage targets that get executed anyway. Thus, they count towards coverage of a module. As a consequence coverage cannot drop below import coverage.
Algorithm | With Type Hints | Without Type Hints | ||
---|---|---|---|---|
DynaMOSA | 80.7 | 71.6 ± 30.5 | 76.5 | 69.4 ± 31.0 |
MIO | 80.0 | 71.3 ± 30.7 | 75.0 | 68.4 ± 31.3 |
MOSA | 80.0 | 71.3 ± 30.8 | 76.7 | 68.7 ± 31.7 |
Random | 71.4 | 66.9 ± 31.5 | 66.7 | 62.6 ± 32.9 |
WS | 80.0 | 70.6 ± 30.7 | 71.4 | 67.5 ± 31.5 |
flutes.math
from the flutes
project12.
b = 0
will cause a ZeroDivisionError
). Since this function is the only function in that particular module, achieving full coverage on this module is also trivially possible for all test-generation algorithms, especially since they know from the type hints that they shall provide integer values as parameters.flutes
13 project in Listing 10.yield
statement). Pynguin currently supports neither; the context manager would require a special syntax construct to be called, such as a with
block. Calling a function with a yield
statement in it does not actually execute its code. It generates an iterator object pointing to that function. The code will only be executed when iterating over the generator in a loop or by explicitly calling next
on the object. A test case Pynguin can come up with is similar to the one shown in Listing 11.import
and def
statements during module loading. As stated above, the body of the function will not even be executed.4.4 RQ2: Influence of Type Information
isinstance
check can only be executed if the argument to the function is of type str
—in contradiction to the annotated int
. Although it is possible that Pynguin disregards the type hint there is only a small probability for this. Besides, Pynguin would then pick a random type from all available types in the test cluster, where again the probability that it picks a str
is small, depending on the size of the test cluster.4.5 RQ3: Assertion Effectiveness to Reveal Faults
pyMonet
’s14 module pymonet.task
, which we show in Listing 13.
map
and methods). The methods of
and reject
are not even called as they are annotated as class methods, which is currently not supported by Pynguin. Pynguin, however, is generating a test cases like the one shown in Listing 14.
@classmethod
decorator. The consequence is that of
or reject
become normal methods by this operation. Executing the test case from Listing 14 against such a mutant causes a difference in the result and thus Pynguin counts the mutant as killed. As a consequence, both possible mutants get killed, although only one third of the modules coverage goals are actually covered. Furthermore, we also note many modules for those the tests achieve 100 % coverage but the achieved mutation scores range from 0 % to 100 %; this shows that the effectiveness of our assertions can still be improved in the future.
5 Related Work
6 Conclusions
mimesis.providers.date
from the mimesis
20 project contains the following function presented in Listing 15 (we removed various lines that are not relevant to the bug for presentation reasons).
mimesis
project provides utility functions that generate fake data, which aims to look as realistic as possible. The function presented in Listing 15 aims to generate a list of datetime
objects (essentially, a combination of date and time stamp) between a given start and end time using a specific interval. It requires a start time that is smaller than the end time, that is, lies before the end time. Besides, it takes a dictionary in its ⋆⋆kwargs
parameter, which it hands over to the timedelta
function of Python’s datetime
API.timedelta
function is used to compute a delta that can be added to the current time in order to retrieve the next date. Its API is very flexible: it allows to specify an arbitrary delta, especially also of negative and zero values. While we were investigating in Pynguin’s timeout handling, we noticed that Pynguin generated a test case for for this function similar to the one presented in Listing 16.timedelta
function will yield a delta of zero if no parameters given.mimesis
, who stated that this ‘is a major bug actually’21. They furthermore accepted the fix proposed by the second author of this paper and released a new version of their library.