1 Introduction
middleButtonOnClick
requires the context of an object of type View
. Similarly, it does not explicitly capture the fact that the operations setText(“clicked”)
, opt = r.nextInt()
, c = getBlue()
, and SetUpperRightButton(c)
take place within the method middleButtonOnClick
. In addition, the TCFG does not have any explicit construct to represent interactions among objects. A TCFG does not also contain the complete context of object information. That is, the TCFG contains the static datatype of a particular object, but the TCFG does not explicitly maintain the superclass hierarchy for each object.
middleButtonOnClick
in Fig. 2 has \(2 * 4 = 8\) possible traces due to the permutation of respective branches; Fig. 4 depicts one of those 8 trace nodes.middleButtonOnClick
requires the context of an object of type View
and a callee; the corresponding hyperedge in Fig. 4 is labeled accordingly with the destination method and program state information for context.
-
Section 2 formalizes a sequence diagram with respect to a hyperpath in a hypergraph.
-
StaticGen provides a query system to refine the set of generated diagrams and guide the user to the most interesting interactions in the source code. (Sect. 6).
-
We illustrate the efficacy of our technique (Sect. 7) with quantitative analyses and a case study to identify a security vulnerability.
2 Program abstraction and code hypergraphs
2.1 Program abstraction model
2.2 The code hypergraph
middleButtonOnClick
method takes a parameter of object v
of type View
and is represented as a code object in the corresponding code hypergraph in Fig. 4 (indicated by rounded edges). A trace is more than just a basic block in a TCFG, it is a sequential set of instructions corresponding to an execution path for an entire method. For example, in Fig. 4, a trace node corresponding to one path of the middleButtonOnClick
method is composed of instructions that would span many basic blocks in a TCFG. That is, a complete code hypergraph for the source code in Fig. 2 would contain 8 distinct trace nodes for the middleButtonOnClick
method consistent with the 8 unique execution paths.b
and c
) and a trace node representing the execution path of SetUpperRight
Button
. The target node of this call hyperedge is a trace node for SetBtnColor
. In general, a hyperedge consists of \(n + 1\) source nodes where n is the number of formal parameters as well as one target node. Finally, the edge is annotated with the name of the method being invoked (SetBtnColor
) as well as state information (\(\sigma _6\)). For a program P, we say a program state \(\sigma \) of a program \(P\) is data store for all variables at a given execution point in \(P\).getBlue
returns a value which will be assigned to variable c
(in middleButtonOnClick
) indicated with a dotted line and annotated with getBlueReturn
and state information (\(\sigma _3\)).2.3 Sequence diagrams
-
a partitioned set of “send” and “receive” events,
-
a mapping that maps an event to a process,
-
a bijective mapping that matches a send message with a corresponding receive message,
-
a mapping that labels each event, and
-
a partial order on the set of events.
2.4 Characteristics of sequence diagrams
middleButtonOnClick
(similar to Fig. 5); each of these diagrams has 6 messages. However, our generation routine may also construct Fig. 6 as an element in \(\mathcal {D} \) corresponding to method onOptionsItemSelected
in Fig. 2. Thus, \(\left| \mathcal {D} \right| = 9\) with one diagram having 2 messages and eight diagrams having 6 messages. The only way for the sequence diagram in Fig. 6 to be considered interesting is if \(k = 9\); that is, the entire set of sequence diagrams \(\mathcal {D} \) would be considered interesting. The sequence diagram in Fig. 6 is uninteresting for \(1 \le k \le 8\) since it has the lowest message count compared to the other eight sequence diagrams. Hence, the sequence diagram in Fig. 5 will be considered interesting when \(k = 8\) and, since ties are broken arbitrarily, may be considered interesting when \(1 \le k < 8\).3 Constructing the hypergraph
v
, b
, and c
. Figure 4 includes illustrative code objects (goodId
) declared in the corresponding code in Fig. 2 while omitting some others (btnID
, r
, and id
).ConstructTraces
). For each trace of method m, we add the corresponding node to \(H\) (Line 8 to Line 10). Hence, the complete code hypergraph corresponding to Fig. 2 would contain 8 trace nodes describing middleButtonOnClick
and a single trace node describing the single paths for each method get
Red
, getBlue
, SetBtnColor
, and SetUpperRightButton
.middle
ButtonOnClick
calls several methods; most prominent in our example are SetUpperRightButton
and getBlue
. Thus, in the code hypergraph (Fig. 4), we observe two corresponding call hyperedges for these method invocations.getReturnObject
on Line 12) and adding to \(H\). The annotation for r does not correspond to any existing function, so we append “Return” to the current method name m (to distinguish annotations accordingly) and a user-defined program state. Figure 4 contains the annotated return hyperedges getBlueReturn
and getIdReturn
since, respectively, getBlue
and getId
return code objects.4 Refinement of code hypergraphs through pebbling
4.1 Sub-hypergraph identification through pebbling
System.
or toString
in Java, would not be allowed in the resulting pebbled code hypergraph. As described in Algorithm 4, pebbling is a breadth-first traversal over an annotated hypergraph where we mark each node with a pebble once it is visited (Line 7). Then, on Line 9 through Line 19, we use the following rule for pebbling and propagation: if all origin nodes of a hyperedge are pebbled, we place the target node of the hyperedge in the work list. As pebbling continues, we add all “unexpanded,” pebbled nodes (Line 8) and hyperedges (Line 15) to the sub-hypergraph in preparation for the return of the pebbled version of the hypergraph (Line 21).middleButtonOnClick
, we immediately pebble the trace node for getBlue
. In turn, we pebble the code object node for c
. Then, since both source nodes are pebbled, we pebble the target trace node for Set
UpperRightButton
.4.2 Pebbling a code hypergraph
5 Static sequence diagram construction
5.1 Hyperpath identification
CollectTraceNodes
on Line 3), we construct all paths. In this case, a path is a DAG whose nodes are code instructions and edges are attributed to the flow of control of the code. For a path \({{\mathcal {P}}}\), we use an object-oriented notation in which (1) \({{\mathcal {P}}}.Path\) refers to the path associated with a trace path and (2) \({{\mathcal {P}}}.\mathcal {O}\) is an associated set of code objects.isInvocation
on Line 4 will return true if the instruction is a method call. If the instruction in T is a method call, hyperpath construction is recursive (Line 7) by following the corresponding call hyperedge to the target trace node, otherwise, we simply view the instruction i as a singleton set (Line 9). Regardless of how instruction i is processed, we append the new sets of paths (\(\mathcal {Q}\)) to the old path sets (\(\mathcal {Y}\)) using procedure AppendPaths \(\mathcal {Y}\), \(\mathcal {Q}\); this includes the objects in a particular path (Line 20). We assume set union with \(\cup \) operator maintains sequential ordering of instructions and creates a directed edge from the last instruction of one path to the first instruction of the other path.5.2 Generating sequence diagrams from a hyperpath
addReturnMessage
.Main
as well as a code object of type Button
(b
); no methods act on code object c
thus it does not appear in the resulting sequence diagram. To generate the set of method calls for the sequence diagram, we begin with the dashed hyperedge into the trace node for middleButtonOnClick
. Considering each of the instructions of this trace node in turn, we add messages in causal order: recursively following call hyperedges to getBlue
and SetUpperRightButton
then subsequently following their respective return hyperedges. The result is the sequence diagram in Fig. 5.6 Interface for diagram generation
6.1 Query over the language of sequence diagrams
main
method in a program. Generation can be performed on-demand beginning at any method reducing the size of the corresponding hypergraph. In order to acquire the initial set of sequence diagrams \(\mathcal{D}_S\), we may use the predicate “start
M,” where M is a method dictating where the resultant sequence diagram(s) will begin.-
For a method trace \(\ell \in \varSigma \), “
filter
\(\ell \ \mathcal {D} \)” prunes the substring from \(\ell \) to \(\ell '\) in each sequence diagram in \(\mathcal {D} \). This removal process efficiently eliminates calls to library-based functionality or method definitions that are not of interest. For a set of code objects \(\ell \subseteq \varSigma \), “filter
\(\ell \ \mathcal {D} \)” prunes all characters \(c \in \ell \) from each string in \(\mathcal {D} \). Removal of a code object allows the user to refine the set of sequence diagrams by omitting specific variables. -
For a set of predicates \(R\) describing strings in \(\mathcal {L}\), “
remove
\(R~\mathcal {D} \)” will remove all resulting sequence diagrams for which all \(r\in R\) evaluate totrue
. The complementary operation “accept
\(R~\mathcal {D} \)” will collect all sequence diagrams for which all \(r\in R\) evaluate totrue
. -
For an integer k, “
top-interesting
\(k\ \mathcal {D} \)” returns \(\mathsf {top} \left( \mathcal {D} , k \right) \). -
“
method cover
\(p\ \mathcal {D} \)” and “branch cover
\(p\ \mathcal {D} \)” each return sequence diagrams ensuring minimal method and branch coverage, respectively, for a lower bound percentage p.
main
method, instead they may begin their analysis at any arbitrary method. Our solution is a set of modes of refinement. A user-defined query is a tuple of refinement modes or global query predicates. A refinement mode is of the form \({\left\langle t, m \right\rangle }\) where t is a unique identifier for an object or a method and m is a mode descriptor; we define each of the mode descriptors below.-
Synthesis can be performed on-demand beginning at any method using “
start
M,” where M is a method from which the resulting diagram(s) will begin. -
Since source code may contain method calls to library-based functionality, “
filter
S” where S is a set of methods or code objects, removes messages or objects in a sequence diagram corresponding to all \(s \in S\). -
Synthesis may result in a set of sequence diagrams. For a set of predicates \(R\), “
remove
\(R\)” will remove all resulting sequence diagrams for which all \(r\in R\) evaluate totrue
. -
For a set of predicates \(R\), “
accept
\(R\)” collects all sequence diagrams for which all \(r\in R\) evaluate totrue
.
top-interesting
d” where d is an integer specifying the number of resulting diagrams. Similarly for code coverage, we may refer to “function-cover
p” or “branch cover
p” where p is a lower bound percentage of the coverage desired.6.2 Query language semantics
-
\( \mathsf {E} {\llbracket } \ \varepsilon \ \varvec{and}\ \varepsilon '\ {\rrbracket } \left( \mathsf {d}, R' \right) = \mathsf {E} {\llbracket } \varepsilon {\rrbracket } \left( \mathsf {d}, R' \right) \ \wedge \ \mathsf {E} {\llbracket } \varepsilon ' {\rrbracket } \left( \mathsf {d}, R' \right) \),
-
\( \mathsf {E} {\llbracket } \ \varepsilon \ \varvec{or}\ \varepsilon '\ {\rrbracket } \left( \mathsf {d}, R' \right) = \mathsf {E} {\llbracket } \varepsilon {\rrbracket } \left( \mathsf {d}, R' \right) \ \vee \ \mathsf {E} {\llbracket } \varepsilon ' {\rrbracket } \left( \mathsf {d}, R' \right) \),
-
\( \mathsf {E} {\llbracket } \ \varvec{not}\ \varepsilon \ {\rrbracket } \left( \mathsf {d}, R' \right) = \lnot \mathsf {E} {\llbracket } \varepsilon {\rrbracket } \left( \mathsf {d}, R' \right) \),
-
\( \mathsf {E} {\llbracket } \ \varvec{true}\ {\rrbracket } \left( \mathsf {d}, R' \right) = true\),
-
\( \mathsf {E} {\llbracket } \ \varvec{false}\ {\rrbracket } \left( \mathsf {d}, R' \right) = false\), and
-
\( \mathsf {E} {\llbracket } \ \mathsf {d}\ {\rrbracket } \left( \mathsf {d}, {\{ r \}} \subseteq R' \right) \in {\{ true, false \}}\).
accept
and remove
have similar, yet opposing goals where an accept
command followed by a remove
with the same set of predicates results in an empty set:remove
precedes accept
: \(\texttt {accept} \left( R, \texttt {remove}(R, \mathsf {D}) \right) = \emptyset \).filter
results in a substring \(\mathsf {d}'\) of \(\mathsf {d}\) omitting m.filter
results in a substring \(\mathsf {d}'\) of \(\mathsf {d}\) omitting all dependencies of o.remove
results in the set \(\mathsf {D}\setminus \mathsf {D}'\).accept
results in the set \(\mathsf {D}'\).top-interesting
returns \(\mathsf {D}' \subseteq \mathsf {D}\) with \(\left| \mathsf {D}' \right| = k\), the set of k sequence diagrams with the greatest number of messages.branch coverage
(resp. method coverage
) for p returns the subset \(\mathsf {D}' \subseteq \mathsf {D}\) of sequence diagrams with coverage greater than p.6.3 Query interface to diagram generation
filter
the resulting set of sequence diagrams related to method removal, coverage, or \(\mathsf {top} \) into the desired set of sequence diagrams.6.4 Sample queries
start
being middleButtonOn
Click
and filters object r
and its corresponding methods as well as the setText
method. The result is eight diagrams, seven of which are strictly isomorphically unique, and one of which is shown in Fig. 5. If we append to \(Q\) an accept
predicate with SetUpperRight
Button(int)
, the only diagram returned is shown in Fig. 5. As another example, Fig. 6 arises from a query requesting the least interesting diagram from analyzing all methods.SetUpperRightButton(int)
or SetUpperLeft
Button(int)
. In a larger example, the complete package name of the method to be queried would be required. Figure 8 will take the set of all possible diagrams, and remove any diagram that contains the method SetUpperRightButton
(int)
. In the remaining diagrams, any calls to the method nextInt(int)
will be omitted from the diagram. This will also necessarily remove any calls that nextInt(int)
would make to any other methods. The diagrams would appear as if the call to nextInt(int)
simply did not exist.
SetUpperLeftButton(int)
and getRed()
and not have a call to the method getBlue()
. This query language allows the user to select a very specific subset of the diagrams produced by the system. Additionally, the query framework produces an output both as diagrams and as a dataset of json files which can be used as another input to the querying system, allowing that subset to serve as a starting point for another round of querying.6.5 Possible incompleteness of query language features
7 Experimental analysis
Project name | Package | Class # | Time (s) | Op # |
---|---|---|---|---|
AdBlockPlus | org.adblockplus. android | 90 | 130.2 | 3481 |
APG | org.thialfihar. android.apg.ui | 16 | 396.4 | 863 |
ConnectBot | org.connectbot | 184 | 154.9 | 12102 |
CSipSimple | com.csipsimple. service | 78 | 593.2 | 3736 |
Fennec | com.squareup. picasso | 46 | 7.3 | 1468 |
Jitsi | org.jitsi.service | 110 | 10.2 | 2835 |
K-9 Mail | com.fsck.k9. service | 26 | 22.5 | 1595 |
Linphone | org.linphone.core | 52 | 23.8 | 2192 |
Orbot | a.a.a.a | 9 | 3.0 | 437 |
sipdroid | org.sipdroid.net | 6 | 4.4 | 641 |
AcDisplay | com.achep. acdisplay.services | 59 | 12.4 | 2588 |
AC Stopwatch | com.achep. stopwatch | 184 | 628.4 | 11426 |
Active Notify | com.aky. peek.notification | 119 | 69.2 | 6543 |
AdAway | org.adaway | 126 | 105.8 | 5386 |
AppOps | com.ssrij.appops | 15 | 2.09 | 116 |
Blackberry Unlocker | ir.irtci | 12 | 3.25 | 268 |
Better Battery Stats | com.asksven. betterbatterystats | 151 | 418.6 | 11612 |
Better Wifi On/Off | com.asksven. betterwifionoff. data | 10 | 3.0 | 310 |
Color Clock | com.brianco. colorclock | 17 | 2.9 | 316 |
Amaze File Manager | com.amaze. filemanager. adapters | 35 | 23.4 | 2730 |
Complete Linux | com.zpwebsites. linuxonandroid | 226 | 29.8 | 7407 |
CPU Spy Plus | com.cpuspy | 7 | 2.98 | 186 |
Desk Clock | com | 98 | 114.4 | 4965 |
Fifteen Puzzle | com | 37 | 8.4 | 2089 |
Fontster | com.chromium. fontinstaller | 112 | 112.1 | 4910 |
Halo Shortcuts | com | 23 | 3.9 | 450 |
Heads Up | com.achep. headsup | 20 | 9.2 | 1885 |
Jelly Bean Clock | com | 16 | 2.7 | 254 |
Fake GPS Path | com.rc | 33 | 3.1 | 783 |
Root Verifier | com | 11 | 1.7 | 72 |
7.1 Benchmark code
7.2 Time and scope of synthesis
7.3 Evaluation of interestingness and filtering
getServletPath
function is stored in the variable url
and is not sanitized by both branches.filter
each diagram in \(\mathcal {D} \) for code objects and methods found in \(D\). We repeat this process until k diagrams are acquired. This procedure thus selects diagrams by how much new information they add.doGet
method was the subject of the diagrams ranked 22 and 32 of the 45 interesting diagrams; a fragment of the rank-22 diagram is shown in Fig. 15. This example shows that prioritizing novel information allows us to reduce a set of diagrams while retaining information about code paths that can possibly lead to a vulnerability. This reduced set of diagrams can then be delegated to a human expert or a vulnerability analysis tool for further analysis for security vulnerabilities. It is possible that an excluded diagram may contain the vulnerability; however, our queries focusing on interestingness and filtration ensures each diagram provides novel information. The discarded diagrams, because they have fewer messages than the selected ones, and because they provide less novel information, would therefore less likely be the ones that solely contain/reveal the vulnerability. Hence, it is with low probability that a vulnerability is relegated to the set of discarded diagrams.
7.4 Analyzing polymorphic code paths
Derived
inheriting from BaseClass
. In main
, object obj
is created with \( \mathsf {Datatype} \left( obj \right) = \texttt {Derived}\) and two methods are subsequently called. Together these three method calls (new
, overridden
, and unique
) are reflected clearly in the sequence diagram.obj
with new
results in two subsequent self-calls to super-constructors in BaseClass
and class Object
. Next, overridden
is called and handled in class Derived
. We observe a self-call to overridden
since Derived.overridden
invokes a super
call to BaseClass.overridden
. Last, unique
is handled entirely in BaseClass
.BaseClass
(List<BaseClass>
) which may contain both BaseClass
and Derived
objects. In the list structure, the contents become amorphous and the corresponding sequence diagrams present more generic information.7.5 Diver case study
addOption
method in package edu.
umd.cs.findbugs.
config.CommandLine
shown in Fig. 19. The sequence diagram in Fig. 18 depicts the output from Diver and its trace. With Diver, there does not appear to be an obvious way to save or export a specific diagram from a trace without providing the entire diagram. The Diver trace allows expansion of called methods to the desired level in the resulting sequence diagram. There are several similarities between the Diver diagram in Fig. 20 and StaticGen in Fig. 18; clearly the order of the methods are equivalent. StaticGen preserves more, but not all, variable names (option
and argumentDesc
). StaticGen also maintains individual objects thus differentiating method calls to length
on distinct objects; similarly for the two calls to put
on Map
. It can be argued that the default Diver display is more compact; however, it is unclear in the expansion in Fig. 20 whether Diver differentiates method calls from distinct objects thus providing less accurate information.
edu.umd.findbugs.gui2
, we consider the constructor of class GUI2CommandLine
. The diagram in Fig. 21 is the complete, allowable expansion of this flow through Diver starting at the class constructor. For the code in Fig. 22, StaticGen was able to visualize greater depth into the static class constructor for the class Driver
(Fig. 23) in edu.umd.cs.findbugs.gui2
. Generally, in comparison, StaticGen can offer users a view of object creation from any level; StaticGen produced 55 total potential execution path diagrams which we could query and sort.