1 Introduction
-
Defining an attribute grammar capable of depicting sequential behavior while retaining information about the triggering process and its parameters,
-
Developing a grammar inference framework based on the Sequitur algorithm capable of performing input data compression and anomaly detection on arbitrary system traces,
-
Expanding this approach to a knowledge discovery system supporting automated evaluation and extraction of potentially interesting patterns through our novel KAMAS visualization tool.
2 Related work
3 Preliminaries
3.1 Grammar inference
supervised
|
semi-supervised
|
unsupervised
| |
---|---|---|---|
statistical
| Co-training [49] | Self-training [32] | ADIOS [47] |
evolutionary
| GA-based [42] | LAgts [8] | |
heuristic
| ALLiS [13] | Inductive CYK [36] | |
ABL [54] | |||
MDL
| e-GRIDS [38] | ||
CDC [10] | |||
Eiland et al. [17] | |||
greedy search
| ADIOS | ||
CDC | |||
Sequitur [37] | |||
clustering
| EMILE [1] | CDC |
-
Evolutionary computing techniques, often used in computational biology, regularly update (evolve) the initial model or grammar. Each new iteration is produced by removing less desired solutions. GA-based approaches and e.g. LAgts [8] both use genetic algorithms to eliminate unnecessary non-terminal symbols and production rules from the grammar.
-
Heuristic methods generate training examples of sentences. In grammar inference, ALLiS [13] uses heuristics to reduce the number of similar rules as well as for selecting rules that have the most content. ABL [54] finds, with the help of heuristics, the longest common sequence shared between sentences.
-
Minimum description length (MDL) [40] assumes that the simplest, most compact representation of data is its best and most probable depiction. The principle finds its primary application in data reduction, where “any regularity in a given set of data can be used to compress the data” [20]. Examples include CDC [10] and e-GRIDS [38].
-
Greedy search algorithms make decisions based on their internal logic which may lead to the creation, removal or fusion of rules. For example, Sequitur [37] recursively replaces same-character sequences with new production rules. It produces a grammar that reflects repetitions and thereby infers the hierarchical structure of the grammar. ADIOS [47] also applies a greedy learning algorithm to its graph representation of sentences.
-
Clustering techniques require a starting grammar that contains all possible sentences. They subsequently cluster syntactic units until the grammar has been constructed. For example, EMILE [1] clusters expressions that occur in the same context, while CDC [10] creates sets of sequences within a context before selecting clusters that satisfy the MDL principle (see above).
3.2 Algorithm selection
-
Unsupervised learning – The learning module used for generating knowledge from malicious system behavior must be unassisted. Human intervention in the decision of whether a grammar is valid or not would contradict the automation requirements set by most analysts.
-
Context-free grammar (CFG) – A compromise between a regular and a context-sensitive grammar, CFGs offer a good balance between ease of parsing and computational efficiency. The language created by a CFG can be recognized in \(O(n^3)\) time, which will prove helpful in future parsing efforts. This selection prerequisite was complemented by the decision to use an attribute grammar for formal representation (see 3.3).
-
Loss-less operation – It is vital that the algorithm employed does not change the order or immutability of events, since both is likely to have an impact on the semantics of a behavioral sequence.
Sym | String | Grammar | Remarks |
---|---|---|---|
1 | a | S \(\rightarrow \) a | |
2 | ab | S \(\rightarrow \) ab | |
3 | abc | S \(\rightarrow \) abc | |
4 | abcd | S \(\rightarrow \) abcd | |
5 | abcdb | S \(\rightarrow \) abcdb | |
6 | abcdbc | S \(\rightarrow \) abcdbc | bc appears 2x |
S \(\rightarrow \) aAdA |
bigram uniq.
| ||
A \(\rightarrow \) bc | |||
7 | abcdbca | S \(\rightarrow \) aAdAa | |
A \(\rightarrow \) bc | |||
8 | abcdbcab | S \(\rightarrow \) aAdAab | |
A \(\rightarrow \) bc | |||
9 | abcdbcabc | S \(\rightarrow \) aAdAabc | bc reappears |
A \(\rightarrow \) bc | |||
S \(\rightarrow \) aAdAaA |
bigram uniq.
| ||
A \(\rightarrow \) bc | aA appears 2x | ||
S \(\rightarrow \) BdAB |
bigram uniq.
| ||
A \(\rightarrow \) bc | |||
B \(\rightarrow \) aA | |||
10 | abcdbcabcd | S \(\rightarrow \) BdABd | Bd appears 2x |
A \(\rightarrow \) bc | |||
B \(\rightarrow \) aA | |||
S \(\rightarrow \) CAC |
bigram uniq.
| ||
A \(\rightarrow \) bc | B used only 1x | ||
B \(\rightarrow \) aA | |||
C \(\rightarrow \) Bd | |||
S \(\rightarrow \) CAC |
rule utility
| ||
A \(\rightarrow \) bc | |||
C \(\rightarrow \) aAd |
3.2.1 Sequitur
abcdbcabcd
, the first bigram would be ab
, followed by a second bigram bc
, and so forth. See Table 2 for a complete example of the process.3.3 Formal definition
-
\(G = (N, T, P, S)\) is a context-free grammar
-
N... Set of non-terminal symbols (variables)
-
T... Set of terminal symbols (alphabet)
-
P... Production rules
-
S... Start symbol
-
-
A is a finite set of attributes
-
R is a finite set of attribution rules (semantic rules)
-
V is a finite set of values assigned to an attribute
3.4 Event data
CreateProcess
) that offer a simple interface to the application programmer, or native system calls (e.g. NtCreateProcess
) that represent the underlying OS or kernel support functions. In the context of SEQUIN, event data is collected directly from the Windows kernel. We employ a driver-based monitoring agent [31] designed to collect and forward a number of events to a database server. This gives us unimpeded access to events depicting operations related to process and thread control, image loads, file management, registry modification, network socket interaction, and more. For example, a shell event that creates a new text file on a system may be simply denoted as a triple explorer.exe,file-create,document.txt
.explorer.exe,0.75,file-document.txt
, whereby 0.75 is the edge label value currently associated with ‘create’ operations in a file context. Regardless of format, Sequitur still transforms the supplied input to sequences of strings during processing.
4 Inference and analysis process
4.1 Preprocessing
1.txt
or 2.exe
).-
Verbose – This trace format uses full, attribute-enabled events as individual words of the corpus. In verbose mode, the input data is transformed into the following format:
triggering-process,operation,
element-name
, which translates to \(v_{i} \in V(X.a_{1})\),\(t_{x} \in T\),\(v_{j} \in V(X.a_{2})\). For example, a specific file creation operation triggered by the knownexplorer.exe
process would be preprocessed into the following textual input format:explorer.exe,file-create,1.
txt
. -
Reduced – In this preprocessing mode, we omit attribute \(a_{2}\) to generate a quick view of the high-level activity exhibited by the processes under scrutiny. Here, \(v_{j}\) is not processed, resulting in a reduced format of
triggering-process,operation
, depicted as e.g.explorer.exe,file-create
. -
Granular – The goal in granular mode is to investigate operations not as single word, but as elementary components. Each of the elements processed in verbose mode is treated by Sequitur as one terminal of the bigram. To maintain a level of separation between event triples, a forth item denoting the start of a new event is prepended before each \(v_{i}\). This results in the following input (items delimited by semicolon): \(\texttt {<start>;triggering-process;operation;}{} \texttt {element-name}\).
4.2 Rule extraction
-
Lexical analysis – In this initial step, each unique terminal \(t \in T\) is assigned a corresponding symbol, called a token. This numerical representation is used to streamline the process by reducing the processing complexity of string-only comparisons. Each new terminal is additionally stored in a translation (symbol) table for later reference.
-
Grammatical inference – After the lexical analysis process, the Sequitur algorithm is applied to generate an execution trace grammar consisting of tokenized terminal symbols. The first rule \(p \in P\) of each grammar is the start rule, or zero rule, which depicts the full grammar of the compressed input data. In our implementation, every line thereafter contains the following extracted information:
-
Rule – The rule consists of a left-side rule name (variable), which is sequentially numbered, as well as right-side variables and terminals. The non-terminals are, again, references to finer-grained rules while the terminals represent the actual system events. In line with the definition of CFGs, there is only one single variable on the left side of a rule.
-
Resolved rule – In order to provide a detailed view on individual rules, we recursively resolve each sequence of non-terminals \(n \in N\) to their base terminals \(t \in T\).
-
4.3 Rule evaluation
-
File rule (FR) count – This number shows how many times a rule occurs in the current input file.
-
Grammar rule (GR) count – The overall count across all supplied input files is specified here. For a single trace, this number is identical to the FR count.
-
Prevalence count – This value specifies the number of input files a particular derivation has been found in. The result is displayed as x / y (x in y), where x is the number of files the pattern is prevalent and y is the overall count of individual input files.
-
Match flag – The extraction of interesting rules is facilitated by determining rules that are identical in occurrence and number across all of the processed input files, indicated by a Boolean flag.
-
Rule length – this value defines the overall number of items seen in the entire derivation (i.e. the resolved rule). Multiples of the input file count y are likely to represent recursively compressed rules.
-
Rule density – this support metric facilitates anomaly detection by calculating the ratio between inferred rules and single terminals that are present in the input as well as rule zero.
4.4 Rule transformation
process-create
operation followed by a file-delete
operation is transformed into the descriptive variable CREATE-PROC_DELETE-FILE. A rule that describes the loading of two image files is dubbed LOAD2-IMG.-
\(NS = (O,E,MO,ME,L)\), where
-
Operation O = {CREA, MOD, START, LOAD, KILL, DEL, CONN}
-
Event type E = {PROC, THR, IMG, FILE, REG, NET}
-
Operation mapping rules MO = { CREA \(\rightarrow \) create, MOD \(\rightarrow \) modify \(|\) change \(|\) edit, START \(\rightarrow \) start \(|\) spawn, LOAD \(\rightarrow \) load, KILL \(\rightarrow \) kill \(|\) stop \(|\) terminate, DEL \(\rightarrow \) delete, CONN \(\rightarrow \) connect }
-
Event mapping rules ME = {PROC \(\rightarrow \) process, THR \(\rightarrow \) thread, IMG \(\rightarrow \) image, FILE \(\rightarrow \) file, REG \(\rightarrow \) registry, NET \(\rightarrow \) network}
-
-
and labeling rules L, where
-
\((O_{1} ||\) "-" \(||E_{1} ||"\_", ..., O_{n} ||\) "-" \(||E_{n})\)
-
If \(O_{n} == O_{n+1}\) then \(O_{n} ||\) "2"
-
5 Implementation
5.1 Example
explorer.exe,
file-create,1.exe
(abbreviated: a) and explorer.
exe,process-start,1.exe
(b) form the rule 1 \(\rightarrow \) a b. For a more formal breakdown summary of example input file 1, please consult below grammar depiction \(AG_{1}\).
1.exe,registry-create,
machine/system
, 1.exe,thread-create,thread
, and 1.exe,image-load,ws2_32.dll
highlighted, we can immediately spot the deviations from the otherwise recurring behavior.-
\(G_{1} = (N, T, P, S)\), and where:
-
N = {CREA-FILE_START-PROC; LOAD2-IMG; MOD2-REG_CREA2-PROC; MOD2-REG; KILL-PROC_KILL-THR_DEL-FILE}
-
T = { file-create.tp, en = explorer.exe, 1.exe; file-delete.tp, en = explorer.exe, 1.exe; process-create.tp, en = explorer.exe, 1.exe; process-create.tp, en = 1.exe, cmd.exe; process-create.tp, en = 1.exe, net.exe; process-kill.tp, en = cmd.exe, net.exe; image-load.tp, en = 1.exe, kernel32.dll; image-load.tp, en = 1.exe, advapi32.dll; registry-create.tp, en = 1.exe, machine/system; registry-modify.tp, en = 1.exe, hklm/software/microsoft; thread-terminate.tp, en = 1.exe, thread; }
-
P = { ZERO-RULE \(\rightarrow \) CREA-FILE_START-PROC LOAD2-IMG MOD2-REG_CREA2-PROC registry-create.tp, en = 1.exe,machine/system MOD2-REG KILL-PROC_KILL-THR_DEL-FILE; CREA-FILE_START-PROC \(\rightarrow \) file-create.tp, en = explorer.exe,1.exe process-create.tp, en = explorer.exe,1.exe; LOAD2-IMG \(\rightarrow \) image-load.tp, en = 1.exe,kernel32.dll image-load.tp, en = 1.exe,advapi32.dll; MOD2-REG \(\rightarrow \) registry-modify.tp, en = 1.exe,hklm/software/microsoft registry-modify.tp, en = 1.exe,hklm/software/microsoft; MOD2-REG_CREA2-PROC MOD2-REG \(\rightarrow \) process-create.tp, en = 1.exe,cmd.exe process-create.tp, en = cmd.exe,net.exe; KILL-PROC_KILL-THR_DEL-FILE \(\rightarrow \) process-kill.tp, en = cmd.exe, net.exe thread-terminate.tp, en = 1.exe,thread file-delete.tp, en = explorer.exe, 1.exe }
-
S = {ZERO-RULE}
-
-
A = {tp; en}
-
R is described as part of the preprocessing stage and defines which portion of the data translates into triggering process tp (\(v_{i}\)), operation (\(t_{x}\)), and element en (\(v_{j}\)).
-
V = {explorer.exe; 1.exe; kernel32.dll; advapi32.dll; cmd.exe; net.exe; machine/software/microsoft; machine/system; thread}
Scenario
|
Eval.
|
Significance
|
---|---|---|
Compression
| Reduction of input data size | |
Reduction of processing complexity (third-party) | ||
Anomaly detection
| Detection and extraction of deviating behavior | |
Baselining
| Identification of common patterns in traces | |
Visualization
| Visual presentation of inferred rules | |
Discovery
| Interactive filtering and extraction of terminals/rules | |
Rule labeling, storing | ||
Highlighting of known rules |
6 Evaluation
6.1 Preparatory data reduction
6.1.1 Concept
6.1.2 Evaluation
svchost.exe
process. Under normal circumstances, this data would have to be assessed in its uncompressed entirety, as it is used for creating baseline graph templates utilized in behavior deviation analysis using a combination of Hungarian distance computation [26] and Malheur heuristic clustering [39]. Thanks to our Sequitur-enabled data reduction, we can focus on event sequences (rules) that are representative for specific processes – or on the remaining terminals which constitute a potential anomaly. Both significantly speed up all involved, polynomial complexity star graph matching operations (Fig. 4) used by the tested system [30]. At the same time, we increase the accuracy of the template creation process by drastically reducing the number of empty feature vectors that are normally produced in the clustering stage.
6.1.3 Discussion
6.2 Anomaly detection
6.2.1 Concept
6.2.2 Evaluation
Sample trace | TRR | TRR* | Length | Length* | Preval. | Preval.* |
Overall
|
---|---|---|---|---|---|---|---|
Ben-1 | 58.60 | 0.33 | 8.87 | 0.44 | 2 | 0.20 | 0.274 |
Ben-2
|
64.00
|
1.00
|
9.28
|
0.95
|
9
|
0.90
|
0.945
|
Ben-3 | 59.30 | 0.41 | 8.9 | 0.48 | 2 | 0.20 | 0.313 |
Ben-4 | 56.00 | 0.00 | 8.51 | 0.00 | 1 | 0.10 | 0.050 |
Ben-5 | 56.80 | 0.10 | 8.52 | 0.01 | 1 | 0.10 | 0.091 |
Ben-6 | 58.30 | 0.29 | 8.79 | 0.35 | 2 | 0.20 | 0.250 |
Ben-7 | 59.20 | 0.40 | 8.69 | 0.22 | 0 | 0.00 | 0.182 |
Ben-8 | 60.00 | 0.50 | 8.72 | 0.26 | 4 | 0.40 | 0.426 |
Ben-9 |
63.80
|
0.98
|
9.32
|
1.00
| 0 | 0.00 | 0.490 |
Ben-10 | 57.30 | 0.16 | 8.6 | 0.11 | 3 | 0.30 | 0.226 |
Ben-11 | 58.50 | 0.31 | 8.79 | 0.35 |
10
|
1.00
| 0.660 |
Mal-1
|
62.80
|
0.85
| 9.01 | 0.62 |
9
|
0.90
|
0.852
|
Mal-2
|
62.74
|
0.84
| 8.99 | 0.59 |
9
|
0.90
|
0.846
|
File | Rule | FR # | GR # | Prevalence | Length |
---|---|---|---|---|---|
Ben-2 | 3 \(\rightarrow \) 139 139 | 2 | 2 | 1/12 | 16 |
Ben-2 | 4 \(\rightarrow \) 140 140 | 3 | 3 | 1/12 | 4 |
Ben-2 | 12 \(\rightarrow \) 1-0-1-(...)-0-40 1-0-1-(...)-0-40 | 2 | 2 | 1/12 | 2 |
Ben-2 | 23 \(\rightarrow \) 1-0-1-(...)-1-56 1-0-1-(...)-1-56 | 2 | 2 | 1/12 | 2 |
Ben-2 | 69 \(\rightarrow \) 1-0-1-(...)-5-59 1-0-1-(...)-5-59 | 2 | 2 | 1/12 | 2 |
Ben-2 | 98 \(\rightarrow \) 1-0-1-(...)-8-60 1-0-1-(...)-8-60 | 2 | 2 | 1/12 | 2 |
Ben-2 | 102 \(\rightarrow \) 0-0-1-(...)-8-54 0-0-1-(...)-8-54 | 2 | 2 | 1/12 | 2 |
Ben-2 | 139 \(\rightarrow \) 4 4 | 2 | 2 | 1/12 | 8 |
Ben-2 | 140 \(\rightarrow \) 0-0-0-(...)-0-20 0-0-0-(...)-0-20 | 2 | 2 | 1/12 | 2 |
Mal-1 | 57 \(\rightarrow \) 1-0-1-(...)-4-60 1-0-1-(...)-4-60 | 2 | 2 | 1/12 | 2 |
Mal-1 | 72 \(\rightarrow \) 1-0-1-(...)-6-52 1-0-1-(...)-6-52 | 2 | 2 | 1/12 | 2 |
Mal-1 | 81 \(\rightarrow \) 82 82 | 2 | 2 | 1/12 | 64 |
Mal-1 | 82 \(\rightarrow \) 83 83 | 3 | 3 | 1/12 | 32 |
Mal-1 | 83 \(\rightarrow \) 157 157 | 3 | 3 | 1/12 | 16 |
Mal-1 | 84 \(\rightarrow \) 85 85 | 3 | 3 | 1/12 | 4 |
Mal-1 | 85 \(\rightarrow \) 1-0-1-(...)-7-60 1-0-1-(...)-7-60 | 3 | 3 | 1/12 | 2 |
Mal-1 | 86 \(\rightarrow \) 1-0-1-(...)-7-59 1-0-1-(...)-7-59 | 2 | 2 | 1/12 | 2 |
Mal-1 | 157 \(\rightarrow \) 84 84 | 2 | 2 | 1/12 | 8 |
windows/softwaredistribution/download
folder structure were replaced during normalization as part of the preprocessing stage (Section 4.1). The remaining unique rules as well as terminals were deviations from the computed baseline caused by random temporary file names and minor changes in update chronology. In practice, this variant of the inference process can be used to create whitelists, templates for benign process behavior, or, again, for extracting anomalies from seemingly benign sequences of application behavior.6.2.3 Discussion
7 Visualization&knowledge discovery
7.1 Visual analytics
7.2 Visualization considerations
-
Representation of explicit knowledge. To support the analysts in their task of behavior-based malware analysis, explicit expert knowledge should be made available in the system. Additionally, the actual generation of explicit knowledge needs to be facilitated. For the visualization of that knowledge, a basic hierarchy of events is required. We utilized the malicious behavior schema by Dornhackl et al. [14]: Using the provided semantic categorization, it is possible for analysts to explore currently stored knowledge and to add newly inferred rules to the system. The visualization of the malicious behavior schema employs a tree structure, where the nodes are the different types of malicious behavior and the leaves are the rules for its representation (see Fig. 6:1).
-
Representation of events. For the representation of the events included in an analysis file, two important aspects have to be covered. On the one hand, the name of the event (see Section 3.4 for more information on input data) is essential for the analyst who is trying to ascertain its purpose. On the other hand, it is very important to learn how often a single event is included in the analysis file. We use a table structure for the visualization of this data, whereby the event name is represented as string and its occurrence is represented as a bar chart including the total number as an overlay. Employing this visualization technique, the analyst gains the ability to quickly find events of interest (e.g., by visually analyzing the size of the bar charts) (see Fig. 6:3).
-
Representation of rules. Since a rule is a sequential structure containing several events, it is prudent to use a similar representation as for individual items. In contrast to the representation of all events included in a rule (resolved rule, see 4.2), a more abstract visualization can be applied here. The transformation of events based on their unique ID into a graphical representation (which is called ‘Graphical Summary’ [57]) helps to more effectively locate unknown patterns in the data. Additionally, all the other related information can be visualized as bar charts in combination with a label representing the total number (e.g., the rule’s prevalence and length as introduced in Section 4.3). The original order of events within a rule is highlighted (see Fig. 6:2).