1 Introduction
example.com
) is now over 296 million [44]. Multiple fully qualified domain names (FQDNs) (e.g., www.example.com
) may exist under the same 2LD names; therefore, the number of all existing FQDNs could be in the billions. The third reason is that no one can fully understand all real-time changes in the mappings between domain names and IP addresses. Since DNS is a distributed system and the mappings are configured in each authoritative name server, the mappings of all domain names cannot feasibly be observed in real time. Given these reasons, blacklisting approaches based on DNS observations have failed to keep up with newly generated malicious domain names. Thus, we adopt an approach of prediction instead of observation, i.e., we aim to discover malicious domain names that are likely to be abused in future. The key idea of this approach is to exploit temporal variation patterns (TVPs) of malicious domain names. The TVPs of domain names include the information about how and when a domain name has been listed in legitimate/popular and/or malicious domain name lists. We use TVPs to comprehend the variations in domain names. For example, a domain name may be newly registered or updated, IP addresses corresponding to the domain name may be changed, and the traffic directed to the domain name may be changed.-
We propose DomainProfiler, which identifies TVPs of domain names to precisely profile various types of malicious domain names.
-
Our evaluation with real and large ground truth data reveals that DomainProfiler can predict malicious domain names 220 days beforehand with a true positive rate (TPR) of 0.985 in the best-case scenario.
-
We reveal the contribution or importance of each feature in DomainProfiler to detect future malicious domain names.
-
We conduct a lifespan analysis for malicious domain names detected by DomainProfiler to illustrate the characteristics of various domain names abused in a series of cyber-attacks.
-
We use a large number of actual malware samples to demonstrate the effectiveness of DomainProfiler at defending against malware activities.
2 Motivation: temporal variation pattern
TVPs | Objectives |
---|---|
Alexa-null | Improving true positive rates (TPRs) |
Alexa-stable | Improving true negative rates (TNRs) |
Alexa-fall | Improving true positive rates (TPRs) |
Alexa-rise | Improving true negative rates (TNRs) |
hpHosts-null | Improving true negative rates (TNRs) |
hpHosts-stable | Improving true positive rates (TPRs) |
hpHosts-fall | Improving true negative rates (TNRs) |
hpHosts-rise | Improving true positive rates (TPRs) |
3 Our system: DomainProfiler
3.1 Monitoring module
3.2 Profiling module
3.2.1 Step 1: identifying TVPs
.com.au
, .co.jp
, and .co.uk
, as shown in Fig. 5. In general, TLDs are divided into generic top-level domains (gTLDs), such as .com
, .net
, and .org
, and country code top-level domains (ccTLDs), such as .au
, .jp
, and .uk
. If we do not use effective TLDs, the 2LD parts of gTLDs and ccTLDs differ significantly. For example, in the gTLD case of foo.bar.example.com
, the 2LD part is example.com
; however, in the ccTLD case of baz.qux.example.co.jp
, the 2LD part is co.jp
. Our definition of including effective TLDs is intended to treat both gTLD and ccTLD identically, that is, the 2LD part in the above ccTLD example is example.co.jp
in this paper.Type | No. | Feature name | Type | No. | Feature name | Type | No. | Feature name |
---|---|---|---|---|---|---|---|---|
TVP | 1 | Alexa1k-null | rIP | 21 | # of BGP Prefixes (FQDN) | rDomain | 39 | # FQDNs |
(Legitimate/ Popular) | 2 | Alexa1k-stable | (BGP) | 22 | # of BGP prefixes (3LD) | (FQDN) | 40 | Mean lengths |
3 | Alexa1k-fall | 23 | # of BGP prefixes (2LD) | 41 | SD lengths | |||
4 | Alexa1k-rise | 24 | # of Countries (FQDN) | rDomain | 42 | Mean distribution | ||
5 | Alexa10k-null | 25 | # of countries (3LD) | (1-gram) | 43 | Median distribution | ||
6 | Alexa10k-stable | 26 | # of countries (2LD) | 44 | SD distribution | |||
7 | Alexa10k-fall | 27 | # of IP addresses (3LD) | rDomain | 45 | Mean distribution | ||
8 | Alexa10k-rise | 28 | # of IP addresses (2LD) | (2-grams) | 46 | Median distribution | ||
9 | Alexa100k-null | 29 | # of organizations (FQDN) | 47 | SD distribution | |||
10 | Alexa100k-stable | rIP | 30 | # of ASNs (FQDN) | rDomain | 48 | Mean distribution | |
11 | Alexa100k-fall | (ASN) | 31 | # of ASNs (3LD) | (3-grams) | 49 | Median distribution | |
12 | Alexa100k-rise | 32 | # of ASNs (2LD) | 50 | SD distribution | |||
13 | Alexa1M-null | rIP | 33 | # of registries (FQDN) | rDomain | 51 | # TLDs | |
14 | Alexa1M-stable | (Registration) | 34 | # registries (3LD) | (TLD) | 52 | # of Ratios of .com | |
15 | Alexa1M-fall | 35 | # of registries (2LD) | 53 | Mean distribution | |||
16 | Alexa1M-rise | 36 | # of dates (FQDN) | 54 | Median distribution | |||
TVP | 17 | hpHosts-null | 37 | # of dates (3LD) | 55 | SD distribution | ||
(Malicious) | 18 | hpHosts-stable | 38 | # of Dates (2LD) | ||||
19 | hpHosts-fall | |||||||
20 | hpHosts-rise |
3.2.2 Step 2: appending DNS-based features
foo.example.com
. The graph is a union of every resolved IP address corresponding to each domain name at the FQDN level and its parent domain name levels, such as 3LD and 2LD, from historical DNS logs collected in the former monitoring module. In Fig. 6, FQDN and 3LD (foo.example.com
) correspond to the IP address 192.0.2.2
at time \(t-1\) and 198.51.100.2
at t, and 2LD (example.com
) corresponds to the IP address 192.0.2.1
at \(t-1\) and 198.51.100.1
at t. Thus, these four IP addresses are defined as rIPs for foo.example.com
. Then, we extract the features from rIPs. These features consist of three subsets: border gateway protocol (BGP), autonomous system number (ASN), and registration.foo.example.com
. The graph is a union of domain names pointing to IP addresses in the same autonomous system number (ASN) of the historical IP addresses of each target domain name. In Fig. 7, the ASN for the target foo.example.com
is AS64501
and another IP address 192.0.2.3
in AS64501
is connected to the domain names bar.example.net
and baz.example.org
. Thus, these three domain names are defined as rDomains for foo.example.com
, and we extract their features. These features consist of three subsets: FQDN string, n-grams, and top-level domain (TLD).example.com
consists of pairs of letters such as ex
, xa
, and am
. Specifically, we extract the mean, median, and SD of 1-gram (Nos. 42–44) in rDomains, those of 2-grams (Nos. 45–47), and those of 3-grams (Nos. 48–50)..com
TLD in the set (No. 52), and mean, median, and SD of the occurrence frequency of the TLDs in the set (Nos. 53–55).Type | Dataset | Period | # FQDNs |
---|---|---|---|
Target domain names (training set) | Legitimate-alexa | 2013-05-22–2015-02-28 | 89,739 |
Malicious-hpHosts | 2013-01-17–2015-02-28 | 83,670 | |
Target domain names (test set) | Honeyclient-exploit | 2015-03-01–2015-10-07 | 537 |
Honeyclient-malware | 2015-03-01–2015-10-07 | 68 | |
Sandbox-malware | 2015-03-01–2015-10-07 | 775 | |
Sandbox-C&C | 2015-03-01–2015-10-07 | 8473 | |
Pro-C&C | 2015-03-01–2015-03-29 | 97 | |
Pro-phishing | 2015-03-01–2015-03-29 | 78,221 | |
Legitimate-new | 2015-03-01–2015-03-29 | 5868 | |
Domain name lists DB | AlexaDB | 2013-05-22–2015-02-28 | 5,596,219 |
hpHostsDB | 2013-01-17–2015-02-28 | 1,709,836 | |
Historical DNS logs | DNSDB | 2014-10-01–2015-02-28 | 47,538,966 |
3.2.3 Step 3: applying machine learning
4 Evaluation
4.1 Dataset
4.2 Parameter tuning
4.2.1 Time window size
4.2.2 Random Forests
Feature set | TP | FP | FN | TN | # FQDNs | TPR/recall | TNR | FPR | Precision | F-measure |
---|---|---|---|---|---|---|---|---|---|---|
TVP | 81,436 | 834 | 2234 | 88,905 | 173,409 | 0.973 | 0.991 | 0.009 | 0.990 | 0.982 |
rIP | 62,734 | 32,347 | 20,270 | 57,306 | 172,657 | 0.756 | 0.639 | 0.361 | 0.660 | 0.705 |
rDomain | 58,095 | 16,523 | 24,873 | 72,893 | 172,384 | 0.700 | 0.815 | 0.185 | 0.779 | 0.737 |
rIP+rDomain | 61,928 | 13,197 | 21,033 | 76,206 | 172,364 | 0.746 | 0.852 | 0.148 | 0.824 | 0.783 |
TVP+rIP+rDomain | 80,879 | 798 | 2082 | 88,605 | 172,364 | 0.975 | 0.991 | 0.009 | 0.990 | 0.983 |
4.3 Feature set selection
Dataset | TP | FP | FN | TN | # FQDNs | TPR/recall | TNR | FPR | Precision | F-measure |
---|---|---|---|---|---|---|---|---|---|---|
Honeyclient-exploit | 529 | – | 8 | – | 537 | 0.985 | – | – | – | – |
Honeyclient-malware | 67 | – | 1 | – | 68 | 0.985 | – | – | – | – |
Sandbox-malware | 721 | – | 54 | – | 775 | 0.930 | – | – | – | – |
Sandbox-C&C | 7476 | – | 997 | – | 8473 | 0.882 | – | – | – | – |
Pro-C&C | 92 | – | 5 | – | 97 | 0.948 | – | – | – | – |
Pro-phishing | 75,583 | – | 2638 | – | 78,221 | 0.966 | – | – | – | – |
Legitimate-new | – | 142 | – | 5726 | 5868 | – | 0.976 | 0.024 | – | – |
Total | 84,468 | 142 | 3073 | 5726 | 94,309 | 0.958 | 0.976 | 0.024 | 0.998 | 0.978 |
Dataset | TP | FP | FN | TN | # FQDNs | TPR/recall | TNR | FPR | Precision | F-measure |
---|---|---|---|---|---|---|---|---|---|---|
Honeyclient-exploit | 184 | – | 353 | – | 537 | 0.343 | – | – | – | – |
Honeyclient-malware | 5 | – | 63 | – | 68 | 0.074 | – | – | – | – |
Sandbox-malware | 197 | – | 578 | – | 775 | 0.254 | – | – | – | – |
Sandbox-C&C | 2491 | – | 5982 | – | 8473 | 0.294 | – | – | – | – |
Pro-C&C | 39 | – | 58 | – | 97 | 0.402 | – | – | – | – |
Pro-phishing | 29,427 | – | 48,794 | – | 78,221 | 0.376 | – | – | – | – |
Legitimate-new | – | 1038 | – | 4830 | 5868 | – | 0.823 | 0.177 | – | – |
Total | 32,343 | 1038 | 55,828 | 4830 | 94,039 | 0.367 | 0.823 | 0.177 | 0.969 | 0.532 |
4.4 System performance
4.5 Predictive detection performance
Dataset | days_Min | days_1stQu | days_2ndQu | days_Mean | days_3rdQu | days_Max |
---|---|---|---|---|---|---|
Honeyclient-exploit | 16 | 140 | 176 | 164.5 | 197 | 220 |
Honeyclient-malware | 1 | 125 | 205 | 159.4 | 212 | 220 |
Sandbox-malware | 1 | 108 | 133 | 128.2 | 151 | 221 |
Sandbox-C&C | 1 | 38 | 99 | 98.31 | 140 | 221 |
Pro-C&C | 1 | 9 | 15 | 14.95 | 21 | 29 |
Pro-phishing | 1 | 9 | 14 | 14.04 | 19 | 28 |
Dataset | days_Min | days_1stQu | days_2ndQu | days_Mean | days_3rdQu | days_Max |
---|---|---|---|---|---|---|
Honeyclient-exploit | 19 | 158 | 187 | 172.3 | 210 | 219 |
Honeyclient-malware | 1 | 60 | 63 | 108.4 | 205 | 213 |
Sandbox-malware | 3 | 112 | 133 | 128.3 | 153 | 221 |
Sandbox-C&C | 1 | 47 | 108 | 103.9 | 144 | 221 |
Pro-C&C | 1 | 9 | 12 | 14.18 | 20 | 29 |
Pro-phishing | 1 | 11 | 17 | 15.21 | 20 | 28 |
No. | Feature (TVP) | GI | Rank | No. | Feature (rIP) | GI | Rank | No. | Feature (rDomain) | GI | Rank |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | Alexa1k-null | 76.7 | 44 | 21 | # BGP Prefixes (FQDN) | 135.0 | 40 | 39 | # FQDNs | 210.6 | 21 |
2 | Alexa1k-stable | 37.7 | 51 | 22 | # BGP Prefixes (3LD) | 159.5 | 33 | 40 | Mean lengths | 196.1 | 25 |
3 | Alexa1k-fall | 19.4 | 53 | 23 | # BGP Prefixes (2LD) | 261.4 | 15 | 41 | SD lengths | 214.6 | 20 |
4 | Alexa1k-rise | 6.8 | 55 | 24 | # Countries (FQDN) | 156.2 | 35 | 42 | Mean distribution | 163.3 | 32 |
5 | Alexa10k-null | 155.6 | 36 | 25 | # Countries (3LD) | 203.1 | 23 | 43 | Median distribution | 179.5 | 27 |
6 | Alexa10k-stable | 70.3 | 46 | 26 | # Countries (2LD) | 148.2 | 37 | 44 | SD distribution | 174.9 | 29 |
7 | Alexa10k-fall | 48.0 | 50 | 27 | # IP addresses (3LD) | 319.4 | 13 | 45 | Mean distribution | 204.6 | 22 |
8 | Alexa10k-rise | 11.4 | 54 | 28 | # IP addresses (2LD) | 328.3 | 12 | 46 | Median distribution | 189.7 | 26 |
9 | Alexa100k-null | 23197.3 | 1 | 29 | # Organizations (FQDN) | 53.0 | 48 | 47 | SD distribution | 167.9 | 31 |
10 | Alexa100k-stable | 5073.1 | 5 | 30 | # ASN (FQDN) | 129.9 | 41 | 48 | Mean distribution | 177.7 | 28 |
11 | Alexa100k-fall | 2977.5 | 7 | 31 | # ASN (3LD) | 351.8 | 10 | 49 | Median distribution | 85.6 | 42 |
12 | Alexa100k-rise | 330.1 | 11 | 32 | # ASN (2LD) | 215.1 | 19 | 50 | SD distribution | 174.4 | 30 |
13 | Alexa1m-null | 3503.1 | 6 | 33 | # Registries (FQDN) | 35.8 | 52 | 51 | # TLDs | 157.7 | 34 |
14 | Alexa1m-stable | 8255.9 | 3 | 34 | # Registries (3LD) | 49.3 | 49 | 52 | # Ratio of .com | 242.9 | 17 |
15 | Alexa1m-fall | 1068.8 | 9 | 35 | # Registries (2LD) | 74.2 | 45 | 53 | Mean distribution | 252.1 | 16 |
16 | Alexa1m-rise | 60.6 | 47 | 36 | # Dates (FQDN) | 141.5 | 38 | 54 | Median distribution | 140.5 | 39 |
17 | hpHosts-null | 17248.9 | 2 | 37 | # Dates (3LD) | 309.0 | 14 | 55 | SD distribution | 234.6 | 18 |
18 | hpHosts-stable | 5404.1 | 4 | 38 | # Dates (2LD) | 203.0 | 24 | ||||
19 | hpHosts-fall | 2861.2 | 8 | ||||||||
20 | hpHosts-rise | 79.0 | 43 |
4.6 Effectiveness of each feature
Dataset | # of FQDNs in test set | # of Queried FQDNs | # of Answered FQDNs |
---|---|---|---|
Honeyclient-exploit | 537 | 537 | 534 |
Honeyclient-malware | 68 | 68 | 67 |
Sandbox-malware | 775 | 775 | 610 |
Sandbox-C&C | 8473 | 8473 | 6079 |
Pro-C&C | 97 | 97 | 91 |
Pro-phishing | 78,221 | 50 | 49 |
Total | 88,171 | 10,000 | 7429 |
4.7 Effectiveness of our temporal variation patterns
.xyz
and .solutions
. This is because these domain names are less likely to be within Alexa1M.84c7zq.example.com
.14c2c5h8[masked].yr7w2[masked]
.com
. We observed that this type of 2LD will be continuously used for a while by attackers to create many subdomain names. The other is domain names under free subdomain name services, which offer subdomain name creation under 2LD parts, such as .flu.cc
and .co.nr
. These services are easily abused by attackers for creating distinct domain names.4.8 Lifespan of detected domain names
Dataset | # of Answered FQDNs | # of Parking FQDNs | # of Sinkhole FQDNs |
---|---|---|---|
Honeyclient-exploit | 534 | 3 | 0 |
Honeyclient-malware | 67 | 2 | 0 |
Sandbox-malware | 610 | 82 | 0 |
Sandbox-C&C | 6079 | 600 | 72 |
Pro-C&C | 91 | 12 | 6 |
Pro-phishing | 49 | 12 | 0 |
Total | 7429 | 711 | 78 |
Malware family | # Blocked FQDNs | # Blocked samples | First submission date | |
---|---|---|---|---|
1 | bladabindi
| 33 | 2 | 2016/01/01–2016/01/03 |
2 | gamarue
| 16 | 12 | 2015/10/08–2016/07/29 |
3 | banload
| 12 | 23 | 2015/10/19–2016/08/14 |
4 | ramnit
| 10 | 50 | 2015/10/14–2016/09/15 |
5 | zusy
| 8 | 526 | 2015/10/08–2016/07/16 |
6 | zbot
| 7 | 431 | 2015/10/08–2016/09/11 |
7 | upatre
| 5 | 1406 | 2015/10/08–2016/09/15 |
8 | delf
| 5 | 637 | 2015/10/09–2016/09/01 |
9 | badur
| 5 | 6 | 2015/12/14–2016/08/14 |
10 | ymeta
| 5 | 3 | 2015/10/11–2016/3/5 |
11 | yakes
| 4 | 303 | 2015/10/8–2016/9/12 |
12 | barys
| 3 | 7 | 2015/12/7–2016/4/14 |
13 | bayrob
| 2 | 3208 | 2015/10/12–2016/9/15 |
14 | bublik
| 2 | 917 | 2015/10/8–2016/9/15 |
15 | soxgrave
| 2 | 6 | 2016/5/18–2016/5/24 |
Total | 119 | 7537 |
4.9 Defending against malware activities
Malware family | # Blocked FQDNs | # Blocked samples | First submission date | |
---|---|---|---|---|
1 | bayrob
| 218 | 3208 | 2015/10/12–2016/09/15 |
2 | upatre
| 129 | 1406 | 2015/10/08–2016/09/15 |
3 | banload
| 127 | 23 | 2015/10/19–2016/08/14 |
4 | zbot
| 110 | 431 | 2015/10/08–2016/09/11 |
5 | delf
| 105 | 637 | 2015/10/09–2016/09/15 |
6 | tinba
| 89 | 272 | 2015/10/08–2016/08/23 |
7 | zusy
| 88 | 526 | 2015/10/08–2016/07/16 |
8 | badur
| 86 | 6 | 2015/12/14–2016/08/14 |
9 | sality
| 83 | 124 | 2015/10/12–2016/08/03 |
10 | gamarue
| 67 | 12 | 2015/10/8–2016/7/29 |
11 | ramnit
| 56 | 50 | 2015/10/14–2016/9/15 |
12 | scar
| 42 | 71 | 2015/11/16–2016/6/10 |
13 | barys
| 40 | 7 | 2015/12/7–2016/4/14 |
14 | bicololo
| 37 | 150 | 2015/10/10–2016/3/1 |
15 | crowti
| 36 | 10 | 2015/11/13–2015/11/19 |
Total | 1313 | 6933 |
5 Discussion
5.1 Evading DomainProfiler
5.2 DNS-based blocking
6 Related work
6.1 Lexical/linguistic approach
6.2 User-centric approach
6.3 Historic relationship approach
.xyz
and .top
) have started being used since October 2013. The number of such new gTLDs was 1,184 as of September 2016 [22]. Attackers also leverage new gTLDs for their cyber-attacks. For example, Halvorson et al. [20] showed that domain names using new gTLDs are twice as likely to appear on blacklists; this means attackers now actively make use of new gTLDs. Obviously, to keep up with such situations, predator needs to obtain real-time access privileges to highly confidential data inside the each new gTLD’s registry. Although the concept of predator resembles that of DomainProfiler, their mechanisms are totally different because our system does not require any data only owned by a registrar, registry, and authoritative name server.