Background
Related works
The URL based phishing detection system
Feature extraction and analysis
-
URL_Size: this is the number of characters in the URL usually phishing websites have a more important size then legitimate websites.
-
Number_of_Hyphens: this feature counts the number of the character ‘-’ in a URL. Normally legitimate websites rarely have an occurrence of the character ‘-’.
-
Number_of_Dots: this attribute counts the number of the character ‘.’ (dots) in a URL (for example the number_of_dots = 4 in the following URL sub-domain2.sub-domain3.sub-domain4.mcomerce.com).
-
Number_of_Numeric_Chars: we count the number of numeric characters in a URL. Since generally there is no occurrence of numeric characters in domain names of legitimate websites.
-
IP_presence: this feature takes two values: 1 whenever there is an IP address in a URL otherwise 0.
-
Similarity_index: the mathematically calculated distance measuring the difference between two data (two strings in our case). It is equal to 100% when measured on two identical words. Several variations and algorithms have been developed to measure this similarity among other we cite the most prevalent in this field: Levenshtein [15] Jaro Winkler [16] Normalized Levenshtein [17] longest common subsequence [18] Q Gram [19] Hamming [20].
URL_size | NH | ND | NNC | IP | NL | L | JW | LCS | QG | H | |
---|---|---|---|---|---|---|---|---|---|---|---|
Min legitimate | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
Min phishing | 4 | 0 | 1 | 0 | 0 | 0.1 | 1 | 0 | 2 | 2 | 1 |
Average legitimate | 12.175 | 0.025 | 1.156 | 0.075 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
Average phishing | 20.01 | 0.254 | 1.685 | 1.155 | 0.036 | 0.759 | 15.889 | 0.520 | 19.999 | 23.173 | 19.391 |
Max legitimate | 31 | 2 | 3 | 4 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
Max phishing | 202 | 8 | 14 | 37 | 1 | 1 | 192 | 0.959 | 192 | 192 | 192 |
-
NH for Number_of_Hyphens
-
ND for Number_of_Dots
-
NNC for Number_of_Numeric_Chars
-
IP for IP_presence
-
L for the classic Levenshtein distance
-
NL for Normalized Levenshtein distance
-
JW for Jaro Winkler distance
-
LCS for the longest common subsequence
-
QG for the Q-Gram distance
-
And finally H to the Hamming distance.
-
have an average of eight characters more than the legitimate websites,
-
and may have to thirty-seven against four numeric characters only for legitimate websites.
NL (%) | L (%) | JW (%) | LCS (%) | QG (%) | H (%) | |
---|---|---|---|---|---|---|
NL | 100 | 62 | 95 | 72 | 76 | 66 |
L | 62 | 100 | 52 | 98 | 96 | 98 |
JW | 95 | 52 | 100 | 63 | 68 | 58 |
LCS | 72 | 98 | 63 | 100 | 91 | 98 |
QG | 76 | 96 | 68 | 99 | 100 | 97 |
H | 66 | 98 | 58 | 98 | 97 | 100 |
URL_Size (%) | NH (%) | ND (%) | NNC (%) | IP (%) | |
---|---|---|---|---|---|
URL_Size | 100 | 42 |
75
|
63
| −3 |
NH | 42 | 100 | 34 | 26 | −3 |
ND | 75 | 34 | 100 | 60 | 20 |
NNC | 63 | 26 | 60 | 100 | 44 |
IP | −3 | −3 | 20 | 44 | 100 |
Phishing detection system
Test of the system on the BD
Tests
Site | Target | Length | @ | – | . | [0–9] | IP? | NL | L | JW | LCS | QG | H | State |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
kinsloglasswall.com | capitecbank.co.za | 19 | 0 | 0 | 1 | 0 | 0 | 0.89 | 17 | 0.45 | 26 | 30 | 18 | P |
zooplaneta.sumy.ua | capitecbank.co.za | 18 | 0 | 0 | 2 | 0 | 0 | 0.77 | 14 | 0.60 | 23 | 31 | 18 | P |
aussiehydrovac.staginghost.com.au | capitecbank.co.za | 33 | 0 | 0 | 3 | 0 | 0 | 0.75 | 25 | 0.55 | 28 | 44 | 31 | P |
guneva.net | capitecbank.co.za | 10 | 0 | 0 | 1 | 0 | 0 | 0.82 | 14 | 0.46 | 21 | 25 | 17 | P |
reg-playiing.byethost11.com | zynga.com | 27 | 0 | 1 | 2 | 2 | 0 | 0.74 | 20 | 0.51 | 22 | 26 | 27 | P |
capitecbank.co.za | capitecbank.co.za | 17 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | L |
zynga.com | zynga.com | 9 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | L |
santander.co.uk | santander.co.uk | 15 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | L |
facebook.com | facebook.com | 12 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | L |
Records count | 100 | 250 | 500 | 750 | 1000 | 1250 | 1500 | 1750 | 2000 |
---|---|---|---|---|---|---|---|---|---|
SVM (%) | 40 | 41.26 | 34.40 | 36.70 | 28 | 30.35 | 28.26 | 23.97 | 26 |
SVM—JW (%) | 24 | 19.04 | 17.60 | 16.48 | 12 | 12.77 | 8 | 12.32 | 12.20 |
SVM—H (%) | 8 | 11.11 | 5.60 | 2.12 | 4 | 2.55 | 5.60 | 5.02 | 4.20 |
SVM—LCS (%) | 12 | 9.52 | 4.80 | 2.65 | 3.20 | 3.19 | 2.66 | 3.42 | 5.20 |
SVM—L (%) | 0 | 4.76 | 3.20 | 1.06 | 2.80 | 2.23 | 2.93 | 5.93 | 5.60 |
SVM—NL (%) | 20 | 39.68 | 16 | 11.70 | 16.8 | 15.01 | 15.73 | 14.84 | 13.4 |
SVM—QG (%) | 0 | 7.93 | 2.40 | 2.65 | 4.80 | 5.75 | 3.20 | 4.10 | 5 |
Records count | 100 | 250 | 500 | 750 | 1000 | 1250 | 1500 | 1750 | 2000 |
---|---|---|---|---|---|---|---|---|---|
Bayes (%) | 52 | 47.61 | 40.80 | 28.72 | 33.20 | 41.85 | 33.86 | 33.10 | 38.80 |
Bayes—JW (%) | 64 | 36.50 | 52 | 29.78 | 38 | 38.01 | 36.80 | 38.12 | 39.20 |
Bayes—H (%) | 4 | 7.93 | 2.40 | 3.72 | 2 | 38.01 | 36 | 36.98 | 36.80 |
Bayes—LCS (%) | 36 | 6.34 | 6.40 | 2.65 | 0.4 | 34.82 | 35.46 | 36.30 | 39.60 |
Bayes—L (%) | 16 | 6.34 | 5.60 | 2.12 | 7.20 | 0.63 | 6.66 | 35.38 | 37.6 |
Bayes—NL (%) | 32 | 36.50 | 32.8 | 35.10 | 31.60 | 39.29 | 34.13 | 37.67 | 36.4 |
Bayes—QG (%) | 24 | 6.34 | 2.40 | 2.12 | 42.4 | 36.42 | 39.20 | 36.98 | 34.4 |
Records count | 100 | 250 | 500 | 750 | 1000 | 1250 | 1500 | 1750 | 2000 |
---|---|---|---|---|---|---|---|---|---|
Naive bayes (%) | 52 | 53.96 | 36.8 | 36.70 | 34.40 | 36.74 | 33.60 | 38.12 | 35.8 |
Naive bayes—JW (%) | 72 | 26.98 | 47.20 | 37.76 | 39.20 | 37.69 | 28.8 | 35.15 | 37.6 |
Naive bayes—H (%) | 16 | 9.52 | 8 | 1.59 | 5.20 | 38.97 | 35.46 | 36.98 | 36 |
Naive bayes—LCS (%) | 20 | 11.11 | 4.80 | 3.19 | 0.4 | 37.06 | 32.80 | 36.75 | 40.6 |
Naive bayes—L (%) | 16 | 19.04 | 2.40 | 2.12 | 4 | 10.22 | 5.86 | 36.75 | 35 |
Naive bayes—NL (%) | 44 | 50.79 | 40.80 | 27.12 | 36.40 | 37.06 | 39.2 | 25.79 | 32.80 |
Naive bayes—QG (%) | 12 | 6.34 | 1.60 | 4.78 | 35.2 | 35.14 | 41.06 | 36.07 | 34.8 |
Records count | 100 | 250 | 500 | 750 | 1000 | 1250 | 1500 | 1750 | 2000 |
---|---|---|---|---|---|---|---|---|---|
PNN (%) | 64 | 57.14 | 51.2 | 49.46 | 46 | 53.03 | 49.60 | 52.73 | 48.8 |
PNN—JW (%) | 56 | 57.14 | 54.4 | 47.34 | 52.8 | 49.2 | 52 | 49.31 | 49 |
PNN—H (%) | 96 | 84.12 | 92 | 100 | 99.2 | 96.80 | 97.86 | 99.31 | 98.40 |
PNN—LCS (%) | 96 | 88.88 | 96.8 | 100 | 99.2 | 97.12 | 98.66 | 97.03 | 97.8 |
PNN—L (%) | 92 | 95.23 | 97.6 | 100 | 98.8 | 100 | 99.73 | 98.40 | 98 |
PNN—NL (%) | 72 | 49.20 | 64 | 52.65 | 52.4 | 53.03 | 52.53 | 52.05 | 51.80 |
PNN—QG (%) | 100 | 92.06 | 94.40 | 100 | 98 | 87.85 | 96.80 | 98.17 | 98.40 |