Skip to main content
Top
Published in: Soft Computing 12/2019

10-02-2018 | Methodologies and Application

Context-sensitive and keyword density-based supervised machine learning techniques for malicious webpage detection

Authors: Betul Altay, Tansel Dokeroglu, Ahmet Cosar

Published in: Soft Computing | Issue 12/2019

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Conventional malicious webpage detection methods use blacklists in order to decide whether a webpage is malicious or not. The blacklists are generally maintained by third-party organizations. However, keeping a list of all malicious Web sites and updating this list regularly is not an easy task for the frequently changing and rapidly growing number of webpages on the web. In this study, we propose a novel context-sensitive and keyword density-based method for the classification of webpages by using three supervised machine learning techniques, support vector machine, maximum entropy, and extreme learning machine. Features (words) of webpages are obtained from HTML contents and information is extracted by using feature extraction methods: existence of words, keyword frequencies, and keyword density techniques. The performance of proposed machine learning models is evaluated by using a benchmark data set which consists of one hundred thousand webpages. Experimental results show that the proposed method can detect malicious webpages with an accuracy of 98.24%, which is a significant improvement compared to state-of-the-art approaches.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
go back to reference Abbasi A, Zahedi F, Kaza S et al (2012) Detecting fake medical web sites using recursive trust labeling. ACM Trans Inf Syst (TOIS) 30(4):22CrossRef Abbasi A, Zahedi F, Kaza S et al (2012) Detecting fake medical web sites using recursive trust labeling. ACM Trans Inf Syst (TOIS) 30(4):22CrossRef
go back to reference Abraham A, Ohsawa Y, Dote Y (2007) Web intelligence and chance discovery. Soft Comput Fusion Found Methodol Appl 11(8):695–696 Abraham A, Ohsawa Y, Dote Y (2007) Web intelligence and chance discovery. Soft Comput Fusion Found Methodol Appl 11(8):695–696
go back to reference Bannur SN, Saul LK, Savage S (2011) Judging a site by its content: learning the textual, structural, and visual features of malicious web pages. In: Proceedings of the 4th ACM workshop on security and artificial intelligence. ACM, pp 1–10 Bannur SN, Saul LK, Savage S (2011) Judging a site by its content: learning the textual, structural, and visual features of malicious web pages. In: Proceedings of the 4th ACM workshop on security and artificial intelligence. ACM, pp 1–10
go back to reference Basnet R, Mukkamala S, Sung AH (2008) Detection of phishing attacks: a machine learning approach. In: Soft computing applications in industry. Springer, pp 373–383 Basnet R, Mukkamala S, Sung AH (2008) Detection of phishing attacks: a machine learning approach. In: Soft computing applications in industry. Springer, pp 373–383
go back to reference Berger AL, Pietra VJD, Pietra SAD (1996) A maximum entropy approach to natural language processing. Comput Linguist 22(1):39–71 Berger AL, Pietra VJD, Pietra SAD (1996) A maximum entropy approach to natural language processing. Comput Linguist 22(1):39–71
go back to reference Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on computational learning theory. ACM, p 144–152 Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on computational learning theory. ACM, p 144–152
go back to reference Canali D, Cova M, Vigna G, Kruegel C (2011) Prophiler: a fast filter for the large-scale detection of malicious web pages. In: Proceedings of the 20th international conference on world wide web. ACM, pp 197–206 Canali D, Cova M, Vigna G, Kruegel C (2011) Prophiler: a fast filter for the large-scale detection of malicious web pages. In: Proceedings of the 20th international conference on world wide web. ACM, pp 197–206
go back to reference Carrasco RA, Villar P (2012) A new model for linguistic summarization of heterogeneous data: an application to tourism web data sources. Soft Comput 16(1):135–151CrossRef Carrasco RA, Villar P (2012) A new model for linguistic summarization of heterogeneous data: an application to tourism web data sources. Soft Comput 16(1):135–151CrossRef
go back to reference Chau M, Chen H (2008) A machine learning approach to web page filtering using content and structure analysis. Decis Support Syst 44(2):482–494CrossRef Chau M, Chen H (2008) A machine learning approach to web page filtering using content and structure analysis. Decis Support Syst 44(2):482–494CrossRef
go back to reference Chen J, Guo C (2006) Online detection and prevention of phishing attacks. In: Communications and networking in China, 2006. ChinaCom’06. First international conference on IEEE, pp 1–7 Chen J, Guo C (2006) Online detection and prevention of phishing attacks. In: Communications and networking in China, 2006. ChinaCom’06. First international conference on IEEE, pp 1–7
go back to reference Chieu HL, Ng HT (2002) A maximum entropy approach to information extraction from semi-structured and free text. AAAI/IAAI 2002:786–791 Chieu HL, Ng HT (2002) A maximum entropy approach to information extraction from semi-structured and free text. AAAI/IAAI 2002:786–791
go back to reference Christodorescu M, Jha S (2004) Testing malware detectors. ACM SIGSOFT Softw Eng Notes 29(4):34–44CrossRef Christodorescu M, Jha S (2004) Testing malware detectors. ACM SIGSOFT Softw Eng Notes 29(4):34–44CrossRef
go back to reference Corinna C, Vladimir V (1995) Support-vector networks. Mach Learn 20(3):273–297MATH Corinna C, Vladimir V (1995) Support-vector networks. Mach Learn 20(3):273–297MATH
go back to reference Deniz A, Kiziloz HE, Dokeroglu T, Cosar A (2017) Robust multiobjective evolutionary feature subset selection algorithm for binary classification using machine learning techniques. Neurocomputing 241:128–146CrossRef Deniz A, Kiziloz HE, Dokeroglu T, Cosar A (2017) Robust multiobjective evolutionary feature subset selection algorithm for binary classification using machine learning techniques. Neurocomputing 241:128–146CrossRef
go back to reference El-Halees A (2007) Arabic text classification using maximum entropy. Islam Univ J (Series of Natural Studies and Engineering) 15(1):157–167 El-Halees A (2007) Arabic text classification using maximum entropy. Islam Univ J (Series of Natural Studies and Engineering) 15(1):157–167
go back to reference Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874MATH Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874MATH
go back to reference Hou Y-T, Chang Y, Chen T, Laih C-S, Chen C-M (2010) Malicious web content detection by machine learning. Exp Syst Appl 37(1):55–60CrossRef Hou Y-T, Chang Y, Chen T, Laih C-S, Chen C-M (2010) Malicious web content detection by machine learning. Exp Syst Appl 37(1):55–60CrossRef
go back to reference Hsu CW, Chang CC, Lin et al (2003) A practical guide to support vector classification Hsu CW, Chang CC, Lin et al (2003) A practical guide to support vector classification
go back to reference Huang GB, Zhu QY, Siew CK (2004) Extreme learning machine: a new learning scheme of feedforward neural networks. In: Neural networks, 2004. Proceedings. 2004 IEEE international joint conference on IEEE, vol 2, pp 985–990 Huang GB, Zhu QY, Siew CK (2004) Extreme learning machine: a new learning scheme of feedforward neural networks. In: Neural networks, 2004. Proceedings. 2004 IEEE international joint conference on IEEE, vol 2, pp 985–990
go back to reference Huang G-B, Zhu Q-Y, Siew C-K (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1):489–501CrossRef Huang G-B, Zhu Q-Y, Siew C-K (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1):489–501CrossRef
go back to reference Huang GB, Wang DH, Lan Y (2011) Extreme learning machines: a survey. Int J Mach Learn Cybern 2(2):107–122CrossRef Huang GB, Wang DH, Lan Y (2011) Extreme learning machines: a survey. Int J Mach Learn Cybern 2(2):107–122CrossRef
go back to reference Invernizzi L, Comparetti PM, Benvenuti S, Kruegel C, Cova M, Vigna G (2012) Evilseed: a guided approach to finding malicious web pages. In: Security and privacy (SP), 2012 IEEE symposium on IEEE, pp 428–442 Invernizzi L, Comparetti PM, Benvenuti S, Kruegel C, Cova M, Vigna G (2012) Evilseed: a guided approach to finding malicious web pages. In: Security and privacy (SP), 2012 IEEE symposium on IEEE, pp 428–442
go back to reference Kazemian HB, Ahmed S (2015) Comparisons of machine learning techniques for detecting malicious webpages. Exp Syst Appl 42(3):1166–1177CrossRef Kazemian HB, Ahmed S (2015) Comparisons of machine learning techniques for detecting malicious webpages. Exp Syst Appl 42(3):1166–1177CrossRef
go back to reference Moshchuk A, Bragin T, Damien D, Gribble SD, Levy HM (2007) Execution-based detection of malicious web content. In: USENIX security, Spyproxy Moshchuk A, Bragin T, Damien D, Gribble SD, Levy HM (2007) Execution-based detection of malicious web content. In: USENIX security, Spyproxy
go back to reference Nigam K, Lafferty J, McCallum A (1999) Using maximum entropy for text classification. In: IJCAI-99 workshop on machine learning for information filtering, vol 1, pp 61–67 Nigam K, Lafferty J, McCallum A (1999) Using maximum entropy for text classification. In: IJCAI-99 workshop on machine learning for information filtering, vol 1, pp 61–67
go back to reference Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 conference on Empirical methods in natural language processing, vol 10. Association for Computational Linguistics, pp 79–86 Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 conference on Empirical methods in natural language processing, vol 10. Association for Computational Linguistics, pp 79–86
go back to reference Prakash P, Kumar M, Kompella RR, Gupta M (2010) Phishnet: predictive blacklisting to detect phishing attacks. In: INFOCOM, 2010 proceedings IEEE. IEEE, pp. 1–5 Prakash P, Kumar M, Kompella RR, Gupta M (2010) Phishnet: predictive blacklisting to detect phishing attacks. In: INFOCOM, 2010 proceedings IEEE. IEEE, pp. 1–5
go back to reference Provos N, McNamee D, Mavrommatis P, Wang K, Modadugu N et al (2007) The ghost in the browser: analysis of web-based malware. HotBots 7:4–4 Provos N, McNamee D, Mavrommatis P, Wang K, Modadugu N et al (2007) The ghost in the browser: analysis of web-based malware. HotBots 7:4–4
go back to reference Seifert C, Welch I, Komisarczuk P (2008) Identification of malicious web pages with static heuristics. In: Telecommunication networks and applications conference, 2008. ATNAC 2008. Australasian. IEEE, pp 91–96 Seifert C, Welch I, Komisarczuk P (2008) Identification of malicious web pages with static heuristics. In: Telecommunication networks and applications conference, 2008. ATNAC 2008. Australasian. IEEE, pp 91–96
go back to reference Seifert C, Welch I, Komisarczuk P, Aval CU, Endicott-Popovsky B (2008) Identification of malicious web pages through analysis of underlying DNS and web server relationships. In: LCN, Citeseer, pp 935–941 Seifert C, Welch I, Komisarczuk P, Aval CU, Endicott-Popovsky B (2008) Identification of malicious web pages through analysis of underlying DNS and web server relationships. In: LCN, Citeseer, pp 935–941
go back to reference Sirageldin A, Baharudin BB, Jung LT (2014) Malicious web page detection: a machine learning approach. In: Advances in computer science and its applications. Springer, pp 217–224 Sirageldin A, Baharudin BB, Jung LT (2014) Malicious web page detection: a machine learning approach. In: Advances in computer science and its applications. Springer, pp 217–224
go back to reference Tsuruoka Y, Tsujii J, Ananiadou S (2009) Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty. In: Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP, volume 1–1. Association for computational linguistics, pp 477–485 Tsuruoka Y, Tsujii J, Ananiadou S (2009) Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty. In: Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP, volume 1–1. Association for computational linguistics, pp 477–485
go back to reference Wassermann G, Su Z (2008) Static detection of cross-site scripting vulnerabilities. In: 2008 ACM/IEEE 30th international conference on software engineering. IEEE, pp 171–180 Wassermann G, Su Z (2008) Static detection of cross-site scripting vulnerabilities. In: 2008 ACM/IEEE 30th international conference on software engineering. IEEE, pp 171–180
Metadata
Title
Context-sensitive and keyword density-based supervised machine learning techniques for malicious webpage detection
Authors
Betul Altay
Tansel Dokeroglu
Ahmet Cosar
Publication date
10-02-2018
Publisher
Springer Berlin Heidelberg
Published in
Soft Computing / Issue 12/2019
Print ISSN: 1432-7643
Electronic ISSN: 1433-7479
DOI
https://doi.org/10.1007/s00500-018-3066-4

Other articles of this Issue 12/2019

Soft Computing 12/2019 Go to the issue

Premium Partner