Skip to main content
Erschienen in: Soft Computing 22/2019

09.01.2019 | Methodologies and Application

An efficient character recognition method using enhanced HOG for spam image detection

verfasst von: Fatemeh Naiemi, Vahid Ghods, Hassan Khalesi

Erschienen in: Soft Computing | Ausgabe 22/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Generally, a spam image is an unsolicited message electronically sent to a wide group of arbitrary addresses. Due to attractiveness and more difficult detection, spam images are the most complicated type of spam. One of the ways to encounter the spam images is an optical character recognition, OCR, method. In this paper, the proposed enhanced HOG feature extraction method has been used so that the optical character recognition system of spam has been enhanced by using the HOG feature extraction method in such a way to be both resistant against the character variations on scale and translation and to be computationally cost-effective. For these purposes, two steps of the cropped image and input image size normalization have been added to pre-processing stages. Support vector machine, SVM, was employed for classification. Two heuristic modifications including thickening of the thin characters in the pre-processing stage and non-discrimination in detecting the uppercase and lowercase letters with the same shapes in the classification stage have been also proposed to increase the system recognition accuracy. In the first heuristic modification, when all pixels of the output image are empty (the character is eliminated), the original image was made thicker by one layer. In the second modification, when recognizing the letters, no differentiation was considered between the uppercase and lowercase letters with the same shapes. An average recognition accuracy of the modified HOG method with two heuristic modifications equals 91.61% on Char74K database. Then, an optimum threshold for classification was investigated by ROC curve. The optimal cutoff point was 0.736 with the highest average accuracy, 94.20%, and AUC, area under curve, for ROC and precision–recall, PR, curves were 0.96 and 0.73, respectively. The proposed method was also examined on ICDAR2003 database, and the average accuracy and its optimum using ROC curve were 82.73% and 86.01%, respectively. These results of recognition accuracy and AUC for ROC and PR curve showed an outstanding enhancement in comparison with the best recognition rate of the previous methods.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Alghamdi B, Watson J, Xu Y (2016) Toward detecting malicious links in online social networks through user behavior. In: IEEE/WIC/ACM international conference on web intelligence workshops (WIW). IEEE, pp 5–8 Alghamdi B, Watson J, Xu Y (2016) Toward detecting malicious links in online social networks through user behavior. In: IEEE/WIC/ACM international conference on web intelligence workshops (WIW). IEEE, pp 5–8
Zurück zum Zitat Attar A, Rad RM, Atani RE (2013) A survey of image spamming and filtering techniques. Artif Intell Rev 40(1):71–105CrossRef Attar A, Rad RM, Atani RE (2013) A survey of image spamming and filtering techniques. Artif Intell Rev 40(1):71–105CrossRef
Zurück zum Zitat Bhowmick A, Hazarika SM (2016) Machine learning for e-mail spam filtering: review, techniques and trends. arXiv preprint arXiv:1606.01042 Bhowmick A, Hazarika SM (2016) Machine learning for e-mail spam filtering: review, techniques and trends. arXiv preprint arXiv:​1606.​01042
Zurück zum Zitat Bowling JR, Hope P, Liszka KJ (2008) Spam image identification using an artificial neural network. The University of Akron Akron, Ohio, pp 44003–44325 Bowling JR, Hope P, Liszka KJ (2008) Spam image identification using an artificial neural network. The University of Akron Akron, Ohio, pp 44003–44325
Zurück zum Zitat Brodić D, Milivojević ZN, Maluckov ČA (2015) An approach to the script discrimination in the Slavic documents. Soft Comput 19(9):2655–2665CrossRef Brodić D, Milivojević ZN, Maluckov ČA (2015) An approach to the script discrimination in the Slavic documents. Soft Comput 19(9):2655–2665CrossRef
Zurück zum Zitat Camastra F (2007) A SVM-based cursive character recognizer. Pattern Recognit 40(12):3721–3727CrossRef Camastra F (2007) A SVM-based cursive character recognizer. Pattern Recognit 40(12):3721–3727CrossRef
Zurück zum Zitat Chen C, Wang Y, Zhang J, Xiang Y, Zhou W, Min G (2017a) Statistical features-based real-time detection of drifted Twitter spam. IEEE Trans Inf Forensics Secur 12(4):914–925CrossRef Chen C, Wang Y, Zhang J, Xiang Y, Zhou W, Min G (2017a) Statistical features-based real-time detection of drifted Twitter spam. IEEE Trans Inf Forensics Secur 12(4):914–925CrossRef
Zurück zum Zitat Chen J, Zhao H, Yang J, Zhang J, Li T, Wang K (2017b) An intelligent character recognition method to filter spam images on cloud. Soft Comput 21(3):753–763CrossRef Chen J, Zhao H, Yang J, Zhang J, Li T, Wang K (2017b) An intelligent character recognition method to filter spam images on cloud. Soft Comput 21(3):753–763CrossRef
Zurück zum Zitat Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE Computer Society conference on computer vision and pattern recognition, CVPR 2005. IEEE Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE Computer Society conference on computer vision and pattern recognition, CVPR 2005. IEEE
Zurück zum Zitat Das M, Prasad V (2014) Analysis of an image spam in email based on content analysis. Int J Nat Lang Comput (IJNLC) 3(3):129–140CrossRef Das M, Prasad V (2014) Analysis of an image spam in email based on content analysis. Int J Nat Lang Comput (IJNLC) 3(3):129–140CrossRef
Zurück zum Zitat De Campos TE, Babu BR, Varma M (2009) Character recognition in natural images. In: Proceedings of the Int’l conference on computer vision theory and application De Campos TE, Babu BR, Varma M (2009) Character recognition in natural images. In: Proceedings of the Int’l conference on computer vision theory and application
Zurück zum Zitat Derrac J, García S, Molina D, Herrera F (2011) A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol Comput 1(1):3–18 Derrac J, García S, Molina D, Herrera F (2011) A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol Comput 1(1):3–18
Zurück zum Zitat Dhanaraj S, Karthikeyani V (2013) A study on e-mail image spam filtering techniques. In: International conference on pattern recognition, informatics and mobile engineering (PRIME). IEEE Dhanaraj S, Karthikeyani V (2013) A study on e-mail image spam filtering techniques. In: International conference on pattern recognition, informatics and mobile engineering (PRIME). IEEE
Zurück zum Zitat Fan G-F, Peng L-L, Hong W-C (2018) Short term load forecasting based on phase space reconstruction algorithm and bi-square kernel regression model. Appl Energy 224:13–33CrossRef Fan G-F, Peng L-L, Hong W-C (2018) Short term load forecasting based on phase space reconstruction algorithm and bi-square kernel regression model. Appl Energy 224:13–33CrossRef
Zurück zum Zitat Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11(1):86–92MathSciNetCrossRef Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11(1):86–92MathSciNetCrossRef
Zurück zum Zitat Fumera G, Pillai I, Roli F (2006) Spam filtering based on the analysis of text information embedded into images. J Mach Learn Res 7:2699–2720 Fumera G, Pillai I, Roli F (2006) Spam filtering based on the analysis of text information embedded into images. J Mach Learn Res 7:2699–2720
Zurück zum Zitat Galdi P, Tagliaferri R (2019) Data mining: accuracy and error measures for classification and prediction. In: Shoba R (ed) Reference module in life sciences, Encyclopedia of Bioinformatics and Computational Biology, vol 1. Elsevier, Amsterdam, pp 1–14 Galdi P, Tagliaferri R (2019) Data mining: accuracy and error measures for classification and prediction. In: Shoba R (ed) Reference module in life sciences, Encyclopedia of Bioinformatics and Computational Biology, vol 1. Elsevier, Amsterdam, pp 1–14
Zurück zum Zitat Gao Y, Choudhary A, Hua G (2010) A nonnegative sparsity induced similarity measure with application to cluster analysis of spam images. In: IEEE international conference on acoustics speech and signal processing (ICASSP). IEEE Gao Y, Choudhary A, Hua G (2010) A nonnegative sparsity induced similarity measure with application to cluster analysis of spam images. In: IEEE international conference on acoustics speech and signal processing (ICASSP). IEEE
Zurück zum Zitat Jithesh K, Sulochana K, Kumar RR (2003) Optical character recognition (OCR) system for Malayalam language. In: National workshop on application of language technology in Indian languages Jithesh K, Sulochana K, Kumar RR (2003) Optical character recognition (OCR) system for Malayalam language. In: National workshop on application of language technology in Indian languages
Zurück zum Zitat Kaur R, Singh S, Kumar H (2018) Rise of spam and compromised accounts in online social networks: a state-of-the-art review of different combating approaches. J Netw Comput Appl 112:53–88CrossRef Kaur R, Singh S, Kumar H (2018) Rise of spam and compromised accounts in online social networks: a state-of-the-art review of different combating approaches. J Netw Comput Appl 112:53–88CrossRef
Zurück zum Zitat Keys R (1981) Cubic convolution interpolation for digital image processing. IEEE Trans Acoust Speech Signal Process 29(6):1153–1160MathSciNetCrossRef Keys R (1981) Cubic convolution interpolation for digital image processing. IEEE Trans Acoust Speech Signal Process 29(6):1153–1160MathSciNetCrossRef
Zurück zum Zitat Krasser S, Tang Y, Gould J, Alperovitch D, Judge P (2007) Identifying image spam based on header and file properties using C4. 5 decision trees and support vector machine learning. In: Information assurance and security workshop, IAW’07. IEEE SMC, IEEE Krasser S, Tang Y, Gould J, Alperovitch D, Judge P (2007) Identifying image spam based on header and file properties using C4. 5 decision trees and support vector machine learning. In: Information assurance and security workshop, IAW’07. IEEE SMC, IEEE
Zurück zum Zitat Li F, Shen Q, Li Y, Parthaláin NM (2015) Handwritten Chinese character recognition using fuzzy image alignment. Soft Comput 20(8):2939–2949CrossRef Li F, Shen Q, Li Y, Parthaláin NM (2015) Handwritten Chinese character recognition using fuzzy image alignment. Soft Comput 20(8):2939–2949CrossRef
Zurück zum Zitat Liu T-J, Tsao W-L, Lee C-L (2010) A high performance image-spam filtering system. In: Ninth international symposium on distributed computing and applications to business engineering and science (DCABES). IEEE Liu T-J, Tsao W-L, Lee C-L (2010) A high performance image-spam filtering system. In: Ninth international symposium on distributed computing and applications to business engineering and science (DCABES). IEEE
Zurück zum Zitat Lucas SM, Panaretos A, Sosa L, Tang A, Wong S, Young R (2003) ICDAR 2003 robust reading competitions. In: Seventh international conference on document analysis and recognition, proceedings. IEEE Computer Society Lucas SM, Panaretos A, Sosa L, Tang A, Wong S, Young R (2003) ICDAR 2003 robust reading competitions. In: Seventh international conference on document analysis and recognition, proceedings. IEEE Computer Society
Zurück zum Zitat Mehta B, Nangia S, Gupta M, Nejdl W (2008) Detecting image spam using visual features and near duplicate detection. In: Proceedings of the 17th international conference on World Wide Web. ACM Mehta B, Nangia S, Gupta M, Nejdl W (2008) Detecting image spam using visual features and near duplicate detection. In: Proceedings of the 17th international conference on World Wide Web. ACM
Zurück zum Zitat Narudin FA, Feizollah A, Anuar NB, Gani A (2014) Evaluation of machine learning classifiers for mobile malware detection. Soft Comput 20(1):343–357CrossRef Narudin FA, Feizollah A, Anuar NB, Gani A (2014) Evaluation of machine learning classifiers for mobile malware detection. Soft Comput 20(1):343–357CrossRef
Zurück zum Zitat Saito T, Rehmsmeier M (2015) The precision–recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10(3):e0118432CrossRef Saito T, Rehmsmeier M (2015) The precision–recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10(3):e0118432CrossRef
Zurück zum Zitat Saraubon K, Limthanmaphon B (2009) Fast effective botnet spam detection. In: Fourth international conference on computer sciences and convergence information technology, ICCIT’09. IEEE Saraubon K, Limthanmaphon B (2009) Fast effective botnet spam detection. In: Fourth international conference on computer sciences and convergence information technology, ICCIT’09. IEEE
Zurück zum Zitat Sathiya V, Divakar M, Sumi T (2011) Partial image spam e-mail detection using OCR. Int J Eng Trends Technol 1(1):55–59 Sathiya V, Divakar M, Sumi T (2011) Partial image spam e-mail detection using OCR. Int J Eng Trends Technol 1(1):55–59
Zurück zum Zitat Sharaff A, Nagwani NK, Dhadse A (2016) Comparative study of classification algorithms for spam email detection. In: Shetty NR, Patnaik LM, Hamsavath PN, Nalini N (eds) Emerging research in computing, information, communication and applications. Springer, New Delhi, pp 237–244CrossRef Sharaff A, Nagwani NK, Dhadse A (2016) Comparative study of classification algorithms for spam email detection. In: Shetty NR, Patnaik LM, Hamsavath PN, Nalini N (eds) Emerging research in computing, information, communication and applications. Springer, New Delhi, pp 237–244CrossRef
Zurück zum Zitat Steinwart I, Christmann A (2008) Support vector machines. Springer, BerlinMATH Steinwart I, Christmann A (2008) Support vector machines. Springer, BerlinMATH
Zurück zum Zitat Wakade SV (2011) Classification of image spam. University of Akron, Akron Wakade SV (2011) Classification of image spam. University of Akron, Akron
Zurück zum Zitat Wang K, Babenko B, Belongie S (2011) End-to-end scene text recognition. In: IEEE international conference on computer vision (ICCV). IEEE Wang K, Babenko B, Belongie S (2011) End-to-end scene text recognition. In: IEEE international conference on computer vision (ICCV). IEEE
Zurück zum Zitat Xu J, Huang Y (2006) Using SVM to extract acronyms from text. Soft Comput 11(4):369–373CrossRef Xu J, Huang Y (2006) Using SVM to extract acronyms from text. Soft Comput 11(4):369–373CrossRef
Zurück zum Zitat Xu Z, Wang H-G, Shao Z-Z (2009) Evaluation of image spam classification system based on AHP. In: International conference on computational intelligence and software engineering, CiSE 2009. IEEE Xu Z, Wang H-G, Shao Z-Z (2009) Evaluation of image spam classification system based on AHP. In: International conference on computational intelligence and software engineering, CiSE 2009. IEEE
Metadaten
Titel
An efficient character recognition method using enhanced HOG for spam image detection
verfasst von
Fatemeh Naiemi
Vahid Ghods
Hassan Khalesi
Publikationsdatum
09.01.2019
Verlag
Springer Berlin Heidelberg
Erschienen in
Soft Computing / Ausgabe 22/2019
Print ISSN: 1432-7643
Elektronische ISSN: 1433-7479
DOI
https://doi.org/10.1007/s00500-018-03728-z

Weitere Artikel der Ausgabe 22/2019

Soft Computing 22/2019 Zur Ausgabe

Premium Partner