Skip to main content
Top
Published in: Social Network Analysis and Mining 1/2020

01-12-2020 | Original Article

Enhancing data quality in real-time threat intelligence systems using machine learning

Authors: Ariel Rodriguez, Koji Okamura

Published in: Social Network Analysis and Mining | Issue 1/2020

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this research, we aim to expand the utility of keyword filtering on text-based data in the domain of cyber threat intelligence. Existing research-based cyber threat intelligence systems and production systems often utilize keyword filtering as a method to obtain training data for a classification model or as a classifier in itself. This method is known to have concerns with false-positives that affect data quality and thus can produce downstream issues for security analysts that utilize these types of systems. We propose a method to classify open-source intelligence data into a cybersecurity-related information stream and subsequently increase the quality of that stream using an unsupervised clustering method. Our method expands on keyword filtering techniques by introducing a word2vec generated associated words list which assists in the classification of ambiguous posts to reduce false-positives while still retrieving large scope data. We then use k-means clustering on positively classified entries to identify and remove clusters that are not relevant to threats. We further explore this method by investigating the effects of using segmentation based on data characteristics to achieve better classification. Together these methods are able to create a higher quality cyber threat-related data stream that can be applied to existing text-based threat intelligence systems that use keyword filtering methods.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
go back to reference (2020) ”the economics of security operations centers: What is the true cost for effective results?”. Technical report, ”Ponemon Institute LLC sponsored by Respond Software” (2020) ”the economics of security operations centers: What is the true cost for effective results?”. Technical report, ”Ponemon Institute LLC sponsored by Respond Software”
go back to reference Alves F, Bettini A, Ferreira PM, Bessani A (2019) Processing tweets for cybersecurity threat awareness. arXiv preprint arXiv:190402072 Alves F, Bettini A, Ferreira PM, Bessani A (2019) Processing tweets for cybersecurity threat awareness. arXiv preprint arXiv:​190402072
go back to reference Baumgartner J, Zannettou S, Keegan B, Squire M, Blackburn J (2020) The pushshift reddit dataset. arXiv preprint arXiv:200108435 Baumgartner J, Zannettou S, Keegan B, Squire M, Blackburn J (2020) The pushshift reddit dataset. arXiv preprint arXiv:​200108435
go back to reference Behzadan V, Aguirre C, Bose A, Hsu W (2018) Corpus and deep learning classifier for collection of cyber threat indicators in twitter stream. In: 2018 IEEE international conference on big data (Big Data), IEEE, pp 5002–5007 Behzadan V, Aguirre C, Bose A, Hsu W (2018) Corpus and deep learning classifier for collection of cyber threat indicators in twitter stream. In: 2018 IEEE international conference on big data (Big Data), IEEE, pp 5002–5007
go back to reference Botes F, Leenen L, De La Harpe R (2017) Ant colony induced decision trees for intrusion detection. In: 16th European conference on cyber warfare and security, ACPI, pp 53–62 Botes F, Leenen L, De La Harpe R (2017) Ant colony induced decision trees for intrusion detection. In: 16th European conference on cyber warfare and security, ACPI, pp 53–62
go back to reference Caragea C, Silvescu A, Tapia AH (2016) Identifying informative messages in disaster events using convolutional neural networks. In: International conference on information systems for crisis response and management, pp 137–147 Caragea C, Silvescu A, Tapia AH (2016) Identifying informative messages in disaster events using convolutional neural networks. In: International conference on information systems for crisis response and management, pp 137–147
go back to reference Concone F, De Paola A, Re GL, Morana M (2017) Twitter analysis for real-time malware discovery. In: 2017 AEIT international annual conference, IEEE, pp 1–6 Concone F, De Paola A, Re GL, Morana M (2017) Twitter analysis for real-time malware discovery. In: 2017 AEIT international annual conference, IEEE, pp 1–6
go back to reference Dionísio N, Alves F, Ferreira PM, Bessani A (2019) Cyberthreat detection from twitter using deep neural networks. arXiv preprint arXiv:190401127 Dionísio N, Alves F, Ferreira PM, Bessani A (2019) Cyberthreat detection from twitter using deep neural networks. arXiv preprint arXiv:​190401127
go back to reference Exchange S (2019) The stack exchange data explorer. Online, http://datastackexchangecom/ Accessed September Exchange S (2019) The stack exchange data explorer. Online, http://​datastackexchang​ecom/​ Accessed September
go back to reference Fink GA, North CL, Endert A, Rose S (2009) Visualizing cyber security: Usable workspaces. In: 2009 6th international workshop on visualization for cyber security, IEEE, pp 45–56 Fink GA, North CL, Endert A, Rose S (2009) Visualizing cyber security: Usable workspaces. In: 2009 6th international workshop on visualization for cyber security, IEEE, pp 45–56
go back to reference Hariharan A, Gupta A, Pal T (2020) Camlpad: Cybersecurity autonomous machine learning platform for anomaly detection. In: Future of information and communication conference, Springer, pp 705–720 Hariharan A, Gupta A, Pal T (2020) Camlpad: Cybersecurity autonomous machine learning platform for anomaly detection. In: Future of information and communication conference, Springer, pp 705–720
go back to reference Horawalavithana S, Bhattacharjee A, Liu R, Choudhury N, O Hall L, Iamnitchi A (2019) Mentions of security vulnerabilities on reddit, twitter and github. In: IEEE/WIC/ACM international conference on web intelligence, pp 200–207 Horawalavithana S, Bhattacharjee A, Liu R, Choudhury N, O Hall L, Iamnitchi A (2019) Mentions of security vulnerabilities on reddit, twitter and github. In: IEEE/WIC/ACM international conference on web intelligence, pp 200–207
go back to reference Kaggle (2019) All the news. Online, https://wwwkagglecom/snapcrack/all-the-news Accessed September Kaggle (2019) All the news. Online, https://​wwwkagglecom/​snapcrack/​all-the-news Accessed September
go back to reference Khatua A, Khatua A, Cambria E (2019) A tale of two epidemics: Contextual word2vec for classifying twitter streams during outbreaks. Inf Process Manag 56(1):247–257CrossRef Khatua A, Khatua A, Cambria E (2019) A tale of two epidemics: Contextual word2vec for classifying twitter streams during outbreaks. Inf Process Manag 56(1):247–257CrossRef
go back to reference Khurana N, Mittal S, Piplai A, Joshi A (2019) Preventing poisoning attacks on ai based threat intelligence systems. In: 2019 IEEE 29th international workshop on machine learning for signal processing (MLSP), IEEE, pp 1–6 Khurana N, Mittal S, Piplai A, Joshi A (2019) Preventing poisoning attacks on ai based threat intelligence systems. In: 2019 IEEE 29th international workshop on machine learning for signal processing (MLSP), IEEE, pp 1–6
go back to reference Le BD, Wang G, Nasim M, Babar MA (2019) Gathering cyber threat intelligence from twitter using novelty classification. In: 2019 International conference on cyberworlds (CW), IEEE, pp 316–323 Le BD, Wang G, Nasim M, Babar MA (2019) Gathering cyber threat intelligence from twitter using novelty classification. In: 2019 International conference on cyberworlds (CW), IEEE, pp 316–323
go back to reference Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196 Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196
go back to reference Le Sceller Q, Karbab EB, Debbabi M, Iqbal F (2017) Sonar: Automatic detection of cyber security events over the twitter stream. In: Proceedings of the 12th international conference on availability, Reliability and Security, ACM, p 23 Le Sceller Q, Karbab EB, Debbabi M, Iqbal F (2017) Sonar: Automatic detection of cyber security events over the twitter stream. In: Proceedings of the 12th international conference on availability, Reliability and Security, ACM, p 23
go back to reference Lee KC, Hsieh CH, Wei LJ, Mao CH, Dai JH, Kuang YT (2017) Sec-buzzer: cyber security emerging topic mining with open threat intelligence retrieval and timeline event annotation. Soft Comput 21(11):2883–2896CrossRef Lee KC, Hsieh CH, Wei LJ, Mao CH, Dai JH, Kuang YT (2017) Sec-buzzer: cyber security emerging topic mining with open threat intelligence retrieval and timeline event annotation. Soft Comput 21(11):2883–2896CrossRef
go back to reference Liu Y, Yao X (1999) Ensemble learning via negative correlation. Neural Netw 12(10):1399–1404CrossRef Liu Y, Yao X (1999) Ensemble learning via negative correlation. Neural Netw 12(10):1399–1404CrossRef
go back to reference Mendsaikhan O, Hasegawa H, Yamaguchi Y, Shimada H (2019) Identification of cybersecurity specific content using the doc2vec language model. In: 2019 IEEE 43rd annual computer software and applications conference (COMPSAC), vol 1, pp 396–401 Mendsaikhan O, Hasegawa H, Yamaguchi Y, Shimada H (2019) Identification of cybersecurity specific content using the doc2vec language model. In: 2019 IEEE 43rd annual computer software and applications conference (COMPSAC), vol 1, pp 396–401
go back to reference Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781 Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:​13013781
go back to reference Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems. Curran Associates, Inc., pp 3111–3119. PMID: 903 Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems. Curran Associates, Inc., pp 3111–3119. PMID: 903
go back to reference Miller ST, Busby-Earle C (2017) Multi-perspective machine learning a classifier ensemble method for intrusion detection. In: Proceedings of the 2017 international conference on machine learning and soft computing, pp 7–12 Miller ST, Busby-Earle C (2017) Multi-perspective machine learning a classifier ensemble method for intrusion detection. In: Proceedings of the 2017 international conference on machine learning and soft computing, pp 7–12
go back to reference Mittal S, Das PK, Mulwad V, Joshi A, Finin T (2016) Cybertwitter: Using twitter to generate alerts for cybersecurity threats and vulnerabilities. In: Proceedings of the 2016 IEEE/ACM international conference on advances in social networks analysis and Mining, IEEE Press, pp 860–867 Mittal S, Das PK, Mulwad V, Joshi A, Finin T (2016) Cybertwitter: Using twitter to generate alerts for cybersecurity threats and vulnerabilities. In: Proceedings of the 2016 IEEE/ACM international conference on advances in social networks analysis and Mining, IEEE Press, pp 860–867
go back to reference Moustafa N, Slay J (2015) Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 military communications and information systems conference (MilCIS), IEEE, pp 1–6 Moustafa N, Slay J (2015) Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 military communications and information systems conference (MilCIS), IEEE, pp 1–6
go back to reference Nunes E, Diab A, Gunn A, Marin E, Mishra V, Paliath V, Robertson J, Shakarian J, Thart A, Shakarian P (2016) Darknet and deepnet mining for proactive cybersecurity threat intelligence. In: 2016 IEEE conference on intelligence and security informatics (ISI), IEEE, pp 7–12 Nunes E, Diab A, Gunn A, Marin E, Mishra V, Paliath V, Robertson J, Shakarian J, Thart A, Shakarian P (2016) Darknet and deepnet mining for proactive cybersecurity threat intelligence. In: 2016 IEEE conference on intelligence and security informatics (ISI), IEEE, pp 7–12
go back to reference Palshikar GK, Apte M, Pandita D (2017) Weakly supervised classification of tweets for disaster management. In: SMERP@ ECIR, pp 4–13 Palshikar GK, Apte M, Pandita D (2017) Weakly supervised classification of tweets for disaster management. In: SMERP@ ECIR, pp 4–13
go back to reference Rao A, Spasojevic N (2016) Actionable and political text classification using word embeddings and lstm. arXiv preprint arXiv:160702501 Rao A, Spasojevic N (2016) Actionable and political text classification using word embeddings and lstm. arXiv preprint arXiv:​160702501
go back to reference Rehurek R, Sojka P (2011) Gensim—-statistical semantics in python. statistical semantics; gensim; Python; LDA; SVD Rehurek R, Sojka P (2011) Gensim—-statistical semantics in python. statistical semantics; gensim; Python; LDA; SVD
go back to reference Rodriguez A, Okamura K (2020) Cybersecurity text data classification and optimization for cti systems. In: Workshops of the international conference on advanced information networking and applications, Springer, pp 410–419 Rodriguez A, Okamura K (2020) Cybersecurity text data classification and optimization for cti systems. In: Workshops of the international conference on advanced information networking and applications, Springer, pp 410–419
go back to reference Samtani S, Chinn R, Chen H, Nunamaker JF Jr (2017) Exploring emerging hacker assets and key hackers for proactive cyber threat intelligence. J Manag Inf Syst 34(4):1023–1053CrossRef Samtani S, Chinn R, Chen H, Nunamaker JF Jr (2017) Exploring emerging hacker assets and key hackers for proactive cyber threat intelligence. J Manag Inf Syst 34(4):1023–1053CrossRef
go back to reference Shin HS, Kwon HY, Ryu SJ (2020) A new text classification model based on contrastive word embedding for detecting cybersecurity intelligence in twitter. Electronics 9(9):1527CrossRef Shin HS, Kwon HY, Ryu SJ (2020) A new text classification model based on contrastive word embedding for detecting cybersecurity intelligence in twitter. Electronics 9(9):1527CrossRef
go back to reference Shrestha Chitrakar A, Petrović S (2019) Efficient k-means using triangle inequality on spark for cyber security analytics. In: Proceedings of the ACM international workshop on security and privacy analytics, pp 37–45 Shrestha Chitrakar A, Petrović S (2019) Efficient k-means using triangle inequality on spark for cyber security analytics. In: Proceedings of the ACM international workshop on security and privacy analytics, pp 37–45
go back to reference Tripathy B, Thakur S, Chowdhury R (2017) A classification model to analyze the spread and emerging trends of the zika virus in twitter. In: Behera H, Mohapatra D (eds) Advances in intelligent systems and computing, 1st edn, chap 61. Springer Nature Singapore Pte Ltd., pp 643–650 Tripathy B, Thakur S, Chowdhury R (2017) A classification model to analyze the spread and emerging trends of the zika virus in twitter. In: Behera H, Mohapatra D (eds) Advances in intelligent systems and computing, 1st edn, chap 61. Springer Nature Singapore Pte Ltd., pp 643–650
go back to reference Vasudevan A, Harshini E, Selvakumar S (2011) Ssenet-2011: a network intrusion detection system dataset and its comparison with kdd cup 99 dataset. In: 2011 second asian himalayas international conference on internet (AH-ICI), IEEE, pp 1–5 Vasudevan A, Harshini E, Selvakumar S (2011) Ssenet-2011: a network intrusion detection system dataset and its comparison with kdd cup 99 dataset. In: 2011 second asian himalayas international conference on internet (AH-ICI), IEEE, pp 1–5
go back to reference Zhang F, Stromer-Galley J, Tanupabrungsun S, Hegde Y, McCracken N, Hemsley J (2017) Understanding discourse acts: Political campaign messages classification on facebook and twitter. In: International conference on social computing. Springer, Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation, pp 242–247 Zhang F, Stromer-Galley J, Tanupabrungsun S, Hegde Y, McCracken N, Hemsley J (2017) Understanding discourse acts: Political campaign messages classification on facebook and twitter. In: International conference on social computing. Springer, Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation, pp 242–247
go back to reference Zhang J, Chen X, Xiang Y, Zhou W, Wu J (2015) Robust network traffic classification. IEEE/ACM Trans Netw 23(4):1257–1270CrossRef Zhang J, Chen X, Xiang Y, Zhou W, Wu J (2015) Robust network traffic classification. IEEE/ACM Trans Netw 23(4):1257–1270CrossRef
go back to reference Zhou Y, Wang P (2019) An ensemble learning approach for xss attack detection with domain knowledge and threat intelligence. Comput Secur 82:261–269CrossRef Zhou Y, Wang P (2019) An ensemble learning approach for xss attack detection with domain knowledge and threat intelligence. Comput Secur 82:261–269CrossRef
go back to reference Zhou Y, Cheng G, Jiang S, Dai M (2019a) An efficient intrusion detection system based on feature selection and ensemble classifier. arXiv preprint arXiv:190401352 Zhou Y, Cheng G, Jiang S, Dai M (2019a) An efficient intrusion detection system based on feature selection and ensemble classifier. arXiv preprint arXiv:​190401352
go back to reference Zong S, Ritter A, Mueller G, Wright E (2019b) Analyzing the perceived severity of cybersecurity threats reported on social media. arXiv preprint arXiv:190210680 Zong S, Ritter A, Mueller G, Wright E (2019b) Analyzing the perceived severity of cybersecurity threats reported on social media. arXiv preprint arXiv:​190210680
Metadata
Title
Enhancing data quality in real-time threat intelligence systems using machine learning
Authors
Ariel Rodriguez
Koji Okamura
Publication date
01-12-2020
Publisher
Springer Vienna
Published in
Social Network Analysis and Mining / Issue 1/2020
Print ISSN: 1869-5450
Electronic ISSN: 1869-5469
DOI
https://doi.org/10.1007/s13278-020-00707-x

Other articles of this Issue 1/2020

Social Network Analysis and Mining 1/2020 Go to the issue

Premium Partner