Skip to main content
Erschienen in: Social Network Analysis and Mining 1/2020

01.12.2020 | Original Article

Enhancing data quality in real-time threat intelligence systems using machine learning

verfasst von: Ariel Rodriguez, Koji Okamura

Erschienen in: Social Network Analysis and Mining | Ausgabe 1/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this research, we aim to expand the utility of keyword filtering on text-based data in the domain of cyber threat intelligence. Existing research-based cyber threat intelligence systems and production systems often utilize keyword filtering as a method to obtain training data for a classification model or as a classifier in itself. This method is known to have concerns with false-positives that affect data quality and thus can produce downstream issues for security analysts that utilize these types of systems. We propose a method to classify open-source intelligence data into a cybersecurity-related information stream and subsequently increase the quality of that stream using an unsupervised clustering method. Our method expands on keyword filtering techniques by introducing a word2vec generated associated words list which assists in the classification of ambiguous posts to reduce false-positives while still retrieving large scope data. We then use k-means clustering on positively classified entries to identify and remove clusters that are not relevant to threats. We further explore this method by investigating the effects of using segmentation based on data characteristics to achieve better classification. Together these methods are able to create a higher quality cyber threat-related data stream that can be applied to existing text-based threat intelligence systems that use keyword filtering methods.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat (2020) ”the economics of security operations centers: What is the true cost for effective results?”. Technical report, ”Ponemon Institute LLC sponsored by Respond Software” (2020) ”the economics of security operations centers: What is the true cost for effective results?”. Technical report, ”Ponemon Institute LLC sponsored by Respond Software”
Zurück zum Zitat Alves F, Bettini A, Ferreira PM, Bessani A (2019) Processing tweets for cybersecurity threat awareness. arXiv preprint arXiv:190402072 Alves F, Bettini A, Ferreira PM, Bessani A (2019) Processing tweets for cybersecurity threat awareness. arXiv preprint arXiv:​190402072
Zurück zum Zitat Baumgartner J, Zannettou S, Keegan B, Squire M, Blackburn J (2020) The pushshift reddit dataset. arXiv preprint arXiv:200108435 Baumgartner J, Zannettou S, Keegan B, Squire M, Blackburn J (2020) The pushshift reddit dataset. arXiv preprint arXiv:​200108435
Zurück zum Zitat Behzadan V, Aguirre C, Bose A, Hsu W (2018) Corpus and deep learning classifier for collection of cyber threat indicators in twitter stream. In: 2018 IEEE international conference on big data (Big Data), IEEE, pp 5002–5007 Behzadan V, Aguirre C, Bose A, Hsu W (2018) Corpus and deep learning classifier for collection of cyber threat indicators in twitter stream. In: 2018 IEEE international conference on big data (Big Data), IEEE, pp 5002–5007
Zurück zum Zitat Botes F, Leenen L, De La Harpe R (2017) Ant colony induced decision trees for intrusion detection. In: 16th European conference on cyber warfare and security, ACPI, pp 53–62 Botes F, Leenen L, De La Harpe R (2017) Ant colony induced decision trees for intrusion detection. In: 16th European conference on cyber warfare and security, ACPI, pp 53–62
Zurück zum Zitat Caragea C, Silvescu A, Tapia AH (2016) Identifying informative messages in disaster events using convolutional neural networks. In: International conference on information systems for crisis response and management, pp 137–147 Caragea C, Silvescu A, Tapia AH (2016) Identifying informative messages in disaster events using convolutional neural networks. In: International conference on information systems for crisis response and management, pp 137–147
Zurück zum Zitat Concone F, De Paola A, Re GL, Morana M (2017) Twitter analysis for real-time malware discovery. In: 2017 AEIT international annual conference, IEEE, pp 1–6 Concone F, De Paola A, Re GL, Morana M (2017) Twitter analysis for real-time malware discovery. In: 2017 AEIT international annual conference, IEEE, pp 1–6
Zurück zum Zitat Dionísio N, Alves F, Ferreira PM, Bessani A (2019) Cyberthreat detection from twitter using deep neural networks. arXiv preprint arXiv:190401127 Dionísio N, Alves F, Ferreira PM, Bessani A (2019) Cyberthreat detection from twitter using deep neural networks. arXiv preprint arXiv:​190401127
Zurück zum Zitat Exchange S (2019) The stack exchange data explorer. Online, http://datastackexchangecom/ Accessed September Exchange S (2019) The stack exchange data explorer. Online, http://​datastackexchang​ecom/​ Accessed September
Zurück zum Zitat Fink GA, North CL, Endert A, Rose S (2009) Visualizing cyber security: Usable workspaces. In: 2009 6th international workshop on visualization for cyber security, IEEE, pp 45–56 Fink GA, North CL, Endert A, Rose S (2009) Visualizing cyber security: Usable workspaces. In: 2009 6th international workshop on visualization for cyber security, IEEE, pp 45–56
Zurück zum Zitat Hariharan A, Gupta A, Pal T (2020) Camlpad: Cybersecurity autonomous machine learning platform for anomaly detection. In: Future of information and communication conference, Springer, pp 705–720 Hariharan A, Gupta A, Pal T (2020) Camlpad: Cybersecurity autonomous machine learning platform for anomaly detection. In: Future of information and communication conference, Springer, pp 705–720
Zurück zum Zitat Horawalavithana S, Bhattacharjee A, Liu R, Choudhury N, O Hall L, Iamnitchi A (2019) Mentions of security vulnerabilities on reddit, twitter and github. In: IEEE/WIC/ACM international conference on web intelligence, pp 200–207 Horawalavithana S, Bhattacharjee A, Liu R, Choudhury N, O Hall L, Iamnitchi A (2019) Mentions of security vulnerabilities on reddit, twitter and github. In: IEEE/WIC/ACM international conference on web intelligence, pp 200–207
Zurück zum Zitat Kaggle (2019) All the news. Online, https://wwwkagglecom/snapcrack/all-the-news Accessed September Kaggle (2019) All the news. Online, https://​wwwkagglecom/​snapcrack/​all-the-news Accessed September
Zurück zum Zitat Khatua A, Khatua A, Cambria E (2019) A tale of two epidemics: Contextual word2vec for classifying twitter streams during outbreaks. Inf Process Manag 56(1):247–257CrossRef Khatua A, Khatua A, Cambria E (2019) A tale of two epidemics: Contextual word2vec for classifying twitter streams during outbreaks. Inf Process Manag 56(1):247–257CrossRef
Zurück zum Zitat Khurana N, Mittal S, Piplai A, Joshi A (2019) Preventing poisoning attacks on ai based threat intelligence systems. In: 2019 IEEE 29th international workshop on machine learning for signal processing (MLSP), IEEE, pp 1–6 Khurana N, Mittal S, Piplai A, Joshi A (2019) Preventing poisoning attacks on ai based threat intelligence systems. In: 2019 IEEE 29th international workshop on machine learning for signal processing (MLSP), IEEE, pp 1–6
Zurück zum Zitat Le BD, Wang G, Nasim M, Babar MA (2019) Gathering cyber threat intelligence from twitter using novelty classification. In: 2019 International conference on cyberworlds (CW), IEEE, pp 316–323 Le BD, Wang G, Nasim M, Babar MA (2019) Gathering cyber threat intelligence from twitter using novelty classification. In: 2019 International conference on cyberworlds (CW), IEEE, pp 316–323
Zurück zum Zitat Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196 Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196
Zurück zum Zitat Le Sceller Q, Karbab EB, Debbabi M, Iqbal F (2017) Sonar: Automatic detection of cyber security events over the twitter stream. In: Proceedings of the 12th international conference on availability, Reliability and Security, ACM, p 23 Le Sceller Q, Karbab EB, Debbabi M, Iqbal F (2017) Sonar: Automatic detection of cyber security events over the twitter stream. In: Proceedings of the 12th international conference on availability, Reliability and Security, ACM, p 23
Zurück zum Zitat Lee KC, Hsieh CH, Wei LJ, Mao CH, Dai JH, Kuang YT (2017) Sec-buzzer: cyber security emerging topic mining with open threat intelligence retrieval and timeline event annotation. Soft Comput 21(11):2883–2896CrossRef Lee KC, Hsieh CH, Wei LJ, Mao CH, Dai JH, Kuang YT (2017) Sec-buzzer: cyber security emerging topic mining with open threat intelligence retrieval and timeline event annotation. Soft Comput 21(11):2883–2896CrossRef
Zurück zum Zitat Liu Y, Yao X (1999) Ensemble learning via negative correlation. Neural Netw 12(10):1399–1404CrossRef Liu Y, Yao X (1999) Ensemble learning via negative correlation. Neural Netw 12(10):1399–1404CrossRef
Zurück zum Zitat Mendsaikhan O, Hasegawa H, Yamaguchi Y, Shimada H (2019) Identification of cybersecurity specific content using the doc2vec language model. In: 2019 IEEE 43rd annual computer software and applications conference (COMPSAC), vol 1, pp 396–401 Mendsaikhan O, Hasegawa H, Yamaguchi Y, Shimada H (2019) Identification of cybersecurity specific content using the doc2vec language model. In: 2019 IEEE 43rd annual computer software and applications conference (COMPSAC), vol 1, pp 396–401
Zurück zum Zitat Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781 Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:​13013781
Zurück zum Zitat Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems. Curran Associates, Inc., pp 3111–3119. PMID: 903 Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems. Curran Associates, Inc., pp 3111–3119. PMID: 903
Zurück zum Zitat Miller ST, Busby-Earle C (2017) Multi-perspective machine learning a classifier ensemble method for intrusion detection. In: Proceedings of the 2017 international conference on machine learning and soft computing, pp 7–12 Miller ST, Busby-Earle C (2017) Multi-perspective machine learning a classifier ensemble method for intrusion detection. In: Proceedings of the 2017 international conference on machine learning and soft computing, pp 7–12
Zurück zum Zitat Mittal S, Das PK, Mulwad V, Joshi A, Finin T (2016) Cybertwitter: Using twitter to generate alerts for cybersecurity threats and vulnerabilities. In: Proceedings of the 2016 IEEE/ACM international conference on advances in social networks analysis and Mining, IEEE Press, pp 860–867 Mittal S, Das PK, Mulwad V, Joshi A, Finin T (2016) Cybertwitter: Using twitter to generate alerts for cybersecurity threats and vulnerabilities. In: Proceedings of the 2016 IEEE/ACM international conference on advances in social networks analysis and Mining, IEEE Press, pp 860–867
Zurück zum Zitat Moustafa N, Slay J (2015) Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 military communications and information systems conference (MilCIS), IEEE, pp 1–6 Moustafa N, Slay J (2015) Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 military communications and information systems conference (MilCIS), IEEE, pp 1–6
Zurück zum Zitat Nunes E, Diab A, Gunn A, Marin E, Mishra V, Paliath V, Robertson J, Shakarian J, Thart A, Shakarian P (2016) Darknet and deepnet mining for proactive cybersecurity threat intelligence. In: 2016 IEEE conference on intelligence and security informatics (ISI), IEEE, pp 7–12 Nunes E, Diab A, Gunn A, Marin E, Mishra V, Paliath V, Robertson J, Shakarian J, Thart A, Shakarian P (2016) Darknet and deepnet mining for proactive cybersecurity threat intelligence. In: 2016 IEEE conference on intelligence and security informatics (ISI), IEEE, pp 7–12
Zurück zum Zitat Palshikar GK, Apte M, Pandita D (2017) Weakly supervised classification of tweets for disaster management. In: SMERP@ ECIR, pp 4–13 Palshikar GK, Apte M, Pandita D (2017) Weakly supervised classification of tweets for disaster management. In: SMERP@ ECIR, pp 4–13
Zurück zum Zitat Rao A, Spasojevic N (2016) Actionable and political text classification using word embeddings and lstm. arXiv preprint arXiv:160702501 Rao A, Spasojevic N (2016) Actionable and political text classification using word embeddings and lstm. arXiv preprint arXiv:​160702501
Zurück zum Zitat Rehurek R, Sojka P (2011) Gensim—-statistical semantics in python. statistical semantics; gensim; Python; LDA; SVD Rehurek R, Sojka P (2011) Gensim—-statistical semantics in python. statistical semantics; gensim; Python; LDA; SVD
Zurück zum Zitat Rodriguez A, Okamura K (2020) Cybersecurity text data classification and optimization for cti systems. In: Workshops of the international conference on advanced information networking and applications, Springer, pp 410–419 Rodriguez A, Okamura K (2020) Cybersecurity text data classification and optimization for cti systems. In: Workshops of the international conference on advanced information networking and applications, Springer, pp 410–419
Zurück zum Zitat Samtani S, Chinn R, Chen H, Nunamaker JF Jr (2017) Exploring emerging hacker assets and key hackers for proactive cyber threat intelligence. J Manag Inf Syst 34(4):1023–1053CrossRef Samtani S, Chinn R, Chen H, Nunamaker JF Jr (2017) Exploring emerging hacker assets and key hackers for proactive cyber threat intelligence. J Manag Inf Syst 34(4):1023–1053CrossRef
Zurück zum Zitat Shin HS, Kwon HY, Ryu SJ (2020) A new text classification model based on contrastive word embedding for detecting cybersecurity intelligence in twitter. Electronics 9(9):1527CrossRef Shin HS, Kwon HY, Ryu SJ (2020) A new text classification model based on contrastive word embedding for detecting cybersecurity intelligence in twitter. Electronics 9(9):1527CrossRef
Zurück zum Zitat Shrestha Chitrakar A, Petrović S (2019) Efficient k-means using triangle inequality on spark for cyber security analytics. In: Proceedings of the ACM international workshop on security and privacy analytics, pp 37–45 Shrestha Chitrakar A, Petrović S (2019) Efficient k-means using triangle inequality on spark for cyber security analytics. In: Proceedings of the ACM international workshop on security and privacy analytics, pp 37–45
Zurück zum Zitat Tripathy B, Thakur S, Chowdhury R (2017) A classification model to analyze the spread and emerging trends of the zika virus in twitter. In: Behera H, Mohapatra D (eds) Advances in intelligent systems and computing, 1st edn, chap 61. Springer Nature Singapore Pte Ltd., pp 643–650 Tripathy B, Thakur S, Chowdhury R (2017) A classification model to analyze the spread and emerging trends of the zika virus in twitter. In: Behera H, Mohapatra D (eds) Advances in intelligent systems and computing, 1st edn, chap 61. Springer Nature Singapore Pte Ltd., pp 643–650
Zurück zum Zitat Vasudevan A, Harshini E, Selvakumar S (2011) Ssenet-2011: a network intrusion detection system dataset and its comparison with kdd cup 99 dataset. In: 2011 second asian himalayas international conference on internet (AH-ICI), IEEE, pp 1–5 Vasudevan A, Harshini E, Selvakumar S (2011) Ssenet-2011: a network intrusion detection system dataset and its comparison with kdd cup 99 dataset. In: 2011 second asian himalayas international conference on internet (AH-ICI), IEEE, pp 1–5
Zurück zum Zitat Zhang F, Stromer-Galley J, Tanupabrungsun S, Hegde Y, McCracken N, Hemsley J (2017) Understanding discourse acts: Political campaign messages classification on facebook and twitter. In: International conference on social computing. Springer, Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation, pp 242–247 Zhang F, Stromer-Galley J, Tanupabrungsun S, Hegde Y, McCracken N, Hemsley J (2017) Understanding discourse acts: Political campaign messages classification on facebook and twitter. In: International conference on social computing. Springer, Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation, pp 242–247
Zurück zum Zitat Zhang J, Chen X, Xiang Y, Zhou W, Wu J (2015) Robust network traffic classification. IEEE/ACM Trans Netw 23(4):1257–1270CrossRef Zhang J, Chen X, Xiang Y, Zhou W, Wu J (2015) Robust network traffic classification. IEEE/ACM Trans Netw 23(4):1257–1270CrossRef
Zurück zum Zitat Zhou Y, Wang P (2019) An ensemble learning approach for xss attack detection with domain knowledge and threat intelligence. Comput Secur 82:261–269CrossRef Zhou Y, Wang P (2019) An ensemble learning approach for xss attack detection with domain knowledge and threat intelligence. Comput Secur 82:261–269CrossRef
Zurück zum Zitat Zhou Y, Cheng G, Jiang S, Dai M (2019a) An efficient intrusion detection system based on feature selection and ensemble classifier. arXiv preprint arXiv:190401352 Zhou Y, Cheng G, Jiang S, Dai M (2019a) An efficient intrusion detection system based on feature selection and ensemble classifier. arXiv preprint arXiv:​190401352
Zurück zum Zitat Zong S, Ritter A, Mueller G, Wright E (2019b) Analyzing the perceived severity of cybersecurity threats reported on social media. arXiv preprint arXiv:190210680 Zong S, Ritter A, Mueller G, Wright E (2019b) Analyzing the perceived severity of cybersecurity threats reported on social media. arXiv preprint arXiv:​190210680
Metadaten
Titel
Enhancing data quality in real-time threat intelligence systems using machine learning
verfasst von
Ariel Rodriguez
Koji Okamura
Publikationsdatum
01.12.2020
Verlag
Springer Vienna
Erschienen in
Social Network Analysis and Mining / Ausgabe 1/2020
Print ISSN: 1869-5450
Elektronische ISSN: 1869-5469
DOI
https://doi.org/10.1007/s13278-020-00707-x

Weitere Artikel der Ausgabe 1/2020

Social Network Analysis and Mining 1/2020 Zur Ausgabe