Top

Social Network Analysis and Mining

Published in:

01-12-2020 | Original Article

Enhancing data quality in real-time threat intelligence systems using machine learning

Authors: Ariel Rodriguez, Koji Okamura

Published in: Social Network Analysis and Mining | Issue 1/2020

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

In this research, we aim to expand the utility of keyword filtering on text-based data in the domain of cyber threat intelligence. Existing research-based cyber threat intelligence systems and production systems often utilize keyword filtering as a method to obtain training data for a classification model or as a classifier in itself. This method is known to have concerns with false-positives that affect data quality and thus can produce downstream issues for security analysts that utilize these types of systems. We propose a method to classify open-source intelligence data into a cybersecurity-related information stream and subsequently increase the quality of that stream using an unsupervised clustering method. Our method expands on keyword filtering techniques by introducing a word2vec generated associated words list which assists in the classification of ambiguous posts to reduce false-positives while still retrieving large scope data. We then use k-means clustering on positively classified entries to identify and remove clusters that are not relevant to threats. We further explore this method by investigating the effects of using segmentation based on data characteristics to achieve better classification. Together these methods are able to create a higher quality cyber threat-related data stream that can be applied to existing text-based threat intelligence systems that use keyword filtering methods.

previous article A framework for quantifying controversy of social network debates using attributed networks: biased random walk (BRW)

next article Elite versus mass polarization on the Brazilian impeachment proceedings of 2016

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

https://github.com/Ebiquity/Unified-Cybersecurity-Ontology.

https://nvd.nist.gov/.

https://twitter.com/.

https://www.reddit.com/.

https://stackexchange.com/.

https://www.welivesecurity.com/.

(2013) Selenium documentation¶. https://www.selenium.dev/selenium/docs/api/py/api.html

(2018) ”cisco 2018 annual cybersecurity report”. Technical report, ”Cisco Systems”, https://www.cisco.com/c/dam/m/hu_hu/campaigns/security-hub/pdf/acr-2018.pdf

(2019) ”the impact of security alert overload”. Technical report, ”CriticalStart”, https://www.criticalstart.com/wp-content/uploads/CS_MDR_Survey_Report.pdf

(2020) ”the economics of security operations centers: What is the true cost for effective results?”. Technical report, ”Ponemon Institute LLC sponsored by Respond Software”

Alves F, Bettini A, Ferreira PM, Bessani A (2019) Processing tweets for cybersecurity threat awareness. arXiv preprint arXiv:190402072

Baumgartner J, Zannettou S, Keegan B, Squire M, Blackburn J (2020) The pushshift reddit dataset. arXiv preprint arXiv:200108435

Behzadan V, Aguirre C, Bose A, Hsu W (2018) Corpus and deep learning classifier for collection of cyber threat indicators in twitter stream. In: 2018 IEEE international conference on big data (Big Data), IEEE, pp 5002–5007

Botes F, Leenen L, De La Harpe R (2017) Ant colony induced decision trees for intrusion detection. In: 16th European conference on cyber warfare and security, ACPI, pp 53–62

Caragea C, Silvescu A, Tapia AH (2016) Identifying informative messages in disaster events using convolutional neural networks. In: International conference on information systems for crisis response and management, pp 137–147

Concone F, De Paola A, Re GL, Morana M (2017) Twitter analysis for real-time malware discovery. In: 2017 AEIT international annual conference, IEEE, pp 1–6

Dionísio N, Alves F, Ferreira PM, Bessani A (2019) Cyberthreat detection from twitter using deep neural networks. arXiv preprint arXiv:190401127

ESET (2020) Welivesecurity. https://www.welivesecurity.com/

Exchange S (2019) The stack exchange data explorer. Online, http://datastackexchangecom/ Accessed September

Fink GA, North CL, Endert A, Rose S (2009) Visualizing cyber security: Usable workspaces. In: 2009 6th international workshop on visualization for cyber security, IEEE, pp 45–56

Hariharan A, Gupta A, Pal T (2020) Camlpad: Cybersecurity autonomous machine learning platform for anomaly detection. In: Future of information and communication conference, Springer, pp 705–720

Horawalavithana S, Bhattacharjee A, Liu R, Choudhury N, O Hall L, Iamnitchi A (2019) Mentions of security vulnerabilities on reddit, twitter and github. In: IEEE/WIC/ACM international conference on web intelligence, pp 200–207

Kaggle (2019) All the news. Online, https://wwwkagglecom/snapcrack/all-the-news Accessed September

Khatua A, Khatua A, Cambria E (2019) A tale of two epidemics: Contextual word2vec for classifying twitter streams during outbreaks. Inf Process Manag 56(1):247–257CrossRef

Khurana N, Mittal S, Piplai A, Joshi A (2019) Preventing poisoning attacks on ai based threat intelligence systems. In: 2019 IEEE 29th international workshop on machine learning for signal processing (MLSP), IEEE, pp 1–6

Le BD, Wang G, Nasim M, Babar MA (2019) Gathering cyber threat intelligence from twitter using novelty classification. In: 2019 International conference on cyberworlds (CW), IEEE, pp 316–323

Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196

Le Sceller Q, Karbab EB, Debbabi M, Iqbal F (2017) Sonar: Automatic detection of cyber security events over the twitter stream. In: Proceedings of the 12th international conference on availability, Reliability and Security, ACM, p 23

Lee KC, Hsieh CH, Wei LJ, Mao CH, Dai JH, Kuang YT (2017) Sec-buzzer: cyber security emerging topic mining with open threat intelligence retrieval and timeline event annotation. Soft Comput 21(11):2883–2896CrossRef

Liu Y, Yao X (1999) Ensemble learning via negative correlation. Neural Netw 12(10):1399–1404CrossRef

Mendsaikhan O, Hasegawa H, Yamaguchi Y, Shimada H (2019) Identification of cybersecurity specific content using the doc2vec language model. In: 2019 IEEE 43rd annual computer software and applications conference (COMPSAC), vol 1, pp 396–401

Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781

Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems. Curran Associates, Inc., pp 3111–3119. PMID: 903

Miller ST, Busby-Earle C (2017) Multi-perspective machine learning a classifier ensemble method for intrusion detection. In: Proceedings of the 2017 international conference on machine learning and soft computing, pp 7–12

Mittal S, Das PK, Mulwad V, Joshi A, Finin T (2016) Cybertwitter: Using twitter to generate alerts for cybersecurity threats and vulnerabilities. In: Proceedings of the 2016 IEEE/ACM international conference on advances in social networks analysis and Mining, IEEE Press, pp 860–867

Moustafa N, Slay J (2015) Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 military communications and information systems conference (MilCIS), IEEE, pp 1–6

News TH (2019) Cybersecurity news and analysis. https://thehackernews.com/

Nunes E, Diab A, Gunn A, Marin E, Mishra V, Paliath V, Robertson J, Shakarian J, Thart A, Shakarian P (2016) Darknet and deepnet mining for proactive cybersecurity threat intelligence. In: 2016 IEEE conference on intelligence and security informatics (ISI), IEEE, pp 7–12

Palshikar GK, Apte M, Pandita D (2017) Weakly supervised classification of tweets for disaster management. In: SMERP@ ECIR, pp 4–13

Rao A, Spasojevic N (2016) Actionable and political text classification using word embeddings and lstm. arXiv preprint arXiv:160702501

Rehurek R, Sojka P (2011) Gensim—-statistical semantics in python. statistical semantics; gensim; Python; LDA; SVD

Rodriguez A, Okamura K (2020) Cybersecurity text data classification and optimization for cti systems. In: Workshops of the international conference on advanced information networking and applications, Springer, pp 410–419

Roesslein J (2009) tweepy documentation. Online] http://tweepy.readthedocs.io/en/v3.5

Samtani S, Chinn R, Chen H, Nunamaker JF Jr (2017) Exploring emerging hacker assets and key hackers for proactive cyber threat intelligence. J Manag Inf Syst 34(4):1023–1053CrossRef

Shin HS, Kwon HY, Ryu SJ (2020) A new text classification model based on contrastive word embedding for detecting cybersecurity intelligence in twitter. Electronics 9(9):1527CrossRef

Shrestha Chitrakar A, Petrović S (2019) Efficient k-means using triangle inequality on spark for cyber security analytics. In: Proceedings of the ACM international workshop on security and privacy analytics, pp 37–45

Tripathy B, Thakur S, Chowdhury R (2017) A classification model to analyze the spread and emerging trends of the zika virus in twitter. In: Behera H, Mohapatra D (eds) Advances in intelligent systems and computing, 1st edn, chap 61. Springer Nature Singapore Pte Ltd., pp 643–650

Vasudevan A, Harshini E, Selvakumar S (2011) Ssenet-2011: a network intrusion detection system dataset and its comparison with kdd cup 99 dataset. In: 2011 second asian himalayas international conference on internet (AH-ICI), IEEE, pp 1–5

Zhang F, Stromer-Galley J, Tanupabrungsun S, Hegde Y, McCracken N, Hemsley J (2017) Understanding discourse acts: Political campaign messages classification on facebook and twitter. In: International conference on social computing. Springer, Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation, pp 242–247

Zhang J, Chen X, Xiang Y, Zhou W, Wu J (2015) Robust network traffic classification. IEEE/ACM Trans Netw 23(4):1257–1270CrossRef

Zhou Y, Wang P (2019) An ensemble learning approach for xss attack detection with domain knowledge and threat intelligence. Comput Secur 82:261–269CrossRef

Zhou Y, Cheng G, Jiang S, Dai M (2019a) An efficient intrusion detection system based on feature selection and ensemble classifier. arXiv preprint arXiv:190401352

Zong S, Ritter A, Mueller G, Wright E (2019b) Analyzing the perceived severity of cybersecurity threats reported on social media. arXiv preprint arXiv:190210680

Title: Enhancing data quality in real-time threat intelligence systems using machine learning
Authors: Ariel Rodriguez
Koji Okamura
Publication date: 01-12-2020
Publisher: Springer Vienna
Published in: Social Network Analysis and Mining / Issue 1/2020
Print ISSN: 1869-5450
Electronic ISSN: 1869-5469
DOI: https://doi.org/10.1007/s13278-020-00707-x

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 1/2020

FauxWard: a graph neural network approach to fauxtography detection using social media comments

Hiding Your Face Is Not Enough: user identity linkage with image recognition

Tweets can tell: activity recognition using hybrid gated recurrent neural networks

The dynamics of polarized beliefs in networks governed by viral diffusion and media influence

Correction to: Diachronic equivalence: an examination of the international news network

Towards establishing the effect of self-similarity on influence maximization in online social networks

Premium Partner