Skip to main content
Top

2016 | OriginalPaper | Chapter

Text Mining in Social Media for Security Threats

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

We discuss techniques for information extraction from texts, and present two applications that use these techniques. We focus in particular on social media texts (Twitter messages), which present challenges for the information extraction techniques because they are noisy and short. The first application is extracting the locations mentioned in Twitter messages, and the second one is detecting the location of the users based on all the tweets written by each user. The same techniques can be used for extracting other kinds of information from social media texts, with the purpose of monitoring the topics, events, emotions, or locations of interest to security and defence applications.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
[25] recently released a dataset of various kinds of social media data annotated with generic location expressions, but not with cities, states/provinces, and countries.
 
4
The number of countries is larger than 200 because alternative names are counted; the same for states/provinces and cities.
 
8
Explained in Sect. 5.2.
 
9
Not all of these 5000 n-grams are necessarily good location indicators, we don’t manually distinguish them; a machine learning model after training should be able to do so.
 
10
Alternatively, we also tried the loss function defined as the average squared error of output numbers, which is equivalent to the average Euclidean distance between the estimated location and the true location; this alternative model did not perform well.
 
14
We are unable to conduct t-tests on the Eisenstein models, because of the unavailability of the details of the results produced by these models.
 
15
We are unable to conduct t-tests on the other models, because of the unavailability of the details of the results produced by these models.
 
16
Only this metric was reported by the author in the top 3 % features configuration.
 
Literature
2.
go back to reference Aman, S., Szpakowicz, S.: Identifying expressions of emotion in text. In: Text, Speech and Dialogue, pp. 196–205. Springer (2007) Aman, S., Szpakowicz, S.: Identifying expressions of emotion in text. In: Text, Speech and Dialogue, pp. 196–205. Springer (2007)
6.
go back to reference Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a CPU and GPU math expression compiler. In: Proceedings of the Python for Scientific Computing Conference (SciPy). vol. 4, p. 3 (2010) Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a CPU and GPU math expression compiler. In: Proceedings of the Python for Scientific Computing Conference (SciPy). vol. 4, p. 3 (2010)
7.
go back to reference Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. Knowl. Data Eng. IEEE Trans. 18(10), 1411–1428 (2006)CrossRef Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. Knowl. Data Eng. IEEE Trans. 18(10), 1411–1428 (2006)CrossRef
8.
go back to reference Cohen, W.W.: Minorthird: methods for identifying names and ontological relations in text using heuristics for inducing regularities from data (2004) Cohen, W.W.: Minorthird: methods for identifying names and ontological relations in text using heuristics for inducing regularities from data (2004)
10.
go back to reference Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: A framework and graphical development environment for robust nlp tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02). Association for Computational Linguistics (2002) Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: A framework and graphical development environment for robust nlp tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02). Association for Computational Linguistics (2002)
11.
go back to reference Domingos, P.: A few useful things to know about machine learning. Commun. ACM 55(10), 78–87 (2012)CrossRef Domingos, P.: A few useful things to know about machine learning. Commun. ACM 55(10), 78–87 (2012)CrossRef
12.
go back to reference Eisenstein, J., Ahmed, A., Xing, E.P.: Sparse additive generative models of text. In: Proceedings of the 28th International Conference on Machine Learning (ICML’11), pp. 1041–1048 (2011) Eisenstein, J., Ahmed, A., Xing, E.P.: Sparse additive generative models of text. In: Proceedings of the 28th International Conference on Machine Learning (ICML’11), pp. 1041–1048 (2011)
13.
go back to reference Eisenstein, J., O’Connor, B., Smith, N.A., Xing, E.P.: A latent variable model for geographic lexical variation. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 1277–1287. ACL (2010) Eisenstein, J., O’Connor, B., Smith, N.A., Xing, E.P.: A latent variable model for geographic lexical variation. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 1277–1287. ACL (2010)
14.
go back to reference Ghazi, D., Inkpen, D., Szpakowicz, S.: Hierarchical versus flat classification of emotions in text. In: Proceedings of the NAACL HLT 2010 workshop on computational approaches to analysis and generation of emotion in text, pp. 140–146. Association for Computational Linguistics, Los Angeles, (June 2010). http://www.aclweb.org/anthology/W10-0217 Ghazi, D., Inkpen, D., Szpakowicz, S.: Hierarchical versus flat classification of emotions in text. In: Proceedings of the NAACL HLT 2010 workshop on computational approaches to analysis and generation of emotion in text, pp. 140–146. Association for Computational Linguistics, Los Angeles, (June 2010). http://​www.​aclweb.​org/​anthology/​W10-0217
16.
go back to reference Glorot, X., Bordes, A., Bengio, Y.: Domain adaptation for large-scale sentiment classification: a deep learning approach. In: Proceedings of the 28th International Conference on Machine Learning (ICML’11). pp. 513–520 (2011) Glorot, X., Bordes, A., Bengio, Y.: Domain adaptation for large-scale sentiment classification: a deep learning approach. In: Proceedings of the 28th International Conference on Machine Learning (ICML’11). pp. 513–520 (2011)
20.
go back to reference Inkpen, D., Liu, J., Farzindar, A., Kazemi, F., Ghazi, D.: Detecting and disambiguating locations in Twitter messages. In: Proceedings of the 16th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2015). Cairo, Egypt (2014) Inkpen, D., Liu, J., Farzindar, A., Kazemi, F., Ghazi, D.: Detecting and disambiguating locations in Twitter messages. In: Proceedings of the 16th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2015). Cairo, Egypt (2014)
21.
go back to reference Keshtkar, F., Inkpen, D.: A hierarchical approach to mood classification in blogs. Nat. Lang. Eng. 18(1), 61–81 (2012)CrossRef Keshtkar, F., Inkpen, D.: A hierarchical approach to mood classification in blogs. Nat. Lang. Eng. 18(1), 61–81 (2012)CrossRef
22.
go back to reference Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001). http://dl.acm.org/citation.cfm?id=645530.655813 Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001). http://​dl.​acm.​org/​citation.​cfm?​id=​645530.​655813
24.
go back to reference Li, T., Zhang, Y., Sindhwani, V.: A non-negative matrix tri-factorization approach to sentiment classification with lexical prior knowledge. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. pp. 244–252. Association for Computational Linguistics, (Aug 2009). http://dl.acm.org/citation.cfm?id=1687878.1687914 Li, T., Zhang, Y., Sindhwani, V.: A non-negative matrix tri-factorization approach to sentiment classification with lexical prior knowledge. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. pp. 244–252. Association for Computational Linguistics, (Aug 2009). http://​dl.​acm.​org/​citation.​cfm?​id=​1687878.​1687914
25.
go back to reference Liu, F., Vasardani, M., Baldwin, T.: Automatic identification of locative expressions from social media text: A comparative analysis. In: Proceedings of the 4th International Workshop on Location and the Web. LocWeb ’14, pp. 9–16. ACM, New York (2014). http://doi.acm.org/10.1145/2663713.2664426 Liu, F., Vasardani, M., Baldwin, T.: Automatic identification of locative expressions from social media text: A comparative analysis. In: Proceedings of the 4th International Workshop on Location and the Web. LocWeb ’14, pp. 9–16. ACM, New York (2014). http://​doi.​acm.​org/​10.​1145/​2663713.​2664426
26.
go back to reference Liu, J., Inkpen, D.: Estimating user locations on social media: a deep learning approach. Technical Report. University of Ottawa (2014) Liu, J., Inkpen, D.: Estimating user locations on social media: a deep learning approach. Technical Report. University of Ottawa (2014)
28.
29.
go back to reference Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N., Smith, N.A.: Improved part-of-speech tagging for online conversational text with word clusters. In: Proceedings of NAACL-HLT, pp. 380–390 (2013) Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N., Smith, N.A.: Improved part-of-speech tagging for online conversational text with word clusters. In: Proceedings of NAACL-HLT, pp. 380–390 (2013)
30.
go back to reference Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH
31.
32.
go back to reference Razavi, A.H., Inkpen, D., Brusilovsky, D., Bogouslavski, L.: General topic annotation in social networks: A latent dirichlet allocation approach. In: Advances in Artificial Intelligence, Lecture Notes in Computer Science, vol. 7884, pp. 293–300. Springer, Berlin (2013). http://dx.doi.org/10.1007/978-3-642-38457-8_29 Razavi, A.H., Inkpen, D., Brusilovsky, D., Bogouslavski, L.: General topic annotation in social networks: A latent dirichlet allocation approach. In: Advances in Artificial Intelligence, Lecture Notes in Computer Science, vol. 7884, pp. 293–300. Springer, Berlin (2013). http://​dx.​doi.​org/​10.​1007/​978-3-642-38457-8_​29
33.
go back to reference Razavi, A.H., Inkpen, D., Falcon, R., Abielmona, R.: Textual risk mining for maritime situational awareness. In: 2014 IEEE International Inter-Disciplinary Conference on Cognitive Methods in Situation Awareness and Decision Support (CogSIMA), pp. 167–173. IEEE (2014) Razavi, A.H., Inkpen, D., Falcon, R., Abielmona, R.: Textual risk mining for maritime situational awareness. In: 2014 IEEE International Inter-Disciplinary Conference on Cognitive Methods in Situation Awareness and Decision Support (CogSIMA), pp. 167–173. IEEE (2014)
34.
go back to reference Roller, S., Speriosu, M., Rallapalli, S., Wing, B., Baldridge, J.: Supervised text-based geolocation using language models on an adaptive grid. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. pp. 1500–1510. Association for Computational Linguistics (Jul 2012). http://dl.acm.org/citation.cfm?id=2390948.2391120 Roller, S., Speriosu, M., Rallapalli, S., Wing, B., Baldridge, J.: Supervised text-based geolocation using language models on an adaptive grid. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. pp. 1500–1510. Association for Computational Linguistics (Jul 2012). http://​dl.​acm.​org/​citation.​cfm?​id=​2390948.​2391120
35.
go back to reference Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. Technical Report. DTIC Document (1985) Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. Technical Report. DTIC Document (1985)
36.
go back to reference Sarawagi, S., Cohen, W.W.: Semi-markov conditional random fields for information extraction. NIPS 17, 1185–1192 (2004) Sarawagi, S., Cohen, W.W.: Semi-markov conditional random fields for information extraction. NIPS 17, 1185–1192 (2004)
37.
go back to reference Sinnott, R.W.: Virtues of the haversine. Sky Telesc. 68, 158 (1984) Sinnott, R.W.: Virtues of the haversine. Sky Telesc. 68, 158 (1984)
40.
go back to reference Wing, B.P., Baldridge, J.: Simple supervised document geolocation with geodesic grids. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL HLT ’11), pp. 955–964. Association for Computational Linguistics (June 2011). http://dl.acm.org/citation.cfm?id=2002472.2002593 Wing, B.P., Baldridge, J.: Simple supervised document geolocation with geodesic grids. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL HLT ’11), pp. 955–964. Association for Computational Linguistics (June 2011). http://​dl.​acm.​org/​citation.​cfm?​id=​2002472.​2002593
41.
Metadata
Title
Text Mining in Social Media for Security Threats
Author
Diana Inkpen
Copyright Year
2016
DOI
https://doi.org/10.1007/978-3-319-26450-9_19

Premium Partner