Top

Published in:

2016 | OriginalPaper | Chapter

Text Mining in Social Media for Security Threats

Author : Diana Inkpen

Published in: Recent Advances in Computational Intelligence in Defense and Security

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

We discuss techniques for information extraction from texts, and present two applications that use these techniques. We focus in particular on social media texts (Twitter messages), which present challenges for the information extraction techniques because they are noisy and short. The first application is extracting the locations mentioned in Twitter messages, and the second one is detecting the location of the users based on all the tweets written by each user. The same techniques can be used for extracting other kinds of information from social media texts, with the purpose of monitoring the topics, events, emotions, or locations of interest to security and defence applications.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR) Using Fuzzy Co-occurrence Matrix Texture Features

next chapter DroidAnalyst: Synergic App Framework for Static and Dynamic App Analysis

[25] recently released a dataset of various kinds of social media data annotated with generic location expressions, but not with cities, states/provinces, and countries.

https://dev.twitter.com.

http://www.geonames.org.

The number of countries is larger than 200 because alternative names are counted; the same for states/provinces and cities.

https://github.com/rex911/locdet.

http://www.ark.cs.cmu.edu/GeoTwitter.

https://github.com/utcompling/textgrounder/wiki/RollerEtAl_EMNLP2012.

Explained in Sect. 5.2.

Not all of these 5000 n-grams are necessarily good location indicators, we don’t manually distinguish them; a machine learning model after training should be able to do so.

Alternatively, we also tried the loss function defined as the average squared error of output numbers, which is equivalent to the average Euclidean distance between the estimated location and the true location; this alternative model did not perform well.

http://www.mapquest.com.

http://www.census.gov/geo/maps-data/maps/pdfs/reference/us_regdiv.pdf.

Our code is available at https://github.com/rex911/usrloc.

We are unable to conduct t-tests on the Eisenstein models, because of the unavailability of the details of the results produced by these models.

We are unable to conduct t-tests on the other models, because of the unavailability of the details of the results produced by these models.

Only this metric was reported by the author in the top 3 % features configuration.

Aggarwal, C., Zhai, C.: A survey of text classification algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 163–222. Springer (2012). http://dx.doi.org/10.1007/978-1-4614-3223-4_6

Aman, S., Szpakowicz, S.: Identifying expressions of emotion in text. In: Text, Speech and Dialogue, pp. 196–205. Springer (2007)

Bengio, Y.: Practical recommendations for gradient-based training of deep architectures. Neural Networks: Tricks of the Trade 7700, pp. 437–478 (2012). http://link.springer.com/chapter/10.1007/978-3-642-35289-8_26

Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013). http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6472238

Bengio, Y., Lamblin, P.: Greedy layer-wise training of deep networks. Adv. Neural Inf. Process. Syst. 19(153) (2007). https://papers.nips.cc/paper/3048-greedy-layer-wise-training-of-deep-networks.pdf

Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a CPU and GPU math expression compiler. In: Proceedings of the Python for Scientific Computing Conference (SciPy). vol. 4, p. 3 (2010)

Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. Knowl. Data Eng. IEEE Trans. 18(10), 1411–1428 (2006)CrossRef

Cohen, W.W.: Minorthird: methods for identifying names and ontological relations in text using heuristics for inducing regularities from data (2004)

Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). http://dx.doi.org/10.1007/BF00994018

10.

Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: A framework and graphical development environment for robust nlp tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02). Association for Computational Linguistics (2002)

11.

Domingos, P.: A few useful things to know about machine learning. Commun. ACM 55(10), 78–87 (2012)CrossRef

12.

Eisenstein, J., Ahmed, A., Xing, E.P.: Sparse additive generative models of text. In: Proceedings of the 28th International Conference on Machine Learning (ICML’11), pp. 1041–1048 (2011)

13.

Eisenstein, J., O’Connor, B., Smith, N.A., Xing, E.P.: A latent variable model for geographic lexical variation. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 1277–1287. ACL (2010)

14.

Ghazi, D., Inkpen, D., Szpakowicz, S.: Hierarchical versus flat classification of emotions in text. In: Proceedings of the NAACL HLT 2010 workshop on computational approaches to analysis and generation of emotion in text, pp. 140–146. Association for Computational Linguistics, Los Angeles, (June 2010). http://www.aclweb.org/anthology/W10-0217

15.

Ghazi, D., Inkpen, D., Szpakowicz, S.: Prior and contextual emotion of words in sentential context. Comput. Speech Lang. 28(1), 76–92 (2014). http://dx.doi.org/10.1016/j.csl.2013.04.009

16.

Glorot, X., Bordes, A., Bengio, Y.: Domain adaptation for large-scale sentiment classification: a deep learning approach. In: Proceedings of the 28th International Conference on Machine Learning (ICML’11). pp. 513–520 (2011)

17.

Han, B., Cook, P., Baldwin, T.: Text-based Twitter user geolocation prediction. Artif. Intell. Res. 49(1), 451–500, (Jan 2014). http://dl.acm.org/citation.cfm?id=2655713.2655726

18.

Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 54–1527, (Jul 2006). http://dl.acm.org/citation.cfm?id=1161603.1161605

19.

Huang, F., Yates, A.: Exploring representation-learning approaches to domain adaptation. In: Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing, pp. 23–30 (2010). http://dl.acm.org/citation.cfm?id=1870530

20.

Inkpen, D., Liu, J., Farzindar, A., Kazemi, F., Ghazi, D.: Detecting and disambiguating locations in Twitter messages. In: Proceedings of the 16th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2015). Cairo, Egypt (2014)

21.

Keshtkar, F., Inkpen, D.: A hierarchical approach to mood classification in blogs. Nat. Lang. Eng. 18(1), 61–81 (2012)CrossRef

22.

Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001). http://dl.acm.org/citation.cfm?id=645530.655813

23.

Li, H., Srihari, R.K., Niu, C., Li, W.: Location normalization for information extraction. In: Proceedings of the 19th international conference on Computational linguistics, vol. 1, pp. 1–7. Association for Computational Linguistics, Morristown, (Aug 2002). http://dl.acm.org/citation.cfm?id=1072228.1072355

24.

Li, T., Zhang, Y., Sindhwani, V.: A non-negative matrix tri-factorization approach to sentiment classification with lexical prior knowledge. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. pp. 244–252. Association for Computational Linguistics, (Aug 2009). http://dl.acm.org/citation.cfm?id=1687878.1687914

25.

Liu, F., Vasardani, M., Baldwin, T.: Automatic identification of locative expressions from social media text: A comparative analysis. In: Proceedings of the 4th International Workshop on Location and the Web. LocWeb ’14, pp. 9–16. ACM, New York (2014). http://doi.acm.org/10.1145/2663713.2664426

26.

Liu, J., Inkpen, D.: Estimating user locations on social media: a deep learning approach. Technical Report. University of Ottawa (2014)

27.

Mani, I., Hitzeman, J., Richer, J., Harris, D., Quimby, R., Wellner, B.: SpatialML: annotation scheme, corpora, and tools. In: Proceedings of the 6th international Conference on Language Resources and Evaluation (2008), p. 11 (2008). http://www.lrec-conf.org/proceedings/lrec2008/summaries/106.html

28.

Melville, P., Gryc, W., Lawrence, R.D.: Sentiment analysis of blogs by combining lexical knowledge with text classification. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09), p. 1275. ACM Press, New York (June 2009). http://dl.acm.org/citation.cfm?id=1557019.1557156

29.

Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N., Smith, N.A.: Improved part-of-speech tagging for online conversational text with word clusters. In: Proceedings of NAACL-HLT, pp. 380–390 (2013)

30.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH

31.

Priedhorsky, R., Culotta, A., Del Valle, S.Y.: Inferring the origin locations of tweets with quantitative confidence. In: Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing (CSCW ’14), pp. 1523–1536. ACM Press, New York (Feb 2014). http://dl.acm.org/citation.cfm?id=2531602.2531607

32.

Razavi, A.H., Inkpen, D., Brusilovsky, D., Bogouslavski, L.: General topic annotation in social networks: A latent dirichlet allocation approach. In: Advances in Artificial Intelligence, Lecture Notes in Computer Science, vol. 7884, pp. 293–300. Springer, Berlin (2013). http://dx.doi.org/10.1007/978-3-642-38457-8_29

33.

Razavi, A.H., Inkpen, D., Falcon, R., Abielmona, R.: Textual risk mining for maritime situational awareness. In: 2014 IEEE International Inter-Disciplinary Conference on Cognitive Methods in Situation Awareness and Decision Support (CogSIMA), pp. 167–173. IEEE (2014)

34.

Roller, S., Speriosu, M., Rallapalli, S., Wing, B., Baldridge, J.: Supervised text-based geolocation using language models on an adaptive grid. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. pp. 1500–1510. Association for Computational Linguistics (Jul 2012). http://dl.acm.org/citation.cfm?id=2390948.2391120

35.

Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. Technical Report. DTIC Document (1985)

36.

Sarawagi, S., Cohen, W.W.: Semi-markov conditional random fields for information extraction. NIPS 17, 1185–1192 (2004)

37.

Sinnott, R.W.: Virtues of the haversine. Sky Telesc. 68, 158 (1984)

38.

Tang, D., Qin, B., Liu, T., Li, Z.: Learning sentence representation for emotion classification on microblogs. Natural Language Processing and Chinese Computing, vol. 400, pp. 212–223 (2013). http://link.springer.com/chapter/10.1007/978-3-642-41644-6_20

39.

Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning (ICML’08), pp. 1096–1103 (2008). http://portal.acm.org/citation.cfm?doid=1390156.1390294

40.

Wing, B.P., Baldridge, J.: Simple supervised document geolocation with geodesic grids. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL HLT ’11), pp. 955–964. Association for Computational Linguistics (June 2011). http://dl.acm.org/citation.cfm?id=2002472.2002593

41.

Yao, Y., Rosasco, L., Caponnetto, A.: On early stopping in gradient descent learning. Constr. Approx. 26(2), 289–315 (2007)MathSciNetCrossRefMATH

Title: Text Mining in Social Media for Security Threats
Author: Diana Inkpen
Publisher: Springer International Publishing
Book: Recent Advances in Computational Intelligence in Defense and Security
Print ISBN: 978-3-319-26448-6

Electronic ISBN: 978-3-319-26450-9

Copyright Year: 2016
DOI: https://doi.org/10.1007/978-3-319-26450-9_19

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner