Skip to main content
Top
Published in: Social Network Analysis and Mining 1/2016

01-12-2016 | Original Article

Comparison of sentiment lexicon development techniques for event prediction

Authors: Mehmet Kaya, Shannon Conley

Published in: Social Network Analysis and Mining | Issue 1/2016

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

What started as a social utility for sharing short bursts of ‘inconsequential information’ has become a powerful information network capable of both tracking and shaping current events. From orchestrating government insurgencies to tracking epidemics, the majority of information shared via Twitter contains semantic relevance to contemporary topic(s), according to recent statistics. And, in consequence, Twitter is considered by researchers as an ideal platform for sentiment analysis. Compared to other online arenas such as forum discussions, blogs, and Facebook postings, Twitter frequently yields a higher degree of sentiment analysis accuracy due to the shortness of each post (140 character limit per Tweet). Various natural language processing techniques have been used to successfully perform sentiment classification on a group of Tweets. However, these techniques analyze text using both English-specific grammar rules and lexicons. Since there are fewer resources or tools in other languages, researchers often attempt to first use machine translation to translate the text into English. Often, translation errors introduce noise that obfuscates the results. In this study, we are analyzing the accuracy of sentiment analysis using an ad hoc and a translated sentiment lexicon in terms of capability of predicting the results of a future occurrence. We collected some 22,000 tweets using Twitter Search and Streaming APIs regarding a highly popular TV Show called “O Ses Türkiye” to predict the winner (Turkish version of globally known voice contest “The Voice of America”). We first performed a frequency-based statistical classification using an English sentiment lexicon translated into Turkish as well as a small ad hoc Turkish sentiment lexicon generated specifically for this study. We also use a k-means clustering technique using the two sentiment lexicons to evaluate the accuracies. Our study concludes that although using a translated sentiment lexicon (or training data for that matter) can also give a rough estimate for the result of a future event successfully, a language-specific ad hoc lexicon yields better granularity with higher discriminative power between negative, positive and neutral tweets. We also show the effect of automatic spell check and stemming in tweets on the predictive and discriminative power of auto-translated sentiment lexicon on a target language.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
go back to reference Akın AA, Akın MD (2007) Zemberek, an open source NLP framework for Turkic Languages. Structure 10:1–5 Akın AA, Akın MD (2007) Zemberek, an open source NLP framework for Turkic Languages. Structure 10:1–5
go back to reference Bakhshandeh R, Samadi M, Azimifar Z, Schaeffer J (2011) Degrees of separation in social networks. In: Borrajo D, Likhachev M, Lopez CL (eds) Proceedings of the fourth annual symposium on combinatorial search (SoCS-2011), Castell de Cardona, Barcelona, Spain, July 15–16, 2011, AAAI, Menlo Park, 2011, pp 18–23 Bakhshandeh R, Samadi M, Azimifar Z, Schaeffer J (2011) Degrees of separation in social networks. In: Borrajo D, Likhachev M, Lopez CL (eds) Proceedings of the fourth annual symposium on combinatorial search (SoCS-2011), Castell de Cardona, Barcelona, Spain, July 15–16, 2011, AAAI, Menlo Park, 2011, pp 18–23
go back to reference Banea C, Mihalcea R, Wiebe J, Hassan S (2008) Multilingual subjectivity analysis using machine translation. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP ‘08). Association for Computational Linguistics, Stroudsburg, PA, USA, pp 127–135 Banea C, Mihalcea R, Wiebe J, Hassan S (2008) Multilingual subjectivity analysis using machine translation. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP ‘08). Association for Computational Linguistics, Stroudsburg, PA, USA, pp 127–135
go back to reference Banea C, Mihalcea R, Wiebe J (2010) Multilingual subjectivity: are more languages better? In: Proceedings of the 23rd international conference on computational linguistics (COLING ‘10). Association for Computational Linguistics, Stroudsburg, PA, USA, pp 28–36 Banea C, Mihalcea R, Wiebe J (2010) Multilingual subjectivity: are more languages better? In: Proceedings of the 23rd international conference on computational linguistics (COLING ‘10). Association for Computational Linguistics, Stroudsburg, PA, USA, pp 28–36
go back to reference Bar-Haim R, Dinur E, Feldman R, Fresko M, Goldstein G (2011) Identifying and following expert investors in stock microblogs. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP ‘11). Association for Computational Linguistics, Stroudsburg, PA, USA, pp 1310–1319 Bar-Haim R, Dinur E, Feldman R, Fresko M, Goldstein G (2011) Identifying and following expert investors in stock microblogs. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP ‘11). Association for Computational Linguistics, Stroudsburg, PA, USA, pp 1310–1319
go back to reference Bautin M, Vijayarenu L, Skiena S (2008) International sentiment analysis for news and blogs. In: Proceedings of the international conference on weblogs and social media (ICWSM-2008), Seattle, Washington Bautin M, Vijayarenu L, Skiena S (2008) International sentiment analysis for news and blogs. In: Proceedings of the international conference on weblogs and social media (ICWSM-2008), Seattle, Washington
go back to reference Blair-Goldensohn S, Hannan K, McDonald R, Neylon T, Reis GA, Reynar J (2008) Building a sentiment summarizer for local service reviews Blair-Goldensohn S, Hannan K, McDonald R, Neylon T, Reis GA, Reynar J (2008) Building a sentiment summarizer for local service reviews
go back to reference Boiy E, Moens M-F (2009) A machine learning approach to sentiment analysis in multilingual web texts. Inf Retr 12(5):526–558CrossRef Boiy E, Moens M-F (2009) A machine learning approach to sentiment analysis in multilingual web texts. Inf Retr 12(5):526–558CrossRef
go back to reference Bollen J, Mao H, Zeng X (2011) Twitter mood predicts the stock market. J Comput Sci 2(1):1–8 (ISSN 1877-7503) CrossRef Bollen J, Mao H, Zeng X (2011) Twitter mood predicts the stock market. J Comput Sci 2(1):1–8 (ISSN 1877-7503) CrossRef
go back to reference Boyd-Graber J, Resnik P (2010) Holistic sentiment analysis across languages: multilingual supervised latent Dirichlet allocation. In: Proceedings of the 2010 conference on empirical methods in natural language processing (EMNLP ‘10). Association for Computational Linguistics, Stroudsburg, PA, USA, pp 45–55 Boyd-Graber J, Resnik P (2010) Holistic sentiment analysis across languages: multilingual supervised latent Dirichlet allocation. In: Proceedings of the 2010 conference on empirical methods in natural language processing (EMNLP ‘10). Association for Computational Linguistics, Stroudsburg, PA, USA, pp 45–55
go back to reference Brooke J, Tofiloski M, Taboada M (2009) Cross-linguistic sentiment analysis: from English to Spanish. In: Proceedings of the 7th international conference on recent advances in natural language processing, pp 50–54, Borovets Brooke J, Tofiloski M, Taboada M (2009) Cross-linguistic sentiment analysis: from English to Spanish. In: Proceedings of the 7th international conference on recent advances in natural language processing, pp 50–54, Borovets
go back to reference Chen Y, Fay S, Wang Q (2011) The role of marketing in social media: how online consumer reviews evolve. J Interact Mark 25(2):85–94 Chen Y, Fay S, Wang Q (2011) The role of marketing in social media: how online consumer reviews evolve. J Interact Mark 25(2):85–94
go back to reference Esuli A, Sebastiani F (2005) Determining the semantic orientation of terms through gloss classification. In: Proceedings of the 14th ACM international conference on information and knowledge management (CIKM ‘05) Esuli A, Sebastiani F (2005) Determining the semantic orientation of terms through gloss classification. In: Proceedings of the 14th ACM international conference on information and knowledge management (CIKM ‘05)
go back to reference Hassan A, Radev D (2010) Identifying text polarity using random walks. In: Proceedings of the 48th annual meeting of the association for computational linguistics (ACL ‘10) Hassan A, Radev D (2010) Identifying text polarity using random walks. In: Proceedings of the 48th annual meeting of the association for computational linguistics (ACL ‘10)
go back to reference Hong Y, Skiena S (2010) The wisdom of bookies? Sentiment analysis versus the NFL point spread. In: Cohen WW, Gosling S (eds) ‘ICWSM’. The AAAI Press Hong Y, Skiena S (2010) The wisdom of bookies? Sentiment analysis versus the NFL point spread. In: Cohen WW, Gosling S (eds) ‘ICWSM’. The AAAI Press
go back to reference Howard PN, Duffy A (2011) Opening closed regimes, what was the role of social media during the arab spring? Project on Information Technology and Political Islam, pp 1–30 Howard PN, Duffy A (2011) Opening closed regimes, what was the role of social media during the arab spring? Project on Information Technology and Political Islam, pp 1–30
go back to reference Hu M, Liu B (2004) Mining and summarizing customer reviews. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (KDD ‘04). ACM, New York, NY, USA, pp 168–177 Hu M, Liu B (2004) Mining and summarizing customer reviews. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (KDD ‘04). ACM, New York, NY, USA, pp 168–177
go back to reference Kaji N, Kitsuregawa M (2006) Automatic construction of polarity-tagged corpus from HTML documents. In: Proceedings of the COLING/ACL on main conference poster sessions (COLING-ACL ‘06) Kaji N, Kitsuregawa M (2006) Automatic construction of polarity-tagged corpus from HTML documents. In: Proceedings of the COLING/ACL on main conference poster sessions (COLING-ACL ‘06)
go back to reference Kaji N, Kitsuregawa M (2007) Building lexicon for sentiment analysis from massive collection of HTML documents. In: Proc. Of EMNLP’07, pp 1075–1083 Kaji N, Kitsuregawa M (2007) Building lexicon for sentiment analysis from massive collection of HTML documents. In: Proc. Of EMNLP’07, pp 1075–1083
go back to reference Kamps J, Marx M, Mokken RJ, DeRijke M (2004) Using WordNet to measure semantic orientation of adjectives. In: Proceedings of LREC-04, 4th international conference on language resources and evaluation, vol IV, pp 1115–1118, Lisbon, PT Kamps J, Marx M, Mokken RJ, DeRijke M (2004) Using WordNet to measure semantic orientation of adjectives. In: Proceedings of LREC-04, 4th international conference on language resources and evaluation, vol IV, pp 1115–1118, Lisbon, PT
go back to reference Kaplan AM, Haenlein M (2010) Users of the world, unite! The challenges and opportunities of Social Media. Bus Horiz 53(1):59–68CrossRef Kaplan AM, Haenlein M (2010) Users of the world, unite! The challenges and opportunities of Social Media. Bus Horiz 53(1):59–68CrossRef
go back to reference Kim S-M, Hovy E (2004) Determining the sentiment of opinions. In: Proceedings of the 20th international conference on computational linguistics (COLING ‘04) Kim S-M, Hovy E (2004) Determining the sentiment of opinions. In: Proceedings of the 20th international conference on computational linguistics (COLING ‘04)
go back to reference Kim S-M, Hovy E (2006) Automatic identification of pro and con reasons in online reviews. In: Proceedings of the COLING/ACL on main conference poster sessions (COLING-ACL ‘06) Kim S-M, Hovy E (2006) Automatic identification of pro and con reasons in online reviews. In: Proceedings of the COLING/ACL on main conference poster sessions (COLING-ACL ‘06)
go back to reference Kim J, Li J, Lee J (2010) Evaluating multilanguage-comparability of subjective analysis system. In: Proceedings of annual meeting of the association for computational linguistics (ACL-2010), 2010 Kim J, Li J, Lee J (2010) Evaluating multilanguage-comparability of subjective analysis system. In: Proceedings of annual meeting of the association for computational linguistics (ACL-2010), 2010
go back to reference Lin Z, Jin X, Xu X, Wang Y, Tan S, Cheng X (2014) Make it possible: multilingual sentiment analysis without much prior knowledge. In: Proceedings of the 2014 IEEE/WIC/ACM international joint conferences on web intelligence (WI) and intelligent agent technologies (IAT)—Volume 02 (WI-IAT ‘14), vol 2. IEEE Computer Society, Washington, DC, USA, pp 79–86 Lin Z, Jin X, Xu X, Wang Y, Tan S, Cheng X (2014) Make it possible: multilingual sentiment analysis without much prior knowledge. In: Proceedings of the 2014 IEEE/WIC/ACM international joint conferences on web intelligence (WI) and intelligent agent technologies (IAT)—Volume 02 (WI-IAT ‘14), vol 2. IEEE Computer Society, Washington, DC, USA, pp 79–86
go back to reference Liu J, Cao Y, Lin CY, Huang Y, Zhou M (2007) Low-quality product review detection in opinion summarization. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), pp 334–342 Liu J, Cao Y, Lin CY, Huang Y, Zhou M (2007) Low-quality product review detection in opinion summarization. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), pp 334–342
go back to reference Lu Y, Wang F, Maciejewski R (2014) Business Intelligence from social media: a study from the VAST box office challenge. Comput Graph Appl IEEE 34(5):58–69CrossRef Lu Y, Wang F, Maciejewski R (2014) Business Intelligence from social media: a study from the VAST box office challenge. Comput Graph Appl IEEE 34(5):58–69CrossRef
go back to reference Mcglohon M, Glance N, Reiter Z (2010) Aggregating reviews to rank products and merchants. In: Proceedings of fourth international conference on weblogs and social media (ICWSM), AAAI Mcglohon M, Glance N, Reiter Z (2010) Aggregating reviews to rank products and merchants. In: Proceedings of fourth international conference on weblogs and social media (ICWSM), AAAI
go back to reference Mihalcea R, Banea C, Wiebe J (2007) Learning multilingual subjective language via cross-lingual projections. In: Proceedings of the 45th annual meeting of the association of computational linguistics, pp 976–983 Mihalcea R, Banea C, Wiebe J (2007) Learning multilingual subjective language via cross-lingual projections. In: Proceedings of the 45th annual meeting of the association of computational linguistics, pp 976–983
go back to reference Miller GA, Beckwith R, Fellbaum CD, Gross D, Miller K (1990) WordNet: an online lexical database. Int J Lexicogr 3(4):235–244CrossRef Miller GA, Beckwith R, Fellbaum CD, Gross D, Miller K (1990) WordNet: an online lexical database. Int J Lexicogr 3(4):235–244CrossRef
go back to reference Mohammad SM, Yang TW (2011) Tracking sentiment in mail: how genders differ on emotional axes. In: Proceedings of the 2nd workshop on computational approaches to subjectivity and sentiment analysis (WASSA ‘11). Association for Computational Linguistics, Stroudsburg, PA, USA, pp 70–79 Mohammad SM, Yang TW (2011) Tracking sentiment in mail: how genders differ on emotional axes. In: Proceedings of the 2nd workshop on computational approaches to subjectivity and sentiment analysis (WASSA ‘11). Association for Computational Linguistics, Stroudsburg, PA, USA, pp 70–79
go back to reference Mohammad S, Dunne C, Dorr B (2009) Generating high-coverage semantic orientation lexicons from overtly marked words and a thesaurus. In: Proceedings of the 2009 conference on empirical methods in natural language processing: volume 2—volume 2 (EMNLP ‘09), vol 2. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 599–608 Mohammad S, Dunne C, Dorr B (2009) Generating high-coverage semantic orientation lexicons from overtly marked words and a thesaurus. In: Proceedings of the 2009 conference on empirical methods in natural language processing: volume 2—volume 2 (EMNLP ‘09), vol 2. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 599–608
go back to reference O’Connor B, Balasubramanyan R, Routledge BR, Smith NA (2010) From tweets to polls: linking text sentiment to public opinion time series. Fourth international AAAI conference on weblogs and social media O’Connor B, Balasubramanyan R, Routledge BR, Smith NA (2010) From tweets to polls: linking text sentiment to public opinion time series. Fourth international AAAI conference on weblogs and social media
go back to reference Oflazar K, Göçmen E, Bozsahin C (1994) An outline of Turkish morphology. Report on Turkish Natural Language Processing Initiative Project Oflazar K, Göçmen E, Bozsahin C (1994) An outline of Turkish morphology. Report on Turkish Natural Language Processing Initiative Project
go back to reference Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 conference on empirical methods in natural language processing—volume 10 (EMNLP ‘02) Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 conference on empirical methods in natural language processing—volume 10 (EMNLP ‘02)
go back to reference Porshnev A, Redkin I, Shevchenko A (2013) Machine learning in prediction of stock market indicators based on historical data and data from twitter sentiment analysis. IEEE 13th international conference on data mining workshops (ICDMW), pp 440–444, 7–10 Dec. 2013 Porshnev A, Redkin I, Shevchenko A (2013) Machine learning in prediction of stock market indicators based on historical data and data from twitter sentiment analysis. IEEE 13th international conference on data mining workshops (ICDMW), pp 440–444, 7–10 Dec. 2013
go back to reference Prettenhofer P, Stein B (2010) Cross-language text classification using structural correspondence learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics (ACL ‘10). Association for Computational Linguistics, Stroudsburg, PA, USA, pp 1118–1127 Prettenhofer P, Stein B (2010) Cross-language text classification using structural correspondence learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics (ACL ‘10). Association for Computational Linguistics, Stroudsburg, PA, USA, pp 1118–1127
go back to reference Sadikov E, Parameswaran AG, Venetis P (2009) Blogs as predictors of movie success. Paper presented at the meeting of the ICWSM Sadikov E, Parameswaran AG, Venetis P (2009) Blogs as predictors of movie success. Paper presented at the meeting of the ICWSM
go back to reference Sebastiani AF, Esuli A, Sebastiani F (2006) Determining term subjectivity and term orientation for opinion mining. In: Proceedings of the 11th conference of the european chapter of the association for computational linguistics (EACL’06) Sebastiani AF, Esuli A, Sebastiani F (2006) Determining term subjectivity and term orientation for opinion mining. In: Proceedings of the 11th conference of the european chapter of the association for computational linguistics (EACL’06)
go back to reference Tumasjan A, Sprenger T, Sandner P, Welpe I (2010) Predicting elections with twitter: what 140 characters reveal about political sentiment. International AAAI conference on weblogs and social media Tumasjan A, Sprenger T, Sandner P, Welpe I (2010) Predicting elections with twitter: what 140 characters reveal about political sentiment. International AAAI conference on weblogs and social media
go back to reference Wan X (2009) Co-training for cross-lingual sentiment classification. In: Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP: volume 1—volume 1 (ACL ‘09), vol 1. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 235–243 Wan X (2009) Co-training for cross-lingual sentiment classification. In: Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP: volume 1—volume 1 (ACL ‘09), vol 1. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 235–243
go back to reference Wiebe JM, Bruce RF, O’Hara TP (1999) Development and use of a gold-standard data set for subjectivity classifications. In: Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics (ACL ‘99) Wiebe JM, Bruce RF, O’Hara TP (1999) Development and use of a gold-standard data set for subjectivity classifications. In: Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics (ACL ‘99)
go back to reference Williams GK, Anand SS (2009) Predicting the polarity strength of adjectives using WordNet. In: Proceedings of the third international conference on weblogs and social media, ICWSM 2009, San Jose, California, USA, May 17–20, 2009 Williams GK, Anand SS (2009) Predicting the polarity strength of adjectives using WordNet. In: Proceedings of the third international conference on weblogs and social media, ICWSM 2009, San Jose, California, USA, May 17–20, 2009
go back to reference Zhang W, Skiena S (2010) Trading strategies to exploit blog and news sentiment. International Conference on Weblogs and Social Media—ICWSM Zhang W, Skiena S (2010) Trading strategies to exploit blog and news sentiment. International Conference on Weblogs and Social Media—ICWSM
Metadata
Title
Comparison of sentiment lexicon development techniques for event prediction
Authors
Mehmet Kaya
Shannon Conley
Publication date
01-12-2016
Publisher
Springer Vienna
Published in
Social Network Analysis and Mining / Issue 1/2016
Print ISSN: 1869-5450
Electronic ISSN: 1869-5469
DOI
https://doi.org/10.1007/s13278-015-0315-8

Other articles of this Issue 1/2016

Social Network Analysis and Mining 1/2016 Go to the issue

Premium Partner