Annotated news corpora and a lexicon for sentiment analysis in Slovene

Bučar, Jože; Žnidaršič, Martin; Povh, Janez

doi:10.1007/s10579-018-9413-3

Annotated news corpora and a lexicon for sentiment analysis in Slovene

Project Notes
Published: 06 February 2018

Volume 52, pages 895–919, (2018)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

1102 Accesses
14 Citations
4 Altmetric
Explore all metrics

Abstract

In this study, we introduce Slovene web-crawled news corpora with sentiment annotation on three levels of granularity: sentence, paragraph and document levels. We describe the methodology and tools that were required for their construction. The corpora contain more than 250,000 documents with political, business, economic and financial content from five Slovene media resources on the web. More than 10,000 of them were manually annotated as negative, neutral or positive. All corpora are publicly available under a Creative Commons copyright license. We used the annotated documents to construct a Slovene sentiment lexicon, which is the first of its kind for Slovene, and to assess the sentiment classification approaches used. The constructed corpora were also utilised to monitor within-the-document sentiment dynamics, its changes over time and relations with news topics. We show that sentiment is, on average, more explicit at the beginning of documents, and it loses sharpness towards the end of documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

Multinomial Naïve Bayes is a Bayesian approach that avoids the explicit penalization of nonoccurence of words in documents and is suitable for text classification. Detailed description is out of the scope of this paper, but is available for example in Kibriya et al. (2004) or Aggarwal (2015).
https://w3techs.com/technologies/overview/content_language/all.
MSDs are more detailed than is commonly the case for part-of-speech [PoS] tags; they are compact string representations of a simplified kind of feature structures. The first letter of a MSD encodes the PoS. The specifications define the values of the position-determined attributes, for each PoS, its appropriate attributes, their values and one-letter codes.
http://www.clarin.si/.
http://sentiwordnet.isti.cnr.it/.
http://wndomains.fbk.eu/wnaffect.html.
http://www.unipv.it/wnop.
http://mpqa.cs.pitt.edu/.
http://neuro.imm.dtu.dk/wiki/A_new_ANEW:_evaluation_of_a_word_list_for_sentiment_analysis_in_microblogs.
Apache OpenNLP, GATE, Lydia, MAE, MALLET, MPQA (OpinionFinder 2), Orange, QDA Miner Lite, Phyton (NLTK), R (tm), RapidMiner, TAMS Analyzer, WEKA, etc.
http://www.24ur.com/arhiv/novice/gospodarstvo/.
https://www.dnevnik.si/posel/novice/.
http://www.finance.si/danes/.
http://www.rtvslo.si/gospodarstvo/arhiv/.
http://www.zurnal24.si/archive/slovenija/.
http://dejan.amadej.si/test/.
Initially, the project leaders published a call with basic information about the project on the website of the Faculty of Information Studies and its social networks. Several candidates responded to the call. We invited all the candidates to the first meeting, where they were introduced to the objectives of the project, the content and scope of their work. Next, the project leaders selected six candidates, where various criteria were taken into account: (1) candidate’s suitability for carrying out the task, (2) candidate’s interest, (3) candidate’s organisation and (4) gender and age equality. Among all applicants, three women and three men, aged between 19 and 30 and from two different faculties (Faculty of Computer and Information Science in Ljubljana and Faculty of Information Studies in Novo mesto), were chosen to annotate the texts.
http://www.fran.si/130/sskj-slovar-slovenskega-knjiznega-jezika/.
https://www.clarin.si/repository/xmlui/browse?value=Bu%C4%8Dar,%20Jo%C5%BEe&type=author.
https://github.com/19Joey85/Sentiment-annotated-news-corpus-and-sentiment-lexicon-in-Slovene.
https://creativecommons.org/licenses/by-sa/4.0/.
Classifiers and WEKA parameters:
- k-Nearest Neighbour (KNN): IBk -K 9 -W 0 -A
- “weka.core.neighboursearch.LinearNNSearch -A weka.core.EuclideanDistance -R first-last”
- Multinomial Naïve Bayes (NBM): NaiveBayesMultinomial
- Support Vector Machine (SVM): SMO -C 1.0 -L 0.001 -P 1.0E-12 -N 0 -V -1 -W 1 -K
- “weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0”
- Random Forest (RF): -I 10 -K 0 -S 1
- C4.5: J48 -C 0.25 -M 2
- Decision Table (DT): -X 1 -S “weka.attributeSelection.BestFirst -D 1 -N 5”
- Simple Logistic Regression (SLR): -I 0 -M 500 -H 50 -W 0.0
- Voted Perceptron (VP): -I 1 -E 1.0 -S 1 -M 10000

References

Abdul-Mageed, M. & Diab, M. T. (2011). Subjectivity and sentiment annotation of modern standard Arabic newswire. In Proceedings of the 5th linguistic annotation workshop (pp. 110–118), Portland, OR. Association for Computational Linguistics, Stroudsburg, PA.
Aggarwal, C. C. (2015). Data mining: The textbook. New York: Springer.
Book Google Scholar
Balahur, A., Steinberger, R., Kabadjov, M., Zavarella, V., Van Der Goot, E., Halkia, M., et al. (2013). Sentiment analysis in the news. arXiv Preprint ArXiv:1309.6202.
Berginc, N. L., Grčar, M., Brakus, M., Erjavec, T., Holdt, Š. A., Krek, S., et al. (2012). The Gigafida, KRES, ccGigafida and ccKRES corpora of Slovene language: Compilation, content, use. Institute for Applied Slovene Studies, Ljubljana: Trojina.
Google Scholar
Berginc, N. L., & Ljubešić, N. (2013). Gigafida and slWaC: Topic comparison. Slovenščina 2.0: Empirical, Applied and Interdisciplinary Research, 1(1), 78–110.
Google Scholar
Bučar, J. (2017a). Automatically sentiment annotated Slovenian news corpus AutoSentiNews 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1109.
Bučar, J. (2017b). Manually sentiment annotated Slovenian news corpus SentiNews 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1110.
Bučar, J. (2017c). R crawlers for five Slovenian web media 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1105.
Bučar, J. (2017d). Slovene sentiment lexicon JOB 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1112.
Bučar, J., Povh, J., & Žnidaršič, M. (2016). Sentiment classification of the Slovenian news texts. In Proceedings of the 9th international conference on computer recognition systems (CORES 2015) (pp. 777–787), Wrocław. Springer, Cham.
Ceron, A., Curini, L., & Iacus, S. M. (2015). Using sentiment analysis to monitor electoral campaigns: Method matters-evidence from the United States and Italy. Social Science Computer Review, 33(1), 3–20.
Article Google Scholar
Colbaugh, R. & Glass, K. (2010). Estimating sentiment orientation in social media for intelligence monitoring and analysis. In Proceedings of the IEEE international conference on intelligence and security informatics (ISI) (pp. 135–137), Vancouver. IEEE.
Das, S. & Chen, M. (2001). Yahoo! for Amazon: Extracting market sentiment from stock message boards. In Proceedings of the Asia Pacific finance association annual conference (APFA), Bangkok.
Durant, K. T. & Smith, M. D. (2006). Mining sentiment classification from political web logs. In Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, Philadelphia, PA. ACM, New York.
Erjavec, T. (2014). Digital library and corpus of historical Slovene IMP 1.1. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1031.
Erjavec, T. & Fišer, D. (2006). Building Slovene WordNet. In Proceedings of the 5th international conference on language resources and evaluation (LREC 2006) (pp. 1678–1683), Genoa. European Language Resources Association.
Erjavec, T., Fišer, D., Krek, S., & Ledinek, N. (2010). The JOS linguistically tagged corpus of Slovene. In Proceedings of the 7th international conference on language resources and evaluation (LREC 2010), Valletta. European Language Resources Association.
Erjavec, T., Ignat, C., Pouliquen, B., & Steinberger, R. (2005). Massive multi lingual corpus compilation: Acquis Communautaire and ToTaLe. Archives of Control Science, 15(4), 253–264.
Google Scholar
Erjavec, T. & Krek, S. (2010). Training corpus jos1M 1.1. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1037.
Fellbaum, C., et al. (1998). WordNet: An electronic database. Cambridge, MA: MIT Press.
Google Scholar
Fišer, D., Smailović, J., Erjavec, T., Mozetič, I., & Grčar, M. (2016). Sentiment annotation of Slovene user-generated content. In Proceedings of the 2016 conference language technologies and digital humanities (JTDH 2016) (pp. 65–70), Ljubljana. Faculty of Arts, University of Ljubljana.
Glavaš, G., Šnajder, J., & Bašić, B. D. (2012). Semi-supervised acquisition of Croatian sentiment lexicon. In Proceedings of the 15th international conference text, speech and dialogue (pp. 166–173). Springer, Brno.
Hatzivassiloglou, V. & McKeown, K. R. (1997). Predicting the semantic orientation of adjectives. In Proceedings of the 35th annual meeting of ACL and 8th conference of the european chapter of ACL (pp. 174–181), Madrid. Association for Computational Linguistics, New Brunswick, NJ.
Hsueh, P. Y., Melville, P., & Sindhwani, V. (2009). Data quality from crowdsourcing: A study of annotation selection criteria. In Proceedings of the NAACL HLT (pp. 27–35), Boulder, CO. Association for Computational Linguistics.
Jakopin, P. (2006). List of Slovenian headwords 1.1. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1038.
Jovanoski, D., Pachovski, V., & Preslav, N. (2015). Sentiment analysis in Twitter for Macedonian. In Proceedings of the international conference on recent advances in natural language processing (RANLP 2015) (pp. 249–257), Hissar.
Kadunc, K. & Robnik-Šikonja, M. (2016). Analiza mnenj s pomočjo strojnega učenja in slovenskega leksikona sentimenta [Opinion mining using machine learning and Slovene sentiment lexicon]. In Proceedings of the 2016 conference language technologies and digital humanities (JTDH 2016) (pp. 83–89), Ljubljana. Faculty of Arts, University of Ljubljana.
Kadunc, K. & Robnik-Šikonja, M. (2017). Opinion corpus of Slovene web commentaries KKS 1.001. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1115.
Kapukaranov, B. & Nakov, P. (2015). Fine-grained sentiment analysis for movie reviews in Bulgarian. In Proceedings of the international conference on recent advances in natural language processing (RANLP 2015) (pp. 266–274), Hissar.
Kibriya, A. M., Frank, E., Pfahringer, B., & Holmes, G. (2004). Multinomial Naive Bayes for text categorization revisited. In Australian conference on artificial intelligence (pp. 488–499), Cairns. Springer.
Krek, S., Dobrovoljc, K., Erjavec, T., Može, S., Ledinek, N., & Holz, N. (2015). Training corpus ssj500k 1.4. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1052.
Kushal, D., Lawrence, S., & Pennock, D. M. (2003). Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. In Proceedings of the 12th ACM international conference on WWW (pp. 519–528), Budapest. ACM, New York.
Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology.
Liu, B., Hu, M., & Cheng, J. (2005). Opinion observer: Analyzing and comparing opinions on the web. In Proceedings of the 14th ACM international conference on WWW (pp. 342–351), Bremen. ACM, New York.
Ljubešić, N. & Erjavec, T. (2011). HrWaC and slWac: Compiling web corpora for Croatian and Slovene. In Proceedings of the 14th international conference text, speech and dialogue (pp. 395–402), Pilsen. Springer.
Martinc, R. (2013). Measuring sentiment on social network Twitter: Designing a tool and evaluation. Ljubljana: Faculty of Social Sciences, University of Ljubljana.
Google Scholar
McCallum, A. K. (2002). Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu.
Mozetič, I., Grčar, M., & Smailović, J. (2016). Multilingual Twitter sentiment classification: The role of human annotators. PloS One, 11(5), 1–26.
Google Scholar
Mozetič, I., Grčar, M., & Smailović, J. (2016). Twitter sentiment for 15 European languages. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1054.
Nakov, P., Rosenthal, S., Kiritchenko, S., Mohammad, S. M., Kozareva, Z., Ritter, A., et al. (2016). Developing a successful SemEval task in sentiment analysis of Twitter and other social media texts. Language Resources and Evaluation, 50(1), 35–65.
Article Google Scholar
Nielsen, F. Å. (2011). A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. In Proceedings of the 1st workshop on making sense of microposts: Big things come in small packagess (pp. 93–98), Heraklion.
O’Hare, N., Davy, M., Bermingham, A., Ferguson, P., Sheridan, P., Gurrin, C., et al. (2009). Topic-dependent sentiment analysis of financial blogs. In Proceedings of the 1st ACM international CIKM workshop on topic-sentiment analysis for mass opinion (pp. 9–16), Hong Kong. ACM Press, New York.
Okruhlica, A. (2013). Slovak sentiment lexicon induction in absence of labeled data. Master’s thesis, Comenius University Bratislava.
Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up?: Sentiment classification using machine learning techniques. In Proceedings of ACL-02 conference on empirical methods in natural language processing (pp. 79–86), Philadelphia, PA. Association for Computational Linguistics, Stroudsburg, PA.
Perez-Rosas, V., Banea, C., & Mihalcea, R. (2012). Learning sentiment lexicons in Spanish. In Proceedings of the 8th international conference on language resources and evaluation (LREC 2012) (pp. 3077–3081), Istanbul. European Language Resources Association.
Pustejovsky, J., & Stubbs, A. (2012). Natural language annotation for machine learning. Newton: O’Reilly Media Inc.
Google Scholar
Reis, J., Benevenuto, F., de Melo, P. O., Prates, R., Kwak, H., & An, J. (2015). Breaking the news: First impressions matter on online news. arXiv Preprint ArXiv:1503.07921.
Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast—but is it good?: Evaluating non-expert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing (pp. 254–263), Waikiki, Honolulu. ACM Press, New York.
Stone, P. J., Dunphy, D. C., & Smith, M. S. (1966). The General Inquirer: A computer approach to content analysis. Cambridge: MIT Press.
Google Scholar
Taboada, M., Brooke, J., Tofiloski, M., Voll, K., & Stede, M. (2011). Lexicon-based methods for sentiment analysis. Computational Linguistics, 37(2), 267–307.
Article Google Scholar
Turney, P. D. (2002). Thumbs up or thumbs down?: Semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th annual meeting on ACL (pp. 417–424), Philadelphia, PA. Association for Computational Linguistics, Stroudsburg, PA.
Veselovská, K. (2013). Czech subjectivity lexicon: A lexical resource for Czech polarity classification. In Proceedings of the 7th international conference Slovko (pp. 279–284), Bratislava. RAM-Verlag, Lüdenscheid.
Wawer, A. (2012). Extracting emotive patterns for languages with rich morphology. International Journal of Computational Linguistics and Applications, 3(1), 11–24.
Google Scholar
Wiebe, J. & Riloff, E. (2005). Creating subjective and objective sentence classifiers from unannotated texts. In Proceedings of the international conference on intelligent text processing and computational linguistics (pp. 486–497).
Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques. Burlington: Morgan Kaufmann.
Google Scholar

Download references

Acknowledgements

This study is based on a work supported by the European Union, The European Regional Development Fund, Slovene Human Resources Development and Scholarship Fund, Ministry of Education, Science and Sport, Slovenia, and the Young Researcher Programme by Slovenian Research Agency. Our research was conducted within the framework of the Operational Programme for Strengthening Regional Development Potentials for the period 2007–2013, Development Priority 1: Competitiveness and research excellence, Priority Guideline 1.1: Improving the competitive skills and research excellence. We acknowledge financial support from the Slovenian Research Agency for the research core funding No. P2-0103. We also thank the anonymous reviewers for their comments and suggestions which helped to improve the paper.

Author information

Authors and Affiliations

Real Estate Mass Valuation System, Surveying and Mapping Authority of the Republic of Slovenia, Ljubljana, Slovenia
Jože Bučar
Laboratory of Data Technologies, Faculty of Information Studies, Novo mesto, Slovenia
Jože Bučar & Janez Povh
Laboratory for Engineering Design, Faculty of Mechanical Engineering, Ljubljana, Slovenia
Janez Povh
Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia
Martin Žnidaršič

Authors

Jože Bučar
View author publications
You can also search for this author in PubMed Google Scholar
Martin Žnidaršič
View author publications
You can also search for this author in PubMed Google Scholar
Janez Povh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jože Bučar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bučar, J., Žnidaršič, M. & Povh, J. Annotated news corpora and a lexicon for sentiment analysis in Slovene. Lang Resources & Evaluation 52, 895–919 (2018). https://doi.org/10.1007/s10579-018-9413-3

Download citation

Published: 06 February 2018
Issue Date: September 2018
DOI: https://doi.org/10.1007/s10579-018-9413-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Annotated news corpora and a lexicon for sentiment analysis in Slovene

Abstract

Access this article

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation