Skip to main content
Log in

Annotated news corpora and a lexicon for sentiment analysis in Slovene

  • Project Notes
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

In this study, we introduce Slovene web-crawled news corpora with sentiment annotation on three levels of granularity: sentence, paragraph and document levels. We describe the methodology and tools that were required for their construction. The corpora contain more than 250,000 documents with political, business, economic and financial content from five Slovene media resources on the web. More than 10,000 of them were manually annotated as negative, neutral or positive. All corpora are publicly available under a Creative Commons copyright license. We used the annotated documents to construct a Slovene sentiment lexicon, which is the first of its kind for Slovene, and to assess the sentiment classification approaches used. The constructed corpora were also utilised to monitor within-the-document sentiment dynamics, its changes over time and relations with news topics. We show that sentiment is, on average, more explicit at the beginning of documents, and it loses sharpness towards the end of documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Notes

  1. Multinomial Naïve Bayes is a Bayesian approach that avoids the explicit penalization of nonoccurence of words in documents and is suitable for text classification. Detailed description is out of the scope of this paper, but is available for example in Kibriya et al. (2004) or Aggarwal (2015).

  2. https://w3techs.com/technologies/overview/content_language/all.

  3. MSDs are more detailed than is commonly the case for part-of-speech [PoS] tags; they are compact string representations of a simplified kind of feature structures. The first letter of a MSD encodes the PoS. The specifications define the values of the position-determined attributes, for each PoS, its appropriate attributes, their values and one-letter codes.

  4. http://www.clarin.si/.

  5. http://sentiwordnet.isti.cnr.it/.

  6. http://wndomains.fbk.eu/wnaffect.html.

  7. http://www.unipv.it/wnop.

  8. http://mpqa.cs.pitt.edu/.

  9. http://neuro.imm.dtu.dk/wiki/A_new_ANEW:_evaluation_of_a_word_list_for_sentiment_analysis_in_microblogs.

  10. Apache OpenNLP, GATE, Lydia, MAE, MALLET, MPQA (OpinionFinder 2), Orange, QDA Miner Lite, Phyton (NLTK), R (tm), RapidMiner, TAMS Analyzer, WEKA, etc.

  11. http://www.24ur.com/arhiv/novice/gospodarstvo/.

  12. https://www.dnevnik.si/posel/novice/.

  13. http://www.finance.si/danes/.

  14. http://www.rtvslo.si/gospodarstvo/arhiv/.

  15. http://www.zurnal24.si/archive/slovenija/.

  16. http://dejan.amadej.si/test/.

  17. Initially, the project leaders published a call with basic information about the project on the website of the Faculty of Information Studies and its social networks. Several candidates responded to the call. We invited all the candidates to the first meeting, where they were introduced to the objectives of the project, the content and scope of their work. Next, the project leaders selected six candidates, where various criteria were taken into account: (1) candidate’s suitability for carrying out the task, (2) candidate’s interest, (3) candidate’s organisation and (4) gender and age equality. Among all applicants, three women and three men, aged between 19 and 30 and from two different faculties (Faculty of Computer and Information Science in Ljubljana and Faculty of Information Studies in Novo mesto), were chosen to annotate the texts.

  18. http://www.fran.si/130/sskj-slovar-slovenskega-knjiznega-jezika/.

  19. https://www.clarin.si/repository/xmlui/browse?value=Bu%C4%8Dar,%20Jo%C5%BEe&type=author.

  20. https://github.com/19Joey85/Sentiment-annotated-news-corpus-and-sentiment-lexicon-in-Slovene.

  21. https://creativecommons.org/licenses/by-sa/4.0/.

  22. Classifiers and WEKA parameters:

    • k-Nearest Neighbour (KNN): IBk -K 9 -W 0 -A

    • “weka.core.neighboursearch.LinearNNSearch -A weka.core.EuclideanDistance -R first-last”

    • Multinomial Naïve Bayes (NBM): NaiveBayesMultinomial

    • Support Vector Machine (SVM): SMO -C 1.0 -L 0.001 -P 1.0E-12 -N 0 -V -1 -W 1 -K

    • “weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0”

    • Random Forest (RF): -I 10 -K 0 -S 1

    • C4.5: J48 -C 0.25 -M 2

    • Decision Table (DT): -X 1 -S “weka.attributeSelection.BestFirst -D 1 -N 5”

    • Simple Logistic Regression (SLR): -I 0 -M 500 -H 50 -W 0.0

    • Voted Perceptron (VP): -I 1 -E 1.0 -S 1 -M 10000

References

  • Abdul-Mageed, M. & Diab, M. T. (2011). Subjectivity and sentiment annotation of modern standard Arabic newswire. In Proceedings of the 5th linguistic annotation workshop (pp. 110–118), Portland, OR. Association for Computational Linguistics, Stroudsburg, PA.

  • Aggarwal, C. C. (2015). Data mining: The textbook. New York: Springer.

    Book  Google Scholar 

  • Balahur, A., Steinberger, R., Kabadjov, M., Zavarella, V., Van Der Goot, E., Halkia, M., et al. (2013). Sentiment analysis in the news. arXiv Preprint ArXiv:1309.6202.

  • Berginc, N. L., Grčar, M., Brakus, M., Erjavec, T., Holdt, Š. A., Krek, S., et al. (2012). The Gigafida, KRES, ccGigafida and ccKRES corpora of Slovene language: Compilation, content, use. Institute for Applied Slovene Studies, Ljubljana: Trojina.

    Google Scholar 

  • Berginc, N. L., & Ljubešić, N. (2013). Gigafida and slWaC: Topic comparison. Slovenščina 2.0: Empirical, Applied and Interdisciplinary Research, 1(1), 78–110.

    Google Scholar 

  • Bučar, J. (2017a). Automatically sentiment annotated Slovenian news corpus AutoSentiNews 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1109.

  • Bučar, J. (2017b). Manually sentiment annotated Slovenian news corpus SentiNews 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1110.

  • Bučar, J. (2017c). R crawlers for five Slovenian web media 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1105.

  • Bučar, J. (2017d). Slovene sentiment lexicon JOB 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1112.

  • Bučar, J., Povh, J., & Žnidaršič, M. (2016). Sentiment classification of the Slovenian news texts. In Proceedings of the 9th international conference on computer recognition systems (CORES 2015) (pp. 777–787), Wrocław. Springer, Cham.

  • Ceron, A., Curini, L., & Iacus, S. M. (2015). Using sentiment analysis to monitor electoral campaigns: Method matters-evidence from the United States and Italy. Social Science Computer Review, 33(1), 3–20.

    Article  Google Scholar 

  • Colbaugh, R. & Glass, K. (2010). Estimating sentiment orientation in social media for intelligence monitoring and analysis. In Proceedings of the IEEE international conference on intelligence and security informatics (ISI) (pp. 135–137), Vancouver. IEEE.

  • Das, S. & Chen, M. (2001). Yahoo! for Amazon: Extracting market sentiment from stock message boards. In Proceedings of the Asia Pacific finance association annual conference (APFA), Bangkok.

  • Durant, K. T. & Smith, M. D. (2006). Mining sentiment classification from political web logs. In Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, Philadelphia, PA. ACM, New York.

  • Erjavec, T. (2014). Digital library and corpus of historical Slovene IMP 1.1. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1031.

  • Erjavec, T. & Fišer, D. (2006). Building Slovene WordNet. In Proceedings of the 5th international conference on language resources and evaluation (LREC 2006) (pp. 1678–1683), Genoa. European Language Resources Association.

  • Erjavec, T., Fišer, D., Krek, S., & Ledinek, N. (2010). The JOS linguistically tagged corpus of Slovene. In Proceedings of the 7th international conference on language resources and evaluation (LREC 2010), Valletta. European Language Resources Association.

  • Erjavec, T., Ignat, C., Pouliquen, B., & Steinberger, R. (2005). Massive multi lingual corpus compilation: Acquis Communautaire and ToTaLe. Archives of Control Science, 15(4), 253–264.

    Google Scholar 

  • Erjavec, T. & Krek, S. (2010). Training corpus jos1M 1.1. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1037.

  • Fellbaum, C., et al. (1998). WordNet: An electronic database. Cambridge, MA: MIT Press.

    Google Scholar 

  • Fišer, D., Smailović, J., Erjavec, T., Mozetič, I., & Grčar, M. (2016). Sentiment annotation of Slovene user-generated content. In Proceedings of the 2016 conference language technologies and digital humanities (JTDH 2016) (pp. 65–70), Ljubljana. Faculty of Arts, University of Ljubljana.

  • Glavaš, G., Šnajder, J., & Bašić, B. D. (2012). Semi-supervised acquisition of Croatian sentiment lexicon. In Proceedings of the 15th international conference text, speech and dialogue (pp. 166–173). Springer, Brno.

  • Hatzivassiloglou, V. & McKeown, K. R. (1997). Predicting the semantic orientation of adjectives. In Proceedings of the 35th annual meeting of ACL and 8th conference of the european chapter of ACL (pp. 174–181), Madrid. Association for Computational Linguistics, New Brunswick, NJ.

  • Hsueh, P. Y., Melville, P., & Sindhwani, V. (2009). Data quality from crowdsourcing: A study of annotation selection criteria. In Proceedings of the NAACL HLT (pp. 27–35), Boulder, CO. Association for Computational Linguistics.

  • Jakopin, P. (2006). List of Slovenian headwords 1.1. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1038.

  • Jovanoski, D., Pachovski, V., & Preslav, N. (2015). Sentiment analysis in Twitter for Macedonian. In Proceedings of the international conference on recent advances in natural language processing (RANLP 2015) (pp. 249–257), Hissar.

  • Kadunc, K. & Robnik-Šikonja, M. (2016). Analiza mnenj s pomočjo strojnega učenja in slovenskega leksikona sentimenta [Opinion mining using machine learning and Slovene sentiment lexicon]. In Proceedings of the 2016 conference language technologies and digital humanities (JTDH 2016) (pp. 83–89), Ljubljana. Faculty of Arts, University of Ljubljana.

  • Kadunc, K. & Robnik-Šikonja, M. (2017). Opinion corpus of Slovene web commentaries KKS 1.001. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1115.

  • Kapukaranov, B. & Nakov, P. (2015). Fine-grained sentiment analysis for movie reviews in Bulgarian. In Proceedings of the international conference on recent advances in natural language processing (RANLP 2015) (pp. 266–274), Hissar.

  • Kibriya, A. M., Frank, E., Pfahringer, B., & Holmes, G. (2004). Multinomial Naive Bayes for text categorization revisited. In Australian conference on artificial intelligence (pp. 488–499), Cairns. Springer.

  • Krek, S., Dobrovoljc, K., Erjavec, T., Može, S., Ledinek, N., & Holz, N. (2015). Training corpus ssj500k 1.4. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1052.

  • Kushal, D., Lawrence, S., & Pennock, D. M. (2003). Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. In Proceedings of the 12th ACM international conference on WWW (pp. 519–528), Budapest. ACM, New York.

  • Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology.

  • Liu, B., Hu, M., & Cheng, J. (2005). Opinion observer: Analyzing and comparing opinions on the web. In Proceedings of the 14th ACM international conference on WWW (pp. 342–351), Bremen. ACM, New York.

  • Ljubešić, N. & Erjavec, T. (2011). HrWaC and slWac: Compiling web corpora for Croatian and Slovene. In Proceedings of the 14th international conference text, speech and dialogue (pp. 395–402), Pilsen. Springer.

  • Martinc, R. (2013). Measuring sentiment on social network Twitter: Designing a tool and evaluation. Ljubljana: Faculty of Social Sciences, University of Ljubljana.

    Google Scholar 

  • McCallum, A. K. (2002). Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu.

  • Mozetič, I., Grčar, M., & Smailović, J. (2016). Multilingual Twitter sentiment classification: The role of human annotators. PloS One, 11(5), 1–26.

    Google Scholar 

  • Mozetič, I., Grčar, M., & Smailović, J. (2016). Twitter sentiment for 15 European languages. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1054.

  • Nakov, P., Rosenthal, S., Kiritchenko, S., Mohammad, S. M., Kozareva, Z., Ritter, A., et al. (2016). Developing a successful SemEval task in sentiment analysis of Twitter and other social media texts. Language Resources and Evaluation, 50(1), 35–65.

    Article  Google Scholar 

  • Nielsen, F. Å. (2011). A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. In Proceedings of the 1st workshop on making sense of microposts: Big things come in small packagess (pp. 93–98), Heraklion.

  • O’Hare, N., Davy, M., Bermingham, A., Ferguson, P., Sheridan, P., Gurrin, C., et al. (2009). Topic-dependent sentiment analysis of financial blogs. In Proceedings of the 1st ACM international CIKM workshop on topic-sentiment analysis for mass opinion (pp. 9–16), Hong Kong. ACM Press, New York.

  • Okruhlica, A. (2013). Slovak sentiment lexicon induction in absence of labeled data. Master’s thesis, Comenius University Bratislava.

  • Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up?: Sentiment classification using machine learning techniques. In Proceedings of ACL-02 conference on empirical methods in natural language processing (pp. 79–86), Philadelphia, PA. Association for Computational Linguistics, Stroudsburg, PA.

  • Perez-Rosas, V., Banea, C., & Mihalcea, R. (2012). Learning sentiment lexicons in Spanish. In Proceedings of the 8th international conference on language resources and evaluation (LREC 2012) (pp. 3077–3081), Istanbul. European Language Resources Association.

  • Pustejovsky, J., & Stubbs, A. (2012). Natural language annotation for machine learning. Newton: O’Reilly Media Inc.

    Google Scholar 

  • Reis, J., Benevenuto, F., de Melo, P. O., Prates, R., Kwak, H., & An, J. (2015). Breaking the news: First impressions matter on online news. arXiv Preprint ArXiv:1503.07921.

  • Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast—but is it good?: Evaluating non-expert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing (pp. 254–263), Waikiki, Honolulu. ACM Press, New York.

  • Stone, P. J., Dunphy, D. C., & Smith, M. S. (1966). The General Inquirer: A computer approach to content analysis. Cambridge: MIT Press.

    Google Scholar 

  • Taboada, M., Brooke, J., Tofiloski, M., Voll, K., & Stede, M. (2011). Lexicon-based methods for sentiment analysis. Computational Linguistics, 37(2), 267–307.

    Article  Google Scholar 

  • Turney, P. D. (2002). Thumbs up or thumbs down?: Semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th annual meeting on ACL (pp. 417–424), Philadelphia, PA. Association for Computational Linguistics, Stroudsburg, PA.

  • Veselovská, K. (2013). Czech subjectivity lexicon: A lexical resource for Czech polarity classification. In Proceedings of the 7th international conference Slovko (pp. 279–284), Bratislava. RAM-Verlag, Lüdenscheid.

  • Wawer, A. (2012). Extracting emotive patterns for languages with rich morphology. International Journal of Computational Linguistics and Applications, 3(1), 11–24.

    Google Scholar 

  • Wiebe, J. & Riloff, E. (2005). Creating subjective and objective sentence classifiers from unannotated texts. In Proceedings of the international conference on intelligent text processing and computational linguistics (pp. 486–497).

  • Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques. Burlington: Morgan Kaufmann.

    Google Scholar 

Download references

Acknowledgements

This study is based on a work supported by the European Union, The European Regional Development Fund, Slovene Human Resources Development and Scholarship Fund, Ministry of Education, Science and Sport, Slovenia, and the Young Researcher Programme by Slovenian Research Agency. Our research was conducted within the framework of the Operational Programme for Strengthening Regional Development Potentials for the period 2007–2013, Development Priority 1: Competitiveness and research excellence, Priority Guideline 1.1: Improving the competitive skills and research excellence. We acknowledge financial support from the Slovenian Research Agency for the research core funding No. P2-0103. We also thank the anonymous reviewers for their comments and suggestions which helped to improve the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jože Bučar.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bučar, J., Žnidaršič, M. & Povh, J. Annotated news corpora and a lexicon for sentiment analysis in Slovene. Lang Resources & Evaluation 52, 895–919 (2018). https://doi.org/10.1007/s10579-018-9413-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-018-9413-3

Keywords

Navigation