Skip to main content

2020 | Buch

Artificial Intelligence and Natural Language

9th Conference, AINL 2020, Helsinki, Finland, October 7–9, 2020, Proceedings

insite
SUCHEN

Über dieses Buch

​This book constitutes the refereed proceedings of the 9th Conference on Artificial Intelligence and Natural Language, AINL 2020, held in Helsinki, Finland, in October 2020.
The 11 revised full papers and 3 short papers were carefully reviewed and selected from 36 submissions. Additionally, the volume presents 1 shared task paper. The volume presents recent research in areas of of text mining, speech technologies, dialogue systems, information retrieval, machine learning, articial intelligence, and robotics.

Inhaltsverzeichnis

Frontmatter
PolSentiLex: Sentiment Detection in Socio-Political Discussions on Russian Social Media
Abstract
We present a freely available Russian language sentiment lexicon PolSentiLex designed to detect sentiment in user-generated content related to social and political issues. The lexicon was generated from a database of posts and comments of the top 2,000 LiveJournal bloggers posted during one year (\(\sim \)1.5 million posts and 20 million comments). Following a topic modeling approach, we extracted 85,898 documents that were used to retrieve domain-specific terms. This term list was then merged with several external sources. Together, they formed a lexicon (16,399 units) marked-up using a crowdsourcing strategy. A sample of Russian native speakers (n = 105) was asked to assess words’ sentiment given the context of their use (randomly paired) as well as the prevailing sentiment of the respective texts. In total, we received 59,208 complete annotations for both texts and words. Several versions of the marked-up lexicon were experimented with, and the final version was tested for quality against the only other freely available Russian language lexicon and against three machine learning algorithms. All experiments were run on two different collections. They have shown that, in terms of \(\text {F}_{\text {macro}}\), lexicon-based approaches outperform machine learning by 11%, and our lexicon outperforms the alternative one by 11% on the first collection, and by 7% on the negative scale of the second collection while showing similar quality on the positive scale and being three times smaller. Our lexicon also outperforms or is similar to the best existing sentiment analysis results for other types of Russian-language texts .
Olessia Koltsova, Svetlana Alexeeva, Sergei Pashakhin, Sergei Koltsov
Automatic Detection of Hidden Communities in the Texts of Russian Social Network Corpus
Abstract
This paper proposes a linguistically-rich approach to hidden community detection which was tested in experiments with the Russian corpus of VKontakte posts. Modern algorithms for hidden community detection are based on graph theory, these procedures leaving out of account the linguistic features of analyzed texts. The authors have developed a new hybrid approach to the detection of hidden communities, combining author-topic modeling and automatic topic labeling. Specific linguistic parameters of Russian posts were revealed for correct language processing. The results justify the use of the algorithm that can be further integrated with already developed graph methods.
Ivan Mamaev, Olga Mitrofanova
Dialog Modelling Experiments with Finnish One-to-One Chat Data
Abstract
We analyzed two conversational corpora in Finnish: A public library question-answering (QA) data and a private medical chat dataẆe developed response retrieval (ranking) models using TF-IDF, StarSpace, ESIM and BERT methods. These four represent techniques ranging from the simple and classical ones to recent pretrained transformer neural networks. We evaluated the effect of different preprocessing strategies, including raw, casing, lemmatization and spell-checking for the different methods. Using our medical chat data, We also developed a novel three-stage preprocessing pipeline with speaker role classification. We found the BERT model pretrained with Finnish (FinBERT) an unambiguous winner in ranking accuracy, reaching 92.2% for the medical chat and 98.7% for the library QA in the 1-out-of-10 response ranking task where the chance level was 10%. The best accuracies were reached using uncased text with spell-checking (BERT models) or lemmatization (non-BERT models). The role of preprocessing had less impact for BERT models compared to the classical and other neural network models. Furthermore, we found the TF-IDF method still a strong baseline for the vocabulary-rich library QA task, even surpassing the more advanced StarSpace method. Our results highlight the complex interplay between preprocessing strategies and model type when choosing the optimal approach in chat-data modelling. Our study is the first work on dialogue modelling using neural networks for the Finnish language. It is also first of the kind to use real medical chat data. Our work contributes towards the development of automated chatbots in the professional domain.
Janne Kauttonen, Lili Aunimo
Advances of Transformer-Based Models for News Headline Generation
Abstract
Pretrained language models based on Transformer architecture are the reason for recent breakthroughs in many areas of NLP, including sentiment analysis, question answering, named entity recognition. Headline generation is a special kind of text summarization task. Models need to have strong natural language understanding that goes beyond the meaning of individual words and sentences and an ability to distinguish essential information to succeed in it. In this paper, we fine-tune two pretrained Transformer-based models (mBART and BertSumAbs) for that task and achieve new state-of-the-art results on the RIA and Lenta datasets of Russian news. BertSumAbs increases ROUGE on average by 2.9 and 2.0 points respectively over previous best score achieved by Phrase-Based Attentional Transformer and CopyNet.
Alexey Bukhtiyarov, Ilya Gusev
An Explanation Method for Black-Box Machine Learning Survival Models Using the Chebyshev Distance
Abstract
A new modification of the explanation method SurvLIME called SurvLIME-Inf for explaining machine learning survival models is proposed. The basic idea behind SurvLIME as well as SurvLIME-Inf is to apply the Cox proportional hazards model to approximate the black-box survival model at the local area around a test example. The Cox model is used due to the linear relationship of covariates. In contrast to SurvLIME, the proposed modification uses \(L_{\infty }\)-norm for defining distances between approximating and approximated cumulative hazard functions. This leads to a simple linear programming problem for determining important features and for explaining the black-box model prediction. Moreover, SurvLIME-Inf outperforms SurvLIME when the training set is very small. Numerical experiments with synthetic and real datasets demonstrate the SurvLIME-Inf efficiency.
Lev V. Utkin, Maxim S. Kovalev, Ernest M. Kasimov
Unsupervised Neural Aspect Extraction with Related Terms
Abstract
The tasks of aspect identification and term extraction remain challenging in natural language processing. While supervised methods achieve high scores, it is hard to use them in real-world applications due to the lack of labelled datasets. Unsupervised approaches outperform these methods on several tasks, but it is still a challenge to extract both an aspect and a corresponding term, particularly in the multi-aspect setting. In this work, we present a novel unsupervised neural network with convolutional multi-attention mechanism, that allows extracting pairs (aspect, term) simultaneously, and demonstrate the effectiveness on the real-world dataset. We apply a special loss aimed to improve the quality of multi-aspect extraction. The experimental study demonstrates, what with this loss we increase the precision not only on this joint setting but also on aspect prediction only.
Timur Sokhin, Maria Khodorchenko, Nikolay Butakov
Predicting Eurovision Song Contest Results Using Sentiment Analysis
Abstract
Over a million tweets were analyzed using various methods in an attempt to predict the results of the Eurovision Song Contest televoting. Different methods of sentiment analysis (English, multilingual polarity lexicons and deep learning) and translating the focus language tweets into English were used to determine the method that produced the best prediction for the contest. Furthermore, we analyzed the effect of sampling tweets during different periods, namely during the performances and/or during the televoting phase of the competition. The quality of the predictions was assessed through correlations between the actual ranks of the televoting and the predicted ranks. The prediction was based on the application of an adjusted Eurovision televoting scoring system to the results of the sentiment analysis of tweets. A predicted rank for each performance resulted in a Spearman \(\rho \) correlation coefficients of 0.62 and 0.74 during the televoting period for the lexicon sentiment-based and deep learning approaches, respectively.
Iiro Kumpulainen, Eemil Praks, Tenho Korhonen, Anqi Ni, Ville Rissanen, Jouko Vankka
Improving Results on Russian Sentiment Datasets
Abstract
In this study, we test standard neural network architectures (CNN, LSTM, BiLSTM) and recently appeared BERT architectures on previous Russian sentiment evaluation datasets. We compare two variants of Russian BERT and show that for all sentiment tasks in this study the conversational variant of Russian BERT performs better. The best results were achieved by BERT-NLI model, which treats sentiment classification tasks as a natural language inference task. On one of the datasets, this model practically achieves the human level .
Anton Golubev, Natalia Loukachevitch
Dataset for Automatic Summarization of Russian News
Abstract
Automatic text summarization has been studied in a variety of domains and languages. However, this does not hold for the Russian language. To overcome this issue, we present Gazeta, the first dataset for summarization of Russian news. We describe the properties of this dataset and benchmark several extractive and abstractive models. We demonstrate that the dataset is a valid task for methods of text summarization for Russian. Additionally, we prove the pretrained mBART model to be useful for Russian text summarization.
Ilya Gusev
Dataset for Evaluation of Mathematical Reasoning Abilities in Russian
Abstract
We present a Russian version of DeepMind Mathematics Dataset. The original dataset is synthetically generated using inference rules and a set of linguistic templates. We translate the linguistic templates to Russian leaving the inference part without changes. So as a result we get a mathematically parallel dataset where the same mathematical problems are explored but in another language. We reproduce the experiment from the original paper to check whether the performance of a Transformer model is impacted by the differences of the languages in which math problems are expressed. Though our contribution is small compared to the original work, we think it is valuable given the fact that languages other than English (and Russian in particular) are underrepresented.
Mikhail Nefedov
Searching Case Law Judgments by Using Other Judgments as a Query
Abstract
This paper presents an effective method for case law retrieval based on semantic document similarity and a web application for querying Finnish case law. The novelty of the work comes from the idea of using legal documents for automatic formulation of the query, including case law judgments, legal case descriptions, or other texts. The query documents may be in various formats, including image files with text content. This approach allows efficient search for similar documents without the need to specify a query string or keywords, which can be difficult in this use case. The application leverages two traditional word frequency based methods, TF-IDF and LDA, alongside two modern neural network methods, Doc2Vec and Doc2VecC. Effectiveness of the approach for document relevance ranking has been evaluated using a gold standard set of inter-document similarities. We show that a linear combination of similarities derived from the individual models provides a robust automatic similarity assessment for ranking the case law documents for retrieval.
Sami Sarsa, Eero Hyvönen
GenPR: Generative PageRank Framework for Semi-supervised Learning on Citation Graphs
Abstract
Nowadays, Semi-Supervised Learning (SSL) on citation graph data sets is a rapidly growing area of research. However, the recently proposed graph-based SSL algorithms use a default adjacency matrix with binary weights on edges (citations), that causes a loss of the nodes (papers) similarity information. In this work, therefore, we propose a framework focused on embedding PageRank SSL in a generative model. This framework allows one to do joint training of nodes latent space representation and label spreading through the reweighted adjacency matrix by node similarities in the latent space. We explain that a generative model can improve accuracy and reduce the number of iteration steps for PageRank SSL. Moreover, we show that our framework outperforms the best graph-based SSL algorithms on four public citation graph data sets and improves the interpretability of classification results.
Mikhail Kamalov, Konstantin Avrachenkov
Finding New Multiword Expressions for Existing Thesaurus
Abstract
In this paper we study the task of adding new multiword expressions (MWEs) into an existing thesaurus. Standard methods of MWE discovery (statistical, context, distributional measures) can efficiently detect the most prominent MWEs. However, given a large number of MWEs already present in a lexical resource those methods fail to provide sufficient results in extracting unseen expressions. We show that the information deduced from the thesaurus itself is more useful than observed frequency and other corpus statistics in detecting less prominent expressions. Focusing on nominal bigrams (Adj-Noun and Noun-Noun) in Russian, we propose a number of measures making use of thesaurus statistics (e.g. the number of expressions with a given word present in the thesaurus), which significantly outperform standard methods based on corpus statistics or word embeddings.
Petr Rossyaykin, Natalia Loukachevitch
Matching LIWC with Russian Thesauri: An Exploratory Study
Abstract
In Author Profiling research, there is a growing interest in lexical resources providing various psychologically meaningful word categories. One of such instruments is Linguistic Inquiry and Word Count, which was compiled manually in English and translated into many other languages. We argue that the resource contains a lot of subjectivity, which is further increased in the translation process. As a result, the translated lexical resource is not linguistically transparent. In order to address this issue, we translate the resource from English to Russian semi-automatically, analyze the translation in terms of agreement and match the resulting translation with two Russian linguistic thesauri. One of the thesauri is chosen as a better match for the psychologically meaningful categories in question. We further apply the linguistic thesaurus to analyze the psychologically meaningful word categories in two Author Profiling tasks based on Russian texts. Our results indicate that linguistically-motivated thesauri not only provide objective and linguistically motivated content, but also result in significant correlates of certain psychological states, replicating evidence obtained with hand-crafted lexical resources.
Polina Panicheva, Tatiana Litvinova
Chinese-Russian Shared Task on Multi-domain Translation
Abstract
We present the results the first shared task on Machine Translation (MT) from Chinese into Russian, which is the only MT competition for this pair of languages to date. The task for participants was to train a general-purpose MT system which performs reasonably well on very diverse text domains and styles without additional fine-tuning. 11 teams participated in the competition, some of the submitted models showed reasonably good performance topping at 19.7 BLEU.
Valentin Malykh, Varvara Logacheva
Backmatter
Metadaten
Titel
Artificial Intelligence and Natural Language
herausgegeben von
Andrey Filchenkov
Janne Kauttonen
Lidia Pivovarova
Copyright-Jahr
2020
Electronic ISBN
978-3-030-59082-6
Print ISBN
978-3-030-59081-9
DOI
https://doi.org/10.1007/978-3-030-59082-6

Premium Partner