Skip to main content

2018 | Buch

Artificial Intelligence and Natural Language

7th International Conference, AINL 2018, St. Petersburg, Russia, October 17–19, 2018, Proceedings

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 7th Conference on Artificial Intelligence and Natural Language, AINL 2018, held in St. Petersburg, Russia, in October 2018. The 19 revised full papers were carefully reviewed and selected from 56 submissions and cover a wide range of topics, including morphology and word-level semantics, sentence and discourse representations, corpus linguistics, language resources, and social interaction analysis.

Inhaltsverzeichnis

Frontmatter

Morphology and Word-Level Semantics

Frontmatter
Deep Convolutional Networks for Supervised Morpheme Segmentation of Russian Language
Abstract
The present paper addresses the task of morphological segmentation for Russian language. We show that deep convolutional neural networks solve this problem with F1-score of 98% over morpheme boundaries and beat existing non-neural approaches.
Alexey Sorokin, Anastasia Kravtsova
Smart Context Generation for Disambiguation to Wikipedia
Abstract
Wikification is a crucial NLP task that aims to identify entities in text and disambiguate their meaning. Being partially solved for English, the problem still remains fairly untouched for Russian. In this article we present a novel approach to Disambiguation to Wikipedia applied to the Russian language. Inspired by the Neural Machine Translation task our method implements encoder-decoder neural network architecture. It translates text tokens into concept embeddings that are subsequently used as context for disambiguation. In order to test our hypothesis we add our context features to GLOW system considered a baseline. Moreover, we present commonly available dataset for the Disambiguation to Wikipedia task.
Andrey Sysoev, Irina Nikishina
A Multi-feature Classifier for Verbal Metaphor Identification in Russian Texts
Abstract
The paper presents a supervised machine learning experiment with multiple features for identification of sentences containing verbal metaphors in raw Russian text. We introduce the custom-created training dataset, describe the feature engineering techniques, and discuss the results. The following set of features is applied: distributional semantic features, lexical and morphosyntactic co-occurrence frequencies, flag words, quotation marks, and sentence length. We combine these features into models of varying complexity; the results of the experiment demonstrate that fairly simple models based on lexical, morphosyntactic and semantic features are able to produce competitive results.
Yulia Badryzlova, Polina Panicheva
Lemmatization for Ancient Languages: Rules or Neural Networks?
Abstract
Lemmatisation, which is one of the most important stages of text preprocessing, consists in grouping the inflected forms of a word together so they can be analysed as a single item. This task is often considered solved for most modern languages irregardless of their morphological type, but the situation is dramatically different for ancient languages. Rich inflectional system and high level of orthographic variation common to these languages together with lack of resources make lemmatising historical data a challenging task. It becomes more and more important as manuscripts are being extensively digitized now, but still remains poorly covered in literature. In this work, I compare a rule-based and a neural network based approach to lemmatisation in case of Early Irish (Old and Middle Irish are often described together as “Early Irish”) data.
Oksana Dereza
Named Entity Recognition in Russian with Word Representation Learned by a Bidirectional Language Model
Abstract
Named Entity Recognition is one of the most popular tasks of the natural language processing. Pre-trained word embeddings learned from unlabeled text have become a standard component of neural network architectures for natural language processing tasks. However, in most cases, a recurrent network that operates on word-level representations to produce context sensitive representations is trained on relatively few labeled data. Also, there are many difficulties in processing Russian language. In this paper, we present a semi-supervised approach for adding deep contextualized word representation that models both complex characteristics of word usage (e.g., syntax and semantics), and how these usages vary across linguistic contexts (i.e., to model polysemy). Here word vectors are learned functions of the internal states of a deep bidirectional language model, which is pretrained on a large text corpus. We show that these representations can be easily added to existing models and be combined with other word representation features. We evaluate our model on FactRuEval-2016 dataset for named entity recognition in Russian and achieve state of the art results.
Georgy Konoplich, Evgeniy Putin, Andrey Filchenkov, Roman Rybka

Sentence and Discourse Representations

Frontmatter
Supervised Mover’s Distance: A Simple Model for Sentence Comparison
Abstract
We propose a simple neural network model which can learn relation between sentences by passing their representations obtained from Long Short Term Memory (LSTM) through a Relation Network. The Relation Network module tries to extract similarity between multiple contextual representations obtained from LSTM. The aim is to build a model which is simple to implement, light in terms of parameters and works across multiple supervised sentence comparison tasks. We show good results for the model on two sentence comparison datasets.
Muktabh Mayank Srivastava
Direct-Bridge Combination Scenario for Persian-Spanish Low-Resource Statistical Machine Translation
Abstract
This paper investigates the idea of making effective use of bridge language technique to respond to minimal parallel-resource data set bottleneck reality to improve translation quality in the case of Persian-Spanish low-resource language pair using a well-resource language such as English as the bridge one. We apply the optimized direct-bridge combination scenario to enhance the translation performance. We analyze the effects of this scenario on our case study.
Benyamin Ahmadnia, Javier Serrano, Gholamreza Haffari, Nik-Mohammad Balouchzahi
Automatic Mining of Discourse Connectives for Russian
Abstract
The identification of discourse connectives plays an important role in many discourse processing approaches. Among them there are functional words usually enumerated in grammars (iz-za ‘due to’, blagodarya ‘thanks to’,) and not grammaticalized expressions (X vedet k Y ‘X leads to Y’, prichina etogo ‘the cause is’). Both types of connectives signal certain relations between discourse units. However, there are no ready-made lists of the second type of connectives. We suggest a method for expanding a seed list of connectives based on their vector representations by candidates for not grammaticalized connectives for Russian. Firstly, we compile a list of patterns for this type of connectives. These patterns are based on the following heuristics: the connectives are often used with anaphoric expressions substituting discourse units (thus, some patterns include special anaphoric elements); the connectives more frequently occur at the sentence beginning or after a comma. Secondly, we build multi-word tokens that are based on these patterns. Thirdly, we build vector representations for the multi-word tokens that match these patterns. Our experiments based on distributional semantics give quite reasonable list of the candidates for connectives.
Svetlana Toldova, Maria Kobozeva, Dina Pisarevskaya

Corpus Linguistics

Frontmatter
Avoiding Echo-Responses in a Retrieval-Based Conversation System
Abstract
Retrieval-based conversation systems generally tend to highly rank responses that are semantically similar or even identical to the given conversation context. While the system’s goal is to find the most appropriate response, rather than the most semantically similar one, this tendency results in low-quality responses. We refer to this challenge as the echoing problem. To mitigate this problem, we utilize a hard negative mining approach at the training stage. The evaluation shows that the resulting model reduces echoing and achieves better results in terms of Average Precision and Recall@N metrics, compared to the models trained without the proposed approach.
Denis Fedorenko, Nikita Smetanin, Artem Rodichev
A Model-Free Comorbidities-Based Events Prediction in ICU Unit
Abstract
In this work we focus on recently introduced “medical concept vectors” (MCV) extracted from electronic health records (EHR), explore in similar manner several methods useful for patient’s medical history events prediction and provide our own novel state-of-the-art method to solve this problem. We use MCVs to analyze publicly-available EHR de-identified data, with strong focus on fair comparison of several different models applied to patient’s death, heart failure and chronic liver diseases (cirrhosis and fibrosis) prediction tasks. We propose ontology-based regularization method that can be used to pre-train MCV embeddings. The approach we use to predict these diseases and conditions can be applied to solve other prediction tasks.
Tatiana Malygina, Ivan Drokin
Explicit Semantic Analysis as a Means for Topic Labelling
Abstract
This paper deals with a method for topic labelling that makes use of Explicit Semantic Analysis (ESA). Top words of a topic are given to ESA as an input, and the algorithm yields titles of Wikipedia articles that are considered most relevant to the input. An alternative approach that serves as a strong baseline employs titles of first outputs in a search engine, given topic words as a query. In both methods, obtained titles are then automatically analysed and phrases characterizing the topic are constructed from them with the use of a graph algorithm and are assigned with weights. Within the proposed method based on ESA, post-processing is then performed to sort candidate labels according to empirically formulated rules. Experiments were conducted on a corpus of Russian encyclopaedic texts on linguistics. The results justify applying ESA for this task, and we state that though it works a little inferior to the method based on a search engine in terms of labels’ quality, it can be used as a reasonable alternative because it exhibits two advantages that the baseline method lacks.
Anna Kriukova, Aliia Erofeeva, Olga Mitrofanova, Kirill Sukharev
Four Keys to Topic Interpretability in Topic Modeling
Abstract
Interpretability of topics built by topic modeling is an important issue for researchers applying this technique. We suggest a new interpretability score, which we select from an interpretability score parametric space defined by four components: a splitting method, a probability estimation method, a confirmation measure and an aggregation function. We designed a regularizer for topic modeling representing this score. The resulting topic modeling method shows significant superiority to all analogs in reflecting human assessments of topic interpretability.
Andrey Mavrin, Andrey Filchenkov, Sergei Koltcov

Language Resources

Frontmatter
Cleaning Up After a Party: Post-processing Thesaurus Crowdsourced Data
Abstract
The study deals with post-processing of a noisy collection of synsets created using crowdsourcing. First, we cluster long synsets in three different ways. Second, we apply four cluster cleaning techniques based either on word popularity or word embeddings. Evaluation shows that the method based on word embeddings and existing dictionary definitions delivers best results.
Oksana Antropova, Elena Arslanova, Maxim Shaposhnikov, Pavel Braslavski, Mikhail Mukhin
A Comparative Study of Publicly Available Russian Sentiment Lexicons
Abstract
Sentiment lexicons play an important role in the systems of sentiment analysis and opinion mining. The article takes a look into eight publicly available Russian sentiment lexicons of today. A joint analysis of these lexicons was done by finding unions and intersections of the lexicons and also analysing the distribution of parts of speech. In order to study the quality of the lexicons, a sentiment classification is made based on the SVM and the TF-IDF model. Text corpora from reviews of works of art (books and movies), organizations (banks and hotels) and goods (kitchen appliances) are made for this purpose. Lexicons are compared in terms of their classification quality, and also on the basis of a linear regression model that reflects the dependence of their F1-measure on their TF-IDF model size. The resulting union lexicon most fully reflects the sentiment lexica of the present day Russian language and can be used both in scientific research and in applied sentiment analysis systems.
Evgeny Kotelnikov, Tatiana Peskisheva, Anastasia Kotelnikova, Elena Razova
Acoustic Features of Speech of Typically Developing Children Aged 5–16 Years
Abstract
The study is aimed at investigating the formation of acoustic features of speech in typically developing (TD) Russian-speaking children. The purpose of the study is to describe the dynamics of the temporal and spectral characteristics of the words of 5–16 years old children depending on their gender and age. The decrease of stressed and unstressed vowels duration from child’s words to the age of 13 years is revealed. Pitch values of vowels from words significantly decrease to the age of 14 years in girls and to the age of 16 years in boys. Pitch values of vowels from girls’ words are higher vs. corresponding features from boys’ words. Differences in the pitch values and vowel articulation index in boys and girls in different ages are shown. The obtained data on the acoustic features of the speech of TD children can be used as a normative basis in artificial intelligence systems for teaching children, for creating alternative communication systems for children with atypical development, for automatic recognition of child speech.
Alexey Grigorev, Olga Frolova, Elena Lyakso

Social Interaction Analysis

Frontmatter
Profiling the Age of Russian Bloggers
Abstract
The task of predicting demographics of social media users, bloggers and authors of other types of online texts is crucial for marketing, security, etc. However, most of the papers in authorship profiling deal with author gender prediction. In addition, most of the studies are performed in English-language corpora and very little work in the area in the Russian language. Filling this gap will elaborate on the multi-lingual insights into age-specific linguistic features and will provide a crucial step towards online security management in social networks. We present the first age-annotated dataset in Russian. The dataset contains blogs of 1260 authors from LiveJournal and is balanced against both age group and gender of the author. We perform age classification experiments (for age groups 20–30, 30–40, 40–50) with the presented data using basic linguistic features (lemmas, part-of-speech unigrams and bigrams etc.) and obtain a considerable baseline in age classification for Russian. We also consider age as a continuous variable and build regression models to predict age. Finally, we analyze significant features and provide interpretation where possible.
Tatiana Litvinova, Alexandr Sboev, Polina Panicheva
Stierlitz Meets SVM: Humor Detection in Russian
Abstract
In this paper, we investigate the problem of the humor detection for Russian language. For experiments, we used a large collection of jokes from social media and a contrast collection of non-funny sentences, as well as a small collection of puns. We implemented a large set of features and trained several SVM classifiers. The results are promising and establish a baseline for further research in this direction.
Anton Ermilov, Natasha Murashkina, Valeria Goryacheva, Pavel Braslavski
Interactive Attention Network for Adverse Drug Reaction Classification
Abstract
Detection of new adverse drug reactions is intended to both improve the quality of medications and drug reprofiling. Social media and electronic clinical reports are becoming increasingly popular as a source for obtaining the health-related information, such as identification of adverse drug reactions. One of the tasks of extracting adverse drug reactions from social media is the classification of entities that describe the state of health. In this paper, we investigate the applicability of Interactive Attention Network for identification of adverse drug reactions from user reviews. We formulate this problem as a binary classification task. We show the effectiveness of this method on a number of publicly available corpora.
Ilseyar Alimova, Valery Solovyev
Modeling Propaganda Battle: Decision-Making, Homophily, and Echo Chambers
Abstract
Studies concerning social patterns that appear as a result of propaganda and rumors generally tend to neglect considerations of the behavior of individuals that constitute these patterns. This places obvious limitations upon the scope of research. We propose a dynamical model for the mechanics of the processes of polarization and formation of echo chambers. This model is based on the Rashevsky neurological scheme of decision-making.
Alexander Petrov, Olga Proncheva
Backmatter
Metadaten
Titel
Artificial Intelligence and Natural Language
herausgegeben von
Dmitry Ustalov
Andrey Filchenkov
Lidia Pivovarova
Jan Žižka
Copyright-Jahr
2018
Electronic ISBN
978-3-030-01204-5
Print ISBN
978-3-030-01203-8
DOI
https://doi.org/10.1007/978-3-030-01204-5

Premium Partner