Skip to main content

2018 | Buch

Artificial Intelligence and Natural Language

6th Conference, AINL 2017, St. Petersburg, Russia, September 20–23, 2017, Revised Selected Papers

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 6th Conference on Artificial Intelligence and Natural Language, AINL 2017, held in St. Petersburg, Russia, in September 2017.

The 13 revised full papers, 4 revised short papers papers were carefully reviewed and selected from 35 submissions. The papers are organized in topical sections on social interaction analysis, speech processing, information extraction, Web-scale data processing, computation morphology and word embedding, machine learning. The volume also contains 6 papers participating in the Russian paraphrase detection shared task.

Inhaltsverzeichnis

Frontmatter
Erratum to: Multi-objective Topic Modeling for Exploratory Search in Tech News
Anastasia Ianina, Lev Golitsyn, Konstantin Vorontsov

Social Interaction Analysis

Frontmatter
Semantic Feature Aggregation for Gender Identification in Russian Facebook
Abstract
The goal of the current work is to evaluate semantic feature aggregation techniques in a task of gender classification of public social media texts in Russian. We collect Facebook posts of Russian-speaking users and apply them as a dataset for two topic modelling techniques and a distributional clustering approach. The output of the algorithms is applied as a feature aggregation method in a task of gender classification based on a smaller Facebook sample. The classification performance of the best model is favorably compared against the lemmas baseline and the state-of-the-art results reported for a different genre or language. The resulting successful features are exemplified, and the difference between the three techniques in terms of classification performance and feature contents are discussed, with the best technique clearly outperforming the others.
Polina Panicheva, Aliia Mirzagitova, Yanina Ledovaya
Using Linguistic Activity in Social Networks to Predict and Interpret Dark Psychological Traits
Abstract
Studying the relationships between one’s psychological characteristics and linguistic behaviour is a problem of a profound importance in many fields ranging from psychology to marketing, but there are very few works of this kind on Russian-speaking samples. We use Latent Dirichlet Allocation on the Facebook status updates to extract interpretable features that we then use to identify Facebook users with certain negative psychological traits (the so-called Dark Triad: narcissism, psychopathy, and Machiavellianism) and to find the themes that are most important to such individuals.
Arseny Moskvichev, Marina Dubova, Sergey Menshov, Andrey Filchenkov
Boosting a Rule-Based Chatbot Using Statistics and User Satisfaction Ratings
Abstract
Using data from user-chatbot conversations where users have rated the answers as good or bad, we propose a more efficient alternative to a chatbot’s keyword-based answer retrieval heuristic. We test two neural network approaches to the near-duplicate question detection task as a first step towards a better answer retrieval method. A convolutional neural network architecture gives promising results on this difficult task.
Octavia Efraim, Vladislav Maraev, João Rodrigues

Speech Processing

Frontmatter
Deep Learning for Acoustic Addressee Detection in Spoken Dialogue Systems
Abstract
The addressee detection problem arises in real spoken dialogue systems (SDSs) which are supposed to distinguish the speech addressed to them from the speech addressed to real humans. In this work, several modalities were analyzed, and acoustic data has been chosen as the main modality by reason of the most flexible usability in modern SDSs. To resolve the problem of addressee detection, deep learning methods such as fully-connected neural networks and Long Short-Term Memory were applied in the present study. The developed models were improved by using different optimization methods, activation functions and a learning rate optimization method. Also the models were optimized by using a recursive feature elimination method and multiple initialization to increase the training speed. A fully-connected neural network reaches an average recall of 0.78, a Long Short-Term Memory neural network shows an average recall of 0.65. Advantages and disadvantages of both architectures are provided for the particular task.
Aleksei Pugachev, Oleg Akhtiamov, Alexey Karpov, Wolfgang Minker
Deep Neural Networks in Russian Speech Recognition
Abstract
Hybrid speech recognition systems incorporating deep neural networks (DNNs) with Hidden Markov Models/Gaussian Mixture Models have achieved good results. We propose applying various DNNs in automatic recognition of Russian continuous speech. We used different neural network models such as Convolutional Neural Networks (CNNs), modifications of Long short-term memory (LSTM), Residual Networks and Recurrent Convolutional Networks (RCNNs). The presented model achieved \(7.5\%\) reducing of word error rate (WER) compared with Kaldi baseline. Experiments are performed with extra-large vocabulary (more than 30 h) of Russian speech.
Nikita Markovnikov, Irina Kipyatkova, Alexey Karpov, Andrey Filchenkov
Combined Feature Representation for Emotion Classification from Russian Speech
Abstract
Acoustic feature extraction for emotion classification is possible on different levels. Frame-level features provide low-level description characteristics that preserve temporal structure of the utterance. On the other hand, utterance-level features represent functionals applied to the low-level descriptors and contain important information about speaker emotional state. Utterance-level features are particularly useful for determining emotion intensity, however, they lose information about temporal changes of the signal. Another drawback includes often insufficient number of feature vectors for complex classification tasks. One solution to overcome these problems is to combine the frame-level features and utterance-level features to take advantage of both methods. This paper proposes to obtain low-level feature representation feeding frame-level descriptor sequences to a Long Short-Term Memory (LSTM) network, combine the outcome with the Principal Component Analysis (PCA) representation of utterance-level features, and make the final prediction with a logistic regression classifier.
Oxana Verkholyak, Alexey Karpov

Information Extraction

Frontmatter
Active Learning with Adaptive Density Weighted Sampling for Information Extraction from Scientific Papers
Abstract
The paper addresses the task of information extraction from scientific literature with machine learning methods. In particular, the tasks of definition and result extraction from scientific publications in Russian are considered. We note that annotation of scientific texts for creation of training dataset is very labor insensitive and expensive process. To tackle this problem, we propose methods and tools based on active learning. We describe and evaluate a novel adaptive density-weighted sampling (ADWeS) meta-strategy for active learning. The experiments demonstrate that active learning can be a very efficient technique for scientific text mining, and the proposed meta-strategy can be beneficial for corpus annotation with strongly skewed class distribution. We also investigate informative task-independent features for information extraction from scientific texts and present an openly available tool for corpus annotation, which is equipped with ADWeS and compatible with well-known sampling strategies.
Roman Suvorov, Artem Shelmanov, Ivan Smirnov
Application of a Hybrid Bi-LSTM-CRF Model to the Task of Russian Named Entity Recognition
Abstract
Named Entity Recognition (NER) is one of the most common tasks of the natural language processing. The purpose of NER is to find and classify tokens in text documents into predefined categories called tags, such as person names, quantity expressions, percentage expressions, names of locations, organizations, as well as expression of time, currency and others. Although there is a number of approaches have been proposed for this task in Russian language, it still has a substantial potential for the better solutions. In this work, we studied several deep neural network models starting from vanilla Bi-directional Long Short Term Memory (Bi-LSTM) then supplementing it with Conditional Random Fields (CRF) as well as highway networks and finally adding external word embeddings. All models were evaluated across three datasets Gareev’s, Person-1000 and FactRuEval 2016. We found that extension of Bi-LSTM model with CRF significantly increased the quality of predictions. Encoding input tokens with external word embeddings reduced training time and allowed to achieve state of the art for the Russian NER task.
The Anh Le, Mikhail Y. Arkhipov, Mikhail S. Burtsev

Web-Scale Data Processing

Frontmatter
Employing Wikipedia Data for Coreference Resolution in Russian
Abstract
Semantic information has been deemed a valuable resource for solving the task of coreference resolution by many researchers. Unfortunately, not much has been done in the direction of using this data when working with Russian data. This work describes the first step of a research, attempting to create a coreference resolution system for Russian based on semantic data, concerned with using Wikipedia information for the task. The obtained results are comparable to ones for English data, which gives reasons to expect their improvement in further steps of the research.
Ilya Azerkovich
Building Wordnet for Russian Language from Ru.Wiktionary
Abstract
This paper presents a method of fully-automatic transformation of the free-content Russian dictionary ru.wiktionary to WordNet-like thesaurus. The primary concern of this study is to describe a procedure of relating words to their meanings throughout Wiktionary pages and establish synonym and hyponym-hypernym relation between specific senses of words. The produced database contains 104696 synsets and is publicly available in alpha version as a python package wiki-ru-wordnet.
Yuliya Chernobay
Corpus of Syntactic Co-Occurrences: A Delayed Promise
Abstract
The paper gives a technical description of CoSyCo, a corpus of syntactic co-occurrences, which provides information on syntactically connected words in the Russian language. The paper includes an overview of the corpora collected for CoSyCo creation and the amount of collected combinations. In the paper, we also provide a short evaluation of the gathered information.
Eduard S. Klyshinsky, Natalia Y. Lukashevich

Computation Morphology and Word Embeddings

Frontmatter
A Close Look at Russian Morphological Parsers: Which One Is the Best?
Abstract
This article presents a comparative study of four morphological parsers of Russian – mystem, pymorphy2, TreeTagger, and FreeLing – involving the two main tasks of morphological analysis: lemmatization and POS tagging. The experiments were conducted on three currently available Russian corpora which have qualitative morphological labeling – Russian National Corpus, OpenCorpora, and RU-EVAL (a small corpus created in 2010 to evaluate parsers). As evaluation measures, the authors use accuracy for lemmatization and F1-measure for POS tagging. The authors give error analysis, identify the most difficult parts of speech for the parsers, and analyze the work of parsers on dictionary words and predicted words.
Evgeny Kotelnikov, Elena Razova, Irina Fishcheva
Morpheme Level Word Embedding
Abstract
Modern NLP tasks such as sentiment analysis, semantic analysis, text entity extraction and others depend on the language model quality. Language structure influences quality: a model that fits well the analytic languages for some NLP tasks, doesn’t fit well enough the synthetic languages for the same tasks. For example, a well known Word2Vec [27] model shows good results for the English language which is rather an analytic language than a synthetic one, but Word2Vec has some problems with synthetic languages due to their high inflection for some NLP tasks. Since every morpheme in synthetic languages provides some information, we propose to discuss morpheme level-model to solve different NLP tasks. We consider the Russian language in our experiments. Firstly, we describe how to build morpheme extractor from prepared vocabularies. Our extractor reached 91% accuracy on the vocabularies of known morpheme segmentation. Secondly we show the way how it can be applied for NLP tasks, and then we discuss our results, pros and cons, and our future work.
Ruslan Galinsky, Tatiana Kovalenko, Julia Yakovleva, Andrey Filchenkov
Comparison of Vector Space Representations of Documents for the Task of Information Retrieval of Massive Open Online Courses
Abstract
One of the important issues, arising in development of educational courses is maintaining relevance for the intended receivers of the course. In general, it requires developers of such courses to use and borrow some elements presented in similar content developed by others. This form of collaboration allows for the integration of experience and points of view of multiple authors, which tends to result in better, more relevant content. This article addresses the question of searching for relevant massive open online courses (MOOC) using a course programme document as a query. As a novel solution to this task we propose the application of language modelling. Presented results of the experiment, comparing several most popular models of vector space representation of text documents, such as the classical weighting scheme TF-IDF, Latent Semantic Indexing, topic modeling in the form of Latent Dirichlet Allocation, popular modern neural net language models word2vec and paragraph vectors. The experiment is carried out on the corpus of courses in Russian, collected from several popular MOOC-platforms. The effectiveness of the proposed model is evaluated taking into account opinions of university professors.
Julius Klenin, Dmitry Botov, Yuri Dmitrin

Machine Learning

Frontmatter
Interpretable Probabilistic Embeddings: Bridging the Gap Between Topic Models and Neural Networks
Abstract
We consider probabilistic topic models and more recent word embedding techniques from a perspective of learning hidden semantic representations. Inspired by a striking similarity of the two approaches, we merge them and learn probabilistic embeddings with online EM-algorithm on word co-occurrence data. The resulting embeddings perform on par with Skip-Gram Negative Sampling (SGNS) on word similarity tasks and benefit in the interpretability of the components. Next, we learn probabilistic document embeddings that outperform paragraph2vec on a document similarity task and require less memory and time for training. Finally, we employ multimodal Additive Regularization of Topic Models (ARTM) to obtain a high sparsity and learn embeddings for other modalities, such as timestamps and categories. We observe further improvement of word similarity performance and meaningful inter-modality similarities.
Anna Potapenko, Artem Popov, Konstantin Vorontsov
Multi-objective Topic Modeling for Exploratory Search in Tech News
Abstract
Exploratory search is a paradigm of information retrieval, in which the user’s intention is to learn the subject domain better. To do this the user repeats “query–browse–refine” interactions with the search engine many times. We consider typical exploratory search tasks formulated by long text queries. People usually solve such a task in about half an hour and find dozens of documents using conventional search facilities iteratively. The goal of this paper is to reduce the time-consuming multi-step process to one step without impairing the quality of the search. Probabilistic topic modeling is a suitable text mining technique to retrieve documents, which are semantically relevant to a long text query. We use the additive regularization of topic models (ARTM) to build a model that meets multiple objectives. The model should have sparse, diverse and interpretable topics. Also, it should incorporate meta-data and multimodal data such as n-grams, authors, tags and categories. Balancing the regularization criteria is an important issue for ARTM. We tackle this problem with coordinate-wise optimization technique, which chooses the regularization trajectory automatically. We use the parallel online implementation of ARTM from the open source library BigARTM. Our evaluation technique is based on crowdsourcing and includes two tasks for assessors: the manual exploratory search and the explicit relevance feedback. Experiments on two popular tech news media show that our topic-based exploratory search outperforms assessors as well as simple baselines, achieving precision and recall of about 85–92%.
Anastasia Ianina, Lev Golitsyn, Konstantin Vorontsov
A Deep Forest for Transductive Transfer Learning by Using a Consensus Measure
Abstract
A Transfer Learning Deep Forest (TLDF) is proposed in the paper. It is based on the Deep Forest or gcForest proposed by Zhou and Feng and can be viewed as a gcForest modification whose aim is to implement the transductive transfer learning. The transfer learning is based on introducing weights of trees in forests which impact on the forest class probability distributions. The weights can be regarded as training parameters of the deep forest and are determined in order to maximize the agreement on target and source domains. The convex quadratic optimization problem with linear constraints is obtained to compute optimal weights for every forest taking into account the consensus principle. The numerical experiments illustrate the proposed distance metric method.
Lev V. Utkin, Mikhail A. Ryabinin

Russian Paraphrase Detection Shared Task

Frontmatter
ParaPhraser: Russian Paraphrase Corpus and Shared Task
Abstract
The paper describes the results of the First Russian Paraphrase Detection Shared Task held in St.-Petersburg, Russia, in October 2016. Research in the area of paraphrase extraction, detection and generation has been successfully developing for a long time while there has been only a recent surge of interest towards the problem in the Russian community of computational linguistics. We try to overcome this gap by introducing the project ParaPhraser.ru dedicated to the collection of Russian paraphrase corpus and organizing a Paraphrase Detection Shared Task, which uses the corpus as the training data. The participants of the task applied a wide variety of techniques to the problem of paraphrase detection, from rule-based approaches to deep learning, and results of the task reflect the following tendencies: the best scores are obtained by the strategy of using traditional classifiers combined with fine-grained linguistic features, however, complex neural networks, shallow methods and purely technical methods also demonstrate competitive results.
Lidia Pivovarova, Ekaterina Pronoza, Elena Yagunova, Anton Pronoza
Effect of Semantic Parsing Depth on the Identification of Paraphrases in Russian Texts
Abstract
As a tool to solve the problem of identification of paraphrases in Russian texts, the paper proposes the semantic-syntactic parser SemSin and a semantic classifier. Several alternative methods for evaluating the similarity of sentence pairs—by words, by lemmas, by classes, by semantically related concepts, by predicate groups—have been analyzed. Advantages and drawbacks of the methods are discussed. The paraphrase identification quality has been shown to rise with increasing depth of using the semantic information. Yet, complementing the analysis with predicate groups, identified by the dependency tree, may even cause the identification to degrade due to the growing number of false positive decisions.
Kirill Boyarsky, Eugeni Kanevsky
RuThes Thesaurus in Detecting Russian Paraphrases
Abstract
In this paper we study the contribution of semantic features to the detection of Russian paraphrases. The features were calculated on the Russian Thesaurus RuThes. First, we applied RuThes synonyms in clustering news articles, many of which had been created with rewriting (that is paraphrasing) of source news, and found significant improvement. Second, we applied several semantic similarity measures proposed for English thesaurus WordNet to RuThes thesaurus and utilized them for detecting Russian paraphrased sentences.
Natalia Loukachevitch, Aleksandr Shevelev, Valerie Mozharova, Boris Dobrov, Andrey Pavlov
Knowledge-lean Paraphrase Identification Using Character-Based Features
Abstract
The paraphrase identification task has practical importance in the NLP community because of the need to deal with the pervasive problem of linguistic variation. Accurate methods should help improve the performance of NLP applications, including machine translation, information retrieval, question answering, text summarization, document clustering and plagiarism detection, amongst others. We consider an approach to paraphrase identification that may be considered “knowledge-lean”. Our approach minimizes the need for data transformation and avoids the use of knowledge-based tools and resources. Candidate paraphrase pairs are represented using combinations of word- and character-based features. We show that SVM classifiers may be trained to distinguish paraphrase and non-paraphrase pairs across a number of different paraphrase corpora with good results. Analysis shows that features derived from character bigrams are particularly informative. We also describe recent experiments in identifying paraphrase for Russian, a language with rich morphology and free word order that presents a particularly interesting challenge for our knowledge-lean approach. We are able to report good results on a three-way paraphrase classification task.
Asli Eyecioglu, Bill Keller
Paraphrase Detection Using Machine Translation and Textual Similarity Algorithms
Abstract
I present experiments on the task of paraphrase detection for Russian text using Machine Translation (MT) into English and applying existing sentence similarity algorithms in English on the translated sentences. But since I use translation engines - my method to detect paraphrases can be applied to any other languages, which translation into English is available on translation engines. Specifically, I consider two tasks: given pair of sentences in Russian – classify them into two (non-paraphrases, paraphrases) or three (non-paraphrases, near-paraphrases, precise-paraphrases) classes. I compare five different well-established sentence similarity methods developed in English and three different Machine Translation engines (Google, Microsoft and Yandex). I perform detailed ablation tests to identify the contribution of each component of the five methods, and identify the best combination of Machine Translation and sentence similarity method, including ensembles, on the Russian Paraphrase data set. My best results on the Russian data set are an Accuracy of 81.4% and F1 score of 78.5% for an ensemble method with the translation using three MT engines (Google, Microsoft and Yandex). This compares favorably with state of the art methods in English on data sets of a similar size which are in the range of Accuracy 80.41% and F1-score of 85.96%. This demonstrates that, with the current level of performance of public MT engines, the simple approach of translating/classifying in English has become a feasible strategy to address the task. I perform detailed error analysis to indicate potential for further improvements.
Dmitry Kravchenko
Character-Level Convolutional Neural Network for Paraphrase Detection and Other Experiments
Abstract
The central goal of this paper is to report on the results of an experimental study on the application of character-level embeddings and basic convolutional neural network to the shared task of sentence paraphrase detection in Russian. This approach was tested in the standard run of Task 2 of that shared task and revealed competitive results, namely 73.9% accuracy against the test set. It is compared against a word-level convolutional neural network for the same task, and varied other approaches, such as rule-based and classical machine learning.
Vladislav Maraev, Chakaveh Saedi, João Rodrigues, António Branco, João Silva
Backmatter
Metadaten
Titel
Artificial Intelligence and Natural Language
herausgegeben von
Andrey Filchenkov
Lidia Pivovarova
Jan Žižka
Copyright-Jahr
2018
Electronic ISBN
978-3-319-71746-3
Print ISBN
978-3-319-71745-6
DOI
https://doi.org/10.1007/978-3-319-71746-3