nach oben

2014 | Buch

Kapitel lesen Erstes Kapitel lesen

Advances in Natural Language Processing

9th International Conference on NLP, PolTAL 2014, Warsaw, Poland, September 17-19, 2014. Proceedings

herausgegeben von: Adam Przepiórkowski, Maciej Ogrodniczuk

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book constitutes the refereed proceedings of the 9th International Conference on Advances in Natural Language Processing, PolTAL 2014, Warsaw, Poland, in September 2014. The 27 revised full papers and 20 revised short papers presented were carefully reviewed and selected from 83 submissions. The papers are organized in topical sections on morphology, named entity recognition, term extraction; lexical semantics; sentence level syntax, semantics, and machine translation; discourse, coreference resolution, automatic summarization, and question answering; text classification, information extraction and information retrieval; and speech processing, language modelling, and spell- and grammar-checking.

Inhaltsverzeichnis

Frontmatter

Morphology, Named Entity Recognition, Term Extraction

Development of Amharic Morphological Analyzer Using Memory-Based Learning

Morphological analysis of highly inflected languages like Am-haric is a non-trivial task because of the complexity of the morphology. In this paper, we propose a supervised data-driven experimental approach to develop Amharic morphological analyzer. We use a memory-based supervised machine learning method which extrapolates new unseen classes based on previous examples in memory. We treat morphological analysis as a classification task which retrieves the grammatical functions and properties of morphologically inflected words. As the task is geared towards analyzing the vowelled inflected Amharic words with their grammatical functions of morphemes, the morphological structure of words and the way how they are represented in memory-based learning is exhaustively investigated. The performance of the model is evaluated using 10-fold cross-validation with IB1 and IGtree algorithms resulting in the over all accuracy of 93.6% and 82.3%, respectively.

Mesfin Abate, Yaregal Assabie

A Finite-State Treatment of Neoclassical Compounds in Modern Greek and English

This paper presents an attempt to process neoclassical compounds (NCs) in Modern Greek (MG) and English with the use of finite-state methods. Our approach is based on a theoretical background that applies to both languages and assumes that Modern Greek compounding has affected the formation of NCs in English. The processing system is proposed to be both linguistically accurate and computationally efficient.

Evanthia Petropoulou, Eleni Galiotou, Angela Ralli

CroDeriV 2.0.: Initial Experiments

This paper deals with the processing of derivational morphology in Croatian and focuses on the expansion of CroDeriV a resource with data on morphological structure and derivational relations. The purpose of CroDeriV is to systematically present the morphological structure and derivational relations of Croatian lexemes, and to use this data for the enrichment and development of existing resources and tools, as well as of new ones. One of the objectives in this ongoing project is to build an analyzer for Croatian capable of analyzing both inflectional and derivational morphemes. In this paper we present the initial experiments towards the enlargement of CroDeriV to include nouns, as well as the development of a morphological analyzer for inflectional and derivational morphemes.

Krešimir Šojat, Matea Srebačić, Tin Pavelić

Named Entity Matching Method Based on the Context-Free Morphological Generator

Polish named entities are mostly out-of-vocabulary words, i.e. they are not described in morphological lexicons, and their proper analysis by Polish morphological analysers is difficult.The existing approaches to guessing unknown word lemmas and descriptions do not provide results on a satisfactory level. Moreover, lemmatisation of multi-word named entities cannot be solved by word-by-word lemmatisation in Polish. Multi-word named entity lemmas (e.g. included in gazetteers) often contain word forms that differ from lemmas of their constituents. Such multi-word lemmas can be produced only by tagger- or parser-based lemmatisation. Polish is a language with rich inflection (rich variety of word forms), therefore comparing two words (even these which share the same lemma) is a difficult task. Instead of calculating the value of form-based similarity function between the text words and gazetteer entries, we propose a method which uses a context-free morphological generator, built on the top of the morphological lexicon and encoded as a set of inflection rules. The proposed solution outperforms several state-of-the-art methods that are based on word-to-word similarity functions.

Jan Kocoń, Maciej Piasecki

NER in Tweets Using Bagging and a Small Crowdsourced Dataset

Named entity recognition (NER) systems for Twitter are very sensitive to cross-sample variation, and the performance of off-the-shelf systems vary from reasonable (

: 60–70%) to completely useless (

: 40–50%) across available Twitter datasets. This paper introduces a semi-supervised wrapper method for robust learning of sequential problems with many negative examples, such as NER, and shows that using a simple conditional random fields (CRF) model and a small crowdsourced dataset [4], leads to good NER performance across datasets.

Hege Fromreide, Anders Søgaard

Yet Another Ranking Function for Automatic Multiword Term Extraction

Term extraction is an essential task in domain knowledge acquisition. We propose two new measures to extract multiword terms from a domain-specific text. The first measure is both linguistic and statistical based. The second measure is graph-based, allowing assessment of the importance of a multiword term of a domain. Existing measures often solve some problems related (but not completely) to term extraction, e.g., noise, silence, low frequency, large-corpora, complexity of the multiword term extraction process. Instead, we focus on managing the entire set of problems, e.g., detecting rare terms and overcoming the low frequency issue. We show that the two proposed measures outperform precision results previously reported for automatic multiword extraction by comparing them with the state-of-the-art reference measures.

Juan Antonio Lossio-Ventura, Clement Jonquet, Mathieu Roche, Maguelonne Teisseire

Unsupervised Keyword Extraction from Polish Legal Texts

In this work, we present an application of the recently proposed unsupervised keyword extraction algorithm RAKE to a corpus of Polish legal texts from the field of public procurement. RAKE is essentially a language and domain independent method. Its only language-specific input is a stoplist containing a set of non-content words. The performance of the method heavily depends on the choice of such a stoplist, which should be domain adopted. Therefore, we complement RAKE algorithm with an automatic approach to selecting non-content words, which is based on the statistical properties of term distribution.

Michał Jungiewicz, Michał Łopuszyński

Term Ranking Adaptation to the Domain: Genetic Algorithm-Based Optimisation of the C-Value

Term extraction methods based on linguistic rules have been proposed to help the terminology building from corpora. As they face the difficulty of identifying the relevant terms among the noun phrases extracted, statistical measures have been proposed. However, the term selection results may depend on corpus and strong assumptions reflecting specific terminological practice. We tackle this problem by proposing a parametrised

C-Value

which optimally considers the length and the syntactic roles of the nested terms thanks to a genetic algorithm. We compare its impact on the ranking of terms extracted from three corpora. Results show average precision increased by 9% above the frequency-based ranking and by 12% above the

C-Value

-based ranking.

Thierry Hamon, Christopher Engström, Sergei Silvestrov

Lexical Semantics

Graph-Based, Supervised Machine Learning Approach to (Irregular) Polysemy in WordNet

This paper presents a supervised machine learning approach that aims at annotating those homograph word forms in WordNet that share some common meaning and can hence be thought of as belonging to a polysemous word. Using different graph-based measures, a set of features is selected, and a random forest model is trained and evaluated. The results are compared to other features used for polysemy identification in WordNet. The features proposed in this paper not only outperform the commonly used CoreLex resource, but they also work on different parts of speech and can be used to identify both regular and irregular polysemous word forms in WordNet.

Bastian Entrup

Attribute Value Acquisition through Clustering of Adjectives

In the paper we analyse Polish descriptive adjectives which occur in domain related texts. The experiments were done on data obtained from hospital discharge records. Prenominal adjectives selected from these texts were filtered out of presumably relative adjectives and clustered on the basis of a set of context related features and interword relations derived from Wordnet. We tested if this procedure can be used to automatically identify concept features, i.e. whether adjectives representing different values of one feature will form one cluster. The obtained results proved to be useful as a preprocessing step in a specialized subdomain ontology creation procedure.

Agnieszka Mykowiecka, Małgorzata Marciniak

Cross-Lingual Semantic Similarity Measure for Comparable Articles

A measure of similarity is required to find and compare cross-lingual articles concerning a specific topic. This measure can be based on bilingual dictionaries or based on numerical methods such as Latent Semantic Indexing (LSI). In this paper, we use LSI in two ways to retrieve Arabic-English comparable articles. The first way is monolingual: the English article is translated into Arabic and then mapped into the Arabic LSI space; the second way is cross-lingual: Arabic and English documents are mapped into Arabic-English LSI space. Then we compare LSI approaches to the dictionary-based approach on several English-Arabic parallel and comparable corpora. Results indicate that the performance of our cross-lingual LSI approach is competitive to the monolingual approach and even better for some corpora. Moreover, both LSI approaches outperform the dictionary approach.

Motaz Saad, David Langlois, Kamel Smaïli

An Integrated Approach to Automatic Synonym Detection in Turkish Corpus

In this study, we designed a model to determine synonymy. Our main assumption is that synonym pairs show similar semantic and dependency relation by the definition. They share same meronym/holonym and hypernym/hyponym relations. Contrary to synonymy, hypernymy and meronymy relations can probably be acquired by applying lexico-syntactic patterns to a big corpus. Such acquisition might be utilized and ease detection of synonymy. Likewise, we utilized some particular dependency relations such as object/subject of a verb, etc. Machine learning algorithms were applied on all these acquired features. The first aim is to find out which dependency and semantic features are the most informative and contribute most to the model. Performance of each feature is individually evaluated with cross validation. The model that combines all features shows promising results and successfully detects synonymy relation. The main contribution of the study is to integrate both semantic and dependency relation within distributional aspect. Second contribution is considered as being first major attempt for Turkish synonym identification based on corpus-driven approach.

Tuğba Yıldız, Savaş Yıldırım, Banu Diri

Distributional Context Generalisation and Normalisation as a Mean to Reduce Data Sparsity: Evaluation of Medical Corpora

Distributional analysis relies on the recurrence of information in the contexts of words to associate. But the vector space models implementing the approach suffer from data sparsity and from a high dimensional context matrix. If reducing data sparsity is an important aspect with general corpora, it is also a major issue with specialised corpora that are of much smaller size and with much lower context frequencies. We tackle this problem on specialised texts and propose a method to increase the matrix density by normalising and generalising distributional contexts with synonymy and hypernymy relations acquired from corpora. Experiments on a French biomedical corpus show that context generalisation and normalisation improve the results when combined with the use of relations acquired with lexico-syntactic patterns.

Amandine Périnet, Thierry Hamon

A Parallel Non-negative Sparse Large Matrix Factorization

This paper proposes parallel methods of non-negative large sparse matrix factorization – a very popular technique in computational linguistics. Memory usage and data transmitting necessity of factorization algorithm was analysed and optimized. The described effective GPU-based and distributed algorithms were implemented, tested and compared by means of large sparse matrices processing.

Anatoly Anisimov, Oleksandr Marchenko, Emil Nasirov, Stepan Palamarchuk

Sentence-Level Syntax, Semantics and Machine Translation

Statistical Analysis of the Interaction between Word Order and Definiteness in Polish

Although (in-)definiteness is semantically relevant in Polish, the language lacks explicit linguistic features for marking it. The paper presents the first quantitative, statistical evaluation of the correlation between word order and definiteness. Our results support previous qualitative theories about the influence of the verb-relative position on definiteness in Polish.

Adrian Czardybon, Oliver Hellwig, Wiebke Petersen

Stanford Typed Dependencies: Slavic Languages Application

The Stanford typed dependency model [7] constitutes a universal schema of grammatical relationships for dependency parsing. However, it was based on English data and did not provide descriptions for grammatical features that are fundamental in other language types. This paper addresses the problem of applying the Stanford typed dependency model for Slavic languages. Language features specific to Slavic languages that are presented and described include ellipsis, different types of predicates, genitive constructions, direct vs. indirect objects, reflexive pronouns and determiners. In order to maintain cross-language consistency we try to avoid major changes in the original Stanford model, and rather devise new applications of the existing relation types.

Katarzyna Marszałek-Kowalewska, Anna Zaretskaya, Milan Souček

Towards a Weighted Induction Method of Dependency Annotation

This paper presents a method of annotating sentences with dependency trees which is set within the mainstream of the study on dependency projection. The approach builds on the idea of weighted projection. However, we involve a weighting factor not only in the process of projecting dependency relations (weighted projection) but also in the process of acquiring dependency trees from projected sets of dependency relations (weighted induction). Using a parallel corpus, its source side is automatically annotated with a syntactic parser and resulting dependencies are transferred to equivalent target sentences via an extended set of word alignment links. Projected relations are initially weighted according to the certainty of word alignment links used in projection. Since word alignments may be noisy and we should not entirely rely on them, initial weights are thus recalculated using a version of the EM algorithm. Then, maximum spanning trees fulfilling properties of well-formed dependency structures are selected from EM-scored directed graphs. An extrinsic evaluation shows that parsers trained on induced trees perform comparably to parsers trained on a manually developed treebank.

Alina Wróblewska, Adam Przepiórkowski

Semantic and Syntactic Model of Natural Language Based on Non-negative Matrix and Tensor Factorization

A method for developing a structural model of natural language syntax and semantics is proposed. Factorization of lexical combinability arrays obtained from text corpora generates linguistic databases that are used for analysis of natural language semantics and syntax.

Anatoly Anisimov, Oleksandr Marchenko, Volodymyr Taranukha, Taras Vozniuk

Experiments on the Identification of Predicate-Argument Structure in Polish

This paper focuses on automatic methods of extracting a predicate-argument structure in Polish. Two approaches to extract selected aspects of the predicate-argument structure are evaluated. In the first experiment the multi-output version of the Random Forest classifier is used to extract a valency frame for each predicate in a sentence. In the second experiment the Conditional Random Fields classifier is used to find syntactic heads of all arguments realised in a sentence. What is more, the importance of various sources of features is presented, including shallow syntactic parsing, dependency parsing and a verb valency information. Due to the lack of the high-quality syntactic parser, the presented approach does not rely on the deep syntactic information.

Konrad Gołuchowski

Syntactic Approximation of Semantic Roles

The aim of this paper is to propose a method of simulating - in a syntactico-semantic parser - the behaviour of semantic roles in case of a language that has no resources such as VerbNet of FrameNet, but has relatively rich morphosyntax (here: Polish). We argue that using an approximation of semantic roles derived from syntactic (grammatical functions) and morphosyntactic (grammatical cases) features of arguments may be beneficial for applications such as text entailment.

Wojciech Jaworski, Adam Przepiórkowski

Using Polish Wordnet for Predicting Semantic Roles for the Valency Dictionary of Polish Verbs

The paper describes a preliminary proposal of a method for creating the semantic layer in a valency dictionary of Polish by enriching it with information about semantic roles (meaning-related participant types of predicate’s arguments that ensure the same semantic properties across their different syntactic realizations) taken from a Polish wordnet. The peculiarities of organizing senses of verbs (in the form of synsets) that can be helpful for extracting information about semantic roles from the Polish wordnet are described. A working role set that is intended to satisfy predicates with both abstract and concrete meanings within the same role structure is proposed. Protoframes for selected verb classes are presented.

Natalia Kotsyba

Semantic Extraction with Use of Frames

This work describes an information extraction methodology which uses shallow parsing. We present detailed information on the extraction process, data structures used within that process as well as the evaluation of the described method. The extraction is fully automatic. Instead of machine learning it uses predefined frame templates and vocabulary stored within a domain ontology with elements related to frame templates. The architecture of the information extractor is modular and the main extraction module is capable of processing various languages when lexicalization for these languages is provided.

Jakub Dutkiewicz, Maciej Falkowski, Maciej Nowak, Czesław Jędrzejek

Constraint Grammar-Based Swedish-Danish Machine Translation

This paper describes and evaluates a grammar-based machine translation system for the Swedish-Danish language pair. Source-language structural analysis, polysemy resolution, syntactic movement rules and target-language agreement are based on Constraint Grammar morphosyntactic tags and dependency trees. Lexical transfer rules exploit dependency links to access contextual information, such as syntactic argument function, semantic type and quantifiers, or to integrate verbal features, e.g. diathesis and auxiliaries. Out-of-vocabulary words are handled by derivational and compound analysis with a combined coverage of 99.3%, as well as systematic morpho-phonemic transliterations for the remaining cases. The system achieved BLEU scores of 0.65-0.8 depending on references and outperformed both STMT and RBMT competitors by a large margin.

Eckhard Bick

A Hybrid Approach to the Development of Bidirectional English-Oromiffa Machine Translation

This paper presents the development of bidirectional English-Oromiffa machine translation system using a hybrid of rule-based and statistical approaches. Since English and Oromiffa have different sentence structures, we implement syntactic reordering with the purpose of making the structure of source sentences similar to the structure of target sentences. Accordingly, reordering rules are developed for simple, interrogative and complex English and Oromiffa sentences. Two groups of experiments are conducted by using purely statistical approach and hybrid approach. The Oromiffa-English SMT yields a BLEU score of 41.50% where as English-Oromiffa SMT has a BLEU score of 32.39%. After applying local reordering rules, the system is improved to provide a BLEU score of 52.02% and 37.41% for Oromiffa-English and English-Oromiffa translations, respectively.

Jabesa Daba, Yaregal Assabie

Inflating a Training Corpus for SMT by Using Unrelated Unaligned Monolingual Data

To improve the translation quality of less resourced language pairs, the most natural answer is to build larger and larger aligned training data, that is to make those language pairs well resourced. But aligned data is not always easy to collect. In contrast, monolingual data are usually easier to access. In this paper we show how to leverage unrelated unaligned monolingual data to construct additional training data that varies only a little from the original training data. We measure the contribution of such additional data to translation quality. We report an experiment between Chinese and Japanese where we use 70,000 sentences of unrelated unaligned monolingual additional data in each language to construct new sentence pairs that are not perfectly aligned.We add these sentence pairs to a training corpus of 110,000 sentence pairs, and report an increase of 6 BLEU points.

Wei Yang, Yves Lepage

Discourse, Coreference Resolution, Summarisation, Question Answering

Uncovering Discourse Relations to Insert Connectives between the Sentences of an Automatic Summary

This paper presents a machine learning approach to find and classify discourse relations between two unseen sentences. It describes the process of training a classifier that aims to determine (i) if there is any discourse relation among two sentences, and, if a relation is found, (ii) which is that relation. The final goal of this task is to insert discourse connectives between sentences seeking to enhance text cohesion of a summary produced by an extractive summarization system for the Portuguese language.

Sara Botelho Silveira, António Branco

Toward Automatic Classification of Metadiscourse

This paper describes the supervised classification of four metadiscursive functions in English. Training data is collected using crowdsourcing to label a corpus of TED talks transcripts with occurrences of

Introductions

Conclusions

Examples

, and

Emphasis

. Using decision trees and lexical features, we report classification accuracy.

Rui Correia, Nuno Mamede, Jorge Baptista, Maxine Eskenazi

Detection of Nested Mentions for Coreference Resolution in Polish

This paper describes the results of creating a shallow grammar of Polish capable of detecting multi-level nested nominal phrases, intended to be used as mentions in coreference resolution tasks. The work is based on existing grammar developed for the National Corpus of Polish and evaluated on manually annotated Polish Coreference Corpus.

Maciej Ogrodniczuk, Alicja Wójcicka, Katarzyna Głowińska, Mateusz Kopeć

Amharic Anaphora Resolution Using Knowledge-Poor Approach

Building complete anaphora resolution systems that incorporate all linguistic information is difficult because of the complexities of languages. In the case of Amharic, it is even more difficult because of its complex morphology. In addition to independent anaphors, Amharic has anaphors embedded inside words (hidden anaphors). In this paper, we propose Amharic anaphora resolution system that also treats hidden anaphors, in addition to independent ones. Hidden anaphors are extracted using Amharic morphological analyzer. After anaphoric terms are identified, their relationships with antecedents are built by making use of the grammatical structure of the language along with constraint and preference rules. The system is developed based on knowledge-poor approach in the sense that we use low levels of linguistic knowledge like morphology avoiding the need of complex knowledge like semantics, world knowledge and others. The performance of the system is evaluated using 10-fold cross validation technique and experimental results are reported.

Temesgen Dawit, Yaregal Assabie

The First Resource for Bengali Question Answering Research

This paper reports the development of the first tagged resource for question answering research for a less computerized Indian language, namely Bengali. We developed a tagging scheme for annotating the questions based on their types. Expected answer type and question topical target are also marked to facilitate the answer search. Due to scarcity of canonical documents in the web for Bengali, we could not take the advantage of web as the resource and the major portion of the resource data was collected from authentic books. Six highly qualified annotators were involved in this rigorous work. At present, the resource contains 47 documents from three domains, namely history, geography and agriculture. Question answering based annotation was performed to prepare more than 2250 question-answer pairs. The inter-annotator agreement scores measured in non-weighted kappa statistics is satisfactory.

Somnath Banerjee, Pintu Lohar, Sudip Kumar Naskar, Sivaji Bandyopadhyay

Computer-Assisted Scoring of Short Responses: The Efficiency of a Clustering-Based Approach in a Real-Life Task

We present an extrinsic evaluation of a clustering-based approach to computer-assisted scoring of short constructed response items, as encountered in educational assessment. Due to their open-ended nature, constructed response items need to be graded by human readers, which makes the overall testing process costly and time-consuming. In this paper we investigate the prospects for streamlining the grading task by grouping similar responses for scoring. The efficiency of scoring clustered responses is compared both with the traditional mode of grading individual test-takers’ sheets and with by-item scoring of non-clustered responses. Evaluation of the three grading modes is carried out during real-life language proficiency tests of German as a Foreign Language. We show that a system based on basic clustering techniques and shallow features yields a promising trend of reducing grading time and performs as well as a system displaying test-taker sheets for scoring.

Magdalena Wolska, Andrea Horbach, Alexis Palmer

Pinpointing Sentence-Level Subjectivity through Balanced Subjective and Objective Features

The sentence-level subjectivity classification is a challenging task. This paper pinpoints some of its unique characteristics. It argues that these characteristics should be considered when extracting subjective or objective features from sentences. Through various sentence-level subjectivity classification experiments with numerous feature combinations, we found that balanced features for both subjective and objective sentences help to achieve balanced precision and recall for sentence subjectivity classification.

Munhyong Kim, Hyopil Shin

Text Classification, Information Extraction, Information Retrieval

Using Graphs and Semantic Information to Improve Text Classifiers

Text classification using semantic information is the latest trend of research due to its greater potential to accurately represent text content compared with bag-of-words (BOW) approaches. On the other hand, representation of semantics through graphs has several advantages over the traditional representation of feature vector. Therefore, error tolerant graph matching techniques can be used for text classification. Nevertheless, very few methodologies exist in the literature which use semantic representation through graphs. In the present work, a methodology has been proposed to represent semantic information from a summarized text into a graph. The discourse representation structure of a text is utilized in order to represent its semantic content and, afterwards, it is transformed into a graph. Five different graph matching techniques based on Maximum Common Subgraphs (mcs) and Minimum Common Supergraphs (MCS) are evaluated on 20 classes from the Reuters dataset taking 10 docs of each class for both training and testing purposes using the k-NN classifier. From the results it can be observed that the technique has potential to perform text classification as well as the traditional BOW approaches. Moreover a majority voting based combination of the semantic representation and a traditional BOW approach provided an improved recognition accuracy on the same data set.

Nibaran Das, Swarnendu Ghosh, Teresa Gonçalves, Paulo Quaresma

Exploring the Traits of Manual E-Mail Categorization Text Patterns

Automated e-mail answering with a standard answer is a text categorization task. Text categorization by matching manual text patterns to messages yields good performance if the text categories are specific. Given that manual text patterns embody informal human perception of important wording in a written inquiry, it is interesting to investigate more formal traits of this important wording, such as the amount of matching text, distance between matching words, n-grams, part-of-speech patterns, and vocabulary in the matching words. Understanding these features may help us better design text-pattern extraction algorithms.

Eriks Sneiders, Gunnar Eriksson, Alyaa Alfalahi

Relation Extraction for the Food Domain without Labeled Training Data – Is Distant Supervision the Best Solution?

We examine the task of relation extraction in the food domain by employing distant supervision. We focus on the extraction of two relations that are not only relevant to product recommendation in the food domain, but that also have significance in other domains, such as the fashion or electronics domain. In order to select suitable training data, we investigate various degrees of freedom. We consider three processing levels being argument level, sentence level and feature level. As external resources, we employ manually created surface patterns and semantic types on all these levels. We also explore in how far rule-based methods employing the same information are competitive.

Melanie Reiplinger, Michael Wiegand, Dietrich Klakow

Semantic Clustering of Relations between Named Entities

Most research in Information Extraction concentrates on the extraction of relations from texts but less work has been done about their organization after their extraction. We present in this article a multi-level clustering method to group semantically equivalent relations: a first step groups relation instances with similar expressions to form clusters with high precision; a second step groups these initial clusters into larger semantic clusters using more complex semantic similarities. Experiments demonstrate that our multi-level clustering not only improves the scalability of the method but also improves clustering results by exploiting redundancy in each initial cluster.

Wei Wang, Romaric Besançon, Olivier Ferret, Brigitte Grau

Automatic Prediction of Future Business Conditions

Predicting the future has been an aspiration of humans since the beginning of time. Today, predicting both macro- and micro-economic events is an important activity enabling better policy and the potential for profits. In this work, we present a novel method for automatically extracting forward-looking statement from a specific type of formal corporate documents called earning call transcripts. Our main objective is that of improving an analyst’s ability to accurately forecast future events of economic relevance, over and above the predictive contribution of quantitative firm data that companies are required to produce. By exploiting both Natural Language Processing and Machine Learning techniques, our approach is stronger and more reliable than the ones commonly used in literature and it is able to accurately classify forward-looking statements without requiring any user interaction nor extensive tuning.

Lucia Noce, Alessandro Zamberletti, Ignazio Gallo, Gabriele Piccoli, Joaquin Alfredo Rodriguez

Evaluation of IR Strategies for Polish

The paper presents results and conclusions of an ad hoc evaluation lab concerning information retrieval for Polish. A corpus of ca. million document descriptions of Polish Europeana resources was indexed and matched against a set of fifty test queries. Different pre-processing procedures as well as different indexing and term weighting approaches were used and evaluated. Efficiency of different IR models was compared. Finally human-based relevance assessment was provided for retrieved documents.

Mitra Akasereh, Piotr Malak, Adam Pawłowski

Speech Processing, Language Modelling, Spell- and Grammar-Checking

Word, Syllable and Phoneme Based Metrics Do Not Correlate with Human Performance in ASR-Mediated Tasks

Automatic evaluation metrics should correlate with human judgement. We collected sixteen ASR mediated dialogues using a map task scenario. The material was assessed extrinsically (i.e. in context) through measures like time to task completion and intrinsically (i.e. out of context) using the word error rate and several variants thereof, which are based on smaller units. Extrinsic and intrinsic results did not correlate, neither for word error rate nor for metrics based on characters, syllables or phonemes.

Anne H. Schneider, Johannes Hellrich, Saturnino Luz

Concatenative Hymn Synthesis from Yared Notations

Yared musical notation is widely used in Ethiopian liturgical and non-liturgical songs since it was invented about 1500 years ago. This paper presents automatic synthesis of Yared hymn from the notations available in text form in different contexts where a specific note may produce the different utterance depending on the contexts. Concatenative approach with unit selection is used to synthesize the hymns. We also apply contextual pitch shifting and amplitude modification to remove discontinuities caused by concatenation. The system is evaluated using a test data collected from various hymn lyrics. We use Mean Opinion Score to evaluate the performance of the system for both naturalness and intelligibility, which is computed through an assessment made by domain experts. Experimental results are reported.

Girma Zemedu, Yaregal Assabie

NLP-Oriented Contrastive Study of Linguistic Productions of Alzheimer’s and Control People

The increase in Alzheimers disease is due to the aging of the population and is the first cause of neurodegenerative disorders. Progressive development of cognitive, emotional and behavior troubles leads to the loss of autonomy and to dependency of people, which corresponds to the dementia phase. Language disorders are among the first clinical cognitive signs of the disease. Our objective is to study verbal communication of people affected by the Alzheimer’s disease at early to moderate stages. One particularity of our approach is that we work in ecological conversation situation: people are faced to persons they know. We study verbal productions of five people affected by the Alzheimer’s disease and of five control people. The conversations are transcribed and processed with the NLP methods and tools. Over thirty features grouped in four categories are studied. Our results indicate that the Alzheimer’s patients present lexical and semantic deficit and that, in several ways, their conversation is notably poorer than the conversation of the control people.

Maïté Boyé, Thi Mai Tran, Natalia Grabar

Paraphrastic Reformulations in Spoken Corpora

Our work addresses the automatic detection of paraphrastic reformulation in French spoken corpora. The proposed approach is syntagmatic. It is based on specific markers and the specificities of the spoken language. Manual multi-dimensional annotation performed by two annotators provides fine-grained reference data. An automatic method is proposed in order to decide whether sentences contain or not paraphrastic relations. The obtained results show up to 66.4% precision. Analysis of the manual annotations indicates that few paraphrastic segments show morphological modifications (inflection, derivation or compounding) and that the syntactic equivalence between the segments is seldom respected, as these segments usually belong to different syntactic categories.

Iris Eshkol-Taravella, Natalia Grabar

Between Sound and Spelling: Combining Phonetics and Clustering Algorithms to Improve Target Word Recovery

In this paper we revisit the task of spell checking focusing on target word recovery. We propose a new approach that relies on phonetic information to improve the accuracy of clustering algorithms in identifying misspellings and generating accurate suggestions. The use of phonetic information is not new to the task of spell checking and it was used successfully in previous approaches. The combination of phonetics and cluster-based methods for spell checking was to our knowledge not yet explored and it is the new contribution of our work. We report an improvement of 8.16% accuracy when compared to a previously proposed spell checking approach.

Marcos Zampieri, Renato Cordeiro de Amorim

Experiments with Language Models for Word Completion and Prediction in Hebrew

In this paper, we describe various language models (LMs) and combinations created to support word prediction and completion in Hebrew. We define and apply 5 general types of LMs: (1) Basic LMs (unigrams, bigrams, trigrams, and quadgrams), (2) Backoff LMs, (3) LMs Integrated with tagged LMs, (4) Interpolated LMs, and (5) Interpolated LMs Integrated with tagged LMs. 16 specific implementations of these LMs were compared using 3 types of Israeli web newspaper corpora. The foremost keystroke saving results were achieved with LMs of the most complex variety, the Interpolated LMs Integrated with tagged LMs. Therefore, we conclude that combining all strengths by creating a synthesis of all four basic LMs and the tagged LMs leads to the best results.

Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman

Slovak Web Discussion Corpus

This contribution aims to provide a representative sample of Slovak colloquial language in an organized corpus. The corpus makes it possible to study spontaneous, interactive communication that often includes various incorrect or unusual words. The corpus includes a complete set of web discussions about various topics from a single site. Each discussion is marked with a topic and talking person and is assigned to a specific section. The corpus includes an index for easy searching using regular expressions. Text of the discussions is processed with our tools for word tokenization, sentence boundary detection and morphological analysis. Token annotations include a correct word, proposed by a statistical correction system.

Daniel Hládek, Ján Staš, Jozef Juhár

Towards the Development of the Multilingual Multimodal Virtual Agent

The mobile virtual agent (assistant) is one of today’s most intriguing new technologies. Development of such an agent is multidisciplinary work, and natural language processing is an indispensable part of this work. Our goal is to develop a multilingual multimodal virtual agent. In this paper, we describe the first steps towards this goal – the design, development, and evaluation of the intelligent translation agent. The agent provides speech to speech translation of words, phrases, and sentences from English into Spanish, French, or Russian. The initial evaluation performed for natural language components, as well as for the agent in general, indicated that there is user interest and that such an application is useful.

Inese Vīra, Jānis Teseļskis, Inguna Skadiņa

The WikEd Error Corpus: A Corpus of Corrective Wikipedia Edits and Its Application to Grammatical Error Correction

This paper introduces the freely available WikEd Error Corpus. We describe the data mining process from Wikipedia revision histories, corpus content and format. The corpus consists of more than 12 million sentences with a total of 14 million edits of various types. As one possible application, we show that WikEd can be successfully adapted to improve a strong baseline in a task of grammatical error correction for English-as-a-Second-Language (ESL) learners’ writings by 2.63%. Used together with an ESL error corpus, a composed system gains 1.64% when compared to the ESL-trained system.

Roman Grundkiewicz, Marcin Junczys-Dowmunt

Backmatter

Titel: Advances in Natural Language Processing
herausgegeben von: Adam Przepiórkowski
Maciej Ogrodniczuk
Verlag: Springer International Publishing
Electronic ISBN: 978-3-319-10888-9
Print ISBN: 978-3-319-10887-2
DOI: https://doi.org/10.1007/978-3-319-10888-9