Skip to main content

2018 | Buch

Computational Linguistics and Intelligent Text Processing

17th International Conference, CICLing 2016, Konya, Turkey, April 3–9, 2016, Revised Selected Papers, Part I

insite
SUCHEN

Über dieses Buch

The two-volume set LNCS 9623 + 9624 constitutes revised selected papers from the CICLing 2016 conference which took place in Konya, Turkey, in April 2016.

The total of 89 papers presented in the two volumes was carefully reviewed and selected from 298 submissions. The book also contains 4 invited papers and a memorial paper on Adam Kilgarriff’s Legacy to Computational Linguistics.

The papers are organized in the following topical sections:

Part I: In memoriam of Adam Kilgarriff; general formalisms; embeddings, language modeling, and sequence labeling; lexical resources and terminology extraction; morphology and part-of-speech tagging; syntax and chunking; named entity recognition; word sense disambiguation and anaphora resolution; semantics, discourse, and dialog.

Part II: machine translation and multilingualism; sentiment analysis, opinion mining, subjectivity, and social media; text classification and categorization; information extraction; and applications.

Inhaltsverzeichnis

Frontmatter

In Memoriam

Frontmatter
Adam Kilgarriff’s Legacy to Computational Linguistics and Beyond

The 2016 CICLing conference was dedicated to the memory of Adam Kilgarriff who died the year before. Adam leaves behind a tremendous scientific legacy and those working in computational linguistics, other fields of linguistics and lexicography are indebted to him. This paper is a summary review of some of Adam’s main scientific contributions. It is not and cannot be exhaustive. It is written by only a small selection of his large network of collaborators. Nevertheless we hope this will provide a useful summary for readers wanting to know more about the origins of work, events and software that are so widely relied upon by scientists today, and undoubtedly will continue to be so in the foreseeable future.

Roger Evans, Alexander Gelbukh, Gregory Grefenstette, Patrick Hanks, Miloš Jakubíček, Diana McCarthy, Martha Palmer, Ted Pedersen, Michael Rundell, Pavel Rychlý, Serge Sharoff, David Tugwell

General Formalisms

Frontmatter
A Roadmap Towards Machine Intelligence

The development of intelligent machines is one of the biggest unsolved challenges in computer science. In this paper, we propose some fundamental properties these machines should have, focusing in particular on communication and learning. We discuss a simple environment that could be used to incrementally teach a machine the basics of natural-language-based communication, as a prerequisite to more complex interaction with human users. We also present some conjectures on the sort of algorithms the machine should support in order to profitably learn from the environment.

Tomas Mikolov, Armand Joulin, Marco Baroni
Algebraic Specification for Interoperability Between Data Formats: Application on Arabic Lexical Data

Linguistic data formats (LDF) became, over the years, more and more complex and heterogeneous due to the diversity of linguistic needs. Communication between these linguistic data formats is impossible since they are increasingly multiplatform and multi-providers. LDF suffer from several communication issues. Therefore, they have to face several interoperability issues in order to guarantee consistency and avoid redundancy. In an interoperability resolution context, we establish a method based on algebraic specifications to resolve interoperability among data formats. The proposed categorical method consists in constructing a unified language. In order to compose this unified language, we apply the co-limit algebraic specifications category for each data format. With this method, we establish a complex grid between existing data formats allowing the mapping to the unifier using algebraic specification. Then, we apply our approach on Arabic lexical data. We experiment our approach using Specware software.

Malek Lhioui, Kais Haddar, Laurent Romary
Persianp: A Persian Text Processing Toolbox

This paper describes Persianp Toolbox, an integrated Persian text processing system and easily used in other software applications. The toolbox which provides fundamental Persian text processing steps includes several modules. In developing some modules of the toolbox such as normalizer, tokenizer, sentencizer, stop word detector, and Part-Of-Speech tagger previous studies are applied. In other modules i.e. Persian lemmatizer and NP chunker, new ideas in preparing required training data and/or applying new techniques are presented. Experimental results show the strong performance of the toolbox in each part. The accuracies of the tokenizer, the POS tagger, the lemmatizer and the NP chunker are 97%, 95.6%, 97%, 97.2%, respectively.

Mahdi Mohseni, Javad Ghofrani, Heshaam Faili

Embeddings, Language Modeling, and Sequence Labeling

Frontmatter
Generating Bags of Words from the Sums of Their Word Embeddings

Many methods have been proposed to generate sentence vector representations, such as recursive neural networks, latent distributed memory models, and the simple sum of word embeddings (SOWE). However, very few methods demonstrate the ability to reverse the process – recovering sentences from sentence embeddings. Amongst the many sentence embeddings, SOWE has been shown to maintain semantic meaning, so in this paper we introduce a method for moving from the SOWE representations back to the bag of words (BOW) for the original sentences. This is a part way step towards recovering the whole sentence and has useful theoretical and practical applications of its own. This is done using a greedy algorithm to convert the vector to a bag of words. To our knowledge this is the first such work. It demonstrates qualitatively the ability to recreate the words from a large corpus based on its sentence embeddings.As well as practical applications for allowing classical information retrieval methods to be combined with more recent methods using the sums of word embeddings, the success of this method has theoretical implications on the degree of information maintained by the sum of embeddings representation. This lends some credence to the consideration of the SOWE as a dimensionality reduced, and meaning enhanced, data manifold for the bag of words.

Lyndon White, Roberto Togneri, Wei Liu, Mohammed Bennamoun
New Word Analogy Corpus for Exploring Embeddings of Czech Words

The word embedding methods have been proven to be very useful in many tasks of NLP (Natural Language Processing). Much has been investigated about word embeddings of English words and phrases, but only little attention has been dedicated to other languages.Our goal in this paper is to explore the behavior of state-of-the-art word embedding methods on Czech, the language that is characterized by very rich morphology. We introduce new corpus for word analogy task that inspects syntactic, morphosyntactic and semantic properties of Czech words and phrases. We experiment with Word2Vec and GloVe algorithms and discuss the results on this corpus. The corpus is available for the research community.

Lukáš Svoboda, Tomáš Brychcín
Using Embedding Models for Lexical Categorization in Morphologically Rich Languages

Neural-network-based semantic embedding models are relatively new but popular tools in the field of natural language processing. It has been shown that continuous embedding vectors assigned to words provide an adequate representation of their meaning in the case of English. However, morphologically rich languages have not yet been the subject of experiments with these embedding models. In this paper, we investigate the performance of embedding models for Hungarian, trained on corpora with different levels of preprocessing. The models are evaluated on various lexical categorization tasks. They are used for enriching the lexical database of a morphological analyzer with semantic features automatically extracted from the corpora.

Borbála Siklósi
A New Language Model Based on Possibility Theory

Language modeling is a very important step in several NLP applications. Most of the current language models are based on probabilistic methods. In this paper, we propose a new language modeling approach based on the possibility theory. Our goal is to suggest a method for estimating the possibility of a word-sequence and to test this new approach in a machine translation system. We propose a word-sequence possibilistic measure, which can be estimated from a corpus. We proceeded in two ways: first, we checked the behavior of the new approach compared with the existing work. Second, we compared the new language model with the probabilistic one used in statistical MT systems. The results, in terms of the METEOR metric, show that the possibilistic-language model is better than the probabilistic one. However, in terms of BLEU and TER scores, the probabilistic model remains better.

Mohamed Amine Menacer, Abdelfetah Boumerdas, Chahnez Zakaria, Kamel Smaili
Combining Discrete and Neural Features for Sequence Labeling

Neural network models have recently received heated research attention in the natural language processing community. Compared with traditional models with discrete features, neural models have two main advantages. First, they take low-dimensional, real-valued embedding vectors as inputs, which can be trained over large raw data, thereby addressing the issue of feature sparsity in discrete models. Second, deep neural networks can be used to automatically combine input features, and including non-local features that capture semantic patterns that cannot be expressed using discrete indicator features. As a result, neural network models have achieved competitive accuracies compared with the best discrete models for a range of NLP tasks.On the other hand, manual feature templates have been carefully investigated for most NLP tasks over decades and typically cover the most useful indicator pattern for solving the problems. Such information can be complementary the features automatically induced from neural networks, and therefore combining discrete and neural features can potentially lead to better accuracy compared with models that leverage discrete or neural features only.In this paper, we systematically investigate the effect of discrete and neural feature combination for a range of fundamental NLP tasks based on sequence labeling, including word segmentation, POS tagging and named entity recognition for Chinese and English, respectively. Our results on standard benchmarks show that state-of-the-art neural models can give accuracies comparable to the best discrete models in the literature for most tasks and combing discrete and neural features unanimously yield better results.

Jie Yang, Zhiyang Teng, Meishan Zhang, Yue Zhang
New Recurrent Neural Network Variants for Sequence Labeling

In this paper we study different architectures of Recurrent Neural Networks (RNN) for sequence labeling tasks. We propose two new variants of RNN and we compare them to the more traditional RNN architectures of Elman and Jordan. We explain in details the advantages of these new variants of RNNs with respect to Elman’s and Jordan’s RNN. We evaluate all models, either new or traditional, on three different tasks: POS-tagging of the French Treebank, and two tasks of Spoken Language Understanding (SLU), namely ATIS and MEDIA. The results we obtain clearly show that the new variants of RNN are more effective than the traditional ones.

Marco Dinarelli, Isabelle Tellier

Lexical Resources and Terminology Extraction

Frontmatter
Mining the Web for Collocations: IR Models of Term Associations

Automatic collocation recognition has attracted considerable attention of researchers from diverse fields since it is one of the fundamental tasks in NLP, which feeds into several other tasks (e.g., parsing, idioms, summarization, etc.). Despite this attention the problem has remained a “daunting challenge.” As others have observed before, existing approaches based on frequencies and statistical information have limitations. An even bigger problem is that they are restricted to bigrams and as yet there is no consensus on how to extend them to trigrams and higher-order n-grams. This paper presents encouraging results based on novel angles of general collocation extraction leveraging statistics and the Web. In contrast to existing work, our algorithms are applicable to n-grams of arbitrary order, and directional. Experiments across several datasets, including a gold-standard benchmark dataset that we created, demonstrate the effectiveness of proposed methods.

Rakesh Verma, Vasanthi Vuppuluri, An Nguyen, Arjun Mukherjee, Ghita Mammar, Shahryar Baki, Reed Armstrong
A Continuum-Based Model of Lexical Acquisition

The automatic acquisition of verbal constructions is an important issue for natural language processing. In this paper, we have a closer look at two fundamental aspects of the description of the verb: the notion of lexical item and the distinction between arguments and adjuncts. Following up on studies in natural language processing and linguistics, we embrace the double hypothesis (i) of a continuum between ambiguity and vagueness, and (ii) of a continuum between arguments and adjuncts. We provide a complete approach to lexical knowledge acquisition of verbal constructions from an untagged news corpus. The approach is evaluated through the analysis of a sample of the 7,000 Japanese verbs automatically described by the system.

Pierre Marchal, Thierry Poibeau
Description of Turkish Paraphrase Corpus Structure and Generation Method

Because developing a corpus requires a long time and lots of human effort, it is desirable to make it as resourceful as possible: rich in coverage, flexible, multipurpose and expandable. Here we describe the steps we took in the development of Turkish paraphrase corpus, the factors we considered, problems we faced and how we dealt with them. Currently our corpus contains nearly 4000 sentences with the ratio of 60% paraphrase and 40% non-paraphrase sentence pairs. The sentence pairs are annotated at 5-scale: paraphrase, encapsulating, encapsulated, non-paraphrase and opposite. The corpus is formulated in a database structure integrated with Turkish dictionary. The sources we used till now are news texts from Bilcon 2005 corpus, a set of professionally translated sentence pairs from MSRP corpus, multiple Turkish translations from different languages that are involved in Tatoeba corpus and user generated paraphrases.

Bahar Karaoglan, Tarık Kışla, Senem Kumova Metin
Extracting Terminological Relationships from Historical Patterns of Social Media Terms

In this article we propose and evaluate a method to extract terminological relationships from microblogs. The idea is to analyze archived microblogs (tweets for example) and then to trace the history of each term. Similar history indicates a relationship between terms. This indication can be validated using further processing. For example, if the term t1 and t2 were frequently used in Twitter at certain days, and there is a match in the frequency patterns over a period of time, then t1 and t2 can be related. Extracting standard terminological relationships can be difficult; especially in a dynamic context such as social media, where millions of microblogs (short textual messages) are published, and thousands of new terms are coined every day. So we are proposing to compile nonstandard raw repository of lexical units with unconfirmed relationships. This paper shows a method to draw relationships between time-sensitive Arabic terms by matching similar timelines of these terms. We use dynamic time warping to align the timelines. To evaluate our approach we elected 430 terms and we matched the similarity between the frequency patterns of these terms over a period of 30 days. Around 250 correct relationships were extracted with a precision of 0.65. These relationships were drawn without using any parallel text, nor analyzing the textual context of the term. Taking into consideration that the studied terms can be newly coined by microbloggers and their availability in standard repositories is limited.

Daoud Daoud, Mohammad Daoud
Adaptation of Cross-Lingual Transfer Methods for the Building of Medical Terminology in Ukrainian

An increasing availability of parallel bilingual corpora and of automatic methods and tools makes it possible to build linguistic and terminological resources for low-resourced languages. We propose to exploit corpora available in several languages for building bilingual and trilingual terminologies. Typically, terminology information extracted in better resourced languages is associated with the corresponding units in lower-resourced languages thanks to the multilingual transfer. The method is applied on corpora involving Ukrainian language. According to the experiments, precision of term extraction varies between 0.454 and 0.966, while the quality of the interlingual relations varies between 0.309 and 0.965. The resource built contains 4,588 medical terms in Ukrainian and their 34,267 relations with French and English terms.

Thierry Hamon, Natalia Grabar
Adaptation of a Term Extractor to Arabic Specialised Texts: First Experiments and Limits

In this paper, we present an adaptation to Modern Standard Arabic of a French and English term extractor. The goal of this work is to reduce the lack of resources and NLP tools for Arabic language in specialised domains. The adaptation firstly focuses on the description of extraction processes similar to those already defined for French and English while considering the morpho-syntactic specificity of Arabic. Agglutination phenomena are further taken into account in the term extraction process. The current state of the adapted system was evaluated on a medical text corpus. 400 maximal candidate terms were examined, among which 288 were correct (72% precision). An error analysis shows that term extraction errors are first due to Part-of-Speech tagging errors and the difficulties induced by non-diacritised texts, then to remaining agglutination phenomena.

Wafa Neifar, Thierry Hamon, Pierre Zweigenbaum, Mariem Ellouze Khemakhem, Lamia Hadrich Belguith

Morphology and Part-of-Speech Tagging

Frontmatter
Corpus Frequency and Affix Ordering in Turkish

Suffix sequences in agglutinative languages derive complex structures. Based on frequency information from a corpus data, this study will present emerging multi-morpheme sequences in Turkish. Morphgrams formed by combination of voice suffixes with other verbal suffixes from finite and non-finite templates are identified in the corpus. Statistical analyses are conducted on permissible combinations of these suffixes. The findings of the study have implications for further studies on morphological processing in agglutinative languages.

Mustafa Aksan, Umut Ufuk Demirhan, Yeşim Aksan
Pluralising Nouns in isiZulu and Related Languages

There are compelling reasons for a Controlled Natural Language of isiZulu in software applications, which requires pluralising nouns. Only ‘canonical’ singular/plural pairs exist, however, which are insufficient for computational use of isiZulu. Starting from these rules, we take an experimental approach as virtuous spiral to refine the rules by repeatedly testing two test sets against successive versions of refined rules for pluralisation. This resulted in the elucidation of additional pluralisation rules not included in typical isiZulu textbooks and grammar resources and motivated design choices for algorithm development. We assessed the potential for reuse of the approach and the type of deviations with Runyankore, which demonstrated encouraging results.

Joan Byamugisha, C. Maria Keet, Langa Khumalo
Morphological Analysis of Urdu Verbs

The acquisition of knowledge about word characteristics is a basic requirement for developing natural language processing applications of a particular language. In this paper, we present a detailed analysis for the morphology of Urdu verbs. During our analysis, we have observed that Urdu verbs can have 47 different types of inflections. The different inflected forms of 975 Urdu verbs have been analyzed and the details of the analysis have been presented. We propose a new classification scheme for Urdu verbs, based on morphology. The morphological rules proposed for each class have been tested by simulating with a 2-layer morphological analyzer, based on finite state transducers. The analysis and generation of surface forms have been successfully carried out, indicating the robustness of proposed methodology.

Aneeta Niazi
Stemming and Segmentation for Classical Tibetan

Tibetan is a monosyllabic language for which computerized language tools are largely lacking. We describe the development of a syllable stemmer for Tibetan. The stemmer is based on a set of rules that strive to identify the vowel, the core letter of the syllable, and then the other parts. We demonstrate the value of the stemmer with two applications: determining stem similarity of two syllables and word segmentation. Our stemmer is being made available as an open-source tool and word segmentation as a freely-available online tool.

Orna Almogi, Lena Dankin, Nachum Dershowitz, Yair Hoffman, Dimitri Pauls, Dorji Wangchuk, Lior Wolf
Part of Speech Tagging for Polish: State of the Art and Future Perspectives

In this paper we discuss the intricacies of Polish language part of speech tagging, present the current state of the art by comparing available taggers in detail and show the main obstacles that are a limiting factor in achieving an accuracy of Polish POS tagging higher than 91% of correctly tagged word segments. As this result is not only lower than in the case of English taggers, but also below those for other highly inflective languages, such as Czech and Slovene, we try to identify the main weaknesses of the taggers, their underlying algorithms, the training data, or difficulties inherent to the language to explain this difference. For this purpose we analyze the errors made individually by each of the available Polish POS taggers, an ensemble of the taggers and also by a publicly available well-known OpenNLP tagger, adapted to Polish tagset. Finally, we propose further steps that should be taken to narrow down the gap between Polish and English POS tagging performance.

Łukasz Kobyliński, Witold Kieraś
Turkish PoS Tagging by Reducing Sparsity with Morpheme Tags in Small Datasets

Sparsity is one of the major problems in natural language processing. The problem becomes even more severe in agglutinating languages that are highly prone to be inflected. We deal with sparsity in Turkish by adopting morphological features for part-of-speech tagging. We learn inflectional and derivational morpheme tags in Turkish by using conditional random fields (CRF) and we employ the morpheme tags in part-of-speech (PoS) tagging by using hidden Markov models (HMMs) to mitigate sparsity. Results show that using morpheme tags in PoS tagging helps alleviate the sparsity in emission probabilities. Our model outperforms other hidden Markov model based PoS tagging models for small training datasets in Turkish. We obtain an accuracy of 94.1% in morpheme tagging and 89.2% in PoS tagging on a 5K training dataset.

Burcu Can, Ahmet Üstün, Murathan Kurfalı
Part-of-Speech Tagging for Code Mixed English-Telugu Social Media Data

Part-of-Speech Tagging is a primary and an important step for many Natural Language Processing Applications. POS taggers have reported high accuracies on grammatically correct monolingual data. This paper reports work on annotating code mixed English-Telugu data collected from social media site Facebook and creating automatic POS Taggers for this corpus. POS tagging is considered as a classification problem and we use different classifiers like Linear SVMs, CRFs, Multinomial Bayes with different combinations of features which capture both context of the word and its internal structure. We also report our work on experimenting with combining monolingual POS taggers for POS tagging of this code mixed English-Telugu data.

Kovida Nelakuditi, Divya Sai Jitta, Radhika Mamidi

Syntax and Chunking

Frontmatter
Analysis of Word Order in Multiple Treebanks

This paper gives an overview of results of automatic analysis of word order in 23 dependency treebanks. These treebanks have been collected in the frame of the HamleDT project, whose main goal is to provide universal annotation for dependency corpora; thus it also makes it possible to use identical queries for all the corpora. The analysis concentrates on basic characteristics of word order, the order of three main constituents, a predicate, a subject and an object. A quantitative analysis is performed separately for main clauses and subordinated clauses; further, a presence of an active verb is taken into account – we show that in many languages the subordinated clauses have a slightly different order of words than main clauses; the choice of voice has also an impact on word order.

Vladislav Kuboň, Markéta Lopatková, Jiří Mírovský
A Framework for Language Resource Construction and Syntactic Analysis: Case of Arabic

Language resources such as grammars or dictionaries are very important to any natural language processing application. Unfortunately, the manual construction of these resources is laborious and time-consuming. The use of annotated corpora as a knowledge database might be a solution to a fast construction of a grammar for a given language. In this paper, we present our framework to automatically induce a syntactic grammar from an Arabic annotated corpus (The Penn Arabic TreeBank), a probabilistic context free grammar in our case. The developed system allows the user to build a probabilistic context free grammar from the annotated corpus syntactic trees. It’s also offer the possibility to parse Arabic sentences using the generated resource. Finally, we present evaluation results.

Nabil Khoufi, Chafik Aloulou, Lamia Hadrich Belguith
Enhancing Neural Network Based Dependency Parsing Using Morphological Information for Hindi

In this paper, we propose a way of incorporating morphological resources for enhancing the performance of neural network based dependency parsing. We conduct our experiments in Hindi, which is a morphologically rich language. We report our results on two well known Hindi Dependency Parsing datasets. We show an improvement of both Unlabeled Attachment Score (UAS) and Labeled Attachment Score (LAS) compared to previous state-of-the art hindi dependency parsers using only word embeddings, POS tag embeddings and arc-label embeddings as features. Using morphological features, such as number, gender, person and case of words, we achieve an additional improvement of both LAS and UAS. We find that many of the erroneous sentences contain Named Entities. We propose a treatment for Named Entities which further improves both UAS and LAS of our Hindi dependency parser (The parser is available at http://www.cicling.org/2016/data/126/CICLing_126.zip).

Agnivo Saha, Sudeshna Sarkar
Construction Grammar Based Annotation Framework for Parsing Tamil

Syntactic parsing in NLP is the task of working out the grammatical structure of sentences. Some of the purely formal approaches to parsing such as phrase structure grammar, dependency grammar have been successfully employed for a variety of languages. While phrase structure based constituent analysis is possible for fixed order languages such as English, dependency analysis between the grammatical units have been suitable for many free word order languages. These approaches rely on identifying the linguistic units based on their formal syntactic properties and establishing the relationships between such units in the form of a tree. Instead, we characterize every morphosyntactic unit as a mapping between form and function on the lines of Construction Grammar and parsing as identification of dependency relations between such conceptual units. Our approach to parser annotation shows an average MALT LAS score of 82.21% on Tamil gold annotated corpus of 935 sentences in a five-fold validation experiment.

Vigneshwaran Muralidaran, Dipti Misra Sharma
Comparative Error Analysis of Parser Outputs on Telugu Dependency Treebank

We present a comparative error analysis of two parsers - MALT and MST on Telugu Dependency Treebank data. MALT and MST are currently two of the most dominant data-driven dependency parsers. We discuss the performances of both the parsers in relation to Telugu language. We also talk in detail about both the algorithmic issues of the parsers as well as the language specific constraints of Telugu. The purpose is, to better understand how to help the parsers deal with complex structures, make sense of implicit language specific cues and build a more informed Treebank.

Silpa Kanneganti, Himani Chaudhry, Dipti Misra Sharma
Gut, Besser, Chunker – Selecting the Best Models for Text Chunking with Voting

The CoNLL-2000 dataset is the de-facto standard dataset for measuring chunkers on the task of chunking base noun phrases (NP) and arbitrary phrases. The state-of-the-art tagging method is utilising TnT, an HMM-based Part-of-Speech tagger (POS), with simple majority voting on different representations and fine-grained classes created by lexcialising tags. In this paper the state-of-the-art English phrase chunking method was deeply investigated, re-implemented and evaluated with several modifications. We also investigated a less studied side of phrase chunking, i.e. the voting between different currently available taggers, the checking of invalid sequences and the way how the state-of-the-art method can be adapted to morphologically rich, agglutinative languages.We propose a new, mild level of lexicalisation and a better combination of representations and taggers for English. The final architecture outperformed the state-of-the-art for arbitrary phrase identification and NP chunking achieving the F-score of 95.06% for arbitrary and 96.49% for noun phrase chunking.

Balázs Indig, István Endrédy

Named Entity Recognition

Frontmatter
A Deep Learning Solution to Named Entity Recognition

Identifying named entities is vital for many Natural Language Processing (NLP) applications. Much of the earlier work for identifying named entities focused on using handcrafted features and knowledge resources (feature engineering). This is a barrier for resource-scarce languages as many resources are not readily available. Recently, Deep Learning techniques have been proposed for various NLP tasks requiring little/no hand-crafted features and knowledge resources, instead the features are learned from the data. Many proposed deep learning solutions for Named Entity Recognition (NER) still rely on feature engineering as opposed to feature learning. However, it is not clear whether the deep learning system or the engineered features are responsible for the positive results reported. This is in contrast with the goal of deep learning systems i.e., to learn the features from the data itself. In this study, we show that a feature learned deep learning system is a viable solution to NER task. We test our deep learning systems on CoNLL English and Spanish NER datasets. Our system is able to give comparable results with the existing state-of-the-art feature engineered systems for English. We report the best performance of 89.27 F-Score for English when comparing with systems which do not use any handcrafted features or knowledge resources. Evaluation of our trained system on out-of-domain data indicate that the results are promising with the reported results. Our system when tested on Spanish NER achieves the best reported F-Score of 82.59 indicating its applicability to other languages.

V. Rudra Murthy, Pushpak Bhattacharyya
Deep Learning Approach for Arabic Named Entity Recognition

Inspired by recent work in Deep Learning that have achieved excellent performance on difficult problems such as computer vision and speech recognition, we introduce a simple and fast model for Arabic named entity recognition based on Deep Neural Networks (DNNs). Named Entity Recognition (NER) is the task of classifying or labelling atomic elements in the text into categories such as Person, Location or Organization. The unique characteristics and the complexity of the Arabic language make the extraction of named entities a challenging task. Most state-of-the-art systems use a combination of various Machine Learning algorithms or rely on handcrafted engineering features and the output of other NLP tasks such as part-of-speech (POS) tagging, text chunking, prefixes and suffixes as well as a large gazetteer. In this paper, we present an Arabic NER system based on DNNs that automatically learns features from data. The experimental results show that our approach outperforms the model based on Conditional Random Fields by 12.36 points in F-measure. Moreover, our model outperforms the state-of-the-art by 5.18 points in Precision and gets very close results in F-measure. Most importantly, our system can be easily extended to recognize other named entities without any additional rules or handcrafted engineering features.

Mourad Gridach
Hybrid Feature Selection Approach for Arabic Named Entity Recognition

Named Entity Recognition (NER) task has drawn a great attention in the research field in the last decade, as it played an important role in the Natural Language Processing (NLP) applications; In this paper, we investigate the effectiveness of a hybrid feature subset selection approach for Arabic Named Entity Recognition (NER) which is presented using filtering approach and optimized Genetic algorithm. Genetic algorithm is utilized through parallelization of the fitness computation in order to reduce the computation time to search out the most appropriate and informative combination of features for classification. Support Vector Machine (SVM) is used as the machine learning based classifier to evaluate the accuracy of the Arabic NER through the proposed approach. ANER is the dataset used in our experiments which is presented by both language independent and language specific features in Arabic NER; Experimental results show the effectiveness of the feature subsets obtained by the proposed hybrid approach which are smaller and effective than the original feature set that leads to a considerable increase in the classification accuracy.

Miran Shahine, Mohamed Sakre
Named-Entity-Recognition (NER) for Tamil Language Using Margin-Infused Relaxed Algorithm (MIRA)

Named-Entity-Recognition (NER) is widely used as a foundation for Natural Language Processing (NLP) applications. There have been few previous attempts on building generic NER systems for Tamil language. These attempts were based on machine-learning approaches such as Hidden Markov Models (HMM), Maximum Entropy Markov Models (MEMM), Support Vector Machine (SVM) and Conditional Random Fields (CRF). Among them, CRF has been proven to be the best with respect to the accuracy of NER in Tamil. This paper presents a novel approach to build a Tamil NER system using the Margin-Infused Relaxed Algorithm (MIRA). We also present a comparison of performance between MIRA and CRF algorithms for Tamil NER. When the gazetteer, POS tags and orthographic features are used with the MIRA algorithm, it attains an F1-measure of 81.38% on the Tamil BBC news data whereas the CRF algorithm shows only an F1-measure of 79.13% for the same set of features. Our NER system outperforms all the previous NER systems for Tamil language.

Pranavan Theivendiram, Megala Uthayakumar, Nilusija Nadarasamoorthy, Mokanarangan Thayaparan, Sanath Jayasena, Gihan Dias, Surangika Ranathunga

Word Sense Disambiguation and Anaphora Resolution

Frontmatter
Word Sense Disambiguation Using Swarm Intelligence: A Bee Colony Optimization Approach

Word Sense Disambiguation (WSD) is a problem of figuring out the correct sense of a word in a given context. We introduce an unsupervised knowledge-source approach for word sense disambiguation using a bee colony optimization algorithm that is constructive in nature. Our algorithm, using WordNet, optimizes the search space by globally disambiguating a document by constructively determining the sense of a word using the previously disambiguated words. Heuristic methods for unsupervised word sense disambiguation mostly give less importance to the context words while determining the sense of the target word. In this paper, we put more emphasis on the context and the part of speech of a word while determining its correct sense. We make use of a modified simplified Lesk algorithm as a relatedness measure. Our approach is then compared with recent unsupervised heuristics such as ant colony optimization, genetic algorithms, and simulated annealing, and shows promising results. We finally introduce a voting strategy to our algorithm that ends up further improving our results.

Saket Kumar, Omar El Ariss
Verb Sense Annotation for Turkish PropBank via Crowdsourcing

In order to extract meaning representations from sentences, a corpus annotated with semantic roles is obligatory. Unfortunately building such a corpus requires tremendous amount of manual work for creating semantic frames and annotation of corpus. Thereby, we have divided the annotation task into two microtasks as verb sense annotation and argument annotation tasks and employed crowd intelligence to perform these microtasks. In this paper, we present our approach and the challenges on crowdsourcing verb sense disambiguation task and introduce the resource with 5855 annotated verb senses with 83.15% annotator agreement.

Gözde Gül Şahin
Coreference Resolution for French Oral Data: Machine Learning Experiments with ANCOR

We present CROC (Coreference Resolution for Oral Corpus), the first machine learning system for coreference resolution in French. One specific aspect of the system is that it has been trained on data that come exclusively from transcribed speech, namely ANCOR (ANaphora and Coreference in ORal corpus), the first large-scale French corpus with anaphorical relation annotations. In its current state, the CROC system requires pre-annotated mentions. We detail the features used for the learning algorithms, and we present a set of experiments with these features. The scores we obtain are close to those of state-of-the-art systems for written English.

Adèle Désoyer, Frédéric Landragin, Isabelle Tellier, Anaïs Lefeuvre, Jean-Yves Antoine, Marco Dinarelli
Arabic Anaphora Resolution Using Markov Decision Process

The anaphora resolution belongs to the attractive problems of the NLP field. In this paper, we treat the problem of resolving pronominal anaphora which are very abundant in Arabic texts. Our approach includes a set of steps; namely: the identification of anaphoric pronouns, removing the non-referential ones, identification of the lists of candidates from the context surrounding the identified anaphora and choosing the best candidate for each anaphoric pronoun. The last two steps could be seen as a dynamic and probabilistic process that consists of a sequence of decisions and could be modeled using a Markov Decision Process (MDP). In addition, we have opted for a reinforcement learning approach because it is an effective method for learning in an uncertain and stochastic environment like ours. Also, it could resolve the MDPs. In order to evaluate the proposed approach, we have developed a system that gives us encouraging results. The resolution accuracy reaches up to 80%.

Fériel Ben Fraj Trabelsi, Chiraz Ben Othmane Zribi, Saoussen Mathlouthi
Arabic Pronominal Anaphora Resolution Based on New Set of Features

In this paper, we present a machine learning approach for Arabic pronominal anaphora resolution. This approach resolves anaphoric pronouns without using linguistic or domain knowledge, nor deep parsing. It relies on some features which are widely used in the literary for other languages such as English. In addition, we propose new features specific for Arabic language. We provide a practical implementation of this approach which has been evaluated on three data sets (a technical manual, newspaper articles and educational texts). The results of evaluation shows that our approach provide good performance for resolving the Arabic pronominal anaphora. The measures of F-measure are respectively 86.2% for the genre of technical manuals, 84.5% for newspaper articles and 72.1% for the literary texts.

Souha Mezghani Hammami, Lamia Hadrich Belguith

Semantics, Discourse, and Dialog

Frontmatter
GpSense: A GPU-Friendly Method for Commonsense Subgraph Matching in Massively Parallel Architectures

In the context of commonsense reasoning, spreading activation is used to select relevant concepts in a graph of commonsense knowledge. When such a graph starts growing, however, the number of relevant concepts selected during spreading activation tends to diminish. In the literature, such an issue has been addressed in different ways but two other important issues have been rather under-researched, namely: performance and scalability. Both issues are caused by the fact that many new nodes, i.e., natural language concepts, are continuously integrated into the graph. Both issues can be solved by means of GPU accelerated computing, which offers unprecedented performance by offloading compute-intensive portions of the application to the GPU, while the remainder of the code still runs on the CPU. To this end, we propose a GPU-friendly method, termed GpSense, which is designed for massively parallel architectures to accelerate the tasks of commonsense querying and reasoning via subgraph matching. We show that GpSense outperforms the state-of-the-art algorithms and efficiently answers subgraph queries on a large commonsense graph.

Ha-Nguyen Tran, Erik Cambria
Parameters Driving Effectiveness of LSA on Topic Segmentation

Latent Semantic Analysis (LSA) is an efficient statistical technique for extracting semantic knowledge from large corpora. One of the major problems of this technique is the identification of the most efficient parameters of LSA and the best combination between them. Therefore, in this paper, we propose a new topic segmenter to study in depth the different parameters of LSA for the topic segmentation. Thus, the aim of this study is to analyze the effect of these different parameters on the quality of topic segmentation and to identify the most efficient parameters. Based on extensive experiments, we showed that the choice of LSA parameters is very sensitive and it has an impact on the quality of topic segmentation. More important, according to this study, we are able to propose appropriate recommendation for the selection of parameters in the field of topic segmentation.

Marwa Naili, Anja Chaibi Habacha, Henda Hajjami Ben Ghezala
A New Russian Paraphrase Corpus. Paraphrase Identification and Classification Based on Different Prediction Models

Our main objectives are constructing a paraphrase corpus for Russian and developing of the paraphrase identification and classification models based on this corpus. The corpus consists of pairs of news headlines from different media agencies which are extracted and analyzed in real time. Paraphrase candidates are extracted using an unsupervised matrix similarity metric: if the metric value satisfies a certain threshold, the corresponding pair of sentences is included in the corpus. These pairs of sentences are further annotated via crowdsourcing. We provide a user-friendly online interface for crowdsourced annotation which is available at http://paraphraser.ru. There are 7480 annotated sentence pairs in the corpus at the moment, and there are still more to come. The types and the features of these sentence pairs are not introduced to the annotators. We adopt a 3-classes classification of paraphrases and distinguish precise paraphrases (conveying the same meaning), loose paraphrases (conveying similar meaning) and non-paraphrases (conveying different meaning).

Ekaterina Pronoza, Elena Yagunova
Constructing a Turkish Corpus for Paraphrase Identification and Semantic Similarity

The Paraphrase identification (PI) task has practical importance for work in Natural Language Processing (NLP) because of the problem of linguistic variation. Accurate methods should help improve performance of key NLP applications. Paraphrase corpora are important resources in developing and evaluating PI methods. This paper describes the construction of a paraphrase corpus for Turkish. The corpus comprises pairs of sentences with semantic similarity scores based on human judgments, permitting experimentation with both PI and semantic similarity. We believe this is the first such corpus for Turkish. The data collection and scoring methodology is described and initial PI experiments with the corpus are reported. Our approach to PI is novel in using ‘knowledge lean’ methods (i.e. no use of manually constructed knowledge bases or processing tools that rely on these). We have previously achieved excellent results using such techniques on the Microsoft Research Paraphrase Corpus, and close to state-of-the-art performance on the Twitter Paraphrase Corpus.

Asli Eyecioglu, Bill Keller
Evaluation of Semantic Relatedness Measures for Turkish Language

The problem of quantifying semantic relatedness level of two words is a fundamental sub-task for many natural language processing systems. While there is a large body of research on measuring semantic relatedness in the English language, the literature lacks detailed analysis for these methods in agglutinative languages. In this research, two new evaluation resources for the Turkish language are constructed. An extensive set of experiments involving multiple tasks: word association, semantic categorization, and automatic WordNet relationship discovery are performed to evaluate different semantic relatedness measures in the Turkish language. As Turkish is an agglutinative language, the morphological processing component is important for distributional similarity algorithms. For languages with rich morphological variations and productivity, methods ranging from simple stemming strategies to morphological disambiguation exists. In our experiments, different morphological processing methods for the Turkish language are investigated.

Ugur Sopaoglu, Gonenc Ercan
Using Sentence Semantic Similarity to Improve LMF Standardized Arabic Dictionary Quality

This paper presents a novel algorithm to measure semantic similarity between sentences. It will introduce a method that takes into account of not only semantic knowledge but also syntactico-semantic knowledge notably semantic predicate, semantic class and thematic role. Firstly, semantic similarity between sentences is derived from words synonymy. Secondly, syntactico-semantic similarity is computed from the common semantic class and thematic role of words in the sentence. Indeed, this information is related to semantic predicate. Finally, semantic similarity is computed as a combination of lexical similarity, semantic similarity and syntactico-semantic similarity using a supervised learning. The proposed algorithm is applied to detect the information redundancy in LMF Arabic dictionary especially the definitions and the examples of lexical entries. Experimental results show that the proposed algorithm reduces the redundant information to improve the content quality of LMF Arabic dictionary.

Wafa Wali, Bilel Gargouri, Abdelmajid Ben Hamadou
Multiword Expressions (MWE) for Mizo Language: Literature Survey

We examine the formation of multi-word expressions (MWE) and reduplicated words in the Mizo language, basing on a news corpus (reduplication is a repetition of a linguistic unit, such as morpheme, affix, word, or clause). To study the structure of reduplication, we follow lexical and morphological approaches, which have been used for the study of other Indian languages, such as Manipuri, Bengali, Odia, Marathi etc. We also show the effect of these phenomena on natural language processing tasks for the Mizo language. To develop an algorithm for identification of reduplicated words in the Mizo language, we manually identified MWEs and reduplicated words and then studied their structural and semantic properties. The results were verified by linguists, experts in the Mizo language.

Goutam Majumder, Partha Pakray, Zoramdinthara Khiangte, Alexander Gelbukh
Classification of Textual Genres Using Discourse Information

This papers aims to measure the influence of textual genre on the usage of discourse relations and discourse markers. Specifically, we wish to evaluate to what extend the use of certain discourse relations and discourse markers are correlated to textual genre and consequently can be used to predict textual genre. To do so, we have used the British National Corpus and compared a variety of discourse-level features on the task of genre classification.The results show that individually, discourse relations and discourse markers do not outperform the standard bag-of-words approach even with an identical number of features. However, discourse features do provide a significant increase in performance when they are used to augment the bag-of-words approach. Using discourse relations and discourse markers allowed us to increase the F-measure of the bag-of-words approach from 0.796 to 0.878.

Elnaz Davoodi, Leila Kosseim, Félix-Hervé Bachand, Majid Laali, Emmanuel Argollo
Features for Discourse-New Referent Detection in Russian

This paper concerns discourse-new mention detection in Russian. This might be helpful for different NLP applications such as coreference resolution, protagonist identification, summarization and different tasks of information extraction to detect the mention of an entity newly introduced into discourse. In our work, we are dealing with the Russian where there is no grammatical devices, like articles in English, for the overt marking a newly introduced referent. Our aim is to check the impact of various features on this task. The focus is on specific devices for introducing a new discourse prominent referent in Russian specified in theoretical studies. We conduct a pilot study of features impact and provide a series of experiments on detecting the first mention of a referent in a non-singleton coreference chain, drawing on linguistic insights about how a prominent entity introduced into discourse is affected by structural, morphological and lexical features.

Svetlana Toldova, Max Ionov
A Karaka Dependency Based Dialog Act Tagging for Telugu Using Combination of LMs and HMM

The main goal of this paper is to perform the dialog act(DA) tagging for Telugu corpus. Annotation of utterances with dialog acts is necessary to recognize the intent of speaker in dialog systems. While English language follows a strict subject–verb–object(SVO) syntax, Telugu is a free word order language. The n-gram DA tagging methods proposed for English language will not work for free word order languages like Telugu. In this paper, we propose a method to perform DA tagging for Telugu corpus using advanced machine learning techniques combined with karaka dependency relation modifiers. In other words, we use syntactic features obtained from karaka dependencies and apply combination of language models(LMs) at utterance level with Hidden Markov Model(HMM) at context level for DA tagging. The use of karaka dependencies for free word order languages like Telugu helps in extracting the modifier-modified relationships between words or word clusters for an utterance. The modifier-modified relationships remain fixed even though the word order in an utterance changes. These extracted modifier-modified relationships appear similar to n-grams. Statistical machine learning methods such as combination of LMs and HMM are applied to predict DA for an utterance in a dialog. The proposed method is compared with several baseline tagging algorithms.

Suman Dowlagar, Radhika Mamidi
Backmatter
Metadaten
Titel
Computational Linguistics and Intelligent Text Processing
herausgegeben von
Dr. Alexander Gelbukh
Copyright-Jahr
2018
Electronic ISBN
978-3-319-75477-2
Print ISBN
978-3-319-75476-5
DOI
https://doi.org/10.1007/978-3-319-75477-2