Skip to main content

Über dieses Buch

The two volumes LNCS 9041 and 9042 constitute the proceedings of the 16th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2015, held in Cairo, Egypt, in April 2015.

The total of 95 full papers presented was carefully reviewed and selected from 329 submissions. They were organized in topical sections on grammar formalisms and lexical resources; morphology and chunking; syntax and parsing; anaphora resolution and word sense disambiguation; semantics and dialogue; machine translation and multilingualism; sentiment analysis and emotion detection; opinion mining and social network analysis; natural language generation and text summarization; information retrieval, question answering, and information extraction; text classification; speech processing; and applications.



Grammar Formalisms and Lexical Resources


Towards a Universal Grammar for Natural Language Processing

Universal Dependencies is a recent initiative to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. In this paper, I outline the motivation behind the initiative and explain how the basic design principles follow from these requirements. I then discuss the different components of the annotation standard, including principles for word segmentation, morphological annotation, and syntactic annotation. I conclude with some thoughts on the challenges that lie ahead.

Joakim Nivre

Deletions and Node Reconstructions in a Dependency-Based Multilevel Annotation Scheme

The aim of the present contribution is to put under scrutiny the ways in which the so-called deletions of elements in the surface shape of the sentence are treated in syntactically annotated corpora and to attempt at a categorization of deletions within a multilevel annotation scheme. We explain first (Sect. 1) the motivations of our research into this matter and in Sect. 2 we briefly overview how deletions are treated in some of the advanced annotation schemes for different languages. The core of the paper is Sect. 3, which is devoted to the treatment of deletions and node reconstructions on the two syntactic levels of annotation of the annotation scheme of the Prague Dependency Treebank (PDT). After a short account of PDT relevant for the issue under discussion (Sect. 3.1) and of the treatment of deletions at the level of surface structure of sentences (Sect. 3.2), we concentrate on selected types of reconstructions of the deleted items on the underlying (tectogrammatical) level of PDT (Sect. 3.3). In Section 3.4 we present some statistical data that offer a stimulating and encouraging ground for further investigations, both for linguistic theory and annotation practice. The results and the advantages of the approach applied and further perspectives are summarized in Sect. 4.

Jan Hajič, Eva Hajičová, Marie Mikulová, Jiří Mírovský, Jarmila Panevová, Daniel Zeman

Enriching, Editing, and Representing Interlinear Glossed Text

The majority of the world’s languages have little to no NLP resources or tools. This is due to a lack of training data (“resources”) over which tools, such as taggers or parsers, can be trained. In recent years, there have been increasing efforts to apply NLP methods to a much broader swathe of the worlds languages. In many cases this involves bootstrapping the learning process with enriched or partially enriched resources. One promising line of research involves the use of Interlinear Glossed Text (IGT), a very common form of annotated data used in the field of linguistics. Although IGT is generally very richly annotated, and can be enriched even further (e.g., through structural projection), much of the content is not easily consumable by machines since it remains “trapped” in linguistic scholarly documents and in human readable form. In this paper, we introduce several tools that make IGT more accessible and consumable by NLP researchers.

Fei Xia, Michael Wayne Goodman, Ryan Georgi, Glenn Slayden, William D. Lewis

Comparing Neural Lexical Models of a Classic National Corpus and a Web Corpus: The Case for Russian

In this paper we compare the Russian National Corpus to a larger Russian web corpus composed in 2014; the assumption behind our work is that the National corpus, being limited by the texts it contains and their proportions, presents lexical contexts (and thus meanings) which are different from those found ‘in the wild’ or in a language in use.

To do such a comparison, we used both corpora as training sets to learn vector word representations and found the nearest neighbors or associates for all top-frequency nominal lexical units. Then the difference between these two neighbor sets for each word was calculated using the Jaccard similarity coefficient. The resulting value is the measure of how much the meaning of a given word is different in the language of web pages from the Russian language in the National corpus. About 15% of words were found to acquire completely new neighbors in the web corpus.

In this paper, the methodology of research is described and implications for Russian National Corpus are proposed. All experimental data are available online.

Andrey Kutuzov, Elizaveta Kuzmenko

Lexical Network Enrichment Using Association Rules Model

In this paper, we present our method of lexical enrichment applied on a semantic network in the context of query disambiguation. This network represents the list of relevant sentences in French (noted by list


) that respond to a given Arabic query. In a first step we generate the semantic network covering the content of the list


. The generation of the network is based on our approach of semantic and conceptual indexing. In a second step, we apply a contextual enrichment on this network using association rules model. The evaluation of our method shows the impact of this model on the semantic network enrichment. As a result, this enrichment increases the F-measure from 71% to 81% in terms of the (liste


) coverage.

Souheyl Mallat, Emna Hkiri, Mohsen Maraoui, Mounir Zrigui

When was Macbeth Written? Mapping Book to Time

We address the question of predicting the time when a book was written using the Google Books Ngram corpus. This prediction could be useful for authorship and plagiarism detection, identification of literary movements, and forensic document examination. We propose an unsupervised approach and compare this with four baseline measures on a dataset consisting of 36 books written between 1551 and 1969. The proposed approach could be applicable to other languages as long as corpora of those languages similar to the Google Books Ngram are available.

Aminul Islam, Jie Mei, Evangelos E. Milios, Vlado Kešelj

Tharawat: A Vision for a Comprehensive Resource for Arabic Computational Processing

In this paper, we present a vision for a comprehensive unified lexical resource for computational processing of Arabic with as many of its variants as possible. We will review the current state of the art for three existing resources and then propose a method to link them in addition to augment them in a manner that would render them even more useful for natural language processing whether targeting enabling technologies such as part of speech tagging or parsing, or applications such as Machine Translation, or Information Extraction. The unified lexical resource, Tharawat, meaning treasures, is an extension of our core unique resource Tharwa, which is a three way computational lexicon for Dialectal Arabic, Modern Standard Arabic, and English lemma correspondents. Tharawat will incorporate two other current resources namely SANA, our Arabic Sentiment Lexicon, and MuSTalAHAt, our Multiword Expression (MWE) version of Tharwa but instead of listing lemmas and their correspondents, it lists MWE and their correspondents. Moreover, we present a roadmap for incorporating links for Tharawat to existing English resources and corpora leveraging advanced machine learning techniques and crowd sourcing methods. Such resources are at the core of NLP technologies. Specifically, we believe that such a resource could lead to significant leaps and strides for Arabic NLP. Possessing them for a language such as Arabic could be quite impactful for the development of advanced scientific material and hence lead to an Arabic scientific and economic revolution.

Mona Diab

High Quality Arabic Lexical Ontology Based on MUHIT, WordNet, SUMO and DBpedia

In this paper, we aim to move ontology-based Arabic NLP forward by experimenting with the generation of a comprehensive Arabic lexical ontology using multiple language resources. We recommend a combination of MUHIT, WordNet and SUMO and use a simple method to link them, which results in the generation of an Arabic-lexicalized version of the SUMO ontology. Then, we evaluate the generated ontology, and propose a method for increasing its named entity coverage using DBpedia, English-to-Arabic Transliteration, and Named Entity Recognition. We end up with an Arabic lexical ontology that has 228K Arabic synsets, linked to 7.8K concepts and 143K instances. This ontology achieves a precision of 96.9% and recall of 75.5% for NLU scenarios.

Eslam Kamal, Mohsen Rashwan, Sameh Alansary

Building a Nasa Yuwe Language Test Collection

The nasa yuwe is the language of the Paez people in Colombia is currently an endangered language[1]. The nasa community has therefore been reviewing different strategies with the purpose of encouraging 1) the visualization process of the language and 2) the sensibilization of the use of the language, by means of computational tools. With the intention of making a contribution to both of these areas, the building of an information retrieval system (IRS) for texts written in Nasa Yuwe is proposed. This would be expected to encourage writing in Nasa Yuwe and the retrieval of documents written in the language. To implement the system, it is necessary to have a test collection with which to assess the IRS, so that the first step, prior to IRS development, is to build that test collection specifically for Nasa Yuwe texts, something which is not currently available. This paper thus presents the first test collection in Nasa Yuwe, as well as showing its construction process and results. The results allow appreciation of:1) the process of building the Nasa Yuwe test collection, 2) the queries, expert opinions and documents; and 3) a statistical analysis of the data, including an analysis of Zipf’s Law[2].

Luz Marina Sierra, Carlos Alberto Cobos, Juan Carlos Corrales, Tulio Rojas Curieux

Morphology and Chunking


Making Morphologies the “Easy” Way

Computational morphologies often consist of a lexicon and some rule component, the creation of which requires various competences and considerable effort. Such a description, on the other hand, makes an easy extension of the morphology with new lexical items possible. Most freely available morphological resources, however, contain no rule component. They are usually based on just a morphological lexicon, containing base forms and some information (often just a paradigm ID) identifying the inflectional paradigm of the word, possibly augmented with some other morphosyntactic features. The aim of the research presented in this paper was to create an algorithm that makes the integration of new words into such resources similarly easy to the way a rule-based morphology can be extended. This is achieved by predicting the correct paradigm for words not present in the lexicon. The supervised machine learning algorithm described in this paper is based on longest matching suffixes and lexical frequency data, and is demonstrated and evaluated for Russian.

Attila Novák

To Split or Not, and If so, Where? Theoretical and Empirical Aspects of Unsupervised Morphological Segmentation

The purpose of this paper is twofold: First, it offers an overview of challenges encountered by unsupervised, knowledge free methods when analysing language data (with focus on morphology). Second, it presents a system for unsupervised morphological segmentation comprising two complementary methods that can handle a broad range of morphological processes. The first method collects words which share distributional and form similarity and applies Multiple Sequence Alignment to derive segmentation of these words. The second method then analyses less frequent words utilizing the segmentation results of the first method. The challenges presented in the theoretical part are demonstrated exemplarily on the workings and output of the introduced unsupervised system and accompanied by suggestions how to address them in future works.

Amit Kirschenbaum

Data-Driven Morphological Analysis and Disambiguation for Kazakh

We propose a method for morphological analysis and disambiguation for Kazakh language that accounts for both inflectional and derivational morphology, including not fully productive derivation. The method is data-driven and does not require manually generated rules. We leverage so called “transition chains” that help pruning false segmentations, while keeping correct ones. At the disambiguation step we use a standard HMM-based approach. Evaluating our method against open source solutions on several data sets, we show that it achieves better or on par performance. We also provide an extensive error analysis that sheds light on common problems of the morphological disambiguation of the language.

Olzhas Makhambetov, Aibek Makazhanov, Islam Sabyrgaliyev, Zhandos Yessenbayev

Statistical Sandhi Splitter for Agglutinative Languages

Sandhi splitting is a primary and an important step for any natural language processing (NLP) application for languages which have agglutinative morphology. This paper presents a statistical approach to build a sandhi splitter for agglutinative languages. The input to the model is a valid string in the language and the output is a split of that string into meaningful word/s. The approach adopted comprises of two stages namely Segmentation and Word generation, both of which use conditional random fields (CRFs). Our approach is robust and language independent. The results for two Dravidian languages viz. Telugu and Malayalam show an accuracy of 89.07% and 90.50% respectively.

Prathyusha Kuncham, Kovida Nelakuditi, Sneha Nallani, Radhika Mamidi

Chunking in Turkish with Conditional Random Fields

In this paper, we report our work on chunking in Turkish. We used the data that we generated by manually translating a subset of the Penn Treebank. We exploited the already available tags in the trees to automatically identify and label chunks in their Turkish translations. We used conditional random fields (CRF) to train a model over the annotated data. We report our results on different levels of chunk resolution.

Olcay Taner Yıldız, Ercan Solak, Razieh Ehsani, Onur Görgün

Syntax and Parsing


Statistical Arabic Grammar Analyzer

The grammar analysis is considered one of the complex tasks in the Natural Language Processing (NLP) field, since it determines the relation between the words in the sentence. This paper proposes a system to automate the grammar analysis of Arabic language sentences (Sentence Grammar Analysis, <ErAb Aljml). The task of Arabic grammar analysis has been divide into three sub-tasks, of determining the grammatical tag, the case, and the sign of each token in the level of the sentence. For the task of Arabic grammar analysis, a dataset has been compiled and a statistical system that assigns an appropriate tag, case and sign has been implemented. The proposed system has been tested and the experiments show that it achieves a 89.74% token accuracy and a 63.56% overall sentence accuracy and it has the potential to be further improved.

Michael Nawar Ibrahim

Bayesian Finite Mixture Models for Probabilistic Context-Free Grammars

Instead of using a common PCFG to parse all texts, we present an efficient generative probabilistic model for the probabilistic context-free grammars(PCFGs) based on the Bayesian finite mixture model, where we assume that there are several PCFGs and each of these PCFGs share the same CFG but with different rule probabilities. Sentences of the same article in the corpus are generated from a common multinomial distribution over these PCFGs. We derive a Markov chain Monte Carlo algorithm for this model. In the experiments, our multi-grammar model outperforms both single grammar model and Inside-Outside algorithm.

Philip L. H. Yu, Yaohua Tang

Employing Oracle Confusion for Parse Quality Estimation

We propose an approach for

Parse Quality Estimation

based on the dynamic computation of an entropy-based confusion score for directed arcs and for joint prediction of directed arcs and their dependency labels, in a typed dependency parsing framework. This score accompanies a parsed output and aims to present an exhaustive picture of the

parse quality

, detailed down to each arc of the parse tree. The methodology explores the confusion encountered by the oracle of a transition-based data-driven dependency parser. We support our hypothesis by analytically illustrating, for 18 languages, that the arcs with high confusion scores are notably the predominant parsing errors.

Sambhav Jain, Naman Jain, Bhasha Agrawal, Rajeev Sangal

Experiments on Sentence Boundary Detection in User-Generated Web Content

Sentence Boundary Detection (SBD) is a very important prerequisite for proper sentence analysis in different Natural Language Processing tasks. During the last years, many SBD methods have been used in the transcriptions produced by Automatic Speech Recognition systems and in well-structured texts (e.g. news, scientific texts). However, there are few researches about SBD in informal user-generated content such as web reviews, comments, and posts, which are not necessarily well written and structured. In this paper, we adapt and extend a well-known SBD method to the domain of the opinionated texts in the web. Particularly, we evaluate our proposal in a set of online product reviews and compare it with other traditional SBD methods. The experimental results show that we outperform these other methods.

Roque López, Thiago A. S. Pardo

Anaphora Resolution and Word Sense Disambiguation


An Investigation of Neural Embeddings for Coreference Resolution

Coreference Resolution is an important task in Natural Language Processing (NLP) and involves finding all the phrases in a document that refer to the same entity in the real world, with applications in question answering and document summarisation.Work from deep learning has led to the training of neural embeddings of words and sentences from unlabelled text. Word embeddings have been shown to capture syntactic and semantic properties of the words and have been used in POS tagging and NER tagging to achieve state of the art performance. Therefore, the key contribution of this paper is to investigate whether neural embeddings can be leveraged to overcome challenges associated with the scarcity of coreference resolution labelled datasets for benchmarking. We show, as a preliminary result, that neural embeddings improve the performance of a coreference resolver when compared to a baseline.

Varun Godbole, Wei Liu, Roberto Togneri

Feature Selection in Anaphora Resolution for Bengali: A Multiobjective Approach

In this paper we propose a feature selection technique for anaphora resolution for a resource-poor language like Bengali. The technique is grounded on the principle of differential evolution (DE) based multiobjective optimization (MOO). For this we explore adapting BART, a state-of-the-art anaphora resolution system, which is originally designed for English. There does not exist any globally accepted metric for measuring the performance of anaphora resolution, and each of


, B






exhibits significantly different behaviours. System optimized with respect to one metric often tend to perform poorly with respect to the others, and therefore comparing the performance between the different systems becomes quite difficult. In our work we determine the most relevant set of features that best optimize all the metrics. Evaluation results yield the overall average F-measure values of 66.70%, 59.70%, 51.56%, 33.08%, 72.75% for


, B








, respectively.

Utpal Kumar Sikdar, Asif Ekbal, Sriparna Saha

A Language Modeling Approach for Acronym Expansion Disambiguation

Nonstandard words such as proper nouns, abbreviations, and acronyms are a major obstacle in natural language text processing and information retrieval. Acronyms, in particular, are difficult to read and process because they are often domain-specific with high degree of polysemy. In this paper, we propose a language modeling approach for the automatic disambiguation of acronym senses using context information. First, a dictionary of all possible expansions of acronyms is generated automatically. The dictionary is used to search for all possible expansions or senses to expand a given acronym. The extracted dictionary consists of about 17 thousands acronym-expansion pairs defining 1,829 expansions from different fields where the average number of expansions per acronym was 9.47. Training data is automatically collected from downloaded documents identified from the results of search engine queries. The collected data is used to build a unigram language model that models the context of each candidate expansion. At the in-context expansion prediction phase, the relevance of acronym expansion candidates is calculated based on the similarity between the context of each specific acronym occurrence and the language model of each candidate expansion. Unlike other work in the literature, our approach has the option to reject to expand an acronym if it is not confident on disambiguation. We have evaluated the performance of our language modeling approach and compared it with tf-idf discriminative approach.

Akram Gaballah Ahmed, Mohamed Farouk Abdel Hady, Emad Nabil, Amr Badr

Web Person Disambiguation Using Hierarchical Co-reference Model

As one of the entity disambiguation tasks, Web Person Disambiguation (WPD) identifies different persons with the same name by grouping search results for different persons into different clusters. Most of current research works use clustering methods to conduct WPD. These approaches require the tuning of thresholds that are biased towards training data and may not work well for different datasets. In this paper, we propose a novel approach by using pairwise co-reference modeling for WPD without the need to do threshold tuning. Because person names are named entities, disambiguation of person names can use semantic measures using the so called co-reference resolution criterion across different documents. The algorithm first forms a forest with person names as observable leaf nodes. It then stochastically tries to form an entity hierarchy by merging names into a sub-tree as a latent entity group if they have co-referential relationship across documents. As the joining/partition of nodes is based on co-reference-based comparative values, our method is independent of training data, and thus parameter tuning is not required. Experiments show that this semantic based method has achieved comparable performance with the top two state-of-the-art systems without using any training data. The stochastic approach also makes our algorithm to exhibit near linear processing time much more efficient than HAC based clustering method. Because our model allows a small number of upper-level entity nodes to summarize a large number of name mentions, the model has much higher semantic representation power and it is much more scalable over large collections of name mentions compared to HAC based algorithms.

Jian Xu, Qin Lu, Minglei Li, Wenjie Li

Semantics and Dialogue


From Natural Logic to Natural Reasoning

This paper starts with a brief history of Natural Logic from its origins to the most recent work on implicatives. It then describes on-going attempts to represent the meanings of so-called evaluative adjectives in these terms based on what linguists have traditionally assumed about constructions such as

NP was stupid to VP


NP was not lucky to VP

that have been described as factive. It turns out that the account cannot be based solely on lexical classification as the existing framework of Natural Logic assumes.

The conclusion we draw from this ongoing work is that Natural Logic of the classical type must be grounded in a more inclusive theory of Natural Reasoning that takes into account pragmatic factors in the context of use such as the assumed relation between the evaluative adjective and even the perceived communicative intent of the speaker.

Lauri Karttunen

A Unified Framework to Identify and Extract Uncertainty Cues, Holders, and Scopes in One Fell-Swoop

We present a unified framework based on supervised sequence labelling methods to identify and extract uncertainty cues, holders, and scopes in one-fell swoop with an application on Arabic tweets. The underlying technology employs Support Vector Machines with a rich set of morphological, syntactic, lexical, semantic, pragmatic, dialectal, and genre-specific features, and yields an average F


score of 0.759.

Rania Al-Sabbagh, Roxana Girju, Jana Diesner

Lemon and Tea Are Not Similar: Measuring Word-to-Word Similarity by Combining Different Methods

Substantial amount of work has been done on measuring word-to-word relatedness which is also commonly referred as similarity. Though relatedness and similarity are closely related, they are not the same as illustrated by the words





are related but not similar

. The relatedness takes into account a broader ranLemge of relations while similarity only considers subsumption relations to assess how two objects are similar. We present in this paper a method for measuring the semantic similarity of words as a combination of various techniques including knowledge-based and corpus-based methods that capture different aspects of similarity. Our corpus based method exploits state-of-the-art word representations. We performed experiments with a recently published significantly large dataset called Simlex-999 and achieved a significantly better correlation (


= 0.642, P < 0.001) with human judgment compared to the individual performance.

Rajendra Banjade, Nabin Maharjan, Nobal B. Niraula, Vasile Rus, Dipesh Gautam

Domain-Specific Semantic Relatedness from Wikipedia Structure: A Case Study in Biomedical Text

Wikipedia is becoming an important knowledge source in various domain specific applications based on concept representation. This introduces the need for concrete evaluation of Wikipedia as a foundation for computing semantic relatedness between concepts. While lexical resources like WordNet cover generic English well, they are weak in their coverage of domain specific terms and named entities, which is one of the strengths of Wikipedia. Furthermore, semantic relatedness methods that rely on the hierarchical structure of a lexical resource are not directly applicable to the Wikipedia link structure, which is not hierarchical and whose links do not capture well defined semantic relationships like hyponymy.

In this paper we (1) Evaluate Wikipedia in a domain specific semantic relatedness task and demonstrate that Wikipedia based methods can be competitive with state of the art ontology based methods and distributional methods in the biomedical domain (2) Adapt and evaluate the effectiveness of bibliometric methods of various degrees of sophistication on Wikipedia (3) Propose a new graph-based method for calculating semantic relatedness that outperforms existing methods by considering some specific features of Wikipedia structure.

Armin Sajadi, Evangelos E. Milios, Vlado Kešelj, Jeannette C. M. Janssen

Unsupervised Induction of Meaningful Semantic Classes through Selectional Preferences

This paper addresses the general task of semantic class learning by introducing a methodology to induce semantic classes for labeling instances of predicate arguments in an input text. The proposed methodology takes a Proposition Store as Background Knowledge Base to firstly identify a set of classes capable of representing the arguments of predicates in the store; where the classes corresponds to common nouns from the store to support interpretability. Then, it learns a selectional preference model for predicates based on tuples of classes to set up a generative model of propositions from which to perform the induction of classes. The proposed method is completely unsupervised and rely on a reference collection of unlabeled text documents used as the source of background knowledge to build the proposition store. We demonstrate our proposal on a collection of news stories. Specifically, we evaluate the learned model in the task of predicting tuples of argument instances for predicates from held-aside data.

Henry Anaya-Sánchez, Anselmo Peñas

Hypernym Extraction: Combining Machine-Learning and Dependency Grammar

Hypernym extraction is a crucial task for semantically motivated NLP tasks such as taxonomy and ontology learning, textual entailment or paraphrase identification. In this paper, we describe an approach to hypernym extraction from textual definitions, where machine-learning and post-classification refinement rules are combined. Our best-performing configuration shows competitive results compared to state-of-the-art systems in a well-known benchmarking dataset. The quality of our features is measured by combining them in different feature sets and by ranking them by their Information Gain score. Our experiments confirm that both syntactic and definitional information play a crucial role in the hypernym extraction task.

Luis Espinosa-Anke, Francesco Ronzano, Horacio Saggion

Arabic Event Detection in Social Media

Event detection is a concept that is crucial to the assurance of public safety surrounding real-world events. Decision makers use information from a range of terrestrial and online sources to help inform decisions that enable them to develop policies and react appropriately to events as they unfold. One such source of online information is social media. Twitter, as a form of social media, is a popular micro-blogging web application serving hundreds of millions of users. User-generated content can be utilized as a rich source of information to identify real-world events. In this paper, we present a novel detection framework for identifying such events, with a focus on ‘disruptive’ events using Twitter data.The approach is based on five steps; data collection, pre-processing, classification, clustering and summarization. We use a Naïve Bayes classification model and an Online Clustering method to validate our model over multiple real-world data sets. To the best of our knowledge, this study is the first effort to identify real-world events in Arabic from social media.

Nasser Alsaedi, Pete Burnap

Learning Semantically Rich Event Inference Rules Using Definition of Verbs

Natural language understanding is a key requirement for many NLP tasks. Deep language understanding, which enables inference, requires systems that have large amounts of knowledge enabling them to connect natural language to the concepts of the world. We present a novel attempt to automatically acquire conceptual knowledge about events in the form of inference rules by reading verb definitions. We learn semantically rich inference rules which can be actively chained together in order to provide deeper understanding of conceptual events. We show that the acquired knowledge is precise and informative which can be potentially employed in different NLP tasks which require language understanding.

Nasrin Mostafazadeh, James F. Allen

Rehabilitation of Count-Based Models for Word Vector Representations

Recent works on word representations mostly rely on predictive models. Distributed word representations (aka word embeddings) are trained to optimally predict the contexts in which the corresponding words tend to appear. Such models have succeeded in capturing word similarities as well as semantic and syntactic regularities. Instead, we aim at reviving interest in a model based on counts. We present a systematic study of the use of the Hellinger distance to extract semantic representations from the word co-occurrence statistics of large text corpora. We show that this distance gives good performance on word similarity and analogy tasks, with a proper type and size of context, and a dimensionality reduction based on a stochastic low-rank approximation. Besides being both simple and intuitive, this method also provides an encoding function which can be used to infer unseen words or phrases. This becomes a clear advantage compared to predictive models which must train these new words.

Rémi Lebret, Ronan Collobert

Word Representations in Vector Space and their Applications for Arabic

A lot of work has been done to give the individual words of a certain language adequate representations in vector space so that these representations capture semantic and syntactic properties of the language. In this paper, we compare different techniques to build vectorized space representations for Arabic, and test these models via intrinsic and extrinsic evaluations. Intrinsic evaluation assesses the quality of models using benchmark semantic and syntactic dataset, while extrinsic evaluation assesses the quality of models by their impact on two Natural Language Processing applications: Information retrieval and Short Answer Grading. Finally, we map the Arabic vector space to the English counterpart using Cosine error regression neural network and show that it outperforms standard mean square error regression neural networks in this task.

Mohamed A. Zahran, Ahmed Magooda, Ashraf Y. Mahgoub, Hazem Raafat, Mohsen Rashwan, Amir Atyia

Short Text Hashing Improved by Integrating Multi-granularity Topics and Tags

Due to computational and storage efficiencies of compact binary codes, hashing has been widely used for large-scale similarity search. Unfortunately, many existing hashing methods based on observed keyword features are not effective for short texts due to the sparseness and shortness. Recently, some researchers try to utilize latent topics of certain granularity to preserve semantic similarity in hash codes beyond keyword matching. However, topics of certain granularity are not adequate to represent the intrinsic semantic information. In this paper, we present a novel unified approach for

short text Hashing using Multi-granularity Topics and Tags

, dubbed HMTT. In particular, we propose a selection method to choose the optimal multi-granularity topics depending on the type of dataset, and design two distinct hashing strategies to incorporate multi-granularity topics. We also propose a simple and effective method to exploit tags to enhance the similarity of related texts. We carry out extensive experiments on one short text dataset as well as on one normal text dataset. The results demonstrate that our approach is effective and significantly outperforms baselines on several evaluation metrics.

Jiaming Xu, Bo Xu, Guanhua Tian, Jun Zhao, Fangyuan Wang, Hongwei Hao

A Computational Approach for Corpus Based Analysis of Reduplicated Words in Bengali

Reduplication is an important phenomenon in language studies especially in Indian languages. The definition of reduplication is the repetition of the smallest linguistic unit partially or completely i.e. repetition of phoneme, morpheme, word, phrase, clause or the utterance as a whole and it gives different meaning in syntax as well as semantic level. The reduplicated words has important role in many natural language processing (NLP) applications, namely in machine translation (MT), text summarization, identification of multiword expressions, etc. This article focuses on an algorithm for identifying the reduplicated words from a text corpus and computing statistics (descriptive statistics) of reduplicated words frequently used in Bengali.

Apurbalal Senapati, Utpal Garain

Automatic Dialogue Act Annotation within Arabic Debates

Dialogue acts play an important role in the identification of argumentative discourse structure in human conversations. In this paper, we propose an automatic dialogue acts annotation method based on supervised learning techniques for Arabic debates programs. The choice of this kind of corpora is justified by its large content of argumentative information. To experiment annotation results, we used a specific annotation scheme relatively reliable for our task with a kappa agreement of 84%. The annotation process was yield using Weka platform algorithms experimenting Naive Bayes, SVM and Decision Trees classifiers. We obtained encouraging results with an average accuracy of 53%.

Samira Ben Dbabis, Hatem Ghorbel, Lamia Hadrich Belguith, Mohamed Kallel

E-Quotes: Enunciative Modalities Analysis Tool for Direct Reported Speech in Arabic

With rapidly growing Arabic online sources aimed to encourage people’s discussions concerning personal, public or social issues (

news, blogs, forums

…), there is a critical need in development of computational tools for the Enunciative Modalities analysis (


opinion, commitment…

). We present a new system that identifies and categorizes quotations in Arabic texts and proposes a strategy to determine whether a given speaker’s quotation conveys some enunciative modalities and potentially its evaluation by the enunciator. Our system enables two query types search for keywords within the “categorized” quotations: searching for keywords in the part potentially containing the reported speech source (

the reporting clause

) or searching for keywords in the part concerning the topic (

the reported clause

). The annotation is performed with a rule-based system using the reporting markers’ meaning. We applied our system to process a corpus of Arabic newspaper articles and we obtained promising results for the evaluation.

Motasem Alrahabi

Textual Entailment Using Different Similarity Metrics

Textual entailment (TE) relation determines whether a text can be inferred from another. Given two texts, one is called the “Text” denoted as T and the other one is called “Hypothesis” denoted as H, the process of textual entailment is to decide whether or not the meaning of H can be logically inferred from the meaning of T. Different semantic, lexical and vector based similarity metrics are used as features for different machine learning classifiers to take the entailment decision in this study. We also considered two machine translation evaluation metrics, namely BLEU and METEOR, as similarity metrics for this task. We carried out the experiments on the datasets released in the shared tasks on textual entailment organized in RTE-1, RTE-2, and RTE-3. We experimented with different feature combinations. Best accuracies were obtained on different feature combinations by different classifiers. The best classification accuracies obtained by our system on the RTE-1, RTE-2 and RTE-3 dataset are 55.91%, 58.88% and 63.38% respectively. MT evaluation metrics based feature alone produced the best classification accuracies of 53.9%, 59.3%, and 62.8% on the RTE-1, RTE-2, and RTE-3 datasets respectively.

Tanik Saikh, Sudip Kumar Naskar, Chandan Giri, Sivaji Bandyopadhyay

Machine Translation and Multilingualism


Translation Induction on Indian Language Corpora Using Translingual Themes from Other Languages

Identifying translations from comparable corpora is a well-known problem with several applications, e.g. dictionary creation in resource-scarce languages. Scarcity of high quality corpora, especially in Indian languages, makes this problem hard, e.g. state-of-the-art techniques achieve a mean reciprocal rank (MRR) of 0.66 for English-Italian, and a mere 0.187 for Telugu-Kannada. There exist comparable corpora in many Indian languages with other “auxiliary” languages. We observe that translations have many topically related words in common in the auxiliary language. To model this, we define the notion of a

translingual theme

, a set of topically related words from auxiliary language corpora, and present a probabilistic framework for translation induction. Extensive experiments on 35 comparable corpora using English and French as auxiliary languages show that this approach can yield dramatic improvements in performance (e.g. MRR improves by 124% to 0.419 for Telugu-Kannada). A user study on


, a system for cross-lingual Wikipedia title suggestion that uses our approach, shows a 20% improvement in the quality of titles suggested.

Goutham Tholpadi, Chiranjib Bhattacharyya, Shirish Shevade

English-Arabic Statistical Machine Translation: State of the Art

This paper presents state of the art of the statistical methods that enhance English to Arabic (En-Ar) Machine Translation (MT). First, the paper introduces a brief history of the machine translation by clarifying the obstacles it faced; as exploring the history shows that research can develop new ideas. Second, the paper discusses the Statistical Machine Translation (SMT) method as an effective state of the art in the MT field. Moreover, it presents the SMT pipeline in brief and explores the En-Ar MT enhancements that have been applied by processing both sides of the parallel corpus before, after and within the pipeline. The paper explores Arabic linguistic challenges in MT such as: orthographic, morphological and syntactical issues. The purpose of surveying only En-Ar translation direction in the SMT is to help transferring the knowledge and science to the Arabic language and spreading the information to all who are interested in the Arabic language.

Sara Ebrahim, Doaa Hegazy, Mostafa G. M. Mostafa, Samhaa R. El-Beltagy

Mining Parallel Resources for Machine Translation from Comparable Corpora

Good performance of Statistical Machine Translation (SMT) is usually achieved with huge parallel bilingual training corpora, because the translations of words or phrases are computed basing on bilingual data. However, in case of low-resource language pairs such as English-Bengali, the performance is affected by insufficient amount of bilingual training data. Recently, comparable corpora became widely considered as valuable resources for machine translation. Though very few cases of sub-sentential level parallelism are found between two comparable documents, there are still potential parallel phrases in comparable corpora. Mining parallel data from comparable corpora is a promising approach to collect more parallel training data for SMT. In this paper, we propose an automatic alignment of English-Bengali comparable sentences from comparable documents. We use a novel textual entailment method and distributional semantics for text similarity. Subsequently, we apply template-based phrase extraction technique to aligned parallel phrases from comparable sentence pairs. The effectiveness of our approach is demonstrated by using parallel phrases as additional training examples for an English-Bengali phrase-based SMT system. Our system achieves significant improvement in terms of translation quality over the baseline system.

Santanu Pal, Partha Pakray, Alexander Gelbukh, Josef van Genabith

Statistical Machine Translation from and into Morphologically Rich and Low Resourced Languages

In this paper, we consider the challenging problem of automatic machine translation between a language pair which is both morphologically rich and low resourced: Sinhala and Tamil. We build a phrase based Statistical Machine Translation (SMT) system and attempt to enhance it by unsupervised morphological analysis. When translating across this pair of languages, morphological changes result in large numbers of out-of-vocabulary (OOV) terms between training and test sets leading to reduced BLEU scores in evaluation. This early work shows that unsupervised morphological analysis using the Morfessor algorithm, extracting morpheme-like units is able to significantly reduce the OOV problem and help in improved translation.

Randil Pushpananda, Ruvan Weerasinghe, Mahesan Niranjan

Adaptive Tuning for Statistical Machine Translation (AdapT)

In statistical machine translation systems, it is a common practice to use one set of weighting parameters in scoring the candidate translations from a source language to a target language. In this paper, we challenge the assumption that only one set of weights is sufficient to pick the best candidate translation for all source language sentences. We propose a new technique that generates a different set of weights for each input sentence. Our technique outperforms the popular tuning algorithm MERT on different datasets using different language pairs.

Mohamed A. Zahran, Ahmed Y. Tawfik

A Hybrid Approach for Word Alignment with Statistical Modeling and Chunker

This paper presents a hybrid approach to improve word alignment with Statistical Modeling and Chunker for English-Hindi language pair. We first apply the standard word alignment technique to get an approximate alignment. The source and target language sentences are divided into chunks. The approximate word alignment is then used to align the chunks. The aligned chunks are then used to improve the original word alignment.

The statistical model used here is IBM Model 1. CRF Chunker is used to break the English sentences into chunks. A shallow parser is used to break Hindi sentences into chunks. This paper demonstrates an increment in F-measure by approximately 7% and reduction in Alignment Error Rate (AER) by approximately 7% in comparison to the performance of IBM Model 1 for word alignment. Experiments of this paper are based on TDIL corpus of 1000 sentences.

Jyoti Srivastava, Sudip Sanyal

Improving Bilingual Search Performance Using Compact Full-Text Indices

Machine Translation tasks must tackle the ever-increasing sizes of parallel corpora, requiring space and time efficient solutions to support them. Several approaches were developed based on full-text indices, such as suffix arrays, with important time and space achievements. However, for supporting bilingual tasks, the search time efficiency of such indices can be improved using an extra layer for the text alignment. Additionally, their space requirements can be significantly reduced using more compact indices. We propose a search procedure on top of a compact bilingual framework that improves bilingual search response time, while having a space efficient representation of aligned parallel corpora.

Jorge Costa, Luís Gomes, Gabriel P. Lopes, Luís M. S. Russo

Neutralizing the Effect of Translation Shifts on Automatic Machine Translation Evaluation

State-of-the-art automatic Machine Translation [MT] evaluation is based on the idea that the closer MT output is to Human Translation [HT], the higher its quality. Thus, automatic evaluation is typically approached by measuring some sort of similarity between machine and human translations. Most widely used evaluation systems calculate similarity at surface level, for example, by computing the number of shared word n-grams. The correlation between automatic and manual evaluation scores at sentence level is still not satisfactory. One of the main reasons is that metrics underscore acceptable candidate translations due to their inability to tackle lexical and syntactic variation between possible translation options. Acceptable differences between candidate and reference translations are frequently due to optional

translation shifts

. It is common practice in HT to paraphrase what could be viewed as close version of the source text in order to adapt it to target language use. When a reference translation contains such changes, using it as the only point of comparison is less informative, as the differences are not indicative of MT errors. To alleviate this problem, we design a paraphrase generation system based on a set of rules that model prototypical optional shifts that may have been applied by human translators. Applying the rules to the available human reference, the system generates additional references in a principled and controlled way. We show how using linguistic rules for the generation of additional references neutralizes the negative effect of optional translation shifts on n-gram-based MT evaluation.

Marina Fomicheva, Núria Bel, Iria da Cunha

Arabic Transliteration of Romanized Tunisian Dialect Text: A Preliminary Investigation

In this paper, we describe the process of converting Tunisian Dialect text that is written in Latin script (also called Arabizi) into Arabic script following the CODA orthography convention for Dialectal Arabic. Our input consists of messages and comments taken from SMS, social networks and broadcast videos. The language used in social media and SMS messaging is characterized by the use of informal and non-standard vocabulary such as repeated letters for emphasis, typos, non-standard abbreviations, and nonlinguistic content, such as emoticons. There is a high degree of variation is spelling in Arabic dialects due to the lack of orthographic widely supported standards in both Arabic and Latin scripts. In the context of natural language processing, transliterating from Arabizi to Arabic script is a necessary step since most recently available tools for processing Arabic Dialects expect Arabic script input.

Abir Masmoudi, Nizar Habash, Mariem Ellouze, Yannick Estève, Lamia Hadrich Belguith

Cross-Dialectal Arabic Processing

We present, in this paper an Arabic multi-dialect study including dialects from both the Maghreb and the Middle-east that we compare to the Modern Standard Arabic (MSA). Three dialects from Maghreb are concerned by this study: two from Algeria and one from Tunisia and two dialects from Middle-east (Syria and Palestine). The resources which have been built from scratch have lead to a collection of a multi-dialect parallel resource. Furthermore, this collection has been aligned by hand with a MSA corpus. We conducted several analytical studies in order to understand the relationship between these vernacular languages. For this, we studied the closeness between all the pairs of dialects and MSA in terms of Hellinger distance. We also performed an experiment of dialect identification. This experiment showed that neighbouring dialects as expected tend to be confused, making difficult their identification. Because the Arabic dialects are different from one region to another which make the communication between people difficult, we conducted cross-lingual machine translation between all the pairs of dialects and also with MSA. Several interesting conclusions have been carried out from this experiment.

Salima Harrat, Karima Meftouh, Mourad Abbas, Salma Jamoussi, Motaz Saad, Kamel Smaili

Language Set Identification in Noisy Synthetic Multilingual Documents

In this paper, we reconsider the problem of language identification of multilingual documents. Automated language identification algorithms have been improving steadily from the seventies until recent years. The current state-of-the-art language identifiers are quite efficient even with only a few characters and this gives us enough reason to again evaluate the possibility to use existing language identifiers for monolingual text to detect the language set of a multilingual document. We are using a previously developed language identifier for monolingual documents with the multilingual documents from the WikipediaMulti dataset published in a recent study. Our method outperforms previous methods tested with the same data, achieving an



-score of 97.6 when classifying between 44 languages.

Tommi Jauhiainen, Krister Lindén, Heidi Jauhiainen

Feature Analysis for Native Language Identification

In this study we investigate the role of different features for the task of native language identification. For this purpose, we compile a learner corpus based on a subset of the EF Cambridge Open Language Database - EFCAMDAT [10] developed at the University of Cambridge in collaboration with EF Education. The features we are taking into consideration include character n-grams, positional token frequencies, part of speech n-grams, function words, shell nouns and a set of annotated errors. Last but not least, we examine whether the essays of English learners that share the same mother tongue can be distinguished based on their country of origin.

Sergiu Nisioi


Weitere Informationen