Skip to main content
Top

2018 | Book

Computational Linguistics and Intelligent Text Processing

17th International Conference, CICLing 2016, Konya, Turkey, April 3–9, 2016, Revised Selected Papers, Part II

insite
SEARCH

About this book

The two-volume set LNCS 9623 + 9624 constitutes revised selected papers from the CICLing 2016 conference which took place in Konya, Turkey, in April 2016.

The total of 89 papers presented in the two volumes was carefully reviewed and selected from 298 submissions. The book also contains 4 invited papers and a memorial paper on Adam Kilgarriff’s Legacy to Computational Linguistics.

The papers are organized in the following topical sections:

Part I: In memoriam of Adam Kilgarriff; general formalisms; embeddings, language modeling, and sequence labeling; lexical resources and terminology extraction; morphology and part-of-speech tagging; syntax and chunking; named entity recognition; word sense disambiguation and anaphora resolution; semantics, discourse, and dialog.

Part II: machine translation and multilingualism; sentiment analysis, opinion mining, subjectivity, and social media; text classification and categorization; information extraction; and applications.

Table of Contents

Frontmatter

Machine Translation and Multilingualism

Frontmatter
Enabling Medical Translation for Low-Resource Languages

We present research towards bridging the language gap between migrant workers in Qatar and medical staff. In particular, we present the first steps towards the development of a real-world Hindi-English machine translation system for doctor-patient communication. As this is a low-resource language pair, especially for speech and for the medical domain, our initial focus has been on gathering suitable training data from various sources. We applied a variety of methods ranging from fully automatic extraction from the Web to manual annotation of test data. Moreover, we developed a method for automatically augmenting the training data with synthetically generated variants, which yielded a very sizable improvement of more than 3 BLEU points absolute.

Ahmad Musleh, Nadir Durrani, Irina Temnikova, Preslav Nakov, Stephan Vogel, Osama Alsaad
Combining Phrase and Neural-Based Machine Translation: What Worked and Did Not

Phrase-based machine translation assumes that all words are at the same distance and translates them using feature functions that approximate the probability at different levels. On the other hand, neural machine translation infers a word embedding and translates these word vectors using a neural model. At the moment, both approaches co-exist and are being intensively investigated.This paper to the best of our knowledge is the first work that both compares and combines these two systems by: using the phrase-based output to solve unknown words in the neural machine translation output; using the neural alignment in the phrase-based system; comparing how the popular strategy of pre-reordering affects both systems; and combining both translation outputs. Improvements are achieved in Catalan-to-Spanish and German-to-English.

Marta R. Costa-jussà, José A. R. Fonollosa
Combining Machine Translated Sentence Chunks from Multiple MT Systems

This paper presents a hybrid machine translation (HMT) system that pursues syntactic analysis to acquire phrases of source sentences, translates the phrases using multiple online machine translation (MT) system application program interfaces (APIs) and generates output by combining translated chunks to obtain the best possible translation. The aim of this study is to improve translation quality of English – Latvian texts over each of the individual MT APIs. The selection of the best translation hypothesis is done by calculating the perplexity for each hypothesis using an n-gram language model. The result is a phrase-based multi-system machine translation system that allows to improve MT output compared to individual online MT systems. The proposed approach show improvement up to +1.48 points in BLEU and −0.015 in TER scores compared to the baselines and related research.

Matīss Rikters, Inguna Skadiņa
Forest to String Based Statistical Machine Translation with Hybrid Word Alignments

Forest to String Based Statistical Machine Translation (FSBSMT) is a forest-based tree sequence to string translation model for syntax based statistical machine translation. The model automatically learns tree sequence to string translation rules from a given word alignment estimated on a source-side-parsed bilingual parallel corpus. This paper presents a hybrid method which combines different word alignment methods and integrates them into an FSBSMT system. The hybrid word alignment provides the most informative alignment links to the FSBSMT system. We show that hybrid word alignment integrated into various experimental settings of FSBSMT provides considerable improvement over state-of-the-art Hierarchical Phrase based SMT (HPBSMT). The research also demonstrates that additional integration of Named Entities (NEs), their translations and Example Based Machine Translation (EBMT) phrases (all extracted from the bilingual parallel training data) into the system brings about further considerable performance improvements over the hybrid FSBSMT system. We apply our hybrid model to a distant language pair, English–Bengali. The proposed system achieves 78.5% relative (9.84 BLEU points absolute) improvement over baseline HPBSMT.

Santanu Pal, Sudip Kumar Naskar, Josef van Genabith
Instant Translation Model Adaptation by Translating Unseen Words in Continuous Vector Space

In statistical machine translation (smt), differences between domains of training and test data result in poor translations. Although there have been many studies on domain adaptation of language models and translation models, most require supervised in-domain language resources such as parallel corpora for training and tuning the models. The necessity of supervised data has made such methods difficult to adapt to practical smt systems. We thus propose a novel method that adapts translation models without in-domain parallel corpora. Our method infers translation candidates of unseen words by nearest-neighbor search after projecting their vector-based semantic representations to the semantic space of the target language. In our experiment of out-of-domain translation from Japanese to English, our method improved bleu score by 0.5–1.5.

Shonosuke Ishiwatari, Naoki Yoshinaga, Masashi Toyoda, Masaru Kitsuregawa
Fast-Syntax-Matching-Based Japanese-Chinese Limited Machine Translation

Limited machine translation (LMT) is an unliterate automatic translation based on bilingual dictionary and sentence bank, and related algorithms can be widely used in natural language processing applications. This paper addresses the Japanese-Chinese LMT problem, proposes two syntactic hypotheses about Japanese language, and designs a fast-syntax-matching-based Japanese-Chinese (FSMJC) LMT algorithm. In which, the fast syntax matching function, a modified version of Levenshtein function, can approximately get the syntactic similarity after the efficient calculating of the formal similarity between two Japanese sentences. The experimental results show that the FSMJC LMT algorithm can achieve the preferable performance with greatly reduced time costs, and prove that our two syntactic hypotheses are effective on Japanese text.

Wuying Liu, Lin Wang, Xing Zhang
A Classifier-Based Preordering Approach for English-Vietnamese Statistical Machine Translation

Reordering is of essential importance problem for phrase based statistical machine translation (SMT). In this paper, we propose an approach to automatically learn reordering rules as preprocessing step based on a dependency parser in phrase-based statistical machine translation for English to Vietnamese. We used dependency parsing and rules extracting from training the features-rich discriminative classifiers for reordering source-side sentences. We evaluated our approach on English-Vietnamese machine translation tasks, and showed that it outperform the baseline phrase-based SMT system.

Viet Hong Tran, Huyen Thuong Vu, Vinh Van Nguyen, Minh Le Nguyen
Quality Estimation for English-Hungarian Machine Translation Systems with Optimized Semantic Features

Quality estimation at run-time for machine translation systems is an important task. The standard automatic evaluation methods that use reference translations cannot evaluate MT results in real-time and the correlation between the results of these methods and that of human evaluation is very low in the case of translations from English to Hungarian. The new method to solve this problem is called quality estimation, which addresses the task by estimating the quality of translations as a prediction task for which features are extracted from the source and translated sentences only. In this study, we implement quality estimation for English-Hungarian. First, a corpus is created, which contains Hungarian human judgements. Using these human evaluation scores, different quality estimation models are described, evaluated and optimized. We created a corpus for English-Hungarian quality estimation and we developed 27 new semantic features using WordNet and word embedding models, then we created feature sets optimized for Hungarian, which produced better results than the baseline feature set.

Zijian Győző Yang, László János Laki, Borbála Siklósi
Genetic-Based Decoder for Statistical Machine Translation

We propose a new algorithm for decoding on machine translation process. This approach is based on an evolutionary algorithm. We hope that this new method will constitute an alternative to Moses’s decoder which is based on a beam search algorithm while the one we propose is based on the optimisation of a total solution. The results achieved are very encouraging in terms of measures and the proposed translations themselves are well built.

Douib Ameur, Langlois David, Smaïli Kamel
Bilingual Contexts from Comparable Corpora to Mine for Translations of Collocations

Due to the limited availability of parallel data in many languages, we propose a methodology that benefits from comparable corpora to find translation equivalents for collocations (as a specific type of difficult-to-translate multi-word expressions). Finding translations is known to be more difficult for collocations than for words. We propose a method based on bilingual context extraction and build a word (distributional) representation model drawing on these bilingual contexts (bilingual English-Spanish contexts in our case). We show that the bilingual context construction is effective for the task of translation equivalent learning and that our method outperforms a simplified distributional similarity baseline in finding translation equivalents.

Shiva Taslimipoor, Ruslan Mitkov, Gloria Corpas Pastor, Afsaneh Fazly
Bi-text Alignment of Movie Subtitles for Spoken English-Arabic Statistical Machine Translation

We describe efforts towards getting better resources for English-Arabic machine translation of spoken text. In particular, we look at movie subtitles as a unique, rich resource, as subtitles in one language often get translated into other languages. Movie subtitles are not new as a resource and have been explored in previous research; however, here we create a much larger bi-text (the biggest to date), and we further generate better quality alignment for it. Given the subtitles for the same movie in different languages, a key problem is how to align them at the fragment level. Typically, this is done using length-based alignment, but for movie subtitles, there is also time information. Here we exploit this information to develop an original algorithm that outperforms the current best subtitle alignment tool, subalign. The evaluation results show that adding our bi-text to the IWSLT training bi-text yields an improvement of over two BLEU points absolute.

Fahad Al-Obaidli, Stephen Cox, Preslav Nakov
A Parallel Corpus of Translationese

We describe a set of bilingual English-French and English-German parallel corpora in which the direction of translation is accurately and reliably annotated. The corpora are diverse, consisting of parliamentary proceedings, literary works, transcriptions of TED talks and political commentary. They will be instrumental for research of translationese and its applications to (human and machine) translation; specifically, they can be used for the task of translationese identification, a research direction that enjoys a growing interest in recent years. To validate the quality and reliability of the corpora, we replicated previous results of supervised and unsupervised identification of translationese, and further extended the experiments to additional datasets and languages.

Ella Rabinovich, Shuly Wintner, Ofek Luis Lewinsohn
A Low Dimensionality Representation for Language Variety Identification

Language variety identification aims at labelling texts in a native language (e.g. Spanish, Portuguese, English) with its specific variation (e.g. Argentina, Chile, Mexico, Peru, Spain; Brazil, Portugal; UK, US). In this work we propose a low dimensionality representation (LDR) to address this task with five different varieties of Spanish: Argentina, Chile, Mexico, Peru and Spain. We compare our LDR method with common state-of-the-art representations and show an increase in accuracy of $${\sim }$$35%. Furthermore, we compare LDR with two reference distributed representation models. Experimental results show competitive performance while dramatically reducing the dimensionality—and increasing the big data suitability—to only 6 features per variety. Additionally, we analyse the behaviour of the employed machine learning algorithms and the most discriminating features. Finally, we employ an alternative dataset to test the robustness of our low dimensionality representation with another set of similar languages.

Francisco Rangel, Marc Franco-Salvador, Paolo Rosso

Sentiment Analysis, Opinion Mining, Subjectivity, and Social Media

Frontmatter
Towards Empathetic Human-Robot Interactions

Since the late 1990s when speech companies began providing their customer-service software in the market, people have gotten used to speaking to machines. As people interact more often with voice and gesture controlled machines, they expect the machines to recognize different emotions, and understand other high level communication features such as humor, sarcasm and intention. In order to make such communication possible, the machines need an empathy module in them, which is a software system that can extract emotions from human speech and behavior and can decide the correct response of the robot. Although research on empathetic robots is still in the primary stage, current methods involve using signal processing techniques, sentiment analysis and machine learning algorithms to make robots that can ‘understand’ human emotion. Other aspects of human-robot interaction include facial expression and gesture recognition, as well as robot movement to convey emotion and intent. We propose Zara the Supergirl as a prototype system of empathetic robots. It is a software-based virtual android, with an animated cartoon character to present itself on the screen. She will get ‘smarter’ and more empathetic, by having machine learning algorithms, and gathering more data and learning from it. In this paper, we present our work so far in the areas of deep learning of emotion and sentiment recognition, as well as humor recognition. We hope to explore the future direction of android development and how it can help improve people’s lives.

Pascale Fung, Dario Bertero, Yan Wan, Anik Dey, Ricky Ho Yin Chan, Farhad Bin Siddique, Yang Yang, Chien-Sheng Wu, Ruixi Lin
Extracting Aspect Specific Sentiment Expressions Implying Negative Opinions

Subjective expression extraction is a central problem in fine-grained sentiment analysis. Most existing works focus on generic subjective expression extraction as opposed to aspect specific opinion phrase extraction. Given the ever-growing product reviews domain, extracting aspect specific opinion phrases is important as it yields the key product issues that are often mentioned via phrases (e.g., “signal fades very quickly,” “had to flash the firmware often”). In this paper, we solve the problem using a combination of generative and discriminative modeling. The generative model performs a first level processing facilitating (1) discovery of potential head aspects containing issues, (2) generation of a labeled dataset of issue phrases, and (3) feed latent semantic features to subsequent discriminative modeling. We then employ discriminative large-margin and sequence modeling with pivot features for issue sentence classification and issue phrase boundary extraction. Experimental results using real-world reviews from Amazon.com demonstrate the effectiveness of the proposed approach.

Arjun Mukherjee
Aspect Terms Extraction of Arabic Dialects for Opinion Mining Using Conditional Random Fields

While English opinion mining has been studied extensively, Arabic fine grained opinion mining has not received much attention. This paper looks at employing conditional random fields as a supervised method to extract aspect terms which can then be employed for fine grained opinion mining. Despite the lack of Arabic Dialect NLP tools that limited the amount of improvement that can be added to the algorithm, Our analysis shows a comparable level of precision and recall to what has been achieved for English.

Alawya Alawami
Large Scale Authorship Attribution of Online Reviews

Traditional authorship attribution methods focus on the scenario of a limited number of authors writing long pieces of text. These methods are engineered to work on a small number of authors and generally do not scale well to a corpus of online reviews where the candidate set of authors is large. However, attribution of online reviews is important as they are replete with deception and spam. We evaluate a new large scale approach for predicting authorship via the task of verification on online reviews. Our evaluation considers a large number of possible candidate authors seen to date. Our results show that multiple verification models can be successfully combined to associate reviews with their correct author in more than 78% of the time. We propose that our approach can be used to slow down or deter the number of deceptive reviews in the wild.

Prasha Shrestha, Arjun Mukherjee, Thamar Solorio
Discovering Correspondence of Sentiment Words and Aspects

Extracting aspects and sentiments is a key problem in sentiment analysis. Existing models rely on joint modeling with supervised aspect and sentiment switching. This paper explores unsupervised models by exploiting a novel angle – correspondence of sentiments with aspects via topic modeling under two views. The idea is to split documents into two views and model the topic correspondence across the two views. We propose two new models that work on a set of document pairs (documents with two views) to discover their corresponding topics. Experimental results show that the proposed approach significantly outperforms strong baselines.

Geli Fei, Zhiyuan (Brett) Chen, Arjun Mukherjee, Bing Liu
Aspect Based Sentiment Analysis: Category Detection and Sentiment Classification for Hindi

E-commerce markets in developing countries (e.g. India) have witnessed a tremendous amount of user’s interest recently. Product reviews are now being generated daily in huge amount. Classifying the sentiment expressed in a user generated text/review into certain categories of interest, for example, positive or negative is famously known as sentiment analysis. Whereas aspect based sentiment analysis (ABSA) deals with the sentiment classification of a review towards some aspects or attributes or features. In this paper we asses the challenges and provide a benchmark setup for aspect category detection and sentiment classification for Hindi. Aspect category can be seen as the generalization of various aspects that are discussed in a review. As far as our knowledge is concerned, this is the very first attempt for such kind of task involving any Indian language. The key contributions of the present work are two-fold, viz. providing a benchmark platform by creating annotated dataset for aspect category detection and sentiment classification, and developing supervised approaches for these two tasks that can be treated as a baseline model for further research.

Md Shad Akhtar, Asif Ekbal, Pushpak Bhattacharyya
A New Emotional Vector Representation for Sentiment Analysis

With the advent of Web 2.0, social networks (like, Twitter and Facebook) offer to users a different writing style that’s close to the SMS language. This language is characterized by the presence of emotion symbols (emoticons, acronyms and exclamation words). They often manifest the sentiments expressed in the comments and bring an important contextual value to determine the general sentiment of the text. Moreover, these emotion symbols are considered as multilingual and universal symbols. This fact has inspired us to research in the area of automatic sentiment classification. In this paper, we present a new vector representation of text which can faithfully translate the sentimental orientation of text, based on the emotion symbols. We use Support Vector Machines to show that our emotional vector representation significantly improves accuracy for sentiment analysis problem compared with the well known bag-of-words vector representations, using dataset derived from Facebook.

Hanen Ameur, Salma Jamoussi, Abdelmajid Ben Hamadou
Cascading Classifiers for Twitter Sentiment Analysis with Emotion Lexicons

Many different attempts have been made to determine sentiment polarity in tweets, using emotion lexicons and different NLP techniques with machine learning. In this paper we focus on using emotion lexicons and machine learning only, avoiding the use of additional NLP techniques. We present a scheme that is able to outperform other systems that use both natural language processing and distributional semantics. Our proposal consists on using a cascading classifier on lexicon features to improve accuracy. We evaluate our results with the TASS 2015 corpus, reaching an accuracy only 0.07 below the top-ranked system for task 1, 3 levels, whole test corpus. The cascading method we implemented consisted on using the results of a first stage classification with Multinomial Naïve Bayes as additional columns for a second stage classification using a Naïve Bayes Tree classifier with feature selection. We tested with at least 30 different classifiers and this combination yielded the best results.

Hiram Calvo, Omar Juárez Gambino
A Multilevel Approach to Sentiment Analysis of Figurative Language in Twitter

Commendable amount of work has been attempted in the field of Sentiment Analysis or Opinion Mining from natural language texts and Twitter texts. One of the main goals in such tasks is to assign polarities (positive or negative) to a piece of text. But, at the same time, one of the important as well as difficult issues is how to assign the degree of positivity or negativity to certain texts. The answer becomes more complex when we perform a similar task on figurative language texts collected from Twitter. Figurative language devices such as irony and sarcasm contain an intentional secondary or extended meaning hidden within the expressions. In this paper we present a novel approach to identify the degree of the sentiment (fine grained in an 11-point scale) for the figurative language texts. We used several semantic features such as sentiment and intensifiers as well as we introduced sentiment abruptness, which measures the variation of sentiment from positive to negative or vice versa. We trained our systems at multiple levels to achieve the maximum cosine similarity of 0.823 and minimum mean square error of 2.170.

Braja Gopal Patra, Soumadeep Mazumdar, Dipankar Das, Paolo Rosso, Sivaji Bandyopadhyay
Determining Sentiment in Citation Text and Analyzing Its Impact on the Proposed Ranking Index

Whenever human beings interact with each other, they exchange or express opinions, emotions and sentiments. These opinions can be expressed in text, speech or images. Analysis of these sentiments is one of the popular research areas of present day researchers. Sentiment analysis, also known as opinion mining tries to identify or classify these sentiments or opinions into two broad categories – positive and negative. Much work on sentiment analysis has been done on social media conversations, blog posts, newspaper articles and various narrative texts. However, when it came to identifying emotions from scientific papers, researchers used to face difficulties due to the implicit and hidden natures of opinions or emotions. As the citation instances are considered inherently positive in emotion, popular ranking and indexing paradigms often neglect the opinion present while citing. Therefore in the present paper, we deployed a system of citation sentiment analysis to achieve three major objectives. First, we identified sentiments in the citation text and assigned a score to each of the instances. We have used a supervised classifier for this purpose. Secondly, we have proposed a new index (we shall refer to it hereafter as M-index) which takes into account both the quantitative and qualitative factors while scoring a paper. Finally, we developed a ranking of research papers based on the M-index. We have also shown the impacts of M-index on the ranking of scientific papers.

Souvick Ghosh, Dipankar Das, Tanmoy Chakraborty
Combining Lexical Features and a Supervised Learning Approach for Arabic Sentiment Analysis

The importance of building sentiment analysis tools for Arabic social media has been recognized during the past couple of years, especially with the rapid increase in the number of Arabic social media users. One of the main difficulties in tackling this problem is that text within social media is mostly colloquial, with many dialects being used within social media platforms. In this paper, we present a set of features that were integrated with a machine learning based sentiment analysis model and applied on Egyptian, Saudi, Levantine, and MSA Arabic social media datasets. Many of the proposed features were derived through the use of an Arabic Sentiment Lexicon. The model also presents emoticon based features, as well as input text related features such as the number of segments within the text, the length of the text, whether the text ends with a question mark or not, etc. We show that the presented features have resulted in an increased accuracy across six of the seven datasets we’ve experimented with and which are all benchmarked. Since the developed model outperforms all existing Arabic sentiment analysis systems that have publicly available datasets, we can state that this model presents state-of-the-art in Arabic sentiment analysis.

Samhaa R. El-Beltagy, Talaat Khalil, Amal Halaby, Muhammad Hammad
Sentiment Analysis in Arabic Twitter Posts Using Supervised Methods with Combined Features

With the huge amount of daily generated social networks posts, reviews, ratings, recommendations and other forms of online expressions, the web 2.0 has turned into a crucial opinion rich resource. Since others’ opinions seem to be determinant when making a decision both on individual and organizational level, several researches are currently looking to the sentiment analysis.In this paper, we deal with sentiment analysis in Arabic written Twitter posts. Our proposed approach is leveraging a rich set of multilevel features like syntactic, surface-form, tweet-specific and linguistically motivated features. Sentiment features are also applied, being mainly inferred from both novel general-purpose as well as tweet-specific sentiment lexicons for Arabic words.Several supervised classification algorithms (Support Vector Machines, Naive Bayes, Decision tree and Random Forest) were applied on our data focusing on modern standard Arabic (MSA) tweets. The experimental results using the proposed resources and methods indicate high performance levels given the challenge imposed by the Arabic language particularities.

Rihab Bouchlaghem, Aymen Elkhelifi, Rim Faiz
Interactions Between Term Weighting and Feature Selection Methods on the Sentiment Analysis of Turkish Reviews

Term weighting methods assign appropriate weights to the terms in a document so that more important terms receive higher weights for the text representation. In this study, we consider four term weighting and three feature selection methods and investigate how these term weighting methods respond to the reduced text representation. We conduct experiments on five Turkish review datasets so that we can establish baselines and compare the performance of these term weighting methods. We test these methods on the English reviews so that we can identify their differences with the Turkish reviews. We show that both tf and tp weighting methods are the best for the Turkish, while tp is the best for the English reviews. When feature selection is applied, tf*idf method with DFD and χ2 has the highest accuracies for the Turkish, while tf*idf and tp methods with χ2 have the best performance for the English reviews.

Tuba Parlar, Selma Ayşe Özel, Fei Song
Developing a Concept-Level Knowledge Base for Sentiment Analysis in Singlish

In this paper, we present Singlish SenticNet, a concept-level knowledge base for sentiment analysis that associates multiword expressions to a set of emotion labels and a polarity value. Unlike many other sentiment analysis resources, SenticNet is not built by manually labeling pieces of knowledge coming from general NLP resources such as WordNet or DBPedia. Instead, it is automatically constructed by applying graph-mining and multi-dimensional scaling techniques on the affective common-sense knowledge collected from three different sources. This knowledge is represented redundantly at three levels: semantic network, matrix, and vector space. Subsequently, the concepts are labeled by emotions and polarity through the ensemble application of spreading activation, neural networks and an emotion categorization model.

Rajiv Bajpai, Danyuan Ho, Erik Cambria
Using Syntactic and Semantic Features for Classifying Modal Values in the Portuguese Language

This paper presents a study made in a field poorly explored in the Portuguese language – modality and its automatic tagging. Our main goal was to find a set of attributes for the creation of automatic taggers with improved performance over the bag-of-words (bow) approach. The performance was measured using precision, recall and $$F_1$$. Because it is a relatively unexplored field, the study covers the creation of the corpus (composed by eleven verbs), the use of a parser to extract syntactic and semantic information from the sentences and a machine learning approach to identify modality values. Based on three different sets of attributes – from trigger itself and the trigger’s path (from the parse tree) and context – the system creates a tagger for each verb achieving (in almost every verb) an improvement in $$F_1$$ when compared to the traditional bow approach.

João Sequeira, Teresa Gonçalves, Paulo Quaresma, Amália Mendes, Iris Hendrickx
Detecting the Likely Causes Behind the Emotion Spikes of Influential Twitter Users

Understanding the causes of spikes in the emotion flow of influential social media users is a key component when analyzing the diffusion and adoption of opinions and trends. Hence, in this work we focus on detecting the likely reasons or causes of spikes within influential Twitter users’ emotion flow. To achieve this, once an emotion spike is identified we use linguistic and statistical analyses on the tweets surrounding the spike in order to reveal the spike’s likely explanations or causes in the form of keyphrases. Experimental evaluation on emotion flow visualization, emotion spikes identification and likely cause extraction for several influential Twitter users shows that our method is effective for pinpointing interesting insights behind the causes of the emotion fluctuation. Implications of our work are highlighted by relating emotion flow spikes to real-world events and by the transversal application of our technique to other types of timestamped text.

Calkin Suero Montero, Hatem Haddad, Maxim Mozgovoy, Chedi Bechikh Ali
Age Identification of Twitter Users: Classification Methods and Sociolinguistic Analysis

In this article, we address the problem of age identification of Twitter users, after their online text. We used a set of text mining, sociolinguistic-based and content-related text features, and we evaluated a number of well-known and widely used machine learning algorithms for classification, in order to examine their appropriateness on this task. The experimental results showed that Random Forest algorithm offered superior performance achieving accuracy equal to 61%. We ranked the classification features after their informativity, using the ReliefF algorithm, and we analyzed the results in terms of the sociolinguistic principles on age linguistic variation.

Vasiliki Simaki, Iosif Mporas, Vasileios Megalooikonomou
Mining of Social Networks from Literary Texts of Resource Poor Languages

We describe our work on automatic identification of social events and mining of social networks from literary texts in Tamil. Tamil belongs to Dravidian language family and is a morphologically rich language. This is a resource poor language; sophisticated resources for document processing such as parsers, phrase structure tree tagger are not available. In our work we have used shallow parsing for document processing. Conditional Random Fields (CRFs), a machine learning technique is used for automatic identification of social events. We have obtained an F-measure of 62% on social event identification. Social networks are mined by forming triads of the actors in the social events. The social networks are evaluated using graph comparison technique. The system generated social networks is compared with the gold network. We have obtained a very encouraging similarity score of 0.75.

Pattabhi R. K. Rao, Sobha Lalitha Devi
Collecting and Annotating Indian Social Media Code-Mixed Corpora

The pervasiveness of social media in the present digital era has empowered the ‘netizens’ to be more creative and interactive, and to generate content using free language forms that often are closer to spoken language and hence show phenomena previously mainly analysed in speech. One such phenomenon is code-mixing, which occurs when multilingual persons switch freely between the languages they have in common. Code-mixing presents many new challenges for language processing and the paper discusses some of them, taking as a starting point the problems of collecting and annotating three corpora of code-mixed Indian social media text: one corpus with English-Bengali Twitter messages and two corpora containing English-Hindi Twitter and Facebook messages, respectively. We present statistics of these corpora, discuss part-of-speech tagging of the corpora using both a coarse-grained and a fine-grained tag set, and compare their complexity to several other code-mixed corpora based on a Code-Mixing Index.

Anupam Jamatia, Björn Gambäck, Amitava Das
Turkish Normalization Lexicon for Social Media

Social media has its own evergrowing language and distinct characteristics. Although social media is shown to be of great utility to research studies, varying quality of written texts degrades the performance of existing NLP tools. Normalization of texts, transforming from informal to well-written texts, appears to be a reasonable preprocessing step to adapt tools trained on different domains to social media. In this study, we compile the first Turkish normalization lexicon that sheds light to the kinds of observed lexical variations in social media texts. A graphical representation acquired from a text corpus is used to model contextual similarities between normalization equivalences and the lexicon is automatically generated by performing random walks on this graph. The underlying framework not only enables different lexicons to be generated from the same corpus but also produces lexicons that are tuned to specific genres. Evaluation studies demonstrated the effectiveness of induced lexicon in normalizing Turkish texts.

Seniz Demir, Murat Tan, Berkay Topcu

Text Classification and Categorization

Frontmatter
Introducing Semantics in Short Text Classification

To overcome short text classification issues due to shortness and sparseness, the enrichment process is classically proposed: topics (word clusters) are extracted from external knowledge sources using Latent Dirichlet Allocation. All the words, associated to topics which encompass short text words, are added to the initial short text content. We propose (i) an explicit representation of a two-level enrichment method in which the enrichment is considered either with respect to each word in the text or to the global semantic meaning of the short text and (ii) a new semantic Random Forest kind in which semantic relations between features are taken into account at node level rather than at tree level as it was recently proposed in the literature to avoid potential tree correlation. We demonstrate that our enrichment method is valid not only for Random Forest based methods but also for other methods like MaxEnt, SVM and Naive Bayes.

Ameni Bouaziz, Célia da Costa Pereira, Christel Dartigues-Pallez, Frédéric Precioso
Topics and Label Propagation: Best of Both Worlds for Weakly Supervised Text Classification

We propose a Label Propagation based algorithm for weakly supervised text classification. We construct a graph where each document is represented by a node and edge weights represent similarities among the documents. Additionally, we discover underlying topics using Latent Dirichlet Allocation (LDA) and enrich the document graph by including the topics in the form of additional nodes. The edge weights between a topic and a text document represent level of “affinity” between them. Our approach does not require document level labelling, instead it expects manual labels only for topic nodes. This significantly minimizes the level of supervision needed as only a few topics are observed to be enough for achieving sufficiently high accuracy. The Label Propagation Algorithm is employed on this enriched graph to propagate labels among the nodes. Our approach combines the advantages of Label Propagation (through document-document similarities) and Topic Modelling (for minimal but smart supervision). We demonstrate the effectiveness of our approach on various datasets and compare with state-of-the-art weakly supervised text classification approaches.

Sachin Pawar, Nitin Ramrakhiyani, Swapnil Hingmire, Girish K. Palshikar
Deep Neural Networks for Czech Multi-label Document Classification

This paper is focused on automatic multi-label document classification of Czech text documents. The current approaches usually use some pre-processing which can have negative impact (loss of information, additional implementation work, etc). Therefore, we would like to omit it and use deep neural networks that learn from simple features. This choice was motivated by their successful usage in many other machine learning fields. Two different networks are compared: the first one is a standard multi-layer perceptron, while the second one is a popular convolutional network. The experiments on a Czech newspaper corpus show that both networks significantly outperform baseline method which uses a rich set of features with maximum entropy classifier. We have also shown that convolutional network gives the best results.

Ladislav Lenc, Pavel Král
Turkish Document Classification with Coarse-Grained Semantic Matrix

In this paper, we present a novel method for Document Classification that uses semantic matrix representation of Turkish sentences by concentrating on the sentence phrases and their concepts in text. Our model has been designed to find phrases in a sentence, identify their relations with specific concepts, and represent the sentences as coarse-grained semantic matrix. Predicate features and semantic class type are also added to the coarse-grained semantic matrix representation. The highest success rate in Turkish Document Classification “97.12” is obtained by adding the coarse-grained semantic matrix representation to the data which has previous highest result in the previous studies about Turkish Document Classification.

İlknur Dönmez, Eşref Adalı
Supervised Topic Models for Diagnosis Code Assignment to Discharge Summaries

Mining medical data has significantly gained interest in the recent years thanks to the advances in data mining and machine learning fields. In this work, we focus on a challenging issue in medical data mining: automatic diagnosis code assignment to discharge summaries, i.e., characterizing patient’s hospital stay (diseases, symptoms, treatments, etc.) with a set of codes usually derived from the International Classification of Diseases (ICD). We cast the problem as a machine learning task and we experiment some recent approaches based on the probabilistic topic models. We demonstrate the efficiency of these models in terms of high predictive scores and ease of result interpretation. As such, we show how topic models enable gaining insights into this field and provide new research opportunities for possible improvements.

Mohamed Dermouche, Julien Velcin, Rémi Flicoteaux, Sylvie Chevret, Namik Taright

Information Extraction

Frontmatter
Identity and Granularity of Events in Text

In this paper we describe a method to detect event descriptions in different news articles and to model the semantics of events and their components using RDF representations. We compare these descriptions to solve a cross-document event coreference task. Our component approach to event semantics defines identity and granularity of events at different levels. It performs close to state-of-the-art approaches on the cross-document event coreference task, while outperforming other works when assuming similar quality of event detection. We demonstrate how granularity and identity are interconnected and we discuss how semantic anomaly could be used to define differences between coreference, subevent and topical relations.

Piek Vossen, Agata Cybulska
An Informativeness Approach to Open IE Evaluation

Open Information Extraction (OIE) systems extract relational tuples from text without requiring to specify in advance the relations of interest. Systems perform well on widely used metrics such as precision and yield, but a close look at systems output shows a general lack of informativeness in facts deemed correct.We propose a new evaluation protocol, based on question answering, that is closer to text understanding and end user needs. Extracted information is judged upon its capacity to automatically answer questions about the source text. As a showcase for our protocol, we devise a small corpus of question/answer pairs, and evaluate available state-of-the-art OIE systems on it. Performance-wise, our results are in line with previous findings. Furthermore, we are able to estimate recall for the task, which is novel. We distribute our annotated data and automatic evaluation program.

William Léchelle, Philippe Langlais
End-to-End Relation Extraction Using Markov Logic Networks

The task of end-to-end relation extraction consists of two sub-tasks: (i) identifying entity mentions along with their types and (ii) recognizing semantic relations among the entity mention pairs. It has been shown that for better performance, it is necessary to address these two sub-tasks jointly [13, 22]. We propose an approach for simultaneous extraction of entity mentions and relations in a sentence, by using inference in Markov Logic Networks (MLN) [21]. We learn three different classifiers: (i) local entity classifier, (ii) local relation classifier and (iii) “pipeline” relation classifier which uses predictions of the local entity classifier. Predictions of these classifiers may be inconsistent with each other. We represent these predictions along with some domain knowledge using weighted first-order logic rules in an MLN and perform joint inference over the MLN to obtain a global output with minimum inconsistencies. Experiments on the ACE (Automatic Content Extraction) 2004 dataset demonstrate that our approach of joint extraction using MLNs outperforms the baselines of individual classifiers. Our end-to-end relation extraction performance is better than 2 out of 3 previous results reported on the ACE 2004 dataset.

Sachin Pawar, Pushpak Bhattacharya, Girish K. Palshikar
Knowledge Extraction with NooJ Using a Syntactico-Semantic Approach for the Arabic Utterances Understanding

Regarding the amelioration of NLP field, knowledge extraction has become an interesting research topic. Indeed, the need to an improvement through the NLP techniques has become also necessary and advantageous. Hence, in a general context of the construction of an Arabic touristic corpus equivalent to those of European projects MEDIA and LUNA, and due to the lack of Arabic electronic resources, we had the opportunity to expand the EL-DicAr of [11] by knowledge hinging on Touristic Information and Hotel Reservations (TIHR). Thus, in the same manner of [11], we have developed local grammars for the recognition of essential knowledge in our field of study. This task facilitates greatly the subsequent work of understanding user utterances interacting with a dialogue system.

Chahira Lhioui, Anis Zouaghi, Mounir Zrigui
Adapting TimeML to Basque: Event Annotation

In this paper we present an event annotation effort following EusTimeML, a temporal mark-up language for Basque based on TimeML. For this, we first describe events and their main ontological and grammatical features. We base our analysis on Basque grammars and TimeML mark-up language classification of events. Annotation guidelines have been created to address the event information annotation for Basque and an annotation experiment has been conducted. A first round has served to evaluate the preliminary guidelines and decisions on event annotation have been taken according to annotations and inter-annotator agreement results. Then a guideline tuning period has followed. In the second round, we have created a manually-annotated gold standard corpus for event annotation in Basque. Event analysis and annotation experiment are part of a complete temporal information analysis and corpus creation work.

Begoña Altuna, María Jesús Aranzabe, Arantza Díaz de Ilarraza

Applications

Frontmatter
Deeper Summarisation: The Second Time Around
An Overview and Some Practical Suggestions

This paper advocates deeper summarisation methods: methods that are closer to text understanding; methods that manipulate intermediate semantic representations. As a field, we are not yet in a position to create these representations perfectly, but I still believe that now is a good time to be a bit more ambitious again in our goals for summarisation. I think that a summariser should be able to provide some form of explanation for the summary it just created; and if we want those types of summarisers, we will have to start manipulating semantic representations.Considering the state of the art in NLP in 2016, I believe that the field is ready for a second attempt at going deeper in summarisation. We NLP folk have come a long way since the days of early AI research. Twenty-five years of statistical research in NLP have given us more robust, more informative processing of many aspects of semantics – such as semantic similarity and relatedness between words (and maybe larger things), semantic role labelling, co-reference resolution, and sentiment detection. Now, with these new tools under our belt, we can try again to create the right kind of intermediate representations for summarisation, and then do something exciting with them. Of course, exactly how is a very big question. In this opinion paper, I will bring forward some suggestions, by taking a second look at historical summarisation models from the era of Strong AI. These may have been over-ambitious back then, but people still talk about them now because of their explanatory power: they make statements about which meaning units in a text are always important, and why.I will discuss two 1980s models for text understanding and summarisation (Wendy Lehnert’s Plot Units, and Kintsch and van Dijk’s memory-restricted discourse structure), both of which have recently been revived by their first modern implementations. The implementation of Plot Unit-style affect analysis is by Goyal et al. (2013), the KvD implementation is by my student Yimai Fang, using a new corpus of language learner texts (Fang and Teufel 2014). Looking at those systems, I will argue that even an imperfect deeper summariser is exciting news.

Simone Teufel
Tracing Language Variation for Romanian

This paper illustrates a pilot study on two collections of publications, written at the middle of the 19th century in two countries, Romania and Republic of Moldavia. The corpus includes articles from the most important Romanian and Bessarabian publications, categorized in three periods: 1840–1917, 1918–1940, and 1941–1991. The research conducted on these resources focuses on the lexical evolution of words. We use a machine learning approach to explore the patterns that govern the lexical differences between two lexicons. The model is used for automatically correlating different forms of a word. The approach is suitable for bootstrapping, in order to increase the quantity and quality of the training data. The presented approach is language independent. By using the contemporary language as a pivot, the data is analyzed and compared from various perspectives.

Daniela Gîfu, Radu Simionescu
Aoidos: A System for the Automatic Scansion of Poetry Written in Portuguese

Scansion is the activity of determining the patterns that give verses their poetic rhythm. In Portuguese, this means discovering the number of syllables that the verses in a poem have and fitting all verses to this measure, while attempting to pronounce syllables so that an adequate stress pattern is produced. This article presents Aoidos, a rule-based system that takes a poem written in the Portuguese language and performs scansion automatically, further providing an analysis of rhymes. The system works by making a phonetic transcription of a poem, determining the number of poetic syllables that the verses in the poem should have, fitting all the verses according to this measure and looking for verses that rhyme. Experiments show that the system attains a high accuracy rate (above 98%).

Adiel Mittmann, Aldo von Wangenheim, Alckmar Luiz dos Santos
Backmatter
Metadata
Title
Computational Linguistics and Intelligent Text Processing
Editor
Dr. Alexander Gelbukh
Copyright Year
2018
Electronic ISBN
978-3-319-75487-1
Print ISBN
978-3-319-75486-4
DOI
https://doi.org/10.1007/978-3-319-75487-1

Premium Partner