Skip to main content

Über dieses Buch

This two-volume set, consisting of LNCS 7816 and LNCS 7817, constitutes the thoroughly refereed proceedings of the 13th International Conference on Computer Linguistics and Intelligent Processing, CICLING 2013, held on Samos, Greece, in March 2013. The total of 91 contributions presented was carefully reviewed and selected for inclusion in the proceedings. The papers are organized in topical sections named: general techniques; lexical resources; morphology and tokenization; syntax and named entity recognition; word sense disambiguation and coreference resolution; semantics and discourse; sentiment, polarity, subjectivity, and opinion; machine translation and multilingualism; text mining, information extraction, and information retrieval; text summarization; stylometry and text simplification; and applications.



Sentiment, Polarity, Emotion, Subjectivity, and Opinion

Damping Sentiment Analysis in Online Communication: Discussions, Monologs and Dialogs

Sentiment analysis programs are now sometimes used to detect patterns of sentiment use over time in online communication and to help automated systems interact better with users. Nevertheless, it seems that no previous published study has assessed whether the position of individual texts within on-going communication can be exploited to help detect their sentiments. This article assesses apparent sentiment anomalies in on-going communication – texts assigned significantly different sentiment strength to the average of previous texts – to see whether their classification can be improved. The results suggest that a damping procedure to reduce sudden large changes in sentiment can improve classification accuracy but that the optimal procedure will depend on the type of texts processed.

Mike Thelwall, Kevan Buckley, George Paltoglou, Marcin Skowron, David Garcia, Stephane Gobron, Junghyun Ahn, Arvid Kappas, Dennis Küster, Janusz A. Holyst

Optimal Feature Selection for Sentiment Analysis

Sentiment Analysis (SA) research has increased tremendously in recent times. Sentiment analysis deals with the methods that automatically process the text contents and extract the opinion of the users. In this paper,






are extracted from the text, and composite features are created using them. Part of Speech (POS) based features adjectives and adverbs are also extracted. Information Gain (IG) and Minimum Redundancy Maximum Relevancy (mRMR) feature selection methods are used to extract prominent features. Further, effect of various feature sets for sentiment classification is investigated using machine learning methods. Effects of different categories of features are investigated on four standard datasets i.e. Movie review, product (book, DVD and electronics) review dataset. Experimental results show that composite features created from prominent features of




perform better than other features for sentiment classification. mRMR is better feature selection method as compared to IG for sentiment classification. Boolean Multinomial Naïve Bayes (BMNB) algorithm performs better than Support Vector Machine (SVM) classifier for sentiment analysis in terms of accuracy and execution time.

Basant Agarwal, Namita Mittal

Measuring the Effect of Discourse Structure on Sentiment Analysis

The aim of this paper is twofold: measuring the effect of discourse structure when assessing the overall opinion of a document and analyzing to what extent these effects depend on the corpus genre. Using Segmented Discourse Representation Theory as our formal framework, we propose several strategies to compute the overall rating. Our results show that discourse-based strategies lead to better scores in terms of accuracy and Pearson’s correlation than state-of-the-art approaches.

Baptiste Chardon, Farah Benamara, Yannick Mathieu, Vladimir Popescu, Nicholas Asher

Lost in Translation: Viability of Machine Translation for Cross Language Sentiment Analysis

Recently there has been a lot of interest in Cross Language Sentiment Analysis (CLSA) using Machine Translation (MT) to facilitate Sentiment Analysis in resource deprived languages. The idea is to use the annotated resources of one language (say,



) for performing Sentiment Analysis in another language (say,



) which does not have annotated resources. The success of such a scheme crucially depends on the availability of a MT system between






. We argue that such a strategy ignores the fact that a Machine Translation system is much more demanding in terms of resources than a Sentiment Analysis engine. Moreover, these approaches fail to take into account the divergence in the expression of sentiments across languages. We provide strong experimental evidence to prove that even the best of such systems do not outperform a system trained using only a few polarity annotated documents in the target language. Having a very large number of documents in



also does not help because most Machine Learning approaches converge (or reach a plateau) after a certain training size (as demonstrated by our results). Based on our study, we take the stand that languages which have a genuine need for a Sentiment Analysis engine should focus on collecting a few polarity annotated documents in their language instead of relying on CLSA.

Balamurali A.R., Mitesh M. Khapra, Pushpak Bhattacharyya

An Enhanced Semantic Tree Kernel for Sentiment Polarity Classification

Sentiment analysis has gained a lot of attention in recent years, mainly due to the many practical applications it supports and a growing demand for such applications. This growing demand is supported by an increasing amount and availability of opinionated online information, mainly due to the proliferation and popularity of social media. The majority of work in sentiment analysis considers the polarity of word terms rather than the polarity of specific senses of the word in context. However there has been an increased effort in distinguishing between different senses of a word as well as their different opinion-related properties. Syntactic parse trees are a widely used natural language processing construct that has been effectively employed for text classification tasks. This paper proposes a novel methodology for extending syntactic parse trees, based on word sense disambiguation and context specific opinion-related features. We evaluate the methodology on three publicly available corpuses, by employing the sub-set tree kernel as a similarity function in a support vector machine. We also evaluate the effectiveness of several publicly available sense specific sentiment lexicons. Experimental results show that all our extended parse tree representations surpass the baseline performance for every measure and across all corpuses, and compared well to other state-of-the-art techniques.

Luis A. Trindade, Hui Wang, William Blackburn, Niall Rooney

Combining Supervised and Unsupervised Polarity Classification for non-English Reviews

Two main approaches are used in order to detect the sentiment polarity from reviews. The supervised methods apply machine learning algorithms when training data are provided and the unsupervised methods are usually applied when linguistic resources are available and training data are not provided. Each one of them has its own advantages and disadvantages and for this reason we propose the use of meta-classifiers that combine both of them in order to classify the polarity of reviews. Firstly, the non-English corpus is translated to English with the aim of taking advantage of English linguistic resources. Then, it is generated two machine learning models over the two corpora (original and translated), and an unsupervised technique is only applied to the translated version. Finally, the three models are combined with a voting algorithm. Several experiments have been carried out using Spanish and Arabic corpora showing that the proposed combination approach achieves better results than those obtained by using the methods separately.

José M. Perea-Ortega, Eugenio Martínez-Cámara, María-Teresa Martín-Valdivia, L. Alfonso Ureña-López

Word Polarity Detection Using a Multilingual Approach

Determining polarity of words is an important task in sentiment analysis with applications in several areas such as text categorization and review analysis. In this paper, we propose a multilingual approach for word polarity detection. We construct a word relatedness graph by using the relations in WordNet of a given language. We extend the graph by connecting the WordNets of different languages with the help of the Inter-Lingual-Index based on English WordNet. We develop a semi-automated procedure to produce a set of positive and negative seed words for foreign languages by using a set of English seed words. To identify the polarity of unlabeled words, we propose a method based on random walk model with commute time metric as proximity measure. We evaluate our multilingual approach for English and Turkish and show that it leads to improvement in performance for both languages.

Cüneyd Murad Özsert, Arzucan Özgür

Mining Automatic Speech Transcripts for the Retrieval of Problematic Calls

In order to assure and to improve the quality of service, call center operators need to automatically identify the problematic calls in the mass of information flowing through the call center. Our method to select and rank those critical conversations uses linguistic text mining to detect sentiment markers on French automatic speech transcripts. The markers’ weight and orientation are used to calculate the semantic orientation of the speech turns. The course of a conversation can then be graphically represented with positive and negative curves. We have established and evaluated on a manually annotated corpus three heuristics for the automatic selection of problematic conversations. Two proved to be very useful and complementary for the retrieval of conversations having segments with anger and tension. Their precision is high enough for use in real world systems and the ranking evaluated by mean precision follows the usual relevance behavior of a search engine.

Frederik Cailliau, Ariane Cavet

Cross-Lingual Projections vs. Corpora Extracted Subjectivity Lexicons for Less-Resourced Languages

Subjectivity tagging is a prior step for sentiment annotation. Both machine learning based approaches and linguistic knowledge based ones profit from using subjectivity lexicons. However, most of these kinds of resources are often available only for English or other major languages. This work analyses two strategies for building subjectivity lexicons in an automatic way: by projecting existing subjectivity lexicons from English to a new language, and building subjectivity lexicons from corpora. We evaluate which of the strategies performs best for the task of building a subjectivity lexicon for a less-resourced language (Basque). The lexicons are evaluated in an extrinsic manner by classifying subjective and objective text units belonging to various domains, at document- or sentence-level. A manual intrinsic evaluation is also provided which consists of evaluating the correctness of the words included in the created lexicons.

Xabier Saralegi, Iñaki San Vicente, Irati Ugarteburu

Predicting Subjectivity Orientation of Online Forum Threads

Online forums contain huge amounts of valuable information in the form of discussions between forum users. The topics of discussions can be subjective seeking opinions of other users on some issue or non-subjective seeking factual answer to specific questions. Internet users search these forums for different types of information such as opinions, evaluations, speculations, facts, etc. Hence, knowing subjectivity orientation of forum threads would improve information search in online forums. In this paper, we study methods to analyze subjectivity of online forum threads. We build binary classifiers on textual features extracted from thread content to classify threads as subjective or non-subjective. We demonstrate the effectiveness of our methods on two popular online forums.

Prakhar Biyani, Cornelia Caragea, Prasenjit Mitra

Distant Supervision for Emotion Classification with Discrete Binary Values

In this paper, we present an experiment to identify emotions in tweets. Unlike previous studies, which typically use the six basic emotion classes defined by Ekman, we classify emotions according to a set of eight basic


emotions defined by Plutchik (Plutchik’s “wheel of emotions”). This allows us to treat the inherently multi-class problem of emotion classification as a binary problem for four opposing emotion pairs. Our approach applies

distant supervision

, which has been shown to be an effective way to overcome the need for a large set of manually labeled data to produce accurate classifiers. We build on previous work by treating not only emoticons and hashtags but also emoji, which are increasingly used in social media, as an alternative for explicit, manual labels. Since these labels may be noisy, we first perform an experiment to investigate the correspondence among particular labels of different types assumed to be indicative of the same emotion. We then test and compare the accuracy of independent binary classifiers for each of Plutchik’s four binary emotion pairs trained with different combinations of label types. Our best performing classifiers produce results between 75-91%, depending on the emotion pair; these classifiers can be combined to emulate a single multi-label classifier for Plutchik’s eight emotions that achieves accuracies superior to those reported in previous multi-way classification studies.

Jared Suttles, Nancy Ide

Using Google n-Grams to Expand Word-Emotion Association Lexicon

We present an approach to automatically generate a word-emotion lexicon based on a smaller human-annotated lexicon. To identify associated feelings of a target word (a word being considered for inclusion in the lexicon), our proposed approach uses the frequencies, counts or unique words around it within the trigrams from the Google n-gram corpus. The approach was tuned using as training lexicon, a subset of the National Research Council of Canada (NRC) word-emotion association lexicon, and applied to generate new lexicons of 18,000 words. We present six different lexicons generated by different ways using the frequencies, counts, or unique words extracted from the n-gram corpus. Finally, we evaluate our approach by testing each generated lexicon against a human-annotated lexicon to classify feelings from affective text, and demonstrate that the larger generated lexicons perform better than the human-annotated one.

Jessica Perrie, Aminul Islam, Evangelos Milios, Vlado Keselj

A Joint Prediction Model for Multiple Emotions Analysis in Sentences

In this study, we propose a scheme for recognizing people’s multiple emotions from Chinese sentence. Compared to the previous studies which focused on the single emotion analysis through texts, our work can better reflect people’s inner thoughts by predicting all the possible emotions. We first predict the multiple emotions of words from a CRF model, which avoids the restrictions from traditional emotion lexicons with limited resources and restricted context information. Instead of voting emotions directly, we perform a probabilistic merge of the output words’ multi-emotion distributions to jointly predict the sentence emotions, under the assumption that the emotions from the contained words and a sentence are statistically consistent. As a comparison, we also employ the SVM and LGR classifiers to predict each entry of the multiple emotions through a problem-transformation method. Finally, we combine the joint probabilities of the multiple emotions of sentence generated from the CRF-based merge model and the transformed LGR model, which is proved to be the best recognition for sentence multiple emotions in our experiment.

Yunong Wu, Kenji Kita, Kazuyuki Matsumoto, Xin Kang

Evaluating the Impact of Syntax and Semantics on Emotion Recognition from Text

In this paper, we systematically analyze the effect of incorporating different levels of syntactic and semantic information on the accuracy of emotion recognition from text. We carry out the evaluation in a supervised learning framework, and employ tree kernel functions as an intuitive and effective way to generate different feature spaces based on structured representations of the input data. We compare three different formalisms to encode syntactic information enriched with semantic features. These features are obtained from hand-annotated resources as well as distributional models. For the experiments, we use three datasets annotated according to the same set of emotions. Our analysis indicates that shallow syntactic information can positively interact with semantic features. In addition, we show how the three datasets can hardly be combined to learn more robust models, due to inherent differences in the linguistic properties of the texts or in the annotation.

Gözde Özbal, Daniele Pighin

Chinese Emotion Lexicon Developing via Multi-lingual Lexical Resources Integration

This paper proposes an automatic approach to build Chinese emotion lexicon based on WordNet-Affect which is a widely-used English emotion lexicon resource developed on WordNet. The approach consists of three steps, namely translation, filtering and extension. Initially, all English words in WordNet-Affect synsets are translated into Chinese words. Thereafter, with the help of Chinese synonyms dictionary (Tongyici Cilin), we build a bilingual undirected graph for each emotion category and propose a graph based algorithm to filter all non-emotion words introduced by translation procedure. Finally, the Chinese emotion lexicons are obtained by expanding their synonym words representing the similar emotion. The results show that the generated-lexicons is a reliable source for analyzing the emotions in Chinese text.

Jun Xu, Ruifeng Xu, Yanzhen Zheng, Qin Lu, Kai-Fai Wong, Xiaolong Wang

N-Gram-Based Recognition of Threatening Tweets

In this paper, we investigate to what degree it is possible to recognize threats in Dutch tweets. We attempt threat recognition on the basis of only the single tweet (without further context) and using only very simple recognition features, namely n-grams. We present two different methods of n-gram-based recognition, one based on manually constructed n-gram patterns and the other on machine learned patterns. Our evaluation is not restricted to precision and recall scores, but also looks into the difference in yield of the two methods, considering either combination or means that may help refine both methods individually.

Nelleke Oostdijk, Hans van Halteren

Distinguishing the Popularity between Topics: A System for Up-to-Date Opinion Retrieval and Mining in the Web

The constantly increasing amount of opinionated texts found in the Web had a significant impact in the development of sentiment analysis. So far, the majority of the comparative studies in this field focus on analyzing fixed (offline) collections from certain domains, genres, or topics. In this paper, we present an online system for opinion mining and retrieval that is able to discover up-to-date web pages on given topics using focused crawling agents, extract opinionated textual parts from web pages, and estimate their polarity using opinion mining agents. The evaluation of the system on real-world case studies, demonstrates that is appropriate for opinion comparison between topics, since it provides useful indications on the popularity based on a relatively small amount of web pages. Moreover, it can produce genre-aware results of opinion retrieval, a valuable option for decision-makers.

Nikolaos Pappas, Georgios Katsimpras, Efstathios Stamatatos

Machine Translation and Multilingualism

No Free Lunch in Factored Phrase-Based Machine Translation

Factored models have been successfully used in many language pairs to improve translation quality in various aspects. In this work, we analyze this paradigm in an attempt at automating the search for well-performing machine translation systems. We examine the space of possible factored systems, concluding that a fully automatic search for good configurations is not feasible. We demonstrate that even if results of automatic evaluation are available, guiding the search is difficult due to small differences between systems, which are further blurred by randomness in tuning. We describe a heuristic for estimating the complexity of factored models. Finally, we discuss the possibilities of a “semi-automatic” exploration of the space in several directions and evaluate the obtained systems.

Aleš Tamchyna, Ondřej Bojar

Domain Adaptation in Statistical Machine Translation Using Comparable Corpora: Case Study for English Latvian IT Localisation

In the recent years, statistical machine translation (SMT) has received much attention from language technology researchers and it is more and more applied not only to widely used language pairs, but also to under-resourced languages. However, under-resourced languages and narrow domains face the problem of insufficient parallel data for building SMT systems of reasonable quality for practical applications. In this paper we show how broad domain SMT systems can be successfully tailored to narrow domains using data extracted from strongly comparable corpora. We describe our experiments on adaptation of a baseline English-Latvian SMT system trained on publicly available parallel data (mostly legal texts) to the information technology domain by adding data extracted from in-domain comparable corpora. In addition to comparative human evaluation the adapted SMT system was also evaluated in a real life localisation scenario. Application of comparable corpora provides significant improvements increasing human translation productivity by 13.6% while maintaining an acceptable quality of translation.

Mārcis Pinnis, Inguna Skadiņa, Andrejs Vasiļjevs

Assessing the Accuracy of Discourse Connective Translations: Validation of an Automatic Metric

Automatic metrics for the evaluation of machine translation (MT) compute scores that characterize globally certain aspects of MT quality such as adequacy and fluency. This paper introduces a reference-based metric that is focused on a particular class of function words, namely discourse connectives, of particular importance for text structuring, and rather challenging for MT. To measure the accuracy of connective translation (ACT), the metric relies on automatic word-level alignment between a source sentence and respectively the reference and candidate translations, along with other heuristics for comparing translations of discourse connectives. Using a dictionary of equivalents, the translations are scored automatically, or, for better precision, semi-automatically. The precision of the ACT metric is assessed by human judges on sample data for English/French and English/Arabic translations: the ACT scores are on average within 2% of human scores. The ACT metric is then applied to several commercial and research MT systems, providing an assessment of their performance on discourse connectives.

Najeh Hajlaoui, Andrei Popescu-Belis

An Empirical Study on Word Segmentation for Chinese Machine Translation

Word segmentation has been shown helpful for Chinese-to-English machine translation (MT), yet the way different segmentation strategies affect MT is poorly understood. In this paper, we focus on comparing different segmentation strategies in terms of machine translation quality. Our empirical study covers both English-to-Chinese and Chinese-to-English translation for the first time. Our results show the necessity of word segmentation depends on the translation direction. After comparing two types of segmentation strategies with associated linguistic resources, we demonstrate that optimizing segmentation itself does not guarantee better MT performance, and segmentation strategy choice is not the key to improve MT. Instead, we discover that linguistical resources such as segmented corpora or the dictionaries that segmentation tools rely on actually determine how word segmentation affects machine translation. Based on these findings, we propose an empirical approach that directly optimize dictionary with respect to the MT task for word segmenter, providing a BLEU score improvement of 1.30.

Hai Zhao, Masao Utiyama, Eiichiro Sumita, Bao-Liang Lu

Class-Based Language Models for Chinese-English Parallel Corpus

This paper addresses using novel class-based language models on parallel corpora, focusing specifically on English and Chinese languages. We find that the perplexity of Chinese is generally much higher than English and discuss the possible reasons. We demonstrate the relative effectiveness of using class-based models over the modified Kneser-Ney trigram model for our task. We also introduce a rare events clustering and a polynomial discounting mechanism, which is shown to improve results. Our experimental results on parallel corpora indicate that the improvement due to classes are similar for English and Chinese. This suggests that class-based language models should be used for both languages.

Junfei Guo, Juan Liu, Michael Walsh, Helmut Schmid

Building a Bilingual Dictionary from a Japanese-Chinese Patent Corpus

In this paper, we propose an automatic method to build a bilingual dictionary from a Japanese-Chinese parallel corpus. The proposed method uses character similarity between Japanese and Chinese, and a statistical machine translation (SMT) framework in a cascading manner. The first step extracts word translation pairs from the parallel corpus based on similarity between Japanese kanji characters (Chinese characters used in Japanese writing) and simplified Chinese characters. The second step trains phrase tables using 2 different SMT training tools, then extracts common word translation pairs. The third step trains an SMT system using the word translation pairs obtained by the first and the second steps. According to the experimental results, the proposed method yields 59.3% to 92.1% accuracy in the word translation pairs extracted, depending on the cascading step.

Keiji Yasuda, Eiichiro Sumita

A Diagnostic Evaluation Approach for English to Hindi MT Using Linguistic Checkpoints and Error Rates

This paper addresses diagnostic evaluation of machine translation (MT) systems for Indian languages, English to Hindi MT in particular, assessing the performance of MT systems on relevant linguistic phenomena (checkpoints). We use the diagnostic evaluation tool DELiC4MT to analyze the performance of MT systems on various PoS categories (e.g. nouns, verbs). The current system supports only word level checkpoints which might not be as helpful in evaluating the translation quality as compared to using checkpoints at phrase level and checkpoints that deal with named entities (NE), inflections, word order, etc. We therefore suggest phrase level checkpoints and NEs as additional checkpoints for DELiC4MT. We further use Hjerson to evaluate checkpoints based on word order and inflections that are relevant for evaluation of MT with Hindi as the target language. The experiments conducted using Hjerson generate overall (document level) error counts and error rates for five error classes (inflectional errors, reordering errors, missing words, extra words, and lexical errors) to take into account the evaluation based on word order and inflections. The effectiveness of the approaches was tested on five English to Hindi MT systems.

Renu Balyan, Sudip Kumar Naskar, Antonio Toral, Niladri Chatterjee

Leveraging Arabic-English Bilingual Corpora with Crowd Sourcing-Based Annotation for Arabic-Hebrew SMT

Recent studies in Statistical Machine Translation (SMT) paradigm have been focused on developing foreign language to English translation systems. However as SMT systems have matured, there is a lot of demand to translate from one foreign language to another language. Unfortunately, the availability of parallel training corpora for a pair of morphologically complex foreign languages like Arabic and Hebrew is very scarce. This paper uses active learning based data selection and crowd sourcing technique like Amazon Mechanical Turk to create Arabic-Hebrew parallel corpora. It then explores two different techniques to build Arabic-Hebrew SMT system. The first one involves the traditional cascading of two SMT systems using English as a pivot language. The second approach is training a direct Arabic-Hebrew SMT system using sentence pivoting. Finally, we use a phrase generalization approach to further improve our performance.

Manish Gaurav, Guruprasad Saikumar, Amit Srivastava, Premkumar Natarajan, Shankar Ananthakrishnan, Spyros Matsoukas

Automatic and Human Evaluation on English-Croatian Legislative Test Set

This paper presents work on the manual and automatic evaluation of the online available machine translation (MT) service Google Translate, for the English-Croatian language pair in legislation and general domains. The experimental study is conducted on the test set of 200 sentences in total. Human evaluation is performed by native speakers, using the criteria of fluency and adequacy, and it is enriched by error analysis. Automatic evaluation is performed on a single reference set by using the following metrics: BLEU, NIST, F-measure and WER. The influence of lowercasing, tokenization and punctuation is discussed. Pearson’s correlation between automatic metrics is given, as well as correlation between the two criteria, fluency and adequacy, and automatic metrics.

Marija Brkić, Sanja Seljan, Tomislav Vičić

Text Mining, Information Extraction, and Information Retrieval

Enhancing Search: Events and Their Discourse Context

Event-based search systems have become of increasing interest. This paper provides an overview of recent advances in event-based text mining, with an emphasis on biomedical text. We focus particularly on the enrichment of events with information relating to their interpretation according to surrounding textual and discourse contexts. We describe our annotation scheme used to capture this information at the event level, report on the corpora that have so far been enriched according to this scheme and provide details of our experiments to recognise this information automatically.

Sophia Ananiadou, Paul Thompson, Raheel Nawaz

Distributional Term Representations for Short-Text Categorization

Everyday, millions of short-texts are generated for which effective tools for organization and retrieval are required. Because of the tiny length of these documents and of their extremely sparse representations, the direct application of standard text categorization methods is not effective. In this work we propose using distributional term representations (DTRs) for short-text categorization. DTRs represent terms by means of contextual information, given by document occurrence and term co-occurrence statistics. Therefore, they allow us to develop enriched document representations that help to overcome, to some extent, the small-length and high-sparsity issues. We report experimental results in three challenging collections, using a variety of classification methods. These results show that the use of DTRs is beneficial for improving the classification performance of classifiers in short-text categorization.

Juan Manuel Cabrera, Hugo Jair Escalante, Manuel Montes-y-Gómez

Learning Bayesian Network Using Parse Trees for Extraction of Protein-Protein Interaction

Extraction of protein-protein interactions from scientific papers is a relevant task in the biomedical field. Machine learning-based methods such as kernel-based represent the state-of-the-art in this task. Many efforts have focused on obtaining new types of kernels in order to employ syntactic information, such as parse trees, to extract interactions from sentences. These methods have reached the best performances on this task. Nevertheless, parse trees were not exploited by other machine learning-based methods such as Bayesian networks. The advantage of using Bayesian networks is that we can exploit the structure of the parse trees to learn the Bayesian network structure, i.e., the parse trees provide the random variables and also possible relations among them. Here we use syntactic relation as a causal dependence between variables. Hence, our proposed method learns a Bayesian network from parse trees. The evaluation was carried out over five protein-protein interaction benchmark corpora. Results show that our method is competitive in comparison with state-of-the-art methods.

Pedro Nelson Shiguihara-Juárez, Alneu de Andrade Lopes

A Model for Information Extraction in Portuguese Based on Text Patterns

This paper proposes an information extraction model that identifies text patterns representing relations between two entities. It is proposed that, given a set of entity pairs representing a specific relation, it is possible to find text patterns representing such relation within sentences from documents containing those entites. After those text patterns are identified, it is possible to attempt the extraction of a complementary entity, considering the first entity of the relation and the related text patterns are provided. The pattern selection relies on regular expressions, frequency and identification of less relevant words. Modern search engines APIs and HTML parsers are used to retrieve and parse web pages in real time, eliminating the need of a pre-established corpus. The retrieval of document counts within a timeframe is also used to aid in the selection of the entities extracted.

Tiago Luis Bonamigo, Renata Vieira

A Study on Query Expansion Based on Topic Distributions of Retrieved Documents

This paper describes a new relevance feedback (RF) method that uses latent topic information extracted from target documents.In the method, we extract latent topics of the target documents by means of latent Dirichlet allocation (LDA) and expand the initial query by providing the topic distributions of the documents retrieved at the first search. We conduct experiments for retrieving information by our proposed method and confirm that our proposed method is especially useful when the precision of the first search is low. Furthermore, we discuss the cases where RF based on latent topic information and RF based on surface information, i.e., word frequency, work well, respectively.

Midori Serizawa, Ichiro Kobayashi

Link Analysis for Representing and Retrieving Legal Information

Legal texts consist of a great variety of texts, for example laws, rules, statutes, etc. This kind of documents has as an important feature, that they are strongly linked among them, since they include references from one part to another. This makes it difficult to consult them, because in order to satisfy an information request, it is necessary to gather several references and rulings from a single text, and even with other texts. The goal of this work is to help in the process of consulting legal rulings through their retrieval from a request expressed as a question in natural language. For this, a formal model is proposed; this model is based on a weighted, non-directed graph; nodes represent the articles that integrate each document, and its edges represent references between articles and their degree of similarity. Given a question, this is added to the graph, and by combining a shortest-path algorithm with edge weight analysis, a ranked list of articles is obtained. To evaluate the performance of the proposed model we gathered 8,987 rulings and evaluated the answer to 40 test-questions as correct, incorrect or partial. A lawyer validated the answer to these questions. We compared results with other systems such as Lucene and JIRS (Java Information Retrieval System)

Alfredo López Monroy, Hiram Calvo, Alexander Gelbukh, Georgina García Pacheco

Text Summarization

Discursive Sentence Compression

This paper presents a method for automatic summarization by deleting intra-sentence discourse segments. First, each sentence is divided into elementary discourse units and, then, less informative segments are deleted. To analyze the results, we have set up an annotation campaign, thanks to which we have found interesting aspects regarding the elimination of discourse segments as an alternative to sentence compression task. Results show that the degree of disagreement in determining the optimal compressed sentence is high and increases with the complexity of the sentence. However, there is some agreement on the decision to delete discourse segments. The informativeness of each segment is calculated using textual energy, a method that has shown good results in automatic summarization.

Alejandro Molina, Juan-Manuel Torres-Moreno, Eric SanJuan, Iria da Cunha, Gerardo Eugenio Sierra Martínez

A Knowledge Induced Graph-Theoretical Model for Extract and Abstract Single Document Summarization

Summarization mainly provides the major topics or theme of document in limited number of words. However, in extract summary we depend upon extracted sentences, while in abstract summary, each summary sentence may contain concise information from multiple sentences. The major facts which affect the quality of summary are: (1) the way of handling noisy or less important terms in document, (2) utilizing information content of terms in document (as, each term may have different levels of importance in document) and (3) finally, the way to identify the appropriate thematic facts in the form of summary. To reduce the effect of noisy terms and to utilize the information content of terms in the document, we introduce the graph theoretical model populated with semantic and statistical importance of terms. Next, we introduce the concept of weighted minimum vertex cover which helps us in identifying the most representative and thematic facts in the document. Additionally, to generate abstract summary, we introduce the use of vertex constrained shortest path based technique, which uses minimum vertex cover related information as valuable resource. Our experimental results on DUC-2001 and DUC-2002 dataset show that our devised system performs better than baseline systems.

Niraj Kumar, Kannan Srinathan, Vasudeva Varma

Hierarchical Clustering in Improving Microblog Stream Summarization

Microblogging has shown a massive increase in use over the past couple of years. According to recent statistics, Twitter (the most popular microblogging platform) has over 500 million posts per day. In order to help users manage this information overload or to assess the full information potential of microblogging streams, a few summarization algorithms have been proposed. However, they are designed to work on a stream of posts filtered on a particular keyword, whereas most streams suffer from noise or have posts referring to more than one topic. Because of this, the generated summary is incomplete and even meaningless. We approach the problem of summarizing a stream and propose adding a layer of text clustering before the summarizing step. We first identify the events users are talking about in the stream, we group posts by event and then we continue by clustering each group hierarchically. We show how, by generating an agglomerative hierarchical cluster tree based on the posts and applying a summarization algorithm, the quality of the summary improves.

Andrei Olariu

Summary Evaluation: Together We Stand NPowER-ed

Summary evaluation has been a distinct domain of research for several years. Human summary evaluation appears to be a high-level cognitive process and, thus, difficult to reproduce. Even though several automatic evaluation methods correlate well to human evaluations over systems, we fail to get equivalent results when judging individual summaries. In this work, we propose the NPowER evaluation method based on machine learning and a set of methods from the family of “n-gram graph”-based summary evaluation methods. First, we show that the combined, optimized use of the evaluation methods outperforms the individual ones. Second, we compare the proposed method to a combination of ROUGE metrics. Third, we study and discuss what can make future evaluation measures better, based on the results of feature selection. We show that we can easily provide per summary evaluations that are far superior to existing performance of evaluation systems and face different measures under a unified view.

George Giannakopoulos, Vangelis Karkaletsis

Stylometry and Text Simplification

Explanation in Computational Stylometry

Computational stylometry, as in authorship attribution or profiling, has a large potential for applications in diverse areas: literary science, forensics, language psychology, sociolinguistics, even medical diagnosis. Yet, many of the basic research questions of this field are not studied systematically or even at all. In this paper we will go into these problems, and suggest that a reinterpretation of current and historical methods in the framework and methodology of machine learning of natural language processing would be helpful. We also argue for more attention in research for explanation in computational stylometry as opposed to purely quantitative evaluation measures and propose a strategy for data collection and analysis for achieving progress in computational stylometry. We also introduce a fairly new application of computational stylometry in internet security.

Walter Daelemans

The Use of Orthogonal Similarity Relations in the Prediction of Authorship

Recent work on Authorship Attribution (AA) proposes the use of meta characteristics to train author models. The meta characteristics are orthogonal sets of similarity relations between the features from the different candidate authors. In that approach, the features are grouped and processed separately according to the type of information they encode, the so called linguistic modalities. For instance, the syntactic, stylistic and semantic features are each considered different modalities as they represent different aspects of the texts. The assumption is that the independent extraction of meta characteristics results in more informative feature vectors, that in turn result in higher accuracies. In this paper we set out to the task of studying the empirical value of this modality specific process. We experimented with different ways of generating the meta characteristics on different data sets with different numbers of authors and genres. Our results show that by extracting the meta characteristics from splitting features by their linguistic dimension we achieve consistent improvement of prediction accuracy.

Upendra Sapkota, Thamar Solorio, Manuel Montes-y-Gómez, Paolo Rosso

ERNESTA: A Sentence Simplification Tool for Children’s Stories in Italian

We present ERNESTA (Enhanced Readability through a Novel Event-based Simplification Tool), the first sentence simplification system for Italian, specifically developed to improve the comprehension of factual events in stories for children with low reading skills. The system performs two basic actions: First, it analyzes a text by resolving anaphoras (including null pronouns), so as to make all implicit information explicit. Then, it simplifies the story sentence by sentence at syntactic level, by producing simple statements in the present tense on the factual events described in the story. Our simplification strategy is driven by psycholinguistic principles and targets children aged 7 - 11 with text comprehension difficulties. The evaluation shows that our approach achieves promising results. Furthermore, ERNESTA could be exploited in different tasks, for instance in the generation of educational games and reading comprehension tests.

Gianni Barlacchi, Sara Tonelli

Automatic Text Simplification in Spanish: A Comparative Evaluation of Complementing Modules

In this paper we present two components of an automatic text simplification system for Spanish, aimed at making news articles more accessible to readers with cognitive disabilities. Our system in its current state consists of a rule-based lexical transformation component and a module for syntactic simplification. We evaluate the two components separately and as a whole, with a view to determining the level of simplification and the preservation of meaning and grammaticality. In order to test the readability level pre- and post-simplification, we apply seven readability measures for Spanish to three sets of randomly chosen news articles: the original texts, the output obtained after lexical transformations, the syntactic simplification output, and the output of both system components. To test whether the simplification output is grammatically correct and semantically adequate, we ask human annotators to grade pairs of original and simplified sentences according to these two criteria. Our results suggest that both components of our system produce simpler output when compared to the original, and that grammaticality and meaning preservation are positively rated by the annotators.

Biljana Drndarević, Sanja Štajner, Stefan Bott, Susana Bautista, Horacio Saggion

The Impact of Lexical Simplification by Verbal Paraphrases for People with and without Dyslexia

Text simplification is the process of transforming a text into an equivalent which is easier to read and to understand, preserving its meaning for a target population. One such population who could benefit from text simplification are people with dyslexia. One of the alternatives for text simplification is the use of verbal paraphrases. One of the more common verbal paraphrase pairs are the one composed by a lexical verb (

to hug

) and by a support verb plus a noun collocation (

to give a hug

). This paper explores how Spanish verbal paraphrases impact the readability and the comprehension of people with and without dyslexia dyslexia. For the selection of pairs of verbal paraphrases we have used the


database, a linguistic resource composed of more than 3,600 verbal paraphrases. To measure the impact in reading performance and understandability, we performed an eye-tracking study including comprehension questionnaires. The study is based on a group of 46 participants, 23 with confirmed dyslexia and 23 control group. We did not find significant effects, thus tools that can perform this kind of paraphrases automatically might not have a large effect on people with dyslexia. Therefore, other kinds of text simplification might be needed to benefit readability and understandability of people with dyslexia.

Luz Rello, Ricardo Baeza-Yates, Horacio Saggion

Detecting Apposition for Text Simplification in Basque

In this paper we have performed a study on Apposition in Basque and we have developed a tool to identify and to detect automatically these structures. In fact, it is necessary to detect and to code this structures for advanced NLP applications. In our case, we plan to use the Apposition Detector in our Automatic Text Simplification system. This Detector applies a grammar that has been created using the Constraint Grammar formalism. The grammar is based, among others, on morphological features and linguistic information obtained by a named entity recogniser. We present the evaluation of that grammar and moreover, based on a study on errors, we propose a method to improve the results. We also use a Mention Detection System and we combine our results with those obtained by the Mention Detector to improve the performance.

Itziar Gonzalez-Dios, María Jesús Aranzabe, Arantza Díaz de Ilarraza, Ander Soraluze


Automation of Linguistic Creativitas for Adslogia

In this paper, we propose a computational approach to automate the generation of neologisms by adding Latin suffixes to English words or homophonic puns. This approach takes into account both semantic appropriateness and sound pleasantness of words. Our analysis of the generated neologisms provides interesting clues for understanding which technologies can successfully be exploited for the task, and the results of the evaluation show that the system that we developed can be a useful tool for supporting the generation of creative names.

Gözde Özbal, Carlo Strapparava

Allongos: Longitudinal Alignment for the Genetic Study of Writers’ Drafts

We present


, a procedure capable of aligning multiple drafts for genetic text analysis purposes. To our knowledge, this is the first time a complete alignment is attempted on the longitudinal axis in addition to the textual axis, i.e. all drafts that lead to the production of a text are consistently aligned together, taking word shifts into account. We propose a practical interface where differences between successive drafts are highlighted, giving the user control over the drafts to be displayed and automatically adapting the display to the current selection. Our experiments show that our approach is both fast and accurate.

Adrien Lardilleux, Serge Fleury, Georgeta Cislaru

A Combined Method Based on Stochastic and Linguistic Paradigm for the Understanding of Arabic Spontaneous Utterances

ASTI is an Arabic-speaking spoken language understanding (SLU) system which carries out two kinds of analysis which are relatively opposed. It is designed for touristic field to tell trippers about something that interests them. Based on a dual approach, the system adapts the idea of stochastic approach to the probabilistic context free grammar (PCFG) (approach based on rules). This paper provides a detailed description of ASTI system as well as well as results compared with several international ones. The observed error rates suggest that our combined approach can stand a comparison with concept spotters on larger application domains.

Chahira Lhioui, Anis Zouaghi, Mounir Zrigui

Evidence in Automatic Error Correction Improves Learners’ English Skill

Mastering proper article usage, especially in the English language, has been known to pose an extreme challenge to non-native speakers whose L1 languages have no concept of articles. Although the development of correction methods for article usage has posed a challenge for researchers, current methods do not perfectly correct the articles. In addition, proper article usage is not taught by these methods. Therefore, they are not useful for those wishing to learn a language with article usage. In this paper, we discuss the necessity of presenting evidence for corrections of English article usage. We demonstrate the effectiveness of this approach to improve the writing skills of English learners.

Jiro Umezawa, Junta Mizuno, Naoaki Okazaki, Kentaro Inui


Weitere Informationen

Premium Partner