Skip to main content
main-content

Über dieses Buch

The two volumes LNCS 9041 and 9042 constitute the proceedings of the 16th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2015, held in Cairo, Egypt, in April 2015.

The total of 95 full papers presented was carefully reviewed and selected from 329 submissions. They were organized in topical sections on grammar formalisms and lexical resources; morphology and chunking; syntax and parsing; anaphora resolution and word sense disambiguation; semantics and dialogue; machine translation and multilingualism; sentiment analysis and emotion detection; opinion mining and social network analysis; natural language generation and text summarization; information retrieval, question answering, and information extraction; text classification; speech processing; and applications.

Inhaltsverzeichnis

Frontmatter

Erratum: Aspect-Based Sentiment Analysis Using Tree Kernel Based Relation Extraction

In the originally published version, the 12

th

reference was wrong. It should read as follows:

12. Nguyen, T.H., Shirai, K.: Text classification of technical papers based on text segmentation. In: Métais, E., Meziane, F., Saraee, M., Sugumaran, V., Vadera, S. (eds.) NLDB 2013. LNCS, vol. 7934, pp. 278-284. Springer, Heidelberg (2013)

Thien Hai Nguyen, Kiyoaki Shirai

Sentiment Analysis and Emotion Detection

Frontmatter

The CLSA Model: A Novel Framework for Concept-Level Sentiment Analysis

Hitherto, sentiment analysis has been mainly based on algorithms relying on the textual representation of online reviews and microblogging posts. Such algorithms are very good at retrieving texts, splitting them into parts, checking the spelling, and counting their words. But when it comes to interpreting sentences and extracting opinionated information, their capabilities are known to be very limited. Current approaches to sentiment analysis are mainly based on supervised techniques relying on manually labeled samples, such as movie or product reviews, where the overall positive or negative attitude was explicitly indicated. However, opinions do not occur only at document-level, nor they are limited to a single valence or target. Contrary or complementary attitudes toward the same topic or multiple topics can be present across the span of a review. In order to overcome this and many other issues related to sentiment analysis, we propose a novel framework, termed concept-level sentiment analysis (CLSA) model, which takes into account all the natural-language-processing tasks necessary for extracting opinionated information from text, namely: microtext analysis, semantic parsing, subjectivity detection, anaphora resolution, sarcasm detection, topic spotting, aspect extraction, and polarity detection.

Erik Cambria, Soujanya Poria, Federica Bisio, Rajiv Bajpai, Iti Chaturvedi

Building Large Arabic Multi-domain Resources for Sentiment Analysis

While there has been a recent progress in the area of Arabic Sentiment Analysis, most of the resources in this area are either of limited size, domain specific or not publicly available. In this paper, we address this problem by generating large multi-domain datasets for Sentiment Analysis in Arabic. The datasets were scrapped from different reviewing websites and consist of a total of 33K annotated reviews for movies, hotels, restaurants and products. Moreover we build multi-domain lexicons from the generated datasets. Different experiments have been carried out to validate the usefulness of the datasets and the generated lexicons for the task of sentiment classification. From the experimental results, we highlight some useful insights addressing: the best performing classifiers and feature representation methods, the effect of introducing lexicon based features and factors affecting the accuracy of sentiment classification in general. All the datasets, experiments code and results have been made publicly available for scientific purposes.

Hady ElSahar, Samhaa R. El-Beltagy

Learning Ranked Sentiment Lexicons

In contrast to classic retrieval, where users search factual information, opinion retrieval deals with the search of subjective information. A major challenge in opinion retrieval is the informal style of writing and the use of domain-specific jargon to describe the opinion targets. In this paper, we present an automatic method to learn a space model for opinion retrieval. Our approach is a generative model that learns sentiment word distributions by embedding multi-level relevance judgments in the estimation of the model parameters. The model is learned using online Variational Inference, a recently published method that can learn from streaming data and can scale to very large datasets. Opinion retrieval and classification experiments on two large datasets with 703,000 movie reviews and 189,000 hotel reviews showed that the proposed method outperforms the baselines while using a significantly lower dimensional lexicon than other methods.

Filipa Peleja, João Magalhães

Modelling Public Sentiment in Twitter: Using Linguistic Patterns to Enhance Supervised Learning

This paper describes a Twitter sentiment analysis system that classifies a tweet as positive or negative based on its overall tweet-level polarity. Supervised learning classifiers often misclassify tweets containing conjunctions such as “but” and conditionals such as “if”, due to their special linguistic characteristics. These classifiers also assign a decision score very close to the decision boundary for a large number tweets, which suggests that they are simply unsure instead of being completely wrong about these tweets. To counter these two challenges, this paper proposes a system that enhances supervised learning for polarity classification by leveraging on linguistic rules and sentic computing resources. The proposed method is evaluated on two publicly available Twitter corpora to illustrate its effectiveness.

Prerna Chikersal, Soujanya Poria, Erik Cambria, Alexander Gelbukh, Chng Eng Siong

Trending Sentiment-Topic Detection on Twitter

Twitter plays a significant role in information diffusion and has evolved to an important information resource as well as news feed. People wonder and care about what is happening on Twitter and what news it is bringing to us every moment. However, with huge amount of data, it is impossible to tell what topic is trending on time manually, which makes real-time topic detection attractive and significant. Furthermore, Twitter provides a platform of opinion sharing and sentiment expression for events, news, products etc. Users intend to tell what they are really thinking about on Twitter thus makes Twitter a valuable source of opinions. Nevertheless, most works about trending topic detection fail to take sentiment into consideration. This work is based on a non-parametric supervised real-time trending topic detection model with sentimental feature. Experiment shows our model successfully detects trending sentimental topic in the shortest time. After a combination of multiple features, e.g. tweet volume and user volume, it demonstrates impressive effectiveness with 82.3% recall and surpasses all the competitors.

Baolin Peng, Jing Li, Junwen Chen, Xu Han, Ruifeng Xu, Kam-Fai Wong

EmoTwitter – A Fine-Grained Visualization System for Identifying Enduring Sentiments in Tweets

Traditionally, work on sentiment analysis focuses on detecting the positive and negative attributes of sentiments. To broaden the scope, we introduce the concept of

enduring sentiments

based on psychological descriptions of sentiments as enduring emotional dispositions that have formed over time. To aid us identify the enduring sentiments, we present a fine-grained functional visualization system, EmoTwitter, that takes tweets written over a period of time as input for analysis. Adopting a lexicon-based approach, the system identifies the Plutchik’s eight emotion categories and shows them over the time period that the tweets were written. The enduring sentiment patters of

like

and

dislike

are then calculated over the time period using the flow of the emotion categories. The potential impact and usefulness of our system are highlighted during a user-based evaluation. Moreover, the new concept and technique introduced in this paper for extracting enduring sentiments from text shows great potential, for instance, in business decision making.

Myriam Munezero, Calkin Suero Montero, Maxim Mozgovoy, Erkki Sutinen

Feature Selection for Twitter Sentiment Analysis: An Experimental Study

Feature selection is an important problem for any pattern classification task. In this paper, we developed an ensemble of two Maximum Entropy classifiers for Twitter sentiment analysis: one for subjectivity and the other for polarity classification. Our ensemble employs surface-form, semantic and sentiment features. The classification complexity of this ensemble of linear models is linear with respect to the number of features. Our goal is to select a compact feature subset from the exhaustive list of extracted features in order to reduce the computational complexity without scarifying the classification accuracy. We evaluate the performance on two benchmark datasets, CrowdScale and SemEval. Our selected 20K features have shown very similar results in subjectivity classification to the NRC state-of-the-art system with 4 million features that has ranked first in 2013 SemEval competition. Also, our selected features have shown a relative performance gain in the ensemble classification over the baseline of uni-gram and bi-gram features of 9.9% on CrowdScale and 11.9% on SemEval.

Riham Mansour, Mohamed Farouk Abdel Hady, Eman Hosam, Hani Amr, Ahmed Ashour

An Iterative Emotion Classification Approach for Microblogs

The typical emotion classification approach adopts one-step single-label classification using intra-sentence features such as unigrams, bigrams and emotion words. However, single-label classifier with intra-sentence features cannot ensure good performance for short microblogs text which has flexible expressions. Target to this problem, this paper proposes an iterative multi-label emotion classification approach for microblogs by incorporating intra-sentence features, as well as sentence and document contextual information. Based on the prediction of the base classifier with intra-sentence features, the iterative approach updates the prediction by further incorporating both sentence and document contextual information until the classification results converge. Experimental results obtained by three different multi-label classifiers on NLP & CC2013 Chinese microblog emotion classification bakeoff dataset demonstrates the effectiveness of our iterative emotion classification approach.

Ruifeng Xu, Zhaoyu Wang, Jun Xu, Junwen Chen, Qin Lu, Kam-Fai Wong

Aspect-Based Sentiment Analysis Using Tree Kernel Based Relation Extraction

We present an application of kernel methods for extracting relation between an aspect of an entity and an opinion word from text. Two tree kernels based on the constituent tree and dependency tree were applied for aspect-opinion relation extraction. In addition, we developed a new kernel by combining these two tree kernels. We also proposed a new model for sentiment analysis on aspects. Our model can identify polarity of a given aspect based on the aspect-opinion relation extraction. It outperformed the model without relation extraction by 5.8% on accuracy and 4.6% on F-measure.

Thien Hai Nguyen, Kiyoaki Shirai

Text Integrity Assessment: Sentiment Profile vs Rhetoric Structure

We formulate the problem of text integrity assessment as learning the discourse structure of text given the dataset of texts with high integrity and low integrity. We use two approaches to formalizing the discourse structures, sentiment profile and rhetoric structures, relying on sentence-level sentiment classifier and rhetoric structure parsers respectively. To learn discourse structures, we use the graph-based nearest neighbor approach which allows for explicit feature engineering, and also SVM tree kernel–based learning. Both learning approaches operate on the graphs (parse thickets) which are sets of parse trees with nodes with either additional labels for sentiments, or additional arcs for rhetoric relations between different sentences. Evaluation in the domain of valid vs invalid customer complains (those with argumentation flow, non-cohesive, indicating a bad mood of a complainant) shows the stronger contribution of rhetoric structure information in comparison with the sentiment profile information. Both above learning approaches demonstrated that discourse structure as obtained by RST parser is sufficient to conduct the text integrity assessment. At the same time, sentiment profile-based approach shows much weaker results and also does not complement strongly the rhetoric structure ones.

Boris Galitsky, Dmitry Ilvovsky, Sergey O. Kuznetsov

Sentiment Classification with Graph Sparsity Regularization

Text representation is a preprocessing step in building a classifier for sentiment analysis. But in vector space model (VSM) or bag-of -features (BOF) model, features are independent of each other when to learn a classifier model. In this paper, we firstly explore the text graph structure which can represent the structural features in natural language text. Different to the BOF model, by directly embedding the features into a graph, we propose a graph sparsity regularization method which can make use of the the graph embedded features. Our proposed method can encourage a sparse model with a small number of features connected by a set of paths. The experiments on sentiment classification demonstrate our proposed method can get better results comparing with other methods. Qualitative discussion also shows that our proposed method with graph-based representation is interpretable and effective in sentiment classification task.

Xin-Yu Dai, Chuan Cheng, Shujian Huang, Jiajun Chen

Detecting Emotion Stimuli in Emotion-Bearing Sentences

Emotion, a pervasive aspect of human experience, has long been of interest to social and behavioural sciences. It is now the subject of multi-disciplinary research also in computational linguistics. Emotion recognition, studied in the area of sentiment analysis, has focused on detecting the expressed emotion. A related challenging question,

why

the experiencer feels that emotion, has, to date, received very little attention. The task is difficult and there are no annotated English resources. FrameNet refers to the person, event or state of affairs which evokes the emotional response in the experiencer as emotion

stimulus

. We automatically build a dataset annotated with both the emotion and the stimulus using FrameNet’s

emotions-directed

frame. We address the problem as information extraction: we build a CRF learner, a sequential learning model to detect the emotion stimulus spans in emotion-bearing sentences. We show that our model significantly outperforms all the baselines.

Diman Ghazi, Diana Inkpen, Stan Szpakowicz

Sentiment-Bearing New Words Mining: Exploiting Emoticons and Latent Polarities

New words and new senses are produced quickly and are used widely in micro blogs, so to automatically extract new words and predict their semantic orientations is vital to sentiment analysis in micro blogs. This paper proposes

Extractor

and

PolarityAssigner

to tackle this task in an unsupervised manner.

Extractor

is a pattern-based method which extracts sentiment-bearing words from large-scale raw micro blog corpus, where the main task is to eliminate the huge ambiguities in the un-segmented raw texts.

PolarityAssigner

predicts the semantic orientations of words by exploiting emoticons and latent polarities, using a LDA model which treats each sentiment-bearing word as a document and each co-occurring emoticon as a word in that document. The experimental results are promising: many new sentiment-bearing words are extracted and are given proper semantic orientations with a relatively high precision, and the automatically extracted sentiment lexicon improves the performance of sentiment analysis on an open opinion mining task in micro blog corpus.

Fei Wang, Yunfang Wu

Identifying Temporal Information and Tracking Sentiment in Cancer Patients’ Interviews

Time is an essential component for the analysis of medical data, and the sentiment beneath the temporal information is intrinsically connected with the medical reasoning tasks. The present paper introduces the problem of identifying temporal information as well as tracking of the sentiments/emotions according to the temporal situations from the interviews of cancer patients. A supervised method has been used to identify the medical events using a list of temporal words along with various syntactic and semantic features. We also analyzed the sentiments of the patients with respect to the time-bins with the help of dependency based sentiment analysis techniques and several Sentiment lexicons. We have achieved the maximum accuracy of 75.38% and 65.06% in identifying the temporal and sentiment information, respectively.

Braja Gopal Patra, Nilabjya Ghosh, Dipankar Das, Sivaji Bandyopadhyay

Using Stylometric Features for Sentiment Classification

This paper is a comparative study about text feature extraction methods in statistical learning of sentiment classification. Feature extraction is one of the most important steps in classification systems. We use stylometry to compare with TF-IDF and Delta TF-IDF baseline methods in sentiment classification. Stylometry is a research area of Linguistics that uses statistical techniques to analyze literary style. In order to assess the viability of the stylometry, we create a corpus of product reviews from the most traditional online service in Portuguese, namely, Buscapé. We gathered 2000 review about Smartphones. We use three classifiers, Support Vector Machine (SVM), Naive Bayes, and J48 to evaluate whether the stylometry has higher accuracy than the TF-IDF and Delta TF-IDF methods in sentiment classification. We found the better result with the SVM classifier (82,75%) of accuracy with stylometry and (72,62%) with Delta TF-IDF and (56,25%) with TF-IDF. The results show that stylometry is quite feasible method for sentiment classification, outperforming the accuracy of the baseline methods. We may emphasize that approach used has promising results.

Rafael T. Anchiêta, Francisco Assis Ricarte Neto, Rogério Figueiredo de Sousa, Raimundo Santos Moura

Opinion Mining and Social Network Analysis

Frontmatter

Automated Linguistic Personalization of Targeted Marketing Messages Mining User-Generated Text on Social Media

Personalizing marketing messages for specific audience segments is vital for increasing user engagement with advertisements, but it becomes very resource-intensive when the marketer has to deal with multiple segments, products or campaigns. In this research, we take the first steps towards automating message personalization by algorithmically inserting adjectives and adverbs that have been found to evoke positive sentiment in specific audience segments, into basic versions of ad messages. First, we build language models representative of linguistic styles from user-generated textual content on social media for each segment. Next, we mine product-specific adjectives and adverbs from content associated with positive sentiment. Finally, we insert extracted words into the basic version using the language models to enrich the message for each target segment, after statistically checking in-context readability. Decreased cross-entropy values from the basic to the transformed messages show that we are able to approach the linguistic style of the target segments. Crowdsourced experiments verify that our personalized messages are almost indistinguishable from similar human compositions. Social network data processed for this research has been made publicly available for community use.

Rishiraj Saha Roy, Aishwarya Padmakumar, Guna Prasaad Jeganathan, Ponnurangam Kumaraguru

Inferring Aspect-Specific Opinion Structure in Product Reviews Using Co-training

Opinions expressed about a particular subject are often nuanced: a person may have both negative and positive opinions about different aspects of the subject of interest, and these aspect-specific opinions can be independent of the overall opinion. Being able to identify, collect, and count these nuanced opinions in a large set of data offers more insight into the strengths and weaknesses of competing products and services than does aggregating overall ratings. We contribute a new confidence-based co-training algorithm that can identify product aspects and sentiments expressed about such aspects. Our algorithm offers better precision than existing methods, and handles previously unseen language well. We show competitive results on a set of opinionated sentences about laptops and restaurants from a SemEval-2014 Task 4 challenge.

Dave Carter, Diana Inkpen

Summarizing Customer Reviews through Aspects and Contexts

This study leverages the syntactic, semantic and contextual features of online hotel and restaurant reviews to extract information aspects and summarize them into meaningful feature groups. We have designed a set of syntactic rules to extract aspects and their descriptors. Further, we test the precision of a modified algorithm for clustering aspects into closely related feature groups, on a dataset provided by Yelp.com. Our method uses a combination of semantic similarity methods- distributional similarity, co-occurrence and knowledge base based similarity, and performs better than two state-of-the-art approaches. It is shown that opinion words and the context provided by them can prove to be good features for measuring the semantic similarity and relationship of their product features. Our approach successfully generates thematic aspect groups about food quality, décor and service quality.

Prakhar Gupta, Sandeep Kumar, Kokil Jaidka

An Approach for Intention Mining of Complex Comparative Opinion Why Type Questions Asked on Product Review Sites

Opinion why-questions require answers to include reasons, elaborations, explanations for the users’ sentiments expressed in the questions. Sentiment analysis has been recently used for answering why type opinion questions.Existing research addresses simple why-type questions having description of single product in the questions. In real life, there could be complex why type questions having description of multiple products (as observed in comparative sentences) given in multiple sentences. For example, the question, “I need mobile with good camera and nice sound quality. Why should I go for buying Nokia over Samsung?” Nokia is the main focus for the questioner who shows positive intention for buying mobile. This calls for natural requirement for systems to identify the product which is centre of attention for the questioners and the intention of the questioner towards the same. We address such complex questions and propose an approach to perform intention mining of the questioner by determining the sentiment polarity of the questioner towards the main focused product. We conduct experiments which obtain better results as compared to existing baseline systems.

Amit Mishra, Sanjay Kumar Jain

TRUPI: Twitter Recommendation Based on Users’ Personal Interests

Twitter has emerged as one of the most powerful micro-blogging services for real-time sharing of information on the web. The large volume of posts in several topics is overwhelming to twitter users who might be interested in only few topics. To this end, we propose TRUPI, a personalized recommendation system for the timelines of twitter users where tweets are ranked by the user’s personal interests. The proposed system combines the user social features and interactions as well as the history of her tweets content to attain her interests. The system captures the users interests dynamically by modeling them as a time variant in different topics to accommodate the change of these interests over time. More specifically, we combine a set of machine learning and natural language processing techniques to analyze the topics of the various tweets posted on the user’s timeline and rank them based on her dynamically detected interests. Our extensive performance evaluation on a publicly available dataset demonstrates the effectiveness of TRUPI and shows that it outperforms the competitive state of the art by 25% on nDCG@25, and 14% on MAP.

Hicham G. Elmongui, Riham Mansour, Hader Morsy, Shaymaa Khater, Ahmed El-Sharkasy, Rania Ibrahim

Detection of Opinion Spam with Character n-grams

In this paper we consider the detection of opinion spam as a stylistic classification task because, given a particular domain, the deceptive and truthful opinions are similar in content but differ in the way opinions are written (style). Particularly, we propose using character n-grams as features since they have shown to capture lexical content as well as stylistic information. We evaluated our approach on a standard corpus composed of 1600 hotel reviews, considering positive and negative reviews. We compared the results obtained with character n-grams against the ones with word n-grams. Moreover, we evaluated the effectiveness of character n-grams decreasing the training set size in order to simulate real training conditions. The results obtained show that character n-grams are good features for the detection of opinion spam; they seem to be able to capture better than word n-grams the content of deceptive opinions and the writing style of the deceiver. In particular, results show an improvement of 2.3% and 2.1% over the word-based representations in the detection of positive and negative deceptive opinions respectively. Furthermore, character n-grams allow to obtain a good performance also with a very small training corpus. Using only 25% of the training set, a Naïve Bayes classifier showed

F

1

values up to 0.80 for both opinion polarities.

Donato Hernández Fusilier, Manuel Montes-y-Gómez, Paolo Rosso, Rafael Guzmán Cabrera

Content-Based Recommender System Enriched with Wordnet Synsets

Content-based recommender systems can overcome many problems related to collaborative filtering systems, such as the new-item issue. However, to make accurate recommendations, content-based recommenders require an adequate amount of content, and external knowledge sources are used to augment the content. In this paper, we use Wordnet synsets to enrich a content-based joke recommender system. Experiments have shown that content-based recommenders using K-nearest neighbors perform better than collaborative filtering, particularly when synsets are used.

Haifa Alharthi, Diana Inkpen

Active Learning Based Weak Supervision for Textual Survey Response Classification

Analysing textual responses to open-ended survey questions has been one of the challenging applications for NLP. Such unstructured text data is a rich data source of subjective opinions about a specific topic or entity; but it is not amenable to quick and comprehensive analysis. Survey coding is the process of categorizing such text responses using a pre-specified hierarchy of classes (often called a

code-frame

). In this paper, we identify the factors constraining the automation approaches to this problem and observe that a completely supervised learning approach is not feasible in practice. We then present details of our approach which uses multi-label text classification as a first step without requiring labeled training data. This is followed by the second step of active learning based verification of survey response categorization done in first step. This weak supervision using active learning helps us to optimize the human involvement as well as to adapt the process for different domains. Efficacy of our method is established using the high agreement with real-life, manually annotated benchmark data.

Sangameshwar Patil, B. Ravindran

Detecting and Disambiguating Locations Mentioned in Twitter Messages

Detecting the location entities mentioned in Twitter messages is useful in text mining for business, marketing or defence applications. Therefore, techniques for extracting the location entities from the Twitter textual content are needed. In this work, we approach this task in a similar manner to the Named Entity Recognition (NER) task focused only on locations, but we address a deeper task: classifying the detected locations into names of cities, provinces/states, and countries. We approach the task in a novel way, consisting in two stages. In the first stage, we train Conditional Random Fields (CRF) models with various sets of features; we collected and annotated our own dataset or training and testing. In the second stage, we resolve cases when there exist more than one place with the same name. We propose a set of heuristics for choosing the correct physical location in these cases. We report good evaluation results for both tasks.

Diana Inkpen, Ji Liu, Atefeh Farzindar, Farzaneh Kazemi, Diman Ghazi

Natural Language Generation and Text Summarization

Frontmatter

Satisfying Poetry Properties Using Constraint Handling Rules

Poetry is one of the most interesting and complex natural language generation (NLG) systems because a text needs to simultaneously satisfy three properties to be considered a poem; namely grammaticality (grammatical structure and syntax), poeticness (poetic structure) and meaningfulness (semantic content). In this paper we show how the declarative approach enabled by the high-level constraint programming language Constraint Handling Rules (CHR) can be applied to satisfy the three properties while generating poems. The developed automatic poetry generation system generates a poem by incrementally selecting its words through a step-wise pruning of a customised lexicon according to the grammaticality, poeticness and meaningfulness constraints.

Alia El Bolock, Slim Abdennadher

A Multi-strategy Approach for Lexicalizing Linked Open Data

This paper aims at exploiting Linked Data for generating natural text, often referred to as lexicalization. We propose a framework that can generate patterns which can be used to lexicalize Linked Data triples. Linked Data is structured knowledge organized in the form of triples consisting of a subject, a predicate and an object. We use DBpedia as the Linked Data source which is not only free but is currently the fastest growing data source organized as Linked Data. The proposed framework utilizes the Open Information Extraction (OpenIE) to extract relations from natural text and these relations are then aligned with triples to identify lexicalization patterns. We also exploit lexical semantic resources which encode knowledge on lexical, semantic and syntactic information about entities. Our framework uses VerbNet and WordNet as semantic resources. The extracted patterns are ranked and categorized based on the DBpedia ontology class hierarchy. The pattern collection is then sorted based on the score assigned and stored in an index embedded database for use in the framework as well as for future lexical resource. The framework was evaluated for syntactic accuracy and validity by measuring the Mean Reciprocal Rank (MRR) of the first correct pattern. The results indicated that framework can achieve 70.36% accuracy and a MRR value of 0.72 for five DBpedia ontology classes generating 101 accurate lexicalization patterns.

Rivindu Perera, Parma Nand

A Dialogue System for Telugu, a Resource-Poor Language

A dialogue system is a computer system which is designed to converse with human beings in natural language (NL). A lot of work has been done to develop dialogue systems in regional languages. This paper presents an approach to build a dialogue system for resource poor languages. The approach comprises of two parts namely Data Management and Query Processing. Data Management deals with storing the data in a particular format which helps in easy and quick retrieval of requested information. Query Processing deals with producing a relevant system response for a user query. Our model can handle code-mixed queries which are very common in Indian languages and also handles context which is a major challenge in dialogue systems. It also handles spelling mistakes and a few grammatical errors. The model is domain and language independent. As there is no automated evaluation tool available for dialogue systems we went for human evaluation of our system, which was developed for Telugu language over ‘Tourist places of Hyderabad’ domain. 5 people evaluated our system and the results are reported in the paper.

Mullapudi Ch. Sravanthi, Kuncham Prathyusha, Radhika Mamidi

Anti-Summaries: Enhancing Graph-Based Techniques for Summary Extraction with Sentiment Polarity

We propose an

unsupervised

model to extract two types of summaries

(positive, and negative)

per document based on sentiment polarity. Our model builds a

weighted polar digraph

from the text, then evolves recursively until some desired properties converge. It can be seen as an enhanced variant of

TextRank

type algorithms working with non-polar text graphs. Each positive, negative, and objective opinion has some impact on the other if they are semantically related or placed close in the document.

Our experiments cover several interesting scenarios. In case of a one author news article, we notice a significant overlap between the

anti-summary

(focusing on negatively polarized sentences) and the the summary. For a transcript of a debate or a talk-show, an anti-summary represents the disagreement of the participants on stated topic(s) whereas the summary becomes the collection of positive feedbacks. In this case, the anti-summary tends to be

disjoint

from the regular summary. Overall, our experiments show that our model can be used with TextRank to enhance the quality of the extractive summarization process.

Fahmida Hamid, Paul Tarau

A Two-Level Keyphrase Extraction Approach

In this paper, we present a new two-level approach to extract KeyPhrases from textual documents. Our approach relies on a linguistic analysis to extract candidate KeyPhrases and a statistical analysis to rank and filter the final KeyPhrases. We evaluated our approach on three publicly available corpora with documents of varying lengths, domains and languages including English and French. We obtained improvement of Precision, Recall and F-measure. Our results indicate that our approach is independent of the length, the domain and the language.

Chedi Bechikh Ali, Rui Wang, Hatem Haddad

Information Retrieval, Question Answering, and Information Extraction

Frontmatter

Conceptual Search for Arabic Web Content

The main reason of adopting Semantic Web technology in information retrieval is to improve the retrieval performance. A semantic search-based system is characterized by locating web contents that are semantically related to the query’s concepts rather than relying on the exact matching with keywords in queries. There is a growing interest in Arabic web content worldwide due to its importance for culture, political aspect, strategic location, and economics. Arabic is linguistically rich across all levels which makes the effective search of Arabic text a challenge. In the literature, researches that address searching the Arabic web content using semantic web technology are still insufficient compared to Arabic’s actual importance as a language. In this research, we propose an Arabic semantic search approach that is applied on Arabic web content. This approach is based on the Vector Space Model (VSM), which has proved its success and many researches have been focused on improving its traditional version. Our approach uses the Universal WordNet to build a rich concept-space index instead of the traditional term-space index. This index is used for enabling a Semantic VSM capabilities. Moreover, we introduced a new incidence measurement to calculate the semantic significance degree of the concept in a document which fits with our model rather than the traditional term frequency. Furthermore, for the purpose of determining the semantic similarity of two vectors, we introduced a new formula for calculating the semantic weight of the concept. Because documents are indexed by their topics and classified semantically, we were able to search Arabic documents effectively. The experimental results in terms of Precision, Recall and F-measure have showed improvement in performance from 77%, 56%, and 63% to 71%, 96%, and 81%, respectively.

Aya M. Al-Zoghby, Khaled Shaalan

Experiments with Query Expansion for Entity Finding

Query expansion techniques have proved to have an impact on retrieval performance across many retrieval tasks. This paper reports research on query expansion in the entity finding domain. We used a number of methods for query formulation including thesaurus-based, relevance feedback, and exploiting NLP structure. We incorporated the query expansion component as part of our entity finding pipeline and report the results of the aforementioned models on the CERC collection.

Fawaz Alarfaj, Udo Kruschwitz, Chris Fox

Mixed Language Arabic-English Information Retrieval

For many non-English languages in developing countries (such as Arabic), text switching/mixing (e.g. between Arabic and English) is very prevalent, especially in scientific domains, due to the fact that most technical terms are borrowed from English and/or they are neither included in the native (non-English) languages nor have a precise translation/transliteration in these native languages. This makes it difficult to search only in a non-English (native) language because either non-English-speaking users, such as Arabic speakers, are not able to express terminology in their native languages or the concepts need to be expanded using context. This results in mixed queries and documents in the non-English speaking world (the Arabic world in particular). Mixed-language querying is a challenging problem and does not attained major attention in IR community. Current search engines and traditional CLIR systems did not handle mixed-language querying adequately and did not exploit this natural human tendency. This paper attempts to address the problem of mixed querying in CLIR. It proposes mixed-language (language-aware) IR solution, in terms of cross-lingual re-weighting model, in which mixed queries are used to retrieve most relevant documents, regardless of their languages. For the purpose of conducting the experiments, a new multilingual and mixed Arabic-English corpus on the computer science domain is therefore created. Test results showed that the proposed cross-lingual re-weighting model could yield statistically significant better results, with respect to mixed-language queries and it achieved more than 94% of monolingual baseline effectiveness.

Mohammed Mustafa, Hussein Suleman

Improving Cross Language Information Retrieval Using Corpus Based Query Suggestion Approach

Users seeking information may not find relevant information pertaining to their information need in a specific language. But information may be available in a language different from their own, but users may not know that language. Thus users may experience difficulty in accessing the information present in different languages. Since the retrieval process depends on the translation of the user query, there are many issues in getting the right translation of the user query. For a pair of languages chosen by a user, resources, like incomplete dictionary, inaccurate machine translation system may exist. These resources may be insufficient to map the query terms in one language to its equivalent terms in another language. Also for a given query, there might exist multiple correct translations. The underlying corpus evidence may suggest a clue to select a probable set of translations that could eventually perform a better information retrieval. In this paper, we present a cross language information retrieval approach to effectively retrieve information present in a language other than the language of the user query using the corpus driven query suggestion approach. The idea is to utilize the corpus based evidence of one language to improve the retrieval and re-ranking of news documents in the another language. We use FIRE corpora - Tamil and English news collections - in our experiments and illustrate the effectiveness of the proposed cross language information retrieval approach.

Rajendra Prasath, Sudeshna Sarkar, Philip O’Reilly

Search Personalization via Aggregation of Multidimensional Evidence About User Interests

A core aspect of search personalization is inferring the user’s search interests. Different approaches may consider different aspects of user information and may have different interpretations of the notion of

interest

. This may lead to learning disparate characteristics of a user. Although search engines collect a variety of information about their users, the following question remains unanswered: to what extent can personalized search systems harness these information sources to capture multiple views of the user’s interests, and adapt the search accordingly? To answer this question, this paper proposes a hybrid approach for search personalization. The advantage of this approach is that it can flexibly combine multiple sources of user information, and incorporate multiple aspects of user interests. Experimental results demonstrate the effectiveness of the proposed approach for search personalization.

Yu Xu, M. Rami Ghorab, Séamus Lawless

Question Analysis for a Closed Domain Question Answering System

This study describes and evaluates the techniques we developed for the question analysis module of a closed domain Question Answering (QA) system that is intended for high-school students to support their education. Question analysis, which involves analyzing the questions to extract the necessary information for determining what is being asked and how to approach answering it, is one of the most crucial vcomponents of a QA system. Therefore, we propose novel methods for two major problems in question analysis, namely focus extraction and question classification, based on integrating a rule-based and a Hidden Markov Model (HMM) based sequence classification approach, both of which make use of the dependency relations among the words in the question. Comparisons of these solutions with baseline models are also provided. This study also offers a manually collected and annotated vgold standard data set for further research in this area.

Caner Derici, Kerem Çelik, Ekrem Kutbay, Yiğit Aydın, Tunga Güngör, Arzucan Özgür, Günizi Kartal

Information Extraction with Active Learning: A Case Study in Legal Text

Active learning has been successfully applied to a number of NLP tasks. In this paper, we present a study on Information Extraction for natural language licenses that need to be translated to RDF. The final purpose of our work is to automatically extract from a natural language document specifying a certain license a machine-readable description of the terms of use and reuse identified in such license. This task presents some peculiarities that make it specially interesting to study: highly repetitive text, few annotated or unannotated examples available, and very fine precision needed.

In this paper we compare different active learning settings for this particular application. We show that the most straightforward approach to instance selection, uncertainty sampling, does not provide a good performance in this setting, performing even worse than passive learning. Density-based methods are the usual alternative to uncertainty sampling, in contexts with very few labelled instances. We show that we can obtain a similar effect to that of density-based methods using uncertainty sampling, by just reversing the ranking criterion, and choosing the

most certain

instead of the

most uncertain

instances.

Cristian Cardellino, Serena Villata, Laura Alonso Alemany, Elena Cabrio

Text Classification

Frontmatter

Term Network Approach for Transductive Classification

Transductive classification is a useful way to classify texts when just few labeled examples are available. Transductive classification algorithms rely on term frequency to directly classify texts represented in vector space model or to build networks and perform label propagation. Related terms tend to belong to the same class and this information can be used to assign relevance scores of terms for classes and consequently the labels of documents. In this paper we propose the use of term networks to model term relations and perform transductive classification. In order to do so, we propose (i) different ways to generate term networks, (ii) how to assign initial relevance scores for terms, (iii) how to propagate the relevance scores among terms, and (iv) how to use the relevance scores of terms in order to classify documents. We demonstrate that transductive classification based on term networks can surpass the accuracies obtained by transductive classification considering texts represented in other types of networks or vector space model, or even the accuracies obtained by inductive classification. We also demonstrated that we can decrease the size of term networks through feature selection while keeping classification accuracy and decreasing computational complexity.

Rafael Geraldeli Rossi, Solange Oliveira Rezende, Alneu de Andrade Lopes

Calculation of Textual Similarity Using Semantic Relatedness Functions

Semantic similarity between two sentences is concerned with measuring how much two sentences share the same or related meaning. Two methods in the literature for measuring sentence similarity are cosine similarity and overall similarity. In this work we investigate if it is possible to improve the performance of these methods by integrating different word level semantic relatedness methods. Four different word relatedness methods are compared using four different data sets compiled from different domains, providing a testbed formed of various range of writing expressions to challenge the selected methods. Results show that the use of corpus-based word semantic similarity function has significantly outperformed that of WordNet-based word semantic similarity function in sentence similarity methods. Moreover, we propose a new sentence similarity measure method by modifying an existing method which incorporates word order and lexical similarity called as overall similarity. Furthermore, the results show that the proposed method has significantly improved the performance of the overall method. All the selected methods are tested and compared with other state-of-the-art methods.

Ammar Riadh Kairaldeen, Gonenc Ercan

Confidence Measure for Czech Document Classification

This paper deals with automatic document classification in the context of a real application for the Czech News Agency (ČTK). The accuracy our classifier is high, however it is still important to improve the classification results. The main goal of this paper is thus to propose novel confidence measure approaches in order to detect and remove incorrectly classified samples. Two proposed methods are based on the

posterior

class probability and the third one is a supervised approach which uses another classifier to determine if the result is correct. The methods are evaluated on a Czech newspaper corpus. We experimentally show that it is beneficial to integrate the novel approaches into the document classification task because they significantly improve the classification accuracy.

Pavel Král, Ladislav Lenc

An Approach to Tweets Categorization by Using Machine Learning Classifiers in Oil Business

The rapid growth in social media data has motivated the development of a real time framework to understand and extract the meaning of the data. Text categorization is a well-known method for understanding text. Text categorization can be applied in many forms, such as authorship detection and text mining by extracting useful information from documents to sort a set of documents automatically into predefined categories. Here, we propose a method for identifying those who posted the tweets into categories. The task is performed by extracting key features from tweets and subjecting them to a machine learning classifier. The research shows that this multi-classification task is very difficult, in particular the building of a domain-independent machine learning classifier. Our problem specifically concerned tweets about oil companies, most of which were noisy enough to affect the accuracy. The analytical technique used here provided structured and valuable information for oil companies.

Hanaa Aldahawi, Stuart Allen

Speech Processing

Frontmatter

Long-Distance Continuous Space Language Modeling for Speech Recognition

The

n

-gram language models has been the most frequently used language model for a long time as they are easy to build models and require the minimum effort for integration in different NLP applications. Although of its popularity,

n

-gram models suffer from several drawbacks such as its ability to generalize for the unseen words in the training data, the adaptability to new domains, and the focus only on short distance word relations. To overcome the problems of the

n

-gram models the continuous parameter space LMs were introduced. In these models the words are treated as vectors of real numbers rather than of discrete entities. As a result, semantic relationships between the words could be quantified and can be integrated into the model. The infrequent words are modeled using the more frequent ones that are semantically similar. In this paper we present a long distance continuous language model based on a latent semantic analysis (LSA). In the LSA framework, the word-document co-occurrence matrix is commonly used to tell how many times a word occurs in a certain document. Also, the word-word co-occurrence matrix is used in many previous studies. In this research, we introduce a different representation for the text corpus, this by proposing long-distance word co-occurrence matrices. These matrices to represent the long range co-occurrences between different words on different distances in the corpus. By applying LSA to these matrices, words in the vocabulary are moved to the continuous vector space. We represent each word with a continuous vector that keeps the word order and position in the sentences. We use tied-mixture HMM modeling (TM-HMM) to robustly estimate the LM parameters and word probabilities. Experiments on the Arabic Gigaword corpus show improvements in the perplexity and the speech recognition results compared to the conventional

n

-gram.

Mohamed Talaat, Sherif Abdou, Mahmoud Shoman

A Supervised Phrase Selection Strategy for Phonetically Balanced Standard Yorùbá Corpus

This paper presents a scheme for the development of speech corpus for Standard Yorùbá (SY). The problem herein is the non-availability of phonetically balanced corpus in most resource-scarce languages such as SY. The proposed solution herein is hinged on the development and implementation of a supervised phrase selection using Rule-Based Corpus Optimization Model (RBCOM) to obtain phonetically balanced SY corpus. This was in turn compared with the random phrase selection procedure. The concept of Exploitative Data Analysis (EDA), which is premised on frequency distribution models, was further deployed to evaluate the distribution of allophones of selected phrases. The goodness of fit of the frequency distributions was studied using: Kolmogorov Smirnov, Andersen Darling and Chi-Squared tests while comparative studies were respectively carried out among other techniques. The sample skewness result was used to establish the normality behavior of the data. The results obtained confirmed the efficacy of the supervised phrase selection against the random phrase selection.

Adeyanju Sosimi, Tunde Adegbola, Omotayo Fakinlede

Semantic Role Labeling of Speech Transcripts

Speech data has been established as an extremely rich and important source of information. However, we still lack suitable methods for the semantic annotation of speech that has been transcribed by automated speech recognition (ASR) systems . For instance, the semantic role labeling (SRL) task for ASR data is still an unsolved problem, and the achieved results are significantly lower than with regular text data. SRL for ASR data is a difficult and complex task due to the absence of sentence boundaries, punctuation, grammar errors, words that are wrongly transcribed, and word deletions and insertions. In this paper we propose a novel approach to SRL for ASR data based on the following idea: (1) combine evidence from different segmentations of the ASR data, (2) jointly select a good segmentation, (3) label it with the semantics of PropBank roles. Experiments with the OntoNotes corpus show improvements compared to the state-of-the-art SRL systems on the ASR data. As an additional contribution, we semi-automatically align the predicates found in the ASR data with the predicates in the gold standard data of OntoNotes which is a quite difficult and challenging task, but the result can serve as gold standard alignments for future research.

Niraj Shrestha, Ivan Vulić, Marie-Francine Moens

Latent Topic Model Based Representations for a Robust Theme Identification of Highly Imperfect Automatic Transcriptions

Speech analytics suffer from poor automatic transcription quality. To tackle this difficulty, a solution consists in mapping transcriptions into a space of hidden topics. This abstract representation allows to work around drawbacks of the ASR process. The well-known and commonly used one is the topic-based representation from a Latent Dirichlet Allocation (LDA). During the LDA learning process, distribution of words into each topic is estimated automatically. Nonetheless, in the context of a classification task, LDA model does not take into account the targeted classes. The supervised Latent Dirichlet Allocation (sLDA) model overcomes this weakness by considering the class, as a response, as well as the document content itself. In this paper, we propose to compare these two classical topic-based representations of a dialogue (LDA and sLDA), with a new one based not only on the dialogue content itself (words), but also on the theme related to the dialogue. This original Author-topic Latent Variables (ATLV) representation is based on the Author-topic (AT) model. The effectiveness of the proposed ATLV representation is evaluated on a classification task from automatic dialogue transcriptions of the Paris Transportation customer service call. Experiments confirmed that this ATLV approach outperforms by far the LDA and sLDA approaches, with a substantial gain of respectively 7.3 and 5.8 points in terms of correctly labeled conversations.

Mohamed Morchid, Richard Dufour, Georges Linarès, Youssef Hamadi

Probabilistic Approach for Detection of Vocal Pathologies in the Arabic Speech

There are different methods for vocal pathology detection. These methods usually have three steps which are feature extraction, feature reduction and speech classification. The first and second steps present obstacles to attain high performance and accuracy of the classification system [20]. Indeed, feature reduction can create a loss of data. In this paper, we present an initial study of Arabic speech classification based on probabilistic approach and distance between reference speeches and speech to classify. The first step in our approach is dedicated to generate a standard distance (phonetic distance) between different healthy speech bases. In the second stage we will determine the distance between speech to classify and reference speeches (phonetic model proper to speaker and a reference phonetic model). Comparing these two distances (distance between speech to classify and reference speeches & standard distance), in the third step, we can classify the input speech to healthy or pathological. The proposed method is able to classify Arabic speeches with an accuracy of 96.25%, and we attain 100% by concatenation falsely classified sequences. Results of our method provide insights that can guide biologists and computer scientists to design high performance systems of vocal pathology detection.

Naim Terbeh, Mohsen Maraoui, Mounir Zrigui

Applications

Frontmatter

Clustering Relevant Terms and Identifying Types of Statements in Clinical Records

The automatic processing of clinical documents created at clinical settings has become a focus of research in natural language processing. However, standard tools developed for general texts are not applicable or perform poorly on this type of documents, especially in the case of less-resourced languages. In order to be able to create a formal representation of knowledge in the clinical records, a normalized representation of concepts needs to be defined. This can be done by mapping each record to an external ontology or other semantic resources. In the case of languages, where no such resources exist, it is reasonable to create a representational schema from the texts themselves. In this paper, we show that, based on the pairwise distributional similarities of words and multiword terms, a conceptual hierarchy can be built from the raw documents. In order to create the hierarchy, we applied an agglomerative clustering algorithm on the most frequent terms. Having such an initial system of knowledge extracted from the documents, a domain expert can then check the results and build a system of concepts that is in accordance with the documents the system is applied to. Moreover, we propose a method for classifying various types of statements and parts of clinical documents by annotating the texts with cluster identifiers and extracting relevant patterns.

Borbála Siklósi

Medical Entities Tagging Using Distant Learning

A semantic tagger aiming to detect relevant entities in medical documents and tagging them with their appropriate semantic class is presented. In the experiments described in this paper the tagset consists of the six most frequent classes in

SNOMED-CT

taxonomy (

SN

). The system uses six binary classifiers, and two combination mechanisms are presented for combining the results of the binary classifiers. Learning the classifiers is performed using three widely used knowledge sources, including one domain restricted and two domain independent resources. The system obtains state-of-the-art results.

Jorge Vivaldi, Horacio Rodríguez

Identification of Original Document by Using Textual Similarities

When there are two documents that share similar content, either accidentally or intentionally, the knowledge about which one of the two is the original source of the content is unknown in most cases. This knowledge can be crucial in order to charge or acquit someone of plagiarism, to establish the provenance of a document or in the case of sensitive information, to make sure that you can rely on the source of the information. Our system identifies the original document by using the idea that the pieces of text written by the same author have higher resemblance to each other than to those written by different authors. Given two pairs of documents with shared content, our system compares the shared part with the remaining text in both of the documents by treating them as bag of words. For cases when there is no reference text by one of the authors to compare against, our system makes predictions based on similarity of the shared content to just one of the documents.

Prasha Shrestha, Thamar Solorio

Kalema: Digitizing Arabic Content for Accessibility Purposes Using Crowdsourcing

In this paper, we present “Kalema”, a system for digitizing Arabic scanned documents for the visually impaired such that it can be converted to audio format or Braille. This is done through a GWAP which offers a simple, challenging game that helps attract many volunteers for this cause. We show how such a tedious task can be achieved accurately and easily through the use of crowdsourcing.

Gasser Akila, Mohamed El-Menisy, Omar Khaled, Nada Sharaf, Nada Tarhony, Slim Abdennadher

An Enhanced Technique for Offline Arabic Handwritten Words Segmentation

The accuracy of handwritten word segmentation is essential for the recognition results; however, it is extremely complex task. In this work, an enhanced technique for Arabic handwriting segmentation is proposed. This technique is based on a recent technique which is dubbed in this work the base technique. It has two main stages: over-segmentation and neural-validation. Although the base technique gives promising results, it still suffers from many drawback such as the missed and bad segmentation-points(SPs). To alleviate these problems, two enhancements has been integrated in the first stage: word to sub-word segmentation and the thinned word restoration. Additionally, in the neural-validation stage an enhanced area concatenation technique is utilized to handle the segmentation of complex characters such as س. Both techniques were evaluated using the IFN/ENIT database. The results show that the bad and missed SPs have been significantly reduced and the overall performance of the system is increased.

Roqyiah M. Abdeen, Ahmed Afifi, Ashraf B. El-Sisi

Backmatter

Weitere Informationen