Skip to main content

2018 | Buch

Natural Language Processing and Information Systems

23rd International Conference on Applications of Natural Language to Information Systems, NLDB 2018, Paris, France, June 13-15, 2018, Proceedings

herausgegeben von: Max Silberztein, Faten Atigui, Elena Kornyshova, Elisabeth Métais, Farid Meziane

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 23rd International Conference on Applications of Natural Language to Information Systems, NLDB 2018, held in Paris, France, in June 2018.

The 18 full papers, 26 short papers, and 9 poster papers presented were carefully reviewed and selected from 99 submissions. The papers are organized in the following topical sections: Opinion Mining and Sentiment Analysis in Social Media; Semantics-Based Models and Applications; Neural Networks Based Approaches; Ontology Engineering; NLP; Text Similarities and Plagiarism Detection; Text Classification; Information Mining; Recommendation Systems; Translation and Foreign Language Querying; Software Requirement and Checking.

Inhaltsverzeichnis

Frontmatter

Opinion Mining and Sentiment Analysis in Social Media

Frontmatter
A Proposal for Book Oriented Aspect Based Sentiment Analysis: Comparison over Domains

Aspect-based sentiment analysis (absa) deals with extracting opinions at a fine-grained level from texts, providing a very useful information for companies which want to know what people think about them or their products. Most of the systems developed in this field are based on supervised machine learning techniques and need a high amount of annotated data, nevertheless not many resources can be found due to their high cost of preparation. In this paper we present an analysis of a recently published dataset, covering different subtasks, which are aspect extraction, category detection, and sentiment analysis. It contains book reviews published in Amazon, which is a new domain of application in absa literature. The annotation process and its characteristics are described, as well as a comparison with other datasets. This paper focuses on this comparison, addressing the different subtasks and analyzing their performance and properties.

Tamara Álvarez-López, Milagros Fernández-Gavilanes, Enrique Costa-Montenegro, Patrice Bellot
Stance Evolution and Twitter Interactions in an Italian Political Debate

The number of communications and messages generated by users on social media platforms has progressively increased in the last years. Therefore, the issue of developing automated systems for a deep analysis of users’ generated contents and interactions is becoming increasingly relevant. In particular, when we focus on the domain of online political debates, interest for the automatic classification of users’ stance towards a given entity, like a controversial topic or a politician, within a polarized debate is significantly growing. In this paper we propose a new model for stance detection in Twitter, where authors’ messages are not considered in isolation, but in a diachronic perspective for shedding light on users’ opinion shift dynamics along the temporal axis. Moreover, different types of social network community, based on retweet, quote, and reply relations were analyzed, in order to extract network-based features to be included in our stance detection model. The model has been trained and evaluated on a corpus of Italian tweets where users were discussing on a highly polarized debate in Italy, i.e. the 2016 referendum on the reform of the Italian Constitution. The development of a new annotated corpus for stance is described. Analysis and classification experiments show that network-based features help in detecting stance and confirm the importance of modeling stance in a diachronic perspective.

Mirko Lai, Viviana Patti, Giancarlo Ruffo, Paolo Rosso
Identifying and Classifying Influencers in Twitter only with Textual Information

Online Reputation Management systems aim at identifying and classifying Twitter influencers due to their importance for brands. Current methods mainly rely on metrics provided by Twitter such as followers, retweets, etc. In this work we follow the research initiated at RepLab 2014, but relying only on the textual content of tweets. Moreover, we have proposed a workflow to identify influencers and classify them into an interest group from a reputation point of view, besides the classification proposed at RepLab. We have evaluated two families of classifiers, which do not require feature engineering, namely: deep learning classifiers and traditional classifiers with embeddings. Additionally, we also use two baselines: a simple language model classifier and the “majority class” classifier. Experiments show that most of our methods outperform the reported results in RepLab 2014, especially the proposed Low Dimensionality Statistical Embedding.

Victoria Nebot, Francisco Rangel, Rafael Berlanga, Paolo Rosso
Twitter Sentiment Analysis Experiments Using Word Embeddings on Datasets of Various Scales

Sentiment analysis is a popular research topic in social media analysis and natural language processing. In this paper, we present the details and evaluation results of our Twitter sentiment analysis experiments which are based on word embeddings vectors such as word2vec and doc2vec, using an ANN classifier. In these experiments, we utilized two publicly available sentiment analysis datasets and four smaller datasets derived from these datasets, in addition to a publicly available trained vector model over 400 million tweets. The evaluation results are accompanied with discussions and future research directions based on the current study. One of the main conclusions drawn from the experiments is that filtering out the emoticons in the tweets could be a facilitating factor for sentiment analysis on tweets.

Yusuf Arslan, Dilek Küçük, Aysenur Birturk
A Deep Learning Approach for Sentiment Analysis Applied to Hotel’s Reviews

Sentiment Analysis is an active area of research and has presented promising results. There are several approaches for modeling that are capable of performing classifications with good accuracy. However, there is no approach that performs well in all contexts, and the nature of the corpus used can exert a great influence. This paper describes a research that presents a convolutional neural network approach to the Sentiment Analysis Applied to Hotel’s Reviews, and performs a comparison with models previously executed on the same corpus.

Joana Gabriela Ribeiro de Souza, Alcione de Paiva Oliveira, Guidson Coelho de Andrade, Alexandra Moreira
Automatic Identification and Classification of Misogynistic Language on Twitter

Hate speech may take different forms in online social media. Most of the investigations in the literature are focused on detecting abusive language in discussions about ethnicity, religion, gender identity and sexual orientation. In this paper, we address the problem of automatic detection and categorization of misogynous language in online social media. The main contribution of this paper is two-fold: (1) a corpus of misogynous tweets, labelled from different perspective and (2) an exploratory investigations on NLP features and ML models for detecting and classifying misogynistic language.

Maria Anzovino, Elisabetta Fersini, Paolo Rosso
Assessing the Effectiveness of Affective Lexicons for Depression Classification

Affective lexicons have been commonly used as lexical features for depression classification, but their effectiveness is relatively unexplored in the literature. In this paper, we investigate the effectiveness of three popular affective lexicons in the task of depression classification. We also develop two lexical feature engineering strategies for incorporating those lexicons into a supervised classifier. The effectiveness of different lexicons and feature engineering strategies are evaluated on a depression dataset collected from LiveJournal.

Noor Fazilla Abd Yusof, Chenghua Lin, Frank Guerin

Semantics-Based Models and Applications

Frontmatter
Toward Human-Like Robot Learning

We present an implemented robotic system that learns elements of its semantic and episodic memory through language interaction with its human users. This human-like learning can happen because the robot can extract, represent and reason over the meaning of the user’s natural language utterances. The application domain is collaborative assembly of flatpack furniture. This work facilitates a bi-directional grounding of implicit robotic skills in explicit ontological and episodic knowledge and of ontological symbols in the real-world actions by the robot. In so doing, this work provides an example of successful integration of robotic and cognitive architectures.

Sergei Nirenburg, Marjorie McShane, Stephen Beale, Peter Wood, Brian Scassellati, Olivier Magnin, Alessandro Roncone
Identifying Argumentative Paragraphs: Towards Automatic Assessment of Argumentation in Theses

The academic revisions by instructors is a critical step of the writing process, revealing deficiencies to the students, such as lack of argumentation. Argumentation is needed to communicate clearly ideas and to convince the reader of the stated claims. This paper presents three models to identify argumentative paragraphs in different sections (Problem Statement, Justification, and Conclusion) of theses and determine their level of argumentation. The task is achieved using machine learning techniques with lexical features. The models came from an annotated collection of student writings, which served for training. We performed experiments to evaluate argumentative paragraph identification in the sections, reaching encouraging results compared to previously proposed approaches. Several feature configurations and learning algorithms were tested to reach the models. We applied the models in a web-based argument assessment system to provide access to students, and instructors.

Jesús Miguel García-Gorrostieta, Aurelio López-López
Semantic Mapping of Security Events to Known Attack Patterns

In order to provide cyber environment security, analysts need to analyze a large number of security events on a daily basis and take proper actions to alert their clients of potential threats. The increasing cyber traffic drives a need for a system to assist security analysts to relate security events to known attack patterns. This paper describes the enhancement of an existing Intrusion Detection System (IDS) with the automatic mapping of snort alert messages to known attack patterns. The approach relies on pre-clustering snort messages before computing their similarity to known attack patterns in Common Attack Pattern Enumeration and Classification (CAPEC). The system has been deployed in our partner company and when evaluated against the recommendations of two security analysts, achieved an f-measure of 64.57%.

Xiao Ma, Elnaz Davoodi, Leila Kosseim, Nicandro Scarabeo

Neural Networks Based Approaches

Frontmatter
Accommodating Phonetic Word Variations Through Generated Confusion Pairs for Hinglish Handwritten Text Recognition

On-line handwriting recognition has seen major strides in the past years, especially with the advent of deep learning techniques. Recent work has seen the usage of deep networks for sequential classification of unconstrained handwriting recognition task. However, the recognition of “Hinglish” language faces various unseen problems. Hinglish is a portmanteau of Hindi and English, involving frequent code-switching between the two languages. Millions of Indians use Hinglish as a primary mode of communication, especially across social media. However, being a colloquial language, Hinglish does not have a fixed rule set for spelling and grammar. Auto-correction is an unsuitable solution as there is no correct form of the word, and all the multiple phonetic variations are valid. Unlike the advantage that keyboards provide, recognizing handwritten text also has to overcome the issue of mis-recognizing similar looking alphabets. We propose a comprehensive solution to overcome this problem of recognizing words with phonetic spelling variations. To our knowledge, no work has been done till date to recognize Hinglish handwritten text. Our proposed solution shows a character recognition accuracy of 94% and word recognition accuracy of 72%, thus correctly recognizing the multiple phonetic variations of any given word.

Soumyajit Mitra, Vikrant Singh, Pragya Paramita Sahu, Viswanath Veera, Shankar M. Venkatesan
Arabic Question Classification Using Support Vector Machines and Convolutional Neural Networks

A Question Classification is an important task in Question Answering Systems and Information Retrieval among other NLP systems. Given a question, the aim of Question Classification is to find the correct type of answer for it. The focus of this paper is on Arabic question classification. We present a novel approach that combines a Support Vector Machine (SVM) and a Convolutional Neural Network (CNN). This method works in two stages: in the first stage, we identify the coarse/main question class using an SVM model; in the second stage, for each coarse question class returned by the SVM model, a CNN model is used to predict the subclass (finer class) of the main class. The performed tests have shown that our approach to Arabic Questions Classification yields very promising results.

Asma Aouichat, Mohamed Seghir Hadj Ameur, Ahmed Geussoum
A Supervised Learning Approach for ICU Mortality Prediction Based on Unstructured Electrocardiogram Text Reports

Extracting patient data documented in text-based clinical records into a structured form is a predominantly manual process, both time and cost-intensive. Moreover, structured patient records often fail to effectively capture the nuances of patient-specific observations noted in doctors’ unstructured clinical notes and diagnostic reports. Automated techniques that utilize such unstructured text reports for modeling useful clinical information for supporting predictive analytics applications can thus be highly beneficial. In this paper, we propose a neural network based method for predicting mortality risk of ICU patients using unstructured Electrocardiogram (ECG) text reports. Word2Vec word embedding models were adopted for vectorizing and modeling textual features extracted from the patients’ reports. An unsupervised data cleansing technique for identification and removal of anomalous data/special cases was designed for optimizing the patient data representation. Further, a neural network model based on Extreme Learning Machine architecture was proposed for mortality prediction. ECG text reports available in the MIMIC-III dataset were used for experimental validation. The proposed model when benchmarked against four standard ICU severity scoring methods, outperformed all by 10–13%, in terms of prediction accuracy.

Gokul S. Krishnan, S. Sowmya Kamath
Multimodal Language Independent App Classification Using Images and Text

There are a number of methods for classification of mobile apps, but most of them rely on a fixed set of app categories and text descriptions associated with the apps. Often, one may need to classify apps into a different taxonomy and might have limited app usage data for the purpose. In this paper, we present an app classification system that uses object detection and recognition in images associated with apps, along with text based metadata of the apps, to generate a more accurate classification for a given app according to a given taxonomy. Our image based approach can, in principle, complement any existing text based approach for app classification. We train a fast RCNN to learn the coordinates of bounding boxes in an app image for effective object detection, as well as labels for the objects. We then use the detected objects in the app images in an ensemble with a text based system that uses a hierarchical supervised active learning pipeline based on uncertainty sampling for generating the training samples for a classifier. Using the ensemble, we are able to obtain better classification accuracy than if either of the text or image systems are used on their own.

Kushal Singla, Niloy Mukherjee, Joy Bose
T2S: An Encoder-Decoder Model for Topic-Based Natural Language Generation

Natural language generation (NLG) plays a critical role in various natural language processing (NLP) applications. And the topics provide a powerful tool to understand the natural language. We propose a novel topic-based NLG model which can generate topic coherent sentences given single topic or combination of topics. The model is an extension of the recurrent encoder-decoder framework by introducing a global topic embedding matrix. Experimental results show that our encoder can not only transform a source sentence to a representative topic distribution which can give a better interpretation of the source sentence, but also generate topic coherent and diversified sentences given different topic distribution without any text-level input.

Wenjie Ou, Chaotao Chen, Jiangtao Ren

Ontology Engineering

Frontmatter
Ontology Development Through Concept Map and Text Analytics: The Case of Automotive Safety Ontology

Ontology development is an expensive and time-consuming process. The development of real-world organizational ontology-based knowledge management systems is still in early stages. Some existing ontologies with simple tuples and properties are not designed for domain specific requirement, or does not utilize existing knowledge from organizational database or documents. Here we propose our concept map approach to first semi-automatically create a detailed level entities/concepts as a keyword list by applying natural language processing, including word dependency and POS tagging. Then this list can be used to extract entities/concepts for the same domain. This approach is applied to automotive safety domain. The results are further mapped to existing ontology and aggregated to form a concept map. We implement our approach in KNIME with Stanford NLP parser and generate a concept map from automotive safety complaint dataset. The final results expand the existing ontology, and also bridge the gap between ontology and real-world organization ontology-based knowledge management systems.

Zirun Qi, Vijayan Sugumaran
A Fuzzy-Based Approach for Representing and Reasoning on Imprecise Time Intervals in Fuzzy-OWL 2 Ontology

Representing and reasoning on imprecise temporal information is a common requirement in the field of Semantic Web. Many works exist to represent and reason on precise temporal information in OWL; however, to the best of our knowledge, none of these works is devoted to represent and reason on imprecise time intervals. To address this problem, we propose a fuzzy-based approach for representing and reasoning on imprecise time intervals in ontology. Our approach is based on fuzzy sets theory and fuzzy tools and is modeled in Fuzzy-OWL 2. The 4D-fluents approach is extended, with new fuzzy components, in order to represent imprecise time intervals and qualitative fuzzy interval relations. The Allen’s interval algebra is extended in order to compare imprecise time intervals in a fuzzy gradual personalized way. Inferences are done via a set of Mamdani IF-THEN rules.

Fatma Ghorbel, Fayçal Hamdi, Elisabeth Métais, Nebrasse Ellouze, Faiez Gargouri
Assessing the Impact of Single and Pairwise Slot Constraints in a Factor Graph Model for Template-Based Information Extraction

Template-based information extraction generalizes over standard token-level binary relation extraction in the sense that it attempts to fill a complex template comprising multiple slots on the basis of information given in a text. In the approach presented in this paper, templates and possible fillers are defined by a given ontology. The information extraction task consists in filling these slots within a template with previously recognized entities or literal values. We cast the task as a structure prediction problem and propose a joint probabilistic model based on factor graphs to account for the interdependence in slot assignments. Inference is implemented as a heuristic building on Markov chain Monte Carlo sampling. As our main contribution, we investigate the impact of soft constraints modeled as single slot factors which measure preferences of individual slots for ranges of fillers, as well as pairwise slot factors modeling the compatibility between fillers of two slots. Instead of relying on expert knowledge to acquire such soft constraints, in our approach they are directly captured in the model and learned from training data. We show that both types of factors are effective in improving information extraction on a real-world data set of full-text papers from the biomedical domain. Pairwise factors are shown to particularly improve the performance of our extraction model by up to $${+}0.43$$ points in precision, leading to an F$$_1$$ score of 0.90 for individual templates.

Hendrik ter Horst, Matthias Hartung, Roman Klinger, Nicole Brazda, Hans Werner Müller, Philipp Cimiano

NLP

Frontmatter
Processing Medical Binary Questions in Standard Arabic Using NooJ

Nowadays, the medical domain has a high volume of electronic documents. The exploitation of this large quantity of data makes the search of specific information complex and time consuming. This difficulty has prompted the development of new adapted research tools, as question-answering systems. Indeed, this type of system allows a user to ask a question in natural language and automatically identify a specific answer instead of a set of documents deemed pertinent, as is the case with search engines. For this purpose, we are developing a question answering system which is based on a linguistic approach. The use of the linguistic engine of NooJ in order to formalize the automatic recognition rules and then applying them to a dynamic corpus composed of arabic medical journalistic articles. In this paper, we present a method for analyzing medical Binary questions. The analysis of the question asked by the user by means the application of cascade of morpho-syntactic resources. The linguistic patterns (grammars) which allow us to annotate the question and the semantic features of the question of extracting the focus and topic of the question. We start with the implementation of the rules which identify and to annotate the various medical entities. The named entity recognizer (NER) is able to find references to people, places and organizations, diseases, viruses, as targets to extract the correct answer from the user. The NER is embedded in our question answering system in order to identify the answer and delimit the potential justification sequence the precision and recall show that the actual results are encouraging and could be integrated for more types of questions other than binary questions.

Essia Bessaies, Slim Mesfar, Henda Ben Ghzela
Verbal Multi-Word Expressions in Yiddish

Verbal Multi-Word Expressions (VMWEs) are very common in many languages. They include among other types the following types: Verb-Particle Constructions (VPC) (e.g. get around), Light-Verb Constructions (LVC) (e.g. make a decision), and idioms (ID) (e.g. break a leg). In this paper, we present a new dataset for supervised learning of VMWEs written in Yiddish. The dataset was manually collected and annotated from a web resource. It contains a set of positive examples for VMWEs and a set of non-VMWEs examples. While the dataset can be used for training supervised algorithms, the positive examples can be used as seeds in unsupervised bootstrapping algorithms. Moreover, we analyze the lexical properties of VMWEs written in Yiddish by classifying them to six categories: VPC, LVC, ID, Inherently Pronominal Verb (IPronV), Inherently Prepositional Verb (IPrepV), and other (OTH). The analysis suggests some interesting features of VMWEs for exploration. This dataset is a first step towards automatic identification of VMWEs written in Yiddish, which is important for natural language understanding, generation and translation systems.

Chaya Liebeskind, Yaakov HaCohen-Kerner
Stochastic Approach to Aspect Tracking

In this investigation, we discuss aspect tracking, i.e., how to identify tracking storylines of document topics. Since there happen huge amount of fragment information, it is hard to see what they mean and how they go within topics by hands. Here we attack to this kind of problems by means of stochastic models. Our basic idea is that we consider state transitions as internal structure of stories based on HMM, and we extract several storylines as aspects of topics by probabilistic likelihood. We utilize KL divergence to screen topics.

Maoto Inoue, Takao Miura
Transducer Cascade to Parse Arabic Corpora

Arabic parsing is an important task in several NLP applications. Indeed to obtain a robust, efficient and extensible parser treating several phenomena, several issues (i.e., ambiguity and embedded structures) must be resolved. In this context, we will build an Arabic parser based on a deep linguistic study done with a new vision allowing the problem division and on a transducer cascade implemented in the NooJ linguistic platform. This parser is accomplished through our designed dictionaries, morphological grammars and transducers recognizing different sentence forms. The constructed parser is applied to two test corpora containing more than 5900 sentences with different structures. The parser outputs are XML annotated sentences. To evaluate the obtained results, we calculated the measure values of the precision, the recall and the f-measure, and compare them with those obtained by recursive transducer parser. The calculated measure values show that these results are encouraging.

Nadia Ghezaiel Hammouda, Roua Torjmen, Kais Haddar
Multi-Word Expressions Annotations Effect in Document Classification Task

Document classification is a necessary task for most Natural Language Processing tools since it classifies documents content in a helpful and meaningful way. The main concern in this paper is to investigate the impact of using multi-words for text representation on the performances of text classification task. Two text classification strategies are proposed to observe the robustness of each of them. First, we will deal with the literature review of existing linguistic resources in Arabic language. Secondly, we will present a classification method that is based on domain candidate simple terms. These terms are automatically extracted from multiple specialized corpora depending on their appearance frequency. Then, we will present a detailed description of a classification method based on multi-word expressions dictionary. CompounDic, an Arabic multi-word expressions dictionary, will be used to automatically annotate multi-word expressions and variations in text. Finally, we carried out a series of experiments on classifying specialized text based on simple words and multi-word expressions for comparison purposes. Our experiments show that the use of multi-word expressions annotations enhances the text classification results.

Dhekra Najar, Slim Mesfar, Henda Ben Ghezela
RDST: A Rule-Based Decision Support Tool

With the unfriendly wellbeing impacts of antibiotics and chemical drugs, medical herbalism has been a resurgence of interest in last years. Thus, medicinal plants are capable of treating disease and improving wellbeing, frequently without any significant side effects. This paper presents a Rule-based decision support tool aimed at helping users to identify accurate medicinal plants according to their symptoms taking into account the contraindications of each plant. This tool is based on IF-THEN rules, dictionaries and transducers. It permits the identification of the accurate medicinal plants, the recognition of medicinal plant properties and it incorporates user feedback to refine its results. Dictionaries and transducers are implemented in NooJ linguistic platform and applied in JAVA application with the command-line program noojapply. Experimentations of the Rule-based decision support tool show interesting results. Performance is satisfactory since our tool could act as a consultant. Furthermore, the functionality can be extended to other medicinal plants in the aim to treat the whole body health system.

Sondes Dardour, Héla Fehri
Mention Clustering to Improve Portuguese Semantic Coreference Resolution

This paper evaluates the impact that different clustering techniques may have on grouping referential mentions on rule-based coreference resolution systems. As a result, we show that our approach outperforms commonly applied methods.

Evandro Fonseca, Aline Vanin, Renata Vieira
Resource Creation for Training and Testing of Normalisation Systems for Konkani-English Code-Mixed Social Media Text

Code-Mixing is the mixing of two or more languages or language varieties in speech. Apart from the inherent linguistic complexity, the analysis of code-mixed content poses complex challenges owing to the presence of spelling variations and non-adherence to a formal grammar. However, for any downstream Natural Language Processing task, tools that are able to process and analyze code-mixed social media data are required. Currently there is a lack of publicly available resources for code-mixed Konkani-English social media data, while the amount of such text is increasing everyday. The lack of a standard dataset to evaluate these systems makes it difficult to make any meaningful comparisons of their relative accuracies.In this paper, we describe the methodology for the creation of a normalisation dataset for Konkani-English Code-Mixed Social Media Text (CMST). We believe that this dataset will prove useful not only for the evaluation and training of normalisation systems but also help in the linguistic analysis of the process of normalisation Indian languages from native scripts to Roman. Normalisation refers to the process of writing the text of one language using the script of another language whereby the sound of the text is preserved as far as possible [3].

Akshata Phadte
A TF-IDF and Co-occurrence Based Approach for Events Extraction from Arabic News Corpus

Event extraction is a common task for different applications such as text summarization and information retrieval. We propose, in this work, a TF-IDF based approach for extracting keywords from Arabic news articles’ titles. These keywords will serve to extract the main events for each month using a Part-of-Speech (POS) co-occurrence based approach. The precision values are computed by corresponding the extracted events with another news website. Results show that the approach performance depends on categories and performs well for domain specific ones such as economy.

Amina Chouigui, Oussama Ben Khiroun, Bilel Elayeb
English Text Parsing by Means of Error Correcting Automaton

The article considers developing an effective flexible model for describing syntactic structures of natural language. The model of an augmented transition network in the automaton form is chosen as a basis. This automaton performs the sentence analysis algorithm using forward error detection and backward error correction passes. The automaton finds an optimal variant of error corrections using a technique similar to the Viterbi decoding algorithm for error correction convolution codes. As a result, an effective tool for natural language parsing is developed.

Oleksandr Marchenko, Anatoly Anisimov, Igor Zavadskyi, Egor Melnikov
Annotating Relations Between Named Entities with Crowdsourcing

In this paper, we describe how the CrowdFlower platform was used to build an annotated corpus for Relation Extraction. The obtained data provides information on the relations between named entities in Portuguese texts.

Sandra Collovini, Bolivar Pereira, Henrique D. P. dos Santos, Renata Vieira
Automatic Detection of Negated Findings with NooJ: First Results

The objective of this study is to develop a methodology for the automatic detection of negated findings in radiological reports which takes into account semantic and syntactic descriptions, as well as morphological and syntactic analysis rules. In order to achieve this goal, a series of rules for processing lexical and syntactic information was elaborated. This required development of an electronic dictionary of medical terminology and computerized grammar. Computational framework was carried out with NooJ, a free software developed by Silberztein, which has various utilities for treating natural language. Results show that the detection of negated findings improves if lexical-grammatical information is added.

Walter Koza, Mirian Muñoz, Natalia Rivas, Ninoska Godoy, Darío Filippo, Viviana Cotik, Vanesa Stricker, Ricardo Martínez
Part-of-Speech Tagger for Konkani-English Code-Mixed Social Media Text

In this paper, we propose efficient and less resource-intensive strategies for Konkani-English code-mixed social media text. which witnesses several challenges as compared to tagging general normal text. Part-of-Speech Tagging is a primary and an important step for many Natural Language Processing Applications. This paper reports work on annotating code-mixed Konkani-English data collected from social media site Facebook, which consists of more than four thousands posts from Facebook and developed automatic Part-of-Speech Taggers for this corpus. Part-of-Speech tagging is considered as a classification problem and we use different classifiers such as CRFs, SVM with different combinations of features.

Akshata Phadte, Radhiya Arsekar
Word2vec for Arabic Word Sense Disambiguation

Word embedding, where words are represented as vectors in a continuous space, has recently attracted much attention in natural language processing tasks due to their ability to capture semantic and syntactic relations between words from a huge amount of text. In this work, we will focus on how word embedding can be used in Arabic word sense disambiguation (WSD).

Rim Laatar, Chafik Aloulou, Lamia Hadrich Belghuith

Text Similarities and Plagiarism Detection

Frontmatter
HYPLAG: Hybrid Arabic Text Plagiarism Detection System

Plagiarism is specifically defined as literary theft of paragraphs or sentences from unreferenced source. This unauthorized behavior is a real problem that targets scientific research scope. This paper proposes a Hybrid Arabic Plagiarism Detection System (HYPLAG). The HYPLAG approach combines corpus-based and knowledge-based approaches by utilizing an Arabic semantic resource (Arabic WordNet). A preliminary study on texts from undergraduate students was conducted to understand their behavior and the patterns used in plagiarism. The results of the study show that students apply different techniques to plagiarized sentences, also it shows changes in sentence’s components (verbs, nouns, and adjectives). HYPLAG was evaluated on the ExAraPlagDet-2015 dataset against several other approaches that participated in the AraPlagDet PAN@FIRE shared task on Extrinsic Arabic plagiarism detection obtaining a higher performance (F-score 89% vs. 84% obtained by the best performing system at AraPlagDet) with less computational time.

Bilal Ghanem, Labib Arafeh, Paolo Rosso, Fernando Sánchez-Vega
On the Semantic Similarity of Disease Mentions in and Twitter

Social media mining is becoming an important technique to track the spread of infectious diseases and to understand specific needs of people affected by a medical condition. A common approach is to select a variety of synonyms for a disease derived from scientific literature to then retrieve social media posts for subsequent analysis. With this paper, we question the underlying assumption that user-generated text always makes use of such names, or assigns them the same meaning as in scientific literature. We analyze the most frequently used concepts in $$\textsc {medline}^{\circledR } $$ for semantic similarity to Twitter use and compare their normalized entropy and cosine similarities based on a simple distributional model. We find that diseases are referred to in semantically different ways in both corpora, a difference that increases in inverse proportion to the frequency of the synonym, and of the commonness of the disease or condition. These results imply that, when sampling social media for disease-related micro-blogs, query expressions must be carefully chosen, and even more so for rarily mentioned diseases or conditions.

Camilo Thorne, Roman Klinger
Gemedoc: A Text Similarity Annotation Platform

We present Gemedoc, a platform for text similarity annotation based on the spatial and the thematic dimension. To this end, a two-step annotation protocol was designed to assess the similarity between two documents: (1) identification of salient features according to the two analysis dimensions; (2) similarity assessment according to a 4-degree scale. Ultimately, the labeled data retrieved from different corpora could be used as benchmark for text-mining applications.

Jacques Fize, Mathieu Roche, Maguelonne Teisseire

Text Classification

Frontmatter
Addressing Unseen Word Problem in Text Classification

Word based Deep Neural Network (DNN) approach of text classification suffers performance issues due to limited set of vocabulary words. Character based Convolutional Neural Network models (CNN) was proposed by the researchers to address the issue. But, character based models do not inherently capture the sequential relationship of words in texts. Hence, there is scope of further improvement by addressing unseen word problem through character model while maintaining the sequential context through word based model. In this work, we propose methods to combine both character and word based models for efficient text classification. The methods are compared with some of the benchmark datasets and state-of-the art results.

Promod Yenigalla, Sibsambhu Kar, Chirag Singh, Ajay Nagar, Gaurav Mathur
United We Stand: Using Multiple Strategies for Topic Labeling

Topic labeling aims at providing a sound, possibly multi-words, label that depicts a topic drawn from a topic model. This is of the utmost practical interest in order to quickly grasp a topic informational content – the usual ranked list of words that maximizes a topic presents limitations for this task. In this paper, we introduce three new unsupervised n-gram topic labelers that achieve comparable results than the existing unsupervised topic labelers but following different assumptions. We demonstrate that combining topic labelers - even only two - makes it possible to target a 64% improvement with respect to single topic labeler approaches and therefore opens research in that direction. Finally, we introduce a fourth topic labeler that extracts representative sentences, using Dirichlet smoothing to add contextual information. This sentence-based labeler provides strong surrogate candidates when n-gram topic labelers fall short on providing relevant labels, leading up to 94% topic covering.

Antoine Gourru, Julien Velcin, Mathieu Roche, Christophe Gravier, Pascal Poncelet
Evaluating Thesaurus-Based Topic Models

In this paper, we study thesaurus-based topic models and evaluate them from the point of view of topic coherence. Thesaurus-based topic models enhance the scores of related terms found in the same text, which means that the model encourages these terms to be on the same topics. We evaluate various variants of such models. First, we carry out a manual evaluation of the obtained topics. Second, we study the possibility to use the collected manual data for evaluating new variants of thesaurus-based models, propose a method and select the best its parameters in cross-validation. Third, we apply the created evaluation method to estimate the influence of word frequencies on adding thesaurus relations for generating coherent topic models.

Natalia Loukachevitch, Kirill Ivanov
Classifying Companies by Industry Using Word Embeddings

This contribution investigates whether companies cluster together according to their field of industry using word embeddings and in particular word2vec models on general news text. We explore to what extent this can be utilised for identifying company-industry affiliations automatically. We present an experiment in which we test seven different classification methods on four different word2vec models trained on a 600-million-word corpus from the Guardian newspaper. For training and testing our classifiers we obtained company-industry assignments from the Dbpedia knowledge base for those companies occurring in both the news corpus and Dbpedia. The majority of the 28 scrutinized classification paradigms displays F1 scores near 80%, with some exceeding this threshold. We found differences across industries, with some industries appearing to be more distinctly defined, while others are less clearly delineated from neighbouring fields. To test the robustness of our approach we conducted a field test, identifying candidate companies absent from Dbpedia with a named-entity recognizer, establishing ground truth on company and industry status manually through web search. We found classifier performance to be less reliable in the field test and of varying quality across industries. with precision at 25 values ranging from 16% to 88%, depending on industry. In summary, the presented approach showed some promise, but also some limitations and may in its current form be only robust enough for semi-automated classification.

Martin Lamby, Daniel Isemann
Towards Ontology-Based Training-Less Multi-label Text Classification

In the under-explored research area of multi-label text classification. Substantial amount of research in adapting and transforming traditional classifiers to directly handle multi-label datasets has taken place. The performance of traditional statistical and probabilistic classifiers suffers from the high dimensionality of feature space, training overhead and label imbalance. In this work, we propose a novel ontology-based approach for training-less multi-label text classification. We transform the classification task into a graph matching problem by developing a shallow domain ontology to be used as a training-less classifier. Thereby, we overcome the challenges of feature engineering and label imbalance of traditional methods. Our intensive experiments, using the EUR-Lex dataset, prove that our method provides a comparable performance to the state-of-the-art techniques in terms of Macro $$F_1$$-Score.

Wael Alkhatib, Saba Sabrin, Svenja Neitzel, Christoph Rensing
Overview of Uni-modal and Multi-modal Representations for Classification Tasks

Classification is one of the most fundamental tasks in data mining and machine learning. It is being applied in an increasing number of fields, e.g. filtering, identification, information retrieval, information extraction, and similarity detection. A basic and necessary condition for the success of a classification task is the proper representation of the information it wishes to classify. Classification is needed in domains that are based on uni-modal representations such as text, images, audio, and speech, as well as in domains that are based on multi-modal representations. This paper aims to provide a short review on the developing area of multi-modal representations for classification with emphasis on state-of-the-art systems in this area. Firstly, fundamentals of uni-modal representations are given. Secondly, an overview of multi-modal representations is given. Thirdly, various related systems using multi-modal representations and the datasets used by them are briefly summarized with a comparative summary of these systems.

Aryeh Wiesen, Yaakov HaCohen-Kerner

Information Mining

Frontmatter
Classification of Intangible Social Innovation Concepts

In social sciences, similarly to other fields, there is exponential growth of literature and textual data that people are no more able to cope with in a systematic manner. In many areas there is a need to catalogue knowledge and phenomena in a certain area. However, social science concepts and phenomena are complex and in many cases there is a dispute in the field between conflicting definitions. In this paper we present a method that catalogues a complex and disputed concept of social innovation by applying text mining and machine learning techniques. Recognition of social innovations is performed by decomposing a definitions into several more specific criteria (social objectives, social actor interactions, outputs and innovativeness). For each of these criteria, a machine learning-based classifier is created that checks whether certain text satisfies given criteria. The criteria can be successfully classified with an F1-score of 0.83–0.86. The presented method is flexible, since it allows combining criteria in a later stage in order to build and analyse the definition of choice.

Nikola Milosevic, Abdullah Gok, Goran Nenadic
An Unsupervised Approach for Cause-Effect Relation Extraction from Biomedical Text

Identification of Cause-effect (CE) relation mentions, along with the arguments, are crucial for creating a scientific knowledge-base. Linguistically complex constructs are used to express CE relations in text, mainly using generic causative (causal) verbs (cause, lead, resultetc). We observe that some generic verbs have a domain-specific causative sense (inhibit, express) and some domains have altogether new causative verbs (down-regulate). Not every mention of a generic causative verb (e.g., lead) indicates a CE relation mention. We propose a linguistically-oriented unsupervised iterative co-discovery approach to identify domain-specific causative verbs, starting from a small set of seed causative verbs and an unlabeled corpus. We use known causative verbs to extract CE arguments, and use known CE arguments to discover causative verbs (hence co-discovery). Since causes and effects are typically agents, events, actions, or conditions, we use WordNet hypernym categories to identify suitable CE arguments. PMI is used to measure linguistic associations between a causative verb and its argument. Once we have a list of domain-specific causative verbs, we use it to extract CE relation mentions from a given corpus in an unsupervised manner, filtering out non-causative use of a causative verb using WordNet hypernym check of its arguments. Our approach extracts 256 domain-specific causative verbs from 10, 000 PubMed abstracts of Leukemia papers, and outperforms several baselines for extracting intra-sentence CE relation mentions.

Raksha Sharma, Girish Palshikar, Sachin Pawar
A Supervised Learning to Rank Approach for Dependency Based Concept Extraction and Repository Based Boosting for Domain Text Indexing

In conventional information retrieval systems, keywords extracted from documents are indexed and used for retrieval. Since same information can be represented by different keywords, there is hindrance in extracting relevant documents. Concept based indexing and retrieval which semantically identifies similar documents overcomes this problem by mapping the document phrases to a domain repository. In this paper, the problem of extracting and ranking concepts i.e. key phrases, from domain oriented text is explored. This paper ranks concepts (key phrases) of a document based not only on statistical and cue phrases but also based on the dependency relations in which the candidate concept occurs. For each candidate a vector is formed with the phrase weight and the dependency relations. The features used to score the phrases in the vectors, for re-ranking and as features to weigh the vector corresponding to the candidate are the cue features (presence in title, abstract), C-value in case of multi-words, frequency of occurrence and the type of dependency relation. The ranking process utilizes RankingSVM to rank the candidate concepts based on the feature vectors. In addition, to make the ranking domain sensitive and to determine the domain relevance of the candidate concepts they are fully or partially matched with the domain repository. Based on the depth of the concept and the presence of parent and siblings, the domain relevant concepts are boosted up the order. The results indicate that the use of dependency based context vector and domain repository provides substantial enhancement in the key phrase extraction task compared with other methods.

U. K. Naadan, T. V. Geetha, U. Kanimozhi, D. Manjula, R. Viswapriya, C. Karthik
Demo] Integration of Text- and Web-Mining Results in EpidVis

The new and emerging infectious diseases are an incising threat to countries due to globalisation, movement of passengers and international trade. In order to discover articles of potential importance to infectious disease emergence it is important to mine the Web with an accurate vocabulary. In this paper, we present a new methodology that combines text-mining results and visualisation approach in order to discover associations between hosts and symptoms related to emerging infectious disease outbreaks.

Samiha Fadloun, Arnaud Sallaberry, Alizé Mercier, Elena Arsevska, Pascal Poncelet, Mathieu Roche

Recommendation Systems

Frontmatter
Silent Day Detection on Microblog Data

Microblog has become an increasingly popular information source for users to get updates about the world. Given the rapid growth of the microblog data, users are often interested in getting daily (or even hourly) updates about a certain topic. Existing studies on microblog retrieval mainly focused on how to rank results based on their relevance, but little attention has been paid to whether we should return any results to search users. This paper studies the problem of silent day detection. Specifically, given a query and a set of tweets collected over a certain time period (such as a day), we need to determine whether the set contains any relevant tweets of the query. If not, this day is referred to as a silent day. Silent day detection enables us to not overwhelm users with non-relevant tweets. We formulate the problem as a classification problem, and propose two types of new features based on using collective information from query terms. Experiment results over TREC collections show that these new features are more effective in detecting silent days than previously proposed ones.

Kuang Lu, Hui Fang
Smart Entertainment - A Critiquing Based Dialog System for Eliciting User Preferences and Making Recommendations

We present a Critiquing based dialog system that can make media content recommendations to users by eliciting information through active exploration of user preferences for item attributes. The system and user communicate through a natural language mixed-initiative conversational interface in which the system guides the user to a specific choice. During the conversation, the system presents the user with several options and analyzes the responses or “critiques”. The system starts with general recommendations or relevant candidates and refines this as it learns more about user’s preferences in subsequent iterations. These choices made by the user and the textual feedback/reviews that can be optionally provided, is used to infer a user preference model for the item.

Roshni R. Ramnani, Shubhashis Sengupta, Tirupal Rao Ravilla, Sumitraj Ganapat Patil

Translation and Foreign Language Querying

Frontmatter
Cross-Language Text Summarization Using Sentence and Multi-Sentence Compression

Cross-Language Automatic Text Summarization produces a summary in a language different from the language of the source documents. In this paper, we propose a French-to-English cross-lingual summarization framework that analyzes the information in both languages to identify the most relevant sentences. In order to generate more informative cross-lingual summaries, we introduce the use of chunks and two compression methods at the sentence and multi-sentence levels. Experimental results on the MultiLing 2011 dataset show that our framework improves the results obtained by state-of-the art approaches according to ROUGE metrics.

Elvys Linhares Pontes, Stéphane Huet, Juan-Manuel Torres-Moreno, Andréa Carneiro Linhares
Arabic Readability Assessment for Foreign Language Learners

Reading in a foreign language is a difficult task, especially if the texts presented to readers are chosen without taking into account the reader’s skill level. Foreign language learners need to be presented with reading material suitable to their reading capacities. A basic tool for determining if a text is appropriate to a reader’s level is the assessment of its readability, a measure that aims to represent the human capacities required to comprehend a given text. Readability prediction for a text is an important aspect in the process of teaching and learning, for reading in a foreign language as well as in one’s native language, and continues to be a central area of research and practice. In this paper, we present our approach to readability assessment for Modern Standard Arabic (MSA) as a foreign language. Readability prediction is carried out using the Global Language Online Support System (GLOSS) corpus, which was developed for independent learners to improve their foreign language skills and was annotated with the Interagency Language Roundtable (ILR) scale. In this study, we introduce a frequency dictionary, which was developed to calculate frequency-based features. The approach gives results that surpass the state-of the-art results for Arabic.

Naoual Nassiri, Abdelhak Lakhouaja, Violetta Cavalli-Sforza
String Kernels for Polarity Classification: A Study Across Different Languages

The polarity classification task has as objective to automatically deciding whether a subjective text is positive or negative. Using a cross-domain setting implies the use of different domains for the training and testing. Recently, string kernels, a method which does not employ domain adaptation techniques has been proposed. In this work, we analyse the performance of this method across four different languages: English, German, French and Japanese. Experimental results show the strong potential of this approach independently from the language.

Rosa M. Giménez-Pérez, Marc Franco-Salvador, Paolo Rosso
Integration of Neural Machine Translation Systems for Formatting-Rich Document Translation

In this paper, we present our work on integrating neural machine translation systems in the document translation workflow of the cloud-based machine translation platform Tilde MT. We describe the functionality of the translation workflow and provide examples for formatting-rich document translation.

Mārcis Pinnis, Raivis Skadiņš, Valters Šics, Toms Miks

Software Requirement and Checking

Frontmatter
Using k-Means for Redundancy and Inconsistency Detection: Application to Industrial Requirements

Requirements are usually “hand-written” and suffers from several problems like redundancy and inconsistency. These problems between requirements or sets of requirements impact negatively the success of final products. Manually processing these issues requires too much time and it is very costly. We propose in this paper to automatically handle redundancy and inconsistency issues in a classification approach. The main contribution of this paper is the use of k-means algorithm for redundancy and inconsistency detection in a new context, which is Requirements Engineering context. Also, we introduce a preprocessing step based on the Natural Language Processing techniques in order to see the impact of this latter to the k-means results. We use Part-Of-Speech (POS) tagging and noun chunking in order to detect technical business terms associated with the requirements documents that we analyze. We experiment this approach on real industrial datasets. The results show the efficiency of the k-means clustering algorithm, especially with the preprocessing.

Manel Mezghani, Juyeon Kang, Florence Sèdes
How to Deal with Inaccurate Service Descriptions in On-The-Fly Computing: Open Challenges

The vision of On-The-Fly Computing is an automatic composition of existing software services. Based on natural language software descriptions, end users will receive compositions tailored to their needs. For this reason, the quality of the initial software service description strongly determines whether a software composition really meets the expectations of end users. In this paper, we expose open NLP challenges needed to be faced for service composition in On-The-Fly Computing.

Frederik S. Bäumer, Michaela Geierhos
Backmatter
Metadaten
Titel
Natural Language Processing and Information Systems
herausgegeben von
Max Silberztein
Faten Atigui
Elena Kornyshova
Elisabeth Métais
Farid Meziane
Copyright-Jahr
2018
Electronic ISBN
978-3-319-91947-8
Print ISBN
978-3-319-91946-1
DOI
https://doi.org/10.1007/978-3-319-91947-8