Skip to main content

2011 | Buch

Natural Language Processing and Information Systems

16th International Conference on Applications of Natural Language to Information Systems, NLDB 2011, Alicante, Spain, June 28-30, 2011. Proceedings

herausgegeben von: Rafael Muñoz, Andrés Montoyo, Elisabeth Métais

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 16th International Conference on Applications of Natural Language to Information Systems, held in Alicante, Spain, in June 2011. The 11 revised full papers and 11 revised short papers presented together with 23 poster papers, 1 invited talk and 6 papers of the NLDB 2011 doctoral symposium were carefully reviewed and selected from 74 submissions. The papers address all aspects of Natural Language Processing related areas and present current research on topics such as natural language in conceptual modeling, NL interfaces for data base querying/retrieval, NL-based integration of systems, large-scale online linguistic resources, applications of computational linguistics in information systems, management of textual databases NL on data warehouses and data mining, NLP applications, as well as NL and ubiquitous computing.

Inhaltsverzeichnis

Frontmatter

Invited Talks

Invited Talks

In the internet era we face the recurring problem of information overload thus creating a major need for computer tools able to constantly distill, interpret, and organize textual information available on-line effectively. Over the last decade a number of technological advances in the Natural Language Processing field have made it possible to analyze, discover, and machine-interpret huge volumes of online textual information. One clear example of these advances in the field is the nowadays ubiquitous presence of automatic translation tools on the Web. However other NLP technologies have still a fundamental role to play in allowing access to textual information. In this talk I will present an overview of research in the area of Natural Language Processing aiming at facilitating access to textual information online. I will review the role of text summarization, question answering, and information extraction in making textual information more accessible. I will also discuss the issue of access to textual information for people with learning disabilities and present the ongoing work in this area of the Simplex project which aims at producing a text simplification system to facilitate easy access to textual information.

Horacio Saggion

Full Papers

COMPENDIUM: A Text Summarization System for Generating Abstracts of Research Papers

This paper presents

compendium

, a text summarization system, which has achieved good results in extractive summarization. Therefore, our main goal in this research is to extend it, suggesting a new approach for generating abstractive-oriented summaries of research papers. We conduct a preliminary analysis where we compare the extractive version of

compendium

(

$\textsc{compendium}_{E}$

) with the new abstractive-oriented approach (

$\textsc{compendium}_{E-A}$

). The final summaries are evaluated according to three criteria (content, topic, and user satisfaction) and, from the results obtained, we can conclude that the use of

compendium

is appropriate for producing summaries of research papers automatically, going beyond the simple selection of sentences.

Elena Lloret, María Teresa Romá-Ferri, Manuel Palomar
Automatic Generation of Semantic Features and Lexical Relations Using OWL Ontologies

Semantic features are theoretical units of meaning-holding components which are used for representing word meaning. These features play a vital role in determining the kind of lexical relation which exists between words in a language. Although such model of meaning representation has numerous applications in various fields, the manual derivation of semantic features is a cumbersome and time consuming task. We aim to elevate this process by developing an automated semantic feature extraction system based on ontological models. Such an approach will provide explicit word meaning representation, and enable the computation of lexical relations such as synonym and antonymy.

This paper describes the design and implementation of a prototype system used for automatically deriving componential formulae, and computing lexical relations between words from a given OWL ontology. The system has been tested on a number of ontologies, both English and Arabic. Results of the evaluation indicate that the system was able to provide necessary componential formulae for highly-axiomed ontologies. With regards to computing lexical relations, the system performs better when predicting antonyms, with an average precision of 40%, and an average recall of 75%. We have also found a strong relation between ontology expressivity and system performance.

Maha Al-Yahya, Hend Al-Khalifa, Alia Bahanshal, Iman Al-Oudah
EmotiNet: A Knowledge Base for Emotion Detection in Text Built on the Appraisal Theories

The automatic detection of emotions is a difficult task in Artificial Intelligence. In the field of Natural Language Processing, the challenge of automatically detecting emotion from text has been tackled from many perspectives. Nonetheless, the majority of the approaches contemplated only the word level. Due to the fact that emotion is most of the times not expressed through specific words, but by evoking situations that have a commonsense affective meaning, the performance of existing systems is low. This article presents the EmotiNet knowledge base – a resource for the detection of emotion from text based on commonsense knowledge on concepts, their interaction and their affective consequence. The core of the resource is built from a set of self-reported affective situations and extended with external sources of commonsense knowledge on emotion-triggering concepts. The results of the preliminary evaluations show that the approach is appropriate for capturing and storing the structure and the semantics of real situations and predict the emotional responses triggered by actions presented in text.

Alexandra Balahur, Jesús M. Hermida, Andrés Montoyo, Rafael Muñoz
Querying Linked Data Using Semantic Relatedness: A Vocabulary Independent Approach

Linked Data brings the promise of incorporating a new dimension to the Web where the availability of Web-scale data can determine a paradigmatic transformation of the Web and its applications. However, together with its opportunities, Linked Data brings inherent challenges in the way users and applications consume the available data. Users consuming Linked Data on the Web, or on corporate intranets, should be able to search and query data spread over potentially a large number of heterogeneous, complex and distributed datasets. Ideally, a query mechanism for Linked Data should abstract users from the representation of data. This work focuses on the investigation of a vocabulary independent natural language query mechanism for Linked Data, using an approach based on the combination of entity search, a Wikipedia-based semantic relatedness measure and spreading activation. The combination of these three elements in a query mechanism for Linked Data is a new contribution in the space. Wikipedia-based relatedness measures address existing limitations of existing works which are based on similarity measures/term expansion based on WordNet. Experimental results using the query mechanism to answer 50 natural language queries over DBPedia achieved a mean reciprocal rank of 61.4%, an average precision of 48.7% and average recall of 57.2%, answering 70% of the queries.

André Freitas, João Gabriel Oliveira, Seán O’Riain, Edward Curry, João Carlos Pereira da Silva
Extracting Explicit and Implicit Causal Relations from Sparse, Domain-Specific Texts

Various supervised algorithms for mining causal relations from large corpora exist. These algorithms have focused on relations explicitly expressed with causal verbs, e.g. “to cause”. However, the challenges of extracting causal relations from domain-specific texts have been overlooked. Domain-specific texts are rife with causal relations that are implicitly expressed using verbal and non-verbal patterns, e.g. “reduce”, “drop in”, “due to”. Also, readily-available resources to support supervised algorithms are inexistent in most domains. To address these challenges, we present a novel approach for causal relation extraction. Our approach is minimally-supervised, alleviating the need for annotated data. Also, it identifies both explicit and implicit causal relations. Evaluation results revealed that our technique achieves state-of-the-art performance in extracting causal relations from domain-specific, sparse texts. The results also indicate that many of the domain-specific relations were unclassifiable in existing taxonomies of causality.

Ashwin Ittoo, Gosse Bouma
Topics Inference by Weighted Mutual Information Measures Computed from Structured Corpus

This paper proposes a novel topic inference framework that is built on the scalability and adaptability of mutual information (MI) techniques. The framework is designed to systematically construct a more robust language model (LM) for topic-oriented search terms in the domain of

electronic programming guide

(EPG) for broadcast TV programs. The topic inference system identifies the most relevant topics implied from a search term, based on a simplified MI-based classifier trained from a highly structured XML-based text corpus, which is derived from continuously updated EPG data feeds. The proposed framework is evaluated against a set of EPG-specific queries from a large user population collected from a real world web-based IR system. The MI-base topic inference system is able to achieve 98 percent accuracy in recall measurement and 82 percent accuracy in precision measurement on the test set.

Harry Chang
Improving Subtree-Based Question Classification Classifiers with Word-Cluster Models

Question classification has been recognized as a very important step for many natural language applications (i.e question answering). Subtree mining has been indicated that [10] it is helpful for question classification problem. The authors empirically showed that subtree features obtained by subtree mining, were able to improve the performance of Question Classification for boosting and maximum entropy models. In this paper, our first goal is to investigate that whether or not subtree mining features are useful for structured support vector machines. Secondly, to make the proposed models more robust, we incorporate subtree features with word-cluster models gained from a large collection of text documents. Experimental results show that the uses of word-cluster models with subtree mining can significantly improve the performance of the proposed question classification models.

Le Minh Nguyen, Akira Shimazu
Data-Driven Approach Based on Semantic Roles for Recognizing Temporal Expressions and Events in Chinese

This paper addresses the automatic recognition of temporal expressions and events in Chinese. For this language, these tasks are still in a exploratory stage and high-performance approaches are needed. Recently, in TempEval-2 evaluation exercise, corpora annotated in TimeML were released for different languages including Chinese. However, no systems were evaluated in this language. We present a data-driven approach for addressing these tasks in Chinese, TIRSemZH. This uses semantic roles, in addition to morphosyntactic information, as feature. The performance achieved by TIRSemZH over the TempEval-2 Chinese data (85% F1) is comparable to the state of the art for other languages. Therefore, the method can be used to develop high-performance temporal processing systems, which are currently not available for Chinese. Furthermore, the results obtained verify that when semantic roles are applied, the performance of a baseline based only on morphosyntax is improved. This supports and extends the conclusions reached by related works for other languages.

Hector Llorens, Estela Saquete, Borja Navarro, Liu Li, Zhongshi He
Information Retrieval Techniques for Corpus Filtering Applied to External Plagiarism Detection

We present a set of approaches for corpus filtering in the context of document external plagiarism detection. Producing filtered sets, and hence limiting the problem’s search space, can be a performance improvement and is used today in many real-world applications such as web search engines. With regards to document plagiarism detection, the database of documents to match the suspicious candidate against is potentially fairly large, and hence it becomes very recommendable to apply filtered set generation techniques. The approaches that we have implemented include information retrieval methods and a document similarity measure based on a variant of

tf-idf

. Furthermore, we perform textual comparisons, as well as a semantic similarity analysis in order to capture higher levels of obfuscation.

Daniel Micol, Óscar Ferrández, Rafael Muñoz
Word Sense Disambiguation: A Graph-Based Approach Using N-Cliques Partitioning Technique

This paper presents a new approach to solve semantic ambiguity using an adaptation of the Cliques Partitioning Technique to

N

distance. This new approach is able to identify sets of strongly related senses using a multidimensional graph based on different resources: WordNet Domains, SUMO and WordNet Affects. As a result, each Clique will contain relevant information used to extract the correct sense of each word. In order to evaluate our approach there have been conducted several experiments using the data set of the “English All Words” task of Senseval-2 obtaining promising results.

Yoan Gutiérrez, Sonia Vázquez, Andrés Montoyo
OntoFIS as a NLP Resource in the Drug-Therapy Domain: Design Issues and Solutions Applied

In the Health domain, and specifically in the drug-therapy domain, in order to improve the access to the information of different types of users, several informational resources, semantically annotated, are under development. One of the existing development lines is oriented to reusing the effort spent on the design of the existing resources on the Web and obtaining knowledge-based resources for natural language processing (NLP) tasks. In this line, OntoFIS was designed as a NLP resource aimed at filling the gap of multilingual knowledge-based resources within the domain. The design process used for building OntoFIS merges the best approaches of several ontology design methodologies. However, given the characteristics of the drug-therapy domain, whose needs of knowledge are very precise, the process of formalisation of the domain knowledge led to a set of issues. Thus, this paper discusses the main issues found and the solutions analysed and applied in each case.

María Teresa Romá-Ferri, Jesús M. Hermida, Manuel Palomar

Short Papers

Exploiting Unlabeled Data for Question Classification

In this paper, we introduce a kernel-based approach to question classification. We employed a kernel function based on latent semantic information acquired from Wikipedia. This kernel allows including external semantic knowledge into the supervised learning process. We obtained a highly effective question classifier combining this knowledge with a bag-of-words approach by means of composite kernels. As the semantic information is acquired from unlabeled text, our system can be easily adapted to different languages and domains. We tested it on a parallel corpus of English and Spanish questions.

David Tomás, Claudio Giuliano
A System for Adaptive Information Extraction from Highly Informal Text

We present a first version of

ado

, a system for

A

daptive

D

ata

O

rganization, that is, information extraction from highly informal text: short text messages, classified ads, tweets, etc. It is built on a modular architecture that integrates in a transparent way off-the-shelf NLP tools, general procedures on strings and machine learning and processes tailored to a domain.

The system is called adaptive because it implements a semi-supervised approach. Knowledge resources are initially built by hand, and they are updated automatically by feeds from the corpus. This allows

ado

to adapt to the rapidly changing user-generated language.

In order to estimate the impact of future developments, we have carried out an orientative evaluation of the system with a small corpus of classified advertisements of the real estate domain in Spanish. This evaluation shows that tokenization and chunking can be well resolved by simple techniques, but normalization, morphosyntactic and semantic tagging require either more complex techniques or a bigger training corpus.

Laura Alonso i Alemany, Rafael Carrascosa
Pythia: Compositional Meaning Construction for Ontology-Based Question Answering on the Semantic Web

In this paper we present the ontology-based question answering system Pythia. It compositionally constructs meaning representations using a vocabulary aligned to the vocabulary of a given ontology. In doing so it relies on a deep linguistic analysis, which allows to construct formal queries even for complex natural language questions (e.g. involving quantification and superlatives).

Christina Unger, Philipp Cimiano
‘twazn me!!! ;(’ Automatic Authorship Analysis of Micro-Blogging Messages

In this paper we propose a set of stylistic markers for automatically attributing authorship to micro-blogging messages. The proposed markers include highly personal and idiosyncratic editing options, such as ‘emoticons’, interjections, punctuation, abbreviations and other low-level features. We evaluate the ability of these features to help discriminate the authorship of Twitter messages among three authors. For that purpose, we train SVM classifiers to learn stylometric models for each author based on different combinations of the groups of stylistic features that we propose. Results show a relatively good-performance in attributing authorship of micro-blogging messages (

F

= 0.63) using this set of features, even when training the classifiers with as few as 60 examples from each author (

F

= 0.54). Additionally, we conclude that emoticons are the most discriminating features in these groups.

Rui Sousa Silva, Gustavo Laboreiro, Luís Sarmento, Tim Grant, Eugénio Oliveira, Belinda Maia
Opinion Classification Techniques Applied to a Spanish Corpus

Sentiment analysis is a new challenging task related to Text Mining and Natural Language Processing. Although there are some current works, most of them only focus on English texts. Web pages, information and opinions on the Internet are increasing every day, and English is not the only language used to write them. Other languages like Spanish are increasingly present so we have carried out some experiments over a Spanish film reviews corpus. In this paper we present several experiments using five classification algorithms (SVM, Nave Bayes, BBR, KNN, C4.5). The results obtained are very promising and encourage us to continue investigating in this line.

Eugenio Martínez-Cámara, M. Teresa Martín-Valdivia, L. Alfonso Ureña-López
Prosody Analysis of Thai Emotion Utterances

Emotion speech synthesis is the most important process to generate the naturalness of utterances in text-to-speech system. The interjection utterances in Thai language are employed in express a number of emotions. This paper presents a study of the prosody parameters of the interjection utterances clipped from Thai utterances in the movies. The Thai emotional utterances from various movies have been analyzed and classified into 8 emotional types consisting of neutral, anger, happiness, sadness, fear, pleasant, unpleasant and surprise. The classification of prosodic features is based on fundamental frequency (F0), intensity and duration. This paper compares the prosodic features in the Thai language and other languages including English, Italian, French, Spanish and Arabic. The comparison results show that there are significant differences of prosodic features for each emotion in each language. Therefore, the quality of a text-to-speech system is based on the prosodic analysis of each language.

Sukanya Yimngam, Wichian Premchaisawadi, Worapoj Kreesuradej
Repurposing Social Tagging Data for Extraction of Domain-Level Concepts

The World Wide Web, the world’s largest resource for information, has evolved from organizing information using controlled, top-down taxonomies to a bottom up approach that emphasizes assigning meaning to data via mechanisms such as the Social Web (Web 2.0). Tagging adds meta-data, (weak semantics) to the content available on the web. This research investigates the potential for repurposing this layer of meta-data. We propose a multi-phase approach that exploits user-defined tags to identify and extract domain-level concepts. We operationalize this approach and assess its feasibility by application to a publicly available tag repository. The paper describes insights gained from implementing and applying the heuristics contained in the approach, as well as challenges and implications of repurposing tags for extraction of domain-level concepts.

Sandeep Purao, Veda C. Storey, Vijayan Sugumaran, Jordi Conesa, Julià Minguillón, Joan Casas
Ontology-Guided Approach to Feature-Based Opinion Mining

The boom of the Social Web has had a tremendous impact on a number of different research topics. In particular, the possibility to extract various kinds of added-value, informational elements from users’ opinions has attracted researchers from the information retrieval and computational linguistics fields. However, current approaches to so-called opinion mining suffer from a series of drawbacks. In this paper we propose an innovative methodology for opinion mining that brings together traditional natural language processing techniques with sentimental analysis processes and Semantic Web technologies. The main goals of this methodology is to improve feature-based opinion mining by employing ontologies in the selection of features and to provide a new method for sentimental analysis based on vector analysis. The preliminary experimental results seem promising as compared against the traditional approaches.

Isidro Peñalver-Martínez, Rafael Valencia-García, Francisco García-Sánchez
A Natural Language Interface for Data Warehouse Question Answering

Business Intelligence (BI) aims at providing methods and tools that lead to quick decisions from trusted data. Such advanced tools require some technical knowledge on how to formulate the queries. We propose a natural language (NL) interface for a Data Warehouse based Question Answering system. This system allows users to query with questions expressed in natural language. The proposed system is fully automated, resulting low Total Cost of Ownership. We aim at demonstrating the importance of identifying already existing semantics and using Text Mining techniques on the Web to move toward the users’s need.

Nicolas Kuchmann-Beauger, Marie-Aude Aufaure
Graph-Based Bilingual Sentence Alignment from Large Scale Web Pages

Sentence alignment is an enabling technology which extracts mass of bilingual corpora automatically from the vast and ever-growing Web pages. In this paper, we propose a novel graph-based sentence alignment approach. Compared with the existing approaches, ours is more resistant to the noise and structure diversity of Web pages by considering sentence structural features. We formulate sentence alignment to be a matching problem between nodes (bilingual sentences) of a bipartite graph. The maximum-weighted bipartite graph matching algorithm is first applied to sentence alignment for global optimal matching. Moreover, sentence merging and aligned sentence pattern detection are used to deal with the many-to-many matching issue and the low probability of aligned sentences with few mutual translated words issue respectively. We achieve good precision over 85% and recall over 80% on manually annotated data and 1 million aligned sentence pairs with over 82% accuracy are extracted from 0.8 million bilingual pages.

Yihe Zhu, Haofen Wang, Xixiu Ouyang, Yong Yu

Posters

On Specifying Requirements Using a Semantically Controlled Representation

Requirements are typically specified in natural languages (NL) such as English and then analyzed by analysts and developers to generate formal software design/model. However, English is ambiguous and the requirements specified in English can result in erroneous and absurd software designs. We propose a semantically controlled representation based on SBVR for specifying requirements. The SBVR based controlled representation can not only result in accurate and consistent software models but also machine process able because SBVR has pure mathematical foundation. We also introduce a java based implementation of the presented approach that is a proof of concept.

Imran Sarwar Bajwa, M. Asif Naeem
An Unsupervised Method to Improve Spanish Stemmer

We evaluate the effectiveness of using our edit distances algorithm to improving an unsupervised language-independent stemming method. The main idea is to create morphological families through the automatic words grouping using our distance. Based on that grouping, we make a stemming process. The capacity of the edit distance algorithm in the task of words clustering and the ability of our method to generate the correct stem for Spanish was evaluated. A good result (98% precision) for the morphological families’ creation and also a remarkable 99.85% of correct stemming was obtained.

Antonio Fernández, Josval Díaz, Yoan Gutiérrez, Rafael Muñoz
Using Semantic Classes as Document Keywords

Keyphrases are mainly words that capture the main topics of a document. We think that semantic classes can be used as keyphrases for a text. We have developed a semantic class–based WSD system that can tag the words of a text with their semantic class. A method is developed to compare the semantic classes of the words of a text with the correct ones based on statistical measures. We find that the evaluation of semantic classes considered as keyphrases is very close to 100% in most cases.

Rubén Izquierdo, Armando Suárez, German Rigau
Automatic Term Identification by User Profile for Document Categorisation in Medline

We show how term extraction methods such as AMTE

X

and MMT

X

can be used for the automatic categorisation of medical documents by user profile (novice users and experts). This is achieved by mapping document terms to external lexical resources such as WordNet, and MeSH (the medical thesaurus of NLM).

Angelos Hliaoutakis, Euripides G. M. Petrakis
AZOM: A Persian Structured Text Summarizer

In this paper we propose a summarization approach, nicknamed AZOM, that combines statistical and conceptual property of text and in regards of document structure, extracts the summary of text. AZOM is also capable of summarizing unstructured documents. Proposed approach is localized for Persian language but easily can apply to other languages. The empirical results show comparatively superior results than common structured text summarizers, also than existing Persian text summarizers.

Azadeh Zamanifar, Omid Kashefi
POS Tagging in Amazighe Using Support Vector Machines and Conditional Random Fields

The aim of this paper is to present the first Amazighe POS tagger. Very few linguistic resources have been developed so far for Amazighe and we believe that the development of a POS tagger tool is the first step needed for automatic text processing. The used data have been manually collected and annotated. We have used state-of-art supervised machine learning approaches to build our POS-tagging models. The obtained accuracy achieved 92.58% and we have used the 10-fold technique to further validate our results.

Mohamed Outahajala, Yassine Benajiba, Paolo Rosso, Lahbib Zenkouar
Towards Ontology-Driven End-User Composition of Personalized Mobile Services

With mobile devices being an integral part of the daily life of millions of users having little or no ICT knowledge, mobile services are being developed to save them from difficult or tedious tasks without compromising their needs. Starting with a number of real life scenarios we have been working towards supporting end-users in managing their services in an efficient and user-friendly manner. We observe that these scenarios consist of sub-tasks that can be solved with collaborative service units. Therefore, a composition of such service units will serve the needs of the end-user for the complete scenario. We envisage that a visual formalism and tools can be developed to support these end-users in creating such service compositions. Moreover, methodologies and middleware can significantly reduce the complexity of developing composite services. This paper focuses on the role of ontologies within that context. Ontologies can assist the users in selecting appropriate services and setting composition parameters within a composition tool. For our prototype demonstration system we target the open source Android cellphone architecture supporting a number of different runtime platforms.

Rune Sætre, Mohammad Ullah Khan, Erlend Stav, Alfredo Perez Fernandez, Peter Herrmann, Jon Atle Gulla
Style Analysis of Academic Writing

This paper presents an approach which performs a Style Analysis of Academic Writing in terms of formal voice, readability and scientific language. Our intention is an analysis of academic writing style as a feedback for the authors and editors. The extracted features of a document collection are used to create Self-Organizing Maps which are the interim results to generate reports in our Full Automatic Paper Analysis System (Fapas). To evaluate this method, the system has to solve different tasks to verify the informative value of the generated maps and reports.

Thomas Scholz, Stefan Conrad
Towards the Detection of Cross-Language Source Code Reuse

Internet has made available huge amounts of information, also source code. Source code repositories and, in general, programming related websites, facilitate its reuse. In this work, we propose a simple approach to the detection of cross-language source code reuse, a nearly investigated problem. Our preliminary experiments, based on character

n

-grams comparison, show that considering different sections of the code (i.e., comments, code, reserved words, etc.), leads to different results. When considering three programming languages: C+ +, Java, and Python, the best result is obtained when comments are discarded and the entire source code is considered.

Enrique Flores, Alberto Barrón-Cedeño, Paolo Rosso, Lidia Moreno
Effectively Mining Wikipedia for Clustering Multilingual Documents

This paper presents Multilingual Document Clustering (MDC) using Wikipedia on comparable corpora. Particularly, we utilized the cross lingual links, category, outlinks, Infobox information present in Wikipedia to enrich the document representation. We have used Bisecting k-means algorithm for clustering multilingual documents based on the document similarities. Experiments are conducted based on the usage of English and Hindi Wikipedia. We have considered English and Hindi Datasets provided by FIRE’10 for Ad-hoc Cross-Lingual document retrieval task on Indian languages. No language specific tools are used, which makes the proposed approach easily extendable for other languages. The system is evaluated using F-score and Purity measures and the results obtained are encouraging.

N. Kiran Kumar, G. S. K. Santosh, Vasudeva Varma
A Comparative Study of Classifier Combination Methods Applied to NLP Tasks

There are many classification tools that can be used for various NLP tasks, although none of them can be considered the best of all since each one has a particular list of virtues and defects. The combination methods can serve both to maximize the strengths of the base classifiers and to reduce errors caused by their defects improving the results in terms of accuracy. Here is a comparative study on the most relevant methods that shows that combination seems to be a robust and reliable way of improving our results.

Fernando Enríquez, José A. Troyano, Fermín L. Cruz, F. Javier Ortega
TOES: A Taxonomy-Based Opinion Extraction System

Feature-based opinion extraction is a task related to opinion mining and information extraction which consists of automatically extracting feature-level representations of opinions from subjective texts. In the last years, some researchers have proposed domain-independent solutions to this task. Most of them identify the feature being reviewed by a set of words from the text. Rather than that, we propose a domain-adaptable opinion extraction system based on feature taxonomies (a semantic representation of the opinable parts and attributes of an object) which extracts feature-level opinions and maps them into the taxonomy. The opinions thus obtained can be easily aggregated for summarization and visualization. In order to increase precision and recall of the extraction system, we define a set of domain-specific resources which capture valuable knowledge about how people express opinions on each feature from the taxonomy for a given domain. These resources are automatically induced from a set of annotated documents. The modular design of our architecture allows building either domain-specific or domain-independent opinion extraction systems. According to some experimental results, using the domain-specific resources leads to far better precision and recall, at the expense of some manual effort.

Fermín L. Cruz, José A. Troyano, F. Javier Ortega, Fernando Enríquez
Combining Textual Content and Hyperlinks in Web Spam Detection

In this work, we tackle the problem of spam detection on the Web. Spam web pages have become a problem for Web search engines, due to the negative effects that this phenomenon can cause in their retrieval results. Our approach is based on a random-walk algorithm that obtains a ranking of pages according to their relevance and their spam likelihood. We introduce the novelty of taking into account the content of the web pages to characterize the web graph and to obtain an a priori estimation of the spam likelihood of the web pages. Our graph-based algorithm computes two scores for each node in the graph. Intuitively, these values represent how bad or good (spam-like or not) a web page is, according to its textual content and the relations in the graph. Our experiments show that our proposed technique outperforms other

link-based

techniques for spam detection.

F. Javier Ortega, Craig Macdonald, José A. Troyano, Fermín L. Cruz, Fernando Enríquez
Map-Based Filters for Fuzzy Entities in Geographical Information Retrieval

Many users employ vague geographical expressions to query Information Retrieval systems. These fuzzy entities do not appear neither in gazetteers nor in geographical databases. Searches such as “Ski resorts in north-central Spain” or “Restaurants near the Teatro Real of Madrid” often do not get the expected results, mainly due to the difficulty of disambiguating expressions like “north of” or “near”. This paper presents a first approach to deal with this kind of fuzzy expressions, with the aim of improving the coverage and accuracy of traditional Information Retrieval systems. Our approach is based on the use of raster images as geographic filters to determine the relevance of the documents depending on the location of places referenced in them.

Fernando S. Peregrino, David Tomás, Fernando Llopis
DDIExtractor: A Web-Based Java Tool for Extracting Drug-Drug Interactions from Biomedical Texts

A drug-drug interaction (DDIs) occurs when one drug influences the level or activity of another drug. The detection of DDIs is an important research area in patient safety since these interactions can become very dangerous and increase health care costs. Although there are several databases and web tools providing information on DDIs to patients and health-care professionals, these resources are not comprehensive because many DDIs are only reported in the biomedical literature. This paper presents the first tool for detecting drug-drug interactions from biomedical texts called DDIExtractor. The tool allows users to search by keywords in the Medline 2010 baseline database and then detect drugs and DDIs in any retrieved document.

Daniel Sánchez-Cisneros, Isabel Segura-Bedmar, Paloma Martínez
Geo-Textual Relevance Ranking to Improve a Text-Based Retrieval for Geographic Queries

Geographic Information Retrieval is an active and growing research area that focuses on the retrieval of textual documents according to a geographic criteria of relevance. In this work, we propose a reranking function for these systems that combines the retrieval status value calculated by the information retrieval engine and the geographical similarity between the document and the query. The obtained results show that the proposed ranking function always outperforms text-based baseline approaches.

José M. Perea-Ortega, Miguel A. García-Cumbreras, L. Alfonso Ureña-López, Manuel García-Vega
BioOntoVerb Framework: Integrating Top Level Ontologies and Semantic Roles to Populate Biomedical Ontologies

Ontology population is a knowledge acquisition activity that relies on (semi-) automatic methods to transform un-structured, semi-structured and structured data sources into instance data. A semantic role is a relationship between a syntactic constituent and a predicate that defines the role of a verbal argument in the event expressed by the verb. In this work, we describe a framework where top level ontologies that define the basic semantic relations in biomedical domains are mapped onto semantic role labeling resources in order to develop a tool for ontology population from biomedical natural language text. This framework has been validated by using an ontology extracted from the GENIA corpus.

Juana María Ruiz-Martínez, Rafael Valencia-García, Rodrigo Martínez-Béjar
Treo: Best-Effort Natural Language Queries over Linked Data

Linked Data promises an unprecedented availability of data on the Web. However, this vision comes together with the associated challenges of querying highly heterogeneous and distributed data. In order to query Linked Data on the Web today, end-users need to be aware of which datasets potentially contain the data and the data model behind these datasets. This query paradigm, deeply attached to the traditional perspective of structured queries over databases, does not suit the heterogeneity and scale of the Web, where it is impractical for data consumers to have an a priori understanding of the structure and location of available datasets. This work describes

Treo

, a best-effort natural language query mechanism for Linked Data, which focuses on the problem of bridging the semantic gap between end-user natural language queries and Linked Datasets.

André Freitas, João Gabriel Oliveira, Seán O’Riain, Edward Curry, João Carlos Pereira da Silva
Evaluating EmotiBlog Robustness for Sentiment Analysis Tasks

EmotiBlog

is a corpus labelled with the homonymous annotation schema designed for detecting subjectivity in the new textual genres. Preliminary research demonstrated its relevance as a Machine Learning resource to detect opinionated data. In this paper we compare

EmotiBlog

with the

JRC

corpus in order to check the

EmotiBlog

robustness of annotation. For this research we concentrate on its coarse-grained labels. We carry out a deep ML experimentation also with the inclusion of lexical resources. The results obtained show a similarity with the ones obtained with the

JRC

demonstrating the

EmotiBlog

validity as a resource for the SA task.

Javi Fernández, Ester Boldrini, José Manuel Gómez, Patricio Martínez-Barco
Syntax-Motivated Context Windows of Morpho-Lexical Features for Recognizing Time and Event Expressions in Natural Language

We present an analysis of morpho-lexical features to learn SVM models for recognizing TimeML time and event expressions. We evaluate over the TempEval-2 data, the features: word, lemma, and PoS in isolation, in different size static-context windows, and in a

syntax-motivated dynamic-context windows

defined in this paper. The results show that word, lemma, and PoS introduce complementary advantages and their combination achieves the best performance; this performance is improved using context, and, with dynamic-context, timex recognition is improved to reach state-of-art performance. Although more complex approaches improve the efficacy, the morpho-lexical features can be obtained more efficiently and show a reasonable efficacy.

Hector Llorens, Estela Saquete, Borja Navarro
Tourist Face: A Contents System Based on Concepts of Freebase for Access to the Cultural-Tourist Information

In more and more application areas large collections of digitized multimedia information are gathered and have to be maintained (e.g. in tourism, medicine, etc). Therefore, there is an increasing demand for tools and techniques supporting the management and usage of digital multimedia data. Furthermore, new large collections of data are available through it every day. In this paper we are presenting Tourist Face, a system aimed at integrating text analyzing techniques into the paradigm of multimedia information, specifically tourist multimedia information.

Particularly relevant components to its the development are

Freebase

, a large collaborative base of knowledge, and

General Architecture for Text Engineering

(GATE), a system for text processing. The platform architecture has been built thinking in terms of scalability, with the following objectives: to allow the integration of different natural language processing techniques, to expand the sources from which information extraction can be performed and to ease integration of new user interfaces.

Rafael Muñoz Gil, Fernando Aparicio, Manuel de Buenaga, Diego Gachet, Enrique Puertas, Ignacio Giráldez, Ma Cruz Gaya
Person Name Discrimination in the Dossier–GPLSI at the University of Alicante

We present the Dossier–GPLSI, a system for the automatic generation of press dossiers for organizations. News are downloaded from online newspapers and are automatically classified. We describe specifically a module for the discrimination of person names. Three different approaches are analyzed and evaluated, each one using different kind of information, as semantic information, domain information and statistical evidence. We demonstrate that this module reaches a very good performance, and can be integrated in the Dossier–GPLSI system.

Isabel Moreno, Rubén Izquierdo, Paloma Moreda
MarUja: Virtual Assistant Prototype for the Computing Service Catalogue of the University of Jaén

The information and web services that many organizations offer through its web pages are increasing every day. This makes that navigation and access to information becomes increasingly complex for visitors of these web pages, so, it is necessary to facilitate these tasks for users. In this paper we present a prototype of a Virtual Assistant, which is the result of applying a methodology to develop and set this kind of systems with a minimum cost.

Eugenio Martínez-Cámara, L. Alfonso Ureña-López, José M. Perea-Ortega

Doctoral Symposium

Processing Amazighe Language

Amazighe is a language spoken by millions of people in north Africa in majority, however, it is suffering from scarcity resources. The aim of this PhD thesis is to contribute to provide elementary resources and tools to process this language. In order to achieve this goal, we have achieved an annotated corpus of ~20k tokens and trained two sequence classification models using Support Vector Machines (SVMs) and Conditional Random Fields (CRFs). We have used the 10-fold technique to evaluate our approach. Results show that the performance of SVMs and CRFs are very comparable, however, CRFs outperformed SVMs on the 10 folds average level (88.66% vs. 88.27%). For future steps, we are planning to use semi-supervised techniques to accelerate part-of-speech (POS) annotation in order to increase accuracy, afterwards to approach base phrase chunking, for future work.

Mohamed Outahajala
How to Extract Arabic Definitions from the Web? Arabic Definition Question Answering System

The Web is the most interest information resource available for users, but the issue is how to obtain the precise and the exact information easily and quickly. The classic information retrieval system such as Web search engines can just return snippets and links to Web pages according to a user query. And it is the role of the user to fetch these results and identify the appropriate information. In this paper, we propose dealing with the results returned by Web search engines to return the appropriate information to a user question. The solution proposed is integrated in an Arabic definition question answering system called ‘DefArabicQA’. The experiment was carried out using 140 Arabic definition questions and 2360 snippets returned by divers Web search engines. The result obtained so far is very encouraging and it can be outperformed more in the future.

Omar Trigui
Answer Validation through Textual Entailment

Ongoing research work on an Answer Validation System (AV) based on Textual Entailment and Question Answering has been presented. A number of answer validation modules have been developed based on Textual Entailment, Named Entity Recognition, Question-Answer type analysis, Chunk boundary module and Syntactic similarity module. These answer validation modules have been integrated using a voting technique. We combine the question and the answer into the Hypothesis (H) and the Supporting Text as Text (T) to identify the entailment relation as either “VALIDATED” or “REJECTED”. The important features in the lexical Textual Entailment module are: WordNet based unigram match, bi-gram match and skip-gram. In the syntactic similarity module, the important features used are: subject-subject comparison, subject-verb comparison, object-verb comparison and cross subject-verb comparison. The precision, recall and f-score of the integrated AV system on the AVE 2008 English annotated test set have been observed as 0.66, 0.65 and 0.65 respectively that outperforms the best performing system at AVE 2008 in terms of f-score.

Partha Pakray
Analyzing Text Data for Opinion Mining

Opinion mining has become a hot topic at the crossroads of information retrieval and computational linguistics. In this paper, we propose to study two key research problems of designing an opinion mining system, i.e., entity-related opinion detection problem and sentiment analysis problem. For the entity-related opinion detection problem, we want to use sophisticated statistical models, e.g., probabilistic topic models and statistical rule generation methods, to achieve better performance than existing baselines. For the sentiment analysis problem, we have proposed a novel HL-SOT approach and reported its feasibility in an academic publication. Since the kernel classifier utilized in the HL-SOT approach is a linear function, we are working on developing a multi-layer neural network kernel algorithm which results in a non-linear classifier and is expected to improve the performance of the original HL-SOT approach to sentiment analysis.

Wei Wei
On the Improvement of Passage Retrieval in Arabic Question/Answering (Q/A) Systems

The development of advanced Information Retrieval (IR) applications is of a particular priority in the context of the Arabic language. In this PhD thesis, our aim is improving the performances of Arabic Question/Answering (Q/A) systems. Indeed, we propose an approach which is composed of three levels. We have showed through experiments conducted on a set of 2,264 translated CLEF and TREC questions that the accuracy, the Mean Reciprocal Rank (MRR) and the number of answered questions are enhanced using a Query Expansion (QE) module based on Arabic Wordnet (AWN) in the first level and a structure-based Passage Retrieval (PR) module in the second level. In order to evaluate the impact of the AWN coverage on the performances, we have automatically extended its content in terms of Named Entities (NE), Nouns and Verbs. The next step consists in developing a semantic reasoning process based on Conceptual Graphs (CGs) as a third level.

Lahsen Abouenour
Ontology Extension and Population: An Approach for the Pharmacotherapeutic Domain

For several years ontologies have been seen as a solution to share and reuse knowledge between humans and machines. An ontology is a knowledge photography at the moment of its creation. Nevertheless, in order to keep an ontology useful throughout time, it must be expanded and maintained regularly, mainly in the pharmacotherapeutic domain. Drug-therapy needs up-to-date and reliable information. Unfortunately, achieving a systematic ontology updating has is an arduous and a tedious task that becomes a bottleneck. To limit this obstacle we need methods that expedite the process of extension and population.This proposal aims the designing and validating method able to extract, from a corpus of summary of product characteristics and a pharmacotherapeutic ontology, the relevant knowledge to be added to the ontology.

Jorge Cruanes
Backmatter
Metadaten
Titel
Natural Language Processing and Information Systems
herausgegeben von
Rafael Muñoz
Andrés Montoyo
Elisabeth Métais
Copyright-Jahr
2011
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-22327-3
Print ISBN
978-3-642-22326-6
DOI
https://doi.org/10.1007/978-3-642-22327-3