Skip to main content
main-content

Über dieses Buch

This book constitutes the refereed proceedings of the 19th International Conference on Applications of Natural Language to Information Systems, NLDB 2014, held in Montpellier, France, in June 2014. The 13 long papers, 8 short papers, 14 poster papers, and 7 demo papers presented together with 2 invited talks in this volume were carefully reviewed and selected from 73 submissions. The papers cover the following topics: syntactic, lexical and semantic analysis; information extraction; information retrieval and sentiment analysis and social networks.

Inhaltsverzeichnis

Frontmatter

Syntactic, Lexical and Semantic Analysis

Using Wiktionary to Build an Italian Part-of-Speech Tagger

While there has been a lot of progress in Natural Language Processing (NLP), many basic resources are still missing for many languages, including Italian, especially resources that are free for both research and commercial use. One of these basic resources is a Part-of-Speech tagger, a first processing step in many NLP applications. We describe a weakly-supervised, fast, free and reasonably accurate part-of-speech tagger for the Italian language, created by mining words and their part-of-speech tags from Wiktionary. We have integrated the tagger in Pattern, a freely available Python toolkit. We believe that our approach is general enough to be applied to other languages as well.

Tom De Smedt, Fabio Marfia, Matteo Matteucci, Walter Daelemans

Enhancing Multilingual Biomedical Terminologies via Machine Translation from Parallel Corpora

Creating and maintaining terminologies by human experts is known to be a resource-expensive task. We here report on efforts to computationally support this process by treating term acquisition as a machine translation-guided classification problem capitalizing on parallel multilingual corpora. Experiments are described for French, German, Spanish and Dutch parts of a multilingual biomedical terminology, for which we generated 18k, 23k, 19k and 12k new terms and synonyms, respectively; about one half relate to concepts that have not been lexically labeled before. Based on expert assessment of a sample of the novel German segment about 80% of these newly acquired terms were judged as linguistically correct and bio-medically reasonable additions to the terminology.

Johannes Hellrich, Udo Hahn

A Distributional Semantics Approach for Selective Reasoning on Commonsense Graph Knowledge Bases

Tasks such as question answering and semantic search are dependent on the ability of querying & reasoning over large-scale commonsense knowledge bases (KBs). However, dealing with commonsense data demands coping with problems such as the increase in schema complexity, semantic inconsistency, incompleteness and scalability. This paper proposes a selective graph navigation mechanism based on a distributional relational semantic model which can be applied to querying & reasoning over heterogeneous knowledge bases (KBs). The approach can be used for approximative reasoning, querying and associational knowledge discovery. In this paper we focus on commonsense reasoning as the main motivational scenario for the approach. The approach focuses on addressing the following problems: (i) providing a semantic selection mechanism for facts which are relevant and meaningful in a specific reasoning & querying context and (ii) allowing coping with information incompleteness in large KBs. The approach is evaluated using ConceptNet as a commonsense KB, and achieved

high selectivity

,

high scalability

and

high accuracy in the selection of meaningful navigational paths

. Distributional semantics is also used as a principled mechanism to cope with information incompleteness.

André Freitas, João Carlos Pereira da Silva, Edward Curry, Paul Buitelaar

Ontology Translation: A Case Study on Translating the Gene Ontology from English to German

For many researchers, the purpose of ontologies is sharing data. This sharing is facilitated when ontologies are available in multiple languages, but inhibited when an ontology is only available in a single language. Ontologies should be accessible to people in multiple languages, since multilingualism is inevitable in any scientific work. Due to resource scarcity, most ontologies of the biomedical domain are available only in English at present. We present techniques to translate Gene Ontology terms from English to German using DBPedia, the Google Translate API for isolated terms, and the Google Translate API for terms in sentential context. Average fluency scores for the three methods were 4.0, 4.4, and 4.5, respectively. Average adequacy scores were 4.0, 4.9, and 4.9.

Negacy D. Hailu, K. Bretonnel Cohen, Lawrence E. Hunter

Crowdsourcing Word-Color Associations

In Natural Language Processing and semantic analysis in particular, color information may be important for processing textual information. Knowing what colors are generally associated with terms by people is valuable. We explore how crowdsourcing through a game with a purpose (GWAP) can be an adequate strategy to collect such lexico-semantic data.

Mathieu Lafourcade, Nathalie Le Brun, Virginie Zampa

On the Semantic Representation and Extraction of Complex Category Descriptors

Natural language descriptors used for categorizations are present from folksonomies to ontologies. While some descriptors are composed of simple expressions, other descriptors have complex compositional patterns (e.g. ‘French Senators Of The Second Empire’, ‘Churches Destroyed In The Great Fire Of London And Not Rebuilt’). As conceptual models get more complex and decentralized, more content is transferred to unstructured natural language descriptors, increasing the terminological variation, reducing the conceptual integration and the structure level of the model. This work describes a representation for complex natural language category descriptors (NLCDs). In the representation, complex categories are decomposed into a graph of primitive concepts, supporting their interlinking and semantic interpretation. A category extractor is built and the quality of its extraction under the proposed representation model is evaluated.

André Freitas, Rafael Vieira, Edward Curry, Danilo Carvalho, João Carlos Pereira da Silva

Semantic and Syntactic Model of Natural Language Based on Tensor Factorization

A method of developing a structural model of natural language syntax and semantics is proposed. Factorization of lexical combinability arrays obtained from text corpora generates linguistic databases that used for natural language semantic and syntactic analyses.

Anatoly Anisimov, Oleksandr Marchenko, Volodymyr Taranukha, Taras Vozniuk

How to Populate Ontologies

Computational Linguistics Applied to the Cultural Heritage Domain

The Cultural Heritage (CH) domain brings critical challenges as for the application of Natural Language Processing (NLP) and ontology population (OP) techniques. Actually, CH embraces a wide range of content, variable by type and properties and semantically interlinked whit other domains.This paper presents an on-going research on language treatment based on Lexicon-Grammar (LG) approach for improving knowledge management in the CH domain. We intend to show how our language formalization technique can be applied for both processing and populating a domain ontology.

Maria Pia di Buono, Mario Monteleone, Annibale Elia

Fine-Grained POS Tagging of Spoken Tunisian Dialect Corpora

Arabic Dialects (AD) have recently begun to receive more attention from the speech science and technology communities. The use of dialects in language technologies will contribute to improve the development process and the usability of applications such speech recognition, speech comprehension, or speech synthesis. However, AD faces the problem of lack of resources compared to the Modern Standard Arabic (MSA). This paper deals with the problem of tagging an AD: The Tunisian Dialect (TD). We present, in this work, a method for building a fine grained POS (Part Of Speech tagger) for the TD. This method consists on adapting a MSA POS tagger by generating a training TD corpus from a MSA corpus using a bilingual lexicon MSA-TD. The evaluation of the TD tagger on a corpus of text transcriptions achieved an accuracy of 78.5%.

Rahma Boujelbane, Mariem Mallek, Mariem Ellouze, Lamia Hadrich Belguith

Information Extraction

Applying Dependency Relations to Definition Extraction

Definition Extraction (DE) is the task to automatically identify definitional knowledge in naturally-occurring text. This task has applications in ontology generation, glossary creation or question answering. Although the traditional approach to DE has been based on hand-crafted pattern-matching rules, recent methods incorporate learning algorithms in order to classify sentences as definitional or non-definitional. This paper presents a supervised approach to Definition Extraction in which only syntactic features derived from dependency relations are used. We model the problem as a classification task where each sentence has to be classified as being or not definitional. We compare our results with two well-known approaches: First, a supervised method based on Word-Class Lattices and second, an unsupervised approach based on mining recurrent patterns. Our competitive results suggest that syntactic information alone can contribute substantially to the development and improvement of DE systems.

Luis Espinosa-Anke, Horacio Saggion

Entity Linking for Open Information Extraction

Open domain information extraction (OIE) projects like

Nell

or

ReVerb

are often impaired by a schema-poor structure. This severely limits their application domain in spite of having web-scale coverage. In this work we try to disambiguate an OIE fact by referring its terms to unique instances from a structured knowledge base,

DBpedia

in our case. We propose a method which exploits the frequency information and the semantic relatedness of all probable candidate pairs. We show that our combined linking method outperforms a strong baseline.

Arnab Dutta, Michael Schuhmacher

A New Method of Extracting Structured Meanings from Natural Language Texts and Its Application

An original method of developing the algorithms of semantic-syntactic analysis of texts in natural language (NL) is set forth. It expands the method proposed by V.A. Fomichov in the monograph published by Springer in 2010. For building semantic representations, the class of SK-languages is used. The input texts may be at least from broad and practically interesting sublanguages of English, French, German, and Russian languages. The final part of the paper describes an application of the elaborated method to the design of a NL-interface to an action-based software system. The developed NL-interface NLC-1 (Natural Language Commander -Version 1) is implemented with the help of the functional programming language Haskell.

Vladimir A. Fomichov, Alexander A. Razorenov

A Survey of Multilingual Event Extraction from Text

The ability to process multilingual texts is important for the event extraction systems, because it not only completes the picture of an event, but also improves the algorithm performance quality. The present paper is a partial overview of the systems that cover this functionality. We focus on language-specific event type identification methods. Obtaining and organizing this knowledge is important for our further experiments on mono- and multilingual detection of socio-political events.

Vera Danilova, Mikhail Alexandrov, Xavier Blanco

Information Retrieval

Nordic Music Genre Classification Using Song Lyrics

Lyrics-based music genre classification is still understudied within the music information retrieval community. The existing approaches, reported in the literature, only deals with lyrics in the English language. Thus, it is necessary to evaluate if the standard text classification techniques are suitable for lyrics in languages other than English. More precisely, in this work we are interested in analyzing which approach gives better results: a language-dependent approach using stemming and stopwords removal or a language-independent approach using n-grams. To perform the experiments we have created the Nordic music genre lyrics database. The analysis of the experimental results shows that using a language-independent approach with the n-gram representation is better than using a language-dependent approach with stemming. Additional experiments using stylistic features were also performed. The analysis of these additional experiments has shown that using stylistic features combined with the other approaches improve the classification results.

Adriano A. de Lima, Rodrigo M. Nunes, Rafael P. Ribeiro, Carlos N. Silla

Infographics Retrieval: A New Methodology

Information graphics, such as bar charts and line graphs, are a rich knowledge source that should be accessible to users. However, techniques that have been effective for document or image retrieval are inadequate for the retrieval of such graphics. We present and evaluate a new methodology that hypothesizes information needs from user queries and retrieves infographics based on how well the inherent structure and intended message of the graphics satisfy the query information needs.

Zhuo Li, Sandra Carberry, Hui Fang, Kathleen F. McCoy, Kelly Peterson

A Joint Topic Viewpoint Model for Contention Analysis

This work proposes an unsupervised Joint Topic Viewpoint model (JTV) with the objective to further improve the quality of opinion mining in contentious text. The conceived JTV is designed to learn the hidden features of arguing expressions. The learning task is geared towards the automatic detection and clustering of these expressions according to the latent topics they confer and the embedded viewpoints they voice. Experiments are conducted on three types of contentious documents: polls, online debates and editorials. Qualitative and quantitative evaluations of the models output confirm the ability of JTV in handling different types of contentious issues. Moreover, analysis of the preliminary experimental results shows the ability of the proposed model to automatically and accurately detect recurrent patterns of arguing expressions.

Amine Trabelsi, Osmar R. Zaïane

Identification of Multi-Focal Questions in Question and Answer Reports

A significant amount of business and scientific data is collected via question and answer reports. However, these reports often suffer from various data quality issues. In many cases, questionnaires contain a number of questions that require multiple answers, which we argue can be a potential source of problems that may lead to poor-quality answers. This paper introduces multi-focal questions and proposes a model for identifying them. The model consists of three phases: question pre-processing, feature engineering and question classification. We use six types of features: lexical/surface features, Part-of-Speech, readability, question structure, wording and placement features, question response type and format features and question focus. A comparative study of three different machine learning algorithms (Bayes Net, Decision Tree and Support Vector Machine) is performed on a dataset of 150 questions obtained from the Carbon Disclosure Project, achieving the accuracy of 91%.

Mona Mohamed Zaki Ali, Goran Nenadic, Babis Theodoulidis

Improving Arabic Texts Morphological Disambiguation Using a Possibilistic Classifier

Morphological ambiguity is an important problem that has been studied through different approaches. We investigate, in this paper, some classification methods to disambiguate Arabic morphological features of non-vocalized texts. A possibilistic approach is improved and proposed to handle imperfect training and test datasets. We introduce a data transformation method to convert the imperfect dataset to a perfect one. We compare the disambiguation results of classification approaches to results given by the possibilistic classifier dealing with imperfection context.

Raja Ayed, Ibrahim Bounhas, Bilel Elayeb, Narjès Bellamine Ben Saoud, Fabrice Evrard

Forecasting Euro/Dollar Rate with Forex News

In the paper we build classifiers of texts reflecting opinions of currency market analysts about euro/dollar rate. The classifiers use various combinations of classes: growth, fall, constancy, not-growth, not-fall. The process includes term selection based on criterion of word specificity and model selection using technique of inductive modeling. We shortly describe our tools for these procedures. In the experiments we evaluate quality of classifiers and their sensibility to term list. The results proved to be positive and therefore the proposed approach can be a useful addition to the existing quantitative methods. The work has a practical orientation.

Olexiy Koshulko, Mikhail Alexandrov, Vera Danilova

Towards the Improvement of Topic Priority Assignment Using Various Topic Detection Methods for E-reputation Monitoring on Twitter

Topic priority assignment is defined in

RepLab-2013

as labelling a topic according to its level of priority (

alert

,

mildly important

or

unimportant

) in order to highlight topics requiring immediate attention for online reputation monitoring. Although they are strongly linked, topic detection and priority assignment have been previously treated as separate tasks. We study the impact of integrating topic detection outputs in the process of topic priority assignment.

Jean-Valère Cossu, Benjamin Bigot, Ludovic Bonnefoy, Grégory Senay

Complex Question Answering: Homogeneous or Heterogeneous, Which Ensemble Is Better?

This paper applies homogeneous and heterogeneous ensembles to perform the complex question answering task. For the homogeneous ensemble, we employ Support Vector Machines (SVM) as the learning algorithm and use a Cross-Validation Committees (CVC) approach to form several base models. We use SVM, Hidden Markov Models (HMM), Conditional Random Fields (CRF), and Maximum Entropy (MaxEnt) techniques to build different base models for the heterogeneous ensemble. Experimental analyses demonstrate that both ensemble methods outperform conventional systems and heterogeneous ensemble is better.

Yllias Chali, Sadid A. Hasan, Mustapha Mojahid

Towards the Design of User Friendly Search Engines for Software Projects

Current work proposes a linguistic approach for supporting the identification of User requirements and Software Specifications. We introduce an NLP-based tool,

PYTHIA

, that serves as a search engine capable of handling software engineering terminology, aiming to close the loop between the end-user and the software developer. It is an ontology-based question answering system that employs semantic analysis as well as external (both generic use and domain-specific) dictionaries in order to handle term disambiguation, as posed in user defined queries.

Rafaila Grigoriou, Andreas L. Symeonidis

Towards a New Standard Arabic Test Collection for Mono- and Cross-Language Information Retrieval

We propose in this paper a new standard Arabic test collection for mono- and cross-language Information Retrieval (CLIR). To do this, we exploit the “Hadith” texts and we provide a portal for sampling and evaluation of Hadiths’ results listed in both Arabic and English versions. The new called “Kunuz” standard Arabic test collection will promote and restart the development of Arabic mono retrieval and CLIR systems blocked since the earlier TREC-2001 and TREC-2002 editions.

Oussama Ben Khiroun, Raja Ayed, Bilel Elayeb, Ibrahim Bounhas, Narjès Bellamine Ben Saoud, Fabrice Evrard

Social Networks, Sentiment Analysis, and other Natural Language Analysis

Sentiment Analysis Techniques for Positive Language Development

With the growing availability and popularity of opinion-rich resources such as on-line review sites and personal blogs, the use of information technologies to seek out and understand the opinions of others has increased significantly. This paper presents Posimed, a sentiment assessment approach that focuses on verbal language using information technologies for Spanish. We describe how Posimed combines natural language technologies for Spanish and expert domain knowledge to extract relevant sentiment and attitude information units from conversations between people (from interviews, coaching sessions, etc.) and supports the programmes that positivity training experts provide in order to develop the

Positivity competence

. We have evaluated Posimed both in a quantitative and a qualitative way and these evaluations show that Posimed provides an accurate analysis (73%) and reduces significantly (80% reduction) the time for the same job when it is performed manually by the domain expert.

Izaskun Fernandez, Yolanda Lekuona, Ruben Ferreira, Santiago Fernández, Aitor Arnaiz

Exploiting Wikipedia for Entity Name Disambiguation in Tweets

Social media repositories serve as a significant source of evidence when extracting information related to the reputation of a particular entity (e.g., a particular politician, singer or company). Reputation management experts are in need of automated methods for mining the social media repositories (in particular Twitter) to monitor the reputation of a particular entity. A quite significant research challenge related to the above issue is to disambiguate tweets with respect to entity names. To address this issue in this paper we use “context phrases” in a tweet and Wikipedia disambiguated articles for a particular entity in a random forest classifier. Furthermore, we also utilize the concept of “relatedness” between tweet and entity using the Wikipedia category-article structure that captures the amount of discussion present inside a tweet related to an entity. The experimental evaluations show a significant improvement over the baseline and comparable performance with other systems representing strong performance given that we restrict ourselves to features extracted from Wikipedia.

Muhammad Atif Qureshi, Colm O’Riordan, Gabriella Pasi

From Treebank Conversion to Automatic Dependency Parsing for Vietnamese

This paper presents a new conversion method to automatically transform a constituent-based Vietnamese Treebank into dependency trees. On a dependency Treebank created according to our new approach, we examine two state-of-the-art dependency parsers: the MSTParser and the MaltParser. Experiments show that the MSTParser outperforms the MaltParser. To the best of our knowledge, we report the highest performances published to date in the task of dependency parsing for Vietnamese. Particularly, on gold standard POS tags, we get an unlabeled attachment score of 79.08% and a labeled attachment score of 71.66%.

Dat Quoc Nguyen, Dai Quoc Nguyen, Son Bao Pham, Phuong-Thai Nguyen, Minh Le Nguyen

Sentiment Analysis in Twitter for Spanish

This paper describes a SVM-approach for Sentiment Analysis (SA) in Twitter for Spanish. This task was part of the TASS2013 workshop, which is a framework for SA that is focused on the Spanish language. We describe the approach used, and we present an experimental comparison of the approaches presented by the different teams that took part in the competition. We also describe the improvements that were added to our system after our participation in the competition. With these improvements, we obtained an accuracy of 62.88% and 70.25% on the SA test set for

5-level

and

3-level

tasks respectively. To our knowledge, these results are the best results published until now for the SA tasks of the TASS2013 workshop.

Ferran Pla, Lluís-F. Hurtado

Cross-Domain Sentiment Analysis Using Spanish Opinionated Words

A common issue of most of NLP tasks is the lack of linguistic resources in languages different from English. In this paper is described a new corpus for Sentiment Analysis composed by hotel reviews written in Spanish. We use the corpus to carry out a set of experiments for unsupervised polarity detection using different lexicons. But, in addition, we want to check the adaptability to a domain for the lists of opinionated words. The obtained results are very promising and encourage us to continue investigating in this line.

M. Dolores Molina-González, Eugenio Martínez-Cámara, M. Teresa Martín-Valdivia, L. Alfonso Ureña-López

Real-Time Summarization of Scheduled Soccer Games from Twitter Stream

This paper presents the real-time summarization of scheduled soccer games from flows of Twitter stream. During events, many messages (tweets) are sent describing and expressing opinions about the game. The proposed approach shrinks the stream of tweets in real-time, and consists of two main steps: (i) the sub-event detection step, which determines if something new has occurred, and (ii) the tweet selection step, which picks a few representative tweets to describe each sub-event. We compare the automatic summaries generated in some of the soccer games of Brazilian, Spanish and England (2013-2014) Leagues with the live reports offered by ESPN! Sports, Globo Esporte, Yahoo! Sports and livematch.Com web site. The results show that the proposed approach is efficient and can produce real-time summarization with good quality.

Ahmed A. A. Esmin, Rômulo S. C. Júnior, Wagner S. Santos, Cássio O. Botaro, Thiago P. Nobre

Focus Definition and Extraction of Opinion Attitude Questions

In Question Answering Systems (QAS), Question Analysis is an important task that consists in general in identifying the semantic type of the question and extracting the question focus. In this context, and as part of a framework aiming to implement an Arabic opinion QAS for political debates, this paper addresses the problem of defining the focus of opinion attitude questions and proposes an approach for extracting it. The proposed approach is based on semi-automatically constructed lexico-syntactic patterns. Evaluation results are considered very encouraging with an average precision of around 87.37%.

Amine Bayoudhi, Hatem Ghorbel, Lamia Hadrich Belguith

Implicit Feature Extraction for Sentiment Analysis in Consumer Reviews

With the increasing popularity of aspect-level sentiment analysis, where sentiment is attributed to the actual aspects, or features, on which it is uttered, much attention is given to the problem of detecting these features. While most aspects appear as literal words, some are instead implied by the choice of words. With research in aspect detection advancing, we shift our focus to the less researched group of implicit features. By leveraging the co-occurrence between a set of known implicit features and notional words, we are able to predict the implicit feature based on the choice of words in a sentence. Using two different types of consumer reviews (product reviews and restaurant reviews), an F

1

-measure of 38% and 64% is obtained on these data sets, respectively.

Kim Schouten, Flavius Frasincar

Towards Creation of Linguistic Resources for Bilingual Sentiment Analysis of Twitter Data

This paper presents an approach towards bi-lingual sentiment analysis of tweets. Social networks being most advanced and popular communication medium can help in designing better government and business strategies. There are a number of studies reported that use data from social networks; however, most of them are based on English language. In this research, we have focused on sentiment analysis of bilingual dataset (English and Roman-Urdu) on topic of national interest (General Elections). Our experiments produced encouraging results with 76% of tweet’s sentiment strength classified correctly. We have also created a bi-lingual lexicon that stores the sentiment strength of English and Roman Urdu terms. Our lexicon is available at: https://sites.google. com/a/mcs.edu.pk/codteem/biling_senti

Iqra Javed, Hammad Afzal, Awais Majeed, Behram Khan

Demonstration Papers

Integrating Linguistic and World Knowledge for Domain-Adaptable Natural Language Interfaces

Nowadays, natural language interfaces (NLIs) show strong demands on various smart devices from wearable devices, cell phones, televisions, to vehicles. Domain adaptation becomes one of the major challenging issues to support the applications on different domains. In this paper, we propose a framework of domain-adaptable NLIs to integrate linguistic knowledge and world knowledge. Given a knowledge base of a target domain and the function definition of a target smart device, the corresponding NLI system is developed under the framework. In the experiments, we demonstrate a Chinese NLI system for a video on demand (VOD) service.

Hen-Hsen Huang, Chang-Sheng Yu, Huan-Yuan Chen, Hsin-Hsi Chen, Po-Ching Lee, Chun-Hsun Chen

Speeding Up Multilingual Grammar Development by Exploiting Linked Data to Generate Pre-terminal Rules

The development of grammars, e.g. for spoken dialog systems, is a time- and effort-intensive process. Especially the crafting of rules that list all relevant instances of a non-terminal, e.g. Greek cities or Automobile companies, possibly in multiple languages, is costly. In order to automatize and speed up the generation of multilingual terminal lists, we present a tool that uses linked data sources such as DBpedia in order to retrieve all entities that satisfy a relevant semantic restriction. We briefly describe the architecture of the system and explain how it can be used by means of an online web service.

Sebastian Walter, Christina Unger, Philipp Cimiano

Senterritoire for Spatio-Temporal Representation of Opinions

In previous work, he method called

Opiland

(OPinion mIning from LAND-use planning documents) has been proposed in order to semi-automatically mine opinions in specialized contexts. In this article, we present the associated

Senterritoire

viewer developed to dynamically represent, in time and space, opinions extracted by

Opiland

.

Mohammad Amin Farvardin, Eric Kergosien

Mining Twitter for Suicide Prevention

Automatically detect suicidal people in social networks is a real social issue. In France, suicide attempt is an economic burden with strong socio-economic consequences. In this paper, we describe a complete process to automatically collect suspect tweets according to a vocabulary of topics suicidal persons are used to talk. We automatically capture tweets indicating suicidal risky behaviour based on simple classification methods. An interface for psychiatrists has been implemented to enable them to consult suspect tweets and profiles associated with these tweets. The method has been validated on real datasets. The early feedback of psychiatrists is encouraging and allow to consider a personalised response according to the estimated level of risk.

Amayas Abboute, Yasser Boudjeriou, Gilles Entringer, Jérôme Azé, Sandra Bringay, Pascal Poncelet

The Semantic Measures Library: Assessing Semantic Similarity from Knowledge Representation Analysis

Semantic similarity and relatedness are cornerstones of numerous treatments in which lexical units (e.g., terms, documents), concepts or instances have to be compared from texts or knowledge representation analysis. These semantic measures are central for NLP, information retrieval, sentiment analysis and approximate reasoning, to mention a few. In response to the lack of efficient and generic software solutions dedicated to knowledge-based semantic measures, i.e. those which rely on the analysis of semantic graphs and ontologies, this paper presents the Semantic Measures Library (SML), an extensive and efficient Java library dedicated to the computation and analysis of these measures. The SML can be used with a large diversity of knowledge representations, e.g., WordNet, SKOS thesaurus, RDF(S) and OWL ontologies. We also present the SML-Toolkit, a command-line program which gives (non-programmers) access to several functionalities of the SML, e.g. to compute semantic similarities. Website: http://www.semantic-measures-library.org

Sébastien Harispe, Sylvie Ranwez, Stefan Janaqi, Jacky Montmain

Mobile Intelligent Virtual Agent with Translation Functionality

Virtual agent is a powerful means for human-computer interaction. In this demo paper, we describe a new scenario for mobile virtual agent that, in addition to general social intelligence, can perform translation tasks. We present the design and development of the intelligent virtual agent that translates phrases and sentences from English into French, Russian, and Spanish. Initial evaluation results show that the possibility to translate phrases and short utterances is useful and interesting for the user.

Inguna Skadiņa, Inese Vīra, Jānis Teseļskis, Raivis Skadiņš

A Tool for Theme Identification in RDF Graphs

An increasing number of RDF datasets is published on the Web. A user willing to use these datasets will first have to explore them in order to determine which information is relevant for his own needs. To facilitate this exploration, we present a system which provides a thematic view of a given RDF dataset, making it easier to target the relevant resources and properties. Our system combines a density-based graph clustering algorithm with semantic clustering criteria in order to identify clusters, each one corresponding to a theme. In this paper, we will give an overview of our approach for theme identification and we will present our system along with a scenario illustrating its main features.

Hanane Ouksili, Zoubida Kedad, Stéphane Lopes

Backmatter

Weitere Informationen

Premium Partner

    Bildnachweise