Query-Answering Systems

Conceptual Pathway Querying of Natural Logic Knowledge Bases from Text Bases

We describe a framework affording computation of conceptual pathways between a pair of terms presented as a query to a text database. In this framework, information is extracted from text sentences and becomes represented in natural logic, which is a form of logic coming much closer to natural language than predicate logic. Natural logic accommodates a variety of scientific parlance, ontologies and domain models. It also supports a semantic net or graph view of the knowledge base. This admits computation of relationships between concepts simultaneously through pathfinding in the knowledge base graph and deductive inference with the stored assertions. We envisage use of the developed pathway functionality, e.g., within bio-, pharma-, and medical sciences for calculating bio-pathways and causal chains.

Troels Andreasen, Henrik Bulskov, Jørgen Fischer Nilsson, Per Anker Jensen, Tine Lassen

Query Rewriting for an Incremental Search in Heterogeneous Linked Data Sources

Nowadays, the number of linked data sources available on the Web is considerable. In this scenario, users are interested in frameworks that help them to query those heterogeneous data sources in a friendly way, so avoiding awareness of the technical details related to the heterogeneity and variety of data sources. With this aim, we present a system that implements an innovative query approach that obtains results to user queries in an incremental way. It sequentially accesses different datasets, expressed with possibly different vocabularies. Our approach enriches previous answers each time a different dataset is accessed. Mapping axioms between datasets are used for rewriting the original query and so obtaining new queries expressed with terms in the vocabularies of the target dataset. These rewritten queries may be semantically equivalent or they could result in a certain semantic loss; in this case, an estimation of the loss of information incurred is presented.

Ana I. Torre-Bastida, Jesús Bermúdez, Arantza Illarramendi, Eduardo Mena, Marta González

FILT – Filtering Indexed Lucene Triples – A SPARQL Filter Query Processing Engine–

The Resource Description Framework (RDF) is the W3C recommended standard for data on the semantic web, while the SPARQL Protocol and RDF Query Language (SPARQL) is the query language that retrieves RDF triples. RDF data often contain valuable information that can only be queried through filter functions. The SPARQL query language for RDF can include filter clauses in order to define specific data criteria, such as full-text searches, numerical filtering, and constraints and relationships between data resources. However, the downside of executing SPARQL filter queries is the frequently slow query execution times. This paper presents a SPARQL filter query-processing engine for conventional triplestores called FILT (Filtering Indexed Lucene Triples), built on top of the Apache Lucene framework for storing and retrieving indexed documents, compatible with unmodified SPARQL queries. The objective of FILT was to decrease the query execution time of SPARQL filter queries. This aspect was evaluated by performing a benchmark test of FILT compared to the Joseki triplestore, focusing on two different use-cases; SPARQL regular expression filtering in medical data, and SPARQL numerical/logical filtering of geo-coordinates in geographical locations.

Magnus Stuhr, Csaba Veres

Improving Range Query Result Size Estimation Based on a New Optimal Histogram

Many commercial relational Data Base Management Systems (DBMSs) maintain histograms to approximate the distribution of values in the relation attributes and based on them estimate query result sizes. A histogram approximates the distribution by grouping data into buckets. The estimation-errors resulting from the loss of information during the grouping process affect the accuracy of the decision, made by query optimizers, about choosing the most economical evaluation plan for a query. In front of this challenging problem, many histogram-based estimation techniques including the equi-depth, the v-optimal, the max-diff and the compressed histograms have well contributed to approximate the cost of a query evaluation plan. But, most of the times the obtained estimates have much error. Motivated by the fact that inaccurate estimations can lead to wrong decisions, we propose in this paper an efficient algorithm, called Compressed-V2, for accurate histogram constructions. Both theoretical and effective experiments are done using benchmark data set showing the promising results obtained using the proposed algorithm. We think that this algorithm will significantly contribute for helping to solve the problem of Multi-Query Optimization (MQO) resulting from queries interactions especially in Relational Data Warehouses (RDW) which represent the ideal environment in which complex OLAP queries interact with each other.

Wissem Labbadi, Jalel Akaichi

Has FQAS Something to Say on Taking Good Care of Our Elders?

The increasing population of elders in the near future, and their expectations for a independent, safe and in-place living require new practical systems and technologies to fulfil their demands in sustainable ways. This paper presents our own reflection on the great relevance of FQAS’ main topics for recent developments on the context of Home Assistance. We show how those developments employ several techniques from the FQAS conference scope with the aim of encouraging researchers to test their systems in this field.

María Ros, Miguel Molina-Solana, Miguel Delgado

Question Answering System for Dialogues: A New Taxonomy of Opinion Questions

Question analysis is an important task in Question Answering Systems (QAS). To perform this task, the system must procure fine-grained information about the question types. This information is defined by the question taxonomy. In the literature, factual question taxonomies were the object of many research works. However, opinion question taxonomies did not get the same attention because they are more complicated. Besides, most QAS were focusing on monologal texts, while dialogues have rarely been explored by information retrieval tools. In this paper, we investigate the use of dialogue data as an information source for opinion QAS. Hence, we propose a new opinion question taxonomy in the context of an Arabic QAS for political debates and we propose then an approach to classify these questions. Obtained results were relevant with a precision of around 91.13% for the opinion classes’ classification.

Amine Bayoudhi, Hatem Ghorbel, Lamia Hadrich Belguith

R/quest: A Question Answering System

In this paper, we discuss our novel, open-domain question answering (Q/A) system, R/quest. We use web page snippets from Google

TM

to extract short paragraphs that become candidate answers. We performed an evaluation that showed, on average, 1.4 times higher recall and a slightly higher precision by using a question expansion method. We have modified the Cosine Coefficient Similarity Measure to take into account the rank position of a candidate answer and its length. This produces an effective ranking scheme. We have a new question refinement method that improves recall. We further enhanced performance by adding a Boolean NOT operator. R/quest on average provides an answer within the top 2 to 3 paragraphs shown to the user. We consider this to be a considerable advance over search engines that provide millions of ranked web pages which must be searched manually to find the information needed.

Joan Morrissey, Ruoxuan Zhao

Answering Questions by Means of Causal Sentences

The aim of this paper is to introduce a set of algorithms able to configure an automatic answer from a proposed question. This procedure has two main steps. The first one is focused in the extraction, filtering and selection of those causal sentences that could have relevant information for the answer. The second one is focused in the composition of a suitable answer with the obtained information in the previous step.

C. Puente, E. Garrido, J. A. Olivas

Ontology-Based Question Analysis Method

Question analysis is a central component of Question Answering systems. In this paper we propose a new method for question analysis based on ontologies (

QAnalOnto

).

QAnalOnto

relies on four main components: (1) Lexical and syntactic analysis, (2) Question graph construction, (3) Query reformulation and (4) Search for similar questions. Our contribution consists on the representation of generic structures of questions and results by using typed attributed graphs and on the integration of domain ontologies and lexico-syntactic patterns for query reformulation. Some preliminary tests have shown that the proposed method improves the quality of the retrieved documents and the search of previous similar questions.

Ghada Besbes, Hajer Baazaoui-Zghal, Antonio Moreno

Fuzzy Multidimensional Modelling for Flexible Querying of Learning Object Repositories

The goal of this research is to design a fuzzy multidimensional model to manage learning object repositories. This model will provide the required elements to develop an intelligent system for information retrieval on learning object repositories based on OLAP multidimensional modeling and soft computing tools. It will handle the uncertainty of this data through a flexible approach.

Gloria Appelgren Lara, Miguel Delgado, Nicolás Marín

Environmental Scanning for Strategic Early Warning

Using Formal Concept Analysis to Detect and Monitor Organised Crime

This paper describes some possible uses of Formal Concept Analysis in the detection and monitoring of Organised Crime. After describing FCA and its mathematical basis, the paper suggests, with some simple examples, ways in which FCA and some of its related disciplines can be applied to this problem domain. In particular, the paper proposes FCA-based approaches for finding multiple instances of an activity associated with Organised Crime, finding dependencies between Organised Crime attributes, and finding new indicators of Organised Crime from the analysis of existing data. The paper concludes by suggesting that these approaches will culminate in the creation and implementation of an Organised Crime ‘threat score card’, as part of an overall environmental scanning system that is being developed by the new European ePOOLICE project.

Simon Andrews, Babak Akhgar, Simeon Yates, Alex Stedmon, Laurence Hirsch

Analysis of Semantic Networks Using Complex Networks Concepts

In this paper we perform a preliminary analysis of semantic networks to determine the most important terms that could be used to optimize a summarization task. In our experiments, we measure how the properties of a semantic network change, when the terms in the network are removed. Our preliminary results indicate that this approach provides good results on the semantic network analyzed in this paper.

Daniel Ortiz-Arroyo

Detecting Anomalous and Exceptional Behaviour on Credit Data by Means of Association Rules

Association rules is a data mining technique for extracting useful knowledge from databases. Recently some approaches has been developed for mining novel kinds of useful information, such us peculiarities, infrequent rules, exception or anomalous rules. The common feature of these proposals is the low support of such type of rules. Therefore, finding efficient algorithms for extracting them are needed.

The aim of this paper is three fold. First, it reviews a previous formulation for exception and anomalous rules, focusing on its semantics and definition. Second, we propose efficient algorithms for mining such type of rules. Third, we apply them to the case of detecting anomalous and exceptional behaviours on credit data.

Miguel Delgado, Maria J. Martin-Bautista, M. Dolores Ruiz, Daniel Sánchez

Issues of Security and Informational Privacy in Relation to an Environmental Scanning System for Fighting Organized Crime

This paper clarifies privacy challenges related to the EU project, ePOOLICE, which aims at developing a particular kind of open source information filtering system, namely a so-called environmental scanning system, for fighting organized crime by improving law enforcement agencies opportunities for strategic proactive planning in response to emerging organized crime threats. The environmental scanning is carried out on public online data streams, focusing on modus operandi and crime trends, not on individuals. Hence, ethical and technical issues – related to societal security and potential privacy infringements in public online contexts – are being discussed in order to safeguard privacy all through the system design process.

Anne Gerdes, Henrik Legind Larsen, Jacobo Rouces

Semantic Technology

Algorithmic Semantics for Processing Pronominal Verbal Phrases

The formal language of acyclic recursion

${L^{\lambda}_{ar}}$

(FLAR) has a distinctive algorithmic expressiveness, which, in addition to computational fundamentals, provides representation of underspecified semantic information. Semantic ambiguities and underspecification of information expressed by human language are problematic for computational semantics, and for natural language processing in general. Pronominal and elliptical expressions in human languages are ubiquitous and major contributors to underspecification in language and other information processing. We demonstrate the capacity of the type theory of

${L^{\lambda}_{ar}}$

for computational semantic underspecification by representing interactions between reflexives, non-reflexive pronominals, and VP ellipses with type theoretic, recursion therms. We present a class of semantic underspecification that propagates and presents in question-answering interactions. The paper introduces a technique for incremental presentation of question-answer interaction.

Roussanka Loukanova

Improving the Understandability of OLAP Queries by Semantic Interpretations

Everyday methods providing managers with elaborated information making more comprehensible the results obtained of queries over OLAP systems are required. This problem is relatively recent due to the huge amount of information they store, but so far there are few proposals facing this issue, and they are mainly focused on presenting the information to the user in a comprehensible language (natural language). Here we go further and introduce a new mathematical formalism, the

Semantic Interpretations

, to supply the user not only understandable responses, but also semantically meaningful results.

Carlos Molina, Belen Prados-Suárez, Miguel Prados de Reyes, Carmen Peña Yañez

Semantic Interpretation of Intermediate Quantifiers and Their Syllogisms

This paper is a contribution to the formal theory of

intermediate quantifiers

(linguistic expressions such as

most, few, almost all, a lot of, many, a great deal of, a large part of, a small part of

). The latter concept was informally introduced by P. L. Peterson in his book and formalized in the frame of higher-order fuzzy logic by V. Novák. The main goal of this paper is to demonstrate how our theory works in an intended model. We will also show, how validity of generalized intermediate syllogisms can be semantically verified.

Petra Murinová, Vilém Novák

Ranking Images Using Customized Fuzzy Dominant Color Descriptors

In this paper we describe an approach for defining customized color descriptors for image retrieval. In particular, a customized fuzzy dominant color descriptor is proposed on the basis of a finite collection of fuzzy colors designed specifically for a certain user. Fuzzy colors modeling the semantics of a color name are defined as fuzzy subsets of colors on an ordinary color space, filling the semantic gap between the color representation in computers and the subjective human perception. The design of fuzzy colors is based on a collection of color names and corresponding crisp representatives provided by the user. The descriptor is defined as a fuzzy set over the customized fuzzy colors (i.e. a level-2 fuzzy set), taking into account the imprecise concept that is modelled, in which membership degrees represent the dominance of each color. The dominance of each fuzzy color is calculated on the basis of a fuzzy quantifier representing the notion of dominance, and a fuzzy histogram representing as a fuzzy quantity the percentage of pixels that match each fuzzy color. The obtained descriptor can be employed in a large amount of applications. We illustrate the usefulness of the descriptor by a particular application in image retrieval.

J. M. Soto-Hidalgo, J. Chamorro-Martínez, P. Martínez-Jiménez, Daniel Sánchez

Generating Linguistic Descriptions of Data

Linguistic Descriptions: Their Structure and Applications

In this paper, we provide a brief survey of the main theoretical and conceptual principles of methods that use the, so called, linguistic descriptions and thus, belong to the broad area of methods encapsulated under the term

modeling with words

. The theoretical frame is fuzzy natural logic — an extension of mathematical fuzzy logic consisting of several constituents. In this paper, we will deal with formal logical theory of evaluative linguistic expressions and the related concepts of linguistic description and perception-based logical deduction. Furthermore, we mention some applications and highlight two of them: forecasting and linguistic analysis of time series and linguistic associations mining.

Vilém Novák, Martin Štěpnička, Jiří Kupka

Landscapes Description Using Linguistic Summaries and a Two-Dimensional Cellular Automaton

Cellular automata models are used in ecology since they permit integrate space, ecological process and stochasticity in a single predictive framework. The complex nature of modeling (spatial) ecological processes has made linguistic summaries difficult to use within the traditional cellular automata models. This paper deals with the development of a computational system capable to generate linguistic summaries from the data provided by a cellular automaton. This paper shows two proposals that can be used for this purpose. We build our system by combining techniques from Zadeh’s Computational Theory of Perceptions with ideas from the State Machine Theory. This paper discusses how linguistic descriptions may be integrated into cellular automata models and then demonstrates the use of our approach in the development of a prototype capable to provide a linguistic description of ecological phenomena.

Francisco P. Romero, Juan Moreno-García

Comparing f β -Optimal with Distance Based Merge Functions

Merge functions informally combine information from a certain universe into a solution over that same universe. This typically results in a, preferably optimal, summarization. In previous research, merge functions over sets have been looked into extensively. A specific case concerns sets that allow elements to appear more than once, multisets. In this paper we compare two types of merge functions over multisets against each other. We examine both general properties as practical usability in a real world application.

Daan Van Britsom, Antoon Bronselaer, Guy De Tré

Flexible Querying with Linguistic F-Cube Factory

In this paper a new tool which allows flexible querying on multidimensional data bases is presented.

Linguistic F-Cube Factory

is based on the use of natural language when querying multidimensional data cubes to obtain linguistic results. Natural language is one of the best ways of presenting results to human users as it is their inherent way of communication. Data warehouses take advantage of the multidimensional data model in order to store big amounts of data that users can manage and query by means of OLAP operations. They are a context where the development of a linguistic querying tool is of special interest.

R. Castillo-Ortega, Nicolás Marín, Daniel Sánchez, Carlos Molina

Mathematical Morphology Tools to Evaluate Periodic Linguistic Summaries

This paper considers the task of establishing periodic linguistic summaries of the form “Regularly, the data take high values”, enriched with an estimation of the period and a linguistic formulation. Within the framework of methods that address this task testing whether the dataset contains regularly spaced groups of high and low values with approximately constant size, it proposes a mathematical morphology (MM) approach based on watershed. It compares the proposed approach to other MM methods in an experimental study based on artificial data with different forms and noise types.

Gilles Moyse, Marie-Jeanne Lesot, Bernadette Bouchon-Meunier

Automatic Generation of Textual Short-Term Weather Forecasts on Real Prediction Data

In this paper we present a computational method which obtains textual short-term weather forecasts for every municipality in Galicia (NW Spain), using the real data provided by the Galician Meteorology Agency (MeteoGalicia). This approach is based on Soft-Computing based methods and strategies for linguistic description of data and for Natural Language Generation. The obtained results have been thoroughly validated by expert meteorologists, which ensures that in the near future it can be improved and released as a real service offering custom forecasts for a wide public.

A. Ramos-Soto, A. Bugarin, S. Barro, J. Taboada

Increasing the Granularity Degree in Linguistic Descriptions of Quasi-periodic Phenomena

In previous works, we have developed some computational models of quasi-periodic phenomena based on Fuzzy Finite State Machines. Here, we extend this work to allow designers to obtain detailed linguistic descriptions of relevant amplitude and temporal changes. We include several examples that will help to understand and use this new resource for linguistic description of complex phenomena.

Daniel Sanchez-Valdes, Gracian Trivino

A Model-Based Multilingual Natural Language Parser — Implementing Chomsky’s X-bar Theory in ModelCC

Natural language support is a powerful feature that enhances user interaction with query systems. NLP requires dealing with ambiguities. Traditional probabilistic parsers provide a convenient means for disambiguation. However, they incorrigibly return wrong sequences of tokens, they impose hard constraints on the way lexical and syntactic ambiguities can be resolved, and they are limited in the mechanisms they allow for taking context into account. In comparison, model-based parser generators allow for flexible constraint specification and reference resolution, which facilitates the context consideration. In this paper, we explain how the ModelCC model-based parser generator supports statistical language models and arbitrary probability estimators. Then, we present the ModelCC implementation of a natural language parser based on the syntax of most Romance and Germanic languages. This natural language parser can be instantiated for a specific language by connecting it with a thesaurus (for lexical analysis), a linguistic corpus (for syntax-driven disambiguation), and an ontology or semantic database (for semantics-driven disambiguation).

Luis Quesada, Fernando Berzal, Juan-Carlos Cubero

Patterns and Classification

Correlated Trends: A New Representation for Imperfect and Large Dataseries

The computational representation of dataseries is a task of growing interest in our days. However, as these data are often imperfect, new representation models are required to effectively handle them. This work presents

Frequent Correlated Trends

, our proposal for representing uncertain and imprecise multivariate dataseries. Such a model can be applied to any domain where dataseries contain patterns that recur in similar —but not identical— shape. We describe here the model representation and an associated learning algorithm.

Miguel Delgado, Waldo Fajardo, Miguel Molina-Solana

Arc-Based Soft XML Pattern Matching

The internet is undoubtedly the biggest data source ever with tons of data from different sources following different formats. One of the main challenges in computer science is how to make data sharing and exchange between these sources possible; or in other words, how to develop a system that can deal with all these differences in data representation and extract useful knowledge from there. And since XML is the de facto standard for representing data on the internet, XML query matching has gained so much popularity recently. In this paper we present new types of fuzzy arc matching that can match a pattern arc to a schema arc as long as the correspondent parent and child nodes are there and have reachability between them. Experimental results shown that the proposed approach provided better results than previous works.

Mohammedsharaf Alzebdi, Panagiotis Chountas, Krassimir Atanassov

Discrimination of the Micro Electrode Recordings for STN Localization during DBS Surgery in Parkinson’s Patients

During deep brain stimulation (DBS) treatment of Parkinson disease, the target of the surgery is a small (9 x 7 x 4 mm) deep within brain placed structure called

Subthalamic

Nucleus

(

STN

). It is similar morphologically to the surrounding tissue and as such poorly visible in CT or MRI. The goal of the surgery is the permanent precise placement of the stimulating electrode within target nucleus. Precision is extremely important as wrong placement of the stimulating electrode may lead to serious mood disturbances. To obtain exact location of the

STN

nucleus an intraoperative stereotactic supportive navigation is being used. A set of 3 to 5 parallel micro electrodes is inserted into brain and in measured steps advanced towards expected location of the nucleus. At each step electrodes record activity of the surrounding neural tissue. Because

STN

has a distinct physiology, the signals recorded within it also display specific features. It is therefore possible to provide analytical methods targeted for detection of those

STN

specific characteristics. Basing on such methods this paper presents clustering and classification approaches for discrimination of the micro electrode recordings coming from the

STN

nucleus. Application of those methods during the neurosurgical procedure might lessen the risks of medical complications and might also shorten the – out of necessity awake – part of the surgery.

Konrad Ciecierski, Zbigniew W. Raś, Andrzej W. Przybyszewski

Image Classification Based on 2D Feature Motifs

The classification of raw data often involves the problem of selecting the appropriate set of features to represent the input data. In general, various features can be extracted from the input dataset, but only some of them are actually relevant for the classification process. Since relevant features are often unknown in real-world problems, many candidate features are usually introduced. This degrades both the speed and the predictive accuracy of the classifier due to the presence of redundancy in the candidate feature set.

In this paper, we study the capability of a special class of motifs previously introduced in the literature, i.e. 2D irredundant motifs, when they are exploited as features for image classification. In particular, such a class of motifs showed to be powerful in capturing the relevant information of digital images, also achieving good performances for image compression. We embed such 2D feature motifs in a bag-of-words model, and then exploit K-nearest neighbour for the classification step. Preliminary results obtained on both a benchmark image dataset and a video frames dataset are promising.

Angelo Furfaro, Maria Carmela Groccia, Simona E. Rombo

Advances in Fuzzy Querying and Fuzzy Databases: Theory and Applications

Wildfire Susceptibility Maps Flexible Querying and Answering

Forecasting natural disasters, as wildfires or floods, is a mandatory activity to reduce the level of risk and damage to people, properties and infrastructures. Since estimating real-time the susceptibility to a given phenomenon is computationally onerous, susceptibility maps are usually pre-computed. So, techniques are needed to efficiently query such maps, in order to retrieve the most plausible scenario for the current situation. We propose a flexible querying and answering framework by which the operator, in charge of managing an ongoing disaster, can retrieve the list of susceptibility maps in decreasing order of satisfaction with respect to the query conditions. The operator can also describe trends of the conditions that are related with environmental parameters, assessing what happens if a dynamic parameter is increasing or decreasing in value.

Paolo Arcaini, Gloria Bordogna, Simone Sterlacchini

Enhancing Flexible Querying Using Criterion Trees

Traditional query languages like SQL and OQL use a so-called WHERE clause to extract only those database records that fulfil a specified condition. Conditions can be simple or be composed of conditions that are connected through logical operators. Flexible querying approaches, among others, generalized this concept by allowing more flexible user preferences as well in the specification of the simple conditions (through the use of fuzzy sets), as in the specification of the logical aggregation (through the use of weights). In this paper, we study and propose a new technique to further enhance the use of weights by working with so-called criterion trees. Next to better facilities for specifying flexible queries, criterion trees also allow for a more general aggregation approach. In the paper we illustrate and discuss how LSP basic aggregation operators can be used in criterion trees.

Guy De Tré, Jozo Dujmović, Joachim Nielandt, Antoon Bronselaer

A Possibilistic Logic Approach to Conditional Preference Queries

The paper presents a new approach to deal with database preference queries, where preferences are represented in the style of possibilistic logic, using symbolic weights. The symbolic weights may be processed without the need of a numerical assignment of priority. Still, it is possible to introduce a partial ordering among the symbolic weights if necessary. On this basis, four methods that have an increasing discriminating power for ranking the answers to conjunctive queries, are proposed. The approach is compared to different lines of research in preference queries including skyline-based methods and fuzzy set-based queries. With the four proposed ranking methods the first group of best answers is made of non dominated items. The purely qualitative nature of the approach avoids the commensurability requirement of elementary evaluations underlying the fuzzy logic methods.

Didier Dubois, Henri Prade, Fayçal Touazi

Bipolar Conjunctive Query Evaluation for Ontology Based Database Querying

In the wake of the flexible querying system, designed in [21], allowing the expression of user preferences as bipolar conditions of type “and if possible” over relational databases and ontologies, we detail in this paper the user query evaluation process, under the extension of the logical framework to bipolarity of type “or else” [15,14]. Queries addressed to our system are bipolar conjunctive queries made of bipolar atoms, and their evaluation relies on three-step algorithm: (i)

atom substitution process

that details how bipolar subsumption axioms defined in the bipolar ontology are used, (ii)

query derivation process

which delivers from each atom substitution a complementary query, and (iii)

translation process

that translates the obtained set of queries into bipolar SQLf statements, subsequently evaluated over a bipolar relational database.

Nouredine Tamani, Ludovic Liétard, Daniel Rocacher

Bipolar Querying of Valid-Time Intervals Subject to Uncertainty

Databases model parts of reality by containing data representing properties of real-world objects or concepts. Often, some of these properties are time-related. Thus, databases often contain data representing time-related information. However, as they may be produced by humans, such data or information may contain imperfections like uncertainties. An important purpose of databases is to allow their data to be queried, to allow access to the information these data represent. Users may do this using queries, in which they describe their preferences concerning the data they are (not) interested in. Because users may have both positive and negative such preferences, they may want to query databases in a bipolar way. Such preferences may also have a temporal nature, but, traditionally, temporal query conditions are handled specifically. In this paper, a novel technique is presented to query a valid-time relation containing uncertain valid-time data in a bipolar way, which allows the query to have a single bipolar temporal query condition.

Christophe Billiet, José Enrique Pons, Olga Pons, Guy De Tré

Declarative Fuzzy Linguistic Queries on Relational Databases

In this paper we propose a declarative method to formulate fuzzy linguistic queries on Relational Database Management Systems. That is, flexible queries containing linguistic terms associate to the attributes of a table of a relational database. To this end, we adapt techniques originate from a proximity-based Logic Programming Language called Bousi~Prolog.

Clemente Rubio-Manzano, Pascual Julián-Iranzo, Esteban Salazar-Santis, Eduardo San Martín-Villarroel

Finding Similar Objects in Relational Databases — An Association-Based Fuzzy Approach

This paper deals with the issue of extending the scope of a user query in order to retrieve objects which are similar to its “strict answers”. The approach proposed exploits associations between database items, corresponding, e.g., to the presence of foreign keys in the database schema. Fuzzy concepts such as typicality, similarity and linguistic quantifiers are at the heart of the approach and make it possible to obtain a ranked list of similar answers.

Olivier Pivert, Grégory Smits, Hélène Jaudoin

M2LFGP: Mining Gradual Patterns over Fuzzy Multiple Levels

Data are often described at several levels of granularity. For instance, data concerning fruits that are purchased can be categorized regarding some criteria (such as size, weight, color, etc.). When dealing with data from the real world, such categories can hardly be defined in a crisp manner. For instance, some fruits may belong both to the

small

and

medium

-sized fruits. Data mining methods have been proposed to deal with such data, in order to take benefit from the several levels when extracting relevant patterns. The challenge is to discover patterns that are not too general (as they would not contain relevant novel information) while remaining typical (as detailed data do not embed general and representative information). In this paper, we focus on the extraction of gradual patterns in the context of hierarchical data. Gradual patterns describe covariation of attributes such as

the bigger, the more expensive

. As our proposal increases the number of combinations to be considered since all levels must be explored, we propose to implement the parallel computation in order to decrease the execution time.

Yogi S. Aryadinata, Arnaud Castelltort, Anne Laurent, Michel Sala

Building a Fuzzy Valid Time Support Module on a Fuzzy Object-Relational Database

In this work we present the implementation of a Fuzzy Valid Time Support Module on top of a Fuzzy Object-Relational Database System, based on a model to deal with imprecision in valid-time databases. The integration of these modules allows to perform queries that combines fuzzy valid time constraints with fuzzy predicates. Both modules can be deployed in Oracle Relational Database Management System 10.2 and higher. The module implements the mechanisms that overload the SQL sentences: Insert, Update, Delete and Select to allow fuzzy temporal handling. The implementation described supports the crisp valid time model as a particular case of its fuzzy valid time support provided.

Carlos D. Barranco, Juan Miguel Medina, José Enrique Pons, Olga Pons

Personalization and Recommender Systems

Predictors of Users’ Willingness to Personalize Web Search

Personalized Web search offers a promising solution to the task of user-tailored information-seeking; however, one of the reasons why it is not widely adopted by users is due to privacy concerns. Over the past few years social networking services (SNS) have re-shaped the traditional paradigm of information-seeking. People now tend to simultaneously make use of both Web search engines and social networking services when faced with an information need. In this paper, using data gathered in a user survey, we present an analysis of the correlation between the users’ willingness to personalize Web search and their social network usage patterns. The participants’ responses to the survey questions enabled us to use a regression model for identifying the relationship between SNS variables and willingness to personalize Web search. We also performed a follow-up user survey for use in a support vector machine (SVM) based prediction framework. The prediction results lead to the observation that SNS features such as a user’s demographic factors (such as age, gender, location), a user’s presence or absence on Twitter and Google+, amount of activity on Twitter and Google+ along with the user’s tendency to ask questions on social networks are significant predictors in characterising users who would be willing to opt for personalized Web search results.

Arjumand Younus, Colm O’Riordan, Gabriella Pasi

Semantic-Based Recommendation of Nutrition Diets for the Elderly from Agroalimentary Thesauri

The wealth of information on nutrition and healthy diets along the web, as health web magazines or forums, often leads to confuse users in several ways. Reliability and completeness of information, as well as extracting only the relevant one becomes a critical issue, especially for certain groups of people such as the elderly. Likewise, heterogeneity of information representation and without a clear semantics hinders knowledge sharing and enrichment. In this paper, it is introduced a method to compute the semantic similarity between foods used in NutElCare, an ontology-based recommender system capable of collecting and representing relevant nutritional information from expert sources in order to providing adequate nutrition tips for the elderly. The knowledge base of NutElCare is an OWL ontology built from AGROVOC FAO thesaurus.

Vanesa Espín, María V. Hurtado, Manuel Noguera, Kawtar Benghazi

Enhancing Recommender System with Linked Open Data

In this paper, we present an innovative method to use Linked Open Data (LOD) to improve content based recommender systems. We have selected the domain of secondhand bookshops, where recommending is extraordinary difficult because of high ratio of objects/users, lack of significant attributes and small number of the same items in stock. Those difficulties prevents us from successfully apply both collaborative and common content based recommenders. We have queried Czech language mutation of DBPedia in order to receive additional attributes of objects (books) to reveal nontrivial connections between them. Our approach is general and can be applied on other domains as well. Experiments show that enhancing recommender system with LOD can significantly improve its results in terms of object similarity computation and top-k objects recommendation. The main drawback hindering widespread of such systems is probably missing data about considerable portion of objects, which can however vary across domains and improve over time.

Ladislav Peska, Peter Vojtas

Making Structured Data Searchable via Natural Language Generation

with an Application to ESG Data

Relational Databases are used to store structured data, which is typically accessed using report builders based on SQL queries. To search, forms need to be understood and filled out, which demands a high cognitive load. Due to the success of Web search engines, users have become acquainted with the easier mechanism of natural language search for accessing

un

structured data. However, such keyword-based search methods are not easily applicable to

structured

data, especially where structured records contain non-textual content such as numbers.

We present a method to make structured data, including numeric data, searchable with a Web search engine-like keyword search access mechanism. Our method is based on the creation of surrogate text documents using Natural Language Generation (NLG) methods that can then be retrieved by off-the-shelf search methods.

We demonstrate that this method is effective by applying it to two real-life sized databases, a proprietary database comprising corporate Environmental, Social and Governance (ESG) data and a public-domain environmental pollution database, respectively, in a federated scenario. Our evaluation includes speed and index size investigations, and indicates effectiveness (

P

@1 = 84%,

P

@5 = 92%) and practicality of the method.

Jochen L. Leidner, Darya Kamkova

Searching and Ranking

On Top-k Retrieval for a Family of Non-monotonic Ranking Functions

We presented a top-k algorithm to retrieve tuples according to the order provided by a non-necessarily monotone ranking funtion that belongs to a novel family of functions. The conditions imposed on the ranking functions are related to the values where the maximum score is achieved.

Nicolás Madrid, Umberto Straccia

Using a Stack Decoder for Structured Search

We describe a novel and flexible method that translates free-text queries to structured queries for filling out web forms. This can benefit searching in web databases which only allow access to their information through complex web forms. We introduce boosting and discounting heuristics, and use the constraints imposed by a web form to find a solution both efficiently and effectively. Our method is more efficient and shows improved performance over a baseline system.

Kien Tjin-Kam-Jet, Dolf Trieschnigg, Djoerd Hiemstra

On Cosine and Tanimoto Near Duplicates Search among Vectors with Domains Consisting of Zero, a Positive Number and a Negative Number

The cosine and Tanimoto similarity measures are widely applied in information retrieval, text and Web mining, data cleaning, chemistry and bio-informatics for finding similar objects, their clustering and classification. Recently, a few very efficient methods were offered to deal with the problem of lossless determination of such objects, especially in large and very high-dimensional data sets. They typically relate to objects that can be represented by (weighted) binary vectors. In this paper, we offer methods suitable for searching vectors with domains consisting of zero, a positive number and a negative number; that is, being a generalization of weighted binary vectors. Our results are not worse than their existing analogs offered for (weighted) binary vectors.

Marzena Kryszkiewicz

L2RLab: Integrated Experimenter Environment for Learning to Rank

L2RLab is a development environment that lets us to integrate all the stages to develop, evaluate, compare and analyze the performance of new learning-to-rank models. It contains tools for individual and multiple pre-processed of the data collections, it also lets us to study the influence of the features in the ranking, the format conversion (e.g., Weka’s .ARFF) and visualization. This software facilitates the comparison between two or more methods taking as parameters the performance achieved in the ranking, also includes functionalities for the statistical analysis on the query-level precision of the algorithm proposed regarding to those referenced in the literature. The study of the learning curves’ behavior of the different methods is another feature of the tool. L2RLab is programmed in java and is designed as a tool oriented to the extensibility, therefore, the addition of new functionalities is an easy task. L2RLab has an easy-to-use interface that avoids the reprogramming of the applications for our experiments. Basically, L2RLab is structured by two main modules: the visual application and a framework that facilitates the inclusion of the new algorithms and the performance measures developed by the researcher.

Óscar J. Alejo, Juan M. Fernández-Luna, Juan F. Huete, Eleazar Moreno-Cerrud

Fusion and Ensemble Techniques for Online Learning on Data Streams

Heuristic Classifier Chains for Multi-label Classification

Multi-label classification, in opposite to conventional classification, assumes that each data instance may be associated with more than one labels simultaneously. Multi-label learning methods take advantage of dependencies between labels, but this implies greater learning computational complexity.

The paper considers Classifier Chain multi-label classification method, which in original form is fast, but assumes the order of labels in the chain. This leads to propagation of inference errors down the chain. On the other hand recent Bayes-optimal method, Probabilistic Classifier Chain, overcomes this drawback, but is computationally intractable. In order to find the trade off solution it is presented a novel heuristic approach for finding appropriate label order in chain. It is demonstrated that the method obtains competitive overall accuracy and is also tractable to higher-dimensional data.

Tomasz Kajdanowicz, Przemyslaw Kazienko

Weighting Component Models by Predicting from Data Streams Using Ensembles of Genetic Fuzzy Systems

Our recently proposed method to predict from a data stream of real estate sales transactions based on ensembles of genetic fuzzy systems was extended to include weighting component models. The method consists in incremental expanding an ensemble by models built over successive chunks of a data stream. The predicted prices of residential premises computed by aged component models for current data are updated according to a trend function reflecting the changes of the market. The impact of different techniques of weighting component models on the accuracy of an ensemble was compared in the paper. Three techniques of weighting component models were proposed: proportional to their estimated accuracy, time of ageing, and dependent on property market fluctuations.

Bogdan Trawiński, Tadeusz Lasota, Magdalena Smętek, Grzegorz Trawiński

Weighted Aging Classifier Ensemble for the Incremental Drifted Data Streams

Evolving systems are recently focus of intense research because for most of the real problems we can observe that the parameters of the decision tasks should adapt to new conditions. In classification such a problem is usually called concept drift. The paper deals with the data stream classification where we assume that the concept drift is sudden but its rapidity is limited. To deal with this problem we propose a new algorithm called Weighted Aging Ensemble (WAE), which is able to adapt to changes of classification model parameters. The method is inspired by well-known algorithm Accuracy Weighted Ensemble (AWE) which allows to change the line-up of a classifier ensemble, but the proposed method incudes two important modifications: (i) classifier weights depend on the individual classifier accuracies and time they have been spending in the ensemble, (ii) individual classifier are chosen to the ensemble on the basis on the non-pairwise diversity measure. The proposed method was evaluated on the basis of computer experiments which were carried out on SEA dataset. The obtained results encourage us to continue the work on the proposed concept.

Michał Woźniak, Andrzej Kasprzak, Piotr Cal

An Analysis of Change Trends by Predicting from a Data Stream Using Neural Networks

A method to predict from a data stream of real estate sales transactions based on ensembles of artificial neural networks was proposed. The approach consists in incremental expanding an ensemble by models built over successive chunks of a data stream. The predicted prices of residential premises computed by aged component models for current data are updated according to a trend function reflecting the changes of the market. The impact of different trend functions on the accuracy of ensemble neural models was investigated in the paper. The results indicate it is necessary to make selection of correcting functions appropriate to the nature of market changes.

Zbigniew Telec, Tadeusz Lasota, Bogdan Trawiński, Grzegorz Trawiński

Web and Human-Computer Interaction

An Autocompletion Mechanism for Enriched Keyword Queries to RDF Data Sources

This article introduces a novel keyword query paradigm for end users in order to retrieve precise answers from semantic data sources. Contrary to existing approaches, connectors corresponding to linking words or verbal structures from natural languages are used inside queries to specify the meaning of each keyword, thus leading to a complete and explicit definition of the intent of the search. An example of such a query is

name of person at the head of company and author of article about “business intelligence”

. In order to help users formulate such connected keywords queries and to translate them into SPARQL, an interactive mechanism based on autocompletion has been developed, which is presented in this article.

Grégory Smits, Olivier Pivert, Hélène Jaudoin, François Paulus

Querying Sentiment Development over Time

A new language is introduced for describing hypotheses about fluctuations of measurable properties in streams of timestamped data, and as prime example, we consider trends of emotions in the constantly flowing stream of Twitter messages. The language, called

EmoEpisodes

, has a precise semantics that measures how well a hypothesis characterizes a given time interval; the semantics is parameterized so it can be adjusted to different views of the data.

EmoEpisodes

is extended to a query language with variables standing for unknown topics and emotions, and the query-answering mechanism will return instantiations for topics and emotions as well as time intervals that provide the largest deflections in this measurement. Experiments are performed on a selection of Twitter data to demonstrates the usefulness of the approach.

Troels Andreasen, Henning Christiansen, Christian Theil Have

SuDoC: Semi-unsupervised Classification of Text Document Opinions Using a Few Labeled Examples and Clustering

The presented novel procedure named

SuDoC

– or Semi-unsupervised Document Classification – provides an alternative method to standard clustering techniques when it is necessary to separate a very large set of textual instances into groups that represent the text-document semantics. Unlike the conventional clustering, SuDoC proceeds from an initial small set of typical specimen that can be created manually and which provides the necessary bias for generating appropriate classes. SuDoC starts with a higher number of generated clusters and – to avoid over-fitting – reiteratively decreases their quantity, increasing the resulting classification generality. The unlabeled instances are automatically labeled according to their similarity to the defined labeled samples, thus reaching higher classification accuracy in the future. The results of the presented strengthened clustering procedure are demonstrated using a real-world data set represented by hotel guests’ unstructured reviews written in natural language.

František Dařena, Jan Žižka

Efficient Visualization of Folksonomies Based on «Intersectors »

Social bookmarking systems have recently received an increasing attention in both academic and industrial communities. This success is owed to their ease of use that relies on a simple intuitive process, allowing their users to label diverse resources with freely chosen keywords aka

Intelligent Information Extraction from Texts

Ukrainian WordNet: Creation and Filling

This paper deals with the process of developing a lexical semantic database for Ukrainian language – UkrWordNet. The architecture of the developed system is described in detail. The data storing structure and mechanisms of access to knowledge are reviewed along with the internal logic of the system and some key software modules. The article is also concerned with the research and development of automated techniques of UkrWordNet Semantic Network replenishment and extension.

Anatoly Anisimov, Oleksandr Marchenko, Andrey Nikonenko, Elena Porkhun, Volodymyr Taranukha

Linguistic Patterns for Encyclopaedic Information Extraction

Information extraction has almost always focused on extracting retrievable data from a text. Approaches that manage to extract elaborated information have seldom been devised. Through the use of interlingua-type language-independent contents representation, the semantic relations of the contents can be used to search a set of information concerning a particular entity. This way, the person asking a question to find out something about a city or a person, for example, would have to know no more than the name to be used to run a search. This approach is very promising as the person asking the question does not have to know what type of information he or she can request from a documentary source. Our work targets the goal of, given a user’s query, providing a complete report about such topic or event, composed of what we consider encyclopaedic knowledge. We describe the origins of this research and the followed procedure, as well as an illustrative case of this on-going research.

Jesús Cardeñosa, Miguel Ángel de la Villa, Carolina Gallardo

Contextualization and Personalization of Queries to Knowledge Bases Using Spreading Activation

Most taxonomies and thesauri offer their users a huge amount of structured data. However, this volume of data is often excessive, and, thus does not fulfill the needs of the users, who are trying to find specific information related to a certain concept. While there are techniques that may partially alleviate this problem (e.g. visual representation of the data), some of the effects of the information overload persist. This paper proposes a four-step mechanism for personalization and knowledge extraction, derived from the information about users’ activities stored in their profiles. More precisely, the system extracts contextualization from the users’ profiles by using a spreading activation algorithm. The preliminary results of this approach are presented in this paper.

Ana B. Pelegrina, Maria J. Martin-Bautista, Pamela Faber

Utilizing Annotated Wikipedia Article Titles to Improve a Rule-Based Named Entity Recognizer for Turkish

Named entity recognition is one of the information extraction tasks which aims to identify named entities such as person/ location/organization names along with some numeric and temporal expressions in free natural language texts. In this study, we target at named entity recognition from Turkish texts on which information extraction research is considerably rare compared to other well-studied languages. The effects of utilizing annotated Wikipedia article titles to enrich the lexical resources of a rule-based named entity recognizer for Turkish are discussed after evaluating the enriched named entity recognizer against its initial version. The evaluation results demonstrate that the presented extension improves the recognition performance on different text genres, particularly on historical and financial news text sets for which the initial recognizer has not been engineered for. The current study is significant as it is the first study to address the utilization of Wikipedia articles as an information source to improve named entity recognition on Turkish texts.

Dilek Küçük

Springer Professional

About this book

Table of Contents

Frontmatter

Query-Answering Systems

Conceptual Pathway Querying of Natural Logic Knowledge Bases from Text Bases

Query Rewriting for an Incremental Search in Heterogeneous Linked Data Sources

FILT – Filtering Indexed Lucene Triples – A SPARQL Filter Query Processing Engine–

Improving Range Query Result Size Estimation Based on a New Optimal Histogram

Has FQAS Something to Say on Taking Good Care of Our Elders?

Question Answering System for Dialogues: A New Taxonomy of Opinion Questions

R/quest: A Question Answering System

Answering Questions by Means of Causal Sentences

Ontology-Based Question Analysis Method

Fuzzy Multidimensional Modelling for Flexible Querying of Learning Object Repositories

Environmental Scanning for Strategic Early Warning

Using Formal Concept Analysis to Detect and Monitor Organised Crime

Analysis of Semantic Networks Using Complex Networks Concepts

Detecting Anomalous and Exceptional Behaviour on Credit Data by Means of Association Rules

Issues of Security and Informational Privacy in Relation to an Environmental Scanning System for Fighting Organized Crime

Semantic Technology

Algorithmic Semantics for Processing Pronominal Verbal Phrases

Improving the Understandability of OLAP Queries by Semantic Interpretations

Semantic Interpretation of Intermediate Quantifiers and Their Syllogisms

Ranking Images Using Customized Fuzzy Dominant Color Descriptors

Generating Linguistic Descriptions of Data

Linguistic Descriptions: Their Structure and Applications

Landscapes Description Using Linguistic Summaries and a Two-Dimensional Cellular Automaton

Comparing f β -Optimal with Distance Based Merge Functions

Flexible Querying with Linguistic F-Cube Factory

Mathematical Morphology Tools to Evaluate Periodic Linguistic Summaries

Automatic Generation of Textual Short-Term Weather Forecasts on Real Prediction Data

Increasing the Granularity Degree in Linguistic Descriptions of Quasi-periodic Phenomena

A Model-Based Multilingual Natural Language Parser — Implementing Chomsky’s X-bar Theory in ModelCC

Patterns and Classification

Correlated Trends: A New Representation for Imperfect and Large Dataseries

Arc-Based Soft XML Pattern Matching

Discrimination of the Micro Electrode Recordings for STN Localization during DBS Surgery in Parkinson’s Patients

Image Classification Based on 2D Feature Motifs

Advances in Fuzzy Querying and Fuzzy Databases: Theory and Applications

Wildfire Susceptibility Maps Flexible Querying and Answering

Enhancing Flexible Querying Using Criterion Trees

A Possibilistic Logic Approach to Conditional Preference Queries

Bipolar Conjunctive Query Evaluation for Ontology Based Database Querying

Bipolar Querying of Valid-Time Intervals Subject to Uncertainty

Declarative Fuzzy Linguistic Queries on Relational Databases

Finding Similar Objects in Relational Databases — An Association-Based Fuzzy Approach

M2LFGP: Mining Gradual Patterns over Fuzzy Multiple Levels

Building a Fuzzy Valid Time Support Module on a Fuzzy Object-Relational Database

Personalization and Recommender Systems

Predictors of Users’ Willingness to Personalize Web Search

Semantic-Based Recommendation of Nutrition Diets for the Elderly from Agroalimentary Thesauri

Enhancing Recommender System with Linked Open Data

Making Structured Data Searchable via Natural Language Generation

Searching and Ranking

On Top-k Retrieval for a Family of Non-monotonic Ranking Functions

Using a Stack Decoder for Structured Search

On Cosine and Tanimoto Near Duplicates Search among Vectors with Domains Consisting of Zero, a Positive Number and a Negative Number

L2RLab: Integrated Experimenter Environment for Learning to Rank

Fusion and Ensemble Techniques for Online Learning on Data Streams

Heuristic Classifier Chains for Multi-label Classification

Weighting Component Models by Predicting from Data Streams Using Ensembles of Genetic Fuzzy Systems

Weighted Aging Classifier Ensemble for the Incremental Drifted Data Streams

An Analysis of Change Trends by Predicting from a Data Stream Using Neural Networks

Web and Human-Computer Interaction

An Autocompletion Mechanism for Enriched Keyword Queries to RDF Data Sources

Querying Sentiment Development over Time

SuDoC: Semi-unsupervised Classification of Text Document Opinions Using a Few Labeled Examples and Clustering

Efficient Visualization of Folksonomies Based on «Intersectors »

Intelligent Information Extraction from Texts

Ukrainian WordNet: Creation and Filling

Linguistic Patterns for Encyclopaedic Information Extraction

Contextualization and Personalization of Queries to Knowledge Bases Using Spreading Activation

Utilizing Annotated Wikipedia Article Titles to Improve a Rule-Based Named Entity Recognizer for Turkish

Backmatter

Premium Partner