Social Networks, and Network Analysis

Frontmatter

A Preference-Driven Database Approach to Reciprocal User Recommendations in Online Social Networks

Online Social Networks (OSN) are frequently used to find people with common interests, though such functionality is often based on mechanisms such as friends-of-friends that do not perform well for real life interactions. We demonstrate an integrated database-driven recommendation approach that determines reciprocal user matches, which is an important feature to reduce the risk of rejection. Similarity is computed in a data-adaptive way based on dimensions such as homophily, propinquity, and recommendation context. By representation of dimensions as unique preference database queries, user models can be created in an intuitive way and can be directly evaluated on datasets. Query results serve as input for a reciprocal recommendation process that handles various similarity measures. Performance benchmarks conducted with data of a commercial outdoor platform prove the applicability to real-life tasks.

Florian Wenzel, Werner Kießling

Community Detection in Multi-relational Bibliographic Networks

In this paper, we introduce a community detection approach from heterogeneous multi-relational network which incorporate the multiple types of objects and relationships, derived from a bibliographic networks. The proposed approach performs firstly by constructing the relation context family (RCF) to represent the different objects and relations in the multi-relational bibliographic networks using the Relational Concept Analysis (RCA) methods; and secondly by exploring such RCF for community detection. Experiments performed on a dataset of academic publications from the Computer Science domain enhance the effectiveness of our proposal and open promising issues.

Soumaya Guesmi, Chiraz Trabelsi, Chiraz Latiri

Quality Prediction in Collaborative Platforms: A Generic Approach by Heterogeneous Graphs

As everyone can enrich or rather impoverish crowd-sourcing contents, it is a crucial need to continuously improve automatic quality contents assessment tools. Structural-based analysis methods developed for such quality prediction purposes generally handle a limited or manually fixed number of families of nodes and relations. This lack of genericity prevents existing algorithms for being adaptable to platforms evolutions. In this work, we propose a generic and adaptable algorithm, called HSQ, generalising various state-of-the-art models and allowing the consideration of graphs defined by an arbitrary number of nodes semantics. Evaluations performed over the two representative crowd-sourcing platforms Wikipedia and Stack Exchange state that the consideration of additional nodes semantics and relations improve the performances of state-of-the-art approaches.

Baptiste de La Robertie, Yoann Pitarch, Olivier Teste

Analyzing Relationships of Listed Companies with Stock Prices and News Articles

To support decision making in our business and personal lives, we propose an integrated model of stock prices and a novel method based on it for analyzing the relationships of listed companies. In our integrated model, the stock price of a listed company consists of three factors: market-index, sector-index and pure-price. By utilizing this model, we can relax the affections of the market and sectors, and analyze the relationship between companies on the basis of their business performance by comparing their pure-prices. Experiments using a newly collected data set validated the proposed integrated model and methods.

Satoshi Baba, Qiang Ma

Linked Data

Frontmatter

Approximate Semantic Matching over Linked Data Streams

In the Internet of Things (IoT), data can be generated by all kinds of smart things. In such context, enabling machines to process and understand such data is critical. Semantic Web technologies, such as Linked Data, provide an effective and machine-understandable way to represent IoT data for further processing. It is a challenging issue to match Linked Data streams semantically based on text similarity as text similarity computation is time consuming. In this paper, we present a hashing-based approximate approach to efficiently match Linked Data streams with users’ needs. We use the Resource Description Framework (RDF) to represent IoT data and adopt triple patterns as user queries to describe users’ data needs. We then apply locality-sensitive hashing techniques to transform semantic data into numerical values to support efficient matching between data and user queries. We design a modified k nearest neighbors (kNN) algorithm to speedup the matching process. The experimentalresults show that our approach is up to five times faster than the traditional methods and can achieve high precisions and recalls.

Yongrui Qin, Lina Yao, Quan Z. Sheng

A Mapping-Based Method to Query MongoDB Documents with SPARQL

Accessing legacy data as virtual RDF stores is a key issue in the building of the Web of Data. In recent years, the MongoDB database has become a popular actor in the NoSQL market, making it a significant potential contributor to the Web of Linked Data. Therefore, in this paper we address the question of how to access arbitrary MongoDB documents with SPARQL. We propose a two-step method to (i) translate a SPARQL query into a pivot abstract query under MongoDB-to-RDF mappings represented in the xR2RML language, then (ii) translate the pivot query into a concrete MongoDB query. We elaborate on the discrepancy between the expressiveness of SPARQL and the MongoDB query language, and we show that we can always come up with a rewriting that shall produce all correct answers.

Franck Michel, Catherine Faron-Zucker, Johan Montagnat

Incremental Maintenance of Materialized SPARQL-Based Linkset Views

In the Linked Data field, data publishers frequently materialize linksets between two different datasets using link discovery tools. To create a linkset, such tools typically execute linkage rules that retrieve data from the underlying datasets and apply matching predicates to create the links, in an often complex process. Also, such tools do not support linkset maintenance, when the datasets are updated. A simple, but costly strategy to maintain linksets up-to-date would be to fully re-materialize them from time to time. This paper presents an alternative strategy, called incremental, for maintaining linksets, based on idea that one should re-compute only the links that involve the updated resources. The paper discusses in detail the incremental strategy, outlines an implementation and describes an experiment to compare the performance of the incremental strategy with the full re-materialization of linksets.

Elisa S. Menendez, Marco A. Casanova, Vânia M. P. Vidal, Bernardo P. Nunes, Giseli Rabello Lopes, Luiz A. P. Paes Leme

Data Analysis

Frontmatter

Aggregate Reverse Rank Queries

Recently, reverse rank queries have attracted significant research interest. They have real-life applicability, such as in marketing analysis and product placement. Reverse k-ranks queries return users (preferences) who favor a given product more than other people. This helps manufacturers find potential buyers even for an unpopular product. Similar to the cable television industry, which often bundles channels, manufacturers are also willing to offer several products for sale as one combined product for marketing purposes.Unfortunately, current reverse rank queries, including Reverse k-ranks queries, only consider one product. To address this limitation, we propose the aggregate reverse rank queries to find matching user preferences for a set of products. To resolve this query more efficiently, we propose the concept of pre-processing the preference set and determining its upper and lower bounds. Combining these bounds with the query set, we proposed and implemented the tree pruning method (TPM) and double-tree method (DTM). The theoretical analysis and experimental results demonstrated the efficacy of the proposed methods.

Yuyang Dong, Hanxiong Chen, Kazutaka Furuse, Hiroyuki Kitagawa

Abstract-Concrete Relationship Analysis of News Events Based on a 5W Representation Model

In a follow-up news article, description of previous news events may be abbreviated or summarized. This feature makes news article difficult to understand if the reader has no knowledge about the previous events. In such a case, providing concrete and detailed descriptions is helpful. In this paper, we propose a five element, who, what, whom, when, and where (5W) model and extraction method with completion functionality. With this model, a news event is represented using these 5Ws. To discover abstract and concrete descriptions of a given event, we propose the novel concept of abstractiveness based on this model. The abstractiveness of an event description is defined based on the difficulty of imagining and identifying that event. Currently, we estimate the abstractiveness of an event by considering the abstract levels and comprehensivity of its 5Ws to identify that event. We also propose a method for estimating the abstractiveness of an event and analyzing the abstract-concrete relationships between news events based on the 5W model. The experimental results indicate that our model, concept, and method are effective for extracting a concrete event description.

Shintaro Horie, Keisuke Kiritoshi, Qiang Ma

Detecting Maximum Inclusion Dependencies without Candidate Generation

Inclusion dependencies (INDs) within and across databases are an important relationship for many applications in data integration, schema (re-)design, integrity checking, or query optimization. Existing techniques for detecting all INDs need to generate IND candidates and test their validity in the given data instance. However, the major disadvantage of this approach is the exponentially growing number of data accesses in terms of the number of SQL queries as well as I/O operations. We introduce Mind$$_2$$, a new approach for detecting n-ary INDs ($$n > 1$$) without any candidate generation. Mind$$_2$$ implements a new characterization of the maximum INDs we developed in this paper. This characterization is based on set operations defined on certain metadata that Mind$$_2$$ generates by accessing the database only 2 $$\times $$ the number of valid unary INDs. Thus, Mind$$_2$$ eliminates the exponential number of data accesses needed by existing approaches. Furthermore, the experiments show that Mind$$_2$$ is significantly more scalable than hypergraph-based approaches.

Nuhad Shaabani, Christoph Meinel

NoSQL, NewSQL

Frontmatter

Footprint Reduction and Uniqueness Enforcement with Hash Indices in SAP HANA

Databases commonly use multi-column indices for composite keys that concatenate attribute values for fast entity retrieval. For real-world applications, such concatenated composite keys contribute significantly to the overall space consumption, which is particularly expensive for main memory-resident databases. We present an integer-based hash representation of the actual values for the purpose of reducing the overall memory footprint of a system while maintaining the level of performance. We analyzed the performance impact as well as the memory footprint reduction of hash-based indices in SAP HANA in a real-world enterprise database setting. For a live production SAP ERP system, the introduction of hash-based primary key indices alone reduces the entire memory footprint by 10 % with comparable performance.

Martin Faust, Martin Boissier, Marvin Keller, David Schwalb, Holger Bischoff, Katrin Eisenreich, Franz Färber, Hasso Plattner

Benchmarking Replication in Cassandra and MongoDB NoSQL Datastores

The proliferation in Web 2.0 applications has increased the volume, velocity, and variety of data sources which have exceeded the limitations and expected use cases of traditional relational DBMSs. Cloud serving NoSQL data stores address these concerns and provide replication mechanisms to ensure fault tolerance, high availability, and improved scalability. In this paper, we empirically explore the impact of replication on the performance of Cassandra and MongoDB NoSQL datastores. We evaluate the impact of replication in comparison to non-replicated clusters of equal size hosted on a private cloud environment. Our benchmarking experiments are conducted for read and write heavy workloads subject to different access distributions and tunable consistency levels. Our results demonstrate that replication must be taken into consideration in empirical and modelling studies in order to achieve an accurate evaluation of the performance of these datastores.

Gerard Haughian, Rasha Osman, William J. Knottenbelt

τJSchema: A Framework for Managing Temporal JSON-Based NoSQL Databases

Although NoSQL databases are claimed to be schemaless, several NoSQL database vendors have chosen JSON as agile data representation format and provide a JSON-based API or query facility to simplify the life of application developers. Whereas many applications require the management of temporal data, the JSON Schema language lacks explicit support for time-varying data. In this paper, for a systematic approach to the management of temporal data in NoSQL databases, we propose a framework called Temporal JSON Schema (τJSchema), inspired by the τXSchema framework defined for XML data. τJSchema allows defining a temporal JSON schema from a conventional JSON schema and a set of temporal logical and physical characteristics. Our framework guarantees logical and physical data independence for temporal schemas and provides a low-impact solution since it requires neither modifications of existing JSON documents, nor extensions to the JSON format, the JSON Schema language, and all related tools and languages.

Safa Brahmia, Zouhaier Brahmia, Fabio Grandi, Rafik Bouaziz

Multimedia Data

Frontmatter

Enhancing Similarity Search Throughput by Dynamic Query Reordering

A lot of multimedia data are being created nowadays, which can only be searched by content since no searching metadata are available for them. To make the content search efficient, similarity indexing structures based on the metric-space model can be used. In our work, we focus on a scenario where the similarity search is used in the context of stream processing. In particular, there is a potentially infinite sequence (stream) of query objects, and a query needs to be executed for each of them. The goal is to maximize the throughput of processed queries while maintaining an acceptable delay. We propose an approach based on dynamic reordering of the incoming queries combined with caching of recent results. We were able to achieve up to 3.7 times higher throughput compared to the base case when no reordering and caching is used.

Filip Nalepa, Michal Batko, Pavel Zezula

Creating a Music Recommendation and Streaming Application for Android

Music recommendation and streaming services have grown exponentially with the introduction of smartphones. Despite the large number of systems, they all currently face a lot of issues. One issue is a cold start where a user who is new to the system can’t be made recommendations until the system learns their tastes. They also lack context awareness to make truly personalised recommendations to the user. This paper introduces a new recommendation and streaming application, individual Personalised Music (iPMusic), for Android which is specifically designed to address the issues. We examine the effectiveness of iPMusic based on real world users’ feedback which shows positive results.

Elliot Jenkins, Yanyan Yang

A Score Fusion Method Using a Mixture Copula

In this paper, we propose a score fusion method using a mixture copula that can consider complex dependencies between multiple relevance scores in order to improve the effectiveness of information retrieval. The combination of multiple relevance scores has been shown to be effective in comparison with a single score. Widely used score fusion methods are linear combination and learning to rank. Linear combination cannot capture the non-linear dependency of multiple scores. Learning to rank yields output that makes it difficult to understand the models. These problems can be solved by using a copula, which is a statistical framework, because it can capture the non-linear dependency and also provide an interpretable reason for the model. Although some studies apply copulas to score fusion and demonstrate the effectiveness, their methods employ a unimodal copula, thus making it difficult to capture complex dependencies. Therefore, we introduce a new score fusion method that uses a mixture copula to handle the complicated dependencies of scores; then, we evaluate the accuracy of our proposed method. Experiments on ClueWeb’09, a large-scale document set, show that in some cases, our proposed method significantly outperforms linear combination and others existing methods that use a unimodal copula.

Takuya Komatsuda, Atsushi Keyaki, Jun Miyazaki

Personal Information Management

Frontmatter

Axiomatic Term-Based Personalized Query Expansion Using Bookmarking System

This paper tackles the problem of pinpointing relevant information in a social network for Personalized Information Retrieval (PIR). We start from the premise that user profiles must be filtered so that they outperform non profile based queries. The formal Profile Query Expansion Constraint is then defined. We fix a specific integration of profile and a probabilistic matching framework that fits into the constraint defined. Experiments are conducted on the Bibsonomy corpus. Our findings show that even simple profile adaptation using query is effective for Personalized Information Retrieval.

Philippe Mulhem, Nawal Ould Amer, Mathias Géry

A Relevance-Focused Search Application for Personalised Ranking Model

The assumption that users’ profiles can be exploited by employing their implicit feedback for query expansion through a conceptual search to index documents has been proven in previous research. Several successful approaches leading to an improvement in the accuracy of personalised search results have been proposed. This paper extends existing approaches and combines the keyword-based and semantic-based features in order to provide further evidence of relevance-focused search application for Personalised Ranking Model (PRM). A description of the hybridisation of these approaches is provided and various issues arising in the context of computing the similarity between users’ profiles are discussed. As compared to any traditional search system, the superiority of our approach lies in pushing significantly relevant documents to the top of the ranked lists. The results were empirically confirmed through human subjects who conducted several real-life Web searches.

Al Sharji Safiya, Martin Beer, Elizabeth Uruchurtu

Aggregated Search over Personal Process Description Graph

People share various processes in daily lives on-line in natural language form (e.g., cooking recipes, “how-to guides” in eHow). We refer to them as personal process descriptions. Previously, we proposed Personal Process Description Graph (PPDG) to concretely represent the personal process descriptions as graphs, along with query processing techniques that conduct exact as well as similarity search over PPDGs. However, both techniques fail if no single personal process description satisfies all constraints of a query. In this paper, we propose a new approach based on our previous query techniques to query personal process descriptions by aggregation - composing fragments from different PPDGs to produce an answer. We formally define the PPDG Aggregated Search. A general framework is presented to perform aggregated searches over PPDGs. Comprehensive experiments demonstrate the efficiency and scalability of our techniques.

Jing Ouyang Hsu, Hye-young Paik, Liming Zhan, Anne H. H. Ngu

Inferring Lurkers’ Gender by Their Interest Tags

Gender prediction has evoked great research interests due to its potential applications like personalized search, targeted advertisement and recommendation. Most of the existing studies rely on the content texts to build feature vector. However, there is a large number of lurkers in social media who do not post any message. It is unable to extract stylistic or syntactic features for these users as they do not have content information. In this paper, we present a novel framework to infer lurkers’ gender by their interest tags. This task is extremely challenging due to the fact that each user only has a few (usually less than 10) and diverse tags. In order to solve this problem, we first select a few tags and classify them into conceptual classes according to social and psycholinguistic characteristics. Then we enlarge the conceptual class using an association mining based method. Finally, we use the conceptual class to condense the users feature space. We conduct experiments on a real data set extracted from Sina Weibo. Experimental results demonstrate that our proposed approach is quite effective in predicting lurkers’ genders.

Peisong Zhu, Tieyun Qian, Zhenni You, Xuhui Li

Semantic Web and Ontologies

Frontmatter

Data Access Based on Faceted Queries over Ontologies

We propose a method for generating and evaluating faceted queries over ontology-enhanced distributed graph databases. A user, who only vaguely knows the domain ontology, starts with a set of keywords. Then, an initial faceted query is automatically generated and the user is guided in interactive modification and refinement of successively created faceted queries. We provide the theoretical foundation for this way of faceted query construction and translation into first order monadic positive existential queries.

Tadeusz Pankowski, Grażyna Brzykcy

Incremental and Directed Rule-Based Inference on RDFS

The Semantic Web contributes to the elicitation of knowledge from data, and leverages implicit knowledge through reasoning algorithms. The dynamic aspect of the Web pushes actual batch reasoning solutions, providing the best scalability so far, to upgrade towards incremental reasoning. This paradigm enables reasoners to handle new data as they arrive. In this paper we introduce Slider-p, an efficient incremental reasoner. It is designed to handle streaming expanding data with a growing background knowledge base. Directed reasoning implemented in Slider-p allows to influence the order of inferred triples. This feature, novel in the state of the art at the best of our knowledge, enables the adaptation of Slider-p’s behavior to answer at best queries as the reasoning process is not over. It natively supports $$\rho $$df and RDFS, and its architecture allows to extend it to more complex fragments with a minimal effort. Our experimentations show that it is able to influence the order of the inferred triples, prioritizing the inference of selected kinds of triples.

Jules Chevalier, Julien Subercaze, Christophe Gravier, Frédérique Laforest

Top-k Matching Queries for Filter-Based Profile Matching in Knowledge Bases

Finding the best matching job offers for a candidate profile or, the best candidates profiles for a particular job offer, respectively constitutes the most common and most relevant type of queries in the Human Resources (HR) sector. This technically requires to investigate top-k queries on top of knowledge bases and relational databases. We propose in this paper a top-k query algorithm on relational databases able to produce effective and efficient results. The approach is to consider the partial order of matching relations between jobs and candidates profiles together with an efficient design of the data involved. In particular, the focus on a single relation, the matching relation, is crucial to achieve the expectations.

Alejandra Lorena Paoletti, Jorge Martinez-Gil, Klaus-Dieter Schewe

FETA: Federated QuEry TrAcking for Linked Data

Following the principles of Linked Data (LD), data providers are producing thousands of interlinked datasets in multiple domains including life science, government, social networking, media and publications. Federated query engines allow data consumers to query several datasets through a federation of SPARQL endpoints. However, data providers just receive subqueries resulting from the decomposition of the original federated query. Consequently, they do not know how their data are crossed with other datasets of the federation. In this paper, we propose FETA, a Federated quEry TrAcking system for LD. We consider that data providers collaborate by sharing their query logs. Then, from a federated log, FETA infers Basic Graph Patterns (BGPs) containing joined triple patterns, executed among endpoints. We experimented FETA with logs produced by FedBench queries executed with Anapsid and FedX federated query engines. Experiments show that FETA is able to infer BGPs of joined triple patterns with a good precision and recall.

Georges Nassopoulos, Patricia Serrano-Alvarado, Pascal Molli, Emmanuel Desmontils

Database and Information System Architectures

Frontmatter

Dynamic Power-Aware Disk Storage Management in Database Servers

Energy consumption has become a first-class optimization goal in design and implementation of data-intensive computing systems. This is particularly true in the design of database management system (DBMS), which was found to be the major consumer of energy in the software stack of modern data centers. Among all database components, the storage system is the most power-hungry element. In this paper, we present our research on designing a power-aware data storage system. To tackle the limitations of the previous work, we introduce a DPM optimization model to minimize power consumption of the disk-based storage system while satisfying given performance requirements. It dynamically determines the state of disks and plans for inter-disk fragment migration to achieve desirable balance between power consumption and query response time. We evaluate our proposed idea by running simulations using several synthetic workloads based on popular TPC benchmarks.

Peyman Behzadnia, Wei Yuan, Bo Zeng, Yi-Cheng Tu, Xiaorui Wang

FR-Index: A Multi-dimensional Indexing Framework for Switch-Centric Data Centers

Data center occupies a decisive position in business of data management and data analysis. To improve the efficiency of data retrieval in a data center, we propose a distributed multi-dimensional indexing framework for switch-centric data centers with tree-like topologies. Taking Fat-Tree as a representative, which is a typical switch-centric data center topology, we design FR-Index, a two-layer indexing schema fully taking advantage of the Fat-Tree topology and R-tree indexing technology. In the lower layer, each server indexes the local data with R-tree, while in the upper layer the distributed global index depicting an overview of the whole data set. To improve the efficiency of query processing, we also provide special techniques to reduce the dimensionality of the index. Experiments on Amazon’s EC2 show that our proposed indexing schema is scalable, efficient and lightweight, which can significantly promote the efficiency of query processing.

Yatao Zhang, Jialiang Cao, Xiaofeng Gao, Guihai Chen

Unsupervised Learning for Detecting Refactoring Opportunities in Service-Oriented Applications

Service-Oriented Computing (SOC) has been widely used for building distributed and enterprise-wide software applications. One major problem in this kind of applications is their growth; as size and complexity of applications increase, the probability of duplicity of code increases, among other refactoring issues. This paper proposes an unsupervised learning approach to assist software developers in detecting refactoring opportunities in service-oriented applications. The approach gathers non-refactored Web Service Description Language (WSDL) documents and applies clustering and visualization techniques to deliver a list of refactoring suggestions to start working on the refactoring process. We evaluated our approach using two real-life case-studies by using internal validity criteria for the clustering quality.

Guillermo Rodríguez, Álvaro Soria, Alfredo Teyseyre, Luis Berdun, Marcelo Campo

A Survey on Visual Query Systems in the Web Era

As more and more collections of data are becoming available on the web to everyone, non expert users demand easy ways to retrieve data from these collections. One solution is the so called Visual Query Systems (VQS) where queries are represented visually and users do not have to understand query languages such as SQL or XQuery. In 1996, a paper by Catarci reviewed the Visual Query Systems available until that year. In this paper, we review VQSs from 1997 until now and try to determine whether they have been the solution for non expert users. The short answer is no because very few systems have in fact been used in real environments or as commercial tools. We have also gathered basic features of VQSs such as the visual representation adopted to present the reality of interest or the visual representation adopted to express queries.

Jorge Lloret-Gazo

Query Answering and Optimization

Frontmatter

Query Similarity for Approximate Query Answering

Query rewriting in heterogeneous environments assumes mappings that are complete. In reality and especially in the Big Data era it is rarely the case that such complete sets of mappings exist between sources, and the presence of partial mappings is the norm rather than the exception. So, practically, existing rewriting algorithms fail in the majority of cases. The solution is to approximate original queries with others that can be answered by existing mappings. Approximate queries bear some similarity to original ones in terms of structure and semantics. In this paper we investigate the notion of such query similarity and we introduce the use of query similarity functions to this end. We also present a methodology for the construction of such functions. We employ exemplary similarity functions created with the proposed methodology into recent algorithms for approximate query answering and show experimental results for the influence of the similarity function to the efficiency of the algorithms.

Verena Kantere

Generalized Maximal Consistent Answers in P2P Deductive Databases

The paper provides a contribution in computing consistent answers to logic queries in a P2P environment. Each peer joining a P2P system imports data from its neighbors by using a set of mapping rules, i.e. a set of semantic correspondences to a set of peers belonging to the same environment. By using mapping rules, as soon as it enters the system, a peer can participate and access all data available in its neighborhood, and through its neighborhood it becomes accessible to all the other peers in the system. The declarative semantics of a P2P system is defined in terms of minimal weak models. Under this semantics each peer uses its mapping rules to import minimal sets of mapping atoms allowing to satisfy its local integrity constraints. The contribution of the present paper consists in extending the classical notion of consistent answer by allowing the presence of partially defined atoms, i.e. atoms with “unknown” values due to the presence of tuples in different minimal weak models which disagree on the value of one or more attributes. The basic proposal is the following: in the presence of alternative minimal weak models the choice is to extracts the minimal consistent portion of the information they all hold, i.e. the information on which the minimal weak models agree. Therefore, true information is that “supported”’ by all minimal weak models, i.e. the set of atoms which maximizes the information shared by the minimal weak models.

Luciano Caroprese, Ester Zumpano

Computing Range Skyline Query on Uncertain Dimension

A user sometimes prefers to not be restricted when querying for information. Querying information within a range of search often provides a different perspective to user as opposed to a rigid search. To compute skyline within a given range would be easy on traditional dataset. The challenge is when the dataset being queried consists of both atomic values as well as continuous range of values. For a set of objects with uncertain dimension, a skyline with a range query $$ [q_{j} :q_{j} '] $$ on that uncertain dimension returns objects which are not dominated by any other objects in the range query. A method is proposed to determine objects and answer skyline query that satisfy the range query. The correctness of the method is proven through comparisons between two naïve methods that strictly reject and loosely accept objects that intersect with the range query.

Nurul Husna Mohd Saad, Hamidah Ibrahim, Fatimah Sidi, Razali Yaakob, Ali Amer Alwan

Aging Locality Awareness in Cost Estimation for Database Query Optimization

A number of insertions, updates and deletions eventually deteriorate the structural efficiency of database storage, and then cause performance degradation. This phenomenon is called “aging.” In real-world database systems, aging often exhibits strong locality because of the inherent skewness of data access; specifically speaking, the cost of I/O operations is not uniform throughout the storage space. Potentially query execution cost is influenced by the aging. However, conventional query optimizers do not consider the aging locality; thus they cannot accurately estimate the cost of query execution plans at times. In this paper, we propose a novel method of cost estimation that has the key capability of accurately determining aging phenomena, even though such phenomena are non-uniformly incurred. Our experiment on PostgreSQL and TPC-H data sets showed that the proposed method can accurately estimate the query execution cost even if it is influenced by the aging.

Chihiro Kato, Yuto Hayamizu, Kazuo Goda, Masaru Kitsuregawa

Information Retrieval, and Keyword Search

Frontmatter

Constructing Data Graphs for Keyword Search

Data graphs are convenient for supporting keyword search that takes into account available semantic structure and not just textual relevance. However, the problem of constructing data graphs that facilitate both efficiency and effectiveness of the underlying system has hardly been addressed. A conceptual model for this task is proposed. Principles for constructing good data graphs are explained. A transformation for generating data graphs from XML is developed.

Konstantin Golenberg, Yehoshua Sagiv

Generating Pseudo Search History Data in the Absence of Real Search History

Previous studies in Information Retrieval literature have shown that users’ search history can be leveraged to improve current search results. However sometimes we have little to no search history available. In such cases, it would be helpful to obtain data similar to search history data. One way of doing this is by simulating previous search interactions. In the present study, we focus on generating simulated “related queries” that can serve as an additional source of information about the current search [1]. Assuming that users reformulate their queries by leveraging some of the terms and key phrases they find in ranked documents during their search, we proposed simple models for generating such related queries.

Ashraf Bah, Ben Carterette

Variable-Chromosome-Length Genetic Algorithm for Time Series Discretization

The symbolic aggregate approximation method (SAX) of time series is a widely-known dimensionality reduction technique of time series data. SAX assumes that normalized time series have a high-Gaussian distribution. Based on this assumption SAX uses statistical lookup tables to determine the locations of the breakpoints on which SAX is based. In a previous work, we showed how this assumption oversimplifies the problem, which may result in high classification errors. We proposed an alternative approach, based on the genetic algorithms, to determine the locations of the breakpoints. We also showed how this alternative approach boosts the performance of the original SAX. However, the method we presented has the same drawback that existed in the original SAX; it was only able to determine the locations of the breakpoints but not the corresponding alphabet size, which had to be input by the user in the original SAX. In the method we previously presented we had to run the optimization process as many times as the range of the alphabet size. Besides, performing the optimization process in two steps can cause overfitting. The novelty of the present work is twofold; first, we extend a version of the genetic algorithms that uses chromosomes of different lengths. Second, we apply this new version of variable-chromosome-length genetic algorithm to the problem at hand to simultaneously determine the number of the breakpoints, together with their locations, so that the optimization process is run only once. This speeds up the training stage and also avoids overfitting. The experiments we conducted on a variety of datasets give promising results.

Muhammad Marwan Muhammad Fuad

Approximate Temporal Aggregation with Nearby Coalescing

Temporal aggregation is an important query operation in temporal databases. Although the general forms of temporal aggregation have been well researched, some new applications such as online calendaring systems call for new temporal aggregation. In this paper, we study the issue of approximate temporal aggregation with nearby coalescing, which we call NSTA. NSTA improves instant temporal aggregation by coalescing nearby (not necessarily adjacent) intervals to produce more compact and concise aggregate results. We introduce the term of coalescibility and based on it we develop efficient algorithms to compute coalesced aggregates. We evaluate the proposed methods experimentally and verify the feasibility.

Kai Cheng

Data Modelling, and Uncertainty

Frontmatter

A Data Model for Determining Weather’s Impact on Travel Time

Accurate estimating travel times in road networks is a complex task because travel times depends on factors such as the weather. In this paper, we present a generic model for integrating weather data with GPS data to improve the accuracy of the estimated travel times. First, we present a data model for storing and map-matching GPS data, and integrating this data with detailed weather data. The model is generic in the sense that it can be used anywhere GPS data and weather data is available. Next, we analyze the correlation between travel time and the weather classes dry, fog, rain, and snow along with winds impact on travel time. Using a data set of 1.6 billion GPS records collected from 10,560 vehicles, over a 5 year period from all of Denmark, we show that snow can increase the travel time up to 27 % and strong headwind can increase the travel time with up to 19 % (compared to dry calm weather). This clearly shows that accurate travel time estimation requires knowledge about the weather.

Ove Andersen, Kristian Torp

Simplify the Design of XML Schemas by Type Dependencies

In XML Schema, the type definition mechanism is responsible for defining types as well as passing contextual information, which may cause some design problems such as artificial types. This paper proposes a new kind of XML schema called Type Dependencies (TD) schema to realize the separation of those two tasks. A TD schema includes two parts, a set of type dependencies which is responsible for passing contextual information and a complete competition grammar which is only responsible for defining types. It can help users to design better schemas more easily, since there are no problems related to Element Declarations Consistent (EDC) rule and artificial types in TD schemas. Furthermore, the expressiveness of TD schemas is more powerful than XML Schema and it also satisfies the semantic concept of 1-pass preorder typing, which make it more suitable for streaming data.

Jia Liu, Husheng Liao

An Efficient Initialization Method for Probabilistic Relational Databases

Probabilistic relational databases play an important role on uncertain data management. Informally, a probabilistic database is a probability distribution over a set of deterministic databases (namely, possible worlds). The existing initialization methods that transform the possible worlds representation into our chosen representation, make the formulae of tuples very long. An efficient initialization method is proposed by providing an equation that can generate simplified formulae of tuples. The experimental study shows that the proposed method greatly simplifies the formulae of tuples without additional time cost. The subsequent queries benefit from the simplified formulae of tuples.

Hong Zhu, Caicai Zhang, Zhongsheng Cao

Springer Professional

About this book

Table of Contents

Frontmatter

Erratum to: Aging Locality Awareness in Cost Estimation for Database Query Optimization

Social Networks, and Network Analysis

Frontmatter

A Preference-Driven Database Approach to Reciprocal User Recommendations in Online Social Networks

Community Detection in Multi-relational Bibliographic Networks

Quality Prediction in Collaborative Platforms: A Generic Approach by Heterogeneous Graphs

Analyzing Relationships of Listed Companies with Stock Prices and News Articles

Linked Data

Frontmatter

Approximate Semantic Matching over Linked Data Streams

A Mapping-Based Method to Query MongoDB Documents with SPARQL

Incremental Maintenance of Materialized SPARQL-Based Linkset Views

Data Analysis

Frontmatter

Aggregate Reverse Rank Queries

Abstract-Concrete Relationship Analysis of News Events Based on a 5W Representation Model

Detecting Maximum Inclusion Dependencies without Candidate Generation

NoSQL, NewSQL

Frontmatter

Footprint Reduction and Uniqueness Enforcement with Hash Indices in SAP HANA

Benchmarking Replication in Cassandra and MongoDB NoSQL Datastores

τJSchema: A Framework for Managing Temporal JSON-Based NoSQL Databases

Multimedia Data

Frontmatter

Enhancing Similarity Search Throughput by Dynamic Query Reordering

Creating a Music Recommendation and Streaming Application for Android

A Score Fusion Method Using a Mixture Copula

Personal Information Management

Frontmatter

Axiomatic Term-Based Personalized Query Expansion Using Bookmarking System

A Relevance-Focused Search Application for Personalised Ranking Model

Aggregated Search over Personal Process Description Graph

Inferring Lurkers’ Gender by Their Interest Tags

Semantic Web and Ontologies

Frontmatter

Data Access Based on Faceted Queries over Ontologies

Incremental and Directed Rule-Based Inference on RDFS

Top-k Matching Queries for Filter-Based Profile Matching in Knowledge Bases

FETA: Federated QuEry TrAcking for Linked Data

Database and Information System Architectures

Frontmatter

Dynamic Power-Aware Disk Storage Management in Database Servers

FR-Index: A Multi-dimensional Indexing Framework for Switch-Centric Data Centers

Unsupervised Learning for Detecting Refactoring Opportunities in Service-Oriented Applications

A Survey on Visual Query Systems in the Web Era

Query Answering and Optimization

Frontmatter

Query Similarity for Approximate Query Answering

Generalized Maximal Consistent Answers in P2P Deductive Databases

Computing Range Skyline Query on Uncertain Dimension

Aging Locality Awareness in Cost Estimation for Database Query Optimization

Information Retrieval, and Keyword Search

Frontmatter

Constructing Data Graphs for Keyword Search

Generating Pseudo Search History Data in the Absence of Real Search History

Variable-Chromosome-Length Genetic Algorithm for Time Series Discretization

Approximate Temporal Aggregation with Nearby Coalescing

Data Modelling, and Uncertainty

Frontmatter

A Data Model for Determining Weather’s Impact on Travel Time

Simplify the Design of XML Schemas by Type Dependencies

An Efficient Initialization Method for Probabilistic Relational Databases

Backmatter

Premium Partner