Security and Privacy (II)

Frontmatter

Generating k-Anonymous Microdata by Fuzzy Possibilistic Clustering

Collecting, releasing and sharing microdata about individuals is needed in some domains to support research initiatives aiming to create new valuable knowledge, by means of data mining and analysis tools. Thus, seeking individuals’ anonymity is required to guarantee their privacy prior publication. The k-anonymity by microaggregation, is a widely accepted model for data anonymization. It consists in de-associating the relationship between the identity of data subjects, i.e. individuals, and their confidential information. However, this method shows limits when dealing with real datasets. Indeed, the latter are characterized by their large number of attributes and the presence of noisy data. Thus, decreasing the information loss during the anonymization process is a compelling task to achieve. This paper aims to deal with such challenge. Doing so, we propose a microaggregation algorithm called Micro-PFSOM, based on fuzzy possibilitic clustering. The main thrust of this algorithm stands in applying an hybrid anonymization process.

Balkis Abidi, Sadok Ben Yahia

Lightweight Privacy-Preserving Task Assignment in Skill-Aware Crowdsourcing

Crowdsourcing platforms dedicated to work are used by a growing number of individuals and organizations, for tasks that are more and more diverse, complex, and that require very specific skills. These highly detailed worker profiles enable high-quality task assignments but may disclose a large amount of personal information to the central platform (e.g., personal preferences, availabilities, wealth, occupations), jeopardizing the privacy of workers. In this paper, we propose a lightweight approach to protect workers privacy against the platform along the current crowdsourcing task assignment process. Our approach (1) satisfies differential privacy by letting each worker perturb locally her profile before sending it to the platform, and (2) copes with the resulting perturbation by leveraging a taxonomy defined on workers profiles. We overview this approach below, explaining the lightweight upgrades to be brought to the participants. We have also shown (full version of this paper [1]) formally that our approach satisfies differential privacy, and empirically, through experiments performed on various synthetic datasets, that it is a promising research track for coping with realistic cost and quality requirements.

Louis Béziaud, Tristan Allard, David Gross-Amblard

Clustering Heuristics for Efficient t-closeness Anonymisation

Anonymisation based on t-closeness is a privacy-preserving method of publishing micro-data that is safe from skewness, and similarity attacks. The t-closeness privacy requirement for publishing micro-data requires that the distance between the distribution of a sensitive attribute in an equivalence class, and the distribution of sensitive attributes in the whole micro-data set, be no greater than a threshold value of t. An equivalence class is a set records that are similar with respect to certain identifying attributes (quasi-identifiers), and a micro-data set is said to be t-close when all such equivalence classes satisfy t-closeness. However, the t-closeness anonymisation problem is NP-Hard. As a performance efficient alternative, we propose a t-clustering algorithm with an average time complexity of $$O(m^{2} \log n)$$ where n and m are the number of tuples and attributes, respectively. We address privacy disclosures by using heuristics based on noise additions to distort the anonymised datasets, while minimising information loss. Our experiments indicate that our proposed algorithm is time efficient and practically scalable.

Anne V. D. M. Kayem, Christoph Meinel

Service Computing

Frontmatter

A QoS-Aware Web Service Composition Approach Based on Genetic Programming and Graph Databases

A Web service can be thought of as a software module designed to accomplish specific tasks over the Internet. Web services are very popular, as they encourage code reuse as opposed to re-implementing already existing functionality. The process of combining multiple Web services is known as Web service composition. Previous attempts at automatically generating compositions have made use of genetic programming to optimize compositions, or introduced databases to keep track of relationships between services. This paper presents an approach that combines these two ideas, generating new compositions based on information stored in a graph database and then optimising their quality using genetic programming. Experiments were conducted comparing the performance of the newly proposed approach against that of existing works. Results show that the new approach executes faster than the previously proposed works, though it does not always reach the same solution quality as the compositions produced by them. Despite this, the experiments demonstrate that the fundamental idea of combining graph databases and genetic programming for Web service composition is feasible and a promising area of investigation.

Alexandre Sawczuk da Silva, Ewan Moshi, Hui Ma, Sven Hartmann

Combining Web-Service and Rule-Based Systems to Implement a Reliable DRG-Solution

Diagnosis-Related Groups (DRG) is a Patient Classification System that allows classifying inpatients stays among well-known groups in order to estimate the appropriate fees to refund hospitals. Each DRG solution has its own grouping approach. Using data gathered from the Hospital Information System (HIS), it assigns to an inpatient stay the appropriate group called DRG. In this paper, using both Web Service technology and Rule Based Expert System, we develop a rule-based system as a service for handling the grouping solution. This approach provides for hospitals and public/private stockholders the possibility to avoid the stringent software requirements on their local IT infrastructure by reusing a shared and distributed DRG solution. Moreover, combining these two technologies enhance knowledge maintenance and improve its reusability.

Idir Amine Amarouche, Lydia Rabia, Tayeb Kenaza

Usage-Aware Service Identification for Architecture Migration of Object-Oriented Systems to SoA

Organisations currently migrate the architecture of their traditional systems to service-oriented systems. Since their systems usually serve a lot of clients, a diversity of the system usage may exist. The evolution of a service-oriented system is facilitated if the offered services are specific for each group of clients. However, the state-of-the-art service-identification approaches do not consider the system usage. Thus, we propose an online process that dynamically re-identifies services by the arrival of new method traces of the system usage. The preliminary evaluation of our process on real-world case-studies shows high effectiveness on identifying usage-aware services.

Dionysis Athanasopoulos

Continuous and Temporal Data, and Continuous Query Language

Frontmatter

On Computing Temporal Aggregates over Null Time Intervals

In a temporal database, each data tuple is accompanied by a time interval during which its attribute values are valid. In this paper, we consider the null time intervals, that is, time intervals not intersected by any time intervals in the temporal database. We deal with the problem of computing temporal aggregates over null time intervals with length constraints. By interval folding, we transform the problem into aggregates over stabbing groups, maximal stabbing interval sets. We describe the detailed algorithms and report the experimental results.

Kai Cheng

A Continuous Query Language for Stream-Based Artifacts

Artifacts not only combine data and process into the same semantic entities that evolve according to precise state-based lifecycles, but they also provide casual users with a simple and intuitive framework for modeling, executing, and managing business processes. Despite the surge in interest, artifacts still lack effective languages and tools to define, manipulate and interrogate their instances. Most existing artifact languages and notations focuses only on modeling and executing artifact-based processes in which data have discrete values and fail to handle artifacts in which their data include continuous data streams. In this paper, we present the Continuous Artifact Query Language (CAQL) for modeling stream-based artifacts and semantically manipulating and interrogating their instances.

Maroun Abi Assaf, Youakim Badr, Youssef Amghar

Past Indeterminacy in Data Warehouse Design

Traditional data warehouse design methods do not fully address some important challenges, particularly temporal ones. Among them past indeterminacy is not handled systematically and uniformly. Furthermore, most methods published until now present transformation approaches by providing examples rather than general and systematic transformation rules. As a result, real-world applications require manual adaptations and implementations. This hinders scalability, long-term maintenance and increases the risk of inconsistency in case of manual implementation. This article extends the Unified Bitemporal Historicization Framework with a set of specifications and a deterministic process that defines simple steps for transforming a non-historical database schema into a historical schema allowing data evolution and traceability, including past and future indeterminacy. The primary aim of this work is to help data warehouse designers to model historicized schema based on a sound theory ensuring a sound temporal semantic, data integrity and query expressiveness.

Christina Khnaisser, Luc Lavoie, Anita Burgun, Jean-François Ethier

Text Processing and Semantic Search

Frontmatter

A Case for Term Weighting Using a Dictionary on GPUs

This paper explains the demonstration of a fast method of Okapi BM25 term weighting on graphics processing units (GPUs) for information retrieval by combining a GPU-based dictionary using a succinct data structure and data parallel primitives. The main problem with handling documents on GPUs is in processing variable length strings, such as the documents themselves and words. Processing variable sizes of data causes many idle cores, i.e., load imbalances in threads, due to the single instruction multiple data (SIMD) nature of the GPU architecture. Our term weighting method is carefully composed of efficient data parallel primitives to avoid load imbalance. Additionally, we implemented a high performance compressed dictionary on GPUs. As words are converted into identifiers (IDs) with this dictionary, costly string comparisons could be avoided. Our experimental results revealed that the proposed method of term weighting on GPUs performed up to 5$$\times $$ faster than the MapReduce-based one on multi-core CPUs.

Toshiaki Wakatsuki, Atsushi Keyaki, Jun Miyazaki

Process and Tool-Support to Collaboratively Formalize Statutory Texts by Executable Models

This paper describes a collaborative modeling environment to support the analysis and interpretation of statutory texts, i.e., laws. The paper performs a case study on formalizing the product liability act and proposes a data-centric process that enables the formalization of laws. The implemented application integrates state-of-the-art text mining technologies to assist legal experts during the formalization of norms by automatically detecting key concepts within legal texts, e.g., legal definitions, obligations, etc. The work at hand elaborates on the implementation of data science environment and describes key requirements, a reference architecture and a collaborative user interface.

Bernhard Waltl, Thomas Reschenhofer, Florian Matthes

A Bayesian Approach for Semantic Search Based on DAG-Shaped Ontologies

Semantic search has a great potentiality in helping users to make choices, since it appears to outperform traditional keyword-based approaches. This paper presents an ontology-based semantic search method, referred to as influential SemSim (i-SemSim), which relies on the Bayesian probabilistic approach for weighting the reference ontology. The Bayesian approach seems promising when the reference ontology is organized according to a Directed Acyclic Graph (DAG). In particular, in the proposed method the similarity among a user request and semantically annotated resources is evaluated. The user request, as well as each annotated resource, is represented by a set of concepts of the reference ontology. The experimental results of this paper show that the adoption of the Bayesian method for weighting DAG-based reference ontologies allows i-SemSim to outperform the most representative methods selected in the literature.

Anna Formica, Michele Missikoff, Elaheh Pourabbas, Francesco Taglino

Indexing and Concurrency Control Methods

Frontmatter

Indexing Multiple-Instance Objects

As an actively investigated topic in machine learning, Multiple-Instance Learning (MIL) has many proposed solutions, including supervised and unsupervised methods. We introduce an indexing technique supporting efficient queries on Multiple-Instance (MI) objects. Our technique has a dynamic structure that supports efficient insertions and deletions and is based on an effective similarity measure for MI objects. Some MIL approaches have proposed their similarity measures for MI objects, but they either do not use all information or are time consuming. In this paper, we use two joint Gaussian based measures for MIL, Joint Gaussian Similarity (JGS) and Joint Gaussian Distance (JGD). They are based on intuitive definitions and take all the information into account while being robust to noise. For JGS, we propose the Instance based Index for querying MI objects. For JGD, metric trees can be directly used as the index because of its metric properties. Extensive experimental evaluations on various synthetic and real-world data sets demonstrate the effectiveness and efficiency of the similarity measures and the performance of the corresponding index structures.

Linfei Zhou, Wei Ye, Zhen Wang, Claudia Plant, Christian Böhm

Novel Indexing Strategy and Similarity Measures for Gaussian Mixture Models

Efficient similarity search for data with complex structures is a challenging task in many modern data mining applications, such as image retrieval, speaker recognition and stock market analysis. A common way to model these data objects is using Gaussian Mixture Models which has the ability to approximate arbitrary distributions in a concise way. To facilitate efficient queries, indexes are essential techniques. However, due different numbers of components in Gaussian Mixture Models, existing index methods tend to break down in performance. In this paper we propose a novel technique Normalized Transformation that reorganizes the index structure to account for different numbers of components in Gaussian Mixture Models. In addition, Normalized Transformation enables us to derive a set of similarity measures on the basis of existing ones that have close-form expression. Extensive experiments demonstrate the effectiveness of proposed technique for Gaussian component-based indexing and the performance of the novel similarity measures for clustering and classification.

Linfei Zhou, Wei Ye, Bianca Wackersreuther, Claudia Plant, Christian Böhm

A Concurrency Control Protocol that Selects Accessible Replicated Pages to Avoid Latch Collisions for B-Trees in Manycore Environments

In recent years, microprocessor vendors aiming for dramatic performance improvement have introduced manycore processors with over 100 cores on a single chip. To take advantage of this in database and storage systems, it is necessary for B-trees and their concurrency control to reduce the number of latch collisions and interactions among the cores. Concurrency control methods such as physiological partitioning (PLP), which assigns cores to partitions in a value–range partition, have been studied. These methods perform effectively for nearly static and uniform workloads using multicore processors. However, their performance deteriorates significantly if there is major restructuring of B-trees against skew and for changing workloads. The manycore approach has a high likelihood of causing workload skew, given the lower power of each core, with an accompanying severe degradation in performance. This issue is critical for database and storage systems, which demand consistent high performance even against dynamic workloads. To address this problem, we propose an efficient new concurrency control method suitable for manycore processor platforms, called the selecting accessible replicated pages (SARP) B-tree concurrency control method. SARP achieves a consistent high performance with robustness against workload skew by distributing the workload to many cores on manycore processors, while reducing latch collisions and interactions among the cores. By applying parallel B-trees to shared-everything environments, SARP selects execution cores and access paths that distribute workloads widely to cores with appropriate processor characteristics. Experimental results using a Linux server with an Intel Xeon Phi manycore processor demonstrated that the proposed system could achieve a throughput of 44 times that for PLP in the maximum-skew case and could maintain the throughput at 66% of a throughput for uniform workloads.

Tomohiro Yoshihara, Haruo Yokota

Non-order-preserving Index for Encrypted Database Management System

Data confidentiality is concerned in Database-as-a-Service (DBaaS) model. Encrypted database management system (EDBMS) addresses this concern by the data owner (DO) encrypting its private data before storing them in the database hosted by a third party service provider (SP). Indexing at SP over encrypted data is not straightforward. Most existing indexing methods are either order-preserving, or requiring DO to involve in query computation. Order-preserving index is vulnerable to inference analysis. Having DO to compute query beats the purpose of DBaaS model which is to delegate the database works of DO to SP. We developed a non-order-preserving indexing method that does not require DO’s involvement in query processing at SP. Our empirical study shows that our indexing method can reduce selection processing cost by an order of magnitude compared to the case without the index.

Wai-Kit Wong, Kwok-Wai Wong, Ho-Yin Yue, David W. Cheung

Data Warehouse and Data Stream Warehouse

Frontmatter

A Variety-Sensitive ETL Processes

Nowadays, small, medium and large companies need advanced data integration techniques supported by tools to analyse data in order to deliver real-time alerts and trigger automated actions, etc. In the context of rapidly technology changing, these techniques have to consider two main issues: (a) the variety of the huge amount of data sources (ex. traditional, semantic, and graph databases) and (b) the variety of storage platforms, where a data integration system may have several stores, where one hosts a particular type. These issues directly impact the efficiency and the deployment flexibility of ETL (Extract, Transform, Load). In this paper, we consider these issues. Firstly, thanks to Model Driven Engineering, we make generic different types of data sources. This genericity allows overloading the ETL operators. To show the benefit of this genericity, several examples of instantiation are described covering relational, semantic and graph databases. Secondly, a Web-service-driven approach for orchestrating the ETL flows is given. Thirdly, we present a fusion procedure that merges the set of heterogeneous instances and deployed according their favorite stores. Finally, our finding is validated through a proof of concept tool using the LUBM benchmark and YAGO $$\mathcal {KB}$$ and deployed in Oracle RDF Semantic Graph 12c.

Nabila Berkani, Ladjel Bellatreche

Integrating the R Language Runtime System with a Data Stream Warehouse

Computing mathematical functions or machine learning models on data streams is difficult: a popular approach is to use the R language. Unfortunately, R has important limitations: a dynamic runtime system incompatible with a DBMS, limited by available RAM and no data management capabilities. On the other hand, SQL is well established to write queries and manage data, but it is inadequate to perform mathematical computations. With that motivation in mind, we present a system that enables analysis in R on a time window, where the DBMS continuously inserts new records and propagates updates to materialized views. We explain the low-level integration enabling fast data transfer in RAM between the DBMS query process and the R runtime. Our system enables analytic calls in both directions: (1) R calling SQL to evaluate streaming queries; transferring output streaming tables and analyzing them with R operators and functions in the R runtime, (2) SQL calling R, to exploit R mathematical operators and mathematical models, computed in a streaming fashion inside the DBMS. We discuss analytic examples, illustrating analytic calls in both directions. We experimentally show our system achieves streaming speed to transfer data.

Carlos Ordonez, Theodore Johnson, Simon Urbanek, Vladislav Shkapenyuk, Divesh Srivastava

Cleaning Out Web Spam by Entropy-Based Cascade Outlier Detection

Web spam refers to those Web pages where tricks are played to mislead search engines to increase their rank than they really deserved. It causes huge damages on e-commerce and Web users, and threats the Web security. Combating Web spam is an urgent task. In this paper, Web quality and semantic measurements are integrated with the content and link features to construct a more representative characteristic set. A cascade detection mechanism based on entropy-based outlier mining (EOM) algorithm is proposed. The mechanism consists of three stages with different feature groups. The experiments on WEBSPAM-UK2007 show that the quality and semantic features can effectively improve the detection, and the EOM algorithm outperforms many classic classification algorithms under the circumstance of data unbalanced. The cascade detection mechanism can clean out more spam.

Sha Wei, Yan Zhu

Logical Schema for Data Warehouse on Column-Oriented NoSQL Databases

The column-oriented NoSQL systems propose a flexible and highly denormalized data schema that facilitates data warehouse scalability. However, the implementation process of data warehouses with NoSQL databases is a challenging task as it involves a distributed data management policy on multi-nodes clusters. Indeed, in column-oriented NoSQL systems, the query performances can be improved by a careful data grouping. In this paper, we present a method that uses clustering techniques, in particular k-means, to model the better form of column families, from existing fact and dimensional tables. To validate our method, we adopt TPC-DS data benchmark. We have conducted several experiments to examine the benefits of clustering techniques for the creation of column families in a column-oriented NoSQL HBase database on Hadoop platform. Our experiments suggest that defining a good data grouping on HBase database during the implementation of a data warehouse increases significantly the performance of the decisional queries.

Mohamed Boussahoua, Omar Boussaid, Fadila Bentayeb

Data Mining and Machine Learning

Frontmatter

PALM: A Parallel Mining Algorithm for Extracting Maximal Frequent Conceptual Links from Social Networks

Numerous methods have been proposed in order to perform clustering from social networks. While significant works have been carried out on the design of new approaches, able to search for various kinds of clusters, a major challenge concerns the scalability of these approaches. Indeed, given the mass of data that can now be collected from online social networks, particularly from social platforms, it is important to have efficient methods for exploring and analyzing these very large amount of data. One of the recent social network clustering approaches is the extraction of conceptual links, a new approach that performs link clustering by exploiting both the structure of the network and attributes of nodes to identify strong links between groups of nodes in which nodes share common attributes. In this paper, we focus on the optimization of the search for conceptual links. In particular, we propose PALM, a parallel algorithm that aims to improve the efficiency of the extraction by simultaneously exploring several areas of the search space. For this purpose, we begin by demonstrating that the solution space forms a concept lattice. Then, we propose an approach that explores in parallel the branches of the lattice while reducing the search space based on various properties of conceptual links. We demonstrate the efficiency of the algorithm by comparing the performances with the original extraction approach. The results obtained show a significant gain on the computation time.

Erick Stattner, Reynald Eugenie, Martine Collard

Learning Interactions from Web Service Logs

Web services are typically involved in various types of interaction during their lifespan. They may participate as components in more complex services (composition) or replace unavailable services (substitution). Identifying the invocations that are part of the same interaction relationship and the nature of these relationships provides support for mashup developers. In this paper, we propose a novel approach for discovering composition and substitution relationships from service logs. We introduce a technique to correlate events that are part of the same relationship. We use association rule algorithms to determine the most frequent item-sets of correlated events. We infer composition and substitution relationships from these item-sets and derive a multi-relation network of Web services. Experiments show that 80% of the interaction relationships can be learned with 70% precision.

Hamza Labbaci, Brahim Medjahed, Youcef Aklouf

Data Driven Generation of Synthetic Data with Support Vector Data Description

We propose a method to generate synthetic data by Support Vector Data Description. Support Vector Data Description is a variant of Support Vector Machine for one-class classification problem. Our method assumes that an observed data is a sample of a random variable which satisfies an unknown membership decision function. The unknown membership decision function is to be learned by Support Vector Data Description based on the training data.By using the learned membership decision function, we perform rejection sampling. Firstly, we generate a random data point. Secondly, we test the data point against the membership decision function. Lastly, if the data point fails the test, we repeat from the first step.However, in some cases, the rejection sampling approach runs slowly. Therefore, we also propose another approach. The approach works by using a heuristic to find a good starting point and then performs gradient descent to gradually move the data point into inside the positive region boundary while maintaining randomness of the generated data. This approach runs noticeably faster than rejection sampling when rejection sampling runs slowly.

Fajrian Yunus, Ashish Dandekar, Stéphane Bressan

Conversion Rate Estimation in Online Advertising via Exploring Potential Impact of Creative

As a key criterion to measure ads performance, CVR quantitatively describes the proportion of users who take a desirable action (such as purchasing an item, adding to a cart, adding favorite items, etc.) on given ads in the ads ecosystem. Therefore, it is a critical issue to allocate ads-budget and increase advertisers profits. Focusing on improving the accuracy of CVR prediction in online advertising, this paper firstly analyzes and reveals the correlation underlying creatives associated with ads and CVR, which is excluded by most state-of-the-arts in this literature. Furthermore, we propose a novel LR+ model to utilize the potential impacts of creatives on predicting CVR. Experimental results and analysis on two public real-world datasets (REC-TMALL dataset and Taobao Clothes Matching dataset) validate the effectiveness of the proposed LR+ and demonstrate that the proposed LR+ outperforms typical models (e.g., LR, GBDT and linear SVR) in term of root mean square of error (RMSE).

Junxiang Jiang, Huiqiang Jiang

Recommender Systems and Query Recommendation

Frontmatter

A NoSQL Data-Based Personalized Recommendation System for C2C e-Commerce

With the considerable development of customer-to-customer (C2C) e-commerce in the recent years, there is a big demand for an effective recommendation system that suggests suitable websites for users to sell their items with some specified needs. Nonetheless, e-commerce recommendation systems are mostly designed for business-to-customer (B2C) websites, where the systems offer the consumers the products that they might like to buy. Almost none of the related research works focus on choosing selling sites for target items. In this paper, we introduce an approach that recommends the selling websites based upon the item’s description, category, and desired selling price. This approach employs NoSQL data-based machine learning techniques for building and training topic models and classification models. The trained models can then be used to rank the websites dynamically with respect to the user needs. The experimental results with real-world datasets from Vietnam C2C websites will demonstrate the effectiveness of our proposed method.

Tran Khanh Dang, An Khuong Vo, Josef Küng

Recommending Diverse and Personalized Travel Packages

The success of recommender systems has made them the focus of a massive research effort in both industry and academia. The aim of most recommender systems is to identify a ranked list of items that are likely to be of interest to users. However, there are several applications such as trip planning, where users are interested in package recommendations as collections of items. In this paper, we consider the problem of recommending the top-k packages to the active user, where each package is constituted with a set of points of interest (POIs) and is under user-specified constraints (time, price, etc.). We formally define the problem of top-k composite recommendations and present our approach which is inspired from composite retrieval. We introduce a scoring function and propose a ranking algorithm for solving the top-k packages problem, taking into account the preferences of the user, the diversity of recommended packages as well as the popularity of POIs. An experimental evaluation of our proposal, using data crawled from Tripadvisor demonstrates its quality and its ability to provide relevant and diverse package recommendations.

Idir Benouaret, Dominique Lenne

Association Rule Based Approach to Improve Diversity of Query Recommendations

Query recommendation (QR) support search engine to provide alternative queries as a recommendation using similarity-based approaches. In the literature, orthogonal query recommendation (OQR) has been proposed to compute the diversity of QR when the user does not formulate proper queries. The OQR uses dissimilarity measure in QR to recommend completely different queries. In this paper, we propose an approach in QR by extending association rules, diverse patterns, and unbalanced concept hierarchy of search terms. We conceptualize association rules based QR, and order the rules based on confidence and diversity. Subsequently, the high ranked rules based on confidence and diversity are provided in QRs. The experimental results on real world AOL click-through dataset show that the diverse QRs improve the performance significantly.

M. Kumara Swamy, P. Krishna Reddy, Subhash Bhalla

How to Find the Best Rated Items on a Likert Scale and How Many Ratings Are Enough

The collection and exploitation of ratings from users are modern pillars of collaborative filtering. Likert scale is a psychometric quantifier of ratings popular among the electronic commerce sites. In this paper, we consider the tasks of collecting Likert scale ratings of items and of finding the n-k best-rated items, i.e., the n items that are most likely to be the top-k in a ranking constructed from these ratings. We devise an algorithm, Pundit, that computes the n-k best-rated items. Pundit uses the probability-generating function constructed from the Likert scale responses to avoid the combinatorial exploration of the possible outcomes and to compute the result efficiently. Selection of the best-rated items meets, in practice, the major obstacle of the scarcity of ratings. We propose an approach that learns from the available data how many ratings are enough to meet a prescribed error. We empirically validate with real datasets the effectiveness of our method to recommend the collection of additional ratings.

Qing Liu, Debabrota Basu, Shruti Goel, Talel Abdessalem, Stéphane Bressan

Graph Algorithms

Frontmatter

Itinerary Planning with Category Constraints Using a Probabilistic Approach

We propose a probabilistic approach for finding approximate solutions to rooted orienteering problems with category constraints. The basic idea is to select nodes from the input graph according to a probability distribution considering properties such as the reward of a node, the attractiveness of its neighborhood, its visiting time, and its proximity to the direct route from source to destination. In this way, we reduce the size of the input considerably, resulting in a much faster execution time. Surprisingly, the quality of the generated solutions does not suffer significantly compared to the optimal ones. We illustrate the effectiveness of our approach with an experimental evaluation also including real-world data sets.

Paolo Bolzoni, Fabio Persia, Sven Helmer

Minimizing Negative Influence in Social Networks: A Graph OLAP Based Approach

Users of social networks (SN) interact continuously and are easily influenced. Research on influence analysis in SN gives much attention to positive influence. Few works on negative influence minimization make a restricted analysis limited to the influence degree between users ignoring the context and more importantly the users’ opinion which may reflect the positive or negative aspect. None of the works do an OLAP style analysis of the influence to enable a multidimensional and a multilevel view of the network. In this paper, we propose an influence analysis graph OLAPing framework which offers a more complete, flexible and dynamic analysis. We then apply it to minimize negative influence.

Zakia Challal, Omar Boussaid, Kamel Boukhalfa

Hypergraph Drawing by Force-Directed Placement

We propose a family of algorithms that transform a hypergraph drawing problem into a graph drawing problem and leverage force-directed graph drawing algorithms in order to draw hypergraphs. We propose and discuss a number of criteria to evaluate the quality of the drawings from the points of view of aesthetics and of visualization and analytics. We empirically and comparatively evaluate the quality of the drawings based on these criteria on both synthetic and real data sets. Experiments reveal that the algorithms are generally effective and the drawings generated are aesthetically pleasing.

Naheed Anjum Arafat, Stéphane Bressan

A Fast Heuristic for Finding Near-Optimal Groups for Vehicle Platooning in Road Networks

Grouping vehicles into platoons with short distance can reduce fuel consumption up to $$21\%$$ [1] and improve capacity of roads. The Vehicle Platooning Problem deals with route planing to form platoons and therefore minimize the overall costs of all considered vehicles. This article is focused on the subject to find groups with most fuel savings potential in acceptable time. We propose a novel spatial grouping algorithm to determine near-optimal groups. Our approach uses fast geometrical heuristics, consisting of direction-comparison, a modified version of a geometric-median-calculation and a comparison of intersections areas between two vehicles respectively. For evaluations is same-start unlimited ILP (SSU ILP) used to solves the Vehicle Platooning Problem to get the optimal solution. Driving in found platoons saves round about $$ 2 \% $$ to $$3 \% $$ fuel in average compared to the sum of particular shortest paths of the vehicles. The algorithm is tested in simulations on randomly created vehicles on different graphs with the size of 5.000, 10.000 40.000 and edges and round about 0.5 times nodes respectively. The performance is evaluated and the results are compared to the total possible amount of savings.

Dietrich Steinmetz, Gerrit Burmester, Sven Hartmann

Semantic Clustering and Data Classification

Frontmatter

F-SED: Feature-Centric Social Event Detection

In the context of social media, existent works offer social-event-based organization of multimedia objects (e.g., photos, videos) by mainly considering spatio-temporal data, while neglecting other user-related information (e.g., people, user interests). In this paper we propose an automated, extensible, and incremental Feature-centric Social Event Detection (F-SED) approach, based on Formal Concept Analysis (FCA), to organize shared multimedia objects on social media platforms and sharing applications. F-SED simultaneously considers various event features (e.g., temporal, geographical, social (user related)), and uses the latter to detect different feature-centric events (e.g., user-centric, location-centric). Our experimental results show that detection accuracy is improved when, besides spatio-temporal information, other features, such as social, are considered. We also show that the performance of our prototype is quasi-linear in most cases.

Elio Mansour, Gilbert Tekli, Philippe Arnould, Richard Chbeir, Yudith Cardinale

Generating Fake but Realistic Headlines Using Deep Neural Networks

Social media platforms such as Twitter and Facebook implement filters to detect fake news as they foresee their transition from social media platform to primary sources of news. The robustness of such filters lies in the variety and the quality of the data used to train them. There is, therefore, a need for a tool that automatically generates fake but realistic news.In this paper, we propose a deep learning model that automatically generates news headlines. The model is trained with a corpus of existing headlines from different topics. Once trained, the model generates a fake but realistic headline given a seed and a topic. For example, given the seed “Kim Jong Un” and the topic “Business”, the model generates the headline “kim jong un says climate change is already making money”.In order to better capture and leverage the syntactic structure of the headlines for the task of synthetic headline generation, we extend the architecture - Contextual Long Short Term Memory, proposed by Ghosh et al. - to also learn a part-of-speech model. We empirically and comparatively evaluate the performance of the proposed model on a real corpora of headlines. We compare our proposed approach and its variants using Long Short Term Memory and Gated Recurrent Units as the building blocks. We evaluate and compare the topical coherence of the generated headlines using a state-of-the-art classifier. We, also, evaluate the quality of the generated headline using a machine translation quality metric and its novelty using a metric we propose for this purpose. We show that the proposed model is practical and competitively efficient and effective.

Ashish Dandekar, Remmy A. M. Zen, Stéphane Bressan

Qualitative AHP Method Based on Multiple Criteria Levels Under Group of Experts

In this paper, we consider a multi-criteria group decision making problem. We propose a novel version of the Analytic Hierarchy Process under the belief function theory. The presented approach uses groups of experts to express their assessments regarding the evaluation criteria and the evaluation alternatives. It considers also more complex multi-criteria decision problems that have multiple criteria levels. The presented method is illustrated with some examples.

Amel Ennaceur, Zied Elouedi, Eric Lefevre

Exploit Label Embeddings for Enhancing Network Classification

Learning representations for network has aroused great research interests in recent years. Existing approaches embed vertices into a low dimensional continuous space which encodes local or global network structures. While these methods show improvements over traditional representations, they do not utilize the label information until the learned embeddings are used for training classifier. That is, the process of representation learning is separated from the labels and thus is unsupervised. In this paper, we propose a novel method which learns the embeddings for vertices under the supervision of labels. The key idea is to regard the label as the context of a vertex. More specifically, we attach a true or virtual label node for each training or test sample, and update the embeddings for vertices and labels to maximize the probability of both the neighbors and their labels in the context. We conduct extensive experiments on three real datasets. Results demonstrate that our method outperforms the state-of-the-art approaches by a large margin.

Yiqi Chen, Tieyun Qian, Ming Zhong, Xuhui Li

Springer Professional

About this book

Table of Contents

Frontmatter

Security and Privacy (II)

Frontmatter

Generating k-Anonymous Microdata by Fuzzy Possibilistic Clustering

Lightweight Privacy-Preserving Task Assignment in Skill-Aware Crowdsourcing

Clustering Heuristics for Efficient t-closeness Anonymisation

Service Computing

Frontmatter

A QoS-Aware Web Service Composition Approach Based on Genetic Programming and Graph Databases

Combining Web-Service and Rule-Based Systems to Implement a Reliable DRG-Solution

Usage-Aware Service Identification for Architecture Migration of Object-Oriented Systems to SoA

Continuous and Temporal Data, and Continuous Query Language

Frontmatter

On Computing Temporal Aggregates over Null Time Intervals

A Continuous Query Language for Stream-Based Artifacts

Past Indeterminacy in Data Warehouse Design

Text Processing and Semantic Search

Frontmatter

A Case for Term Weighting Using a Dictionary on GPUs

Process and Tool-Support to Collaboratively Formalize Statutory Texts by Executable Models

A Bayesian Approach for Semantic Search Based on DAG-Shaped Ontologies

Indexing and Concurrency Control Methods

Frontmatter

Indexing Multiple-Instance Objects

Novel Indexing Strategy and Similarity Measures for Gaussian Mixture Models

A Concurrency Control Protocol that Selects Accessible Replicated Pages to Avoid Latch Collisions for B-Trees in Manycore Environments

Non-order-preserving Index for Encrypted Database Management System

Data Warehouse and Data Stream Warehouse

Frontmatter

A Variety-Sensitive ETL Processes

Integrating the R Language Runtime System with a Data Stream Warehouse

Cleaning Out Web Spam by Entropy-Based Cascade Outlier Detection

Logical Schema for Data Warehouse on Column-Oriented NoSQL Databases

Data Mining and Machine Learning

Frontmatter

PALM: A Parallel Mining Algorithm for Extracting Maximal Frequent Conceptual Links from Social Networks

Learning Interactions from Web Service Logs

Data Driven Generation of Synthetic Data with Support Vector Data Description

Conversion Rate Estimation in Online Advertising via Exploring Potential Impact of Creative

Recommender Systems and Query Recommendation

Frontmatter

A NoSQL Data-Based Personalized Recommendation System for C2C e-Commerce

Recommending Diverse and Personalized Travel Packages

Association Rule Based Approach to Improve Diversity of Query Recommendations

How to Find the Best Rated Items on a Likert Scale and How Many Ratings Are Enough

Graph Algorithms

Frontmatter

Itinerary Planning with Category Constraints Using a Probabilistic Approach

Minimizing Negative Influence in Social Networks: A Graph OLAP Based Approach

Hypergraph Drawing by Force-Directed Placement

A Fast Heuristic for Finding Near-Optimal Groups for Vehicle Platooning in Road Networks

Semantic Clustering and Data Classification

Frontmatter

F-SED: Feature-Centric Social Event Detection

Generating Fake but Realistic Headlines Using Deep Neural Networks

Qualitative AHP Method Based on Multiple Criteria Levels Under Group of Experts

Exploit Label Embeddings for Enhancing Network Classification

Backmatter

Premium Partner