Modeling and ETL

An Approach on ETL Attached Data Quality Management

This contribution introduces an approach on ETL attached Data Quality Management by means of an autonomous Data Quality Monitoring System. The Data Quality Monitor can be attached (via light-weight connectors) to already implemented ETL processes and allows to quantify data quality and to suggest measures if the quality of a particular data package falls below a certain limit for instance. Furthermore, the long-term vision of this approach is to correct corrupted data (semi-)automatically according to user-defined Data Quality Rules. The Data Quality Monitor can be attached to an ETL process by defining ”snapshot points”, where data samples which should be validated are collected and by introducing ”approval points”, where an ETL process can be interrupted in case of corrupted input data. As the Data Quality Monitor is an autonomous module which is attached to instead of embedded into ETL processes, this approach supports the division of work between ETL developers and special data quality engineers.

Christian Lettner, Reinhard Stumptner, Karl-Heinz Bokesch

Quality Measures for ETL Processes

ETL processes play an increasingly important role for the support of modern business operations. These business processes are centred around artifacts with high variability and diverse lifecycles, which correspond to key business entities. The apparent complexity of these activities has been examined through the prism of Business Process Management, mainly focusing on functional requirements and performance optimization. However, the quality dimension has not yet been thoroughly investigated and there is a need for a more human-centric approach to bring them closer to business-users requirements. In this paper we take a first step towards this direction by defining a sound model for ETL process quality characteristics and quantitative measures for each characteristic, based on existing literature. Our model shows dependencies among quality characteristics and can provide the basis for subsequent analysis using Goal Modeling techniques.

Vasileios Theodorou, Alberto Abelló, Wolfgang Lehner

A Logical Model for Multiversion Data Warehouses

Data warehouse systems integrate data from heterogeneous sources. These sources are autonomous in nature and change independently of a data warehouse. Owing to changes in data sources, the content and the schema of a data warehouse may need to be changed for accurate decision making. Slowly changing dimensions and temporal data warehouses are the available solutions to manage changes in the content of the data warehouse. Multiversion data warehouses are capable of managing changes in the content and the structure simultaneously however, they are relatively complex and not easy to implement. In this paper, we present a logical model of a multiversion data warehouse which is capable of handling schema changes independently of changes in the content. We also introduce a new hybrid table version approach to implement the multiversion data warehouse.

Waqas Ahmed, Esteban Zimányi, Robert Wrembel

Ontology-Based Data Warehouses

OntoWarehousing – Multidimensional Design Supported by a Foundational Ontology: A Temporal Perspective

The choice on information representation is extremely important to fulfil analysis requirements, making the modelling task fundamental in the Business Intelligence (BI) lifecycle. The semantic expressiveness in multidimensional (MD) design is an issue that has been studied for some years now. Nevertheless, the lack of conceptualization constructs from real world phenomena in MD design is still a challenge. This paper presents an ontological approach for the derivation of MD schemas, using categories from a foundational ontology (FO) to analyse the data source domains as a well-founded ontology. The approach is exemplified through a real scenario of the Brazilian electrical system, supporting the joint exploration of electrical disturbances data and their possible repercussion on the news.

João Moreira, Kelli Cordeiro, Maria Luiza Campos, Marcos Borges

Modeling and Querying Data Warehouses on the Semantic Web Using QB4OLAP

The web is changing the way in which data warehouses are designed and exploited. Nowadays, for many data analysis tasks, data contained in a conventional data warehouse may not suffice, and external data sources, like the web, can provide useful multidimensional information. Also, large repositories of semantically annotated data are becoming available on the web, opening new opportunities for enhancing current decision-support systems. Representation of multidimensional data via semantic web standards is crucial to achieve such goal. In this paper we extend the QB4OLAP RDF vocabulary to represent balanced, recursive, and ragged hierarchies. We also present a set of rules to obtain a QB4OLAP representation of a conceptual multidimensional model, and a procedure to populate the result from a relational implementation of the multidimensional model. We conclude the paper showing how complex real-world OLAP queries expressed in SPARQL can be posed to the resulting QB4OLAP model.

Lorena Etcheverry, Alejandro Vaisman, Esteban Zimányi

Extending Drill-Down through Semantic Reasoning on Indicator Formulas

Performance indicators are calculated by composition of more basic pieces of information, and/or aggregated along a number of different dimensions. The multidimensional model is not able to take into account the compound nature of an indicator. In this work, we propose a semantic multidimensional model in which indicators are formally described together with the mathematical formulas needed for their computation. By exploiting the formal representation of formulas an extended drill-down operator is defined, which is capable to expand an indicator into its components, enabling a novel mode of data exploration. Effectiveness and efficiency are briefly discussed on a prototype introduced as a proof-of concept.

Claudia Diamantini, Domenico Potena, Emanuele Storti

Advanced Data Warehouses and OLAP

An OLAP-Based Approach to Modeling and Querying Granular Temporal Trends

Data warehouses contain valuable information for decision-making purposes, they can be queried and visualised with Online Analytical Processing (OLAP) tools. They contain time-related information and thus representing and reasoning on temporal data is important both to guarantee the efficacy and the quality of decision-making processes, and to detect any emergency situation as soon as possible. Several proposals deal with temporal data models and query languages for data warehouses, allowing one to use different time granularities both when storing and when querying data. In this paper we focus on two aspects pertaining to temporal data in data warehouses, namely,

temporal patterns

and

temporal granularities

. We first motivate the need for discovering granular trends in an OLAP context. Then, we propose a model for analyzing granular temporal trends in time series by taking advantage of the hierarchical structure of the time dimension.

Alberto Sabaini, Esteban Zimányi, Carlo Combi

Real-Time Data Warehousing: A Rewrite/Merge Approach

This paper focuses on

Real-Time Data Warehousing systems

, a relevant class of

Data Warehouses

where the main requirement consists in executing classical data warehousing operations (e.g., loading, aggregation, indexing, OLAP query answering, and so forth) under

real-time constraints

. This makes classical DW architectures not suitable to this goal, and puts the basis for a novel research area which has tight relationship with emerging

Cloud architectures

. Inspired by this motivation, in this paper we proposed a novel framework for supporting

Real-Time Data Warehousing

which makes use of a

rewrite/merge approach

. We also provide an extensive experimental campaign that confirms the benefits deriving from our framework.

Alfredo Cuzzocrea, Nickerson Ferreira, Pedro Furtado

Towards Next Generation BI Systems: The Analytical Metadata Challenge

Next generation Business Intelligence (BI) systems require integration of heterogeneous data sources and a strong user-centric orientation. Both needs entail machine-processable metadata to enable automation and allow end users to gain access to relevant data for their decision making processes. Although evidently needed, there is no clear picture about the necessary metadata artifacts, especially considering user support requirements. Therefore, we propose a comprehensive metadata framework to support the user assistance activities and their automation in the context of next generation BI systems. This framework is based on the findings of a survey of current user-centric approaches mainly focusing on query recommendation assistance. Finally, we discuss the benefits of the framework and present the plans for future work.

Jovan Varga, Oscar Romero, Torben Bach Pedersen, Christian Thomsen

Uncertainty

Mining Fuzzy Contextual Preferences

Recent research work on preference mining has focused on the development of methods for mining a preference model from preference data following a

crisp

pairwise representation. In this representation, the user has two options regarding a pair of objects

u

and

v

: either he/she prefers

u

to

v

or

v

to

u

. In this article, we propose

FuzzyPrefMiner

, a method for extracting

fuzzy contextual

preference models from

fuzzy

preference data characterized by the fact that, given two objects

u

,

v

the user has a spectrum of options according to his

degree of preference

on

u

and

v

. Accordingly, the mined preference model is

fuzzy

, in the sense that it is capable to predict, given two new objects u and v, the degree of preference the user would assign to these objects. The efficiency of FuzzyPrefMiner is analysed through a series of experiments on real datasets.

Sandra de Amo, Juliete A. Ramos Costa

BLIMP: A Compact Tree Structure for Uncertain Frequent Pattern Mining

Tree structures (e.g., UF-trees, UFP-trees) corresponding to many existing uncertain frequent pattern mining algorithms can be large. Other tree structures for handling uncertain data may achieve compactness at the expense of loose upper bounds on expected supports. To solve this problem, we propose a compact tree structure that captures uncertain data with tighter upper bounds than the aforementioned tree structures. The corresponding algorithm mines frequent patterns from this compact tree structure. Experimental results show the compactness of our tree structure and the tightness of upper bounds to expected supports provided by our uncertain frequent pattern mining algorithm.

Carson Kai-Sang Leung, Richard Kyle MacKinnon

Discovering Statistically Significant Co-location Rules in Datasets with Extended Spatial Objects

Co-location rule mining is one of the tasks of spatial data mining, which focuses on the detection of sets of spatial features that show spatial associations. Most previous methods are generally based on transaction-free apriori-like algorithms which are dependent on user-defined thresholds and are designed for boolean data points. Due to the absence of a clear notion of transactions, it is nontrivial to use association rule mining techniques to tackle the co-location rule mining problem. To solve these difficulties, a transactionization approach was recently proposed; designed to mine datasets with extended spatial objects. A statistical test is used instead of global thresholds to detect significant co-location rules. One major shortcoming of this work is that it limits the size of antecedent of co-location rules up to three features, therefore, the algorithm is difficult to scale up. In this paper we introduce a new algorithm that fully exploits the property of statistical significance to detect more general co-location rules. We use our algorithm on real datasets with the National Pollutant Release Inventory (NPRI). A classifier is also proposed to help evaluate the discovered co-location rules.

Jundong Li, Osmar R. Zaïane, Alvaro Osornio-Vargas

Preferences and Recommendation

Discovering Community Preference Influence Network by Social Network Opinion Posts Mining

The popularity of posts, topics, and opinions on social media websites and the influence ability of users can be discovered by analyzing the responses of users (e.g., likes/dislikes, comments, ratings). Existing web opinion mining systems such as OpinionMiner is based on opinion text similarity scoring of users’ review texts and product ratings to generate database table of features, functions and opinions mined through classification to identify arriving opinions as positive or negative on user-service networks or interest networks (e.g., Amazon.com). These systems are not directly applicable to user-user networks or friendship networks (e.g., Facebook.com) since they do not consider multiple posts on multiple products, users’ relationships (such as influence), and diverse posts and comments. This paper proposes a new influence network (IN) generation algorithm (Opinion Based IN:OBIN) through opinion mining of friendship networks. OBIN mines opinions using extended OpinionMiner that considers multiple posts and relationships (influences) between users. Approach used includes frequent pattern mining algorithm for determining community (positive or negative) preferences for a given product as input to standard influence maximization algorithms like CELF for target marketing.

Tamanna S. Mumu, Christie I. Ezeife

Computing Hierarchical Skyline Queries “On-the-Fly” in a Data Warehouse

Skyline queries represent a powerful tool for multidimensional data analysis and for decision aid. When the dimensions are conflicting, skyline queries return the best compromises associated with these dimensions. Many studies have focused on the extraction of skyline points in the context of multidimensional databases, but, to the best of our knowledge, none of them have investigated skyline queries, when data are structured along multiple and hierarchical dimensions. This article proposes a new method that extends skyline queries to multiple hierarchical dimensions. Our proposal,

HSky

(Hierarchical Skyline Queries) allows the user to navigate along the dimensions hierarchies (i.e. specialize / generalize) while ensuring an efficient online calculation of the associated skyline.

Tassadit Bouadi, Marie-Odile Cordier, René Quiniou

Impact of Content Novelty on the Accuracy of a Group Recommender System

A group recommender system is designed for contexts in which more than a person is involved in the recommendation process. There are types of content (like movies) for which it would be advisable to recommend an item only if it has not yet been consumed by most of the group. In fact, it would be trivial and not significant to recommended an item if a great part of the group has already expressed a preference for it. This paper studies the impact of content novelty on the accuracy of a group recommender system, by introducing a constraint on the percentage of a group for which the recommended content has to be novel. A comparative analysis in terms of different values of the percentage of the group and for groups of different sizes, was validated through statistical tests, in order to evaluate when the difference in the accuracy values is significant. Experimental results, deeply analyzed and discussed, show that the recommendation of novel content significantly affects the performances only for small groups and only when content has to be novel for the majority of it.

Ludovico Boratto, Salvatore Carta

Query Performance and HPC

Optimizing Queue-Based Semi-Stream Joins with Indexed Master Data

In Data Stream Management Systems (DSMS) semi-stream processing has become a popular area of research due to the high demand of applications for up-to-date information (e.g. in real-time data warehousing). A common operation in stream processing is joining an incoming stream with disk-based master data, also known as semi-stream join. This join typically works under the constraint of limited main memory, which is generally not large enough to hold the whole disk-based master data. Many semi-stream joins use a queue of stream tuples to amortize the disk access to the master data, and use an index to allow directed access to master data, avoiding the loading of unnecessary master data. In such a situation the question arises which master data partitions should be accessed, as any stream tuple from the queue could serve as a lookup element for accessing the master data index. Existing algorithms use simple safe and correct strategies, but are not optimal in the sense that they maximize the join service rate. In this paper we analyze strategies for selecting an appropriate lookup element, particularly for skewed stream data. We show that a good selection strategy can improve the performance of a semi-stream join significantly, both for synthetic and real data sets with known skewed distributions.

M. Asif Naeem, Gerald Weber, Christof Lutteroth, Gillian Dobbie

Parallel Co-clustering with Augmented Matrices Algorithm with Map-Reduce

Co-clustering with augmented matrices (CCAM) [11] is a two-way clustering algorithm that considers dyadic data (e.g., two types of objects) and other correlation data (e.g., objects and their attributes) simultaneously. CCAM was developed to outperform other state-of-the-art algorithms in certain real-world recommendation tasks [12]. However, incorporating multiple correlation data involves a heavy scalability demand. In this paper, we show how the parallel co-clustering with augmented matrices (PCCAM) algorithm can be designed on the Map-Reduce framework. The experimental work shows that the input format, the number of blocks, and the number of reducers can greatly affect the overall performance.

Meng-Lun Wu, Chia-Hui Chang

Processing OLAP Queries over an Encrypted Data Warehouse Stored in the Cloud

Several studies deal with mechanisms for processing transactional queries over encrypted data. However, little attention has been devoted to determine how a data warehouse (DW) hosted in a cloud should be encrypted to enable analytical queries processing. In this article, we present a novel method for encrypting a DW and show performance results of this DW implementation. Moreover, an OLAP system based on the proposed encryption method was developed and performance tests were conducted to validate our system in terms of query processing performance. Results showed that the overhead caused by the proposed encryption method decreased when the proposed system was scaled out and compared to a non-encrypted dataset (46.62% with one node and 9.47% with 16 nodes). Also, the computation of aggregates and data groupings over encrypted data in the server produced performance gains (from 84.67% to 93.95%) when compared to their executions in the client, after decryption.

Claudivan Cruz Lopes, Valéria Cesário Times, Stan Matwin, Ricardo Rodrigues Ciferri, Cristina Dutra de Aguiar Ciferri

Cube and OLAP

Reducing Multidimensional Data

Our aim is to elaborate a multidimensional database reduction process which will specify aggregated schema applicable over a period of time as well as retains useful data for decision support. Firstly, we describe a multidimensional database schema composed of a set of states. Each state is defined as a star schema composed of one fact and its related dimensions. Each reduced state is defined through reduction operators. Secondly, we describe our experiments and discuss their results. Evaluating our solution implies executing different requests in various contexts: unreduced single fact table, unreduced relational star schema, reduced star schema or reduced snowflake schema. We show that queries are more efficiently calculated within a reduced star schema.

Faten Atigui, Franck Ravat, Jiefu Song, Gilles Zurfluh

Towards an OLAP Environment for Column-Oriented Data Warehouses

Column-oriented database systems offer decision-makers the most appropriate model for data warehouse storage. However, in the absence of on-line analytical operators, the only, very costly, way of constructing OLAP cubes involves using the

UNION

operator for group by queries in order to obtain all the

Group By

required to compute the OLAP cube. To solve this problem, in this article we propose a new aggregation operator, called C-CUBE (

Columnar-CUBE

), which allows data cubes to be computed using column-oriented data warehouses. We implemented the C-CUBE operator within the column-oriented DBMS,

MonetDB

and conducted experiments on the benchmark SSBM (Star Schema Benchmark). Thus we have shown that C-CUBE has OLAP cubes computation times reduced by up to 60% compared with the SQL Server CUBE operator in a 1TB warehouse.

Khaled Dehdouh, Fadila Bentayeb, Omar Boussaid, Nadia Kabachi

Interval OLAP: Analyzing Interval Data

The ability to analyze data organized as sequences of events or intervals became important by nowadays applications since such data became ubiquitous. In this paper we propose a formal model and briefly discuss a prototypical implementation for processing interval data in an OLAP style. The fundamental constructs of the formal model include: events, intervals, sequences of intervals, dimensions, dimension hierarchies, a dimension members, and an iCube. The model supports: (1) defining multiple sets of intervals over sequential data, (2) defining measures computed from both, events and intervals, and (3) analyzing the measures in the context set up by dimensions.

Christian Koncilia, Tadeusz Morzy, Robert Wrembel, Johann Eder

Optimization

Improving the Processing of DW Star-Queries under Concurrent Query Workloads

Currently, Data Warehouse (DW) analyses are extensively being used not only for strategic business decisions by a few, but also for feedback to a wider audience and into daily operational decisions. As a result, there’s an increase in the number of aggregation star-queries that are being concurrently submitted. Although such queries require similar processing patterns, they are stressing the database engine ability to deliver timely execution, due to the fact that each query executes independently from the others (query-at-time processing model). Recently, there’s an increasing interest in approaches that cooperate to manage large numbers of concurrent aggregation star-queries. We have proposed SPIN in a previous paper [1]. It is a data processing model that shares data and computation in order to handle large concurrent query loads, and its data organization provides almost constant and predictable execution times for all submitted queries. It has a data reader that reads data in circular loop, placing it in a pipeline, before being processed by branches that combine common processing computations. SPIN is IO dependent, i.e. a query is only be answered after a full circular loop, even though tuples and similar predicates have been evaluated in the past. In this paper we propose data processing approach that uses a set of

bitsets

, built on-the-fly, to significantly reduce the query processing time, the tuple evaluation cost and the number of predicates and tuples evaluated, without sacrificing its predictability features. The data read from storage is reduced to the minimum needed by the current query load.

João Pedro Costa, Pedro Furtado

Building Smart Cubes for Reliable and Faster Access to Data

In data warehousing, selecting a subset of views for materialization has been widely employed as a way to reduce the query evaluation time for real-time OLAP queries. However, materialization of a large number of views may be counterproductive and may exceed storage thresholds, especially when considering very large data warehouses. Thus, an important concern is to find the best set of views to materialize, in order to guarantee acceptable query response times. It further follows that the best set of views may change, as the query histories evolve. To address these issues, we introduce the Smart Cube algorithm that combines vertical partitioning, partial materialization and dynamic computation. In our approach, we partition the search space into fragments and proceed to select the optimal subset of fragments to materialize. We dynamically adapt the set of materialized views that we store, as based on query histories. The experimental evaluation of our Smart Cube algorithm shows that our work compare favorably with the state-of-the-art. The results indicate that our algorithm materializes a smaller number of views than other techniques, while yielding fast query response times.

Daniel K. Antwi, Herna L. Viktor

Optimizing Reaction and Processing Times in Automotive Industry’s Quality Management

A Data Mining Approach

Manufacturing industry has come to recognize the potential of the data it generates as an information source for quality management departments to detect potential problems in the production as early and as accurately as possible. This is essential for reducing warranty costs and ensuring customer satisfaction. One of the greatest challenges in quality management is that the amount of data produced during the development and manufacturing process and in the after sales market grows rapidly. Thus, the need for automated detection of meaningful information arises. This work focuses on enhancing quality management by applying data mining approaches and introduces: (i) a meta model for data integration; (ii) a novel company internal analysis method which uses statistics and data mining to process the data in its entirety to find interesting, concealed information; and (iii) the application Q-AURA (

quality - abnormality and cause analysis

), an implementation of the concepts for an industrial partner in the automotive industry.

Thomas Leitner, Christina Feilmayr, Wolfram Wöß

Classification

Using Closed n-set Patterns for Spatio-Temporal Classification

Today, huge volumes of sensor data are collected from many different sources. One of the most crucial data mining tasks considering this data is the ability to predict and classify data to anticipate trends or failures and take adequate steps. While the initial data might be of limited interest itself, the use of additional information,

e.g.,

latent attributes, spatio-temporal details, etc., can add significant values and interestingness. In this paper we present a classification approach, called Closed n-set Spatio-Temporal Classification (CnSC), which is based on the use of latent attributes, pattern mining, and classification model construction. As the amount of generated patterns is huge, we employ a scalable NoSQL-based graph database for efficient storage and retrieval. By considering hierarchies in the latent attributes, we define pattern and context similarity scores. The classification model for a specific context is constructed by aggregating the most similar patterns. Presented approach CnSC is evaluated with a real dataset and shows competitive results compared with other prediction strategies.

S. Samulevičius, Y. Pitarch, T. B. Pedersen

Short Text Classification Using Semantic Random Forest

Using traditional Random Forests in short text classification revealed a performance degradation compared to using them for standard texts. Shortness, sparseness and lack of contextual information in short texts are the reasons of this degradation. Existing solutions to overcome these issues are mainly based on data enrichment. However, data enrichment can also introduce noise. We propose a new approach that combines data enrichment with the introduction of semantics in Random Forests. Each short text is enriched with data semantically similar to its words. These data come from an external source of knowledge distributed into topics thanks to the Latent Dirichlet Allocation model. Learning process in Random Forests is adapted to consider semantic relations between words while building the trees. Tests performed on search-snippets using the new method showed significant improvements in the classification. The accuracy has increased by 34% compared to traditional Random Forests and by 20% compared to MaxEnt.

Ameni Bouaziz, Christel Dartigues-Pallez, Célia da Costa Pereira, Frédéric Precioso, Patrick Lloret

3-D MRI Brain Scan Classification Using A Point Series Based Representation

This paper presents a procedure for the classification of 3-D objects in Magnetic Resonance Imaging (MRI) brain scan volumes. More specifically the classification of the left and right ventricles of the brain according to whether they feature epilepsy or not. The main contributions of the paper are two point series representation techniques to support the above: (i) Disc-based and (ii) Spoke-based. The proposed methods were evaluated using Support Vector Machine (SVM) and K-Nearest Neighbour (KNN) classifiers. The first required a feature space representation which was generated using Hough signature extraction. The second required some distance measure; the “warping path” distance generated using Dynamic Time Warping (DTW) curve comparison was used for this purpose. An epilepsy dataset used for evaluation purposes comprised 210 3-D MRI brain scans of which 105 were from “healthy” people and 105 from “epilepsy patients”. The results indicated that the proposed process can successfully be used to classify objects within 3-D data volumes.

Akadej Udomchaiporn, Frans Coenen, Marta García-Fiñana, Vanessa Sluming

Social Networks and Recommendation Systems

Mining Interesting “Following” Patterns from Social Networks

Over the past few years, social network sites (e.g., Facebook, Twitter, Weibo) have become very popular. These sites have been used for sharing knowledge and information among users. Nowadays, it is not unusual for any user to have many friends (e.g., hundreds or even thousands friends) in these social networks. In general, social networks consist of social entities that are linked by some interdependency such as friendship. As social networks keep growing, it is not unusual for a user to find those frequently followed groups of social entities in the networks so that he can follow the same groups. In this paper, we propose (i) a space-efficient bitwise data structure to capture interdependency among social entities and (ii) a time-efficient data mining algorithm that makes the best use of our proposed data structure to discover groups of friends who are frequently followed by social entities in the social networks. Evaluation results show the efficiency of our data structure and mining algorithm.

Fan Jiang, Carson Kai-Sang Leung

A Novel Approach Using Context-Based Measure for Matching Large Scale Ontologies

Identifying alignments between ontologies has become a central knowledge engineering activity. In ontology matching the same word placed in different textual contexts assumes completely different meanings. This paper proposes an algorithm for ontologies alignment named XMap ++ (eXtensible Mapping), applied in an ontology mapping context. In XMap++ the measurement of lexical similarity in ontology matching is performed using synset, defined in WordNet. In our approach, the similarity between two entities of different ontologies is evaluated not only by investigating the semantics of the entities names, but also taking into account the context, through which the effective meaning is described. We provide experimental results that measure the accuracy of our algorithm based on our participation with two versions (XMapSig and XMapGen) at the OAEI campaign 2013.

Warith Eddine Djeddi, Mohamed Tarek Khadir

ActMiner: Discovering Location-Specific Activities from Community-Authored Reviews

Location-specific community authored reviews are useful resource for discovering location-specific activities and developing various location-aware activity recommendation applications. Existing works on activity discovery have mostly utilized body-worn sensors, images or human GPS traces and discovered generalized activities that do not convey any location-specific knowledge. Moreover, many of the discovered activities are irrelevant and redundant and hence, significantly affect the performance of a location-aware activity recommender system. In this paper, we propose a three-phase Discover-Filer-Merge solution, namely

ActMiner

, to infer the location-specific relevant and non-redundant activities from community-authored reviews. The proposed solution uses Dependency-aware, Category-aware and Sense-aware approaches in three sequential phases to accomplish its objective. Experimental results on two real-world data sets show that the accuracy and correctness of

ActMiner

are better than the existing approaches.

Sahisnu Mazumder, Dhaval Patel, Sameep Mehta

Knowledge Data Discovery

A Scalable Algorithm for Banded Pattern Mining in Multi-dimensional Zero-One Data

A banded pattern in “zero-one” high dimensional data is one where all the dimensions can be organized in such a way that the “ones” are arranged along the leading diagonal across the dimensions. Rearranging zero-one data so as to feature bandedness allows for the identification of hidden information and enhances the operation of many data mining algorithms that work with zero-one data. In this paper an effective ND banding algorithm, the ND-BPM algorithm, is presented together with a full evaluation of its operation. To illustrate the utility of the banded pattern concept a case study using the GB Cattle movement database is also presented.

Fatimah B. Abdullahi, Frans Coenen, Russell Martin

Approximation of Frequent Itemset Border by Computing Approximate Minimal Hypergraph Transversals

In this paper, we present a new approach to approximate the negative border and the positive border of frequent itemsets. This approach is based on the transition from a border to the other one by computing the minimal transversals of a hypergraph. We also propose a new method to compute approximate minimal hypergraph transversals based on hypergraph reduction. The experiments realized on different data sets show that our propositions to approximate frequent itemset borders produce good results.

Nicolas Durand, Mohamed Quafafou

Clustering Based on Sequential Multi-Objective Games

We propose a novel approach for data clustering based on sequential multi-objective multi-act games (ClusSMOG). It automatically determines the number of clusters and optimises simultaneously the inertia and the connectivity objectives. The approach consists of three structured steps. The first step identifies initial clusters and calculates a set of conflict-clusters. In the second step, for each conflict-cluster, we construct a sequence of multi-objective multi-act sequential two-player games. In the third step, we develop a sequential two-player game between each cluster representative and its nearest neighbour. For each game, payoff functions corresponding to the objectives were defined. We use a backward induction method to calculate Nash equilibrium for each game. Experimental results confirm the effectiveness of the proposed approach over state-of-the-art clustering algorithms.

Imen Heloulou, Mohammed Said Radjef, Mohand Tahar Kechadi

Industrial Applications

Opening up Data Analysis for Medical Health Services: Cancer Survival Analysis with CARESS

Dealing with cancer is one of the big challenges of the German healthcare system. Originally, efforts regarding the analysis of cancer data focused on the detection of spatial clusters of cancer incidences. Nowadays, the emphasis also incorporates complex health services research and quality assurance. In 2013, a law was enacted in Germany forcing the spatially all-encompassing expansion of clinical cancer registries, each of them covering a commuting area of about 1 to 2 million inhabitants [1]. Guidelines for a unified evaluation of data are currently in development, and it is very probable that these guidelines will demand the execution of comparative survival analyses.

In this paper, we present how the CARLOS Epidemiological and Statistical Data Exploration System (CARESS), a sophisticated data warehouse system that is used by epidemiological cancer registries (ECRs) in several German federal states, opens up data analysis for a wider audience. We show that by applying the principles of integration and abstraction, CARESS copes with the challenges posed by the diversity of the cancer registry landscape in Germany. Survival estimates are calculated by the software package periodR seamlessly integrated in CARESS. We also discuss several performance optimizations for survival estimation, and illustrate the feasibility of our approach by an experiment on cancer survival estimation performance and by an example on the application of cancer survival analysis with CARESS.

David Korfkamp, Stefan Gudenkauf, Martin Rohde, Eunice Sirri, Joachim Kieschke, H. -Jürgen Appelrath

“Real-time” Instance Selection for Biomedical Data Classification

Computer-based medical systems play a very important role in medical applications because they can strongly support the physicians in the decision making process. Several existing methods infer a classification function from labeled training data. The large amount of data nowadays available, although collected from high quality sources, usually contain irrelevant, redundant, or noisy information, suggesting that not all the training instances are useful for the classification task. To address this issue, we present here an instance selection method that, different from the existing approaches, selects in “real-time” a subset of instances from the original training set on the basis of the information derived from each test instance to be classified. We apply our method to seven public benchmark datasets, showing that the recognition performances are improved. We will also discuss how method parameters affect the experimental results.

Chongsheng Zhang, Roberto D’Ambrosio, Paolo Soda

Mining Churning Factors in Indian Telecommunication Sector Using Social Media Analytics

In this paper we address the problem of churning in the telecommunication sector in Indian context. Churning becomes a challenging problem for telecom industries especially when the subscriber base almost reaches saturation level. It directly affect the revenue of the telecom companies. A proper analysis of factors affecting churning can help the telecom service providers to reduce churning, satisfy their customers and may be design new products to reduce churning. We use social media analytics, in particular twitter feeds, to get opinion of the users. The main contribution of the paper is feasibility of data mining tools, in particular association rules, to determine factors affecting churning.

Nitish Varshney, S. K. Gupta

Mining and Processing Data Stream

Drift Detector for Memory-Constrained Environments

Current approaches to drift detection assume that stable memory consumption with slight variations with each stream is suitable for all programs. This is not always the case and there are situations where small variations in memory are undesirable such as drift detectors on medical vital sign monitoring systems. Under these circumstances, it is not sufficient to have a memory use that is predictable on average, but instead memory use must be fixed. To detect drift using fixed memory in a stream, we propose DualWin: a technique that keeps two samples of controllable size, one is stored in a sliding window, which represents the most recent stream elements, and the other is stored in a reservoir, which uses reservoir sampling to maintain an image of the stream since the previous drift was detected. Through experimentation, we find that DualWin obtains a rate of true positive detection which is comparable to ADWIN2, a rate of false positive detection which is much lower, an execution time which is faster, and a fixed memory consumption.

Timothy D. Robinson, David Tse Jung Huang, Yun Sing Koh, Gillian Dobbie

Mean Shift Clustering Algorithm for Data with Missing Values

Missing values in data are common in real world applications. There are several methods that deal with this problem. In this research we developed a new version of the

mean shift

clustering algorithm that deals with datasets with missing values. We use a weighted distance function that deals with datasets with missing values, that was defined in our previous work. To compute the distance between two points that may have attributes with missing values, only the

mean

and the

variance

of the distribution of the attribute are required. Thus, after they have been computed, the distance can be computed in

O

(1). Furthermore, we use this distance to derive a formula for computing the

mean shift

vector for each data point, showing that the

mean shift

runtime complexity is the same as the Euclidian

mean shift

runtime. We experimented on six standard numerical datasets from different fields. On these datasets we simulated missing values and compared the performance of the

mean shift

clustering algorithm using our distance and the suggested mean shift vector to other three basic methods. Our experiments show that

mean shift

using our distance function outperforms

mean shift

using other methods for dealing with missing values.

Loai AbdAllah, Ilan Shimshoni

Mining Recurrent Concepts in Data Streams Using the Discrete Fourier Transform

In this research we address the problem of capturing recurring concepts in a data stream environment. Recurrence capture enables the re-use of previously learned classifiers without the need for re-learning while providing for better accuracy during the concept recurrence interval. We capture concepts by applying the Discrete Fourier Transform (DFT) to Decision Tree classifiers to obtain highly compressed versions of the trees at concept drift points in the stream and store such trees in a repository for future use. Our empirical results on real world and synthetic data exhibiting varying degrees of recurrence show that the Fourier compressed trees are more robust to noise and are able to capture recurring concepts with higher precision than a meta learning approach that chooses to re-use classifiers in their originally occurring form.

Sakthithasan Sripirakas, Russel Pears

Mining and Similarity

Semi-Supervised Learning to Support the Exploration of Association Rules

In the last years, many approaches for post-processing association rules have been proposed. The automatics are simple to use, but they don’t consider users’ subjectivity. Unlike, the approaches that consider subjectivity need an explicit description of the users’ knowledge and/or interests, requiring a considerable time from the user. Looking at the problem from another perspective, post-processing can be seen as a classification task, in which the user labels some rules as interesting [I] or not interesting [NI], for example, in order to propagate these labels to the other unlabeled rules. This work presents a framework for post-processing association rules that uses semi-supervised learning in which: (a) the user is constantly directed to the [I] patterns of the domain, minimizing his exploration effort by reducing the exploration space, since his knowledge and/or interests are iteratively propagated; (b) the users’ subjectivity is considered without using any formalism, making the task simpler.

Veronica Oliveira de Carvalho, Renan de Padua, Solange Oliveira Rezende

Parameter-Free Extended Edit Distance

The edit distance is the most famous distance to compute the similarity between two strings of characters. The main drawback of the edit distance is that it is based on local procedures which reflect only a local view of similarity. To remedy this problem we presented in a previous work the extended edit distance, which adds a global view of similarity between two strings. However, the extended edit distance includes a parameter whose computation requires a long training time. In this paper we present a new extension of the edit distance which is parameter-free. We compare the performance of the new extension to that of the extended edit distance and we show how they both perform very similarly.

Muhammad Marwan Muhammad Fuad

VGEN: Fast Vertical Mining of Sequential Generator Patterns

Sequential pattern mining

is a popular data mining task with wide applications. However, the set of all sequential patterns can be very large. To discover fewer but more representative patterns, several compact representations of sequential patterns have been studied. The set of

sequential generators

is one the most popular representations. It was shown to provide higher accuracy for classification than using all or only closed sequential patterns. Furthermore, mining generators is a key step in several other data mining tasks such as sequential rule generation. However, mining generators is computationally expensive. To address this issue, we propose a novel mining algorithm named

VGEN

(

Vertical sequential GENerator miner

). An experimental study on five real datasets shows that VGEN is up to two orders of magnitude faster than the state-of-the-art algorithms for sequential generator mining.

Philippe Fournier-Viger, Antonio Gomariz, Michal Šebek, Martin Hlosta

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Modeling and ETL

An Approach on ETL Attached Data Quality Management

Quality Measures for ETL Processes

A Logical Model for Multiversion Data Warehouses

Ontology-Based Data Warehouses

OntoWarehousing – Multidimensional Design Supported by a Foundational Ontology: A Temporal Perspective

Modeling and Querying Data Warehouses on the Semantic Web Using QB4OLAP

Extending Drill-Down through Semantic Reasoning on Indicator Formulas

Advanced Data Warehouses and OLAP

An OLAP-Based Approach to Modeling and Querying Granular Temporal Trends

Real-Time Data Warehousing: A Rewrite/Merge Approach

Towards Next Generation BI Systems: The Analytical Metadata Challenge

Uncertainty

Mining Fuzzy Contextual Preferences

BLIMP: A Compact Tree Structure for Uncertain Frequent Pattern Mining

Discovering Statistically Significant Co-location Rules in Datasets with Extended Spatial Objects

Preferences and Recommendation

Discovering Community Preference Influence Network by Social Network Opinion Posts Mining

Computing Hierarchical Skyline Queries “On-the-Fly” in a Data Warehouse

Impact of Content Novelty on the Accuracy of a Group Recommender System

Query Performance and HPC

Optimizing Queue-Based Semi-Stream Joins with Indexed Master Data

Parallel Co-clustering with Augmented Matrices Algorithm with Map-Reduce

Processing OLAP Queries over an Encrypted Data Warehouse Stored in the Cloud

Cube and OLAP

Reducing Multidimensional Data

Towards an OLAP Environment for Column-Oriented Data Warehouses

Interval OLAP: Analyzing Interval Data

Optimization

Improving the Processing of DW Star-Queries under Concurrent Query Workloads

Building Smart Cubes for Reliable and Faster Access to Data

Optimizing Reaction and Processing Times in Automotive Industry’s Quality Management

Classification

Using Closed n-set Patterns for Spatio-Temporal Classification

Short Text Classification Using Semantic Random Forest

3-D MRI Brain Scan Classification Using A Point Series Based Representation

Social Networks and Recommendation Systems

Mining Interesting “Following” Patterns from Social Networks

A Novel Approach Using Context-Based Measure for Matching Large Scale Ontologies

ActMiner: Discovering Location-Specific Activities from Community-Authored Reviews

Knowledge Data Discovery

A Scalable Algorithm for Banded Pattern Mining in Multi-dimensional Zero-One Data

Approximation of Frequent Itemset Border by Computing Approximate Minimal Hypergraph Transversals

Clustering Based on Sequential Multi-Objective Games

Industrial Applications

Opening up Data Analysis for Medical Health Services: Cancer Survival Analysis with CARESS

“Real-time” Instance Selection for Biomedical Data Classification

Mining Churning Factors in Indian Telecommunication Sector Using Social Media Analytics

Mining and Processing Data Stream

Drift Detector for Memory-Constrained Environments

Mean Shift Clustering Algorithm for Data with Missing Values

Mining Recurrent Concepts in Data Streams Using the Discrete Fourier Transform

Mining and Similarity

Semi-Supervised Learning to Support the Exploration of Association Rules

Parameter-Free Extended Edit Distance

VGEN: Fast Vertical Mining of Sequential Generator Patterns

Backmatter