Modeling and ETL

An Analytics-Aware Conceptual Model for Evolving Graphs

Graphs are ubiquitous data structures commonly used to represent highly connected data. Many real-world applications, such as social and biological networks, are modeled as graphs. To answer the surge for graph data management, many graph database solutions were developed. These databases are commonly classified as NoSQL graph databases, and they provide better support for graph data management than their relational counterparts. However, each of these databases implement their own operational graph data model, which differ among the products. Further, there is no commonly agreed conceptual model for graph databases.

In this paper, we introduce a novel conceptual model for graph databases. The aim of our model is to provide analysts with a set of simple, well-defined, and adaptable conceptual components to perform rich analysis tasks. These components take into account the evolving aspect of the graph. Our model is analytics-oriented, flexible and incremental, enabling analysis over evolving graph data. The proposed model provides a typing mechanism for the underlying graph, and formally defines the minimal set of data structures and operators needed to analyze the graph.

Amine Ghrab, Sabri Skhiri, Salim Jouili, Esteban Zimányi

Declarative Expression and Optimization of Data-Intensive Flows

Data-intensive analytic flows, such as populating a datawarehouse or analyzing a click stream at runtime, are very common in modern business intelligence scenarios. Current state-of-the-art data flow management techniques rely on the users to specify the flow structure without performing automated optimization of that structure. In this work, we introduce a declarative way to specify flows, which is based on annotated descriptions of the output schema of each flow activity. We show that our approach is adequate to capture both a wide-range of arbitrary data transformations, which cannot be supported by traditional relational operators, and the precedence constraints between the various stages in the flow. Moreover, we show that we can express the flows as annotated queries and thus apply precedence-aware query optimization algorithms. We propose an approach to optimizing linear conceptual data flows by producing a parallel execution plan and our evaluation results show that we can speedup the flow execution by up to an order of magnitude compared to existing techniques.

Georgia Kougka, Anastasios Gounaris

Tabular Web Data: Schema Discovery and Integration

Web data such as web tables, lists, and data records from a wide variety of domains can be combined for different purposes such as querying for information and creating example data sets. Tabular web data location, extraction, and schema discovery and integration are important for effectively combining, querying, and presenting it in a uniform format. We focus on schema generation and integration for both a static and a dynamic framework. We contribute algorithms for generating individual schemas from extracted tabular web data and integrating the generated schemas. Our approach is novel because it contributes functionality not previously addressed; it accommodates both the static and dynamic frameworks, different kinds of web data types, schema discovery and unification, and table integration.

Prudhvi Janga, Karen C. Davis

Query Optimization and Parallelism

Uncoupled MapReduce: A Balanced and Efficient Data Transfer Model

In the MapReduce model, reduce tasks need to fetch output data of map tasks in the manner of “pull”. However, reduce tasks which are occupying reduce slots cannot start to compute until all the corresponding map tasks are completed. It forms the dependence between map and reduce tasks, which is called the coupled relationship in this paper. The coupled relationship leads to two problems, reduce slot hoarding and underutilized network bandwidth. We propose an uncoupled intermediate data transfer model in order to address these problems. Three core techniques, including weighted mapping, data pushing, and partial data backup are introduced and applied in Apache Hadoop, the mainstream open-source implementation of MapReduce model. This work has been practised in Baidu, the biggest search engine company in China. A real-world application for web data processing shows that our model can improve the system throughput by 29.5%, reduce the total wall time by 22.8%, and provide a weighted wall time acceleration of 26.3%. What’s more, the implementation of this model is transparent to user jobs and compatible with the original Hadoop.

Jie Zhang, Maosen Sun, Jian Lin, Li Zha

Efficient Evaluation of Ad-Hoc Range Aggregates

θ

-MDA is a flexible and efficient operator for complex ad-hoc multi-dimensional aggregation queries. It separates the specification of aggregation groups for which aggregate values are computed (base table

b

) and the specification of aggregation tuples from which aggregate values are computed. Aggregation tuples are subsets of the detail table

r

and are defined by a general

θ

-condition. The

θ

-MDA requires one scan of

r

, during which the aggregates are incrementally updated.

In this paper, we propose a two-step evaluation strategy for

θ

-MDA to optimize the computation of ad-hoc range aggregates by reducing them to point aggregates. The first step scans

r

and computes point aggregates as a partial intermediate result x̃, which can be done efficiently. The second step combines the point aggregates to the final aggregates. This transformation significantly reduces the number of incremental updates to aggregates and reduces the runtime from

$\mathcal{O}(|{\bf r}|\cdot|{\bf b}|)$

to

$\mathcal{O}(|{\bf r}|)$

, provided that

$|{\bf b}| < \sqrt{|{\bf r}|}$

and |x̃| ≈ |

b

|, which is common for OLAP. An empirical evaluation confirms the analytical results and shows the effectiveness of our optimization: range queries are evaluated with almost the same efficiency as point queries.

Christian Ammendola, Michael H. Böhlen, Johann Gamper

SPIN: Concurrent Workload Scaling over Data Warehouses

Increasingly, data warehouse (DW) analyses are being used not only for strategic business decisions but also as valuable tools in operational daily decisions. As a consequence, a large number of queries are concurrently submitted, stressing the database engine ability to handle such query workloads without severely degrading query response times. The query-at-time model of common database engines, where each query is independently executed and competes for the same resources, is inefficient for handling large DWs and does not provide the expected performance and scalability when processing large numbers of concurrent queries. However, the query workload, which is mainly composed of aggregation star queries, frequently has to process similar data and perform similar computations. While materialized views can help in this matter, their usefulness is limited to queries and query patterns that are known in advance. The reviewed proposals on data and processing sharing suffer from memory limitations, reduced scalability and unpredictable execution times when applied to large DWs, particularly those with large dimensions. We present SPIN, a data and processing sharing model that delivers predictable execution times to aggregated star-queries even in the presence of large concurrent query loads, without the memory and scalability limitations of existing approaches. We used the TPC-H benchmark to experimentally evaluate SPIN.

João Pedro Costa, Pedro Furtado

Spatial Data Warehouses and Applications

Lily: A Geo-Enhanced Library for Location Intelligence

Location intelligence is a set of tools and techniques to integrate a geographical dimension into BI platforms, aimed at enhancing their capability of better monitoring and interpreting business events. Though most commercial data warehouse tools have implemented spatial extensions to support GIS integration, the user experience with spatial data is still mostly limited to the visualization of maps labeled with numerical indicators. To overcome this limit we developed

Lily

, a geo-enhanced library that adds true location intelligence capabilities to existing BI platforms. Lily provides end-users with a highly-interactive interface that seamlessly achieves a bidirectional integration between the BI and the geospatial worlds, so as to enable advanced analytical features that truly take into account the spatial dimension. In this paper we describe Lily from a functional and architectural point of view, and show an example where Lily is coupled with the Oracle Suite to be used for location intelligence in the field of telecommunications.

Matteo Golfarelli, Marco Mantovani, Federico Ravaldi, Stefano Rizzi

Discovering Co-location Patterns in Datasets with Extended Spatial Objects

Co-location mining is one of the tasks of spatial data mining, which focuses on the detection of the sets of spatial features frequently located in close proximity of each other. Previous work is based on transaction-free apriori-like algorithms. The approach we propose is based on a grid transactionization of geographic space and designed to mine datasets with extended spatial objects. A statistical test is used instead of global thresholds to detect significant co-location patterns.

Aibek Adilmagambetov, Osmar R. Zaiane, Alvaro Osornio-Vargas

SOLAP4epidemiologist: A Spatial Data Warehousing Application in Epidemiology Domain

During last decades, the Provinces of Naples and Caserta, in the Campania Region, experienced a dreadful increase in the level of pollution as effect of documented practices of illegal waste dumping and burning. In the same period, an abnormal increase in deaths due to cancer diseases was registered. Up to now, no impact of waste treatment on human health has been scientifically proven, but it has not even excluded yet.

We believe that the availability of simple-to-use analytics tools may be of great help to epidemiologists in managing and querying the huge amount of heterogeneous data disposable for epidemiologic purposes.

This paper presents an innovative decision support application SOLAP4epidemiologist, based on Spatial Data Warehousing technologies (Spatial ETL, GIS, Spatial OLAP) able to integrate structured and geo-referenced data coming from different sources and to investigate them by means of user-friendly GUI, using statistical charts and maps representations.

Assuntina Cembalo, Michele Ferrucci, Francesca Maria Pisano, Gianpaolo Pigliasco

Text Mining and OLAP

Document Difficulty Framework for Semi-automatic Text Classification

Text Classification systems are able to deal with large datasets, spending less time and human cost compared with manual classification. This is achieved, however, in expense of loss in quality. Semi-Automatic Text Classification (SATC) aims to achieve high quality with minimum human effort by ranking the documents according to their estimated certainty of being correctly classified. This paper introduces the Document Difficulty Framework (DDF), a unification of different strategies to estimate the document certainty, and its application to SATC. DDF exploits the scores and thresholds computed by any given classifier. Different metrics are obtained by changing the parameters of the three levels the framework is lied upon: how to measure the confidence for each document-class (evidence), which classes to observe (class) and how to aggregate this knowledge (aggregation). Experiments show that DDF metrics consistently achieve high error reduction with large portions of the collection being automatically classified. Furthermore, DDF outperforms all the reported SATC methods in the literature.

Miguel Martinez-Alvarez, Alejandro Bellogin, Thomas Roelleke

Knowledge-Free Table Summarization

Considering relational tables as the object of analysis, methods to summarize them can help the analyst to have a starting point to explore the data. Typically, table summarization aims at producing an informative data summary through the use of metadata supplied by attribute taxonomies. Nevertheless, such a hierarchical knowledge is not always available or may even be inadequate when existing. To overcome these limitations, we propose a new framework, named

cTabSum

, to automatically generate attribute value taxonomies and directly perform table summarization based on its own content. Our innovative approach considers a relational table as input and proceeds in a two-step way. First, a taxonomy for each attribute is extracted. Second, a new table summarization algorithm exploits the automatic generated taxonomies. An information theory measure is used to guide the summarization process. Associated with the new algorithm we also develop a prototype. Interestingly, our prototype incorporates some additional features to help the user familiarizing with the data: (i) the resulting summarized table produced by

cTabSum

can be used as recommended starting point to browse the data; (ii) some very easy-to-understand charts allow to visualize how taxonomies have been so built; (iii) finally, standard OLAP operators, i.e. drill-down and roll-up, have been implemented to easily navigate within the data set. In addition we also supply an objective evaluation of our table summarization strategy over real data.

Dino Ienco, Yoann Pitarch, Pascal Poncelet, Maguelonne Teisseire

Recommendation and Prediction

Predicting Your Next OLAP Query Based on Recent Analytical Sessions

In Business Intelligence systems, users interact with data warehouses by formulating OLAP queries aimed at exploring multidimensional data cubes. Being able to predict the most likely next queries would provide a way to recommend interesting queries to users on the one hand, and could improve the efficiency of OLAP sessions on the other. In particular, query recommendation would proactively guide users in data exploration and improve the quality of their interactive experience. In this paper, we propose a framework to predict the most likely next query and recommend this to the user. Our framework relies on a probabilistic user behavior model built by analyzing previous OLAP sessions and exploiting a query similarity metric. To gain insight in the recommendation precision and on what parameters it depends, we evaluate our approach using different quality assessments.

Marie-Aude Aufaure, Nicolas Kuchmann-Beauger, Patrick Marcel, Stefano Rizzi, Yves Vanrompay

Where Will You Go? Mobile Data Mining for Next Place Prediction

The technological advances in smartphones and their widespread use has resulted in the big volume and varied types of mobile data which we have today. Location prediction through mobile data mining leverages such big data in applications such as traffic planning, location-based advertising, intelligent resource allocation; as well as in recommender services including the popular Apple Siri or Google Now. This paper, focuses on the challenging problem of predicting the next location of a mobile user given data on his or her current location. In this work, we propose

NextLocation

- a personalised mobile data mining framework - that not only uses spatial and temporal data but also other contextual data such as

accelerometer

,

bluetooth

and

call/sms

log. In addition, the proposed framework represents a new paradigm for privacy-preserving next place prediction as the mobile phone data is not shared without user permission. Experiments have been performed using data from the Nokia Mobile Data Challenge (MDC). The results on MDC data show large variability in predictive accuracy of about 17% across users. For example, irregular users are very difficult to predict while for more regular users it is possible to achieve more than 80% accuracy. To the best of our knowledge, our approach achieves the highest predictive accuracy when compared with existing results.

João Bártolo Gomes, Clifton Phua, Shonali Krishnaswamy

An Extended Local Hierarchical Classifier for Prediction of Protein and Gene Functions

Gene function prediction and protein function prediction are complex classification problems where the functional classes are structured according to a predefined hierarchy. To solve these problems, we propose an extended local hierarchical Naive Bayes classifier, where a binary classifier is built for each class in the hierarchy. The extension to conventional local approaches is that each classifier considers both the parent and child classes of the current class. We have evaluated the proposed approach on eight protein function and ten gene function hierarchical classification datasets. The proposed approach achieved somewhat better predictive accuracies than a global hierarchical Naive Bayes classifier.

Luiz Henrique de Campos Merschmann, Alex Alves Freitas

Data Mining Optimization and Machine Learning Techniques

Supervised Dimensionality Reduction via Nonlinear Target Estimation

Dimensionality reduction is a crucial ingredient of machine learning and data mining, boosting classification accuracy through the isolation of patterns via omission of noise. Nevertheless, recent studies have shown that dimensionality reduction can benefit from label information, via a joint estimation of predictors and target variables from a low-rank representation. In the light of such inspiration, we propose a novel dimensionality reduction which simultaneously reconstructs the predictors using matrix factorization and estimates the target variable via a dual-form maximum margin classifier from the latent space. The joint optimization function is learned through a coordinate descent algorithm via stochastic updates. Finally empirical results demonstrate the superiority of the proposed method compared to both classification in the original space (no reduction), or classification after unsupervised reduction.

Josif Grabocka, Lucas Drumond, Lars Schmidt-Thieme

Concurrent Execution of Data Mining Queries for Spatial Collocation Pattern Discovery

In spatial databases, Collocation Pattern Discovery is a very important data mining technique. It consists in searching for types of spatial objects that are frequently located together. Due to high requirements for CPU, memory or storage space, such data mining queries are often executed at times of low user activity. Multiple users or even the same user experimenting with different parameters can define many queries during the working hours that are executed, e.g., at off-peak night-time hours. Given a set of multiple spatial data mining queries, a data mining system may take advantage of potential overlapping of the queried datasets. In this paper we present a new method for concurrent processing of multiple spatial collocation pattern discovery queries. The aim of our new algorithm is to improve processing times by reducing the number of searches for neighboring objects, which is a crucial step for the identification of collocation patterns.

Pawel Boinski, Maciej Zakrzewicz

Depth-First Traversal over a Mirrored Space for Non-redundant Discriminative Itemsets

Discriminative pattern mining is known under the names of subgroup discovery, contrast set mining, emerging pattern mining, etc. and has been intensively studied for the last 15 years. Based on the sophisticated techniques developed so far (e.g. branch-and-bound search, minimum support raising, and redundancy elimination including the use of closed patterns), this paper proposes an efficient exact algorithm for finding top-

k

discriminative patterns that are not redundant and would be of value at a later step in prediction or knowledge discovery. The proposed algorithm is unique in that it conducts depth-first search over enumeration trees in a mirrored form of conventional ones, and by this design we can keep compact the list of candidate top-

k

patterns during the search and consequently high the minimum support threshold. Experimental results with the datasets from UCI Machine Learning Repository clearly show the efficiency of the proposed algorithm.

Yoshitaka Kameya, Hiroki Asaoka

Mining and Processing Data Stream

Stream Mining of Frequent Patterns from Delayed Batches of Uncertain Data

Streams of data can be continuously generated by sensors in various real-life applications such as environment surveillance. Partially due to the inherited limitation of the sensors, data in these streams can be uncertain. To discover useful knowledge in the form of frequent patterns from streams of uncertain data, a few algorithms have been developed. They mostly use the sliding window model for processing and mining data streams. However, for some applications, other stream processing models such as the time-fading model are more appropriate. Moreover, batches of data in the stream may be delayed and not arrived in the intended order. In this paper, we propose mining algorithms that use the time-fading model to mine frequent patterns when these batches in the streams of uncertain data were delayed and arrived out of order.

Fan Jiang, Carson Kai-Sang Leung

Fast Causal Network Inference over Event Streams

This paper addresses causal inference and modeling over event streams where data have high throughput and are unbounded. The availability of large amount of data along with the high data throughput present several new challenges related to causal modeling, such as the need for fast causal inference operations while ensuring consistent and valid results. There is no existing work specifically for such a streaming environment. We meet the challenges by introducing a time-centric causal inference strategy that leverages temporal precedence information to decrease the number of conditional independence tests required to establish the dependencies between the variables in a causal network. Dependency and temporal precedence of cause over effect are the two properties of a causal relationship. We also present the

Temporal Network Inference

algorithm to model the temporal precedence relations into a temporal network. Then, we propose the

Fast Causal Network Inference

algorithm for faster learning of causal network using the temporal network. Experiments using synthetic and real datasets demonstrate the efficacy of the proposed algorithms.

Saurav Acharya, Byung Suk Lee

SSCJ: A Semi-Stream Cache Join Using a Front-Stage Cache Module

Semi-stream processing has become an emerging area of research in the field of data stream management. One common operation in semi-stream processing is joining a stream with disk-based master data using a join operator. This join operator typically works under limited main memory and this memory is generally not large enough to hold the whole disk-based master data. Recently, a number of semi-stream join algorithms have been proposed in the literature to achieve an optimal performance but still there is room to improve the performance. In this paper we propose a novel Semi-Stream Cache Join (SSCJ) using a front-stage cache module. The algorithm takes advantage of skewed distributions, and we present results for Zipfian distributions of the type that appear in many applications. We analyze the performance of SSCJ with a well known related join algorithm, HYBRIDJOIN (Hybrid Join). We also provide the cost model for our approach and validate it with experiments.

M. Asif Naeem, Gerald Weber, Gillian Dobbie, Christof Lutteroth

Clustering and Data Mining Applications

Metrics to Support the Evaluation of Association Rule Clustering

Many topics related to association mining have received attention in the research community, especially the ones focused on the discovery of interesting knowledge. A promising approach, related to this topic, is the application of clustering in the pre-processing step to aid the user to find the relevant associative patterns of the domain. In this paper, we propose nine metrics to support the evaluation of this kind of approach. The metrics are important since they provide criteria to: (a) analyze the methodologies, (b) identify their positive and negative aspects, (c) carry out comparisons among them and, therefore, (d) help the users to select the most suitable solution for their problems. Some experiments were done in order to present how the metrics can be used and their usefulness.

Veronica Oliveira de Carvalho, Fabiano Fernandes dos Santos, Solange Oliveira Rezende

An Automatic Email Management Approach Using Data Mining Techniques

Email mining provides solution to email overload problem by automatically placing emails into some meaningful and similar groups based on email subject and contents. Existing email mining systems such as BuzzTrack, do not consider the semantic similarity between email contents, and when large number of email messages are clustered to a single folder it retains the problem of email overload. The goal of this paper is to solve the problem of email overload through semantically structuring the user’s email by automatically organizing email in folders and sub-folders using data mining clustering technique and extracting important terms from created folders using Apriori-based method for folder identification. This paper proposes a system named AEMS for automatic folder and sub-folder creation and later indexing the created folders. For AEMS module, a novel approach named Semantic non-parametric K-Means++ clustering is proposed for folder creation. Experiments show the effectiveness and efficiency of the proposed techniques using large volumes of email datasets.

Gunjan Soni, C. I. Ezeife

A Data Mining-Based Wind Power Forecasting Method: Results for Wind Power Plants in Turkey

With the huge technological and industrial developments in recent years, the electricity demand of all countries has been increasing day by day. In order to supply the electricity needs, countries have been looking for ways of benefitting from their renewable energy sources efficiently and wind energy is an important and ubiquitous renewable energy source. However, due to wind’s discontinuity and unstable characteristics, a reliable wind forecasting system is crucial not only for transmission system operators but also wind power plant (WPP) owners. This paper presents a reliable forecasting method based on data mining approaches. The method uses numerical weather predictions and past power measurements of the WPPs as input and it produces hourly short-term wind power forecasts for the WPPs for a time period of 48 hours. The method has been tested in the Wind Power Monitoring and Forecast Center (RİTM) project of Turkey for a duration of six months for 14 WPPs. The proposed model achieves better accuracy performance rates than those of the other well-known forecasting models for seven of WPPs selected for the testing procedure by the General Directorate of Renewable Energy in Turkey.

Mehmet Barış Özkan, Dilek Küçük, Erman Terciyanlı, Serkan Buhan, Turan Demirci, Pinar Karagoz

Disease Occurrence Prediction Based on Spatio-temporal Characterization – A Mesoscale Study for Knowledge and Pattern Discovery

Many statistical modeling and simulation-based methods have been proposed to detect, simulate and predict disease in space and time. However, these models either make use of unprecedented amount of domain-related parameters, or suffer from issues of global-generality and local over fitting. In this work, a methodology is proposed for detection, simulation and prediction of disease in a region based on an aggregated idea of spatio-temporal zoning. Specifically, a novel method is developed to model meso-scale processes by capturing spatio-temporal correlation leveraging both potential dependencies among local coefficients, and the potential incongruities within global models for prediction of trends in the context of disease management. The method is able to infer causality, and simulate the spread given the initial sources derived primarily from the spatio-temporal history of the disease. Illustrative case study of Salmonellosis disease in USA is presented to demonstrate utility of the method. The prediction trends mimic the observed event data better than the standard methodology even though magnitudinal predictions need to be improved. It is evident that such a methodology will help prioritize decision-making process for better risk assessment and management including disease outbreak.

Vipul Raheja, K. S. Rajan

Population Estimation Mining Using Satellite Imagery

Many countries around the world regularly collect census data. This census data provides statistical information regarding populations to in turn support decision making processes. However, traditional approaches to the collation of censes data are both expensive and time consuming. The analysis of high resolution satellite imagery provides a useful alternative to collecting census data which is significantly cheaper than traditional methods, although less accurate. This paper describes a technique for mining satellite imagery, to extract census information, founded on the use of classification techniques coupled with a graph based representation of the relevant imagery. The fundamental idea is to build a classifier that can label households according to “family size” which can then be used to collect census data. To act as a focus for the work training data obtained from villages lying some 300km to the northwest of Addis Ababa in Ethiopia was used. The nature of each household in the segmented training data was captured using a tree-based representation. Each tree represented household had a “family size” class label associated with it. This data was then used to build a classifier that can be used to predict household sizes according to the nature of the tree-based structure.

Kwankamon Dittakan, Frans Coenen, Rob Christley, Maya Wardeh

Social Network and Graph Mining

GMAC: A Seed-Insensitive Approach to Local Community Detection

Local community detection aims at finding a community structure starting from a seed (i.e., a given vertex) in a network without global information, such as online social networks that are too large and dynamic to ever be known fully. Nonetheless, the existing approaches to local community detection are usually sensitive to seeds, i.e., some seeds may lead to missing of some true communities. In this paper, we present a seed-insensitive method called GMAC for local community detection. It estimates the similarity between vertices via the investigation on vertices’ neighborhoods, and reveals a local community by maximizing its internal similarity and minimizing its external similarity simultaneously. Extensive experimental results on both synthetic and real-world data sets verify the effectiveness of our GMAC algorithm.

Lianhang Ma, Hao Huang, Qinming He, Kevin Chiew, Jianan Wu, Yanzhe Che

Finding Maximal Overlapping Communities

Social networks play an important role in everyday life. Nowadays there is various research that concentrates on detecting communities within these networks. Traditionally most of the community detection algorithms focus on detecting disjoint networks. However there is a need for overlapping community detection. In recent years there have been some attempts at detecting overlapping communities. Most of these techniques concentrate on just detecting these communities, none of this research tries to detect the maximal set of these communities which gives more stability. In this paper we propose a new method called Maximal-DSHRINK that allows us to detect the maximal set of overlapping communities within a social network. We show that the maximal set provides us with better quality in terms of modularity gain.

Eileen H. -C. Wei, Yun Sing Koh, Gillian Dobbie

Minimal Vertex Unique Labelled Subgraph Mining

This paper introduces the concept of Vertex Unique Labelled Subgraph Mining (VULSM), a specialised form of subgraph mining. A VULS is a subgraph defined by a set of edge labels that has a unique vertex labelling associated with it. A minimal VULS is then a VULS which is not a supergraph of any other VULS. The application considered in this paper, for evaluation purposes, is error prediction with respect to sheet metal forming. The minimum BFS Right-most Extension Unique Subgraph Mining (Min-BFS-REUSM) algorithm is introduced for identifying minimal VULS using a Breadth First Search(BFS) strategy.

Wen Yu, Frans Coenen, Michele Zito, Subhieh El Salhi

Event Sequence and Web Mining

A Hybrid Graph-Based Method for Concept Rule Discovery

In this paper we present a hybrid graph-based concept discovery method. Concept rule discovery aims at finding the definition of a specific concept in terms of relations involving background knowledge. The proposed method is a hybrid approach that combines path finding-based and substructure-based approaches. It distinguishes from other state of the art graph-based approaches in such a way that the data is initially stored in a relational database, and the concept descriptors are induced while the graph is constructed. In this approach, in addition to being the representation framework, the graph structure is also utilized to guide the concept induction process. A set of experiments is conducted on data sets that belong to different learning problems. The results show that the proposed approach has promising results in comparison to state of the art methods in terms of running time and accuracy.

Alev Mutlu, Pinar Karagoz

Applications of Concurrent Access Patterns in Web Usage Mining

This paper builds on the original data mining and modelling research which has proposed the discovery of novel structural relation patterns, applying the approach in web usage mining. The focus of attention here is on concurrent access patterns (CAP), where an overarching framework illuminates the methodology for web access patterns post-processing. Data pre-processing, pattern discovery and patterns analysis all proceed in association with access patterns mining, CAP mining and CAP modelling. Pruning and selection of access patterns takes place as necessary, allowing further CAP mining and modelling to be pursued in the search for the most interesting concurrent access patterns. It is shown that higher level CAPs can be modelled in a way which brings greater structure to bear on the process of knowledge discovery. Experiments with real-world datasets highlight the applicability of the approach in web navigation.

Jing Lu, Malcolm Keech, Cuiqing Wang

Cell-at-a-Time Approach to Lazy Evaluation of Dimensional Aggregations

We present a lazy evaluation technique for computing summarized information from dimensional databases. Our technique works well with a very large number of dimensions. While the traditional approach has been to preprocess analysis models from which the user selects the data of interest, in our approach only the cells required by the user are calculated using a cell-by-cell computation strategy.

Peter Thanisch, Jyrki Nummenmaa, Tapio Niemi, Marko Niinimäki

MAF: A Method for Detecting Temporal Associations from Multiple Event Sequences

In this paper, we propose a two-phase method, called

Multivariate Association Finder

(MAF), to mine temporal associations in multiple event sequences. It is assumed that a set of event sequences, where each event has an id and an occurrence time, is collected from an application. Our work is motivated by the observation that many associated events in multiple temporal sequences do not occur concurrently but sequentially. In an empirical study, we apply our method to two different application domains. Firstly, we use MAF to detect multivariate motifs from multiple time series data. Existing approaches all assume that the univariate elements of a multivariate motif occur synchronously. The experimental results on both synthetic and read data sets show that our method finds both synchronous and non-synchronous multivariate motifs. Secondly, we apply our method to mine frequent episodes from event streams. Current methods often ask users to provide possible lengths of frequent episodes. The results on neuronal spike simulation data show that MAF automatically detects episodes with variable time delays.

Han Liang

Springer Professional

About this book

Table of Contents

Frontmatter