Top

2014 | Book

Read chapter Read first chapter

New Frontiers in Mining Complex Patterns

Second International Workshop, NFMCP 2013, Held in Conjunction with ECML-PKDD 2013, Prague, Czech Republic, September 27, 2013, Revised Selected Papers

Editors: Annalisa Appice, Michelangelo Ceci, Corrado Loglisci, Giuseppe Manco, Elio Masciari, Zbigniew W. Ras

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This book constitutes the thoroughly refereed post-conference proceedings of the Second International Workshop on New Frontiers in Mining Complex Patterns, NFMCP 2013, held in conjunction with ECML/PKDD 2013 in Prague, Czech Republic, in September 2013. The 16 revised full papers were carefully reviewed and selected from numerous submissions. The papers are organized in topical sections on data streams and time series analysis, classification, clustering and pattern discovery, graphs, networks and relational data, machine learning and music data.

Frontmatter

Data Streams and Time Series Analysis

Frontmatter

Parameter Estimation and Pattern Validation in Flock Mining

Abstract

Due to the diffusion of location-aware devices and location-based services, it is now possible to analyse the digital trajectories of human mobility through the use of mining algorithms. However, in most cases, these algorithms come with little support for the analyst to actually use them in real world applications. In particular, means for understanding how to choose the proper parameters are missing. This work improves the state-of-the-art of mobility data analysis by providing an experimental study on the use of data-driven parameter estimation measures for mining flock patterns along with a validation procedure to measure the quality of these extracted patterns. Experiments were conducted on two real world datasets, one dealing with pedestrian movements in a recreational park and the other with car movements in a coastal area. The study has shown promising results for estimating suitable values for parameters for flock patterns as well as defining meaningful quantitative measures for assessing the quality of extracted flock patterns. It has also provided a sound basis to envisage a formal framework for parameter evaluation and pattern validation in the near future, since the advent of more complex pattern algorithms will require the use of a larger number of parameters.

Rebecca Ong, Mirco Nanni, Chiara Renso, Monica Wachowicz, Dino Pedreschi

Feature Extraction over Multiple Representations for Time Series Classification

Abstract

We suggest a simple yet effective and parameter-free feature construction process for time series classification. Our process is decomposed in three steps: (i) we transform original data into several simple representations; (ii) on each representation, we apply a coclustering method; (iii) we use coclustering results to build new features for time series. It results in a new transactional (i.e. object-attribute oriented) data set, made of time series identifiers described by features related to the various generated representations. We show that a Selective Naive Bayes classifier on this new data set is highly competitive when compared with state-of-the-art times series classification methods while highlighting interpretable and class relevant patterns.

Dominique Gay, Romain Guigourès, Marc Boullé, Fabrice Clérot

A Classification Based Scoring Function for Continuous Time Bayesian Network Classifiers

Abstract

Continuous time Bayesian network classifiers are designed for analyzing multivariate streaming data when time duration of events matters. New continuous time Bayesian network classifiers are introduced while their conditional log-likelihood scoring function is developed. A learning algorithm, combining conditional log-likelihood with Bayesian parameter estimation is developed. Classification accuracies achieved on synthetic data by continuous time and dynamic Bayesian network classifiers are compared. Results show that conditional log-likelihood scoring combined with Bayesian parameter estimation outperforms marginal log-likelihood scoring in terms of classification accuracy. Continuous time Bayesian network classifiers are applied to post-stroke rehabilitation.

Daniele Codecasa, Fabio Stella

Trajectory Data Pattern Mining

Abstract

In this paper, we study the problem of mining for frequent trajectories, which is crucial in many application scenarios, such as vehicle traffic management, hand-off in cellular networks, supply chain management. We approach this problem as that of mining for frequent sequential patterns. Our approach consists of a partitioning strategy for incoming streams of trajectories in order to reduce the trajectory size and represent trajectories as strings. We mine frequent trajectories using a sliding windows approach combined with a counting algorithm that allows us to promptly update the frequency of patterns. In order to make counting really efficient, we represent frequent trajectories by prime numbers, whereby the Chinese reminder theorem can then be used to expedite the computation.

Elio Masciari, Gao Shi, Carlo Zaniolo

Process Mining to Forecast the Future of Running Cases

Abstract

Processes are everywhere in our daily lives. More and more information about executions of processes are recorded in event logs by several information systems. Process mining techniques are used to analyze historic information hidden in event logs and to provide surprising insights for managers, system developers, auditors, and end users. While existing process mining techniques mainly analyze full process instances (cases), this paper extends the analysis to running cases, which have not yet completed. For running cases, process mining can be used to notify future events. This forecasting ability can provide insights for check conformance and support decision making. This paper details a process mining approach, which uses predictive clustering to equip an execution scenario with a prediction model. This model accounts for recent events of running cases to predict the characteristics of future events. Several tests with benchmark logs investigate the viability of the proposed approach.

Sonja Pravilovic, Annalisa Appice, Donato Malerba

Classification, Clustering and Pattern Discovery

Frontmatter

A Hybrid Distance-Based Method and Support Vector Machines for Emotional Speech Detection

Abstract

We describe a novel methodology that is applicable in the detection of emotions from speech signals. The methodology is useful if we can safely ignore sequence information since it constructs static feature vectors to represent a sequence of values; this is the case of the current application. In the initial feature extraction part, the speech signals are cut into 3 speech segments according to relative time interval process. The speech segments are processed and described using 988 acoustic features. Our proposed methodology consists of two steps. The first step constructs emotion models using principal component analysis and it computes distances of the observations to each emotion models. The distance values from the previous step are used to train a support vector machine classifier that can identify the affective content of a speech signal. We note that our method is not only applicable for speech signal, it can also be used to analyse other data of similar nature. The proposed method is tested using four emotional databases. Results showed competitive performance yielding an average accuracy of at least 80 % on three databases for the detection of basic types of emotion.

Vladimer Kobayashi

Methods for the Efficient Discovery of Large Item-Indexable Sequential Patterns

Abstract

An increasingly relevant set of tasks, such as the discovery of biclusters with order-preserving properties, can be mapped as a sequential pattern mining problem on data with item-indexable properties. An item-indexable database, typically observed in biomedical domains, does not allow item repetitions per sequence and is commonly dense. Although multiple methods have been proposed for the efficient discovery of sequential patterns, their performance rapidly degrades over item-indexable databases. The target tasks for these databases benefit from lengthy patterns and tolerate local mismatches. However, existing methods that consider noise relaxations to increase the average short length of sequential patterns scale poorly, aggravating the yet critical efficiency. In this work, we first propose a new sequential pattern mining method, IndexSpan, which is able to mine sequential patterns over item-indexable databases with heightened efficiency. Second, we propose a pattern-merging procedure, MergeIndexBic, to efficiently discover lengthy noise-tolerant sequential patterns. The superior performance of IndexSpan and MergeIndexBic against competitive alternatives is demonstrated on both synthetic and real datasets.

Rui Henriques, Cláudia Antunes, Sara C. Madeira

Mining Frequent Partite Episodes with Partwise Constraints

Abstract

In this paper, we study the problem of efficiently mining frequent partite episodes that satisfy partwise constraints from an input event sequence. Through our constraints, we can extract episodes related to events and their precedent-subsequent relations, on which we focus, in a short time. This improves the efficiency of data mining using trial and error processes. A partite episode of length \(k\) is of the form \(P = \langle P_1,\ldots ,P_k\rangle \) for sets \(P_i \; (1 \le i \le k)\) of events. We call \(P_i\) a part of \(P\) for every \(1 \le i \le k\). We introduce the partwise constraints for partite episodes \(P\), which consists of shape and pattern constraints. A shape constraint specifies the size of each part of \(P\) and the length of \(P\). A pattern constraint specifies subsets of each part of \(P\). We then present a backtracking algorithm that finds all of the frequent partite episodes satisfying a partwise constraint from an input event sequence. By theoretical analysis, we show that the algorithm runs in output polynomial time and polynomial space for the total input size. In the experiment, we show that our proposed algorithm is much faster than existing algorithms for mining partite episodes on an artificial and a real-world datasets.

Takashi Katoh, Shin-ichiro Tago, Tatsuya Asai, Hiroaki Morikawa, Junichi Shigezumi, Hiroya Inakoshi

Structure Determination and Estimation of Hierarchical Archimedean Copulas Based on Kendall Correlation Matrix

Abstract

An estimation method for the copula of a continuous multivariate distribution is proposed. A popular class of copulas, namely the class of hierarchical Archimedean copulas, is considered. The proposed method is based on the close relationship of the copula structure and the values of Kendall’s tau computed on all its bivariate margins. A generalized measure based on Kendall’s tau adapted for purposes of the estimation is introduced. A simple algorithm implementing the method is provided and its effectiveness is shown in several experiments including its comparison to other available methods. The results show that the proposed method can be regarded as a suitable alternative to existing methods in the terms of goodness of fit and computational efficiency.

Jan Górecki, Martin Holeňa

ReliefF for Hierarchical Multi-label Classification

Abstract

In machine learning, the data available for analysis is becoming more complex both in terms of high-dimensionality and the way it is structured. This emphasises the need for developing machine learning algorithms that are able to tackle both the high-dimensionality and the complex structure of the data. Our work in this paper, focuses on extending a feature ranking algorithm that can be used as a filter method for a specific type of structured data. More specifically, we adapt the RReliefF algorithm for regression, for the task of hierarchical multi-label classification (HMC). We evaluate this algorithm experimentally in a filter-like setting by employing ensembles of predictive clustering trees for HMC as a classifier. In the experimental evaluation, we consider datasets from two prominent domains for HMC - functional genomics and image annotation. The results show that HMC-ReliefF can identify the relevant features present in the data and produces a ranking where they are placed among the top ranked ones.

Ivica Slavkov, Jana Karcheska, Dragi Kocev, Slobodan Kalajdziski, Sašo Džeroski

The Use of the Label Hierarchy in Hierarchical Multi-label Classification Improves Performance

Abstract

We address the task of learning models for predicting structured outputs. We consider both global and local approaches to the prediction of structured outputs, the former based on a single model that predicts the entire output structure and the latter based on a collection of models, each predicting a component of the output structure. More specifically, we compare local and global approaches in terms of predictive performance, learning time and model complexity. Moreover, we discuss the interpretability of the obtained models. We evaluate the predictive performance of the considered approaches on six case studies from three domains: ecological modelling, text classification and image classification. Finally, we identify the properties of the tasks at hand that lead to the differences in performance.

Jurica Levatić, Dragi Kocev, Sašo Džeroski

Graphs, Networks and Relational Data

Frontmatter

Agwan: A Generative Model for Labelled, Weighted Graphs

Abstract

Real-world graphs or networks tend to exhibit a well-known set of properties, such as heavy-tailed degree distributions, clustering and community formation. Much effort has been directed into creating realistic and tractable models for unlabelled graphs, which has yielded insights into graph structure and evolution. Recently, attention has moved to creating models for labelled graphs: many real-world graphs are labelled with both discrete and numeric attributes. In this paper, we present Agwan (Attribute Graphs: Weighted and Numeric), a generative model for random graphs with discrete labels and weighted edges. The model is easily generalised to edges labelled with an arbitrary number of numeric attributes. We include algorithms for fitting the parameters of the Agwan model to real-world graphs and for generating random graphs from the model. Using real-world directed and undirected graphs as input, we compare our approach to state-of-the-art random labelled graph generators and draw conclusions about the contribution of discrete vertex labels and edge weights to graph structure.

Michael Davis, Weiru Liu, Paul Miller, Ruth F. Hunter, Frank Kee

Thresholding of Semantic Similarity Networks Using a Spectral Graph-Based Technique

Abstract

The functional similarity among terms of an ontology is evaluated by using Semantic Similarity Measures (SSM). In computational biology, biological entities such as genes or proteins are usually annotated with terms extracted from Gene Ontology (GO) and the most common application is to find the similarity or dissimilarity among two entities through the application of SSMs to their annotations. More recently, the extensive application of SSMs yielded to the Semantic Similarity Networks (SSNs). SSNs are edge-weighted graphs where the nodes are concepts (e.g. proteins) and each edge has an associated weight that represents the semantic similarity among related pairs of nodes. Community detection algorithms that analyse SSNs, such as protein complexes prediction or motif extraction, may reveal clusters of functionally associated proteins. Because SSNs have a high number of arcs with low weight, likened to noise, the application of classical clustering algorithms on raw networks exhibits low performance. To improve the performance of such algorithms, a possible approach is to simplify the structure of SSNs through a preprocessing step able to delete arcs likened to noise. Thus we propose a novel preprocessing strategy to simplify SSNs based on an hybrid global-local thresholding approach based on spectral graph theory. As proof of concept we demonstrate that community detection algorithms applied to filtered (thresholded) networks, have better performances in terms of biological relevance of the results, with respect to the use of raw unfiltered networks.

Pietro Hiram Guzzi, Pierangelo Veltri, Mario Cannataro

A Relational Unsupervised Approach to Author Identification

Abstract

In the last decades speaking and writing habits have changed. Many works faced the author identification task by exploiting frequency-based approaches, numeric techniques or writing style analysis. Following the last approach we propose a technique for author identification based on First-Order Logic. Specifically, we translate the complex data represented by natural language text to complex (relational) patterns that represent the writing style of an author. Then, we model an author as the result of clustering the relational descriptions associated to the sentences. The underlying idea is that such a model can express the typical way in which an author composes the sentences in his writings. So, if we can map such writing habits from the unknown-author model to the known-author model, we can conclude that the author is the same. Preliminary results are promising and the approach seems viable in real contexts since it does not need a training phase and performs well also with short texts.

Fabio Leuzzi, Stefano Ferilli, Fulvio Rotella

Machine Learning and Music Data

Frontmatter

From Personalized to Hierarchically Structured Classifiers for Retrieving Music by Mood

Abstract

With the increased amount of music that is available to the average user, either online or through their own collection, there is a need to develop new ways to organize and retrieve music. We propose a system by which we develop a set of personalized emotion classifiers, one for each emotion in a set of 16 and a set unique to each user. We train a set of emotion classifiers using feature data extracted from audio which has been tagged with a set of emotions by volunteers. We then develop SVM, kNN, Random Forest, and C4.5 tree based classifiers for each emotion and determine the best classification algorithm. We then compare our personalized emotion classifiers to a set of non-personalized classifiers. Finally, we present a method for efficiently developing personalized classifiers based on hierarchical clustering.

Amanda Cohen Mostafavi, Zbigniew W. Raś, Alicja A. Wieczorkowska

Mining Audio Data for Multiple Instrument Recognition in Classical Music

Abstract

This paper addresses the problem of identification of multiple musical instruments in polyphonic recordings of classical music. A set of binary random forests was used as a classifier, and each random forest was trained to recognize the target class of sounds. Training data were prepared in two versions, one based on single sounds and their mixes, and the other containing also sound frames taken from classical music recordings. The experiments on identification of multiple instrument sounds in recordings are presented, and their results are discussed in this paper.

Elżbieta Kubera, Alicja A. Wieczorkowska

Backmatter

Title: New Frontiers in Mining Complex Patterns
Editors: Annalisa Appice
Michelangelo Ceci
Corrado Loglisci
Giuseppe Manco
Elio Masciari
Zbigniew W. Ras
Publisher: Springer International Publishing
Electronic ISBN: 978-3-319-08407-7
Print ISBN: 978-3-319-08406-0
DOI: https://doi.org/10.1007/978-3-319-08407-7

Springer Professional

New Frontiers in Mining Complex Patterns

Second International Workshop, NFMCP 2013, Held in Conjunction with ECML-PKDD 2013, Prague, Czech Republic, September 27, 2013, Revised Selected Papers

About this book

Table of Contents

Frontmatter

Data Streams and Time Series Analysis

Frontmatter

Parameter Estimation and Pattern Validation in Flock Mining

Feature Extraction over Multiple Representations for Time Series Classification

A Classification Based Scoring Function for Continuous Time Bayesian Network Classifiers

Trajectory Data Pattern Mining

Process Mining to Forecast the Future of Running Cases

Classification, Clustering and Pattern Discovery

Frontmatter

A Hybrid Distance-Based Method and Support Vector Machines for Emotional Speech Detection

Methods for the Efficient Discovery of Large Item-Indexable Sequential Patterns

Mining Frequent Partite Episodes with Partwise Constraints

Structure Determination and Estimation of Hierarchical Archimedean Copulas Based on Kendall Correlation Matrix

ReliefF for Hierarchical Multi-label Classification

The Use of the Label Hierarchy in Hierarchical Multi-label Classification Improves Performance

Graphs, Networks and Relational Data

Frontmatter

Agwan: A Generative Model for Labelled, Weighted Graphs

Thresholding of Semantic Similarity Networks Using a Spectral Graph-Based Technique

A Relational Unsupervised Approach to Author Identification

Machine Learning and Music Data

Frontmatter

From Personalized to Hierarchically Structured Classifiers for Retrieving Music by Mood

Mining Audio Data for Multiple Instrument Recognition in Classical Music

Backmatter

Premium Partner