Skip to main content

Über dieses Buch

This book constitutes the refereed proceedings of the 11th International Conference on Intelligent Data Analysis, IDA 2012, held in Helsinki, Finland, in October 2012. The 32 revised full papers presented together with 3 invited papers were carefully reviewed and selected from 88 submissions. All current aspects of intelligent data analysis are addressed, including intelligent support for modeling and analyzing data from complex, dynamical systems. The papers focus on novel applications of IDA techniques to, e.g., networked digital information systems; novel modes of data acquisition and the associated issues; robustness and scalability issues of intelligent data analysis techniques; and visualization and dissemination results.



Invited Papers

Over-Fitting in Model Selection and Its Avoidance

Over-fitting is a ubiquitous problem in machine learning, and a variety of techniques to avoid over-fitting the training sample have proven highly effective, including early stopping, regularization, and ensemble methods. However, while over-fitting in training is widely appreciated and its avoidance now a standard element of best practice, over-fitting can also occur in model selection. This form of over-fitting can significantly degrade generalization performance, but has thus far received little attention. For example the kernel and regularization parameters of a support vector machine are often tuned by optimizing a cross-validation based model selection criterion. However the cross-validation estimate of generalization performance will inevitably have a finite variance, such that its minimizer depends on the particular sample on which it is evaluated, and this will generally differ from the minimizer of the true generalization error. Therefore if the cross-validation error is aggressively minimized, generalization performance may be substantially degraded. In general, the smaller the amount of data available, the higher the variance of the model selection criterion, and hence the more likely over-fitting in model selection will be a significant problem. Similarly, the more hyper-parameters to be tuned in model selection, the more easily the variance of the model selection criterion can be exploited, which again increases the likelihood of over-fitting in model selection.

Over-fitting in model selection is empirically demonstrated to pose a substantial pitfall in the application of kernel learning methods and Gaussian process classifiers. Furthermore, evaluation of machine learning methods can easily be significantly biased unless the evaluation protocol properly accounts for this type of over-fitting. Fortunately the common solutions to avoiding over-fitting in training also appear to be effective in avoiding over-fitting in model selection. Three examples are presented based on regularization of the model selection criterion, early stopping in model selection and minimizing the number of hyper-parameters to be tuned during model selection.

Gavin C. Cawley

Intelligent Data Analysis of Human Genetic Data

The last two decades have witnessed impressive developments in the technology of genotyping and sequencing. Thousands of human DNA samples have been genotyped at increasing densities or sequenced in full using next generation DNA sequencing technology. The challenge is now to equip computational scientists with the right tools to go beyond mining genetic data to discover small gold nuggets and build models that can decipher the mechanism linking genotypes to phenotypes and can be used to identify subjects at risk for disease. We will discuss different approaches to model genetic data, and emphasize the need of blending a deep understanding of study design, with statistical modeling techniques and intelligent data approaches to make analysis feasible and results interpretable and useful.

Paola Sebastiani

Queries for Data Analysis

If we view data as a set of queries with an answer, what would a model be? In this paper we explore this question. The motivation is that there are more and more kinds of data that have to be analysed. Data of such a diverse nature that it is not easy to define precisely what data analysis actually is. Since all these different types of data share


characteristic – they can be queried – it seems natural to base a notion of data analysis on this characteristic.

The discussion in this paper is preliminary at best. There is no attempt made to connect the basic ideas to other – well known – foundations of data analysis. Rather, it just explores some simple consequences of its central tenet: data is a set of queries with their answer.

Arno Siebes

Selected Contributions

Parallel Data Mining Revisited. Better, Not Faster

In this paper we argue that parallel and/or distributed compute resources can be used differently: instead of focusing on speeding up algorithms, we propose to focus on improving accuracy. In a nutshell, the goal is to tune data mining algorithms to produce better results in the same time rather than producing similar results a lot faster. We discuss a number of generic ways of tuning data mining algorithms and elaborate on two prominent examples in more detail. A series of exemplary experiments is used to illustrate the effect such use of parallel resources can have.

Zaenal Akbar, Violeta N. Ivanova, Michael R. Berthold

Weighting Features for Partition around Medoids Using the Minkowski Metric

In this paper we introduce the Minkowski weighted partition around medoids algorithm (MW-PAM). This extends the popular partition around medoids algorithm (PAM) by automatically assigning


weights to each feature in a dataset, where


is the number of clusters. Our approach utilizes the within-cluster variance of features to calculate the weights and uses the Minkowski metric.

We show through many experiments that MW-PAM, particularly when initialized with the Build algorithm (also using the Minkowski metric), is superior to other medoid-based algorithms in terms of both accuracy and identification of irrelevant features.

Renato Cordeiro de Amorim, Trevor Fenner

On Initializations for the Minkowski Weighted K-Means

Minkowski Weighted K-Means is a variant of K-Means set in the Minkowski space, automatically computing weights for features at each cluster. As a variant of K-Means, its accuracy heavily depends on the initial centroids fed to it. In this paper we discuss our experiments comparing six initializations, random and five other initializations in the Minkowski space, in terms of their accuracy, processing time, and the recovery of the Minkowski exponent



We have found that the Ward method in the Minkowski space tends to outperform other initializations, with the exception of low-dimensional Gaussian Models with noise features. In these, a modified version of intelligent K-Means excels.

Renato Cordeiro de Amorim, Peter Komisarczuk

A Skew-t-Normal Multi-level Reduced-Rank Functional PCA Model for the Analysis of Replicated Genomics Time Course Data

Modelling replicated genomics time series data sets is challenging for two key reasons. Firstly, they exhibit two distinct levels of variation — the between-transcript and, nested within that, the between-replicate. Secondly, the typical assumption of normality rarely holds. Standard practice in light of these issues is to simply treat each transcript independently which greatly simplifies the modelling approach, reduces the computational burden and nevertheless appears to yield good results. We have set out to improve upon this, and in this article we present a multi-level reduced-rank functional PCA model that more accurately reflects the biological reality of these replicated genomics data sets, retains a degree of computational efficiency and enables us to carry out dimensionality reduction.

Maurice Berk, Giovanni Montana

Methodological Foundation for Sign Language 3D Motion Trajectory Analysis

Current researches in sign language computer recognition, aim to recognize signs from video content. The majority of existing studies of sign language recognition from video-based scenes use classical learning approach due to their acceptable results. HMM, Neural Network, Matching techniques or Fuzzy classifier; are very used in video recognition with large training data. Up to day, there is a considerable progress in animation generation field. These tools contribute to improve the accessibility to information and to services for deaf individuals with low literacy level. They rely mainly on 3D-based content standard (X3D) in their sign language animation. Therefore, signs animations are becoming common. However in this new field, there are few works that try to apply the classical learning techniques for sign language recognition from 3D-based content. The majority of studies rely on positions or rotations of virtual agent articulations as training data for classifiers or for matching techniques. Unfortunately, existing animation generation software use different 3D virtual agent content, therefore, articulation positions or rotations differ from system to other. Consequently this recognition method is not efficient.

In this paper, we propose a methodological foundation for future research to recognize signs from any sign language 3D content. Our new approach aims to provide an invariant to sign position changes method based on 3D motion trajectory analysis. Our recognition experiments were based on 900 ASL signs using Microsoft kinect sensor to manipulate our X3D virtual agent. We have successfully recognized 887 isolated signs with 98.5 recognition rate and 0.3 second as recognition response time.

Mehrez Boulares, Mohamed Jemni

Assembly Detection in Continuous Neural Spike Train Data

Since Hebb’s work on the organization of the brain [16] finding cell assemblies in neural spike trains has become a vivid field of research. As modern multi-electrode techniques allow to record the electrical potentials of many neurons in parallel, there is an increasing need for efficient and reliable algorithms to identify assemblies as expressed by synchronous spiking activity. We present a method that is able to cope with two core challenges of this complex task:

temporal imprecision

(spikes are not perfectly aligned across the spike trains) and

selective participation

(neurons in an ensemble do not all contribute a spike to all synchronous spiking events). Our approach is based on modeling spikes by influence regions of a user-specified width around the exact spike times and a clustering-like grouping of similar spike trains.

Christian Braune, Christian Borgelt, Sonja Grün

Online Techniques for Dealing with Concept Drift in Process Mining

Concept drift

is an important concern for any data analysis scenario involving temporally ordered data. In the last decade

Process mining

arose as a discipline that uses the



information systems

in order to mine, analyze and enhance the process dimension. There is very little work dealing with concept drift in process mining. In this paper we present the first online mechanism for detecting and managing concept drift, which is based on

abstract interpretation


sequential sampling

, together with recent learning techniques on data streams.

Josep Carmona, Ricard Gavaldà

An Evolutionary Based Clustering Algorithm Applied to Dada Compression for Industrial Systems

In this paper, in order to address the well-known ‘sensitivity’ problems associated with


-means clustering, a real-coded Genetic Algorithms (GA) is incorporated into


-means clustering. The result of the hybridisation is an enhanced search algorithm obtained by incorporating the local search capability rendered by the hill-climbing optimisation with the global search ability provided by GAs. The proposed algorithm has been compared with other clustering algorithms under the same category using an artificial data set and a benchmark problem. Results show, in all cases, that the proposed algorithm outperforms its counterparts in terms of global search capability. Moreover, the scalability of the proposed algorithm to high-dimensional problems featuring a large number of data points has been validated using an application to compress field data sets from sub-15MW industry gas turbines, during commissioning. Such compressed field data is expected to result in more efficient and more accurate sensor fault detection.

Jun Chen, Mahdi Mahfouf, Chris Bingham, Yu Zhang, Zhijing Yang, Michael Gallimore

Multi-label LeGo — Enhancing Multi-label Classifiers with Local Patterns

The straightforward approach to multi-label classification is based on decomposition, which essentially treats all labels independently and ignores interactions between labels. We propose to enhance multi-label classifiers with features constructed from local patterns representing explicitly such interdependencies. An Exceptional Model Mining instance is employed to find local patterns representing parts of the data where the conditional dependence relations between the labels are exceptional. We construct binary features from these patterns that can be interpreted as partial solutions to local complexities in the data. These features are then used as input for multi-label classifiers. We experimentally show that using such constructed features can improve the classification performance of decompositive multi-label learning techniques.

Wouter Duivesteijn, Eneldo Loza Mencía, Johannes Fürnkranz, Arno Knobbe

Discriminative Dimensionality Reduction Mappings

Discriminative dimensionality reduction aims at a low dimensional, usually nonlinear representation of given data such that information as specified by auxiliary discriminative labeling is presented as accurately as possible. This paper centers around two open problems connected to this question: (i) how to evaluate discriminative dimensionality reduction quantitatively? (ii) how to arrive at explicit nonlinear discriminative dimensionality reduction mappings? Based on recent work for the unsupervised case, we propose an evaluation measure and an explicit discriminative dimensionality reduction mapping using the Fisher information.

Andrej Gisbrecht, Daniela Hofmann, Barbara Hammer

Finding Interesting Contexts for Explaining Deviations in Bus Trip Duration Using Distribution Rules

In this paper we study the deviation of bus trip duration and its causes. Deviations are obtained by comparing scheduled times against actual trip duration and are either delays or early arrivals. We use distribution rules, a kind of association rules that may have continuous distributions on the consequent. Distribution rules allow the systematic identification of particular conditions, which we call contexts, under which the distribution of trip time deviations differs significantly from the overall deviation distribution. After identifying specific causes of delay the bus company operational managers can make adjustments to the timetables increasing punctuality without disrupting the service.

Alípio M. Jorge, João Mendes-Moreira, Jorge Freire de Sousa, Carlos Soares, Paulo J. Azevedo

Curve Fitting for Short Time Series Data from High Throughput Experiments with Correction for Biological Variation

Modern high-throughput technologies like microarray, mass spectrometry or next generation sequencing enable biologists to measure cell products like metabolites, peptides, proteins or mRNA. With the advance of the technologies there are more and more experiments that do not only compare the cell products under two or more specific conditions, but also track them over time. These experiments usually yield short time series for a large number of cell products, but with only a few replicates. The noise in the measurements, but also the often strong biological variation of the replicates makes a coherent analysis of such data difficult. In this paper, we focus on methods to correct measurement errors or deviations caused by biological variation in terms of a time shift, different reaction speed and different reaction intensity for replicates. We propose a regression model that can estimate corresponding parameters that can be used to correct the data and to obtain better results in the further analysis.

Frank Klawonn, Nada Abidi, Evelin Berger, Lothar Jänsch

Formalizing Complex Prior Information to Quantify Subjective Interestingness of Frequent Pattern Sets

In this paper, we are concerned with the problem of modelling prior information of a data miner about the data, with the purpose of quantifying subjective interestingness of patterns. Recent results have achieved this for the specific case of prior expectations on the row and column marginals, based on the Maximum Entropy principle [2,9]. In the current paper, we extend these ideas to make them applicable to more general prior information, such as knowledge of frequencies of itemsets, a cluster structure in the data, or the presence of dense areas in the database. As in [2,9], we show how information theory can be used to quantify subjective interestingness against this model, in particular the subjective interestingness of tile patterns [3]. Our method presents an efficient, flexible, and rigorous alternative to the randomization approach presented in [5]. We demonstrate our method by searching for interesting patterns in real-life data with respect to various realistic types of prior information.

Kleanthis-Nikolaos Kontonasios, Tijl DeBie

MCut: A Thresholding Strategy for Multi-label Classification

The multi-label classification is a frequent task in machine learning notably in text categorization. When binary classifiers are not suited, an alternative consists in using a multiclass classifier that provides for each document a score per category and then in applying a thresholding strategy in order to select the set of categories which must be assigned to the document. The common thresholding strategies, such as RCut, PCut and SCut methods, need a training step to determine the value of the threshold. To overcome this limit, we propose a new strategy, called MCut which automatically estimates a value for the threshold. This method does not have to be trained and does not need any parametrization. Experiments performed on two textual corpora, XML Mining 2009 and RCV1 collections, show that the MCut strategy results are on par with the state of the art but MCut is easy to implement and parameter free.

Christine Largeron, Christophe Moulin, Mathias Géry

Improving Tag Recommendation Using Few Associations

Collaborative tagging services allow users to freely assign tags to resources. As the large majority of users enters only very few tags, good tag recommendation can vastly improve the usability of tags for techniques such as searching, indexing, and clustering. Previous research has shown that accurate recommendation can be achieved by using conditional probabilities computed from tag associations. The main problem, however, is that enormous amounts of associations are needed for optimal recommendation.

We argue and demonstrate that pattern selection techniques can improve tag recommendation by giving a very favourable balance between accuracy and computational demand. That is, few associations are chosen to act as information source for recommendation, providing high-quality recommendation and good scalability at the same time.

We provide a proof-of-concept using an off-the-shelf pattern selection method based on the Minimum Description Length principle. Experiments on data from Delicious, LastFM and YouTube show that our proposed methodology works well: applying pattern selection gives a very favourable trade-off between runtime and recommendation quality.

Matthijs van Leeuwen, Diyah Puspitaningrum

Identifying Anomalous Social Contexts from Mobile Proximity Data Using Binomial Mixture Models

Mobile proximity information provides a rich and detailed view into the social interactions of mobile phone users, allowing novel empirical studies of human behavior and context-aware applications. In this study, we apply a statistical anomaly detection method based on multivariate binomial mixture models to mobile proximity data from 106 users. The method detects days when a person’s social context is unexpected, and it provides a clustering of days based on the contexts. We present a detailed analysis regarding one user, identifying days with anomalous contexts, and potential reasons for the anomalies. We also study the overall anomalousness of people’s social contexts. This analysis reveals a clear weekly oscillation in the predictability of the contexts and a weekend-like behavior on public holidays.

Eric Malmi, Juha Raitio, Oskar Kohonen, Krista Lagus, Timo Honkela

Constrained Clustering Using SAT

Constrained clustering - finding clusters that satisfy user-specified constraints - aims at providing more relevant clusters by adding constraints enforcing required properties. Leveraging the recent progress in declarative and constraint-based pattern mining, we propose an effective constraint-clustering approach handling a large set of constraints which are described by a generic constraint-based language. Starting from an initial solution, queries can easily be refined in order to focus on more interesting clustering solutions. We show how each constraint (and query) is encoded in SAT and solved by taking benefit from several features of SAT solvers. Experiments performed using


on several datasets from the UCI repository show the feasibility and the advantages of our approach.

Jean-Philippe Métivier, Patrice Boizumault, Bruno Crémilleux, Mehdi Khiari, Samir Loudni

Two-Stage Approach for Electricity Consumption Forecasting in Public Buildings

Many preprocessing and prediction techniques have been used for large-scale electricity load forecasting. However, small-scale prediction, such as in the case of public buildings, has received little attention. This field presents certain specific features. The most distinctive one is that consumption is extremely influenced by the activity in the building. For that reason, a suitable approach to predict the next 24-hour consumption profiles is presented in this paper. First, the features that influence the consumption are processed and selected. These environmental variables are used to cluster the consumption profiles in subsets of similar behavior using neural gas. A direct forecasting approach based on Support Vector Regression (SVR) is applied to each cluster to enhance the prediction. The input vector is selected from a set of past values. The approach is validated on teaching and research buildings at the University of León.

Antonio Morán, Miguel A. Prada, Serafín Alonso, Pablo Barrientos, Juan J. Fuertes, Manuel Domínguez, Ignacio Díaz

Online Predictive Model for Taxi Services

In recent years, both companies and researchers have been exploring intelligent data analysis to increase the profitability of the taxi industry. Intelligent systems for online taxi dispatching and time saving route finding have been built to do so. In this paper, we propose a novel methodology to produce online predictions regarding the spatial distribution of passenger demand throughout taxi stand networks. We have done so by assembling two well-known time series short-term forecast models: the time-varying Poisson models and ARIMA models. Our tests were performed using data gathered over a period of 6 months and collected from 63 taxi stands within the city of Porto, Portugal. Our results demonstrate that this model is a true major contribution to the driver mobility intelligence: 78% of the 253745 demanded taxi services were correctly forecasted in a 30 minutes horizon.

Luís Moreira-Matias, João Gama, Michel Ferreira, João Mendes-Moreira, Luís Damas

Analysis of Noisy Biosignals for Musical Performance

Biosignal sensors are now small, affordable, and wireless. We desire to include such sensors (e.g. heart rate, respiration, acceleration) in a live musical performance, which sets requirements on the reliability and variability of the data. Unfortunately the raw signals from such devices are unable to meet these requirements. We contribute our solutions for overcoming the shortcomings of these sensors in two parts. The first is an online data processing and analysis system, including on-line generative models that describe the signals but add consistency. The second is the end-to-end system for capturing wireless signal data for the analysis system and integrating the resulting output into a popular digital audio workstation in a very flexible manner conducive to live performance. We also explore the role of “analysis supervisor”—a member of the performing act who ensures that the results of biosignal analysis fall within the desired ranges to contribute to the music effectively.

Joonas Paalasmaa, David J. Murphy, Ove Holmqvist

Information Theoretic Approach to Improve Performance of Networked Control Systems

Networked control systems (NCS) could be utilised in several industrial applications. However, the variable time delays introduced by the network impair the NCS performance, resulting even in the instability of the controlled process. To mitigate the delay problems, the advantage is taken from model-based, adaptive controllers. This calls for an efficient approach for on-line analysis of measurements applied to update the controller state in NCS. The paper introduces a new adaptive Model Predictive Controller (MPC) capable of compensating for variations in measurement and actuating delays. Weighting factors for delayed measurements and actuators are adjusted based on normalised version of mutual information that is calculated using a procedure described in the paper. The method is superior compared with other, more usual, metrics.

Marko Paavola, Mika Ruusunen, Aki Sorsa, Kauko Leiviskä

Learning Pattern Graphs for Multivariate Temporal Pattern Retrieval

We propose a two-phased approach to learn pattern graphs, a powerful pattern language for complex, multivariate temporal data, which is capable of reflecting more aspects of temporal patterns than earlier proposals. The first phase aims at increasing the understandability of the graph by finding common substructures, thereby helping the second phase to specialize the graph learned so far to discriminate against undesired situations. The usefulness is shown on data from the automobile industry and the libras data set by taking the accuracy and the knowledge gain of the learned graphs into account.

Sebastian Peter, Frank Höppner, Michael R. Berthold

GeT_Move: An Efficient and Unifying Spatio-temporal Pattern Mining Algorithm for Moving Objects

Recent improvements in positioning technology have led to a massive moving object data. A crucial task is to find the moving objects that travel together. Usually, they are called spatio-temporal patterns. Due to the emergence of many different kinds of spatio-temporal patterns in recent years, different approaches have been proposed to extract them. However, each approach only focuses on mining a specific kind of pattern. In addition to the fact that it is a painstaking task due to the large number of algorithms used to mine and manage patterns, it is also time consuming. Additionally, we have to execute these algorithms again whenever new data are added to the existing database. To address these issues, we first redefine spatio-temporal patterns in the itemset context. Secondly, we propose a unifying approach, named


, using a frequent closed itemset-based spatio-temporal pattern-mining algorithm to mine and manage different spatio-temporal patterns. GeT_Move is implemented in two versions which are GeT_Move and Incremental GeT_Move. Experiments are performed on real and synthetic datasets and the results show that our approaches are very effective and outperform existing algorithms in terms of efficiency.

Phan Nhat Hai, Pascal Poncelet, Maguelonne Teisseire

Fuzzy Frequent Pattern Mining in Spike Trains

We present a framework for characterizing spike (and spike-train) synchrony in parallel neuronal spike trains that is based on identifying spikes with what we call

influence maps

: real-valued functions describing an influence region around the corresponding spike times within which possibly graded synchrony with other spikes is defined. We formalize two models of synchrony in this framework: the bin-based model (the almost exclusively applied model in the literature) and a novel, alternative model based on a continuous, graded notion of synchrony, aimed at overcoming the drawbacks of the bin-based model. We study the task of identifying frequent (and synchronous) neuronal patterns from parallel spike trains in our framework, formalized as an instance of what we call the fuzzy frequent pattern mining problem (a generalization of standard frequent pattern mining) and briefly evaluate our synchrony models on this task.

David Picado Muiño, Iván Castro León, Christian Borgelt

Mass Scale Modeling and Simulation of the Air-Interface Load in 3G Radio Access Networks

This paper outlines the approach developed together with the Radio Network Strategy & Design Department of a large telecom operator in order to forecast the Air-Interface load in their 3G network, which is used for planning network upgrades and budgeting purposes. It is based on large scale intelligent data analysis and modeling at the level of thousands of individual radio cells resulting in 100,000 models. It has been embedded into a scenario simulation framework that is used by end users not experienced in data mining for studying and simulating the behavior of this complex networked system.

Dejan Radosavljevik, Peter van der Putten, Kim Kyllesbech Larsen

Batch-Incremental versus Instance-Incremental Learning in Dynamic and Evolving Data

Many real world problems involve the challenging context of data streams, where classifiers must be incremental: able to learn from a theoretically-infinite stream of examples using limited time and memory, while being able to predict at any point. Two approaches dominate the literature: batch-incremental methods that gather examples in batches to train models; and instance-incremental methods that learn from each example as it arrives. Typically, papers in the literature choose one of these approaches, but provide insufficient evidence or references to justify their choice. We provide a first in-depth analysis comparing both approaches, including how they adapt to concept drift, and an extensive empirical study to compare several different versions of each approach. Our results reveal the respective advantages and disadvantages of the methods, which we discuss in detail.

Jesse Read, Albert Bifet, Bernhard Pfahringer, Geoff Holmes

Applying Piecewise Approximation in Perceptron Training of Conditional Random Fields

We show that the recently proposed piecewise approximation approach can benefit conditional random fields estimation using the structured perceptron algorithm. We present experiments in noun-phrase chunking task on the CoNLL-2000 corpus. The results show that, compared to standard training, applying the piecewise approach during model estimation may yield not only savings in training time but also improvement in model performance on test set due to added model regularization.

Teemu Ruokolainen

Patch-Based Data Analysis Using Linear-Projection Diffusion

To process massive high-dimensional datasets, we utilize the underlying assumption that data on a manifold is approximately linear in sufficiently small patches (or neighborhoods of points) that are sampled with sufficient density from the manifold. Under this assumption, each patch can be represented by a tangent space of the manifold in its area and the tangential point of this tangent space. We use these tangent spaces, and the relations between them, to extend the


relations that are used by many kernel methods to


relations, which can encompass multidimensional similarities between local neighborhoods of points on the manifold. The properties of the presented construction are explored and its spectral decomposition is utilized to embed the patches of the manifold into a tensor space in which the relations between them are revealed. We present two applications that utilize the patch-to-tensor embedding framework: data classification and data clustering for image segmentation.

Moshe Salhov, Guy Wolf, Amir Averbuch, Pekka Neittaanmäki

Dictionary Construction for Patch-to-Tensor Embedding

The incorporation of matrix relation, which can encompass multidimensional similarities between local neighborhoods of points in the manifold, can improve kernel based data analysis. However, the utilization of multidimensional similarities results in a larger kernel and hence the computational cost of the corresponding spectral decomposition increases dramatically. In this paper, we propose dictionary construction to approximate the kernel in this case and its respected embedding. The proposed dictionary construction is demonstrated on a relevant example of a super kernel that is based on the utilization of the diffusion maps kernel together with linear-projection operators between tangent spaces of the manifold.

Moshe Salhov, Guy Wolf, Amit Bermanis, Amir Averbuch, Pekka Neittaanmäki

Where Are We Going? Predicting the Evolution of Individuals

When searching for patterns on data streams, we come across perennial (dynamic) objects that evolve over time. These objects are encountered repeatedly and each time with different definition and values. Examples are (a) companies registered at stock exchange and reporting their progress at the end of each year, and (b) students whose performance is evaluated at the end of each semester. On such data, domain experts also pose questions on how the individual objects will evolve: would it be beneficial to invest in a given company, given


the company’s individual performance thus far and the drift experienced in the model? Or, how will a given student perform next year, given the performance variations observed thus far? While there is much research on how models evolve/change over time [Ntoutsi et al., 2011a], little is done to predict the change of individual objects

when the states are not known a priori

. In this work, we propose a framework that learns the clusters to which the objects belong at each moment, uses them as

ad hoc states

in a state-transition graph, and then learns a mixture model of Markov Chains, which predicts the next most likely state/cluster per object. We report on our evaluation on synthetic and real datasets.

Zaigham Faraz Siddiqui, Márcia Oliveira, João Gama, Myra Spiliopoulou

Use of General Purpose GPU Programming to Enhance the Classification of Leukaemia Blast Cells in Blood Smear Images

Leukaemia is a life threatening form of cancer, which causes an uncontrollable increase in the production of malformed white blood cells, termed blasts, inhibiting the body’s ability to fight infection. Given the variety of leukaemia types and the disease’s nature, prompt diagnosis is essential for the choice of appropriate, timely patient treatment. Currently, however, the diagnostic process is time consuming and laborious. To target this issue we propose a methodology based on an existing system, for automated blast detection and diagnosis from a set of blood smear images, utilising a mixture of image processing, cellular automata, heuristic search and classification techniques. Our system builds upon this work, by employing General Purpose Graphical Processing Unit programming, to shorten execution times. Additionally, we utilise an enhanced ellipse-fitting algorithm for blast cell detection, yielding more information from captured cells. We show that the methodology is efficient, producing highly accurate classification results.

Stefan Skrobanski, Stelios Pavlidis, Waidah Ismail, Rosline Hassan, Steve Counsell, Stephen Swift

Intelligent Data Analysis by a Home-Use Human Monitoring Robot

In this paper, we argue that a home-use autonomous mobile robot is a platform for a new kind of Intelligent Data Analysis (IDA). Recent advancement of hardware and software for robotics have enabled us to construct a small yet powerful, autonomous mobile robot from components in low cost. Such a robot is able to perform machine learning and data mining in the real world for a long period, which opens a new avenue for IDA. This paper improves and studies one of our monitoring robots in detail to reveal promising directions and challenges inherent in the new kind of IDA.

Shinsuke Sugaya, Daisuke Takayama, Asuki Kouno, Einoshin Suzuki

Sleep Musicalization: Automatic Music Composition from Sleep Measurements

We introduce data musicalization as a novel approach to aid analysis and understanding of sleep measurement data. Data musicalization is the process of automatically composing novel music, with given data used to guide the process. We present Sleep Musicalization, a methodology that reads a signal from state-of-the-art mattress sensor, uses highly non-trivial data analysis methods to measure sleep from the signal, and then composes music from the measurements. As a result, Sleep Musicalization produces music that reflects the user’s sleep during a night and complements visualizations of sleep measurements. The ultimate goal is to help users improve their sleep and well-being. For practical use and later evaluation of the methodology, we have built a public web service at

for users of the sleep sensors.

Aurora Tulilaulu, Joonas Paalasmaa, Mikko Waris, Hannu Toivonen

Engine Parameter Outlier Detection: Verification by Simulating PID Controllers Generated by Genetic Algorithm

We propose a method for engine configuration diagnostics based on clustering of engine parameters. The method is tested using simulation of PID controller parameters generated and selected using a genetic algorithm. The parameter analysis is based on a state-of-the art method using multivariate extreme value statistics for outlier detection. This method is modified using a variational mixture model which automatically defines a number of Gaussian kernels and replaces a Gaussian mixture model.

Joni Vesterback, Vladimir Bochko, Mika Ruohonen, Jarmo Alander, Andreas Bäck, Martin Nylund, Allan Dal, Fredrik Östman

Unit Operational Pattern Analysis and Forecasting Using EMD and SSA for Industrial Systems

This paper studies operational pattern analysis and forecasting for industrial systems. To analyze the global change pattern, a novel methodology for extracting the underlying trends of signals is proposed, which is based on the sum of chosen intrinsic mode functions (IMFs) obtained via empirical mode decomposition (EMD). An adaptive strategy for the selection of the appropriate IMFs to form the trend, is proposed. Then, to forecast the change of the trend, Singular Spectrum Analysis (SSA) is applied. Results from experiment trials on an industrial turbine system show that the proposed methodology provides a convenient and effective mechanism for forecasting the trend of the operational pattern. In so doing, it therefore has application to support flexible maintenance scheduling, rather than the traditional use of calendar based maintenance.

Zhijing Yang, Chris Bingham, Wing-Kuen Ling, Yu Zhang, Michael Gallimore, Jill Stewart


Weitere Informationen

Premium Partner