Skip to main content
main-content

Über dieses Buch

This book constitutes the thoroughly refereed post-conference proceedings of the Third International Workshop on New Frontiers in Mining Complex Patterns, NFMCP 2014, held in conjunction with ECML-PKDD 2014 in Nancy, France, in September 2014. The 13 revised full papers presented were carefully reviewed and selected from numerous submissions. They illustrate advanced data mining techniques which preserve the informative richness of complex data and allow for efficient and effective identification of complex information units present in such data. The papers are organized in the following sections: classification and regression; clustering; data streams and sequences; applications.

Inhaltsverzeichnis

Frontmatter

Classification and Regression

Frontmatter

Semi-supervised Learning for Multi-target Regression

Abstract
The most common machine learning approach is supervised learning, which uses labeled data for building predictive models. However, in many practical problems, the availability of annotated data is limited due to the expensive, tedious and time-consuming annotation procedure. At the same, unlabeled data can be easily available in large amounts. This is especially pronounced for predictive modelling problems with a structured output space and complex labels.
Semi-supervised learning (SSL) aims to use unlabeled data as an additional source of information in order to build better predictive models than can be learned from labeled data alone. The majority of work in SSL considers the simple tasks of classification and regression where the output space consists of a single variable. Much less work has been done on SSL for structured output prediction.
In this study, we address the task of multi-target regression (MTR), a type of structured output prediction, where the output space consists of multiple numerical values. Our main objective is to investigate whether we can improve over supervised methods for MTR by using unlabeled data. We use ensembles of predictive clustering trees in a self-training fashion: the most reliable predictions (passing a reliability threshold) on unlabeled data are iteratively used to re-train the model. We use the variance of the ensemble models’ predictions as an indicator of the reliability of predictions. Our results provide a proof-of-concept: The use of unlabeled data improves the predictive performance of ensembles for multi-target regression, but further efforts are needed to automatically select the optimal threshold for the reliability of predictions.
Jurica Levatić, Michelangelo Ceci, Dragi Kocev, Sašo Džeroski

Evaluation of Different Data-Derived Label Hierarchies in Multi-label Classification

Abstract
Motivated by an increasing number of new applications, the research community is devoting an increasing amount of attention to the task of multi-label classification (MLC). Many different approaches to solving multi-label classification problems have been recently developed. Recent empirical studies have comprehensively evaluated many of these approaches on many datasets using different evaluation measures. The studies have indicated that the predictive performance and efficiency of the approaches could be improved by using data derived (artificial) hierarchies, in the learning and prediction phases. In this paper, we compare different clustering algorithms for constructing the label hierarchies (in a data-driven manner), in multi-label classification. We consider flat label sets and construct the label hierarchies from the label sets that appear in the annotations of the training data by using four different clustering algorithms (balanced \(k\)-means, agglomerative clustering with single and complete linkage and predictive clustering trees). The hierarchies are then used in conjunction with global hierarchical multi-label classification (HMC) approaches. The results from the statistical and experimental evaluation reveal that the data-derived label hierarchies used in conjunction with global HMC methods greatly improve the performance of MLC methods. Additionally, multi-branch hierarchies appear much more suitable for the global HMC approaches as compared to the binary hierarchies.
Gjorgji Madjarov, Ivica Dimitrovski, Dejan Gjorgjevikj, Sašo Džeroski

Clustering

Frontmatter

Predicting Negative Side Effects of Surgeries Through Clustering

Abstract
Prescribed treatments to patients often result in side effects that may not be known beforehand. Side effects analysis research focuses on specific treatments and targets small groups of patients. In previous work, we presented methods for extracting treatment effects from the Florida State Inpatient Databases (SID), which contain over 2.5 million visit discharges from 1.5 million patients. We classified these effects into positive, neutral, and negative effects. In addition, we proposed an approach for clustering patients based on negative side effects and analyzed them. As an extension to this work, We believe that a system identifying the cluster membership of a patient prior to applying the procedure is highly beneficial. In this paper, we extended our work and introduced a new approach for predicting patients’ negative side effects before applying a given meta-action (or procedure). We propose a system that measures the similarity of a new patient to existing clusters, and makes a personalized decision on the patient’s most likely negative side effects. We further evaluate our system using SID, which is part of the Healthcare Cost and Utilization Project (HCUP). Our experiments validated our approach and produced desired results.
Ayman Hajja, Hakim Touati, Zbigniew W. Raś, James Studnicki, Alicja A. Wieczorkowska

Parallel Multicut Segmentation via Dual Decomposition

Abstract
We propose a new outer relaxation of the multicut polytope, along with a dual decomposition approach for correlation clustering and multicut segmentation, for general graphs. Each subproblem is a minimum \(st\)-cut problem and can thus be solved efficiently. An optimal reparameterization is found using subgradients and affords a new characterization of the basic LP relaxation of the multicut problem, as well as informed decoding heuristics. The algorithm we propose for solving the problem distributes the computation and is amenable to a parallel implementation.
Julian Yarkony, Thorsten Beier, Pierre Baldi, Fred A. Hamprecht

Learning from Imbalanced Data Using Ensemble Methods and Cluster-Based Undersampling

Abstract
Imbalanced data, where the number of instances of one class is much higher than the others, are frequent in many domains such as fraud detection, telecommunications management, oil spill detection, and text classification. Traditional classifiers do not perform well when considering data that are susceptible to both within-class and between-class imbalances. In this paper, we propose the ClusFirstClass algorithm that employs cluster analysis to aid classifiers when aiming to build accurate models against such imbalanced datasets. In order to work with balanced classes, all minority instances are used together with the same number of majority instances. To further reduce the impact of within-class imbalance, majority instances are clustered into different groups and at least one instance is selected from each cluster. Experimental results demonstrate that our proposed ClusFirstClass algorithm yields promising results compared to the state-of-the art classification approaches, when evaluated against a number of highly imbalanced datasets.
Parinaz Sobhani, Herna Viktor, Stan Matwin

Data Streams and Sequences

Frontmatter

Prequential AUC for Classifier Evaluation and Drift Detection in Evolving Data Streams

Abstract
Detecting and adapting to concept drifts make learning data stream classifiers a difficult task. It becomes even more complex when the distribution of classes in the stream is imbalanced. Currently, proper assessment of classifiers for such data is still a challenge, as existing evaluation measures either do not take into account class imbalance or are unable to indicate class ratio changes in time. In this paper, we advocate the use of the area under the ROC curve (AUC) in imbalanced data stream settings and propose an efficient incremental algorithm that uses a sorted tree structure with a sliding window to compute AUC using constant time and memory. Additionally, we experimentally verify that this algorithm is capable of correctly evaluating classifiers on imbalanced streams and can be used as a basis for detecting changes in class definitions and imbalance ratio.
Dariusz Brzezinski, Jerzy Stefanowski

Mining Positional Data Streams

Abstract
We study frequent pattern mining from positional data streams. Existing approaches require discretised data to identify atomic events and are not applicable in our continuous setting. We propose an efficient trajectory-based preprocessing to identify similar movements and a distributed pattern mining algorithm to identify frequent trajectories. We empirically evaluate all parts of the processing pipeline.
Jens Haase, Ulf Brefeld

Visualization for Streaming Telecommunications Networks

Abstract
Regular services in telecommunications produce massive volumes of relational data. In this work the data produced in telecommunications is seen as a streaming network, where clients are the nodes and phone calls are the edges. Visualization techniques are required for exploratory data analysis and event detection. In social network visualization and analysis the goal is to get more information from the data taking into account actors at the individual level. Previous methods relied on aggregating communities, k-Core decompositions and matrix feature representations to visualize and analyse the massive network data. Our contribution is a group visualization and analysis technique of influential actors in the network by sampling the full network with a top-k representation of the network data stream.
Rui Sarmento, Mário Cordeiro, João Gama

Temporal Dependency Detection Between Interval-Based Event Sequences

Abstract
We present a new approach to mine dependencies between sequences of interval-based events that link two events if they occur in a similar manner, one being often followed by the other one in the data. The proposed technique is robust to temporal variability of events and determines the most appropriate time intervals whose validity is assessed by a \(\chi ^2\) test. TEDDY algorithm, TEmporal Dependency DiscoverY, prunes the search space while certifying the discovery of all valid and significant temporal dependencies. We present a real-world case study of balance bicycles into the Bike Sharing System of Lyon.
Marc Plantevit, Vasile-Marian Scuturici, Céline Robardet

Applications

Frontmatter

Discovering Behavioural Patterns in Knowledge-Intensive Collaborative Processes

Abstract
Domains like emergency management, health care, or research and innovation development, are characterized by the execution of so-called knowledge-intensive processes. Such processes are typically highly uncertain, with little or no structure; consequently, classical process discovery techniques, aimed at extracting complete process schemas from execution logs, usually provide a limited support in analysing these processes. As a remedy, in the present work we propose a methodology aimed at extracting relevant subprocesses, representing meaningful collaboration behavioural patterns. We consider a real case study regarding the development of research activities, to test the approach and compare its results with the outcome of classical process discovery techniques.
Claudia Diamantini, Laura Genga, Domenico Potena, Emanuele Storti

Learning Complex Activity Preconditions in Process Mining

Abstract
The availability of automatic support may sometimes determine the successful accomplishment of a process. Such a support can be provided if a model of the intended process is available. Many real-world process models are very complex. Additionally, their components might be associated to conditions that determine whether they are to be carried out or not. These conditions may be in turn very complex, involving sequential relationships that take into account the past history of the current process execution. In this landscape, writing and setting up manually the process models and conditions might be infeasible, and even standard Machine Learning approaches may be unable to infer them.
This paper presents a First-Order Logic-based approach to learn complex process models extended with conditions. It combines two powerful Inductive Logic Programming systems. The overall system was exploited to learn the daily routines of the user of a smart environment, for predicting his needs and comparing the actual situation with the expected one. In addition to proving the efficiency and effectiveness of the system, the outcomes show that complex, human-readable and interesting preconditions can be learned for the tasks involved in the process.
Stefano Ferilli, Berardina De Carolis, Floriana Esposito

Location Prediction of Mobile Phone Users Using Apriori-Based Sequence Mining with Multiple Support Thresholds

Abstract
Due to the increasing use of mobile phones and their increasing capabilities, huge amount of usage and location data can be collected. Location prediction is an important task for mobile phone operators and smart city administrations to provide better services and recommendations. In this work, we propose a sequence mining based approach for location prediction of mobile phone users. More specifically, we present a modified Apriori-based sequence mining algorithm for the next location prediction, which involves use of multiple support thresholds for different levels of pattern generation process. The proposed algorithm involves a new support definition, as well. We have analyzed the behaviour of the algorithm under the change of threshold through experimental evaluation and the experiments indicate improvement in comparison to conventional Apriori-based algorithm.
Ilkcan Keles, Mert Ozer, I. Hakki Toroslu, Pinar Karagoz

Pitch-Related Identification of Instruments in Classical Music Recordings

Abstract
Identification of particular voices in polyphonic and polytimbral music is a task often performed by musicians in their everyday life. However, the automation of this task is very challenging, because of high complexity of audio data. Usually additional information is supplied, and the results are far from satisfactory. In this paper, we focus on classical music recordings, without requiring the user to submit additional information. Our goal is to identify musical instruments playing in short audio frames of polyphonic recordings of classical music. Additionally, we extract pitches (or pitch ranges) which combined with instrument information can be used in score-following and audio alignment, see e.g. [9, 20], or in works towards automatic score extraction, which are a motivation behind this work. Also, since instrument timbre changes with pitch, separate classifiers are trained for various pitch ranges for each instrument. Four instruments are investigated, representing stringed and wind instruments. The influence of adding harmonic (pitch-based) features to the feature set on the results is also investigated. Random forests are applied as a classification tool, and the results are presented and discussed.
Elżbieta Kubera, Alicja A. Wieczorkowska

Backmatter

Weitere Informationen

Premium Partner

    Bildnachweise