nach oben

2012 | Buch

Kapitel lesen Erstes Kapitel lesen

Discovery Science

15th International Conference, DS 2012, Lyon, France, October 29-31, 2012. Proceedings

herausgegeben von: Jean-Gabriel Ganascia, Philippe Lenca, Jean-Marc Petit

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book constitutes the refereed proceedings of the 15th International Conference on Discovery Science, DS 2012, held in Lyon, France, in October 2012. The 22 papers presented in this volume were carefully reviewed and selected from 46 submissions. The field of discovery science aims at inducing and validating new scientific hypotheses from data. The scope of this conference includes the development and analysis of methods for automatic scientific knowledge discovery, machine learning, intelligent data analysis, theory of learning, tools for supporting the human process of discovery in science, as well as their application to knowledge discovery.

Inhaltsverzeichnis

Frontmatter

Invited Talks

Declarative Modeling for Machine Learning and Data Mining

Abstract

Despite the popularity of machine learning and data mining today, it remains challenging to develop applications and software that incorporates machine learning or data mining techniques. This is because machine learning and data mining have focussed on developing high-performance algorithms for solving particular tasks rather than on developing general principles and techniques. I propose to alleviate these problems by applying the constraint programming methodology to machine learning and data mining and to specify machine learning and data mining problems as constraint satisfaction and optimization problems. What is essential is that the user be provided with a way to declaratively specify what the machine learning or data mining problem is rather than having to outline how that solution needs to be computed. This corresponds to a model + solver-based approach to machine learning and data mining, in which the user specifies the problem in a high level modeling language and the system automatically transforms such models into a format that can be used by a solver to efficiently generate a solution. This should be much easier for the user than having to implement or adapt an algorithm that computes a particular solution to a specific problem. Throughout the talk, I shall use illustrations from our work on constraint programming for itemset mining and probabilistic programming.

Luc De Raedt

Recent Developments in Pattern Mining

Abstract

Pattern Mining is one of the most researched topics in the data mining community. Literally hundreds of algorithms for efficiently enumerating all frequent itemsets have been proposed. These exhaustive algorithms, however, all suffer from the pattern explosion problem. Depending on the minimal support threshold, even for moderately sized databases, millions of patterns may be generated. Although this problem is by now well recognized in te pattern mining community, it has not yet been solved satisfactorily. In my talk I will give an overview of the different approaches that have been proposed to alleviate this problem. As a first step, constraint-based mining and condensed representations such as the closed itemsets and the non-derivable itemsets were introduced. These methods, however, still produce too many and redundant results. More recently, promising methods based upon the minimal description length principle, information theory, and statistical models have been introduced. We show the respective advantages and disadvantages of these approaches and their connections, and illustrate their usefulness on real life data. After this overview we move from itemsets to more complex patterns, such as sequences and graphs. Even though these extensions seem trivial at first, they turn out to be quite challenging. I will end my talk with an overview of what I consider to be important open questions in this fascinating research area.

Toon Calders

Tutorial

Exploring Sequential Data

Abstract

The tutorial is devoted to categorical sequence data describing for instance the successive buys of customers, working states of devices, visited web pages, or professional careers. Addressed topics include the rendering of state and event sequences, longitudinal characteristics of sequences, measuring pairwise dissimilarities and dissimilarity-based analysis of sequence data such as clustering, representative sequences, and regression trees. The tutorial also provides a short introduction to the practice of sequence analysis with the TraMineR R-package.

Gilbert Ritschard

Regular Papers

Large Scale Spectral Clustering Using Resistance Distance and Spielman-Teng Solvers

Abstract

The promise of spectral clustering is that it can help detect complex shapes and intrinsic manifold structure in large and high dimensional spaces. The price for this promise is the computational cost O(n ³) for computing the eigen-decomposition of the graph Laplacian matrix - so far a necessary subroutine for spectral clustering. In this paper we bypass the eigen-decomposition of the original Laplacian matrix by leveraging the recently introduced Spielman and Teng near-linear time solver for systems of linear equations and random projection. Experiments on several synthetic and real datasets show that the proposed approach has better clustering quality and is faster than the state-of-the-art approximate spectral clustering methods.

Nguyen Lu Dang Khoa, Sanjay Chawla

Prediction of Quantiles by Statistical Learning and Application to GDP Forecasting

Abstract

In this paper, we tackle the problem of prediction and confidence intervals for time series using a statistical learning approach and quantile loss functions. In a first time, we show that the Gibbs estimator is able to predict as well as the best predictor in a given family for a wide set of loss functions. In particular, using the quantile loss function of [1], this allows to build confidence intervals. We apply these results to the problem of prediction and confidence regions for the French Gross Domestic Product (GDP) growth, with promising results.

Pierre Alquier, Xiaoyin Li

Policy Search in a Space of Simple Closed-form Formulas: Towards Interpretability of Reinforcement Learning

Abstract

In this paper, we address the problem of computing interpretable solutions to reinforcement learning (RL) problems. To this end, we propose a search algorithm over a space of simple closed-form formulas that are used to rank actions. We formalize the search for a high-performance policy as a multi-armed bandit problem where each arm corresponds to a candidate policy canonically represented by its shortest formula-based representation. Experiments, conducted on standard benchmarks, show that this approach manages to determine both efficient and interpretable solutions.

Francis Maes, Raphael Fonteneau, Louis Wehenkel, Damien Ernst

Towards Finding Relational Redescriptions

Abstract

This paper introduces relational redescription mining, that is, the task of finding two structurally different patterns that describe nearly the same set of object tuples in a relational dataset. By extending redescription mining beyond propositional and real-valued attributes, it provides a powerful tool to match different relational descriptions of the same concept. As a first step towards solving this general task, we introduce an efficient algorithm that mines one description of a given binary concept. A set of graph patterns is built from frequent path patterns connecting example pairs. Experiments in the domain of explaining kinship terms show that this approach can produce complex descriptions that match explanations by domain experts, while being much faster than a direct relational query mining approach.

Esther Galbrun, Angelika Kimmig

Descriptive Modeling of Systemic Banking Crises

Abstract

The topic of the work is detection of connections between occurrences of systemic banking crises and values of socio-economic indicators in time frames of three years before outburst of crises. For this task we have used the list of banking crises in the period 1976-2007 prepared by the International Monetary Fund that we have connected with publically available Word Bank data. For the analysis a machine learning methodology based on subgroup discovery has been used. The main result is that demographic indicators have been detected as most relevant. At first place this is the indicator of percentage of total population that is in the age group 15-64 years. This indicator is present in both developed models and presents a condition related to a high number of crises outbursts. In the first model this indicator is connected with the indicator of annual percentage of money and quasi money growth while in the second model it is connected with the indicator of life expectancy for male population. For the analysis especially interesting is the second model because it includes decreasing or very low positive trend in active population in a period before the onset of the crises. The importance of the result is in the fact that such situations may be expected in the near future in many developed and developing economies.

Dragan Gamberger, Dražen Lučanin, Tomislav Šmuc

A Trim Distance between Positions in Nucleotide Sequences

Abstract

In this paper, first we introduce a label-based closest-neighbor trimming method to trim a phylogenetic tree built from nucleotide sequences according to the labels of leaves. Next, by replacing the indices of nucleotide sequences as labels of leaves with the nucleotides occurring in a position, we formulate a trim distance between two positions in nucleotide sequences as the LCA-preserving distance between the trimmed phylogenetic trees according to nucleotides occurring in the positions. Finally, we apply the trim distance to compare pandemic influenza A (H1N1) viruses with non-pandemic ones.

Shunsuke Makino, Takaharu Shimada, Kouichi Hirata, Kouki Yonezawa, Kimihito Ito

Data Squashing for HSV Subimages by an Autonomous Mobile Robot

Abstract

In this paper, we propose a data index structure which is constructed by a small autonomous mobile robot so that it manages millions of subimages it takes during a navigation of dozens of minutes. The subimages are managed according to a similarity measure between a pair of subimages, which is based on a method for quantizing HSV colors. The data index structure has been inspired by the CF tree of BIRCH, which is an early work in data squashing, though care and inventions were necessary as the bins of HSV colors are highly correlated. We also propose an application for peculiar subimage detection by the robot, which exploits the data index structures for the current image and another one for all images in its navigation. Experiments conducted in a private office of about 25m ² proved the feasibility of the data index structure and the effectiveness of the peculiar subimage detection.

Einoshin Suzuki, Emi Matsumoto, Asuki Kouno

Cohesive Co-evolution Patterns in Dynamic Attributed Graphs

Abstract

We focus on the discovery of interesting patterns in dynamic attributed graphs. To this end, we define the novel problem of mining cohesive co-evolution patterns. Briefly speaking, cohesive co-evolution patterns are tri-sets of vertices, timestamps, and signed attributes that describe the local co-evolutions of similar vertices at several timestamps according to set of signed attributes that express attributes trends. We design the first algorithm to mine the complete set of cohesive co-evolution patterns in a dynamic graph. Some experiments performed on both synthetic and real-world datasets demonstrate that our algorithm enables to discover relevant patterns in a feasible time.

Elise Desmier, Marc Plantevit, Céline Robardet, Jean-François Boulicaut

Efficient Redundancy Reduced Subgroup Discovery via Quadratic Programming

Abstract

Subgroup discovery is a task at the intersection of predictive and descriptive induction, aiming at identifying subgroups that have the most unusual statistical (distributional) characteristics with respect to a property of interest. Although a great deal of work has been devoted to the topic, one remaining problem concerns the redundancy of subgroup descriptions, which often effectively convey very similar information. In this paper, we propose a quadratic programming based approach to reduce the amount of redundancy in the subgroup rules. Experimental results on 12 datasets show that the resulting subgroups are in fact less redundant compared to standard methods. In addition, our experiments show that the computational costs are significantly lower than the one of other methods compared in the paper.

Rui Li, Stefan Kramer

HCAC: Semi-supervised Hierarchical Clustering Using Confidence-Based Active Learning

Abstract

Despite their importance, hierarchical clustering has been little explored for semi-supervised algorithms. In this paper, we address the problem of semi-supervised hierarchical clustering by using an active learning solution with cluster-level constraints. This active learning approach is based on a new concept of merge confidence in agglomerative clustering. When there is low confidence in a cluster merge the user is queried and provides a cluster-level constraint. The proposed method is compared with an unsupervised algorithm (average-link) and two state-of-the-art semi-supervised algorithms (pairwise constraints and Constrained Complete-Link). Results show that our algorithm tends to be better than the two semi-supervised algorithms and can achieve a significant improvement when compared to the unsupervised algorithm. Our approach is particularly useful when the number of clusters is high which is the case in many real problems.

Bruno M. Nogueira, Alípio M. Jorge, Solange O. Rezende

LF-CARS: A Loose Fragment-Based Consensus Clustering Algorithm with a Robust Similarity

Abstract

The consensus clustering technique is to combine multiple clustering results without accessing the original data. It can be used to obtain the clustering result from multiple data sources or to improve the robustness of clustering result. In this paper, we propose a novel definition of the similarity between points and clusters to represent how a point should join or leave a cluster clearly. With this definition of similarity, we desigh an iterative process which can determine the number of clusters automatically. In addition, we propose the concept loose fragment which is improved from clustering fragment into our method for speed-up. The experimental results show that our algorithm achieves good performances on both artificial data and real data.

Bi-Ru Dai, Chih-Heng Chung

Fast Approximation Algorithm for the 1-Median Problem

Abstract

We present a fast approximation algorithm for the 1-median problem. Our algorithm can be applied to metric undirected graphs with node weight. Given a node v, our algorithm repeatedly finds a better node by making use of a shortest path tree of the previous node. We empirically show that our algorithm runs much faster and has better approximation ratio than a sophisticated existing method called DTZ. We demonstrate the effectiveness of our algorithm through experiments.

Koji Tabata, Atsuyoshi Nakamura, Mineichi Kudo

Online Co-regularized Algorithms

Abstract

We propose an online co-regularized learning algorithm for classification and regression tasks. We demonstrate that by sequentially co-regularizing prediction functions on unlabeled data points, our algorithm provides improved performance in comparison to supervised methods on several UCI benchmarks and a real world natural language processing dataset. The presented algorithm is particularly applicable to learning tasks where large amounts of (unlabeled) data are available for training. We also provide an easy to set-up and use Python implementation of our algorithm.

Tom de Ruijter, Evgeni Tsivtsivadze, Tom Heskes

Fast Progressive Training of Mixture Models for Model Selection

Abstract

Finite Mixture Models are flexible models with varying uses such as density estimation, clustering, classification, modeling heterogeneity, model averaging, and handling missing data. One of the prerequisites of using mixture models is the a priori knowledge of the number of mixture components so that the Expectation Maximization (EM) algorithm can learn the maximum likelihood parameters of the mixture model. However, the number of mixing components is often unknown and determining the number of mixture components has been a central problem in mixture modelling. Thus, mixture modelling is often a two-stage process of determining the number of mixture components and then estimating the parameters of the mixture model. This paper proposes a fast, search-based model selection algorithm for mixture models using progressive merging of mixture components. The paper also proposes a data driven, fast approximation of the Kullback-Leibler (KL) divergence as a criterion to merge the mixture components. The proposed methodology is used in mixture modelling of two chromosomal aberration datasets showing that model selection is efficient and effective.

Prem Raj Adhikari, Jaakko Hollmén

Including Spatial Relations and Scales within Sequential Pattern Extraction

Abstract

Georeferenced databases contain a huge volume of temporal and spatial data. They are notably used in environmental analysis. Several works address the problem of mining those data, but none are able to take into account the richness of the data and especially their spatial and temporal dimensions. In this paper, we focus on the extraction of a new kind of spatio-temporal pattern, considering the relationship between spatial objects and geographical scales. We propose an algorithm, STR_PrefixGrowth, which can be applied to a huge amount of data. The proposed method is evaluated on hydrological data collected on the Saône watershed during the last 19 years. Our experiments emphasize the contribution of our approach toward the existing methods.

Mickaël Fabrègue, Agnès Braud, Sandra Bringay, Florence Le Ber, Maguelonne Teisseire

Predicting Ramp Events with a Stream-Based HMM Framework

Abstract

The motivation for this work is the study and prediction of wind ramp events occurring in a large-scale wind farm located in the US Midwest. In this paper we introduce the SHRED framework, a stream-based model that continuously learns a discrete HMM model from wind power and wind speed measurements. We use a supervised learning algorithm to learn HMM parameters from discretized data, where ramp events are HMM states and discretized wind speed data are HMM observations. The discretization of the historical data is obtained by running the SAX algorithm over the first order variations in the original signal. SHRED updates the HMM using the most recent historical data and includes a forgetting mechanism to model natural time dependence in wind patterns. To forecast ramp events we use recent wind speed forecasts and the Viterbi algorithm, that incrementally finds the most probable ramp event to occur.

We compare SHRED framework against Persistence baseline in predicting ramp events occurring in short-time horizons, ranging from 30 minutes to 90 minutes. SHRED consistently exhibits more accurate and cost-effective results than the baseline.

Carlos Abreu Ferreira, João Gama, Vítor Santos Costa, Vladimiro Miranda, Audun Botterud

Burst Detection in a Sequence of Tweets Based on Information Diffusion Model

Abstract

We propose a method of detecting the period in which a burst of information diffusion took place from an observed diffusion sequence data over a social network and report the results obtained by applying it to the real Twitter data. We assume a generic information diffusion model in which time delay associated with the diffusion follows the exponential distribution and the burst is directly reflected to the changes in the time delay parameter of the distribution (inverse of the average time delay). The shape of the parameter change is approximated by a series of step functions and the problem of detecting the change points and finding the values of the parameter is formulated as an optimization problem of maximizing the likelihood of generating the observed diffusion sequence. Time complexity of the search is almost proportional to the number of observed data points (possible change points) and very efficient. We apply the method to the real Twitter data of the 2011 To-hoku earthquake and tsunami, and show that the proposed method is by far efficient than a naive method that adopts exhaustive search, and more accurate than a simple greedy method. Two interesting discoveries are that a burst period between two change points detected by the proposed method tends to contain massive homogeneous tweets on a specific topic even if the observed diffusion sequence consists of heterogeneous tweets on various topics, and that assuming the information diffusion path is a line shape tree can give a good approximation of the maximum likelihood estimator when the actual diffusion path is not known.

Kazumi Saito, Kouzou Ohara, Masahiro Kimura, Hiroshi Motoda

Error-Correcting Output Codes as a Transformation from Multi-Class to Multi-Label Prediction

Abstract

In this paper, we reinterpret error-correcting output codes (ECOCs) as a framework for converting multi-class classification problems into multi-label prediction problems. Different well-known multi-label learning approaches can be mapped upon particular ways of dealing with the original multi-class problem. For example, the label powerset approach obviously constitutes the inverse transformation from multi-label back to multi-class, whereas binary relevance learning may be viewed as the conventional way of dealing with ECOCs, in which each classifier is learned independently of the others. Consequently, we evaluate whether alternative choices for solving the multi-label problem may result in improved performance. This question is interesting because it is not clear whether approaches that do not treat the bits of the code words independently have sufficient error-correcting properties. Our results indicate that a slight but consistent advantage can be obtained with the use of multi-label methods, in particular when longer codes are employed.

Johannes Fürnkranz, Sang-Hyeun Park

An Assessment on Loan Performance from Combined Quantitative and Qualitative Data in XML

Abstract

The intensifying need to incorporate knowledge extracted from qualitative information into banks’ lending decision has been recognized in recent times, particularly for micro lenders. In this study, the multi-faceted credit information is captured in an integrated form using XML to facilitate the discovery of knowledge models encompassing a broad range of credit risk related aspects. The quantitative and qualitative credit data obtained from the industry partner describes existing lender profiles. The experiments are performed to discover classification models for the performing or non-performing lenders in one problem setting, and the duration of payment delay in another. The results are compared with a common credit risk prediction setting where qualitative data is excluded. The findings confirm the role of domain experts’ knowledge as well as qualitative information on loan performance assessment, and describe a number of rules indicating refinement of the banks’ lending policy requirement.

Novita Ikasari, Fedja Hadzic

Structural Change Pattern Mining Based on Constrained Maximal k-Plex Search

Abstract

We discuss in this paper a problem of mining structural change patterns. Given a pair of graphs before and after some change, a structural change pattern is extracted as a vertex set X which is pseudo-independent set before the change but a pseudo-clique after the change. In order to detect this kind of patterns more interesting, X is particularly required to have many outgoing edges from X before the change, while to have few outgoing edges after the change. We formalize such an X as a maximal k-plex in the combined graph defined from the given graphs. An effective algorithm for extracting them is designed as a constrained maximal k-plex enumerator with a pruning mechanism based on right candidate control. Our experimental results show an example of structural change pattern actually detected. Moreover, it is shown that the pruning mechanism and the use of combined graph are very effective for efficient computation.

Yoshiaki Okubo, Makoto Haraguchi, Etsuji Tomita

Enhancing Patent Expertise through Automatic Matching with Scientific Papers

Abstract

This paper focuses on a subtask of the QUAERO research program, a major innovating research project related to the automatic processing of multimedia and multilingual content. The objective discussed in this article is to propose a new method for the classification of scientific papers, developed in the context of an international patents classification plan related to the same field. The practical purpose of this work is to provide an assistance tool to experts in their task of evaluation of the originality and novelty of a patent, by offering to the latter the most relevant scientific citations. This issue raises new challenges in categorization research as the patent classification plan is not directly adapted to the structure of scientific documents, classes have high citation or cited topic and that there is not always a balanced distribution of the available examples within the different learning classes. We propose, as a solution to this problem, to apply an improved K-nearest-neighbors (KNN) algorithm based on the exploitation of association rules occurring between the index terms of the documents and the ones of the patent classes. By using a reference dataset of patents belonging to the field of pharmacology, on the one hand, and a bibliographic dataset of the same field issued from the Medline collection, on the other hand, we show that this new approach, which combines the advantages of numerical and symbolical approaches, improves considerably categorization performance, as compared to the usual categorization methods.

Kafil Hajlaoui, Pascal Cuxac, Jean-Charles Lamirel, Claire François

Soft Threshold Constraints for Pattern Mining

Abstract

Constraint-based pattern discovery is at the core of numerous data mining tasks. Patterns are extracted with respect to a given set of constraints (frequency, closedness, size, etc). In practice, many constraints require threshold values whose choice is often arbitrary. This difficulty is even harder when several thresholds are required and have to be combined. Moreover, patterns barely missing a threshold will not be extracted even if they may be relevant. In this paper, by using Constraint Programming we propose a method to integrate soft threshold constraints into the pattern discovery process. We show the relevance and the efficiency of our approach through a case study in chemoinformatics for discovering toxicophores.

Willy Ugarte, Patrice Boizumault, Samir Loudni, Bruno Crémilleux

Backmatter

Titel: Discovery Science
herausgegeben von: Jean-Gabriel Ganascia
Philippe Lenca
Jean-Marc Petit
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-642-33492-4
Print ISBN: 978-3-642-33491-7
DOI: https://doi.org/10.1007/978-3-642-33492-4

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Invited Talks

Declarative Modeling for Machine Learning and Data Mining

Recent Developments in Pattern Mining

Tutorial

Exploring Sequential Data

Regular Papers

Large Scale Spectral Clustering Using Resistance Distance and Spielman-Teng Solvers

Prediction of Quantiles by Statistical Learning and Application to GDP Forecasting

Policy Search in a Space of Simple Closed-form Formulas: Towards Interpretability of Reinforcement Learning

Towards Finding Relational Redescriptions

Descriptive Modeling of Systemic Banking Crises

A Trim Distance between Positions in Nucleotide Sequences

Data Squashing for HSV Subimages by an Autonomous Mobile Robot

Cohesive Co-evolution Patterns in Dynamic Attributed Graphs

Efficient Redundancy Reduced Subgroup Discovery via Quadratic Programming

HCAC: Semi-supervised Hierarchical Clustering Using Confidence-Based Active Learning

LF-CARS: A Loose Fragment-Based Consensus Clustering Algorithm with a Robust Similarity

Fast Approximation Algorithm for the 1-Median Problem

Online Co-regularized Algorithms

Fast Progressive Training of Mixture Models for Model Selection

Including Spatial Relations and Scales within Sequential Pattern Extraction

Predicting Ramp Events with a Stream-Based HMM Framework

Burst Detection in a Sequence of Tweets Based on Information Diffusion Model

Error-Correcting Output Codes as a Transformation from Multi-Class to Multi-Label Prediction

An Assessment on Loan Performance from Combined Quantitative and Qualitative Data in XML

Structural Change Pattern Mining Based on Constrained Maximal k-Plex Search

Enhancing Patent Expertise through Automatic Matching with Scientific Papers

Soft Threshold Constraints for Pattern Mining

Backmatter

Premium Partner