main-content

## Über dieses Buch

This book constitutes the refereed conference proceedings of the 14th International Conference on Intelligent Data Analysis, which was held in October 2015 in Saint Étienne. France. The 29 revised full papers were carefully reviewed and selected from 65 submissions. The traditional focus of the IDA symposium series is on end-to-end intelligent support for data analysis. The symposium aims to provide a forum for inspiring research contributions that might be considered preliminary in other leading conferences and journals, but that have a potentially dramatic impact. To facilitate this, IDA 2015 will feature two tracks: a regular "Proceedings" track, as well as a "Horizon" track for early-stage research of potentially ground-breaking nature.

## Inhaltsverzeichnis

### Data Analytics and Optimisation for Assessing a Ride Sharing System

Abstract
Ride-sharing schemes attempt to reduce road traffic by matching prospective passengers to drivers with spare seats in their cars. To be successful, such schemes require a critical mass of drivers and passengers. In current deployed implementations, the possible matches are based on heuristics, rather than real route times or distances. In some cases, the heuristics propose infeasible matches; in others, feasible matches are omitted. Poor ride matching is likely to deter participants from using the system. We develop a constraint-based model for acceptable ride matches which incorporates route plans and time windows. Through data analytics on a history of advertised schedules and agreed shared trips, we infer parameters for this model that account for 90 % of agreed trips. By applying the inferred model to the advertised schedules, we demonstrate that there is an imbalance between riders and passengers. We assess the potential benefits of persuading existing drivers to switch to becoming passengers if appropriate matches can be found, by solving the inferred model with and without switching. We demonstrate that flexible participation has the potential to reduce the number of unmatched participants by up to 80 %.
Vincent Armant, John Horan, Nahid Mabub, Kenneth N. Brown

### Constraint-Based Querying for Bayesian Network Exploration

Abstract
Understanding the knowledge that resides in a Bayesian network can be hard, certainly when a large network is to be used for the first time, or when the network is complex or has just been updated. Tools to assist users in the analysis of Bayesian networks can help. In this paper, we introduce a novel general framework and tool for answering exploratory queries over Bayesian networks. The framework is inspired by queries from the constraint-based mining literature designed for the exploratory analysis of data. Adapted to Bayesian networks, these queries specify a set of constraints on explanations of interest, where an explanation is an assignment to a subset of variables in a network. Characteristic for the methodology is that it searches over different subsets of the explanations, corresponding to different marginalizations. A general purpose framework, based on principles of constraint programming, data mining and knowledge compilation, is used to answer all possible queries. This CP4BN framework employs a rich set of constraints and is able to emulate a range of existing queries from both the Bayesian network and the constraint-based data mining literature.
Behrouz Babaki, Tias Guns, Siegfried Nijssen, Luc De Raedt

### Efficient Model Selection for Regularized Classification by Exploiting Unlabeled Data

Abstract
Hyper-parameter tuning is a resource-intensive task when optimizing classification models. The commonly used k-fold cross validation can become intractable in large scale settings when a classifier has to learn billions of parameters. At the same time, in real-world, one often encounters multi-class classification scenarios with only a few labeled examples; model selection approaches often offer little improvement in such cases and the default values of learners are used. We propose bounds for classification on accuracy and macro measures (precision, recall, F1) that motivate efficient schemes for model selection and can benefit from the existence of unlabeled data. We demonstrate the advantages of those schemes by comparing them with k-fold cross validation and hold-out estimation in the setting of large scale classification.
Georgios Balikas, Ioannis Partalas, Eric Gaussier, Rohit Babbar, Massih-Reza Amini

### Segregation Discovery in a Social Network of Companies

Abstract
We introduce a framework for a data-driven analysis of segregation of minority groups in social networks, and challenge it on a complex scenario. The framework builds on quantitative measures of segregation, called segregation indexes, proposed in the social science literature. The segregation discovery problem consists of searching sub-graphs and sub-groups for which a reference segregation index is above a minimum threshold. A search algorithm is devised that solves the segregation problem. The framework is challenged on the analysis of segregation of social groups in the boards of directors of the real and large network of Italian companies connected through shared directors.
Alessandro Baroni, Salvatore Ruggieri

### A First-Order-Logic Based Model for Grounded Language Learning

Abstract
Much is still unknown about how children learn language, but it is clear that they perform “grounded” language learning: they learn the grammar and vocabulary not just from examples of sentences, but from examples of sentences in a particular context. Grounded language learning has been the subject of much research. Most of this work focuses on particular aspects, such as constructing semantic parsers, or on particular types of applications. In this paper, we take a broader view that includes an aspect that has received little attention until now: learning the meaning of phrases from phrase/context pairs in which the phrase’s meaning is not explicitly represented. We propose a simple model for this task that uses first-order logic representations for contexts and meanings, including a simple incremental learning algorithm. We experimentally demonstrate that the proposed model can explain the gradual learning of simple concepts and language structure, and that it can easily be used for interpretation, generation, and translation of phrases.
Leonor Becerra-Bonache, Hendrik Blockeel, María Galván, François Jacquenet

### A Parallel Distributed Processing Algorithm for Image Feature Extraction

Abstract
We present a new parallel algorithm for image feature extraction. which uses a distance function based on the LZ-complexity of the string representation of the two images. An input image is represented by a feature vector whose components are the distance values between its parts (sub-images) and a set of prototypes. The algorithm is highly scalable and computes these values in parallel. It is implemented on a massively parallel graphics processing unit (GPU) with several thousands of cores which yields a three order of magnitude reduction in time for processing the images. Given a corpus of input images the algorithm produces labeled cases that can be used by any supervised or unsupervised learning algorithm to learn image classification or image clustering. A main advantage is the lack of need for any image processing or image analysis; the user only once defines image-features through a simple basic process of choosing a few small images that serve as prototypes. Results for several image classification problems are presented.
Alexander Belousov, Joel Ratsaby

### Modeling Concept Drift: A Probabilistic Graphical Model Based Approach

Abstract
An often used approach for detecting and adapting to concept drift when doing classification is to treat the data as i.i.d. and use changes in classification accuracy as an indication of concept drift. In this paper, we take a different perspective and propose a framework, based on probabilistic graphical models, that explicitly represents concept drift using latent variables. To ensure efficient inference and learning, we resort to a variational Bayes inference scheme. As a proof of concept, we demonstrate and analyze the proposed framework using synthetic data sets as well as a real financial data set from a Spanish bank.
Hanen Borchani, Ana M. Martínez, Andrés R. Masegosa, Helge Langseth, Thomas D. Nielsen, Antonio Salmerón, Antonio Fernández, Anders L. Madsen, Ramón Sáez

### Diversity-Driven Widening of Hierarchical Agglomerative Clustering

Abstract
In this paper we show that diversity-driven widening, the parallel exploration of the model space with focus on developing diverse models, can improve hierarchical agglomerative clustering. Depending on the selected linkage method, the model that is found through the widened search achieves a better silhouette coefficient than its sequentially built counterpart.
Alexander Fillbrunn, Michael R. Berthold

### Batch Steepest-Descent-Mildest-Ascent for Interactive Maximum Margin Clustering

Abstract
The maximum margin clustering principle extends support vector machines to unsupervised scenarios. We present a variant of this clustering scheme that can be used in the context of interactive clustering scenarios. In particular, our approach permits the class ratios to be manually defined by the user during the fitting process. Our framework can be used at early stages of the data mining process when no or very little information is given about the true clusters and class ratios. One of the key contributions is an adapted steepest-descent-mildest-ascent optimization scheme that can be used to fine-tune maximum margin clustering solutions in an interactive manner. We demonstrate the applicability of our approach in the context of remote sensing and astronomy with training sets consisting of hundreds of thousands of patterns.
Fabian Gieseke, Tapio Pahikkala, Tom Heskes

### Time Series Classification with Representation Ensembles

Abstract
Time series has attracted much attention in recent years, with thousands of methods for diverse tasks such as classification, clustering, prediction, and anomaly detection. Among all these tasks, classification is likely the most prominent task, accounting for most of the applications and attention from the research community. However, in spite of the huge number of methods available, there is a significant body of empirical evidence indicating that the 1-nearest neighbor algorithm ($$1$$-NN) in the time domain is “extremely difficult to beat”. In this paper, we evaluate the use of different data representations in time series classification. Our work is motivated by methods used in related areas such as signal processing and music retrieval. In these areas, a change of representation frequently reveals features that are not apparent in the original data representation. Our approach consists of using different representations such as frequency, wavelets, and autocorrelation to transform the time series into alternative decision spaces. A classifier is then used to provide a classification for each test time series in the alternative domain. We investigate how features provided in different domains can help in time series classification. We also experiment with different ensembles to investigate if the data representations are a good source of diversity for time series classification. Our extensive experimental evaluation approaches the issue of combining sets of representations and ensemble strategies, resulting in over 300 ensemble configurations.
Rafael Giusti, Diego F. Silva, Gustavo E. A. P. A. Batista

### Simultaneous Clustering and Model Selection for Multinomial Distribution: A Comparative Study

Abstract
In this paper, we study different discrete data clustering methods, which use the Model-Based Clustering (MBC) framework with the Multinomial distribution. Our study comprises several relevant issues, such as initialization, model estimation and model selection. Additionally, we propose a novel MBC method by efficiently combining the partitional and hierarchical clustering techniques. We conduct experiments on both synthetic and real data and evaluate the methods using accuracy, stability and computation time. Our study identifies appropriate strategies to be used for discrete data analysis with the MBC methods. Moreover, our proposed method is very competitive w.r.t. clustering accuracy and better w.r.t. stability and computation time.
Md. Abul Hasnat, Julien Velcin, Stéphane Bonnevay, Julien Jacques

### On Binary Reduction of Large-Scale Multiclass Classification Problems

Abstract
In the context of large-scale problems, traditional multiclass classification approaches have to deal with class imbalancement and complexity issues which make them inoperative in some extreme cases. In this paper we study a transformation that reduces the initial multiclass classification of examples into a binary classification of pairs of examples and classes. We present generalization error bounds that exhibit the interdependency between the pairs of examples and which recover known results on binary classification with i.i.d. data. We show the efficiency of the deduced algorithm compared to state-of-the-art multiclass classification strategies on two large-scale document collections especially in the interesting case where the number of classes becomes very large.
Bikash Joshi, Massih-Reza Amini, Ioannis Partalas, Liva Ralaivola, Nicolas Usunier, Eric Gaussier

### Probabilistic Active Learning in Datastreams

Abstract
In recent years, stream-based active learning has become an intensively investigated research topic. In this work, we propose a new algorithm for stream-based active learning that decides immediately whether to acquire a label (selective sampling). To this purpose, we extend our pool-based Probabilistic Active Learning framework into a framework for streams. In particular, we complement the notion of usefulness within a topological space (“spatial usefulness”) with the concept of “temporal usefulness”. To actively select the instances, for which labels must be acquired, we introduce the Balanced Incremental Quantile Filter (BIQF), an algorithm that assesses the usefulness of instances in a sliding window, ensuring that the predefined budget restrictions will be met within a given tolerance window. We compare our approach to other active learning approaches for streams and show the competitiveness of our method.
Daniel Kottke, Georg Krempl, Myra Spiliopoulou

### Implicitly Constrained Semi-supervised Least Squares Classification

Abstract
We introduce a novel semi-supervised version of the least squares classifier. This implicitly constrained least squares (ICLS) classifier minimizes the squared loss on the labeled data among the set of parameters implied by all possible labelings of the unlabeled data. Unlike other discriminative semi-supervised methods, our approach does not introduce explicit additional assumptions into the objective function, but leverages implicit assumptions already present in the choice of the supervised least squares classifier. We show this approach can be formulated as a quadratic programming problem and its solution can be found using a simple gradient descent procedure. We prove that, in a certain way, our method never leads to performance worse than the supervised classifier. Experimental results corroborate this theoretical result in the multidimensional case on benchmark datasets, also in terms of the error rate.
Jesse H. Krijthe, Marco Loog

### Diagonal Co-clustering Algorithm for Document-Word Partitioning

Abstract
We propose a novel diagonal co-clustering algorithm built upon the double Kmeans to address the problem of document-word co-clustering. At each iteration, the proposed algorithm seeks for a diagonal block structure of the data by minimizing a criterion based on the variance within and the centroid effect. In addition to be easy-to-interpret and efficient on sparse binary and continuous data, Diagonal Double Kmeans (DDKM) is also faster than other state-of-the art clustering algorithms. We illustrate our contribution using real datasets commonly used in document clustering.

### I-Louvain: An Attributed Graph Clustering Method

Abstract
Modularity allows to estimate the quality of a partition into communities of a graph composed of highly inter-connected vertices. In this article, we introduce a complementary measure, based on inertia, and specially conceived to evaluate the quality of a partition based on real attributes describing the vertices. We propose also I-Louvain, a graph nodes clustering method which uses our criterion, combined with Newman’s modularity, in order to detect communities in attributed graph where real attributes are associated with the vertices. Our experiments show that combining the relational information with the attributes allows to detect the communities more efficiently than using only one type of information. In addition, our method is more robust to data degradation.
David Combe, Christine Largeron, Mathias Géry, Előd Egyed-Zsigmond

### Class-Based Outlier Detection: Staying Zombies or Awaiting for Resurrection?

Abstract
This paper addresses the task of finding outliers within each class in the context of supervised classification problems. Class-based outliers are cases that deviate too much with respect to the cases of the same class. We introduce a novel method for outlier detection in labelled data based on Random Forests and compare it with existing methods both on artificial and real-world data. We show that it is competitive with the existing methods and sometimes gives more intuitive results. We also provide an overview for outlier detection in labelled data. The main contribution are two methods for class-based outlier description and interpretation.
Leona Nezvalová, Luboš Popelínský, Luis Torgo, Karel Vaculík

### Using Metalearning for Prediction of Taxi Trip Duration Using Different Granularity Levels

Abstract
Trip duration is an important metric for the management of taxi companies, as it affects operational efficiency, driver satisfaction and, above all, customer satisfaction. In particular, the ability to predict trip duration in advance can be very useful for allocating taxis to stands and finding the best route for trips. A data mining approach can be used to generate models for trip time prediction. In fact, given the amount of data available, different models can be generated for different taxis. Given the difference between the data collected by different taxis, the best model for each one can be obtained with different algorithms and/or parameter settings. However, finding the configuration that generates the best model for each taxi is computationally very expensive. In this paper, we propose the use of metalearning to address the problem of selecting the algorithm that generates the model with the most accurate predictions for each taxi. The approach is tested on data collected in the Drive-In project. Our results show that metalearning can help to select the algorithm with the best accuracy.

### Using Entropy as a Measure of Acceptance for Multi-label Classification

Abstract
Multi-label classifiers allow us to predict the state of a set of responses using a single model. A multi-label model is able to make use of the correlation between the labels to potentially increase the accuracy of its prediction. Critical applications of multi-label classifiers (such as medical diagnoses) require that the system’s confidence in prediction also be provided with the multi-label prediction. The specialist then uses the measure of confidence to assess whether to accept the system’s prediction. Probabilistic multi-label classification provides a categorical distribution over the set of responses, allowing us to observe the distribution, select the most probable response, and obtain an indication of confidence by the shape of the distribution. In this article, we examine if normalised entropy, a parameter of the probabilistic multi-label response distribution, correlates with the accuracy of the prediction and therefore can be used to gauge confidence in the system’s prediction. We found that for all three methods examined on each data set, the accuracy increases for the majority of the observations where the normalised entropy threshold decreases, showing that we can use normalised entropy to gauge a systems confidence, and hence use it as a measure of acceptance.
Laurence A. F. Park, Simeon Simoff

### Investigation of Node Deletion Techniques for Clustering Applications of Growing Self Organizing Maps

Abstract
Self Organizing Maps (SOM) are widely used in data mining and high-dimensional data visualization due to its unsupervised nature and robustness. Growing Self Organizing Maps (GSOM) is a variant of SOM algorithm which allows nodes to be grown so that it can represent the input space better. Without using a fixed 2D grid like SOM, GSOM starts with four nodes and keeps track of the quantization error in each node. New nodes are grown from an existing node if its error value exceeds a pre-defined threshold. Ability of the GSOM algorithm to represent input space accurately is vital to extend its applicability to a wider spectrum of problems. This ability can be improved by identifying nodes that represent low probability regions in the input space and removing them periodically from the map. This will improve the homogeneity and completeness of the final clustering result. A new extension to GSOM algorithm based on node deletion is proposed in this paper as a solution to this problem. Furthermore, two new algorithms inspired by cache replacement policies are presented. First algorithm is based on Adaptive Replacement Cache (ARC) and maintains two separate Least Recently Used (LRU) lists of the nodes. Second algorithm is built on Frequency Based Replacement policy (FBR) and maintains a single LRU list. These algorithms consider both recent and frequent trends in the GSOM grid before deciding on the nodes to be deleted. The experiments conducted suggest that the FBR based method for node deletion outperforms the standard algorithm and other existing node deletion methods.
Thilina Rathnayake, Maheshakya Wijewardena, Thimal Kempitiya, Kevin Rathnasekara, Thushan Ganegedara, Amal S. Perera, Damminda Alahakoon

### Exploratory Topic Modeling with Distributional Semantics

Abstract
As we continue to collect and store textual data in a multitude of domains, we are regularly confronted with material whose largely unknown thematic structure we want to uncover. With unsupervised, exploratory analysis, no prior knowledge about the content is required and highly open-ended tasks can be supported. In the past few years, probabilistic topic modeling has emerged as a popular approach to this problem. Nevertheless, the representation of the latent topics as aggregations of semi-coherent terms limits their interpretability and level of detail.
This paper presents an alternative approach to topic modeling that maps topics as a network for exploration, based on distributional semantics using learned word vectors. From the granular level of terms and their semantic similarity relations global topic structures emerge as clustered regions and gradients of concepts. Moreover, the paper discusses the visual interactive representation of the topic map, which plays an important role in supporting its exploration.
(Topic mapping code and demo available at http://​samuel.​ronnqvist.​fi/​topicMap/​)
Samuel Rönnqvist

### Assigning Geo-relevance of Sentiments Mined from Location-Based Social Media Posts

Abstract
Broad adoption of smartphones has increased the number of posts generated while people are going about their daily lives. Many of these posts are related to the location where that post is generated. Being able to infer a person’s sentiment toward a given location would be a boon to market researchers. The large percentage of system-generated content in these posts posed difficulties for calculating sentiment and assigning that sentiment to the location associated with the post. Consequently our proposed system implements a sequence of text cleaning functions which was completed with a naive Bayes classifier to determine if a post was more or less likely to be associated with an individual’s present location. The system was tested on set of nearly 30,000 posts from Foursquare that had been cross-posted to Twitter which resulted in reasonable precision but with a large number of posts discarded.
Randall Sanborn, Michael Farmer, Syagnik Banerjee

### Continuous and Discrete Deep Classifiers for Data Integration

Abstract
Data representation in a lower dimension is needed in applications, where information comes from multiple high dimensional sources. A final compact model has to be interpreted by human experts, and interpretation of a classifier whose weights are discrete is much more straightforward. In this contribution, we propose a novel approach, called Deep Kernel Dimensionality Reduction which is designed for learning layers of new compact data representations simultaneously. We show by experiments on standard and on real large-scale biomedical data sets that the proposed method embeds data in a new compact meaningful representation, and leads to a lower classification error compared to the state-of-the-art methods. We also consider some state-of-the art deep learners and their corresponding discrete classifiers. We illustrate by our experiments that although purely discrete models do not always perform better than real-valued classifiers, the trade-off between the model accuracy and the interpretability is quite reasonable.
Nataliya Sokolovska, Salwa Rizkalla, Karine Clément, Jean-Daniel Zucker

### A Bayesian Approach for Identifying Multivariate Differences Between Groups

Abstract
We present a novel approach to the problem of detecting multivariate statistical differences across groups of data. The need to compare data in a multivariate manner arises naturally in observational studies, randomized trials, comparative effectiveness research, abnormality and anomaly detection scenarios, and other application areas. In such comparisons, it is of interest to identify statistical differences across the groups being compared. The approach we present in this paper addresses this issue by constructing statistical models that describe the groups being compared and using a decomposable Bayesian Dirichlet score of the models to identify variables that behave statistically differently between the groups. In our evaluation, the new method performed significantly better than logistic lasso regression in indentifying differences in a variety of datasets under a variety of conditions.
Yuriy Sverchkov, Gregory F. Cooper

### Automatically Discovering Offensive Patterns in Soccer Match Data

Abstract
In recent years, many professional sports clubs have adopted camera-based tracking technology that captures the location of both the players and the ball at a high frequency. Nevertheless, the valuable information that is hidden in these performance data is rarely used in their decision-making process. What is missing are the computational methods to analyze these data in great depth. This paper addresses the task of automatically discovering patterns in offensive strategies in professional soccer matches. To address this task, we propose an inductive logic programming approach that can easily deal with the relational structure of the data. An experimental study shows the utility of our approach.
Jan Van Haaren, Vladimir Dzyuba, Siebe Hannosset, Jesse Davis

### Fast Algorithm Selection Using Learning Curves

Abstract
One of the challenges in Machine Learning to find a classifier and parameter settings that work well on a given dataset. Evaluating all possible combinations typically takes too much time, hence many solutions have been proposed that attempt to predict which classifiers are most promising to try. As the first recommended classifier is not always the correct choice, multiple recommendations should be made, making this a ranking problem rather than a classification problem. Even though this is a well studied problem, there is currently no good way of evaluating such rankings. We advocate the use of Loss Time Curves, as used in the optimization literature. These visualize the amount of budget (time) needed to converge to a acceptable solution. We also investigate a method that utilizes the measured performances of classifiers on small samples of data to make such recommendation, and adapt it so that it works well in Loss Time space. Experimental results show that this method converges extremely fast to an acceptable solution.
Jan N. van Rijn, Salisu Mamman Abdulrahman, Pavel Brazdil, Joaquin Vanschoren

### Optimally Weighted Cluster Kriging for Big Data Regression

Abstract
In business and academia we are continuously trying to model and analyze complex processes in order to gain insight and optimize. One of the most popular modeling algorithms is Kriging, or Gaussian Processes. A major bottleneck with Kriging is the amount of processing time of at least $$O(n^3)$$ and memory required $$O(n^2)$$ when applying this algorithm on medium to big data sets. With big data sets, that are more and more available these days, Kriging is not computationally feasible. As a solution to this problem we introduce a hybrid approach in which a number of Kriging models built on disjoint subsets of the data are properly weighted for the predictions. The proposed model is both in processing time and memory much more efficient than standard Global Kriging and performs equally well in terms of accuracy. The proposed algorithm is better scalable, and well suited for parallelization.
Bas van Stein, Hao Wang, Wojtek Kowalczyk, Thomas Bäck, Michael Emmerich

### Slower Can Be Faster: The iRetis Incremental Model Tree Learner

Abstract
Incremental learning is useful for processing streaming data, where data elements are produced at a high rate and cannot be stored. An incremental learner typically updates its model with each new instance that arrives. To avoid skipped instances, the model update must finish before the next element arrives, so it should be fast. However, there can be a trade-off between the efficiency of the update and how many updates are needed to get a good model. We investigate this trade-off in the context of model trees. We compare FIMT, a state-of-the-art incremental model tree learner developed for streaming data, with two alternative methods that use a more expensive update method. We find that for data with relatively low (but still realistic) dimensionality, the most expensive method often yields the best learning curve: the system converges faster to a smaller and more accurate model tree.
Denny Verbeeck, Hendrik Blockeel

### VoQs: A Web Application for Visualization of Questionnaire Surveys

Abstract
This paper is motivated by analyzing questionnaire data that collected from patients who suffer from an orphan disease. In order to decrease misdiagnoses and shorten the diagnosis time for people who have not been diagnosed yet but already have a long history of health problems, a research project about using questionnaire mode and data analysis methods to predetermine orphan disease has been set up and questionnaires were designed based on experiences from patients who already have a diagnosis.
The main focus of this work is to visualize answering patterns that characterize patients with a specific orphan disease, which questions are most useful to distinguish between certain orphan diseases and how well an answering pattern of a specific patient fits to the general pattern of those patients who share the same disease.
We borrow from the concept of sequence logos, commonly used in genetics to visualize the conservation of nucleotides in a strand of DNA or RNA. Instead of nucleotides, we have possible answers from a question.
Our proposed visualization techniques are not limited to questionnaires on orphan diseases but also can be applied to any questionnaire survey with closed-ended questions for which we are interested in answering characteristics of different groups.
Xiaowei Zhang, Frank Klawonn, Lorenz Grigull, Werner Lechner

### Backmatter

Weitere Informationen