Skip to main content
Top

2015 | Book

Statistical Learning and Data Sciences

Third International Symposium, SLDS 2015, Egham, UK, April 20-23, 2015, Proceedings

insite
SEARCH

About this book

This book constitutes the refereed proceedings of the Third International Symposium on Statistical Learning and Data Sciences, SLDS 2015, held in Egham, Surrey, UK, April 2015.

The 36 revised full papers presented together with 2 invited papers were carefully reviewed and selected from 59 submissions. The papers are organized in topical sections on statistical learning and its applications, conformal prediction and its applications, new frontiers in data analysis for nuclear fusion, and geometric data analysis.

Table of Contents

Frontmatter

Invited Papers

Frontmatter
Learning with Intelligent Teacher: Similarity Control and Knowledge Transfer
In memory of Alexey Chervonenkis

This paper introduces an advanced setting of machine learning problem in which an Intelligent Teacher is involved. During training stage, Intelligent Teacher provides Student with information that contains, along with classification of each example, additional privileged information (explanation) of this example. The paper describes two mechanisms that can be used for significantly accelerating the speed of Student’s training: (1) correction of Student’s concepts of similarity between examples, and (2) direct Teacher-Student knowledge transfer.

Vladimir Vapnik, Rauf Izmailov
Statistical Inference Problems and Their Rigorous Solutions
In memory of Alexey Chervonenkis

This paper presents direct settings and rigorous solutions of Statistical Inference problems. It shows that rigorous solutions require solving ill-posed Fredholm integral equations of the first kind in the situation where not only the right-hand side of the equation is an approximation, but the operator in the equation is also defined approximately. Using Stefanuyk-Vapnik theory for solving such operator equations, constructive methods of empirical inference are introduced. These methods are based on a new concept called

$$V$$

-matrix. This matrix captures geometric properties of the observation data that are ignored by classical statistical methods.

Vladimir Vapnik, Rauf Izmailov

Statistical Learning and Its Applications

Frontmatter
Feature Mapping Through Maximization of the Atomic Interclass Distances

We discuss a way of implementing feature mapping for classification problems by expressing the given data through a set of functions comprising of a mixture of convex functions. In this way, a certain pattern’s potential of belonging to a certain class is mapped in a way that promotes interclass separation, data visualization and understanding of the problem’s mechanics. In terms of enhancing separation, the algorithm can be used in two ways: to construct problem features to feed a classification algorithm or to detect a subset of problem attributes that could be safely ignored. In terms of problem understanding, the algorithm can be used for constructing a low dimensional feature mapping in order to make problem visualization possible. The whole approach is based on the derivation of an optimization objective which is solved with a genetic algorithm. The algorithm was tested under various datasets and it is successful in providing improved evaluation results. Specifically for Wisconsin breast cancer problem, the algorithm has a generalization success rate of 98% while for Pima Indian diabetes it provides a generalization success rate of 82%.

Savvas Karatsiolis, Christos N. Schizas
Adaptive Design of Experiments for Sobol Indices Estimation Based on Quadratic Metamodel

Sensitivity analysis aims to identify which input parameters of a given mathematical model are the most important. One of the well-known sensitivity metrics is the Sobol sensitivity index. There is a number of approaches to Sobol indices estimation. In general, these approaches can be divided into two groups: Monte Carlo methods and methods based on metamodeling. Monte Carlo methods have well-established mathematical apparatus and statistical properties. However, they require a lot of model runs. Methods based on metamodeling allow to reduce a required number of model runs, but may be difficult for analysis. In this work, we focus on metamodeling approach for Sobol indices estimation, and particularly, on the initial step of this approach — design of experiments. Based on the concept of D-optimality, we propose a method for construction of an adaptive experimental design, effective for calculation of Sobol indices from a quadratic metamodel. Comparison of the proposed design of experiments with other methods is performed.

Evgeny Burnaev, Ivan Panin
GoldenEye++: A Closer Look into the Black Box

Models with high predictive performance are often opaque, i.e., they do not allow for direct interpretation, and are hence of limited value when the goal is to understand the reasoning behind predictions. A recently proposed algorithm, GoldenEye, allows detection of groups of interacting variables exploited by a model. We employed this technique in conjunction with random forests generated from data obtained from electronic patient records for the task of detecting adverse drug events (ADEs). We propose a refined version of the GoldenEye algorithm, called GoldenEye++, utilizing a more sensitive grouping metric. An empirical investigation comparing the two algorithms on 27 datasets related to detecting ADEs shows that the new version of the algorithm in several cases finds groups of medically relevant interacting attributes, corresponding to prescribed drugs, undetected by the previous version. This suggests that the GoldenEye++ algorithm can be a useful tool for finding novel (adverse) drug interactions.

Andreas Henelius, Kai Puolamäki, Isak Karlsson, Jing Zhao, Lars Asker, Henrik Boström, Panagiotis Papapetrou
Gaussian Process Regression for Structured Data Sets

Approximation algorithms are widely used in many engineering problems. To obtain a data set for approximation a factorial design of experiments is often used. In such case the size of the data set can be very large. Therefore, one of the most popular algorithms for approximation — Gaussian Process regression — can hardly be applied due to its computational complexity. In this paper a new approach for a Gaussian Process regression in case of a factorial design of experiments is proposed. It allows to efficiently compute exact inference and handle large multidimensional and anisotropic data sets.

Mikhail Belyaev, Evgeny Burnaev, Yermek Kapushev
Adaptive Design of Experiments Based on Gaussian Processes

We consider a problem of adaptive design of experiments for Gaussian process regression. We introduce a Bayesian framework, which provides theoretical justification for some well-know heuristic criteria from the literature and also gives an opportunity to derive some new criteria. We also perform testing of methods in question on a big set of multidimensional functions.

Evgeny Burnaev, Maxim Panov
Forests of Randomized Shapelet Trees

Shapelets have recently been proposed for data series classification, due to their ability to capture phase independent and local information. Decision trees based on shapelets have been shown to provide not only interpretable models, but also, in many cases, state-of-the-art predictive performance. Shapelet discovery is, however, computationally costly, and although several techniques for speeding up this task have been proposed, the computational cost is still in many cases prohibitive. In this work, an ensemble-based method, referred to as Random Shapelet Forest (RSF), is proposed, which builds on the success of the random forest algorithm, and which is shown to have a lower computational complexity than the original shapelet tree learning algorithm. An extensive empirical investigation shows that the algorithm provides competitive predictive performance and that a proposed way of calculating importance scores can be used to successfully identify influential regions.

Isak Karlsson, Panagotis Papapetrou, Henrik Boström
Aggregation of Adaptive Forecasting Algorithms Under Asymmetric Loss Function

The paper deals with applying the strong aggregating algorithm to games with asymmetric loss function. A particular example of such games is the problem of time series forecasting where specific losses from under-forecasting and over-forecasting may vary considerably. We use the aggregating algorithm for building compositions of adaptive forecasting algorithms. The paper specifies sufficient conditions under which a composition based on the aggregating algorithm performs as well as the best of experts. As a result, we find a theoretical bound for the loss process of a given composition under asymmetric loss function. Finally we compare the composition based on the aggregating algorithm to other well-known compositions in experiments with real data.

Alexey Romanenko
Visualization and Analysis of Multiple Time Series by Beanplot PCA

Beanplot time series have been introduced by the authors as an aggregated data representation, in terms of peculiar symbolic data, for dealing with large temporal datasets. In the presence of multiple beanplot time series it can be very interesting for interpretative aims to find useful syntheses. Here we propose an extension, based on PCA, of the previous approach to multiple beanplot time series. We show the usefulness of our proposal in the context of the analysis of different financial markets.

Carlo Drago, Carlo Natale Lauro, Germana Scepi
Recursive SVM Based on TEDA

The new method for incremental learning of SVM model incorporating recently proposed TEDA approach is proposed. The method updates the widely renowned incremental SVM approach, as well as introduces new TEDA and RDE kernels which are learnable and capable of adaptation to data. The slack variables are also adaptive and depend on each point’s ‘importance’ combining the outliers detection with SVM slack variables to deal with misclassifications. Some suggestions on the evolving systems based on SVM are also provided. The examples of image recognition are provided to give a ‘proof of concept’ for the method.

Dmitry Kangin, Plamen Angelov
RDE with Forgetting: An Approximate Solution for Large Values of $$k$$ with an Application to Fault Detection Problems

Recursive density estimation is a very powerful metric, based on a kernel function, used to detect outliers in a n-dimensional data set. Since it is calculated in a recursive manner, it becomes a very interesting solution for on-line and real-time applications. However, in its original formulation, the equation defined for density calculation is considerably conservative, which may not be suitable for applications that require fast response to dynamic changes in the process. For on-line applications, the value of k, which represents the index of the data sample, may increase indefinitely and, once that the mean update equation directly depends on the number of samples read so far, the influence of a new data sample may be nearly insignificant if the value of k is high. This characteristic creates, in practice, a stationary scenario that may not be adequate for fault detect applications, for example. In order to overcome this problem, we propose in this paper a new approach to RDE, holding its recursive characteristics. This new approach, called RDE with forgetting, introduces the concept of moving mean and forgetting factor, detailed in the next sections. The proposal is tested and validated on a very well known real data fault detection benchmark, however can be generalized to other problems.

Clauber Gomes Bezerra, Bruno Sielly Jales Costa, Luiz Affonso Guedes, Plamen Parvanov Angelov
Sit-to-Stand Movement Recognition Using Kinect

This paper examines the application of machine-learning techniques to human movement data in order to recognise and compare movements made by different people. Data from an experimental set-up using a sit-to-stand movement are first collected using the Microsoft Kinect input sensor, then normalized and subsequently compared using the assigned labels for correct and incorrect movements. We show that attributes can be extracted from the time series produced by the Kinect sensor using a dynamic time-warping technique. The extracted attributes are then fed to a random forest algorithm, to recognise anomalous behaviour in time series of joint measurements over the whole movement. For comparison, the k-Nearest Neighbours algorithm is also used on the same attributes with good results. Both methods’ results are compared using Multi-Dimensional Scaling for clustering visualisation.

Erik Acorn, Nikos Dipsis, Tamar Pincus, Kostas Stathis
Additive Regularization of Topic Models for Topic Selection and Sparse Factorization

Probabilistic topic modeling of text collections is a powerful tool for statistical text analysis. Determining the optimal number of topics remains a challenging problem in topic modeling. We propose a simple entropy regularization for topic selection in terms of

Additive Regularization of Topic Models

(ARTM), a multicriteria approach for combining regularizers. The entropy regularization gradually eliminates insignificant and linearly dependent topics. This process converges to the correct value on semi-real data. On real text collections it can be combined with sparsing, smoothing and decorrelation regularizers to produce a sequence of models with different numbers of well interpretable topics.

Konstantin Vorontsov, Anna Potapenko, Alexander Plavin
Social Web-Based Anxiety Index’s Predictive Information on S&P 500 Revisited

There has been an increasing interest recently in examining the possible relationships between emotions expressed online and stock markets. Most of the previous studies claiming that emotions have predictive influence on the stock market do so by developing various machine learning predictive models, but do not validate their claims rigorously by analysing the statistical significance of their findings. In turn, the few works that attempt to statistically validate such claims suffer from important limitations of their statistical approaches. In particular, stock market data exhibit erratic volatility, and this time-varying volatility makes any possible relationship between these variables non-linear, which tends to statistically invalidate linear based approaches. Our work tackles this kind of limitations, and extends linear frameworks by proposing a new, non-linear statistical approach that accounts for non-linearity and heteroscedasticity.

Rapheal Olaniyan, Daniel Stamate, Doina Logofatu
Exploring the Link Between Gene Expression and Protein Binding by Integrating mRNA Microarray and ChIP-Seq Data

ChIP-sequencing experiments are routinely used to study genome-wide chromatin marks. Due to the high-cost and complexity associated with this technology, it is of great interest to investigate whether the low-cost option of microarray experiments can be used in combination with ChIP-seq experiments. Most integrative analyses do not consider important features of ChIP-seq data, such as spatial dependencies and ChIP-efficiencies. In this paper, we address these issues by applying a Markov random field model to ChIP-seq data on the protein Brd4, for which both ChIP-seq and microarray data are available on the same biological conditions. We investigate the correlation between the enrichment probabilities around transcription start sites, estimated by the Markov model, and microarray gene expression values. Our preliminary results suggest that binding of the protein is associated with lower gene expression, but differential binding across different conditions does not show an association with differential expression of the associated genes.

Mohsina Mahmuda Ferdous, Veronica Vinciotti, Xiaohui Liu, Paul Wilson
Evolving Smart URL Filter in a Zone-Based Policy Firewall for Detecting Algorithmically Generated Malicious Domains

Domain Generation Algorithm (DGA) has evolved as one of the most dangerous and “undetectable” digital security deception methods. The complexity of this approach (combined with the intricate function of the fast-flux “botnet” networks) is the cause of an extremely risky threat which is hard to trace. In most of the cases it should be faced as zero-day vulnerability. This kind of combined attacks is responsible for malware distribution and for the infection of Information Systems. Moreover it is related to illegal actions, like money mule recruitment sites, phishing websites, illicit online pharmacies, extreme or illegal adult content sites, malicious browser exploit sites and web traps for distributing virus. Traditional digital security mechanisms face such vulnerabilities in a conventional manner, they create often false alarms and they fail to forecast them. This paper proposes an innovative fast and accurate evolving Smart URL Filter (eSURLF) in a Zone-based Policy Firewall (ZFW) which uses evolving Spiking Neural Networks (eSNN) for detecting algorithmically generated malicious domains names.

Konstantinos Demertzis, Lazaros Iliadis
Lattice-Theoretic Approach to Version Spaces in Qualitative Decision Making

We present a lattice-theoretic approach to version spaces in multicriteria preference learning and discuss some complexity aspects. In particular, we show that the description of version spaces in the preference model based on the Sugeno integral is an NP-hard problem, even for simple instances.

Miguel Couceiro, Tamás Waldhauser

Conformal Prediction and Its Applications

Frontmatter
A Comparison of Three Implementations of Multi-Label Conformal Prediction

The property of calibration of Multi-Label Learning (MLL) has not been well studied. Because of the excellent calibration property of Conformal Predictors (CP), it is valuable to achieve calibrated MLL prediction via CP. Three practical implementations of Multi-Label Conformal Predictors (MLCP) can be established. Among them are Instance Reproduction MLCP (IR-MLCP), Binary Relevance MLCP (BR-MLCP) and Power Set MLCP (PS-MLCP). The experimental results on benchmark datasets show that all three MLCP methods possess calibration property. Comparatively speaking, BR-MLCP performs better in terms of prediction efficiency and computational cost than the other two.

Huazhen Wang, Xin Liu, Ilia Nouretdinov, Zhiyuan Luo
Modifications to p-Values of Conformal Predictors

The original definition of a p-value in a conformal predictor can sometimes lead to too conservative prediction regions when the number of training or calibration examples is small. The situation can be improved by using a modification to define an approximate p-value. Two modified p-values are presented that converges to the original p-value as the number of training or calibration examples goes to infinity.

Numerical experiments empirically support the use of a p-value we call the interpolated p-value for conformal prediction. The interpolated p-value seems to be producing prediction sets that have an error rate which corresponds well to the prescribed significance level.

Lars Carlsson, Ernst Ahlberg, Henrik Boström, Ulf Johansson, Henrik Linusson
Cross-Conformal Prediction with Ridge Regression

Cross-Conformal Prediction (CCP) is a recently proposed approach for overcoming the computational inefficiency problem of Conformal Prediction (CP) without sacrificing as much informational efficiency as Inductive Conformal Prediction (ICP). In effect CCP is a hybrid approach combining the ideas of cross-validation and ICP. In the case of classification the predictions of CCP have been shown to be empirically valid and more informationally efficient than those of the ICP. This paper introduces CCP in the regression setting and examines its empirical validity and informational efficiency compared to that of the original CP and ICP when combined with Ridge Regression.

Harris Papadopoulos
Handling Small Calibration Sets in Mondrian Inductive Conformal Regressors

In inductive conformal prediction, calibration sets must contain an adequate number of instances to support the chosen confidence level. This problem is particularly prevalent when using Mondrian inductive conformal prediction, where the input space is partitioned into independently valid prediction regions. In this study, Mondrian conformal regressors, in the form of regression trees, are used to investigate two problematic aspects of small calibration sets. If there are too few calibration instances to support the significance level, we suggest using either extrapolation or altering the model. In situations where the desired significance level is between two calibration instances, the standard procedure is to choose the more nonconforming one, thus guaranteeing validity, but producing conservative conformal predictors. The suggested solution is to use interpolation between calibration instances. All proposed techniques are empirically evaluated and compared to the standard approach on 30 benchmark data sets. The results show that while extrapolation often results in invalid models, interpolation works extremely well and provides increased efficiency with preserved empirical validity.

Ulf Johansson, Ernst Ahlberg, Henrik Boström, Lars Carlsson, Henrik Linusson, Cecilia Sönströd
Conformal Anomaly Detection of Trajectories with a Multi-class Hierarchy

The paper investigates the problem of anomaly detection in the maritime trajectory surveillance domain. Conformal predictors in this paper are used as a basis for anomaly detection. A multi-class hierarchy framework is presented for different class representations. Experiments are conducted with data taken from shipping vessel trajectories using data obtained through AIS (Automatic Identification System) broadcasts and the results are discussed.

James Smith, Ilia Nouretdinov, Rachel Craddock, Charles Offer, Alexander Gammerman
Model Selection Using Efficiency of Conformal Predictors

The Conformal Prediction framework guarantees error calibration in the online setting, but its practical usefulness in real-world problems is affected by its efficiency, i.e. the size of the prediction region. Narrow prediction regions that maintain validity would be the most useful conformal predictors. In this work, we use the efficiency of conformal predictors as a measure to perform model selection in classifiers. We pose this objective as an optimization problem on the model parameters, and test this approach with the k-Nearest Neighbour classifier. Our results on the USPS and other standard datasets show promise in this approach.

Ritvik Jaiswal, Vineeth N. Balasubramanian
Confidence Sets for Classification

Conformal predictors, introduced by [

13

], serve to build prediction intervals by exploiting a notion of conformity of the new data point with previously observed data. In the classification problem, conformal predictor may respond to the problem of classification with reject option. In the present paper, we propose a novel method of construction of confidence sets, inspired both by conformal prediction and by classification with reject option. An important aspect of these confidence sets is that, when there are several observations to label, they control the proportion of the data we want to label. Moreover, we introduce a notion of risk adapted to classification with reject option. We show that for this risk, the confidence set risk converges to the risk of the confidence set based on the Bayes classifier.

Christophe Denis, Mohamed Hebiri
Conformal Clustering and Its Application to Botnet Traffic

The paper describes an application of a novel clustering technique based on Conformal Predictors. Unlike traditional clustering methods, this technique allows to control the number of objects that are left outside of any cluster by setting up a required confidence level. This paper considers a multi-class unsupervised learning problem, and the developed technique is applied to bot-generated network traffic. An extended set of features describing the bot traffic is presented and the results are discussed.

Giovanni Cherubin, Ilia Nouretdinov, Alexander Gammerman, Roberto Jordaney, Zhi Wang, Davide Papini, Lorenzo Cavallaro
Interpretation of Conformal Prediction Classification Models

We present a method for interpretation of conformal prediction models. The discrete gradient of the largest p-value is calculated with respect to object space. A criterion is applied to identify the most important component of the gradient and the corresponding part of the object is visualized.

The method is exemplified with data from drug discovery relating chemical compounds to mutagenicity. Furthermore, a comparison is made to already established important subgraphs with respect to mutagenicity and this initial assessment shows very useful results with respect to interpretation of a conformal predictor.

Ernst Ahlberg, Ola Spjuth, Catrin Hasselgren, Lars Carlsson

New Frontiers in Data Analysis for Nuclear Fusion

Frontmatter
Confinement Regime Identification Using Artificial Intelligence Methods

The L-H transition is a remarkable self-organization phenomenon that occurs in Magnetically Confined Nuclear Fusion (MCNF) devices. For research reasons, it is relevant to create models able to determine the confinement regime the plasma is in by using, from the wide number of measured signals in each discharge, just a reduce number of them. Also desirable is that a general model, applicable not only to one device but to all of them, is reached. From a data-driven modelling point of view it implies the careful —and hopefully, automatic— selection of the phenomenon’s related signals to input them into an equation able to determine the confinement mode. Using a supervised machine learning method, it would also require the tuning of some internal parameters. This is an optimization problem, tackled in this study with Genetic Algorithms (GAs). The results prove that reliable and universal laws that describe the L-H transition with more than a ~98,60% classification accuracy can be attained using only 3 input signals.

G. A. Rattá, J. Vega
How to Handle Error Bars in Symbolic Regression for Data Mining in Scientific Applications

Symbolic regression via genetic programming has become a very useful tool for the exploration of large databases for scientific purposes. The technique allows testing hundreds of thousands of mathematical models to find the most adequate to describe the phenomenon under study, given the data available. In this paper, a major refinement is described, which allows handling the problem of the error bars. In particular, it is shown how the use of the geodesic distance on Gaussian manifolds as fitness function allows taking into account the uncertainties in the data, from the beginning of the data analysis process. To exemplify the importance of this development, the proposed methodological improvement has been applied to a set of synthetic data and the results have been compared with more traditional solutions.

A. Murari, E. Peluso, M. Gelfusa, M. Lungaroni, P. Gaudio
Applying Forecasting to Fusion Databases

This manuscript describes the application of four forecasting methods to predict future magnitudes of plasma signals during the discharge. One application of the forecasting could be to provide in advance signal magnitudes in order to detect in real-time previously known patterns such as plasma instabilities. The forecasting was implemented for four different prediction techniques from classical and machine learning approaches. The results show that the performance of predictions can get a high level of accuracy and precision. In fact, over 95 % of predictions match the real magnitudes in most signals.

Gonzalo Farias, Sebastián Dormido-Canto, Jesús Vega, Norman Díaz
Computationally Efficient Five-Class Image Classifier Based on Venn Predictors

This article shows the computational efficiency of an image classifier based on a Venn predictor with the nearest centroid taxonomy. It has been applied to the automatic classification of the images acquired by the Thomson Scattering diagnostic of the TJ-II stellarator. The Haar wavelet transform is used to reduce the image dimensionality. The average time per image to classify 1144 examples (in an on-line learning setting) is 0.166 ms. The classification of the last image takes 187 ms.

J. Vega, S. Dormido-Canto, F. Martínez, I. Pastor, M. C. Rodríguez
SOM and Feature Weights Based Method for Dimensionality Reduction in Large Gauss Linear Models

Discovering the most important variables is a crucial step for accelerating model building without losing potential predictive power of the data. In many practical problems is necessary to discover the dependant variables and the ones that are redundant. In this paper an automatic method for discovering the most important signals or characteristics to build data-driven models is presented. This method was developed thinking in a very high dimensionality inputs spaces, where many variables are independent, but existing many others which are combinations of the independent ones. The base of the method are the SOM neural network and a method for feature weighting very similar to Linear Discriminant Analysis (LDA) with some modifications.

Fernando Pavón, Jesús Vega, Sebastián Dormido Canto

Geometric Data Analysis

Frontmatter
Assigning Objects to Classes of a Euclidean Ascending Hierarchical Clustering

In a Euclidean ascending hierarchical clustering (

ahc

, Ward’s method), the usual method for allocating a supplementary object to a cluster is based on the geometric distance from the object-point to the barycenter of the cluster. The main drawback of this method is that it does not take into consideration that clusters differ as regards weights, shapes and dispersions. Neither does it take into account successive dichotomies of the hierarchy of classes. This is why we propose a new ranking rule adapted to geometric data analysis that takes the shape of clusters into account. From a set of supplementary objects, we propose a strategy for assigning these objects to clusters stemming from an

ahc

. The idea is to assign supplementary objects at the local level of a node to one of its two successors until a cluster of the partition under study is reached. We define a criterion based on the ratio of Mahalanobis distances from the object–point to barycenters of the two clusters that make up the node.

We first introduce the principle of the method, and we apply it to a barometric survey carried out by the

cevipof

on various components of trust among French citizens. We compare the evolution of clusters of individuals between 2009 and 2012 then 2013.

Brigitte Le Roux, Frédérik Cassor
The Structure of Argument: Semantic Mapping of US Supreme Court Cases

We semantically map out the flow of the narrative involved in a United States Supreme Court case. Our objective is both the static analysis of semantics but, more so, the trajectory of argument. This includes consideration of those who are involved, the Justices and the Attorneys. We study therefore the flow of argument. Geometrical (metric, latent semantic) and topological (ultrametric, hierarchical) analyses are used in our analytics.

Fionn Murtagh, Mohsen Farid
Supporting Data Analytics for Smart Cities: An Overview of Data Models and Topology

An overview of data models suitable for smart cities is given. CityGML and

$$G$$

-maps implicitly model the underlying combinatorial structure, whereas topological databases make this structure explicit. This combinatorial structure is the basis for topological queries, and topological consistency of such data models allows for correct answers to topological queries. A precise definition of topological consistency in the two-dimensional case is given and an application to data models is discussed.

Patrick E. Bradley
Manifold Learning in Regression Tasks

The paper presents a new geometrically motivated method for non-linear regression based on Manifold learning technique. The regression problem is to construct a predictive function which estimates an unknown smooth mapping f from q-dimensional inputs to m-dimensional outputs based on a training data set consisting of given ‘input-output’ pairs. The unknown mapping f determines q-dimensional manifold M(f) consisting of all the ‘input-output’ vectors which is embedded in (q+m)-dimensional space and covered by a single chart; the training data set determines a sample from this manifold. Modern Manifold Learning methods allow constructing the certain estimator M* from the manifold-valued sample which accurately approximates the manifold. The proposed method called Manifold Learning Regression (MLR) finds the predictive function f

MLR

to ensure an equality M(f

MLR

) = M*. The MLR simultaneously estimates the m×q Jacobian matrix of the mapping f.

Alexander Bernstein, Alexander Kuleshov, Yury Yanovich
Random Projection Towards the Baire Metric for High Dimensional Clustering

For high dimensional clustering and proximity finding, also referred to as high dimension and low sample size data, we use random projection with the following principle. With the greater probability of close-to-orthogonal projections, compared to orthogonal projections, we can use rank order sensitivity of projected values. Our Baire metric, divisive hierarchical clustering, is of linear computation time.

Fionn Murtagh, Pedro Contreras
Optimal Coding for Discrete Random Vector

Based on the notion of mutual information between the components of a discrete random vector, we construct, for data reduction reasons, an optimal quantization of the support of its probability measure. More precisely, we propose a simultaneous discretization of the whole set of the components of the discrete random vector which takes into account, as much as possible, the stochastic dependence between them. Computationals aspects and example are presented.

Bernard Colin, Jules de Tibeiro, François Dubeau
Backmatter
Metadata
Title
Statistical Learning and Data Sciences
Editors
Alexander Gammerman
Vladimir Vovk
Harris Papadopoulos
Copyright Year
2015
Electronic ISBN
978-3-319-17091-6
Print ISBN
978-3-319-17090-9
DOI
https://doi.org/10.1007/978-3-319-17091-6

Premium Partner