Skip to main content

2016 | Buch

Challenges in Computational Statistics and Data Mining

insite
SUCHEN

Über dieses Buch

This volume contains nineteen research papers belonging to the

areas of computational statistics, data mining, and their applications. Those papers, all written specifically for this volume, are their authors’ contributions to honour and celebrate Professor Jacek Koronacki on the occcasion of his 70th birthday. The

book’s related and often interconnected topics, represent Jacek Koronacki’s research interests and their evolution. They also clearly indicate how close the areas of computational statistics and data mining are.

Inhaltsverzeichnis

Frontmatter
Evolutionary Computation for Real-World Problems
Abstract
In this paper we discuss three topics that are present in the area of real-world optimization, but are often neglected in academic research in evolutionary computation community. First, problems that are a combination of several interacting sub-problems (so-called multi-component problems) are common in many real-world applications and they deserve better attention of research community. Second, research on optimisation algorithms that focus the search on the edges of feasible regions of the search space is important as high quality solutions usually are the boundary points between feasible and infeasible parts of the search space in many real-world problems. Third, finding bottlenecks and best possible investment in real-world processes are important topics that are also of interest in real-world optimization. In this chapter we discuss application opportunities for evolutionary computation methods in these three areas.
Mohammad Reza Bonyadi, Zbigniew Michalewicz
Selection of Significant Features Using Monte Carlo Feature Selection
Abstract
Feature selection methods identify subsets of features in large datasets. Such methods have become popular in data-intensive areas, and performing feature selection prior to model construction may reduce the computational cost and improve the model quality. Monte Carlo Feature Selection (MCFS) is a feature selection method aimed at finding features to use for classification. Here we suggest a strategy using a z-test to compute the significance of a feature using MCFS. We have used simulated data with both informative and random features, and compared the z-test with a permutation test and a test implemented into the MCFS software. The z-test had a higher agreement with the permutation test compared with the built-in test. Furthermore, it avoided a bias related to the distribution of feature values that may have affected the built-in test. In conclusion, the suggested method has the potential to improve feature selection using MCFS.
Susanne Bornelöv, Jan Komorowski
ADX Algorithm for Supervised Classification
Abstract
In this paper, a final version of the rule based classifier (ADX) is presented. ADX is an algorithm for inductive learning and for later classification of objects. As is typical for rule systems, knowledge representation is easy to understand by a human. The advantage of ADX algorithm is that rules are not too complicated and for most real datasets learning time increases linearly with the size of a dataset. The novel elements in this work are the following: a new method for selection of the final ruleset in ADX and the classification mechanism. The algorithm’s performance is illustrated by a series of experiments performed on a suitably designed set of artificial data.
Michał Dramiński
Estimation of Entropy from Subword Complexity
Abstract
Subword complexity is a function that describes how many different substrings of a given length are contained in a given string. In this paper, two estimators of block entropy are proposed, based on the profile of subword complexity. The first estimator works well only for IID processes with uniform probabilities. The second estimator provides a lower bound of block entropy for any strictly stationary process with the distributions of blocks skewed towards less probable values. Using this estimator, some estimates of block entropy for natural language are obtained, confirming earlier hypotheses.
Łukasz Dębowski
Exact Rate of Convergence of Kernel-Based Classification Rule
Abstract
A binary classification problem is considered, where the posteriori probability is estimated by the nonparametric kernel regression estimate with naive kernel. The excess error probability of the corresponding plug-in decision classification rule according to the error probability of the Bayes decision is studied such that the excess error probability is decomposed into approximation and estimation error. A general formula is derived for the approximation error. Under a weak margin condition and various smoothness conditions, tight upper bounds are presented on the approximation error. By a Berry-Esseen type central limit theorem a general expression for the estimation error is shown.
Maik Döring, László Györfi, Harro Walk
Compound Bipolar Queries: A Step Towards an Enhanced Human Consistency and Human Friendliness
Abstract
Database querying is a basic capability to make use of databases that are omnipresent and huge. A crucial problem is how to make possible for an ordinary human user to properly express his intentions and preferences as to what should be searched for. As natural language, with its inherent imprecision, is the only fully natural human means of communication and articulation, this makes difficult the use of traditional binary logic based querying tools. Fuzzy logic can come to the rescue, notably using fuzzy logic with linguistic quantifiers. Such queries, proposed by Kacprzyk and Ziółkowski [24], Kacprzyk et al. [25], have offered much in this context, and will also be used here. While looking for further solutions in this direction, the concept of a bipolar query has been proposed by Dubois and Prade [13], followed by a fuzzy bipolar query due to Zadrożny [36] (cf. Zadrożny and Kacprzyk [40]) involving negative and positive information, notably meant as required and desired conditions. A natural solution consisting in combining these two ideas was proposed conceptually by Kacprzyk and Zadrożny [22], and termed a compound bipolar query. In this paper we further extend this concept mainly by exploring some additional aggregation related aspects of bipolar queries which include fuzzy queries with linguistic quantifiers.
Janusz Kacprzyk, Sławomir Zadrożny
Process Inspection by Attributes Using Predicted Data
Abstract
SPC procedures for process inspection by attributes are usually designed under the assumption of directly observed quality data. However, in many practical cases this direct observations are very costly or even hardly possible. For example, in the production of pharmaceuticals costly and time consuming chemical analyses are required for the determination of product’s quality even in the simplest case when we have to decide whether the inspected product conforms to quality requirements. The situation is even more difficult when quality tests are destructive and long-lasting, as it is in the case of reliability testing. In such situations we try to predict the actual quality of inspected items using the values of predictors whose values are easily measured. In the paper we consider a situation when traditional prediction models based on the assumption of the multivariate normal distribution cannot be applied. Instead, for prediction purposes we propose to use some techniques known from data mining. In this study, which has an introductory character, and is mainly based on the results of computer simulations, we show how the usage of popular data mining techniques, such as binary regression, linear discrimination analysis, and decision trees may influence the results of process inspection.
Olgierd Hryniewicz
Székely Regularization for Uplift Modeling
Abstract
Uplift modeling is a subfield of machine learning concerned with predicting the causal effect of an action at the level of individuals. This is achieved by using two training sets: treatment, containing objects which have been subjected to an action and control, containing objects on which the action has not been performed. An uplift model then predicts the difference between conditional success probabilities in both groups. Uplift modeling is best applied to training sets obtained from randomized controlled trials, but such experiments are not always possible, in which case treatment assignment is often biased. In this paper we present a modification of Uplift Support Vector Machines which makes them less sensitive to such a bias. This is achieved by including in the model formulation an additional term which penalizes models which score treatment and control groups differently. We call the technique Székely regularization since it is based on the energy distance proposed by Székely and Rizzo. Optimization algorithm based on stochastic gradient descent techniques has also been developed. We demonstrate experimentally that the proposed regularization term does indeed produce uplift models which are less sensitive to biased treatment assignment.
Szymon Jaroszewicz, Łukasz Zaniewicz
Dominance-Based Rough Set Approach to Multiple Criteria Ranking with Sorting-Specific Preference Information
Abstract
A novel multiple criteria decision aiding method is proposed, that delivers a recommendation characteristic for ranking problems but employs preference information typical for sorting problems. The method belongs to the category of ordinal regression methods: it starts with preference information provided by the Decision Maker (DM) in terms of decision examples, and then builds a preference model that reproduces these exemplary decisions. The ordinal regression is analogous to inductive learning of a model that is true in the closed world of data where it comes from. The sorting examples show an assignment of some alternatives to pre-defined and ordered quality classes. Although this preference information is purely ordinal, the number of quality classes separating two assigned alternatives is meaningful for an ordinal intensity of preference. Using an adaptation of the Dominance-based Rough Set Approach (DRSA), the method builds from this information a decision rule preference model. This model is then applied on a considered set of alternatives to finally rank them from the best to the worst. The decision rule preference model resulting from DRSA is able to represent the preference information about the ordinal intensity of preference without converting this information into a cardinal scale. Moreover, the decision rules can be interpreted straightforwardly by the DM, facilitating her understanding of the feedback between the preference information and the preference model. An illustrative case study performed in this paper supports this claim.
Miłosz Kadziński, Roman Słowiński, Marcin Szeląg
On Things Not Seen
Abstract
Some statistical observations are frequently dismissed as “marginal” or even “oddities” but are far from such. On the contrary, they provide insights that lead to a better understanding of mechanisms which logically should exist but for which evidence is missing. We consider three case studies of probabilistic models in evolution, genetics and cancer. First, ascertainment bias in evolutionary genetics, arising when comparison between two or more species is based on genetic markers discovered in one of these species. Second, quasistationarity, i.e., probabilistic equilibria arising conditionally on non-absorption. Since evolution is also the history of extinctions (which are absorptions), this is a valid field of study. Third, inference concerning unobservable events in cancer, such as the appearance of the first malignant cell, or the first micrometastasis. The topic is vital for public health of aging societies. We try to adhere to mathematical rigor, but avoid professional jargon, with emphasis on the wider context.
Marek Kimmel
Network Capacity Bound for Personalized Bipartite PageRank
Abstract
In this paper a novel notion of Bipartite PageRank is introduced and limits of authority flow in bipartite graphs are investigated. As a starting point we simplify the proof of a theorem on personalized random walk in unimodal graphs that is fundamental to graph nodes clustering. As a consequence we generalize this theorem to bipartite graphs.
Mieczysław A. Kłopotek, Sławomir T. Wierzchoń, Robert A. Kłopotek, Elżbieta A. Kłopotek
Dependence Factor as a Rule Evaluation Measure
Abstract
Certainty factor and lift are known evaluation measures of association rules. Nevertheless, they do not guarantee accurate evaluation of the strength of dependence between rule’s constituents. In particular, even if there is a strongest possible positive or negative dependence between rule’s constituents X and Y, these measures may reach values quite close to the values indicating independence of X and Y. Recently, we have proposed a new measure called a dependence factor to overcome this drawback. Unlike in the case of the certainty factor, when defining the dependence factor, we took into account the fact that for a given rule \(X \rightarrow Y\), the minimal conditional probability of the occurrence of Y given X may be greater than 0, while its maximal possible value may less than 1. In this paper, we first recall definitions and properties of all the three measures. Then, we examine the dependence factor from the point of view of an interestingness measure as well as we examine the relationship among the dependence factor for X and Y with those for \(\bar{X}\) and Y, X and \(\bar{Y}\), as well as \(\bar{X}\) and \(\bar{Y}\), respectively. As a result, we obtain a number of new properties of the dependence factor.
Marzena Kryszkiewicz
Recent Results on Nonparametric Quantile Estimation in a Simulation Model
Abstract
We present recent results on nonparametric estimation of a quantile of distribution of Y given by a simulation model \(Y=m(X)\), where \(m: \mathbb {R}^d\rightarrow \mathbb {R}\) is a function which is costly to compute and X is a \(\mathbb {R}^d\)-valued random variable with given density. We argue that importance sampling quantile estimate of m(X), based on a suitable estimate \(m_n\) of m achieves better rate of convergence than the estimate based on order statistics alone. Similar results are given for Robbins-Monro type recursive importance sampling and for quantile estimation based on surrogate model.
Adam Krzyżak
Adaptive Monte Carlo Maximum Likelihood
Abstract
We consider Monte Carlo approximations to the maximum likelihood estimator in models with intractable norming constants. This paper deals with adaptive Monte Carlo algorithms, which adjust control parameters in the course of simulation. We examine asymptotics of adaptive importance sampling and a new algorithm, which uses resampling and MCMC. This algorithm is designed to reduce problems with degeneracy of importance weights. Our analysis is based on martingale limit theorems. We also describe how adaptive maximization algorithms of Newton-Raphson type can be combined with the resampling techniques. The paper includes results of a small scale simulation study in which we compare the performance of adaptive and non-adaptive Monte Carlo maximum likelihood algorithms.
Błażej Miasojedow, Wojciech Niemiro, Jan Palczewski, Wojciech Rejchel
What Do We Choose When We Err? Model Selection and Testing for Misspecified Logistic Regression Revisited
Abstract
The problem of fitting logistic regression to binary model allowing for missppecification of the response function is reconsidered. We introduce two-stage procedure which consists first in ordering predictors with respect to deviances of the models with the predictor in question omitted and then choosing the minimizer of Generalized Information Criterion in the resulting nested family of models. This allows for large number of potential predictors to be considered in contrast to an exhaustive method. We prove that the procedure consistently chooses model \(t^{*}\) which is the closest in the averaged Kullback-Leibler sense to the true binary model t. We then consider interplay between t and \(t^{*}\) and prove that for monotone response function when there is genuine dependence of response on predictors, \(t^{*}\) is necessarily nonempty. This implies consistency of a deviance test of significance under misspecification. For a class of distributions of predictors, including normal family, Rudd’s result asserts that \(t^{*}=t\). Numerical experiments reveal that for normally distributed predictors probability of correct selection and power of deviance test depend monotonically on Rudd’s proportionality constant \(\eta \).
Jan Mielniczuk, Paweł Teisseyre
Semiparametric Inference in Identification of Block-Oriented Systems
Abstract
In this paper, we give the semiparametric statistics perspective on the problem of identification of a class of nonlinear dynamic systems. We present a framework for identification of the so-called block-oriented systems that can be represented by finite-dimensional parameters and an infinite-dimensional set of nonlinear characteristics that run typically through a nonparametric class of univariate functions. We consider systems which are expressible exactly in this form and the case when they are approximative models. In the latter case, we derive projections, that is, solutions which minimize the mean \(L_2\) error. The chief benefit of such an approach is to make classical nonparametric estimates amenable to the incorporation of constraints and able to overcome the celebrated curse of dimensionality and system complexity. The developed methodology is explained by examining semiparametric versions of popular block-oriented structures, i.e., Hammerstein, Wiener, and parallel systems.
Mirosław Pawlak
Dealing with Data Difficulty Factors While Learning from Imbalanced Data
Abstract
Learning from imbalanced data is still one of challenging tasks in machine learning and data mining. We discuss the following data difficulty factors which deteriorate classification performance: decomposition of the minority class into rare sub-concepts, overlapping of classes and distinguishing different types of examples. New experimental studies showing the influence of these factors on classifiers are presented. The paper also includes critical discussions of methods for their identification in real world data. Finally, open research issues are stated.
Jerzy Stefanowski
Personal Privacy Protection in Time of Big Data
Abstract
Personal privacy protection increasingly becomes a story of privacy protection in electronic data format. Personal privacy protection also becomes a showcase of advantages and challenges of Big Data phenomenon. Accumulation of massive data volumes combined with development of intelligent Data Mining algorithms allows more data being analysed and linked. Unintended consequences of Big Data analytics include increasing risks of discovery new information about individuals. There are several approaches to protect privacy of individuals in the large data sets, privacy-preserving Data Mining being an example. In this paper, we discuss content-aware prevention of data leaks. We concentrate on protection of personal health information (PHI), arguably the most vulnerable type of personal information. This paper discusses the applied methods and challenges which arise when we want to hold health information private. PHI leak prevention on the Web and on online social networks is our case study.
Marina Sokolova, Stan Matwin
Data Based Modeling
Abstract
Statisticians spend a great deal of time coming up with tests that are frequently useless in practice and then proving the asymptotic optimality of the tests under the assumption of conditions that do not actually exist. There is another approach: we can use the data to build models. This is the goal of Tukey’s “Exploratory Data Analysis.” In this paper I will be examining alternatives to the Neoclassical Analysis of the stock market, the dominant view still in schools of business, data notwithstanding. Then I will give a brief analysis of the breakdown of America’s public health service in stopping the progression of the AIDS epidemic and demonstrate that simply closing down the gay bathhouses would have prevented AIDS from progressing from an endemic to a full blown epidemic which has already killed more Americans than died in the two world wars.
James R. Thompson
Metadaten
Titel
Challenges in Computational Statistics and Data Mining
herausgegeben von
Stan Matwin
Jan Mielniczuk
Copyright-Jahr
2016
Electronic ISBN
978-3-319-18781-5
Print ISBN
978-3-319-18780-8
DOI
https://doi.org/10.1007/978-3-319-18781-5