Elsevier

Information Sciences

Volume 560, June 2021, Pages 504-527
Information Sciences

Methodically unified procedures for a conditional approach to outlier detection, clustering, and classification

https://doi.org/10.1016/j.ins.2020.08.122Get rights and content

Abstract

The subject of this study is three fundamental procedures of contemporary data analysis: outlier detection, clustering and classification. These issues are considered in a conditional approach – the introduction of specific (e.g., current) values to the model allows, in practice, a significantly precise description of the reality under research. The same methodology has been used for all three of the above tasks, and it considerably facilitates the interpretations, potential modifications and practical applications of the material investigated. Using non-parametric methods frees the procedures under investigation from a distribution in the considered dataset. This paper contains a complete set of formulas that allow easy implementation of the presented material in real-world problems.

Introduction

Consider the basic tasks:

  • (a)

    outlier detection – indication of elements that are considerably different from the others [1],

  • (b)

    clustering – grouping of elements in relatively homogeneous subsets [49],

  • (c)

    classification – assignment of the tested element to one of the distinguished classes [8].

They are fundamental problems in the majority of practical issues of data analysis and data mining [2], [14]. Moreover, in many applications, the above procedures (a)-(c) constitute a skeleton for the whole concept of the solution being designed. Thus, outlier detection allows for the data cleaning of atypical elements, which are frequently laden with gross errors and, in consequence, introduces false information or elements with slight significance or without useful meaning. Clustering enables the division of data into subsets of similar elements, owing to which further analysis, conducted separately on these relatively homogenous subsets, becomes more efficient or even possible at all. In turn, classification makes possible the assignment of an element, which is the subject of interest, to one of the homogeneous classes, defined earlier (sometimes by clustering) and, e.g., the use with respect to this element a previously prepared action that is suitable for this class. Of course, each practical issue requires the application of additional activities and an individualized approach [24], [25], but the procedures outlined in points (a)-(c) often constitute an essential framework of the solution.

In many practical tasks, the information possessed in the form of data can be made considerably more precise by measuring and introducing the appropriate values of the factors which have significant impact on the phenomenon under investigation. We let an illustrative example be the current outdoor temperature T in the problem of consumer demand for electrical energy. Thus, the model of such a demand should be dependent, among other things, on the temperature T, which is of primary importance in such a case. In a given situation, after measuring its current value and entering it into the model, it can become significantly more precise. The particular elements of the dataset, on which the model is built, should have, in this situation, more significance the closer the value T is to the temperature at which it was measured. A similar importance in medical matters has been attached to the patient’s age – the closer age at which the data were obtained from individuals, the greater the importance that they should have to one another. This situation satisfies a probabilistic conditional approach [5], [46]. Then, the basic attributes, called describing, are dependent on the conditioning attributes, of which the measured and introduced concrete values could give meaningfully precise information concerning the considered object. Such an approach, applied to outlier detection, clustering, and classification problems, constitutes the subject of research presented in this paper.

For the defining characteristics of data, the nonparametric methodology of kernel estimators [16], [48] is used, and it frees the investigated procedures from distributions that characterize both the describing and conditioning attributes. This concept will be used for all three procedures (a)-(c). The homogeneity of the methodology obtained in this way may turn out to be very valuable in practical applications – it simplifies the understanding, interpretation, implementation and conformity in individual research circumstances, which is worthy of particular emphasis, especially in view of complex contemporary applications [36].

Thus, Section 2 outlines the mathematical preliminaries: kernel estimators. The procedures for outlier detection, clustering, and classification in the conditional approach investigated here are described in 3 Detection of outliers (atypical elements), 4 Clustering, 4.2 Influence of the parameters’ values on the obtained results, 5 Classification, respectively. Section 6 presents the results of empirical tests first for illustrative, artificially generated data (Section 6.1), and for sociological and environmental problems that concern generosity (Section 6.2), fertility (Section 6.3), and pollution in Kraków (Section 6.4). Additional comments are given in the final Section 7.

The research that concerns the basic unconditional case is summarized in the text [17] and in detail in the following papers: outlier detection [28], [29], clustering [19], [22], and classification [26], [27]. These concepts have been extended in the presented paper to a conditional approach, which, as illustrated in Section 6, can be very valuable in many practical tasks of engineering, economy and marketing, natural sciences, and others. The preliminary version of this paper was presented as [23].

Section snippets

Mathematical preliminaries: kernel estimators

Consider the m-elements set of n-dimensional vectors with continuous attributes:x1,x2,,xmRn.The kernel estimator f^:Rn0, of the density of a dataset (1) distribution is defined asf^x=1mi=1mKx,xi,h,where after separation into coordinates, we havex=x1x2xn,xi=xi,1xi,2xi,nfori=1,2,,m,h=h1h2hn,while the positive constants hj are the so-called smoothing parameters. The kernel K will be used hereinafter in the product form:Kx,xi,h=j=1n1hjKjxj-xi,jhj,whereas the one-dimensional kernels Kj:R[0

Detection of outliers (atypical elements)

The task of detecting atypical elements is one of the fundamental problems of contemporary data analysis [1], especially in the preliminary phase of processing. The occurrence of such elements is mostly associated with gross errors that handicap some of the elements in the set being considered. They are then eliminated or corrected. In marketing issues, they can characterize existing items, however, so rare that their service becomes uneconomical. In the framework of another point of view, less

Clustering

Clustering [49] is the second basic task considered in this paper. It can be found between classical data analysis, where the research goal has already been specified, and exploration data analysis, in which the aim of future investigations is unknown a priori. In the first case, clustering can be applied for classification, however, without fixed patterns. The second option treats it as a division of the explored data into a few groups, each comprising elements that are similar to each other,

Classification

Consider conditional patterns, analogous to the set (13):

which characterize J assumed classes. The sizes m1,m2,,mJ should be proportional to the share of specific classes within the investigated datasets. The aim of classification [8] is to assign the tested elementywRnY+nWto one of the classes. The issue of classification is considerably more advantageously conditioned than the tasks of outlier detection (Section 3) and clustering (Section 4). Indicating the patterns (46)–(48) here

Verification of the correctness and characteristics of the results

The investigated procedures have undergone detailed empirical study, first using artificially generated data, which allows for illustrative characterization and comparison of the specific features of the results obtained (Section 6.1), and second by means of benchmarks from sociology (6.2 Investigations based on a benchmark – generosity, 6.3 Investigations based on a benchmark – fertility) and measurements that concern pollution in Kraków (Section 6.4).

Additional comments

The first issue that requires commentary is assuring the desired dataset size. With the current automation of metrological systems, this aspect is often not a problem; however, in many tasks, the number of records is objectively limited (e.g., the number of countries that are concerned in the generosity and fertility research in 6.2 Investigations based on a benchmark – generosity, 6.3 Investigations based on a benchmark – fertility). Too small a dataset could result in accidental aspects of

Summary

This paper presents methodologically uniform material for three fundamental procedures of data analysis: outlier (atypical element) detection, clustering, and classification in the conditional approach. A correct definition of the factors and conditioning values could considerably increase the possibility of data analysis, in particular the interpretation of the results. In turn, owing to this uniformity and ease of interpretation of universal mathematical apparatus applied here, the presented

CRediT authorship contribution statement

Piotr Kulczycki: Conceptualization, Methodology, Validation, Formal analysis, Writing, Supervision. Krystian Franus: Software, Resources, Investigation, Visualization.

Acknowledgments

This work was partially supported by the Systems Research Institute of the Polish Academy of Sciences, as well as the Faculty of Physics and Applied Computer Science of the AGH University of Science and Technology.

References (51)

  • G. Casella et al.

    Statistical Inference

    (2002)
  • M. Charytanowicz et al.

    Application of Complete Gradient Clustering Algorithm for analysis of wildlife spatial distribution

    Ecol. Indicators

    (2018)
  • R.O. Duda et al.

    Pattern Classification

    (2001)
  • K. Fukunaga et al.

    The estimation of the gradient of a density function, with applications in pattern recognition

    IEEE Trans. Inform. Theory

    (1975)
  • Gapminder (2020) Gapminder Foundation, https://www.gapminder.org/data/, access 22 June...
  • A. Hinneburg et al.

    DENCLUE 2.0: Fast Clustering Based on Kernel Density Estimation

  • V. Hautamaki et al.

    Outlier detection using k-nearest neighbour graph

  • P. Kulczycki, Estymatory jadrowe w analizie systemowej, WNT,...
  • P. Kulczycki

    Methodically unified procedures for outlier detection, clustering and classification

  • P. Kulczycki

    Parametric identification for robust control

    Automatic Control, Robotics

    (2020)
  • P. Kulczycki et al.

    A complete gradient clustering algorithm formed with kernel estimators

    Int. J. Appl. Math. Computer Sci.

    (2010)
  • P. Kulczycki et al.

    An algorithm for conditional multidimensional parameter identification with asymmetric and correlated losses of under- and overestimations

    J. Stat. Comput. Simul.

    (2016)
  • P. Kulczycki et al.

    The complete gradient clustering algorithm: properties in practical applications

    J. Appl. Stat.

    (2012)
  • P. Kulczycki et al.

    Outlier Detection, Clustering, and Classification – Unified Algorithms for Conditional Approach

    Proceedings of the 12th International Conference on Computational Collective Intelligence, Da Nang, Vietnam, 30 November − 3 December

    (2020)
  • Cited by (5)

    • Graph autoencoder-based unsupervised outlier detection

      2022, Information Sciences
      Citation Excerpt :

      Model-based approaches: These approaches construct a model of a dataset by using optimization techniques and then identify the outliers as the objects that do not fit the model well [3,4,11]. Notable examples are the following: autoencoder-based [13], link-based [10], the one-class SVM [22], graph-based [23], and clustering-based [2,18]. In autoencoder-based methods, the outlier detection ability comes from the autoencoder ‘inability’ to reconstruct some of the objects.

    • Predicted Distribution Density Estimation for Streaming Data

      2021, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View full text