Methodically unified procedures for a conditional approach to outlier detection, clustering, and classification
Introduction
Consider the basic tasks:
- (a)
outlier detection – indication of elements that are considerably different from the others [1],
- (b)
clustering – grouping of elements in relatively homogeneous subsets [49],
- (c)
classification – assignment of the tested element to one of the distinguished classes [8].
In many practical tasks, the information possessed in the form of data can be made considerably more precise by measuring and introducing the appropriate values of the factors which have significant impact on the phenomenon under investigation. We let an illustrative example be the current outdoor temperature in the problem of consumer demand for electrical energy. Thus, the model of such a demand should be dependent, among other things, on the temperature , which is of primary importance in such a case. In a given situation, after measuring its current value and entering it into the model, it can become significantly more precise. The particular elements of the dataset, on which the model is built, should have, in this situation, more significance the closer the value is to the temperature at which it was measured. A similar importance in medical matters has been attached to the patient’s age – the closer age at which the data were obtained from individuals, the greater the importance that they should have to one another. This situation satisfies a probabilistic conditional approach [5], [46]. Then, the basic attributes, called describing, are dependent on the conditioning attributes, of which the measured and introduced concrete values could give meaningfully precise information concerning the considered object. Such an approach, applied to outlier detection, clustering, and classification problems, constitutes the subject of research presented in this paper.
For the defining characteristics of data, the nonparametric methodology of kernel estimators [16], [48] is used, and it frees the investigated procedures from distributions that characterize both the describing and conditioning attributes. This concept will be used for all three procedures (a)-(c). The homogeneity of the methodology obtained in this way may turn out to be very valuable in practical applications – it simplifies the understanding, interpretation, implementation and conformity in individual research circumstances, which is worthy of particular emphasis, especially in view of complex contemporary applications [36].
Thus, Section 2 outlines the mathematical preliminaries: kernel estimators. The procedures for outlier detection, clustering, and classification in the conditional approach investigated here are described in 3 Detection of outliers (atypical elements), 4 Clustering, 4.2 Influence of the parameters’ values on the obtained results, 5 Classification, respectively. Section 6 presents the results of empirical tests first for illustrative, artificially generated data (Section 6.1), and for sociological and environmental problems that concern generosity (Section 6.2), fertility (Section 6.3), and pollution in Kraków (Section 6.4). Additional comments are given in the final Section 7.
The research that concerns the basic unconditional case is summarized in the text [17] and in detail in the following papers: outlier detection [28], [29], clustering [19], [22], and classification [26], [27]. These concepts have been extended in the presented paper to a conditional approach, which, as illustrated in Section 6, can be very valuable in many practical tasks of engineering, economy and marketing, natural sciences, and others. The preliminary version of this paper was presented as [23].
Section snippets
Mathematical preliminaries: kernel estimators
Consider the -elements set of -dimensional vectors with continuous attributes:The kernel estimator of the density of a dataset (1) distribution is defined aswhere after separation into coordinates, we havewhile the positive constants are the so-called smoothing parameters. The kernel will be used hereinafter in the product form:whereas the one-dimensional kernels
Detection of outliers (atypical elements)
The task of detecting atypical elements is one of the fundamental problems of contemporary data analysis [1], especially in the preliminary phase of processing. The occurrence of such elements is mostly associated with gross errors that handicap some of the elements in the set being considered. They are then eliminated or corrected. In marketing issues, they can characterize existing items, however, so rare that their service becomes uneconomical. In the framework of another point of view, less
Clustering
Clustering [49] is the second basic task considered in this paper. It can be found between classical data analysis, where the research goal has already been specified, and exploration data analysis, in which the aim of future investigations is unknown a priori. In the first case, clustering can be applied for classification, however, without fixed patterns. The second option treats it as a division of the explored data into a few groups, each comprising elements that are similar to each other,
Classification
Consider conditional patterns, analogous to the set (13):which characterize assumed classes. The sizes should be proportional to the share of specific classes within the investigated datasets. The aim of classification [8] is to assign the tested elementto one of the classes. The issue of classification is considerably more advantageously conditioned than the tasks of outlier detection (Section 3) and clustering (Section 4). Indicating the patterns (46)–(48) here
Verification of the correctness and characteristics of the results
The investigated procedures have undergone detailed empirical study, first using artificially generated data, which allows for illustrative characterization and comparison of the specific features of the results obtained (Section 6.1), and second by means of benchmarks from sociology (6.2 Investigations based on a benchmark – generosity, 6.3 Investigations based on a benchmark – fertility) and measurements that concern pollution in Kraków (Section 6.4).
Additional comments
The first issue that requires commentary is assuring the desired dataset size. With the current automation of metrological systems, this aspect is often not a problem; however, in many tasks, the number of records is objectively limited (e.g., the number of countries that are concerned in the generosity and fertility research in 6.2 Investigations based on a benchmark – generosity, 6.3 Investigations based on a benchmark – fertility). Too small a dataset could result in accidental aspects of
Summary
This paper presents methodologically uniform material for three fundamental procedures of data analysis: outlier (atypical element) detection, clustering, and classification in the conditional approach. A correct definition of the factors and conditioning values could considerably increase the possibility of data analysis, in particular the interpretation of the results. In turn, owing to this uniformity and ease of interpretation of universal mathematical apparatus applied here, the presented
CRediT authorship contribution statement
Piotr Kulczycki: Conceptualization, Methodology, Validation, Formal analysis, Writing, Supervision. Krystian Franus: Software, Resources, Investigation, Visualization.
Acknowledgments
This work was partially supported by the Systems Research Institute of the Polish Academy of Sciences, as well as the Faculty of Physics and Applied Computer Science of the AGH University of Science and Technology.
References (51)
- et al.
An Evaluation of Utilizing Geometric Features for Wheat Grain Classification using X-ray Images
Comput. Electron. Agric.
(2018) - et al.
A new kernel density estimator based on the minimum entropy of data set
Information Sci.
(2019) Kernel density estimation based sampling for imbalanced class distribution
Inform. Sci.
(2020)- et al.
Conditional parameter identification with different losses of under- and overestimation
Appl. Math. Modell.
(2013) - et al.
Identification of atypical elements by transforming task to supervised form with fuzzy and intuitionistic fuzzy evaluations
Appl. Soft Comput.
(2017) - et al.
Class-specific attribute value weighting for Naive Bayes
Inform. Sci.
(2020) Outlier Analysis
(2013)Data Mining
(2015)- et al.
Detection of Spatial Outlier by Using Improved Z-Score Test
Categorical Data Analysis
(2002)
Statistical Inference
Application of Complete Gradient Clustering Algorithm for analysis of wildlife spatial distribution
Ecol. Indicators
Pattern Classification
The estimation of the gradient of a density function, with applications in pattern recognition
IEEE Trans. Inform. Theory
DENCLUE 2.0: Fast Clustering Based on Kernel Density Estimation
Outlier detection using k-nearest neighbour graph
Methodically unified procedures for outlier detection, clustering and classification
Parametric identification for robust control
Automatic Control, Robotics
A complete gradient clustering algorithm formed with kernel estimators
Int. J. Appl. Math. Computer Sci.
An algorithm for conditional multidimensional parameter identification with asymmetric and correlated losses of under- and overestimations
J. Stat. Comput. Simul.
The complete gradient clustering algorithm: properties in practical applications
J. Appl. Stat.
Outlier Detection, Clustering, and Classification – Unified Algorithms for Conditional Approach
Proceedings of the 12th International Conference on Computational Collective Intelligence, Da Nang, Vietnam, 30 November − 3 December
Cited by (5)
Graph autoencoder-based unsupervised outlier detection
2022, Information SciencesCitation Excerpt :Model-based approaches: These approaches construct a model of a dataset by using optimization techniques and then identify the outliers as the objects that do not fit the model well [3,4,11]. Notable examples are the following: autoencoder-based [13], link-based [10], the one-class SVM [22], graph-based [23], and clustering-based [2,18]. In autoencoder-based methods, the outlier detection ability comes from the autoencoder ‘inability’ to reconstruct some of the objects.
A probabilistic generalization of isolation forest
2022, Information SciencesPredicted Distribution Density Estimation for Streaming Data
2021, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)