Elsevier

Applied Soft Computing

Volume 60, November 2017, Pages 623-633
Applied Soft Computing

Identification of atypical elements by transforming task to supervised form with fuzzy and intuitionistic fuzzy evaluations

https://doi.org/10.1016/j.asoc.2017.06.024Get rights and content

Highlights

  • A ready-to-use procedure for identifying atypical (rare) elements is proposed.

  • The concept is free from distribution characterizing a data set.

  • The method enables the effective generation of fuzzy and intuitionistic evaluations.

  • Share of atypical elements is defined by a single parameter.

  • The task can be transformed to a supervised form allowing classification methodology.

Abstract

The subject of this paper is a procedure for the identification (detection, discovery) of atypical elements, understood in the sense that they occur rarely. A result of the procedure is the generation of a rating as to whether an examined observation should be classed as atypical, given in classic two-values form (deterministic, sharp), as well as fuzzy or intuitionistic fuzzy. Moreover, the task of identifying atypical elements, unsupervised in its basic formula, can − as a result of the procedure − be brought to a supervised form, which allows well-developed diverse methods of supervised classification to be used. The investigated method is independent of distribution existing in a population and enables the detection of atypical elements not only occurring in the tails, but also − e.g. for multimodal distributions with more distant factors − potentially located inside. The procedure is presented in ready-to-use form and does not require laborious research or literary study. The correctness of its functioning has been examined in practical medical problems.

Introduction

The task of identifying atypical elements is one of the fundamental problems of contemporary data analysis [1]. Its present significance is growing, particularly in relation to today's common automatic way of measuring, transferring, collecting and processing information, as it omits the need for human perceptiveness and thought in detecting potential anomalies.

The occurrence of atypical elements can be interpreted in two ways. The first, and more popular, associates them with gross errors handicapping some elements of the set being considered. They are then eliminated or corrected. In this case the identification of atypical elements can be termed detection, which is generally connoted with negative occurrences. In the second, less common yet more constructive, atypical elements represent unconventional phenomena, exceptional items and new trends. They then provide exceptionally valuable information, and stimulate nontrivial behaviors and innovative thinking. In order to cover this case, it is worth replacing the notion of “detection” with the more neutral “identification”, as is done throughout this text.

There is no one definition of atypical elements. The most general is that they are observations originating from a distribution other than the remaining population. However, this view does not help to recognize them in a specific dataset. The above definition is most often refined by the classic notion of “outliers”, to a distance-based concept, indicating those elements furthest from the majority. This paper will apply the frequency approach, whereby atypical elements are rare, i.e. the probability of their appearance is faint. Thus, we can identify atypical observations not only on the peripheries of the population, but in the case of multimodal distributions with wide-spreading segments, also those lying in between these segments, even if close to the center of the set (see Fig. 1 later).

A detailed review of notions and methods associated with atypical observations can be found in the classic monographs [2], [3] as well as the survey paper [4]. Their identification enjoys comprehensive practical application in all disciplines. In medical tasks results deviating from standards may infer dangers, illness or pathologies, in technology they determine faults in a dynamic system under supervision, in archeology − a different origin of artefacts, in banking − attempted fraud. Atypical elements can also indicate threats to public order, meteorological anomalies, earthquakes, changes in climate and ecological dangers.

As mentioned before, the subject of this paper is the identification of elements atypical in the sense of rare occurrences in the population. Using a representative set of data, we select regions of lowest distribution density, and in such a way that total probability of an observation appearing in these regions equals an assumed value, e.g. 0.01, 0.05, 0.1. Elements belonging to these sets will be treated as atypical (rare). An evaluation of whether the tested element should be termed atypical can be given in the classic two-values form (deterministic, sharp) as well as fuzzy [5] and intuitionistic fuzzy [6]. The procedure is designed on the basis of the nonparametric kernel estimators method [7], [8], which frees it from the distribution characterizing the population under consideration. The subject material is ready-to-use without laborious research. Its easy and illustrative interpretation is particularly valuable.

Section 2 presents the statistical kernel estimators methodology. Then, the basic formula of the procedure for identifying atypical, i.e. rarely occurring, elements is described in Section 3. Due to difficult conditioning, mainly stemming from a naturally very low number of elements considered atypical, the quality of the procedure is considerably improved in Section 4 by significantly increasing the set of elements representative for the population. Next, in Section 5, patterns of atypical and typical elements, equal in size, will be generated, which form the basis for the effective creation of a fuzzy and intuitionistic fuzzy assessment also for disadvantageous parameter values, as well as the convenient application of a well-developed, valuable and distinctive classification method, according to the researcher's preferences and specifics of the task under investigation. In this way, in and of itself the unsupervised task of identification (detection) of atypical elements (outliers) is brought to the much more convenient supervised problem of classification with equal-sized patterns. In Section 6 the function of the procedure is verified using artificially generated illustrative data, and in the subsequent Section 7 based on medical data. The paper finishes with Section 8 as a summary containing a detailed sequence of steps for applying the method investigated here.

The preliminary version of this paper was partially presented as the publications [9], [10].

Section snippets

Nonparametric kernel estimators

In the presented method, the characteristics of a data set will be defined using the nonparametric methodology of kernel estimators. It is distribution-free, i.e. the preliminary assumptions concerning the types of appearing distributions are not required. A broad description can be found in the monographs [7], [8]. Exemplary applications for data analysis tasks are described in the publications [11], [12], [13], [14], [15]; see also [16], [17].

Let the n-dimensional continuous random variable X

Basic version of procedure

The basic idea of the presented procedure for identification of atypical elements stems from the significance test proposed in the work [18]. Thanks to the application of nonparametric methods it is unnecessary to introduce arbitrary assumptions concerning distribution type for an examined population.

Let the set be given, with elements representative for the populationx1, x2, …, xm .Treat these elements as realizations of the n-dimensional continuous random variable X with distribution having

Extended pattern of population

Although, from a theoretical point of view, the procedure presented in the previous section seems complete, when the values r are applied in practice − see conditions (13) and especially (14) − and the size m is not big, the estimator of the quantile qˆr is encumbered with a large error, due to the low number of elements zi smaller than the estimated value. To counteract this, a data set will be extended by generating additional elements with distribution identical to that characterizing the

Equal-sized patterns of atypical and typical elements; fuzzy and intuitionistic fuzzy evaluations

Let us consider set (10) introduced in Section 3, consisting of elements representative for an investigated population, and potentially extended as described in accordance with Section 4. In taking its subset comprising these observations xi for which condition (19) is fulfilled, one can treat it as a pattern of atypical elements. Denote it thus:x1at,x2at,...,xmatat.Similarly, the set of observations for which the opposite inequality (20) is true may be considered as a pattern of typical

Numerical verification

This section presents the results of numerical verification, which positively confirmed the correct functioning of the procedure for identifying atypical elements. Thus those obtained for real data taken from medicine are described in the subsequent Section 7.

Consider therefore the one-dimensional case, where the distribution characterizing the data in set (10) is bimodal with the following normal (Gauss) components and sharesN(−3, 1) 40%, N(3, 1) 60% .Table 1 shows results achieved with the

Experimental verification

Laboratory research is a fundamental factor of contemporary medical practice and the most important source of information for making correct medical decisions. This section describes the implementation of the procedure worked out for identifying atypical elements, using experimental data from biochemical blood tests concerning plasma component analysis or, to be more precise, concentration of electrolytes: glucose, potassium and sodium. The data used below originates from the National Health

Summary

This paper deals with a procedure for identifying atypical elements constructed using nonparametric methods of mathematical statistics, which frees the investigated concept from distribution characterizing the data set under analysis. Atypical elements are understood to be rarely occurring. The procedure sensitivity is defined by a single parameter, interpreted as share of atypical elements in the population. The text contains a complete formula for the algorithm, without need for additional

References (26)

  • L.A. Zadeh

    Fuzzy sets

    J. Inf. Control

    (1965)
  • P. Kulczycki et al.

    Conditional parameter identification with different losses of under- and overestimation

    Appl. Math. Modell.

    (2013)
  • C.C. Aggarwal

    Data Mining

    (2015)
  • C.C. Aggarwal

    Outlier Analysis

    (2013)
  • V. Barnett et al.

    Outliers in Statistical Data

    (1994)
  • V.-J. Hodge et al.

    A survey of outlier detection methodologies

    Artif. Intell. Rev.

    (2004)
  • K. Atanassov

    Intuitionistic Fuzzy Sets. Theory and Applications

    (1999)
  • P. Kulczycki

    Estymatory jądrowe w analizie systemowej

    (2005)
  • M. Wand et al.

    Kernel Smoothing

    (1995)
  • P. Kulczycki et al.

    Detection of atypical elements with fuzzy and intuitionistic fuzzy evaluation

  • P. Kulczycki et al.

    Detection of Atypical Elements by Transforming Task to Supervised Form

  • P. Kulczycki et al.

    An algorithm for conditional multidimensional parameter identification with asymmetric and correlated losses of under- and overestimations

    J. Stat. Comput. Simul.

    (2016)
  • P. Kulczycki et al.

    The complete gradient clustering algorithm: properties in practical applications

    J. Appl. Stat.

    (2012)
  • Cited by (7)

    View all citing articles on Scopus
    View full text