A new multi-objective differential evolution approach for simultaneous clustering and feature selection

https://doi.org/10.1016/j.engappai.2019.103307Get rights and content

Abstract

Today’s real-world data mostly involves incomplete, inconsistent, and/or irrelevant information that causes many drawbacks to transform it into an understandable format. In order to deal with such issues, data preprocessing is a proven discipline in data mining. One of the typical tasks in data preprocessing, feature selection aims to reduce the dimensionality in the data and thereby contributes to further processing. Feature selection is widely used to enhance the performance of a supervised learning algorithm (e.g., classification) but is rarely used in unsupervised tasks (e.g., clustering). This paper introduces a new multi-objective differential evolution approach in order to find relatively homogeneous clusters without the prior knowledge of cluster number using a smaller number of features from all available features in the data. To analyze the goodness of the introduced approach, several experiments are conducted on a various number of real-world and synthetic benchmarks using a variety of clustering approaches. From the analyzes through several different criteria, it is suggested that our method can significantly improve the clustering performance while reducing the dimensionality at the same time.

Introduction

Nowadays vast amounts of data can be stored without more effort thanks to the developments in technological tools such as the World Wide Web, network services, and database storages. On the other hand, due to this massive data growth, the process of knowledge extraction using a data mining method becomes more difficult. The quality of the knowledge extracted from a data does not only depend on the data mining method but also depends on the data quality. The more quality and suitability such data involves, the better the process of knowledge extraction is supposed to be. Unfortunately, not only required and necessary information but also noise, inconsistency, missing values, and a huge number of samples and features may be involved in the data. To deal with such issues, data preprocessing is proven to be a major and essential step to prepare the data for further data mining processes.

One of the critical tasks in data preprocessing, dimensionality reduction is commonly used to alleviate the curse of dimensionality “which is a serious problem as it will impede the operation of most data mining algorithms as the computational cost rise (García et al., 2016)”. Dimensionality reduction approaches are mainly investigated in two categories: space transformation and feature selection. Space transformation tries to transform the original feature set into a small number of projections. The representative examples concerning space transformation are principal component analysis, linear component analysis, and factor analysis (Ekbal and Saha, 2015). Feature selection is the process of eliminating as more irrelevant and redundant features as possible. The goal is to obtain a feature subset from all available features that is far more beneficial to the further process. In other words, feature selection aims to eliminate irrelevant and redundant features which may lead to undesired correlations in further (learning) process. By the way, the further learning process will also be faster and require less storage. Unlike space transformation, feature selection keeps the originality of features and thereby is treated as more general. Feature selection has been widely used to enhance the performance of supervised tasks (e.g., prediction, classification, regression) in data mining and machine learning (Anifowose and Abdulraheem, 2011, Anifowose et al., 2013, Anifowose et al., 2014, Anifowose et al., 2016, Anifowose, 2018). Especially, feature selection approaches have brought significant improvements in both the learning performance and the computational efficiency of the classification algorithms. Such improvements are particularly observed when feature selection approaches are wrapped around evolutionary computation (EC) techniques which are known to be powerful global search methods (Xue et al., 2013).

Despite the popularity in classification, feature selection has been rarely used in unsupervised tasks, e.g., clustering. Clustering is the process of dividing data instances into clusters (or groups) such that the instances within a cluster are highly similar to each other, whereas the instances from different clusters are maximally dissimilar. The representative measures to quantify the similarity between a pair of instances are Euclidean, Mahalanobis, Cityblock, Cosine, Correlation, Hamming, and Jaccard distances (Hancer and Karaboga, 2017). While typically computing the distance between data instances, each feature is considered with equal importance. Unfortunately, this such typical computation may lead to the curse of dimensionality problem in clustering for high dimensional data. It is also reported in Chakraborty and Das (2018) that the performance of arguably well-known clustering algorithms, such as K-means and its variants (MacQueen, 1967) may deteriorate proportionally to the number of features. Another significant and long-standing point concerning clustering is the automatic evolution of clusters since today’s real-world data mostly does not include the information of cluster number. Therefore, it is necessary to integrate the process of automatic cluster evolution with an automated feature selection process, referred to as simultaneous clustering and feature selection. Simultaneous clustering and feature selection has just come into consideration and so there exist only a few attempts to deal with this issue in the literature. In the literature, it has been considered as a single-objective problem using EC techniques, such as niching genetic algorithm (GA) (Shang et al., 2007) and particle swarm optimization (PSO) (Lensen et al., 2016). In these EC-based simultaneous clustering and feature selection approaches, all features are considered during the calculation of similarity measures whereas they are not even selected by the approaches to prevent feature selection introducing bias towards lower cluster numbers. In other words, feature selection is carried out independent of clustering; thus, it is not possible to determine whether the selected feature subset is suitable or not for the partitions obtained by clustering. It can, therefore, be inferred without no doubt that simultaneous clustering and feature selection is still an open issue for researchers.

In this paper, we aim to deal with issues concerning clustering that attract researchers’ attention by developing a new simultaneous clustering and feature selection approach, with the expectation of enhancing the quality of cluster partitions. To achieve this, we propose a variable-string length based multi-objective differential evolution approach. Specifically, we will investigate:

  • the performance of the proposed approach versus representative partitional approaches on real-world datasets,

  • the performance of the proposed approach versus representative partitional approaches on synthetic datasets, and

  • the performance of the proposed approach versus several different clustering approaches.

The rest of the paper is organized as follows. Section 2 outlines differential evolution, cluster analysis, and recent related works. Section 3 introduces the proposed algorithm. Section 4 defines the experimental design. Section 5 presents the results with discussions. Finally, Section 6 states conclusions with future trends.

Section snippets

Background

This section first describes the basic differential evolution algorithm. It then explains the fundamentals of cluster analysis. Finally, it surveys recent works on simultaneous clustering and feature selection.

Proposed clustering and feature selection approach

In this section, we describe the newly proposed multi-objective differential evolution approach (referred to as MODE-CFS) for simultaneous clustering and feature selection. It should be notified that instead of using an existing variant of multi-objective differential evolution, we develop a new multi-objective variable-string length based differential evolution to perform clustering and feature selection simultaneously.

Experimental design

The performance of the proposed simultaneous clustering and feature selection approach is evaluated by comparing it with a number of representative partitional clustering approaches on a variety of real-world and synthetic datasets using representative evaluation criteria. Due to its non-deterministic mechanism, each approach is carried out for 30 independent runs, and the mean of each evaluation criterion is calculated across all runs. To further analyze the performance of the MODE-CFS

Results and discussions

The experimental results are presented in Table 2, Table 3 for the real-world and synthetic datasets, respectively. Each table shows the mean values of the feature subset size and the number of clusters obtained by each method as well the average clustering performance in terms of a variety of evaluation criteria. The external evaluation criteria should be maximized, while the internal evaluation criteria should be minimized as well as the number of features. For each of the approaches used for

Conclusions

Multi-objective clustering is not an easy task due to the following issues. First, the components of the multi-objective fitness function should be designed or selected according to the considered multi-objective framework. For example, one internal index which properly works in a multi-objective evolutionary framework may not have a positive effect on clustering when used with another multi-objective evolutionary framework. Second, the selection of a solution from the Pareto front is also a

References (51)

  • SahaS. et al.

    Simultaneous feature selection and symmetry based clustering using multiobjective framework

    Appl. Soft Comput.

    (2015)
  • ShangW. et al.

    A novel feature selection algorithm for text Categorization

    Exp. Syst. Appl.

    (2007)
  • SzilágyiS.M. et al.

    A fast hierarchical clustering algorithm for large-scale protein sequence data sets

    Comput. Biol. Med.

    (2014)
  • ZhaoQ. et al.

    WB-index: A sum-of-squares based index for cluster validity

    Data Knowl. Eng.

    (2014)
  • Anifowose, F., Data-driven approach to handling high-dimensional data input space using feature selection-based hybrid...
  • AnifowoseF. et al.

    A least-square-driven functional networks type-2 fuzzy logic hybrid model for efficient petroleum reservoir properties prediction

    Neural Comput. Appl.

    (2013)
  • AnkerstM. et al.

    OPTICS: Ordering points to identify the clustering structure

  • BacheK. et al.

    UCI machine learning repository

    (2013)
  • CalinskiR. et al.

    A dendrite method for cluster analysis

    Commun. Stat. -Theory Methods

    (1974)
  • ComaniciuD. et al.

    Mean shift: A robust approach toward feature space analysis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2002)
  • DaviesD. et al.

    A cluster separation measure

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1979)
  • DunnJ.

    Well separated clusters and optimal fuzzy partitions

    J. Cybern.

    (1974)
  • Dutta, D., Dutta, P., Sil, J., Simultaneous feature selection and clustering for categorical features using multi...
  • Dutta, D., Dutta, P., Sil, J., Simultaneous continuous feature selection and K clustering by Multi Objective Genetic...
  • FowlkesE.B. et al.

    A method for comparing two hierarchical clusterings

    J. Amer. Statist. Assoc.

    (1983)
  • Cited by (58)

    View all citing articles on Scopus

    No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.engappai.2019.103307.

    View full text