Elsevier

Pattern Recognition

Volume 53, May 2016, Pages 25-35
Pattern Recognition

Weighted Multi-view Clustering with Feature Selection

https://doi.org/10.1016/j.patcog.2015.12.007Get rights and content

Highlights

  • This paper proposes a new multi-view data clustering algorithm.

  • The new method considers both view weighting and feature weighting.

  • An EM-like method is designed to get the local optimum solution.

  • Extensive experiments have been conducted to show the effectiveness.

Abstract

In recent years, combining multiple sources or views of datasets for data clustering has been a popular practice for improving clustering accuracy. As different views are different representations of the same set of instances, we can simultaneously use information from multiple views to improve the clustering results generated by the limited information from a single view. Previous studies mainly focus on the relationships between distinct data views, which would get some improvement over the single-view clustering. However, in the case of high-dimensional data, where each view of data is of high dimensionality, feature selection is also a necessity for further improving the clustering results. To overcome this problem, this paper proposes a novel algorithm termed Weighted Multi-view Clustering with Feature Selection (WMCFS) that can simultaneously perform multi-view data clustering and feature selection. Two weighting schemes are designed that respectively weight the views of data points and feature representations in each view, such that the best view and the most representative feature space in each view can be selected for clustering. Experimental results conducted on real-world datasets have validated the effectiveness of the proposed method.

Introduction

Clustering is one of the most important methods to explore the underlying (cluster) structure of data [1]. The basic idea is to partition a set of data objects according to some criterion such that similar objects can be grouped into the same cluster, and dissimilar objects are separated into different clusters. To achieve this goal, we usually conduct clustering by maximizing the intra-cluster similarity and the inter-cluster dissimilarity. After several decades׳ development, a number of clustering algorithms have been developed [1], such as k-means clustering [2], spectral clustering [3], kernel-based clustering [4], graph-based clustering [5] and hierarchical clustering [6].

With the development of hardware technology, a huge amount of multi-view data with various representations have been generated in real-world applications [7], [8], [9], [10], [11], [12], [13], [14]. For example, in web clustering, different types of data, such as images, videos, hyper-links and texts, can be taken into consideration as they are different views of web pages (as shown in Fig. 1). In multi-view data, different views are different representations of the same set of instances. It is a significant research challenge to combine together multiple views or sources of the same set of instances to get a better clustering performance. The existing clustering algorithms designed for single-source data cannot be applied directly to the data consisting of multiple views or in various representations as they often vary greatly from traditional single-source data. Data in different views or sources are always not comparable to each other due to their dimensions and semantic representations are always different.

In addition, some views of data may be of high dimensionality which leads to high computational complexity and possibly low clustering accuracy. For example, when it comes to biomedicine, we can get different types of information for a patient, including magnetic resonance images, cerebrospinal fluid test data, blood test data, protein expression data, and genetic data, each of which is taken as a distinct view of patient data. However, some view of data may be of high dimensionality which would lead to a large amount of calculation. For some specific views, only a portion of features are needed for improving the clustering results. In other words, feature selection is a way which can both simplify the calculation and help to get an accurate data model in data clustering [15], [13], [16].

In order to solve this problem, we propose a novel algorithm, termed Weighted Multi-view Clustering with Feature Selection (WMCFS), which can simultaneously perform multi-view data clustering and feature selection. A global objective function is proposed, which takes into consideration both of the multi-view learning and feature selection in the process of data clustering. In the global objective function, two weighting schemes are designed that respectively weight the views of data points and feature representations in each view, such that the best view and the most representative feature space in each view can be selected for clustering. To solve the objective function, we design an EM (Expectation Maximization)-like iteration, which can converge to the acceptable clustering results. Experimental results conducted on real-world datasets have validated the effectiveness of the proposed method.

The rest of the paper is organized as follows. Section 2 briefly overviews the previous work on multi-view data clustering. The proposed WMCFS algorithm and its foundations are described in detail in Section 3. To demonstrate the performance of our algorithms, we have conducted extensive experiments, the experimental results of which are reported in Section 4. The conclusion is drawn in Section 5.

Section snippets

Related work

For clustering multi-view or multi-source datasets, some algorithms have been proposed recently which take different factors into consideration, e.g. the differences and relationships between data from various views. Most of the earlier methods extend the traditional single-source clustering algorithms to the multi-view situation by simply minimizing the disagreement between different views, i.e., by minimizing the difference of the clustering results generated from different views. Two early

Weighted Multi-view Clustering with Feature Selection

To make this paper clear, Table 1 summarizes the symbols used in this paper.

Experimental results

In order to demonstrate the effectiveness of the proposed method, extensive experiments have been conducted on three real-world datasets. We first analyze the performance sensitivity to the two parameters p and β. Then, several state-of-the-art multi-view clustering methods have been performed and compared with the proposed method, which shows the significant improvement achieved by our method. For experimental purpose, we only perform the parameter analysis on two of the three datasets and

Conclusion

In this paper, we have proposed a novel multi-view clustering methods, termed Weighted Multi-view Clustering with Feature Selection (WMCFS), which simultaneously performs feature selection and multi-view data clustering. A global objective function is proposed, which takes into consideration both of the multi-view learning and the feature selection in the process of data clustering. In the global objective function, two weighting schemes are designed that respectively weight the views of data

Conflict of interest

There is no conflict of interest.

Acknowledgments

This work was supported by NSFC (61173084 and 61502543), CCF-Tencent Open Research Fund (CCF-TencentRAGR20140112), the PhD Start-up Fund of Natural Science Foundation of Guangdong Province, China (2014A030310180), Guangdong Natural Science Funds for Distinguished Young Scholar (No. 16050000051). The authors would like to thank the associate editor and reviewers for their comments which are very helpful in improving the manuscript.

Yu-Meng Xu received her master degree in 2015 from Sun Yat-sen University, China. Her research interest is data clustering.

References (33)

  • Z. Wu et al.

    An optimal graph theoretic approach to data clusteringtheory and its application to image segmentation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1993)
  • E. Eaton et al.

    Multi-view constrained clustering with an incomplete mapping between views

    Knowl. Inf. Syst.

    (2014)
  • E. Taralova, F. De la Torre, M. Hebert, Source constrained clustering, in: Proceedings of the 2011 IEEE International...
  • L. Huang et al.

    Co-learned multi-view spectral clustering for face recognition based on image sets

    IEEE Signal Process. Lett.

    (2014)
  • X. Wang, B. Qian, I. Davidson, Improving document clustering using automated machine translation, in: Proceedings of...
  • M. Fang et al.

    Multi-source transfer learning based on label shared subspace

    Pattern Recognit. Lett.

    (2014)
  • Cited by (0)

    Yu-Meng Xu received her master degree in 2015 from Sun Yat-sen University, China. Her research interest is data clustering.

    Chang-Dong Wang received his Ph.D. degree in computer science in 2013 from Sun Yat-sen University, China. He is currently an assistant professor at School of Mobile Information Engineering, Sun Yat-sen University. His current research interests include machine learning and pattern recognition, especially focusing on data clustering and its applications. He has published over 30 scientific papers in international journals and conferences such as IEEE TPAMI, IEEE TKDE, IEEE TSMC-C, Pattern Recognition, Knowledge and Information System, Neurocomputing, ICDM and SDM. His ICDM 2010 paper won the Honorable Mention for Best Research Paper Awards. He won 2012 Microsoft Research Fellowship Nomination Award. He was awarded 2015 Chinese Association for Artificial Intelligence (CAAI) Outstanding Dissertation.

    Jian-Huang Lai received his M.Sc. degree in applied mathematics in 1989 and his Ph.D. in mathematics in 1999 from Sun Yat-sen University, China. He joined Sun Yat-sen University in 1989 as an assistant professor, where currently, he is a professor with the Department of Automation of School of Information Science and Technology and dean of School of Information Science and Technology. His current research interests are in the areas of digital image processing, pattern recognition, multimedia communication, wavelet and its applications. He has published over 150 scientific papers in the international journals and conferences on image processing and pattern recognition, e.g. IEEE TPAMI, IEEE TKDE, IEEE TNN, IEEE TIP, IEEE TSMC (Part B), Pattern Recognition, ICCV, CVPR and ICDM. Lai serves as a standing member of the Image and Graphics Association of China and also serves as a standing director of the Image and Graphics Association of Guangdong.

    View full text