Elsevier

Neurocomputing

Volume 266, 29 November 2017, Pages 458-464
Neurocomputing

Brief papers
Orthogonal extreme learning machine for image classification

https://doi.org/10.1016/j.neucom.2017.05.058Get rights and content

Abstract

Extreme learning machine (ELM) is an emerging learning algorithm for the generalized single hidden layer feedforward neural networks in which the parameters of hidden units are randomly generated and thus the output weights can be analytically calculated. From the hidden to output layer, ELM essentially learns the output weight matrix based on the least squares regression formula that can be used for both classification/regression and dimensionality reduction. In this paper, we impose the orthogonal constraint on the output weight matrix and then formulate an orthogonal extreme learning machine (OELM) model, which produces orthogonal basis functions and can have more locality preserving power from ELM feature space to output layer than ELM. Since the locality preserving ability is potentially related to the discriminating power, the OELM is expect to have more discriminating power than ELM. Considering the case that the number of hidden units is usually greater than the number of classes, we propose an effective method to optimize the OELM objective by solving an orthogonal procrustes problem. Experiments by pairwisely comparing OELM with ELM on three widely used image data sets show the effectiveness of learning orthogonal mapping especially when given only limited training samples.

Introduction

Extreme learning machine [1] has been proved as an efficient and effective method to train single hidden layer feed neural networks (SLFNs), providing us an unified framework for both multi-class classification and regression tasks. The basic ELM model can be simply seen as a random feature mapping followed by the least squares regression formula. The main contribution of ELM to general SLFNs is that the parameters of hidden units, including the input weights between the input layer and hidden layer as well as the biases of hidden units, can be randomly generated, which leads to the analytical determination of the output weights between the hidden layer and output layer. Such improvement greatly alleviates the burden of weight tuning caused by the widely used back-propagation algorithms and thus guarantees the fast learning speed of ELM. As a variant of SLFNs, though the mathematical formula of ELM is simple, the universal approximation capacity [2], [3] can be also kept. Furthermore, the rationality of the randomly generated input weights and biases was analyzed by some recently published studies [4], [5]. ELM fills gaps among many types of SLFNs such as feedforward networks (e.g., sigmoid networks), RBF networks, SVM (considered as a special type of SLFNs), polynomial networks and proposes that it need not have different learning algorithms for different SLFNs if universal approximation and classification capabilities are considered [2], [6], [7]. Further, ELM theories and philosophy show that some earlier learning theories such as ridge regression theory, Bartlett’s neural network generalization performance theory and SVM’s maximal margin are actually consistent in machine learning [8], [9]. Inspired by deep learning but different from it, the hierarchical models using ELM as building block do not require intensive tuning in hidden layers and hidden units and also obtain amazing performance [10], [11]. Due to the success of ELM in diverse applications, ELM research has been a hotspot in machine learning communities and many studies are conducted from many aspects such as theoretical investigation [4], [5], model improvements [11], [12] and applications [13], [14]. Some of recent progresses were briefly reviewed in [15], [16].

From the hidden layer to output layer, ELM essentially learns the output weight matrix based on the least squares regression formula. Therefore, many approaches were proposed to do discriminant analysis based on the least squares regression. The central task is to find a proper transformation matrix to minimize the sum-of-squares error function, which will be further used for dimensionality reduction or classification. Xiang et al. proposed a framework of discriminative least squares regression for multiclass classification whose idea is to utilize the ε-dragging to enlarge the distance of samples from different classes [17]. Similar work conducted by Zhang et al. aims to directly learn regression targets from data that can better evaluate the classification error than conventional predefined regression targets [18]. In most cases, ELM is viewed as classifier in which the hidden layer data representation (ELM feature space) is projected to output layer (label space), we expect to learn proper output weight matrix (transformation matrix) to make ELM more effective in classification. To this end, many efforts were made to impose different properties on the output weight matrix. Peng et al. proposed to enhance the label consistency property of ELM and formulated the graph regularized extreme learning machine that shows excellent performance in face recognition [19] and EEG-based emotion recognition [20]. Shi et al. introduced the elastic net regularization into ELM which can simultaneously bring the sparsity of output weight matrix and avoid the singularity problem [21]. Among different existing strategies, orthogonal constraint on transformation matrix has been widely employed in both subspace learning and least squares-based classification, which shows excellent performance in both situations. Cai et al. proposed the orthogonal locality preserving projection (OLPP) method that produces orthogonal basis functions and can have more locality preserving power than LPP [22]. Since it has been shown that the locality preserving power is directly related to the discriminating power, OLPP obtains better performance than LPP [23]. In [24], Nie et al. showed that the orthogonal least squares discriminant analysis is better than the basic counterpart without orthogonal constraint. Similar work was conducted to do feature extraction based on orthogonal least squares regression [25]. Motivated by these studies, in this paper, we propose to learn orthogonal output weight matrix from hidden layer to output layer that is expect to the transformation matrix under the orthogonal constraint can preserve more structure information between these two layers and thus have more discriminating power for classification.

The remainder of this paper is organized as follows. Section 2 gives a brief description of the basic ELM model. The model formulation, optimization method, convergence as well as computational complexity of the proposed OELM are detailed in Section 3. Experimental studies are conducted in Section 4 to show the effectiveness of OELM. Section 5 concludes the whole paper.

Section snippets

Extreme learning machine

Suppose we have n labeled training samples {(xi,yi),i=1,,n} where each sample xiRd×1 and its corresponding label vector yiR1×c (c is the number of classes). If xi is labeled as class p, then the pth element of yi is 1 and the other elements of yi are 0. Consider a SLFN with input weight matrix ARd×k, hidden bias vector bRk×1 and output weight matrix WRk×m, where k is the number of hidden units. For an input vector x, the output of this SLFN can be represented as f(x)=i=1kGi(x,ai,bi)·wi,ai

Model formulation and optimization

By introducing the orthogonal constraint, we formulate the objective of orthogonal ELM as minWTW=IHWY2,where HRn×k,WRk×c and YRn×c. Under the orthogonal constraint, the data will be projected to an orthogonal subspace where the data metric structure can be preserved. Some properties of OELM will be discussed in Section 3.3 in detail. This section focuses on its model formulation as well as the optimization method.

Since k > c, objective (7) is an unbalanced orthogonal procrustes problem

Experimental settings and datasets

In this section, we conduct pairwise comparison between OELM and ELM (with ℓ2 regularization) to show the effectiveness of the orthogonal constraint on output weight matrix. The activation function for ELM is the ‘sigmoid’ function, and the number of hidden neurons is set as three times of the input dimension of data. The regularization parameter for ELM is searched from candidates 2.{10,,10}.

The properties of the three images data sets used in our experiments are descried as follows

  • UMIST.

Conclusion

In this paper we proposed a new ELM model, termed OELM, in which the output weight matrix is enforced to be orthogonal. The main contributions of this work lie in three aspects: (1) formulating the objective of OELM and analyzing its effectiveness from the perspective of discriminative analysis; (2) presenting an effective iterative procedure to optimize the OELM objective by solving a balanced orthogonal procrustes problem via singular value decomposition; (3) demonstrating the effectiveness

Acknowledgments

This work was partially supported by National Natural Science Foundation of China (61602140, 61671193 61402143), Science and Technology Program of Zhejiang Province (2017C33049), Natural Science Foundation of Zhejiang Province (LQ14F020012), Jiangsu Key Laboratory of Image and Video Understanding for Social Safety, Nanjing University of Science and Technology (30916014107) and Guangxi High School Key Laboratory of Complex System and Computational Intelligence (2016CSCI04).

Yong Peng received the B.S. degree from Hefei New Star Research Institute of Applied Technology, the M.S. degree from Graduate University of Chinese Academy of Sciences, and the Ph.D. degree from Shanghai Jiao Tong University, all in computer science, in 2006, 2010, and 2015, respectively. From September 2012 to August 2014, he was a visiting Ph.D. student in the Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor. He joined in School of Computer Science

References (25)

  • HuangG.-B. et al.

    Universal approximation using incremental constructive feedforward networks with random hidden nodes

    IEEE Trans. Neural Netw.

    (2006)
  • ZhangR. et al.

    Universal approximation of extreme learning machine with adaptive growth of hidden nodes

    IEEE Trans Neural Netw Learn. Syst.

    (2012)
  • Cited by (0)

    Yong Peng received the B.S. degree from Hefei New Star Research Institute of Applied Technology, the M.S. degree from Graduate University of Chinese Academy of Sciences, and the Ph.D. degree from Shanghai Jiao Tong University, all in computer science, in 2006, 2010, and 2015, respectively. From September 2012 to August 2014, he was a visiting Ph.D. student in the Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor. He joined in School of Computer Science and Technology, Hangzhou Dianzi University as an Assistant Professor in June 2015 where he is currently a Research Associate Professor. He was awarded by the President Scholarship, Chinese Academy of Sciences in 2009 and National Scholarship for Graduate Students, Ministry of Education in 2012. His research interests include machine learning, pattern recognition, and brain-computer interface.

    Wanzeng Kong received both bachelor degree and Ph.D. degree from Electrical Engineering Department, Zhejiang University, Hangzhou, China, in 2003 and 2008, respectively. He is currently a professor and vice dean of college of computer science, Hangzhou Dianzi University, Hangzhou, China. From November 2012 to November 2013, Dr. Kong is a visiting research associate in department of biomedical engineering, University of Minnesota, Twin Cities, USA. His research interests include cognitive computing, pattern recognition and BCI-based electronic system. Dr. Kong is also a member of IEEE, ACM, and CCF.

    Bing Yang received her Ph.D. degree in Computer Science from Zhejiang University in 2013 and then joined in School of Computer Science and Technology, Hangzhou Dianzi University where she is now serving as an associate professor. Her main research interests include computer vision and machine learning.

    View full text