Elsevier

Neurocomputing

Volume 388, 7 May 2020, Pages 34-44
Neurocomputing

Unimodal regularized neuron stick-breaking for ordinal classification

https://doi.org/10.1016/j.neucom.2020.01.025Get rights and content

Abstract

This paper targets for the ordinal regression/classification, which objective is to learn a rule to predict labels from a discrete but ordered set. For instance, the classification for medical diagnosis usually involves inherently ordered labels corresponding to the level of health risk. Previous multi-task classifiers on ordinal data often use several binary classification branches to compute a series of cumulative probabilities. However, these cumulative probabilities are not guaranteed to be monotonically decreasing. It also introduces a large number of hyper-parameters to be fine-tuned manually. This paper aims to eliminate or at least largely reduce the effects of those problems. We propose a simple yet efficient way to rephrase the output layer of the conventional deep neural network. Besides, in order to alleviate the effects of label noise in ordinal datasets, we propose a unimodal label regularization strategy. It also explicitly encourages the class predictions to distribute on nearby classes of ground truth. We show that our methods lead to the state-of-the-art accuracy on the medical diagnose task (e.g., Diabetic Retinopathy and Ultrasound Breast dataset) as well as the face age prediction (e.g., Adience face and MORPH Album II) with very little additional cost.

Introduction

Recently, ordinal regression/classification has received much attention in recognition community. It aims to determine the discrete label of a certain pattern on an ordinal scale. The natural order of the labels (e.g., 1,2,3) indicates the order of the ranks [56].

For instance, the classes of a medical image usually represent the health risk levels, which are inherently ordered. The Diabetic Retinopathy Diagnosis (DR) involves five levels: no DR (1), mild DR (2), moderate DR (3), severe DR (4) and proliferative DR (5) [1]. The Breast Imaging-Reporting and Data System (BIRADS) also includes five diagnostic labels: 1-healthy, 2-benign, 3-probably benign, 4-may contain malignant and 5-probably contains malignant [2], [3]. Similar ordinal labeling systems for liver (LIRADS), gynecology (GIRADS), colonography (CRADS) have been established soon afterward [4].

Surely, the ordinal data is not unique to the medical image classification [55]. Some other examples of ordinal labels include the age of a person [5], [6], face expression intensity [7], aesthetic [8], star rating of a movie [9], etc., and are traditionally referred to ordinal regression tasks [10].

Recent advances in deep neural networks (DNN) for natural image tasks have prompted a surge of interest in adapting it to several applications [2], [4], [11]. However, some of the special characteristics of ordinal data have, in our opinion, not been efficiently explored.

Two of the most straightforward approaches for ordinal data either cast it as a multi-class classification problem [12] and optimize the cross-entropy (CE) loss or treat it as a metric regression problem [13] and minimize the absolute/squared error loss (i.e.,MAE/MSE). The former (Fig. 1(a)) assumes that the classes are independent of each other, which totally fails to explore the inherent ordering between the labels. The latter (Fig. 1(c)) treats the discrete labels as continuous numerical values, in which the adjacent classes are equally distant. This assumption violates the non-stationary property of many image related tasks, easily resulting in over-fitting [14].

Recently, better results were achieved via a N1 binary classification sub-tasks (Fig. 1(b)) using sigmoid output with MSE loss [10] or softmax output with CE loss [3], [4], [15], [16], when we have N levels as the class label. We can transform N levels to a series of labels of length N1. Then the first class is [0, … ,0], followed by the second class [1, … ,0], third class [1,1, … ,0] and so forth. The sub-branches in Fig. 1(b) calculate the cumulative probability p(y>i|x), where i index the class.1 With the cumulative probability, then it is trivial to define the corresponding discrete probabilities p(y=i|x) via subtraction. These techniques are closely related to their non-deep counterparts [17], [18]. However, the cumulative probabilities p(y>1|x),,p(y>N1|x) are calculated by several branches independently, therefore, can not guarantee they are monotonically decreasing. That leads to the p(y=i|x) are not guaranteed to be strictly positive and results poor learning efficiency in the early stage of training. Moreover, N1 weights need to be manually fine-tuned to balance the CE loss of each branch.

Besides, under the one-hot target label encoding, the CE loss log(p(y=l|x))) essentially only cares about the ground truth class l. [19] argues that misclassifying an adult as a baby is more severe than misclassifying as a teenager, even if the probabilities of the adult class are the same. Authors in [20], [21], [22] propose to use a single output neuron to calculate a parameter of a unimodal distribution, and strictly require that the p(y=i|x) follows a Poisson or Binomial distribution, but suffers from lacking the ability to control the variance [22]. Since the peak (also the mean and variance) of a Poisson distribution is equal to a designated λ, we can not assign the peak to the first or last class, and its variance is very high when we need the peak in the very later classes.

Furthermore, the quality of ordinal label makes this problem even more challenging. For example, the agreement rate of the radiologists for malignancy is usually less than 80% [23], [24], which results in a noisy labeled dataset [25]. Despite the distinction between adjacent labels is often unclear, it is more likely that a well-trained annotator will mislabel a Severe DR (4) sample to Moderate DR (3) rather than No DR (1). The label smoothing is a general method for label noise, which cut down the 100% probability in one-hot distribution and averages it to all of the classes following a uniform distribution. It assumes the noise is caused by random error. However, there is a prior in medical diagnosis that a medical doctor or well-trained annotator is more likely to mislabel an image to a neighbor risk-level. Therefore, the uniform distribution may not be an ideal choice to model this kind of label noise. Instead, we propose to model it with the unimodal distribution. Besides, we show that smoothing the label with unimodal distribution can explicitly consider the relative similarity of ordinal data, and has a smaller loss when the prediction probabilities are closer distribute around the ground truth class.

In this paper, we propose to address the issues discussed above. The preliminary versions of the concepts in this paper were published in the BioImage Computing workshop at 2018 European Conference on Computer Vision [26]. In this paper, we extend those basic concepts in the following ways:

  • (1)

    We design a novel unimodal regularization strategy to smooth the target label from one-hot distribution to the unimodal distribution. It not only alleviates the effects of label noise in ordinal datasets, but also explicitly regularizes the structure of label space. Therefore, it can encourage the label predictions to distribute close to the ground truth class.

  • (2)

    We also investigate the possible combination of unimodal regularization with conventional models as well as the proposed neuron stick-breaking, which can improve the performance without sophisticated network design.

  • (3)

    We conduct all experiments using the new architecture, test on more general and challenging benchmarks, and give more comprehensive ablation studies.

In summary, this paper makes the following contributions.

  • (1)

    We rephrase the conventional softmax-based output layer to the neuron stick-breaking formulations to guarantee the cumulative probabilities are monotonically decreasing, and not need hyperparameters to balance the branches in muti-task learning framework.

  • (2)

    The unimodal label smoothing not only considers the prior knowledge in noisy ordinal data, but also explicitly includes a structured relationship between neighboring classes and penalizes less when the inter-class distances are smaller.

  • (3)

    Extensive evaluations in Diabetic Retinopathy, Ultrasound BIRADS, Adience face and MORPH Album II age datasets demonstrate that our method outperforms many state-of-art approaches on the medical diagnoses well as face age prediction task.

Section snippets

Ordinal regression

The conventional ordinal regression approaches can be classified to three classes, i.e., naive, binary decomposition and threshold methods [27], [28], [29]. Following the development of deep learning [30], [31], several works have been proposed to target the ordinal data. Authors in [3], [26] propose the multi-task learning framework. However, the percentages of each class are not guaranteed to be positive, which may hurt the training especially in the early stage. Besides, there are N1

Neuron stick-breaking for ordinal regression

In the stick-breaking approach, we define a stick of unit length between [0,1], and sequentially break off parts of the stick [33]. The length of generated bits can represent the discrete probabilities for that class.

When we make a breaking manipulation, the stick will be separated to two parts with the length of σ(η1) and 1σ(η1), respectively. Their length can represent the probability of the two classes. Then, we further break the remaining part 1σ(η1) by defending how much of the

Numerical experiments

In this section, we show implementation details and experimental results on the Diabetic Retinopathy, Ultrasound BIRADS, Adience face and MORPH Album II age Datasets. To manifest the effectiveness of each setting choice and their combinations, we give a serial of elaborate ablation studies along with the standard measures. For a fair comparison, we choose similar backbones neural networks as in previous works. We adjust the last layer and softmax normalization to our neuron stick-breaking

Conclusions

We have introduced the stick-breaking presses for DNN-based ordinal regression problem. By reformulating the neurons of the last layer and softmax function, we not only fully consider the ordinal property of the class labels, but also guarantee the cumulative probabilities are monotonically decreasing. Targeting on the noisy label problem in many datasets, we propose the unimodal label regularization, which has several attractive characteristics. We also show how these approaches offer improved

Declaration of Competing Interest

None.

Acknowledgment

This work was supported in part by the National Natural Science Foundation 61627819 and 61727818, Hong Kong Government General Research Fund GRF152202/14E, PolyU Central Research Grant G-YBJW, Youth Innovation Promotion Association, CAS (2017264), Innovative Foundation of CIOMP, CAS (Y586320150).

Xiaofeng Liu is a research fellow in Harvard Medical School, Harvard University. He was a joint supervision PhD in Carnegie Mellon University and University of Chinese Academy of Sciences. Before that, he received the B.Eng. degree in automation and B.A. degree in communication from the University of Science and Technology of China in 2014. He was the research assistant in MSRA and Facebook. He was a recipient of the Best Paper award of the IEEE International Conference on Identity, Security

References (64)

  • X. Li et al.

    Lung nodule malignancy prediction using multi-task convolutional neural network

    Medical Imaging 2017: Computer-Aided Diagnosis

    (2017)
  • A.S. Al-Shannaq et al.

    Comprehensive analysis of the literature for age estimation from facial images

    IEEE Access

    (2019)
  • R.R. Atallah et al.

    Face recognition and age estimation implications of changes in facial features: a critical review study

    IEEE Access

    (2018)
  • R. Zhao et al.

    Facial expression intensity estimation using ordinal information

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • Y. Koren et al.

    Ordrec: an ordinal model for predicting personalized item rating distributions

    Proceedings of the Fifth ACM Conference on Recommender Systems

    (2011)
  • Z. Niu et al.

    Ordinal regression with multiple output cnn for age estimation

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • A.E. Gentry et al.

    Penalized ordinal regression methods for predicting stage of cancer in high-dimensional covariate spaces

    Cancer Inform.

    (2015)
  • X. Geng et al.

    Automatic age estimation based on facial aging patterns

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2007)
  • Y. Fu et al.

    Human age estimation with regression on discriminative aging manifold

    IEEE Trans. Multimed.

    (2008)
  • K.-Y. Chang et al.

    Ordinal hyperplanes ranker with cost sensitivities for age estimation

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2011)
  • H. Fu et al.

    Deep ordinal regression network for monocular depth estimation

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • S. Chen et al.

    Using ranking-cnn for age estimation

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2017)
  • J. Cheng et al.

    A neural network approach to ordinal regression

    Proceedings of IEEE International Joint Conference on Neural Networks, IJCNN (IEEE World Congress on Computational Intelligence)

    (2008)
  • E. Frank et al.

    A simple approach to ordinal classification

    Proceedings of European Conference on Machine Learning

    (2001)
  • L. Hou et al.

    Squared earth movers distance loss for training deep neural networks on ordered-classes

    Proceedings of NIPS Workshop

    (2017)
  • C. Beckham, C. Pal, A Simple Squared-error Reformulation for Ordinal Classification, arXiv preprint arXiv:1612.00775...
  • J.F.P. da Costa et al.

    The unimodal model for the classification of ordinal data

    Neural Netw.

    (2008)
  • C. Beckham, C. Pal, Unimodal Probability Distributions for Deep Ordinal Classification, arXiv preprint arXiv:1705.05278...
  • R.M. Nishikawa et al.

    Agreement between radiologists interpretations of screening mammograms

    Proceedings of International Workshop on Digital Mammography

    (2016)
  • A.J. Salazar et al.

    Reliability of the BI-RADS final assessment categories and management recommendations in a telemammography context

    J. Am. College Radiol.

    (2017)
  • B. Du et al.

    Robust graph-based semisupervised learning for noisy labeled data via maximum correntropy criterion

    IEEE Trans. Cybern.

    (2018)
  • X. Liu et al.

    Ordinal regression with neuron stick-breaking for medical diagnosis

    Proceedings of European Conference on Computer Vision

    (2018)
  • Cited by (48)

    • Ordinal classification with a spectrum of information sources

      2022, Expert Systems with Applications
      Citation Excerpt :

      Ordinal classification is an important problem setting that has been recognized in many real-life applications, such as human age estimation (Tian et al., 2016; Xie & Pun, 2020), music recommendation (Lee et al., 2010), medical treatment decision (Liu et al., 2020; Pérez-Ortiz et al., 2014), classification of microorganism growth (Cruz-Ramírez et al., 2013), and so on.

    • Unimodal regularisation based on beta distribution for deep ordinal regression

      2022, Pattern Recognition
      Citation Excerpt :

      To evaluate the performance of the proposed method, we use the unimodal regularised loss to train a CNN model with three separate images datasets. The unimodal regularisation is combined with the recently proposed stick-breaking scheme for the output layer [27] but also tested with the standard softmax function. As shown in the following sections, combining these two elements results in improved performance for ordinal problems with respect to previously published alternatives.

    View all citing articles on Scopus

    Xiaofeng Liu is a research fellow in Harvard Medical School, Harvard University. He was a joint supervision PhD in Carnegie Mellon University and University of Chinese Academy of Sciences. Before that, he received the B.Eng. degree in automation and B.A. degree in communication from the University of Science and Technology of China in 2014. He was the research assistant in MSRA and Facebook. He was a recipient of the Best Paper award of the IEEE International Conference on Identity, Security and Behavior Analysis 2018. His research interests include image processing, computer vision, and pattern recognition. He is the reviewer of CVPR, ICCV, ECCV, TPAMI, PR, etc.

    Fangfang Fan is a research fellow at Harvard University. She received his Ph.D. degree from Huazhong University of Science and Technology in 2013. Her current research interests include emotion regulation and mental health as well as neural electrophysiology signal processing.

    Lingsheng Kong received his bachelors degree from the University of Science and Technology of China in 2007 and his PhD in optical engineering from the University of the CAS in 2012. He is currently an associate Professor at CIOMP, CAS. His current research interests include optical engineering, imaging and image processing.

    Zhihui Diao received his bachelors degree from Harbin Engineering University in 2010 and his PhD from the University of the CAS in 2015. He is currently an associate Professor with CIOMP, CAS. His current research interests include optical engineering, imaging and image processing.

    Wanqing Xie a PhD major in Electronic Physics, is a visiting scholar at Harvard Medical School / Beth Israel Deaconess Medical Center. She also serves as an assistant professor at University of Science and Technology of China (USTC). She is interested in neuroscience, biomedical signal processing, bio-statistics, and machine learning.

    Jun Lu is an Associate Professor of Neurology, Harvard Medical School. He received his MD from the Forth Military Medical University in 1984, MS from the Institute of Space Medico-Engineering in 1988, and PhD from Texas A M University in 1994.

    Jane You received the Ph.D. degree from La Trobe University, Melbourne, VIC, Australia, in 1992. She is currently a Professor with the Department of Computing, Hong Kong Polytechnic University, Hong Kong, and the Chair of Department Research Committee. She has researched extensively in the fields of image processing, medical imaging, computer-aided diagnosis, and pattern recognition. She has been a Principal Investigator for one ITF project, three GRF projects, and many other joint grants since she joined PolyU in 1998. Prof. You was a recipient of three awards including Hong Kong Government Industrial Awards, the Special Prize and Gold Medal with Jurys Commendation at the 39th International Exhibition of Inventions of Geneva in 2011 for her current work on retinal imaging, and the Second Place in an International Competition [SPIE Medical Imaging2009 Retinopathy Online Challenge in (ROC2009)]. Her research output on retinal imaging has been successfully led to technology transfer with clinical applications. She is an Associate Editor of Pattern Recognition and other journals.

    1

    Xiaofeng Liu and Fangfang Fan contribute equally to this article.

    View full text