Unimodal regularized neuron stick-breaking for ordinal classification
Introduction
Recently, ordinal regression/classification has received much attention in recognition community. It aims to determine the discrete label of a certain pattern on an ordinal scale. The natural order of the labels (e.g., 1,2,3) indicates the order of the ranks [56].
For instance, the classes of a medical image usually represent the health risk levels, which are inherently ordered. The Diabetic Retinopathy Diagnosis (DR) involves five levels: no DR (1), mild DR (2), moderate DR (3), severe DR (4) and proliferative DR (5) [1]. The Breast Imaging-Reporting and Data System (BIRADS) also includes five diagnostic labels: 1-healthy, 2-benign, 3-probably benign, 4-may contain malignant and 5-probably contains malignant [2], [3]. Similar ordinal labeling systems for liver (LIRADS), gynecology (GIRADS), colonography (CRADS) have been established soon afterward [4].
Surely, the ordinal data is not unique to the medical image classification [55]. Some other examples of ordinal labels include the age of a person [5], [6], face expression intensity [7], aesthetic [8], star rating of a movie [9], etc., and are traditionally referred to ordinal regression tasks [10].
Recent advances in deep neural networks (DNN) for natural image tasks have prompted a surge of interest in adapting it to several applications [2], [4], [11]. However, some of the special characteristics of ordinal data have, in our opinion, not been efficiently explored.
Two of the most straightforward approaches for ordinal data either cast it as a multi-class classification problem [12] and optimize the cross-entropy (CE) loss or treat it as a metric regression problem [13] and minimize the absolute/squared error loss (i.e.,MAE/MSE). The former (Fig. 1(a)) assumes that the classes are independent of each other, which totally fails to explore the inherent ordering between the labels. The latter (Fig. 1(c)) treats the discrete labels as continuous numerical values, in which the adjacent classes are equally distant. This assumption violates the non-stationary property of many image related tasks, easily resulting in over-fitting [14].
Recently, better results were achieved via a binary classification sub-tasks (Fig. 1(b)) using sigmoid output with MSE loss [10] or softmax output with CE loss [3], [4], [15], [16], when we have N levels as the class label. We can transform N levels to a series of labels of length . Then the first class is [0, … ,0], followed by the second class [1, … ,0], third class [1,1, … ,0] and so forth. The sub-branches in Fig. 1(b) calculate the cumulative probability where i index the class.1 With the cumulative probability, then it is trivial to define the corresponding discrete probabilities via subtraction. These techniques are closely related to their non-deep counterparts [17], [18]. However, the cumulative probabilities are calculated by several branches independently, therefore, can not guarantee they are monotonically decreasing. That leads to the are not guaranteed to be strictly positive and results poor learning efficiency in the early stage of training. Moreover, weights need to be manually fine-tuned to balance the CE loss of each branch.
Besides, under the one-hot target label encoding, the CE loss essentially only cares about the ground truth class l. [19] argues that misclassifying an adult as a baby is more severe than misclassifying as a teenager, even if the probabilities of the adult class are the same. Authors in [20], [21], [22] propose to use a single output neuron to calculate a parameter of a unimodal distribution, and strictly require that the follows a Poisson or Binomial distribution, but suffers from lacking the ability to control the variance [22]. Since the peak (also the mean and variance) of a Poisson distribution is equal to a designated λ, we can not assign the peak to the first or last class, and its variance is very high when we need the peak in the very later classes.
Furthermore, the quality of ordinal label makes this problem even more challenging. For example, the agreement rate of the radiologists for malignancy is usually less than 80% [23], [24], which results in a noisy labeled dataset [25]. Despite the distinction between adjacent labels is often unclear, it is more likely that a well-trained annotator will mislabel a Severe DR (4) sample to Moderate DR (3) rather than No DR (1). The label smoothing is a general method for label noise, which cut down the 100% probability in one-hot distribution and averages it to all of the classes following a uniform distribution. It assumes the noise is caused by random error. However, there is a prior in medical diagnosis that a medical doctor or well-trained annotator is more likely to mislabel an image to a neighbor risk-level. Therefore, the uniform distribution may not be an ideal choice to model this kind of label noise. Instead, we propose to model it with the unimodal distribution. Besides, we show that smoothing the label with unimodal distribution can explicitly consider the relative similarity of ordinal data, and has a smaller loss when the prediction probabilities are closer distribute around the ground truth class.
In this paper, we propose to address the issues discussed above. The preliminary versions of the concepts in this paper were published in the BioImage Computing workshop at 2018 European Conference on Computer Vision [26]. In this paper, we extend those basic concepts in the following ways:
- (1)
We design a novel unimodal regularization strategy to smooth the target label from one-hot distribution to the unimodal distribution. It not only alleviates the effects of label noise in ordinal datasets, but also explicitly regularizes the structure of label space. Therefore, it can encourage the label predictions to distribute close to the ground truth class.
- (2)
We also investigate the possible combination of unimodal regularization with conventional models as well as the proposed neuron stick-breaking, which can improve the performance without sophisticated network design.
- (3)
We conduct all experiments using the new architecture, test on more general and challenging benchmarks, and give more comprehensive ablation studies.
In summary, this paper makes the following contributions.
- (1)
We rephrase the conventional softmax-based output layer to the neuron stick-breaking formulations to guarantee the cumulative probabilities are monotonically decreasing, and not need hyperparameters to balance the branches in muti-task learning framework.
- (2)
The unimodal label smoothing not only considers the prior knowledge in noisy ordinal data, but also explicitly includes a structured relationship between neighboring classes and penalizes less when the inter-class distances are smaller.
- (3)
Extensive evaluations in Diabetic Retinopathy, Ultrasound BIRADS, Adience face and MORPH Album II age datasets demonstrate that our method outperforms many state-of-art approaches on the medical diagnoses well as face age prediction task.
Section snippets
Ordinal regression
The conventional ordinal regression approaches can be classified to three classes, i.e., naive, binary decomposition and threshold methods [27], [28], [29]. Following the development of deep learning [30], [31], several works have been proposed to target the ordinal data. Authors in [3], [26] propose the multi-task learning framework. However, the percentages of each class are not guaranteed to be positive, which may hurt the training especially in the early stage. Besides, there are
Neuron stick-breaking for ordinal regression
In the stick-breaking approach, we define a stick of unit length between [0,1], and sequentially break off parts of the stick [33]. The length of generated bits can represent the discrete probabilities for that class.
When we make a breaking manipulation, the stick will be separated to two parts with the length of σ(η1) and respectively. Their length can represent the probability of the two classes. Then, we further break the remaining part by defending how much of the
Numerical experiments
In this section, we show implementation details and experimental results on the Diabetic Retinopathy, Ultrasound BIRADS, Adience face and MORPH Album II age Datasets. To manifest the effectiveness of each setting choice and their combinations, we give a serial of elaborate ablation studies along with the standard measures. For a fair comparison, we choose similar backbones neural networks as in previous works. We adjust the last layer and softmax normalization to our neuron stick-breaking
Conclusions
We have introduced the stick-breaking presses for DNN-based ordinal regression problem. By reformulating the neurons of the last layer and softmax function, we not only fully consider the ordinal property of the class labels, but also guarantee the cumulative probabilities are monotonically decreasing. Targeting on the noisy label problem in many datasets, we propose the unimodal label regularization, which has several attractive characteristics. We also show how these approaches offer improved
Declaration of Competing Interest
None.
Acknowledgment
This work was supported in part by the National Natural Science Foundation 61627819 and 61727818, Hong Kong Government General Research Fund GRF152202/14E, PolyU Central Research Grant G-YBJW, Youth Innovation Promotion Association, CAS (2017264), Innovative Foundation of CIOMP, CAS (Y586320150).
Xiaofeng Liu is a research fellow in Harvard Medical School, Harvard University. He was a joint supervision PhD in Carnegie Mellon University and University of Chinese Academy of Sciences. Before that, he received the B.Eng. degree in automation and B.A. degree in communication from the University of Science and Technology of China in 2014. He was the research assistant in MSRA and Facebook. He was a recipient of the Best Paper award of the IEEE International Conference on Identity, Security
References (64)
- et al.
Modelling ordinal relations with SVMs: an application to objective aesthetic evaluation of breast cancer conservative treatment
Neural Netw.
(2005) - et al.
A convex formulation for multiple ordinal output classification
Pattern Recognit.
(2019) - et al.
The ordinal relation preserving binary codes
Pattern Recognit.
(2015) - et al.
Iterative privileged learning
IEEE Trans. Neural Netw. Learn. Syst.
(2019) - et al.
Stick-breaking construction for the indian buffet process
Artificial Intelligence and Statistics
(2007) - et al.
Age estimation using trainable Gabor wavelet layers in a convolutional neural network
Proceedings of 2019 IEEE International Conference on Image Processing (ICIP)
(2019) - et al.
Conservative wasserstein training for pose estimation
- et al.
Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs
JAMA
(2016) - K.J. Geras, S. Wolfson, Y. Shen, S. Kim, L. Moy, K. Cho, High-resolution Breast Cancer Screening with Multi-view Deep...
- V. Ratner, Y. Shoshan, T. Kachman, Learning Multiple Non-mutually-exclusive Tasks for Improved Classification of...
Lung nodule malignancy prediction using multi-task convolutional neural network
Medical Imaging 2017: Computer-Aided Diagnosis
Comprehensive analysis of the literature for age estimation from facial images
IEEE Access
Face recognition and age estimation implications of changes in facial features: a critical review study
IEEE Access
Facial expression intensity estimation using ordinal information
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Ordrec: an ordinal model for predicting personalized item rating distributions
Proceedings of the Fifth ACM Conference on Recommender Systems
Ordinal regression with multiple output cnn for age estimation
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Penalized ordinal regression methods for predicting stage of cancer in high-dimensional covariate spaces
Cancer Inform.
Automatic age estimation based on facial aging patterns
IEEE Trans. Pattern Anal. Mach. Intell.
Human age estimation with regression on discriminative aging manifold
IEEE Trans. Multimed.
Ordinal hyperplanes ranker with cost sensitivities for age estimation
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Deep ordinal regression network for monocular depth estimation
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Using ranking-cnn for age estimation
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
A neural network approach to ordinal regression
Proceedings of IEEE International Joint Conference on Neural Networks, IJCNN (IEEE World Congress on Computational Intelligence)
A simple approach to ordinal classification
Proceedings of European Conference on Machine Learning
Squared earth movers distance loss for training deep neural networks on ordered-classes
Proceedings of NIPS Workshop
The unimodal model for the classification of ordinal data
Neural Netw.
Agreement between radiologists interpretations of screening mammograms
Proceedings of International Workshop on Digital Mammography
Reliability of the BI-RADS final assessment categories and management recommendations in a telemammography context
J. Am. College Radiol.
Robust graph-based semisupervised learning for noisy labeled data via maximum correntropy criterion
IEEE Trans. Cybern.
Ordinal regression with neuron stick-breaking for medical diagnosis
Proceedings of European Conference on Computer Vision
Cited by (48)
Generalised triangular distributions for ordinal deep learning: Novel proposal and optimisation
2023, Information SciencesExponential loss regularisation for encouraging ordinal constraint to shotgun stocks quality assessment
2023, Applied Soft ComputingSoft labelling based on triangular distributions for ordinal classification
2023, Information FusionDeep learning based hierarchical classifier for weapon stock aesthetic quality control assessment
2023, Computers in IndustryOrdinal classification with a spectrum of information sources
2022, Expert Systems with ApplicationsCitation Excerpt :Ordinal classification is an important problem setting that has been recognized in many real-life applications, such as human age estimation (Tian et al., 2016; Xie & Pun, 2020), music recommendation (Lee et al., 2010), medical treatment decision (Liu et al., 2020; Pérez-Ortiz et al., 2014), classification of microorganism growth (Cruz-Ramírez et al., 2013), and so on.
Unimodal regularisation based on beta distribution for deep ordinal regression
2022, Pattern RecognitionCitation Excerpt :To evaluate the performance of the proposed method, we use the unimodal regularised loss to train a CNN model with three separate images datasets. The unimodal regularisation is combined with the recently proposed stick-breaking scheme for the output layer [27] but also tested with the standard softmax function. As shown in the following sections, combining these two elements results in improved performance for ordinal problems with respect to previously published alternatives.
Xiaofeng Liu is a research fellow in Harvard Medical School, Harvard University. He was a joint supervision PhD in Carnegie Mellon University and University of Chinese Academy of Sciences. Before that, he received the B.Eng. degree in automation and B.A. degree in communication from the University of Science and Technology of China in 2014. He was the research assistant in MSRA and Facebook. He was a recipient of the Best Paper award of the IEEE International Conference on Identity, Security and Behavior Analysis 2018. His research interests include image processing, computer vision, and pattern recognition. He is the reviewer of CVPR, ICCV, ECCV, TPAMI, PR, etc.
Fangfang Fan is a research fellow at Harvard University. She received his Ph.D. degree from Huazhong University of Science and Technology in 2013. Her current research interests include emotion regulation and mental health as well as neural electrophysiology signal processing.
Lingsheng Kong received his bachelors degree from the University of Science and Technology of China in 2007 and his PhD in optical engineering from the University of the CAS in 2012. He is currently an associate Professor at CIOMP, CAS. His current research interests include optical engineering, imaging and image processing.
Zhihui Diao received his bachelors degree from Harbin Engineering University in 2010 and his PhD from the University of the CAS in 2015. He is currently an associate Professor with CIOMP, CAS. His current research interests include optical engineering, imaging and image processing.
Wanqing Xie a PhD major in Electronic Physics, is a visiting scholar at Harvard Medical School / Beth Israel Deaconess Medical Center. She also serves as an assistant professor at University of Science and Technology of China (USTC). She is interested in neuroscience, biomedical signal processing, bio-statistics, and machine learning.
Jun Lu is an Associate Professor of Neurology, Harvard Medical School. He received his MD from the Forth Military Medical University in 1984, MS from the Institute of Space Medico-Engineering in 1988, and PhD from Texas A M University in 1994.
Jane You received the Ph.D. degree from La Trobe University, Melbourne, VIC, Australia, in 1992. She is currently a Professor with the Department of Computing, Hong Kong Polytechnic University, Hong Kong, and the Chair of Department Research Committee. She has researched extensively in the fields of image processing, medical imaging, computer-aided diagnosis, and pattern recognition. She has been a Principal Investigator for one ITF project, three GRF projects, and many other joint grants since she joined PolyU in 1998. Prof. You was a recipient of three awards including Hong Kong Government Industrial Awards, the Special Prize and Gold Medal with Jurys Commendation at the 39th International Exhibition of Inventions of Geneva in 2011 for her current work on retinal imaging, and the Second Place in an International Competition [SPIE Medical Imaging2009 Retinopathy Online Challenge in (ROC2009)]. Her research output on retinal imaging has been successfully led to technology transfer with clinical applications. She is an Associate Editor of Pattern Recognition and other journals.
- 1
Xiaofeng Liu and Fangfang Fan contribute equally to this article.