Abstract
The study of model bias and variance with respect to decision boundaries is critically important in supervised learning and artificial intelligence. There is generally a tradeoff between the two, as fine-tuning of the decision boundary of a classification model to accommodate more boundary training samples (i.e., higher model complexity) may improve training accuracy (i.e., lower bias) but hurt generalization against unseen data (i.e., higher variance). By focusing on just classification boundary fine-tuning and model complexity, it is difficult to reduce both bias and variance. To overcome this dilemma, we take a different perspective and investigate a new approach to handle inaccuracy and uncertainty in the training data labels, which are inevitable in many applications where labels are conceptual entities and labeling is performed by human annotators. The process of classification can be undermined by uncertainty in the labels of the training data; extending a boundary to accommodate an inaccurately labeled point will increase both bias and variance. Our novel method can reduce both bias and variance by estimating the pointwise label uncertainty of the training set and accordingly adjusting the training sample weights such that those samples with high uncertainty are weighted down and those with low uncertainty are weighted up. In this way, uncertain samples have a smaller contribution to the objective function of the model’s learning algorithm and exert less pull on the decision boundary. In a real-world physical activity recognition case study, the data present many labeling challenges, and we show that this new approach improves model performance and reduces model variance.
- Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin. 2012. Learning from Data. AMLBook.Google Scholar
- Donald J. Berndt and James Clifford. 1994. Using dynamic time warping to find patterns in time series. In Proceedings of the KDD Workshop, Vol. 10. 359--370.Google ScholarDigital Library
- Thomas B. Berrett, Richard J. Samworth, and Ming Yuan. 2016. Efficient multivariate entropy estimation via -nearest neighbour distances. The Annals of Statistics 47, 1 (2016), 288--318. DOI:10.1214/18-AOS1688Google ScholarCross Ref
- Wenhao Bian, Jie Wang, Bojin Zhuang, Jiankui Yang, Shaojun Wang, and Jing Xiao. 2019. Audio-based music classification with DenseNet and data augmentation. In Proceedings of the Pacific Rim International Conference on Artificial Intelligence. Springer, 56--65.Google ScholarCross Ref
- Gilles Blanchard, Gyemin Lee, and Clayton Scott. 2010. Semi-supervised novelty detection. Journal of Machine Learning Research 11, Nov (2010), 2973--3009.Google Scholar
- François Chollet et al. 2015. Keras. Retrieved from https://keras.io.Google Scholar
- Scott E. Crouter, Jennifer I. Flynn, and David R. Bassett Jr. 2015. Estimating physical activity in youth using a wrist accelerometer. Medicine and Science in Sports and Exercise 47, 5 (2015), 944.Google ScholarCross Ref
- Steven Davis and Paul Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 28, 4 (1980), 357--366.Google ScholarCross Ref
- Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. 2017. FMA: A dataset for music analysis. In Proceedings of the 18th International Society for Music Information Retrieval Conference. DOI:https://arxiv.org/abs/1612.01840.Google Scholar
- Armen Der Kiureghian and Ove Ditlevsen. 2009. Aleatory or epistemic? Does it matter? Structural Safety 31, 2 (2009), 105--112.Google ScholarCross Ref
- Wei Ding, Tom Stepinski, and J. Salazar. 2009. Discovery of geospatial discriminating patterns from remote sensing datasets. In Proceedings of the 2009 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics.Google Scholar
- Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. 2018. Analysis of classifiers’ robustness to adversarial perturbations. Machine Learning 107, 3 (2018), 481--508.Google ScholarDigital Library
- Alhussein Fawzi, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. 2016. Robustness of classifiers: From adversarial to random noise. In Proceedings of the 30th International Conference on Neural Information Processing Systems. ACM, 1632--1640.Google Scholar
- Todor Ganchev, Nikos Fakotakis, and George Kokkinakis. 2005. Comparative evaluation of various MFCC implementations on the speaker verification task. In Proceedings of the 10th International Conference on Speech and Computer.Google Scholar
- Weihao Gao, Sewoong Oh, and Pramod Viswanath. 2017. Density functional estimators with k-nearest neighbor bandwidths. In Proceedings of 2017 IEEE International Symposium on Information Theory. IEEE, 1351--1355.Google ScholarCross Ref
- Weihao Gao, Sewoong Oh, and Pramod Viswanath. 2018. Demystifying fixed k-nearest neighbor information estimators. IEEE Transactions on Information Theory 64, 8 (2018), 5629–5661.Google ScholarCross Ref
- Stuart Geman, Elie Bienenstock, and René Doursat. 1992. Neural networks and the bias/variance dilemma. Neural Computation 4, 1 (1992), 1--58.Google ScholarDigital Library
- Neil Gershenfeld. 1999. The Nature of Mathematical Modeling. Cambridge University Press, New York, NY.Google Scholar
- Jacob Goldberger and Ehud Ben-Reuven. 2016. Training deep neural-networks using a noise adaptation layer. In Proceedings of the 5th International Conference on Learning Representations.Google Scholar
- Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. Retrieved from http://www.deeplearningbook.org.Google ScholarDigital Library
- Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. In Proceedings of the International Conference on Learning Representations.Google Scholar
- Shixiang Gu and Luca Rigazio. 2015. Towards deep neural network architectures robust to adversarial examples. In Proceedings of the 3rd International Conference on Learning Representations (ICLR'15), San Diego, CA, USA, May 7-9, 2015, Yoshua Bengio and Yann LeCun (Eds.). https://dblp.org/rec/journals/corr/GuR14.bib.Google Scholar
- Trevor Hastie, Robert Tibshirani, Jerome Friedman, and James Franklin. 2005. The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer 27, 2 (2005), 83--85.Google ScholarCross Ref
- Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4700--4708.Google ScholarCross Ref
- Gareth M. James. 2003. Variance and bias for general loss functions. Machine Learning 51, 2 (2003), 115--135.Google ScholarDigital Library
- Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. 2018. MentorNet: Regularizing very deep neural networks on corrupted labels. In Proceedings of the 35th International Conference on Machine Learning.Google Scholar
- L. F. Kozachenko and Nikolai N. Leonenko. 1987. Sample estimate of the entropy of a random vector. Problemy Peredachi Informatsii 23, 2 (1987), 9--16.Google Scholar
- Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of IEEE 86, 11 (1998), 2278--2324.Google ScholarCross Ref
- Donmoon Lee, Jaejun Lee, Jeongsoo Park, and Kyogu Lee. 2019. Enhancing music features by knowledge transfer from user-item log data. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 386--390.Google ScholarCross Ref
- Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. 1993. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks 6, 6 (1993), 861--867.Google ScholarDigital Library
- Tongliang Liu and Dacheng Tao. 2016. Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 3 (2016), 447--461.Google ScholarDigital Library
- Brian McFee, Colin Raffel, Dawen Liang, Daniel P. W. Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. 2015. librosa: Audio and music signal analysis in Python. In Proceedings of the 14th Python in Science Conference.Google ScholarCross Ref
- Aditya Menon, Brendan Van Rooyen, Cheng Soon Ong, and Bob Williamson. 2015. Learning from corrupted binary labels via class-probability estimation. In Proceedings of the International Conference on Machine Learning. 125--134.Google Scholar
- Yang Mu, Henry Z. Lo, Wei Ding, Kevin Amaral, and Scott E. Crouter. 2014. Bipart: Learning block structure for activity detection. IEEE Transactions on Knowledge and Data Engineering 26, 10 (2014), 2397--2409.Google ScholarCross Ref
- Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep K. Ravikumar, and Ambuj Tewari. 2013. Learning with noisy labels. In Proceedings of the Advances in Neural Information Processing Systems, J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), Vol. 26. Curran Associates, Inc.Google Scholar
- Andrew Ng. 2017. The State of Artificial Intelligence. Retrieved May 14, 2018 from https://youtu.be/NKpuX_yzdYs.Google Scholar
- Curtis G. Northcutt, Tailin Wu, and Isaac L. Chuang. 2017. Learning with confident examples: Rank pruning for robust classification with noisy labels. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence (UAI'17), Sydney, Australia, August 11-15, 2017, Gal Elidan, Kristian Kersting, and Alexander T. Ihler. AUAI Press. http://auai.org/uai2017/proceedings/papers/35.pdf.Google Scholar
- Travis E. Oliphant. 2006. A Guide to NumPy. Vol. 1. Trelgol Publishing.Google ScholarDigital Library
- Jiyoung Park, Jongpil Lee, Jangyeon Park, Jung-Woo Ha, and Juhan Nam. 2018. Representation learning of music using artist labels. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR'18), Paris, France, September 23-27, 2018, Emilia Gómez, Xiao Hu, Eric Humphrey, and Emmanouil Benetos (Eds.). 717--724. http://ismir2018.ircam.fr/doc/pdfs/168_Paper.pdf.Google Scholar
- Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. 2018. Learning to reweight examples for robust deep learning. arXiv:1803.09050.Google Scholar
- Stan Salvador and Philip Chan. 2007. Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis 11, 5 (2007), 561--580.Google ScholarDigital Library
- Clayton Scott. 2015. A rate of convergence for mixture proportion estimation, with application to learning from noisy labels. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics. 838--846.Google Scholar
- Clayton Scott, Gilles Blanchard, and Gregory Handy. 2013. Classification with asymmetric label noise: Consistency and maximal denoising. In Proccedings of the Conference on Learning Theory. 489--511.Google Scholar
- Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.Google ScholarDigital Library
- Leslie N. Smith. 2018. A Disciplined Approach to Neural Network Hyper-Parameters: Part 1—Learning Rate, Batch Size, Momentum, and Weight Decay. Technical Report 5510-026. US Naval Research Laboratory.Google Scholar
- Tom Stepinski, Wei Ding, , and R. Vilalta. 2012. Detecting impact craters in planetary images using machine learning. In Intelligent Data Analysis for Real-Life Applications: Theory and Practice. IGI Global, 146–159.Google Scholar
- Sethu Vijayakumar. 2007. The Bias-Variance Tradeoff (PDF). Retrieved from http://www.inf.ed.ac.uk/teaching/courses/mlsc/Notes/Lecture4/BiasVariance.pdf.Google Scholar
- Dawei Wang, Wei Ding, Kui Yu, Xindong Wu, Ping Chen, David L. Small, and Shafiqul Islam. 2013. Towards long-lead forecasting of extreme flood events: A data mining framework for precipitation cluster precursors identification. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1285–1293.Google ScholarDigital Library
Index Terms
- Mitigating Class-Boundary Label Uncertainty to Reduce Both Model Bias and Variance
Recommendations
Towards Mitigating the Class-Imbalance Problem for Partial Label Learning
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningPartial label (PL) learning aims to induce a multi-class classifier from training examples where each of them is associated with a set of candidate labels, among which only one is valid. It is well-known that the problem of class-imbalance stands as a ...
Addressing class-imbalance in multi-label learning via two-stage multi-label hypernetwork
Multi-label learning is concerned with learning from data examples that are represented by a single feature vector while associated with multiple labels simultaneously. Existing multi-label learning approaches mainly focus on exploiting label ...
Transductive Multilabel Learning via Label Set Propagation
The problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image ...
Comments