Top

Journal of Classification

Published in:

16-02-2023

Classification Trees with Mismeasured Responses

Authors: Liqun Diao, Grace Y. Yi

Published in: Journal of Classification | Issue 1/2023

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Classification trees are a popular machine learning tool for studying a variety of problems, including prediction, inference, risk factors identification, and risk groups classification. Classification trees are basically developed under the assumption that the response and covariate variables are accurately measured. This condition, however, is often violated in practice. Ignoring this feature commonly yields invalid analysis results. In this paper, we study the impact of mismeasured responses on the performance of standard classification trees and propose a novel classification trees algorithm for mismeasured responses. Our study is directed to settings with binary responses which are subject to mismeasurement. To address the effects of mismeasured responses, we modify the decision rules which are valid for tree building in the mismeasurement-free settings by introducing new measures for the node impurity and misclassification cost. To characterize the magnitude of mismeasurement in responses, we consider two data scenarios. In the first scenario, the mismeasurement rates are known, either from previous studies of the same nature or being set by researchers who are interested in conducting sensitivity analyses to assess the impact of mismeasured responses. In the second scenario, the mismeasurement rates are unknown and are estimated from a validation dataset which contains both accurate measurements and mismeasurements for responses. We conduct a variety of simulation studies to assess the performance of the proposed classification trees algorithm, in comparison to the usual classification trees algorithms which ignore response mismeasurement. It is demonstrated that ignoring response mismeasurement can yield seriously erroneous results and that our proposed method provides superior performance with the mismeasurement effects accommodated. To illustrate the usage of the proposed method, we analyze the data arising from the National Health and Nutrition Examination Surveys (NHANES) by conducting sensitivity analyses to assess how classification results may be affected by different misclassification costs.

previous article Model-Based Clustering and Classification Using Mixtures of Multivariate Skewed Power Exponential Distributions

next article Finding the Proverbial Needle: Improving Minority Class Identification Under Extreme Class Imbalance

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

Ball, N., & Brunner, R.J. (2010). Data mining and machine learning in astronomy. International Journal of Modern Physics D, 19, 1049–1106.CrossRefMATH

Beckman, R.J., & Cook, R.D. (1983). Outlier..........s. Technometrics, 25(2), 119–149.MathSciNetMATH

Birke, H. (2015). Model-based recursive partitioning with adjustment for measurement error - Applied to the Cox’s proportional hazards and Weibull model. Springer Spektrum.

Blanchard, G., Lee, G., & Scott, C. (2010). Semi-supervised novelty detection. The Journal of Machine Learning Research, 11, 2973–3009.MathSciNetMATH

Blanco, V., Japón, A., & Puerto, J. (2022). Robust optimal classification trees under noisy labels. Advances in Data Analysis and Classification, 16, 155–179.MathSciNetCrossRefMATH

Bose, I., & Mahapatra, R.K. (2001). Business data mining – A machine learning perspective. Information and Management, 39, 211–225.CrossRef

Breiman, L., Friedman, J.H., Olshen, R.A., & Stone, C.J. (1984). Classification and regression trees. The Wadsworth statistics/probability series. Belmont, CA: Wadsworth International Group.

Brodley, C.E., & Friedl, M.A. (1999). Identifying mislabeled training data. Journal of Artificial Intelligence Research, 11, 131–167.CrossRefMATH

Buonaccorsi, J.P. (2010). Measurement error: Models, methods, and applications. Boca Raton: Chapman and Hall/CRC Press.CrossRefMATH

Carroll, R.J., D., R., & Stefanski, L.A. (1995). Measurement error in nonlinear models. New York: Chapman and Hall.CrossRefMATH

Carroll, R.J., Ruppert, D., Stefanski, L.A., & Crainiceanu, C.M. (2006). Measurement error in nonlinear models: A modern perspective, 2nd edn. Boca Raton: Chapman and Hall/CRC Press.CrossRefMATH

Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys (CSUR), 41(3), 1–58.CrossRef

Cheng, J., Liu, T., Ramamohanarao, K., & Tao, D. (2020). Learning with bounded instance and label-dependent label noise. In International conference on machine learning. PMLR (pp. 1789–1799).

Foster, I., Ghani, R., Jarmin, R.S., Kreuter, F., & Lane, J. (2016). Big data and social science: A practical guide to methods and tools. Boca Raton: Chapman and Hall/CRC Press.CrossRef

Frénay, B., & Verleysen, M. (2013). Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems, 25(5), 845–869.CrossRef

Fuller, W.A. (1987). Measurement error models. New York: Wiley.CrossRefMATH

Ghosh, A., Manwani, N., & Sastry, P. (2015). Making risk minimization tolerant to label noise. Neurocomputing, 160, 93–107.CrossRef

Ghosh, A., Manwani, N., & Sastry, P. (2017). On the robustness of decision tree learning under label noise. In Pacific-Asia conference on knowledge discovery and data mining (pp. 685–697). Springer.

Gustafson, P. (2004). Measurement error or misclassification in statistics and epidemiology. Boca Raton: Chapman and Hall/CRC Press.MATH

Hawkins, D.M. (1980). Identification of outliers. London: Chapman & Hall.CrossRefMATH

Hayton, P., Schölkopf, B., Tarassenko, L., & Anuzis, P. (2000). Support vector novelty detection applied to jet engine vibration spectra. Advances in Neural Information Processing Systems, 13.

Hodge, V., & Austin, J. (2004). A survey of outlier detection methodologies. Artificial Intelligence Review, 22(2), 85–126.CrossRefMATH

Khoshgoftaar, T.M., Zhong, S., & Joshi, V. (2005). Enhancing software quality estimation using ensemble-classifier based noise filtering. Intelligent Data Analysis, 9(1), 3–27.CrossRef

Kononenko, I. (2001). Machine learning for medical diagnosis: History, state of the art and perspective. Artificial Intelligence in Medicine, 23, 89–109.CrossRef

Lin, C.H., Mausam, & Weld, D.S. (2014). To re (label), or not to re (label). In Proceedings of the 2nd AAAI conference on human computation and crowdsourcing, (Vol. 2 pp. 151–158).

Liu, T., & Tao, D. (2016). Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(3), 447–461.MathSciNetCrossRef

Liu, Y., & Guo, H. (2020). Peer loss functions: Learning from noisy labels without knowing noise rates. In International conference on machine learning. PMLR (pp. 6226–6236).

Natarajan, N., Dhillon, I.S., Ravikumar, P.K., & Tewari, A. (2013). Learning with noisy labels. In Proceedings of the 26th international conference on neural information processing systems, (Vol. 1 pp. 1196–1204).

Nigam, N., Dutta, T., & Gupta, H.P. (2020). Impact of noisy labels in learning techniques: A survey. In M.L. Kolhe, S. Tiwari, & M. C. (Eds.) Advances in data and information sciences: Proceedings of ICDIS 2019 (pp. 403–411). Singapore: Springer.

Patrini, G., Nielsen, F., Nock, R., & Carioni, M. (2016). Loss factorization, weakly supervised learning and label noise robustness. In Proceedings of the 33rd international conference on machine learning, (Vol. 48 pp. 708–717).

Raykar, V.C., Yu, S., Zhao, L.H., Valadez, G.H., Florin, C., Bogoni, L., & Moy, L. (2010). Learning from crowds. Journal of Machine Learning Research, 11, 1297–1322.MathSciNet

Schölkopf, B., Williamson, R.C., Smola, A., Shawe-Taylor, J., & Platt, J. (1999). Support vector method for novelty detection. In S. Solla, T. Leen, & K. Müller (Eds.) Advances in neural information processing systems, Vol. 12. MIT Press.

Scott, C., Blanchard, G., & Handy, G. (2013). Classification with asymmetric label noise: Consistency and maximal denoising. In Conference on learning theory. PMLR (pp. 489–511).

Sexton, J., & Laake, P. (2007). Boosted regression trees with errors in variables. Biometrics, 63, 585–592.MathSciNetCrossRefMATH

Shnayder, V., Agarwal, A., Frongillo, R., & Parkes, D.C. (2016). Informed truthfulness in multi-task peer prediction. In Proceedings of the 2016 ACM conference on economics and computation (pp. 179–196).

Verbaeten, S., & Assche, A.V. (2003). Ensemble methods for noise elimination in classification problems. In Multiple classifier systems. MCS 2003. Lecture Notes in Computer Science, (Vol. 2709 pp. 317–325). Berlin: Springer.

Yi, G.Y. (2017). Statistical analysis with measurement error or misclassification: Strategy method and application. New York: Springer.CrossRefMATH

Yi, G.Y., & He, W. (2017). Analysis of case-control data with interacting misclassified covariates. Journal of Statistical Distributions and Applications, 4(1), 1–16.CrossRefMATH

Yi, G.Y., Ma, Y., Spiegelman, D., & Carroll, R.J. (2015). Functional and structural methods with mixed measurement error and misclassification in covariates. Journal of the American Statistical Association, 110(510), 681–696.MathSciNetCrossRefMATH

Zeileis, A., Hothorn, T., & Hornik, K. (2008). Model-based recursive partitioning. Journal of Computational and Graphical Statistics, 17, 492–514.MathSciNetCrossRef

Zhu, X., & Wu, X. (2004). Class noise vs. attribute noise: A quantitative study. Artificial Intelligence Review, 22(3), 177–210.CrossRefMATH

Title: Classification Trees with Mismeasured Responses
Authors: Liqun Diao
Grace Y. Yi
Publication date: 16-02-2023
Publisher: Springer US
Published in: Journal of Classification / Issue 1/2023
Print ISSN: 0176-4268
Electronic ISSN: 1432-1343
DOI: https://doi.org/10.1007/s00357-023-09430-6

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 1/2023

A Semi-parametric Density Estimation with Application in Clustering

Finding the Proverbial Needle: Improving Minority Class Identification Under Extreme Class Imbalance

Editorial: Journal of Classification Vol. 40-1

Imputation Strategies for Clustering Mixed-Type Data with Missing Values

Uncertainty Diagnostics of Binomial Regression Trees for Ordered Rating Data

Merging Components in Linear Gaussian Cluster-Weighted Models

Premium Partner