Top

Pattern Analysis and Applications

Published in:

14-09-2022 | Theoretical Advances

Threshold prediction for detecting rare positive samples using a meta-learner

Authors: Hossein Ghaderi Zefrehi, Ghazaal Sheikhi, Hakan Altınçay

Published in: Pattern Analysis and Applications | Issue 1/2023

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Threshold-moving is one of the several techniques employed in correcting the bias of binary classifiers towards the majority class. In this approach, the decision threshold is adjusted to detect the minority class at the cost of increased misclassification of the majority. In practice, selecting a good threshold using cross-validation on the training data is not feasible in some problems since there are only a few minority samples. In this study, building a meta-learner for threshold prediction to tackle the threshold estimation problem in the case of rare positive samples is addressed. Novel meta-features are suggested to quantify the imbalance characteristics of the data sets and the patterns among the prediction scores. A random forest-based threshold prediction model is constructed using these meta-features extracted from the score space of external data. The models obtained are then employed to estimate the optimal thresholds for previously unseen datasets. The random forest-based meta-learner that employs implicitly selected subset of the proposed meta-features and encodes information from multiple external sources in the form of different trees is evaluated by using 52 imbalanced datasets. In the first set of experiments, the best-fitting thresholds are computed for SVM and logistic regression classifiers that are trained using the original imbalanced training sets. The experiments are repeated by using ensembles of multiple learners, each trained using a different balanced data set. It is observed that the proposed approach provides better F-score when compared to alternative threshold-moving and balancing techniques.

previous article Statistical image watermark decoder by modeling local NSST-PHFMs magnitudes with Morgenstern-type bivariate-generalized exponential distribution

next article Segmentation of retinal blood vessel using generalized extreme value probability distribution function(pdf)-based matched filter approach

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Fotouhi S, Asadi S, Kattan MW (2019) A comprehensive data level analysis for cancer diagnosis on imbalanced data. Journal of Biomedical Informatics 90:103089. https://doi.org/10.1016/j.jbi.2018.12.003CrossRef

Jing X, Wu F, Dong X, Xu B (2017) An improved sda based defect prediction framework for both within-project and cross-project class-imbalance problems. IEEE Transactions on Software Engineering 43(4):321–339CrossRef

Fdez-Glez J, Ruano-Ordás D, Fdez-Riverola F, Méndez JR, Pavón R, Laza R (2015) Analyzing the impact of unbalanced data on web spam classification. In: Omatu S, Malluhi QM, Gonzalez SR, Bocewicz G, Bucciarelli E, Giulioni G, Iqba F (eds) Distributed Computing and Artificial Intelligence, 12th International Conference. Springer International Publishing, Cham, pp 243–250CrossRef

Padmaja TM, Dhulipalla N, Krishna PR, Bapi RS, Laha A (2007) An unbalanced data classification model using hybrid sampling technique for fraud detection. In: Ghosh A, De RK, Pal SK (eds) Pattern Recognition and Machine Intelligence. Springer, Berlin Heidelberg, Berlin, Heidelberg, pp 341–348CrossRef

Bahnsen C. A, Stojanovic A, Aouada D, Ottersten E. B (2014) Improving credit card fraud detection with calibrated probabilities, in: Proceedings of the 2014 SIAM International Conference on Data Mining (SDM), pp. 677–685

Zhu B, Baesens B, S. K. vanden Broucke L. M (2017) An empirical comparison of techniques for the class imbalance problem in churn prediction, Information Sciences 408. 84–99

Lee J, Park K (2021) GAN-based imbalanced data intrusion detection system. Personal and Ubiquitous Computing 25:121–128CrossRef

Alotaibi R, Flach P (2021) Multi-label thresholding for cost-sensitive classification. Neurocomputing 436:232–247. https://doi.org/10.1016/j.neucom.2020.12.004CrossRef

Pillai I, Fumera G, Roli F (2013) Threshold optimisation for multi-label classifiers. Pattern Recognition 46(7):2055–2065. https://doi.org/10.1016/j.patcog.2013.01.012CrossRefMATH

10.

Quevedo J. Ramón, Luaces O, Bahamonde A (2012) Multilabel classifiers with a probabilistic thresholding strategy, Pattern Recognition 45 (2) 876–883. https://doi.org/10.1016/j.patcog.2011.08.007

11.

Tsoumakas G, Katakis I (2007) Multi-label classification: An overview. Int Journal of Data Warehousing and Mining 3(3):1–13CrossRef

12.

Tarekegn AN, Giacobini M, Michalak K (2021) A review of methods for imbalanced multi-label classification. Pattern Recognition 118:107965. https://doi.org/10.1016/j.patcog.2021.107965CrossRef

13.

Rastin N, Taheri M, Jahromi MZ (2021) A stacking weighted k-Nearest neighbour with thresholding. Information Sciences 571:605–622. https://doi.org/10.1016/j.ins.2021.05.030MathSciNetCrossRef

14.

López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences 250:113–141CrossRef

15.

Díez-Pastor JF, Rodríguez JJ, García-Osorio CI, Kuncheva LI (2015) Random balance: Ensembles of variable priors classifiers for imbalanced data. Knowledge-Based Systems 85:96–111CrossRef

16.

Zefrehi H. G, Altınçay H, Imbalance learning using heterogeneous ensembles, Expert Systems with Applications 142. https://doi.org/10.1016/j.eswa.2019.113005

17.

Yu H, Mu C, Sun C, Yang W, Yang X, Zuo X (2015) Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data. Knowledge-Based Systems 76:67–78. https://doi.org/10.1016/j.knosys.2014.12.007CrossRef

18.

Susan S, Kumar A (2021) The balancing trick: Optimized sampling of imbalanced datasets-A brief survey of the recent state of the art. Engineering Reports 3(4):e12298. https://doi.org/10.1002/eng2.12298CrossRef

19.

Tarawneh AS, Hassanat ABA, Almohammadi K, Chetverikov D, Bellinger C (2020) SMOTEFUNA: Synthetic minority over-sampling technique based on furthest neighbour algorithm. IEEE Access 8:59069–59082. https://doi.org/10.1109/ACCESS.2020.2983003CrossRef

20.

Douzas G, Bacao F (2019) Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE. Information Sciences 501:118–135. https://doi.org/10.1016/j.ins.2019.06.007CrossRef

21.

Koziarski M (2020) Radial-based undersampling for imbalanced data classification. Pattern Recognition 102:107262. https://doi.org/10.1016/j.patcog.2020.107262CrossRef

22.

Thai-Nghe N, Gantner Z, Schmidt-Thieme L (2010) Cost-sensitive learning methods for imbalanced data, in: The 2010 International Joint Conference on Neural Networks (IJCNN), pp. 1–8

23.

Domingos P (2002) Metacost: A general method for making classifiers cost-sensitive. Proceedings of the Fifth ACM SIGKDD Internaional Conference on Knowledge Discovery and Data Mining. https://doi.org/10.1145/312129.312220

24.

Díez-Pastor JF, Rodríguez JJ, García-Osorio CI, Kuncheva LI (2015) Diversity techniques improve the performance of the best imbalance learning ensembles. Information Sciences 325:98–117MathSciNetCrossRef

25.

Sun Z, Song Q, Zhu X, Sun H, Xu B, Zhou Y (2015) A novel ensemble method for classifying imbalanced data. Pattern Recognition 48(5):1623–1637. https://doi.org/10.1016/j.patcog.2014.11.014CrossRef

26.

Galar M, Fernandez A, Barrenechea E, Bustince H, Herrer F (2012) A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems Man and Cybernetics Part C 42(4):463–484CrossRef

27.

Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications 73:220–239CrossRef

28.

Collell G, Prelec D, Patil KR (2018) A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data. Neurocomputing 275:330–340CrossRef

29.

Pozzolo A. D, Caelen O, Johnson R. A, Bontempi G (2015) Calibrating probability with undersampling for unbalanced classification, in: IEEE Symposium Series on Computational Intelligence, SSCI2015, Cape Town, South Africa, pp. 159–166

30.

Wallace B, Dahabreh I (2014) Improving class probability estimates for imbalanced data. Knowledge and Information Systems 41:33–52. https://doi.org/10.1007/s10115-013-0670-6CrossRef

31.

Sun A, Lim E-P, Liu Y (2009) On strategies for imbalanced text classification using SVM: A comparative study. Decision Support Systems 48(1):191–201. https://doi.org/10.1016/j.dss.2009.07.011CrossRef

32.

Buda M, Maki A, Mazurowski MA (2018) A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks 106:249–259CrossRef

33.

Johnson J. M, Khoshgoftaar T. M (2019) Deep learning and thresholding with class-imbalanced big data, in: Proceedings of 18th IEEE International Conference On Machine Learning And Applications (ICMLA),

34.

Yang Y (2001) A study on thresholding strategies for text categorization, in: Proceedings of SIGIR-01, 24th ACM International Conference on Research and Development in Information Retrieval, ACM Press, pp. 137–145

35.

Lipton ZC, Elkan C, Naryanaswamy B (2014) Optimal thresholding of classifiers to maximize F1 measure. In: Calders T, Esposito F, Hüllermeier E, Meo R (eds) Machine Learning and Knowledge Discovery in Databases. Springer, Berlin Heidelberg, Berlin, Heidelberg, pp 225–239CrossRef

36.

Chen JJ, Tsai CA, Moon H, Ahn H, Young JJ, Chen CH (2006) Decision threshold adjustment in class prediction. SAR and QSAR in Environmental Research 17(3):337–352CrossRef

37.

Lin W-J, Chen J (2012) Class-imbalanced classifiers for high-dimensional data. Briefings in bioinformatics 14(1):13–26. https://doi.org/10.1093/bib/bbs006CrossRef

38.

Zhou Zhi-Hua, Liu Xu-Ying (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering 18(1):63–77MathSciNetCrossRef

39.

Saerens M, Latinne P, Decaestecker C (2002) Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure. Neural Computation 14:21–41. https://doi.org/10.1162/089976602753284446CrossRefMATH

40.

Tang L, Rajan S, Narayanan V. K (2009) Large scale multi-label classification via metalabeler, in: Proceedings of the 18th International Conference on World Wide Web, WWW ’09, Association for Computing Machinery, New York, USA, p. 211–220. https://doi.org/10.1145/1526709.1526738

41.

Ioannou M, Sakkas G, Tsoumakas G, Vlahavas I (2010) Obtaining bipartitions from score vectors for multi-label classification, in: 22nd IEEE International Conference on Tools with Artificial Intelligence, Vol. 1, pp. 409–416. https://doi.org/10.1109/ICTAI.2010.65

42.

Elisseeff A, Weston J (2001) A kernel method for multi-labelled classification, in: Proceedings of the 14th International Conference on Neural Information Processing Systems, NIPS’01, p. 681–687

43.

Katz G, Shin ECR, Song D (2016) ExploreKit: Automatic feature generation and selection. In: Bonchi F, Domingo-Ferrer J, Baeza-Yates R, Zhou Z, Wu X (eds) IEEE 16th International Conference on Data Mining, ICDM 2016, December 12–15. Spain, IEEE Computer Society, Barcelona, pp 979–984

44.

Zou Q, Xie S, Lin Z, Wu M, Ju Y (2016) Finding the best classification threshold in imbalanced classification. Big Data Research 5:2–8. https://doi.org/10.1016/j.bdr.2015.12.001CrossRef

45.

Chen C, Liaw A, Breiman L (2004) Using random forest to learn imbalanced data, Technical Report 666. Department of Statistics, UC Berkley

46.

Ling C. X, Li C (1998) Data mining for direct marketing: Problems and solutions, in: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, KDD’98, AAAI Press, p. 73–79

47.

Chawla N, Bowyer K, Hall L, Kegelmeyer W (2002) SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16:321–357CrossRefMATH

48.

Haibo H, Bai Y, Garcia E. A, Li S (2008) ADASYN: Adaptive synthetic sampling approach for imbalanced learning, in: IEEE International Joint Conference on Neural Networks (IEEE WorId Congress on Computational Intelligence), pp. 1322–1328

49.

Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) DBSMOTE: Density-based synthetic minority over-sampling technique. Applied Intelligence 36(3):664–684CrossRef

50.

Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In: Huang D-S, Zhang X-P, Huang G-B (eds) Advances in Intelligent Computing. Springer, Berlin Heidelberg, Berlin, Heidelberg, pp 878–887CrossRef

51.

Barua S, Islam MM, Yao X, Murase K (2014) MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Transactions on Knowledge and Data Engineering 26(2):405–425CrossRef

52.

Bellinger C, Sharma S, Japkowicz N, Zaïane OR (2020) Framework for extreme imbalance classification: SWIM-sampling with the majority class. Knowledge and Information Systems 62:841–866CrossRef

53.

Menardi G, Torelli N (2014) Training and assessing classification rules with imbalanced data. Data Mining and Knowledge Discovery 28:92–122MathSciNetCrossRefMATH

54.

Alcalá-Fdez J, Fernandez A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing 17:255–287

Title: Threshold prediction for detecting rare positive samples using a meta-learner
Authors: Hossein Ghaderi Zefrehi
Ghazaal Sheikhi
Hakan Altınçay
Publication date: 14-09-2022
Publisher: Springer London
Published in: Pattern Analysis and Applications / Issue 1/2023
Print ISSN: 1433-7541
Electronic ISSN: 1433-755X
DOI: https://doi.org/10.1007/s10044-022-01103-1

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 1/2023

SSGNet: semi-supervised multi-path grid network for diagnosing melanoma

An intelligent approach using boosted support vector machine based arithmetic optimization algorithm for accurate detection of plant leaf disease

A multiple classifiers system with roulette-based feature subspace selection for one-vs-one scheme

Segmentation of retinal blood vessel using generalized extreme value probability distribution function(pdf)-based matched filter approach

Statistical image watermark decoder by modeling local NSST-PHFMs magnitudes with Morgenstern-type bivariate-generalized exponential distribution

Edge detection and characterization of digitized images

Premium Partner