nach oben

Neural Computing and Applications

Erschienen in:

01.05.2013 | Original Article

Improving the precision-recall trade-off in undersampling-based binary text categorization using unanimity rule

verfasst von: Zafer Erenel, Hakan Altınçay

Erschienen in: Neural Computing and Applications | Sonderheft 1/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The distribution of documents over two classes in binary text categorization problem is generally uneven where resampling approaches are shown to improve F ₁ scores. The improvement achieved is mainly due to the gain in recall where precision may deteriorate. Since precision is the primary concern in some applications, achieving higher F ₁ scores with a desired level of trade-off between precision and recall is important. In this study, we present an analytical comparison between unanimity and majority voting rules. It is shown that unanimity rule can provide better F ₁ scores compared to majority voting when an ensemble of high recall but low precision classifiers is considered. Then, category-based undersampling is proposed to generate high recall members. The experiments conducted on three datasets have shown that superior F ₁ scores can be realized compared to the support vector machines(SVM)-based baseline system and voting over a random undersampling-based ensemble.

Vorheriger Artikel Structure and weight optimization of neural network based on CPA-MLR and its application in naphtha dry point soft sensor

Nächster Artikel Transforming input variables for RBFN based on PCA-ASH multivariate correlation analysis and its application

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47CrossRef

Jiang C, Coenen F, Sanderson R, Zito M (2010) Text classification using graph mining-based feature extraction. Knowl-Based Syst 23(4):302–308CrossRef

Selamat A, Omatu S (2004) Web page feature selection and classification using neural networks. Inf Sci 158:69–88MathSciNetCrossRef

Joachims T (1997) A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. In: Proceedings of 14th international conference on machine learning, pp 143–151

Chen J, Huang H, Tian S, Qu Y (2009) Feature selection for text classification with naive Bayes. Expert Syst Appl 36:5432–5435CrossRef

Lu SH, Chiang DA, Keh HC, Huang HH (2010) Chinese text classification by the naïve Bayes classifier and the associative classifier with multiple confidence threshold values. Knowl-Based Syst 23(6):598–604CrossRef

Tan S (2005) Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Syst Appl 28:667–671CrossRef

Joachims T (1998) Text categorization with support vector machines : Learning with many relevant features. In: Proceedings of 10th European conference of machine learning, pp 137–142

Leopold E, Kindermann J (2002) Text categorization with support vector machines. how to represent texts in input space? Mach Learn 46(1–3):423–444MATHCrossRef

10.

Liu Z, Lv X, Liu K, Shi S (2010) Study on SVM compared with the other text classification methods. In: Proceedings of the 2010 second international workshop on education technology and computer science, March 2010

11.

Basu A, Watters C, Shepherd M (2003) Support vector machines for text categorization. In: Proceedings of the 36th Hawaii international conference on system sciences, January 2003

12.

Estabrooks A, Japkowicz N (2001) A mixture-of-experts framework for text classification. In: ConLL ’01: proceedings of the 2001 workshop on computational natural language learning. Association for Computational Linguistics, Morristown, pp 1–8

13.

Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36MathSciNetCrossRef

14.

Imam T, Ting KM, Kamruzzaman J (2006) z-SVM: an SVM for improved classification of imbalanced data. In: Australian conference on artificial intelligence, pp 264–273

15.

Liu X, Wu J, Zhou Z (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cyber B 39(2):539–550CrossRef

16.

Sun A, Lim E, Liu Y (2009) On strategies for imbalanced text classification using svm: a comparative study. Decis Support Syst 48(1):191–201CrossRef

17.

Li X, Yan Y, Peng Y (2009) The method of text categorization on imbalanced datasets. In: Proceedings of the 2009 international conference on communication software and networks, February 2009

18.

He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284CrossRef

19.

Visa S and Ralescu A (2005) Issues in mining imbalanced data sets—a review paper. In: Proceedings of the sixteen midwest artificial intelligence and cognitive science conference, pp 67–73

20.

Tian J, Gu H, Liu W (2011) Imbalanced classification using support vector machine ensemble. Neural Comput Appl 20:203–209CrossRef

21.

Zhang J and Mani I (2003) KNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of the international conference on machine learning (ICML-2003)

22.

Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727MathSciNetCrossRef

23.

Liu AY (2004) The effect of oversampling and undersampling on classifying imbalanced text datasets. Master’s thesis, Graduate School of The University of Texas at Austin

24.

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357MATH

25.

Drummond C, Holte RC (2003) C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Proceedings of the ICML’03 workshop on learning from imbalanced datasets, August 2003

26.

Li C (2007) Classifying imbalanced data using a bagging ensemble variation (BEV). In: ACM-SE 45: proceedings of the 45th annual southeast regional conference. ACM, New York, pp 203–208

27.

Dong YS, Han KS (2004) A comparison of several ensemble methods for text categorization. In IEEE international conference on services computing. IEEE Computer Society, Los Alamitos, pp 419–422

28.

Dong YS, Han KS (2005) Boosting svm classifiers by ensemble. In: WWW ’05: special interest tracks and posters of the 14th international conference on World Wide Web. ACM, New York, pp 1072–1073

29.

Lin SC, Chang YI, Yang WN (2009) Meta-learning for imbalanced data and classification ensemble in binary classification. Neurocomputing 73(1–3):484–494CrossRef

30.

Hulse JV, Khoshgoftaar TM, Napolitano A (2009) An empirical comparison of repetitive undersampling techniques. In: IRI’09: Proceedings of the 10th IEEE international conference on information reuse & integration. IEEE Press, Piscataway, pp 29–34

31.

Yan R, Liu Y, Jin R, Hauptmann A (2003) On predicting rare classes with svm ensembles in scene classification. In: ICASSP, pp 21–24

32.

Ricamoto MT, Marrocco C, Tortorella F (2008) MCS-based balancing techniques for skewed classes: an empirical comparison. In: Proceedings of the 19th international conference on pattern recognition, (ICPR2008), December 2008

33.

Yoon K, Kwek S (2007) A data reduction approach for resolving the imbalanced data issue in functional genomics. Neural Comput Appl 16:295–306CrossRef

34.

Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: Seventh European conference on principles and practice of knowledge discovery in databases, pp 107–119

35.

Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. SIGKDD Explor Newslett 6(1):30–39CrossRef

36.

Seiffert C, Khoshgoftaar TM, Hulse JV, Napolitano A (2010) Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern A Syst Hum 40(1):185–197CrossRef

37.

Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley, New JerseyMATH

38.

Kang P, Cho S (2006) EUS SVMs: ensemble of undersampled SVMs for data imbalance problems. In: 13th international conference on neural information processing, ICONIP 2006, Hong Kong, vol 4232, pp 837–846

39.

Kolcz A, Yih W (2007) Raising the baseline for high-precision text classifiers. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, San Jose, pp 400–409

40.

Masuyama T, Nakagawa H (2004) Two step POS selection for SVM based text categorization. IEICE Trans Inf Syst E87-D(2):1–7

41.

Wu S-H, Lin K-P, Chen C-M, Chen M-S (2008) Asymmetric support vector machines: low false-positive learning under the user tolerance. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, August 2008

42.

Gasparini F, Corchs S, Schettini R (2005) A recall or precision oriented skin classifier using binary combining strategies. Pattern Recognit 38:2204–2207CrossRef

43.

Bryan K and Cunningham P (2007) Balboa: extending bicluster analysis to classify orfs using expression data. In: BIBE’07, pp 995–1002

44.

Delany SJ, Cunningham P, Tsymbal A, Coyle L (2005) A case-based technique for tracking concept drift in spam filtering. Knowl-Based Syst 18:187–195CrossRef

45.

Yang H, Nenadic G, Keane JA (2008) Identification of transcription factor contexts in literature using machine learning approaches. BMC Bioinform 9(Suppl 3):S11CrossRef

46.

Chang YI (2003) Boosting SVM classifiers with logistic regression. Technical report, Academia Sinica, (http://www.stat.sinica.edu.tw/library/c_tec_rep/2003-03.pdf)

47.

Chen D, Müller HM, Sternberg PW (2006) Automatic document classification of biological literature. BMC Bioinform 7

48.

Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S (1998) Learning to extract symbolic knowledge from the world wide web. In: Proceedings of the fifteenth national conference on artificial intelligence, Madison, pp 509–516

49.

Xue XB, Zhou ZH (2009) Distributional features for text categorization. IEEE Trans Knowl Data Eng 21:428–442MathSciNetCrossRef

50.

Nunzio GMD (2009) Using scatterplots to understand and improve probabilistic models for text categorization and retrieval. Int J Approx Reason 50(7):945–956CrossRef

51.

Bekkerman R, El-Yaniv R, Tishby N, Winter Y (2003) Distributional word clusters versus words for text categorization. J Mach Learn Res 3:1183–1208MATH

52.

Buckley C (1985) Implementation of the smart information retrieval system. Technical report, Cornell University, Ithaca, USA

53.

Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137CrossRef

54.

Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2(2):121–167CrossRef

55.

Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735CrossRef

56.

Joachims T (1999) Making large-scale SVM learning practical. In: Schölkoph B, Burges CJC, Smola AJ (eds) Advances in Kernel methods—support vector learning, MIT Press, Cambridge, pp 169–184

57.

Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of ICML’97, 14th international conference on machine learning, Morgan Kaufmann Publishers, San Francisco, pp 412–420

58.

Altınçay H, Erenel Z (2010) Analytical evaluation of term weighting schemes for text categorization. Pattern Recognit Lett 31:1310–1323CrossRef

59.

Debole F, Sebastiani F (2004) An analysis of the relative hardness of Reuters-21578 subsets. J Am Soc Inform Sci Technol 56(6):584–596CrossRef

60.

Estabrooks A, Japkowicz N (2001) A mixture-of-experts framework for text classification. In: Proceedings of the intelligent data analysis conference, IDA

61.

Sarinnapakorn K, Kubat M (2007) Combining subclassifiers in text categorization: a DST-based solution and a case study. IEEE Trans Knowl Data Eng 19:1638–1651

62.

Kumar MA, Gopal M (2010) A comparison study on multiple binary-class svm methods for unilabel text categorization. Pattern Recognit Lett 31(11):1437–1444CrossRef

63.

Demirekler M, Altınçay H (2002) Plurality voting based multiple classifier systems: statistically independent with respect to dependent classifier sets. Pattern Recognit 35(11):2365–2379MATHCrossRef

Titel: Improving the precision-recall trade-off in undersampling-based binary text categorization using unanimity rule
verfasst von: Zafer Erenel
Hakan Altınçay
Publikationsdatum: 01.05.2013
Verlag: Springer-Verlag
Erschienen in: Neural Computing and Applications / Ausgabe Sonderheft 1/2013
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI: https://doi.org/10.1007/s00521-012-1056-5

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Sonderheft 1/2013

Dynamic analysis for high-order Hopfield neural networks with leakage delay and impulsive effects

Hermite-neural-network-based adaptive control for a coupled nonlinear chaotic system

ANN- and ANFIS-based multi-staged decision algorithm for the detection and diagnosis of bearing faults

Application of fuzzy logic for predicting roof fall rate in coal mines

Application of feature-weighted Support Vector regression using grey correlation degree to stock price forecasting

Neutralizing lighting non-homogeneity and background size in PCNN image signature for Arabic Sign Language recognition

Premium Partner