Top

Neural Computing and Applications

Published in:

17-02-2018 | Original Article

Effective use of 2-termsets by discarding redundant member terms in bag-of-words representation

Authors: Dima Badawi, Hakan Altınçay

Published in: Neural Computing and Applications | Issue 9/2019

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Recent studies have proven the potential of using termsets to enrich the conventionally used bag-of-words-based representation of electronic documents by forming composite feature vectors. In this approach, some of the member terms may become redundant due to being strongly correlated with the corresponding termsets. On the other hand, the co-occurrence of terms may be more informative than their individual appearance. In these cases, removal of the member terms should be addressed to avoid the curse of dimensionality during model generation. In this study, elimination of member terms that become redundant due to employing 2-termsets is firstly addressed and two novel algorithms are developed for this purpose. The proposed algorithms are based on evaluating the relative discriminative powers and correlations of member terms and corresponding 2-termsets. As a third approach, evaluating redundancies of all terms when 2-termsets are used and discarding the terms that are most correlated with the 2-termsets is addressed. Simulations conducted on five benchmark datasets have verified the importance of eliminating redundant terms and effectiveness of the proposed algorithms.

previous article Delineation of groundwater prospective resources by exploiting geo-spatial decision-making techniques for the Kingdom of Saudi Arabia

next article Opinion extraction by distinguishing term dependencies and digging deep text features

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

http://csmining.org/index.php/id-20-newsgroups.html.

http://csmining.org/index.php/e-news-datasets-.html.

Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47CrossRef

Jaillet S, Laurent A, Teisseire M (2006) Sequential patterns for text categorization. Intell Data Anal 10(3):199–214CrossRef

Chen J, Huang H, Tian S, Qu Y (2009) Feature selection for text classification with naive Bayes. Expert Syst Appl 36:5432–5435CrossRef

Yang J, Liu Y, Zhu X, Liu Z, Zhang X (2012) A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inf Process Manag 48(4):741–754CrossRef

Liu Y, Loh HT, Sun A (2009) Imbalanced text classification: a term weighting approach. Expert Syst Appl 36:690–701CrossRef

Zhang L, Jiang L, Li C (2016) A new feature selection approach to naive Bayes text classifiers. Int J Pattern Recognit Artif Intell 30(02):1650003MathSciNetCrossRef

Debole F, Sebastiani F (2003) Supervised term weighting for automated text categorization. In: SAC’03: proceedings of the 2003 ACM symposium on applied computing, ACM, New York, NY, USA. pp 784–788

Junejo KN, Karim A, Hassan MT, Jeon M (2016) Terms-based discriminative information space for robust text classification. Inf Sci 372:518–538CrossRef

Haddoud M, Mokhtari A, Lecroq T, Abdeddaïm S (2016) Combining supervised term-weighting metrics for svm text classification with extended term representation. Knowl Inf Syst 49(3):909–931CrossRef

10.

Altınel B, Ganiz MC (2016) A new hybrid semi-supervised algorithm for text classification with class-based semantics. Knowl Based Syst 108:50–64CrossRef

11.

Jiang L, Li C, Wang S, Zhang L (2016) Deep feature weighting for naive Bayes and its application to text classification. Eng Appl Artif Intell 52:26–39CrossRef

12.

Zhang L, Jiang L, Li C, Kong G (2016) Two feature weighting approaches for naive Bayes text classifiers. Knowl Based Syst 100:137–144CrossRef

13.

Li Y, Luo C, Chung SM (2012) Weighted naive Bayes for text classification using positive term-class dependency. Int J Artif Intell Tools 21(01):1250008CrossRef

14.

Lewis DD (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’92, ACM, New York, USA, pp 37–50

15.

Boulis C, Ostendorf M (2005) Text classification by augmenting the bag-of-words representation with redundancy compensated bigrams. In: Proceedings of the international workshop on feature selection in data mining, in conjunction with SIAM SDM-05, pp 9–16

16.

Keikha M, Khonsari A, Oroumchian F (2009) Rich document representation and classification: an analysis. Knowl Based Syst 22(1):67–71CrossRef

17.

Bekkerman R, Allan J (2004) Using bigrams in text categorization. Technical Report IR-408, Center of Intelligent Information Retrieval, UMass Amherst

18.

Figueiredo F, Rocha L, Couto T, Salles T, Gonçalves MA, Meira W (2011) Word co-occurrence features for text classification. Inf Syst 36(5):843–858CrossRef

19.

Tesar R, Poesio M, Strnad V, Jezek K (2006) Extending the single words-based document model: a comparison of bigrams and 2-itemsets. In: Proceedings of the 2006 ACM symposium on Document engineering, ACM, New York, USA, pp 138–146

20.

Badawi D, Altınçay H (2014) A novel framework for termset selection and weighting in binary text classification. Eng Appl Artif Intell 35:38–53CrossRef

21.

Özgür L, Güngör T (2010) Text classification with the support of pruned dependency patterns. Pattern Recogn Lett 31(12):1598–1607CrossRef

22.

Fürnkranz J (1998) A study using n-gram features for text categorization. Technical Report OEFAI-TR-98-30, Austrian Research Institute for Artificial Intelligence, Austria

23.

Tan CM, Wang YF, Lee CD (2002) The use of bigrams to enhance text categorization. Inf Process Manag 38:529–546CrossRef

24.

Yang L, Li C, Ding Q, Li L (2013) Combining lexical and semantic features for short text classification. Proced Comput Sci 22:78–86CrossRef

25.

Uysal AK (2016) An improved global feature selection scheme for text classification. Expert Syst Appl 43:82–92CrossRef

26.

Feng G, Guo J, Jing BY, Sun T (2015) Feature subset selection using naive Bayes for text classification. Pattern Recogn Lett 65:109–115CrossRef

27.

Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238CrossRef

28.

Zeng XQ, Li GZ (2014) Supervised redundant feature detection for tumor classification. BMC Med Genomics 7(Suppl2):S5CrossRef

29.

Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224MathSciNetMATH

30.

Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5(4):537–550CrossRef

31.

Wang J, Wei JM, Yang Z, Wang SQ (2017) Feature selection by maximizing independent classification information. IEEE Trans Knowl Data Eng 29(4):828–841CrossRef

32.

Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735CrossRef

33.

Erenel Z, Altınçay H, Varoğlu E (2011) Explicit use of term occurrence probabilities for term weighting in text categorization. J Inf Sci Eng 27(3):819–834

34.

Ng HT, Goh WB, Low KL (1997) Feature selection, perceptron learning, and a usability case study for text categorization. In: Proceedings of the 20th annual international SIGIR conference on research and development in information retrieval, pp 7–73

35.

Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. SIGKDD Explor Newsl 6(1):80–89CrossRef

36.

Debole F, Sebastiani F (2004) An analysis of the relative hardness of Reuters-21578 subsets. J Am Soc Inf Sci Technol 56(6):584–596CrossRef

37.

Joachims T (1998) Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of the 10th European conference on machine learning, ECML ’98, Springer, London, UK, pp 37–142CrossRef

38.

Erenel Z, Altınçay H (2012) Nonlinear transformation of term frequencies for term weighting in text categorization. Eng Appl Artif Intell 25:1505–1514CrossRef

39.

Buckley C (1985) Implementation of the smart information retrieval system. Technical report, Cornell University, Ithaca, USA

40.

Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137CrossRef

41.

Joachims T (1999) Making large-scale SVM learning practical. In: Schölkoph B, Burges CJC, Smola AJ (eds) Advances in kernel methods—support vector learning. MIT Press, Cambridge, pp 169–184

Title: Effective use of 2-termsets by discarding redundant member terms in bag-of-words representation
Authors: Dima Badawi
Hakan Altınçay
Publication date: 17-02-2018
Publisher: Springer London
Published in: Neural Computing and Applications / Issue 9/2019
Print ISSN: 0941-0643
Electronic ISSN: 1433-3058
DOI: https://doi.org/10.1007/s00521-018-3371-y

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Other articles of this Issue 9/2019

An efficient simulation–neural network–genetic algorithm for flexible flow shops with sequence-dependent setup times, job deterioration and learning effects

Detection of Schizophrenia in brain MR images based on segmented ventricle region and deep belief networks

Constructive function approximation by neural networks with optimized activation functions and fixed weights

CHIP: Constraint Handling with Individual Penalty approach using a hybrid evolutionary algorithm

Spark-based intelligent parameter inversion method for prestack seismic data

A new method of online extreme learning machine based on hybrid kernel function

Premium Partner