Skip to main content
Top
Published in: Neural Computing and Applications 9/2019

17-02-2018 | Original Article

Effective use of 2-termsets by discarding redundant member terms in bag-of-words representation

Authors: Dima Badawi, Hakan Altınçay

Published in: Neural Computing and Applications | Issue 9/2019

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Recent studies have proven the potential of using termsets to enrich the conventionally used bag-of-words-based representation of electronic documents by forming composite feature vectors. In this approach, some of the member terms may become redundant due to being strongly correlated with the corresponding termsets. On the other hand, the co-occurrence of terms may be more informative than their individual appearance. In these cases, removal of the member terms should be addressed to avoid the curse of dimensionality during model generation. In this study, elimination of member terms that become redundant due to employing 2-termsets is firstly addressed and two novel algorithms are developed for this purpose. The proposed algorithms are based on evaluating the relative discriminative powers and correlations of member terms and corresponding 2-termsets. As a third approach, evaluating redundancies of all terms when 2-termsets are used and discarding the terms that are most correlated with the 2-termsets is addressed. Simulations conducted on five benchmark datasets have verified the importance of eliminating redundant terms and effectiveness of the proposed algorithms.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47CrossRef Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47CrossRef
2.
go back to reference Jaillet S, Laurent A, Teisseire M (2006) Sequential patterns for text categorization. Intell Data Anal 10(3):199–214CrossRef Jaillet S, Laurent A, Teisseire M (2006) Sequential patterns for text categorization. Intell Data Anal 10(3):199–214CrossRef
3.
go back to reference Chen J, Huang H, Tian S, Qu Y (2009) Feature selection for text classification with naive Bayes. Expert Syst Appl 36:5432–5435CrossRef Chen J, Huang H, Tian S, Qu Y (2009) Feature selection for text classification with naive Bayes. Expert Syst Appl 36:5432–5435CrossRef
4.
go back to reference Yang J, Liu Y, Zhu X, Liu Z, Zhang X (2012) A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inf Process Manag 48(4):741–754CrossRef Yang J, Liu Y, Zhu X, Liu Z, Zhang X (2012) A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inf Process Manag 48(4):741–754CrossRef
5.
go back to reference Liu Y, Loh HT, Sun A (2009) Imbalanced text classification: a term weighting approach. Expert Syst Appl 36:690–701CrossRef Liu Y, Loh HT, Sun A (2009) Imbalanced text classification: a term weighting approach. Expert Syst Appl 36:690–701CrossRef
6.
go back to reference Zhang L, Jiang L, Li C (2016) A new feature selection approach to naive Bayes text classifiers. Int J Pattern Recognit Artif Intell 30(02):1650003MathSciNetCrossRef Zhang L, Jiang L, Li C (2016) A new feature selection approach to naive Bayes text classifiers. Int J Pattern Recognit Artif Intell 30(02):1650003MathSciNetCrossRef
7.
go back to reference Debole F, Sebastiani F (2003) Supervised term weighting for automated text categorization. In: SAC’03: proceedings of the 2003 ACM symposium on applied computing, ACM, New York, NY, USA. pp 784–788 Debole F, Sebastiani F (2003) Supervised term weighting for automated text categorization. In: SAC’03: proceedings of the 2003 ACM symposium on applied computing, ACM, New York, NY, USA. pp 784–788
8.
go back to reference Junejo KN, Karim A, Hassan MT, Jeon M (2016) Terms-based discriminative information space for robust text classification. Inf Sci 372:518–538CrossRef Junejo KN, Karim A, Hassan MT, Jeon M (2016) Terms-based discriminative information space for robust text classification. Inf Sci 372:518–538CrossRef
9.
go back to reference Haddoud M, Mokhtari A, Lecroq T, Abdeddaïm S (2016) Combining supervised term-weighting metrics for svm text classification with extended term representation. Knowl Inf Syst 49(3):909–931CrossRef Haddoud M, Mokhtari A, Lecroq T, Abdeddaïm S (2016) Combining supervised term-weighting metrics for svm text classification with extended term representation. Knowl Inf Syst 49(3):909–931CrossRef
10.
go back to reference Altınel B, Ganiz MC (2016) A new hybrid semi-supervised algorithm for text classification with class-based semantics. Knowl Based Syst 108:50–64CrossRef Altınel B, Ganiz MC (2016) A new hybrid semi-supervised algorithm for text classification with class-based semantics. Knowl Based Syst 108:50–64CrossRef
11.
go back to reference Jiang L, Li C, Wang S, Zhang L (2016) Deep feature weighting for naive Bayes and its application to text classification. Eng Appl Artif Intell 52:26–39CrossRef Jiang L, Li C, Wang S, Zhang L (2016) Deep feature weighting for naive Bayes and its application to text classification. Eng Appl Artif Intell 52:26–39CrossRef
12.
go back to reference Zhang L, Jiang L, Li C, Kong G (2016) Two feature weighting approaches for naive Bayes text classifiers. Knowl Based Syst 100:137–144CrossRef Zhang L, Jiang L, Li C, Kong G (2016) Two feature weighting approaches for naive Bayes text classifiers. Knowl Based Syst 100:137–144CrossRef
13.
go back to reference Li Y, Luo C, Chung SM (2012) Weighted naive Bayes for text classification using positive term-class dependency. Int J Artif Intell Tools 21(01):1250008CrossRef Li Y, Luo C, Chung SM (2012) Weighted naive Bayes for text classification using positive term-class dependency. Int J Artif Intell Tools 21(01):1250008CrossRef
14.
go back to reference Lewis DD (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’92, ACM, New York, USA, pp 37–50 Lewis DD (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’92, ACM, New York, USA, pp 37–50
15.
go back to reference Boulis C, Ostendorf M (2005) Text classification by augmenting the bag-of-words representation with redundancy compensated bigrams. In: Proceedings of the international workshop on feature selection in data mining, in conjunction with SIAM SDM-05, pp 9–16 Boulis C, Ostendorf M (2005) Text classification by augmenting the bag-of-words representation with redundancy compensated bigrams. In: Proceedings of the international workshop on feature selection in data mining, in conjunction with SIAM SDM-05, pp 9–16
16.
go back to reference Keikha M, Khonsari A, Oroumchian F (2009) Rich document representation and classification: an analysis. Knowl Based Syst 22(1):67–71CrossRef Keikha M, Khonsari A, Oroumchian F (2009) Rich document representation and classification: an analysis. Knowl Based Syst 22(1):67–71CrossRef
17.
go back to reference Bekkerman R, Allan J (2004) Using bigrams in text categorization. Technical Report IR-408, Center of Intelligent Information Retrieval, UMass Amherst Bekkerman R, Allan J (2004) Using bigrams in text categorization. Technical Report IR-408, Center of Intelligent Information Retrieval, UMass Amherst
18.
go back to reference Figueiredo F, Rocha L, Couto T, Salles T, Gonçalves MA, Meira W (2011) Word co-occurrence features for text classification. Inf Syst 36(5):843–858CrossRef Figueiredo F, Rocha L, Couto T, Salles T, Gonçalves MA, Meira W (2011) Word co-occurrence features for text classification. Inf Syst 36(5):843–858CrossRef
19.
go back to reference Tesar R, Poesio M, Strnad V, Jezek K (2006) Extending the single words-based document model: a comparison of bigrams and 2-itemsets. In: Proceedings of the 2006 ACM symposium on Document engineering, ACM, New York, USA, pp 138–146 Tesar R, Poesio M, Strnad V, Jezek K (2006) Extending the single words-based document model: a comparison of bigrams and 2-itemsets. In: Proceedings of the 2006 ACM symposium on Document engineering, ACM, New York, USA, pp 138–146
20.
go back to reference Badawi D, Altınçay H (2014) A novel framework for termset selection and weighting in binary text classification. Eng Appl Artif Intell 35:38–53CrossRef Badawi D, Altınçay H (2014) A novel framework for termset selection and weighting in binary text classification. Eng Appl Artif Intell 35:38–53CrossRef
21.
go back to reference Özgür L, Güngör T (2010) Text classification with the support of pruned dependency patterns. Pattern Recogn Lett 31(12):1598–1607CrossRef Özgür L, Güngör T (2010) Text classification with the support of pruned dependency patterns. Pattern Recogn Lett 31(12):1598–1607CrossRef
22.
go back to reference Fürnkranz J (1998) A study using n-gram features for text categorization. Technical Report OEFAI-TR-98-30, Austrian Research Institute for Artificial Intelligence, Austria Fürnkranz J (1998) A study using n-gram features for text categorization. Technical Report OEFAI-TR-98-30, Austrian Research Institute for Artificial Intelligence, Austria
23.
go back to reference Tan CM, Wang YF, Lee CD (2002) The use of bigrams to enhance text categorization. Inf Process Manag 38:529–546CrossRef Tan CM, Wang YF, Lee CD (2002) The use of bigrams to enhance text categorization. Inf Process Manag 38:529–546CrossRef
24.
go back to reference Yang L, Li C, Ding Q, Li L (2013) Combining lexical and semantic features for short text classification. Proced Comput Sci 22:78–86CrossRef Yang L, Li C, Ding Q, Li L (2013) Combining lexical and semantic features for short text classification. Proced Comput Sci 22:78–86CrossRef
25.
go back to reference Uysal AK (2016) An improved global feature selection scheme for text classification. Expert Syst Appl 43:82–92CrossRef Uysal AK (2016) An improved global feature selection scheme for text classification. Expert Syst Appl 43:82–92CrossRef
26.
go back to reference Feng G, Guo J, Jing BY, Sun T (2015) Feature subset selection using naive Bayes for text classification. Pattern Recogn Lett 65:109–115CrossRef Feng G, Guo J, Jing BY, Sun T (2015) Feature subset selection using naive Bayes for text classification. Pattern Recogn Lett 65:109–115CrossRef
27.
go back to reference Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238CrossRef Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238CrossRef
28.
go back to reference Zeng XQ, Li GZ (2014) Supervised redundant feature detection for tumor classification. BMC Med Genomics 7(Suppl2):S5CrossRef Zeng XQ, Li GZ (2014) Supervised redundant feature detection for tumor classification. BMC Med Genomics 7(Suppl2):S5CrossRef
29.
go back to reference Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224MathSciNetMATH Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224MathSciNetMATH
30.
go back to reference Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5(4):537–550CrossRef Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5(4):537–550CrossRef
31.
go back to reference Wang J, Wei JM, Yang Z, Wang SQ (2017) Feature selection by maximizing independent classification information. IEEE Trans Knowl Data Eng 29(4):828–841CrossRef Wang J, Wei JM, Yang Z, Wang SQ (2017) Feature selection by maximizing independent classification information. IEEE Trans Knowl Data Eng 29(4):828–841CrossRef
32.
go back to reference Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735CrossRef Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735CrossRef
33.
go back to reference Erenel Z, Altınçay H, Varoğlu E (2011) Explicit use of term occurrence probabilities for term weighting in text categorization. J Inf Sci Eng 27(3):819–834 Erenel Z, Altınçay H, Varoğlu E (2011) Explicit use of term occurrence probabilities for term weighting in text categorization. J Inf Sci Eng 27(3):819–834
34.
go back to reference Ng HT, Goh WB, Low KL (1997) Feature selection, perceptron learning, and a usability case study for text categorization. In: Proceedings of the 20th annual international SIGIR conference on research and development in information retrieval, pp 7–73 Ng HT, Goh WB, Low KL (1997) Feature selection, perceptron learning, and a usability case study for text categorization. In: Proceedings of the 20th annual international SIGIR conference on research and development in information retrieval, pp 7–73
35.
go back to reference Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. SIGKDD Explor Newsl 6(1):80–89CrossRef Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. SIGKDD Explor Newsl 6(1):80–89CrossRef
36.
go back to reference Debole F, Sebastiani F (2004) An analysis of the relative hardness of Reuters-21578 subsets. J Am Soc Inf Sci Technol 56(6):584–596CrossRef Debole F, Sebastiani F (2004) An analysis of the relative hardness of Reuters-21578 subsets. J Am Soc Inf Sci Technol 56(6):584–596CrossRef
37.
go back to reference Joachims T (1998) Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of the 10th European conference on machine learning, ECML ’98, Springer, London, UK, pp 37–142CrossRef Joachims T (1998) Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of the 10th European conference on machine learning, ECML ’98, Springer, London, UK, pp 37–142CrossRef
38.
go back to reference Erenel Z, Altınçay H (2012) Nonlinear transformation of term frequencies for term weighting in text categorization. Eng Appl Artif Intell 25:1505–1514CrossRef Erenel Z, Altınçay H (2012) Nonlinear transformation of term frequencies for term weighting in text categorization. Eng Appl Artif Intell 25:1505–1514CrossRef
39.
go back to reference Buckley C (1985) Implementation of the smart information retrieval system. Technical report, Cornell University, Ithaca, USA Buckley C (1985) Implementation of the smart information retrieval system. Technical report, Cornell University, Ithaca, USA
40.
go back to reference Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137CrossRef Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137CrossRef
41.
go back to reference Joachims T (1999) Making large-scale SVM learning practical. In: Schölkoph B, Burges CJC, Smola AJ (eds) Advances in kernel methods—support vector learning. MIT Press, Cambridge, pp 169–184 Joachims T (1999) Making large-scale SVM learning practical. In: Schölkoph B, Burges CJC, Smola AJ (eds) Advances in kernel methods—support vector learning. MIT Press, Cambridge, pp 169–184
Metadata
Title
Effective use of 2-termsets by discarding redundant member terms in bag-of-words representation
Authors
Dima Badawi
Hakan Altınçay
Publication date
17-02-2018
Publisher
Springer London
Published in
Neural Computing and Applications / Issue 9/2019
Print ISSN: 0941-0643
Electronic ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-018-3371-y

Other articles of this Issue 9/2019

Neural Computing and Applications 9/2019 Go to the issue

S.I. : Emergence in Human-like Intelligence towards Cyber-Physical Systems

Constructive function approximation by neural networks with optimized activation functions and fixed weights

S.I. : Emergence in Human-like Intelligence towards Cyber-Physical Systems

Spark-based intelligent parameter inversion method for prestack seismic data

S.I. : Emergence in Human-like Intelligence towards Cyber-Physical Systems

A new method of online extreme learning machine based on hybrid kernel function

Premium Partner