A two-stage feature selection method for text categorization by using category correlation degree and latent semantic indexing

Wang, Fei; Li, Cai-hong; Wang, Jing-shan; Xu, Jiao; Li, Lian

doi:10.1007/s12204-015-1586-y

A two-stage feature selection method for text categorization by using category correlation degree and latent semantic indexing

Published: 29 January 2015

Volume 20, pages 44–50, (2015)
Cite this article

Journal of Shanghai Jiaotong University (Science) Aims and scope Submit manuscript

Fei Wang (王飞)¹,
Cai-hong Li (李彩虹)¹,
Jing-shan Wang (王景山)¹,
Jiao Xu (徐娇)¹ &
…
Lian Li (李廉)¹

194 Accesses
7 Citations
Explore all metrics

Abstract

With the purpose of improving the accuracy of text categorization and reducing the dimension of the feature space, this paper proposes a two-stage feature selection method based on a novel category correlation degree (CCD) method and latent semantic indexing (LSI). In the first stage, a novel CCD method is proposed to select the most effective features for text classification, which is more effective than the traditional feature selection method. In the second stage, document representation requires a high dimensionality of the feature space and does not take into account the semantic relation between features, which leads to a poor categorization accuracy. So LSI method is proposed to solve these problems by using statistically derived conceptual indices to replace the individual terms which can discover the important correlative relationship between features and reduce the feature space dimension. Firstly, each feature in our algorithm is ranked depending on their importance of classification using CCD method. Secondly, we construct a new semantic space based on LSI method among features. The experimental results have proved that our method can reduce effectively the dimension of text vector and improve the performance of text categorization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A New Feature Selection Algorithm Based on Category Difference for Text Categorization

A supervised term selection technique for effective text categorization

Article 18 September 2015

A non-redundant feature selection method for text categorization based on term co-occurrence frequency and mutual information

Article 31 July 2023

References

Uguz H. A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm [J]. Knowledge-Based Systems, 2011, 24: 1024–1032.
Article Google Scholar
Forman G. An extensive empirical study of feature selection metrics for text classification [J]. Journalof Machine Learning Research, 2003, 3: 1289–1305.
MATH Google Scholar
Huang X H, Ye Y M, Du X L, et al. Short text clustering with expandingkeywords through concept graph [J]. Journal of Computational Information Systems, 2013, 9(21): 8649–8657.
Google Scholar
Jiang J Y, Liou R J, Lee S J. A fuzzy selfconstructing feature clustering algorithm for text classification [J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(3): 335–349.
Article Google Scholar
Meng J N, Lin H F. A two-stage feature selection method for text categorization [C]//Seventh International Conference on Fuzzy Systems and Knowledge Discovery. [s.l.]: IEEE, 2010: 1492–1496.
Google Scholar
Song Q B, Ni J J, Wang G T. A fast clustering-based feature subset selection algorithm for high-dimensional data [J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(1): 1–14.
Article Google Scholar
Uysal A K, Gunal S. A novel probabilistic feature selection method for text classification [J]. Knowledge-Based Systems, 2012, 36: 226–235.
Article Google Scholar
Wu D, Zhang Y P, Wang X. Feature reduction methods for text classification [J]. Journal of Computational Information Systems, 2008, 4(2): 495–502.
Google Scholar
Meng J N, Lin H F, Yu Y H. A two-stage feature selection method for text categorization [J]. Computers and Mathematics with Applications, 2011, 62: 2793–2800.
Article Google Scholar
Shima K, Todoriki M, Suzuki A. SVM-based feature selection of latent semantic features [J]. Pattern Recognition Letters, 2004, 25: 1051–1057.
Article Google Scholar
Song W, Parks C. Genetic algorithm for text clustering based on latent semantic indexing [J]. Computers and Mathematics with Applications, 2009, 57: 1901–1907.
Article MATH Google Scholar
Li X F, Tian X D. Two steps features selection and support vector machines for Web page text categorization [J]. Journal of Computational Information Systems, 2008, 4(1): 133–138.
Google Scholar
Zhao Z, Wang L, Liu H, et al. On similarity preserving feature selection [J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(3): 619–632.
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Science & Engineering, Lanzhou University, Lanzhou, 73000, China
Fei Wang (王飞), Cai-hong Li (李彩虹), Jing-shan Wang (王景山), Jiao Xu (徐娇) & Lian Li (李廉)

Authors

Fei Wang (王飞)
View author publications
You can also search for this author in PubMed Google Scholar
Cai-hong Li (李彩虹)
View author publications
You can also search for this author in PubMed Google Scholar
Jing-shan Wang (王景山)
View author publications
You can also search for this author in PubMed Google Scholar
Jiao Xu (徐娇)
View author publications
You can also search for this author in PubMed Google Scholar
Lian Li (李廉)
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cai-hong Li (李彩虹).

Additional information

Foundation item: the National Natural Science Foundation of China (Nos. 61073193 and 61300230), the Key Science and Technology Foundation of Gansu Province (No. 1102FKDA010), the Natural Science Foundation of Gansu Province (No. 1107RJZA188), and the Science and Technology Support Program of Gansu Province (No. 1104GKCA037)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, F., Li, Ch., Wang, Js. et al. A two-stage feature selection method for text categorization by using category correlation degree and latent semantic indexing. J. Shanghai Jiaotong Univ. (Sci.) 20, 44–50 (2015). https://doi.org/10.1007/s12204-015-1586-y

Download citation

Received: 10 January 2014
Published: 29 January 2015
Issue Date: February 2015
DOI: https://doi.org/10.1007/s12204-015-1586-y

Key words

CLC number

TP 391.1

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A two-stage feature selection method for text categorization by using category correlation degree and latent semantic indexing

Abstract

Access this article

Similar content being viewed by others

A New Feature Selection Algorithm Based on Category Difference for Text Categorization

A supervised term selection technique for effective text categorization

A non-redundant feature selection method for text categorization based on term co-occurrence frequency and mutual information

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Key words

CLC number

Navigation

A two-stage feature selection method for text categorization by using category correlation degree and latent semantic indexing

Abstract

Access this article

Similar content being viewed by others

A New Feature Selection Algorithm Based on Category Difference for Text Categorization

A supervised term selection technique for effective text categorization

A non-redundant feature selection method for text categorization based on term co-occurrence frequency and mutual information

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

CLC number

Search

Navigation