Skip to main content
Erschienen in: Journal of Intelligent Information Systems 3/2015

01.12.2015

Optimizing text classification through efficient feature selection based on quality metric

verfasst von: Jean-Charles Lamirel, Pascal Cuxac, Aneesh Sreevallabh Chivukula, Kafil Hajlaoui

Erschienen in: Journal of Intelligent Information Systems | Ausgabe 3/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Feature maximization is a cluster quality metric which favors clusters with maximum feature representation as regard to their associated data. In this paper we show that a simple adaptation of such metric can provide a highly efficient feature selection and feature contrasting model in the context of supervised classification. The method is experienced on different types of textual datasets. The paper illustrates that the proposed method provides a very significant performance increase, as compared to state of the art methods, in all the studied cases even when a single bag of words model is exploited for data description. Interestingly, the most significant performance gain is obtained in the case of the classification of highly unbalanced, highly multidimensional and noisy data, with a high degree of similarity between the classes.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Since Feature recall is equivalent to the conditional probability P(g|p) and Feature precision is equivalent to the conditional probability P(p|g), this former strategy can be classified as an expectation maximization approach with respect to the original definition given by Dempster et al. (1977). Harmonic mean provides an additional influence to the lowest of the two values in the combination of feature recall and feature precision.
 
2
See Section 4 for more details on usual weighting schemes exploited on textual data.
 
3
The QUAERO project was initiated to meet multimedia content analysis requirements for consumers and professionals facing the rapid increase of accessible digital information. This collaborative research and development project focuses on the areas of automatic extraction of information, analysis, classification and usage of digital multimedia content for professionals and consumers. One specific subtask of the project is to develop automatic patents’ validation tools.
 
9
In terms of active variables (see Section 3 for details).
 
10
The computation is performed on Linux with a laptop equipped with Intel®;Pentium®; cpu B970 2.3Ghz and with 8Go standard memory.
 
Literatur
Zurück zum Zitat Aha, D., & Kibler, D. (1991). Instance-based learning algorithms. Machine Learning, 6, 37–66. Aha, D., & Kibler, D. (1991). Instance-based learning algorithms. Machine Learning, 6, 37–66.
Zurück zum Zitat Attik, M., Lamirel, J.-C., Al Shehabi, S. (2006). Clustering analysis for data with multiple labels. In Proceedings of the IASTED International conference on databases and applications (DBA). Innsbruck. Attik, M., Lamirel, J.-C., Al Shehabi, S. (2006). Clustering analysis for data with multiple labels. In Proceedings of the IASTED International conference on databases and applications (DBA). Innsbruck.
Zurück zum Zitat Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A. (2012). A review of feature selection methods on synthetic data. Knowledge and Information Systems, 1–37. Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A. (2012). A review of feature selection methods on synthetic data. Knowledge and Information Systems, 1–37.
Zurück zum Zitat Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984). Classification and Regression Trees. Belmont: Wadsworth International Group.MATH Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984). Classification and Regression Trees. Belmont: Wadsworth International Group.MATH
Zurück zum Zitat Chawla, N.V., Bowyer, K.V., Hall, L.O., Kegelmeyer, W.P. (2002). Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.MATH Chawla, N.V., Bowyer, K.V., Hall, L.O., Kegelmeyer, W.P. (2002). Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.MATH
Zurück zum Zitat Daviet, H. (2009). Class-Add, une procédure de sélection de variables basée sur une troncature k-additive de l’ information mutuelle et sur une classification ascendante hiérarchique en pré-traitement. PhD, Université de Nantes, France. Daviet, H. (2009). Class-Add, une procédure de sélection de variables basée sur une troncature k-additive de l’ information mutuelle et sur une classification ascendante hiérarchique en pré-traitement. PhD, Université de Nantes, France.
Zurück zum Zitat Dempster, A. P., Laird, N. M., Rubin, D. B. (1977). Maximum likelihood for incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B, 39(1), 1–38.MATHMathSciNet Dempster, A. P., Laird, N. M., Rubin, D. B. (1977). Maximum likelihood for incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B, 39(1), 1–38.MATHMathSciNet
Zurück zum Zitat Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research, 3, 1289–1305.MATH Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research, 3, 1289–1305.MATH
Zurück zum Zitat Good, P. (2006). Resampling methods, 3rd edn. Birkhauser. Good, P. (2006). Resampling methods, 3rd edn. Birkhauser.
Zurück zum Zitat Guyon, I., Weston, J., Barnhill, S., Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine learning, 46(1), 389–422.MATHCrossRef Guyon, I., Weston, J., Barnhill, S., Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine learning, 46(1), 389–422.MATHCrossRef
Zurück zum Zitat Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3, 1157–1182.MATH Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3, 1157–1182.MATH
Zurück zum Zitat Hall, M.A., & Smith, L.A. (1999). Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In Proceedings of the 12th international florida artificial intelligence research society conference (pp. 235-239). AAAI Press. Hall, M.A., & Smith, L.A. (1999). Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In Proceedings of the 12th international florida artificial intelligence research society conference (pp. 235-239). AAAI Press.
Zurück zum Zitat Hajlaoui, K., Cuxac, P., Lamirel, J.C., Francois, C. (2012). Enhancing patent expertise through automatic matching with scientific papers. Discovery Science LNCS, 7569, 299–312.CrossRef Hajlaoui, K., Cuxac, P., Lamirel, J.C., Francois, C. (2012). Enhancing patent expertise through automatic matching with scientific papers. Discovery Science LNCS, 7569, 299–312.CrossRef
Zurück zum Zitat Ken Lang, K. (1995). Learning to filter netnews. In Proceedings of the 12th international conference on machine learning (pp. 331–339). Ken Lang, K. (1995). Learning to filter netnews. In Proceedings of the 12th international conference on machine learning (pp. 331–339).
Zurück zum Zitat Kohavi, R., & John, G.H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1-2), 273–324.MATHCrossRef Kohavi, R., & John, G.H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1-2), 273–324.MATHCrossRef
Zurück zum Zitat Kononenko, I. (1994). Estimating Attributes: Analysis and Extensions of RELIEF. European Conference on Machine Learning, 171–182. Kononenko, I. (1994). Estimating Attributes: Analysis and Extensions of RELIEF. European Conference on Machine Learning, 171–182.
Zurück zum Zitat Ladha, L., & Deepa, T. (2011). Feature selection methods and algorithms. International Journal on Computer Science and Engineering, 3(5), 1787–1797. Ladha, L., & Deepa, T. (2011). Feature selection methods and algorithms. International Journal on Computer Science and Engineering, 3(5), 1787–1797.
Zurück zum Zitat Lallich, S., & Rakotomalala, R. (2000). Fast Feature Selection Using Partial Correlation for Multi-valued Attributes. In D. A. Zighed, J. Komorowski, J. Zytkow (Eds.), Principles of data mining and knowledge discovery, 221-231. Lecture notes in computer science (pp. 1910). Berlin-Heidelberg: Springer. Lallich, S., & Rakotomalala, R. (2000). Fast Feature Selection Using Partial Correlation for Multi-valued Attributes. In D. A. Zighed, J. Komorowski, J. Zytkow (Eds.), Principles of data mining and knowledge discovery, 221-231. Lecture notes in computer science (pp. 1910). Berlin-Heidelberg: Springer.
Zurück zum Zitat Lamirel, J.-C., Al Shehabi, S., Francois, C., Hoffmann, M. (2004). New classification quality estimators for analysis of documentary information: application to patent analysis and web mapping. Scientometrics, 60(3). Lamirel, J.-C., Al Shehabi, S., Francois, C., Hoffmann, M. (2004). New classification quality estimators for analysis of documentary information: application to patent analysis and web mapping. Scientometrics, 60(3).
Zurück zum Zitat Lamirel, J.-C., & Ta, A.P. (2008). Combination of hyperbolic visualization and graph-based approach for organizing data analysis results: an application to social network analysis. In Proceedings of the 4th international conference on webometrics, informetrics and scientometrics and 9th COLLNET meeting. Berlin. Lamirel, J.-C., & Ta, A.P. (2008). Combination of hyperbolic visualization and graph-based approach for organizing data analysis results: an application to social network analysis. In Proceedings of the 4th international conference on webometrics, informetrics and scientometrics and 9th COLLNET meeting. Berlin.
Zurück zum Zitat Lamirel, J.-C., Ghribi, M., Cuxac, P. (2010). Unsupervised recall and precision measures: a step towards new efficient clustering quality indexes. In Proceedings of the 19th international conference on computational statistics (COMPSTAT’2010). Paris. Lamirel, J.-C., Ghribi, M., Cuxac, P. (2010). Unsupervised recall and precision measures: a step towards new efficient clustering quality indexes. In Proceedings of the 19th international conference on computational statistics (COMPSTAT’2010). Paris.
Zurück zum Zitat Lamirel, J.-C, Mall, R., Cuxac, P., Safi, G. (2011). Variations to incremental growing neural gas algorithm based on label maximization. In Proceedings of IJCNN 2011. San Jose. Lamirel, J.-C, Mall, R., Cuxac, P., Safi, G. (2011). Variations to incremental growing neural gas algorithm based on label maximization. In Proceedings of IJCNN 2011. San Jose.
Zurück zum Zitat Lamirel, J.-C. (2012). A new approach for automatizing the analysis of research topics dynamics: application to optoelectronics research. Scientometrics, 93, 151–166.CrossRef Lamirel, J.-C. (2012). A new approach for automatizing the analysis of research topics dynamics: application to optoelectronics research. Scientometrics, 93, 151–166.CrossRef
Zurück zum Zitat Mejía-Lavalle, M., Sucar, E., Arroyo, G. (2006). Feature selection with a perceptron neural net. Feature selection for data mining: interfacing machine learning and statistics. Mejía-Lavalle, M., Sucar, E., Arroyo, G. (2006). Feature selection with a perceptron neural net. Feature selection for data mining: interfacing machine learning and statistics.
Zurück zum Zitat Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(11), 559–572.CrossRef Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(11), 559–572.CrossRef
Zurück zum Zitat Platt, J. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Schoelkopf, C. Burges, A. Smola (Eds.). Advances in kernel methods - support vector learning. MIT Press. Platt, J. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Schoelkopf, C. Burges, A. Smola (Eds.). Advances in kernel methods - support vector learning. MIT Press.
Zurück zum Zitat Porter, M.F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.CrossRef Porter, M.F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.CrossRef
Zurück zum Zitat Quinlan, R. (1993). C4.5: Programs for machine learning. San Mateo: Morgan Kaufmann. Quinlan, R. (1993). C4.5: Programs for machine learning. San Mateo: Morgan Kaufmann.
Zurück zum Zitat Salton, G. (1971). Automatic processing of foreign language documents. Englewood Cliffs: Prentice-Hill. Salton, G. (1971). Automatic processing of foreign language documents. Englewood Cliffs: Prentice-Hill.
Zurück zum Zitat Salton, G., & Buckley, C. (1988). Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513–523.CrossRef Salton, G., & Buckley, C. (1988). Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513–523.CrossRef
Zurück zum Zitat Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of international conference on new methods in language processing. Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of international conference on new methods in language processing.
Zurück zum Zitat Witten, I.H., & Frank, E. (2005). Data mining: practical machine learning tools and techniques. Morgan Kaufmann. Witten, I.H., & Frank, E. (2005). Data mining: practical machine learning tools and techniques. Morgan Kaufmann.
Zurück zum Zitat Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: a fast correlation-based filter solution. ICML 2003, 856–863. Washington. Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: a fast correlation-based filter solution. ICML 2003, 856–863. Washington.
Zurück zum Zitat Zhang, T., & Oles, F.J. (2001). Text categorization based on regularized linear classification methods. Information Retrieval, 4(1), 5–31.MATHCrossRef Zhang, T., & Oles, F.J. (2001). Text categorization based on regularized linear classification methods. Information Retrieval, 4(1), 5–31.MATHCrossRef
Metadaten
Titel
Optimizing text classification through efficient feature selection based on quality metric
verfasst von
Jean-Charles Lamirel
Pascal Cuxac
Aneesh Sreevallabh Chivukula
Kafil Hajlaoui
Publikationsdatum
01.12.2015
Verlag
Springer US
Erschienen in
Journal of Intelligent Information Systems / Ausgabe 3/2015
Print ISSN: 0925-9902
Elektronische ISSN: 1573-7675
DOI
https://doi.org/10.1007/s10844-014-0317-4

Weitere Artikel der Ausgabe 3/2015

Journal of Intelligent Information Systems 3/2015 Zur Ausgabe

Premium Partner