Skip to main content

2017 | OriginalPaper | Buchkapitel

A Novel Approach to Feature Selection Based on Quality Estimation Metrics

verfasst von : Jean-Charles Lamirel, Pascal Cuxac, Kafil Hajlaoui

Erschienen in: Advances in Knowledge Discovery and Management

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Feature maximization (F-max) is an unbiased quality estimation metric of unsupervised classification (clustering) that favours clusters with a maximal feature F-measure value. In this article we show that an adaptation of this metric within the framework of supervised classification allows efficient feature selection and feature contrasting to be performed. We experiment the method on different types of textual data. In this context, we demonstrate that this technique significantly improves the performance of classification methods as compared with the use of state-of-the art feature selection techniques, notably in the case of the classification of unbalanced, highly multidimensional and noisy textual data gathered in similar classes.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Aha, D., Kibler, D., & Albert, M. (1991). Instance-based learning algorithms. Machine Learning, 6, 37–66. Aha, D., Kibler, D., & Albert, M. (1991). Instance-based learning algorithms. Machine Learning, 6, 37–66.
Zurück zum Zitat Alphonse, E. E., et al. (2005). Préparation des donnés et analyse des résultats de DEFT’05. In TALN 2005 - Atelier DEFT 2005 (pp. 99–111). Alphonse, E. E., et al. (2005). Préparation des donnés et analyse des résultats de DEFT’05. In TALN 2005 - Atelier DEFT 2005 (pp. 99–111).
Zurück zum Zitat Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2012). A review of feature selection methods on synthetic data. Knowledge and Information Systems, 1–37. Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2012). A review of feature selection methods on synthetic data. Knowledge and Information Systems, 1–37.
Zurück zum Zitat Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Technical report. Wadsworth International Group, Belmont, CA, USA. Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Technical report. Wadsworth International Group, Belmont, CA, USA.
Zurück zum Zitat Chawla, N., Bowyer, K., Hall, L., & Kegelmeyer, W. (2002). Synthetic minority oversampling technique. Journal of Artificial Intelligence Research, 16, 321–357.MATH Chawla, N., Bowyer, K., Hall, L., & Kegelmeyer, W. (2002). Synthetic minority oversampling technique. Journal of Artificial Intelligence Research, 16, 321–357.MATH
Zurück zum Zitat Daviet, H. (2009). Class-Add, une procédure de sélection de variables basée sur une troncature k-additive de l’information mutuelle et sur une classification ascendante hiérarchique en pré-traitement. Thèse de doctorat, Université de Nantes. Daviet, H. (2009). Class-Add, une procédure de sélection de variables basée sur une troncature k-additive de l’information mutuelle et sur une classification ascendante hiérarchique en pré-traitement. Thèse de doctorat, Université de Nantes.
Zurück zum Zitat El-Bèze, M., Torres-Moreno, J.-M., & Béchet, F. (2005). Peut-on rendre automatiquement à César ce qui lui appartient. Application au jeu du Chirand-Mitterrac. In TALN 2005 - Atelier DEFT 2005 (pp. 125–134). El-Bèze, M., Torres-Moreno, J.-M., & Béchet, F. (2005). Peut-on rendre automatiquement à César ce qui lui appartient. Application au jeu du Chirand-Mitterrac. In TALN 2005 - Atelier DEFT 2005 (pp. 125–134).
Zurück zum Zitat Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289–1305.MATH Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289–1305.MATH
Zurück zum Zitat Good, P. (2006). Resampling methods. Ed. Birkhauser. Good, P. (2006). Resampling methods. Ed. Birkhauser.
Zurück zum Zitat Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.MATH Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.MATH
Zurück zum Zitat Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1), 389–422.CrossRefMATH Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1), 389–422.CrossRefMATH
Zurück zum Zitat Habert, B., et al. (2000). Profilage de textes: cadre de travail et expérience. In Proceedings of JADT’2000 (5ièmes journées internationales d’Analyse Statistique des Données Textuelles). Habert, B., et al. (2000). Profilage de textes: cadre de travail et expérience. In Proceedings of JADT’2000 (5ièmes journées internationales d’Analyse Statistique des Données Textuelles).
Zurück zum Zitat Hajlaoui, K., Cuxac, P., Lamirel, J.-C., & Francois, C. (2012). Enhancing patent expertise through automatic matching with scientific papers. In J.-G. Ganascia, P. Lenca & J.-M. Petit (Eds.), Discovery science. (Vol. 7569, pp. 299–312), Lecture notes in computer science. Berlin Heidelberg: Springer. Hajlaoui, K., Cuxac, P., Lamirel, J.-C., & Francois, C. (2012). Enhancing patent expertise through automatic matching with scientific papers. In J.-G. Ganascia, P. Lenca & J.-M. Petit (Eds.), Discovery science. (Vol. 7569, pp. 299–312), Lecture notes in computer science. Berlin Heidelberg: Springer.
Zurück zum Zitat Hall, M., & Smith, L. (1999). Feature selection for machine learning: Comparing a correlation-based filter approach to the wrapper. In Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Conference (pp. 235–239). Hall, M., & Smith, L. (1999). Feature selection for machine learning: Comparing a correlation-based filter approach to the wrapper. In Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Conference (pp. 235–239).
Zurück zum Zitat Kira, K., & Rendell, L. (1995). The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 129–134). Kira, K., & Rendell, L. (1995). The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 129–134).
Zurück zum Zitat Kohavi, R., & John, G. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1–2), 273–324.CrossRefMATH Kohavi, R., & John, G. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1–2), 273–324.CrossRefMATH
Zurück zum Zitat Konokenko, I. (1994). Estimating attributes: Analysis and extensions of relief. In Proceedings of European Conference on Machine Learning (pp. 171–182). Konokenko, I. (1994). Estimating attributes: Analysis and extensions of relief. In Proceedings of European Conference on Machine Learning (pp. 171–182).
Zurück zum Zitat Ladha, L., & Deepa, T. (2011). Feature selection methods and algorithms. International Journal on Computer Science and Engineering, 3(5), 1787–1797. Ladha, L., & Deepa, T. (2011). Feature selection methods and algorithms. International Journal on Computer Science and Engineering, 3(5), 1787–1797.
Zurück zum Zitat Lallich, S., & Rakotomalala, R. (2000). Fast feature selection using partial correlation for multi-valued attributes. In D. A. Zighed, J. Komorowski & J. Żytkow (Eds.), Principles of data mining and knowledge discovery (Vol. 1910, pp. 221–231), Lecture notes in computer science. Berlin Heidelberg: Springer. Lallich, S., & Rakotomalala, R. (2000). Fast feature selection using partial correlation for multi-valued attributes. In D. A. Zighed, J. Komorowski & J. Żytkow (Eds.), Principles of data mining and knowledge discovery (Vol. 1910, pp. 221–231), Lecture notes in computer science. Berlin Heidelberg: Springer.
Zurück zum Zitat Lamirel, J., Al Shehabi, S., François, C., & Hoffmann, M. (2004). New classification quality estimators for analysis of documentary information: Application to patent analysis and web mapping. Scientometrics, 60(3), 445–562.CrossRef Lamirel, J., Al Shehabi, S., François, C., & Hoffmann, M. (2004). New classification quality estimators for analysis of documentary information: Application to patent analysis and web mapping. Scientometrics, 60(3), 445–562.CrossRef
Zurück zum Zitat Lamirel, J., Ghribi, M., & Cuxac, P. (2010). Unsupervised recall and precision measures: a step towards new efficient clustering quality indexes. In Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010, Paris, France). Lamirel, J., Ghribi, M., & Cuxac, P. (2010). Unsupervised recall and precision measures: a step towards new efficient clustering quality indexes. In Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010, Paris, France).
Zurück zum Zitat Lamirel, J., Cuxac, P., Chivukula, A.S., & Hajlaoui, K. (2014). Optimizing text classification through efficient feature selection based on quality metric. Journal of Intelligent Information Systems, Special issue on PAKDD-QIMIE 2013 (pp. 1–18). Lamirel, J., Cuxac, P., Chivukula, A.S., & Hajlaoui, K. (2014). Optimizing text classification through efficient feature selection based on quality metric. Journal of Intelligent Information Systems, Special issue on PAKDD-QIMIE 2013 (pp. 1–18).
Zurück zum Zitat Lamirel, J., & Ta, A. (2008). Combination of hyperbolic visualization and graph-based approach for organizing data analysis results: An application to social network analysis. In Proceedings of the 4th International Conference on Webometrics, Informetrics and Scientometrics and 9th COLLNET Meetings, Berlin, Germany. Lamirel, J., & Ta, A. (2008). Combination of hyperbolic visualization and graph-based approach for organizing data analysis results: An application to social network analysis. In Proceedings of the 4th International Conference on Webometrics, Informetrics and Scientometrics and 9th COLLNET Meetings, Berlin, Germany.
Zurück zum Zitat Lang, K. (1995). Learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning (pp. 331–339). Lang, K. (1995). Learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning (pp. 331–339).
Zurück zum Zitat Pearson, K. (1901). On lines an planes of closetst fit to systems of points in space. Philosophical Magazine, 2(11), 559–572.CrossRefMATH Pearson, K. (1901). On lines an planes of closetst fit to systems of points in space. Philosophical Magazine, 2(11), 559–572.CrossRefMATH
Zurück zum Zitat Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. In: Advances in kernel methods (pp. 185–208). Cambridge, MA, USA: MIT Press. Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. In: Advances in kernel methods (pp. 185–208). Cambridge, MA, USA: MIT Press.
Zurück zum Zitat Porter, M. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.CrossRef Porter, M. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.CrossRef
Zurück zum Zitat Quinlan, J. R. (1993). C4.5: programs for machine learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. Quinlan, J. R. (1993). C4.5: programs for machine learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
Zurück zum Zitat Salton, G. (1971). Automatic processing of foreign language documents. Englewood Clifs, NJ, USA: Prentice-Hill. Salton, G. (1971). Automatic processing of foreign language documents. Englewood Clifs, NJ, USA: Prentice-Hill.
Zurück zum Zitat Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing. Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing.
Zurück zum Zitat Witten, I., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques. San Francisco: Morgan Kaufmann. Witten, I., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques. San Francisco: Morgan Kaufmann.
Zurück zum Zitat Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of ICML 03, Washington DC, USA (pp. 856–863). Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of ICML 03, Washington DC, USA (pp. 856–863).
Metadaten
Titel
A Novel Approach to Feature Selection Based on Quality Estimation Metrics
verfasst von
Jean-Charles Lamirel
Pascal Cuxac
Kafil Hajlaoui
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-45763-5_7