nach oben

Erschienen in:

2017 | OriginalPaper | Buchkapitel

A Novel Approach to Feature Selection Based on Quality Estimation Metrics

verfasst von : Jean-Charles Lamirel, Pascal Cuxac, Kafil Hajlaoui

Erschienen in: Advances in Knowledge Discovery and Management

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Feature maximization (F-max) is an unbiased quality estimation metric of unsupervised classification (clustering) that favours clusters with a maximal feature F-measure value. In this article we show that an adaptation of this metric within the framework of supervised classification allows efficient feature selection and feature contrasting to be performed. We experiment the method on different types of textual data. In this context, we demonstrate that this technique significantly improves the performance of classification methods as compared with the use of state-of-the art feature selection techniques, notably in the case of the classification of unbalanced, highly multidimensional and noisy textual data gathered in similar classes.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Comparison of Linear Modularization Criteria Using the Relational Formalism, an Approach to Easily Identify Resolution Limit

Nächstes Kapitel Ultrametricity of Dissimilarity Spaces and Its Significance for Data Mining

http://www.ncbi.nlm.nih.gov/pubmed/.

http://web.ist.utl.pt/~acardoso/datasets/.

http://www.research.att.com/~lewis/reuters21578.html.

http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/.

http://www.cs.waikato.ac.nz/ml/weka/.

Aha, D., Kibler, D., & Albert, M. (1991). Instance-based learning algorithms. Machine Learning, 6, 37–66.

Alphonse, E. E., et al. (2005). Préparation des donnés et analyse des résultats de DEFT’05. In TALN 2005 - Atelier DEFT 2005 (pp. 99–111).

Bache, K., & Lichman, M. (2013). Uci machine learning repository. University of California, School of Information and Computer Science, Irvine, CA, USA. http://archive.ics.uci.edu/ml.

Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2012). A review of feature selection methods on synthetic data. Knowledge and Information Systems, 1–37.

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.MathSciNetCrossRefMATH

Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Technical report. Wadsworth International Group, Belmont, CA, USA.

Chawla, N., Bowyer, K., Hall, L., & Kegelmeyer, W. (2002). Synthetic minority oversampling technique. Journal of Artificial Intelligence Research, 16, 321–357.MATH

Dash, M., & Liu, H. (2003). Consistency-based search in feature selection. Artificial Intelligence, 151(1), 155–176.MathSciNetCrossRefMATH

Daviet, H. (2009). Class-Add, une procédure de sélection de variables basée sur une troncature k-additive de l’information mutuelle et sur une classification ascendante hiérarchique en pré-traitement. Thèse de doctorat, Université de Nantes.

El-Bèze, M., Torres-Moreno, J.-M., & Béchet, F. (2005). Peut-on rendre automatiquement à César ce qui lui appartient. Application au jeu du Chirand-Mitterrac. In TALN 2005 - Atelier DEFT 2005 (pp. 125–134).

Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289–1305.MATH

Good, P. (2006). Resampling methods. Ed. Birkhauser.

Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.MATH

Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1), 389–422.CrossRefMATH

Habert, B., et al. (2000). Profilage de textes: cadre de travail et expérience. In Proceedings of JADT’2000 (5ièmes journées internationales d’Analyse Statistique des Données Textuelles).

Hajlaoui, K., Cuxac, P., Lamirel, J.-C., & Francois, C. (2012). Enhancing patent expertise through automatic matching with scientific papers. In J.-G. Ganascia, P. Lenca & J.-M. Petit (Eds.), Discovery science. (Vol. 7569, pp. 299–312), Lecture notes in computer science. Berlin Heidelberg: Springer.

Hall, M., & Smith, L. (1999). Feature selection for machine learning: Comparing a correlation-based filter approach to the wrapper. In Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Conference (pp. 235–239).

Kira, K., & Rendell, L. (1995). The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 129–134).

Kohavi, R., & John, G. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1–2), 273–324.CrossRefMATH

Konokenko, I. (1994). Estimating attributes: Analysis and extensions of relief. In Proceedings of European Conference on Machine Learning (pp. 171–182).

Ladha, L., & Deepa, T. (2011). Feature selection methods and algorithms. International Journal on Computer Science and Engineering, 3(5), 1787–1797.

Lallich, S., & Rakotomalala, R. (2000). Fast feature selection using partial correlation for multi-valued attributes. In D. A. Zighed, J. Komorowski & J. Żytkow (Eds.), Principles of data mining and knowledge discovery (Vol. 1910, pp. 221–231), Lecture notes in computer science. Berlin Heidelberg: Springer.

Lamirel, J., Al Shehabi, S., François, C., & Hoffmann, M. (2004). New classification quality estimators for analysis of documentary information: Application to patent analysis and web mapping. Scientometrics, 60(3), 445–562.CrossRef

Lamirel, J., Ghribi, M., & Cuxac, P. (2010). Unsupervised recall and precision measures: a step towards new efficient clustering quality indexes. In Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010, Paris, France).

Lamirel, J., Cuxac, P., Chivukula, A.S., & Hajlaoui, K. (2014). Optimizing text classification through efficient feature selection based on quality metric. Journal of Intelligent Information Systems, Special issue on PAKDD-QIMIE 2013 (pp. 1–18).

Lamirel, J., & Ta, A. (2008). Combination of hyperbolic visualization and graph-based approach for organizing data analysis results: An application to social network analysis. In Proceedings of the 4th International Conference on Webometrics, Informetrics and Scientometrics and 9th COLLNET Meetings, Berlin, Germany.

Lang, K. (1995). Learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning (pp. 331–339).

Pearson, K. (1901). On lines an planes of closetst fit to systems of points in space. Philosophical Magazine, 2(11), 559–572.CrossRefMATH

Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. In: Advances in kernel methods (pp. 185–208). Cambridge, MA, USA: MIT Press.

Porter, M. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.CrossRef

Quinlan, J. R. (1993). C4.5: programs for machine learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

Salton, G. (1971). Automatic processing of foreign language documents. Englewood Clifs, NJ, USA: Prentice-Hill.

Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing.

Witten, I., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques. San Francisco: Morgan Kaufmann.

Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of ICML 03, Washington DC, USA (pp. 856–863).

Titel: A Novel Approach to Feature Selection Based on Quality Estimation Metrics
verfasst von: Jean-Charles Lamirel
Pascal Cuxac
Kafil Hajlaoui
Verlag: Springer International Publishing
Buch: Advances in Knowledge Discovery and Management
Print ISBN: 978-3-319-45762-8

Electronic ISBN: 978-3-319-45763-5

Copyright-Jahr: 2017
DOI: https://doi.org/10.1007/978-3-319-45763-5_7

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"