Skip to main content
Erschienen in: Journal of Classification 2/2020

11.07.2019

An Ensemble Feature Ranking Algorithm for Clustering Analysis

verfasst von: Jaehong Yu, Hua Zhong, Seoung Bum Kim

Erschienen in: Journal of Classification | Ausgabe 2/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Feature ranking is a widely used feature selection method. It uses importance scores to evaluate features and selects those with high scores. Conventional unsupervised feature ranking methods do not consider the information on cluster structures; therefore, these methods may be unable to select the relevant features for clustering analysis. To address this limitation, we propose a feature ranking algorithm based on silhouette decomposition. The proposed algorithm calculates the ensemble importance scores by decomposing the average silhouette widths of random subspaces. By doing so, the contribution of a feature in generating cluster structures can be represented more clearly. Experiments on different benchmark data sets examined the properties of the proposed algorithm and compared it with the existing ensemble-based feature ranking methods. The experiments demonstrated that the proposed algorithm outperformed its existing counterparts.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat Andrews, J. L., & Mcnicholas, P. D. (2014). Variable selection for clustering and classification. Journal of Classification, 31(2), 136–153.MathSciNetMATH Andrews, J. L., & Mcnicholas, P. D. (2014). Variable selection for clustering and classification. Journal of Classification, 31(2), 136–153.MathSciNetMATH
Zurück zum Zitat Arbelaitz, O., Gurrutxaga, I., Muguerrza, J., Pèrez, J. M., & Perona, I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1), 243–256. Arbelaitz, O., Gurrutxaga, I., Muguerrza, J., Pèrez, J. M., & Perona, I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1), 243–256.
Zurück zum Zitat Arthur, D., and Vassilvitskii, S. (2007). “k-means++: the advantages of careful seeding”, in Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms, 2007, pp. 1027–1035. Arthur, D., and Vassilvitskii, S. (2007). “k-means++: the advantages of careful seeding”, in Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms, 2007, pp. 1027–1035.
Zurück zum Zitat Ayad, H. G., & Kamel, M. S. (2008). Cumulative voting consensus method for partitions with variable number of clusters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(1), 160–173. Ayad, H. G., & Kamel, M. S. (2008). Cumulative voting consensus method for partitions with variable number of clusters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(1), 160–173.
Zurück zum Zitat Boutsidis, C., Drineas, P., and Mahoney, M.W. (2009), “Unsupervised feature selection for the k-means clustering problem”, in Proceedings of the Advances in Neural Information Processing Systems, pp. 153–161. Boutsidis, C., Drineas, P., and Mahoney, M.W. (2009), “Unsupervised feature selection for the k-means clustering problem”, in Proceedings of the Advances in Neural Information Processing Systems, pp. 153–161.
Zurück zum Zitat Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.MATH Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.MATH
Zurück zum Zitat Caliński, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, 3(1), 1–27.MathSciNetMATH Caliński, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, 3(1), 1–27.MathSciNetMATH
Zurück zum Zitat Chiang, M. M. T., & Mirkin, B. (2010). Intelligent choice of the number of clusters in k-means clustering: an experimental study with different cluster spreads. Journal of Classification, 27(1), 3–40.MathSciNetMATH Chiang, M. M. T., & Mirkin, B. (2010). Intelligent choice of the number of clusters in k-means clustering: an experimental study with different cluster spreads. Journal of Classification, 27(1), 3–40.MathSciNetMATH
Zurück zum Zitat de Amorim, R. C. (2016). A survey on feature weighting based k-means algorithms. Journal of Classification, 33(2), 210–242.MathSciNetMATH de Amorim, R. C. (2016). A survey on feature weighting based k-means algorithms. Journal of Classification, 33(2), 210–242.MathSciNetMATH
Zurück zum Zitat de Amorim, R. C., & Mirkin, B. (2012). Minkowski metric, feature weighting and anomalous cluster initializing in k-means clustering. Pattern Recognition, 45(3), 1061–1075. de Amorim, R. C., & Mirkin, B. (2012). Minkowski metric, feature weighting and anomalous cluster initializing in k-means clustering. Pattern Recognition, 45(3), 1061–1075.
Zurück zum Zitat de Amorim, R. C., Makarenkov, V., & Mirkin, B. (2016). A-Wardpβ: Effective hierarchical clustering using the Minkowski metric and a fast k-means initialization. Information Sciences, 370, 343–354. de Amorim, R. C., Makarenkov, V., & Mirkin, B. (2016). A-Ward: Effective hierarchical clustering using the Minkowski metric and a fast k-means initialization. Information Sciences, 370, 343–354.
Zurück zum Zitat de Amorim, R. C., Shestakov, A., Mirkin, B., & Makarenkov, V. (2017). The Minkowski central partition as a pointer to a suitable distance exponent and consensus partitioning. Pattern Recognition, 67, 62–72. de Amorim, R. C., Shestakov, A., Mirkin, B., & Makarenkov, V. (2017). The Minkowski central partition as a pointer to a suitable distance exponent and consensus partitioning. Pattern Recognition, 67, 62–72.
Zurück zum Zitat Dy, J. G., & Brodley, C. E. (2004). Feature selection for unsupervised learning. Journal of Machine Learning Research, 5, 845–889.MathSciNetMATH Dy, J. G., & Brodley, C. E. (2004). Feature selection for unsupervised learning. Journal of Machine Learning Research, 5, 845–889.MathSciNetMATH
Zurück zum Zitat Elghazel, H., & Aussem, A. (2015). Unsupervised feature selection with ensemble learning. Machine Learning, 98(1–2), 157–180.MathSciNetMATH Elghazel, H., & Aussem, A. (2015). Unsupervised feature selection with ensemble learning. Machine Learning, 98(1–2), 157–180.MathSciNetMATH
Zurück zum Zitat Fred, A. L., & Jain, A. K. (2005). Combining multiple clusterings using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6), 835–850. Fred, A. L., & Jain, A. K. (2005). Combining multiple clusterings using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6), 835–850.
Zurück zum Zitat Guerra, L., Robles, V., Bielza, C., & Larrañaga, P. (2012). A comparison of clustering quality indices using outliers and noise. Intelligent Data Analysis, 16(4), 703–715. Guerra, L., Robles, V., Bielza, C., & Larrañaga, P. (2012). A comparison of clustering quality indices using outliers and noise. Intelligent Data Analysis, 16(4), 703–715.
Zurück zum Zitat Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.MATH Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.MATH
Zurück zum Zitat Handl, J., & Knowles, J. (2006). Feature subset selection in unsupervised learning via multiobjective optimization. International Journal of Computational Intelligence Research, 2(3), 217–238.MathSciNet Handl, J., & Knowles, J. (2006). Feature subset selection in unsupervised learning via multiobjective optimization. International Journal of Computational Intelligence Research, 2(3), 217–238.MathSciNet
Zurück zum Zitat Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society: Series C: Applied Statistics, 28(1), 100–108.MATH Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society: Series C: Applied Statistics, 28(1), 100–108.MATH
Zurück zum Zitat HE, X., CAI, D., and NIYOGI, P. (2005), “Laplacian score for feature selection”, in Proceedings of the Advances in Neural Information Processing Systems, pp. 507–514. HE, X., CAI, D., and NIYOGI, P. (2005), “Laplacian score for feature selection”, in Proceedings of the Advances in Neural Information Processing Systems, pp. 507–514.
Zurück zum Zitat Herrero, J., Dìaz-uriarte, R., & Dopazo, J. (2003). Gene expression data preprocessing. Bioinformatics, 19(5), 655–656. Herrero, J., Dìaz-uriarte, R., & Dopazo, J. (2003). Gene expression data preprocessing. Bioinformatics, 19(5), 655–656.
Zurück zum Zitat Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 832–844. Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 832–844.
Zurück zum Zitat Hong, Y., Kwong, S., Chang, Y., & Ren, Q. (2008a). Consensus unsupervised feature ranking from multiple views. Pattern Recognition Letters, 29(5), 595–602. Hong, Y., Kwong, S., Chang, Y., & Ren, Q. (2008a). Consensus unsupervised feature ranking from multiple views. Pattern Recognition Letters, 29(5), 595–602.
Zurück zum Zitat Hong, Y., Kwong, S., Chang, Y., & Ren, Q. (2008b). Unsupervised feature selection using clustering ensembles and population based incremental learning algorithm. Pattern Recognition, 41(9), 2742–2756.MATH Hong, Y., Kwong, S., Chang, Y., & Ren, Q. (2008b). Unsupervised feature selection using clustering ensembles and population based incremental learning algorithm. Pattern Recognition, 41(9), 2742–2756.MATH
Zurück zum Zitat Huang, J. Z., Ng, M. K., Rong, H., & LI, Z. (2005). Automated variable weighting in k-means type clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5), 657–668. Huang, J. Z., Ng, M. K., Rong, H., & LI, Z. (2005). Automated variable weighting in k-means type clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5), 657–668.
Zurück zum Zitat Hubrert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218. Hubrert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.
Zurück zum Zitat Iam-on, N., Boongoen, T., Garrett, S., & Price, C. (2011). A link-based approach to the cluster ensemble problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(12), 2396–2409. Iam-on, N., Boongoen, T., Garrett, S., & Price, C. (2011). A link-based approach to the cluster ensemble problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(12), 2396–2409.
Zurück zum Zitat Ketchen, D. J., Jr., & Shook, C. L. (1996). The application of cluster analysis in strategic management research: an analysis and critique. Strategic Management Journal, 17, 441–458. Ketchen, D. J., Jr., & Shook, C. L. (1996). The application of cluster analysis in strategic management research: an analysis and critique. Strategic Management Journal, 17, 441–458.
Zurück zum Zitat Kim, S. B., & Rattakorn, P. (2011). Unsupervised feature selection using weighted principal components. Expert Systems with Applications, 38(5), 5704–5710. Kim, S. B., & Rattakorn, P. (2011). Unsupervised feature selection using weighted principal components. Expert Systems with Applications, 38(5), 5704–5710.
Zurück zum Zitat Kim, E. Y., Kim, S. Y., Ashlock, D., & Nam, D. (2009). MULTI-K: Accurate classification of microarray subtypes using ensemble k-means clustering. BMC Bioinformatics, 10(1), 260. Kim, E. Y., Kim, S. Y., Ashlock, D., & Nam, D. (2009). MULTI-K: Accurate classification of microarray subtypes using ensemble k-means clustering. BMC Bioinformatics, 10(1), 260.
Zurück zum Zitat Kuncheva, L. I., & Vetrov, D. P. (2006). Evaluation of stability of k-means cluster ensembles with respect to random initialization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11), 1798–1808. Kuncheva, L. I., & Vetrov, D. P. (2006). Evaluation of stability of k-means cluster ensembles with respect to random initialization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11), 1798–1808.
Zurück zum Zitat Lai, C., Reinders, M. J., & Wessels, L. (2006). Random subspace method for multivariate feature selection. Pattern Recognition Letters, 27(10), 1067–1076. Lai, C., Reinders, M. J., & Wessels, L. (2006). Random subspace method for multivariate feature selection. Pattern Recognition Letters, 27(10), 1067–1076.
Zurück zum Zitat Li, F., Zhang, Z., & Jin, C. (2016). Feature selection with partition differentiation entropy for large-scale data sets. Information Sciences, 329, 690–700.MATH Li, F., Zhang, Z., & Jin, C. (2016). Feature selection with partition differentiation entropy for large-scale data sets. Information Sciences, 329, 690–700.MATH
Zurück zum Zitat Liu, Y., Li, Z., Xiong, H., Gao, X., and Wu, J. (2010), “Understanding of internal clustering validation measures”, in Proceedings of IEEE 10th International Conference on Data Mining (ICDM), pp. 911–916. Liu, Y., Li, Z., Xiong, H., Gao, X., and Wu, J. (2010), “Understanding of internal clustering validation measures”, in Proceedings of IEEE 10th International Conference on Data Mining (ICDM), pp. 911–916.
Zurück zum Zitat MacQueen, J. (1967), “Some methods for classification and analysis of multivariate observations”, in Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, 1(14), pp. 281–297. MacQueen, J. (1967), “Some methods for classification and analysis of multivariate observations”, in Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, 1(14), pp. 281–297.
Zurück zum Zitat Makarenkov, V., & Legendre, P. (2001). Optimal variable weighting for ultrametric and additive trees and K-means partitioning: methods and software. Journal of Classification, 18(2), 245–271.MathSciNetMATH Makarenkov, V., & Legendre, P. (2001). Optimal variable weighting for ultrametric and additive trees and K-means partitioning: methods and software. Journal of Classification, 18(2), 245–271.MathSciNetMATH
Zurück zum Zitat Mitra, P., Murthy, C. A., & Pal, S. K. (2002). Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3), 301–312. Mitra, P., Murthy, C. A., & Pal, S. K. (2002). Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3), 301–312.
Zurück zum Zitat Oehmen, C., & Nieplocha, J. (2006). ScalaBLAST: a scalable implementation of BLAST for high-performance data-intensive bioinformatics analysis. IEEE Transactions on Parallel and Distributed Systems, 17(8), 740–749. Oehmen, C., & Nieplocha, J. (2006). ScalaBLAST: a scalable implementation of BLAST for high-performance data-intensive bioinformatics analysis. IEEE Transactions on Parallel and Distributed Systems, 17(8), 740–749.
Zurück zum Zitat Panday, D., de Amorim, R. C., & Lane, P. (2018). Feature weighting as a tool for unsupervised feature selection. Information Processing Letters, 129, 44–52.MathSciNetMATH Panday, D., de Amorim, R. C., & Lane, P. (2018). Feature weighting as a tool for unsupervised feature selection. Information Processing Letters, 129, 44–52.MathSciNetMATH
Zurück zum Zitat Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65.MATH Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65.MATH
Zurück zum Zitat Steinley, D., & Brusco, M. J. (2007). Initializing k-means batch clustering: a critical evaluation of several techniques. Journal of Classification, 24(1), 99–121.MathSciNetMATH Steinley, D., & Brusco, M. J. (2007). Initializing k-means batch clustering: a critical evaluation of several techniques. Journal of Classification, 24(1), 99–121.MathSciNetMATH
Zurück zum Zitat Tan, P. N., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. Boston: Addison-Wesley. Tan, P. N., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. Boston: Addison-Wesley.
Zurück zum Zitat Vendramin, L., Campello, R. J., & Hruschka, E. R. (2010). Relative clustering validity criteria: a comparative overview. Statistical Analysis and Data Mining: the ASA Data Science Journal, 3(4), 209–235.MathSciNet Vendramin, L., Campello, R. J., & Hruschka, E. R. (2010). Relative clustering validity criteria: a comparative overview. Statistical Analysis and Data Mining: the ASA Data Science Journal, 3(4), 209–235.MathSciNet
Zurück zum Zitat Xu, R. F., & Lee, S. J. (2015). Dimensionality reduction by feature clustering for regression problems. Information Sciences, 299, 42–57.MathSciNetMATH Xu, R. F., & Lee, S. J. (2015). Dimensionality reduction by feature clustering for regression problems. Information Sciences, 299, 42–57.MathSciNetMATH
Zurück zum Zitat Yang, C., Wan, B., and Gao, X. (2006), “Effectivity of Internal Validation Techniques for Gene Clustering”, in Proceedings of International Symposium on Biological and Medical Data Analysis, pp. 49–59. Yang, C., Wan, B., and Gao, X. (2006), “Effectivity of Internal Validation Techniques for Gene Clustering”, in Proceedings of International Symposium on Biological and Medical Data Analysis, pp. 49–59.
Zurück zum Zitat Yu, J., & Kim, S. B. (2016). A density-based Noisy graph partitioning algorithm. Neurocomputing, 175, 473–491. Yu, J., & Kim, S. B. (2016). A density-based Noisy graph partitioning algorithm. Neurocomputing, 175, 473–491.
Zurück zum Zitat Yu, Z., Wang, D., You, J., Wong, H. S., Wu, S., Zhang, J., & Han, G. (2016). Progressive subspace ensemble learning. Pattern Recognition, 60, 692–705. Yu, Z., Wang, D., You, J., Wong, H. S., Wu, S., Zhang, J., & Han, G. (2016). Progressive subspace ensemble learning. Pattern Recognition, 60, 692–705.
Zurück zum Zitat Zhang, S., Wong, H. S., Shen, Y., & Xie, D. (2012). A new unsupervised feature ranking method for gene expression data based on consensus affinity. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(4), 1257–1263. Zhang, S., Wong, H. S., Shen, Y., & Xie, D. (2012). A new unsupervised feature ranking method for gene expression data based on consensus affinity. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(4), 1257–1263.
Zurück zum Zitat Zhong, C., Yue, X., Zhang, Z., & Lei, J. (2015). A clustering ensemble: Two-level refined co-association matrix with path-based transformation. Pattern Recognition, 48(8), 2699–2709.MATH Zhong, C., Yue, X., Zhang, Z., & Lei, J. (2015). A clustering ensemble: Two-level refined co-association matrix with path-based transformation. Pattern Recognition, 48(8), 2699–2709.MATH
Metadaten
Titel
An Ensemble Feature Ranking Algorithm for Clustering Analysis
verfasst von
Jaehong Yu
Hua Zhong
Seoung Bum Kim
Publikationsdatum
11.07.2019
Verlag
Springer US
Erschienen in
Journal of Classification / Ausgabe 2/2020
Print ISSN: 0176-4268
Elektronische ISSN: 1432-1343
DOI
https://doi.org/10.1007/s00357-019-09330-8

Weitere Artikel der Ausgabe 2/2020

Journal of Classification 2/2020 Zur Ausgabe

Premium Partner