Skip to main content

2017 | OriginalPaper | Buchkapitel

Probabilistic Leverage Scores for Parallelized Unsupervised Feature Selection

verfasst von : Bruno Ordozgoiti, Sandra Gómez Canaval, Alberto Mozo

Erschienen in: Advances in Computational Intelligence

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Dimensionality reduction is often crucial for the application of machine learning and data mining. Feature selection methods can be employed for this purpose, with the advantage of preserving interpretability. There exist unsupervised feature selection methods based on matrix factorization algorithms, which can help choose the most informative features in terms of approximation error. Randomized methods have been proposed recently to provide better theoretical guarantees and better approximation errors than their deterministic counterparts, but their computational costs can be significant when dealing with big, high dimensional data sets. Some existing randomized and deterministic approaches require the computation of the singular value decomposition in \(O(mn\min (m,n))\) time (for m samples and n features) for providing leverage scores. This compromises their applicability to domains of even moderately high dimensionality. In this paper we propose the use of Probabilistic PCA to compute the leverage scores in O(mnk) time, enabling the applicability of some of these randomized methods to large, high-dimensional data sets. We show that using this approach, we can rapidly provide an approximation of the leverage scores that is works well in this context. In addition, we offer a parallelized version over the emerging Resilient Distributed Datasets paradigm (RDD) on Apache Spark, making it horizontally scalable for enormous numbers of data instances. We validate the performance of our approach on different data sets comprised of real-world and synthetic data.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
If the cluster has enough memory, we recommend caching this RDD, since it is used in two different operations.
 
Literatur
1.
Zurück zum Zitat Boutsidis, C., Mahoney, M.W., Drineas, P.: An improved approximation algorithm for the column subset selection problem. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 968–977. Society for Industrial and Applied Mathematics (2009) Boutsidis, C., Mahoney, M.W., Drineas, P.: An improved approximation algorithm for the column subset selection problem. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 968–977. Society for Industrial and Applied Mathematics (2009)
2.
Zurück zum Zitat Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef
3.
Zurück zum Zitat Deshpande, A., Rademacher, L., Vempala, S., Wang, G.: Matrix approximation and projective clustering via volume sampling. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1117–1126. Society for Industrial and Applied Mathematics (2006) Deshpande, A., Rademacher, L., Vempala, S., Wang, G.: Matrix approximation and projective clustering via volume sampling. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1117–1126. Society for Industrial and Applied Mathematics (2006)
4.
Zurück zum Zitat Farahat, A.K., Elgohary, A., Ghodsi, A., Kamel, M.S.: Distributed column subset selection on mapreduce. In: IEEE 13th International Conference on Data Mining (ICDM), pp. 171–180. IEEE (2013) Farahat, A.K., Elgohary, A., Ghodsi, A., Kamel, M.S.: Distributed column subset selection on mapreduce. In: IEEE 13th International Conference on Data Mining (ICDM), pp. 171–180. IEEE (2013)
5.
Zurück zum Zitat Frieze, A., Kannan, R., Vempala, S.: Fast monte-carlo algorithms for finding low-rank approximations. J. ACM (JACM) 51(6), 1025–1041 (2004)MathSciNetCrossRefMATH Frieze, A., Kannan, R., Vempala, S.: Fast monte-carlo algorithms for finding low-rank approximations. J. ACM (JACM) 51(6), 1025–1041 (2004)MathSciNetCrossRefMATH
6.
Zurück zum Zitat He, Q., Cheng, X., Zhuang, F., Shi, Z.: Parallel feature selection using positive approximation based on mapreduce. In: 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp. 397–402. IEEE (2014) He, Q., Cheng, X., Zhuang, F., Shi, Z.: Parallel feature selection using positive approximation based on mapreduce. In: 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp. 397–402. IEEE (2014)
7.
Zurück zum Zitat Jolliffe, I.T.: Discarding variables in a principal component analysis. I: artificial data. Appl. Stat. 21, 160–173 (1972)MathSciNetCrossRef Jolliffe, I.T.: Discarding variables in a principal component analysis. I: artificial data. Appl. Stat. 21, 160–173 (1972)MathSciNetCrossRef
8.
Zurück zum Zitat Ordozgoiti, B., Canaval, S.G., Mozo, A.: Parallelized unsupervised feature selection for large-scale network traffic analysis. In: Proceedings of the ESANN (2016) Ordozgoiti, B., Canaval, S.G., Mozo, A.: Parallelized unsupervised feature selection for large-scale network traffic analysis. In: Proceedings of the ESANN (2016)
9.
Zurück zum Zitat Ordozgoiti, B., Canaval, S.G., Mozo, A.: A fast iterative algorithm for improved unsupervised feature selection. In: IEEE 16th International Conference on Data Mining (ICDM), pp. 390–399. IEEE (2016) Ordozgoiti, B., Canaval, S.G., Mozo, A.: A fast iterative algorithm for improved unsupervised feature selection. In: IEEE 16th International Conference on Data Mining (ICDM), pp. 390–399. IEEE (2016)
10.
Zurück zum Zitat Papailiopoulos, D., Kyrillidis, A., Boutsidis, C.: Provable deterministic leverage score sampling. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 997–1006. ACM (2014) Papailiopoulos, D., Kyrillidis, A., Boutsidis, C.: Provable deterministic leverage score sampling. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 997–1006. ACM (2014)
11.
Zurück zum Zitat Singh, S., Kubica, J., Larsen, S., Sorokina, D.: Parallel large scale feature selection for logistic regression. In: SDM, pp. 1172–1183. SIAM (2009) Singh, S., Kubica, J., Larsen, S., Sorokina, D.: Parallel large scale feature selection for logistic regression. In: SDM, pp. 1172–1183. SIAM (2009)
12.
Zurück zum Zitat Sun, Z., Li, Z.: Data intensive parallel feature selection method study. In: International Joint Conference on Neural Networks (IJCNN), pp. 2256–2262. IEEE (2014) Sun, Z., Li, Z.: Data intensive parallel feature selection method study. In: International Joint Conference on Neural Networks (IJCNN), pp. 2256–2262. IEEE (2014)
13.
14.
Zurück zum Zitat Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemometr. Intell. Lab. Syst. 2(1), 37–52 (1987)CrossRef Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemometr. Intell. Lab. Syst. 2(1), 37–52 (1987)CrossRef
15.
Zurück zum Zitat Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 2. USENIX Assoc. (2012) Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 2. USENIX Assoc. (2012)
16.
Zurück zum Zitat Zhao, Z., Zhang, R., Cox, J., Duling, D., Sarle, W.: Massively parallel feature selection: an approach based on variance preservation. Mach. Learn. 92(1), 195–220 (2013)MathSciNetCrossRefMATH Zhao, Z., Zhang, R., Cox, J., Duling, D., Sarle, W.: Massively parallel feature selection: an approach based on variance preservation. Mach. Learn. 92(1), 195–220 (2013)MathSciNetCrossRefMATH
Metadaten
Titel
Probabilistic Leverage Scores for Parallelized Unsupervised Feature Selection
verfasst von
Bruno Ordozgoiti
Sandra Gómez Canaval
Alberto Mozo
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-59147-6_61