Skip to main content
Top

01-10-2015 | Regular Paper

Greedy column subset selection for large-scale data sets

Authors: Ahmed K. Farahat, Ahmed Elgohary, Ali Ghodsi, Mohamed S. Kamel

Published in: Knowledge and Information Systems | Issue 1/2015

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In today’s information systems, the availability of massive amounts of data necessitates the development of fast and accurate algorithms to summarize these data and represent them in a succinct format. One crucial problem in big data analytics is the selection of representative instances from large and massively distributed data, which is formally known as the Column Subset Selection problem. The solution to this problem enables data analysts to understand the insights of the data and explore its hidden structure. The selected instances can also be used for data preprocessing tasks such as learning a low-dimensional embedding of the data points or computing a low-rank approximation of the corresponding matrix. This paper presents a fast and accurate greedy algorithm for large-scale column subset selection. The algorithm minimizes an objective function, which measures the reconstruction error of the data matrix based on the subset of selected columns. The paper first presents a centralized greedy algorithm for column subset selection, which depends on a novel recursive formula for calculating the reconstruction error of the data matrix. The paper then presents a MapReduce algorithm, which selects a few representative columns from a matrix whose columns are massively distributed across several commodity machines. The algorithm first learns a concise representation of all columns using random projection, and it then solves a generalized column subset selection problem at each machine in which a subset of columns are selected from the sub-matrix on that machine such that the reconstruction error of the concise representation is minimized. The paper demonstrates the effectiveness and efficiency of the proposed algorithm through an empirical evaluation on benchmark data sets.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Footnotes
1
\(\Vert A \Vert _{F}^{2} = trace\left( A^TA \right) \).
 
2
The in-memory summation can also be replaced by a MapReduce combiner, Dean2008.
 
3
The data sets Reuters-21578, MNIST-4K, PIE-20 and YaleB-38 are available in MAT format at: http://​www.​cad.​zju.​edu.​cn/​home/​dengcai/​Data/​data.​html. PIE-20 is a subset of PIE-32x32 with the images of the first 20 persons.
 
6
The CSS algorithm of Boutsidis et al. [3] was not included in the comparison as its implementation is not available.
 
7
Revision: 5.13.4.7.
 
8
In the implemented code, the efficient recursive formulas in Section 4 of [31] are used to implement the update of QR decomposition and the swapping criterion.
 
9
In [4] (a newer version of [6]), Boutsidis et al. suggested the use of the SRRQR algorithm [31, Algorithm 4] for the deterministic phase. Although the SRRQR algorithm achieves the theoretical guarantee presented in [6], the MATLAB \(qr\) function is used in the conducted experiments as it is much faster and it achieves comparable accuracy for the experimented data sets.
 
10
For the MNIST4K data set, the range of \(l/n\) values is smaller since the rank of the matrix is very low (i.e., less than the number of pixels).
 
11
The qr and SRRQR methods both depend on the MATLAB qr function. For the document data sets, the MATLAB qr function takes very long times compared with other methods and accordingly they are not reported in the shown figures.
 
12
The MATLAB functions anova1 and multcompare were used.
Table 2
The relative accuracy of the best performing CSS methods for the Reuters-21578, Reviews and LA1 data sets
Method
\(l/n=5\,\%\)
\(l/n=9\,\%\)
\(l/n=13\,\%\)
\(l/n=17\,\%\)
Reuters-21578
   HybridCSS
66,3400.90
70,5200.81
73,8400.39
76,1100.55
   ApproxSVD (a)
70,5900.65
74,7600.66
77,8600.32
80,1100.31
   GreedyCSS (b)
70,8600.65
74,9600.65
77,9600.32
80,2800.31
   RndGreedyCSS (c)
69,9800.61
74,4100.67
77,5800.35
80,0100.34
   F-statistic
86.99
90.36
331.1
267.52
   p value
1.50E\(-16\)
8.10E\(-17\)
2.90E\(-26\)
1.20E\(-24\)
   Tukey’s test
\(a=b\), \(a=c\), \(b>c\)
\(a=b\), \(a=c\), \(b=c\)
\(a=b\), \(a=c\), \(b=c\)
\(a=b\), \(a=c\), \(b=c\)
Reviews
   HybridCSS
35,4403.01
35,2201.30
33,7300.84
29,0200.43
   ApproxSVD (a)
46,2801.86
40,4801.44
36,6400.76
32,7500.41
   GreedyCSS (b)
47,8801.80
41,8001.41
37,5600.75
33,5900.40
   RndGreedyCSS (c)
45,6401.86
39,6701.50
35,7900.88
31,9800.52
   F-statistic
66.7
40.83
41.08
203.01
   p value
9.00E-15
1.10E-11
1.00E-11
1.30E-22
   Tukey’s test
\(a=b\), \(a=c\), \(b=c\)
\(a=b\), \(a=c\), \(b>c\)
\(a=b\), \(a=c\), \(b>c\)
\(a<b\), \(a>c\), \(b>c\)
LA1
   HybridCSS
31,5301.79
34,2201.05
33,8601.28
31,0800.75
   ApproxSVD (a)
43,0900.88
40,3400.85
37,3501.03
34,5200.79
   GreedyCSS (b)
44,8400.85
41,8800.83
38,1801.02
35,5000.78
   RndGreedyCSS (c)
41,2801.16
39,3001.15
36,0701.13
33,5200.92
   F-statistic
234.34
114.16
28.38
54.23
   p value
1.10E\(-23\)
1.9E\(-18\)
1.40E\(-09\)
2.00E\(-13\)
   Tukey’s test
\(a<b\), \(a>c\), \(b>c\)
\(a<b\), \(a=c\), \(b>c\)
\(a=b\), \(a=c\), \(b>c\)
\(a<b\), \(a>c\), \(b>c\)
For each sub-column, the analysis of variance (ANOVA) procedure is conducted to test the statistical significance of the results. The F-statistic, p value, and results of Tukey’s tests between the three best performing methods are reported
Table 3
The relative accuracy of the best performing CSS methods for the MNIST-4K, PIE-20 and YaleB-38 data sets
MNIST-4K
Method
\(l/n=1\,\%\)
\(l/n=5\,\%\)
\(l/n=9\,\%\)
\(l/n=13\,\%\)
HybridCSS
\(23.41 \pm 03.33\)
\(21.97 \pm 02.18\)
\(20.60 \pm 02.09\)
\(-72.06 \pm 17.25\)
ApproxSVD (a)
\(39.79 \pm 01.16\)
\(29.36 \pm 01.66\)
\(34.59 \pm 01.63\)
\(55.56 \pm 04.17\)
GreedyCSS (b)
\(39.91 \pm 01.16\)
\(29.47 \pm 01.65\)
\(36.09 \pm 01.59\)
\(56.36 \pm 04.09\)
RndGreedyCSS (c)
\(28.84 \pm 02.25\)
\(24.11 \pm 01.72\)
\(27.90 \pm 02.11\)
\(51.62 \pm 04.07\)
F-statistic
143.63
43.39
143.88
460.6
p value
4.40E\(-20\)
4.80E\(-12\)
4.20E\(-20\)
9.30E\(-29\)
Tukey’s test
\(a=b\), \(a>c\), \(b>c\)
\(a=b\), \(a>c\), \(b>c\)
\(a=b\), \(a>c\), \(b>c\)
\(a=b\), \(a=c\), \(b=c\)
PIE-20
Method
\(l/n=5\,\%\)
\(l/n=9\,\%\)
\(l/n=13\,\%\)
\(l/n=17\,\%\)
HybridCSS
\(43.67 \pm 01.77\)
\(43.65 \pm 01.86\)
\(39.77 \pm 01.37\)
\(36.55 \pm 01.95\)
ApproxSVD (a)
\(46.79 \pm 01.22\)
\(46.16 \pm 01.75\)
\(45.07 \pm 01.20\)
\(45.03 \pm 01.95\)
GreedyCSS (b)
\(46.98 \pm 01.22\)
\(46.53 \pm 01.74\)
\(44.82 \pm 01.20\)
\(45.08 \pm 01.95\)
RndGreedyCSS (c)
\(44.16 \pm 01.57\)
\(44.11 \pm 01.92\)
\(43.38 \pm 01.20\)
\(43.79 \pm 02.03\)
F-statistic
13.88
6.31
38.41
43.1
p value
3.60E\(-06\)
1.50E\(-03\)
2.60E\(-11\)
5.30E\(-12\)
Tukey’s test
\(a=b\), \(a>c\), \(b>c\)
\(a=b\), \(a=c\), \(b>c\)
\(a=b\), \(a>c\), \(b=c\)
\(a=b\), \(a=c\), \(b=c\)
YALEB-38
Method
\(l/n=5\,\%\)
\(l/n=9\,\%\)
\(l/n=13\,\%\)
\(l/n=17\,\%\)
HybridCSS
\(33.83 \pm 02.87\)
\(37.11 \pm 01.28\)
\(35.20 \pm 02.54\)
\(30.60 \pm 01.69\)
ApproxSVD (a)
\(40.71 \pm 02.12\)
\(39.70 \pm 01.09\)
\(37.88 \pm 02.22\)
\(36.92 \pm 01.33\)
GreedyCSS (b)
\(40.01 \pm 02.14\)
\(39.59 \pm 01.09\)
\(38.14 \pm 02.21\)
\(36.96 \pm 01.33\)
RndGreedyCSS (c)
\(35.82 \pm 02.25\)
\(37.17 \pm 01.32\)
\(36.29 \pm 02.35\)
\(35.19 \pm 01.34\)
F-statistic
19.56
14.56
3.53
43.87
p value
1.10E\(-07\)
2.30E\(-06\)
2.40E\(-02\)
4.10E\(-12\)
Tukey’s test
\(a=b\), \(a>c\), \(b>c\)
\(a=b\), \(a>c\), \(b>c\)
\(a=b\), \(a=c\), \(b=c\)
\(a=b\), \(a>c\), \(b>c\)
For each sub-column, the analysis of variance (ANOVA) procedure is conducted to test the statistical significance of the results. The F-statistic, p value, and results of Tukey’s tests between the three best performing methods are reported
 
13
Amazon Elastic Compute Cloud (EC2): http://​aws.​amazon.​com/​ec2.
 
14
Mahout is an Apache project for implementing Machine Learning algorithms on Hadoop. See http://​mahout.​apache.​org/​.
 
Literature
1.
go back to reference Achlioptas D (2003) Database-friendly random projections: Johnson-Lindenstrauss with binary coins. J Comput Syst Sci 66(4):671–687MATHMathSciNetCrossRef Achlioptas D (2003) Database-friendly random projections: Johnson-Lindenstrauss with binary coins. J Comput Syst Sci 66(4):671–687MATHMathSciNetCrossRef
2.
go back to reference Bischof C, Quintana-Ortí G (1998) Computing rank-revealing QR factorizations of dense matrices. ACM Trans Math Softw 24(2):226–253MATHCrossRef Bischof C, Quintana-Ortí G (1998) Computing rank-revealing QR factorizations of dense matrices. ACM Trans Math Softw 24(2):226–253MATHCrossRef
3.
go back to reference Boutsidis C, Drineas P, Magdon-Ismail M (2011) Near optimal column-based matrix reconstruction. In: Proceedings of the 52nd annual IEEE symposium on foundations of computer science (FOCS’11), pp 305–314 Boutsidis C, Drineas P, Magdon-Ismail M (2011) Near optimal column-based matrix reconstruction. In: Proceedings of the 52nd annual IEEE symposium on foundations of computer science (FOCS’11), pp 305–314
4.
go back to reference Boutsidis C, Mahoney MW, Drineas P (2008a) An improved approximation algorithm for the column subset selection problem, CoRR abs/0812.4293 Boutsidis C, Mahoney MW, Drineas P (2008a) An improved approximation algorithm for the column subset selection problem, CoRR abs/0812.4293
5.
go back to reference Boutsidis C, Mahoney MW Drineas P (2008b) Unsupervised feature selection for principal components analysis. In Li Y, Liu B, Sarawagi S (eds) Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’08). ACM, New York, pp 61–69 Boutsidis C, Mahoney MW Drineas P (2008b) Unsupervised feature selection for principal components analysis. In Li Y, Liu B, Sarawagi S (eds) Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’08). ACM, New York, pp 61–69
6.
go back to reference Boutsidis C, Mahoney MW, Drineas P (2009) An improved approximation algorithm for the column subset selection problem. In: Proceedings of the 20th annual ACM-SIAM symposium on discrete algorithms (SODA’09), pp 968–977 Boutsidis C, Mahoney MW, Drineas P (2009) An improved approximation algorithm for the column subset selection problem. In: Proceedings of the 20th annual ACM-SIAM symposium on discrete algorithms (SODA’09), pp 968–977
7.
go back to reference Boutsidis C, Sun J, Anerousis N (2008) Clustered subset selection and its applications on it service metrics. In: Proceedings of the 17th ACM conference on information and knowledge management (CIKM’08), pp 599–608 Boutsidis C, Sun J, Anerousis N (2008) Clustered subset selection and its applications on it service metrics. In: Proceedings of the 17th ACM conference on information and knowledge management (CIKM’08), pp 599–608
8.
go back to reference Cai D, Zhang C, He X (2010) Unsupervised feature selection for multi-cluster data. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’10). ACM, New York, NY, pp 333–342 Cai D, Zhang C, He X (2010) Unsupervised feature selection for multi-cluster data. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’10). ACM, New York, NY, pp 333–342
9.
go back to reference Çivril A, Magdon-Ismail M (2008) Deterministic sparse column based matrix reconstruction via greedy approximation of SVD. In: Proceedings of the 19th international symposium on algorithms and computation (ISAAC’08). Springer, New York, pp 414–423 Çivril A, Magdon-Ismail M (2008) Deterministic sparse column based matrix reconstruction via greedy approximation of SVD. In: Proceedings of the 19th international symposium on algorithms and computation (ISAAC’08). Springer, New York, pp 414–423
10.
11.
12.
go back to reference Chen W-Y, Song Y, Bai H, Lin C-J, Chang E (2011) Parallel spectral clustering in distributed systems. IEEE Trans Pattern Anal Mach Intell 33(3):568–586CrossRef Chen W-Y, Song Y, Bai H, Lin C-J, Chang E (2011) Parallel spectral clustering in distributed systems. IEEE Trans Pattern Anal Mach Intell 33(3):568–586CrossRef
13.
go back to reference Cui Y, Dy J (2008) Orthogonal principal feature selection, the sparse optimization and variable selection workshop at the international conference on machine learning (ICML) Cui Y, Dy J (2008) Orthogonal principal feature selection, the sparse optimization and variable selection workshop at the international conference on machine learning (ICML)
14.
15.
go back to reference Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRef Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRef
16.
go back to reference Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inform Sci Technol 41(6):391–407CrossRef Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inform Sci Technol 41(6):391–407CrossRef
17.
go back to reference Deshpande A, Rademacher L (2010) Efficient volume sampling for row/column subset selection. In: Proceedings of the 51st annual IEEE symposium on foundations of computer science (FOCS’10), pp 329–338 Deshpande A, Rademacher L (2010) Efficient volume sampling for row/column subset selection. In: Proceedings of the 51st annual IEEE symposium on foundations of computer science (FOCS’10), pp 329–338
18.
go back to reference Deshpande A, Rademacher L, Vempala S, Wang G (2006a) Matrix approximation and projective clustering via volume sampling. Theory Comput 2(1):225–247MathSciNetCrossRef Deshpande A, Rademacher L, Vempala S, Wang G (2006a) Matrix approximation and projective clustering via volume sampling. Theory Comput 2(1):225–247MathSciNetCrossRef
19.
go back to reference Deshpande A, Rademacher L, Vempala S, Wang G (2006b) Matrix approximation and projective clustering via volume sampling. In: Proceedings of the 17th annual ACM-SIAM symposium on discrete algorithms (SODA’06). ACM, New York, NY, pp 1117–1126 Deshpande A, Rademacher L, Vempala S, Wang G (2006b) Matrix approximation and projective clustering via volume sampling. In: Proceedings of the 17th annual ACM-SIAM symposium on discrete algorithms (SODA’06). ACM, New York, NY, pp 1117–1126
20.
go back to reference Drineas P, Frieze A, Kannan R, Vempala S, Vinay V (2004) Clustering large graphs via the singular value decomposition. Mach Learn 56(1–3):9–33MATHCrossRef Drineas P, Frieze A, Kannan R, Vempala S, Vinay V (2004) Clustering large graphs via the singular value decomposition. Mach Learn 56(1–3):9–33MATHCrossRef
21.
go back to reference Drineas P, Kannan R, Mahoney M (2007) Fast Monte Carlo algorithms for matrices II: computing a low-rank approximation to a matrix. SIAM J Comput 36(1):158–183MathSciNetCrossRef Drineas P, Kannan R, Mahoney M (2007) Fast Monte Carlo algorithms for matrices II: computing a low-rank approximation to a matrix. SIAM J Comput 36(1):158–183MathSciNetCrossRef
22.
go back to reference Drineas P, Mahoney M, Muthukrishnan S (2006) Subspace sampling and relative-error matrix approximation: column-based methods. Approximation, randomization, and combinatorial optimization. Algorithms and techniques. Springer, Berlin, pp 316–326 Drineas P, Mahoney M, Muthukrishnan S (2006) Subspace sampling and relative-error matrix approximation: column-based methods. Approximation, randomization, and combinatorial optimization. Algorithms and techniques. Springer, Berlin, pp 316–326
23.
go back to reference Elgohary A, Farahat AK, Kamel MS, Karray F (2013) Embed and conquer: scalable embeddings for kernel k-means on mapreduce, CoRR abs/1311.2334 Elgohary A, Farahat AK, Kamel MS, Karray F (2013) Embed and conquer: scalable embeddings for kernel k-means on mapreduce, CoRR abs/1311.2334
24.
go back to reference Elsayed T, Lin J, Oard DW (2008) Pairwise document similarity in large collections with MapReduce. In: Proceedings of the 46th annual meeting of the association for computational linguistics on human language technologies: short Papers (HLT’08), pp 265–268 Elsayed T, Lin J, Oard DW (2008) Pairwise document similarity in large collections with MapReduce. In: Proceedings of the 46th annual meeting of the association for computational linguistics on human language technologies: short Papers (HLT’08), pp 265–268
25.
go back to reference Ene A, Im S, Moseley B (2011) Fast clustering using MapReduce. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’11), pp 681–689 Ene A, Im S, Moseley B (2011) Fast clustering using MapReduce. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’11), pp 681–689
26.
go back to reference Farahat A, Elgohary A, Ghodsi A, Kamel M (2013) Distributed column subset selection on MapReduce. In: Proceedings of the 13th IEEE international conference on data mining (ICDM’13), pp 171–180 Farahat A, Elgohary A, Ghodsi A, Kamel M (2013) Distributed column subset selection on MapReduce. In: Proceedings of the 13th IEEE international conference on data mining (ICDM’13), pp 171–180
27.
go back to reference Farahat AK, Ghodsi A, Kamel MS (2011) An efficient greedy method for unsupervised feature selection. In: Proceedings of the 11th IEEE international conference on data mining (ICDM’11), pp 161–170 Farahat AK, Ghodsi A, Kamel MS (2011) An efficient greedy method for unsupervised feature selection. In: Proceedings of the 11th IEEE international conference on data mining (ICDM’11), pp 161–170
28.
go back to reference Farahat AK, Ghodsi A, Kamel MS (2013) Efficient greedy feature selection for unsupervised learning. Knowl Inf Syst 35(2):285–310CrossRef Farahat AK, Ghodsi A, Kamel MS (2013) Efficient greedy feature selection for unsupervised learning. Knowl Inf Syst 35(2):285–310CrossRef
29.
go back to reference Frieze A, Kannan R, Vempala S (1998) Fast Monte-Carlo algorithms for finding low-rank approximations. In: Proceedings of the 39th annual IEEE symposium on foundations of computer science (FOCS’98), pp 370–378 Frieze A, Kannan R, Vempala S (1998) Fast Monte-Carlo algorithms for finding low-rank approximations. In: Proceedings of the 39th annual IEEE symposium on foundations of computer science (FOCS’98), pp 370–378
30.
go back to reference Golub G, Van Loan C (1996) Matrix computations, 3rd edn. Johns Hopkins University Press, BaltimoreMATH Golub G, Van Loan C (1996) Matrix computations, 3rd edn. Johns Hopkins University Press, BaltimoreMATH
31.
go back to reference Gu M, Eisenstat SC (1996) Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM J Sci Comput 17(4):848–869MATHMathSciNetCrossRef Gu M, Eisenstat SC (1996) Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM J Sci Comput 17(4):848–869MATHMathSciNetCrossRef
32.
go back to reference Guruswami V, Sinop AK (2012) Optimal column-based low-rank matrix reconstruction. In: Proceedings of the 21st annual ACM-SIAM symposium on discrete algorithms (SODA’12), pp 1207–1214 Guruswami V, Sinop AK (2012) Optimal column-based low-rank matrix reconstruction. In: Proceedings of the 21st annual ACM-SIAM symposium on discrete algorithms (SODA’12), pp 1207–1214
33.
go back to reference Halko N, Martinsson P-G, Shkolnisky Y, Tygert M (2011) An algorithm for the principal component analysis of large data sets. SIAM J Sci Comput 33(5):2580–2594MATHMathSciNetCrossRef Halko N, Martinsson P-G, Shkolnisky Y, Tygert M (2011) An algorithm for the principal component analysis of large data sets. SIAM J Sci Comput 33(5):2580–2594MATHMathSciNetCrossRef
34.
go back to reference He X, Cai D, Niyogi P (2005) Laplacian score for feature selection, advances in neural information processing systems 18 (NIPS’05). MIT Press, Cambridge, MA He X, Cai D, Niyogi P (2005) Laplacian score for feature selection, advances in neural information processing systems 18 (NIPS’05). MIT Press, Cambridge, MA
35.
go back to reference He X, Yan S, Hu Y, Niyogi P, Zhang H (2005) Face recognition using Laplacianfaces. IEEE Trans Pattern Anal Mach Intell 27(3):328–340CrossRef He X, Yan S, Hu Y, Niyogi P, Zhang H (2005) Face recognition using Laplacianfaces. IEEE Trans Pattern Anal Mach Intell 27(3):328–340CrossRef
36.
go back to reference Hogg RV, Ledolter J (1987) Engineering statistics, vol 358. MacMillan, New York Hogg RV, Ledolter J (1987) Engineering statistics, vol 358. MacMillan, New York
37.
go back to reference Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall Inc, Upper Saddle River, NJMATH Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall Inc, Upper Saddle River, NJMATH
38.
go back to reference Jolliffe I (2002) Principal component analysis, 2nd edn. Springer, New YorkMATH Jolliffe I (2002) Principal component analysis, 2nd edn. Springer, New YorkMATH
39.
go back to reference Kang U, Tsourakakis C, Appel A, Faloutsos C, Leskovec J (2008) Hadi: fast diameter estimation and mining in massive graphs with hadoop, CMU-ML-08-117 Kang U, Tsourakakis C, Appel A, Faloutsos C, Leskovec J (2008) Hadi: fast diameter estimation and mining in massive graphs with hadoop, CMU-ML-08-117
40.
go back to reference Karloff H, Suri S, Vassilvitskii S (2010) A model of computation for MapReduce. In: Proceedings of the 21st annual ACM-SIAM symposium on discrete algorithms (SODA’10), pp 938–948 Karloff H, Suri S, Vassilvitskii S (2010) A model of computation for MapReduce. In: Proceedings of the 21st annual ACM-SIAM symposium on discrete algorithms (SODA’10), pp 938–948
41.
go back to reference Karypis G (2003) CLUTO—a clustering toolkit, rechnical report #02-017. University of Minnesota, Department of Computer Science Karypis G (2003) CLUTO—a clustering toolkit, rechnical report #02-017. University of Minnesota, Department of Computer Science
42.
go back to reference Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. Department of Mathematics and Informatics,Technical report, Technische Hogeschool, Delft (Netherlands) Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. Department of Mathematics and Informatics,Technical report, Technische Hogeschool, Delft (Netherlands)
43.
go back to reference Lee K, Ho J, Kriegman D (2005) Acquiring linear subspaces for face recognition under variable lighting. IEEE Trans Pattern Anal Mach Intell 27(5):684–698CrossRef Lee K, Ho J, Kriegman D (2005) Acquiring linear subspaces for face recognition under variable lighting. IEEE Trans Pattern Anal Mach Intell 27(5):684–698CrossRef
44.
go back to reference Lewis D (1999) Reuters-21578 text categorization test collection distribution 1.0 Lewis D (1999) Reuters-21578 text categorization test collection distribution 1.0
45.
go back to reference Lewis DD, Yang Y, Rose TG, Li F (2004) Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397 Lewis DD, Yang Y, Rose TG, Li F (2004) Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397
46.
go back to reference Li P, Hastie TJ, Church KW (2006) Very sparse random projections. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’06), pp 287–296 Li P, Hastie TJ, Church KW (2006) Very sparse random projections. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’06), pp 287–296
47.
go back to reference Lu Y, Cohen I, Zhou X, Tian Q (2007) Feature selection using principal feature analysis. In: Proceedings of the 15th international conference on multimedia. ACM, New York, NY, pp 301–304 Lu Y, Cohen I, Zhou X, Tian Q (2007) Feature selection using principal feature analysis. In: Proceedings of the 15th international conference on multimedia. ACM, New York, NY, pp 301–304
48.
go back to reference Lütkepohl H (1996) Handbook of matrices. Wiley, New JerseyMATH Lütkepohl H (1996) Handbook of matrices. Wiley, New JerseyMATH
49.
go back to reference Masaeli M, Yan Y, Cui Y, Fung, G, Dy J (2010) Convex principal feature selection. In: Proceedings of SIAM international conference on data mining (SDM), pp 619–628 Masaeli M, Yan Y, Cui Y, Fung, G, Dy J (2010) Convex principal feature selection. In: Proceedings of SIAM international conference on data mining (SDM), pp 619–628
50.
go back to reference Meng X, Mahoney M (2013) Robust regression on mapreduce. In: Proceedings of the 30th international conference on machine learning (ICML-13), pp 888–896 Meng X, Mahoney M (2013) Robust regression on mapreduce. In: Proceedings of the 30th international conference on machine learning (ICML-13), pp 888–896
51.
go back to reference Mitra P, Murthy C, Pal S (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312CrossRef Mitra P, Murthy C, Pal S (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312CrossRef
52.
53.
go back to reference Sim T, Baker S, Bsat M (2003) The CMU pose, illumination, and expression database. IEEE Trans Pattern Anal Mach Intell 25(12):1615–1618CrossRef Sim T, Baker S, Bsat M (2003) The CMU pose, illumination, and expression database. IEEE Trans Pattern Anal Mach Intell 25(12):1615–1618CrossRef
54.
go back to reference Singh S, Kubica J, Larsen S, Sorokina D (2009) Parallel large scale feature selection for logistic regression. Proceedings of the SIAM international conference on data mining, pp 1171–1182 Singh S, Kubica J, Larsen S, Sorokina D (2009) Parallel large scale feature selection for logistic regression. Proceedings of the SIAM international conference on data mining, pp 1171–1182
55.
go back to reference Torralba A, Fergus R, Freeman W (2008) 80 Million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans Pattern Anal Mach Intell 30(11):1958–1970CrossRef Torralba A, Fergus R, Freeman W (2008) 80 Million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans Pattern Anal Mach Intell 30(11):1958–1970CrossRef
56.
go back to reference White T (2009) Hadoop: the definitive guide, 1st edn. O’Reilly Media Inc, Sebastopol White T (2009) Hadoop: the definitive guide, 1st edn. O’Reilly Media Inc, Sebastopol
57.
go back to reference Wolf L, Shashua A (2005) Feature selection for unsupervised and supervised inference: the emergence of sparsity in a weight-based approach. J Mach Learn Res 6:1855–1887MATHMathSciNet Wolf L, Shashua A (2005) Feature selection for unsupervised and supervised inference: the emergence of sparsity in a weight-based approach. J Mach Learn Res 6:1855–1887MATHMathSciNet
58.
go back to reference Xiang J, Guo C, Aboulnaga A (2013) Scalable maximum clique computation using mapreduce. IEEE 29th international conference on data engineering (ICDE), 2013 , pp 74–85 Xiang J, Guo C, Aboulnaga A (2013) Scalable maximum clique computation using mapreduce. IEEE 29th international conference on data engineering (ICDE), 2013 , pp 74–85
59.
go back to reference Zhao Z, Liu H (2007) Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th international conference on machine learning (ICML’07). ACM, New York, NY, pp 1151–1157 Zhao Z, Liu H (2007) Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th international conference on machine learning (ICML’07). ACM, New York, NY, pp 1151–1157
Metadata
Title
Greedy column subset selection for large-scale data sets
Authors
Ahmed K. Farahat
Ahmed Elgohary
Ali Ghodsi
Mohamed S. Kamel
Publication date
01-10-2015
Publisher
Springer London
Published in
Knowledge and Information Systems / Issue 1/2015
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-014-0801-8

Premium Partner