Top

Knowledge and Information Systems

01-10-2015 | Regular Paper

Greedy column subset selection for large-scale data sets

Authors: Ahmed K. Farahat, Ahmed Elgohary, Ali Ghodsi, Mohamed S. Kamel

Published in: Knowledge and Information Systems | Issue 1/2015

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

In today’s information systems, the availability of massive amounts of data necessitates the development of fast and accurate algorithms to summarize these data and represent them in a succinct format. One crucial problem in big data analytics is the selection of representative instances from large and massively distributed data, which is formally known as the Column Subset Selection problem. The solution to this problem enables data analysts to understand the insights of the data and explore its hidden structure. The selected instances can also be used for data preprocessing tasks such as learning a low-dimensional embedding of the data points or computing a low-rank approximation of the corresponding matrix. This paper presents a fast and accurate greedy algorithm for large-scale column subset selection. The algorithm minimizes an objective function, which measures the reconstruction error of the data matrix based on the subset of selected columns. The paper first presents a centralized greedy algorithm for column subset selection, which depends on a novel recursive formula for calculating the reconstruction error of the data matrix. The paper then presents a MapReduce algorithm, which selects a few representative columns from a matrix whose columns are massively distributed across several commodity machines. The algorithm first learns a concise representation of all columns using random projection, and it then solves a generalized column subset selection problem at each machine in which a subset of columns are selected from the sub-matrix on that machine such that the reconstruction error of the concise representation is minimized. The paper demonstrates the effectiveness and efficiency of the proposed algorithm through an empirical evaluation on benchmark data sets.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

\(\Vert A \Vert _{F}^{2} = trace\left( A^TA \right) \).

The in-memory summation can also be replaced by a MapReduce combiner, Dean2008.

The data sets Reuters-21578, MNIST-4K, PIE-20 and YaleB-38 are available in MAT format at: http://www.cad.zju.edu.cn/home/dengcai/Data/data.html. PIE-20 is a subset of PIE-32x32 with the images of the first 20 persons.

http://trec.nist.gov.

http://yann.lecun.com/exdb/mnist.

The CSS algorithm of Boutsidis et al. [3] was not included in the comparison as its implementation is not available.

Revision: 5.13.4.7.

In the implemented code, the efficient recursive formulas in Section 4 of [31] are used to implement the update of QR decomposition and the swapping criterion.

In [4] (a newer version of [6]), Boutsidis et al. suggested the use of the SRRQR algorithm [31, Algorithm 4] for the deterministic phase. Although the SRRQR algorithm achieves the theoretical guarantee presented in [6], the MATLAB \(qr\) function is used in the conducted experiments as it is much faster and it achieves comparable accuracy for the experimented data sets.

For the MNIST4K data set, the range of \(l/n\) values is smaller since the rank of the matrix is very low (i.e., less than the number of pixels).

The qr and SRRQR methods both depend on the MATLAB qr function. For the document data sets, the MATLAB qr function takes very long times compared with other methods and accordingly they are not reported in the shown figures.

The MATLAB functions anova1 and multcompare were used.

Table 2

The relative accuracy of the best performing CSS methods for the Reuters-21578, Reviews and LA1 data sets

Method	\(l/n=5\,\%\)	\(l/n=9\,\%\)	\(l/n=13\,\%\)	\(l/n=17\,\%\)
Reuters-21578
HybridCSS	66,3400.90	70,5200.81	73,8400.39	76,1100.55
ApproxSVD (a)	70,5900.65	74,7600.66	77,8600.32	80,1100.31
GreedyCSS (b)	70,8600.65	74,9600.65	77,9600.32	80,2800.31
RndGreedyCSS (c)	69,9800.61	74,4100.67	77,5800.35	80,0100.34
F-statistic	86.99	90.36	331.1	267.52
p value	1.50E\(-16\)	8.10E\(-17\)	2.90E\(-26\)	1.20E\(-24\)
Tukey’s test	\(a=b\), \(a=c\), \(b>c\)	\(a=b\), \(a=c\), \(b=c\)	\(a=b\), \(a=c\), \(b=c\)	\(a=b\), \(a=c\), \(b=c\)
Reviews
HybridCSS	35,4403.01	35,2201.30	33,7300.84	29,0200.43
ApproxSVD (a)	46,2801.86	40,4801.44	36,6400.76	32,7500.41
GreedyCSS (b)	47,8801.80	41,8001.41	37,5600.75	33,5900.40
RndGreedyCSS (c)	45,6401.86	39,6701.50	35,7900.88	31,9800.52
F-statistic	66.7	40.83	41.08	203.01
p value	9.00E-15	1.10E-11	1.00E-11	1.30E-22
Tukey’s test	\(a=b\), \(a=c\), \(b=c\)	\(a=b\), \(a=c\), \(b>c\)	\(a=b\), \(a=c\), \(b>c\)	\(a<b\), \(a>c\), \(b>c\)
LA1
HybridCSS	31,5301.79	34,2201.05	33,8601.28	31,0800.75
ApproxSVD (a)	43,0900.88	40,3400.85	37,3501.03	34,5200.79
GreedyCSS (b)	44,8400.85	41,8800.83	38,1801.02	35,5000.78
RndGreedyCSS (c)	41,2801.16	39,3001.15	36,0701.13	33,5200.92
F-statistic	234.34	114.16	28.38	54.23
p value	1.10E\(-23\)	1.9E\(-18\)	1.40E\(-09\)	2.00E\(-13\)
Tukey’s test	\(a<b\), \(a>c\), \(b>c\)	\(a<b\), \(a=c\), \(b>c\)	\(a=b\), \(a=c\), \(b>c\)	\(a<b\), \(a>c\), \(b>c\)

For each sub-column, the analysis of variance (ANOVA) procedure is conducted to test the statistical significance of the results. The F-statistic, p value, and results of Tukey’s tests between the three best performing methods are reported

Table 3

The relative accuracy of the best performing CSS methods for the MNIST-4K, PIE-20 and YaleB-38 data sets

MNIST-4K
Method	\(l/n=1\,\%\)	\(l/n=5\,\%\)	\(l/n=9\,\%\)	\(l/n=13\,\%\)
HybridCSS	\(23.41 \pm 03.33\)	\(21.97 \pm 02.18\)	\(20.60 \pm 02.09\)	\(-72.06 \pm 17.25\)
ApproxSVD (a)	\(39.79 \pm 01.16\)	\(29.36 \pm 01.66\)	\(34.59 \pm 01.63\)	\(55.56 \pm 04.17\)
GreedyCSS (b)	\(39.91 \pm 01.16\)	\(29.47 \pm 01.65\)	\(36.09 \pm 01.59\)	\(56.36 \pm 04.09\)
RndGreedyCSS (c)	\(28.84 \pm 02.25\)	\(24.11 \pm 01.72\)	\(27.90 \pm 02.11\)	\(51.62 \pm 04.07\)
F-statistic	143.63	43.39	143.88	460.6
p value	4.40E\(-20\)	4.80E\(-12\)	4.20E\(-20\)	9.30E\(-29\)
Tukey’s test	\(a=b\), \(a>c\), \(b>c\)	\(a=b\), \(a>c\), \(b>c\)	\(a=b\), \(a>c\), \(b>c\)	\(a=b\), \(a=c\), \(b=c\)

PIE-20
Method	\(l/n=5\,\%\)	\(l/n=9\,\%\)	\(l/n=13\,\%\)	\(l/n=17\,\%\)
HybridCSS	\(43.67 \pm 01.77\)	\(43.65 \pm 01.86\)	\(39.77 \pm 01.37\)	\(36.55 \pm 01.95\)
ApproxSVD (a)	\(46.79 \pm 01.22\)	\(46.16 \pm 01.75\)	\(45.07 \pm 01.20\)	\(45.03 \pm 01.95\)
GreedyCSS (b)	\(46.98 \pm 01.22\)	\(46.53 \pm 01.74\)	\(44.82 \pm 01.20\)	\(45.08 \pm 01.95\)
RndGreedyCSS (c)	\(44.16 \pm 01.57\)	\(44.11 \pm 01.92\)	\(43.38 \pm 01.20\)	\(43.79 \pm 02.03\)
F-statistic	13.88	6.31	38.41	43.1
p value	3.60E\(-06\)	1.50E\(-03\)	2.60E\(-11\)	5.30E\(-12\)
Tukey’s test	\(a=b\), \(a>c\), \(b>c\)	\(a=b\), \(a=c\), \(b>c\)	\(a=b\), \(a>c\), \(b=c\)	\(a=b\), \(a=c\), \(b=c\)

YALEB-38
Method	\(l/n=5\,\%\)	\(l/n=9\,\%\)	\(l/n=13\,\%\)	\(l/n=17\,\%\)
HybridCSS	\(33.83 \pm 02.87\)	\(37.11 \pm 01.28\)	\(35.20 \pm 02.54\)	\(30.60 \pm 01.69\)
ApproxSVD (a)	\(40.71 \pm 02.12\)	\(39.70 \pm 01.09\)	\(37.88 \pm 02.22\)	\(36.92 \pm 01.33\)
GreedyCSS (b)	\(40.01 \pm 02.14\)	\(39.59 \pm 01.09\)	\(38.14 \pm 02.21\)	\(36.96 \pm 01.33\)
RndGreedyCSS (c)	\(35.82 \pm 02.25\)	\(37.17 \pm 01.32\)	\(36.29 \pm 02.35\)	\(35.19 \pm 01.34\)
F-statistic	19.56	14.56	3.53	43.87
p value	1.10E\(-07\)	2.30E\(-06\)	2.40E\(-02\)	4.10E\(-12\)
Tukey’s test	\(a=b\), \(a>c\), \(b>c\)	\(a=b\), \(a>c\), \(b>c\)	\(a=b\), \(a=c\), \(b=c\)	\(a=b\), \(a>c\), \(b>c\)

Amazon Elastic Compute Cloud (EC2): http://aws.amazon.com/ec2.

Mahout is an Apache project for implementing Machine Learning algorithms on Hadoop. See http://mahout.apache.org/.

Achlioptas D (2003) Database-friendly random projections: Johnson-Lindenstrauss with binary coins. J Comput Syst Sci 66(4):671–687MATHMathSciNetCrossRef

Bischof C, Quintana-Ortí G (1998) Computing rank-revealing QR factorizations of dense matrices. ACM Trans Math Softw 24(2):226–253MATHCrossRef

Boutsidis C, Drineas P, Magdon-Ismail M (2011) Near optimal column-based matrix reconstruction. In: Proceedings of the 52nd annual IEEE symposium on foundations of computer science (FOCS’11), pp 305–314

Boutsidis C, Mahoney MW, Drineas P (2008a) An improved approximation algorithm for the column subset selection problem, CoRR abs/0812.4293

Boutsidis C, Mahoney MW Drineas P (2008b) Unsupervised feature selection for principal components analysis. In Li Y, Liu B, Sarawagi S (eds) Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’08). ACM, New York, pp 61–69

Boutsidis C, Mahoney MW, Drineas P (2009) An improved approximation algorithm for the column subset selection problem. In: Proceedings of the 20th annual ACM-SIAM symposium on discrete algorithms (SODA’09), pp 968–977

Boutsidis C, Sun J, Anerousis N (2008) Clustered subset selection and its applications on it service metrics. In: Proceedings of the 17th ACM conference on information and knowledge management (CIKM’08), pp 599–608

Cai D, Zhang C, He X (2010) Unsupervised feature selection for multi-cluster data. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’10). ACM, New York, NY, pp 333–342

Çivril A, Magdon-Ismail M (2008) Deterministic sparse column based matrix reconstruction via greedy approximation of SVD. In: Proceedings of the 19th international symposium on algorithms and computation (ISAAC’08). Springer, New York, pp 414–423

10.

Çivril A, Magdon-Ismail M (2012) Column subset selection via sparse approximation of SVD. Theoret Comput Sci 421:1–14MATHMathSciNetCrossRef

11.

Chan T (1987) Rank revealing QR factorizations. Linear Algebra Appl 88:67–82MathSciNet

12.

Chen W-Y, Song Y, Bai H, Lin C-J, Chang E (2011) Parallel spectral clustering in distributed systems. IEEE Trans Pattern Anal Mach Intell 33(3):568–586CrossRef

13.

Cui Y, Dy J (2008) Orthogonal principal feature selection, the sparse optimization and variable selection workshop at the international conference on machine learning (ICML)

14.

Dasgupta S, Gupta A (2003) An elementary proof of a theorem of Johnson and Lindenstrauss. Random Struct Algorithms 22(1):60–65MATHMathSciNetCrossRef

15.

Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRef

16.

Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inform Sci Technol 41(6):391–407CrossRef

17.

Deshpande A, Rademacher L (2010) Efficient volume sampling for row/column subset selection. In: Proceedings of the 51st annual IEEE symposium on foundations of computer science (FOCS’10), pp 329–338

18.

Deshpande A, Rademacher L, Vempala S, Wang G (2006a) Matrix approximation and projective clustering via volume sampling. Theory Comput 2(1):225–247MathSciNetCrossRef

19.

Deshpande A, Rademacher L, Vempala S, Wang G (2006b) Matrix approximation and projective clustering via volume sampling. In: Proceedings of the 17th annual ACM-SIAM symposium on discrete algorithms (SODA’06). ACM, New York, NY, pp 1117–1126

20.

Drineas P, Frieze A, Kannan R, Vempala S, Vinay V (2004) Clustering large graphs via the singular value decomposition. Mach Learn 56(1–3):9–33MATHCrossRef

21.

Drineas P, Kannan R, Mahoney M (2007) Fast Monte Carlo algorithms for matrices II: computing a low-rank approximation to a matrix. SIAM J Comput 36(1):158–183MathSciNetCrossRef

22.

Drineas P, Mahoney M, Muthukrishnan S (2006) Subspace sampling and relative-error matrix approximation: column-based methods. Approximation, randomization, and combinatorial optimization. Algorithms and techniques. Springer, Berlin, pp 316–326

23.

Elgohary A, Farahat AK, Kamel MS, Karray F (2013) Embed and conquer: scalable embeddings for kernel k-means on mapreduce, CoRR abs/1311.2334

24.

Elsayed T, Lin J, Oard DW (2008) Pairwise document similarity in large collections with MapReduce. In: Proceedings of the 46th annual meeting of the association for computational linguistics on human language technologies: short Papers (HLT’08), pp 265–268

25.

Ene A, Im S, Moseley B (2011) Fast clustering using MapReduce. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’11), pp 681–689

26.

Farahat A, Elgohary A, Ghodsi A, Kamel M (2013) Distributed column subset selection on MapReduce. In: Proceedings of the 13th IEEE international conference on data mining (ICDM’13), pp 171–180

27.

Farahat AK, Ghodsi A, Kamel MS (2011) An efficient greedy method for unsupervised feature selection. In: Proceedings of the 11th IEEE international conference on data mining (ICDM’11), pp 161–170

28.

Farahat AK, Ghodsi A, Kamel MS (2013) Efficient greedy feature selection for unsupervised learning. Knowl Inf Syst 35(2):285–310CrossRef

29.

Frieze A, Kannan R, Vempala S (1998) Fast Monte-Carlo algorithms for finding low-rank approximations. In: Proceedings of the 39th annual IEEE symposium on foundations of computer science (FOCS’98), pp 370–378

30.

Golub G, Van Loan C (1996) Matrix computations, 3rd edn. Johns Hopkins University Press, BaltimoreMATH

31.

Gu M, Eisenstat SC (1996) Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM J Sci Comput 17(4):848–869MATHMathSciNetCrossRef

32.

Guruswami V, Sinop AK (2012) Optimal column-based low-rank matrix reconstruction. In: Proceedings of the 21st annual ACM-SIAM symposium on discrete algorithms (SODA’12), pp 1207–1214

33.

Halko N, Martinsson P-G, Shkolnisky Y, Tygert M (2011) An algorithm for the principal component analysis of large data sets. SIAM J Sci Comput 33(5):2580–2594MATHMathSciNetCrossRef

34.

He X, Cai D, Niyogi P (2005) Laplacian score for feature selection, advances in neural information processing systems 18 (NIPS’05). MIT Press, Cambridge, MA

35.

He X, Yan S, Hu Y, Niyogi P, Zhang H (2005) Face recognition using Laplacianfaces. IEEE Trans Pattern Anal Mach Intell 27(3):328–340CrossRef

36.

Hogg RV, Ledolter J (1987) Engineering statistics, vol 358. MacMillan, New York

37.

Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall Inc, Upper Saddle River, NJMATH

38.

Jolliffe I (2002) Principal component analysis, 2nd edn. Springer, New YorkMATH

39.

Kang U, Tsourakakis C, Appel A, Faloutsos C, Leskovec J (2008) Hadi: fast diameter estimation and mining in massive graphs with hadoop, CMU-ML-08-117

40.

Karloff H, Suri S, Vassilvitskii S (2010) A model of computation for MapReduce. In: Proceedings of the 21st annual ACM-SIAM symposium on discrete algorithms (SODA’10), pp 938–948

41.

Karypis G (2003) CLUTO—a clustering toolkit, rechnical report #02-017. University of Minnesota, Department of Computer Science

42.

Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. Department of Mathematics and Informatics,Technical report, Technische Hogeschool, Delft (Netherlands)

43.

Lee K, Ho J, Kriegman D (2005) Acquiring linear subspaces for face recognition under variable lighting. IEEE Trans Pattern Anal Mach Intell 27(5):684–698CrossRef

44.

Lewis D (1999) Reuters-21578 text categorization test collection distribution 1.0

45.

Lewis DD, Yang Y, Rose TG, Li F (2004) Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397

46.

Li P, Hastie TJ, Church KW (2006) Very sparse random projections. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’06), pp 287–296

47.

Lu Y, Cohen I, Zhou X, Tian Q (2007) Feature selection using principal feature analysis. In: Proceedings of the 15th international conference on multimedia. ACM, New York, NY, pp 301–304

48.

Lütkepohl H (1996) Handbook of matrices. Wiley, New JerseyMATH

49.

Masaeli M, Yan Y, Cui Y, Fung, G, Dy J (2010) Convex principal feature selection. In: Proceedings of SIAM international conference on data mining (SDM), pp 619–628

50.

Meng X, Mahoney M (2013) Robust regression on mapreduce. In: Proceedings of the 30th international conference on machine learning (ICML-13), pp 888–896

51.

Mitra P, Murthy C, Pal S (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312CrossRef

52.

Pan C (2000) On the existence and computation of rank-revealing LU factorizations. Linear Algebra Appl 316(1):199–222MATHMathSciNetCrossRef

53.

Sim T, Baker S, Bsat M (2003) The CMU pose, illumination, and expression database. IEEE Trans Pattern Anal Mach Intell 25(12):1615–1618CrossRef

54.

Singh S, Kubica J, Larsen S, Sorokina D (2009) Parallel large scale feature selection for logistic regression. Proceedings of the SIAM international conference on data mining, pp 1171–1182

55.

Torralba A, Fergus R, Freeman W (2008) 80 Million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans Pattern Anal Mach Intell 30(11):1958–1970CrossRef

56.

White T (2009) Hadoop: the definitive guide, 1st edn. O’Reilly Media Inc, Sebastopol

57.

Wolf L, Shashua A (2005) Feature selection for unsupervised and supervised inference: the emergence of sparsity in a weight-based approach. J Mach Learn Res 6:1855–1887MATHMathSciNet

58.

Xiang J, Guo C, Aboulnaga A (2013) Scalable maximum clique computation using mapreduce. IEEE 29th international conference on data engineering (ICDE), 2013 , pp 74–85

59.

Zhao Z, Liu H (2007) Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th international conference on machine learning (ICML’07). ACM, New York, NY, pp 1151–1157

Title: Greedy column subset selection for large-scale data sets
Authors: Ahmed K. Farahat
Ahmed Elgohary
Ali Ghodsi
Mohamed S. Kamel
Publication date: 01-10-2015
Publisher: Springer London
Published in: Knowledge and Information Systems / Issue 1/2015
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI: https://doi.org/10.1007/s10115-014-0801-8

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Premium Partner