ABSTRACT
How can one summarize a massive data set "on the fly", i.e., without even having seen it in its entirety? In this paper, we address the problem of extracting representative elements from a large stream of data. I.e., we would like to select a subset of say k data points from the stream that are most representative according to some objective function. Many natural notions of "representativeness" satisfy submodularity, an intuitive notion of diminishing returns. Thus, such problems can be reduced to maximizing a submodular set function subject to a cardinality constraint. Classical approaches to submodular maximization require full access to the data set. We develop the first efficient streaming algorithm with constant factor 1/2-ε approximation guarantee to the optimum solution, requiring only a single pass through the data, and memory independent of data size. In our experiments, we extensively evaluate the effectiveness of our approach on several applications, including training large-scale kernel methods and exemplar-based clustering, on millions of data points. We observe that our streaming method, while achieving practically the same utility value, runs about 100 times faster than previous work.
Supplemental Material
- Census1990, UCI machine learning repository, 2010.Google Scholar
- Yahoo! academic relations. r6a, yahoo! front page today module user click log dataset, version 1.0, 2012.Google Scholar
- A. Badanidiyuru and J. Vondrak. Fast algorithms for maximizing submodular functions. In SODA, 2014. Google ScholarDigital Library
- M. Bateni, M. Hajiaghayi, and M. Zadimoghaddam. Submodular secretary problem and extensions. In APPROX-RANDOM, pages 39--52, 2010. Google ScholarDigital Library
- G. E. Blelloch, R. Peng, and K. Tangwongsan. Linear-work greedy parallel approximate set cover and variants. In SPAA, 2011. Google ScholarDigital Library
- F. Chierichetti, R. Kumar, and A. Tomkins. Max-cover in map-reduce. In WWW, 2010. Google ScholarDigital Library
- A. Dasgupta, R. Kumar, and S. Ravi. Summarization through submodularity and dispersion. In ACL, 2013.Google Scholar
- D. Dueck and B. J. Frey. Non-metric affinity propagation for unsupervised image categorization. In ICCV, 2007.Google ScholarCross Ref
- K. El-Arini and C. Guestrin. Beyond keyword search: Discovering relevant scientific literature. In KDD, 2011. Google ScholarDigital Library
- K. El-Arini, G. Veda, D. Shahaf, and C. Guestrin. Turning down the noise in the blogosphere. In KDD, 2009. Google ScholarDigital Library
- U. Feige. A threshold of ln n for approximating set cover. Journal of the ACM, 1998. Google ScholarDigital Library
- M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy. Mining data streams: a review. SIGMOD Record, 34(2):18--26, 2005. Google ScholarDigital Library
- J. Gillenwater, A. Kulesza, and B. Taskar. Near-optimal map inference for determinantal point processes. In NIPS, pages 2744--2752, 2012.Google Scholar
- R. Gomes and A. Krause. Budgeted nonparametric learning from data streams. In ICML, 2010.Google ScholarDigital Library
- A. Gupta, A. Roth, G. Schoenebeck, and K. Talwar. Constrained non-monotone submodular maximization: Offline and secretary algorithms. In WINE, 2010. Google ScholarDigital Library
- L. Kaufman and P. J. Rousseeuw. Finding groups in data: an introduction to cluster analysis, volume 344. Wiley-Interscience, 2009.Google Scholar
- D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of in uence through a social network. In Proceedings of the ninth ACM SIGKDD, 2003. Google ScholarDigital Library
- A. Krause and D. Golovin. Submodular function maximization. In Tractability: Practical Approaches to Hard Problems. Cambridge University Press, 2013.Google Scholar
- A. Krause and C. Guestrin. Near-optimal nonmyopic value of information in graphical models. In UAI, 2005.Google ScholarDigital Library
- R. Kumar, B. Moseley, S. Vassilvitskii, and A. Vattani. Fast greedy algorithms in mapreduce and streaming. In SPAA, 2013. Google ScholarDigital Library
- S. Lattanzi, B. Moseley, S. Suri, and S. Vassilvitskii. Filtering: a method for solving graph problems in mapreduce. In SPAA, 2011. Google ScholarDigital Library
- J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance. Cost-effective outbreak detection in networks. In KDD, 2007. Google ScholarDigital Library
- H. Lin and J. Bilmes. A class of submodular functions for document summarization. In NAACL/HLT, 2011. Google ScholarDigital Library
- M. Minoux. Accelerated greedy algorithms for maximizing submodular set functions. Optimization Techniques, LNCS, pages 234--243, 1978.Google Scholar
- B. Mirzasoleiman, A. Karbasi, R. Sarkar, and A. Krause. Distributed submodular maximization: Identifying representative elements in massive data. In Neural Information Processing Systems (NIPS), 2013.Google Scholar
- G. L. Nemhauser and L. A. Wolsey. Best algorithms for approximating the maximum of a submodular set function. Math. Oper. Research, 1978.Google ScholarDigital Library
- G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions - I. Mathematical Programming, 1978.Google ScholarDigital Library
- C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). 2006. Google ScholarDigital Library
- C. Reed and Z. Ghahramini. Scaling the indian buffet process via submodular maximization. In ICML, 2013.Google Scholar
- M. G. Rodriguez, J. Leskovec, and A. Krause. Inferring networks of diffusion and influence. ACM Transactions on Knowledge Discovery from Data, 5(4):21:1--21:37, 2012. Google ScholarDigital Library
- B. Scholkopf and A. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001. Google ScholarDigital Library
- M. Seeger. Greedy forward selection in the informative vector machine. Technical report, University of California, Berkeley, 2004.Google Scholar
- R. Sipos, A. Swaminathan, P. Shivaswamy, and T. Joachims. Temporal corpus summarization using submodular word coverage. In CIKM, 2012. Google ScholarDigital Library
- A. Tsanas, M. A. Little, P. E. McSharry, and L. O. Ramig. Enhanced classical dysphonia measures and sparse regression for telemonitoring of parkinson's disease progression. In ICASSP, 2010.Google ScholarCross Ref
- J. Vitter. Random sampling with a reservoir. ACM Trans. Mathematical Software, 11(1):37--57, 1985. Google ScholarDigital Library
Index Terms
- Streaming submodular maximization: massive data summarization on the fly
Recommendations
A Multi-pass Streaming Algorithm for Regularized Submodular Maximization
Combinatorial Optimization and ApplicationsAbstractIn this paper, we consider a problem of maximizing regularized submodular functions with a k-cardinality constraint under streaming fashion. In the model, the utility function is expressed as the difference between a non-negative ...
Multi-Pass Streaming Algorithms for Monotone Submodular Function Maximization
AbstractWe consider maximizing a monotone submodular function under a cardinality constraint or a knapsack constraint in the streaming setting. In particular, the elements arrive sequentially and at any point of time, the algorithm has access to only a ...
Streaming non-monotone submodular maximization: personalized video summarization on the fly
AAAI'18/IAAI'18/EAAI'18: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial IntelligenceThe need for real time analysis of rapidly producing data streams (e.g., video and image streams) motivated the design of streaming algorithms that can efficiently extract and summarize useful information from massive data "on the fly". Such problems can ...
Comments