research-article

Streaming submodular maximization: massive data summarization on the fly

Authors:
Ashwinkumar Badanidiyuru

Cornell University, Ithaca, NY, USA

Cornell University, Ithaca, NY, USA
View Profile

,
Baharan Mirzasoleiman

ETH Zurich, Zurich, Switzerland

ETH Zurich, Zurich, Switzerland
View Profile

,
Amin Karbasi

ETH Zurich, Zurich, Switzerland

ETH Zurich, Zurich, Switzerland
View Profile

,
Andreas Krause

ETH Zurich, Zurich, Switzerland

ETH Zurich, Zurich, Switzerland
View Profile

KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2014Pages 671–680https://doi.org/10.1145/2623330.2623637

Published:24 August 2014Publication History

KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 671–680

ABSTRACT

How can one summarize a massive data set "on the fly", i.e., without even having seen it in its entirety? In this paper, we address the problem of extracting representative elements from a large stream of data. I.e., we would like to select a subset of say k data points from the stream that are most representative according to some objective function. Many natural notions of "representativeness" satisfy submodularity, an intuitive notion of diminishing returns. Thus, such problems can be reduced to maximizing a submodular set function subject to a cardinality constraint. Classical approaches to submodular maximization require full access to the data set. We develop the first efficient streaming algorithm with constant factor 1/2-ε approximation guarantee to the optimum solution, requiring only a single pass through the data, and memory independent of data size. In our experiments, we extensively evaluate the effectiveness of our approach on several applications, including training large-scale kernel methods and exemplar-based clustering, on millions of data points. We observe that our streaming method, while achieving practically the same utility value, runs about 100 times faster than previous work.

Supplemental Material

p671-sidebyside.mp4

mp4

236.3 MB

Download

References

Census1990, UCI machine learning repository, 2010.Google Scholar
Yahoo! academic relations. r6a, yahoo! front page today module user click log dataset, version 1.0, 2012.Google Scholar
A. Badanidiyuru and J. Vondrak. Fast algorithms for maximizing submodular functions. In SODA, 2014. Google ScholarDigital Library
M. Bateni, M. Hajiaghayi, and M. Zadimoghaddam. Submodular secretary problem and extensions. In APPROX-RANDOM, pages 39--52, 2010. Google ScholarDigital Library
G. E. Blelloch, R. Peng, and K. Tangwongsan. Linear-work greedy parallel approximate set cover and variants. In SPAA, 2011. Google ScholarDigital Library
F. Chierichetti, R. Kumar, and A. Tomkins. Max-cover in map-reduce. In WWW, 2010. Google ScholarDigital Library
A. Dasgupta, R. Kumar, and S. Ravi. Summarization through submodularity and dispersion. In ACL, 2013.Google Scholar
D. Dueck and B. J. Frey. Non-metric affinity propagation for unsupervised image categorization. In ICCV, 2007.Google ScholarCross Ref
K. El-Arini and C. Guestrin. Beyond keyword search: Discovering relevant scientific literature. In KDD, 2011. Google ScholarDigital Library
K. El-Arini, G. Veda, D. Shahaf, and C. Guestrin. Turning down the noise in the blogosphere. In KDD, 2009. Google ScholarDigital Library
U. Feige. A threshold of ln n for approximating set cover. Journal of the ACM, 1998. Google ScholarDigital Library
M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy. Mining data streams: a review. SIGMOD Record, 34(2):18--26, 2005. Google ScholarDigital Library
J. Gillenwater, A. Kulesza, and B. Taskar. Near-optimal map inference for determinantal point processes. In NIPS, pages 2744--2752, 2012.Google Scholar
R. Gomes and A. Krause. Budgeted nonparametric learning from data streams. In ICML, 2010.Google ScholarDigital Library
A. Gupta, A. Roth, G. Schoenebeck, and K. Talwar. Constrained non-monotone submodular maximization: Offline and secretary algorithms. In WINE, 2010. Google ScholarDigital Library
L. Kaufman and P. J. Rousseeuw. Finding groups in data: an introduction to cluster analysis, volume 344. Wiley-Interscience, 2009.Google Scholar
D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of in uence through a social network. In Proceedings of the ninth ACM SIGKDD, 2003. Google ScholarDigital Library
A. Krause and D. Golovin. Submodular function maximization. In Tractability: Practical Approaches to Hard Problems. Cambridge University Press, 2013.Google Scholar
A. Krause and C. Guestrin. Near-optimal nonmyopic value of information in graphical models. In UAI, 2005.Google ScholarDigital Library
R. Kumar, B. Moseley, S. Vassilvitskii, and A. Vattani. Fast greedy algorithms in mapreduce and streaming. In SPAA, 2013. Google ScholarDigital Library
S. Lattanzi, B. Moseley, S. Suri, and S. Vassilvitskii. Filtering: a method for solving graph problems in mapreduce. In SPAA, 2011. Google ScholarDigital Library
J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance. Cost-effective outbreak detection in networks. In KDD, 2007. Google ScholarDigital Library
H. Lin and J. Bilmes. A class of submodular functions for document summarization. In NAACL/HLT, 2011. Google ScholarDigital Library
M. Minoux. Accelerated greedy algorithms for maximizing submodular set functions. Optimization Techniques, LNCS, pages 234--243, 1978.Google Scholar
B. Mirzasoleiman, A. Karbasi, R. Sarkar, and A. Krause. Distributed submodular maximization: Identifying representative elements in massive data. In Neural Information Processing Systems (NIPS), 2013.Google Scholar
G. L. Nemhauser and L. A. Wolsey. Best algorithms for approximating the maximum of a submodular set function. Math. Oper. Research, 1978.Google ScholarDigital Library
G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions - I. Mathematical Programming, 1978.Google ScholarDigital Library
C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). 2006. Google ScholarDigital Library
C. Reed and Z. Ghahramini. Scaling the indian buffet process via submodular maximization. In ICML, 2013.Google Scholar
M. G. Rodriguez, J. Leskovec, and A. Krause. Inferring networks of diffusion and influence. ACM Transactions on Knowledge Discovery from Data, 5(4):21:1--21:37, 2012. Google ScholarDigital Library
B. Scholkopf and A. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001. Google ScholarDigital Library
M. Seeger. Greedy forward selection in the informative vector machine. Technical report, University of California, Berkeley, 2004.Google Scholar
R. Sipos, A. Swaminathan, P. Shivaswamy, and T. Joachims. Temporal corpus summarization using submodular word coverage. In CIKM, 2012. Google ScholarDigital Library
A. Tsanas, M. A. Little, P. E. McSharry, and L. O. Ramig. Enhanced classical dysphonia measures and sparse regression for telemonitoring of parkinson's disease progression. In ICASSP, 2010.Google ScholarCross Ref
J. Vitter. Random sampling with a reservoir. ACM Trans. Mathematical Software, 11(1):37--57, 1985. Google ScholarDigital Library

Index Terms

Streaming submodular maximization: massive data summarization on the fly
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

A Multi-pass Streaming Algorithm for Regularized Submodular Maximization
Combinatorial Optimization and Applications
Abstract
In this paper, we consider a problem of maximizing regularized submodular functions with a k-cardinality constraint under streaming fashion. In the model, the utility function $f (\cdot) = g (\cdot) - ℓ (\cdot)$ is expressed as the difference between a non-negative ... $\frac{\sqrt{}}{\sqrt{}}$
Read More
Multi-Pass Streaming Algorithms for Monotone Submodular Function Maximization
Abstract
We consider maximizing a monotone submodular function under a cardinality constraint or a knapsack constraint in the streaming setting. In particular, the elements arrive sequentially and at any point of time, the algorithm has access to only a ... $^{}^{}$ $^{}^{}$
Read More
Streaming non-monotone submodular maximization: personalized video summarization on the fly
AAAI'18/IAAI'18/EAAI'18: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence

The need for real time analysis of rapidly producing data streams (e.g., video and image streams) motivated the design of streaming algorithms that can efficiently extract and summarize useful information from massive data "on the fly". Such problems can ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2014
2028 pages
ISBN:9781450329569
DOI:10.1145/2623330
General Chairs:
Sofus Macskassy
Facebook
,
Claudia Perlich
Dstillery
,
Program Chairs:
Jure Leskovec
Stanford University
,
Wei Wang
UCLA
,
Rayid Ghani
University of Chicago
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
streaming algorithms
submodular functions
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '14 Paper Acceptance Rate151of1,036submissions,15%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 198
  Total Citations
  View Citations
- 1,700
  Total Downloads
- Downloads (Last 12 months)174
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Streaming submodular maximization: massive data summarization on the fly

KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

A Multi-pass Streaming Algorithm for Regularized Submodular Maximization

Multi-Pass Streaming Algorithms for Monotone Submodular Function Maximization

Streaming non-monotone submodular maximization: personalized video summarization on the fly