skip to main content
10.1145/2623330.2623637acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Streaming submodular maximization: massive data summarization on the fly

Published:24 August 2014Publication History

ABSTRACT

How can one summarize a massive data set "on the fly", i.e., without even having seen it in its entirety? In this paper, we address the problem of extracting representative elements from a large stream of data. I.e., we would like to select a subset of say k data points from the stream that are most representative according to some objective function. Many natural notions of "representativeness" satisfy submodularity, an intuitive notion of diminishing returns. Thus, such problems can be reduced to maximizing a submodular set function subject to a cardinality constraint. Classical approaches to submodular maximization require full access to the data set. We develop the first efficient streaming algorithm with constant factor 1/2-ε approximation guarantee to the optimum solution, requiring only a single pass through the data, and memory independent of data size. In our experiments, we extensively evaluate the effectiveness of our approach on several applications, including training large-scale kernel methods and exemplar-based clustering, on millions of data points. We observe that our streaming method, while achieving practically the same utility value, runs about 100 times faster than previous work.

Skip Supplemental Material Section

Supplemental Material

p671-sidebyside.mp4

mp4

236.3 MB

References

  1. Census1990, UCI machine learning repository, 2010.Google ScholarGoogle Scholar
  2. Yahoo! academic relations. r6a, yahoo! front page today module user click log dataset, version 1.0, 2012.Google ScholarGoogle Scholar
  3. A. Badanidiyuru and J. Vondrak. Fast algorithms for maximizing submodular functions. In SODA, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Bateni, M. Hajiaghayi, and M. Zadimoghaddam. Submodular secretary problem and extensions. In APPROX-RANDOM, pages 39--52, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. E. Blelloch, R. Peng, and K. Tangwongsan. Linear-work greedy parallel approximate set cover and variants. In SPAA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. F. Chierichetti, R. Kumar, and A. Tomkins. Max-cover in map-reduce. In WWW, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Dasgupta, R. Kumar, and S. Ravi. Summarization through submodularity and dispersion. In ACL, 2013.Google ScholarGoogle Scholar
  8. D. Dueck and B. J. Frey. Non-metric affinity propagation for unsupervised image categorization. In ICCV, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  9. K. El-Arini and C. Guestrin. Beyond keyword search: Discovering relevant scientific literature. In KDD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. El-Arini, G. Veda, D. Shahaf, and C. Guestrin. Turning down the noise in the blogosphere. In KDD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. U. Feige. A threshold of ln n for approximating set cover. Journal of the ACM, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy. Mining data streams: a review. SIGMOD Record, 34(2):18--26, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Gillenwater, A. Kulesza, and B. Taskar. Near-optimal map inference for determinantal point processes. In NIPS, pages 2744--2752, 2012.Google ScholarGoogle Scholar
  14. R. Gomes and A. Krause. Budgeted nonparametric learning from data streams. In ICML, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Gupta, A. Roth, G. Schoenebeck, and K. Talwar. Constrained non-monotone submodular maximization: Offline and secretary algorithms. In WINE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. L. Kaufman and P. J. Rousseeuw. Finding groups in data: an introduction to cluster analysis, volume 344. Wiley-Interscience, 2009.Google ScholarGoogle Scholar
  17. D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of in uence through a social network. In Proceedings of the ninth ACM SIGKDD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Krause and D. Golovin. Submodular function maximization. In Tractability: Practical Approaches to Hard Problems. Cambridge University Press, 2013.Google ScholarGoogle Scholar
  19. A. Krause and C. Guestrin. Near-optimal nonmyopic value of information in graphical models. In UAI, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Kumar, B. Moseley, S. Vassilvitskii, and A. Vattani. Fast greedy algorithms in mapreduce and streaming. In SPAA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Lattanzi, B. Moseley, S. Suri, and S. Vassilvitskii. Filtering: a method for solving graph problems in mapreduce. In SPAA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance. Cost-effective outbreak detection in networks. In KDD, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. H. Lin and J. Bilmes. A class of submodular functions for document summarization. In NAACL/HLT, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Minoux. Accelerated greedy algorithms for maximizing submodular set functions. Optimization Techniques, LNCS, pages 234--243, 1978.Google ScholarGoogle Scholar
  25. B. Mirzasoleiman, A. Karbasi, R. Sarkar, and A. Krause. Distributed submodular maximization: Identifying representative elements in massive data. In Neural Information Processing Systems (NIPS), 2013.Google ScholarGoogle Scholar
  26. G. L. Nemhauser and L. A. Wolsey. Best algorithms for approximating the maximum of a submodular set function. Math. Oper. Research, 1978.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions - I. Mathematical Programming, 1978.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C. Reed and Z. Ghahramini. Scaling the indian buffet process via submodular maximization. In ICML, 2013.Google ScholarGoogle Scholar
  30. M. G. Rodriguez, J. Leskovec, and A. Krause. Inferring networks of diffusion and influence. ACM Transactions on Knowledge Discovery from Data, 5(4):21:1--21:37, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. B. Scholkopf and A. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Seeger. Greedy forward selection in the informative vector machine. Technical report, University of California, Berkeley, 2004.Google ScholarGoogle Scholar
  33. R. Sipos, A. Swaminathan, P. Shivaswamy, and T. Joachims. Temporal corpus summarization using submodular word coverage. In CIKM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. Tsanas, M. A. Little, P. E. McSharry, and L. O. Ramig. Enhanced classical dysphonia measures and sparse regression for telemonitoring of parkinson's disease progression. In ICASSP, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  35. J. Vitter. Random sampling with a reservoir. ACM Trans. Mathematical Software, 11(1):37--57, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Streaming submodular maximization: massive data summarization on the fly

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2014
      2028 pages
      ISBN:9781450329569
      DOI:10.1145/2623330

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 August 2014

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      KDD '14 Paper Acceptance Rate151of1,036submissions,15%Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader