research-article

Fast Greedy Algorithms in MapReduce and Streaming

Authors:
Ravi Kumar

Google, CA

Google, CA
View Profile

,
Benjamin Moseley

Washington University in St. Louis, MO

Washington University in St. Louis, MO
View Profile

,
Sergei Vassilvitskii

Google, CA

Google, CA
View Profile

,
Andrea Vattani

University of California at San Diego

University of California at San Diego
View Profile

Authors Info & Claims

ACM Transactions on Parallel Computing Volume 2 Issue 3Article No.: 14pp 1–22https://doi.org/10.1145/2809814

Published:08 September 2015Publication History

ACM Transactions on Parallel Computing

Abstract

Greedy algorithms are practitioners’ best friends—they are intuitive, are simple to implement, and often lead to very good solutions. However, implementing greedy algorithms in a distributed setting is challenging since the greedy choice is inherently sequential, and it is not clear how to take advantage of the extra processing power.

Our main result is a powerful sampling technique that aids in parallelization of sequential algorithms. Armed with this primitive, we then adapt a broad class of greedy algorithms to the MapReduce paradigm; this class includes maximum cover and submodular maximization subject to p-system constraint problems. Our method yields efficient algorithms that run in a logarithmic number of rounds while obtaining solutions that are arbitrarily close to those produced by the standard sequential greedy algorithm. We begin with algorithms for modular maximization subject to a matroid constraint and then extend this approach to obtain approximation algorithms for submodular maximization subject to knapsack or p-system constraints.

References

R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong. 2009. Diversifying search results. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining (WSDM’09). ACM, New York, NY, 5--14. DOI:http://dx.doi.org/10.1145/1498759.1498766 Google ScholarDigital Library
A. Arasu, S. Chaudhuri, and R. Kaushik. 2009. Learning string transformations from examples. PVLDB 2, 1 (2009), 514--525. Google ScholarDigital Library
M. Babaioff, N. Immorlica, and R. Kleinberg. 2007. Matroids, secretary problems, and online mechanisms. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’07), New Orleans, LA, January 7-9, 2007. 434--443. Google ScholarDigital Library
B. Bahmani, R. Kumar, and S. Vassilvitskii. 2012a. Densest subgraph in streaming and Map-Reduce. PVLDB 5, 1 (2012), 454--465. Google ScholarDigital Library
B. Bahmani, B. Moseley, A. Vattani, R. Kumar, and S. Vassilvitskii. 2012b. Scalable K-Means++. PVLDB 5, 7 (2012), 622--633. Google ScholarDigital Library
Z. Bar-Yossef, T. S. Jayram, R. Kumar, and D. Sivakumar. 2004. An information statistics approach to data stream and communication complexity. J. Comput. Syst. Sci. 68, 4 (2004), 702--732. Google ScholarDigital Library
M. H. Bateni, M. T. Hajiaghayi, and M. Zadimoghaddam. 2013. Submodular secretary problem and extensions. ACM Trans. Algorithms 9, 4 (2013), 32. Google ScholarDigital Library
B. Berger, J. Rompel, and P. W. Shor. 1994. Efficient NC algorithms for set cover with applications to learning and geometry. J. Comput. Syst. Sci. 49, 3 (1994), 454--477. Google ScholarDigital Library
G. E. Blelloch, R. Peng, and K. Tangwongsan. 2011. Linear-work greedy parallel approximate set cover and variants. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’11), San Jose, CA, June 4-6, 2011 (Co-located with FCRC 2011). 23--32. Google ScholarDigital Library
G. Călinescu, C. Chekuri, M. Pál, and J. Vondrák. 2011. Maximizing a monotone submodular function subject to a matroid constraint. SIAM J. Comput. 40, 6 (2011), 1740--1766. Google ScholarDigital Library
G. Capannini, F. M. Nardini, R. Perego, and F. Silvestri. 2011. Efficient diversification of web search results. PVLDB 4, 7 (2011), 451--459. Google ScholarDigital Library
A. Chakrabarti, S. Khot, and X. Sun. 2003. Near-optimal lower bounds on the multi-party communication complexity of set disjointness. In 18th Annual IEEE Conference on Computational Complexity (Complexity 2003), Aarhus, Denmark, July 7-10, 2003. 107--117.Google Scholar
M. Charikar. 2000. Greedy approximation algorithms for finding dense components in a graph. In Proceedings of the 3rd International Workshop on Approximation Algorithms for Combinatorial Optimization (APPROX’00), Saarbrücken, Germany, September 5-8, 2000. 84--95. Google ScholarDigital Library
W. Chen, Y. Wang, and S. Yang. 2009. Efficient influence maximization in social networks. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, June 28 - July 1, 2009. 199--208. Google ScholarDigital Library
F. Chierichetti, R. Kumar, and A. Tomkins. 2010. Max-cover in map-reduce. In Proceedings of the 19th International Conference on World Wide Web (WWW’10), Raleigh, NC, April 26-30, 2010. 231--240. Google ScholarDigital Library
G. Cormode, H. J. Karloff, and A. Wirth. 2010. Set cover algorithms for very large datasets. In Proceedings of the 19th ACM Conference on Information and Knowledge Management (CIKM’10), Toronto, Ontario, Canada, October 26-30, 2010. 479--488. Google ScholarDigital Library
B. C. Dean, M. X. Goemans, and J. Vondrák. 2005. Adaptivity and approximation for stochastic packing problems. In Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’05), Vancouver, British Columbia, Canada, January 23-25, 2005. 395--404. Google ScholarDigital Library
J. Dean and S. Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In 6th Symposium on Operating System Design and Implementation (OSDI’04), San Francisco, CA, December 6-8, 2004. 137--150. Google ScholarDigital Library
E. W. Dijkstra. 1959. A note on two problems in connexion with graphs. Numer. Math. 1, 1 (1959), 269--271. Google ScholarDigital Library
J. Edmonds. 1971. Matroids and the greedy algorithm. Math. Program. 1 (1971), 126--136.Google ScholarDigital Library
A. Ene, S. Im, and B. Moseley. 2011. Fast clustering using MapReduce. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, August 21-24, 2011. 681--689. Google ScholarDigital Library
U. Feige. 1998. A threshold of Ln N for approximating set cover. J. ACM 45, 4 (July 1998), 634--652. DOI:http://dx.doi.org/10.1145/285055.285059 Google ScholarDigital Library
M. L. Fisher, G. L. Nemhauser, and L. A. Wolsey. 1978. An analysis of approximation for maximizing submodular set functions II. Math. Prog. Study 8 (1978), 73--87.Google ScholarCross Ref
L. R. Ford and D. R. Fulkerson. 1956. Maximal flow through a network. Can. J. Math. 8 (1956), 399--404.Google ScholarCross Ref
D. Gale and L. S. Shapley. 1962. College admissions and the stability of marriage. Am. Math. Monthly 69, 1 (1962), 9--15. http://www.jstor.org/stable/2312726.Google ScholarCross Ref
A. Goel, S. Guha, and K. Munagala. 2006. Asking the right questions: Model-driven optimization using probes. In Proceedings of the 25th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Chicago, IL, June 26-28, 2006. 203--212. Google ScholarDigital Library
M. T. Goodrich, N. Sitchinava, and Q. Zhang. 2011. Sorting, searching, and simulation in the MapReduce framework. In Proceedings of the 22nd International Symposium on Algorithms and Computation (ISAAC’11), Yokohama, Japan, December 5-8, 2011. 374--383. Google ScholarDigital Library
A. Goyal, F. Bonchi, and L. V. S. Lakshmana. 2012. A data-based approach to social influence maximization. PVLDB 5, 1 (2012). Google ScholarDigital Library
A. Gupta, A. Roth, G. Schoenebeck, and K. Talwar. 2010. Constrained non-monotone submodular maximization: Offline and secretary algorithms. In Proceedings of the 6th International Workshop on Internet and Network Economics (WINE’10), Stanford, CA, December 13-17, 2010. 246--257. Google ScholarDigital Library
E. Hazan, S. Safra, and O. Schwartz. 2006. On the complexity of approximating k-set packing. Comput. Complexity 15, 1 (2006), 20--39. Google ScholarDigital Library
D. A. Huffman. 1952. A method for the construction of minimum-redundancy codes. Proceedings of the Institute of Radio Engineers 40, 9 (1952), 1098--1101.Google Scholar
T. A. Jenkyns. 1976. The efficacy of the “greedy” algorithm. In Proceedings of 7th South Eastern Conference on Combinatorics, Graph Theory and Computing. 341--350.Google Scholar
H. J. Karloff, S. Suri, and S. Vassilvitskii. 2010. A model of computation for MapReduce. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’10), Austin, Texas, January 17-19, 2010. 938--948. Google ScholarDigital Library
D. Kempe, J. M. Kleinberg, and É. Tardos. 2003. Maximizing the spread of influence through a social network. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, August 24-27, 2003. 137--146. Google ScholarDigital Library
B. Korte and D. Hausmann. 1978. An analysis of the greedy heuristic for independence systems. Ann. Discrete Math. 2 (1978), 65--74.Google ScholarCross Ref
R. Kumar, B. Moseley, S. Vassilvitskii, and A. Vattani. 2013. Fast greedy algorithms in mapreduce and streaming. In 25th ACM Symposium on Parallelism in Algorithms and Architectures, (SPAA’13), Montreal, QC, Canada, July 23-25, 2013. 1--10. Google ScholarDigital Library
S. Lattanzi, B. Moseley, S. Suri, and S. Vassilvitskii. 2011. Filtering: A method for solving graph problems in MapReduce. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’11), San Jose, CA, June 4-6, 2011 (Co-located with FCRC’11). 85--94. Google ScholarDigital Library
J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. M. VanBriesen, and N. S. Glance. 2007. Cost-effective outbreak detection in networks. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, California, August 12-15, 2007. 420--429. Google ScholarDigital Library
B. Mirzasoleiman, A. Karbasi, R. Sarkar, and A. Krause. 2013. Distributed submodular maximization: Identifying representative elements in massive data. In Neural Information Processing Systems (NIPS).Google Scholar
S. Muthukrishnan. 2005. Data streams: Algorithms and applications. Foundations Trends Theor. Comput. Sci. 1, 2 (2005). Google ScholarDigital Library
G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. 1978. An analysis of approximation for maximizing submodular set functions I. Math. Prog. 14 (1978), 265--294.Google ScholarDigital Library
A. Pietracaprina, G. Pucci, M. Riondato, F. Silvestri, and E. Upfal. 2012. Space-round tradeoffs for MapReduce computations. In International Conference on Supercomputing (ICS’12), Venice, Italy, June 25-29, 2012. 235--244. Google ScholarDigital Library
L. Rokach and O. Maimon. 2005. Top-down induction of decision trees classifiers - a survey. IEEE Tran. Syst. Man Cybernetics Part C 35, 4 (2005), 476--487. Google ScholarDigital Library
B. Saha and L. Getoor. 2009. On maximum coverage in the streaming model & application to multi-topic blog-watch. In Proceedings of the SIAM International Conference on Data Mining (SDM’’09), Sparks, Nevada, April 30-May 2, 2009. 697--708.Google Scholar
A. D. Sarma, A. Lall, D. Nanongkai, R. J. Lipton, and J. (Jim) Xu. 2011. Representative skylines using threshold-based preference distributions. In Proceedings of the 27th International Conference on Data Engineering (ICDE’11), Hannover, Germany, April 11-16, 2011. 387--398. Google ScholarDigital Library
A. Schrijver. 2002. Combinatorial Optimization. Vol. B. Springer.Google Scholar
L. G. Valiant. 1990. A bridging model for parallel computation. Commun. ACM 33, 8 (1990), 103--111. Google ScholarDigital Library
J. Vondrak. 2007. Submodularity in Combinatorial Optimization. Ph.D. Dissertation. Charles University, Prague.Google Scholar
C. Wang, W. Chen, and Y. Wang. 2012. Scalable influence maximization for independent cascade model in large-scale social networks. Data Min. Knowl. Discov. 25, 3 (2012), 545--576.Google ScholarCross Ref

Index Terms

Fast Greedy Algorithms in MapReduce and Streaming
1. Theory of computation
  1. Design and analysis of algorithms

Recommendations

Fast greedy algorithms in mapreduce and streaming
SPAA '13: Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures

Greedy algorithms are practitioners' best friends - they are intuitive, simple to implement, and often lead to very good solutions. However, implementing greedy algorithms in a distributed setting is challenging since the greedy choice is inherently ...
Read More
Randomized greedy algorithms for covering problems
GECCO '18: Proceedings of the Genetic and Evolutionary Computation Conference

Greedy algorithms provide a fast and often also effective solution to many combinatorial optimization problems. However, it is well known that they sometimes lead to low quality solutions on certain instances. In this paper, we explore the use of ...
Read More
Greedy by Chance - Stochastic Greedy Algorithms
ICAS '10: Proceedings of the 2010 Sixth International Conference on Autonomic and Autonomous Systems

For many complex combinatorial optimization problems, obtaining good solutions quickly is of value either by itself or as part of an exact algorithm. Greedy algorithms to obtain such solutions are known for many problems. In this paper we present ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Parallel Computing Volume 2, Issue 3
Special Issue for SPAA 2013
October 2015
196 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/2821462
Editor:
Phillip B. Gibbons
Intel Labs, Pittsburgh, USA
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 September 2015
- Accepted: 1 May 2015
- Revised: 1 April 2015
- Received: 1 November 2013
Published in topc Volume 2, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Distributed computing
MapReduce
algorithm analysis
approximation algorithms
greedy algorithms
submodular function
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 70
  Total Citations
  View Citations
- 977
  Total Downloads
- Downloads (Last 12 months)59
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Fast Greedy Algorithms in MapReduce and Streaming

ACM Transactions on Parallel Computing

Abstract

References

Cited By

Index Terms

Recommendations

Fast greedy algorithms in mapreduce and streaming

Randomized greedy algorithms for covering problems

Greedy by Chance - Stochastic Greedy Algorithms