research-article

On distributing symmetric streaming computations

Authors:
Jon Feldman

Google Inc., New York, NY

Google Inc., New York, NY
View Profile

,
S. Muthukrishnan

Google Inc., New York, NY

Google Inc., New York, NY
View Profile

,
Anastasios Sidiropoulos

Toyota Technological Institute at Chicago, Chicago, IL

Toyota Technological Institute at Chicago, Chicago, IL
View Profile

,
Cliff Stein

Columbia University, New York, NY

Columbia University, New York, NY
View Profile

,
Zoya Svitkina

University of Alberta, Alberta, Canada

University of Alberta, Alberta, Canada
View Profile

Authors Info & Claims

ACM Transactions on Algorithms Volume 6 Issue 4Article No.: 66pp 1–19https://doi.org/10.1145/1824777.1824786

Published:03 September 2010Publication History

ACM Transactions on Algorithms

Abstract

A common approach for dealing with large datasets is to stream over the input in one pass, and perform computations using sublinear resources. For truly massive datasets, however, even making a single pass over the data is prohibitive. Therefore, streaming computations must be distributed over many machines. In practice, obtaining significant speedups using distributed computation has numerous challenges including synchronization, load balancing, overcoming processor failures, and data distribution. Successful systems in practice such as Google's MapReduce and Apache's Hadoop address these problems by only allowing a certain class of highly distributable tasks defined by local computations that can be applied in any order to the input.

The fundamental question that arises is: How does the class of computational tasks supported by these systems differ from the class for which streaming solutions exist?

We introduce a simple algorithmic model for massive, unordered, distributed (mud) computation, as implemented by these systems. We show that in principle, mud algorithms are equivalent in power to symmetric streaming algorithms. More precisely, we show that any symmetric (order-invariant) function that can be computed by a streaming algorithm can also be computed by a mud algorithm, with comparable space and communication complexity. Our simulation uses Savitch's theorem and therefore has superpolynomial time complexity. We extend our simulation result to some natural classes of approximate and randomized streaming algorithms. We also give negative results, using communication complexity arguments to prove that extensions to private randomness, promise problems, and indeterminate functions are impossible. We also introduce an extension of the mud model to multiple keys and multiple rounds.

References

Alon, N., Matias, Y., and Szegedy, M. 1996. The space complexity of approximating the frequency moments. In Proceedings of the Symposium on Theory of Computing, 20--29. Google ScholarDigital Library
Babai, L., Gal, A., Kimmel, P., and Lokam, S. 1996. Simultaneous messages and communication. Tech. rep., University of Chicago.Google Scholar
Babai, L. and Kimmel, P. G. 1997. Randomized simultaneous messages: Solution of a problem of Yao in communication complexity. In Computational Complexity. 239. Google ScholarDigital Library
Bialecki, A., Cafarella, M., Cutting, D., and O'Malley, O. 2005. Hadoop: A framework for running applications on large clusters built of commodity hardware. http://lucene.apache.org/hadoop/.Google Scholar
Broder, A., Charikar, M., Frieze, A., and Mitzenmacher, M. 2000. Min-Wise independent permutations. J. Comput. Syst. Sci. 60, 3, 630--659. Google ScholarDigital Library
Datar, M. and Muthukrishnan, S. 2002. Estimating rarity and similarity over data stream windows. In Proceedings of the Annual European Symposium on Algorithms (ESA). 323--334. Google ScholarDigital Library
Dean, J. and Ghemawat, S. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI'04). Google ScholarDigital Library
Greenlaw, R., Hoover, H. J., and Ruzzo, W. L. 1995. Limits to Parallel Computation: P-Completeness Theory. Oxford University Press. Google ScholarDigital Library
Henzinger, M., Raghavan, P., and Rajagopalan, S. 1998. Computing on data streams. Tech. note 1998-011, Digital Systems Research Center, Palo Alto, CA.Google Scholar
Indyk, P. 2006. Stable distributions, pseudorandom generators, embeddings, and data stream computation. J. ACM, 307--323. Google ScholarDigital Library
McGregor, A. Open problems in data streams research. http://www.cse.iitk.ac.in/users/sganguly/data-stream-probs.pdf.Google Scholar
Muthukrishnan, S. 2005. Data streams: Algorithms and applications. In Foundations and Trends in Theoretical Computer Science. Google ScholarDigital Library
Naor, J. and Naor, M. 1993. Small-Bias probability spaces: Efficient constructions and applications. SIAM J. Comput. 22, 4, 838--856. Google ScholarDigital Library
Nath, S., Gibbons, P. B., Seshan, S., and Anderson, Z. 2008. Synopsis diffusion for robust aggregation in sensor networks. ACM Trans. Sensor. Netw. 4, 2, 1--40. Google ScholarDigital Library
Newman, I. and Szegedy, M. 1996. Public vs. private coin flips in one round communication games (extended abstract). In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC). 561--570. Google ScholarDigital Library
Pike, R., Dorward, S., Griesemer, R., and Quinlan, S. 2005. Interpreting the data: Parallel analysis with sawzall. Sci. Program. J. 13, 4, 227--298. Google ScholarDigital Library
Pudlak, P., Rodl, V., and Sgall, J. 1994. Boolean circuits, tensor ranks and communication complexity. SIAM J. Comput. 26, 3, 605--633. Google ScholarDigital Library
Savitch, W. 1973. Maze recognizing automata and nondeterministic tape complexity. J. Comput. Syst. Sci. 7, 4, 389--403. Google ScholarDigital Library

Index Terms

On distributing symmetric streaming computations
1. Theory of computation
  1. Design and analysis of algorithms

Recommendations

On distributing symmetric streaming computations
SODA '08: Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms

A common approach for dealing with large data sets is to stream over the input in one pass, and perform computations using sublinear resources. For truly massive data sets, however, even making a single pass over the data is prohibitive. Therefore, ...
Read More
Distributing frequency-dependent data stream computations
CATS '09: Proceedings of the Fifteenth Australasian Symposium on Computing: The Australasian Theory - Volume 94

Data stream computations in domains such as internet applications are often performed in a highly distributed fashion in order to save time. An example is the class of applications that use the Google Mapreduce framework of scalable distributed ...
Read More
SWEclat: a frequent itemset mining algorithm over streaming data using Spark Streaming
Abstract
Finding frequent itemsets in a continuous streaming data is an important data mining task which is widely used in network monitoring, Internet of Things data analysis and so on. In the era of big data, it is necessary to develop a distributed ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Algorithms Volume 6, Issue 4
August 2010
308 pages
ISSN:1549-6325
EISSN:1549-6333
DOI:10.1145/1824777
Issue’s Table of Contents

Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 September 2010
- Accepted: 1 September 2009
- Received: 1 August 2009
Published in talg Volume 6, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Streaming
distributed
distributed computations
mapreduce
symmetric
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 32
  Total Citations
  View Citations
- 618
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

On distributing symmetric streaming computations

ACM Transactions on Algorithms

Abstract

References

Cited By

Index Terms

Recommendations

On distributing symmetric streaming computations

Distributing frequency-dependent data stream computations

SWEclat: a frequent itemset mining algorithm over streaming data using Spark Streaming

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

On distributing symmetric streaming computations

ACM Transactions on Algorithms

Abstract

References

Cited By

Index Terms

Recommendations

On distributing symmetric streaming computations

Distributing frequency-dependent data stream computations

SWEclat: a frequent itemset mining algorithm over streaming data using Spark Streaming

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media