Abstract
A common approach for dealing with large datasets is to stream over the input in one pass, and perform computations using sublinear resources. For truly massive datasets, however, even making a single pass over the data is prohibitive. Therefore, streaming computations must be distributed over many machines. In practice, obtaining significant speedups using distributed computation has numerous challenges including synchronization, load balancing, overcoming processor failures, and data distribution. Successful systems in practice such as Google's MapReduce and Apache's Hadoop address these problems by only allowing a certain class of highly distributable tasks defined by local computations that can be applied in any order to the input.
The fundamental question that arises is: How does the class of computational tasks supported by these systems differ from the class for which streaming solutions exist?
We introduce a simple algorithmic model for massive, unordered, distributed (mud) computation, as implemented by these systems. We show that in principle, mud algorithms are equivalent in power to symmetric streaming algorithms. More precisely, we show that any symmetric (order-invariant) function that can be computed by a streaming algorithm can also be computed by a mud algorithm, with comparable space and communication complexity. Our simulation uses Savitch's theorem and therefore has superpolynomial time complexity. We extend our simulation result to some natural classes of approximate and randomized streaming algorithms. We also give negative results, using communication complexity arguments to prove that extensions to private randomness, promise problems, and indeterminate functions are impossible. We also introduce an extension of the mud model to multiple keys and multiple rounds.
- Alon, N., Matias, Y., and Szegedy, M. 1996. The space complexity of approximating the frequency moments. In Proceedings of the Symposium on Theory of Computing, 20--29. Google ScholarDigital Library
- Babai, L., Gal, A., Kimmel, P., and Lokam, S. 1996. Simultaneous messages and communication. Tech. rep., University of Chicago.Google Scholar
- Babai, L. and Kimmel, P. G. 1997. Randomized simultaneous messages: Solution of a problem of Yao in communication complexity. In Computational Complexity. 239. Google ScholarDigital Library
- Bialecki, A., Cafarella, M., Cutting, D., and O'Malley, O. 2005. Hadoop: A framework for running applications on large clusters built of commodity hardware. http://lucene.apache.org/hadoop/.Google Scholar
- Broder, A., Charikar, M., Frieze, A., and Mitzenmacher, M. 2000. Min-Wise independent permutations. J. Comput. Syst. Sci. 60, 3, 630--659. Google ScholarDigital Library
- Datar, M. and Muthukrishnan, S. 2002. Estimating rarity and similarity over data stream windows. In Proceedings of the Annual European Symposium on Algorithms (ESA). 323--334. Google ScholarDigital Library
- Dean, J. and Ghemawat, S. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI'04). Google ScholarDigital Library
- Greenlaw, R., Hoover, H. J., and Ruzzo, W. L. 1995. Limits to Parallel Computation: P-Completeness Theory. Oxford University Press. Google ScholarDigital Library
- Henzinger, M., Raghavan, P., and Rajagopalan, S. 1998. Computing on data streams. Tech. note 1998-011, Digital Systems Research Center, Palo Alto, CA.Google Scholar
- Indyk, P. 2006. Stable distributions, pseudorandom generators, embeddings, and data stream computation. J. ACM, 307--323. Google ScholarDigital Library
- McGregor, A. Open problems in data streams research. http://www.cse.iitk.ac.in/users/sganguly/data-stream-probs.pdf.Google Scholar
- Muthukrishnan, S. 2005. Data streams: Algorithms and applications. In Foundations and Trends in Theoretical Computer Science. Google ScholarDigital Library
- Naor, J. and Naor, M. 1993. Small-Bias probability spaces: Efficient constructions and applications. SIAM J. Comput. 22, 4, 838--856. Google ScholarDigital Library
- Nath, S., Gibbons, P. B., Seshan, S., and Anderson, Z. 2008. Synopsis diffusion for robust aggregation in sensor networks. ACM Trans. Sensor. Netw. 4, 2, 1--40. Google ScholarDigital Library
- Newman, I. and Szegedy, M. 1996. Public vs. private coin flips in one round communication games (extended abstract). In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC). 561--570. Google ScholarDigital Library
- Pike, R., Dorward, S., Griesemer, R., and Quinlan, S. 2005. Interpreting the data: Parallel analysis with sawzall. Sci. Program. J. 13, 4, 227--298. Google ScholarDigital Library
- Pudlak, P., Rodl, V., and Sgall, J. 1994. Boolean circuits, tensor ranks and communication complexity. SIAM J. Comput. 26, 3, 605--633. Google ScholarDigital Library
- Savitch, W. 1973. Maze recognizing automata and nondeterministic tape complexity. J. Comput. Syst. Sci. 7, 4, 389--403. Google ScholarDigital Library
Index Terms
- On distributing symmetric streaming computations
Recommendations
On distributing symmetric streaming computations
SODA '08: Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithmsA common approach for dealing with large data sets is to stream over the input in one pass, and perform computations using sublinear resources. For truly massive data sets, however, even making a single pass over the data is prohibitive. Therefore, ...
Distributing frequency-dependent data stream computations
CATS '09: Proceedings of the Fifteenth Australasian Symposium on Computing: The Australasian Theory - Volume 94Data stream computations in domains such as internet applications are often performed in a highly distributed fashion in order to save time. An example is the class of applications that use the Google Mapreduce framework of scalable distributed ...
SWEclat: a frequent itemset mining algorithm over streaming data using Spark Streaming
AbstractFinding frequent itemsets in a continuous streaming data is an important data mining task which is widely used in network monitoring, Internet of Things data analysis and so on. In the era of big data, it is necessary to develop a distributed ...
Comments