skip to main content
research-article

On distributing symmetric streaming computations

Published:03 September 2010Publication History
Skip Abstract Section

Abstract

A common approach for dealing with large datasets is to stream over the input in one pass, and perform computations using sublinear resources. For truly massive datasets, however, even making a single pass over the data is prohibitive. Therefore, streaming computations must be distributed over many machines. In practice, obtaining significant speedups using distributed computation has numerous challenges including synchronization, load balancing, overcoming processor failures, and data distribution. Successful systems in practice such as Google's MapReduce and Apache's Hadoop address these problems by only allowing a certain class of highly distributable tasks defined by local computations that can be applied in any order to the input.

The fundamental question that arises is: How does the class of computational tasks supported by these systems differ from the class for which streaming solutions exist?

We introduce a simple algorithmic model for massive, unordered, distributed (mud) computation, as implemented by these systems. We show that in principle, mud algorithms are equivalent in power to symmetric streaming algorithms. More precisely, we show that any symmetric (order-invariant) function that can be computed by a streaming algorithm can also be computed by a mud algorithm, with comparable space and communication complexity. Our simulation uses Savitch's theorem and therefore has superpolynomial time complexity. We extend our simulation result to some natural classes of approximate and randomized streaming algorithms. We also give negative results, using communication complexity arguments to prove that extensions to private randomness, promise problems, and indeterminate functions are impossible. We also introduce an extension of the mud model to multiple keys and multiple rounds.

References

  1. Alon, N., Matias, Y., and Szegedy, M. 1996. The space complexity of approximating the frequency moments. In Proceedings of the Symposium on Theory of Computing, 20--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Babai, L., Gal, A., Kimmel, P., and Lokam, S. 1996. Simultaneous messages and communication. Tech. rep., University of Chicago.Google ScholarGoogle Scholar
  3. Babai, L. and Kimmel, P. G. 1997. Randomized simultaneous messages: Solution of a problem of Yao in communication complexity. In Computational Complexity. 239. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bialecki, A., Cafarella, M., Cutting, D., and O'Malley, O. 2005. Hadoop: A framework for running applications on large clusters built of commodity hardware. http://lucene.apache.org/hadoop/.Google ScholarGoogle Scholar
  5. Broder, A., Charikar, M., Frieze, A., and Mitzenmacher, M. 2000. Min-Wise independent permutations. J. Comput. Syst. Sci. 60, 3, 630--659. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Datar, M. and Muthukrishnan, S. 2002. Estimating rarity and similarity over data stream windows. In Proceedings of the Annual European Symposium on Algorithms (ESA). 323--334. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Dean, J. and Ghemawat, S. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI'04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Greenlaw, R., Hoover, H. J., and Ruzzo, W. L. 1995. Limits to Parallel Computation: P-Completeness Theory. Oxford University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Henzinger, M., Raghavan, P., and Rajagopalan, S. 1998. Computing on data streams. Tech. note 1998-011, Digital Systems Research Center, Palo Alto, CA.Google ScholarGoogle Scholar
  10. Indyk, P. 2006. Stable distributions, pseudorandom generators, embeddings, and data stream computation. J. ACM, 307--323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. McGregor, A. Open problems in data streams research. http://www.cse.iitk.ac.in/users/sganguly/data-stream-probs.pdf.Google ScholarGoogle Scholar
  12. Muthukrishnan, S. 2005. Data streams: Algorithms and applications. In Foundations and Trends in Theoretical Computer Science. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Naor, J. and Naor, M. 1993. Small-Bias probability spaces: Efficient constructions and applications. SIAM J. Comput. 22, 4, 838--856. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Nath, S., Gibbons, P. B., Seshan, S., and Anderson, Z. 2008. Synopsis diffusion for robust aggregation in sensor networks. ACM Trans. Sensor. Netw. 4, 2, 1--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Newman, I. and Szegedy, M. 1996. Public vs. private coin flips in one round communication games (extended abstract). In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC). 561--570. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Pike, R., Dorward, S., Griesemer, R., and Quinlan, S. 2005. Interpreting the data: Parallel analysis with sawzall. Sci. Program. J. 13, 4, 227--298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Pudlak, P., Rodl, V., and Sgall, J. 1994. Boolean circuits, tensor ranks and communication complexity. SIAM J. Comput. 26, 3, 605--633. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Savitch, W. 1973. Maze recognizing automata and nondeterministic tape complexity. J. Comput. Syst. Sci. 7, 4, 389--403. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. On distributing symmetric streaming computations

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Algorithms
      ACM Transactions on Algorithms  Volume 6, Issue 4
      August 2010
      308 pages
      ISSN:1549-6325
      EISSN:1549-6333
      DOI:10.1145/1824777
      Issue’s Table of Contents

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 3 September 2010
      • Accepted: 1 September 2009
      • Received: 1 August 2009
      Published in talg Volume 6, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader