skip to main content
research-article

ApproxHadoop: Bringing Approximations to MapReduce Frameworks

Published:14 March 2015Publication History
Skip Abstract Section

Abstract

We propose and evaluate a framework for creating and running approximation-enabled MapReduce programs. Specifically, we propose approximation mechanisms that fit naturally into the MapReduce paradigm, including input data sampling, task dropping, and accepting and running a precise and a user-defined approximate version of the MapReduce code. We then show how to leverage statistical theories to compute error bounds for popular classes of MapReduce programs when approximating with input data sampling and/or task dropping. We implement the proposed mechanisms and error bound estimations in a prototype system called ApproxHadoop. Our evaluation uses MapReduce applications from different domains, including data analytics, scientific computing, video encoding, and machine learning. Our results show that ApproxHadoop can significantly reduce application execution time and/or energy consumption when the user is willing to tolerate small errors. For example, ApproxHadoop can reduce runtimes by up to 32x when the user can tolerate an error of 1% with 95% confidence. We conclude that our framework and system can make approximation easily accessible to many application domains using the MapReduce model.

References

  1. Apache Hadoop. http://hadoop.apache.org.Google ScholarGoogle Scholar
  2. Apache Mahout. http://mahout.apache.org.Google ScholarGoogle Scholar
  3. Apache Nutch. http://nutch.apache.org.Google ScholarGoogle Scholar
  4. S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In Proceedings of the European Conference on Computer Systems (EuroSys), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. Ananthanarayanan, M. Hung, X. Ren, I. Stoica, A. Wierman, and M. Yu. GRASS: Trimming Stragglers in Approximation Analytics. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. W. Baek and T. M. Chilimbi. Green: A Framework for Supporting Energy-Conscious Programming using Controlled Approximation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Bhat, J. Borgstrom, A. D. Gordon, and C. Russo. Deriving Probability Density Functions from Probabilistic Functional Programs. In Proceedings of the International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A Comparison of Join Algorithms for Log Processing in MapReduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Bornholt, T. Mytkowicz, and K. S. McKinley. Uncertain : A First-Order Type for Uncertain Data. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Chaudhuri, G. Das, and V. Narasayya. Optimized Stratified Sampling for Approximate Query Processing. ACM Transactions on Database Systems (TODS), 32(2), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Coles. An Introduction to Statistical Modeling of Extreme Values. Springer, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  12. T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. MapReduce Online. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the Symposium on Operating Systems Design and Implementation (OSDI), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Doucet, S. Godsill, and C. Andrieu. On Sequential Monte Carlo Sampling Methods for Bayesian Filtering. Statistics and Computing, 10(3), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Ekanayake, S. Pallickara, and G. Fox. MapReduce for Data Intensive Scientific Analyses. In Proceedings of the IEEE International Conference on e-Science (e-Science), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Z. Fadika, E. Dede, M. Govindaraju, and L. Ramakrishnan. Adapting MapReduce for HPC environments. In Proceedings of the International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. N. Garofalakis and P. B. Gibbons. Approximate Query Processing: Taming the TeraBytes. In Proceedings of the International Conference on Very Large Databases (VLDB), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. I. Goiri, K. Le, J. Guitart, J. Torres, and R. Bianchini. Intelligent Placement of Datacenters for Internet Services. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. I. Goiri, R. Bianchini, S. Nagarakatte, and T. D. Nguyen. ApproxHadoop: Bringing Approximations to MapReduce Frameworks. Technical Report DCS-TR-709, Department of Computer Science, Rutgers University, 2014.Google ScholarGoogle Scholar
  20. P. J. Haas, J. F. Naughton, S. Seshadri, and L. Stokes. Sampling-Based Estimation of the Number of Distinct Values of an Attribute. In Proceedings of the International Conference on Very Large Databases (VLDB), 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online Aggregation. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. H. Hoffmann, S. Sidiroglou, M. Carbin, S. Misailovic, A. Agarwal, and M. Rinard. Dynamic Knobs for Responsive Power-Aware Computing. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. O. Kiselyov and C.-C. Shan. Embedded Probabilistic Programming. In Proceedings of the IFIP TC 2 Working Conference on Domain-Specific Languages (DSL), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Lin. Cloud9: A Hadoop Toolkit for Working with Big Data. http://lintool.github.io/Cloud9.Google ScholarGoogle Scholar
  25. J. W. Liu, W.-K. Shih, K.-J. Lin, R. Bettati, and J.-Y. Chung. Imprecise Computations. Proceedings of the IEEE, 82(1), 1994.Google ScholarGoogle Scholar
  26. S. Liu and W. Q. Meeker. Statistical Methods for Estimating the Minimum Thickness Along a Pipeline. Technometrics, 2014.Google ScholarGoogle Scholar
  27. S. Lohr. Sampling: Design and Analysis. Cengage Learning, 2009.Google ScholarGoogle Scholar
  28. T. Minka, J. Winn, J. Guiver, S. Webster, Y. Zaykov, B. Yangel, A. Spengler, and J. Bronskill. Infer.NET 2.6. Microsoft Research Cambridge, 2014. http://research.microsoft.com/infernet.Google ScholarGoogle Scholar
  29. S. Misailovic, S. Sidiroglou, H. Hoffmann, and M. Rinard. Quality of Service Profiling. In Proceedings of the ACM/IEEE International Conference on Software Engineering (ICSE), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Misailovic, D. M. Roy, and M. C. Rinard. Probabilistically Accurate Program Transformations. In Proceedings of the International Static Analysis Symposium (SAS), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Misailovic, S. Sidiroglou, H. Hoffmann, M. Carbin, A. Agarwal, and M. Rinard. Code Perforation: Automatically and Dynamically Trading Accuracy for Performance and Power, 2014. http://groups.csail.mit.edu/cag/codeperf/.Google ScholarGoogle Scholar
  32. L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical report, Stanford InfoLab, 1999.Google ScholarGoogle Scholar
  33. N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online Aggregation for Large MapReduce Jobs. Proceedings of the VLDB Endowment (PVLDB), 4(11), 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. Pfeffer. A General Importance Sampling Algorithm for Probabilistic Programs. Technical Report TR-12-07, Harvard University, 2007.Google ScholarGoogle Scholar
  35. M. Rinard. Probabilistic Accuracy Bounds for Fault-tolerant Computations That Discard Tasks. In Proceedings of the Annual International Conference on Supercomputing (ICS), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Riondato, J. A. DeBrabant, R. Fonseca, and E. Upfal. PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. Samadi, J. Lee, A. Jamshidi, A. Hormati, and S. Mahlke. SAGE: Self-Tuning Approximation for Graphics Engines. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman. EnerJ: Approximate Data Types for Safe and General Low-Power Computation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. A. Sampson, J. Nelson, K. Strauss, and L. Ceze. Approximate Storage in Solid-State Memories. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. A. Sampson, P. Panchekha, T. Mytkowicz, K. S. McKinley, D. Grossman, and L. Ceze. Expressing and Verifying Probabilistic Assertions. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. S. Sidiroglou-Douskos, S. Misailovic, H. Hoffmann, and M. Rinard. Managing Performance vs. Accuracy Trade-offs with Loop Perforation. In Proceedings of the Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. L. Sidirourgos, M. L. Kersten, and P. A. Boncz. SciBORQ: Scientific data management with Bounds On Runtime and Quality. In Proceedings of the Conference on Innovative Data Systems Research (CIDR), 2011.Google ScholarGoogle Scholar
  43. J. Slauson and Q. Wan. Approximate Hadoop, 2012. http://www.joshslauson.com/pdf/cs736_project.pdf.Google ScholarGoogle Scholar
  44. A. Verma, N. Zea, B. Cho, I. Gupta, and R. H. Campbell. Breaking the MapReduce Stage Barrier. In Proceedings of the IEEE International Conference on Cluster Computing (Cluster), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Wikipedia. Wikipedia Database, 2014. http://en.wikipedia.org/wiki/Wikipedia_database.Google ScholarGoogle Scholar
  46. Wikipedia. Wikimedia Downloads, 2014. http://dumps.wikimedia.org.Google ScholarGoogle Scholar
  47. D. Wingate, A. Stuhlmueller, and N. D. Goodman. Lightweight Implementations of Probabilistic Programming Languages Via Transformational Compilation. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.Google ScholarGoogle Scholar
  48. M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica. Improving MapReduce Performance in Heterogeneous Environments. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. ApproxHadoop: Bringing Approximations to MapReduce Frameworks

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM SIGARCH Computer Architecture News
            ACM SIGARCH Computer Architecture News  Volume 43, Issue 1
            ASPLOS'15
            March 2015
            676 pages
            ISSN:0163-5964
            DOI:10.1145/2786763
            Issue’s Table of Contents
            • cover image ACM Conferences
              ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems
              March 2015
              720 pages
              ISBN:9781450328357
              DOI:10.1145/2694344

            Copyright © 2015 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 14 March 2015

            Check for updates

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader