research-article

ApproxHadoop: Bringing Approximations to MapReduce Frameworks

Authors:
Inigo Goiri

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Ricardo Bianchini

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Santosh Nagarakatte

Rutgers University, New Brunswick, NJ, USA

Rutgers University, New Brunswick, NJ, USA
View Profile

,
Thu D. Nguyen

Rutgers University, New Brunswick, NJ, USA

Rutgers University, New Brunswick, NJ, USA
View Profile

Authors Info & Claims

ACM SIGARCH Computer Architecture News Volume 43 Issue 1March 2015pp 383–397https://doi.org/10.1145/2786763.2694351

Published:14 March 2015Publication History

ACM SIGARCH Computer Architecture News

Abstract

We propose and evaluate a framework for creating and running approximation-enabled MapReduce programs. Specifically, we propose approximation mechanisms that fit naturally into the MapReduce paradigm, including input data sampling, task dropping, and accepting and running a precise and a user-defined approximate version of the MapReduce code. We then show how to leverage statistical theories to compute error bounds for popular classes of MapReduce programs when approximating with input data sampling and/or task dropping. We implement the proposed mechanisms and error bound estimations in a prototype system called ApproxHadoop. Our evaluation uses MapReduce applications from different domains, including data analytics, scientific computing, video encoding, and machine learning. Our results show that ApproxHadoop can significantly reduce application execution time and/or energy consumption when the user is willing to tolerate small errors. For example, ApproxHadoop can reduce runtimes by up to 32x when the user can tolerate an error of 1% with 95% confidence. We conclude that our framework and system can make approximation easily accessible to many application domains using the MapReduce model.

References

Apache Hadoop. http://hadoop.apache.org.Google Scholar
Apache Mahout. http://mahout.apache.org.Google Scholar
Apache Nutch. http://nutch.apache.org.Google Scholar
S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In Proceedings of the European Conference on Computer Systems (EuroSys), 2013. Google ScholarDigital Library
G. Ananthanarayanan, M. Hung, X. Ren, I. Stoica, A. Wierman, and M. Yu. GRASS: Trimming Stragglers in Approximation Analytics. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2014. Google ScholarDigital Library
W. Baek and T. M. Chilimbi. Green: A Framework for Supporting Energy-Conscious Programming using Controlled Approximation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2010. Google ScholarDigital Library
S. Bhat, J. Borgstrom, A. D. Gordon, and C. Russo. Deriving Probability Density Functions from Probabilistic Functional Programs. In Proceedings of the International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS), 2013. Google ScholarDigital Library
S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A Comparison of Join Algorithms for Log Processing in MapReduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2010. Google ScholarDigital Library
J. Bornholt, T. Mytkowicz, and K. S. McKinley. Uncertain : A First-Order Type for Uncertain Data. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014. Google ScholarDigital Library
S. Chaudhuri, G. Das, and V. Narasayya. Optimized Stratified Sampling for Approximate Query Processing. ACM Transactions on Database Systems (TODS), 32(2), 2007. Google ScholarDigital Library
S. Coles. An Introduction to Statistical Modeling of Extreme Values. Springer, 2001.Google ScholarCross Ref
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. MapReduce Online. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2010. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the Symposium on Operating Systems Design and Implementation (OSDI), 2004. Google ScholarDigital Library
A. Doucet, S. Godsill, and C. Andrieu. On Sequential Monte Carlo Sampling Methods for Bayesian Filtering. Statistics and Computing, 10(3), 2000. Google ScholarDigital Library
J. Ekanayake, S. Pallickara, and G. Fox. MapReduce for Data Intensive Scientific Analyses. In Proceedings of the IEEE International Conference on e-Science (e-Science), 2008. Google ScholarDigital Library
Z. Fadika, E. Dede, M. Govindaraju, and L. Ramakrishnan. Adapting MapReduce for HPC environments. In Proceedings of the International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2011. Google ScholarDigital Library
M. N. Garofalakis and P. B. Gibbons. Approximate Query Processing: Taming the TeraBytes. In Proceedings of the International Conference on Very Large Databases (VLDB), 2001. Google ScholarDigital Library
I. Goiri, K. Le, J. Guitart, J. Torres, and R. Bianchini. Intelligent Placement of Datacenters for Internet Services. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS), 2011. Google ScholarDigital Library
I. Goiri, R. Bianchini, S. Nagarakatte, and T. D. Nguyen. ApproxHadoop: Bringing Approximations to MapReduce Frameworks. Technical Report DCS-TR-709, Department of Computer Science, Rutgers University, 2014.Google Scholar
P. J. Haas, J. F. Naughton, S. Seshadri, and L. Stokes. Sampling-Based Estimation of the Number of Distinct Values of an Attribute. In Proceedings of the International Conference on Very Large Databases (VLDB), 1995. Google ScholarDigital Library
J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online Aggregation. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 1997. Google ScholarDigital Library
H. Hoffmann, S. Sidiroglou, M. Carbin, S. Misailovic, A. Agarwal, and M. Rinard. Dynamic Knobs for Responsive Power-Aware Computing. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2011. Google ScholarDigital Library
O. Kiselyov and C.-C. Shan. Embedded Probabilistic Programming. In Proceedings of the IFIP TC 2 Working Conference on Domain-Specific Languages (DSL), 2009. Google ScholarDigital Library
J. Lin. Cloud9: A Hadoop Toolkit for Working with Big Data. http://lintool.github.io/Cloud9.Google Scholar
J. W. Liu, W.-K. Shih, K.-J. Lin, R. Bettati, and J.-Y. Chung. Imprecise Computations. Proceedings of the IEEE, 82(1), 1994.Google Scholar
S. Liu and W. Q. Meeker. Statistical Methods for Estimating the Minimum Thickness Along a Pipeline. Technometrics, 2014.Google Scholar
S. Lohr. Sampling: Design and Analysis. Cengage Learning, 2009.Google Scholar
T. Minka, J. Winn, J. Guiver, S. Webster, Y. Zaykov, B. Yangel, A. Spengler, and J. Bronskill. Infer.NET 2.6. Microsoft Research Cambridge, 2014. http://research.microsoft.com/infernet.Google Scholar
S. Misailovic, S. Sidiroglou, H. Hoffmann, and M. Rinard. Quality of Service Profiling. In Proceedings of the ACM/IEEE International Conference on Software Engineering (ICSE), 2010. Google ScholarDigital Library
S. Misailovic, D. M. Roy, and M. C. Rinard. Probabilistically Accurate Program Transformations. In Proceedings of the International Static Analysis Symposium (SAS), 2011. Google ScholarDigital Library
S. Misailovic, S. Sidiroglou, H. Hoffmann, M. Carbin, A. Agarwal, and M. Rinard. Code Perforation: Automatically and Dynamically Trading Accuracy for Performance and Power, 2014. http://groups.csail.mit.edu/cag/codeperf/.Google Scholar
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical report, Stanford InfoLab, 1999.Google Scholar
N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online Aggregation for Large MapReduce Jobs. Proceedings of the VLDB Endowment (PVLDB), 4(11), 2011.Google ScholarDigital Library
A. Pfeffer. A General Importance Sampling Algorithm for Probabilistic Programs. Technical Report TR-12-07, Harvard University, 2007.Google Scholar
M. Rinard. Probabilistic Accuracy Bounds for Fault-tolerant Computations That Discard Tasks. In Proceedings of the Annual International Conference on Supercomputing (ICS), 2006. Google ScholarDigital Library
M. Riondato, J. A. DeBrabant, R. Fonseca, and E. Upfal. PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), 2012. Google ScholarDigital Library
M. Samadi, J. Lee, A. Jamshidi, A. Hormati, and S. Mahlke. SAGE: Self-Tuning Approximation for Graphics Engines. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), 2013. Google ScholarDigital Library
A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman. EnerJ: Approximate Data Types for Safe and General Low-Power Computation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2011. Google ScholarDigital Library
A. Sampson, J. Nelson, K. Strauss, and L. Ceze. Approximate Storage in Solid-State Memories. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), 2013. Google ScholarDigital Library
A. Sampson, P. Panchekha, T. Mytkowicz, K. S. McKinley, D. Grossman, and L. Ceze. Expressing and Verifying Probabilistic Assertions. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2014. Google ScholarDigital Library
S. Sidiroglou-Douskos, S. Misailovic, H. Hoffmann, and M. Rinard. Managing Performance vs. Accuracy Trade-offs with Loop Perforation. In Proceedings of the Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE), 2011. Google ScholarDigital Library
L. Sidirourgos, M. L. Kersten, and P. A. Boncz. SciBORQ: Scientific data management with Bounds On Runtime and Quality. In Proceedings of the Conference on Innovative Data Systems Research (CIDR), 2011.Google Scholar
J. Slauson and Q. Wan. Approximate Hadoop, 2012. http://www.joshslauson.com/pdf/cs736_project.pdf.Google Scholar
A. Verma, N. Zea, B. Cho, I. Gupta, and R. H. Campbell. Breaking the MapReduce Stage Barrier. In Proceedings of the IEEE International Conference on Cluster Computing (Cluster), 2010. Google ScholarDigital Library
Wikipedia. Wikipedia Database, 2014. http://en.wikipedia.org/wiki/Wikipedia_database.Google Scholar
Wikipedia. Wikimedia Downloads, 2014. http://dumps.wikimedia.org.Google Scholar
D. Wingate, A. Stuhlmueller, and N. D. Goodman. Lightweight Implementations of Probabilistic Programming Languages Via Transformational Compilation. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.Google Scholar
M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica. Improving MapReduce Performance in Heterogeneous Environments. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2008. Google ScholarDigital Library

Index Terms

ApproxHadoop: Bringing Approximations to MapReduce Frameworks

Recommendations

ApproxHadoop: Bringing Approximations to MapReduce Frameworks
ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

We propose and evaluate a framework for creating and running approximation-enabled MapReduce programs. Specifically, we propose approximation mechanisms that fit naturally into the MapReduce paradigm, including input data sampling, task dropping, and ...
Read More
ApproxHadoop: Bringing Approximations to MapReduce Frameworks
ASPLOS '15

We propose and evaluate a framework for creating and running approximation-enabled MapReduce programs. Specifically, we propose approximation mechanisms that fit naturally into the MapReduce paradigm, including input data sampling, task dropping, and ...
Read More
Design and Development of a Medical Big Data Processing System Based on Hadoop

Secondary use of medical big data is increasingly popular in healthcare services and clinical research. Understanding the logic behind medical big data demonstrates tendencies in hospital information technology and shows great significance for hospital ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGARCH Computer Architecture News Volume 43, Issue 1
ASPLOS'15
March 2015
676 pages
ISSN:0163-5964
DOI:10.1145/2786763
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents
ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems
March 2015
720 pages
ISBN:9781450328357
DOI:10.1145/2694344
General Chairs:
Ozcan Ozturk
Bilkent University, Turkey
,
Kemal Ebcioglu
Global Supercomputing, USA
,
Program Chair:
Sandhya Dwarkadas
University of Rochester, USA
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 March 2015
Check for updates
Author Tags
MapReduce
approximation
extreme value theory
multi-stage sampling
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 131
  Total Citations
  View Citations
- 881
  Total Downloads
- Downloads (Last 12 months)48
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

ApproxHadoop: Bringing Approximations to MapReduce Frameworks

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

ApproxHadoop: Bringing Approximations to MapReduce Frameworks

ApproxHadoop: Bringing Approximations to MapReduce Frameworks

Design and Development of a Medical Big Data Processing System Based on Hadoop