article

PREDIcT: towards predicting the runtime of large scale iterative analytics

Authors:
Adrian Daniel Popescu

Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland

Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
View Profile

,
Andrey Balmin

IBM Almaden Research Center, San Jose, CA and GraphSQL, Mountain View, CA

IBM Almaden Research Center, San Jose, CA and GraphSQL, Mountain View, CA
View Profile

,
Vuk Ercegovac

IBM Almaden Research Center, San Jose, CA and Google Inc.

IBM Almaden Research Center, San Jose, CA and Google Inc.
View Profile

,
Anastasia Ailamaki

Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland

Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
View Profile

Proceedings of the VLDB Endowment Volume 6 Issue 14pp 1678–1689https://doi.org/10.14778/2556549.2556553

Published:01 September 2013Publication History

Proceedings of the VLDB Endowment

Abstract

Machine learning algorithms are widely used today for analytical tasks such as data cleaning, data categorization, or data filtering. At the same time, the rise of social media motivates recent uptake in large scale graph processing. Both categories of algorithms are dominated by iterative subtasks, i.e., processing steps which are executed repetitively until a convergence condition is met. Optimizing cluster resource allocations among multiple workloads of iterative algorithms motivates the need for estimating their runtime, which in turn requires: i) predicting the number of iterations, and ii) predicting the processing time of each iteration. As both parameters depend on the characteristics of the dataset and on the convergence function, estimating their values before execution is difficult.

This paper proposes PREDIcT, an experimental methodology for predicting the runtime of iterative algorithms. PREDIcT uses sample runs for capturing the algorithm's convergence trend and per-iteration key input features that are well correlated with the actual processing requirements of the complete input dataset. Using this combination of characteristics we predict the runtime of iterative algorithms, including algorithms with very different runtime patterns among subsequent iterations. Our experimental evaluation of multiple algorithms on scale-free graphs shows a relative prediction error of 10%-30% for predicting runtime, including algorithms with up to 100× runtime variability among consecutive iterations.

References

F. N. Afrati, V. Borkar, M. Carey, N. Polyzotis, and J. D. Ullman. Map-Reduce Extensions and Recursive Queries. In EDBT, pages 1-8, 2011. Google Scholar
M. Ahmad, S. Duan, A. Aboulnaga, and S. Babu. Interaction-aware Prediction of Business Intelligence Workload Completion Times. In ICDE, pages 413-416, 2010.Google Scholar
D. Arthur and S. Vassilvitskii. How Slow is the k-means Method? In SCG, pages 144-153, 2006. Google Scholar
F. Bancilhon and R. Ramakrishnan. An Amateur's Introduction to Recursive Query Processing Strategies. In SIGMOD, pages 16-52, 1986. Google Scholar
Y. Bu, V. R. Borkar, M. J. Carey, J. Rosen, N. Polyzotis, T. Condie, M. Weimer, and R. Ramakrishnan. Scaling Datalog for Machine Learning on Big Data. CoRR, abs/1203.0160, 2012.Google Scholar
Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: Efficient Iterative Data Processing on Large Clusters. PVLDB, 3:285-296, 2010. Google Scholar
M. Cha, H. Haddadi, F. Benevenuto, and K. P. Gummadi. Measuring User Influence in Twitter: The Million Follower Fallacy. In ICWSM, 2010.Google Scholar
S. Chaudhuri, R. Kaushik, and R. Ramamurthy. When Can We Trust Progress Estimators for SQL Queries? In SIGMOD, pages 575-586, 2005. Google Scholar
S. Chaudhuri, R. Motwani, and V. Narasayya. Random Sampling for Histogram Construction: How Much is Enough? In SIGMOD, pages 436-447, 1998. Google Scholar
J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD Skills: New Analysis Practices for Big Data. PVLDB, 2(2):1481-1492, 2009. Google Scholar
A. Deshpande, Z. Ives, and V. Raman. Adaptive query processing. Found. Trends databases, 1(1):1-140, 2007. Google Scholar
J. Duggan, U. Cetintemel, O. Papaemmanouil, and E. Upfal. Performance Prediction for Concurrent Database Workloads. In SIGMOD, pages 337-348, 2011. Google Scholar
S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl. Spinning Fast Iterative Data Flows. PVLDB, 5(11):1268-1279, 2012. Google Scholar
A. Ganapathi, H. Kuno, U. Dayal, J. L. Wiener, A. Fox, M. Jordan, and D. Patterson. Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning. In ICDE, pages 592-603, 2009. Google Scholar
M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou. Walking in Facebook: A Case Study of Unbiased Sampling of OSNs. In INFOCOM, pages 2498-2506, 2010. Google Scholar
S. Har-Peled and B. Sadri. How Fast is the k-means Method? In SODA, pages 185-202, 2005. Google Scholar
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference and Prediction, Second Edition. Springer, 2008.Google Scholar
H. Herodotou and S. Babu. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. PVLDB, 4(11):1111-1122, 2011.Google Scholar
H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A Self-tuning System for Big Data Analytics. In CIDR, pages 261-272, 2011.Google Scholar
U. Kang, C. E. Tsourakakis, and C. Faloutsos. PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations. In ICDM, pages 229-238, 2009. Google Scholar
Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, and P. Kalnis. Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing. In EuroSys, pages 169-182, 2013. Google Scholar
A. N. Langville and C. D. Meyer. Deeper Inside PageRank. Internet Mathematics, 1(3):335-380, 2003.Google Scholar
J. Leskovec and C. Faloutsos. Sampling from Large Graphs. In KDD, pages 631-636, 2006. Google Scholar
J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Statistical Properties of Community Structure in Large Social and Information Networks. In WWW, pages 695-704, 2008. Google Scholar
J. Li, A. C. König, V. Narasayya, and S. Chaudhuri. Robust Estimation of Resource Consumption for SQL Queries Using Statistical Techniques. PVLDB, 5(11):1555-1566, 2012. Google Scholar
Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. PVLDB, 5(8):716-727, 2012. Google Scholar
G. Luo, J. F. Naughton, and P. S. Yu. Multiquery SQL Progress Indicators. In EDBT, pages 921-941, 2006. Google Scholar
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A System for Large-Scale Graph Processing. In SIGMOD, pages 135-146, 2010. Google Scholar
K. Morton, M. Balazinska, and D. Grossman. ParaTimer: A Progress Indicator for MapReduce DAGs. In SIGMOD, pages 507-518, 2010. Google Scholar
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 1999-66, Stanford InfoLab, 1999.Google Scholar
A. D. Popescu, A. Balmin, V. Ercegovac, and A. Ailamaki. Towards Predicting the Runtime of Iterative Analytics with PREDIcT. Technical Report 187356, EPFL, 2013.Google Scholar
A. D. Popescu, V. Ercegovac, A. Balmin, M. Branco, and A. Ailamaki. Same Queries, Different Data: Can We Predict Runtime Performance? In ICDE Workshops, pages 275-280, 2012. Google Scholar
M. Stillger, G. M. Lohman, V. Markl, and M. Kandil. LEO - DB2's LEarning Optimizer. In VLDB, pages 19-28, 2001. Google Scholar
L. G. Valiant. A Bridging Model for Parallel Computation. CACM, 33(8):103-111, 1990. Google Scholar
S. van Dongen. A Cluster Algorithm for Graphs. Technical Report INS-R0010, CWI, 2000. Google Scholar
A. Verma, L. Cherkasova, and R. H. Campbell. ARIA: Automatic Resource Inference and Allocation for MapReduce Environments. In ICAC, pages 235-244, 2011. Google Scholar
G. Wang, W. Xie, A. J. Demers, and J. Gehrke. Asynchronous Large-Scale Graph Processing Made Easy. In CIDR, 2013.Google Scholar
J. Wolf, D. Rajan, K. Hildrum, R. Khandekar, V. Kumar, S. Parekh, K.-L. Wu, and A. Balmin. FLEX: A Slot Allocation Scheduling Optimizer for MapReduce Workloads. In Middleware, pages 1-20, 2010. Google Scholar
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster Computing with Working Sets. In HotCloud, 2010. Google Scholar

Recommendations

SARNA-Predict: Accuracy Improvement of RNA Secondary Structure Prediction Using Permutation-Based Simulated Annealing

Ribonucleic acid (RNA), a single-stranded linear molecule, is essential to all biological systems. Different regions of the same RNA strand will fold together via base pair interactions to make intricate secondary and tertiary structures that guide ...
Read More
Using Historical Data to Predict Application Runtimes on Backfilling Parallel Systems
PDP '10: Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing

We present in this paper a novel method to predict application runtimes on backfilling parallel systems. The method is based on mining historical data to obtain important parameters. These parameters are then applied to predict the runtime of future ...
Read More
Semismooth Karush-Kuhn-Tucker Equations and Convergence Analysis of Newton and Quasi-Newton Methods for Solving these Equations

There are several forms of systems of nonsmooth equations which are equivalent to the Karush-Kuhn-Tucker KKT system of a nonlinearly constrained optimization problem NLP. If the NLP is twice continuously differentiable and the Hessian functions of its ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 6, Issue 14
September 2013
384 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 September 2013
Published in pvldb Volume 6, Issue 14
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 217
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

PREDIcT: towards predicting the runtime of large scale iterative analytics

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

SARNA-Predict: Accuracy Improvement of RNA Secondary Structure Prediction Using Permutation-Based Simulated Annealing

Using Historical Data to Predict Application Runtimes on Backfilling Parallel Systems

Semismooth Karush-Kuhn-Tucker Equations and Convergence Analysis of Newton and Quasi-Newton Methods for Solving these Equations

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

PREDIcT: towards predicting the runtime of large scale iterative analytics

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

SARNA-Predict: Accuracy Improvement of RNA Secondary Structure Prediction Using Permutation-Based Simulated Annealing

Using Historical Data to Predict Application Runtimes on Backfilling Parallel Systems

Semismooth Karush-Kuhn-Tucker Equations and Convergence Analysis of Newton and Quasi-Newton Methods for Solving these Equations

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media