Abstract
Machine learning algorithms are widely used today for analytical tasks such as data cleaning, data categorization, or data filtering. At the same time, the rise of social media motivates recent uptake in large scale graph processing. Both categories of algorithms are dominated by iterative subtasks, i.e., processing steps which are executed repetitively until a convergence condition is met. Optimizing cluster resource allocations among multiple workloads of iterative algorithms motivates the need for estimating their runtime, which in turn requires: i) predicting the number of iterations, and ii) predicting the processing time of each iteration. As both parameters depend on the characteristics of the dataset and on the convergence function, estimating their values before execution is difficult.
This paper proposes PREDIcT, an experimental methodology for predicting the runtime of iterative algorithms. PREDIcT uses sample runs for capturing the algorithm's convergence trend and per-iteration key input features that are well correlated with the actual processing requirements of the complete input dataset. Using this combination of characteristics we predict the runtime of iterative algorithms, including algorithms with very different runtime patterns among subsequent iterations. Our experimental evaluation of multiple algorithms on scale-free graphs shows a relative prediction error of 10%-30% for predicting runtime, including algorithms with up to 100× runtime variability among consecutive iterations.
- F. N. Afrati, V. Borkar, M. Carey, N. Polyzotis, and J. D. Ullman. Map-Reduce Extensions and Recursive Queries. In EDBT, pages 1-8, 2011. Google Scholar
- M. Ahmad, S. Duan, A. Aboulnaga, and S. Babu. Interaction-aware Prediction of Business Intelligence Workload Completion Times. In ICDE, pages 413-416, 2010.Google Scholar
- D. Arthur and S. Vassilvitskii. How Slow is the k-means Method? In SCG, pages 144-153, 2006. Google Scholar
- F. Bancilhon and R. Ramakrishnan. An Amateur's Introduction to Recursive Query Processing Strategies. In SIGMOD, pages 16-52, 1986. Google Scholar
- Y. Bu, V. R. Borkar, M. J. Carey, J. Rosen, N. Polyzotis, T. Condie, M. Weimer, and R. Ramakrishnan. Scaling Datalog for Machine Learning on Big Data. CoRR, abs/1203.0160, 2012.Google Scholar
- Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: Efficient Iterative Data Processing on Large Clusters. PVLDB, 3:285-296, 2010. Google Scholar
- M. Cha, H. Haddadi, F. Benevenuto, and K. P. Gummadi. Measuring User Influence in Twitter: The Million Follower Fallacy. In ICWSM, 2010.Google Scholar
- S. Chaudhuri, R. Kaushik, and R. Ramamurthy. When Can We Trust Progress Estimators for SQL Queries? In SIGMOD, pages 575-586, 2005. Google Scholar
- S. Chaudhuri, R. Motwani, and V. Narasayya. Random Sampling for Histogram Construction: How Much is Enough? In SIGMOD, pages 436-447, 1998. Google Scholar
- J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD Skills: New Analysis Practices for Big Data. PVLDB, 2(2):1481-1492, 2009. Google Scholar
- A. Deshpande, Z. Ives, and V. Raman. Adaptive query processing. Found. Trends databases, 1(1):1-140, 2007. Google Scholar
- J. Duggan, U. Cetintemel, O. Papaemmanouil, and E. Upfal. Performance Prediction for Concurrent Database Workloads. In SIGMOD, pages 337-348, 2011. Google Scholar
- S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl. Spinning Fast Iterative Data Flows. PVLDB, 5(11):1268-1279, 2012. Google Scholar
- A. Ganapathi, H. Kuno, U. Dayal, J. L. Wiener, A. Fox, M. Jordan, and D. Patterson. Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning. In ICDE, pages 592-603, 2009. Google Scholar
- M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou. Walking in Facebook: A Case Study of Unbiased Sampling of OSNs. In INFOCOM, pages 2498-2506, 2010. Google Scholar
- S. Har-Peled and B. Sadri. How Fast is the k-means Method? In SODA, pages 185-202, 2005. Google Scholar
- T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference and Prediction, Second Edition. Springer, 2008.Google Scholar
- H. Herodotou and S. Babu. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. PVLDB, 4(11):1111-1122, 2011.Google Scholar
- H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A Self-tuning System for Big Data Analytics. In CIDR, pages 261-272, 2011.Google Scholar
- U. Kang, C. E. Tsourakakis, and C. Faloutsos. PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations. In ICDM, pages 229-238, 2009. Google Scholar
- Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, and P. Kalnis. Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing. In EuroSys, pages 169-182, 2013. Google Scholar
- A. N. Langville and C. D. Meyer. Deeper Inside PageRank. Internet Mathematics, 1(3):335-380, 2003.Google Scholar
- J. Leskovec and C. Faloutsos. Sampling from Large Graphs. In KDD, pages 631-636, 2006. Google Scholar
- J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Statistical Properties of Community Structure in Large Social and Information Networks. In WWW, pages 695-704, 2008. Google Scholar
- J. Li, A. C. König, V. Narasayya, and S. Chaudhuri. Robust Estimation of Resource Consumption for SQL Queries Using Statistical Techniques. PVLDB, 5(11):1555-1566, 2012. Google Scholar
- Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. PVLDB, 5(8):716-727, 2012. Google Scholar
- G. Luo, J. F. Naughton, and P. S. Yu. Multiquery SQL Progress Indicators. In EDBT, pages 921-941, 2006. Google Scholar
- G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A System for Large-Scale Graph Processing. In SIGMOD, pages 135-146, 2010. Google Scholar
- K. Morton, M. Balazinska, and D. Grossman. ParaTimer: A Progress Indicator for MapReduce DAGs. In SIGMOD, pages 507-518, 2010. Google Scholar
- L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 1999-66, Stanford InfoLab, 1999.Google Scholar
- A. D. Popescu, A. Balmin, V. Ercegovac, and A. Ailamaki. Towards Predicting the Runtime of Iterative Analytics with PREDIcT. Technical Report 187356, EPFL, 2013.Google Scholar
- A. D. Popescu, V. Ercegovac, A. Balmin, M. Branco, and A. Ailamaki. Same Queries, Different Data: Can We Predict Runtime Performance? In ICDE Workshops, pages 275-280, 2012. Google Scholar
- M. Stillger, G. M. Lohman, V. Markl, and M. Kandil. LEO - DB2's LEarning Optimizer. In VLDB, pages 19-28, 2001. Google Scholar
- L. G. Valiant. A Bridging Model for Parallel Computation. CACM, 33(8):103-111, 1990. Google Scholar
- S. van Dongen. A Cluster Algorithm for Graphs. Technical Report INS-R0010, CWI, 2000. Google Scholar
- A. Verma, L. Cherkasova, and R. H. Campbell. ARIA: Automatic Resource Inference and Allocation for MapReduce Environments. In ICAC, pages 235-244, 2011. Google Scholar
- G. Wang, W. Xie, A. J. Demers, and J. Gehrke. Asynchronous Large-Scale Graph Processing Made Easy. In CIDR, 2013.Google Scholar
- J. Wolf, D. Rajan, K. Hildrum, R. Khandekar, V. Kumar, S. Parekh, K.-L. Wu, and A. Balmin. FLEX: A Slot Allocation Scheduling Optimizer for MapReduce Workloads. In Middleware, pages 1-20, 2010. Google Scholar
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster Computing with Working Sets. In HotCloud, 2010. Google Scholar
Recommendations
SARNA-Predict: Accuracy Improvement of RNA Secondary Structure Prediction Using Permutation-Based Simulated Annealing
Ribonucleic acid (RNA), a single-stranded linear molecule, is essential to all biological systems. Different regions of the same RNA strand will fold together via base pair interactions to make intricate secondary and tertiary structures that guide ...
Using Historical Data to Predict Application Runtimes on Backfilling Parallel Systems
PDP '10: Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based ProcessingWe present in this paper a novel method to predict application runtimes on backfilling parallel systems. The method is based on mining historical data to obtain important parameters. These parameters are then applied to predict the runtime of future ...
Semismooth Karush-Kuhn-Tucker Equations and Convergence Analysis of Newton and Quasi-Newton Methods for Solving these Equations
There are several forms of systems of nonsmooth equations which are equivalent to the Karush-Kuhn-Tucker KKT system of a nonlinearly constrained optimization problem NLP. If the NLP is twice continuously differentiable and the Hessian functions of its ...
Comments