skip to main content
article

PREDIcT: towards predicting the runtime of large scale iterative analytics

Authors Info & Claims
Published:01 September 2013Publication History
Skip Abstract Section

Abstract

Machine learning algorithms are widely used today for analytical tasks such as data cleaning, data categorization, or data filtering. At the same time, the rise of social media motivates recent uptake in large scale graph processing. Both categories of algorithms are dominated by iterative subtasks, i.e., processing steps which are executed repetitively until a convergence condition is met. Optimizing cluster resource allocations among multiple workloads of iterative algorithms motivates the need for estimating their runtime, which in turn requires: i) predicting the number of iterations, and ii) predicting the processing time of each iteration. As both parameters depend on the characteristics of the dataset and on the convergence function, estimating their values before execution is difficult.

This paper proposes PREDIcT, an experimental methodology for predicting the runtime of iterative algorithms. PREDIcT uses sample runs for capturing the algorithm's convergence trend and per-iteration key input features that are well correlated with the actual processing requirements of the complete input dataset. Using this combination of characteristics we predict the runtime of iterative algorithms, including algorithms with very different runtime patterns among subsequent iterations. Our experimental evaluation of multiple algorithms on scale-free graphs shows a relative prediction error of 10%-30% for predicting runtime, including algorithms with up to 100× runtime variability among consecutive iterations.

References

  1. F. N. Afrati, V. Borkar, M. Carey, N. Polyzotis, and J. D. Ullman. Map-Reduce Extensions and Recursive Queries. In EDBT, pages 1-8, 2011. Google ScholarGoogle Scholar
  2. M. Ahmad, S. Duan, A. Aboulnaga, and S. Babu. Interaction-aware Prediction of Business Intelligence Workload Completion Times. In ICDE, pages 413-416, 2010.Google ScholarGoogle Scholar
  3. D. Arthur and S. Vassilvitskii. How Slow is the k-means Method? In SCG, pages 144-153, 2006. Google ScholarGoogle Scholar
  4. F. Bancilhon and R. Ramakrishnan. An Amateur's Introduction to Recursive Query Processing Strategies. In SIGMOD, pages 16-52, 1986. Google ScholarGoogle Scholar
  5. Y. Bu, V. R. Borkar, M. J. Carey, J. Rosen, N. Polyzotis, T. Condie, M. Weimer, and R. Ramakrishnan. Scaling Datalog for Machine Learning on Big Data. CoRR, abs/1203.0160, 2012.Google ScholarGoogle Scholar
  6. Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: Efficient Iterative Data Processing on Large Clusters. PVLDB, 3:285-296, 2010. Google ScholarGoogle Scholar
  7. M. Cha, H. Haddadi, F. Benevenuto, and K. P. Gummadi. Measuring User Influence in Twitter: The Million Follower Fallacy. In ICWSM, 2010.Google ScholarGoogle Scholar
  8. S. Chaudhuri, R. Kaushik, and R. Ramamurthy. When Can We Trust Progress Estimators for SQL Queries? In SIGMOD, pages 575-586, 2005. Google ScholarGoogle Scholar
  9. S. Chaudhuri, R. Motwani, and V. Narasayya. Random Sampling for Histogram Construction: How Much is Enough? In SIGMOD, pages 436-447, 1998. Google ScholarGoogle Scholar
  10. J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD Skills: New Analysis Practices for Big Data. PVLDB, 2(2):1481-1492, 2009. Google ScholarGoogle Scholar
  11. A. Deshpande, Z. Ives, and V. Raman. Adaptive query processing. Found. Trends databases, 1(1):1-140, 2007. Google ScholarGoogle Scholar
  12. J. Duggan, U. Cetintemel, O. Papaemmanouil, and E. Upfal. Performance Prediction for Concurrent Database Workloads. In SIGMOD, pages 337-348, 2011. Google ScholarGoogle Scholar
  13. S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl. Spinning Fast Iterative Data Flows. PVLDB, 5(11):1268-1279, 2012. Google ScholarGoogle Scholar
  14. A. Ganapathi, H. Kuno, U. Dayal, J. L. Wiener, A. Fox, M. Jordan, and D. Patterson. Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning. In ICDE, pages 592-603, 2009. Google ScholarGoogle Scholar
  15. M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou. Walking in Facebook: A Case Study of Unbiased Sampling of OSNs. In INFOCOM, pages 2498-2506, 2010. Google ScholarGoogle Scholar
  16. S. Har-Peled and B. Sadri. How Fast is the k-means Method? In SODA, pages 185-202, 2005. Google ScholarGoogle Scholar
  17. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference and Prediction, Second Edition. Springer, 2008.Google ScholarGoogle Scholar
  18. H. Herodotou and S. Babu. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. PVLDB, 4(11):1111-1122, 2011.Google ScholarGoogle Scholar
  19. H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A Self-tuning System for Big Data Analytics. In CIDR, pages 261-272, 2011.Google ScholarGoogle Scholar
  20. U. Kang, C. E. Tsourakakis, and C. Faloutsos. PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations. In ICDM, pages 229-238, 2009. Google ScholarGoogle Scholar
  21. Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, and P. Kalnis. Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing. In EuroSys, pages 169-182, 2013. Google ScholarGoogle Scholar
  22. A. N. Langville and C. D. Meyer. Deeper Inside PageRank. Internet Mathematics, 1(3):335-380, 2003.Google ScholarGoogle Scholar
  23. J. Leskovec and C. Faloutsos. Sampling from Large Graphs. In KDD, pages 631-636, 2006. Google ScholarGoogle Scholar
  24. J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Statistical Properties of Community Structure in Large Social and Information Networks. In WWW, pages 695-704, 2008. Google ScholarGoogle Scholar
  25. J. Li, A. C. König, V. Narasayya, and S. Chaudhuri. Robust Estimation of Resource Consumption for SQL Queries Using Statistical Techniques. PVLDB, 5(11):1555-1566, 2012. Google ScholarGoogle Scholar
  26. Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. PVLDB, 5(8):716-727, 2012. Google ScholarGoogle Scholar
  27. G. Luo, J. F. Naughton, and P. S. Yu. Multiquery SQL Progress Indicators. In EDBT, pages 921-941, 2006. Google ScholarGoogle Scholar
  28. G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A System for Large-Scale Graph Processing. In SIGMOD, pages 135-146, 2010. Google ScholarGoogle Scholar
  29. K. Morton, M. Balazinska, and D. Grossman. ParaTimer: A Progress Indicator for MapReduce DAGs. In SIGMOD, pages 507-518, 2010. Google ScholarGoogle Scholar
  30. L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 1999-66, Stanford InfoLab, 1999.Google ScholarGoogle Scholar
  31. A. D. Popescu, A. Balmin, V. Ercegovac, and A. Ailamaki. Towards Predicting the Runtime of Iterative Analytics with PREDIcT. Technical Report 187356, EPFL, 2013.Google ScholarGoogle Scholar
  32. A. D. Popescu, V. Ercegovac, A. Balmin, M. Branco, and A. Ailamaki. Same Queries, Different Data: Can We Predict Runtime Performance? In ICDE Workshops, pages 275-280, 2012. Google ScholarGoogle Scholar
  33. M. Stillger, G. M. Lohman, V. Markl, and M. Kandil. LEO - DB2's LEarning Optimizer. In VLDB, pages 19-28, 2001. Google ScholarGoogle Scholar
  34. L. G. Valiant. A Bridging Model for Parallel Computation. CACM, 33(8):103-111, 1990. Google ScholarGoogle Scholar
  35. S. van Dongen. A Cluster Algorithm for Graphs. Technical Report INS-R0010, CWI, 2000. Google ScholarGoogle Scholar
  36. A. Verma, L. Cherkasova, and R. H. Campbell. ARIA: Automatic Resource Inference and Allocation for MapReduce Environments. In ICAC, pages 235-244, 2011. Google ScholarGoogle Scholar
  37. G. Wang, W. Xie, A. J. Demers, and J. Gehrke. Asynchronous Large-Scale Graph Processing Made Easy. In CIDR, 2013.Google ScholarGoogle Scholar
  38. J. Wolf, D. Rajan, K. Hildrum, R. Khandekar, V. Kumar, S. Parekh, K.-L. Wu, and A. Balmin. FLEX: A Slot Allocation Scheduling Optimizer for MapReduce Workloads. In Middleware, pages 1-20, 2010. Google ScholarGoogle Scholar
  39. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster Computing with Working Sets. In HotCloud, 2010. Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 6, Issue 14
    September 2013
    384 pages

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 September 2013
    Published in pvldb Volume 6, Issue 14

    Qualifiers

    • article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader