Abstract
Motivated by streaming and data analytics scenarios where many queries operate on the same data and perform similar computations, we propose program consolidation for merging multiple user-defined functions (UDFs) that operate on the same input. Program consolidation exploits common computations between UDFs to generate an equivalent optimized function whose execution cost is often much smaller (and never greater) than the sum of the costs of executing each function individually. We present a sound consolidation calculus and an effective algorithm for consolidating multiple UDFs. Our approach is purely static and uses symbolic SMT-based techniques to identify shared or redundant computations. We have implemented the proposed technique on top of the Naiad data processing system. Our experiments show that our algorithm dramatically improves overall job completion time when executing user-defined filters that operate on the same data and perform similar computations.
- Linq project,. URL http://msdn.microsoft.com/en-us/library/vstudio/bb397926.aspx.Google Scholar
- Msdn linq documentation,. URL http://msdn.microsoft.com/en-us/library/bb669073(v=vs.110).aspx.Google Scholar
- Storm project. URL http://storm-project.net.Google Scholar
- Wordcount example. URL https://github.com/MicrosoftResearchSVC/Naiad/blob/release_0.2/Examples/.Google Scholar
- R.-J. Back. A calculus of refinements for program derivations. Acta Inf., 25(6):593--624, 1988. Google ScholarDigital Library
- N. Bruno, S. Agarwal, S. Kandula, B. Shi, M. Wu, and J. Zhou. Recurring job optimization in scope. SIGMOD, pages 805--806, 2012. Google ScholarDigital Library
- S. Chaudhuri and K. Shim. Optimization of queries with user-defined predicates. ACM Trans. Database Syst., 24(2):177--228, 1999. Google ScholarDigital Library
- D. Coutts, R. Leshchinskiy, and D. Stewart. Stream fusion: From lists to streams to nothing at all. ICFP, pages 315--326, 2007. Google ScholarDigital Library
- L. De Moura and N. Bjørner. Z3: An efficient smt solver. In TACAS, pages 337--340. Springer, 2008. Google ScholarDigital Library
- E. W. Dijkstra. A Discipline of Programming. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1976. Google ScholarDigital Library
- S. Finkelstein. Common expression analysis in database applications. SIGMOD '82, pages 235--245, 1982. Google ScholarDigital Library
- G. Giannikis, G. Alonso, and D. Kossmann. SharedDB: killing one thousand queries with one stone. VLDB, 5(6):526--537, Feb. 2012. Google ScholarDigital Library
- A. Gill, J. Launchbury, and S. L. Peyton Jones. A short cut to deforestation. FPCA '93, pages 223--232, 1993. Google ScholarDigital Library
- P. K. Gunda, L. Ravindranath, C. A. Thekkath, Y. Yu, and L. Zhuang. Nectar: automatic management of data and computation in datacenters. OSDI'10, pages 1--8, 2010. Google ScholarDigital Library
- M. Haghighat and C. D. Polychronopoulos. Symbolic analysis: A basis for parallelization, optimization, and scheduling of programs. In LCPC, pages 567--585, 1993. Google ScholarDigital Library
- E. C. R. Hehner. Formalization of time and space. Formal Asp. Comput., 10(3):290--306, 1998.Google ScholarCross Ref
- F. Hueske, M. Peters, M. J. Sax, A. Rheinländer, R. Bergmann, A. Krettek, and K. Tzoumas. Opening the black boxes in data flow optimization. Proc. VLDB Endow., 5(11):1256--1267, July 2012. Google ScholarDigital Library
- N. D. Jones, C. K. Gomard, and P. Sestoft. Partial Evaluation and Automatic Program Generation. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1993. ISBN 0-13-020249-5. Google ScholarDigital Library
- K. Krikellas, S. Viglas, and M. Cintra. Generating code for holistic query evaluation. In ICDE, pages 613--624, 2010.Google ScholarCross Ref
- E. Meijer, M. M. Fokkinga, and R. Paterson. Functional programming with bananas, lenses, envelopes and barbed wire. In FPCA, pages 124--144, 1991. Google ScholarDigital Library
- C. Morgan. Programming from Specifications. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1990. ISBN 0-13-726225-6. Google ScholarDigital Library
- K. Munagala, U. Srivastava, and J. Widom. Optimization of continuous queries with shared expensive filters. PODS'07, pages 215--224. Google ScholarDigital Library
- D. G. Murray, M. Isard, and Y. Yu. Steno: automatic optimization of declarative queries. PLDI '11, pages 121--131, 2011. Google ScholarDigital Library
- D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. In SOSP, pages 439--455, 2013. Google ScholarDigital Library
- L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed stream computing platform. ICDMW '10, pages 170--177, 2010. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD, pages 1099--1110, 2008. Google ScholarDigital Library
- J. Park and A. Segev. Using common subexpressions to optimize multiple queries. In ICDE, pages 311--319, 1988. Google ScholarDigital Library
- G. Press. Scaling users, not data: Sisense new take on machine learning and crowdsourcing. URL http://aka.ms/Fzo7sh.Google Scholar
- W. Pugh. The Omega Test: a fast and practical integer programming algorithm for dependence analysis. CACM, 8:4--13, 1992.Google Scholar
- T. K. Sellis. Multiple-query optimization. ACM Trans. Database Syst., 13(1):23--52, Mar. 1988. Google ScholarDigital Library
- Y. N. Silva, P. Larson, and J. Zhou. Exploiting common subexpressions for cloud query processing. ICDE, pages 1337--1348, 2012. Google ScholarDigital Library
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. VLDB, 2(2):1626--1629, 2009. Google ScholarDigital Library
- S. D. Viglas. Just-in-time compilation for sql query processing. Proc. VLDB Endow., 6(11):1190--1191, Aug. 2013. Google ScholarDigital Library
- Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. Gunda, and J. Currey. Dryadlinq: a system for general-purpose distributed data-parallel computing using a high-level language. OSDI'08, pages 1--14. Google ScholarDigital Library
- J. Zhang, H. Zhou, R. Chen, X. Fan, Z. Guo, H. Lin, J. Y. Li, W. Lin, J. Zhou, and L. Zhou. Optimizing data shuffling in data-parallel computation by understanding user-defined functions. NSDI'12, pages 22--22. Google ScholarDigital Library
- J. Zhou, P.-A. Larson, J.-C. Freytag, and W. Lehner. Efficient exploitation of similar subexpressions for query processing. SIGMOD '07, pages 533--544, 2007. Google ScholarDigital Library
Index Terms
- Consolidation of queries with user-defined functions
Recommendations
Consolidation of queries with user-defined functions
PLDI '14: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and ImplementationMotivated by streaming and data analytics scenarios where many queries operate on the same data and perform similar computations, we propose program consolidation for merging multiple user-defined functions (UDFs) that operate on the same input. Program ...
Performance Analysis for Pareto-Optimal Green Consolidation Based on Virtual Machines Live Migration
Huge energy requirement of cloud data centers is prime concern. Dynamic Virtual Machine VM consolidation based on VM live migration to switched-off or put some of the under-loaded host Physical Machines PMs into a low power consumption mode can ...
Workload-aware database monitoring and consolidation
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of dataIn most enterprises, databases are deployed on dedicated database servers. Often, these servers are underutilized much of the time. For example, in traces from almost 200 production servers from different organizations, we see an average CPU utilization ...
Comments