Abstract
This article presents Codelet Extractor and REplayer (CERE), an open-source framework for code isolation. CERE finds and extracts the hotspots of an application as isolated fragments of code, called codelets. Codelets can be modified, compiled, run, and measured independently from the original application. Code isolation reduces benchmarking cost and allows piecewise optimization of an application. Unlike previous approaches, CERE isolates codes at the compiler Intermediate Representation (IR) level. Therefore CERE is language agnostic and supports many input languages such as C, C++, Fortran, and D. CERE automatically detects codelets invocations that have the same performance behavior. Then, it selects a reduced set of representative codelets and invocations, much faster to replay, which still captures accurately the original application. In addition, CERE supports recompiling and retargeting the extracted codelets. Therefore, CERE can be used for cross-architecture performance prediction or piecewise code optimization. On the SPEC 2006 FP benchmarks, CERE codelets cover 90.9% and accurately replay 66.3% of the execution time. We use CERE codelets in a realistic study to evaluate three different architectures on the NAS benchmarks. CERE accurately estimates each architecture performance and is 7.3 × to 46.6 × cheaper than running the full benchmark.
Supplemental Material
Available for Download
Slide deck associated with this paper
- Chadi Akel, Yuriy Kashnikov, Pablo de Oliveira Castro, and William Jalby. 2013. Is source-code isolation viable for performance characterization? In Proceedings of the 2013 42nd International Conference on Parallel Processing Workshops (ICPPW’13). IEEE. Google ScholarDigital Library
- A. Alexandrescu. 2010. The D Programming Language. Pearson Education. Google ScholarDigital Library
- David H. Bailey, Eric Barszcz, John T. Barton, David S. Browning, Russell L. Carter, Leonardo Dagum, Rod A. Fatoohi, Paul O. Frederickson, Thomas A. Lasinski, Rob S. Schreiber, and others. 1991. The NAS parallel benchmarks summary and preliminary results. In Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing’91). IEEE, 158--165. Google ScholarDigital Library
- Edip Baysal. 1983. Reverse time migration. Geophysics 48, 11 (Nov. 1983), 1514. DOI:http://dx.doi.org/10.1190/1.1441434Google ScholarCross Ref
- Maximilien B. Breughe and Lieven Eeckhout. 2013. Selecting representative benchmark inputs for exploring microprocessor design spaces. ACM Trans. Architec. Code Optim. 10, 4 (2013), 37. Google ScholarDigital Library
- Ariel N. Burton and Paul H. J. Kelly. 2006. Performance prediction of paging workloads using lightweight tracing. Future Gen. Comput. Syst. 22, 7 (2006), 784--793. Google ScholarDigital Library
- Brad Calder, Peter Feller, and Alan Eustace. 1997. Value profiling. In Proceedings of the 1997 30th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 259--269. Google ScholarDigital Library
- CAPS. 2013. Codelet Finder. Retrieved from http://www.caps-entreprise.com/.Google Scholar
- John Cavazos, Christophe Dubach, Felix Agakov, Edwin Bonilla, Michael F. P. O’Boyle, Grigori Fursin, and Olivier Temam. 2006. Automatic performance model construction for the fast software exploration of new hardware designs. In Proceedings of the 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems. ACM, New York, NY, 24--34. Google ScholarDigital Library
- Yang Chen, Yuanjie Huang, Lieven Eeckhout, Grigori Fursin, Liang Peng, Olivier Temam, and Chengyong Wu. 2010. Evaluating iterative optimization across 1000 data sets. In Proceedings of the ACM SIGPLAN 2010 Conference on Programming Language Design and Implementation (PLDI’10). Google ScholarDigital Library
- Thomas M. Conte, Mary Ann Hirsch, and Kishore N. Menezes. 1996. Reducing state loss for effective trace sampling of superscalar processors. In Proceedings of the 1996 IEEE International Conference on Computer Design: VLSI in Computers and Processors, 1996 (ICCD’96). IEEE, 468--477. Google ScholarDigital Library
- Pablo de Oliveira Castro, Yuriy Kashnikov, Chadi Akel, Mihail Popov, and William Jalby. 2014. Fine-grained benchmark subsetting for system selection. In Proceedings of the Annual IEEE/ACM International Symposium on Code Generation and Optimization. ACM, New York, NY, 132. Google ScholarDigital Library
- Pablo de Oliveira Castro, Eric Petit, Asma Farjallah, and William Jalby. 2013. Adaptive sampling for performance characterization of application kernels. Concurrency and Computation: Practice and Experience. DOI:http://dx.doi.org/10.1002/cpe.3097Google Scholar
- Lamia Djoudi, Denis Barthou, Patrick Carribault, Christophe Lemuet, Jean-Thomas Acquaviva, and William Jalby. 2005. Maqao: Modular assembler quality analyzer and optimizer for Itanium 2. In Proceeding of the 4th Workshop on EPIC Architectures and Compiler Technology, San Jose.Google Scholar
- Jason Duell. 2005. The design and implementation of Berkeley Lab’s Linux checkpoint/restart. Lawrence Berkeley National Laboratory.Google Scholar
- Lieven Eeckhout, John Sampson, and Brad Calder. 2005. Exploiting program microarchitecture independent characteristics and phase behavior for reduced benchmark suite simulation. In Proceedings of the 2005 IEEE International Workload Characterization Symposium. IEEE, 2--12.Google ScholarCross Ref
- gperftools v2.2.1. Google Performance Tools. Retrieved from http://code.google.com/p/gperftools.Google Scholar
- Xiaofeng Gao, Michael Laurenzano, Beth Simon, and Allan Snavely. 2005. Reducing overheads for acquiring dynamic memory traces. In Proceedings of the 2005 IEEE International Workload Characterization Symposium. IEEE, 46--55.Google Scholar
- Christopher Haine, Olivier Aumage, Enguerrand Petit, and Denis Barthou. 2014. Exploring and evaluating array layout restructuration for SIMDization. In Proceedings of the 27th International Conference on Languages and Compilers for Parallel Computing (LCPC’14).Google Scholar
- John W. Haskins Jr and Kevin Skadron. 2003. Memory reference reuse latency: Accelerated warmup for sampled microarchitecture simulation. In Proceedings of the 2003 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’03). IEEE, 195--203. Google ScholarDigital Library
- John L. Henning. 2006. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Comput. Architec. News 34, 4 (2006), 1--17. Google ScholarDigital Library
- Sunpyo Hong and Hyesoon Kim. 2010. An integrated GPU power and performance model. In ACM SIGARCH Comput. Architec. News 38, 3 (June 2010), 280--289. Google ScholarDigital Library
- Kenneth Hoste and Lieven Eeckhout. 2006. Comparing benchmarks using key microarchitecture-independent characteristics. In Proceedings of the 2006 IEEE International Symposium on Workload Characterization. IEEE, 83--92.Google ScholarCross Ref
- Kenneth Hoste and Lieven Eeckhout. 2007. Microarchitecture-independent workload characterization. Micro, IEEE 27, 3 (2007), 63--72. Google ScholarDigital Library
- Kenneth Hoste, Aashish Phansalkar, Lieven Eeckhout, Andy Georges, Lizy K. John, and Koen De Bosschere. 2006. Performance prediction based on inherent program similarity. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques. ACM, New York, NY, 114--122. Google ScholarDigital Library
- Yuriy Kashnikov, Jean Christophe Beyler, and William Jalby. 2012. Compiler optimizations: Machine Learning versus O3. In Proceedings of the 25th International Conference on Languages and Compilers for Parallel Computing (LCPC’12).Google Scholar
- Yuriy Kashnikov, Pablo de Oliveira Castro, Emmanuel Oseret, and William Jalby. 2013. Evaluating architecture and compiler design through static loop analysis. In Proceedings of the 2013 International Conference on High Performance Computing and Simulation (HPCS’13). IEEE, 535--544.Google ScholarCross Ref
- Leonard Kaufman and Peter J. Rousseeuw. 2009. Finding Groups in Data: An Introduction to Cluster Analysis. Vol. 344. John Wiley & Sons.Google Scholar
- Richard E. Kessler, Mark D. Hill, and David A. Wood. 1994. A comparison of trace-sampling techniques for multi-megabyte caches. IEEE Trans. Comput. 43, 6 (1994), 664--675. Google ScholarDigital Library
- Minhaj Ahmad Khan, H.-P. Charles, and Denis Barthou. 2008. An effective automated approach to specialization of code. In Languages and Compilers for Parallel Computing. Springer, 308--322.Google Scholar
- Donald E. Knuth. 1971. An empirical study of Fortran programs. Softw.: Pract. Exper. 1, 2 (1971), 105--133.Google ScholarCross Ref
- Thierry Lafage and André Seznec. 2001. Choosing representative slices of program execution for microarchitecture simulations: A preliminary application to the data stream. Workload Characterization of Emerging Computer Applications. Springer, 145--163. Google ScholarDigital Library
- Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis and transformation. In Proceedings of the International Symposium on Code Generation and Optimization, 2004 (CGO 2004). IEEE, 75--86. Google ScholarDigital Library
- Yoon-Ju Lee and Mary Hall. 2005. A code isolator: Isolating code fragments from large programs. In Languages and Compilers for High Performance Computing. Springer, 164--178. Google ScholarDigital Library
- Chunhua Liao, Daniel J. Quinlan, Richard Vuduc, and Thomas Panas. 2010. Effective source-to-source outlining to support whole program empirical optimization. In Languages and Compilers for Parallel Computing. Springer, 308--322. Google ScholarDigital Library
- Gabriel Marin and John Mellor-Crummey. 2004. Cross-architecture performance predictions for scientific applications using parameterized models. In ACM SIGMETRICS Performance Evaluation Review, Vol. 32. ACM, New York, NY, 2--13. Google ScholarDigital Library
- Dmitry Mikushin, Nikolay Likhogrud, Eddy Zheng Zhang, and Christopher Bergström. 2013. KernelGen—The design and Implementation of a Next Generation Compiler Platform for Accelerating Numerical Models on GPUs. Technical Report 2013/02. University of Lugano, July 2013. Retrieved from http://www.old.inf.usi.ch/file/pub/75/tech_report2013.pdf.Google Scholar
- Zhelong Pan and Rudolf Eigenmann. 2006. Fast, automatic, procedure-level performance tuning. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques. ACM Press, New York, NY, 173--181. Google ScholarDigital Library
- Mathias Payer, Enrico Kravina, and Thomas R. Gross. 2013. Lightweight memory tracing. In Proceedings of the USENIX Annual Technical Conference. 115--126. Google ScholarDigital Library
- Eric Petit and François Bodin. 2010. Code-Partitioning for a Concise Characterization of Programs for Decoupled Code Tuning. Retrieved from http://hal.archives-ouvertes.fr/hal-00460897.Google Scholar
- Eric Petit, Pablo de Oliveira Castro, Tarek Menour, Bettina Krammer, and William Jalby. 2012. Computing-kernels performance prediction using dataflow analysis and microbenchmarking. In Proceedings of Compilers for Parallel Computers Workshop (CPC'12).Google Scholar
- Eric Petit, Guillaume Papaure, and François Bodin. 2006. Astex: A hot path based thread extractor for distributed memory system on a chip. In Proceedings of Compilers for Parallel Computers Workshop (CPC 2006).Google ScholarDigital Library
- Aashish Phansalkar, Ajay Joshi, and Lizy K. John. 2007. Analysis of redundancy and application balance in the SPEC CPU2006 benchmark suite. In ACM SIGARCH Comput. Architec. News, 35, 2 (May 2007), 412--423. Google ScholarDigital Library
- Mihail Popov, Chadi Akel, Florent Conti, William Jalby, and Pablo de Oliveira Castro. 2015. PCERE: Fine-grained parallel benchmark decomposition for scalability prediction. In Proceedings of the 29th IEEE International Parallel and Distributed Processing Symposium (IPDPS’15) (to appear). IEEE.Google ScholarDigital Library
- D Sands. 2009. Reimplementing llvm-gcc as a gcc plugin. In Third Annual LLVM Developers Meeting.Google Scholar
- Timothy Sherwood, Erez Perelman, and Brad Calder. 2001. Basic block distribution analysis to find periodic behavior and simulation points in applications. In Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques. IEEE, 3--14. Google ScholarDigital Library
- Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. 2002. Automatically characterizing large scale program behavior. In ACM SIGARCH Comput. Architec. News 30, 5 (Dec. 2002), 45--57. Google ScholarDigital Library
- Jan Treibig, Georg Hager, and Gerhard Wellein. 2010. Likwid: A lightweight performance-oriented tool suite for x86 multicore environments. In Proceedings of the 2010 39th International Conference on Parallel Processing Workshops (ICPPW’10). IEEE, 207--216. Google ScholarDigital Library
Index Terms
- CERE: LLVM-Based Codelet Extractor and REplayer for Piecewise Benchmarking and Optimization
Recommendations
PCERE: Fine-Grained Parallel Benchmark Decomposition for Scalability Prediction
IPDPS '15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing SymposiumEvaluating the strong scalability of OpenMP applications is a costly and time-consuming process. It traditionally requires executing the whole application multiple times with different number of threads. We propose the Parallel Codelet Extractor and ...
Analyzing commercial processor performance numbers for predicting performance of applications of interest
SIGMETRICS '07: Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systemsCurrent practice in benchmarking commercial computer systems is to run a number of industry-standard benchmarks and to report performance numbers. The huge amount of machines and the large number of benchmarks for which performance numbers are published ...
Analyzing commercial processor performance numbers for predicting performance of applications of interest
SIGMETRICS '07 Conference ProceedingsCurrent practice in benchmarking commercial computer systems is to run a number of industry-standard benchmarks and to report performance numbers. The huge amount of machines and the large number of benchmarks for which performance numbers are published ...
Comments