research-article

Free Access

CERE: LLVM-Based Codelet Extractor and REplayer for Piecewise Benchmarking and Optimization

Authors:
Pablo De Oliveira Castro

Université de Versailles Saint-Quentin-en-Yvelines and Exascale Computing Research, France

Université de Versailles Saint-Quentin-en-Yvelines and Exascale Computing Research, France
View Profile

,
Chadi Akel

Exascale Computing Research, France

Exascale Computing Research, France
View Profile

,
Eric Petit

Université de Versailles Saint-Quentin-en-Yvelines, France

Université de Versailles Saint-Quentin-en-Yvelines, France
View Profile

,
Mihail Popov

Université de Versailles Saint-Quentin-en-Yvelines, France

Université de Versailles Saint-Quentin-en-Yvelines, France
View Profile

,
William Jalby

Exascale Computing Research, France

Exascale Computing Research, France
View Profile

ACM Transactions on Architecture and Code Optimization Volume 12 Issue 1Article No.: 6pp 1–24https://doi.org/10.1145/2724717

Published:16 April 2015Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

This article presents Codelet Extractor and REplayer (CERE), an open-source framework for code isolation. CERE finds and extracts the hotspots of an application as isolated fragments of code, called codelets. Codelets can be modified, compiled, run, and measured independently from the original application. Code isolation reduces benchmarking cost and allows piecewise optimization of an application. Unlike previous approaches, CERE isolates codes at the compiler Intermediate Representation (IR) level. Therefore CERE is language agnostic and supports many input languages such as C, C++, Fortran, and D. CERE automatically detects codelets invocations that have the same performance behavior. Then, it selects a reduced set of representative codelets and invocations, much faster to replay, which still captures accurately the original application. In addition, CERE supports recompiling and retargeting the extracted codelets. Therefore, CERE can be used for cross-architecture performance prediction or piecewise code optimization. On the SPEC 2006 FP benchmarks, CERE codelets cover 90.9% and accurately replay 66.3% of the execution time. We use CERE codelets in a realistic study to evaluate three different architectures on the NAS benchmarks. CERE accurately estimates each architecture performance and is 7.3 × to 46.6 × cheaper than running the full benchmark.

Supplemental Material

Available for Download

pdf

taco1201-06.pdf (444.2 KB)

Slide deck associated with this paper

References

Chadi Akel, Yuriy Kashnikov, Pablo de Oliveira Castro, and William Jalby. 2013. Is source-code isolation viable for performance characterization&quest; In Proceedings of the 2013 42nd International Conference on Parallel Processing Workshops (ICPPW’13). IEEE. Google ScholarDigital Library
A. Alexandrescu. 2010. The D Programming Language. Pearson Education. Google ScholarDigital Library
David H. Bailey, Eric Barszcz, John T. Barton, David S. Browning, Russell L. Carter, Leonardo Dagum, Rod A. Fatoohi, Paul O. Frederickson, Thomas A. Lasinski, Rob S. Schreiber, and others. 1991. The NAS parallel benchmarks summary and preliminary results. In Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing’91). IEEE, 158--165. Google ScholarDigital Library
Edip Baysal. 1983. Reverse time migration. Geophysics 48, 11 (Nov. 1983), 1514. DOI:http://dx.doi.org/10.1190/1.1441434Google ScholarCross Ref
Maximilien B. Breughe and Lieven Eeckhout. 2013. Selecting representative benchmark inputs for exploring microprocessor design spaces. ACM Trans. Architec. Code Optim. 10, 4 (2013), 37. Google ScholarDigital Library
Ariel N. Burton and Paul H. J. Kelly. 2006. Performance prediction of paging workloads using lightweight tracing. Future Gen. Comput. Syst. 22, 7 (2006), 784--793. Google ScholarDigital Library
Brad Calder, Peter Feller, and Alan Eustace. 1997. Value profiling. In Proceedings of the 1997 30th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 259--269. Google ScholarDigital Library
CAPS. 2013. Codelet Finder. Retrieved from http://www.caps-entreprise.com/.Google Scholar
John Cavazos, Christophe Dubach, Felix Agakov, Edwin Bonilla, Michael F. P. O’Boyle, Grigori Fursin, and Olivier Temam. 2006. Automatic performance model construction for the fast software exploration of new hardware designs. In Proceedings of the 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems. ACM, New York, NY, 24--34. Google ScholarDigital Library
Yang Chen, Yuanjie Huang, Lieven Eeckhout, Grigori Fursin, Liang Peng, Olivier Temam, and Chengyong Wu. 2010. Evaluating iterative optimization across 1000 data sets. In Proceedings of the ACM SIGPLAN 2010 Conference on Programming Language Design and Implementation (PLDI’10). Google ScholarDigital Library
Thomas M. Conte, Mary Ann Hirsch, and Kishore N. Menezes. 1996. Reducing state loss for effective trace sampling of superscalar processors. In Proceedings of the 1996 IEEE International Conference on Computer Design: VLSI in Computers and Processors, 1996 (ICCD’96). IEEE, 468--477. Google ScholarDigital Library
Pablo de Oliveira Castro, Yuriy Kashnikov, Chadi Akel, Mihail Popov, and William Jalby. 2014. Fine-grained benchmark subsetting for system selection. In Proceedings of the Annual IEEE/ACM International Symposium on Code Generation and Optimization. ACM, New York, NY, 132. Google ScholarDigital Library
Pablo de Oliveira Castro, Eric Petit, Asma Farjallah, and William Jalby. 2013. Adaptive sampling for performance characterization of application kernels. Concurrency and Computation: Practice and Experience. DOI:http://dx.doi.org/10.1002/cpe.3097Google Scholar
Lamia Djoudi, Denis Barthou, Patrick Carribault, Christophe Lemuet, Jean-Thomas Acquaviva, and William Jalby. 2005. Maqao: Modular assembler quality analyzer and optimizer for Itanium 2. In Proceeding of the 4th Workshop on EPIC Architectures and Compiler Technology, San Jose.Google Scholar
Jason Duell. 2005. The design and implementation of Berkeley Lab’s Linux checkpoint/restart. Lawrence Berkeley National Laboratory.Google Scholar
Lieven Eeckhout, John Sampson, and Brad Calder. 2005. Exploiting program microarchitecture independent characteristics and phase behavior for reduced benchmark suite simulation. In Proceedings of the 2005 IEEE International Workload Characterization Symposium. IEEE, 2--12.Google ScholarCross Ref
gperftools v2.2.1. Google Performance Tools. Retrieved from http://code.google.com/p/gperftools.Google Scholar
Xiaofeng Gao, Michael Laurenzano, Beth Simon, and Allan Snavely. 2005. Reducing overheads for acquiring dynamic memory traces. In Proceedings of the 2005 IEEE International Workload Characterization Symposium. IEEE, 46--55.Google Scholar
Christopher Haine, Olivier Aumage, Enguerrand Petit, and Denis Barthou. 2014. Exploring and evaluating array layout restructuration for SIMDization. In Proceedings of the 27th International Conference on Languages and Compilers for Parallel Computing (LCPC’14).Google Scholar
John W. Haskins Jr and Kevin Skadron. 2003. Memory reference reuse latency: Accelerated warmup for sampled microarchitecture simulation. In Proceedings of the 2003 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’03). IEEE, 195--203. Google ScholarDigital Library
John L. Henning. 2006. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Comput. Architec. News 34, 4 (2006), 1--17. Google ScholarDigital Library
Sunpyo Hong and Hyesoon Kim. 2010. An integrated GPU power and performance model. In ACM SIGARCH Comput. Architec. News 38, 3 (June 2010), 280--289. Google ScholarDigital Library
Kenneth Hoste and Lieven Eeckhout. 2006. Comparing benchmarks using key microarchitecture-independent characteristics. In Proceedings of the 2006 IEEE International Symposium on Workload Characterization. IEEE, 83--92.Google ScholarCross Ref
Kenneth Hoste and Lieven Eeckhout. 2007. Microarchitecture-independent workload characterization. Micro, IEEE 27, 3 (2007), 63--72. Google ScholarDigital Library
Kenneth Hoste, Aashish Phansalkar, Lieven Eeckhout, Andy Georges, Lizy K. John, and Koen De Bosschere. 2006. Performance prediction based on inherent program similarity. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques. ACM, New York, NY, 114--122. Google ScholarDigital Library
Yuriy Kashnikov, Jean Christophe Beyler, and William Jalby. 2012. Compiler optimizations: Machine Learning versus O3. In Proceedings of the 25th International Conference on Languages and Compilers for Parallel Computing (LCPC’12).Google Scholar
Yuriy Kashnikov, Pablo de Oliveira Castro, Emmanuel Oseret, and William Jalby. 2013. Evaluating architecture and compiler design through static loop analysis. In Proceedings of the 2013 International Conference on High Performance Computing and Simulation (HPCS’13). IEEE, 535--544.Google ScholarCross Ref
Leonard Kaufman and Peter J. Rousseeuw. 2009. Finding Groups in Data: An Introduction to Cluster Analysis. Vol. 344. John Wiley & Sons.Google Scholar
Richard E. Kessler, Mark D. Hill, and David A. Wood. 1994. A comparison of trace-sampling techniques for multi-megabyte caches. IEEE Trans. Comput. 43, 6 (1994), 664--675. Google ScholarDigital Library
Minhaj Ahmad Khan, H.-P. Charles, and Denis Barthou. 2008. An effective automated approach to specialization of code. In Languages and Compilers for Parallel Computing. Springer, 308--322.Google Scholar
Donald E. Knuth. 1971. An empirical study of Fortran programs. Softw.: Pract. Exper. 1, 2 (1971), 105--133.Google ScholarCross Ref
Thierry Lafage and André Seznec. 2001. Choosing representative slices of program execution for microarchitecture simulations: A preliminary application to the data stream. Workload Characterization of Emerging Computer Applications. Springer, 145--163. Google ScholarDigital Library
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis and transformation. In Proceedings of the International Symposium on Code Generation and Optimization, 2004 (CGO 2004). IEEE, 75--86. Google ScholarDigital Library
Yoon-Ju Lee and Mary Hall. 2005. A code isolator: Isolating code fragments from large programs. In Languages and Compilers for High Performance Computing. Springer, 164--178. Google ScholarDigital Library
Chunhua Liao, Daniel J. Quinlan, Richard Vuduc, and Thomas Panas. 2010. Effective source-to-source outlining to support whole program empirical optimization. In Languages and Compilers for Parallel Computing. Springer, 308--322. Google ScholarDigital Library
Gabriel Marin and John Mellor-Crummey. 2004. Cross-architecture performance predictions for scientific applications using parameterized models. In ACM SIGMETRICS Performance Evaluation Review, Vol. 32. ACM, New York, NY, 2--13. Google ScholarDigital Library
Dmitry Mikushin, Nikolay Likhogrud, Eddy Zheng Zhang, and Christopher Bergström. 2013. KernelGen—The design and Implementation of a Next Generation Compiler Platform for Accelerating Numerical Models on GPUs. Technical Report 2013/02. University of Lugano, July 2013. Retrieved from http://www.old.inf.usi.ch/file/pub/75/tech_report2013.pdf.Google Scholar
Zhelong Pan and Rudolf Eigenmann. 2006. Fast, automatic, procedure-level performance tuning. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques. ACM Press, New York, NY, 173--181. Google ScholarDigital Library
Mathias Payer, Enrico Kravina, and Thomas R. Gross. 2013. Lightweight memory tracing. In Proceedings of the USENIX Annual Technical Conference. 115--126. Google ScholarDigital Library
Eric Petit and François Bodin. 2010. Code-Partitioning for a Concise Characterization of Programs for Decoupled Code Tuning. Retrieved from http://hal.archives-ouvertes.fr/hal-00460897.Google Scholar
Eric Petit, Pablo de Oliveira Castro, Tarek Menour, Bettina Krammer, and William Jalby. 2012. Computing-kernels performance prediction using dataflow analysis and microbenchmarking. In Proceedings of Compilers for Parallel Computers Workshop (CPC'12).Google Scholar
Eric Petit, Guillaume Papaure, and François Bodin. 2006. Astex: A hot path based thread extractor for distributed memory system on a chip. In Proceedings of Compilers for Parallel Computers Workshop (CPC 2006).Google ScholarDigital Library
Aashish Phansalkar, Ajay Joshi, and Lizy K. John. 2007. Analysis of redundancy and application balance in the SPEC CPU2006 benchmark suite. In ACM SIGARCH Comput. Architec. News, 35, 2 (May 2007), 412--423. Google ScholarDigital Library
Mihail Popov, Chadi Akel, Florent Conti, William Jalby, and Pablo de Oliveira Castro. 2015. PCERE: Fine-grained parallel benchmark decomposition for scalability prediction. In Proceedings of the 29th IEEE International Parallel and Distributed Processing Symposium (IPDPS’15) (to appear). IEEE.Google ScholarDigital Library
D Sands. 2009. Reimplementing llvm-gcc as a gcc plugin. In Third Annual LLVM Developers Meeting.Google Scholar
Timothy Sherwood, Erez Perelman, and Brad Calder. 2001. Basic block distribution analysis to find periodic behavior and simulation points in applications. In Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques. IEEE, 3--14. Google ScholarDigital Library
Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. 2002. Automatically characterizing large scale program behavior. In ACM SIGARCH Comput. Architec. News 30, 5 (Dec. 2002), 45--57. Google ScholarDigital Library
Jan Treibig, Georg Hager, and Gerhard Wellein. 2010. Likwid: A lightweight performance-oriented tool suite for x86 multicore environments. In Proceedings of the 2010 39th International Conference on Parallel Processing Workshops (ICPPW’10). IEEE, 207--216. Google ScholarDigital Library

Index Terms

CERE: LLVM-Based Codelet Extractor and REplayer for Piecewise Benchmarking and Optimization

Recommendations

PCERE: Fine-Grained Parallel Benchmark Decomposition for Scalability Prediction
IPDPS '15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium

Evaluating the strong scalability of OpenMP applications is a costly and time-consuming process. It traditionally requires executing the whole application multiple times with different number of threads. We propose the Parallel Codelet Extractor and ...
Read More
Analyzing commercial processor performance numbers for predicting performance of applications of interest
SIGMETRICS '07: Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems

Current practice in benchmarking commercial computer systems is to run a number of industry-standard benchmarks and to report performance numbers. The huge amount of machines and the large number of benchmarks for which performance numbers are published ...
Read More
Analyzing commercial processor performance numbers for predicting performance of applications of interest
SIGMETRICS '07 Conference Proceedings

Current practice in benchmarking commercial computer systems is to run a number of industry-standard benchmarks and to report performance numbers. The huge amount of machines and the large number of benchmarks for which performance numbers are published ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 12, Issue 1
April 2015
201 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2744295
Editor:
Koen De Bosschere
Ghent University
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 April 2015
- Revised: 1 January 2015
- Accepted: 1 January 2015
- Received: 1 October 2014
Published in taco Volume 12, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Program replay
checkpoint restart
iterative optimization
performance prediction
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 28
  Total Citations
  View Citations
- 1,009
  Total Downloads
- Downloads (Last 12 months)181
- Downloads (Last 6 weeks)34
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.