skip to main content
research-article
Free Access

CERE: LLVM-Based Codelet Extractor and REplayer for Piecewise Benchmarking and Optimization

Authors Info & Claims
Published:16 April 2015Publication History
Skip Abstract Section

Abstract

This article presents Codelet Extractor and REplayer (CERE), an open-source framework for code isolation. CERE finds and extracts the hotspots of an application as isolated fragments of code, called codelets. Codelets can be modified, compiled, run, and measured independently from the original application. Code isolation reduces benchmarking cost and allows piecewise optimization of an application. Unlike previous approaches, CERE isolates codes at the compiler Intermediate Representation (IR) level. Therefore CERE is language agnostic and supports many input languages such as C, C++, Fortran, and D. CERE automatically detects codelets invocations that have the same performance behavior. Then, it selects a reduced set of representative codelets and invocations, much faster to replay, which still captures accurately the original application. In addition, CERE supports recompiling and retargeting the extracted codelets. Therefore, CERE can be used for cross-architecture performance prediction or piecewise code optimization. On the SPEC 2006 FP benchmarks, CERE codelets cover 90.9% and accurately replay 66.3% of the execution time. We use CERE codelets in a realistic study to evaluate three different architectures on the NAS benchmarks. CERE accurately estimates each architecture performance and is 7.3 × to 46.6 × cheaper than running the full benchmark.

Skip Supplemental Material Section

Supplemental Material

References

  1. Chadi Akel, Yuriy Kashnikov, Pablo de Oliveira Castro, and William Jalby. 2013. Is source-code isolation viable for performance characterization? In Proceedings of the 2013 42nd International Conference on Parallel Processing Workshops (ICPPW’13). IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Alexandrescu. 2010. The D Programming Language. Pearson Education. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. David H. Bailey, Eric Barszcz, John T. Barton, David S. Browning, Russell L. Carter, Leonardo Dagum, Rod A. Fatoohi, Paul O. Frederickson, Thomas A. Lasinski, Rob S. Schreiber, and others. 1991. The NAS parallel benchmarks summary and preliminary results. In Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing’91). IEEE, 158--165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Edip Baysal. 1983. Reverse time migration. Geophysics 48, 11 (Nov. 1983), 1514. DOI:http://dx.doi.org/10.1190/1.1441434Google ScholarGoogle ScholarCross RefCross Ref
  5. Maximilien B. Breughe and Lieven Eeckhout. 2013. Selecting representative benchmark inputs for exploring microprocessor design spaces. ACM Trans. Architec. Code Optim. 10, 4 (2013), 37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ariel N. Burton and Paul H. J. Kelly. 2006. Performance prediction of paging workloads using lightweight tracing. Future Gen. Comput. Syst. 22, 7 (2006), 784--793. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Brad Calder, Peter Feller, and Alan Eustace. 1997. Value profiling. In Proceedings of the 1997 30th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 259--269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. CAPS. 2013. Codelet Finder. Retrieved from http://www.caps-entreprise.com/.Google ScholarGoogle Scholar
  9. John Cavazos, Christophe Dubach, Felix Agakov, Edwin Bonilla, Michael F. P. O’Boyle, Grigori Fursin, and Olivier Temam. 2006. Automatic performance model construction for the fast software exploration of new hardware designs. In Proceedings of the 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems. ACM, New York, NY, 24--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Yang Chen, Yuanjie Huang, Lieven Eeckhout, Grigori Fursin, Liang Peng, Olivier Temam, and Chengyong Wu. 2010. Evaluating iterative optimization across 1000 data sets. In Proceedings of the ACM SIGPLAN 2010 Conference on Programming Language Design and Implementation (PLDI’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Thomas M. Conte, Mary Ann Hirsch, and Kishore N. Menezes. 1996. Reducing state loss for effective trace sampling of superscalar processors. In Proceedings of the 1996 IEEE International Conference on Computer Design: VLSI in Computers and Processors, 1996 (ICCD’96). IEEE, 468--477. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Pablo de Oliveira Castro, Yuriy Kashnikov, Chadi Akel, Mihail Popov, and William Jalby. 2014. Fine-grained benchmark subsetting for system selection. In Proceedings of the Annual IEEE/ACM International Symposium on Code Generation and Optimization. ACM, New York, NY, 132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Pablo de Oliveira Castro, Eric Petit, Asma Farjallah, and William Jalby. 2013. Adaptive sampling for performance characterization of application kernels. Concurrency and Computation: Practice and Experience. DOI:http://dx.doi.org/10.1002/cpe.3097Google ScholarGoogle Scholar
  14. Lamia Djoudi, Denis Barthou, Patrick Carribault, Christophe Lemuet, Jean-Thomas Acquaviva, and William Jalby. 2005. Maqao: Modular assembler quality analyzer and optimizer for Itanium 2. In Proceeding of the 4th Workshop on EPIC Architectures and Compiler Technology, San Jose.Google ScholarGoogle Scholar
  15. Jason Duell. 2005. The design and implementation of Berkeley Lab’s Linux checkpoint/restart. Lawrence Berkeley National Laboratory.Google ScholarGoogle Scholar
  16. Lieven Eeckhout, John Sampson, and Brad Calder. 2005. Exploiting program microarchitecture independent characteristics and phase behavior for reduced benchmark suite simulation. In Proceedings of the 2005 IEEE International Workload Characterization Symposium. IEEE, 2--12.Google ScholarGoogle ScholarCross RefCross Ref
  17. gperftools v2.2.1. Google Performance Tools. Retrieved from http://code.google.com/p/gperftools.Google ScholarGoogle Scholar
  18. Xiaofeng Gao, Michael Laurenzano, Beth Simon, and Allan Snavely. 2005. Reducing overheads for acquiring dynamic memory traces. In Proceedings of the 2005 IEEE International Workload Characterization Symposium. IEEE, 46--55.Google ScholarGoogle Scholar
  19. Christopher Haine, Olivier Aumage, Enguerrand Petit, and Denis Barthou. 2014. Exploring and evaluating array layout restructuration for SIMDization. In Proceedings of the 27th International Conference on Languages and Compilers for Parallel Computing (LCPC’14).Google ScholarGoogle Scholar
  20. John W. Haskins Jr and Kevin Skadron. 2003. Memory reference reuse latency: Accelerated warmup for sampled microarchitecture simulation. In Proceedings of the 2003 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’03). IEEE, 195--203. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. John L. Henning. 2006. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Comput. Architec. News 34, 4 (2006), 1--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Sunpyo Hong and Hyesoon Kim. 2010. An integrated GPU power and performance model. In ACM SIGARCH Comput. Architec. News 38, 3 (June 2010), 280--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kenneth Hoste and Lieven Eeckhout. 2006. Comparing benchmarks using key microarchitecture-independent characteristics. In Proceedings of the 2006 IEEE International Symposium on Workload Characterization. IEEE, 83--92.Google ScholarGoogle ScholarCross RefCross Ref
  24. Kenneth Hoste and Lieven Eeckhout. 2007. Microarchitecture-independent workload characterization. Micro, IEEE 27, 3 (2007), 63--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Kenneth Hoste, Aashish Phansalkar, Lieven Eeckhout, Andy Georges, Lizy K. John, and Koen De Bosschere. 2006. Performance prediction based on inherent program similarity. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques. ACM, New York, NY, 114--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Yuriy Kashnikov, Jean Christophe Beyler, and William Jalby. 2012. Compiler optimizations: Machine Learning versus O3. In Proceedings of the 25th International Conference on Languages and Compilers for Parallel Computing (LCPC’12).Google ScholarGoogle Scholar
  27. Yuriy Kashnikov, Pablo de Oliveira Castro, Emmanuel Oseret, and William Jalby. 2013. Evaluating architecture and compiler design through static loop analysis. In Proceedings of the 2013 International Conference on High Performance Computing and Simulation (HPCS’13). IEEE, 535--544.Google ScholarGoogle ScholarCross RefCross Ref
  28. Leonard Kaufman and Peter J. Rousseeuw. 2009. Finding Groups in Data: An Introduction to Cluster Analysis. Vol. 344. John Wiley & Sons.Google ScholarGoogle Scholar
  29. Richard E. Kessler, Mark D. Hill, and David A. Wood. 1994. A comparison of trace-sampling techniques for multi-megabyte caches. IEEE Trans. Comput. 43, 6 (1994), 664--675. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Minhaj Ahmad Khan, H.-P. Charles, and Denis Barthou. 2008. An effective automated approach to specialization of code. In Languages and Compilers for Parallel Computing. Springer, 308--322.Google ScholarGoogle Scholar
  31. Donald E. Knuth. 1971. An empirical study of Fortran programs. Softw.: Pract. Exper. 1, 2 (1971), 105--133.Google ScholarGoogle ScholarCross RefCross Ref
  32. Thierry Lafage and André Seznec. 2001. Choosing representative slices of program execution for microarchitecture simulations: A preliminary application to the data stream. Workload Characterization of Emerging Computer Applications. Springer, 145--163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis and transformation. In Proceedings of the International Symposium on Code Generation and Optimization, 2004 (CGO 2004). IEEE, 75--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Yoon-Ju Lee and Mary Hall. 2005. A code isolator: Isolating code fragments from large programs. In Languages and Compilers for High Performance Computing. Springer, 164--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Chunhua Liao, Daniel J. Quinlan, Richard Vuduc, and Thomas Panas. 2010. Effective source-to-source outlining to support whole program empirical optimization. In Languages and Compilers for Parallel Computing. Springer, 308--322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Gabriel Marin and John Mellor-Crummey. 2004. Cross-architecture performance predictions for scientific applications using parameterized models. In ACM SIGMETRICS Performance Evaluation Review, Vol. 32. ACM, New York, NY, 2--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Dmitry Mikushin, Nikolay Likhogrud, Eddy Zheng Zhang, and Christopher Bergström. 2013. KernelGen—The design and Implementation of a Next Generation Compiler Platform for Accelerating Numerical Models on GPUs. Technical Report 2013/02. University of Lugano, July 2013. Retrieved from http://www.old.inf.usi.ch/file/pub/75/tech_report2013.pdf.Google ScholarGoogle Scholar
  38. Zhelong Pan and Rudolf Eigenmann. 2006. Fast, automatic, procedure-level performance tuning. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques. ACM Press, New York, NY, 173--181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Mathias Payer, Enrico Kravina, and Thomas R. Gross. 2013. Lightweight memory tracing. In Proceedings of the USENIX Annual Technical Conference. 115--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Eric Petit and François Bodin. 2010. Code-Partitioning for a Concise Characterization of Programs for Decoupled Code Tuning. Retrieved from http://hal.archives-ouvertes.fr/hal-00460897.Google ScholarGoogle Scholar
  41. Eric Petit, Pablo de Oliveira Castro, Tarek Menour, Bettina Krammer, and William Jalby. 2012. Computing-kernels performance prediction using dataflow analysis and microbenchmarking. In Proceedings of Compilers for Parallel Computers Workshop (CPC'12).Google ScholarGoogle Scholar
  42. Eric Petit, Guillaume Papaure, and François Bodin. 2006. Astex: A hot path based thread extractor for distributed memory system on a chip. In Proceedings of Compilers for Parallel Computers Workshop (CPC 2006).Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Aashish Phansalkar, Ajay Joshi, and Lizy K. John. 2007. Analysis of redundancy and application balance in the SPEC CPU2006 benchmark suite. In ACM SIGARCH Comput. Architec. News, 35, 2 (May 2007), 412--423. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Mihail Popov, Chadi Akel, Florent Conti, William Jalby, and Pablo de Oliveira Castro. 2015. PCERE: Fine-grained parallel benchmark decomposition for scalability prediction. In Proceedings of the 29th IEEE International Parallel and Distributed Processing Symposium (IPDPS’15) (to appear). IEEE.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. D Sands. 2009. Reimplementing llvm-gcc as a gcc plugin. In Third Annual LLVM Developers Meeting.Google ScholarGoogle Scholar
  46. Timothy Sherwood, Erez Perelman, and Brad Calder. 2001. Basic block distribution analysis to find periodic behavior and simulation points in applications. In Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques. IEEE, 3--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. 2002. Automatically characterizing large scale program behavior. In ACM SIGARCH Comput. Architec. News 30, 5 (Dec. 2002), 45--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Jan Treibig, Georg Hager, and Gerhard Wellein. 2010. Likwid: A lightweight performance-oriented tool suite for x86 multicore environments. In Proceedings of the 2010 39th International Conference on Parallel Processing Workshops (ICPPW’10). IEEE, 207--216. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. CERE: LLVM-Based Codelet Extractor and REplayer for Piecewise Benchmarking and Optimization

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Architecture and Code Optimization
            ACM Transactions on Architecture and Code Optimization  Volume 12, Issue 1
            April 2015
            201 pages
            ISSN:1544-3566
            EISSN:1544-3973
            DOI:10.1145/2744295
            Issue’s Table of Contents

            Copyright © 2015 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 16 April 2015
            • Revised: 1 January 2015
            • Accepted: 1 January 2015
            • Received: 1 October 2014
            Published in taco Volume 12, Issue 1

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader