ABSTRACT
Developers often use a virtual platform to develop software before the hardware is available. For software optimization, it is important to profile the cache misses of applications in a realistic operating environment under the virtual platform. In the multicore era, it is hard to simulate the coherence cache miss in a high speed way. In this paper, we propose a hardware-accelerated architecture to simulate the cache misses of a multicore system. We implement the cache miss simulator over a virtual platform with FPGA. Users can profile their software as running over the multicore system. The evaluation shows the throughput achieves 65 MB of trace log per second, when FPGA works in 100 MHz and about 570,000 logic elements are occupied to simulate 4 sets of L1 cache and 1 set of L2 cache in the multicore system with 4 virtual CPUs. The system achieves 1.6 to 2 times of speedup, when comparing with the popular cache miss simulator, Dinero IV. Dinero does less work and does not support coherence cache misses in the multicore system. The evaluation result shows high advantage to speed up the cache miss simulation of the multicore system by the hardware-accelerated architecture as well as FPGA.
- Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator. In USENIX Annual Technical Conference, FREENIX Track. 41--46. Google ScholarDigital Library
- Erik Berg, Hakan Zeffer, and Erik Hagersten. 2006. A statistical multiprocessor cache model. In Performance Analysis of Systems and Software, 2006 IEEE International Symposium on. IEEE, 89--99.Google ScholarCross Ref
- Kristof Beyls and Erik DâĂŹHollander. 2001. Reuse distance as a metric for cache behavior. In Proceedings of the IASTED Conference on Parallel and Distributed Computing and systems, Vol. 14. 350--360.Google Scholar
- Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil A Patil, William Reinhart, Darrel Eric Johnson, Jebediah Keefe, and Hari Angepat. 2007. Fpga-accelerated simulation technologies (fast): Fast, full-system, cycle-accurate simulators. In Proceedings of the 40th Annual IEEE/ACM international Symposium on Microarchitecture. IEEE Computer Society, 249--261. Google ScholarDigital Library
- Intel Coporation. {n. d.}. SignalTap II with Verilog Designs.Google Scholar
- Intel Coporation. {n. d.}. Using ModelSim to Simulate Logic Circuits in Verilog Designs.Google Scholar
- Intel Coporation. {n. d.}. Using TimeQuest Timing Analyzer.Google Scholar
- Intel Coporation. 2017. AvalonÂö Interface Specifications.Google Scholar
- Jan Edler and Mark D. Hill. {n. d.}. Dinero IV Trace-Driven Uniprocessor Cache Simulator. ({n. d.}).Google Scholar
- Matthew R Guthaus, Jeffrey S Ringenberg, Dan Ernst, Todd M Austin, Trevor Mudge, and Richard B Brown. 2001. MiBench: A free, commercially representative embedded benchmark suite. In Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop on. IEEE, 3--14. Google ScholarDigital Library
- Mark D Hill and Alan Jay Smith. 1989. Evaluating associativity in CPU caches. IEEE Trans. Comput. 38, 12 (1989), 1612--1630. Google ScholarDigital Library
- Matthew Jacobsen, Dustin Richmond, Matthew Hogains, and Ryan Kastner. 2015. RIFFA 2.1: A reusable integration framework for FPGA accelerators. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 8, 4 (2015), 22. Google ScholarDigital Library
- Xiaoyue Pan and Bengt Jonsson. 2014. Modeling cache coherence misses on multicores. In Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on. IEEE, 96--105.Google ScholarCross Ref
- Derek L Schuff, Milind Kulkarni, and Vijay S Pai. 2010. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques. ACM, 53--64. Google ScholarDigital Library
- Yakun Sophia Shao, Sam Likun Xi, Vijayalakshmi Srinivasan, Gu-Yeon Wei, and David Brooks. 2016. Co-designing accelerators and soc interfaces using gem5-aladdin. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1--12. Google ScholarDigital Library
- Chia-Heng Tu, Hui-Hsin Hsu, Jen-Hao Chen, Chun-Han Chen, and Shih-Hao Hung. 2014. Performance and power profiling for emulated android systems. ACM Transactions on Design Automation of Electronic Systems (TODAES) 19, 2 (2014), 10. Google ScholarDigital Library
Index Terms
- Hardware-accelerated cache simulation for multicore by FPGA
Recommendations
Reshaping cache misses to improve row-buffer locality in multicore systems
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniquesOptimizing cache locality has always been important since the emergence of caches, and numerous cache locality optimization schemes have been published in compiler literature. However, in modern architectures, cache locality is not the only factor that ...
Thread-Aware Adaptive Prefetcher on Multicore Systems: Improving the Performance for Multithreaded Workloads
Most processors employ hardware data prefetching techniques to hide memory access latencies. However, the prefetching requests from different threads on a multicore processor can cause severe interference with prefetching and/or demand requests of ...
Cache Operations by MRU Change
The performance of set associative caches is analyzed. The method used is to group the cache lines into regions according to their positions in the replacement stacks of a cache, and then to observe how the memory access of a CPU is distributed over ...
Comments