research-article

Hardware-accelerated cache simulation for multicore by FPGA

Authors:
Shih-Hao Hung

National Taiwan University, Taipei, Taiwan

National Taiwan University, Taipei, Taiwan
View Profile

,
Yi-Mo Ho

National Taiwan University, Taipei, Taiwan

National Taiwan University, Taipei, Taiwan
View Profile

,
Chih-Wei Yeh

National Taiwan University, Taipei, Taiwan

National Taiwan University, Taipei, Taiwan
View Profile

,
Cheng-Yueh Liu

National Taiwan University, Taipei, Taiwan

National Taiwan University, Taipei, Taiwan
View Profile

,
Chen-Pang Lee

National Taiwan University, Taipei, Taiwan

National Taiwan University, Taipei, Taiwan
View Profile

RACS '18: Proceedings of the 2018 Conference on Research in Adaptive and Convergent SystemsOctober 2018Pages 231–236https://doi.org/10.1145/3264746.3264766

Published:09 October 2018Publication History

RACS '18: Proceedings of the 2018 Conference on Research in Adaptive and Convergent Systems

Pages 231–236

ABSTRACT

Developers often use a virtual platform to develop software before the hardware is available. For software optimization, it is important to profile the cache misses of applications in a realistic operating environment under the virtual platform. In the multicore era, it is hard to simulate the coherence cache miss in a high speed way. In this paper, we propose a hardware-accelerated architecture to simulate the cache misses of a multicore system. We implement the cache miss simulator over a virtual platform with FPGA. Users can profile their software as running over the multicore system. The evaluation shows the throughput achieves 65 MB of trace log per second, when FPGA works in 100 MHz and about 570,000 logic elements are occupied to simulate 4 sets of L1 cache and 1 set of L2 cache in the multicore system with 4 virtual CPUs. The system achieves 1.6 to 2 times of speedup, when comparing with the popular cache miss simulator, Dinero IV. Dinero does less work and does not support coherence cache misses in the multicore system. The evaluation result shows high advantage to speed up the cache miss simulation of the multicore system by the hardware-accelerated architecture as well as FPGA.

References

Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator. In USENIX Annual Technical Conference, FREENIX Track. 41--46. Google ScholarDigital Library
Erik Berg, Hakan Zeffer, and Erik Hagersten. 2006. A statistical multiprocessor cache model. In Performance Analysis of Systems and Software, 2006 IEEE International Symposium on. IEEE, 89--99.Google ScholarCross Ref
Kristof Beyls and Erik DâĂ&Zacute;Hollander. 2001. Reuse distance as a metric for cache behavior. In Proceedings of the IASTED Conference on Parallel and Distributed Computing and systems, Vol. 14. 350--360.Google Scholar
Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil A Patil, William Reinhart, Darrel Eric Johnson, Jebediah Keefe, and Hari Angepat. 2007. Fpga-accelerated simulation technologies (fast): Fast, full-system, cycle-accurate simulators. In Proceedings of the 40th Annual IEEE/ACM international Symposium on Microarchitecture. IEEE Computer Society, 249--261. Google ScholarDigital Library
Intel Coporation. {n. d.}. SignalTap II with Verilog Designs.Google Scholar
Intel Coporation. {n. d.}. Using ModelSim to Simulate Logic Circuits in Verilog Designs.Google Scholar
Intel Coporation. {n. d.}. Using TimeQuest Timing Analyzer.Google Scholar
Intel Coporation. 2017. AvalonÂö Interface Specifications.Google Scholar
Jan Edler and Mark D. Hill. {n. d.}. Dinero IV Trace-Driven Uniprocessor Cache Simulator. ({n. d.}).Google Scholar
Matthew R Guthaus, Jeffrey S Ringenberg, Dan Ernst, Todd M Austin, Trevor Mudge, and Richard B Brown. 2001. MiBench: A free, commercially representative embedded benchmark suite. In Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop on. IEEE, 3--14. Google ScholarDigital Library
Mark D Hill and Alan Jay Smith. 1989. Evaluating associativity in CPU caches. IEEE Trans. Comput. 38, 12 (1989), 1612--1630. Google ScholarDigital Library
Matthew Jacobsen, Dustin Richmond, Matthew Hogains, and Ryan Kastner. 2015. RIFFA 2.1: A reusable integration framework for FPGA accelerators. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 8, 4 (2015), 22. Google ScholarDigital Library
Xiaoyue Pan and Bengt Jonsson. 2014. Modeling cache coherence misses on multicores. In Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on. IEEE, 96--105.Google ScholarCross Ref
Derek L Schuff, Milind Kulkarni, and Vijay S Pai. 2010. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques. ACM, 53--64. Google ScholarDigital Library
Yakun Sophia Shao, Sam Likun Xi, Vijayalakshmi Srinivasan, Gu-Yeon Wei, and David Brooks. 2016. Co-designing accelerators and soc interfaces using gem5-aladdin. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1--12. Google ScholarDigital Library
Chia-Heng Tu, Hui-Hsin Hsu, Jen-Hao Chen, Chun-Han Chen, and Shih-Hao Hung. 2014. Performance and power profiling for emulated android systems. ACM Transactions on Design Automation of Electronic Systems (TODAES) 19, 2 (2014), 10. Google ScholarDigital Library

Index Terms

Hardware-accelerated cache simulation for multicore by FPGA
1. Applied computing
  1. Computers in other domains
    1. Personal computers and PC applications
      1. Microcomputers
2. Computing methodologies
  1. Modeling and simulation

Recommendations

Reshaping cache misses to improve row-buffer locality in multicore systems
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Optimizing cache locality has always been important since the emergence of caches, and numerous cache locality optimization schemes have been published in compiler literature. However, in modern architectures, cache locality is not the only factor that ...
Read More
Thread-Aware Adaptive Prefetcher on Multicore Systems: Improving the Performance for Multithreaded Workloads

Most processors employ hardware data prefetching techniques to hide memory access latencies. However, the prefetching requests from different threads on a multicore processor can cause severe interference with prefetching and/or demand requests of ...
Read More
Cache Operations by MRU Change

The performance of set associative caches is analyzed. The method used is to group the cache lines into regions according to their positions in the replacement stacks of a cache, and then to observe how the memory access of a CPU is distributed over ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
RACS '18: Proceedings of the 2018 Conference on Research in Adaptive and Convergent Systems
October 2018
355 pages
ISBN:9781450358859
DOI:10.1145/3264746
Conference Chair:
Chih-Cheng Hung
Kennesaw State University
,
General Chair:
Lamjed Ben Said
University of Tunis, Tunisia
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 October 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
FPGA
cache simulation
multicore
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate393of1,581submissions,25%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 99
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Hardware-accelerated cache simulation for multicore by FPGA

RACS '18: Proceedings of the 2018 Conference on Research in Adaptive and Convergent Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Reshaping cache misses to improve row-buffer locality in multicore systems

Thread-Aware Adaptive Prefetcher on Multicore Systems: Improving the Performance for Multithreaded Workloads

Cache Operations by MRU Change

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Hardware-accelerated cache simulation for multicore by FPGA

RACS '18: Proceedings of the 2018 Conference on Research in Adaptive and Convergent Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Reshaping cache misses to improve row-buffer locality in multicore systems

Thread-Aware Adaptive Prefetcher on Multicore Systems: Improving the Performance for Multithreaded Workloads

Cache Operations by MRU Change

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media