Top

International Journal of Parallel Programming

Published in:

01-02-2015

BADCO: Behavioral Application-Dependent Superscalar Core Models

Authors: Ricardo A. Velásquez, Pierre Michaud, André Seznec

Published in: International Journal of Parallel Programming | Issue 1/2015

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Microarchitecture research and development rely heavily on simulators. The ideal simulator should be simple and easy to develop, it should be precise, accurate and very fast. But the ideal simulator does not exist, and microarchitects use different sorts of simulators at different stages of the development of a processor, depending on which is most important, accuracy or simulation speed. Approximate microarchitecture models, which trade accuracy for simulation speed, are very useful for research and design space exploration, provided the loss of accuracy remains acceptable. Behavioral superscalar core modeling is a possible way to trade accuracy for simulation speed in situations where the focus of the study is not the core itself. In this approach, a superscalar core is viewed as a black box emitting requests to the uncore at certain times. A behavioral core model can be connected to a detailed uncore model. Behavioral core models are built from detailed simulations. Once the time to build the model is amortized, important simulation speedups can be obtained. We describe and study a new method for defining behavioral models for modern superscalar cores. The proposed Behavioral Application-Dependent Superscalar Core model, BADCO, predicts the execution time of a thread running on a superscalar core with an error less than 10 % in most cases. We show that BADCO is qualitatively accurate, being able to predict how performance changes when we change the uncore. The simulation speedups we obtained are typically between one and two orders of magnitude.

previous article Evaluation of Speculation in Out-of-Order Execution of Synchronous Dataflow Networks

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

Lee et al. used SimpleScalar sim-outorder [1] for their experiments.

If an L1 miss Y is data-dependent on a delayed L1 hit which is waiting for a cache line requested by a previous L1 miss X, then Y is considered data-dependent on X [2].

The only exception is L1 store miss, because store requests are processed at commit on x86 architectures.

The fetch stall models the pipeline flush on real architectures.

Feedback mechanisms in the prefetcher throttle prefetch requests when needed. The occupancy of the MSHR and the utilization of the bus, for example, can be monitored by the prefetch controller to decide about whether or not to issue prefetch requests.

We attach a DL1 miss request to the first \(\mu \)op (load or store) accessing that cache line. We attach a DL1 prefetch to the \(\mu \)op triggering the prefetch. We attach a write-back request to the same \(\mu \)op to which the request causing the write-back is attached.

The Zesto model implements next-line prefetching for the instructions, but does not pipeline the instruction misses. Node fetching mimics this behavior.

For instance, the Zesto simulator allocates an MSHR entry for each delayed hit (i.e., hits on a pending miss). Our BADCO machine does the same in our experiments. This is why we simulate an unlimited MSHR for generating trace TL, so as to capture all potential delayed hits in the trace.

Each request to the uncore is attached to a single \(\mu \)op.

D(X) is null or 0 when there is not request \(\mu \)op whose CT is less than the IT of X. This just happen at the beginning of the trace.

The uncore simulator was extracted from Zesto.

Austin, T., Larson, E., Ernst, D.: SimpleScalar: an infrastructure for computer system modeling. IEEE Comput. 35(2), 59–67 (2002). http://www.simplescalar.com/

Chen, X.E., Aamodt, T.M.: Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs. In: Proceedings of the 41st International Symposium on Microarchitecture (2008)

Cho, S., Demetriades, S., Evans, S., Jin, L., Lee, H., Lee, K., Moeng, M.: TPTS : a novel framework for very fast manycore processor architecture simulation. In: Proceedings of the 37th International Conference on Parallel Processing (2008)

Durbhakula, M., Pai, V.S., Adve, S.: Improving the accuracy vs. speed tradeoff for simulating shared-memory multiprocessors with ILP processors. In: Proceedings of the 5th International Symposium on High-Performance Computer Architecture (1999)

Eyerman, S., Eeckhout, L., Karkhanis, T., Smith, J.E.: A mechanistic performance model for superscalar out-of-order processors. ACM Trans. Comput. Syst. 27(2) (2009)

Eyerman, S., Smith, J.E., Eeckhout, L.: Characterizing the branch misprediction penalty. In: Proceedings of the International Symposium on Performance Analysis of Systems and Software (2011)

Fields, B.A., Bodik, R., Hill, M.D., Newburn, C.J.: Using interaction costs for microarchitectural bottleneck analysis. In: Proceedings of the 36th International Symposium on Microarchitecture (2003)

Fields, B., Rubin, S., Bodik, R.: Focusing processor policies via critical-path prediction. In: Proceedings of the 28th International Symposium on Computer Architecture (2001)

Genbrugge, D., Eyerman, S., Eeckhout, L.: Interval simulation : raising the level of abstraction in architectural simulation. In: Proceedings of the 16th International Symposium on High-Performance Computer Architecture (2010)

10.

Goldschmidt, S.R., Hennessy, J.L.: The accuracy of trace-driven simulations of multiprocessors. In: Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (1993)

11.

\({\ddot{\rm I}}\)pek, E., McKee, S., Caruana, R., de Supinski, B., Schulz, M.: Efficiently Exploring Architectural Design Spaces Via Predictive Modeling, vol. 40. ACM (2006)

12.

Joseph, P., Vaswani, K., Thazhuthaveetil, M.: Construction and use of linear regression models for processor performance analysis. In: High-Performance Computer Architecture, 2006. The Twelfth International Symposium on, pp. 99–108. IEEE (2006)

13.

Kanaujia, S., Papazian, I.E., Chamberlain, J., Baxter, J.: FastMP : a multi-core simulation methodology. In: Workshop on Modeling, Benchmarking and Simulation (2006)

14.

Karkhanis, T.S., Smith, J.E.: A first-order superscalar processor model. In: Proceedings of the 31st International Symposium on Computer Architecture (2004)

15.

Lee, K., Cho, S.: In-N-Out : reproducing out-of-order superscalar processor behavior from reduced in-order traces. In: Proceedings of the IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS) (2011)

16.

Lee, K., Evans, S., Cho, S.: Accurately approximating superscalar processor performance from traces. In: Proceedings of the International Symposium on Performance Analysis of Systems and Software (2009)

17.

Li, Y., Lee, B., Brooks, D., Hu, Z., Skadron, K.: CMP design space exploration subject to physical constraints. In: Proceedings of the 12th International Symposium on High Performance Computer Architecture (2006)

18.

Loh, G., Subramaniam, S., Xie, Y.: Zesto : a cycle-level simulator for highly detailed microarchitecture exploration. In: Proceedings of the International Symposium on Performance Analysis of Systems and Software (2009)

19.

Loh, G.: A time-stamping algorithm for efficient performance estimation of superscalar processors. In: Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (2001)

20.

Moses, J., Illikkal, R., Iyer, R., Huggahalli, R., Newell, D.: ASPEN : towards effective simulation of threads & engines in evolving platforms. In: Proceedings of the 12th IEEE / ACM International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (2004)

21.

Mutlu, O., Kim, H., Armstrong, D., Patt, Y.: Understanding the effects of wrong-path memory references on processor performance. In: Proceedings of the 3rd Workshop on Memory Performance Issues: In Conjunction with the 31st International Symposium on Computer Architecture, pp. 56–64. ACM (2004)

22.

Noonburg, D.B., Shen, J.P.: Theoretical modeling of superscalar processor performance. In: Proceedings of the 27th International Symposium on Microarchitecture (1994)

23.

Rico, A., Duran, A., Cabarcas, F., Etsion, Y., Ramirez, A., Valero, M.: Trace-driven simulation of multithreaded applications. In: Proceedings of the International Symposium on Performance Analysis of Systems and Software (2011)

24.

Ryckbosch, F., Polfliet, S., Eeckhout, L.: Fast, accurate, and validated full-system software simulation on x86 hardware. IEEE Micro 30(6), 46–56 (2010)CrossRef

25.

Sendag, R., Yilmazer, A., Yi, J., Uht, A.: Quantifying and reducing the effects of wrong-path memory references in cache-coherent multiprocessor systems. In: Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International, pp. 10-pp. IEEE (2006)

26.

Sorin, D.J., Pai, V.S., Adve, S.V., Vernon, M.K., Wood, D.A.: Analytic evaluation of shared-memory systems with ILP processors. In: Proceedings of the 25th International Symposium on Computer Architecture (1998)

27.

Zhao, L., Iyer, R., Moses, J., Illikkal, R., Makineni, S., Newell, D.: Exploring large-scale CMP architectures using ManySim. IEEE Micro 27(4), 21–33 (2007)CrossRef

Title: BADCO: Behavioral Application-Dependent Superscalar Core Models
Authors: Ricardo A. Velásquez
Pierre Michaud
André Seznec
Publication date: 01-02-2015
Publisher: Springer US
Published in: International Journal of Parallel Programming / Issue 1/2015
Print ISSN: 0885-7458
Electronic ISSN: 1573-7640
DOI: https://doi.org/10.1007/s10766-013-0278-1

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 1/2015

Low-Power Reconfigurable Miniature Sensor Nodes for Condition Monitoring

Evaluation of Speculation in Out-of-Order Execution of Synchronous Dataflow Networks

A Transaction-Based Environment for System Modeling and Parallel Simulation

Guest Editorial: Special Issue on Embedded Computer Systems: Architectures, Modeling and Simulation

Revisiting Cache Resizing

Premium Partner