Abstract
Simulation is a key tool for computer architecture research. In particular, cycle-accurate simulators are extremely important for microarchitecture exploration and detailed design decisions, but they are slow and, so, not suitable for simulating large-scale architectures, nor are they meant for this. Moreover, microarchitecture design decisions are irrelevant, or even misleading, for early processor design stages and high-level explorations. This allows one to raise the abstraction level of the simulated architecture, and also the application abstraction level, as it does not necessarily have to be represented as an instruction stream.
In this paper we introduce a definition of different application abstraction levels, and how these are employed in TaskSim, a multi-core architecture simulator, to provide several architecture modeling abstractions, and simulate large-scale architectures with hundreds of cores. We compare the simulation speed of these abstraction levels to the ones in existing simulation tools, and also evaluate their utility and accuracy. Our simulations show that a very high-level abstraction, which may be even faster than native execution, is useful for scalability studies on parallel applications; and that just simulating explicit memory transfers, we achieve accurate simulations for architectures using non-coherent scratchpad memories, with just a 25x slowdown compared to native execution. Furthermore, we revisit trace memory simulation techniques, that are more abstract than instruction-by-instruction simulations and provide an 18x simulation speedup.
- 2011. Mercurium Project website. https://pm.bsc.es/projects/mcxx.Google Scholar
- 2011. NANOS++ Project website. https://pm.bsc.es/projects/nanox.Google Scholar
- Austin, T., Larson, E., and Ernst, D. 2002. SimpleScalar: An infrastructure for computer system modeling. Computer 35, 2, 59--67. Google ScholarDigital Library
- Badia, R. M., Labarta, J., Gimenez, J., and Escalé., F. 2003. DIMEMAS: Predicting MPI applications behavior in Grid environments. In Proceedings of the Workshop on Grid Applications and Programming Tools.Google Scholar
- Barker, K. J., Davis, K., Hoisie, A., Kerbyson, D. J., Lang, M., Pakin, S., and Sancho, J. C. 2008. Entering the petaflop era: The architecture and performance of Roadrunner. In Proceedings of SC '08. 1:1--1:11. Google ScholarDigital Library
- Bellens, P., Perez, J. M., Badia, R. M., and Labarta, J. 2006. CellSs: A Programming model for the Cell BE architecture. In Proceedings of SC '06. 86. Google ScholarDigital Library
- Binkert, N. L., Dreslinski, R. G., Hsu, L. R., Lim, K. T., Saidi, A. G., and Reinhardt, S. K. 2006. The M5 simulator: Modeling networked systems. IEEE Micro 26, 4, 52--60. Google ScholarDigital Library
- Black, B., Huang, A. S., Lipasti, M. H., and Shen, J. P. 1996. Can trace-driven simulators accurately predict superscalar performance?In Proceedings of ICCD '96. 478--485. Google ScholarDigital Library
- Blumofe, R. D., Joerg, C. F., Kuszmaul, B. C., Leiserson, C. E., Randall, K. H., and Zhou, Y. 1995. Cilk: An efficient multithreaded runtime system. SIGPLAN Not. 30, 8, 207--216. Google ScholarDigital Library
- Bose, P. 2011. Integrated modeling challenges in extreme-scale computing. Proceedings of ISPASS'11. Google ScholarDigital Library
- Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., and Sarkar, V. 2005. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of OOPSLA '05. 519--538. Google ScholarDigital Library
- Chen, J., Annavaram, M., and Dubois, M. 2009. SlackSim: A platform for parallel simulations of CMPs on CMPs. SIGARCH Comput. Archit. News 37, 20--29. Google ScholarDigital Library
- Duran, A., Ayguadé, E., Badia, R. M., Labarta, J., Martinell, L., Martorell, X., and Planas, J. 2011. Ompss: A Proposal for Programming Heterogeneous Multi-Core Architectures. Parall. Proc. Lett. 21, 2, 173--193.Google ScholarCross Ref
- Genbrugge, D., Eyerman, S., and Eeckhout, L. 2010. Interval simulation: Raising the level of abstraction in architectural simulation. In Proceedings of HPCA '10. 1--12.Google Scholar
- Gonzalez, J., Gimenez, J., Casas, M., Moreto, M., Ramirez, A., Labarta, J., and Valero, M. 2011. Simulating whole supercomputer applications. IEEE Micro 31, 3, 32--45. Google ScholarDigital Library
- Jefferson, D. R. and Sowrizal, H. A. 1982. Fast concurrent simulation using the Time Warp mechanism, part I: Local control. Rand Note N-1906AF, the Rand Corp.Google Scholar
- Kahle, J. A., Day, M. N., Hofstee, H. P., Johns, C. R., Maeurer, T. R., and Shippy, D. 2005. Introduction to the Cell multiprocessor. IBM J. Res. Dev. 49, 4/5, 589--604. Google ScholarDigital Library
- Lee, H., Jin, L., Lee, K., Demetriades, S., Moeng, M., and Cho, S. 2010. Two-phase trace-driven simulation (TPTS): A fast multicore processor architecture simulation approach. Softw. Pract. Exper. 40, 239--258. Google ScholarDigital Library
- Lee, K., Evans, S., and Cho, S. 2009. Accurately approximating superscalar processor performance from traces. In Proceedings of ISPASS'09. 238--248.Google Scholar
- Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Janapa, V., and Hazelwood, R. K. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of PLDI '05. 190--200. Google ScholarDigital Library
- Magnusson, P. S., Christensson, M., Eskilson, J., Forsgren, D., Hållberg, G., Högberg, J., Larsson, F., Moestedt, A., and Werner, B. 2002. Simics: A full system simulation platform. IEEE Computer 35, 2, 50--58. Google ScholarDigital Library
- Martin, M. M. K., Sorin, D. J., Beckmann, B. M., Marty, M. R., Xu, M., Alameldeen, A. R., Moore, K. E., Hill, M. D., and Wood, D. A. 2005. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Comput. Archit. News 33, 4, 92--99. Google ScholarDigital Library
- Miller, J. E., Kasture, H., Kurian, G., Beckmann, N., III, C. G., Celio, C., Eastep, J., and Agarwal, A. 2009. Graphite: A distributed parallel simulator for multicores. Tech. rep. MIT-CSAIL-TR-2009-056, Massachusetts Institute of Technology.Google Scholar
- Moudgill, M., Bose, P., and Moreno, J. 1999. Validation of Turandot, a fast processor model for microarchitecture exploration. In Proceedings of IPCCC'99. 451--457.Google Scholar
- Mukherjee, S. S., Reinhardt, S. K., Falsafi, B., Litzkow, M., Hill, M. D., Wood, D. A., Huss-Lederman, S., and Larus, J. R. 2000. Wisconsin wind tunnel II: A fast, portable parallel architecture simulator. IEEE Concurrency 8, 12--20. Google ScholarDigital Library
- Perelman, E., Hamerly, G., Van Biesbrouck, M., Sherwood, T., and Calder, B. 2003. Using SimPoint for accurate and efficient simulation. In Proceedings of SIGMETRICS '03. 318--319. Google ScholarDigital Library
- Puzak, T. R. 1985. Analysis of cache replacement-algorithms. Ph.D. thesis. AAI8509594. Google ScholarDigital Library
- Ramirez, A., Cabarcas, F., Juurlink, B., Mesa, A., Sanchez, F., Azevedo, A., Meenderinck, C., Ciobanu, C., Isaza, S., and Gaydadjiev, G. 2010. The SARC architecture. IEEE Micro 30, 5, 16--29. Google ScholarDigital Library
- Reinders, J. 2007. Intel Threading Building Blocks. O'Reilly. Google ScholarDigital Library
- Rico, A., Duran, A., Cabarcas, F., Etsion, Y., Ramirez, A., and Valero, M. 2011. Trace-driven simulation of multithreaded applications. In Proceedings of ISPASS'11. 87--96. Google ScholarDigital Library
- Rico, A., Ramirez, A., and Valero, M. 2009. Available task-level parallelism on the Cell BE. Scientific Program. 17, 1-2, 59--76. Google ScholarDigital Library
- Tikir, M. M., Laurenzano, M. A., Carrington, L., and Snavely, A. 2009. PSINS: An open source event tracer and execution simulator for MPI applications. In Proceedings of Euro-Par '09. 135--148. Google ScholarDigital Library
- Uhlig, R. A. and Mudge, T. N. 1997. Trace-driven memory simulation: A survey. ACM Comput. Surv. 29, 128--170. Google ScholarDigital Library
- Vega, A., Rico, A., Cabarcas, F., Ramírez, A., and Valero, M. 2010. Comparing last-level cache designs for CMP architectures. In Proceedings of IFMT '10. 2:1--2:11. Google ScholarDigital Library
- Wang, W.-H. and Baer, J.-L. 1990. Efficient trace-driven simulation method for cache performance analysis. In Proceedings of SIGMETRICS'90. 27--36. Google ScholarDigital Library
- Wenisch, T. F., Wunderlich, R. E., Falsafi, B., and Hoe, J. C. 2005. TurboSMARTS: accurate microarchitecture simulation sampling in minutes. In Proceedings of SIGMETRICS '05. 408--409. Google ScholarDigital Library
- Wunderlich, R. E., Wenisch, T. F., Falsafi, B., and Hoe, J. C. 2003. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In Proceedings of ISCA '03. 84--97. Google ScholarDigital Library
- Yi, J. J., Eeckhout, L., Lilja, D. J., Calder, B., John, L. K., and Smith, J. E. 2006. The future of simulation: A field of dreams. Computer 39, 22--29. Google ScholarDigital Library
Index Terms
- On the simulation of large-scale architectures using multiple application abstraction levels
Recommendations
Time warp on the go
SIMUTOOLS '12: Proceedings of the 5th International ICST Conference on Simulation Tools and TechniquesIn this paper we deal with the impact of multi and many-core processor architectures on simulation. Despite the fact that modern CPUs have an increasingly large number of cores, most softwares are still unable to take advantage of them. In the last ...
A Simulation and Exploration Technology for Multimedia-Application-Driven Architectures
The increasing of computational power requirements for DSP and Multimedia application and the needs of easy-to-program development environment has driven recent programmable devices toward Very Long Instruction Word (VLIW) [1] architectures and Hw-Sw co-...
Hybridizing S3D into an exascale application using OpenACC: an approach for moving to multi-petaflops and beyond
SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and AnalysisHybridization is the process of converting an application with a single level of parallelism to an application with multiple levels of parallelism. Over the past 15 years a majority of the applications that run on High Performance Computing systems have ...
Comments