ABSTRACT
Performance of multithreaded applications is limited by a variety of bottlenecks, e.g. critical sections, barriers and slow pipeline stages. These bottlenecks serialize execution, waste valuable execution cycles, and limit scalability of applications. This paper proposes Bottleneck Identification and Scheduling in Multithreaded Applications (BIS), a cooperative software-hardware mechanism to identify and accelerate the most critical bottlenecks. BIS identifies which bottlenecks are likely to reduce performance by measuring the number of cycles threads have to wait for each bottleneck, and accelerates those bottlenecks using one or more fast cores on an Asymmetric Chip Multi-Processor (ACMP). Unlike previous work that targets specific bottlenecks, BIS can identify and accelerate bottlenecks regardless of their type. We compare BIS to four previous approaches and show that it outperforms the best of them by 15% on average. BIS' performance improvement increases as the number of cores and the number of fast cores in the system increase.
- MySQL database engine 5.0.1. http://www.mysql.com.Google Scholar
- SQLite database engine version 3.5.8. 2008.Google Scholar
- SysBench: a system performance benchmark v0.4.8. http://sysbench.sourceforge.net.Google Scholar
- G. M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In AFIPS, 1967. Google ScholarDigital Library
- M. Annavaram, E. Grochowski, and J. Shen. Mitigating Amdahl's law through EPI throttling. In ISCA, 2005. Google ScholarDigital Library
- D. H. Bailey et al. The NAS parallel benchmarks. Technical Report RNR-94-007, NASA Ames Research Center, 1994.Google Scholar
- A. Bhattacharjee and M. Martonosi. Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. In ISCA, 2009. Google ScholarDigital Library
- Q. Cai, J. González, R. Rakvic, G. Magklis, P. Chaparro, and A. González. Meeting points: using thread criticality to adapt multicore hardware to parallel regions. In PACT, 2008. Google ScholarDigital Library
- E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee, J. A. Joao, O. Mutlu, and Y. N. Patt. Parallel application memory scheduling. In MICRO, 2011. Google ScholarDigital Library
- B. Fields, R. Bodik, and M. Hill. Slack: maximizing performance under technological constraints. In ISCA, 2002. Google ScholarDigital Library
- B. Fields, S. Rubin, and R. Bodik. Focusing processor policies via critical-path prediction. In ISCA, 2001. Google ScholarDigital Library
- M. Herlihy and J. E. B. Moss. Transactional memory: architectural support for lock-free data structures. In ISCA, 1993. Google ScholarDigital Library
- M. D. Hill and M. R. Marty. Amdahl's law in Multicore Era. Technical Report CS-TR-2007--1593, Univ. of Wisconsin, 2007.Google Scholar
- J. K. Hollingsworth. An online computation of critical path profiling. In SIGMETRICS Symposium on Parallel and Distributed Tools, 1996. Google ScholarDigital Library
- A. H. Hormati, Y. Choi, M. Kudlur, R. Rabbah, T. Mudge, and S. Mahlke. Flextream: Adaptive compilation of streaming applications for heterogeneous architectures. In PACT, 2009. Google ScholarDigital Library
- Intel. Source code for Intel threading building blocks.Google Scholar
- Intel. IA-32 Intel Architecture Software Dev. Guide, 2008.Google Scholar
- Intel. Getting Started with Intel Parallel Studio, 2011.Google Scholar
- D. Koufaty, D. Reddy, and S. Hahn. Bias scheduling in heterogeneous multi-core architectures. In EuroSys, 2010. Google ScholarDigital Library
- H. Kredel. Source code for traveling salesman problem (tsp). http://krum.rz.uni-mannheim.de/ba-pp-2007/java/index.html.Google Scholar
- R. Kumar, D. Tullsen, N. Jouppi, and P. Ranganathan. Heterogeneous chip multiprocessors. IEEE Computer, 2005. Google ScholarDigital Library
- N. B. Lakshminarayana, J. Lee, and H. Kim. Age based scheduling for asymmetric multiprocessors. In SC, 2009. Google ScholarDigital Library
- T. Li, A. R. Lebeck, and D. J. Sorin. Quantifying instruction criticality for shared memory multiprocessors. In SPAA, 2003. Google ScholarDigital Library
- J. F. Martínez and J. Torrellas. Speculative synchronization: applying thread-level speculation to explicitly parallel applications. In ASPLOS, 2002.Google ScholarDigital Library
- J. M. Mellor-Crummey and M. L. Scott. Synchronization without contention. In ASPLOS, 1991. Google ScholarDigital Library
- T. Morad, U. C. Weiser, A. Kolodny, M. Valero, and E. Ayguade. Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors. Computer Architecture Letters, 2006. Google ScholarDigital Library
- R. Narayanan, B. Ozisikyilmaz, J. Zambreno, G. Memik, and A. Choudhary. MineBench: A benchmark suite for data mining workloads. In IISWC, 2006.Google ScholarCross Ref
- Y. Nishitani, K. Negishi, H. Ohta, and E. Nunohiro. Implementation and evaluation of OpenMP for Hitachi SR8000. In ISHPC, 2000. Google ScholarDigital Library
- R. Rajwar and J. Goodman. Speculative lock elision: Enabling highly concurrent multithreaded execution. In MICRO, 2001. Google ScholarDigital Library
- R. Rajwar and J. R. Goodman. Transactional lock-free execution of lock-based programs. In ASPLOS, 2002. Google ScholarDigital Library
- L. Rudolph and Z. Segall. Dynamic decentralized cache schemes for MIMD parallel processors. In ISCA, 1984. Google ScholarDigital Library
- J. C. Saez, M. Prieto, A. Fedorova, and S. Blagodurov. A comprehensive scheduler for asymmetric multicore systems. In EuroSys, 2010. Google ScholarDigital Library
- The Standard Performance Evaluation Corporation. Welcome to SPEC. http://www.specbench.org/.Google Scholar
- M. A. Suleman. An Asymmetric Multi-core Architecture for Efficiently Accelerating Critical Paths in Multithreaded Programs. PhD thesis, UT-Austin, 2010.Google Scholar
- M. A. Suleman, O. Mutlu, J. A. Joao, Khubaib, and Y. N. Patt. Data marshaling for multi-core architectures. ISCA, 2010. Google ScholarDigital Library
- M. A. Suleman, O. Mutlu, M. K. Qureshi, and Y. N. Patt. Accelerating critical section execution with asymmetric multi-core architectures. ASPLOS, 2009. Google ScholarDigital Library
- M. A. Suleman, M. K. Qureshi, Khubaib, and Y. N. Patt. Feedback-directed pipeline parallelism. In PACT, 2010. Google ScholarDigital Library
- Tornado Web Server. Source code. http://tornado.sourceforge.net/, 2008.Google Scholar
- M. Waldvogel, G. Varghese, J. Turner, and B. Plattner. Scalable high speed IP routing lookups. In SIGCOMM, 1997. Google ScholarDigital Library
- P. Zhao and J. N. Amaral. Ablego: a function outlining and partial inlining framework. Softw. Pract. Exper., 37(5):465--491, 2007. Google ScholarDigital Library
Index Terms
- Bottleneck identification and scheduling in multithreaded applications
Recommendations
Bottleneck identification and scheduling in multithreaded applications
ASPLOS '12Performance of multithreaded applications is limited by a variety of bottlenecks, e.g. critical sections, barriers and slow pipeline stages. These bottlenecks serialize execution, waste valuable execution cycles, and limit scalability of applications. ...
Bottleneck identification and scheduling in multithreaded applications
ASPLOS '12Performance of multithreaded applications is limited by a variety of bottlenecks, e.g. critical sections, barriers and slow pipeline stages. These bottlenecks serialize execution, waste valuable execution cycles, and limit scalability of applications. ...
Utility-based acceleration of multithreaded applications on asymmetric CMPs
ICSA '13Asymmetric Chip Multiprocessors (ACMPs) are becoming a reality. ACMPs can speed up parallel applications if they can identify and accelerate code segments that are critical for performance. Proposals already exist for using coarse-grained thread ...
Comments