research-article

Bottleneck identification and scheduling in multithreaded applications

Authors:
José A. Joao

The University of Texas at Austin, Austin, TX, USA

The University of Texas at Austin, Austin, TX, USA
View Profile

,
M. Aater Suleman

Calxeda Inc., Austin, TX, USA

Calxeda Inc., Austin, TX, USA
View Profile

,
Onur Mutlu

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Yale N. Patt

The University of Texas at Austin, Austin, TX, USA

The University of Texas at Austin, Austin, TX, USA
View Profile

ASPLOS XVII: Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating SystemsMarch 2012Pages 223–234https://doi.org/10.1145/2150976.2151001

Published:03 March 2012Publication History

ASPLOS XVII: Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems

Pages 223–234

ABSTRACT

Performance of multithreaded applications is limited by a variety of bottlenecks, e.g. critical sections, barriers and slow pipeline stages. These bottlenecks serialize execution, waste valuable execution cycles, and limit scalability of applications. This paper proposes Bottleneck Identification and Scheduling in Multithreaded Applications (BIS), a cooperative software-hardware mechanism to identify and accelerate the most critical bottlenecks. BIS identifies which bottlenecks are likely to reduce performance by measuring the number of cycles threads have to wait for each bottleneck, and accelerates those bottlenecks using one or more fast cores on an Asymmetric Chip Multi-Processor (ACMP). Unlike previous work that targets specific bottlenecks, BIS can identify and accelerate bottlenecks regardless of their type. We compare BIS to four previous approaches and show that it outperforms the best of them by 15% on average. BIS' performance improvement increases as the number of cores and the number of fast cores in the system increase.

References

MySQL database engine 5.0.1. http://www.mysql.com.Google Scholar
SQLite database engine version 3.5.8. 2008.Google Scholar
SysBench: a system performance benchmark v0.4.8. http://sysbench.sourceforge.net.Google Scholar
G. M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In AFIPS, 1967. Google ScholarDigital Library
M. Annavaram, E. Grochowski, and J. Shen. Mitigating Amdahl's law through EPI throttling. In ISCA, 2005. Google ScholarDigital Library
D. H. Bailey et al. The NAS parallel benchmarks. Technical Report RNR-94-007, NASA Ames Research Center, 1994.Google Scholar
A. Bhattacharjee and M. Martonosi. Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. In ISCA, 2009. Google ScholarDigital Library
Q. Cai, J. González, R. Rakvic, G. Magklis, P. Chaparro, and A. González. Meeting points: using thread criticality to adapt multicore hardware to parallel regions. In PACT, 2008. Google ScholarDigital Library
E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee, J. A. Joao, O. Mutlu, and Y. N. Patt. Parallel application memory scheduling. In MICRO, 2011. Google ScholarDigital Library
B. Fields, R. Bodik, and M. Hill. Slack: maximizing performance under technological constraints. In ISCA, 2002. Google ScholarDigital Library
B. Fields, S. Rubin, and R. Bodik. Focusing processor policies via critical-path prediction. In ISCA, 2001. Google ScholarDigital Library
M. Herlihy and J. E. B. Moss. Transactional memory: architectural support for lock-free data structures. In ISCA, 1993. Google ScholarDigital Library
M. D. Hill and M. R. Marty. Amdahl's law in Multicore Era. Technical Report CS-TR-2007--1593, Univ. of Wisconsin, 2007.Google Scholar
J. K. Hollingsworth. An online computation of critical path profiling. In SIGMETRICS Symposium on Parallel and Distributed Tools, 1996. Google ScholarDigital Library
A. H. Hormati, Y. Choi, M. Kudlur, R. Rabbah, T. Mudge, and S. Mahlke. Flextream: Adaptive compilation of streaming applications for heterogeneous architectures. In PACT, 2009. Google ScholarDigital Library
Intel. Source code for Intel threading building blocks.Google Scholar
Intel. IA-32 Intel Architecture Software Dev. Guide, 2008.Google Scholar
Intel. Getting Started with Intel Parallel Studio, 2011.Google Scholar
D. Koufaty, D. Reddy, and S. Hahn. Bias scheduling in heterogeneous multi-core architectures. In EuroSys, 2010. Google ScholarDigital Library
H. Kredel. Source code for traveling salesman problem (tsp). http://krum.rz.uni-mannheim.de/ba-pp-2007/java/index.html.Google Scholar
R. Kumar, D. Tullsen, N. Jouppi, and P. Ranganathan. Heterogeneous chip multiprocessors. IEEE Computer, 2005. Google ScholarDigital Library
N. B. Lakshminarayana, J. Lee, and H. Kim. Age based scheduling for asymmetric multiprocessors. In SC, 2009. Google ScholarDigital Library
T. Li, A. R. Lebeck, and D. J. Sorin. Quantifying instruction criticality for shared memory multiprocessors. In SPAA, 2003. Google ScholarDigital Library
J. F. Martínez and J. Torrellas. Speculative synchronization: applying thread-level speculation to explicitly parallel applications. In ASPLOS, 2002.Google ScholarDigital Library
J. M. Mellor-Crummey and M. L. Scott. Synchronization without contention. In ASPLOS, 1991. Google ScholarDigital Library
T. Morad, U. C. Weiser, A. Kolodny, M. Valero, and E. Ayguade. Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors. Computer Architecture Letters, 2006. Google ScholarDigital Library
R. Narayanan, B. Ozisikyilmaz, J. Zambreno, G. Memik, and A. Choudhary. MineBench: A benchmark suite for data mining workloads. In IISWC, 2006.Google ScholarCross Ref
Y. Nishitani, K. Negishi, H. Ohta, and E. Nunohiro. Implementation and evaluation of OpenMP for Hitachi SR8000. In ISHPC, 2000. Google ScholarDigital Library
R. Rajwar and J. Goodman. Speculative lock elision: Enabling highly concurrent multithreaded execution. In MICRO, 2001. Google ScholarDigital Library
R. Rajwar and J. R. Goodman. Transactional lock-free execution of lock-based programs. In ASPLOS, 2002. Google ScholarDigital Library
L. Rudolph and Z. Segall. Dynamic decentralized cache schemes for MIMD parallel processors. In ISCA, 1984. Google ScholarDigital Library
J. C. Saez, M. Prieto, A. Fedorova, and S. Blagodurov. A comprehensive scheduler for asymmetric multicore systems. In EuroSys, 2010. Google ScholarDigital Library
The Standard Performance Evaluation Corporation. Welcome to SPEC. http://www.specbench.org/.Google Scholar
M. A. Suleman. An Asymmetric Multi-core Architecture for Efficiently Accelerating Critical Paths in Multithreaded Programs. PhD thesis, UT-Austin, 2010.Google Scholar
M. A. Suleman, O. Mutlu, J. A. Joao, Khubaib, and Y. N. Patt. Data marshaling for multi-core architectures. ISCA, 2010. Google ScholarDigital Library
M. A. Suleman, O. Mutlu, M. K. Qureshi, and Y. N. Patt. Accelerating critical section execution with asymmetric multi-core architectures. ASPLOS, 2009. Google ScholarDigital Library
M. A. Suleman, M. K. Qureshi, Khubaib, and Y. N. Patt. Feedback-directed pipeline parallelism. In PACT, 2010. Google ScholarDigital Library
Tornado Web Server. Source code. http://tornado.sourceforge.net/, 2008.Google Scholar
M. Waldvogel, G. Varghese, J. Turner, and B. Plattner. Scalable high speed IP routing lookups. In SIGCOMM, 1997. Google ScholarDigital Library
P. Zhao and J. N. Amaral. Ablego: a function outlining and partial inlining framework. Softw. Pract. Exper., 37(5):465--491, 2007. Google ScholarDigital Library

Index Terms

Bottleneck identification and scheduling in multithreaded applications
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data

Recommendations

Bottleneck identification and scheduling in multithreaded applications
ASPLOS '12

Performance of multithreaded applications is limited by a variety of bottlenecks, e.g. critical sections, barriers and slow pipeline stages. These bottlenecks serialize execution, waste valuable execution cycles, and limit scalability of applications. ...
Read More
Bottleneck identification and scheduling in multithreaded applications
ASPLOS '12

Performance of multithreaded applications is limited by a variety of bottlenecks, e.g. critical sections, barriers and slow pipeline stages. These bottlenecks serialize execution, waste valuable execution cycles, and limit scalability of applications. ...
Read More
Utility-based acceleration of multithreaded applications on asymmetric CMPs
ICSA '13

Asymmetric Chip Multiprocessors (ACMPs) are becoming a reality. ACMPs can speed up parallel applications if they can identify and accelerate code segments that are critical for performance. Proposals already exist for using coarse-grained thread ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASPLOS XVII: Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
March 2012
476 pages
ISBN:9781450307598
DOI:10.1145/2150976
General Chair:
Tim Harris
Microsoft Research
,
Program Chair:
Michael L. Scott
University of Rochester
ACM SIGARCH Computer Architecture News Volume 40, Issue 1
ASPLOS '12
March 2012
453 pages
ISSN:0163-5964
DOI:10.1145/2189750
Issue’s Table of Contents
ACM SIGPLAN Notices Volume 47, Issue 4
ASPLOS '12
April 2012
453 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2248487
Issue’s Table of Contents
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 March 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
asymmetric CMPs
barriers
critical sections
heterogeneous CMPs
multicore
pipeline parallelism
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate535of2,713submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 128
  Total Citations
  View Citations
- 1,447
  Total Downloads
- Downloads (Last 12 months)64
- Downloads (Last 6 weeks)16
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Bottleneck identification and scheduling in multithreaded applications

ASPLOS XVII: Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Bottleneck identification and scheduling in multithreaded applications

Bottleneck identification and scheduling in multithreaded applications

Utility-based acceleration of multithreaded applications on asymmetric CMPs