nach oben

The Journal of Supercomputing

Erschienen in:

13.05.2023

Reducing branch divergence to speed up parallel execution of unit testing on GPUs

verfasst von: Taghreed Bagies, Wei Le, Jeremy Sheaffer, Ali Jannesari

Erschienen in: The Journal of Supercomputing | Ausgabe 16/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Software testing is an essential phase in the software development life cycle. One of the important types of software testing is unit testing and its execution is time-consuming and costly. Using parallelization to speed up the testing execution is beneficial and productive for programmers. To parallelize test execution, researchers can use GPU machines. In GPU applications, multiple threads execute in parallel within a group known as a warp. Branch divergence affects the performance of a warp negatively when some threads run a branch, and the other threads are idle waiting for the first set of threads to finish their execution. In this paper, we propose a novel algorithm to minimize branch divergence when testing an application on a GPU. We arrange test inputs based on the warp size of a GPU machine. Test inputs that have similar control flow paths are grouped within the same warp executing in parallel. Thus, the branch divergence is minimized per warp. We validate and evaluate our algorithm on six benchmarks (57 programs in total). Our approach accelerates the testing execution by up to 3.8x and improves the warp execution efficiency by up to 15x.

Vorheriger Artikel Fast algorithm for parallel solving inversion of large scale small matrices based on GPU

Nächster Artikel Long- and short-term collaborative attention networks for sequential recommendation

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

https://github.com/tbagies/GPU-BranchDivergence.

Yaneva V, Rajan A, Dubach C (2017) Compiler-assisted test acceleration on gpus for embedded software. In: Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. ISSTA 2017, pp. 35–45. ACM, New York, NY, USA. https://doi.org/10.1145/3092703.3092720

Harrold MJ (2000) Testing: A roadmap. In: Proceedings of the Conference on The Future of Software Engineering. ICSE ’00, pp. 61–72. ACM, New York, NY, USA. https://doi.org/10.1145/336512.336532

Shete N, Jadhav A (2014) An empirical study of test cases in software testing. In: International Conference on Information Communication and Embedded Systems (ICICES2014), pp. 1–5. https://doi.org/10.1109/ICICES.2014.7033883

Sommerville I (2015) Software Engineering, 10th edn. Pearson, ???

Rothermel G, Untch RH, Chu C, Harrold MJ (2001) Prioritizing test cases for regression testing. IEEE Trans Software Eng 27(10):929–948. https://doi.org/10.1109/32.962562CrossRef

Gambi A, Kappler S, Lampel J, Zeller A (2017) Cut: Automatic unit testing in the cloud. In: Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. ISSTA 2017, pp. 364–367. ACM, New York, NY, USA

Kappler S (2016) Finding and breaking test dependencies to speed up test execution. In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. FSE 2016, pp. 1136–1138. ACM, New York, NY, USA. https://doi.org/10.1145/2950290.2983974

Liu C-H, Chen S-L, Chen W-K (2017) Cost-benefit evaluation on parallel execution for improving test efficiency over cloud. In: 2017 International Conference on Applied System Innovation (ICASI), pp. 199–202. https://doi.org/10.1109/ICASI.2017.7988384

Oriol M, Ullah F (2010) Yeti on the cloud. In: 2010 Third International Conference on Software Testing, Verification, and Validation Workshops, pp. 434–437. https://doi.org/10.1109/ICSTW.2010.68

10.

Parveen T, Tilley S, Daley N, Morales P (2009) Towards a distributed execution framework for junit test cases. In: 2009 IEEE International Conference on Software Maintenance, pp. 425–428. https://doi.org/10.1109/ICSM.2009.5306292

11.

Gambi A, Gorla A, Zeller A (2017) O!snap: Cost-efficient testing in the cloud. In: 2017 IEEE International Conference on Software Testing, Verification and Validation (ICST), pp. 454–459. https://doi.org/10.1109/ICST.2017.51

12.

von Hof V, Fuchs A (2018) Automatic scalable parallel test case execution. introducing the münster distributed test case runner for java (midstr). In: Proceedings of the 33rd Annual ACM Symposium on Applied Computing, pp. 1062–1064

13.

Koong C-S, Shih C-H, Wu C-C, Hsiung P-A (2013) The architecture of parallelized cloud-based automatic testing system. In: 2013 Seventh International Conference on Complex, Intelligent, and Software Intensive Systems, pp. 467–470. https://doi.org/10.1109/CISIS.2013.85

14.

Duarte A, Cirne W, Brasileiro F, Machado P (2006) Gridunit: software testing on the grid. In: Proceedings of the 28th International Conference on Software Engineering, pp. 779–782

15.

Rajan A, Sharma S, Schrammel P, Kroening D (2014) Accelerated test execution using gpus. In: Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering. ASE ’14, pp. 97–102. ACM, New York, NY, USA. https://doi.org/10.1145/2642937.2642957

16.

Han TD, Abdelrahman TS (2011) Reducing branch divergence in gpu programs. In: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units. GPGPU-4, pp. 3–138. ACM, New York, NY, USA. https://doi.org/10.1145/1964179.1964184

17.

Zhang EZ, Jiang Y, Guo Z, Shen X (2010) Streamlining gpu applications on the fly: Thread divergence elimination through runtime thread-data remapping. In: Proceedings of the 24th ACM International Conference on Supercomputing. ICS ’10, pp. 115–126. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1810085.1810104

18.

Yu Z, Eeckhout L, Xu C (2016) Thread similarity matrix: Visualizing branch divergence in gpgpu programs. In: 2016 45th International Conference on Parallel Processing (ICPP), pp. 179–184

19.

Coutinho B, Sampaio D, Pereira FMQ, Meira Jr W (2011) Divergence analysis and optimizations. In: 2011 International Conference on Parallel Architectures and Compilation Techniques, pp. 320–329. IEEE

20.

Sampaio D, Martins R, Collange S, Pereira FMQ (2012) Divergence analysis with affine constraints. In: 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing, pp. 67–74

21.

Kerr A, Diamos G, Yalamanchili S (2009) A characterization and analysis of ptx kernels, pp. 3–12. https://doi.org/10.1109/IISWC.2009.5306801

22.

Sartori J, Kumar R (2013) Branch and data herding: reducing control and memory divergence for error-tolerant gpu applications. IEEE Trans Multimedia 15(2):279–290CrossRef

23.

Vespa LVL (2018) Unraveling the divergence of gpu threads. In: 2018 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 1398–1403

24.

Chakroun I, Mezmaz M, Melab N, Bendjoudi A (2013) Reducing thread divergence in a gpu-accelerated branch-and-bound algorithm. Concurr Comput Pract Exp 25(8):1121–1136CrossRef

25.

Li Y, Liu R (2016) High throughput gpu polar decoder. In: 2016 2nd IEEE International Conference on Computer and Communications (ICCC), pp. 1123–1127

26.

Carrillo S, Siegel J, Li X (2009) A control-structure splitting optimization for gpgpu. In: Proceedings of the 6th ACM Conference on Computing Frontiers, pp. 147–150

27.

Reissmann N, Falch TL, Bjørnseth BA, Bahmann H, Meyer JC, Jahre M (2016) Efficient control flow restructuring for gpus. In: 2016 International Conference on High Performance Computing Simulation (HPCS), pp. 48–57

28.

Anantpur J, Govindarajan R (2014) Taming control divergence in gpus through control flow linearization. In: International Conference on Compiler Construction, pp. 133–153. Springer

29.

Zone ND (2021) CUDA Toolkit Documentation. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture. Accessed on 23 June 2021

30.

Gupta P (2020) CUDA Refresher: The CUDA Programming Model. https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/. Accessed on 27 June 2021

31.

Han TD, Abdelrahman TS (2011) hicuda: High-level gpgpu programming. IEEE Trans Parallel Distrib Syst 22(1):78–90. https://doi.org/10.1109/TPDS.2010.62CrossRef

32.

Lin Y, Grover V (2012) Using CUDA warp-level primitives. https://devblogs.nvidia.com/using-cuda-warp-level-primitives/. Accessed on 14 Oct 2018

33.

Workshop V (2019) Introduction to GPGPU and CUDA programming: thread divergence. https://cvw.cac.cornell.edu/gpu/thread_div. Accessed on 20 Aug 2019

34.

Srivastava A, Thiagarajan J (2002) Effectively prioritizing tests in development environment. In: Proceedings of the 2002 ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 97–106

35.

Wong WE, Horgan JR, London S, Agrawal H (1997) A study of effective regression testing in practice. In: Proceedings The Eighth International Symposium on Software Reliability Engineering. pp 264–274

36.

Beller M, Gousios G, Panichella A, Zaidman A (2015) When, how, and why developers (do not) test in their ides. In: Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ESEC/FSE 2015, pp. 179–190. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2786805.2786843

37.

Rothermel G, Untch RH, Chu C, Harrold MJ (1999) Test case prioritization: An empirical study. In: Proceedings IEEE International Conference on Software Maintenance-1999 (ICSM’99).’Software Maintenance for Business Change’(Cat. No. 99CB36360), pp. 179–188. IEEE

38.

Zhang S, Jalali D, Wuttke J, Muşlu K, Lam W, Ernst MD, Notkin D (2014) Empirically revisiting the test independence assumption. In: Proceedings of the 2014 International Symposium on Software Testing and Analysis, pp. 385–396

39.

Lam W, Zhang S, Ernst MD (2015) When tests collide: evaluating and coping with the impact of test dependence. University of Washington Department of Computer Science and Engineering, Tech, Rep

40.

Schwahn O, Coppik N, Winter S, Suri N (2019) Assessing the state and improving the art of parallel testing for c. In: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 123–133

41.

Hu H, Jiang C-H, Ye F, Cai K-Y, Huang D, Yau SS (2010) A parallel implementation strategy of adaptive testing. In: 2010 IEEE 34th Annual Computer Software and Applications Conference Workshops, pp. 214–219. https://doi.org/10.1109/COMPSACW.2010.44

42.

Misailovic S, Milicevic A, Petrovic N, Khurshid S, Marinov D (2007) Parallel test generation and execution with korat. In: Proceedings of the the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering, pp. 135–144

43.

Siddiqui JH, Khurshid S (2009) Pkorat: Parallel generation of structurally complex test inputs. In: 2009 International Conference on Software Testing Verification and Validation, pp. 250–259. https://doi.org/10.1109/ICST.2009.48

44.

Fung WWL, Sham I, Yuan G, Aamodt TM (2007) Dynamic warp formation and scheduling for efficient gpu control flow. In: 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), pp. 407–420

45.

Brunie N, Collange S, Diamos G (2012) Simultaneous branch and warp interweaving for sustained gpu performance. In: 2012 39th Annual International Symposium on Computer Architecture (ISCA), pp. 49–60

46.

Rhu M, Erez M (2012) Capri: prediction of compaction-adequacy for handling control-divergence in gpgpu architectures. ACM SIGARCH Comput Arch News 40(3):61–71CrossRef

47.

Rhu M, Erez M (2013) Maximizing simd resource utilization in gpgpus with simd lane permutation. In: Proceedings of the 40th Annual International Symposium on Computer Architecture. pp. 356–367

48.

Fung WWL, Aamodt TM (2011) Thread block compaction for efficient simt control flow. In: Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture. HPCA ’11, pp. 25–36. IEEE Computer Society, USA

49.

Li B, Wei J, Guo W, Sun J (2015) Improving simd utilization with thread-lane shuffled compaction in gpgpu. Chin J Electron 24:684–688. https://doi.org/10.1049/cje.2015.10.004CrossRef

50.

Yang H, Chen S, Wan J, Xu X (2015) Divergent branch threads compaction for efficient simd control flow. Chin J Electron 24(2):288–294CrossRef

51.

Narasiman V, Shebanow M, Lee CJ, Miftakhutdinov R, Mutlu O, Patt YN (2011) Improving gpu performance via large warps and two-level warp scheduling. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-44, pp. 308–317. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2155620.2155656

52.

Meng J, Tarjan D, Skadron K (2010) Dynamic warp subdivision for integrated branch and memory divergence tolerance. In: Proceedings of the 37th Annual International Symposium on Computer Architecture. pp. 235–246

53.

Tarjan D, Meng J, Skadron K (2009) Increasing memory miss tolerance for simd cores. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. pp. 1–11

54.

Emmery C (2017) Euclidean vs. cosine distance. https://cmry.github.io/notes/euclidean-v-cosine. Accessed on 27 June 2021

55.

Ladd JR (2020) Understanding and using common similarity measures for text analysis. https://programminghistorian.org/en/lessons/common-similarity-measures. Accessed on 11 Jul 2021

56.

Nvidia (2018) cuda-c-programming-guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications__technical-specifications-per-compute-capability. Accessed on 28 Jul 2018

57.

die.net (2011) clock_gettime(3); Linux man page. https://linux.die.net/man/3/clock_gettime. Accessed on 16 Oct 2018

58.

Yang C-T, Huang C-L, Lin C-F (2011) Hybrid cuda, openmp and mpi parallel programming on multicore gpu clusters. Comput Phys Commun. 182(1):266–269. https://doi.org/10.1016/j.cpc.2010.06.035. Computer Physics Communications Special Edition for Conference on Computational Physics Kaohsiung, Taiwan, Dec 15–19, 2009

59.

Harris M (2011) How to Implement Performance Metrics in CUDA C/C++. https://devblogs.nvidia.com/how-implement-performance-metrics-cuda-cc/. Accessed on 16 Oct 2018

60.

NVIDIA (2012) NVIDIA CUDA C Programming Guide. https://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf. Accessed on 26 Jan 2019

61.

Pouchet LN (2016) PolyBench/C 3.2. http://polybench.sourceforge.net. Accessed on 14 Oct 2018

62.

Kalliamvakou E, Damian D, Blincoe K, Singer L, German DM (2015) Open source-style collaborative development practices in commercial projects using github. In: Proceedings of the 37th International Conference on Software Engineering: Volume 1. ICSE ’15, pp. 574–585. IEEE Press, Piscataway, NJ, USA. http://dl.acm.org/citation.cfm?id=2818754.2818825

63.

Quijada M (2014) image-manipulation-in-c. https://github.com/mauryquijada/image-manipulation-in-c.git. Accessed on 29 Jul 2018

64.

Yerburgh E (2017) c-sorting-algorithms. https://github.com/eddyerburgh/c-sorting-algorithms/tree/master/algorithms. Accessed on 17 Aug 2019

65.

Felipe L (2018) VAR-solutions. https://github.com/luizok/GraphAlgorithms/blob/master/graphalgs.c. Accessed on 14 Jan 2020

66.

Varshney R (2018) VAR-solutions. https://github.com/VAR-solutions/Algorithms/tree/dev/Dynamic%20Programming. Accessed on 14 Jan 2020

67.

Corporation N (2019) Nsight Compute CLI. https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html. Accessed on 4 Sep 2019

68.

NVIDIA (2018) NVIDIACUDA Toolkit Documentation. https://docs.nvidia.com/cuda/profiler-users-guide/index.html#metrics-reference. Accessed on 14 Oct 2018

69.

Corporation N (2019) Nsight Compute CLI-5.3. Metric Comparison. https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#nvprof-metric-comparison. Accessed on 4 Sept 2019

70.

Family SP (2021) Sonargraph-Architect. https://www.hello2morrow.com/products/sonargraph/architect9. Accessed on 16 Nov 2021

71.

Whitehead N, Fit-Florea A (2011) Precision & performance: floating point and ieee 754 compliance for nvidia gpus. rn (A+ B) 21(1):18749–19424

72.

NVIDIA (2017) NVIDIA TESLA V100 GPU ARCHITECTURE. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf. Accessed on 28 Apr 2023

73.

Repository S-aI: SIR Usage Information. https://sir.csc.ncsu.edu/portal/usage.php. Accessed on 3 Apr 2019

74.

Bagies T, Jannesari A (2021) An empirical study of parallelizing test execution using cuda unified memory and openmp gpu offloading. In: 2021 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), pp. 271–278. https://doi.org/10.1109/ICSTW52544.2021.00052

75.

Zhang L (2018) Hybrid regression test selection. In: 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), pp. 199–209

76.

Marijan D, Liaaen M (2018) Practical selective regression testing with effective redundancy in interleaved tests. In: 2018 IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), pp. 153–162

Titel: Reducing branch divergence to speed up parallel execution of unit testing on GPUs
verfasst von: Taghreed Bagies
Wei Le
Jeremy Sheaffer
Ali Jannesari
Publikationsdatum: 13.05.2023
Verlag: Springer US
Erschienen in: The Journal of Supercomputing / Ausgabe 16/2023
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-023-05375-0

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 16/2023

Network accelerator for parallel discrete event simulations

Efficient frameworks for statistical seizure detection and prediction

AHP evaluation of rigorous and agile IT service design-building phases-workflows in data centers

Identify spatio-temporal properties of network traffic by model checking

SiMAIM: identifying sockpuppets and puppetmasters on a single forum-oriented social media site

CALYOLOv4: lightweight YOLOv4 target detection based on coordinated attention

Premium Partner