nach oben

The Journal of Supercomputing

Erschienen in:

04.09.2020

Bayesian neural networks at scale: a performance analysis and pruning study

verfasst von: Himanshu Sharma, Elise Jennings

Erschienen in: The Journal of Supercomputing | Ausgabe 4/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Bayesian neural networks (BNNs) are a promising method of obtaining statistical uncertainties for neural network predictions but with a higher computational overhead which can limit their practical usage. This work explores the use of high-performance computing with distributed training to address the challenges of training BNNs at scale. We present a performance and scalability comparison of training the VGG-16 and Resnet-18 models on a Cray-XC40 cluster. We demonstrate that network pruning can speed up inference without accuracy loss and provide an open-source software package, BPrune, to automate this pruning. For certain models we find that pruning up to 80% of the network results in only a 7.0% loss in accuracy. With the development of new hardware accelerators for deep learning, BNNs are of considerable interest for benchmarking performance. This analysis of training a BNN at scale outlines the limitations and benefits compared to a conventional neural network.

Vorheriger Artikel A semi-supervised approach in detecting sentiment and emotion based on digital payment reviews

Nächster Artikel A novel enhanced region proposal network and modified loss function: threat object detection in secure screening using deep learning

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

The communication efficiency is calculated as the ratio of communication time (MPI_WTIME) to elapsed time (includes MPI_INIT & MPI_FINALIZE). For 16-node run the efficiency of BNN VGG and Resnet models is 86.81% and 87.59% respectively, while that of the CNN VGG and Resnet model are 80.26% and 88.59%, respectively. For a 128-node run the BNN VGG and Resnet model communication efficiencies are 91.15% and 94.91%, while for the CNN VGG and Resnet model efficiencies are 86.99% and 89.11%, respectively.

The configuration for the GPUs used at ALCF is as follows, 8X Tesla V100, GPU total system memory of 128 GB, a CPU with Dual 20-Core Intel Xeon E5-2698 v4 2.2 GHz, 40,960 NVIDIA CUDA Cores, 5120 NVIDIA Tensor Cores, System memory of 512 GB 2133 MHz DDR4 LRDIMM, Storage of 4X 1.92 TB SSD RAID 0, and Dual 10 GbE, 4 IB EDR network.

https://github.com/Himscipy/BPrune.

The ratio for the Gaussian prior overweights can be simply calculated as a ratio \(|\mu | / \sigma\) [6]. In the initial BPrune release Gaussian priors are supported. Other choices of distribution will be supported in future releases.

Neal RM (1995) Bayesian learning for neural networks. Technical report

Williams C (1996) Computing with infinite networks. In: Advances in neural information processing systems, vol 9. MIT Press, Cambridge, pp 295–301

MacKay DJC (1992) A practical Bayesian framework for backpropagation networks. Neural Comput 4(3):448–472CrossRef

Hinton G, Van Camp D (1993) Keeping neural networks simple by minimizing the description length of the weights. In: Proceedings of the 6th Annual ACM Conference on Computational Learning Theory. Citeseer

Barber D, Bishop CM (1998) Ensemble learning in Bayesian neural networks. NATO ASI Ser Ser F Comput Syst Sci 168:215–237MATH

Graves A (2011) Practical variational inference for neural networks. In: Advances in neural information processing systems. pp 2348–2356

Hoffman MD, Blei DM, Wang C, Paisley J (2013) Stochastic variational inference. J Mach Learn Res 14(1):1303–1347MathSciNetMATH

Paisley J, Blei D, Jordan M (2013) Variational Bayesian inference with stochastic search. arXiv preprint arXiv:1206.6430

Kingma DP, Welling M (2013) Auto-encoding variational Bayes. arXiv preprint: arXiv:1312.6114

10.

Rezende DJ, Mohamed S, Wierstra D (2014) Stochastic backpropagation and approximate inference in deep generative models

11.

Titsias M, Lázaro-Gredilla M (2014) Doubly stochastic variational Bayes for non-conjugate inference. In: International Conference on Machine Learning, pp 1971–1979

12.

Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D (2015) Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424

13.

Shridhar K, Laumann F, Liwicki M (2019) A comprehensive guide to Bayesian convolutional neural network with variational inference. arXiv preprint arXiv:1901.02731

14.

Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov RR (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580

15.

Gal Y, Ghahramani Z (2016) Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: International Conference on Machine Learning, pp 1050–1059

16.

Tran D, Dusenberry M, van der Wilk M, Hafner D (2019) Bayesian layers: a module for neural network uncertainty. In: Advances in neural information processing systems, pp 14633–14645

17.

Shazeer N, Cheng Y, Parmar N, Tran D, Vaswani A, Koanantakool P, Hawkins P, Lee H, Hong M, Young C et al (2018) Mesh-tensorflow: deep learning for supercomputers. In: Advances in neural information processing systems, pp 10414–10423

18.

Wen Y, Vicol P, Ba J, Tran D, Grosse R (2018) Flipout: efficient pseudo-independent weight perturbations on mini-batches. arXiv preprint: arXiv:1803.04386

19.

Tsyplikhin A (2019) Graphcore delivers 26x performance gains for finance customers

20.

Nabarro S (2019) Probabilistic-modelling-by-combining-markov-chain-monte-carlo-and-variational-inference-with-ipus. https://www.graphcore.ai/posts/probabilistic-modelling-by-combining-markov-chain-monte-carlo-and-variational-inference-with-ipus

21.

Cerebras Systems (2019) Cerebras wafer scale engine: an introduction. https://www.cerebras.net/wp-content/uploads/2019/08/Cerebras-Wafer-Scale-Engine-An-Introduction.pdf. Accessed Nov 2019

22.

Baydin AG, Shao L, Bhimji W, Heinrich L, Meadows L, Liu J, Munk A, Naderiparizi S, Gram-Hansen B, Louppe G et al (2019) Etalumis: bringing probabilistic programming to scientific simulators at scale. arXiv preprint arXiv:1907.03382

23.

Viebke A, Memeti S, Pllana S, Abraham A (2019) Chaos: a parallelization scheme for training convolutional neural networks on Intel Xeon Phi. J Supercomput 75(1):197–227CrossRef

24.

ALCF (2019/2020) Xc40 machine overview. Technical report

25.

Vipin Kumar EK, Ying H, Jing X (2017) Numpy/Scipy with Intel^® MKL and Intel^® compilers. https://software.intel.com/en-us/articles/numpyscipy-with-intel-mkl

26.

Cray-MPICH. The MPICH source wiki

27.

Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

28.

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778

29.

Bowman SR, Vilnis L, Vinyals O, Dai AM, Jozefowicz R, Bengio S (2015) Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349

30.

Higgins I, Matthey L, Pal A, Burgess C, Glorot X, Botvinick M, Mohamed S, Lerchner A (2017) beta-vae: learning basic visual concepts with a constrained variational framework. ICLR 2(5):6

31.

Alemi AA, Poole B, Fischer I, Dillon JV, Saurous RA, Murphy K (2017) Fixing a broken ELBO. arXiv preprint arXiv:1711.00464

32.

Liu X, Gao J, Celikyilmaz A, Carin L et al (2019) Cyclical annealing schedule: a simple approach to mitigating kl vanishing. arXiv preprint arXiv:1903.10145

33.

Ranganath R, Gerrish S, Blei DM (2013) Black box variational inference. arXiv preprint arXiv:1401.0118

34.

Naesseth CA, Ruiz FJR, Linderman SW, Blei DM (2017) Reparameterization gradients through acceptance–rejection sampling algorithms. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017

35.

Krizhevsky A et al (2009) Learning multiple layers of features from tiny images. Technical report, Citeseer

36.

LeCun Y (1998) The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/. Accessed June 2019

37.

Loosli G, Canu S, Bottou L (2007) Training invariant support vector machines using selective sampling

38.

Sergeev A, Del Balso M (2018) Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799

39.

Blei DM, Kucukelbir A, McAuliffe JD (2017) Variational inference: a review for statisticians. University of California, Berkeley

40.

Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, Brubaker M, Guo J, Li P, Riddell A (2017) Stan: a probabilistic programming language. J Stat Softw 76(1)

41.

Stan Development Team et al (2017) PyStan: the Python interface to Stan. Version 2.16. 0.0

42.

Bingham E, Chen JP, Jankowiak M, Obermeyer F, Pradhan N, Karaletsos T, Singh R, Szerlip P, Horsfall P, Goodman ND (2019) Pyro: deep universal probabilistic programming. J Mach Learn Res 20(1):973–978MATH

43.

Cusumano-Towner MF, Saad FA, Lew AK, Mansinghka VK (2019) Gen: a general-purpose probabilistic programming system with programmable inference. In: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019. ACM, New York, pp 221–236

44.

Dillon JV, Langmore I, Tran D, Brevdo E, Vasudevan S, Moore D, Patton B, Alemi A, Hoffman M, Saurous RA (2017) Tensorflow distributions. arXiv preprint arXiv:1711.10604

45.

Huihuo Z, Elise J. Deep learning with Keras, TensorFlow, PyTorch, and Horovod on Theta. https://www.alcf.anl.gov/support-center/training-assets/deep-learning-keras-tensorflow-pytorch-and-horovod-theta

46.

Laanait N, Romero J, Yin J, Young MT, Treichler S, Starchenko V, Borisevich A, Sergeev A, Matheson M (2019) Exascale deep learning for scientific inverse problems. arXiv preprint arXiv:1909.11150

47.

HPCTW. MPI HPC tool. https://www.alcf.anl.gov/files/Knight_Software_JobSubmission_2.pdf

48.

LeCun Y, Denker JS, Solla SA (1990) Optimal brain damage. In: Advances in neural information processing systems, pp 598–605

49.

Giles CL, Omlin CW (1994) Pruning recurrent neural networks for improved generalization performance. IEEE Trans Neural Netw 5(5):848–851CrossRef

50.

Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Software available from www.tensorflow.org. Accessed June 2019

Titel: Bayesian neural networks at scale: a performance analysis and pruning study
verfasst von: Himanshu Sharma
Elise Jennings
Publikationsdatum: 04.09.2020
Verlag: Springer US
Erschienen in: The Journal of Supercomputing / Ausgabe 4/2021
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-020-03401-z

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 4/2021

A comparative analysis of prominently used MCDM methods in cloud environment

SMOaaS: a Scalable Matrix Operation as a Service model in Cloud

Facilitating the learning process in parallel computing by using instant messaging

An efficient management scheme of blockchain-based cloud user information using probabilistic weighting

Improving the learning of self-driving vehicles based on real driving behavior using deep neural network techniques

On the use of many-core Marvell ThunderX2 processor for HPC workloads