nach oben

Journal of Intelligent Information Systems

12.12.2023 | Research

Audio super-resolution via vision transformer

verfasst von: Simona Nisticò, Luigi Palopoli, Adele Pia Romano

Erschienen in: Journal of Intelligent Information Systems

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Audio super-resolution refers to techniques that improve the audio signals quality, usually by exploiting bandwidth extension methods, whereby audio enhancement is obtained by expanding the phase and the spectrogram of the input audio traces. These techniques are therefore much significant for all those cases where audio traces miss relevant parts of the audible spectrum. In several cases, the given input signal contains the low-band frequencies (the easiest to capture with low-quality recording instruments) whereas the high-band must be generated. In this paper, we illustrate techniques implemented into a system for bandwidth extension that works on musical tracks and generates the high-band frequencies starting from the low-band ones. The system, called ViT Super-resolution (\(\textit{ViT-SR}\)), features an architecture based on a Generative Adversarial Network and Vision Transformer model. In particular, two versions of the architecture will be presented in this paper, that work on different input frequency ranges. Experiments, which are accounted for in the paper, prove the effectiveness of our approach. In particular, the objective has been attained to demonstrate that it is possible to faithfully reconstruct the high-band signal of an audio file having only its low-band spectrum available as the input, therewith including the usually difficult to synthetically generate harmonics occurring in the audio tracks, which significantly contribute to the final perceived sound quality.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

We would like to thank one of the anonymous Reviewers for pointing out this method to us.

The source code for \(\textit{ViT-SR Small}\) and \(\textit{ViT-SR}\) is freely available at https://github.com/simona-nistico/ViT-SR.

Andreev, P., Alanov, A., Ivanov, O., & Vetrov, D. (2022). Hifi++: A unified framework for neural vocoding, bandwidth extension and speech enhancement. Preprint retrieved from http://arxiv.org/abs/2203.13086

Charleston, S., & Azimi-Sadjadi, M. R. (1996). Reduced order Kalman filtering for the enhancement of respiratory sounds. IEEE Transactions on Biomedical Engineering, 43(4), 421–424.CrossRef

Chen, X., & Yang, J. (2021). Speech bandwidth extension based on Wasserstein generative adversarial network. In: 2021 IEEE 21st International Conference on Communication Technology (ICCT) (pp. 1356–1362). IEEE.

Choi, H.-S., Kim, J.-H., Huh, J., Kim, A., Ha, J.-W., & Lee, K. (2018). Phase-aware speech enhancement with deep complex u-net. In International Conference on Learning Representations.

Dai, J., Zhang, Y., Xie, P., & Xu, X. (2021). Super-resolution for music signals using generative adversarial networks. In 2021 IEEE 4th International Conference on Big Data and Artificial Intelligence (BDAI) (pp. 1–5). IEEE.

Defferrard, M., Benzi, K., Vandergheynst, P., & Bresson, X. (2016) FMA: A dataset for music analysis. Preprint retrieved from arXiv:1612.01840

Deng, J., Schuller, B., Eyben, F., Schuller, D., Zhang, Z., Francois, H., & Oh, E. (2020). Exploiting time-frequency patterns with LSTM-RNNS for low-bitrate audio restoration. Neural Computing and Applications, 32(4), 1095–1107.CrossRef

Erell, A., & Weintraub, M. (1990).Estimation using log-spectral-distance criterion for noise-robust speech recognition. In International Conference on Acoustics, Speech, and Signal Processing (pp. 853–856). IEEE.

Fujimura, T., & Miyazaki, R. (2022). Removal of musical noise using deep speech prior. Applied Acoustics, 194, 108772.CrossRef

Gong, Y., Chung, Y.-A., & Glass, J. (2021). AST: Audio Spectrogram Transformer.

Guo, M.-H., Xu, T.-X., Liu, J.-J., Liu, Z.-N., Jiang, P.-T., Mu, T.-J., Zhang, S.-H., Martin, R. R., Cheng, M.-M., & Hu, S.-M. (2021). Attention mechanisms in computer vision: A survey. Preprint retrieved from http://arxiv.org/abs/2111.07624

https://huggingface.co/docs/transformers/index

Huang, C.-Z. A., Vaswani, A., Uszkoreit, J., Shazeer, N., Simon, I., Hawthorne, C., Dai, A. M., Hoffman, M. D., Dinculescu, M., & Eck, D. (2018). Music transformer. Preprint retrieved from http://arxiv.org/abs/1809.04281

Johnson, D. H. (2006). Signal-to-noise ratio. Scholarpedia, 1(12), 2088.CrossRef

Kim, J., Englebienne, G., Truong, K. P., & Evers, V. (2017). Deep temporal models using identity skip-connections for speech emotion recognition. In E. A. Laurent Amsaleg & B. Huet (Eds.) Proceedings of the 25th ACM International Conference on Multimedia (pp. 1006–1013).

Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. Preprint retrieved from http://arxiv.org/abs/1412.6980

Kolesnikov, A., Dosovitskiy, A., Weissenborn, D., Heigold, G., Uszkoreit, J., Beyer, L., Minderer, M., Dehghani, M., Houlsby, N., Gelly, S., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale.

Kong, J., Kim, J., & Bae, J. (2020). Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33, 17022–17033.

Kuleshov, V., Enam, S. Z., & Ermon, S. (2017). Audio super resolution using neural networks. Preprint retrieved from http://arxiv.org/abs/1708.00853

Li, K., & Lee, C.-H. (2015). A deep neural network approach to speech bandwidth expansion. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4395–4399). IEEE.

Liu, Y. (2021). Recovery of lossy compressed music based on CNN super-resolution and GAN. In 2021 IEEE 3rd International Conference on Frontiers Technology of Information and Computer (ICFTIC) (pp. 623–629). IEEE.

Liu, S., Keren, G., Parada-Cabaleiro, E., & Schuller, B. (2021). N-HANS: A neural network-based toolkit for in-the-wild audio enhancement. Multimedia Tools and Applications, 80(18), 28365–28389.CrossRef

Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. Preprint retrieved from http://arxiv.org/abs/1711.05101

Mandel, M., Tal, O., & Adi, Y. (2023). Aero: Audio super resolution in the spectral domain. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–5). IEEE.

McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., & Nieto, O. (2015). librosa: Audio and music signal analysis in python. In Kathryn Huff, J. B. (Ed.), Proceedings of the 14th Python in Science Conference, (Vol. 8, pp. 18–25). Citeseer.

McKinley, S., & Levine, M. (1998). Cubic spline interpolation. College of the Redwoods, 45(1), 1049–1060.

Nisticò, S., Palopoli, L., & Romano, A. P. (2022). Audio super-resolution via vision transformer. In International Symposium on Methodologies for Intelligent Systems (pp. 378–387). Springer.

Nogales, A., Donaher, S., & García-Tejedor, Á. (2023). A deep learning framework for audio restoration using convolutional/deconvolutional deep autoencoders. Expert Systems with Applications, 120586.

Oyedotun, O. K., Al Ismaeil, K., & Aouada, D. (2022). Why is everyone training very deep neural network with skip connections? IEEE Transactions on Neural Networks and Learning Systems.

Podder, P., Khan, T. Z., Khan, M. H., & Rahman, M. M. (2014). Comparative performance analysis of hamming, hanning and blackman window. International Journal of Computer Applications,96(18).

Prasad, N., & Kumar, T. K. (2016). Bandwidth extension of speech signals: A comprehensive review. International Journal of Intelligent Systems and Applications, 8(2), 45–52.CrossRef

Rethage, D., Pons, J., & Serra, X. (2018). A wavenet for speech denoising. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5069–5073). IEEE.

Shannon, C. E. (1949). Communication in the presence of noise. Proceedings of the IRE, 37(1), 10–21. https://doi.org/10.1109/jrproc.1949.232969MathSciNetCrossRef

Smaragdis, P., & Raj, B. (2007). Example-driven bandwidth expansion. In 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (pp. 135–138). IEEE.

Su, J., Wang, Y., Finkelstein, A., & Jin, Z. (2021). Bandwidth extension is all you need. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 696–700). IEEE.

Wang, H., & Wang, D. (2020). Time-frequency loss for CNN based speech super-resolution. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 861–865). IEEE.

Wang, H., & Wang, D. (2021). Towards robust speech super-resolution. IEEE/ACM Transactions on Audio, Speech, and Language Processing,29, 2058–2066.

Wang, D., & Chen, J. (2018). Supervised speech separation based on deep learning: An overview. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10), 1702–1726.MathSciNetCrossRef

Wang, J.-C., Lee, H.-P., Wang, J.-F., & Lin, C.-B. (2008). Robust environmental sound recognition for home automation. IEEE Transactions on Automation Science and Engineering, 5(1), 25–31.CrossRef

Westhausen, N. L., & Meyer, B. T. (2020). Dual-signal transformation LSTM network for real-time noise suppression.

Yamamoto, R., Song, E., & Kim, J.-M. (2020) Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6199–6203). IEEE.

Titel: Audio super-resolution via vision transformer
verfasst von: Simona Nisticò
Luigi Palopoli
Adele Pia Romano
Publikationsdatum: 12.12.2023
Verlag: Springer US
Erschienen in: Journal of Intelligent Information Systems
Print ISSN: 0925-9902
Elektronische ISSN: 1573-7675
DOI: https://doi.org/10.1007/s10844-023-00833-w

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"