Skip to main content

12.12.2023 | Research

Audio super-resolution via vision transformer

verfasst von: Simona Nisticò, Luigi Palopoli, Adele Pia Romano

Erschienen in: Journal of Intelligent Information Systems

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Audio super-resolution refers to techniques that improve the audio signals quality, usually by exploiting bandwidth extension methods, whereby audio enhancement is obtained by expanding the phase and the spectrogram of the input audio traces. These techniques are therefore much significant for all those cases where audio traces miss relevant parts of the audible spectrum. In several cases, the given input signal contains the low-band frequencies (the easiest to capture with low-quality recording instruments) whereas the high-band must be generated. In this paper, we illustrate techniques implemented into a system for bandwidth extension that works on musical tracks and generates the high-band frequencies starting from the low-band ones. The system, called ViT Super-resolution (\(\textit{ViT-SR}\)), features an architecture based on a Generative Adversarial Network and Vision Transformer model. In particular, two versions of the architecture will be presented in this paper, that work on different input frequency ranges. Experiments, which are accounted for in the paper, prove the effectiveness of our approach. In particular, the objective has been attained to demonstrate that it is possible to faithfully reconstruct the high-band signal of an audio file having only its low-band spectrum available as the input, therewith including the usually difficult to synthetically generate harmonics occurring in the audio tracks, which significantly contribute to the final perceived sound quality.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
We would like to thank one of the anonymous Reviewers for pointing out this method to us.
 
2
The source code for \(\textit{ViT-SR Small}\) and \(\textit{ViT-SR}\) is freely available at https://​github.​com/​simona-nistico/​ViT-SR.
 
Literatur
Zurück zum Zitat Charleston, S., & Azimi-Sadjadi, M. R. (1996). Reduced order Kalman filtering for the enhancement of respiratory sounds. IEEE Transactions on Biomedical Engineering, 43(4), 421–424.CrossRef Charleston, S., & Azimi-Sadjadi, M. R. (1996). Reduced order Kalman filtering for the enhancement of respiratory sounds. IEEE Transactions on Biomedical Engineering, 43(4), 421–424.CrossRef
Zurück zum Zitat Chen, X., & Yang, J. (2021). Speech bandwidth extension based on Wasserstein generative adversarial network. In: 2021 IEEE 21st International Conference on Communication Technology (ICCT) (pp. 1356–1362). IEEE. Chen, X., & Yang, J. (2021). Speech bandwidth extension based on Wasserstein generative adversarial network. In: 2021 IEEE 21st International Conference on Communication Technology (ICCT) (pp. 1356–1362). IEEE.
Zurück zum Zitat Choi, H.-S., Kim, J.-H., Huh, J., Kim, A., Ha, J.-W., & Lee, K. (2018). Phase-aware speech enhancement with deep complex u-net. In International Conference on Learning Representations. Choi, H.-S., Kim, J.-H., Huh, J., Kim, A., Ha, J.-W., & Lee, K. (2018). Phase-aware speech enhancement with deep complex u-net. In International Conference on Learning Representations.
Zurück zum Zitat Dai, J., Zhang, Y., Xie, P., & Xu, X. (2021). Super-resolution for music signals using generative adversarial networks. In 2021 IEEE 4th International Conference on Big Data and Artificial Intelligence (BDAI) (pp. 1–5). IEEE. Dai, J., Zhang, Y., Xie, P., & Xu, X. (2021). Super-resolution for music signals using generative adversarial networks. In 2021 IEEE 4th International Conference on Big Data and Artificial Intelligence (BDAI) (pp. 1–5). IEEE.
Zurück zum Zitat Defferrard, M., Benzi, K., Vandergheynst, P., & Bresson, X. (2016) FMA: A dataset for music analysis. Preprint retrieved from arXiv:1612.01840 Defferrard, M., Benzi, K., Vandergheynst, P., & Bresson, X. (2016) FMA: A dataset for music analysis. Preprint retrieved from arXiv:​1612.​01840
Zurück zum Zitat Deng, J., Schuller, B., Eyben, F., Schuller, D., Zhang, Z., Francois, H., & Oh, E. (2020). Exploiting time-frequency patterns with LSTM-RNNS for low-bitrate audio restoration. Neural Computing and Applications, 32(4), 1095–1107.CrossRef Deng, J., Schuller, B., Eyben, F., Schuller, D., Zhang, Z., Francois, H., & Oh, E. (2020). Exploiting time-frequency patterns with LSTM-RNNS for low-bitrate audio restoration. Neural Computing and Applications, 32(4), 1095–1107.CrossRef
Zurück zum Zitat Erell, A., & Weintraub, M. (1990).Estimation using log-spectral-distance criterion for noise-robust speech recognition. In International Conference on Acoustics, Speech, and Signal Processing (pp. 853–856). IEEE. Erell, A., & Weintraub, M. (1990).Estimation using log-spectral-distance criterion for noise-robust speech recognition. In International Conference on Acoustics, Speech, and Signal Processing (pp. 853–856). IEEE.
Zurück zum Zitat Fujimura, T., & Miyazaki, R. (2022). Removal of musical noise using deep speech prior. Applied Acoustics, 194, 108772.CrossRef Fujimura, T., & Miyazaki, R. (2022). Removal of musical noise using deep speech prior. Applied Acoustics, 194, 108772.CrossRef
Zurück zum Zitat Gong, Y., Chung, Y.-A., & Glass, J. (2021). AST: Audio Spectrogram Transformer. Gong, Y., Chung, Y.-A., & Glass, J. (2021). AST: Audio Spectrogram Transformer.
Zurück zum Zitat Guo, M.-H., Xu, T.-X., Liu, J.-J., Liu, Z.-N., Jiang, P.-T., Mu, T.-J., Zhang, S.-H., Martin, R. R., Cheng, M.-M., & Hu, S.-M. (2021). Attention mechanisms in computer vision: A survey. Preprint retrieved from http://arxiv.org/abs/2111.07624 Guo, M.-H., Xu, T.-X., Liu, J.-J., Liu, Z.-N., Jiang, P.-T., Mu, T.-J., Zhang, S.-H., Martin, R. R., Cheng, M.-M., & Hu, S.-M. (2021). Attention mechanisms in computer vision: A survey. Preprint retrieved from http://​arxiv.​org/​abs/​2111.​07624
Zurück zum Zitat Johnson, D. H. (2006). Signal-to-noise ratio. Scholarpedia, 1(12), 2088.CrossRef Johnson, D. H. (2006). Signal-to-noise ratio. Scholarpedia, 1(12), 2088.CrossRef
Zurück zum Zitat Kim, J., Englebienne, G., Truong, K. P., & Evers, V. (2017). Deep temporal models using identity skip-connections for speech emotion recognition. In E. A. Laurent Amsaleg & B. Huet (Eds.) Proceedings of the 25th ACM International Conference on Multimedia (pp. 1006–1013). Kim, J., Englebienne, G., Truong, K. P., & Evers, V. (2017). Deep temporal models using identity skip-connections for speech emotion recognition. In E. A. Laurent Amsaleg & B. Huet (Eds.) Proceedings of the 25th ACM International Conference on Multimedia (pp. 1006–1013).
Zurück zum Zitat Kolesnikov, A., Dosovitskiy, A., Weissenborn, D., Heigold, G., Uszkoreit, J., Beyer, L., Minderer, M., Dehghani, M., Houlsby, N., Gelly, S., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. Kolesnikov, A., Dosovitskiy, A., Weissenborn, D., Heigold, G., Uszkoreit, J., Beyer, L., Minderer, M., Dehghani, M., Houlsby, N., Gelly, S., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale.
Zurück zum Zitat Kong, J., Kim, J., & Bae, J. (2020). Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33, 17022–17033. Kong, J., Kim, J., & Bae, J. (2020). Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33, 17022–17033.
Zurück zum Zitat Li, K., & Lee, C.-H. (2015). A deep neural network approach to speech bandwidth expansion. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4395–4399). IEEE. Li, K., & Lee, C.-H. (2015). A deep neural network approach to speech bandwidth expansion. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4395–4399). IEEE.
Zurück zum Zitat Liu, Y. (2021). Recovery of lossy compressed music based on CNN super-resolution and GAN. In 2021 IEEE 3rd International Conference on Frontiers Technology of Information and Computer (ICFTIC) (pp. 623–629). IEEE. Liu, Y. (2021). Recovery of lossy compressed music based on CNN super-resolution and GAN. In 2021 IEEE 3rd International Conference on Frontiers Technology of Information and Computer (ICFTIC) (pp. 623–629). IEEE.
Zurück zum Zitat Liu, S., Keren, G., Parada-Cabaleiro, E., & Schuller, B. (2021). N-HANS: A neural network-based toolkit for in-the-wild audio enhancement. Multimedia Tools and Applications, 80(18), 28365–28389.CrossRef Liu, S., Keren, G., Parada-Cabaleiro, E., & Schuller, B. (2021). N-HANS: A neural network-based toolkit for in-the-wild audio enhancement. Multimedia Tools and Applications, 80(18), 28365–28389.CrossRef
Zurück zum Zitat Mandel, M., Tal, O., & Adi, Y. (2023). Aero: Audio super resolution in the spectral domain. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–5). IEEE. Mandel, M., Tal, O., & Adi, Y. (2023). Aero: Audio super resolution in the spectral domain. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–5). IEEE.
Zurück zum Zitat McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., & Nieto, O. (2015). librosa: Audio and music signal analysis in python. In Kathryn Huff, J. B. (Ed.), Proceedings of the 14th Python in Science Conference, (Vol. 8, pp. 18–25). Citeseer. McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., & Nieto, O. (2015). librosa: Audio and music signal analysis in python. In Kathryn Huff, J. B. (Ed.), Proceedings of the 14th Python in Science Conference, (Vol. 8, pp. 18–25). Citeseer.
Zurück zum Zitat McKinley, S., & Levine, M. (1998). Cubic spline interpolation. College of the Redwoods, 45(1), 1049–1060. McKinley, S., & Levine, M. (1998). Cubic spline interpolation. College of the Redwoods, 45(1), 1049–1060.
Zurück zum Zitat Nisticò, S., Palopoli, L., & Romano, A. P. (2022). Audio super-resolution via vision transformer. In International Symposium on Methodologies for Intelligent Systems (pp. 378–387). Springer. Nisticò, S., Palopoli, L., & Romano, A. P. (2022). Audio super-resolution via vision transformer. In International Symposium on Methodologies for Intelligent Systems (pp. 378–387). Springer.
Zurück zum Zitat Nogales, A., Donaher, S., & García-Tejedor, Á. (2023). A deep learning framework for audio restoration using convolutional/deconvolutional deep autoencoders. Expert Systems with Applications, 120586. Nogales, A., Donaher, S., & García-Tejedor, Á. (2023). A deep learning framework for audio restoration using convolutional/deconvolutional deep autoencoders. Expert Systems with Applications, 120586.
Zurück zum Zitat Oyedotun, O. K., Al Ismaeil, K., & Aouada, D. (2022). Why is everyone training very deep neural network with skip connections? IEEE Transactions on Neural Networks and Learning Systems. Oyedotun, O. K., Al Ismaeil, K., & Aouada, D. (2022). Why is everyone training very deep neural network with skip connections? IEEE Transactions on Neural Networks and Learning Systems.
Zurück zum Zitat Podder, P., Khan, T. Z., Khan, M. H., & Rahman, M. M. (2014). Comparative performance analysis of hamming, hanning and blackman window. International Journal of Computer Applications,96(18). Podder, P., Khan, T. Z., Khan, M. H., & Rahman, M. M. (2014). Comparative performance analysis of hamming, hanning and blackman window. International Journal of Computer Applications,96(18).
Zurück zum Zitat Prasad, N., & Kumar, T. K. (2016). Bandwidth extension of speech signals: A comprehensive review. International Journal of Intelligent Systems and Applications, 8(2), 45–52.CrossRef Prasad, N., & Kumar, T. K. (2016). Bandwidth extension of speech signals: A comprehensive review. International Journal of Intelligent Systems and Applications, 8(2), 45–52.CrossRef
Zurück zum Zitat Rethage, D., Pons, J., & Serra, X. (2018). A wavenet for speech denoising. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5069–5073). IEEE. Rethage, D., Pons, J., & Serra, X. (2018). A wavenet for speech denoising. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5069–5073). IEEE.
Zurück zum Zitat Smaragdis, P., & Raj, B. (2007). Example-driven bandwidth expansion. In 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (pp. 135–138). IEEE. Smaragdis, P., & Raj, B. (2007). Example-driven bandwidth expansion. In 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (pp. 135–138). IEEE.
Zurück zum Zitat Su, J., Wang, Y., Finkelstein, A., & Jin, Z. (2021). Bandwidth extension is all you need. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 696–700). IEEE. Su, J., Wang, Y., Finkelstein, A., & Jin, Z. (2021). Bandwidth extension is all you need. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 696–700). IEEE.
Zurück zum Zitat Su, J., Wang, Y., Finkelstein, A., & Jin, Z. (2021). Bandwidth extension is all you need. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 696–700). IEEE. Su, J., Wang, Y., Finkelstein, A., & Jin, Z. (2021). Bandwidth extension is all you need. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 696–700). IEEE.
Zurück zum Zitat Wang, H., & Wang, D. (2020). Time-frequency loss for CNN based speech super-resolution. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 861–865). IEEE. Wang, H., & Wang, D. (2020). Time-frequency loss for CNN based speech super-resolution. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 861–865). IEEE.
Zurück zum Zitat Wang, H., & Wang, D. (2021). Towards robust speech super-resolution. IEEE/ACM Transactions on Audio, Speech, and Language Processing,29, 2058–2066. Wang, H., & Wang, D. (2021). Towards robust speech super-resolution. IEEE/ACM Transactions on Audio, Speech, and Language Processing,29, 2058–2066.
Zurück zum Zitat Wang, D., & Chen, J. (2018). Supervised speech separation based on deep learning: An overview. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10), 1702–1726.MathSciNetCrossRef Wang, D., & Chen, J. (2018). Supervised speech separation based on deep learning: An overview. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10), 1702–1726.MathSciNetCrossRef
Zurück zum Zitat Wang, J.-C., Lee, H.-P., Wang, J.-F., & Lin, C.-B. (2008). Robust environmental sound recognition for home automation. IEEE Transactions on Automation Science and Engineering, 5(1), 25–31.CrossRef Wang, J.-C., Lee, H.-P., Wang, J.-F., & Lin, C.-B. (2008). Robust environmental sound recognition for home automation. IEEE Transactions on Automation Science and Engineering, 5(1), 25–31.CrossRef
Zurück zum Zitat Westhausen, N. L., & Meyer, B. T. (2020). Dual-signal transformation LSTM network for real-time noise suppression. Westhausen, N. L., & Meyer, B. T. (2020). Dual-signal transformation LSTM network for real-time noise suppression.
Zurück zum Zitat Yamamoto, R., Song, E., & Kim, J.-M. (2020) Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6199–6203). IEEE. Yamamoto, R., Song, E., & Kim, J.-M. (2020) Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6199–6203). IEEE.
Metadaten
Titel
Audio super-resolution via vision transformer
verfasst von
Simona Nisticò
Luigi Palopoli
Adele Pia Romano
Publikationsdatum
12.12.2023
Verlag
Springer US
Erschienen in
Journal of Intelligent Information Systems
Print ISSN: 0925-9902
Elektronische ISSN: 1573-7675
DOI
https://doi.org/10.1007/s10844-023-00833-w