Skip to main content
Top

2020 | OriginalPaper | Chapter

Audio Interval Retrieval Using Convolutional Neural Networks

Authors : Ievgeniia Kuzminykh, Dan Shevchuk, Stavros Shiaeles, Bogdan Ghita

Published in: Internet of Things, Smart Spaces, and Next Generation Networks and Systems

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Modern streaming services are increasingly labeling videos based on their visual or audio content. This typically augments the use of technologies such as AI and ML by allowing to use natural speech for searching by keywords and video descriptions. Prior research has successfully provided a number of solutions for speech to text, in the case of a human speech, but this article aims to investigate possible solutions to retrieve sound events based on a natural language query, and estimate how effective and accurate they are. In this study, we specifically focus on the YamNet, AlexNet, and ResNet-50 pre-trained models to automatically classify audio samples using their respective melspectrograms into a number of predefined classes. The predefined classes can represent sounds associated with actions within a video fragment. Two tests are conducted to evaluate the performance of the models on two separate problems: audio classification and intervals retrieval based on a natural language query. Results show that the benchmarked models are comparable in terms of performance, with YamNet slightly outperforming the other two models. YamNet was able to classify single fixed-size audio samples with 92.7% accuracy and 68.75% precision while its average accuracy on intervals retrieval was 71.62% and precision was 41.95%. The investigated method may be embedded into an automated event marking architecture for streaming services.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
3.
go back to reference The Dynata global trends report. Dynata (2019) The Dynata global trends report. Dynata (2019)
4.
go back to reference Consumer Intelligence Series: Prepare for the voice revolution. PwC report (2019) Consumer Intelligence Series: Prepare for the voice revolution. PwC report (2019)
5.
go back to reference Kim, K., Heo, M., Choi, S., Zhang, B.: DeepStory: video story QA by deep embedded memory networks. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), Melbourne, Australia, pp. 2016–2022 (2017) Kim, K., Heo, M., Choi, S., Zhang, B.: DeepStory: video story QA by deep embedded memory networks. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), Melbourne, Australia, pp. 2016–2022 (2017)
6.
8.
go back to reference Smeaton, A.F., Wilkins, P., et al.: Content-based video retrieval: three example systems from TRECVid. Int. J. Imaging Syst. Technol. 18(2–3), 195–201 (2008)CrossRef Smeaton, A.F., Wilkins, P., et al.: Content-based video retrieval: three example systems from TRECVid. Int. J. Imaging Syst. Technol. 18(2–3), 195–201 (2008)CrossRef
9.
go back to reference Araujo, A., Girod, B.: Large-scale video retrieval using image queries. IEEE Trans. Circ. Syst. Video Technol. 28(6), 1406–1420 (2018)CrossRef Araujo, A., Girod, B.: Large-scale video retrieval using image queries. IEEE Trans. Circ. Syst. Video Technol. 28(6), 1406–1420 (2018)CrossRef
10.
go back to reference Hershey, S., et al.: CNN architectures for large-scale audio classification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, pp. 131–135 (2017) Hershey, S., et al.: CNN architectures for large-scale audio classification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, pp. 131–135 (2017)
11.
go back to reference Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, pp. 776–780 (2017) Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, pp. 776–780 (2017)
12.
go back to reference Dogan, E., Sert, M., Yazıcı, A.: A flexible and scalable audio information retrieval system for mixed-type audio signals. Int. J. Intell. Syst. 26(10), 952–970 (2011)CrossRef Dogan, E., Sert, M., Yazıcı, A.: A flexible and scalable audio information retrieval system for mixed-type audio signals. Int. J. Intell. Syst. 26(10), 952–970 (2011)CrossRef
13.
go back to reference Guggenberger, M.: Aurio: audio processing, analysis and retrieval. In: Proceedings of the 23rd ACM International Conference on Multimedia (MM 2015), pp. 705–708 (2015) Guggenberger, M.: Aurio: audio processing, analysis and retrieval. In: Proceedings of the 23rd ACM International Conference on Multimedia (MM 2015), pp. 705–708 (2015)
14.
go back to reference Sundaram, S., Narayanan, S.: Audio retrieval by latent perceptual indexing. In: IEEE International Conference on Acoustics, Speech and Signal Processing, Las Vegas, NV, pp. 49–52 (2008) Sundaram, S., Narayanan, S.: Audio retrieval by latent perceptual indexing. In: IEEE International Conference on Acoustics, Speech and Signal Processing, Las Vegas, NV, pp. 49–52 (2008)
15.
go back to reference Wan, C., Liu, M.: Content-based audio retrieval with relevance feedback. Pattern Recogn. Lett. 27(2), 85–92 (2006)CrossRef Wan, C., Liu, M.: Content-based audio retrieval with relevance feedback. Pattern Recogn. Lett. 27(2), 85–92 (2006)CrossRef
16.
go back to reference Kim, K., Kim, S., Jeon, J., Park, K.: Quick audio retrieval using multiple feature vectors. IEEE Trans. Consum. Electron. 52(1), 200–205 (2006) Kim, K., Kim, S., Jeon, J., Park, K.: Quick audio retrieval using multiple feature vectors. IEEE Trans. Consum. Electron. 52(1), 200–205 (2006)
17.
go back to reference Qazi, K.A., Nawaz, T., Mehmood, Z., Rashid, M., Habib, H.A.: A hybrid technique for speech segregation and classification using a sophisticated deep neural network. PLoS ONE 13(3), e0194151 (2018)CrossRef Qazi, K.A., Nawaz, T., Mehmood, Z., Rashid, M., Habib, H.A.: A hybrid technique for speech segregation and classification using a sophisticated deep neural network. PLoS ONE 13(3), e0194151 (2018)CrossRef
19.
go back to reference Patel, N.P., Patwardhan, M.S.: Identification of most contributing features for audio classification. In: International Conference on Cloud & Ubiquitous Computing & Emerging Technologies, Pune, pp. 219–223 (2013) Patel, N.P., Patwardhan, M.S.: Identification of most contributing features for audio classification. In: International Conference on Cloud & Ubiquitous Computing & Emerging Technologies, Pune, pp. 219–223 (2013)
22.
go back to reference Richard, G., Sundaram, S., Narayanan, S.: An overview on perceptually motivated audio indexing and classification. Proc. IEEE 101(9), 1939–1954 (2013)CrossRef Richard, G., Sundaram, S., Narayanan, S.: An overview on perceptually motivated audio indexing and classification. Proc. IEEE 101(9), 1939–1954 (2013)CrossRef
24.
go back to reference McLoughlin, I., Zhang, H., Xie, Z., Song, Y., Xiao, W.: Robust sound event classification using deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(3), 540–552 (2015)CrossRef McLoughlin, I., Zhang, H., Xie, Z., Song, Y., Xiao, W.: Robust sound event classification using deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(3), 540–552 (2015)CrossRef
26.
go back to reference Pfeiffer, S., Fischer, S., Effelsberg, W.: Automatic audio content analysis. In: Proceedings of the Fourth ACM International Conference on Multimedia (MULTIMEDIA 1996), pp. 21–30 (1997) Pfeiffer, S., Fischer, S., Effelsberg, W.: Automatic audio content analysis. In: Proceedings of the Fourth ACM International Conference on Multimedia (MULTIMEDIA 1996), pp. 21–30 (1997)
28.
go back to reference Catal, C.: Performance evaluation metrics for software fault prediction studies. Acta Polytech. Hung. 9(4), 193–206 (2012) Catal, C.: Performance evaluation metrics for software fault prediction studies. Acta Polytech. Hung. 9(4), 193–206 (2012)
29.
go back to reference Boughorbel, S., Jarray, F., El-Anbari, M.: Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS ONE 12(6), e0177678 (2017)CrossRef Boughorbel, S., Jarray, F., El-Anbari, M.: Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS ONE 12(6), e0177678 (2017)CrossRef
30.
go back to reference Andrew, G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. Computer Vision and Pattern Recognition arXiv:1704.04861 (2017) Andrew, G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. Computer Vision and Pattern Recognition arXiv:​1704.​04861 (2017)
31.
go back to reference Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 1097–1105 (2017)CrossRef Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 1097–1105 (2017)CrossRef
32.
go back to reference He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. Computer Vision and Pattern Recognition arXiv:1512.03385 (2015) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. Computer Vision and Pattern Recognition arXiv:​1512.​03385 (2015)
Metadata
Title
Audio Interval Retrieval Using Convolutional Neural Networks
Authors
Ievgeniia Kuzminykh
Dan Shevchuk
Stavros Shiaeles
Bogdan Ghita
Copyright Year
2020
DOI
https://doi.org/10.1007/978-3-030-65726-0_21

Premium Partner