Top

Published in:

2020 | OriginalPaper | Chapter

Audio Interval Retrieval Using Convolutional Neural Networks

Authors : Ievgeniia Kuzminykh, Dan Shevchuk, Stavros Shiaeles, Bogdan Ghita

Published in: Internet of Things, Smart Spaces, and Next Generation Networks and Systems

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Modern streaming services are increasingly labeling videos based on their visual or audio content. This typically augments the use of technologies such as AI and ML by allowing to use natural speech for searching by keywords and video descriptions. Prior research has successfully provided a number of solutions for speech to text, in the case of a human speech, but this article aims to investigate possible solutions to retrieve sound events based on a natural language query, and estimate how effective and accurate they are. In this study, we specifically focus on the YamNet, AlexNet, and ResNet-50 pre-trained models to automatically classify audio samples using their respective melspectrograms into a number of predefined classes. The predefined classes can represent sounds associated with actions within a video fragment. Two tests are conducted to evaluate the performance of the models on two separate problems: audio classification and intervals retrieval based on a natural language query. Results show that the benchmarked models are comparable in terms of performance, with YamNet slightly outperforming the other two models. YamNet was able to classify single fixed-size audio samples with 92.7% accuracy and 68.75% precision while its average accuracy on intervals retrieval was 71.62% and precision was 41.95%. The investigated method may be embedded into an automated event marking architecture for streaming services.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Modeling and Investigation of the Movement of the User of Augmented Reality Service

next chapter Modeling the NB-IoT Transmission Process with Intermittent Network Availability

Market Research Report: Fortune Business Insights (2018). https://www.fortunebusinessinsights.com/industry-reports/natural-language-processing-nlp-market-101933. Accessed 26 June 2020

Exclusive: Amazon says 100 million Alexa devices have been sold—what’s next? The verge interview. https://www.theverge.com/2019/1/4/18168565/amazon-alexa-devices-how-many-sold-number-100-million-dave-limp. Accessed 26 June 2020

The Dynata global trends report. Dynata (2019)

Consumer Intelligence Series: Prepare for the voice revolution. PwC report (2019)

Kim, K., Heo, M., Choi, S., Zhang, B.: DeepStory: video story QA by deep embedded memory networks. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), Melbourne, Australia, pp. 2016–2022 (2017)

Lei, J., Yu, L., Berg, T.L., Bansal, M.: TVR: a large-scale dataset for video-subtitle moment retrieval. arXiv:2001.09099 (2020)

Brindha, N., Visalakshi, P.: Bridging semantic gap between high-level and low-level features in content-based video retrieval using multi-stage ESN–SVM classifier. Sādhanā 42(1), 1–10 (2016). https://doi.org/10.1007/s12046-016-0574-8MathSciNetCrossRefMATH

Smeaton, A.F., Wilkins, P., et al.: Content-based video retrieval: three example systems from TRECVid. Int. J. Imaging Syst. Technol. 18(2–3), 195–201 (2008)CrossRef

Araujo, A., Girod, B.: Large-scale video retrieval using image queries. IEEE Trans. Circ. Syst. Video Technol. 28(6), 1406–1420 (2018)CrossRef

10.

Hershey, S., et al.: CNN architectures for large-scale audio classification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, pp. 131–135 (2017)

11.

Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, pp. 776–780 (2017)

12.

Dogan, E., Sert, M., Yazıcı, A.: A flexible and scalable audio information retrieval system for mixed-type audio signals. Int. J. Intell. Syst. 26(10), 952–970 (2011)CrossRef

13.

Guggenberger, M.: Aurio: audio processing, analysis and retrieval. In: Proceedings of the 23rd ACM International Conference on Multimedia (MM 2015), pp. 705–708 (2015)

14.

Sundaram, S., Narayanan, S.: Audio retrieval by latent perceptual indexing. In: IEEE International Conference on Acoustics, Speech and Signal Processing, Las Vegas, NV, pp. 49–52 (2008)

15.

Wan, C., Liu, M.: Content-based audio retrieval with relevance feedback. Pattern Recogn. Lett. 27(2), 85–92 (2006)CrossRef

16.

Kim, K., Kim, S., Jeon, J., Park, K.: Quick audio retrieval using multiple feature vectors. IEEE Trans. Consum. Electron. 52(1), 200–205 (2006)

17.

Qazi, K.A., Nawaz, T., Mehmood, Z., Rashid, M., Habib, H.A.: A hybrid technique for speech segregation and classification using a sophisticated deep neural network. PLoS ONE 13(3), e0194151 (2018)CrossRef

18.

Mäkinen, T., Kiranyaz, S., Raitoharju, J., Gabbouj, M.: An evolutionary feature synthesis approach for content-based audio retrieval. EURASIP J. Audio Speech Music Process. 2012(1), 1–23 (2012). https://doi.org/10.1186/1687-4722-2012-23CrossRef

19.

Patel, N.P., Patwardhan, M.S.: Identification of most contributing features for audio classification. In: International Conference on Cloud & Ubiquitous Computing & Emerging Technologies, Pune, pp. 219–223 (2013)

20.

Lostanlen, V., Lafay, G., Andén, J., Lagrange, M.: Relevance-based quantization of scattering features for unsupervised mining of environmental audio. EURASIP J. Audio Speech Music Process. 2018(1), 1–10 (2018). https://doi.org/10.1186/s13636-018-0138-4CrossRef

21.

Lu, G.: Indexing and retrieval of audio: a survey. Multimed. Tools Appl. 15(3), 269–290 (2001). https://doi.org/10.1023/A:1012491016871CrossRefMATH

22.

Richard, G., Sundaram, S., Narayanan, S.: An overview on perceptually motivated audio indexing and classification. Proc. IEEE 101(9), 1939–1954 (2013)CrossRef

23.

Pinquier, J., André-Obrecht, R.: Audio indexing: primary components retrieval: robust classification in audio documents. Multimed. Tools Appl. 30(3), 313–330 (2006). https://doi.org/10.1007/s11042-006-0027-1CrossRef

24.

McLoughlin, I., Zhang, H., Xie, Z., Song, Y., Xiao, W.: Robust sound event classification using deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(3), 540–552 (2015)CrossRef

25.

Xie, L., et al.: Pitch-density-based features and an SVM binary tree approach for multi-class audio classification in broadcast news. Multimed. Syst. 17(2), 101–112 (2011). https://doi.org/10.1007/s00530-010-0205-xCrossRef

26.

Pfeiffer, S., Fischer, S., Effelsberg, W.: Automatic audio content analysis. In: Proceedings of the Fourth ACM International Conference on Multimedia (MULTIMEDIA 1996), pp. 21–30 (1997)

27.

Foote, J.: An overview of audio information retrieval. Multimed. Syst. 7(1), 2–10 (1999). https://doi.org/10.1007/s005300050106CrossRef

28.

Catal, C.: Performance evaluation metrics for software fault prediction studies. Acta Polytech. Hung. 9(4), 193–206 (2012)

29.

Boughorbel, S., Jarray, F., El-Anbari, M.: Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS ONE 12(6), e0177678 (2017)CrossRef

30.

Andrew, G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. Computer Vision and Pattern Recognition arXiv:1704.04861 (2017)

31.

Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 1097–1105 (2017)CrossRef

32.

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. Computer Vision and Pattern Recognition arXiv:1512.03385 (2015)

Title: Audio Interval Retrieval Using Convolutional Neural Networks
Authors: Ievgeniia Kuzminykh
Dan Shevchuk
Stavros Shiaeles
Bogdan Ghita
Publisher: Springer International Publishing
Book: Internet of Things, Smart Spaces, and Next Generation Networks and Systems
Print ISBN: 978-3-030-65725-3

Electronic ISBN: 978-3-030-65726-0

Copyright Year: 2020
DOI: https://doi.org/10.1007/978-3-030-65726-0_21

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner