Skip to main content
Erschienen in: Artificial Intelligence Review 1/2024

01.01.2024

Acoustic-based LEGO recognition using attention-based convolutional neural networks

verfasst von: Van-Thuan Tran, Chia-Yang Wu, Wei-Ho Tsai

Erschienen in: Artificial Intelligence Review | Ausgabe 1/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This work investigates the classification of LEGO types using deep learning-based audio classification approaches. The motivation for this investigation is based on the following assumption. If objects of the same shape fall freely from a certain height and hit a fixed plane, the impact sounds will be very similar, so we can distinguish the same types of objects from the others. Applying this idea to LEGO recognition, we collect impact sounds of 200 LEGO objects that fall from a height of about 30cm from a designated plane, and design a CNN-based recognition system that processes the impact sounds to determine the type of LEGO it belongs to. Recognizing that the fall of LEGO results in the main impact sound (i.e., only the sound at the moment of impact) and several subsequent sounds, we examine whether considering only the first impact sound or all sounds brings about better classification accuracies. We propose a compact two-dimensional CNN model, namely LegoNet, which is designed with a frame-level attention module at the input spectrogram and time-distributed fully-connected layers. Our experiments show that free-fall impact sounds can be used efficiently for accurate object recognition, and the proposed LegoNet, with a much smaller size, achieves better accuracy and robustness compared to baseline models. Also, using the whole sequence of impact sounds is more informative for LEGO classification than only considering the first impact sound. Moreover, it is found that utilizing data of specific object postures can help to improve the classifier’s performance in the case of small training data. The proposed approach can be employed as an extra module to build intelligent agents or object classification systems that require a rich understanding of the surrounding physical world.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Aytar Y, Vondrick C, Torralba A (2016) SoundNet: learning sound representations from unlabeled video. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. pp 892–900 Aytar Y, Vondrick C, Torralba A (2016) SoundNet: learning sound representations from unlabeled video. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. pp 892–900
Zurück zum Zitat Boddapati V, Petef A, Rasmusson J, Lars L (2017) Classifying environmental sounds using image recognition networks. Procedia Comput Sci 112:2048–2056CrossRef Boddapati V, Petef A, Rasmusson J, Lars L (2017) Classifying environmental sounds using image recognition networks. Procedia Comput Sci 112:2048–2056CrossRef
Zurück zum Zitat Clarke S, Rhodes T, Atkeson CG, Kroemer O (2018) Learning audio feedback for estimating amount and flow of granular material. In: Billard A, Dragan A, Peters J, Morimoto J (eds) Proceedings of the 2nd conference on robot learning. PMLR, pp 529–550 Clarke S, Rhodes T, Atkeson CG, Kroemer O (2018) Learning audio feedback for estimating amount and flow of granular material. In: Billard A, Dragan A, Peters J, Morimoto J (eds) Proceedings of the 2nd conference on robot learning. PMLR, pp 529–550
Zurück zum Zitat Dai W, Dai C, Qu S, et al (2017) Very deep convolutional neural networks for raw waveforms. In: Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing. Institute of Electrical and Electronics Engineers Inc., pp 421–425 Dai W, Dai C, Qu S, et al (2017) Very deep convolutional neural networks for raw waveforms. In: Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing. Institute of Electrical and Electronics Engineers Inc., pp 421–425
Zurück zum Zitat Gandhi D, Gupta A, Pinto L (2020) Swoosh! rattle! thump!--actions that sound. In: Robotics: Science and Systems 2020. pp 1–10 Gandhi D, Gupta A, Pinto L (2020) Swoosh! rattle! thump!--actions that sound. In: Robotics: Science and Systems 2020. pp 1–10
Zurück zum Zitat Griffith S, Sukhoy V, Wegter T, Stoytchev A (2012b) Object categorization in the sink : learning behavior—grounded object categories with water. In: Proceedings of the 2012 ICRA Workshop on Semantic Perception, Mapping and Exploration. pp 1–6 Griffith S, Sukhoy V, Wegter T, Stoytchev A (2012b) Object categorization in the sink : learning behavior—grounded object categories with water. In: Proceedings of the 2012 ICRA Workshop on Semantic Perception, Mapping and Exploration. pp 1–6
Zurück zum Zitat Guo J, Xu N, Li L-J, Alwan A (2017) Attention based CLDNNs for short-duration acoustic scene classification. In: INTERSPEECH. pp 469–473 Guo J, Xu N, Li L-J, Alwan A (2017) Attention based CLDNNs for short-duration acoustic scene classification. In: INTERSPEECH. pp 469–473
Zurück zum Zitat Hassan SU, Zeeshan Khan M, Ghani Khan MU, Saleem S (2019) Robust sound classification for surveillance using time frequency audio features. In: Proceeding of International Conference on Communication Technologies (ComTech). pp 13–18 Hassan SU, Zeeshan Khan M, Ghani Khan MU, Saleem S (2019) Robust sound classification for surveillance using time frequency audio features. In: Proceeding of International Conference on Communication Technologies (ComTech). pp 13–18
Zurück zum Zitat Henze D, Gorishti K, Bruegge B, Simen J-P (2019) AudioForesight: A process model for audio predictive maintenance in industrial environments. In: 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA). pp 352–357 Henze D, Gorishti K, Bruegge B, Simen J-P (2019) AudioForesight: A process model for audio predictive maintenance in industrial environments. In: 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA). pp 352–357
Zurück zum Zitat Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning. pp 448–456 Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning. pp 448–456
Zurück zum Zitat Kim G, Han DK, Ko H (2021) SpecMix : a mixed sample data augmentation method for training withtime-frequency domain features. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. International Speech Communication Association, pp 6–10 Kim G, Han DK, Ko H (2021) SpecMix : a mixed sample data augmentation method for training withtime-frequency domain features. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. International Speech Communication Association, pp 6–10
Zurück zum Zitat Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization. In: Proceeding of International Conference for Learning Representations. pp 1–15 Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization. In: Proceeding of International Conference for Learning Representations. pp 1–15
Zurück zum Zitat Ko T, Peddinti V, Povey D, Khudanpur S (2015) Audio augmentation for speech recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. International Speech and Communication Association, pp 3586–3589 Ko T, Peddinti V, Povey D, Khudanpur S (2015) Audio augmentation for speech recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. International Speech and Communication Association, pp 3586–3589
Zurück zum Zitat Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems. Curran Associates, New York, pp 1097–1105 Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems. Curran Associates, New York, pp 1097–1105
Zurück zum Zitat Lezhenin I, Bogach N, Pyshkin E (2019) Urban sound classification using long short-term memory neural network. In: Proceedings of the 2019 Federated Conference on Computer Science and Information Systems, FedCSIS 2019. Institute of Electrical and Electronics Engineers, pp 57–60 Lezhenin I, Bogach N, Pyshkin E (2019) Urban sound classification using long short-term memory neural network. In: Proceedings of the 2019 Federated Conference on Computer Science and Information Systems, FedCSIS 2019. Institute of Electrical and Electronics Engineers, pp 57–60
Zurück zum Zitat Li X, Chebiyyam V, Kirchhoff K (2019) Multi-stream network with temporal attention for environmental sound classification. In: INTERSPEECH. pp 3604–3608 Li X, Chebiyyam V, Kirchhoff K (2019) Multi-stream network with temporal attention for environmental sound classification. In: INTERSPEECH. pp 3604–3608
Zurück zum Zitat Liu W, Anguelov D, Erhan D et al (2016) SSD: single shot multibox detector. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer vision—ECCV 2016. Springer, Berlin, pp 21–37CrossRef Liu W, Anguelov D, Erhan D et al (2016) SSD: single shot multibox detector. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer vision—ECCV 2016. Springer, Berlin, pp 21–37CrossRef
Zurück zum Zitat Lopez-Caudana E, Quiroz O, Rodríguez A et al (2017) Classification of materials by acoustic signal processing in real time for NAO robots. Int J Adv Robot Syst 14:1–10CrossRef Lopez-Caudana E, Quiroz O, Rodríguez A et al (2017) Classification of materials by acoustic signal processing in real time for NAO robots. Int J Adv Robot Syst 14:1–10CrossRef
Zurück zum Zitat Mcfee B, Raffel C, Liang D, et al (2015) librosa: audio and music signal analysis in python. In: Proceeding of the 14th Python in Science Conference. pp 18–25 Mcfee B, Raffel C, Liang D, et al (2015) librosa: audio and music signal analysis in python. In: Proceeding of the 14th Python in Science Conference. pp 18–25
Zurück zum Zitat Mushtaq Z, Su S-F (2020) Environmental sound classification using a regularized deep convolutional neural network with data augmentation. Appl Acoust 167:1–13CrossRef Mushtaq Z, Su S-F (2020) Environmental sound classification using a regularized deep convolutional neural network with data augmentation. Appl Acoust 167:1–13CrossRef
Zurück zum Zitat Nakamura T, Nagai T, Iwahashi N (2007) Multimodal object categorization by a robot. In: IEEE International Conference on Intelligent Robots and Systems. pp 2415–2420 Nakamura T, Nagai T, Iwahashi N (2007) Multimodal object categorization by a robot. In: IEEE International Conference on Intelligent Robots and Systems. pp 2415–2420
Zurück zum Zitat Park DS, Chan W, Zhang Y, et al (2019) SpecAugment: a simple data augmentation method for automatic speech recognition. In: INTERSPEECH 2019. pp 2613–2617 Park DS, Chan W, Zhang Y, et al (2019) SpecAugment: a simple data augmentation method for automatic speech recognition. In: INTERSPEECH 2019. pp 2613–2617
Zurück zum Zitat Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans Pattern Anal Mach Intell 39:1137–1149CrossRef Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans Pattern Anal Mach Intell 39:1137–1149CrossRef
Zurück zum Zitat Salamon J, Bello JP (2017) Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification. IEEE Signal Process Lett 24:279–283CrossRef Salamon J, Bello JP (2017) Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification. IEEE Signal Process Lett 24:279–283CrossRef
Zurück zum Zitat Sehili MA, Lecouteux B, Vacher M et al (2012) Sound Environment analysis in smart home. In: Paternò F, de Ruyter B, Markopoulos P, Santoro C, van Loenen E, Luyten K (eds) Proceeding of ambient intelligence. Springer, Berlin, pp 208–223CrossRef Sehili MA, Lecouteux B, Vacher M et al (2012) Sound Environment analysis in smart home. In: Paternò F, de Ruyter B, Markopoulos P, Santoro C, van Loenen E, Luyten K (eds) Proceeding of ambient intelligence. Springer, Berlin, pp 208–223CrossRef
Zurück zum Zitat Simonyan K, Zisserman A (2015) Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556. 1–14 Simonyan K, Zisserman A (2015) Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556. 1–14
Zurück zum Zitat Sinapov J, Wiemer M, Stoytchev A (2009) Interactive learning of the acoustic properties of household objects. In: Proceedings—IEEE International Conference on Robotics and Automation. pp 2518–2524 Sinapov J, Wiemer M, Stoytchev A (2009) Interactive learning of the acoustic properties of household objects. In: Proceedings—IEEE International Conference on Robotics and Automation. pp 2518–2524
Zurück zum Zitat Srivastava N, Hinton GE, Krizhevsky A et al (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958MathSciNet Srivastava N, Hinton GE, Krizhevsky A et al (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958MathSciNet
Zurück zum Zitat Tokozume Y, Harada T (2017) Learning environmental sounds with end-to-end convolutional neural network. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing. Institute of Electrical and Electronics Engineers. pp 2721–2725 Tokozume Y, Harada T (2017) Learning environmental sounds with end-to-end convolutional neural network. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing. Institute of Electrical and Electronics Engineers. pp 2721–2725
Zurück zum Zitat Tran VT, Tsai WH (2020) Acoustic-Based Emergency Vehicle Detection Using Convolutional Neural Networks. IEEE Access 8:75702–75713CrossRef Tran VT, Tsai WH (2020) Acoustic-Based Emergency Vehicle Detection Using Convolutional Neural Networks. IEEE Access 8:75702–75713CrossRef
Zurück zum Zitat Tran VT, Tsai WH (2021) Audio-Vision Emergency Vehicle Detection. IEEE Sens J 21:27905–27917CrossRef Tran VT, Tsai WH (2021) Audio-Vision Emergency Vehicle Detection. IEEE Sens J 21:27905–27917CrossRef
Zurück zum Zitat Xu K, Feng D, Mi H, et al (2018) Mixup-based acoustic scene classification using multi-channel convolutional neural network. In: Advances in Multimedia Information Processing—PCM 2018. pp 14–23 Xu K, Feng D, Mi H, et al (2018) Mixup-based acoustic scene classification using multi-channel convolutional neural network. In: Advances in Multimedia Information Processing—PCM 2018. pp 14–23
Zurück zum Zitat Yun S, Han D, Chun S, et al (2019) CutMix: regularization strategy to train strong classifiers with localizable features. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society, pp 6022–6031 Yun S, Han D, Chun S, et al (2019) CutMix: regularization strategy to train strong classifiers with localizable features. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society, pp 6022–6031
Zurück zum Zitat Zhang H, Cisse M, Dauphin YN, Lopez-Paz D (2018a) mixup: beyond empirical risk minimization. In: 6th International Conference on Learning Representations, ICLR 2018. pp 1–13 Zhang H, Cisse M, Dauphin YN, Lopez-Paz D (2018a) mixup: beyond empirical risk minimization. In: 6th International Conference on Learning Representations, ICLR 2018. pp 1–13
Zurück zum Zitat Zhang Z, Xu S, Zhang S et al (2021) Attention based convolutional recurrent neural network for environmental sound classification. Neurocomputing 453:896–903CrossRef Zhang Z, Xu S, Zhang S et al (2021) Attention based convolutional recurrent neural network for environmental sound classification. Neurocomputing 453:896–903CrossRef
Metadaten
Titel
Acoustic-based LEGO recognition using attention-based convolutional neural networks
verfasst von
Van-Thuan Tran
Chia-Yang Wu
Wei-Ho Tsai
Publikationsdatum
01.01.2024
Verlag
Springer Netherlands
Erschienen in
Artificial Intelligence Review / Ausgabe 1/2024
Print ISSN: 0269-2821
Elektronische ISSN: 1573-7462
DOI
https://doi.org/10.1007/s10462-023-10625-x

Weitere Artikel der Ausgabe 1/2024

Artificial Intelligence Review 1/2024 Zur Ausgabe

Premium Partner