Skip to main content
Erschienen in:
Buchtitelbild

2018 | OriginalPaper | Buchkapitel

Automatic Recognition of Sound Categories from Their Vocal Imitation Using Audio Primitives Automatically Found by SI-PLCA and HMM

verfasst von : Enrico Marchetto, Geoffroy Peeters

Erschienen in: Music Technology with Swing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this paper we study the automatic recognition of sound categories (such as fridge, mixers or sawing sounds) from their vocal imitations. Vocal imitations are made of a succession over time of sounds produced using vocal mechanisms that can largely differ from the ones used in speech. We develop here a recognition approach inspired by automatic-speech-recognition systems, with an acoustic model (that maps the audio signal to a set of probability over “phonemes”) and a language model (that represents the expected succession of “phonemes” for each sound category). Since we do not know what are the underlying “phonemes” of vocal imitations we propose to automatically estimate them using Shift-Invariant Probabilistic Latent Component Analysis (SI-PLCA) applied to a dataset of vocal imitations. The kernel distributions of the SI-PLCA are considered as the “phonemes” of vocal imitation and its impulse distributions are used to compute the emission probabilities of the states of a set of Hidden Markov Models (HMMs). To evaluate our proposal, we test it for a task of automatically recognizing 12 sound categories from their vocal imitations.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
2
In our CQT, harmonics are spaced by 18 bins, so the kernels size in frequency has to be at least 18 to exploit the shift-invariance.
 
3
One third of the data is used for testing, the remaining for training; each third is used in turns for testing.
 
4
The same subject can not appear simultaneously in the training and testing set.
 
Literatur
1.
Zurück zum Zitat Baldan, S., Delle Monache, S., Rocchesso, D.: The sound design toolkit. Softw. X 6, 255–260 (2017) Baldan, S., Delle Monache, S., Rocchesso, D.: The sound design toolkit. Softw. X 6, 255–260 (2017)
2.
Zurück zum Zitat Brown, J.C.: Calculation of a constant Q spectral transform. J. Acoust. Soc. Am. 89(1), 425–434 (1991)CrossRef Brown, J.C.: Calculation of a constant Q spectral transform. J. Acoust. Soc. Am. 89(1), 425–434 (1991)CrossRef
3.
Zurück zum Zitat Houix, O., Monache, S.D., Lachambre, H., Bevilacqua, F., Rocchesso, D., Lemaitre, G.: Innovative tools for sound sketching combining vocalizations and gestures. In: Proceedings of the Audio Mostly 2016, pp. 12–10. ACM (2016) Houix, O., Monache, S.D., Lachambre, H., Bevilacqua, F., Rocchesso, D., Lemaitre, G.: Innovative tools for sound sketching combining vocalizations and gestures. In: Proceedings of the Audio Mostly 2016, pp. 12–10. ACM (2016)
4.
Zurück zum Zitat Ishihara, K., Nakatani, T., Ogata, T., Okuno, H.G.: Automatic sound-imitation word recognition from environmental sounds focusing on ambiguity problem in determining phonemes. In: Zhang, C., W. Guesgen, H., Yeap, W.-K. (eds.) PRICAI 2004. LNCS (LNAI), vol. 3157, pp. 909–918. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28633-2_96CrossRef Ishihara, K., Nakatani, T., Ogata, T., Okuno, H.G.: Automatic sound-imitation word recognition from environmental sounds focusing on ambiguity problem in determining phonemes. In: Zhang, C., W. Guesgen, H., Yeap, W.-K. (eds.) PRICAI 2004. LNCS (LNAI), vol. 3157, pp. 909–918. Springer, Heidelberg (2004). https://​doi.​org/​10.​1007/​978-3-540-28633-2_​96CrossRef
5.
Zurück zum Zitat Juang, B.H., Rabiner, L.R.: Automatic speech recognition-a brief history of the technology development. Georgia Institute of Technology. Atlanta Rutgers University and the University of California 1:67 (2005) Juang, B.H., Rabiner, L.R.: Automatic speech recognition-a brief history of the technology development. Georgia Institute of Technology. Atlanta Rutgers University and the University of California 1:67 (2005)
6.
Zurück zum Zitat Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)CrossRef Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)CrossRef
7.
Zurück zum Zitat Lemaitre, G., Dessein, A., Aura, K., Susini, P.: Do vocal imitations enable the identification of the imitated sounds. In: Proceedings of the 8th Annual Auditory Perception, Cognition and Action Meeting (APCAM 2009), Boston, MA (2009) Lemaitre, G., Dessein, A., Aura, K., Susini, P.: Do vocal imitations enable the identification of the imitated sounds. In: Proceedings of the 8th Annual Auditory Perception, Cognition and Action Meeting (APCAM 2009), Boston, MA (2009)
8.
Zurück zum Zitat Lemaitre, G., Houix, O., Voisin, F., Misdariis, N., Susini, P.: Vocal imitations of non-vocal sounds. PLoS ONE 11(12), e0168167 (2016). Public Library of ScienceCrossRef Lemaitre, G., Houix, O., Voisin, F., Misdariis, N., Susini, P.: Vocal imitations of non-vocal sounds. PLoS ONE 11(12), e0168167 (2016). Public Library of ScienceCrossRef
10.
Zurück zum Zitat Marchetto, E., Peeters, G.: A set of audio features for the morphological description of vocal imitations. In: Proceedings of DAFx (2015) Marchetto, E., Peeters, G.: A set of audio features for the morphological description of vocal imitations. In: Proceedings of DAFx (2015)
12.
Zurück zum Zitat Peeters, G., Deruty, E.: Sound indexing using morphological description. IEEE Trans. Audio Speech Lang. Process. 18(3), 675–687 (2010)CrossRef Peeters, G., Deruty, E.: Sound indexing using morphological description. IEEE Trans. Audio Speech Lang. Process. 18(3), 675–687 (2010)CrossRef
13.
Zurück zum Zitat Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)CrossRef Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)CrossRef
14.
Zurück zum Zitat Rabiner, L.R., Juang, B.H.: Fundamentals of speech recognition (1993) Rabiner, L.R., Juang, B.H.: Fundamentals of speech recognition (1993)
15.
Zurück zum Zitat Ricard, J., Herrera, P.: Morphological sound description: computational model and usability evaluation. In: Audio Engineering Society Convention 116 (2004) Ricard, J., Herrera, P.: Morphological sound description: computational model and usability evaluation. In: Audio Engineering Society Convention 116 (2004)
16.
Zurück zum Zitat Saon, G., Chien, J.T.: Large-vocabulary continuous speech recognition systems: a look at some recent advances. IEEE Sig. Process. Mag. 29(6), 18–33 (2012)CrossRef Saon, G., Chien, J.T.: Large-vocabulary continuous speech recognition systems: a look at some recent advances. IEEE Sig. Process. Mag. 29(6), 18–33 (2012)CrossRef
17.
Zurück zum Zitat Schaeffer, P.: Traité des objets musicaux. Le Seuil (1966) Schaeffer, P.: Traité des objets musicaux. Le Seuil (1966)
18.
Zurück zum Zitat Schörkhuber, C., Klapuri, A., Holighaus, N., Dörfler, M.: A Matlab toolbox for efficient perfect reconstruction time-frequency transforms with log-frequency resolution. In: Audio Engineering Society Conference: 53rd International Conference: Semantic Audio, January 2014. http://www.aes.org/e-lib/browse.cfm?elib=17112 Schörkhuber, C., Klapuri, A., Holighaus, N., Dörfler, M.: A Matlab toolbox for efficient perfect reconstruction time-frequency transforms with log-frequency resolution. In: Audio Engineering Society Conference: 53rd International Conference: Semantic Audio, January 2014. http://​www.​aes.​org/​e-lib/​browse.​cfm?​elib=​17112
20.
Zurück zum Zitat Smaragdis, P., Raj, B.: Shift-invariant probabilistic latent component analysis. Technical report, MERL (2007) Smaragdis, P., Raj, B.: Shift-invariant probabilistic latent component analysis. Technical report, MERL (2007)
21.
Zurück zum Zitat Sundaram, S., Narayanan, S.: Vector-based representation and clustering of audio using onomatopoeia words. In: Proceedings of AAAI (2006) Sundaram, S., Narayanan, S.: Vector-based representation and clustering of audio using onomatopoeia words. In: Proceedings of AAAI (2006)
22.
Zurück zum Zitat Sundaram, S., Narayanan, S.: Classification of sound clips by two schemes: using onomatopoeia and semantic labels. In: 2008 IEEE International Conference on Multimedia and Expo, pp. 1341–1344. IEEE (2008) Sundaram, S., Narayanan, S.: Classification of sound clips by two schemes: using onomatopoeia and semantic labels. In: 2008 IEEE International Conference on Multimedia and Expo, pp. 1341–1344. IEEE (2008)
23.
Zurück zum Zitat Velasco, G.A., Holighaus, N., Dörfler, M., Grill, T.: Constructing an invertible constant-Q transform with non-stationary Gabor frames. In: Proceedings of DAFx, Paris, pp. 93–99 (2011) Velasco, G.A., Holighaus, N., Dörfler, M., Grill, T.: Constructing an invertible constant-Q transform with non-stationary Gabor frames. In: Proceedings of DAFx, Paris, pp. 93–99 (2011)
Metadaten
Titel
Automatic Recognition of Sound Categories from Their Vocal Imitation Using Audio Primitives Automatically Found by SI-PLCA and HMM
verfasst von
Enrico Marchetto
Geoffroy Peeters
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-030-01692-0_1

Neuer Inhalt