Skip to main content
Erschienen in: International Journal of Speech Technology 1/2022

12.01.2022

Acoustic domain mismatch compensation in bird audio detection

verfasst von: Tiantian Tang, Yanhua Long, Yijie Li, Jiaen Liang

Erschienen in: International Journal of Speech Technology | Ausgabe 1/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Detecting bird calls in audio is an important task for automatic wildlife monitoring, as well as in citizen science and audio library management. This paper presents front-end acoustic enhancement techniques to handle the acoustic domain mismatch problem in bird detection. A time-domain cross-condition data augmentation (TCDA) method is first proposed to enhance the domain coverage of a fixed training dataset. Then, to eliminate the distortion of stationary noise and enhance the transient events, we investigate a per-channel energy normalization (PCEN) to automatic control the gain of every subband in the mel-frequency spectrogram. Furthermore, a harmonic percussive source separation is investigated to extract robust percussive representations of bird call to alleviate the acoustic mismatch. Our experiments are performed on the Bird Audio Detection Task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events 2018. Extensive results show that the proposed TCDA leads to a relative 5.02% AUC improvements on mismatch conditions. And also on the cross-domain test set, the proposed percussive features (RPFs), and these RPFs with PCEN significantly improve the baseline with conventional log mel-spectrogram features from 81.79% AUC to 84.46% and 88.68%, respectively. Moreover, we find that combing different front-end features can further improve the system performances.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Adavanne, S., Drossos, K., Çakir, E., & Virtanen, T. (2017). Stacked convolutional and recurrent neural networks for bird audio detection. In Proc. EUSIPCO (pp. 1729–1733). Adavanne, S., Drossos, K., Çakir, E., & Virtanen, T. (2017). Stacked convolutional and recurrent neural networks for bird audio detection. In Proc. EUSIPCO (pp. 1729–1733).
Zurück zum Zitat Bai, J. S., Wu, R., Wang, M., et al. (2018). CIAIC-BAD sysytem for DCASE2018 challenge task3. In DCASE challenge. Bai, J. S., Wu, R., Wang, M., et al. (2018). CIAIC-BAD sysytem for DCASE2018 challenge task3. In DCASE challenge.
Zurück zum Zitat Battenberg, E., Child, R., Coates, A., et al. (2017). Reducing bias in production speech models. CoRR, 1705, 04400. Battenberg, E., Child, R., Coates, A., et al. (2017). Reducing bias in production speech models. CoRR, 1705, 04400.
Zurück zum Zitat Becker, L., Nelus, A., Gauer, J., Rudolph, L., & Martin, R. (2020). Audio feature extraction for vehicle engine noise classification. In Proc. ICASSP (pp. 711–715). Becker, L., Nelus, A., Gauer, J., Rudolph, L., & Martin, R. (2020). Audio feature extraction for vehicle engine noise classification. In Proc. ICASSP (pp. 711–715).
Zurück zum Zitat Berger, F., Freillinger, W., Primus, P., & Reisinger, W. (2018). Bird Audio Detection - DCASE 2018. In DCASE challenge Berger, F., Freillinger, W., Primus, P., & Reisinger, W. (2018). Bird Audio Detection - DCASE 2018. In DCASE challenge
Zurück zum Zitat Duan, S., Towsey, M., Zhang, J., Truskinger, A., Wimmer, J., & Roe, P. (2011). Acoustic component detection for automatic species recognition in environmental monitoring. In Proc. ISSNIP (pp. 514–519). Duan, S., Towsey, M., Zhang, J., Truskinger, A., Wimmer, J., & Roe, P. (2011). Acoustic component detection for automatic species recognition in environmental monitoring. In Proc. ISSNIP (pp. 514–519).
Zurück zum Zitat FitzGerald, D. (2010). Harmonic/percussive separation using median filtering. In Proc. DAFx (pp. DAFX1-DAFX-4). FitzGerald, D. (2010). Harmonic/percussive separation using median filtering. In Proc. DAFx (pp. DAFX1-DAFX-4).
Zurück zum Zitat Franceschi, J.-Y., Fawzi, A., & Fawzi, O. (2018). Robustness of classifiers to uniform \(\ell _p\) and gaussian noise. In Proc. AISTATS (pp. 1–25). Franceschi, J.-Y., Fawzi, A., & Fawzi, O. (2018). Robustness of classifiers to uniform \(\ell _p\) and gaussian noise. In Proc. AISTATS (pp. 1–25).
Zurück zum Zitat Grill, T., Schlüter, J. (2017). Two convolutional neural networks for bird detection in audio signals. In Proc. EUSIPCO (pp. 1764–1768) Grill, T., Schlüter, J. (2017). Two convolutional neural networks for bird detection in audio signals. In Proc. EUSIPCO (pp. 1764–1768)
Zurück zum Zitat Himawan, I., Towsey, M., & Roe, P. (2018). 3D convolutional recurrent neural networks for bird sound detection. In Proc. DCASE workshop pp.108–112. Himawan, I., Towsey, M., & Roe, P. (2018). 3D convolutional recurrent neural networks for bird sound detection. In Proc. DCASE workshop pp.108–112.
Zurück zum Zitat Jamali, S., Ahmadpanah, J., & Alipoor, G. (2018). Bird audio detection using supervised weighted NMF. In DCASE challenge Jamali, S., Ahmadpanah, J., & Alipoor, G. (2018). Bird audio detection using supervised weighted NMF. In DCASE challenge
Zurück zum Zitat Kong, Q., Iqbal, T., Xu, Y., et al. (2018). DCASE 2018 challenge SURREY cross-task convolutional neural network baseline. In Proc. DCASE Workshop (pp. 217–221). Kong, Q., Iqbal, T., Xu, Y., et al. (2018). DCASE 2018 challenge SURREY cross-task convolutional neural network baseline. In Proc. DCASE Workshop (pp. 217–221).
Zurück zum Zitat Krstulovic, S. (2018). Audio event recognition in the smart home. Computational analysis of sound scenes and events (pp. 335–371). Springer. Krstulovic, S. (2018). Audio event recognition in the smart home. Computational analysis of sound scenes and events (pp. 335–371). Springer.
Zurück zum Zitat Lasseck, M. (2018). Acoustic bird detection with deep convolutional neural networks. In Proc. DCASE Workshop (pp. 143–147) Lasseck, M. (2018). Acoustic bird detection with deep convolutional neural networks. In Proc. DCASE Workshop (pp. 143–147)
Zurück zum Zitat Liaqat, S., Bozorg, N., Jose, N., Conrey, P., Tamasi, A., & Johnson, M. T. (2018). Domain tuning methods for bird audio detection. In Proc. DCASE Workshop (pp. 163–167) Liaqat, S., Bozorg, N., Jose, N., Conrey, P., Tamasi, A., & Johnson, M. T. (2018). Domain tuning methods for bird audio detection. In Proc. DCASE Workshop (pp. 163–167)
Zurück zum Zitat Lostanlen, V., et al. (2019). Per-channel energy normalization: Why and how. IEEE Signal Processing Letters, 26(1), 39–43.CrossRef Lostanlen, V., et al. (2019). Per-channel energy normalization: Why and how. IEEE Signal Processing Letters, 26(1), 39–43.CrossRef
Zurück zum Zitat Lostanlen, V., Salamon, J., Farnsworth, A., Kelling, S., & Bello, J. P. (2018). Birdvox-full-night: A dataset and benchmark for avian flight call detection. In Proc. ICASSP (pp. 266–270). Lostanlen, V., Salamon, J., Farnsworth, A., Kelling, S., & Bello, J. P. (2018). Birdvox-full-night: A dataset and benchmark for avian flight call detection. In Proc. ICASSP (pp. 266–270).
Zurück zum Zitat Mukherjee, R., Banerjee, D., Dey, K., & Ganguly, N. (2018). Convolutional recurrent neural network based bird audio detection. In DCASE challenge. Mukherjee, R., Banerjee, D., Dey, K., & Ganguly, N. (2018). Convolutional recurrent neural network based bird audio detection. In DCASE challenge.
Zurück zum Zitat Müller, D. (2014). Disch. Extending harmonic-percussive separation of audio. In Pro. ISMIR (pp. 611–616). Müller, D. (2014). Disch. Extending harmonic-percussive separation of audio. In Pro. ISMIR (pp. 611–616).
Zurück zum Zitat Ono, N., Miyamoto, K., Kameoka, H., & Sagayama, S. (2008a). A real-time equalizer of harmonic and percussive components in music signals. In Proc. ISMIR (pp. 139–144). Ono, N., Miyamoto, K., Kameoka, H., & Sagayama, S. (2008a). A real-time equalizer of harmonic and percussive components in music signals. In Proc. ISMIR (pp. 139–144).
Zurück zum Zitat Ono, N., Miyamoto, K., Roux, J. L., Kameoka, H., & Sagayama, S. (2008b). Separation of a monaural audio signal into harmonic/percussive components by complementary diffusion on spectrogram. In Proc. EUSIPCO (pp. 240–244). Ono, N., Miyamoto, K., Roux, J. L., Kameoka, H., & Sagayama, S. (2008b). Separation of a monaural audio signal into harmonic/percussive components by complementary diffusion on spectrogram. In Proc. EUSIPCO (pp. 240–244).
Zurück zum Zitat Park, D. S., Chan, W., Zhang, Y., et al. (2019). SpecAugment: A simple data augmentation method for automatic speech recognition. In Proc. Interspeech (pp. 2613–2617). Park, D. S., Chan, W., Zhang, Y., et al. (2019). SpecAugment: A simple data augmentation method for automatic speech recognition. In Proc. Interspeech (pp. 2613–2617).
Zurück zum Zitat Porter, J., & Boll, S. (1984). Optimal estimators for spectral restoration of noisy speech. In Proc. ICASSP (pp. 53–56). Porter, J., & Boll, S. (1984). Optimal estimators for spectral restoration of noisy speech. In Proc. ICASSP (pp. 53–56).
Zurück zum Zitat Schluter, J., & Lehner, B. (2018). Zero-mean convolutions for level-invariant singing voice detection. In Proc. ISMIR (pp. 1–6). Schluter, J., & Lehner, B. (2018). Zero-mean convolutions for level-invariant singing voice detection. In Proc. ISMIR (pp. 1–6).
Zurück zum Zitat Shen, J., Qu, Y., Zhang, W., & Yu, Y. (2018). Wasserstein distance guided representation learning for domain adaptation, AAAI (pp. 4058–4065). Shen, J., Qu, Y., Zhang, W., & Yu, Y. (2018). Wasserstein distance guided representation learning for domain adaptation, AAAI (pp. 4058–4065).
Zurück zum Zitat Song, J., & Li, S. (2018). Bird audio detection using convolutional neural networks and binary neural networks. In DCASE challenge. Song, J., & Li, S. (2018). Bird audio detection using convolutional neural networks and binary neural networks. In DCASE challenge.
Zurück zum Zitat Sun, B., Feng, J., & Saenko, K. (2016). Return of frustratingly easy domain adaptation, in Proc. AAAI (pp. 2058–2065). Sun, B., Feng, J., & Saenko, K. (2016). Return of frustratingly easy domain adaptation, in Proc. AAAI (pp. 2058–2065).
Zurück zum Zitat Vesperini, F., Gabrielli, L., Principi, E., & Squartini, S. (2018). A capsule neural networks based approach for bird audio detection. In DCASE Challenge. Vesperini, F., Gabrielli, L., Principi, E., & Squartini, S. (2018). A capsule neural networks based approach for bird audio detection. In DCASE Challenge.
Zurück zum Zitat Vincent, L., Salamon, J., Farnsworth, A., et al. (2019). Robust sound event detection in bioacoustic sensor networks. PLoS ONE, 14(10). Vincent, L., Salamon, J., Farnsworth, A., et al. (2019). Robust sound event detection in bioacoustic sensor networks. PLoS ONE, 14(10).
Zurück zum Zitat Wang, Y., Getreuer, P., Hughes, T., Lyon, R. F., & Saurous, R. A. (2017). Trainable frontend for robust and far-field keyword spotting. In Proc. ICASSP (pp. 5670–5674). Wang, Y., Getreuer, P., Hughes, T., Lyon, R. F., & Saurous, R. A. (2017). Trainable frontend for robust and far-field keyword spotting. In Proc. ICASSP (pp. 5670–5674).
Zurück zum Zitat Xie, J., Hu, K., Zhu, M., Yu, J., & Zhu, Q. (2019). Investigation of different CNN-based models for improved bird sound classification. IEEE Access, 7, 175353–175361.CrossRef Xie, J., Hu, K., Zhu, M., Yu, J., & Zhu, Q. (2019). Investigation of different CNN-based models for improved bird sound classification. IEEE Access, 7, 175353–175361.CrossRef
Zurück zum Zitat Yu, C. C, Hao, Y., Yang, W. B., & Fu, B. (2018). Author guidelines for DCASE2018 challenge technical report. In DCASE challenge Yu, C. C, Hao, Y., Yang, W. B., & Fu, B. (2018). Author guidelines for DCASE2018 challenge technical report. In DCASE challenge
Zurück zum Zitat Zinemanas, P., Cancela, P., & Rocamora, M. (2019). End-to-end convolutional neural networks for sound event detection in urban environments. In Proc. FRUCT (pp. 533–539). Zinemanas, P., Cancela, P., & Rocamora, M. (2019). End-to-end convolutional neural networks for sound event detection in urban environments. In Proc. FRUCT (pp. 533–539).
Metadaten
Titel
Acoustic domain mismatch compensation in bird audio detection
verfasst von
Tiantian Tang
Yanhua Long
Yijie Li
Jiaen Liang
Publikationsdatum
12.01.2022
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 1/2022
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-022-09957-w

Weitere Artikel der Ausgabe 1/2022

International Journal of Speech Technology 1/2022 Zur Ausgabe

Neuer Inhalt