nach oben

Pattern Analysis and Applications

Erschienen in:

20.06.2023 | Industrial and Commercial Application

Multi-level distance embedding learning for robust acoustic scene classification with unseen devices

verfasst von: Gang Jiang, Zhongchen Ma, Qirong Mao, Jianming Zhang

Erschienen in: Pattern Analysis and Applications | Ausgabe 3/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Acoustic scene classification (ASC) aims to analyse the recording scene of a piece of audio. In real life, ASC has to deal with audio data from various recording devices, even those recorded by devices that did not appear during the training phase. Audio data recorded by different devices, especially unseen devices, have differences in sampling rate, amplitude, data distribution, etc. These differences can greatly interfere with the feature learning process of CNNs and lead to degradation of the performance of the ASC model. In order to learn advanced features that are less susceptible to differences in device information from manual features that contain device information, we propose an ASC method based on multi-level distance embedding space, called multi-level distance embedding learning (MDEL). There is a hierarchical relationship among the categories of acoustic scene, that is, from the three coarse-grained categories of indoor, outdoor, and transportation to more fine-grained categories. This relation corresponds to a similarity relation between categories of different granularity. MDEL exploits this hierarchical relationship of similarity between acoustic scene classes to construct embedding space containing multi-level distance. During the learning process, the model is guided to focus more on common features of the same scene classes and learn an advanced feature that is more robust to the device, thus improving the robustness of the model to data from unseen devices. Our method was evaluated on the audio dataset provided by the DCASE2020 Challenge for Task1a, and the overall classification accuracy was improved by 1.2\(\%\). For audio data from unseen devices, the classification accuracy was improved by 2.3\(\%\).

Vorheriger Artikel Body condition scoring network based on improved YOLOX

Nächster Artikel Anomalous human activity detection in videos using Bag-of-Adapted-Models-based representation

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Yuanbo H, Bo K, Hauwermeiren V W, Botteldooren D (2022) Relation-guided acoustic scene classification aided with event embeddings. arXiv preprint arXiv:2205.00499

Barchiesi D, Giannoulis D, Stowell D, Plumbley M (2015) Acoustic scene classification: classifying environments from the sounds they produce. IEEE Signal Process Mag 32(3):16–34CrossRef

Byeonggeun K, Seunghan Y, Jangho K, Hyunsin P, Juntae L, Simyung C (2022) Domain generalization with relaxed instance frequency-wise normalization for multi-device acoustic scene classification. arXiv preprint arXiv:2206.12513

Stowell D, Giannoulis D, Benetos E, Lagrange M, Plumbley M (2015) Detection and classification of acoustic scenes and events. IEEE Trans Multimedia 17(10):1733–1746CrossRef

Qian K, Janott C, Zhang Z, Deng J, Baird A, Heiser C, Hohenhorst W, Herzog M, Hemmert W, Schuller B (2018) Teaching machines on snoring: a benchmark on computer audition for snore sound excitation localisation. Arch Acoustics 43(3):465–475

Perera C, Zaslavsky A, Christen P, Georgakopoulos D (2014) Context aware computing for the internet of things: A survey. IEEE Commun Surv Tutor 16(1):414–454CrossRef

Harma A, Jakka J, Tikander M, Karjalainen M, Lokki T, Nironen H (2003) Techniques and applications for wearable augmented reality audio. In: Audio engineering society convention 114. audio engineering

Martinson AE (2007) Robotic discovery of the auditory scene. In: robotics and automation, 2007 IEEE international conference on

Qian K, Zhao R, Pandit V, Yang Z, Schuller B (2017) Wavelets revisited for the classification of acoustic scenes. In: workshop on detection and classification of acoustic scenes and events

10.

Amiriparian S, Freitag M, Cummins N, Schuller B (2017) Sequence to sequence autoencoders for unsupervised representation learning from audio

11.

Hershey S, Chaudhuri S, Ellis D, Gemmeke JF, Jansen A, Moore RC, Plakal M, Platt D, Saurous RA, Seybold B (2017) Cnn architectures for large-scale audio classification. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 131–135. IEEE

12.

Ren Z, Kong Q, Han J, Plumbley M, Schuller BW (2019) Attention-based atrous convolutional neural networks: Visualisation and understanding perspectives of acoustic scenes. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), Brighton, UK, pp 56–60

13.

Wang H, Zou Y, Chong D (2020) Acoustic scene classification with spectrogram processing strategies

14.

Salamon J, Bello JP (2017) Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process Lett 24(3):1–1CrossRef

15.

Abeer J (2020) A review of deep learning based methods for acoustic scene classification. Appl Sci 10(6):2020CrossRef

16.

Suh S, Park S, Jeong Y, Lee T (June 2020) Designing acoustic scene classification models with CNN variants. Technical report, DCASE2020 challenge

17.

Hu H, Yang C, Xia X, Bai X, Lee CH (2020) Device-robust acoustic scene classification based on two-stage categorization and data augmentation

18.

Hu H, Yang C, Xia X, Bai X, Lee CH (2021) A two-stage approach to device-robust acoustic scene classification. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 845–849. IEEE

19.

Gao W, Mcdonnell M (June 2020) Acoustic scene classification using deep residual networks with focal loss and mild domain adaptation. Technical report, DCASE2020 Challenge

20.

Jie L (June 2020) Acoustic scene classification with residual networks and attention mechanism. Technical report, DCASE2020 Challenge

21.

Mcdonnell M, Gao W (2020) Acoustic scene classification using deep residual networks with late fusion of separated high and low frequency paths. In: ICASSP 2020 - 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP)

22.

Jie H, Li S, Gang S, Albanie S (2019) Squeeze-and-excitation networks. IEEE transactions on pattern analysis and machine intelligence

23.

Woo S, Park J, Lee JY, Kweon IS (2018) Cbam: convolutional block attention module. Springer, Cham

24.

Bochkovskiy A, Wang CY, Liao H (2020) Yolov4: Optimal speed and accuracy of object detection

25.

Hadsell R, Chopra S, Lecun Y (2006) Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06)

26.

Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. IEEE

27.

Kim Y, Park W (2021) Multi-level distance regularization for deep metric learning. CoRR arXiv:2102.04223

28.

Heittola T, Mesaros A, Virtanen, T.: TAU Urban Acoustic Scenes, (2020) Mobile. Development Dataset. https://doi.org/10.5281/zenodo.3819968

29.

Zhang H, Cisse M, Dauphin YN, Lopez-Paz D (2017) mixup: Beyond empirical risk minimization

30.

Koutini K, Eghbal-Zadeh H, Dorfer M, Widmer G (2019) The receptive field as a regularizer in deep convolutional neural networks for acoustic scene classification

31.

Cramer J, Wu HH, Salamon J, Bello JP (2019) Look, listen and learn more: Design choices for deep audio embeddings. In: IEEE Int. \(\tilde{}\)Conf.\(\tilde{}\)on acoustics, speech and signal processing (ICASSP), Brighton, UK, pp 3852–3856. https://ieeexplore.ieee.org/document/8682475

Titel: Multi-level distance embedding learning for robust acoustic scene classification with unseen devices
verfasst von: Gang Jiang
Zhongchen Ma
Qirong Mao
Jianming Zhang
Publikationsdatum: 20.06.2023
Verlag: Springer London
Erschienen in: Pattern Analysis and Applications / Ausgabe 3/2023
Print ISSN: 1433-7541
Elektronische ISSN: 1433-755X
DOI: https://doi.org/10.1007/s10044-023-01172-w

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3/2023

RKHS subspace domain adaption via minimum distribution gap

Spatial–Temporal gated graph attention network for skeleton-based action recognition

MPF6D: masked pyramid fusion 6D pose estimation

A new multidimensional discriminant representation for robust person re-identification

Shape completion using orthogonal views through a multi-input–output network

SE-MD: a single-encoder multiple-decoder deep network for point cloud reconstruction from 2D images

Premium Partner