Top

Neural Processing Letters

Published in:

Open Access 01-02-2024

Spectro Temporal Fusion with CLSTM-Autoencoder based approach for Anomalous Sound Detection

Authors: S. Chandrakala, Akhilandeswari Pidikiti, P. V. N. Sai Mahathi

Published in: Neural Processing Letters | Issue 1/2024

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

Deep learning models are proved efficient for complex learning tasks. Anomalous sound detection is one such complex task for which self-supervised deep architectures are emerging in recent days. Self-supervised deep models efficiently capture the underlying structure of data. Self-supervised anomalous sound detection attempts to distinguish between normal sounds and unidentified anomalous sounds. With the use of appropriate autoencoders, reconstruction error based decision making is effective for anomaly detection in domains such as computer vision. Auditory image (Spectrogram) based representation of sound signals are commonly used in sound event detection. We propose convolutional long short-term memory (CLSTM) Auto Encoder based approach for anomalous sound detection. In this approach, we explore fusion of spectral and temporal features to model characteristics of normal sounds with noises. The proposed approach is evaluated using MIMII dataset and the DCASE Challenge (2020) Task 2—Anomalous sound detection dataset. Experiments on proposed approach reveal significant improvement over the state-of-the-art approaches.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

The task of abnormal sound detection (ASD) is challenging and demanding in monitoring applications such as industrial machines, animal care, and public surveillance [1]. When compared to visual surveillance, audio sensors offer few advantages: (i) audio data acquisition and basic processing is much less expensive with respect to memory space and computational load (ii) microphones offer omnidirectional coverage (iii) insensitive to luminosity and critical weather conditions and (iv) audio cues can supplement video cues in cases where anomalous events/objects are occluded. Neural network-based techniques have been widely employed to address the ASD problem as deep learning models have grown in popularity [2, 3]. Using data from solely normal sound occurrences, the self-supervised anomaly detection approach trains the auto encoders (AE) to learn the patterns of normal sound events and to spot abnormalities.

Acoustic characteristics in sound signal can also be interpreted as time-frequency representation in the form of an auditory image. AEs with convolutional layers (CAE) are commonly used for image reconstruction. To capture the sequential patterns from normal sound signals, we propose to use convolutional long short-term memory auto encoder (CLSTM-AE) based approach for anomalous sound detection. With an encoder-decoder paradigm, the CLSTM Autoencoder is an autoencoder suitable for sequential data. The trained encoder can be used to obtain compressed representation of speech, text, and other sequential data. CLSTM-AE has the ability to recall or use long term dependencies over lengthy input sequences with the help of internal memory. The decoder model that reconstructs the input sequence uses these learnt representation as input. The reconstruction error serves as the basic measure for this model’s efficacy. In order to lower the reconstruction error or anomaly score of the sound data, the CLSTM-AE learns the network parameters during the training phase. During testing, anomalous sounds won’t be compressed properly and so provide high anomaly scores or reconstruction error. Optimal thresholding identifies anomalous sounds.

In addition to spectrogram image input, we explore CNN based Temporal gram Network to learn temporal (Tgram) features from normal sound data. The log-Mel spectrogram (Sgram) and the Tgram are fused to form a Mel-Spectro-tempogram which is then used to train CLSTM-AE. The rest of this paper is structured as follows: state of the art deep learning based approaches for anomalous sound detection are reviewed in “Related work”. The proposed CLSTM-AE based approach is presented in Sect. 3. Experimental study in Sect. 4 covers ablation studies of proposed approach on DCASE 2020 Task 2 dataset and performance analysis with state of the art deep model based approaches.

A wide range of deep anomaly detection algorithms have been introduced, showing noticeably superior performance than traditional machine learning based anomaly detection in tackling difficult detection problems [4‐6]. In Koizumi et al. [7], Variational Auto Encoder (VAE) is employed which used normal distribution in the latent layer of VAE. In addition to preventing overfitting, this can help in data reconstruction. Acoustic input was transformed into auditory images by convolutional layers in the AE. The authors used the MIMII dataset, which included four machine sounds that were both normal and abnormal (fans, pumps, slide rails and valves).

According to Purohit et al. [8], the Neyman-Pearson lemma is used to describe the loss function of AE, and ASD is seen as a statistical hypothesis test to mimic anomalous sound. The authors conclude that their method improves ASD performance measures with a low false positive rate. The drawback of this approach is that it simulates aberrant sound via costly rejection sampling. In Purohit et al. [8], WaveNet is used to predict sound data rather than generating it, and prediction error is used to calculate the anomaly score. Studies were carried out using real-world dataset recorded in a subway station. According to experimental findings, WaveNet marginally boosts an ASD system’s effectiveness in comparison to AE-based methods. However, the WaveNet incurs higher computational cost.

Recent study using evaluation data of DCASE 2020 Task 2 covered (1) Ensemble approaches using "outlier exposure" (OE)- and "inlier modelling" (IM)-based detectors, and (2) approaches using IM-based detection for features discovered during a machine-identification task [9]. Group Masked Autoencoder for Density Estimation (GMADE) is an unique density estimation-based anomaly detector that is built on an ensemble with a self-supervised classification-based anomaly detector [10]. A unique approach [11] uses audio representations based on Gammatones and convolutional autoencoders (both unsupervised and semi-supervised).These architectures significantly outperform the baseline results.

According to a technical report [12] that summarized performance of Task 2 of the DCASE 2020 challenge dataset, few attempts followed self supervised learning based techniques to detect abnormal sounds. Training used only samples of normal machine operations. These techniques employ deep autoencoders that leverage melspectogram-processed sound features and dense convolutional architectures. The dense and convolutional AE trained and tested using six types of machine operations produced results that were competitive with and even better than the baseline technique given in DCASE challenge. The baseline consists of a dense Autoencoder (AE) with three layers, in both the encoder and decoder components, with 128 units, and a latent space with 8 units, all with the ReLU activation function.

Long short term memory (LSTM) Autoencoder and Support Vector Machines used in another approach [8] takes into account the fact that, due to anomalies occurring in real-time industrial environments on an infrequent basis, there would be fewer aberrant samples than normal samples. One Class SVM receives the compressed patterns from the input data that the autoencoder model has presented. The normal sound data is used to train the model, and both the normal and anomalous data are used to test the learned model.

Another framework proposed a dense, convolutional autoencoder framework [8]. The mel-spectrograms of the audio data were used as input. When compared to classification based on raw audio data, the mel-spectrogram image aids in emphasising the spatial features and hence helps in improved classification outcomes. The MIMII Dataset and a portion of the ToyAdmos Dataset were used to train this model.

A deep learning system called MAABL (MobileNetV2 with enhanced attention block), which combines Deep Convolutional Neural Networks with Attention Blocks, is presented in this work [13]. An augmented attention block and an inverted residual block from MobileNetV2 are combined to create the neural network. In order to train neural networks, attention blocks assist in highlighting the aspects that are most important. MIMII and the ToyAdmos datasets were utilised for the experimentation. In comparison to the MobileNetV2 basic architecture and a few other frameworks that combined convolutional LSTM, and attention modules, the proposed framework displayed promising results.

One of the main challenges when working with the anomaly detection is high dimensional data. Combination of an anomaly detection mechanism and a feature extractor model is explored in [14]. They proposed a hybrid model that combines Deep Belief Networks (DBN) and One-Class SVM. The OneClass SVM is trained using the features that the DBN framework has learned from the input data. Results are reported on par with a Deep Autoencoder in terms of performance.

A similarity function was put forth by the author [7] for use in one-shot anomaly detection of sounds. The proposed method is called the SPecific anomaly IDentifiER network (SPIDERnet). As the similarity function used was based on the naive mean squared error between input data and the memorised spectrogram, previous research that used memory-based one-shot learning were only able to detect short anomalous sounds. SPIDERnet is viewed as a fusion of attention mechanisms for capturing time-frequency stretching and feature extractors based on neural networks for similarity measurement. ToyADMOS and MIMII datasets were utilised for experimenting. In comparison to other approaches [15‐17] utilised for industrial anomaly identification, better performance is reported in this work.

3 Spectro Temporal Fusion with CLSTM-Autoencoder

Self supervised (one class) deep model based anomaly detection approaches train the autoencoders (AE) using data of normal sounds only to learn the patterns of normal sound events. Convolutional AEs (CAE) are commonly used for image reconstruction. To capture the sequential patterns from normal sound signals, we propose to use Convolutional Long Short-Term Memory Auto Encoder (CLSTM-AE) based approach for anomalous sound detection. Spectrograms are commonly used with CAE for sound anomaly detection. In addition to spectrogram image input, we explore CNN based Temporal gram Network to learn temporal features from normal sound data.

3.1 Spectro Temporal Fusion

The log-Mel spectrogram (Sgram) and the temporal features of Tgram network are combined to form a Mel-Spectro-tempogram which is then used to train CLSTM-AE as shown in Fig. 1. Both Mel-Spectrogram (Sgram) images and Temporal gram (Tgram) images were created separately from the input data which is in the form of time-series vibration data. Raw audio data is transformed into Mel-sgram format by applying the Fast Fourier transform. The mapping is done through overlapped window based processing of the raw audio data.

To compensate loss of anomaly related information in the log-Mel spectrogram, a CNN-based network (TgramNet) is utilised to extract temporal information from the sound events. A 1D large kernel convolution is applied by setting kernel size, channel number and stride, the same as the window size, number of Mel scale bins and hop length for the log-Mel spectrogram. Secondly, CNN block operations are applied thrice. A CNN block includes a layer normalization, a leaky ReLU activation, and a 1D convolution with a small kernel size. The dimension of Tgram temporal features is not changed by CNN block operations.

Features in the form of Tgram and mel-spectrogram images are then fused by way of image fusion which combines the data from two or more images of the same scene to produce a better detailed image than the sum of their individual parts of images. It is used in the compression, fusion, identification, and denoising of digital images, among many other tasks. Fusion is carried out by applying the discrete wavelet transform (DWT), a function that can be used to divide the input into several scale components using the wavelet algorithm. To learn more about the data, it is possible to investigate at each scale. Sub-bands of the DWT make transformations and computations fast. The images are partitioned into components using a range of coefficients and frequencies and then used to increase the resolution of the images. Fused images are obtained as shown in Fig. 2. The sequences of such fused images are fed as input to CLSTM-AE.

3.2 CLSTM-AutoEncoder Based Approach

The proposed architecture stacked blocks of 2D CNN, 1D CNN, and LSTM based encoder and decoder as shown in Fig. 3. First, the multifilter of 2D CNN layers is used to extract the spatial features as an output. A max-pooling layer strengthens the features extracted from the preceding layers using the output from the 2D CNN layer. The subsequent layers of the design would use this as input. The 1D CNN layer reduces the size of the previously retrieved features.

In addition to learning complicated temporal dynamics within the input sequences, CLSTM-AE has the ability to utilkise the long term dependencies over lengthy input sequences with the help of internal memory. The sequential input is represented as a fixed-dimensional vector in the CLSTM-AE model’s hidden state output. Following that, the decoder model that reconstructs the input sequence uses this representation as input. The reconstruction error serves as the basic measure for this model’s efficacy. In order to lower the reconstruction error or anomaly score of the sound data, the CLSTM-AE learns the optimal network parameters during the training phase. Anomalies produce high anomaly scores or reconstruction errors during testing since they won’t be appropriately compressed. Anomalies are identified through optimal thresholding on anomaly scores.

Our model uses CLSTM-AE in a self supervised manner (trained using normal sound data only) to learn the representations of the dataset. For the purpose of producing a compressed feature vector, the encoder block encodes the input data and the dimensions are lowered to 128, 64, 32, and 16 after the encoder’s first, second, third, and fourth layers, respectively. The layers in the decoder block are arranged in the opposite order and the dimensions increase to 16, 32, 64, and 128 corresponding to the first, second, third, and fourth decoder levels respectively. The decoder block then reconstructs the input. The final layer of the decoder block is then linked to a layer that is fully connected to create the output feature vector. The model was trained using only normal data, hence when compared to anomalous data, the reconstruction error for normal data will be smaller. This behaviour enables recognising anomalous data because the associated error value will be substantially greater.

4 Experimental Studies

The proposed approach is evaluated on the DCASE 2020 Challenge Task 2 (Unsupervised Anomalous Sound Identification for Machine Condition Monitoring). The dataset includes normal and abnormal sounds of six types of machines namely TGoyCar, ToyConveyor, Fan, Pump, Slide rail, and Valve. Each sound clip is of 10 s duration and includes environmental noise. In unsupervised anomalous sound detection tasks, it is assumed that anomaly detectors are trained entirely with typical sound clips excluding anomalies, but after training, are capable of accurately recognising anomalies.

The training data (normal sounds) from the development and supplementary datasets of Task 2 are utilized as the training set in the experiments, while the test data (normal and anomaly sounds) from the development dataset is used for evaluation. Initial experiments were conducted using Mean Square Error (MSE). However, on further analysis and experimentation on other datasets, we concluded that MSE is not a promising error metric as it lacks local sensitivity since it ignores the spatial organization in images. Then, we used Proximal Sensitive error [25], a locally sensitive error function that considers the relative positions in the spectrograms also. For the purpose of illustration, we used spectrograms of few samples of normal and abnormal sounds from the Toy Car dataset and the error values are plotted in Figs. 4 and 5. Based on our analysis on the average error values for all the spectrograms of normal and abnormal sounds, we fixed a threshold of 0.8 for classification. All the samples with reconstruction error below 0.8 would be classified as Normal and those with reconstruction error above would be classified as Anomalous. We studied the performance of proposed approach and SCR-LSTM (stacked convolutional Residual LSTM) approach for three types of inputs namely, spectrogram, Tgram and SpectroTempogram (STgram) for ToyCar dataset and the STgram based CLSTM-AE approach significantly improves the performance as shown in Table 1. The epoch vs training loss curve in case of training using ToyCar dataset is shown in Fig. 6.

Table 1

Performance with three forms of inputs on ToyCar dataset

Model with input	Accuracy (%)
Mel-Spectrogram with SCR-LSTM [12]	67.54
Mel-Spectrogram with CLSTM-AE	69.13
Temporal gram with SCR-LSTM [12]	55.33
Temporal gram with CLSTM-AE	65.33
Mel-Spectro-tempogram with SCR-LSTM [12]	90.62
Mel-Spectro-tempogram with LSTM-AE	94.54

In addition to DCASE 2020 Task 2 data, the MIMII (Malfunctioning Industrial Machine Investigation and Inspection) dataset is also used for verifying the robustness of the proposed approach. It consisted of data from four kinds of industrial machines which include Pump, Valve, Slide Rail and Fan. The data was taped down with a microphone placed at 10 cm distances. Each category of dataset consisted of audio signals with three kinds of noise—6dB, 0dB and 6dB and sampled at 16kHz and the AUC results of proposed approach are shown in Table 2. Comparison of AUC results with two other baseline methods such as AE and IDNN [24] are presented in Table 3. Interpolating deep neural network (IDNN) model is trained to forecast a given time frame of a representation from surrounding frames which leads to slightly better performance compared to the basic autoencoder.

Table 2

AUC results of proposed STgram with CLSTM-AE approach on MIMII dataset

Input	−6 dB	0 dB	6 dB
Fan	87.23	84.02	88.11
Pump	83.65	81.59	79.87
Valve	82.23	80.32	82.94
Slide Rail	88.79	91.84	87.69

Table 3

Comparison of AUC results with baseline methods on MIMII dataset

Model	Fan	Pump	Valve	Slide rail
AE [24]	66.2	72.9	66.3	85.5
IDNN [24]	67.7	73.7	84.5	86.6
STgram+CLSTM-AE	87.2	83.6	82.2	88.8

Studies were carried out on DCASE Task 2 dataset using few state of the art methods and the comparison of performance for seven types of machines is shown in Table 4. ANP [23] is an approach namely Attentive Neural Process. It is an encoder/decoder architecture, which works based on encoding each of the elements in the context set along with the observed values to predict the output. It is trained to estimate the conditionally independent Gaussian parameters for each of the elements of the target set by attending to the context points of nearby coordinates. It seems to give reasonable results for few scenarios like Toy Car, Slide Rail. However, for other sounds, it seems to give poor results.

Table 4

Comparison of STgram+CLSTM-AE approach with other models for DCASE 2020 Task 2 dataset

Framework	AUC(%)
Framework	Toy car	Toy conveyor	Fan	Pump	Slide rail	Valve
Conv AE [12]	69.12	60.03	52.63	60.96	76.20	53.10
ANP [23]	70.1	60.1	48.0	56.9	85.4	43.5
SCR-LSTM [12]	69.13	66.79	65.83	72.89	84.76	66.28
Semi supervised Auto Encoder [12]	87.27	90.35	78.63	80.33	78.94	80.94
Classification based ASD [18]	82.79	80.66	85.60	82.42	65.84	56.22
Dense AE [12]	80.79	76.43	72.03	73.06	87.08	72.16
GroupMADE AE [18]	80.51	76.03	70.10	75.68	93.29	89.68
Glow aff. [21]	80.1	61.0	49.6	65.7	87.8	77.7
MobileNet V2 [18]	87.66	69.71	80.19	82.53	95.27	88.65
ResNet [18]	88.69	65.04	78.87	83.50	90.49	86.24
IDCAE [22]	91.25	72.23	81.82	88.17	86.49	84.59
STgram+CLSTM-AE	94.54	96.12	89.96	84.80	92.94	84.23

SCR-LSTM [12] consists of stacked two- and one-dimensional CNNs and residual long short-term memory (LSTM) and is not so promising in detecting anomalies. Group-Masked Autoencoder based Density Estimation (Group-MADE) approach for audio anomaly detection yields good result for sound anomaly detection [18]. This approach gave slightly better performance than Dense AE based approach. Glow aff. [21] is basically based on exact likelihood estimation with the help of Normalizing Flows. The model is trained to output higher likelihood to the target machine sounds than that compared to sounds from other machines of same machine type. This is to overcome the issues with out-of-distribution detection where the likelihood will be affected by the smoothness of the data.

MobileNetV2 architecture [19] used a sequence of 2 or more identical layers. Every layer in the same sequence will have same number of output channels. All spatial convolutions use 3 kernels. MobileNet V2 detected anomalous sounds of Slide rail and valve machine types better compared to other methods. ResNet-50 model consists of 5 stages each of which has a convolution and Identity blocks. A convolution block has 3 convolution layers and an identity block has 3 convolution layers [20]. ResNet provided comparable performance with that of MobileNet. IDCAE [22] is an adaptive class conditioned autoencoder which best suits for open-set recognition problems. It is able to provide reasonable good results on comparison with other models. The proposed CLSTM-AE with STgram input significantly improved the anomaly detection performance consistently for all 7 types of machines.

5 Conclusion

We have proposed spectro-tempograms with CLSTM-AE based approach for anomalous sound detection. Fusion of temporal features learnt using TgramNet and the spectrogram features enhances the quality of input to the CLSTM-AE model. The proposed model was validated using DCASE Challenge Task 2 dataset that comprised of sound data from seven types of machines namely Toy Car, Toy Train, pump, slide rails, fans, and valve. The model significantly improved the performance over few baseline and state of the art deep models .

Declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

The author has consented to the submission of the manuscript to the journal.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

previous article Merging of Neural Networks

next article Time Series Prediction of ESN Based on Chebyshev Mapping and Strongly Connected Topology

Mnasri Z, Rovetta S, Masulli F (2022) Anomalous sound event detection: a survey of machine learning based methods and applications. Multimed Tools Appl 81(4):5537–5586CrossRef

Hojjati H, Armanfard N (2022) Self-supervised acoustic anomaly detection via contrastive Learning. In: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 3253-3257

Chen H, Song Y, Dai LR, McLoughlin I, Liu L (2022) Self-supervised representation learning for unsupervised anomalous sound detection under domain shift. In: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 471-475

Jing L, Tian Y (2020) Self-supervised visual feature learning with deep neural networks: a survey. IEEE Trans Pattern Anal Mach Intell 43(11):4037–4058CrossRef

Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatio temporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489-4497

Muller R, Ritz F, Illium S, Linnho Popien C (2020) Acoustic anomaly detection for machine sounds based on image transfer learning. arXiv:2006.03429

Koizumi Y, Yasuda M, Murata S, Saito S, Uematsu H, Harada N (2020) Spidernet:Attention network for one-shot anomaly detection in sounds. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 281-285

Purohit H, Tanabe R, Ichige K, Endo T, Nikaido Y, Suefusa K, Kawaguchi Y (2019) Mimii dataset: sound dataset for malfunctioning industrial machine investigation and inspection. arXiv:1909.09347

Kawaguchi Y, Imoto K, Koizumi Y, Harada N, Niizumi D, Dohi K, Tanabe R, Purohit H, Endo T (2021) Description and discussion on dcase 2021 challenge task 2: unsupervised anomalous sound detection for machine condition monitoring under domain shifted conditions. arXiv:2106.04492

10.

Perez-Castanos S, Naranjo-Alcazar J, Zuccarello P, Cobos M (2020) Anomalous sound detection using unsupervised and semi-supervised autoencoders and gammatone audio representation. arXiv:2006.15321

11.

Pang G, Shen C, Cao L, Hengel AVD (2021) Deep learning for anomaly detection: a review. ACM Comput Surv 54(2):1–38CrossRef

12.

Ribeiro A, Matos LM, Pereira PJ, Nunes EC, Ferreira AL, Cortez P, Pilastri A (2020) Deep dense and convolutional autoencoders for unsupervised anomaly detection in machine condition sounds. arXiv:2006.10417

13.

Tan J, Oyekan J (2021) Attention augmented convolutional neural network for acoustics based machine state estimation. Appl Soft Comput 110:107630CrossRef

14.

Mobtahej P, Zhang X, Hamidi M, Zhang J (2021) Deep learning-based anomaly detection for compressors using audio data. In: 2021 annual reliability and maintainability symposium (RAMS), pp 1-7

15.

Zhang A, Li S, Cui Y, Yang W, Dong R, Hu J (2019) Limited data rolling bearing fault diagnosis with few-shot learning. IEEE Access 7:110895–110904CrossRef

16.

Wen L, Gao L, Li X (2019) A new snapshot ensemble convolutional neural network for fault diagnosis. IEEE Access 7:32037–32047CrossRef

17.

Koizumi Y, Murata S, Harada N, Saito S, Uematsu H (2019) Sniper: Few-shot learning for anomaly detection to minimize false-negative rate with ensured true-positive rate. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 915-919

18.

Giri R, Tenneti SV, Helwani K, Cheng F, Isik U, Krishnaswamy A (2020) Unsupervised anomalous sound detection using self-supervised classification and group masked Autoencoder for density estimation. Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2020 Challenge), Tech. Rep 23

19.

Howard A, Zhmoginov A, Chen L-C, Sandler M, Zhu M (2018) Inverted residuals and linear bottlenecks: mobile networks for classification, detection and segmentation, CVPR

20.

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770-778

21.

Dohi K, Endo T, Purohit H, Tanabe R, Kawaguchi Y (2021) Flow-based self-supervised density estimation for anomalous sound detection. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 336-340

22.

Daniluk P, Goździewski M, Kapka S, Kośmider M (2020) Ensemble of auto-encoder based and wavenet like systems for unsupervised anomaly detection, challenge on detection and classification of acoustic scenes and events (DCASE 2020 Challenge). Tech, Rep

23.

Wichern G, Chakrabarty A, Wang Z Q, Le Roux J (2021) Anomalous sound detection using attentive neural processes, In: IEEE workshop on applications of signal processing to audio and acoustics (WASPAA), pp 186-190

24.

Suefusa K, Nishida T, Purohit H, Tanabe R, Endo T, Kawaguchi Y (2020) Anomalous sound detection based on interpolation deep neural network, In: Proc. ICASSP, pp 271–275

25.

Amogh G, Büttner E, van Gemert J (2022) Proximally sensitive error for anomaly detection and feature learning. arXiv:2206.00506v1

Title: Spectro Temporal Fusion with CLSTM-Autoencoder based approach for Anomalous Sound Detection
Authors: S. Chandrakala
Akhilandeswari Pidikiti
P. V. N. Sai Mahathi
Publication date: 01-02-2024
Publisher: Springer US
Published in: Neural Processing Letters / Issue 1/2024
Print ISSN: 1370-4621
Electronic ISSN: 1573-773X
DOI: https://doi.org/10.1007/s11063-024-11485-4

Springer Professional

Abstract

Publisher's Note

1 Introduction

2 Related Work

3 Spectro Temporal Fusion with CLSTM-Autoencoder

3.1 Spectro Temporal Fusion

3.2 CLSTM-AutoEncoder Based Approach

4 Experimental Studies

5 Conclusion

Declarations

Conflict of interest

Consent for Publication

Publisher's Note

Other articles of this Issue 1/2024

A Robust Sequential Recommendation Model Based on Multiple Feedback Behavior Denoising and Trusted Neighbors

A Graph Contrastive Learning Model Based on Structural and Semantic View for HIN Recommendation

Sparse Embedded Convolution Based Dual Feature Aggregation 3D Object Detection Network

Merging of Neural Networks

A YOLOX Object Detection Algorithm Based on Bidirectional Cross-scale Path Aggregation

Knowledge Graph Embedding via Triplet Component Interactions