Skip to main content
Top
Published in: Neural Processing Letters 1/2024

Open Access 01-02-2024

Spectro Temporal Fusion with CLSTM-Autoencoder based approach for Anomalous Sound Detection

Authors: S. Chandrakala, Akhilandeswari Pidikiti, P. V. N. Sai Mahathi

Published in: Neural Processing Letters | Issue 1/2024

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Deep learning models are proved efficient for complex learning tasks. Anomalous sound detection is one such complex task for which self-supervised deep architectures are emerging in recent days. Self-supervised deep models efficiently capture the underlying structure of data. Self-supervised anomalous sound detection attempts to distinguish between normal sounds and unidentified anomalous sounds. With the use of appropriate autoencoders, reconstruction error based decision making is effective for anomaly detection in domains such as computer vision. Auditory image (Spectrogram) based representation of sound signals are commonly used in sound event detection. We propose convolutional long short-term memory (CLSTM) Auto Encoder based approach for anomalous sound detection. In this approach, we explore fusion of spectral and temporal features to model characteristics of normal sounds with noises. The proposed approach is evaluated using MIMII dataset and the DCASE Challenge (2020) Task 2—Anomalous sound detection dataset. Experiments on proposed approach reveal significant improvement over the state-of-the-art approaches.
Notes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

The task of abnormal sound detection (ASD) is challenging and demanding in monitoring applications such as industrial machines, animal care, and public surveillance [1]. When compared to visual surveillance, audio sensors offer few advantages: (i) audio data acquisition and basic processing is much less expensive with respect to memory space and computational load (ii) microphones offer omnidirectional coverage (iii) insensitive to luminosity and critical weather conditions and (iv) audio cues can supplement video cues in cases where anomalous events/objects are occluded. Neural network-based techniques have been widely employed to address the ASD problem as deep learning models have grown in popularity [2, 3]. Using data from solely normal sound occurrences, the self-supervised anomaly detection approach trains the auto encoders (AE) to learn the patterns of normal sound events and to spot abnormalities.
Acoustic characteristics in sound signal can also be interpreted as time-frequency representation in the form of an auditory image. AEs with convolutional layers (CAE) are commonly used for image reconstruction. To capture the sequential patterns from normal sound signals, we propose to use convolutional long short-term memory auto encoder (CLSTM-AE) based approach for anomalous sound detection. With an encoder-decoder paradigm, the CLSTM Autoencoder is an autoencoder suitable for sequential data. The trained encoder can be used to obtain compressed representation of speech, text, and other sequential data. CLSTM-AE has the ability to recall or use long term dependencies over lengthy input sequences with the help of internal memory. The decoder model that reconstructs the input sequence uses these learnt representation as input. The reconstruction error serves as the basic measure for this model’s efficacy. In order to lower the reconstruction error or anomaly score of the sound data, the CLSTM-AE learns the network parameters during the training phase. During testing, anomalous sounds won’t be compressed properly and so provide high anomaly scores or reconstruction error. Optimal thresholding identifies anomalous sounds.
In addition to spectrogram image input, we explore CNN based Temporal gram Network to learn temporal (Tgram) features from normal sound data. The log-Mel spectrogram (Sgram) and the Tgram are fused to form a Mel-Spectro-tempogram which is then used to train CLSTM-AE. The rest of this paper is structured as follows: state of the art deep learning based approaches for anomalous sound detection are reviewed in “Related work”. The proposed CLSTM-AE based approach is presented in Sect. 3. Experimental study in Sect. 4 covers ablation studies of proposed approach on DCASE 2020 Task 2 dataset and performance analysis with state of the art deep model based approaches.
A wide range of deep anomaly detection algorithms have been introduced, showing noticeably superior performance than traditional machine learning based anomaly detection in tackling difficult detection problems [46]. In Koizumi et al. [7], Variational Auto Encoder (VAE) is employed which used normal distribution in the latent layer of VAE. In addition to preventing overfitting, this can help in data reconstruction. Acoustic input was transformed into auditory images by convolutional layers in the AE. The authors used the MIMII dataset, which included four machine sounds that were both normal and abnormal (fans, pumps, slide rails and valves).
According to Purohit et al. [8], the Neyman-Pearson lemma is used to describe the loss function of AE, and ASD is seen as a statistical hypothesis test to mimic anomalous sound. The authors conclude that their method improves ASD performance measures with a low false positive rate. The drawback of this approach is that it simulates aberrant sound via costly rejection sampling. In Purohit et al. [8], WaveNet is used to predict sound data rather than generating it, and prediction error is used to calculate the anomaly score. Studies were carried out using real-world dataset recorded in a subway station. According to experimental findings, WaveNet marginally boosts an ASD system’s effectiveness in comparison to AE-based methods. However, the WaveNet incurs higher computational cost.
Recent study using evaluation data of DCASE 2020 Task 2 covered (1) Ensemble approaches using "outlier exposure" (OE)- and "inlier modelling" (IM)-based detectors, and (2) approaches using IM-based detection for features discovered during a machine-identification task [9]. Group Masked Autoencoder for Density Estimation (GMADE) is an unique density estimation-based anomaly detector that is built on an ensemble with a self-supervised classification-based anomaly detector [10]. A unique approach [11] uses audio representations based on Gammatones and convolutional autoencoders (both unsupervised and semi-supervised).These architectures significantly outperform the baseline results.
According to a technical report [12] that summarized performance of Task 2 of the DCASE 2020 challenge dataset, few attempts followed self supervised learning based techniques to detect abnormal sounds. Training used only samples of normal machine operations. These techniques employ deep autoencoders that leverage melspectogram-processed sound features and dense convolutional architectures. The dense and convolutional AE trained and tested using six types of machine operations produced results that were competitive with and even better than the baseline technique given in DCASE challenge. The baseline consists of a dense Autoencoder (AE) with three layers, in both the encoder and decoder components, with 128 units, and a latent space with 8 units, all with the ReLU activation function.
Long short term memory (LSTM) Autoencoder and Support Vector Machines used in another approach [8] takes into account the fact that, due to anomalies occurring in real-time industrial environments on an infrequent basis, there would be fewer aberrant samples than normal samples. One Class SVM receives the compressed patterns from the input data that the autoencoder model has presented. The normal sound data is used to train the model, and both the normal and anomalous data are used to test the learned model.
Another framework proposed a dense, convolutional autoencoder framework [8]. The mel-spectrograms of the audio data were used as input. When compared to classification based on raw audio data, the mel-spectrogram image aids in emphasising the spatial features and hence helps in improved classification outcomes. The MIMII Dataset and a portion of the ToyAdmos Dataset were used to train this model.
A deep learning system called MAABL (MobileNetV2 with enhanced attention block), which combines Deep Convolutional Neural Networks with Attention Blocks, is presented in this work [13]. An augmented attention block and an inverted residual block from MobileNetV2 are combined to create the neural network. In order to train neural networks, attention blocks assist in highlighting the aspects that are most important. MIMII and the ToyAdmos datasets were utilised for the experimentation. In comparison to the MobileNetV2 basic architecture and a few other frameworks that combined convolutional LSTM, and attention modules, the proposed framework displayed promising results.
One of the main challenges when working with the anomaly detection is high dimensional data. Combination of an anomaly detection mechanism and a feature extractor model is explored in [14]. They proposed a hybrid model that combines Deep Belief Networks (DBN) and One-Class SVM. The OneClass SVM is trained using the features that the DBN framework has learned from the input data. Results are reported on par with a Deep Autoencoder in terms of performance.
A similarity function was put forth by the author [7] for use in one-shot anomaly detection of sounds. The proposed method is called the SPecific anomaly IDentifiER network (SPIDERnet). As the similarity function used was based on the naive mean squared error between input data and the memorised spectrogram, previous research that used memory-based one-shot learning were only able to detect short anomalous sounds. SPIDERnet is viewed as a fusion of attention mechanisms for capturing time-frequency stretching and feature extractors based on neural networks for similarity measurement. ToyADMOS and MIMII datasets were utilised for experimenting. In comparison to other approaches [1517] utilised for industrial anomaly identification, better performance is reported in this work.

3 Spectro Temporal Fusion with CLSTM-Autoencoder

Self supervised (one class) deep model based anomaly detection approaches train the autoencoders (AE) using data of normal sounds only to learn the patterns of normal sound events. Convolutional AEs (CAE) are commonly used for image reconstruction. To capture the sequential patterns from normal sound signals, we propose to use Convolutional Long Short-Term Memory Auto Encoder (CLSTM-AE) based approach for anomalous sound detection. Spectrograms are commonly used with CAE for sound anomaly detection. In addition to spectrogram image input, we explore CNN based Temporal gram Network to learn temporal features from normal sound data.

3.1 Spectro Temporal Fusion

The log-Mel spectrogram (Sgram) and the temporal features of Tgram network are combined to form a Mel-Spectro-tempogram which is then used to train CLSTM-AE as shown in Fig. 1. Both Mel-Spectrogram (Sgram) images and Temporal gram (Tgram) images were created separately from the input data which is in the form of time-series vibration data. Raw audio data is transformed into Mel-sgram format by applying the Fast Fourier transform. The mapping is done through overlapped window based processing of the raw audio data.
To compensate loss of anomaly related information in the log-Mel spectrogram, a CNN-based network (TgramNet) is utilised to extract temporal information from the sound events. A 1D large kernel convolution is applied by setting kernel size, channel number and stride, the same as the window size, number of Mel scale bins and hop length for the log-Mel spectrogram. Secondly, CNN block operations are applied thrice. A CNN block includes a layer normalization, a leaky ReLU activation, and a 1D convolution with a small kernel size. The dimension of Tgram temporal features is not changed by CNN block operations.
Features in the form of Tgram and mel-spectrogram images are then fused by way of image fusion which combines the data from two or more images of the same scene to produce a better detailed image than the sum of their individual parts of images. It is used in the compression, fusion, identification, and denoising of digital images, among many other tasks. Fusion is carried out by applying the discrete wavelet transform (DWT), a function that can be used to divide the input into several scale components using the wavelet algorithm. To learn more about the data, it is possible to investigate at each scale. Sub-bands of the DWT make transformations and computations fast. The images are partitioned into components using a range of coefficients and frequencies and then used to increase the resolution of the images. Fused images are obtained as shown in Fig. 2. The sequences of such fused images are fed as input to CLSTM-AE.

3.2 CLSTM-AutoEncoder Based Approach

The proposed architecture stacked blocks of 2D CNN, 1D CNN, and LSTM based encoder and decoder as shown in Fig. 3. First, the multifilter of 2D CNN layers is used to extract the spatial features as an output. A max-pooling layer strengthens the features extracted from the preceding layers using the output from the 2D CNN layer. The subsequent layers of the design would use this as input. The 1D CNN layer reduces the size of the previously retrieved features.
In addition to learning complicated temporal dynamics within the input sequences, CLSTM-AE has the ability to utilkise the long term dependencies over lengthy input sequences with the help of internal memory. The sequential input is represented as a fixed-dimensional vector in the CLSTM-AE model’s hidden state output. Following that, the decoder model that reconstructs the input sequence uses this representation as input. The reconstruction error serves as the basic measure for this model’s efficacy. In order to lower the reconstruction error or anomaly score of the sound data, the CLSTM-AE learns the optimal network parameters during the training phase. Anomalies produce high anomaly scores or reconstruction errors during testing since they won’t be appropriately compressed. Anomalies are identified through optimal thresholding on anomaly scores.
Our model uses CLSTM-AE in a self supervised manner (trained using normal sound data only) to learn the representations of the dataset. For the purpose of producing a compressed feature vector, the encoder block encodes the input data and the dimensions are lowered to 128, 64, 32, and 16 after the encoder’s first, second, third, and fourth layers, respectively. The layers in the decoder block are arranged in the opposite order and the dimensions increase to 16, 32, 64, and 128 corresponding to the first, second, third, and fourth decoder levels respectively. The decoder block then reconstructs the input. The final layer of the decoder block is then linked to a layer that is fully connected to create the output feature vector. The model was trained using only normal data, hence when compared to anomalous data, the reconstruction error for normal data will be smaller. This behaviour enables recognising anomalous data because the associated error value will be substantially greater.

4 Experimental Studies

The proposed approach is evaluated on the DCASE 2020 Challenge Task 2 (Unsupervised Anomalous Sound Identification for Machine Condition Monitoring). The dataset includes normal and abnormal sounds of six types of machines namely TGoyCar, ToyConveyor, Fan, Pump, Slide rail, and Valve. Each sound clip is of 10 s duration and includes environmental noise. In unsupervised anomalous sound detection tasks, it is assumed that anomaly detectors are trained entirely with typical sound clips excluding anomalies, but after training, are capable of accurately recognising anomalies.
The training data (normal sounds) from the development and supplementary datasets of Task 2 are utilized as the training set in the experiments, while the test data (normal and anomaly sounds) from the development dataset is used for evaluation. Initial experiments were conducted using Mean Square Error (MSE). However, on further analysis and experimentation on other datasets, we concluded that MSE is not a promising error metric as it lacks local sensitivity since it ignores the spatial organization in images. Then, we used Proximal Sensitive error [25], a locally sensitive error function that considers the relative positions in the spectrograms also. For the purpose of illustration, we used spectrograms of few samples of normal and abnormal sounds from the Toy Car dataset and the error values are plotted in Figs. 4 and 5. Based on our analysis on the average error values for all the spectrograms of normal and abnormal sounds, we fixed a threshold of 0.8 for classification. All the samples with reconstruction error below 0.8 would be classified as Normal and those with reconstruction error above would be classified as Anomalous. We studied the performance of proposed approach and SCR-LSTM (stacked convolutional Residual LSTM) approach for three types of inputs namely, spectrogram, Tgram and SpectroTempogram (STgram) for ToyCar dataset and the STgram based CLSTM-AE approach significantly improves the performance as shown in Table 1. The epoch vs training loss curve in case of training using ToyCar dataset is shown in Fig. 6.
Table 1
Performance with three forms of inputs on ToyCar dataset
Model with input
Accuracy (%)
Mel-Spectrogram with SCR-LSTM [12]
67.54
Mel-Spectrogram with CLSTM-AE
69.13
Temporal gram with SCR-LSTM [12]
55.33
Temporal gram with CLSTM-AE
65.33
Mel-Spectro-tempogram with SCR-LSTM [12]
90.62
Mel-Spectro-tempogram with LSTM-AE
94.54
In addition to DCASE 2020 Task 2 data, the MIMII (Malfunctioning Industrial Machine Investigation and Inspection) dataset is also used for verifying the robustness of the proposed approach. It consisted of data from four kinds of industrial machines which include Pump, Valve, Slide Rail and Fan. The data was taped down with a microphone placed at 10 cm distances. Each category of dataset consisted of audio signals with three kinds of noise—6dB, 0dB and 6dB and sampled at 16kHz and the AUC results of proposed approach are shown in Table 2. Comparison of AUC results with two other baseline methods such as AE and IDNN [24] are presented in Table 3. Interpolating deep neural network (IDNN) model is trained to forecast a given time frame of a representation from surrounding frames which leads to slightly better performance compared to the basic autoencoder.
Table 2
AUC results of proposed STgram with CLSTM-AE approach on MIMII dataset
Input
−6 dB
0 dB
6 dB
Fan
87.23
84.02
88.11
Pump
83.65
81.59
79.87
Valve
82.23
80.32
82.94
Slide Rail
88.79
91.84
87.69
Table 3
Comparison of AUC results with baseline methods on MIMII dataset
Model
Fan
Pump
Valve
Slide rail
AE [24]
66.2
72.9
66.3
85.5
IDNN [24]
67.7
73.7
84.5
86.6
STgram+CLSTM-AE
87.2
83.6
82.2
88.8
Studies were carried out on DCASE Task 2 dataset using few state of the art methods and the comparison of performance for seven types of machines is shown in Table 4. ANP [23] is an approach namely Attentive Neural Process. It is an encoder/decoder architecture, which works based on encoding each of the elements in the context set along with the observed values to predict the output. It is trained to estimate the conditionally independent Gaussian parameters for each of the elements of the target set by attending to the context points of nearby coordinates. It seems to give reasonable results for few scenarios like Toy Car, Slide Rail. However, for other sounds, it seems to give poor results.
Table 4
Comparison of STgram+CLSTM-AE approach with other models for DCASE 2020 Task 2 dataset
Framework
AUC(%)
Toy car
Toy conveyor
Fan
Pump
Slide rail
Valve
Conv AE [12]
69.12
60.03
52.63
60.96
76.20
53.10
ANP [23]
70.1
60.1
48.0
56.9
85.4
43.5
SCR-LSTM [12]
69.13
66.79
65.83
72.89
84.76
66.28
Semi supervised Auto Encoder [12]
87.27
90.35
78.63
80.33
78.94
80.94
Classification based ASD [18]
82.79
80.66
85.60
82.42
65.84
56.22
Dense AE [12]
80.79
76.43
72.03
73.06
87.08
72.16
GroupMADE AE [18]
80.51
76.03
70.10
75.68
93.29
89.68
Glow aff. [21]
80.1
61.0
49.6
65.7
87.8
77.7
MobileNet V2 [18]
87.66
69.71
80.19
82.53
95.27
88.65
ResNet [18]
88.69
65.04
78.87
83.50
90.49
86.24
IDCAE [22]
91.25
72.23
81.82
88.17
86.49
84.59
STgram+CLSTM-AE
94.54
96.12
89.96
84.80
92.94
84.23
SCR-LSTM [12] consists of stacked two- and one-dimensional CNNs and residual long short-term memory (LSTM) and is not so promising in detecting anomalies. Group-Masked Autoencoder based Density Estimation (Group-MADE) approach for audio anomaly detection yields good result for sound anomaly detection [18]. This approach gave slightly better performance than Dense AE based approach. Glow aff. [21] is basically based on exact likelihood estimation with the help of Normalizing Flows. The model is trained to output higher likelihood to the target machine sounds than that compared to sounds from other machines of same machine type. This is to overcome the issues with out-of-distribution detection where the likelihood will be affected by the smoothness of the data.
MobileNetV2 architecture [19] used a sequence of 2 or more identical layers. Every layer in the same sequence will have same number of output channels. All spatial convolutions use 3 kernels. MobileNet V2 detected anomalous sounds of Slide rail and valve machine types better compared to other methods. ResNet-50 model consists of 5 stages each of which has a convolution and Identity blocks. A convolution block has 3 convolution layers and an identity block has 3 convolution layers [20]. ResNet provided comparable performance with that of MobileNet. IDCAE [22] is an adaptive class conditioned autoencoder which best suits for open-set recognition problems. It is able to provide reasonable good results on comparison with other models. The proposed CLSTM-AE with STgram input significantly improved the anomaly detection performance consistently for all 7 types of machines.

5 Conclusion

We have proposed spectro-tempograms with CLSTM-AE based approach for anomalous sound detection. Fusion of temporal features learnt using TgramNet and the spectrogram features enhances the quality of input to the CLSTM-AE model. The proposed model was validated using DCASE Challenge Task 2 dataset that comprised of sound data from seven types of machines namely Toy Car, Toy Train, pump, slide rails, fans, and valve. The model significantly improved the performance over few baseline and state of the art deep models .

Declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.
The author has consented to the submission of the manuscript to the journal.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literature
1.
go back to reference Mnasri Z, Rovetta S, Masulli F (2022) Anomalous sound event detection: a survey of machine learning based methods and applications. Multimed Tools Appl 81(4):5537–5586CrossRef Mnasri Z, Rovetta S, Masulli F (2022) Anomalous sound event detection: a survey of machine learning based methods and applications. Multimed Tools Appl 81(4):5537–5586CrossRef
2.
go back to reference Hojjati H, Armanfard N (2022) Self-supervised acoustic anomaly detection via contrastive Learning. In: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 3253-3257 Hojjati H, Armanfard N (2022) Self-supervised acoustic anomaly detection via contrastive Learning. In: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 3253-3257
3.
go back to reference Chen H, Song Y, Dai LR, McLoughlin I, Liu L (2022) Self-supervised representation learning for unsupervised anomalous sound detection under domain shift. In: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 471-475 Chen H, Song Y, Dai LR, McLoughlin I, Liu L (2022) Self-supervised representation learning for unsupervised anomalous sound detection under domain shift. In: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 471-475
4.
go back to reference Jing L, Tian Y (2020) Self-supervised visual feature learning with deep neural networks: a survey. IEEE Trans Pattern Anal Mach Intell 43(11):4037–4058CrossRef Jing L, Tian Y (2020) Self-supervised visual feature learning with deep neural networks: a survey. IEEE Trans Pattern Anal Mach Intell 43(11):4037–4058CrossRef
5.
go back to reference Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatio temporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489-4497 Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatio temporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489-4497
6.
go back to reference Muller R, Ritz F, Illium S, Linnho Popien C (2020) Acoustic anomaly detection for machine sounds based on image transfer learning. arXiv:2006.03429 Muller R, Ritz F, Illium S, Linnho Popien C (2020) Acoustic anomaly detection for machine sounds based on image transfer learning. arXiv:​2006.​03429
7.
go back to reference Koizumi Y, Yasuda M, Murata S, Saito S, Uematsu H, Harada N (2020) Spidernet:Attention network for one-shot anomaly detection in sounds. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 281-285 Koizumi Y, Yasuda M, Murata S, Saito S, Uematsu H, Harada N (2020) Spidernet:Attention network for one-shot anomaly detection in sounds. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 281-285
8.
go back to reference Purohit H, Tanabe R, Ichige K, Endo T, Nikaido Y, Suefusa K, Kawaguchi Y (2019) Mimii dataset: sound dataset for malfunctioning industrial machine investigation and inspection. arXiv:1909.09347 Purohit H, Tanabe R, Ichige K, Endo T, Nikaido Y, Suefusa K, Kawaguchi Y (2019) Mimii dataset: sound dataset for malfunctioning industrial machine investigation and inspection. arXiv:​1909.​09347
9.
go back to reference Kawaguchi Y, Imoto K, Koizumi Y, Harada N, Niizumi D, Dohi K, Tanabe R, Purohit H, Endo T (2021) Description and discussion on dcase 2021 challenge task 2: unsupervised anomalous sound detection for machine condition monitoring under domain shifted conditions. arXiv:2106.04492 Kawaguchi Y, Imoto K, Koizumi Y, Harada N, Niizumi D, Dohi K, Tanabe R, Purohit H, Endo T (2021) Description and discussion on dcase 2021 challenge task 2: unsupervised anomalous sound detection for machine condition monitoring under domain shifted conditions. arXiv:​2106.​04492
10.
go back to reference Perez-Castanos S, Naranjo-Alcazar J, Zuccarello P, Cobos M (2020) Anomalous sound detection using unsupervised and semi-supervised autoencoders and gammatone audio representation. arXiv:2006.15321 Perez-Castanos S, Naranjo-Alcazar J, Zuccarello P, Cobos M (2020) Anomalous sound detection using unsupervised and semi-supervised autoencoders and gammatone audio representation. arXiv:​2006.​15321
11.
go back to reference Pang G, Shen C, Cao L, Hengel AVD (2021) Deep learning for anomaly detection: a review. ACM Comput Surv 54(2):1–38CrossRef Pang G, Shen C, Cao L, Hengel AVD (2021) Deep learning for anomaly detection: a review. ACM Comput Surv 54(2):1–38CrossRef
12.
go back to reference Ribeiro A, Matos LM, Pereira PJ, Nunes EC, Ferreira AL, Cortez P, Pilastri A (2020) Deep dense and convolutional autoencoders for unsupervised anomaly detection in machine condition sounds. arXiv:2006.10417 Ribeiro A, Matos LM, Pereira PJ, Nunes EC, Ferreira AL, Cortez P, Pilastri A (2020) Deep dense and convolutional autoencoders for unsupervised anomaly detection in machine condition sounds. arXiv:​2006.​10417
13.
go back to reference Tan J, Oyekan J (2021) Attention augmented convolutional neural network for acoustics based machine state estimation. Appl Soft Comput 110:107630CrossRef Tan J, Oyekan J (2021) Attention augmented convolutional neural network for acoustics based machine state estimation. Appl Soft Comput 110:107630CrossRef
14.
go back to reference Mobtahej P, Zhang X, Hamidi M, Zhang J (2021) Deep learning-based anomaly detection for compressors using audio data. In: 2021 annual reliability and maintainability symposium (RAMS), pp 1-7 Mobtahej P, Zhang X, Hamidi M, Zhang J (2021) Deep learning-based anomaly detection for compressors using audio data. In: 2021 annual reliability and maintainability symposium (RAMS), pp 1-7
15.
go back to reference Zhang A, Li S, Cui Y, Yang W, Dong R, Hu J (2019) Limited data rolling bearing fault diagnosis with few-shot learning. IEEE Access 7:110895–110904CrossRef Zhang A, Li S, Cui Y, Yang W, Dong R, Hu J (2019) Limited data rolling bearing fault diagnosis with few-shot learning. IEEE Access 7:110895–110904CrossRef
16.
go back to reference Wen L, Gao L, Li X (2019) A new snapshot ensemble convolutional neural network for fault diagnosis. IEEE Access 7:32037–32047CrossRef Wen L, Gao L, Li X (2019) A new snapshot ensemble convolutional neural network for fault diagnosis. IEEE Access 7:32037–32047CrossRef
17.
go back to reference Koizumi Y, Murata S, Harada N, Saito S, Uematsu H (2019) Sniper: Few-shot learning for anomaly detection to minimize false-negative rate with ensured true-positive rate. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 915-919 Koizumi Y, Murata S, Harada N, Saito S, Uematsu H (2019) Sniper: Few-shot learning for anomaly detection to minimize false-negative rate with ensured true-positive rate. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 915-919
18.
go back to reference Giri R, Tenneti SV, Helwani K, Cheng F, Isik U, Krishnaswamy A (2020) Unsupervised anomalous sound detection using self-supervised classification and group masked Autoencoder for density estimation. Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2020 Challenge), Tech. Rep 23 Giri R, Tenneti SV, Helwani K, Cheng F, Isik U, Krishnaswamy A (2020) Unsupervised anomalous sound detection using self-supervised classification and group masked Autoencoder for density estimation. Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2020 Challenge), Tech. Rep 23
19.
go back to reference Howard A, Zhmoginov A, Chen L-C, Sandler M, Zhu M (2018) Inverted residuals and linear bottlenecks: mobile networks for classification, detection and segmentation, CVPR Howard A, Zhmoginov A, Chen L-C, Sandler M, Zhu M (2018) Inverted residuals and linear bottlenecks: mobile networks for classification, detection and segmentation, CVPR
20.
go back to reference He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770-778 He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770-778
21.
go back to reference Dohi K, Endo T, Purohit H, Tanabe R, Kawaguchi Y (2021) Flow-based self-supervised density estimation for anomalous sound detection. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 336-340 Dohi K, Endo T, Purohit H, Tanabe R, Kawaguchi Y (2021) Flow-based self-supervised density estimation for anomalous sound detection. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 336-340
22.
go back to reference Daniluk P, Goździewski M, Kapka S, Kośmider M (2020) Ensemble of auto-encoder based and wavenet like systems for unsupervised anomaly detection, challenge on detection and classification of acoustic scenes and events (DCASE 2020 Challenge). Tech, Rep Daniluk P, Goździewski M, Kapka S, Kośmider M (2020) Ensemble of auto-encoder based and wavenet like systems for unsupervised anomaly detection, challenge on detection and classification of acoustic scenes and events (DCASE 2020 Challenge). Tech, Rep
23.
go back to reference Wichern G, Chakrabarty A, Wang Z Q, Le Roux J (2021) Anomalous sound detection using attentive neural processes, In: IEEE workshop on applications of signal processing to audio and acoustics (WASPAA), pp 186-190 Wichern G, Chakrabarty A, Wang Z Q, Le Roux J (2021) Anomalous sound detection using attentive neural processes, In: IEEE workshop on applications of signal processing to audio and acoustics (WASPAA), pp 186-190
24.
go back to reference Suefusa K, Nishida T, Purohit H, Tanabe R, Endo T, Kawaguchi Y (2020) Anomalous sound detection based on interpolation deep neural network, In: Proc. ICASSP, pp 271–275 Suefusa K, Nishida T, Purohit H, Tanabe R, Endo T, Kawaguchi Y (2020) Anomalous sound detection based on interpolation deep neural network, In: Proc. ICASSP, pp 271–275
Metadata
Title
Spectro Temporal Fusion with CLSTM-Autoencoder based approach for Anomalous Sound Detection
Authors
S. Chandrakala
Akhilandeswari Pidikiti
P. V. N. Sai Mahathi
Publication date
01-02-2024
Publisher
Springer US
Published in
Neural Processing Letters / Issue 1/2024
Print ISSN: 1370-4621
Electronic ISSN: 1573-773X
DOI
https://doi.org/10.1007/s11063-024-11485-4

Other articles of this Issue 1/2024

Neural Processing Letters 1/2024 Go to the issue