Elsevier

Speech Communication

Volume 85, December 2016, Pages 53-70
Speech Communication

ILMSAF based speech enhancement with DNN and noise classification

https://doi.org/10.1016/j.specom.2016.10.008Get rights and content

Highlights

  • An adaptive coefficient of filter’s parameters is introduced into conventional least mean square adaptive filtering.

  • Deep Belief Network is used to model the relationship between noise type, snr of noise and the adaptive coefficient.

  • Deep Neural Network based noise classification is employed to improve the generalization ability.

  • The groupings of noise types have distinct effects on the performance of noise classification.

  • The subjective and objective quality of the enhanced speech is significantly improved, especially in low signal-to-noise radios.

Abstract

In order to improve the performance of speech enhancement algorithm in low Signal-to-Noise Ratio (SNR) complex noise environments, a novel Improved Least Mean Square Adaptive Filtering (ILMSAF) based speech enhancement algorithm with Deep Neural Network (DNN) and noise classification is proposed. An adaptive coefficient of filter's parameters is introduced into conventional Least Mean Square Adaptive Filtering (LMSAF). First, the adaptive coefficient of filter's parameters is estimated by Deep Belief Network (DBN). Then, the enhanced speech is obtained by ILMSAF. In addition, in order to make the presented approach suitable for various kinds of noise environments, a new noise classification method based on DNN is presented. According to the result of noise classification, the corresponding ILMSAF model is selected in the enhancement process. The performance test results under ITU-TG.160 show that, the performance of the proposed algorithm tends to achieve significant improvements in terms of various speech subjective and objective quality measures than the wiener filtering based speech enhancement approach with Weighted Denoising Auto-encoder and noise classification.

Introduction

A great many types of noise are ubiquitous in the actual environments. Speech signals for human communication are seriously polluted by noise. Therefore, it is necessary to introduce a model of speech enhancement which could remove the influence of noise and improve the quality of voice communication.

Speech enhancement is one of the most challenging issues in the field of speech signal processing. It is used in mobile speech communication, the design of digital hearing aid and anti-jamming speech recognition (Xu et al., 2015). For digital hearing aid, noise has always been a big problem for hearing aid wearers. Especially, in the low Signal to Noise Ratio (SNR) environment, it is very difficult to remove the different kinds of noise for hearing aid. In some complex noise environments, the hearing aid wears have an uncomfortable feeling due to much noise (Deepa et al., 2012). In the past decades, a large number of speech enhancement algorithms are proposed. In 1979, a spectral subtraction based on Fourier Transform (FT) is presented by Boll (1979). It is so far the most easily implemented speech enhancement algorithm, but a large of music noise is remained. In the same year, a Wiener filtering method is proposed by Lim and Oppenheim (1978). The residual noise is not music noise, but similar to White noise. However, the ability to eliminate noise of this method is limited and it is non-ideal for nonlinear noise. Least Mean Square Adaptive Filtering (LMSAF) based speech enhancement approach has the best filtering performance of Wiener filter and Calman filter. Meanwhile, it doesn't need a priori knowledge. It can be adapted to the external environment by self-learning. But this approach has some disadvantages, including slow constringency, strong sensitivity to nonstationary noise and a contradiction between constringency and stability (Siddappaji, 2014; Gupta et al., 2015). In 1989, the model of Hidden Markov Model (HMM) is firstly introduced into the field of speech enhancement by Ephraim et al. (1989). The produced music noise in traditional speech enhancement algorithms is very well solved. However, its computation is too large. In 1995, a signal subspace based speech enhancement algorithm is developed by Ephraim and Van Trees (1995). The background noise of noisy speech is well removed. The quality and intelligibility of enhanced speech are greatly improved. But the algorithm complexity is high. Besides, the wavelet transform method is also used in the field of speech enhancement. It has good time frequency analysis characteristics and overcomes the shortcoming of short time Fourier fixed resolution ratio. The signal could be analyzed in multi-scale and multi-resolution ratio. However, this method is relatively complex and is not suitable for real-time mobile speech communication and digital hearing aids (Li, 2008, 2009, 2012).

The disadvantages of traditional speech enhancement algorithms include: a large of residual noise, even music noise; the details of speech are largely destroyed in low SNRs speech enhancement; it is very difficult to process the nonstationary noise. In recent years, Deep Neural Network (DNN) is also introduced to speech enhancement (Hinton et al., 2006; Chen et al., 2014; Xu et al., 2014). DNN could be seen as a fine noise reduction filter. Its nonlinear characteristics make it easier to describe the complex nonlinear relationship between noise and speech. Its strong learning ability make it like a person to remember the different kinds of noise model and the nonlinear relationship between noise and speech, which could well suppress the nonstationary noise (Xu et al., 2015). Denoising Auto-encoder (DA) is a two-layered Artificial Neural Network (ANN) model. In 2012, the Stacked DA (SDA) which is constructed by a series of DAs is applied to image denoising and painting by Xie et al. (2012). In the same year, a Recurrent Neural Network (RNN) based noise reduction method is developed by Maas et al. (2012). Likewise, RNN is also comprised of a series of DAs. In 2013, a deep Denoising Auto-encoder based speech enhancement approach is proposed by Lu et al. (2013). The ideal ratio mask estimation using deep neural networks is also developed by Narayanan (2013). In 2014, a Wiener filtering based speech enhancement method with Weighted Denoising Auto-encoder (WDA) and noise classification is presented by Xia (2014). In the same year, a binaural deep neural network for robust speech enhancement is developed by Jiang (2014). The ideal binary mask method is also used for speech enhancement by Sun et al. (2014). In 2015, a regression approach to speech enhancement based on deep neural network is proposed by Xu et al. (2015). In the same year, Tian Gao develops an improving deep neural network based speech enhancement in low SNR environments (Gao et al., 2015). H. W. Tseng proposes a new classification-based approach for speech enhancement with deep neural network (Tseng et al., 2015). However, the performance of the above speech enhancement algorithms based on deep neural network is non-ideal in low SNR environments.

Focusing on some shortcomings of the existing speech enhancement algorithms, for example, non-ideal performance in low SNR environments, poor adaptability in various types of noise environments and the difficulty to process nonstationary noise, a novel Improved Least Mean Square Adaptive Filtering (ILMSAF) based speech enhancement algorithm with DNN and noise classification is proposed. LMSAF based speech enhancement approach has the best filtering performance of Wiener filter and Calman filter. Meanwhile, it doesn't need a priori knowledge. It can be adapted to the external environment by self-learning. DNN has strong nonlinear processing ability, which could well describe the complex nonlinear relationship between noise and speech. This advantage makes DNN excellently process nonstationary noise. Different kinds of noise signals have difference characteristics, which brings different effects on speech. For various kinds of noise signals, the filter parameters are different when noisy speech is enhanced by filter. In order to make the enhancement algorithm suitable for different types of noise environments, noise environments should be classified. Focusing on different noise types, the corresponding parameters of LMSAF are trained. Therefore, for LMSAF based speech enhancement, a noise classification algorithm should be introduced. According to the result of noise classification, the corresponding LMSAF model is selected for speech enhancement. In the proposed algorithm, considering the fact that the filter parameters play a crucial role in the performance of filtering, an adaptive coefficient is introduced to the traditional LMSAF model in order to make the filter better remove noise in the current noise environment. The new model is called ILMSAF. First, the adaptive coefficient of filter's parameters is estimated by Deep Belief Network (DBN). Then, the enhanced speech is obtained by ILMSAF. In order to make the presented algorithm suitable for various kinds of noise environments, a new noise classification algorithm based on DNN is presented. According to the result of noise classification, the corresponding ILMSAF is chosen to remove noise of noisy speech.

The performance test results under ITU-TG.160 show that, the proposed algorithm tends to achieve significant improvements in terms of various speech objective quality measures than the wiener filtering based speech enhancement approach with WDA and noise classification (Xia, 2014).

Section snippets

ILMSAF based speech enhancement algorithm with DNN and noise classification

The block diagram of ILMSAF based speech enhancement with DNN and noise classification is shown in Fig. 1.

In this proposed algorithm, firstly, noisy speech s(n) is framed and windowed. Then maximum of short-time autocorrelation function and spectrum variance of noisy speech are extracted respectively. In order to Voice Activity Detection (VAD), these features as 2-dimensional feature vectors are inputted to BP Neural Network (BPNN). Genetic Algorithm is used to optimize the BPNN. According to

The effect of the noise signal's grouping on the performance of noise classification model

According to the discussion in Section 2.2.2, firstly, 15 types of noise signals should be roughly divided into 3 categories by LVQNN. Each category includes 5 types of noise signals. Secondly, each category is further divided into 5 types by BPNN. The DNN based noise classification is adopted in the proposed algorithm, rather than BPNN based noise classification. The reason is that for 15 types of noise signals, the MFCC and ∆MFCC of some noise types are similar which makes it difficult to

Performance evaluation

The results of performance evaluation, including the VAD test, the noise classification test and the objective speech quality test, are summarized and discussed in this section. The objective speech quality test is conducted under the standard of ITU-T G. 160 (ITU-T, 2008). Meanwhile, the subjective speech quality test is also adopted.

Conclusions

A novel ILMSAF based speech enhancement algorithm with DNN and noise classification is proposed. An adaptive coefficient of filter's parameters is introduced into conventional LMSAF. The novel model is called ILMSAF. First, the adaptive coefficient of filter's parameters is estimated by Deep Belief Network (DBN). Then, the enhanced speech is obtained by ILMSAF. In order to make the presented algorithm suitable for various kinds of noise environments, a new noise classification algorithm based

Acknowledgements

This work was supported by the Scientific Research Program of Beijing Municipal Commission of education 2015 (No. KM201510005007).

References (34)

  • Geoffrey Hinton et al.

    A fast learning algorithm for deep belief nets

    Neural Comput.

    (2006)
  • HuaYuming et al.

    Deep belief networks and deep learning

  • ITU-T, 2001. ITU-T P.862, perceptual evaluation of speech quality (PESQ): an objective method for end-to-end speech...
  • ITU-T, 2008. ITU-T G.160, voice enhancement devices. Int. Telecommun. Union (ITU), series...
  • JiangYi et al.

    Binaural deep neural network for robust speech enhancement

  • LiRu-wei et al.

    A speech endpoint detection algorithm based on band-partitioning spectral entropy and spectral energy

    J. Beijing Univ. Technol.

    (2007)
  • LiRu-wei et al.

    Speech enhancement using adaptive threshold based on bi-orthogonal wavelet packet decompositoin

    Chin. J. Sci. Instrum.

    (2008)
  • Cited by (0)

    View full text