Elsevier

Speech Communication

Volume 53, Issue 2, February 2011, Pages 154-174
Speech Communication

Enhancement of noisy speech by temporal and spectral processing

https://doi.org/10.1016/j.specom.2010.08.011Get rights and content

Abstract

This paper presents a noisy speech enhancement method by combining linear prediction (LP) residual weighting in the time domain and spectral processing in the frequency domain to provide better noise suppression as well as better enhancement in the speech regions. The noisy speech is initially processed by the excitation source (LP residual) based temporal processing that involves identifying and enhancing the excitation source based speech-specific features present at the gross and fine temporal levels. The gross level features are identified by estimating the following speech parameters: sum of the peaks in the discrete Fourier transform (DFT) spectrum, smoothed Hilbert envelope of the LP residual and modulation spectrum values, all from the noisy speech signal. The fine level features are identified using the knowledge of the instants of significant excitation. A weight function is derived from the gross and fine weight functions to obtain the temporally processed speech signal. The temporally processed speech is further subjected to spectral domain processing. Spectral processing involves estimation and removal of degrading components, and also identification and enhancement of speech-specific spectral components. The proposed method is evaluated using different objective and subjective quality measures. The quality measures show that the proposed combined temporal and spectral processing method provides better enhancement, compared to either temporal or spectral processing alone.

Research highlights

► The potential of combining LP residual weighting based temporal processing and harmonics enhancement based spectral processing is demonstrated for noisy speech enhancement. ► Set of speech-specific features are proposed for the gross level detection of high SNR speech regions. ► Method to determine the instants of significant excitation in noisy speech is proposed. ► A new spectral enhancement technique is proposed to enhance the voiced regions of noisy speech.

Introduction

The problem of enhancing noisy speech received considerable attention and in the literature variety of methods have been proposed. The noisy speech enhancement methods available may be broadly classified into two categories, namely, spectral and temporal domain enhancement methods. Generally, the spectral domain processing methods can be classified into two main areas: nonparametric and statistical model-based speech enhancement (Shao and Chang, 2007, Loizou, 2007). The nonparametric methods remove an estimate of the distortion from the noisy features, such as subtractive-type algorithms (Boll, 1979, Berouti et al., 1979, Kim et al., 2000, Kamath and Loizou, 2002, Yamashita and Shimamura, 2005, Yang and Fu, 2005, Lu, 2007) and wavelet denoising (Donoho, 1995, Chang et al., 2007, Senapati et al., 2008). The statistical-model-based speech enhancement (Ephraim and Malah, 1984, Ephraim and Malah, 1985, Marzinzik and Kollmeier, 2002, Martin, 2005, Chen and Loizou, 2005, Chen and Loizou, 2007) utilizes a parametric model of the signal generation process. The spectral subtraction algorithm is the oldest one proposed for noise reduction (Boll, 1979). Spectral subtraction is performed by subtracting the average magnitude of the noise spectrum from the spectrum of the noisy speech to estimate the magnitude of the enhanced speech spectrum. The noise spectrum is estimated by averaging short-term magnitude spectra of the non-speech segments. One of the serious drawbacks of this method is that it produces musical noise in the enhanced speech. This noise arises because of randomly spaced peaks in the time frequency plane due to the deviation of the estimated spectrum of noise from the instantaneous noise spectrum (Seok and Bae, 1999). Several modifications are proposed for the spectral subtraction approach to reduce the effect of musical noise (Loizou, 2007). One of the most popular spectral based noisy speech enhancement is the minimum mean square error (MMSE) estimation of the short time spectral amplitude (STSA) algorithm of Ephraim and Malah (1984). This algorithm is based on a Gaussian statistical model. Accordingly the coefficients of the short time Fourier transform (STFT) of speech and noise are modelled as statistically independent Gaussian random variable. The aim was to enhance degraded speech by minimizing the mean squared error between the STSA of the clean speech and the enhanced speech. This optimality gives very good results in practice, with noticeable reduction in musical noise. A number of non-Gaussian modelling (like Gamma modelling (Marzinzik and Kollmeier, 2002), Laplacian modelling (Chen and Loizou, 2007)) based Ephraim–Malah filters have also been proposed for improving the performance.

Yegnanarayana et al. proposed an enhancement method by exploiting the characteristics of excitation source signal such as linear prediction (LP) residual (Yegnanarayana et al., 1999). The basis for this approach is that human beings perceive speech by capturing features present from the high signal-to-noise ratio (SNR) regions and then extrapolating the features in the low SNR regions (Yegnanarayana et al., 1999). Accordingly, the approach for speech enhancement is to identify the high SNR regions in the noisy speech, and enhance them relative to the low SNR regions, without causing significant distortion in the enhanced speech. A weight function is derived for the residual signal that will reduce the energy in the low SNR regions relative to the high SNR regions of the noisy signal. The residual signal samples are multiplied with the weight function and the weighted LP residual is used to excite the time-varying all-pole filter derived from the noisy speech to generate the enhanced speech. In (Jin and Scordilis, 2006) a speech enhancement algorithm similar to (Yegnanarayana et al., 1999) is proposed. It differs with the former residual weighting scheme in that the weights on the LP residuals are derived based on a constrained optimization criterion. In (Yegnanarayana et al., 2002) authors exploited the use of coherently added Hilbert envelope (HE) for LP residual reconstruction. The feature that the HE has large amplitude at the instant of significant excitation makes it a good indicator of glottal closure (GC), where an excitation pulse takes place. Therefore, applying the HE to the LP residual as a weighting function has the effect of emphasizing the pulse train structure for voiced speech, which leads to an enhanced LP residual signal.

As mentioned, most of the enhancement methods process degraded speech in either temporal or spectral domains for achieving enhancement. The scope of this work is to highlight and demonstrate the merits of combined temporal and spectral processing methods for processing noisy speech. Generally in most of the spectral domain based methods more emphasis is given to suppress the noise components by estimating the noise characteristics from the degraded speech. The merit of this approach is its effectiveness for noise removal. However, information about the noise needs to be continuously estimated, particularly, in non-stationary environments where noise characteristics keep changing. Alternatively, the temporal processing methods that use the characteristics of excitation source information primarily aim at emphasizing the high SNR regions of noisy speech. Therefore no explicit knowledge of characteristics of background noise is required. The limitation of the temporal processing methods is that the level of removal of degradation achieved may not be significant as in the case of spectral based methods. Thus the integration of these two approaches may lead to better suppression of degradation and also enhancement of high SNR speech regions. This may lead to improved performance compared to either temporal processing or spectral processing alone. Further, from the speech production point of view, the temporal and spectral processing methods use independent information from the noisy speech. It will therefore be interesting to study whether they are exploiting different information for processing. If so, then they can be suitably combined to develop robust methods for the speech enhancement. Motivated by these observations, this work proposes a method for the enhancement of noisy speech by the combined temporal and spectral processing to provide better noise suppression and also better enhancement in the speech regions.

The various steps involved in the proposed noisy speech enhancement method are illustrated in Fig. 1. The temporal processing involves identifying and enhancing the speech-specific features present at the gross and fine temporal levels. The main objective of the gross level processing is to identify and enhance the speech components at the sound units (100–300 ms) level. In this paper, a method is proposed for detecting high SNR regions using the sum of the ten largest peaks in the discrete Fourier transform (DFT) spectrum, the smoothed HE of the LP residual, and the modulation spectrum values. The objective of the fine level processing is to identify and enhance the speech-specific features at the subsegmental (2–3 ms) level. It is based on the fact that the significant excitation of the vocal tract takes place at the instants of glottal closure and onset of events like burst, frication and aspiration. Depending on the nature of degradation, the LP residual signal will have many other random peaks in addition to the original instants of significant excitation. Temporal processing method identifies the original instants of significant excitation and emphasizes the region around them to obtain the enhanced speech. In this paper for fine level processing, a method is proposed to identify the instants of significant excitation from the noisy speech. The proposed method involves the following: (i) sinusoidal analysis of noisy speech, (ii) convolving the HE of the LP residual of the speech obtained from sinusoidal analysis by the first order Gaussian differentiator (FOGD). Finally, the gross and fine level features are combined to derive a weight function for the excitation source signal which emphasizes the excitation around the instants of significant excitation and deemphasizes the random peaks of background noise. The enhanced excitation signal is used to excite the time-varying all-pole filter derived from the noisy speech to generate the temporally processed speech. The temporally processed speech signal is further subjected to spectral processing. The spectral processing is based on the fact that the spectral values of the degraded speech will have both speech and degrading components. The spectral components of degradation are therefore estimated and removed. Further, there are spectral peaks that are perceptually important that are identified and enhanced. Accordingly in this work spectral processing is performed in two stages: attenuation of spectral characteristics of background noise and enhancement of speech-specific spectral features. In the first stage, the spectral characteristics of the background noise is estimated and attenuated using conventional spectral processing methods based on spectral subtraction or MMSE estimators. In the second stage, the region around pitch and harmonics are enhanced by estimating pitch from the enhanced excitation source signal.

The rest of the paper is organized as follows: Section 2 discusses about the temporal processing of noisy speech signal. The spectral domain processing of noisy speech signal is described in Section 3. Various experimental studies and objective quality measures performed on the individual and the combined processing methods are described in Section 4. Summary and conclusions of this study with scope for future work are discussed in Section 5.

Section snippets

Gross level temporal processing

The high SNR regions at gross level are identified by using the sum of ten largest peaks in the DFT spectrum, smoothed HE of the LP residual and modulation spectrum values of the noisy speech signal (Krishnamoorthy and Prasanna, 2009). The gross weight function used in this work is similar to voice activity detection (VAD). One can use the existing VAD methods in place of gross level weight function. On the other hand, we would like to emphasize that in this work we have investigated alternate

Spectral processing of noisy speech

The temporally processed speech sounds to be perceptually enhanced. This is mainly due to the enhancement of speech-specific features in the noisy speech signal. This includes high SNR regions at gross level and regions around the instants of significant excitation. This is achieved by multiplying the LP residual of the noisy speech signal by the weight function. Even though the speech-specific features are emphasized in the temporally processed speech, the noise suppression is minimal mainly

Experimental results and performance evaluation

The proposed speech enhancement method is evaluated using the composite objective quality measures that have high degree of correlation with subjective quality (Hu and Loizou, 2006, Hu and Loizou, 2008). This measure evaluates the quality of enhanced speech along three dimensions: signal distortion, noise distortion, and overall quality. The resultant objective score values are in between 1 and 5 like mean opinion score (MOS). This measure rates the quality of the enhanced speech on three

Summary and conclusions

In this paper for speech degraded by background noise, a combined TSP method is proposed by emphasizing high SNR regions in the temporal domain, and eliminating the degradation and enhancing the speech-specific components in the spectral domain. The main objective of this study is to show that the combined TSP method gives relatively better performance compared to temporal or spectral processing alone. The enhancement of noisy speech is achieved in two stages, namely, temporal enhancement

References (46)

  • Berouti, M., Schwartz, R., Makhoul, J., 1979. Enhancement of speech corrupted by acoustic noise. In: Proc. IEEE...
  • S.F. Boll

    Suppression of acoustic noise in speech using spectral subtraction

    IEEE Trans. Acoust. Speech Signal Process.

    (1979)
  • Chen, B., Loizou, P., 2005. Speech enhancement using a MMSE short time spectral amplitude estimator with Laplacian...
  • J.R. Deller et al.

    Discrete Time Processing of Speech Signals

    (1993)
  • D.L. Donoho

    De-noising by soft-thresholding

    IEEE Trans. Inf. Theory

    (1995)
  • Y. Ephraim et al.

    Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator

    IEEE Trans. Acoust. Speech Signal Process.

    (1984)
  • Y. Ephraim et al.

    Speech enhancement using a minimum mean square error log-spectral amplitude estimator

    IEEE Trans. Acoust. Speech Signal Process.

    (1985)
  • Greenberg, S., Kingsbury, B.E.D., 1997. The modulation spectrogram: In pursuit of an invariant representation of...
  • Hu, Y., Loizou, P.C., 2006. Evaluation of objective measures for speech enhancement. In: Proc. Interspeech,...
  • Y. Hu et al.

    Evaluation of objective quality measures for speech enhancement

    IEEE Trans. Audio, Speech, Lang. Process.

    (2008)
  • Kamath, S., Loizou, P., 2002. A multi-band spectral subtraction method for enhancing speech corrupted by colored noise....
  • W. Kim et al.

    Spectral subtraction based on phonetic dependency and masking effects

    IEEE Proc. Vision, Image Signal Process.

    (2000)
  • Krishnamoorthy, P., Prasanna, S.R.M., 2008. Temporal and spectral processing of degraded speech. In: IEEE Proc....
  • Cited by (41)

    • Robust intelligibility and quality evaluation of combined temporal and spectral processing for hearing impaired

      2022, Intelligent Systems with Applications
      Citation Excerpt :

      In (Deepak & Prasanna, 2015), the proposed Zero Band Filter (ZBF) was observed to offer better results for computing the significant excitation regions. With this motivation, the temporal processing algorithm (TPA) using ZBF to locate significant excitation regions in the unenhanced speech signal and to compute fine weight function (Krishnamoorthy & Prasanna, 2011; Deepak & Prasanna, 2016) was selected. Additionally, TPA was combined with two different MAP estimators of the magnitude squared spectrum; namely, soft masking using posterior SNR uncertainty on magnitude squared spectrum (SP4) and soft masking using priori SNR uncertainty on magnitude squared spectrum (SP5).

    • Effect of pitch enhancement in Punjabi children's speech recognition system under disparate acoustic conditions

      2021, Applied Acoustics
      Citation Excerpt :

      Kaldi toolkit was used for developing the Punjabi ASR for speech recognition. Several issues arise during the speech recognition process due to variation in the acoustic features, age, and gender differences, as well as acoustic attributes such as fundamental frequency (or pitch), formant frequencies, and segmental durations, vary with the age and gender of the speakers [5,9]. Many speech recognition systems are available from different traders for several languages but there exists no such system for the Punjabi language that shows the study of variation of Punjabi adult and child speech recognition systems using pitch enhanced acoustic features.

    • A perceptually motivated LP residual estimator in noisy and reverberant environments

      2018, Speech Communication
      Citation Excerpt :

      Examples of state-of-the-art LRSV estimators are the statistical model of RIR based methods (Habets et al., 2009), multiple-step linear prediction based methods (Kinoshita et al., 2009), and the smearing effect of late reverberation based methods (Wu and Wang, 2006). One can find other speech dereverberation and enhancement algorithms by temporal and spectral processing in Krishnamoorthy and Prasanna (2009); 2011). Once the NPSD and the LRSV are estimated, SS methods are generally implemented to suppress additive noise and late reverberation simultaneously.

    View all citing articles on Scopus
    View full text