Enhancement of noisy speech by temporal and spectral processing
Research highlights
► The potential of combining LP residual weighting based temporal processing and harmonics enhancement based spectral processing is demonstrated for noisy speech enhancement. ► Set of speech-specific features are proposed for the gross level detection of high SNR speech regions. ► Method to determine the instants of significant excitation in noisy speech is proposed. ► A new spectral enhancement technique is proposed to enhance the voiced regions of noisy speech.
Introduction
The problem of enhancing noisy speech received considerable attention and in the literature variety of methods have been proposed. The noisy speech enhancement methods available may be broadly classified into two categories, namely, spectral and temporal domain enhancement methods. Generally, the spectral domain processing methods can be classified into two main areas: nonparametric and statistical model-based speech enhancement (Shao and Chang, 2007, Loizou, 2007). The nonparametric methods remove an estimate of the distortion from the noisy features, such as subtractive-type algorithms (Boll, 1979, Berouti et al., 1979, Kim et al., 2000, Kamath and Loizou, 2002, Yamashita and Shimamura, 2005, Yang and Fu, 2005, Lu, 2007) and wavelet denoising (Donoho, 1995, Chang et al., 2007, Senapati et al., 2008). The statistical-model-based speech enhancement (Ephraim and Malah, 1984, Ephraim and Malah, 1985, Marzinzik and Kollmeier, 2002, Martin, 2005, Chen and Loizou, 2005, Chen and Loizou, 2007) utilizes a parametric model of the signal generation process. The spectral subtraction algorithm is the oldest one proposed for noise reduction (Boll, 1979). Spectral subtraction is performed by subtracting the average magnitude of the noise spectrum from the spectrum of the noisy speech to estimate the magnitude of the enhanced speech spectrum. The noise spectrum is estimated by averaging short-term magnitude spectra of the non-speech segments. One of the serious drawbacks of this method is that it produces musical noise in the enhanced speech. This noise arises because of randomly spaced peaks in the time frequency plane due to the deviation of the estimated spectrum of noise from the instantaneous noise spectrum (Seok and Bae, 1999). Several modifications are proposed for the spectral subtraction approach to reduce the effect of musical noise (Loizou, 2007). One of the most popular spectral based noisy speech enhancement is the minimum mean square error (MMSE) estimation of the short time spectral amplitude (STSA) algorithm of Ephraim and Malah (1984). This algorithm is based on a Gaussian statistical model. Accordingly the coefficients of the short time Fourier transform (STFT) of speech and noise are modelled as statistically independent Gaussian random variable. The aim was to enhance degraded speech by minimizing the mean squared error between the STSA of the clean speech and the enhanced speech. This optimality gives very good results in practice, with noticeable reduction in musical noise. A number of non-Gaussian modelling (like Gamma modelling (Marzinzik and Kollmeier, 2002), Laplacian modelling (Chen and Loizou, 2007)) based Ephraim–Malah filters have also been proposed for improving the performance.
Yegnanarayana et al. proposed an enhancement method by exploiting the characteristics of excitation source signal such as linear prediction (LP) residual (Yegnanarayana et al., 1999). The basis for this approach is that human beings perceive speech by capturing features present from the high signal-to-noise ratio (SNR) regions and then extrapolating the features in the low SNR regions (Yegnanarayana et al., 1999). Accordingly, the approach for speech enhancement is to identify the high SNR regions in the noisy speech, and enhance them relative to the low SNR regions, without causing significant distortion in the enhanced speech. A weight function is derived for the residual signal that will reduce the energy in the low SNR regions relative to the high SNR regions of the noisy signal. The residual signal samples are multiplied with the weight function and the weighted LP residual is used to excite the time-varying all-pole filter derived from the noisy speech to generate the enhanced speech. In (Jin and Scordilis, 2006) a speech enhancement algorithm similar to (Yegnanarayana et al., 1999) is proposed. It differs with the former residual weighting scheme in that the weights on the LP residuals are derived based on a constrained optimization criterion. In (Yegnanarayana et al., 2002) authors exploited the use of coherently added Hilbert envelope (HE) for LP residual reconstruction. The feature that the HE has large amplitude at the instant of significant excitation makes it a good indicator of glottal closure (GC), where an excitation pulse takes place. Therefore, applying the HE to the LP residual as a weighting function has the effect of emphasizing the pulse train structure for voiced speech, which leads to an enhanced LP residual signal.
As mentioned, most of the enhancement methods process degraded speech in either temporal or spectral domains for achieving enhancement. The scope of this work is to highlight and demonstrate the merits of combined temporal and spectral processing methods for processing noisy speech. Generally in most of the spectral domain based methods more emphasis is given to suppress the noise components by estimating the noise characteristics from the degraded speech. The merit of this approach is its effectiveness for noise removal. However, information about the noise needs to be continuously estimated, particularly, in non-stationary environments where noise characteristics keep changing. Alternatively, the temporal processing methods that use the characteristics of excitation source information primarily aim at emphasizing the high SNR regions of noisy speech. Therefore no explicit knowledge of characteristics of background noise is required. The limitation of the temporal processing methods is that the level of removal of degradation achieved may not be significant as in the case of spectral based methods. Thus the integration of these two approaches may lead to better suppression of degradation and also enhancement of high SNR speech regions. This may lead to improved performance compared to either temporal processing or spectral processing alone. Further, from the speech production point of view, the temporal and spectral processing methods use independent information from the noisy speech. It will therefore be interesting to study whether they are exploiting different information for processing. If so, then they can be suitably combined to develop robust methods for the speech enhancement. Motivated by these observations, this work proposes a method for the enhancement of noisy speech by the combined temporal and spectral processing to provide better noise suppression and also better enhancement in the speech regions.
The various steps involved in the proposed noisy speech enhancement method are illustrated in Fig. 1. The temporal processing involves identifying and enhancing the speech-specific features present at the gross and fine temporal levels. The main objective of the gross level processing is to identify and enhance the speech components at the sound units (100–300 ms) level. In this paper, a method is proposed for detecting high SNR regions using the sum of the ten largest peaks in the discrete Fourier transform (DFT) spectrum, the smoothed HE of the LP residual, and the modulation spectrum values. The objective of the fine level processing is to identify and enhance the speech-specific features at the subsegmental (2–3 ms) level. It is based on the fact that the significant excitation of the vocal tract takes place at the instants of glottal closure and onset of events like burst, frication and aspiration. Depending on the nature of degradation, the LP residual signal will have many other random peaks in addition to the original instants of significant excitation. Temporal processing method identifies the original instants of significant excitation and emphasizes the region around them to obtain the enhanced speech. In this paper for fine level processing, a method is proposed to identify the instants of significant excitation from the noisy speech. The proposed method involves the following: (i) sinusoidal analysis of noisy speech, (ii) convolving the HE of the LP residual of the speech obtained from sinusoidal analysis by the first order Gaussian differentiator (FOGD). Finally, the gross and fine level features are combined to derive a weight function for the excitation source signal which emphasizes the excitation around the instants of significant excitation and deemphasizes the random peaks of background noise. The enhanced excitation signal is used to excite the time-varying all-pole filter derived from the noisy speech to generate the temporally processed speech. The temporally processed speech signal is further subjected to spectral processing. The spectral processing is based on the fact that the spectral values of the degraded speech will have both speech and degrading components. The spectral components of degradation are therefore estimated and removed. Further, there are spectral peaks that are perceptually important that are identified and enhanced. Accordingly in this work spectral processing is performed in two stages: attenuation of spectral characteristics of background noise and enhancement of speech-specific spectral features. In the first stage, the spectral characteristics of the background noise is estimated and attenuated using conventional spectral processing methods based on spectral subtraction or MMSE estimators. In the second stage, the region around pitch and harmonics are enhanced by estimating pitch from the enhanced excitation source signal.
The rest of the paper is organized as follows: Section 2 discusses about the temporal processing of noisy speech signal. The spectral domain processing of noisy speech signal is described in Section 3. Various experimental studies and objective quality measures performed on the individual and the combined processing methods are described in Section 4. Summary and conclusions of this study with scope for future work are discussed in Section 5.
Section snippets
Gross level temporal processing
The high SNR regions at gross level are identified by using the sum of ten largest peaks in the DFT spectrum, smoothed HE of the LP residual and modulation spectrum values of the noisy speech signal (Krishnamoorthy and Prasanna, 2009). The gross weight function used in this work is similar to voice activity detection (VAD). One can use the existing VAD methods in place of gross level weight function. On the other hand, we would like to emphasize that in this work we have investigated alternate
Spectral processing of noisy speech
The temporally processed speech sounds to be perceptually enhanced. This is mainly due to the enhancement of speech-specific features in the noisy speech signal. This includes high SNR regions at gross level and regions around the instants of significant excitation. This is achieved by multiplying the LP residual of the noisy speech signal by the weight function. Even though the speech-specific features are emphasized in the temporally processed speech, the noise suppression is minimal mainly
Experimental results and performance evaluation
The proposed speech enhancement method is evaluated using the composite objective quality measures that have high degree of correlation with subjective quality (Hu and Loizou, 2006, Hu and Loizou, 2008). This measure evaluates the quality of enhanced speech along three dimensions: signal distortion, noise distortion, and overall quality. The resultant objective score values are in between 1 and 5 like mean opinion score (MOS). This measure rates the quality of the enhanced speech on three
Summary and conclusions
In this paper for speech degraded by background noise, a combined TSP method is proposed by emphasizing high SNR regions in the temporal domain, and eliminating the degradation and enhancing the speech-specific components in the spectral domain. The main objective of this study is to show that the combined TSP method gives relatively better performance compared to temporal or spectral processing alone. The enhancement of noisy speech is achieved in two stages, namely, temporal enhancement
References (46)
- et al.
Multiple statistical models for soft decision in noisy speech enhancement
Pattern Recognition
(2007) - et al.
A Laplacian-based MMSE estimator for speech enhancement
Speech Comm.
(2007) - et al.
Subjective comparison and evaluation of speech enhancement algorithms
Speech Comm.
(2007) - et al.
Speech enhancement by residual domain constrained optimization
Speech Comm.
(2006) Reduction of musical residual noise for speech enhancement using masking properties and optimal smoothing
Adv. Pattern Recognition Lett.
(2007)- et al.
Speech enhancement by joint statistical characterization in the Log Gabor Wavelet domain
Speech Comm.
(2008) - et al.
Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems
Speech Comm.
(1993) - et al.
Speech enhancement using linear prediction residual
Speech Comm.
(1999) - et al.
Speech database development at MIT: TIMIT and beyond
Speech Comm.
(1990) - et al.
Epoch extraction from linear prediction residual for identification of closed glottis interval
IEEE Trans. Acoust. Speech Signal Process.
(1979)
Suppression of acoustic noise in speech using spectral subtraction
IEEE Trans. Acoust. Speech Signal Process.
Discrete Time Processing of Speech Signals
De-noising by soft-thresholding
IEEE Trans. Inf. Theory
Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator
IEEE Trans. Acoust. Speech Signal Process.
Speech enhancement using a minimum mean square error log-spectral amplitude estimator
IEEE Trans. Acoust. Speech Signal Process.
Evaluation of objective quality measures for speech enhancement
IEEE Trans. Audio, Speech, Lang. Process.
Spectral subtraction based on phonetic dependency and masking effects
IEEE Proc. Vision, Image Signal Process.
Cited by (41)
Robust intelligibility and quality evaluation of combined temporal and spectral processing for hearing impaired
2022, Intelligent Systems with ApplicationsCitation Excerpt :In (Deepak & Prasanna, 2015), the proposed Zero Band Filter (ZBF) was observed to offer better results for computing the significant excitation regions. With this motivation, the temporal processing algorithm (TPA) using ZBF to locate significant excitation regions in the unenhanced speech signal and to compute fine weight function (Krishnamoorthy & Prasanna, 2011; Deepak & Prasanna, 2016) was selected. Additionally, TPA was combined with two different MAP estimators of the magnitude squared spectrum; namely, soft masking using posterior SNR uncertainty on magnitude squared spectrum (SP4) and soft masking using priori SNR uncertainty on magnitude squared spectrum (SP5).
Effect of pitch enhancement in Punjabi children's speech recognition system under disparate acoustic conditions
2021, Applied AcousticsCitation Excerpt :Kaldi toolkit was used for developing the Punjabi ASR for speech recognition. Several issues arise during the speech recognition process due to variation in the acoustic features, age, and gender differences, as well as acoustic attributes such as fundamental frequency (or pitch), formant frequencies, and segmental durations, vary with the age and gender of the speakers [5,9]. Many speech recognition systems are available from different traders for several languages but there exists no such system for the Punjabi language that shows the study of variation of Punjabi adult and child speech recognition systems using pitch enhanced acoustic features.
Enhancement of cleft palate speech using temporal and spectral processing
2020, Speech CommunicationA perceptually motivated LP residual estimator in noisy and reverberant environments
2018, Speech CommunicationCitation Excerpt :Examples of state-of-the-art LRSV estimators are the statistical model of RIR based methods (Habets et al., 2009), multiple-step linear prediction based methods (Kinoshita et al., 2009), and the smearing effect of late reverberation based methods (Wu and Wang, 2006). One can find other speech dereverberation and enhancement algorithms by temporal and spectral processing in Krishnamoorthy and Prasanna (2009); 2011). Once the NPSD and the LRSV are estimated, SS methods are generally implemented to suppress additive noise and late reverberation simultaneously.
Improved voicing decision using glottal activity features for statistical parametric speech synthesis
2017, Digital Signal Processing: A Review JournalA hybrid feature-extracted deep CNN with reduced parameters substitutes an End-to-End CNN for the recognition of spoken Bengali digits
2024, Multimedia Tools and Applications