Enhancement of noisy speech by temporal and spectral processing

doi:10.1016/j.specom.2010.08.011

Speech Communication

Volume 53, Issue 2, February 2011, Pages 154-174

https://doi.org/10.1016/j.specom.2010.08.011 Get rights and content

Abstract

This paper presents a noisy speech enhancement method by combining linear prediction (LP) residual weighting in the time domain and spectral processing in the frequency domain to provide better noise suppression as well as better enhancement in the speech regions. The noisy speech is initially processed by the excitation source (LP residual) based temporal processing that involves identifying and enhancing the excitation source based speech-specific features present at the gross and fine temporal levels. The gross level features are identified by estimating the following speech parameters: sum of the peaks in the discrete Fourier transform (DFT) spectrum, smoothed Hilbert envelope of the LP residual and modulation spectrum values, all from the noisy speech signal. The fine level features are identified using the knowledge of the instants of significant excitation. A weight function is derived from the gross and fine weight functions to obtain the temporally processed speech signal. The temporally processed speech is further subjected to spectral domain processing. Spectral processing involves estimation and removal of degrading components, and also identification and enhancement of speech-specific spectral components. The proposed method is evaluated using different objective and subjective quality measures. The quality measures show that the proposed combined temporal and spectral processing method provides better enhancement, compared to either temporal or spectral processing alone.

Research highlights

► The potential of combining LP residual weighting based temporal processing and harmonics enhancement based spectral processing is demonstrated for noisy speech enhancement. ► Set of speech-specific features are proposed for the gross level detection of high SNR speech regions. ► Method to determine the instants of significant excitation in noisy speech is proposed. ► A new spectral enhancement technique is proposed to enhance the voiced regions of noisy speech.

Introduction

The problem of enhancing noisy speech received considerable attention and in the literature variety of methods have been proposed. The noisy speech enhancement methods available may be broadly classified into two categories, namely, spectral and temporal domain enhancement methods. Generally, the spectral domain processing methods can be classified into two main areas: nonparametric and statistical model-based speech enhancement (Shao and Chang, 2007, Loizou, 2007). The nonparametric methods remove an estimate of the distortion from the noisy features, such as subtractive-type algorithms (Boll, 1979, Berouti et al., 1979, Kim et al., 2000, Kamath and Loizou, 2002, Yamashita and Shimamura, 2005, Yang and Fu, 2005, Lu, 2007) and wavelet denoising (Donoho, 1995, Chang et al., 2007, Senapati et al., 2008). The statistical-model-based speech enhancement (Ephraim and Malah, 1984, Ephraim and Malah, 1985, Marzinzik and Kollmeier, 2002, Martin, 2005, Chen and Loizou, 2005, Chen and Loizou, 2007) utilizes a parametric model of the signal generation process. The spectral subtraction algorithm is the oldest one proposed for noise reduction (Boll, 1979). Spectral subtraction is performed by subtracting the average magnitude of the noise spectrum from the spectrum of the noisy speech to estimate the magnitude of the enhanced speech spectrum. The noise spectrum is estimated by averaging short-term magnitude spectra of the non-speech segments. One of the serious drawbacks of this method is that it produces musical noise in the enhanced speech. This noise arises because of randomly spaced peaks in the time frequency plane due to the deviation of the estimated spectrum of noise from the instantaneous noise spectrum (Seok and Bae, 1999). Several modifications are proposed for the spectral subtraction approach to reduce the effect of musical noise (Loizou, 2007). One of the most popular spectral based noisy speech enhancement is the minimum mean square error (MMSE) estimation of the short time spectral amplitude (STSA) algorithm of Ephraim and Malah (1984). This algorithm is based on a Gaussian statistical model. Accordingly the coefficients of the short time Fourier transform (STFT) of speech and noise are modelled as statistically independent Gaussian random variable. The aim was to enhance degraded speech by minimizing the mean squared error between the STSA of the clean speech and the enhanced speech. This optimality gives very good results in practice, with noticeable reduction in musical noise. A number of non-Gaussian modelling (like Gamma modelling (Marzinzik and Kollmeier, 2002), Laplacian modelling (Chen and Loizou, 2007)) based Ephraim–Malah filters have also been proposed for improving the performance.

Yegnanarayana et al. proposed an enhancement method by exploiting the characteristics of excitation source signal such as linear prediction (LP) residual (Yegnanarayana et al., 1999). The basis for this approach is that human beings perceive speech by capturing features present from the high signal-to-noise ratio (SNR) regions and then extrapolating the features in the low SNR regions (Yegnanarayana et al., 1999). Accordingly, the approach for speech enhancement is to identify the high SNR regions in the noisy speech, and enhance them relative to the low SNR regions, without causing significant distortion in the enhanced speech. A weight function is derived for the residual signal that will reduce the energy in the low SNR regions relative to the high SNR regions of the noisy signal. The residual signal samples are multiplied with the weight function and the weighted LP residual is used to excite the time-varying all-pole filter derived from the noisy speech to generate the enhanced speech. In (Jin and Scordilis, 2006) a speech enhancement algorithm similar to (Yegnanarayana et al., 1999) is proposed. It differs with the former residual weighting scheme in that the weights on the LP residuals are derived based on a constrained optimization criterion. In (Yegnanarayana et al., 2002) authors exploited the use of coherently added Hilbert envelope (HE) for LP residual reconstruction. The feature that the HE has large amplitude at the instant of significant excitation makes it a good indicator of glottal closure (GC), where an excitation pulse takes place. Therefore, applying the HE to the LP residual as a weighting function has the effect of emphasizing the pulse train structure for voiced speech, which leads to an enhanced LP residual signal.

As mentioned, most of the enhancement methods process degraded speech in either temporal or spectral domains for achieving enhancement. The scope of this work is to highlight and demonstrate the merits of combined temporal and spectral processing methods for processing noisy speech. Generally in most of the spectral domain based methods more emphasis is given to suppress the noise components by estimating the noise characteristics from the degraded speech. The merit of this approach is its effectiveness for noise removal. However, information about the noise needs to be continuously estimated, particularly, in non-stationary environments where noise characteristics keep changing. Alternatively, the temporal processing methods that use the characteristics of excitation source information primarily aim at emphasizing the high SNR regions of noisy speech. Therefore no explicit knowledge of characteristics of background noise is required. The limitation of the temporal processing methods is that the level of removal of degradation achieved may not be significant as in the case of spectral based methods. Thus the integration of these two approaches may lead to better suppression of degradation and also enhancement of high SNR speech regions. This may lead to improved performance compared to either temporal processing or spectral processing alone. Further, from the speech production point of view, the temporal and spectral processing methods use independent information from the noisy speech. It will therefore be interesting to study whether they are exploiting different information for processing. If so, then they can be suitably combined to develop robust methods for the speech enhancement. Motivated by these observations, this work proposes a method for the enhancement of noisy speech by the combined temporal and spectral processing to provide better noise suppression and also better enhancement in the speech regions.

The various steps involved in the proposed noisy speech enhancement method are illustrated in Fig. 1. The temporal processing involves identifying and enhancing the speech-specific features present at the gross and fine temporal levels. The main objective of the gross level processing is to identify and enhance the speech components at the sound units (100–300 ms) level. In this paper, a method is proposed for detecting high SNR regions using the sum of the ten largest peaks in the discrete Fourier transform (DFT) spectrum, the smoothed HE of the LP residual, and the modulation spectrum values. The objective of the fine level processing is to identify and enhance the speech-specific features at the subsegmental (2–3 ms) level. It is based on the fact that the significant excitation of the vocal tract takes place at the instants of glottal closure and onset of events like burst, frication and aspiration. Depending on the nature of degradation, the LP residual signal will have many other random peaks in addition to the original instants of significant excitation. Temporal processing method identifies the original instants of significant excitation and emphasizes the region around them to obtain the enhanced speech. In this paper for fine level processing, a method is proposed to identify the instants of significant excitation from the noisy speech. The proposed method involves the following: (i) sinusoidal analysis of noisy speech, (ii) convolving the HE of the LP residual of the speech obtained from sinusoidal analysis by the first order Gaussian differentiator (FOGD). Finally, the gross and fine level features are combined to derive a weight function for the excitation source signal which emphasizes the excitation around the instants of significant excitation and deemphasizes the random peaks of background noise. The enhanced excitation signal is used to excite the time-varying all-pole filter derived from the noisy speech to generate the temporally processed speech. The temporally processed speech signal is further subjected to spectral processing. The spectral processing is based on the fact that the spectral values of the degraded speech will have both speech and degrading components. The spectral components of degradation are therefore estimated and removed. Further, there are spectral peaks that are perceptually important that are identified and enhanced. Accordingly in this work spectral processing is performed in two stages: attenuation of spectral characteristics of background noise and enhancement of speech-specific spectral features. In the first stage, the spectral characteristics of the background noise is estimated and attenuated using conventional spectral processing methods based on spectral subtraction or MMSE estimators. In the second stage, the region around pitch and harmonics are enhanced by estimating pitch from the enhanced excitation source signal.

The rest of the paper is organized as follows: Section 2 discusses about the temporal processing of noisy speech signal. The spectral domain processing of noisy speech signal is described in Section 3. Various experimental studies and objective quality measures performed on the individual and the combined processing methods are described in Section 4. Summary and conclusions of this study with scope for future work are discussed in Section 5.

Section snippets

Gross level temporal processing

The high SNR regions at gross level are identified by using the sum of ten largest peaks in the DFT spectrum, smoothed HE of the LP residual and modulation spectrum values of the noisy speech signal (Krishnamoorthy and Prasanna, 2009). The gross weight function used in this work is similar to voice activity detection (VAD). One can use the existing VAD methods in place of gross level weight function. On the other hand, we would like to emphasize that in this work we have investigated alternate

Spectral processing of noisy speech

The temporally processed speech sounds to be perceptually enhanced. This is mainly due to the enhancement of speech-specific features in the noisy speech signal. This includes high SNR regions at gross level and regions around the instants of significant excitation. This is achieved by multiplying the LP residual of the noisy speech signal by the weight function. Even though the speech-specific features are emphasized in the temporally processed speech, the noise suppression is minimal mainly

Experimental results and performance evaluation

The proposed speech enhancement method is evaluated using the composite objective quality measures that have high degree of correlation with subjective quality (Hu and Loizou, 2006, Hu and Loizou, 2008). This measure evaluates the quality of enhanced speech along three dimensions: signal distortion, noise distortion, and overall quality. The resultant objective score values are in between 1 and 5 like mean opinion score (MOS). This measure rates the quality of the enhanced speech on three

Summary and conclusions

In this paper for speech degraded by background noise, a combined TSP method is proposed by emphasizing high SNR regions in the temporal domain, and eliminating the degradation and enhancing the speech-specific components in the spectral domain. The main objective of this study is to show that the combined TSP method gives relatively better performance compared to temporal or spectral processing alone. The enhancement of noisy speech is achieved in two stages, namely, temporal enhancement

References (46)

J.-H. Chang et al.
Multiple statistical models for soft decision in noisy speech enhancement
Pattern Recognition
(2007)
B. Chen et al.
A Laplacian-based MMSE estimator for speech enhancement
Speech Comm.
(2007)
Y. Hu et al.
Subjective comparison and evaluation of speech enhancement algorithms
Speech Comm.
(2007)
W. Jin et al.
Speech enhancement by residual domain constrained optimization
Speech Comm.
(2006)
C.-T. Lu
Reduction of musical residual noise for speech enhancement using masking properties and optimal smoothing
Adv. Pattern Recognition Lett.
(2007)
S. Senapati et al.
Speech enhancement by joint statistical characterization in the Log Gabor Wavelet domain
Speech Comm.
(2008)
A. Varga et al.
Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems
Speech Comm.
(1993)
B. Yegnanarayana et al.
Speech enhancement using linear prediction residual
Speech Comm.
(1999)
V. Zue et al.
Speech database development at MIT: TIMIT and beyond
Speech Comm.
(1990)
T. Ananthapadmanabha et al.
Epoch extraction from linear prediction residual for identification of closed glottis interval
IEEE Trans. Acoust. Speech Signal Process.
(1979)

Berouti, M., Schwartz, R., Makhoul, J., 1979. Enhancement of speech corrupted by acoustic noise. In: Proc. IEEE...

S.F. Boll

Suppression of acoustic noise in speech using spectral subtraction

IEEE Trans. Acoust. Speech Signal Process.

(1979)

Chen, B., Loizou, P., 2005. Speech enhancement using a MMSE short time spectral amplitude estimator with Laplacian...

J.R. Deller et al.

Discrete Time Processing of Speech Signals

(1993)

D.L. Donoho

De-noising by soft-thresholding

IEEE Trans. Inf. Theory

(1995)

Y. Ephraim et al.

Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator

IEEE Trans. Acoust. Speech Signal Process.

(1984)

Y. Ephraim et al.

Speech enhancement using a minimum mean square error log-spectral amplitude estimator

IEEE Trans. Acoust. Speech Signal Process.

(1985)

Greenberg, S., Kingsbury, B.E.D., 1997. The modulation spectrogram: In pursuit of an invariant representation of...

Hu, Y., Loizou, P.C., 2006. Evaluation of objective measures for speech enhancement. In: Proc. Interspeech,...

Y. Hu et al.

Evaluation of objective quality measures for speech enhancement

IEEE Trans. Audio, Speech, Lang. Process.

(2008)

Kamath, S., Loizou, P., 2002. A multi-band spectral subtraction method for enhancing speech corrupted by colored noise....

W. Kim et al.

Spectral subtraction based on phonetic dependency and masking effects

IEEE Proc. Vision, Image Signal Process.

(2000)

Krishnamoorthy, P., Prasanna, S.R.M., 2008. Temporal and spectral processing of degraded speech. In: IEEE Proc....

Cited by (41)

Robust intelligibility and quality evaluation of combined temporal and spectral processing for hearing impaired
2022, Intelligent Systems with Applications
Citation Excerpt :
In (Deepak & Prasanna, 2015), the proposed Zero Band Filter (ZBF) was observed to offer better results for computing the significant excitation regions. With this motivation, the temporal processing algorithm (TPA) using ZBF to locate significant excitation regions in the unenhanced speech signal and to compute fine weight function (Krishnamoorthy & Prasanna, 2011; Deepak & Prasanna, 2016) was selected. Additionally, TPA was combined with two different MAP estimators of the magnitude squared spectrum; namely, soft masking using posterior SNR uncertainty on magnitude squared spectrum (SP4) and soft masking using priori SNR uncertainty on magnitude squared spectrum (SP5).
Hearing-impaired people face numerous challenges with speech perception in the presence of interfering background noise. To suppress interfering background noise, the common approach widely used is speech enhancement. Inspired by the improved results of combined temporal and spectral processing in speech enhancement, this research study proposes temporal enhancement combined with two different spectral enhancement methods, with a novel approach of soft masking using priori and posterior signal to noise ratio uncertainty. The present study investigates quality and intelligibility objective evaluations, namely, hearing aid speech quality index and hearing aid speech perception index, of spectral and a combination of temporal-spectral speech enhancement methods for typical pattern of hearing loss characterized by six audiograms. For evaluation, clean speech files from the NOIZEUS database are mixed with four local noises, namely, cafeteria, traffic, station, and train at -5, -3, 0,3,5, and 10 dB SNRs. These local noises are quite common, which are encountered by people in their day-to-day lives. In most of the testing conditions, the new combined temporal spectral enhancement shows improved results in comparison with the purely spectral processing methods.
Effect of pitch enhancement in Punjabi children's speech recognition system under disparate acoustic conditions
2021, Applied Acoustics
Citation Excerpt :
Kaldi toolkit was used for developing the Punjabi ASR for speech recognition. Several issues arise during the speech recognition process due to variation in the acoustic features, age, and gender differences, as well as acoustic attributes such as fundamental frequency (or pitch), formant frequencies, and segmental durations, vary with the age and gender of the speakers [5,9]. Many speech recognition systems are available from different traders for several languages but there exists no such system for the Punjabi language that shows the study of variation of Punjabi adult and child speech recognition systems using pitch enhanced acoustic features.
In this work, a Punjabi children speech recognition system is developed under different acoustic matched and mismatched conditions. One major problem in children's speech recognition is the differences in the acoustic attributes of the children and adult speech signals, which leads to the poor recognition rate for the children's speech. This paper shows how pitch enhanced features extracted from the front-end feature extraction process plays an important role under mismatched acoustic conditions. After enhancing the pitch using the Cepstral analysis in the feature extraction process, the recognition rate of the children's speech recognition system using different age group datasets increases as compared to the normal acoustics features extracted using Mel Frequency Cepstral Coefficient (MFCC) feature extraction process. Kaldi toolkit is used for building the children's speech recognition models at different phoneme levels. Results show the improvement of 0.03% to 16.47% WER under different acoustic conditions using pitch enhanced features.
Enhancement of cleft palate speech using temporal and spectral processing
2020, Speech Communication
The speech of the individuals with cleft palate (CP) is generally characterized by the presence of abnormal nasal resonances during the production of voiced sounds, primarily in vowels, and is called hypernasality. Hypernasality is present in more than 50% of the individuals with CP, and it often results in degraded speech, both in quality and intelligibility. The current work describes the signal processing based enhancement of CP speech, where specifically hypernasal speech modification is addressed. The hypernasal speech’s residual and vocal tract system characteristics are analyzed using an extended weighted linear prediction (XLP) method. The enhancement is performed for three different variants: XLP residual weighting in the time domain, Gaussian mixture model-based spectral conversion in the frequency domain, and combined modification of the XLP residual and vocal tract system characteristics. The modified hypernasal speech achieved by the proposed method is evaluated using different objective and subjective measures for the vowel /a/, /i/, and /u/. The evaluation results indicate that the combination of XLP residual and vocal tract system characteristics modification yields better results than XLP residual or vocal tract system characteristics modification alone.
A perceptually motivated LP residual estimator in noisy and reverberant environments
2018, Speech Communication
Citation Excerpt :
Examples of state-of-the-art LRSV estimators are the statistical model of RIR based methods (Habets et al., 2009), multiple-step linear prediction based methods (Kinoshita et al., 2009), and the smearing effect of late reverberation based methods (Wu and Wang, 2006). One can find other speech dereverberation and enhancement algorithms by temporal and spectral processing in Krishnamoorthy and Prasanna (2009); 2011). Once the NPSD and the LRSV are estimated, SS methods are generally implemented to suppress additive noise and late reverberation simultaneously.
Both reverberation and additive noise can degrade the quality of recorded speech and thus should be suppressed simultaneously. Previous studies have shown that the generalized singular value decomposition (GSVD) has the capability of suppressing the additive noise effectively, but it is not often applied for speech dereverberation since reverberation is considered to be convolutive as well as colored noise. Recently, we revealed that late reverberation is also additive and relatively white interference component in the linear prediction (LP) residual domain. To suppress both late reverberation and additive noise, we have proposed an optimal filter for LP residual estimator (LPRE) based on a constrained minimum mean square error (CMMSE) by using GSVD in single channel speech enhancement, where the algorithm is referred as CMMSE-GSVD-LPRE. Experimental results have shown a better performance of the CMMSE-GSVD-LPRE than spectral subtraction methods, but some residual noise and reverberation components are still audible and annoying. To solve this problem, this paper incorporates the masking properties of the human auditory system in the LP residual domain to further suppress these residual noise and reverberation components while reducing speech distortion at the same time. Various simulation experiments are conducted, and the results show an improved performance of the proposed algorithm. Experimental results with speech recorded in noisy and reverberant environments further confirm the effectiveness of the proposed algorithm in real-world environments.
Improved voicing decision using glottal activity features for statistical parametric speech synthesis
2017, Digital Signal Processing: A Review Journal
A method to improve voicing decision using glottal activity features proposed for statistical parametric speech synthesis. In existing methods, voicing decision relies mostly on fundamental frequency $F 0$ , which may result in errors when the prediction is inaccurate. Even though $F 0$ is a glottal activity feature, other features that characterize this activity may help in improving the voicing decision. The glottal activity features used in this work are the strength of excitation (SoE), normalized autocorrelation peak strength (NAPS), and higher-order statistics (HOS). These features obtained from approximated source signals like zero-frequency filtered signal and integrated linear prediction residual. To improve voicing decision and to avoid heuristic threshold for classification, glottal activity features are trained using different statistical learning methods such as the k-nearest neighbor, support vector machine (SVM), and deep belief network. The voicing decision works best with SVM classifier, and its effectiveness is tested using the statistical parametric speech synthesis. The glottal activity features SoE, NAPS, and HOS modeled along with $F 0$ and Mel-cepstral coefficients in Hidden Markov model and deep neural network to get the voicing decision. The objective and subjective evaluations demonstrate that the proposed method improves the naturalness of synthetic speech.
A hybrid feature-extracted deep CNN with reduced parameters substitutes an End-to-End CNN for the recognition of spoken Bengali digits
2024, Multimedia Tools and Applications

View all citing articles on Scopus

View full text

Enhancement of noisy speech by temporal and spectral processing

Abstract

Research highlights

Introduction

Section snippets

Gross level temporal processing

Spectral processing of noisy speech

Experimental results and performance evaluation

Summary and conclusions

Pattern Recognition

Speech Comm.

Speech Comm.

Speech Comm.

Adv. Pattern Recognition Lett.

Speech Comm.

Speech Comm.

Speech Comm.

Speech Comm.

Epoch extraction from linear prediction residual for identification of closed glottis interval

IEEE Trans. Acoust. Speech Signal Process.

Suppression of acoustic noise in speech using spectral subtraction

IEEE Trans. Acoust. Speech Signal Process.

Discrete Time Processing of Speech Signals

De-noising by soft-thresholding

IEEE Trans. Inf. Theory

Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator

IEEE Trans. Acoust. Speech Signal Process.

Speech enhancement using a minimum mean square error log-spectral amplitude estimator

IEEE Trans. Acoust. Speech Signal Process.

Evaluation of objective quality measures for speech enhancement

IEEE Trans. Audio, Speech, Lang. Process.

Spectral subtraction based on phonetic dependency and masking effects

IEEE Proc. Vision, Image Signal Process.