Skip to main content

2015 | OriginalPaper | Buchkapitel

9. Maximum A Posteriori Spectral Estimation with Source Log-Spectral Priors for Multichannel Speech Enhancement

verfasst von : Yasuaki Iwata, Tomohiro Nakatani, Takuya Yoshioka, Masakiyo Fujimoto, Hirofumi Saito

Erschienen in: Speech and Audio Processing for Coding, Enhancement and Recognition

Verlag: Springer New York

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

When speech signals are captured in real acoustical environments, the captured signals are distorted by certain types of interference, such as ambient noise, reverberation, and extraneous speakers’ utterances. There are two important approaches to speech enhancement that reduce such interference in the captured signals. One approach is based on the spatial features of the signals, such as direction of arrival and acoustic transfer functions, and enhances speech using multichannel audio signal processing. The other approach is based on speech spectral models that represent the probability density function of the speech spectra, and it enhances speech by distinguishing between speech and noise based on the spectral models. In this chapter, we propose a new approach that integrates the above two approaches. The proposed approach uses the spatial and spectral features of signals in a complementary manner to achieve reliable and accurate speech enhancement. The approach can be applied to various speech enhancement problems, including denoising, dereverberation, and blind source separation (BSS). In particular, in this chapter, we focus on applying the approach to BSS. We show experimentally that the proposed integration can improve the performance of BSS compared with a conventional approach.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
As noted later, despite this assumption, this scenario can represent a situation with long reverberation, and can be used for achieving dereverberation.
 
2
If we interpret the ATFs from s t to \(z_{t}^{(m)}\) also as a part of the interference, we may formulate speech enhancement that estimates s t . This is beyond the scope of this chapter.
 
3
The same model can be used to represent ambient noise, for example, as in [10]. The way to formulate MLSE for denoising and its extension to MAPSE can be found in [12]. As regards MLSE based dereverberation with the long-term linear prediction approach, the generative model of the interference can be defined in the following form [10, 11, 16].
$$\displaystyle\begin{array}{rcl} p(\mathbf{a}_{n,f}\vert \theta _{f})& =& \delta (\mathbf{a}_{n,f} -\mathbf{r}_{n,f}(\theta _{f})), {}\end{array}$$
(9.16)
where δ(⋅ ) is the Dirac delta function, and \(\mathbf{r}_{n,f}(\theta _{f}) = [r_{n,f}^{(1)}(\theta _{f}),r_{n,f}^{(2)}(\theta _{f}),\ldots,r_{n,f}^{(M)}(\theta _{f})]^{T}\) is a spatial vector of the interference signal, namely the late reverberation signal. The model parameter set θ f is composed of the prediction coefficients, and the late reverberation \(r_{n,f}^{(m)}(\theta _{f})\) is modeled by an inner product of a vector containing the prediction coefficients and that containing a past captured signal in the MLSE based dereverberation. It was shown that the MLSE based dereverberation can be extended to MAPSE based dereverberation as discussed in [11] based on the technique discussed in this chapter.
 
Literatur
1.
Zurück zum Zitat J. Benesty, S. Makino, J. Chen (eds.), Speech Enhancement (Signals and Communication Technology) (Springer, Berlin, 2005) J. Benesty, S. Makino, J. Chen (eds.), Speech Enhancement (Signals and Communication Technology) (Springer, Berlin, 2005)
2.
Zurück zum Zitat C.M. Biship, Pattern Recognition and Machine Learning (Information Science and Statistics) (Springer, New York, 2010) C.M. Biship, Pattern Recognition and Machine Learning (Information Science and Statistics) (Springer, New York, 2010)
3.
Zurück zum Zitat A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B Methodol. 39, 1–38 (1977)MathSciNetMATH A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B Methodol. 39, 1–38 (1977)MathSciNetMATH
4.
Zurück zum Zitat N.Q.K. Duong, E. Vincent, R. Gribonval, Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Trans. Audio Speech Lang. Process. 18(7), 1830–1840 (2010)CrossRef N.Q.K. Duong, E. Vincent, R. Gribonval, Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Trans. Audio Speech Lang. Process. 18(7), 1830–1840 (2010)CrossRef
5.
Zurück zum Zitat M. Fujimoto, T. Nakatani, Model-based noise suppression using unsupervised estimation of hidden Markov model for non-stationary noise, in Proceedings of INTERSPEECH 2013 (2013), pp. 2982–2986 M. Fujimoto, T. Nakatani, Model-based noise suppression using unsupervised estimation of hidden Markov model for non-stationary noise, in Proceedings of INTERSPEECH 2013 (2013), pp. 2982–2986
6.
Zurück zum Zitat S. Gannot, M. Moonen, Subspace methods for multimicrophone speech dereverberation. EURASIP J. Adv. Signal Process. 2003(11), 1074–1090 (2003)CrossRefMATH S. Gannot, M. Moonen, Subspace methods for multimicrophone speech dereverberation. EURASIP J. Adv. Signal Process. 2003(11), 1074–1090 (2003)CrossRefMATH
7.
Zurück zum Zitat J.F. Gemmeke, T. Virtanen, A. Hurmalainen, Exemplar-based sparse representations for noise robust automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 19(7), 2067–2080 (2011)CrossRef J.F. Gemmeke, T. Virtanen, A. Hurmalainen, Exemplar-based sparse representations for noise robust automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 19(7), 2067–2080 (2011)CrossRef
8.
Zurück zum Zitat S. Haykin, Adaptive Filter Theory, 5th edn. (Prentice Hall, Englewood Cliffs, 2013) S. Haykin, Adaptive Filter Theory, 5th edn. (Prentice Hall, Englewood Cliffs, 2013)
9.
Zurück zum Zitat K. Iso, S. Araki, S. Makino, T. Nakatani, H. Sawada, T. Yamada, A. Nakamura, Blind source separation of mixed speech in a high reverberation environment, in Proceedings of 3rd Joint Workshop on Hands-free Speech Communication and Microphone Array (HSCMA-2011) (2011), pp. 36–39 K. Iso, S. Araki, S. Makino, T. Nakatani, H. Sawada, T. Yamada, A. Nakamura, Blind source separation of mixed speech in a high reverberation environment, in Proceedings of 3rd Joint Workshop on Hands-free Speech Communication and Microphone Array (HSCMA-2011) (2011), pp. 36–39
10.
Zurück zum Zitat N. Ito, S. Araki, T. Nakatani, Probabilistic integration of diffuse noise suppression and dereverberation, in Proceedings of IEEE ICASSP-2014 (2014), pp. 5204–5208 N. Ito, S. Araki, T. Nakatani, Probabilistic integration of diffuse noise suppression and dereverberation, in Proceedings of IEEE ICASSP-2014 (2014), pp. 5204–5208
11.
Zurück zum Zitat Y. Iwata, T. Nakatani, Introduction of speech log-spectral priors into dereverberation based on Itakura-Saito distance minimization, in Proceedings of IEEE ICASSP-2012 (2012), pp. 245–248 Y. Iwata, T. Nakatani, Introduction of speech log-spectral priors into dereverberation based on Itakura-Saito distance minimization, in Proceedings of IEEE ICASSP-2012 (2012), pp. 245–248
12.
Zurück zum Zitat Y. Iwata, T. Nakatani, M. Fujimoto, T. Yoshioka, H. Saito, MAP spectral estimation of speech using log-spectral prior for noise reduction (in Japanese), in Proceedings of Autumn-2012 Meeting of the Acoustical Society of Japan (2012), pp. 795–798 Y. Iwata, T. Nakatani, M. Fujimoto, T. Yoshioka, H. Saito, MAP spectral estimation of speech using log-spectral prior for noise reduction (in Japanese), in Proceedings of Autumn-2012 Meeting of the Acoustical Society of Japan (2012), pp. 795–798
13.
Zurück zum Zitat Y. Izumi, N. Ono, S. Sagayama, Sparseness-based 2ch BSS using the EM algorithm in reverberant environment, in Proceedings of IEEE WASPAA-2007 (2007), pp. 147–150 Y. Izumi, N. Ono, S. Sagayama, Sparseness-based 2ch BSS using the EM algorithm in reverberant environment, in Proceedings of IEEE WASPAA-2007 (2007), pp. 147–150
14.
Zurück zum Zitat P.C. Loizou, Speech Enhancement: Theory and Practice, 2nd edn. (CRC Press, Boca Raton, 2013) P.C. Loizou, Speech Enhancement: Theory and Practice, 2nd edn. (CRC Press, Boca Raton, 2013)
15.
Zurück zum Zitat P.J. Moreno, B. Raj, R.M. Stern, A vector taylor series approach for environment-independent speech recognition, in Proceedings of IEEE ICASSP-1996, vol. 2 (1996), pp. 733–736 P.J. Moreno, B. Raj, R.M. Stern, A vector taylor series approach for environment-independent speech recognition, in Proceedings of IEEE ICASSP-1996, vol. 2 (1996), pp. 733–736
16.
Zurück zum Zitat T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, B.H. Juang, Speech dereverberation based on variance-normalized delayed linear prediction. IEEE Trans. Audio Speech Lang. Process. 18(7), 1717–1731 (2010)CrossRef T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, B.H. Juang, Speech dereverberation based on variance-normalized delayed linear prediction. IEEE Trans. Audio Speech Lang. Process. 18(7), 1717–1731 (2010)CrossRef
17.
Zurück zum Zitat A. Ogawa, K. Kinoshita, T. Hori, T. Nakatani, A. Nakamura, Fast segment search for corpus-based speech enhancement based on speech recognition technology, in Proceedings of IEEE ICASSP-2014 (2014), pp. 1576–1580 A. Ogawa, K. Kinoshita, T. Hori, T. Nakatani, A. Nakamura, Fast segment search for corpus-based speech enhancement based on speech recognition technology, in Proceedings of IEEE ICASSP-2014 (2014), pp. 1576–1580
18.
Zurück zum Zitat D. Pearce, H.G. Hirsch, The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions, in Proceedings of INTERSPEECH-2000, vol. 2000 (2000), pp. 29–32 D. Pearce, H.G. Hirsch, The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions, in Proceedings of INTERSPEECH-2000, vol. 2000 (2000), pp. 29–32
19.
Zurück zum Zitat M. Rainer, Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process. 9(5), 504–512 (2001)CrossRef M. Rainer, Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process. 9(5), 504–512 (2001)CrossRef
20.
Zurück zum Zitat S.J. Rennie, J.R. Hershey, P.A. Olsen, Single-channel multitalker speech recognition. IEEE SP Mag. 27(6), 66–80 (2010) S.J. Rennie, J.R. Hershey, P.A. Olsen, Single-channel multitalker speech recognition. IEEE SP Mag. 27(6), 66–80 (2010)
21.
Zurück zum Zitat H. Sawada, S. Araki, R. Mukai, S. Makino, Grouping separated frequency components by estimating propagation model parameters in frequency-domain blind source separation. IEEE Trans. Audio Speech Lang. Process. 15(5), 1592–1604 (2007)CrossRef H. Sawada, S. Araki, R. Mukai, S. Makino, Grouping separated frequency components by estimating propagation model parameters in frequency-domain blind source separation. IEEE Trans. Audio Speech Lang. Process. 15(5), 1592–1604 (2007)CrossRef
22.
Zurück zum Zitat H. Sawada, S. Araki, S. Makino, Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment. IEEE Trans. Audio Speech Lang. Process. 19(3), 516–527 (2011)CrossRef H. Sawada, S. Araki, S. Makino, Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment. IEEE Trans. Audio Speech Lang. Process. 19(3), 516–527 (2011)CrossRef
23.
Zurück zum Zitat M. Seltzer, D. Yu, Y. Wang, An investigation of deep neural networks for noise robust speech recognition, in Proceedings of IEEE ICASSP-2013 (2013), pp. 7398–7402 M. Seltzer, D. Yu, Y. Wang, An investigation of deep neural networks for noise robust speech recognition, in Proceedings of IEEE ICASSP-2013 (2013), pp. 7398–7402
24.
Zurück zum Zitat M. Souden, J. Chen, J. Benesty, S. Affes, An integrated solution for online multichannel noise tracking and reduction. IEEE Trans. Audio Speech Lang. Process. 19, 2159–2169 (2011)CrossRef M. Souden, J. Chen, J. Benesty, S. Affes, An integrated solution for online multichannel noise tracking and reduction. IEEE Trans. Audio Speech Lang. Process. 19, 2159–2169 (2011)CrossRef
25.
Zurück zum Zitat M. Togami, Y. Kawaguchi, R. Takeda, Y. Obuchi, N. Nukaga, Optimized speech dereverberation from probabilistic perspective for time varying acoustic transfer function. IEEE Trans. Audio Speech Lang. Process. 21(7), 1369–1380 (2013)CrossRef M. Togami, Y. Kawaguchi, R. Takeda, Y. Obuchi, N. Nukaga, Optimized speech dereverberation from probabilistic perspective for time varying acoustic transfer function. IEEE Trans. Audio Speech Lang. Process. 21(7), 1369–1380 (2013)CrossRef
26.
Zurück zum Zitat O. Yilmaz, S. Rickard, Blind separation of speech mixture via time-frequency masking. IEEE Trans. Signal Process. 52(7), 1830–1847 (2004)MathSciNetCrossRef O. Yilmaz, S. Rickard, Blind separation of speech mixture via time-frequency masking. IEEE Trans. Signal Process. 52(7), 1830–1847 (2004)MathSciNetCrossRef
27.
Zurück zum Zitat T. Yoshioka, T. Nakatani, M. Miyoshi, H.G. Okuno, Blind separation and dereverberation of speech mixtures by joint optimization. IEEE Trans. Audio Speech Lang. Process. 19(1), 69–84 (2011)CrossRef T. Yoshioka, T. Nakatani, M. Miyoshi, H.G. Okuno, Blind separation and dereverberation of speech mixtures by joint optimization. IEEE Trans. Audio Speech Lang. Process. 19(1), 69–84 (2011)CrossRef
28.
Zurück zum Zitat E. Vincent, H. Sawada, P. Bofill, S. Makino, J. Rosca, First stereo audio source separation evaluation campaign: data, algorithms and results, in Proceedings of International Conference on Independent Component Analysis (ICA), pp. 552–559 (2007) E. Vincent, H. Sawada, P. Bofill, S. Makino, J. Rosca, First stereo audio source separation evaluation campaign: data, algorithms and results, in Proceedings of International Conference on Independent Component Analysis (ICA), pp. 552–559 (2007)
Metadaten
Titel
Maximum A Posteriori Spectral Estimation with Source Log-Spectral Priors for Multichannel Speech Enhancement
verfasst von
Yasuaki Iwata
Tomohiro Nakatani
Takuya Yoshioka
Masakiyo Fujimoto
Hirofumi Saito
Copyright-Jahr
2015
Verlag
Springer New York
DOI
https://doi.org/10.1007/978-1-4939-1456-2_9

Neuer Inhalt