Skip to main content
Top
Published in: International Journal of Machine Learning and Cybernetics 7/2019

01-08-2018 | Original Article

A voice activity detection algorithm in spectro-temporal domain using sparse representation

Authors: Mohadese Eshaghi, Farbod Razzazi, Alireza Behrad

Published in: International Journal of Machine Learning and Cybernetics | Issue 7/2019

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This paper describes a new algorithm for voice activity detection (VAD), based on sparse representation of spectro-temporal domain. Our audio classification algorithm is based on multi-scale spectro-temporal modulation features which are extracted using auditory cortex model. The key concept in sparse representation is that any speech fragment can be represented as a linear combination of a small number of exemplar speech tokens. In this algorithm, the approach transforms the speech into spectro-temporal domain resulting in its decomposition into auditory-based features with multiple scales of temporal and spectral resolutions; in the next stage, each frame is divided into several sub-cubes in the new domain; then the algorithm detects the speech in the signal by using the sparse representation of sub-cubes of the frames in this domain. Simulation results are given to illustrate the effectiveness of our new VAD algorithms. The results reveal that the achieved performance is 90.11 and 91.75% under − 5 db SNR in white and car noise respectively, outperforming most of the state of the art VAD algorithms.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Show more products
Literature
1.
go back to reference Freeman DK, Cosier G, Southcott CB, Boyd I (1989) The voice activity detector for the pan European digital cellular mobile telephone service. In: International conference on acoustics, speech, and signal processing, Glascow, May 1989, pp 369–372 Freeman DK, Cosier G, Southcott CB, Boyd I (1989) The voice activity detector for the pan European digital cellular mobile telephone service. In: International conference on acoustics, speech, and signal processing, Glascow, May 1989, pp 369–372
2.
go back to reference Ghosh PK, Tsiartas A, Narayanan S (2011) Robust voice activity detection using long-term signal variability. IEEE Trans Audio Speech Lang Process 19:600–613CrossRef Ghosh PK, Tsiartas A, Narayanan S (2011) Robust voice activity detection using long-term signal variability. IEEE Trans Audio Speech Lang Process 19:600–613CrossRef
3.
go back to reference Datao Y, Jiqing H, Guibin Z, Tieran Z (2012) Sparse power spectrum based robust voice activity detector. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), March 2012, pp 289–292 Datao Y, Jiqing H, Guibin Z, Tieran Z (2012) Sparse power spectrum based robust voice activity detector. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), March 2012, pp 289–292
4.
go back to reference Hongzhi W, Yuchao X, Meijing L (2011) Study on the MFCC similarity-based voice activity detection algorithm. In: International conference on artificial intelligence, management science and electronic commerce (AIMSEC), August 2011, pp 4391–4394 Hongzhi W, Yuchao X, Meijing L (2011) Study on the MFCC similarity-based voice activity detection algorithm. In: International conference on artificial intelligence, management science and electronic commerce (AIMSEC), August 2011, pp 4391–4394
5.
go back to reference Martin G, Abeer A, Dan E et al (2013) All for one: feature combination for highly channel-degraded speech activity detection. INTERSPEECH, Lyon, pp 709–713 Martin G, Abeer A, Dan E et al (2013) All for one: feature combination for highly channel-degraded speech activity detection. INTERSPEECH, Lyon, pp 709–713
6.
go back to reference J. Sohn, N. S. Kim, and W. Sung (1999) A statistical model-based voice activity detection. IEEE Signal Process Lett 6(1):1–3CrossRef J. Sohn, N. S. Kim, and W. Sung (1999) A statistical model-based voice activity detection. IEEE Signal Process Lett 6(1):1–3CrossRef
7.
go back to reference Cho YD, Kondoz A (2001) Analysis and improvement of a statistical model-based voice activity detector. IEEE Signal Process Lett 8(10):276–278CrossRef Cho YD, Kondoz A (2001) Analysis and improvement of a statistical model-based voice activity detector. IEEE Signal Process Lett 8(10):276–278CrossRef
8.
go back to reference Beritelli F, Casale S, Ruggeri G, Serrano S (2002) Performance evaluation and comparison of G.729/AMR/fuzzy voice activity detectors. IEEE Signal Process Lett 9(3):85–88CrossRef Beritelli F, Casale S, Ruggeri G, Serrano S (2002) Performance evaluation and comparison of G.729/AMR/fuzzy voice activity detectors. IEEE Signal Process Lett 9(3):85–88CrossRef
9.
go back to reference Nemer E, Goubran R, Mahmoud S (2001) Robust voice activity detection using higher-order statistics in the LPC residual domain. IEEE Trans Speech Audio Process Lett 9(3):217–231CrossRef Nemer E, Goubran R, Mahmoud S (2001) Robust voice activity detection using higher-order statistics in the LPC residual domain. IEEE Trans Speech Audio Process Lett 9(3):217–231CrossRef
10.
go back to reference Benyassine AE, Shlomot HY, Su D, Massaloux C, Lamblin, Petit JP (1997) ITU-T Recommendation G.729 Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications. IEEE Commun Mag Lett 35(9):64–73CrossRef Benyassine AE, Shlomot HY, Su D, Massaloux C, Lamblin, Petit JP (1997) ITU-T Recommendation G.729 Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications. IEEE Commun Mag Lett 35(9):64–73CrossRef
11.
go back to reference Eshaghi M, Karami MR, Mollaei (2010) Voice activity detection based on using wavelet packet. Digital Signal Process Lett 20(4):1102–1115CrossRef Eshaghi M, Karami MR, Mollaei (2010) Voice activity detection based on using wavelet packet. Digital Signal Process Lett 20(4):1102–1115CrossRef
12.
go back to reference Mallat SG (1989) A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans Pattern Anal Mach Intell Lett 11(7):674–693CrossRefMATH Mallat SG (1989) A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans Pattern Anal Mach Intell Lett 11(7):674–693CrossRefMATH
13.
go back to reference Mesgarani N, Shamma S (2007) Denoising in the domain of spectro-temporal modulations. EURASIP J Audio Speech Music Process 2007(3):042357 Mesgarani N, Shamma S (2007) Denoising in the domain of spectro-temporal modulations. EURASIP J Audio Speech Music Process 2007(3):042357
14.
go back to reference Li W, Zhou Y, Poh N, Zhou F, Liao Q (2013) Feature denoising using joint sparse representation for in-car speech recognition. IEEE Signal Process Lett 20:681–684CrossRef Li W, Zhou Y, Poh N, Zhou F, Liao Q (2013) Feature denoising using joint sparse representation for in-car speech recognition. IEEE Signal Process Lett 20:681–684CrossRef
15.
go back to reference Mesgarani N, David S, Shamma SA (2007) Representation of phoneme in primary auditory cortex: how the brain analyzes speech. In: IEEE international conference on acoustic, speech and signal processing (ICASSP), vol 4. April 2007, Hawai, pp 765–768 Mesgarani N, David S, Shamma SA (2007) Representation of phoneme in primary auditory cortex: how the brain analyzes speech. In: IEEE international conference on acoustic, speech and signal processing (ICASSP), vol 4. April 2007, Hawai, pp 765–768
16.
go back to reference Mirbagheri M, Mesgarani N, Shamma S (2010) Nonlinear filtering of spectrotemporal modulation in speech enhancement. In: IEEE international conference on acoustic, speech and signal processing (ICASSP), vol 6. March 2010, pp 5478–5481 Mirbagheri M, Mesgarani N, Shamma S (2010) Nonlinear filtering of spectrotemporal modulation in speech enhancement. In: IEEE international conference on acoustic, speech and signal processing (ICASSP), vol 6. March 2010, pp 5478–5481
17.
go back to reference Kim C, Kumar K, Stern RM (2011) Binaural sound source separation motivated by auditory processing. In: IEEE international conference on acoustic, speech and signal processing (ICASSP), vol 5. May 2011, Prague, pp 5072–5075 Kim C, Kumar K, Stern RM (2011) Binaural sound source separation motivated by auditory processing. In: IEEE international conference on acoustic, speech and signal processing (ICASSP), vol 5. May 2011, Prague, pp 5072–5075
18.
go back to reference Martínez C, Goddardb J, Milone D, Rufiner H (2012) Bio inspired sparse spectro-temporal representation of speech for robust classification. Comput Speech Lang 26:336–348CrossRef Martínez C, Goddardb J, Milone D, Rufiner H (2012) Bio inspired sparse spectro-temporal representation of speech for robust classification. Comput Speech Lang 26:336–348CrossRef
19.
go back to reference Gemmeke JF, Van Hamme H, Cranen B, Boves L (2010) Compressive sensing for missing data imputation in noise robust speech recognition. IEEE J Sel Topics Signal Process 4:273–282CrossRef Gemmeke JF, Van Hamme H, Cranen B, Boves L (2010) Compressive sensing for missing data imputation in noise robust speech recognition. IEEE J Sel Topics Signal Process 4:273–282CrossRef
21.
go back to reference Gemmeke J, Cranen B, Remes U (2011) Sparse imputation for large vocabulary noise robust ASR. Comput Speech Lang 25:462–479CrossRef Gemmeke J, Cranen B, Remes U (2011) Sparse imputation for large vocabulary noise robust ASR. Comput Speech Lang 25:462–479CrossRef
22.
go back to reference Mohimani GH, Babaie-Zadeh M, Jutten C (2009) A fast approach for overcomplete sparse decomposition based on smoothed L0 norm. IEEE Trans Signal Process 57:289–301MathSciNetCrossRefMATH Mohimani GH, Babaie-Zadeh M, Jutten C (2009) A fast approach for overcomplete sparse decomposition based on smoothed L0 norm. IEEE Trans Signal Process 57:289–301MathSciNetCrossRefMATH
23.
go back to reference Kreutz-Delgado K, Murray JF, Rao BD, Engan K, Lee T, Sejnowski TJ (2003) Dictionary learning algorithms for sparse representation. Neural Comput 15:349–396CrossRefMATH Kreutz-Delgado K, Murray JF, Rao BD, Engan K, Lee T, Sejnowski TJ (2003) Dictionary learning algorithms for sparse representation. Neural Comput 15:349–396CrossRefMATH
24.
go back to reference Aharon M, Elad M, Bruckstein A (2006) K-svd: a algorithm for designing over complete dictionaries for sparse representation. IEEE Trans Signal Process 54:4311–4322CrossRefMATH Aharon M, Elad M, Bruckstein A (2006) K-svd: a algorithm for designing over complete dictionaries for sparse representation. IEEE Trans Signal Process 54:4311–4322CrossRefMATH
25.
go back to reference Zdunek R, Cichocki A (2007) Non-negative matrix factorization with quadratic programming. Neurocomputing 71:2309–2320CrossRefMATH Zdunek R, Cichocki A (2007) Non-negative matrix factorization with quadratic programming. Neurocomputing 71:2309–2320CrossRefMATH
26.
go back to reference Fisher WM, Doddington GR, Goudie M, Kathleen M (1986) The DARPA speech recognition research database: specifications and status. In: Proceedings of DARPA workshop on speech recognition, February 1986, Palo. AJeo, pp 93–99 Fisher WM, Doddington GR, Goudie M, Kathleen M (1986) The DARPA speech recognition research database: specifications and status. In: Proceedings of DARPA workshop on speech recognition, February 1986, Palo. AJeo, pp 93–99
27.
go back to reference Varga A, Steeneken HJM (1993) Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251CrossRef Varga A, Steeneken HJM (1993) Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251CrossRef
28.
go back to reference Raj B, Virtanen T, Chaudhure S, Singh R (2010) Non-negative matrix factorization based compensation of music for automatic speech recognition. In: Proceedings of international conference on speech and language processing, Makuhari, pp 717–720 Raj B, Virtanen T, Chaudhure S, Singh R (2010) Non-negative matrix factorization based compensation of music for automatic speech recognition. In: Proceedings of international conference on speech and language processing, Makuhari, pp 717–720
29.
go back to reference Mesgarani N, Shamma S, Slaney M (2004) Speech discrimination based on multiscale spectro-temporal modulations. Proc IEEE Int Conf Acoust Speech Signal Process 4(1):601–604 Mesgarani N, Shamma S, Slaney M (2004) Speech discrimination based on multiscale spectro-temporal modulations. Proc IEEE Int Conf Acoust Speech Signal Process 4(1):601–604
30.
go back to reference McLoughlin IV (2014) Super-audible voice activity detection. IEEE Trans Speech Audio Process Lett 22(9):1424–1433CrossRef McLoughlin IV (2014) Super-audible voice activity detection. IEEE Trans Speech Audio Process Lett 22(9):1424–1433CrossRef
31.
go back to reference Tan LN, Borgstrom BJ, Alwan A (2010) Voice activity detection using harmonic frequency components in likelihood ratio test. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP), March 2010, Dallas, pp 4466–4469 Tan LN, Borgstrom BJ, Alwan A (2010) Voice activity detection using harmonic frequency components in likelihood ratio test. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP), March 2010, Dallas, pp 4466–4469
32.
go back to reference Ramirez J, Segura JC, Benitez C, de la Torre A, Rubio A (2004) Efficient voice activity detection algorithms using long-term speech information. Speech Commun 42:271–287CrossRef Ramirez J, Segura JC, Benitez C, de la Torre A, Rubio A (2004) Efficient voice activity detection algorithms using long-term speech information. Speech Commun 42:271–287CrossRef
Metadata
Title
A voice activity detection algorithm in spectro-temporal domain using sparse representation
Authors
Mohadese Eshaghi
Farbod Razzazi
Alireza Behrad
Publication date
01-08-2018
Publisher
Springer Berlin Heidelberg
Published in
International Journal of Machine Learning and Cybernetics / Issue 7/2019
Print ISSN: 1868-8071
Electronic ISSN: 1868-808X
DOI
https://doi.org/10.1007/s13042-018-0856-z

Other articles of this Issue 7/2019

International Journal of Machine Learning and Cybernetics 7/2019 Go to the issue