Skip to main content
Log in

CNN-based speech segments endpoints detection framework using short-time signal energy features

  • Original Research
  • Published:
International Journal of Information Technology Aims and scope Submit manuscript

Abstract

The quality of Speech Recognition systems has improved, with a shift focus from short utterance scenarios like Voice Assistants and Voice Search to extended utterance circumstances such as voice inputting and meeting transcriptions. In short utterance set-ups, speech end-points plays a crucial role in perceiving latency and user experience. For long-tailed circumstances, the prime intention is to generate a well formatted and highly readable transcriptions that aided to substitute typing with keyboard for vital permanent tasks such as writing e-mails or text documents. The significance of punctuation and capitalization becomes equally crucial as recognition errors. In the case of long utterances, valuable processing time, bandwidth, and other resources can be conserved by disregarding unnecessary portion of the audio signal. This optimization ultimately leads to enhance throughput of the system. In this study, we develop a framework called Speech Segments Endpoint Detection which utilizes short-time energy signal features, a simple Mel-spectrogram, and a hybrid Convolution Neural Network-Bidirectional Long short-term Memory (CNN-BiLSTM) model for classification. We conducted experiment using our CNN-BiLSTM classification model on a 35-h audio dataset. This dataset comprised of 16 h of speech data and 19 h of audio containing music and noise. The dataset was split into training and validation sets in an 80:20 ratio. Our model attained an accuracy of 98.67% on the training set and 93.62% on the validation set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Availability of supporting data

The data set used is publicly available on opensource platform and accessed using link: http://www.openslr.org/17/

References.

  1. Barkani F, Hamidi M, Laaidi N, Zealouk O, Satori H, Satori K (2023) Amazigh speech recognition based on the Kaldi ASR toolkit. Int J Inf Technol 2023:1–8. https://doi.org/10.1007/S41870-023-01354-Z

    Article  Google Scholar 

  2. Hwang I, Chang JH (2020) End-to-end speech endpoint detection utilizing acoustic and language modeling knowledge for online low-latency speech recognition. IEEE Access 8:161109–161123. https://doi.org/10.1109/ACCESS.2020.3020696

    Article  Google Scholar 

  3. Aytar Y, Vondrick C, neural AT-A in (2016) undefined: Soundnet: Learning sound representations from unlabeled video. proceedings.neurips.cc.

  4. Basbug AM, Sert M (2019) Analysis of deep neural network models for acoustic scene classification. In: 27th Signal Processing and Communications Applications Conference, SIU 2019. https://doi.org/10.1109/SIU.2019.8806301.

  5. Chen L, Zheng X, Zhang C, Guo L, Yu B (2022) Multi-scale temporal-frequency attention for music source separation. In: Proceedings-IEEE International Conference on Multimedia and Expo. 2022-July. https://doi.org/10.1109/ICME52920.2022.9859957

  6. Mak MW, Yu HB (2014) A study of voice activity detection techniques for NIST speaker recognition evaluations. Comput Speech Lang 28:295–313. https://doi.org/10.1016/J.CSL.2013.07.003

    Article  Google Scholar 

  7. Mousazadeh S, Cohen I (2013) Voice activity detection in presence of transient noise using spectral clustering. IEEE Trans Audio Speech Lang Process 21:1261–1271. https://doi.org/10.1109/TASL.2013.2248717

    Article  Google Scholar 

  8. Liu B, Hoffmeister B, Rastrow A (2015) Accurate endpointing with expected pause duration

  9. Maas R, Rastrow A, Goehner K, Tiwari G, Joseph S (2017) Domain-specific utterance end-point detection for speech recognition

  10. Maas R, Rastrow A, Ma C, Lan G, Goehner K, Tiwari G, Joseph S, Hoffmeister B Combining acoustic embeddings and decoding features for end-of-utterance detection in real-time far-field speech recognition systems. ieeexplore.ieee.org.

  11. Moreno IL, Wan L, Wang Q, Ding S, Chang S (2019) Personal VAD: speaker-conditioned voice activity detection, pp 433–439. https://doi.org/10.21437/odyssey.2020-62

  12. Cho J, Yun S, Park H, Eum J, Hwang K (2019) Acoustic scene classification based on a large-margin factorized CNN, pp 45–49. https://doi.org/10.33682/8XH4-JM46

  13. Choi K, Fazekas G, Sandler M, Cho K (2017) Convolutional recurrent neural networks for music classification. In: ICASSP, IEEE International Conference on acoustics, speech and signal processing-proceedings, pp 2392–2396. https://doi.org/10.1109/ICASSP.2017.7952585.

  14. Demir F, Abdullah D, Access AS-I (2020) undefined: a new deep CNN model for environmental sound classification. ieeexplore.ieee.org.

  15. Dong M (2019) Convolutional neural network achieves human-level accuracy in music genre classification. https://doi.org/10.32470/CCN.2018.1153-0.

  16. Dörfler M, Bammer R, on T.G.-2017 international conference (2017) undefined: inside the spectrogram: Convolutional Neural Networks in audio processing. ieeexplore.ieee.org

  17. Guzhov A, Raue F, Hees J, Dengel A (2020) Esresnet: Environmental sound classification based on visual domain models. In: Proceedings—International Conference on Pattern Recognition. 8819–8825. https://doi.org/10.1109/ICPR48806.2021.9413035

  18. Hemakumar G, PP-IJ of S (2014) undefined: Automatic Segmentation of Kannada speech signal into syllables and sub-words: noised and noiseless signals. academia.edu.

  19. Hershey S, Chaudhuri S, Ellis DPW, Gemmeke JF, Jansen A, Channing Moore R, Plakal M, Platt D, Saurous RA, Seybold B, Slaney M, Weiss RJ, Wilson K CNN architectures for large-scale audio classification. ieeexplore.ieee.org

  20. Islam MM, Haque M, Islam S, Mia MZA, Rahman SMAM (2022) DCNN-LSTM based audio classification combining multiple feature engineering and data augmentation techniques. Lect Notes Netw Syst 371:227–236. https://doi.org/10.1007/978-3-030-93247-3_23

    Article  Google Scholar 

  21. Ketkar N, Moolayil J (2021) Convolutional neural networks. Deep learning with Python, pp 197–242. https://doi.org/10.1007/978-1-4842-5364-9_6.

  22. Kudin O, Kryvokhata A, Gorbenko VI (2020) Developing a deep learning sound classification system for a smart farming. ECS Meeting Abstracts. MA2020–01, 1853–1853 (2020). https://doi.org/10.1149/MA2020-01261853MTGABS/META..M

  23. Lee J, Park J, Kim KL, Nam J (2019) Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. In: Proceedings of the 14th Sound and Music Computing Conference 2017, SMC 2017. 220–226

  24. Li X, Chebiyyam V, Kirchhoff K (2019) Multi-stream network with temporal attention for environmental sound classification. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2019-September, pp 3604–3608. https://doi.org/10.21437/Interspeech.2019-3019.

  25. Nguyen T, FP-C of the IE in (2020) undefined: Lung sound classification using snapshot ensemble of convolutional neural networks. ieeexplore.ieee.org. 2020-July, 760–763 (2020). https://doi.org/10.1109/EMBC44109.2020.9176076.

  26. Niranjan K, International SV-2021 F (2021) undefined: Ensemble and multi model approach to environmental sound classification. ieeexplore.ieee.org.

  27. Van Den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K Wavenet: a generative model for raw audio. arxiv.org.

  28. Ouyang Z, Yu H, Zhu W-P, Champagne B A fully convolutional neural network for complex spectrogram processing in speech enhancement. ieeexplore.ieee.org.

  29. Prombut N, Waijanya S, Promrit N (2021) Feature Extraction technique based on Conv1D and Conv2D network for Thai speech emotion recognition. In: ACM International Conference Proceeding Series, pp 54–60. https://doi.org/10.1145/3508230.3508238.

  30. Rabiner LR, Sambur MR (1975) An Algorithm for Determining the Endpoints of Isolated Utterances. Bell Syst Tech J 54:297–315. https://doi.org/10.1002/J.1538-7305.1975.TB02840.X

    Article  Google Scholar 

  31. Rahman M, Khatun F, Preface MB-E (2015) undefined: Blocking black area method for speech segmentation. Citeseer

    Google Scholar 

  32. Rahman M, Advanced MB-IJ of (2012) undefined: continuous Bangla speech segmentation using short-term speech features extraction approaches. academia.edu.

  33. Theera-Umpon N, Vilasdechanon J, Ratsameewichai S, Theera-Umpon N, Vilasdechanon J, Uatrongjit S, Likit-Anurucks K Thai phoneme segmentation using dual-band energy contour. researchgate.net

  34. Roberts A, Engel J, Raffel C, Hawthorne C, Eck D (2018) A hierarchical latent vector model for learning long-term structure in music. proceedings.mlr.press

  35. Solanki A, Pandey S (2022) Music instrument recognition using deep convolutional neural networks. Int J Inform Technol (Singapore) 14:1659–1668. https://doi.org/10.1007/S41870-019-00285-Y/METRICS

    Article  Google Scholar 

  36. Scheirer E, on MS-1997 I. international conference (1997) undefined: construction and evaluation of a robust multifeature speech/music discriminator. ieeexplore.ieee.org.

  37. Li X, Liu H, Zheng Y, Xu B (2007) Robust speech endpoint detection based on improved adaptive band-partitioning spectral entropy. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 4688 LNCS, 36–45. https://doi.org/10.1007/978-3-540-74769-7_5

  38. Zhang H, Hu H (2010) An endpoint detection algorithm based on MFCC and spectral entropy using BP NN. In: ICSPS 2010—Proceedings of the 2010 2nd International Conference on Signal Processing Systems. 2, (2010). https://doi.org/10.1109/ICSPS.2010.5555699.

  39. Li J, Ping Z, Xinxing J, Zhiran DU (2012) Speech endpoint detection method based on TEO in noisy environment. Proc Eng 29:2655–2660. https://doi.org/10.1016/j.proeng.2012.01.367

    Article  Google Scholar 

  40. Ali Z, Talha M (2018) innovative method for unsupervised voice activity detection and classification of audio segments. IEEE Access 6:15494–15504. https://doi.org/10.1109/ACCESS.2018.2805845

    Article  Google Scholar 

  41. Ma Y, Nishihara A (2013) Efficient voice activity detection algorithm using long-term spectral flatness measure. Eurasip J Audio, Speech Music Process 2013:1–18. https://doi.org/10.1186/1687-4722-2013-21/FIGURES/18

    Article  Google Scholar 

  42. Ghosh PK, Tsiartas A, Narayanan S (2011) Robust voice activity detection using long-term signal variability. IEEE Trans Audio Speech Lang Process 19:600–613. https://doi.org/10.1109/TASL.2010.2052803

    Article  Google Scholar 

  43. Sehgal A, Kehtarnavaz N (2018) A Convolutional neural network smartphone app for real-time voice activity detection. IEEE Access Pract Innov Open Sol 6:9017–9026. https://doi.org/10.1109/ACCESS.2018.2800728

    Article  Google Scholar 

  44. Hamandouche D, Nazarov A, Kaya H (2022) Speech Detection for noisy audio files

  45. Lavechin M, Gill MP, Bousbib R, Bredin H, Garcia-Perera LP (2019) End-to-end domain-adversarial voice activity detection. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2020-October, 3685–3689. https://doi.org/10.21437/Interspeech.2020-2285.

  46. Chen K, Huang J, Cui Y, Ren W (2023) Research on Chinese audio and text alignment algorithm based on AIC-FCM and Doc2Vec. ACM Trans Asian Low-Resour Lang Inform Process. https://doi.org/10.1145/3532852

    Article  Google Scholar 

  47. Shashidhar R, Patilkulkarni S, Puneeth SB (2022) Combining audio and visual speech recognition using LSTM and deep convolutional neural network. Int J Inf Technol (Singapore) 14:3425–3436. https://doi.org/10.1007/S41870-022-00907-Y/METRICS

    Article  Google Scholar 

  48. Si S, Wang J, Sun H, Wu J, Zhang C, Qu X, Cheng N, Chen L, Xiao J Variational information bottleneck for effective low-resource audio classification. arxiv.org.

  49. Snyder D, Chen G, Povey D (2015) MUSAN: a music, speech, and noise corpus

  50. Su, Y., Zhang, K., Wang, J., Sensors, K.M.-, 2019, undefined: Environment sound classification using a two-stream CNN based on decision-level fusion. mdpi.com. 19, (2019). https://doi.org/10.3390/s19071733.

  51. Telecommunications TGI, PhD, vol (2009) undefined: Study and application of acoustic information for the detection of harmful content, and fusion with visual information. cgi.di.uoa.gr. (2009).

  52. Tokozume Y, on T.H.-2017 I. international conference (2017) undefined: Learning environmental sounds with end-to-end convolutional neural network. ieeexplore.ieee.org.

  53. Tzanetakis G, PC-IT on speech (2002) undefined: Musical genre classification of audio signals. ieeexplore.ieee.org. 10, 293 (2002). https://doi.org/10.1109/TSA.2002.800560.

  54. Vidhya J, Algorithms RU-P of the and C (2021) undefined: violence detection in videos using Conv2D VGG-19 architecture and LSTM network. ceur-ws.org. (2021).

  55. Zhang T, on C.K.-1999 IIC (1999) undefined: Hierarchical classification of audio data for archiving and retrieving. ieeexplore.ieee.org.

  56. Zhang W, Lei W, Xu X, Interspeech XX. (2016) undefined: Improved music genre classification with convolutional neural networks. isca-speech.org.

  57. Jia C, language, B.X.-I. symposium on Chinese spoken (2002) Undefined: An improved entropy-based endpoint detection algorithm. isca-speech.org

  58. Guo Q, Ji G, Li N (2010) A improved dual-threshold speech endpoint detection algorithm. In: 2010 The 2nd International Conference on Computer and Automation Engineering, ICCAE 2010. 2, 123–126 (2010). https://doi.org/10.1109/ICCAE.2010.5451414.

  59. Zhang T, Huang H, He L, Lech M (2014) A robust speech endpoint detection algorithm based on wavelet packet and energy entropy. In: Proceedings of 2013 3rd International Conference on Computer Science and Network Technology, ICCSNT 2013. 1050–1054. https://doi.org/10.1109/ICCSNT.2013.6967284.

  60. Cao Y, La D, Jia S, Niu X (2014) A speech endpoint detection algorithm based on wavelet transforms. In: 26th Chinese Control and Decision Conference, CCDC 2014. 3010–3012 (2014). https://doi.org/10.1109/CCDC.2014.6852690

  61. Ouzounov A (2014) Telephone speech endpoint detection using Mean-Delta feature. Cybern Inf Technol 14:127–139. https://doi.org/10.2478/CAIT-2014-0025

    Article  Google Scholar 

  62. Zhang Y, Wang K, Yan B (2016) Speech endpoint detection algorithm with low signal-to-noise based on improved conventional spectral entropy. In: Proceedings of the World Congress on Intelligent Control and Automation (WCICA). 2016-September, 3307–3311. https://doi.org/10.1109/WCICA.2016.7578597.

  63. Roy T, Marwala T, Chakraverty S (2019) Precise detection of speech endpoints dynamically: a wavelet convolution based approach. Commun Nonlinear Sci Numer Simul 67:162–175. https://doi.org/10.1016/J.CNSNS.2018.07.008

    Article  MathSciNet  MATH  Google Scholar 

  64. Shome N, Laskar RH, Kashyap R, Bandyopadhyay S (2020) A robust technique for end point detection under practical environment. In: Communications in Computer and Information Science. 1241 CCIS, 131–144 (2020). https://doi.org/10.1007/978-981-15-6318-8_12/COVER.

  65. Singh R, Saurav S, Kumar T, Saini R, Vohra A, Singh S (2023) Facial expression recognition in videos using hybrid CNN & ConvLSTM. Int J Inform Technol (Singapore) 15:1819–1830. https://doi.org/10.1007/S41870-023-01183-0/TABLES/4

    Article  Google Scholar 

  66. Hamidi M, Zealouk O, Satori H, Laaidi N, Salek A (2023) COVID-19 assessment using HMM cough recognition system. Int J Inform Technol (Singapore) 15:193–201. https://doi.org/10.1007/S41870-022-01120-7/FIGURES/14

    Article  Google Scholar 

Download references

Acknowledgements

The authors gratefully acknowledge the assistance from the Faculty of Computer Sciences, Baba Ghulam Shah Badshah University.

Funding

Not Applicable.

Author information

Authors and Affiliations

Authors

Contributions

GA: Conceived and designed the study, collected and analyzed the data, and wrote the manuscript. AAL: Contributed to the study design, data analysis, and critically revised the manuscript.

Corresponding author

Correspondence to Aadil Ahmad Lawaye.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests, financial or otherwise, that could have influenced the research or its outcomes.

Ethical approval

This research did not contain any studies involving animal or human participants, nor did it take place on any private or protected areas. No specific permissions were required for this study.

Consent to participate

Written informed consent was obtained from all participants involved in this study. They were provided with detailed information regarding the purpose of the study, potential risks and benefits, and their rights to withdraw at any time without any consequences.

Consent to publish

Participants in this study provided consent for the publication of anonymized data and findings. Any identifiable information has been appropriately masked or removed to ensure confidentiality.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ahmed, G., Lawaye, A.A. CNN-based speech segments endpoints detection framework using short-time signal energy features. Int. j. inf. tecnol. 15, 4179–4191 (2023). https://doi.org/10.1007/s41870-023-01466-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41870-023-01466-6

Keywords

Navigation