CNN-based speech segments endpoints detection framework using short-time signal energy features

Ahmed, Ghayas; Lawaye, Aadil Ahmad

doi:10.1007/s41870-023-01466-6

CNN-based speech segments endpoints detection framework using short-time signal energy features

Original Research
Published: 10 September 2023

Volume 15, pages 4179–4191, (2023)
Cite this article

International Journal of Information Technology Aims and scope Submit manuscript

128 Accesses
7 Citations
Explore all metrics

Abstract

The quality of Speech Recognition systems has improved, with a shift focus from short utterance scenarios like Voice Assistants and Voice Search to extended utterance circumstances such as voice inputting and meeting transcriptions. In short utterance set-ups, speech end-points plays a crucial role in perceiving latency and user experience. For long-tailed circumstances, the prime intention is to generate a well formatted and highly readable transcriptions that aided to substitute typing with keyboard for vital permanent tasks such as writing e-mails or text documents. The significance of punctuation and capitalization becomes equally crucial as recognition errors. In the case of long utterances, valuable processing time, bandwidth, and other resources can be conserved by disregarding unnecessary portion of the audio signal. This optimization ultimately leads to enhance throughput of the system. In this study, we develop a framework called Speech Segments Endpoint Detection which utilizes short-time energy signal features, a simple Mel-spectrogram, and a hybrid Convolution Neural Network-Bidirectional Long short-term Memory (CNN-BiLSTM) model for classification. We conducted experiment using our CNN-BiLSTM classification model on a 35-h audio dataset. This dataset comprised of 16 h of speech data and 19 h of audio containing music and noise. The dataset was split into training and validation sets in an 80:20 ratio. Our model attained an accuracy of 98.67% on the training set and 93.62% on the validation set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

End-to-end ASR framework for Indian-English accent: using speech CNN-based segmentation

Article 11 November 2023

Spoken word recognition using a novel speech boundary segment of voiceless articulatory consonants

Article 17 March 2024

Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

Article Open access 17 June 2019

Availability of supporting data

The data set used is publicly available on opensource platform and accessed using link: http://www.openslr.org/17/

References.

Barkani F, Hamidi M, Laaidi N, Zealouk O, Satori H, Satori K (2023) Amazigh speech recognition based on the Kaldi ASR toolkit. Int J Inf Technol 2023:1–8. https://doi.org/10.1007/S41870-023-01354-Z
Article Google Scholar
Hwang I, Chang JH (2020) End-to-end speech endpoint detection utilizing acoustic and language modeling knowledge for online low-latency speech recognition. IEEE Access 8:161109–161123. https://doi.org/10.1109/ACCESS.2020.3020696
Article Google Scholar
Aytar Y, Vondrick C, neural AT-A in (2016) undefined: Soundnet: Learning sound representations from unlabeled video. proceedings.neurips.cc.
Basbug AM, Sert M (2019) Analysis of deep neural network models for acoustic scene classification. In: 27th Signal Processing and Communications Applications Conference, SIU 2019. https://doi.org/10.1109/SIU.2019.8806301.
Chen L, Zheng X, Zhang C, Guo L, Yu B (2022) Multi-scale temporal-frequency attention for music source separation. In: Proceedings-IEEE International Conference on Multimedia and Expo. 2022-July. https://doi.org/10.1109/ICME52920.2022.9859957
Mak MW, Yu HB (2014) A study of voice activity detection techniques for NIST speaker recognition evaluations. Comput Speech Lang 28:295–313. https://doi.org/10.1016/J.CSL.2013.07.003
Article Google Scholar
Mousazadeh S, Cohen I (2013) Voice activity detection in presence of transient noise using spectral clustering. IEEE Trans Audio Speech Lang Process 21:1261–1271. https://doi.org/10.1109/TASL.2013.2248717
Article Google Scholar
Liu B, Hoffmeister B, Rastrow A (2015) Accurate endpointing with expected pause duration
Maas R, Rastrow A, Goehner K, Tiwari G, Joseph S (2017) Domain-specific utterance end-point detection for speech recognition
Maas R, Rastrow A, Ma C, Lan G, Goehner K, Tiwari G, Joseph S, Hoffmeister B Combining acoustic embeddings and decoding features for end-of-utterance detection in real-time far-field speech recognition systems. ieeexplore.ieee.org.
Moreno IL, Wan L, Wang Q, Ding S, Chang S (2019) Personal VAD: speaker-conditioned voice activity detection, pp 433–439. https://doi.org/10.21437/odyssey.2020-62
Cho J, Yun S, Park H, Eum J, Hwang K (2019) Acoustic scene classification based on a large-margin factorized CNN, pp 45–49. https://doi.org/10.33682/8XH4-JM46
Choi K, Fazekas G, Sandler M, Cho K (2017) Convolutional recurrent neural networks for music classification. In: ICASSP, IEEE International Conference on acoustics, speech and signal processing-proceedings, pp 2392–2396. https://doi.org/10.1109/ICASSP.2017.7952585.
Demir F, Abdullah D, Access AS-I (2020) undefined: a new deep CNN model for environmental sound classification. ieeexplore.ieee.org.
Dong M (2019) Convolutional neural network achieves human-level accuracy in music genre classification. https://doi.org/10.32470/CCN.2018.1153-0.
Dörfler M, Bammer R, on T.G.-2017 international conference (2017) undefined: inside the spectrogram: Convolutional Neural Networks in audio processing. ieeexplore.ieee.org
Guzhov A, Raue F, Hees J, Dengel A (2020) Esresnet: Environmental sound classification based on visual domain models. In: Proceedings—International Conference on Pattern Recognition. 8819–8825. https://doi.org/10.1109/ICPR48806.2021.9413035
Hemakumar G, PP-IJ of S (2014) undefined: Automatic Segmentation of Kannada speech signal into syllables and sub-words: noised and noiseless signals. academia.edu.
Hershey S, Chaudhuri S, Ellis DPW, Gemmeke JF, Jansen A, Channing Moore R, Plakal M, Platt D, Saurous RA, Seybold B, Slaney M, Weiss RJ, Wilson K CNN architectures for large-scale audio classification. ieeexplore.ieee.org
Islam MM, Haque M, Islam S, Mia MZA, Rahman SMAM (2022) DCNN-LSTM based audio classification combining multiple feature engineering and data augmentation techniques. Lect Notes Netw Syst 371:227–236. https://doi.org/10.1007/978-3-030-93247-3_23
Article Google Scholar
Ketkar N, Moolayil J (2021) Convolutional neural networks. Deep learning with Python, pp 197–242. https://doi.org/10.1007/978-1-4842-5364-9_6.
Kudin O, Kryvokhata A, Gorbenko VI (2020) Developing a deep learning sound classification system for a smart farming. ECS Meeting Abstracts. MA2020–01, 1853–1853 (2020). https://doi.org/10.1149/MA2020-01261853MTGABS/META..M
Lee J, Park J, Kim KL, Nam J (2019) Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. In: Proceedings of the 14th Sound and Music Computing Conference 2017, SMC 2017. 220–226
Li X, Chebiyyam V, Kirchhoff K (2019) Multi-stream network with temporal attention for environmental sound classification. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2019-September, pp 3604–3608. https://doi.org/10.21437/Interspeech.2019-3019.
Nguyen T, FP-C of the IE in (2020) undefined: Lung sound classification using snapshot ensemble of convolutional neural networks. ieeexplore.ieee.org. 2020-July, 760–763 (2020). https://doi.org/10.1109/EMBC44109.2020.9176076.
Niranjan K, International SV-2021 F (2021) undefined: Ensemble and multi model approach to environmental sound classification. ieeexplore.ieee.org.
Van Den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K Wavenet: a generative model for raw audio. arxiv.org.
Ouyang Z, Yu H, Zhu W-P, Champagne B A fully convolutional neural network for complex spectrogram processing in speech enhancement. ieeexplore.ieee.org.
Prombut N, Waijanya S, Promrit N (2021) Feature Extraction technique based on Conv1D and Conv2D network for Thai speech emotion recognition. In: ACM International Conference Proceeding Series, pp 54–60. https://doi.org/10.1145/3508230.3508238.
Rabiner LR, Sambur MR (1975) An Algorithm for Determining the Endpoints of Isolated Utterances. Bell Syst Tech J 54:297–315. https://doi.org/10.1002/J.1538-7305.1975.TB02840.X
Article Google Scholar
Rahman M, Khatun F, Preface MB-E (2015) undefined: Blocking black area method for speech segmentation. Citeseer
Google Scholar
Rahman M, Advanced MB-IJ of (2012) undefined: continuous Bangla speech segmentation using short-term speech features extraction approaches. academia.edu.
Theera-Umpon N, Vilasdechanon J, Ratsameewichai S, Theera-Umpon N, Vilasdechanon J, Uatrongjit S, Likit-Anurucks K Thai phoneme segmentation using dual-band energy contour. researchgate.net
Roberts A, Engel J, Raffel C, Hawthorne C, Eck D (2018) A hierarchical latent vector model for learning long-term structure in music. proceedings.mlr.press
Solanki A, Pandey S (2022) Music instrument recognition using deep convolutional neural networks. Int J Inform Technol (Singapore) 14:1659–1668. https://doi.org/10.1007/S41870-019-00285-Y/METRICS
Article Google Scholar
Scheirer E, on MS-1997 I. international conference (1997) undefined: construction and evaluation of a robust multifeature speech/music discriminator. ieeexplore.ieee.org.
Li X, Liu H, Zheng Y, Xu B (2007) Robust speech endpoint detection based on improved adaptive band-partitioning spectral entropy. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 4688 LNCS, 36–45. https://doi.org/10.1007/978-3-540-74769-7_5
Zhang H, Hu H (2010) An endpoint detection algorithm based on MFCC and spectral entropy using BP NN. In: ICSPS 2010—Proceedings of the 2010 2nd International Conference on Signal Processing Systems. 2, (2010). https://doi.org/10.1109/ICSPS.2010.5555699.
Li J, Ping Z, Xinxing J, Zhiran DU (2012) Speech endpoint detection method based on TEO in noisy environment. Proc Eng 29:2655–2660. https://doi.org/10.1016/j.proeng.2012.01.367
Article Google Scholar
Ali Z, Talha M (2018) innovative method for unsupervised voice activity detection and classification of audio segments. IEEE Access 6:15494–15504. https://doi.org/10.1109/ACCESS.2018.2805845
Article Google Scholar
Ma Y, Nishihara A (2013) Efficient voice activity detection algorithm using long-term spectral flatness measure. Eurasip J Audio, Speech Music Process 2013:1–18. https://doi.org/10.1186/1687-4722-2013-21/FIGURES/18
Article Google Scholar
Ghosh PK, Tsiartas A, Narayanan S (2011) Robust voice activity detection using long-term signal variability. IEEE Trans Audio Speech Lang Process 19:600–613. https://doi.org/10.1109/TASL.2010.2052803
Article Google Scholar
Sehgal A, Kehtarnavaz N (2018) A Convolutional neural network smartphone app for real-time voice activity detection. IEEE Access Pract Innov Open Sol 6:9017–9026. https://doi.org/10.1109/ACCESS.2018.2800728
Article Google Scholar
Hamandouche D, Nazarov A, Kaya H (2022) Speech Detection for noisy audio files
Lavechin M, Gill MP, Bousbib R, Bredin H, Garcia-Perera LP (2019) End-to-end domain-adversarial voice activity detection. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2020-October, 3685–3689. https://doi.org/10.21437/Interspeech.2020-2285.
Chen K, Huang J, Cui Y, Ren W (2023) Research on Chinese audio and text alignment algorithm based on AIC-FCM and Doc2Vec. ACM Trans Asian Low-Resour Lang Inform Process. https://doi.org/10.1145/3532852
Article Google Scholar
Shashidhar R, Patilkulkarni S, Puneeth SB (2022) Combining audio and visual speech recognition using LSTM and deep convolutional neural network. Int J Inf Technol (Singapore) 14:3425–3436. https://doi.org/10.1007/S41870-022-00907-Y/METRICS
Article Google Scholar
Si S, Wang J, Sun H, Wu J, Zhang C, Qu X, Cheng N, Chen L, Xiao J Variational information bottleneck for effective low-resource audio classification. arxiv.org.
Snyder D, Chen G, Povey D (2015) MUSAN: a music, speech, and noise corpus
Su, Y., Zhang, K., Wang, J., Sensors, K.M.-, 2019, undefined: Environment sound classification using a two-stream CNN based on decision-level fusion. mdpi.com. 19, (2019). https://doi.org/10.3390/s19071733.
Telecommunications TGI, PhD, vol (2009) undefined: Study and application of acoustic information for the detection of harmful content, and fusion with visual information. cgi.di.uoa.gr. (2009).
Tokozume Y, on T.H.-2017 I. international conference (2017) undefined: Learning environmental sounds with end-to-end convolutional neural network. ieeexplore.ieee.org.
Tzanetakis G, PC-IT on speech (2002) undefined: Musical genre classification of audio signals. ieeexplore.ieee.org. 10, 293 (2002). https://doi.org/10.1109/TSA.2002.800560.
Vidhya J, Algorithms RU-P of the and C (2021) undefined: violence detection in videos using Conv2D VGG-19 architecture and LSTM network. ceur-ws.org. (2021).
Zhang T, on C.K.-1999 IIC (1999) undefined: Hierarchical classification of audio data for archiving and retrieving. ieeexplore.ieee.org.
Zhang W, Lei W, Xu X, Interspeech XX. (2016) undefined: Improved music genre classification with convolutional neural networks. isca-speech.org.
Jia C, language, B.X.-I. symposium on Chinese spoken (2002) Undefined: An improved entropy-based endpoint detection algorithm. isca-speech.org
Guo Q, Ji G, Li N (2010) A improved dual-threshold speech endpoint detection algorithm. In: 2010 The 2nd International Conference on Computer and Automation Engineering, ICCAE 2010. 2, 123–126 (2010). https://doi.org/10.1109/ICCAE.2010.5451414.
Zhang T, Huang H, He L, Lech M (2014) A robust speech endpoint detection algorithm based on wavelet packet and energy entropy. In: Proceedings of 2013 3rd International Conference on Computer Science and Network Technology, ICCSNT 2013. 1050–1054. https://doi.org/10.1109/ICCSNT.2013.6967284.
Cao Y, La D, Jia S, Niu X (2014) A speech endpoint detection algorithm based on wavelet transforms. In: 26th Chinese Control and Decision Conference, CCDC 2014. 3010–3012 (2014). https://doi.org/10.1109/CCDC.2014.6852690
Ouzounov A (2014) Telephone speech endpoint detection using Mean-Delta feature. Cybern Inf Technol 14:127–139. https://doi.org/10.2478/CAIT-2014-0025
Article Google Scholar
Zhang Y, Wang K, Yan B (2016) Speech endpoint detection algorithm with low signal-to-noise based on improved conventional spectral entropy. In: Proceedings of the World Congress on Intelligent Control and Automation (WCICA). 2016-September, 3307–3311. https://doi.org/10.1109/WCICA.2016.7578597.
Roy T, Marwala T, Chakraverty S (2019) Precise detection of speech endpoints dynamically: a wavelet convolution based approach. Commun Nonlinear Sci Numer Simul 67:162–175. https://doi.org/10.1016/J.CNSNS.2018.07.008
Article MathSciNet MATH Google Scholar
Shome N, Laskar RH, Kashyap R, Bandyopadhyay S (2020) A robust technique for end point detection under practical environment. In: Communications in Computer and Information Science. 1241 CCIS, 131–144 (2020). https://doi.org/10.1007/978-981-15-6318-8_12/COVER.
Singh R, Saurav S, Kumar T, Saini R, Vohra A, Singh S (2023) Facial expression recognition in videos using hybrid CNN & ConvLSTM. Int J Inform Technol (Singapore) 15:1819–1830. https://doi.org/10.1007/S41870-023-01183-0/TABLES/4
Article Google Scholar
Hamidi M, Zealouk O, Satori H, Laaidi N, Salek A (2023) COVID-19 assessment using HMM cough recognition system. Int J Inform Technol (Singapore) 15:193–201. https://doi.org/10.1007/S41870-022-01120-7/FIGURES/14
Article Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge the assistance from the Faculty of Computer Sciences, Baba Ghulam Shah Badshah University.

Funding

Not Applicable.

Author information

Authors and Affiliations

Baba Ghulam Shah Badshah University, Rajouri, J&K, India
Ghayas Ahmed & Aadil Ahmad Lawaye

Authors

Ghayas Ahmed
View author publications
You can also search for this author in PubMed Google Scholar
Aadil Ahmad Lawaye
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

GA: Conceived and designed the study, collected and analyzed the data, and wrote the manuscript. AAL: Contributed to the study design, data analysis, and critically revised the manuscript.

Corresponding author

Correspondence to Aadil Ahmad Lawaye.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests, financial or otherwise, that could have influenced the research or its outcomes.

Ethical approval

This research did not contain any studies involving animal or human participants, nor did it take place on any private or protected areas. No specific permissions were required for this study.

Consent to participate

Written informed consent was obtained from all participants involved in this study. They were provided with detailed information regarding the purpose of the study, potential risks and benefits, and their rights to withdraw at any time without any consequences.

Consent to publish

Participants in this study provided consent for the publication of anonymized data and findings. Any identifiable information has been appropriately masked or removed to ensure confidentiality.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ahmed, G., Lawaye, A.A. CNN-based speech segments endpoints detection framework using short-time signal energy features. Int. j. inf. tecnol. 15, 4179–4191 (2023). https://doi.org/10.1007/s41870-023-01466-6

Download citation

Received: 08 April 2023
Accepted: 28 August 2023
Published: 10 September 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s41870-023-01466-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CNN-based speech segments endpoints detection framework using short-time signal energy features

Abstract

Access this article

Similar content being viewed by others

End-to-end ASR framework for Indian-English accent: using speech CNN-based segmentation

Spoken word recognition using a novel speech boundary segment of voiceless articulatory consonants

Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

Availability of supporting data

References.

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent to publish

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CNN-based speech segments endpoints detection framework using short-time signal energy features

Abstract

Access this article

Similar content being viewed by others

End-to-end ASR framework for Indian-English accent: using speech CNN-based segmentation

Spoken word recognition using a novel speech boundary segment of voiceless articulatory consonants

Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

Availability of supporting data

References.

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent to publish

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation