Top

Published in:

2021 | OriginalPaper | Chapter

Improving Word Recognition in Speech Transcriptions by Decision-Level Fusion of Stemming and Two-Way Phoneme Pruning

Authors : Sunakshi Mehra, Seba Susan

Published in: Advanced Computing

Publisher: Springer Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

We introduce an unsupervised approach for correcting highly imperfect speech transcriptions based on a decision-level fusion of stemming and two-way phoneme pruning. Transcripts are acquired from videos by extracting audio using Ffmpeg framework and further converting audio to text transcript using Google API. In the benchmark LRW dataset, there are 500 word categories, and 50 videos per class in mp4 format. All videos consist of 29 frames (each 1.16 s long) and the word appears in the middle of the video. In our approach we tried to improve the baseline accuracy from 9.34% by using stemming, phoneme extraction, filtering and pruning. After applying the stemming algorithm to the text transcript and evaluating the results, we achieved 23.34% accuracy in word recognition. To convert words to phonemes we used the Carnegie Mellon University (CMU) pronouncing dictionary that provides a phonetic mapping of English words to their pronunciations. A two-way phoneme pruning is proposed that comprises of the two non-sequential steps: 1) filtering and pruning the phonemes containing vowels and plosives 2) filtering and pruning the phonemes containing vowels and fricatives. After obtaining results of stemming and two-way phoneme pruning, we applied decision-level fusion and that led to an improvement of word recognition rate upto 32.96%.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Deep Learning Based Question Generation Using T5 Transformer

next chapter MultiDeepFake: Improving Fake News Detection with a Deep Convolutional Neural Network Using a Multimodal Dataset

Besacier, L., Barnard, E., Karpov, A., Schultz, T.: Automatic speech recognition for under-resourced languages: a survey. Speech Commun. 56, 85–100 (2014)CrossRef

Susan, S., Sharma, S.: A fuzzy nearest neighbor classifier for speaker identification. In: 2012 Fourth International Conference on Computational Intelligence and Communication Networks, pp. 842–845. IEEE (2012)

Hemakumar, G.: Vowel-plosive of English word recognition using HMM. In: IJCSI (2011)

Tripathi, M., Singh, D., Susan, S.: Speaker recognition using SincNet and X-Vector fusion. arXiv preprint arXiv:2004.02219 (2020).

Haubold, A., Kender, J.R.: Alignment of speech to highly imperfect text transcriptions. In: 2007 IEEE International Conference on Multimedia and Expo, pp. 224–227. IEEE (2007)

Gupta, P.: A context-sensitive real-time spell checker with language adaptability. In: 2020 IEEE 14th International Conference on Semantic Computing (ICSC), pp. 116–122. IEEE (2020)

Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)CrossRef

Stan, A., Bell, P., King, S.: A grapheme-based method for automatic alignment of speech and text data. In: 2012 IEEE Spoken Language Technology Workshop (SLT), pp. 286–290. IEEE (2012)

Haubold, A., Kender, J.R.: Augmented segmentation and visualization for presentation videos. In: Proceedings of the 13th Annual ACM International Conference on Multimedia, pp. 51–60 (2005)

10.

Ghosh, K., Sreenivasa Rao, K.: Subword based approach for grapheme-to- phoneme conversion in Bengali text-to-speech synthesis system. In: 2012 National Conference on Communications (NCC), pp. 1–5. IEEE (2012)

11.

Wang, W., Zhou, Y., Xiong, C., Socher, R.: An investigation of phone-based subword units for end-to-end speech recognition. arXiv preprint arXiv:2004.04290 (2020)

12.

Alsharhan, E., Ramsay, A.: Improved Arabic speech recognition system through the automatic generation of fine-grained phonetic transcriptions. Inf. Process. Manag. 56(2), 343–353 (2019)CrossRef

13.

Gimenes, M., Perret, C., New, B.: Lexique-Infra: Grapheme-phoneme, phoneme-grapheme regularity, consistency, and other sublexical statistics for 137,717 polysyllabic French words. Behav. Res. Methods 52(6), 2480–2488 (2020). https://doi.org/10.3758/s13428-020-01396-2CrossRef

14.

Harwath, D., Glass, J.: Towards visually grounded sub-word speech unit discovery. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3017–3021. IEEE (2019)

15.

Lin, S.-H., Yeh, Y.-M., Chen, B.: Extractive speech summarization- From the view of decision theory. In: Eleventh Annual Conference of the International Speech Communication Association (2010)

16.

Siivola, V., Hirsimaki, T., Creutz, M., Kurimo, M.: Unlimited vocabulary speech recognition based on morphs discovered in an unsupervised manner. In: Eighth European Conference on Speech Communication and Technology (2003)

17.

Williamson, D.S., Wang, Y., Wang, D.: Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2015)CrossRef

18.

Chen, J., Wang, Y., Wang, D.: A feature study for classification-based speech separation at low signal-to-noise ratios. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1993–2002 (2014)CrossRef

19.

Mamamothb: Python port SymSpell (2019). https://github.com/mammothb/symspellpy

20.

Shuang, Y., et al.: LRW-1000: a naturally-distributed large- scale benchmark for lip reading in the wild. In: 2019 14th IEEE International Conference on Automatic Face Gesture Recognition (FG 2019), pp. 1–8. IEEE (2019)

21.

Torfi, A., Iranmanesh, S.M., Nasrabadi, N., Dawson, J.: 3d convolutional neural networks for cross audio-visual matching recognition. IEEE Access 5, 22081–22091 (2017)CrossRef

22.

Hazen, T.J.: Automatic alignment and error correction of human generated transcripts for long speech recordings. In: Ninth International Conference on Spoken Language Processing (2006)

23.

Martin, P.: WinPitchPro-A tool for text to speech alignment and prosodic analysis. In: Speech Prosody 2004, International Conference (2004)

24.

Chen, Y.-C., Shen, C.-H., Huang, S.-F., Lee, H.-Y.: Towards unsupervised automatic speech recognition trained by unaligned speech and text only arXiv preprint arXiv:1803.10952 (2018)

25.

Novotney, S., Schwartz, R., Ma, J.: Unsupervised acoustic and language model training with small amounts of labelled data. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4297–4300. IEEE (2009)

26.

https://github.com/wolfgarbe/SymSpell

27.

Schwartz, R., Makhoul, J.: Where the phonemes are: Dealing with ambiguity in acoustic-phonetic recognition. IEEE Trans. Acoust. Speech Signal Process. 23(1), 50–53 (1975)CrossRef

28.

Mulholland, M., Lopez, M., Evanini, K., Loukina, A., Qian, Y.: A comparison of ASR and human errors for transcription of non-native spontaneous speech. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5855–5859. IEEE (2016)

29.

Bahl, L., et al.: Some experiments with large-vocabulary isolated-word sentence recognition. In: ICASSP 1984. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 395–396. IEEE (1984)

30.

Rayson, S.J., Hachamovitch, D.J., Kwatinetz, A.L., Hirsch, S.M.: Autocorrecting text typed into a word processing document. U.S. Patent 5,761,689, issued June 2 (1998)

31.

Xu, H., Ding, S., Watanabe, S.: Improving end-to-end speech recognition with pronunciation-assisted sub-word modeling. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7110- 7114. IEEE (2019)

32.

https://github.com/phatpiglet/autocorrect

33.

Drexler, J., Glass, J.: Learning a subword inventory jointly with end-to-end automatic speech recognition. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6439–6443. IEEE (2020)

34.

Hermann, E., Kamper, H., Goldwater, S.: Multilingual and unsupervised subword modeling for zero-resource languages. Comput. Speech Lang. 65, 101098 (2020)CrossRef

35.

Agenbag, W., Niesler, T.: Automatic sub-word unit discovery and pronunciation lexicon induction for ASR with application to under-resourced languages. Comput. Speech Lang. 57, 20–40 (2019)CrossRef

36.

Susan, S., Kumar, S., Agrawal, R., Yadav, K.: Statistical keyword matching using automata. Int. J. Appl. Res. Inf. Technol. Computing 5(3), 250–255 (2014)CrossRef

37.

Susan, S., Keshari, J.: Finding significant keywords for document databases by two-phase Maximum Entropy Partitioning. Pattern Recogn. Lett. 125, 195–205 (2019)CrossRef

38.

Feng, S., Lee, T.: Exploiting cross-lingual speaker and phonetic diversity for unsupervised subword modeling. IEEE/ACM Trans. Audio Speech Lang. Process. 27(12), 2000–2011 (2019)CrossRef

39.

Ojha, R., Chandra Sekhar, C.: Multi-label classification models for detection of phonetic features in building acoustic models. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2019)

40.

CMU Pronouncing Dictionary. www.speech.cs.cmu.edu/cgi-bin/cmudict. Accessed 15 June 2020

Title: Improving Word Recognition in Speech Transcriptions by Decision-Level Fusion of Stemming and Two-Way Phoneme Pruning
Authors: Sunakshi Mehra
Seba Susan
Publisher: Springer Singapore
Book: Advanced Computing
Print ISBN: 978-981-16-0400-3

Electronic ISBN: 978-981-16-0401-0

Copyright Year: 2021
DOI: https://doi.org/10.1007/978-981-16-0401-0_19

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner