Skip to main content
Erschienen in: Journal on Multimodal User Interfaces 4/2018

01.08.2018 | Original Paper

Multimodal speech recognition: increasing accuracy using high speed video data

verfasst von: Denis Ivanko, Alexey Karpov, Dmitrii Fedotov, Irina Kipyatkova, Dmitry Ryumin, Dmitriy Ivanko, Wolfgang Minker, Milos Zelezny

Erschienen in: Journal on Multimodal User Interfaces | Ausgabe 4/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

To date, multimodal speech recognition systems based on the processing of audio and video signals show significantly better results than their unimodal counterparts. In general, researchers divide the solution of the audio–visual speech recognition problem into two parts. First, in extracting the most informative features from each modality and second, in the most successful way of fusion both modalities. Ultimately, this leads to an improvement in the accuracy of speech recognition. Almost all modern studies use this approach with video data of a standard recording speed of 25 frames per second. The choice of such a recording speed is easily explained, since the vast majority of existing audio–visual databases are recorded with this rate. However, it should be noticed that the number of 25 frames per second is a world standard for many areas and has never been specifically calculated for speech recognition tasks. The main purpose of this study is to investigate the effect brought by the high-speed video data (up to 200 frames per second) on the speech recognition accuracy. And also to find out whether the use of a high-speed video camera makes the speech recognition systems more robust to acoustical noise. To this end, we recorded a database of audio–visual Russian speech with high-speed video recordings, which consists of records of 20 speakers, each of them pronouncing 200 phrases of continuous Russian speech. Experiments performed on this database showed an improvement in the absolute speech recognition rate up to 3.10%. We also proved that the use of the high-speed camera with 200 fps allows achieving better recognition results under different acoustically noisy conditions (signal-to-noise ratio varied between 40 and 0 dB) with different types of noise (e.g. white noise, babble noise).

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748CrossRef McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748CrossRef
2.
Zurück zum Zitat Neti C, Potamianos G, Luettin J, Matthews I, Glotin H, Vergyri D, Sison J, Mashari A, Zhou J (2000) Audio visual speech recognition. In: Final workshop 2000 report. Center for Language and Speech Processing, The Johns Hopkins University, Baltimore Neti C, Potamianos G, Luettin J, Matthews I, Glotin H, Vergyri D, Sison J, Mashari A, Zhou J (2000) Audio visual speech recognition. In: Final workshop 2000 report. Center for Language and Speech Processing, The Johns Hopkins University, Baltimore
3.
Zurück zum Zitat Katsaggelos K, Bahaadini S, Molina R (2015) Audiovisual fusion: challenges and new approaches. In: Proceedings of the IEEE, vol 103(9), pp 1635–1653CrossRef Katsaggelos K, Bahaadini S, Molina R (2015) Audiovisual fusion: challenges and new approaches. In: Proceedings of the IEEE, vol 103(9), pp 1635–1653CrossRef
4.
Zurück zum Zitat Dean D, Sridharan S (2010) Dynamic visual features for audio–visual speaker verification. Comput Speech Lang 24(2):136–149CrossRef Dean D, Sridharan S (2010) Dynamic visual features for audio–visual speaker verification. Comput Speech Lang 24(2):136–149CrossRef
5.
Zurück zum Zitat Luckyanets E, Melnikov A, Kudashev O, Novoselov S, Lavrentyeva G (2017) Bimodal anti-spoofing system for mobile security. In: SPECOM 2017, LNAI 10458, pp 211–220CrossRef Luckyanets E, Melnikov A, Kudashev O, Novoselov S, Lavrentyeva G (2017) Bimodal anti-spoofing system for mobile security. In: SPECOM 2017, LNAI 10458, pp 211–220CrossRef
6.
Zurück zum Zitat Akhtiamov O, Sidorov M, Karpov A, Minker W (2017) Speech and text analysis for multimodal addressee detection in human–human–computer interaction. In: Proceedings of the interspeech 2017, pp 2521–2525 Akhtiamov O, Sidorov M, Karpov A, Minker W (2017) Speech and text analysis for multimodal addressee detection in human–human–computer interaction. In: Proceedings of the interspeech 2017, pp 2521–2525
7.
Zurück zum Zitat Shamim HM, Muhammad G (2016) Audio–visual emotion recognition using multi-directional regression and ridgelet transform. J Multimodal User Interfaces (JMUI) 10(4):325–333CrossRef Shamim HM, Muhammad G (2016) Audio–visual emotion recognition using multi-directional regression and ridgelet transform. J Multimodal User Interfaces (JMUI) 10(4):325–333CrossRef
8.
Zurück zum Zitat Fedotov D, Sidorov M, Minker W (2017) Context-awarded models in time-continuous multidimensional affect recognition. In: ICR 2017, LNAI 10459, pp 59–66 Fedotov D, Sidorov M, Minker W (2017) Context-awarded models in time-continuous multidimensional affect recognition. In: ICR 2017, LNAI 10459, pp 59–66
9.
Zurück zum Zitat Liu Q, Wang W, Jackson P (2011) A visual voice activity detection method with adaboosting. In: Proceedings of the sensor signal process defence, pp 1–5 Liu Q, Wang W, Jackson P (2011) A visual voice activity detection method with adaboosting. In: Proceedings of the sensor signal process defence, pp 1–5
10.
Zurück zum Zitat Barnard M et al (2014) Robust multi-speaker tracking via dictionary learning and identity modeling. IEEE Trans Multimed 16(3):864–880CrossRef Barnard M et al (2014) Robust multi-speaker tracking via dictionary learning and identity modeling. IEEE Trans Multimed 16(3):864–880CrossRef
11.
Zurück zum Zitat Kaya H, Karpov A (2017) Introducing weighted kernel classifiers for handling imbalanced paralinguistic corpora: snoring, addressee and cold. In: Proceedings of the interspeech 2017, pp 3527–3531 Kaya H, Karpov A (2017) Introducing weighted kernel classifiers for handling imbalanced paralinguistic corpora: snoring, addressee and cold. In: Proceedings of the interspeech 2017, pp 3527–3531
12.
Zurück zum Zitat Shivappa ST, Trivedi ST (2010) Audiovisual information fusion in human–computer interfaces and intelligent environments: a survey. Proc IEEE 98(10):1692–1715CrossRef Shivappa ST, Trivedi ST (2010) Audiovisual information fusion in human–computer interfaces and intelligent environments: a survey. Proc IEEE 98(10):1692–1715CrossRef
13.
Zurück zum Zitat Khokhlov Y, Tomashenko N, Medennikov I, Romanenko A (2017) Fast and accurate OOV decoder on high-level features. In: Proceedings of the interspeech 2017, pp 2884–2888 Khokhlov Y, Tomashenko N, Medennikov I, Romanenko A (2017) Fast and accurate OOV decoder on high-level features. In: Proceedings of the interspeech 2017, pp 2884–2888
14.
Zurück zum Zitat Ngiam J et al (2011) Multimodal deep learning. In: Proceedings of the 28th international conference of machine learning, pp 689–696 Ngiam J et al (2011) Multimodal deep learning. In: Proceedings of the 28th international conference of machine learning, pp 689–696
15.
Zurück zum Zitat Chetty G, Wagner M (2006) Audio–visual multimodal fusion for biometric person authentication and liveness verification. In: Proceedings of the NICTA-HCSNet multimodal user interaction workshop, vol 57, pp 17–24 Chetty G, Wagner M (2006) Audio–visual multimodal fusion for biometric person authentication and liveness verification. In: Proceedings of the NICTA-HCSNet multimodal user interaction workshop, vol 57, pp 17–24
16.
Zurück zum Zitat Atrey PK, Hossain MA, Saddik E, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multimed Syst 16(6):345–379CrossRef Atrey PK, Hossain MA, Saddik E, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multimed Syst 16(6):345–379CrossRef
17.
Zurück zum Zitat Xu H, Chua TS (2006) Fusion of AV features and external information sources for event detection in team sport video. ACM Trans Multimed Comput Commun Appl 2(1):44–67CrossRef Xu H, Chua TS (2006) Fusion of AV features and external information sources for event detection in team sport video. ACM Trans Multimed Comput Commun Appl 2(1):44–67CrossRef
18.
Zurück zum Zitat Dean D.B (2008) Synchronous HMMs for audio–visual speech processing. Ph.D. dissertation, Queensland University Dean D.B (2008) Synchronous HMMs for audio–visual speech processing. Ph.D. dissertation, Queensland University
19.
Zurück zum Zitat Morency LP, Kok I, Gratch J (2010) A probabilistic multimodal approach for predicting listener backchannels. Auton Agents Multi-Agents Syst 20(1):70–84CrossRef Morency LP, Kok I, Gratch J (2010) A probabilistic multimodal approach for predicting listener backchannels. Auton Agents Multi-Agents Syst 20(1):70–84CrossRef
20.
Zurück zum Zitat Lv G, Jiang D, Zhao R, Hou Y (2007) Multi-stream asynchrony modeling for audio–visual speech recognition. In: Proceedings of the 9th IEEE international symposium multimedia, pp 37–44 Lv G, Jiang D, Zhao R, Hou Y (2007) Multi-stream asynchrony modeling for audio–visual speech recognition. In: Proceedings of the 9th IEEE international symposium multimedia, pp 37–44
21.
Zurück zum Zitat Torres-Valencia C, Alvarez-Lopez M, Orozco-Gutierrez A (2017) SVM-based feature selection methods for emotion recognition from multimodal data. J Multimodal User Interfaces (JMUI) 11(1):9–23CrossRef Torres-Valencia C, Alvarez-Lopez M, Orozco-Gutierrez A (2017) SVM-based feature selection methods for emotion recognition from multimodal data. J Multimodal User Interfaces (JMUI) 11(1):9–23CrossRef
22.
Zurück zum Zitat Terry L (2011) Audio–visual asynchrony modeling and analysis for speech alignment and recognition. Ph.D. dissertation, Northwestern University Terry L (2011) Audio–visual asynchrony modeling and analysis for speech alignment and recognition. Ph.D. dissertation, Northwestern University
23.
Zurück zum Zitat Nefian AV et al (2002) A coupled HMM for audio–visual speech recognition. In: Proceedings of the IEEE international conference acoustic speech signal processing, vol 2, pp 2009–2013 Nefian AV et al (2002) A coupled HMM for audio–visual speech recognition. In: Proceedings of the IEEE international conference acoustic speech signal processing, vol 2, pp 2009–2013
24.
Zurück zum Zitat Estellers V, Gurban M, Thiran J (2012) On dynamic stream weighting for audio–visual speech recognition. IEEE Trans Audio Speech Lang Process 20(4):1145–1157CrossRef Estellers V, Gurban M, Thiran J (2012) On dynamic stream weighting for audio–visual speech recognition. IEEE Trans Audio Speech Lang Process 20(4):1145–1157CrossRef
25.
Zurück zum Zitat Abdelaziz AH, Kolossa D (2014) Dynamic stream weight estimation in coupled HMM-based audio–visual speech recognition using multilayer perceptrons. In: Proceedings of the interspeech, pp 1144–1148 Abdelaziz AH, Kolossa D (2014) Dynamic stream weight estimation in coupled HMM-based audio–visual speech recognition using multilayer perceptrons. In: Proceedings of the interspeech, pp 1144–1148
26.
Zurück zum Zitat Chitu AG, Rothkrantz LJM (2007) The influence of video sampling rate on lipreading performance. In: Proceedings of the international conference on speech and computer SPECOM 2007. Moscow, pp 678–684 Chitu AG, Rothkrantz LJM (2007) The influence of video sampling rate on lipreading performance. In: Proceedings of the international conference on speech and computer SPECOM 2007. Moscow, pp 678–684
27.
Zurück zum Zitat Chitu AG, Driel K, Rothkrantz LJM (2010) Automatic lip reading in the Dutch language using active appearance models on high speed recordings. In: Text, speech and dialogue, Springer LNCS (LNAI) 2010, vol 6231, pp 259–266CrossRef Chitu AG, Driel K, Rothkrantz LJM (2010) Automatic lip reading in the Dutch language using active appearance models on high speed recordings. In: Text, speech and dialogue, Springer LNCS (LNAI) 2010, vol 6231, pp 259–266CrossRef
28.
Zurück zum Zitat Polykovsky S, Kameda Y, Ohta Y (2009) Facial micro-expressions recognition using high speed camera and 3D-gradient descriptor. In: Proceedings of the 3rd international conference on crime detection and prevention (ICDP). Tsukuba, pp 1–6 Polykovsky S, Kameda Y, Ohta Y (2009) Facial micro-expressions recognition using high speed camera and 3D-gradient descriptor. In: Proceedings of the 3rd international conference on crime detection and prevention (ICDP). Tsukuba, pp 1–6
29.
Zurück zum Zitat Bettadapura V (2012) Face expression recognition and analysis: the state of the art. Technical Report, College of Computing, Georgia Institute of Technology, pp 1–27 Bettadapura V (2012) Face expression recognition and analysis: the state of the art. Technical Report, College of Computing, Georgia Institute of Technology, pp 1–27
30.
Zurück zum Zitat Ohzeki K (2006) Video analysis for detecting eye blinking using a high-speed camera. In: Proceedings of the 40th Asilomar conference on signals, systems and computers (ACSSC). Pacific Grove, Part 1, pp 1081–1085 Ohzeki K (2006) Video analysis for detecting eye blinking using a high-speed camera. In: Proceedings of the 40th Asilomar conference on signals, systems and computers (ACSSC). Pacific Grove, Part 1, pp 1081–1085
31.
Zurück zum Zitat Chitu AG, Rothkrantz LJM (2008) On dual view lipreading using high speed camera. In: Proceedings of the 14th annual scientific conference euromedia. Ghent, pp 43–51 Chitu AG, Rothkrantz LJM (2008) On dual view lipreading using high speed camera. In: Proceedings of the 14th annual scientific conference euromedia. Ghent, pp 43–51
32.
Zurück zum Zitat Verkhodanova V, Ronzhin A, Kipyatkova I, Ivanko D, Karpov A, Zelezny M (2016) HAVRUS corpus: high-speed recordings of audio–visual Russian speech. In: Ronzhin A, Potapova R, Nmeth G (eds) Speech and computer. SPECOM 2016. Lecture notes in computer science, vol 9811. Springer, ChamCrossRef Verkhodanova V, Ronzhin A, Kipyatkova I, Ivanko D, Karpov A, Zelezny M (2016) HAVRUS corpus: high-speed recordings of audio–visual Russian speech. In: Ronzhin A, Potapova R, Nmeth G (eds) Speech and computer. SPECOM 2016. Lecture notes in computer science, vol 9811. Springer, ChamCrossRef
33.
Zurück zum Zitat Karpov A, Ronzhin A, Markov K, Zelezny M (2010) Viseme-dependent weight optimization for CHMM-based audio–visual speech recognition. In: Proceedings of the interspeech 2010, pp 2678–2681 Karpov A, Ronzhin A, Markov K, Zelezny M (2010) Viseme-dependent weight optimization for CHMM-based audio–visual speech recognition. In: Proceedings of the interspeech 2010, pp 2678–2681
34.
Zurück zum Zitat Karpov A (2014) An automatic multimodal speech recognition system with audio and video information. Autom Remote Control 75(12):2190–2200MathSciNetCrossRef Karpov A (2014) An automatic multimodal speech recognition system with audio and video information. Autom Remote Control 75(12):2190–2200MathSciNetCrossRef
35.
Zurück zum Zitat Ivanko D, Karpov A, Ryumin D, Kipyatkova I, Saveliev A, Budkov V, Ivanko D, Zelezny M (2017) Using a high-speed video Camera for robust audio–visual speech recognition in acoustically noisy conditions. In: SPECOM 2017, LNAI 10458, pp 757–766CrossRef Ivanko D, Karpov A, Ryumin D, Kipyatkova I, Saveliev A, Budkov V, Ivanko D, Zelezny M (2017) Using a high-speed video Camera for robust audio–visual speech recognition in acoustically noisy conditions. In: SPECOM 2017, LNAI 10458, pp 757–766CrossRef
36.
Zurück zum Zitat Lee B, Hasegawa-Johnson M, Goudeseune C, Kamdar S, Borys S, Liu M, Huang T (2004) AVICAR: audio–visual speech corpus in a car environment. In: Proceedings of the interspeech, pp 380–383 Lee B, Hasegawa-Johnson M, Goudeseune C, Kamdar S, Borys S, Liu M, Huang T (2004) AVICAR: audio–visual speech corpus in a car environment. In: Proceedings of the interspeech, pp 380–383
37.
Zurück zum Zitat Cox S, Harvey R, Lan Y, Newman J, Theobald B (2008) The challenge of multispeaker lip-reading. In: Proceedings of the international conference auditory-visual speech process (AVSP), pp 179–184 Cox S, Harvey R, Lan Y, Newman J, Theobald B (2008) The challenge of multispeaker lip-reading. In: Proceedings of the international conference auditory-visual speech process (AVSP), pp 179–184
38.
Zurück zum Zitat Patterson E, Gurbuz S, Tufekci Z, Gowdy J (2002) CUAVE: a new audio–visual database for multimodal human–computer interface research. In: Proceedings of the IEEE ICASSP 2002, vol 2, pp 2017–2020 Patterson E, Gurbuz S, Tufekci Z, Gowdy J (2002) CUAVE: a new audio–visual database for multimodal human–computer interface research. In: Proceedings of the IEEE ICASSP 2002, vol 2, pp 2017–2020
39.
Zurück zum Zitat Hazen T, Saenko K, La C, Glass J (2004) A segment-base audio–visual speech recognizer: data collection, development, and initial experiments. In: Proceedings of the international conference multimodal interfaces, pp 235–242 Hazen T, Saenko K, La C, Glass J (2004) A segment-base audio–visual speech recognizer: data collection, development, and initial experiments. In: Proceedings of the international conference multimodal interfaces, pp 235–242
40.
Zurück zum Zitat Lucey P, Potaminanos G, Sridharan S (2008) Patch-based analysis of visual speech from multiple views. In: Proceedings of the AVSP 2008, pp 69–74 Lucey P, Potaminanos G, Sridharan S (2008) Patch-based analysis of visual speech from multiple views. In: Proceedings of the AVSP 2008, pp 69–74
41.
Zurück zum Zitat Abhishek N, Prasanta KG (2017) PRAV: a phonetically rich audio visual corpus. In: Proceedings of the interspeech 2017, pp 3747–3751 Abhishek N, Prasanta KG (2017) PRAV: a phonetically rich audio visual corpus. In: Proceedings of the interspeech 2017, pp 3747–3751
42.
Zurück zum Zitat Zhou Z, Zhao G, Hong X, Pietikainen M (2014) A review of recent advances in visual speech decoding. In: Proceedings of the image and vision computing, vol 32, pp 590–605CrossRef Zhou Z, Zhao G, Hong X, Pietikainen M (2014) A review of recent advances in visual speech decoding. In: Proceedings of the image and vision computing, vol 32, pp 590–605CrossRef
43.
Zurück zum Zitat Karpov A, Kipyatkova I, Zelezny M (2014) A framework for recording audio–visual speech corpora with a microphone and a high-speed camera. In: Speech and computer. SPECOM 2014. Lecture notes in computer science, vol 8773. Springer, Cham Karpov A, Kipyatkova I, Zelezny M (2014) A framework for recording audio–visual speech corpora with a microphone and a high-speed camera. In: Speech and computer. SPECOM 2014. Lecture notes in computer science, vol 8773. Springer, Cham
44.
Zurück zum Zitat Yan S, Xu D, Zhang H, Yang Q, Lin S (2007) Graph embedding and extensions: a general framework for dimensionality reduction. IEEE Trans Pattern Anal Mach Intell 29(1):40–51CrossRef Yan S, Xu D, Zhang H, Yang Q, Lin S (2007) Graph embedding and extensions: a general framework for dimensionality reduction. IEEE Trans Pattern Anal Mach Intell 29(1):40–51CrossRef
45.
Zurück zum Zitat Hong S, Yao H, Wan Y, Chen R (2006) A PCA based visual DCT feature extraction method for lip-reading. In: Proceedings of the intelligent informatics hiding multimedia, signal process, pp 321–326 Hong S, Yao H, Wan Y, Chen R (2006) A PCA based visual DCT feature extraction method for lip-reading. In: Proceedings of the intelligent informatics hiding multimedia, signal process, pp 321–326
46.
Zurück zum Zitat Yoshinaga T, Tamura S, Iwano K, Furui S (2003) Audio–visual speech recognition using lip movement extracted from side-face images. In: Proceedings of the international conference auditory-visual speech processing (AVSP), pp 117–120 Yoshinaga T, Tamura S, Iwano K, Furui S (2003) Audio–visual speech recognition using lip movement extracted from side-face images. In: Proceedings of the international conference auditory-visual speech processing (AVSP), pp 117–120
47.
Zurück zum Zitat Cetingul H, Yemez Y, Erzin E, Tekalp A (2006) Discriminative analysis of lip motion features for speaker identification and speech reading. IEEE Trans Image Process 15(10):2879–2891CrossRef Cetingul H, Yemez Y, Erzin E, Tekalp A (2006) Discriminative analysis of lip motion features for speaker identification and speech reading. IEEE Trans Image Process 15(10):2879–2891CrossRef
48.
Zurück zum Zitat Kumar S, Bhuyan MK, Chakraborty BK (2017) Extraction of texture and geometrical features from informative facial regions for sign language recognition. J Multimodal User Interfaces (JMUI) 11(2):227–239CrossRef Kumar S, Bhuyan MK, Chakraborty BK (2017) Extraction of texture and geometrical features from informative facial regions for sign language recognition. J Multimodal User Interfaces (JMUI) 11(2):227–239CrossRef
49.
Zurück zum Zitat Lan Y, Theobald B, Harvey E, Ong E, Bowden R (2010) Improving visual features for lip-reading. In: Proceedings of the AVSP 2010, pp 142–147 Lan Y, Theobald B, Harvey E, Ong E, Bowden R (2010) Improving visual features for lip-reading. In: Proceedings of the AVSP 2010, pp 142–147
50.
Zurück zum Zitat Chu SM, Huang TS (2002) Multi-modal sensory fusion with application to audio–visual speech recognition. In: Proceedings of the multi-modal speech recognition workshop-2002, Greensboro Chu SM, Huang TS (2002) Multi-modal sensory fusion with application to audio–visual speech recognition. In: Proceedings of the multi-modal speech recognition workshop-2002, Greensboro
51.
Zurück zum Zitat Bear H, Harvey R, Theobald B, Lan Y (2014) Which phoneme-to-viseme maps best improve visual-only computer lip-reading. In: Advances in visual computing. Springer, Berlin, pp 230–239 Bear H, Harvey R, Theobald B, Lan Y (2014) Which phoneme-to-viseme maps best improve visual-only computer lip-reading. In: Advances in visual computing. Springer, Berlin, pp 230–239
52.
Zurück zum Zitat Stewart D, Seymour R, Pass A, Ming J (2014) Robust audio–visual speech recognition under noisy audio–video conditions. IEEE Trans Cybern 44(2):175–184CrossRef Stewart D, Seymour R, Pass A, Ming J (2014) Robust audio–visual speech recognition under noisy audio–video conditions. IEEE Trans Cybern 44(2):175–184CrossRef
53.
Zurück zum Zitat Huang J, Kingsbury B (2013) Audio–visual deep learning for noise robust speech recognition. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, pp 7596–7599 Huang J, Kingsbury B (2013) Audio–visual deep learning for noise robust speech recognition. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, pp 7596–7599
Metadaten
Titel
Multimodal speech recognition: increasing accuracy using high speed video data
verfasst von
Denis Ivanko
Alexey Karpov
Dmitrii Fedotov
Irina Kipyatkova
Dmitry Ryumin
Dmitriy Ivanko
Wolfgang Minker
Milos Zelezny
Publikationsdatum
01.08.2018
Verlag
Springer International Publishing
Erschienen in
Journal on Multimodal User Interfaces / Ausgabe 4/2018
Print ISSN: 1783-7677
Elektronische ISSN: 1783-8738
DOI
https://doi.org/10.1007/s12193-018-0267-1

Weitere Artikel der Ausgabe 4/2018

Journal on Multimodal User Interfaces 4/2018 Zur Ausgabe