Skip to main content
Top
Published in: International Journal of Speech Technology 2/2014

01-06-2014

Recent developments in spoken term detection: a survey

Authors: Anupam Mandal, K. R. Prasanna Kumar, Pabitra Mitra

Published in: International Journal of Speech Technology | Issue 2/2014

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Spoken term detection (STD) provides an efficient means for content based indexing of speech. However, achieving high detection performance, faster speed, detecting ot-of-vocabulary (OOV) words and performing STD on low resource languages are some of the major research challenges. The paper provides a comprehensive survey of the important approaches in the area of STD and their addressing of the challenges mentioned above. The review provides a classification of these approaches, highlights their advantages and limitations and discusses their context of usage. It also performs an analysis of the various approaches in terms of detection accuracy, storage requirements and execution time. The paper summarizes various tools and speech corpora used in the different approaches. Finally it concludes with future research directions in this area.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
go back to reference Allauzen, C., Mohri, M., & Saraclar, M. (2004). General indexation of weighted automata: application to spoken utterance retrieval. In HLT-NAACL, Boston, USA. Allauzen, C., Mohri, M., & Saraclar, M. (2004). General indexation of weighted automata: application to spoken utterance retrieval. In HLT-NAACL, Boston, USA.
go back to reference Avidan, S., & Shamir, A. (2007). Seam carving for content-aware image resizing. ACM Transactions on Graph, 26(3). Avidan, S., & Shamir, A. (2007). Seam carving for content-aware image resizing. ACM Transactions on Graph, 26(3).
go back to reference Baghai-Ravary, L., Kochanski, G., & Coleman, J. (2009). Data-driven approaches to objective evaluation of phoneme alignment systems. In Proceedings of the 4th conference on human language technology, Poznan, Poland. Baghai-Ravary, L., Kochanski, G., & Coleman, J. (2009). Data-driven approaches to objective evaluation of phoneme alignment systems. In Proceedings of the 4th conference on human language technology, Poznan, Poland.
go back to reference Barnwal, S., Sahni, K., Singh, R., & Raj, B. (2012). Spectrographic seam patterns for discriminative word spotting. In Proc. int. conf. acoustics, speech and signal processing, Kyoto, Japan. Barnwal, S., Sahni, K., Singh, R., & Raj, B. (2012). Spectrographic seam patterns for discriminative word spotting. In Proc. int. conf. acoustics, speech and signal processing, Kyoto, Japan.
go back to reference Benayed, Y. D., Fohr, J. H., & Chollet, G. (2003). Confidence measures for keyword spotting using support vector machines. In Proc. int. conf. acoustics, speech and signal processing, Hong Kong. Benayed, Y. D., Fohr, J. H., & Chollet, G. (2003). Confidence measures for keyword spotting using support vector machines. In Proc. int. conf. acoustics, speech and signal processing, Hong Kong.
go back to reference Boves, L., Carlson, R., Hinrichs, E., House, D., Krauwer, S., Lemnitzer, L., Vainio, M., & Wittenburg, P. (2009). Resources for speech research: present and future infrastructure needs. In Proc. int. conf. speech processing, Brighton, UK. Boves, L., Carlson, R., Hinrichs, E., House, D., Krauwer, S., Lemnitzer, L., Vainio, M., & Wittenburg, P. (2009). Resources for speech research: present and future infrastructure needs. In Proc. int. conf. speech processing, Brighton, UK.
go back to reference Bridle, J. (1973). An efficient elastic template method for detecting given key words in running speech. In Proc. of British acoustic society meeting, UK. Bridle, J. (1973). An efficient elastic template method for detecting given key words in running speech. In Proc. of British acoustic society meeting, UK.
go back to reference Can, D. (2011). Lattice indexing for spoken term detection. IEEE Transactions on Audio, Speech, and Language Processing, 19(8), 2338–2347. CrossRefMathSciNet Can, D. (2011). Lattice indexing for spoken term detection. IEEE Transactions on Audio, Speech, and Language Processing, 19(8), 2338–2347. CrossRefMathSciNet
go back to reference Can, P., Cooper, E., Sethy, A., White, C., Ramabhadran, B., & Saraclar, M. (2009). Effect of pronunciations on oov queries in spoken term detection. In Proc. int. conf. acoustics, speech and signal processing, Taipei, Taiwan. Can, P., Cooper, E., Sethy, A., White, C., Ramabhadran, B., & Saraclar, M. (2009). Effect of pronunciations on oov queries in spoken term detection. In Proc. int. conf. acoustics, speech and signal processing, Taipei, Taiwan.
go back to reference Chan, C., & Lee, L. (2010). Unsupervised spoken-term detection with spoken queries using segment-based dynamic time warping. In Proc. int. conf. speech processing, Chiba, Japan. Chan, C., & Lee, L. (2010). Unsupervised spoken-term detection with spoken queries using segment-based dynamic time warping. In Proc. int. conf. speech processing, Chiba, Japan.
go back to reference Chan, C., & Lee, L. (2011). Integrating frame-based and segment-based dynamic time warping for unsupervised spoken term detection with spoken queries. In Proc. int. conf. acoustics, speech and signal processing, Prague. Chan, C., & Lee, L. (2011). Integrating frame-based and segment-based dynamic time warping for unsupervised spoken term detection with spoken queries. In Proc. int. conf. acoustics, speech and signal processing, Prague.
go back to reference Chelba, C., & Acero, A. (2005). Position specific posterior lattices for indexing speech. In Annual conference of the association of computational linguistics, Ann Arbor, USA. Chelba, C., & Acero, A. (2005). Position specific posterior lattices for indexing speech. In Annual conference of the association of computational linguistics, Ann Arbor, USA.
go back to reference Deligne, S., & Bimbot, F. (1995). Language modeling by variable length sequences. In Proc. int. conf. acoustics, speech and signal processing, Michigan, USA. Deligne, S., & Bimbot, F. (1995). Language modeling by variable length sequences. In Proc. int. conf. acoustics, speech and signal processing, Michigan, USA.
go back to reference Ezzat, T., & Poggio, T. (2008). Discriminative word spotting using ordered spectro-temporal patch features. In ISCA workshop statistical and perceptual audition, Brisbane, Australia. Ezzat, T., & Poggio, T. (2008). Discriminative word spotting using ordered spectro-temporal patch features. In ISCA workshop statistical and perceptual audition, Brisbane, Australia.
go back to reference Fousek, P., & Hermansky, H. (2006). Towards ASR based on hierarchical posterior-based keyword recognition. In Proc. int. conf. acoustics, speech and signal processing, Toulouse, France. Fousek, P., & Hermansky, H. (2006). Towards ASR based on hierarchical posterior-based keyword recognition. In Proc. int. conf. acoustics, speech and signal processing, Toulouse, France.
go back to reference Garcia, A., & Gish, H. (2006). Keyword spotting of arbitrary words using minimal speech resources. In Proc. int. conf. acoustics, speech and signal processing, Toulouse, France. Garcia, A., & Gish, H. (2006). Keyword spotting of arbitrary words using minimal speech resources. In Proc. int. conf. acoustics, speech and signal processing, Toulouse, France.
go back to reference Garofolo, J., Auzzane, G., & Voorhees, E. (2000). The trec spoken document retrieval track: a success story. In Ninth text retrieval conference (TREC-9) NIST. Garofolo, J., Auzzane, G., & Voorhees, E. (2000). The trec spoken document retrieval track: a success story. In Ninth text retrieval conference (TREC-9) NIST.
go back to reference Grangier, D., Keshet, J., & Bengio, S. (2009). Chapter on discriminative keyword spotting. In Automatic speech and speaker recognition: large margin and kernel methods. New York: Wiley. Grangier, D., Keshet, J., & Bengio, S. (2009). Chapter on discriminative keyword spotting. In Automatic speech and speaker recognition: large margin and kernel methods. New York: Wiley.
go back to reference Hakkani-Tur, D., & Riccardi, G. (2003). A general algorithm for word graph matrix decomposition. In Proc. int. conf. acoustics, speech and signal processing, Hong-Kong. Hakkani-Tur, D., & Riccardi, G. (2003). A general algorithm for word graph matrix decomposition. In Proc. int. conf. acoustics, speech and signal processing, Hong-Kong.
go back to reference Hazen, T., Shen, W., & White, C. (2009). Query-by-example spoken term detection using phonetic posteriorgram templates. In Proc. IEEE workshop on automatic speech recognition and understanding, Merano, Italy. Hazen, T., Shen, W., & White, C. (2009). Query-by-example spoken term detection using phonetic posteriorgram templates. In Proc. IEEE workshop on automatic speech recognition and understanding, Merano, Italy.
go back to reference Huijbregts, M., McLaren, M., & Leeuwen, D. V. (2011). Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection. In Proc. int. conf. acoustics, speech and signal processing, Prague. Huijbregts, M., McLaren, M., & Leeuwen, D. V. (2011). Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection. In Proc. int. conf. acoustics, speech and signal processing, Prague.
go back to reference James, D., & Young, S. (1994). A fast lattice-based approach to vocabulary independent wordspotting. In Proc. int. conf. acoustics, speech and signal processing, Adelaide, Australia. James, D., & Young, S. (1994). A fast lattice-based approach to vocabulary independent wordspotting. In Proc. int. conf. acoustics, speech and signal processing, Adelaide, Australia.
go back to reference Jansen, A., & Niyogi, P. (2009). Point process models for spotting keywords in continuous speech. IEEE Transactions on Audio, Speech, and Language Processing, 17(8), 1457–1470. CrossRef Jansen, A., & Niyogi, P. (2009). Point process models for spotting keywords in continuous speech. IEEE Transactions on Audio, Speech, and Language Processing, 17(8), 1457–1470. CrossRef
go back to reference Jansen, A., Church, K., & Hermansky, H. (2010). Towards spoken term discovery at scale with zero resources. In Proc. int. conf. speech processing, Chiba, Japan. Jansen, A., Church, K., & Hermansky, H. (2010). Towards spoken term discovery at scale with zero resources. In Proc. int. conf. speech processing, Chiba, Japan.
go back to reference Keshet, J., Grangier, D., & Bengio, S. (2007). Discriminative keyword spotting. In Proc. of workshop on non-linear speech processing, Paris, France. Keshet, J., Grangier, D., & Bengio, S. (2007). Discriminative keyword spotting. In Proc. of workshop on non-linear speech processing, Paris, France.
go back to reference Kintzley, K., Jansen, A., & Hermansky, H. (2011). Event selection from phone posteriorgrams using matched filters. In Proc. int. conf. speech processing, Florence, Italy. Kintzley, K., Jansen, A., & Hermansky, H. (2011). Event selection from phone posteriorgrams using matched filters. In Proc. int. conf. speech processing, Florence, Italy.
go back to reference Lehtonen, M., Fousek, P., & Hermansky, H. (2005). IDIAP research report: hierarchical approach for spotting keywords. Lehtonen, M., Fousek, P., & Hermansky, H. (2005). IDIAP research report: hierarchical approach for spotting keywords.
go back to reference Mamou, J., Ramabhadran, B., & Siohan, O. (2007). Vocabulary independent spoken term detection. In Proc. ACM special interest group on information retrieval, New York, USA. Mamou, J., Ramabhadran, B., & Siohan, O. (2007). Vocabulary independent spoken term detection. In Proc. ACM special interest group on information retrieval, New York, USA.
go back to reference Mangu, L., Brill, E., & Stolcke, A. (2000). Finding consensus in speech recognition: word error minimization and other applications of confusion networks. Computer Speech & Language, 14(4), 373–400. CrossRef Mangu, L., Brill, E., & Stolcke, A. (2000). Finding consensus in speech recognition: word error minimization and other applications of confusion networks. Computer Speech & Language, 14(4), 373–400. CrossRef
go back to reference Meyers, C., Rabiner, L., & Rosenberg, A. (1980). Performance tradeoffs in dynamic time warping algorithms for isolated word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(6), 623–635. CrossRef Meyers, C., Rabiner, L., & Rosenberg, A. (1980). Performance tradeoffs in dynamic time warping algorithms for isolated word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(6), 623–635. CrossRef
go back to reference Mohri, M., Pereira, F., Pereira, O., & Reiley, M. (1996). Weighted automata in text and speech processing. In ECAI workshop. Mohri, M., Pereira, F., Pereira, O., & Reiley, M. (1996). Weighted automata in text and speech processing. In ECAI workshop.
go back to reference Ng, K., & Zue, V. (2000). Subwordbased approaches for spoken document retrieval. Speech Communication, 32(3), 157–186. CrossRef Ng, K., & Zue, V. (2000). Subwordbased approaches for spoken document retrieval. Speech Communication, 32(3), 157–186. CrossRef
go back to reference Novotney, S., Schwartz, R., & Ma, J. (2009). Unsupervised acoustic and language model training with small amounts of labelled data. In Proc. int. conf. acoustics, speech and signal processing, Taipei, Taiwan. Novotney, S., Schwartz, R., & Ma, J. (2009). Unsupervised acoustic and language model training with small amounts of labelled data. In Proc. int. conf. acoustics, speech and signal processing, Taipei, Taiwan.
go back to reference Pan, Y. C., & shan Lee, L. (2010). Performance analysis for lattice-based speech indexing approaches using words and subword units. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1562–1574. CrossRef Pan, Y. C., & shan Lee, L. (2010). Performance analysis for lattice-based speech indexing approaches using words and subword units. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1562–1574. CrossRef
go back to reference Parada, C., Sethi, A., & Ramabhadran, B. (2009). Query-by-example spoken term detection for oov terms. In Proc. IEEE workshop on automatic speech recognition and understanding, Merano, Italy. Parada, C., Sethi, A., & Ramabhadran, B. (2009). Query-by-example spoken term detection for oov terms. In Proc. IEEE workshop on automatic speech recognition and understanding, Merano, Italy.
go back to reference Park, A. S., & Glass, J. (2008). Unsupervised pattern discovery in speech. IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 186–197. CrossRef Park, A. S., & Glass, J. (2008). Unsupervised pattern discovery in speech. IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 186–197. CrossRef
go back to reference Rohlicek, J. R. (1995). Chapter on word spotting. In Modern methods of speech processing, Norwell: Kluwer Academic. Rohlicek, J. R. (1995). Chapter on word spotting. In Modern methods of speech processing, Norwell: Kluwer Academic.
go back to reference Rose, R. C. (1996). Word spotting from continuous speech utterances. In Automatic speech and speaker recognition: advanced topics, Norwell: Kluwer Academic. Rose, R. C. (1996). Word spotting from continuous speech utterances. In Automatic speech and speaker recognition: advanced topics, Norwell: Kluwer Academic.
go back to reference Rose, R. C., & Paul, D. B. (1990). A hidden Markov model based keyword recognition system. In Proc. int. conf. acoustics, speech and signal processing, Albuquerque, USA. Rose, R. C., & Paul, D. B. (1990). A hidden Markov model based keyword recognition system. In Proc. int. conf. acoustics, speech and signal processing, Albuquerque, USA.
go back to reference Sakoe, H., & Chiba, S. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1), 43–49. CrossRefMATH Sakoe, H., & Chiba, S. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1), 43–49. CrossRefMATH
go back to reference Sandness, E., & Hetherington, I. (2000). Keyword-based discriminative training of acoustic models. In Proc. int. conf. speech and language processing, Beijing, China. Sandness, E., & Hetherington, I. (2000). Keyword-based discriminative training of acoustic models. In Proc. int. conf. speech and language processing, Beijing, China.
go back to reference Saraclar, M., & Sproat, R. W. (2004). Lattice based search for spoken utterance retrieval. In HLT-NAACL, Boston, USA. Saraclar, M., & Sproat, R. W. (2004). Lattice based search for spoken utterance retrieval. In HLT-NAACL, Boston, USA.
go back to reference Shen, W., White, C., & Hazen, T. (2009). A comparison of query-by-example methods for spoken term detection. In Proc. int. conf. speech processing, Brighton, UK. Shen, W., White, C., & Hazen, T. (2009). A comparison of query-by-example methods for spoken term detection. In Proc. int. conf. speech processing, Brighton, UK.
go back to reference Silaghi, M., & Bourlard, H. (1999). Iterative posterior-based keyword spotting without filler models. In Proc. IEEE workshop on automatic speech recognition and understanding, Colorado, USA. Silaghi, M., & Bourlard, H. (1999). Iterative posterior-based keyword spotting without filler models. In Proc. IEEE workshop on automatic speech recognition and understanding, Colorado, USA.
go back to reference Sukkar, R., Seltur, A., Rahim, M. G., & Lee, C. H. (1996). Utterance verification of keyword strings using word-based minimum verification error training. In Proc. int. conf. acoustics, speech and signal processing, Atlanta, USA. Sukkar, R., Seltur, A., Rahim, M. G., & Lee, C. H. (1996). Utterance verification of keyword strings using word-based minimum verification error training. In Proc. int. conf. acoustics, speech and signal processing, Atlanta, USA.
go back to reference Szoke, I., Schwarz, P., Patejka, P., Burget, L., Karafiat, M., Fapso, M., & Cernocky, J. (2005). Comparison of keyword spotting approaches for informal continuous speech. In Eurospeech, Lisbon, Portugal. Szoke, I., Schwarz, P., Patejka, P., Burget, L., Karafiat, M., Fapso, M., & Cernocky, J. (2005). Comparison of keyword spotting approaches for informal continuous speech. In Eurospeech, Lisbon, Portugal.
go back to reference Szoke, I., Burget, L., Cernocky, J., & Fapso, M. (2008). Sub-word modeling of out-of-vocabulary words in spoken term detection. In Spoken language technology workshop, Goa, India. Szoke, I., Burget, L., Cernocky, J., & Fapso, M. (2008). Sub-word modeling of out-of-vocabulary words in spoken term detection. In Spoken language technology workshop, Goa, India.
go back to reference Tejedor, J., Szoke, I., & Fapso, M. (2010). Novel methods for query selection and combination in query-by-example spoken term detection. In ACM workshop on searching spontaneous conversational speech, Firenze, Italy. Tejedor, J., Szoke, I., & Fapso, M. (2010). Novel methods for query selection and combination in query-by-example spoken term detection. In ACM workshop on searching spontaneous conversational speech, Firenze, Italy.
go back to reference Thambiratnam, K., & Sridharan, S. (2005). Dynamic match phone-lattice searches for very fast and accurate unrestricted vocabulary keyword spotting. In Proc. int. conf. acoustics, speech and signal processing, Philadelphia, USA. Thambiratnam, K., & Sridharan, S. (2005). Dynamic match phone-lattice searches for very fast and accurate unrestricted vocabulary keyword spotting. In Proc. int. conf. acoustics, speech and signal processing, Philadelphia, USA.
go back to reference Vergyri, D., Shafran, I., Stocke, A., Gadde, R., Akbacak, M., Roark, B., & Wang, W. (2007). The sri/ogi 2006 spoken term detection system. In Proc. int. conf. speech processing, Antwerp, Belgium. Vergyri, D., Shafran, I., Stocke, A., Gadde, R., Akbacak, M., Roark, B., & Wang, W. (2007). The sri/ogi 2006 spoken term detection system. In Proc. int. conf. speech processing, Antwerp, Belgium.
go back to reference Wang, H., Lee, T., & Leung, C. (2011). Unsupervised spoken term detection with acoustic segment model. In Int. conf. speech database and assessments, China. Wang, H., Lee, T., & Leung, C. (2011). Unsupervised spoken term detection with acoustic segment model. In Int. conf. speech database and assessments, China.
go back to reference Weintraub, M., Beaufays, F., Rivlin, Z., Konig, Y., & Stolcke, A. (1997). Neuralnetwork based measures of confidence for word recognition. In Proc. int. conf. acoustics, speech and signal processing, Munich, Germany. Weintraub, M., Beaufays, F., Rivlin, Z., Konig, Y., & Stolcke, A. (1997). Neuralnetwork based measures of confidence for word recognition. In Proc. int. conf. acoustics, speech and signal processing, Munich, Germany.
go back to reference Wright, C., Ballar, L., Coull, S., Monrose, F., & Masson, G. (2010). Uncovering spoken phrases in encrypted voice over IP conversations. ACM Transactions on Information and System Security, 13(4), 35.1–35.30. CrossRef Wright, C., Ballar, L., Coull, S., Monrose, F., & Masson, G. (2010). Uncovering spoken phrases in encrypted voice over IP conversations. ACM Transactions on Information and System Security, 13(4), 35.1–35.30. CrossRef
go back to reference Zhang, Y., & Glass, J. (2009). Unsupervised spoken keyword spotting via segmental dtw on Gaussian posteriorgrams. In Proc. IEEE workshop on automatic speech recognition and understanding, Merano, Italy. Zhang, Y., & Glass, J. (2009). Unsupervised spoken keyword spotting via segmental dtw on Gaussian posteriorgrams. In Proc. IEEE workshop on automatic speech recognition and understanding, Merano, Italy.
go back to reference Zhang, Y., & Glass, J. (2011). An inner-product lower-bound estimate for dynamic time warping. In Proc. int. conf. acoustics, speech and signal processing, Prague. Zhang, Y., & Glass, J. (2011). An inner-product lower-bound estimate for dynamic time warping. In Proc. int. conf. acoustics, speech and signal processing, Prague.
Metadata
Title
Recent developments in spoken term detection: a survey
Authors
Anupam Mandal
K. R. Prasanna Kumar
Pabitra Mitra
Publication date
01-06-2014
Publisher
Springer US
Published in
International Journal of Speech Technology / Issue 2/2014
Print ISSN: 1381-2416
Electronic ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-013-9217-1

Other articles of this Issue 2/2014

International Journal of Speech Technology 2/2014 Go to the issue