Skip to main content
Erschienen in: International Journal of Speech Technology 1/2019

25.01.2019

Improved dynamic match phone lattice search for Persian spoken term detection system in online and offline applications

verfasst von: Shima Tabibian, Ahmad Akbari, Babak Nasersharif

Erschienen in: International Journal of Speech Technology | Ausgabe 1/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Spoken term detection (STD) refers to discovering all occurrences of a given term in a set of speech utterances. One of the well-known approaches for the STD system is the phone lattice search (PLS) that produces a phone-based lattice of speech utterances. Since the accuracy of a phone recognizer affects the accuracy of the STD system, the PLS approach utilizes the minimum edit distance (MED) measure to compensate the phone recognizer errors. While this measure increases the detection rate, it also raises the false alarm rate. In this paper, we consider the PLS approach as the baseline. Then, we use Viterbi scoring and Jaro-Winkler similarity measure in order to decrease the false alarm rate. Since the proposed approach uses more techniques than the baseline approach, the search speed may decrease. To overcome this problem, we use lattice pruning and indexing techniques such as depth first search algorithm to increase the search speed in online and offline applications, respectively. We report the experimental results for monophone-based and triphone-based STD system. The results indicate that using triphone-based STD system improved the performance about 2% in comparison with monophone-based STD system. Moreover, when we used triphone-based models, the proposed approach including MED measure, Viterbi scores and Jaro-Winkler similarity measure improved the accuracy of the method with only MED measure, about 17%.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Allauzen, C., Mohri, M., & Saraclar, M. (2004). General indexation of weighted automata: Application to spoken utterance retrieval. In: Proceedings of the workshop on interdisciplinary approaches to speech indexing and retrieval at HLT-NAACL 2004. Association for computational linguistics (pp. 33–40). Allauzen, C., Mohri, M., & Saraclar, M. (2004). General indexation of weighted automata: Application to spoken utterance retrieval. In: Proceedings of the workshop on interdisciplinary approaches to speech indexing and retrieval at HLT-NAACL 2004. Association for computational linguistics (pp. 33–40).
Zurück zum Zitat Audhkhasi, K., & Verma, A. (2007). Keyword search using modified minimum edit distance measure. In: IEEE international conference on acoustics, speech and signal processing, ICASSP 2007 (pp. IV-929–IV-932). IEEE. Audhkhasi, K., & Verma, A. (2007). Keyword search using modified minimum edit distance measure. In: IEEE international conference on acoustics, speech and signal processing, ICASSP 2007 (pp. IV-929–IV-932). IEEE.
Zurück zum Zitat BenZeghiba, M. F., Gauvain, J.-L., & Lamel, L. (2010). Improved n-gram phonotactic models for language recognition. In: Eleventh annual conference of the international speech communication association (pp. 2710–2713). BenZeghiba, M. F., Gauvain, J.-L., & Lamel, L. (2010). Improved n-gram phonotactic models for language recognition. In: Eleventh annual conference of the international speech communication association (pp. 2710–2713).
Zurück zum Zitat Bijankhan, M., Sheykhzadegan, J., Roohani, M. R., Zarrintare, R., Ghasemi, S. Z., & Ghasedi, M. E. (2003). Tfarsdat-the telephone Farsi speech database. In: Eighth european conference on speech communication and technology. Bijankhan, M., Sheykhzadegan, J., Roohani, M. R., Zarrintare, R., Ghasemi, S. Z., & Ghasedi, M. E. (2003). Tfarsdat-the telephone Farsi speech database. In: Eighth european conference on speech communication and technology.
Zurück zum Zitat Burget, L., Černocký, J., Fapšo, M., Karafiát, M., Matějka, P., Schwarz, P., Smrž, P., & Szöke, I. (2006). Indexing and search methods for spoken documents. In: International conference on text, speech and dialogue (pp. 351–358). Berlin: Springer. Burget, L., Černocký, J., Fapšo, M., Karafiát, M., Matějka, P., Schwarz, P., Smrž, P., & Szöke, I. (2006). Indexing and search methods for spoken documents. In: International conference on text, speech and dialogue (pp. 351–358). Berlin: Springer.
Zurück zum Zitat Can, D., Cooper, E., Sethy, A., White, C., Ramabhadran, B., & Saraclar, M. (2009). Effect of pronounciations on OOV queries in spoken term detection. In: IEEE international conference on acoustics, speech and signal processing ICASSP 2009 (pp. 3957–3960). Can, D., Cooper, E., Sethy, A., White, C., Ramabhadran, B., & Saraclar, M. (2009). Effect of pronounciations on OOV queries in spoken term detection. In: IEEE international conference on acoustics, speech and signal processing ICASSP 2009 (pp. 3957–3960).
Zurück zum Zitat Can, D., & Saraclar, M. (2011). Lattice indexing for spoken term detection. IEEE Transactions on Audio, Speech, and Language Processing, 19(8), 2338–2347.CrossRef Can, D., & Saraclar, M. (2011). Lattice indexing for spoken term detection. IEEE Transactions on Audio, Speech, and Language Processing, 19(8), 2338–2347.CrossRef
Zurück zum Zitat Cernocky, J., Szoke, I., Fapso, M., Karafiat, M., Burget, L., Kopecky, J., Grezl, F., Schwarz, P., Glembek, O., & Oparin, I. (2007). Search in speech for public security and defense. In: IEEE Workshop on signal processing applications for public security and forensics, SAFE’07 (pp. 1–7). Cernocky, J., Szoke, I., Fapso, M., Karafiat, M., Burget, L., Kopecky, J., Grezl, F., Schwarz, P., Glembek, O., & Oparin, I. (2007). Search in speech for public security and defense. In: IEEE Workshop on signal processing applications for public security and forensics, SAFE’07 (pp. 1–7).
Zurück zum Zitat Chaudhari, U. V., & Picheny, M. (2007). Improvements in phone based audio search via constrained match with high order confusion estimates. In: IEEE Workshop on automatic speech recognition and understanding, ASRU 2007 (pp. 665–670). Chaudhari, U. V., & Picheny, M. (2007). Improvements in phone based audio search via constrained match with high order confusion estimates. In: IEEE Workshop on automatic speech recognition and understanding, ASRU 2007 (pp. 665–670).
Zurück zum Zitat Chelba, C., & Acero, A. (2005). Position specific posterior lattices for indexing speech. In: Proceedings of the 43rd annual meeting on association for computational linguistics (pp. 443–450). Chelba, C., & Acero, A. (2005). Position specific posterior lattices for indexing speech. In: Proceedings of the 43rd annual meeting on association for computational linguistics (pp. 443–450).
Zurück zum Zitat Goodarzi, M. M., Shekofteh, Y., Rezaei, I. S., & Kabudian, J. (2014). Discriminative confidence measure using linear combination of duration-based features and acoustic-based scores in keyword spotting. In: IEEE 7th international symposium on telecommunications (IST) (pp 316–319). Goodarzi, M. M., Shekofteh, Y., Rezaei, I. S., & Kabudian, J. (2014). Discriminative confidence measure using linear combination of duration-based features and acoustic-based scores in keyword spotting. In: IEEE 7th international symposium on telecommunications (IST) (pp 316–319).
Zurück zum Zitat Gracia, C., Anguera, X., Luque, J., & Artzi, I. (2014). Phoneme-lattice to phoneme-sequence matching algorithm based on dynamic programming. In: Advances in speech and language technologies for iberian languages (pp. 99–108). Cham: Springer. Gracia, C., Anguera, X., Luque, J., & Artzi, I. (2014). Phoneme-lattice to phoneme-sequence matching algorithm based on dynamic programming. In: Advances in speech and language technologies for iberian languages (pp. 99–108). Cham: Springer.
Zurück zum Zitat Li, W., Wu, J., & Wang, Z. A. (2008). Trellis based fast lattice generating algorithm. In: IEEE 6th international symposium on chinese spoken language processing, ISCSLP’08 (pp. 1–4). Li, W., Wu, J., & Wang, Z. A. (2008). Trellis based fast lattice generating algorithm. In: IEEE 6th international symposium on chinese spoken language processing, ISCSLP’08 (pp. 1–4).
Zurück zum Zitat Mamou, J., Ramabhadran, B., & Siohan, O. (2007). Vocabulary independent spoken term detection. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, 2007 (pp. 615–622). New York: ACM Mamou, J., Ramabhadran, B., & Siohan, O. (2007). Vocabulary independent spoken term detection. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, 2007 (pp. 615–622). New York: ACM
Zurück zum Zitat Mangu, L., Soltau, H., Kuo, H.-K., Kingsbury, B., & Saon, G. (2013). Exploiting diversity for spoken term detection. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), 2013 (pp. 8282–8286). Mangu, L., Soltau, H., Kuo, H.-K., Kingsbury, B., & Saon, G. (2013). Exploiting diversity for spoken term detection. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), 2013 (pp. 8282–8286).
Zurück zum Zitat Mansikkaniemi, A. (2010). Acoustic model and language model adaptation for a mobile dictation service. Aalto University: Master of Science. Mansikkaniemi, A. (2010). Acoustic model and language model adaptation for a mobile dictation service. Aalto University: Master of Science.
Zurück zum Zitat Masoud, A. (2017). Keyword spotting in persian speech using a hybrid model of DNN and HMM. Msc, Amirkabir University of Technology. Masoud, A. (2017). Keyword spotting in persian speech using a hybrid model of DNN and HMM. Msc, Amirkabir University of Technology.
Zurück zum Zitat Meng, S., Yu, P., Seide, F., & Liu, J. (2007). A study of lattice-based spoken term detection for Chinese spontaneous speech. In: IEEE workshop on automatic speech recognition and understanding, ASRU (pp. 635–640). Meng, S., Yu, P., Seide, F., & Liu, J. (2007). A study of lattice-based spoken term detection for Chinese spontaneous speech. In: IEEE workshop on automatic speech recognition and understanding, ASRU (pp. 635–640).
Zurück zum Zitat Mertens, T., & Schneider, D. (2009). Efficient subword lattice retrieval for German spoken term detection. In: IEEE international conference on, acoustics, speech and signal processing, ICASSP 2009 (pp. 4885–4888). Mertens, T., & Schneider, D. (2009). Efficient subword lattice retrieval for German spoken term detection. In: IEEE international conference on, acoustics, speech and signal processing, ICASSP 2009 (pp. 4885–4888).
Zurück zum Zitat Rajabzadeh, M., Tabibian, S., Akbari, A., & Nasersharif, B. (2012a). An improved phone lattice search method for triphone based keyword spotting in online persian telephony speech. In: International conference on contemporary (CICIS) (pp. 294–299). Rajabzadeh, M., Tabibian, S., Akbari, A., & Nasersharif, B. (2012a). An improved phone lattice search method for triphone based keyword spotting in online persian telephony speech. In: International conference on contemporary (CICIS) (pp. 294–299).
Zurück zum Zitat Rajabzadeh, M., Tabibian, S., Akbari, A., & Nasersharif, B. (2012b). Improved dynamic match phone lattice search using viterbi scores and Jaro Winkler distance for keyword spotting system. In: IEEE 16th CSI International Symposium on, Artificial Intelligence and Signal Processing (AISP) (pp. 423–427). Rajabzadeh, M., Tabibian, S., Akbari, A., & Nasersharif, B. (2012b). Improved dynamic match phone lattice search using viterbi scores and Jaro Winkler distance for keyword spotting system. In: IEEE 16th CSI International Symposium on, Artificial Intelligence and Signal Processing (AISP) (pp. 423–427).
Zurück zum Zitat Sak, H., Saraclar, M., & Güngör, T. (2010). On-the-fly lattice rescoring for real-time automatic speech recognition. In: Eleventh annual conference of the international speech communication association. Sak, H., Saraclar, M., & Güngör, T. (2010). On-the-fly lattice rescoring for real-time automatic speech recognition. In: Eleventh annual conference of the international speech communication association.
Zurück zum Zitat Saraclar, M., & Sproat, R. (2004). Lattice-based search for spoken utterance retrieval. In: Proceedings of the human language technology conference of the north american chapter of the association for computational linguistics: HLT-NAACL 2004, (pp. 129–136). Saraclar, M., & Sproat, R. (2004). Lattice-based search for spoken utterance retrieval. In: Proceedings of the human language technology conference of the north american chapter of the association for computational linguistics: HLT-NAACL 2004, (pp. 129–136).
Zurück zum Zitat Shao, J., Zhao, Q., Zhang, P., Liu, Z., & Yan, Y. (2008). Fast fuzzy keyword spotting using syllable confusion network indexing. Chinese Journal of Electronics, 17(2), 265–270. Shao, J., Zhao, Q., Zhang, P., Liu, Z., & Yan, Y. (2008). Fast fuzzy keyword spotting using syllable confusion network indexing. Chinese Journal of Electronics, 17(2), 265–270.
Zurück zum Zitat Shekofteh, Y., Kabudian, J., Goodarzi, M. M., & Rezaei, I. S. (2012). Confidence measure improvement using useful predictor features and support vector machines. In: IEEE 20th Iranian conference on electrical engineering (ICEE) (pp. 1168–1171). Shekofteh, Y., Kabudian, J., Goodarzi, M. M., & Rezaei, I. S. (2012). Confidence measure improvement using useful predictor features and support vector machines. In: IEEE 20th Iranian conference on electrical engineering (ICEE) (pp. 1168–1171).
Zurück zum Zitat Shokri, A., Tabibian, S., Akbari, A., Nasersharif, B., & Kabudian, J. (2011). A robust keyword spotting system for Persian conversational telephone speech using feature and score normalization and ARMA filter. In: GCC conference and exhibition (GCC) (pp. 497–500). Shokri, A., Tabibian, S., Akbari, A., Nasersharif, B., & Kabudian, J. (2011). A robust keyword spotting system for Persian conversational telephone speech using feature and score normalization and ARMA filter. In: GCC conference and exhibition (GCC) (pp. 497–500).
Zurück zum Zitat Siohan, O., Bacchiani, M. (2005). Fast vocabulary-independent audio search using path-based graph indexing. In: Ninth European conference on speech communication and technology (pp. 53–56). Siohan, O., Bacchiani, M. (2005). Fast vocabulary-independent audio search using path-based graph indexing. In: Ninth European conference on speech communication and technology (pp. 53–56).
Zurück zum Zitat Szoke, I., Schwarz, P., Matejka, P., Burget, L., Karafiát, M., Fapso, M., & Cernocky, J. (2005). Comparison of keyword spotting approaches for informal continuous speech. In: Ninth European conference on speech communication and technology (pp. 633–636). Szoke, I., Schwarz, P., Matejka, P., Burget, L., Karafiát, M., Fapso, M., & Cernocky, J. (2005). Comparison of keyword spotting approaches for informal continuous speech. In: Ninth European conference on speech communication and technology (pp. 633–636).
Zurück zum Zitat Tabibian, S., Akbari, A., & Nasersharif, B. (2018). Discriminative keyword spotting using triphones information and N-best search. Information Sciences, 423, 157–171.CrossRef Tabibian, S., Akbari, A., & Nasersharif, B. (2018). Discriminative keyword spotting using triphones information and N-best search. Information Sciences, 423, 157–171.CrossRef
Zurück zum Zitat Thambiratnam, K., & Sridharan, S. (2005). Dynamic match phone-lattice searches for very fast and accurate unrestricted vocabulary keyword spotting. In: IEEE international conference on acoustics, speech, and signal processing, proceedings (ICASSP’05) (Vol. 461, pp. I/465–I/468). Thambiratnam, K., & Sridharan, S. (2005). Dynamic match phone-lattice searches for very fast and accurate unrestricted vocabulary keyword spotting. In: IEEE international conference on acoustics, speech, and signal processing, proceedings (ICASSP’05) (Vol. 461, pp. I/465–I/468).
Zurück zum Zitat Thambiratnam, K., & Sridharan, S. (2007). Rapid yet accurate speech indexing using dynamic match lattice spotting. IEEE Transactions on Audio, Speech, and Language Processing, 15(1), 346–357.CrossRef Thambiratnam, K., & Sridharan, S. (2007). Rapid yet accurate speech indexing using dynamic match lattice spotting. IEEE Transactions on Audio, Speech, and Language Processing, 15(1), 346–357.CrossRef
Zurück zum Zitat Trinh, K., Nguyen, H., Duong, D., & Vu, Q. (2008). An empirical study of multipass decoding for vietnamese LVCSR. In: Spoken languages technologies for under-resourced languages (pp. 12–17). Trinh, K., Nguyen, H., Duong, D., & Vu, Q. (2008). An empirical study of multipass decoding for vietnamese LVCSR. In: Spoken languages technologies for under-resourced languages (pp. 12–17).
Zurück zum Zitat Vazirnezhad, B., Almasganj, F., & Ahadi, S. M. (2009). Hybrid statistical pronunciation models designed to be trained by a medium-size corpus. Computer Speech & Language, 23(1), 1–24.CrossRef Vazirnezhad, B., Almasganj, F., & Ahadi, S. M. (2009). Hybrid statistical pronunciation models designed to be trained by a medium-size corpus. Computer Speech & Language, 23(1), 1–24.CrossRef
Zurück zum Zitat Wallace, R., Baker, B., Vogt, R., & Sridharan, S. (2009a). The effect of language models on phonetic decoding for spoken term detection. In: Proceedings of the third workshop on Searching spontaneous conversational speech (pp. 31–36). New York: ACM. Wallace, R., Baker, B., Vogt, R., & Sridharan, S. (2009a). The effect of language models on phonetic decoding for spoken term detection. In: Proceedings of the third workshop on Searching spontaneous conversational speech (pp. 31–36). New York: ACM.
Zurück zum Zitat Wallace, R., Vogt, R., & Sridharan, S. (2009b). Spoken term detection using fast phonetic decoding. In: IEEE international conference on acoustics, speech and signal processing, ICASSP 2009 (pp. 4881–4884). Wallace, R., Vogt, R., & Sridharan, S. (2009b). Spoken term detection using fast phonetic decoding. In: IEEE international conference on acoustics, speech and signal processing, ICASSP 2009 (pp. 4881–4884).
Zurück zum Zitat Wang, X., Xie, L., Ma, B., Chng, E. S., & Li, H. (2010). Phoneme lattice based TextTiling towards multilingual story segmentation. In: Eleventh annual conference of the international speech communication association (pp. 1305–1308). Wang, X., Xie, L., Ma, B., Chng, E. S., & Li, H. (2010). Phoneme lattice based TextTiling towards multilingual story segmentation. In: Eleventh annual conference of the international speech communication association (pp. 1305–1308).
Zurück zum Zitat Winkler, W. E. (1990). String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage (pp. 354–359). Winkler, W. E. (1990). String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage (pp. 354–359).
Zurück zum Zitat Winkler, W. E. (2006). Overview of record linkage and current research directions. Citeseer. Winkler, W. E. (2006). Overview of record linkage and current research directions. Citeseer.
Zurück zum Zitat Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., & Povey, D. (2002). The HTK book (Vol. 3, p. 175). Cambridge: Cambridge University Engineering Department. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., & Povey, D. (2002). The HTK book (Vol. 3, p. 175). Cambridge: Cambridge University Engineering Department.
Zurück zum Zitat Young, S. J., Woodland, P., & Byrne, W. (1993). HTK: hidden Markov Model Toolkit V1. 5. Young, S. J., Woodland, P., & Byrne, W. (1993). HTK: hidden Markov Model Toolkit V1. 5.
Zurück zum Zitat Zhang, S., Shuang, Z., Shi, Q., & Qin, Y. (2010). Improved mandarin keyword spotting using confusion garbage model. In: IEEE 20th International conference on pattern recognition (ICPR) (pp. 3700–3703). Zhang, S., Shuang, Z., Shi, Q., & Qin, Y. (2010). Improved mandarin keyword spotting using confusion garbage model. In: IEEE 20th International conference on pattern recognition (ICPR) (pp. 3700–3703).
Zurück zum Zitat Zhou, Z.-Y., Yu, P., Chelba, C., & Seide, F. (2006). Towards spoken-document retrieval for the internet: Lattice indexing for large-scale web-search architectures. In: Proceedings of the main conference on human language technology conference of the North American chapter of the association of computational linguistics (pp. 415–422). Zhou, Z.-Y., Yu, P., Chelba, C., & Seide, F. (2006). Towards spoken-document retrieval for the internet: Lattice indexing for large-scale web-search architectures. In: Proceedings of the main conference on human language technology conference of the North American chapter of the association of computational linguistics (pp. 415–422).
Metadaten
Titel
Improved dynamic match phone lattice search for Persian spoken term detection system in online and offline applications
verfasst von
Shima Tabibian
Ahmad Akbari
Babak Nasersharif
Publikationsdatum
25.01.2019
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 1/2019
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-019-09594-w

Weitere Artikel der Ausgabe 1/2019

International Journal of Speech Technology 1/2019 Zur Ausgabe

Neuer Inhalt