Weitere Artikel dieser Ausgabe durch Wischen aufrufen
Similarity measurement between speech signals aims at calculating the degree of similarity using acoustic features that has been receiving much interest due to the processing of large volume of multimedia information. However, dynamic properties of speech signals such as varying silence segments and time warping factor make it more challenging to measure the similarity between speech signals. This manuscript entails further extension of our research towards the adaptive framing based similarity measurement between speech signals using a Kalman filter. Silence removal is enhanced by integrating multiple features for voiced and unvoiced speech segments detection. The adaptive frame size measurement is improved by using the acceleration/deceleration phenomenon of object linear motion. A dominate feature set is used to represent the speech signals along with the pre-calculated model parameters that are set by the offline tuning of a Kalman filter. Performance is evaluated using additional datasets to evaluate the impact of the proposed model and silence removal approach on the time warped speech similarity measurement. Detailed statistical results are achieved indicating the overall accuracy improvement from 91 to 98% that proves the superiority of the extended approach on our previous research work towards the time warped continuous speech similarity measurement.
Abad, A., Rodriguez-Fuentes, L. J., Penagarikano, M., Varona, A., Diez, M., & Bordel, G. (2013). On the calibration and fusion of heterogeneous spoken term detection systems. Conference of the International Speech Communication Association, Interspeech, France, 25–29 August 2013.
Akila, A., & Chandra, E. (2013). Slope finder—A distance measure for DTW based isolated word speech recognition. International Journal of Engineering and Computer Science, 2(12), 3411–3417.
Anguera, X., Metze, F., Buzo, A., Szoke, I., & Rodriguez-Fuentes, L. J. (2013). The spoken web search task. In Proceedings of MediaEval (pp. 1–2), Aachen, Germany: CEUR Workshop Proceedings.
Anguera, X., Rodriguez-Fuentes, L. J., Szoke, I., Buzo, A., & Metze, F. (2014). Query by example search on speech. In Proceedings of MediaEval (pp. 1–2). Spain
Chan, C.-A., & Lee, L. S. (2010). Unsupervised spoken-term detection with spoken queries using segment-based dynamic time warping. In Proceedings of Interspeech (pp. 693–696). Prague
Cheng-Tao, C., Chun-an, C., & Lin-Shan, L. (2014). Unsupervised spoken term detection with spoken queries by multi-level acoustic patterns with varying model granularity. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7814–7818), 4–9 May 2014. https://doi.org/10.1109/ICASSP.2014.6855121.
Chotirat, R., & Eamonn, K. (2005). Three myths about dynamic time warping data mining. In The Proceedings of SIAM International Conference on Data Mining (pp. 506–510).
Chun-An, C., & Lin-Shan, L. (2011). Unsupervised hidden markov modeling of spoken queries for spoken term detection without speech recognition. In Proceedings of Interspeech (pp. 2141–2144).
Chun-An, C., & Lin-Shan, L. (2013). Model-based unsupervised spoken term detection with spoken queries. IEEE Transactions on Audio, Speech, and Language Processing, 21(7), 1330–1342. CrossRef
Dave, N. (2013). Feature extraction methods LPC, PLP and MFCC in speech recognition. International Journal for Advance Research in Engineering and Technology, 1(6), 1–4.
Dhingra, S., Nijhawan, G., & Pandit, P. (2013). Isolated speech recognition using MFCC and DTW. International Journal of Advanced Research in Electrical Electronics and Instrumentation Engineering, 2(8), 1–8.
Ezzaidi, H., & Jean, R. (2004). Pitch and MFCC dependent GMM models for speaker identification systems. Canadian Conference on Electrical and Computer Engineering (Vol. 1, pp. 43–46).
Giannakopoulos, T. (2014). A method for silence removal and segmentation of speech signals, implemented in Matlab, 2014. Retrieved May 13, 2014 from http://cgi.di.uoa.gr/~tyiannak/Software.html.
Greg, W., & Gary, B. (2006). An introduction to Kalman Filter. TR 95-041. Course 8. Chapel Hill: University of North Carolina at Chapel Hill.
Haipeng, W., Tan, L., & Cheung-Chi, L. (2011). Unsupervised spoken term detection with acoustic segment model. In IEEE Proceedings of the International Conference on Speech Database and Assessments (Oriental COCOSDA) (pp. 106–111).
Hung, H., & Chittaranjan, G. (2010). The Idiap wolf corpus: Exploring group behaviour in a competitive role-playing game. Florence, Italy: ACM Multimedia. Retrieved January 27, 2011 from http://homepage.tudelft.nl/3e2t5/mmsct22567-hung.pdf.
Jansen, A., & Van Durme, B. (2012). Indexing raw acoustic features for scalable zero resource search. In Proceedings of Interspeech
Javier, T., Doroteo, T. T., Paula, L., Laura, D., Carmen, G., Antonio, C., Julian, D., Alejandro, C., Julia, O., & Antonio, M. (2015). Spoken term detection ALBAYZIN 2014 evaluation: Overview, systems, results, and discussion. EURASIP Journal on Audio, Speech, and Music Processing, 21, 1–27.
Joho, H., & Kishida, K. (2014). Overview of the NTCIR-11 SpokenQuery&Doc task. In Proceedings of NTCIR-11 (pp. 1–7). Tokyo, Japan: National Institute of Informatics (NII).
Lawrence, R. R., Jay, G. W., & Frank, K. S. (1989). High performance connected digit recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(2), 1214–1225.
Liscombe, M., & Asif, A. (2009). A new method for instantaneous signal period identification by repetitive pattern matching. In IEEE 13th International Multitopic Conference, INMIC (pp. 1–5).
Marijn, H., Mitchell, M., & David, V. L. (2011). Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection. In IEEE Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4436–4439).
McCool, C., Marcel, S., Hadid, A., Pietikäinen, M., Matějka, P., Černocký, J., Poh, N., Kittler, J., Larcher, A., Lévy, C., Matrouf, D., Bonastre, J., Tresadern, P., & Cootes, T. (2012). Bi-modal person recognition on a mobile phone: Using mobile phone data. IEEE ICME Workshop on Hot Topics in Mobile Multimedia.
Michael, E. (2013). Top 100 speeches, American Rhetoric, 2001. Retrieved December 12, 2013 from http://www.americanrhetoric.com/top100speechesall.html.
Mohinder, S. G., & Angus, P. A. (1993). Kalman filtering: Theory and practice. Upper Saddle River, NJ: Prentice-Hall, Inc. MATH
Mohinder, S. G., & Angus, P. A. (2001). Kalman filtering: Theory and practice using MATLAB (2nd ed., pp. 15–17). New York: Wiley).
Olivier, S. (1995). On the robustness of linear discriminant analysis as a pre-processing step for noisy speech recognition. In International Conference on Acoustics, Speech, and Signal Processing, 9–12 May 1995 (Vol. 1, pp. 125–128).
Pour, M. M., & Farokhi, F. (2009). An advanced method for speech recognition. International Scholarly and Scientific Research & Innovation, 3(1), 840–845.
Ravindran, G., Shenbagadevi, S., & Salai, S. V. (2010). Cepstral and linear prediction techniques for improving intelligibility and audibility of impaired speech. Journal of Biomedical Science and Engineering, 3(1), 85–94. CrossRef
Saha, G., Sandipan, C., & Suman, S. (2005). A new silence removal and endpoint detection algorithm for speech and speaker recognition applications. In Proceedings of the NCC.
Sen, Z., & Graduate, S. (2006). An energy-based adaptive voice detection approach. 8th International Conference on Signal Processing (Vol. 1). Beijing: Chinese Academy of Science
Shahzadi, F., & Azra, S. (2013). Speaker recognition system using mel-frequency cepstrum coefficients, linear prediction coding and vector quantization. International Conference on Computer, Control & Communication (IC4) (pp. 1–5).
Sharma, P., & Rajpoot, A. K. (2013). Automatic Identification of silence, unvoiced and voiced chunks in speech. Journal of Computer Science & Information Technology (CS & IT), 3(5), 87–96.
Soluade, O. A. (2010). Establishment of confidence threshold for interactive voice response systems using ROC Analysis. Communications of the IIMA, 10(2), 43–57.
Tejedor, J., Toledano, D. T., Anguera, X., Varona, A., Hurtado, L. F., Miguel, A., & Colas, J. (2013). Query-by-example spoken term detection ALBAYZIN 2012 evaluation: Overview, systems, results, and discussion. Journal on Audio, Speech, and Music Processing, EURASIP, 23, 1–17.
Thambiratmann, K., & Sridharan, S. (2007). Rapid yet accurate speech indexing using dynamic match lattice spotting. IEEE Transactions on Audio, Speech and Language Processing, 15(1), 346–357. CrossRef
Timothy, J. H., Wade, S., & Christopher, W. (2009). Query-by-example spoken term detection using phonetic posteriorgram templates. In IEEE Proceedings of the Automatic Speech Recognition & Understanding (ASRU) Workshop, 17 December 2009 (pp. 421–426).
Tushar, R. S., Ranjan, S., & Sabyasachi, P. (2014). Silence removal and endpoint detection of speech signal for text independent speaker identification. International Journal of Image, Graphics and Signal Processing, 6, 27–35.
Wasiq, K., & Kaya, K. (2017). An intelligent system for spoken term detection that uses belief combination. IEEE Intelligent Systems, 32(1), 70–79. CrossRef
Wasiq, K., & Rob, H. (2015). Time Warped continuous speech signal matching using Kalman filter. International Journal of Speech Technology, 18(1), 1381–2416.
Yaodong, Z., & James, R. G. (2011a). A piecewise aggregate approximation lower-bound estimate for posteriorgram-based dynamic time warping. In Proceedings of Interspeech (pp. 1909–1912).
Yaodong, Z., & James, R. G. (2011b). An inner-product lower-bound estimate for dynamic time warping. In IEEE Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5660–5663).
Yaodong, Z., Kiarash, A., & James, G. (2012). Fast spoken query detection using lower-bound dynamic time warping on graphical processing units. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5173–5176).
Yegnanarayana, B., & Sreekumar, T. (1984). Signal dependent matching for isolated word speech recognition system. Journal of Signal Processing, 17(2), 161–173. CrossRef
Zahorian, S. A., & Hu, H. (2008). A spectral/temporal method for robust fundamental frequency tracking. Journal of Acoustic Society of America, 123(6), 4559–4571. CrossRef
- Adaptive framing based similarity measurement between time warped speech signals using Kalman filter
- Springer US