Skip to main content
main-content
Top

Hint

Swipe to navigate through the chapters of this book

2021 | OriginalPaper | Chapter

Identification of Scandinavian Languages from Speech Using Bottleneck Features and X-Vectors

Authors : Petr Cerva, Lukas Mateju, Frantisek Kynych, Jindrich Zdansky, Jan Nouza

Published in: Text, Speech, and Dialogue

Publisher: Springer International Publishing

share
SHARE

Abstract

This work deals with identification of the three main Scandinavian languages (Swedish, Danish and Norwegian) from spoken data. For this purpose, various state-of-the-art approaches are adopted, compared and combined, including i-vectors, deep neural networks (DNNs), bottleneck features (BTNs) as well as x-vectors. The best resulting approaches take advantage of multilingual BTNs and allow us to identify the target languages in speech segments lasting 5 s with a very low error rate around 1%. Therefore, they have many practical applications, such as in systems for transcription of Scandinavian TV and radio programs, where different persons speaking any of the target languages may occur. Within identification of Norwegian, we also focus on an unexplored sub-task of distinguishing between Bokmål and Nynorsk. Our results show that this problem is much harder to solve since these two language variants are acoustically very similar to each other: the best error rate achieved in this case is around 20%.
Literature
1.
go back to reference Amdal, I., Strand, O.M., Almberg, J., Svendsen, T.: RUNDKAST: an annotated Norwegian broadcast news speech corpus. In: LREC 2008, Marrakech, Morocco, pp. 1907–1913 (2008) Amdal, I., Strand, O.M., Almberg, J., Svendsen, T.: RUNDKAST: an annotated Norwegian broadcast news speech corpus. In: LREC 2008, Marrakech, Morocco, pp. 1907–1913 (2008)
2.
go back to reference Cai, W., Cai, Z., Liu, W., Wang, X., Li, M.: Insights in-to-end learning scheme for language identification. In: ICASSP 2018, Calgary, AB, Canada, pp. 5209–5213 (2018) Cai, W., Cai, Z., Liu, W., Wang, X., Li, M.: Insights in-to-end learning scheme for language identification. In: ICASSP 2018, Calgary, AB, Canada, pp. 5209–5213 (2018)
3.
go back to reference Cai, W., Cai, Z., Zhang, X., Wang, X., Li, M.: A novel learnable dictionary encoding layer for end-to-end language identification. In: ICASSP 2018, Calgary, AB, Canada, pp. 5189–5193 (2018) Cai, W., Cai, Z., Zhang, X., Wang, X., Li, M.: A novel learnable dictionary encoding layer for end-to-end language identification. In: ICASSP 2018, Calgary, AB, Canada, pp. 5189–5193 (2018)
4.
go back to reference Cerva, P., Mateju, L., Zdansky, J., Safarik, R., Nouza, J.: Identification of related languages from spoken data: moving from off-line to on-line scenario. Comput. Speech Lang. 68, 101180 (2021) CrossRef Cerva, P., Mateju, L., Zdansky, J., Safarik, R., Nouza, J.: Identification of related languages from spoken data: moving from off-line to on-line scenario. Comput. Speech Lang. 68, 101180 (2021) CrossRef
5.
go back to reference Dehak, N., Torres-Carrasquillo, P.A., Reynolds, D.A., Dehak, R.: Language recognition via i-vectors and dimensionality reduction. In: Interspeech 2011, Florence, Italy, pp. 857–860 (2011) Dehak, N., Torres-Carrasquillo, P.A., Reynolds, D.A., Dehak, R.: Language recognition via i-vectors and dimensionality reduction. In: Interspeech 2011, Florence, Italy, pp. 857–860 (2011)
6.
go back to reference Fer, R., Matejka, P., Grezl, F., Plchot, O., Cernocky, J.: Multilingual bottleneck features for language recognition. In: Interspeech 2015, Dresden, Germany, pp. 389–393 (2015) Fer, R., Matejka, P., Grezl, F., Plchot, O., Cernocky, J.: Multilingual bottleneck features for language recognition. In: Interspeech 2015, Dresden, Germany, pp. 389–393 (2015)
7.
go back to reference Fer, R., Matejka, P., Grezl, F., Plchot, O., Vesely, K., Cernocky, J.H.: Multilingually trained bottleneck features in spoken language recognition. Comput. Speech Lang. 46, 252–267 (2017) CrossRef Fer, R., Matejka, P., Grezl, F., Plchot, O., Vesely, K., Cernocky, J.H.: Multilingually trained bottleneck features in spoken language recognition. Comput. Speech Lang. 46, 252–267 (2017) CrossRef
8.
go back to reference Fernando, S., Sethu, V., Ambikairajah, E., Epps, J.: Bidirectional modelling for short duration language identification. In: Interspeech 2017, Stockholm, Sweden, pp. 2809–2813 (2017) Fernando, S., Sethu, V., Ambikairajah, E., Epps, J.: Bidirectional modelling for short duration language identification. In: Interspeech 2017, Stockholm, Sweden, pp. 2809–2813 (2017)
9.
go back to reference Ferrer, L., Lei, Y., McLaren, M., Scheffer, N.: Study of senone-based deep neural network approaches for spoken language recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24(1), 105–116 (2016) CrossRef Ferrer, L., Lei, Y., McLaren, M., Scheffer, N.: Study of senone-based deep neural network approaches for spoken language recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24(1), 105–116 (2016) CrossRef
10.
go back to reference Garcia-Romero, D., McCree, A.: Stacked long-term TDNN for spoken language recognition. In: Interspeech 2016, San Francisco, CA, USA, pp. 3226–3230 (2016) Garcia-Romero, D., McCree, A.: Stacked long-term TDNN for spoken language recognition. In: Interspeech 2016, San Francisco, CA, USA, pp. 3226–3230 (2016)
11.
go back to reference Gelly, G., Gauvain, J.: Spoken language identification using LSTM-based angular proximity. In: Interspeech 2017, Stockholm, Sweden, pp. 2566–2570 (2017) Gelly, G., Gauvain, J.: Spoken language identification using LSTM-based angular proximity. In: Interspeech 2017, Stockholm, Sweden, pp. 2566–2570 (2017)
12.
go back to reference Geng, W., Wang, W., Zhao, Y., Cai, X., Xu, B.: End-to-end language identification using attention-based recurrent neural networks. In: Interspeech 2016, San Francisco, CA, USA, pp. 2944–2948 (2016) Geng, W., Wang, W., Zhao, Y., Cai, X., Xu, B.: End-to-end language identification using attention-based recurrent neural networks. In: Interspeech 2016, San Francisco, CA, USA, pp. 2944–2948 (2016)
13.
go back to reference Geng, W., Zhao, Y., Wang, W., Cai, X., Xu, B.: Gating recurrent enhanced memory neural networks on language identification. In: Interspeech 2016, San Francisco, CA, USA, pp. 3280–3284 (2016) Geng, W., Zhao, Y., Wang, W., Cai, X., Xu, B.: Gating recurrent enhanced memory neural networks on language identification. In: Interspeech 2016, San Francisco, CA, USA, pp. 3280–3284 (2016)
14.
go back to reference Gonzalez-Dominguez, J., Lopez-Moreno, I., Sak, H., Gonzalez-Rodriguez, J., Moreno, P.J.: Automatic language identification using long short-term memory recurrent neural networks. In: Interspeech 2014, Singapore, 14–18 September 2014, pp. 2155–2159 (2014) Gonzalez-Dominguez, J., Lopez-Moreno, I., Sak, H., Gonzalez-Rodriguez, J., Moreno, P.J.: Automatic language identification using long short-term memory recurrent neural networks. In: Interspeech 2014, Singapore, 14–18 September 2014, pp. 2155–2159 (2014)
15.
go back to reference Lopez-Moreno, I., Gonzalez-Dominguez, J., Plchot, O., Martinez, D., Gonzalez-Rodriguez, J., Moreno, P.J.: Automatic language identification using deep neural networks. In: ICASSP 2014, Florence, Italy, pp. 5337–5341 (2014) Lopez-Moreno, I., Gonzalez-Dominguez, J., Plchot, O., Martinez, D., Gonzalez-Rodriguez, J., Moreno, P.J.: Automatic language identification using deep neural networks. In: ICASSP 2014, Florence, Italy, pp. 5337–5341 (2014)
16.
go back to reference Lozano-Diez, A., Zazo-Candil, R., Gonzalez-Dominguez, J., Toledano, D.T., Gonzalez-Rodriguez, J.: An end-to-end approach to language identification in short utterances using convolutional neural networks. In: Interspeech 2015, Dresden, Germany, pp. 403–407 (2015) Lozano-Diez, A., Zazo-Candil, R., Gonzalez-Dominguez, J., Toledano, D.T., Gonzalez-Rodriguez, J.: An end-to-end approach to language identification in short utterances using convolutional neural networks. In: Interspeech 2015, Dresden, Germany, pp. 403–407 (2015)
19.
go back to reference Povey, D., et al.: The Kaldi speech recognition toolkit. In: ASRU 2011, Waikoloa, HI, USA (2011) Povey, D., et al.: The Kaldi speech recognition toolkit. In: ASRU 2011, Waikoloa, HI, USA (2011)
20.
go back to reference Raj, D., Snyder, D., Povey, D., Khudanpur, S.: Probing the information encoded in x-vectors. In: ASRU 2019, Singapore, pp. 726–733 (2019) Raj, D., Snyder, D., Povey, D., Khudanpur, S.: Probing the information encoded in x-vectors. In: ASRU 2019, Singapore, pp. 726–733 (2019)
21.
go back to reference Richardson, F., Reynolds, D.A., Dehak, N.: Deep neural network approaches to speaker and language recognition. IEEE Sig. Process. Lett. 22(10), 1671–1675 (2015) CrossRef Richardson, F., Reynolds, D.A., Dehak, N.: Deep neural network approaches to speaker and language recognition. IEEE Sig. Process. Lett. 22(10), 1671–1675 (2015) CrossRef
22.
go back to reference Sadjadi, S.O., et al.: The 2017 NIST language recognition evaluation. In: Odyssey 2018, Les Sables d’Olonne, France, pp. 82–89 (2018) Sadjadi, S.O., et al.: The 2017 NIST language recognition evaluation. In: Odyssey 2018, Les Sables d’Olonne, France, pp. 82–89 (2018)
23.
go back to reference Snyder, D., Garcia-Romero, D., McCree, A., Sell, G., Povey, D., Khudanpur, S.: Spoken language recognition using x-vectors. In: Odyssey 2018, Les Sables d’Olonne, France, pp. 105–111 (2018) Snyder, D., Garcia-Romero, D., McCree, A., Sell, G., Povey, D., Khudanpur, S.: Spoken language recognition using x-vectors. In: Odyssey 2018, Les Sables d’Olonne, France, pp. 105–111 (2018)
24.
go back to reference Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: robust DNN embeddings for speaker recognition. In: ICASSP 2018, Calgary, AB, Canada, pp. 5329–5333 (2018) Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: robust DNN embeddings for speaker recognition. In: ICASSP 2018, Calgary, AB, Canada, pp. 5329–5333 (2018)
25.
go back to reference Zhang, S., Liu, C., Jiang, H., Wei, S., Dai, L., Hu, Y.: Feedforward sequential memory networks: a new structure to learn long-term dependency. CoRR abs/1512.08301 (2015) Zhang, S., Liu, C., Jiang, H., Wei, S., Dai, L., Hu, Y.: Feedforward sequential memory networks: a new structure to learn long-term dependency. CoRR abs/1512.08301 (2015)
26.
go back to reference Zhao, H., et al.: Results of the 2015 NIST language recognition evaluation. In: Interspeech 2016, San Francisco, CA, USA, pp. 3206–3210 (2016) Zhao, H., et al.: Results of the 2015 NIST language recognition evaluation. In: Interspeech 2016, San Francisco, CA, USA, pp. 3206–3210 (2016)
Metadata
Title
Identification of Scandinavian Languages from Speech Using Bottleneck Features and X-Vectors
Authors
Petr Cerva
Lukas Mateju
Frantisek Kynych
Jindrich Zdansky
Jan Nouza
Copyright Year
2021
DOI
https://doi.org/10.1007/978-3-030-83527-9_31

Premium Partner