Skip to main content

2018 | OriginalPaper | Buchkapitel

First Insight into the Processing of the Language Consulting Center Data

verfasst von : Zbyněk Zajíc, Lucie Zajícová, Josef V. Psutka, Petr Salajka, Jaromír Novotný, Aleš Pražák, Luděk Müller

Erschienen in: Speech and Computer

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this paper, we describe the initial stages of the project “Access to a Linguistically Structured Database of Enquiries from the Language Consulting Center”. This project is attempting to provide an improved access to the large archives of mainly telephone conversations collected continuously by the Institute of the Czech Language. The main goal is to open up the unique Czech data acquired from the queries to the Language Consulting Center and to build the semi-automatic system that will facilitate searching and categorizing of these queries. For this purpose, the Automatic Speech Recognizer (ASR) and the language processing methods are being designed. The vocabulary used in such queries contains many unusual words unlike the common speech (e.g. linguistic terms). In order to train the ASR system, it is necessary to manually transcribe a large amount of speech data, identify the appropriate vocabulary, and obtain relevant text for language modeling purposes. In this paper, the proposed telephone system for recording the new data and the baseline speech recognition on these data is described. The first experiments with the topic detection on these data aimed at discovering what can be found in them and also how to preprocess them is also described.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)MATH Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)MATH
2.
Zurück zum Zitat Bryant, R., Madsen, L., Meggelen, J.V.: Asterisk: The Definitive Guide: The Future of Telephony Is Now, 4th edn. O’Reilly Media (2013) Bryant, R., Madsen, L., Meggelen, J.V.: Asterisk: The Definitive Guide: The Future of Telephony Is Now, 4th edn. O’Reilly Media (2013)
3.
Zurück zum Zitat Černocký, J., Pollák Petr, H.V.: Czech speechdat(e) database. ELRA-S0077, ELRA (2000) Černocký, J., Pollák Petr, H.V.: Czech speechdat(e) database. ELRA-S0077, ELRA (2000)
8.
Zurück zum Zitat Landauer, T.K., Dumais, S.T.: A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol. Rev. 104, 211–240 (1997)CrossRef Landauer, T.K., Dumais, S.T.: A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol. Rev. 104, 211–240 (1997)CrossRef
9.
Zurück zum Zitat Lilleberg, J., Zhu, Y., Zhang, Y.: Support vector machines and Word2vec for text classification with semantic features. In: IEEE Conference on Cognitive Informatics and Cognitive Computing (ICCI*CC), Beijing, pp. 136–140 (2015) Lilleberg, J., Zhu, Y., Zhang, Y.: Support vector machines and Word2vec for text classification with semantic features. In: IEEE Conference on Cognitive Informatics and Cognitive Computing (ICCI*CC), Beijing, pp. 136–140 (2015)
10.
Zurück zum Zitat MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1: Statistics, pp. 281–297. University of California Press, Berkeley (1967) MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1: Statistics, pp. 281–297. University of California Press, Berkeley (1967)
11.
Zurück zum Zitat Maergner, P., Waibel, A., Lane, I.: Unsupervised vocabulary selection for real-time speech recognition of lectures. In: ICASSP, Kyoto, pp. 4417–4420 (2012) Maergner, P., Waibel, A., Lane, I.: Unsupervised vocabulary selection for real-time speech recognition of lectures. In: ICASSP, Kyoto, pp. 4417–4420 (2012)
13.
Zurück zum Zitat Petr, P., Černocký Jan, H.V.: Telephone speech data collection for czech. ELRA-S0094, ELRA (1999) Petr, P., Černocký Jan, H.V.: Telephone speech data collection for czech. ELRA-S0094, ELRA (1999)
16.
Zurück zum Zitat Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50 (2010) Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50 (2010)
17.
Zurück zum Zitat Straková, J., Straka, M., Hajič, J.: Open-source tools for morphology, lemmatization, POS tagging and named entity recognition. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 13–18 (2014) Straková, J., Straka, M., Hajič, J.: Open-source tools for morphology, lemmatization, POS tagging and named entity recognition. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 13–18 (2014)
18.
Zurück zum Zitat Švec, J., Šmídl, L., Ircing, P.: Hierarchical discriminative model for spoken language understanding. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8322–8326. IEEE, Vancouver (2013) Švec, J., Šmídl, L., Ircing, P.: Hierarchical discriminative model for spoken language understanding. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8322–8326. IEEE, Vancouver (2013)
19.
Zurück zum Zitat Wang, Y., Zhou, Z., Jin, S., Liu, D., Lu, M.: Comparisons and selections of features and classifiers for short text classification. In: International Conference on Artificial Intelligence Applications and Technologies (AIAAT), vol. 261, pp. 1–7. IEEE, Hawaii (2017) Wang, Y., Zhou, Z., Jin, S., Liu, D., Lu, M.: Comparisons and selections of features and classifiers for short text classification. In: International Conference on Artificial Intelligence Applications and Technologies (AIAAT), vol. 261, pp. 1–7. IEEE, Hawaii (2017)
Metadaten
Titel
First Insight into the Processing of the Language Consulting Center Data
verfasst von
Zbyněk Zajíc
Lucie Zajícová
Josef V. Psutka
Petr Salajka
Jaromír Novotný
Aleš Pražák
Luděk Müller
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-319-99579-3_79