Skip to main content
Erschienen in: Journal of Combinatorial Optimization 4/2019

21.09.2018

Speech corpora subset selection based on time-continuous utterances features

verfasst von: Luobing Dong, Qiumin Guo, Weili Wu

Erschienen in: Journal of Combinatorial Optimization | Ausgabe 4/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

An extremely large corpus with rich acoustic properties is very useful for training new speech recognition and semantic analysis models. However, it also brings some troubles, because the complexity of the acoustic model training usually depends on the size of the corpora. In this paper, we propose a corpora subset selection method considering data contributions from time-continuous utterances and multi-label constraints that are not limited to single-scale metrics. Our goal is to extract a sufficiently rich subset from large corpora under certain meaningful constraints. In addition, taking into account the uniform coverage of the target subset and its internal property, we design a constrained subset selection algorithm. Specifically, a fast subset selection algorithm is designed by introducing n-grams models. Experiments are implemented based on very large real speech corpora database and validate the effectiveness of our method.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Banko M, Brill E (2001) Scaling to very very large corpora for natural language disambiguation. In: Proceedings of the 39th annual meeting on association for computational linguistics—ACL’01. Toulouse, France, pp 26–33 Banko M, Brill E (2001) Scaling to very very large corpora for natural language disambiguation. In: Proceedings of the 39th annual meeting on association for computational linguistics—ACL’01. Toulouse, France, pp 26–33
Zurück zum Zitat Boleda G et al (2006) CUCWeb: a Catalan corpus built from the Web. In: Wac’06 processing of the 2nd international workshop on web as corpus. April. Trento, Italy, pp 19–26 Boleda G et al (2006) CUCWeb: a Catalan corpus built from the Web. In: Wac’06 processing of the 2nd international workshop on web as corpus. April. Trento, Italy, pp 19–26
Zurück zum Zitat Braunschweiler N, Buchholz S (2011) Automatic sentence selection from speech corpora including diverse speech for improved HMM-TTS synthesis quality. In: Proceedings of the annual conference of the international speech communication association, Interspeech. August. Florence, Italy, pp 1821–1824 Braunschweiler N, Buchholz S (2011) Automatic sentence selection from speech corpora including diverse speech for improved HMM-TTS synthesis quality. In: Proceedings of the annual conference of the international speech communication association, Interspeech. August. Florence, Italy, pp 1821–1824
Zurück zum Zitat Brown PF et al (1992) Class-based n-gram models of natural language. Comput Linguist 4(18):467–479 Brown PF et al (1992) Class-based n-gram models of natural language. Comput Linguist 4(18):467–479
Zurück zum Zitat Clarke CLA et al (2002) The impact of corpus size on question answering performance. In: Proceedings of the 25th annual international ACM SIGIR conference on research development on information retrieval, pp 369–370 Clarke CLA et al (2002) The impact of corpus size on question answering performance. In: Proceedings of the 25th annual international ACM SIGIR conference on research development on information retrieval, pp 369–370
Zurück zum Zitat Curran JR, Osborne M (2002) A very very large corpus doesn’t always yield reliable estimates. In: Proceedings of the 6th conference on natural language learning—COLING-02. Vol. 20. Stroudsburg, PA, USA, pp 1–6 Curran JR, Osborne M (2002) A very very large corpus doesn’t always yield reliable estimates. In: Proceedings of the 6th conference on natural language learning—COLING-02. Vol. 20. Stroudsburg, PA, USA, pp 1–6
Zurück zum Zitat Drouin P (2004) Detection of domain specific terminology using corpora comparison. In: Proceedings of the 4th international conference on language resources and evaluation. Lisbon, Portugal, pp 79–82 Drouin P (2004) Detection of domain specific terminology using corpora comparison. In: Proceedings of the 4th international conference on language resources and evaluation. Lisbon, Portugal, pp 79–82
Zurück zum Zitat Fujishige S (2005) Submodular functions and optimization, vol 58. C. Elsevier, Amsterdam, pp 315–363MATH Fujishige S (2005) Submodular functions and optimization, vol 58. C. Elsevier, Amsterdam, pp 315–363MATH
Zurück zum Zitat Glavas G, Ponzetto SP (2017) Dual tensor model for detecting asymmetric lexico-semantic relations. In: Proceedings of the 2017 conference on empirical methods in natural language processing. September. Copenhagen, Denmark, pp 1757–1767 Glavas G, Ponzetto SP (2017) Dual tensor model for detecting asymmetric lexico-semantic relations. In: Proceedings of the 2017 conference on empirical methods in natural language processing. September. Copenhagen, Denmark, pp 1757–1767
Zurück zum Zitat Gómez-Adorno H et al (2018) Document embeddings learned on various types of n-grams for cross-topic authorship attribution. In: Computing September, pp 1–16 Gómez-Adorno H et al (2018) Document embeddings learned on various types of n-grams for cross-topic authorship attribution. In: Computing September, pp 1–16
Zurück zum Zitat King S, Bartels C, Bilmes J (2005) SVitchboard 1: small vocabulary tasks from switchboard 1. In: Ninth European conference on speech communication and technology. Lisbon, Portugal, pp 2–5 King S, Bartels C, Bilmes J (2005) SVitchboard 1: small vocabulary tasks from switchboard 1. In: Ninth European conference on speech communication and technology. Lisbon, Portugal, pp 2–5
Zurück zum Zitat Kumar VV, Satyanarayana N (2017) Probability of semantic similarity and N-grams pattern learning for data classification. In: Global journal of computer science and technology, pp 1–5 Kumar VV, Satyanarayana N (2017) Probability of semantic similarity and N-grams pattern learning for data classification. In: Global journal of computer science and technology, pp 1–5
Zurück zum Zitat Lin H, Bilmes J (2011) Optimal selection of limited vocabulary speech corpora. In: Proceedings of the annual conference of the international speech communication association, interspeech, Florence, Italy, pp 1489–1492 Lin H, Bilmes J (2011) Optimal selection of limited vocabulary speech corpora. In: Proceedings of the annual conference of the international speech communication association, interspeech, Florence, Italy, pp 1489–1492
Zurück zum Zitat Liu Y et al (2017) SVitchboard II and FiSVer I: high-quality limited-complexity corpora of conversational English speech. In: Proceedings of the annual conference of the international speech communication association, interspeech, vol 42, pp 122–142 Liu Y et al (2017) SVitchboard II and FiSVer I: high-quality limited-complexity corpora of conversational English speech. In: Proceedings of the annual conference of the international speech communication association, interspeech, vol 42, pp 122–142
Zurück zum Zitat Matthew S (2018) An extensible schema for building large weakly-labeled semantic corpora. Proced Comput Sci 128:65–71CrossRef Matthew S (2018) An extensible schema for building large weakly-labeled semantic corpora. Proced Comput Sci 128:65–71CrossRef
Zurück zum Zitat McDonald G, Macdonald C, Ounis I (1999) Finding parts in very large corpora, vol June, College Park, pp 57–64 McDonald G, Macdonald C, Ounis I (1999) Finding parts in very large corpora, vol June, College Park, pp 57–64
Zurück zum Zitat Ogren PV et al (2006) Building and evaluating annotated corpora for medical NLP systems. In: AMIA annual symposium proceedings/AMIA symposium. AMIA symposium 36.2003, p 1050 Ogren PV et al (2006) Building and evaluating annotated corpora for medical NLP systems. In: AMIA annual symposium proceedings/AMIA symposium. AMIA symposium 36.2003, p 1050
Zurück zum Zitat Peris Álvaro, Chinea-Rios Mara, Casacuberta Francisco (2017) Neural networks classifier for data selection in statistical machine translation. Prague Bull Math Linguist 108(1):283–294CrossRef Peris Álvaro, Chinea-Rios Mara, Casacuberta Francisco (2017) Neural networks classifier for data selection in statistical machine translation. Prague Bull Math Linguist 108(1):283–294CrossRef
Zurück zum Zitat Richmond K, Hoole P, King S (2011) Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus. In: Proceedings of the annual conference of the international speech communication association, interspeech. August. Florence, Italy, pp 1505–1508 Richmond K, Hoole P, King S (2011) Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus. In: Proceedings of the annual conference of the international speech communication association, interspeech. August. Florence, Italy, pp 1505–1508
Zurück zum Zitat Schwenk H, Gauvain J-L (2005) Training neural network language models on very large corpora. In: Proceedings of the conference on human language technology and empirical methods in natural language processing—HLT’05. Vancouver, B.C., Canada, pp 201–208 Schwenk H, Gauvain J-L (2005) Training neural network language models on very large corpora. In: Proceedings of the conference on human language technology and empirical methods in natural language processing—HLT’05. Vancouver, B.C., Canada, pp 201–208
Zurück zum Zitat Walter L, Radauer A, Moehrle MG (2017) The beauty of brimstone butterfly: novelty of 290 patents identified by near environment analysis based on text mining. Scientometrics 111(1):103–115CrossRef Walter L, Radauer A, Moehrle MG (2017) The beauty of brimstone butterfly: novelty of 290 patents identified by near environment analysis based on text mining. Scientometrics 111(1):103–115CrossRef
Metadaten
Titel
Speech corpora subset selection based on time-continuous utterances features
verfasst von
Luobing Dong
Qiumin Guo
Weili Wu
Publikationsdatum
21.09.2018
Verlag
Springer US
Erschienen in
Journal of Combinatorial Optimization / Ausgabe 4/2019
Print ISSN: 1382-6905
Elektronische ISSN: 1573-2886
DOI
https://doi.org/10.1007/s10878-018-0350-2

Weitere Artikel der Ausgabe 4/2019

Journal of Combinatorial Optimization 4/2019 Zur Ausgabe