General framework for mining, processing and storing large amounts of electronic texts for language modeling purposes

Švec, Jan; Lehečka, Jan; Ircing, Pavel; Skorkovská, Lucie; Pražák, Aleš; Vavruška, Jan; Stanislav, Petr; Hoidekr, Jan

doi:10.1007/s10579-013-9246-z

General framework for mining, processing and storing large amounts of electronic texts for language modeling purposes

Original Paper
Published: 24 July 2013

Volume 48, pages 227–248, (2014)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Jan Švec¹,
Jan Lehečka¹,
Pavel Ircing¹,
Lucie Skorkovská¹,
Aleš Pražák¹,
Jan Vavruška¹,
Petr Stanislav¹ &
…
Jan Hoidekr¹

421 Accesses
12 Citations
Explore all metrics

Abstract

The paper describes a general framework for mining large amounts of text data from a defined set of Web pages. The acquired data are meant to constitute a corpus for training robust and reliable language models and thus the framework needs to also incorporate algorithms for appropriate text processing and duplicity detection in order to secure quality and consistency of the data. As we expect the resulting corpus to be very large, we have also implemented topic detection algorithms that allow us to automatically select subcorpora for domain-specific language models. The description of the framework architecture and the implemented algorithms is complemented with a detailed evaluation section. It analyses the basic properties of the gathered Czech corpus containing more than one billion text tokens collected using the described framework, shows the results of the topic detection methods and finally also describes the design and outcomes of the automatic speech recognition experiments with domain-specific language models estimated from the collected data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Building Large Resources for Text Mining: The Leipzig Corpora Collection

Topic modeling, long texts and the best number of topics. Some Problems and solutions

Article 17 February 2020

Large Scale Text Mining Approaches for Information Retrieval and Extraction

Notes

Note that the number of occurrences of the word “Havel” is divided by the factor of ten in order to scale down to other two examples.
For example the Unicode standard defines a special glyph for a ligature “fi”. These ligatures are substituted with the sequence of characters “f” and “i”.
http://www.cs.hmc.edu/~geoff/ispell.html.
We have considered using longer token sequences but as processed documents are typically rather short (545 words on average), the usage of higher order n-grams resulted in severe data sparsity.
Note that assuming to know the topics before the actual broadcasting is not unrealistic—the main “themes” of each debate are published on the broadcaster website beforehand.
Please note that even though our decoder can handle a lexicon with up to one million words (which makes it one of the world’s best in this aspect), it is still not able to accommodate all the words occurring in our corpora, not even just the ones that occurred at least five times—see Fig. 5.

References

Baroni, M. & Bernardini, S. (2004). Bootcat: Bootstrapping corpora and terms from the web. In In Proceedings of LREC 2004, pp. 1313–1316.
Broder, A. Z., Glassman, S. C., Manasse, M. S., & Zweig, G. (1997). Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8–13), 1157–1166.
Article Google Scholar
Bulyko, I., Ostendorf, M., Siu, M., Ng, T., Stolcke, A., & Çetin, O. (2007). Web resources for language modeling in conversational speech recognition. ACM Transactions on Speech and Language Processing (TSLP), 5(1), 1:1–1:25.
Google Scholar
Fairon, C. (2006). Corporator: a tool for creating rss-based specialized corpora. In Proceedings of the 2nd international workshop on web as corpus, WAC ’06 (pp. 43–49). Stroudsburg, PA, USA: Association for Computational Linguistics.
Kanis, J., & Skorkovská, L. (2010). Comparison of different lemmatization approaches through the means of information retrieval performance. In: P. Sojka, A. Horák, I. Kopeček, & K. Pala (Eds.), TSD 2010. LNCS (Vol. 6231, pp. 93–100). Heidelberg: Springer.
Google Scholar
Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 97–133.
Article Google Scholar
Kilgarriff, A., Reddy, S., Pomikálek, J., & PVS, A. (2010). A corpus factory for many languages. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, & D. Tapias (Eds.), Proceedings of the seventh international conference on language resources and evaluation (LREC’10) (pp. 904–910). Valletta, Malta: European Language Resources Association (ELRA).
Kučera, K. (2002). The Czech National Corpus: Principles, design, and results. Literary and Linguistic Computing, 17(2), 245–257.
Article Google Scholar
Li, P., Zhu, Q., Qian, P., & Fox, G. (2007). Constructing a large scale text corpus based on the grid and trustworthiness. In: V. Matousek & P. Mautner (Eds.), TSD. Lecture Notes in Computer Science (Vol. 4629, pp. 56–65). New York: Springer.
Malkin, M. & Venkatesan, R. (2005). Comparison of texts streams in the presence of mild adversaries. In Proceedings of the 2005 Australasian workshop on grid computing and e-research (Vol. 44, pp. 179–186). ACSW Frontiers ’05. Australian Computer Society, Inc.,.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. New York: Cambridge University Press.
Book Google Scholar
Pomikálek, J. (2011). Removing boilerplate and duplicate content from web corpora. Ph.D. thesis, Faculty of Informatics, Masaryk University, Brno, Czech Republic.
Pražák, A., Loose, Z., Psutka, J., Radová, V., & Müller, L. (2011). Four-phase re-speaker training system. In Proceedings of SIGMAP 2011. Seville.
Psutka, J., Ircing, P., Psutka, J.V., Radová, V., Byrne, W., Hajič, J., Mírovský, J., & Gustman, S. (2003). Large vocabulary ASR for spontaneous Czech in the MALACH project. In Proceedings of Eurospeech 2003 (pp. 1821–1824). Geneva.
Psutka, J., Radová, V., Müller, L., Matoušek, J., Ircing, P., & Graff, D. (2001). Large broadcast news and read speech corpora of spoken Czech. In Proceedings of Eurospeech 2001 (pp. 2067–2070). Denmark: Aalborg.
Psutka, J., Švec, J., Psutka, J.V., Vaněk, J., Pražák, A., Šmídl, L., & Ircing, P. (2011). System for fast lexical and phonetic spoken term detection in a Czech cultural heritage archive. EURASIP Journal on Audio, Speech, and Music Processing, 10.
Sharoff, S. (2006). Creating general-purpose corpora using automated search engine queries. In WaCky! Working papers on the Web as Corpus (pp. 63–98). Gedit.
Spoustová, D., Spousta, M., & Pecina, P. (2010). Building a Web Corpus of Czech. In Proceedings of the seventh conference on international language resources and evaluation (LREC’10). Valletta, Malta.
Stolcke, A. (2002). SRILM—an extensible language modeling toolkit. In Proceedings of ICSLP 2002 (pp. 901–904). Denver.
Švec, J. (2010). The Voiar (Voice Archive) library. University of West Bohemia, Plzeň.
Švec, J., Hoidekr, J., Soutner, D., & Vavruška, J. (2011). Web text data mining for building large scale language modelling corpus. In: I. Habernal & V. Matoušek (Eds.), Text, speech and dialogue. Lecture Notes in Computer Science (Vol. 6836, pp. 356–363). Berlin / Heidelberg: Springer.
Google Scholar
Trmal, J., Pražák, A., Loose, Z., & Psutka, J. (2010). Online TV Captioning of Czech Parliamentary Sessions. In: Sojka, P., Horák, A., Kopeček, I., & Pala, K. (Eds.), Text, speech and dialogue. Lecture Notes in Artificial Intelligence (Vol. 6231, pp. 416–422). Berlin: Springer.
Google Scholar
Vaněk, J. & Psutka, J. (2010). Gender-dependent acoustic models fusion developed for automatic subtitling of parliament meetings broadcasted by the Czech TV. In: P. Sojka, A. Horák, I. Kopeček, & K. Pala (Eds.), TSD 2010. LNCS (Vol. 6231, pp. 431–438). Heidelberg: Springer.
Google Scholar
Zajíc, Z., Machlica, L., & Müller, L. (2010). Robust statistic estimates for adaptation in the task of speech recognition. In: P. Sojka, A. Horák, I. Kopeček, & K. Pala (Eds.), TSD 2010. LNCS (Vol. 6231, pp. 464–471). Heidelberg: Springer.
Google Scholar
Zelinka, J., Kanis, J., & Müller, L. (2005). Automatic transcription of numerals in inflectional languages. In: V. Matoušek, P. Mautner, & T. Pavelka (Eds.), Text, speech and dialogue. Lecture Notes in Computer Science (Vol. 3658, pp. 326–333). Berlin/Heidelberg: Springer.
Chapter Google Scholar

Download references

Acknowledgements

This work has been supported by the grant of The University of West Bohemia, project No. SGS-2010-054 and by the Grant Agency of the Czech Republic, project No. GAČR P103/12/G084. The access to the MetaCentrum computing facilities provided under the programme Projects of Large Infrastructure for Research, Development, and Innovations LM2010005 funded by the Ministry of Education, Youth, and Sports of the Czech Republic is appreciated.

Author information

Authors and Affiliations

Department of Cybernetics, Faculty of Applied Sciences, University of West Bohemia, Univerzitní 8, 306 14, Plzeň, Czech Republic
Jan Švec, Jan Lehečka, Pavel Ircing, Lucie Skorkovská, Aleš Pražák, Jan Vavruška, Petr Stanislav & Jan Hoidekr

Authors

Jan Švec
View author publications
You can also search for this author in PubMed Google Scholar
Jan Lehečka
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Ircing
View author publications
You can also search for this author in PubMed Google Scholar
Lucie Skorkovská
View author publications
You can also search for this author in PubMed Google Scholar
Aleš Pražák
View author publications
You can also search for this author in PubMed Google Scholar
Jan Vavruška
View author publications
You can also search for this author in PubMed Google Scholar
Petr Stanislav
View author publications
You can also search for this author in PubMed Google Scholar
Jan Hoidekr
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pavel Ircing.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Švec, J., Lehečka, J., Ircing, P. et al. General framework for mining, processing and storing large amounts of electronic texts for language modeling purposes. Lang Resources & Evaluation 48, 227–248 (2014). https://doi.org/10.1007/s10579-013-9246-z

Download citation

Published: 24 July 2013
Issue Date: June 2014
DOI: https://doi.org/10.1007/s10579-013-9246-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

General framework for mining, processing and storing large amounts of electronic texts for language modeling purposes

Abstract

Access this article

Similar content being viewed by others

Building Large Resources for Text Mining: The Leipzig Corpora Collection

Topic modeling, long texts and the best number of topics. Some Problems and solutions

Large Scale Text Mining Approaches for Information Retrieval and Extraction

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

General framework for mining, processing and storing large amounts of electronic texts for language modeling purposes

Abstract

Access this article

Similar content being viewed by others

Building Large Resources for Text Mining: The Leipzig Corpora Collection

Topic modeling, long texts and the best number of topics. Some Problems and solutions

Large Scale Text Mining Approaches for Information Retrieval and Extraction

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation