Abstract
For more than 5,000 years, we have been communicating using some form of written language. For many scholars, the advent of written language contributed to the development of societies because it enabled knowledge to be passed to future generations without considerable loss of information or ambiguity. Today, it is estimated that we use about 7,000 languages to communicate, but the majority of these do not have a written form; in fact, there are no reliable estimates of how many written languages exist today. There are three main families of written languages: Afro-Asiatic, Indo-European, and Turkic. These families of languages are based on historical family-trees. However, with the amount of data available today, one can start looking at language classification using regularities extracted from corpora of text. This paper focus on regularities of 10 languages from the mentioned families. In order to find features for these languages we use (1) Heaps’ law, which models the number of distinct words in a corpus as a function of the total number of words in the same corpora, and (2) structural properties of networks created from word co-occurrence in large corpora for different languages. Using clustering approaches we show that despite differences from years of being used in separate countries, the clustering still seem to respect some historical organization of families.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
We again decided to show the charts only for the case with the Heaps’ parameters due to space restrictions.
References
Abramov, O., Mehler, A.: Automatic language classification by means of syntactic dependency networks. J. Quant. Linguist. 18(4), 291–336 (2011)
Amancio, D.R., Antiqueira, L., Pardo, T.A.S., da F. Costa, L., Oliveira Jr., O.N., Nunes, M.G.V.: Complex networks analysis of manual and machine translations. Int. J. Mod. Phys. C 19(04), 583–598 (2008)
Antiqueira, L., Oliveira, O.N., da Fontoura Costa, L., das Graças Volpe Nunes, M.: A complex network approach to text summarization. Inf. Sci. 179(5), 584–599 (2009)
Arbesman, S., Strogatz, S.H., Vitevitch, M.S.: The structure of phonological networks across multiple languages. Int. J. Bifurc. Chaos 20(03), 679–685 (2010)
Arenas, A., Danon, L., Diaz-Guilera, A., Gleiser, P.M., Guimera, R.: Community analysis in social networks. Eur. Phys. J. B Condens. Matter Complex Syst. 38(2), 373–380 (2004)
Ban, K., Meštrović, A., Martinčić-ipšić, A.: Initial comparison of linguistic networks measures for parallel texts. In: 5th International Conference on Information Technologies and Information Society (ITIS), 97104. Citeseer (2013)
Beckage, N.M., Colunga, E.: Language networks as models of cognition: understanding cognition through language. In: Towards a Theoretical Framework for Analyzing Complex Linguistic Networks, pp. 3–28. Springer (2016)
Bickel, B.: Typology in the 21st century: major current developments. Linguist. Typol. 11(1), 239–251 (2007)
Biemann, C., Bordag, S., Heyer, G., Quasthoff, U., Wolff, C.: Language-independent methods for compiling monolingual lexical data. In: International Conference on Intelligent Text Processing and Computational Linguistics, pp. 217–228. Springer (2004)
Brill, E.: Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging. Comput. Linguist. 21(4), 543–565 (1995)
Campbell, L., Poser, W.J.: Language Classification: History and Method. Cambridge (2008)
Chen, X., Liu, H.: Function nodes in Chinese syntactic networks. In: Towards a Theoretical Framework for Analyzing Complex Linguistic Networks, pp. 187–201. Springer (2016)
Choudhury, M., Mukherjee, A.: The structure and dynamics of linguistic networks. In: Dynamics on and of Complex Networks, pp. 145–166. Springer (2009)
Choudhury, M., Thomas, M., Mukherjee, A., Basu, A., Ganguly, N.: How difficult is it to develop a perfect spell-checker? A cross-linguistic analysis through complex network approach. In: TextGraphs-2: Graph-Based Algorithms for Natural Language Processing, p. 81 (2007)
Coulmas, F.: The Writing Systems of the World. B. Blackwell (1989)
de Arruda, H.F., da F. Costa, L., Amancio, D.R.: Topic segmentation via community detection in complex networks. Chaos: an interdisciplinary. J. Nonlinear Sci. 26(6), 063120 (2016)
Deutschland and Statistisches Bundesamt Deutschland. Statistisches Jahrbuch Deutschland und Internationales. Statistisches Bundesamt (2012)
Font-Clos, F., Boleda, G., Corral, Á.: A scaling law beyond Zipf’s law and its relation to Heaps’ law. New J. Phys. 15(9), 093033 (2013)
Gao, Y., Liang, W., Shi, Y., Huang, Q.: Comparison of directed and weighted co-occurrence networks of six languages. Phys. A. Stat. Mech. Appl. 393, 579–589 (2014)
Goldhahn, D., Eckart, T., Quasthoff, U.: Building large monolingual dictionaries at the Leipzig corpora collection: from 100 to 200 languages. In: LREC, pp. 759–765 (2012)
Herdan, G.: Type-Token Mathematics, vol. 4. Mouton (1960)
i Cancho, R.F.: The structure of syntactic dependency networks: insights from recent advances in network theory. In: Problems of Quantitative Linguistics, pp. 60–75 (2005)
Liu, H.T., Cong, J.: Language clustering with word co-occurrence networks based on parallel texts. Chin. Sci. Bull. 58(10), 1139–1144 (2013)
Liu, H., Chunshan, X.: Can syntactic networks indicate morphological complexity of a language? EPL (Europhys. Lett.) 93(2), 28005 (2011)
Mamede, N., Correia, J., Baptista, J.: Syntax deep explorer. In: Computational Processing of the Portuguese Language: 12th International Conference, PROPOR 2016, Tomar, Portugal, July 13–15, 2016, Proceedings, vol. 9727, p. 189. Springer (2016)
Newman, M.E.J.: Modularity and community structure in networks. Proc. Natl. Acad. Sci. 103(23), 8577–8582 (2006)
Siew, C.S.Q.: Community structure in the phonological network. Front. Psychol. 4, 553 (2013)
Singhal, A.: Modern information retrieval: a brief overview. IEEE Data Eng. Bull. 24(4), 35–43 (2001)
Soares, M.M., Corso, G., Lucena, L.S.: The network of syllables in Portuguese. Phys. A Stat. Mech. Appl. 355(2), 678–684 (2005)
Solé, R.V., Corominas-Murtra, B., Valverde, S., Steels, L.: Language networks: their structure, function, and evolution. Complexity 15(6), 20–26 (2010)
Song, J.J.: The Oxford Handbook of Linguistic Typology. Oxford University Press (2010)
Steyvers, M., Tenenbaum, J.B.: The large-scale structure of semantic networks: Statistical analyses and a model of semantic growth. Cogn. Sci. 29(1), 41–78 (2005)
Watts, D.J., Strogatz, S.H.: Collective dynamics of small-worldnetworks. Nature 393(6684), 440–442 (1998)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Al Rozz, Y., Hamoodat, H., Menezes, R. (2017). Characterization of Written Languages Using Structural Features from Common Corpora. In: Gonçalves, B., Menezes, R., Sinatra, R., Zlatic, V. (eds) Complex Networks VIII. CompleNet 2017. Springer Proceedings in Complexity. Springer, Cham. https://doi.org/10.1007/978-3-319-54241-6_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-54241-6_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54240-9
Online ISBN: 978-3-319-54241-6
eBook Packages: Physics and AstronomyPhysics and Astronomy (R0)