Skip to main content

Characterization of Written Languages Using Structural Features from Common Corpora

  • Conference paper
  • First Online:
Complex Networks VIII (CompleNet 2017)

Part of the book series: Springer Proceedings in Complexity ((SPCOM))

Included in the following conference series:

Abstract

For more than 5,000 years, we have been communicating using some form of written language. For many scholars, the advent of written language contributed to the development of societies because it enabled knowledge to be passed to future generations without considerable loss of information or ambiguity. Today, it is estimated that we use about 7,000 languages to communicate, but the majority of these do not have a written form; in fact, there are no reliable estimates of how many written languages exist today. There are three main families of written languages: Afro-Asiatic, Indo-European, and Turkic. These families of languages are based on historical family-trees. However, with the amount of data available today, one can start looking at language classification using regularities extracted from corpora of text. This paper focus on regularities of 10 languages from the mentioned families. In order to find features for these languages we use (1) Heaps’ law, which models the number of distinct words in a corpus as a function of the total number of words in the same corpora, and (2) structural properties of networks created from word co-occurrence in large corpora for different languages. Using clustering approaches we show that despite differences from years of being used in separate countries, the clustering still seem to respect some historical organization of families.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.ethnologue.com.

  2. 2.

    We again decided to show the charts only for the case with the Heaps’ parameters due to space restrictions.

References

  1. Abramov, O., Mehler, A.: Automatic language classification by means of syntactic dependency networks. J. Quant. Linguist. 18(4), 291–336 (2011)

    Article  Google Scholar 

  2. Amancio, D.R., Antiqueira, L., Pardo, T.A.S., da F. Costa, L., Oliveira Jr., O.N., Nunes, M.G.V.: Complex networks analysis of manual and machine translations. Int. J. Mod. Phys. C 19(04), 583–598 (2008)

    Google Scholar 

  3. Antiqueira, L., Oliveira, O.N., da Fontoura Costa, L., das Graças Volpe Nunes, M.: A complex network approach to text summarization. Inf. Sci. 179(5), 584–599 (2009)

    Google Scholar 

  4. Arbesman, S., Strogatz, S.H., Vitevitch, M.S.: The structure of phonological networks across multiple languages. Int. J. Bifurc. Chaos 20(03), 679–685 (2010)

    Google Scholar 

  5. Arenas, A., Danon, L., Diaz-Guilera, A., Gleiser, P.M., Guimera, R.: Community analysis in social networks. Eur. Phys. J. B Condens. Matter Complex Syst. 38(2), 373–380 (2004)

    Google Scholar 

  6. Ban, K., Meštrović, A., Martinčić-ipšić, A.: Initial comparison of linguistic networks measures for parallel texts. In: 5th International Conference on Information Technologies and Information Society (ITIS), 97104. Citeseer (2013)

    Google Scholar 

  7. Beckage, N.M., Colunga, E.: Language networks as models of cognition: understanding cognition through language. In: Towards a Theoretical Framework for Analyzing Complex Linguistic Networks, pp. 3–28. Springer (2016)

    Google Scholar 

  8. Bickel, B.: Typology in the 21st century: major current developments. Linguist. Typol. 11(1), 239–251 (2007)

    Google Scholar 

  9. Biemann, C., Bordag, S., Heyer, G., Quasthoff, U., Wolff, C.: Language-independent methods for compiling monolingual lexical data. In: International Conference on Intelligent Text Processing and Computational Linguistics, pp. 217–228. Springer (2004)

    Google Scholar 

  10. Brill, E.: Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging. Comput. Linguist. 21(4), 543–565 (1995)

    MathSciNet  Google Scholar 

  11. Campbell, L., Poser, W.J.: Language Classification: History and Method. Cambridge (2008)

    Google Scholar 

  12. Chen, X., Liu, H.: Function nodes in Chinese syntactic networks. In: Towards a Theoretical Framework for Analyzing Complex Linguistic Networks, pp. 187–201. Springer (2016)

    Google Scholar 

  13. Choudhury, M., Mukherjee, A.: The structure and dynamics of linguistic networks. In: Dynamics on and of Complex Networks, pp. 145–166. Springer (2009)

    Google Scholar 

  14. Choudhury, M., Thomas, M., Mukherjee, A., Basu, A., Ganguly, N.: How difficult is it to develop a perfect spell-checker? A cross-linguistic analysis through complex network approach. In: TextGraphs-2: Graph-Based Algorithms for Natural Language Processing, p. 81 (2007)

    Google Scholar 

  15. Coulmas, F.: The Writing Systems of the World. B. Blackwell (1989)

    Google Scholar 

  16. de Arruda, H.F., da F. Costa, L., Amancio, D.R.: Topic segmentation via community detection in complex networks. Chaos: an interdisciplinary. J. Nonlinear Sci. 26(6), 063120 (2016)

    Google Scholar 

  17. Deutschland and Statistisches Bundesamt Deutschland. Statistisches Jahrbuch Deutschland und Internationales. Statistisches Bundesamt (2012)

    Google Scholar 

  18. Font-Clos, F., Boleda, G., Corral, Á.: A scaling law beyond Zipf’s law and its relation to Heaps’ law. New J. Phys. 15(9), 093033 (2013)

    Article  ADS  Google Scholar 

  19. Gao, Y., Liang, W., Shi, Y., Huang, Q.: Comparison of directed and weighted co-occurrence networks of six languages. Phys. A. Stat. Mech. Appl. 393, 579–589 (2014)

    Article  Google Scholar 

  20. Goldhahn, D., Eckart, T., Quasthoff, U.: Building large monolingual dictionaries at the Leipzig corpora collection: from 100 to 200 languages. In: LREC, pp. 759–765 (2012)

    Google Scholar 

  21. Herdan, G.: Type-Token Mathematics, vol. 4. Mouton (1960)

    Google Scholar 

  22. i Cancho, R.F.: The structure of syntactic dependency networks: insights from recent advances in network theory. In: Problems of Quantitative Linguistics, pp. 60–75 (2005)

    Google Scholar 

  23. Liu, H.T., Cong, J.: Language clustering with word co-occurrence networks based on parallel texts. Chin. Sci. Bull. 58(10), 1139–1144 (2013)

    Article  Google Scholar 

  24. Liu, H., Chunshan, X.: Can syntactic networks indicate morphological complexity of a language? EPL (Europhys. Lett.) 93(2), 28005 (2011)

    Article  ADS  Google Scholar 

  25. Mamede, N., Correia, J., Baptista, J.: Syntax deep explorer. In: Computational Processing of the Portuguese Language: 12th International Conference, PROPOR 2016, Tomar, Portugal, July 13–15, 2016, Proceedings, vol. 9727, p. 189. Springer (2016)

    Google Scholar 

  26. Newman, M.E.J.: Modularity and community structure in networks. Proc. Natl. Acad. Sci. 103(23), 8577–8582 (2006)

    Article  ADS  Google Scholar 

  27. Siew, C.S.Q.: Community structure in the phonological network. Front. Psychol. 4, 553 (2013)

    Article  Google Scholar 

  28. Singhal, A.: Modern information retrieval: a brief overview. IEEE Data Eng. Bull. 24(4), 35–43 (2001)

    Google Scholar 

  29. Soares, M.M., Corso, G., Lucena, L.S.: The network of syllables in Portuguese. Phys. A Stat. Mech. Appl. 355(2), 678–684 (2005)

    Google Scholar 

  30. Solé, R.V., Corominas-Murtra, B., Valverde, S., Steels, L.: Language networks: their structure, function, and evolution. Complexity 15(6), 20–26 (2010)

    Google Scholar 

  31. Song, J.J.: The Oxford Handbook of Linguistic Typology. Oxford University Press (2010)

    Google Scholar 

  32. Steyvers, M., Tenenbaum, J.B.: The large-scale structure of semantic networks: Statistical analyses and a model of semantic growth. Cogn. Sci. 29(1), 41–78 (2005)

    Google Scholar 

  33. Watts, D.J., Strogatz, S.H.: Collective dynamics of small-worldnetworks. Nature 393(6684), 440–442 (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Younis Al Rozz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Al Rozz, Y., Hamoodat, H., Menezes, R. (2017). Characterization of Written Languages Using Structural Features from Common Corpora. In: Gonçalves, B., Menezes, R., Sinatra, R., Zlatic, V. (eds) Complex Networks VIII. CompleNet 2017. Springer Proceedings in Complexity. Springer, Cham. https://doi.org/10.1007/978-3-319-54241-6_14

Download citation

Publish with us

Policies and ethics