Skip to main content
Top

2021 | OriginalPaper | Chapter

Automatic Understanding of Code Mixed Social Media Text: A State of the Art

Authors : Neetika, Vishal Goyal, Simpel Rani

Published in: Advances in Information Communication Technology and Computing

Publisher: Springer Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Social media content is often addressed as noisy or informal text due to the existence of zigzag conversational patterns. People do not always use Unicode rather they mix multiple languages. Hence, the processing of code mixed data postures computational challenges ahead. Since decades, social media content and its analysis have gained momentum worldwide. In parallel, the pace of research on Indian languages is also commendable. In India, the users of social media hail from different religions, regions, subdivisions and culture. The major concern of the paper is to throw light on the works done in Indian languages with code mixed social media as a concern. The journey of the research in the respective field has various milestones between basic tasks of natural language processing and deep learning. This paper focusses on the works done on Indian languages with respect to language identification, normalization and POS tagging. Efforts have been done to discuss the tools, techniques and the corpora used by researchers in different Indian languages. In the digital age, we have an abundancy of tools and APIs available for extracting code mixed text. Still, there is paucity of public data available for analysis. The need of the hour seems to be protruding toward deep learning and extending the public availability of code mixed corpora.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Hong L, Convertino G, Chi EH (2011) Language matters in twitter: A large scale study. In: Fifth international AAAI conference on weblogs and social media Hong L, Convertino G, Chi EH (2011) Language matters in twitter: A large scale study. In: Fifth international AAAI conference on weblogs and social media
2.
go back to reference Jauhiainen TS, Lui M, Zampieri M, Baldwin T, Lindén K (2019) Automatic language identification in texts: a survey. J Artif Intell Res 65:675–782MathSciNetMATHCrossRef Jauhiainen TS, Lui M, Zampieri M, Baldwin T, Lindén K (2019) Automatic language identification in texts: a survey. J Artif Intell Res 65:675–782MathSciNetMATHCrossRef
3.
go back to reference Cetinoglu O, Schulz S, Vu NT (2016) Challenges of computational processing of code-switching. In: Second workshop on computational approaches to code switching, pp 1–11 Cetinoglu O, Schulz S, Vu NT (2016) Challenges of computational processing of code-switching. In: Second workshop on computational approaches to code switching, pp 1–11
4.
go back to reference Konate A, Du R (2018) Sentiment analysis of code-mixed Bambara-French social media text using deep learning techniques. Wuhan Univ J Nat Sci 23(3):237–243CrossRef Konate A, Du R (2018) Sentiment analysis of code-mixed Bambara-French social media text using deep learning techniques. Wuhan Univ J Nat Sci 23(3):237–243CrossRef
5.
go back to reference Santosh T, Aravind K (2019) Hate speech detection in Hindi-English code-Mixed social media text. In: Proceedings of the ACM India joint international conference on Data Science and Management of data. ACM, pp 310–313 Santosh T, Aravind K (2019) Hate speech detection in Hindi-English code-Mixed social media text. In: Proceedings of the ACM India joint international conference on Data Science and Management of data. ACM, pp 310–313
7.
go back to reference Duarte N, Llanso E, Loup A (2018) Mixed messages? The limits of automated social media content analysis. FAT, p 106 Duarte N, Llanso E, Loup A (2018) Mixed messages? The limits of automated social media content analysis. FAT, p 106
8.
go back to reference Pavan K, Tandon N, Varma V (2010) Addressing challenges in automatic language identification of romanized text. In: 8th International conference on natural language processing (ICON-2010) Pavan K, Tandon N, Varma V (2010) Addressing challenges in automatic language identification of romanized text. In: 8th International conference on natural language processing (ICON-2010)
9.
go back to reference Sequiera, R. D., Rao, S. S., & Shambavi, B. R.: Word-Level language identification and back transliteration of Romanized text: a shared task report by BMSCE. In: MSRI FIRE working notes. (2014) Sequiera, R. D., Rao, S. S., & Shambavi, B. R.: Word-Level language identification and back transliteration of Romanized text: a shared task report by BMSCE. In: MSRI FIRE working notes. (2014)
10.
go back to reference Bali K, Sharma J, Choudhury M, Vyas Y (2014) I am borrowing ya mixing? An analysis of English-Hindi Code mixing in Facebook. In: Proceedings of the first workshop on computational approaches to code switching, pp 116–126 Bali K, Sharma J, Choudhury M, Vyas Y (2014) I am borrowing ya mixing? An analysis of English-Hindi Code mixing in Facebook. In: Proceedings of the first workshop on computational approaches to code switching, pp 116–126
11.
go back to reference Das A, Gamback B (2015) Code-mixing in social media text: the last language identification frontier? Revue TAL 54(3):41–64 Das A, Gamback B (2015) Code-mixing in social media text: the last language identification frontier? Revue TAL 54(3):41–64
12.
go back to reference Gamback B, Das A (2014) On measuring the complexity of code-mixing. In: 11th international conference on natural language processing, pp 1–7, Goa Gamback B, Das A (2014) On measuring the complexity of code-mixing. In: 11th international conference on natural language processing, pp 1–7, Goa
13.
go back to reference Barman U, Das A, Wagner J, Foster J (2014) Code mixing: a challenge for language identification in the language of social media. In: First workshop on computational approaches to code switching, pp 13–23 Barman U, Das A, Wagner J, Foster J (2014) Code mixing: a challenge for language identification in the language of social media. In: First workshop on computational approaches to code switching, pp 13–23
14.
go back to reference GokulChittaranjan, Vyas Y, Bali K, Choudhury M (2014) A framework to label code-mixed sentences in social media. In: First workshop on computational approaches to code-switching. ACL, Doha GokulChittaranjan, Vyas Y, Bali K, Choudhury M (2014) A framework to label code-mixed sentences in social media. In: First workshop on computational approaches to code-switching. ACL, Doha
15.
go back to reference Vyas Y, Gella S, Sharma J, Bali K, Choudhury M (2014) Pos tagging of english-hindi code-mixed social media content. In: Conference on empirical methods in natural language processing (EMNLP), pp 974–979 Vyas Y, Gella S, Sharma J, Bali K, Choudhury M (2014) Pos tagging of english-hindi code-mixed social media content. In: Conference on empirical methods in natural language processing (EMNLP), pp 974–979
16.
go back to reference Das A, Gamback B (2014) Identifying languages at the word level in code-mixed indian social media text. In: 11th International conference on natural language processing, pp 378–387 Das A, Gamback B (2014) Identifying languages at the word level in code-mixed indian social media text. In: 11th International conference on natural language processing, pp 378–387
17.
go back to reference Kaur J, Singh J (2015) Toward normalizing romanized gurumukhi text from social media. Indian J Sci Technol 8(27):1–6CrossRef Kaur J, Singh J (2015) Toward normalizing romanized gurumukhi text from social media. Indian J Sci Technol 8(27):1–6CrossRef
18.
go back to reference Desai N, Narvekar M (2015) Normalization of noisy text data. Procedia Comput Sci 45:127–132CrossRef Desai N, Narvekar M (2015) Normalization of noisy text data. Procedia Comput Sci 45:127–132CrossRef
19.
go back to reference Sequiera R, Choudhury M, Bali K (2015) Pos tagging of Hindi-English code mixed text from social media: some machine learning experiments. In: 12th international conference on natural language processing, pp 237–246 Sequiera R, Choudhury M, Bali K (2015) Pos tagging of Hindi-English code mixed text from social media: some machine learning experiments. In: 12th international conference on natural language processing, pp 237–246
20.
go back to reference Jamatia A, Gamback B, Das A (2015) Part-of-speech tagging for code-mixed English-Hindi twitter and facebook chat messages. In: International conference recent advances in natural language processing, pp 239–248 Jamatia A, Gamback B, Das A (2015) Part-of-speech tagging for code-mixed English-Hindi twitter and facebook chat messages. In: International conference recent advances in natural language processing, pp 239–248
21.
go back to reference Petrov S, Das D, McDonald R (2012) A universal part-of-speech tagset. In: Eighth international conference on language re-sources and evaluation (LREC-2012). European Languages Resources Association (ELRA), Turkey, pp 2089–2096 Petrov S, Das D, McDonald R (2012) A universal part-of-speech tagset. In: Eighth international conference on language re-sources and evaluation (LREC-2012). European Languages Resources Association (ELRA), Turkey, pp 2089–2096
22.
go back to reference Gimpel K, Schneider N, O’Connor B, Das D, Mills D (2010) Part-of-speech tagging for twitter: annotation, features, and experiments. Technical Report, Carnegie-Mellon Univ Pittsburgh Pa School of Computer Science Gimpel K, Schneider N, O’Connor B, Das D, Mills D (2010) Part-of-speech tagging for twitter: annotation, features, and experiments. Technical Report, Carnegie-Mellon Univ Pittsburgh Pa School of Computer Science
23.
go back to reference Baskaran S, Bali K, Bhattacharya T, Bhattacharya P, Jha GN (2008) A common parts-of-speech tagset framework for indian languages. In: LREC 2008 Baskaran S, Bali K, Bhattacharya T, Bhattacharya P, Jha GN (2008) A common parts-of-speech tagset framework for indian languages. In: LREC 2008
24.
go back to reference Dholakia PS, Yoonus MM (2014) Rule based approach for the transition of tagsets to build the POS annotated corpus. Int J Adv Res Comput Commun Eng 3(7):7417–7422 Dholakia PS, Yoonus MM (2014) Rule based approach for the transition of tagsets to build the POS annotated corpus. Int J Adv Res Comput Commun Eng 3(7):7417–7422
25.
go back to reference Dutta S, Saha T, Banerjee S, Naskar SK (2015) Text normalization in code-mixed social media text. In: 2nd international conference on recent trends in information systems (ReTIS). IEEE Press, New York, pp 378–382 Dutta S, Saha T, Banerjee S, Naskar SK (2015) Text normalization in code-mixed social media text. In: 2nd international conference on recent trends in information systems (ReTIS). IEEE Press, New York, pp 378–382
26.
go back to reference Sharma A, Motlani R (2015) Pos tagging for code-mixed indian social media text: systems from IIIT-h for icon NLP tools contest Sharma A, Motlani R (2015) Pos tagging for code-mixed indian social media text: systems from IIIT-h for icon NLP tools contest
27.
go back to reference Sitaram S, Rallabandi SK, Rijhwani S, Black AW (2016) Experiments with cross-lingual systems for synthesis of code-mixed text. In: SSW, pp 76–81 Sitaram S, Rallabandi SK, Rijhwani S, Black AW (2016) Experiments with cross-lingual systems for synthesis of code-mixed text. In: SSW, pp 76–81
28.
go back to reference Sharma A, Gupta S, Motlani R, Bansal P, Shrivastava M (2016) Shallow parsing pipeline for Hindi-English code-mixed social media text. In: NAACL-HLT, pp 1340–1345 Sharma A, Gupta S, Motlani R, Bansal P, Shrivastava M (2016) Shallow parsing pipeline for Hindi-English code-mixed social media text. In: NAACL-HLT, pp 1340–1345
29.
go back to reference Han B, Baldwin T (2011) Lexical normalisation of short text messages: Makn sens a# twitter. In: 49th Annual meeting of the Association for Computational Linguistics: Human Language Technologies, pp 368–378 Han B, Baldwin T (2011) Lexical normalisation of short text messages: Makn sens a# twitter. In: 49th Annual meeting of the Association for Computational Linguistics: Human Language Technologies, pp 368–378
30.
go back to reference Gupta K, Choudhury M, Bali K (2012) Mining Hindi-English transliteration pairs from online Hindi lyrics. In: LREC, pp 2459–2465 (2012) Gupta K, Choudhury M, Bali K (2012) Mining Hindi-English transliteration pairs from online Hindi lyrics. In: LREC, pp 2459–2465 (2012)
31.
go back to reference Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51MATHCrossRef Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51MATHCrossRef
32.
go back to reference Ranjan P, Raja B, Priyadharshini R, Balabantaray RC (2016) A comparative study on code-mixed data of Indian social media vs formal text. In: 2nd international conference on contemporary computing and informatics (IC3I). IEEE, pp 608–611 Ranjan P, Raja B, Priyadharshini R, Balabantaray RC (2016) A comparative study on code-mixed data of Indian social media vs formal text. In: 2nd international conference on contemporary computing and informatics (IC3I). IEEE, pp 608–611
33.
go back to reference Mikolov T, Kombrink S, Deoras A, Burget L, Cernocky J (2011) RNNLM-recurrent neural network language modeling toolkit. In: ASRU Workshop, 2011, pp 196-201 Mikolov T, Kombrink S, Deoras A, Burget L, Cernocky J (2011) RNNLM-recurrent neural network language modeling toolkit. In: ASRU Workshop, 2011, pp 196-201
34.
go back to reference Phadte A, Wagh R (2017) Word level language identification system for Konkani-English code-mixed social media text (CMST). In: 10th annual ACM India compute conference. ACM, pp 103–107) Phadte A, Wagh R (2017) Word level language identification system for Konkani-English code-mixed social media text (CMST). In: 10th annual ACM India compute conference. ACM, pp 103–107)
35.
go back to reference Veena PV, Kumar MA, Soman KP (2017) An effective way of word-level language identification for code-mixed face-book comments using word-embedding via character-embedding. In: International conference on advances in computing, communications and informatics (ICACCI. IEEE), pp 1552–1556 Veena PV, Kumar MA, Soman KP (2017) An effective way of word-level language identification for code-mixed face-book comments using word-embedding via character-embedding. In: International conference on advances in computing, communications and informatics (ICACCI. IEEE), pp 1552–1556
36.
go back to reference Lakshmi BS, Shambhavi BR (2017) An automatic language identification system for code-mixed English-Kannada social media text. In: 2nd international conference on computational systems and information technology for sustainable solution (CSITSS). IEEE Press, pp 1–5 Lakshmi BS, Shambhavi BR (2017) An automatic language identification system for code-mixed English-Kannada social media text. In: 2nd international conference on computational systems and information technology for sustainable solution (CSITSS). IEEE Press, pp 1–5
37.
go back to reference Jamatia A, Das A, Gamback B (2019) Deep learning-based language identification in English-Hindi-Bengali code-mixed social media corpora. J Intell Syst 38(3):399–408CrossRef Jamatia A, Das A, Gamback B (2019) Deep learning-based language identification in English-Hindi-Bengali code-mixed social media corpora. J Intell Syst 38(3):399–408CrossRef
38.
go back to reference Gupta D, Tripathi S, Ekbal A, Bhattacharyya, P (2017) SMPOST: parts of speech tagger for code-mixed indic social media text. arXiv preprint arXiv:1702.00167 Gupta D, Tripathi S, Ekbal A, Bhattacharyya, P (2017) SMPOST: parts of speech tagger for code-mixed indic social media text. arXiv preprint arXiv:​1702.​00167
39.
go back to reference Jamatia A, Gamback B, Das A (2016) Collecting and annotating Indian social media code-mixed corpora. In: International conference on intelligent text processing and computational linguistics. Springer, pp 406–417 Jamatia A, Gamback B, Das A (2016) Collecting and annotating Indian social media code-mixed corpora. In: International conference on intelligent text processing and computational linguistics. Springer, pp 406–417
40.
Metadata
Title
Automatic Understanding of Code Mixed Social Media Text: A State of the Art
Authors
Neetika
Vishal Goyal
Simpel Rani
Copyright Year
2021
Publisher
Springer Singapore
DOI
https://doi.org/10.1007/978-981-15-5421-6_10