Top

Published in:

2021 | OriginalPaper | Chapter

Automatic Understanding of Code Mixed Social Media Text: A State of the Art

Authors : Neetika, Vishal Goyal, Simpel Rani

Published in: Advances in Information Communication Technology and Computing

Publisher: Springer Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Social media content is often addressed as noisy or informal text due to the existence of zigzag conversational patterns. People do not always use Unicode rather they mix multiple languages. Hence, the processing of code mixed data postures computational challenges ahead. Since decades, social media content and its analysis have gained momentum worldwide. In parallel, the pace of research on Indian languages is also commendable. In India, the users of social media hail from different religions, regions, subdivisions and culture. The major concern of the paper is to throw light on the works done in Indian languages with code mixed social media as a concern. The journey of the research in the respective field has various milestones between basic tasks of natural language processing and deep learning. This paper focusses on the works done on Indian languages with respect to language identification, normalization and POS tagging. Efforts have been done to discuss the tools, techniques and the corpora used by researchers in different Indian languages. In the digital age, we have an abundancy of tools and APIs available for extracting code mixed text. Still, there is paucity of public data available for analysis. The need of the hour seems to be protruding toward deep learning and extending the public availability of code mixed corpora.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Sentiment Analysis of English-Punjabi Code-Mixed Social Media Content to Predict Elections

next chapter Secure Server Virtualization Using Object Level Permission Model

Hong L, Convertino G, Chi EH (2011) Language matters in twitter: A large scale study. In: Fifth international AAAI conference on weblogs and social media

Jauhiainen TS, Lui M, Zampieri M, Baldwin T, Lindén K (2019) Automatic language identification in texts: a survey. J Artif Intell Res 65:675–782MathSciNetMATHCrossRef

Cetinoglu O, Schulz S, Vu NT (2016) Challenges of computational processing of code-switching. In: Second workshop on computational approaches to code switching, pp 1–11

Konate A, Du R (2018) Sentiment analysis of code-mixed Bambara-French social media text using deep learning techniques. Wuhan Univ J Nat Sci 23(3):237–243CrossRef

Santosh T, Aravind K (2019) Hate speech detection in Hindi-English code-Mixed social media text. In: Proceedings of the ACM India joint international conference on Data Science and Management of data. ACM, pp 310–313

Thara S, Poornachandran P (2018) Code-mixing: a brief survey. In: International conference on advances in computing, communications and informatics (ICACCI). IEEE, pp 2382–2388. https://doi.org/10.1109/icacci.2018.8554413

Duarte N, Llanso E, Loup A (2018) Mixed messages? The limits of automated social media content analysis. FAT, p 106

Pavan K, Tandon N, Varma V (2010) Addressing challenges in automatic language identification of romanized text. In: 8th International conference on natural language processing (ICON-2010)

Sequiera, R. D., Rao, S. S., & Shambavi, B. R.: Word-Level language identification and back transliteration of Romanized text: a shared task report by BMSCE. In: MSRI FIRE working notes. (2014)

10.

Bali K, Sharma J, Choudhury M, Vyas Y (2014) I am borrowing ya mixing? An analysis of English-Hindi Code mixing in Facebook. In: Proceedings of the first workshop on computational approaches to code switching, pp 116–126

11.

Das A, Gamback B (2015) Code-mixing in social media text: the last language identification frontier? Revue TAL 54(3):41–64

12.

Gamback B, Das A (2014) On measuring the complexity of code-mixing. In: 11th international conference on natural language processing, pp 1–7, Goa

13.

Barman U, Das A, Wagner J, Foster J (2014) Code mixing: a challenge for language identification in the language of social media. In: First workshop on computational approaches to code switching, pp 13–23

14.

GokulChittaranjan, Vyas Y, Bali K, Choudhury M (2014) A framework to label code-mixed sentences in social media. In: First workshop on computational approaches to code-switching. ACL, Doha

15.

Vyas Y, Gella S, Sharma J, Bali K, Choudhury M (2014) Pos tagging of english-hindi code-mixed social media content. In: Conference on empirical methods in natural language processing (EMNLP), pp 974–979

16.

Das A, Gamback B (2014) Identifying languages at the word level in code-mixed indian social media text. In: 11th International conference on natural language processing, pp 378–387

17.

Kaur J, Singh J (2015) Toward normalizing romanized gurumukhi text from social media. Indian J Sci Technol 8(27):1–6CrossRef

18.

Desai N, Narvekar M (2015) Normalization of noisy text data. Procedia Comput Sci 45:127–132CrossRef

19.

Sequiera R, Choudhury M, Bali K (2015) Pos tagging of Hindi-English code mixed text from social media: some machine learning experiments. In: 12th international conference on natural language processing, pp 237–246

20.

Jamatia A, Gamback B, Das A (2015) Part-of-speech tagging for code-mixed English-Hindi twitter and facebook chat messages. In: International conference recent advances in natural language processing, pp 239–248

21.

Petrov S, Das D, McDonald R (2012) A universal part-of-speech tagset. In: Eighth international conference on language re-sources and evaluation (LREC-2012). European Languages Resources Association (ELRA), Turkey, pp 2089–2096

22.

Gimpel K, Schneider N, O’Connor B, Das D, Mills D (2010) Part-of-speech tagging for twitter: annotation, features, and experiments. Technical Report, Carnegie-Mellon Univ Pittsburgh Pa School of Computer Science

23.

Baskaran S, Bali K, Bhattacharya T, Bhattacharya P, Jha GN (2008) A common parts-of-speech tagset framework for indian languages. In: LREC 2008

24.

Dholakia PS, Yoonus MM (2014) Rule based approach for the transition of tagsets to build the POS annotated corpus. Int J Adv Res Comput Commun Eng 3(7):7417–7422

25.

Dutta S, Saha T, Banerjee S, Naskar SK (2015) Text normalization in code-mixed social media text. In: 2nd international conference on recent trends in information systems (ReTIS). IEEE Press, New York, pp 378–382

26.

Sharma A, Motlani R (2015) Pos tagging for code-mixed indian social media text: systems from IIIT-h for icon NLP tools contest

27.

Sitaram S, Rallabandi SK, Rijhwani S, Black AW (2016) Experiments with cross-lingual systems for synthesis of code-mixed text. In: SSW, pp 76–81

28.

Sharma A, Gupta S, Motlani R, Bansal P, Shrivastava M (2016) Shallow parsing pipeline for Hindi-English code-mixed social media text. In: NAACL-HLT, pp 1340–1345

29.

Han B, Baldwin T (2011) Lexical normalisation of short text messages: Makn sens a# twitter. In: 49th Annual meeting of the Association for Computational Linguistics: Human Language Technologies, pp 368–378

30.

Gupta K, Choudhury M, Bali K (2012) Mining Hindi-English transliteration pairs from online Hindi lyrics. In: LREC, pp 2459–2465 (2012)

31.

Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51MATHCrossRef

32.

Ranjan P, Raja B, Priyadharshini R, Balabantaray RC (2016) A comparative study on code-mixed data of Indian social media vs formal text. In: 2nd international conference on contemporary computing and informatics (IC3I). IEEE, pp 608–611

33.

Mikolov T, Kombrink S, Deoras A, Burget L, Cernocky J (2011) RNNLM-recurrent neural network language modeling toolkit. In: ASRU Workshop, 2011, pp 196-201

34.

Phadte A, Wagh R (2017) Word level language identification system for Konkani-English code-mixed social media text (CMST). In: 10th annual ACM India compute conference. ACM, pp 103–107)

35.

Veena PV, Kumar MA, Soman KP (2017) An effective way of word-level language identification for code-mixed face-book comments using word-embedding via character-embedding. In: International conference on advances in computing, communications and informatics (ICACCI. IEEE), pp 1552–1556

36.

Lakshmi BS, Shambhavi BR (2017) An automatic language identification system for code-mixed English-Kannada social media text. In: 2nd international conference on computational systems and information technology for sustainable solution (CSITSS). IEEE Press, pp 1–5

37.

Jamatia A, Das A, Gamback B (2019) Deep learning-based language identification in English-Hindi-Bengali code-mixed social media corpora. J Intell Syst 38(3):399–408CrossRef

38.

Gupta D, Tripathi S, Ekbal A, Bhattacharyya, P (2017) SMPOST: parts of speech tagger for code-mixed indic social media text. arXiv preprint arXiv:1702.00167

39.

Jamatia A, Gamback B, Das A (2016) Collecting and annotating Indian social media code-mixed corpora. In: International conference on intelligent text processing and computational linguistics. Springer, pp 406–417

40.

Mave D, Maharjan S, Solorio T (2018) Language identification and analysis of code-switched social media text. In: Proceedings of the third workshop on computational approaches to linguistic code-switching, pp 51–61. https://doi.org/10.18653/v1/w18-3206

Title: Automatic Understanding of Code Mixed Social Media Text: A State of the Art
Authors: Neetika
Vishal Goyal
Simpel Rani
Publisher: Springer Singapore
Book: Advances in Information Communication Technology and Computing
Print ISBN: 978-981-15-5420-9

Electronic ISBN: 978-981-15-5421-6

Copyright Year: 2021
DOI: https://doi.org/10.1007/978-981-15-5421-6_10

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"