Top

Neural Processing Letters

Published in:

15-07-2022

Author Profiling in Code-Mixed WhatsApp Messages Using Stacked Convolution Networks and Contextualized Embedding Based Text Augmentation

Authors: V. Sharmila Devi, S. Kannimuthu

Published in: Neural Processing Letters | Issue 1/2023

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

The increasing use of social media to communicate is an emerging trend compared to traditional phone calls and SMS. WhatsApp is one of the popular social messaging applications used in India. Identification of demographic features of authors in social media is known as author profiling. Author profiling is helpful for many applications such as forensics, security and marketing. Author profiling helps to identify fake profiles in social media. By analysing their WhatsApp messages in code-mixed Tamil, this paper focuses on identifying the socio-demographic appearance of author traits or features such as gender, age-group, marital and education status. Even though many studies have been conducted on Author Profiling in English and other resources-rich languages, the research on the Indian language is still nascent. This study is the first Author Profiling task for code-mixed Tamil on WhatsApp. As a part of this study, we have created the benchmark WhatsApp dataset in code-mixed Tamil language to develop the author profiling system. We propose a stacked Convolutional Network (CNN) combined with k-max pooling and Bidirectional Long Short Term Memory (BiLSTM) to enhance the classification performance of CNN. Multiple experiments have been conducted to demonstrate the effectiveness of the proposed model, including the comparison against existing models with diverse parameter settings. We have also incorporated Focal loss and context embedding based data augmentation to handle the data imbalance. The proposed model outperforms state-of-the-art deep learning models with a better performance.

previous article Augmenting Textbooks with cQA Question-Answers and Annotated YouTube Videos to Increase Its Relevance

next article ChaInNet: Deep Chain Instance Segmentation Network for Panoptic Segmentation

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

https://www.oberlo.com/blog/whatsapp-statistics.

Fatima M et al (2017) Multilingual author profiling on Facebook. Inf Process Manag 53(4):886–904MathSciNetCrossRef

Briedienė, M, Kapočiutė-Dzikienė J (2018) An automatic author profiling from non-normative Lithuanian texts. In: CEUR workshop proceedings [electronic resource]: IVUS 2018, international conference on information technologies, Kaunas, Lithuania, 27 April, 2018. Aachen: CEUR-WS, 2018, vol 2145

Alsmearat, Kholoud, Mahmoud Al-Ayyoub, and Riyad Al-Shalabi (2014) An extensive study of the bag-of-words approach for gender identification of arabic articles. In 2014 IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA), pp. 601-608. IEEE

Rangel, F, Rosso P, Koppel M, Stamatatos E, Inches G (2013) Overview of the author profiling task at PAN 2013. In: CLEF conference on multilingual and multimodal information access evaluation. CELCT, pp 352–365

Rangel F, Rosso P, Chugur I, Potthast M, Trenkmann M, Stein B, Verhoeven B, Daelemans W (2014) Overview of the 2nd author profiling task at pan 2014. In: CLEF 2014 evaluation labs and workshop working notes papers, Sheffield, UK, 2014, pp 1–30

Rangel P, Manuel F, Celli F, Rosso P, Potthast M, Stein B, Daelemans W (2015) Overview of the 3rd author profiling task at PAN 2015. In: CLEF 2015 evaluation labs and workshop working notes papers, pp 1–8

Rangel, F, Rosso P, Verhoeven B, Daelemans W, Potthast M, Stein B (2016) Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: Working notes papers of the CLEF 2016 evaluation labs. CEUR Workshop proceedings/Balog, Krisztian [edit.]; et al, pp 750–784

Rangel F, Rosso P, Potthast M, Stein B (2017) Overview of the 5th author profiling task at pan 2017: gender and language variety identification in twitter. In: Working notes papers of the CLEF 1613-0073

Rangel F, Rosso P, Montes-y-Gómez M, Potthast M, Stein B (2018) Overview of the 6th author profiling task at pan 2018: multimodal gender identification in twitter. In: Working notes papers of the CLEF

10.

Rangel F, Rosso P (2019) Overview of the 7th author profiling task at PAN 2019: bots and gender profiling in twitter

11.

Ramos R, Neto G, Silva B, Monteiro D, Paraboni I, Dias R (2018) Building a corpus for personality-dependent natural language understanding and generation. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018)

12.

Anand Kumar M, Ganesh HBB, Singh S, Soman KP, Rosso P (2017) Overview of the INLI PAN at FIRE-2017 track on Indian native language identification. In: CEUR workshop proceedings, vol 2036, pp 99–105

13.

Bayot, R, Gonçalves T (2016) Multilingual author profiling using word embedding averages and SVMS. In: 2016 10th international conference on software, knowledge, information management and applications (SKIMA). IEEE, pp 382–386

14.

Kocher M, Savoy J (2017) Distance measures in author profiling. Inf Process Manag 53(5):1103–1119CrossRef

15.

Alowibdi JS, Buy UA, Yu P (2013) Language independent gender classification on Twitter. In: Proceedings of the 2013 IEEE/ACM international conference on advances in social networks analysis and mining, pp 739–743

16.

Verhoeven B, Daelemans W, Plank B (2016) Twisty: a multilingual twitter stylometry corpus for gender and personality profiling. In: Proceedings of the 10th annual conference on language resources and evaluation (LREC 2016)/Calzolari, Nicoletta [edit.]; et al., pp 1–6

17.

Martinc M, Skrjanec I, Zupan K, Pollak S (2017) PAN 2017: author profiling-gender and language variety prediction. In: CLEF (working notes)

18.

Villegas MP, Ucelay MJG, Errecalde ML, Cagnina L (2014) A Spanish text corpus for the author profiling task. In: XX Congreso Argentino de Ciencias de la Computación (Buenos Aires, 2014)

19.

Coşkun M, Ozturan M (2018) # europehappinessmap: A framework for multi-lingual sentiment analysis via social media big data (a Twitter case study). Information 9(5):102CrossRef

20.

Jain D, Kumar A, Garg G (2020) Sarcasm detection in mash-up language using soft-attention based bi-directional LSTM and feature-rich CNN. Appl Soft Comput 106198

21.

Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1480–1489

22.

Kim Y, Jernite Y, Sontag D, Rush AM (2016) Character-aware neural language models. In: Thirtieth AAAI conference on artificial intelligence

23.

Zhou, Chunting, Chonglin Sun, Zhiyuan Liu, and Francis Lau (2015) A C-LSTM neural network for text classification. arXiv preprint arXiv:1511.08630

24.

Er MJ, Zhang Y, Wang N, Pratama M (2016) Attention pooling-based convolutional neural network for sentence modelling. Inf Sci 373:388–403CrossRefMATH

25.

Gu, Jiuxiang, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing Shuai, Ting Liu et al (2018) Recent advances in convolutional neural networks. Pattern Recognition 77: 354-377

26.

Zhang Y, Zhang Z, Miao D, Wang J (2019) Three-way enhanced convolutional neural networks for sentence-level sentiment classification. Inf Sci 477:55–64CrossRef

27.

Wang J, Xu W, Fu X, Xu G, Wu Y (2020) ASTRAL: adversarial trained LSTM-CNN for named entity recognition. Knowl Based Syst 105842

28.

Mirończuk MM, Protasiewicz J (2018) A recent overview of the state-of-the-art elements of text classification. Expert Syst Appl 106:36–54CrossRef

29.

Hartmann J, Huppertz J, Schamp C, Heitmann M (2019) Comparing automated text classification methods. Int J Res Mark 36(1):20–389CrossRef

30.

Thenmozhi D, Kannan K, Aravindan C (2017) SSN_NLP INLI-FIRE-2017: a neural network approach to Indian native language identification. In: FIRE (working notes), pp 113–114

31.

Patra BG, Das D, Das A (2018) Sentiment analysis of code-mixed indian languages: an overview of SAIL_Code-mixed shared task ICON-2017. arXiv:1803.06745

32.

Seshadri S, Anand Kumar M, Soman KP (2016) Analyzing sentiment in Indian languages micro text using recurrent neural network. IIOAB J A J Multidiscip Sci Technol 7:313–318

33.

Chakravarthi BR, Muralidaran V, Priyadharshini R, McCrae, JP (2020) Corpus creation for sentiment analysis in code-mixed Tamil–English text. arXiv:2006.00206

34.

Remmiya Devi G, Veena P, Anand Kumar M, Soman K (2016) Amrita-cen@ fire 2016: Code-mix entity extraction for Hindi-English and Tamil-English tweets. In: CEUR workshop proceedings, vol 1737, pp 304–308

35.

Veena PV, Kumar MA, Soman KP (2017) An effective way of word-level language identification for code-mixed facebook comments using word-embedding via character-embedding. In: 2017 International conference on advances in computing, communications and informatics (ICACCI). IEEE, pp 1552–1556

36.

Chacko VR, Anand Kumar M, Soman KP (2019) Gender identification of code-mixed Malayalam–English data from WhatsApp. In: Innovations in computer science and engineering. Lecture notes in networks and systems, vol 74. Springer, Singapore

37.

Bevendorff J, Ghanem B, Giachanou A et al (2020) Shared tasks on authorship analysis at PAN 2020. Adv Inf Retriev 12036:508–516

38.

http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

39.

Kalchbrenner N, Grefenstette E, Blunsom P (2014) A convolutional neural network for modelling sentences. arXiv:1404.2188

40.

https://kharshit.github.io/blog/2018/05/04/dropout-prevent-overfitting

41.

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRefMATH

42.

Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988

43.

Sosuke K (2018) Contextual augmentation: data augmentation by words with paradigmatic relations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 2 (short papers), pp 452–457

44.

Joulin A, Grave E, Bojanowski P, Mikolov T (2016) Bag of tricks for efficient text classification. arXiv:1607.01759

45.

Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. In: Advances in neural information processing systems, pp 649–657

46.

Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In: Twenty-ninth AAAI conference on artificial intelligence

47.

Yoon K (2014) Convolutional neural networks for sentence classification. arXiv:1408.5882

48.

Liu P, Qiu X, Huang X (2016) Recurrent neural network for text classification with multi-task learning. In: IJCAI 2016, May

49.

Conneau A, Schwenk H, Barrault L, Lecun Y (2016) Very deep convolutional networks for text classification. arXiv:1606.01781

50.

Zhou P, Shi W, Tian J, Qi Z, Li B, Hao H, Xu B (2016) Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: short papers), pp 207–212

51.

Devi S, Kannimuthu S, Ravikumar G, Kumar A (2019) KCE DALab-APDAFIRE2019: author profiling and deception detection in Arabic using weighted embedding. In: Working notes of the forum for information retrieval evaluation (FIRE 2019). CEUR workshop proceedings. CEUR-WS. org, Kolkata, India, December, pp 12–15

Title: Author Profiling in Code-Mixed WhatsApp Messages Using Stacked Convolution Networks and Contextualized Embedding Based Text Augmentation
Authors: V. Sharmila Devi
S. Kannimuthu
Publication date: 15-07-2022
Publisher: Springer US
Published in: Neural Processing Letters / Issue 1/2023
Print ISSN: 1370-4621
Electronic ISSN: 1573-773X
DOI: https://doi.org/10.1007/s11063-022-10898-3

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 1/2023

DCT-Net: A Neurodynamic Approach with Definable Convergence Property for Real-Time Synchronization of Chaotic Systems

Graph-Based LSTM for Anti-money Laundering: Experimenting Temporal Graph Convolutional Network with Bitcoin Data

Leveraging Deep Learning for Designing Healthcare Analytics Heuristic for Diagnostics

Efficient Mobile Security for E Health Care Application in Cloud for Secure Payment Using Key Distribution

Adaptive Generation of Weakly Supervised Semantic Segmentation for Object Detection

Hybrid Optimized Deep Neural Network with Enhanced Conditional Random Field Based Intrusion Detection on Wireless Sensor Network