Skip to main content
Top
Published in: Neural Processing Letters 1/2023

15-07-2022

Author Profiling in Code-Mixed WhatsApp Messages Using Stacked Convolution Networks and Contextualized Embedding Based Text Augmentation

Authors: V. Sharmila Devi, S. Kannimuthu

Published in: Neural Processing Letters | Issue 1/2023

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The increasing use of social media to communicate is an emerging trend compared to traditional phone calls and SMS. WhatsApp is one of the popular social messaging applications used in India. Identification of demographic features of authors in social media is known as author profiling. Author profiling is helpful for many applications such as forensics, security and marketing. Author profiling helps to identify fake profiles in social media. By analysing their WhatsApp messages in code-mixed Tamil, this paper focuses on identifying the socio-demographic appearance of author traits or features such as gender, age-group, marital and education status. Even though many studies have been conducted on Author Profiling in English and other resources-rich languages, the research on the Indian language is still nascent. This study is the first Author Profiling task for code-mixed Tamil on WhatsApp. As a part of this study, we have created the benchmark WhatsApp dataset in code-mixed Tamil language to develop the author profiling system. We propose a stacked Convolutional Network (CNN) combined with k-max pooling and Bidirectional Long Short Term Memory (BiLSTM) to enhance the classification performance of CNN. Multiple experiments have been conducted to demonstrate the effectiveness of the proposed model, including the comparison against existing models with diverse parameter settings. We have also incorporated Focal loss and context embedding based data augmentation to handle the data imbalance. The proposed model outperforms state-of-the-art deep learning models with a better performance.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
2.
go back to reference Briedienė, M, Kapočiutė-Dzikienė J (2018) An automatic author profiling from non-normative Lithuanian texts. In: CEUR workshop proceedings [electronic resource]: IVUS 2018, international conference on information technologies, Kaunas, Lithuania, 27 April, 2018. Aachen: CEUR-WS, 2018, vol 2145 Briedienė, M, Kapočiutė-Dzikienė J (2018) An automatic author profiling from non-normative Lithuanian texts. In: CEUR workshop proceedings [electronic resource]: IVUS 2018, international conference on information technologies, Kaunas, Lithuania, 27 April, 2018. Aachen: CEUR-WS, 2018, vol 2145
3.
go back to reference Alsmearat, Kholoud, Mahmoud Al-Ayyoub, and Riyad Al-Shalabi (2014) An extensive study of the bag-of-words approach for gender identification of arabic articles. In 2014 IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA), pp. 601-608. IEEE Alsmearat, Kholoud, Mahmoud Al-Ayyoub, and Riyad Al-Shalabi (2014) An extensive study of the bag-of-words approach for gender identification of arabic articles. In 2014 IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA), pp. 601-608. IEEE
4.
go back to reference Rangel, F, Rosso P, Koppel M, Stamatatos E, Inches G (2013) Overview of the author profiling task at PAN 2013. In: CLEF conference on multilingual and multimodal information access evaluation. CELCT, pp 352–365 Rangel, F, Rosso P, Koppel M, Stamatatos E, Inches G (2013) Overview of the author profiling task at PAN 2013. In: CLEF conference on multilingual and multimodal information access evaluation. CELCT, pp 352–365
5.
go back to reference Rangel F, Rosso P, Chugur I, Potthast M, Trenkmann M, Stein B, Verhoeven B, Daelemans W (2014) Overview of the 2nd author profiling task at pan 2014. In: CLEF 2014 evaluation labs and workshop working notes papers, Sheffield, UK, 2014, pp 1–30 Rangel F, Rosso P, Chugur I, Potthast M, Trenkmann M, Stein B, Verhoeven B, Daelemans W (2014) Overview of the 2nd author profiling task at pan 2014. In: CLEF 2014 evaluation labs and workshop working notes papers, Sheffield, UK, 2014, pp 1–30
6.
go back to reference Rangel P, Manuel F, Celli F, Rosso P, Potthast M, Stein B, Daelemans W (2015) Overview of the 3rd author profiling task at PAN 2015. In: CLEF 2015 evaluation labs and workshop working notes papers, pp 1–8 Rangel P, Manuel F, Celli F, Rosso P, Potthast M, Stein B, Daelemans W (2015) Overview of the 3rd author profiling task at PAN 2015. In: CLEF 2015 evaluation labs and workshop working notes papers, pp 1–8
7.
go back to reference Rangel, F, Rosso P, Verhoeven B, Daelemans W, Potthast M, Stein B (2016) Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: Working notes papers of the CLEF 2016 evaluation labs. CEUR Workshop proceedings/Balog, Krisztian [edit.]; et al, pp 750–784 Rangel, F, Rosso P, Verhoeven B, Daelemans W, Potthast M, Stein B (2016) Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: Working notes papers of the CLEF 2016 evaluation labs. CEUR Workshop proceedings/Balog, Krisztian [edit.]; et al, pp 750–784
8.
go back to reference Rangel F, Rosso P, Potthast M, Stein B (2017) Overview of the 5th author profiling task at pan 2017: gender and language variety identification in twitter. In: Working notes papers of the CLEF 1613-0073 Rangel F, Rosso P, Potthast M, Stein B (2017) Overview of the 5th author profiling task at pan 2017: gender and language variety identification in twitter. In: Working notes papers of the CLEF 1613-0073
9.
go back to reference Rangel F, Rosso P, Montes-y-Gómez M, Potthast M, Stein B (2018) Overview of the 6th author profiling task at pan 2018: multimodal gender identification in twitter. In: Working notes papers of the CLEF Rangel F, Rosso P, Montes-y-Gómez M, Potthast M, Stein B (2018) Overview of the 6th author profiling task at pan 2018: multimodal gender identification in twitter. In: Working notes papers of the CLEF
10.
go back to reference Rangel F, Rosso P (2019) Overview of the 7th author profiling task at PAN 2019: bots and gender profiling in twitter Rangel F, Rosso P (2019) Overview of the 7th author profiling task at PAN 2019: bots and gender profiling in twitter
11.
go back to reference Ramos R, Neto G, Silva B, Monteiro D, Paraboni I, Dias R (2018) Building a corpus for personality-dependent natural language understanding and generation. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018) Ramos R, Neto G, Silva B, Monteiro D, Paraboni I, Dias R (2018) Building a corpus for personality-dependent natural language understanding and generation. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018)
12.
go back to reference Anand Kumar M, Ganesh HBB, Singh S, Soman KP, Rosso P (2017) Overview of the INLI PAN at FIRE-2017 track on Indian native language identification. In: CEUR workshop proceedings, vol 2036, pp 99–105 Anand Kumar M, Ganesh HBB, Singh S, Soman KP, Rosso P (2017) Overview of the INLI PAN at FIRE-2017 track on Indian native language identification. In: CEUR workshop proceedings, vol 2036, pp 99–105
13.
go back to reference Bayot, R, Gonçalves T (2016) Multilingual author profiling using word embedding averages and SVMS. In: 2016 10th international conference on software, knowledge, information management and applications (SKIMA). IEEE, pp 382–386 Bayot, R, Gonçalves T (2016) Multilingual author profiling using word embedding averages and SVMS. In: 2016 10th international conference on software, knowledge, information management and applications (SKIMA). IEEE, pp 382–386
14.
go back to reference Kocher M, Savoy J (2017) Distance measures in author profiling. Inf Process Manag 53(5):1103–1119CrossRef Kocher M, Savoy J (2017) Distance measures in author profiling. Inf Process Manag 53(5):1103–1119CrossRef
15.
go back to reference Alowibdi JS, Buy UA, Yu P (2013) Language independent gender classification on Twitter. In: Proceedings of the 2013 IEEE/ACM international conference on advances in social networks analysis and mining, pp 739–743 Alowibdi JS, Buy UA, Yu P (2013) Language independent gender classification on Twitter. In: Proceedings of the 2013 IEEE/ACM international conference on advances in social networks analysis and mining, pp 739–743
16.
go back to reference Verhoeven B, Daelemans W, Plank B (2016) Twisty: a multilingual twitter stylometry corpus for gender and personality profiling. In: Proceedings of the 10th annual conference on language resources and evaluation (LREC 2016)/Calzolari, Nicoletta [edit.]; et al., pp 1–6 Verhoeven B, Daelemans W, Plank B (2016) Twisty: a multilingual twitter stylometry corpus for gender and personality profiling. In: Proceedings of the 10th annual conference on language resources and evaluation (LREC 2016)/Calzolari, Nicoletta [edit.]; et al., pp 1–6
17.
go back to reference Martinc M, Skrjanec I, Zupan K, Pollak S (2017) PAN 2017: author profiling-gender and language variety prediction. In: CLEF (working notes) Martinc M, Skrjanec I, Zupan K, Pollak S (2017) PAN 2017: author profiling-gender and language variety prediction. In: CLEF (working notes)
18.
go back to reference Villegas MP, Ucelay MJG, Errecalde ML, Cagnina L (2014) A Spanish text corpus for the author profiling task. In: XX Congreso Argentino de Ciencias de la Computación (Buenos Aires, 2014) Villegas MP, Ucelay MJG, Errecalde ML, Cagnina L (2014) A Spanish text corpus for the author profiling task. In: XX Congreso Argentino de Ciencias de la Computación (Buenos Aires, 2014)
19.
go back to reference Coşkun M, Ozturan M (2018) # europehappinessmap: A framework for multi-lingual sentiment analysis via social media big data (a Twitter case study). Information 9(5):102CrossRef Coşkun M, Ozturan M (2018) # europehappinessmap: A framework for multi-lingual sentiment analysis via social media big data (a Twitter case study). Information 9(5):102CrossRef
20.
go back to reference Jain D, Kumar A, Garg G (2020) Sarcasm detection in mash-up language using soft-attention based bi-directional LSTM and feature-rich CNN. Appl Soft Comput 106198 Jain D, Kumar A, Garg G (2020) Sarcasm detection in mash-up language using soft-attention based bi-directional LSTM and feature-rich CNN. Appl Soft Comput 106198
21.
go back to reference Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1480–1489 Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1480–1489
22.
go back to reference Kim Y, Jernite Y, Sontag D, Rush AM (2016) Character-aware neural language models. In: Thirtieth AAAI conference on artificial intelligence Kim Y, Jernite Y, Sontag D, Rush AM (2016) Character-aware neural language models. In: Thirtieth AAAI conference on artificial intelligence
23.
go back to reference Zhou, Chunting, Chonglin Sun, Zhiyuan Liu, and Francis Lau (2015) A C-LSTM neural network for text classification. arXiv preprint arXiv:1511.08630 Zhou, Chunting, Chonglin Sun, Zhiyuan Liu, and Francis Lau (2015) A C-LSTM neural network for text classification. arXiv preprint arXiv:​1511.​08630
24.
go back to reference Er MJ, Zhang Y, Wang N, Pratama M (2016) Attention pooling-based convolutional neural network for sentence modelling. Inf Sci 373:388–403CrossRefMATH Er MJ, Zhang Y, Wang N, Pratama M (2016) Attention pooling-based convolutional neural network for sentence modelling. Inf Sci 373:388–403CrossRefMATH
25.
go back to reference Gu, Jiuxiang, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing Shuai, Ting Liu et al (2018) Recent advances in convolutional neural networks. Pattern Recognition 77: 354-377 Gu, Jiuxiang, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing Shuai, Ting Liu et al (2018) Recent advances in convolutional neural networks. Pattern Recognition 77: 354-377
26.
go back to reference Zhang Y, Zhang Z, Miao D, Wang J (2019) Three-way enhanced convolutional neural networks for sentence-level sentiment classification. Inf Sci 477:55–64CrossRef Zhang Y, Zhang Z, Miao D, Wang J (2019) Three-way enhanced convolutional neural networks for sentence-level sentiment classification. Inf Sci 477:55–64CrossRef
27.
go back to reference Wang J, Xu W, Fu X, Xu G, Wu Y (2020) ASTRAL: adversarial trained LSTM-CNN for named entity recognition. Knowl Based Syst 105842 Wang J, Xu W, Fu X, Xu G, Wu Y (2020) ASTRAL: adversarial trained LSTM-CNN for named entity recognition. Knowl Based Syst 105842
28.
go back to reference Mirończuk MM, Protasiewicz J (2018) A recent overview of the state-of-the-art elements of text classification. Expert Syst Appl 106:36–54CrossRef Mirończuk MM, Protasiewicz J (2018) A recent overview of the state-of-the-art elements of text classification. Expert Syst Appl 106:36–54CrossRef
29.
go back to reference Hartmann J, Huppertz J, Schamp C, Heitmann M (2019) Comparing automated text classification methods. Int J Res Mark 36(1):20–389CrossRef Hartmann J, Huppertz J, Schamp C, Heitmann M (2019) Comparing automated text classification methods. Int J Res Mark 36(1):20–389CrossRef
30.
go back to reference Thenmozhi D, Kannan K, Aravindan C (2017) SSN_NLP INLI-FIRE-2017: a neural network approach to Indian native language identification. In: FIRE (working notes), pp 113–114 Thenmozhi D, Kannan K, Aravindan C (2017) SSN_NLP INLI-FIRE-2017: a neural network approach to Indian native language identification. In: FIRE (working notes), pp 113–114
31.
go back to reference Patra BG, Das D, Das A (2018) Sentiment analysis of code-mixed indian languages: an overview of SAIL_Code-mixed shared task ICON-2017. arXiv:1803.06745 Patra BG, Das D, Das A (2018) Sentiment analysis of code-mixed indian languages: an overview of SAIL_Code-mixed shared task ICON-2017. arXiv:​1803.​06745
32.
go back to reference Seshadri S, Anand Kumar M, Soman KP (2016) Analyzing sentiment in Indian languages micro text using recurrent neural network. IIOAB J A J Multidiscip Sci Technol 7:313–318 Seshadri S, Anand Kumar M, Soman KP (2016) Analyzing sentiment in Indian languages micro text using recurrent neural network. IIOAB J A J Multidiscip Sci Technol 7:313–318
33.
go back to reference Chakravarthi BR, Muralidaran V, Priyadharshini R, McCrae, JP (2020) Corpus creation for sentiment analysis in code-mixed Tamil–English text. arXiv:2006.00206 Chakravarthi BR, Muralidaran V, Priyadharshini R, McCrae, JP (2020) Corpus creation for sentiment analysis in code-mixed Tamil–English text. arXiv:​2006.​00206
34.
go back to reference Remmiya Devi G, Veena P, Anand Kumar M, Soman K (2016) Amrita-cen@ fire 2016: Code-mix entity extraction for Hindi-English and Tamil-English tweets. In: CEUR workshop proceedings, vol 1737, pp 304–308 Remmiya Devi G, Veena P, Anand Kumar M, Soman K (2016) Amrita-cen@ fire 2016: Code-mix entity extraction for Hindi-English and Tamil-English tweets. In: CEUR workshop proceedings, vol 1737, pp 304–308
35.
go back to reference Veena PV, Kumar MA, Soman KP (2017) An effective way of word-level language identification for code-mixed facebook comments using word-embedding via character-embedding. In: 2017 International conference on advances in computing, communications and informatics (ICACCI). IEEE, pp 1552–1556 Veena PV, Kumar MA, Soman KP (2017) An effective way of word-level language identification for code-mixed facebook comments using word-embedding via character-embedding. In: 2017 International conference on advances in computing, communications and informatics (ICACCI). IEEE, pp 1552–1556
36.
go back to reference Chacko VR, Anand Kumar M, Soman KP (2019) Gender identification of code-mixed Malayalam–English data from WhatsApp. In: Innovations in computer science and engineering. Lecture notes in networks and systems, vol 74. Springer, Singapore Chacko VR, Anand Kumar M, Soman KP (2019) Gender identification of code-mixed Malayalam–English data from WhatsApp. In: Innovations in computer science and engineering. Lecture notes in networks and systems, vol 74. Springer, Singapore
37.
go back to reference Bevendorff J, Ghanem B, Giachanou A et al (2020) Shared tasks on authorship analysis at PAN 2020. Adv Inf Retriev 12036:508–516 Bevendorff J, Ghanem B, Giachanou A et al (2020) Shared tasks on authorship analysis at PAN 2020. Adv Inf Retriev 12036:508–516
39.
41.
go back to reference Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRefMATH Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRefMATH
42.
go back to reference Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988 Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
43.
go back to reference Sosuke K (2018) Contextual augmentation: data augmentation by words with paradigmatic relations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 2 (short papers), pp 452–457 Sosuke K (2018) Contextual augmentation: data augmentation by words with paradigmatic relations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 2 (short papers), pp 452–457
45.
go back to reference Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. In: Advances in neural information processing systems, pp 649–657 Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. In: Advances in neural information processing systems, pp 649–657
46.
go back to reference Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In: Twenty-ninth AAAI conference on artificial intelligence Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In: Twenty-ninth AAAI conference on artificial intelligence
48.
go back to reference Liu P, Qiu X, Huang X (2016) Recurrent neural network for text classification with multi-task learning. In: IJCAI 2016, May Liu P, Qiu X, Huang X (2016) Recurrent neural network for text classification with multi-task learning. In: IJCAI 2016, May
49.
50.
go back to reference Zhou P, Shi W, Tian J, Qi Z, Li B, Hao H, Xu B (2016) Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: short papers), pp 207–212 Zhou P, Shi W, Tian J, Qi Z, Li B, Hao H, Xu B (2016) Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: short papers), pp 207–212
51.
go back to reference Devi S, Kannimuthu S, Ravikumar G, Kumar A (2019) KCE DALab-APDAFIRE2019: author profiling and deception detection in Arabic using weighted embedding. In: Working notes of the forum for information retrieval evaluation (FIRE 2019). CEUR workshop proceedings. CEUR-WS. org, Kolkata, India, December, pp 12–15 Devi S, Kannimuthu S, Ravikumar G, Kumar A (2019) KCE DALab-APDAFIRE2019: author profiling and deception detection in Arabic using weighted embedding. In: Working notes of the forum for information retrieval evaluation (FIRE 2019). CEUR workshop proceedings. CEUR-WS. org, Kolkata, India, December, pp 12–15
Metadata
Title
Author Profiling in Code-Mixed WhatsApp Messages Using Stacked Convolution Networks and Contextualized Embedding Based Text Augmentation
Authors
V. Sharmila Devi
S. Kannimuthu
Publication date
15-07-2022
Publisher
Springer US
Published in
Neural Processing Letters / Issue 1/2023
Print ISSN: 1370-4621
Electronic ISSN: 1573-773X
DOI
https://doi.org/10.1007/s11063-022-10898-3

Other articles of this Issue 1/2023

Neural Processing Letters 1/2023 Go to the issue