Skip to main content
Erschienen in: Neural Processing Letters 1/2023

15.07.2022

Author Profiling in Code-Mixed WhatsApp Messages Using Stacked Convolution Networks and Contextualized Embedding Based Text Augmentation

verfasst von: V. Sharmila Devi, S. Kannimuthu

Erschienen in: Neural Processing Letters | Ausgabe 1/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The increasing use of social media to communicate is an emerging trend compared to traditional phone calls and SMS. WhatsApp is one of the popular social messaging applications used in India. Identification of demographic features of authors in social media is known as author profiling. Author profiling is helpful for many applications such as forensics, security and marketing. Author profiling helps to identify fake profiles in social media. By analysing their WhatsApp messages in code-mixed Tamil, this paper focuses on identifying the socio-demographic appearance of author traits or features such as gender, age-group, marital and education status. Even though many studies have been conducted on Author Profiling in English and other resources-rich languages, the research on the Indian language is still nascent. This study is the first Author Profiling task for code-mixed Tamil on WhatsApp. As a part of this study, we have created the benchmark WhatsApp dataset in code-mixed Tamil language to develop the author profiling system. We propose a stacked Convolutional Network (CNN) combined with k-max pooling and Bidirectional Long Short Term Memory (BiLSTM) to enhance the classification performance of CNN. Multiple experiments have been conducted to demonstrate the effectiveness of the proposed model, including the comparison against existing models with diverse parameter settings. We have also incorporated Focal loss and context embedding based data augmentation to handle the data imbalance. The proposed model outperforms state-of-the-art deep learning models with a better performance.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
2.
Zurück zum Zitat Briedienė, M, Kapočiutė-Dzikienė J (2018) An automatic author profiling from non-normative Lithuanian texts. In: CEUR workshop proceedings [electronic resource]: IVUS 2018, international conference on information technologies, Kaunas, Lithuania, 27 April, 2018. Aachen: CEUR-WS, 2018, vol 2145 Briedienė, M, Kapočiutė-Dzikienė J (2018) An automatic author profiling from non-normative Lithuanian texts. In: CEUR workshop proceedings [electronic resource]: IVUS 2018, international conference on information technologies, Kaunas, Lithuania, 27 April, 2018. Aachen: CEUR-WS, 2018, vol 2145
3.
Zurück zum Zitat Alsmearat, Kholoud, Mahmoud Al-Ayyoub, and Riyad Al-Shalabi (2014) An extensive study of the bag-of-words approach for gender identification of arabic articles. In 2014 IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA), pp. 601-608. IEEE Alsmearat, Kholoud, Mahmoud Al-Ayyoub, and Riyad Al-Shalabi (2014) An extensive study of the bag-of-words approach for gender identification of arabic articles. In 2014 IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA), pp. 601-608. IEEE
4.
Zurück zum Zitat Rangel, F, Rosso P, Koppel M, Stamatatos E, Inches G (2013) Overview of the author profiling task at PAN 2013. In: CLEF conference on multilingual and multimodal information access evaluation. CELCT, pp 352–365 Rangel, F, Rosso P, Koppel M, Stamatatos E, Inches G (2013) Overview of the author profiling task at PAN 2013. In: CLEF conference on multilingual and multimodal information access evaluation. CELCT, pp 352–365
5.
Zurück zum Zitat Rangel F, Rosso P, Chugur I, Potthast M, Trenkmann M, Stein B, Verhoeven B, Daelemans W (2014) Overview of the 2nd author profiling task at pan 2014. In: CLEF 2014 evaluation labs and workshop working notes papers, Sheffield, UK, 2014, pp 1–30 Rangel F, Rosso P, Chugur I, Potthast M, Trenkmann M, Stein B, Verhoeven B, Daelemans W (2014) Overview of the 2nd author profiling task at pan 2014. In: CLEF 2014 evaluation labs and workshop working notes papers, Sheffield, UK, 2014, pp 1–30
6.
Zurück zum Zitat Rangel P, Manuel F, Celli F, Rosso P, Potthast M, Stein B, Daelemans W (2015) Overview of the 3rd author profiling task at PAN 2015. In: CLEF 2015 evaluation labs and workshop working notes papers, pp 1–8 Rangel P, Manuel F, Celli F, Rosso P, Potthast M, Stein B, Daelemans W (2015) Overview of the 3rd author profiling task at PAN 2015. In: CLEF 2015 evaluation labs and workshop working notes papers, pp 1–8
7.
Zurück zum Zitat Rangel, F, Rosso P, Verhoeven B, Daelemans W, Potthast M, Stein B (2016) Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: Working notes papers of the CLEF 2016 evaluation labs. CEUR Workshop proceedings/Balog, Krisztian [edit.]; et al, pp 750–784 Rangel, F, Rosso P, Verhoeven B, Daelemans W, Potthast M, Stein B (2016) Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: Working notes papers of the CLEF 2016 evaluation labs. CEUR Workshop proceedings/Balog, Krisztian [edit.]; et al, pp 750–784
8.
Zurück zum Zitat Rangel F, Rosso P, Potthast M, Stein B (2017) Overview of the 5th author profiling task at pan 2017: gender and language variety identification in twitter. In: Working notes papers of the CLEF 1613-0073 Rangel F, Rosso P, Potthast M, Stein B (2017) Overview of the 5th author profiling task at pan 2017: gender and language variety identification in twitter. In: Working notes papers of the CLEF 1613-0073
9.
Zurück zum Zitat Rangel F, Rosso P, Montes-y-Gómez M, Potthast M, Stein B (2018) Overview of the 6th author profiling task at pan 2018: multimodal gender identification in twitter. In: Working notes papers of the CLEF Rangel F, Rosso P, Montes-y-Gómez M, Potthast M, Stein B (2018) Overview of the 6th author profiling task at pan 2018: multimodal gender identification in twitter. In: Working notes papers of the CLEF
10.
Zurück zum Zitat Rangel F, Rosso P (2019) Overview of the 7th author profiling task at PAN 2019: bots and gender profiling in twitter Rangel F, Rosso P (2019) Overview of the 7th author profiling task at PAN 2019: bots and gender profiling in twitter
11.
Zurück zum Zitat Ramos R, Neto G, Silva B, Monteiro D, Paraboni I, Dias R (2018) Building a corpus for personality-dependent natural language understanding and generation. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018) Ramos R, Neto G, Silva B, Monteiro D, Paraboni I, Dias R (2018) Building a corpus for personality-dependent natural language understanding and generation. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018)
12.
Zurück zum Zitat Anand Kumar M, Ganesh HBB, Singh S, Soman KP, Rosso P (2017) Overview of the INLI PAN at FIRE-2017 track on Indian native language identification. In: CEUR workshop proceedings, vol 2036, pp 99–105 Anand Kumar M, Ganesh HBB, Singh S, Soman KP, Rosso P (2017) Overview of the INLI PAN at FIRE-2017 track on Indian native language identification. In: CEUR workshop proceedings, vol 2036, pp 99–105
13.
Zurück zum Zitat Bayot, R, Gonçalves T (2016) Multilingual author profiling using word embedding averages and SVMS. In: 2016 10th international conference on software, knowledge, information management and applications (SKIMA). IEEE, pp 382–386 Bayot, R, Gonçalves T (2016) Multilingual author profiling using word embedding averages and SVMS. In: 2016 10th international conference on software, knowledge, information management and applications (SKIMA). IEEE, pp 382–386
14.
Zurück zum Zitat Kocher M, Savoy J (2017) Distance measures in author profiling. Inf Process Manag 53(5):1103–1119CrossRef Kocher M, Savoy J (2017) Distance measures in author profiling. Inf Process Manag 53(5):1103–1119CrossRef
15.
Zurück zum Zitat Alowibdi JS, Buy UA, Yu P (2013) Language independent gender classification on Twitter. In: Proceedings of the 2013 IEEE/ACM international conference on advances in social networks analysis and mining, pp 739–743 Alowibdi JS, Buy UA, Yu P (2013) Language independent gender classification on Twitter. In: Proceedings of the 2013 IEEE/ACM international conference on advances in social networks analysis and mining, pp 739–743
16.
Zurück zum Zitat Verhoeven B, Daelemans W, Plank B (2016) Twisty: a multilingual twitter stylometry corpus for gender and personality profiling. In: Proceedings of the 10th annual conference on language resources and evaluation (LREC 2016)/Calzolari, Nicoletta [edit.]; et al., pp 1–6 Verhoeven B, Daelemans W, Plank B (2016) Twisty: a multilingual twitter stylometry corpus for gender and personality profiling. In: Proceedings of the 10th annual conference on language resources and evaluation (LREC 2016)/Calzolari, Nicoletta [edit.]; et al., pp 1–6
17.
Zurück zum Zitat Martinc M, Skrjanec I, Zupan K, Pollak S (2017) PAN 2017: author profiling-gender and language variety prediction. In: CLEF (working notes) Martinc M, Skrjanec I, Zupan K, Pollak S (2017) PAN 2017: author profiling-gender and language variety prediction. In: CLEF (working notes)
18.
Zurück zum Zitat Villegas MP, Ucelay MJG, Errecalde ML, Cagnina L (2014) A Spanish text corpus for the author profiling task. In: XX Congreso Argentino de Ciencias de la Computación (Buenos Aires, 2014) Villegas MP, Ucelay MJG, Errecalde ML, Cagnina L (2014) A Spanish text corpus for the author profiling task. In: XX Congreso Argentino de Ciencias de la Computación (Buenos Aires, 2014)
19.
Zurück zum Zitat Coşkun M, Ozturan M (2018) # europehappinessmap: A framework for multi-lingual sentiment analysis via social media big data (a Twitter case study). Information 9(5):102CrossRef Coşkun M, Ozturan M (2018) # europehappinessmap: A framework for multi-lingual sentiment analysis via social media big data (a Twitter case study). Information 9(5):102CrossRef
20.
Zurück zum Zitat Jain D, Kumar A, Garg G (2020) Sarcasm detection in mash-up language using soft-attention based bi-directional LSTM and feature-rich CNN. Appl Soft Comput 106198 Jain D, Kumar A, Garg G (2020) Sarcasm detection in mash-up language using soft-attention based bi-directional LSTM and feature-rich CNN. Appl Soft Comput 106198
21.
Zurück zum Zitat Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1480–1489 Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1480–1489
22.
Zurück zum Zitat Kim Y, Jernite Y, Sontag D, Rush AM (2016) Character-aware neural language models. In: Thirtieth AAAI conference on artificial intelligence Kim Y, Jernite Y, Sontag D, Rush AM (2016) Character-aware neural language models. In: Thirtieth AAAI conference on artificial intelligence
23.
Zurück zum Zitat Zhou, Chunting, Chonglin Sun, Zhiyuan Liu, and Francis Lau (2015) A C-LSTM neural network for text classification. arXiv preprint arXiv:1511.08630 Zhou, Chunting, Chonglin Sun, Zhiyuan Liu, and Francis Lau (2015) A C-LSTM neural network for text classification. arXiv preprint arXiv:​1511.​08630
24.
Zurück zum Zitat Er MJ, Zhang Y, Wang N, Pratama M (2016) Attention pooling-based convolutional neural network for sentence modelling. Inf Sci 373:388–403CrossRefMATH Er MJ, Zhang Y, Wang N, Pratama M (2016) Attention pooling-based convolutional neural network for sentence modelling. Inf Sci 373:388–403CrossRefMATH
25.
Zurück zum Zitat Gu, Jiuxiang, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing Shuai, Ting Liu et al (2018) Recent advances in convolutional neural networks. Pattern Recognition 77: 354-377 Gu, Jiuxiang, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing Shuai, Ting Liu et al (2018) Recent advances in convolutional neural networks. Pattern Recognition 77: 354-377
26.
Zurück zum Zitat Zhang Y, Zhang Z, Miao D, Wang J (2019) Three-way enhanced convolutional neural networks for sentence-level sentiment classification. Inf Sci 477:55–64CrossRef Zhang Y, Zhang Z, Miao D, Wang J (2019) Three-way enhanced convolutional neural networks for sentence-level sentiment classification. Inf Sci 477:55–64CrossRef
27.
Zurück zum Zitat Wang J, Xu W, Fu X, Xu G, Wu Y (2020) ASTRAL: adversarial trained LSTM-CNN for named entity recognition. Knowl Based Syst 105842 Wang J, Xu W, Fu X, Xu G, Wu Y (2020) ASTRAL: adversarial trained LSTM-CNN for named entity recognition. Knowl Based Syst 105842
28.
Zurück zum Zitat Mirończuk MM, Protasiewicz J (2018) A recent overview of the state-of-the-art elements of text classification. Expert Syst Appl 106:36–54CrossRef Mirończuk MM, Protasiewicz J (2018) A recent overview of the state-of-the-art elements of text classification. Expert Syst Appl 106:36–54CrossRef
29.
Zurück zum Zitat Hartmann J, Huppertz J, Schamp C, Heitmann M (2019) Comparing automated text classification methods. Int J Res Mark 36(1):20–389CrossRef Hartmann J, Huppertz J, Schamp C, Heitmann M (2019) Comparing automated text classification methods. Int J Res Mark 36(1):20–389CrossRef
30.
Zurück zum Zitat Thenmozhi D, Kannan K, Aravindan C (2017) SSN_NLP INLI-FIRE-2017: a neural network approach to Indian native language identification. In: FIRE (working notes), pp 113–114 Thenmozhi D, Kannan K, Aravindan C (2017) SSN_NLP INLI-FIRE-2017: a neural network approach to Indian native language identification. In: FIRE (working notes), pp 113–114
31.
Zurück zum Zitat Patra BG, Das D, Das A (2018) Sentiment analysis of code-mixed indian languages: an overview of SAIL_Code-mixed shared task ICON-2017. arXiv:1803.06745 Patra BG, Das D, Das A (2018) Sentiment analysis of code-mixed indian languages: an overview of SAIL_Code-mixed shared task ICON-2017. arXiv:​1803.​06745
32.
Zurück zum Zitat Seshadri S, Anand Kumar M, Soman KP (2016) Analyzing sentiment in Indian languages micro text using recurrent neural network. IIOAB J A J Multidiscip Sci Technol 7:313–318 Seshadri S, Anand Kumar M, Soman KP (2016) Analyzing sentiment in Indian languages micro text using recurrent neural network. IIOAB J A J Multidiscip Sci Technol 7:313–318
33.
Zurück zum Zitat Chakravarthi BR, Muralidaran V, Priyadharshini R, McCrae, JP (2020) Corpus creation for sentiment analysis in code-mixed Tamil–English text. arXiv:2006.00206 Chakravarthi BR, Muralidaran V, Priyadharshini R, McCrae, JP (2020) Corpus creation for sentiment analysis in code-mixed Tamil–English text. arXiv:​2006.​00206
34.
Zurück zum Zitat Remmiya Devi G, Veena P, Anand Kumar M, Soman K (2016) Amrita-cen@ fire 2016: Code-mix entity extraction for Hindi-English and Tamil-English tweets. In: CEUR workshop proceedings, vol 1737, pp 304–308 Remmiya Devi G, Veena P, Anand Kumar M, Soman K (2016) Amrita-cen@ fire 2016: Code-mix entity extraction for Hindi-English and Tamil-English tweets. In: CEUR workshop proceedings, vol 1737, pp 304–308
35.
Zurück zum Zitat Veena PV, Kumar MA, Soman KP (2017) An effective way of word-level language identification for code-mixed facebook comments using word-embedding via character-embedding. In: 2017 International conference on advances in computing, communications and informatics (ICACCI). IEEE, pp 1552–1556 Veena PV, Kumar MA, Soman KP (2017) An effective way of word-level language identification for code-mixed facebook comments using word-embedding via character-embedding. In: 2017 International conference on advances in computing, communications and informatics (ICACCI). IEEE, pp 1552–1556
36.
Zurück zum Zitat Chacko VR, Anand Kumar M, Soman KP (2019) Gender identification of code-mixed Malayalam–English data from WhatsApp. In: Innovations in computer science and engineering. Lecture notes in networks and systems, vol 74. Springer, Singapore Chacko VR, Anand Kumar M, Soman KP (2019) Gender identification of code-mixed Malayalam–English data from WhatsApp. In: Innovations in computer science and engineering. Lecture notes in networks and systems, vol 74. Springer, Singapore
37.
Zurück zum Zitat Bevendorff J, Ghanem B, Giachanou A et al (2020) Shared tasks on authorship analysis at PAN 2020. Adv Inf Retriev 12036:508–516 Bevendorff J, Ghanem B, Giachanou A et al (2020) Shared tasks on authorship analysis at PAN 2020. Adv Inf Retriev 12036:508–516
39.
41.
Zurück zum Zitat Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRefMATH Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRefMATH
42.
Zurück zum Zitat Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988 Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
43.
Zurück zum Zitat Sosuke K (2018) Contextual augmentation: data augmentation by words with paradigmatic relations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 2 (short papers), pp 452–457 Sosuke K (2018) Contextual augmentation: data augmentation by words with paradigmatic relations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 2 (short papers), pp 452–457
45.
Zurück zum Zitat Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. In: Advances in neural information processing systems, pp 649–657 Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. In: Advances in neural information processing systems, pp 649–657
46.
Zurück zum Zitat Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In: Twenty-ninth AAAI conference on artificial intelligence Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In: Twenty-ninth AAAI conference on artificial intelligence
48.
Zurück zum Zitat Liu P, Qiu X, Huang X (2016) Recurrent neural network for text classification with multi-task learning. In: IJCAI 2016, May Liu P, Qiu X, Huang X (2016) Recurrent neural network for text classification with multi-task learning. In: IJCAI 2016, May
49.
50.
Zurück zum Zitat Zhou P, Shi W, Tian J, Qi Z, Li B, Hao H, Xu B (2016) Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: short papers), pp 207–212 Zhou P, Shi W, Tian J, Qi Z, Li B, Hao H, Xu B (2016) Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: short papers), pp 207–212
51.
Zurück zum Zitat Devi S, Kannimuthu S, Ravikumar G, Kumar A (2019) KCE DALab-APDAFIRE2019: author profiling and deception detection in Arabic using weighted embedding. In: Working notes of the forum for information retrieval evaluation (FIRE 2019). CEUR workshop proceedings. CEUR-WS. org, Kolkata, India, December, pp 12–15 Devi S, Kannimuthu S, Ravikumar G, Kumar A (2019) KCE DALab-APDAFIRE2019: author profiling and deception detection in Arabic using weighted embedding. In: Working notes of the forum for information retrieval evaluation (FIRE 2019). CEUR workshop proceedings. CEUR-WS. org, Kolkata, India, December, pp 12–15
Metadaten
Titel
Author Profiling in Code-Mixed WhatsApp Messages Using Stacked Convolution Networks and Contextualized Embedding Based Text Augmentation
verfasst von
V. Sharmila Devi
S. Kannimuthu
Publikationsdatum
15.07.2022
Verlag
Springer US
Erschienen in
Neural Processing Letters / Ausgabe 1/2023
Print ISSN: 1370-4621
Elektronische ISSN: 1573-773X
DOI
https://doi.org/10.1007/s11063-022-10898-3

Weitere Artikel der Ausgabe 1/2023

Neural Processing Letters 1/2023 Zur Ausgabe

Neuer Inhalt