Skip to main content
Erschienen in: Arabian Journal for Science and Engineering 8/2022

02.12.2021 | Research Article-Computer Engineering and Computer Science

Improving Phrase Chunking by using Contextualized Word Embeddings for a Morphologically Rich Language

verfasst von: Toqeer Ehsan, Javairia Khalid, Saadia Ambreen, Asad Mustafa, Sarmad Hussain

Erschienen in: Arabian Journal for Science and Engineering | Ausgabe 8/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Phrase chunking is an important task in various natural language processing (NLP) applications. This paper presents a neural phrase chunking for Urdu by training contextualized word representations. This work also produces an annotated corpus. The annotation has been performed by using IOB (inside-outside-begin) labels. Comprehensive guidelines have been developed for four phrases which are noun phrase (NP), verb phrase (VP), post-positional phrase (PP) and prepositional phrase (PRP). The annotated text has been evaluated for completeness and correctness automatically. Inter-annotator agreement has been calculated for ten percent reference corpus. A neural chunker has been developed and trained on the annotated corpus. The chunker is based on long–short- term memory networks. Transfer learning has been employed to improve the chunking results. For that purpose, context-free (Word2Vec) and contextualized (ELMo) word representations have been trained. The chunker performed with an f-score of 94.9 when trained by using third layer of ELMo embeddings.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Eberhard, D.M.; Simons, G.F.; Fennig, C.D.: Ethnologue: Languages of the World . SIL International (2019) Eberhard, D.M.; Simons, G.F.; Fennig, C.D.: Ethnologue: Languages of the World . SIL International (2019)
2.
Zurück zum Zitat Bögel, T.; Butt, M.; Hautli, A.; Sulger, S.: Developing a Finite-State Morphological Analyzer for Urdu and Hindi. Universität Potsdam (2008) Bögel, T.; Butt, M.; Hautli, A.; Sulger, S.: Developing a Finite-State Morphological Analyzer for Urdu and Hindi. Universität Potsdam (2008)
3.
Zurück zum Zitat Hussain, S.: Finite-State Morphological Analyzer for Urdu. Unpublished MS thesis, Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, Pakistan (2004) Hussain, S.: Finite-State Morphological Analyzer for Urdu. Unpublished MS thesis, Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, Pakistan (2004)
4.
Zurück zum Zitat Butt, M.: The Structure of Complex Predicates in Urdu. Center for the Study of Language (CSLI) (1995) Butt, M.: The Structure of Complex Predicates in Urdu. Center for the Study of Language (CSLI) (1995)
5.
Zurück zum Zitat Butt, M.; Ramchand, G.: Complex Aspectual Structure in Hindi/Urdu. M. Liakata, B. Jensen, D. Maillat, Eds, 1–30 (2001) Butt, M.; Ramchand, G.: Complex Aspectual Structure in Hindi/Urdu. M. Liakata, B. Jensen, D. Maillat, Eds, 1–30 (2001)
6.
Zurück zum Zitat Khan, T.A.: Spatial Expressions and Case in South Asian Languages. PhD thesis (2009) Khan, T.A.: Spatial Expressions and Case in South Asian Languages. PhD thesis (2009)
7.
Zurück zum Zitat Butt, M.; King, T.H.: The Status of Case. In: Clause Structure in South Asian Languages, pp. 153–198. Springer (2004) Butt, M.; King, T.H.: The Status of Case. In: Clause Structure in South Asian Languages, pp. 153–198. Springer (2004)
8.
Zurück zum Zitat Raza, G.; Ahmed, T.; Butt, M.; King, T.H.: Argument Scrambling within Urdu NPs. Proceedings of LFG11, 461 (2011) Raza, G.; Ahmed, T.; Butt, M.; King, T.H.: Argument Scrambling within Urdu NPs. Proceedings of LFG11, 461 (2011)
9.
Zurück zum Zitat Carreras, X.; Marquez, L.: Phrase Recognition by Filtering and Ranking with Perceptrons. Recent advances in natural language processing III: selected papers from RANLP 2003 260, 205 (2004) Carreras, X.; Marquez, L.: Phrase Recognition by Filtering and Ranking with Perceptrons. Recent advances in natural language processing III: selected papers from RANLP 2003 260, 205 (2004)
10.
Zurück zum Zitat Etzioni, O.; Banko, M.; Soderland, S.; Weld, D.S.: Open information extraction from the web. Commun. ACM 51(12), 68–74 (2008)CrossRef Etzioni, O.; Banko, M.; Soderland, S.; Weld, D.S.: Open information extraction from the web. Commun. ACM 51(12), 68–74 (2008)CrossRef
11.
Zurück zum Zitat Ahmed, T.; Urooj, S.; Hussain, S.; Mustafa, A.; Parveen, R.; Adeeba, F.; Hautli, A.; Butt, M.: The CLE Urdu POS Tagset. In: LREC 2014, Ninth International Conference on Language Resources and Evaluation, pp. 2920–2925 (2015) Ahmed, T.; Urooj, S.; Hussain, S.; Mustafa, A.; Parveen, R.; Adeeba, F.; Hautli, A.; Butt, M.: The CLE Urdu POS Tagset. In: LREC 2014, Ninth International Conference on Language Resources and Evaluation, pp. 2920–2925 (2015)
12.
Zurück zum Zitat Ali, W.; Malik, M.K.; Hussain, S.; Siddiq, S.; Ali, A.: Urdu Noun Phrase Chunking: HMM based approach. In: 2010 International Conference on Educational and Information Technology, vol. 2, pp. 2–494 (2010). IEEE Ali, W.; Malik, M.K.; Hussain, S.; Siddiq, S.; Ali, A.: Urdu Noun Phrase Chunking: HMM based approach. In: 2010 International Conference on Educational and Information Technology, vol. 2, pp. 2–494 (2010). IEEE
13.
Zurück zum Zitat Ali, W.; Hussain, S.: A Hybrid Approach to Urdu Verb Phrase Chunking. In: Proceedings of the Eighth Workshop on Asian Language Resouces, pp. 137–143 (2010) Ali, W.; Hussain, S.: A Hybrid Approach to Urdu Verb Phrase Chunking. In: Proceedings of the Eighth Workshop on Asian Language Resouces, pp. 137–143 (2010)
14.
Zurück zum Zitat Asopa, S.; Asopa, P.; Mathur, I.; Joshi, N.: Rule based Chunker for Hindi. In: 2016 2nd International Conference on Contemporary Computing and Informatics (IC3I), pp. 442–445 (2016). IEEE Asopa, S.; Asopa, P.; Mathur, I.; Joshi, N.: Rule based Chunker for Hindi. In: 2016 2nd International Conference on Contemporary Computing and Informatics (IC3I), pp. 442–445 (2016). IEEE
15.
Zurück zum Zitat Ehsani, R.; Solak, E.; Yıldız, O.T.: Hybrid Chunking for Turkish Combining Morphological and Semantic Features Ehsani, R.; Solak, E.; Yıldız, O.T.: Hybrid Chunking for Turkish Combining Morphological and Semantic Features
16.
Zurück zum Zitat Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J.: Distributed Representations of Words and Phrases and their Compositionality. arXiv preprint arXiv:1310.4546 (2013) Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J.: Distributed Representations of Words and Phrases and their Compositionality. arXiv preprint arXiv:​1310.​4546 (2013)
17.
Zurück zum Zitat Park, S.-B.; Zhang, B.-T.: Text Chunking by Combining Hand-crafted Rules and Memory-based Learning. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 497–504 (2003) Park, S.-B.; Zhang, B.-T.: Text Chunking by Combining Hand-crafted Rules and Memory-based Learning. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 497–504 (2003)
18.
Zurück zum Zitat Le Nguyen, M.; Nguyen, H.T.; Nguyen, P.-T.; Ho, T.-B.; Shimazu, A.: An Empirical Study of Vietnamese Noun Phrase Chunking with Discriminative Sequence Models. In: Proceedings of the 7th Workshop on Asian Language Resources (ALR7), pp. 9–16 (2009) Le Nguyen, M.; Nguyen, H.T.; Nguyen, P.-T.; Ho, T.-B.; Shimazu, A.: An Empirical Study of Vietnamese Noun Phrase Chunking with Discriminative Sequence Models. In: Proceedings of the 7th Workshop on Asian Language Resources (ALR7), pp. 9–16 (2009)
19.
Zurück zum Zitat Knutsson, O.; Bigert, J.; Kann, V.: A Robust Shallow Parser for Swedish. In: Proceedings of Nodalida, vol. 2003, p. 2003 (2003) Knutsson, O.; Bigert, J.; Kann, V.: A Robust Shallow Parser for Swedish. In: Proceedings of Nodalida, vol. 2003, p. 2003 (2003)
20.
Zurück zum Zitat Diab, M.; Hacioglu, K.; Jurafsky, D.: Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks. In: Proceedings of HLT-NAACL 2004: Short Papers, pp. 149–152 (2004) Diab, M.; Hacioglu, K.; Jurafsky, D.: Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks. In: Proceedings of HLT-NAACL 2004: Short Papers, pp. 149–152 (2004)
21.
Zurück zum Zitat Eiselen, R.: South African Language Resources: Phrase Chunking. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 689–693 (2016) Eiselen, R.: South African Language Resources: Phrase Chunking. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 689–693 (2016)
22.
23.
Zurück zum Zitat Gharaibeh, I.K.: Development of Arabic Noun Phrase Extractor (ANPE). International Journal on Natural Language Computing (IJNLC) Vol 6 (2017) Gharaibeh, I.K.: Development of Arabic Noun Phrase Extractor (ANPE). International Journal on Natural Language Computing (IJNLC) Vol 6 (2017)
24.
Zurück zum Zitat Prathibba, R.; Padma, M.: Shallow parser for Kannada sentences using machine learning approach. Int. J. Comput. Linguistic. Res. 8(4), 158–170 (2017) Prathibba, R.; Padma, M.: Shallow parser for Kannada sentences using machine learning approach. Int. J. Comput. Linguistic. Res. 8(4), 158–170 (2017)
25.
Zurück zum Zitat Sun, X.; Nan, X.: Chinese Base Phrases Chunking Based on Latent Semi-CRF Model. In: Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering (NLPKE-2010), pp. 1–7 (2010). IEEE Sun, X.; Nan, X.: Chinese Base Phrases Chunking Based on Latent Semi-CRF Model. In: Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering (NLPKE-2010), pp. 1–7 (2010). IEEE
26.
Zurück zum Zitat Sun, X.; Nan, X.: Chinese Noun Phrases Chunking: A Latent Discriminative Model with Global Features. In: 2011 14th IEEE International Conference on Computational Science and Engineering, pp. 167–172 (2011). IEEE Sun, X.; Nan, X.: Chinese Noun Phrases Chunking: A Latent Discriminative Model with Global Features. In: 2011 14th IEEE International Conference on Computational Science and Engineering, pp. 167–172 (2011). IEEE
27.
Zurück zum Zitat Sarkar, K.; Gayen, V.: Bengali Noun Phrase Chunking Based on Conditional Random Fields. In: 2014 2nd International Conference on Business and Information Management (ICBIM), pp. 148–153 (2014). IEEE Sarkar, K.; Gayen, V.: Bengali Noun Phrase Chunking Based on Conditional Random Fields. In: 2014 2nd International Conference on Business and Information Management (ICBIM), pp. 148–153 (2014). IEEE
28.
Zurück zum Zitat Pawar, S.; Ramrakhiyani, N.; Palshikar, G.; Bhattacharyya, P.; Hingmire, S.: Noun Phrase Chunking for Marathi using Distant Supervision. In: Proceedings of the 12th International Conference on Natural Language Processing, pp. 29–38 (2015) Pawar, S.; Ramrakhiyani, N.; Palshikar, G.; Bhattacharyya, P.; Hingmire, S.: Noun Phrase Chunking for Marathi using Distant Supervision. In: Proceedings of the 12th International Conference on Natural Language Processing, pp. 29–38 (2015)
29.
Zurück zum Zitat Sassano, M.; Kurohashi, S.: A Unified Single Scan Algorithm for Japanese Base Phrase Chunking and Dependency Parsing. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pp. 49–52 (2009) Sassano, M.; Kurohashi, S.: A Unified Single Scan Algorithm for Japanese Base Phrase Chunking and Dependency Parsing. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pp. 49–52 (2009)
30.
Zurück zum Zitat Supnithi, T.; Onman, C.; Porkaew, P.; Ruangrajitpakorn, T.; Trakultaweekoon, K.; Kawtrakul, A.: A Supervised Learning based Chunking in Thai using Categorial Grammar. In: Proceedings of the Eighth Workshop on Asian Language Resouces, pp. 129–136 (2010) Supnithi, T.; Onman, C.; Porkaew, P.; Ruangrajitpakorn, T.; Trakultaweekoon, K.; Kawtrakul, A.: A Supervised Learning based Chunking in Thai using Categorial Grammar. In: Proceedings of the Eighth Workshop on Asian Language Resouces, pp. 129–136 (2010)
31.
Zurück zum Zitat Nongmeikapam, K.; Chingangbam, C.; Keisham, N.; Varte, B.; Bandopadhyay, S.: Chunking in Manipuri using CRF. Int. J. Nat. Lang. Comput. (IJNLC) 3(3) (2014) Nongmeikapam, K.; Chingangbam, C.; Keisham, N.; Varte, B.; Bandopadhyay, S.: Chunking in Manipuri using CRF. Int. J. Nat. Lang. Comput. (IJNLC) 3(3) (2014)
32.
Zurück zum Zitat Aung, M.P.; Moe, A.L.: New phrase chunking algorithm for Myanmar natural language processing. In: Applied Mechanics and Materials, vol. 695, pp. 548–552 (2015). Trans Tech Publications Aung, M.P.; Moe, A.L.: New phrase chunking algorithm for Myanmar natural language processing. In: Applied Mechanics and Materials, vol. 695, pp. 548–552 (2015). Trans Tech Publications
33.
Zurück zum Zitat Ehsan, T.; Hussain, S.: Development and evaluation of an Urdu treebank (CLE-UTB) and a statistical parser. Language Resourc. Eval. 1–40 (2020) Ehsan, T.; Hussain, S.: Development and evaluation of an Urdu treebank (CLE-UTB) and a statistical parser. Language Resourc. Eval. 1–40 (2020)
34.
Zurück zum Zitat Ehsan, T.; Hussain, S.: Analysis of experiments on statistical and neural parsing for a morphologically rich and free word order language Urdu. IEEE Access 7, 161776–161793 (2019)CrossRef Ehsan, T.; Hussain, S.: Analysis of experiments on statistical and neural parsing for a morphologically rich and free word order language Urdu. IEEE Access 7, 161776–161793 (2019)CrossRef
35.
Zurück zum Zitat Ahmed, T.; Ehsan, T.; Ashraf, A.; u Rahman, M.; Hussain, S.; Butt, M.: A Multilayered Urdu Treebank. In: International Conference on Language and Technology (CLT 2020) (2020) Ahmed, T.; Ehsan, T.; Ashraf, A.; u Rahman, M.; Hussain, S.; Butt, M.: A Multilayered Urdu Treebank. In: International Conference on Language and Technology (CLT 2020) (2020)
36.
Zurück zum Zitat Ehsan, T.; Butt, M.: Dependency parsing for Urdu: resources, conversions and learning. In: Proceedings of The 12th Language Resources and Evaluation Conference, pp. 5202–5207 (2020) Ehsan, T.; Butt, M.: Dependency parsing for Urdu: resources, conversions and learning. In: Proceedings of The 12th Language Resources and Evaluation Conference, pp. 5202–5207 (2020)
37.
Zurück zum Zitat Kamran Malik, M.; Ahmed, T.; Sulger, S.; Bögel, T.; Gulzar, A.; Raza, G.; Hussain, S.; Butt, M.: Transliterating Urdu for a Broad-Coverage Urdu/Hindi LFG Grammar. In: LREC 2010, Seventh International Conference on Language Resources and Evaluation, pp. 2921–2927 (2010) Kamran Malik, M.; Ahmed, T.; Sulger, S.; Bögel, T.; Gulzar, A.; Raza, G.; Hussain, S.; Butt, M.: Transliterating Urdu for a Broad-Coverage Urdu/Hindi LFG Grammar. In: LREC 2010, Seventh International Conference on Language Resources and Evaluation, pp. 2921–2927 (2010)
38.
Zurück zum Zitat Jespersen, O.: A Modern English Grammar on Historical Principles, vol. 3. Routledge (2013) Jespersen, O.: A Modern English Grammar on Historical Principles, vol. 3. Routledge (2013)
39.
Zurück zum Zitat Gómez, I.P.: Nominal Modifiers in Noun Phrase Structure: Evidence from Contemporary English. University of Santiago de Compostela (2010) Gómez, I.P.: Nominal Modifiers in Noun Phrase Structure: Evidence from Contemporary English. University of Santiago de Compostela (2010)
40.
Zurück zum Zitat Bharati, A.; Sangal, R.; Sharma, D.M.; Bai, L.: Anncorra: Annotating Corpora Guidelines for POS and Chunk Annotation for Indian Languages. LTRC-TR31, 1–38 (2006) Bharati, A.; Sangal, R.; Sharma, D.M.; Bai, L.: Anncorra: Annotating Corpora Guidelines for POS and Chunk Annotation for Indian Languages. LTRC-TR31, 1–38 (2006)
41.
Zurück zum Zitat Bhatt, R.; Farudi, A.; Rambow, O.: Hindi-Urdu Phrase Structure Annotation Guidelines (2013) Bhatt, R.; Farudi, A.; Rambow, O.: Hindi-Urdu Phrase Structure Annotation Guidelines (2013)
42.
Zurück zum Zitat Anwar, B.: Urdu-English code switching: the use of Urdu phrases and clauses in Pakistani English (a non-native variety). Int J Language Stud 3(4) (2009) Anwar, B.: Urdu-English code switching: the use of Urdu phrases and clauses in Pakistani English (a non-native variety). Int J Language Stud 3(4) (2009)
43.
Zurück zum Zitat Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20(1), 37–46 (1960)CrossRef Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20(1), 37–46 (1960)CrossRef
44.
Zurück zum Zitat Adeeba, F.; Akram, Q.; Khalid, H.; Hussain, S.: Cle Urdu Books N-Grams. In: Conference on Language and Technology (2014) Adeeba, F.; Akram, Q.; Khalid, H.; Hussain, S.: Cle Urdu Books N-Grams. In: Conference on Language and Technology (2014)
46.
Zurück zum Zitat Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018) Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L.: Deep contextualized word representations. arXiv preprint arXiv:​1802.​05365 (2018)
Metadaten
Titel
Improving Phrase Chunking by using Contextualized Word Embeddings for a Morphologically Rich Language
verfasst von
Toqeer Ehsan
Javairia Khalid
Saadia Ambreen
Asad Mustafa
Sarmad Hussain
Publikationsdatum
02.12.2021
Verlag
Springer Berlin Heidelberg
Erschienen in
Arabian Journal for Science and Engineering / Ausgabe 8/2022
Print ISSN: 2193-567X
Elektronische ISSN: 2191-4281
DOI
https://doi.org/10.1007/s13369-021-06343-7

Weitere Artikel der Ausgabe 8/2022

Arabian Journal for Science and Engineering 8/2022 Zur Ausgabe

Research Article-Computer Engineering and Computer Science

A Distributed Data Storage Strategy Based on LOPs

Research Article-Computer Engineering and Computer Science

IRText: An Item Response Theory-Based Approach for Text Categorization

Research Article-Computer Engineering and Computer Science

A Ship Detection Method in Complex Background Via Mixed Attention Model

Research Article-Computer Engineering and Computer Science

QCA-Based Adder for Redundant Binary Signed Digit Numbers

    Marktübersichten

    Die im Laufe eines Jahres in der „adhäsion“ veröffentlichten Marktübersichten helfen Anwendern verschiedenster Branchen, sich einen gezielten Überblick über Lieferantenangebote zu verschaffen.