Skip to main content
Erschienen in: Knowledge and Information Systems 6/2020

21.12.2019 | Regular Paper

English–Arabic collocation extraction to enhance Arabic collocation identification

verfasst von: Chiraz Ben Othmane Zribi

Erschienen in: Knowledge and Information Systems | Ausgabe 6/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Bilingual collocation extraction could improve the performance of monolingual extraction. This is especially true for the English–Arabic pair, as difficulties of Arabic collocation extraction can be overcome. We present in this paper two novel approaches for extracting both monolingual and bilingual collocations. The monolingual extraction approach is hybrid, based on linguistic patterns and statistical measures. We propose during statistical filtering to combine vector-based measures with different association measures via a voting procedure. The bilingual extraction capitalizes on different cues (position, frequency, cross-language correspondence between POS-patterns, distribution, translation). It allows enhancing the monolingual collocation extraction by considering not only collocation equivalents with direct translation. Indeed, it can validate unconfirmed collocations because they translate confirmed ones. The results showed, in particular, how the extraction of Arabic collocations can be improved by extracting English–Arabic ones. The precision of extracting Arabic collocations moved upward, respectively, from about 86 to 96%.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
We use in this section indifferently MWE and collocation, since some researchers do not differentiate between them.
 
Literatur
1.
Zurück zum Zitat Abdulgabbar MS, Juzaiddin A, Mohd JA (2011) An automatic collocation extraction from arabic corpus. J Comput Sci 7(1):6–11CrossRef Abdulgabbar MS, Juzaiddin A, Mohd JA (2011) An automatic collocation extraction from arabic corpus. J Comput Sci 7(1):6–11CrossRef
2.
Zurück zum Zitat Akef AM, Wang Y, Yang E (2017) Arabic collocation extraction based on hybrid methods. In: The 16th China national conference (CCL 2017) and the 5th international symposium (NLP-NABD 2017), China, p 3–12 Akef AM, Wang Y, Yang E (2017) Arabic collocation extraction based on hybrid methods. In: The 16th China national conference (CCL 2017) and the 5th international symposium (NLP-NABD 2017), China, p 3–12
3.
Zurück zum Zitat Altszyler E, Ribiero S, Sigman M, Fernandez-Slezak D (2017) Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database. In: Artificial intelligence symposium, Argentin, p 25 Altszyler E, Ribiero S, Sigman M, Fernandez-Slezak D (2017) Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database. In: Artificial intelligence symposium, Argentin, p 25
4.
Zurück zum Zitat Attia M, Tounsi L, Pecina P, Genabith J, Toral A (2010) Automatic extraction of arabic multiword expressions. In: The Multiword Expressions: From Theory to Applications (MWE 2010), Beijing, p 19–27 Attia M, Tounsi L, Pecina P, Genabith J, Toral A (2010) Automatic extraction of arabic multiword expressions. In: The Multiword Expressions: From Theory to Applications (MWE 2010), Beijing, p 19–27
5.
Zurück zum Zitat Ben Othmane Zribi C, Baghouli B (2017) A syntactico-semantic method for Arabic collocations extraction. In: The 14th ACS/IEEE international conference on computer systems applications (AICCSA 2017), Tunisia, p 915–921 Ben Othmane Zribi C, Baghouli B (2017) A syntactico-semantic method for Arabic collocations extraction. In: The 14th ACS/IEEE international conference on computer systems applications (AICCSA 2017), Tunisia, p 915–921
6.
Zurück zum Zitat Bouamor D, Semmar N, Zweigenbaum P (2012) Identifying bilingual multi-word expressions for statistical machine translation. In: The 8th international conference on language resources and evaluation (LREC 2012), Turkey, p 674–679 Bouamor D, Semmar N, Zweigenbaum P (2012) Identifying bilingual multi-word expressions for statistical machine translation. In: The 8th international conference on language resources and evaluation (LREC 2012), Turkey, p 674–679
7.
Zurück zum Zitat Boulaknader S, Daille B, Boutajdine D (2008) A multi-word term extraction program for Arabic language. In: The 6th international conference on language resources and evaluation (LREC 2008), Morocco Boulaknader S, Daille B, Boutajdine D (2008) A multi-word term extraction program for Arabic language. In: The 6th international conference on language resources and evaluation (LREC 2008), Morocco
8.
Zurück zum Zitat Boros T, Burtica R (2018) GBD-NER at PARSEME Shared Task 2018: Multi-word expression detection using bidirectional long-short-term memory networks and graph-based decoding. In: The joint workshop on linguistic annotation, multiword expressions and constructions (LAW-MWE-CxG-2018), New Mexico, p 254–260 Boros T, Burtica R (2018) GBD-NER at PARSEME Shared Task 2018: Multi-word expression detection using bidirectional long-short-term memory networks and graph-based decoding. In: The joint workshop on linguistic annotation, multiword expressions and constructions (LAW-MWE-CxG-2018), New Mexico, p 254–260
9.
Zurück zum Zitat Daille B (2001) Extraction des collocations à partir de textes. In: Huitième Conférence Nationale sur le Traitement Automatique des Langues Naturelles (TALN 2001), France Daille B (2001) Extraction des collocations à partir de textes. In: Huitième Conférence Nationale sur le Traitement Automatique des Langues Naturelles (TALN 2001), France
10.
Zurück zum Zitat DeNero J, Klein D (2008) The complexity of phrase alignment problems. In: The 46th annual meeting of the association for computational linguistics on human language technologies (ACL HLT 2008), Columbus, p 25–28 DeNero J, Klein D (2008) The complexity of phrase alignment problems. In: The 46th annual meeting of the association for computational linguistics on human language technologies (ACL HLT 2008), Columbus, p 25–28
11.
Zurück zum Zitat Fawi F, Delmonte R (2015) Italian–Arabic domain terminology extraction from parallel corpora. In: The 2th conference on computational linguistics (CLiC-it 2015), Torino, p 130–134 Fawi F, Delmonte R (2015) Italian–Arabic domain terminology extraction from parallel corpora. In: The 2th conference on computational linguistics (CLiC-it 2015), Torino, p 130–134
12.
Zurück zum Zitat Garcia M, Garcia-Salido M, Alonso-Ramos M (2017) Using bilingual word-embeddings for multilingual collocation extraction. In: The 13th workshop on multiword expressions (MWE 2017), Spain, p 21–30 Garcia M, Garcia-Salido M, Alonso-Ramos M (2017) Using bilingual word-embeddings for multilingual collocation extraction. In: The 13th workshop on multiword expressions (MWE 2017), Spain, p 21–30
13.
Zurück zum Zitat Grefenstette G, Teufel S (1995) Corpus-based methods for automatic identification of support verbs for nominalizations. In: The 7th conference of the European chapter of the association for computational linguistic (EACL 1995), Dublin Grefenstette G, Teufel S (1995) Corpus-based methods for automatic identification of support verbs for nominalizations. In: The 7th conference of the European chapter of the association for computational linguistic (EACL 1995), Dublin
14.
Zurück zum Zitat Heid U (1999) Extracting terminologically relevant collocations from German technical texts. In: The 5th international congress on terminology and knowledge engineering (TKE 1999), Austria, p 241–255 Heid U (1999) Extracting terminologically relevant collocations from German technical texts. In: The 5th international congress on terminology and knowledge engineering (TKE 1999), Austria, p 241–255
15.
Zurück zum Zitat Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Processes 25:259–284CrossRef Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Processes 25:259–284CrossRef
16.
Zurück zum Zitat Lehecka T (2015) Collocation and colligation, handbook of pragmatics. Östman, & J. Verschueren, Benjamins, Amsterdam Lehecka T (2015) Collocation and colligation, handbook of pragmatics. Östman, & J. Verschueren, Benjamins, Amsterdam
17.
Zurück zum Zitat Mandravickaite J, Rimkute E, Krilavicius T (2016) Hybrid approach for automatic identification of multi-word expressions in lithuanian. Hum Lang Technol Baltic Perspect 289(1):153–159 Mandravickaite J, Rimkute E, Krilavicius T (2016) Hybrid approach for automatic identification of multi-word expressions in lithuanian. Hum Lang Technol Baltic Perspect 289(1):153–159
18.
Zurück zum Zitat Marchand M, Semmar N (2011) A hybrid multiword terms alignment arroach using word co-occurrence with a bilingual lexicon. In: The 5th conference of language and technology: human language technologies as a challenge for computer science and linguistics (LTC 2011), Poland, p 311–318 Marchand M, Semmar N (2011) A hybrid multiword terms alignment arroach using word co-occurrence with a bilingual lexicon. In: The 5th conference of language and technology: human language technologies as a challenge for computer science and linguistics (LTC 2011), Poland, p 311–318
19.
Zurück zum Zitat Mikolov T, Yih W, Zweig G (2013) Efficient estimation of word representations in vector space. In: The international conference of learning representations (ICLR 2013), Arizona Mikolov T, Yih W, Zweig G (2013) Efficient estimation of word representations in vector space. In: The international conference of learning representations (ICLR 2013), Arizona
20.
Zurück zum Zitat Mokrane MA (2006) Représentation de collections de documents textuels: application à la caractérisation thématique, PHD Thesis, Montpellier II University Mokrane MA (2006) Représentation de collections de documents textuels: application à la caractérisation thématique, PHD Thesis, Montpellier II University
21.
Zurück zum Zitat Klyueva N, Doucet A, Straka M (2017) Neural networks for multi-word expression detection. In: The 13th workshop on multiword expressions (MWE 2017), Spain, p 60–65 Klyueva N, Doucet A, Straka M (2017) Neural networks for multi-word expression detection. In: The 13th workshop on multiword expressions (MWE 2017), Spain, p 60–65
22.
Zurück zum Zitat Pasha A, Al-Badrashiny M, Diab M, El Kholy A, Eskander R, Habash N, Pooleery M, Rambow O, Roth RM (2014) MADAMIRA: a fast, comprehensive tool for morphological analysis and disambiguation of arabic. In: The 9th edition of the Language Resources and Evaluation Conference (LREC 2014), Iceland, p 1094–1101 Pasha A, Al-Badrashiny M, Diab M, El Kholy A, Eskander R, Habash N, Pooleery M, Rambow O, Roth RM (2014) MADAMIRA: a fast, comprehensive tool for morphological analysis and disambiguation of arabic. In: The 9th edition of the Language Resources and Evaluation Conference (LREC 2014), Iceland, p 1094–1101
23.
Zurück zum Zitat Pecina P, Schlesinger P (2006) Combining association measures for collocation extraction. In: The 21th international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (COLING-ACL, 2006), Australia, p 651–658 Pecina P, Schlesinger P (2006) Combining association measures for collocation extraction. In: The 21th international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (COLING-ACL, 2006), Australia, p 651–658
24.
Zurück zum Zitat Pecina P (2010) Lexical association measures and collocation extraction. J Lang Resour Eval Springer 44(1):137–158CrossRef Pecina P (2010) Lexical association measures and collocation extraction. J Lang Resour Eval Springer 44(1):137–158CrossRef
25.
Zurück zum Zitat Pearce D (2001) Synonymy in collocation extraction, NAACL workshop: WordNet and other lexical resources: applications, extensions and customizations, Pittsburgh, p 41–46 Pearce D (2001) Synonymy in collocation extraction, NAACL workshop: WordNet and other lexical resources: applications, extensions and customizations, Pittsburgh, p 41–46
26.
Zurück zum Zitat Ramisch C, Villavicencio A, Boitet C (2010) MWE-Toolkit: a framework for multiword expression identification. In: The 7th conference on international language resources and evaluation (LREC 2010), Malta, p 662–669 Ramisch C, Villavicencio A, Boitet C (2010) MWE-Toolkit: a framework for multiword expression identification. In: The 7th conference on international language resources and evaluation (LREC 2010), Malta, p 662–669
27.
Zurück zum Zitat Ramisch C, Villavicencio A, Kordoni V (2013) Introduction to the special issue on multiword expressions: from theory to practice and use. ACM Trans Speech Lang Process 10(2):3–10CrossRef Ramisch C, Villavicencio A, Kordoni V (2013) Introduction to the special issue on multiword expressions: from theory to practice and use. ACM Trans Speech Lang Process 10(2):3–10CrossRef
28.
Zurück zum Zitat Rafalovitch A, Dale R (2009) United Nations general assembly resolutions: a six-language parallel corpus. MT Summit XII, Ottawa Rafalovitch A, Dale R (2009) United Nations general assembly resolutions: a six-language parallel corpus. MT Summit XII, Ottawa
29.
Zurück zum Zitat Rivera OM, Mitkov R, Pastor GC (2013) A flexible framework for collocation retrieval and translation from Parallel and comparable corpora. In: Workshop on multi-word units in machine translation and translation technology, France, p 18–25 Rivera OM, Mitkov R, Pastor GC (2013) A flexible framework for collocation retrieval and translation from Parallel and comparable corpora. In: Workshop on multi-word units in machine translation and translation technology, France, p 18–25
31.
Zurück zum Zitat Salehi B, Cook P, Baldwin T (2015) A word embedding approach to predicting the compositionality of multiword expressions. In: The annual conference of the North American chapter of the ACL, Colorado, p 977–983 Salehi B, Cook P, Baldwin T (2015) A word embedding approach to predicting the compositionality of multiword expressions. In: The annual conference of the North American chapter of the ACL, Colorado, p 977–983
33.
Zurück zum Zitat Seretan V, Wehrli E (2009) Multilingual collocation extraction with a syntactic parser. Lang Resour Eval 43(1):71–85CrossRef Seretan V, Wehrli E (2009) Multilingual collocation extraction with a syntactic parser. Lang Resour Eval 43(1):71–85CrossRef
34.
Zurück zum Zitat Singh D, Bhingardive S, Bhattacharraya KPP (2015) Detection of multiword expressions for Hindi language using word embeddings and WordNet-based features. In: The 12th international conference on natural language processing, India, p 295–302 Singh D, Bhingardive S, Bhattacharraya KPP (2015) Detection of multiword expressions for Hindi language using word embeddings and WordNet-based features. In: The 12th international conference on natural language processing, India, p 295–302
35.
Zurück zum Zitat Smadja F (1993) Retrieving collocations from text: xtract. Comput Linguist Spec Issue Large Corpora 19(1):143–177 Smadja F (1993) Retrieving collocations from text: xtract. Comput Linguist Spec Issue Large Corpora 19(1):143–177
36.
Zurück zum Zitat Snajder B, Dalbelo B, Petrovi´c S, Sikiri´c I (2008) Evolving new lexical association measures using genetic programming. In: The 46th annual meeting of the association for computational linguistics on human language technologies, Association for Computational Linguistics, Columbus, p 181–184 Snajder B, Dalbelo B, Petrovi´c S, Sikiri´c I (2008) Evolving new lexical association measures using genetic programming. In: The 46th annual meeting of the association for computational linguistics on human language technologies, Association for Computational Linguistics, Columbus, p 181–184
37.
Zurück zum Zitat Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: The conference of the North American chapter of the association for computational linguistics on human language technology (WLT-NAACL), Edmonton, p 173–180 Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: The conference of the North American chapter of the association for computational linguistics on human language technology (WLT-NAACL), Edmonton, p 173–180
38.
Zurück zum Zitat Wanner L, Alonso Ramos M (2000) Vers une approche sémantique pour l’identification des collocations en corpus. Journée d’études de l’ATALA, La collocation, France Wanner L, Alonso Ramos M (2000) Vers une approche sémantique pour l’identification des collocations en corpus. Journée d’études de l’ATALA, La collocation, France
39.
Zurück zum Zitat Zaidi S, Abdellali L, Sadat F, Laskri M (2012) Hybrid approach for extracting collocations from arabic Quran texts, language resources and evaluation for religious LRE-Rel Workshop, Turkey Zaidi S, Abdellali L, Sadat F, Laskri M (2012) Hybrid approach for extracting collocations from arabic Quran texts, language resources and evaluation for religious LRE-Rel Workshop, Turkey
Metadaten
Titel
English–Arabic collocation extraction to enhance Arabic collocation identification
verfasst von
Chiraz Ben Othmane Zribi
Publikationsdatum
21.12.2019
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 6/2020
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-019-01428-0

Weitere Artikel der Ausgabe 6/2020

Knowledge and Information Systems 6/2020 Zur Ausgabe