Skip to main content

2016 | OriginalPaper | Buchkapitel

Integration of a Segmentation Tool for Arabic Corpora in NooJ Platform to Build an Automatic Annotation Tool

verfasst von : Nadia Ghezaiel Hammouda, Kais Haddar

Erschienen in: Automatic Processing of Natural-Language Electronic Texts with NooJ

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Automatic annotation for Arabic corpora has an important role in many applications of Natural Language Processing (NLP). In this context, we are interested in the automatic annotation of Arabic corpora using transducers set implemented in NooJ platform. And to achieve our aim, we must precede the annotation phase by a segmentation phase. This segmentation phase will, on the one hand, reduce the complexity of the analysis and, on the other hand, improve NooJ platform functionalities. Also, we achieved our annotation phase by identifying different types of lexical ambiguities, and then an appropriate set of rules is proposed. In addition, we experiment our phase on a test corpus with NooJ platform. The obtained results are ambitious and can be improved by adding other rules and heuristics.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Da Cunha, I., SanJuan, E., Torres-Moreno, J.-M., Lloberes, M., Castellón, I.: Discourse segmentation for Spanish based on shallow parsing. In: Sidorov, G., Hernández Aguirre, A., Reyes García, C.A. (eds.) MICAI 2010. LNCS (LNAI), vol. 6437, pp. 13–23. Springer, Heidelberg (2010). doi:10.1007/978-3-642-16761-4_2 CrossRef Da Cunha, I., SanJuan, E., Torres-Moreno, J.-M., Lloberes, M., Castellón, I.: Discourse segmentation for Spanish based on shallow parsing. In: Sidorov, G., Hernández Aguirre, A., Reyes García, C.A. (eds.) MICAI 2010. LNCS (LNAI), vol. 6437, pp. 13–23. Springer, Heidelberg (2010). doi:10.​1007/​978-3-642-16761-4_​2 CrossRef
Zurück zum Zitat Dichy, J., Alrahabi, M.: Levée d’ambiguité par la methode d’exploration contextuelle: la sequence ‘alif-nûn’ en arabic. In: Sidhom, S., Ghenima, M., Ouksel, A. (eds.) Information Systems and Economic Intelligence: Proceedings of the 2nd International Conference SIIE 2009, pp. 573–585. IHE Editions, Tunis (2009) Dichy, J., Alrahabi, M.: Levée d’ambiguité par la methode d’exploration contextuelle: la sequence ‘alif-nûn’ en arabic. In: Sidhom, S., Ghenima, M., Ouksel, A. (eds.) Information Systems and Economic Intelligence: Proceedings of the 2nd International Conference SIIE 2009, pp. 573–585. IHE Editions, Tunis (2009)
Zurück zum Zitat Diab, M.T.: Second generation tools (AMIRA 2.0): fast and robust tokenization, POS tagging, and base phrase chunking. In: Choukri, K., Maegaard, B. (eds.) Proceedings of the Second International Conference on Arabic Language Resources and Tools, pp. 285–288. The MEDAR Consortium, Cairo (2009) Diab, M.T.: Second generation tools (AMIRA 2.0): fast and robust tokenization, POS tagging, and base phrase chunking. In: Choukri, K., Maegaard, B. (eds.) Proceedings of the Second International Conference on Arabic Language Resources and Tools, pp. 285–288. The MEDAR Consortium, Cairo (2009)
Zurück zum Zitat Ellouze, S., Haddar, K., Abdelhamid, A.: Study and analysis of Arabic broken plural with NooJ. In: Ben Hamadou, A., Mesfar, S., Silberztein, M. (eds.) Proceedings of the NooJ 2009 International Conference and Workshop, pp. 31–50. Sfax University Publication Center, Sfax (2010) Ellouze, S., Haddar, K., Abdelhamid, A.: Study and analysis of Arabic broken plural with NooJ. In: Ben Hamadou, A., Mesfar, S., Silberztein, M. (eds.) Proceedings of the NooJ 2009 International Conference and Workshop, pp. 31–50. Sfax University Publication Center, Sfax (2010)
Zurück zum Zitat Fehri, H., Haddar, K., Abdelhamid, A.: Recognition and translation of arabic named entities with NooJ using a new representation model. In: Constant, M., Maletti, A., Savary, A. (eds.) Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing, pp. 134–142. ACM, New York (2011) Fehri, H., Haddar, K., Abdelhamid, A.: Recognition and translation of arabic named entities with NooJ using a new representation model. In: Constant, M., Maletti, A., Savary, A. (eds.) Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing, pp. 134–142. ACM, New York (2011)
Zurück zum Zitat Ghezaiel, N., Haddar, K.: Study and resolution of Arabic lexical ambiguity through transduction on text automaton. In: Okrut, T., Hetsevich, Y., Silberztein, M., Stanislavenka, H. (eds.) NooJ 2015. CCIS, vol. 607, pp. 123–133. Springer, Heidelberg (2016). doi:10.1007/978-3-319-42471-2_11 CrossRef Ghezaiel, N., Haddar, K.: Study and resolution of Arabic lexical ambiguity through transduction on text automaton. In: Okrut, T., Hetsevich, Y., Silberztein, M., Stanislavenka, H. (eds.) NooJ 2015. CCIS, vol. 607, pp. 123–133. Springer, Heidelberg (2016). doi:10.​1007/​978-3-319-42471-2_​11 CrossRef
Zurück zum Zitat Habash, N., Rambow, O., Roth, R.: MADA+TOKAN: a toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In: Choukri, K., Maegaard, B. (eds.) Proceedings of the Second International Conference on Arabic Language Resources and Tools, pp. 102–109. The MEDAR Consortium, Cairo (2009) Habash, N., Rambow, O., Roth, R.: MADA+TOKAN: a toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In: Choukri, K., Maegaard, B. (eds.) Proceedings of the Second International Conference on Arabic Language Resources and Tools, pp. 102–109. The MEDAR Consortium, Cairo (2009)
Zurück zum Zitat Keskes, I., Benanamara, F., Belguith, L.: Segmentation de textes arabes en unites discursives minimales. In: TALN (eds.) Actes de la 20e conférence sur le Traitement Automatique des Langues Naturelles (TALN’2013), pp. 435–449. TALN, Les Sables d’Olonne (2013) Keskes, I., Benanamara, F., Belguith, L.: Segmentation de textes arabes en unites discursives minimales. In: TALN (eds.) Actes de la 20e conférence sur le Traitement Automatique des Langues Naturelles (TALN’2013), pp. 435–449. TALN, Les Sables d’Olonne (2013)
Zurück zum Zitat Lungen, H., Lobin, H., Barenfanger, M., Hilbert, M., Puskás, C.: Text parsing of a complex genre. In: Martens, B. (ed.) Proceedings of the Conference on Electronic Publishing, pp. 247–256. ELPUB, Bansko (2006) Lungen, H., Lobin, H., Barenfanger, M., Hilbert, M., Puskás, C.: Text parsing of a complex genre. In: Martens, B. (ed.) Proceedings of the Conference on Electronic Publishing, pp. 247–256. ELPUB, Bansko (2006)
Zurück zum Zitat Mesfar S.: Analyse morpho-syntaxique automatique et reconnaissance des entités nommées en arabe strandard. Ph.D. thesis, University of Franche-Comté, France (2008) Mesfar S.: Analyse morpho-syntaxique automatique et reconnaissance des entités nommées en arabe strandard. Ph.D. thesis, University of Franche-Comté, France (2008)
Zurück zum Zitat Shaalan, K., Othman, E., Rafea, A.: Towards resolving ambiguity in understanding Arabic sentence. In: Choukri, K. (ed.) Proceedings of the International Conference on Arabic Language Resources and Tools, pp. 118–122. The MEDAR Consortium, Cairo (2004) Shaalan, K., Othman, E., Rafea, A.: Towards resolving ambiguity in understanding Arabic sentence. In: Choukri, K. (ed.) Proceedings of the International Conference on Arabic Language Resources and Tools, pp. 118–122. The MEDAR Consortium, Cairo (2004)
Zurück zum Zitat Silberztein, M.: Disambiguation tools for NooJ. In: Varadi, T., Kuti, J., Silberztein, M. (eds.) Applications of Finite-State Language Processing. Selected Papers of the 2008 International NooJ Conference, pp. 1–14. Cambridge Scholar Publishing, Newcastle (2010) Silberztein, M.: Disambiguation tools for NooJ. In: Varadi, T., Kuti, J., Silberztein, M. (eds.) Applications of Finite-State Language Processing. Selected Papers of the 2008 International NooJ Conference, pp. 1–14. Cambridge Scholar Publishing, Newcastle (2010)
Zurück zum Zitat Touir, A., Mathkour, H., Al-Sanea, W.: Semantic-based segmentation of Arabic texts. Inf. Technol. J. 7(7), 1009–1015 (2008)CrossRef Touir, A., Mathkour, H., Al-Sanea, W.: Semantic-based segmentation of Arabic texts. Inf. Technol. J. 7(7), 1009–1015 (2008)CrossRef
Zurück zum Zitat Tofiloski, M., Brooke, J., Taboada, M.: A syntactic and lexical-based discourse segmenter. In: Lee, G.G., Shulte, S. (eds.) Proceedings of the ACL-IJCNLP 2009 Conference, pp. 77–80. ACL, Stroudsburg (2009)CrossRef Tofiloski, M., Brooke, J., Taboada, M.: A syntactic and lexical-based discourse segmenter. In: Lee, G.G., Shulte, S. (eds.) Proceedings of the ACL-IJCNLP 2009 Conference, pp. 77–80. ACL, Stroudsburg (2009)CrossRef
Metadaten
Titel
Integration of a Segmentation Tool for Arabic Corpora in NooJ Platform to Build an Automatic Annotation Tool
verfasst von
Nadia Ghezaiel Hammouda
Kais Haddar
Copyright-Jahr
2016
DOI
https://doi.org/10.1007/978-3-319-55002-2_8