Skip to main content
Erschienen in: International Journal of Speech Technology 2/2016

09.06.2015 | Special Issue Article

Semantic similarity based approach for reducing Arabic texts dimensionality

verfasst von: Arafat Awajan

Erschienen in: International Journal of Speech Technology | Ausgabe 2/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

An efficient method is introduced to represent large Arabic texts in comparatively smaller size without losing significant information. The proposed method uses the distributional semantics to build the word-context matrix representing the distribution of words across contexts and to transform the text into a vector space model (VSM) representation based on word semantic similarity. The linguistic features of the Arabic language, in addition to the semantic information extracted from different lexical-semantic resources such as Arabic WordNet and named entities’ gazetteers are used to improve the text representation and to create word clusters of similar and related words. Distributional similarity measures have been used to capture the words’ semantic similarity and to create clusters of similar words. The conducted experiments have shown that the proposed method significantly reduces the size of text representation by about 27 % compared with the stem-based VSM and by about 50 % compared with the traditional bag-of-words model. Their results have shown that the amount of dimension reduction depends on the size and shape of the windows of analysis as well as on the content of the text.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Awajan, A. (2007). Arabic text preprocessing for the natural language processing applications. Arab Gulf Journal of Scientific Research, 25(4), 179–189. Awajan, A. (2007). Arabic text preprocessing for the natural language processing applications. Arab Gulf Journal of Scientific Research, 25(4), 179–189.
Zurück zum Zitat Awajan, A. (2015). Semantic vector space model for reducing arabic text dimensionality. In Proceedings of the 5th international conference on digital information and communication technology and its applications, Lebanon, (pp. 129–135). April 29–May 1, 2015. Awajan, A. (2015). Semantic vector space model for reducing arabic text dimensionality. In Proceedings of the 5th international conference on digital information and communication technology and its applications, Lebanon, (pp. 129–135). April 29–May 1, 2015.
Zurück zum Zitat Beesley, K. R. (1998). Consonant spreading in Arabic stems. In COLING-ACL’98, vol 1, pp 117–123, Montreal, Quebec, Canada, August 10–14. Beesley, K. R. (1998). Consonant spreading in Arabic stems. In COLING-ACL’98, vol 1, pp 117–123, Montreal, Quebec, Canada, August 10–14.
Zurück zum Zitat Biemann, C. (2006). Chinese whispers—An efficient graph clustering algorithm and its application to natural language processing problems. Workshop on TextGraphs, at HLT-NAACL 2006, pp. 73–80 Biemann, C. (2006). Chinese whispers—An efficient graph clustering algorithm and its application to natural language processing problems. Workshop on TextGraphs, at HLT-NAACL 2006, pp. 73–80
Zurück zum Zitat Bullinaria, J. A., & Levy, J. P. (2012). Extracting semantic representations from word co-occurrence statistics: Stop-lists, stemming and SVD. Behavior Research Methods, 44, 890–907.CrossRef Bullinaria, J. A., & Levy, J. P. (2012). Extracting semantic representations from word co-occurrence statistics: Stop-lists, stemming and SVD. Behavior Research Methods, 44, 890–907.CrossRef
Zurück zum Zitat Duwairi, R., Al-Refai, M. N., & Khasawneh, N. (2009). Feature reduction techniques for Arabic text categorization. Journal of the American Society for Information Science and Technology, 60(11), 2347–2352.CrossRef Duwairi, R., Al-Refai, M. N., & Khasawneh, N. (2009). Feature reduction techniques for Arabic text categorization. Journal of the American Society for Information Science and Technology, 60(11), 2347–2352.CrossRef
Zurück zum Zitat Elkateb, S., Black, W., Rodríguez, H., Alkhalifa, M., Vossen, P., Pease, A., & Fellbaum, C. (2006). Building a WordNet for Arabic. In Proceedings of the fifth international conference on language resources and evaluation (LREC 2006). Genoa, Italy, May 22–28, 2006. Elkateb, S., Black, W., Rodríguez, H., Alkhalifa, M., Vossen, P., Pease, A., & Fellbaum, C. (2006). Building a WordNet for Arabic. In Proceedings of the fifth international conference on language resources and evaluation (LREC 2006). Genoa, Italy, May 22–28, 2006.
Zurück zum Zitat Froud, H., Lachkar, A., & Ouatik, S. A. (2012). A comparative study of root-based and stem-based approaches for measuring similarity between Arabic words for Arabic text mining applications. Advanced Computing: An International Journal (ACIJ), 3(6). Froud, H., Lachkar, A., & Ouatik, S. A. (2012). A comparative study of root-based and stem-based approaches for measuring similarity between Arabic words for Arabic text mining applications. Advanced Computing: An International Journal (ACIJ), 3(6).
Zurück zum Zitat Green, S., & Manning, C. D. (2010). Better Arabic parsing: Baselines, evaluations, and analysis. In COLING, Beijing (pp. 394–402). Green, S., & Manning, C. D. (2010). Better Arabic parsing: Baselines, evaluations, and analysis. In COLING, Beijing (pp. 394–402).
Zurück zum Zitat Habash, N. (2010). Introduction to Arabic natural language processing. San Rafael: Morgan & Claypool Publishers. Habash, N. (2010). Introduction to Arabic natural language processing. San Rafael: Morgan & Claypool Publishers.
Zurück zum Zitat Hagiwara, M. (2008). A supervised learning approach to automatic synonym identification based on distributional features. In Proceedings of the ACL-08, Columbus, June 2008 (pp. 1–6). Hagiwara, M. (2008). A supervised learning approach to automatic synonym identification based on distributional features. In Proceedings of the ACL-08, Columbus, June 2008 (pp. 1–6).
Zurück zum Zitat Harrag, F., El-Qawasmah, E., & Al-Salman, A. M. (2010). Comparing dimension reduction techniques for Arabic text classification using BPNN algorithm. In IEEE first international conference on integrated intelligent computing, pp. 6–11. Harrag, F., El-Qawasmah, E., & Al-Salman, A. M. (2010). Comparing dimension reduction techniques for Arabic text classification using BPNN algorithm. In IEEE first international conference on integrated intelligent computing, pp. 6–11.
Zurück zum Zitat Harris, Z. (1954). Distributional structure. Word, 10(23), 146–162.CrossRef Harris, Z. (1954). Distributional structure. Word, 10(23), 146–162.CrossRef
Zurück zum Zitat Hasnah, A. M., & Al-Ja’am, J. M. (2002). Thesaurus-based query disambiguation method for cross-language information retrieval. International Journal Intelligent Computing and Information Sciences, 2(2), 58–68. Hasnah, A. M., & Al-Ja’am, J. M. (2002). Thesaurus-based query disambiguation method for cross-language information retrieval. International Journal Intelligent Computing and Information Sciences, 2(2), 58–68.
Zurück zum Zitat Heintz, I. (2010). Arabic language modeling with stem-derived morphemes for automatic speech recognition. Ph.D. thesis, Graduate School of The Ohio State University. Heintz, I. (2010). Arabic language modeling with stem-derived morphemes for automatic speech recognition. Ph.D. thesis, Graduate School of The Ohio State University.
Zurück zum Zitat Hmeidi, I., Kanaan, G., & Evens, M. (1997). Design and implementation of automatic indexing for information retrieval with arabic documents. Journal of the American Society for Information Science, 48(10), 867–881.CrossRef Hmeidi, I., Kanaan, G., & Evens, M. (1997). Design and implementation of automatic indexing for information retrieval with arabic documents. Journal of the American Society for Information Science, 48(10), 867–881.CrossRef
Zurück zum Zitat Kirchhoff, K., Vergyri, D., Duh, K., Bilmes, J., & Stolcke, A. (2006). Morphology-based language modeling for conversational Arabic speech recognition. Computer Speech & Language, 20(4), 589–608.CrossRef Kirchhoff, K., Vergyri, D., Duh, K., Bilmes, J., & Stolcke, A. (2006). Morphology-based language modeling for conversational Arabic speech recognition. Computer Speech & Language, 20(4), 589–608.CrossRef
Zurück zum Zitat Martins, C. A., Monard, M. C., & Matsubara, E. T. (2003). Reducing the dimensionality of bag-of-words text representation used by learning algorithms. In Proceedings of 3rd IASTED international conference on artificial intelligence and applications (AIA2003), Benalmádena, Espanha (pp. 228–233). Calgary: Acta Press. Martins, C. A., Monard, M. C., & Matsubara, E. T. (2003). Reducing the dimensionality of bag-of-words text representation used by learning algorithms. In Proceedings of 3rd IASTED international conference on artificial intelligence and applications (AIA2003), Benalmádena, Espanha (pp. 228–233). Calgary: Acta Press.
Zurück zum Zitat Mihalcea, R., & Tarau, P. (2004). TextRank: Brining order into texts. In Proceedings of EMNLP 2004. Association for Computational Linguistics, Barcelona, Spain (pp. 404–411). Mihalcea, R., & Tarau, P. (2004). TextRank: Brining order into texts. In Proceedings of EMNLP 2004. Association for Computational Linguistics, Barcelona, Spain (pp. 404–411).
Zurück zum Zitat Parkinson, D. B. (2005). Using Arabic synonyms. Cambridge: Cambridge University Press. Parkinson, D. B. (2005). Using Arabic synonyms. Cambridge: Cambridge University Press.
Zurück zum Zitat Salton, G., & McGill, M. J. (1986). Introduction to modern information retrieval. New York, NY: McGraw-Hill. Inc.MATH Salton, G., & McGill, M. J. (1986). Introduction to modern information retrieval. New York, NY: McGraw-Hill. Inc.MATH
Zurück zum Zitat Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communication of the ACM, 18(11), 613–620.CrossRefMATH Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communication of the ACM, 18(11), 613–620.CrossRefMATH
Zurück zum Zitat Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37, 141–188.MathSciNetMATH Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37, 141–188.MathSciNetMATH
Zurück zum Zitat Van Rijsbergen, C. J. (1979). Information retrieval (2nd ed.). Cambridge: Computer Laboratory, University of Cambridge.MATH Van Rijsbergen, C. J. (1979). Information retrieval (2nd ed.). Cambridge: Computer Laboratory, University of Cambridge.MATH
Zurück zum Zitat Xu, J., Fraser, A., & Weischedel, R. (2002). Empirical studies in strategies for Arabic retrieval. In SIGIR’02, Proceedings of the 25th annual international ACMSIGIR conference on Research and development in information retrieval, Tampere, Finland (pp. 269–274). August 11–15, 2002. Xu, J., Fraser, A., & Weischedel, R. (2002). Empirical studies in strategies for Arabic retrieval. In SIGIR’02, Proceedings of the 25th annual international ACMSIGIR conference on Research and development in information retrieval, Tampere, Finland (pp. 269–274). August 11–15, 2002.
Metadaten
Titel
Semantic similarity based approach for reducing Arabic texts dimensionality
verfasst von
Arafat Awajan
Publikationsdatum
09.06.2015
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 2/2016
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-015-9284-6

Weitere Artikel der Ausgabe 2/2016

International Journal of Speech Technology 2/2016 Zur Ausgabe

Neuer Inhalt