Skip to main content
Top

2016 | OriginalPaper | Chapter

On Cross-Script Information Retrieval

Authors : Nada Naji, James Allan

Published in: Advances in Information Retrieval

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

We address the problem of cross-script retrieval in the context of a microblog system such as Twitter. Specifically, we explore methods for using native Arabic script queries to retrieve Arabic tweets written in a Roman script known as Arabizi. For example, a query for “كتاب” would not match “kitab” even though an Arabic reader would see them as the same word. Moreover, because of the lack of Arabic script, automatic language identification methods fail to recognize the Arabizi text as Arabic and label it as English, Polish, or the like. We propose a cross-script retrieval system using automatic rule-based mapping and statistical selection of transliteration keywords. We show that our system can achieve effective cross-script retrieval with minimal knowledge of the target language and without the need to rely on external translation or transliteration tools or lexica. With minimal human annotation, our technique can be applied to other languages such as Hindi and Greek, which are commonly converted to a Roman character set similarly.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Chalabi, A., Gerges, H.: Romanized arabic transliteration. In: Proceedings of the Second Workshop on Advances in Text Input Methods, pp. 89–96 (Mumbai, India, 2012). The COLING 2012 Organizing Committee (2012) Chalabi, A., Gerges, H.: Romanized arabic transliteration. In: Proceedings of the Second Workshop on Advances in Text Input Methods, pp. 89–96 (Mumbai, India, 2012). The COLING 2012 Organizing Committee (2012)
3.
go back to reference Al-Badrashiny, M., Eskander, R., Habash, N., Rambow, O.: Automatic transliteration of romanized dialectal arabic. In: Proceedings of the 18th Conference on Computational Language Learning (Baltimore, Maryland USA, 2014) (2014) Al-Badrashiny, M., Eskander, R., Habash, N., Rambow, O.: Automatic transliteration of romanized dialectal arabic. In: Proceedings of the 18th Conference on Computational Language Learning (Baltimore, Maryland USA, 2014) (2014)
4.
go back to reference Habash, N., Ryan, R., Owen, R., Ramy, E., Nadt, T.: Morphological analysis and disambiguation for dialectal arabic. In: Proceedings of Conference of the North American Association for Computational Linguistics (NAACL) (Atlanta, Georgia, 2013) (2013) Habash, N., Ryan, R., Owen, R., Ramy, E., Nadt, T.: Morphological analysis and disambiguation for dialectal arabic. In: Proceedings of Conference of the North American Association for Computational Linguistics (NAACL) (Atlanta, Georgia, 2013) (2013)
5.
go back to reference Arfath, P., Al-Badrashiny, M., Diab, T.M., Habash, N., Pooleery, M., Rambow, O., Roth, M.R., Altantawy, M.: DIRA: Dialectal arabic information retrieval assistant. demo paper. In: Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP) (Nagoya, Japan, 2013) (2013) Arfath, P., Al-Badrashiny, M., Diab, T.M., Habash, N., Pooleery, M., Rambow, O., Roth, M.R., Altantawy, M.: DIRA: Dialectal arabic information retrieval assistant. demo paper. In: Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP) (Nagoya, Japan, 2013) (2013)
6.
go back to reference Gupta, P., Bali, P., Banchs, E., R., Choudhury, M., Rosso, P.: Query expansion for mixed-script information retrieval. In: Proceedings of the 37th International ACM SIGIR 2014. New York, NY, USA, pp. 677–686 (2014) Gupta, P., Bali, P., Banchs, E., R., Choudhury, M., Rosso, P.: Query expansion for mixed-script information retrieval. In: Proceedings of the 37th International ACM SIGIR 2014. New York, NY, USA, pp. 677–686 (2014)
7.
go back to reference Saha Roy, R., Choudhury, M., Majumder, P., Agarwal, K.: Overview and datasets of FIRE 2013 track on transliterated search. In: 5th Forum for Information Retrieval Evaluation (2013) Saha Roy, R., Choudhury, M., Majumder, P., Agarwal, K.: Overview and datasets of FIRE 2013 track on transliterated search. In: 5th Forum for Information Retrieval Evaluation (2013)
8.
go back to reference Al-Onaizan, Y., Knight, K.: Machine transliteration of names in arabic text. In: Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages (2002) Al-Onaizan, Y., Knight, K.: Machine transliteration of names in arabic text. In: Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages (2002)
9.
go back to reference AbdulJaleel, N., Larkey, S.L.: Statistical transliteration for English-Arabic cross language information retrieval. In: Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM 2003). ACM (New York, NY, USA, 2003), pp. 139–146 (2003) AbdulJaleel, N., Larkey, S.L.: Statistical transliteration for English-Arabic cross language information retrieval. In: Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM 2003). ACM (New York, NY, USA, 2003), pp. 139–146 (2003)
10.
go back to reference May, J., Benjira, Y., Echihabi, A.: An arabizi-english social media statistical machine translation system. In: Proceedings of the Eleventh Biennial Conference of the Association for Machine Translation in the Americas, Vancouver, Canada (2014) May, J., Benjira, Y., Echihabi, A.: An arabizi-english social media statistical machine translation system. In: Proceedings of the Eleventh Biennial Conference of the Association for Machine Translation in the Americas, Vancouver, Canada (2014)
11.
go back to reference Bies, A., Song, Z., Maamouri, M., Grimes, S., Lee, H., Wright, J., Strassel, S., Habash, N., Eskander, R., Rambow, O.: Transliteration of arabizi into arabic orthography: developing a parallel annotated arabizi-arabic script SMS/Chat corpus. In: Proceedings of the EMNLP 2014 Workshop on Arabic Natural Langauge Processing (ANLP), pp. 93–103 (Doha, Qatar, 2014) Bies, A., Song, Z., Maamouri, M., Grimes, S., Lee, H., Wright, J., Strassel, S., Habash, N., Eskander, R., Rambow, O.: Transliteration of arabizi into arabic orthography: developing a parallel annotated arabizi-arabic script SMS/Chat corpus. In: Proceedings of the EMNLP 2014 Workshop on Arabic Natural Langauge Processing (ANLP), pp. 93–103 (Doha, Qatar, 2014)
Metadata
Title
On Cross-Script Information Retrieval
Authors
Nada Naji
James Allan
Copyright Year
2016
Publisher
Springer International Publishing
DOI
https://doi.org/10.1007/978-3-319-30671-1_70