Skip to main content
Top
Published in: Discover Computing 6/2007

01-12-2007

Searching strategies for the Bulgarian language

Author: Jacques Savoy

Published in: Discover Computing | Issue 6/2007

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This paper reports on the underlying IR problems encountered when indexing and searching with the Bulgarian language. For this language we propose a general light stemmer and demonstrate that it can be quite effective, producing significantly better MAP (around + 34%) than an approach not applying stemming. We implement the GL2 model derived from the Divergence from Randomness paradigm and find its retrieval effectiveness better than other probabilistic, vector-space and language models. The resulting MAP is found to be about 50% better than the classical tf idf approach. Moreover, increasing the query size enhances the MAP by around 10% (from T to TD). In order to compare the retrieval effectiveness of our suggested stopword list and the light stemmer developed for the Bulgarian language, we conduct a set of experiments on another stopword list and also a more complex and aggressive stemmer. Results tend to indicate that there is no statistically significant difference between these variants and our suggested approach. This paper evaluates other indexing strategies such as 4-gram indexing and indexing based on the automatic decompounding of compound words. Finally, we analyze certain queries to discover why we obtained poor results, when indexing Bulgarian documents using the suggested word-based approach.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
go back to reference Abdou, S., Ruck, P., & Savoy, J. (2006). Evaluation of stemming, query expansion and manula indexing approaches for the Genomic task. In Proceedings of TREC-2005. NIST Publication #500-266, Gaithersburg, MA. Abdou, S., Ruck, P., & Savoy, J. (2006). Evaluation of stemming, query expansion and manula indexing approaches for the Genomic task. In Proceedings of TREC-2005. NIST Publication #500-266, Gaithersburg, MA.
go back to reference Abdou, S., & Savoy, J. (2006). Statistical and comparative evaluation of various indexing and search models. In Proceeding AIRS, Singapore, Springer-Verlag, Berlin, LNCS #4182, pp. 362–373. Abdou, S., & Savoy, J. (2006). Statistical and comparative evaluation of various indexing and search models. In Proceeding AIRS, Singapore, Springer-Verlag, Berlin, LNCS #4182, pp. 362–373.
go back to reference Ahlgren, P., & Kekäläinen, J. (2007). Indexing strategies for Swedish full text retrieval under different user scenarios. Information Processing & Management, 43(1), 81–102.CrossRef Ahlgren, P., & Kekäläinen, J. (2007). Indexing strategies for Swedish full text retrieval under different user scenarios. Information Processing & Management, 43(1), 81–102.CrossRef
go back to reference Ahmad, F., Yusoff, M., & Sembok, T. M. T. (1996). Experiments with a stemming algorithms for Malay words. Journal of the American Society for Information Science, 47(12), 909–918.CrossRef Ahmad, F., Yusoff, M., & Sembok, T. M. T. (1996). Experiments with a stemming algorithms for Malay words. Journal of the American Society for Information Science, 47(12), 909–918.CrossRef
go back to reference Alkula, R. (2001). From plain character strings to meaningful words: Producing better full text databases for Finnish with morphological analysis software. IR Journal, 4(3–4), 195–208.MATH Alkula, R. (2001). From plain character strings to meaningful words: Producing better full text databases for Finnish with morphological analysis software. IR Journal, 4(3–4), 195–208.MATH
go back to reference Allières, J. (2000). Les langues de l’Europe. Paris: Presses Universitaires de France. Allières, J. (2000). Les langues de l’Europe. Paris: Presses Universitaires de France.
go back to reference Amati, G., & van Rijsbergen, C. J. (2002). Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM-Transactions on Information Systems, 20(4), 357–389.CrossRef Amati, G., & van Rijsbergen, C. J. (2002). Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM-Transactions on Information Systems, 20(4), 357–389.CrossRef
go back to reference Asian, J., Williams, H. E., & Tahaghoghi, S. M. M. (2004). A testbed for Indonesian text retrieval. In Proceedings of the ADCS. Melbourne, pp. 55–58. Asian, J., Williams, H. E., & Tahaghoghi, S. M. M. (2004). A testbed for Indonesian text retrieval. In Proceedings of the ADCS. Melbourne, pp. 55–58.
go back to reference Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. New York: The ACM Press. Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. New York: The ACM Press.
go back to reference Braschler, M., & Ripplinger, B. (2004). How effective is stemming and decompounding for German text retrieval? IR Journal, 7(3–4), 291–316. Braschler, M., & Ripplinger, B. (2004). How effective is stemming and decompounding for German text retrieval? IR Journal, 7(3–4), 291–316.
go back to reference Buckley, C., Singhal, A., Mitra, M., & Salton, G. (1996). New retrieval approaches using SMART. In Proceedings of TREC-4. NIST Publication #500-236, Gaithersburg, MA, pp. 25–48. Buckley, C., Singhal, A., Mitra, M., & Salton, G. (1996). New retrieval approaches using SMART. In Proceedings of TREC-4. NIST Publication #500-236, Gaithersburg, MA, pp. 25–48.
go back to reference Chen, A. (2003). Cross-language retrieval experiments at CLEF 2002. In Advances in Cross-Language Information Retrieval, LNCS #2785, Springer-Verlag, Berlin, pp. 28–48. Chen, A. (2003). Cross-language retrieval experiments at CLEF 2002. In Advances in Cross-Language Information Retrieval, LNCS #2785, Springer-Verlag, Berlin, pp. 28–48.
go back to reference Chen, C., & Gey, F. (2003). Building an Arabic stemmer for Information retrieval. In Proceedings of TREC-2002. NIST Publication #500-251, Gaithersburg, MA, pp. 631–640. Chen, C., & Gey, F. (2003). Building an Arabic stemmer for Information retrieval. In Proceedings of TREC-2002. NIST Publication #500-251, Gaithersburg, MA, pp. 631–640.
go back to reference Crawley, M. J. (2005). Statistics. An introduction using R. Chichester: John Wiley & Sons.MATH Crawley, M. J. (2005). Statistics. An introduction using R. Chichester: John Wiley & Sons.MATH
go back to reference Di Nunzio, G.M., Ferro, N., Melucci, M., & Orio, N. (2004). Experiments to evaluate probabilistic models for automatic stemmer generation and query word translation. In Comparative Evaluation of Multilingual Information Access Systems, LNCS #3237, Springer-Verlag, Berlin, pp. 220–235. Di Nunzio, G.M., Ferro, N., Melucci, M., & Orio, N. (2004). Experiments to evaluate probabilistic models for automatic stemmer generation and query word translation. In Comparative Evaluation of Multilingual Information Access Systems, LNCS #3237, Springer-Verlag, Berlin, pp. 220–235.
go back to reference Ekmekçioglu, F. C., & Willett, P. (2000). Effectiveness of stemming for Turkish text retrieval. Program, 34(2), 195–200. Ekmekçioglu, F. C., & Willett, P. (2000). Effectiveness of stemming for Turkish text retrieval. Program, 34(2), 195–200.
go back to reference Fox, C. (1990). A stop list for general text. SIGIR Forum, 24(1–2), 19–35. Fox, C. (1990). A stop list for general text. SIGIR Forum, 24(1–2), 19–35.
go back to reference Harman, D. (1991). How effective is suffixing? Journal of the American Society for Information Science, 42(1), 7–15.CrossRef Harman, D. (1991). How effective is suffixing? Journal of the American Society for Information Science, 42(1), 7–15.CrossRef
go back to reference Harter, S. P. (1986). Online information retrieval: Concepts, principles and techniques. San Diego: The Academic Press. Harter, S. P. (1986). Online information retrieval: Concepts, principles and techniques. San Diego: The Academic Press.
go back to reference Hedlund, T., Pirkola, A., & Järvelin, K. (2001). Aspects of Swedish morphology and semantics from the perspective of mono- and cross-language information retrieval. Information Processing & Management, 37(1), 147–161.MATHCrossRef Hedlund, T., Pirkola, A., & Järvelin, K. (2001). Aspects of Swedish morphology and semantics from the perspective of mono- and cross-language information retrieval. Information Processing & Management, 37(1), 147–161.MATHCrossRef
go back to reference Hiemstra, D. (2000). Using language models for information retrieval. CTIT Ph.D. Thesis. Hiemstra, D. (2000). Using language models for information retrieval. CTIT Ph.D. Thesis.
go back to reference Hull, D. (1996). Stemming algorithms: A case study for detailed evaluation. Journal of the American Society for Information Science, 47(1), 70–84.CrossRef Hull, D. (1996). Stemming algorithms: A case study for detailed evaluation. Journal of the American Society for Information Science, 47(1), 70–84.CrossRef
go back to reference Kettunen, K., & Airo, E. (2006). Is a morphologically complex language really that complex in full-text retrieval? In Advances in natural language processing (pp. 411–422). LNCS #4139, Berlin: Springer. Kettunen, K., & Airo, E. (2006). Is a morphologically complex language really that complex in full-text retrieval? In Advances in natural language processing (pp. 411–422). LNCS #4139, Berlin: Springer.
go back to reference Kalamboukis, T. Z. (1995). Suffix stripping with modern Greek. Program, 29(3), 313–321. Kalamboukis, T. Z. (1995). Suffix stripping with modern Greek. Program, 29(3), 313–321.
go back to reference Kraaij, W., & Pohlman, R. (1996). Viewing stemming as recall enhancement. In Proceedings of ACM-SIGIR. Tempere, pp. 40–48. Kraaij, W., & Pohlman, R. (1996). Viewing stemming as recall enhancement. In Proceedings of ACM-SIGIR. Tempere, pp. 40–48.
go back to reference Kraaij, W., Westerveld, T., & Hiemstra, D. (2002). The importance of prior probabilities for entry page search. In Proceedings of ACM-SIGIR. Tempere, pp. 27–34. Kraaij, W., Westerveld, T., & Hiemstra, D. (2002). The importance of prior probabilities for entry page search. In Proceedings of ACM-SIGIR. Tempere, pp. 27–34.
go back to reference Lovins, J. B. (1968). Development of a stemming algorithm. Mechanical translation and computational linguistics, 11(1), 22–31. Lovins, J. B. (1968). Development of a stemming algorithm. Mechanical translation and computational linguistics, 11(1), 22–31.
go back to reference McNamee, P., & Mayfield, J. (2004). Character n-gram tokenization for European language text retrieval. IR Journal, 7(1–2), 73–97. McNamee, P., & Mayfield, J. (2004). Character n-gram tokenization for European language text retrieval. IR Journal, 7(1–2), 73–97.
go back to reference McNamee, P. (2006). Exploring new languages with HAIRCUT at CLEF-2005. In Accessing multilingual information repositories. LNCS #4022, Spinger-Verlag, Berlin, pp. 155–164. McNamee, P. (2006). Exploring new languages with HAIRCUT at CLEF-2005. In Accessing multilingual information repositories. LNCS #4022, Spinger-Verlag, Berlin, pp. 155–164.
go back to reference Nakov, P. (2003). BulStem: Design and evaluation of inflectional stemmer for Bulgarian. In Proceedings of workshop on Balkan language resources and tools. Thessaloniki. Nakov, P. (2003). BulStem: Design and evaluation of inflectional stemmer for Bulgarian. In Proceedings of workshop on Balkan language resources and tools. Thessaloniki.
go back to reference Peters, C., Clough, P. D., Gonzalo, J., Jones, G. J. F., Kluck, M., & Magnini, B. (Eds.) (2005). Multilingual information access for text, speech and images. LNCS #3491. Springer-Verlag, Berlin, 2005. Peters, C., Clough, P. D., Gonzalo, J., Jones, G. J. F., Kluck, M., & Magnini, B. (Eds.) (2005). Multilingual information access for text, speech and images. LNCS #3491. Springer-Verlag, Berlin, 2005.
go back to reference Peters, C., Gey, F. C., Gonzalo, J., Müller, H., Jones, G. J. F., Kluck, M., Magnini, B., & de Rijke, M. (Eds) (2006). Accessing Multilingual Information Repositories. LNCS #4022, Spinger-Verlag, Berlin. Peters, C., Gey, F. C., Gonzalo, J., Müller, H., Jones, G. J. F., Kluck, M., Magnini, B., & de Rijke, M. (Eds) (2006). Accessing Multilingual Information Repositories. LNCS #4022, Spinger-Verlag, Berlin.
go back to reference Popovic, M., & Willett, P. (1992). The effectiveness of stemming for natural-language access to Slovene textual data? Journal of the American Society for Information Science, 43(5), 384–390.CrossRef Popovic, M., & Willett, P. (1992). The effectiveness of stemming for natural-language access to Slovene textual data? Journal of the American Society for Information Science, 43(5), 384–390.CrossRef
go back to reference Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.
go back to reference Robertson, S. E., Walker, S., & Beaulieu, M. (2000). Experimentation as a way of life: Okapi at TREC. Information Processing & Management, 36(1), 95–108.CrossRef Robertson, S. E., Walker, S., & Beaulieu, M. (2000). Experimentation as a way of life: Okapi at TREC. Information Processing & Management, 36(1), 95–108.CrossRef
go back to reference Savoy, J. (1997). Statistical inference in retrieval effectiveness evaluation. Information Processing & Management, 33(4), 495–512.CrossRef Savoy, J. (1997). Statistical inference in retrieval effectiveness evaluation. Information Processing & Management, 33(4), 495–512.CrossRef
go back to reference Savoy, J. (1999). A stemming procedure and stopword list for general French corpora. Journal of the American Society for Information Science, 50(10), 944–952.CrossRef Savoy, J. (1999). A stemming procedure and stopword list for general French corpora. Journal of the American Society for Information Science, 50(10), 944–952.CrossRef
go back to reference Savoy, J., & Rasolofo, Y. (2003). Report on the TREC-11 experiment: Arabic, named page and topic distillation searches. In Proceedings of TREC-2002, NIST publication #500-251, Gaithersburg, MD, pp. 765–774. Savoy, J., & Rasolofo, Y. (2003). Report on the TREC-11 experiment: Arabic, named page and topic distillation searches. In Proceedings of TREC-2002, NIST publication #500-251, Gaithersburg, MD, pp. 765–774.
go back to reference Savoy, J. (2004). Report on CLEF 2003 monolingual tracks. In Comparative evaluation of multilingual information access systems, LNCS #2785, Springer, Berlin, pp. 322–336. Savoy, J. (2004). Report on CLEF 2003 monolingual tracks. In Comparative evaluation of multilingual information access systems, LNCS #2785, Springer, Berlin, pp. 322–336.
go back to reference Savoy, J. (2005). Comparative study of monolingual and multilingual search models for use with Asian languages. ACM Transactions on Asian Languages Information Processing, 4(2), 163–189.CrossRef Savoy, J. (2005). Comparative study of monolingual and multilingual search models for use with Asian languages. ACM Transactions on Asian Languages Information Processing, 4(2), 163–189.CrossRef
go back to reference Savoy, J. (2006). Light stemming approaches for the French, Portuguese, German and Hungarian languages. In Proceedings ACM-SAC, Dijon, pp. 1031–1035. Savoy, J. (2006). Light stemming approaches for the French, Portuguese, German and Hungarian languages. In Proceedings ACM-SAC, Dijon, pp. 1031–1035.
go back to reference Savoy, J. (2007). Searching strategies for the Hungarian language. Information Processing & Management, to appear. Savoy, J. (2007). Searching strategies for the Hungarian language. Information Processing & Management, to appear.
go back to reference Schinke, R., Greengrass, M., Robertson, A. M., & Willett, P. (1998). Retrieval of morphological variants in searches of Latin text databases. Computers and the Humanities, 31(1), 409–432. Schinke, R., Greengrass, M., Robertson, A. M., & Willett, P. (1998). Retrieval of morphological variants in searches of Latin text databases. Computers and the Humanities, 31(1), 409–432.
go back to reference Sproat, R. (1992). Morphology and computation. Cambridge: The MIT Press. Sproat, R. (1992). Morphology and computation. Cambridge: The MIT Press.
go back to reference Tomlinson, S. (2004). Lexical and algorithmic stemming compared for 9 European languages with Humminbird SearchServerTM at CLEF 2003. In Comparative Evaluation of Multilingual Information Access Systems. LNCS #3237, Springer-Verlag, Berlin, pp. 286–300. Tomlinson, S. (2004). Lexical and algorithmic stemming compared for 9 European languages with Humminbird SearchServerTM at CLEF 2003. In Comparative Evaluation of Multilingual Information Access Systems. LNCS #3237, Springer-Verlag, Berlin, pp. 286–300.
go back to reference Xu, J., & Croft, B. (1998). Corpus-based stemming using cooccurrence of word variants. ACM-Transactions on Information Systems, 16(1), 61–81.CrossRef Xu, J., & Croft, B. (1998). Corpus-based stemming using cooccurrence of word variants. ACM-Transactions on Information Systems, 16(1), 61–81.CrossRef
Metadata
Title
Searching strategies for the Bulgarian language
Author
Jacques Savoy
Publication date
01-12-2007
Publisher
Springer Netherlands
Published in
Discover Computing / Issue 6/2007
Print ISSN: 2948-2984
Electronic ISSN: 2948-2992
DOI
https://doi.org/10.1007/s10791-007-9033-9

Premium Partner