Skip to main content
Erschienen in: Discover Computing 4-5/2007

01.10.2007

An empirical study of tokenization strategies for biomedical information retrieval

verfasst von: Jing Jiang, ChengXiang Zhai

Erschienen in: Discover Computing | Ausgabe 4-5/2007

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Due to the great variation of biological names in biomedical text, appropriate tokenization is an important preprocessing step for biomedical information retrieval. Despite its importance, there has been little study on the evaluation of various tokenization strategies for biomedical text. In this work, we conducted a careful, systematic evaluation of a set of tokenization heuristics on all the available TREC biomedical text collections for ad hoc document retrieval, using two representative retrieval methods and a pseudo-relevance feedback method. We also studied the effect of stemming and stop word removal on the retrieval performance. As expected, our experiment results show that tokenization can significantly affect the retrieval accuracy; appropriate tokenization can improve the performance by up to 96%, measured by mean average precision (MAP). In particular, it is shown that different query types require different tokenization heuristics, stemming is effective only for certain queries, and stop word removal in general does not improve the retrieval performance on biomedical text.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Ando, R. K., Dredze, M., & Zhang, T. (2005). TREC 2005 genomics track experiments at IBM Watson. In Proceedings of the Fourteenth Text REtrieval Conference (TREC 2005). Ando, R. K., Dredze, M., & Zhang, T. (2005). TREC 2005 genomics track experiments at IBM Watson. In Proceedings of the Fourteenth Text REtrieval Conference (TREC 2005).
Zurück zum Zitat Buttcher, S., Clarke, C. L. A., & Cormack, G. V. (2004). Domain-specific synonym expansion and validation for biomedical information retrieval (MultiText experiments for TREC 2004). In Proceedings of the Thirteenth Text REtrieval Conference (TREC 2004). Buttcher, S., Clarke, C. L. A., & Cormack, G. V. (2004). Domain-specific synonym expansion and validation for biomedical information retrieval (MultiText experiments for TREC 2004). In Proceedings of the Thirteenth Text REtrieval Conference (TREC 2004).
Zurück zum Zitat Carpenter, B. (2004). Phrasal queries with LingPipe and Lucene: ad hoc genomics text retrieval. In Proceedings of Thirteenth Text REtrieval Conference (TREC 2004). Carpenter, B. (2004). Phrasal queries with LingPipe and Lucene: ad hoc genomics text retrieval. In Proceedings of Thirteenth Text REtrieval Conference (TREC 2004).
Zurück zum Zitat Crangle, C., Zbyslaw, A., Cherry, J. M., & Hong, E. L. (2004). Concept extraction and synonym management for biomedical information retrieval. In Proceedings of the Thirteenth Text REtrieval Conference (TREC 2004). Crangle, C., Zbyslaw, A., Cherry, J. M., & Hong, E. L. (2004). Concept extraction and synonym management for biomedical information retrieval. In Proceedings of the Thirteenth Text REtrieval Conference (TREC 2004).
Zurück zum Zitat Dayanik, A., Nevill-Manning, C. G., & Oughtred, R. (2003). Partitioning a graph of sequences, structures and abstracts for information retrieval. In Proceedings of the Twelfth Text REtreival Conference (TREC 2003). Dayanik, A., Nevill-Manning, C. G., & Oughtred, R. (2003). Partitioning a graph of sequences, structures and abstracts for information retrieval. In Proceedings of the Twelfth Text REtreival Conference (TREC 2003).
Zurück zum Zitat Fang, H., Tao, T., & Zhai, C. (2004). A formal study of information retrieval heuristics. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 49–56). ACM Press. Fang, H., Tao, T., & Zhai, C. (2004). A formal study of information retrieval heuristics. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 49–56). ACM Press.
Zurück zum Zitat Fujita, S. (2004). Revisiting again document length hypotheses TREC-2004 genomics track experiments at Patolis. In Proceedings of the Thirteenth Text REtrieval Conference (TREC 2004). Fujita, S. (2004). Revisiting again document length hypotheses TREC-2004 genomics track experiments at Patolis. In Proceedings of the Thirteenth Text REtrieval Conference (TREC 2004).
Zurück zum Zitat Harman, D. (1991). How effective is suffixing? Journal of the American Society for Information Science, 42(1), 7–15.CrossRef Harman, D. (1991). How effective is suffixing? Journal of the American Society for Information Science, 42(1), 7–15.CrossRef
Zurück zum Zitat Hersh, W. R., & Bhuptiraju, R. T. (2003). TREC genomics track overview. In Proceedings of the Twelvth Text REtrieval Conference (TREC 2003). Hersh, W. R., & Bhuptiraju, R. T. (2003). TREC genomics track overview. In Proceedings of the Twelvth Text REtrieval Conference (TREC 2003).
Zurück zum Zitat Hersh, W. R., Bhuptiraju, R. T., Ross, L., Johnson, P., Cohen, A. M., & Kraemer, D. F. (2004). TREC 2004 genomics track overview. In Proceedings of the Thirteenth Text REtrieval Conference (TREC 2004). Hersh, W. R., Bhuptiraju, R. T., Ross, L., Johnson, P., Cohen, A. M., & Kraemer, D. F. (2004). TREC 2004 genomics track overview. In Proceedings of the Thirteenth Text REtrieval Conference (TREC 2004).
Zurück zum Zitat Hersh, W., Cohen, A., Yang, J., Bhuptiraju, R. T., Roberts, P., & Hearst, M. (2005). TREC 2005 genomics track overview. In Proceedings of the Fourteenth Text REtrieval Conference (TREC 2005). Hersh, W., Cohen, A., Yang, J., Bhuptiraju, R. T., Roberts, P., & Hearst, M. (2005). TREC 2005 genomics track overview. In Proceedings of the Fourteenth Text REtrieval Conference (TREC 2005).
Zurück zum Zitat Huang, X., Zhong, M., & Si, L. (2005). York University at TREC 2005: genomics track. In Proceedings of the Fourteenth Text REtrieval Conference (TREC 2005). Huang, X., Zhong, M., & Si, L. (2005). York University at TREC 2005: genomics track. In Proceedings of the Fourteenth Text REtrieval Conference (TREC 2005).
Zurück zum Zitat Lafferty, J., & Zhai, C. (2001). Document language models, query models, and risk minimization for information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 111–119). ACM Press. Lafferty, J., & Zhai, C. (2001). Document language models, query models, and risk minimization for information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 111–119). ACM Press.
Zurück zum Zitat Lovins, J. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11, 22–31. Lovins, J. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11, 22–31.
Zurück zum Zitat Pirkola, A., & Leppanen, E. (2003). TREC 2003 genomics track experiments at UTA. In Proceedings of the Twelfth Text REtrieval Conference (TREC 2003). Pirkola, A., & Leppanen, E. (2003). TREC 2003 genomics track experiments at UTA. In Proceedings of the Twelfth Text REtrieval Conference (TREC 2003).
Zurück zum Zitat Porter, M. F. (1997). An algorithm for suffix stripping. Program, 14(3), 130–137. Porter, M. F. (1997). An algorithm for suffix stripping. Program, 14(3), 130–137.
Zurück zum Zitat Savoy, J., Rasolofo, Y., & Perret, L. (2003). Report on the TREC 2003 experiment: genomic and web searches. In Proceedings of the Twelfth Text REtrieval Conference (TREC 2003). Savoy, J., Rasolofo, Y., & Perret, L. (2003). Report on the TREC 2003 experiment: genomic and web searches. In Proceedings of the Twelfth Text REtrieval Conference (TREC 2003).
Zurück zum Zitat Song, Y-I., Han, K-S., Seo, H-C., Kim, S-B., & Rim, H-C. (2003). Biomedical text retrieval system at Korea University. In Proceedings of the Twelfth Text REtrieval Conference (TREC 2003). Song, Y-I., Han, K-S., Seo, H-C., Kim, S-B., & Rim, H-C. (2003). Biomedical text retrieval system at Korea University. In Proceedings of the Twelfth Text REtrieval Conference (TREC 2003).
Zurück zum Zitat Tomlinson, S. (2003). Robust, web and genomics retrieval with Hummingbird SearchServer at TREC 2003. In Poceedings of the Twelfth Text REtrieval Conference (TREC 2003). Tomlinson, S. (2003). Robust, web and genomics retrieval with Hummingbird SearchServer at TREC 2003. In Poceedings of the Twelfth Text REtrieval Conference (TREC 2003).
Zurück zum Zitat Zhai, C. (2001). Notes on the Lemur TFIDF model. http://www.cs.cmu.edu/ lemur/1.1/tfidf.ps. Zhai, C. (2001). Notes on the Lemur TFIDF model. http://​www.​cs.​cmu.​edu/​ lemur/1.1/tfidf.ps.
Zurück zum Zitat Zhai, C., & Lafferty, J. (2001). Model-based feedback in the language modeling approach to information retrieval. In Proceedings of the Tenth International Conference on Information and Knowledge Management (pp. 403–410). ACM Press. Zhai, C., & Lafferty, J. (2001). Model-based feedback in the language modeling approach to information retrieval. In Proceedings of the Tenth International Conference on Information and Knowledge Management (pp. 403–410). ACM Press.
Metadaten
Titel
An empirical study of tokenization strategies for biomedical information retrieval
verfasst von
Jing Jiang
ChengXiang Zhai
Publikationsdatum
01.10.2007
Verlag
Springer Netherlands
Erschienen in
Discover Computing / Ausgabe 4-5/2007
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI
https://doi.org/10.1007/s10791-007-9027-7

Weitere Artikel der Ausgabe 4-5/2007

Discover Computing 4-5/2007 Zur Ausgabe