Abstract
We report on Korean monolingual, Chinese-Korean English-as-pivot bilingual, and Chinese-English bilingual CLIR experiments using MT software augmented with Web-based entity-oriented translation as resources in the NTCIR-4 environment. Simple stemming is helpful in improving bigram indexing for Korean retrieval. For word indexing, keeping nouns only is preferable. Web-based translation reduces untranslated terms left over after MT and substantially improves CLIR results. Translation concatenation is found to consistently improve CLIR effectiveness, while combining a retrieval list from bigram and word indexing is also helpful. A method to disambiguate multiple MT outputs using a log likelihood ratio threshold was tested. Depending on the nature of the title or description queries, bigram only or a retrieval combination, or relaxed or rigid evaluations, direct bilingual CLIR returned an average precision of 71--79% (English-Korean) and 76--84% (Chinese-English) of the corresponding Korean-Korean and English-English monolingual results. Using English as a pivot in Chinese-Korean CLIR provides about 55--65% the effectiveness that Korean alone does. Entity/terminology translation at the pivot language stage accounts for a large portion of this deficiency. A topic with comparatively worse Chinese-English bilingual result does not necessarily mean that it will continue to under-perform (after further transitive Korean translation) at the Korean retrieval level.
- Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jellnek, F., Lafferty, J. D., Mercer, R. L., and Roossin, P. S. 1990. A statistical approach to machine translation. Computational Linguistics 16 (1990), 79--85. Google Scholar
- Gollins, T. and Sanderson, M. 2001. Improving cross language information retrieval with triangulated translation. In Proceedings of 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (New Orleans, LA, Sept. 2001). ACM Press, New York, 90--95. Google Scholar
- Grefenstesse, G. 1998. Cross Language Information Retrieval. Kluwer, Dordrecht. Google Scholar
- Hiemstra, D. and Kraaij, W. 1998. Twenty-one at TREC-7: Ad-hoc and cross-language track. In Information Technology: The Seventh REtrieval Conference (TREC-7, Gaithersburg, MD, Nov. 1998), National Institute of Standards and Technology Special Publication 500-245, 227--238.Google Scholar
- Kang, I-S., Na, S.-H., and Lee, J-H. 2004. POSTECH at NTCIR-4: CJKE monolingual and Korean-related cross-language retrieval experiments. In Working Notes of the Fourth NTCIR Workshop Meeting (Tokyo, June 2004), 89--95.Google Scholar
- Kishida, K. and Kando, N. 2003. Two stages refinement of query translation for pivot language approach to cross lingual information retrieval: a trial at CLEF 2003. In CLEF 2003 Working Notes (Trondheim, Norway, Aug. 2003), 253--262.Google Scholar
- Kishida, K., Chen, K.-H., Lee, S., Kuriyama, K., Kando, N., Chen, H.-H., Myaeng, S. H., and Eguchi, K. 2004. Overview of CLIR task at the fourth NTCIR workshop. In Working Notes of the Fourth NTCIR Workshop Meeting (Tokyo, June 2004), 1--59.Google Scholar
- Kwok, K. L. 1995. A network approach to probabilistic information retrieval. ACM Trans. Office Inf. Syst. 13 (1995), 324--353. Google Scholar
- Kwok, K. L. 2000. Improving English and Chinese ad-hoc retrieval: A tipster text phase 3 project report. Information Retrieval 3 (2000), 313--338. Google Scholar
- Kwok, K. L. 2001. Exploiting the LDC Chinese-English bilingual wordlist for cross language information retrieval. Int. J. Computer Process. Oriental Languages 14, 2 (2001), 173--191.Google Scholar
- Kwok, K. L. 2001. NTCIR-2 Chinese and cross language experiments using PIRCS”, In: Proceedings of the Second NTCIR Workshop on Research in Chinese & Japanese Text Retrieval and Text Summarization. (Tokyo, March 2001), 111--118.Google Scholar
- Kwok, K. L., Deng, P., Dinstl, N., Sun, H. L., Xu, W., Peng, P., and Doyon, J. 2005. CHINET: a Chinese name finder system for document triage. In Proceedings of 2005 International Conference on Intelligence Analysis (McLean, VA, May 2005). (http://analysis.mitre.org/proceedings/Final_Papers_Files/73_Camera_Ready_Paper.pdf).Google Scholar
- Lee, J. H. and Ahn, J. S. 1996. Using n-grams for Korean text retrieval. In Proceedings of 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Zurich, Aug. 1996). ACM Press, New York, 216--224. Google Scholar
- Lehtokangas, R. and Airio, E. 2002. Translation via a pivot language challenges direct translation in CLIR. In: Cross-Language Information Retrieval: A Research Roadmap. Workshop at the 25th Annual International ACM SIGIR Conference (Tampere, Finland, Aug. 2002). (http://www.dcs.shef.ac.uk/research/groups/nlp/clarity/clarity_publications.htm)Google Scholar
- Manning, C. D. and Schutze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA. Google Scholar
- Robertson, S. E. and Sparck Jones, K. 1976. Relevance weighting of search terms. J. American Society of Information Science 27 (1976), 129--146.Google Scholar
- Sakai, T., Koyama, M., Kumano, A., and Manabe, T. 2004. Toshiba BRIDJE at NTCIR-4 CLIR: monolingual/bilingual IR and flexible feedback, In Working Notes of the Fourth NTCIR Workshop Meeting, Tokyo, Japan, June, 2004, 65--72.Google Scholar
- Seo, H-C., Kim, S.-B., Lim, H.-G., and Rim, H.-C. 2004. KUNLP system for NTCIR-4 Korean-English cross-language information retrieval. In Working Notes of the Fourth NTCIR Workshop Meeting (Tokyo, June 2004). 103--109.Google Scholar
- Vogt, C. C. and Cottrell, G. W. 1999. Fusion via a linear combination of scores. Information Retrieval 1, 3 (1999), 151--173. Google Scholar
- Xu, J. and Weischedel, R. 2000. TREC-9 cross-lingual retrieval at BBN. In Information Technology: The Ninth Text REtrieval Conference (TREC-9, Gaithersburg, MD, Nov. 2000). National Institute of Standards and Technology Special Publication 500-249, 106--116.Google Scholar
Index Terms
- Rich results from poor resources: NTCIR-4 monolingual and cross-lingual retrieval of korean texts using chinese and english
Recommendations
Constructing a poor man’s wordnet in a resource-rich world
AbstractIn this paper we present a language-independent, fully modular and automatic approach to bootstrap a wordnet for a new language by recycling different types of already existing language resources, such as machine-readable dictionaries, parallel ...
Chinese OOV translation and post-translation query expansion in chinese--english cross-lingual information retrieval
Cross-lingual information retrieval allows users to query mixed-language collections or to probe for documents written in an unfamiliar language. A major difficulty for cross-lingual information retrieval is the detection and translation of out-of-...
Stemming resource-poor Indian languages
Stemming is a basic method for morphological normalization of natural language texts. In this study, we focus on the problem of stemming several resource-poor languages from Eastern India, viz., Assamese, Bengali, Bishnupriya Manipuri and Bodo. While ...
Comments