skip to main content
article

Rich results from poor resources: NTCIR-4 monolingual and cross-lingual retrieval of korean texts using chinese and english

Authors Info & Claims
Published:01 June 2005Publication History
Skip Abstract Section

Abstract

We report on Korean monolingual, Chinese-Korean English-as-pivot bilingual, and Chinese-English bilingual CLIR experiments using MT software augmented with Web-based entity-oriented translation as resources in the NTCIR-4 environment. Simple stemming is helpful in improving bigram indexing for Korean retrieval. For word indexing, keeping nouns only is preferable. Web-based translation reduces untranslated terms left over after MT and substantially improves CLIR results. Translation concatenation is found to consistently improve CLIR effectiveness, while combining a retrieval list from bigram and word indexing is also helpful. A method to disambiguate multiple MT outputs using a log likelihood ratio threshold was tested. Depending on the nature of the title or description queries, bigram only or a retrieval combination, or relaxed or rigid evaluations, direct bilingual CLIR returned an average precision of 71--79% (English-Korean) and 76--84% (Chinese-English) of the corresponding Korean-Korean and English-English monolingual results. Using English as a pivot in Chinese-Korean CLIR provides about 55--65% the effectiveness that Korean alone does. Entity/terminology translation at the pivot language stage accounts for a large portion of this deficiency. A topic with comparatively worse Chinese-English bilingual result does not necessarily mean that it will continue to under-perform (after further transitive Korean translation) at the Korean retrieval level.

References

  1. Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jellnek, F., Lafferty, J. D., Mercer, R. L., and Roossin, P. S. 1990. A statistical approach to machine translation. Computational Linguistics 16 (1990), 79--85. Google ScholarGoogle Scholar
  2. Gollins, T. and Sanderson, M. 2001. Improving cross language information retrieval with triangulated translation. In Proceedings of 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (New Orleans, LA, Sept. 2001). ACM Press, New York, 90--95. Google ScholarGoogle Scholar
  3. Grefenstesse, G. 1998. Cross Language Information Retrieval. Kluwer, Dordrecht. Google ScholarGoogle Scholar
  4. Hiemstra, D. and Kraaij, W. 1998. Twenty-one at TREC-7: Ad-hoc and cross-language track. In Information Technology: The Seventh REtrieval Conference (TREC-7, Gaithersburg, MD, Nov. 1998), National Institute of Standards and Technology Special Publication 500-245, 227--238.Google ScholarGoogle Scholar
  5. Kang, I-S., Na, S.-H., and Lee, J-H. 2004. POSTECH at NTCIR-4: CJKE monolingual and Korean-related cross-language retrieval experiments. In Working Notes of the Fourth NTCIR Workshop Meeting (Tokyo, June 2004), 89--95.Google ScholarGoogle Scholar
  6. Kishida, K. and Kando, N. 2003. Two stages refinement of query translation for pivot language approach to cross lingual information retrieval: a trial at CLEF 2003. In CLEF 2003 Working Notes (Trondheim, Norway, Aug. 2003), 253--262.Google ScholarGoogle Scholar
  7. Kishida, K., Chen, K.-H., Lee, S., Kuriyama, K., Kando, N., Chen, H.-H., Myaeng, S. H., and Eguchi, K. 2004. Overview of CLIR task at the fourth NTCIR workshop. In Working Notes of the Fourth NTCIR Workshop Meeting (Tokyo, June 2004), 1--59.Google ScholarGoogle Scholar
  8. Kwok, K. L. 1995. A network approach to probabilistic information retrieval. ACM Trans. Office Inf. Syst. 13 (1995), 324--353. Google ScholarGoogle Scholar
  9. Kwok, K. L. 2000. Improving English and Chinese ad-hoc retrieval: A tipster text phase 3 project report. Information Retrieval 3 (2000), 313--338. Google ScholarGoogle Scholar
  10. Kwok, K. L. 2001. Exploiting the LDC Chinese-English bilingual wordlist for cross language information retrieval. Int. J. Computer Process. Oriental Languages 14, 2 (2001), 173--191.Google ScholarGoogle Scholar
  11. Kwok, K. L. 2001. NTCIR-2 Chinese and cross language experiments using PIRCS”, In: Proceedings of the Second NTCIR Workshop on Research in Chinese & Japanese Text Retrieval and Text Summarization. (Tokyo, March 2001), 111--118.Google ScholarGoogle Scholar
  12. Kwok, K. L., Deng, P., Dinstl, N., Sun, H. L., Xu, W., Peng, P., and Doyon, J. 2005. CHINET: a Chinese name finder system for document triage. In Proceedings of 2005 International Conference on Intelligence Analysis (McLean, VA, May 2005). (http://analysis.mitre.org/proceedings/Final_Papers_Files/73_Camera_Ready_Paper.pdf).Google ScholarGoogle Scholar
  13. Lee, J. H. and Ahn, J. S. 1996. Using n-grams for Korean text retrieval. In Proceedings of 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Zurich, Aug. 1996). ACM Press, New York, 216--224. Google ScholarGoogle Scholar
  14. Lehtokangas, R. and Airio, E. 2002. Translation via a pivot language challenges direct translation in CLIR. In: Cross-Language Information Retrieval: A Research Roadmap. Workshop at the 25th Annual International ACM SIGIR Conference (Tampere, Finland, Aug. 2002). (http://www.dcs.shef.ac.uk/research/groups/nlp/clarity/clarity_publications.htm)Google ScholarGoogle Scholar
  15. Manning, C. D. and Schutze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA. Google ScholarGoogle Scholar
  16. Robertson, S. E. and Sparck Jones, K. 1976. Relevance weighting of search terms. J. American Society of Information Science 27 (1976), 129--146.Google ScholarGoogle Scholar
  17. Sakai, T., Koyama, M., Kumano, A., and Manabe, T. 2004. Toshiba BRIDJE at NTCIR-4 CLIR: monolingual/bilingual IR and flexible feedback, In Working Notes of the Fourth NTCIR Workshop Meeting, Tokyo, Japan, June, 2004, 65--72.Google ScholarGoogle Scholar
  18. Seo, H-C., Kim, S.-B., Lim, H.-G., and Rim, H.-C. 2004. KUNLP system for NTCIR-4 Korean-English cross-language information retrieval. In Working Notes of the Fourth NTCIR Workshop Meeting (Tokyo, June 2004). 103--109.Google ScholarGoogle Scholar
  19. Vogt, C. C. and Cottrell, G. W. 1999. Fusion via a linear combination of scores. Information Retrieval 1, 3 (1999), 151--173. Google ScholarGoogle Scholar
  20. Xu, J. and Weischedel, R. 2000. TREC-9 cross-lingual retrieval at BBN. In Information Technology: The Ninth Text REtrieval Conference (TREC-9, Gaithersburg, MD, Nov. 2000). National Institute of Standards and Technology Special Publication 500-249, 106--116.Google ScholarGoogle Scholar

Index Terms

  1. Rich results from poor resources: NTCIR-4 monolingual and cross-lingual retrieval of korean texts using chinese and english

        Recommendations

        Reviews

        Apostolos N Papadopoulos

        Cross-lingual information retrieval (CLIR) is considered very important, since it results in more flexible information retrieval systems. This work presents an interesting approach for information retrieval in collections of Korean documents. Both monolingual and cross-lingual (CLIR) approaches are studied. For cross-lingual IR involving Korean and Chinese, English is used as the pivot or intermediate language. Two key issues are investigated. The first is the effectiveness of IR for Korean document collections without extensive Korean text processing requirements. A similar approach has been followed for Chinese, with very good results. The second issue is the effectiveness of using the English language as an intermediate (pivot) for performing cross-lingual IR between two Asian languages (for example, Korean and Chinese). Since the English language is spoken extensively worldwide, there is a good chance that there exist translation resources between language X and English, and between English and language Y. Therefore, the two Asian languages X and Y can be associated through the pivot language (English). The experimental results provided support the following findings: bilingual CLIR from English to Korean has an accuracy of 71 to 79 percent compared to Korean to Korean; bilingual CLIR from Chinese to English has an accuracy of 76 to 84 percent compared to English to English; and using English as the pivot language, Chinese to Korean CLIR has an accuracy of 55 to 65 percent compared to the usage of Korean only. Developers of Asian language search engines, as well as developers of special-purpose IR systems for Asian languages, will benefit from this research. Online Computing Reviews Service

        Access critical reviews of Computing literature here

        Become a reviewer for Computing Reviews.

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader