article

Rich results from poor resources: NTCIR-4 monolingual and cross-lingual retrieval of korean texts using chinese and english

Authors:
Kui Lam Kwok

Queens College, City University of New York, Flushing, NY

Queens College, City University of New York, Flushing, NY
View Profile

,
Sora Choi

Queens College, City University of New York, Flushing, NY

Queens College, City University of New York, Flushing, NY
View Profile

,
Norbert Dinstl

Queens College, City University of New York, Flushing, NY

Queens College, City University of New York, Flushing, NY
View Profile

ACM Transactions on Asian Language Information Processing Volume 4 Issue 2pp 136–162https://doi.org/10.1145/1105696.1105700

Published:01 June 2005Publication History

ACM Transactions on Asian Language Information Processing

Abstract

We report on Korean monolingual, Chinese-Korean English-as-pivot bilingual, and Chinese-English bilingual CLIR experiments using MT software augmented with Web-based entity-oriented translation as resources in the NTCIR-4 environment. Simple stemming is helpful in improving bigram indexing for Korean retrieval. For word indexing, keeping nouns only is preferable. Web-based translation reduces untranslated terms left over after MT and substantially improves CLIR results. Translation concatenation is found to consistently improve CLIR effectiveness, while combining a retrieval list from bigram and word indexing is also helpful. A method to disambiguate multiple MT outputs using a log likelihood ratio threshold was tested. Depending on the nature of the title or description queries, bigram only or a retrieval combination, or relaxed or rigid evaluations, direct bilingual CLIR returned an average precision of 71--79% (English-Korean) and 76--84% (Chinese-English) of the corresponding Korean-Korean and English-English monolingual results. Using English as a pivot in Chinese-Korean CLIR provides about 55--65% the effectiveness that Korean alone does. Entity/terminology translation at the pivot language stage accounts for a large portion of this deficiency. A topic with comparatively worse Chinese-English bilingual result does not necessarily mean that it will continue to under-perform (after further transitive Korean translation) at the Korean retrieval level.

References

Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jellnek, F., Lafferty, J. D., Mercer, R. L., and Roossin, P. S. 1990. A statistical approach to machine translation. Computational Linguistics 16 (1990), 79--85. Google Scholar
Gollins, T. and Sanderson, M. 2001. Improving cross language information retrieval with triangulated translation. In Proceedings of 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (New Orleans, LA, Sept. 2001). ACM Press, New York, 90--95. Google Scholar
Grefenstesse, G. 1998. Cross Language Information Retrieval. Kluwer, Dordrecht. Google Scholar
Hiemstra, D. and Kraaij, W. 1998. Twenty-one at TREC-7: Ad-hoc and cross-language track. In Information Technology: The Seventh REtrieval Conference (TREC-7, Gaithersburg, MD, Nov. 1998), National Institute of Standards and Technology Special Publication 500-245, 227--238.Google Scholar
Kang, I-S., Na, S.-H., and Lee, J-H. 2004. POSTECH at NTCIR-4: CJKE monolingual and Korean-related cross-language retrieval experiments. In Working Notes of the Fourth NTCIR Workshop Meeting (Tokyo, June 2004), 89--95.Google Scholar
Kishida, K. and Kando, N. 2003. Two stages refinement of query translation for pivot language approach to cross lingual information retrieval: a trial at CLEF 2003. In CLEF 2003 Working Notes (Trondheim, Norway, Aug. 2003), 253--262.Google Scholar
Kishida, K., Chen, K.-H., Lee, S., Kuriyama, K., Kando, N., Chen, H.-H., Myaeng, S. H., and Eguchi, K. 2004. Overview of CLIR task at the fourth NTCIR workshop. In Working Notes of the Fourth NTCIR Workshop Meeting (Tokyo, June 2004), 1--59.Google Scholar
Kwok, K. L. 1995. A network approach to probabilistic information retrieval. ACM Trans. Office Inf. Syst. 13 (1995), 324--353. Google Scholar
Kwok, K. L. 2000. Improving English and Chinese ad-hoc retrieval: A tipster text phase 3 project report. Information Retrieval 3 (2000), 313--338. Google Scholar
Kwok, K. L. 2001. Exploiting the LDC Chinese-English bilingual wordlist for cross language information retrieval. Int. J. Computer Process. Oriental Languages 14, 2 (2001), 173--191.Google Scholar
Kwok, K. L. 2001. NTCIR-2 Chinese and cross language experiments using PIRCS”, In: Proceedings of the Second NTCIR Workshop on Research in Chinese & Japanese Text Retrieval and Text Summarization. (Tokyo, March 2001), 111--118.Google Scholar
Kwok, K. L., Deng, P., Dinstl, N., Sun, H. L., Xu, W., Peng, P., and Doyon, J. 2005. CHINET: a Chinese name finder system for document triage. In Proceedings of 2005 International Conference on Intelligence Analysis (McLean, VA, May 2005). (http://analysis.mitre.org/proceedings/Final_Papers_Files/73_Camera_Ready_Paper.pdf).Google Scholar
Lee, J. H. and Ahn, J. S. 1996. Using n-grams for Korean text retrieval. In Proceedings of 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Zurich, Aug. 1996). ACM Press, New York, 216--224. Google Scholar
Lehtokangas, R. and Airio, E. 2002. Translation via a pivot language challenges direct translation in CLIR. In: Cross-Language Information Retrieval: A Research Roadmap. Workshop at the 25th Annual International ACM SIGIR Conference (Tampere, Finland, Aug. 2002). (http://www.dcs.shef.ac.uk/research/groups/nlp/clarity/clarity_publications.htm)Google Scholar
Manning, C. D. and Schutze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA. Google Scholar
Robertson, S. E. and Sparck Jones, K. 1976. Relevance weighting of search terms. J. American Society of Information Science 27 (1976), 129--146.Google Scholar
Sakai, T., Koyama, M., Kumano, A., and Manabe, T. 2004. Toshiba BRIDJE at NTCIR-4 CLIR: monolingual/bilingual IR and flexible feedback, In Working Notes of the Fourth NTCIR Workshop Meeting, Tokyo, Japan, June, 2004, 65--72.Google Scholar
Seo, H-C., Kim, S.-B., Lim, H.-G., and Rim, H.-C. 2004. KUNLP system for NTCIR-4 Korean-English cross-language information retrieval. In Working Notes of the Fourth NTCIR Workshop Meeting (Tokyo, June 2004). 103--109.Google Scholar
Vogt, C. C. and Cottrell, G. W. 1999. Fusion via a linear combination of scores. Information Retrieval 1, 3 (1999), 151--173. Google Scholar
Xu, J. and Weischedel, R. 2000. TREC-9 cross-lingual retrieval at BBN. In Information Technology: The Ninth Text REtrieval Conference (TREC-9, Gaithersburg, MD, Nov. 2000). National Institute of Standards and Technology Special Publication 500-249, 106--116.Google Scholar

Index Terms

Rich results from poor resources: NTCIR-4 monolingual and cross-lingual retrieval of korean texts using chinese and english
1. Information systems

Recommendations

Constructing a poor man’s wordnet in a resource-rich world
Abstract
In this paper we present a language-independent, fully modular and automatic approach to bootstrap a wordnet for a new language by recycling different types of already existing language resources, such as machine-readable dictionaries, parallel ...
Read More
Chinese OOV translation and post-translation query expansion in chinese--english cross-lingual information retrieval

Cross-lingual information retrieval allows users to query mixed-language collections or to probe for documents written in an unfamiliar language. A major difficulty for cross-lingual information retrieval is the detection and translation of out-of-...
Read More
Stemming resource-poor Indian languages

Stemming is a basic method for morphological normalization of natural language texts. In this study, we focus on the problem of stemming several resource-poor languages from Eastern India, viz., Assamese, Bengali, Bishnupriya Manipuri and Bodo. While ...
Read More

Reviews

Reviewer: Apostolos N Papadopoulos

Cross-lingual information retrieval (CLIR) is considered very important, since it results in more flexible information retrieval systems. This work presents an interesting approach for information retrieval in collections of Korean documents. Both monolingual and cross-lingual (CLIR) approaches are studied. For cross-lingual IR involving Korean and Chinese, English is used as the pivot or intermediate language. Two key issues are investigated. The first is the effectiveness of IR for Korean document collections without extensive Korean text processing requirements. A similar approach has been followed for Chinese, with very good results. The second issue is the effectiveness of using the English language as an intermediate (pivot) for performing cross-lingual IR between two Asian languages (for example, Korean and Chinese). Since the English language is spoken extensively worldwide, there is a good chance that there exist translation resources between language X and English, and between English and language Y. Therefore, the two Asian languages X and Y can be associated through the pivot language (English). The experimental results provided support the following findings: bilingual CLIR from English to Korean has an accuracy of 71 to 79 percent compared to Korean to Korean; bilingual CLIR from Chinese to English has an accuracy of 76 to 84 percent compared to English to English; and using English as the pivot language, Chinese to Korean CLIR has an accuracy of 55 to 65 percent compared to the usage of Korean only. Developers of Asian language search engines, as well as developers of special-purpose IR systems for Asian languages, will benefit from this research. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Asian Language Information Processing Volume 4, Issue 2
June 2005
179 pages
ISSN:1530-0226
EISSN:1558-3430
DOI:10.1145/1105696
Issue’s Table of Contents

Copyright © 2005 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2005
Published in talip Volume 4, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Chinese-English-Korean pivot CLIR
Chinese-Korean CLIR
Web-based entity-oriented translation
bigram indexing
translation disambiguation
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 363
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Rich results from poor resources: NTCIR-4 monolingual and cross-lingual retrieval of korean texts using chinese and english

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Constructing a poor man’s wordnet in a resource-rich world

Chinese OOV translation and post-translation query expansion in chinese--english cross-lingual information retrieval

Stemming resource-poor Indian languages

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Rich results from poor resources: NTCIR-4 monolingual and cross-lingual retrieval of korean texts using chinese and english

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Constructing a poor man’s wordnet in a resource-rich world

Chinese OOV translation and post-translation query expansion in chinese--english cross-lingual information retrieval

Stemming resource-poor Indian languages

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media