article

Cross-lingual CSTRD: English access to Hindi information

Authors:
Anton Leuski

Information Sciences Institute, University of Southern California

Information Sciences Institute, University of Southern California
View Profile

,
Chin-Yew Lin

Information Sciences Institute, University of Southern California

Information Sciences Institute, University of Southern California
View Profile

,
Liang Zhou

Information Sciences Institute, University of Southern California

Information Sciences Institute, University of Southern California
View Profile

,
Ulrich Germann

Information Sciences Institute, University of Southern California

Information Sciences Institute, University of Southern California
View Profile

,
Franz Josef Och

Information Sciences Institute, University of Southern California

Information Sciences Institute, University of Southern California
View Profile

,
Eduard Hovy

Information Sciences Institute, University of Southern California

Information Sciences Institute, University of Southern California
View Profile

ACM Transactions on Asian Language Information Processing Volume 2 Issue 3pp 245–269https://doi.org/10.1145/979872.979877

Published:01 September 2003Publication History

ACM Transactions on Asian Language Information Processing

Abstract

We present C*ST*RD, a cross-language information delivery system that supports cross-language information retrieval, information space visualization and navigation, machine translation, and text summarization of single documents and clusters of documents. C*ST*RD was assembled and trained within 1 month, in the context of DARPA's Surprise Language Exercise, that selected as source a heretofore unstudied language, Hindi. Given the brief time, we could not create deep Hindi capabilities for all the modules, but instead experimented with combining shallow Hindi capabilities, or even English-only modules, into one integrated system. Various possible configurations, with different tradeoffs in processing speed and ease of use, enable the rapid deployment of C*ST*RD to new languages under various conditions.

References

Allan, J., Callan, J., Croft, B., Ballesteros, L., Broglio, J., Xu, J., and Shu, H. 1997. Inquery at TREC-5. In Fifth Text REtrieval Conference (TREC-5) (Gaithersburg, MD, USA). 119--132.]]Google Scholar
Allan, J., Callan, J., Croft, W. B., Ballesteros, L., Byrd, D., Swan, R., and Xu, J. 1998. Inquery does battle with TREC-6. In Sixth Text REtrieval Conference (TREC-6) (Gaithersburg, MD, USA). 169--206.]]Google Scholar
Allan, J., Leuski, A., Swan, R., and Byrd, D. 2000. Evaluating combinations of ranked lists and visualizations of inter-document similarity. Information Processing and Management (IPM) 37, 435--458.]] Google Scholar
Bikel, D. M., Miller, S., Schwartz, R., and Weischedel, R. 1997. Nymble: a high-performance learning name-finder. In Proceedings of ANLP-97. 194--201.]] Google Scholar
Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., and Mercer, R. L. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19, 2, 263--311.]] Google Scholar
Buckley, C. and Salton, G. 1995. Optimization of relevance feedback weights. In Proceedings of ACM SIGIR (Seattle, Washington, USA). 351--357.]] Google Scholar
Croft, W. B. 1978. Organising and searching large files of documents. Ph.D. thesis, University of Cambridge.]]Google Scholar
Cutting, D. R., Karger, D. R., and Pedersen, J. O. 1993. Constant interaction-time Scatter/Gather browsing of very large document collections. In Proceedings of ACM SIGIR. 126--134.]] Google Scholar
Cutting, D. R., Pedersen, J. O., Karger, D. R., and Tukey, J. W. 1992. Scatter/Gather: A cluster-based approach to browsing large document collections. In Proceedings of ACM SIGIR (Copenhagen, Denmark). 318--329.]] Google Scholar
Dunning, T. E. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19, 1, 61--74.]] Google Scholar
Edmundson, H. P. 1969. New methods in automatic extraction. Journal of the ACM 16, 2, 264--285.]] Google Scholar
Fruchterman, T. M. J. and Reingold, E. M. 1991. Graph drawing by force-directed placement. Software--Practice and Experience 21, 11, 1129--1164.]] Google Scholar
Germann, U. 2001. Building a statistical machine translation system from scratch: How much bang for the buck canwe expect? In ACL 2001 Workshop on Data-Driven Machine Translation (Toulouse).]] Google Scholar
Germann, U. 2003. Greedy decoding for statistical machine translation in almost linear time. In HLT-NAACL 2003: Main Proceedings, M. Hearst and M. Ostendorf, Eds. Association for Computational Linguistics, Edmonton, AB, Canada, 72--79.]] Google Scholar
Gillam, R. 1999. Finding Text Boundaries in Java. Available at http://www-106.ibm.com/ developerworks/java/library/j-boundaries/boundaries.html as of March 2004.]]Google Scholar
Goldstein, J., Kantrowitz, M., Mittal, V. O., and Carbonell, J. G. 1999. Summarizing text documents: Sentence selection and evaluation metrics. In Research and Development in Information Retrieval. 121--128.]] Google Scholar
Hearst, M. A. and Pedersen, J. O. 1996. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of ACM SIGIR (Zurich, Switzerland). 76--84.]] Google Scholar
Jin, R. and Hauptmann, A. 2001. Headline generation using a training corpus. In Proceedings of the Second International Conference on Intelligent Text Processing and Computational Linguistics (CICLing01). Lecture Notes in Computer Science. Springer, Mexico City, Mexico, 208--215.]] Google Scholar
Koenemann, J. and Belkin, N. J. 1996. A case for interaction: A study of interactive information retrieval behavior and effectivness. In Proceedings of ACM SIGCHI Conference on Human Factors in Computing Systems (Vancouver, BC, Canada). 205--212.]] Google Scholar
Leuski, A. 2000. Relevance and reinforcement in interactive browsing. In Proceedings of Ninth International Conference on Information and Knowledge Management (CIKM'00), A. Agah, J. Callan, and E. Rundensteiner, Eds. ACM Press, McLean, Virginia, USA, 119--126.]] Google Scholar
Leuski, A. 2001a. Evaluating document clustering for interactive information retrieval. In Proceedings of Tenth International Conference on Information and Knowledge Management (CIKM'01), H. Paques, L. Liu, and D. Grossman, Eds. ACM Press, Atlanta, Georgia, USA, 41--48.]] Google Scholar
Leuski, A. 2001b. Interactive Information Organization: Techniques and Evaluation. Ph.D. thesis, University of Massachusetts at Amherst.]] Google Scholar
Leuski, A. and Allan, J. 2003. Interactive information retrieval using clustering and spatial proximity. User Modeling and User Adapted Interaction (UMUAI). In Press.]] Google Scholar
Leuski, A., Lin, C.-Y., and Hovy, E. 2003. iNeATS: Interactive multi-document summarization. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL 2003) (Sapporo, Japan). 125--128.]] Google Scholar
Lin, C.-Y. and Hovy, E. 1997. Identifying topics by position. In Proceedings of the 5th Conference on Applied Natural Language Processing (Washington, DC).]] Google Scholar
Lin, C.-Y. and Hovy, E. 2000. The automated acquisition of topic signatures for text summarization. In Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000) (Saarbrücken, Germany).]] Google Scholar
Lin, C.-Y. and Hovy, E. 2002. From single to multi-document summarization: a prototype system and it evaluation. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL-02) (Philadelphia, PA, USA).]] Google Scholar
Lin, C.-Y. and Hovy, E. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In HLT-NAACL 2003: Main Proceedings, M. Hearst and M. Ostendorf, Eds. Association for Computational Linguistics, Edmonton, AB, Canada, 150--157.]] Google Scholar
McKeown, K. R., Barzilay, R., Evans, D., Hatzivassiloglou, V., Schiffman, B., and Teufel, S. 2001. Columbia multi-document summarization: Approach and evaluation. In Proceedings of the Workshop on Text Summarization, ACM SIGIR Conference 2001. DARPA/NIST, Document Understanding Conference.]]Google Scholar
Mirkin, B. 1996. Mathematical Classification and Clustering. Kluwer, Boston.]]Google Scholar
Oard, D. W. and Och, F. J. 2003. Rapid-response machine translation for unexpected languages. In Proceedings of the MT Summit IX (New Orleans, LA).]]Google Scholar
Och, F. J. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) (Sapporo, Japan).]] Google Scholar
Och, F. J. and Ney, H. 2002. Discriminative training and maximum entropy models for statistical machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) (Philadelphia, PA).]] Google Scholar
Och, F. J., Tillmann, C., and Ney, H. 1999. Improved alignment models for statistical machine translation. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (University of Maryland, College Park, MD). 20--28.]]Google Scholar
Over, P. 2001. Introduction to duc-2001: an intrinsic evaluation of generic news text summarization systems. In Proceedings of the Workshop on Text Summarization, ACM SIGIR Conference 2001. DARPA/NIST, Document Understanding Conference.]]Google Scholar
Papineni, K. A., Roukos, S., Ward, T., and Zhu, W.-J. 2001. Bleu: a method for automatic evaluation of machine translation. Tech. Rep. RC22176 (W0109-022), IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, NY.]]Google Scholar
Pirkola, A. 1998. The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98). 55--63.]] Google Scholar
Porter, M. F. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.]]Google Scholar
Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., and Gatford, M. 1995. Okapi at TREC-3. In Third Text REtrieval Conference (TREC-3), D. Harman and E. Voorhees, Eds. NIST, Gaithersburg, Maryland, USA.]]Google Scholar
Rocchio, Jr., J. J. 1971. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing, G. Salton, Ed. Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 313--323.]]Google Scholar
Salton, G. 1989. Automatic Text Processing. Addison-Wesley.]] Google Scholar
Ueffing, N., Och, F. J., and Ney, H. 2002. Generation of word graphs in statistical machine translation. In Proceedings Conference on Empirical Methods for Natural Language Processing (Philadelphia, PA). 156--163.]] Google Scholar
van Rijsbergen, C. J. 1979. Information Retrieval, 2nd ed. Butterworths, London.]] Google Scholar
Willett, P. 1988. Recent trends in hierarchic document clustering: A critical review. Information Processing and Management 24, 5, 577--597.]] Google Scholar
Zajic, D., Dorr, B., and Schwartz, R. 2002. Automatic headline generation for newspaper stories. In Proceedings of the ACL-02 Workshop on Text Summarization (Philadelphia, PA).]]Google Scholar
Zhou, L. and Hovy, E. 2003. Headline summarization at ISI. In Document Understanding Conference (DUC-03) (Edmonton, AB, Canada).]]Google Scholar

Index Terms

Cross-lingual C*ST*RD: English access to Hindi information
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval

Recommendations

Data driven methods for improving mono- and cross-lingual IR performance in noisy environments
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data

In cross-language information retrieval (CLIR), novel or non-standard expressions, technical terminology, or rare proper nouns can be seen as noise when they appear in queries or in the target collection. This kind of vocabulary is often out-of-...
Read More
Enhancing cross-language information retrieval by an automatic acquisition of bilingual terminology from comparable corpora
SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval

This paper presents an approach to bilingual lexicon extraction from comparable corpora and evaluations on Cross-Language Information Retrieval. We explore a bi-directional extraction of bilingual terminology primarily from comparable corpora. A ...
Read More
Leveraging comparable corpora for cross-lingual information retrieval in resource-lean language pairs
Abstract
Cross-language information retrieval (CLIR) has so far been studied with the assumption that some rich linguistic resources such as bilingual dictionaries or parallel corpora are available. But creation of such high quality resources is labor-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Asian Language Information Processing Volume 2, Issue 3
September 2003
132 pages
ISSN:1530-0226
EISSN:1558-3430
DOI:10.1145/979872
Issue’s Table of Contents

Copyright © 2003 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 September 2003
Published in talip Volume 2, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cross-language information retrieval
Hindi-to-English machine translation
headline generation
information retrieval and information space navigation
single- and multi-document text summarization
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 24
  Total Citations
  View Citations
- 668
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Cross-lingual CSTRD: English access to Hindi information

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Data driven methods for improving mono- and cross-lingual IR performance in noisy environments

Enhancing cross-language information retrieval by an automatic acquisition of bilingual terminology from comparable corpora

Leveraging comparable corpora for cross-lingual information retrieval in resource-lean language pairs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Cross-lingual C*ST*RD: English access to Hindi information

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Data driven methods for improving mono- and cross-lingual IR performance in noisy environments

Enhancing cross-language information retrieval by an automatic acquisition of bilingual terminology from comparable corpora

Leveraging comparable corpora for cross-lingual information retrieval in resource-lean language pairs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media

Cross-lingual CSTRD: English access to Hindi information