Abstract
We present C*ST*RD, a cross-language information delivery system that supports cross-language information retrieval, information space visualization and navigation, machine translation, and text summarization of single documents and clusters of documents. C*ST*RD was assembled and trained within 1 month, in the context of DARPA's Surprise Language Exercise, that selected as source a heretofore unstudied language, Hindi. Given the brief time, we could not create deep Hindi capabilities for all the modules, but instead experimented with combining shallow Hindi capabilities, or even English-only modules, into one integrated system. Various possible configurations, with different tradeoffs in processing speed and ease of use, enable the rapid deployment of C*ST*RD to new languages under various conditions.
- Allan, J., Callan, J., Croft, B., Ballesteros, L., Broglio, J., Xu, J., and Shu, H. 1997. Inquery at TREC-5. In Fifth Text REtrieval Conference (TREC-5) (Gaithersburg, MD, USA). 119--132.]]Google Scholar
- Allan, J., Callan, J., Croft, W. B., Ballesteros, L., Byrd, D., Swan, R., and Xu, J. 1998. Inquery does battle with TREC-6. In Sixth Text REtrieval Conference (TREC-6) (Gaithersburg, MD, USA). 169--206.]]Google Scholar
- Allan, J., Leuski, A., Swan, R., and Byrd, D. 2000. Evaluating combinations of ranked lists and visualizations of inter-document similarity. Information Processing and Management (IPM) 37, 435--458.]] Google Scholar
- Bikel, D. M., Miller, S., Schwartz, R., and Weischedel, R. 1997. Nymble: a high-performance learning name-finder. In Proceedings of ANLP-97. 194--201.]] Google Scholar
- Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., and Mercer, R. L. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19, 2, 263--311.]] Google Scholar
- Buckley, C. and Salton, G. 1995. Optimization of relevance feedback weights. In Proceedings of ACM SIGIR (Seattle, Washington, USA). 351--357.]] Google Scholar
- Croft, W. B. 1978. Organising and searching large files of documents. Ph.D. thesis, University of Cambridge.]]Google Scholar
- Cutting, D. R., Karger, D. R., and Pedersen, J. O. 1993. Constant interaction-time Scatter/Gather browsing of very large document collections. In Proceedings of ACM SIGIR. 126--134.]] Google Scholar
- Cutting, D. R., Pedersen, J. O., Karger, D. R., and Tukey, J. W. 1992. Scatter/Gather: A cluster-based approach to browsing large document collections. In Proceedings of ACM SIGIR (Copenhagen, Denmark). 318--329.]] Google Scholar
- Dunning, T. E. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19, 1, 61--74.]] Google Scholar
- Edmundson, H. P. 1969. New methods in automatic extraction. Journal of the ACM 16, 2, 264--285.]] Google Scholar
- Fruchterman, T. M. J. and Reingold, E. M. 1991. Graph drawing by force-directed placement. Software--Practice and Experience 21, 11, 1129--1164.]] Google Scholar
- Germann, U. 2001. Building a statistical machine translation system from scratch: How much bang for the buck canwe expect? In ACL 2001 Workshop on Data-Driven Machine Translation (Toulouse).]] Google Scholar
- Germann, U. 2003. Greedy decoding for statistical machine translation in almost linear time. In HLT-NAACL 2003: Main Proceedings, M. Hearst and M. Ostendorf, Eds. Association for Computational Linguistics, Edmonton, AB, Canada, 72--79.]] Google Scholar
- Gillam, R. 1999. Finding Text Boundaries in Java. Available at http://www-106.ibm.com/ developerworks/java/library/j-boundaries/boundaries.html as of March 2004.]]Google Scholar
- Goldstein, J., Kantrowitz, M., Mittal, V. O., and Carbonell, J. G. 1999. Summarizing text documents: Sentence selection and evaluation metrics. In Research and Development in Information Retrieval. 121--128.]] Google Scholar
- Hearst, M. A. and Pedersen, J. O. 1996. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of ACM SIGIR (Zurich, Switzerland). 76--84.]] Google Scholar
- Jin, R. and Hauptmann, A. 2001. Headline generation using a training corpus. In Proceedings of the Second International Conference on Intelligent Text Processing and Computational Linguistics (CICLing01). Lecture Notes in Computer Science. Springer, Mexico City, Mexico, 208--215.]] Google Scholar
- Koenemann, J. and Belkin, N. J. 1996. A case for interaction: A study of interactive information retrieval behavior and effectivness. In Proceedings of ACM SIGCHI Conference on Human Factors in Computing Systems (Vancouver, BC, Canada). 205--212.]] Google Scholar
- Leuski, A. 2000. Relevance and reinforcement in interactive browsing. In Proceedings of Ninth International Conference on Information and Knowledge Management (CIKM'00), A. Agah, J. Callan, and E. Rundensteiner, Eds. ACM Press, McLean, Virginia, USA, 119--126.]] Google Scholar
- Leuski, A. 2001a. Evaluating document clustering for interactive information retrieval. In Proceedings of Tenth International Conference on Information and Knowledge Management (CIKM'01), H. Paques, L. Liu, and D. Grossman, Eds. ACM Press, Atlanta, Georgia, USA, 41--48.]] Google Scholar
- Leuski, A. 2001b. Interactive Information Organization: Techniques and Evaluation. Ph.D. thesis, University of Massachusetts at Amherst.]] Google Scholar
- Leuski, A. and Allan, J. 2003. Interactive information retrieval using clustering and spatial proximity. User Modeling and User Adapted Interaction (UMUAI). In Press.]] Google Scholar
- Leuski, A., Lin, C.-Y., and Hovy, E. 2003. iNeATS: Interactive multi-document summarization. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL 2003) (Sapporo, Japan). 125--128.]] Google Scholar
- Lin, C.-Y. and Hovy, E. 1997. Identifying topics by position. In Proceedings of the 5th Conference on Applied Natural Language Processing (Washington, DC).]] Google Scholar
- Lin, C.-Y. and Hovy, E. 2000. The automated acquisition of topic signatures for text summarization. In Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000) (Saarbrücken, Germany).]] Google Scholar
- Lin, C.-Y. and Hovy, E. 2002. From single to multi-document summarization: a prototype system and it evaluation. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL-02) (Philadelphia, PA, USA).]] Google Scholar
- Lin, C.-Y. and Hovy, E. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In HLT-NAACL 2003: Main Proceedings, M. Hearst and M. Ostendorf, Eds. Association for Computational Linguistics, Edmonton, AB, Canada, 150--157.]] Google Scholar
- McKeown, K. R., Barzilay, R., Evans, D., Hatzivassiloglou, V., Schiffman, B., and Teufel, S. 2001. Columbia multi-document summarization: Approach and evaluation. In Proceedings of the Workshop on Text Summarization, ACM SIGIR Conference 2001. DARPA/NIST, Document Understanding Conference.]]Google Scholar
- Mirkin, B. 1996. Mathematical Classification and Clustering. Kluwer, Boston.]]Google Scholar
- Oard, D. W. and Och, F. J. 2003. Rapid-response machine translation for unexpected languages. In Proceedings of the MT Summit IX (New Orleans, LA).]]Google Scholar
- Och, F. J. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) (Sapporo, Japan).]] Google Scholar
- Och, F. J. and Ney, H. 2002. Discriminative training and maximum entropy models for statistical machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) (Philadelphia, PA).]] Google Scholar
- Och, F. J., Tillmann, C., and Ney, H. 1999. Improved alignment models for statistical machine translation. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (University of Maryland, College Park, MD). 20--28.]]Google Scholar
- Over, P. 2001. Introduction to duc-2001: an intrinsic evaluation of generic news text summarization systems. In Proceedings of the Workshop on Text Summarization, ACM SIGIR Conference 2001. DARPA/NIST, Document Understanding Conference.]]Google Scholar
- Papineni, K. A., Roukos, S., Ward, T., and Zhu, W.-J. 2001. Bleu: a method for automatic evaluation of machine translation. Tech. Rep. RC22176 (W0109-022), IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, NY.]]Google Scholar
- Pirkola, A. 1998. The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98). 55--63.]] Google Scholar
- Porter, M. F. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.]]Google Scholar
- Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., and Gatford, M. 1995. Okapi at TREC-3. In Third Text REtrieval Conference (TREC-3), D. Harman and E. Voorhees, Eds. NIST, Gaithersburg, Maryland, USA.]]Google Scholar
- Rocchio, Jr., J. J. 1971. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing, G. Salton, Ed. Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 313--323.]]Google Scholar
- Salton, G. 1989. Automatic Text Processing. Addison-Wesley.]] Google Scholar
- Ueffing, N., Och, F. J., and Ney, H. 2002. Generation of word graphs in statistical machine translation. In Proceedings Conference on Empirical Methods for Natural Language Processing (Philadelphia, PA). 156--163.]] Google Scholar
- van Rijsbergen, C. J. 1979. Information Retrieval, 2nd ed. Butterworths, London.]] Google Scholar
- Willett, P. 1988. Recent trends in hierarchic document clustering: A critical review. Information Processing and Management 24, 5, 577--597.]] Google Scholar
- Zajic, D., Dorr, B., and Schwartz, R. 2002. Automatic headline generation for newspaper stories. In Proceedings of the ACL-02 Workshop on Text Summarization (Philadelphia, PA).]]Google Scholar
- Zhou, L. and Hovy, E. 2003. Headline summarization at ISI. In Document Understanding Conference (DUC-03) (Edmonton, AB, Canada).]]Google Scholar
Index Terms
- Cross-lingual C*ST*RD: English access to Hindi information
Recommendations
Data driven methods for improving mono- and cross-lingual IR performance in noisy environments
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text dataIn cross-language information retrieval (CLIR), novel or non-standard expressions, technical terminology, or rare proper nouns can be seen as noise when they appear in queries or in the target collection. This kind of vocabulary is often out-of-...
Enhancing cross-language information retrieval by an automatic acquisition of bilingual terminology from comparable corpora
SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrievalThis paper presents an approach to bilingual lexicon extraction from comparable corpora and evaluations on Cross-Language Information Retrieval. We explore a bi-directional extraction of bilingual terminology primarily from comparable corpora. A ...
Leveraging comparable corpora for cross-lingual information retrieval in resource-lean language pairs
AbstractCross-language information retrieval (CLIR) has so far been studied with the assumption that some rich linguistic resources such as bilingual dictionaries or parallel corpora are available. But creation of such high quality resources is labor-...
Comments