skip to main content
article

Cross-lingual C*ST*RD: English access to Hindi information

Authors Info & Claims
Published:01 September 2003Publication History
Skip Abstract Section

Abstract

We present C*ST*RD, a cross-language information delivery system that supports cross-language information retrieval, information space visualization and navigation, machine translation, and text summarization of single documents and clusters of documents. C*ST*RD was assembled and trained within 1 month, in the context of DARPA's Surprise Language Exercise, that selected as source a heretofore unstudied language, Hindi. Given the brief time, we could not create deep Hindi capabilities for all the modules, but instead experimented with combining shallow Hindi capabilities, or even English-only modules, into one integrated system. Various possible configurations, with different tradeoffs in processing speed and ease of use, enable the rapid deployment of C*ST*RD to new languages under various conditions.

References

  1. Allan, J., Callan, J., Croft, B., Ballesteros, L., Broglio, J., Xu, J., and Shu, H. 1997. Inquery at TREC-5. In Fifth Text REtrieval Conference (TREC-5) (Gaithersburg, MD, USA). 119--132.]]Google ScholarGoogle Scholar
  2. Allan, J., Callan, J., Croft, W. B., Ballesteros, L., Byrd, D., Swan, R., and Xu, J. 1998. Inquery does battle with TREC-6. In Sixth Text REtrieval Conference (TREC-6) (Gaithersburg, MD, USA). 169--206.]]Google ScholarGoogle Scholar
  3. Allan, J., Leuski, A., Swan, R., and Byrd, D. 2000. Evaluating combinations of ranked lists and visualizations of inter-document similarity. Information Processing and Management (IPM) 37, 435--458.]] Google ScholarGoogle Scholar
  4. Bikel, D. M., Miller, S., Schwartz, R., and Weischedel, R. 1997. Nymble: a high-performance learning name-finder. In Proceedings of ANLP-97. 194--201.]] Google ScholarGoogle Scholar
  5. Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., and Mercer, R. L. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19, 2, 263--311.]] Google ScholarGoogle Scholar
  6. Buckley, C. and Salton, G. 1995. Optimization of relevance feedback weights. In Proceedings of ACM SIGIR (Seattle, Washington, USA). 351--357.]] Google ScholarGoogle Scholar
  7. Croft, W. B. 1978. Organising and searching large files of documents. Ph.D. thesis, University of Cambridge.]]Google ScholarGoogle Scholar
  8. Cutting, D. R., Karger, D. R., and Pedersen, J. O. 1993. Constant interaction-time Scatter/Gather browsing of very large document collections. In Proceedings of ACM SIGIR. 126--134.]] Google ScholarGoogle Scholar
  9. Cutting, D. R., Pedersen, J. O., Karger, D. R., and Tukey, J. W. 1992. Scatter/Gather: A cluster-based approach to browsing large document collections. In Proceedings of ACM SIGIR (Copenhagen, Denmark). 318--329.]] Google ScholarGoogle Scholar
  10. Dunning, T. E. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19, 1, 61--74.]] Google ScholarGoogle Scholar
  11. Edmundson, H. P. 1969. New methods in automatic extraction. Journal of the ACM 16, 2, 264--285.]] Google ScholarGoogle Scholar
  12. Fruchterman, T. M. J. and Reingold, E. M. 1991. Graph drawing by force-directed placement. Software--Practice and Experience 21, 11, 1129--1164.]] Google ScholarGoogle Scholar
  13. Germann, U. 2001. Building a statistical machine translation system from scratch: How much bang for the buck canwe expect? In ACL 2001 Workshop on Data-Driven Machine Translation (Toulouse).]] Google ScholarGoogle Scholar
  14. Germann, U. 2003. Greedy decoding for statistical machine translation in almost linear time. In HLT-NAACL 2003: Main Proceedings, M. Hearst and M. Ostendorf, Eds. Association for Computational Linguistics, Edmonton, AB, Canada, 72--79.]] Google ScholarGoogle Scholar
  15. Gillam, R. 1999. Finding Text Boundaries in Java. Available at http://www-106.ibm.com/ developerworks/java/library/j-boundaries/boundaries.html as of March 2004.]]Google ScholarGoogle Scholar
  16. Goldstein, J., Kantrowitz, M., Mittal, V. O., and Carbonell, J. G. 1999. Summarizing text documents: Sentence selection and evaluation metrics. In Research and Development in Information Retrieval. 121--128.]] Google ScholarGoogle Scholar
  17. Hearst, M. A. and Pedersen, J. O. 1996. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of ACM SIGIR (Zurich, Switzerland). 76--84.]] Google ScholarGoogle Scholar
  18. Jin, R. and Hauptmann, A. 2001. Headline generation using a training corpus. In Proceedings of the Second International Conference on Intelligent Text Processing and Computational Linguistics (CICLing01). Lecture Notes in Computer Science. Springer, Mexico City, Mexico, 208--215.]] Google ScholarGoogle Scholar
  19. Koenemann, J. and Belkin, N. J. 1996. A case for interaction: A study of interactive information retrieval behavior and effectivness. In Proceedings of ACM SIGCHI Conference on Human Factors in Computing Systems (Vancouver, BC, Canada). 205--212.]] Google ScholarGoogle Scholar
  20. Leuski, A. 2000. Relevance and reinforcement in interactive browsing. In Proceedings of Ninth International Conference on Information and Knowledge Management (CIKM'00), A. Agah, J. Callan, and E. Rundensteiner, Eds. ACM Press, McLean, Virginia, USA, 119--126.]] Google ScholarGoogle Scholar
  21. Leuski, A. 2001a. Evaluating document clustering for interactive information retrieval. In Proceedings of Tenth International Conference on Information and Knowledge Management (CIKM'01), H. Paques, L. Liu, and D. Grossman, Eds. ACM Press, Atlanta, Georgia, USA, 41--48.]] Google ScholarGoogle Scholar
  22. Leuski, A. 2001b. Interactive Information Organization: Techniques and Evaluation. Ph.D. thesis, University of Massachusetts at Amherst.]] Google ScholarGoogle Scholar
  23. Leuski, A. and Allan, J. 2003. Interactive information retrieval using clustering and spatial proximity. User Modeling and User Adapted Interaction (UMUAI). In Press.]] Google ScholarGoogle Scholar
  24. Leuski, A., Lin, C.-Y., and Hovy, E. 2003. iNeATS: Interactive multi-document summarization. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL 2003) (Sapporo, Japan). 125--128.]] Google ScholarGoogle Scholar
  25. Lin, C.-Y. and Hovy, E. 1997. Identifying topics by position. In Proceedings of the 5th Conference on Applied Natural Language Processing (Washington, DC).]] Google ScholarGoogle Scholar
  26. Lin, C.-Y. and Hovy, E. 2000. The automated acquisition of topic signatures for text summarization. In Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000) (Saarbrücken, Germany).]] Google ScholarGoogle Scholar
  27. Lin, C.-Y. and Hovy, E. 2002. From single to multi-document summarization: a prototype system and it evaluation. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL-02) (Philadelphia, PA, USA).]] Google ScholarGoogle Scholar
  28. Lin, C.-Y. and Hovy, E. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In HLT-NAACL 2003: Main Proceedings, M. Hearst and M. Ostendorf, Eds. Association for Computational Linguistics, Edmonton, AB, Canada, 150--157.]] Google ScholarGoogle Scholar
  29. McKeown, K. R., Barzilay, R., Evans, D., Hatzivassiloglou, V., Schiffman, B., and Teufel, S. 2001. Columbia multi-document summarization: Approach and evaluation. In Proceedings of the Workshop on Text Summarization, ACM SIGIR Conference 2001. DARPA/NIST, Document Understanding Conference.]]Google ScholarGoogle Scholar
  30. Mirkin, B. 1996. Mathematical Classification and Clustering. Kluwer, Boston.]]Google ScholarGoogle Scholar
  31. Oard, D. W. and Och, F. J. 2003. Rapid-response machine translation for unexpected languages. In Proceedings of the MT Summit IX (New Orleans, LA).]]Google ScholarGoogle Scholar
  32. Och, F. J. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) (Sapporo, Japan).]] Google ScholarGoogle Scholar
  33. Och, F. J. and Ney, H. 2002. Discriminative training and maximum entropy models for statistical machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) (Philadelphia, PA).]] Google ScholarGoogle Scholar
  34. Och, F. J., Tillmann, C., and Ney, H. 1999. Improved alignment models for statistical machine translation. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (University of Maryland, College Park, MD). 20--28.]]Google ScholarGoogle Scholar
  35. Over, P. 2001. Introduction to duc-2001: an intrinsic evaluation of generic news text summarization systems. In Proceedings of the Workshop on Text Summarization, ACM SIGIR Conference 2001. DARPA/NIST, Document Understanding Conference.]]Google ScholarGoogle Scholar
  36. Papineni, K. A., Roukos, S., Ward, T., and Zhu, W.-J. 2001. Bleu: a method for automatic evaluation of machine translation. Tech. Rep. RC22176 (W0109-022), IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, NY.]]Google ScholarGoogle Scholar
  37. Pirkola, A. 1998. The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98). 55--63.]] Google ScholarGoogle Scholar
  38. Porter, M. F. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.]]Google ScholarGoogle Scholar
  39. Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., and Gatford, M. 1995. Okapi at TREC-3. In Third Text REtrieval Conference (TREC-3), D. Harman and E. Voorhees, Eds. NIST, Gaithersburg, Maryland, USA.]]Google ScholarGoogle Scholar
  40. Rocchio, Jr., J. J. 1971. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing, G. Salton, Ed. Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 313--323.]]Google ScholarGoogle Scholar
  41. Salton, G. 1989. Automatic Text Processing. Addison-Wesley.]] Google ScholarGoogle Scholar
  42. Ueffing, N., Och, F. J., and Ney, H. 2002. Generation of word graphs in statistical machine translation. In Proceedings Conference on Empirical Methods for Natural Language Processing (Philadelphia, PA). 156--163.]] Google ScholarGoogle Scholar
  43. van Rijsbergen, C. J. 1979. Information Retrieval, 2nd ed. Butterworths, London.]] Google ScholarGoogle Scholar
  44. Willett, P. 1988. Recent trends in hierarchic document clustering: A critical review. Information Processing and Management 24, 5, 577--597.]] Google ScholarGoogle Scholar
  45. Zajic, D., Dorr, B., and Schwartz, R. 2002. Automatic headline generation for newspaper stories. In Proceedings of the ACL-02 Workshop on Text Summarization (Philadelphia, PA).]]Google ScholarGoogle Scholar
  46. Zhou, L. and Hovy, E. 2003. Headline summarization at ISI. In Document Understanding Conference (DUC-03) (Edmonton, AB, Canada).]]Google ScholarGoogle Scholar

Index Terms

  1. Cross-lingual C*ST*RD: English access to Hindi information

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader