ABSTRACT
This paper describes our submission for FIRE 2014 Shared Task on Transliterated Search. The shared task features two sub-tasks: Query word labeling and Mixed-script Ad hoc retrieval for Hindi Song Lyrics.
Query Word Labeling is on token level language identification of query words in code-mixed queries and back-transliteration of identified Indian language words into their native scripts. We have developed letter based language models for the token level language identification of query words and a structured perceptron model for back-transliteration of Indic words.
The second subtask for Mixed-script Ad hoc retrieval for Hindi Song Lyrics is to retrieve a ranked list of songs from a corpus of Hindi song lyrics given an input query in Devanagari or transliterated Roman script. We have used edit distance based query expansion and language modeling followed by relevance based reranking for the retrieval of relevant Hindi Song lyrics for a given query.
- Michael Collins. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. pages 188--193, 2006.Google Scholar
- Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. Irstlm: an open source toolkit for handling large scale language models. In Interspeech, pages 1618--1621, 2008.Google ScholarCross Ref
- Parth Gupta, Kalika Bali, Rafael E Banchs, Monojit Choudhury, and Paolo Rosso. Query expansion for mixed-script information retrieval. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pages 677--686. ACM, 2014. Google ScholarDigital Library
- Christopher D Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to information retrieval, volume 1. Cambridge university press Cambridge, 2008. Google ScholarCross Ref
- Tomas Mikolov, Stefan Kombrink, Anoop Deoras, Lukar Burget, and J Cernocky. Rnnlm-recurrent neural network language modeling toolkit. In Proc. of the 2011 ASRU Workshop, pages 196--201, 2011.Google Scholar
- Franz Josef Och and Hermann Ney. Giza++: Training of statistical translation models, 2000.Google Scholar
- Andreas Stolcke et al. Srilm-an extensible language modeling toolkit. In INTERSPEECH, 2002.Google Scholar
- Olga Vechtomova and Ying Wang. A study of the effect of term proximity on query expansion. Journal of Information Science, 32(4):324--333, 2006.Google ScholarCross Ref
Index Terms
- IIIT-H System Submission for FIRE2014 Shared Task on Transliterated Search
Recommendations
ISM@FIRE-2013 Shared Task on Transliterated Search
FIRE '12 & '13: Proceedings of the 4th and 5th Annual Meetings of the Forum for Information Retrieval EvaluationThis paper describes the approach we adopted during official submission of FIRE-2013 Shared Task on Transliterated Search along with few other approaches that we experimented post-submission. The techniques solve the problem of language labeling, by ...
A Hybrid Approach for Transliterated Word-Level Language Identification: CRF with Post-Processing Heuristics
FIRE '14: Proceedings of the 6th Annual Meeting of the Forum for Information Retrieval EvaluationIn this paper, we describe a hybrid approach for word-level language (WLL) identification of Bangla words written in Roman script and mixed with English words as part of our participation in the shared task on transliterated search at Forum for ...
Two approaches for the resolution of word mismatch problem caused by English words and foreign words in Korean information retrieval
IRAL '00: Proceedings of the fifth international workshop on on Information retrieval with Asian languagesIn Korean text, recently, the use of English words with or without phonetic translation is growing at high speed. To make matters worse the Korean transliterations of an English word may be very various. The mixed use of English words and their various ...
Comments