ABSTRACT
Recent works show that the problem of noisy text normalization can be treated as a machine translation (MT) problem with convincing results. There have been supervised MT approaches which use noisy-regular parallel data for training an MT model, as well as unsupervised models which learn the translation probabilities in alternative ways and try to mimic the MT-based approach. While the supervised approaches suffer from data annotation and domain adaptation difficulties, the unsupervised models lack a holistic approach catering to all types of noise. In this paper, we propose an algorithm to artificially generate noisy text in a controlled way, from any regular English text. We see this approach as an alternative to the unsupervised approaches while getting the advantages of a parallel corpus based MT approach. We generate parallel noisy text from two widely used regular English datasets and test the MT-based approach for text normalization. Semi-supervised approaches were also tried to explore different ways of improving the parallel corpus (manually annotated) based MT approach by using the generated noisy text. An extensive analysis based on comparison of our approaches with both the supervised as well as unsupervised approaches is presented.
- The bnc sampler, xml version., 2005.Google Scholar
- S. Agarwal, S. Godbole, D. Punjani, and S. Roy. How much noise is too much: A study in automatic text classification. In Seventh IEEE International Conference on Data Mining, 2007, pages 3--12. IEEE, 2007. Google ScholarDigital Library
- S. Asur and B. Huberman. Predicting the future with social media. In 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, pages 492--499. IEEE, 2010. Google ScholarDigital Library
- A. Aw, M. Zhang, J. Xiao, and J. Su. A phrase-based statistical model for sms text normalization. In COLING-ACL, 2006, pages 33--40. Association for Computational Linguistics, 2006. Google ScholarDigital Library
- E. Bakshy, J. Hofman, W. Mason, and D. Watts. Everyone's an influencer: Quantifying influence on twitter. In Proceedings of WSDM 2011, pages 65--74. ACM, 2011. Google ScholarDigital Library
- L. Barbosa and J. Feng. Robust sentiment detection on twitter from biased and noisy data. In Proceedings of COLING 2010: Posters, pages 36--44. Association for Computational Linguistics, 2010. Google ScholarDigital Library
- N. Bertoldi and M. Federico. Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 182--189. Citeseer, 2009. Google ScholarDigital Library
- C. Brockett, W. Dolan, and M. Gamon. Correcting esl errors using phrasal smt techniques. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 249--256. Association for Computational Linguistics, 2006. Google ScholarDigital Library
- M. Choudhury, R. Sharaf, V. Jain, A. Mukherjee, S. Sarkar, and A. Basu. Investigation and modeling of the structure of texting language. Int. J. Doc. Anal. Recognition, 34:157--174, 2007. Google ScholarDigital Library
- D. Contractor, T. Faruquie, and L. Subramaniam. Unsupervised cleansing of noisy text. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pages 189--196. Association for Computational Linguistics, 2010. Google ScholarDigital Library
- P. Cook and S. Stevenson. An unsupervised model for text message normalization. In Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, pages 71--78. Association for Computational Linguistics, 2009. Google ScholarDigital Library
- J. Foster. Treebanks gone bad. International Journal on Document Analysis and Recognition, 10(3):129--145, 2007. Google ScholarDigital Library
- J. Foster. cba to check the spelling investigating parser performance on discussion forum posts. In HLT-NAACL, 2010, pages 381--384. Association for Computational Linguistics, 2010. Google ScholarDigital Library
- J. Foster and ØS. Andersen. Generrate: generating errors for use in grammatical error detection. In Proceedings of the fourth workshop on innovative use of nlp for building educational applications, pages 82--90. Association for Computational Linguistics, 2009. Google ScholarDigital Library
- J. Foster, J. Wagner, and J. Van Genabith. Adapting a wsj-trained parser to grammatically noisy text. In Proceedings of ACL-HLT, 2008, pages 221--224. Association for Computational Linguistics, 2008. Google ScholarDigital Library
- Y. How and M. Kan. Optimizing predictive text entry for short message service on mobile phones. In Proceedings of HCII. Citeseer, 2005.Google Scholar
- M. Kaufmann and J. Kalita. Syntactic normalization of twitter messages. In International Conference on Natural Language Processing (ICON), 2010.Google Scholar
- C. Knoblock, D. Lopresti, S. Roy, and L. Subramaniam. Special issue on noisy text analytics. International Journal on Document Analysis and Recognition, 10(3):127--128, 2007. Google ScholarDigital Library
- C. Kobus, F. Yvon, and G. Damnati. Normalizing sms: are two metaphors better than one? In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pages 441--448. Association for Computational Linguistics, 2008. Google ScholarDigital Library
- P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, et al. Moses: Open source toolkit for statistical machine translation. In ACL 2007 Interactive Poster and Demonstration Sessions, pages 177--180. Association for Computational Linguistics, 2007. Google ScholarDigital Library
- M. Kul. Phonology in text messages. Poznań Studies in Contemporary Linguistics, 43(2):43--57, 2007.Google ScholarCross Ref
- M. Marcus, M. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313--330, 1993. Google ScholarDigital Library
- F. Och. Minimum error rate training in statistical machine translation. In ACl, 2003, pages 160--167. Association for Computational Linguistics, 2003. Google ScholarDigital Library
- K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the ACL 2002, pages 311--318. Association for Computational Linguistics, 2002. Google ScholarDigital Library
- K. Raghunathan and S. Krawczyk. Cs224n: Investigating sms text normalization using statistical machine translation. 2009.Google Scholar
- A. Stolcke. Srilm-an extensible language modeling toolkit. In Proceedings of the international conference on spoken language processing, volume 2, pages 901--904. Citeseer, 2002.Google Scholar
- L. Subramaniam, S. Roy, T. Faruquie, and S. Negi. A survey of types of text noise and techniques to handle noisy text. In Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data, pages 115--122. ACM, 2009. Google ScholarDigital Library
- R. Weide. The cmu pronunciation dictionary, release 0.6, 1998.Google Scholar
Index Terms
- Experiments with artificially generated noise for cleansing noisy text
Recommendations
Adapting a WSJ trained part-of-speech tagger to noisy text: preliminary results
MOCR_AND '11: Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text DataWith the increase in the number of people communicating through internet, there has been a steady increase in the amount of text available online. Most such text is different from the standard language, as people try to use various kinds of short forms ...
Towards Robustness to Label Noise in Text Classification via Noise Modeling
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge ManagementLarge datasets in NLP tend to suffer from noisy labels due to erroneous automatic and human annotation procedures. We study the problem of text classification with label noise, and aim to capture this noise through an auxiliary noise model over the ...
Generating Arabic text in multilingual speech-to-speech machine translation framework
The interlingual approach to machine translation (MT) is used successfully in multilingual translation. It aims to achieve the translation task in two independent steps. First, meanings of the source-language sentences are represented in an intermediate ...
Comments