skip to main content
10.1145/2034617.2034622acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesmocr-andConference Proceedingsconference-collections
research-article

Experiments with artificially generated noise for cleansing noisy text

Authors Info & Claims
Published:17 September 2011Publication History

ABSTRACT

Recent works show that the problem of noisy text normalization can be treated as a machine translation (MT) problem with convincing results. There have been supervised MT approaches which use noisy-regular parallel data for training an MT model, as well as unsupervised models which learn the translation probabilities in alternative ways and try to mimic the MT-based approach. While the supervised approaches suffer from data annotation and domain adaptation difficulties, the unsupervised models lack a holistic approach catering to all types of noise. In this paper, we propose an algorithm to artificially generate noisy text in a controlled way, from any regular English text. We see this approach as an alternative to the unsupervised approaches while getting the advantages of a parallel corpus based MT approach. We generate parallel noisy text from two widely used regular English datasets and test the MT-based approach for text normalization. Semi-supervised approaches were also tried to explore different ways of improving the parallel corpus (manually annotated) based MT approach by using the generated noisy text. An extensive analysis based on comparison of our approaches with both the supervised as well as unsupervised approaches is presented.

References

  1. The bnc sampler, xml version., 2005.Google ScholarGoogle Scholar
  2. S. Agarwal, S. Godbole, D. Punjani, and S. Roy. How much noise is too much: A study in automatic text classification. In Seventh IEEE International Conference on Data Mining, 2007, pages 3--12. IEEE, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Asur and B. Huberman. Predicting the future with social media. In 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, pages 492--499. IEEE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Aw, M. Zhang, J. Xiao, and J. Su. A phrase-based statistical model for sms text normalization. In COLING-ACL, 2006, pages 33--40. Association for Computational Linguistics, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. E. Bakshy, J. Hofman, W. Mason, and D. Watts. Everyone's an influencer: Quantifying influence on twitter. In Proceedings of WSDM 2011, pages 65--74. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. Barbosa and J. Feng. Robust sentiment detection on twitter from biased and noisy data. In Proceedings of COLING 2010: Posters, pages 36--44. Association for Computational Linguistics, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. N. Bertoldi and M. Federico. Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 182--189. Citeseer, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Brockett, W. Dolan, and M. Gamon. Correcting esl errors using phrasal smt techniques. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 249--256. Association for Computational Linguistics, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Choudhury, R. Sharaf, V. Jain, A. Mukherjee, S. Sarkar, and A. Basu. Investigation and modeling of the structure of texting language. Int. J. Doc. Anal. Recognition, 34:157--174, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Contractor, T. Faruquie, and L. Subramaniam. Unsupervised cleansing of noisy text. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pages 189--196. Association for Computational Linguistics, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Cook and S. Stevenson. An unsupervised model for text message normalization. In Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, pages 71--78. Association for Computational Linguistics, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Foster. Treebanks gone bad. International Journal on Document Analysis and Recognition, 10(3):129--145, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Foster. cba to check the spelling investigating parser performance on discussion forum posts. In HLT-NAACL, 2010, pages 381--384. Association for Computational Linguistics, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Foster and ØS. Andersen. Generrate: generating errors for use in grammatical error detection. In Proceedings of the fourth workshop on innovative use of nlp for building educational applications, pages 82--90. Association for Computational Linguistics, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Foster, J. Wagner, and J. Van Genabith. Adapting a wsj-trained parser to grammatically noisy text. In Proceedings of ACL-HLT, 2008, pages 221--224. Association for Computational Linguistics, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Y. How and M. Kan. Optimizing predictive text entry for short message service on mobile phones. In Proceedings of HCII. Citeseer, 2005.Google ScholarGoogle Scholar
  17. M. Kaufmann and J. Kalita. Syntactic normalization of twitter messages. In International Conference on Natural Language Processing (ICON), 2010.Google ScholarGoogle Scholar
  18. C. Knoblock, D. Lopresti, S. Roy, and L. Subramaniam. Special issue on noisy text analytics. International Journal on Document Analysis and Recognition, 10(3):127--128, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Kobus, F. Yvon, and G. Damnati. Normalizing sms: are two metaphors better than one? In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pages 441--448. Association for Computational Linguistics, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, et al. Moses: Open source toolkit for statistical machine translation. In ACL 2007 Interactive Poster and Demonstration Sessions, pages 177--180. Association for Computational Linguistics, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Kul. Phonology in text messages. Poznań Studies in Contemporary Linguistics, 43(2):43--57, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  22. M. Marcus, M. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313--330, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. F. Och. Minimum error rate training in statistical machine translation. In ACl, 2003, pages 160--167. Association for Computational Linguistics, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the ACL 2002, pages 311--318. Association for Computational Linguistics, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. K. Raghunathan and S. Krawczyk. Cs224n: Investigating sms text normalization using statistical machine translation. 2009.Google ScholarGoogle Scholar
  26. A. Stolcke. Srilm-an extensible language modeling toolkit. In Proceedings of the international conference on spoken language processing, volume 2, pages 901--904. Citeseer, 2002.Google ScholarGoogle Scholar
  27. L. Subramaniam, S. Roy, T. Faruquie, and S. Negi. A survey of types of text noise and techniques to handle noisy text. In Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data, pages 115--122. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. Weide. The cmu pronunciation dictionary, release 0.6, 1998.Google ScholarGoogle Scholar

Index Terms

  1. Experiments with artificially generated noise for cleansing noisy text

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        MOCR_AND '11: Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
        September 2011
        144 pages
        ISBN:9781450306850
        DOI:10.1145/2034617

        Copyright © 2011 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 September 2011

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader