research-article

Experiments with artificially generated noise for cleansing noisy text

Authors:
Phani Gadde

Language Technologies Research Centre, IIIT-Hyderabad, India

Language Technologies Research Centre, IIIT-Hyderabad, India
View Profile

,
Rahul Goutam

Language Technologies Research Centre, IIIT-Hyderabad, India

Language Technologies Research Centre, IIIT-Hyderabad, India
View Profile

,
Rakshit Shah

Language Technologies Research Centre, IIIT-Hyderabad, India

Language Technologies Research Centre, IIIT-Hyderabad, India
View Profile

,
Hemanth Sagar Bayyarapu

Language Technologies Research Centre, IIIT-Hyderabad, India

Language Technologies Research Centre, IIIT-Hyderabad, India
View Profile

,
L. V. Subramaniam

IBM Research, New Delhi, India

IBM Research, New Delhi, India
View Profile

MOCR_AND '11: Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text DataSeptember 2011Article No.: 4Pages 1–8https://doi.org/10.1145/2034617.2034622

Published:17 September 2011Publication History

MOCR_AND '11: Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data

Pages 1–8

ABSTRACT

Recent works show that the problem of noisy text normalization can be treated as a machine translation (MT) problem with convincing results. There have been supervised MT approaches which use noisy-regular parallel data for training an MT model, as well as unsupervised models which learn the translation probabilities in alternative ways and try to mimic the MT-based approach. While the supervised approaches suffer from data annotation and domain adaptation difficulties, the unsupervised models lack a holistic approach catering to all types of noise. In this paper, we propose an algorithm to artificially generate noisy text in a controlled way, from any regular English text. We see this approach as an alternative to the unsupervised approaches while getting the advantages of a parallel corpus based MT approach. We generate parallel noisy text from two widely used regular English datasets and test the MT-based approach for text normalization. Semi-supervised approaches were also tried to explore different ways of improving the parallel corpus (manually annotated) based MT approach by using the generated noisy text. An extensive analysis based on comparison of our approaches with both the supervised as well as unsupervised approaches is presented.

References

The bnc sampler, xml version., 2005.Google Scholar
S. Agarwal, S. Godbole, D. Punjani, and S. Roy. How much noise is too much: A study in automatic text classification. In Seventh IEEE International Conference on Data Mining, 2007, pages 3--12. IEEE, 2007. Google ScholarDigital Library
S. Asur and B. Huberman. Predicting the future with social media. In 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, pages 492--499. IEEE, 2010. Google ScholarDigital Library
A. Aw, M. Zhang, J. Xiao, and J. Su. A phrase-based statistical model for sms text normalization. In COLING-ACL, 2006, pages 33--40. Association for Computational Linguistics, 2006. Google ScholarDigital Library
E. Bakshy, J. Hofman, W. Mason, and D. Watts. Everyone's an influencer: Quantifying influence on twitter. In Proceedings of WSDM 2011, pages 65--74. ACM, 2011. Google ScholarDigital Library
L. Barbosa and J. Feng. Robust sentiment detection on twitter from biased and noisy data. In Proceedings of COLING 2010: Posters, pages 36--44. Association for Computational Linguistics, 2010. Google ScholarDigital Library
N. Bertoldi and M. Federico. Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 182--189. Citeseer, 2009. Google ScholarDigital Library
C. Brockett, W. Dolan, and M. Gamon. Correcting esl errors using phrasal smt techniques. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 249--256. Association for Computational Linguistics, 2006. Google ScholarDigital Library
M. Choudhury, R. Sharaf, V. Jain, A. Mukherjee, S. Sarkar, and A. Basu. Investigation and modeling of the structure of texting language. Int. J. Doc. Anal. Recognition, 34:157--174, 2007. Google ScholarDigital Library
D. Contractor, T. Faruquie, and L. Subramaniam. Unsupervised cleansing of noisy text. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pages 189--196. Association for Computational Linguistics, 2010. Google ScholarDigital Library
P. Cook and S. Stevenson. An unsupervised model for text message normalization. In Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, pages 71--78. Association for Computational Linguistics, 2009. Google ScholarDigital Library
J. Foster. Treebanks gone bad. International Journal on Document Analysis and Recognition, 10(3):129--145, 2007. Google ScholarDigital Library
J. Foster. cba to check the spelling investigating parser performance on discussion forum posts. In HLT-NAACL, 2010, pages 381--384. Association for Computational Linguistics, 2010. Google ScholarDigital Library
J. Foster and ØS. Andersen. Generrate: generating errors for use in grammatical error detection. In Proceedings of the fourth workshop on innovative use of nlp for building educational applications, pages 82--90. Association for Computational Linguistics, 2009. Google ScholarDigital Library
J. Foster, J. Wagner, and J. Van Genabith. Adapting a wsj-trained parser to grammatically noisy text. In Proceedings of ACL-HLT, 2008, pages 221--224. Association for Computational Linguistics, 2008. Google ScholarDigital Library
Y. How and M. Kan. Optimizing predictive text entry for short message service on mobile phones. In Proceedings of HCII. Citeseer, 2005.Google Scholar
M. Kaufmann and J. Kalita. Syntactic normalization of twitter messages. In International Conference on Natural Language Processing (ICON), 2010.Google Scholar
C. Knoblock, D. Lopresti, S. Roy, and L. Subramaniam. Special issue on noisy text analytics. International Journal on Document Analysis and Recognition, 10(3):127--128, 2007. Google ScholarDigital Library
C. Kobus, F. Yvon, and G. Damnati. Normalizing sms: are two metaphors better than one? In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pages 441--448. Association for Computational Linguistics, 2008. Google ScholarDigital Library
P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, et al. Moses: Open source toolkit for statistical machine translation. In ACL 2007 Interactive Poster and Demonstration Sessions, pages 177--180. Association for Computational Linguistics, 2007. Google ScholarDigital Library
M. Kul. Phonology in text messages. Poznań Studies in Contemporary Linguistics, 43(2):43--57, 2007.Google ScholarCross Ref
M. Marcus, M. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313--330, 1993. Google ScholarDigital Library
F. Och. Minimum error rate training in statistical machine translation. In ACl, 2003, pages 160--167. Association for Computational Linguistics, 2003. Google ScholarDigital Library
K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the ACL 2002, pages 311--318. Association for Computational Linguistics, 2002. Google ScholarDigital Library
K. Raghunathan and S. Krawczyk. Cs224n: Investigating sms text normalization using statistical machine translation. 2009.Google Scholar
A. Stolcke. Srilm-an extensible language modeling toolkit. In Proceedings of the international conference on spoken language processing, volume 2, pages 901--904. Citeseer, 2002.Google Scholar
L. Subramaniam, S. Roy, T. Faruquie, and S. Negi. A survey of types of text noise and techniques to handle noisy text. In Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data, pages 115--122. ACM, 2009. Google ScholarDigital Library
R. Weide. The cmu pronunciation dictionary, release 0.6, 1998.Google Scholar

Index Terms

Experiments with artificially generated noise for cleansing noisy text
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Optical character recognition
2. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources

Recommendations

Adapting a WSJ trained part-of-speech tagger to noisy text: preliminary results
MOCR_AND '11: Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data

With the increase in the number of people communicating through internet, there has been a steady increase in the amount of text available online. Most such text is different from the standard language, as people try to use various kinds of short forms ...
Read More
Towards Robustness to Label Noise in Text Classification via Noise Modeling
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

Large datasets in NLP tend to suffer from noisy labels due to erroneous automatic and human annotation procedures. We study the problem of text classification with label noise, and aim to capture this noise through an auxiliary noise model over the ...
Read More
Generating Arabic text in multilingual speech-to-speech machine translation framework

The interlingual approach to machine translation (MT) is used successfully in multilingual translation. It aims to achieve the translation task in two independent steps. First, meanings of the source-language sentences are represented in an intermediate ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MOCR_AND '11: Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
September 2011
144 pages
ISBN:9781450306850
DOI:10.1145/2034617
Conference Chairs:
Lipika Dey
India
,
Venu Govindaraju
USA
,
Daniel Lopresti
USA
,
Prem Natarajan
USA
,
Christoph Ringlstetter
Germany
,
Shourya Roy
India
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 September 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
NLP
artificial noise
cleansing noisy text
generated noise
machine translation
noisy text analytics
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 268
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Experiments with artificially generated noise for cleansing noisy text

MOCR_AND '11: Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Adapting a WSJ trained part-of-speech tagger to noisy text: preliminary results

Towards Robustness to Label Noise in Text Classification via Noise Modeling

Generating Arabic text in multilingual speech-to-speech machine translation framework