ABSTRACT
The rapid proliferation of microblogs such as Twitter has resulted in a vast quantity of written text becoming available that contains interesting information for NLP tasks. However, the noise level in tweets is so high that standard NLP tools perform poorly. In this pa- per, we present a statistical truecaser for tweets using a 3-gram language model built with truecased newswire texts and tweets. Our truecasing method shows an improvement in named entity recognition and part-of-speech tagging tasks.
- T. Baldwin, P. Cook, M. Lui, A. MacKinlay, and L. Wang. How noisy social media text, how diffrnt social media sources? In Proceedings of the 6th International Joint Conference on Natural Language Processing, 2013.Google Scholar
- K. Bontcheva, L. Derczynski, A. Funk, M. A. Greenwood, D. Maynard, and N. Aswani. Twitie: An open-source information extraction pipeline for microblog text. In Proceedings of the International Conference on Recent Advances in Natural Language Processing. Association for Computational Linguistics, 2013.Google Scholar
- A. E. Cano, M. Rowe, M. Stankovic, and A.-S. Dadzie, editors. Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', Rio de Janeiro, Brazil, May 13, 2013. CEUR-WS.org, 2013.Google Scholar
- T. Chen and M.-Y. Kan. Creating a live, public short message service corpus: The nus sms corpus. CoRR, 2011.Google Scholar
- H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02), 2002.Google Scholar
- H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan, N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts, D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion, J. Petrak, Y. Li, and W. Peters. Text Processing with GATE (Version 6). 2011. Google ScholarDigital Library
- L. Derczynski, D. Maynard, G. Rizzo, M. van Erp, G. Gorrell, R. Troncy, J. Petrak, and K. Bontcheva. Analysis of named entity recognition and linking for tweets. CoRR, abs/1410.7182, 2014.Google Scholar
- L. Derczynski, A. Ritter, S. Clark, and K. Bontcheva. Twitter part-of-speech tagging for all: Overcoming sparse and noisy data. In Proceedings of the International Conference on Recent Advances in Natural Language Processing, 2013.Google Scholar
- T. Finin, W. Murnane, A. Karandikar, N. Keller, J. Martineau, and M. Dredze. Annotating named entities in twitter data with crowdsourcing. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, pages 80--88. Association for Computational Linguistics, 2010. Google ScholarDigital Library
- J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL '05, pages 363--370, 2005. Google ScholarDigital Library
- J. Foster, Ö. Çetinoglu, J. Wagner, J. Le Roux, S. Hogan, J. Nivre, D. Hogan, and J. Van Genabith.# hardtoparse: Pos tagging and parsing the twitterverse. In AAAI 2011 Workshop on Analyzing Microtext, pages 20--25, 2011.Google ScholarDigital Library
- K. Gimpel, N. Schneider, B. O'Connor, D. Das, D. Mills, J. Eisenstein, M. Heilman, D. Yogatama, J. Flanigan, and N. A. Smith. Part-of-speech tagging for twitter: Annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers - Volume 2, HLT '11, pages 42--47. Association for Computational Linguistics, 2011. Google ScholarDigital Library
- A. Gravano, M. Jansche, and M. Bacchiani. Restoring punctuation and capitalization in transcribed speech. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, pages 4741--4744. IEEE, 2009. Google ScholarDigital Library
- M. Kaufmann and J. Kalita. Syntactic normalization of twitter messages. In International conference on natural language processing, Kharagpur, India, 2010.Google Scholar
- L. V. Lita, A. Ittycheriah, S. Roukos, and N. Kambhatla. truecasing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 152--159. Association for Computational Linguistics, 2003. Google ScholarDigital Library
- C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55--60, 2014.Google ScholarCross Ref
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL '02, pages 311--318. Association for Computational Linguistics, 2002. Google ScholarDigital Library
- R. Parker, D. Graff, J. Kong, K. Chen, and K. Maeda. English Gigaword Fifth Edition, 2011.Google Scholar
- A. Ritter, S. Clark, O. Etzioni, et al. Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1524--1534. Association for Computational Linguistics, 2011. Google ScholarDigital Library
- A. Stolcke. Srilm-an extensible language modeling toolkit. In Proceedings International Conference on Spoken Language Processing, pages 257--286, 2002.Google Scholar
- E. F. Tjong Kim Sang and F. De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, CONLL '03, pages 142--147. Association for Computational Linguistics, 2003. Google ScholarDigital Library
- W. Wang, K. Knight, and D. Marcu. Capitalizing machine translation. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pages 1--8. Association for Computational Linguistics, 2006. Google ScholarDigital Library
Index Terms
- ResToRinG CaPitaLiZaTion in #TweeTs
Recommendations
Recovering capitalization and punctuation marks for automatic speech recognition: Case study for Portuguese broadcast news
The following material presents a study about recovering punctuation marks, and capitalization information from European Portuguese broadcast news speech transcriptions. Different approaches were tested for capitalization, both generative and ...
Lexical Normalization of Spanish Tweets
WWW '16 Companion: Proceedings of the 25th International Conference Companion on World Wide WebTwitter data have brought new opportunities to know what happens in the world in real-time, and conduct studies on the human subjectivity on a diversity of issues and topics at large scale, which would not be feasible using traditional methods. However, ...
Comments