nach oben

Erschienen in:

2018 | OriginalPaper | Buchkapitel

Turkish Normalization Lexicon for Social Media

verfasst von : Seniz Demir, Murat Tan, Berkay Topcu

Erschienen in: Computational Linguistics and Intelligent Text Processing

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Social media has its own evergrowing language and distinct characteristics. Although social media is shown to be of great utility to research studies, varying quality of written texts degrades the performance of existing NLP tools. Normalization of texts, transforming from informal to well-written texts, appears to be a reasonable preprocessing step to adapt tools trained on different domains to social media. In this study, we compile the first Turkish normalization lexicon that sheds light to the kinds of observed lexical variations in social media texts. A graphical representation acquired from a text corpus is used to model contextual similarities between normalization equivalences and the lexicon is automatically generated by performing random walks on this graph. The underlying framework not only enables different lexicons to be generated from the same corpus but also produces lexicons that are tuned to specific genres. Evaluation studies demonstrated the effectiveness of induced lexicon in normalizing Turkish texts.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Collecting and Annotating Indian Social Media Code-Mixed Corpora

Nächstes Kapitel Introducing Semantics in Short Text Classification

Since the graph representation is for modeling contextual similarities between individual words, OOV words that contain more than one word due to omitted spaces (e.g., “şarkısözü” which is indeed “şarkı sözü{lyrics}”) are manually removed. Automatic handling of these cases is left as future work.

twitter4j.org.

http://www.kemik.yildiz.edu.tr/?id=28.

The punctuation characters are omitted while identifying n-gram sequences.

https://github.com/ahmetaa/zemberek-nlp.

Stop words (e.g., “ve”{and}) and very frequent words (e.g.,“bir”{one}) were observed to have higher degrees.

In a bipartite graph, a step cannot be taken between the nodes of the same bipartite.

More than two trials could be made in order to reduce the effect of randomness.

How these cases can be handled is indeed in our future work.

In our evaluations, an edit distance of 2 was used.

Hu, Y., Talamadupula, K., Kambhampati, S.: Dude, srsly?: the surprisingly formal nature of Twitter’s language. In: 7th International AAAI Conference on Weblogs and Social Media (ICWSM), pp. 244–253 (2013)

Eisenstein, J., O’Connor, B., Smith, N.A., Xing, E.P.: Diffusion of lexical change in social media. PLoS One 9 (2014)

Herdağdelen, A.: Twitter n-gram corpus with demographic metadata. Lang. Resour. Eval. 47, 1127–1147 (2013)CrossRef

Schwartz, H.A., Eichstaedt, J.C., Kern, M.L., Dziurzynski, L., Ramones, S.M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M.E.P., Ungar, L.H.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS One 8 (2013)

Foster, J., Çetinoğlu, Ö., Wagner, J., Roux, J.L., Hogan, S., Nivre, J., Hogan, D., van Genabith, J.: # hardtoparse: POS tagging and parsing the twitterverse. In: The Workshop on Analyzing Microtext (AAAI), pp. 20–25 (2011)

Kucuk, D., Steinberger, R.: Experiments to improve named entity recognition on Turkish tweets. In: 5th Workshop on Language Analysis for Social Media, pp. 71–78 (2014)

Hassan, H., Menezes, A.: Social text normalization using contextual graph random walks. In: 51st Annual Meeting of the Association for Computational Linguistics, pp. 1577–1586 (2013)

Brill, E., Moore, R.C.: An improved error model for noisy channel spelling correction. In: 38th Annual Meeting on Association for Computational Linguistics, pp. 286–293 (2000)

Tautanova, K., Moore, R.C.: A pronunciation modeling for improved spelling correction. In: 40th Annual Meeting on Association for Computational Linguistics, pp. 144–151 (2002)

10.

Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., Basu, A.: Investigation and modeling of the structure of texting language. Int. J. Doc. Anal. Recogn. 10, 157–174 (2007)CrossRef

11.

Cook, P., Stevenson, S.: An unsupervised model for text message normalization. In: 4th Workshop on Computational Approaches to Linguistic Creativity (CALC), pp. 71–78 (2009)

12.

Aw, A., Zhang, M., Xiao, J., Su, J.: A phrase-based statistical model for SMS text normalization. In: 21st International Conference on Computational Linguistics/ACL, pp. 33–40 (2006)

13.

Kaufmann, M., Kalita, J.: Syntactic normalization of Twitter messages. In: International Conference on Natural Language Processing (2010)

14.

Liu, F., Weng, F., Wang, B., Liu, Y.: Insertion, deletion, or substitution?: normalizing text messages without pre-categorization nor supervision. In: 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT), pp. 71–76 (2011)

15.

Liu, F., Weng, F., Jiang, X.: A broad-coverage normalization system for social media language. In: 50th Annual Meeting of the Association for Computational Linguistics, pp. 1035–1044 (2012)

16.

Han, B., Cook, P., Baldwin, T.: Lexical normalization for social media text. ACM Trans. Intell. Syst. Technol. (TIST) 4, 5(1)–5(27) (2013)

17.

Sönmez, C., Özgür, A.: A graph-based approach for contextual text normalization. In: Conference on Empirical Methods on Natural Language Processing (EMNLP), pp. 313–324 (2014)

18.

Torunoğlu, D., Eryiğit, G.: A cascaded approach for social media text normalization of Turkish. In: 5th Workshop on Language Analysis for Social Media (LASM), pp. 62–70 (2014)

19.

Yıldırım, S., Yıldız, T.: An unsupervised text normalization architecture for Turkish language. In: 16th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING) (2015)

Titel: Turkish Normalization Lexicon for Social Media
verfasst von: Seniz Demir
Murat Tan
Berkay Topcu
Verlag: Springer International Publishing
Buch: Computational Linguistics and Intelligent Text Processing
Print ISBN: 978-3-319-75486-4

Electronic ISBN: 978-3-319-75487-1

Copyright-Jahr: 2018
DOI: https://doi.org/10.1007/978-3-319-75487-1_33

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner