Skip to main content

2018 | OriginalPaper | Buchkapitel

Turkish Normalization Lexicon for Social Media

verfasst von : Seniz Demir, Murat Tan, Berkay Topcu

Erschienen in: Computational Linguistics and Intelligent Text Processing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Social media has its own evergrowing language and distinct characteristics. Although social media is shown to be of great utility to research studies, varying quality of written texts degrades the performance of existing NLP tools. Normalization of texts, transforming from informal to well-written texts, appears to be a reasonable preprocessing step to adapt tools trained on different domains to social media. In this study, we compile the first Turkish normalization lexicon that sheds light to the kinds of observed lexical variations in social media texts. A graphical representation acquired from a text corpus is used to model contextual similarities between normalization equivalences and the lexicon is automatically generated by performing random walks on this graph. The underlying framework not only enables different lexicons to be generated from the same corpus but also produces lexicons that are tuned to specific genres. Evaluation studies demonstrated the effectiveness of induced lexicon in normalizing Turkish texts.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Since the graph representation is for modeling contextual similarities between individual words, OOV words that contain more than one word due to omitted spaces (e.g., “şarkısözü” which is indeed “şarkı sözü{lyrics}”) are manually removed. Automatic handling of these cases is left as future work.
 
4
The punctuation characters are omitted while identifying n-gram sequences.
 
6
Stop words (e.g., “ve”{and}) and very frequent words (e.g.,“bir”{one}) were observed to have higher degrees.
 
7
In a bipartite graph, a step cannot be taken between the nodes of the same bipartite.
 
8
More than two trials could be made in order to reduce the effect of randomness.
 
9
How these cases can be handled is indeed in our future work.
 
10
In our evaluations, an edit distance of 2 was used.
 
Literatur
1.
Zurück zum Zitat Hu, Y., Talamadupula, K., Kambhampati, S.: Dude, srsly?: the surprisingly formal nature of Twitter’s language. In: 7th International AAAI Conference on Weblogs and Social Media (ICWSM), pp. 244–253 (2013) Hu, Y., Talamadupula, K., Kambhampati, S.: Dude, srsly?: the surprisingly formal nature of Twitter’s language. In: 7th International AAAI Conference on Weblogs and Social Media (ICWSM), pp. 244–253 (2013)
2.
Zurück zum Zitat Eisenstein, J., O’Connor, B., Smith, N.A., Xing, E.P.: Diffusion of lexical change in social media. PLoS One 9 (2014) Eisenstein, J., O’Connor, B., Smith, N.A., Xing, E.P.: Diffusion of lexical change in social media. PLoS One 9 (2014)
3.
Zurück zum Zitat Herdağdelen, A.: Twitter n-gram corpus with demographic metadata. Lang. Resour. Eval. 47, 1127–1147 (2013)CrossRef Herdağdelen, A.: Twitter n-gram corpus with demographic metadata. Lang. Resour. Eval. 47, 1127–1147 (2013)CrossRef
4.
Zurück zum Zitat Schwartz, H.A., Eichstaedt, J.C., Kern, M.L., Dziurzynski, L., Ramones, S.M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M.E.P., Ungar, L.H.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS One 8 (2013) Schwartz, H.A., Eichstaedt, J.C., Kern, M.L., Dziurzynski, L., Ramones, S.M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M.E.P., Ungar, L.H.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS One 8 (2013)
5.
Zurück zum Zitat Foster, J., Çetinoğlu, Ö., Wagner, J., Roux, J.L., Hogan, S., Nivre, J., Hogan, D., van Genabith, J.: # hardtoparse: POS tagging and parsing the twitterverse. In: The Workshop on Analyzing Microtext (AAAI), pp. 20–25 (2011) Foster, J., Çetinoğlu, Ö., Wagner, J., Roux, J.L., Hogan, S., Nivre, J., Hogan, D., van Genabith, J.: # hardtoparse: POS tagging and parsing the twitterverse. In: The Workshop on Analyzing Microtext (AAAI), pp. 20–25 (2011)
6.
Zurück zum Zitat Kucuk, D., Steinberger, R.: Experiments to improve named entity recognition on Turkish tweets. In: 5th Workshop on Language Analysis for Social Media, pp. 71–78 (2014) Kucuk, D., Steinberger, R.: Experiments to improve named entity recognition on Turkish tweets. In: 5th Workshop on Language Analysis for Social Media, pp. 71–78 (2014)
7.
Zurück zum Zitat Hassan, H., Menezes, A.: Social text normalization using contextual graph random walks. In: 51st Annual Meeting of the Association for Computational Linguistics, pp. 1577–1586 (2013) Hassan, H., Menezes, A.: Social text normalization using contextual graph random walks. In: 51st Annual Meeting of the Association for Computational Linguistics, pp. 1577–1586 (2013)
8.
Zurück zum Zitat Brill, E., Moore, R.C.: An improved error model for noisy channel spelling correction. In: 38th Annual Meeting on Association for Computational Linguistics, pp. 286–293 (2000) Brill, E., Moore, R.C.: An improved error model for noisy channel spelling correction. In: 38th Annual Meeting on Association for Computational Linguistics, pp. 286–293 (2000)
9.
Zurück zum Zitat Tautanova, K., Moore, R.C.: A pronunciation modeling for improved spelling correction. In: 40th Annual Meeting on Association for Computational Linguistics, pp. 144–151 (2002) Tautanova, K., Moore, R.C.: A pronunciation modeling for improved spelling correction. In: 40th Annual Meeting on Association for Computational Linguistics, pp. 144–151 (2002)
10.
Zurück zum Zitat Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., Basu, A.: Investigation and modeling of the structure of texting language. Int. J. Doc. Anal. Recogn. 10, 157–174 (2007)CrossRef Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., Basu, A.: Investigation and modeling of the structure of texting language. Int. J. Doc. Anal. Recogn. 10, 157–174 (2007)CrossRef
11.
Zurück zum Zitat Cook, P., Stevenson, S.: An unsupervised model for text message normalization. In: 4th Workshop on Computational Approaches to Linguistic Creativity (CALC), pp. 71–78 (2009) Cook, P., Stevenson, S.: An unsupervised model for text message normalization. In: 4th Workshop on Computational Approaches to Linguistic Creativity (CALC), pp. 71–78 (2009)
12.
Zurück zum Zitat Aw, A., Zhang, M., Xiao, J., Su, J.: A phrase-based statistical model for SMS text normalization. In: 21st International Conference on Computational Linguistics/ACL, pp. 33–40 (2006) Aw, A., Zhang, M., Xiao, J., Su, J.: A phrase-based statistical model for SMS text normalization. In: 21st International Conference on Computational Linguistics/ACL, pp. 33–40 (2006)
13.
Zurück zum Zitat Kaufmann, M., Kalita, J.: Syntactic normalization of Twitter messages. In: International Conference on Natural Language Processing (2010) Kaufmann, M., Kalita, J.: Syntactic normalization of Twitter messages. In: International Conference on Natural Language Processing (2010)
14.
Zurück zum Zitat Liu, F., Weng, F., Wang, B., Liu, Y.: Insertion, deletion, or substitution?: normalizing text messages without pre-categorization nor supervision. In: 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT), pp. 71–76 (2011) Liu, F., Weng, F., Wang, B., Liu, Y.: Insertion, deletion, or substitution?: normalizing text messages without pre-categorization nor supervision. In: 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT), pp. 71–76 (2011)
15.
Zurück zum Zitat Liu, F., Weng, F., Jiang, X.: A broad-coverage normalization system for social media language. In: 50th Annual Meeting of the Association for Computational Linguistics, pp. 1035–1044 (2012) Liu, F., Weng, F., Jiang, X.: A broad-coverage normalization system for social media language. In: 50th Annual Meeting of the Association for Computational Linguistics, pp. 1035–1044 (2012)
16.
Zurück zum Zitat Han, B., Cook, P., Baldwin, T.: Lexical normalization for social media text. ACM Trans. Intell. Syst. Technol. (TIST) 4, 5(1)–5(27) (2013) Han, B., Cook, P., Baldwin, T.: Lexical normalization for social media text. ACM Trans. Intell. Syst. Technol. (TIST) 4, 5(1)–5(27) (2013)
17.
Zurück zum Zitat Sönmez, C., Özgür, A.: A graph-based approach for contextual text normalization. In: Conference on Empirical Methods on Natural Language Processing (EMNLP), pp. 313–324 (2014) Sönmez, C., Özgür, A.: A graph-based approach for contextual text normalization. In: Conference on Empirical Methods on Natural Language Processing (EMNLP), pp. 313–324 (2014)
18.
Zurück zum Zitat Torunoğlu, D., Eryiğit, G.: A cascaded approach for social media text normalization of Turkish. In: 5th Workshop on Language Analysis for Social Media (LASM), pp. 62–70 (2014) Torunoğlu, D., Eryiğit, G.: A cascaded approach for social media text normalization of Turkish. In: 5th Workshop on Language Analysis for Social Media (LASM), pp. 62–70 (2014)
19.
Zurück zum Zitat Yıldırım, S., Yıldız, T.: An unsupervised text normalization architecture for Turkish language. In: 16th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING) (2015) Yıldırım, S., Yıldız, T.: An unsupervised text normalization architecture for Turkish language. In: 16th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING) (2015)
Metadaten
Titel
Turkish Normalization Lexicon for Social Media
verfasst von
Seniz Demir
Murat Tan
Berkay Topcu
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-319-75487-1_33

Premium Partner