nach oben

Erschienen in:

2015 | OriginalPaper | Buchkapitel

New Word Detection and Tagging on Chinese Twitter Stream

verfasst von : Yuzhi Liang, Pengcheng Yin, S. M. Yiu

Erschienen in: Big Data Analytics and Knowledge Discovery

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Twitter becomes one of the critical channels for disseminating up-to-date information. The volume of tweets can be huge. It is desirable to have an automatic system to analyze tweets. The obstacle is that Twitter users usually invent new words using non-standard rules that appear in a burst within a short period of time. Existing new word detection methods are not able to identify them effectively. Even if the new words can be identified, it is difficult to understand their meanings. In this paper, we focus on Chinese Twitter. There are no natural word delimiters in a sentence, which makes the problem more difficult. To solve the problem, we derive an unsupervised new word detection framework without relying on training data. Then, we introduce automatic tagging to new word annotation which tag the new words using known words according to our proposed tagging algorithm.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Distributed Classification of Data Streams: An Adaptive Technique

Nächstes Kapitel Text Categorization for Deriving the Application Quality in Enterprises Using Ticketing Systems

Here 15 is an experimental number, but this number can be evaluated by some statistical features such as mean and standardization of all the character sequences’ frequency.

TF-IDF is a numerical statistic used to indicate the importance of the given word in a corpus. The score is TF \(\times \) IDF, where TF is term frequency which is a normalized term count, IDF is Inverse Document Frequency which indicates the proportion of documents in the corpus containing \(w_i\).

\(Sim_{ccs} = \frac{Sim_{rawccs}-Min_{rawccs}}{Max_{rawccs}-Min_{rawccs}}\).

Ritter, A., Clark, S., Etzioni, O.: Named entity recognition in tweets: an experimental study. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2011)

Gattani, A., et al.: Entity extraction, linking, classification, and tagging for social media: a wikipedia-based approach. In: Proceedings of the VLDB Endowment 6.11, pp. 1126–1137 (2013)

Ye, Y., Qingyao, W., Li, Y., Chow, K.P., Hui, L.C.K., Yiu, S.-M.: Unknown chinese word extraction based on variety of overlapping strings. Inf. Process. Manag. 49(2), 497–512 (2013)CrossRef

Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investig. 30(1), 3–26 (2007)CrossRef

Zhou, N., et al.: A hybrid probabilistic model for unified collaborative and content-based image tagging. IEEE Trans. Pattern Anal. Mach. Intell. 33(7), 1281–1294 (2011)CrossRef

Kim, H.-N., et al.: Collaborative filtering based on collaborative tagging for enhancing the quality of recommendation. Electron. Commer. Res. Appl. 9(1), 73–83 (2010)CrossRef

Luo, S., Sun, M.: Two-character Chinese word extraction based on hybrid of internal and contextual measures. In: Proceedings of the second SIGHAN workshop on Chinese language processing, vol. 17. Association for Computational Linguistics (2003)

Jin, Z., Tanaka-Ishii, K.: Unsupervised segmentation of Chinese text by use of branching entropy. In: Proceedings of the COLING/ACL on Main Conference Poster Sessions. Association for Computational Linguistics (2006)

Wang, L., et al.: CRFs-based Chinese word segmentation for micro-blog with small-scale data. In: Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language (2012)

10.

Zhang, K., Sun, M., Zhou, C.: Word segmentation on Chinese mirco-blog data with a linear-time incremental model. In: Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, Tianjin (2012)

11.

Zhang, H.-P., et al.: HHMM-based Chinese lexical analyzer ICTCLAS. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, vol. 17. Association for Computational Linguistics (2003)

12.

Gang, Z., et al.: Chinese New Words Detection in Internet. Chin. Inf. Technol. 18(6), 1–9 (2004)

13.

Tseng, H., et al.: A conditional random field word segmenter for sighan bakeoff 2005. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, vol. 171, Jeju Island (2005)

Titel: New Word Detection and Tagging on Chinese Twitter Stream
verfasst von: Yuzhi Liang
Pengcheng Yin
S. M. Yiu
Verlag: Springer International Publishing
Buch: Big Data Analytics and Knowledge Discovery
Print ISBN: 978-3-319-22728-3

Electronic ISBN: 978-3-319-22729-0

Copyright-Jahr: 2015
DOI: https://doi.org/10.1007/978-3-319-22729-0_24

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"