Top

Published in:

2016 | OriginalPaper | Chapter

LiCord: Language Independent Content Word Finder

Authors : Md-Mizanur Rahoman, Tetsuya Nasukawa, Hiroshi Kanayama, Ryutaro Ichise

Published in: Hybrid Artificial Intelligent Systems

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Content Words (CWs) are important segments of the text. In text mining, we utilize them for various purposes such as topic identification, document summarization, question answering etc. Usually, the identification of CWs requires various language dependent tools. However, such tools are not available for many languages and developing of them for all languages is costly. On the other hand, because of recent growth of text contents in various languages, language independent text mining carries great potentiality. To mine text automatically, the language tool independent CWs finding is a requirement. In this research, we devise a framework that identifies text segments into CWs in a language independent way. We identify some structural features that relate text segments into CWs. We devise the features over a large text corpus and apply machine learning-based classification that classifies the segments into CWs. The proposed framework only uses large text corpus and some training examples, apart from these, it does not require any language specific tool. We conduct experiments of our framework for three different languages: English, Vietnamese and Indonesian, and found that it works with more than 83 % accuracy.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Towards Automatic Composition of Multicomponent Predictive Systems

next chapter Mining Correlated High-Utility Itemsets Using the Bond Measure

In later part, we will use n-gram(s) to mean word n-gram(s).

http://www2.fs.u-bunkyo.ac.jp/~gilner/wordlists.html#functionwords.

https://translate.google.com/.

http://www.speech.sri.com/projects/srilm/.

More accurately DBpedia annotator, DBpedia works as structured version of Wikipedia, it can be found at http://dbpedia.org/about/.

http://cogcomp.cs.illinois.edu/page/demo_view/Wikifier (for the example of GoogleChina), and http://dbpediaspotlight.github.io/demo/, respectively.

http://nlp.stanford.edu/software/lex-parser.shtml.

Aggarwal, C.C., Zhai, C. (eds.): Mining Text Data. Springer, New York (2012)

Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)MATH

Gamon, M., Aue, A., Corston-Oliver, S., Ringger, E.: Pulse: mining customer opinions from free text. In: Famili, A.F., Kok, J.N., Peña, J.M., Siebes, A., Feelders, A. (eds.) IDA 2005. LNCS, vol. 3646, pp. 121–132. Springer, Heidelberg (2005)CrossRef

Kanayama, H., Nasukawa, T.: Unsupervised lexicon induction for clause-level detection of evaluations. Nat. Lang. Eng. 18(1), 83–107 (2012)CrossRef

Kim, S., Toutanova, K., Yu, H.: Multilingual named entity recognition using parallel data and metadata from wikipedia. In: Proceedings of the 50th Annual Meeting on Association for Computational Linguistics, pp. 694–702 (2012)

Lewis, D.: What is web 2.0? Crossroads 13(1), 3–3 (2006)CrossRef

Ma, Y., Wu, J.: Combining n-gram and dependency word pair for multi-document summarization. In: IEEE 17th International Conference on Computational Science and Engineering, pp. 27–31 (2014)

Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: Dbpedia spotlight: Shedding light on the web of documents. In: Proceedings of the 7th International Conference on Semantic Systems, pp. 1–8 (2011)

Nasukawa, T., Nagano, T.: Text analysis and knowledge mining system. IBM Syst. J. 40(4), 967–984 (2001)CrossRef

10.

Niesler, T., Woodland, P.C.: Variable-length category n-gram language models. Comput. Speech Lang. 13(1), 99–124 (1999)CrossRef

11.

Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993)

12.

Ratinov, L., Roth, D., Downey, D., Anderson, M.: Local, global algorithms for disambiguation to wikipedia. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 1375–1384 (2011)

13.

Shinzato, K., Shibata, T., Kawahara, D., Kurohashi, S.: Tsubaki: An open search engine infrastructure for developing information access methodology. Inf. Med. Technol. 7(1), 354–365 (2012)

14.

Zhu, X., Kiritchenko, S., Mohammad, S.M.: Sentiment analysis of short informal texts. J. Artif. Intell. Res. 50, 723–762 (2014)MATH

15.

Volpe, A.D., Klammer, T.P., Schulz, M.R.: Analyzing English Grammar. Longman, New York (2009)

16.

Tckstrm, O., Das, D., Petrov, S., McDonald, R., Nivre, J.: Token and type constraints for cross-lingual part-of-speech tagging. Trans. Assoc. Comput. Linguist. 1, 1–12 (2013)

17.

Wang, M., Manning, C.D.: Cross-lingual projected expectation regularization for weakly supervised learning. TACL 2, 55–66 (2014)

18.

Winkler, E.: Understanding Language: A Basic Course in Linguistics. Continuum, London (2007)

19.

Wisniewski, G., Pécheux, N., Gahbiche-Braham, S., Yvon, F.: Cross-lingual part-of-speech tagging through ambiguous learning. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1779–1785 (2014)

20.

Yarowsky, D., Ngai, G., Wicentowski, R.: Inducing multilingual text analysis tools via robust projection across aligned corpora. In: Proceedings of the First International Conference on Human Language Technology Research, pp. 1–8 (2001)

Title: LiCord: Language Independent Content Word Finder
Authors: Md-Mizanur Rahoman
Tetsuya Nasukawa
Hiroshi Kanayama
Ryutaro Ichise
Publisher: Springer International Publishing
Book: Hybrid Artificial Intelligent Systems
Print ISBN: 978-3-319-32033-5

Electronic ISBN: 978-3-319-32034-2

Copyright Year: 2016
DOI: https://doi.org/10.1007/978-3-319-32034-2_4

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner