Skip to main content
Erschienen in: Wireless Personal Communications 4/2018

08.02.2018

A Sample Extension Method Based on Wikipedia and Its Application in Text Classification

verfasst von: Wenhao Zhu, Yiting Liu, Guannan Hu, Jianyue Ni, Zhiguo Lu

Erschienen in: Wireless Personal Communications | Ausgabe 4/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Text classification is a topic in natural language processing that is particularly useful for Internet information processing. Methods based on supervised learning require a large amount of manually annotated training samples. The annotation of training samples is time consuming, and performance relies heavily on the quality of the training samples. This paper presents a text classification method based on sample extension. The extension is based on the correlation of the labeled sample data and the concepts in Wikipedia. Combined with the rich link relationships between concepts, we selected appropriate articles from Wikipedia to expand the training sample set. By introducing the large amount of rich semantic concept pages that are contained in Wikipedia along with links that are related to different pages, our approach enhances the performance and generalization of the classifier. Experiments demonstrate that the performance of the method proposed in this paper is better than that of both supervised and semi-supervised methods.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Banerjee, S. (2007). Boosting inductive transfer for text classification using wikipedia. In Sixth International Conference on Machine Learning and Applications, 2007 (ICMLA 2007) (pp. 148–153). Banerjee, S. (2007). Boosting inductive transfer for text classification using wikipedia. In Sixth International Conference on Machine Learning and Applications, 2007 (ICMLA 2007) (pp. 148–153).
2.
Zurück zum Zitat Bijalwan, V., Kumar, V., Kumari, P., & Pascual, J. (2014). Knn based machine learning approach for text and document mining. International Journal of Database Theory and Application, 7(1), 61–70.CrossRef Bijalwan, V., Kumar, V., Kumari, P., & Pascual, J. (2014). Knn based machine learning approach for text and document mining. International Journal of Database Theory and Application, 7(1), 61–70.CrossRef
4.
Zurück zum Zitat Chapelle, O., & Zien, A. (2005). Semi-supervised classification by low density separation. In AISTATS (pp. 57–64). Chapelle, O., & Zien, A. (2005). Semi-supervised classification by low density separation. In AISTATS (pp. 57–64).
5.
Zurück zum Zitat Dópido, I., Li, J., Marpu, P. R., Plaza, A., Dias, J. M. B., & Benediktsson, J. A. (2013). Semisupervised self-learning for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 51(7), 4032–4044.CrossRef Dópido, I., Li, J., Marpu, P. R., Plaza, A., Dias, J. M. B., & Benediktsson, J. A. (2013). Semisupervised self-learning for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 51(7), 4032–4044.CrossRef
6.
Zurück zum Zitat Dorado, R., & Ratté, S. (2016). Semisupervised text classification using unsupervised topic information. In FLAIRS. Dorado, R., & Ratté, S. (2016). Semisupervised text classification using unsupervised topic information. In FLAIRS.
7.
Zurück zum Zitat Galán-GarcÍa, P., De La Puerta, J. G., Gómez, C. L., Santos, I., & Bringas, P. G. (2015). Supervised machine learning for the detection of troll profiles in twitter social network: Application to a real case of cyberbullying. Logic Journal of IGPL, 24(1), 42–53.MathSciNet Galán-GarcÍa, P., De La Puerta, J. G., Gómez, C. L., Santos, I., & Bringas, P. G. (2015). Supervised machine learning for the detection of troll profiles in twitter social network: Application to a real case of cyberbullying. Logic Journal of IGPL, 24(1), 42–53.MathSciNet
8.
Zurück zum Zitat Harispe, S., Ranwez, S., Janaqi, S., & Montmain, J. (2013). Semantic measures for the comparison of units of language, concepts or instances from text and knowledge base analysis. arXiv preprint arXiv:1310.1285. Harispe, S., Ranwez, S., Janaqi, S., & Montmain, J. (2013). Semantic measures for the comparison of units of language, concepts or instances from text and knowledge base analysis. arXiv preprint arXiv:​1310.​1285.
9.
Zurück zum Zitat Harispe, S., Sánchez, D., Ranwez, S., Janaqi, S., & Montmain, J. (2014). A framework for unifying ontology-based semantic similarity measures: A study in the biomedical domain. Journal of Biomedical Informatics, 48, 38–53.CrossRef Harispe, S., Sánchez, D., Ranwez, S., Janaqi, S., & Montmain, J. (2014). A framework for unifying ontology-based semantic similarity measures: A study in the biomedical domain. Journal of Biomedical Informatics, 48, 38–53.CrossRef
10.
Zurück zum Zitat Jiang, S., Pang, G., Wu, M., & Kuang, L. (2012). An improved k-nearest-neighbor algorithm for text categorization. Expert Systems with Applications, 39(1), 1503–1509.CrossRef Jiang, S., Pang, G., Wu, M., & Kuang, L. (2012). An improved k-nearest-neighbor algorithm for text categorization. Expert Systems with Applications, 39(1), 1503–1509.CrossRef
12.
Zurück zum Zitat Li, Y., Guan, C., Li, H., & Chin, Z. (2008). A self-training semi-supervised SVM algorithm and its application in an EEG-based brain computer interface speller system. Pattern Recognition Letters, 29(9), 1285–1294.CrossRef Li, Y., Guan, C., Li, H., & Chin, Z. (2008). A self-training semi-supervised SVM algorithm and its application in an EEG-based brain computer interface speller system. Pattern Recognition Letters, 29(9), 1285–1294.CrossRef
13.
Zurück zum Zitat Low, Y., & Zheng, A. X. (2012). Fast top-k similarity queries via matrix compression. In Proceedings of the 21st ACM international conference on information and knowledge management (pp. 2070–2074). Low, Y., & Zheng, A. X. (2012). Fast top-k similarity queries via matrix compression. In Proceedings of the 21st ACM international conference on information and knowledge management (pp. 2070–2074).
14.
Zurück zum Zitat Pavlinek, M., & Podgorelec, V. (2017). Text classification method based on self-training and lda topic models. Expert Systems with Applications, 80, 83–93.CrossRef Pavlinek, M., & Podgorelec, V. (2017). Text classification method based on self-training and lda topic models. Expert Systems with Applications, 80, 83–93.CrossRef
15.
Zurück zum Zitat Ramírez, J., Górriz, J., Salas-Gonzalez, D., Romero, A., López, M., Álvarez, I., et al. (2013). Computer-aided diagnosis of alzheimers type dementia combining support vector machines and discriminant set of features. Information Sciences, 237, 59–72.CrossRef Ramírez, J., Górriz, J., Salas-Gonzalez, D., Romero, A., López, M., Álvarez, I., et al. (2013). Computer-aided diagnosis of alzheimers type dementia combining support vector machines and discriminant set of features. Information Sciences, 237, 59–72.CrossRef
16.
Zurück zum Zitat Van Dongen, B., Dijkman, R., & Mendling, J. (2013). Measuring similarity between business process models. In Seminal contributions to information systems engineering (pp. 405–419). Berlin: Springer. Van Dongen, B., Dijkman, R., & Mendling, J. (2013). Measuring similarity between business process models. In Seminal contributions to information systems engineering (pp. 405–419). Berlin: Springer.
17.
Zurück zum Zitat Wajeed, M.A., Adilakshmi, T. (2011). Semi-supervised text classification using enhanced KNN algorithm. In 2011 World Congress on information and communication technologies (WICT) (pp. 138–142). Wajeed, M.A., Adilakshmi, T. (2011). Semi-supervised text classification using enhanced KNN algorithm. In 2011 World Congress on information and communication technologies (WICT) (pp. 138–142).
18.
Zurück zum Zitat Wang, P., Hu, J., Zeng, H. J., & Chen, Z. (2009). Using wikipedia knowledge to improve text classification. Knowledge and Information Systems, 19(3), 265–281.CrossRef Wang, P., Hu, J., Zeng, H. J., & Chen, Z. (2009). Using wikipedia knowledge to improve text classification. Knowledge and Information Systems, 19(3), 265–281.CrossRef
19.
Zurück zum Zitat Wang, X. Z., He, Y. L., & Wang, D. D. (2014). Non-naive bayesian classifiers for classification problems with continuous attributes. IEEE Transactions on Cybernetics, 44(1), 21–39.CrossRef Wang, X. Z., He, Y. L., & Wang, D. D. (2014). Non-naive bayesian classifiers for classification problems with continuous attributes. IEEE Transactions on Cybernetics, 44(1), 21–39.CrossRef
20.
Zurück zum Zitat Yoshikawa, Y., Iwata, T., & Sawada, H. (2014). Latent support measure machines for bag-of-words data classification. In Advances in neural information processing systems (pp. 1961–1969). Yoshikawa, Y., Iwata, T., & Sawada, H. (2014). Latent support measure machines for bag-of-words data classification. In Advances in neural information processing systems (pp. 1961–1969).
21.
Zurück zum Zitat Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 649–657). Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 649–657).
Metadaten
Titel
A Sample Extension Method Based on Wikipedia and Its Application in Text Classification
verfasst von
Wenhao Zhu
Yiting Liu
Guannan Hu
Jianyue Ni
Zhiguo Lu
Publikationsdatum
08.02.2018
Verlag
Springer US
Erschienen in
Wireless Personal Communications / Ausgabe 4/2018
Print ISSN: 0929-6212
Elektronische ISSN: 1572-834X
DOI
https://doi.org/10.1007/s11277-018-5416-z

Weitere Artikel der Ausgabe 4/2018

Wireless Personal Communications 4/2018 Zur Ausgabe

Neuer Inhalt