Skip to main content
Top
Published in: Soft Computing 18/2018

09-03-2018 | Focus

Wikipedia-based hybrid document representation for textual news classification

Authors: Marcos Antonio Mouriño-García, Roberto Pérez-Rodríguez, Luis Anido-Rifón, Manuel Vilares-Ferro

Published in: Soft Computing | Issue 18/2018

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The sheer amount of news items that are published every day makes worth the task of automating their classification. The common approach consists in representing news items by the frequency of the words they contain and using supervised learning algorithms to train a classifier. This bag-of-words (BoW) approach is oblivious to three aspects of natural language: synonymy, polysemy, and multiword terms. More sophisticated representations based on concepts—or units of meaning—have been proposed, following the intuition that document representations that better capture the semantics of text will lead to higher performance in automatic classification tasks. The reality is that, when classifying news items, the BoW representation has proven to be really strong, with several studies reporting it to perform above different ‘flavours’ of bag of concepts (BoC). In this paper, we propose a hybrid classifier that enriches the traditional BoW representation with concepts extracted from text—leveraging Wikipedia as background knowledge for the semantic analysis of text (WikiBoC). We benchmarked the proposed classifier, comparing it with BoW and several BoC approaches: Latent Dirichlet Allocation (LDA), Explicit Semantic Analysis, and word embeddings (doc2vec). We used two corpora: the well-known Reuters-21578, composed of newswire items, and a new corpus created ex professo for this study: the Reuters-27000. Results show that (1) the performance of concept-based classifiers is very sensitive to the corpus used, being higher in the more “concept-friendly” Reuters-27000; (2) the Hybrid-WikiBoC approach proposed offers performance increases over BoW up to 4.12 and 49.35% when classifying Reuters-21578 and Reuters-27000 corpora, respectively; and (3) for average performance, the proposed Hybrid-WikiBoC outperforms all the other classifiers, achieving a performance increase of 15.56% over the best state-of-the-art approach (LDA) for the largest training sequence. Results indicate that concepts extracted with the help of Wikipedia add useful information that improves classification performance for news items.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
go back to reference Bekkerman R, El-Yaniv R, Tishby N, Winter Y (2003) Distributional word clusters vs. words for text categorization. J Mach Learn Res 3:1183–1208MATH Bekkerman R, El-Yaniv R, Tishby N, Winter Y (2003) Distributional word clusters vs. words for text categorization. J Mach Learn Res 3:1183–1208MATH
go back to reference Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155MATH Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155MATH
go back to reference Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATH Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATH
go back to reference Cai L, Hofmann T (2003) Text categorization by boosting automatically extracted concepts. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM, pp 182–189 Cai L, Hofmann T (2003) Text categorization by boosting automatically extracted concepts. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM, pp 182–189
go back to reference Chang MW, Ratinov LA, Roth D, Srikumar V (2008) Importance of semantic representation: dataless classification. AAAI 2:830–835 Chang MW, Ratinov LA, Roth D, Srikumar V (2008) Importance of semantic representation: dataless classification. AAAI 2:830–835
go back to reference Colace F, De Santo M, Greco L, Napoletano P (2014) Text classification using a few labeled examples. Comput Hum Behav 30:689–697CrossRef Colace F, De Santo M, Greco L, Napoletano P (2014) Text classification using a few labeled examples. Comput Hum Behav 30:689–697CrossRef
go back to reference Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JAsIs 41(6):391–407CrossRef Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JAsIs 41(6):391–407CrossRef
go back to reference Egozi O, Markovitch S, Gabrilovich E (2011) Concept-based information retrieval using explicit semantic analysis. ACM Trans Inf Syst (TOIS) 29(2):8CrossRef Egozi O, Markovitch S, Gabrilovich E (2011) Concept-based information retrieval using explicit semantic analysis. ACM Trans Inf Syst (TOIS) 29(2):8CrossRef
go back to reference Elberrichi Z, Rahmoun A, Bentaallah MA (2008) Using wordnet for text categorization. Int Arab J Inf Technol 5(1):16–24 Elberrichi Z, Rahmoun A, Bentaallah MA (2008) Using wordnet for text categorization. Int Arab J Inf Technol 5(1):16–24
go back to reference Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using wikipedia-based explicit semantic analysis. IJCAI 7:1606–1611 Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using wikipedia-based explicit semantic analysis. IJCAI 7:1606–1611
go back to reference Gabrilovich E, Markovitch S (2009) Wikipedia-based semantic interpretation for natural language processing. J Artif Intell Res 34:443–498CrossRefMATH Gabrilovich E, Markovitch S (2009) Wikipedia-based semantic interpretation for natural language processing. J Artif Intell Res 34:443–498CrossRefMATH
go back to reference Hearst MA, Dumais ST, Osman E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intell Syst Appl 13(4):18–28CrossRef Hearst MA, Dumais ST, Osman E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intell Syst Appl 13(4):18–28CrossRef
go back to reference Huang A, Milne D, Frank E, Witten IH (2009) Clustering documents using a wikipedia-based concept representation. In: Advances in knowledge discovery and data mining. Springer, pp 628–636 Huang A, Milne D, Frank E, Witten IH (2009) Clustering documents using a wikipedia-based concept representation. In: Advances in knowledge discovery and data mining. Springer, pp 628–636
go back to reference Huang L, Milne D, Frank E, Witten IH (2012) Learning a concept-based document similarity measure. J Am Soc Inform Sci Technol 63(8):1593–1608CrossRef Huang L, Milne D, Frank E, Witten IH (2012) Learning a concept-based document similarity measure. J Am Soc Inform Sci Technol 63(8):1593–1608CrossRef
go back to reference Jadhav BR, Mahajan M, GHR CEM W, (2016) Dual sentiment analysis using adaboost algorithm sentiment analysis. Int J Eng Sci 6(6):7641–7645 Jadhav BR, Mahajan M, GHR CEM W, (2016) Dual sentiment analysis using adaboost algorithm sentiment analysis. Int J Eng Sci 6(6):7641–7645
go back to reference Jin P, Zhang Y, Chen X, Xia Y (2016) Bag-of-embeddings for text classification. Int Jt Conf Artif Intell 25:2824–2830 Jin P, Zhang Y, Chen X, Xia Y (2016) Bag-of-embeddings for text classification. Int Jt Conf Artif Intell 25:2824–2830
go back to reference Khan A, Baharudin B, Lee LH, Khan K (2010) A review of machine learning algorithms for text-documents classification. J Adv Inf Technol 1(1):4–20 Khan A, Baharudin B, Lee LH, Khan K (2010) A review of machine learning algorithms for text-documents classification. J Adv Inf Technol 1(1):4–20
go back to reference Kim HK, Kim M (2016) Model-induced term-weighting schemes for text classification. Appl Intell 45(1):30–43CrossRef Kim HK, Kim M (2016) Model-induced term-weighting schemes for text classification. Appl Intell 45(1):30–43CrossRef
go back to reference Kim H, Howland P, Park H (2005) Dimension reduction in text classification with support vector machines. J Mach Learn Res 6:37–53MathSciNetMATH Kim H, Howland P, Park H (2005) Dimension reduction in text classification with support vector machines. J Mach Learn Res 6:37–53MathSciNetMATH
go back to reference King RD, Feng C, Sutherland A (1995) Statlog: comparison of classification algorithms on large real-world problems. Appl Artif Intell Int J 9(3):289–333CrossRef King RD, Feng C, Sutherland A (1995) Statlog: comparison of classification algorithms on large real-world problems. Appl Artif Intell Int J 9(3):289–333CrossRef
go back to reference Kozielski S, Mrozek D, Kasprowski P, Kostrzewa D et al (2015) Beyond databases, architectures and structures. Springer, BerlinCrossRef Kozielski S, Mrozek D, Kasprowski P, Kostrzewa D et al (2015) Beyond databases, architectures and structures. Springer, BerlinCrossRef
go back to reference Lau JH, Baldwin T (2016) An empirical evaluation of doc2vec with practical insights into document embedding generation. ACL 2016:78 Lau JH, Baldwin T (2016) An empirical evaluation of doc2vec with practical insights into document embedding generation. ACL 2016:78
go back to reference Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31st international conference on machine learning (ICML-14), pp 1188–1196 Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31st international conference on machine learning (ICML-14), pp 1188–1196
go back to reference Lewis DD (1998) Naive (bayes) at forty: the independence assumption in information retrieval. In: Machine learning: ECML-98. Springer, pp 4–15 Lewis DD (1998) Naive (bayes) at forty: the independence assumption in information retrieval. In: Machine learning: ECML-98. Springer, pp 4–15
go back to reference Li J, Fong S, Zhuang Y, Khoury R (2016) Hierarchical classification in text mining for sentiment analysis of online news. Soft Comput 20(9):3411–3420CrossRef Li J, Fong S, Zhuang Y, Khoury R (2016) Hierarchical classification in text mining for sentiment analysis of online news. Soft Comput 20(9):3411–3420CrossRef
go back to reference Manimala K, David IG, Selvi K (2015) A novel data selection technique using fuzzy c-means clustering to enhance svm-based power quality classification. Soft Comput 19(11):3123–3144CrossRef Manimala K, David IG, Selvi K (2015) A novel data selection technique using fuzzy c-means clustering to enhance svm-based power quality classification. Soft Comput 19(11):3123–3144CrossRef
go back to reference Mihalcea R, Corley C, Strapparava C et al (2006) Corpus-based and knowledge-based measures of text semantic similarity. AAAI 6:775–780 Mihalcea R, Corley C, Strapparava C et al (2006) Corpus-based and knowledge-based measures of text semantic similarity. AAAI 6:775–780
go back to reference Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:​1301.​3781
go back to reference Ming ZY, Chua TS (2015) Resolving polysemy and pseudonymity in entity linking with comprehensive name and context modeling. Inf Sci 307:18–38CrossRef Ming ZY, Chua TS (2015) Resolving polysemy and pseudonymity in entity linking with comprehensive name and context modeling. Inf Sci 307:18–38CrossRef
go back to reference Mogadala A, Rettinger A (2016) Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification. In: Proceedings of NAACL-HLT, pp 692–702 Mogadala A, Rettinger A (2016) Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification. In: Proceedings of NAACL-HLT, pp 692–702
go back to reference Moise G, Vladoiu M, Constantinescu Z (2014) Maseco: a multi-agent system for evaluation and classification of oers and ocw based on quality criteria. In: E-Learning paradigms and applications. Springer, pp 185–227 Moise G, Vladoiu M, Constantinescu Z (2014) Maseco: a multi-agent system for evaluation and classification of oers and ocw based on quality criteria. In: E-Learning paradigms and applications. Springer, pp 185–227
go back to reference Mouriño García MA, Pérez Rodríguez R, Anido Rifón LE (2015) Biomedical literature classification using encyclopedic knowledge: a wikipedia-based bag-of-concepts approach. Peer J 3:e1279CrossRef Mouriño García MA, Pérez Rodríguez R, Anido Rifón LE (2015) Biomedical literature classification using encyclopedic knowledge: a wikipedia-based bag-of-concepts approach. Peer J 3:e1279CrossRef
go back to reference Mouriño-García M, Pérez-Rodríguez R, Anido-Rifón L, Gómez-Carballa M (2016a) Bag-of-concepts document representation for bayesian text classification. In: 2016 IEEE international conference on computer and information technology (CIT). IEEE, pp 281–288 Mouriño-García M, Pérez-Rodríguez R, Anido-Rifón L, Gómez-Carballa M (2016a) Bag-of-concepts document representation for bayesian text classification. In: 2016 IEEE international conference on computer and information technology (CIT). IEEE, pp 281–288
go back to reference Nezreg H, Lehbab H, Belbachir H (2014) Conceptual representation using wordnet for text categorization. Int J Comput Commun Eng 3(1):27CrossRef Nezreg H, Lehbab H, Belbachir H (2014) Conceptual representation using wordnet for text categorization. Int J Comput Commun Eng 3(1):27CrossRef
go back to reference Ni X, Sun JT, Hu J, Chen Z (2011) Cross lingual text classification by mining multilingual topics from wikipedia. In: Proceedings of the fourth ACM international conference on Web search and data mining. ACM, pp 375–384 Ni X, Sun JT, Hu J, Chen Z (2011) Cross lingual text classification by mining multilingual topics from wikipedia. In: Proceedings of the fourth ACM international conference on Web search and data mining. ACM, pp 375–384
go back to reference Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using em. Mach Learn 39(2–3):103–134CrossRefMATH Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using em. Mach Learn 39(2–3):103–134CrossRefMATH
go back to reference Pavlinek M, Podgorelec V (2017) Text classification method based on self-training and lda topic models. Expert Syst Appl 80:83–93CrossRef Pavlinek M, Podgorelec V (2017) Text classification method based on self-training and lda topic models. Expert Syst Appl 80:83–93CrossRef
go back to reference Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830MathSciNetMATH Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830MathSciNetMATH
go back to reference Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137CrossRef Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137CrossRef
go back to reference Rehurek R, Sojka P (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, Citeseer Rehurek R, Sojka P (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, Citeseer
go back to reference Rodrigues F, Lourenco M, Ribeiro B, Pereira FC (2017) Learning supervised topic models for classification and regression from crowds. IEEE Trans Pattern Anal Mach Intell 39(12):2409–2422CrossRef Rodrigues F, Lourenco M, Ribeiro B, Pereira FC (2017) Learning supervised topic models for classification and regression from crowds. IEEE Trans Pattern Anal Mach Intell 39(12):2409–2422CrossRef
go back to reference Rose T, Stevenson M, Whitehead M (2002) The reuters corpus volume 1-from yesterday’s news to tomorrow’s language resources. LREC 2:827–832 Rose T, Stevenson M, Whitehead M (2002) The reuters corpus volume 1-from yesterday’s news to tomorrow’s language resources. LREC 2:827–832
go back to reference Roul RK, Asthana SR, Kumar G (2017) Study on suitability and importance of multilayer extreme learning machine for classification of text data. Soft Comput 21(15):4239–4256CrossRef Roul RK, Asthana SR, Kumar G (2017) Study on suitability and importance of multilayer extreme learning machine for classification of text data. Soft Comput 21(15):4239–4256CrossRef
go back to reference Sahlgren M, Cöster R (2004) Using bag-of-concepts to improve the performance of support vector machines in text categorization. In: Proceedings of the 20th international conference on Computational Linguistics. Association for Computational Linguistics, p 487 Sahlgren M, Cöster R (2004) Using bag-of-concepts to improve the performance of support vector machines in text categorization. In: Proceedings of the 20th international conference on Computational Linguistics. Association for Computational Linguistics, p 487
go back to reference Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47CrossRef Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47CrossRef
go back to reference Selamat A, Yanagimoto H, Omatu S (2002) Web news classification using neural networks based on pca. In: SICE 2002. Proceedings of the 41st SICE annual conference, vol 4. IEEE, pp 2389–2394 Selamat A, Yanagimoto H, Omatu S (2002) Web news classification using neural networks based on pca. In: SICE 2002. Proceedings of the 41st SICE annual conference, vol 4. IEEE, pp 2389–2394
go back to reference Settles B (1994) Active learning literature survey. Mach Learn 15(2):201–221 Settles B (1994) Active learning literature survey. Mach Learn 15(2):201–221
go back to reference Singh A, Chhillar SK (2017) News category classification using distinctive bag of words and ann classifier. Int J Emerg Res Manag Technol 6(6):311–317CrossRef Singh A, Chhillar SK (2017) News category classification using distinctive bag of words and ann classifier. Int J Emerg Res Manag Technol 6(6):311–317CrossRef
go back to reference Stock WG (2010) Concepts and semantic relations in information science. J Am Soc Inform Sci Technol 61(10):1951–1969CrossRef Stock WG (2010) Concepts and semantic relations in information science. J Am Soc Inform Sci Technol 61(10):1951–1969CrossRef
go back to reference Van TP, Thanh TM (2017) Vietnamese news classification based on bow with keywords extraction and neural network. In: 2017 21st Asia Pacific symposium on intelligent and evolutionary systems (IES). IEEE, pp 43–48 Van TP, Thanh TM (2017) Vietnamese news classification based on bow with keywords extraction and neural network. In: 2017 21st Asia Pacific symposium on intelligent and evolutionary systems (IES). IEEE, pp 43–48
go back to reference Vulić I, De Smet W, Tang J, Moens MF (2015) Probabilistic topic modeling in multilingual settings: an overview of its methodology and applications. Inf Process Manag 51(1):111–147CrossRef Vulić I, De Smet W, Tang J, Moens MF (2015) Probabilistic topic modeling in multilingual settings: an overview of its methodology and applications. Inf Process Manag 51(1):111–147CrossRef
go back to reference Wang P, Hu J, Zeng HJ, Chen Z (2009) Using wikipedia knowledge to improve text classification. Knowl Inf Syst 19(3):265–281CrossRef Wang P, Hu J, Zeng HJ, Chen Z (2009) Using wikipedia knowledge to improve text classification. Knowl Inf Syst 19(3):265–281CrossRef
go back to reference Wenliang C, Xingzhi C, Huizhen W, Jingbo Z, Tianshun Y (2004) Automatic word clustering for text categorization using global information. In: Asia information retrieval symposium. Springer, pp 1–11 Wenliang C, Xingzhi C, Huizhen W, Jingbo Z, Tianshun Y (2004) Automatic word clustering for text categorization using global information. In: Asia information retrieval symposium. Springer, pp 1–11
go back to reference Yao D, Bi J, Huang J, Zhu J (2015) A word distributed representation based framework for large-scale short text classification. In: 2015 international joint conference on neural networks (IJCNN) Yao D, Bi J, Huang J, Zhu J (2015) A word distributed representation based framework for large-scale short text classification. In: 2015 international joint conference on neural networks (IJCNN)
go back to reference Yousif SA, Samawi VW, Elkabani I, Zantout R (2015) The effect of combining different semantic relations on arabic text classification. World Comput Sci Inf Technol J 5(1):12–118 Yousif SA, Samawi VW, Elkabani I, Zantout R (2015) The effect of combining different semantic relations on arabic text classification. World Comput Sci Inf Technol J 5(1):12–118
go back to reference Zhang H (2004) The optimality of naive bayes. AA 1(2):3 Zhang H (2004) The optimality of naive bayes. AA 1(2):3
Metadata
Title
Wikipedia-based hybrid document representation for textual news classification
Authors
Marcos Antonio Mouriño-García
Roberto Pérez-Rodríguez
Luis Anido-Rifón
Manuel Vilares-Ferro
Publication date
09-03-2018
Publisher
Springer Berlin Heidelberg
Published in
Soft Computing / Issue 18/2018
Print ISSN: 1432-7643
Electronic ISSN: 1433-7479
DOI
https://doi.org/10.1007/s00500-018-3101-5

Other articles of this Issue 18/2018

Soft Computing 18/2018 Go to the issue

Methodologies and Application

Biclustering with a quantum annealer

Premium Partner