Top

Soft Computing

Published in:

09-03-2018 | Focus

Wikipedia-based hybrid document representation for textual news classification

Authors: Marcos Antonio Mouriño-García, Roberto Pérez-Rodríguez, Luis Anido-Rifón, Manuel Vilares-Ferro

Published in: Soft Computing | Issue 18/2018

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

The sheer amount of news items that are published every day makes worth the task of automating their classification. The common approach consists in representing news items by the frequency of the words they contain and using supervised learning algorithms to train a classifier. This bag-of-words (BoW) approach is oblivious to three aspects of natural language: synonymy, polysemy, and multiword terms. More sophisticated representations based on concepts—or units of meaning—have been proposed, following the intuition that document representations that better capture the semantics of text will lead to higher performance in automatic classification tasks. The reality is that, when classifying news items, the BoW representation has proven to be really strong, with several studies reporting it to perform above different ‘flavours’ of bag of concepts (BoC). In this paper, we propose a hybrid classifier that enriches the traditional BoW representation with concepts extracted from text—leveraging Wikipedia as background knowledge for the semantic analysis of text (WikiBoC). We benchmarked the proposed classifier, comparing it with BoW and several BoC approaches: Latent Dirichlet Allocation (LDA), Explicit Semantic Analysis, and word embeddings (doc2vec). We used two corpora: the well-known Reuters-21578, composed of newswire items, and a new corpus created ex professo for this study: the Reuters-27000. Results show that (1) the performance of concept-based classifiers is very sensitive to the corpus used, being higher in the more “concept-friendly” Reuters-27000; (2) the Hybrid-WikiBoC approach proposed offers performance increases over BoW up to 4.12 and 49.35% when classifying Reuters-21578 and Reuters-27000 corpora, respectively; and (3) for average performance, the proposed Hybrid-WikiBoC outperforms all the other classifiers, achieving a performance increase of 15.56% over the best state-of-the-art approach (LDA) for the largest training sequence. Results indicate that concepts extracted with the help of Wikipedia add useful information that improves classification performance for news items.

previous article Elephant search algorithm applied to data clustering

next article Adaptable multi-phase rules over the infrequent class

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

http://scikit-learn.org/stable/ holds documentation and usage examples of SVM, RF, and NB classification algorithms.

The source code is freely available at https://github.com/faraday/wikiprep-esa.

https://radimrehurek.com/gensim/.

https://github.com/jhlau/doc2vec.

Models are freely available at http://www.itec-sde.net/doc2vec_reuters27000_reuters21578_models.zip.

Arif MH, Li J, Iqbal M, Liu K (2017) Sentiment analysis and spam detection in short informal text using learning classifier systems. Soft Comput 1–11. https://doi.org/10.1007/s00500-017-2729-x

Bekkerman R, El-Yaniv R, Tishby N, Winter Y (2003) Distributional word clusters vs. words for text categorization. J Mach Learn Res 3:1183–1208MATH

Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155MATH

Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATH

Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefMATH

Cai L, Hofmann T (2003) Text categorization by boosting automatically extracted concepts. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM, pp 182–189

Chang MW, Ratinov LA, Roth D, Srikumar V (2008) Importance of semantic representation: dataless classification. AAAI 2:830–835

Colace F, De Santo M, Greco L, Napoletano P (2014) Text classification using a few labeled examples. Comput Hum Behav 30:689–697CrossRef

De Smet W, Tang J, Moens MF (2011) Knowledge transfer across multilingual corpora via latent topics. Adv Knowl Discov Data Min 549–560. https://doi.org/10.1007/978-3-642-20841-6_45

Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JAsIs 41(6):391–407CrossRef

Egozi O, Markovitch S, Gabrilovich E (2011) Concept-based information retrieval using explicit semantic analysis. ACM Trans Inf Syst (TOIS) 29(2):8CrossRef

Elberrichi Z, Rahmoun A, Bentaallah MA (2008) Using wordnet for text categorization. Int Arab J Inf Technol 5(1):16–24

Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using wikipedia-based explicit semantic analysis. IJCAI 7:1606–1611

Gabrilovich E, Markovitch S (2009) Wikipedia-based semantic interpretation for natural language processing. J Artif Intell Res 34:443–498CrossRefMATH

Hearst MA, Dumais ST, Osman E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intell Syst Appl 13(4):18–28CrossRef

Huang A, Milne D, Frank E, Witten IH (2009) Clustering documents using a wikipedia-based concept representation. In: Advances in knowledge discovery and data mining. Springer, pp 628–636

Huang L, Milne D, Frank E, Witten IH (2012) Learning a concept-based document similarity measure. J Am Soc Inform Sci Technol 63(8):1593–1608CrossRef

Jadhav BR, Mahajan M, GHR CEM W, (2016) Dual sentiment analysis using adaboost algorithm sentiment analysis. Int J Eng Sci 6(6):7641–7645

Jiang M, Cao J-Z (2016) Positive-unlabeled learning for pupylation sites prediction. BioMed Res Int 2016:4525786 https://doi.org/10.1155/2016/4525786

Jin P, Zhang Y, Chen X, Xia Y (2016) Bag-of-embeddings for text classification. Int Jt Conf Artif Intell 25:2824–2830

Khan A, Baharudin B, Lee LH, Khan K (2010) A review of machine learning algorithms for text-documents classification. J Adv Inf Technol 1(1):4–20

Kim HK, Kim M (2016) Model-induced term-weighting schemes for text classification. Appl Intell 45(1):30–43CrossRef

Kim H, Howland P, Park H (2005) Dimension reduction in text classification with support vector machines. J Mach Learn Res 6:37–53MathSciNetMATH

King RD, Feng C, Sutherland A (1995) Statlog: comparison of classification algorithms on large real-world problems. Appl Artif Intell Int J 9(3):289–333CrossRef

Kozielski S, Mrozek D, Kasprowski P, Kostrzewa D et al (2015) Beyond databases, architectures and structures. Springer, BerlinCrossRef

Lau JH, Baldwin T (2016) An empirical evaluation of doc2vec with practical insights into document embedding generation. ACL 2016:78

Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31st international conference on machine learning (ICML-14), pp 1188–1196

Lewis DD (1998) Naive (bayes) at forty: the independence assumption in information retrieval. In: Machine learning: ECML-98. Springer, pp 4–15

Li J, Fong S, Zhuang Y, Khoury R (2016) Hierarchical classification in text mining for sentiment analysis of online news. Soft Comput 20(9):3411–3420CrossRef

Manimala K, David IG, Selvi K (2015) A novel data selection technique using fuzzy c-means clustering to enhance svm-based power quality classification. Soft Comput 19(11):3123–3144CrossRef

Mekala D, Gupta V, Karnick H (2016) Text classification with sparse composite document vectors. arXiv preprint arXiv:1612.06778

Mihalcea R, Corley C, Strapparava C et al (2006) Corpus-based and knowledge-based measures of text semantic similarity. AAAI 6:775–780

Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

Milne D, Witten IH (2013) An open-source toolkit for mining wikipedia. Artif Intell 194:222–239MathSciNetCrossRef

Ming ZY, Chua TS (2015) Resolving polysemy and pseudonymity in entity linking with comprehensive name and context modeling. Inf Sci 307:18–38CrossRef

Mogadala A, Rettinger A (2016) Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification. In: Proceedings of NAACL-HLT, pp 692–702

Moise G, Vladoiu M, Constantinescu Z (2014) Maseco: a multi-agent system for evaluation and classification of oers and ocw based on quality criteria. In: E-Learning paradigms and applications. Springer, pp 185–227

Mouriño García MA, Pérez Rodríguez R, Anido Rifón LE (2015) Biomedical literature classification using encyclopedic knowledge: a wikipedia-based bag-of-concepts approach. Peer J 3:e1279CrossRef

Mouriño-García M, Pérez-Rodríguez R, Anido-Rifón L, Gómez-Carballa M (2016a) Bag-of-concepts document representation for bayesian text classification. In: 2016 IEEE international conference on computer and information technology (CIT). IEEE, pp 281–288

Mouriño García MA, Pérez Rodríguez R, Anido Rifón L (2016) Reuters 27000 corpus. URL http://dx.doi.org/10.17632/3cw44dk29f.2

Mouriño-García MA, Pérez-Rodríguez R, Anido-Rifón L (2017) Wikipedia-based cross-language text classification. Inf Sci 406–407:12–28. https://doi.org/10.1016/j.ins.2017.04.024 CrossRef

Nezreg H, Lehbab H, Belbachir H (2014) Conceptual representation using wordnet for text categorization. Int J Comput Commun Eng 3(1):27CrossRef

Ni X, Sun JT, Hu J, Chen Z (2011) Cross lingual text classification by mining multilingual topics from wikipedia. In: Proceedings of the fourth ACM international conference on Web search and data mining. ACM, pp 375–384

Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using em. Mach Learn 39(2–3):103–134CrossRefMATH

Pavlinek M, Podgorelec V (2017) Text classification method based on self-training and lda topic models. Expert Syst Appl 80:83–93CrossRef

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830MathSciNetMATH

Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137CrossRef

Rehurek R, Sojka P (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, Citeseer

Rodrigues F, Lourenco M, Ribeiro B, Pereira FC (2017) Learning supervised topic models for classification and regression from crowds. IEEE Trans Pattern Anal Mach Intell 39(12):2409–2422CrossRef

Rose T, Stevenson M, Whitehead M (2002) The reuters corpus volume 1-from yesterday’s news to tomorrow’s language resources. LREC 2:827–832

Roul RK, Asthana SR, Kumar G (2017) Study on suitability and importance of multilayer extreme learning machine for classification of text data. Soft Comput 21(15):4239–4256CrossRef

Sahlgren M, Cöster R (2004) Using bag-of-concepts to improve the performance of support vector machines in text categorization. In: Proceedings of the 20th international conference on Computational Linguistics. Association for Computational Linguistics, p 487

Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47CrossRef

Selamat A, Yanagimoto H, Omatu S (2002) Web news classification using neural networks based on pca. In: SICE 2002. Proceedings of the 41st SICE annual conference, vol 4. IEEE, pp 2389–2394

Settles B (1994) Active learning literature survey. Mach Learn 15(2):201–221

Singh A, Chhillar SK (2017) News category classification using distinctive bag of words and ann classifier. Int J Emerg Res Manag Technol 6(6):311–317CrossRef

Stock WG (2010) Concepts and semantic relations in information science. J Am Soc Inform Sci Technol 61(10):1951–1969CrossRef

Van TP, Thanh TM (2017) Vietnamese news classification based on bow with keywords extraction and neural network. In: 2017 21st Asia Pacific symposium on intelligent and evolutionary systems (IES). IEEE, pp 43–48

Vulić I, De Smet W, Tang J, Moens MF (2015) Probabilistic topic modeling in multilingual settings: an overview of its methodology and applications. Inf Process Manag 51(1):111–147CrossRef

Wang P, Hu J, Zeng HJ, Chen Z (2009) Using wikipedia knowledge to improve text classification. Knowl Inf Syst 19(3):265–281CrossRef

Wenliang C, Xingzhi C, Huizhen W, Jingbo Z, Tianshun Y (2004) Automatic word clustering for text categorization using global information. In: Asia information retrieval symposium. Springer, pp 1–11

Yao D, Bi J, Huang J, Zhu J (2015) A word distributed representation based framework for large-scale short text classification. In: 2015 international joint conference on neural networks (IJCNN)

Yousif SA, Samawi VW, Elkabani I, Zantout R (2015) The effect of combining different semantic relations on arabic text classification. World Comput Sci Inf Technol J 5(1):12–118

Zhang H (2004) The optimality of naive bayes. AA 1(2):3

Title: Wikipedia-based hybrid document representation for textual news classification
Authors: Marcos Antonio Mouriño-García
Roberto Pérez-Rodríguez
Luis Anido-Rifón
Manuel Vilares-Ferro
Publication date: 09-03-2018
Publisher: Springer Berlin Heidelberg
Published in: Soft Computing / Issue 18/2018
Print ISSN: 1432-7643
Electronic ISSN: 1433-7479
DOI: https://doi.org/10.1007/s00500-018-3101-5

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 18/2018

A novel differential evolution algorithm with a self-adaptation parameter control method by differential evolution

Optimization of thermal parameters in a double pipe heat exchanger with a twisted tape using response surface methodology

Biclustering with a quantum annealer

Semantic Twitter sentiment analysis based on a fuzzy thesaurus

Handling dropout probability estimation in convolution neural networks using meta-heuristics

Fractional-order PID control of a chopper-fed DC motor drive using a novel firefly algorithm with dynamic control mechanism

Premium Partner