Elsevier

Neurocomputing

Volume 55, Issues 3–4, October 2003, Pages 665-679
Neurocomputing

LVQ for text categorization using a multilingual linguistic resource

https://doi.org/10.1016/S0925-2312(02)00633-1Get rights and content

Abstract

Neural learning has been used with effectiveness in natural language processing tasks. Particularly, the Widrow–Hoff and the Kivinen–Warmuth exponentiated gradient (based on neural learning rules) algorithms have been used in text categorization, improving the results obtained by the well-known Rocchio's algorithm. The high performance of competitive learning algorithms, recently applied to solve information retrieval problems, leads us to use them in the specific text categorization tasks. This paper presents a multilingual categorization system based on neural learning, using the polyglot Bible as training collection, both in Spanish and English. The method we suggest is based on using the LVQ algorithm to build a classifier that learns the training multilingual collection. We have performed experiments with the four algorithm which show that the ideas we describe are promising and are worth further investigation.

Introduction

Nowadays, about 90% of companies’ information is found in text format [4]. We can find text in documents, manuals, reports, circulars, e-mails and also in Web pages. It is estimated that the amount of text available in Web pages is about a terabyte [1]. Therefore, it is essential to manage automatically this enormous amount of information by using natural language processing (NLP) techniques.

One of the main tasks of NLP is the automatic analysis of content that can be defined as a group of techniques to examine the information objects and provide subsequent access to them [34]. Text categorization is a very interesting task for NLP, which consists of the assignment of one or more pre-existing categories to each document [17]. Text categorization has been the focus of much research in recent years and the most categorization systems use collections of training documents to predict the categories of new documents [35], [26], [36]. With the increasing availability of electronically accessible information, the use of linguistic resources in text categorization is receiving more and more attention. These resources include training corpora and lexical databases. Training corpora, such as Reuters-21,578 [17], Ohsumed [9], TREC [30] are document collection manually labelled. On the other hand, lexical databases, such as WordNet [22], EuroWorNet [32], EDR [37], are repositories that accumulate information about the lexical items of one or several languages and have been integrated successfully in text categorization [31], [7].

One of the most widely used algorithms in text categorization is the Rocchio algorithm [25]. It is a very simple algorithm and good results have been obtained [29], [7]. However, a recent work of Lewis et al. [18] shows that two machine learning algorithms (Widrow–Hoff (WH) [33] and Kivinen–Warmuth (KW) [13]) are more effective than the widely used Rocchio algorithm in several categorization and routing tasks.

In contrast, the algorithms of competitive learning, based on the Kohonen model, have been used successfully in tasks of NLP [21], [16]. The Kohonen model [14] presents two variants: Self-Organizing Map or SOM and Learning Vector Quantization or LVQ. Although both of them use competitive learning, the main difference lies in that SOM uses a non-supervised learning method, while in LVQ, it is supervised.

Although the versatility of this type of neural network is very wide, which allows us to classify all types of information from literary [10] to economic [12], the Kohonen model has two limitations. Firstly, the learning process is long and hard, and secondly, it is necessary to repeat the whole learning process to learn new data.

This paper proposes the use of a competitive learning algorithm to train a collection of documents for text categorization. Since it is supervised learning, we have chosen the Kohonen LVQ algorithm. In order to test the effectiveness of LVQ algorithm, we have developed some experiments with a resource, widely spread, freely available and translated into all languages: the Bible. We have compared Rocchio, WH, KW and LVQ algorithms. The results obtained show that algorithms based on neural learning are more effective than Rocchio's algorithm, and that the LVQ algorithm, which uses a competitive learning rule, is the one which obtains the best accuracy.

Research on information retrieval (IR) is developed by several models such as the vector space model (VSM), the probabilistic model and the boolean model. In this paper, we will use the VSM, which is considered an effective model in the IR field [28].

The rest of the paper is organized as follows. Firstly, the TC task is presented, with a brief introductory summary to VSM. Secondly, the polyglot Bible is described as a multilingual linguistic resource for TC. Then, the four algorithms used in the experiments are described. In Section 5, the experiments carried out to evaluate the algorithm's performance are detailed. The results of this evaluation are described in Section 6, closing with the conclusions and future lines of work.

Section snippets

Text categorization with VSM

VSM was originally developed for IR, although it can be used in other TC tasks [3]. The aim of VSM for IR is representing a natural language expression as a weight vector of terms, where each weight measures the importance of the term in the natural language expression, which can be either a document or a query. The semantic proximity among documents and queries is calculated with the cosine of the angle formed by their vectors.

In an analogous way, we can consider in TC that a document belongs

Parallel corpora as multilingual resource for text categorization

Parallel corpora are a multilingual resource, more and more available in Internet [8], which are being used in NLP for different tasks such as disambiguation [5], IR [23], automatic translation [20], etc.

The Polyglot Bible [24] is one of these resources, with some specially attractive characteristics: freely available, translated into all languages, translations are practically perfect and it is divided into verses.

We propose in this paper a text classifier in Spanish and English, based on a

Training algorithms for text categorization

Although there are many algorithms used in text classification [7], [19], we have selected the Rocchio, WH and KW algorithms because they have been used in many other studies [3], [31], [18] and therefore we can contrast its efficiency with the LVQ algorithm.

Corpus generation

In order to compare the four algorithms, we have used a group of files generated from the bilingual Bible. As the Bible is divided into books, chapters and verses, we have created a division which we have adjusted to our experiments. We have generated a total of 1189 training documents and 1189 evaluation documents, each of them belonging to one of the 66 possible classes, that is, each Bible book defines a different class.

For each book and for each chapter, we have created two files: one file

Results

In order to evaluate the effectiveness of algorithms, two widely used measurements in TC have been used: microaveraging precision and macroaveraging precision [29], [19]. Microaveraging precision is defined as follows:Pmicroavg=tdctdc+tdi,where tdc is the number of documents correctly classified and tdi is the number of documents wrongly classified. Macroaveraging precision is calculated as the average of microaveraging precisions of all categories:Pmacroavg=PmicroavgK.

Macroaveraging gives

Conclusions and future works

This paper has presented a multilingual categorization system, using a neural learning competitive algorithm.

A direct evaluation of our categorization method based on competitive learning has been carried out, obtaining very important results and better performance than those obtained by other algorithms successfully used in categorization. In fact, our method improves about 27.77% in comparison with others.

We have also displayed an automatic evaluation method of categorization that allows us

Acknowledgements

The authors wish to thank the anonymous reviewers for their helpful comments on an earlier version of this paper.

M. Teresa Martı́n-Valdivia is Assistant Professor in the Department of Computer Science at Jaén University (Spain). He received M.S. degree in Computer Science from the University of Granada in 1992. Her research interests include the application of neural networks, information retrieval and text categorization.

References (37)

  • J. Kivinen et al.

    Exponentiated gradient versus gradient descent for linear predictors

    Inform. Computat.

    (1997)
  • T. Kohonen

    The self-organizing map

    Neurocomputing

    (1998)
  • D. Merkl

    Text classification with self-organizing maps: some lessons learned

    Neurocomputing

    (1998)
  • R. Baeza-Yates et al.

    Modern Information Retrieval

    (1999)
  • C. Buckley, Implementation of the smart information retrieval system, Technical Report 85-686, Cornell University,...
  • M. Buenaga, J.M.B. Gómez, Dı́az, Using WordNet to complement training information in text categorization, in:...
  • Corp. Oracle., Managing text with Oracle8(tm) context cartridge, in: An Oracle Technical White Paper,...
  • M.W. Davis

    On the effective use of large parallel corpora in cross language text retrieval

  • W. Frakes et al.

    Information Retrieval: Data Structures and Algorithms

    (1992)
  • J.M. Gómez, M. de Buenaga, L.A. Ureña, M.T. Martı́n, M. Garcı́a, Integrating Lexical knowledge in learning based text...
  • G. Grefenstette

    Cross-Lingual Information Retrieval

    (1998)
  • W. Hersh, C. Buckley, T.J. Leone, D. Hickman, Oshumed: an interactive retrieval evaluation a new large text collection...
  • T. Honkela, V. Pulkki, T. Kohonen, Contextual relations of words in Grimm tales, analysed by self-organizing map, in:...
  • D. Hull, G. Grefenstette, Experiments in multilingual information retrieval, in: Proceedings of ACM, SIGIR’96, Zurich,...
  • S. Kaski, T. Kohonen, Exploratory data analysis by the self-organizing map: structures of welfare and poverty in the...
  • T. Kohonen

    Self-Organization and Associative Memory

    (1995)
  • T. Kohonen et al.

    Self organization of massive document collection

    IEEE Trans. Neural Networks

    (2000)
  • D.D. Lewis, Representation and learning in information retrieval, Ph.D. Thesis, Department of Computer and Information...
  • Cited by (8)

    • Regularized margin-based conditional log-likelihood loss for prototype learning

      2010, Pattern Recognition
      Citation Excerpt :

      The nearest neighbor classifiers with reduced prototypes are also called nearest prototype classifiers. They have been widely used in applications such as character recognition [7], text categorization [8], classification of mass spectrometry data [9], and so on. Learning vector quantization (LVQ) [10] is a well known prototype learning algorithm which offers intuitive and simple, yet powerful learning capacity in supervised learning.

    • The learning vector quantization algorithm applied to automatic text classification tasks

      2007, Neural Networks
      Citation Excerpt :

      Recently, Hung, Wermter, and Smith (2004) integrate a guided SOM and other competitive neural learning with diverse knowledge sources, and Hung and Wermter (2004) use three different vector representation approaches extracting class knowledge for document classification. Finally, our recent work Martín-Valdivia, García-Vega, and Ureña López (2003) presents a neural model based on the LVQ algorithm to categorize a multilingual corpus (the polyglot Bible). The application of the LVQ algorithm to different NLP classification problems is described in Martín-Valdivia (2004).

    • Multilingual and Hierarchical Classification of Large Datasets of Scientific Publications

      2016, Proceedings - 2015 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2015
    View all citing articles on Scopus

    M. Teresa Martı́n-Valdivia is Assistant Professor in the Department of Computer Science at Jaén University (Spain). He received M.S. degree in Computer Science from the University of Granada in 1992. Her research interests include the application of neural networks, information retrieval and text categorization.

    Manuel Garcı́a Vega is Assistant Professor in the Department of Computer Science at Jaén University (Spain). He received M.S. degree in Computer Science from the University of Granada in 1991. His current research includes Word Sense Disambiguation, Text Categorization, Information Retrieval and Management Systems of Natural Language Processing.

    L. Alfonso Ureña López is Assistant Professor in the Department of Computer Science at Jaén University (Spain). He received M.S. degree in Computer Science from the University of Granada in 1991, and Ph.D. in Computer Science from Software Engineering Department of Granada University in 2000. His Ph.D. Thesis won the winner of the 2001 Awards of the Spanish Society for Natural Language Processing. His current research includes Word Sense Disambiguation, Information Retrieval, Text Categorization and Management Systems of Natural Language Processing and Human Computer Interaction. Dr. Ureña is Director of the Computer Science Department at University of Jaén and Director of Research Group of Intelligent Systems. He is author or co-author of more than 40 scientific publications. He serves as a technical reviewer in the for several journals and in the Program Committee of some major conferences.

    View full text