Top

2010 | Book

Read chapter Read first chapter

Inductive Inference for Large Scale Text Classification

Kernel Approaches and Techniques

Authors: Catarina Silva, Bernardete Ribeiro

Publisher: Springer Berlin Heidelberg

Book Series : Studies in Computational Intelligence

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

Text classification is becoming a crucial task to analysts in different areas. In the last few decades, the production of textual documents in digital form has increased exponentially. Their applications range from web pages to scientific documents, including emails, news and books. Despite the widespread use of digital texts, handling them is inherently difficult - the large amount of data necessary to represent them and the subjectivity of classification complicate matters.

This book gives a concise view on how to use kernel approaches for inductive inference in large scale text classification; it presents a series of new techniques to enhance, scale and distribute text classification tasks. It is not intended to be a comprehensive survey of the state-of-the-art of the whole field of text classification. Its purpose is less ambitious and more practical: to explain and illustrate some of the important methods used in this field, in particular kernel approaches and techniques.

Frontmatter

Fundamentals

Frontmatter

Background on Text Classification

Abstract

In this chapter background material for studying text classification problems is presented along with the notation used throughout the book. After describing the problem, a summary of typical applications is given and document representation issues are introduced followed by commonly used pre-processing steps, including dimensionality reduction. Next, state-of-the-art classifiers for text classification are briefly reviewed with current achievements, followed by some widely accepted performance evaluation metrics and benchmarks.

To determine the influence and relative importance of pre-processing methods in text classification performance an empirical study was carried out to compare dimensionality reduction techniques, using standard learning machines and benchmarks. Results and analysis of this study are reported and finally the conclusions on the relative success of the several pre-processing, learning and evaluation approaches are presented.

Catarina Silva, Bernardete Ribeiro

Kernel Machines for Text Classification

Abstract

This chapter details the concept of kernel methods and presents the foundations of two paradigmatic techniques: support vector machines and relevance vector machines. Both are introduced in a text classification perspective and then results and comparisons of their application to benchmark corpora are presented.

Catarina Silva, Bernardete Ribeiro

Approaches and techniques

Frontmatter

Enhancing SVMs for Text Classification

Abstract

The previous chapter introduced kernel-based techniques and their baseline application to text classification. In this chapter we develop and explore learning techniques that integrate knowledge in the classification task to improve the performance of support vector machines (SVMs) in text classification applications.

The introduction of unlabeled data in the learning stage is investigated. With the deluge of digital text data, unlabeled texts are ubiquitous. Whether it is the Internet, email servers, database files or plain file systems, the sources for digital texts are countless. However, such texts are usually unlabeled, and their labeling is mostly manual and costly. Therefore, a research field on the study and use of these unlabeled texts has been emerging. It is further exploited the potential of using several learning machines organized in a committee. Knowing that there is no unique classifier that suits all situations, the focus is on using the diversity of classifiers to enhance performance.

Catarina Silva, Bernardete Ribeiro

Scaling RVMs for Text Classification

Abstract

In the previous chapter we investigated learning techniques to improve support vector machines’ (SVMs) performance in text classification.We turn our attention in this chapter to relevance vector machines (RVMs) and their application to text classification. RVMs’ probabilistic Bayesian nature allows both predictive distributions on testing instances and model-based selection that yields a parsimonious solution. However, scaling up the algorithm is not viable in most digital information processing applications.

Catarina Silva, Bernardete Ribeiro

Distributing Text Classification in Grid Environments

Abstract

The previous chapters looked at several ways to improve the performance of support vector machines (SVMs) and relevance vector machines (RVMs) in text classification applications.

Most data mining problems are nowadays faced with two great challenges. First, the volume of digital data available is growing massively in almost all application areas. Second, state-of-the-art learning machines are becoming increasingly demanding in terms of computing power. This chapter establishes a high-performance distributed computing environment model where the learning techniques proposed in the previous chapters are efficiently deployed and tested in large scale corpora.

Catarina Silva, Bernardete Ribeiro

Framework for Text Classification

Abstract

The previous chapters presented a number of novel techniques to tackle a variety of problems encountered in real-world text classification settings. The common underlying thread has been the integration of knowledge in the inference of inductive learning models without penalizing processing time. This chapter unifies the main topics of this book into a framework. An inductive inference-based text classification framework will provide basic generic tools that are appropriate for a broad range of applications. New research trends in text classification are highlighted towards the end. We will focus on the particular developments in kernel methods triggered by new problems in text mining and on how to extract useful knowledge by mining relationships between data. We include a few promising research directions that are likely to expand in the future.

Catarina Silva, Bernardete Ribeiro

Backmatter

Title: Inductive Inference for Large Scale Text Classification
Authors: Catarina Silva
Bernardete Ribeiro
Publisher: Springer Berlin Heidelberg
Electronic ISBN: 978-3-642-04533-2
Print ISBN: 978-3-642-04532-5
DOI: https://doi.org/10.1007/978-3-642-04533-2

Springer Professional

Inductive Inference for Large Scale Text Classification

Kernel Approaches and Techniques

About this book

Table of Contents

Frontmatter

Fundamentals

Frontmatter

Background on Text Classification

Kernel Machines for Text Classification

Approaches and techniques

Frontmatter

Enhancing SVMs for Text Classification

Scaling RVMs for Text Classification

Distributing Text Classification in Grid Environments

Framework for Text Classification

Backmatter

Premium Partners