nach oben

2018 | Buch

Machine Learning for Text

verfasst von: Dr. Charu C. Aggarwal

Verlag: Springer International Publishing

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

Text analytics is a field that lies on the interface of information retrieval,machine learning, and natural language processing, and this textbook carefully covers a coherently organized framework drawn from these intersecting topics. The chapters of this textbook is organized into three categories:

- Basic algorithms: Chapters 1 through 7 discuss the classical algorithms for machine learning from text such as preprocessing, similarity computation, topic modeling, matrix factorization, clustering, classification, regression, and ensemble analysis.

- Domain-sensitive mining: Chapters 8 and 9 discuss the learning methods from text when combined with different domains such as multimedia and the Web. The problem of information retrieval and Web search is also discussed in the context of its relationship with ranking and machine learning methods.

- Sequence-centric mining: Chapters 10 through 14 discuss various sequence-centric and natural language applications, such as feature engineering, neural language models, deep learning, text summarization, information extraction, opinion mining, text segmentation, and event detection.

This textbook covers machine learning topics for text in detail. Since the coverage is extensive,multiple courses can be offered from the same book, depending on course level. Even though the presentation is text-centric, Chapters 3 to 7 cover machine learning algorithms that are often used indomains beyond text data. Therefore, the book can be used to offer courses not just in text analytics but also from the broader perspective of machine learning (with text as a backdrop).

This textbook targets graduate students in computer science, as well as researchers, professors, and industrial practitioners working in these related fields. This textbook is accompanied with a solution manual for classroom teaching.

Inhaltsverzeichnis

Frontmatter

Chapter 1. Machine Learning for Text: An Introduction

Abstract

The extraction of useful insights from text with various types of statistical algorithms is referred to as text mining, text analytics, or machine learning from text. The choice of terminology largely depends on the base community of the practitioner. This book will use these terms interchangeably. Text analytics has become increasingly popular in recent years because of the ubiquity of text data on the Web, social networks, emails, digital libraries, and chat sites.

Charu C. Aggarwal

Chapter 2. Text Preparation and Similarity Computation

Abstract

Text data is often found in highly unstructured environments, and is frequently created by human participants. In many cases, text is embedded within Web documents, which is contaminated with elements such as HyperText Markup Language (HTML) tags, misspellings, ambiguous words, and so on. Furthermore, a single Web page may contain multiple blocks, most of which might be advertisements or other unrelated content.

Charu C. Aggarwal

Chapter 3. Matrix Factorization and Topic Modeling

Abstract

Most document collections are defined by document-term matrices in which the rows (or columns) are highly correlated with one another. These correlations can be leveraged to create a low-dimensional representation of the data, and this process is referred to as dimensionality reduction.

Charu C. Aggarwal

Chapter 4. Text Clustering

Abstract

The problem of text clustering is that of partitioning a corpus into groups of similar documents. Clustering is an unsupervised learning application because no data-driven guidance is provided about specific types of groups (e.g., sports, politics, and so on) with the use of training data.

Charu C. Aggarwal

Chapter 5. Text Classification: Basic Models

Abstract

In classification, the corpus is partitioned into classes that are typically defined by application-specific criteria. Therefore, training examples are provided that associate data points with labels indicating their class membership. For example, the training examples extracted from a news portal on political matters might attach one of three labels associated with each of the documents, such as “senate,” “congress,” and “legislation.

Charu C. Aggarwal

Chapter 6. Linear Classification and Regression for Text

Abstract

Linear models for classification and regression express the dependent variable (or class variable) as a linear function of the independent variables (or feature variables). Specifically, consider the case in which y _i is the dependent variable of the ith document, and \(\overline{X_{i}} = (x_{i1}\ldots x_{id})\) are the d-dimensional feature variables of this document. In the case of text, these feature variables correspond to the term frequencies of a lexicon with d terms. The value of y _i is a numerical quantity in the case of regression, and it is a binary value drawn from { − 1, +1} in the case of classification.

Charu C. Aggarwal

Chapter 7. Classifier Performance and Evaluation

Abstract

Among all machine learning problems, classification is the most well studied, and has the most number of solution methodologies. This embarrassment of riches also leads to the natural problems of model selection and evaluation.

Charu C. Aggarwal

Chapter 8. Joint Text Mining with Heterogeneous Data

Abstract

In Web and social media networks, the text documents are often associated with nodes. For example, the Web can be a viewed as a graph in which each node contains a Web page and also connects to other nodes via hyperlinks. Similarly, a social network is a friendship graph of user-to-user linkages in which each node contains the textual posting activity of the user.

Charu C. Aggarwal

Chapter 9. Information Retrieval and Search Engines

Abstract

Information retrieval is the process of satisfying user information needs that are expressed as textual queries. Search engines represent a Web-specific example of the information retrieval paradigm. The problem of Web search has many additional challenges, such as the collection of Web resources, the organization of these resources, and the use of hyperlinks to aid the search. Whereas traditional information retrieval only uses the content of documents to retrieve results of queries, the Web requires stronger mechanisms for quality control because of its open nature. Furthermore, Web documents contain significant meta-information and zoned text, such as title, author, or anchor text, which can be leveraged to improve retrieval accuracy.

Charu C. Aggarwal

Chapter 10. Text Sequence Modeling and Deep Learning

Abstract

Much of the discussion in the previous chapters has focused on a bag-of-words representation of text. While the bag-of-words representation is sufficient in many practical applications, there are cases in which the sequential aspects of text become more important.

Charu C. Aggarwal

Chapter 11. Text Summarization

Abstract

“Less is more.”—Ludwig Mies van der Rohe

Charu C. Aggarwal

Chapter 12. Information Extraction

Abstract

In its most basic form, text is a sequence of tokens, which is not annotated with the properties of these tokens. The goal of information extraction is to discover specific types of useful properties of these tokens and their interrelationships relationships.

Charu C. Aggarwal

Chapter 13. Opinion Mining and Sentiment Analysis

Abstract

The recent proliferation of social media has enabled users to post views about entities, individuals, events, and topics in a variety of formal and informal settings. Examples of such settings include reviews, forums, social media posts, blogs, and discussion boards. The problem of opinion mining and sentiment analysis is defined as the computational analytics associated with such text.

Charu C. Aggarwal

Chapter 14. Text Segmentation and Event Detection

Abstract

“To improve is to change; to be perfect is to change often.”—Winston Churchill

Charu C. Aggarwal

Backmatter

Titel: Machine Learning for Text
verfasst von: Dr. Charu C. Aggarwal
Verlag: Springer International Publishing
Electronic ISBN: 978-3-319-73531-3
Print ISBN: 978-3-319-73530-6
DOI: https://doi.org/10.1007/978-3-319-73531-3