Skip to main content

2018 | Buch

Machine Learning for Text

insite
SUCHEN

Über dieses Buch

Text analytics is a field that lies on the interface of information retrieval,machine learning, and natural language processing, and this textbook carefully covers a coherently organized framework drawn from these intersecting topics. The chapters of this textbook is organized into three categories:

- Basic algorithms: Chapters 1 through 7 discuss the classical algorithms for machine learning from text such as preprocessing, similarity computation, topic modeling, matrix factorization, clustering, classification, regression, and ensemble analysis.

- Domain-sensitive mining: Chapters 8 and 9 discuss the learning methods from text when combined with different domains such as multimedia and the Web. The problem of information retrieval and Web search is also discussed in the context of its relationship with ranking and machine learning methods.

- Sequence-centric mining: Chapters 10 through 14 discuss various sequence-centric and natural language applications, such as feature engineering, neural language models, deep learning, text summarization, information extraction, opinion mining, text segmentation, and event detection.

This textbook covers machine learning topics for text in detail. Since the coverage is extensive,multiple courses can be offered from the same book, depending on course level. Even though the presentation is text-centric, Chapters 3 to 7 cover machine learning algorithms that are often used indomains beyond text data. Therefore, the book can be used to offer courses not just in text analytics but also from the broader perspective of machine learning (with text as a backdrop).

This textbook targets graduate students in computer science, as well as researchers, professors, and industrial practitioners working in these related fields. This textbook is accompanied with a solution manual for classroom teaching.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Machine Learning for Text: An Introduction
Abstract
The extraction of useful insights from text with various types of statistical algorithms is referred to as text mining, text analytics, or machine learning from text. The choice of terminology largely depends on the base community of the practitioner. This book will use these terms interchangeably. Text analytics has become increasingly popular in recent years because of the ubiquity of text data on the Web, social networks, emails, digital libraries, and chat sites.
Charu C. Aggarwal
Chapter 2. Text Preparation and Similarity Computation
Abstract
Text data is often found in highly unstructured environments, and is frequently created by human participants. In many cases, text is embedded within Web documents, which is contaminated with elements such as HyperText Markup Language (HTML) tags, misspellings, ambiguous words, and so on. Furthermore, a single Web page may contain multiple blocks, most of which might be advertisements or other unrelated content.
Charu C. Aggarwal
Chapter 3. Matrix Factorization and Topic Modeling
Abstract
Most document collections are defined by document-term matrices in which the rows (or columns) are highly correlated with one another. These correlations can be leveraged to create a low-dimensional representation of the data, and this process is referred to as dimensionality reduction.
Charu C. Aggarwal
Chapter 4. Text Clustering
Abstract
The problem of text clustering is that of partitioning a corpus into groups of similar documents. Clustering is an unsupervised learning application because no data-driven guidance is provided about specific types of groups (e.g., sports, politics, and so on) with the use of training data.
Charu C. Aggarwal
Chapter 5. Text Classification: Basic Models
Abstract
In classification, the corpus is partitioned into classes that are typically defined by application-specific criteria. Therefore, training examples are provided that associate data points with labels indicating their class membership. For example, the training examples extracted from a news portal on political matters might attach one of three labels associated with each of the documents, such as “senate,” “congress,” and “legislation.
Charu C. Aggarwal
Chapter 6. Linear Classification and Regression for Text
Abstract
Linear models for classification and regression express the dependent variable (or class variable) as a linear function of the independent variables (or feature variables). Specifically, consider the case in which y i is the dependent variable of the ith document, and \(\overline{X_{i}} = (x_{i1}\ldots x_{id})\) are the d-dimensional feature variables of this document. In the case of text, these feature variables correspond to the term frequencies of a lexicon with d terms. The value of y i is a numerical quantity in the case of regression, and it is a binary value drawn from { − 1, +1} in the case of classification.
Charu C. Aggarwal
Chapter 7. Classifier Performance and Evaluation
Abstract
Among all machine learning problems, classification is the most well studied, and has the most number of solution methodologies. This embarrassment of riches also leads to the natural problems of model selection and evaluation.
Charu C. Aggarwal
Chapter 8. Joint Text Mining with Heterogeneous Data
Abstract
In Web and social media networks, the text documents are often associated with nodes. For example, the Web can be a viewed as a graph in which each node contains a Web page and also connects to other nodes via hyperlinks. Similarly, a social network is a friendship graph of user-to-user linkages in which each node contains the textual posting activity of the user.
Charu C. Aggarwal
Chapter 9. Information Retrieval and Search Engines
Abstract
Information retrieval is the process of satisfying user information needs that are expressed as textual queries. Search engines represent a Web-specific example of the information retrieval paradigm. The problem of Web search has many additional challenges, such as the collection of Web resources, the organization of these resources, and the use of hyperlinks to aid the search. Whereas traditional information retrieval only uses the content of documents to retrieve results of queries, the Web requires stronger mechanisms for quality control because of its open nature. Furthermore, Web documents contain significant meta-information and zoned text, such as title, author, or anchor text, which can be leveraged to improve retrieval accuracy.
Charu C. Aggarwal
Chapter 10. Text Sequence Modeling and Deep Learning
Abstract
Much of the discussion in the previous chapters has focused on a bag-of-words representation of text. While the bag-of-words representation is sufficient in many practical applications, there are cases in which the sequential aspects of text become more important.
Charu C. Aggarwal
Chapter 11. Text Summarization
Abstract
“Less is more.”—Ludwig Mies van der Rohe
Charu C. Aggarwal
Chapter 12. Information Extraction
Abstract
In its most basic form, text is a sequence of tokens, which is not annotated with the properties of these tokens. The goal of information extraction is to discover specific types of useful properties of these tokens and their interrelationships relationships.
Charu C. Aggarwal
Chapter 13. Opinion Mining and Sentiment Analysis
Abstract
The recent proliferation of social media has enabled users to post views about entities, individuals, events, and topics in a variety of formal and informal settings. Examples of such settings include reviews, forums, social media posts, blogs, and discussion boards. The problem of opinion mining and sentiment analysis is defined as the computational analytics associated with such text.
Charu C. Aggarwal
Chapter 14. Text Segmentation and Event Detection
Abstract
“To improve is to change; to be perfect is to change often.”—Winston Churchill
Charu C. Aggarwal
Backmatter
Metadaten
Titel
Machine Learning for Text
verfasst von
Dr. Charu C. Aggarwal
Copyright-Jahr
2018
Electronic ISBN
978-3-319-73531-3
Print ISBN
978-3-319-73530-6
DOI
https://doi.org/10.1007/978-3-319-73531-3