Zum Inhalt

Machine Learning for Text

  • 2018
  • Buch

Über dieses Buch

Text analytics is a field that lies on the interface of information retrieval,machine learning, and natural language processing, and this textbook carefully covers a coherently organized framework drawn from these intersecting topics. The chapters of this textbook is organized into three categories:

- Basic algorithms: Chapters 1 through 7 discuss the classical algorithms for machine learning from text such as preprocessing, similarity computation, topic modeling, matrix factorization, clustering, classification, regression, and ensemble analysis.

- Domain-sensitive mining: Chapters 8 and 9 discuss the learning methods from text when combined with different domains such as multimedia and the Web. The problem of information retrieval and Web search is also discussed in the context of its relationship with ranking and machine learning methods.

- Sequence-centric mining: Chapters 10 through 14 discuss various sequence-centric and natural language applications, such as feature engineering, neural language models, deep learning, text summarization, information extraction, opinion mining, text segmentation, and event detection.

This textbook covers machine learning topics for text in detail. Since the coverage is extensive,multiple courses can be offered from the same book, depending on course level. Even though the presentation is text-centric, Chapters 3 to 7 cover machine learning algorithms that are often used indomains beyond text data. Therefore, the book can be used to offer courses not just in text analytics but also from the broader perspective of machine learning (with text as a backdrop).

This textbook targets graduate students in computer science, as well as researchers, professors, and industrial practitioners working in these related fields. This textbook is accompanied with a solution manual for classroom teaching.

Inhaltsverzeichnis

  1. Frontmatter

  2. Chapter 1. Machine Learning for Text: An Introduction

    Charu C. Aggarwal
    Abstract
    The extraction of useful insights from text with various types of statistical algorithms is referred to as text mining, text analytics, or machine learning from text. The choice of terminology largely depends on the base community of the practitioner. This book will use these terms interchangeably. Text analytics has become increasingly popular in recent years because of the ubiquity of text data on the Web, social networks, emails, digital libraries, and chat sites.
  3. Chapter 2. Text Preparation and Similarity Computation

    Charu C. Aggarwal
    Abstract
    Text data is often found in highly unstructured environments, and is frequently created by human participants. In many cases, text is embedded within Web documents, which is contaminated with elements such as HyperText Markup Language (HTML) tags, misspellings, ambiguous words, and so on. Furthermore, a single Web page may contain multiple blocks, most of which might be advertisements or other unrelated content.
  4. Chapter 3. Matrix Factorization and Topic Modeling

    Charu C. Aggarwal
    Abstract
    Most document collections are defined by document-term matrices in which the rows (or columns) are highly correlated with one another. These correlations can be leveraged to create a low-dimensional representation of the data, and this process is referred to as dimensionality reduction.
  5. Chapter 4. Text Clustering

    Charu C. Aggarwal
    Abstract
    The problem of text clustering is that of partitioning a corpus into groups of similar documents. Clustering is an unsupervised learning application because no data-driven guidance is provided about specific types of groups (e.g., sports, politics, and so on) with the use of training data.
  6. Chapter 5. Text Classification: Basic Models

    Charu C. Aggarwal
    Abstract
    In classification, the corpus is partitioned into classes that are typically defined by application-specific criteria. Therefore, training examples are provided that associate data points with labels indicating their class membership. For example, the training examples extracted from a news portal on political matters might attach one of three labels associated with each of the documents, such as “senate,” “congress,” and “legislation.
  7. Chapter 6. Linear Classification and Regression for Text

    Charu C. Aggarwal
    Abstract
    Linear models for classification and regression express the dependent variable (or class variable) as a linear function of the independent variables (or feature variables). Specifically, consider the case in which y i is the dependent variable of the ith document, and \(\overline{X_{i}} = (x_{i1}\ldots x_{id})\) are the d-dimensional feature variables of this document. In the case of text, these feature variables correspond to the term frequencies of a lexicon with d terms. The value of y i is a numerical quantity in the case of regression, and it is a binary value drawn from { − 1, +1} in the case of classification.
  8. Chapter 7. Classifier Performance and Evaluation

    Charu C. Aggarwal
    Abstract
    Among all machine learning problems, classification is the most well studied, and has the most number of solution methodologies. This embarrassment of riches also leads to the natural problems of model selection and evaluation.
  9. Chapter 8. Joint Text Mining with Heterogeneous Data

    Charu C. Aggarwal
    Abstract
    In Web and social media networks, the text documents are often associated with nodes. For example, the Web can be a viewed as a graph in which each node contains a Web page and also connects to other nodes via hyperlinks. Similarly, a social network is a friendship graph of user-to-user linkages in which each node contains the textual posting activity of the user.
  10. Chapter 9. Information Retrieval and Search Engines

    Charu C. Aggarwal
    Abstract
    Information retrieval is the process of satisfying user information needs that are expressed as textual queries. Search engines represent a Web-specific example of the information retrieval paradigm. The problem of Web search has many additional challenges, such as the collection of Web resources, the organization of these resources, and the use of hyperlinks to aid the search. Whereas traditional information retrieval only uses the content of documents to retrieve results of queries, the Web requires stronger mechanisms for quality control because of its open nature. Furthermore, Web documents contain significant meta-information and zoned text, such as title, author, or anchor text, which can be leveraged to improve retrieval accuracy.
  11. Chapter 10. Text Sequence Modeling and Deep Learning

    Charu C. Aggarwal
    Abstract
    Much of the discussion in the previous chapters has focused on a bag-of-words representation of text. While the bag-of-words representation is sufficient in many practical applications, there are cases in which the sequential aspects of text become more important.
  12. Chapter 11. Text Summarization

    Charu C. Aggarwal
    Abstract
    “Less is more.”—Ludwig Mies van der Rohe
  13. Chapter 12. Information Extraction

    Charu C. Aggarwal
    Abstract
    In its most basic form, text is a sequence of tokens, which is not annotated with the properties of these tokens. The goal of information extraction is to discover specific types of useful properties of these tokens and their interrelationships relationships.
  14. Chapter 13. Opinion Mining and Sentiment Analysis

    Charu C. Aggarwal
    Abstract
    The recent proliferation of social media has enabled users to post views about entities, individuals, events, and topics in a variety of formal and informal settings. Examples of such settings include reviews, forums, social media posts, blogs, and discussion boards. The problem of opinion mining and sentiment analysis is defined as the computational analytics associated with such text.
  15. Chapter 14. Text Segmentation and Event Detection

    Charu C. Aggarwal
    Abstract
    “To improve is to change; to be perfect is to change often.”—Winston Churchill
  16. Backmatter

Titel
Machine Learning for Text
Verfasst von
Dr. Charu C. Aggarwal
Copyright-Jahr
2018
Electronic ISBN
978-3-319-73531-3
Print ISBN
978-3-319-73530-6
DOI
https://doi.org/10.1007/978-3-319-73531-3

Informationen zur Barrierefreiheit für dieses Buch folgen in Kürze. Wir arbeiten daran, sie so schnell wie möglich verfügbar zu machen. Vielen Dank für Ihre Geduld.

    Bildnachweise
    AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, ams.solutions GmbH/© ams.solutions GmbH, Wildix/© Wildix, arvato Systems GmbH/© arvato Systems GmbH, Ninox Software GmbH/© Ninox Software GmbH, Nagarro GmbH/© Nagarro GmbH, GWS mbH/© GWS mbH, CELONIS Labs GmbH, USU GmbH/© USU GmbH, G Data CyberDefense/© G Data CyberDefense, Vendosoft/© Vendosoft, Kumavision/© Kumavision, Noriis Network AG/© Noriis Network AG, tts GmbH/© tts GmbH, Asseco Solutions AG/© Asseco Solutions AG, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH, Ferrari electronic AG/© Ferrari electronic AG, Doxee AT GmbH/© Doxee AT GmbH , Haufe Group SE/© Haufe Group SE, NTT Data/© NTT Data, Bild 1 Verspätete Verkaufsaufträge (Sage-Advertorial 3/2026)/© Sage, IT-Director und IT-Mittelstand: Ihre Webinar-Matineen in 2025 und 2026/© amgun | Getty Images