Low Resource Social Media Text Mining | springerprofessional.de

Springer Professional

Top

2021 | Book

Read chapter Read first chapter

Low Resource Social Media Text Mining

Authors: Shriphani Palakodety, Ashiqur R. KhudaBukhsh, Guha Jayachandran

Publisher: Springer Singapore

Book Series : SpringerBriefs in Computer Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Login to get access

About this book

This book focuses on methods that are unsupervised or require minimal supervision—vital in the low-resource domain. Over the past few years, rapid growth in Internet access across the globe has resulted in an explosion in user-generated text content in social media platforms. This effect is significantly pronounced in linguistically diverse areas of the world like South Asia, where over 400 million people regularly access social media platforms. YouTube, Facebook, and Twitter report a monthly active user base in excess of 200 million from this region. Natural language processing (NLP) research and publicly available resources such as models and corpora prioritize Web content authored primarily by a Western user base. Such content is authored in English by a user base fluent in the language and can be processed by a broad range of off-the-shelf NLP tools. In contrast, text from linguistically diverse regions features high levels of multilinguality, code-switching, and varied language skill levels. Resources like corpora and models are also scarce. Due to these factors, newer methods are needed to process such text.

This book is designed for NLP practitioners well versed in recent advances in the field but unfamiliar with the landscape of low-resource multilingual NLP. The contents of this book introduce the various challenges associated with social media content, quantify these issues, and provide solutions and intuition. When possible, the methods discussed are evaluated on real-world social media data sets to emphasize their robustness to the noisy nature of the social media environment.

On completion of the book, the reader will be well-versed with the complexity of text-mining in multilingual, low-resource environments; will be aware of a broad set of off-the-shelf tools that can be applied to various problems; and will be able to conduct sophisticated analyses of such text.

Advertisement

Table of Contents

Frontmatter

Chapter 1. Introduction

Abstract

We motivate this book with statistics on the growth of social media platforms in various communities across the globe and provide an outline of the included methods.

Shriphani Palakodety, Ashiqur R. KhudaBukhsh, Guha Jayachandran

Chapter 2. The Problem Setting

Abstract

We first introduce the domain of this book: low resource social media text. The domain encompasses some of the most used languages in the world and a wide variety of tasks and applications. We explore the socio-technical conditions that lead to such text, and how it influences expression online. Examples and statistics are provided from various social media data sets and recent research. We then cover attempts to bridge the resource gap between world languages like English and low-resource languages. Special attention is given to the various data acquisition strategies employed by researchers. This chapter will help NLP practitioners understand the importance of analyzing the low-resource components of corpora from various societies and how ignoring them can skew results, how to go about addressing these, and a broad set of examples and statistics to reinforce the importance of low-resource social media text mining.

Shriphani Palakodety, Ashiqur R. KhudaBukhsh, Guha Jayachandran

Chapter 3. A Rapid Tour of NLP

Abstract

In this chapter, we briefly review the NLP methods utilized in this book. Should the readers desire, a number of highly regarded texts have been authored recently (Eisenstein in Adaptive computation and machine learning series. MIT Press (2019) [9], Goldberg in Synth Lect Human Lang Technol 10(1):1-309 (2017) [11]) which provide a thorough and rigorous grounding of NLP. We discuss static and contextual word and document embeddings, and their applications. We then look at polyglot training in static and contextual embeddings.

Shriphani Palakodety, Ashiqur R. KhudaBukhsh, Guha Jayachandran

Chapter 4. Language Identification

Abstract

We introduce the language identification problem—a vital component in a multilingual text analysis pipeline. We discuss the document and word level formulations of the language identification task, briefly discuss supervised solutions, and then present low-supervision methods based on polyglot training that are highly applicable in low-resource settings. We then discuss code mixing, a linguistic phenomenon common in bilingual and multilingual speakers. We extend our language identification methods to model code mixing and measure the extent of English-Hindi code mixing in various social media data sets.

Shriphani Palakodety, Ashiqur R. KhudaBukhsh, Guha Jayachandran

Chapter 5. Low Resource Machine Translation

Abstract

We discuss the burgeoning field of unsupervised machine translation, where words and phrases are translated between languages without any parallel corpora. We discuss popular methods, and applications to low-resource settings. We further investigate the application of polyglot training to this field and present new promising directions for unsupervised machine translation.

Shriphani Palakodety, Ashiqur R. KhudaBukhsh, Guha Jayachandran

Chapter 6. Semantic Sampling

Abstract

A variety of tasks involving social media text require mining rare samples. In text classification, information retrieval, and other NLP tasks, working with very skewed or imbalanced data sets poses many challenges. In such settings, training data sets can be rapidly bootstrapped using highly targeted sampling strategies. This chapter draws on work in active learning, semantic similarity, and sampling strategies to address a variety of social media text mining tasks. The topics involved are particularly well suited for social media analysis. Most tasks surrounding user generated social media text such as content moderation, and recommendations often involve rapid model construction in response to real world events in real time. The methods discussed allow task-specific data sets and models to be constructed rapidly often using just a handful of initial samples. We then explore extensions to sample across languages—allowing powerful pipelines that can transfer resources from well-resourced languages to their low-resource counterparts.

Shriphani Palakodety, Ashiqur R. KhudaBukhsh, Guha Jayachandran

Title: Low Resource Social Media Text Mining
Authors: Shriphani Palakodety
Ashiqur R. KhudaBukhsh
Guha Jayachandran
Copyright Year: 2021
Publisher: Springer Singapore
Electronic ISBN: 978-981-16-5625-5
Print ISBN: 978-981-16-5624-8
DOI: https://doi.org/10.1007/978-981-16-5625-5

Premium Partner