Skip to main content

2019 | Buch

Text Analytics with Python

A Practitioner's Guide to Natural Language Processing

insite
SUCHEN

Über dieses Buch

Leverage Natural Language Processing (NLP) in Python and learn how to set up your own robust environment for performing text analytics. This second edition has gone through a major revamp and introduces several significant changes and new topics based on the recent trends in NLP.

You’ll see how to use the latest state-of-the-art frameworks in NLP, coupled with machine learning and deep learning models for supervised sentiment analysis powered by Python to solve actual case studies. Start by reviewing Python for NLP fundamentals on strings and text data and move on to engineering representation methods for text data, including both traditional statistical models and newer deep learning-based embedding models. Improved techniques and new methods around parsing and processing text are discussed as well.

Text summarization and topic models have been overhauled so the book showcases how to build, tune, and interpret topic models in the context of an interest dataset on NIPS conference papers. Additionally, the book covers text similarity techniques with a real-world example of movie recommenders, along with sentiment analysis using supervised and unsupervised techniques.

There is also a chapter dedicated to semantic analysis where you’ll see how to build your own named entity recognition (NER) system from scratch. While the overall structure of the book remains the same, the entire code base, modules, and chapters has been updated to the latest Python 3.x release.

What You'll Learn

• Understand NLP and text syntax, semantics and structure• Discover text cleaning and feature engineering• Review text classification and text clustering • Assess text summarization and topic models• Study deep learning for NLP

Who This Book Is For

IT professionals, data analysts, developers, linguistic experts, data scientists and engineers and basically anyone with a keen interest in linguistics, analytics and generating insights from textual data.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Natural Language Processing Basics
Abstract
We have ushered in the age of Big Data, where organizations and businesses are having difficulty managing all the data generated by various systems, processes, and transactions. However, the term Big Data is misused a lot due to the vague definition of the 3Vs of data—volume, variety, and velocity. It is sometimes difficult to quantify what data is “big”. Some might think a billion records in a database is “Big Data,” but that number seems small compared to the petabytes of data being generated by various sensors or by social media. One common characteristic is the large volume of unstructured textual data that’s present across all organizations, irrespective of their domain. As an example, we have vast amounts of data in the form of tweets, status messages, hash tags, articles, blogs, wikis, and much more on social media. Even retail and ecommerce stores generate a lot of textual data, from new product information and metadata to customer reviews and feedback.
Dipanjan Sarkar
Chapter 2. Python for Natural Language Processing
Abstract
In the previous chapter, we took a journey into the world of natural language processing and explored several interesting concepts and domains associated with it. We now have a better understanding of the entire scope surrounding natural language processing, linguistics, and text analytics. If you refresh your memory, we also got our first taste of running Python code to look at essentials with regard to processing and understanding text. We also looked at ways to access and use text corpora resources with the help of the NLTK framework. In this chapter, we look at why Python is the language of choice for natural language processing (NLP), set up a robust Python environment, take a hands-on based approach to understanding essentials of string and text processing, manipulation, and transformation, and conclude by looking at some of the important libraries and frameworks associated with NLP and text analytics. This chapter is aimed to provide a quick refresher for getting started with Python and NLP.
Dipanjan Sarkar
Chapter 3. Processing and Understanding Text
Abstract
In the previous chapters, we saw a glimpse of the entire natural language processing and text analytics landscape with essential terminology and concepts. Besides this, we were also introduced to the Python programming language, essential constructs, syntax, and learned how to work with strings to manage textual data. To perform complex operations on text with machine learning or deep learning algorithms, you need to process and parse textual data into more easy-to-interpret formats. All machine learning algorithms, be they supervised or unsupervised techniques, work with input features, which are numeric in nature. While this is a separate topic under feature engineering, which we shall explore in detail in the next chapter, to get to that step, you will need to clean, normalize, and preprocess the initial textual data.
Dipanjan Sarkar
Chapter 4. Feature Engineering for Text Representation
Abstract
In the previous chapters, we saw how to understand, process, and wrangle text data. However, all machine learning or deep learning models are limited because they cannot understand text data directly and they only understand numeric representations of features as inputs. In this chapter, we look at how to work with text data, which is definitely one of the most abundant sources of unstructured data. Text data usually consists of documents that can represent words, sentences, or even paragraphs of free-flowing text. The inherent lack of structure (no neatly formatted data columns!) and noisy nature of textual data makes it harder for machine learning methods to directly work on raw text data. Hence, in this chapter, we follow a hands-on approach to exploring some of the most popular and effective strategies for extracting meaningful features from text data. These features can then be used to represent text efficiently, which can be further leveraged in building machine learning or deep learning models easily to solve complex tasks.
Dipanjan Sarkar
Chapter 5. Text Classification
Abstract
Learning to process and understand text is one of the first, yet most essential, steps on the journey to getting meaningful insights from textual data. While it is important to understand language syntax, structure, and semantics, it is not sufficient on its own to be able to derive useful patterns and insights and get maximum use out of vast volumes of text data. Knowledge of language processing coupled with concepts from artificial intelligence, machine learning, and deep learning help in building intelligent systems, which can leverage text data and help solve real-world practical problems that benefit businesses and enterprises.
Dipanjan Sarkar
Chapter 6. Text Summarization and Topic Models
Abstract
We have come quite a long way in our journey through the world of text analytics and natural language processing. You have seen how to process and annotate textual data for various applications. We also looked at state-of-the-art text representation methods with feature engineering. We also ventured into the world of machine learning and built our own multi-class text classification system by leveraging various feature extraction techniques and supervised machine learning algorithms. In this chapter, we tackle a slightly different problem in the world of text analytics—information summarization.
Dipanjan Sarkar
Chapter 7. Text Similarity and Clustering
Abstract
In the previous chapters, we covered several techniques to analyze text and extract interesting insights. We looked at supervised machine learning techniques, which are used to categorize text documents into several assumed categories. Unsupervised techniques like topic models and document summarization were also covered, which involved trying to retrieve key themes and information from large text documents and corpora.
Dipanjan Sarkar
Chapter 8. Semantic Analysis
Abstract
Natural language understanding has gained significant importance in the last decade with the advent of machine learning and further advances like deep learning and artificial intelligence. Computers, or machines in general, can be programmed to learn specific things or perform specific operations. However, the key limitation is their inability to perceive, understand, and comprehend things like humans do. With the resurgence in popularity of neural networks and advances made in computer architecture, we now have deep learning and artificial intelligence evolving at a rapid pace and we have been engineering machines into learning, perceiving, understanding, and performing actions on their own. You may have seen or heard several of these efforts in the form of self-driving cars, computers beating experienced players in their own games like Chess and Go, and more recently chatbots. So far, we have looked at various computational, language processing, and machine learning techniques to classify, cluster, and summarize text. We also developed certain methods and programs to analyze and understand text syntax and structure. This chapter deals with methods that try to answer the question, "Can we analyze and understand the meaning and sentiment behind a body of text?"
Dipanjan Sarkar
Chapter 9. Sentiment Analysis
Abstract
In this chapter, we cover one of the most interesting and widely used aspects pertaining to natural language processing (NLP), text analytics, and machine learning. The problem at hand is sentiment analysis or opinion mining, where we want to analyze some textual documents and predict their sentiment or opinion based on the content of these documents. Sentiment analysis is perhaps one of the most popular applications of natural language processing and text analytics, with a vast number of websites, books, and tutorials on this subject. Sentiment analysis seems to work best on subjective text, where people express opinions, feelings, and their mood. From a real-world industry standpoint, sentiment analysis is widely used to analyze corporate surveys, feedback surveys, social media data, and reviews for movies, places, commodities, and many more. The idea is to analyze the reactions of people about a specific entity and take insightful actions based on their sentiments.
Dipanjan Sarkar
Chapter 10. The Promise of Deep Learning
Abstract
The focus of this book has been primarily to get you up to speed on essential techniques in natural language processing, so covering detailed applications leveraging deep learning for NLP is out of the current scope. However, we have still tried to depict some interesting applications of NLP throughout the book, including Chapter 4, where we covered interesting methods around word embeddings using deep learning methods like Word2Vec, GloVe, and FastText and Chapter 5, where we built text classification models using deep learning. The intent of this chapter is to talk a fair bit about the recent advancements made in the field of NLP with the help of deep learning and the promise it holds toward building better models, solving more complex problems and helping us build better and more intelligent systems.
Dipanjan Sarkar
Backmatter
Metadaten
Titel
Text Analytics with Python
verfasst von
Dipanjan Sarkar
Copyright-Jahr
2019
Verlag
Apress
Electronic ISBN
978-1-4842-4354-1
Print ISBN
978-1-4842-4353-4
DOI
https://doi.org/10.1007/978-1-4842-4354-1

Premium Partner