Skip to main content
Top

2015 | Book

Advanced Applications of Natural Language Processing for Performing Information Extraction

insite
SEARCH

About this book

This book explains how can be created information extraction (IE) applications that are able to tap the vast amount of relevant information available in natural language sources: Internet pages, official documents such as laws and regulations, books and newspapers, and social web. Readers are introduced to the problem of IE and its current challenges and limitations, supported with examples. The book discusses the need to fill the gap between documents, data, and people, and provides a broad overview of the technology supporting IE. The authors present a generic architecture for developing systems that are able to learn how to extract relevant information from natural language documents, and illustrate how to implement working systems using state-of-the-art and freely available software tools. The book also discusses concrete applications illustrating IE uses.

· Provides an overview of state-of-the-art technology in information extraction (IE), discussing achievements and limitations for the software developer and providing references for specialized literature in the area

· Presents a comprehensive list of freely available, high quality software for several subtasks of IE and for several natural languages

· Describes a generic architecture that can learn how to extract information for a given application domain

Table of Contents

Frontmatter
Chapter 1. Introduction
Abstract
Chapter 1 introduces the problem of extracting information from natural language unstructured documents, which is becoming more and more relevant in our “document society”. Despite the many useful applications that the information in these documents can potentiate, it is harder and harder to obtain the wanted information. Major problems result from the fact that much of the documents are in a format non usable by humans or machines. There is the need to create ways to extract relevant information from the vast amount of natural language sources.
After this, the chapter presents, briefly, background information on Semantics, knowledge representation and Natural Language Processing, to support the presentation of the area of Information Extraction [IE, “the analysis of unstructured text in order to extract information about pre-specified types of events, entities or relationships, such as the relationship between disease and genes or disease and food items; in so doing value and insight are added to the data.” (Text mining of web-based medical content, Berlin, p 50)], its challenges, different approaches and general architecture, which is organized as a processing pipeline including domain independent components—tokenization, morphological analysis, part-of-speech tagging, syntactic parsing—and domain specific IE components—named entity recognition and co-reference resolution, relation identification, information fusion, among others.
Mário Rodrigues, António Teixeira
Chapter 2. Data Gathering, Preparation and Enrichment
Abstract
This chapter presents the domain independent part of the general architecture of Information Extraction (IE) systems. This first part aims at preparing documents by the application of several Natural Language processing tasks that enrich the documents with morphological and syntactic information. This is made in successive processing steps which start by making contents uniform, and end by identifying the roles of the words and how they are arranged.
Here are described the most common steps: sentence boundary detection, tokenization, part-of-speech tagging, and syntactic parsing. The description includes information on a selection of relevant tools available to implement each step.
The chapter ends with the presentation of three very representative software suites that make easier the integration of the several steps described.
Mário Rodrigues, António Teixeira
Chapter 3. Identifying Things, Relations, and Semantizing Data
Abstract
This chapter concludes the presentation of the generic pipelined architecture of Information Extraction (IE) systems, by presenting its domain dependent part.
After preparation and enrichment, the document’s contents are now characterized and suitable to be processed to locate and extract information. This chapter explains how this can be performed, addressing both extraction of entities and relations between entities.
Identifying entities mentioned in texts is a pervasive task in IE. It is called Named Entity Recognition (NER) and seeks to locate and classify textual mentions that refer to specific types of entities, such as, for example, persons, organizations, addresses and dates.
The chapter also dedicates attention to how to store the extracted information and how to take advantage of semantics to improve the information extraction process, presenting the basis of Ontology-Based Information Extraction (OBIE) systems.
Mário Rodrigues, António Teixeira
Chapter 4. Extracting Relevant Information Using a Given Semantic
Abstract
This chapter presents an example of software architecture, developed by the authors, for performing Ontology Based Information Extraction (OBIE) using an arbitrary ontology. The goal of the architecture is to allow the deployment of applications for arbitrary domains without need of system reprogramming. For that, human operator(s) define the semantics of the application and provide some examples of ontology concepts in target texts; then the system learns how to extract information according to the defined ontology.
An instantiation of the proposed architecture using freely available and high performance software tools is also presented. This instantiation is made for processing texts in a natural language, Portuguese, that was not the original target for most of the tools, showing and discussing the preparation of tools for other languages than the ones provided out of the box.
Mário Rodrigues, António Teixeira
Chapter 5. Application Examples
Abstract
In this chapter are presented two concrete examples of applications. The first example is a tutorial that is easy to replicate (almost) without requiring computer programming skills. This example elaborates on extracting information useful in a wide range of scenarios: detection of people, organizations, and dates. It shows how to extract information from a Wikipedia page. Most of the system is implemented using the Stanford CoreNLP suite.
The second example is more complex and instantiates the OBIE architecture presented in the previous chapter using software tools from different sources that need to be adapted to work together. The application is related to electronic government, and processes publically available documents of municipalities. This second example targets contents written in a natural language not often available out of the box: Portuguese.
Mário Rodrigues, António Teixeira
Chapter 6. Conclusion
Abstract
In this book was discussed the need to provide formal structures to contents originally created in unstructured formats using natural language. The volume of relevant information in such formats increases every day as people use the Internet to communicate, and as organizations create and publish documentation. The contents are often without formalized markups because marking contents manually can be a time consuming and error prone task that requires some specialized knowledge. The objective of information extraction is to analyze these contents and produce fixed format, unambiguous and formal representations of them, including the identification of the entities involved and the relations they establish among them.
Mário Rodrigues, António Teixeira
Backmatter
Metadata
Title
Advanced Applications of Natural Language Processing for Performing Information Extraction
Authors
Mário Rodrigues
António Teixeira
Copyright Year
2015
Electronic ISBN
978-3-319-15563-0
Print ISBN
978-3-319-15562-3
DOI
https://doi.org/10.1007/978-3-319-15563-0

Premium Partner