Skip to main content

2005 | Buch

Text Mining

Predictive Methods for Analyzing Unstructured Information

verfasst von: Sholom M. Weiss, Nitin Indurkhya, Tong Zhang, Fred J. Damerau

Verlag: Springer New York

insite
SUCHEN

Über dieses Buch

Data mining is a mature technology. The prediction problem, looking for predictive patterns in data, has been widely studied. Strong me- ods are available to the practitioner. These methods process structured numerical information, where uniform measurements are taken over a sample of data. Text is often described as unstructured information. So, it would seem, text and numerical data are different, requiring different methods. Or are they? In our view, a prediction problem can be solved by the same methods, whether the data are structured - merical measurements or unstructured text. Text and documents can be transformed into measured values, such as the presence or absence of words, and the same methods that have proven successful for pred- tive data mining can be applied to text. Yet, there are key differences. Evaluation techniques must be adapted to the chronological order of publication and to alternative measures of error. Because the data are documents, more specialized analytical methods may be preferred for text. Moreover, the methods must be modi?ed to accommodate very high dimensions: tens of thousands of words and documents. Still, the central themes are similar.

Inhaltsverzeichnis

Frontmatter
1. Overview of Text Mining
Abstract
Do you have a shortage of data? Not very likely. A consequence of the pervasive use of computers is that most data originate in digital form. If we trade a stock or write a book or buy a product online, these events evolve electronically. Since so many paper transactions are now in paperless digital form, lots of “big” data are available for further analysis.
Sholom M. Weiss, Nitin Indurkhya, Tong Zhang, Fred J. Damerau
2. From Textual Information to Numerical Vectors
Abstract
To mine text, we first need to process it into a form that data-mining procedures can use. As mentioned in the previous chapter, this typically involves generating features in a spreadsheet format. Classical data mining looks at highly structured data. Our spreadsheet model is the embodiment of a representation that is supportive of predictive modeling. In some ways, predictive text mining is simpler and more restrictive than open-ended data mining. Because predictive mining methods are so highly developed, most time spent on data-mining projects is for data preparation. We say that text mining is unstructured because it is very far from the spreadsheet model that we need to process data for prediction. Yet, the transformation of data from text to the spreadsheet model can be highly methodical, and we have a carefully organized procedure to fill in the cells of the spreadsheet. First, of course, we have to determine the nature of the columns (i.e., the features) of the spreadsheet. Some useful features are easy to obtain (e.g., a word as it occurs in text) and some are much more difficult (e.g., the grammatical function of a word in a sentence such as subject, object, etc.). In this chapter, we will discuss how to obtain the kinds of features commonly generated from text.
Sholom M. Weiss, Nitin Indurkhya, Tong Zhang, Fred J. Damerau
3. Using Text for Prediction
Abstract
The words prediction and forecast conjure up images of momentous decisions and complex processes fraught with inaccuracies. From a statistical perspective, it’s a straightforward problem that has a solution. Of course, the solution may not always be very good. The problem presents itself as in Figure 3.1. Given a sample of examples of past experience, we project to new examples. If the future is similar to the past, we may have an opportunity to make accurate predictions. An example of such a situation is where one tries to predict the future share price of a company based on historical records of the company’s share price and other measures of its performance. Figure 3.1.
Sholom M. Weiss, Nitin Indurkhya, Tong Zhang, Fred J. Damerau
4. Information Retrieval and Text Mining
Abstract
What is the principal computer specialty for processing documents and text? Many experts would respond “Information Retrieval.” The task of information retrieval, or IR as its practitioners call it, is to retrieve relevant documents in response to a query. Figure 4.1 illustrates the objectives of information retrieval of documents, where (a) a general description is given of the query, (b) the document collection is searched, and (c) a subset of relevant documents are returned. Figure 4.1
Sholom M. Weiss, Nitin Indurkhya, Tong Zhang, Fred J. Damerau
5. Finding Structure in a Document Collection
Abstract
Prediction methods look at stored examples with correct answers and project answers for new examples. One would expect that if we cannot obtain answers for the training examples, then the process cannot be completed. Given a collection of documents, we have no problem transforming the unstructured set of words for each document into a structured spreadsheet. But the last column also must be filled in. In Figure 5.1, we see a spreadsheet, a list of labels, and the spreadsheet column containing the labeled answers. Someone must compose a list of potential labels. Given the list, someone assigns labels to the documents. Sometimes label assignment can be automated, such as the label that a company’s stock price has risen. In most instances, such as topic assignment to newswire articles, the assignment of labels is done by humans, and this can be a tedious and expensive task. Is there any way to assign labels automatically to a document collection? We will discuss this task. Not only will the labels be assigned, but the list of labels will also be determined automatically. Because such key information is missing from the problem description, our expectations for accurate predictive performance should be reduced from standard prediction applications with labeled data.
Sholom M. Weiss, Nitin Indurkhya, Tong Zhang, Fred J. Damerau
6. Looking for Information in Documents
Abstract
An important research area for natural language processing and text mining is the extraction and formatting of information from unstructured text. One can think of the end goal of information extraction in terms of filling templates codifying the extracted information. In this chapter, we shall describe information extraction from this perspective and some machine-learning methods that can be used to solve this problem.
Sholom M. Weiss, Nitin Indurkhya, Tong Zhang, Fred J. Damerau
7. Case Studies
Abstract
Our approach to text mining is motivated by practical applications. However, the design and development of prediction methods often take place in a controlled scientific environment that simulates the real world. This is necessary for comparative analyses and also for unraveling the pieces of the puzzle that constitute a prediction problem. The question remains as to the appropriateness of these methods for practical use. Unlike laboratory environments, the real world is less readily controlled. Methods may need to be combined and adapted to the task at hand. User interface issues should be addressed. Practical considerations, such as resource limitations, must be acknowledged.
Sholom M. Weiss, Nitin Indurkhya, Tong Zhang, Fred J. Damerau
8. Emerging Directions
Abstract
Our principal objective is predictive text mining. The wealth of research literature for text mining encompasses a much wider range of topics than is presented here. At the same time, the research literature deals with each of these topics in great depth, describing many alternatives to the approaches that we have selected. Our description is not a comprehensive review of the field. We have used our judgment in selecting the basic areas of interest and the fundamental concepts that can lead to practical results. For example, thousands of papers have been written on classification methods. We picked our favorites for text mining.
Sholom M. Weiss, Nitin Indurkhya, Tong Zhang, Fred J. Damerau
Backmatter
Metadaten
Titel
Text Mining
verfasst von
Sholom M. Weiss
Nitin Indurkhya
Tong Zhang
Fred J. Damerau
Copyright-Jahr
2005
Verlag
Springer New York
Electronic ISBN
978-0-387-34555-0
Print ISBN
978-0-387-95433-2
DOI
https://doi.org/10.1007/978-0-387-34555-0

Premium Partner