Skip to main content

2010 | Buch

Data Mining and Knowledge Discovery via Logic-Based Methods

Theory, Algorithms, and Applications

insite
SUCHEN

Über dieses Buch

The importance of having ef cient and effective methods for data mining and kn- ledge discovery (DM&KD), to which the present book is devoted, grows every day and numerous such methods have been developed in recent decades. There exists a great variety of different settings for the main problem studied by data mining and knowledge discovery, and it seems that a very popular one is formulated in terms of binary attributes. In this setting, states of nature of the application area under consideration are described by Boolean vectors de ned on some attributes. That is, by data points de ned in the Boolean space of the attributes. It is postulated that there exists a partition of this space into two classes, which should be inferred as patterns on the attributes when only several data points are known, the so-called positive and negative training examples. The main problem in DM&KD is de ned as nding rules for recognizing (cl- sifying) new data points of unknown class, i. e. , deciding which of them are positive and which are negative. In other words, to infer the binary value of one more attribute, called the goal or class attribute. To solve this problem, some methods have been suggested which construct a Boolean function separating the two given sets of positive and negative training data points.

Inhaltsverzeichnis

Frontmatter

Algorithmic Issues

Frontmatter
Chapter 1. Introduction
Abstract
Data mining and knowledge discovery is a family of computational methods that aim at collecting and analyzing data related to the function of a system of interest for the purpose of gaining a better understanding of it. This system of interest might be artificial or natural. According to the Merriam-Webster online dictionary the term system is derived from the Greek terms syn (plus, with, along with, together, at the same time) and istanai (to cause to stand) and it means a complex entity which is comprised of other more elementary entities which in turn may be comprised of other even more elementary entities and so on. All these entities are somehow interconnected with each other and form a unified whole (the system). Thus, all these entities are related to each other and their collective operation is of interest to the analyst, hence the need to employ data mining and knowledge discovery (DM&KD) methods. Some illustrative examples of various systems are given in the next section.
Evangelos Triantaphyllou
Chapter 2. Inferring a Boolean Function from Positive and Negative Examples
Abstract
A central problem in data mining is how to analyze observations grouped into two categories and infer some key patterns that may be implied by these observations. As discussed in Chapter 1, these observations describe different states of nature of the system or phenomenon of interest to the analyst.
Evangelos Triantaphyllou
Chapter 3. A Revised Branch-and-Bound Approach for Inferring a Boolean Function from Examples
Abstract
This chapter discusses a revised branch-and-bound (B&B) algorithm for inferring a single clause (in CNF or DNF) from two disjoint sets of binary training examples. This algorithm is an extension of the B&B algorithm described in the previous chapter. Now the states of the search space are described by using more information and this seems to be critical in leading to good search results faster. This chapter is based on the developments first presented in [ Triantaphyllou, 1994].
Evangelos Triantaphyllou
Chapter 4. Some Fast Heuristics for Inferring a Boolean Function from Examples
Abstract
The previous two chapters discussed the development and key mathematical properties of some branch-and-bound (B&B) branch-and-bound (B&B) approaches for inferring a Boolean function in the form of a compact (i.e., with as few clauses as possible) CNF or DNF expression from two collections of disjoint examples. As was described in Chapters 2 and 3, the B&B approaches may take a long time to run (actually, they are of exponential time complexity).
Evangelos Triantaphyllou
Chapter 5. An Approach to Guided Learning of Boolean Functions
Abstract
In most of the previous treatments it was assumed that somehow we have available two disjoint sets of training data described by binary vectors, that is, the collections of the positive and negative examples. Then the problem was how to infer a Boolean function that “fits these data.” In other words, a Boolean function in CNF or DNF form that satisfies the requirements of the positive and negative examples as described in Chapters 2 and 3. It is hoped at this point that the inferred Boolean function will accurately classify all remaining examples not included in the currently available positive and negative examples.
Evangelos Triantaphyllou
Chapter 6. An Incremental Learning Algorithm for Inferring Boolean Functions
Abstract
The previous chapter studied the guided learning problem. In that setting, the analyst has the option to select which unclassified example to send to the oracle for classification and use that information to improve the understanding of the system under consideration. When the new example would unveil the need for an update, one had to use all the existing training examples, plus the newly classified example, to infer a new (and hopefully more accurate) pattern in the form of a Boolean function or other data mining model.
Evangelos Triantaphyllou
Chapter 7. A Duality Relationship Between Boolean Functions in CNF and DNF Derivable from the Same Training Examples
Abstract
This chapter discusses a useful relationship between the CNF and DNF forms of the Boolean functions derivable from the same training data. This relationship can benefit approaches which attempt to solve large Boolean function inference problems and use either the CNF or the DNF form in representing a Boolean function.
Evangelos Triantaphyllou
Chapter 8. The Rejectability Graph of Two Sets of Examples
Abstract
This chapter is based on the findings presented in [Triantaphyllou and Soyster, 1996] and presents the motivation and definition of a special graph which can be easily derived from positive and negative examples. To understand the motivation for introducing this graph, consider a situation with n = 5 attributes.
Evangelos Triantaphyllou

Application Issues

Frontmatter
Chapter 9. The Reliability Issue in Data Mining: The Case of Computer-Aided Breast Cancer Diagnosis
Abstract
Almost any use of a data mining and knowledge discovery method on a data set requires some discussion on the accuracy of the extracted model on some test data. This accuracy can be a general description of how well the extracted model classifies test data. Some studies split this accuracy rate into two rates: the false-positive and false-negative rates. This distinction might be more appropriate for most real-life applications. For instance, it is one thing to wrongly diagnose a benign tumor as malignant than the other way around. Related are some of the discussions in Sections 1.3.4, 4.5, and 11.6.
Evangelos Triantaphyllou
Chapter 10. Data Mining and Knowledge Discovery by Means of Monotone Boolean Functions
Abstract
In all previous discussions the problem was how to infer a general Boolean function based on some training examples. Such a Boolean function can be completely inferred if all possible binary examples (states) in the space of the attributes are used for training. Thus, one may never be 100% certain about the validity of the inferred knowledge when the number of training examples is less than 2 n . The situation is different, however, if one deals with the inference of systems that exhibit monotonic behavior. The developments presented in this chapter are based on the award-winning doctoral work of Vetle I. Torvik and in particular on the research results first published in [ Torvik and Triantaphyllou, 2002; 2003; 2004; 2006].
Evangelos Triantaphyllou
Chapter 11. Some Application Issues of Monotone Boolean Functions
Abstract
The property of monotonicity has many applications. Its attractive mathematical advantages in inferring a model of the system of interest with high accuracy make the search for this property in data and its consecutive algorithmic exploitation, to be of high potential in data mining and knowledge discovery applications. The following developments are based on the work described in [ Kovalerchuk, Vityaev, and Triantaphyllou, 1996] and [ Kovalerchuk, Triantaphyllou, and Vityaev, 1995].
Evangelos Triantaphyllou
Chapter 12. Mining of Association Rules
Abstract
Mining of association rules from databases has attracted great interest because of its potentially very useful applications. Association rules are derived from a type of analysis that extracts information from coincidence [ Blaxton and Westphal, 1998]. Sometimes called market basket analysis, this methodology allows a data analyst to discover correlations, or co-occurrences of transactional events. In the classic example, consider the items contained in a customer’s shopping cart on any one trip to a grocery store. Chances are that the customer’s own shopping patterns tend to be internally consistent, and that he/she tends to buy certain items on certain days. There might be many examples of pairs of items that are likely to be purchased together. This is the kind of information the store manager could use to make decisions about where to place items in the store so as to increase sales. This information can be expressed in the form of association rules. Such information may have tremendous potential on the marketing of new or existing products. This is the kind of approach used by many enterprises (such as Amazon.com for instance) to recommend new or existing products to their customers. Mining of association rules is applicable to many more domains [ Bayardo, et al., 1999]. This chapter is based on the results discussed in [ Yilmaz, et al., 2003].
Evangelos Triantaphyllou
Chapter 13. Data Mining of Text Documents
Abstract
This chapter investigates the problem of classifying text documents into two disjoint classes. It does so by employing a data mining approach based on the OCAT algorithm. This chapter is based on the work discussed in [ Nieto Sanchez, Triantaphyllou, and Kraft, 2002]. In the present setting two sample sets of training examples (text documents) are assumed to be available. An approach is developed that uses indexing terms to form patterns of logical expressions (Boolean functions) that next are used to classify new text documents (which are of unknown class). This is a typical case of supervised “crisp” classification.
Evangelos Triantaphyllou
Chapter 14. First Case Study: Predicting Muscle Fatigue from EMG Signals
Abstract
Most of the previous chapters discussed some application issues on a number of areas. This chapter discusses a case study in detail. The emphasis is on some comparative issues with other data mining techniques that do not use logic-based approaches. This chapter also provides a link to the data used in this study.
Evangelos Triantaphyllou
Chapter 15. Second Case Study: Inference of Diagnostic Rules for Breast Cancer
Abstract
For this case study we used a data set that described a number of clinical cases of breast cancer diagnoses. The data were divided into two disjoint sets of malignant and benign cases. We applied the OCAT approach, as it is embedded in the RA1 heuristic (see also Chapter 4), after the data were transformed into binary ones according to the method described in Section 2.2. The following sections describe the data and inferred diagnostic rules in more detail.
Evangelos Triantaphyllou
Chapter 16. A Fuzzy Logic Approach to Attribute Formalization: Analysis of Lobulation for Breast Cancer Diagnosis
Abstract
In many data mining and knowledge discovery applications a critical task is how to define the values of the various attributes that the analyst believes may be of significance. For easily quantifiable attributes (such as, age, weight, cost, etc.) this task is a rather straightforward one as it involves simple measurements and expressing the results in terms of some units. For other attributes, however, this task may not be a simple one. This is the case when some of the data are fuzzy. For instance, although in common language one often uses terms such as “small,” “large,” “round,” “tall,” and so on, these terms may mean different concepts to different people or to the same person at different times.
Evangelos Triantaphyllou
Chapter 17. Conclusions
Abstract
Each of the previous chapters ends with a section with some concluding remarks tailored to the contents of the particular chapter. This section provides some comprehensive concluding remarks. As was mentioned earlier, there are many approaches to data mining and knowledge discovery from data sets. Such approaches include neural networks, closest neighbor methods, and various statistical methods. However, such approaches may have some severe limitations for a number of reasons.
Evangelos Triantaphyllou
Backmatter
Metadaten
Titel
Data Mining and Knowledge Discovery via Logic-Based Methods
verfasst von
Evangelos Triantaphyllou
Copyright-Jahr
2010
Verlag
Springer US
Electronic ISBN
978-1-4419-1630-3
Print ISBN
978-1-4419-1629-7
DOI
https://doi.org/10.1007/978-1-4419-1630-3