Skip to main content
main-content

Über dieses Buch

Data mining includes a wide range of activities such as classification, clustering, similarity analysis, summarization, association rule and sequential pattern discovery, and so forth. The book focuses on the last two previously listed activities. It provides a unified presentation of algorithms for association rule and sequential pattern discovery. For both mining problems, the presentation relies on the lattice structure of the search space. All algorithms are built as processes running on this structure. Proving their properties takes advantage of the mathematical properties of the structure. Part of the motivation for writing this book was postgraduate teaching. One of the main intentions was to make the book a suitable support for the clear exposition of problems and algorithms as well as a sound base for further discussion and investigation. Since the book only assumes elementary mathematical knowledge in the domains of lattices, combinatorial optimization, probability calculus, and statistics, it is fit for use by undergraduate students as well. The algorithms are described in a C-like pseudo programming language. The computations are shown in great detail. This makes the book also fit for use by implementers: computer scientists in many domains as well as industry engineers.

Inhaltsverzeichnis

Frontmatter

Chapter 1. Introduction

Abstract
Recent advances in data collection and storage technologies have made it possible for companies (e.g., bar-code technology), administrative agencies (e.g., census data), and scientific laboratories (e.g., molecule databases in chemistry or biology) to keep vast amounts of data relating to their activities. At the same time, the availability of cheap computing power has made automatic extraction of structured knowledge from these data feasible. Such an activity is referred to as data mining. More recently, the advent in the marketplace of cheap high performance (gigabit level) communication switches is even placing cheap parallel data mining within the reach of the majority. Data mining includes such activities as classification, clustering, similarity analysis, summarization, association rule and sequential pattern discovery, and so forth. The book focuses on the development of sequential and parallel algorithms for association rule and sequential pattern discovery.
Jean-Marc Adamo

Chapter 2. Search Space Partition-Based Rule Mining

Abstract
In this chapter, we define the Boolean association rules and give a formulation of the problem of mining for such rules. Next, we devise sequential and parallel algorithms, based on search space partitioning, that solve this problem. In Section 2.1, we describe the basic objects involved in the problem of mining for association rules. We briefly discuss the relationship between the support/confidence measures and probability. We also discuss the relationship between the association rules and the logical implication. The procedure proposed in [S96] is considered to solve the mining problem. It consists of the following steps: (1) enumerate all frequent attribute sets, and (2) derive all association rules from this set. In Section 2.2, following and extending ideas proposed in [Z98], we identify the search space attached to the mining problem to the powerset 2 A , where A is the set of attributes involved. We define a family of equivalence relations over 2 A that are used in Section 2.3 to derive a partitioning procedure. The procedure is based on recursive bisection and consists of progressively splitting the powerset into finer and finer equivalence classes. Section 2.4 shows that the classes can be processed independently of one another. A join operator, stable with respect to the class partitioning, is proposed for this purpose. The join operator is next used within two partition-based algorithms for frequent attribute set enumeration: a sequential algorithm (Section 2.5) and a parallel one (Section 2.6). The latter requires both initial and dynamic load balancing. These issues are discussed in Section 2.6 where load balancing algorithms are given. Once the set of frequent attribute sets has been found, this set is processed in order to discover all association rules. This is the topic of Section 2.7 where two algorithms (a sequential algorithm and a parallel one) are described.
Jean-Marc Adamo

Chapter 3. Apriori and Other Algorithms

Abstract
In this chapter, we review the main rule-mining algorithms proposed in the literature. We start with two early algorithms known as AIS [AIS93] and SETM [HS95]. Next, we present two important algorithms known as Apriori and AprioriTid [S96], [ASR94]. We finally complete the chapter with two algorithms recently proposed in [PCY97] and [BMUT97] . We essentially focus here on cas enumeration since all mining algorithms only differ on performing this task (all share the same procedure for rule generation that was presented in Chapter 2).
Jean-Marc Adamo

Chapter 4. Mining for Rules over Attribute Taxonomies

Abstract
The algorithms presented so far deal with association rules based on a flat set of attributes. In most applications, the attribute set is given together with a set of taxonomies. The taxonomies reflect arbitrary individual or collective views according to which the set of attributes is hierarchically organized. For instance, Figure 4.1 shows a sample taxonomy defining hierarchies for grocery items (attributes of basket data).
Jean-Marc Adamo

Chapter 5. Constraint-Based Rule Mining

Abstract
In practice, users may only be interested in subsets of associations containing attributes satisfying given Boolean conditions also called Boolean constraints. This chapter, which builds upon the previous chapter, deals with solving the problem of mining for association rules in the presence of such constraints. Taxonomies may be present and constraints may contain both terminal and nonterminal attributes. A set of Boolean constraints can be identified with a Boolean function. In the first section, we define the syntax and semantics of Boolean functions. In the second section we review the notion of prime implicant. The prime implicants are the basic building blocks of Boolean functions. Any Boolean function can be identified with the set of its prime implicants (often, only identified with a subset of it, since the set of prime implicants is, in general, redundant). Each prime implicant defines a sublattice in 2 A . In the last section, we take advantage of the sublattices attached to the prime implicants to devise a sequential and a parallel algorithm solving the problem of mining for association rules under Boolean constraints. The algorithms derive from the ones developed in Chapter 4. Cas enumeration takes advantage of the sublattices to discard all those cass that do not meet the given constraints or cannot be expected to lead to cass meeting these constraints.
Jean-Marc Adamo

Chapter 6. Data Partition-Based Rule Mining

Abstract
So far, the attribute support has been assumed to fit into main memory. This might no longer be the case for large dense databases (e.g., census data or the corpus of text documents). Such databases may contain attributes (e.g., the sex attribute in census data) that have very large support. As suggested in [SON95], solving such a difficulty can be achieved by partitioning the database in addition to partitioning the search space as it was done in Chapter 2. It should be noted that the algorithms described in this chapter are not proposed as an alternative to those presented in Chapters 2, 4, and 5. Instead, they should be considered as new extensions. Data partitioning offers a new source of parallelism that comes in addition to search space partition-based parallelism. Data partitioning can be used within Procedure 2.6.2.2 that computes the starting sets or within Procedure 2.6.3.1 that performs full enumeration as well. For the purpose of simplicity, we confine ourselves to developing the new algorithms upon the basic enumeration procedures found in Chapter 2 (Procedures 2.5.1, 2.6.2.2, and 2.6.3.1). Accommodating these algorithms to make them work with all extensions developed in Chapters 4 and 5 is straightforward. This chapter contains two sections. Section 6.1 presents a probabilistic method for data partitioning while Section 6.2 describes data partition-based algorithms for rule mining.
Jean-Marc Adamo

Chapter 7. Mining for Rules with Categorical and Metric Attributes

Abstract
This chapter focuses on mining for association rules with categorical and metric attributes. Categorical attributes are similar to Boolean ones except that they can take on several discrete values instead of two. A categorical attribute can easily be transformed into a set of Boolean attributes. For instance, the categorical attribute (a, {1, 2, 3, 4}) can be transformed into the following set of pseudo-Boolean attributes: {(a_1, {0, 1}), (a_2, {0, 1}), (a_3, {0, 1}), (a_4, {0, 1})} such that a_i = 0 if §Ñ ≠ i and a_i = 1 if a = i. A metric attribute is one whose domain of values is a metric space, that is (see [B48], p. X for instance), a set M endowed with a distance δ satisfying the properties:
1.
δ(e, e) = 0 for any e in M,
 
2.
δ(e1, e2) > 0 for any pair (e1, e2) such that e1 ≠ e2,
 
3.
δ(e1, e2) = δ(e2, e1) for any pair (e1, e2),
 
4.
δ(e1, e2) + δ(e2, e3) ≥ δ(e1, e3) for any triple (e1, e2, e3).
 
Jean-Marc Adamo

Chapter 8. Optimizing Rules with Quantitative Attributes

Abstract
The purpose of mining over categorical and metric attributes, as described in Chapter 7, relies on exhaustive enumeration. There is another way of drawing useful information from quantitative association rules leading to optimization problems such as the following one.
Jean-Marc Adamo

Chapter 9. Beyond Support-Confidence Framework

Abstract
The standard support-confidence framework suffers from a few weaknesses. Indeed, some of the rules generated with these measures may have poor predictive ability, which means the measures do not perfectly account for the semantics of directed associations. Besides, mining large dense databases with the standard algorithms generally leads to combinatorial explosion, making the approach impracticable. This chapter describes new measures aimed at improving the rule predictive ability and algorithms aimed at limiting combinatorial explosion. In Section 9.1 we present a criticism of the support-confidence framework and we propose a new measure to substitute for confidence. The new measure, so-called conviction, is shown to produce association rules with much better predictive ability. Algorithms for conviction-based rule generation and pruning are proposed. Limiting the complexity of the cas-enumeration procedure is the topic of the next section (Section 9.2). First, the complexity of the search space is drastically reduced. This is achieved by limiting the rule-mining problem to one in which the consequent of the rules is fixed (input of the mining algorithm). Next, in order to reduce the complexity of the search procedure, testing for confidence/conviction is shifted to the cas-enumeration process itself. New one-step sequential and parallel rule-mining algorithms are proposed as an alternative to the classical two-step algorithms. A new measure of rule improvement is also proposed, which yields an efficient improvement-based pruning algorithm. Finally, a new paradigm is presented in Section 9.3 as an alternative to association rule mining: correlated attribute sets are mined instead of association rules. The paradigm relies on a new measure called collective strength. The measure is presented and analyzed, and new efficient sequential and parallel algorithms based on it are developed.
Jean-Marc Adamo

Chapter 10. Search Space Partition-Based Sequential Pattern Mining

Abstract
In this chapter, we give a formulation of the problem of mining for sequential patterns in a Boolean database. Next, we devise sequential and parallel algorithms, based on search space partitioning, that solve this problem. In Section 10.1, we describe the basic objects involved in the problem of mining for sequential patterns. In Section 10.2, following and extending ideas proposed in [Z98], we identify the search space with the powerset (2 A )P, where A is the set of attributes involved and P is the length of the largest sequential pattern that can be discovered. In the next section (Section 10.3), we define a family of equivalence relations over (2 A )P that is used in Section 10.4 to derive a partitioning procedure. The procedure is based on recursive bisection and consists of progressively splitting (2 A )P into finer and finer equivalence classes. Section 10.5 is dedicated to the development of algorithms for sequential pattern enumeration. First, three complementary join operators, stable with respect to the class partitioning, are defined and analyzed. Next, these operators are used within two search space partition-based algorithms for frequent sequential pattern enumeration: a sequential algorithm (Section 10.5.3) and a parallel one (Section 10.5.4). The latter requires both initial and dynamic load balancing. These issues are discussed in Section 10.5, where load balancing algorithms are given.
Jean-Marc Adamo

Backmatter

Weitere Informationen