1 Introduction
Customer | Purchase items |
---|---|
Jack | (video:1/2),(food:1) |
Mary | (clothing:1),(video:1/3); (book:2/3) |
\(\mathcal W \)
| Tuples in \(\mathcal W \)
| Prob. |
---|---|---|
\(w_1\)
| {food}; {clothing} | 1/9 |
\(w_2\)
| {food}; {clothing, video} | 1/18 |
\(w_3\)
| {food}; {clothing, book} | 2/9 |
\(w_4\)
| {food}; {clothing, book, video} | 1/9 |
\(w_5\)
| {food, video}; {clothing} | 1/9 |
\(w_6\)
| {food, video}; {clothing, video} | 1/18 |
\(w_7\)
| {food, video}; {clothing, book} | 2/9 |
\(w_8\)
| {food, video}; {clothing, book, video} | 1/9 |
-
A generalized model-based approach to approximately compute PFIs efficiently and effectively;
-
In-depth study of three s-pmf approximation models based on standard probability models;
-
A more efficient method to verify a threshold-based PFI;
-
Techniques to enhance the performance of threshold and rank-based PFI discovery algorithms, for both attribute and tuple uncertainty models; and
-
An extensive experimental evaluation of the proposed methods in terms of efficiency and effectiveness performed on real and synthetic datasets.
2 Related work
3 Problem definition
3.1 Attribute and tuple uncertainty
Notation | Description |
---|---|
\(D\)
| An uncertain database of \(n\) tuples |
\(n\)
| The number of tuples contained in \(D\)
|
\(V\)
| The set of items that appear in \(D\)
|
\(v\)
| An item, where \(v \in V\)
|
\(t_j\)
| The \(j\)-th tuple with a set of items, where \(j=1,\ldots ,n\)
|
\(\mathcal W \)
| The set of all possible worlds |
\(w_j\)
| A possible world \(w_j \in W\)
|
\(I\)
| An itemset, where \(I \subseteq V\)
|
\(minsup\)
| An integer between \((0, n]\)
|
\(minprob\)
| A real value between \((0, 1]\), used in threshold-based PFI |
\(k\)
| An integer greater than zero, used in rank-based PFI |
\(Pr^I(i)\)
| Support probability (i.e., probability that \(I\) has a support count of \(i\)) |
\(Pr_{freq} (I)\)
| Frequentness probability of \(I\)
|
\(p_j^I\)
|
\(Pr(I \subseteq t_j)\)
|
\(\mu _I\)
| Expected value of \(X^I\)
|
\((\mu _I)_l\)
| Expected \(X^I\), up to \(l\) tuples |
3.2 Probabilistic frequent itemsets (PFI)
-
\(I\) is a threshold-based PFI if its frequentness probability is larger than some threshold [39]. Formally, given a real value \({ minprob} \in (0,1]\), \(I\) is a threshold-based PFI, if \(Pr_{freq}(I) \ge { minprob}\). We call minprob the frequentness probability threshold.
-
\(I\) is a rank-based PFI if its frequentness probability satisfies some ranking criteria. The top-k PFI, proposed in [39], belongs to this class. Given an integer \(k > 0\), \(I\) is a top-\(k\) PFI if \(Pr_{freq}(I)\) is at least the \(k\)-th highest, among all itemsets. We focus on top-\(k\) PFI in this paper.
4 Approximation of S-pmf
-
For attribute uncertainty,$$\begin{aligned} Pr(I \subseteq t_j) = \prod _{v \in I} Pr(v \in t_j) \end{aligned}$$(3)
-
For tuple uncertainty,$$\begin{aligned} Pr(I \subseteq t_j) = \left\{ \begin{array}{ll} Pr(t_j)&\text{ if} I \subseteq t_j\\ 0&\text{ otherwise} \end{array} \right. \end{aligned}$$(4)
4.1 Approximation by expected support
4.2 Poisson Distribution-Based Approximation
4.3 Normal Distribution-Based Approximation
4.4 Discussion
5 Threshold-based PFI mining
5.1 PFI Testing
-
Compute the parameters \(\mu _I\) (and \(\sigma ^{2}_{I}\) for the normal case).
-
Derive the probability \(Pr_{freq}(I)=1-(p\text{-}value(minsup-\epsilon ,support(I))\) given \(minsup\) and the respective approximation model support(I).
-
if \(Pr_{freq}(I)>minprob\), return \(I\) as a frequent item set, otherwise, conclude that \(I\) is not a frequent itemset.
5.2 Improving the PFI testing process for the poisson approximation
-
Step 1. Find a real number \((\mu )_m\) satisfying the equation:where \(F_{p/n}\) denotes the cdf of the Poisson distribution. The above equation can be solved efficiently by employing numerical methods, thanks to Theorem 2. This computation has to be performed only once.$$\begin{aligned} minprob = 1 - F_{p/n}(minsup - 1, (\mu )_m) , \end{aligned}$$(16)
-
Step 3. If \((\mu _I)_i \ge (\mu )_m\), we conclude that \(I\) is a PFI. Otherwise, \(I\) must not be a PFI.
5.3 Case study: the Apriori algorithm
6 Rank-based PFI mining
6.1 Candidate itemset generation
6.2 Candidate itemset testing
Iteration | Answer | Priority queue (\(Q\)) |
---|---|---|
\(0\)
|
\(\emptyset \)
| { {food},{clothing}} |
\(1\)
| {food} | {{clothing}} |
\(2\)
| {clothing} |
\(\emptyset \)
|
7 Results
7.1 Results on threshold-based PFI mining
DP
, the Apriori algorithm used in [39]; (2) Expected
, the modified Apriori algorithm that uses the expected support only [12]; (3) Poisson
, the modified Apriori algorithm that uses the Poisson approximation employing the PFI testing method (Sect. 5.1); (4) Normal
, the modified Apriori algorithm that uses the normal approximation, also employing the PFI testing method (Sect. 5.1); and (5) MBP
, the algorithm that uses the Poisson approximation utilizing the improved version of the PFI testing method (Sect. 5.2).Expected
, Poisson
and Normal
each approximate the exact s-pmf, we first examine their respective accuracy with respect to DP
, which yields PFIs based on exact frequentness probabilities. Here, we use the standard recall and precision measures [7], which quantify the number of negatives and false positives. Specifically, let MB
\(\in \) {Expected
, Poisson
, Normal
} be one of the model-based approximation approaches, and let \(F_{DP}\) be the set of PFIs generated by DP
and \(F_{MB}\) be the set of PFIs produced by MB
. Then, the recall and the precision of MB
, relative to DP
, can be defined as follows:MB
approaches, for a wide range of \(minsup\), \(n\), and \(minprob\) values. As we can see, the precision and recall values are generally very high. Hence, the PFIs returned by the MB
approaches are highly similar to those returned by DP
. In particular, we see that the Expected
approach generally yields the worst results, having precision and recall values of less than 95 % in some setting. The Poisson
approach performs significantly better in these experiments. Yet in some settings, the Poisson
approach reports false hits, while in other settings, it performs false dismissals. The Normal
approach is most notable in this experiment. In this set of experiments, the Normal
approach never had a false dismissal, nor did it report a single false hit. This observation also remained true for further experiments in this line, which are not depicted here. In the next set of experiments, we will experimentally investigate the effectivity of the MB
approach.
\(minsup/n\)
| 0.1 | 0.2 | 0.3 | 0.4 | 0.5 |
---|---|---|---|---|---|
(a) Expected approach. | |||||
Recall | 0.99 | 0.98 | 1 | 1 | 1 |
Precision | 1 | 1 | 1 | 1 | 1 |
\( Recall\ \& \ precision\) vs. \(minsup\)
| |||||
\(minprob\)
| 0.1 | 0.3 | 0.5 | 0.7 | 0.9 |
Recall | 0.975 | 0.975 | 1 | 1 | 1 |
Precision | 1 | 1 | 1 | 0.975 | 0.941 |
\( Recall\ \& \ precision\) vs. \(minprob\)
| |||||
\(n\)
| 1k | 5k | 10k | 50k | 100k |
Recall | 0.937 | 0.986 | 0.983 | 1 | 1 |
Precision | 0.969 | 1 | 0.992 | 1 | 1 |
\( Recall\ \& \ precision\) vs. \(n\)
| |||||
(b) Normal approach. | |||||
Recall | 1 | 1 | 1 | 1 | 1 |
Precision | 1 | 1 | 1 | 1 | 1 |
\( Recall\ \& \ precision\) vs. \(minsup\)
| |||||
\(minprob\)
| 0.1 | 0.3 | 0.5 | 0.7 | 0.9 |
Recall | 1 | 1 | 1 | 1 | 1 |
Precision | 1 | 1 | 1 | 1 | 1 |
\( Recall\ \& \ precision\) vs. \(minprob\)
| |||||
\(n\)
| 1k | 5k | 10k | 50k | 100k |
Recall | 1 | 1 | 1 | 1 | 1 |
Precision | 1 | 1 | 1 | 1 | 1 |
\( Recall\ \& \ precision\) vs. \(n\)
| |||||
Recall | 1 | 1 | 1 | 1 | 1 |
Precision | 1 | 0.992 | 1 | 1 | 1 |
\( Recall\ \& \ precision\) vs.\(minsup\)
| |||||
\(minprob\)
| 0.1 | 0.3 | 0.5 | 0.7 | 0.9 |
Recall | 1 | 1 | 1 | 0.983 | 0.985 |
Precision | 0.986 | 1 | 0.985 | 1 | 1 |
\( Recall\ \& \ precision\) vs. \(minprob\)
| |||||
\(n\)
| 1k | 5k | 10k | 50k | 100k |
Recall | 1 | 1 | 1 | 1 | 1 |
Precision | 0.989 | 1 | 0.992 | 1 | 1 |
\( Recall\ \& \ precision\) vs. \(n\)
|
Poisson
and the Normal
approach, for a variety of settings. In particular, we scaled both the dataset size and the size of the itemsets whose pmf is to be approximated. Since, in this setting, the probabilities of individual items are uniformly sampled in the \([0,1]\) interval, and since due to the assumption of item independence, it holds that \(P(I)=\prod _{i\in I}P(i)\), a large itemset \(I\) implies smaller existence probabilities. In Fig. 4a, it can be seen that for single itemsets (i.e., large probability values), the pmf acquired by the Poisson approximation is too shallow—that is, for support values close to \(\mu _I\) the exact pmf is underestimated, while for support values far from \(\mu _I\), the pmf is overestimated. In Fig. 4d, g it can be observed that this situation does not improve for the Poisson approximation. However, this shortcoming of the Poisson approximation can be explained: Since the Poisson approximation does only have one parameter \(\mu _I\), but no variance parameter \(\sigma {^2}{_I}\), it is not able to differ between a set of five transactions with occurrence probabilities of \(1\), and five million transactions with occurrence probabilities of \(10^{-6}\), since in both scenarios it holds that \(\mu _I=5\cdot 1=5\cdot 10^6\cdot 10^{-6}=5\). Clearly, the variance is much greater in the later case, and so are the tails of the corresponding exact pmf. Since the Poisson distribution is the distribution of the low probabilities, the Poisson assumes the later case, i.e., a case of very many transactions, each having a very low probability. In contrast, the normal distribution is able to adjust to these different situations by adjusting its \(\sigma {^2}{_I}\) parameter accordingly. Figure 4b, e, h show the same experiments for itemsets consisting of two items, that is, for lower probabilities \(I\subseteq t_i\). It can be observed that with lower probabilities, the error of the Poisson approximation decreases, while (as we will see later, in Table 7) the quality of the normal approximation decreases. Finally, for itemsets of size \(5\), the Poisson approximation comes very close to the exact pmf.
# Itemsets | ||||
---|---|---|---|---|
\(n\)
| Model | 1 | 2 | 5 |
10 | Normal | 0.020 | 0.078 | 0.180 |
Poisson | 0.526 | 0.283 | 0.077 | |
100 | Normal | 0.0003 | 0.0020 | 0.0134 |
Poisson | 0.0515 | 0.0283 | 0.0052 | |
1,000 | Normal | 2.39E-6 | 6.12E-6 | 0.0004 |
Poisson | 5.24E-3 | 2.85E-3 | 0.0007 |
DP
\(\in \){Normal
, Poisson
} the distance to the exact pmf, computed by the DP
approach:MB
vs.DP
. Next, we compare the performance (in log scale) of the model-based approaches (MB
) and the dynamic programming-based DP
. First, we will compare the runtime of the three (MB
)s to each other. The result is shown in Fig. 7. It can be seen that the approach using expected support, since it only has to perform a single value comparison (\(\mu _I\ge minsup\)), is faster than the other model-based approaches by a factor of about two. The normal and the Poisson approximation take about the same time to compute. While in our Java experiments (shown here), the Poisson approximation slightly outperforms the normal approximation, our experiments in \(R\) (not depicted here) show the opposite situation. In summary, we can say that Poisson and normal approximation take about the same time to evaluate, except for some possibly implementation specific differences.
MB
) approaches, which corresponds to the runtime of the normal and Poisson approximation. For the runtime of the expected approach, simply divide the result by two due to the observations made in Fig. 7.
MB
is about two orders of magnitude faster than DP
, over a wide range of minsup. This is because MB
does not compute exact frequentness probabilities as DP
does; instead, MB
only computes the \(\mu _I\) values, which can be obtained faster. We also notice that the running times of both algorithms decrease with a higher minsup. This is explained by Figure 8b, which shows that the number of PFIs generated by the two algorithms, \(|PFI|\), decreases as minsup increases. Thus, the time required to compute the frequentness probabilities of these itemsets decreases. We can also see that \(|PFI|\) is almost the same for the two algorithms, reflecting that the results returned by MB
closely resemble those of DP.
MB
and DP
(in log scale) over different minprob values. Their execution times drop by about 6 % when minprob changes from 0.1 to 0.9. We see that MB
is faster than DP
. For instance, at \(minprob=0.5\), MB
needs 0.3 s, while DP
requires 118 s, delivering an almost 400-fold performance improvement.MB
vs. MBP
. We then examine the benefit of using the improved PFI testing method (MBP
) over the basic one (MB
). Figure 9(a) shows that MBP
runs faster than MB
over different minsup values. For instance, when \(minsup = 0.5\), MBP
has about 25 % of performance improvement. Moreover, as minsup increases, the performance gap increases. This can be explained by Fig. 9b, which presents the fraction of the database scanned by the two algorithms. When minsup increases, MBP
examines a smaller fraction of the database. For instance, at \({ minsup}=0.5\), MBP
scans about 80 % of the database. This reduces the I/O cost and the effort for interpreting the retrieved tuples. Thus, MBP
performs better than MB
.
MB
and MBP
scale well with \(n\). The performance gap between MB
/MBP
and DP
also increases with \(n\). At \(n = 20k\), MB
and DP
need 0.62 and 657.7 s, respectively; at \(n = 100k\), MB
finishes in 3.1 s while DP
spends 10 h. Hence, the scalability of our approaches is clearly better than that of DP
.
Mean | Standard deviation | |
---|---|---|
\(G_1\)
| 0.8 | 0.125 |
\(G_2\)(default) | 0.5 | 0.125 |
\(G_3\)
| 0.5 |
\(\sqrt{1/12}\approx 0.289\)
|
\(G_4\)
| 0.5 | 0.5 |
\(Un\)
| 0.5 |
\(\sqrt{1/12}\approx 0.289\)
|
MB
and MBP
perform better than DP
over different distributions. All algorithms run comparatively slower on \(G_1\). This is because \(G_1\) has high mean (0.8) and low standard deviation (0.125), which generates high existential probability values. As a result, many candidates and PFIs are generated. Also note that \(G_3\) and \(Un\), which have the same mean and standard deviation, yield similar performance. Lastly, we found that the precision and the recall of MB
and MBP
over these distributions are the same and are close to 1. Hence, the PFIs retrieved by our methods closely resemble those returned by DP
.7.2 Results on rank-based PFI mining
top-k
) and our enhanced algorithm (called E-top-k
).
E-top-k
relative to that of top-k
, by using formulas similar to Eqs. 21 and 22. Table 9 shows the recall and the precision of E-top-k
for a wide range of \(n\), \(k\) and \(minsup\) values. We observe that the recall values are equal to the precision values. This is because number of PFIs returned by top-k
and E-top-k
are the same. As We can also see the recall and precision values are always higher than 98 %. Hence, E-top-k
can accurately return top-\(k\) PFIs.
E-top-k
\(n\)
| 1k | 5k | 10k | 50k |
---|---|---|---|---|
Recall & Precision | 1 | 1 | 1 | 1 |
\( Recall\ \& \ Precision\) vs. \(n\)
| ||||
\(k\)
| 1 | 10 | 50 | 100 |
Recall & precision | 1 | 1 | 0.98 | 0.99 |
\( Recall\ \& \ precision\) vs. \(k\)
| ||||
\(minsup/n\)
| 0.1 | 0.2 | 0.3 | 0.4 |
Recall & precision | 1 | 1 | 1 | 1 |
\( Recall\ \& \ precision\) vs. \(minsup\)
|
top-k
. When \(n=10k\), for example, E-top-k
finishes in only 0.2 s, giving an almost 20,000-fold improvement over that of top-k
, which completes in 1.1 hours. In Fig. 11a, we see that the runtime of top-k
increases faster than that of E-top-k
with a bigger database. In Fig. 11b, observe that the E-top-k
is about four orders of magnitude faster than top-k
.In Fig. 11c, with the increase in \(minsup\), top-k
needs more time, but the runtime of E-top-k only slightly increases. This is because (1) fewer candidates are produced by our generation step and (2) the testing step is significantly improved by using our model-based methods.
7.3 Other experiments
DP
, TODIS
, MB
, and MBP
in Figure 12(a). TODIS
[38] is proposed to retrieve threshold-based PFIs from tuple-uncertain databases. It shows that both MB
and MBP
perform much better than DP
and TODIS
, under different minsup values. For example, when \(minsup = 0.3\), MB
needs 1.6 s, but DP
and TODIS
complete in 581 and 81 s, respectively. In Fig. 12b, we also see that E-top-k
consistently outperforms top-k
by about two orders of magnitude. This means that our approach, which avoids computing frequentness probabilities, also works well for the tuple uncertainty model.
MB
, MBP
, and DP
, for attribute uncertainty model. We found that MB
and MBP
still outperform DP
. Figure 13b tests the performance of top-k
and E-top-k
, for tuple uncertainty model. Again, E-top-k
runs faster than top-k
, by around two orders of magnitude. Thus, our approach also works well for this dataset.