1 Introduction
2 Uplift modeling
Fundamental Problem of Causal Inference. For every individual, only one of the outcomes is observed, after the individual has been subject to an action (treated) or when the individual has not been subject to the action (was a control case), never both.
2.1 Related work
2.2 Notation
2.3 Current uplift modeling algorithms
2.4 Ensemble methods for uplift modeling
3 Bagging and random forests for uplift modeling
3.1 Base learners
Weka
package. This is a version of the well known C4.5 learner and is not discussed here in detail, see Quinlan (1992) and Witten and Frank (2005).3.2 Bagging of uplift models
3.3 Random forests for uplift modeling
Weka
’s RandomTree
classifier to construct members of the ensemble. Unfortunately the RandomTree
class uses a slightly different splitting criterion than J4.8 tree which we use in bagged double classifiers. The former uses raw entropy gain and the latter uses entropy gain ratio, i.e., the gain is divided by the entropy of the test itself. Moreover J4.8 uses heuristics to eliminate tests with very low entropies, see Quinlan (1992) for details. This makes comparison of bagged double classifiers with Double Uplift Random Forests more difficult, but we chose not to modify the implementations of Weka
tree learners as they are a standard used by the community, and since neither criterion is uniformly better than the other.3.4 Theoretical properties
4 Experimental evaluation
4.1 Benchmark datasets for uplift modeling
Hillstrom visit
) and using only the women’s merchandise group (dataset called Hillstrom visit w.
). The women’s group was chosen because the campaign on this group was, overall, much more effective.Dataset | Source | #Records | #Attributes | ||
---|---|---|---|---|---|
Treatment | Control | Total | |||
Hillstrom visit | MineThatData blog (Hillstrom 2008) | 42,694 | 21,306 | 64,000 | 8 |
Hillstrom visit w. | MineThatData blog (Hillstrom 2008) | 21,306 | 21,306 | 42,612 | 8 |
BMT cgvh | Pintilie (2006) | 49 | 51 | 100 | 4 |
BMT agvh | Pintilie (2006) | 49 | 51 | 100 | 4 |
Tamoxifen | Pintilie (2006) | 321 | 320 | 641 | 10 |
Pbc | R , survival package | 158 | 154 | 312 | 20 |
Bladder | R , survival package | 38 | 47 | 85 | 6 |
Cgd | R , survival package | 65 | 63 | 128 | 10 |
Colon death | R , survival package | 614 | 315 | 929 | 14 |
Colon recurrence | R , survival package | 614 | 315 | 929 | 14 |
Veteran | R , survival package | 69 | 68 | 137 | 9 |
Burn | R , KMsurv package | 84 | 70 | 154 | 17 |
Hodg | R , KMsurv package | 16 | 27 | 43 | 7 |
R
package for statistical computing. The first medical dataset available with Pintilie (2006) is the Bone Marrow Transplant (BMT
) data on patients who received two types of bone marrow transplant: taken from the pelvic bone (used as the control group since this was the procedure commonly used at the time the data was collected) or from the peripheral blood (a novel approach, used as the treatment group in this paper). The peripheral blood transplant is easier on the donor but may result in a higher rate of rejection in the recipient. The goal of using an uplift model is to pick a group of patients for whom the alternative therapy is applicable without the increased risk. There are two target variables representing the occurrence of the chronic (cgvh
) and acute (agvh
) graft versus host disease. We ignore the survival nature of the data and simply treat nonoccurrence as the successful outcome. There are only three randomization time variables: the type and extent of the disease and patient’s age.BMT
dataset does not, strictly speaking, include a control group, uplift modeling can still be applied. The role of the control group is played by one of the treatments and the method allows for selection of patients to whom an alternative treatment should be applied.Tamoxifen
, contains data on treatment of breast cancer with a drug tamoxifen. The control group received tamoxifen alone and the treatment group tamoxifen combined with radiotherapy. We model the target variable stat
describing whether the patient was alive at the time of the last follow-up. The dataset contains six variables: size of the tumor, histology, hormone receptor level, haemoglobin level, patient’s age, and a binary variable set to true if auxiliary node dissection was done. Details can be found in Pintilie (2006).survival
and KMsurv
packages from the R
statistical computing system. We will discuss them in less detail since full descriptions are easily accessible online. First, datasets available in the survival
package. The pbc
dataset comes from the Mayo Clinic study of primary biliary cirrhosis (PBC) of the liver conducted between 1974 and 1984 and includes data on 312 patients who participated in a randomized controlled trial of the drug D-penicillamine (the control group received placebo). We assumed death before the endpoint of the study to be the negative outcome and a patient receiving a transplant or being censored to be the positive outcome.bladder
) contains information on 85 subjects who received either the thiotepa drug or placebo. For each patient it is reported whether recurrence occurred during four periods of time. We assumed patients for whom there was at least one recurrence to be the negative cases, those without any recurrence, the positive cases.cgd
comes from a placebo controlled trial of gamma interferon in chronic granulotomous disease (CGD) and contains complete information on the time to first serious infection observed through the end of study. Since each patient eventually developed an infection we considered those who did so in less than 180 days to be negative and the remaining ones positive cases.colon
data comes from a trial of adjuvant chemotherapy for colon cancer. There are two types of treatment which we merged together into a single treatment group. The control group received placebo. We analyzed two target attributes: ‘death’ and ‘recurrence or death’ with the resulting datasets called respectively colon recurrence
and colon death
.veteran
data comes from a randomized trial of two treatment regimens for lung cancer on 137 patients. For uplift analysis, survival time is omitted and patients alive up to the end of the study constitute the positive examples.KMsurv
package. The burn
dataset has 154 rows describing infections suffered by patients who underwent burns. The treatment group was subject to body cleansing and the control group to routine bathing. Occurrence of staphylococcus aureus infection was the negative outcome. Finally, the hodg
dataset describes 43 patients who underwent an allogeneic graft or an autologous graft (control group) as a lymphoma treatment. Those who die by the end of the study constitute the negative examples.hepatitis
dataset) or splits the data evenly into two groups. Details are given in Table 2 taken from Rzepakowski and Jaroszewicz (2012). The first column contains the dataset name and the second provides the condition used to select records for the treatment group. The remaining records formed the control. A further postprocessing step removed attributes strongly correlated with the split itself; ideally, the division into treatment and control groups should be independent from all predictive attributes, but this is possible only in a controlled experiment. A simple heuristic was used for this purpose:Dataset | Treatment/control split condition | #Removed attributes/#original attributes |
---|---|---|
Australian | a1 = ‘1’ | 2/14 |
Breast-cancer | Menopause = ‘PREMENO’ | 2/9 |
Credit-a | a7 \(\ne \) ‘V’ | 3/15 |
Dermatology | Exocytosis \(\le \)1 | 16/34 |
Diabetes | Insu \(>\)79.8 | 2/8 |
Heart-c | Sex = ‘MALE’ | 2/13 |
Hepatitis | Steroid = ‘YES’ | 1/19 |
Labor | Education-allowance = ‘YES’ | 4/16 |
Liver-disorders | Drinks \(<\)2 | 2/6 |
Primary-tumor | Sex = ‘MALE’ | 2/17 |
Splice | Attribute1 \(\in \{{`A\text {'}, `G\text {'}}\}\)
| 2/61 |
Winequal-red | Sulfur dioxide \(<\)46.47 | 2/11 |
Winequal-white | Sulfur dioxide \(<\)138.36 | 3/11 |
4.2 Evaluating uplift models
4.3 Experimental setup
-
Bagged uplift trees. Bagged ensembles of E-divergence based unpruned uplift decision trees (see Sect. 3.1).
-
Bagged double J4.8 trees. Bagged ensembles of double classifiers based on unpruned J4.8 models from
Weka
. -
Uplift Random Forests. Bagged ensembles of randomized E-divergence based uplift decision trees built using the algorithm in Fig. 2.
-
Double Uplift Random Forests. Bagged ensembles of double classifiers based on randomized trees from
Weka
.
4.4 Illustrative examples
winequality_white
dataset. The base model, a double classifier based on a J4.8 tree, achieved only a modest improvement over targeting a randomly selected subset of the population. By targeting about 80 % of the database according to base model’s selection we are able to obtain the net gain just 3 % higher than if we indiscriminately applied the action to all objects. In contrast, when targeting 70 % of the population selected using a Double Random Forest the difference grows to almost 10 %. The area under the uplift curve for the Random Forest model is more than three times larger than for the base model (regardless, pruned or unpruned)! Similar improvements have been achieved for the liver_disorders
dataset with the application of bagging to uplift decision trees based on E-divergence test selection criterion.Tamoxifen
dataset is another interesting example; here the base model is practically useless, as its performance is almost identical to random selection of the target group. Applying bagging improved performance significantly: by targeting about 70 % of patients with the drug and radiotherapy and the remaining 30 % with the drug only we would (apart from reducing the number of people subject to radiotherapy and its side effects) achieve, overall, better results than if the combined treatment was administered to all patients. Similar gains are visible for the chronic graft versus host disease in the BMT
dataset. Using an uplift model, we could target almost 75 % of patients with the alternative, milder therapy while actually achieving lower incidence of side effects. Note that the overall impact of the alternative therapy is negative in this context, but this seems to be due to only about a quarter of the patients for whom it gives particularly bad results.Hillstrom
dataset. The gains are not as spectacular as in the previous cases, but still, the application of bagging resulted in about 10 % increase in the AUUC over a single pruned uplift tree and about 20 % increase over a single unpruned tree.credit_a
dataset, where Uplift Random Forest is seen to perform exceptionally well. The chart requires a comment. It can be seen that the base model makes predictions which are actually worse than random selection, while an ensemble with just one member performs much better. This is unexpected, since the single member tree was built, due to bootstrapping, on a smaller sample than the full model. The same effect was seen when bagging was applied to E-divergence based uplift decision trees on this dataset. To understand this result we examined the generated trees. When the base model was used, in almost all of the 128 random train/test splits the test in the root of the tree was based on the A6
attribute which takes 14 different values; this resulted in quick training data fragmentation and poor overall performance. If the same tree construction algorithm was applied to a bootstrap sample taken from the original dataset, tests in the root were almost never based on this attribute resulting in much better trees. The good performance of one member ensembles thus turned out to be a counterintuitive side effect of the test selection criterion proposed in Rzepakowski and Jaroszewicz (2010). As can be seen in the charts presented in the next section this phenomenon occurs (less strongly) also for other datasets as well as for the J4.8 decision trees. To visualize real gains resulting from forming larger ensembles we have included the curves for one model ensembles in all the charts in Fig. 3.4.5 Performance evaluation of uplift ensembles
Tamoxifen
, veteran
and hepatitis
), forming larger ensembles improves performance for both bagging and Uplift Random Forests, sometimes dramatically so. For the cgd
, bladder
, colon death
, colon recurrence
, breast_cancer
, diabetes
, heart_c
, liver_disorders
, splice
, and both winequality
datasets the gains over base models were especially large, with Areas Under the Uplift Curves doubling or even tripling. For the Hillstrom visit
dataset the performance of the ensemble increased steadily as more members were added but fell just short of surpassing the pruned base model. Note that when only women’s merchandise offer was considered (see also Fig. 3) bagging brought significant improvement in performance over the base model. The loss of performance on the Tamoxifen
and veteran
datasets is most probably due to poor base models.Dataset | Unpruned | 1001 bagged | Uplift Rand. | Double | 1001 bagged | Double uplift |
---|---|---|---|---|---|---|
E-div. tree | E-div. trees | Forest (1001) | J4.8 classif. | double J4.8 | Rand. forest | |
BMT agvh |
\(1.97\pm 4.76\)
|
\(2.25 \pm 4.79\)
|
\(2.77\pm 4.58\)
|
\(0.55\pm 2.82\)
|
\(0.74 \pm 5.08\)
|
\(3.92 \pm 4.78\)
|
BMT cgvh |
\(2.20\pm 4.50\)
|
\(2.95 \pm 4.17\)
|
\(3.24 \pm 4.29\)
|
\(2.75 \pm 3.23\)
|
5.78 \(\pm \) 4.39*
|
4.69 \(\pm \) 4.57*
|
Hillstrom visit |
0.35 \(\pm \) 0.17**
|
0.38 \(\pm \) 0.17**
|
0.28 \(\pm \) 0.16*
| 0.17 \(\pm \) 0.18 | 0.06 \(\pm \) 0.16 |
\(-0.00 \pm 0.17\)
|
Hillstrom visit w. |
0.62 \(\pm \) 0.19**
|
0.73 \(\pm \) 0.18**
|
0.63 \(\pm \) 0.19**
|
0.45 \(\pm \) 0.22**
|
0.32 \(\pm \) 0.21*
|
0.23 \(\pm \) 0.22*
|
Tamoxifen |
\(-0.10\pm 1.46\)
|
\(-0.23 \pm 1.17\)
|
\(-0.40 \pm 1.12\)
|
\(-0.15 \pm 1.42\)
|
\(0.25 \pm 1.27\)
|
\(-0.31 \pm 1.27\)
|
Burn |
6.02 \(\pm \) 3.17*
| 4.36 \(\pm \) 4.38 | 4.47 \(\pm \) 4.91 | 2.37 \(\pm \) 4.10 | 2.54 \(\pm \) 4.64 | 2.49 \(\pm \) 4.64 |
Hodg | 2.39 \(\pm \) 7.67 | 6.81 \(\pm \) 8.88 | 6.98 \(\pm \) 8.66 | 7.80 \(\pm \) 9.24 |
9.75 \(\pm \) 8.60*
|
9.72 \(\pm \) 8.67*
|
Bladder | 0.53 \(\pm \) 5.21 | 1.09 \(\pm \) 5.69 | 1.08 \(\pm \) 5.69 | 0.09 \(\pm \) 4.88 | 1.86 \(\pm \) 6.08 | 1.63 \(\pm \) 6.10 |
Cgd | 0.99 \(\pm \) 2.23 | 2.44 \(\pm \) 2.73 |
2.95 \(\pm \) 2.65*
|
2.83 \(\pm \) 2.07*
| 1.67 \(\pm \) 2.40 |
2.74 \(\pm \) 2.17*
|
Colon death | 0.18 \(\pm \) 1.50 | 0.71 \(\pm \) 1.28 | 0.88 \(\pm \) 1.27 | 0.72 \(\pm \) 1.30 | 0.59 \(\pm \) 1.46 | 0.73 \(\pm \) 1.08 |
Colon recur. | 0.83 \(\pm \) 2.11 | 1.48 \(\pm \) 1.78 | 1.19 \(\pm \) 1.73 | 0.81 \(\pm \) 2.05 | 1.83 \(\pm \) 2.12 | 1.57 \(\pm \) 2.19 |
Pbc | 0.82 \(\pm \) 3.42 | 0.68 \(\pm \) 2.92 | 0.57 \(\pm \) 2.90 | 0.08 \(\pm \) 3.34 |
\(-0.16 \pm 2.93\)
|
\(-0.30\pm 3.00\)
|
Veteran |
\(-0.87\pm 2.90\)
|
\(-1.45 \pm 3.00\)
|
\(-1.56 \pm 2.93\)
|
\(-0.30 \pm 1.97\)
|
\(-2.52 \pm 2.15\)
|
\(-0.81 \pm 2.31\)
|
Australian |
\(-0.72\pm 2.60\)
|
\(0.60 \pm 2.31\)
|
\(1.16 \pm 2.17\)
|
\(1.00\pm 2.65\)
|
\(1.04 \pm 2.23\)
|
\(-0.39 \pm 2.18\)
|
Breast_cancer |
\(0.84 \pm 2.82\)
|
\(1.96 \pm 2.76\)
|
\(2.33 \pm 2.74\)
|
\(1.46 \pm 3.35\)
|
\(2.51 \pm 3.09\)
|
\(2.62 \pm 2.72\)
|
Credit_a |
\(-3.06 \pm 2.39\)
|
4.73 \(\pm \) 2.23**
|
6.26 \(\pm \) 1.93**
| 0.86 \(\pm \) 2.50 | 0.34 \(\pm \) 2.09 |
\(-3.55 \pm 1.91\)
|
Dermatology |
6.28 \(\pm \) 1.97**
|
7.37 \(\pm \) 1.41**
|
8.09 \(\pm \) 1.01**
|
5.44 \(\pm \) 2.31**
|
7.43 \(\pm \) 1.54**
|
7.92 \(\pm \) 1.29**
|
Diabetes | 1.69 \(\pm \) 2.36 |
2.83 \(\pm \) 2.15*
|
2.68 \(\pm \) 2.14*
| 1.19 \(\pm \) 2.57 | 2.17 \(\pm \) 2.34 |
2.41 \(\pm \) 2.33*
|
Heart_c | 2.05 \(\pm \) 3.22 | 3.32 \(\pm \) 3.39 |
3.64 \(\pm \) 3.33*
| 2.34 \(\pm \) 3.50 |
4.19 \(\pm \) 3.29*
|
4.62 \(\pm \) 3.42*
|
Hepatitis | 0.56 \(\pm \) 4.28 | 0.14 \(\pm \) 3.74 | 0.06 \(\pm \) 3.56 | 1.32 \(\pm \) 4.87 | 0.24 \(\pm \) 4.10 | 0.16 \(\pm \) 3.98 |
Labor |
\(-4.72 \pm 6.47\)
|
\(-0.01 \pm 8.69\)
|
\(0.00 \pm 8.39\)
|
\(-0.96 \pm 8.13\)
|
\(0.27 \pm 8.29\)
|
\(-4.40 \pm 5.72\)
|
Liver_disorders | 1.12 \(\pm \) 3.48 |
3.60 \(\pm \) 3.10*
|
3.60 \(\pm \) 3.06*
| 1.09 \(\pm \) 3.40 |
3.55 \(\pm \) 3.22*
|
3.32 \(\pm \) 3.07*
|
Splice |
5.04 \(\pm \) 0.95**
|
8.13 \(\pm \) 0.88**
|
8.15 \(\pm \) 0.79**
| 0.76 \(\pm \) 0.79 |
3.45 \(\pm \) 1.25**
|
7.78 \(\pm \) 0.95**
|
Winequality_red |
4.61 \(\pm \) 1.58**
|
8.33 \(\pm \) 1.38**
|
8.14 \(\pm \) 1.38**
|
3.52 \(\pm \) 1.65**
|
8.22 \(\pm \) 1.55**
|
9.81 \(\pm \) 1.49**
|
Winequality_white |
4.51 \(\pm \) 0.95**
|
7.53 \(\pm \) 0.72**
|
7.30 \(\pm \) 0.74**
|
3.35 \(\pm \) 1.05**
|
9.28 \(\pm \) 0.76**
|
10.77 \(\pm \) 0.76**
|
5 Analysis of ensemble diversity
5.1 Bagged double classifiers
splice
dataset the correlation of predictions made by the uplift ensemble members is just \(0.179\) even though individual J4.8 trees make highly correlated predictions (coefficient equal to \(0.852\)). Very large differences are also visible for the australian
and credit_a
datasets, and large ones for colon recurrence
, diabetes
, heart_c
, winequality_red
, and winequality_white
. Note (see Fig. 5) that for all of those datasets adding more members dramatically improved performance of bagged double J4.8 classifiers, eventually doubling or tripling the AUUC of the ensemble.5.2 Bagged E-divergence based uplift decision trees
5.3 Bagging versus random forests
Dataset | Bagging forest | Random forest |
---|---|---|
Real
| ||
BMT agvh |
2.11
| 2.10 |
BMT cgvh | 1.74 |
2.11
|
Hillstrom visit |
0.27
| 0.16 |
Hillstrom visit w. |
0.55
| 0.30 |
Tamoxifen |
0.08
|
\(-\)0.03 |
Burn |
6.02
| 3.26 |
Hodg | 2.39 |
3.37
|
Bladder |
0.53
| 0.49 |
Cgd | 0.99 |
1.31
|
Colon death |
0.18
| 0.12 |
Colon recurrence |
0.83
| 0.55 |
Pbc |
0.82
| 0.04 |
Veteran |
\(-\)
0.86
|
\(-\)0.90 |
Artificial
| ||
Australian | 0.28 |
0.43
|
Breast_cancer | 0.97 |
1.17
|
Credit_a | 2.11 |
2.23
|
Dermatology |
5.89
| 5.05 |
Diabetes | 0.84 |
1.10
|
Heart_c |
2.00
| 1.70 |
Hepatitis |
0.58
| 0.32 |
Labor |
\(-1.13\)
|
\(-\)
1.08
|
Liver_disorders |
1.96
| 1.33 |
Splice |
4.04
| 3.73 |
Winequality_red |
3.81
| 3.47 |
Winequality_white |
3.76
| 3.56 |
Hillstrom visit w.
dataset, where adding more members to the Random Forest produced higher gains than it did for bagging (due to higher diversity), but since the individual randomized trees were significantly worse, the overall performance of bagging was better. Similar results have been obtained by Segal for classical regression (Segal 2004).