nach oben

Erschienen in:

Open Access 2018 | OriginalPaper | Buchkapitel

3. Testing Joint Conditional Independence of Categorical Random Variables with a Standard Log-Likelihood Ratio Test

verfasst von : Helmut Schaeben

Erschienen in: Handbook of Mathematical Geosciences

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

While tests for pairwise conditional independence of random variables have been devised, testing joint conditional independence of several random variables seems to be a challenge in general. Restriction to categorical random variables implies in particular that their common distribution may initially be thought of as contingency table, and then in terms of a log-linear model. Thus, Hammersley–Clifford theorem applies, and provides insight in the factorization of the log-linear model corresponding to assumptions of independence or conditional independence. Such assumptions simplify the full joint log-linear model, and in turn any conditional distribution. If the joint log-linear model corresponding to the assumption of joint conditional independence given the conditioning variable is not sufficiently large to explain some data according to a standard log-likelihood test, its null–hypothesis of joint conditional independence may be rejected with respect to some significance level. Enlarging the log-linear model by some product terms of variables and running the log-likelihood test on different models may provide insight which variables are lacking conditional independence. Since the joint distribution determines any conditional distribution, the series of tests eventually provides insight which variables and product terms a proper logistic regression model should comprise.

3.1 Introduction

Conditional independence is a probabilistic approach to causality (Suppes 1970; Dawid 1979, 2004, 2007; Spohn 1980, 1994; Pearl 2009; Chalak and White 2012) while for instance correlation is obviously not as it is a symmetric relationship. Features of conditional independence are

Conditionally independent random variables are conditionally uncorrelated.
Conditionally independent random variables may be significantly correlated or not.
Independence does not imply conditional independence and vice versa.
Pairwise conditional independence does not imply joint conditional independence.

Statistical tests for pairwise conditional independence of random variables have been devised, e.g., Bergsma (2004), Su and White (2007), Su and White (2008), Song (2009), Bergsma (2010), Huang (2010), Zhang et al. (2011), Bouezmarni et al. (2012), Györfi and Walk (2012), Doran et al. (2014), Ramsey (2014), Huang et al. (2016), testing joint conditional independence of several random variables seems to be a challenge in general. For the special case of dichotomous variables, the “omnibus test” (Bonham-Carter 1994) and the “new omnibus test” (Agterberg and Cheng 2002) have been suggested.

Weak conditional independence of random variables was introduced in Wong and Butz (1999), and elaborated on in Butz and Sanscartier (2002). Extended conditional independence has recently been introduced in Constantinou and Dawid (2015). The definition of weak conditional independence given in Cheng (2015) refers to conditional independent random events, and rephrases conditional independence in terms of ratios of conditional probabilities rather than conditional probabilities to avoid the distinction of conditional independence given a conditioning event or its complement. This definition becomes irrelevant when proceeding from elementary probability of events to probability of random variables, and to the general definition of conditionally independent random variables.

Conditional independence is an issue in a Bayesian approach to estimate posterior (conditional) probabilities of a dichotomous random target variable in terms of weights-of-evidence (Good 1950, 1960, 1985). In turn, conditional independence is the major mathematical assumption of potential modeling with weights of evidence, cf. (Bonham-Carter et al. 1989; Agterberg and Cheng 2002; Schaeben 2014b), e.g., applied to prospectivity modeling of mineral deposits. The method requires a training dataset laid out in regular cells (pixels, voxels) of equal physical size representing the support of probabilities. The sum of posterior probabilities over all cells equals the sum of the target variable over all cells. Deviations indicate a violation of the assumption of conditional independence, and are used as statistic of a test (Agterberg and Cheng 2002) which involves a normality assumption. Funny enough, ArcSDM calculates so-called normalized probabilities, i.e., posterior probabilities rescaled so that the overall measure of conditional independence is satisfied (ESRI 2018); of course, the trick does not fix any problem. Violation of the assumption of conditional independence does not only corrupt the posterior (conditional) probabilities estimated with weights of evidence, but also their ranks, cf. (Schaeben 2014b), which is worse. Thus, the method of weights-of-evidence requires the mathematical modeling assumption of conditional independence to yield reasonable predictions. However, conditional independence is an issue with respect to logistic regression, too.

3.2 From Contingency Tables to Log-Linear Models

A comprehensive exposure of log-linear models is Christensen (1997). Let ${\varvec{ Z}}$ be a random vector of categorical random variables $\mathsf Z_\ell , \ell =0,\ldots ,m$, i.e., ${\varvec{ Z}} = (\mathsf Z_0, \mathsf Z_1, \ldots , \mathsf Z_m)^{\mathsf {T}}$. It is completely characterized by its distribution

$$\begin{aligned} p_{\kappa } = P_{{\varvec{ Z}}} ({\varvec{s}}_{\kappa }) = P({\varvec{ Z}} = {\varvec{s}}_{\kappa }) = P \left( \left( \mathsf Z_0,\ldots ,\mathsf Z_m) = (s_{k_0}, \ldots , s_{k_m} \right) \right) \end{aligned}$$

with the multi-index $\kappa = (k_0, \ldots , k_m)$, where $s_{k_\ell }$ with $k_\ell = 1,\ldots ,K_\ell $ denotes all possible categories of the categorical random variable $\mathsf Z_\ell , \ell =0,\ldots ,m$. Since it is assumed that there is a total of $K_\ell $ different categories with $P_{Z_\ell }(s_{k_\ell }) > 0$, there is a total of $\prod _{\ell =0}^m K_\ell $ different categorical states for $\varvec{ Z} = \bigotimes _{\ell =0}^m \mathsf Z_\ell $.

The distribution of a categorical random vector may initially be thought of as being provided by contingency tables. More conveniently, the distribution of a categorical random vector ${\varvec{ Z}}$ can generally be written in terms of a log-linear model as

$$\begin{aligned} \log p_{\kappa } = \sum _{\kappa } w_{\kappa } \; f_{{\varvec{ Z}}}^{\kappa } ({\varvec{z}}) \end{aligned}$$

with

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-319-78999-6_3/437020_1_En_3_Equ33_HTML.gif

3.3 Independence, Conditional Independence of Random Variables

If the random variables $\mathsf Z_\ell , \ell =1,\ldots ,m$, are independent, then the joint probability of any subset of random variables $\mathsf Z_{\ell }$ can be factorized into the product of the individual probabilities, i.e.,

$$\begin{aligned} P_{ \bigotimes _{\ell \in M} Z_\ell } = \bigotimes _{\ell \in M} P_{\mathsf Z_\ell }. \end{aligned}$$

where M denotes any non-empty subset of the set $\{1,\ldots ,m \}$. In particular

$$\begin{aligned} P_{\varvec{ Z}} = P_{ \bigotimes _{\ell =1}^m \mathsf Z_\ell } = \bigotimes _{\ell =1}^m P_{\mathsf Z_\ell }. \end{aligned}$$

If the random variables $\mathsf Z_\ell , \ell =1,\ldots ,m$, are conditionally independent given $\mathsf Z_0$, then the joint conditional probability of any subset of random variables $\mathsf Z_\ell $ given $\mathsf Z_0$ can be factorized into the product of the individual conditional probabilities, i.e.,

$$\begin{aligned} P_{ \bigotimes _{\ell \in M} \mathsf Z_\ell \mid \mathsf Z_0 } = \bigotimes _{\ell \in M} P_{\mathsf Z_\ell \mid \mathsf Z_0}, \end{aligned}$$

(3.1)

and in particular

$$\begin{aligned} P_{ \bigotimes _{\ell =1}^m \mathsf Z_\ell \mid \mathsf Z_0 } = \bigotimes _{\ell =1}^m P_{\mathsf Z_\ell \mid \mathsf Z_0}. \end{aligned}$$

3.4 Logistic Regression, and Its Special Case of Weights-of-Evidence

Conditional expectation of a dichotomous random target variable $\mathsf Z_0$ given a m–variate random predictor vector $\varvec{ Z} = (\mathsf Z_1, \ldots , \mathsf Z_m)^{\mathsf {T}}$ is equal to a conditional probability, i.e.,

$$\begin{aligned} \mathrm {E}(\mathsf Z_0 \mid \varvec{ Z}) = P (\mathsf Z_0 = 1 \mid \varvec{ Z}). \end{aligned}$$

Then the ordinary logistic regression model (without interaction terms) neglecting the error term yields

$$\begin{aligned} \mathrm {logit} P(\mathsf Z_0 = 1 \mid \varvec{ Z}) = \beta _0 + \varvec{\beta }^{\mathsf {T}} \varvec{ Z}, \beta _0 \in \mathbb R, \varvec{\beta } \in \mathbb R^m. \end{aligned}$$

Omitting the error term it can be rewritten in terms of a probability as

$$\begin{aligned} P \left( \mathsf Z_0 = 1 \mid \varvec{ Z} \right) = \varLambda \left( \beta _0 + \varvec{\beta }^{\mathsf {T}} \varvec{ Z} \right) , \end{aligned}$$

where $\varLambda $ denotes the logistic function. The logistic regression model with interaction terms reads in terms of a logit transformed probability

$$\begin{aligned} \mathrm {logit} P(\mathsf Z_0 = 1 \mid \varvec{ Z}) = \beta _0 + \sum _\ell \beta _\ell \mathsf Z_\ell + \sum _{\ell _i, \ldots , \ell _j} \beta _{\ell _i, \ldots , \ell _j} \mathsf Z_{\ell _i} \ldots \mathsf Z_{\ell _j} \bigr ), \end{aligned}$$

(3.2)

and in terms of a probability

$$\begin{aligned} P \left( \mathsf Z_0 = 1 \mid \varvec{ Z} \right) = \varLambda \left( \beta _0 + \sum _\ell \beta _\ell \mathsf Z_\ell + \sum _{\ell _i, \ldots , \ell _j} \beta _{\ell _i, \ldots , \ell _j} \mathsf Z_{\ell _i} \ldots \mathsf Z_{\ell _j} \bigr ) \right) . \end{aligned}$$

If all predictor variables are dichotomous variables and conditionally independent given the target variable then the parameters of the ordinary logistic regression model simplify to

$$\begin{aligned} \beta _0 = \mathrm {logit}P(\mathsf Z_0=1) + W^{(0)}, \quad \beta _\ell = C_\ell , \ell =1,\ldots ,m, \end{aligned}$$

with contrasts

$$\begin{aligned} C_\ell = W_{\ell }^{(1)} - W_{\ell }^{(0)}, \ell = 1,\ldots , m, \end{aligned}$$

defined as differences of weights of evidence

$$\begin{aligned} W_{\ell }^{(1)} = \ln {\frac{P(\mathsf Z_\ell = 1 \mid \mathsf Z_0 = 1 )}{P(\mathsf Z_\ell = 1 \mid \mathsf Z_0 = 0 )}}, \quad W_{\ell }^{(0)} = \ln {\frac{P(\mathsf Z_\ell = 0 \mid \mathsf Z_0 = 1 )}{P(\mathsf Z_\ell = 0 \mid \mathsf Z_0 = 0 )}}, \end{aligned}$$

and with $W^{(0)} = \sum _{\ell =1}^m W_\ell ^{(0)}$ provided all conditional probabilities are different from 0 (Schaeben 2014b). Obviously the model parameters become independent of one another, and can be estimated by mere counting. This special case of a logistic regression model is usually referred to as the method of “weights-of-evidence”. In turn, the canonical generalization of Bayesian weights-of-evidence is logistic regression.

That weights of evidence $W_\ell $ agree with the logistic regression parameters $\beta _\ell $ in case of joint conditional independence becomes obvious when recalling

$$\begin{aligned} C_\ell= & {} W_{\ell }^{(1}) - W_{\ell }^{(0)} \\= & {} \ln {\frac{P(\mathsf Z_\ell = 1 \mid \mathsf Z_0 = 1 )}{P(\mathsf Z_\ell = 1 \mid \mathsf Z_0 = 0 )}} - \ln {\frac{P(\mathsf Z_\ell = 0 \mid \mathsf Z_0 = 1 )}{P(\mathsf Z_\ell = 0 \mid \mathsf Z_0 = 0 )}} \\= & {} \ln \left( \frac{\mathrm {O}(\mathsf Z_0 = 1 \mid \mathsf Z_\ell = 1)}{\mathrm {O}(\mathsf Z_0 = 1 \mid \mathsf Z_\ell = 0)} \right) = \beta _\ell , \end{aligned}$$

which is the log odds ratio, the usual interpretation of $\beta _\ell $ (Hosmer and Lemeshow 2000).

If $\varvec{ Z}$ comprises m dichotomous predictor variables $\mathsf Z_\ell , \ell =1,\ldots ,m$, there are $2^m$ possible different realizations $\varvec{z}_k, k=1,\ldots , 2^m$, of $\varvec{ Z}$. Then

$$\begin{aligned} \sum _{i=1}^n \widehat{P} \bigl ( \mathsf Z_0=1 \mid \varvec{ Z} = \varvec{z} \left( i \right) \bigr )= & {} \sum _{k=1}^{2^m} \widehat{P}(\mathsf Z_0=1 \mid \varvec{ Z} = \varvec{z}_k) \; H(\varvec{ Z} = \varvec{z}_k) \\= & {} \sum _{k=1}^{2^m} \widehat{P}(\mathsf Z_0=1 \mid \varvec{ Z} = \varvec{z}_k) \; n \, \widehat{P}(\varvec{ Z} = \varvec{z}_k) \\= & {} n \widehat{P}(\mathsf Z_0=1) = \sum _{i=1}^n z_0(i), \end{aligned}$$

where the last equation is an application of the formula of total probability. It is a constitutive equation to estimate the parameters of a logistic regression model and holds always for fitted logistic regression models. With respect to weights-of-evidence, the test statistic of the so-called “new omnibus test” of conditional independence (Agterberg and Cheng 2002) is

$$\begin{aligned} t = \sum _{i=1}^n \left( \widehat{P} \left( \mathsf Z_0=1 \mid \varvec{ Z} = \varvec{z} \left( i \right) \right) - z_0(i) \right) \end{aligned}$$

and should not be too large for conditional independence to be reasonably assumed.

3.5 Hammersley–Clifford Theorem

Rephrasing the proper statement (Lauritzen 1996) casually, the Hammersley–Clifford Theorem states that a probability distribution with a positive density satisfies one of the Markov properties with respect to an undirected graph G if and only if its density can be factorized over the cliques of the graph. Since the distribution of a categorical random vector can be represented in terms of a log-linear model, Hammersley–Clifford theorem applies. Given $(m+1)$ random variables $\mathsf Z_0, \dots , \mathsf Z_m$, there is a total of $\left( {\begin{array}{c}m+1\\ \ell +1\end{array}}\right) $ different product terms each involving $(\ell +1)$ variables, $\ell =0,\ldots ,m$, summing to a total of $\sum _{\ell =0}^{m} \left( {\begin{array}{c}m+1\\ \ell +1\end{array}}\right) = 2^{m+1}-1$ different terms. Thus there is a total of $(m+1)$ single variable terms, and a total of $2^{m+1}-(m+2)$ multi variable terms.

The full log-linear model encompasses all terms and reads

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-319-78999-6_3/437020_1_En_3_Equ3_HTML.gif

(3.3)

where $\alpha \in C_{\ell +1}^{m+1}$ denotes an $(\ell +1)$-combination of the set $\{1, \ldots , m+1 \} \subset {\mathbb N}$, and $\kappa (\alpha ) = ( k_{i_1}, \ldots , k_{i_{\ell +1}} )$ denotes a multi-index with $(\ell +1)$ entries $k_{i_\ell } = 1,\ldots ,K_{i_\ell }$, for $\ell =0,\ldots ,m$. The random vector ${\varvec{ Z}}_{\kappa (\alpha )}$ is the product of any tuple of $(\ell +1)$ components of $\varvec{ Z}$, the total number of which is $\left( {\begin{array}{c}m+1\\ \ell +1\end{array}}\right) $.

Assumptions of independence or conditional independence simplify the distribution of ${\varvec{ Z}}$, i.e., its full log-linear model, considerably. Assuming independence for all its components $\mathsf Z_\ell , \ell =0,\ldots ,m$, the log-linear model simplifies according to Eq. (3.1) to

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-319-78999-6_3/437020_1_En_3_Equ4_HTML.gif

(3.4)

where $\phi _{k_\ell } = \log p_{k_\ell }$.

Assuming joint conditional independence of all components $\mathsf Z_\ell , \ell =1,\ldots ,m$, given $\mathsf Z_0$, the log-linear model, Eq. (3.3), simplifies according to Eq. (3.1) to

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-319-78999-6_3/437020_1_En_3_Equ5_HTML.gif

(3.5)

Thus the latter model, Eq. (3.5), assuming conditional independence differs from the model for independence, Eq. (3.4), in the additional product terms $\mathsf Z_0 \otimes \mathsf Z_\ell , \ell =1,\ldots ,m$.

Any violation of joint conditional independence given $\mathsf Z_0$ results in additional cliques of the graph and in additional product terms. Assuming that conditional independence given $\mathsf Z_0$ does not hold for a particular subset $\mathsf Z_{\ell _1}, \ldots , Z_{\ell _k}$ of variables $\mathsf Z_\ell $ results in an enlarging of the log-linear model of Eq. (3.5) by additional terms referring to $\mathsf Z_0 \otimes \bigotimes _{\ell =1}^k \bigotimes _{\ell _i \in C_{\ell }^k} \mathsf Z_{\ell _i}$ and $\bigotimes _{\ell =1}^k \bigotimes _{\ell _i \in C_{\ell }^k} \mathsf Z_{\ell _i}$, respectively.

3.6 Testing Joint Conditional Independence of Categorical Random Variables

The statistic of the likelihood ratio test (Neyman and Pearson 1933; Casella and Berger 2001) is the ratio of the maximized likelihood of a restricted model and the maximized likelihood of the full model. The assumption of the likelihood ratio test concerns the choice of the model family of distributions.

The null-hypothesis is that a given log-linear model is sufficiently large to represent the joint distribution. If the random variables are categorical, the full log-linear model is always sufficiently large as was explicitly shown above. More interesting are tests whether a smaller log-linear model is sufficiently large. Testing the null-hypothesis whether a log-linear model encompassing one-variable and two-variable terms, all of which involve $\mathsf Z_0$, is sufficiently large provides a test of conditional independence of all $\mathsf Z_\ell , \ell =1,\ldots ,m$, given $\mathsf Z_0$ because this log-linear model is sufficiently large in case of conditional independence given $\mathsf Z_0$. Thus, a reasonable rejection of the initial null-hypothesis implies a reasonable rejection of the assumption of conditional independence given $\mathsf Z_0$.

3.7 Conditional Distribution, Logistic Regression

Since the joint distribution implies all marginal and conditional distribution, respectively, the conditional distribution

$$\begin{aligned} P_{\mathsf Z_0 \mid \bigotimes _{\ell =1}^m \mathsf Z_\ell } = \frac{P_{\bigotimes _{\ell =0}^m \mathsf Z_\ell }}{P_{\bigotimes _{\ell =1}^m \mathsf Z_\ell }} \end{aligned}$$

(3.6)

is explicitly given here by

$$\begin{aligned} \frac{P_{\bigotimes _{\ell =0}^m \mathsf Z_\ell }(s_{k_0}, \dots , s_{k_\ell })}{P_{\bigotimes _{\ell =1}^m \mathsf Z_\ell }(s_{k_1}, \dots , s_{k_\ell })} = \frac{P_{\bigotimes _{\ell =0}^m \mathsf Z_\ell }(s_{k_0}, \dots , s_{k_\ell })}{\sum _{k_0=1}^{K_0} P_{\bigotimes _{\ell =0}^m \mathsf Z_\ell }(s_{k_0}, s_{k_1}, \dots , s_{k_\ell })}. \end{aligned}$$

Assuming independence, Eq. (3.6) immediately reveals

$$\begin{aligned} P_{\mathsf Z_0 \mid \bigotimes _{\ell =1}^m \mathsf Z_\ell } = P_{\mathsf Z_0}. \end{aligned}$$

Assuming conditional independence of all $\mathsf Z_\ell , \ell =1,\ldots ,m$, given $\mathsf Z_0$ and further that $\mathsf Z_0$ is dichotomous, then

$$\begin{aligned} P_{\mathsf Z_0 \mid \bigotimes _{\ell =1}^m \mathsf Z_\ell } (1 \mid s_{k_1}, \dots , s_{k_\ell }) = \frac{P_{\bigotimes _{\ell =0}^m \mathsf Z_\ell }(1, s_{k_1}, \dots , s_{k_\ell })}{\sum _{i =0}^1 P_{\bigotimes _{\ell =0}^m \mathsf Z_\ell }(i, s_{k_1}, \dots , s_{k_\ell })} \end{aligned}$$

(3.7)

with

$$\begin{aligned} P_{\bigotimes _{\ell =0}^m \mathsf Z_\ell }(1, s_{k_1}, \dots , s_{k_\ell }) = \exp \left( \phi _{1} + \sum _{\ell =1}^m \phi _{k_\ell } + \sum _{\ell =1}^m \; \sum _{k_\ell =1}^{K_\ell } \phi _{1, k_\ell } \right) \end{aligned}$$

and

$$\begin{aligned} \sum _{i=0}^1 P_{\bigotimes _{\ell =0}^m \mathsf Z_\ell }(i, s_{k_1}, \dots , s_{k_\ell }) = \sum _{i =0}^1 \exp \left( \phi _{i} + \sum _{\ell =1}^m \phi _{k_\ell } + \sum _{\ell =1}^m \; \sum _{k_\ell =1}^{K_\ell } \phi _{i, k_\ell } \right) \end{aligned}$$

Thus,

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-319-78999-6_3/437020_1_En_3_Equ34_HTML.gif

Finally,

$$\begin{aligned} P_{\mathsf Z_0 \mid \bigotimes _{\ell =1}^m \mathsf Z_\ell } = \varLambda \Big ( \beta _0 + \sum _{\ell =1}^m \beta _\ell \mathsf Z_\ell \Big ), \end{aligned}$$

which is obviously logistic regression

$$\begin{aligned} \mathrm {logit} P_{\mathsf Z_0 \mid \bigotimes _{\ell =1}^m \mathsf Z_\ell } = \beta _0 + \sum _{\ell =1}^m \beta _\ell \mathsf Z_\ell . \end{aligned}$$

(3.8)

It should be noted that additional product terms in the joint probability $P_{\bigotimes _{\ell =0}^m \mathsf Z_\ell }$ on the right hand side of Eq. (3.7) of the form $\bigotimes _{\ell =1}^k \bigotimes _{\ell _i \in C_{\ell }^k} \mathsf Z_{\ell _i}$ including $\mathsf Z_\ell , \ell =1,\ldots ,m$, only, i.e., not including $\mathsf Z_0$, would not effect the form of the conditional probability, Eq. (3.8). Additional product terms of the form $\mathsf Z_0 \otimes \bigotimes _{\ell =1}^k \bigotimes _{\ell _i \in C_{\ell }^k} \mathsf Z_{\ell _i}$, i.e., including $\mathsf Z_0$, result in a logistic regression model with interaction terms, Eq. (3.2).

Ordinary logistic regression is optimum, if the joint probability of the (dichotomous) target variable and the predictor variables is of log-linear form and all predictor variables are jointly conditionally independent given the target variable; in particular, it is optimum if the predictor variables are categorical and jointly conditionally independent given the target variable (Schaeben 2014a). Logistic regression with interaction terms is optimum, if the joint probability of the (dichotomous) target variable and the predictor variables is of log-linear form and the interaction terms correspond to lacking conditionally independence given the target variable; for categorical predictor variables, interaction terms can compensate for any lack of conditional independence exactly. Logistic regression with interaction terms is optimum in case of lacking conditional independence (Schaeben 2014a).

3.8 Practical Applications

The practical application of the log-likelihood ratio test of joint conditional independence generally includes the following steps

test the null-hypothesis that the full log-linear model is sufficiently large to represent the joint probability of all predictor variables and the target variables;
if the first null-hypothesis is not reasonably rejected, test the null-hypotheses that smaller log-linear models are sufficiently large; in particular;
test the null hypothesis that the log-linear model without any interaction term is sufficiently large;
if the final null-hypothesis is rejected, then the predictor variables must not be assumed to be jointly conditionally independent given the target variable.

3.8.1 Practical Application with Fabricated Indicator Data

3.8.1.1 The Data Set BRY

The data set bry is derived from the https://en.wikipedia.org/wiki/Conditional_independence. Initially it comprises three random events B, R, Y, denoting the subsets of the set of all 49 pixels which are blue, red or yellow with given probabilities $P(B) = \tfrac{18}{49} = 0.367, P(R) = \tfrac{16}{49} = 0.326, P(Y) = \tfrac{12}{49} = 0.244$. The random events B, R, Y are distinguished from their corresponding random indicator variables $\mathsf B, \mathsf R, \mathsf Y$ defined as usually, e.g.,

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-319-78999-6_3/437020_1_En_3_Equ35_HTML.gif

where

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-319-78999-6_3/437020_1_En_3_IEq79_HTML.gif

denotes the indicator variable. They are assigned to pixels of a $7 \times 7$ digital map image, Fig. 3.1.

It should be noted that in this example any spatial references are solely owed to the purpose of visualization as map images, and that the test itself does not take any spatial references or spatially induced dependences into account.

Checking independence according to its definition in reference to random events, the figures

$$\begin{aligned} P(B \cap R) = 0.122, \quad P(B) \; P(R) = 0.119 \end{aligned}$$

indicate that the random events B and R are not independent. However, the deviation is small.

Next, conditional independence is checked in terms of its definition referring to random events. Since conditional independence of the random events B and R given Y does not imply conditional independence of the random events B and R given the complement $\complement Y$, two checks are required. The results are

$$\begin{aligned} P(B \cap R \mid Y)= & {} \frac{1}{6} = P(B \mid Y) \; P(R \mid Y) \\ P(B \cap R \mid \complement Y)= & {} \frac{4}{37} \not = {\Bigl (\frac{12}{37}\Bigr )}^2 = P(B \mid \complement Y) \; P(R \mid \complement Y), \end{aligned}$$

and indicate that the random events B and R are conditionally independent given the random event Y, but that they are not conditionally independent given the complement $\complement Y$. It should be noted that the deviation of the joint conditional probability and the product of the two individual conditional probabilities in terms of their ratio is 1.027. In fact, the events B and R are conditionally independent given either Y or $\complement Y$ if one white pixel, e.g. pixel (1,7) with $\mathsf B = \mathsf R = \mathsf Y = 0$, is omitted.

Generalizing the view to random variables $\mathsf B, \mathsf R, \mathsf Y$ and their unique joint realization as shown in Fig. 3.1, Pearson’s $\chi ^2$ test with Yates’ continuity correction of the null-hypothesis of independence of the random variables $\mathsf B$ and $\mathsf R$ given the data returns a p-value of 1 indicating that the null-hypothesis cannot reasonably be rejected.

The likelihood ratio test is applied with respect to the log-linear distribution corresponding to the null-hypothesis of conditional independence and results in a p-value of 0.996 indicating that the null-hypothesis cannot reasonably be rejected.

Thus, given the data the tests suggest to infer that the random variables $\mathsf B$ and $\mathsf R$ are independent and conditionally independent given the random variable $\mathsf Y$.

3.8.1.2 The Data Set SCCI

The next data set scci comprises three random events $B_1,B_2,T$ with given probabilities $P(B_1) = P(B_2) = P(T) = \tfrac{7}{49} = \tfrac{7}{49} = 0.142$. They are assigned to pixels of a $7 \times 7$ digital map image, Fig. 3.2.

Checking independence according to its definition for random events, the figures

$$\begin{aligned} P(B_1 \cap B_2) = 0.102, \quad P(B_1) \; P(B_2) = 0.020 \end{aligned}$$

indicate that the random events $B_1$ and $B_2$ are not independent.

Next, conditional independence is checked in terms of its definition referring to random events. Since conditional independence of the random events $B_1$ and $B_2$ given T does not imply conditional independence of the random events $B_1$ and $B_2$ given $\complement T$, two checks are required. The results are

$$\begin{aligned} P(B_1 \cap B_2 \mid T) = 0.714 \not =&0.734 = P(B_1 \mid T) \; P(B_2 \mid T) \\ P(B_1 \cap B_2 \mid \complement T) = 0 \not =&0.0005 = P(B_1 \mid \complement T) \; P(B_2 \mid \complement T), \end{aligned}$$

and indicate that the random events $B_1$ and $B_2$ are neither conditionally independent given the random event T nor given the complement $\complement T$.

Testing the null-hypothesis of independence of the random variables $\mathsf B_1$ and $\mathsf B_2$ with Pearson’s $\chi ^2$ test with Yates’ continuity correction given the data returns a p-value of practically equal to 0 indicating that the null-hypothesis should be rejected.

The likelihood ratio test is applied with respect to the log-linear distribution corresponding to the null-hypothesis of conditional independence and results in a p-value of 0.825 indicating that the null-hypothesis cannot reasonably be rejected.

Thus, given the data the tests imply that the random variables $\mathsf B_1$ and $\mathsf B_2$ are not independent but conditionally independent given the random variable $\mathsf T$.

3.9 Discussion and Conclusions

Since pairwise conditional independence does not imply joint conditional independence, the $\chi ^2$-test (Bonham-Carter 1994) of independence given $\mathsf Z_0=1$ does not apply to checking the modeling assumption of weights-of-evidence. The disadvantage of both the “omnibus” test (Bonham-Carter 1994) and the “new omnibus” test (Agterberg and Cheng 2002) is twofold. First, it involves an assumption of normal distribution which itself should be subject to a test. Second, weights-of-evidence has to be applied to calculate the test statistic which is the sum of all predicted conditional probabilities within the training data set. If the test actually suggests rejection of the null-hypothesis of conditional independence, the user learns that the application of weights-of-evidence was not mathematically authorized to predict the conditional probabilities. The standard likelihood ratio test suggested here resolves both shortcomings.

Acknowledgements

The author would like to thank Prof. Juanjo Egozcue, UPC Barcelona, Spain, and Prof. K. Gerald van den Boogaart, HIF, Germany, for emphatic and instructive discussions of conditional independence.

<SimplePara><Emphasis Type="Bold">Open Access</Emphasis> This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.</SimplePara> <SimplePara>The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.</SimplePara>

Vorheriges Kapitel A Statistical Commentary on Mineral Prospectivity Analysis

Nächstes Kapitel Modelling Compositional Data. The Sample Space Approach

Agterberg FP, Cheng Q (2002) Conditional independence test for weights-of-evidence modeling. Nat Resour Res 11:249–255CrossRef

Bergsma, WP (2004) Testing conditional independence for continuous random variables. Eurandom Report Vol. 2004048, Eindhoven

Bergsma WP (2010) Nonparametric testing of conditional independence by means of the partial copula, SSRN: https://ssrn.com/abstract=1702981, https://doi.org/10.2139/ssrn.1702981

Bonham-Carter GF (1994) Geographic information systems for geoscientists: modeling with GIS. Elsevier, Pergamon

Bonham-Carter GF, Agterberg FP, Wright DF (1989) Weights of evidence modeling: a new approach to mapping mineral potential. In: Agterberg FP, Bonham-Carter GF (eds.) Statistical Applications in the Earth Sciences, Geological Survey of Canada, Paper 89-9, pp. 171–183

Bouezmarni T, Rombouts JVK, Taamouti A (2012) Nonparametric copula-based test for conditional independence with applications to granger causality. J Bus Econ Stat 30:275–287CrossRef

Butz CJ, Sanscartier MJ (2002) Properties of weak conditional independence. In: Alpigini JJ, Peters JF, Skowron A, Zhong N (eds) Rough Sets and Current Trends in Computing: third international conference, RSCTC 2002 Malvern, PA, USA, 14–16 October, 2002, vol 2475. Proceedings Lecture Notes in Computer Science. Springer, Berlin, pp 349–356

Casella G, Berger RL (2001) Statistical inference, 2nd edn. Duxburry Thomson Learning

Chalak K, White H (2012) Causality, conditional independence, and graphical separation in settable systems. Neural Comput 24:1611–1668CrossRef

Cheng Q (2015) BoostWofE: A new sequential weights of evidence model reducing the effect of conditional dependency. Math Geosci 47:591–621CrossRef

Christensen R (1997) Log–Linear models and logistic regression, 2nd edn. Springer, Berlin

Constantinou P, Dawid AP (2015) Extended conditional independence and applications in causal inference, arXiv:1512.00245

Dawid AP (1979) Conditional independence in statistical theory. J R Stat Soc B 41:1–31

Dawid AP (2004) Probability, causality and the empirical world: A Bayes-de Finetti-Popper-Borel synthesis. Stat Sci 19:44–57CrossRef

Dawid AP (2007) Fundamentals of statistical causality. Research Report 279, Department of Statistical Science, University College London

Doran G, Muandet K, Zhang K, Schölkopf B (2014) A permutation-based kernel conditional independence test. Proceedings of UAI

ESRI: ArcSDM3.1 User Guide Spatial Data Modeller 3 Extension for ArcMap 9.1 file: ///C|/arcgis/ArcSDM/Documentation/sdmrspns.htm(9of16)2/22/200612:05:17PM)

Good IJ (1950) Probability and the weighing of evidence. Griffin, London

Good IJ (1960) Weight of evidence, corroboration, explanatory power, information and the utility of experiments. J R Stat Soc B 22:319–331

Good IJ (1985) Weight of evidence: A brief survey. Bayesian Stat 2:249–270

Györfi L, Walk H (2012) Strongly consistent nonparametric tests of conditional independence. Stat Probab Lett 82:1145–1150CrossRef

Hosmer DW, Lemeshow S (2000) Applied logistic regression, 2nd edn. Wiley, New JerseyCrossRef

Huang T-M (2010) Testing conditional independence using maximal nonlinear conditional correlation. Ann Stat 38:2047–2091CrossRef

Huang M, Sun Y, White H (2016) A flexible nonparametric test for conditional independence. Econom. Theory 32:1434–1482CrossRef

Lauritzen SL (1996) Graphical models. Clarendon Press, Oxford

Neyman J, Pearson ES (1933) On the problem of the most efficient tests of statistical hypotheses. Philos Trans R Soc Lond A 231:289–337CrossRef

Pearl J (2009) Causality: models, reasoning, and inference, 2nd edn. Cambridge University Press, New YorkCrossRef

Ramsey JD (2014) A scalable conditional independence test for nonlinear, non Gaussian data, arXiv:1401.5031

Schaeben H (2014a) A mathematical view of weights-of-evidence, conditional independence, and logistic regression in terms of Markov random fields. Math Geosci 46:691–709CrossRef

Schaeben H (2014b) Potential modeling: conditional independence matters. Int J Geomath 5:99–116CrossRef

Song K (2009) Testing conditional independence via Rosenblatt transforms. Ann Statist 37:4011–4045CrossRef

Spohn W (1980) Stochastic independence, causal independence, and shieldability. J Philos Log 9:73–99CrossRef

Spohn W (1994) On the properties of conditional independence. In: Suppes P, Humphreys P (eds) Scientific Philosopher, vol 1. Probability and Probabilistic Causality. Kluwer, Dordrecht, pp 173–194CrossRef

Su L, White H (2007) A consistent characteristic function-based test for conditional independence. J Econom 141:807–834CrossRef

Su L, White H (2008) A nonparametric Hellinger metric test for conditional independence. Econo Theory 24:829–864CrossRef

Suppes P (1970) A probabilistic theory of causality. North-Holland, Amsterdam

Wong SKM, Butz CJ (1999) Contextual weak independence in Bayesian networks. In: Fifteenth conference on uncertainty in artificial intelligence, pp. 670–679

Zhang K, Peters J, Janzing D, Schölkopf B (2011) Kernel-based conditional independence test and application in causal discovery. In: Cozman FG, Pfeffer A (eds) Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011), Barcelona, Spain, July 14–17, 2011. AUAI Press, Corvallis, OR, USA, pp 804–813

Titel: Testing Joint Conditional Independence of Categorical Random Variables with a Standard Log-Likelihood Ratio Test
verfasst von: Helmut Schaeben
Verlag: Springer International Publishing
Buch: Handbook of Mathematical Geosciences
Print ISBN: 978-3-319-78998-9

Electronic ISBN: 978-3-319-78999-6

Copyright-Jahr: 2018
DOI: https://doi.org/10.1007/978-3-319-78999-6_3