We pursue a threefold purpose in this paper. First, we suggest a Kullback-Leibler formulation for developing a statistics and making discriminative projection for case-control studies, based on which existing typical methods are revisited and then further extended to matrix-variate counterparts. Second, we propose a bi-linear matrix form, based on which multivariate discriminative analysis and logistic, Cox, and linear mixed regression are extended into their matrix-variate counterparts. Third, we systematically address the necessity, feasibility, and methodology of integrative hypothesis tests (IHT) from the complementarity of model-based test and boundary-based test (BBT) in the data (D)-space, statistics (S)-space, and probability (P)-space. We elaborate four IHT components (modelling, comparison, classification, and assurance) and summarise four IHT types in the D-space. Then, we extend the existing efforts on multivariate tests to BBTs in the S-space. Particularly, we extend the classic univariate one-tail z-test to the multivariate ones, which is then applied to a multivariate sample-pairing delta (SPD) test for detecting a collective inclining dominance. Also, we propose a SPD discriminative analysis that extends this SPD test. Moreover, we propose a multivariate bi-test that tests the classic null and also a null about the inference reliability due to test space complexity, including a further development of Fisher combination. Finally, we suggest possible applications for gene expression biomarkers and exome-sequencing-based joint single-nucleotide variant (SNV) detection.
Notes
Competing interests
The author declares that he has no competing interests.
Background
Typically, multivariate statistical analysis and related machine-learning studies consider a basic sampling unit in a vector xt. Though an entire data set may be regarded as given in a format of matrix that consists of x1,⋯,xN as the columns, each statistics is computed from an assembly of vector samples and featured by vector inner product as a basic modelling unit.
Nowadays, not only rapid developments of data acquisition techniques (DePristo et al. 2011; Koboldt et al. 2013) demand that data with a matrix Xt as shown in Figure 1 as a basic sampling unit be considered, but also ever-increasing computing ability makes such a demand possible. One typical field that longs for such demands is featured by image-based tasks, of which a basic sampling unit is naturally a matrix though traditional studies consider sample vectors to simplify computation. However, this simplification will miss some useful structural information, e.g. considering the rows of Xt as independent and identically distributed (i.i.d.) samples will miss the dependence cross rows. Also, recent efforts on big-data analyses eagerly demand statistical approaches for matrix-variate-based data analysis.
×
Advertisement
Another field that demands matrix-variate-based analyses is computational biology or particularly computational genomics. Typically, expression profiles of basic units (e.g. gene, miRNA, lncRNA) are analysed via vector samples (e.g. via rows or columns of expression matrix) (Simon et al. 2003). Advanced studies also examine expression profiles under different conditions (Ji et al. 2009; Persson et al. 2011) and across different time points (Bar-Joseph et al. 2012) and thus demand that sampling units in matrix format or even a high-dimensional array are considered. In a genome-wide association study or exome-sequencing analysis (DePristo et al. 2011; Gibson 2012; Purcell et al. 2007), though a majority of methods is still featured by vector-variate analysis, there are already some efforts made on matrix-variate-based data analysis.
In the rest of this paper, we start at providing a background and review on the related topics and methods, including the following:
Two-sample test and Hotelling statistics.
Logistic regression, Wald test, and Rao’s score.
Discriminative analyses and integrative hypothesis tests (IHT).
Cox model and linear mixed model
Then, we pursuit a threefold purpose as follows: (1) A Kullback-Leibler-divergence-based formulation for developing statistics and discriminative criterion for the case-control studies, based on which existing typical methods are revisited and extended to their matrix-variate counterparts. (2) A bi-linear matrix form, based on which discriminative analysis, logistic regression, Cox model, and linear mixed model are extended into their matrix-variate counterparts. (3) A systematic investigation of the necessity, feasibility, and implementing methods of IHT from the perspective of model-based test (MBT) versus boundary-based test (BBT) in the three levels of space, namely the data sample space (D-space), the statistics space (S-space), and probability space (P-space).
More specifically, the above third one consists of the following:
The complementarity of MBT versus BBT in the D-space, the basic IHT components (modelling, comparison, classification, and assurance), and four types of IHT.
Bayesian Ying Yang (BYY)-harmony-learning-based IHT formulation for coordinately optimising the performances of task A, task B, and task C in the D-space.
The MBT vs BBT perspective in the S-space, especially extensions of the existing efforts on the integration of multiple statistics to the S-space BBT, with the help of dependence decoupling.
A S-space BBT-based extension of univariate one-tail z-test for testing the null of multivariate zero mean, which is then applied to multivariate sample-pairing delta (SPD) test for detecting a collective inclining dominance.
A SPD discriminative analysis that not only improves the multivariate SPD test but also further extends it to matrix-variate ones.
A multivariate bi-test on both the classic null and also a null about test reliability by controlling the testing complexity, including a further development of the Fisher combination.
Advertisement
Finally, we discuss several possible IHT applications for expression-profile-based biomarker finding and exome-sequencing-based joint single-nucleotide variant (SNV) detection.
Hypothesis tests for case-control studies
Most efforts in computational genomics and generally computational biology involve case-control studies. For a case-control study, we are given two populations of vector-variate samples Xω={xt,ω,t=1,⋯,Nω},ω=0,1, where the one with ω=1 is called the case population while the one with ω=0 is called the control population. The task of a hypothesis test is examining a rejection of the following null assumption:
for which a statistics is computed from the samples to test the opposite assumption H1 that there is a significant difference between the two populations.
A typical example is testing whether H0 breaks on two populations of samples from a multivariate Gaussian distribution G(x|c,Σ) with the mean vector c and the covariance matrix Σ, with help from the following Hotelling statistics (Hotelling 1931):
where N=N0+N1, and c1,c0 are the mean vectors of the case and control populations, respectively. Also, the covariance matrix is assumed to be Σ=Σ0=Σ1.
Generally, we evaluate the difference between two populations based on population modelling by a parametric model q(x|θ), that is, firstly modelling each population of samples and then evaluating the overall difference between two resulted models. The performance is measured by the p value that describes the false alarm probability of judging that H0 by Equation (1) significantly breaks. Such efforts are usually referred as model-based tests or sometimes called model comparison or class comparison (Simon et al. 2003).
Another typical example is logistic regression. Rewriting the above two populations of samples into a set of paired samples {xt,ωt},t=1,⋯,N with ωt=1 and ωt=0 indicating the sample xt from the case and control population, respectively. We let ωt be regressed by xt in the following conditional probability:
which cannot be analytically solved due to the nonlinearity of s(r) and are usually handled by a gradient-based iterative algorithm (Hosmer et al. 2013). The test of the null assumption by Equation (1) becomes testing the null assumption:
where w is a subset of θ. It is typically made by either the Wald test or the Score test (Engle 1984), both of which are computed from one or both of the following statistics:
as a testing statistics that has an asymptotic distribution of \({\chi ^{2}_{k}}\), where k is the number of constraints imposed by the null hypothesis. It degenerates to \({\chi ^{2}_{1}}\) when w consists of only one parameter.
This logistic regression examines the difference between two populations via firstly building up a hyperplane boundary and then tests Equation (5) that directly aims at whether the boundary depends on variables in consideration.
Discriminative analyses and integrative tests
Other than directly aiming at the boundary, a different aspect of logistic regression is that we can use p(ωt|xt,θ) by Equation (3) to classify each sample by:
Equivalently, the same result comes from the hyperplane boundary ζt=0 with ζt given in Equation (3) such that samples are classified into its two sides. The outcome is the following decomposition:
That is, the case set X1 is separated into a subset \(X^{(1)}_{1}\) with unchanged labels and a subset \(X^{(0)}_{1}\) of samples that are relabelled as control samples, and similarly, the control set X0 into \(X^{(0)}_{0}\) with unchanged labels and \(X^{(1)}_{0}\) relabelled as case samples.
Actually, seeking a hyperplane boundary is the goal of linear discriminative analyses (LDA). One classic example is the Fisher discriminative analysis (FDA). For separating samples of two populations, the FDA seeks a projection yt=wTxt to map each vector xt into a univariate yt such that:
On the one-dimensional yt, it follows from Equation (2) that \(T^{2}=\frac {N_{0}N_{1}}{N} J_{y}\) and that FDA is equivalent to seeking a direction w along which two populations differ mostly.
On a small size of samples, the resulted w by FDA may suffer the well-known overfitting problem, for which efforts have been made on learning a linear boundary in the literature of machine learning. One classical method is the support vector machine (SVM) (Suykens and Vandewalle 1999; Suykens et al. 2002).
Widely adopted in the studies of pattern classification and machine learning, the performance of discriminative analyses is typically measured by the misclassification rate of Equation (10), featuring the separation or overlap of two populations around the boundary and reflecting the confusing chance incurred by a decision or prediction (sometimes called class prediction (Simon et al. 2003)).
The performance of discriminative analyses may also be measured by T2 that considers the separation of two populations of yt=wTxt. Monotonically varying with T2, the p value may be obtained by a univariate t-test. Here, the performance is measured by only considering the salient difference between two populations along the normal direction of the boundary, instead of considering the overall difference in the entire space as addressed after Equation (2).
Alternatively, see Equation (31) in (Xu 2013a), the performance of discriminative analyses may be also measured by a statistics that jointly considers the separating boundary and its outcome by Equation (10).
Since there are different choices for evaluating the difference between two populations, we are motivated to examine whether they can be integrated for a better evaluation. The name of IHT was previously advocated in (Xu 2013a, 2013b) for a joint consideration of the misclassification rate and the p value about the overall difference. This paper will further proceed along this direction.
Cox regression and linear mixed model
Survival analyses consider the relation of the observed time yt that a subject t passes before some event occurs to one or more covariates in xt that may be associated with yt. The Cox model for survival analysis (Cox and Oakes 1984) describes the hazard ratio as follows:
Again, we can test H0 by Equation (5) with the Wald test by Equation (7) or Rao’s score test by Equation (8), with help getting Δ(w),I(w) still by Equation (6) but with L given by the above partial likelihood L(w).
Actually, the core part yt=wTxt of Equations (3) and (13) is also the core part of the classic multivariate linear regression yt=wTxt+et with w estimated by minimising \(\sum _{t} \textbf {e}_{t}^{2}\).
Denoting y=[ y1,⋯,yN]T, e=[ e1,⋯,eN]T, and X=[x1,⋯,xN]T, we may rewrite yt=wTxt+et into y=Xw+e as a degenerated case of the following linear mixed model (Demidenko 2013) :
where Z is a design matrix and f is a random effect vector. We may use the existing methods to estimate w,K,R (Demidenko 2013) and then test w=0 via the Wald test by Equation (7) or Rao’s score test by Equation (8) but with the likelihood L replaced by:
$$\begin{array}{@{}rcl@{}} L=G(\textbf{y}-X\textbf{w}|\textbf{0}, Z KZ^{T}+ R). \end{array} $$
(16)
Moreover, an N×1 vector y may be further extended to a N×m matrix with one dependent variable extended to m-dependent variables. Accordingly, w,f,e are extended to d×m matrices. As a result, we have:
where F=[f1,⋯,fm], and E=[e1,⋯,em]. One typical case is that f1,⋯,fm are mutually i.i.d. with each fi∼G(fi|0,K). Also, e1,⋯,em are i.i.d. with each ei∼G(ei|0,R).
From inner product to bi-linear form
In many studies of multivariate statistical analysis and machine learning, a basic sampling unit is a vector \(\boldsymbol {x}_{t}=\left [\!{x}_{t}^{(1)},\cdots, {x}_{t}^{(d)}\right ]^{T}\), and the basic computing operation is the inner product wTxt that is linear with respect to the elements of xt and also of w. Though wTxt becomes XW in Equation (17), it actually consists of a set of vector inner products in parallel.
Efforts have been made in (Xu 2013a, 2013b) to extend this inner product to get a matrix-variate discriminative analysis. Considering that a basic sampling unit is a matrix Xt as shown in Figure 1, the inner product is extended into a bi-linear form:
which is quadratic with respect to w(i) and v(j) but still linear with respect to the elements of Xt and is featured by two consecutive layers of inner products. Similarly, we may also have \(\boldsymbol {w}^{T}X_{t}\textbf {v}=\textbf {v}^{T}\textbf {x}_{t}^{w}\) and \(\boldsymbol {x}_{t}^{w}={X_{t}^{T}}\textbf {w}\). We call such a matrix-variate-based basic-computing operation a bi-linear form. This bi-linear form leads us to matrix-variate LDA and factor analyses in (Xu 2013a, 2013b). Also, using matrix normal distribution, the implementations are made by the Bayesian Ying Yang harmony learning (Xu 1995, 2015).
To get further insight, we directly extend the vector inner product into the following matrix format:
which is still linear with respect to the elements of Xt but unable be decomposed into two inner products, where vec[ O] denotes the vectorisation of a matrix O.
Comparing Equations (18) and (19), we observe that the bi-linear form can be regarded as constrained in the following structure :
$$\begin{array}{@{}rcl@{}} o^{(i,j)} = w^{(i)}v^{(j)}, \ or \ O=\textbf{w}\textbf{v}^{T}. \end{array} $$
(20)
That is, the weighting along the rows of Xt is unrelated to one along the columns of Xt. It significantly reduces the number of free parameters of o(i,j) from md into m+d for w(i) and v(j), which is favourable because we usually have a small-size N for a given sample set N. However, it also suffers the limitation of being applicable only to the cases where the dependence across rows of Xt is not related to one along the columns of Xt. To extend such a limitation, further generalisations of bi-linear matrix forms will be proposed in Equation (40).
Methods
KL statistics and matrix-variate tests
Given the case and control samples Xω={xt,ω,t=1,⋯,Nω and ω=0,1} from a parametric family q(x|θ), all the unknown parts of the true value θ∗ are estimated under H0 by Equation (1), e.g. by the maximum likelihood from X0∪X1. Also, we estimate \(\hat \theta \) from X1 and test whether H0 breaks by the following formulation (see Equation (36) in (Xu 2012a)):
with X1 from q(x|θ1) and X0 from q(x|θ0). We estimate θ1 from the case samples X1 and θ0 from the control samples X0 by either the maximum likelihood or other learning principles, and test H0 by the following case-control formula:
which directly measures the discrepancy between the case population and control population and provides a general formulation for model-based tests. In contrast, sKL by Equation (21) indirectly considers the difference of the case population from the pool of both populations under H0.
For the special case that q(x|θ)=G(x|c,Σ), sKL by Equation (21) and sKL by Equation (23) are equivalent with merely a slight difference of a constant scale, resulting in:
It relates to the Hotelling statistics by Equation (2) via \( T^{2} =2\frac {N_{0}N_{1}}{N_{0}+N_{1}}s_{\textit {KL}}\), i.e. the Hotelling statistics is covered as a special case of the general formulation by Equation (21).
The equivalence no longer exists when we consider other examples of q(x|θ1) and q(x|θ0). Because the case population reflects an abnormal situation and thus has a distribution that is quite different from the control population; q(x|θ1) may come from a parametric family that is different from the one of q(x|θ0). For an example, we may consider a Gaussian for the control samples while a mixture of two Gaussians for the case samples.
In addition to testing c0=c1 as considered by the Hotelling statistics, we may use sKL by Equation (23) to develop statistics for other null hypotheses of the type \({\theta ^{s}_{0}}= {\theta ^{s}_{1}}\). For examples, \({\theta ^{s}_{i}}\) could be a covariance Σi.
Generally, we may use sKL by Equation (21) to develop a statistics for testing a general relation given by a vector equation h(θ)=0 that consists of one or several joint equations, for which we estimate θ0 from samples of X0∪X1 subject to the constraint h(θ)=0 and estimate θ1 from only the case samples X1 without the constraint. The above type \({\theta ^{s}_{0}}= {\theta ^{s}_{1}}\) is a special case \(h(\theta)={\theta ^{s}_{0}}- {\theta ^{s}_{1}}=0\). Also, the equality may be extended to several subsets \(\{{\theta ^{s}_{i}}\}\) that are equal to each other, with each \({\theta ^{s}_{i}}\) to be either of the mean vector ci or a covariance Σi. Even the simplest case θs=0, θs⊆θ has been widely studied. For examples, θs could be the variances for the variance analyses or w=0 in Equation (5) for logistic regression and Cox regression.
Not only Equation (21) provides a general formulation of developing a statistics for a composite test, but also a bird view of the existing statistics for further understanding, improvements, and extensions.
Simply with each vector x replaced by a matrix X, we can extend Equations (21) and (23) to consider matrix-variate samples. Without losing generality, we focus on Equation (23) and get:
where a matrix Ω describes the cross-column dependence of the matrix variate X, and a matrix Σ describes the cross-row dependence of X. This matrix distribution is equivalent to a multivariate Gaussian distribution G(vec(X)|vec(C),Σ⊗Ω), where ⊗ denotes the Kronecker product.
With each sample Xt,ω from \(N\left (X|C^{x}_{\omega },\Omega ^{x}_{\omega },\Sigma ^{x}_{\omega }\right)\) under the assumption:
as the matrix-variate counterpart of Equation (24), where parameters are typically estimated by the maximum likelihood principle (Xu, 2015).
Generally, with help of Equation (25), we may also develop statistics for distributions other than matrix normal distributions.
Model-based two-sample tests
The tests for H0 by Equation (22) are featured by comparing the difference between two parametric models q(x|θ1) and q(x|θ0) on the entire domain of x. Its basis is modelling the case population by q(x|θ1) with its parameter θ1 estimated from X1 and modelling the control population by q(x|θ0) with its parameter θ0 estimated from X0. Thus, these tests are called model-based two-sample tests or model-based tests in short wherever there is no confusion caused.
Typically, a statistics s is considered to measure the difference between two models. The bigger the value s is, the larger the difference is. We reject H0 when s takes a large enough value s∗, while the false positive probability of this rejection is called the p value.
Usually, how to get a statistics s from samples is task-dependent. It is typically a function of the first- and second-order statistics that are random variables directly obtained from samples of populations, e.g. see the Hotelling statistics by Equation (2). Equation (23) provides a general perspective of getting such a statistics sKL, covering not only the first- and second-order statistics but also ones beyond.
Actually, Equation (23) can be further generalised. Adding in the priorities α1,α0 for q(x|θ1) and q(x|θ0), we have:
from which we observe how an overall difference is structured from the statistics on individual differences. For KLsum, the role of anti-dispersion difference δα,Σ is cancelled while the position difference δc is averaged. For KLdif, the role of δα,Σ is summed up while the position difference δc is cancelled. In other words, the roles of KLsum and KLdif are complementary. According to the nature of tasks, we may use either of them separately or the both of them jointly.
The performance of examining H0 by Equation (22) is typically evaluated via the p value, which depends on not only how p is approximately estimated but also how well q(x|θ0) models X0 and q(x|θ1) models X1. A poor modelling makes the resulted p unreliable. Thus, the performance evaluation should also consider its corresponding modelling error or generally the likelihood:
The modelling error depends not only on what type of model is used but also on an appropriate model complexity. Using a model with a big model complexity can lead to an over-optimistic result, i.e. suffering an over-fitting problem. To remedy it, we need to consider either an average of modelling errors on training and testing samples (e.g. by cross validation (Stone 1974)) or approximated generalisation error by one of the model-selection criterion (e.g. BIC (Schwarz 1978)).
Jointly, model-based two-sample tests involve two tasks, that is, the first two tasks summarised in Table 1. Task A is a typical topic of machine learning, from which those existing studies can be adopted, while task B is a typical topic of a statistical test, with its corresponding εB being a nonnegative measure that monotonically decreases towards zero as s tends towards a large value.
Table 1
Four Tasks of Integrative Hypothesis Tests
Tasks
Description
Task A (modelling)
estimate θω such that q(x|θω) models the corresponding population of samples, with the performance evaluated by its corresponding εA, e.g., the average error or generalisation error.
Task B (comparison)
develop a statistics s based on the resulted models to test H0 by Equation (22), with the performance evaluated by its corresponding εB that measures the difference between two populations, e.g., the p-value.
Task C (classification)
classify each sample to either ω=1 or 0, with the performance evaluated by its corresponding εC, e.g., either the rate of incorrect classification by Equation (44) or alternatively the corresponding p-value obtained by a test based on a statistics by Equation (47).
Task D (assurance)
test whether a reliable separating boundary exists between the two populations of samples, with the performance evaluated by its corresponding εD.
It is an open challenge to integrate εA and εB into one objective to optimise because of lacking investigations on how to combine them. A preliminary study has been made empirically with the help of the 2D scattering plots of εA versus εB as illustrated in Figure 2. Each scattering point denotes a performance pair (εA,εB), associated with one miRNA on the samples for gene expression. Those points located near the origin (e.g. those in the orange colour) act as the interested candidate points.
×
Matrix-variate discriminative analysis
As addressed around Equation (11), the classic FDA seeks a projection yt=wTxt to maximize Jy. Moreover, it follows from the bi-linear form by Equation (18) that a matrix-variate discriminative analysis is obtained by:
with a m×ms matrix V and a d×ds matrix W. It degenerates back to Equation (18) when ms=1,ds=1. Mapping into one variable yt may lose too much discriminative information. Instead, Equation (35) maps Xt into either of a size-reduced matrix, a column vector, or a row vector according to practical problems, e.g. from not only genomics data in genetic biology but also image or table data in various tasks of big data analyses.
With Xt replaced by Yt, equations from Equations (25) to (29) are directly applicable. If Xt comes from an MND, Yt comes from an MND too. Accordingly, Equation (33) becomes:
Each \(y_{t}^{(k,\ell)}\) above and the bi-linear form by Equation (18) suffer the limitation discussed after Equation (20), which is relaxed with v(j) replaced by \(v^{(j)}_{i}\) or v(j,ℓ) replaced by \(v^{(j,\ell)}_{i}\), i.e. adding another dimension by a subscript i.
Focusing on the former, we extend Equation (20) into:
$$\begin{array}{@{}rcl@{}} o^{(i,j)} = w^{(i)}v^{(j)}_{i}, \cr v^{(j)}_{i} \text{is subject to a constraint, e.g. one of} \cr \left\{\begin{array}{ll} \sum_{j=1}^{d} v^{(j)}_{i}=1, &\text{Choice (a)}, \\ \text{from a Gaussian density}, &\text{Choice (b)}, \\ \text{from a Laplace density}, &\text{Choice (c)}. \end{array}\right. \end{array} $$
This solution does not relate to w, and thus, the job is done after getting w∗ by Equation (34).
Also, we may update V by a gradient-based approach via ∇VJ(w,V). Practically, a regularisation may be added on J(w,v) and J(w,V) via Gaussian priories on w,v, and V. Alternatively, we may make sparse learning via Laplace priories on w,v, and V.
Being a complementary to model-based two-sample tests that considers H0 by Equation (22) from an overall perspective of populations, we may also perform the classification task in Table 1 to evaluate the goodness of the decomposition by Equation (10), measured by another quantity εC, e.g. the following rate of incorrect classification
where ξ could be either of xt and Xt or the corresponding projections yt and Yt. Mapping samples into the projections helps to reduce the dimension of xt and Xt for tackling the overfitting difficulty of task A in Table 1, especially when the size of samples is not large enough. Also, it facilitates visualisation of two populations in a low dimension (especially below 3D dimension) such that classification is made with human interaction.
Boundary-based tests
Actually, the FDA by Equation (11) finds w that defines the normal direction of the best discriminative hyperplane, as shown in Figure 3. In addition to Equation (45), the hyperplane often acts as a separating boundary as follows:
$$\begin{array}{@{}rcl@{}} g(x,\textbf{w})=\textbf{w}^{T}\textbf{x}+w_{0}=\textbf{w}^{T}(\textbf{x}-\mu)=0,\\ \text{by which \(\boldsymbol{x}\) is classified into} \\ \left\{\begin{array}{ll} \text{a case sample}, & if \ g(\textbf{x},\textbf{w})>0, \\ \text{a control sample}, & if \ g(\textbf{x},\textbf{w})\le 0. \end{array} \right. \end{array} $$
(46)
×
That is, it performs task C to get the decomposition by Equation (10) on which we may directly get the measure εC by Equation (44).
Alternatively, testing Equation (1) may be made by the following statistics from Equation (10):
There are also two other choices in Table 2. Choice (1) is a model-based test for task B from the perspective of one-dimensional samples of yt=wTxt. Focusing on a most discriminative direction, this test puts attention only on salient differences. As to be addressed later in Table 3, the test can be made together with testing H0 by Equation (5) such that the rest of the entire sample space is taken into consideration.
Table 2
Two Boundary based tests for Task B
Type
Description
(1)
on the projected samples of yt=wTxt, we use the one dimensional case of Equation (24) or the Welch’s t-test to test Equation (1) merely along the normal direction of the boundary.
(2)
measuring the distances of samples from a separating boundary, we consider \( s_{B}=\frac {\sum _{{\mathbf {x}}\in X^{(1)}_{1}\cup X^{(0)}_{0} } | \frac {{\mathbf {w}}^{T}({\mathbf {x}}-{\mathbf {c}}_0)}{\Vert {\mathbf {w}} \Vert }|{\!~\!}^{q}}{ \sum _{{\mathbf {x}}\in X^{(1)}_{0}\cup X^{(0)}_{1} } | \frac {{\mathbf {w}}^{T}({\mathbf {x}}-{\mathbf {c}}_0) }{\Vert {\mathbf {w}} \Vert }|{\!~\!}^q+\gamma _{B}}, \ q \ge 0. \) with q=2 for the square distance, q=1 for the Euclidean one.
Table 3
Four Types of Integrative Hypothesis Tests
Types
Description
Type-1 (model based IHT)
For Task A, each of two populations is modelled by a parametric model, with εA measured by the negative log-likelihood by Equation (32) or its extension to generalisation error. For Task B, a model based test is made to compare the difference between two parametric models, with εB by the corresponding p-value. For Task C, we get the classification by Equation (45), with εC by Equation (44) or the p-value by a BBT via a statistics obtained from Equation (10).
Type-2 (boundary based IHT)
A separating boundary is modelled by a hyperplane with its normal w, based on which Task D is handled by a boundary existence test by Equation (5) with εD measured by the corresponding p-value. For Task C we get the classification by Equation (46) with εC by Equation (44) or alternatively the corresponding p-value obtained by Equation (47), and for Task B we get the p-value by one of two BBT choices in Table 2.
Type-3 (mixing IHT)
Mix the above two types with two populations and their separating boundary all in parametric models. A basic one uses εA,εB from Type-1 and εC,εD from Type-2. The other uses εC,εD from Type-2 while εA,εB are modified by Equation (58).
Type-4 (Ying-Yang IHT)
Instead of mixing, the parametric models are jointly learned for two populations of samples and their separating boundary. One example is the BYY harmony learning based formulation to be introduced after Equation (60).
Choice (2) in Table 2 provides a statistics for task B on samples without dimension reduction. The statistics sB comes from considering that samples of \(X^{(1)}_{1}, \ X^{(0)}_{0} \) should be distant from the boundary (as illustrated by two blue arrows in Figure 3) while samples of \( X^{(1)}_{0}, \ X^{(0)}_{1}\) should not be far from this boundary (see two red arrows). Actually, sB is a special case of the ones given by Equations (26) and (30) in (Xu 2013a). The only difference is that γB>0 is added here to trade off the contribution from \(X^{(1)}_{0}\cup X^{(0)}_{1}\).
Both two choices in Table 2 are based on the boundary (i.e. either Equation (10) or yt=wTxt) and thus are called boundary-based two-sample tests or BBT in short. Different choices of BBT are also coupled with how w is obtained; see some examples outlined in Table 4.
Table 4
Some choices for obtaining w
Choice
Description
(a)
get w via FDA by Equation (11), as addressed in the previous subsection.
(b)
estimate w by maximizing L by Equation (4), as to be addressed in the next subsection.
(c)
get w as the normal direction of a separating hyperplane by one of machine learning approaches, e.g., support vector machine (SVM) (Cortes and Vapnik 1995; Suykens et al. 2002).
Replacing Equation (11) with the matrix-variate FDA by Equation (33), we get the projection yt=wTXtv column by column along the direction w and row by row along the direction v. With every appearance of x replaced by \(\boldsymbol {x}_{t}^{v}=X_{t}\mathbf {v}\), all the above studies directly apply. Similarly, we may also consider the dual representation \(y_{t}= \mathbf {v}^{T}\mathbf {x}_{t}^{w} \) with \(\boldsymbol {x}_{t}^{w} ={X}_{t}^{T}\mathbf {w}\) to get a linear separating boundary featured by v. It follows from Equations (19) and (20) that w and v jointly form a linear boundary by vec[ O] to separate samples of vec[ Xt].
Furthermore, extension can be made on the generalised bi-linear form via Equation (40) and Equation (41), with each x replaced by \(\boldsymbol {x}_{t}^{v} \) given in Equation (40).
Extensions can be also made on the generalised bi-linear form by Equation (35). Samples of two populations are projected into a dimension-reduced matrix Yt=VTXtW, and then, a matrix-variate Hotelling test can be made by Equation (28) with Xt replaced by Yt and the subscript x replaced by y, where the matrices W,V actually take the roles of the boundary.
Matrix-variate logistic regression
Testing H0 by Equation (5) has been widely studied in the literature of logistic regression. Actually, the role of this w is the same as the one in Equation (46), i.e. a discriminative boundary that separates every sample into either ω=1 or ω=0. Thus, the choices in Table 4 can be cross-utilised for a mutual benefit, e.g. getting w via FDA by Equation (11) is relatively easy to compute and thus provides an initialization for estimating w by Equation (4), while the advantage of Equation (3) over FDA is that dummy or design variables may be taken into consideration for learning w, e.g. we extend ζt=yt+c in Equation (3) into:
where ξt consists of dummy variables. Moreover, random effects may also be added, in a way similar to that of the linear mixed model by Equation (15).
Testing H0 by Equation (5) is typically handled with the Wald test by Equation (7) or Rao’s score test by Equation (8), for which the score vector and the information matrix are given as follows (Pan et al. 2014):
Being different from the BBT addressed in the previous subsection, testing H0 by Equation (5) directly aims at whether a boundary w exists. Such a test is thus named boundary existence test. It is widely known as a test for regression analyses. Also, we may regard it as a two-sample test that is complementary to the BBT choice (1) in Table 2. The two tests jointly cover the entire space of samples.
The boundary existence test actually tackles another essential problem of discriminative analysis, namely, task D in Table 1. Given two populations with a finite sample size, it is not difficult to draw a boundary to separate them if there is no restriction on the complexity of the boundary. However, a boundary with a high complexity will be unreliable to separate new samples that come randomly from the same populations. To be reliable, the boundary should have an appropriate complexity too. It follows from Equation (45) that an optimal separating boundary is related to the models q(x|θ1) and q(x|θ0). In other words, appropriate boundary complexity is related to an appropriate model boundary complexity. Thus, task D and task A in Table 1 are coupled.
Typically, we consider a linear boundary because of its simple complexity. In the literature of pattern recognition (Cortes and Vapnik 1995; Cover 1965) efforts on whether samples of two populations are linearly separable by a hyperplane or a maximum-margin hyperplane can be regarded as examples related to task D in Table 1.
Next, we proceed to consider matrix-variate logistic regression. Putting the case and control samples into a paired set {Xt,ωt},t=1,⋯,N, we extend Equation (3) with the inner product yt=wTxt to be replaced by the bi-linear form by Equation (18) or its extension by Equation (40).
Given V, the above studies directly apply when \(\boldsymbol {x}_{t}^{\textbf {v}}\) in Equation (40) replaces xt in Equations 3, 4, 7, and 8. The task of learning w,V can be made via the matrix-variate FDA by Equations (34) or (42).
Alternatively, we may estimate w,V via the maximum likelihood L by Equation (4) with the advantage of taking the effect of covariates into consideration. With −L written as J(w,V), we get it solved by Equation (37) with w replaced by W, e.g. implemented by the following gradient-based updating (Hosmer et al. 2013):
for the bi-linear form by Equation (18) simply with v replacing w in Equations 6, 7, 8, and 49. Similarly, extension may also be made to test H0: vi=0,∀i.
Moreover, we may also apply Equation (21) to develop a statistics as follows:
with p(ωt|xt,θ) given by Equation (3), where θ∗ is estimated via maximising L by Equation (4) under H0 by Equation (5) and \(\hat \theta \) is estimated via maximising L by Equation (4) without H0.
Similarly, we may get a matrix-variate Cox regression with the inner product wTxt in Equation (13) replaced by the bi-linear form by Equation (18) or its extension Equation (40). Accordingly, we test the H0 by Equation (5) and the H0 by Equation (51), using the Wald test with Equation (7) or Rao’s score by Equation (8) with Δ(w),I(w) computed from Equation (6) but L given by the partial likelihood L(w).
Furthermore, the univariate yt can be extended into a vector or matrix Yt. One typical example is a bi-linear regression of Yt by Equation (35), that is we consider:
which may be again handled by Equation (37) with w replaced by W.
It can be observed that Equation (53) is an extension of Equation (17) with F=0. On the other hand, we may extend Equation (17) into a bi-linear extension as follows:
as a bi-linear mixed model extended from Equation (15).
Integrative hypothesis test
Discriminative analysis and testing of H0 by Equation (1) are made from either a model-based perspective (e.g. performing task A and task B in Table 1) or a boundary-based perspective (e.g. performing task C and task D in Table 1). Moreover, all the four tasks are associated with another problem called feature selection, that is, selecting a number of elements in x to form a subset xf such that one or more of the four tasks achieves a good enough performance.
In the existing efforts, each of four tasks has been studied individually, with each having its strength and limited coverage. However, performances of these tasks are coupled, and thus, a best set of features for one task may not be necessarily the best for the others.
The complementary nature of task B and task C was preliminarily discussed in Section VI in (Xu 2012a), where a model-based test for task B is named as A-test (a test in the observed data domain) and a boundary-based test for task C is named as I-test (a test in the inner representation domain). Under the name of IHT, good performances of task B and task C are demanded jointly (Xu 2013a, 2013b). This paper further extends IHT to include task A and task D.
We start at jointly optimising the performances of task B and task C. Its necessity and feasibility are empirically justified, with help of the 2D scattering plots of εB by the p value for measuring the performance of task B and εC by the misclassification rate for measuring the performance of task C. A small εB indicates a big difference between q(x|θ0) and q(x|θ1) from an overall perspective, and a small εC indicates a well classification of samples from a separating boundary perspective. Illustrated in Figure 4 are two examples obtained from one empirical study.
×
As indicated by the blue vertical dashed line in Figure 4, there are many miRNAs that share a same small p value εB but can take different values of misclassification εC in a big range. Also, as indicated by the blue horizontal dashed line in Figure 4, there could be multiple miRNAs that take a same misclassification but take different p values. In other words, though the performance of one task is optimised, the performance of the other can still be poor. Thus, we need to jointly seek the good performances of both the tasks, i.e. IHT is necessary. On the other hand, it is observable from the red dots within the blue circle in Figure 4 that there are indeed a few scattering points with each taking both a small p value εB and a small misclassification εC, i.e. it is also feasible to achieve the goal of IHT too.
Such a 2D plot’s evaluation provides a tool for better joint performances of task B and task C, by which we may interactively observe the configuration of scattering points and locate the candidate points that are nearest to the origin of the coordinate space.
Extensions can be further made to a joint evaluation of the IHT performance with task A and task D also included, such that the strengths of different tests and methods are integrated in a rather systemic way, for which we address four types of IHT in Table 3.
From the model-based perspective, the first type is an extension of the one addressed in Figure 2, with εC added in to get a 3D plots for a joint evaluation of εA, εB, and εC. Instead of Equation (45), we may get εC by some nonparametric classifiers, e.g. the classic kNN classifier and the kernel classifiers (Williams 2003). Moreover, we are unable to handle task D because the boundary involved here does not have an explicit expression to be tested.
From the boundary-based perspective, the second type considers samples jointly by a separating boundary and projected samples, evaluated by εD for the existence of boundary, εC for the misclassification by the boundary, and εB for measuring the difference of two populations either along the normal direction of the boundary or according to the sample deviations from the boundary. Again, we may use a 3D plots for a joint evaluation of εB, εC, and εD. However, it is difficult to handle task A merely based on the boundary.
The type of mix-modelled IHT combines the above two types to avoid the weak points of each type. Two typical examples are listed in Table 3. One picks εA,εB from type (1) and εC,εD from type (2) for a joint evaluation. The other modifies εA,εB by taking the outcome by Equation (10) of the boundary in consideration, with the original estimated θ0 and θ1 replaced by the following maximum likelihood estimation:
Even better, we may estimate each θω by the maximum likelihood on the entire set X of samples but with the likelihood of each sample weighted by its corresponding posteriori p(ω|sample) by Equation (3).
BYY-harmony-learning-based formulation
The 2D plots and 3D plots only provides a preliminary tool for IHT, we need further studies on not only appropriate combinations of multiple p values and misclassification rates but also simultaneous optimisation of multiple measures. For the latter purpose, the mix-modelled IHT in Table 3 is further extended via iteratively learning θ0 and θ1 by Equation (58) to update the models \(q\left (X^{(0)}_{0} | \theta _{0}\right), q\left (X^{(1)}_{1}| \theta _{1}\right)\) and also re-estimating the boundary w, e.g. by a FDA method based on the updated models.
Leaving the task D for a future study, in the sequel, we further understand the task of learning the models from a perspective of learning a Ying machine and the task of learning the boundary from a perspective of learning a Yang machine, which leads to a BYY-harmony-learning-based formulation for IHT.
We start from revisiting Equation (29) from an IHT perspective. From α1q(x|θ1)=q(x|θ)−α0q(x|θ0), we consider the task B by the following measure:
from which we observe that a large KL10 comes from a large L1 that reflects a good modelling of α1q(x|θ1) (i.e. a good performance of task A) and a small confusion error \(e^{c}_{0,1}+e^{c}_{1,0}\) that is closely related to a small misclassification (i.e. a good performance of task C). In other words, three tasks are coordinately optimised.
However, a good modelling on the control samples has not been taken in the consideration of KL10, which may be further improved by considering:
From this KLsum, we need to get θω,ω=0,1 by the ML learning. In other words, KLsum merely takes a role of evaluating the performances of task B and task C, but do not have a port to accommodate samples for estimating θω,ω=0,1. Favourably, such a port is provided in the BYY harmony learning such that task A, task B, and task C are all jointly implemented.
Firstly, proposed in (Xu 1995) and systematically developed in the past two decades, the BYY harmony learning on typical structures leads to new model selection criteria, new techniques for implementing learning regularisation, and developing a class of algorithms that implement automatic model selection during parameter learning. Readers are referred to (Xu 2010, 2012b, 2015) for the latest introduction about the BYY harmony learning.
Briefly, a BYY system consists of a Yang machine and Ying machine corresponding to two types of decomposition, namely, Yang p(R|X)p(X) and Ying q(X|R)q(R), respectively. The data X is regarded as generated from its inner representation R that consists of latent variables Y and parameters θ. The harmony measure is mathematically expressed as follows:
Maximising this H(p||q) makes this Ying Yang pair not only best matched but also have the least complexity. Such an ability can also be further observed from several perspectives (see Section 4.1 in (Xu 2010)).
where p(x) provides a port to accommodate samples \(\{\mathbf {x}_{t}\}^{N}_{t=1}\) via an empirical \(p(\mathbf {x})=\frac {1}{N}\sum _{t} \delta (\mathbf {x}-\mathbf {x}_{t})\) with δ(x) being the Dirac delta, which thus makes it possible to estimate θω,ω=0,1 via maximising H(p||q).
Approximately considering p(x)≈q(x|θ), \(e^{H}_{0,1}+e^{H}_{1,0}\approx e^{c}_{0,1}+e^{c}_{1,0}\), and \({L_{1}^{H}}+{L_{0}^{H}}\approx L_{1} +L_{0}\), we observe that H(p||q) shares a nature similar to KLsum in Equation (59), while a difference is that the modelling part \({L_{1}^{H}}+{L_{0}^{H}}\) is provided with a port p(x) to accommodate samples such that task A can be performed via maximising H(p||q) without a need of separately estimating θω by the ML learning.
For q(x|θ)=G(x|c,Σ), we implement the maximisation of H(p||q) to estimate θω by directly adopting the semi-supervised BYY harmony learning for Gaussian mixture given in (Xu 2015), i.e. its algorithm 9, by which the performances of task A, task B, and task C are coordinated. Moreover, H(p||q) can be extended into its matrix-variate counterpart. Particularly, algorithm 9 in (Xu 2015) can be extended into the algorithm ?? given below for learning \(\alpha _{\omega }N\left (X|C^{x}_{\omega },\Omega ^{x}_{\omega },\Sigma ^{x}_{\omega }\right)\).
During implementation of the above algorithm, not only task A is performed but also task C can be simply handled in the Yang step by checking whether \(\phantom {\dot {i}\!}p_{1| \textbf {x}_{t}} \ge p_{0| \textbf {x}_{t}} \) to classify each sample into the case or control. Also, task B can be made after learning by putting the resulted parameters into sKL=KL10 or sKL=KLsum to get the corresponding p value.
Last but not least, considering semi-supervised learning, we also propose an improved procedure in Table 5 for training, testing, and validating on a small size of samples.
Table 5
Semi-supervised testing and validating
Issues
Description
Issue-1
Estimate the parameters by semi-supervised learning on the training set, from which we get the corresponding p-value p and a classifier. Using this classifier on the training set and the testing set, it follows from Equation (44) that we get \(\varepsilon _{C}^{tr}\) and \(\varepsilon _{C}^{te}\). This is what we traditionally get.
Issue-2
Lump the training samples and testing samples together, and estimate the parameters by semi-supervised learning on the lumped set, we also get the corresponding \(\tilde {p}\), \(\tilde {\varepsilon }_{C}^{tr}\) and \(\tilde {\varepsilon }_{C}^{te}\).
Issue-3
\(\tilde {p}\) is actually more reliable than p because testing samples are used for regularising parameter estimation. This \(\tilde {p}\) is also different from the traditional compounded p-value because the label information of testing samples have not been compounded.
Issue-4
Without using the label information of testing samples, \(\tilde {\varepsilon }_{C}^{te}\) shares the concept same as \(\varepsilon _{C}^{te}\), but is actually more reliable because of regularization.
Issue-5
Merging the training set and testing set to get a big training set and treating the validating set as a new testing set, which actually extends this procedure to improve the validation.
Integrating p values, inferring rejection domain, and S-space boundary-based tests
Each IHT type in Table 3 involves more than one measure, which incurs for the problem about how different measures are jointly evaluated. Though 2D or 3D plots provide a possible joint evaluation, how to appropriately scale each measure is still a challenging issue. In general, we need to integrate multiple measures into a scalar index based on which the joint performance can be evaluated, which relates closely to efforts made on combing multiple classifiers (Xu and Amari 2008; Xu et al. 1992b) and evidence combination (Barnett 2008).
For an IHT task, the final scalar index is typically the p value. When multiple measures are all in the p values, what we encounter becomes the task of p value combination, e.g. by the Fisher combination (Fisher 1948).
In Table 3, εB and εD are already given in p values. But εA is usually measured by a square error or negative log-likelihood, and εC is measured by a misclassification rate. Alternatively, εC may be given in a p value via the statistics in Equation (47). Let s=−εA or generally s=−ε for a monotonic measure ε≥0 that prefers values close to zero, we may get the corresponding p value with help of the permutation method.
However, p value combination has a weak point. Each p value is merely a positive number that indicates the false alarm probability, losing certain useful information already. Under the term meta-analysis (Evangelou and Ioannidis 2013), efforts have been made by transforming p values into multiple Z statistics such that the missing information is added in without or with help of information directly from data (Zaykin 2011).
Actually, the Hotelling T2 statistics by Equation (24) and getting a statistics by Equation (21) may also be regarded as examples that get an integrated statistics sf. Generally, a multivariate hypothesis test may also be regarded as an integration of multiple univariate hypothesis tests.
Typically, an integrated statistics sf=g(s,Ψ)≥0 comes from s=[s(1),⋯,s(d)] such that sf≥0 monotonically increases as the situation differs far from H0, where each s(i) comes from one univariate hypothesis test (e.g. s=c1−c1 in the Hotelling T2 statistics) with a set Ψ of parameters shaping the integration (e.g. the covariance Σ in the Hotelling T2 statistics). The set Ψ is specified without or with help of information obtained directly from input data. A critical value \({\tilde s}_{f}\) is computed from the original pair of the sample set X0,X1. Then, the false alarm probability \(p(s_{f}>{\tilde s}_{f}|H_{0})\) is obtained as the p value, where and hereafter p(·|H0) denotes under the condition that H0 is satisfied.
However, choices for such a sf=g(s,Ψ) are very limited in the existing studies, mostly in a quadratic form such as Hotelling statistics, Rao’s score by Equation (8), and the Wald test by Equation (7). This is equivalent to approximately regarding s(1),⋯,s(d) from a multivariate Gaussian distribution, while other distributions are seldom studied yet.
Instead of seeking an integrated statistics sf, we directly seek the domain \(\Gamma (\boldsymbol {\tilde s})\) of rejecting H0 in the space of s based on a critical vector \(\boldsymbol {\tilde s}\) as follows:
where \(\boldsymbol {\tilde s}_{{X}_{1||0}}=I_{\textit {nf}}(X_{0}||X_{1})\) means that \(\boldsymbol {\tilde s}\) is inferred from the given sample set X0,X1 by an inferring method Inf, and the subscript X1||0 is used as the abbreviation of X1||X0, which will be used whenever its omission will not cause confusion.
Then, test is made by checking the probability that s falls in \(\Gamma (\boldsymbol {\tilde s})\) under H0, that is:
We estimate the p value by a permutation test. That is, we get a new pair of sample sets \(X_{0}^{\pi }, X_{1}^{\pi }\) from X0,X1 by a permutation π that shuffles each label ω of xt,ω and then we obtain:
where #S denotes the cardinality of a set S, the subscript \(X^{\pi }_{1|0}\) is used as the abbreviation of \(X_{0}^{\pi }||X_{1}^{\pi }\), and Π consists of a large enough set of permutations made by either enumeration or random shuffling, including that π=empty denotes the sample pair X0,X1.
Recalling the classic studies of getting an integrated statistics sf, we observe that \({\tilde s}_{f}=g(\boldsymbol {s},\Psi)\) actually define a closed shell or boundary that divides the space of multivariate statistics s (shortly S-space) into two parts, with its inside as the acceptance domain and its outside as the rejection domain \(\Gamma (\boldsymbol {\tilde s})\). For example, the acceptance domain obtained by both the Hotelling statistics and Rao’s score by Equation (8) is a hyper-elliptic volume. We may further extend a hyper-elliptic volume to a bounded volume in another shape. Actually, a bounded acceptance domain corresponds a probabilistic modelling by a single-mode distribution. Thus, the corresponding tests are called S-space model-based tests.
On the other hand, we have also a S-space boundary based test (BBT) as summarised in Table 6. It should not be confused with the BBTs in the space of input data (shortly D-space), as those previously addressed in Tables 2 and 3, as well as in Figure 3. Those are two-sample tests with the boundary for separating two populations in the D-space while the S-space BBTs may correspond to any tests in the D-space.
Table 6
S-space boundary based test (BBT)
Step
Description
(1)
infer \(\tilde {\mathbf {s}}=I_{\textit {nf}}(X_{0}|| X_{1})\) in the multidimensional space of statistics s, where \(\tilde {\mathbf {s}}_{{X}_{1||0}}=I_{\textit {nf}}(X_{0}||X_{1})\) means that \(\tilde {\mathbf {s}}\) is inferred from the given sample set X0,X1 by an inferring method Inf, and the subscript X1||0 is used as the abbreviation of X1||X0, which will be used whenever its omission will not cause confusion.
(2)
use \(\tilde {\mathbf {s}}\) to design an unbounded boundary that divides the space of statistics s into two separated and unbounded half-spaces.
(3)
let the one that does not contain the origin 0 as the rejection domain \(\Gamma (\tilde {\mathbf {s}})\), with the corresponding boundary side named as the R-side. The other one is the acceptance domain.
(4)
tend to reject H0 as s deviates from the R-side of boundary with a nonzero distance. The larger the distance is, the more seriously H0 breaks.
Also, integration can be made by considering the complementarity of S-space BBTs and S-space model-based tests, via combining \(\Gamma (\boldsymbol {\tilde s})\) and the acceptance domains, obtained from not only the above complementary aspects, but also different sources, e.g. a bottom-up source from univariate tests on input data and a top-down source inversely transformed from the p values via a meta-analysis (Evangelou and Ioannidis 2013). Also, based on the resulted \(\Gamma (\boldsymbol {\tilde s})\), an easy computing expression \(s_{f}=g(\boldsymbol {s},\Gamma (\boldsymbol {\tilde s}))\) may be obtained to get an asymptotic distribution \(p(s_{f}|\Gamma (\boldsymbol {\tilde s}))\) for a fast estimation of the p value, see examples given after Equation (70).
S-space BBT for the multivariate zero mean
Testing H0 by Equation (1) for the case-control studies can be formulated into testing whether a multivariate statistics s=[s(1),⋯,s(d)] takes a point far away from the origin of the multidimensional space. One example is a two-sample test that examines the following null:
by the Hotelling T2 statistics. The second example is the Wald testing statistics by Equation (7), and another example will be given in the next subsection.
In the existing studies, such a test is typically made via either the \({\chi ^{2}_{k}}\) statistics or Hotelling’s T2 statistics. Also, Rao’s score by Equation (8) is such a type of statistics. As addressed in the previous subsection, they are all featured by an integrated statistics sf≥0 that monotonically increases as s deviates away from the origin and belong to the S-space model-based tests. Also, all these tests may be regarded as extensions of one typical univariate two-tail test (e.g. by t2 test), that is, a univariate statistics s deviates away from the origin s=0 via the value |s|.
The counterpart of a univariate two-tail test is a univariate one-tail test that examines how far s deviates from (−∞,0], i.e. testing the statement s≤0. When either rejecting s≤0 or rejecting s≥0 happens, we reject H0:s=0. Even when the statement s≤0 is not rejected, there are still chances that H0:s=0 will be rejected.
Typical studies of univariate one-tail tests include the one-tailed t-test and one-tailed z-test. However, we are not clear what are their counterparts in multivariate tests. As addressed above, Hotelling’s T2 test can be regarded as a multivariate counterpart of a two-tailed test.
The S-space BBT given in Table 6 actually provides a road to extend univariate one-tail tests to multivariate ones. Observing univariate one-tail tests from the perspective of S-space BBT, we see that \({ \tilde s}=I_{\textit {nf}}(X_{0}|| X_{1})\) is actually a boundary point that results in:
Given \({ \tilde s}\) and thus \(\Gamma ({ \tilde s})\), any s obtained from the case-control samples under H0 may cause a false alarm if s falls in \(\Gamma ({ \tilde s})\), which happens in a probability \(p(s \in \Gamma ({ \tilde s}) | H_{0})\), i.e. the p value by the inference \({ \tilde s}\). If it is small enough, the statement \(s \notin \Gamma ({ \tilde s}) \) will be rejected, which implies that s=0 or H0 by Equation (1) is rejected.
We further consider a statistics s in the multidimensional space from the perspective of S-space BBT given in Table 6 (2). We start by observing an orthant of the Rd space featured by \(\text {sign}(\boldsymbol {\tilde s}) =\left [\text {sign}\left ({ \tilde s^{(1)}}\right), \dots, \text {sign}\left ({ \tilde s}^{(d)}\right)\right ]^{T}\) and consider one separating boundary, as illustrated in Figure 5A. Such a boundary is equivalent to the following decomposition:
where each \(\Gamma ({ \tilde s}^{(i)})\) is given by Equation (67) for computing \(p\left (\textbf {s}^{(i)}\in \Gamma \left ({ \tilde s}^{(i)}\right) |H_{0}\right)\). This actually provides an example that extends a one-tail univariate hypothesis test to a vector-variate one.
In implementation, it is not easy to get the factorization of \(p(\boldsymbol {s}\in \Gamma (\boldsymbol {\tilde s}) |H_{0})\) by Equation (68). Instead, we approximately consider to remove the second-order dependence by the following decorrelation:
and U is a d×m matrix with its columns consisting of the eigenvectors of Σπ such that Λu=UTΣπU.
Another issue is that only those major components in Equation (68) are useful while some components are not only useless but also disturbing, especially when we consider a limited size of samples. To do so, one may consider that the columns of the matrix U consist of the eigenvectors of Σπ corresponding to the m-largest diagonal elements of Λu. Such an implementation of Equation (69) is typically called principal component analysis (PCA). How to decide an appropriate number of components is a model selection task (Tu and Xu 2011, 2012; Xu 2011). Moreover, one novel direction for this task will be addressed later in thip paper between Equation (91) and Equation (99). Actually, Equation (69) only applies to remove the second-order dependence. One may further consider non-Gaussian factor analysis (NFA) and binary factor analysis (BFA) to remove dependencies among non-Gaussian components (Tu and Xu (2014); Xu (2003, 2009) and also Section 5 in Xu (2012b)).
Simply, we use the notation \(\boldsymbol {\tilde s}=I_{\textit {nf}}(X_{0}|| X_{1})\) to denote a procedure to obtain such major components and then use this \(\boldsymbol {\tilde s}\) to get a separating boundary and its corresponding \(\Gamma (\boldsymbol {\tilde s})\). Illustrated in Figure 5 are three examples as follows:
Choice (a) is illustrated in Figure 5A same as the one in Equation (68) with each \(\Gamma ({ \tilde s}^{(i)})\) given by Equation (67). As illustrated in Figure 5B, each of two other choices is a half space bounded by a plane and on the side away from the origin. Choice (b) is more suitable to the case after using Equation (69) in choice (b). Except for the degenerated cases that the normal direction of the hyperplane becomes in parallel to one of the coordinate axis, choice (b) and choice (c) will approximately describe a certain dependence across the components of s.
After using Equation (69) to make the statistics s become an m-dimensional vector with the second-order dependence removed, we may observe that the scope of \(\Gamma (\boldsymbol {\tilde s})\) becomes narrowed as m reduces. When m=1, the scope of \(\Gamma (\boldsymbol {\tilde s})\) is narrowed to a one-tail test along the axis of only one component.
In implementation, we obtain \(p(\boldsymbol {s}\in \Gamma (\boldsymbol {\tilde s}) |H_{0})\) by Equation (64) via the permutation by Equation (65). Also, choice (b) and choice (c) may be understood from getting an integrated statistics as follows:
Approximately, sw comes from a normal distribution with the mean μw and the variance sw, based on which we can make a one univariate test.
SPD test and SPD discriminative analysis
Proposed in (Xu 2013a), the SPD method firstly examines the delta δ(x,y) by pairing every case sample x∈X1 and every control sample y∈X0 and then summarises such deltas as follows:
where \(c_{\omega }, \sigma ^{2}_{\omega }, \alpha _{\omega }\) is the sample mean, variance, and proportion of the samples in Xω, respectively, and rxy is the mutual correlation between x and y.
The above example can be extended to the case that both x,y are vectors with:
which is equivalent to testing the difference of two sample means. To find the collective inclining structure, we classify δ(x,y) into three groups by x>y,x=y,x<y and get the following decomposition:
with D(X1||0)<0 indicating that there is a collective inclining dominance (i.e. the representations of cases are bigger than the ones of controls), D(X1||0)<0 indicating a reversed dominance, and D(X1||0)=0 indicating no dominance.
Recalling Equation (66), it follows from \({\tilde s}=D(X_{1||0})= c_{1}-c_{0}\) that D(X1||0) is approximated from a normal distribution. Thus, the above collective inclining dominance can be tested by the one-tailed t-test and one-tailed z-test addressed in the previous subsections. We may get the mean \( \mu \left (X_{1||0}^{\pi }\right)\) and the variance \(\sigma ^{2}\left (X_{1||0}^{\pi }\right)\) from \(\left \{ D(X_{1||0}^{\pi }, \pi \in \Pi \right \}\) and then approximately compute the p value by a univariate one-tail z-test.
with each D(i)(X1||0) by Equation (76). The task is detecting whether there is a collective inclining dominance, i.e. whether s deviates far away from the origin such that H0 by Equation (1) breaks. The task can be handled by the S-space BBT in Table 6 as a multivariate extension of a one-tail univariate hypothesis test, following the method introduced from Equations (68) to (71) given previously.
Also, we may consider this multivariate SPD study from a perspective similar to the FDA by Equation (11). When x,y are the d-dimensional vectors, we extend Equation (74) into:
where ρ(u)=[ρ(u(1),⋯,ρ(u(d)]T and ρ(u) is the same as the one in Equation (74). That is, the difference x−y is projected onto a most reasonable direction w. In the simplest case ρ(u)=u, we get δ(x,y)=(x−y)Tw given in Equation (72) and thus leads to sw=wTs in Equation (71) as follows:
Without losing generality, we consider that the components of s are mutually independent, e.g. obtaining a second-order independence by Equation (69). Then, we seek how to choose an appropriate w.
Under H0, we expect that \( \mathbf {s}_{\mathbf {w}}^{\pi }=D_{\mathbf {w}}\left (X_{1||0}^{\pi }\right), \pi \in \Pi \) varies around its mean that is typically zero according to Equation (75), that is, we expect that the following standard deviation of \(\boldsymbol {s}_{\mathbf {w}}^{\pi }\) is minimised:
Also, we expect that sw best preserves discriminative information underlying X1,X0, for which we maximise |sw|. We apply a bootstrapping method to enhance the reliability by maximising:
by which the solution of w=[w(1),…,w(d)]T is reached at one vertex, i.e. w(i) takes either a(i) or b(i). Particularly, when Ω consists of only one pair X1,X0, the above maximisation leads to choice (b) in Equation (70) if we let −a(i)=b(i)=1 and to choice (c) if we let −a(i)=b(i)=|D(i)(X1||0)|.
with its solution given by the eigenvector that corresponds to the largest eigenvalue of \(\Sigma ^{\phi } =\sum _{{\omega } \in \Omega } \mathbf {s}^{\phi }\mathbf {s}^{\phi \ T} \).
Integrating Equations (80) and (81), we consider to maximise ργ(w) with \(\sigma _{\pi }^{\gamma }(\mathbf {w})\) minimised simultaneously or subject to a constraint \( \sigma _{\pi }^{\gamma }(\mathbf {w})\le \text {constant}\).
with its solution given by the eigenvector that corresponds to the largest eigenvalue of \(\Sigma _{\pi }^{-0.5}\Sigma ^{\phi } \Sigma _{\pi }^{-0.5}\).
Furthermore, we proceed to consider that each D(i)(X1||0) in Equation (79) is not a simple difference by Equation (76) but the following 1×2 row vector:
where v=[v(1),v(2)]T and Δx−y is a d×2 matrix with the i-th row being [x(i),−y(i)]. It follows from Equation (72) that the above Equation (87) leads D(i)(X1||0) to:
Given v as fixed, the study from Equations (79) and (84) applies directly for us to get w.
Given w as fixed, \( \mathbf {w}^{T}D_{M}\left (X_{1||0}^{\pi }\right)={D_{c}^{T}}\left (X_{1||0}\right)\) becomes a two-dimensional row vector and, it follows from Equation (89) that we have \(\boldsymbol {s}_{\mathbf {w}} =\mathbf {v}^{T}{D_{c}^{T}}\left (X_{1||0}\right)\) in the same form as Equation (79). With v in the place of w and Dc(X1||0) in the place of s, similarly, the study from Equations (79) and (84) applies directly for us to get v. Generally, we iteratively update v with a fixed w and update w with a fixed v, for a number of circles getting converged. Still, whether such an alternative iterating procedure can converge is an open issue that demands further investigation.
The p values and testing complexity control
Recalling Equation (64) and Table 6, based on a given sample pair X1||0=X0∥X1, we get a statistics vector \(\boldsymbol {\tilde s}_{{X}_{1||0}}=I_{\textit {nf}}(X_{0}|| X_{1})\) and a rejection domain \(\Gamma =\Gamma \left (\boldsymbol {\tilde s}_{{X}_{1||0}}\right)\) by the inferring method Inf. Then, we compute the following false alarm probability:
as the p value. This concept is the same as the one used in the conventional literature where X1||0 and Inf are usually implied but not spelled out.
Being different from those studies considering a univariate statistics, the p value by a multidimensional statistics vector s highly depends on the dimension m of this vector or the complexity of the testing space. Given a limited sample size, the p value by Equation (90) will reduce as the value of m increases, causing a phenomenon similar to the overfitting problem in the studies of machine learning and statistical modelling. In other words, we encounter a ‘dimension curse’ in hypothesis testing too. Therefore, we need to appropriately control the complexity of testing space, i.e. selecting one appropriate m.
Given a criterion J(m), the problem of selecting a best subset is a typical problem of feature selection. Generally, it involves an exhaustive evaluation of all the combinations of m features (i.e. m components of s) and all the possible values of m, which is a NP hard problem. Usually, the branch and bound policy (Narendra and Fukunaga 1977; Somol et al. 2004) and the best first strategy are used to save computing cost (Xu et al. 1988). In this paper, we only consider one simple selection strategy that evaluates the components of s incrementally one by one.
To facilitate it, we perform Equation (69) to make the components of s become decorrelated and start to pick one component that corresponds to the smallest value of a given criterion J(m). Then, we successively add in one component such that J(m) gets a bigger drop further and so on and so forth until no further reduction is caused. Finally, the selected components form the resulted feature set with a size m∗.
For this purpose, using the p value by Equation (90) as J(m) does not work well because of its tendency of reducing as m increases, resulting in one m∗ that is usually much bigger than the appropriate one. Instead, we consider another false alarm probability as follows:
which is obtained on all the possible sets of \(X^{\pi }_{1||0}\) that come under H0 instead of merely on a given pair X1||0.
Though this probability is useless to judge whether X1||0 contains enough information to reject H0, it reflects how the complexity of testing space affects a background portion of the false alarm probability. Actually, it reflects an inverse of the effective volume of the support that the statistics s locates. As m increases, the volume increases exponentially, and thus, p(s∈Γ| Inf,H0) will reduce negative-exponentially. Such an exponentially decreasing tendency is also contained in p(s∈Γ| Inf,X1||0,H0) for the same reason, which affects the accuracy of the estimated p value.
To reduce this background disturbance, we consider Equations (90) and (91) jointly by the following a posteriori version of the p value:
where and hereafter ¬H0 denotes rejecting H0. The denominator aims at cancelling out the disturbing portion in the numerator, such that \(\phantom {\dot {i}\!}{pp}_{{X}_{1||0}}\) provides not only a better estimation of false alarm probability of rejecting H0 but also a better criterion J(m) for selecting a best subset of the components of s and thus inferring one appropriate m∗.
Instead of directly handling the above integral, we get a large set Π of sample pairs \(X_{1}^{\pi }, X_{0}^{\pi }\), with each pair \(X_{1}^{\pi }, X_{0}^{\pi }\) resulted from a permutation of X0 and X1. Using every pair \(X_{1}^{\pi }, X_{0}^{\pi } \) to infer \(I_{\textit {nf}}\left (X^{\pi }_{0}|| X^{\pi }_{1}\right)= \boldsymbol {\tilde s}_{X^{\pi }_{1|0}}\), we get a set of p values as follows:
We observe that the pp value has two factors. One is \(pp^{o}_{{X}_{1||0}}\) that describes the proportion of the pairs of \(X_{1}^{\pi }, X_{0}^{\pi }\) with the corresponding \(p_{X^{\pi }_{1|0}}\le p_{X_{1|0}}\), that is, on each of these pairs we should also reject H0 if we reject H0 on X1||0. In other words, \(pp^{o}_{{X}_{1||0}}\) reflects the information of relative difference contained in PΠ. The other factor \(\phantom {\dot {i}\!}{rp}_{{X}_{1||0}}\) is the ratio of the average false alarm probability per pair over the disturbing background per pair, reflecting the strength of discriminative information contained in PΠ.
In implementation, we may use \( {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) to make an initial screening. When \( {rp}_{{X}_{1||0}}>1\phantom {\dot {i}\!}\), inference is nonsense and no further computing should be made. Generally, \( {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) will be much smaller than 1, and thus, \(\phantom {\dot {i}\!}{pp}_{{X}_{1||0}}\) will be much smaller, while \(pp^{o}_{{X}_{1||0}}\) provides a worst case upper bound of \(\phantom {\dot {i}\!}{pp}_{{X}_{1||0}}\).
We should observe \({pp}_{{X}_{1||0}}\phantom {\dot {i}\!}\), \(pp^{o}_{{X}_{1||0}}\), and \( {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) at not only one same value of m but also an appropriate m∗. In addition to using \({pp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) by Equation (93) as J(m) for making an incremental selection, we may also consider \(pp^{o}_{{X}_{1||0}}\) or \( {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) as J(m), resulting in \( m^{\ast }_{o} \) or \(m^{\ast }_{\textit {rp}}\). Also, it follows from some mathematical derivation that we have \( m^{\ast } \ge m^{\ast }_{\textit {rp}} \ge m^{\ast }_{o}\) with \(m^{\ast }_{o}\) being a most conservative lower bound. We will be more confident when all these values are identical or not different too much. Moreover, further insights can be obtained from the following considerations.
On one side, we desire that the exponentially decreasing tendency contained in p(s∈Γ| Inf,X1||0,H0) is removed via the normalisation by p(s∈Γ| Inf,H0) such that \({pp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) in Equation (93) will no longer have such a decreasing tendency. With \(p_{X^{\pi }_{1|0}}=p(\textbf {s}\in \Gamma |\ I_{\textit {nf}}, X_{1||0}^{\pi }, H_{0})\) in Equation (92) replaced by \({pp}_{{X}_{1||0}}\phantom {\dot {i}\!}\), \(pp^{o}_{{X}_{1||0}}\), and \( {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\), we may turn PΠ into its counterparts Ppp, \(\phantom {\dot {i}\!}P_{pp^{o}}\), and Prp. We compute not only the varying curve for each of \({pp}_{{X}_{1||0}}\phantom {\dot {i}\!}\), \(pp^{o}_{{X}_{1||0}}\), and \( {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) as m increases, but also the varying curve of the mean of the elements in each of Ppp, \(\phantom {\dot {i}\!}P_{pp^{o}}\), and Prp as m increases. Then, we compare each curve with its corresponding mean curve and desire that the mean curve is as flat as possible or at least flat around m∗.
On the other side, desiring a flat mean curve is not a sole principle. W also desire that the discriminative information should be kept in each of \({pp}_{{X}_{1||0}}\phantom {\dot {i}\!}\), \(pp^{o}_{{X}_{1||0}}\), and \( {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) as much as possible. Observing the factorization \(\phantom {\dot {i}\!}{pp}_{{X}_{1||0}}= pp^{o}_{{X}_{1||0}} {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) in Equation (93), the strength of discriminative information is contained in \( {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) with an exponentially decreasing tendency that is supposed to be mutually cancelled out by the denominator and the numerator but perhaps not completely, while the discriminative information of relative difference is contained in \(pp^{o}_{{X}_{1||0}}\) and kept unchanged as long as every inequality between \(\phantom {\dot {i}\!}p_{X^{\pi }_{1|0}}\) and \(\phantom {\dot {i}\!}p_{X_{1|0}}\) remains unchanged.
Bi-test, twin p values, and P-space BBT
Putting the above two sides together, we observe that a S-space multivariate test is actually a bi-test that tests H0 together with the following hypothesis:
$$\begin{array}{@{}rcl@{}} I_{0}\ : \text{ the inference is not reliable.} \end{array} $$
(94)
We examine a decision that both H0 and I0 are rejected, featured with two p values.
As addressed after Equation (91), the multivariate statistics s inferred by Inf suffers a systematic bias that will make Inf unreliable. This unreliability varies with the dimension m that takes an important role in Inf. Though corrected by the denominator in Equation (93), there are still some residuals that will not be completely cancelled out, the effect of which still varies with m and reduces the reliability of Inf. The test I0 is formulated for this reliability via controlling an appropriate m∗ and a level of false alarm probability of rejecting I0.
One should notice the difference between testing H0 and testing I0. Testing H0 examines only the input, while testing I0 examines both the input and the performance of testing H0. The inference Inf gets X1||0 as the input and the outcomes \(p_{{X}_{1||0}}, {pp}_{{X}_{1||0}}\phantom {\dot {i}\!}\), \(pp^{o}_{{X}_{1||0}}\), and \( {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\). Using \(\phantom {\dot {i}\!}o_{{X}_{1||0}}\) to denote anyone of these indices, regarding Inf as reliable on X1||0 actually implies that it should also be regarded as reliable on any pair \(X_{1}^{\pi }, X_{0}^{\pi }\) with the corresponding \(o_{{X}^{\pi }_{1||0}}\) being smaller than \(\phantom {\dot {i}\!}o_{{X}_{1||0}}\). Thus, the false alarm probability of rejecting I0 is computed by \(p\left (o_{{X}^{\pi }_{1||0}}\le o_{{X}_{1||0}}|\neg H_{0}, H_{0}\right)\).
Interestingly, some mathematical derivation shows that letting \(o_{{X}_{1||0}}\phantom {\dot {i}\!}\) to be anyone of \(p_{{X}_{1||0}}, {pp}_{{X}_{1||0}}\phantom {\dot {i}\!}\), \(pp^{o}_{{X}_{1||0}}\), and \( {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) will always result in the same false alarm probability as follows:
where and hereafter ¬I0 denotes rejecting I0. Reflecting the discriminative information of relative difference, this p value of rejecting I0 will be not affected as long as the exponentially decreasing tendency will not change every inequality between \(p_{X^{\pi }_{1|0}}\) and \(p_{X_{1|0}}\).
As summarised in Table 7, a multivariate test is actually a bi-test that tests not only the classic null but also a null about the ‘dimension curse’. The rejection of H0 is controlled by a given level α. If \(\phantom {\dot {i}\!}{pp}_{{X}_{1||0}}\ge \alpha \), H0 will not be rejected, and thus, there is no need to test I0. Accordingly, Equation (93) for the p value of rejecting I0 is also modified in Table 7. The bi-test is implemented with or without using stochastic simulation. Table 7 (2) outlines those previously addressed points for implementation via stochastic simulation, while Table 7 (3) outlines an alternative implementation that does not need stochastic simulation.
Table 7
Multivariate Bi-test and Implementations
Type
Description
Test bi-hypotheses and twin p-values
test H0
whether the case-control populations are different, by an inference Inf in the space of multivariate statistics s based on samples from the two populations. H0 is rejected if \(\phantom {\dot {i}\!}{pp}_{{X}_{1||0}}\!\le \! \alpha \), where the false alarm probability \(\phantom {\dot {i}\!}{pp}_{{X}_{1||0}}\,=\,pp^{o}_{{X}_{1||0}} {rp}_{{X}_{1||0}}\) is given by Equation (93) and α is a prespecified level.
test I0
whether the dimension m of s is appropriate such that Inf is reliable, with the p-value given by \( p(\neg I_{0} |\neg H_{0}, H_{0}) = p({pp}_{X^{\pi }_{1|0}}\le \alpha |{pp}_{{X}_{1||0}}< \alpha, H_0), \) which is not smaller than \(pp^{o}_{{X}_{1||0}}\) that reflects the relative discriminative information among \( {pp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) while ignoring \({rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) that reflects the strength of discriminative information.
Bi-text Implementations
Stochastic way
(a) Make the components of s decorrelated by Equation (69). (b) Get \(p({\mathbf {s}}\in \Gamma |\ I_{\textit {nf}}, X_{1||0}^{\pi }, H_0)=p({\mathbf {s}}\in \Gamma (\tilde {\bf {s}}) |H_0)\) by Equation (68) with \(\Gamma (\tilde {\bf s})\) taking one of three choices in Equation (70), and then getting PΠ by Equation (92). (c) Get \( {pp}_{{X}_{1||0}}, pp^{o}_{{X}_{1||0}}, {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) by Equation (93) and then getting p(¬I0|¬H0,H0) as above. (d) Using \(pp^{o}_{{X}_{1||0}}\) or p(¬I0|¬H0,H0) as J(m) to infer an appropriate \(m^{*}_{o}\) and select the \(m^{*}_{o}\) best components of s.
Nonstochastic way
(a) Make the components of s decorrelated by Equation (69). (b) Get {pi} with each p-value pi obatined by an univariate test. (c) Get \( pp^{o}_{{X}_{1||0}}\) by Equation (99) and \({rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) by Equation (97) with \(p_{X_{1|0}}= \prod _{i} p_{i}\), as well as getting p(¬I0|¬H0,H0) as above. (d) The same as the above (2)(d).
This alternative comes from considering Γ in the choice (a) of Equation (70) by which we have:
where the extra components of s will contribute a constant factor \(\prod _{i > m^{\ast }}\delta _{i}\) that will be cancelled out via the denominator and the numerator in Equation (93).
In such a case, we may get \(\phantom {\dot {i}\!}{rp}_{{X}_{1||0}}={\mu _{\Gamma }}/{\mu }\) without stochastic simulation. First, we have \(\mu =\prod _{i} \mu ^{(i)}\). Each \({ \tilde s}^{(i)}\) under H0 is a random variable with a zero mean, and its corresponding false alarm probability pi is uniformly distributed over [ 0,0.5]. Thus, we get μ(i)=1/4. Second, we also get \(\phantom {\dot {i}\!}\mu _{\Gamma }\le p_{X_{1|0}}\) by letting \( p(\textbf {s}\in \Gamma |\ I_{\textit {nf}}, X_{1||0}^{\pi }, H_{0}) \le p_{X_{1|0}}\) for each π∈ΠΓ to be approximated by its upper bound \(p_{X_{1|0}}\). Putting the two together, we have:
Next, \(pp^{o}_{{X}_{1||0}}\) is also considered without stochastic simulation. From Equation (95), we have \( p\left (\neg I_{0} |\neg H_{0}, H_{0}\right) =pp^{o}_{{X}_{1||0}} =p\left (\prod _{i}p_{i}^{\pi }\le \prod _{i}p_{i}| H_{0}\right) =p\left (\prod _{i}\left (p_{i}^{\pi }\right)^{2}\le \prod _{i}{p_{i}^{2}} |H_{0}\right)\), which leads us to the well-known Fisher combination (Fisher 1948) that makes a test on the false alarm probabilities {pi} by the following combination:
This link provides new insights from two perspectives. On one perspective, we may adopt the Fisher combination approach to estimate \(pp^{o}_{{X}_{1||0}}\) as follows:
Together with Equation (97), we get \(\phantom {\dot {i}\!}{pp}_{{X}_{1||0}}=pp^{o}_{{X}_{1||0}} {rp}_{{X}_{1||0}}\) for testing both H0 and I0 without stochastic simulation via permutation.
On the other perspective, we observe that the traditional p value pF of the Fisher combination is actually the false alarm probability by Equation (95), only reflecting the discriminative information of relative difference between \(\prod _{i} p_{i}^{\pi }\) and \( \prod _{i} p_{i} \) but ignoring the strength of discriminative information contained in \( \prod _{i} p_{i}\). In other words, the Fisher combination just provides a half story for combining {pi}, and we can use the formulation \( {pp}_{{X}_{1||0}}=pp^{o}_{{X}_{1||0}} {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) to complete the whole story, using \(pp^{o}_{{X}_{1||0}}\) by Equation (99) and \( {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) by Equation (97) with \(p_{X_{1|0}}= \prod _{i} p_{i}\).
The last but not least, one should notice that the p value of testing H0 measures the chances in the S-space (i.e. the space of multivariate statistics), and the p value of testing I0 measures an event in the P-space (i.e. the space of false alarm probabilities). In other words, testing H0 involves a S-space BBT while testing I0 involves a P-space BBT.
Discussions
Gene expression analyses
Gene expression analyses take important roles in bioinformatics and computational genetics. Expression profiles are featured by data matrix with its row indicating expressions of different samples t=1,⋯,N while its column consisting of expressions i=1,⋯,m from different genes, miRNAs, and lncRNAs.
In recent years, developments of data acquisition techniques lead us to consider expression profiles in a cubic or even a high-dimension array. As illustrated in Figure 1, one additional dimension j=1,⋯,d is added for examining expressions under different conditions (Ji et al. 2009; Persson et al. 2011) and across different time points (Bar-Joseph et al. 2012). For examples, current cancer studies consider each basic unit (i.e. a gene, a miRNA, a lncRNA) in paired expressions of normal and tumour tissue from the same individual, that is, each individual is featured at least by a 2×d matrix Xt. Generally, each example Xt is a m×d matrix. In Table 7, we suggest a list of topics for such matrix-variate-based applications.
Typically, the number d of rows (i.e. gene, miRNA, and lnclRNA) is huge, while the sample size n is small. It is difficult and also unreliable to consider the entire m×d matrix as a sample Xt. Instead, we pick k- tuple out of m rows to form a m×k matrix as a sample Xt. Without losing generality, we focus on that each sample Xt is a 2×k matrix from paired expressions of normal tissue and tumour tissue.
In the existing studies, there are two types of efforts for dealing with such format of samples. The first one reduces each sample \(X_{t}=\left [x_{t}^{(i,j)}\right ], i=1,2; j=1,\cdots, k\) into a 1×k matrix \(x_{t}=\left [x_{t}^{(1)}, \cdots, x_{t}^{(k)}\right ]\) for multivariate hypothesis test. A typical reduction is given by:
The second type of efforts is a paired difference test, e.g. a paired t-test when k=1 and paired Hotelling’s square test when k≥2. In Table 8, comparative empirical IHT studies are suggested on the samples of Xt in a 2×k matrix versus in a 1×k vector.
Table 8
Several IHT Applications
IHT types
Applications
Model based and Mix-modelled
(a) Starting at the case that Xt is degenerated into an 1×2 matrix, we conduct the Hotelling test by Equation (2) and its extension KLsum in Equation (31), in comparison with both univariate t-test and a paired t-test. (b) For the general case with k≥2, we conduct a matrix-variate test by Equation (28), as well as by the matrix-variate counterparts of KL1,0, KLsum, and KLsum∗, in comparison with not only the Hotelling’s T-square test on the k dimensional vector xt obtained from Equation (100) but also the paired Hotelling’s T-square test on 2×k matrix-variate samples of Xt. (c) Considering each sample Xt in a 2×k matrix, we investigate the bi-linear discriminant analysis by Equations 18, 33, and 34, in comparison with the classic FDA by Equation (11) on the k dimensional vector xt obtained from Equation (100). (d) Investigate the generalised bi-linear discriminant analysis by Equations 40, 41, and 34. For simplicity, we get vi,i=1,⋯,d by Equation (43) and then solve w by Equation (34). When k becomes too big, we further regularise the learning of vi by minimising \( J_y=\frac {\alpha _{0} \sigma _{0}^{y\ 2} +\alpha _{1}\sigma _{1}^{y\ 2}} {(c^{y}_{0} -c^{y}_{1})^{2}}+ \sum _{i=1}^{m} \gamma _{i} \sum _{j=1}^{d} |u_{i}^{(j)} |^{q}, \) with q=2 for Tikhonov, q=1 for sparse learning.
Boundary based and Mix-modelled
(a) Consider a logistic regression by Equation (3) with w in one of the ways given in Table 4, we test Equation (5) by the Rao’s score Equation (8), and get εC by Equation (44), and εB by the p-value with one of choices in Table 2. (b) Extend all the above studies on Equation (3) with yt=wTxt replaced by the bi-linear form Equation (18). (c) Make a survival analysis via the Cox regression by Equation (13) in comparison with its bi-linear extension by Equations (18) or (40). Again, IHT is made by εD, εC, and εB in a way similar to the above.
BYY harmony
(a) Use either Algorithm 9 in Ref. (Xu, 2015) to get α(i),c(i), Σ(i),i=0,1 or Algorithm ?? to get α(i),C(i),Σ(i),Ω(i),i=0,1 for model based IHT. (b) Perform the procedure given in Table 5 for training, testing and validating in a small size of samples.
Exome sequencing analyses
The case-control study is also a major problem in a genome-wide association study (GWAS) or exome-sequencing analysis (DePristo et al. 2011; Purcell et al. 2007). Typically, a digit score (i.e. 0,1,2) is assigned to a Single Nucleotide Polymorphism (SNP) allele per site and per individual. In such a representation, each sample is univariate when each site is considered one by one. One variate two-sample test takes a fundamental role for detecting a single SNP in the GWAS, e.g. the PLINK provides one widely used tool box (Purcell et al. 2007).
Moreover, each sample can be a vector when multiple sites are considered jointly. Recently, there have been ever-increasing efforts on finding multiple SNVs jointly (DePristo et al. 2011; Derkach et al. 2013; Evangelou and Ioannidis 2013; Lin et al. 2014; Liu et al. 2014; Pan et al. 2014). Also, we may test whether there is a collective inclining dominance of the representations of case samples over the ones of control samples, or vice versa, with help of the method proposed from Equations (79) and (84), as well as the extension introduced around Equations (87) and (89).
Alternatively, we may also consider a SNP allele per site and per individual with δ(x,y) in Equation (75) replaced by one 3×3 matrix \(\Delta (x,y)=\left [ \delta _{x-y}^{(i.j)}\right ]\) with:
It follows from Equation (72) that we get D(X1||0) to be also a 3×3 matrix as a collective measure, which may be further examined to test whether two populations differ significantly. We may visualise the matrix by plotting them in two 2D histograms and observe their configurations.
Conclusions
Statistical analyses for case-control studies have been addressed rather comprehensively. First, a Kullback-Leibler divergence-based formulation is suggested to develop testing statistics and discriminative criterion for the case-control studies. Based on this formulation, typical existing methods are revisited, and their matrix-variate counterparts are developed. Second, a bi-linear matrix form is proposed to obtain the matrix-variate counterparts from existing multivariate statistical analyses, such as discriminative analysis, logistic regression, Cox model, and linear mixed model. Third, the necessity and feasibility of integrative hypothesis tests (IHT) are addressed from the complementarity of BMTs and BBTs in the D-space, together with empirical illustration. Moreover, four basic components of IHT are elaborated, and four IHT types are summarised according to how the components are integrated. Then, in the space of multiple statistics (shortly S-space), the S-space BBT is proposed to perform BBT based on an unbounded boundary, with the help of information-preserved decoupling. Moreover, a S-space BBT-based extension of univariate one-tail z-test is developed to test the null of multivariate zero mean and then applied to a multivariate SPD test for detecting a collective inclining dominance for the case-control studies. Also, a SPD discriminative analysis is proposed with this multivariate SPD test improved and extended to matrix-variate ones. Furthermore, a multivariate bi-test is proposed to test not only the classic null but also a null about inference reliability due to the complexity of testing space, including a new insight on and a further development of the Fisher combination. Finally, possible applications have been suggested for expression-profile-based biomarker finding and exome-sequencing-based joint SNV detection.
Acknowledgements
This work was supported by a CUHK Direct grant project 4055025 and by the Zhi-Yuan chair professorship by Shanghai Jiao Tong University.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Competing interests
The author declares that he has no competing interests.