Shape constrained additive models

Pya, Natalya; Wood, Simon N.

doi:10.1007/s11222-013-9448-7

Shape constrained additive models

Open access
Published: 25 February 2014

Volume 25, pages 543–559, (2015)
Cite this article

Download PDF

You have full access to this open access article

Statistics and Computing Aims and scope Submit manuscript

Shape constrained additive models

Download PDF

Natalya Pya¹ &
Simon N. Wood¹

14k Accesses
155 Citations
5 Altmetric
1 Mention
Explore all metrics

Abstract

A framework is presented for generalized additive modelling under shape constraints on the component functions of the linear predictor of the GAM. We represent shape constrained model components by mildly non-linear extensions of P-splines. Models can contain multiple shape constrained and unconstrained terms as well as shape constrained multi-dimensional smooths. The constraints considered are on the sign of the first or/and the second derivatives of the smooth terms. A key advantage of the approach is that it facilitates efficient estimation of smoothing parameters as an integral part of model estimation, via GCV or AIC, and numerically robust algorithms for this are presented. We also derive simulation free approximate Bayesian confidence intervals for the smooth components, which are shown to achieve close to nominal coverage probabilities. Applications are presented using real data examples including the risk of disease in relation to proximity to municipal incinerators and the association between air pollution and health.

Violating the normality assumption may be the lesser of two evils

Article Open access 07 May 2021

The Frank-Wolfe Algorithm: A Short Introduction

Article Open access 13 December 2023

A Review on Global Sensitivity Analysis Methods

1 Introduction

This paper is about estimation and inference with the model

$$\begin{aligned} g(\mu _i) = \mathbf{A} {\varvec{\theta }} + \sum _j f_j(z_{ji}) + \sum _k m_k(x_{ki}), \quad Y_i \sim \mathrm {EF}(\mu _{i},\phi ), \end{aligned}$$

(1)

where $Y_i$ is a univariate response variable with mean $\mu _i$ arising from an exponential family distribution with scale parameter $\phi $ (or at least with mean variance relationship known to within a scale parameter), $g$ is a known smooth monotonic link function, $\mathbf A$ is a model matrix, $\varvec{\theta }$ is a vector of unknown parameters, $f_j$ is an unknown smooth function of predictor variable $z_j$ and $m_k$ is an unknown shape constrained smooth function of predictor variable $x_k$. The predictors $x_j$ and $z_k$ may be vector valued.

It is the shape constraints on the $m_k$ that differentiate this model from a standard generalized additive model (GAM). In many studies it is natural to assume that the relationship between a response variable and one or more predictors obeys certain shape restrictions. For example, the growth of children over time and dose-response curves in medicine are known to be monotonic. The relationships between daily mortality and air pollution concentration, between body mass index and incidence of heart disease are other examples requiring shape restrictions. Unconstrained models might be too flexible and give implausible or un-interpretable results.

Here we develop a general framework for shape constrained generalized additive models (SCAM), covering estimation, smoothness selection, interval estimation and also allowing for model comparison. The aim is to make SCAMs as routine to use as conventional unconstrained GAMs. To do this we build on the established framework for generalized additive modelling covered, for example, in Wood (2006a). Model smooth terms are represented using spline type penalized basis function expansions; given smoothing parameter values, model coefficients are estimated by maximum penalized likelihood, achieved by an inner iteratively reweighted least squares type algorithm; smoothing parameters are estimated by the outer optimization of a GCV or AIC criterion. Interval estimation is achieved by taking a Bayesian view of the smoothing process, and model comparison can be achieved using AIC, for example.

This paper supplies the novel components required to make this basic strategy work, namely

1.
We propose shape constrained P-splines (SCOP-splines), based on a novel mildly non linear extension of the P-splines of Eilers and Marx (1996), with novel discrete penalties. These allow a variety of shape constraints for one and multidimensional smooths. From a computational viewpoint, they ensure that the penalized likelihood and the GCV/AIC scores are smooth with respect to the model coefficients and smoothing parameters, allowing the development of efficient and stable model estimation methods.
2.
We develop stable computational schemes for estimating the model coefficients and smoothing parameters, able to deal with the ill-conditioning that can affect even unconstrained GAM fits (Wood 2004, 2008), while retaining computational efficiency. The extra non-linearity induced by the use of SCOP-splines does not allow the unconstrained GAM methods to be re-used or simply modified. Substantially new algorithms are required instead.
3.
We provide simulation free approximate Bayesian confidence intervals for the SCOP-spline model components in this setting.

The bulk of this paper concentrates on these new developments, covering standard results on unconstrained GAMs only tersely. We refer the reader to Wood (2006a) for a more complete coverage of this background. Technical details and extensive comparative testing are provided in online supplementary material.

To understand the motivation for our approach, note that it is not difficult to construct shape constrained spline like smoothers, by subjecting the spline coefficients to linear inequality constraints (Ramsay 1988; Wood 1994; Zhang 2004; Kelly and Rice 1990; Meyer 2012). However, this approach leads to methodological problems in estimating the smoothing parameters of the spline. The use of linear inequality constraints makes it difficult to optimize standard smoothness selection criteria, such as AIC and GCV with respect to multiple smoothing parameters. The difficulty arises because the derivatives of these criteria change discontinuously as constraints enter or leave the set of active constraints. This leads to failure of the derivative based optimization schemes which are essential for efficient computation when there are many smoothing parameters to optimize. SCOP-splines circumvent this problem.

Other procedures based on B-splines were proposed by He and Shi (1998), Bollaerts et al. (2006), Rousson (2008), Wang and Meyer (2011). Meyer (2012) presented a cone projection method for estimating penalized B-splines with monotonicity or convexity constraints and proposed a GCV based test for checking the shape constrained assumptions. Monotonic regression within the Bayesian framework has been considered by Lang and Brezger (2004), Holmes and Heard (2003), Dunson and Neelon (2003), and Dunson (2005). In spite of their diversity these existing approaches also lack the ability to efficiently compute the smoothing parameter in a multiple smooth context. In addition, to our knowledge except for the bivariate constrained P-spline introduced by Bollaerts et al. (2006), multi-dimensional smooths under shape constraints on either all or a selection of the covariates have not yet been presented in the literature.

The remainder of the paper is structured as follows. The next section introduces SCOP-splines. Section 3.1 shows how SCAMs can be represented for estimation. A penalized likelihood maximization method for SCAM coefficient estimation is discussed in Sect. 3.2. Section 3.3 investigates the selection of multiple smoothing parameters. Interval estimation of the component smooth functions of the model is considered in Sect. 3.4. A simulation study is presented in Sect. 4 while Sect. 5 demonstrates applications of SCAM to two epidemiological examples.

2 SCOP-splines

2.1 B-spline background

In the smoothing literature B-splines are a common choice for the basis functions because of their smooth interpolation property, flexibility, and local support. The B-splines properties are thoroughly discussed in De Boor (1978). Eilers and Marx (1996) combined B-spline basis functions with discrete penalties in the basis coefficients to produce the popular ‘P-spline’ smoothers. Li and Ruppert (2008) established the corresponding asymptotic theory. Specifically that the rate of convergence of the penalized spline to a smooth function depends on an order of the difference penalty but not on a degree of B-spline basis and number of knots, given that the number of knots grows with the number of data and assuming the function is twice continuously differentiable. Ruppert (2002) and Li and Ruppert (2008) showed that the choice of the basis dimension is not critical but should be above some minimal level which depends on the spline degree. Asymptotic properties of P-splines were also studied in Kauermann et al. (2009) and Claeskens et al. (2009). Here we propose to build on the P-spline idea to produce SCOP-splines.

2.2 One-dimensional case

The basic idea is most easily introduced by considering the construction of a monotonically increasing smooth, $m$, using a B-spline basis. Specifically let

$$\begin{aligned} m(x) = \sum _{j=1}^q \gamma _j B_j (x), \end{aligned}$$

where $q$ is the number of basis function, the $B_j$ are B-spline basis functions of at least second order for representing smooth functions over interval $[a,b]$, based on equally spaced knots, and the $\gamma _j$ are the spline coefficients.

It is well known that a sufficient condition for $m^\prime (x)\ge 0$ over $[a,b]$ is that $\gamma _{j} \ge \gamma _{j-1}~\forall ~j$ (see Supplementary material, S.1, for details). In the case of quadratic splines this condition is necessary. It is easy to see that this condition could be imposed by re-parameterizing, so that

$$\begin{aligned} {\varvec{\gamma }} = {\varvec{\varSigma }} \tilde{{\varvec{\beta }}}, \end{aligned}$$

where ${\varvec{\beta }}= \left[ \beta _1, \beta _2, \ldots , \beta _{q}\right] ^{\mathrm{T}}$ and $\tilde{{\varvec{\beta }}} = \left[ \beta _1,\exp (\beta _2), \ldots , \exp (\beta _q)\right] ^{\mathrm{T}}$, while $\varSigma _{ij} = 0$ if $i<j$ and $\varSigma _{ij}=1$ if $i \ge j$.

So if $\mathbf{m} = [m(x_1),m(x_2), \ldots ,m(x_n)]^{\mathrm{T}}$ is the vector of $m$ values at the observed points $x_i$, and $\mathbf{X}$ is the matrix such that $X_{ij} = B_j(x_i)$, then we have

$$\begin{aligned} \mathbf{m} = \mathbf{X}{\varvec{\varSigma }} \tilde{\varvec{\beta }}. \end{aligned}$$

2.2.1 Smoothing

In a smoothing context we would also like to have a penalty on the $m(x)$ which can be used to control its ‘wiggliness’. Eilers and Marx (1996) introduced the notion of directly penalizing differences in the basis coefficients of a B-spline basis, which is used with a relatively large $q$ to avoid underfitting. We can adapt this idea here. For $j>2$ our $\beta _j$ are log differences in $\gamma _j$. We therefore propose penalizing the squared differences between adjacent $\beta _j$, starting from $\beta _2$, using the penalty $\Vert \mathbf{D}{\varvec{\beta }}\Vert ^2$ where $\mathbf D$ is the $(q-2)\times q$ matrix that is all zero except that $D_{i,i+1}=-D_{i,i+2}=1$ for $i=1,\ldots , q-2$. The penalty is zeroed when all the $\beta _j$ after $\beta _1$ are equal, so that the $\gamma _j$ form a uniformly increasing sequence and $m(x)$ is an increasing straight line (see Fig. 1). As a result our penalty shares with a second order P-spline penalty, the basic feature of ‘smoothing towards a straight line’, but in manner that is computationally convenient for constrained smoothing.

It might be asked whether penalization is necessary at all, given the restrictions imposed by the shape constraints? Figure 2 provides an illustration of what the penalty achieves. Even with shape constraint, the unpenalized estimated curve shows a good deal of spurious variation that the penalty removes.

2.2.2 Identifiability, basis dimension

If we were interested solely in smoothing one-dimensional Gaussian data then ${\varvec{\beta }}$ would be chosen to minimize

$$\begin{aligned} \Vert \mathbf{y} - \mathbf{X}{\varvec{\varSigma }} \tilde{\varvec{\beta }}\Vert ^2 + \lambda \Vert \mathbf{D}{\varvec{\beta }}\Vert ^2, \end{aligned}$$

where $\lambda $ is a smoothing parameter controlling the trade-off between smoothness and fidelity to the response data $\mathbf y$. Here, we are interested in the basis and penalty in order to be able to embed the shape constrained smooth $m(x)$ in a larger model. This requires an additional constraint on $m(x)$ in order to achieve identifiability to avoid confounding with the intercept of the model in which it is embedded. A convenient way to do this is to use centering constraints on the model matrix columns, i.e. the sum of the values of the smooth is set to be zero $\sum _{i=1}^nm(x_i)=0$ or equivalently $\mathbf{1}^T\mathbf{X}\varvec{\varSigma }{\varvec{\beta }}=0$.

As with any penalized regression spline approaches, the choice of the basis dimension, $q$, is not crucial but should be generous enough to avoid oversmoothing/underfitting (Ruppert 2002; Li and Ruppert 2008). Ruppert (2002) suggested algorithms for the basis dimension selection by minimizing GCV over a set of specified values of $q$, while Kauermann and Opsomer (2011) proposed an equivalent likelihood based scheme.

This simple monotonically increasing smooth can be extended to a variety of monotonic functions, including decreasing, convex/concave, increasing/decreasing and concave, increasing/ decreasing and convex, the difference between alternative shape constraints being the form of the matrices $\varvec{\varSigma }$ and $\mathbf D$. Table 1 details eight possibilities, while Supplementary material, S.2, provides the corresponding derivations.

Table 1 Univariate shape constrained smooths

Full size table

2.3 Multi-dimensional SCOP-splines

Using the concept of tensor product spline bases it is possible to build up smooths of multiple covariates under the monotonicity constraint, where monotonicity may be assumed on either all or a selection of the covariates. In this section the construction of a multivariable smooth, $m(x_1,x_2,\ldots ,x_p),$ with multiple monotonically increasing constraints along all covariates is first considered, followed by a discussion of single monotonicity along a single direction.

2.3.1 Tensor product basis

Consider $p$ B-spline bases of dimensions $q_{1},$ $q_{2},$ and $q_{p}$ for representing marginal smooth functions, each of a single covariate

$$\begin{aligned} f_{1}(x_{1})&= \sum \limits _{k_{1}=1}^{q_{1}}\alpha ^{1}_{k_{1}}B_{k_{1}}(x_{1}), \quad f_{2}(x_{2})= \sum \limits _{k_{2}=1}^{q_{2}}\alpha ^{2}_{k_{2}}B_{k_{2}}(x_{2}), \ldots ,\\ f_{p}(x_{p})&= \sum \limits _{k_{p}=1}^{q_{p}}\alpha ^{p}_{k_{p}}B_{k_{p}}(x_{p}), \end{aligned}$$

where $B_{k_{j}}(x_{j}),$ $j=1,\ldots ,p,$ are B-spline basis functions, and $\alpha ^j_{k_{j}}$ are spline coefficients. Then, following Wood (2006a) the multivariate smooth can be represented by expressing spline coefficients of the marginal smooths as the B-spline of the following covariate, starting from the first marginal smooth. By denoting $B_{k_{1}\ldots k_{p}}(x_{1},\ldots ,x_{p})=B_{k_{1}}(x_{1})\cdot \ldots \cdot B_{k_{p}}(x_{p}),$ the smooth of $p$ covariates may be written as follows

$$\begin{aligned} m(x_{1},\ldots ,x_{p})=\sum \limits _{k_{1}=1}^{q_{1}}\ldots \sum \limits _{k_{p}=1}^{q_{p}}B_{k_{1}\ldots k_{p}} (x_{1},\ldots ,x_{p})\gamma _{k_{1}\ldots k_{p}}, \end{aligned}$$

where $\gamma _{k_{1}\ldots k_{p}}$ are unknown coefficients.

So if $\mathbf{X}$ is the matrix such that its $i$th row is $\mathbf{X}_{i}=\mathbf{X}_{1i}\otimes \mathbf{X}_{2i}\otimes \cdots \otimes \mathbf{X}_{pi},$ where $\otimes $ denotes a Kronecker product, and $\varvec{\gamma }=(\gamma _{11\ldots 1},\ldots ,\gamma _{k_{1}k_{2}\ldots k_{p}},\ldots ,\gamma _{q_{1}q_{2}\ldots q_{p}})^{T},$ then

$$\begin{aligned} \mathbf{m}=\mathbf{X}\varvec{\gamma }. \end{aligned}$$

2.3.2 Constraints

By extending the univariate case one can see that a sufficient condition for ${\partial f(x_{1},\ldots ,x_{p})}/{\partial x_{j}} \ge 0$ is $\gamma _{k_{1}\ldots k_{j}\ldots k_{p}}\ge \gamma _{k_{1}\ldots (k_{j}-1)\ldots k_{p}}.$ To impose these conditions the re-parametrization ${\varvec{\gamma }} = {\varvec{\varSigma }} \tilde{\varvec{\beta }}$ is proposed, where

$$\begin{aligned} \tilde{\varvec{\beta }}= \left[ \beta _{11\ldots 1},\exp (\beta _{11\ldots 2}), \ldots , \exp (\beta _{k_1\ldots k_p}),\ldots ,\exp (\beta _{q_1\ldots q_p})\right] ^{\mathrm{T}}, \end{aligned}$$

and $\varvec{\varSigma }= \varvec{\varSigma }_{1} \otimes \varvec{\varSigma }_{2} \otimes \cdots \otimes \varvec{\varSigma }_{p}.$ The elements of $\varvec{\varSigma }_j$ are the same as for the univariate monotonically increasing smooth (see Table 1). For the multiple monotonically decreasing multivariate function $\varvec{\varSigma }= \left[ \mathbf{1}:\varvec{\varSigma }'_{(,-1)}\right] ,$ where $\varvec{\varSigma }' = -\varvec{\varSigma }_{1} \otimes \varvec{\varSigma }_{2} \otimes \cdots \otimes \varvec{\varSigma }_{p},$ that is $\varvec{\varSigma }$ is a matrix $\varvec{\varSigma }'$ with the first column replaced by the column of one’s.

To satisfy conditions for a monotonically increasing or decreasing smooth with respect to only one covariate the following re-parameterizations are suggested:

1.
For the single monotonically increasing constraint along the $x_{j}$ direction: Let $\varvec{\varSigma }_{j}$ be defined as previously while $\mathbf{I}_{s}$ is an identity matrix of size $q_{s},\,s\ne j,$ then
$$\begin{aligned} \varvec{\varSigma }= \mathbf{I}_{1}\otimes \cdots \otimes \varvec{\varSigma }_{j} \otimes \cdots \otimes \mathbf{I}_{p}, \end{aligned}$$
and ${\varvec{\gamma }} = {\varvec{\varSigma }} \tilde{\varvec{\beta }},$ where $\tilde{\varvec{\beta }}$ is a vector containing a mixture of un-exponentiated and exponentiated coefficients with $\tilde{\beta }_{k_1\ldots k_j\ldots k_p}=\exp (\beta _{k_1\ldots k_j\ldots k_p})$ when $k_j\ne 1.$
2.
For the single monotonically decreasing constraint along the $x_{j}$ direction: The re-parametrization is the same as above except for the representation of the matrix $\varvec{\varSigma }_{j}$ which is as for univariate smooth with monotonically decreasing constraint (see Table 1).

By analogy it is not difficult to construct tensor products with monotonicity constraints along any number of covariates.

2.3.3 Penalties

For controlling the level of smoothing, the penalty introduced in Sect. 2 can be extended. For multiple monotonicity the penalties may be written as

$$\begin{aligned} \fancyscript{P}= \lambda _{1}\varvec{\beta }^{T}\mathbf{S}_{1}\varvec{\beta }+ \lambda _{2}\varvec{\beta }^{T}\mathbf{S}_{2}\varvec{\beta }+\cdots +\lambda _{p}\varvec{\beta }^{T}\mathbf{S}_{p}\varvec{\beta }, \end{aligned}$$

where $\mathbf{S}_{j}=\mathbf{D}_{j}^{\mathrm{T}}\mathbf{D}_{j}$ and $\mathbf{D}_{j}=\mathbf{I}_{1}\otimes \mathbf{I}_{2}\otimes \cdots \otimes \mathbf{D}_{mj}\otimes \cdots \otimes \mathbf{I}_{p}.$ $\mathbf{D}_{mj}$ is as $\mathbf D$ in Table 1 for a monotone smooth. Penalties for single monotonicity along $x_{j}$ are

$$\begin{aligned} \fancyscript{P}= \lambda _{1}\varvec{\beta }^{T}\tilde{\mathbf{S}}_{1}\varvec{\beta }+\cdots +\lambda _{j}\varvec{\beta }^{T}\mathbf{S}_{j}\varvec{\beta }+ \lambda _{p}\varvec{\beta }^{T}\tilde{\mathbf{S}}_{p}\varvec{\beta }, \end{aligned}$$

where $\mathbf{S}_j$ is defined as above. The penalty matrices $\tilde{\mathbf{S}}_i,$ $i\ne j,$ in the unconstrained directions can be constructed using the marginal penalty approach described in Wood (2006a). The degree of smoothness in the unconstrained directions can be controlled by the second-order difference penalties applied to the non-exponentiated coefficients, and by the first-order difference penalties for the exponentiated coefficients. As in the univariate case, these penalties keep the parameter estimates close to each other, resulting in similar increments in the coefficients of marginal smooths. When $\lambda _{j}\rightarrow \infty $ such penalization results in straight lines for marginal curves.

3 SCAM

3.1 SCAM representation

To represent (1) for computation we now choose basis expansions, penalties and identifiability constraints for all the unconstrained $f_j$, as described in detail in Wood (2006a), for example. This allows $\sum _j f_j(z_{ji})$ to be replaced by $\mathbf{F}_i {\varvec{\gamma }}$, where $\mathbf F$ is a model matrix determined by the basis functions and the constraints, and $\varvec{\gamma }$ is a vector of coefficients to be estimated. The penalties on the $f_j$ are quadratic in $\varvec{\gamma }$.

Each shape constrained term $m_k$ is represented by a model matrix of the form $\mathbf{X}{\varvec{\varSigma }}$ and corresponding coefficient vector. Identifiability constraints are absorbed by the column centering constraints. The model matrices for all the $m_k$ are then combined so that we can write

$$\begin{aligned} \sum _k m_k(x_{ki}) = \mathbf{M}_i \tilde{\varvec{\beta }}, \end{aligned}$$

where $\mathbf M$ is a model matrix and $\tilde{\varvec{\beta }}$ is a vector containing a mixture of model coefficients ($\beta _i$) and exponentiated model coefficients $\left( \exp (\beta _i)\right) $. The penalties in this case are quadratic in the coefficients ${\varvec{\beta }}$ (not in the $\tilde{\varvec{\beta }}$).

So (1) becomes

$$\begin{aligned} g(\mu _i) = \mathbf{A}_i {\varvec{\theta }} + \mathbf{F}_i{\varvec{\gamma }} + \mathbf{M}_i \tilde{\varvec{\beta }},\quad Y_i \sim \mathrm{EF}(\mu _i,\phi ). \end{aligned}$$

For fitting purposes we may as well combine the model matrices column-wise into one model matrix $\mathbf{X}$, and write the model as

$$\begin{aligned} g(\mu _i) = \mathbf{X}_i \tilde{\varvec{\beta }}, \end{aligned}$$

(2)

where $\tilde{\varvec{\beta }}$ has been enlarged to now contain $\varvec{\theta }$, $\varvec{\gamma }$ and the original $\tilde{\varvec{\beta }}$. Similarly there is a corresponding expanded model coefficient vector ${\varvec{\beta }}$ containing $\varvec{\theta }$, $\varvec{\gamma }$ and the original ${\varvec{\beta }}$. The penalties on the terms have the general form ${\varvec{\beta }}^{\mathrm{T}}\mathbf{S}_\lambda {\varvec{\beta }}$ where $\mathbf{S}_\lambda = \sum _k \lambda _k \mathbf{S}_k$, and the $\mathbf{S}_k$ are the original penalty matrices expanded with zeros everywhere except for the elements which correspond to the coefficients of the $k$th smooth.

3.2 SCAM coefficient estimation

Now consider the estimation of ${\varvec{\beta }}$ given values for the smoothing parameters $\varvec{\lambda }$. The exponential family chosen determines the form of the log likelihood $l({\varvec{\beta }})$ of the model, and to control the degree of model smoothness we seek to maximize its penalized version

$$\begin{aligned} l_p ({\varvec{\beta }}) = l({\varvec{\beta }}) - {\varvec{\beta }}^{\mathrm{T}}\mathbf{S}_\lambda {\varvec{\beta }}/2. \end{aligned}$$

However, the non-linear dependence of $\mathbf{X}\tilde{\varvec{\beta }}$ on ${\varvec{\beta }}$ makes this more difficult than in the case of unconstrained GAMs. In particular we found that optimization via Fisher scoring caused convergence problems for some models, and we therefore use a full Newton approach. The special structure of the model means that it is possible to work entirely in terms of a matrix square root of the Hessian of $l$, when applying Newton’s method, thereby improving the numerical stability of computations, so we also adopt this refinement. Also, since SCAM is very much within GAM theory, the same convergence issues might arise as in the case of GAM/GLM fitting (Wood 2006a). In particular, the likelihood might not be uni-modal and the process may converge to different estimates depending on the starting values of the fitting process. However, if the initial values are reasonably selected then it is unlikely that there will be major convergence issues. The following algorithm suggests such initial values.

Let $V(\mu )$ be the variance function for the model’s exponential family distribution, and define

$$\begin{aligned} \alpha (\mu _i) = 1 + (y_i - \mu _i) \left\{ \frac{V^\prime (\mu _i)}{V(\mu _i)} + \frac{g^{\prime \prime }(\mu _i)}{g^\prime (\mu _i)} \right\} . \end{aligned}$$

Penalized likelihood maximization is then achieved as follows:

1.
To obtain an initial estimate of $ {\varvec{\beta }}$, minimize $\Vert g(\mathbf{y}) - \mathbf{X}\tilde{\varvec{\beta }}\Vert ^2 + \tilde{\varvec{\beta }}^{\mathrm{T}}\mathbf{S}_\lambda \tilde{\varvec{\beta }}$ w.r.t. $\tilde{\varvec{\beta }}$, subject to linear inequality constraints ensuring that $\tilde{\beta }_j >0 $ whenever $\tilde{\beta }_j = \exp (\beta _j)$. This is a standard quadratic programming (QP) problem. (If necessary $\mathbf y$ is adjusted slightly to avoid infinite $g(\mathbf{y})$.)
2.
Set k = 0 and repeat the steps 3–11 to convergence...
3.
Evaluate $z_i = (y_i - \mu _i) g^\prime (\mu _i)/\alpha (\mu _i)$ and $w_i = \omega _i \alpha (\mu _i)/ \{V(\mu _i)g^{\prime 2}(\mu _i)\},$ using the current estimate of $\mu _i$.
4.
Evaluate vectors $\tilde{\mathbf{w}} = |\mathbf{w}|$ and $\tilde{\mathbf{z}}$ where $\tilde{z}_i = \mathrm{sign}(w_i)z_i$.
5.
Evaluate the diagonal matrix $\mathbf C$ such that $C_{jj} = 1$ if $\tilde{\beta }_j = \beta _j$, and $C_{jj} = \exp (\beta _j)$ otherwise.
6.
Evaluate the diagonal matrix $\mathbf E$ such that $E_{jj} = 0$ if $\tilde{\beta }_j = \beta _j$, and $E_{jj} =\sum _i^n w_ig^\prime (\mu _i) [\mathbf{XC}]_{ij}(y_i-\mu _i)/\alpha (\mu _i)$ otherwise.
7.
Let $\mathbf{I}^-$ be the diagonal matrix such that $I^{-}_{ii} = 1$ if $w_i<0$ and $I^{-}_{ii}=0 $ otherwise.
8.
Letting $\tilde{\mathbf{W}}$ denote diag$(\tilde{\mathbf{w}})$, form the QR decomposition
$$\begin{aligned} \left[ \begin{array}{c} \sqrt{\tilde{\mathbf{W}}} \mathbf{X}\mathbf{C} \\ \mathbf{B} \end{array}\right] = \mathbf{QR}, \end{aligned}$$
where $\mathbf{B}$ is any matrix square root such that $\mathbf{B}^{\mathrm{T}}\mathbf{B} = \mathbf{S}_\lambda $.
9.
Letting $\mathbf{Q}_1$ denote the first $n$ rows of $\mathbf Q$, form the symmetric eigen-decomposition
$$\begin{aligned} \mathbf{Q}_1 ^{\mathrm{T}}\mathbf{I}^- \mathbf{Q}_1 + \mathbf{R}^{-\mathrm T}\mathbf{E} \mathbf{R}^{-1} = \mathbf{U }{\varvec{\varLambda }} \mathbf{U}^{\mathrm{T}}. \end{aligned}$$
10.
Hence define $\mathbf{P} = \mathbf{R}^{-1} \mathbf{U}(\mathbf{I} - {\varvec{\varLambda }})^{-1/2}$ and $\mathbf{K} = \mathbf{Q}_1 \mathbf{U} (\mathbf{I} - {\varvec{\varLambda }})^{-1/2}$.
11.
Update the estimate of ${\varvec{\beta }}$ as ${\varvec{\beta }}^{[k+1]} = {\varvec{\beta }}^{[k]} + \mathbf{P K}^{\mathrm{T}}\sqrt{\tilde{\mathbf{W}}} \tilde{\mathbf{z}} - \mathbf{PP}^{\mathrm{T}}\mathbf{S}_\lambda {\varvec{\beta }}^{[k]}$ and increment $k$.

The algorithm is derived in Appendix 1 which shows several similarities to a standard penalized IRLS scheme for penalized GLM estimation. However, the more complicated structure results from the need to use full Newton, rather than Fisher scoring, while at the same time avoiding computation of the full Hessian, which would approximately square the condition number of the update computations.

Two refinements of the basic iteration may be required.

1.
If the Hessian of the log likelihood is indefinite then step 10 will fail, because some $\varLambda _{ii}$ will exceed 1. In this case a Fisher update step must be substituted, by setting $\alpha (\mu _i) = 1$.
2.
There is considerable scope for identifiability issues to hamper computation. In common with unconstrained GAMs, flexible SCAMs with highly correlated covariates can display co-linearity problems between model coefficients, which require careful handling numerically, in order to ensure numerical stability of the estimation algorithms. An additional issue is that the non-linear constraints mean that parameters can be poorly identified on flat sections of a fitted curve, where $\beta $ is simply ‘very negative’, but the data contain no information on how negative. So steps must be taken to deal with unidentifiable parameters. One approach is to work directly with the QR decomposition to calculate which coefficients are unidentifiable at each iteration and to drop these, but a simpler strategy substitutes a singular value decomposition for the R factor at step 8 if it is rank deficient, so that
$$\begin{aligned} \mathbf{R}=\varvec{{\fancyscript{U}}} \mathbf{DV}^{\mathrm{T}}. \end{aligned}$$
Then we set $\varvec{{\fancyscript{Q}}} = \mathbf{Q}\varvec{{\fancyscript{U}}},$ $\varvec{{\fancyscript{R}}}=\mathbf{DV}^T,$ and $\mathbf{Q}_1$ is the first $n$ rows of $\varvec{{\fancyscript{Q}}},$ and everything proceeds as before, except for the inversion of $\mathbf R$. We now substitute the pseudoinverse $\mathbf{R}^- = \mathbf{VD}^-$, where the diagonal matrix $\mathbf{D}^-$ is such that $D_{jj}^- = 0 $ if the singular value $D_{jj}$ is ‘too small’, but otherwise $ D^-_{jj} = 1/D_{jj}$. ‘Too small’ is judged relative to the largest singular value $D_{11}$ multiplied by some power (in the range .5 to 1) of the machine precision. If all parameters are numerically identifiable then the pseudo-inverse is just the inverse.

3.3 SCAM smoothing parameter estimation

We propose to estimate the smoothing parameter vector $\varvec{\lambda }$ by optimizing a prediction error criterion such as AIC (Akaike 1973) or GCV (Craven and Wahba 1979). The model deviance is defined in the standard way as

$$\begin{aligned} D(\hat{\varvec{\beta }}) = 2 \{ l_\mathrm{max} - l(\hat{\varvec{\beta }})\} \phi , \end{aligned}$$

where $l_\mathrm{max}$ is the saturated log likelihood. When the scale parameter is known we find $\varvec{\lambda }$ which minimizes ${\fancyscript{V}}_u = D(\hat{\varvec{\beta }}) + 2 \phi \gamma \tau ,$ where $\tau $ is the effective degrees of freedom (edf) of the model. $\gamma $ is a parameter that in most cases has the value of $1,$ but is sometimes increased above 1 to obtain smoother models [see Kim and Gu (2004)]. When the scale parameter is unknown we find $\varvec{\lambda }$ minimizing the GCV score, $ {\fancyscript{V}}_g = {n D(\hat{\varvec{\beta }})}/{(n - \gamma \tau )^2}. $ For both criteria the dependence on $\varvec{\lambda }$ is via the dependence of $\tau $ and $\hat{\varvec{\beta }}$ on $\varvec{\lambda }$ [see Hastie and Tibshirani (1990) and Wood (2008) for further details].

The edf can be found, following Meyer and Woodroofe (2000) as

$$\begin{aligned} \tau = \sum _{i=1}^n \frac{\partial \hat{\mu }_i}{\partial y_i} = \mathrm{tr}(\mathbf{KK}^{\mathrm{T}}\mathbf{L}^+), \end{aligned}$$

(3)

where $\mathbf{L}^+$ is the diagonal matrix such that

$$\begin{aligned} L^+_{ii} = \left\{ \begin{array}{ll} \alpha (\mu _i)^{-1}, &{} \text { if } w_i \ge 0 \\ - \alpha (\mu _i)^{-1}, &{}\text { otherwise}. \end{array} \right. \end{aligned}$$

Details are provided in Appendix 2.

Optimization of the ${\fancyscript{V}}_*$ w.r.t. ${\varvec{\rho }} = \log ({\varvec{\lambda }})$ can be achieved by a quasi-Newton method. Each trial $\varvec{\rho }$ vector will require a Sect. 3.2 iteration to find the corresponding $\hat{\varvec{\beta }}$ so that the criterion can be evaluated. In addition the first derivative vector of ${\fancyscript{V}}_*$ w.r.t. $\varvec{\rho }$ will be required, which in turn requires $\partial \hat{\varvec{\beta }}/{\partial \varvec{\rho } }$ and $\partial \tau /{\partial \rho }$.

As demonstrated in Supplementary material, S.3, implicit differentiation can be used to obtain

$$\begin{aligned} \frac{\partial \hat{\varvec{\beta }}}{\partial \rho _k} = - \lambda _k \mathbf{PP}^{\mathrm{T}}\mathbf{S}_k \hat{\varvec{\beta }}. \end{aligned}$$

The derivatives of $D$ and $\tau $ then follow, as S.4 (Supplementary material), shows in tedious detail.

3.4 Interval estimation

Having obtained estimates $\hat{\varvec{\beta }}$ and $\hat{\varvec{\lambda }}$, we have point estimates for the component smooth functions of the model, but it is usually desirable to obtain interval estimates for these functions as well. To facilitate the computation of such intervals we seek distributional results for the $\tilde{\varvec{\beta }}$, i.e. for the coefficients on which the estimated functions depend linearly.

Here we adopt the Bayesian approach to interval estimation pioneered in Wahba (1983), but following Silverman’s (1985) formulation. Such intervals are appealing following Nychka’s (1988) analysis showing that they have good frequentist properties by virtue of accounting for both sampling variability and smoothing bias. Specifically, we view the smoothness penalty as equivalent to an improper prior distribution on the model coefficients

$$\begin{aligned} {\varvec{\beta }}\sim N(\mathbf{0}, \mathbf{S}_\lambda ^- /(2 \phi )), \end{aligned}$$

where $\mathbf{S}_\lambda ^-$ is the Moore–Penrose pseudoinverse of $\mathbf{S}_\lambda =\sum _k \lambda _k \mathbf{S}_k$. In conjunction with the model likelihood, Bayes theorem then leads to the approximate result

$$\begin{aligned} \tilde{\varvec{\beta }}| \mathbf{y} \sim N(\hat{\tilde{\varvec{\beta }}}, \mathbf{V}_{\tilde{\varvec{\beta }}}), \end{aligned}$$

(4)

where $\mathbf{V}_{\tilde{\varvec{\beta }}} = \mathbf{C} (\mathbf{C}^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{W}\mathbf{XC} + \mathbf{S}_\lambda )^{-1} \mathbf{C}\phi $, and $\mathbf W$ is the diagonal matrix of $w_i$ calculated with $\alpha (\mu _i)=1$. Supplementary material, S.5, derives this result. The deviance or Pearson statistic divided by the effective residual degrees of freedom provides an estimate of $\phi $, if required. To use the result we condition on the smoothing parameter estimates: the intervals display surprisingly good coverage properties despite this (Marra and Wood (2012), provide a theoretical analysis which partly explains this).

4 Simulated examples

4.1 Simulations: comparison with alternative methods

In this section performance comparison with unconstrained GAM and the QP approach to shape preserving smoothing (Wood 1994) is illustrated on a simulated example of an additive model with a mixture of monotone and unconstrained smooth terms. All simulation studies and data applications were performed using R packages "scam", which implements the proposed SCAM approach, and "mgcv" for GAM and QP implementations. A more extensive simulation study is given in Supplementary material, S.6. Particularly, the first subsection of S.6 shows comparative study with the constrained P-spline regression (Bollaerts et al. 2006), monotone piecewise quadratic splines of Meyer (2008), and shape-restricted penalized B-splines of Meyer (2012) on simulated example on univariate single smooth term models. Since there was no mean square error advantage of these approaches over the SCAM for the univariate model, and moreover, the direct grid search for multiple optimal smoothing parameter is computational expensive, (and to the authors’ knowledge, R routines for the implementation of these methods are not freely available) the comparison for multivariate and additive examples were performed only with the unconstrained GAM and QP approach.

The following additive model is considered:

$$\begin{aligned} g(\mu _{i})=m_{1}(x_{1i})+f_{2}(x_{2i}), \quad \mathrm E (Y_{i})=\mu _{i}, \end{aligned}$$

(5)

where $Y_{i}\sim \mathrm N (\mu _i,\sigma ^2)$ or $\mathrm Poi (\mu _i)$ distribution. Figure 3 illustrates the graphs of the true functions used for this study. Their analytical expressions are given in Supplementary material, S.4.

The covariate values, $x_{1i}$ and $x_{2i},$ were simulated from the uniform distribution on $[-1,3]$ and $[-3,3]$ respectively. For the Gaussian data the values of $\sigma $ were 0.05, 0.1, 0.2, which gave the signal to noise ratios of about 0.97, 0.88, and 0.65. For the Poisson model the noise level was controlled by multiplying $g(\mu _i)$ by $d$, taking values 0.5, 0.7, 1.2, which resulted in the signal to noise ratios of about 0.58, 0.84, and 0.99. For the SCAM implementation a cubic SCOP-spline of the dimension 30 was used to represent the first monotonic smooth term and a cubic P-spline with $q=15$ for the second unconstrained term. For an unconstrained GAM, P-splines with the same basis dimensions were used for both model components. The models were fitted by penalized likelihood maximization with the smoothing parameter selected using ${\fancyscript{V}}_g$ in the Gaussian case and ${\fancyscript{V}}_u$ for the Poisson case.

For implementing the QP approach to monotonicity preserving constraint, we approximated the necessary and sufficient condition $f'(x)\ge 0$, via the standard technique (Villalobos and Wahba 1987) of using a fine grid of linear constraints $(f'(x_i^*)\ge 0, i=1,\ldots , n),$ where $x_i^*$ are spread evenly through the range of $x$ (strictly such constraints are necessary, but only sufficient as $n \rightarrow \infty $, but in practice we observed no violations of monotonicity). Cubic regression spline bases were used here together with the integrated squared second order derivative of the smooth as the penalty. The model fit is obtained by setting the QP problem within a penalized IRLS loop given $\varvec{\lambda }$ chosen via GCV/UBRE from unconstrained model fit. Cubic regression splines tend to have slightly better MSE performance than P-splines (Wood 2006a) and moreover, the conditions built on finite differences are not only sufficient but also necessary for monotonicity. So this is a challenging test for SCAM. Three hundred replicates were produced for Gaussian and Poisson distributions at each of three levels of noise and for two sample sizes, 100, 200, for the three alternative approaches.

The simulation results for the Gaussian data are illustrated in Fig. 4. The results show that SCAM works better than the other two alternative methods in the sense of MSE performance. Note that the performance of GAM was better than the performance of the QP approach in this case, but the difference in MSE between SCAM and GAM is much less than that in the one-dimensional simulation studies shown in Supplementary material, S.6. Also it is noticeable that GAM reconstructed the truth better than the QP method. The explanation may be due to there being only one monotonic term, and both GAM and SCAM gave similar fits for the unconstrained term, $f_{2}.$ At lower noise levels GAM might also be able to reconstruct the monotone shape of $m_{1}$ for some replicates. The results also suggest that the SCAM works better than GAM for greater levels of noise which seems to be natural since at lower noise levels the shapes of constrained terms can be captured by the unconstrained GAM. The reduction in performance of the QP compared to GAM was due to the smoothing parameter estimation from the unconstrained fit which sometimes resulted in less smooth tails of the smooth term than those of the unconstrained GAM. For the Poisson data of the samples size $n=100$ all three methods worked similarly, but with an increase in sample size SCAM outperformed the other two approaches (plots are not shown). As in the Gaussian case the unconstrained GAM worked better than QP.

The simulation studies show that SCAM may have practical advantages over the alternative methods considered. It is computationally slower than GAM and QP approaches, however, obviously GAM cannot impose monotonicity, and the selection of the smoothing parameter for SCAM is well founded, in contrast to the ad hoc method used with QP of choosing $\lambda $ from an unconstrained fit, and then refitting subject to constraint. Finally, the practical MSE performance of SCAM seems to be better than that of the alternatives considered here.

4.2 Coverage probabilities

The proposed Bayesian approach for confidence intervals construction makes a number of key assumptions; (i) it uses linear approximation of the exponentiated parameters, and in the case of non-Gaussian models adopts large sample inference; (ii) the smoothing parameters are treated as fixed. The simulation example of the previous subsection is used in order to examine how these restrictions affect the performance of the confidence intervals. The realized coverage probabilities is taken as a measure of their performance. Supplementary material, S.7, demonstrates two other examples for more thorough confidence interval performance presentation.

The simulation study of confidence interval performance is conducted in an analogous manner to Wood (2006b). Samples of sizes $n=200$ and 500 were generated from (5) for Gaussian and Poisson distributions. 500 replicates were produced for both distributions at each of three levels of noise and for two sample sizes. For each replicate the realized coverage proportions were calculated as the proportions of the values of the true functions (at each of the covariate values) falling within the constructed confidence interval. Three confidence levels were considered 90, 95, 99 %. An overall mean coverage probability and its standard error were obtained from the 500 ‘across-the-function’ coverage proportions. The results of the study are presented in Fig. 5 for the Gaussian and Poisson models. The realized coverage probabilities are near the corresponding nominal values, the larger sample size reduces the standard errors as expected. The results for the Poisson models are quite good with an exception for the first monotone smooth, $m_1(x_1)$, for the low signal strength, which may be explained by the fact that the optimal fit inclines toward a straight line model (Marra and Wood 2012).

5 Examples

This section presents the application of SCAM to two different data sets. The purpose of the first application is to investigate whether proximity to municipal incinerators in Great Britain is associated with increased risk of stomach cancer (Elliott et al. 1996; Shaddick et al. 2007). It is hypothesized that the risk of cancer is a decreasing function of distance from an incinerator. The second application uses data from the National Morbidity, Mortality, and Air Pollution Study (Peng and Welty 2004). The relationship between daily counts of mortality and short-term changes in air pollution concentrations is investigated. It is assumed that increases in concentrations of ozone, sulphur dioxide, particular matter will be associated with adverse health effects.

Incinerator data:Elliott et al. (1996) presented a large-scale study to investigate whether proximity to incinerators is associated with an increased risk of cancer. They analyzed data from 72 municipal solid waste incinerators in Great Britain and investigated the possibility of a decline in risk with distance from sources of pollution for a number of cancers. There was significant evidence for such a decline for stomach cancer, among several others. Data from a single incinerator from those 72 sources, located in the northeast of England, are analyzed using the SCAM approach in this section. This incinerator had a significant result indicating a monotone decreasing risk with distance (Elliott et al. 1996).

The data are from 44 enumeration districts (census-defined administrative areas), ED, whose geographical centroids lay within 7.5 km of the incinerator. The response variable, $Y_{i},$ are the observed numbers of cases of stomach cancer for each enumeration district. Associated estimates of the expected number of cases, $E_{i},$ available for risk determination, $\mathtt{risk}_{i}=Y_{i}/E_{i}$, obtained using national rates for the whole of Great Britain, standardized for age and sex, were calculated for each ED. The two covariates are the distance (km), $\mathtt{dist}_{i},$ from the incinerator and a deprivation score, the Carstairs score, $\mathtt{cs}_{i}$.

Under the model, it is assumed that, $Y_{i}$ are independent Poisson variables, $Y_{i} \sim \mathrm Poi (\mu _{i}),$ where $\mu _{i}=\lambda _{i}E_{i},$ $\mu _{i}$ is the rate of the Poisson distribution with $E_{i}$ the expected number of cases (in area $i$) and $\lambda _{i}$ the relative risk. Shaddick et al. (2007) proposed a model under which the effect of a covariate, e.g., distance, on cancer risk was linear through an exponential function, i.e. $\lambda _{i}=\exp (\beta _{0}+\beta _{1}\mathtt{dist}_{i}).$ Since the risk of cancer might be expected to decrease with distance from the incinerator, in this paper a smooth monotonically decreasing function, $m(\mathtt{dist}_{i}),$ is suggested for modelling its relationship with the distance $\lambda _{i}=\exp \left\{ m(\mathtt{dist}_{i})\right\} .$ Hence, the model can be represented as the following:

$$\begin{aligned}&\log (\lambda _{i})=m(\mathtt{dist}_{i}) \quad \Rightarrow \quad \log \left( \mu _{i}/E_{i}\right) =m(\mathtt{dist}_{i}) \quad \Rightarrow \\&\log (\mu _{i})=\log (E_{i})+m(\mathtt{dist}_{i}), \end{aligned}$$

which is a single smooth generalized Poisson regression model under monotonicity constraint, where $\log (E_{i})$ is treated as an offset (a variable with a coefficient equal to $1$). Therefore, the SCAM approach can be applied to fit such a model. Carstairs score is known to be a good predictor of cancer rates (Elliott et al. 1996; Shaddick et al. 2007), so its effect may also be included in the model. The following four models are considered for this application.

Model 1: $\log \left\{ \mathrm E (Y_{i})\right\} =\log (E_{i})+m_{1}(\mathtt{dist}_{i}),$ $m'_1(\mathtt{dist}_{i})<0.$ Model 2 is the same as model 1 but with $m_{2}(\mathtt{cs}_{i})$ as its smooth term instead with $m'_{2}(\mathtt{cs}_{i})>0.$ Model 3 combines both smooths while model 4 takes a bivariate function $m_{3}(-\mathtt{dist}_{i},\mathtt{cs}_{i})$ subject to double monotone increasing constraint. The univariate smooth terms were represented by the third order SCOP-splines with $q=15,$ while $q_1=q_2=6$ were used for the bivariate SCOP-spline.

Plots for assessing the suitability of model one are given in Supplementary material, S.8. The first model for comparison has been also fitted without constraint. The estimated smooths and risk functions for both methods are illustrated in Fig. 6. The estimate of the cancer risk function was obtained by $\hat{\mathtt{risk}}_{i}=\hat{\mu }_{i}/E_{i}= \exp \left\{ \hat{m}_{1}(\mathtt{dist}_{i})\right\} .$ Note, that the unconstrained GAM resulted in a non-monotone smooth, which supports the SCAM approach. The AIC score allows us to compare models with and without shape constraints. The AIC values were 152.35 for GAM and 150.57 for SCAM which favoured the shape constrained model.

In model 2 the number of cases of stomach cancer are represented by a smooth function of deprivation score. This function is assumed to be monotonically increasing since it was shown (Elliott et al. 1996) that in general, people living closer to incinerators tend to be less affluent (low Carstairs score). The AIC value for this model was 155.59, whereas the unconstrained version gave AIC = 156.4, both of which were higher than for the previous model. The other three measures of the model performance, ${\fancyscript{V}}_u,$ the adjusted $r^{2},$ and the deviance explained, also gave slightly worse results than those seen in model 1.

Model 3 incorporates both covariates, dist and cs, assuming an additive effect on log scale. The estimated edf of $m_{2}(\mathtt{cs})$ was about zero. This smoothing term was insignificant in this model, with all its coefficients near zero. This can be explained by a high correlation between two covariates. Considering a linear effect of Carstairs in place of the smooth function, $m_{2},$ as it was proposed in Shaddick et al. (2007), $\log \left\{ \mathrm E (Y_{i})\right\} = \log (E_{i})+m_{1}(\mathtt{dist}_{i})+\beta \mathtt{cs}_{i},$ also resulted in an insignificant value for $\beta .$

The bivariate function, $m_{3}(-\mathtt{dist}_{i},\mathtt{cs}_{i}),$ is considered in the last model. The perspective plot of the estimated smooth is illustrated in Fig. 7. This plot also supports the previous result, that the Carstairs score does not provide any additional information for modelling cancer risk when distance is included in the model. The graph of the estimated smooth has almost no increasing trend with respect to the second covariate. The measures of the model performance, such as ${\fancyscript{V}}_u$, adjusted $r^{2},$ and the percentage of deviance explained were not as good as for the first simple model 1. The equivalent model without shape constraints resulted in the AIC = 157.35, whereas the AIC score for SCAM was 155.4. Hence, the AIC best selected model is the simple shape constrained model which only includes distance.

Air pollution data: The second application investigates the relationship between non-accidental daily mortality and air pollution. The data were from the National Morbidity, Mortality, and Air Pollution Study (Peng and Welty 2004) which contains 5,114 daily measurements on different variables for 108 cities within the United States. As an example a single city (Chicago) study was examined in Wood (2006a). The response variable was the daily number of deaths in Chicago (death) for the years 1987–1994. Four explanatory variables were considered: average daily temperature (tempd), levels of ozone (o3median), levels of particulate matter (pm10median), and time. Since it might be expected that increased mortality will be associated with increased concentrations of air pollution, modelling with SCAM may prove useful.

The preliminary modelling and examination of the data showed that the mortality rate at a given day could be better predicted if the aggregated air pollution levels and aggregated mean temperature were incorporated into the model, rather than levels of pollution and temperature on the day in question (Wood 2006a). It was proposed that the aggregation should be the sum of each covariate (except time), over the current day and three preceding days. Hence, the three aggregated predictors are as follows

$$\begin{aligned}&\mathtt{tmp}_{i}=\sum _{j=i-3}^{i}\mathtt{tempd}_{j}, \quad \mathtt{o3}_{i}=\sum _{j=i-3}^{i}\mathtt{o3median}_{j}, \\&\mathtt{pm10}_{i}=\sum _{j=i-3}^{i}\mathtt{pm10median}_{j}. \end{aligned}$$

Assuming that the observed numbers of daily death are independent Poisson random variables, the following additive model structure can be considered

Model 1:
$ \log \left\{ \mathrm E (\mathtt{death}_{i})\right\} = f_{1}(\mathtt{time}_{i})+m_{2}(\mathtt{pm10}_{i})+ m_{3}(\mathtt{o3}_{i})+f_{4}(\mathtt{tmp}_{i}), $

where monotonically increasing constraints are assumed on $m_{2}$ and $m_{3},$ since increased air pollution levels are expected to be associated with increases in mortality. The plots for assessing the suitability of this model together and the plots of the smooth estimates are illustrated in Supplementary material, S.8. This model indicates that though the effect of the ozone level is only with one degree of freedom, it is positive and increasing. The rapid increase in the smooth of aggregated mean temperature can be explained by the four highest daily death rates occurring on four consecutive days of very high temperature, which also experienced high levels of ozone (Wood 2006a).

Since the combination of high temperatures together with high levels of ozone might be expected to result in higher mortality, we consider a bivariate smooth of these predictors. The following model is now considered

Model 2:
$\log \left\{ \mathrm E (\mathtt{death}_{i})\right\} = f_{1}(\mathtt{time}_{i})+m_{2}(\mathtt{pm10}_{i}) +m_{3}(\mathtt{o3}_{i},\mathtt{tmp}_{i}), $

where $m_{2}(\mathtt{pm10}_{i})$ is a monotone increasing function and $m_{3}(\mathtt{o3}_{i},\mathtt{tmp}_{i})$ is subject to single monotonicity along the first covariate. The diagnostic plots of this model showed a slight improvement in comparison to the first model (Supplementary material, S.8). The estimates of the univariate smooths and perspective plot of the estimated bivariate smooth of model 2 are illustrated in Fig. 8. The second model also has a lower ${\fancyscript{V}}_u$ score which implies that model 2 is a preferable model.

The current approach has been applied to air pollution data for Chicago just for demonstration purpose. It would be of interest to apply the same model to other cities, to see whether the relationship between non-accidental mortality and air pollution can be described by the proposed SCAM in other locations.

6 Discussion

In this paper a framework for generalized additive modelling with a mixture of unconstrained and shape restricted smooth terms, SCAM, has been presented and evaluated on a range of simulated and real data sets. The motivation of this framework is an attempt to develop general methods for estimating SCAMs similar to that of a standard unconstrained GAM. SCAM models allow inclusion of multiple unconstrained and shape constrained smooths of both univariate and multi-dimensional type which are represented by the proposed SCOP-splines. It should be mentioned that the shape constraints were assured by the sufficient but not necessary condition for the cubic and higher order splines. However, this condition for the cubic splines is equivalent to that of Fritsch and Carlson (1980) who showed that the sufficient parameter space constitutes the substantial part of the necessary parameter space (see their Fig. 2, p. 242). Also the sensitivity analysis of Brezger and Steiner (2008) on an empirical application models defends the point that the sufficient condition is not highly restrictive.

Since a major challenge of any flexible regression method is its implementation in a computationally efficient and stable manner, numerically robust algorithms for model estimation have been presented. The main benefit of the procedure is that smoothing parameter selection is incorporated into the SCAM parameter estimation scheme, which also produces interval estimates at no additional cost. The approach has the $O(nq^2) $ computational cost of standard penalized regression spline based GAM estimation, but typically involves 2–4 times as many $O(nq^2)$ steps because of the additional non-linearities required for the monotonic terms, and the need to use Quasi-Newton in place of full Newton optimization. However, in contrast to the ad hoc methods of choosing the smoothing parameter used in other approaches, smoothing parameter selection for SCAMs is well founded. It should also be mentioned that although the simulation free intervals proposed in this paper show good coverage probabilities it might be of interest to see whether Bayesian confidence intervals derived from posterior distribution simulated via MCMC would give better results.

References

Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Petrov, B.N., Csaki, B.F. (eds.) Second International Symposium on Information Theory. Academiai Kiado, Budapest (1973)
Google Scholar
Bollaerts, K., Eilers, P., van Mechelen, I.: Simple and multiple P-splines regression with shape constraints. Br. J. Math. Stat. Psychol. 59, 451–469 (2006)
Article Google Scholar
Brezger, A., Steiner, W.: Monotonic regression based on Bayesian P-splines: an application to estimating price response functions from store-level scanner data. J. Bus. Econ. Stat. 26(1), 90–104 (2008)
Article MathSciNet Google Scholar
Claeskens, G., Krivobokova, T., Opsomer, J.: Asymptotic properties of penalized spline estimators. Biometrica 96(3), 529–544 (2009)
Article MATH MathSciNet Google Scholar
Craven, P., Wahba, G.: Smoothing noisy data with spline functions. Numer. Math. 31, 377–403 (1979)
Article MATH MathSciNet Google Scholar
De Boor, C.: A Practical Guide to Splines. Cambridge University Press, Cambridge (1978)
MATH Google Scholar
Dunson, D.: Bayesian semiparametric isotonic regression for count data. J. Am. Stat. Assoc. 100(470), 618–627 (2005)
Article MATH MathSciNet Google Scholar
Dunson, D., Neelon, B.: Bayesian inference on order-constrained parameters in generalized linear models. Biometrics 59, 286–295 (2003)
Article MATH MathSciNet Google Scholar
Eilers, P., Marx, B.: Flexible smoothing with B-splines and penalties. Stat. Sci. 11, 89–121 (1996)
Article MATH MathSciNet Google Scholar
Elliott, P., Shaddick, G., Kleinschmidt, I., Jolley, D., Walls, P., Beresford, J., Grundy, C.: Cancer incidence near municipal solid waste incinerators in Great Britain. Br. J. Cancer 73, 702–710 (1996)
Article Google Scholar
Fritsch, F., Carlson, R.: Monotone piecewise cubic interpolation. SIAM J. Numer. Anal. 17(2), 238–246 (1980)
Article MATH MathSciNet Google Scholar
Golub, G., van Loan, C.: Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore (1996)
MATH Google Scholar
Hastie, T., Tibshirani, R.: Generalized Additive Models. Chapman & Hall, New York (1990)
MATH Google Scholar
He, X., Shi, P.: Monotone B-spline smoothing. J. Am. Stat. Assoc. 93(442), 643–650 (1998)
MATH MathSciNet Google Scholar
Holmes, C., Heard, N.: Generalized monotonic regression using random change points. Stat. Med. 22, 623–638 (2003)
Article Google Scholar
Kauermann, G., Krivobokova, T., Fahrmeir, L.: Some asymptotic results on generalized penalized spline smoothing. J. R. Stat. Soc. B 71(2), 487–503 (2009)
Article MATH MathSciNet Google Scholar
Kauermann, G., Opsomer, J.: Data-driven selection of the spline dimension in penalized spline regression. Biometrika 98(1), 225–230 (2011)
Article MATH MathSciNet Google Scholar
Kelly, C., Rice, J.: Monotone smoothing with application to dose-response curves and the assessment of synergism. Biometrics 46, 1071–1085 (1990)
Article Google Scholar
Kim, Y.-J., Gu, C.: Smoothing spline gaussian regression: more scalable computation via efficient approximation. J. R. Stat. Soc: Ser. B. 66(2), 37–356 (2004)
Google Scholar
Lang, S., Brezger, A.: Bayesian P-splines. J. Comput. Graph. Stat. 13(1), 183–212 (2004)
Article MathSciNet Google Scholar
Li, Y., Ruppert, D.: On the asymptotics of penalized splines. Biometrika 95(2), 415–436 (2008)
Article MATH MathSciNet Google Scholar
Marra, G., Wood, S.N.: Coverage properties of confidence intervals for generalized additive model components. Scand. J. Stat. 39(1), 53–74 (2012)
Google Scholar
Meyer, M.: Inference using shape-restricted regression splines. Ann. Appl. Stat. 2(3), 1013–1033 (2008)
Article MATH MathSciNet Google Scholar
Meyer, M.: Constrained penalized splines. Can. J. Stat. 40(1), 190–206 (2012)
Article MATH Google Scholar
Meyer, M., Woodroofe, M.: On the degrees of freedom in shape-restricted regression. Ann. Stat. 28(4), 1083–1104 (2000)
Article MATH MathSciNet Google Scholar
Nychka, D.: Bayesian confidence intervals for smoothing splines. J. Am. Stat. Assoc. 88, 1134–1143 (1988)
Google Scholar
Peng, R., Welty, L.: The NMMAPSdata package. R News 4(2), 10–14 (2004)
Google Scholar
Ramsay, J.: Monotone regression splines in action (with discussion). Stat. Sci. 3(4), 425–461 (1988)
Article Google Scholar
Rousson, V.: Monotone fitting for developmental variables. J. Appl. Stat. 35(6), 659–670 (2008)
Article MATH MathSciNet Google Scholar
Ruppert, D.: Selecting the number of knots for penalized splines. J. Comput. Graph. Stat. 11(4), 735–757 (2002)
Article MathSciNet Google Scholar
Silverman, B.: Some aspects of the spline smoothing approach to nonparametric regression curve fitting. J. R. Stat. Soc.: Ser. B. 47, 1–52 (1985)
Google Scholar
Shaddick, G., Choo, L., Walker, S.: Modelling correlated count data with covariates. J. Stat. Comput. Simul. 77(11), 945–954 (2007)
Article MATH MathSciNet Google Scholar
Villalobos, M., Wahba, G.: Inequality-constrained multivariate smoothing splines with application to the estimation of posterior probabilities. J. Am. Stat. Assoc. 82(397), 239–248 (1987)
Article MATH MathSciNet Google Scholar
Wahba, G.: Bayesian confidence intervals for the cross validated smoothing spline. J. R. Stat. Soc: Ser. B. 45, 133–150 (1983)
Google Scholar
Wang, J., Meyer, M.: Testing the monotonicity or convexity of a function using regression splines. Can. J. Stat. 39(1), 89–107 (2011)
Article MATH MathSciNet Google Scholar
Wood, S.: Monotonic smoothing splines fitted by cross validation. SIAM J. Sci. Comput. 15(5), 1126–1133 (1994)
Article MATH MathSciNet Google Scholar
Wood, S.: Partially specified ecological models. Ecol. Monogr. 71(1), 1–25 (2001)
Google Scholar
Wood, S.: Stable and efficient multiple smoothing parameter estimation for generalized additive models. J. Am. Stat. Assoc. 99, 673–686 (2004)
Google Scholar
Wood, S.: Generalized Additive Models. An Introduction with R. Chapman & Hall, Boca Raton (2006a)
Wood, S.: On confidence intervals for generalized additive models based on penalized regression splines. Aust. N. Z. J. Stat. 48(4), 445–464 (2006b)
Article MATH MathSciNet Google Scholar
Wood, S.: Fast stable direct fitting and smoothness selection for generalized additive models. J. R. Stat. Soc. B 70(3), 495–518 (2008)
Article MATH Google Scholar
Wood, S.: Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J. R. Stat. Soc. B 73(1), 1–34 (2011)
Article Google Scholar
Zhang, J.: A simple and efficient monotone smoother using smoothing splines. J. Nonparametr. Stat. 16(5), 779–796 (2004)
Article MATH MathSciNet Google Scholar

Download references

Acknowledgments

The incinerator data were provided by the Small Area Health Statistics Unit, a unit jointly funded by the UK Department of Health, the Department of the Environment, Food and Rural Affairs, Environment Agency, Health and Safety Executive, Scottish Executive, National Assembly for Wales, and Northern Ireland Assembly. The authors are grateful to Jianxin Pan and Gavin Shaddick for useful discussions on several aspects of the work. The authors are also grateful for the valuable comments and suggestions of two referees and an associated editor. NP was funded by EPSRC/NERC grant EP/1000917/1.

Author information

Authors and Affiliations

Mathematical Sciences, University of Bath, Bath, BA2 7AY, UK
Natalya Pya & Simon N. Wood

Authors

Natalya Pya
View author publications
You can also search for this author in PubMed Google Scholar
Simon N. Wood
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Natalya Pya.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1003 KB)

Appendix

1.1 Appendix 1: Newton method for penalized likelihood estimation of SCAM

This appendix describes a full Newton (Newton–Raphson) method for maximizing the penalized likelihood of a SCAM. The penalized log likelihood function to be maximized w.r.t. ${\varvec{\beta }}$ is

$$\begin{aligned} l_{p}(\varvec{\beta })=l(\varvec{\beta })-\varvec{\beta }^{\mathrm{T}}\mathbf{S}_{\lambda }\varvec{\beta }/2, \end{aligned}$$

where the log likelihood of $\varvec{\beta }$ can be written as

$$\begin{aligned} l({\varvec{\beta }})=\sum _{i=1}^{n}\left[ \left\{ y_{i}\theta _{i}-b_{i}(\theta _{i}) \right\} /a_{i}(\phi )+c_{i}(\phi ,y_{i})\right] , \end{aligned}$$

(6)

where $a_{i}$, $b_{i}$, and $c_{i}$ are arbitrary functions, $\phi $ an arbitrary ‘scale’ parameter, and $\theta _{i}$ a ‘canonical parameter’ of the distribution related to the linear predictor via the relationship $\mathrm E (Y_{i})=b'(\theta _{i})$ (Wood 2006a). While the functions $a_{i}$, $b_{i}$, and $c_{i}$ may vary with $i,$ the scale parameter $\phi $ is assumed to be constant for all observations.

The distribution parameters $\theta _{i}$ depend on the model coefficients $\beta _{j}$ via the link between the mean of $Y_{i}$ and $\theta _{i},$ $\mathrm E (Y_{i})=b'_{i}(\theta _{i}).$ Recall that the smoothing parameter vector $\varvec{\lambda }$ is considered to be fixed while estimating $\varvec{\beta }$. Consider only cases where $a_{i}(\phi )=\phi /\omega _{i},$ and $\omega _{i}$ is a known constant, which usually equals $1.$ Almost all probability distributions of interest from the exponential family are covered by such a limitation. Then

$$\begin{aligned} l({\varvec{\beta }})=\sum _{i=1}^{n}\left[ \omega _{i}\left\{ y_{i} \theta _{i}-b_{i}(\theta _{i})\right\} /\phi +c_{i}(\phi ,y_{i})\right] \end{aligned}$$

and the first order derivative of $l({\varvec{\beta }})$ w.r.t. $\beta _{j}$ is

$$\begin{aligned} \frac{\partial l_{p}}{\partial \beta _{j}}=\frac{1}{\phi }\sum _{i=1}^{n}\omega _{i}\left\{ y_{i}\frac{\partial {\theta _{i}}}{\partial \beta _{j}} -b'_{i}(\theta _{i})\frac{\partial {\theta _{i}}}{{\partial \beta _{j}}}\right\} -\mathbf{S}_{\lambda j}\varvec{\beta }, \end{aligned}$$

where (for this appendix only) $\mathbf{S}_{\lambda j}=\sum _k \lambda _k\mathbf{S}_{kj}$ while $\mathbf{S}_{kj}$ is the $j$th row of the matrix $\mathbf{S}_k,$ and

$$\begin{aligned} \frac{\partial \theta _{i}}{\partial \beta _{j}}=\frac{\partial \theta _{i}}{\partial \mu _{i}}\frac{\partial \mu _{i}}{\partial \beta _{j}}. \end{aligned}$$

Taking the first order derivatives from the both sides of the linking equation $\mathrm E (Y_{i})=b'_{i}(\theta _{i}),$ we get

$$\begin{aligned}&\frac{\partial \mu _{i}}{\partial \theta _{i}}=b''_{i}(\theta _{i})\Rightarrow \frac{\partial \theta _{i}}{\partial \mu _{i}}=\frac{1}{b''_{i}(\theta _{i})},\nonumber \\&\frac{\partial l_{p}}{\partial \beta _{j}}=\frac{1}{\phi }\sum _{i=1}^{n}\frac{\left\{ y_{i}-b'_{i}(\theta _{i})\right\} }{b''_{i} (\theta _{i})/\omega _{i}}\frac{\partial \mu _{i}}{\partial \beta _{j}}-\mathbf{S}_{\lambda j}\varvec{\beta }. \end{aligned}$$

(7)

Since $g(\mu _{i})=\mathbf{X}_{i}\tilde{{\varvec{\beta }}},$ then

$$\begin{aligned} g'(\mu _{i})\frac{\partial \mu _{i}}{\partial \beta _{j}}&= \left[ \mathbf{X}\right] _{ij} \quad \mathrm{if} \quad \tilde{\beta }_j=\beta _j,\\ g'(\mu _{i})\frac{\partial \mu _{i}}{\partial \beta _{j}}&= \left[ \mathbf{X}\right] _{ij}\exp (\beta _{j}) \quad \mathrm{otherwise}. \end{aligned}$$

Hence

$$\begin{aligned}&\frac{\partial \mu _{i}}{\partial \beta _{j}}=\frac{\left[ {\mathbf{X}}\right] _{ij}}{g'(\mu _{i})} \quad \mathrm{if} \quad \tilde{\beta }_j=\beta _j, \\&\frac{\partial \mu _{i}}{\partial \beta _{j}}=\frac{\left[ {\mathbf{X}}\right] _{ij}\exp (\beta _{j})}{g'(\mu _{i})} \quad \mathrm otherwise . \end{aligned}$$

Another key point of the exponential family concerns the variance

$$\begin{aligned} \mathrm var (Y_{i})=b''_{i}(\theta _{i})a_{i}(\phi )=b''_{i}(\theta _{i})\phi /\omega _{i}, \end{aligned}$$

which is represented in the theory of GLMs in terms of $\mu _{i}$ as $\text{ var }(Y_{i})=V(\mu _{i})\phi ,$ where $V(\mu _{i})=b''_{i}(\theta _{i})/\omega _{i}.$

Let $\mathbf G$ and $\mathbf{W}_{1}$ be $n\times n$ diagonal matrices with the diagonal elements $G_{i}=g'(\mu _{i})$ and

$$\begin{aligned} w_{1i}=\frac{\omega _{i}}{V(\mu _{i})g'^{2}(\mu _{i})}, \end{aligned}$$

and let $\mathbf C$ be a $q\times q$ diagonal matrix such that

$$\begin{aligned} C_{jj} = \left\{ \begin{array}{ll} 1, &{} \mathrm{~if~} \tilde{\beta }_j = \beta _j\\ \exp (\beta _j), &{} \mathrm{~otherwise.} \end{array} \right. \end{aligned}$$

Then a penalized score vector may be written as

$$\begin{aligned} \mathbf{u}_{p}(\varvec{\beta })=\frac{\partial l_{p}}{\partial \varvec{\beta }}=\frac{1}{\phi }\mathbf{C}^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{W}_{1}\mathbf{G}(\mathbf{y}-\varvec{\mu })-\mathbf{S}_\lambda \varvec{\beta }. \end{aligned}$$

(8)

To find the working model parameters estimates, $\hat{\varvec{\beta }},$ one needs to solve $\mathbf{u}_{p}(\varvec{\beta })=\mathbf{0}.$ These equations are non-linear and have no analytical solution, so some numerical methods should be applied. In the case of unconstrained GAM the penalized iteratively reweighed least squares (P-IRLS) scheme based on Fisher scoring is used to solve these equations.

To proceed the Hessian of the log-likelihood function is derived from (8)

$$\begin{aligned} \mathbf{H}(\varvec{\beta })=\left[ \frac{\partial ^{2} l_{p}}{\partial \varvec{\beta }_{j}\partial \varvec{\beta }_{k}}\right] =-\frac{1}{\phi }\mathbf{C}^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{WXC}+\frac{1}{\phi }\mathbf{E}-\mathbf{S}_\lambda ,\nonumber \\ \end{aligned}$$

(9)

where $\mathbf W$ is a diagonal matrix with

$$\begin{aligned} w_{i}&= \frac{\omega _{i}\alpha _{i}}{V(\mu _{i})g'^{2}(\mu _{i})},\quad \text {and}\nonumber \\&\quad \alpha _{i}=1+(y_{i}-\mu _{i})\left\{ \frac{V'(\mu _{i})}{V(\mu _{i})}+\frac{g''(\mu _{i})}{g'(\mu _{i})}\right\} , \end{aligned}$$

(10)

$\mathbf E$ is a $q\times q$ diagonal matrix with

$$\begin{aligned} E_{jj} = \left\{ \begin{array}{ll} 0, &{} \mathrm{~if~} \tilde{\beta }_j = \beta _j\\ \sum _{i=1}^n w_ig^\prime (\mu _i) [\mathbf{XC}]_{ij}(y_i-\mu _i)/\alpha (\mu _i), &{} \mathrm{~otherwise.} \end{array} \right. \end{aligned}$$

Note that for the model with a canonical link function, the second term of $\alpha _{i}$ is equal to zero, since in this case

$$\begin{aligned} V'(\mu _{i})/V(\mu _{i})+g''(\mu _{i})/g'(\mu _{i})=0. \end{aligned}$$

Therefore, $\alpha _{i}=1$ and the matrices $\mathbf{W}_{1}$ and $\mathbf W$ are identical.

So, using the Newton method, if $\varvec{\beta }^{[k]}$ is the current estimate of $\varvec{\beta },$ then the next estimate is

$$\begin{aligned} \varvec{\beta }^{[k+1]}&= \varvec{\beta }^{[k]}\nonumber \\&+\left\{ \mathbf{C}^{[k]\mathrm T }\mathbf{X}^{\mathrm{T}}\mathbf{W}^{[k]}\mathbf{X C}^{[k]}-\mathbf{E}^{[k]}+\mathbf{S}_\lambda \right\} ^{-1} \nonumber \\&\left\{ \mathbf{C}^{[k]\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{W}_{1}^{[k]}\mathbf{G}^{[k]}(\mathbf{y}-\varvec{\mu }^{[k]})-\mathbf{S}_\lambda \varvec{\beta }^{[k]}\right\} , \end{aligned}$$

(11)

where the scale parameter $\phi $ is absorbed into the smoothing parameter $\lambda .$

To use (11) directly for ${\varvec{\beta }}$ estimation is not efficient since explicit formation of the Hessian would square the condition number of the working model matrix, $\sqrt{\mathbf{W}}\mathbf{XC}$ (Golub and van Loan 1996). It should be noted that the Hessian matrix also appears in an expression for the edf of the fitted model (Appendix 2). In the case of the unconstrained model (Wood 2006a) a stable solution for $\hat{\varvec{\beta }}$ is based on a QR decomposition of $\sqrt{\mathbf{W}}\mathbf{X}$ augmented with $\mathbf{B}$, where $\mathbf{B}^{T}\mathbf{B}=\mathbf{S}_\lambda .$ The same approach can be applied here for the shape constrained model, i.e. use a QR decomposition of the augmented $\sqrt{\mathbf{W}}\mathbf{XC}.$ However, the values of $\mathbf W$ can be negative when a non-canonical link function is assumed, so firstly, the issue with these negative weights has to be handled.

The approach applied here is similar to that given in Sect. 3.3 of Wood (2011). Let $\tilde{\mathbf{W}}$ denote a diagonal matrix with the elements $|w_{i}|,$ and $\mathbf{W}^{-}$ be a diagonal matrix with

$$\begin{aligned} w^{-}_{i}=\left\{ \begin{array}{cc} 0, &{} \text {if} \quad w_{i}\ge 0\\ -w_{i}, &{} \text {otherwise}. \\ \end{array}\right. \end{aligned}$$

Then

$$\begin{aligned} \mathbf{C}^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{WXC}=\mathbf{C}^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\tilde{\mathbf{W}}\mathbf{XC}-2\mathbf{C}^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{W^{-}XC}. \end{aligned}$$

Now the QR decomposition may be used for the augmented matrix,

$$\begin{aligned} \left[ \begin{array}{c} \sqrt{\tilde{\mathbf{W}}}\mathbf{XC} \\ \mathbf{B} \\ \end{array} \right] =\mathbf{QR}, \end{aligned}$$

(12)

where $\mathbf Q$ is a rectangular matrix with orthogonal columns, and $\mathbf R$ is upper triangular. Now let $\mathbf{Q}_{1}$ be the first $n$ rows of $\mathbf Q,$ then $\sqrt{\tilde{\mathbf{W}}}\mathbf{XC}=\mathbf{Q}_{1}\mathbf{R}.$

Therefore

$$\begin{aligned}&\mathbf{C}^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{WXC}+\mathbf{S}_\lambda -\mathbf{E} =\mathbf{R}^{\mathrm{T}}\mathbf{R}-2\mathbf{C}^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{W^{-}XC}-\mathbf{E}\\&\quad =\mathbf{R}^{\mathrm{T}}\left( \mathbf{I}-2\mathbf{R}^{-\mathrm T}\mathbf{C}^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{W^{-}XC}\mathbf{R}^{-1}-\mathbf{R}^{-\mathrm T}\mathbf{ER}^{-1}\right) \mathbf{R}\\&\quad =\mathbf{R}^{\mathrm{T}}\left( \mathbf{I}-2\mathbf{Q}_{1}^{\mathrm{T}}\mathbf{I}^{-}\mathbf{Q}_{1}-\mathbf{R}^{-\mathrm T}\mathbf{ER}^{-1}\right) \mathbf{R},\\ \end{aligned}$$

where $\mathbf{I}^{-}$ is an $n\times n$ diagonal matrix with

$$\begin{aligned} I_{i}^{-}=\left\{ \begin{array}{cc} 0, &{} \text {if} \quad w_{i}\ge 0\\ 1, &{} \text {otherwise}. \\ \end{array}\right. \end{aligned}$$

Note that several near non-identifiability issues can arise here. In order to deal with unidentifiable parameters it is proposed to use a singular value decomposition for the R factor of the QR decomposition if it is rank deficient. This step is described in Sect. 3.2.

The next step is to apply the eigen-decomposition

$$\begin{aligned} 2\mathbf{Q}_{1}^{\mathrm{T}}\mathbf{I}^{-}\mathbf{Q}_{1}+\mathbf{R}^{-\mathrm T}\mathbf{ER}^{-1}=\mathbf{U\varvec{\varLambda }U}^{\mathrm{T}}\end{aligned}$$

which gives

$$\begin{aligned} \mathbf{C}^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{WXC}+\mathbf{S}_\lambda -\mathbf{E} =\mathbf{R}^{\mathrm{T}}\left( \mathbf{I}-\mathbf{U\varvec{\varLambda }U}^{\mathrm{T}}\right) \mathbf{R} =\mathbf{R}^{\mathrm{T}}\mathbf{U}\left( \mathbf{I}-\varvec{\varLambda }\right) \mathbf{U}^{\mathrm{T}}\mathbf{R}. \end{aligned}$$

Defining a vector $\mathbf z$ such that $z_i = (y_i - \mu _i) g^\prime (\mu _i)/\alpha (\mu _i)$ and $\tilde{\mathbf{z}}$ where $\tilde{z}_i = \mathrm{sign}(w_i)z_i,$ then

$$\begin{aligned}&\varvec{\beta }^{[k+1]}=\varvec{\beta }^{[k]}\nonumber \\&\quad +\mathbf{R}^{-1}\mathbf{U}(\mathbf{I}-\varvec{\varLambda })^{-1}\mathbf{U}^{\mathrm{T}}\mathbf{Q}_{1}^{\mathrm{T}}\sqrt{\tilde{\mathbf{W}}}\tilde{\mathbf{z}}-\mathbf{R}^{-1}\mathbf{U}(\mathbf{I}-\varvec{\varLambda })^{-1}\mathbf{U}^{\mathrm{T}}\mathbf{R}^{-\mathrm T}\mathbf{S}_\lambda \varvec{\beta }^{[k]}.\nonumber \\ \end{aligned}$$

(13)

By denoting

$$\begin{aligned} \mathbf{P}=\mathbf{R}^{-1}\mathbf{U}(\mathbf{I}-\varvec{\varLambda })^{-1/2} \quad \text {and} \quad \mathbf{K}=\mathbf{Q}_{1}\mathbf{U}(\mathbf{I}-\varvec{\varLambda })^{-1/2} \end{aligned}$$

(14)

(13) may be written as

$$\begin{aligned} \varvec{\beta }^{[k+1]}=\varvec{\beta }^{[k]}+\mathbf{PK}^{\mathrm{T}}\sqrt{\tilde{\mathbf{W}}}\tilde{\mathbf{z}}-\mathbf{PP}^{\mathrm{T}}\mathbf{S}_\lambda \varvec{\beta }^{[k]}. \end{aligned}$$

(15)

The last expression has roughly the square root of the condition number of (11) for the unpenalized likelihood maximization problem, since the condition number of $\mathbf{R}^{-1}$ equals the condition number of $\sqrt{\tilde{\mathbf{W}}}\mathbf{XC}.$

Another refinement may be required in the last step. If the Hessian of the log likelihood is indefinite then step in expression (14) will fail because some $\varLambda _{ii}$ will exceed $1$. To avoid indefiniteness problem a Fisher update step must be substituted by setting $\alpha (\mu _i)=1$ so that $w_{i}\ge 0$ for any $i,$ then the QR decomposition is used as previously

$$\begin{aligned} \left[ \begin{array}{c} \sqrt{\mathbf{W}}\mathbf{XC} \\ \mathbf{B} \\ \end{array} \right] =\mathbf{QR}, \end{aligned}$$

then $\varvec{\beta }^{[k+1]}=\mathbf{R}^{-1}\mathbf{Q}_1^{T}\sqrt{\mathbf{W}}\mathbf{z},$ where $\mathbf{z}=\mathbf{G}(\mathbf{y}-\varvec{\mu })+\mathbf{XC}\varvec{\beta }^{[k]}.$ If there is an identifiability issue then the singular value decomposition step is applied on the QR factor, $\mathbf{R}=\varvec{{\fancyscript{U}}}\mathbf{DV}^T,$ resulting in

$$\begin{aligned} \varvec{\beta }^{[k+1]}=\mathbf{VD}^-\mathbf{Q}_1^T\sqrt{\mathbf{W}}\mathbf{z}, \end{aligned}$$

where $\mathbf{Q}_1$ is the first $n$ rows of $\mathbf{Q}\varvec{{\fancyscript{U}}}.$

Note that in case of the canonical link function $\alpha _{i}=1$ for any $i$, and therefore, $\tilde{\mathbf{W}}=\mathbf W$.

1.2 Appendix 2: SCAM degrees of freedom

An un-penalized model would have as many degrees of freedom as the number of unconstrained model parameters. However, the use of penalties decreases the number of degrees of freedom so that a model with $\lambda \rightarrow \infty $ would have the degrees of freedom near $1.$ Using the concept of the divergence of the maximum likelihood estimator, the edf of the penalized fit can be found as (Meyer and Woodroofe 2000; Wood 2001)

$$\begin{aligned} \tau =\text{ div }(\hat{\varvec{\mu }})=\sum \limits _{i=1}^{n}\frac{\partial }{\partial y_{i}}\hat{\mu }_{i}(\mathbf{y}). \end{aligned}$$

Substituting (11) (Appendix 1) into the model $g(\mu _i)=\mathbf{X}_i\tilde{{\varvec{\beta }}}$ and taking first-order derivatives with respect to $y_{i}$, we get

$$\begin{aligned} \frac{\partial \hat{\mu }_{i}}{\partial y_{i}}=\left[ \mathbf{XC}\left\{ \mathbf{C}^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{WXC}-\mathbf{E}+\mathbf{S}_\lambda \right\} ^{-1}\mathbf{C}^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{W}_{1}\right] _{ii}, \end{aligned}$$

where the right-hand-side of this expression is the $i$th diagonal element of the matrix written in the square brackets.

Therefore,

$$\begin{aligned} \tau =\mathrm tr (\mathbf{F}), \end{aligned}$$

(16)

where

$$\begin{aligned} \mathbf{F}=\left\{ \mathbf{C}^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{WXC}-\mathbf{E}+\mathbf{S}_\lambda \right\} ^{-1}\mathbf{C}^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{W}_{1}\mathbf{XC} \end{aligned}$$

and the matrices $\mathbf{W},$ $\mathbf{W}_{1},$ $\mathbf{C},$ and $\mathbf{E}$ are evaluated at convergence. Note that $\mathbf F$ is the expected Hessian of $l(\varvec{\beta }),$ pre-multiplied by the inverse of the Hessian of $l_{p}(\varvec{\beta }).$

Using the approach and notations of Appendix 1, $\tau $ can also be obtained in a stable manner. Introducing $n\times n$ diagonal matrices $\mathbf{L}^+$ such that

$$\begin{aligned} L^+_{ii} = \left\{ \begin{array}{ll} \alpha (\mu _i)^{-1}, &{} {~\mathrm if~} w_i \ge 0 \\ - \alpha (\mu _i)^{-1}, &{}{~\mathrm otherwise}, \end{array} \right. \end{aligned}$$

then the expression for the edf (16) becomes

$$\begin{aligned} \mathrm tr (\mathbf{F})&= \mathrm tr \left( \{\mathbf{C}^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{WXC}-\mathbf{E}+\mathbf{S}_\lambda \}^{-1}\mathbf{C}^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}{\sqrt{\tilde{\mathbf{W}}}}\mathbf{L}^+{\sqrt{\tilde{\mathbf{W}}}}\mathbf{XC}\right) \nonumber \\&= \mathrm tr \left( \mathbf{PK}^{\mathrm{T}}\mathbf{L}^+{\sqrt{\tilde{\mathbf{W}}}}\mathbf{XC}\right) =\mathrm tr \left( \mathbf{KK}^{\mathrm{T}}\mathbf{L}^+\right) . \end{aligned}$$

(17)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

Reprints and permissions

About this article

Cite this article

Pya, N., Wood, S.N. Shape constrained additive models. Stat Comput 25, 543–559 (2015). https://doi.org/10.1007/s11222-013-9448-7

Download citation

Received: 06 March 2013
Accepted: 27 December 2013
Published: 25 February 2014
Issue Date: May 2015
DOI: https://doi.org/10.1007/s11222-013-9448-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Shape constrained additive models

Abstract

Similar content being viewed by others

Violating the normality assumption may be the lesser of two evils

The Frank-Wolfe Algorithm: A Short Introduction

A Review on Global Sensitivity Analysis Methods

1 Introduction

2 SCOP-splines

2.1 B-spline background

2.2 One-dimensional case

2.2.1 Smoothing

2.2.2 Identifiability, basis dimension

2.3 Multi-dimensional SCOP-splines

2.3.1 Tensor product basis

2.3.2 Constraints

2.3.3 Penalties

3 SCAM

3.1 SCAM representation

3.2 SCAM coefficient estimation

3.3 SCAM smoothing parameter estimation

3.4 Interval estimation

4 Simulated examples

4.1 Simulations: comparison with alternative methods

4.2 Coverage probabilities

5 Examples

6 Discussion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 1003 KB)

Appendix

Appendix

1.1 Appendix 1: Newton method for penalized likelihood estimation of SCAM

1.2 Appendix 2: SCAM degrees of freedom

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation