1 Introduction

This paper is about estimation and inference with the model

$$\begin{aligned} g(\mu _i) = \mathbf{A} {\varvec{\theta }} + \sum _j f_j(z_{ji}) + \sum _k m_k(x_{ki}), \quad Y_i \sim \mathrm {EF}(\mu _{i},\phi ), \end{aligned}$$
(1)

where \(Y_i\) is a univariate response variable with mean \(\mu _i\) arising from an exponential family distribution with scale parameter \(\phi \) (or at least with mean variance relationship known to within a scale parameter), \(g\) is a known smooth monotonic link function, \(\mathbf A\) is a model matrix, \(\varvec{\theta }\) is a vector of unknown parameters, \(f_j\) is an unknown smooth function of predictor variable \(z_j\) and \(m_k\) is an unknown shape constrained smooth function of predictor variable \(x_k\). The predictors \(x_j\) and \(z_k\) may be vector valued.

It is the shape constraints on the \(m_k\) that differentiate this model from a standard generalized additive model (GAM). In many studies it is natural to assume that the relationship between a response variable and one or more predictors obeys certain shape restrictions. For example, the growth of children over time and dose-response curves in medicine are known to be monotonic. The relationships between daily mortality and air pollution concentration, between body mass index and incidence of heart disease are other examples requiring shape restrictions. Unconstrained models might be too flexible and give implausible or un-interpretable results.

Here we develop a general framework for shape constrained generalized additive models (SCAM), covering estimation, smoothness selection, interval estimation and also allowing for model comparison. The aim is to make SCAMs as routine to use as conventional unconstrained GAMs. To do this we build on the established framework for generalized additive modelling covered, for example, in Wood (2006a). Model smooth terms are represented using spline type penalized basis function expansions; given smoothing parameter values, model coefficients are estimated by maximum penalized likelihood, achieved by an inner iteratively reweighted least squares type algorithm; smoothing parameters are estimated by the outer optimization of a GCV or AIC criterion. Interval estimation is achieved by taking a Bayesian view of the smoothing process, and model comparison can be achieved using AIC, for example.

This paper supplies the novel components required to make this basic strategy work, namely

  1. 1.

    We propose shape constrained P-splines (SCOP-splines), based on a novel mildly non linear extension of the P-splines of Eilers and Marx (1996), with novel discrete penalties. These allow a variety of shape constraints for one and multidimensional smooths. From a computational viewpoint, they ensure that the penalized likelihood and the GCV/AIC scores are smooth with respect to the model coefficients and smoothing parameters, allowing the development of efficient and stable model estimation methods.

  2. 2.

    We develop stable computational schemes for estimating the model coefficients and smoothing parameters, able to deal with the ill-conditioning that can affect even unconstrained GAM fits (Wood 2004, 2008), while retaining computational efficiency. The extra non-linearity induced by the use of SCOP-splines does not allow the unconstrained GAM methods to be re-used or simply modified. Substantially new algorithms are required instead.

  3. 3.

    We provide simulation free approximate Bayesian confidence intervals for the SCOP-spline model components in this setting.

The bulk of this paper concentrates on these new developments, covering standard results on unconstrained GAMs only tersely. We refer the reader to Wood (2006a) for a more complete coverage of this background. Technical details and extensive comparative testing are provided in online supplementary material.

To understand the motivation for our approach, note that it is not difficult to construct shape constrained spline like smoothers, by subjecting the spline coefficients to linear inequality constraints (Ramsay 1988; Wood 1994; Zhang 2004; Kelly and Rice 1990; Meyer 2012). However, this approach leads to methodological problems in estimating the smoothing parameters of the spline. The use of linear inequality constraints makes it difficult to optimize standard smoothness selection criteria, such as AIC and GCV with respect to multiple smoothing parameters. The difficulty arises because the derivatives of these criteria change discontinuously as constraints enter or leave the set of active constraints. This leads to failure of the derivative based optimization schemes which are essential for efficient computation when there are many smoothing parameters to optimize. SCOP-splines circumvent this problem.

Other procedures based on B-splines were proposed by He and Shi (1998), Bollaerts et al. (2006), Rousson (2008), Wang and Meyer (2011). Meyer (2012) presented a cone projection method for estimating penalized B-splines with monotonicity or convexity constraints and proposed a GCV based test for checking the shape constrained assumptions. Monotonic regression within the Bayesian framework has been considered by Lang and Brezger (2004), Holmes and Heard (2003), Dunson and Neelon (2003), and Dunson (2005). In spite of their diversity these existing approaches also lack the ability to efficiently compute the smoothing parameter in a multiple smooth context. In addition, to our knowledge except for the bivariate constrained P-spline introduced by Bollaerts et al. (2006), multi-dimensional smooths under shape constraints on either all or a selection of the covariates have not yet been presented in the literature.

The remainder of the paper is structured as follows. The next section introduces SCOP-splines. Section 3.1 shows how SCAMs can be represented for estimation. A penalized likelihood maximization method for SCAM coefficient estimation is discussed in Sect. 3.2. Section 3.3 investigates the selection of multiple smoothing parameters. Interval estimation of the component smooth functions of the model is considered in Sect. 3.4. A simulation study is presented in Sect. 4 while Sect. 5 demonstrates applications of SCAM to two epidemiological examples.

2 SCOP-splines

2.1 B-spline background

In the smoothing literature B-splines are a common choice for the basis functions because of their smooth interpolation property, flexibility, and local support. The B-splines properties are thoroughly discussed in De Boor (1978). Eilers and Marx (1996) combined B-spline basis functions with discrete penalties in the basis coefficients to produce the popular ‘P-spline’ smoothers. Li and Ruppert (2008) established the corresponding asymptotic theory. Specifically that the rate of convergence of the penalized spline to a smooth function depends on an order of the difference penalty but not on a degree of B-spline basis and number of knots, given that the number of knots grows with the number of data and assuming the function is twice continuously differentiable. Ruppert (2002) and Li and Ruppert (2008) showed that the choice of the basis dimension is not critical but should be above some minimal level which depends on the spline degree. Asymptotic properties of P-splines were also studied in Kauermann et al. (2009) and Claeskens et al. (2009). Here we propose to build on the P-spline idea to produce SCOP-splines.

2.2 One-dimensional case

The basic idea is most easily introduced by considering the construction of a monotonically increasing smooth, \(m\), using a B-spline basis. Specifically let

$$\begin{aligned} m(x) = \sum _{j=1}^q \gamma _j B_j (x), \end{aligned}$$

where \(q\) is the number of basis function, the \(B_j\) are B-spline basis functions of at least second order for representing smooth functions over interval \([a,b]\), based on equally spaced knots, and the \(\gamma _j\) are the spline coefficients.

It is well known that a sufficient condition for \(m^\prime (x)\ge 0\) over \([a,b]\) is that \(\gamma _{j} \ge \gamma _{j-1}~\forall ~j\) (see Supplementary material, S.1, for details). In the case of quadratic splines this condition is necessary. It is easy to see that this condition could be imposed by re-parameterizing, so that

$$\begin{aligned} {\varvec{\gamma }} = {\varvec{\varSigma }} \tilde{{\varvec{\beta }}}, \end{aligned}$$

where \({\varvec{\beta }}= \left[ \beta _1, \beta _2, \ldots , \beta _{q}\right] ^{\mathrm{T}}\) and \(\tilde{{\varvec{\beta }}} = \left[ \beta _1,\exp (\beta _2), \ldots , \exp (\beta _q)\right] ^{\mathrm{T}}\), while \(\varSigma _{ij} = 0\) if \(i<j\) and \(\varSigma _{ij}=1\) if \(i \ge j\).

So if \(\mathbf{m} = [m(x_1),m(x_2), \ldots ,m(x_n)]^{\mathrm{T}}\) is the vector of \(m\) values at the observed points \(x_i\), and \(\mathbf{X}\) is the matrix such that \(X_{ij} = B_j(x_i)\), then we have

$$\begin{aligned} \mathbf{m} = \mathbf{X}{\varvec{\varSigma }} \tilde{\varvec{\beta }}. \end{aligned}$$

2.2.1 Smoothing

In a smoothing context we would also like to have a penalty on the \(m(x)\) which can be used to control its ‘wiggliness’. Eilers and Marx (1996) introduced the notion of directly penalizing differences in the basis coefficients of a B-spline basis, which is used with a relatively large \(q\) to avoid underfitting. We can adapt this idea here. For \(j>2\) our \(\beta _j\) are log differences in \(\gamma _j\). We therefore propose penalizing the squared differences between adjacent \(\beta _j\), starting from \(\beta _2\), using the penalty \(\Vert \mathbf{D}{\varvec{\beta }}\Vert ^2\) where \(\mathbf D\) is the \((q-2)\times q\) matrix that is all zero except that \(D_{i,i+1}=-D_{i,i+2}=1\) for \(i=1,\ldots , q-2\). The penalty is zeroed when all the \(\beta _j\) after \(\beta _1\) are equal, so that the \(\gamma _j\) form a uniformly increasing sequence and \(m(x)\) is an increasing straight line (see Fig. 1). As a result our penalty shares with a second order P-spline penalty, the basic feature of ‘smoothing towards a straight line’, but in manner that is computationally convenient for constrained smoothing.

Fig. 1
figure 1

Illustration of the SCOP-splines for five values of the smoothing parameter: \(\lambda _{1} = 10^{-4}\) (long dashed curve), \(\lambda _{2} = 0.005\) (short dashed curve), \(\lambda _{3} = 0.01\) (dotted curve), \(\lambda _{4} = 0.1\) (dot-dashed curve), and \(\lambda _{5} = 100\) (two dashed curve). The true curve is represented as a solid line and dots show the simulated data. Twenty five B-spline basis functions of the third order were used

It might be asked whether penalization is necessary at all, given the restrictions imposed by the shape constraints? Figure 2 provides an illustration of what the penalty achieves. Even with shape constraint, the unpenalized estimated curve shows a good deal of spurious variation that the penalty removes.

Fig. 2
figure 2

Illustration of the SCOP-splines: un-penalized (long dashed curve, \(\lambda = 0\)), penalized (dotted curve, \(\lambda = 10^{-4}\)), and the true curve (solid line). Despite a monotonicity constraint, the un-penalized curve shows spurious detail that the penalty can remove

2.2.2 Identifiability, basis dimension

If we were interested solely in smoothing one-dimensional Gaussian data then \({\varvec{\beta }}\) would be chosen to minimize

$$\begin{aligned} \Vert \mathbf{y} - \mathbf{X}{\varvec{\varSigma }} \tilde{\varvec{\beta }}\Vert ^2 + \lambda \Vert \mathbf{D}{\varvec{\beta }}\Vert ^2, \end{aligned}$$

where \(\lambda \) is a smoothing parameter controlling the trade-off between smoothness and fidelity to the response data \(\mathbf y\). Here, we are interested in the basis and penalty in order to be able to embed the shape constrained smooth \(m(x)\) in a larger model. This requires an additional constraint on \(m(x)\) in order to achieve identifiability to avoid confounding with the intercept of the model in which it is embedded. A convenient way to do this is to use centering constraints on the model matrix columns, i.e. the sum of the values of the smooth is set to be zero \(\sum _{i=1}^nm(x_i)=0\) or equivalently \(\mathbf{1}^T\mathbf{X}\varvec{\varSigma }{\varvec{\beta }}=0\).

As with any penalized regression spline approaches, the choice of the basis dimension, \(q\), is not crucial but should be generous enough to avoid oversmoothing/underfitting (Ruppert 2002; Li and Ruppert 2008). Ruppert (2002) suggested algorithms for the basis dimension selection by minimizing GCV over a set of specified values of \(q\), while Kauermann and Opsomer (2011) proposed an equivalent likelihood based scheme.

This simple monotonically increasing smooth can be extended to a variety of monotonic functions, including decreasing, convex/concave, increasing/decreasing and concave, increasing/ decreasing and convex, the difference between alternative shape constraints being the form of the matrices \(\varvec{\varSigma }\) and \(\mathbf D\). Table 1 details eight possibilities, while Supplementary material, S.2, provides the corresponding derivations.

Table 1 Univariate shape constrained smooths

2.3 Multi-dimensional SCOP-splines

Using the concept of tensor product spline bases it is possible to build up smooths of multiple covariates under the monotonicity constraint, where monotonicity may be assumed on either all or a selection of the covariates. In this section the construction of a multivariable smooth, \(m(x_1,x_2,\ldots ,x_p),\) with multiple monotonically increasing constraints along all covariates is first considered, followed by a discussion of single monotonicity along a single direction.

2.3.1 Tensor product basis

Consider \(p\) B-spline bases of dimensions \(q_{1},\) \(q_{2},\) and \(q_{p}\) for representing marginal smooth functions, each of a single covariate

$$\begin{aligned} f_{1}(x_{1})&= \sum \limits _{k_{1}=1}^{q_{1}}\alpha ^{1}_{k_{1}}B_{k_{1}}(x_{1}), \quad f_{2}(x_{2})= \sum \limits _{k_{2}=1}^{q_{2}}\alpha ^{2}_{k_{2}}B_{k_{2}}(x_{2}), \ldots ,\\ f_{p}(x_{p})&= \sum \limits _{k_{p}=1}^{q_{p}}\alpha ^{p}_{k_{p}}B_{k_{p}}(x_{p}), \end{aligned}$$

where \(B_{k_{j}}(x_{j}),\) \(j=1,\ldots ,p,\) are B-spline basis functions, and \(\alpha ^j_{k_{j}}\) are spline coefficients. Then, following Wood (2006a) the multivariate smooth can be represented by expressing spline coefficients of the marginal smooths as the B-spline of the following covariate, starting from the first marginal smooth. By denoting \(B_{k_{1}\ldots k_{p}}(x_{1},\ldots ,x_{p})=B_{k_{1}}(x_{1})\cdot \ldots \cdot B_{k_{p}}(x_{p}),\) the smooth of \(p\) covariates may be written as follows

$$\begin{aligned} m(x_{1},\ldots ,x_{p})=\sum \limits _{k_{1}=1}^{q_{1}}\ldots \sum \limits _{k_{p}=1}^{q_{p}}B_{k_{1}\ldots k_{p}} (x_{1},\ldots ,x_{p})\gamma _{k_{1}\ldots k_{p}}, \end{aligned}$$

where \(\gamma _{k_{1}\ldots k_{p}}\) are unknown coefficients.

So if \(\mathbf{X}\) is the matrix such that its \(i\)th row is \(\mathbf{X}_{i}=\mathbf{X}_{1i}\otimes \mathbf{X}_{2i}\otimes \cdots \otimes \mathbf{X}_{pi},\) where \(\otimes \) denotes a Kronecker product, and \(\varvec{\gamma }=(\gamma _{11\ldots 1},\ldots ,\gamma _{k_{1}k_{2}\ldots k_{p}},\ldots ,\gamma _{q_{1}q_{2}\ldots q_{p}})^{T},\) then

$$\begin{aligned} \mathbf{m}=\mathbf{X}\varvec{\gamma }. \end{aligned}$$

2.3.2 Constraints

By extending the univariate case one can see that a sufficient condition for \({\partial f(x_{1},\ldots ,x_{p})}/{\partial x_{j}} \ge 0\) is \(\gamma _{k_{1}\ldots k_{j}\ldots k_{p}}\ge \gamma _{k_{1}\ldots (k_{j}-1)\ldots k_{p}}.\) To impose these conditions the re-parametrization \({\varvec{\gamma }} = {\varvec{\varSigma }} \tilde{\varvec{\beta }}\) is proposed, where

$$\begin{aligned} \tilde{\varvec{\beta }}= \left[ \beta _{11\ldots 1},\exp (\beta _{11\ldots 2}), \ldots , \exp (\beta _{k_1\ldots k_p}),\ldots ,\exp (\beta _{q_1\ldots q_p})\right] ^{\mathrm{T}}, \end{aligned}$$

and \(\varvec{\varSigma }= \varvec{\varSigma }_{1} \otimes \varvec{\varSigma }_{2} \otimes \cdots \otimes \varvec{\varSigma }_{p}.\) The elements of \(\varvec{\varSigma }_j\) are the same as for the univariate monotonically increasing smooth (see Table 1). For the multiple monotonically decreasing multivariate function \(\varvec{\varSigma }= \left[ \mathbf{1}:\varvec{\varSigma }'_{(,-1)}\right] ,\) where \(\varvec{\varSigma }' = -\varvec{\varSigma }_{1} \otimes \varvec{\varSigma }_{2} \otimes \cdots \otimes \varvec{\varSigma }_{p},\) that is \(\varvec{\varSigma }\) is a matrix \(\varvec{\varSigma }'\) with the first column replaced by the column of one’s.

To satisfy conditions for a monotonically increasing or decreasing smooth with respect to only one covariate the following re-parameterizations are suggested:

  1. 1.

    For the single monotonically increasing constraint along the \(x_{j}\) direction: Let \(\varvec{\varSigma }_{j}\) be defined as previously while \(\mathbf{I}_{s}\) is an identity matrix of size \(q_{s},\,s\ne j,\) then

    $$\begin{aligned} \varvec{\varSigma }= \mathbf{I}_{1}\otimes \cdots \otimes \varvec{\varSigma }_{j} \otimes \cdots \otimes \mathbf{I}_{p}, \end{aligned}$$

    and \({\varvec{\gamma }} = {\varvec{\varSigma }} \tilde{\varvec{\beta }},\) where \(\tilde{\varvec{\beta }}\) is a vector containing a mixture of un-exponentiated and exponentiated coefficients with \(\tilde{\beta }_{k_1\ldots k_j\ldots k_p}=\exp (\beta _{k_1\ldots k_j\ldots k_p})\) when \(k_j\ne 1.\)

  2. 2.

    For the single monotonically decreasing constraint along the \(x_{j}\) direction: The re-parametrization is the same as above except for the representation of the matrix \(\varvec{\varSigma }_{j}\) which is as for univariate smooth with monotonically decreasing constraint (see Table 1).

By analogy it is not difficult to construct tensor products with monotonicity constraints along any number of covariates.

2.3.3 Penalties

For controlling the level of smoothing, the penalty introduced in Sect. 2 can be extended. For multiple monotonicity the penalties may be written as

$$\begin{aligned} \fancyscript{P}= \lambda _{1}\varvec{\beta }^{T}\mathbf{S}_{1}\varvec{\beta }+ \lambda _{2}\varvec{\beta }^{T}\mathbf{S}_{2}\varvec{\beta }+\cdots +\lambda _{p}\varvec{\beta }^{T}\mathbf{S}_{p}\varvec{\beta }, \end{aligned}$$

where \(\mathbf{S}_{j}=\mathbf{D}_{j}^{\mathrm{T}}\mathbf{D}_{j}\) and \(\mathbf{D}_{j}=\mathbf{I}_{1}\otimes \mathbf{I}_{2}\otimes \cdots \otimes \mathbf{D}_{mj}\otimes \cdots \otimes \mathbf{I}_{p}.\) \(\mathbf{D}_{mj}\) is as \(\mathbf D\) in Table 1 for a monotone smooth. Penalties for single monotonicity along \(x_{j}\) are

$$\begin{aligned} \fancyscript{P}= \lambda _{1}\varvec{\beta }^{T}\tilde{\mathbf{S}}_{1}\varvec{\beta }+\cdots +\lambda _{j}\varvec{\beta }^{T}\mathbf{S}_{j}\varvec{\beta }+ \lambda _{p}\varvec{\beta }^{T}\tilde{\mathbf{S}}_{p}\varvec{\beta }, \end{aligned}$$

where \(\mathbf{S}_j\) is defined as above. The penalty matrices \(\tilde{\mathbf{S}}_i,\) \(i\ne j,\) in the unconstrained directions can be constructed using the marginal penalty approach described in Wood (2006a). The degree of smoothness in the unconstrained directions can be controlled by the second-order difference penalties applied to the non-exponentiated coefficients, and by the first-order difference penalties for the exponentiated coefficients. As in the univariate case, these penalties keep the parameter estimates close to each other, resulting in similar increments in the coefficients of marginal smooths. When \(\lambda _{j}\rightarrow \infty \) such penalization results in straight lines for marginal curves.

3 SCAM

3.1 SCAM representation

To represent (1) for computation we now choose basis expansions, penalties and identifiability constraints for all the unconstrained \(f_j\), as described in detail in Wood (2006a), for example. This allows \(\sum _j f_j(z_{ji})\) to be replaced by \(\mathbf{F}_i {\varvec{\gamma }}\), where \(\mathbf F\) is a model matrix determined by the basis functions and the constraints, and \(\varvec{\gamma }\) is a vector of coefficients to be estimated. The penalties on the \(f_j\) are quadratic in \(\varvec{\gamma }\).

Each shape constrained term \(m_k\) is represented by a model matrix of the form \(\mathbf{X}{\varvec{\varSigma }}\) and corresponding coefficient vector. Identifiability constraints are absorbed by the column centering constraints. The model matrices for all the \(m_k\) are then combined so that we can write

$$\begin{aligned} \sum _k m_k(x_{ki}) = \mathbf{M}_i \tilde{\varvec{\beta }}, \end{aligned}$$

where \(\mathbf M\) is a model matrix and \(\tilde{\varvec{\beta }}\) is a vector containing a mixture of model coefficients (\(\beta _i\)) and exponentiated model coefficients \(\left( \exp (\beta _i)\right) \). The penalties in this case are quadratic in the coefficients \({\varvec{\beta }}\) (not in the \(\tilde{\varvec{\beta }}\)).

So (1) becomes

$$\begin{aligned} g(\mu _i) = \mathbf{A}_i {\varvec{\theta }} + \mathbf{F}_i{\varvec{\gamma }} + \mathbf{M}_i \tilde{\varvec{\beta }},\quad Y_i \sim \mathrm{EF}(\mu _i,\phi ). \end{aligned}$$

For fitting purposes we may as well combine the model matrices column-wise into one model matrix \(\mathbf{X}\), and write the model as

$$\begin{aligned} g(\mu _i) = \mathbf{X}_i \tilde{\varvec{\beta }}, \end{aligned}$$
(2)

where \(\tilde{\varvec{\beta }}\) has been enlarged to now contain \(\varvec{\theta }\), \(\varvec{\gamma }\) and the original \(\tilde{\varvec{\beta }}\). Similarly there is a corresponding expanded model coefficient vector \({\varvec{\beta }}\) containing \(\varvec{\theta }\), \(\varvec{\gamma }\) and the original \({\varvec{\beta }}\). The penalties on the terms have the general form \({\varvec{\beta }}^{\mathrm{T}}\mathbf{S}_\lambda {\varvec{\beta }}\) where \(\mathbf{S}_\lambda = \sum _k \lambda _k \mathbf{S}_k\), and the \(\mathbf{S}_k\) are the original penalty matrices expanded with zeros everywhere except for the elements which correspond to the coefficients of the \(k\)th smooth.

3.2 SCAM coefficient estimation

Now consider the estimation of \({\varvec{\beta }}\) given values for the smoothing parameters \(\varvec{\lambda }\). The exponential family chosen determines the form of the log likelihood \(l({\varvec{\beta }})\) of the model, and to control the degree of model smoothness we seek to maximize its penalized version

$$\begin{aligned} l_p ({\varvec{\beta }}) = l({\varvec{\beta }}) - {\varvec{\beta }}^{\mathrm{T}}\mathbf{S}_\lambda {\varvec{\beta }}/2. \end{aligned}$$

However, the non-linear dependence of \(\mathbf{X}\tilde{\varvec{\beta }}\) on \({\varvec{\beta }}\) makes this more difficult than in the case of unconstrained GAMs. In particular we found that optimization via Fisher scoring caused convergence problems for some models, and we therefore use a full Newton approach. The special structure of the model means that it is possible to work entirely in terms of a matrix square root of the Hessian of \(l\), when applying Newton’s method, thereby improving the numerical stability of computations, so we also adopt this refinement. Also, since SCAM is very much within GAM theory, the same convergence issues might arise as in the case of GAM/GLM fitting (Wood 2006a). In particular, the likelihood might not be uni-modal and the process may converge to different estimates depending on the starting values of the fitting process. However, if the initial values are reasonably selected then it is unlikely that there will be major convergence issues. The following algorithm suggests such initial values.

Let \(V(\mu )\) be the variance function for the model’s exponential family distribution, and define

$$\begin{aligned} \alpha (\mu _i) = 1 + (y_i - \mu _i) \left\{ \frac{V^\prime (\mu _i)}{V(\mu _i)} + \frac{g^{\prime \prime }(\mu _i)}{g^\prime (\mu _i)} \right\} . \end{aligned}$$

Penalized likelihood maximization is then achieved as follows:

  1. 1.

    To obtain an initial estimate of \( {\varvec{\beta }}\), minimize \(\Vert g(\mathbf{y}) - \mathbf{X}\tilde{\varvec{\beta }}\Vert ^2 + \tilde{\varvec{\beta }}^{\mathrm{T}}\mathbf{S}_\lambda \tilde{\varvec{\beta }}\) w.r.t. \(\tilde{\varvec{\beta }}\), subject to linear inequality constraints ensuring that \(\tilde{\beta }_j >0 \) whenever \(\tilde{\beta }_j = \exp (\beta _j)\). This is a standard quadratic programming (QP) problem. (If necessary \(\mathbf y\) is adjusted slightly to avoid infinite \(g(\mathbf{y})\).)

  2. 2.

    Set k = 0 and repeat the steps 3–11 to convergence...

  3. 3.

    Evaluate \(z_i = (y_i - \mu _i) g^\prime (\mu _i)/\alpha (\mu _i)\) and \(w_i = \omega _i \alpha (\mu _i)/ \{V(\mu _i)g^{\prime 2}(\mu _i)\},\) using the current estimate of \(\mu _i\).

  4. 4.

    Evaluate vectors \(\tilde{\mathbf{w}} = |\mathbf{w}|\) and \(\tilde{\mathbf{z}}\) where \(\tilde{z}_i = \mathrm{sign}(w_i)z_i\).

  5. 5.

    Evaluate the diagonal matrix \(\mathbf C\) such that \(C_{jj} = 1\) if \(\tilde{\beta }_j = \beta _j\), and \(C_{jj} = \exp (\beta _j)\) otherwise.

  6. 6.

    Evaluate the diagonal matrix \(\mathbf E\) such that \(E_{jj} = 0\) if \(\tilde{\beta }_j = \beta _j\), and \(E_{jj} =\sum _i^n w_ig^\prime (\mu _i) [\mathbf{XC}]_{ij}(y_i-\mu _i)/\alpha (\mu _i)\) otherwise.

  7. 7.

    Let \(\mathbf{I}^-\) be the diagonal matrix such that \(I^{-}_{ii} = 1\) if \(w_i<0\) and \(I^{-}_{ii}=0 \) otherwise.

  8. 8.

    Letting \(\tilde{\mathbf{W}}\) denote diag\((\tilde{\mathbf{w}})\), form the QR decomposition

    $$\begin{aligned} \left[ \begin{array}{c} \sqrt{\tilde{\mathbf{W}}} \mathbf{X}\mathbf{C} \\ \mathbf{B} \end{array}\right] = \mathbf{QR}, \end{aligned}$$

    where \(\mathbf{B}\) is any matrix square root such that \(\mathbf{B}^{\mathrm{T}}\mathbf{B} = \mathbf{S}_\lambda \).

  9. 9.

    Letting \(\mathbf{Q}_1\) denote the first \(n\) rows of \(\mathbf Q\), form the symmetric eigen-decomposition

    $$\begin{aligned} \mathbf{Q}_1 ^{\mathrm{T}}\mathbf{I}^- \mathbf{Q}_1 + \mathbf{R}^{-\mathrm T}\mathbf{E} \mathbf{R}^{-1} = \mathbf{U }{\varvec{\varLambda }} \mathbf{U}^{\mathrm{T}}. \end{aligned}$$
  10. 10.

    Hence define \(\mathbf{P} = \mathbf{R}^{-1} \mathbf{U}(\mathbf{I} - {\varvec{\varLambda }})^{-1/2}\) and \(\mathbf{K} = \mathbf{Q}_1 \mathbf{U} (\mathbf{I} - {\varvec{\varLambda }})^{-1/2}\).

  11. 11.

    Update the estimate of \({\varvec{\beta }}\) as \({\varvec{\beta }}^{[k+1]} = {\varvec{\beta }}^{[k]} + \mathbf{P K}^{\mathrm{T}}\sqrt{\tilde{\mathbf{W}}} \tilde{\mathbf{z}} - \mathbf{PP}^{\mathrm{T}}\mathbf{S}_\lambda {\varvec{\beta }}^{[k]}\) and increment \(k\).

The algorithm is derived in Appendix 1 which shows several similarities to a standard penalized IRLS scheme for penalized GLM estimation. However, the more complicated structure results from the need to use full Newton, rather than Fisher scoring, while at the same time avoiding computation of the full Hessian, which would approximately square the condition number of the update computations.

Two refinements of the basic iteration may be required.

  1. 1.

    If the Hessian of the log likelihood is indefinite then step 10 will fail, because some \(\varLambda _{ii}\) will exceed 1. In this case a Fisher update step must be substituted, by setting \(\alpha (\mu _i) = 1\).

  2. 2.

    There is considerable scope for identifiability issues to hamper computation. In common with unconstrained GAMs, flexible SCAMs with highly correlated covariates can display co-linearity problems between model coefficients, which require careful handling numerically, in order to ensure numerical stability of the estimation algorithms. An additional issue is that the non-linear constraints mean that parameters can be poorly identified on flat sections of a fitted curve, where \(\beta \) is simply ‘very negative’, but the data contain no information on how negative. So steps must be taken to deal with unidentifiable parameters. One approach is to work directly with the QR decomposition to calculate which coefficients are unidentifiable at each iteration and to drop these, but a simpler strategy substitutes a singular value decomposition for the R factor at step 8 if it is rank deficient, so that

    $$\begin{aligned} \mathbf{R}=\varvec{{\fancyscript{U}}} \mathbf{DV}^{\mathrm{T}}. \end{aligned}$$

    Then we set \(\varvec{{\fancyscript{Q}}} = \mathbf{Q}\varvec{{\fancyscript{U}}},\) \(\varvec{{\fancyscript{R}}}=\mathbf{DV}^T,\) and \(\mathbf{Q}_1\) is the first \(n\) rows of \(\varvec{{\fancyscript{Q}}},\) and everything proceeds as before, except for the inversion of \(\mathbf R\). We now substitute the pseudoinverse \(\mathbf{R}^- = \mathbf{VD}^-\), where the diagonal matrix \(\mathbf{D}^-\) is such that \(D_{jj}^- = 0 \) if the singular value \(D_{jj}\) is ‘too small’, but otherwise \( D^-_{jj} = 1/D_{jj}\). ‘Too small’ is judged relative to the largest singular value \(D_{11}\) multiplied by some power (in the range .5 to 1) of the machine precision. If all parameters are numerically identifiable then the pseudo-inverse is just the inverse.

3.3 SCAM smoothing parameter estimation

We propose to estimate the smoothing parameter vector \(\varvec{\lambda }\) by optimizing a prediction error criterion such as AIC (Akaike 1973) or GCV (Craven and Wahba 1979). The model deviance is defined in the standard way as

$$\begin{aligned} D(\hat{\varvec{\beta }}) = 2 \{ l_\mathrm{max} - l(\hat{\varvec{\beta }})\} \phi , \end{aligned}$$

where \(l_\mathrm{max}\) is the saturated log likelihood. When the scale parameter is known we find \(\varvec{\lambda }\) which minimizes \({\fancyscript{V}}_u = D(\hat{\varvec{\beta }}) + 2 \phi \gamma \tau ,\) where \(\tau \) is the effective degrees of freedom (edf) of the model. \(\gamma \) is a parameter that in most cases has the value of \(1,\) but is sometimes increased above 1 to obtain smoother models [see Kim and Gu (2004)]. When the scale parameter is unknown we find \(\varvec{\lambda }\) minimizing the GCV score, \( {\fancyscript{V}}_g = {n D(\hat{\varvec{\beta }})}/{(n - \gamma \tau )^2}. \) For both criteria the dependence on \(\varvec{\lambda }\) is via the dependence of \(\tau \) and \(\hat{\varvec{\beta }}\) on \(\varvec{\lambda }\) [see Hastie and Tibshirani (1990) and Wood (2008) for further details].

The edf can be found, following Meyer and Woodroofe (2000) as

$$\begin{aligned} \tau = \sum _{i=1}^n \frac{\partial \hat{\mu }_i}{\partial y_i} = \mathrm{tr}(\mathbf{KK}^{\mathrm{T}}\mathbf{L}^+), \end{aligned}$$
(3)

where \(\mathbf{L}^+\) is the diagonal matrix such that

$$\begin{aligned} L^+_{ii} = \left\{ \begin{array}{ll} \alpha (\mu _i)^{-1}, &{} \text { if } w_i \ge 0 \\ - \alpha (\mu _i)^{-1}, &{}\text { otherwise}. \end{array} \right. \end{aligned}$$

Details are provided in Appendix 2.

Optimization of the \({\fancyscript{V}}_*\) w.r.t. \({\varvec{\rho }} = \log ({\varvec{\lambda }})\) can be achieved by a quasi-Newton method. Each trial \(\varvec{\rho }\) vector will require a Sect. 3.2 iteration to find the corresponding \(\hat{\varvec{\beta }}\) so that the criterion can be evaluated. In addition the first derivative vector of \({\fancyscript{V}}_*\) w.r.t. \(\varvec{\rho }\) will be required, which in turn requires \(\partial \hat{\varvec{\beta }}/{\partial \varvec{\rho } }\) and \(\partial \tau /{\partial \rho }\).

As demonstrated in Supplementary material, S.3, implicit differentiation can be used to obtain

$$\begin{aligned} \frac{\partial \hat{\varvec{\beta }}}{\partial \rho _k} = - \lambda _k \mathbf{PP}^{\mathrm{T}}\mathbf{S}_k \hat{\varvec{\beta }}. \end{aligned}$$

The derivatives of \(D\) and \(\tau \) then follow, as S.4 (Supplementary material), shows in tedious detail.

3.4 Interval estimation

Having obtained estimates \(\hat{\varvec{\beta }}\) and \(\hat{\varvec{\lambda }}\), we have point estimates for the component smooth functions of the model, but it is usually desirable to obtain interval estimates for these functions as well. To facilitate the computation of such intervals we seek distributional results for the \(\tilde{\varvec{\beta }}\), i.e. for the coefficients on which the estimated functions depend linearly.

Here we adopt the Bayesian approach to interval estimation pioneered in Wahba (1983), but following Silverman’s (1985) formulation. Such intervals are appealing following Nychka’s (1988) analysis showing that they have good frequentist properties by virtue of accounting for both sampling variability and smoothing bias. Specifically, we view the smoothness penalty as equivalent to an improper prior distribution on the model coefficients

$$\begin{aligned} {\varvec{\beta }}\sim N(\mathbf{0}, \mathbf{S}_\lambda ^- /(2 \phi )), \end{aligned}$$

where \(\mathbf{S}_\lambda ^-\) is the Moore–Penrose pseudoinverse of \(\mathbf{S}_\lambda =\sum _k \lambda _k \mathbf{S}_k\). In conjunction with the model likelihood, Bayes theorem then leads to the approximate result

$$\begin{aligned} \tilde{\varvec{\beta }}| \mathbf{y} \sim N(\hat{\tilde{\varvec{\beta }}}, \mathbf{V}_{\tilde{\varvec{\beta }}}), \end{aligned}$$
(4)

where \(\mathbf{V}_{\tilde{\varvec{\beta }}} = \mathbf{C} (\mathbf{C}^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{W}\mathbf{XC} + \mathbf{S}_\lambda )^{-1} \mathbf{C}\phi \), and \(\mathbf W\) is the diagonal matrix of \(w_i\) calculated with \(\alpha (\mu _i)=1\). Supplementary material, S.5, derives this result. The deviance or Pearson statistic divided by the effective residual degrees of freedom provides an estimate of \(\phi \), if required. To use the result we condition on the smoothing parameter estimates: the intervals display surprisingly good coverage properties despite this (Marra and Wood (2012), provide a theoretical analysis which partly explains this).

4 Simulated examples

4.1 Simulations: comparison with alternative methods

In this section performance comparison with unconstrained GAM and the QP approach to shape preserving smoothing (Wood 1994) is illustrated on a simulated example of an additive model with a mixture of monotone and unconstrained smooth terms. All simulation studies and data applications were performed using R packages "scam", which implements the proposed SCAM approach, and "mgcv" for GAM and QP implementations. A more extensive simulation study is given in Supplementary material, S.6. Particularly, the first subsection of S.6 shows comparative study with the constrained P-spline regression (Bollaerts et al. 2006), monotone piecewise quadratic splines of Meyer (2008), and shape-restricted penalized B-splines of Meyer (2012) on simulated example on univariate single smooth term models. Since there was no mean square error advantage of these approaches over the SCAM for the univariate model, and moreover, the direct grid search for multiple optimal smoothing parameter is computational expensive, (and to the authors’ knowledge, R routines for the implementation of these methods are not freely available) the comparison for multivariate and additive examples were performed only with the unconstrained GAM and QP approach.

The following additive model is considered:

$$\begin{aligned} g(\mu _{i})=m_{1}(x_{1i})+f_{2}(x_{2i}), \quad \mathrm E (Y_{i})=\mu _{i}, \end{aligned}$$
(5)

where \(Y_{i}\sim \mathrm N (\mu _i,\sigma ^2)\) or \(\mathrm Poi (\mu _i)\) distribution. Figure 3 illustrates the graphs of the true functions used for this study. Their analytical expressions are given in Supplementary material, S.4.

Fig. 3
figure 3

Shape of the functions used for the simulation study

The covariate values, \(x_{1i}\) and \(x_{2i},\) were simulated from the uniform distribution on \([-1,3]\) and \([-3,3]\) respectively. For the Gaussian data the values of \(\sigma \) were 0.05, 0.1, 0.2, which gave the signal to noise ratios of about 0.97, 0.88, and 0.65. For the Poisson model the noise level was controlled by multiplying \(g(\mu _i)\) by \(d\), taking values 0.5, 0.7, 1.2, which resulted in the signal to noise ratios of about 0.58, 0.84, and 0.99. For the SCAM implementation a cubic SCOP-spline of the dimension 30 was used to represent the first monotonic smooth term and a cubic P-spline with \(q=15\) for the second unconstrained term. For an unconstrained GAM, P-splines with the same basis dimensions were used for both model components. The models were fitted by penalized likelihood maximization with the smoothing parameter selected using \({\fancyscript{V}}_g\) in the Gaussian case and \({\fancyscript{V}}_u\) for the Poisson case.

For implementing the QP approach to monotonicity preserving constraint, we approximated the necessary and sufficient condition \(f'(x)\ge 0\), via the standard technique (Villalobos and Wahba 1987) of using a fine grid of linear constraints \((f'(x_i^*)\ge 0, i=1,\ldots , n),\) where \(x_i^*\) are spread evenly through the range of \(x\) (strictly such constraints are necessary, but only sufficient as \(n \rightarrow \infty \), but in practice we observed no violations of monotonicity). Cubic regression spline bases were used here together with the integrated squared second order derivative of the smooth as the penalty. The model fit is obtained by setting the QP problem within a penalized IRLS loop given \(\varvec{\lambda }\) chosen via GCV/UBRE from unconstrained model fit. Cubic regression splines tend to have slightly better MSE performance than P-splines (Wood 2006a) and moreover, the conditions built on finite differences are not only sufficient but also necessary for monotonicity. So this is a challenging test for SCAM. Three hundred replicates were produced for Gaussian and Poisson distributions at each of three levels of noise and for two sample sizes, 100, 200, for the three alternative approaches.

The simulation results for the Gaussian data are illustrated in Fig. 4. The results show that SCAM works better than the other two alternative methods in the sense of MSE performance. Note that the performance of GAM was better than the performance of the QP approach in this case, but the difference in MSE between SCAM and GAM is much less than that in the one-dimensional simulation studies shown in Supplementary material, S.6. Also it is noticeable that GAM reconstructed the truth better than the QP method. The explanation may be due to there being only one monotonic term, and both GAM and SCAM gave similar fits for the unconstrained term, \(f_{2}.\) At lower noise levels GAM might also be able to reconstruct the monotone shape of \(m_{1}\) for some replicates. The results also suggest that the SCAM works better than GAM for greater levels of noise which seems to be natural since at lower noise levels the shapes of constrained terms can be captured by the unconstrained GAM. The reduction in performance of the QP compared to GAM was due to the smoothing parameter estimation from the unconstrained fit which sometimes resulted in less smooth tails of the smooth term than those of the unconstrained GAM. For the Poisson data of the samples size \(n=100\) all three methods worked similarly, but with an increase in sample size SCAM outperformed the other two approaches (plots are not shown). As in the Gaussian case the unconstrained GAM worked better than QP.

Fig. 4
figure 4

MSE comparisons between SCAM (mg), GAM (g), and quadratic programming (qp) approaches for the Gaussian distribution for each of three noise levels. The upper panel illustrates the results for \(n=200,\) the lower for \(n=100.\) Boxplots show the distributions of differences in relative MSE between each alternative method and SCAM. 300 replicates were used. Relative MSE was calculated by dividing the MSE value by the average MSE of SCAM for the given case

The simulation studies show that SCAM may have practical advantages over the alternative methods considered. It is computationally slower than GAM and QP approaches, however, obviously GAM cannot impose monotonicity, and the selection of the smoothing parameter for SCAM is well founded, in contrast to the ad hoc method used with QP of choosing \(\lambda \) from an unconstrained fit, and then refitting subject to constraint. Finally, the practical MSE performance of SCAM seems to be better than that of the alternatives considered here.

4.2 Coverage probabilities

The proposed Bayesian approach for confidence intervals construction makes a number of key assumptions; (i) it uses linear approximation of the exponentiated parameters, and in the case of non-Gaussian models adopts large sample inference; (ii) the smoothing parameters are treated as fixed. The simulation example of the previous subsection is used in order to examine how these restrictions affect the performance of the confidence intervals. The realized coverage probabilities is taken as a measure of their performance. Supplementary material, S.7, demonstrates two other examples for more thorough confidence interval performance presentation.

The simulation study of confidence interval performance is conducted in an analogous manner to Wood (2006b). Samples of sizes \(n=200\) and 500 were generated from (5) for Gaussian and Poisson distributions. 500 replicates were produced for both distributions at each of three levels of noise and for two sample sizes. For each replicate the realized coverage proportions were calculated as the proportions of the values of the true functions (at each of the covariate values) falling within the constructed confidence interval. Three confidence levels were considered 90, 95, 99 %. An overall mean coverage probability and its standard error were obtained from the 500 ‘across-the-function’ coverage proportions. The results of the study are presented in Fig. 5 for the Gaussian and Poisson models. The realized coverage probabilities are near the corresponding nominal values, the larger sample size reduces the standard errors as expected. The results for the Poisson models are quite good with an exception for the first monotone smooth, \(m_1(x_1)\), for the low signal strength, which may be explained by the fact that the optimal fit inclines toward a straight line model (Marra and Wood 2012).

Fig. 5
figure 5

Realized coverage probabilities for confidence intervals from the SCAM simulation study of the first example, for normal and Poisson data for \(n=200\) (top panel) and \(n= 500\) (bottom panel). Three noise levels are used for each smooth term and for the overall model (“all”). The nominal coverage probabilities of 0.90, 0.95, and 0.99, are shown as horizontal dashed lines. ‘\(\circ \)’ indicates the average realized coverage probabilities over 500 replicate data sets. Vertical lines show twice standard error intervals of the mean coverage probabilities

5 Examples

This section presents the application of SCAM to two different data sets. The purpose of the first application is to investigate whether proximity to municipal incinerators in Great Britain is associated with increased risk of stomach cancer (Elliott et al. 1996; Shaddick et al. 2007). It is hypothesized that the risk of cancer is a decreasing function of distance from an incinerator. The second application uses data from the National Morbidity, Mortality, and Air Pollution Study (Peng and Welty 2004). The relationship between daily counts of mortality and short-term changes in air pollution concentrations is investigated. It is assumed that increases in concentrations of ozone, sulphur dioxide, particular matter will be associated with adverse health effects.

Incinerator data:Elliott et al. (1996) presented a large-scale study to investigate whether proximity to incinerators is associated with an increased risk of cancer. They analyzed data from 72 municipal solid waste incinerators in Great Britain and investigated the possibility of a decline in risk with distance from sources of pollution for a number of cancers. There was significant evidence for such a decline for stomach cancer, among several others. Data from a single incinerator from those 72 sources, located in the northeast of England, are analyzed using the SCAM approach in this section. This incinerator had a significant result indicating a monotone decreasing risk with distance (Elliott et al. 1996).

The data are from 44 enumeration districts (census-defined administrative areas), ED, whose geographical centroids lay within 7.5 km of the incinerator. The response variable, \(Y_{i},\) are the observed numbers of cases of stomach cancer for each enumeration district. Associated estimates of the expected number of cases, \(E_{i},\) available for risk determination, \(\mathtt{risk}_{i}=Y_{i}/E_{i}\), obtained using national rates for the whole of Great Britain, standardized for age and sex, were calculated for each ED. The two covariates are the distance (km), \(\mathtt{dist}_{i},\) from the incinerator and a deprivation score, the Carstairs score, \(\mathtt{cs}_{i}\).

Under the model, it is assumed that, \(Y_{i}\) are independent Poisson variables, \(Y_{i} \sim \mathrm Poi (\mu _{i}),\) where \(\mu _{i}=\lambda _{i}E_{i},\) \(\mu _{i}\) is the rate of the Poisson distribution with \(E_{i}\) the expected number of cases (in area \(i\)) and \(\lambda _{i}\) the relative risk. Shaddick et al. (2007) proposed a model under which the effect of a covariate, e.g., distance, on cancer risk was linear through an exponential function, i.e. \(\lambda _{i}=\exp (\beta _{0}+\beta _{1}\mathtt{dist}_{i}).\) Since the risk of cancer might be expected to decrease with distance from the incinerator, in this paper a smooth monotonically decreasing function, \(m(\mathtt{dist}_{i}),\) is suggested for modelling its relationship with the distance \(\lambda _{i}=\exp \left\{ m(\mathtt{dist}_{i})\right\} .\) Hence, the model can be represented as the following:

$$\begin{aligned}&\log (\lambda _{i})=m(\mathtt{dist}_{i}) \quad \Rightarrow \quad \log \left( \mu _{i}/E_{i}\right) =m(\mathtt{dist}_{i}) \quad \Rightarrow \\&\log (\mu _{i})=\log (E_{i})+m(\mathtt{dist}_{i}), \end{aligned}$$

which is a single smooth generalized Poisson regression model under monotonicity constraint, where \(\log (E_{i})\) is treated as an offset (a variable with a coefficient equal to \(1\)). Therefore, the SCAM approach can be applied to fit such a model. Carstairs score is known to be a good predictor of cancer rates (Elliott et al. 1996; Shaddick et al. 2007), so its effect may also be included in the model. The following four models are considered for this application.

Model 1: \(\log \left\{ \mathrm E (Y_{i})\right\} =\log (E_{i})+m_{1}(\mathtt{dist}_{i}),\) \(m'_1(\mathtt{dist}_{i})<0.\) Model 2 is the same as model 1 but with \(m_{2}(\mathtt{cs}_{i})\) as its smooth term instead with \(m'_{2}(\mathtt{cs}_{i})>0.\) Model 3 combines both smooths while model 4 takes a bivariate function \(m_{3}(-\mathtt{dist}_{i},\mathtt{cs}_{i})\) subject to double monotone increasing constraint. The univariate smooth terms were represented by the third order SCOP-splines with \(q=15,\) while \(q_1=q_2=6\) were used for the bivariate SCOP-spline.

Plots for assessing the suitability of model one are given in Supplementary material, S.8. The first model for comparison has been also fitted without constraint. The estimated smooths and risk functions for both methods are illustrated in Fig. 6. The estimate of the cancer risk function was obtained by \(\hat{\mathtt{risk}}_{i}=\hat{\mu }_{i}/E_{i}= \exp \left\{ \hat{m}_{1}(\mathtt{dist}_{i})\right\} .\) Note, that the unconstrained GAM resulted in a non-monotone smooth, which supports the SCAM approach. The AIC score allows us to compare models with and without shape constraints. The AIC values were 152.35 for GAM and 150.57 for SCAM which favoured the shape constrained model.

Fig. 6
figure 6

The estimated smooth and cancer risk function for monotone and unconstrained versions of model 1 (incinerator data). a the estimated smooth of SCAM + \(95\,\%\) confidence interval; b the SCAM estimated risk as the function of distance; c the GAM estimated smooth + \(95\,\%\) confidence interval; d the GAM estimated risk as the function of distance. Points show the observed data. As noted in the text, AIC suggests that the shape constrained model (a, b) is better than the unconstrained version (c, d)

In model 2 the number of cases of stomach cancer are represented by a smooth function of deprivation score. This function is assumed to be monotonically increasing since it was shown (Elliott et al. 1996) that in general, people living closer to incinerators tend to be less affluent (low Carstairs score). The AIC value for this model was 155.59, whereas the unconstrained version gave AIC = 156.4, both of which were higher than for the previous model. The other three measures of the model performance, \({\fancyscript{V}}_u,\) the adjusted \(r^{2},\) and the deviance explained, also gave slightly worse results than those seen in model 1.

Model 3 incorporates both covariates, dist and cs, assuming an additive effect on log scale. The estimated edf of \(m_{2}(\mathtt{cs})\) was about zero. This smoothing term was insignificant in this model, with all its coefficients near zero. This can be explained by a high correlation between two covariates. Considering a linear effect of Carstairs in place of the smooth function, \(m_{2},\) as it was proposed in Shaddick et al. (2007), \(\log \left\{ \mathrm E (Y_{i})\right\} = \log (E_{i})+m_{1}(\mathtt{dist}_{i})+\beta \mathtt{cs}_{i},\) also resulted in an insignificant value for \(\beta .\)

The bivariate function, \(m_{3}(-\mathtt{dist}_{i},\mathtt{cs}_{i}),\) is considered in the last model. The perspective plot of the estimated smooth is illustrated in Fig. 7. This plot also supports the previous result, that the Carstairs score does not provide any additional information for modelling cancer risk when distance is included in the model. The graph of the estimated smooth has almost no increasing trend with respect to the second covariate. The measures of the model performance, such as \({\fancyscript{V}}_u\), adjusted \(r^{2},\) and the percentage of deviance explained were not as good as for the first simple model 1. The equivalent model without shape constraints resulted in the AIC = 157.35, whereas the AIC score for SCAM was 155.4. Hence, the AIC best selected model is the simple shape constrained model which only includes distance.

Fig. 7
figure 7

Perspective plot of the estimated bivariate smooths of model 4 (incinerator data)

Air pollution data: The second application investigates the relationship between non-accidental daily mortality and air pollution. The data were from the National Morbidity, Mortality, and Air Pollution Study (Peng and Welty 2004) which contains 5,114 daily measurements on different variables for 108 cities within the United States. As an example a single city (Chicago) study was examined in Wood (2006a). The response variable was the daily number of deaths in Chicago (death) for the years 1987–1994. Four explanatory variables were considered: average daily temperature (tempd), levels of ozone (o3median), levels of particulate matter (pm10median), and time. Since it might be expected that increased mortality will be associated with increased concentrations of air pollution, modelling with SCAM may prove useful.

The preliminary modelling and examination of the data showed that the mortality rate at a given day could be better predicted if the aggregated air pollution levels and aggregated mean temperature were incorporated into the model, rather than levels of pollution and temperature on the day in question (Wood 2006a). It was proposed that the aggregation should be the sum of each covariate (except time), over the current day and three preceding days. Hence, the three aggregated predictors are as follows

$$\begin{aligned}&\mathtt{tmp}_{i}=\sum _{j=i-3}^{i}\mathtt{tempd}_{j}, \quad \mathtt{o3}_{i}=\sum _{j=i-3}^{i}\mathtt{o3median}_{j}, \\&\mathtt{pm10}_{i}=\sum _{j=i-3}^{i}\mathtt{pm10median}_{j}. \end{aligned}$$

Assuming that the observed numbers of daily death are independent Poisson random variables, the following additive model structure can be considered

  1. Model 1:

    \( \log \left\{ \mathrm E (\mathtt{death}_{i})\right\} = f_{1}(\mathtt{time}_{i})+m_{2}(\mathtt{pm10}_{i})+ m_{3}(\mathtt{o3}_{i})+f_{4}(\mathtt{tmp}_{i}), \)

where monotonically increasing constraints are assumed on \(m_{2}\) and \(m_{3},\) since increased air pollution levels are expected to be associated with increases in mortality. The plots for assessing the suitability of this model together and the plots of the smooth estimates are illustrated in Supplementary material, S.8. This model indicates that though the effect of the ozone level is only with one degree of freedom, it is positive and increasing. The rapid increase in the smooth of aggregated mean temperature can be explained by the four highest daily death rates occurring on four consecutive days of very high temperature, which also experienced high levels of ozone (Wood 2006a).

Since the combination of high temperatures together with high levels of ozone might be expected to result in higher mortality, we consider a bivariate smooth of these predictors. The following model is now considered

  1. Model 2:

    \(\log \left\{ \mathrm E (\mathtt{death}_{i})\right\} = f_{1}(\mathtt{time}_{i})+m_{2}(\mathtt{pm10}_{i}) +m_{3}(\mathtt{o3}_{i},\mathtt{tmp}_{i}), \)

where \(m_{2}(\mathtt{pm10}_{i})\) is a monotone increasing function and \(m_{3}(\mathtt{o3}_{i},\mathtt{tmp}_{i})\) is subject to single monotonicity along the first covariate. The diagnostic plots of this model showed a slight improvement in comparison to the first model (Supplementary material, S.8). The estimates of the univariate smooths and perspective plot of the estimated bivariate smooth of model 2 are illustrated in Fig. 8. The second model also has a lower \({\fancyscript{V}}_u\) score which implies that model 2 is a preferable model.

Fig. 8
figure 8

The estimates of the smooth terms of model 2 (air pollution data). A cubic regression spline was used for \(f_1\) with \(q=200,\) SCOP-spline of the third order with \(q=10\) for \(m_2,\) and bivariate SCOP-spline with the marginal basis dimensions \(q_1=q_2=10\) for \(m_3\)

The current approach has been applied to air pollution data for Chicago just for demonstration purpose. It would be of interest to apply the same model to other cities, to see whether the relationship between non-accidental mortality and air pollution can be described by the proposed SCAM in other locations.

6 Discussion

In this paper a framework for generalized additive modelling with a mixture of unconstrained and shape restricted smooth terms, SCAM, has been presented and evaluated on a range of simulated and real data sets. The motivation of this framework is an attempt to develop general methods for estimating SCAMs similar to that of a standard unconstrained GAM. SCAM models allow inclusion of multiple unconstrained and shape constrained smooths of both univariate and multi-dimensional type which are represented by the proposed SCOP-splines. It should be mentioned that the shape constraints were assured by the sufficient but not necessary condition for the cubic and higher order splines. However, this condition for the cubic splines is equivalent to that of Fritsch and Carlson (1980) who showed that the sufficient parameter space constitutes the substantial part of the necessary parameter space (see their Fig. 2, p. 242). Also the sensitivity analysis of Brezger and Steiner (2008) on an empirical application models defends the point that the sufficient condition is not highly restrictive.

Since a major challenge of any flexible regression method is its implementation in a computationally efficient and stable manner, numerically robust algorithms for model estimation have been presented. The main benefit of the procedure is that smoothing parameter selection is incorporated into the SCAM parameter estimation scheme, which also produces interval estimates at no additional cost. The approach has the \(O(nq^2) \) computational cost of standard penalized regression spline based GAM estimation, but typically involves 2–4 times as many \(O(nq^2)\) steps because of the additional non-linearities required for the monotonic terms, and the need to use Quasi-Newton in place of full Newton optimization. However, in contrast to the ad hoc methods of choosing the smoothing parameter used in other approaches, smoothing parameter selection for SCAMs is well founded. It should also be mentioned that although the simulation free intervals proposed in this paper show good coverage probabilities it might be of interest to see whether Bayesian confidence intervals derived from posterior distribution simulated via MCMC would give better results.