Following the cohort component approach, note that the number of Indigenous persons aged 5–9 in a given ILOC in 2011 will be the number who were aged 0–4 in 2006 minus deaths, net migration (includes internal migration) and net changes associated with identification of Indigenous status.
1 The data set-up is demonstrated in Fig.
1, where the arrows trace the cohort movement through time. Using
j to index the age group categories, it can be seen that there are 16 such cohort progressions or flows in which the population (
P) in 2011 (
t) can be related to the population in the earlier age category in 2006 (
\(t - 1\)):
$$\begin{aligned} P_{j-1, t-1} \rightarrow P_{jt}. \end{aligned}$$
(1)
With population data for males and females, this gives 32 flows observed for each ILOC. This can be treated as a multi-level regression framework in which there are 32 observations for each ILOC. Following the convention in the econometric multi-level modelling literature (Rabe-Hesketh and Skrondal
2008, p. 65), the entity or cluster is denoted by subscript
i and the occasions providing repeat observations for that cluster by subscript
j. Hence, in this case communities are denoted by subscript
i (
i = 1–618) and the 2011 age groups by subscript
j (
j = 2–17). For convenience, the gender distinction is ignored for the purposes of setting out the model. From Eq.
1, a modelling framework can be developed which incorporates clustering at the ILOC level,
$$\begin{aligned} P_{ijt} = f(P_{i,j-1,t-1}). \end{aligned}$$
(2)
At this stage, the cohort model does not take into account any changes in population due to births, deaths, migration and identification issues of Indigenous individuals. However, clearly all of these components will affect observed population levels. In the empirical approach described more fully below, we have an explicit fertility model to account for the former, and we include death rates in the model to address the second issue. However, it is not clear how to address the final two points in the empirical strategy, and therefore, we make the assumption that these rates are relatively constant over time, such that when we model
changes (see below) in population levels, such systematic biases in the population level counts are effectively “differenced out”.
However, modelling current population levels as a function of past levels, especially over a short number of time observations, raises several econometric concerns, such as endogeneity, non-stationarity and the spurious regression problem. Indeed, in a simple regression, the lagged dependent variable clearly exhibited signs of these issues, with the estimated coefficient being very close to unity in value and with an extremely high
t-statistic. Thus, the preferred strategy is to model the change in population for each cohort. That is, the dependent variable becomes:
$$\begin{aligned} c_{ijt} = P_{ijt} - P_{i,j-1,t-1}. \end{aligned}$$
(3)
As such there are 32 observations of
\(c_{ijt}\) for each ILOC (
\(i = 2\) to 17 for males and females). Referring back to Fig.
1, the model estimates changes in the population of people aged 5–9 in 2011 from the number aged 0–4 in 2006; of people aged 10–14 in 2011 from the number aged 5–9 in 2006; and so on. However, it cannot provide estimates for the population aged 0–4 in 2011, as there is no younger cohort in 2006 to use as the baseline. To enable projections for the total populations by ILOC, a separate fertility model is developed to generate estimates of the number of males and females aged 0–4 in 2016, which we detail below.
4.2 Statistical model
Stochastically then, population changes can be modelled as:
$$\begin{aligned} c_{ijt} = {\mathbf {x}}^{'}_{i,j,t-1} \ {\varvec{\beta }} + u_{i} + \lambda _{j} + \varepsilon _{ijt} \quad \text{where} \quad \varepsilon _{ijt} \sim N(0,\sigma ^{2}) \end{aligned}$$
(5)
and
\({\mathbf {x}}_{i,j,t-1}\) represents a vector containing the independent variables (all defined in the base year 2006);
\(\varvec{\beta }\) is a vector of coefficients to be estimated;
\(u_{i}\) and
\(\lambda _{j}\) denote the unobservable community-level (ILOC) and cohort effects. Lastly,
\(\varepsilon _{ijt}\) denotes the usual errors in the model.
The above model (Eq.
5) can be estimated by simple linear regression. However, this is not appropriate for the “small numbers” model we are dealing with here, as the data will be necessarily truncated in many instances as the dependent variable has an effective lower bound. That is, any population cannot decrease by more than the starting value. For example, consider an ILOC with five Indigenous males aged 20–24 in 2006. The change in the population from 2006 to 2011 can only range from
\(-5\) upwards, given that the population of males aged 25–29 in 2011 cannot be less than zero.
Note also that if the population of males aged 25–29 in 2011 is also five, then there has been no change and the dependent variable
\(c_{ijt}\) equals zero. This zero is a legitimate value indicating no change in the population, and indeed in any estimation, the expected value of
\(c_{ijt}\) would (should) accordingly be close to zero. However, consider the case in which there are no individuals in a particular age-by-gender category in 2006, as is common for older age cohorts. In this case, the population can increase
\((c_{ijt} > 0)\), but has a lower bound of zero, and the probability of observing zero change is much higher than for observations with a positive initial population. In such situations where there is a ‘latent’ potential change in the population that cannot be observed because of the effective lower bound, linear regression models will produce biased and inconsistent results, which will worsen with the extent of such censoring/boundary observations (Amemiya
1984; Greene
2003).
A preferable approach is to implement a Tobit model with varying censoring limits. As noted above, in the current context the limit is equal to the 2006 population for a specific gender-by-age group at a given ILOC. This differs from the usual Tobit model set-up, where lower (and/or upper) limits are usually fixed (commonly at zero) and the same for all observations in the sample.
To be clear, a standard Tobit model typically has a fixed lower (upper) limit, which is often a lower bound at zero. However, in the suggested approach, as alluded to above, the situation here is subtly, but importantly, different. With a starting population of
X (with
\(X\ge 0\)), then (the negative of) this provides an effective lower bound for the change in the next period: by definition, the change cannot be
\(-Y\) where
\(|Y|>X\) as it is not possible to have negative actual population levels. That is, while population
changes can lie anywhere on the real number line, population
levels must be strictly
\(\ge 0\). However, this latter condition has direct implications for the allowable range of the former. Unfortunately, the situation is somewhat complicated here as in the cohort component model employed—as defined by Eqs. (
1) and (
2)—the cohort necessarily ages from
\(j-1\) to
j, as in equation (
3). In this way, each
\(c_{ijt}\) observation faces an effective, and binding, lower bound of not
\(-P_{i,j,t-1}\), but
\(-P_{i,j-1,t-1}\), due to the necessary ageing of the cohort. Thus, the usual fixed lower limit Tobit model has to be amended to an observation-varying one defined by the variable
\(-P_{i,j-1,t-1}\).
The previous model (Eq.
5) can be updated to reflect this possible censoring as
$$\begin{aligned} c^*_{ijt}={\alpha }_i+ {\mathbf {x}}^{'}_{i,j,t-1} \ {\varvec{\beta }} + {\varepsilon }_{ijt} \quad \text{where} \quad {\ \varepsilon }_{ijt}\sim N\left( 0,{\sigma }^2\right) \end{aligned}$$
(6)
and
\(c^*_{ijt}\) now denotes the
latent underlying change in ILOC
i for age group
j at time
t (2011). However, this cannot be fully observed due to the fact that the change cannot be less than the current population. In other words, there is lower tail censoring such that only
\(c_{ijt}\) is observed. Hence,
$$\begin{aligned} \text{if} \ \ c^*_{ijt} < -P_{i,j-1,t-1}\quad \text{then} \quad c_{ijt} = -P_{i,j-1,t-1}. \end{aligned}$$
(7)
In contrast to the standard Tobit model in which the lower limit is assumed to be a fixed value for all
i and
j, the proposed framework contains the 2006 population as a varying lower limit for each observation. Like the linear model, the Tobit model can be estimated assuming either random or fixed effects, although the latter will suffer from the well-known incidental parameters problem, if the dimension over which these are constant is ’small’ (Greene
2012). Additionally, estimating fixed effects for a large number of ILOCs (over 600) is problematic. With regard to modelling cohort effects, these were incorporated into the explanatory variables. Hence, after extensive modelling, it was established that the Tobit model with random effects provided a better fit.
The data used for developing the model contains only one observation on each
\(c_{ijt}\), i.e. the change from 2006 (
\(t - 1\)) to 2011 (
t). In this sense, the model is cross-sectional rather than longitudinal, and as such, the time subscript can be omitted. With 2016 Census data now available, further work is planned to move to a true multi-level panel structure that should provide more rigorous estimation of community-level unobserved effects. For now, the random-effects Tobit model can be expressed as
$$\begin{aligned} c^{*}_{ij} = {\mathbf {x}}^{'}_{ij} \ {\varvec{\beta }} + \varepsilon _{ij} + u_{i}. \end{aligned}$$
(8)
Following usual practice, the identifying assumption is that
\(\varepsilon _{ij}\) and
\(u_{i}\) are both normally distributed, with zero means and variances of
\(\sigma ^{2}\) and
\(\omega ^{2}\), respectively. The data are observed as
\(c_{ij} = \text{max}(L_{ij},c^{*}_{ij})\) where
\(L_{ij} = -P_{i,j-1}\). In this context, this is an example of lower tail censoring: the change in population in the next time period cannot be less than the number of people (in an age group) currently residing in the ILOC. As per usual, the
\(\varepsilon _{ij}\) is assumed to be uncorrelated across ILOCs. To derive the log likelihood function, the focus here is on the conditional distribution of
\(f(c_{ij}|u_{i})\). Let the dummy variable,
\(d_{ij} = 1\) indicate that
\(c_{ij} > L_{ij}\). This is the uncensored case and
\(d_{ij} = 0\) for censored cases. Based on the above identifying assumptions, the conditional density of
\(c_{ij}\) can then be expressed as
$$\begin{aligned} f(c_{ij}|u_{i}, \ d_{ij}=0) = P(c^{*}_{ij} \le L_{ij}|u_{i}) = \Phi \left( \frac{L_{ij} - {\mathbf {x}}^{'}_{ij} \ {\varvec{\beta }} - u_{i}}{\sigma } \right) \end{aligned}$$
for censored cases and
$$\begin{aligned} f(c_{ij}|u_{i}, \ d_{ij}=1) = \frac{1}{\sigma } \phi \left( \frac{c_{ij} - {\mathbf {x}}^{'}_{ij} \ \varvec{\beta } - u_{i}}{\sigma } \right) \end{aligned}$$
for uncensored cases (Greene
2012), where
\(\Phi\) and
\(\phi\), respectively, denote the
c.d.f and
p.d.f of the standardised normal distribution. Combining the above two cases,
$$\begin{aligned} f(c_{ij}|u_{i}) = [f(c_{ij}|u_{i}, \ d_{ij}=0)]^{1-d_{ij}} \times [f(c_{ij}|u_{i},d_{ij}=1)]^{d_{ij}}. \end{aligned}$$
Assuming independence, the joint density of all observations in a group can be expressed as
$$\begin{aligned} f(c_{i1},c_{i2},\dots ,c_{iT_{j}}|u_{i}) = \prod ^{T_{j}}_{j=1} f(c_{ij}|u_{i}). \end{aligned}$$
Based on the above results, the log likelihood function of this model can be written as
$$\begin{aligned} \text{log} \ L = \sum ^{n}_{i = 1} \text{log} \left\{ \int ^{\infty }_{-\infty } \frac{1}{\omega \sqrt{2 \pi }} \text{exp}\left( -\frac{u_{i}}{2\omega ^{2}}\right) \prod ^{T_{j}}_{j=1} \left[ \Phi \left( \frac{L_{ij} - {\mathbf {x}}^{'}_{ij} \ \varvec{\beta } - u_{i}}{\sigma } \right) \right] ^{1-d_{ij}}\left[ \frac{1}{\sigma } \phi \left( \frac{c_{ij} - {\mathbf {x}}^{'}_{ij} \ \varvec{\beta } - u_{i}}{\sigma } \right) \right] ^{d_{ij}} du_{i} \right\}. \end{aligned}$$
Lastly, find values of
\(\beta , \sigma\) and
\(\omega\) such that this function is maximised. The integrals can be computed using Gauss–Hermite quadrature (or by simulation) and the function can be maximised using standard nonlinear optimisation methods.
2
We note that although the above statistical models are rather complex, they can be estimated routinely in standard commercial software. For example, here we used the Limdep/Nlogit version 6 package, although Stata version 16 could also be used. For these reasons, the suggested approach is easy to use and implement, and could therefore be widely applicable to any other research areas characterised by sparse populations.
In order to model change, several variables were considered. For example, a priori one can expect the initial population size of the ILOC to be a factor that affects change. In addition to this, age group, remoteness level, state and gender are also factors. Furthermore, the interaction terms are included to allow for possible differential effects of age by ILOC size and remoteness. Previous studies have identified trends in which younger Indigenous people tend to move away from smaller, more remote communities into larger regional centres, while older people tend to move back to country (Biddle
2009). As such, we tested all available variables and various interaction terms to allow for a flexible specification. The results indicated that interaction terms were clearly preferred since majority of them were significant. For example, in the case of ILOC size and age group, 11 of interaction terms are significant out of 15 possible terms.
A dummy variable was also included to indicate whether the community was nominated as a Territory Growth Town under the Northern Territory Governments’ 2009 Working Future policy (Sanders
2010).
Also included in the model as a covariate are survival rates for each age group (taken from separately available ABS projections of the total Indigenous population by age). As expected, these rates are an important factor for modelling population change. The survival rates are calculated for each gender and age group. It is the ratio of the number of individuals in age group
i at time
t to the number of individuals in age group
\(i-1\) at time
\(t-1\). The survival rates are close to unity for younger cohorts and decline to under 0.7 for cohorts beyond the age of 70 years. Not surprisingly, this variable is highly significant in the final model. Note that we assume the survival rates for each gender and age group are constant over the short to medium term. For further details on all the variables included in the model, please refer to Table
2. Descriptive statistics for all variables are provided in “Appendix
1”.
Table 2
Definitions of variables in the data set
ILOC size | The natural logarithm of the resident total population of the ILOC in 2006 (including Indigenous, other Australians and those for whom Indigenous status is not stated) |
Remote, Very remote | Two mutually exclusive dummy variables indicating whether the ILOC is in ARIA level 4 (Remote) or ARIA level 5 (Very remote). The omitted or ‘reference’ category is outer regional (ARIA level 3) |
Victoria, Queensland, South Australia, Western Australia, Tasmania, Northern Territory | Six mutually exclusive dummy variables indicating the state or territory of the ILOC. New South Wales is the omitted category. There are no outer regional, Remote or Very remote ILOCs in the Australian Capital Territory |
Female | Dummy variable is equal to 1 if the observation is for a female cohort and is equal to 0 if it relates to a male cohort |
Growth town | Dummy variable taking on a value of 1 if the ILOC contains or corresponds to one of the communities nominated as growth towns under the Northern Territory Government’s Working Future policy announced in 2009. While the policy named 20 towns, one of these (Daguragu–Kalkarindji) falls across two ILOCs (Daguragu and Kalkarindji), meaning there are 21 ILOCs coded with a value of 1 |
Age 10–14; Age 15–19, ... Age 75–79, Age 80+ | Sixteen mutually exclusive dummy variables indicating the age of the cohort in 2011. The omitted category is age 5–9 |
Survival rate | Based on ABS Catalogue 3238.0—Estimated and projected Aboriginal and Torres Strait Islander population Series B for Australia (ABS 2014). The ratio of the estimated population in each age cohort i in 2011 to the estimated population of age cohort \(i-1\) in 2006. This gives an age-specific apparent survival rate and is calculated separately for males and females |
ILOC size*age interaction terms | Fifteen separate variables are generated by interacting the continuous ILOC size variable with the age group dummies. The omitted age category is age 5–9. The coefficients on these variables indicate whether, within each specific age group, there is any further effect of community size in addition to the average effect of community size observed across all age cohorts |
Outer regional × age and Remote × age interaction terms | Thirty separate dummy variables generated by interacting the outer regional dummy variable (ARIA level 3) with age cohort and the Remote dummy variable (ARIA level 4) with age cohort. The omitted categories are Very remote (ARIA level 5) and Age 5–9. The coefficients on these variables indicate whether, within each specific age group, there is any further effect of remoteness in addition to the average effects observed across all age cohorts |