1 Introduction

Boosting is a successful technique for training classifier for both binary and multi-class classification (Freund and Schapire 1995; Schapire and Singer 1999). In this paper, our focus is on multiclass LogitBoost (Friedman et al. 1998), one of the popular boosting variants. Originally, LogitBoost was motivated by a statistical perspective (Friedman et al. 1998), where boosting algorithm consists of three key components: the loss, the function model, and the optimization algorithm. In the case of LogitBoost, these are the multi-class Logistic loss, the use of additive tree models, and a stage-wise optimization, respectively. For \(K\)-nary classification, LogitBoost directly learns a \(K\) dimensional vector as the classifier output, each component representing the confidence of predicting the corresponding class.

There are two important factors in LogitBoost’s settings. Firstly, an “invariant property” is implied by the Logistic loss, i.e., adding a constant to each component of the classifier output won’t change the loss value. Therefore, the classifier output minimizing the total loss is not unique, making the optimization procedure a bit difficult. Secondly, the Logistic loss produces a dense Hessian matrix, causing troubles when deriving the tree node split gain and node value fitting. It is challenging to design a tractable optimization algorithm that fully handles both these factors. Consequently, some simplification and/or approximation is needed.

In Friedman et al. (1998) the Hessian is diagonally approximated. In this way, the minimizer becomes unique, while the optimization—essentially a quadratic problem when using one step Newton—is substantially simplified. Consequently, at each Boosting iteration the tree model updating collapses to \(K\) independent Weighted Regression Tree fittings, each tree outputting a scalar.

Unfortunately, Friedman’s prescription turns out to have some drawbacks. The over simplified quadratic loss even doesn’t satisfy the invariant property, and is thus a very crude approximation. A later improvement, ABC-LogitBoost, is shown to outperform LogitBoost in terms of both classification accuracy and convergence rate (Li 2008, 2010b). This is due to ABC-LogitBoost’s careful handling of the above key problems of the LogitBoost setting. The invariant property is addressed by adding a sum-to-zero constraint to the output vector. To make the minimizer unique, at each iteration one variable is eliminated while the other \((K-1)\) are free, so that the loss is re-written with \((K-1)\) variables. At this point, the diagonal Hessian approximation is, again, adopted to permit \(K-1\) scalar trees being independently fitted for each of the \(K-1\) classes. The remaining class—called the base class—is recovered by taking minus the sum of the other \(K-1\) classes. The base class—i.e., the variable to be eliminated—is selected adaptively per iteration (or every several iterations), hence the acronym ABC (Adaptive Base Class). Note the diagonal Hessian approximation in ABC-LogitBoost is taken for the \((K-1)\) dimensional problem yielded by eliminating a redundant variable. Li (2008, 2010b) considers it a more refined approximation than that of original LogitBoost (Friedman et al. 1998).

In this paper, we propose two novel techniques to address the challenging aspects of the LogitBoost setting. In our approach, one vector tree is added per iteration. To make the minimizer unique, we allow a \(K\) dimensional sum-to-zero vector to be fitted for each tree node. This permits us to explicitly formulate the computation for both node split gain and node value fitting as a \(K\) dimensional constrained quadratic optimization, which arises as a subproblem in the inner loop for split seeking when fitting a new tree. To avoid the difficulty of a dense Hessian, we propose that for each of these subproblems, only two coordinates (i.e., two classes or a class pair) are adaptively selected for updating, hence we call the modified algorithm AOSO-LogitBoost (Adaptive One vS One). Figure 1 gives an overview of our approach. In Sect 4.4 we show that first order and second order approximation of loss reduction can be a good measure for the quality of selected class pair.

Fig. 1
figure 1

A newly added tree at some boosting iteration for a 3-class problem. a A class pair (shown in brackets) is selected for each tree node. For each internal node (filled), the pair is for computing split gain; For terminal nodes (unfilled), it is for node vector updating. b The feature space (the outer black box) is partitioned by the tree in (a) into regions \(\{R_1,R_2,R_3\}\). On each region only two coordinates are updated based on the corresponding class pair shown in (a)

Following the above formulation, ABC-LogitBoost (derived using a somewhat different framework than Li 2010b) can thus be shown to be a special case of AOSO-LogitBoost with a less flexible tree model. In Sect. 5 we compare the differences between the two approaches in detail and provide some intuition for AOSO’s improvement over ABC.

Both ABC or AOSO are carefully devised to address the difficulties of dense Hessian matrix arising in Logistic loss. In other words, the tree model learning would be easy if we were encountering diagonal Hessian matrix. Based on loss design in Primal/Dual view (Masnadi-Shirazi and Vasconcelos 2010; Reid and Williamson 2010), we show that the dense Hessian matrix essentially results from the sum-to-one constraint of class probability in Logistic loss. We thus investigate the possibility of diagonal Hessian matrix by removing such a sum-to-one constraint. The LogitBoost variant we produce does work. However, it is still inferior to AOSO LogitBoost (see Sect. 6 for detailed discussion). We therefore conclude that modifications of the original LogitBoost, such as AOSO and ABC, are necessary for efficient model learning since the dense Hessian matrix seems unavoidable.

The rest of this paper is organised as follows: In Sect. 2 we first formulate the problem setting for LogitBoost. In Sect. 3 we briefly review/discuss original/ABC-/AOSO- LogitBoost in regards of the interplay between the tree model and the optimisation procedure. In Sect. 4 we give the details of our approach. In Sect. 5 we compare our approach to (ABC-)LogitBoost. In Sect. 6 we show how to design a loss that produces diagonal Hessian matrices and assess its implications for model accuracy. In Sect. 7, experimental results in terms of classification errors and convergence rates are reported on a range of public datasets.

2 The problem setup

For \(K\)-class classification (\(K \ge 2\)), consider an \(N\) example training set \(\{\varvec{x}_i, y_i\}_{i=1}^N\) where \(\varvec{x}_i\) denotes a feature value and \(y_{i}\in \{1,\ldots ,K\}\) denotes a class label. From the training set a prediction function \(\varvec{F}(\varvec{x}) \in \mathbb {R}^K\) is learned. When clear from context, we omit the dependence on \(\varvec{x}\) and simply denote \(\varvec{F}(\varvec{x})\) by \(\varvec{F}\) (We will do the same to other related variables.). Given a test example with known \(\varvec{x}\) and unknown \(y\), we predict a class label by taking \(\hat{y} = \arg \max _{k} F_k, k=1,\ldots ,K\), where \(F_k\) is the \(k\)-th component of \(\varvec{F}\).

The function \(\varvec{F}\) is obtained by minimizing a target function on training data:

$$\begin{aligned} {\text {Loss}} = \sum _{i=1}^{N} L(y_i, \varvec{F}_i). \end{aligned}$$
(1)

Each \(L(y, \varvec{F})\) term is a loss for a single training example \((\varvec{x},y)\). We will make the loss concrete shortly.

To make the optimization of (1) feasible, a function model is needed to describe how \(\varvec{F}\) depends on \(\varvec{x}\). For example, linear model \(\varvec{F}= \varvec{W}^{T}\varvec{x}+ \varvec{b}\) is used in traditional Logistic regression, while Generalized Additive Model is adopted in LogitBoost where

$$\begin{aligned} \varvec{F}(\varvec{x}) = \textstyle \sum \limits _{m=1}^{M}\varvec{f}^{(m)}(\varvec{x}), \end{aligned}$$
(2)

where \(\varvec{f}^{(m)}(\varvec{x}) \in \mathbb {R}^K\) is further expressed by tree(s), as will be seen shortly.

Each \(\varvec{f}^{(m)}(\varvec{x})\), or simply \(\varvec{f}\), is learned by the so-called greedy stage-wise optimization. That is, at each Boosting iteration \(\varvec{f}^{(m)}\) is added only based on \(\varvec{F}= \sum _{j=1}^{m-1} \varvec{f}^{(j)}\).

In summary, the learning procedure, called training, consists of three ingredients: the loss, the  function model and the optimization algorithm. In what follows we discuss them in greater details.

2.1 The logistic loss

LogitBoost adopts the Logistic loss (Friedman et al. 1998), which is an implicit function of \(y\) and \(\varvec{F}\):

$$\begin{aligned} L(y, \varvec{F}) = -\sum _{k=1}^K r_k\log (p_k), \end{aligned}$$
(3)

where \(r_k = 1\) if \(k=y\) and 0 otherwise, \(\{p_k\}_{k=1}^K\) is connected to \(\varvec{F}\) via the so-called Link function:

$$\begin{aligned} p_k = \frac{ \exp (F_k) }{ \sum _{j=1}^K\exp (F_j) }. \end{aligned}$$
(4)

Therefore, \(\varvec{p}= (p_1,\ldots ,p_K)^T\) can be seen as probability estimate since \(p_k \geqslant 0\) and \(\sum _{k=1}^K p_k = 1\).

2.1.1 The invariant property

It is easy to verify that an “invariant property” holds for Logistic loss:

$$\begin{aligned} L(y,\varvec{F}) = L(y,\varvec{F}+ c\varvec{1}), \end{aligned}$$
(5)

where \(c\) is an arbitrary constant, \(\varvec{1}\) is \(K\)-dimensional all 1 vector. That is to say, the Logistic loss won’t change its value when moving \(\varvec{F}\) along a all-1 vector \(\varvec{1}\), which implies that \(\varvec{F}\) is “equivalent to” \(\varvec{F}+ c\varvec{1}\) because they both lead to the same loss (see Fig. 2b). Such an invariant property is, in turn, consistent with the multi-class prediction rule \(\arg \max _{k} F_k, k = 1,\ldots ,K\) which says adding a constant to each of \(F_k\) does not alter the prediction.

Fig. 2
figure 2

Multiclass loss \(L(y=1,\varvec{F})\) for \(K\)-nary problem. In this figure \(K = 2\), thus \(\varvec{F}\in \mathbb {R}^2\). The Logistic loss is defined in (3); The Exponential loss is defined in (10). Note the parallel contour plots in the case of Logistic loss

2.1.2 Derivative and quadratic approximation

It is difficult to directly optimize the Logistic loss. To permit numeric method, we need the first order and second order derivatives of example wise loss (3). Simple matrix calculus shows that the gradient \(\varvec{g}\in \mathbb {R}^K\) and Hessian \(\varvec{H}\in \mathbb {R}^{K \times K}\) w.r.t. \(\varvec{F}\) are:

$$\begin{aligned} \varvec{g}= -(\varvec{r}- \varvec{p}), \end{aligned}$$
(6)
$$\begin{aligned} \varvec{H}= \hat{\varvec{P}}- \varvec{p}\varvec{p}^T, \end{aligned}$$
(7)

where the response \(\varvec{r}= (r_1,\ldots ,r_K)^T\) and probability estimate \(\varvec{p}= (p_1,\ldots ,p_K)^T\) are as defined before, the diagonal matrix \(\hat{\varvec{P}}= \mathrm {diag}(p_1,\ldots ,p_K)\).

With \(\varvec{g}\) and \(\varvec{H}\) given above, we can write down the Taylor expansion of (3) at \(\varvec{F}\) up to second order:

$$\begin{aligned} q(\varvec{f}|\varvec{F}) = L(y,\varvec{F}) + \varvec{g}^T\varvec{f}+ \frac{1}{2}\varvec{f}^T\varvec{H}\varvec{f}, \end{aligned}$$
(8)

which locally approximates \(L(y,\varvec{F}+ \varvec{f})\) in quadratic sense. Note that in (8) \(\varvec{f}\) is the variable, while the “constants” \(\varvec{g}\) and \(\varvec{H}\) depend on \(\varvec{F}\). When clear from context, we also omit the dependence on \(\varvec{F}\) and simply write \(q(\varvec{f}|\varvec{F})\) as \(q(\varvec{f})\).

It is easy to verify that the invariant property carries over to \(q(\varvec{f})\):

$$\begin{aligned} q(\varvec{f}) = q(\varvec{f}+c\varvec{1}) \end{aligned}$$
(9)

by noting \(\varvec{1}^T \varvec{g}= 0\) and \(\varvec{1}^T \varvec{H}= \varvec{H}\varvec{1}= 0\).

2.1.3 \((K-1)\) Degrees of freedom and the sum-to-zero constraint

Due to the invariant property, the minimizer \(\varvec{f}^*\) of (8) is not unique since any \(\varvec{f}^* = c\varvec{1}\) would be a minimizer. To pin-down the value, we can add a constraint \(\varvec{1}^T \varvec{f}= 0\), which in effect restricts \(\varvec{f}\) to vary just in the linear subspace defined by \(\varvec{1}^T \varvec{f}= 0\). Obviously, now we need only \(K-1\) coordinates to express the vector living in the subspace, i.e., the degrees of freedom is \(K-1\).

In Sect. 4.2 we will discuss the rank of the Hessian matrix \(\varvec{H}\), which provides another perspective to why the minimizer of (8) is not unique.

Conceptually, the invariant property is primary, causing the minimizer being not unique; The sum-to-zero constraint is secondary, serving as a mathematical tool to make the minimizer unique.

2.1.4 More on \((K-1)\) degrees of freedom

We don’t claim that the invariant property be a necessity. In the literature, there exists other multiclass loss not satisfying this condition. In this case, it doesn’t need a sum-to-zero constraint and \(\varvec{f}\) has \(K\) degrees of freedom. For instance, consider the \(K\)-nary exponential loss for AdaBoost.MH:

$$\begin{aligned} L(y, \varvec{F}) = \sum _{k=1}^K exp(-y_k^* \varvec{F}_k), \end{aligned}$$
(10)

where \(y_k^* = +1\) if \(y_k = k\) and \(-1\) otherwise. See Fig. 2(b) for its graph. In general, \(L(y, \varvec{F}) \ne L(y, \varvec{F}+ c\varvec{1})\) for Exponential loss.

It is out of this paper’s scope to compare Exponential loss and Logistic loss in multiclass case. Instead, we focus on the Logistic loss and show how it is applied in LogitBoost. Also, in Sect. 6 we discuss a modified Logistic loss not satisfying the invariant property (but simplifying the Hessian and easing the quadratic solver) and show its degraded performance when comparing with original Logistic loss.

2.2 The tree model

As mentioned previously, \(\varvec{F}(\varvec{x}) = \sum _{m=1}^{M} \varvec{f}^{(m)}(\varvec{x})\) is additive tree model. However, the way each \(\varvec{f}(\varvec{x})\) (we have omitted the subscript \(m\) for simplicity and without confusion) being expressed by tree model is not unique.

In this paper, we adopt a single vector-valued tree. Further, we let it be a binary tree (i.e., only binary splits on internal node are allowed) with splits that are vertical to coordinate axis, as in Friedman et al. (1998) and Li (2010b). Formally,

$$\begin{aligned} \varvec{f}(\varvec{x}) = \sum _{j=1}^J \varvec{t}_{j} I(\varvec{x}\in R_j) \end{aligned}$$
(11)

where \(\{ R_j \}_{j=1}^J\) are a rectangular partition of the feature space, \(I(P)\) is the indicator function which is 1 when the predicate \(P\) is true and 0 otherwise, and each \(\varvec{t}_j \in \mathbb {R}^K\) is the node value associated with \(R_j\). See Fig. 1 for an illustration. The vector-valued tree model is also adopted in other Boosting implementation, say, Real AdaBoost.MH described in Kégl and Busa-Fekete (2009).

\(\varvec{f}(\varvec{x})\) can be also represented by \(K\) scalar regression trees as in the original LogitBoost of Friedman et al. (1998) and the Real AdaBoost.MH implementation of Friedman et al. (1998):

$$\begin{aligned} \varvec{f}(\varvec{x}) = \sum _{k=1}^K \sum _{j=1}^{J} t_{k,j} I(\varvec{x} \in R_{k,j}), \end{aligned}$$
(12)

where \(t_{k,j}\) is a scalar and each tree can have its own partition \(\{R_{k,j}\}_{j=1}^J\).

Finally, it is possible to express \(fv(\varvec{x})\) with just \(K-1\) trees by adding to \(\varvec{f}(\varvec{x})\) a sum-to-zero constraint, as is adopted in Li’s series work (Li 2008, 2009, 2010b).

2.3 Stage wise optimization

Formally, using \(\varvec{F}_{i}\) and \(\varvec{f}_i\) as shorthand for \(\varvec{F}(\varvec{x}_i)\) and \(\varvec{f}(\varvec{x}_i)\) respectively, the stage wise optimization is:

$$\begin{aligned} \varvec{f}^{(m)}(\varvec{x}) = \arg \min _{\varvec{f}(\varvec{x})} \sum _{i=1}^{N} L(y_i,\varvec{F}_i + \varvec{f}_i), \end{aligned}$$
(13)

where \(\varvec{f}(\varvec{x})\) is expressed by tree model as explained in last subsection. This procedure repeats \(M\) times with initial condition \(F=0\). Owing to its iterative nature, we only need to know how to solve (13) in order to implement the optimization.

In (13) it is difficult to directly work on the Logistic loss; numeric method must be relied on. Friedman et al. (1998) suggests to use one step Newton, i.e., in (13) replace example wise Logistic loss \(L()\) with its corresponding second order Taylor expansion:

$$\begin{aligned} \varvec{f}^{(m)}(\varvec{x}) = \arg \min _{\varvec{f}(\varvec{x})} \sum _{i=1}^{N} q(\varvec{f}_i | \varvec{F}_i). \end{aligned}$$
(14)

3 Interplay between tree model and optimization

In last section we have reviewed the problem setup. In particular, we made concrete the three ingredients in LogitBoost: the loss, the function model and the optimization algorithm are the Logistic loss, the additive tree model and the stage wise optimization, respectively. To train a LogitBoost classifier, now the only thing left unexplained is how to optimize the quadratic optimization (14), which is still a bit complicated.

Basically, \(\varvec{f}(\varvec{x})\) on a training set \(\{ \varvec{x}_i, y_i \}_{i=1}^N\) can be seen as an \(K \times N\) matrix, as in Fig. 3, where the \(i\)-th column represents the \(K\) values of \(\varvec{f}(\varvec{x}_i)\). However, the values in the matrix can not vary independently. Instead, the values must be subject to a tree model, which determines a type of partition on the matrix such that the elements falling into the same cell take a common, unique value. Thus the tree model has an impact on the optimization procedure. Things become even subtle when considering the invariant property of the Logistic loss. The original/ABC-/AOSO-LogitBoost makes different choices to address this tree model-optimization interplay, as will be briefly discussed in the rest of this section. The details are deferred to next section.

Fig. 3
figure 3

Viewing \(\varvec{f}(\varvec{x})\) a \(K \times N\) matrix on a training dataset. A \(J\)-leaf tree corresponds a \(J\)-cell partition on the matrix such that the values falling into the same cell must have a common value. In this figure, \(K = 3\), \(N = 7\) and \(J = 3\). The squares in dashed lines denote the matrix elements, while the solid lines denote the partitions. Note that each cell needs not be continuous, although we intentionally plot it a continuous one for illustrative purpose. a \(K\) scalar trees. Each tree fits a class (a row) and has its own partition. b \((K-1)\) scalar trees. Each tree fits a class (a row) and has its own partition. The values on the remaining class (i.e. the base class) is recovered by the other classes. c only one vector-valued tree. Any column belongs to only one cell. At each cell, only two classes (two rows) can vary (shaded squares) and the others keep zero values (blank squares)

3.1 Original LogitBoost

To solve problem (14), LogitBoost (Friedman et al. 1998) takes further simplification. In the example wise quadratic loss (8), the Hessian matrix is diagonally approximated. Thus (8) collapses into \(K\) independent 1-dimensional loss, i.e., (8) is written as the sum of \(K\) terms each involving just one component of \(\varvec{f}\):

$$\begin{aligned}&q(\varvec{f}|\varvec{F}) = \sum _{k=1}^K q_k (f_k|\varvec{F}), \nonumber \\&q_k (f_k|\varvec{F}) = g_kf_k + \frac{1}{2}h_{kk}f_k^2, \end{aligned}$$
(15)

where \(h_{kk}\) is the \(k\)-th diagonal elements in \(\varvec{H}\) and we have without confusion omitted the constants that are irrelevant to \(\varvec{f}\). Noting the \(N\)-additive form and substituting back (15), we can also collapse (14) into \(K\) 1-dimensional, independent minimization problems:

$$\begin{aligned} f_k^{(m)}(\varvec{x}) = \arg \min _{f_k(\varvec{x})} \sum _{i=1}^{N} q_k(\varvec{f}_{i,k} | \varvec{F}_i), \quad k = 1,\ldots ,K, \end{aligned}$$
(16)

where \(\varvec{f}_{i,k} = f_k(\varvec{x}_i)\) means the \(f_k\) value corresponding to the \(i\)-th training examples.

Then \(K\) scalar trees are grown, the \(k\)-th tree approximately minimizing the \(k\)-th problem. Note that each tree can have its own partition, as in Fig. 3(a).

To grow the scalar tree, Friedman et al. (1998) borrowed a traditional regression model in statistics, namely the Weighted Regression Tree. The formulation becomes: for the \(k\)-th problem, fit a Weighted Regression Tree on the training examples \(\{\varvec{x}_i,y_i\}_{i=1}^N\) with targets \(\{ -g_{i,k}/h_{i,kk} \}_{i=1}^N\) and weights \(\{ h_{i,kk} \}_{i=1}^N\), where (in a slightly abuse of notation) we use an additional subscript \(i\) for \(g_k\) and \(h_{kk}\) to denote they correspond to the \(i\)-th training example.

3.2 ABC-LogitBoost

LogitBoost adopts a rather crude approximation to (8), where the Hessian matrix \(\varvec{H}\in \mathbb {R}^{K \times K}\) is diagonally approximated. ABC-LogitBoost (Li 2010b) considers an intuitively more refined way. Recall that \((K-1)\) degrees of freedom suffices to express (8), i.e., (8) can be re-written with just \((K-1)\) variables by eliminating one redundant variable. Then the new \((K-1) \times (K-1)\) Hessian matrix is, again, diagonally approximated. Li (2010b) shows that this does make a difference, as what follows.

In Sect. 2.1.3 we have explained that adding a sum to zero constraint to (8) doesn’t change the problem and actually makes the minimizer unique. We can thus restrict \(\varvec{f}\) to vary in the subspace defined by \(\varvec{1}^T \varvec{f}= 0\). Consequently, it is convenient to use a new coordinate system \(\widetilde{\varvec{f}} \in \mathbb {R}^{K-1}\). In order to do this, Li (2010b) introduces a concept “base class”. Suppose the base class \(b = K\), then the new coordinates are that \(f_1 = \widetilde{f}_1,\ldots ,f_{K-1}=\widetilde{f}_{K-1}\), \(f_K = -(\widetilde{f}_1+\cdots +\widetilde{f}_{K-1})\). Writing them in compact matrix notation, we have:

$$\begin{aligned}&\varvec{f}= \varvec{A}\widetilde{\varvec{f}} \nonumber \\&\varvec{A}= \begin{bmatrix} \varvec{I} \\ -\varvec{1}\end{bmatrix}, \end{aligned}$$
(17)

where \(\varvec{A}\in \mathbb {R}^{K \times (K-1)}\), \(\varvec{I} \in \mathbb {R}^{(K-1) \times (K-1)}\) and \(\varvec{1}\) is \((K-1)\)-dim all one vector. Substituting it back to (8) we have:

$$\begin{aligned} \tilde{q} (\widetilde{\varvec{f}})&\triangleq q (\varvec{A}\widetilde{\varvec{f}}) \nonumber \\&= (\varvec{A}^T\varvec{g})^T \widetilde{\varvec{f}} + \frac{1}{2} \widetilde{\varvec{f}}^T \varvec{A}^T\varvec{H}\varvec{A}\widetilde{\varvec{f}} + C \nonumber \\&= \widetilde{\varvec{g}}^T \widetilde{\varvec{f}} + \frac{1}{2} \widetilde{\varvec{f}}^T \widetilde{\varvec{H}} \widetilde{\varvec{f}} + C \end{aligned}$$
(18)

where \(C\) is a irrelevant constant and we have let

$$\begin{aligned} \widetilde{\varvec{g}}&= \varvec{A}^T\varvec{g}\nonumber \\ \widetilde{\varvec{H}}&= \varvec{A}^T\varvec{H}\varvec{A}. \end{aligned}$$
(19)

Li (2010b) then diagonally approximates the \((K-1) \times (K-1)\) matrix \(\widetilde{\varvec{H}} = \varvec{A}^T\varvec{H}\varvec{A}\). After doing this, the rest optimization procedure is essentially the same with original LogitBoost: the loss (14) collapses to \((K-1)\) 1-dimensional, independent minimization problems; the \(k\)-th tree is fitted by a Weighted Regression Tree using the new targets and weights derived from \(\widetilde{\varvec{g}}\) and \(\widetilde{\varvec{H}}\) as are defined in (19).

Li (2010b) shows that the choice of \(b\) impacts on how well the diagonal approximation is. Two intuitive methods are proposed to select \(b\): (1) The “worst class”, i.e., the \(b\) having the biggest loss (before minimizing (14)) on that class is selected. (2) The “best class”, i.e., all possible \(b\), up to \(K\) choices, are tried and the one leading to lowest loss (after minimizing (14)) is selected.

3.3 AOSO-LogitBoost

In AOSO-LogitBoost, we also adopts the loss (18) with \(K-1\) free coordinates. However, we update only one coordinate a time. An equivalent formulation goes in the following. We add a sum-to-zero constraint \(\varvec{1}^T \varvec{f}= 0\) to the \(K\)-dim loss (8) and let only two coordinates of \(\varvec{f}\), say, \(f_r\) and \(f_s\), vary and the other \(K-2\) keep zero. Due to the sum-to-zero constraint, we further let \(f_r = +t\) and \(f_s = -t\) where \(t\) is a real number.

Obviously, the choice of \((r,s)\) impacts on the goodness of approximation. However, it is unlikely to select the best class pair \((r,s)\) for each training example. To permit the class pair selection as adaptive as possible, we adopt vector-valued tree model in AOSO-LogitBoost, i.e., any column in Fig. 3(c) can not be assigned to two cells. Then, for each node (i.e., each cell in the matrix as in Fig. 3c) we adaptively select the class pair \((r,s)\).

Superficially, AOSO is inferior to original/ABC- LogitBoost, because many of the \(\varvec{f}\) values in the matrix are untouched (zeros), as shown in Fig. 3(c). However, we should recall the “big picture”: the untouched values might hopefully receive better updating in later Boosting iterations. Consequently, AOSO is still “on average” more efficient than original/ABC- LogitBoost. We will further discuss this issue in Sect. 5.

4 The AOSO-LogitBoost algorithm

In this section we describe the details of AOSO-LogitBoost. Specifically, we will focus on how to build a tree in Boosting iteration. Some key ingredients of tree building improving previous techniques were firstly introduced by Li. However, we will re-derive them in our own language. The credits will be made clear when they are explained in the following.

4.1 Details of tree building

Solving (14) with the tree model (11) is equivalent to determining the parameters \(\{ \varvec{t}_j, R_j \}_{j=1}^J\) at the \(m\)-th iteration. In this subsection we will show how this problem reduces to solving a collection of quadratic subproblems for which we can use standard numerical methods-Block Coordinate Descent.Footnote 1 Also, we will show how the gradient and Hessian can be computed incrementally.

We begin by defining some notation for the node loss:

$$\begin{aligned} NodeLoss(\varvec{t};\mathcal {I}) = \sum _{i \in \mathcal {I}} q(\varvec{t}| \varvec{F}_i) \end{aligned}$$
(20)

where \(\mathcal {I}\) denotes the index set of the training examples on some either internal or terminal node (i.e., those falling into the corresponding region). Minimizing (20) allows us to obtain a set of nodes \(\{ \varvec{t}_j, R_j \}_{j=1}^J\) using the following procedures:

  1. 1.

    To obtain the values \(\varvec{t}_j\) for a given \(R_j\), we simply take the minimizer of (20):

    $$\begin{aligned} \varvec{t}_j = \arg \min _{\varvec{t}} NodeLoss(\varvec{t};\mathcal {I}_j), \end{aligned}$$
    (21)

    where \(\mathcal {I}_j\) denotes the index set for \(R_j\).

  2. 2.

    To obtain the partition \(\{ R_j \}_{j=1}^J\), we recursively perform binary splitting until there are \(J\)-terminal nodes.

The key to the second procedure is the computation of the node split gain. Suppose an internal node with \(n\) training examples (\(n=N\) for the root node), we fix on some feature and re-index all the \(n\) examples according to their sorted feature values. Now we need to find the index \(n'\) with \(1 < n' < n\) that maximizes the node gain defined as loss reduction after a division between the \(n'\)-th and \((n'+1)\)-th examples:

$$\begin{aligned} NodeGain(n') =&NodeLoss(\varvec{t}^*;\mathcal {I}) \\&- (NodeLoss(\varvec{t}_L^*;\mathcal {I}_L) + NodeLoss(\varvec{t}_R^*; \mathcal {I}_R)) \nonumber \end{aligned}$$
(22)

where \(\mathcal {I}=\{1,\ldots ,n\}\), \(\mathcal {I}_L=\{1,\ldots ,n'\}\) and \(\mathcal {I}_R=\{n'+1,\ldots ,n\}\); \(\varvec{t}^*\), \(\varvec{t}_L^*\) and \(\varvec{t}_R^*\) are the minimizers of (20) with index sets \(\mathcal {I}\), \(\mathcal {I}_L\) and \(\mathcal {I}_R\), respectively. Generally, this search applies to all features. The division yielding the largest value of (22) is then recorded to perform the actual node split.

Note that (22) arises in the context of an \(O(N\times D)\) outer loop, where \(D\) is number of features. However, a naïve summing of the losses for (20) incurs an additional \(O(N)\) factor in complexity, which finally results in an unacceptable \(O(N^{2}D)\) complexity for a single boosting iteration.

Fortunately, the gradient and Hessian can be incrementally computed. To see why, simply rewrite the (20) in standard quadratic form:

$$\begin{aligned} loss(\varvec{t};\mathcal {I}) = \overline{\varvec{g}}^T\varvec{t}+ \frac{1}{2} \varvec{t}^T \overline{\varvec{H}}\varvec{t}, \end{aligned}$$
(23)

where \(\overline{\varvec{g}}= -\sum _{i \in \mathcal {I}} \varvec{g}_i\), \(\overline{\varvec{H}}= \sum _{i \in \mathcal {I}} \varvec{H}_i\). Thanks to the additive form, both \(\overline{\varvec{g}}\) and \(\overline{\varvec{H}}\) can be incrementally/decrementally computed in constant time when the split searching proceeds from one training example to the next. Therefore, the computation of (23) eliminates the \(O(N)\) complexity in the naïve summing of losses.Footnote 2

4.2 Properties of approximated node loss

To minimise (23), we make use of some properties for (23) that can be exploited when finding a solution. First, the invariant property carries over to the node loss (23):

Property 1

\(loss(\varvec{t};\mathcal {I}) = loss(\varvec{t}+c\varvec{1};\mathcal {I})\).

Proof

This is obvious by noting the additive form. \(\square \)

For the Hessian \(\varvec{H}\), we have \(\mathrm {rank}(\varvec{H}) \le \mathrm {rank}(\varvec{H}_i)\) by noting the additive form in (23). In Li (2010b) it is shown that \(\det \varvec{H}_i = 0\) by brute-force determinant expansion. Here we give a stronger property:

Property 2

Each \(\varvec{H}_i\) is a positive semi-definite matrix such that (1) \(\mathrm {rank}(\varvec{H}_i) = \kappa -1\), where \(\kappa \) is the number of non-zero elements in \(\varvec{p}_i\); (2) \(\varvec{1}\) is the eigenvector for eigenvalue 0.

The proof can be found in Appendix 1.

The properties shown above indicate that (1) \(\varvec{H}\) is singular so that unconstrained Newton descent is not applicable here and (2) \(\mathrm {rank}(H)\) could be as high as \(K-1\), which prohibits the application of the standard fast quadratic solver designed for low rank Hessian. In the following we propose to address this problem via block coordinate descent, a technique that has been successfully used in training SVMs (Bottou and Lin 2007).

4.3 Block coordinate descent

For the variable \(\varvec{t}\) in (23), we only choose two coordinates, i.e., a class pair, to update while keeping the others fixed. We note that single coordinates cannot be updated independently due to the sum-to-zero constraint. Suppose we have chosen the \(r\)-th and the \(s\)-th coordinate (we explain precisely how to do so in the next section). Let \(t_{r} = t\) and \(t_{s} = -t\) be the free variables (such that \(t_r + t_s = 0\)) and \(t_k = 0\) for \(k \ne r\), \(k \ne s\). Plugging these into (23) yields an unconstrained one dimensional quadratic problem with regards to the scalar variable \(t\):

$$\begin{aligned} loss(t) = \bar{g}^{T}t + \frac{1}{2} \bar{h}t^{2} \end{aligned}$$
(24)

where the gradient and Hessian collapse to scalars:

$$\begin{aligned} \bar{g}= -\sum _{i \in I} \left( (r_{i,r} - p_{i,r}) - (r_{i,s} - p_{i,s}) \right) \end{aligned}$$
(25)
$$\begin{aligned} \bar{h}= \sum _{i \in I} \left( p_{i,r}(1-p_{i,r}) + p_{i,s}(1-p_{i,s}) + 2p_{i,r}p_{i,s} \right) , \end{aligned}$$
(26)

The derivatives (25) and (26) involving two classes were firstly given in Li (2008), where the class subscript \(r\), \(s\) are fixed for the nodes in a tree. In this work, we allow \(r\), \(s\) vary from node to node.

To this extent, we are able to obtain the analytical expression for the minimizer and minimum of (24):

$$\begin{aligned} t^{*} = \arg \min _{t} loss(t) = -\frac{\bar{g}}{\bar{h}} \end{aligned}$$
(27)
$$\begin{aligned} loss(t^{*}) = -\frac{\bar{g}^{2}}{2\bar{h}} \end{aligned}$$
(28)

by noting the non-negativity of (26).

Based on (27), node vector (21) can be approximated by \(\varvec{t}_j\) with components

$$\begin{aligned} t_{j,k} = \left\{ \begin{array}{ll} +(-\bar{g}/\bar{h}) &{} k = r \\ -(-\bar{g}/\bar{h}) &{} k = s \\ 0 &{} {\text {otherwise}} \end{array}\right. \end{aligned}$$
(29)

where \(g\) and \(h\) are computed using (25) and (26), respectively, with index set \(\mathcal {I}_{j}\). Based on (28), the node gain (22) can be approximated as

$$\begin{aligned} NodeGain(n') = \frac{\bar{g}_L^{2}}{2\bar{h}_L} + \frac{\bar{g}_R^{2}}{2\bar{h}_R} - \frac{\bar{g}^{2}}{2\bar{h}}, \end{aligned}$$
(30)

where \(\bar{g}\) (or \(\bar{g}_L\), \(\bar{g}_R\)) and \(\bar{h}\) (or \(\bar{h}_L\), \(\bar{h}_R\)) are computed by using (25) and (26) with index set \(\mathcal {I}\) (or \(\mathcal {I}_L\), \(\mathcal {I}_R\)). We note that (30) was firstly derived in Li (2010b) in a slightly different way.

4.4 Class pair selection

Bottou and Lin (2007) provide two methods for selecting \((r,s)\) for an SVM solver that we consider for our purposes. The first is based on a first-order approximation: let \(t_r\) and \(t_s\) be the free variables and fix the rest to \(0\). For a \(\varvec{t}\) with sufficiently small norm, let \(t_{r}=\epsilon \) and \(t_{s}=-\epsilon \) where \(\epsilon >0\) is some small enough constant. The first order approximation of (23) is then:

$$\begin{aligned} loss(\varvec{t}) \approx loss(\varvec{0}) + \overline{\varvec{g}}^{T}\varvec{t}= loss(\varvec{0}) - \epsilon (-\bar{g}_r - (-\bar{g}_s)), \end{aligned}$$
(31)

where we have denoted the vector \(\overline{\varvec{g}}= (\bar{g}_1,\ldots ,\bar{g}_r,\ldots ,\bar{g}_s,\ldots ,\bar{g}_K)^T\). It follows that the indices \(r\), \(s\) resulting in largest decrement to (31) are:

$$\begin{aligned}&r = \arg \max _{k} \left\{ -\bar{g}_{k} \right\} \nonumber \\&s = \arg \min _{k} \left\{ -\bar{g}_{k} \right\} . \end{aligned}$$
(32)

The second method, derived in a similar way, takes into account the second-order information:

$$\begin{aligned}&r = \arg \max _{k} \left\{ -\bar{g}_{k} \right\} \nonumber \\&s = \arg \max _{k} \left\{ \frac{(\bar{g}_{r}-\bar{g}_{k})^{2}}{\bar{h}_{rr}+\bar{h}_{kk}-2\bar{h}_{rk}} \right\} , \end{aligned}$$
(33)

where we have denoted the matrix \(\overline{\varvec{H}}= \{ \bar{h}_{ij}\}\).

Both methods are \(O(K)\) procedures that are better than the \(K\times (K-1)/2\) naïve enumeration of all possible pairs. However, in our implementation we find that (33) achieves better results for AOSO-LogitBoost.

The selection of a class pair \((r,s)\) here is somewhat similar to the selection of base class \(b\) in ABC Boost. Actually, Li proposed in Li (2008) that the “worst class” with the largest loss be selected as \(b\). Clearly, the max derivative is another indicator for how “worst” it is. In this sense, our class pair selection extends the base class idea and can be seen as a concrete implementation of the general “worst class” idea.

The pseudocode for AOSO-LogitBoost is given in Algorithm 1 and shows how all the above approximations are used together to iteratively find a model. The algorithm runs on the sample \(\{\varvec{x}_i,y_i\}_{i=1}^N\) for \(M\) rounds. In each round \(m\) it updates the values of \(\varvec{F}_i = \varvec{F}(\varvec{x}_i)\) for each of the instances \(\varvec{x}_i\) by first computing a rectangular partition of the feature space \(\{ R^m_j \}_{j=1}^J\) and the corresponding node values \(\{\varvec{t}^m_j\}_{j=1}^J\), and then incrementing \(\varvec{F}_i\) by \(\varvec{t}^m_j\) where \(j\) is the index of the region \(R^m_j\) such that \(\varvec{x}_i \in R^m_j\).

figure a

5 Comparison to (ABC-)LogitBoost

In this section we compare the derivations of LogitBoost and ABC-LogitBoost and provide some intuition for observed behaviours in the experiments in Sect. 7.

To solve (14), ABC-LogitBoost uses \(K-1\) independent trees defined by

$$\begin{aligned} f_k(x) = {\left\{ \begin{array}{ll} \sum _j t_{j,k}I(x \in R_{j,k}) &{}, k \ne b\\ -\sum _{l \ne b} f_l(x) &{}, k = b \end{array}\right. } \end{aligned}$$
(34)

where \(b\) is the base class. As explained in Sect. 3.2, \(b\) can be selected as either the “worst class” or the “best class”.

Li (2010b) gives how to compute the node value and the node split gain for building the \(k\)-th tree (\(k \ne b\)). Although derived from different motivation as ours, they are actually the same with (29) and (30) in this paper, where the class pair \((r,s)\) is replaced with \((k,b)\). We should not be surprised at this coincidence by noting that AOSO’s vector tree has only one freely altering coordinate on each node and thus “behaves” like a scalar tree. In this sense, AOSO and ABC is comparable. Actually, ABC can be viewed as a special form of AOSO with two differences: (1) For each tree, the class pair is fixed for every node in ABC, while it is selected adaptively in AOSO, and (2) \(K-1\) trees are added per iteration in ABC, while only one tree is added per iteration by AOSO.

It seems unappealing to add just one tree as in AOSO, since many \(\varvec{f}(\varvec{x})\) values are untouched (i.e.,, set to zeros, as illustrated in Fig. 3c); In the meanwhile, ABC would be better since it updates all the \(\varvec{f}(\varvec{x})\) values. Considering the Boosting context, we argue, however, that AOSO should still be preferred in an “on average” sense. After adding one tree in AOSO, the \(\varvec{F}(\varvec{x})\) values are updated and the gradient/Hessian is immediately recomputed for each training example, which impacts on how to build the tree at next iteration. Thus the \(\varvec{F}(\varvec{x})\) values might still receive good enough updating after several iterations, due to the adaptive class pair selection for every node at current iteration. In contrast, the \(K-1\) trees in ABC use the same set of gradients and Hessians, which are not recomputed until adding all the \(K-1\) trees.

Therefore, it is fair to compare ABC and AOSO in regards of number of trees, rather than number of iterations. AOSO’s “on average” better performance is confirmed by the experiments in Sect. 7.2.

To evaluate whether the adaptive class pair selection is critical, we considered a variant of AOSO-LogitBoost that adopts a fixed class pair selection. Specifically, we still add one tree per iteration, but select a single class pair root node and let it be fixed for all other nodes, which is very similar to ABC’s choice. This variant was tried but unfortunately, degraded performance was observed so the results are not reported here.

From the above analysis, we believe that AOSO-LogitBoost’s more flexible model obtained from the adaptive split selection (as well as its immediate model updating after adding one tree per iteration) is what contributes to its improvement over ABC.

6 Sum-to-one probability and dense Hessian matrix

As in previous discussion, both ABC or AOSO improve on the original LogitBoost method by dealing with dense Hessian matrix due to the Logistic loss as given in (3). An immediate question is that whether we can derive an effective alternative surrogate loss that has a diagonal Hessian matrix “by design”—i.e., can we define a modified Logistic loss that guarantees a diagonal Hessian matrix? In this section, we show how the original multi-class Logistic loss can be derived from a maximum entropy argument via convex duality, in a manner similar to derivations of boosting updates by Shen and Li (2010), Shen and Hao (2011), Lafferty (1999), and Kivinen and Warmuth (1999) and results connecting entropy and loss by Masnadi-Shirazi and Vasconcelos (2010) and Reid and Williamson (2010).

In contrast to earlier work, our analysis focuses on the role of the constraint in defining the loss and the effect it has on the form of its gradient and Hessian. In particular, we’ll see that the original Logistic loss’s dense Hessian matrix essentially results from sum-to-one constraint on class probabilities. Moreover, we are able to obtain a diagonal Hessian matrix by dropping this constraint when deriving an alternative to the Logistic loss from the same maximum entropy formulation. By doing so we show that the optimization (i.e., via Newton descent) becomes straightforward, however lower classification accuracy and slower convergence rate are observed for the new loss. Therefore, we argue the techniques used by ABC/AOSO seem necessary for dealing with the dense Hessian matrix for the original, more effective, Logistic loss.

6.1 Logistic loss in primal/dual view

We now examine the duality between entropy and loss in a manner similar to that of the more general treatment of Masnadi-Shirazi and Vasconcelos (2010); Reid and Williamson (2010). By starting with a “trivial” maximum entropy problem, we show how consideration of its dual problem recovers a “composite representation” Reid and Williamson (2010) of a loss function in terms of a loss defined on the simplex and corresponding link function. In case of Shannon entropy, we show that such a construction results in the original Logistic loss function. Appendix 2 describes the matrix calculus definitions and conventions we adopt, mainly from Magnus and Neudecker (2007).

For each feature vector \(\varvec{x}\), let \(\varvec{\eta }(\varvec{x}) = Pr(\varvec{y}|\varvec{x}) \in \Delta ^K\) be the true conditional probability for the \(K\) classes where \(\Delta ^K = \{ \varvec{p}\in [0,1]^K : \sum _{k=1}^K p_k = 1 \}\) denotes the \(K\)-simplex. Let \(\mathsf {H}: \Delta ^K \rightarrow \mathbb {R}\) be an entropy function—a concave function such that \(\mathsf {H}({\mathbf {e}}^k) = 0\) for each vertex \({\mathbf {e}}^k\) of \(\Delta ^K\).

Given some set of probabilities \(S \subset \Delta ^K\)—typically defined by a set of constraints—the MaxEnt criterion (Jaynes 1957) advises choosing the \(\varvec{p}\in S\) such that \(\mathsf {H}(\varvec{p})\) is maximized. We consider a “trivial” instance of the MaxEnt problem in which the set \(S\) contains only the single point \(\varvec{\eta }\), i.e., \(S = \{ \varvec{\eta }\}\), yielding the following problem:

$$\begin{aligned} {\text {max}}&\quad \mathsf {H}(\varvec{p}) \nonumber \\ s.t.&\quad \varvec{p}- \varvec{\eta }= 0 \\&\quad \varvec{1}^T\varvec{p}- \varvec{1}= 0 \nonumber , \end{aligned}$$
(35)

where the argument \(\varvec{x}\) to \(\varvec{\eta }()\), \(\varvec{p}()\) has been omitted. Despite its apparent triviality, the Lagrangian of this problem has an interesting form:

$$\begin{aligned} \mathcal {L}(\varvec{p},\varvec{F},\lambda ) = \mathsf {H}(\varvec{p}) + \varvec{F}^T(\varvec{p}-\varvec{\eta }) + \lambda (\varvec{1}^T\varvec{p}- \varvec{1}), \end{aligned}$$
(36)

where \(\varvec{F}\in \mathbb {R}^K\), \(\lambda \in \mathbb {R}\) are the so-called dual variables. The dual function is obtained by maximizing the Lagrangian over the domain of the primal variable \(\varvec{p}\). By setting to zero the derivative of (36) w.r.t. \(\varvec{p}\), we can implicitly express \(\varvec{p}\) as a function of \(\varvec{F}\) and \(\lambda \) via

$$\begin{aligned} \nabla \mathsf {H}(\varvec{p}) + \lambda \varvec{1}+ \varvec{F}= 0. \end{aligned}$$
(37)

This gives us the dual function \(g\) which depends only on \(\varvec{F}\), \(\lambda \):

$$\begin{aligned} g(\varvec{F},\lambda ) = \mathsf {H}(\varvec{p}) + \varvec{F}^T(\varvec{p}-\varvec{\eta }) + \lambda (\varvec{1}^T\varvec{p}- \varvec{1}), \end{aligned}$$
(38)

where \(\varvec{p}= \varvec{p}(\varvec{F},\lambda )\) as per (37), and the unconstrained convex dual problem

$$\begin{aligned} {\text {min}}&\quad g(\varvec{F}, \lambda ). \\ \end{aligned}$$

Now further eliminating the dual variable \(\lambda \) in (38) by taking derivative of (38) w.r.t. \(\lambda \) as

$$\begin{aligned} \mathsf {D}_{\lambda } g(\varvec{F},\lambda )&= (\mathsf {D}_{\lambda }\mathsf {H}(\varvec{p}))(\mathsf {D}_{\lambda }\varvec{p}) + \varvec{F}^T(\mathsf {D}_{\lambda }\varvec{p}) + \lambda \varvec{1}^T(\mathsf {D}_{\lambda }\varvec{p}) + (\varvec{1}^T\varvec{p}- \varvec{1}) \\&= (\mathsf {D}_{\lambda }\mathsf {H}(\varvec{p}) + \lambda \varvec{1}^T + \varvec{F}^T)(\mathsf {D}_{\lambda }\varvec{p}) + \varvec{1}^T\varvec{p}- 1 \\&= \varvec{1}^T\varvec{p}- 1. \end{aligned}$$

and setting to zero, we obtain the “partial” dual function:

$$\begin{aligned} \ell (\varvec{F}) = \mathsf {H}(\varvec{p}) + \varvec{F}^T(\varvec{p}-\varvec{\eta }), \end{aligned}$$
(39)

where \(\varvec{p}= \varvec{p}(\varvec{F}) \in \mathbb {R}^K\), \(\lambda = \lambda (\varvec{F}) \in \mathbb {R}\) are given by the \((K+1) \times (K+1)\) equation array

$$\begin{aligned} {\left\{ \begin{array}{ll} \nabla \mathsf {H}(\varvec{p}) + \lambda \varvec{1}+ \varvec{F}&{}= 0 \\ \varvec{1}^T\varvec{p}- 1 &{}= 0. \end{array}\right. } \end{aligned}$$
(40)

In Machine Learning, (39) and (40) are precisely the “loss function” and the “link function”, respectively (e.g., the Logistic loss (3) and Logistic link (4)). Training boosting classifier is essentially a numerical optimization procedure, where the gradient and Hessian of the loss (39) are needed. We give them as follows:

Theorem 1

The gradient of (39) is \(\nabla \ell (\varvec{F}) = \varvec{p}- \varvec{\eta }\).

See Appendix 2 for its proof. Note that the gradient is precisely the left-hand-side of the constraint in primal (35), which is a common result in primal-dual theory in convex optimization (Bertsekas 1982).

Theorem 2

The Hessian of (39) is \(\nabla ^2\ell (\varvec{F}) = -A^{-1} + \frac{1}{\varvec{1}^TA^{-1}\varvec{1}}A^{-1}\varvec{1}\varvec{1}^TA^{-1}\) , where the shorthand \(A = \nabla ^2\mathsf {H}(\varvec{p})\).

See Appendix 2 for its proof.

In both of the above two theorems, the dependence on \(\varvec{F}\) is implicit where \(p\) is given by the Link function (40).

To this extent, we can show some concrete results in case of Logistic settings. The starting point is that we adopt Shannon entropy in (35) such that \(\mathsf {H}(\varvec{p}) = -\sum _{k=1}^K p_k \log p_k\). Thus its gradient and Hessian are \(\nabla \mathsf {H}(\varvec{p}) = -(1+\log p_1,\ldots ,1+\log p_K)^T\), \(\nabla ^2\mathsf {H}(\varvec{p}) = -\mathrm {diag}(1/p_1,\ldots ,1/p_K)\), respectively. With these two expressions, it follows that the link function by solving equations (40) is precisely the Logistic link in (4). Furthermore, according to Theorem 1 and 2 the gradient and Hessian for the loss (39) are \(\nabla \ell = \varvec{p}- \varvec{\eta }\) and \(\nabla ^2 \ell = \mathrm {diag}(p_1,\ldots ,p_K) - \varvec{p}\varvec{p}^T\), which are respectively no more than (6) and (7) by noting that \(\varvec{\eta }\) reduces to 1-of-\(K\) response \(\varvec{r}\) when taking value on the vertex of the probability simplex. These are shown in the left half of Table  1.

Table 1 The Link, gradient/Hessian for the loss with or without the sum-to-one constraint

6.2 Diagonal Hessian matrix by dropping the sum-to-one constraint

In last subsection we derived the LogitBoost via a duality argument. An immediate observation is that the sum-to-one constraint \(\varvec{1}^T\varvec{p}- 1 = 0\) in (35) seems redundant, since it is already guaranteed by the constraint \(\varvec{p}- \varvec{\eta }= 0\) where \(\varvec{\eta }\) itself is sum-to-one. This means that the primal problem can be rewritten as:

$$\begin{aligned} {\text {max}}&\quad \mathsf {H}(\varvec{p}) \end{aligned}$$
(41)
$$\begin{aligned} s.t.&\quad \varvec{p}- \varvec{\eta }= 0 \end{aligned}$$
(42)

Deriving its dual form, we obtain the corresponding loss

$$\begin{aligned} \ell (\varvec{F}) = \mathsf {H}(\varvec{p}) + \varvec{F}^T(\varvec{p}-\varvec{\eta }), \end{aligned}$$
(43)

and link

$$\begin{aligned} \nabla \mathsf {H}(\varvec{p}) + \varvec{F}&= 0 \end{aligned}$$
(44)

Still, with \(\mathsf {H}(\varvec{p}) = -\sum _{k=1}^K p_k \log p_k\) we can obtain the link, the gradient and Hessian for the loss, as shown in the right half of Table 1. One attractive feature of this alternative derivation is the diagonal Hessian matrix that is yielded. When calculating the node value or node gain, we can obtain the precise Newton Step and Newton Decrement since the inversion of Hessian is much easier to compute. Consequently, no effort (e.g., base class selection in ABC or class pair selection in AOSO) is needed to deal with the difficulty of dense Hessian and all classes can be updated for each terminal node.

6.3 Degraded performance

We refer to the modified LogitBoost without sum-to-one constraint as “K-LogitBoost”, where all the \(K\) classes are updated at each terminal node. We compare K-LogitBoost with AOSO-LogitBoost as in Fig. 4. Noting that only 2 classes are updated at each terminal node in AOSO, we scale the horizontal axis (i.e., the number of iterations) by a factor \(2/K\) for AOSO to make it a fair comparison. As can be seen in the results, a slower convergence than AOSO is incurred by K LogitBoost with regard to both test error and L2 norm of gradient, although K LogitBoost is still competitive with AOSO LogitBoost. More comparative results are provided in Sect. 7.3.

Fig. 4
figure 4

Comparison between K LogitBoost and AOSO LogitBoost on dataset M-Basic. a How the L2 norm of gradient \(||\varvec{p}- \varvec{\eta }||_2^2\) decreases when iteration proceeds. b Test error v.s. iteration

We provide an intuitive explanation for this phenomenon. At convergence, the gradient \(\varvec{p}- \varvec{\eta }\) vanishes, and consequently, \(\varvec{p}\) satisfies the sum-to-one constraint since \(\varvec{\eta }\) is a probability. For the original Logistic loss setting, the apparently redundant sum-to-one constraint for \(\varvec{p}\) in (35) enforces that the primal variable \(\varvec{p}\) approaches \(\varvec{\eta }\) on the plane \(\varvec{1}^T\varvec{p}- 1= 0\) containing the simplex \(\Delta ^K\). In contrast, for K LogitBoost \(\varvec{p}\) may reside outside that plane during the optimization procedure. The latter optimization is intuitively slower since the L2 norm \(||\varvec{p}- \varvec{\eta }||_2^2\) never increases when \(\varvec{p}-\varvec{\eta }\) is projected onto a plane.

7 Experiments

In this section we compare AOSO-LogitBoost with ABC-LogitBoost, which was shown to outperform original LogitBoost in Li’s experiments (Li 2010b, 2009a). We test AOSO on all the datasets used in Li (2010b, 2009a), as listed in Table 2. In the top section are UCI datasets and in the bottom are Mnist datasets with many variations (see Li (2010a) for detailed descriptions).Footnote 3

Table 2 Datasets used in our experiments

To exhaust the learning ability of (ABC-)LogitBoost, Li let the boosting stop when either the training loss is small (implemented as \(\le \)10\(^{-16}\)) or a maximum number of iterations, \(M\), is reached. Test errors at last iteration are simply reported since no obvious over-fitting is observed. By default, \(M=10,000\), while for those large datasets (Covertype290k, Poker525k, Pokder275k, Poker150k, Poker100k) \(M=5,000\). We adopt the same criteria, except that our maximum iterations \(M_{AOSO} = (K-1) \times M_{ABC}\), where \(K\) is the number of classes. Note that only one tree is added at each iteration in AOSO, while \(K-1\) are added in ABC. Thus, this correction compares the same maximum number of trees for both AOSO and ABC.

The most important tuning parameters in LogitBoost are the number of terminal nodes \(J\), and the shrinkage factor \(v\). In (Li 2010b, 2009a), Li reported results of (ABC-)LogitBoost for a number of \(J\)-\(v\) combinations. We report the corresponding results for AOSO-LogitBoost for the same combinations. In the following, we intend to show that for nearly all \(J\)\(v\) combinations, AOSO-LogitBoost has lower classification error and faster convergence rates than ABC-LogitBoost.

7.1 Classification errors

7.1.1 Summary

In Table 3 we summarize the results for all datasets. Li (2010b) reports that ABC-LogitBoost is insensitive to \(J\), \(v\) on all datasets except for Poker25kT1 and Poker25kT2. Therefore, Li reports classification errors for ABC simply with \(J=20\) and \(v=0.1\), except that on Poker25kT1 and Poker25kT2 errors are reported by using each other’s test set as a validation set. Based on the same criteria we report AOSO in the middle panel of Table 3 where the test errors as well as the improvement relative to ABC are given. In the right panel of Table 3 we provide the comparison for the best results achieved over all \(J\)\(v\) combinations when the corresponding results for ABC are available in Li (2010b) or Li (2009a).

Table 3 Summary of test classification errors

We also tested the statistical significance between AOSO and ABC. We assume the classification error rate is subject to some Binomial distribution. Let \(z\) denote the number of errors and \(n\) the number of tests, then the estimate of error rate \(\hat{p}=z/n\) and its variance is \(\hat{p}(1-\hat{p})/n\). Subsequently, we approximate the Binomial distribution by a Gaussian distribution and perform a hypothesis test. The \(p\) values are reported in Table 3.

7.1.2 Comparisons with SVM and deep learning

For some problems, we note LogitBoost (both ABC and AOSO) outperforms other state-of-the-art classifier such as SVM or Deep Learning.

On dataset Poker, Li (2009a) reports that linear SVM works poorly (the test error rate is about \(40\,\%\)), while ABC-LogitBoost performs far better (i.e. \({<}10\,\%\) on Poker25kT1 and Poker25kT2). AOSO-LogitBoost proposed in this paper has even lower test error than ABC-LogitBoost, see Table 3.

On those variants of Mnist specially designed for testing various Deep Learning algorithms, Li reported ABC’s results by simply setting \(J=20\), \(v=0.1\) and concluded that ABC are comparable to or better than Deep Learning. We provide AOSO’s results in Table 4.

Table 4 Summary of test error rates for (ABC- and AOSO-)LogitBoost and deep learning algorithms on variants of Mnist

7.1.3 Detailed results

We provide one-on-one comparison between ABC and AOSO over a number of \(J\)\(v\) combinations, as follows.

Mnist10k, M-Image, Letter4k and Letter2k:   For these four datasets, classification errors are reported in Li (2010b) with every combination of \(J \in \{ 4,6,8,10,12,14,16,18,20,24,30,40,50 \}\) and \(v \in \{ 0.04,0.06,0.08,0.1 \}\). The comparison with AOSO-LogitBoost is listed in Tables 5, 6, 7 and 8.

Table 5 Test classification errors on Mnist10k
Table 6 Test classification errors on M-Image
Table 7 Test classification errors on Letter4k
Table 8 Test classification errors on Letter2k

Poker25kT1 and Poker25kT2:   These are the only two datasets on which ABC-MARTFootnote 4 outperforms ABC-LogitBoost in experiments by Li (2010b). Thus we cite the results for both ABC-MART and ABC-LogitBoost in Tables 9 and 10, with \(J \in \{ 4,6,8,10,12,14,16,18,20 \}\) and \(v \in \{ 0.04,0.06,0.08,0.1 \}\). The comparison with AOSO-LogitBoost is also listed. Unlike on previous datasets, AOSO-LogitBoost is a bit sensitive to parameters, which is also observed for ABC-MART and ABC-LogitBoost by Li (2010b).

Table 9 Test classification errors on Poker25kT1
Table 10 Test classification errors on Poker25kT2

Letter, Pendigits, Zipcode, Isolet and Optdigits:   For these five datasets, classification errors are reported by Li (2010b) with every combination of \(J \in \{ 4,6,8,10,12,14,16,18,20 \}\) and \(v \in \{ 0.04,0.06,0.08,0.1 \}\) (except that \(v \in \{0.06,0.1\}\) for Isolet). The comparison with AOSO-LogitBoost is listed in Tables 11, 12, 13, 14 and 15.

Table 11 Test classification errors on Letter
Table 12 Test classification errors on Pendigits
Table 13 Test classification errors on Zipcode
Table 14 Test classification errors on Isolet
Table 15 Test classification errors on Optdigits

7.2 Convergence rate

Recall that we stop the boosting procedure if either the maximum number of iterations is reached or the training loss is small (i.e. the loss (1) \(\le 10^{-16}\)). The fewer trees added when boosting stops, the faster the convergence and the lower the time cost for either training or testing. We compare AOSO with ABC in terms of the number of trees added when boosting stops for the results of ABC available in Li (2010b, 2009a). Note that simply comparing number of boosting iterations is unfair to AOSO, since at each iteration only one tree is added in AOSO and \(K-1\) in ABC.

Results are shown in Table 16 and Table 17. Except for when \(J\)\(v\) is too small, or particularly difficult datasets where both ABC and AOSO reach maximum iterations, we found that trees needed in AOSO are typically only 50–\(80\%\) of those in ABC.

Table 16 \(\#\)trees added when convergence on selected datasets. \(R\) stands for the ratio AOSO/ABC
Table 17 \(\#\)trees added when convergence on Mnist10k for a number of \(J\)\(v\) combinations

Figure 5 shows plots for test classification error vs. iterations in both ABC and AOSO and show that AOSO’s test error decreases faster. More plots for AOSO are given in Figs. 6 and 7. We only choose the datasets and parameters on which the plots of ABC are provided in Li (2010b).

Fig. 5
figure 5

Errors versus iterations on selected datasets and parameters. Top row ABC [copied from Li (2010b)]; Bottom row AOSO (horizontal axis scaled to compensate the \(K-1\) factor)

Fig. 6
figure 6

Errors versus iterations on Mnist10k. \(J \in \{4,6,8,10,12,14,16,18,24,30,40,50 \}\)

Fig. 7
figure 7

Errors versus iterations on selected datasets and parameters

7.3 Comparison between K-LogitBoost and AOSO-LogitBoost

In Fig. 8 we provide comparison between K-LogitBoost and AOSO-LogitBoost on several datasets. The parameters for both algorithms are \(J = 20\), \(v = 0.1\). As can be seen, K-LogitBoost converges slower than AOSO-LogitBoost, confirming our arguments in Sect. 6.3.

Fig. 8
figure 8

Comparison between K-LogitBoost and AOSO-LogitBoost on datasets Optdigits, Pendigits, Zipcode, Letter2k, Mnist10k, M-Image with parameters \(J = 20\) and \(v = 0.1\): how the testing error decreases with iterations

8 Conclusions

We present an improved LogitBoost, namely AOSO-LogitBoost, for multi-class classification. Compared with ABC-LogitBoost, our experiments suggest that our adaptive class pair selection technique results in lower classification error and faster convergence rates.