1 Introduction
As sample size increases gradually to infinity, there is an asymptotical property for learning algorithms, called
consistency. It is an extremely important part in statistical learning theory. In fact, though the sample size is always finite for practical problems, the
consistency of learning algorithms guarantees that by using more samples, a more accurate distribution could be reconstructed. Since the concept of
consistency was first proposed by Vapnik and Chervonenkis [
27‐
29], the
consistency of various learning algorithms has been extensively explored in statistical learning and machine learning areas.
According to different settings for learning machines, the consistency could be summarized as the following types. The consistency of empirical risk minimization (ERM) method [
29] is a classical type of consistency. The loss function minimizing empirical risk is used to approximate the loss function minimizing true risk. For example, Chen et al. [
5] studied the consistency of ERM method based on convex losses of multi-class classification problems. Brownlees et al. [
4] investigated the performance bound of heavy-tailed losses from the view of consistency of ERM method. Berner et al. [
1] analyzed the generalization error of ERM method based on deep artificial neural network hypothesis. Xu et al. [
30] proposed the general framework for statistical learning with group invariance, and paid attention to the consistency of the general framework. Fisher consistency [
10] strengthens the property of unbiasedness for parameters of functions. It estimates the parameters of functions directly, and uses the estimated values of parameters to approximate their true values. For instance, Liu [
17] established the Fisher consistency theory for different loss functions of multi-category support vector machine (SVM) algorithms. Fathony et al. [
8] proposed an adversarial bipartite matching algorithm with the computational efficiency and Fisher consistency properties.
In addition to the two types, universal consistency is also a typical type of consistency, which measures the consistency of learning algorithms with the base of structural risk minimization (SRM) method. Indeed, the ERM method is only about empirical risk, and is not about regularization term. However in many practical situations, in order to control the generalization ability of learning machines, the regularization term is always considered. To this end, Vapnik [
26] proposed the SRM method to balance the empirical risk of training data and generalization ability of learning machines. Later, the concept of universal consistency was introduced to demonstrate the consistency of many learning algorithms based on SRM method. For example, Steinwart [
25] showed the universal consistency for SVMs and their different variants on a unified framework. Liu et al. [
16] indicated that the extreme learning machine (ELM) was universally consistent for radial basis function networks, and pointed out the direction to select the optimal kernel functions in ELM application. Dumpert and Christmann [
7] concerned the universal consistency of localized kernel based methods. Gyorfi et al. [
12] shared the universal consistent results for the nearest-neighbor prototype algorithm in the multi-class classification setting, and the convergence rate was also conducted based on the universal consistency. In summary, universal consistency has been studied deeply in many problem settings. Here, the present paper focuses on universal consistency for binary classification problems.
SVMs are a type of powerful tools for binary classification problems. The key idea is to construct two parallel hyper-planes such that the positive and negative classes are separated well, and then maximize the margin between the two parallel hyper-planes, resulting in the minimization of the regularization term. SVMs were widely applied to many practical problems, such as text classification [
15], face recognition [
23] and bioinformatics [
9] etc. Though successful in these applications, SVMs still have some difficulties, since they deal only with small sample problems. It would take very expensive computational cost for large scale sample problems.
In order to reduce the computational cost of SVMs, many extensions to SVM were proposed and studied. For instance, a generalized eigenvalue proximal support vector machine (GEPSVM) [
18] was such an extension, which constructs two non-parallel hyper-planes, so that each hyper-plane is closest to one of the two classes and is also as far away from the other class as possible. Based on SVMs and GEPSVM, Jayadeva et al. [
13] established another extension to SVMs, that is, twin support vector machine (TWSVM). The main idea of TWSVM is similar to that of GEPSVM, while the formulation is entirely different from that of GEPSVM. In fact, it derives a pair of quadratic programming problems (QPPs) for TWSVM, and the formulation of each QPP is similar to that of SVMs, except that only one class of the training samples appears in the constraints of each QPP. In brief, the computational cost of TWSVM is only one fourth of that of SVMs.
Similar to SVMs, TWSVM also has many variants for binary classification problem. For instance, smooth TWSVM algorithm [
14] approximated a smoothing function
\(\rho (x,\eta )\) to the plus function
\((x)_{+}\) in the process of solving the QPPs, such that the learnt classifier was more smoother than before. Twin bounded SVM (TBSVM) algorithm [
21] embedded a regularization term to TWSVM, according to the SRM principle, and thus improved the classification performance. Weighted linear loss TWSVM algorithm [
22] was constructed to adjust the impact of each point on the hyperplane, and thus the weights for the slack variables were given. Least squares TBSVM algorithm based on L1-norm distance metric [
32] was a least square version of TBSVM firstly, and then substituted L1-norm for L2-norm to enhance the robustness. Besides, there are many other algorithms based on the same idea as TWSVM, like robust TWSVM algorithm [
19], margin-based TWSVM with unity norm hyperplanes [
20], fuzzy TWSVM algorithm [
11] etc. Though these TWSVM variants are formulated well and applied to different practical problems, up to now their universal consistency has not been studied.
In this study, we address the universal consistency of TWSVM and its variants. Since it is very cumbersome to analyze the universal consistency of the TWSVM variants one by one, we suggest to do this work under a general framework. However, there is no unified framework for all TWSVM variants in the literature. So we first try to construct a general framework for TWSVM variants. Furthermore, as not all the variants are based on the same idea, it is very difficult to find a general framework fit to all variants, we determine to construct a general framework for most of the variants based on the idea of TWSVM, and formulate it as a general optimization problem. Concretely, the optimization problem consists of two minimization problems, each of which contains two terms: the first term measures the average of losses for both the positive and negative data, and the second term is expressed by a regularization term for maximizing some margin.
With the general framework, we then study the universal consistency of the general optimization problem. We first introduce the definition of universal consistency for the general optimization problem, and then show in what conditions, the universal consistency is valid. When introducing the definition of universal consistency, risk \(R_{P}(f)\) and Bayes risk \(R_{P}\) for the general optimization problem are related, and thus are redefined here. When showing the validity of universal consistency in some conditions, an assertion is necessary and thus is proposed. Since the definitions like regularized \(L_1\)-risk \(R_{1, P, c_{1}}^{reg}(f_{1})\) and regularized \(L_2\)-risk \(R_{2, P, c_{2}}^{reg}(f_{2})\) are very important to describe the assertion, the detailed definitions are given before the assertion. The assertion is then described centered on a pair of concentration inequalities. Under three different conditions based on covering number, localized covering number and stability respectively, it derives three different pairs of concentration inequalities, and thus it concludes three different results for universal consistency.
The rest of the paper is organized as follows: Section
2 gives the preliminaries. Section
3 derives the general framework for most variants of TWSVM and formulates it as a general optimization problem. Section
4 introduces the definitions of risk, Bayes risk and universal consistency for this optimization problem and proposes an assertion to theoretically support the universal consistency. Section
5 presents the theoretical results about the assertion. Finally, Sect.
6 concludes this paper.
2 Preliminaries
Here, we give some notations and concepts that would be used in the following sections. Denote
\(\mathbb {R}=(-\infty , +\infty )\),
\(\mathbb {R}^{+}=[0, +\infty )\). Suppose
X is a compact metric space, and
\(k: X \times X \mapsto \mathbb {R}\) is a positive semi-definite kernel. We define a quantity
K$$\begin{aligned} K = \sup \big \{ \sqrt{k(x,x)}, x \in X \big \}. \end{aligned}$$
Let
H be the reproducing kernel Hilbert space (RKHS) with respect to kernel
k. Reminder that there is a mapping
\(\varPhi : X \mapsto H\) satisfying the reproducing property, that is,
$$\begin{aligned} k(x_{1},x_{2}) = <\varPhi (x_{1}), \varPhi (x_{2})>,\ \ x_{1},x_{2} \in X, \end{aligned}$$
Suppose the kernel
k is continuous, then the element of
H is also continuous on
X. In this situation, there is a mapping
\(I: H \mapsto \mathcal {C}(X)\), which continuously embeds the RKHS
H into the space of all continuous functions
\(\mathcal {C}(X)\)$$\begin{aligned} I_{f}= <f, \varPhi (x)>_{H}, \ \ f \in H, \end{aligned}$$
We say
k is a universal kernel, if the mapping
I is dense.
Let
\(\varOmega : \mathbb {R}^{+} \times \mathbb {R}^{+} \rightarrow \mathbb {R}^{+}\) be an non-descending function denoted by
\(\varOmega (c, t)\). This function is continuous in 0 with respect to the variable
c, and is unbounded when the variable
t tends to infinity. We introduce the following definition [
25] to explain in which case,
\(\varOmega\) is a regularization function.
For any given loss function
\(L: Y \times \mathbb {R} \times \mathbb {R}^{+} \rightarrow \mathbb {R}^{+}\), denote
$$\begin{aligned} C(\alpha , t, \lambda )= \alpha L(1, t, \lambda ) + (1-\alpha ) L(-1, t, \lambda ) \end{aligned},$$
where
\(\alpha \in [0, 1], t \in \mathbb {R}, \lambda \in \mathbb {R}^{+}\). Let
$$\begin{aligned}&t_{\alpha } = \arg \min _{t \in \mathbb {R}} C(\alpha , t, \lambda ), \\&M(\alpha , \lambda ) = C(\alpha , t_{\alpha }, \lambda ). \end{aligned}$$
Then
L is an admissible loss function [
25], redefined as follows:
3 A general framework for twin support vector machines
Denote by
\(X \subseteq \mathbb {R}^{d}\) the input space for instances and by
\(Y = \{-1, 1\}\) the output space for labels. Let
S be the training set belonging to the space
\(X \times Y\) with
m samples
\((x_{i}, y_{i}), i = 1, \ldots , m\). Suppose
S consists of
\(m_1\) positive samples relabeled by
\((x_i^{1}, 1)\),
\(i=1, \ldots , m_1\), and
\(m_2\) negative samples relabeled by
\((x_j^{2}, -1)\),
\(j=1, \ldots , m_2\) with
\(m = m_{1} + m_{2}\). Assume the data of the training set
S are sampled from an unknown distribution
P on
\(X \times Y\), and they are independent and identical distributed (
i.i.d.). Denote matrices
A,
B and
D as follows
$$\begin{aligned} A^{T} =(x^{1}_{1}, \ldots , x^{1}_{m_{1}}), \ B^{T} =(x^{2}_{1}, \ldots , x^{2}_{m_{2}}), \ D^{T} = (A^{T}, B^{T}). \end{aligned}$$
We give a general framework to cover most of the variants of TWSVM, which is formulated as follows:
$$\begin{aligned}&\min _{f_{1} \in H} \sum _{i=1}^{m_{1}} \ell _{1} (1, f_{1}(x_{i}^{1})) + \sum _{j=1}^{m_{2}} \ell _{2} (-1, f_{1}(x_{j}^{2}), \lambda _{1}) \nonumber \\&\quad + \varOmega ^{*}(c_{1}, ||f_{1}||_{H}), \end{aligned}$$
(1)
$$\begin{aligned}&\min _{f_{2} \in H} \sum _{j=1}^{m_{2}} \ell _{1} (-1, f_{2}(x_{j}^{2})) + \sum _{i=1}^{m_{1}} \ell _{2} (1, f_{2}(x_{i}^{1}), \lambda _{2}) \nonumber \\&\quad + \varOmega ^{*}(c_{2}, ||f_{2}||_{H}), \end{aligned}$$
(2)
where
\(c_{1}\),
\(c_{2}\),
\(\lambda _{1}\),
\(\lambda _{2}\) are all trade-off parameters,
\(f_{1}\),
\(f_{2}\) are the hyper-planes corresponding to the positive and negative classes, respectively, and
H is RKHS. Here,
\(\ell _{1}\) is one loss function measuring the squared distance from the data of one class to the corresponding hyper-plane.
\(\ell _{2}\) is another loss function measuring the slack variable such that the distance from the data of the other class to the same hyper-plane is no smaller than 1.
\(\varOmega ^{*}\) is a regularization term for maximizing some margin.
Note that in the minimization problem Eq. (
1),
\(\ell _{1}\) measures the loss for one positive sample and
\(\ell _{2}\) measures the loss for one negative sample. The sum of the first two terms is the total loss for all the positive and negative samples. Now we want to combine the two loss functions into one revised loss function such that it could measure the loss for any positive or negative sample about hyper-plane
\(f_{1}\). Given a function
p(
y)
$$\begin{aligned} p(y) = \left\{ \begin{aligned} 1,\,&\quad y=1, \\ 0,&\quad y=-1, \end{aligned} \right. \end{aligned}$$
the revised loss function
\(L_{1}\) can be defined as
$$\begin{aligned} \begin{aligned} L_{1}(y, f_{1}(x), \lambda _{1})&= p(y)\ell _{1} (y, f_{1}(x)) \\&\quad +(1 - p(y))\ell _{2}(y, f_{1}(x), \lambda _{1}) \\&= \left\{ \begin{array}{ll} \ell _{1} (1, f_{1}(x)),&\quad y=1, \\ \ell _{2}(-1, f_{1}(x), \lambda _{1}),&\quad y=-1. \end{array} \right. \end{aligned} \end{aligned}$$
Analogously, a revised loss function
\(L_{2}\) is defined as
$$\begin{aligned} \begin{aligned} L_{2}(y, f_{2}(x), \lambda _{2})&= (1 - p(y))\ell _{1} (y, f_{2}(x))\\&\quad + p(y)\ell _{2}(y, f_{2}(x), \lambda _{2}) \\&= \left\{ \begin{array}{ll} \ell _{2} (1, f_{2}(x), \lambda _{2}),\,&\quad y=1, \\ \ell _{1}(-1, f_{2}(x)),&\quad y=-1, \end{array} \right. \end{aligned} \end{aligned}$$
in the minimization problem Eq. (
2). It measures the loss for any positive or negative sample about hyper-plane
\(f_{2}\). Therefore, the general framework can be rewritten as
$$\begin{aligned}&\min _{f_{1} \in H} \sum _{i=1}^{m} L_{1} (y, f_{1}(x), \lambda _{1}) + \varOmega ^{*}(c_{1}, ||f_{1}||_{H}), \\&\min _{f_{2} \in H} \sum _{i=1}^{m} L_{2} (y, f_{2}(x), \lambda _{2}) + \varOmega ^{*}(c_{2}, ||f_{2}||_{H}), \end{aligned}$$
Obviously, it is equivalent to the optimization problem
$$\begin{aligned} \begin{aligned}&\min _{f_{1} \in H} \frac{1}{m}\sum _{i=1}^{m} L_{1} (y, f_{1}(x), \lambda _{1}) + \varOmega (c_{1}, ||f_{1}||_{H}), \\&\min _{f_{2} \in H} \frac{1}{m}\sum _{i=1}^{m} L_{2} (y, f_{2}(x), \lambda _{2}) + \varOmega (c_{2}, ||f_{2}||_{H}), \end{aligned} \end{aligned}$$
(3)
where
\(\varOmega (\cdot , \cdot ) = \frac{1}{m}\varOmega ^{*}(\cdot , \cdot )\) is also a regularization function. To sum up, this optimization problem Eq. (
3) is a unified framework for most of TWSVM variants considered in the paper.
Note, in linear case, the two non-parallel hyper-planes are conducted as
\(f_1(x) = x^{T} w_{1} + b_{1}\) for positive samples and
\(f_2(x) = x^{T} w_{2} + b_{2}\) for negative samples, respectively. Similarly in nonlinear case, they are formulated as
\(f_{1}(x) = k(x^{T}, D^{T}) u_{1} + b_{1}\) for positive samples and
\(f_{2}(x) = k(x^{T}, D^{T}) u_{2} + b_{2}\) for negative samples, where
k is a kernel function. The final classifier
\(f: X \rightarrow \mathbb {R}\) for both linear and non-linear cases could be expressed as
\(f(x) = |f_{2}(x)| - |f_{1}(x)|\). Given a new sample
x, the predicted label of
x is
$$\begin{aligned} y = \left\{ \begin{aligned} 1,&\quad f(x) > 0, \\ -1,&\quad {\mathrm{otherwise}}. \end{aligned} \right. \end{aligned}$$
Let
\(\xi\),
\(\eta\) are two slack variables, and
\(e_{1}\),
\(e_{2}\) are two vectors whose elements are all 1’s and whose dimensions are
\(m_{1}\) and
\(m_{2}\), respectively. Below, we give several examples that can be expressed by the unified framework.
5 Theoretical results
Here, we investigate the assertion and use some theorems to support the validity of the assertion in each step. Three theorems are first given to show Steps 1–3 of the assertion, respectively. In the next three subsections, three pairs of concentration inequalities are derived for Step 4 based on different conditions, including covering number, localized covering number and stability, respectively. Theorems are given to show that the universal consistency is valid based on different concentration inequalities. The proofs follow the idea in [
25]. Because of space limit for the paper, the detailed proofs of Theorems and Lemmas are put to the supplementary material. Readers can see the supplementary material for the proofs.
Let
k be a positive semi-definite kernel,
\(L_{1}, L_{2}\) be admissible loss functions, and
\(\varOmega\) be a regularization function. Given the parameters
\(\lambda _{1}, \lambda _{2}\), define for
\(c_{1}, c_{2} > 0\) that
$$\begin{aligned}&\delta _{c_{1}} = \sup \{ t: \varOmega (c_{1}, t) \le L_{1}(1, 0, \lambda _{1}) + L_{1}(-1, 0, \lambda _{1}) \}, \\&\delta _{c_{2}} = \sup \{ t: \varOmega (c_{2}, t) \le L_{2}(1, 0, \lambda _{2}) + L_{2}(-1, 0, \lambda _{2}) \}, \\&L_{1, c_{1}} = {L_{1}}_{| Y \times [-\delta _{c_{1}}K, \delta _{c_{1}}K] \times \lambda _{1}}, \ \ L_{2, c_{2}} = {L_{2}}_{| Y \times [-\delta _{c_{2}}K, \delta _{c_{2}}K] \times \lambda _{2}}. \end{aligned}$$
Note that
\(0< \delta _{c_{1}}, \delta _{c_{2}} < \infty\), and we have
$$\begin{aligned} \hat{\delta }_{1} = \inf \{ \delta _{c_{1}} : c_{1} \in (0, 1] \}> 0, \ \hat{\delta }_{2} = \inf \{ \delta _{c_{2}} : c_{2} \in (0, 1] \} > 0. \end{aligned}$$
For the loss function
\(L_{i,c_{i}}, i=1,2\), denote by
\(|\cdot |_{1}\) the supremum
$$\begin{aligned}&|L_{i,c_{i}}|_{1} = \sup \left\{ \frac{|L_{i}(y,t^{'},\lambda _{i})- L_{i}(y,t^{''},\lambda _{i})|}{|t^{'}-t^{''}|}: y \in Y, t^{'},t^{''}\right. \\&\quad \left. \in [-\delta _{c_{i}}K, \delta _{c_{i}}K], t^{'} \ne t^{''} \right\} . \end{aligned}$$
Theorem
1 corresponds to Step 1 of the assertion, and ensures the existence of two elements
\(f_{1,P,c_{1}}\) and
\(f_{2,P,c_{2}}\) which minimize the regularized
\(L_1\)-risk and regularized
\(L_2\)-risk, respectively. Here,
\(\delta _{c_{1}}\) and
\(\delta _{c_{2}}\) are two critical quantities and give upper bounds on the norm of the solutions to the optimization problem Eq. (
3), respectively.
Lemma
1 converts the minimal
\(L_{1}\)-risk and minimal
\(L_{2}\)-risk to the expectations of
\(M_{1}\) and
\(M_{2}\), respectively. It is necessary to the proof of Step 2.
Theorem
2 corresponds to Step 2 of the assertion, and guarantees that the minimal
\(L_1\)-risk and minimal
\(L_2\)-risk could be achieved at
\(f_{1,P,c_{1}}\) and
\(f_{2,P,c_{2}}\), respectively, with
\(c_{1}\),
\(c_{2}\) tending to 0.
Theorem
3 corresponds to Step 3 of the assertion, and explains the following relations
$$\begin{aligned} \left\{ \begin{array}{ll} \lim _{m \rightarrow \infty } R_{1,P} (f_{1,m}) = R_{1,P} \\ \lim _{m \rightarrow \infty } R_{2,P} (f_{2,m}) = R_{2,P} \end{array} \right. \Longrightarrow \lim _{m \rightarrow \infty } R_{P} (f_{m}) = R_{P} \end{aligned}$$
are valid for all sequences of measurable functions
\(f_{1,m}\),
\(f_{2,m}\) and
\(f_{m}(\cdot ) = |f_{2,m}(\cdot )| - |f_{1,m}(\cdot )|\).
5.1 Universal consistency based on covering number
Now we pay attention to Step 4 of the assertion, and want to find a pair of concentration inequality based on covering number [
33].
Instead of using covering number directly, its logarithmic form is employed more frequently, which is denoted as \(\mathcal {H}((\mathcal {M}, d), \epsilon ) = \ln \mathcal {N}((\mathcal {M}, d), \epsilon )\).
In addition, we have to measure the continuity of a function. Given a loss function
\(L_{1}\), the modulus and inverted modulus of continuity [
2] of the function are expressed as
\(w(L_{1}, \delta )\) and
\(w^{-1}(L_{1}, \epsilon )\), respectively,
$$\begin{aligned}&w(L_{1}, \delta ) = \sup _{y \in Y, \lambda _{1} \in \mathbb {R}^{+}, t^{'}, t^{''} \in \mathbb {R}, |t^{'}-t^{''}| \le \delta } |L_{1}(y, t^{'}, \lambda _{1}) - L_{1}(y, t^{''}, \lambda _{1})|, \\&w^{-1}(L_{1}, \epsilon ) = \sup \{ \delta >0 : w(L_{1}, \delta ) \le \epsilon \}. \end{aligned}$$
With these definitions, we begin to formulate the pair of concentration inequalities for the optimization problem Eq. (
3) by the following lemma, and establish the consistency result by the following theorem.
Note that each sample \((x_{i}, y_{i}) \in S \subseteq X \times Y, i= 1, \ldots , m\) follows the probability distribution P, then \((x_{1}, y_{1}) \times (x_{2}, y_{2}) \times \ldots \times (x_{m}, y_{m}) \in (X \times Y)^{m}\) follows the probability distribution \(P^{m}\).
Lemma
2 and Theorem
4 are corresponding to the fourth step of the assertion. By virtue of covering numbers of
\(\delta _{c_{1}}I\) and
\(\delta _{c_{2}}I\), the pair of concentration inequalities are derived first. Based on them, the universal consistency of TWSVMs is then conducted. Next we give an example to explain the case in detail.
5.2 Universal consistency based on localized covering number
Sometimes, we suggest the pair of concentration inequalities based on the localized covering number, instead of covering number. Given a function set
\(\mathcal {F} = \{ f: X \mapsto \mathbb {R} \}\), the localized covering number of
\(\mathcal {F}\) is
$$\begin{aligned} \mathcal {N}(\mathcal {F}, m, \epsilon ) = \sup \{ \mathcal {N}((\mathcal {F}_{| X_{0}}, \ell _{\infty }^{|X_{0}|}), \epsilon ) : X_{0} \subset X, |X_{0}| \le m \}, \end{aligned}$$
for any
\(\epsilon > 0\) and
\(m \ge 1\), where
\(\ell _{\infty }^{|X_{0}|}\) is the space
\(\mathbb {R}^{|X_{0}|}\) with the maximum norm, and
\(\mathcal {F}_{| X_{0}} = \{ f_{| X_{0}} : f \in \mathcal {F} \}\) could be regarded as a subset of
\(\ell _{\infty }^{|X_{0}|}\). The logarithm of localized covering number is
\(\mathcal {H}(\mathcal {F}, m, \epsilon ) = \ln \mathcal {N}(\mathcal {F}, m, \epsilon )\). Now we start to obtain another pair of concentration inequalities for the optimization problem Eq. (
3) according to the following lemma, and derive the universal consistency by the following theorem.
Lemma
3 presents another pair of concentration inequalities based on the localized covering numbers of
\(\delta _{c_{1}}I\) and
\(\delta _{c_{2}}I\). And Theorem
5 discusses the conditions under which, the universal consistency is valid for TWSVMs. Thus, Lemma
3 and Theorem
5 imply the validation of Step 4 of the assertion. In the following, an example is illustrated to exhibit the universal consistency.
5.3 Universal consistency based on stability
For practical problems, the case for convex loss functions and for the regularization function
\(\varOmega (c,t) = ct^{2}\) is often considered for the optimization problem Eq. (
3). In this case, a stable classifier is always employed. Here, the property for stability [
3] is redefined as follows:
In Lemma
4, the classifier of the optimization problem Eq. (
3) has shown to be stable for the convex loss functions
\(L_{1}\),
\(L_{2}\) and for the regularization function
\(\varOmega (c,t) = ct^{2}\). Below, we verify Step 4 of the assertion.
In Lemma
5, a pair of concentration inequalities is conducted under the condition that the classifier is stable. With the inequalities, the universal consistency of the optimization problem Eq. (
3) is then obtained in Theorem
6.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.