中国科学院大学学报  2019, Vol. 36 Issue (4): 449-460   PDF    
Targeted local angle-based multi-category support vector machine
KANG Wenjia1, LIN Wenhui2, ZHANG Sanguo1     
1. School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China;
2. Technology Research Institute, Aisino Corporation, Beijing 100195, China
Abstract: The support vector machine(SVM) is one of the most concise and efficient classification methods in machine learning. Traditional SVMs mainly handle with binary classification problems by maximizing the smallest margins. However, the real-world data are much more complicated. On the one hand, the label set usually has more than two categories, so SVMs need to be generalized for solving multi-category problems reasonably. On the other hand, there may exist one special variable which should be singled out to preserve its effect on the final results from other variables such as age in bioscience field. We name such a special variable as targeted variable. In this work, in order to take both aspects mentioned above into consideration, targeted local angle-based multi-category support vector machine(TLAMSVM) is proposed. This new model not only solves multi-category problems but also pays special attention to targeted variable. Moreover, TLAMSVM solves multi-classification in the framework of angle-based method, which provides a better interpretation from the geometrical viewpoint, and it uses local smoothing method to pool the information of targeted variable. In order to validate the classification effect of TLAMSVM model, we apply it to both simulated and real data sets, respectively, and get the expected results in numerical experiments.
Keywords: local smoothing    multi-category support vector machine    angle-based maximum margin classification framework    
基于角度的变系数多分类支持向量机
康文佳1, 林文辉2, 张三国1     
1. 中国科学院大学数学科学学院, 北京 100049;
2. 航天信息股份有限公司技术研究院, 北京 100195
摘要: 支持向量机作为机器学习中一个经典的分类算法,一直广受数据科学家的喜爱。无论是处理线性可分还是非线性可分数据,传统的支持向量机能够很好地解决二分类问题。针对给定的样本,支持向量机通过最大化最小间隔得到最佳的决策分界面,从而实现对新样本的类别预测。然而现实中的数据更为复杂多样,一方面数据的类别往往多于两个,近年不乏有优秀的多分类支持向量机算法出现;另一方面不同领域的数据的特征集中可能存在相对特殊的变量(称之为主变量,targeted variable),需要将其挑选出来并加以特殊处理,以保持主变量对最终分类结果的重要影响。考虑这两个方面,提出基于角度的变系数多分类支持向量机(TLAMSVM)模型以解决含有主变量的多分类问题。它使用具备更好几何解释能力的基于角度的间隔最大分类框架完成多分类,并引入变系数模型,通过选择合适的局部光滑函数处理主变量对模型的影响。把基于角度的变系数多分类支持向量机分别应用到模拟数据集和真实数据集上。数值结果显示,相比没有使用变系数思想或基于角度的多分类框架的多分类支持向量机,TLAMSVM模型具有更高的预测准确度。
关键词: 局部光滑    多分类支持向量机    基于角度的间隔最大分类框架    

Machine learning, as a new research orientation, plays a significant role in the age of big data. A widely used classification method of machine learning is support vector machine(SVM)[1]. It is a kind of classic margin-based classifier, which seeks for the most appropriate hyperplane in the feature space and solves the optimization problem in the form of (loss+regularization)[2]. In the last decades, SVM and its derived algorithms have held an important place in many fields[3] on account of its perfect prediction accuracy and computational efficiency.

Typical SVM is good at handling binary problems with one discriminant function. As for its theoretical property, it has been proved to have good asymptotic convergence rates and Fisher consistency. Nevertheless, in realistic applications, multi-category problems are more common than binary ones. As an illustration, diagnosis for cancer patients would be divided into several categories by measuring grade malignancy of cancer cells or tissues. So, multi-category support vector machine(MSVM) is proposed. Apart from conventional methods such as one-versus-one(OvO) and one-versus-rest(OvR)[4-5], which train a sequence of binary SVMs, there are still many simultaneous multi-classification methods, which consider k classes in a single optimization simultaneously, learn k fitting functions, and make classification decision based on specific prediction rule, such as Crammer and Singer[6-7] etc. However, using only (k-1) functions should be adequate for simultaneous multi-classification methods since binary classifiers only need one function, and thus the classification function space is reduced to (k-1)-dimension by adding a sum-to-zero constraint, $\sum_{j=1}^{k} f_{j}=0 $. However, it would be inefficient and more computations would be needed. Zhang and Liu[8] proved that the number of classification functions could still be reduced to k-1, even though there is no sum-to-zero constraint. From the viewpoint of geometry, large-margin in MSVM is not so readily comprehensible as that in binary SVMs. In order to solve the problem better, Zhang and Liu[9] proposed the angle-based large-margin classification technique. It requires k-1 classification functions but no sum-to-zero constraint. When this technique is applied to SVMs, the only drawback is no Fisher consistency. In 2016, a modified method, reinforced angle-based multi-category support vector machine(RAMSVM)[10], was proposed. RAMSVM is Fisher consistent and has good results in numerical experiments.

Apart from expanding the numbers of category for SVM, researching the interrelation of covariant variables is a worthy direction to study as well. For example, SVM is a popular algorithm in disease research to predict disease risk status of patients. However Wei et al.[11] mentioned that some variables themselves might make little contribution to prediction but play significant roles when they are coordinated with other correlated variables. So the method that uses local linear smoothing SVM on the entire feature variable space is proposed by Ladicky and Torr[12]. Since it does not accomplish smoothing technique successfully in high-dimensional feature space, Chen[13] developed a targeted local smoothing kernel weighted support vector machine (KSVM), which only needs the data in targeted dimension to be plentiful. It achieves an age-dependent prediction rule to discriminate disease risk via optimizing a kernel weighted problem. It gets quite good results on binary classification problems. It is worth applying such a local weighted tool to MSVM.

In this study, we mainly make two contributions. Firstly, we combine angle-based multi-category classification framework of RAMSVM model and targeted local smoothing method of KSVM model, and propose a new MSVM model, targeted local angle-based multi-category support vector machine(TLAMSVM), which offers a better classification prediction for certain data sets. Secondly, TLAMSVM model provides more detailed analysis for a sample according to its value of targeted variable. Before introducing TLAMSVM model, some fundamental knowledge of RAMSVM model and KSVM model will be described in section 2, which are the basis of TLAMSVM model. In section 3, the detailed presentation of TLAMSVM model will be shown, followed by its computational process of coordinate descent method. In section 4, we will show some experimental results in four simulated and real data sets and give some illustrations to demonstrate the strengths of TLAMSVM model. At the end, we will conclude the article with discussions in section 5.

1 Background and related work 1.1 Reinforced angle-based multi-category support vector machine(RAMSVM)

As for k-class data sets, the angle-based large-margin classification framework defines a k-vertex simplex in (k-1) dimensional space, W1, W2, …, Wk, which correspond to each class,

$ W_{j}=\left\{\begin{array}{ll}{(k-1)^{-\frac{1}{2}} \boldsymbol{\mathit{1}}, } & {j=1} \\ {-\frac{1+\sqrt{k}}{(k-1)^{3 / 2}} \boldsymbol{\mathit{1}}+\sqrt{\frac{k}{k-1}} e_{j-1}, } & {2 \leqslant j \leqslant k}\end{array}\right. $

where Wj represents the jth vertex, j=1, 2, …, k, 1 represents the vector where each element is 1, ej represents unit vector where every element is 0 except for the jth one. Wj, 1, $ e_{j} \in \mathbb{R}^{k-1}$. It can be verified that ‖Wj‖=1 and the angle between any two adjacent vectors is equal. Each observation x could be mapped to (x)=(f1(x), …, fk-1(x))∈$ \mathbb{R}^{k-1}$. Consequently, there are k angles ∠(f(x), Wj), and we regard argminj∈{1, …, k}∠(f(x), Wj) as the predicted category. It is equivalent to write this prediction rule as $\hat{y}(x)=\operatorname{argmax}_{j \in\{1, \cdots, k |}\left\langle f(x), W_{j}\right\rangle $.

According to the regular form of a large-margin classification algorithm, we add a penalty term J(·) to the loss function. So the optimization problem becomes

$ {\min\limits_{f \in F}}\frac{1}{n}\sum\limits_{i = 1}^n l \left( {\left\langle {f\left( {{x_i}} \right), {W_{{y_i}}}} \right\rangle } \right) + \lambda J(f), $ (1)

here l(·) is a loss function, which measures the cost of wrong predictions. Nevertheless, typical loss function such as hinge loss l(u)=[1-u]+ is not Fisher consistent. Inspired by Lee[7], RAMSVM uses a convex combination of loss function to overcome this drawback. Therefore (1) is turned into

$ \begin{array}{*{20}{l}} {{\min \limits_{f \in F}}\frac{1}{n}\sum\limits_{i = 1}^n {\{ (1 - \gamma )\sum\limits_{j \ne {y_i}} {{{\left[ {1 + < f\left( {{x_i}} \right), {W_j} > } \right]}_ + }} + } }\\ {\quad \gamma {{\left[ {(k - 1) - \left\langle {f\left( {{x_i}} \right), {W_{{y_i}}}} \right\rangle } \right]}_ + }\} + \lambda J(f), } \end{array} $ (2)

here γ∈[0, 1] is the convex combination parameter. It has been proved that, when γ∈[0, 0.5], the loss function of RAMSVM model is Fisher consistent[10]. In addition, when γ=0.5, RAMSVM is stable and almost reaches the optimal classification accuracy. Therefore we default γ=0.5 directly in the latter work.

1.2 Kernel weighted support vector machine(KSVM)

Take binary medical data as an example, let Y denote the set of labels {1, -1}, which represents whether a subject is at the disease risk or not. Let "age" be the targeted variable U, and variable X contains other features. Now the purpose is to train a classifier which depends on age information. Firstly, we consider making some changes on typical discriminant score as follows:

$ \alpha(U)+\boldsymbol{X}^{\mathrm{T}} \boldsymbol{\beta}(U), $ (3)

where α(U) is a baseline without specific functional form, β(U) is a vector whose elements represent the coefficients of X that varies with age. It is common sense that the subjects with similar ages and other features would also have similar disease risk. So when changing age U smoothly, β(U) varies smoothly too. Some features might provide more information for classification when age is larger while others do reversely. From this perspective, people would have better understanding of each physical sign. Then, after calculating the value in (3), the one with a negative score will be classified as risk-free and the one with a positive score will be classified to risk group.

After choosing an appropriate loss function, the targeted local smoothing kernel weighted support vector machine (KSVM) model trains a classifier via solving

$ \begin{array}{*{20}{c}} {\mathop {\min }\limits_{\alpha \left( {{u_0}} \right), \beta \left( {{u_0}} \right)} \sum\limits_i {{K_{{h_n}}}} \left( {{U_i} - {u_0}} \right) \cdot }\\ {l\left\{ {{Y_i}, {X_i};\alpha \left( {{u_0}} \right), \beta \left( {{u_0}} \right)} \right\} + \lambda \left\| {\beta \left( {{u_0}} \right)} \right\|, } \end{array} $

where Khn(·) is kernel density function, which could pool information of subjects whose ages are around u0 within certain bandwidth hn. In another word, KSVM model trains different SVMs at every u0, respectively, and the weight of the ith sample is Khn(Ui-u0).

After getting the estimated values of coefficients $ \hat{\alpha}(u)$ and $\hat{\boldsymbol{\beta}}(u) $, KSVM model predicts the disease risk status for any new subjects with age u and feature x via $\hat d(\mathit{\boldsymbol{x}}, u) = {\mathop{\rm sign}\nolimits} \left\{ {\widehat \alpha (u) + {\mathit{\boldsymbol{x}}^{\rm{T}}}\mathit{\boldsymbol{\widehat \beta }}(u)} \right\} $.

2 Methodology and computation

In this section we firstly illuminate how to design multi-category local smoothing kernel weighted optimization problem in TLAMSVM model. Then we list out some main computational steps of TLAMSVM model using coordinate descent method. We take linearly separable case as example, since to non-separable case, we can use kernel trick.

2.1 Targeted local angle-based multi-category support vector machine(TLAMSVM)

Assume that there are n observations, each of which has p+1 features, one of which is the targeted variable U. For simplicity, the intercepts are included in X. Denote the value of each feature of ith observation as Xi=(1, Xi(1), Xi(2), …, Xi(p))T. X(j)=(X1(j), X2(j), …, Xn(j)) representing the value of jth feature for all observations. Therefore the sample matrix Xp*n could be written as

$ \mathit{\boldsymbol{X}} = \left( {{X_1}, {X_2}, \cdots , {X_n}} \right) = {\left( {1, {X^{(1)}}, {X^{(2)}}, \cdots , {X^{(p)}}} \right)^{\rm{T}}}. $ (4)

Assume the classification function vector at U=u0 is

$ {f_q}\left( {{u_0}, \mathit{\boldsymbol{X}}} \right) = {\mathit{\boldsymbol{X}}^{\rm{T}}}{\mathit{\boldsymbol{\beta }}_q}\left( {{u_0}} \right)\;\;\;(q = 1, \cdots , k - 1), $ (5)

where βq(u0)=(αq(u0), βq(1)(u0), βq(2)(u0), …, βq(p)(u0))T. We use convex combination of loss function in the model. Hence, the original optimization problem of TLAMSVM model with targeted variable U is

$ \begin{array}{*{20}{l}} {\mathop {\min }\limits_{f \in F} \sum\limits_i {{K_{{h_n}}}} \left( {{U_i} - {u_0}} \right)\{ (1 - \gamma )}\\ {\sum\limits_{j \ne {y_i}} {{{\left[ {1 + \left\langle {f\left( {{u_0}, {X_i}} \right), {W_j}} \right\rangle } \right]}_ + }} + }\\ \gamma\left[(k-1)-\left\langle f\left(u_{0}, X_{i}\right), W_{y_{i}}\right\rangle\right]_{+} \}+\frac{\lambda}{2} J(f), \end{array} $ (6)

where the {(1-γ)∑jyi[1+〈f(u0, Xi), Wj〉]++γ[(k-1)-〈f(u0, Xi), Wyi〉]+} is the loss function of our model, and the weight before loss of each observation, Khn(Ui-u0), is smoothing kernel function with bandwidth hn, like that of KSVM model. Considering the more general situation, we bring in the concept of "soft margin" to allow part of samples dissatisfying constrain

$ \begin{array}{c}{-\left\langle f\left(u_{0}, X_{i}\right), W_{j}\right\rangle> 1 \text { and } 2-k+} \\ {\left\langle f\left(u_{0}, X_{i}\right), W_{y_{i}}\right\rangle> 1}, \end{array} $

that is to say, put slack variables ξi, j (ξi, j≥0) into constrains, and we get

$ \begin{array}{c}{-\left\langle f\left(u_{0}, X_{i}\right), W_{j}\right\rangle> 1-\xi_{i, j}, (i=1, \cdots, n)}, \\ {2-k+\left\langle f\left(u_{0}, X_{i}\right), W_{y_{i}}\right\rangle> 1-\xi_{i, j}}, \\ {\left(i=1, \cdots, n, j \neq y_{i}\right).}\end{array} $ (7)

We set the form of penalty in (2) as $J=\sum_{q=1}^{k-1} \boldsymbol{\beta}_{q}^{\mathrm{T}}\left(u_{0}\right) \boldsymbol{\beta}_{q}\left(u_{0}\right) $ and plug (7) in (6), and after some simplification one obtains

$ \begin{array}{*{20}{c}} {\mathop {\min }\limits_{{\mathit{\boldsymbol{\beta }}_q}\left( {{u_0}} \right), {\xi _{i, j}}} \frac{{n\lambda }}{2}\sum\limits_{q = 1}^{k - 1} {\mathit{\boldsymbol{\beta }}_q^{\rm{T}}} \left( {{u_0}} \right)\mathit{\boldsymbol{\beta }}_q^{\rm{T}}\left( {{u_0}} \right) + }\\ {\sum\limits_i {{K_{{h_n}}}} \left( {{U_i} - {u_0}} \right)\left[ {\gamma {\xi _{i, {y_i}}} + (1 - \gamma )\sum\limits_{j \ne {y_i}} {{\xi _{i, j}}} } \right]}\\ {\rm{s}}{\rm{.t}}{\rm{.}}\left\{ {\begin{array}{*{20}{l}} {{\xi _{i, j}} \ge 0(i = 1, \cdots , n, j = 1, \cdots , k)}\\ {{\xi _{i, {y_i}}} + \left\langle {f\left( {{u_0}, {X_i}} \right), {W_{{y_i}}}} \right\rangle - (k - 1) \ge 0(i = 1, \cdots , n).}\\ {{\xi _{i, j}} - \left\langle {f\left( {{u_0}, {X_i}} \right), {W_j}} \right\rangle - 1 \ge 0\left( {i = 1, \cdots , n, j \ne {y_i}} \right)} \end{array}} \right. \end{array} $ (8)

Obviously, (8) is a constrained optimization problem, whose Lagrangian function could be defined as

$ \begin{aligned} L=& \frac{n \lambda}{2} \sum\limits_{q=1}^{k-1} \boldsymbol{\beta}_{q}^{\mathrm{T}}\left(u_{0}\right) \boldsymbol{\beta}_{q}\left(u_{0}\right)+\\ & \sum\limits_{i=1}^{n} K_{h_{n}}\left(U_{i}-u_{0}\right)\left[\gamma \xi_{i, y_{i}}+(1-\gamma) \sum\limits_{j \neq y_{i}} \xi_{i, j}\right]-\\ & \sum\limits_{i=1}^{n} \alpha_{i, y_{i}}\left[\xi_{i, y_{i}}+\left\langle f\left(u_{0}, X_{i}\right), W_{j}\right\rangle-(k-1)\right]-\\ & \sum\limits_{i=1}^{n} \sum\limits_{j=1}^{k} \tau_{i, j} \xi_{i, j}-\sum\limits_{i=1}^{n} \sum\limits_{j \neq y_{i}} \alpha_{i, j}\\ &\left[\xi_{i, j}-\left\langle f\left(u_{0}, X_{i}\right), W_{j}\right\rangle- 1\right]. \end{aligned} $

Here αi, j and τi, j are Lagrangian multipliers. After simplification, we get

$ \begin{aligned} L=& \frac{n \lambda}{2} \sum\limits_{q=1}^{k-1} \boldsymbol{\beta}_{q}^{\mathrm{T}}\left(u_{0}\right) \boldsymbol{\beta}_{q}\left(u_{0}\right)+\\ & \sum\limits_{i=1}^{n} \sum\limits_{j=1}^{k}\left[A_{i, j}-\tau_{i, j}-\alpha_{i, j}\right] \xi_{i, j}+\sum\limits_{i=1}^{n} \alpha_{i, y_{i}}(k-1)+\\ &{\sum\limits_{i = 1}^n {\sum\limits_{j \ne {y_i}} {{\alpha _{i, j}}} } - \sum\limits_{i = 1}^n {{\alpha _{i, {y_i}}}} \left\langle {f\left( {{u_0}, {X_i}} \right), {W_{{y_i}}}} \right\rangle + }\\ &{\sum\limits_{i = 1}^n {\sum\limits_{j \ne {y_i}} {{\alpha _{i, j}}} } \left\langle {f\left( {{u_0}, {X_i}} \right), {W_j}} \right\rangle , } \end{aligned} $ (9)

where Ai, j=Khn(Ui-u0)[γI(j=yi)+(1-γ)I(jyi)].

Let the last two parts of (10) as L1 and L2,

$ \begin{aligned} L_{1} &=-\sum\limits_{i=1}^{n} \alpha_{i, y_{i}}\left\langle f\left(u_{0}, X_{i}\right), W_{y_{i}}\right\rangle \\ &=-\sum\limits_{i=1}^{n} \sum\limits_{q=1}^{k-1} \alpha_{i, y_{i}}\left(\boldsymbol{X}_{i}^{\mathrm{T}} \boldsymbol{\beta}_{q}\left(u_{0}\right)\right) W_{y_{i}, q} \\ L_{2} &=\sum\limits_{i=1}^{n} \sum\limits_{j \neq y_{i}} \alpha_{i, j}\left\langle f\left(u_{0}, X_{i}\right), W_{j}\right\rangle \\ &=\sum\limits_{i=1}^{n} \sum\limits_{q=1}^{k-1} \sum\limits_{j \neq y_{i}} \alpha_{i, j}\left(\boldsymbol{X}_{i}^{\mathrm{T}} \boldsymbol{\beta}_{q}\left(u_{0}\right)\right) W_{j, q} .\end{aligned} $

Take partial derivatives of L with respect to βq(u0) and ξi, j, then let the partial derivatives be zero,

$ \begin{array}{*{20}{l}} {\frac{{\partial L}}{{\partial {\xi _{i, j}}}} = {A_{i, j}} - {\tau _{i, j}} - {\alpha _{i, j}} = 0}\\ {(i = 1, \cdots , n;j = 1, \cdots , k), } \end{array} $ (10)
$ \begin{array}{*{20}{c}} {\frac{{\partial L}}{{\partial {\mathit{\boldsymbol{\beta }}_q}\left( {{u_0}} \right)}} = \lambda {\mathit{\boldsymbol{\beta }}_q}\left( {{u_0}} \right) - \sum\limits_{i = 1}^n {{\alpha _{i, {y_i}}}} {W_{{y_i}, q}}{X_i} + }\\ {\sum\limits_{i = 1}^n {\sum\limits_{j \ne {y_i}} {{\alpha _{i, j}}} } {W_{j, q}}{X_i} = 0.} \end{array} $ (11)

From (12) one obtains

$ {\mathit{\boldsymbol{\beta }}_q}\left( {{u_0}} \right) = \frac{1}{{n\lambda }}\left[ {\sum\limits_{i = 1}^n {{\alpha _{i, {y_i}}}} {W_{{y_i}, q}}{X_i} - \sum\limits_{i = 1}^n {\sum\limits_{j \ne {y_i}} {{\alpha _{i, j}}} } {W_{j, q}}{X_i}} \right] $ (12)

and then obtains

$ X_{i}^{\mathrm{T}}=-\frac{\lambda \boldsymbol{\beta}_{q}^{\mathrm{T}}}{\sum_{i=1}^{n}\left(\sum_{j \neq y_{i}} \alpha_{i, j} W_{j, q}-\alpha_{i, y_{i}} W_{y_{i}, q}\right)}. $

We get

$ \begin{array}{*{20}{c}} {{L_1} + {L_2} = \sum\limits_{q = 1}^{k - 1} {\sum\limits_{i = 1}^n {\left( {\mathit{\boldsymbol{X}}_i^{\rm{T}}{\mathit{\boldsymbol{\beta }}_q}\left( {{u_0}} \right)} \right)} } \left[ {\sum\limits_{j \ne {y_i}} {{\alpha _{i, j}}} {W_{j, q}} - } \right.}\\ {{\alpha _{i, {y_i}}}{W_{{y_{i, q}}}}] = - \lambda \sum\limits_{q = 1}^{k - 1} {\mathit{\boldsymbol{\beta }}_q^{\rm{T}}} \left( {{u_0}} \right){\mathit{\boldsymbol{\beta }}_q}\left( {{u_0}} \right).} \end{array} $ (13)

Pluging (10), (11), and (13) in (9), we simplify Lagrangian function as

$ L=-\frac{n \lambda}{2} \sum\limits_{q=1}^{k-1} \boldsymbol{\beta}_{q}^{\mathrm{T}}\left(u_{0}\right) \boldsymbol{\beta}_{q}\left(u_{0}\right)+\sum\limits_{i=1}^{n} \sum\limits_{j \neq y_{i}} \alpha_{i, j}. $ (14)

Pluging (12) in (14) and maximizing Lagrangian function, we get a new form of that,

$ \begin{array}{*{20}{c}} {\mathop {\min }\limits_{{\alpha _{i, j}}} \frac{1}{{2n\lambda }}\sum\limits_{q = 1}^{k - 1} {{{\left[ {\sum\limits_{i = 1}^n {{\alpha _{i, {y_i}}}} {W_{{y_i}, q}}{X_i} - \sum\limits_{i = 1}^n {\sum\limits_{j \ne {y_i}} {{\alpha _{i, j}}} } {W_{i, q}}{X_i}} \right]}^{\rm{T}}}} }\\ {\left[ {\sum\limits_{i = 1}^n {{\alpha _{i, {y_i}}}} {W_{{y_i}, q}}{X_i} - \sum\limits_{i = 1}^n {\sum\limits_{j \ne {y_i}} {{\alpha _{i, j}}} } {W_{i, q}}{X_i}} \right] - }\\ {\sum\limits_{i = 1}^n {{\alpha _{i, {y_i}}}} (k - 1) - \sum\limits_{i = 1}^n {\sum\limits_{j \ne {y_i}} {{\alpha _{i, j}}} } }\\ {{\rm{ s}}{\rm{.t}}{\rm{.}}\;\;\;0 \le {\alpha _{i, j}} \le {A_{i, j}}.} \end{array} $ (15)
2.2 Computation using coordinate descent method

Coordinate descent (CD) method is a remarkable tool to solve large-scale data problems. It is mainly used to solve the convex optimization problems. It treats each dimension of the solution vector of the optimization problem as a coordinate, and selects one-dimensional coordinate for iterative optimization. The iterative process consists of the outer loop and the inner loop. In each outer loop, the inner loop optimization can be considered as a single variable sub-optimization problem, and only one scalar solution needs to be optimized at a time.

Assume that $ \boldsymbol{w} \in \mathbb{R}^{n}$ is the solution vector of an optimization problem. We set a total of k iterations. Beginning with initialized w0, we will get w1, w2, …, wk in turn. Regarding each iteration as an outer loop, each of that will iterate on each dimension of w=[w1, w2, …, wn]. So here each outer loop contains n inner loops. From ith outer loop and jth inner loop, wi, j can be generated,

$ \boldsymbol{w}^{i, j}=\left[w_{1}^{i+1}, \cdots, w_{j-1}^{i+1}, w_{j}^{i}, \cdots, w_{n}^{i}\right]^{\mathrm{T}}. $

Notice that wi, 1=wi and wi, n+1=wi+1. Update wi, j into wi, j+1 through solving single variable subproblem

$ \begin{array}{*{20}{c}} {\mathop {\min }\limits_z f\left( {w_1^{j + 1}, \cdots , w_{j - 1}^{i + 1}, w_j^i + z, w_{j + 1}^i, \cdots , w_n^i} \right) \equiv }\\ {\mathop {\min }\limits_z f\left( {{w^{i, j}} + z{\mathit{\boldsymbol{e}}_j}} \right), } \end{array} $

where ej=[0, …, 0, 1, 0, …, 0]T. At inner loop, many tools can be used, such as Newton method and CMLS.

Now we present the computational procedure of TLAMSVM using coordinate descent method.

Step 1. Initialize the solution vector $ \hat{\boldsymbol{B}}_{j}=\boldsymbol{\mathit{0}}$.

Step 2. Assume the estimated value in mth iteration is $ \left(\hat{\boldsymbol{B}}_{1}^{(m)}, \cdots, \hat{\boldsymbol{B}}_{k-1}^{(m)}\right)$, and each element of that is

$ \begin{aligned} \hat{\boldsymbol{B}}_{j}^{(m)}=&\left(\hat{\alpha}_{j}^{(m)}\left(u_{0}\right), \hat{\boldsymbol{\beta}}_{j}^{(m), (1)}\left(u_{0}\right)\right., \\ & \hat{\boldsymbol{\beta}}_{j}^{(m), (2)}\left(u_{0}\right), \cdots, \hat{\boldsymbol{\beta}}_{j}^{(m), (p)}\left(u_{0}\right) )^{\mathrm{T}}.\end{aligned} $

Fix $\left(\hat{\boldsymbol{B}}_{1}^{(m)}, \cdots, \hat{\boldsymbol{B}}_{j-1}^{(m)}, \hat{\boldsymbol{B}}_{j}^{(m-1)}, \hat{\boldsymbol{B}}_{j+1}^{(m-1)}, \cdots, \hat{\boldsymbol{B}}_{k-1}^{(m-1)}\right) $. After updating we get new $ \hat{\boldsymbol{B}}_{j}^{(m)}$.

Therefore, in the m+1th step, =1, …, n, j=1, …, k-1, l=1, …, p, we set

$ {G_{i, j}}(z) = \sum\limits_{q < l} {{X_{i, q}}} \hat B_j^{(m + 1), (q)} + z{X_{i, l}} + \sum\limits_{q > l} {{X_{i, q}}} \hat B_j^{(m), (q)}, $

where $\hat B_j^{(m + 1), (q)} $ is the qth element of $\hat{\boldsymbol{B}}_{j}^{(m+1)} $, q=1, …, p+1, Gi, j(z) is a function of $z \in \mathbb{R} $. Hence we have

$ \begin{array}{c}{F_{i, j, -l}(z)=\left(\boldsymbol{X}_{i}^{\mathrm{T}} \hat{\boldsymbol{B}}_{1}^{(m+1)}, \cdots, \boldsymbol{X}_{i}^{\mathrm{T}} \hat{\boldsymbol{B}}_{j-1}^{(m+1)}\right.}, \\ {G_{i, j}(z), \boldsymbol{X}_{i}^{\mathrm{T}} \hat{\boldsymbol{B}}_{j+1}^{(m)}, \cdots, \boldsymbol{X}_{i}^{\mathrm{T}} \hat{\boldsymbol{B}}_{k-1}^{(m)} )^{\mathrm{T}}}.\end{array} $

Make the form of $ \hat B_j^{(m + 1), (l)}$ as (16) (see below). It is a single variant suboptimal problem, and we have many typical methods to solve it,

$ \hat{B}_{j}^{(m+1), (l)}=\operatorname{argmin}_{z}\left[\lambda z^{2}+\frac{1}{n} \Sigma_{i=1}^{n} l\left\{\left\langle F_{i, j, -l}(z), W_{y_{i}}\right\rangle\right\}\right]. $ (16)

Step 3. Repeat above until all estimated values are convergent.

After having $ \hat{\boldsymbol{\beta}}_{q}(u)$, we can use $ \hat{\boldsymbol{\beta}}_{q}(u)=\left(\hat{\alpha}_{q}(u), \hat{\beta}_{q}^{(1)}(u), \cdots, \hat{\beta}_{q}^{(p)}(u)\right)^{\mathrm{T}}$ to get each component of classification function vector by calculating

$ \begin{aligned} f_{q}(u, \boldsymbol{X})=& \boldsymbol{X}^{\mathrm{T}} \hat{\boldsymbol{\beta}}_{q}(u)=\left(1, X^{(1)}, X^{(2)}, \cdots, X^{(p)}\right)\cdot \\ &\left(\hat{\alpha}_{q}(u), \hat{\boldsymbol{\beta}}_{q}^{(1)}(u), \cdots, \hat{\boldsymbol{\beta}}_{q}^{(p)}(u)\right)^{\mathrm{T}}. \end{aligned} $

Classification function vector is

$ f(u, \boldsymbol{X})=\left(f_{1}(u, \boldsymbol{X}), f_{2}(u, \boldsymbol{X}), \cdots, f_{k-1}(u, \boldsymbol{X})\right). $

Therefore, any new sample (u′, X′) will be mapped into f(u′, X′) and form k angles with vertex vectors ∠(f(u′, X′), Wj). Then we assign its label as

$ \hat y = {{\mathop{\rm argmin}\nolimits} _{j \in \{ 1, \cdots , k\} }}\angle \left( {f\left( {{u^\prime }, {X^\prime }} \right), {W_j}} \right). $
3 Numerical study

In this section, we use numerical experiments to verify the effect of TLAMSVM model and choose f1 score to measure its classification result. f1 score is a common criterion in binary classification problems. In this paper, we generalize f1 score to measure multi-category cases. In detail, we compute corresponding f1 score for each class by viewing current class as positive one and the rest as negative one, the average of all f1 score is the final result for classifier.

It is worth noting that when dealing with non-linear problems, TLAMSVM model needs both Mercer kernel function K(·) and local smoothing kernel function khn(·). The difference is obvious, the former is used to construct non-linear decision boundaries, while the latter is used to aggregate information near u0. In numerical experiments, we use linear Mercer kernel, gaussian Mercer kernel and polynomial Mercer kernel as Mercer kernel function. And uniform kernel, gaussian kernel and Epanechikov(short as ep) kernel function as local smoothing kernel function.

3.1 Simulation results

We generate two simulated data sets: S1 and S2. Data in S1 has 1 000 samples with 3-class label and 3 features. Data in S2 has 1 500 samples with 5-class label and 7 features. We carry out 10 simulation runs for each setting and the final results are averaged values. We generate targeted variable Ui from U(0, 1) for both S1 and S2. And the label of S1 is defined as (17). As for S2, we use the same method to generate label, due to limitation of space, the formula wouldn't be listed in this article.

$ Y_{i}=\left\{\begin{array}{ll}{1} & {\text { if } U_{i}^{2}+U_{i}-1+\epsilon_{i} \leqslant U_{\frac{1}{3}}} \\ {2} & {\text { if } U_{\frac{1}{3}} \leqslant U_{i}^{2}+U_{i}-1+\epsilon_{i} \leqslant U_{\frac{2}{3}}} \\ {3} & {\text { if } U_{i}^{2}+U_{i}-1+\epsilon_{i}>U_{\frac{2}{3}}}\end{array}\right. $ (17)

here $U \frac{1}{3} $ and $U \frac{2}{3} $ are tertiles of U, ϵi~N(0, 0.1). When given Yi and Ui, we generate features of S1, Xi=(Xi1, Xi2)T from

$ {\mathit{\boldsymbol{X}}_i}|{Y_i}, {U_i}\sim MVN\left\{ {\beta \left( {{U_i}} \right){Y_i}, {\sigma ^2}I} \right\} $

where β(u)=(sin(4πu), 2e-20(u-0.5)2)T

As for S2, we generate features Xi=(Xi1, Xi2, Xi3, Xi4, Xi5, Xi6)T from and σ=2

$ \begin{aligned}& {\mathit{\boldsymbol{X}}_i}\sim MVN\left( {{\mu ^{(k)}}\eta \left( {{U_i}} \right){I_{|{X_i} \in {\rm{ classk }}|}}, {\sigma ^2}I} \right) \\ \boldsymbol{\eta}\left(U_{i}\right)=&\left(2 U_{i}, U_{i}^{2}+2 U_{i}-1, \cos \left(2 \pi U_{i}\right)\right., \\ & 4 \mathrm{e}^{\left(U_{i}-0.3\right)^{2}}, \ln \left(\frac{1}{U_{i}^{2}}\right), \sin \left(3 \pi U_{i}\right) )^{\mathrm{T}}. \end{aligned} $ (18)
$ \begin{array}{c}{\mu^{(1)}=(5, 5, 0, 0, 0, 0), \mu^{(2)}=(0, 5, 5, 0, 0, 0)} \\ {\mu^{(3)}=(0, 0, 5, 5, 0, 0), \mu^{(4)}=(0, 0, 0, 5, 5, 0)} \\ {\mu^{(5)}=(0, 0, 0, 0, 5, 5)}\end{array} $

We mainly compare effects of three models: KMSVM, RAMSVM and TLAMSVM. KMSVM is a multi-category generalization form of KSVM, it combines the OvO multi-classification framework with KSVM method. While RAMSVM doesn't use varying-coefficient model and regards targeted variable as a common feature. Both TLAMSVM and RAMSVM use angel-based multi-category framework. In each model, we utilize 5-fold cross validation to determine parameters λ and hn.

Table 1 and Table 2 record the effects of model RAMSVM and TLAMSVM on simulated data S1 and S2.

Table 1 Comparison of f1 score between RAMSVM and TLAMSVM on data set S1

Table 2 Comparison of f1 score between RAMSVM and TLAMSVM on data set S2

Since RAMSVM model does not need to use local smoothing kernel function to build many MSVMs according to targeted variable, it has advantage of computation time. But as for predictive accuracy, TLAMSVM model still has a better performance no matter we use which Mercer kernel or local smoothing kernel function.

Table 3 and Table 4 record f1 scores and training time of KMSVM and TLAMSVM model on simulated data S1 and S2.

Table 3 Comparison of f1 score between KMSVM and TLAMSVM on data sets S1 and S2

Table 4 Comparison of f1 score between KMSVM and TLAMSVM on data set S2

TLAMSVM model provides a better prediction than KMSVM model with same Mercer or local smoothing kernel function and similar training time. The values in Table 5 are the average results of prediction of model RAMSVM and TLAMSVM on S1.

Table 5 Comparison between RAMSVM and TLAMSVM on data set S1

Here we compare three cases, which have different feature space respectively, Xi1 only, Xi2 only and both of them. We find that the f1 score of TLAMSVM model is much larger than that of RAMSVM model when using single feature variable. So TLAMSVM model does not rely on the number of feature variables in model so much as RAMSVM model does. Likewise, TLAMSVM model has better results with any kernel function in three cases.

3.2 Real data analysis

In this subsection, we demonstrate the classification effects of TLAMSVM model via using two real data sets, Cervical Cancer data and Protein Localization Sites data, named as D1 and D2. Both of them are from University of California Irvine machine learning repository website. The Cervical Cancer data is derived from Universitario de Caracas Hospital. It is collected from 858 patients and it includes demographic information, patients' habits and their historic medical records. The Protein Localization Sites data is derived from a research about an expert system, which uses the information of the amino acid sequence and source origin to predict localization sites of protein. It has 1 484 samples and 10 categories.

3.2.1 Application to Cervical Cancer data set

Cervical cancer remains one of the leading causes of death among patients with malignant gynecologic disease. It could be caused by a number of factors, for instance, viral infection, the number of sexual behavior, the number of pregnancies, smoking habits etc. From the viewpoint of Computed Aided Diagnosis System, there are several diagnosis methods in the detection of pre-cancerous cervical lesions. Here are the four most frequently-used methods in examination, colposcopy using acetic acid-Hinselmann, colposcopy using Lugol iodine-Schiller, cytology and biopsy[14]. The data set records the results of these four tests(Hinselmann, Schiller, cytology and biopsy). From these four tests results, we generate artificially a response variable y∈{1, 2, 3} that reflects the confidence of the diagnosis. When all of the four test results are normal, y=1, when the results of any one or two test are normal, y=2, and y=3 for rest situations. Then we get a 3-category data set.

After some common data preprocessing procedures, we get a data set without missing data nor outliers. And we reduce the number of variables in model from 31 to 12 by calculating the conditional information entropy and information gain of each variable. And we regard feature "age" as targeted variable in this data set. Due to the special property of data, we consider use a combination of undersampling and oversampling techniques (ENN+SMOTE) as balancing tool to process training data. The principle of ENN is to remove the samples whose class is different from class of the majority of its k-nearest neighbor. SMOTE[15] is a relatively basic oversampling method that reduces the differences of number between different categories. For each sample of the minority class, connect this sample between the nearest k samples of it with same category, then there will be k lines, generate one sample randomly from each line, then get k new samples of the minority class.

Table 6 is the numerical results of model TLAMSVM and RAMSVM with different Mercer kernel functions and local smoothing kernel functions on cervical cancer data set. It can be seen that, when using the same local smoothing kernel function, the polynomial Mercer kernel has the best classification accuracy. With the same Mercer kernel function, the Epanechikov and gaussian local kernel function has the relatively better prediction performance. Comparing the average accuracy of TLAMSVM model, the average prediction effect of TLAMSVM model is slightly better than that of RAMSVM model.

Table 6 Summary of accuracy of TLAMSVM and RAMSVM on Cervical Cancer data set

We can get the corresponding estimated coefficient β(u) for any given value of targeted variable u. Figure 1 is the line chart of the coefficient in the first dimension, it shows the variation trend of each coefficient corresponding to 5 feature variables as "age" changes.

Download:


Fig. 1 β1(u) vs. the targeted variable (age) on D1

From Fig. 1 we can see, the influence of feature "Smokes" is small when age is between 20 and 40, but it increases a lot when age becomes larger. It means when facing the young patients, the information about his or her smoking habit is not as important as other variables. However, when judging the middle-aged patients, their smoke habit could offer more information for classification.

Figure 2 describe the inner product of classification function vector of sample and each simplex (〈f(x), Wj〉, (j=1, 2, 3)) when targeted variable U takes different values.

Download:

solid 〈f(x), W1〉, dashed: 〈f(x), W2〉, dotted:〈f(x), W3〉.
Fig. 2 Predictions for the three given samples on D1

From Section 3 we know that the sample will be classified to the category which has the maximal inner product value. So at any fixed age, the class which the top line corresponds to is the prediction result. Fig. 2 reflects that different value for the targeted variable will affect final classification result of 3 given samples. These three subgraphs are the prediction details of three samples, which belong to the class-1, class-2 and class-3 respectively. The age of these these samples are 29, 52, 51 in sequence.

Firstly, we can find that TLAMSVM model gives the correct prediction according to the prediction rules. Secondly, for identical feature variables, different values of the targeted variable will affect the result of determination for that sample. That is to say, with fixed other physiological characteristics, if we assume that the patient is at different ages, it would come out different diagnosis results. Take the third sample as an example, the patient is 51 years old and the true label is 3, which means the patient has high-risk of suffering from cervical cancer. However, we would give an entirely different judgement if we regard the same physiological characteristics values as information from a patient with other age. Patient whose age is from 20 to 30, would be diagnosed as the class-1(no cervical cancer), and patient whose age is over 75, would be diagnosed as class-2(no confirmed). Thus it can be observed that the TLAMSVM model can accurately grasp the essential information of the sample by studying the relationship between the targeted variable and other feature variables to provide better prediction.

3.2.2 Application to Protein Localization Sites data set

Protein Localization Sites data set has 8 features, which are scores of some methods for measuring information about protein or amino acid. It has 10 classes to represent the different localization sites of protein, such as CYT(cytosolic or cytoskeletal), NUC(nuclear), MIT(mitochondrial) etc. We use model RAMSVM and TLAMSVM to train data and get classifiers.

Figure 3 is based on D2 (Protein Localization Sites data).

Download:

solid: feature 'gvh', dashed: feature 'alm', dotted: feature 'mit', dashed-dotted: feature 'nuc'.
Fig. 3 Estimated values for the coefficients of the first three dimensions on D2

And we set "mcg"(McGeoch's method for signal sequence recognition) as targeted variable. Because of limitations of space, Fig. 3 only demonstrates the coefficient β(u) in the first, second and third dimension. And these subgraphs are enough to show the variability of coefficients in model TLAMSVM.

Figure 4 is the inner product graph for a certain sample, whose feature "mcg" equals to 0.45 and real label is class-1. From Fig. 4, we know that the TLAMSVM model gives a correct prediction (class-1) for this sample, and the classification results would be different if the "mcg" value of this sample was different.

Download:


Fig. 4 Predictions for a given sample under different targeted variables (mcg) on D2

Table 7 records the effect of TLAMSVM and RAMSVM model on Protein Localization Sites data set. From Table 7 we can get the same conclusion as Cervical Cancer data. The classification accuracy of TLAMSVM model is better than that of RAMSVM model, consistent with the simulation results. And we can get the same result of classification prediction from the model and its prediction rule.

Table 7 Summary of accuracy of TLAMSVM and RAMSVM on Protein Localization Sites dataset
4 Discussion

In this study, TLAMSVM model is proposed to solve multi-category classification problems and make a better prediction for different samples. In numerical experiment section, we firstly predict the risk of disease and show its dependency on the age. Building age-specific predictive model helps to study the appropriate time of intervention and find potential features which provides individualized treatment for a certain disease. The fitting coefficient β(u) describes the age sensitivity of each feature variable to disease risk. Then we obtain good results on protein localization sites data and make a more "personalized" prediction for samples whose "mcg" values are different. Finally, the simulation results and real data analysis show that the different selections for Mercer kernel and smoothing kernel functions may lead to minor differences in prediction accuracy.

We consider the feature variables with the targeted variable dependency, but it is easy to ignore the features which have stable effect. Thus, we will add some features with invariant effects to the initial classification function in future work, like α(U)+XTβ(U)+ZTγ.

References
[1]
Cortes C, Vapnik V. Support vector networks[J]. Machine Learning, 1995, 20(3): 273-297.
[2]
Yau G X, Zhang C. Multi-category angle-based classifier refit[EB/OL]. (2016-07-19)[2017-08-12]. https://arxiv.org/abs/1607.05709.
[3]
Moguerza J M, Muñoz A. Support vector machines with applications[J]. Statistical Science, 2006, 21(3): 322-336. DOI:10.1214/088342306000000493
[4]
Allwein E L, Schapire R E, Singer Y. Reducing multiclass to binary:a unifying approach for margin classifiers[J]. Proc International Conference on Machine Learning, San Francisco Ca:Morgan Kaufmann, 2000, 1(2): 9-16.
[5]
Hastie T, Tibshirani R. Classification by pairwise coupling[C]//Conference on Advances in Neural Information Processing Systems. MIT Press, 1998: 507-513.
[6]
Crammer K, Singer Y. On the algorithmic implementation of multiclass kernel-based vector machines[J]. J Machine Learning Res, 2001, 2(2): 265-292.
[7]
Lee Y, Lin Y, Wahba G. Multicategory support vector machines:theory and application to the classification of microarray data and satellite radiance data[J]. Journal of the American Statistical Association, 2004, 99(465): 67-81. DOI:10.1198/016214504000000098
[8]
Zhang C, Liu Y, Wu Z. On the effect and remedies of shrinkage on classification probability estimation[J]. American Statistician, 2013, 67(3): 134-142. DOI:10.1080/00031305.2013.817356
[9]
Zhang C, Liu Y. Multicategory angle-based large-margin classification[J]. Biometrika, 2014, 3(3): 625-640.
[10]
Zhang C, Liu Y, Wang J, et al. Reinforced angle-based multicategory support vector machines[J]. Journal of Computational and Graphical Statistics, 2016, 25(3): 806-825. DOI:10.1080/10618600.2015.1043010
[11]
Wei Z, Wang K, Qu H Q, et al. From disease association to risk assessment:an optimistic view from genome-wide association studies on type 1 diabetes[J]. Plos Genetics, 2009, 5(10): e1000678. DOI:10.1371/journal.pgen.1000678
[12]
Torr P H S. Locally linear support vector machines[C]//International Conference on International Conference on Machine Learning. Omnipress, 2011: 985-992.
[13]
Chen T, Wang Y, Chen H, et al. Targeted local support vector machine for age-dependent classification[J]. Journal of the American Statistical Association, 2014, 109(507): 1174-1187. DOI:10.1080/01621459.2014.881743
[14]
Fernandes K, Cardoso J S, Fernandes J. Transfer Learning with Partial Observability Applied to Cervical Cancer Screening[M] Pattern Recognition and Image Analysis. 2017: 243-250.
[15]
Chawla N V, Bowyer K W, Hall L O, et al. SMOTE:synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16(1): 321-357.