2. Technology Research Institute, Aisino Corporation, Beijing 100195, China
2. 航天信息股份有限公司技术研究院, 北京 100195
Machine learning, as a new research orientation, plays a significant role in the age of big data. A widely used classification method of machine learning is support vector machine(SVM)[1]. It is a kind of classic margin-based classifier, which seeks for the most appropriate hyperplane in the feature space and solves the optimization problem in the form of (loss+regularization)[2]. In the last decades, SVM and its derived algorithms have held an important place in many fields[3] on account of its perfect prediction accuracy and computational efficiency.
Typical SVM is good at handling binary problems with one discriminant function. As for its theoretical property, it has been proved to have good asymptotic convergence rates and Fisher consistency. Nevertheless, in realistic applications, multi-category problems are more common than binary ones. As an illustration, diagnosis for cancer patients would be divided into several categories by measuring grade malignancy of cancer cells or tissues. So, multi-category support vector machine(MSVM) is proposed. Apart from conventional methods such as one-versus-one(OvO) and one-versus-rest(OvR)[4-5], which train a sequence of binary SVMs, there are still many simultaneous multi-classification methods, which consider k classes in a single optimization simultaneously, learn k fitting functions, and make classification decision based on specific prediction rule, such as Crammer and Singer[6-7] etc. However, using only (k-1) functions should be adequate for simultaneous multi-classification methods since binary classifiers only need one function, and thus the classification function space is reduced to (k-1)-dimension by adding a sum-to-zero constraint,
Apart from expanding the numbers of category for SVM, researching the interrelation of covariant variables is a worthy direction to study as well. For example, SVM is a popular algorithm in disease research to predict disease risk status of patients. However Wei et al.[11] mentioned that some variables themselves might make little contribution to prediction but play significant roles when they are coordinated with other correlated variables. So the method that uses local linear smoothing SVM on the entire feature variable space is proposed by Ladicky and Torr[12]. Since it does not accomplish smoothing technique successfully in high-dimensional feature space, Chen[13] developed a targeted local smoothing kernel weighted support vector machine (KSVM), which only needs the data in targeted dimension to be plentiful. It achieves an age-dependent prediction rule to discriminate disease risk via optimizing a kernel weighted problem. It gets quite good results on binary classification problems. It is worth applying such a local weighted tool to MSVM.
In this study, we mainly make two contributions. Firstly, we combine angle-based multi-category classification framework of RAMSVM model and targeted local smoothing method of KSVM model, and propose a new MSVM model, targeted local angle-based multi-category support vector machine(TLAMSVM), which offers a better classification prediction for certain data sets. Secondly, TLAMSVM model provides more detailed analysis for a sample according to its value of targeted variable. Before introducing TLAMSVM model, some fundamental knowledge of RAMSVM model and KSVM model will be described in section 2, which are the basis of TLAMSVM model. In section 3, the detailed presentation of TLAMSVM model will be shown, followed by its computational process of coordinate descent method. In section 4, we will show some experimental results in four simulated and real data sets and give some illustrations to demonstrate the strengths of TLAMSVM model. At the end, we will conclude the article with discussions in section 5.
1 Background and related work 1.1 Reinforced angle-based multi-category support vector machine(RAMSVM)As for k-class data sets, the angle-based large-margin classification framework defines a k-vertex simplex in (k-1) dimensional space, W1, W2, …, Wk, which correspond to each class,
$ W_{j}=\left\{\begin{array}{ll}{(k-1)^{-\frac{1}{2}} \boldsymbol{\mathit{1}}, } & {j=1} \\ {-\frac{1+\sqrt{k}}{(k-1)^{3 / 2}} \boldsymbol{\mathit{1}}+\sqrt{\frac{k}{k-1}} e_{j-1}, } & {2 \leqslant j \leqslant k}\end{array}\right. $ |
where Wj represents the jth vertex, j=1, 2, …, k, 1 represents the vector where each element is 1, ej represents unit vector where every element is 0 except for the jth one. Wj, 1,
According to the regular form of a large-margin classification algorithm, we add a penalty term J(·) to the loss function. So the optimization problem becomes
$ {\min\limits_{f \in F}}\frac{1}{n}\sum\limits_{i = 1}^n l \left( {\left\langle {f\left( {{x_i}} \right), {W_{{y_i}}}} \right\rangle } \right) + \lambda J(f), $ | (1) |
here l(·) is a loss function, which measures the cost of wrong predictions. Nevertheless, typical loss function such as hinge loss l(u)=[1-u]+ is not Fisher consistent. Inspired by Lee[7], RAMSVM uses a convex combination of loss function to overcome this drawback. Therefore (1) is turned into
$ \begin{array}{*{20}{l}} {{\min \limits_{f \in F}}\frac{1}{n}\sum\limits_{i = 1}^n {\{ (1 - \gamma )\sum\limits_{j \ne {y_i}} {{{\left[ {1 + < f\left( {{x_i}} \right), {W_j} > } \right]}_ + }} + } }\\ {\quad \gamma {{\left[ {(k - 1) - \left\langle {f\left( {{x_i}} \right), {W_{{y_i}}}} \right\rangle } \right]}_ + }\} + \lambda J(f), } \end{array} $ | (2) |
here γ∈[0, 1] is the convex combination parameter. It has been proved that, when γ∈[0, 0.5], the loss function of RAMSVM model is Fisher consistent[10]. In addition, when γ=0.5, RAMSVM is stable and almost reaches the optimal classification accuracy. Therefore we default γ=0.5 directly in the latter work.
1.2 Kernel weighted support vector machine(KSVM)Take binary medical data as an example, let Y denote the set of labels {1, -1}, which represents whether a subject is at the disease risk or not. Let "age" be the targeted variable U, and variable X contains other features. Now the purpose is to train a classifier which depends on age information. Firstly, we consider making some changes on typical discriminant score as follows:
$ \alpha(U)+\boldsymbol{X}^{\mathrm{T}} \boldsymbol{\beta}(U), $ | (3) |
where α(U) is a baseline without specific functional form, β(U) is a vector whose elements represent the coefficients of X that varies with age. It is common sense that the subjects with similar ages and other features would also have similar disease risk. So when changing age U smoothly, β(U) varies smoothly too. Some features might provide more information for classification when age is larger while others do reversely. From this perspective, people would have better understanding of each physical sign. Then, after calculating the value in (3), the one with a negative score will be classified as risk-free and the one with a positive score will be classified to risk group.
After choosing an appropriate loss function, the targeted local smoothing kernel weighted support vector machine (KSVM) model trains a classifier via solving
$ \begin{array}{*{20}{c}} {\mathop {\min }\limits_{\alpha \left( {{u_0}} \right), \beta \left( {{u_0}} \right)} \sum\limits_i {{K_{{h_n}}}} \left( {{U_i} - {u_0}} \right) \cdot }\\ {l\left\{ {{Y_i}, {X_i};\alpha \left( {{u_0}} \right), \beta \left( {{u_0}} \right)} \right\} + \lambda \left\| {\beta \left( {{u_0}} \right)} \right\|, } \end{array} $ |
where Khn(·) is kernel density function, which could pool information of subjects whose ages are around u0 within certain bandwidth hn. In another word, KSVM model trains different SVMs at every u0, respectively, and the weight of the ith sample is Khn(Ui-u0).
After getting the estimated values of coefficients
In this section we firstly illuminate how to design multi-category local smoothing kernel weighted optimization problem in TLAMSVM model. Then we list out some main computational steps of TLAMSVM model using coordinate descent method. We take linearly separable case as example, since to non-separable case, we can use kernel trick.
2.1 Targeted local angle-based multi-category support vector machine(TLAMSVM)Assume that there are n observations, each of which has p+1 features, one of which is the targeted variable U. For simplicity, the intercepts are included in X. Denote the value of each feature of ith observation as Xi=(1, Xi(1), Xi(2), …, Xi(p))T. X(j)=(X1(j), X2(j), …, Xn(j)) representing the value of jth feature for all observations. Therefore the sample matrix Xp*n could be written as
$ \mathit{\boldsymbol{X}} = \left( {{X_1}, {X_2}, \cdots , {X_n}} \right) = {\left( {1, {X^{(1)}}, {X^{(2)}}, \cdots , {X^{(p)}}} \right)^{\rm{T}}}. $ | (4) |
Assume the classification function vector at U=u0 is
$ {f_q}\left( {{u_0}, \mathit{\boldsymbol{X}}} \right) = {\mathit{\boldsymbol{X}}^{\rm{T}}}{\mathit{\boldsymbol{\beta }}_q}\left( {{u_0}} \right)\;\;\;(q = 1, \cdots , k - 1), $ | (5) |
where βq(u0)=(αq(u0), βq(1)(u0), βq(2)(u0), …, βq(p)(u0))T. We use convex combination of loss function in the model. Hence, the original optimization problem of TLAMSVM model with targeted variable U is
$ \begin{array}{*{20}{l}} {\mathop {\min }\limits_{f \in F} \sum\limits_i {{K_{{h_n}}}} \left( {{U_i} - {u_0}} \right)\{ (1 - \gamma )}\\ {\sum\limits_{j \ne {y_i}} {{{\left[ {1 + \left\langle {f\left( {{u_0}, {X_i}} \right), {W_j}} \right\rangle } \right]}_ + }} + }\\ \gamma\left[(k-1)-\left\langle f\left(u_{0}, X_{i}\right), W_{y_{i}}\right\rangle\right]_{+} \}+\frac{\lambda}{2} J(f), \end{array} $ | (6) |
where the {(1-γ)∑j≠yi[1+〈f(u0, Xi), Wj〉]++γ[(k-1)-〈f(u0, Xi), Wyi〉]+} is the loss function of our model, and the weight before loss of each observation, Khn(Ui-u0), is smoothing kernel function with bandwidth hn, like that of KSVM model. Considering the more general situation, we bring in the concept of "soft margin" to allow part of samples dissatisfying constrain
$ \begin{array}{c}{-\left\langle f\left(u_{0}, X_{i}\right), W_{j}\right\rangle> 1 \text { and } 2-k+} \\ {\left\langle f\left(u_{0}, X_{i}\right), W_{y_{i}}\right\rangle> 1}, \end{array} $ |
that is to say, put slack variables ξi, j (ξi, j≥0) into constrains, and we get
$ \begin{array}{c}{-\left\langle f\left(u_{0}, X_{i}\right), W_{j}\right\rangle> 1-\xi_{i, j}, (i=1, \cdots, n)}, \\ {2-k+\left\langle f\left(u_{0}, X_{i}\right), W_{y_{i}}\right\rangle> 1-\xi_{i, j}}, \\ {\left(i=1, \cdots, n, j \neq y_{i}\right).}\end{array} $ | (7) |
We set the form of penalty in (2) as
$ \begin{array}{*{20}{c}} {\mathop {\min }\limits_{{\mathit{\boldsymbol{\beta }}_q}\left( {{u_0}} \right), {\xi _{i, j}}} \frac{{n\lambda }}{2}\sum\limits_{q = 1}^{k - 1} {\mathit{\boldsymbol{\beta }}_q^{\rm{T}}} \left( {{u_0}} \right)\mathit{\boldsymbol{\beta }}_q^{\rm{T}}\left( {{u_0}} \right) + }\\ {\sum\limits_i {{K_{{h_n}}}} \left( {{U_i} - {u_0}} \right)\left[ {\gamma {\xi _{i, {y_i}}} + (1 - \gamma )\sum\limits_{j \ne {y_i}} {{\xi _{i, j}}} } \right]}\\ {\rm{s}}{\rm{.t}}{\rm{.}}\left\{ {\begin{array}{*{20}{l}} {{\xi _{i, j}} \ge 0(i = 1, \cdots , n, j = 1, \cdots , k)}\\ {{\xi _{i, {y_i}}} + \left\langle {f\left( {{u_0}, {X_i}} \right), {W_{{y_i}}}} \right\rangle - (k - 1) \ge 0(i = 1, \cdots , n).}\\ {{\xi _{i, j}} - \left\langle {f\left( {{u_0}, {X_i}} \right), {W_j}} \right\rangle - 1 \ge 0\left( {i = 1, \cdots , n, j \ne {y_i}} \right)} \end{array}} \right. \end{array} $ | (8) |
Obviously, (8) is a constrained optimization problem, whose Lagrangian function could be defined as
$ \begin{aligned} L=& \frac{n \lambda}{2} \sum\limits_{q=1}^{k-1} \boldsymbol{\beta}_{q}^{\mathrm{T}}\left(u_{0}\right) \boldsymbol{\beta}_{q}\left(u_{0}\right)+\\ & \sum\limits_{i=1}^{n} K_{h_{n}}\left(U_{i}-u_{0}\right)\left[\gamma \xi_{i, y_{i}}+(1-\gamma) \sum\limits_{j \neq y_{i}} \xi_{i, j}\right]-\\ & \sum\limits_{i=1}^{n} \alpha_{i, y_{i}}\left[\xi_{i, y_{i}}+\left\langle f\left(u_{0}, X_{i}\right), W_{j}\right\rangle-(k-1)\right]-\\ & \sum\limits_{i=1}^{n} \sum\limits_{j=1}^{k} \tau_{i, j} \xi_{i, j}-\sum\limits_{i=1}^{n} \sum\limits_{j \neq y_{i}} \alpha_{i, j}\\ &\left[\xi_{i, j}-\left\langle f\left(u_{0}, X_{i}\right), W_{j}\right\rangle- 1\right]. \end{aligned} $ |
Here αi, j and τi, j are Lagrangian multipliers. After simplification, we get
$ \begin{aligned} L=& \frac{n \lambda}{2} \sum\limits_{q=1}^{k-1} \boldsymbol{\beta}_{q}^{\mathrm{T}}\left(u_{0}\right) \boldsymbol{\beta}_{q}\left(u_{0}\right)+\\ & \sum\limits_{i=1}^{n} \sum\limits_{j=1}^{k}\left[A_{i, j}-\tau_{i, j}-\alpha_{i, j}\right] \xi_{i, j}+\sum\limits_{i=1}^{n} \alpha_{i, y_{i}}(k-1)+\\ &{\sum\limits_{i = 1}^n {\sum\limits_{j \ne {y_i}} {{\alpha _{i, j}}} } - \sum\limits_{i = 1}^n {{\alpha _{i, {y_i}}}} \left\langle {f\left( {{u_0}, {X_i}} \right), {W_{{y_i}}}} \right\rangle + }\\ &{\sum\limits_{i = 1}^n {\sum\limits_{j \ne {y_i}} {{\alpha _{i, j}}} } \left\langle {f\left( {{u_0}, {X_i}} \right), {W_j}} \right\rangle , } \end{aligned} $ | (9) |
where Ai, j=Khn(Ui-u0)[γI(j=yi)+(1-γ)I(j≠yi)].
Let the last two parts of (10) as L1 and L2,
$ \begin{aligned} L_{1} &=-\sum\limits_{i=1}^{n} \alpha_{i, y_{i}}\left\langle f\left(u_{0}, X_{i}\right), W_{y_{i}}\right\rangle \\ &=-\sum\limits_{i=1}^{n} \sum\limits_{q=1}^{k-1} \alpha_{i, y_{i}}\left(\boldsymbol{X}_{i}^{\mathrm{T}} \boldsymbol{\beta}_{q}\left(u_{0}\right)\right) W_{y_{i}, q} \\ L_{2} &=\sum\limits_{i=1}^{n} \sum\limits_{j \neq y_{i}} \alpha_{i, j}\left\langle f\left(u_{0}, X_{i}\right), W_{j}\right\rangle \\ &=\sum\limits_{i=1}^{n} \sum\limits_{q=1}^{k-1} \sum\limits_{j \neq y_{i}} \alpha_{i, j}\left(\boldsymbol{X}_{i}^{\mathrm{T}} \boldsymbol{\beta}_{q}\left(u_{0}\right)\right) W_{j, q} .\end{aligned} $ |
Take partial derivatives of L with respect to βq(u0) and ξi, j, then let the partial derivatives be zero,
$ \begin{array}{*{20}{l}} {\frac{{\partial L}}{{\partial {\xi _{i, j}}}} = {A_{i, j}} - {\tau _{i, j}} - {\alpha _{i, j}} = 0}\\ {(i = 1, \cdots , n;j = 1, \cdots , k), } \end{array} $ | (10) |
$ \begin{array}{*{20}{c}} {\frac{{\partial L}}{{\partial {\mathit{\boldsymbol{\beta }}_q}\left( {{u_0}} \right)}} = \lambda {\mathit{\boldsymbol{\beta }}_q}\left( {{u_0}} \right) - \sum\limits_{i = 1}^n {{\alpha _{i, {y_i}}}} {W_{{y_i}, q}}{X_i} + }\\ {\sum\limits_{i = 1}^n {\sum\limits_{j \ne {y_i}} {{\alpha _{i, j}}} } {W_{j, q}}{X_i} = 0.} \end{array} $ | (11) |
From (12) one obtains
$ {\mathit{\boldsymbol{\beta }}_q}\left( {{u_0}} \right) = \frac{1}{{n\lambda }}\left[ {\sum\limits_{i = 1}^n {{\alpha _{i, {y_i}}}} {W_{{y_i}, q}}{X_i} - \sum\limits_{i = 1}^n {\sum\limits_{j \ne {y_i}} {{\alpha _{i, j}}} } {W_{j, q}}{X_i}} \right] $ | (12) |
and then obtains
$ X_{i}^{\mathrm{T}}=-\frac{\lambda \boldsymbol{\beta}_{q}^{\mathrm{T}}}{\sum_{i=1}^{n}\left(\sum_{j \neq y_{i}} \alpha_{i, j} W_{j, q}-\alpha_{i, y_{i}} W_{y_{i}, q}\right)}. $ |
We get
$ \begin{array}{*{20}{c}} {{L_1} + {L_2} = \sum\limits_{q = 1}^{k - 1} {\sum\limits_{i = 1}^n {\left( {\mathit{\boldsymbol{X}}_i^{\rm{T}}{\mathit{\boldsymbol{\beta }}_q}\left( {{u_0}} \right)} \right)} } \left[ {\sum\limits_{j \ne {y_i}} {{\alpha _{i, j}}} {W_{j, q}} - } \right.}\\ {{\alpha _{i, {y_i}}}{W_{{y_{i, q}}}}] = - \lambda \sum\limits_{q = 1}^{k - 1} {\mathit{\boldsymbol{\beta }}_q^{\rm{T}}} \left( {{u_0}} \right){\mathit{\boldsymbol{\beta }}_q}\left( {{u_0}} \right).} \end{array} $ | (13) |
Pluging (10), (11), and (13) in (9), we simplify Lagrangian function as
$ L=-\frac{n \lambda}{2} \sum\limits_{q=1}^{k-1} \boldsymbol{\beta}_{q}^{\mathrm{T}}\left(u_{0}\right) \boldsymbol{\beta}_{q}\left(u_{0}\right)+\sum\limits_{i=1}^{n} \sum\limits_{j \neq y_{i}} \alpha_{i, j}. $ | (14) |
Pluging (12) in (14) and maximizing Lagrangian function, we get a new form of that,
$ \begin{array}{*{20}{c}} {\mathop {\min }\limits_{{\alpha _{i, j}}} \frac{1}{{2n\lambda }}\sum\limits_{q = 1}^{k - 1} {{{\left[ {\sum\limits_{i = 1}^n {{\alpha _{i, {y_i}}}} {W_{{y_i}, q}}{X_i} - \sum\limits_{i = 1}^n {\sum\limits_{j \ne {y_i}} {{\alpha _{i, j}}} } {W_{i, q}}{X_i}} \right]}^{\rm{T}}}} }\\ {\left[ {\sum\limits_{i = 1}^n {{\alpha _{i, {y_i}}}} {W_{{y_i}, q}}{X_i} - \sum\limits_{i = 1}^n {\sum\limits_{j \ne {y_i}} {{\alpha _{i, j}}} } {W_{i, q}}{X_i}} \right] - }\\ {\sum\limits_{i = 1}^n {{\alpha _{i, {y_i}}}} (k - 1) - \sum\limits_{i = 1}^n {\sum\limits_{j \ne {y_i}} {{\alpha _{i, j}}} } }\\ {{\rm{ s}}{\rm{.t}}{\rm{.}}\;\;\;0 \le {\alpha _{i, j}} \le {A_{i, j}}.} \end{array} $ | (15) |
Coordinate descent (CD) method is a remarkable tool to solve large-scale data problems. It is mainly used to solve the convex optimization problems. It treats each dimension of the solution vector of the optimization problem as a coordinate, and selects one-dimensional coordinate for iterative optimization. The iterative process consists of the outer loop and the inner loop. In each outer loop, the inner loop optimization can be considered as a single variable sub-optimization problem, and only one scalar solution needs to be optimized at a time.
Assume that
$ \boldsymbol{w}^{i, j}=\left[w_{1}^{i+1}, \cdots, w_{j-1}^{i+1}, w_{j}^{i}, \cdots, w_{n}^{i}\right]^{\mathrm{T}}. $ |
Notice that wi, 1=wi and wi, n+1=wi+1. Update wi, j into wi, j+1 through solving single variable subproblem
$ \begin{array}{*{20}{c}} {\mathop {\min }\limits_z f\left( {w_1^{j + 1}, \cdots , w_{j - 1}^{i + 1}, w_j^i + z, w_{j + 1}^i, \cdots , w_n^i} \right) \equiv }\\ {\mathop {\min }\limits_z f\left( {{w^{i, j}} + z{\mathit{\boldsymbol{e}}_j}} \right), } \end{array} $ |
where ej=[0, …, 0, 1, 0, …, 0]T. At inner loop, many tools can be used, such as Newton method and CMLS.
Now we present the computational procedure of TLAMSVM using coordinate descent method.
Step 1. Initialize the solution vector
Step 2. Assume the estimated value in mth iteration is
$ \begin{aligned} \hat{\boldsymbol{B}}_{j}^{(m)}=&\left(\hat{\alpha}_{j}^{(m)}\left(u_{0}\right), \hat{\boldsymbol{\beta}}_{j}^{(m), (1)}\left(u_{0}\right)\right., \\ & \hat{\boldsymbol{\beta}}_{j}^{(m), (2)}\left(u_{0}\right), \cdots, \hat{\boldsymbol{\beta}}_{j}^{(m), (p)}\left(u_{0}\right) )^{\mathrm{T}}.\end{aligned} $ |
Fix
Therefore, in the m+1th step, =1, …, n, j=1, …, k-1, l=1, …, p, we set
$ {G_{i, j}}(z) = \sum\limits_{q < l} {{X_{i, q}}} \hat B_j^{(m + 1), (q)} + z{X_{i, l}} + \sum\limits_{q > l} {{X_{i, q}}} \hat B_j^{(m), (q)}, $ |
where
$ \begin{array}{c}{F_{i, j, -l}(z)=\left(\boldsymbol{X}_{i}^{\mathrm{T}} \hat{\boldsymbol{B}}_{1}^{(m+1)}, \cdots, \boldsymbol{X}_{i}^{\mathrm{T}} \hat{\boldsymbol{B}}_{j-1}^{(m+1)}\right.}, \\ {G_{i, j}(z), \boldsymbol{X}_{i}^{\mathrm{T}} \hat{\boldsymbol{B}}_{j+1}^{(m)}, \cdots, \boldsymbol{X}_{i}^{\mathrm{T}} \hat{\boldsymbol{B}}_{k-1}^{(m)} )^{\mathrm{T}}}.\end{array} $ |
Make the form of
$ \hat{B}_{j}^{(m+1), (l)}=\operatorname{argmin}_{z}\left[\lambda z^{2}+\frac{1}{n} \Sigma_{i=1}^{n} l\left\{\left\langle F_{i, j, -l}(z), W_{y_{i}}\right\rangle\right\}\right]. $ | (16) |
Step 3. Repeat above until all estimated values are convergent.
After having
$ \begin{aligned} f_{q}(u, \boldsymbol{X})=& \boldsymbol{X}^{\mathrm{T}} \hat{\boldsymbol{\beta}}_{q}(u)=\left(1, X^{(1)}, X^{(2)}, \cdots, X^{(p)}\right)\cdot \\ &\left(\hat{\alpha}_{q}(u), \hat{\boldsymbol{\beta}}_{q}^{(1)}(u), \cdots, \hat{\boldsymbol{\beta}}_{q}^{(p)}(u)\right)^{\mathrm{T}}. \end{aligned} $ |
Classification function vector is
$ f(u, \boldsymbol{X})=\left(f_{1}(u, \boldsymbol{X}), f_{2}(u, \boldsymbol{X}), \cdots, f_{k-1}(u, \boldsymbol{X})\right). $ |
Therefore, any new sample (u′, X′) will be mapped into f(u′, X′) and form k angles with vertex vectors ∠(f(u′, X′), Wj). Then we assign its label as
$ \hat y = {{\mathop{\rm argmin}\nolimits} _{j \in \{ 1, \cdots , k\} }}\angle \left( {f\left( {{u^\prime }, {X^\prime }} \right), {W_j}} \right). $ |
In this section, we use numerical experiments to verify the effect of TLAMSVM model and choose f1 score to measure its classification result. f1 score is a common criterion in binary classification problems. In this paper, we generalize f1 score to measure multi-category cases. In detail, we compute corresponding f1 score for each class by viewing current class as positive one and the rest as negative one, the average of all f1 score is the final result for classifier.
It is worth noting that when dealing with non-linear problems, TLAMSVM model needs both Mercer kernel function K(·) and local smoothing kernel function khn(·). The difference is obvious, the former is used to construct non-linear decision boundaries, while the latter is used to aggregate information near u0. In numerical experiments, we use linear Mercer kernel, gaussian Mercer kernel and polynomial Mercer kernel as Mercer kernel function. And uniform kernel, gaussian kernel and Epanechikov(short as ep) kernel function as local smoothing kernel function.
3.1 Simulation resultsWe generate two simulated data sets: S1 and S2. Data in S1 has 1 000 samples with 3-class label and 3 features. Data in S2 has 1 500 samples with 5-class label and 7 features. We carry out 10 simulation runs for each setting and the final results are averaged values. We generate targeted variable Ui from U(0, 1) for both S1 and S2. And the label of S1 is defined as (17). As for S2, we use the same method to generate label, due to limitation of space, the formula wouldn't be listed in this article.
$ Y_{i}=\left\{\begin{array}{ll}{1} & {\text { if } U_{i}^{2}+U_{i}-1+\epsilon_{i} \leqslant U_{\frac{1}{3}}} \\ {2} & {\text { if } U_{\frac{1}{3}} \leqslant U_{i}^{2}+U_{i}-1+\epsilon_{i} \leqslant U_{\frac{2}{3}}} \\ {3} & {\text { if } U_{i}^{2}+U_{i}-1+\epsilon_{i}>U_{\frac{2}{3}}}\end{array}\right. $ | (17) |
here
$ {\mathit{\boldsymbol{X}}_i}|{Y_i}, {U_i}\sim MVN\left\{ {\beta \left( {{U_i}} \right){Y_i}, {\sigma ^2}I} \right\} $ |
where β(u)=(sin(4πu), 2e-20(u-0.5)2)T
As for S2, we generate features Xi=(Xi1, Xi2, Xi3, Xi4, Xi5, Xi6)T from and σ=2
$ \begin{aligned}& {\mathit{\boldsymbol{X}}_i}\sim MVN\left( {{\mu ^{(k)}}\eta \left( {{U_i}} \right){I_{|{X_i} \in {\rm{ classk }}|}}, {\sigma ^2}I} \right) \\ \boldsymbol{\eta}\left(U_{i}\right)=&\left(2 U_{i}, U_{i}^{2}+2 U_{i}-1, \cos \left(2 \pi U_{i}\right)\right., \\ & 4 \mathrm{e}^{\left(U_{i}-0.3\right)^{2}}, \ln \left(\frac{1}{U_{i}^{2}}\right), \sin \left(3 \pi U_{i}\right) )^{\mathrm{T}}. \end{aligned} $ | (18) |
$ \begin{array}{c}{\mu^{(1)}=(5, 5, 0, 0, 0, 0), \mu^{(2)}=(0, 5, 5, 0, 0, 0)} \\ {\mu^{(3)}=(0, 0, 5, 5, 0, 0), \mu^{(4)}=(0, 0, 0, 5, 5, 0)} \\ {\mu^{(5)}=(0, 0, 0, 0, 5, 5)}\end{array} $ |
We mainly compare effects of three models: KMSVM, RAMSVM and TLAMSVM. KMSVM is a multi-category generalization form of KSVM, it combines the OvO multi-classification framework with KSVM method. While RAMSVM doesn't use varying-coefficient model and regards targeted variable as a common feature. Both TLAMSVM and RAMSVM use angel-based multi-category framework. In each model, we utilize 5-fold cross validation to determine parameters λ and hn.
Table 1 and Table 2 record the effects of model RAMSVM and TLAMSVM on simulated data S1 and S2.
Since RAMSVM model does not need to use local smoothing kernel function to build many MSVMs according to targeted variable, it has advantage of computation time. But as for predictive accuracy, TLAMSVM model still has a better performance no matter we use which Mercer kernel or local smoothing kernel function.
Table 3 and Table 4 record f1 scores and training time of KMSVM and TLAMSVM model on simulated data S1 and S2.
TLAMSVM model provides a better prediction than KMSVM model with same Mercer or local smoothing kernel function and similar training time. The values in Table 5 are the average results of prediction of model RAMSVM and TLAMSVM on S1.
Here we compare three cases, which have different feature space respectively, Xi1 only, Xi2 only and both of them. We find that the f1 score of TLAMSVM model is much larger than that of RAMSVM model when using single feature variable. So TLAMSVM model does not rely on the number of feature variables in model so much as RAMSVM model does. Likewise, TLAMSVM model has better results with any kernel function in three cases.
3.2 Real data analysisIn this subsection, we demonstrate the classification effects of TLAMSVM model via using two real data sets, Cervical Cancer data and Protein Localization Sites data, named as D1 and D2. Both of them are from University of California Irvine machine learning repository website. The Cervical Cancer data is derived from Universitario de Caracas Hospital. It is collected from 858 patients and it includes demographic information, patients' habits and their historic medical records. The Protein Localization Sites data is derived from a research about an expert system, which uses the information of the amino acid sequence and source origin to predict localization sites of protein. It has 1 484 samples and 10 categories.
3.2.1 Application to Cervical Cancer data setCervical cancer remains one of the leading causes of death among patients with malignant gynecologic disease. It could be caused by a number of factors, for instance, viral infection, the number of sexual behavior, the number of pregnancies, smoking habits etc. From the viewpoint of Computed Aided Diagnosis System, there are several diagnosis methods in the detection of pre-cancerous cervical lesions. Here are the four most frequently-used methods in examination, colposcopy using acetic acid-Hinselmann, colposcopy using Lugol iodine-Schiller, cytology and biopsy[14]. The data set records the results of these four tests(Hinselmann, Schiller, cytology and biopsy). From these four tests results, we generate artificially a response variable y∈{1, 2, 3} that reflects the confidence of the diagnosis. When all of the four test results are normal, y=1, when the results of any one or two test are normal, y=2, and y=3 for rest situations. Then we get a 3-category data set.
After some common data preprocessing procedures, we get a data set without missing data nor outliers. And we reduce the number of variables in model from 31 to 12 by calculating the conditional information entropy and information gain of each variable. And we regard feature "age" as targeted variable in this data set. Due to the special property of data, we consider use a combination of undersampling and oversampling techniques (ENN+SMOTE) as balancing tool to process training data. The principle of ENN is to remove the samples whose class is different from class of the majority of its k-nearest neighbor. SMOTE[15] is a relatively basic oversampling method that reduces the differences of number between different categories. For each sample of the minority class, connect this sample between the nearest k samples of it with same category, then there will be k lines, generate one sample randomly from each line, then get k new samples of the minority class.
Table 6 is the numerical results of model TLAMSVM and RAMSVM with different Mercer kernel functions and local smoothing kernel functions on cervical cancer data set. It can be seen that, when using the same local smoothing kernel function, the polynomial Mercer kernel has the best classification accuracy. With the same Mercer kernel function, the Epanechikov and gaussian local kernel function has the relatively better prediction performance. Comparing the average accuracy of TLAMSVM model, the average prediction effect of TLAMSVM model is slightly better than that of RAMSVM model.
We can get the corresponding estimated coefficient β(u) for any given value of targeted variable u. Figure 1 is the line chart of the coefficient in the first dimension, it shows the variation trend of each coefficient corresponding to 5 feature variables as "age" changes.
Download:
|
|
Fig. 1 β1(u) vs. the targeted variable (age) on D1 |
From Fig. 1 we can see, the influence of feature "Smokes" is small when age is between 20 and 40, but it increases a lot when age becomes larger. It means when facing the young patients, the information about his or her smoking habit is not as important as other variables. However, when judging the middle-aged patients, their smoke habit could offer more information for classification.
Figure 2 describe the inner product of classification function vector of sample and each simplex (〈f(x), Wj〉, (j=1, 2, 3)) when targeted variable U takes different values.
Download:
|
|
solid 〈f(x), W1〉, dashed: 〈f(x), W2〉, dotted:〈f(x), W3〉. Fig. 2 Predictions for the three given samples on D1 |
From Section 3 we know that the sample will be classified to the category which has the maximal inner product value. So at any fixed age, the class which the top line corresponds to is the prediction result. Fig. 2 reflects that different value for the targeted variable will affect final classification result of 3 given samples. These three subgraphs are the prediction details of three samples, which belong to the class-1, class-2 and class-3 respectively. The age of these these samples are 29, 52, 51 in sequence.
Firstly, we can find that TLAMSVM model gives the correct prediction according to the prediction rules. Secondly, for identical feature variables, different values of the targeted variable will affect the result of determination for that sample. That is to say, with fixed other physiological characteristics, if we assume that the patient is at different ages, it would come out different diagnosis results. Take the third sample as an example, the patient is 51 years old and the true label is 3, which means the patient has high-risk of suffering from cervical cancer. However, we would give an entirely different judgement if we regard the same physiological characteristics values as information from a patient with other age. Patient whose age is from 20 to 30, would be diagnosed as the class-1(no cervical cancer), and patient whose age is over 75, would be diagnosed as class-2(no confirmed). Thus it can be observed that the TLAMSVM model can accurately grasp the essential information of the sample by studying the relationship between the targeted variable and other feature variables to provide better prediction.
3.2.2 Application to Protein Localization Sites data setProtein Localization Sites data set has 8 features, which are scores of some methods for measuring information about protein or amino acid. It has 10 classes to represent the different localization sites of protein, such as CYT(cytosolic or cytoskeletal), NUC(nuclear), MIT(mitochondrial) etc. We use model RAMSVM and TLAMSVM to train data and get classifiers.
Figure 3 is based on D2 (Protein Localization Sites data).
Download:
|
|
solid: feature 'gvh', dashed: feature 'alm', dotted: feature 'mit', dashed-dotted: feature 'nuc'. Fig. 3 Estimated values for the coefficients of the first three dimensions on D2 |
And we set "mcg"(McGeoch's method for signal sequence recognition) as targeted variable. Because of limitations of space, Fig. 3 only demonstrates the coefficient β(u) in the first, second and third dimension. And these subgraphs are enough to show the variability of coefficients in model TLAMSVM.
Figure 4 is the inner product graph for a certain sample, whose feature "mcg" equals to 0.45 and real label is class-1. From Fig. 4, we know that the TLAMSVM model gives a correct prediction (class-1) for this sample, and the classification results would be different if the "mcg" value of this sample was different.
Download:
|
|
Fig. 4 Predictions for a given sample under different targeted variables (mcg) on D2 |
Table 7 records the effect of TLAMSVM and RAMSVM model on Protein Localization Sites data set. From Table 7 we can get the same conclusion as Cervical Cancer data. The classification accuracy of TLAMSVM model is better than that of RAMSVM model, consistent with the simulation results. And we can get the same result of classification prediction from the model and its prediction rule.
In this study, TLAMSVM model is proposed to solve multi-category classification problems and make a better prediction for different samples. In numerical experiment section, we firstly predict the risk of disease and show its dependency on the age. Building age-specific predictive model helps to study the appropriate time of intervention and find potential features which provides individualized treatment for a certain disease. The fitting coefficient β(u) describes the age sensitivity of each feature variable to disease risk. Then we obtain good results on protein localization sites data and make a more "personalized" prediction for samples whose "mcg" values are different. Finally, the simulation results and real data analysis show that the different selections for Mercer kernel and smoothing kernel functions may lead to minor differences in prediction accuracy.
We consider the feature variables with the targeted variable dependency, but it is easy to ignore the features which have stable effect. Thus, we will add some features with invariant effects to the initial classification function in future work, like α(U)+XTβ(U)+ZTγ.
[1] |
Cortes C, Vapnik V. Support vector networks[J]. Machine Learning, 1995, 20(3): 273-297. |
[2] |
Yau G X, Zhang C. Multi-category angle-based classifier refit[EB/OL]. (2016-07-19)[2017-08-12]. https://arxiv.org/abs/1607.05709.
|
[3] |
Moguerza J M, Muñoz A. Support vector machines with applications[J]. Statistical Science, 2006, 21(3): 322-336. DOI:10.1214/088342306000000493 |
[4] |
Allwein E L, Schapire R E, Singer Y. Reducing multiclass to binary:a unifying approach for margin classifiers[J]. Proc International Conference on Machine Learning, San Francisco Ca:Morgan Kaufmann, 2000, 1(2): 9-16. |
[5] |
Hastie T, Tibshirani R. Classification by pairwise coupling[C]//Conference on Advances in Neural Information Processing Systems. MIT Press, 1998: 507-513.
|
[6] |
Crammer K, Singer Y. On the algorithmic implementation of multiclass kernel-based vector machines[J]. J Machine Learning Res, 2001, 2(2): 265-292. |
[7] |
Lee Y, Lin Y, Wahba G. Multicategory support vector machines:theory and application to the classification of microarray data and satellite radiance data[J]. Journal of the American Statistical Association, 2004, 99(465): 67-81. DOI:10.1198/016214504000000098 |
[8] |
Zhang C, Liu Y, Wu Z. On the effect and remedies of shrinkage on classification probability estimation[J]. American Statistician, 2013, 67(3): 134-142. DOI:10.1080/00031305.2013.817356 |
[9] |
Zhang C, Liu Y. Multicategory angle-based large-margin classification[J]. Biometrika, 2014, 3(3): 625-640. |
[10] |
Zhang C, Liu Y, Wang J, et al. Reinforced angle-based multicategory support vector machines[J]. Journal of Computational and Graphical Statistics, 2016, 25(3): 806-825. DOI:10.1080/10618600.2015.1043010 |
[11] |
Wei Z, Wang K, Qu H Q, et al. From disease association to risk assessment:an optimistic view from genome-wide association studies on type 1 diabetes[J]. Plos Genetics, 2009, 5(10): e1000678. DOI:10.1371/journal.pgen.1000678 |
[12] |
Torr P H S. Locally linear support vector machines[C]//International Conference on International Conference on Machine Learning. Omnipress, 2011: 985-992.
|
[13] |
Chen T, Wang Y, Chen H, et al. Targeted local support vector machine for age-dependent classification[J]. Journal of the American Statistical Association, 2014, 109(507): 1174-1187. DOI:10.1080/01621459.2014.881743 |
[14] |
Fernandes K, Cardoso J S, Fernandes J. Transfer Learning with Partial Observability Applied to Cervical Cancer Screening[M] Pattern Recognition and Image Analysis. 2017: 243-250.
|
[15] |
Chawla N V, Bowyer K W, Hall L O, et al. SMOTE:synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16(1): 321-357. |