Breakdown point is one of the basic tools to measure the robustness of statistical estimation. It was firstly defined by Hample[1], depending on the specific distribution of the observed data set. A distribution-free finite-sample definition of the breakdown point was proposed in Ref.[2]. Intuitively, the breakdown point is the maximum proportion of outliers that a given sample may contain without spoiling the estimator completely. Obviously, higher breakdown point indicates more robustness of an estimator. This concept has been widely applied to the location, scale, and regression models to describe the resistance of outliers.
The study of breakdown point has made great progress over the recent 40 years. In linear regression models, it is well known that the breakdown point of least square method is 0, which means only one outlier may have a large effect on the estimator. Robust estimators with a high breakdown point of 0.5 have been developed, e.g. the MM-estimators[3], the τ-estimators[4]; the least median of squares (LMS) and least trimmed squares (LTS) estimators[5]. The breakdown properties of some nonlinear regression model estimators were also investigated in, e.g. Ref.[6]. In the area of logistic regression model, the breakdown property of the maximum likelihood estimator (MLE) is totally different from that of the linear model. Cox and Snell[7] argued that the commonly used MLE in logistic regression models is not robust. Christmann[8] proposed that high breakdown point estimators do not exist in logistic regression models, unless there are some specific conditions, such as p≤n. Similar statement was noticed by Ref. [9] if one replaced some observations to raw data set. Neykov and Muller[10] considered the finite-sample breakdown point of trimmed likelihood estimators in generalized linear models. Khan et al.[11] adopted the least trimmed absolute (LTA) estimator for logistic model fitting, in which LTA is a robust estimator with high breakdown since it outperforms M-estimator in case of a certain degree of contamination. Croux et al.[12] proved that classical MLE in logistic regression models always stays bounded if some outliers were added to the raw data set. However, the aforementioned methods requires low-dimensionality that n>p.
Due to the rapid development of science and technology and various ways of data collection, large-dimensional data is becoming more and more common and has attracted great attention. Regularization is powerful to solve high-dimensional problems, which can force the model be sparse, low rank, smooth and so on. Thus, a lot of penalized estimators have been established to select useful variables for high dimensional statistical linear modeling. Breiman[13] proposed the nonnegative garrote for subset regression. Tibshirani introduced lasso in Ref. [14], and LAD-lasso was discussed in Ref. [15]. Xie and Huang[16] applied the smoothly clipped absolute deviation (SCAD) penalty to achieve sparsity in the high-dimensional linear models. Therefore, breakdown point of penalized regression is becoming more important. Alfons et al.[17] focused on some sparse estimators in linear models, in which the breakdown point of sparse LTS is verified to be (n-h+1)/n, where h is the initial guess of non-contaminations, both lasso and LAD-lasso have the breakdown point of 1/n. Various penalized logistic regression estimators are also proposed as alternatives to the non-robust MLE in logistic models, but the breakdown point of penalized logistic regression model under high-dimensionality is not discussed yet. It can neither be derived from the results of the penalized linear model nor from the generalization of the results of the unpenalized logistic regression models.
We aim to compute the breakdown point of penalized logistic regression estimator while there are outliers in the observations. Traditional methods are often limited to n>p, but the collected data sets are often with sophisticated large scale, so we emphasis not only on n>p, but also on n < p in this paper. We have tried to add different penalty items to the original loss function while the model designs can be corrupted by outliers, either in replacement or in addition situations. It is verified that the explosive breakdown point of penalized MLE in logistic regression model attains its maximal value. Accurately speaking, we can still get bounded regression coefficients even if the pollution rate of the observed data reached 50%. Our simulation studies just illustrate this property. While classical estimators in linear regression models might be arbitrarily large while there is only one outlier, which is quite different. The upper bound of implosive breakdown point of logistic regression estimators is also computed. Moreover, our method is not limited to the logistic models, but also to the probit models.
The rest part of this article is arranged as follows. In Section 2, the replacement and addition explosive breakdown point of the penalized logistic regression estimator are obtained respectively. Furthermore, we also show the replacement implosive breakdown point of the penalized MLE. Section 3 discusses a fast convergent iterative algorithm and Section 4 performs the simulation studies. Finally, Section 5 concludes.
We introduce some necessary notations in this paper. If β=(β1, …, βp)T∈
We start with a brief introduction of the logistic regression model. Denote by X=(Xij)1≤i≤p, 1≤j≤n=(X1, …, Xn) the matrix of predictor variables, where n is the number of observations and p the number of variables. Let Y=(Y1, …, Yn)T be the vector of responses, where Yi is either 0 or 1, i=1, …, n.
The logistic regression model is given by
$ {\mu _i}: = p({Y_i} = 1|{\mathit{\boldsymbol{X}}_i}) = {F^{ - 1}}({\theta _i}), $ | (1) |
where F(x)=logit(x):=log{x/(1-x)}, θi=α+XiTβ, α and β=(β1, …, βp)T are the unknown regression parameters. If F(·)=Φ-1(·), where Φ(·) is the cumulative distribution function (CDF) of the standard normal distribution, then model (1) becomes the probit model.
M-estimation of the parameters is often obtained by minimizing a certain loss function. While it is not suitable to use the quadratic loss for classification problems. Let γ=(γ1, …, γp+1)T=(α; β). Giving the observed sample Z=(Z1, …, Zn), where Zi=(Xi; Yi), i=1, …, n. Specifically,
Quadratic loss: l(γ, Zi)=(Yi-θi)2/2;
Deviance loss:
$ l(\mathit{\boldsymbol{\gamma }},{\mathit{\boldsymbol{Z}}_i}) = - 2\{ {Y_i}{\rm{log}}({\mu _i}) + (1 - {Y_i}){\rm{log}}(1 - {\mu _i})\} ; $ |
Exponential loss: l(γ, Zi)=exp{-(Yi-0.5)θi}.
In this paper, we use Bernoulli deviance loss which is often used in exponential dispersion models or generalized linear models.
$ \begin{array}{*{20}{l}} {L(\mathit{\boldsymbol{\gamma }},\mathit{\boldsymbol{Z}}) = - \frac{2}{n}\sum\limits_{i = 1}^n {\{ {Y_i}{\rm{log}}(} {\mu _i}) + }\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} (1 - {Y_i}){\rm{log}}(1 - {\mu _i})\} .} \end{array} $ | (2) |
Here, the regression estimator obtained by minimizing (2) is the MLE. For binary logistic regression model, there are three possible patterns of data points: complete separation, quasi-complete separation and overlap. Albert and Anderson[18] verified that the MLE of logistic regression models does not exist when the data types are completely separated or quasi-complete separated, while in overlap situation, MLE exists uniquely. Similar result was discussed in Ref. [19]. Thus, we concentrate on the overlap situation here. On the other hand, regularization methods generally consist loss functions and penalty terms. It is commonly used to obtain well-behaved solutions to over-parameterized estimation problems. Fan and Li[20] mentioned that a good penalty function could result in an estimator with unbiasedness, sparsity and continuity and asserted that bounded penalty functions can reduce bias and yield continuous solutions. Therefore, we try to add a penalty term Pλ(β) to the loss function. Then the regression estimator is defined as
$ \mathit{\pmb{\hat \gamma }} = \mathop {{\rm{argmin}}}\limits_\mathit{\boldsymbol{\gamma }} \{ L(\mathit{\boldsymbol{\gamma }},\mathit{\boldsymbol{Z}}) + {P_\lambda }(\mathit{\boldsymbol{\beta }})\} : = \mathop {{\rm{ argmin }}}\limits_\gamma Q(\mathit{\boldsymbol{\gamma }},\mathit{\boldsymbol{Z}}), $ | (3) |
where Q(γ, Z) is the objective function. Since the intercept term has no corresponding feature term and it is mainly used to fit the overall migration of the data, so it is not involved in regularization.
The components of Pλ(β) are always additive, which means
L1 penalty function,
$ {P_\lambda }(|{\beta _j}|) = \lambda |{\beta _j}|. $ | (4) |
L2 penalty function,
$ {P_\lambda }(|{\beta _j}|) = \lambda \beta _j^2/2. $ | (5) |
SCAD penalty function,
$ \begin{array}{l} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {P_\lambda }(|{\beta _j}|) = \\ \left\{ {\begin{array}{*{20}{l}} {\lambda |{\beta _j}|,}&{|{\beta _j}| \le \lambda ,}\\ {\frac{{ - (1{\beta _j}{|^2} - 2a\lambda |{\beta _j}| + {\lambda ^2})}}{{2(a - 1)}},}&{\lambda < |{\beta _j}| \le a\lambda ,}\\ {(a + 1){\lambda ^2}/2,}&{|{\beta _j}| > a\lambda ,} \end{array}} \right. \end{array} $ | (6) |
where a>2 and λ>0 are the unknown parameters.
To study the robustness of an estimator, Donoho and Huber[2] introduced two finite-sample versions of breakdown point, one is replacement breakdown point, the other is addition breakdown point. Generally speaking, the value is the smallest proportion of contamination that can lead the estimator's Euclidean norm to infinity or to zero, which means the estimator becomes completely unreliable. Rousseeuw and Leroy[21] held that breakdown point can not exceed 50%. Intuitively, if more than half of the data are polluted, we will not be able to distinguish the original distribution from the pollution distribution.
1.1 Explosivebreakdown point When the Euclidean norm of
$ {\varepsilon _\infty }(\mathit{\pmb{\hat \gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}) = {\rm{min}}\left\{ {\frac{m}{n}:\mathop {{\rm{sup}}}\limits_{{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}} {{\left\| {\mathit{\pmb{\hat \gamma }}({{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}})} \right\|}_2} = \infty } \right\}, $ |
$ {\varepsilon _\infty }(\mathit{\pmb{\hat \gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{add}}}}) = {\rm{min}}\left\{ {\frac{m}{{n + m}}:\mathop {{\rm{sup}}}\limits_{{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{ add }}}}} {{\left\| {\mathit{\pmb{\hat \gamma }}({{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}})} \right\|}_2} = \infty } \right\}, $ |
where
For penalized MLE in logistic models, we have the following theorem.
Theorem 1.1 For model (1), considering
$ {\varepsilon _\infty }(\mathit{\pmb{\hat \gamma }},{\mathit{\boldsymbol{\tilde Z}}^{{\rm{rep}}}}) = 0.5. $ |
Proof The L1 norm of a vector β is denote as ‖β‖1, and the Euclidean norm as ‖β‖2. Since the two norms have topological equivalence, there exists a constant c≥1, such that ‖β‖1≥c‖β‖2. We replace m of the observations by m outliers, then
$ \begin{array}{*{20}{l}} {Q(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{ rep }}}}) = \frac{{ - 2}}{n}\sum\limits_{i = 1}^n {\{ {{\tilde Y}_i}{\rm{log}}(} {\mu _i}) + }\\ {(1 - {{\tilde Y}_i}){\rm{log}}(1 - {\mu _i})\} + {P_\lambda }(\mathit{\boldsymbol{\beta }}).} \end{array} $ |
Considering the case of γi=0, i=1, …, p+1, we have μi=0.5, i=1, …, n. The objective function becomes
$ Q(\mathit{\pmb{0}},{\mathit{\boldsymbol{\tilde Z}}^{{\rm{rep}}}}) = 2{\rm{log}}{\kern 1pt} {\kern 1pt} 2. $ |
For L1 penalty (4), considering any β with ‖β‖2≥M:=(2log 2+1)/cλ, then
$ \begin{array}{*{20}{l}} {Q(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}) > {P_\lambda }(\mathit{\boldsymbol{\beta }}) = \lambda {{\left\| \mathit{\boldsymbol{\beta }} \right\|}_1} \ge c\lambda {{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2}}\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \ge 2{\rm{log}}2 + 1 > Q(\mathit{\pmb{0}},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}).} \end{array} $ |
For L2 regularization (5), considering any β with ‖β‖2≥M:=2(2log 2+1)/λ, we can get
$ \begin{array}{*{20}{l}} {Q(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}) > {P_\lambda }(\mathit{\boldsymbol{\beta }}) = \frac{\lambda }{2}{{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2}}\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \ge 2{\rm{log}}{\kern 1pt} {\kern 1pt} 2 + 1 > Q(\mathit{\pmb{0}},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}).} \end{array} $ |
For SCAD penalty function (6), Pλ(|βj|) would be a invariant constant for large βj. We construct β with ‖β‖2≥M:=aλ
$ \begin{array}{*{20}{l}} {Q(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{ rep }}}}) > {P_\lambda }(\mathit{\boldsymbol{\beta }}) > \frac{{(a + 1){\lambda ^2}}}{2}}\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} > 2{\rm{log}}{\kern 1pt} {\kern 1pt} 2 + 1 > Q(\mathit{\pmb{0}},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{ rep }}}}).} \end{array} $ |
While ∀‖γ‖2≥‖β‖2≥M>0, it holds that
We can see that the constant M only depends on p. So either for the case of n>p or the situation of n < p, Theorem 1.1 is true. We choose deviance loss in the proof of Theorem 1.1. If the loss term was replaced by exponential loss function, the process of proving would be similar, and the conclusion would be the same.
As mentioned above, there are two types of finite-sample breakdown point. Zuo[22] ever gave the quantitative relationship between replacement breakdown point and addition breakdown point of a large class of estimators whose breakdown points are independent of the configuration of X. One may think that the influence of adding outliers may be different from that some of the raw observations are replaced by outliers. From the proof process of Theorem 1.1, it can be seen whether the arbitral observations are replaced or added, it does not matter. If we added m outliers to the raw data set, we could conclude the following theorem.
Theorem 1.2 For model (1), considering
$ {\varepsilon _\infty }(\mathit{\pmb{\hat \gamma }},{\mathit{\boldsymbol{\tilde Z}}^{{\rm{add}}}}) = 0.5. $ |
The proof is similar to that of Theorem 1.1.
Theorem 1.1 and Theorem 1.2 show that the penalized MLE is very robust in binary logistic regression models. Moreover, for multiple classification models, the theorems are also applicable. In logistic regression models with binary data, we verified that the explosive breakdown point of penalized MLE in logistic regression models gets the biggest value 0.5, which is a good result.
1.2 Implosivebreakdown point Another direction to describe a meaningless estimator is ‖
$ {\varepsilon _0}(\mathit{\pmb{\hat \beta }},{\mathit{\boldsymbol{\tilde Z}}^{{\rm{rep}}}}) = {\rm{min}}\left\{ {\frac{m}{n}:\mathop {{\rm{inf}}}\limits_{{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}} {{\left\| {\mathit{\pmb{\hat \beta }}({{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}})} \right\|}_2} = 0} \right\}. $ | (7) |
The following theorem shows that ‖β‖2 tends to zero while 2p observations of the raw sample are replaced by outliers.
Theorem 1.3 In model (1), considering (3), the penalized MLE satisfies
$ {\varepsilon _0}(\mathit{\pmb{\hat \beta }},{\mathit{\boldsymbol{\tilde Z}}^{{\rm{rep}}}}) < \frac{{2p}}{n}. $ |
Proof Let δ>0, M=max{‖X1‖2, …, ‖Xn‖2}, N=τ/δ, and
First, |α+CekTβ|>N‖β‖2>Nδ.
Case one: α+CekTβ>Nδ=τ, considering the observation
$ \begin{array}{*{20}{l}} {Q(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}) > \frac{1}{n}l(\mathit{\boldsymbol{\gamma }},\mathit{\boldsymbol{\tilde Z}}_{k0}^{{\rm{rep}}}) = \frac{2}{n}{\rm{log}}(1 + {{\rm{e}}^{\alpha + \mathit{\boldsymbol{\tilde X}}_i^{\rm{T}}\mathit{\boldsymbol{\beta }}}})}\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} > \frac{2}{n}{\rm{log}}(1 + {{\rm{e}}^\tau }) = Q(\mathit{\pmb{0}},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}).} \end{array} $ |
Case two: α+CekTβ < -Nδ=-τ, considering the observation
$ \begin{array}{*{20}{l}} {Q(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}) > \frac{1}{n}l(\mathit{\boldsymbol{\gamma }},\mathit{\boldsymbol{\tilde Z}}_{k1}^{{\rm{rep}}}) = \frac{2}{n}{\rm{log}}(1 + {{\rm{e}}^{ - \alpha - \mathit{\boldsymbol{\tilde X}}_i^{\rm{T}}\mathit{\boldsymbol{\beta }}}})}\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} > \frac{2}{n}{\rm{log}}(1 + {{\rm{e}}^\tau }) = Q(\mathit{\pmb{0}},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}).} \end{array} $ |
Second, while |α+CekTβ| < N‖β‖2 < Nδ, denote jm the index such that |βjm|=max{|β1|, …, |βp|}. Since ‖β‖2>δ, |βjm|≥‖β‖2/
Case one: βjm>0,
α+ejmTβ=α+Cβjm < N‖β‖2, so α≤N‖β‖2-Cβjm≤N‖β‖2-C‖β‖2/
$ \begin{array}{*{20}{l}} {\alpha + \mathit{\boldsymbol{X}}_i^{\rm{T}}\mathit{\boldsymbol{\beta }} \le \alpha + {{\left\| {{\mathit{\boldsymbol{X}}_i}} \right\|}_2}{{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2}}\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \le - (M + N){{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2} + M{{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2}}\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = - N{{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2} < - N\delta = - \tau .} \end{array} $ |
Thus
$ \begin{array}{*{20}{l}} {Q(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}) > \frac{1}{n}l(\mathit{\boldsymbol{\gamma }},{\mathit{\boldsymbol{Z}}_{i1}}) = \frac{2}{n}{\rm{log}}(1 + {{\rm{e}}^{ - \alpha - \mathit{\boldsymbol{X}}_i^{\rm{T}}\beta }})}\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} > \frac{2}{n} - {\rm{log}}(1 + {e^\tau }) = Q(\mathit{\pmb{0}},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}).} \end{array} $ |
Case two: βjm < 0,
0>α+ejmTβ=α+Cβjm>-N‖β‖2, so α>-N‖β‖2-Cβjm>-N‖β‖2+C‖β‖2/
$ \begin{array}{*{20}{l}} {\alpha + \mathit{\boldsymbol{X}}_i^{\rm{T}}\mathit{\boldsymbol{\beta }} > \alpha - {{\left\| {{\mathit{\boldsymbol{X}}_i}} \right\|}_2}{{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2}}\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} > (M + N){{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2} - M{{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2}}\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = N{{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2} > N\delta = \tau .} \end{array} $ |
Thus
$ \begin{array}{*{20}{l}} {Q(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}) > \frac{1}{n}l(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}_{i0}}) = \frac{2}{n}{\rm{log}}(1 + {{\rm{e}}^{\alpha + \mathit{\boldsymbol{X}}_i^{\rm{T}}\beta }})}\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} > \frac{2}{n}{\rm{log}}(1 + {{\rm{e}}^\tau }) = Q(\mathit{\pmb{0}},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}).} \end{array} $ |
As a conclusion, ∀‖γ‖2≥‖β‖2>δ, it holds that
Theorem 1.3 implies that the number of contaminated data cannot exceed twice the number of independent variables, which shows the non-robustness of the estimator. The standard error of a regression estimator would not explode when the Euclidean norm of the estimator tends to zero. Therefore, this kind of breakdown is more difficult to detect. Croux et al.[12] proved that MLE of logistic regression model breaks down to zero when adding several outliers to the data set. Here, we discuss penalized MLE in replacement situation, and our result is not bad.
2 AlgorithmIn this section, we concentrate on the computation of (3). Approaches applicable to linear models may computationally infeasible to nonlinear regression analysis. In our model, the objective function is nonlinear, the general normal equation is a transcendental equation. we can only solve the problem by numerical methods in stead of algebraic methods.
There are some fast algorithms for numerical computing. Efron et al.[23] introduced least angle regression. Balakrishnan and Madigan[24] presented the online algorithm and the MP algorithm for learning L1 logistic regression models; Wu and Lange[25] used the coordinated descent (CD) algorithm to solve lasso regressions. Friedman et al.[26] derived the generalized CD algorithm for convex optimization problems. CD was also mentioned in Ref.[27], in which it was shown to gain computational superiority. It is proved that the block CD algorithm has linear convergence rate in Ref.[28]. Saha and Tewari[29] proved that the CD algorithm has the convergence rate of O(1/p). Since this algorithm has many excellent properties, it is wise to use it in this paper.
An approximate solution can be calculated by convergent iterative algorithms. we try to transform the deviance loss function into approximate squared loss function through Taylor expansion. In this way, the regression coefficient can be calculated more conveniently. Note that for multiple classification regression problems, the Taylor expansion for the ordinary objective function would become more complex. In these situations, one should consider the use of other numerical methods, such as Newton method, gradient ascent method, among others.
An initial estimator is necessary for Taylor expansion. If the initial value was closer to the real parameter, the iteration rate would be faster and the result would be more accurate. After a large number of experiments, we found that γ0=0 is a good choice. For a given initial estimator γ0=(α0; β0), let θi0=α0+XiTβ0, i=1, …, n. Next, we develop the Taylor expansion of the loss function at θi0,
$ \begin{array}{l} L(\mathit{\boldsymbol{\gamma }},\mathit{\boldsymbol{Z}}) = \frac{1}{n}\sum\limits_{i = 1}^n l (\mathit{\boldsymbol{\gamma }},{\mathit{\boldsymbol{Z}}_i})\\ \approx \frac{1}{n}\sum\limits_{i = 1}^n {\left\{ {l({\mathit{\boldsymbol{\gamma }}_0},{\mathit{\boldsymbol{Z}}_i}) + {{\left. {\frac{{\partial l(\mathit{\boldsymbol{\gamma }},{\mathit{\boldsymbol{Z}}_i})}}{{\partial {\theta _i}}}} \right|}_{{\theta _{i0}}}}({\theta _i} - {\theta _{i0}}) + } \right.} \\ \left. {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \frac{1}{2}{{\left. {\frac{{{\partial ^2}l(\mathit{\boldsymbol{\gamma }},{\mathit{\boldsymbol{Z}}_i})}}{{\partial \theta _i^2}}} \right|}_{{\theta _{i0}}}}{{({\theta _i} - {\theta _{i0}})}^2}} \right\}\\ \approx \frac{1}{{2n}}\sum\limits_{i = 1}^n {l_{{\theta _{i0}}}^{\prime \prime }} {\left\{ {{\theta _{i0}} - \frac{{l_{{\theta _{i0}}}^\prime }}{{l_{{\theta _{i0}}}^{\prime \prime }}} - {\theta _i}} \right\}^2} \end{array} $ |
where
If the loss function is deviance loss function, we would have l(γ, Zi)=-2{yilogμi+(1-yi)log(1-μi)}. Then for the logistic model and the probit model, we already have their distribution functions, it is easy to obtain their first order derivative function and two order derivative function.
For logit model, we have
$ {l_{{\theta _i}}^\prime = \frac{{\partial l(\mathit{\boldsymbol{\gamma }},{\mathit{\boldsymbol{Z}}_i})}}{{\partial {\theta _i}}} = 2({\mu _i} - {y_i}),} $ |
$ {l_{{\theta _i}}^{\prime \prime } = \frac{{{\partial ^2}l(\mathit{\boldsymbol{\gamma }},{\mathit{\boldsymbol{Z}}_i})}}{{\partial \theta _i^2}} = 2{\mu _i}(1 - {\mu _i}).} $ |
And for probit model, we have
$ l_{{\theta _i}}^\prime = 2\frac{{{\mu _i}\mu _i^\prime - {y_i}\mu _i^\prime }}{{{\mu _i}(1 - {\mu _i})}}, $ |
$ \begin{array}{*{20}{l}} {l_{{\theta _i}}^{\prime \prime } = 2\frac{{(\mu _i^\prime \mu _i^\prime + {\mu _i}\mu _i^{\prime \prime } - {y_i}\mu _i^{\prime \prime })({\mu _i} - \mu _i^2)}}{{\mu _i^2{{(1 - {\mu _i})}^2}}} - }\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} 2\frac{{({\mu _i}\mu _i^\prime - {y_i}\mu _i^\prime )(\mu _i^\prime - 2{\mu _i}\mu _i^\prime )}}{{\mu _i^2{{(1 - {\mu _i})}^2}}},} \end{array} $ |
where
Otherwise, if the loss function was exponential loss l(γ, Zi)=exp{-(yi-0.5)θi}, there is nothing to do with μi, so no matter how to construct ui, we have
$ {l_{{\theta _i}}^\prime = ({y_i} - 0.5){{\rm{e}}^{ - ({y_i} - 0.5){\theta _i}}},} $ |
$ {l_{{\theta _i}}^{\prime \prime } = \frac{1}{4}{{\rm{e}}^{ - ({y_i} - 0.5){\theta _i}}}.} $ |
According to the method of seeking extreme point, we set the first order derivative of the constrained objective function to all independent variables be zero. Then the parameters can be attained by solving the simultaneous equations.
$ \left\{ {\begin{array}{*{20}{l}} {\partial Q(\mathit{\boldsymbol{\gamma }},\mathit{\boldsymbol{Z}})/\partial \alpha = 0,}\\ {\partial Q(\mathit{\boldsymbol{\gamma }},\mathit{\boldsymbol{Z}})/\partial {\beta _j} = 0,J = 1, \cdots ,p.} \end{array}} \right. $ |
The first part of the objective function has been changed from Taylor expansion to square terms, and it is easy to find the derivative function of each order. Then consider penalty term. It is important to note that if we could not seek the first order guidance directly of the penalty term, such as L1 penalty, we could apply the soft thresholding method introduced firstly by Donoho and Johnstone[30]. Then we apply CD algorithm. At the kth iteration,
$ {\beta _1^{(k)} \in {\rm{arg}}{\kern 1pt} {\kern 1pt} {\rm{min}}Q\{ ({\alpha ^{(k - 1)}},\beta _2^{(k - 1)}, \cdots ,\beta _p^{(k - 1)}),\mathit{\boldsymbol{Z}}\} ,} $ |
$ {\beta _2^{(k)} \in {\rm{arg}}{\kern 1pt} {\kern 1pt} {\rm{min}}Q\{ ({\alpha ^{(k)}},\beta _1^{(k)},\beta _3^{(k - 1)}, \cdots ,\beta _p^{(k - 1)}),\mathit{\boldsymbol{Z}}\} ,} $ |
$ \begin{array}{l} \cdots \\ \beta _p^{(k)} \in {\rm{arg}}{\kern 1pt} {\kern 1pt} {\rm{min}}Q\{ ({\alpha ^{(k)}},\beta _1^{(k)},\beta _2^{(k)}, \cdots ,\beta _{p - 1}^{(k)}),\mathit{\boldsymbol{Z}}\} , \end{array} $ |
$ {{\alpha ^{(k)}} \in {\rm{arg}}{\kern 1pt} {\kern 1pt} {\rm{min}}Q\{ (\beta _1^k,\beta _2^k,\beta _3^k, \cdots ,\beta _p^k),\mathit{\boldsymbol{Z}}\} .} $ |
For the giving initial γ0, every βj, j=1, …, p has the following explicit solution:
For L1 penalization,
$ {\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}} \over \beta } _j} = \frac{{S\{ \sum\nolimits_{i = 1}^n {l_{{\theta _i}0}^{\prime \prime }} {x_{ij}}({\theta _{i0}} - \frac{{l_{{\theta _{i0}}}^\prime }}{{l_{{\theta _{i0}}}^{\prime \prime }}} - y_{i0}^{ - j}),\lambda \} }}{{\sum\nolimits_{i = 1}^n {l_{{\theta _i}0}^{\prime \prime }} x_{ij}^2}}. $ | (8) |
where yi0-j=θi0-xijβ0j, and
$ S(\mu ,\lambda ) = \left\{ {\begin{array}{*{20}{l}} {\mu - \lambda ,}&{{\rm{ if}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \mu > \lambda ,}\\ {0,}&{{\rm{ if}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} |\mu | \le \lambda ,}\\ {\mu + \lambda ,}&{{\rm{ if}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \mu < - \lambda .} \end{array}} \right. $ |
For L2 penalization,
$ {\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}} \over \beta } _j} = \frac{{\sum\nolimits_{i = 1}^n {l_{{\theta _i}0}^{\prime \prime }} {x_{ij}}({\theta _{i0}} - \frac{{l_{{\theta _{i0}}}^\prime }}{{l_{{\theta _{i0}}}^{\prime \prime }}} - y_{i0}^{ - j})}}{{\sum\nolimits_{i = 1}^n {l_{{\theta _i}0}^{\prime \prime }} x_{ij}^2 + 1}}. $ | (9) |
With the known
$ \hat \alpha = \frac{{\sum\nolimits_{i = 1}^n {l_{{\theta _{i0}}}^{\prime \prime }} ({y_i} - {x_i}\mathit{\pmb{\hat \beta }})}}{{\sum\nolimits_{i = 1}^n {l_{{\theta _{i0}}}^{\prime \prime }} }}. $ | (10) |
In practical application, the extreme point is usually the maximum or minimum extreme point we want. According to the fast convergence property of the CD method, we can easily obtain the convergent regression estimator.
3 SimulationIn this section, simulation studies on artificial data sets with different groups of n and p are presented. We construct two data configurations, include the case of n>p and the case of n < p.
The first data configuration is with n=200 and p=5, n/2, n-20 respectively, which is corresponding to the low-dimensional data set. Assume that X obeys the multivariate normal distribution, X~N(0, ∑), where the covariance matrix ∑=(∑ij)1≤i, j≤p is assigned to ∑ij=0.5|i-j|, which implies the multiple multicollinearity between predictor variables. Using the coefficient vector γ with α=0.2, β1=0.5, β2=-0.6, β3=1, β4=-0.8, β5=0.1, and βj=0 for j∈{1, 2, …, p+1}\{1, 2, 3, 4, 5}, the response variable Y is generated according to model (1).
Then, the second configuration represents the moderate case of high-dimensional. n=100 observations are generated and the value of p is n+10, 2n or 4n. The independent variables are subjected to the distributions N(0, ∑) with ∑ij=0.1|i-j|. For regression parameters, we assume that α=0.2, and β1=-0.7, β1=1, β1=-0.3, β1=0.5, β1=-0.6, while other slope coefficients βj=0. Y is also constructed according to model (1).
Denote by k the contaminated percentage, k=0, 25% or 50%. For each data set, we make different kinds of pollution.
1) No contamination;
2) Vertical outliers: k of the dependent variables in model (1) are changed, which means some Yi will become 1-Yi;
3) Leverage points: same as in (2), while for the k contaminated observations, we drawing the predictor variables from N(μ, ∑) instead of N(0, ∑), where μ=(20, …, 20).
As for shrinkage parameter λ, it is always unknown in advance. We know that the bigger value of λ, the higher punishment. More coefficients will be compressed to 0, which lead to greater bias. But if the value was too small, it could not reach the effect of variable selection. There are many ways to choose an optimal value, such as Bayes information criterion (BIC), Akaike info criterion (AIC), cross validation (CV), among others. In this paper, let MSE=
There are several termination rules widely adopted in an iterative algorithm, such as the function descent criterion, the distance criterion, the gradient criterion. At present article, we use the distance criterion. For each cycle of the coordinate descent algorithm, the specific form of stopping criterion is
$ \mathop {{\rm{max}}}\limits_{j = 1, \cdots ,p + 1} |\hat \gamma _j^{{\rm{ old }}} - \hat \gamma _j^{{\rm{ new }}}| < {10^{ - 3}}. $ | (11) |
The maximum number of iterations is 1 000, and each regression coefficient should be of practical significance, like γj < 103 for j=1, …, p+1.
Given a parametric statistical model (sample space and sample distribution family), decision space and loss function, it is naturally desirable to specify a decision for each point in the sample space. We assume that the decision function is
$ {Y_i} = 1{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{if}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} p({Y_i} = 1|{\mathit{\boldsymbol{X}}_i}) > 0.5. $ | (12) |
Here, 0.5 is chosen as the threshold as it is a common practice. In practical application, different thresholds may be selected for specific situations. If the accuracy of positive criteria was high, the threshold value could be larger. If we required higher positive recall, the threshold value should be selected smaller.
We polluted the raw data in different degrees and different types of outliers. Following are the partial simulation results and all the above simulations are performed in Matlab2016b.
Here, Fig. 1 is one of the cross validation diagram.
Download:
|
|
Fig. 1 Cross validation result while n=200 and p=100 |
Table 1 and Table 2 express the results of logistic regression model while n>p and n < p, averaged over 500 simulation runs. Here, let MR be the misclassification rate.
As shown in Table 1, if there were vertical outliers in the data set, the regression coefficient always stayed bounded regardless of the extent of the first kind of pollution degree. At the same time, regression coefficients keep bounded no matter how many leverage points were in data set in Table 2. In general, penalized logistic regression is a very robust approach, which echoing the previous
Then, we change the link function to Probit link function. Table 3 shows the performance of probit regression, averaged over 500 simulation runs.
Similarly, the MSE have a limit in Table 3, which means the regression estimator keeps bounded. In addition to some similarities between logistic and probit regression models, previous scholars found that the logistic regression coefficient is about 1.8 times bigger than that of the probit regression coefficient with the same data sets. Although the coefficients are different, the classification results are essentially similar. The meaning of the coefficient, the evaluation of the model and the hypothesis test are all similar.
Furthermore, we find that the choice of the loss functions has an almost negligible effect on the prediction result. This might be the reason why less discussion about the impact on different loss functions was talked in classification models.
4 DiscussionThis paper shows that penalized MLE has a high explosive breakdown point in binary logistic regression models. Accurately speaking, we can still get bounded regression coefficients even if the pollution rate of the observations reaches 50%. The property is shown by our simulation studies. In either logistic model or probit model, the regression estimators are robust in contaminated data sets. Also, we give the upper bound of implosive breakdown point of the slope parameter.
Theorem 1.3 gives the upper bound of implosive breakdown point of slope parameter. However, for robust estimator, the bigger the breakdown point, the better. Naturally, we consider whether the implosive breakdown point has a larger lower bound by changing the penalty item or changing the loss function. For example, sparse LTS method is pretty robust in linear models, can we learn from the idea of trimming? As we know, SCAD penalty is a famous penalty term as it satisfies sparsity mathematical conditions. So it may lead to unexpected gains if we use this punishment. In our future research, we will pay more attention to it.
While there is only a slight multi-collinearity between the explanatory variables, MLE is consistent, asymptotically valid, and asymptotically normal for n→+∞ under some weak assumptions. Fahrmeir and Kaufmann[31] did some research. With the increase of sample size, the standard error of parameter estimation will gradually decrease. After adding a penalty item, these properties may change. As for hypothesis test, we need more studies, not only on breakdown point, but also on the standard error.
[1] |
Hample F R. A general qualitative definition of robustness[J]. The Annals of Mathematical Statistics, 1971, 42: 1187-1896. DOI:10.1214/aoms/1177693235 |
[2] |
Donoho D, Huber P J. The notion of breakdown point[C]//Bickel P J, Doksum K A, Hodges J L. A Festschrift for Eric Lehmann. Belmont: Wadsworth, 1983: 157-184.
|
[3] |
Yohai V J. High breakdown point and high efficiency robust esrimates for regression[J]. Tne Annals of Statistics, 1987, 15: 642-656. DOI:10.1214/aos/1176350366 |
[4] |
Yohai V J, Zamar R H. High breakdown point estimators of regression by means of the minimization of an efficient scale[J]. Journal of American Statistical Association, 1988, 83: 406-613. DOI:10.1080/01621459.1988.10478611 |
[5] |
Rousseeuw P. Least median of squares regression[J]. Publications of the American Statistical Association, 1984, 79: 871-880. DOI:10.1080/01621459.1984.10477105 |
[6] |
Sakata S, White H. An alternative definition of finite-sample breakdown point with application to regression model estimators[J]. Journal of the American Statistical Association, 1995, 90: 1099-1106. |
[7] |
Cox D R, Snell E J. The analysis of binary data[M]. Alexandria: Biometrics, 1970.
|
[8] |
Christmann A. Least median of weighted squares in logistic regression with large strata[J]. Biometrika, 1994, 81: 413-417. DOI:10.1093/biomet/81.2.413 |
[9] |
Kunsch H R, Stefanski L A, Carroll R J. Conditionally unbiased bounded-influence estimation in general regression models, with applications to generalized linear models[J]. Journal of the American Statistical Association, 1981, 84: 460-466. |
[10] |
Neykov N M, Muller C H. Breakdown point and computation of trimmed likelihood estimators in generalized linear models[C]//Dutter R, Filzmoser P, Gather U, et al. Developments in Robust Statistics. Berlin: Spring-Verlag, 2003: 277-286.
|
[11] |
Khan D M, Ihtisham S, Ali A, et al. An efficient and high breakdown estimation procedure for nonlinear regression models[J]. Pakistan Journal of Statistics, 2017, 33: 223-236. |
[12] |
Croux C, Flandre C, Haesbroeck G. The breakdown behavior of the maximum likelihood estimator in the logistic regression model[J]. Statistics & Probability Letters, 2002, 60: 377-386. |
[13] |
Breiman L. Better subset regression using the nonnegative garrote[J]. Technometrics, 1995, 37: 373-384. DOI:10.1080/00401706.1995.10484371 |
[14] |
Tibshirani R. Regression shrinkage and selection via the lasso[J]. Journal of the Royal Statistical Society, 1996, 58: 267-288. |
[15] |
Wang H, Li G, Jiang G. Robust regression shrinkage and consistent variable selection through the LAD-lasso[J]. Journal of Business and Economic Statistics, 2007, 25: 347-355. DOI:10.1198/073500106000000251 |
[16] |
Xie H, Huang J. Scad-penalized regression in high-dimensional partially linear models[J]. The Annals of Statistics, 2009, 37: 673-296. DOI:10.1214/07-AOS580 |
[17] |
Alfons A, Croux C, Gelper S. Sparse least trimmed squares regression for analyzing high-dimensional large data sets[J]. The Annals of Applied Statistics, 2013, 7: 226-248. DOI:10.1214/12-AOAS575 |
[18] |
Albert A, Anderson J A. On the existence of maximum likelihood estimators in logistic regression models[J]. Biometrika, 1984, 71: 1-10. DOI:10.1093/biomet/71.1.1 |
[19] |
Silvapulle M J. On the existence of maximum likelihood estimators for the binomial response models[J]. Journal of the Royal Statistical Society, 1981, 43: 310-313. |
[20] |
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties[J]. Journal of the American Statistical Association, 2001, 96: 1348-1360. DOI:10.1198/016214501753382273 |
[21] |
Rousseeuw P J, Leroy A M. Robust regression and outlier detection[M]. America: Wiley-Interscience, 1987.
|
[22] |
Zuo Y. Some quantitative relationships between two types of finite-sample breakdown point[J]. Statistics & Probability Letters, 2001, 51: 369-375. |
[23] |
Efron B, Hastie T, Johnstone I, et al. Least angle regression[J]. The Annals of Statistics, 2004, 32: 407-451. DOI:10.1214/009053604000000067 |
[24] |
Balakrishnan S, Madigan D. Algorithms for sparse linear classifiers in the massive data setting[J]. Journal of Machine Learning Research, 2008, 9: 313-337. |
[25] |
Wu T T, Lange K. Coordinate descent algorithms for lasso penalized regression[J]. The Annals of Applied Statistics, 2008, 2: 224-244. DOI:10.1214/07-AOAS147 |
[26] |
Friedman J, Hastie T, Hofling H, et al. Pathwise coordinate optimization[J]. The Annals of Applied Statistics, 2007, 1: 302-332. DOI:10.1214/07-AOAS131 |
[27] |
Zhang C, Zhang Z, Chai Y. Penalized bregman divergence estimation via coordinate descent[J]. Journal of the Iranian Statistical Society, 2011, 2: 125-140. |
[28] |
Luo Z Q, Tseng P. Error bounds and convergence analysis of feasible descent methods:a general approach[J]. Annals of Operations Research, 1993, 46: 157-178. |
[29] |
Saha A, Tewari A. On the nonasymptotic convergence of cyclic coordinate descent methods[J]. Siam Journal on Optimization, 2013, 23: 576-601. DOI:10.1137/110840054 |
[30] |
Donoho D L, Johnstone I M. Ideal spatial adaptation by wavelet shrinkage[J]. Biometrika, 1994, 81: 425-455. DOI:10.1093/biomet/81.3.425 |
[31] |
Fahrmeir L, Kaufmann H. Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models[J]. The Annals of Statistics, 1985, 13: 342-368. DOI:10.1214/aos/1176346597 |