Breakdown point of penalized logistic regression

引用本文

Dong Y L, Guo X. Breakdown point of penalized logistic regression[J]. Journal of University of Chinese Academy of Sciences, 2020, 37(5): 582-592.

董玉林, 郭潇. 惩罚逻辑回归的击穿点[J]. 中国科学院大学学报, 2020, 37(5): 582-592.

Breakdown point of penalized logistic regression

DONG Yulin, GUO Xiao

International Institute of Finance, School of Management, University of Science and Technology of China, Hefei 230026, China

Received 19 November 2018; Revised 10 April 2019

Foundation items: Supported by the Fundamental Research Funds for the Central Universities and the National Natural Science Foundation of China (11601500, 11671374, 11771418)

Corresponding author: GUO Xiao, xiaoguo@ustc.edu.cn

Abstract: Breakdown point is regarded as an important measure of robustness in regression analysis. At the same time, sparse model estimation is a hot topic in data analysis. In the case of less attention to the breakdown point of robust estimates in nonlinear models, we study it in binary response models. We prove that the penalized estimate of logistic models always stays bounded, which means the finite explosive breakdown point of it is 1. Moreover, we give an upper bound of the implosive breakdown point of the slope parameter. Both simulation study and real data application verify this point while we use the approximation method and coordinate descent algorithm.

Keywords: breakdown point logistic regression maximum likelihood estimator penalization robust estimation

惩罚逻辑回归的击穿点

董玉林, 郭潇

中国科学技术大学管理学院国际金融研究院, 合肥 230026

摘要: 击穿点是估计未知参数稳健性的重要度量。在常规的低维逻辑回归模型中，极大似然估计的击穿点已有了广泛的研究，但少有对高维惩罚似然估计击穿点的分析。通过研究高维逻辑回归模型的惩罚似然估计的增加和替换击穿点来分析这个问题。特别地，证明惩罚极大似然估计的L₂范数总是有界的，这表明有限样本的击穿点达到了最大值0.5。此外，还提供了斜率参数的内爆击穿点的上界。模拟学习很好地支持了该理论结果。

关键词: 击穿点逻辑回归极大似然估计正则化稳健估计

Breakdown point is one of the basic tools to measure the robustness of statistical estimation. It was firstly defined by Hample^[1], depending on the specific distribution of the observed data set. A distribution-free finite-sample definition of the breakdown point was proposed in Ref.[2]. Intuitively, the breakdown point is the maximum proportion of outliers that a given sample may contain without spoiling the estimator completely. Obviously, higher breakdown point indicates more robustness of an estimator. This concept has been widely applied to the location, scale, and regression models to describe the resistance of outliers.

The study of breakdown point has made great progress over the recent 40 years. In linear regression models, it is well known that the breakdown point of least square method is 0, which means only one outlier may have a large effect on the estimator. Robust estimators with a high breakdown point of 0.5 have been developed, e.g. the MM-estimators^[3], the τ-estimators^[4]; the least median of squares (LMS) and least trimmed squares (LTS) estimators^[5]. The breakdown properties of some nonlinear regression model estimators were also investigated in, e.g. Ref.[6]. In the area of logistic regression model, the breakdown property of the maximum likelihood estimator (MLE) is totally different from that of the linear model. Cox and Snell^[7] argued that the commonly used MLE in logistic regression models is not robust. Christmann^[8] proposed that high breakdown point estimators do not exist in logistic regression models, unless there are some specific conditions, such as p≤n. Similar statement was noticed by Ref. [9] if one replaced some observations to raw data set. Neykov and Muller^[10] considered the finite-sample breakdown point of trimmed likelihood estimators in generalized linear models. Khan et al.^[11] adopted the least trimmed absolute (LTA) estimator for logistic model fitting, in which LTA is a robust estimator with high breakdown since it outperforms M-estimator in case of a certain degree of contamination. Croux et al.^[12] proved that classical MLE in logistic regression models always stays bounded if some outliers were added to the raw data set. However, the aforementioned methods requires low-dimensionality that n>p.

Due to the rapid development of science and technology and various ways of data collection, large-dimensional data is becoming more and more common and has attracted great attention. Regularization is powerful to solve high-dimensional problems, which can force the model be sparse, low rank, smooth and so on. Thus, a lot of penalized estimators have been established to select useful variables for high dimensional statistical linear modeling. Breiman^[13] proposed the nonnegative garrote for subset regression. Tibshirani introduced lasso in Ref. [14], and LAD-lasso was discussed in Ref. [15]. Xie and Huang^[16] applied the smoothly clipped absolute deviation (SCAD) penalty to achieve sparsity in the high-dimensional linear models. Therefore, breakdown point of penalized regression is becoming more important. Alfons et al.^[17] focused on some sparse estimators in linear models, in which the breakdown point of sparse LTS is verified to be (n-h+1)/n, where h is the initial guess of non-contaminations, both lasso and LAD-lasso have the breakdown point of 1/n. Various penalized logistic regression estimators are also proposed as alternatives to the non-robust MLE in logistic models, but the breakdown point of penalized logistic regression model under high-dimensionality is not discussed yet. It can neither be derived from the results of the penalized linear model nor from the generalization of the results of the unpenalized logistic regression models.

We aim to compute the breakdown point of penalized logistic regression estimator while there are outliers in the observations. Traditional methods are often limited to n>p, but the collected data sets are often with sophisticated large scale, so we emphasis not only on n>p, but also on n < p in this paper. We have tried to add different penalty items to the original loss function while the model designs can be corrupted by outliers, either in replacement or in addition situations. It is verified that the explosive breakdown point of penalized MLE in logistic regression model attains its maximal value. Accurately speaking, we can still get bounded regression coefficients even if the pollution rate of the observed data reached 50%. Our simulation studies just illustrate this property. While classical estimators in linear regression models might be arbitrarily large while there is only one outlier, which is quite different. The upper bound of implosive breakdown point of logistic regression estimators is also computed. Moreover, our method is not limited to the logistic models, but also to the probit models.

The rest part of this article is arranged as follows. In Section 2, the replacement and addition explosive breakdown point of the penalized logistic regression estimator are obtained respectively. Furthermore, we also show the replacement implosive breakdown point of the penalized MLE. Section 3 discusses a fast convergent iterative algorithm and Section 4 performs the simulation studies. Finally, Section 5 concludes.

We introduce some necessary notations in this paper. If β=(β₁, …, β_p)^T∈$\mathbb{R}$^p is a p-dimensional vector, then L₁ norm of β is defined as ${{{\left\| \mathit{\pmb{\beta }} \right\|}_1} = \sum_{i = 1}^p {\left| {{\beta _i}} \right|} }$, and L₂ norm of β is defined as ${{{\left\| \mathit{\pmb{\beta }} \right\|}_2} = {{\left({\sum_{i = 1}^p {\beta _i^2} } \right)}^{1/2}}}$.

1 Main results

We start with a brief introduction of the logistic regression model. Denote by X=(X_ij)_{1≤i≤p, 1≤j≤n}=(X₁, …, X_n) the matrix of predictor variables, where n is the number of observations and p the number of variables. Let Y=(Y₁, …, Y_n)^T be the vector of responses, where Y_i is either 0 or 1, i=1, …, n.

The logistic regression model is given by

$ {\mu _i}: = p({Y_i} = 1|{\mathit{\boldsymbol{X}}_i}) = {F^{ - 1}}({\theta _i}), $

(1)

where F(x)=logit(x):=log{x/(1-x)}, θ_i=α+X_i^Tβ, α and β=(β₁, …, β_p)^T are the unknown regression parameters. If F(·)=Φ^-1(·), where Φ(·) is the cumulative distribution function (CDF) of the standard normal distribution, then model (1) becomes the probit model.

M-estimation of the parameters is often obtained by minimizing a certain loss function. While it is not suitable to use the quadratic loss for classification problems. Let γ=(γ₁, …, γ_p+1)^T=(α; β). Giving the observed sample Z=(Z₁, …, Z_n), where Z_i=(X_i; Y_i), i=1, …, n. Specifically,

Quadratic loss: l(γ, Z_i)=(Y_i-θ_i)²/2;

Deviance loss:

$ l(\mathit{\boldsymbol{\gamma }},{\mathit{\boldsymbol{Z}}_i}) = - 2\{ {Y_i}{\rm{log}}({\mu _i}) + (1 - {Y_i}){\rm{log}}(1 - {\mu _i})\} ; $

Exponential loss: l(γ, Z_i)=exp{-(Y_i-0.5)θ_i}.

In this paper, we use Bernoulli deviance loss which is often used in exponential dispersion models or generalized linear models. $L(\mathit{\pmb{\gamma}}, \mathit{\boldsymbol{Z}}) = \frac{1}{n}\sum_{i = 1}^n l \left({\mathit{\pmb{\gamma}}, {\mathit{\boldsymbol{Z}}_i}} \right)$, we define the loss function as

$ \begin{array}{*{20}{l}} {L(\mathit{\boldsymbol{\gamma }},\mathit{\boldsymbol{Z}}) = - \frac{2}{n}\sum\limits_{i = 1}^n {\{ {Y_i}{\rm{log}}(} {\mu _i}) + }\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} (1 - {Y_i}){\rm{log}}(1 - {\mu _i})\} .} \end{array} $

(2)

Here, the regression estimator obtained by minimizing (2) is the MLE. For binary logistic regression model, there are three possible patterns of data points: complete separation, quasi-complete separation and overlap. Albert and Anderson^[18] verified that the MLE of logistic regression models does not exist when the data types are completely separated or quasi-complete separated, while in overlap situation, MLE exists uniquely. Similar result was discussed in Ref. [19]. Thus, we concentrate on the overlap situation here. On the other hand, regularization methods generally consist loss functions and penalty terms. It is commonly used to obtain well-behaved solutions to over-parameterized estimation problems. Fan and Li^[20] mentioned that a good penalty function could result in an estimator with unbiasedness, sparsity and continuity and asserted that bounded penalty functions can reduce bias and yield continuous solutions. Therefore, we try to add a penalty term P_λ(β) to the loss function. Then the regression estimator is defined as

$ \mathit{\pmb{\hat \gamma }} = \mathop {{\rm{argmin}}}\limits_\mathit{\boldsymbol{\gamma }} \{ L(\mathit{\boldsymbol{\gamma }},\mathit{\boldsymbol{Z}}) + {P_\lambda }(\mathit{\boldsymbol{\beta }})\} : = \mathop {{\rm{ argmin }}}\limits_\gamma Q(\mathit{\boldsymbol{\gamma }},\mathit{\boldsymbol{Z}}), $

(3)

where Q(γ, Z) is the objective function. Since the intercept term has no corresponding feature term and it is mainly used to fit the overall migration of the data, so it is not involved in regularization.

The components of P_λ(β) are always additive, which means ${P_\lambda }(\mathit{\boldsymbol{\beta }}) = \sum_{j = 1}^p {{P_\lambda }} \left({\left| {{\beta _j}} \right|} \right)$. Different examples of penalty functions are listed as follows.

L₁ penalty function,

$ {P_\lambda }(|{\beta _j}|) = \lambda |{\beta _j}|. $

(4)

L₂ penalty function,

$ {P_\lambda }(|{\beta _j}|) = \lambda \beta _j^2/2. $

(5)

SCAD penalty function,

$ \begin{array}{l} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {P_\lambda }(|{\beta _j}|) = \\ \left\{ {\begin{array}{*{20}{l}} {\lambda |{\beta _j}|,}&{|{\beta _j}| \le \lambda ,}\\ {\frac{{ - (1{\beta _j}{|^2} - 2a\lambda |{\beta _j}| + {\lambda ^2})}}{{2(a - 1)}},}&{\lambda < |{\beta _j}| \le a\lambda ,}\\ {(a + 1){\lambda ^2}/2,}&{|{\beta _j}| > a\lambda ,} \end{array}} \right. \end{array} $

(6)

where a>2 and λ>0 are the unknown parameters.

To study the robustness of an estimator, Donoho and Huber^[2] introduced two finite-sample versions of breakdown point, one is replacement breakdown point, the other is addition breakdown point. Generally speaking, the value is the smallest proportion of contamination that can lead the estimator's Euclidean norm to infinity or to zero, which means the estimator becomes completely unreliable. Rousseeuw and Leroy^[21] held that breakdown point can not exceed 50%. Intuitively, if more than half of the data are polluted, we will not be able to distinguish the original distribution from the pollution distribution.

1.1 Explosive

breakdown point When the Euclidean norm of $\mathit{\pmb{\hat \gamma }}$ goes to infinity, the explosive replacement breakdown point and addition breakdown point are defined as

$ {\varepsilon _\infty }(\mathit{\pmb{\hat \gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}) = {\rm{min}}\left\{ {\frac{m}{n}:\mathop {{\rm{sup}}}\limits_{{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}} {{\left\| {\mathit{\pmb{\hat \gamma }}({{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}})} \right\|}_2} = \infty } \right\}, $

$ {\varepsilon _\infty }(\mathit{\pmb{\hat \gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{add}}}}) = {\rm{min}}\left\{ {\frac{m}{{n + m}}:\mathop {{\rm{sup}}}\limits_{{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{ add }}}}} {{\left\| {\mathit{\pmb{\hat \gamma }}({{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}})} \right\|}_2} = \infty } \right\}, $

where ${\mathit{\boldsymbol{\widetilde Z}}^{{\rm{rep}}}}$ represents the contaminated data set that m observations of the raw sample Z have been replaced, and ${\mathit{\boldsymbol{\widetilde Z}}^{{\rm{add}}}}$ represents that m arbitrary points have been added to the original data set.

For penalized MLE in logistic models, we have the following theorem.

Theorem 1.1 For model (1), considering $\mathit{\pmb{\hat \gamma }}$ in (3), where P_λ(β) is either: i) L₁ penalty; ii) L₂ penalty; iii) the SCAD penalty with a>(4log2+2)/λ²-1. The replacement explosive breakdown point of the estimator $\mathit{\pmb{\hat \gamma }}$ is given by

$ {\varepsilon _\infty }(\mathit{\pmb{\hat \gamma }},{\mathit{\boldsymbol{\tilde Z}}^{{\rm{rep}}}}) = 0.5. $

Proof The L₁ norm of a vector β is denote as ‖β‖₁, and the Euclidean norm as ‖β‖₂. Since the two norms have topological equivalence, there exists a constant c≥1, such that ‖β‖₁≥c‖β‖₂. We replace m of the observations by m outliers, then

$ \begin{array}{*{20}{l}} {Q(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{ rep }}}}) = \frac{{ - 2}}{n}\sum\limits_{i = 1}^n {\{ {{\tilde Y}_i}{\rm{log}}(} {\mu _i}) + }\\ {(1 - {{\tilde Y}_i}){\rm{log}}(1 - {\mu _i})\} + {P_\lambda }(\mathit{\boldsymbol{\beta }}).} \end{array} $

Considering the case of γ_i=0, i=1, …, p+1, we have μ_i=0.5, i=1, …, n. The objective function becomes

$ Q(\mathit{\pmb{0}},{\mathit{\boldsymbol{\tilde Z}}^{{\rm{rep}}}}) = 2{\rm{log}}{\kern 1pt} {\kern 1pt} 2. $

For L₁ penalty (4), considering any β with ‖β‖₂≥M:=(2log 2+1)/cλ, then

$ \begin{array}{*{20}{l}} {Q(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}) > {P_\lambda }(\mathit{\boldsymbol{\beta }}) = \lambda {{\left\| \mathit{\boldsymbol{\beta }} \right\|}_1} \ge c\lambda {{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2}}\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \ge 2{\rm{log}}2 + 1 > Q(\mathit{\pmb{0}},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}).} \end{array} $

For L₂ regularization (5), considering any β with ‖β‖₂≥M:=2(2log 2+1)/λ, we can get

$ \begin{array}{*{20}{l}} {Q(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}) > {P_\lambda }(\mathit{\boldsymbol{\beta }}) = \frac{\lambda }{2}{{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2}}\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \ge 2{\rm{log}}{\kern 1pt} {\kern 1pt} 2 + 1 > Q(\mathit{\pmb{0}},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}).} \end{array} $

For SCAD penalty function (6), P_λ(|β_j|) would be a invariant constant for large β_j. We construct β with ‖β‖₂≥M:=aλ$\sqrt p $+1. Let jm the index such that |β_jm|=max{|β₁|, …, |β_p|}. Since ‖β‖₂>M, |β_jm|≥‖β‖₂/$\sqrt p $>aλ.

$ \begin{array}{*{20}{l}} {Q(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{ rep }}}}) > {P_\lambda }(\mathit{\boldsymbol{\beta }}) > \frac{{(a + 1){\lambda ^2}}}{2}}\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} > 2{\rm{log}}{\kern 1pt} {\kern 1pt} 2 + 1 > Q(\mathit{\pmb{0}},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{ rep }}}}).} \end{array} $

While ∀‖γ‖₂≥‖β‖₂≥M>0, it holds that $Q\left({\mathit{\boldsymbol{\gamma }}, {{\mathit{\boldsymbol{\widetilde Z}}}^{{\rm{rep}}}}} \right) > Q\left({\mathit{\pmb{0}}, {{\mathit{\boldsymbol{\widetilde Z}}}^{{\rm{rep}}}}} \right).\quad {\rm{ since\;}}Q\left({\mathit{\boldsymbol{\widehat \gamma }}, {{\mathit{\boldsymbol{\widetilde Z}}}^{{\rm{rep}}}}} \right) \le Q\left({\mathit{\pmb{0}}, {{\mathit{\boldsymbol{\widetilde Z}}}^{{\rm{rep}}}}} \right)$, we conclude that ‖$\mathit{\pmb{\hat \gamma }}$‖₂ < M, where M does not depend on the outliers. As a corollary, ${\varepsilon _\infty }\left({\mathit{\pmb{\hat \gamma }}, {{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}} \right) = 0.5$, which is the highest possible value.

We can see that the constant M only depends on p. So either for the case of n>p or the situation of n < p, Theorem 1.1 is true. We choose deviance loss in the proof of Theorem 1.1. If the loss term was replaced by exponential loss function, the process of proving would be similar, and the conclusion would be the same.

As mentioned above, there are two types of finite-sample breakdown point. Zuo^[22] ever gave the quantitative relationship between replacement breakdown point and addition breakdown point of a large class of estimators whose breakdown points are independent of the configuration of X. One may think that the influence of adding outliers may be different from that some of the raw observations are replaced by outliers. From the proof process of Theorem 1.1, it can be seen whether the arbitral observations are replaced or added, it does not matter. If we added m outliers to the raw data set, we could conclude the following theorem.

Theorem 1.2 For model (1), considering ${\mathit{\pmb{\hat \gamma }}}$ in (3), where P_λ(β) is either: i) L₁ penalty; ii) L₂ penalty; iii) the SCAD penalty with a>(4log2+2)/λ²-1. The additive explosive breakdown point of the estimator ${\mathit{\pmb{\hat \gamma }}}$ is given by

$ {\varepsilon _\infty }(\mathit{\pmb{\hat \gamma }},{\mathit{\boldsymbol{\tilde Z}}^{{\rm{add}}}}) = 0.5. $

The proof is similar to that of Theorem 1.1.

Theorem 1.1 and Theorem 1.2 show that the penalized MLE is very robust in binary logistic regression models. Moreover, for multiple classification models, the theorems are also applicable. In logistic regression models with binary data, we verified that the explosive breakdown point of penalized MLE in logistic regression models gets the biggest value 0.5, which is a good result.

1.2 Implosive

breakdown point Another direction to describe a meaningless estimator is ‖${\mathit{\pmb{\hat \beta }}}$‖₂ tending to zero. In this situation, ${\mathit{\pmb{\hat \alpha }}}$ dominates the result of model (1). The explanatory variables would have no effect on u_i, i=1, …, n. The fitted probabilities would be all the same, then the model becomes obviously senseless. Therefore, we hope to find the minimum proportion of outliers which can cause ‖${\mathit{\pmb{\hat \beta }}}$‖₂ vanish into zero. Implosive replacement breakdown point of a regression estimator is defined as

$ {\varepsilon _0}(\mathit{\pmb{\hat \beta }},{\mathit{\boldsymbol{\tilde Z}}^{{\rm{rep}}}}) = {\rm{min}}\left\{ {\frac{m}{n}:\mathop {{\rm{inf}}}\limits_{{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}} {{\left\| {\mathit{\pmb{\hat \beta }}({{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}})} \right\|}_2} = 0} \right\}. $

(7)

The following theorem shows that ‖β‖₂ tends to zero while 2p observations of the raw sample are replaced by outliers.

Theorem 1.3 In model (1), considering (3), the penalized MLE satisfies

$ {\varepsilon _0}(\mathit{\pmb{\hat \beta }},{\mathit{\boldsymbol{\tilde Z}}^{{\rm{rep}}}}) < \frac{{2p}}{n}. $

Proof Let δ>0, M=max{‖X₁‖₂, …, ‖X_n‖₂}, N=τ/δ, and $C = (2N + M)\sqrt p $. Obviously, we have $Q\left({\mathit{\pmb{0}}, {{\mathit{\boldsymbol{\widetilde Z}}}^{{\rm{rep}}}}} \right) = 2\log 2$. It can be seen that there is a τ>0, such that log(1+e^τ)/n=log2. Denote by e_j the j_th column of the unit matrix of order p, j=1, …, p. We move 2p observations of Z to the position $\tilde Z_{k0}^{{\rm{rep }}} = \left({C{\mathit{\boldsymbol{e}}_k}; 0} \right), \mathit{\boldsymbol{\widetilde Z}}_{k1}^{{\rm{rep }}} = \left({C{\mathit{\boldsymbol{e}}_k}; 1} \right), k = 1, 2, \cdots, p$. Next, we discuss in separate cases while ∀‖β‖₂>δ.

First, |α+Ce_k^Tβ|>N‖β‖₂>Nδ.

Case one: α+Ce_k^Tβ>Nδ=τ, considering the observation $\mathit{\boldsymbol{\widetilde Z}}^{{\rm{rep}}}$,

$ \begin{array}{*{20}{l}} {Q(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}) > \frac{1}{n}l(\mathit{\boldsymbol{\gamma }},\mathit{\boldsymbol{\tilde Z}}_{k0}^{{\rm{rep}}}) = \frac{2}{n}{\rm{log}}(1 + {{\rm{e}}^{\alpha + \mathit{\boldsymbol{\tilde X}}_i^{\rm{T}}\mathit{\boldsymbol{\beta }}}})}\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} > \frac{2}{n}{\rm{log}}(1 + {{\rm{e}}^\tau }) = Q(\mathit{\pmb{0}},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}).} \end{array} $

Case two: α+Ce_k^Tβ < -Nδ=-τ, considering the observation $\mathit{\boldsymbol{\widetilde Z}}_{k1}^{{\rm{rep}}}$,

$ \begin{array}{*{20}{l}} {Q(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}) > \frac{1}{n}l(\mathit{\boldsymbol{\gamma }},\mathit{\boldsymbol{\tilde Z}}_{k1}^{{\rm{rep}}}) = \frac{2}{n}{\rm{log}}(1 + {{\rm{e}}^{ - \alpha - \mathit{\boldsymbol{\tilde X}}_i^{\rm{T}}\mathit{\boldsymbol{\beta }}}})}\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} > \frac{2}{n}{\rm{log}}(1 + {{\rm{e}}^\tau }) = Q(\mathit{\pmb{0}},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}).} \end{array} $

Second, while |α+Ce_k^Tβ| < N‖β‖₂ < Nδ, denote jm the index such that |β_jm|=max{|β₁|, …, |β_p|}. Since ‖β‖₂>δ, |β_jm|≥‖β‖₂/$\sqrt p $.

Case one: β_jm>0,

α+e_jm^Tβ=α+Cβ_jm < N‖β‖₂, so α≤N‖β‖₂-Cβ_jm≤N‖β‖₂-C‖β‖₂/$\sqrt p $=-(M+N)‖β‖₂. Considering the non-polluting observation Z_i1=(X_i; 1),

$ \begin{array}{*{20}{l}} {\alpha + \mathit{\boldsymbol{X}}_i^{\rm{T}}\mathit{\boldsymbol{\beta }} \le \alpha + {{\left\| {{\mathit{\boldsymbol{X}}_i}} \right\|}_2}{{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2}}\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \le - (M + N){{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2} + M{{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2}}\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = - N{{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2} < - N\delta = - \tau .} \end{array} $

Thus

$ \begin{array}{*{20}{l}} {Q(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}) > \frac{1}{n}l(\mathit{\boldsymbol{\gamma }},{\mathit{\boldsymbol{Z}}_{i1}}) = \frac{2}{n}{\rm{log}}(1 + {{\rm{e}}^{ - \alpha - \mathit{\boldsymbol{X}}_i^{\rm{T}}\beta }})}\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} > \frac{2}{n} - {\rm{log}}(1 + {e^\tau }) = Q(\mathit{\pmb{0}},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}).} \end{array} $

Case two: β_jm < 0,

0>α+e_jm^Tβ=α+Cβ_jm>-N‖β‖₂, so α>-N‖β‖₂-Cβ_jm>-N‖β‖₂+C‖β‖₂/$\sqrt p $=(M+N)‖β‖₂. Considering the non-polluting observation Z_i0=(X_i; 0),

$ \begin{array}{*{20}{l}} {\alpha + \mathit{\boldsymbol{X}}_i^{\rm{T}}\mathit{\boldsymbol{\beta }} > \alpha - {{\left\| {{\mathit{\boldsymbol{X}}_i}} \right\|}_2}{{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2}}\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} > (M + N){{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2} - M{{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2}}\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = N{{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2} > N\delta = \tau .} \end{array} $

Thus

$ \begin{array}{*{20}{l}} {Q(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}) > \frac{1}{n}l(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}_{i0}}) = \frac{2}{n}{\rm{log}}(1 + {{\rm{e}}^{\alpha + \mathit{\boldsymbol{X}}_i^{\rm{T}}\beta }})}\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} > \frac{2}{n}{\rm{log}}(1 + {{\rm{e}}^\tau }) = Q(\mathit{\pmb{0}},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}).} \end{array} $

As a conclusion, ∀‖γ‖₂≥‖β‖₂>δ, it holds that $Q\left({(\theta, \mathit{\boldsymbol{\beta }}), {{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}} \right) > Q\left({(0, \mathit{\pmb{0}}), {{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}} \right)$. Since $Q\left({(\hat \alpha, \mathit{\boldsymbol{\widehat \beta }}), {{\mathit{\boldsymbol{\widetilde Z}}}^{{\rm{rep}}}}} \right) \le Q\left({(0, \mathit{\pmb{0}}), {{\mathit{\boldsymbol{\widetilde Z}}}^{{\rm{rep}}}}} \right)$, we have ${\left\| {\mathit{\boldsymbol{\widehat \beta }}} \right\|_2} \le {\left\| {\mathit{\pmb{\hat \gamma }}} \right\|_2} \le \delta $, which leads to the inequality ${\varepsilon _0}\left({\mathit{\boldsymbol{\widehat \beta }}, {{\mathit{\boldsymbol{\widetilde Z}}}^{{\rm{rep}}}}} \right) \le 2p/n$.

Theorem 1.3 implies that the number of contaminated data cannot exceed twice the number of independent variables, which shows the non-robustness of the estimator. The standard error of a regression estimator would not explode when the Euclidean norm of the estimator tends to zero. Therefore, this kind of breakdown is more difficult to detect. Croux et al.^[12] proved that MLE of logistic regression model breaks down to zero when adding several outliers to the data set. Here, we discuss penalized MLE in replacement situation, and our result is not bad.

2 Algorithm

In this section, we concentrate on the computation of (3). Approaches applicable to linear models may computationally infeasible to nonlinear regression analysis. In our model, the objective function is nonlinear, the general normal equation is a transcendental equation. we can only solve the problem by numerical methods in stead of algebraic methods.

There are some fast algorithms for numerical computing. Efron et al.^[23] introduced least angle regression. Balakrishnan and Madigan^[24] presented the online algorithm and the MP algorithm for learning L₁ logistic regression models; Wu and Lange^[25] used the coordinated descent (CD) algorithm to solve lasso regressions. Friedman et al.^[26] derived the generalized CD algorithm for convex optimization problems. CD was also mentioned in Ref.[27], in which it was shown to gain computational superiority. It is proved that the block CD algorithm has linear convergence rate in Ref.[28]. Saha and Tewari^[29] proved that the CD algorithm has the convergence rate of O(1/p). Since this algorithm has many excellent properties, it is wise to use it in this paper.

An approximate solution can be calculated by convergent iterative algorithms. we try to transform the deviance loss function into approximate squared loss function through Taylor expansion. In this way, the regression coefficient can be calculated more conveniently. Note that for multiple classification regression problems, the Taylor expansion for the ordinary objective function would become more complex. In these situations, one should consider the use of other numerical methods, such as Newton method, gradient ascent method, among others.

An initial estimator is necessary for Taylor expansion. If the initial value was closer to the real parameter, the iteration rate would be faster and the result would be more accurate. After a large number of experiments, we found that γ₀=0 is a good choice. For a given initial estimator γ₀=(α₀; β₀), let θ_i0=α₀+X_i^Tβ₀, i=1, …, n. Next, we develop the Taylor expansion of the loss function at θ_i0,

$ \begin{array}{l} L(\mathit{\boldsymbol{\gamma }},\mathit{\boldsymbol{Z}}) = \frac{1}{n}\sum\limits_{i = 1}^n l (\mathit{\boldsymbol{\gamma }},{\mathit{\boldsymbol{Z}}_i})\\ \approx \frac{1}{n}\sum\limits_{i = 1}^n {\left\{ {l({\mathit{\boldsymbol{\gamma }}_0},{\mathit{\boldsymbol{Z}}_i}) + {{\left. {\frac{{\partial l(\mathit{\boldsymbol{\gamma }},{\mathit{\boldsymbol{Z}}_i})}}{{\partial {\theta _i}}}} \right|}_{{\theta _{i0}}}}({\theta _i} - {\theta _{i0}}) + } \right.} \\ \left. {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \frac{1}{2}{{\left. {\frac{{{\partial ^2}l(\mathit{\boldsymbol{\gamma }},{\mathit{\boldsymbol{Z}}_i})}}{{\partial \theta _i^2}}} \right|}_{{\theta _{i0}}}}{{({\theta _i} - {\theta _{i0}})}^2}} \right\}\\ \approx \frac{1}{{2n}}\sum\limits_{i = 1}^n {l_{{\theta _{i0}}}^{\prime \prime }} {\left\{ {{\theta _{i0}} - \frac{{l_{{\theta _{i0}}}^\prime }}{{l_{{\theta _{i0}}}^{\prime \prime }}} - {\theta _i}} \right\}^2} \end{array} $

where $l_{{\theta _{i0}}}^{\prime \prime } = {\left. {\frac{{{\partial ^2}l\left({\mathit{\boldsymbol{\gamma }}, {\mathit{\boldsymbol{Z}}_i}} \right)}}{{\partial \theta _i^2}}} \right|_{{\theta _{i0}}}}$ and $l_{{\theta _{i0}}}^\prime = {\left. {\frac{{\partial l\left({\mathit{\boldsymbol{\gamma }}, {\mathit{\boldsymbol{Z}}_i}} \right)}}{{\partial {\theta _i}}}} \right|_{{\theta _{i0}}}}$.

If the loss function is deviance loss function, we would have l(γ, Z_i)=-2{y_ilogμ_i+(1-y_i)log(1-μ_i)}. Then for the logistic model and the probit model, we already have their distribution functions, it is easy to obtain their first order derivative function and two order derivative function.

For logit model, we have

$ {l_{{\theta _i}}^\prime = \frac{{\partial l(\mathit{\boldsymbol{\gamma }},{\mathit{\boldsymbol{Z}}_i})}}{{\partial {\theta _i}}} = 2({\mu _i} - {y_i}),} $

$ {l_{{\theta _i}}^{\prime \prime } = \frac{{{\partial ^2}l(\mathit{\boldsymbol{\gamma }},{\mathit{\boldsymbol{Z}}_i})}}{{\partial \theta _i^2}} = 2{\mu _i}(1 - {\mu _i}).} $

And for probit model, we have

$ l_{{\theta _i}}^\prime = 2\frac{{{\mu _i}\mu _i^\prime - {y_i}\mu _i^\prime }}{{{\mu _i}(1 - {\mu _i})}}, $

$ \begin{array}{*{20}{l}} {l_{{\theta _i}}^{\prime \prime } = 2\frac{{(\mu _i^\prime \mu _i^\prime + {\mu _i}\mu _i^{\prime \prime } - {y_i}\mu _i^{\prime \prime })({\mu _i} - \mu _i^2)}}{{\mu _i^2{{(1 - {\mu _i})}^2}}} - }\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} 2\frac{{({\mu _i}\mu _i^\prime - {y_i}\mu _i^\prime )(\mu _i^\prime - 2{\mu _i}\mu _i^\prime )}}{{\mu _i^2{{(1 - {\mu _i})}^2}}},} \end{array} $

where $\mu _i^\prime = \exp \left({ - \theta _i^2/2} \right)/\sqrt {2{\rm{ \mathsf{ π} }}} $, and μ″_i=-θ_iexp $\left({ - \theta _i^2/2} \right)/\sqrt {2\pi } = - {\theta _i}\mu _i^\prime.$

Otherwise, if the loss function was exponential loss l(γ, Z_i)=exp{-(y_i-0.5)θ_i}, there is nothing to do with μ_i, so no matter how to construct u_i, we have

$ {l_{{\theta _i}}^\prime = ({y_i} - 0.5){{\rm{e}}^{ - ({y_i} - 0.5){\theta _i}}},} $

$ {l_{{\theta _i}}^{\prime \prime } = \frac{1}{4}{{\rm{e}}^{ - ({y_i} - 0.5){\theta _i}}}.} $

According to the method of seeking extreme point, we set the first order derivative of the constrained objective function to all independent variables be zero. Then the parameters can be attained by solving the simultaneous equations.

$ \left\{ {\begin{array}{*{20}{l}} {\partial Q(\mathit{\boldsymbol{\gamma }},\mathit{\boldsymbol{Z}})/\partial \alpha = 0,}\\ {\partial Q(\mathit{\boldsymbol{\gamma }},\mathit{\boldsymbol{Z}})/\partial {\beta _j} = 0,J = 1, \cdots ,p.} \end{array}} \right. $

The first part of the objective function has been changed from Taylor expansion to square terms, and it is easy to find the derivative function of each order. Then consider penalty term. It is important to note that if we could not seek the first order guidance directly of the penalty term, such as L₁ penalty, we could apply the soft thresholding method introduced firstly by Donoho and Johnstone^[30]. Then we apply CD algorithm. At the k^th iteration,

$ {\beta _1^{(k)} \in {\rm{arg}}{\kern 1pt} {\kern 1pt} {\rm{min}}Q\{ ({\alpha ^{(k - 1)}},\beta _2^{(k - 1)}, \cdots ,\beta _p^{(k - 1)}),\mathit{\boldsymbol{Z}}\} ,} $

$ {\beta _2^{(k)} \in {\rm{arg}}{\kern 1pt} {\kern 1pt} {\rm{min}}Q\{ ({\alpha ^{(k)}},\beta _1^{(k)},\beta _3^{(k - 1)}, \cdots ,\beta _p^{(k - 1)}),\mathit{\boldsymbol{Z}}\} ,} $

$ \begin{array}{l} \cdots \\ \beta _p^{(k)} \in {\rm{arg}}{\kern 1pt} {\kern 1pt} {\rm{min}}Q\{ ({\alpha ^{(k)}},\beta _1^{(k)},\beta _2^{(k)}, \cdots ,\beta _{p - 1}^{(k)}),\mathit{\boldsymbol{Z}}\} , \end{array} $

$ {{\alpha ^{(k)}} \in {\rm{arg}}{\kern 1pt} {\kern 1pt} {\rm{min}}Q\{ (\beta _1^k,\beta _2^k,\beta _3^k, \cdots ,\beta _p^k),\mathit{\boldsymbol{Z}}\} .} $

For the giving initial γ₀, every β_j, j=1, …, p has the following explicit solution:

For L₁ penalization,

$ {\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}} \over \beta } _j} = \frac{{S\{ \sum\nolimits_{i = 1}^n {l_{{\theta _i}0}^{\prime \prime }} {x_{ij}}({\theta _{i0}} - \frac{{l_{{\theta _{i0}}}^\prime }}{{l_{{\theta _{i0}}}^{\prime \prime }}} - y_{i0}^{ - j}),\lambda \} }}{{\sum\nolimits_{i = 1}^n {l_{{\theta _i}0}^{\prime \prime }} x_{ij}^2}}. $

(8)

where y_i0^-j=θ_i0-x_ijβ_0j, and

$ S(\mu ,\lambda ) = \left\{ {\begin{array}{*{20}{l}} {\mu - \lambda ,}&{{\rm{ if}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \mu > \lambda ,}\\ {0,}&{{\rm{ if}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} |\mu | \le \lambda ,}\\ {\mu + \lambda ,}&{{\rm{ if}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \mu < - \lambda .} \end{array}} \right. $

For L₂ penalization,

$ {\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}} \over \beta } _j} = \frac{{\sum\nolimits_{i = 1}^n {l_{{\theta _i}0}^{\prime \prime }} {x_{ij}}({\theta _{i0}} - \frac{{l_{{\theta _{i0}}}^\prime }}{{l_{{\theta _{i0}}}^{\prime \prime }}} - y_{i0}^{ - j})}}{{\sum\nolimits_{i = 1}^n {l_{{\theta _i}0}^{\prime \prime }} x_{ij}^2 + 1}}. $

(9)

With the known $\widehat \beta $_j, j=1, …, p, we can get $\hat \alpha $.

$ \hat \alpha = \frac{{\sum\nolimits_{i = 1}^n {l_{{\theta _{i0}}}^{\prime \prime }} ({y_i} - {x_i}\mathit{\pmb{\hat \beta }})}}{{\sum\nolimits_{i = 1}^n {l_{{\theta _{i0}}}^{\prime \prime }} }}. $

(10)

In practical application, the extreme point is usually the maximum or minimum extreme point we want. According to the fast convergence property of the CD method, we can easily obtain the convergent regression estimator.

3 Simulation

In this section, simulation studies on artificial data sets with different groups of n and p are presented. We construct two data configurations, include the case of n>p and the case of n < p.

The first data configuration is with n=200 and p=5, n/2, n-20 respectively, which is corresponding to the low-dimensional data set. Assume that X obeys the multivariate normal distribution, X~N(0, ∑), where the covariance matrix ∑=(∑_ij)_{1≤i, j≤p} is assigned to ∑_ij=0.5^|i-j|, which implies the multiple multicollinearity between predictor variables. Using the coefficient vector γ with α=0.2, β₁=0.5, β₂=-0.6, β₃=1, β₄=-0.8, β₅=0.1, and β_j=0 for j∈{1, 2, …, p+1}\{1, 2, 3, 4, 5}, the response variable Y is generated according to model (1).

Then, the second configuration represents the moderate case of high-dimensional. n=100 observations are generated and the value of p is n+10, 2n or 4n. The independent variables are subjected to the distributions N(0, ∑) with ∑_ij=0.1^|i-j|. For regression parameters, we assume that α=0.2, and β₁=-0.7, β₁=1, β₁=-0.3, β₁=0.5, β₁=-0.6, while other slope coefficients β_j=0. Y is also constructed according to model (1).

Denote by k the contaminated percentage, k=0, 25% or 50%. For each data set, we make different kinds of pollution.

1) No contamination;

2) Vertical outliers: k of the dependent variables in model (1) are changed, which means some Y_i will become 1-Y_i;

3) Leverage points: same as in (2), while for the k contaminated observations, we drawing the predictor variables from N(μ, ∑) instead of N(0, ∑), where μ=(20, …, 20).

As for shrinkage parameter λ, it is always unknown in advance. We know that the bigger value of λ, the higher punishment. More coefficients will be compressed to 0, which lead to greater bias. But if the value was too small, it could not reach the effect of variable selection. There are many ways to choose an optimal value, such as Bayes information criterion (BIC), Akaike info criterion (AIC), cross validation (CV), among others. In this paper, let MSE=${\left\| {\mathit{\pmb{\hat \gamma }} - \mathit{\boldsymbol{\gamma }}} \right\|_2}$. Then, we use the 5-fold cross validation method to select the best penalty coefficient according to the minimum MSE. Each λ is taken from a given interval and grow exponentially.

There are several termination rules widely adopted in an iterative algorithm, such as the function descent criterion, the distance criterion, the gradient criterion. At present article, we use the distance criterion. For each cycle of the coordinate descent algorithm, the specific form of stopping criterion is

$ \mathop {{\rm{max}}}\limits_{j = 1, \cdots ,p + 1} |\hat \gamma _j^{{\rm{ old }}} - \hat \gamma _j^{{\rm{ new }}}| < {10^{ - 3}}. $

(11)

The maximum number of iterations is 1 000, and each regression coefficient should be of practical significance, like γ_j < 10³ for j=1, …, p+1.

Given a parametric statistical model (sample space and sample distribution family), decision space and loss function, it is naturally desirable to specify a decision for each point in the sample space. We assume that the decision function is

$ {Y_i} = 1{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{if}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} p({Y_i} = 1|{\mathit{\boldsymbol{X}}_i}) > 0.5. $

(12)

Here, 0.5 is chosen as the threshold as it is a common practice. In practical application, different thresholds may be selected for specific situations. If the accuracy of positive criteria was high, the threshold value could be larger. If we required higher positive recall, the threshold value should be selected smaller.

We polluted the raw data in different degrees and different types of outliers. Following are the partial simulation results and all the above simulations are performed in Matlab2016b.

Here, Fig. 1 is one of the cross validation diagram.

	Download: JPG larger image
Fig. 1 Cross validation result while n=200 and p=100

Table 1 and Table 2 express the results of logistic regression model while n>p and n < p, averaged over 500 simulation runs. Here, let MR be the misclassification rate.

Table 1 Logistic regression results while there are vertical outliers

Table 2 Logistic regression results with different numbers of leverage points

As shown in Table 1, if there were vertical outliers in the data set, the regression coefficient always stayed bounded regardless of the extent of the first kind of pollution degree. At the same time, regression coefficients keep bounded no matter how many leverage points were in data set in Table 2. In general, penalized logistic regression is a very robust approach, which echoing the previous ${\varepsilon _\infty }\left({\mathit{\pmb{\hat \gamma }}, {{\mathit{\pmb{\hat Z}}}^{{\rm{rep}}}}} \right) = $0.5. We can also find some phenomenon from the tables. First, the greater pollution degree, the bigger misjudgment. Then, we find that it takes more compile time while there are leverage points in the data set. It is known that leverage points can be divided into good leverage points and bad leverage points. If there were bad leverage points, it might badly influence the model than any other pollution. Moreover, the greater value of p/n, the longer the compilation time. This is one of the reasons why high-dimensional situations are more difficult to solve. Finally, there is only little difference of the classification report between L₁ regularization and L₂ regularization.

Then, we change the link function to Probit link function. Table 3 shows the performance of probit regression, averaged over 500 simulation runs.

Table 3 Probit regression results with different numbers of outliers

Similarly, the MSE have a limit in Table 3, which means the regression estimator keeps bounded. In addition to some similarities between logistic and probit regression models, previous scholars found that the logistic regression coefficient is about 1.8 times bigger than that of the probit regression coefficient with the same data sets. Although the coefficients are different, the classification results are essentially similar. The meaning of the coefficient, the evaluation of the model and the hypothesis test are all similar.

Furthermore, we find that the choice of the loss functions has an almost negligible effect on the prediction result. This might be the reason why less discussion about the impact on different loss functions was talked in classification models.

4 Discussion

This paper shows that penalized MLE has a high explosive breakdown point in binary logistic regression models. Accurately speaking, we can still get bounded regression coefficients even if the pollution rate of the observations reaches 50%. The property is shown by our simulation studies. In either logistic model or probit model, the regression estimators are robust in contaminated data sets. Also, we give the upper bound of implosive breakdown point of the slope parameter.

Theorem 1.3 gives the upper bound of implosive breakdown point of slope parameter. However, for robust estimator, the bigger the breakdown point, the better. Naturally, we consider whether the implosive breakdown point has a larger lower bound by changing the penalty item or changing the loss function. For example, sparse LTS method is pretty robust in linear models, can we learn from the idea of trimming? As we know, SCAD penalty is a famous penalty term as it satisfies sparsity mathematical conditions. So it may lead to unexpected gains if we use this punishment. In our future research, we will pay more attention to it.

While there is only a slight multi-collinearity between the explanatory variables, MLE is consistent, asymptotically valid, and asymptotically normal for n→+∞ under some weak assumptions. Fahrmeir and Kaufmann^[31] did some research. With the increase of sample size, the standard error of parameter estimation will gradually decrease. After adding a penalty item, these properties may change. As for hypothesis test, we need more studies, not only on breakdown point, but also on the standard error.

References

[1]	Hample F R. A general qualitative definition of robustness[J]. The Annals of Mathematical Statistics, 1971, 42: 1187-1896. DOI:10.1214/aoms/1177693235
[2]	Donoho D, Huber P J. The notion of breakdown point[C]//Bickel P J, Doksum K A, Hodges J L. A Festschrift for Eric Lehmann. Belmont: Wadsworth, 1983: 157-184.
[3]	Yohai V J. High breakdown point and high efficiency robust esrimates for regression[J]. Tne Annals of Statistics, 1987, 15: 642-656. DOI:10.1214/aos/1176350366
[4]	Yohai V J, Zamar R H. High breakdown point estimators of regression by means of the minimization of an efficient scale[J]. Journal of American Statistical Association, 1988, 83: 406-613. DOI:10.1080/01621459.1988.10478611
[5]	Rousseeuw P. Least median of squares regression[J]. Publications of the American Statistical Association, 1984, 79: 871-880. DOI:10.1080/01621459.1984.10477105
[6]	Sakata S, White H. An alternative definition of finite-sample breakdown point with application to regression model estimators[J]. Journal of the American Statistical Association, 1995, 90: 1099-1106.
[7]	Cox D R, Snell E J. The analysis of binary data[M]. Alexandria: Biometrics, 1970.
[8]	Christmann A. Least median of weighted squares in logistic regression with large strata[J]. Biometrika, 1994, 81: 413-417. DOI:10.1093/biomet/81.2.413
[9]	Kunsch H R, Stefanski L A, Carroll R J. Conditionally unbiased bounded-influence estimation in general regression models, with applications to generalized linear models[J]. Journal of the American Statistical Association, 1981, 84: 460-466.
[10]	Neykov N M, Muller C H. Breakdown point and computation of trimmed likelihood estimators in generalized linear models[C]//Dutter R, Filzmoser P, Gather U, et al. Developments in Robust Statistics. Berlin: Spring-Verlag, 2003: 277-286.
[11]	Khan D M, Ihtisham S, Ali A, et al. An efficient and high breakdown estimation procedure for nonlinear regression models[J]. Pakistan Journal of Statistics, 2017, 33: 223-236.
[12]	Croux C, Flandre C, Haesbroeck G. The breakdown behavior of the maximum likelihood estimator in the logistic regression model[J]. Statistics & Probability Letters, 2002, 60: 377-386.
[13]	Breiman L. Better subset regression using the nonnegative garrote[J]. Technometrics, 1995, 37: 373-384. DOI:10.1080/00401706.1995.10484371
[14]	Tibshirani R. Regression shrinkage and selection via the lasso[J]. Journal of the Royal Statistical Society, 1996, 58: 267-288.
[15]	Wang H, Li G, Jiang G. Robust regression shrinkage and consistent variable selection through the LAD-lasso[J]. Journal of Business and Economic Statistics, 2007, 25: 347-355. DOI:10.1198/073500106000000251
[16]	Xie H, Huang J. Scad-penalized regression in high-dimensional partially linear models[J]. The Annals of Statistics, 2009, 37: 673-296. DOI:10.1214/07-AOS580
[17]	Alfons A, Croux C, Gelper S. Sparse least trimmed squares regression for analyzing high-dimensional large data sets[J]. The Annals of Applied Statistics, 2013, 7: 226-248. DOI:10.1214/12-AOAS575
[18]	Albert A, Anderson J A. On the existence of maximum likelihood estimators in logistic regression models[J]. Biometrika, 1984, 71: 1-10. DOI:10.1093/biomet/71.1.1
[19]	Silvapulle M J. On the existence of maximum likelihood estimators for the binomial response models[J]. Journal of the Royal Statistical Society, 1981, 43: 310-313.
[20]	Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties[J]. Journal of the American Statistical Association, 2001, 96: 1348-1360. DOI:10.1198/016214501753382273
[21]	Rousseeuw P J, Leroy A M. Robust regression and outlier detection[M]. America: Wiley-Interscience, 1987.
[22]	Zuo Y. Some quantitative relationships between two types of finite-sample breakdown point[J]. Statistics & Probability Letters, 2001, 51: 369-375.
[23]	Efron B, Hastie T, Johnstone I, et al. Least angle regression[J]. The Annals of Statistics, 2004, 32: 407-451. DOI:10.1214/009053604000000067
[24]	Balakrishnan S, Madigan D. Algorithms for sparse linear classifiers in the massive data setting[J]. Journal of Machine Learning Research, 2008, 9: 313-337.
[25]	Wu T T, Lange K. Coordinate descent algorithms for lasso penalized regression[J]. The Annals of Applied Statistics, 2008, 2: 224-244. DOI:10.1214/07-AOAS147
[26]	Friedman J, Hastie T, Hofling H, et al. Pathwise coordinate optimization[J]. The Annals of Applied Statistics, 2007, 1: 302-332. DOI:10.1214/07-AOAS131
[27]	Zhang C, Zhang Z, Chai Y. Penalized bregman divergence estimation via coordinate descent[J]. Journal of the Iranian Statistical Society, 2011, 2: 125-140.
[28]	Luo Z Q, Tseng P. Error bounds and convergence analysis of feasible descent methods:a general approach[J]. Annals of Operations Research, 1993, 46: 157-178.
[29]	Saha A, Tewari A. On the nonasymptotic convergence of cyclic coordinate descent methods[J]. Siam Journal on Optimization, 2013, 23: 576-601. DOI:10.1137/110840054
[30]	Donoho D L, Johnstone I M. Ideal spatial adaptation by wavelet shrinkage[J]. Biometrika, 1994, 81: 425-455. DOI:10.1093/biomet/81.3.425
[31]	Fahrmeir L, Kaufmann H. Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models[J]. The Annals of Statistics, 1985, 13: 342-368. DOI:10.1214/aos/1176346597


中国科学院大学学报 2020, Vol. 37 Issue (5): 582-592	PDF