Signed-rank-based test for high dimensional mean vector

引用本文

Liu Y, Li S M, Zhang S G. Signed-rank-based test for high dimensional mean vector[J]. Journal of University of Chinese Academy of Sciences, 2022, 39(5): 586-592.

刘琰, 李仕明, 张三国. 基于符号秩的高维均值检验[J]. 中国科学院大学学报, 2022, 39(5): 586-592.

Signed-rank-based test for high dimensional mean vector

LIU Yan^1,2, LI Shiming³, ZHANG Sanguo^1,2

1. School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China;
2. Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing 100049, China;
3. Beijing Tongren Eye Center, Beijing Tongren Hospital, Capital Medical University, Beijing 100730, China

Received 8 April 2020; Revised 5 November 2020

Foundation items: Supported by Beijing Natural Science Foundation (Z190004, JQ20029), Key Program of Joint Funds of the National Natural Science Foundation of China (U19B2040), and Capital Health Research and Development of Special (2020-2-1081)

Corresponding author: ZHANG Sanguo, E-mail：sgzhang@ucas.ac.cn

Abstract: This work is concerned with tests for one-sample mean vectors under high dimensional cases. Existing high dimensional tests for mean vectors base on the assumption of elliptical distribution have been proposed recently. To extend to more distributions, we propose a signed-rankbased test. The proposed test statistic is robust and scalar-invariant. Asymptotic properties of the test statistic are established. Numerical studies show that the proposed test has a good control of the typeI error and is more efficiency. We also employ the proposed method to analyze an phthalmic data.

Keywords: high dimensional analysis signed-rank one-sample test scalar-invariance

基于符号秩的高维均值检验

刘琰^1,2, 李仕明³, 张三国^1,2

1. 中国科学院大学数学科学学院, 北京 100049;
2. 中国科学院大数据挖掘与知识管理重点实验室, 北京 100049;
3. 首都医科大学附属北京同仁医院眼科中心, 北京 100730

摘要: 研究高维情形下一样本均值检验的问题。已有的一些高维均值检验方法假设样本具有椭球等高分布。为应用到更多的分布, 提出基于符号秩的均值检验统计量。所提方法是稳健的且具有刻度变换不变性。建立了所提出检验统计量的渐近性质, 数值模拟表明该方法可以很好地控制第一类错误, 且功效更高。还将该方法应用到眼科数据中。

关键词: 高维数据分析符号秩一样本检验标度不变性

Suppose that X₁, …, X_n∈ ${{\mathbb{R}}^{p}}$ are independent and identically distribution random samples with mean vector μ and covariance matrix Σ. And consider the following test

${H_0}:{\boldsymbol{\mu }} = {{\boldsymbol{\mu }} _0} \;{\rm{vs.}}\; H_{1}:{\boldsymbol{\mu }} \ne {{\boldsymbol{\mu }} _0}.$

(1)

under n < p. This is the so-called " large p, small n" paradigm. When p is fixed and under the assumption of normal distribution, a traditional method to test (1) is Hotelling's test statistic. However, Hotelling's test is not defined in the case of p > n because of the singularity of the sample covariance matrix. It is a challenge to the traditional method in high dimensional situation.

The challenge of testing (1) in high dimensional situation has attracted many researchers. Ref.[1] constructed the test statistics which avoid the inverse of the sample covariance matrix. but the test statistics can only be applied to the case of p/n→c∈(0, 1), which means that the increasing rate of the sample dimension should be same as the sample size. Ref.[2] proposed a new test statistic without any direct relationship between p and n. In practice, different components may have different scales. Therefore, scalar-invariant is an important property to a test statistic. Ref.[3], Ref.[4] and Ref.[5]> constructed a test statistic with the property of scalar-invariant and under the assumption that p=o(n²). Ref.[6] proposed a scalar-invariant test that allows the dimension to be arbitrarily large. But their test is not location shift invariant. However, under heavy-tailed distributions, which frequently arise in genomics and quantitative finance, the asymptotic properties of the above test statistics are not established, a natural result is that these tests tend to have unsatisfactory [JP2]power. Under the assumption of elliptical distributions, Ref.[7] proposed a novel non-parametric test based on spatial-signs, which is more powerful than the test in Ref.[2] for heavy-tailed multivariate distributions and has similar power to the test in Ref.[2] for multivariate normal distribution. But their test is not scalar-invariant. Ref.[8] proposed a novel scalar-invariant test based on multivariate-sign, which is more powerful than the test in Ref.[5] for heavy-tailed multivariate distributions. And their method is under the assumption that log(p)=o(n).

We propose a novel test for hypothesis (1) based on signed-rank method and our study have two main contributions. Firstly, the proposed test statistic works for more distributions because signed-rank method only requires that the distribution of the samples is symmetric. And the test statistic is available when p is arbitrarily large. Secondly, we show that, under null hypothesis, the proposed test statistic is asymptotically normal. Moreover, the simulation study shows that our method is scalar-invariant and robust, and is more efficient without the assumption of elliptical distributions.

1 A signed-rank-based high dimen-sional test 1.1 The proposed test statistic

Suppose that X_i, i=1, …, n are independent and identically distribution random samples with dimension p. We denote that X^(k)=(X_1k, …, X_nk), k=1, …, p as the sample of the k-th dimension. And, let (r_1k, …, r_nk) be the rank of (|X_1k|, …, |X_nk|). To test hypothesis (1), we proposed a test statistic based on signed-rank functions, which are defined as:

U_i=diag{sign(X_i1), …, sign(X_ip)}(r_i1, …, r_ip)^T, where i=1, 2, …, n. Then, we consider the following U-statistic:

${T_n} = \frac{{\sum _{i \ne j}^nU_i^{\rm{T}}{U_j}}}{{2n(n - {\rm{ }}1)}}.$

(2)

Set s_i=(s_i1, …, s_ip)^T with covariance matrix Σ_s>0, where s_ij=sign(X_ij). To establish the asymptotic properties of the U statistic under the null hypothesis, we need following conditions:

A1. P(s_ij=1)=P(s_ij=-1)= $\frac{1}{2}$ for each i and j; |X_ij|≠|X_kj| for any i≠k and each j, and X_ij≠0 for each i and j; s_ij and r_ij are uncorrelated for each i and j.

A2. tr( Σ_s⁴)=o(tr²( Σ_s²)).

Remark 1.1 Condition A1 is necessary condition of the signed-rank test under null hypothesis and it indicates that the random samples have symmetric distributions. Under the first term in condition A1, we have E(s_ij)=0. Under the second term in condition A1, r_ij≠r_kj for any i≠k and each j so that (r_1j, …, r_nj) is a permutation of all the elements in {1, …, n}. Condition A2 is similar to that applied in Ref.[2], and it is a quite mild condition on the eigenvalues of Σ_s.

Under H₀, and then suppose condition A1 hold, it is easy to show that

$E({T_n}) = 0,$

and

$Var({T_n}) = \frac{1}{{2n(n{\rm{ }} - 1)}}{\rm{tr}}({\boldsymbol{\varSigma }} _u^2),$

where Σ_u=E(U₁U₁^T).

Theorem 1.1 in the following establishes the asymptotic normality of T_n.

Theorem 1.1 Under H₀, and then suppose conditions A1 and A2 hold, as n→∞ and p→∞,

$\frac{T_n}{\sqrt{2 n(n-1) {\rm{tr}}\left(\boldsymbol{\varSigma}_u^2\right)}} \stackrel{d}{\rightarrow} N(0, 1).$

(3)

Theorem 1.1 implies that we can reject H₀ if T_n>z_α(2n(n-1)tr(Σ_u²))^1/2, where z_α is the upper α-quantile of N(0, 1). The proof of Theorem 1.1 is conventional, so we omit the details. If someone needs detailed proof, please contact us.

1.2 Computational issue

In practice, in oder to estimate Var(T_n), the estimator for tr(Σ_u²) is needed. Similar to the estimator used by Ref.[2], we propose the following estimator:

where ${\overline U _{(j, k)}}$ is the sample mean of U after excluding U_j and U_k, and ${\overline U _{(j)}}$ is the sample mean of U after excluding U_j. Then, we have the estimator of Var(T_n): $\overset\frown{Var({{T}_{n}})}=\overset\frown{{\rm{tr}}({\boldsymbol{\varSigma }} _{\rm{u}}^{2})}/2{\rm{n}}\left( {\rm{n}}-1 \right)$ . And, under H₀, we could show. $\overset\frown{Var({{T}_{n}})}/Var({{T}_{n}})\xrightarrow{p}1$ .

When n and p are large, the computation of $\overset\frown{{\rm{tr}}({\boldsymbol{ \varSigma}} _{\rm{u}}^{2})}$ is too complex. In order to improve the speed of calculation, inspired by Ref.[7], we could transform $\overset\frown{{\rm{tr}}({\boldsymbol{ \varSigma}} _{u}^{2})}$ to:

Where ${\overline U ^ * } = \frac{1}{{n{\rm{ }} - 2}}\mathop \sum \limits_{k{\rm{ }} = 1}^n {U_k}$ .

2 Simulation study

We comparr the performance of the proposed test (SR) with five alternatives: Ref.[1] (BS), Ref.[2] (CQ), Ref.[5] (SKK), Ref.[7] (WPL), Ref.[8] (FZW). All the following simulations are replicated 1 000 times. And, we set n=20, 50 and p=200, 1 000.

Example 1 We generate X_i from p-variate normal distribution N(μ, Σ). Two different choices of Σ are considered as follows: 1) Σ₁=R; 2) Σ₂=D^1/2RD^1/2. Where R=(σ_jk) with σ_jk=0.5^|j-k| for 1≤j, k≤p, and D=diag{d₁, …, d_p} with d_j=1I{1≤j≤p/4}+2I{p/4+1≤j≤p/2}+3I{p/2+1≤j≤3p/4}+4I{3p/4+1≤j≤p} for 1≤j≤p. Without loss of generality, we set μ_j=η for j=1, …, p, and $\eta=:\|\mu\|^2 / \sqrt{{\rm{tr}}\left({\boldsymbol{\varSigma}}^2\right)}=c$ . To calculate the power, we set c=0.1 and 0.15 when n=20, and c=0.075 and 0.1 when n=50. And we could calculate the size when we select c=0.

Table 1 stands for the performance of the six tests in Example 1. We can see that the power of SR is similar to those of BS, CQ and WPL when Σ=Σ₁, and is more than those of BS, CQ and WPL when Σ=Σ₂. It indicates that SR has better performance when the scales of different components are different. For example, when (n, p)=(20, 200), Σ=Σ₂ and c=0.1, the power of SR, BS, CQ, and WPL are 0.547, 0.407, 0.420, and 0.394, respectively. And we observe that SR has better performance in power than SKK and FZW when p ≫ n. The reason is that SKK and FZW are under the assumptions that p cannot be much larger than n. For example, when (n, p)=(20, 1 000), Σ=Σ₁ and c=0.15, the power of SR, SKK and FZW are 0.589, 0.413 and 0.347 respectively.

Table 1 The empirical size and power at the significance level of 5 % in Example 1

Example 2 In this example, X_i is generated from p-variate t-distribution with 3 degrees of freedom. The setting of mean vector μ and covariance Σ are the same as those in Example 1. And we select c=0.1 and 0.15 for μ to calculate the power.

Table 2 shows the simulation results in Example 2. We can see that SR have better performance in power than that of other five tests in all settings. For example, when (n, p)=(50, 200), Σ=Σ₁ and c=0.15, the power of SR is 0.773 and the power of the other tests in this setting are 0.419, 0.538, 0.549, 0.577, and 0.610 respectively. For t-distribution is a common heavy-tailed distribution, the results in this table indicate that SR is robust. Table 3 shows the performance of the six tests in Example 3. It shows that SR are more powerful than other five tests in all settings. For example, when (n, p)=(20, 1 000), Σ=Σ₂, and c=0.15, the power of BS, CQ, SKK, WPL, FZW, and SR are 0.626, 0.615, 0.695, 0.653, 0.650, and 0.949, respectively. Laplace distribution is not a elliptical distribution, and Table 3 shows that SR is more effective in this situation.

Table 2 The empirical size and power at the significance level of 5 % in Example 2

Table 3 The empirical size and power at the significance level of 5 % in Example 3

Example 3 In this example, X_i is generated from p-variate Laplace distribution. And we consider the same setting of mean vector μ and covariance Σ as those in Example 1. To calculate the power, we select c=0.1 and c=0.15 when n=20, and c=0.05 and c=0.075 when n=50.

Table 3 shows the performance of the six tests in Example 3. It shows that SR are more powerful than other five tests in all settings. For example, when (n, p)=(20, 1 000), Σ=Σ₂, and c=0.15, the power of BS, CQ, SKK, WPL, FZW, and SR are 0.626, 0.615, 0.695, 0.653, 0.650, and 0.949, respectively. Laplace distribution is not a elliptical distribution, and Table 3 shows that SR is more effective in this situation.

Example 4 In this example, we generate X_i from a mixed distribution. Firstly, we generate Z_ij from normal distribution for 1≤j≤2p/5, generate Z_ij from t distribution with 3 degrees of freedom for 2p/5+1≤j≤7p/10, and generate Z_ij from Laplace distribution for 7p/10+1≤j≤p, and all Z_ij have mean 0 and variance 1. Then we let X_i=ΓZ_i+μ, where Γ is a p×p matrix with ΓΓ^T=Σ, and Z_i={Z_i1, …, Z_ip}^T. And we consider the same setting of mean vector μ and covariance Σ as those in Example 1. To calculate the power, we select c=0.1 and c=0.15 when n=20, and c=0.05 and c=0.075 when n=50

Table 4 stands for the simulation results in Example 4. We can see that the power of SR is more than those of the other five tests in all settings. For example, when (n, p)=(50, 1 000), Σ=Σ₂ and c=0.075, the power of SR is 0.757 and the power of the other tests in this setting are 0.214, 0.271, 0.548, 0.299 and 0.613 respectively. In practice, the variates usually have different distributions. Hence, the results in Table 4 indicate that SR is supposed to have better performancein application.

Table 4 The empirical size and power at the significance level of 5 % in Example 4

Moreover, we plot the empirical distributions of SR with the settings of four examples and compare them with the standard normal distribution. And, Fig. 1 confirms the asymptotic normal distributions of SR given in Theorem 1.1.

	Download: JPG larger image
Fig. 1 T_n under the null hypothesis with four different distributions of X

3 Real data application

In this section, we employ the proposed signed-rank-based method to study an ophthalmic data. This data is collected by the Beijing Tongren Eye Center and Anyang Eye Hospital. We take the data of the fifth and sixth grades of a class in the data, Apply the proposed method to study whether the visual factors and their interaction with eye habits are different in different grades.

Firstly, we remove the visual factors and their interaction with eye habits with missing values greater than 15%, and impute the sample mean into the missing values for the remaining 945 factors. Then, we let X_i be the difference between the visual factors and their interaction with eye habits of the i-th student in the sixth grade and those in the fifth grade. And, we calculate standard deviations of each dimension in X, and show the distribution of the standard deviations in Fig. 2. It shows that these standard deviations are different, so the scalar-invariance method are supposed to have better performance in the analysis of this data. [JP2]Applying the proposed SR method, we obtain a p-value < 10^-9, which illustrates that the visual factors and their interaction factors of eye habits are different in different grades. Through CQ, WPL and FZW methods, the p-values obtained are 0.491 0, 0.491 3 and < 10^-9 respectively. For the standard deviations of each dimension in the sample are different, the CQ and WPL methods are relatively ineffective, while the p -values obtained through FZW and SR methods are small.

	Download: JPG larger image
Fig. 2 The distribution of the standard deviations

References

[1]	Bai Z D, Hewa S. Effect of high dimension: by an example of a two sample problem[J]. Statistica Sinica, 1996, 6(2): 311-329.
[2]	Chen S X, Qin Y L. A two sample test for high dimensional data with applications to gene-set testing[J]. Annals of Statistics, 2010, 38(2): 808-835. DOI:10.1214/09-aos716
[3]	Srivastava M S, Meng D. A test for the mean vector with fewer observations than the dimension[J]. Journal of Multivariate Analysis, 2008, 99(3): 386-402. DOI:10.1016/j.jmva.2006.11.002
[4]	Srivastava M S. A test for the mean vector with fewer observations than the dimension under non-normality[J]. Journal of Multivariate Analysis, 2009, 100(3): 518-532. DOI:10.1016/j.jmva.2008.06.006
[5]	Srivastava M S, Shota K, Yutaka K. A two sample test in high dimensional data[J]. Journal of Multivariate Analysis, 2013, 114: 349-358. DOI:10.1016/j.jmva.2012.08.014
[6]	Park J Y, Deepak N A. A test for the mean vector in large dimension and small samples[J]. Journal of Statistical Planning and Inference, 2013, 143(5): 929-943. DOI:10.1016/j.jspi.2012.11.001
[7]	Wang L, Peng B, Li R Z. A high-dimensional nonparametric multivariate test for mean vector[J]. Journal of the American Statistical Association, 2015, 110(512): 1658-1669. DOI:10.1080/01621459.2014.988215
[8]	Feng L, Zou C L, Wang Z J. Multivariate-sign-based high-dimensional tests for the two-sample location problem[J]. Journal of the American Statistical Association, 2016, 111(514): 721-735. DOI:10.1080/01621459.2015.1035380


中国科学院大学学报 2022, Vol. 39 Issue (5): 586-592	PDF