中国科学院大学学报  2022, Vol. 39 Issue (5): 586-592   PDF    
Signed-rank-based test for high dimensional mean vector
LIU Yan1,2, LI Shiming3, ZHANG Sanguo1,2     
1. School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China;
2. Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing 100049, China;
3. Beijing Tongren Eye Center, Beijing Tongren Hospital, Capital Medical University, Beijing 100730, China
Abstract: This work is concerned with tests for one-sample mean vectors under high dimensional cases. Existing high dimensional tests for mean vectors base on the assumption of elliptical distribution have been proposed recently. To extend to more distributions, we propose a signed-rankbased test. The proposed test statistic is robust and scalar-invariant. Asymptotic properties of the test statistic are established. Numerical studies show that the proposed test has a good control of the typeI error and is more efficiency. We also employ the proposed method to analyze an phthalmic data.
Keywords: high dimensional analysis    signed-rank    one-sample test    scalar-invariance    
基于符号秩的高维均值检验
刘琰1,2, 李仕明3, 张三国1,2     
1. 中国科学院大学数学科学学院, 北京 100049;
2. 中国科学院大数据挖掘与知识管理重点实验室, 北京 100049;
3. 首都医科大学附属北京同仁医院眼科中心, 北京 100730
摘要: 研究高维情形下一样本均值检验的问题。已有的一些高维均值检验方法假设样本具有椭球等高分布。为应用到更多的分布, 提出基于符号秩的均值检验统计量。所提方法是稳健的且具有刻度变换不变性。建立了所提出检验统计量的渐近性质, 数值模拟表明该方法可以很好地控制第一类错误, 且功效更高。还将该方法应用到眼科数据中。
关键词: 高维数据分析    符号秩    一样本检验    标度不变性    

Suppose that X1, …, Xn${{\mathbb{R}}^{p}}$ are independent and identically distribution random samples with mean vector μ and covariance matrix Σ. And consider the following test

$ {H_0}:{\boldsymbol{\mu }} = {{\boldsymbol{\mu }} _0} \;{\rm{vs.}}\; H_{1}:{\boldsymbol{\mu }} \ne {{\boldsymbol{\mu }} _0}. $ (1)

under n < p. This is the so-called " large p, small n" paradigm. When p is fixed and under the assumption of normal distribution, a traditional method to test (1) is Hotelling's test statistic. However, Hotelling's test is not defined in the case of p > n because of the singularity of the sample covariance matrix. It is a challenge to the traditional method in high dimensional situation.

The challenge of testing (1) in high dimensional situation has attracted many researchers. Ref.[1] constructed the test statistics which avoid the inverse of the sample covariance matrix. but the test statistics can only be applied to the case of p/nc∈(0, 1), which means that the increasing rate of the sample dimension should be same as the sample size. Ref.[2] proposed a new test statistic without any direct relationship between p and n. In practice, different components may have different scales. Therefore, scalar-invariant is an important property to a test statistic. Ref.[3], Ref.[4] and Ref.[5]> constructed a test statistic with the property of scalar-invariant and under the assumption that p=o(n2). Ref.[6] proposed a scalar-invariant test that allows the dimension to be arbitrarily large. But their test is not location shift invariant. However, under heavy-tailed distributions, which frequently arise in genomics and quantitative finance, the asymptotic properties of the above test statistics are not established, a natural result is that these tests tend to have unsatisfactory [JP2]power. Under the assumption of elliptical distributions, Ref.[7] proposed a novel non-parametric test based on spatial-signs, which is more powerful than the test in Ref.[2] for heavy-tailed multivariate distributions and has similar power to the test in Ref.[2] for multivariate normal distribution. But their test is not scalar-invariant. Ref.[8] proposed a novel scalar-invariant test based on multivariate-sign, which is more powerful than the test in Ref.[5] for heavy-tailed multivariate distributions. And their method is under the assumption that log(p)=o(n).

We propose a novel test for hypothesis (1) based on signed-rank method and our study have two main contributions. Firstly, the proposed test statistic works for more distributions because signed-rank method only requires that the distribution of the samples is symmetric. And the test statistic is available when p is arbitrarily large. Secondly, we show that, under null hypothesis, the proposed test statistic is asymptotically normal. Moreover, the simulation study shows that our method is scalar-invariant and robust, and is more efficient without the assumption of elliptical distributions.

1 A signed-rank-based high dimen-sional test 1.1 The proposed test statistic

Suppose that Xi, i=1, …, n are independent and identically distribution random samples with dimension p. We denote that X(k)=(X1k, …, Xnk), k=1, …, p as the sample of the k-th dimension. And, let (r1k, …, rnk) be the rank of (|X1k|, …, |Xnk|). To test hypothesis (1), we proposed a test statistic based on signed-rank functions, which are defined as:

Ui=diag{sign(Xi1), …, sign(Xip)}(ri1, …, rip)T, where i=1, 2, …, n. Then, we consider the following U-statistic:

$ {T_n} = \frac{{\sum _{i \ne j}^nU_i^{\rm{T}}{U_j}}}{{2n(n - {\rm{ }}1)}}. $ (2)

Set si=(si1, …, sip)T with covariance matrix Σs>0, where sij=sign(Xij). To establish the asymptotic properties of the U statistic under the null hypothesis, we need following conditions:

A1. P(sij=1)=P(sij=-1)= $\frac{1}{2}$ for each i and j; |Xij|≠|Xkj| for any ik and each j, and Xij≠0 for each i and j; sij and rij are uncorrelated for each i and j.

A2. tr( Σs4)=o(tr2( Σs2)).

Remark 1.1    Condition A1 is necessary condition of the signed-rank test under null hypothesis and it indicates that the random samples have symmetric distributions. Under the first term in condition A1, we have E(sij)=0. Under the second term in condition A1, rijrkj for any ik and each j so that (r1j, …, rnj) is a permutation of all the elements in {1, …, n}. Condition A2 is similar to that applied in Ref.[2], and it is a quite mild condition on the eigenvalues of Σs.

Under H0, and then suppose condition A1 hold, it is easy to show that

$ E({T_n}) = 0, $

and

$ Var({T_n}) = \frac{1}{{2n(n{\rm{ }} - 1)}}{\rm{tr}}({\boldsymbol{\varSigma }} _u^2), $

where Σu=E(U1U1T).

Theorem 1.1 in the following establishes the asymptotic normality of Tn.

Theorem 1.1  Under H0, and then suppose conditions A1 and A2 hold, as n→∞ and p→∞,

$ \frac{T_n}{\sqrt{2 n(n-1) {\rm{tr}}\left(\boldsymbol{\varSigma}_u^2\right)}} \stackrel{d}{\rightarrow} N(0, 1). $ (3)

Theorem 1.1 implies that we can reject H0 if Tn>zα(2n(n-1)tr(Σu2))1/2, where zα is the upper α-quantile of N(0, 1). The proof of Theorem 1.1 is conventional, so we omit the details. If someone needs detailed proof, please contact us.

1.2 Computational issue

In practice, in oder to estimate Var(Tn), the estimator for tr(Σu2) is needed. Similar to the estimator used by Ref.[2], we propose the following estimator:

where $ {\overline U _{(j, k)}}$is the sample mean of U after excluding Uj and Uk, and ${\overline U _{(j)}}$ is the sample mean of U after excluding Uj. Then, we have the estimator of Var(Tn):$ \overset\frown{Var({{T}_{n}})}=\overset\frown{{\rm{tr}}({\boldsymbol{\varSigma }} _{\rm{u}}^{2})}/2{\rm{n}}\left( {\rm{n}}-1 \right)$. And, under H0, we could show.$ \overset\frown{Var({{T}_{n}})}/Var({{T}_{n}})\xrightarrow{p}1$.

When n and p are large, the computation of $ \overset\frown{{\rm{tr}}({\boldsymbol{ \varSigma}} _{\rm{u}}^{2})}$ is too complex. In order to improve the speed of calculation, inspired by Ref.[7], we could transform $ \overset\frown{{\rm{tr}}({\boldsymbol{ \varSigma}} _{u}^{2})}$ to:

Where $ {\overline U ^ * } = \frac{1}{{n{\rm{ }} - 2}}\mathop \sum \limits_{k{\rm{ }} = 1}^n {U_k}$.

2 Simulation study

We comparr the performance of the proposed test (SR) with five alternatives: Ref.[1] (BS), Ref.[2] (CQ), Ref.[5] (SKK), Ref.[7] (WPL), Ref.[8] (FZW). All the following simulations are replicated 1 000 times. And, we set n=20, 50 and p=200, 1 000.

Example 1  We generate Xi from p-variate normal distribution N(μ, Σ). Two different choices of Σ are considered as follows: 1) Σ1=R; 2) Σ2=D1/2RD1/2. Where R=(σjk) with σjk=0.5|j-k| for 1≤j, kp, and D=diag{d1, …, dp} with dj=1I{1≤jp/4}+2I{p/4+1≤jp/2}+3I{p/2+1≤j≤3p/4}+4I{3p/4+1≤jp} for 1jp. Without loss of generality, we set μj=η for j=1, …, p, and $\eta=:\|\mu\|^2 / \sqrt{{\rm{tr}}\left({\boldsymbol{\varSigma}}^2\right)}=c$. To calculate the power, we set c=0.1 and 0.15 when n=20, and c=0.075 and 0.1 when n=50. And we could calculate the size when we select c=0.

Table 1 stands for the performance of the six tests in Example 1. We can see that the power of SR is similar to those of BS, CQ and WPL when Σ=Σ1, and is more than those of BS, CQ and WPL when Σ=Σ2. It indicates that SR has better performance when the scales of different components are different. For example, when (n, p)=(20, 200), Σ=Σ2 and c=0.1, the power of SR, BS, CQ, and WPL are 0.547, 0.407, 0.420, and 0.394, respectively. And we observe that SR has better performance in power than SKK and FZW when pn. The reason is that SKK and FZW are under the assumptions that p cannot be much larger than n. For example, when (n, p)=(20, 1 000), Σ=Σ1 and c=0.15, the power of SR, SKK and FZW are 0.589, 0.413 and 0.347 respectively.

Table 1 The empirical size and power at the significance level of 5 % in Example 1

Example 2  In this example, Xi is generated from p-variate t-distribution with 3 degrees of freedom. The setting of mean vector μ and covariance Σ are the same as those in Example 1. And we select c=0.1 and 0.15 for μ to calculate the power.

Table 2 shows the simulation results in Example 2. We can see that SR have better performance in power than that of other five tests in all settings. For example, when (n, p)=(50, 200), Σ=Σ1 and c=0.15, the power of SR is 0.773 and the power of the other tests in this setting are 0.419, 0.538, 0.549, 0.577, and 0.610 respectively. For t-distribution is a common heavy-tailed distribution, the results in this table indicate that SR is robust. Table 3 shows the performance of the six tests in Example 3. It shows that SR are more powerful than other five tests in all settings. For example, when (n, p)=(20, 1 000), Σ=Σ2, and c=0.15, the power of BS, CQ, SKK, WPL, FZW, and SR are 0.626, 0.615, 0.695, 0.653, 0.650, and 0.949, respectively. Laplace distribution is not a elliptical distribution, and Table 3 shows that SR is more effective in this situation.

Table 2 The empirical size and power at the significance level of 5 % in Example 2

Table 3 The empirical size and power at the significance level of 5 % in Example 3

Example 3  In this example, Xi is generated from p-variate Laplace distribution. And we consider the same setting of mean vector μ and covariance Σ as those in Example 1. To calculate the power, we select c=0.1 and c=0.15 when n=20, and c=0.05 and c=0.075 when n=50.

Table 3 shows the performance of the six tests in Example 3. It shows that SR are more powerful than other five tests in all settings. For example, when (n, p)=(20, 1 000), Σ=Σ2, and c=0.15, the power of BS, CQ, SKK, WPL, FZW, and SR are 0.626, 0.615, 0.695, 0.653, 0.650, and 0.949, respectively. Laplace distribution is not a elliptical distribution, and Table 3 shows that SR is more effective in this situation.

Example 4  In this example, we generate Xi from a mixed distribution. Firstly, we generate Zij from normal distribution for 1≤j≤2p/5, generate Zij from t distribution with 3 degrees of freedom for 2p/5+1≤j≤7p/10, and generate Zij from Laplace distribution for 7p/10+1≤jp, and all Zij have mean 0 and variance 1. Then we let Xi=ΓZi+μ, where Γ is a p×p matrix with ΓΓT=Σ, and Zi={Zi1, …, Zip}T. And we consider the same setting of mean vector μ and covariance Σ as those in Example 1. To calculate the power, we select c=0.1 and c=0.15 when n=20, and c=0.05 and c=0.075 when n=50

Table 4 stands for the simulation results in Example 4. We can see that the power of SR is more than those of the other five tests in all settings. For example, when (n, p)=(50, 1 000), Σ=Σ2 and c=0.075, the power of SR is 0.757 and the power of the other tests in this setting are 0.214, 0.271, 0.548, 0.299 and 0.613 respectively. In practice, the variates usually have different distributions. Hence, the results in Table 4 indicate that SR is supposed to have better performancein application.

Table 4 The empirical size and power at the significance level of 5 % in Example 4

Moreover, we plot the empirical distributions of SR with the settings of four examples and compare them with the standard normal distribution. And, Fig. 1 confirms the asymptotic normal distributions of SR given in Theorem 1.1.

Download:


Fig. 1 Tn under the null hypothesis with four different distributions of X
3 Real data application

In this section, we employ the proposed signed-rank-based method to study an ophthalmic data. This data is collected by the Beijing Tongren Eye Center and Anyang Eye Hospital. We take the data of the fifth and sixth grades of a class in the data, Apply the proposed method to study whether the visual factors and their interaction with eye habits are different in different grades.

Firstly, we remove the visual factors and their interaction with eye habits with missing values greater than 15%, and impute the sample mean into the missing values for the remaining 945 factors. Then, we let Xi be the difference between the visual factors and their interaction with eye habits of the i-th student in the sixth grade and those in the fifth grade. And, we calculate standard deviations of each dimension in X, and show the distribution of the standard deviations in Fig. 2. It shows that these standard deviations are different, so the scalar-invariance method are supposed to have better performance in the analysis of this data. [JP2]Applying the proposed SR method, we obtain a p-value < 10-9, which illustrates that the visual factors and their interaction factors of eye habits are different in different grades. Through CQ, WPL and FZW methods, the p-values obtained are 0.491 0, 0.491 3 and < 10-9 respectively. For the standard deviations of each dimension in the sample are different, the CQ and WPL methods are relatively ineffective, while the p -values obtained through FZW and SR methods are small.

Download:


Fig. 2 The distribution of the standard deviations
References
[1]
Bai Z D, Hewa S. Effect of high dimension: by an example of a two sample problem[J]. Statistica Sinica, 1996, 6(2): 311-329.
[2]
Chen S X, Qin Y L. A two sample test for high dimensional data with applications to gene-set testing[J]. Annals of Statistics, 2010, 38(2): 808-835. DOI:10.1214/09-aos716
[3]
Srivastava M S, Meng D. A test for the mean vector with fewer observations than the dimension[J]. Journal of Multivariate Analysis, 2008, 99(3): 386-402. DOI:10.1016/j.jmva.2006.11.002
[4]
Srivastava M S. A test for the mean vector with fewer observations than the dimension under non-normality[J]. Journal of Multivariate Analysis, 2009, 100(3): 518-532. DOI:10.1016/j.jmva.2008.06.006
[5]
Srivastava M S, Shota K, Yutaka K. A two sample test in high dimensional data[J]. Journal of Multivariate Analysis, 2013, 114: 349-358. DOI:10.1016/j.jmva.2012.08.014
[6]
Park J Y, Deepak N A. A test for the mean vector in large dimension and small samples[J]. Journal of Statistical Planning and Inference, 2013, 143(5): 929-943. DOI:10.1016/j.jspi.2012.11.001
[7]
Wang L, Peng B, Li R Z. A high-dimensional nonparametric multivariate test for mean vector[J]. Journal of the American Statistical Association, 2015, 110(512): 1658-1669. DOI:10.1080/01621459.2014.988215
[8]
Feng L, Zou C L, Wang Z J. Multivariate-sign-based high-dimensional tests for the two-sample location problem[J]. Journal of the American Statistical Association, 2016, 111(514): 721-735. DOI:10.1080/01621459.2015.1035380