统计与医学数据分析

引用本文

魏征, 张瑞. 一个基于子链接函数的非线性相关性测度的讨论[J]. 西北大学学报自然科学版, 2018, 48(1): 1-5. DOI: 10.16152/j.cnki.xdxbzr.2018-01-001.
[复制中文]

WEI Zheng, ZHANG Rui. A note on measuring the non-linear dependence viasub-copula based regression[J]. Journal of Northwest University(Natural Science Edition), 2018, 48(1): 1-5. DOI: 10.16152/j.cnki.xdxbzr.2018-01-001.
[复制英文]

通讯作者

张瑞，女，陕西西安人，西北大学教授，博士生导师，从事机器学习理论与算法、脑电心电数据分析、神经集群建模等方向的研究工作。

文章历史

收稿日期：2017-10-11

摘要(Abstract) 全文(Full text) 图/表(Figures/Tables) PDF下载(PDF)

一个基于子链接函数的非线性相关性测度的讨论

魏征¹, 张瑞²

1. 美国缅因大学数学与统计系，美国缅因州奥罗诺市 04469-5752；
2. 西北大学医学大数据研究中心，陕西西安 710127

收稿日期：2017-10-11

基金项目：国家自然科学基金资助项目(61473223)；陕西省产学研协同创新计划基金资助项目(2017XT-016)；陕西省重点研发计划基金资助项目(2017ZDXM-GY-095)

作者简介：魏征，男，美国缅因大学助理教授，从事贝叶斯方法在大数据中的应用，不对称copula的构造和应用等研究。

通讯作者：张瑞，女，陕西西安人，西北大学教授，博士生导师，从事机器学习理论与算法、脑电心电数据分析、神经集群建模等方向的研究工作。

摘要：近几年来，基于非对称相关性建模的统计分析方法的研究取得了快速的进展。然而，一个显著的问题是，在应用中最常用的相关性测度，如Pearson相关系数，Spearman ρ和Kendall τ，并不适用于度量数据中变量之间的非线性关系。文中主要回顾了一种新的用于描述双向列联表非对称相关性的基于子链接函数的的测量方法。该方法是在(Wei and Kim, 2017)中提出的。同时，文章研究了可以用于测量连续随机变量非线性关联的方法。值得注意的是，该过程是一种非参数统计方法，并没有为相关的随机变量假设参数模型。最后，通过模拟数据说明了该方法的有效性。

关键词：非线性度量相关系数子连接函数双向列联表

A note on measuring the non-linear dependence viasub-copula based regression

WEI Zheng¹, ZHANG Rui²

1. Department of Mathematics and Statistics, University of Maine, Orono, ME, 04469-5752, USA;
2. School of Mathematics, Northwest University, Xi′an 710127, China

Abstract: In recent years, the development of statistical analysis methods designed for the asymmetric dependence modeling has made exciting and rapid progress.However, one issue for the most prominent statistical measures of association in literature, like Pearsons correlation coefficients, Spearmans ρ and Kendalls τ, is that they are not suitable if the data shows the non-linear relationship between the variabls. Motivated by this, we review a new subcopula-based measure of the asymmetric association in a two-way contingency table, which was proposed by Wei and Kim(2017).We examine the use of the measure in detecting the nonlinear association for data generated from continuous random variables. The procedure is developed as a non-parametric approach and assumes no parametric forms for the associated random variables.The proposed procedure is illustrated through a simulation example.

Key words: asymmetric association contingency table regression subcopula

1 Introduction

Dependence or association modeling between two random variables plays an important role in statistics literature.Measuring the strength of various association has an extensive body of literature.The most prominent and popular measures of associations between random variables include but not limited to Pearsons correlation coefficient^[1], Spearmans ρ^[2] and Kendalls τ^[3].

However, these measures are not suitable if the data shows the non-linear relationship between the variables.This is because Pearsons correlation coefficient r only measures the linear relationship of random variables, Spearmans ρ and Kendalls τ are measures of monotonic relationships. As a motivation, we provide a simple example below.

Example 1 Assume X_i, Y_i, i=1, …, n, are identical and independent distributed(i.i.d.) random variables with Y_i=5(X_i-0.5)²+σ∈_i, X_i follows uniform distribution on [0, 1], and ∈_i, i=1, …, n are i.i.d. standard normal random variables. Thus, the true model considered here is that the random variable Y is a non-linear(quadratic) function of X(not vice versa). We simulate the data for n=300, 1 000 and provide the scatter plots in Fig. 1.

Fig. 1 Scatterplots of bivariate data generated from the true model Y_i=5(X_i-0.5)² (solid line) with respect to the sample size n=300 (left), and 1 000 (right)

The plots in Fig. 1 indicate strong nonlinear relationship between the variables, however, the Pearson′s correlation coefficient r=0.021, Spearmans ρ=-0.008, and Kendalls τ=-0.004 for n=300 and r=-0.02, ρ=-0.003, and τ=-0.009 for n=1 000 are all very small (close to zero).

Example 1 shows that it is in contrast to the common belief that linear relationship is more revealing about the underlying random variables than nonlinear relationship. We expect that these problems would be even more severe and more difficult to detect for high dimensional predictors.

In statistical association analysis, it is often assumed the relevant random variable follows some parametric distribution families. The parametric approach often suffers from possible shortcomings induced by the misspecification of the parametric family. Additionally, even for the same type of association and same marginal distribution the associated joint distributions could be very different.

Therefore, a natural questions is how we can determine or quantify the general association(linear or nonlinear) between random variables X and Y by utilizing a non-parametric approach. In this article, we review a new asymmetric dependence measure based on subcopula based regression which was proposed by Wei and Kim^[4] and focus on the association analysis between continuous random variables considering toth interaction directionality and nonlinearity. The rest of the article is structured as follows. Section 2 briefly reviews the definition of the subcopula and the asymmetric dependence measure for contingency based on subcopula based regression proposed in^[4] (Wei and Kim, 2017). Then, section 3 examines the use of the measure for nonlinear association for data generated from continuous random variables. We end this article by a conclusion section.

2 Review on subcopula and Subcopula-based measure of asymmetric association for contingency tables

In this section, we briefly reviews the definition of subcopulas and the asymmetric dependence measure for contingency based on subcopula based regression which was proposed in^[4] (Wei and Kim, 2017).

2.1 Review on subcopula

Copula is a technique to introduce and investigate the dependence between variables of interests.Copulas are frequently applied in many application areas including fiance^[5-8], insurance^[9] (Jaworski et al., 2010), risk management^[10] (Embrechts et al., 2002), health^[11] (Eluru et al., 2010) and environmental sciences^[12](Zhang and Singh, 2007).Here, we briefly review the definition of the bivariate subcopula and copula function.For more details about the copula and dependence concepts in terms of subcopulas and copulas, see^[13-17] (Joe, 2014;Nelsen, 2006;Wei et al., 2015;Wei and Kim, 2017).

Definition 1^[14] (Nelsen, 2006) A bivariate subcopula (or 2-subcopula) is a function C^S: D₁×D₂$\mapsto $[0, 1], where D₁ and D₂ are subsets of [0, 1] containing 0 and 1, satisfying following properties:

(a) C^S is grounded, i.e., C^S(u, 0)=C^S(0, ν)=0;

(b) For every u, ν∈[0, 1], C^S(1, ν)=ν and C^S(u, 1)=u;

vol C^S(J)=C^S(u₂, ν₂)-C^S(u₂, ν₁)-C^S(u₁, ν₂)+C^S(u₁, ν₁)≥0,

A bivariate copula (or 2-copula) C(u, ν) is a subcopula with D₁=D₂=[0, 1].

From the Definition 1 a subcopula(copula) is a (unconditional) cumulative distribution function (CDF) on D₁×D₂([0, 1]).The condition (b) indicates a 2-copula C is a bivariate CDF with uniform margins. Let X and Y be two random variables (discrete or continuous) with the joint CDF H(x, y) and marginal CDF F(x) and G(x).There exists a subcopula function C^S such that H(x, y)=C^S(F(x), G(y)) by the Sklars theorem^[18] (Sklar, 1959).Furthermore, if X and Y are both continuous random variables, then there exist a unique copula C such that H(x, y)=C(F(x), G(y)).

2.2 Subcopula-based measure of asymmetric association for contingency tables

Wei and Kim (2017)^[4] proposed a new asymmetric association measure for the contingency table by utilizing subcopula based regression. Suppose we have a two-way contingency table that cross-classifies n subjects/units according to I categories of the row variable and J categories of the column variable and let the matrix of the joint cell proportions in the I×J contingency table be P={p_ij} where i=1, …, I, j=1, …, J and $\sum\limits_{i=1}^{I}{\sum\limits_{j=1}^{J}{{{p}_{ij}}=1}}$. Furthermore, p_i·=$\sum\limits_{j=1}^{J}{{{p}_{ij}}}$ and ${{p}_{\cdot j}}-\sum\limits_{j=1}^{J}{{{p}_{ij}}}$ denote the i-th row marginal relative frequency and j-th column marginal relative frequency, respectively.Then, the supports for the subcopula C^S associated with the table can be derived below,

$ {{u}_{i}}=\sum\limits_{s=1}^{i}{{{p}_{s}}.\ \ \text{and}\ \ {{v}_{j}}}=\sum\limits_{t=1}^{j}{p\cdot t.} $

(1)

The marginal p.m.f.s and the joint p.m.f of C^s associated with the table are, respectively, p₀(u_i)=p_i·, p₁(ν_j)=p_·j, c^s(u_i, v_i)=p_ij. Furthermore, the conditional p.m.f.s and the conditional p.m.f.are, respectively,

$ \begin{align} & {{p}_{j|i}}={{c}_{V|U}}\left( {{v}_{j}}|{{u}_{i}} \right)=\frac{c\left( {{u}_{i}}, {{v}_{j}} \right)}{{{p}_{0}}\left( {{u}_{i}} \right)}=\frac{{{p}_{ij}}}{{{p}_{i\cdot }}} \\ & {{p}_{i|j}}={{c}_{U|V}}\left( {{u}_{i}}|{{v}_{j}} \right)=\frac{c\left( {{u}_{i}}, {{v}_{j}} \right)}{{{p}_{1}}\left( {{v}_{j}} \right)}=\frac{{{p}_{ij}}}{{{p}_{\cdot j}}}. \\ \end{align} $

(2)

Then, one can quantitatively measure the asymmetric dependence between the two variablex X and Y in an I×J contingency table.

Definition 2 (Wei and Kim, 2017)^[4] Given an I×J contingency table, a measure of subcopula-based asymmetric association of column variable Y on row variable X and of row variable X on column variable Y are defined as follows, respectively:

$ \rho _{\left( X\to Y \right)}^{2}=\frac{{{\sum\limits_{i=1}^{I}{\left( \sum\limits_{j=1}^{J}{{{v}_{j}}{{p}_{j|i}}-\sum\limits_{j=1}^{J}{{{v}_{j}}{{p}_{\cdot j}}}} \right)}}^{2}}{{p}_{i\cdot }}}{{{\sum\limits_{j=1}^{J}{\left( {{v}_{j}}-\sum\limits_{j=1}^{J}{{{v}_{j}}{{p}_{\cdot j}}} \right)}}^{2}}{{p}_{\cdot j}}}, $

and

$ \rho _{\left( Y\to X \right)}^{2}=\frac{\sum\limits_{j=1}^{J}{{{\left( \sum\limits_{i=1}^{I}{{{u}_{i}}{{p}_{i|j}}-\sum\limits_{i=1}^{I}{{{u}_{i}}{{p}_{i\cdot }}}} \right)}^{2}}{{p}_{\cdot j}}}}{{{\sum\limits_{i=1}^{I}{\left( {{u}_{j}}-\sum\limits_{i=1}^{I}{{{u}_{i}}{{p}_{i\cdot }}} \right)}}^{2}}{{p}_{i\cdot }}}. $

(3)

The subcopula-based asymmetric association measure in Definition 2.2 has several nice properties.For example, one can identify the nonlinear or asymmetric relation between the variables in a two-way contingency table.To be specific, if a random variable Y is a function of random variable X almost surely, then ρ_(X→Y)²=1, and if X and Y are independent, then ρ_(X→Y)²=ρ_(Y→X)²=0 (for more details, see Proposition 3.2 in^[4] (Wei and Kim, 2017)).Motivated by this property, we applied the subcopula-based asymmetric association measure in Definition 2 on the continuous random variables in next section.

3 One example for continuous random variables

The asymmetric association measure ρ² in Definition 2 measures the association for the discrete random variable associated with the two-way contingency table without parametric assumptions on the joint distributions.In this section, we develop one procedure to apply the subcopula-based asymmetric association measure on continuous random variables and illustrate the procedure along with the Example 1 as follows.

First, in order to apply asymmetric association measure ρ² in Definition 2 on the continuous random variables, data must be quantized into categorical data.We can construct the I×I contingency table by classifying the n data points into I categories for each variable. For example, if we set I=5 for data (n=300 and n=1 000) simulated from Example 1.1, we have the following two contingency tables, Table 3.1a and 3.1a.Note that for this example, we can classify each variable with I equal-width categories, i.e., [a, a+(b-a)/I, [a+(b-a)/I, a+2(b-a)/I), …, [a+(I-1)(b-a)/I, b], where b=max{x₁, …, x_n} for X(b=max{y₁, …, y_n} for Y), a=min{x₁, …, x_n} for X(a=min{y₁, …, y_n} for Y).

Tab. 1 Two-way contingency tables constructed based on categorical data with I=5 quantized from the continuous data in Example 1.1

Second, based on the contingency tables associated with each number of categories I obtained from the first step.The subcopula-based asymmetric association measures ρ_(X→Y)² and ρ_(Y→X)² in Eq.(3) can be estimated by using the estimators $\hat{\rho }_{\left( X\to Y \right)}^{2}$ and $\hat{\rho }_{\left( Y\to X \right)}^{2}$ in Eq.(14)^[4] in (Wei and Kim, 2017).

For example, for the two contingency tables given in Table 3.1a and 3.1a, we obtain $\hat{\rho }_{\left( X\to Y \right)}^{2}$=0.009 and $\hat{\rho }_{\left( Y\to X \right)}^{2}$=0.608 for the data set with n=300, and $\hat{\rho }_{\left( X\to Y \right)}^{2}$=0.01 and $\hat{\rho }_{\left( Y\to X \right)}^{2}$=0.647, respectively.

Third, we estimate$\hat{\rho }_{\left( X\to Y \right)}^{2}$ and $\hat{\rho }_{\left( Y\to X \right)}^{2}$ for each contingency table with category I=2, 3, …, until the asymmetric association measure converges to a "true" value.For the simulation data presented in Example 1, $\hat{\rho }_{\left( X\to Y \right)}^{2}$ are very small and close to zero for both n=300 and n=1 000, we focus on the asymmetric association measure for $\hat{\rho }_{\left( Y\to X \right)}^{2}$ below.

Figure 2 shows the values of the subcopula based asymmetric association measure$\hat{\rho }_{\left( X\to Y \right)}^{2}$ with respect to the number of the categories for sample size n=300 (left) and n=1 000 (right). We observe that the subcopula based asymmetry measure ρ_(X→Y)² with respect to the number of the categories for sample size n=300 (left) and n=1 000 (right).We observe that the subcopula based asymmetry measure ρ_(X→Y)² approach to a fixed value $\hat{\rho }_{\left( X\to Y \right)}^{2}$≈0.88 as the number of categories increases for both sample size n=300 and n=1 000.Therefore, $\hat{\rho }_{\left( X\to Y \right)}^{2}$ successfully detect the nonlinear association between X and Y.

Fig. 2 The subcopula based asymmetric association measure ρ_X→Y² with respect to the number of the categories with respect to the sample size n=300 (left), and 1 000 (right)

4 Conclusion

In this paper, we have reviewed a subcopula-based measure of the asymmetric association for a two-way contingency table, which was proposed in (Wei and Kim, 2017)^[4]. We applied the proposed measure to a data set with the non-linear relationship and showed it can be used as a tool to detect the non-linear association. The proposed procedure is illustrated via a simulation example.

参考文献

[1]	PEARSON K. Note on regression and inheritance in the case of two parents[J]. Proceedings of the Royal Society of London, 1895, 58: 240-242. DOI:10.1098/rspl.1895.0041
[2]	SPEARMAN C. The proof and measurement of association between two things[J]. The American Journal of Psychology, 1904, 15(1): 72-101. DOI:10.2307/1412159
[3]	KENDALL M G. A new measure of rank correlation[J]. Biometrika, 1938, 30(1/2): 81-93. DOI:10.2307/2332226
[4]	WEI Z, KIM D. Subcopula-based Measure of Asymmetric Association for Contingency Tables[J]. Statistics in Medicine, 2017, 36(24): 3875-3894. DOI:10.1002/sim.v36.24
[5]	PATTON A J. Modelling asymmetric exchange rate dependence[J]. International Economic Review, 2006, 47(2): 527-556. DOI:10.1111/iere.2006.47.issue-2
[6]	张尧庭. 连接函数(copula)技术与金融风险分析[J]. 统计研究, 2002, 4: 48-51.
[7]	吴振翔, 叶五一, 缪柏其. 基于Copula的外汇投资组合风险分析[J]. 中国管理科学学报, 2004(4): 1-5.
[8]	龚金国, 史代敏. 时变Copula模型的非参数推断[J]. 数量经济技术经济研究, 2011(7): 137-150.
[9]	JAWORSKI P, DURANTE F, HARDLE W K, et al. Copula Theory and Its Applications[M]. Berlin, Heidelberg: Springer, 2010.
[10]	EMBRECHTS P, MCNEIL A, STRAUMANN D. Correlation and Dependence in Risk Management:Properties and Pitfalls[M]. Risk Management: Value at Risk and Beyond, 2002: 176-223.
[11]	ELURU N, PALETI R, PENDYALA R, et al. Modeling injury severity or multiple occupants of vehicles:Copulabased multivariate approach[J]. Transportation Research Record:Journal of the Transportation Research Board, 2010(2165): 1-11.
[12]	ZHANG L, SINGH V P. Bivariate rainfall frequency distributions using archimedean copulas[J]. Journal of Hydrology, 2007, 332(1): 93-109.
[13]	JONE H. Dependence Modeling with Copulas[M]. London: Chapman & Hall, 2014.
[14]	NELSEN R B. An Introduction to Copulas(second edition)[M]. New York: Springer, 2006.
[15]	WEI Z, WANG T, NGUYEN P A. Multivariate dependence concepts through copulas[J]. International Journal of Approximate Reasoning, 2015, 65: 24-33. DOI:10.1016/j.ijar.2015.04.004
[16]	龚金国, 史代敏. 时变Copula模型非参数估计的大样本性质[J]. 浙江大学学报(理学版), 2012(6): 630-634.
[17]	龚金国, 邓入侨. 时变C-Vine Copula模型的统计推断--基于广义自回归得分理论[J]. 统计研究, 2015(4): 97-103.
[18]	SKLAR M. Fonctions de répartition àn dimensions et leurs marges[J]. Publications de l′Institut de Statistique de L′Université de Paris, 1959, 8: 229-231.