2. 西北大学 医学大数据研究中心,陕西 西安 710127
2. School of Mathematics, Northwest University, Xi′an 710127, China
Dependence or association modeling between two random variables plays an important role in statistics literature.Measuring the strength of various association has an extensive body of literature.The most prominent and popular measures of associations between random variables include but not limited to Pearsons correlation coefficient[1], Spearmans ρ[2] and Kendalls τ[3].
However, these measures are not suitable if the data shows the non-linear relationship between the variables.This is because Pearsons correlation coefficient r only measures the linear relationship of random variables, Spearmans ρ and Kendalls τ are measures of monotonic relationships. As a motivation, we provide a simple example below.
Example 1 Assume Xi, Yi, i=1, …, n, are identical and independent distributed(i.i.d.) random variables with Yi=5(Xi-0.5)2+σ∈i, Xi follows uniform distribution on [0, 1], and ∈i, i=1, …, n are i.i.d. standard normal random variables. Thus, the true model considered here is that the random variable Y is a non-linear(quadratic) function of X(not vice versa). We simulate the data for n=300, 1 000 and provide the scatter plots in Fig. 1.
|
Fig. 1 Scatterplots of bivariate data generated from the true model Yi=5(Xi-0.5)2 (solid line) with respect to the sample size n=300 (left), and 1 000 (right) |
The plots in Fig. 1 indicate strong nonlinear relationship between the variables, however, the Pearson′s correlation coefficient r=0.021, Spearmans ρ=-0.008, and Kendalls τ=-0.004 for n=300 and r=-0.02, ρ=-0.003, and τ=-0.009 for n=1 000 are all very small (close to zero).
Example 1 shows that it is in contrast to the common belief that linear relationship is more revealing about the underlying random variables than nonlinear relationship. We expect that these problems would be even more severe and more difficult to detect for high dimensional predictors.
In statistical association analysis, it is often assumed the relevant random variable follows some parametric distribution families. The parametric approach often suffers from possible shortcomings induced by the misspecification of the parametric family. Additionally, even for the same type of association and same marginal distribution the associated joint distributions could be very different.
Therefore, a natural questions is how we can determine or quantify the general association(linear or nonlinear) between random variables X and Y by utilizing a non-parametric approach. In this article, we review a new asymmetric dependence measure based on subcopula based regression which was proposed by Wei and Kim[4] and focus on the association analysis between continuous random variables considering toth interaction directionality and nonlinearity. The rest of the article is structured as follows. Section 2 briefly reviews the definition of the subcopula and the asymmetric dependence measure for contingency based on subcopula based regression proposed in[4] (Wei and Kim, 2017). Then, section 3 examines the use of the measure for nonlinear association for data generated from continuous random variables. We end this article by a conclusion section.
2 Review on subcopula and Subcopula-based measure of asymmetric association for contingency tablesIn this section, we briefly reviews the definition of subcopulas and the asymmetric dependence measure for contingency based on subcopula based regression which was proposed in[4] (Wei and Kim, 2017).
2.1 Review on subcopulaCopula is a technique to introduce and investigate the dependence between variables of interests.Copulas are frequently applied in many application areas including fiance[5-8], insurance[9] (Jaworski et al., 2010), risk management[10] (Embrechts et al., 2002), health[11] (Eluru et al., 2010) and environmental sciences[12](Zhang and Singh, 2007).Here, we briefly review the definition of the bivariate subcopula and copula function.For more details about the copula and dependence concepts in terms of subcopulas and copulas, see[13-17] (Joe, 2014;Nelsen, 2006;Wei et al., 2015;Wei and Kim, 2017).
Definition 1[14] (Nelsen, 2006) A bivariate subcopula (or 2-subcopula) is a function CS: D1×D2
(a) CS is grounded, i.e., CS(u, 0)=CS(0, ν)=0;
(b) For every u, ν∈[0, 1], CS(1, ν)=ν and CS(u, 1)=u;
(c) CS is 2-increasing in the sense that, for any u1≤u2, ν1≤ν2 with ui, νi∈[0, 1], i=1, 2,
vol CS(J)=CS(u2, ν2)-CS(u2, ν1)-CS(u1, ν2)+CS(u1, ν1)≥0,
A bivariate copula (or 2-copula) C(u, ν) is a subcopula with D1=D2=[0, 1].
From the Definition 1 a subcopula(copula) is a (unconditional) cumulative distribution function (CDF) on D1×D2([0, 1]).The condition (b) indicates a 2-copula C is a bivariate CDF with uniform margins. Let X and Y be two random variables (discrete or continuous) with the joint CDF H(x, y) and marginal CDF F(x) and G(x).There exists a subcopula function CS such that H(x, y)=CS(F(x), G(y)) by the Sklars theorem[18] (Sklar, 1959).Furthermore, if X and Y are both continuous random variables, then there exist a unique copula C such that H(x, y)=C(F(x), G(y)).
2.2 Subcopula-based measure of asymmetric association for contingency tablesWei and Kim (2017)[4] proposed a new asymmetric association measure for the contingency table by utilizing subcopula based regression. Suppose we have a two-way contingency table that cross-classifies n subjects/units according to I categories of the row variable and J categories of the column variable and let the matrix of the joint cell proportions in the I×J contingency table be P={pij} where i=1, …, I, j=1, …, J and
| $ {{u}_{i}}=\sum\limits_{s=1}^{i}{{{p}_{s}}.\ \ \text{and}\ \ {{v}_{j}}}=\sum\limits_{t=1}^{j}{p\cdot t.} $ | (1) |
The marginal p.m.f.s and the joint p.m.f of Cs associated with the table are, respectively, p0(ui)=pi·, p1(νj)=p·j, cs(ui, vi)=pij. Furthermore, the conditional p.m.f.s and the conditional p.m.f.are, respectively,
| $ \begin{align} & {{p}_{j|i}}={{c}_{V|U}}\left( {{v}_{j}}|{{u}_{i}} \right)=\frac{c\left( {{u}_{i}}, {{v}_{j}} \right)}{{{p}_{0}}\left( {{u}_{i}} \right)}=\frac{{{p}_{ij}}}{{{p}_{i\cdot }}} \\ & {{p}_{i|j}}={{c}_{U|V}}\left( {{u}_{i}}|{{v}_{j}} \right)=\frac{c\left( {{u}_{i}}, {{v}_{j}} \right)}{{{p}_{1}}\left( {{v}_{j}} \right)}=\frac{{{p}_{ij}}}{{{p}_{\cdot j}}}. \\ \end{align} $ | (2) |
Then, one can quantitatively measure the asymmetric dependence between the two variablex X and Y in an I×J contingency table.
Definition 2 (Wei and Kim, 2017)[4] Given an I×J contingency table, a measure of subcopula-based asymmetric association of column variable Y on row variable X and of row variable X on column variable Y are defined as follows, respectively:
| $ \rho _{\left( X\to Y \right)}^{2}=\frac{{{\sum\limits_{i=1}^{I}{\left( \sum\limits_{j=1}^{J}{{{v}_{j}}{{p}_{j|i}}-\sum\limits_{j=1}^{J}{{{v}_{j}}{{p}_{\cdot j}}}} \right)}}^{2}}{{p}_{i\cdot }}}{{{\sum\limits_{j=1}^{J}{\left( {{v}_{j}}-\sum\limits_{j=1}^{J}{{{v}_{j}}{{p}_{\cdot j}}} \right)}}^{2}}{{p}_{\cdot j}}}, $ |
and
| $ \rho _{\left( Y\to X \right)}^{2}=\frac{\sum\limits_{j=1}^{J}{{{\left( \sum\limits_{i=1}^{I}{{{u}_{i}}{{p}_{i|j}}-\sum\limits_{i=1}^{I}{{{u}_{i}}{{p}_{i\cdot }}}} \right)}^{2}}{{p}_{\cdot j}}}}{{{\sum\limits_{i=1}^{I}{\left( {{u}_{j}}-\sum\limits_{i=1}^{I}{{{u}_{i}}{{p}_{i\cdot }}} \right)}}^{2}}{{p}_{i\cdot }}}. $ | (3) |
The subcopula-based asymmetric association measure in Definition 2.2 has several nice properties.For example, one can identify the nonlinear or asymmetric relation between the variables in a two-way contingency table.To be specific, if a random variable Y is a function of random variable X almost surely, then ρ(X→Y)2=1, and if X and Y are independent, then ρ(X→Y)2=ρ(Y→X)2=0 (for more details, see Proposition 3.2 in[4] (Wei and Kim, 2017)).Motivated by this property, we applied the subcopula-based asymmetric association measure in Definition 2 on the continuous random variables in next section.
3 One example for continuous random variablesThe asymmetric association measure ρ2 in Definition 2 measures the association for the discrete random variable associated with the two-way contingency table without parametric assumptions on the joint distributions.In this section, we develop one procedure to apply the subcopula-based asymmetric association measure on continuous random variables and illustrate the procedure along with the Example 1 as follows.
First, in order to apply asymmetric association measure ρ2 in Definition 2 on the continuous random variables, data must be quantized into categorical data.We can construct the I×I contingency table by classifying the n data points into I categories for each variable. For example, if we set I=5 for data (n=300 and n=1 000) simulated from Example 1.1, we have the following two contingency tables, Table 3.1a and 3.1a.Note that for this example, we can classify each variable with I equal-width categories, i.e., [a, a+(b-a)/I, [a+(b-a)/I, a+2(b-a)/I), …, [a+(I-1)(b-a)/I, b], where b=max{x1, …, xn} for X(b=max{y1, …, yn} for Y), a=min{x1, …, xn} for X(a=min{y1, …, yn} for Y).
|
|
Tab. 1 Two-way contingency tables constructed based on categorical data with I=5 quantized from the continuous data in Example 1.1 |
Second, based on the contingency tables associated with each number of categories I obtained from the first step.The subcopula-based asymmetric association measures ρ(X→Y)2 and ρ(Y→X)2 in Eq.(3) can be estimated by using the estimators
For example, for the two contingency tables given in Table 3.1a and 3.1a, we obtain
Third, we estimate
Figure 2 shows the values of the subcopula based asymmetric association measure
|
Fig. 2 The subcopula based asymmetric association measure ρX→Y2 with respect to the number of the categories with respect to the sample size n=300 (left), and 1 000 (right) |
In this paper, we have reviewed a subcopula-based measure of the asymmetric association for a two-way contingency table, which was proposed in (Wei and Kim, 2017)[4]. We applied the proposed measure to a data set with the non-linear relationship and showed it can be used as a tool to detect the non-linear association. The proposed procedure is illustrated via a simulation example.
| [1] |
PEARSON K. Note on regression and inheritance in the case of two parents[J]. Proceedings of the Royal Society of London, 1895, 58: 240-242. DOI:10.1098/rspl.1895.0041 |
| [2] |
SPEARMAN C. The proof and measurement of association between two things[J]. The American Journal of Psychology, 1904, 15(1): 72-101. DOI:10.2307/1412159 |
| [3] |
KENDALL M G. A new measure of rank correlation[J]. Biometrika, 1938, 30(1/2): 81-93. DOI:10.2307/2332226 |
| [4] |
WEI Z, KIM D. Subcopula-based Measure of Asymmetric Association for Contingency Tables[J]. Statistics in Medicine, 2017, 36(24): 3875-3894. DOI:10.1002/sim.v36.24 |
| [5] |
PATTON A J. Modelling asymmetric exchange rate dependence[J]. International Economic Review, 2006, 47(2): 527-556. DOI:10.1111/iere.2006.47.issue-2 |
| [6] |
张尧庭. 连接函数(copula)技术与金融风险分析[J]. 统计研究, 2002, 4: 48-51. |
| [7] |
吴振翔, 叶五一, 缪柏其. 基于Copula的外汇投资组合风险分析[J]. 中国管理科学学报, 2004(4): 1-5. |
| [8] |
龚金国, 史代敏. 时变Copula模型的非参数推断[J]. 数量经济技术经济研究, 2011(7): 137-150. |
| [9] |
JAWORSKI P, DURANTE F, HARDLE W K, et al. Copula Theory and Its Applications[M]. Berlin, Heidelberg: Springer, 2010.
|
| [10] |
EMBRECHTS P, MCNEIL A, STRAUMANN D. Correlation and Dependence in Risk Management:Properties and Pitfalls[M]. Risk Management: Value at Risk and Beyond, 2002: 176-223.
|
| [11] |
ELURU N, PALETI R, PENDYALA R, et al. Modeling injury severity or multiple occupants of vehicles:Copulabased multivariate approach[J]. Transportation Research Record:Journal of the Transportation Research Board, 2010(2165): 1-11. |
| [12] |
ZHANG L, SINGH V P. Bivariate rainfall frequency distributions using archimedean copulas[J]. Journal of Hydrology, 2007, 332(1): 93-109. |
| [13] |
JONE H. Dependence Modeling with Copulas[M]. London: Chapman & Hall, 2014.
|
| [14] |
NELSEN R B. An Introduction to Copulas(second edition)[M]. New York: Springer, 2006.
|
| [15] |
WEI Z, WANG T, NGUYEN P A. Multivariate dependence concepts through copulas[J]. International Journal of Approximate Reasoning, 2015, 65: 24-33. DOI:10.1016/j.ijar.2015.04.004 |
| [16] |
龚金国, 史代敏. 时变Copula模型非参数估计的大样本性质[J]. 浙江大学学报(理学版), 2012(6): 630-634. |
| [17] |
龚金国, 邓入侨. 时变C-Vine Copula模型的统计推断--基于广义自回归得分理论[J]. 统计研究, 2015(4): 97-103. |
| [18] |
SKLAR M. Fonctions de répartition àn dimensions et leurs marges[J]. Publications de l′Institut de Statistique de L′Université de Paris, 1959, 8: 229-231. |
2018, Vol. 48