Mathematical Proof of the Synthetic Running Correlation Coefficient and Its Ability to Reflect Temporal Variations in Correlation

Citation

ZHAO Jinping, CAO Yong, SHI Yanyue, et al. Mathematical Proof of the Synthetic Running Correlation Coefficient and Its Ability to Reflect Temporal Variations in Correlation[J]. Journal of Ocean University of China, 2021, 20(3): 562-572.

Corresponding author

ZHAO Jinping, Tel: 0086-532-66782096, E-mail: jpzhao@ouc.edu.cn.

History

Received November 5, 2020
revised January 20, 2021
accepted January 29, 2021

Contents Abstract Full text Figures/Tables PDF

Mathematical Proof of the Synthetic Running Correlation Coefficient and Its Ability to Reflect Temporal Variations in Correlation

ZHAO Jinping^1),2) , CAO Yong¹⁾ , SHI Yanyue³⁾ , and WANG Xin¹⁾

1) College of Oceanic and Atmospheric Sciences, Ocean University of China, Qingdao, 266100, China;
2) Physical Oceanography Laboratory, Ministry of Education, Qingdao, 266100, China;
3) School of Mathematical Sciences, Ocean University of China, Qingdao, 266100, China

Received November 5, 2020; revised January 20, 2021; accepted January 29, 2021

Corresponding author: ZHAO Jinping, Tel: 0086-532-66782096, E-mail: jpzhao@ouc.edu.cn.

Abstract: The running correlation coefficient (RCC) is useful for capturing temporal variations in correlations between two time series. The local running correlation coefficient (LRCC) is a widely used algorithm that directly applies the Pearson correlation to a time window. A new algorithm called synthetic running correlation coefficient (SRCC) was proposed in 2018 and proven to be reasonable and usable; however, this algorithm lacks a theoretical demonstration. In this paper, SRCC is proven theoretically. RCC is only meaningful when its values at different times can be compared. First, the global means are proven to be the unique standard quantities for comparison. SRCC is the only RCC that satisfies the comparability criterion. The relationship between LRCC and SRCC is derived using statistical methods, and SRCC is obtained by adding a constraint condition to the LRCC algorithm. Dividing the temporal fluctuations into high- and low-frequency signals reveals that LRCC only reflects the correlation of high-frequency signals; by contrast, SRCC reflects the correlations of high- and low-frequency signals simultaneously. Therefore, SRCC is the appropriate method for calculating RCCs.

Key words: running correlation coefficient synthetic running correlation coefficient time window comparability standard value

1 Introduction

Correlation describes the degree of consistency between two time series. The correlation coefficient (CC) is an important statistical quantity (Pearson, 1896) that reflects the overall correlation between two data series. Because knowledge of temporal variations in correlation may sometimes be useful, the running correlation coefficient (RCC) was proposed to reflect varying correlations (Kuznets, 1928).

In most cases, RCC simply applies the CC to a pair of data pieces of the complete dataset. The length of the data piece is called the time window, and the window is moved stepwise to obtain the RCC (e.g., Kodera, 1993). RCC is a time series with values greater than −1.0 and less than +1.0. The RCC obtained by this method was called local running correlation coefficient (LRCC) by Zhao et al. (2018). LRCC is widely used to study varying correlations between two time series, such as the correlations of the Arctic Oscillation and sea level pressure (Zhao et al., 2006), water transport in the Labrador Sea and the North Atlantic Oscillation (Varotsou et al., 2015), atmospheric circulation and air temperature (Hynčica and Huth, 2020), Australian rainfall and El Nino (Brown et al., 2016), solar variability and paleoclimate records (Turner et al., 2016), equatorial quasi-biennial oscillations and stratospheric temperatures (Kodera, 1993; Soukhearev, 1997), solar cycle (Salby et al., 1997), and solar UV irradiance (Elias and Zossi de Artigas, 2003).

Zhao et al. (2018) observed that the LRCC algorithm uses the mean values determined by the data within the time window, which means the mean values also vary with time. LRCC only reflects the correlation of anomalies corresponding to the means and does not reflect the correlation between varying means. Therefore, the authors proposed a new algorithm for RCC called synthetic running correlation coefficient (SRCC). SRCC reflects correlations for anomalies and means using the global means calculated for the whole dataset. The relationship between LRCC and SRCC could be derived to illustrate the consistency and differences of these two algorithms. Some authors, such as Zhao et al. (2019) and Ji and Zhao (2019), have obtained remarkable results using SRCC.

The definition, calculation method, application examples, and physical significance of SRCC were addressed by Zhao et al. (2018) in effort to prove that the method is valid and credible. However, SRCC still lacks the support of mathematical theory. In the present study, the validity of SRCC is demonstrated theoretically. The analysis based on geometric and physical significances proves that SRCC is an appropriate method for measuring varying correlations. In Section 2, the background of the two RCCs is introduced. Comparability as the basic requirement for RCCs is then proposed in Section 3. A mathematical demonstration of SRCC is given in Section 4. Finally, the physical significance of SRCC is discussed in Section 5.

2 Background of Running Correlation Algorithms

All of the CCs discussed in this study are linear correlations; nonlinear correlations (e.g., Geng et al., 2018) are not discussed. A simple CC defined as the Pearson product-moment correlation coefficient (Pearson, 1896) was first introduced by Francis Galton (Galton, 1888) for linear correlation. The more common form of this CC was developed and applied by Karl Pearson (Pearson, 1938; Merrington et al., 1983). For two time series of data lengths N with equal intervals:

$\left\{ \begin{array}{l} X = \{ {x_k}:k = 1, 2, \cdots, N\} \\ Y = \{ {y_k}:k = 1, 2, \cdots, N\} \\ \end{array} \right..$

(1)

The simple correlation coefficient R is written as follows:

$R = \frac{{\mathop \Sigma \limits_{k = 1}^N ({x_k} - \bar X)({y_k} - \bar Y)}}{{\sqrt {\sum\limits_{k = 1}^N {{{({x_k} - \bar X)}^2}} } \sqrt {\sum\limits_{k = 1}^N {{{({y_k} - \bar Y)}^2}} } }}, $

(2)

where

$\bar X = \frac{1}{N}\mathop \Sigma \limits_{k = 1}^N {x_k} \;\;{\rm{and}}\;\; \bar Y = \frac{1}{N}\mathop \Sigma \limits_{k = 1}^N {y_k}, $

(3)

are the means calculated based on all data; thus, these means are called 'global means'. R in Eq. (2) calculated from all of the data is referred to as the 'global CC'.

RCC is a useful tool for understanding temporal variations in the correlation between two time series. The CC of two time series centered at i is:

${R_r}(i) = \frac{{\mathop \Sigma \limits_{k = i - n}^{i + n} ({x_k} - {{\bar X}_i})({y_k} - {{\bar Y}_i})}}{{\sqrt {\sum\limits_{k = i - n}^{i + n} {{{({x_k} - {{\bar X}_i})}^2}} } \sqrt {\sum\limits_{k = i - n}^{i + n} {{{({y_k} - {{\bar Y}_i})}^2}} } }}, \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;i = 1 + n, \cdots, N - n, $

(4)

where i∈[1+ n, N−n] and the time window is [i − n, i + n]; that is:

$\left\{ \begin{array}{l} {X_i} = \{ {x_k}:k = i - n, i - n + 1, \cdots, i + n - 1, i + n\} \\ {Y_i} = \{ {y_k}:k = i - n, i - n + 1, \cdots, i + n - 1, i + n\} \\ \end{array} \right..$

(5)

An RCC is obtained by moving the window i. R_r(i) is LRCC to distinguish it from other RCCs. The means of LRCC are obtained from the data within the time window:

$\begin{array}{*{20}{c}} {{{\bar X}_i} = \frac{1}{{2n + 1}}\mathop \Sigma \limits_{k = i - n}^{i + n} {x_k}, }&{{{\bar Y}_i} = \frac{1}{{2n + 1}}\mathop \Sigma \limits_{k = i - n}^{i + n} {y_k}} \end{array}.$

(6)

Hereafter, these means are referred to as 'local means'.

The algorithm for LRCC in Eq. (4) is the direct application of the definition of the global CC. This algorithm only changes the data length with the limit of the time window. The algorithm assumes that the definition used for the global CC could also be applied to the RCC, but no theoretical evidence proving that this direct application is reasonable has been obtained. Zhao et al. (2018) indicated that the means in Eq. (6) also vary with time. LRCC reflects only the correlation between two anomalies within the time window and does not capture the contributions of two varying means. Some important signals contained in the means are clearly missing, which raises further issues whether LRCC reflects the significance of statistics despite ignoring variations in the local means. The RCCs in different time windows should be comparable to each other; however, R_r(i) is obtained only from the data within the time window and independent of the data of other time windows. Thus, LRCCs in different windows lack common information and, therefore, are not comparable.

Zhao et al. (2018) identified this problem and proposed a new algorithm to calculate RCC:

${R_s}(i) = \frac{{\mathop \Sigma \limits_{k = i - n}^{i + n} ({x_k} - \bar X)({y_k} - \bar Y)}}{{\sqrt {\sum\limits_{k = i - n}^{i + n} {{{({x_k} - \bar X)}^2}} } \sqrt {\sum\limits_{k = i - n}^{i + n} {{{({y_k} - \bar Y)}^2}} } }}, \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;i = 1 + n, \cdots, N - n, $

(7)

where $\bar X$ and $\bar Y$ are the global means as defined by Eq. (3). This RCC is referred to as the SRCC. R_s(i)[−1, 1] can easily be verified using the Cauchy inequality:

$\left| {\mathop \Sigma \limits_{k = i - n}^{i + n} ({x_k} - \bar X)({y_k} - \bar Y)} \right| \leqslant \mathop \Sigma \limits_{k = i - n}^{i + n} \left| {({x_k} - \bar X)({y_k} - \bar Y)} \right| \leqslant \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \sqrt {\sum\limits_{k = i - n}^{i + n} {{{({x_k} - \bar X)}^2}} } \sqrt {\sum\limits_{k = i - n}^{i + n} {{{({y_k} - \bar Y)}^2}} } .$

(8)

Although SRCC was first proposed by Zhao et al. (2018), this algorithm has actually existed for a long time in the following form:

${R_s}^\prime (i) = \frac{{\mathop \Sigma \limits_{k = i - n}^{i + n} {x_k}{y_k}}}{{\sqrt {\sum\limits_{k = i - n}^{i + n} {{{({x_k})}^2}} } \sqrt {\sum\limits_{k = i - n}^{i + n} {{{({y_k})}^2}} } }}, {\rm{ }}i = 1 + n, \cdots, N - n, $

(9)

where the means of the two data series have been removed beforehand and the calculation in Eq. (9) does not include the means. This procedure is equivalent to adopting the global means. Therefore, the algorithm in Eq. (9) is equivalent to SRCC.

Zhao et al. (2018) attempted to prove which RCC is better by proposing and adopting a criterion, that is, the temporal average of the RCC should be close to the global CC. In fact, the average RCC is not exactly equal to the global CC because the amounts of data used for both algorithms differ but could be very close to each other. In general, the temporal average of SRCC is close to the global CC; by contrast, in most cases, the temporal average of LRCC cannot fulfill this criterion. Thus, according to the temporal average criterion, SRCC is better than LRCC for measuring running correlations.

3 Comparability of the RCC Values of Different Windows

An RCC is meaningful only when its values at different times are comparable with each other. For example, if the temperatures of two cities are compared, a large RCC should reflect a consistent variation, and a low one should reflect lower consistency. In this situation, the RCCs are comparable. The standard for each city, which should be an unchanged constant, must be established in advance to meet the requirement of comparability. In the above example, the mean air temperature of each city is used as the comparison standard for cold or warm events, and these events should differ between southern and northern cities. If the standard changes over time, for example, if different temperatures for summer and winter are selected as standards, the RCCs in winter and summer would not be comparable.

Mathematically, a physical component can be decomposed into standard and comparative quantities. The standard quantities should be two unchanged constants not involved in the comparison, and the comparison is conducted on the two comparative quantities (Burdun and Markov, 1972). The means of the temperatures are qualified standard quantities for indicating whether the environment is warmer or cooler at a given time. The comparison is applied to the anomalies of the temperature obtained by subtracting the standard quantities. Here we derive the standard quantities mathematically.

According to Eq. (2), the global means ($\bar X, \bar Y$) are the qualified standards for comparison. Therefore, the global CC of the two time series meets the above comparability requirements. When Eq. (2) is directly applied to a time window, the resultant local means (X_i, Y_i) only use the data in the time window, as shown in Eq. (6). In general, the local means are different when i ≠ j, so LRCCs in different time windows use different standard quantities and, thus, do not meet the comparison criteria. According to the definition of Eq. (7), SRCC meets the requirements of comparability because the global means ($\bar X, \bar Y$) are constant standards.

Even if the global means ($\bar X, \bar Y$) are replaced by any real constant numbers (${\bar X_0}, {\bar Y_0}$), Eq. (7) holds true, which means the constant chosen as the standard is somewhat arbitrary, and the RCCs obtained using different standard values will differ. Thus, more physical constraints must be applied to obtain a unique RCC. Let us verify that (${\bar X_0}, {\bar Y_0}$) equals ($\bar X, \bar Y$) in SRCC.

Let (${\bar X_0}, {\bar Y_0}$) in Eq. (7) be any arbitrary values. Then, a new time series can be expressed as follows:

$\left\{ \begin{array}{l} {F_{xi}} = \frac{1}{{2n + 1}}\mathop \Sigma \limits_{k = i - n}^{i + n} ({x_k} - {{\bar X}_0}) \\ {F_{yi}} = \frac{1}{{2n + 1}}\mathop \Sigma \limits_{k = i - n}^{i + n} ({y_k} - {{\bar Y}_0}) \\ \end{array} \right., {\rm{ }}i = 1 + n, \cdots, N - n.$

(10)

The physical definitions of F_xi and F_yi are the average deviations relative to the standard values (${\bar X_0}, {\bar Y_0}$) in a time window, which vary with time and the values of (${\bar X_0}, {\bar Y_0}$).

That the means of F_xi and F_yi are equal to zero is proposed here as a new constraint:

$\left\{ \begin{array}{l} \frac{1}{N}\sum\limits_{i = 1}^N {{F_{xi}}} {\rm{ = }}\frac{1}{{N(2n + 1)}}\sum\limits_{i = 1}^N {\left[ {\mathop \Sigma \limits_{k = i - n}^{i + n} ({x_k} - {{\bar X}_0})} \right]} {\rm{ = }}0 \\ \frac{1}{N}\sum\limits_{i = 1}^N {{F_{yi}}} {\rm{ = }}\frac{1}{{N(2n + 1)}}\sum\limits_{i = 1}^N {\left[ {\mathop \Sigma \limits_{k = i - n}^{i + n} ({y_k} - {{\bar Y}_0})} \right]} {\rm{ = }}0 \\ \end{array} \right..$

(11)

The rationality of this constraint is that the averaged deviation should be equal to zero regardless of the chosen data window n. That is, the positive and negative parts of the averaged deviation should be completely equal. Then, (${\bar X_0}, {\bar Y_0}$) can be uniquely determined:

$\left\{ \begin{array}{l} {{\bar X}_0} = \frac{1}{{N(2n + 1)}}\sum\limits_{i = 1}^N {\sum\limits_{k = i - n}^{i + n} {{x_k}} } \\ {{\bar Y}_0} = \frac{1}{{N(2n + 1)}}\sum\limits_{i = 1}^N {\sum\limits_{k = i - n}^{i + n} {{y_k}} } \\ \end{array} \right..$

(12)

The significance of Eq. (12) is that the standard values equal the averages of the local means in the whole data domain [1, N]. For clarity, the global means ($\bar X, \bar Y$) are incorporated into Eq. (12) so that the corresponding relations become:

$\left\{ \begin{array}{l} {{\bar X}_0} = \bar X - \frac{1}{{N(2n + 1)}}\sum\limits_{i = 1}^N {\left[ {(2n + 1){x_i} - \sum\limits_{k = i - n}^{i + n} {{x_k}} } \right]} \\ {{\bar Y}_0} = \bar Y - \frac{1}{{N(2n + 1)}}\sum\limits_{i = 1}^N {\left[ {(2n + 1){y_i} - \sum\limits_{k = i - n}^{i + n} {{y_k}} } \right]} \\ \end{array} \right..$

(13)

Notice that the range of k is [1, N] and that of i is [1+n, N−n] in Eq. (10). The range of i must be extended to [1, N]. Therefore, the time series X = {x_k: k = 1, 2, …, N} and Y = {y_k: k = 1, 2, …, N} must be extended beyond both limits:

$\left\{ \begin{array}{l} \{ \ldots, {x_{N - 1}}, {x_N}, {x_1}, \ldots, {x_{N - 1}}, {x_N}, {x_1}, \ldots \} \\ \{ \ldots, {y_{N - 1}}, {y_N}, {y_1}, \ldots, {y_{N - 1}}, {y_N}, {y_1}, \ldots \} \\ \end{array} \right..$

Although any extension could be applied because the extended data do not affect the SRCC calculation in Eq. (7), the extension must satisfy the condition that the mean of the extended data equals the mean of the original data series. This condition is necessary to adopt the same signal of the original data in the extended data. The ideal method is to perform a periodic extension, which is equivalent to a circular extension (e.g., Woods and Oneil, 1986). The general periodic extension is a sinusoidal extension (e.g., Huybrechs, 2010); it may also be a cosinoidal extension, such as the mirror extension of Zhao and Huang (2001).

Because the means are unchanged after data extension, the sums of the second terms on the right-hand side of Eq. (13) are always zero. Thus, the unique standard, which is different from the arbitrary standard, is:

$\left\{ \begin{array}{l} {{\bar X}_0}{\rm{ = }}\bar X \\ {{\bar Y}_0}{\rm{ = }}\bar Y \\ \end{array} \right..$

(14)

The global means ($\bar X, \bar Y$) are proven to be the unique standard quantities satisfying the condition of Eq. (11). Although the standards for comparison can be arbitrarily selected, only one pair of standards, namely the global means, satisfies the condition that the averaged deviations are equal to zero. Therefore, the SRCC algorithm given in Eq. (7) is the only RCC expression that satisfies Eq. (11).

For further comparison, an example with two randomly generated white noise data series, f₁(t) and f₂(t), for the time range 0 – 500 is shown in Figs. 1a and 1b. The global CC between the two white noise data series is zero, and the average values of LRCC and SRCC are 0.01 and 0.02, respectively. LRCC and SRCC for these two white noise data are quite similar, as shown in Figs. 1c and 1d, respectively.

Fig. 1 Two running correlation coefficients of a white noise data series. (a) and (b), Two series of white noise (red lines) and the local means (blue lines); (c), LRCC; (d), SRCC.

If a constant value is added between 150 and 350, the two time series are defined as:

$\left\{ \begin{array}{l} {A_1}(t) = {f_1}(t) + {a_1} \\ {A_2}(t) = {f_2}(t) + {a_2} \\ \end{array} \right..$

(15)

The constants a₁ and a₂ are set as:

${a_1} = \left\{ {\begin{array}{*{20}{c}} 3&{200 \leqslant t \leqslant 300} \\ 0&{{\rm{other \;time}}} \end{array}} \right., {\rm{ }}{a_2} = \left\{ {\begin{array}{*{20}{c}} 2&{200 \leqslant t \leqslant 300} \\ 0&{{\rm{other \;time}}} \end{array}} \right..$

(16)

The new time series are shown in Figs. 2a and 2b. The global correlation coefficient is 0.548. RCCs in the time interval with non-zero constants may be expected to show high correlations. SRCC remarkably increases in the presence of these constants and shows an average value of 0.518. By comparison, LRCC changes minimally, as shown in Fig. 2c, with an average value of only 0.030. Therefore, whereas SRCC is a suitable metric that could reflect the expected running correlation, LRCC appears to lose important information.

Fig. 2 Two running correlation coefficients. (a) and (b), Two series defined by Eq. (15) (red lines) and the local means (blue lines); (c), LRCC; (d), SRCC.

Therefore, from the perspective of comparability, an invariant value must be chosen as the standard value. Because the local mean varies with time, LRCC is not a comparable CC. The global mean is a qualified and unique standard value, and SRCC is the unique RCC satisfying the comparability criterion.

4 Mathematical Difference Between LRCC and SRCC

Although SRCC has been proven to be a qualified RCC by Zhao et al. (2018) and a unique form of an RCC with comparability as demonstrated in Section 3, mathematically verifying that SRCC is a unique form of RCC based on the original geometric definition of statistical quantities remains necessary.

In the linear correlation framework, a linear correlation can be used to calculate the CC. Consider a pair of time series in Eq. (1) with data length N in a scatterplot in x–y space and draw a straight line through this cloud of points that approaches all of the points 'as closely as possible'.

If a + bx is used to estimate y and c + dy is used to estimate x, then the deviations of the two lines from the data are:

$\left\{ \begin{array}{l} Q(a, b) = \sum\limits_{k = 1}^N {{{[{y_k} - (a + b{x_k})]}^2}} \\ Q(c, d) = \sum\limits_{k = 1}^N {{{[{x_k} - (c + d{y_k})]}^2}} \\ \end{array} \right..$

(17)

Calculating the minimum value of Q(a, b) by the least-squares method yields:

$\left\{ \begin{array}{l} b = \frac{{\sum\limits_{k = 1}^N {({x_k} - \bar X)({y_k} - \bar Y)} }}{{\sum\limits_{k = 1}^N {{{({x_k} - \bar X)}^2}} }} \\ d = \frac{{\sum\limits_{k = 1}^N {({x_k} - \bar X)({y_k} - \bar Y)} }}{{\sum\limits_{k = 1}^N {{{({y_k} - \bar Y)}^2}} }} \\ \end{array} \right..$

(18)

The global correlation coefficient R can be expressed as:

${R^2} = bd = \frac{{{{\left[ {\sum\limits_{k = 1}^N {({x_k} - \bar X)({y_k} - \bar Y)} } \right]}^2}}}{{\left[ {\sum\limits_{k = 1}^N {{{({x_k} - \bar X)}^2}} } \right]\left[ {\sum\limits_{k = 1}^N {{{({y_k} - \bar Y)}^2}} } \right]}}, $

(19)

which is identical in form to Eq. (2). Fig. 3 shows that the two empirical regression lines pass through the global means ($\bar X, \bar Y$).

Fig. 3 Geometric interpretation of the linear correlation coefficients (redrawn from Schmid, 1947). The shadow area represents the scatterplot of the data, and θ is the angle between the regression lines.

LRCC expresses the correlation of the data series in the time window [i − n, i + n] as shown in Eq. (5). The linear regression of the data is calculated using a similar method. Here, a' + b'x is used to estimate y, c' + d'y is used to estimate x, and the deviations of the two lines from the data are:

$\left\{ \begin{array}{l} {Q_i}(a', b') = \sum\limits_{k = i - n}^{i + n} {{{[{y_k} - (a' + b'{x_k})]}^2}} \\ {Q_i}(c', d') = \sum\limits_{k = i - n}^{i + n} {{{[{x_k} - (c' + d'{y_k})]}^2}} \\ \end{array} \right., $

(20)

where quantities marked '' ' are constants related to the length of the time window. The least-squares method can be used to calculate b' and d' as follows:

$\left\{ \begin{array}{l} b' = \frac{{\sum\limits_{k = i - n}^{i + n} {({x_k} - {{\bar X}_i})({y_k} - {{\bar Y}_i})} }}{{\sum\limits_{k = i - n}^{i + n} {{{({x_k} - {{\bar X}_i})}^2}} }} \\ d' = \frac{{\sum\limits_{k = i - n}^{i + n} {({x_k} - {{\bar X}_i})({y_k} - {{\bar Y}_i})} }}{{\sum\limits_{k = i - n}^{i + n} {{{({y_k} - {{\bar Y}_i})}^2}} }} \\ \end{array} \right., $

(21)

where (${\bar X_i}, {\bar Y_i}$) is the local mean of the i time window. Thus,

$R_r^2(i) = b'd' = \frac{{{{\left[ {\sum\limits_{k = i - n}^{i + n} {({x_k} - {{\bar X}_i})({y_k} - {{\bar Y}_i})} } \right]}^2}}}{{\sum\limits_{k = i - n}^{i + n} {{{({x_k} - {{\bar X}_i})}^2}} \sum\limits_{k = i - n}^{i + n} {{{({y_k} - {{\bar Y}_i})}^2}} }}, $

(22)

which is identical to the expression of LRCC in Eq. (4). Eq. (22) is obtained from the slope of the lines fitted by the data in the window. In general, $({X_{{t_1}}}, {Y_{{t_1}}})$≠$({X_{{t_2}}}, {Y_{{t_2}}})$ when t₁ ≠ t₂. Geometrically, the cross points (equal to the local means) of the regression lines for different time windows appear at different positions, as shown in Fig. 4a. Because a cross point corresponds to a standard and only RCCs in time windows at the same cross point are comparable, the values of LRCC at different i cannot be compared with each other.

Fig. 4 Geometric interpretation of two RCCs at different times t₁ and t₂. (a), LRCC with different means $({\bar X_{{t_1}}}, {\bar Y_{{t_1}}})$ and $({\bar X_{{t_2}}}, {\bar Y_{{t_2}}})$; (b), SRCC with the same means $(\bar X, \bar Y)$. The shadow area represents the scatterplot of the data.

Because the regression lines of the global correlation cross the global means $(\bar X, \bar Y)$, this correlation may also be a constraint for RCC in Eq. (20), that is, the empirical regression lines fitted to all time windows must cross the point $(\bar X, \bar Y)$.

$\left\{ \begin{array}{l} \bar Y{\rm{ = }}a' + b'\bar X \\ \bar X{\rm{ = }}c' + d'\bar Y \\ \end{array} \right..$

(23)

Eq. (23) should be an additional condition for the deviation of the two regression lines in Eq. (20). Substituting $a' = \bar Y - b'\bar X, {\rm{ }}c' = \bar X - d'\bar Y$ into Eq. (20) yields:

$\left\{ \begin{array}{l} {Q_i}(b') = \sum\limits_{k = i - n}^{i + n} {{{[({y_k} - \bar Y) - b'({x_k} - \bar X)]}^2}} \\ {Q_i}(d') = \sum\limits_{k = i - n}^{i + n} {{{[({x_k} - \bar X) - d'({y_k} - \bar Y)]}^2}} \\ \end{array} \right..$

(24)

When the minimum Q_i(b') and Q_i(d') are calculated by the least-squares method:

$\left\{ \begin{array}{l} b'{\rm{ = }}\frac{{\sum\limits_{k = i - n}^{i + n} {({x_k} - \bar X)({y_k} - \bar Y)} }}{{\sum\limits_{k = i - n}^{i + n} {{{({x_k} - \bar X)}^2}} }} \\ d'{\rm{ = }}\frac{{\sum\limits_{k = i - n}^{i + n} {({x_k} - \bar X)({y_k} - \bar Y)} }}{{\sum\limits_{k = i - n}^{i + n} {{{({y_k} - \bar Y)}^2}} }} \\ \end{array} \right..$

(25)

Thus:

${R_s}^2(i) = b'd' = \frac{{{{\left[ {\sum\limits_{k = i - n}^{i + n} {({x_k} - \bar X)({y_k} - \bar Y)} } \right]}^2}}}{{\sum\limits_{k = i - n}^{i + n} {{{({x_k} - \bar X)}^2}} \sum\limits_{k = i - n}^{i + n} {{{({y_k} - \bar Y)}^2}} }}, $

(26)

which is the expression for SRCC.

Eq. (26) clearly shows that SRCC is simply obtained by adding a constraint condition, Eq. (23), when calculating LRCC. The geometric expression of SRCC is shown in Fig. 4b. The regression lines of the data in all time windows cross the global means $(\bar X, \bar Y)$. The significance of this result is that a unified frame of reference is established with $(\bar X, \bar Y)$, in which the variations in SRCC at different times can be effectively compared. This relationship also means that SRCC takes local and global information into account and, therefore, demonstrates the close connection of the local correlation with the global correlation.

5 Physical Significance of the SRCC

The comparability and geometric consistency of SRCC were demonstrated in Sections 3 and 4 to improve the understanding of the physical significance of SRCC. The following examples present the differences between LRCC and SRCC and explain the reasons behind these differences. Because the annual variation is generally the strongest signal in geoscience, all of the data series used in the following examples are averaged by a 12-point running mean to filter the annual signal.

1) Contribution of the means and the anomalies of SRCC

The relationship between SRCC R_s(i) and LRCC R_r(i) was simply expressed by Zhao et al. (2018) as follows:

${R_s}(i) = {R_r}(i)\cos {\gamma _x}\cos {\gamma _y} + \sin {\gamma _x}\sin {\gamma _y}, $

(27)

where:

$\left\{ \begin{array}{l} \cos {\gamma _x} = \frac{{{\sigma _{rx}}(i)}}{{\sqrt {[\sigma _{rx}^2(i) + {{({{\bar X}_i} - \bar X)}^2}]} }} \\ \sin {\gamma _x} = \frac{{{{\bar X}_i} - \bar X}}{{\sqrt {[\sigma _{rx}^2(i) + {{({{\bar X}_i} - \bar X)}^2}]} }} \\ \cos {\gamma _y} = \frac{{{\sigma _{ry}}(i)}}{{\sqrt {[\sigma _{ry}^2(i) + {{({{\bar Y}_i} - \bar Y)}^2}]} }} \\ \sin {\gamma _y} = \frac{{{{\bar Y}_i} - \bar Y}}{{\sqrt {[\sigma _{ry}^2(i) + {{({{\bar Y}_i} - \bar Y)}^2}]} }} \\ \end{array} \right., $

(28)

where ${\bar X_i} - \bar X$ and ${\bar Y_i} - \bar Y$ are the mean differences between the local and global means and σ_rx(i) and σ_ry(i) are the local variances defined as follows:

$\left\{ \begin{array}{l} \sigma _{rx}^2(i) = \frac{1}{{2n + 1}}\sum\limits_{k = i - n}^{i + n} {{{({x_k} - {{\bar X}_i})}^2}} \\ \sigma _{ry}^2(i) = \frac{1}{{2n + 1}}\sum\limits_{k = i - n}^{i + n} {{{({y_k} - {{\bar Y}_i})}^2}} \\ \end{array} \right..$

(29)

Eq. (27) reveals that SRCC comprises R_r(t) and 1 with certain weights. The weight of R_r(t) is cosγ_xcosγ_y (cosine-weight), and the weight of 1 is sinγ_xsinγ_y (sine-weight). A larger variance benefits the cosine-weight, and a larger mean difference benefits the sine-weight. When the mean difference is zero in extreme cases, the two correlation coefficients are equal; by contrast, when the variance of the anomaly approaches zero, SRCC equals 1.

As an example, the averaged air temperature anomalies for the North Atlantic at 2 m and 500 hPa and their means are shown in Figs. 5a and 5b. LRCC presents a high-frequency variation (Fig. 5c), whereas SRCC shows a positively dominant running correlation (Fig. 5d). Figs. 5e and 5f reveal that sine-right is dominant in most time windows and that cosine-right is only apparent in some years. When cosine-right is apparent, SRCC becomes weak or opposite; otherwise, it is strong and close to 1. When the variance is dominant, the anomalous variation is dominant, and the variation of the mean is neglected. When the mean difference is dominant, the variations in anomalies are not important. This example shows that SRCC reflects the combined effects of the variance and mean difference simultaneously.

Fig. 5 Two running correlation coefficients between 2 m and 500 hPa air temperature anomalies averaged for the North Atlantic over the period 1980 – 2015. The 2 m and 500 hPa air temperature data are obtained from NCEP/NCAR Reanalysis 1. (a), 2 m temperature anomalies (red line) and the local mean (blue line); (b), 500 hPa temperature anomalies (red line) and the local mean (blue line); (c), LRCC R_r(t); (d), SRCC R_s(t); (e), cosine-right cosγ_xcosγ_y; (f), sine-right sinγ_xsinγ_y.

2) Contribution of low- and high-frequency signals

Besides the contributions of the means and anomalies, the double-frequency signal is another factor that decisively impacts LRCC and SRCC. Although fluctuations may contain various frequencies, all of the frequencies considered in the present example can be roughly divided into two groups: one with a period shorter than the time window (high frequency) and the other with a period longer than the time window (low frequency). LRCC usually represents the correlation between high-frequency signals because the low-frequency signals included in the local means are removed from the calculations. By contrast, SRCC still considers the correlation of low-frequency signals. For example, the Arctic Oscillation Index (Fig. 6a) and the latent heat flux in the Greenland Sea (Fig. 6b) are compared The appearances of LRCC (Fig. 6c) and SRCC (Fig. 6d) are quite similar, which means high-frequency signals dominate the data set. This result indicates that the latent heat responds better to the high-frequency variations rather than the low-frequency variations of the Arctic Oscillation. According to Eq. (27), cosine-right is dominant in this example. However, the LRCC and SRCC results of 2 m and 500 hPa air temperatures are quite different, as shown in Fig. 5. This finding indicates that the correlation of low-frequency signals is significant, with sine-right being dominant (Fig. 5f).

Fig. 6 Running correlation dominated by high frequency signals. (a), Arctic Oscillation Index (red line) and the local mean (blue line); (b), Latent heat flux (unit: W m⁻²) in the Greenland Sea (red line) and the local mean (blue line); (c), LRCC R_r(t); (d), SRCC R_s(t). The data of the Arctic oscillation index are obtained from the National Weather Service Climate Prediction Center of NOAA. The latent heat flux data are obtained from NCEP-DOE Reanalysis 2 from the National Centers of Environment Prediction.

Although SRCC includes the correlation of low-frequency signals, it does not filter out high-frequency signals. Therefore, LRCC includes only the correlation of high-frequency signals, while SRCC includes the correlations both high- and low-frequency signals.

The correlation of monthly sea level air pressures in Beijing and Guangzhou (Figs. 7a and 7b) is described here as another example. The appearances of LRCC (Fig. 7c) and SRCC (Fig. 7d) are quite different. LRCC reflects high-frequency features with mostly positive correlations, whereas SRCC shows negative correlations, which means low-frequency signals are dominant in the data set. In this example, the global CC is −0.755, consistent with the averaged SRCC. This result strongly exhibits the advantages of SRCC over LRCC. Indeed, in the present example, LRCC appears to have notable shortcomings.

Fig. 7 Running correlation dominated by low frequency. (a), Monthly air pressure (unit: hPa) in Beijing (red line) and the local mean (blue line); (b), Monthly air pressure (unit: hPa) in Guangzhou (red line) and the local mean (blue line); (c), LRCC R_r(t); (d), SRCC R_s(t). The monthly air pressure data are obtained from the China Meteorological Data Service Center.

SRCC presents the reverse variation in low-frequency, seesaw-like oscillations between Beijing and Guangzhou. This phenomenon is fairly similar to the North Atlantic Oscillation in that the seesaw-like oscillation appears in the surface air pressure difference between Iceland and Azores Island. LRCC cannot detect this low-frequency phenomenon.

The following example presents another correlation of low-frequency phenomena revealed by SRCC. The latilow-frequencytudes of Beijing and New York are located 40˚N. The central longitude of Beijing is 116˚E, and that of New York is 74˚W. Comparison of the surface temperatures of the two cities can help improve the understanding of their longlow-frequencyterm variations. The monthly averaged temperature data for 1989 – 2017 are selected from the NCEP, and a 12-point running average is adopted to eliminate seasonal variations. The variations in temperatures are shown in Figs. 8a and 8b. The temperature variations in the two cities are highly similar before 2009, and even their extremes occur in the same years. However, the temperature variations in the two cities are nearly contradictory from 2009 to 2016. The maximum value of a city's temperature often corresponds to the minimum value of the other city. SRCC (Fig. 8d) reveals these characteristics well. Prior to 2009, positive correlations are dominant; after 2009, negative correlations are dominant. LRCC cannot accurately reflect this shift in correlation (Fig. 8c).

Fig. 8 Running correlation of surface air temperatures in Beijing and New York. (a), Surface air temperature (unit: ℃) with a 12-point average in Beijing (red line) and the local mean (blue line); (b), Surface air temperature (unit: ℃) with a 12-point average in New York (red line) and the local mean (blue line); (c), LRCC R_r(t); (d), SRCC R_s(t). The surface air temperature data are obtained from NCEP/NCAR Reanalysis 1.

In this example, positive correlation was the regular situation and showed the consistent low-frequency variation of global air temperature. The negative correlation obtained in 2009 – 2016 is an abnormality that can be explained by Arctic amplification (Francis and Vavrus, 2012). Arctic amplification has a sizable impact on the climate of the mid-latitudes, such as those seen in the severe winters experienced in New York (2009 – 2013) and Beijing (2014 – 2015), as shown in Figs. 8a and 8b. Francis and Vavrus (2012) explained the occurrence of severe winters using Rossby wave theory. Specifically, the amplitude of the Rossby wave markedly increases as a result of Arctic warming, which allows the cold air in higher latitudes to flow out to the mid-latitude areas along the fronts. The negative correlation observed in 2009 – 2016 indicates that the cold air phenomenon occurs alternately in New York and Beijing.

Therefore, LRCC only reflects the correlation between high-frequency signals, whereas SRCC reflects the correlation of both high- and low-frequency signals. In particular, if the correlation of phenomena with low-frequency variations or long-term trends is to be studied, SRCC is the inevitable choice.

6 Discussion and Conclusions

A running correlation coefficient (RCC) is calculated by moving the time window to study temporal variations in the correlations of two time series. The local running correlation coefficient (LRCC) is obtained by the direct application of the general definition of a correlation coefficient to the data within a time window. However, LRCC only reflects the correlation between two anomalies within the time window and fails to reflect the contributions of two varying means. Thus, a new method called synthetic running correlation coefficient (SRCC) was proposed by Zhao et al. (2018), which is calculated using the means of all data (global means) instead of the varying local means. SRCC reflects the correlation between varying anomalies and varying means. However, as a recently proposed method, SRCC lacks the support of mathematical theory. In the present study, the validity of SRCC is demonstrated theoretically by considering the comparability of RCC values in different time windows.

RCC is only meaningful when its values at different times can be compared. Thus, a pair of constants must be determined prior to the actual calculation as the standard for comparison. The unique standard quantities are demonstrated to be the global means. This result indicates that SRCCs of different time windows are comparable, but LRCCs are not. Comparability may also be expressed in the x-y geometric space of the two data series. The cross points of the fitted lines of LRCC at different times are found at different positions; by contrast, the cross points of SRCC are located at the same position. Therefore, the magnitudes of SRCC at different time windows are comparable, whereas those of LRCC are not.

In this study, the relationship between LRCC and SRCC was derived by statistical methods, and SRCC was obtained by adding a constraint condition to the LRCC algorithm. Specifically, the cross points of the regression lines must pass through the center of all data represented by the global means in the geometric space. This constraint condition provides the mathematical basis of the comparison standards. Thus, SRCC is proven to be the unique RCC satisfying the comparability criterion.

When the temporal fluctuations are divided into high-frequency (i.e., periods shorter than the time window) and low-frequency (i.e., periods longer than the time window) signals, some examples show that LRCC only reflects the correlation of high-frequency signals. By contrast, SRCC reflects the correlation of both high and low frequencies.

Our findings do not mean that previous results obtained using LRCC are questionable. Many studies have focused on the correlation between seasonal and sub-seasonal signals, which are high-frequency variations, and LRCC is, in fact, a good measure of the relevant correlation. Nevertheless, if the correlation of phenomena with low-frequency variations or long-term trends is to be studied, SRCC is the inevitable choice because it provides the complete information of the running correlation between various periods.

More importantly, SRCC embodies the physical fact that any piece of data is a part of the whole dataset. The global means are simply parameters that include the information of the whole data. SRCC establishes the connection between local and global variations via the global means. Therefore, SRCC is the correct approach to calculate RCC.

The dependence of SRCC on global means gives rise to a unique feature of SRCC. Alterations in a data domain result in changes in the global mean. Thus, SRCC varies for different data lengths even if the same data series are used.

Acknowledgements

This study was supported by the National Natural Science Foundation of China (Nos. 41976022, 41941012), and the Major Scientific and Technological Innovation Projects of Shandong Province (No. 2018SDKJ0104-1).

References

Brown, J. R., Hope, P., Gergis, J. and Henley, B. J., 2016. ENSO teleconnections with Australian rainfall in coupled model simulations of the last millennium. Climate Dynamics, 47(1-2): 79-93. DOI:10.1007/s00382-015-2824-6 (

Burdun, G. D. and Markov, B. N., 1972. Osnovy Metrologii (Fundamentals of Metrology). Izd-vo Standartov, Moscow: 196-206. (

Elias, A. G. and Zossi de Artigas, M., 2003. A search for an association between the equatorial stratospheric QBO and solar UV irradiance. Geophysical Research Letters, 30: 1841. (

Francis, J. A. and Vavrus, S. J., 2012. Evidence linking Arctic amplification to extreme weather in mid-latitudes. Geophysical Research Letters, 39: L06801. DOI:10.1029/2012GL051000 (

Galton, F., 1888. Correlations and their measurement, chiefly from anthropometric data. Proceedings of the Royal Society of London, 45: 135-145. (

Geng, X., Zhang, W., Jin, F. F. and Stuecker, M. F., 2018. A new method for interpreting nonstationary running correlations and its application to the ENSO-EAWM relationship. Geophysical Research Letters, 45: 327-334. DOI:10.1002/2017GL076564 (

Huybrechs, D., 2010. On the Fourier extension of non-periodic functions. SIAM Journal on Numerical Analysis, 47(6): 4326-4355. DOI:10.1137/090752456 (

Hynčica, M. and Huth, R., 2020. Gridded versus station temperatures: Time evolution of relationships with atmospheric circulation. Journal of Geophysical Research: Atmospheres, 125: e2020JD033254. DOI:10.1029/2020JD033254 (

Ji, X. P. and Zhao, J. P., 2019. Transition periods between sea ice concentration and sea surface air temperature in the Arctic revealed by an abnormal running correlation. Journal of Ocean University of China, 18(3): 633-642. DOI:10.1007/s11802-019-3909-3 (

Kodera, K., 1993. Quasi-decadal modulation of the influence of the equatorial quasi-biennial oscillation on the north polar stratospheric temperatures. Journal of Geophysical Research: Atmospheres, 98: 7245-7250. DOI:10.1029/92JD02930 (

Kuznets, S., 1928. On moving correlation of time sequences. Journal of the American Statistical Association, 23(162): 121-136. DOI:10.1080/01621459.1928.10503005 (

Merrington, M., Blundell, B., Burrough, S., Golden, J., and Hogarth, J., 1983. A list of the papers and correspondence of Karl Pearson (1857 - 1936). Publications Office, University College London. (

Pearson, E. S., 1938. Karl Pearson: An Appreciation of Some Aspects of His Life and Work. Cambridge University Press, Cambridge, 193-257. (

Pearson, K., 1896. Mathematical contributions to the theory of evolution. - On a form of spurious correlation which may arise when indices are used in the measurement of organs. Proceedings of the Royal Society of London, 60(3): 489-498. (

Salby, M., Callaghan, P. and Shea, D., 1997. Interdependence of the tropical and extratropical QBO: Relationship to the solar cycle versus a biennial oscillation in the stratosphere. Journal of Geophysical Research: Atmospheres, 102(D25): 29789-29798. DOI:10.1029/97JD02606 (

Schmid, J., 1947. The relationship between the coefficient of correlation and the angle included between regression lines. The Journal of Educational Research, 41(4): 311-313. DOI:10.1080/00220671.1947.10881608 (

Soukhearev, B., 1997. The sunspot cycle, the QBO, and the total ozone over northeastern Europe: A connection through the dynamics of stratospheric circulation. Annales Geophysicae, 15: 1595-1603. DOI:10.1007/s00585-997-1595-8 (

Turner, T. E., Swindles, G. T., Charman, D. J., Langdon, P. G., Morris, P. J., Booth, R. K., Parry, L. E. and Nichols, J. E., 2016. Solar cycles or random processes? Evaluating solar variability in Holocene climate records. Scientific Reports, 6: 23961. DOI:10.1038/srep23961 (

Varotsou, E., Jochumsen, K., Serra, N., Kieke, D. and Schneider, L., 2015. Interannual transport variability of upper Labrador Sea water at Flemish Cap. Journal of Geophysical Research: Oceans, 120: 5074-5089. DOI:10.1002/2015JC010705 (

Woods, J. W. and Oneil, S. D., 1986. Subband coding of images. IEEE Transactions on Acoustics, Speech, and Signal Processing, 34: 1278-1288. DOI:10.1109/TASSP.1986.1164962 (

Zhao, J. P. and Huang, D. J., 2001. Mirror extending and circular spline function for empirical mode decomposition method. Journal of Zhejiang University (Science), 2(3): 247-252. DOI:10.1631/jzus.2001.0247 (

Zhao, J. P. and Drinkwater, K., 2014. Multiyear variation of the main heat flux components in the four basins of Nordic Seas. Periodical of Ocean University of China, 44(10): 9-19 (in Chinese with English abstract). (

Zhao, J. P., Cao, Y. and Shi, J. X., 2006. Core region of Arctic oscillation and the main atmospheric events impact on the Arctic. Geophysical Research Letters, 33: L22708. DOI:10.1029/2006GL027590 (

Zhao, J. P., Cao, Y. and Wang, X., 2018. The physical significance of the synthetic running correlation coefficient and its applications in oceanic and atmospheric studies. Journal of Ocean University of China, 17(3): 451-460. DOI:10.1007/s11802-018-3798-x (

Zhao, J. P., Drinkwater, K. and Wang, X., 2019. Positive and negative feedbacks related to the Arctic oscillation revealed by air-sea heat fluxes. Tellus A: Dynamic Meteorology and Oceanography, 71(1): 1-21. (

收稿日期：2020-11-05；修订日期：2021-01-20；接受日期：2021-01-29