Rank correlation under categorical confounding

Rank correlation is invariant to bijective marginal transformations, but it is not immune to confounding. Assuming a categorical confounding variable is observed, the author proposes weighted coefficients of correlation for continuous variables developed within a larger framework based on copulas. While the weighting is clear under the assumption that the dependence is the same within each group implied by the confounder, the author extends the Minimum Averaged Mean Squared Error (MAMSE) weights to borrow strength between groups when the dependence may vary across them. Asymptotic properties of the proposed coefficients are derived and simulations are used to assess their finite sample properties.


Introduction
Correlation may be used to determine the strength of the link between two continuous variables. Rank correlation is often preferred as it makes no assumption on the marginal distributions of the variables and estimate their dependence structure directly. Those rank statistics are however not immune to the effect of confounding variables, and data with an underlying categorical variable may display a false correlation that is somewhat akin to an ecological fallacy when the marginal distributions differ between the groups implied by this confounder.
To illustrate, let us generate random data that show a spurious correlation between height and salary. Figure 1 displays a sample of 150 men and 150 women where the height and salary are generated independently, but their distributions depend on the gender. While the distribution of the height is based on the tables from Mc Dowell et al. (2008), the salary is generated to match statistics for weekly earnings from the Bureau of Labor Statistics. We make no attempt here at determining whether wages are equitable, we merely use factual distributions within a simplified simulation. Although Spearman correlations are -0.004 and 0.027 for the men and women respectively, the correlation calculated from the pooled samples amounts to 0.137 (p-value 0.018) due to the differences in the marginal distributions. Failing to take gender into consideration thus leads to wrongly concluding that salary and height are positively linked.
Differences in the marginal distributions across the groups defined by the confounding variable can be accounted for by calculating the ranks in these groups rather than globally. As a consequence, the sample is split in m smaller samples, and coefficients from each of these groups may be combined with appropriate weights. Under the assumption that the groups share the same dependence structure, any weighting will yield an unbiased estimate, but we also consider the case where the dependence in each group could differ. Whenever the dependence structure in the groups are likely to be similar, using data from all the groups wisely could provide a favorable tradeoff between bias and variance. Rather than keeping only the much smaller sample from each group of interest, we then suggest to use an extension of the Minimum Averaged Mean Squared Error (MAMSE) weights of Plante (2008Plante ( , 2009a to borrow strength adaptively from the other groups. The dependence between two variables is best represented through the copula of their joint distribution. For continuous variables, the population value of rank correlation is a functional of that copula (see Genest and Nešlehová (2007) and Genest et al. (2014) for a descriptions of the challenges in the discrete case). While their original definitions are typically expressed as sums of ranks, coefficients of correlation based on ranks can also be rewritten as a functionnal of the empirical copula. We define MAMSE-weighted coefficients of correlation by replacing the empirical copulas in those alternative definitions with their MAMSE-weighted equivalent. In most cases, this is however equivalent to calcaulting a weighted sum of the coefficients of correlation.
Previous work on copula estimation in the presence of confounding variables includes Gijbels et al. (2011) and Veraverbeke et al. (2011) who use a form of kernel weighting based on a continuous confounder to estimate the marginal distributions as well as the copula underlying the data. Their approach is therefore based on similarities between the confounders, which is harder to define for a discrete variable. By comparison, the MAMSE weights are based on the similarities in the variables of interest between groups, which is possible because a discrete confounder provides a certain number of data for each level of the confounder. In this paper, we also have the notion of possibly homogeneous copulas with heterogeneous marginals which would not seem appropriate in the continuous setting treated by Gijbels et al. (2011) and Veraverbeke et al. (2011). The problem that we address requires a different approach than those proposed therein.
Background definitions and notation are provided in Section 2. Weights for empirical copulas are introduced in Section 3 as well as their theoretical properties. The same weights are used for coefficients of correlations based on ranks in Section 4 and convergence results are provided. Finally, Section 5 presents simulation results and a case study to illustrate the use of these weighted methods and explore their performance on finite samples. Technical proofs appear in the Appendix.

Background and notation
We assume that a discrete finite confounding variable is observed along with pdimensional continuous data of interest. For infinite variables, merging some values could offer a workaround, and if multiple discrete confounders are observed, they can be combined into one categorical variable through a cross-product. The complete sample is formed of independent variables, and is split in m different groups by the confounding variable. We use an index k to keep track of the simultaneously increasing sample sizes in the groups when studying asymptotic results. For any fixed k ∈ IN, we observe independent and identically distributed X i1 , . . . , T is a vector in p dimensions and F i are continuous. By the theorem of Sklar (1959), there exists a unique copula C i underlying the distribution F i such that be the ranks associated with the vectors X ij , j = 1, . . . , n ik .
For fixed i and , the list of values X i1 , . . . , X in ik is sorted and R k ij is the rank of X ij in that list. Since F i are continuous, ties cannot occur with probability 1.
The empirical copula,Ĉ ik (u) = (1/n ik ) . . . , u p ] T , uses ranks to estimate C i . The indicator variable 1(•) is equal to one if its argument is true and equal to 0 otherwise. The empirical copula puts a weight of 1/n ik on some points of an evenly spaced grid over [ 0, 1] p with exactly one such point in every (p − 1)-dimensional slice of the grid (rows and columns in 2 dimensions).
For bivariate data, coefficients of correlation based on ranks measure concordance of the data. The population values of the well-known Spearman's ρ and Kendall's τ are ρ = 12 uvdC (u, v) − 3 and τ = 4 C(u, v)dC(u, v) − 1 respectively, where C stands for the copula underlying the data. Substituting the empirical copula in these expressions leads to estimates that are asymptotically equivalent to the usual formulas (with n data having ranks (R i , S i ) to adopt a simpler more common notation) ρ n = −3(n + 1)/(n − 1) + 12{n(n + 1)(n − 1)} −1 n j=1 R i S i andτ n = (2{n(n − 1)} −1 1≤i<j≤n sign(R i − R j ) sign(S i − S j ). Both empirical coefficients are known to be asymptotically normal. Other measures of dependence such as Gini's γ (Nelsen 1999) or Blest's coefficients (Blest 2000;Genest and Plante 2003;Pinto da Costa and Soares 2005) are akin to Spearman's ρ as they adopt the form of the expectation of a polynomial.

Weights for mixtures of empirical copulas
Let λ k = [λ 1k , . . . , λ mk ] T be nonnegative weights such that m i=1 λ ik = 1 for all k ∈ N and let be a mixture of the empirical copulas based on the m available samples.
In this paper, inference must be made on the dependence between two or more variables (C i or a functional thereof ), conditional on the discrete confounding variable. The marginal distributions are therefore nuisance parameters. We look at two different situations where the dependence accross the groups is homogeneous or not. While it could be tempting to express equality of the dependence structure through coefficients of correlations, especially when it is the measure of interest, one has to remember that correlation does not fully determine dependence. Indeed, two sample can yield an equal correlation, say an equal Spearman's ρ, but come from different copulas. Even with equal sample sizes, the variance of the estimates in these two samples will differ since their theoretical value depends on the true underlying copula (see e.g. Ruymgaart et al. 1972), not on the value of ρ alone. We consider the situation where all groups have a common dependence structure, and as such, it makes sense to assume equal copulas, i.e. C 1 = · · · = C m , rather than a weaker equality of the coefficients. The assumption of homogeneous dependence should be tested when required. We use a resampling procedure for that purpose in the case study. The second situation is when C i differ between groups. Inference must then be made on each group individually since they do not have a common dependence structure. However, it is likely that although not equal, the dependence could be similar between many groups and we thus propose to use the MAMSE weights to borrow strength from other groups.

Homogeneous copulas: scalar weights
We first consider the paradigm where the m groups are assumed to share a common dependence structure, i.e. C 1 = · · · = C m = C. We allow for general scalar weights, but need Assumption 1 to ensure that each datum's contribution tends to 0 as k → ∞.
Assumption 1 We assume that lim sup k N k /n ik < ∞ for i = 1, . . . , m to ensure that all sample sizes increase at a similar rate. This also implies that A k = m i=1 λ 2 ik N k /n ik is finite for all k. Deheuvels (1979) shows that sup u∈[0,1] p |Ĉ ik (u) − C i (u)| → 0 almost surely as k → ∞ (since n ik → ∞ then). Similarly, the estimateĈ λ k (u) converges uniformly.
Let U i (u) be a p-dimensional centered Gaussian random field with covariance function where ∧ is the component-wise minimum. Such a random field is called a p-dimensional pinned C i -Brownian sheet. Early results by Fermanian et al. (2004) and Tsukahara (2005) revisited by Segers (2012) (with weaker assumptions) show that √ n ik {Ĉ ik (u) − C i (u)} converges weakly to such a Brownian sheet whose variance depends on C i and its partial first-order derivatives. Since each of theĈ ik (u) are defined on independent samples and are asymptotically normal, the asymptotic distribution ofĈ λ k (u) directly follows.

Theorem 2 The random variable
Remark 1 The choice of weights has an effect on the asymptotic distribution of the empirical copula. Simple calculus may be used to show that λ ik = n ik /N k minimizes A k , hence yielding the least variable estimateĈ λ k (u). This choice corresponds to allocating an equal weight to each datum and yields A k = 1. Since λ ik = n ik /N k are optimal weights, all numerical examples involving scalar weights hereafter will be based on that choice of weights.
Note that our asymptotic paradigm involves a fixed number of groups whose sample sizes increase to infinity. The convergence would not hold for an infinite number of small groups. For instance, a mixture based on infinitely many samples of size 10 will still haveĈ(1/20, 1/20) = 0. Therefore, the convergence could fail if we were to increase the number of categories defined by confounding variables as the sample size increases.
Unless there are practical reasons to assume homogeneity of the copulas across groups, testing that assumption would seem advisable. Rémillard and Scaillet (2009) propose a test of equality between two copulas, but we need a test for a general number of groups m. Bouzebda et al. (2011) develop mathematical results for the m-sample empirical copula process, but while their results could lead to tests of equality for m samples, we could not locate a numerical implementation of such tests nor any results showing their finite sample properties. In the case study, we rather use resampling techniques to test the homogeneity of the copulas. Some basic properties of the proposed resampling algorithm are explored, but the future development of tests for the equality of the copulas in m samples will certainly provide better alternatives as they become available.

Heterogeneous copulas: adaptive weights
The assumption of identical dependence structures across groups may not always be appropriate. The problem to solve then becomes the inference of one or many of the C i . For simplicity, we will assume that only Group 1 is of interest, but the methodology developed could be applied sequentially to other groups of interest.
By identifying one group of interest, we adopt a paradigm similar to Wang and Zidek (2005) for the weighted likelihood. In this context, adaptive weights can trade potential bias for reduced variance. We therefore extend the MAMSE weights of Plante (2008Plante ( , 2009a by replacing the empirical distribution functions in their definition with empirical copulas.
Looking for a tradeoff between bias and variance means that the variance must play a role in the objective function that will determine the weights. Let us define While the first term in P k (λ) measures bias, the summation plays the role of a penalty for the variance that fosters using data from all the groups rather than limiting the inference to the group of interest. Since the asymptotic variance of the empirical copula depends on the true copula C i (u) and its derivatives, we consider a very rough estimate thereof given by which corresponds to the only term of the asymptotic variance of an empirical copula that does not involve a derivative of C i . The value of λ minimizing the objective function P k (λ) defined in (1) with the substitution (2) is called the MAMSE weights and is denoted μ k . The algorithm for the MAMSE weights proposed by Plante (2008) implemented in the MAMSE R package can be used in the current context with copulas. Numerically, the integral is calculated on an evenly spaced grid with n p 1k points, or through Monte Carlo integration. Additional details specific to copulas may be found in Plante (2007).

The MAMSE weights have the property that
T be a possibly suboptimal choice of weights for P k , and let μ k denote the MAMSE weights, then This property is key in proving Theorem 3, which would hold for other adaptive weights that respect the same condition.

Theorem 3
We have uniform convergence of the MAMSE-weighted empirical copula: Note that the MAMSE weights display an irregular behaviour as k → ∞. AlthoughĈ μ k converges uniformly to the desired target, the rate of that convergence cannot be traced easily and the weights μ k may remain random for an arbitrarily large k if a mixture of the true C 2 , . . . , C m is identical to C 1 . This behaviour is observed and discussed with other versions of the MAMSE weights in Plante (2008Plante ( , 2009a. The study of the asymptotic distribution of √ N k Ĉ μ k − C 1 would require a description of the similarities between the C i , an endeavour that will not be undertaken in this paper. Simulations and bootstrap can be used instead to determine the critical values for a test of hypothesis.

Weighted coefficients of correlation
Many coefficients of correlation based on ranks including Spearman's ρ (but not Kendall's τ ) take the form is a continuous bounded function on [ 0, 1] 2 . The coefficients a n k → a and b n k → b as n k → ∞ are chosen to ensure thatκ ik ∈ [ −1, 1] for all sample sizes n ik with the values ±1 occurring only for perfect concordance or discordance. Coefficients of the formκ ik are asymptotically normal based on the results of Ruymgaart et al. (1972) and Ruymgaart (1974) (see also Genest and Plante (2003) for illustrations). Their variance can be derived from an expression that depends on the true copula underlying the data.

Homogeneous copulas
Assuming that C i = C, eachκ ik is normally distributed and it is clear that the random variable √ N k /A k κ λ k − κ converges weakly to a Normal variate with mean 0 and the same asymptotic variance as √ n ik κ ik − κ when k → ∞. Coefficients of correlation are often used as a test of independence. Suppose that the alternative hypothesis is expressed through a parameter θ for which θ = 0 yields independence. The theoretical value of κ is a function of θ and κ(0) = 0. The asymptotic relative efficiency (ARE) of the two tests represent the ratio of the sample sizes needed by both tests to achieve the same power. To illustrate, suppose that we compare a test of independence based onκ with one based on Spearman'sρ. We find from Lehmann (1998) κ is the asymptotic variance ofκ, and similarly forρ.
Remark 2 If the marginal distributions were not affected by the confounder, we could pool the N k data together to yield a test based on the usual estimateκ calculated on the whole dataset. In that case, ARE Tκ , Tκ λ k = lim k→∞ A k . Recall that A k = 1 when λ i = n ik /N k , which means that there is no loss of power asymptotically for using a weighted coefficient.
Let us also consider the estimateτ converges weakly to a Normal distribution under the assumption that the copulas of the m groups are equal.

Heterogeneous copulas
When copulas are not assumed equal across groups, the MAMSE weights may be used to define consistent coefficients of correlation. Recall that within this paradigm, the dependence of each group is assessed individually while borrowing strength from the other groups. To simplify presentation, only Group 1 is deemed of interest, but the same methodology could be applied sequentially to every group if needed.
One strength of the MAMSE weights is that almost no assumptions are made about the underlying distributions in the m groups, yet consistency is secured. Determining rates of convergence and the asymptotic distribution of MAMSE based statistics would however require much stronger assumptions about the relative shape of the distributions in the m groups. For testing and inference, we prefer to rely on resampling methods.

Remark 3 Consider the Fréchet family of copula from Example 5.3 in Nelsen
The empirical version of Kendall's τ can be written asτ ik = 4n(n − 1) −1 Ĉ ik (u)dĈ ik (u)− 1 + 4(n − 1) −1 , which shows that Kendall's τ is asymptotically equivalent to replacing the copula by its empirical counterpart in the population value of τ . We thus define a new statistic based on 4 To facilitate their interpretation, coefficients of correlation are usually built to have a null expectation under independence. In addition, under perfect negative or positive dependence, the coefficients take values -1 and 1 respectively. We define the asymptotically equivalent expression Although T k ij = 1 under perfect positive dependence, we rather get E T k ij = n ik n jk /N ijk 1/n jk − 1/n ik 2 under the assumption of independence and an unwieldy expression for negative dependence. In general, E T k ij is not the mid-point of the values of T k under perfect positive and negative dependence, except if n ik = n jk . As a consequence, even a linear transformation cannot make τ μ k = μ T k T k μ k fit the magical values of -1, 0 and 1 appropriately for finite samples, contrarily to the inconsistentτ μ k who preserves this property.
The asymptotic normality ofτ can be derived from the theory on U-statistics, but τ μ k does not fall within this paradigm. Resampling methods may be used for testing or establishing confidence intervals.

Simulations and case study
Simulations are used to explore the finite-sample performances of the proposed weighted methods under different scenarios. Scalar weights that are proportional to the sample sizes (λ) are considered as well as the adaptive MAMSE weights (μ). Note that the index k is dropped in this section where sample sizes are fixed. The R package MAMSE from the Comprehensive R Archive Network offers functions to compute the MAMSE weights as well as weighted coefficients of correlation.

Salary vs height
We first revisit the example presented in the introduction where salary and height are simulated as independent variables, but with different marginal parameters for men and women. Height is simulated as a Normal variable with mean 176.3 cm and standard deviation 11.38 cm for men, but mean 162.2 cm and standard deviation 11.15 for women. Those parameters are based on the tables from Mc Dowell et al. (2008). Salary is simulated based on quantiles or order 10, 25, 50, 75 and 90% (average value over quarters of 2009 for each quantile) for the usual weekly earnings of men and women as calculated by the Bureau of Labor Statistics (accessed online at http://www.bls.gov/webapps/legacy/ cpswktab5.htm). The salary are assumed to be uniformly distributed between the given quantiles. For the purpose of the simulation, it is assumed that the minimum salary is 0 and that the maximum salary equals twice the 90th quantile. No attempt is made here to study the possibility of wage inequity: we barely use the distributions of height and salary to illustrate the potential effect of a discrete confounder when marginal distributions are nuisance parameters. The simulation described is repeated 10,000 times. Figure 2 shows the p-values of tests of independence based on Spearman's rho. If we consider a 5% level for a test, ignoring potential differences between men and women leads to a 31.4% rejection rate. The weighted coefficientρ λ , however, provides and unbiased test with an observed 5.3% rejection rate and an histogram that approaches a uniform distribution as expected.

Case study: the Iris dataset
Consider now the famous Iris dataset from Fisher (1936). The variables are respectively sepal length, sepal width, petal length and petal width, all measured in centimeters, for 50 Iris Setosa, 50 Iris Versicolor and 50 Iris Virginica. Although a trained eye would not mistake them for one another, these three species of flowers are relatively similar in color and shape. Looking at the correlation between the measurements may give an idea of the geometry of the flowers. For instance, do the petals of larger specimen keep the same shape, which would translate in a highly positive correlation between their length and width. The descriptive analysis of the marginal distributions found in Fig. 3 shows that the species have different marginal characteristics. A correlation that does not take into consideration this confounding variable therefore presents a biased picture of reality. To fix ideas, Table 1 displays estimates of the Spearman correlation matrix for the whole data set (ρ) on the left, then the same matrix based only on the data of each group (ρ i ). While the three groups have generally similar types of correlations structures,ρ offers a spurious picture that includes negative correlations. Beyond the general picture, the correlations matrices of the three groups display clear discrepancies: it is unlikely that the assumption of homogeneous copulas could hold and support a preference forρ λ . As a matter of fact, a resampling test based on 10,000 bootstrap samples gave a p-value of 0. Details of the resamlping methods can be found in the next subsection. Sinceρ λ would offer a biased view, we are left with the MAMSE-weighted coefficients of correlation to estimate each ρ i separately, or to rely only on data from Group i to estimate ρ i .
To estimate the correlations for, say Iris Versicolor, one could calculate the MAMSE weights on the four dimensional data and combine theρ i matrices accordingly. These weights are determined based on the similarities of four-dimensional empirical copulas across the three groups and must therefore strike a global compromise. If for instance the dependence of petal width and length is very similar across groups, but correlations involving the sepals are much less akin, they will still all be combined with the same weights. An alternative approach is to consider every pair of variables, and to compute bivariate MAMSE weights for them. The adaptation to similarities is improved, but there is no guarantee that the resulting matrix is positive definite. If interest lays in the correlations rather than the correlation matrix, this may be a better option.
In order to evaluate the performance of the different methods, we now run a simulation where parameters obtained from the Iris dataset are assumed to be the "true model".  Whileρ i contain Spearman's coefficients for each each of the three species of iris, namely Setosa (ρ 1 ), Versicolor (ρ 2 ) and Virginica (ρ 3 ), the matrixρ contains Spearman's correlation for the 150 iris taken as a single dataset, hence ignoring marginal discrepancies Using the R package MVN, we find that Mardia's test does not reject multivariate normality of the three datasets from each species of iris. We will therefore draw simulated iris from a multivariate normal distribution with parameters equal to the estimated mean and covariance obtained from the original Iris dataset. Knowing the "real" distribution of the data will allow to evaluate the Mean Squared Error (MSE) of each estimate. We generate 10,000 samples of 150 iris, 50 of each species, and we compute Spearman's correlation based on the two types of MAMSE weights. While the "global" weights are based on the four-dimensional data, the "pairwise" weights are determined separately for each group of two variables, allowing for increased flexibility. Each of the three species of iris are considered, in turn, as the target distribution. Table 2 shows 100 MSE(ρ i )/MSE(ρ μ ), the relative MSE comparing a version of MAMSE to its competitor based solely on the group of interest. The relative MSE is reported for each pairwise correlation for both global and pairwise weights. For the global weights, the MSE of the correlation matrices corresponds to the average MSE for each coefficient of that matrix and is also reported as a relative measure.
We first note that the MAMSE weights provide improved performance in most cases. The estimation of the correlation for Iris Versicolor, for instance, is always better with a MAMSE-weighted correlation, and the pairwise approach is systematically best. For Iris Virginica, both MAMSE approaches seem acceptable since they provide improved performance everywhere, except for the correlation between petal length and petal width. At the other end of the spectrum, the MAMSE-weighted correlations sometimes show weaker performances as it is the case for Iris Setosa. Looking at Table 1 we may notice thatρ 1 is the correlation matrix that seems the most dissimilar to the other ones. While an infinite sample size would still guarantee an efficient estimate, there are cases where a loss of efficiency is observed for finite samples. Such observations are also made by Plante (2008)  The values listed are 100 MSE(ρ i )/MSE(ρ μ ) and are based on 10,000 repetitions. Each species of iris is in turn the target group. The MAMSE weights are calculated based on a global or pairwise strategy. Relative MSE are reported for each pairwise correlation, as well as for the correlation matrix in the case of global weights who describes in the univariate case how the MAMSE weights initially boost the performance for small samples, and provides equivalent performance for very large samples. In between, there is often a certain range for which the MAMSE approach does not offer an improved performance. Note also that although estimation for Iris Setosa was not improved by the contribution of the other two kinds of iris, using Iris Setosa to estimate the parameters of the other two types of iris did yield better performances. In the MAMSE objective function, any bias must be compensated by an equally reduced variance, but transformation of the copulas into other statistics may change the geometry of bias and variance. The MAMSE weights do not provide a uniformly more efficient approach, but overall, it seems to offer an appreciable gain.
In this example, we ran a simulation inspired from a real dataset. That approach could be considered by somebody who wonders how much improvement they could expect from the MAMSE approach: They could run a simulation on a model that mimics their own data.

Resampling technique for testing the homogeneity of copulas
To test homogeneity of the copulas, we need a nonparametric test for the equality of copulas in m groups. As mentioned in Section 3.1, Rémillard and Scaillet (2009) have developped a solution for m = 2 and the results from Bouzebda et al. (2011) for more groups have not been implemented numerically nor tested on finite samples. While further developments of such tests will certainly offer better options in the near future, we choose here to use a resampling method that we present next.
The test is based on a Cramér-Von-Misses type statistic, namely is a mixture based on all groups except for i. In our implementation for this paper, Monte Carlo integration (with 2000 random points) is used to evaluate the integrals in the Cramér-Von-Misses statistic. The resampling test follows these steps: 1. Calculate the ranks R k ij on the raw data (the ranks are taken within each groups) and rescale them by dividing by n i . This step gets rid of the (nuisance) marginal distributions. 2. Pool the rescaled ranks R k ij /n i into a single set. Under the null hypothesis of homogeneity of dependence, those groups of rescaled ranks all follow (approximately) the same copula common to the groups. 3. Generate bootstrap samples of size n 1k , . . . , n mk by drawing witout replacement from the pooled list of ranks. 4. Calculate the ranks in each bootstrap sample, then compute the Cramér-Von-Misses type statistic presented above. 5. Calculate the same Cramér-Von-Misses type statistic on the original data. If it is bigger than the 95% bootstrap quantile, then homogeneity is rejected. Alternatively, a p-value is obtained by taking the proportion of bootstrap samples yielding a statistic greater than or equal to the statistic computed on the original data.
To confirm that this test has reasonnable finite sample properties, we ran a small simulation for scenarios inspired from the Iris dataset. We simulated 3 groups of 50 data from the normal distributions described in the previous section. To represent homogeneity of the copulas, the parameters of Iris Setosa were used for the three groups in a first scenario. A second scenario used the parameters from the original dataset, thus different covariances in each group. In both cases, 1000 samples (of three times 50 iris) were generated and for each, 1000 bootstrap samples were used in the resampling method to determine a p-value for the test of homogeneity. Figure 4 shows the histograms of those p-values.
Under the null hypothesis, the test appears to be conservative with a histogram that displays too many small values. As a matter of fact, a 5% level test would have rejected the null with probability 0.120. Resampling from ranks creates ties which could explain that bias. The right panel of Fig. 4 shows that the test has reasonnable power in the context of the Iris dataset. The same 5% level test has a power of 0.757 under that heterogeneous scenario.

Homogeneous copula
This simulation is designed to measure the loss of efficiency that is suffered when using the proposed weighted methods. Throughout this section, the benchmark method is to pool all the data in a single set, an impossible endeavour with real data because of the confounding.
Theoretical results showed that the scalar-weighted rank statistics considered are unbiased and that their asymptotic variance is not affected by the splitting of the sample in m groups when optimal scalar weights are used, but is there a measurable loss on finite samples? The MAMSE-weighted statistics are consistent, but they could be biased on finite samples. How much do we lose for not assuming homogeneity of the dependence when that assumption is in fact true?
We generate data with homogeneous copulas from a Clayton distribution (Clayton 1978;Nelsen 1999) whose parameter is set to yield a Spearman's correlation of ρ ∈ {0.1, 0.5, 0.9}. A total of 5n data points are available as samples of equal sizes from  Fig. 4 Histograms of 1000 p-values of a resampling test for the homogeneity of the copulas. Simulated iris datasets are generated from two scenarios. While on the left panel, the three species of iris share a same copula, on the right panel, the three species are generated as a multivariate normal with parameters estimated from the three species of iris in the original dataset 5 groups. For each value of ρ and n ∈ {10, 20, 50}, 10,000 sets of samples are simulated and homogeneity of the copulas is assumed without being tested. Although it would not be possible to pool the data into a single set in a real case because of the nuisance marginal distributions, we use that situation as an unreachable benchmark. The estimates based on pooled data can be recognized by their lack of index (they are notedĈ,ρ andτ ).
To evaluate the precision of the weighted empirical copulaĈ λ , the upper section of Table 3 shows the ratio 100 |Ĉ(u) − C(u)|du/ |Ĉ λ (u) − C(u)|du, and similarly forĈ μ . The theoretical results aboutĈ λ mention that it is unbiased and has the same asymptotic variance asĈ. Surprisingly, this conservation of the efficiency is visible even for a samples as small as n = 10, and for all strengths of correlation. Estimating the MAMSE weights has a cost, so a smaller efficiency is expected forĈ μ , but the loss is fairly small.
Weighted coefficients of correlation are also calculated on the samples described above. Their performance measured by ratios such as 100 MSE(ρ)/MSE(ρ λ ) appear in the lower part of Table 3. Note that without the proposed methodology, the alternative would be to use only Group 1 for inference, which would yield a ratio of 20. Compared to that achievable benchmark, the weighted methods always provide an improvement. Whilê ρ λ is asymptotically efficient, its efficiency is not attained on small samples, but clearly increases as n increases. Remember that in a real life setting, the confounding makes it impossible to computeρ directly, so the loss of efficiency observed may be unavoidable. Although we did not compute its ARE explicitly,τ λ shows a behaviour similar toρ λ . This good behaviour is not surprising under homogeneous copulas sinceτ λ is then unbiased with a variance equal to that ofτ . The performance of the weighted coefficients of correlation seem to decrease as the correlation gains in strength. Splitting the dataset in multiple smaller samples reduced the variety of values that an empirical coefficient may achieve and this becomes more acute with larger correlations as most combinations of ranks become improbable. In general, using the MAMSE weights when the copulas are homogeneous decreases the performance, but we can observe that the loss is reasonable. The MAMSE weights are clearly offering a better performance while protecting against heterogeneity.
When considering Kendall's τ , the weighted avaverageτ λ performs best. In the homogeneous case, this coefficient could be considered. Even ifτ μ shows better performances than τ μ , we would still recommend the latter in the heterogeneous case given thatτ μ may not be consistent.

Tests of independence
Remark 2 mentions that there is no asymptotic loss of power in testing independence using a scalar-weighted version of Spearman's ρ. To obtain a more complete picture, we study the power of tests of independence based on different coefficients of correlation, including weighted coefficients with scalar and MAMSE weights. Five groups of equal size n = 20 are simulated from a Clayton copula under three scenarios where the parameter of the Clayton is matched to Spearman's ρ and expressed as such (even for simulations about τ ). The true correlation in Group i is therefore noted ρ i . Power graphs are plotted as a smoothed line based on 51 different values of ρ 1 , and for each of these values of the graph, 1000 repetitions of a test of independence are generated. The first scenario has homogeneous copulas (ρ 1 = ρ 2 = ρ 3 = ρ 4 = ρ 5 ) hence the scalar weights should be performing optimally. In the other scenarios, groups 2 to 5 have a fixed correlation while only ρ 1 varies according to the x-axis. In Scenario 2 ρ 2 = ρ 3 = ρ 4 = ρ 5 = 0.1, but in Scenario 3, ρ 2 = −0.4, ρ 3 = −0.2, ρ 4 = 0.2, ρ 5 = 0.4. To test independence with a sample of size n, Spearman's ρ is compared to a centered Normal with variance 1/(n − 1) and Kendall's τ to a centered Normal with variance (4n + 10)/{9n(n − 1)} (see e.g. Capéraà and Van Cutsem 1988). For weighted coefficients based on five groups with equal scalar weights, the same formulas are used with the total sample size. Tests based on the MAMSE-weighted coefficients are trickier as the weights depend not only on Group 1, but also on data from the other groups. In particular, it depends on data that are not covered by the tested hypothesis H 0 : ρ 1 = 0. To determine if one should reject the null or not, we therefore proceed with resampling techniques where a new sample is generated for Group 1 from the independence copula while sampling with replacement is applied to each of the other groups. To keep the computations manageable, each test is based on 400 bootstrap samples that are use to determine the standard error of the MAMSE-weighted coefficients of correlation. A Wald-type statistic is then used to test independence. Figure 5 shows the power of a test of independence based on different coefficients of correlation. The dashed lines show the power of the test based only on one group of size n. Even though confounding would make such an operation impossible in practice, the coefficients are also calculated on the whole dataset and the power of the corresponding tests are drawn as dotted lines for reference. The mixed (dashes and dots) lines show the power of a test based on a coefficient with scalar weights. The plain lines give the power of tests based on the MAMSE-weighted sum of coefficients. The display for Kendall's τ includes an additional curve with longer dashes for τ μ .
Under the homogeneous copulas scenario, the tests based on scalar weights offer almost the same power as those using the whole dataset directly, thus illustrating Remark 2 on finite samples. With heterogeneous copulas,ρ λ andτ λ are biased and inapplicable, but the MAMSE-based strategies offer good performances. In fact, the MAMSE-weighted coefficients seem to offer the best compromise. Under homogeneity of the copulas, they provide a test that is slightly less efficient thanρ λ orτ λ , but more efficient than the alternativesρ 1 orτ 1 . If the copulas are heterogeneous across groups, then the MAMSE-weighted coefficients offer a power on par with the next best alternative: the coefficient based only on the group of interest. By adapting to the data, the MAMSE weights offer a robust alternative that gets close to the best available option without needing to know the nature of the discrepancies between groups or lack thereof.

Fig. 5
Power of a test of independence based on different coefficient of correlations. The two columns of plots are respectively for estimates of Spearman's ρ and Kendall's τ , the rows correspond to different scenarios described with equations on the right. Equal samples of size n = 20 are drawn from five groups from a Clayton distribution with correlation ρ i . The null hypothesis is H 0 : ρ 1 = 0. The power is simulated with 1000 repetitions on 51 different values of ρ 1 to yield a curve that is then smoothed

Conclusion
Rank statistics are used to infer the dependence structure (copula or correlation) of a distribution without estimating its marginal distributions. The presence of a discrete confounding variable may yield spurious correlations if the marginal distributions vary across the groups implied by the confounder. If the dependence structure is homogeneous across those groups, a weighted sum of the empirical copulas (or coefficients of correlation) computed from each groups provides an unbiased and asymptotically efficient solution. For heterogeneous dependence structures, we propose an adapted version of the MAMSE weights that preserves consistency while letting the groups borrow strength from each others based on the similarities of their empirical copulas. Simulations and a case study have shown that the proposed weighting schemes for rank statistics allow to account for the confounding and that although they are not uniformly more performant, the MAMSE weights provide sizable improvement in the MSE for many cases.