# A new multivariate two-sample test using regular minimum-weight spanning subgraphs

- David M Ruth
^{1}Email author

**1**:22

**DOI: **10.1186/s40488-014-0022-4

© Ruth; licensee Springer. 2014

**Received: **27 February 2014

**Accepted: **26 August 2014

**Published: **4 November 2014

## Abstract

A new nonparametric test is proposed for the multivariate two-sample problem. Similar to Rosenbaum’s cross-match test, each observation is considered to be a vertex of a complete undirected weighted graph; interpoint distances are edge weights. A minimum-weight, *r*-regular subgraph is constructed, and the mean cross-count test statistic is equal to the number of edges in the subgraph containing one observation from the first group and one from the second, divided by *r*. Unequal distributions will tend to result in fewer edges that connect vertices between different groups. The mean cross-count test is sensitive to a wide range of distribution differences and has impressive power characteristics. We derive the first and second moments of the mean cross-count test, and note that simulation studies suggest this test statistic is asymptotically normal regardless of underlying data distributions. A small simulation study compares the power of the mean cross-count test to Hotelling’s *T*^{2} test and to the cross-match test. This new test is a more powerful generalization of Rosenbaum’s test (the cross-match test is the case *r* = 1) and constitutes a noteworthy addition to the class of multivariate, nonparametric two-sample tests.

### Keywords

Distribution-free test Graph-theoretic procedure Change point## Background

### Objective

Consider *N* = *m* + *n* independent multivariate observations **Y**_{1},…, **Y**_{
m
} and **Y**_{m + 1},…, **Y**_{
N
}, where each **Y**_{
i
} is drawn from distribution *F* for 1≤*i*≤*m* and from distribution *G* for *m* + 1≤*i*≤*N*. The dimension of the observations does not depend on *N.* The covariates may be quantitative or categorical; there need only exist some function, *d*, that measures distance between observations. The null hypothesis is that *F* = *G*. The objective is a two-sample test that has little or no dependence on the underlying distribution of the data. Furthermore, this test should have sufficient power to be useful for applications.

### Motivation

We follow in the vein of graph-theoretic tests for homogeneity: Consider each observation to be a vertex of a complete, undirected, weighted graph, $\mathcal{G}$, and assign interpoint distances as edge weights. The distribution of these distances is sensitive to departures from homogeneity; Maa et al. ([1996]) prove that two distributions are identical if and only if the distributions of inter-point distances within and between observations sampled from the two populations are the same. Friedman and Rafsky ([1979], [1981]) fit a minimum spanning tree to $\mathcal{G}$ and count the number of edges in the tree that connect vertices from different groups to test whether the sampling distributions are the same. Schilling ([1986]), Henze ([1988]), and Hall and Tajvidi ([2002]) examine properties of nearest-neighbor subgraphs of $\mathcal{G}$ to test for homogeneity.

*N*is even. Find a minimum-weight non-bipartite matching on $\mathcal{G}$, which is the lowest-weight spanning subgraph for which the degree of each vertex with respect to the subgraph is one and which consists of

*N*/2 non-adjacent edges. Rosenbaum’s cross-match statistic,

*A*

_{1}, counts the number of edges in the matching that include one vertex from each of the two groups. Under the null hypothesis of no group difference each vertex is equally likely to be paired with any other vertex. Rosenbaum ([2005]) shows that the exact null distribution of

*A*

_{1}is found by combinatorial argument to be

for *a*_{1}∈{0, 2, …, min(*m*, *n*)} and *m* and *n* even, or *a*_{1}∈{1, 3, …, min(*m*, *n*)} and *m* and *n* odd; *P*(*A*_{1} = *a*_{1}) = 0 otherwise. In the denominator of (1), $\frac{1}{2}\left(m-{a}_{1}\right)$ is the number of edges in the matching where both vertices are in the group of size *m* and $\frac{1}{2}\left(n-{a}_{1}\right)$ is the number of edges in the matching where both vertices are in the group of size *n*.

When the two groups are drawn from different distributions the number of within-group pairs tends to be higher than for the null case, so the null hypothesis of homogeneity is rejected if *A*_{1} is sufficiently small. For odd *N*, this procedure may be simply modified by introducing a pseudo-observation, **Y**_{0} such that *d*(**Y**_{0}, **Y**_{
i
}) = 0 for all *i*∈{1, …, *N*}, and randomly assigning it to one of the two groups. Then find a minimum-weight non-bipartite matching on this resulting graph with *N* + 1 vertices and compute *A*_{1} with respect to observations **Y**_{0}, …, **Y**_{
N
}.

That the exact null distribution of *A*_{1} is known, regardless of the underlying data distribution, is a particularly attractive property for a multivariate two-sample test. Furthermore, the asymptotic normality of *A*_{1} facilitates testing for large-sample problems. However, the cross-match test has relatively low power. Since only a single non-bipartite matching is considered in this test, information contained in the proximity of many pairs of points is ignored. Friedman and Rafsky ([1979], [1981]) observe that the power of their single-tree test is enhanced by evaluating successive disjoint low-weight spanning trees. Similarly, Ruth and Koyak ([2011]) show that ensembles of disjoint low-weight non-bipartite matchings carry significant information regarding whether a distributional change occurs over a sequence of independent observations. A drawback associated with examining collections of such subgraphs is that null distributions are extremely difficult to determine. Mindful of this caveat, we offer an extension of the cross-match test which exploits the information contained in the distances between many pairs of points.

## Methods

### Illustrating example

*N*= 20 listed in Table 1 and displayed in Figure 1. The sample consists of independent observations in groups 1 (○) and 2 (△); observations within groups are identically distributed. For the purposes of this example, these data were simulated from distributions whose locations differ by one unit in each dimension. Figure 1 also shows the minimum-weight non-bipartite matching associated with this sample with respect to Euclidean distance. The present goal is to identify the distribution difference between these groups, making no assumptions about the underlying distributions.

**Bivariate data for illustrating example**

Observation number | Group | Covariate 1 | Covariate 2 |
---|---|---|---|

1 | 1 | -0.323 | -1.389 |

2 | 1 | 1.020 | -2.078 |

3 | 1 | -0.269 | -1.020 |

4 | 1 | 0.296 | -0.144 |

5 | 1 | 0.602 | 1.021 |

6 | 1 | 0.814 | -0.508 |

7 | 1 | -0.475 | -0.690 |

8 | 1 | -0.079 | 1.360 |

9 | 1 | -0.228 | 0.926 |

10 | 1 | -0.481 | 1.958 |

11 | 2 | 1.269 | 1.275 |

12 | 2 | 0.954 | 2.133 |

13 | 2 | -0.103 | 2.763 |

14 | 2 | -0.581 | -0.428 |

15 | 2 | 2.367 | 0.222 |

16 | 2 | 0.980 | 1.870 |

17 | 2 | 0.494 | 1.981 |

18 | 2 | 0.293 | 0.236 |

19 | 2 | 1.535 | 0.981 |

20 | 2 | 1.993 | -0.120 |

The cross-match test is applicable here; for this example the value of the cross-match statistic is *A*_{1} = 4 with a corresponding p-value = 0.433. So, the cross-match test is insufficiently powerful to identify a distribution difference in this case. In the next section, we introduce an extension of the cross-match test that enhances test power significantly.

### The mean cross-count (MCC) test

*N*of observations forming a complete, undirected, weighted graph, $\mathcal{G}$. Rather than find a minimum-weight non-bipartite matching, we find a

*minimum-weight r-regular spanning subgraph of*$\mathcal{G}$, where 1≤

*r*≤

*N*-2, denoted ${\mathcal{G}}_{r}^{*}$. That is, ${\mathcal{G}}_{r}^{*}$ is a subgraph of $\mathcal{G}$ with the following properties:

- a)
Every vertex in $\mathcal{G}$ is also in ${\mathcal{G}}_{r}^{*}$.

- b)
Every vertex in ${\mathcal{G}}_{r}^{*}$ has degree

*r*. - c)
The total weight of all edges in ${\mathcal{G}}_{r}^{*}$ is the lowest among all subgraphs of $\mathcal{G}$ which satisfy properties (a) and (b).

*r-*regular spanning subgraph of $\mathcal{G}$ is sometimes called an

*r-factor of*$\mathcal{G}$. Note that ${\mathcal{G}}_{1}^{*}$ is the special case of a minimum-weight non-bipartite matching used by Rosenbaum ([2005]), and ${\mathcal{G}}_{N-1}^{*}$ is identical to $\mathcal{G}$. In practice, we are mainly interested in 2≤

*r*≤

*N*/2, although the theoretical details are not so constrained. Minimum-weight

*r-*factors may be computed as follows: For any subgraph of $\mathcal{G}$, let

*x*

_{ ij }be an indicator variable equal to 1 if the edge connecting vertices

*i*and

*j*is included in the subgraph and let

*d*

_{ ij }be the distance between vertex

*i*and vertex

*j*. Then the edges of ${\mathcal{G}}_{r}^{*}$ solve following the combinatorial optimization problem:

Anderson ([1972]) assures the existence of a solution for *r*≤*N*/2. Solutions for *r* > *N*/2 are guaranteed by the fact that the complement of an *r*-regular subgraph of $\mathcal{G}$ is an (*N*-1-*r*) -regular graph. For this paper, solutions are found in R using the package “lpSolve” for *N*≤400. For *N* > 400, solutions are found in R using the package “gurobi” due to the computational complexity of larger problems.

Similar to the cross-match test, we count the number of edges *A*_{
r
} in ${\mathcal{G}}_{r}^{*}$ that include a vertex from each group. We call *T*_{
r
} = *A*_{
r
}/*r* the *mean cross-count* (*MCC*) *statistic*. The idea here is that the number of within-group edges in ${\mathcal{G}}_{r}^{*}$ will be higher for cases of a distribution difference than for the null case. So, small values of *T*_{
r
} are evidence against the null hypothesis. Note that *T*_{1} = *A*_{1} is the cross-match statistic as before. One could use the total cross-count, *A*_{
r
}, as an equivalent test-statistic; however, we choose to scale this value to give some notion of “average cross-count per vertex degree” (hence the name “mean cross-count”). For odd *N*, randomly introduce a pseudo-observation in the same manner as the *r* = 1 case discussed in Section 1.2.

### Illustrating example, continued

*A*

_{3}= 12⇒

*T*

_{3}= 4, so the test statistic value here is the same as Rosenbaum’s cross-match test statistic. A discussion of the distribution of

*T*

_{ r }is in Section 3.1; for this example, we estimate the p-value for

*T*

_{ r }by permutation test on the observation vertex labels. Using 10,000 permutations yields an estimated p-value = 0.146. While not enough evidence to conclude a group difference, this reduction in p-value relative to the

*r*= 1 case (p-value = 0.433) suggests that considering minimum-weight

*r-*factors for

*r*> 1 may improve test power. In Section 3.2 we demonstrate significant power advantages that are realized for the MCC statistic.

## Results and discussion

### MCC moments and normal approximation

*N*is even, adopting the convention that if the number of observations is odd then we will consider

*N*to be the number of observations including a pseudo-observation as previously discussed. To find the mean and variance of

*T*

_{ r }under the null hypothesis, we proceed as follows: Let $\mathcal{G}$ be the complete undirected graph (Z

_{ N },

*E*

_{ N }) where the vertex set Z

_{ N }consists of the indices 1, 2, …,

*N*and the edge set consists of all

*N*(

*N - 1*)/2 pairs of vertices; by convention, write the pairs with smaller vertex first, so

*E*

_{ N }= {(

*i*,

*j*) : 1≤

*i*<

*j*≤

*N*}. Partition Z

_{ N }into two sets

*S*and

*T*, with |

*S*| =

*m*and |

*T*| =

*n*, so

*m*+

*n*=

*N*. Denote ${E}_{N}^{\left(S,T\right)}$ as the set of all edges with one vertex in

*S*and the other in

*T*. Let

*X*

_{ ij }be the random variable that indicates whether edge (

*i*,

*j*) is included in a minimum-weight

*r*-regular subgraph, ${\mathcal{G}}_{r}^{*}$, with 1≤

*r*≤

*N*-2. By the

*r*-regularity of ${\mathcal{G}}_{r}^{*}$, for each

*i*∈Z

_{ N }we have $r={\displaystyle \sum _{j=1}^{i-1}}\phantom{\rule{0.25em}{0ex}}{X}_{ji}+{\displaystyle \sum _{j=i+1}^{N}}\phantom{\rule{0.25em}{0ex}}{X}_{ij}\text{,}$ and so

But under the null hypothesis, each edge is equally likely to be included in ${\mathcal{G}}_{r}^{*}$, so

*r*= (

*N*-1)

*P*(

*X*

_{12}= 1). Therefore, for all (

*i*,

*j*)∈

*E*

_{ N }

*A*

_{ r }, may be written

*T*

_{ r }is slightly more involved. First take

*k*,

*l*) and (

*i*,

*j*),

*k*,

*l*) and (

*i*,

*j*),

We note in particular that the first and second moment results in (7) and (16) match the results in Rosenbaum ([2005]) for the special case *r* = 1.

*T*

_{ r }is negatively skewed, but that for sufficiently large

*N*and possibly certain conditions on

*r*this distribution is asymptotically normal, independent of distribution function

*F*. Rosenbaum ([2005]) proves that

*T*

_{ r }is asymptotically normal for

*r*= 1 for any distribution function; proof of this conjecture for

*r*> 1 remains an open problem. This conjecture is supported by the normal QQ-plots shown in Figures 3 and 4 for 1,000 simulated values of $\left({T}_{r}-E\left[{T}_{r}\right]\right)/\sqrt{\mathrm{V}\mathrm{a}\mathrm{r}\left({T}_{r}\right)}$ at

*r*= 5, 50 and

*N*= 150, 600 with

*m*/

*n*= 1/2, under sampling from uniform distributions on [-1, 1]

^{5}and [-1, 1]

^{20}. For the smaller sample size (

*N*= 150), negative skewness is stronger for lower dimension and for higher

*r*, with

*r*= 50 and Dim = 5 being the most strongly skewed case shown. For the larger sample size (

*N*= 600), skewness effects appear to vanish for all but the

*r*= 50 and Dim = 5 case, and even in this case skewness is vastly diminished compared to the smaller sample size. Other distribution families and other values of

*m*/

*n*produce similar results.

Future work remains to bound rejection region probabilities in terms of sample size, dimension, and choice of *r*. In the absence of such theoretical bounds, for practical purposes a permutation test on observation indices serves as a suitable method to estimate p-values for the MCC test in cases where a normal approximation cannot be justified.

### Small simulation study

We compare power characteristics of tests for two different location-shift scenarios. For each case, 1000 simulations are conducted for each shift in location parameter, group sizes are *m* = 20 and *n* = 40, and tests are conducted at significance level *α* = .05. Distances are Euclidean. Estimated power is shown for MCC tests with *r* = 1, 4, 10, and 30 (where *r* = 1 is the cross-match test), and the performance of these tests is compared directly to that of Hotelling’s *T*^{2} test. Critical values for the MCC test were estimated through simulation for this study. All simulations were performed in R.

For the first example, the smaller group is drawn from a multivariate normal distribution with mean vector **0**, identity covariance matrix, and dimension 5. The larger group is drawn from the same family, but the location vector of the second group is **Δ**, where **Δ** ranges in magnitude from 0 to 1.5 by increments of 0.3. Hotelling’s *T*^{2} test is known to be the uniformly most powerful invariant test for location shift under these conditions (Bilodeau and Brenner [1999]) and the exact power of the test is known for all location alternatives.

*r*= 1 to

*r*= 4 substantially improves on the power of the cross-count test. As

*r*continues to increase, MCC performance is even more impressive; the

*r*= 30 =

*N*/2 case performs nearly as well as Hotelling’s

*T*

^{2}test. Power estimates for cases

*r*>

*N*/2 are not shown. Not surprisingly, test power generally decreases as

*r*increases beyond

*N*/2 toward

*N*-1; in the extreme case

*T*

_{N-1}takes the fixed value $\frac{mn}{N-1}$ and hence the MCC test with

*r*=

*N*-1 has power equal to zero against all alternatives.

For the second example, the first group is drawn from the multivariate log-normal distribution, where each of the 5 dimensions consists of independent, univariate log-normal draws with location parameter 0 and scale parameter 1. As before, the second group is drawn from the same family, but the location parameter vector for the second group is **Δ**, where the magnitude of **Δ** ranges from 0 to 1.5 by increments of 0.3 and each dimension of **Δ** takes equal value. The lognormal distribution is considered here to examine the effects of a skewed distribution on the tests in question. Since the underlying distributions are no longer multivariate normal, the power of Hotelling’s *T*^{2} test is estimated by simulation for this example.

*r*= 4 is much better than for

*r*= 1. It is particularly noteworthy that for sufficiently large

*r*the MCC test outperforms Hotelling’s

*T*

^{2}test.

## Conclusions

The mean cross-count test is a powerful, non-parametric multivariate two-sample test that is applicable to any case where a notion of distance between observations exists. While this paper considers only location shifts, other simulations show that the MCC test has power in a variety of alternative cases as well. A shortcoming of the MCC test is that the null distribution for *T*_{
r
} is not simple (and perhaps not possible) to compute for all *r* > 1 and is not exactly distribution-free in these cases; in contrast, the test upon which it is based, the cross-match test with *r* = 1, has a known distribution that is independent of the distribution being tested.

It is known that *T*_{1} is asymptotically normal, and while the mean and variance of *T*_{
r
} are derived herein and simulation suggests that the normal approximation for *T*_{
r
} is appropriate for sufficiently large *N* with *r* > 1, this property remains to be proven. This proof is part of ongoing work, as is sharpening the normal approximation via Edgeworth expansion based on higher moments of *T*_{
r
}. Likewise, finding useful criteria for choosing *r* is another area for future work*.* This choice is subject to competing factors: On the one hand, the power of *T*_{
r
} appears to improve as *r* increases to *N*/2 when group sizes are equal (i.e., *m* = *n* = *N*/2); therefore, *r* = *N*/2 seems a good choice for equal group sizes. On the other hand, the normal approximation appears to worsen as *r* increases; thus it may be desirable to restrict the size of *r* for this sake. Furthermore, an additional effect exists when group sizes are different. For example, assume *m* < *n*. If *r*≥*m*, then at least one edge in ${\mathcal{G}}_{r}^{*}$ contains a vertex from each group and contributes to the cross-count, increasing the value of *T*_{
r
}. This is true even if the two groups are very different. Since a higher cross-count weakens the evidence against a group difference, this consideration suggests choosing *r* < min(*m*, *n*). A similar effect exists for multimodal distributions, suggesting that the size of *r* might be restricted as the number of modes grows. In any case, the best choice of *r* in practice clearly depends upon application specifics.

## Declarations

## Authors’ Affiliations

## References

- Anderson I: Perfect matchings of a graph sufficient conditions for matchings.
*Proc Edinburgh Math Soc*1972, 18: 129–136. Ser. 2 Ser. 2 10.1017/S0013091500009809MathSciNetView ArticleGoogle Scholar - Bilodeau M, Brenner D:
*Theory of Multivariate Statistics*. Springer, New York; 1999.Google Scholar - Friedman J, Rafsky L: Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests.
*Ann Stat*1979, 7: 697–717. 10.1214/aos/1176344722MathSciNetView ArticleGoogle Scholar - Friedman J, Rafsky L: Graphics for the multivariate two-sample problem.
*JASA*1981, 76: 277–287. 10.1080/01621459.1981.10477643MathSciNetView ArticleGoogle Scholar - Hall P, Tajvidi N: Permutation tests for equality of distributions in high-dimensional settings.
*Biometrika*2002, 89: 359–374. 10.1093/biomet/89.2.359MathSciNetView ArticleGoogle Scholar - Henze N: A multivariate two-sample test based on the number of nearest neighbor type coincidences.
*Ann Stat*1988, 16: 772–783. 10.1214/aos/1176350835MathSciNetView ArticleGoogle Scholar - Maa J, Pearl D, Bartoszynski R: Reducing multidimensional two-sample data to one-dimensional interpoint comparisons.
*Ann Stat*1996, 24: 1069–1074. 10.1214/aos/1032526956MathSciNetView ArticleGoogle Scholar - Rosenbaum P: An exact distribution-free test comparing two multivariate distributions based on adjacency.
*JRSS B*2005, 67: 515–530. 10.1111/j.1467-9868.2005.00513.xMathSciNetView ArticleGoogle Scholar - Ruth D, Koyak K: Nonparametric tests for homogeneity based on non-bipartite matching.
*JASA*2011, 106: 1615–1625. 10.1198/jasa.2011.tm10576MathSciNetView ArticleGoogle Scholar - Schilling M: Multivariate two-sample tests based on nearest neighbors.
*JASA*1986, 81: 799–806. 10.1080/01621459.1986.10478337MathSciNetView ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.