Methodology  Open  Published:
A new multivariate twosample test using regular minimumweight spanning subgraphs
Journal of Statistical Distributions and Applicationsvolume 1, Article number: 22 (2014)
Abstract
A new nonparametric test is proposed for the multivariate twosample problem. Similar to Rosenbaum’s crossmatch test, each observation is considered to be a vertex of a complete undirected weighted graph; interpoint distances are edge weights. A minimumweight, rregular subgraph is constructed, and the mean crosscount test statistic is equal to the number of edges in the subgraph containing one observation from the first group and one from the second, divided by r. Unequal distributions will tend to result in fewer edges that connect vertices between different groups. The mean crosscount test is sensitive to a wide range of distribution differences and has impressive power characteristics. We derive the first and second moments of the mean crosscount test, and note that simulation studies suggest this test statistic is asymptotically normal regardless of underlying data distributions. A small simulation study compares the power of the mean crosscount test to Hotelling’s T^{2} test and to the crossmatch test. This new test is a more powerful generalization of Rosenbaum’s test (the crossmatch test is the case r = 1) and constitutes a noteworthy addition to the class of multivariate, nonparametric twosample tests.
Background
Objective
Consider N = m + n independent multivariate observations Y_{1},…, Y_{ m } and Y_{m + 1},…, Y_{ N }, where each Y_{ i } is drawn from distribution F for 1≤i≤m and from distribution G for m + 1≤i≤N. The dimension of the observations does not depend on N. The covariates may be quantitative or categorical; there need only exist some function, d, that measures distance between observations. The null hypothesis is that F = G. The objective is a twosample test that has little or no dependence on the underlying distribution of the data. Furthermore, this test should have sufficient power to be useful for applications.
Motivation
We follow in the vein of graphtheoretic tests for homogeneity: Consider each observation to be a vertex of a complete, undirected, weighted graph, $\mathcal{G}$, and assign interpoint distances as edge weights. The distribution of these distances is sensitive to departures from homogeneity; Maa et al. ([1996]) prove that two distributions are identical if and only if the distributions of interpoint distances within and between observations sampled from the two populations are the same. Friedman and Rafsky ([1979], [1981]) fit a minimum spanning tree to $\mathcal{G}$ and count the number of edges in the tree that connect vertices from different groups to test whether the sampling distributions are the same. Schilling ([1986]), Henze ([1988]), and Hall and Tajvidi ([2002]) examine properties of nearestneighbor subgraphs of $\mathcal{G}$ to test for homogeneity.
Rosenbaum ([2005]) provides a novel approach to this problem: Suppose N is even. Find a minimumweight nonbipartite matching on $\mathcal{G}$, which is the lowestweight spanning subgraph for which the degree of each vertex with respect to the subgraph is one and which consists of N/2 nonadjacent edges. Rosenbaum’s crossmatch statistic, A_{1}, counts the number of edges in the matching that include one vertex from each of the two groups. Under the null hypothesis of no group difference each vertex is equally likely to be paired with any other vertex. Rosenbaum ([2005]) shows that the exact null distribution of A_{1} is found by combinatorial argument to be
for a_{1}∈{0, 2, …, min(m, n)} and m and n even, or a_{1}∈{1, 3, …, min(m, n)} and m and n odd; P(A_{1} = a_{1}) = 0 otherwise. In the denominator of (1), $\frac{1}{2}\left(m{a}_{1}\right)$ is the number of edges in the matching where both vertices are in the group of size m and $\frac{1}{2}\left(n{a}_{1}\right)$ is the number of edges in the matching where both vertices are in the group of size n.
When the two groups are drawn from different distributions the number of withingroup pairs tends to be higher than for the null case, so the null hypothesis of homogeneity is rejected if A_{1} is sufficiently small. For odd N, this procedure may be simply modified by introducing a pseudoobservation, Y_{0} such that d(Y_{0}, Y_{ i }) = 0 for all i∈{1, …, N}, and randomly assigning it to one of the two groups. Then find a minimumweight nonbipartite matching on this resulting graph with N + 1 vertices and compute A_{1} with respect to observations Y_{0}, …, Y_{ N }.
That the exact null distribution of A_{1} is known, regardless of the underlying data distribution, is a particularly attractive property for a multivariate twosample test. Furthermore, the asymptotic normality of A_{1} facilitates testing for largesample problems. However, the crossmatch test has relatively low power. Since only a single nonbipartite matching is considered in this test, information contained in the proximity of many pairs of points is ignored. Friedman and Rafsky ([1979], [1981]) observe that the power of their singletree test is enhanced by evaluating successive disjoint lowweight spanning trees. Similarly, Ruth and Koyak ([2011]) show that ensembles of disjoint lowweight nonbipartite matchings carry significant information regarding whether a distributional change occurs over a sequence of independent observations. A drawback associated with examining collections of such subgraphs is that null distributions are extremely difficult to determine. Mindful of this caveat, we offer an extension of the crossmatch test which exploits the information contained in the distances between many pairs of points.
Methods
Illustrating example
Consider the bivariate sample of size N = 20 listed in Table 1 and displayed in Figure 1. The sample consists of independent observations in groups 1 (○) and 2 (△); observations within groups are identically distributed. For the purposes of this example, these data were simulated from distributions whose locations differ by one unit in each dimension. Figure 1 also shows the minimumweight nonbipartite matching associated with this sample with respect to Euclidean distance. The present goal is to identify the distribution difference between these groups, making no assumptions about the underlying distributions.
The crossmatch test is applicable here; for this example the value of the crossmatch statistic is A_{1} = 4 with a corresponding pvalue = 0.433. So, the crossmatch test is insufficiently powerful to identify a distribution difference in this case. In the next section, we introduce an extension of the crossmatch test that enhances test power significantly.
The mean crosscount (MCC) test
As before, we assume an even number N of observations forming a complete, undirected, weighted graph, $\mathcal{G}$. Rather than find a minimumweight nonbipartite matching, we find a minimumweight rregular spanning subgraph of$\mathcal{G}$, where 1≤r≤N2, denoted ${\mathcal{G}}_{r}^{*}$. That is, ${\mathcal{G}}_{r}^{*}$ is a subgraph of $\mathcal{G}$ with the following properties:

a)
Every vertex in $\mathcal{G}$ is also in ${\mathcal{G}}_{r}^{*}$.

b)
Every vertex in ${\mathcal{G}}_{r}^{*}$ has degree r.

c)
The total weight of all edges in ${\mathcal{G}}_{r}^{*}$ is the lowest among all subgraphs of $\mathcal{G}$ which satisfy properties (a) and (b).
In graph theory, an r regular spanning subgraph of $\mathcal{G}$ is sometimes called an rfactor of$\mathcal{G}$. Note that ${\mathcal{G}}_{1}^{*}$ is the special case of a minimumweight nonbipartite matching used by Rosenbaum ([2005]), and ${\mathcal{G}}_{N1}^{*}$ is identical to $\mathcal{G}$. In practice, we are mainly interested in 2≤r≤N/2, although the theoretical details are not so constrained. Minimumweight r factors may be computed as follows: For any subgraph of $\mathcal{G}$, let x_{ ij } be an indicator variable equal to 1 if the edge connecting vertices i and j is included in the subgraph and let d_{ ij } be the distance between vertex i and vertex j. Then the edges of ${\mathcal{G}}_{r}^{*}$ solve following the combinatorial optimization problem:
Anderson ([1972]) assures the existence of a solution for r≤N/2. Solutions for r > N/2 are guaranteed by the fact that the complement of an rregular subgraph of $\mathcal{G}$ is an (N1r) regular graph. For this paper, solutions are found in R using the package “lpSolve” for N≤400. For N > 400, solutions are found in R using the package “gurobi” due to the computational complexity of larger problems.
Similar to the crossmatch test, we count the number of edges A_{ r } in ${\mathcal{G}}_{r}^{*}$ that include a vertex from each group. We call T_{ r } = A_{ r }/r the mean crosscount (MCC) statistic. The idea here is that the number of withingroup edges in ${\mathcal{G}}_{r}^{*}$ will be higher for cases of a distribution difference than for the null case. So, small values of T_{ r } are evidence against the null hypothesis. Note that T_{1} = A_{1} is the crossmatch statistic as before. One could use the total crosscount, A_{ r }, as an equivalent teststatistic; however, we choose to scale this value to give some notion of “average crosscount per vertex degree” (hence the name “mean crosscount”). For odd N, randomly introduce a pseudoobservation in the same manner as the r = 1 case discussed in Section 1.2.
Illustrating example, continued
Figure 2 shows a minimumweight 3factor, ${\mathcal{G}}_{3}^{*}$, for the data in Table 1 with respect to Euclidean distance. Crossgroup edges are shown with solid lines. For this case, A_{3} = 12⇒T_{3} = 4, so the test statistic value here is the same as Rosenbaum’s crossmatch test statistic. A discussion of the distribution of T_{ r } is in Section 3.1; for this example, we estimate the pvalue for T_{ r } by permutation test on the observation vertex labels. Using 10,000 permutations yields an estimated pvalue = 0.146. While not enough evidence to conclude a group difference, this reduction in pvalue relative to the r = 1 case (pvalue = 0.433) suggests that considering minimumweight r factors for r > 1 may improve test power. In Section 3.2 we demonstrate significant power advantages that are realized for the MCC statistic.
Results and discussion
MCC moments and normal approximation
For the following discussion we assume N is even, adopting the convention that if the number of observations is odd then we will consider N to be the number of observations including a pseudoobservation as previously discussed. To find the mean and variance of T_{ r } under the null hypothesis, we proceed as follows: Let $\mathcal{G}$ be the complete undirected graph (Z_{ N }, E_{ N }) where the vertex set Z_{ N } consists of the indices 1, 2, …, N and the edge set consists of all N(N  1)/2 pairs of vertices; by convention, write the pairs with smaller vertex first, so E_{ N } = {(i, j) : 1≤i < j≤N}. Partition Z_{ N } into two sets S and T, with S = m and T = n, so m + n = N. Denote ${E}_{N}^{\left(S,T\right)}$ as the set of all edges with one vertex in S and the other in T. Let X_{ ij } be the random variable that indicates whether edge (i, j) is included in a minimumweight rregular subgraph, ${\mathcal{G}}_{r}^{*}$, with 1≤r≤N2. By the rregularity of ${\mathcal{G}}_{r}^{*}$, for each i∈Z_{ N } we have $r={\displaystyle \sum _{j=1}^{i1}}\phantom{\rule{0.25em}{0ex}}{X}_{ji}+{\displaystyle \sum _{j=i+1}^{N}}\phantom{\rule{0.25em}{0ex}}{X}_{ij}\text{,}$ and so
But under the null hypothesis, each edge is equally likely to be included in ${\mathcal{G}}_{r}^{*}$, so
r = (N1)P(X_{12} = 1). Therefore, for all (i, j)∈E_{ N }
and
The total crosscount, A_{ r }, may be written
resulting in
Finding the variance of T_{ r } is slightly more involved. First take
The sum of variances is computed directly as
The sum of covariances may be partitioned into terms that include pairs of adjacent edges and terms that include disjoint (i.e., nonadjacent) edges:
For any two adjacent edges (k, l) and (i, j),
so
For any two disjoint edges (k, l) and (i, j),
So
Combining terms yields
Therefore,
We note in particular that the first and second moment results in (7) and (16) match the results in Rosenbaum ([2005]) for the special case r = 1.
Simulation suggests that the null distribution of T_{ r } is negatively skewed, but that for sufficiently large N and possibly certain conditions on r this distribution is asymptotically normal, independent of distribution function F. Rosenbaum ([2005]) proves that T_{ r } is asymptotically normal for r = 1 for any distribution function; proof of this conjecture for r > 1 remains an open problem. This conjecture is supported by the normal QQplots shown in Figures 3 and 4 for 1,000 simulated values of $\left({T}_{r}E\left[{T}_{r}\right]\right)/\sqrt{\mathrm{V}\mathrm{a}\mathrm{r}\left({T}_{r}\right)}$ at r = 5, 50 and N = 150, 600 with m/n = 1/2, under sampling from uniform distributions on [1, 1]^{5} and [1, 1]^{20}. For the smaller sample size (N = 150), negative skewness is stronger for lower dimension and for higher r, with r = 50 and Dim = 5 being the most strongly skewed case shown. For the larger sample size (N = 600), skewness effects appear to vanish for all but the r = 50 and Dim = 5 case, and even in this case skewness is vastly diminished compared to the smaller sample size. Other distribution families and other values of m/n produce similar results.
Future work remains to bound rejection region probabilities in terms of sample size, dimension, and choice of r. In the absence of such theoretical bounds, for practical purposes a permutation test on observation indices serves as a suitable method to estimate pvalues for the MCC test in cases where a normal approximation cannot be justified.
Small simulation study
We compare power characteristics of tests for two different locationshift scenarios. For each case, 1000 simulations are conducted for each shift in location parameter, group sizes are m = 20 and n = 40, and tests are conducted at significance level α = .05. Distances are Euclidean. Estimated power is shown for MCC tests with r = 1, 4, 10, and 30 (where r = 1 is the crossmatch test), and the performance of these tests is compared directly to that of Hotelling’s T^{2} test. Critical values for the MCC test were estimated through simulation for this study. All simulations were performed in R.
For the first example, the smaller group is drawn from a multivariate normal distribution with mean vector 0, identity covariance matrix, and dimension 5. The larger group is drawn from the same family, but the location vector of the second group is Δ, where Δ ranges in magnitude from 0 to 1.5 by increments of 0.3. Hotelling’s T^{2} test is known to be the uniformly most powerful invariant test for location shift under these conditions (Bilodeau and Brenner [1999]) and the exact power of the test is known for all location alternatives.
Figure 5 displays the estimated power results. We notice immediately that a modest increase of r = 1 to r = 4 substantially improves on the power of the crosscount test. As r continues to increase, MCC performance is even more impressive; the r = 30 = N/2 case performs nearly as well as Hotelling’s T^{2} test. Power estimates for cases r > N/2 are not shown. Not surprisingly, test power generally decreases as r increases beyond N/2 toward N1; in the extreme case T_{N1} takes the fixed value $\frac{mn}{N1}$ and hence the MCC test with r = N1 has power equal to zero against all alternatives.
For the second example, the first group is drawn from the multivariate lognormal distribution, where each of the 5 dimensions consists of independent, univariate lognormal draws with location parameter 0 and scale parameter 1. As before, the second group is drawn from the same family, but the location parameter vector for the second group is Δ, where the magnitude of Δ ranges from 0 to 1.5 by increments of 0.3 and each dimension of Δ takes equal value. The lognormal distribution is considered here to examine the effects of a skewed distribution on the tests in question. Since the underlying distributions are no longer multivariate normal, the power of Hotelling’s T^{2} test is estimated by simulation for this example.
Figure 6 displays the estimated power results. As before, we see that the power of the MCC test with r = 4 is much better than for r = 1. It is particularly noteworthy that for sufficiently large r the MCC test outperforms Hotelling’s T^{2} test.
Conclusions
The mean crosscount test is a powerful, nonparametric multivariate twosample test that is applicable to any case where a notion of distance between observations exists. While this paper considers only location shifts, other simulations show that the MCC test has power in a variety of alternative cases as well. A shortcoming of the MCC test is that the null distribution for T_{ r } is not simple (and perhaps not possible) to compute for all r > 1 and is not exactly distributionfree in these cases; in contrast, the test upon which it is based, the crossmatch test with r = 1, has a known distribution that is independent of the distribution being tested.
It is known that T_{1} is asymptotically normal, and while the mean and variance of T_{ r } are derived herein and simulation suggests that the normal approximation for T_{ r } is appropriate for sufficiently large N with r > 1, this property remains to be proven. This proof is part of ongoing work, as is sharpening the normal approximation via Edgeworth expansion based on higher moments of T_{ r }. Likewise, finding useful criteria for choosing r is another area for future work. This choice is subject to competing factors: On the one hand, the power of T_{ r } appears to improve as r increases to N/2 when group sizes are equal (i.e., m = n = N/2); therefore, r = N/2 seems a good choice for equal group sizes. On the other hand, the normal approximation appears to worsen as r increases; thus it may be desirable to restrict the size of r for this sake. Furthermore, an additional effect exists when group sizes are different. For example, assume m < n. If r≥m, then at least one edge in ${\mathcal{G}}_{r}^{*}$ contains a vertex from each group and contributes to the crosscount, increasing the value of T_{ r }. This is true even if the two groups are very different. Since a higher crosscount weakens the evidence against a group difference, this consideration suggests choosing r < min(m, n). A similar effect exists for multimodal distributions, suggesting that the size of r might be restricted as the number of modes grows. In any case, the best choice of r in practice clearly depends upon application specifics.
References
 1.
Anderson I: Perfect matchings of a graph sufficient conditions for matchings. Proc Edinburgh Math Soc 1972, 18: 129–136. Ser. 2 Ser. 2 10.1017/S0013091500009809
 2.
Bilodeau M, Brenner D: Theory of Multivariate Statistics. Springer, New York; 1999.
 3.
Friedman J, Rafsky L: Multivariate generalizations of the WaldWolfowitz and Smirnov twosample tests. Ann Stat 1979, 7: 697–717. 10.1214/aos/1176344722
 4.
Friedman J, Rafsky L: Graphics for the multivariate twosample problem. JASA 1981, 76: 277–287. 10.1080/01621459.1981.10477643
 5.
Hall P, Tajvidi N: Permutation tests for equality of distributions in highdimensional settings. Biometrika 2002, 89: 359–374. 10.1093/biomet/89.2.359
 6.
Henze N: A multivariate twosample test based on the number of nearest neighbor type coincidences. Ann Stat 1988, 16: 772–783. 10.1214/aos/1176350835
 7.
Maa J, Pearl D, Bartoszynski R: Reducing multidimensional twosample data to onedimensional interpoint comparisons. Ann Stat 1996, 24: 1069–1074. 10.1214/aos/1032526956
 8.
Rosenbaum P: An exact distributionfree test comparing two multivariate distributions based on adjacency. JRSS B 2005, 67: 515–530. 10.1111/j.14679868.2005.00513.x
 9.
Ruth D, Koyak K: Nonparametric tests for homogeneity based on nonbipartite matching. JASA 2011, 106: 1615–1625. 10.1198/jasa.2011.tm10576
 10.
Schilling M: Multivariate twosample tests based on nearest neighbors. JASA 1986, 81: 799–806. 10.1080/01621459.1986.10478337
Author information
Additional information
Competing interest
The author declares that he has no competing interests.
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Distributionfree test
 Graphtheoretic procedure
 Change point