Open Access

A new multivariate two-sample test using regular minimum-weight spanning subgraphs

Journal of Statistical Distributions and Applications20141:22

https://doi.org/10.1186/s40488-014-0022-4

Received: 27 February 2014

Accepted: 26 August 2014

Published: 4 November 2014

Abstract

A new nonparametric test is proposed for the multivariate two-sample problem. Similar to Rosenbaum’s cross-match test, each observation is considered to be a vertex of a complete undirected weighted graph; interpoint distances are edge weights. A minimum-weight, r-regular subgraph is constructed, and the mean cross-count test statistic is equal to the number of edges in the subgraph containing one observation from the first group and one from the second, divided by r. Unequal distributions will tend to result in fewer edges that connect vertices between different groups. The mean cross-count test is sensitive to a wide range of distribution differences and has impressive power characteristics. We derive the first and second moments of the mean cross-count test, and note that simulation studies suggest this test statistic is asymptotically normal regardless of underlying data distributions. A small simulation study compares the power of the mean cross-count test to Hotelling’s T2 test and to the cross-match test. This new test is a more powerful generalization of Rosenbaum’s test (the cross-match test is the case r = 1) and constitutes a noteworthy addition to the class of multivariate, nonparametric two-sample tests.

Keywords

Distribution-free test Graph-theoretic procedure Change point

Background

Objective

Consider N = m + n independent multivariate observations Y1,…, Y m and Ym + 1,…, Y N , where each Y i is drawn from distribution F for 1≤im and from distribution G for m + 1≤iN. The dimension of the observations does not depend on N. The covariates may be quantitative or categorical; there need only exist some function, d, that measures distance between observations. The null hypothesis is that F = G. The objective is a two-sample test that has little or no dependence on the underlying distribution of the data. Furthermore, this test should have sufficient power to be useful for applications.

Motivation

We follow in the vein of graph-theoretic tests for homogeneity: Consider each observation to be a vertex of a complete, undirected, weighted graph, G , and assign interpoint distances as edge weights. The distribution of these distances is sensitive to departures from homogeneity; Maa et al. ([1996]) prove that two distributions are identical if and only if the distributions of inter-point distances within and between observations sampled from the two populations are the same. Friedman and Rafsky ([1979], [1981]) fit a minimum spanning tree to G and count the number of edges in the tree that connect vertices from different groups to test whether the sampling distributions are the same. Schilling ([1986]), Henze ([1988]), and Hall and Tajvidi ([2002]) examine properties of nearest-neighbor subgraphs of G to test for homogeneity.

Rosenbaum ([2005]) provides a novel approach to this problem: Suppose N is even. Find a minimum-weight non-bipartite matching on G , which is the lowest-weight spanning subgraph for which the degree of each vertex with respect to the subgraph is one and which consists of N/2 non-adjacent edges. Rosenbaum’s cross-match statistic, A1, counts the number of edges in the matching that include one vertex from each of the two groups. Under the null hypothesis of no group difference each vertex is equally likely to be paired with any other vertex. Rosenbaum ([2005]) shows that the exact null distribution of A1 is found by combinatorial argument to be
P A 1 = a 1 = 2 a 1 N / 2 ! N m m - a 1 2 ! a 1 ! n - a 1 2 !
(1)

for a1{0, 2, …, min(m, n)} and m and n even, or a1{1, 3, …, min(m, n)} and m and n odd; P(A1 = a1) = 0 otherwise. In the denominator of (1), 1 2 m - a 1 is the number of edges in the matching where both vertices are in the group of size m and 1 2 n - a 1 is the number of edges in the matching where both vertices are in the group of size n.

When the two groups are drawn from different distributions the number of within-group pairs tends to be higher than for the null case, so the null hypothesis of homogeneity is rejected if A1 is sufficiently small. For odd N, this procedure may be simply modified by introducing a pseudo-observation, Y0 such that d(Y0, Y i ) = 0 for all i{1, …, N}, and randomly assigning it to one of the two groups. Then find a minimum-weight non-bipartite matching on this resulting graph with N + 1 vertices and compute A1 with respect to observations Y0, …, Y N .

That the exact null distribution of A1 is known, regardless of the underlying data distribution, is a particularly attractive property for a multivariate two-sample test. Furthermore, the asymptotic normality of A1 facilitates testing for large-sample problems. However, the cross-match test has relatively low power. Since only a single non-bipartite matching is considered in this test, information contained in the proximity of many pairs of points is ignored. Friedman and Rafsky ([1979], [1981]) observe that the power of their single-tree test is enhanced by evaluating successive disjoint low-weight spanning trees. Similarly, Ruth and Koyak ([2011]) show that ensembles of disjoint low-weight non-bipartite matchings carry significant information regarding whether a distributional change occurs over a sequence of independent observations. A drawback associated with examining collections of such subgraphs is that null distributions are extremely difficult to determine. Mindful of this caveat, we offer an extension of the cross-match test which exploits the information contained in the distances between many pairs of points.

Methods

Illustrating example

Consider the bivariate sample of size N = 20 listed in Table 1 and displayed in Figure 1. The sample consists of independent observations in groups 1 () and 2 (); observations within groups are identically distributed. For the purposes of this example, these data were simulated from distributions whose locations differ by one unit in each dimension. Figure 1 also shows the minimum-weight non-bipartite matching associated with this sample with respect to Euclidean distance. The present goal is to identify the distribution difference between these groups, making no assumptions about the underlying distributions.
Table 1

Bivariate data for illustrating example

Observation number

Group

Covariate 1

Covariate 2

1

1

-0.323

-1.389

2

1

1.020

-2.078

3

1

-0.269

-1.020

4

1

0.296

-0.144

5

1

0.602

1.021

6

1

0.814

-0.508

7

1

-0.475

-0.690

8

1

-0.079

1.360

9

1

-0.228

0.926

10

1

-0.481

1.958

11

2

1.269

1.275

12

2

0.954

2.133

13

2

-0.103

2.763

14

2

-0.581

-0.428

15

2

2.367

0.222

16

2

0.980

1.870

17

2

0.494

1.981

18

2

0.293

0.236

19

2

1.535

0.981

20

2

1.993

-0.120

Figure 1

Bivariate data for illustrating example with optimal non-bipartite matching on groups 1 () and 2 () with m = n = 10. Cross-group pairs are connected by solid lines; within-group pairs by dotted lines.

The cross-match test is applicable here; for this example the value of the cross-match statistic is A1 = 4 with a corresponding p-value = 0.433. So, the cross-match test is insufficiently powerful to identify a distribution difference in this case. In the next section, we introduce an extension of the cross-match test that enhances test power significantly.

The mean cross-count (MCC) test

As before, we assume an even number N of observations forming a complete, undirected, weighted graph, G . Rather than find a minimum-weight non-bipartite matching, we find a minimum-weight r-regular spanning subgraph of G , where 1≤rN-2, denoted G r * . That is, G r * is a subgraph of G with the following properties:
  1. a)

    Every vertex in G is also in G r * .

     
  2. b)

    Every vertex in G r * has degree r.

     
  3. c)

    The total weight of all edges in G r * is the lowest among all subgraphs of G which satisfy properties (a) and (b).

     
In graph theory, an r- regular spanning subgraph of G is sometimes called an r-factor of G . Note that G 1 * is the special case of a minimum-weight non-bipartite matching used by Rosenbaum ([2005]), and G N - 1 * is identical to G . In practice, we are mainly interested in 2≤rN/2, although the theoretical details are not so constrained. Minimum-weight r- factors may be computed as follows: For any subgraph of G , let x ij be an indicator variable equal to 1 if the edge connecting vertices i and j is included in the subgraph and let d ij be the distance between vertex i and vertex j. Then the edges of G r * solve following the combinatorial optimization problem:
min x j = 2 N i = 1 j - 1 d i j x i j subject t o i = 1 k - 1 x i k + j = k + 1 N x k j = r k 1 , , N x i j 0 , 1 j i + 1 , , N , i 1 , , N - 1 .
(2)

Anderson ([1972]) assures the existence of a solution for rN/2. Solutions for r > N/2 are guaranteed by the fact that the complement of an r-regular subgraph of G is an (N-1-r) -regular graph. For this paper, solutions are found in R using the package “lpSolve” for N≤400. For N > 400, solutions are found in R using the package “gurobi” due to the computational complexity of larger problems.

Similar to the cross-match test, we count the number of edges A r in G r * that include a vertex from each group. We call T r = A r /r the mean cross-count (MCC) statistic. The idea here is that the number of within-group edges in G r * will be higher for cases of a distribution difference than for the null case. So, small values of T r are evidence against the null hypothesis. Note that T1 = A1 is the cross-match statistic as before. One could use the total cross-count, A r , as an equivalent test-statistic; however, we choose to scale this value to give some notion of “average cross-count per vertex degree” (hence the name “mean cross-count”). For odd N, randomly introduce a pseudo-observation in the same manner as the r = 1 case discussed in Section 1.2.

Illustrating example, continued

Figure 2 shows a minimum-weight 3-factor, G 3 * , for the data in Table 1 with respect to Euclidean distance. Cross-group edges are shown with solid lines. For this case, A3 = 12T3 = 4, so the test statistic value here is the same as Rosenbaum’s cross-match test statistic. A discussion of the distribution of T r is in Section 3.1; for this example, we estimate the p-value for T r by permutation test on the observation vertex labels. Using 10,000 permutations yields an estimated p-value = 0.146. While not enough evidence to conclude a group difference, this reduction in p-value relative to the r = 1 case (p-value = 0.433) suggests that considering minimum-weight r- factors for r > 1 may improve test power. In Section 3.2 we demonstrate significant power advantages that are realized for the MCC statistic.
Figure 2

Bivariate data for illustrating example with optimal 3-factor on groups 1 () and 2 () with m = n = 10. Cross-group pairs are connected by solid lines; within-group pairs by dotted lines.

Results and discussion

MCC moments and normal approximation

For the following discussion we assume N is even, adopting the convention that if the number of observations is odd then we will consider N to be the number of observations including a pseudo-observation as previously discussed. To find the mean and variance of T r under the null hypothesis, we proceed as follows: Let G be the complete undirected graph (Z N , E N ) where the vertex set Z N consists of the indices 1, 2, …, N and the edge set consists of all N(N - 1)/2 pairs of vertices; by convention, write the pairs with smaller vertex first, so E N = {(i, j) : 1≤i < jN}. Partition Z N into two sets S and T, with |S| = m and |T| = n, so m + n = N. Denote E N S , T as the set of all edges with one vertex in S and the other in T. Let X ij be the random variable that indicates whether edge (i, j) is included in a minimum-weight r-regular subgraph, G r * , with 1≤rN-2. By the r-regularity of G r * , for each iZ N we have r = j = 1 i - 1 X j i + j = i + 1 N X i j , and so
r = E r = E j = 1 i - 1 X j i + j = i + 1 N X i j = j = 1 i - 1 E X j i + j = i + 1 N E X i j = j = 1 i - 1 P X j i = 1 + j = i + 1 N P X i j = 1 .
(3)

But under the null hypothesis, each edge is equally likely to be included in G r * , so

r = (N-1)P(X12 = 1). Therefore, for all (i, j)E N
E X i j = P X i j = 1 = r N - 1
(4)
and
V a r X i j = P X i j = 1 P X i j = 0 = r N - 1 - r N - 1 2 .
(5)
The total cross-count, A r , may be written
A r = i , j E N S , T X i j
(6)
resulting in
E T r = 1 r E A r = 1 r E i , j E N S , T X i j = m n r E X i j = m n N - 1 .
(7)
Finding the variance of T r is slightly more involved. First take
V a r A r = V a r i , j E N S , T X i j = i , j E N S , T V a r X i j + i , j , k , l E N S , T i , j k , l C o v X i j , X kl .
(8)
The sum of variances is computed directly as
i , j E N S , T V a r X i j = m n V a r X i j = m n r N - 1 - r N - 1 2 .
(9)
The sum of covariances may be partitioned into terms that include pairs of adjacent edges and terms that include disjoint (i.e., non-adjacent) edges:
i , k S j , l T i , j k , l C o v X i j , X kl = i S j , l T j l C o v X i j , X i l + i , k S i k j T C o v X i j , X k j + i , k S i k j , l T j l C o v X i j , X kl
(10)
For any two adjacent edges (k, l) and (i, j),
P X i j X kl = 1 = P X kl = 1 X i j = 1 P X i j = 1 = r - 1 r N - 2 N - 1 = E X i j X kl ,
(11)
so
i S j , l T j l C o v X i j , X i l + i , k S i k j T C o v X i j , X k j = m n n - 1 + m m - 1 n r - 1 r N - 2 N - 1 - r N - 1 2 = - m n r N - 1 - r N - 1 2 .
(12)
For any two disjoint edges (k, l) and (i, j),
P X i j X kl = 1 = P X kl = 1 X i j = 1 P X i j = 1 = r N - 4 + 2 N - 3 N - 2 r N - 1 = E X i j X kl ,
(13)
So
i , k S i k j , l T j l C o v X i j , X kl = m m - 1 n n - 1 r N - 4 + 2 N - 3 N - 2 r N - 1 - r N - 1 2 = 2 m m - 1 n n - 1 N - 1 - r r N - 3 N - 2 N - 1 2 .
(14)
Combining terms yields
V a r A r = i S j T V a r X i j + i S j , l T j l C o v X i j , X i l + i , k S i k j T C o v X i j , X k j + i , k S i k j , l T j l C o v X i j , X kl = m n r N - 1 - r N - 1 2 - m n r N - 1 - r N - 1 2 + 2 m m - 1 n n - 1 N - 1 - r r N - 3 N - 2 N - 1 2 = 2 m m - 1 n n - 1 N - 1 - r r N - 3 N - 2 N - 1 2 .
(15)
Therefore,
V a r T r = 1 r 2 V a r A r = 2 m m - 1 n n - 1 N - 1 - r r N - 3 N - 2 N - 1 2 .
(16)

We note in particular that the first and second moment results in (7) and (16) match the results in Rosenbaum ([2005]) for the special case r = 1.

Simulation suggests that the null distribution of T r is negatively skewed, but that for sufficiently large N and possibly certain conditions on r this distribution is asymptotically normal, independent of distribution function F. Rosenbaum ([2005]) proves that T r is asymptotically normal for r = 1 for any distribution function; proof of this conjecture for r > 1 remains an open problem. This conjecture is supported by the normal QQ-plots shown in Figures 3 and 4 for 1,000 simulated values of T r - E T r / V a r T r at r = 5, 50 and N = 150, 600 with m/n = 1/2, under sampling from uniform distributions on [-1, 1]5 and [-1, 1]20. For the smaller sample size (N = 150), negative skewness is stronger for lower dimension and for higher r, with r = 50 and Dim = 5 being the most strongly skewed case shown. For the larger sample size (N = 600), skewness effects appear to vanish for all but the r = 50 and Dim = 5 case, and even in this case skewness is vastly diminished compared to the smaller sample size. Other distribution families and other values of m/n produce similar results.
Figure 3

Normal QQ-plots of 1,000 simulated values of T r for m = 50, n = 100 and r = 5, 50 . Panels (a) and (b) are from independent samples of Unif [-1, 1]5 variates. Panels (c) and (d) are from independent samples of Unif [-1, 1]20 variates.

Figure 4

Normal QQ-plots of 1,000 simulated values of T r for m = 200, n = 400 and r = 5, 50 . Panels (a) and (b) are from independent samples of Unif [-1, 1]5 variates. Panels (c) and (d) are from independent samples of Unif [-1, 1]20 variates.

Future work remains to bound rejection region probabilities in terms of sample size, dimension, and choice of r. In the absence of such theoretical bounds, for practical purposes a permutation test on observation indices serves as a suitable method to estimate p-values for the MCC test in cases where a normal approximation cannot be justified.

Small simulation study

We compare power characteristics of tests for two different location-shift scenarios. For each case, 1000 simulations are conducted for each shift in location parameter, group sizes are m = 20 and n = 40, and tests are conducted at significance level α = .05. Distances are Euclidean. Estimated power is shown for MCC tests with r = 1, 4, 10, and 30 (where r = 1 is the cross-match test), and the performance of these tests is compared directly to that of Hotelling’s T2 test. Critical values for the MCC test were estimated through simulation for this study. All simulations were performed in R.

For the first example, the smaller group is drawn from a multivariate normal distribution with mean vector 0, identity covariance matrix, and dimension 5. The larger group is drawn from the same family, but the location vector of the second group is Δ, where Δ ranges in magnitude from 0 to 1.5 by increments of 0.3. Hotelling’s T2 test is known to be the uniformly most powerful invariant test for location shift under these conditions (Bilodeau and Brenner [1999]) and the exact power of the test is known for all location alternatives.

Figure 5 displays the estimated power results. We notice immediately that a modest increase of r = 1 to r = 4 substantially improves on the power of the cross-count test. As r continues to increase, MCC performance is even more impressive; the r = 30 = N/2 case performs nearly as well as Hotelling’s T2 test. Power estimates for cases r > N/2 are not shown. Not surprisingly, test power generally decreases as r increases beyond N/2 toward N-1; in the extreme case TN-1 takes the fixed value m n N - 1 and hence the MCC test with r = N-1 has power equal to zero against all alternatives.
Figure 5

Power estimates for the MCC test at r = 1, 4, 10, and 30 and exact power for Hotelling’s T 2 statistic for 5-variate normal mean alternatives with m = 20 and n = 40. The horizontal dashed line is test level α = .05.

For the second example, the first group is drawn from the multivariate log-normal distribution, where each of the 5 dimensions consists of independent, univariate log-normal draws with location parameter 0 and scale parameter 1. As before, the second group is drawn from the same family, but the location parameter vector for the second group is Δ, where the magnitude of Δ ranges from 0 to 1.5 by increments of 0.3 and each dimension of Δ takes equal value. The lognormal distribution is considered here to examine the effects of a skewed distribution on the tests in question. Since the underlying distributions are no longer multivariate normal, the power of Hotelling’s T2 test is estimated by simulation for this example.

Figure 6 displays the estimated power results. As before, we see that the power of the MCC test with r = 4 is much better than for r = 1. It is particularly noteworthy that for sufficiently large r the MCC test outperforms Hotelling’s T2 test.
Figure 6

Power estimates for the MCC test at r = 1, 4, 10, and 30 and estimated power for Hotelling’s T 2 statistic for 5-variate lognormal location parameter alternatives with m = 20 and n = 40. The horizontal dashed line is test level α = .05.

Conclusions

The mean cross-count test is a powerful, non-parametric multivariate two-sample test that is applicable to any case where a notion of distance between observations exists. While this paper considers only location shifts, other simulations show that the MCC test has power in a variety of alternative cases as well. A shortcoming of the MCC test is that the null distribution for T r is not simple (and perhaps not possible) to compute for all r > 1 and is not exactly distribution-free in these cases; in contrast, the test upon which it is based, the cross-match test with r = 1, has a known distribution that is independent of the distribution being tested.

It is known that T1 is asymptotically normal, and while the mean and variance of T r are derived herein and simulation suggests that the normal approximation for T r is appropriate for sufficiently large N with r > 1, this property remains to be proven. This proof is part of ongoing work, as is sharpening the normal approximation via Edgeworth expansion based on higher moments of T r . Likewise, finding useful criteria for choosing r is another area for future work. This choice is subject to competing factors: On the one hand, the power of T r appears to improve as r increases to N/2 when group sizes are equal (i.e., m = n = N/2); therefore, r = N/2 seems a good choice for equal group sizes. On the other hand, the normal approximation appears to worsen as r increases; thus it may be desirable to restrict the size of r for this sake. Furthermore, an additional effect exists when group sizes are different. For example, assume m < n. If rm, then at least one edge in G r * contains a vertex from each group and contributes to the cross-count, increasing the value of T r . This is true even if the two groups are very different. Since a higher cross-count weakens the evidence against a group difference, this consideration suggests choosing r < min(m, n). A similar effect exists for multimodal distributions, suggesting that the size of r might be restricted as the number of modes grows. In any case, the best choice of r in practice clearly depends upon application specifics.

Declarations

Authors’ Affiliations

(1)
Department of Mathematics, United States Naval Academy

References

  1. Anderson I: Perfect matchings of a graph sufficient conditions for matchings. Proc Edinburgh Math Soc 1972, 18: 129–136. Ser. 2 Ser. 2 10.1017/S0013091500009809MathSciNetView ArticleGoogle Scholar
  2. Bilodeau M, Brenner D: Theory of Multivariate Statistics. Springer, New York; 1999.Google Scholar
  3. Friedman J, Rafsky L: Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. Ann Stat 1979, 7: 697–717. 10.1214/aos/1176344722MathSciNetView ArticleGoogle Scholar
  4. Friedman J, Rafsky L: Graphics for the multivariate two-sample problem. JASA 1981, 76: 277–287. 10.1080/01621459.1981.10477643MathSciNetView ArticleGoogle Scholar
  5. Hall P, Tajvidi N: Permutation tests for equality of distributions in high-dimensional settings. Biometrika 2002, 89: 359–374. 10.1093/biomet/89.2.359MathSciNetView ArticleGoogle Scholar
  6. Henze N: A multivariate two-sample test based on the number of nearest neighbor type coincidences. Ann Stat 1988, 16: 772–783. 10.1214/aos/1176350835MathSciNetView ArticleGoogle Scholar
  7. Maa J, Pearl D, Bartoszynski R: Reducing multidimensional two-sample data to one-dimensional interpoint comparisons. Ann Stat 1996, 24: 1069–1074. 10.1214/aos/1032526956MathSciNetView ArticleGoogle Scholar
  8. Rosenbaum P: An exact distribution-free test comparing two multivariate distributions based on adjacency. JRSS B 2005, 67: 515–530. 10.1111/j.1467-9868.2005.00513.xMathSciNetView ArticleGoogle Scholar
  9. Ruth D, Koyak K: Nonparametric tests for homogeneity based on non-bipartite matching. JASA 2011, 106: 1615–1625. 10.1198/jasa.2011.tm10576MathSciNetView ArticleGoogle Scholar
  10. Schilling M: Multivariate two-sample tests based on nearest neighbors. JASA 1986, 81: 799–806. 10.1080/01621459.1986.10478337MathSciNetView ArticleGoogle Scholar

Copyright

© Ruth; licensee Springer. 2014

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.