Distributions associated with simultaneous multiple hypothesis testing

We develop the distribution for the number of hypotheses found to be statistically significant using the rule from Simes (Biometrika 73: 751–754, 1986) for controlling the family-wise error rate (FWER). We find the distribution of the number of statistically significant p-values under the null hypothesis and show this follows a normal distribution under the alternative. We propose a parametric distribution ΨI(·) to model the marginal distribution of p-values sampled from a mixture of null uniform and non-uniform distributions under different alternative hypotheses. The ΨI distribution is useful when there are many different alternative hypotheses and these are not individually well understood. We fit ΨI to data from three cancer studies and use it to illustrate the distribution of the number of notable hypotheses observed in these examples. We model dependence in sampled p-values using a latent variable. These methods can be combined to illustrate a power analysis in planning a larger study on the basis of a smaller pilot experiment.


Introduction
Much work in informatics is concerned with identifying and classifying statistically significant biological markers. In this work we develop methods for describing the distribution of the numbers of such events. Informatics methods often summarize experiments resulting in a large number of p-values, usually through multiple comparisons of gene expression data. Typically, the number of tests m, is much greater than the number of subjects, N. There are several important rules for identifying statistically significant pvalues while maintaining the significance level below a pre-specified level α (0 < α < 1). Benjamini (2010) provides a review of recent advances.
A commonly cited rule to control the FWER is the Bonferroni correction. Given a sample of ordered p-values p (1) ≤ p (2) ≤ · · · ≤ p (m) , the Bonferroni rule finds the smallest value of B = 0, 1, . . . , m − 1 for which (2020) 7:9 Page 2 of 17 to control the FWER ≤ α. A similar rule developed by Benjamini and Hochberg (1995) to maintain the false discovery rate (FDR) ≤ α finds the largest value of BH such that p (BH) < BH α/m . This reference shows procedures controlling the FWER also control the FDR, but procedures controlling FDR only control FWER in a weaker sense.
We will concentrate on the distribution of B and S in this report. We describe the probability distribution of B and S under null hypotheses where each p-value has an independent marginal uniform distribution as well as an approximating distribution under the alternative hypothesis with density function ψ I (p) expressible as a polynomial in log p of order I.
There has been limited research on parametric distributions for the p-values generated from data under a mixture of the null and different distributions under multiple alternative hypotheses. The mixed p-values are mainly modeled using non-parametric methods (Genovese and Wasserman 2004;Broberg 2005;Langaas et al. 2005;Tang et al. 2007) or alternatively, the p-values are converted into normal quantiles and modeled thereafter (Efron et al. 2001;Efron 2004;Jin and Cai 2007). Another common approach is to approximate the distribution of sampled p-values using a mixture of beta distributions (Pounds and Morris 2003;Broberg 2005;Tang et al. 2007). Other parametric models have been described by Kozoil and Tuckwell (1999); Genovese and Wasserman (2004); Zelterman (2017, 2019).
Of interest is the fraction π 0 of p-values sampled from the uniform distribution under the null hypothesis. Langaas et al. (2005) and Tang et al. (2007) suggest the estimated density of p-values at p = 1 be used to estimate the fraction π 0 . Estimating π 0 is of practical importance: The BH statistic controls the FDR no more than απ 0 . Consequently, Benjamini and Hochberg (2000) recommend we perform tests with significance level α/π 0 and still maintain the FDR below α. We found ψ I (1 |θ ) to be a useful estimator of π 0 in the examples of Section 5, whereθ denotes the maximum likelihood estimate.
The p-values are usually not independent. In microarray studies, for example, a small number of clusters of p-values in the same biological pathway may have high mutual correlations. Methods for modeling such dependencies are developed by Sun and Cai (2009), Friguet et al. (2009), and Wu (2008 for examples. In Section 2, we describe the probability distribution of S in (2) when the p i are independently sampled from an unspecified distribution . In Section 3 we examine p-values sampled from a uniform distribution under the null hypothesis. Section 4 provides elementary properties of the proposed distribution I . The parameters θ of I depend on the specific application and are estimated for two examples in Section 5. In Section 6, we model the distribution of dependent p-values using a latent variable. We combine these methods in Section 7 to illustrate approximate power in planning a proposed study. We provide mathematical details of Sections 2 and 3 in Appendix A. Appendix B examines the behavior of B and S under a close sequence of alternative hypotheses. Appendix C examines the parameter space for the I distribution.
Let p 1 , p 2 , . . . , p m denote m randomly sampled p−values with ordered values p (1) ≤ p (2) ≤ · · · ≤ p (m) . We will initially assume all p-values are independent and have the same distribution function denoted by (·) with corresponding density function ψ(·). In Section 6, we return to the assumption of independence. We propose a non-uniform approximation for in Section 4. If we follow the Bonferroni rule (1) then the distribution of the number B of statistically significant p-values at FWER ≤ α follows a binomial distribution with index m and probability parameter equal to (α/m).
In Appendix A we prove The p-values are typically sampled from a mixture of a uniform distribution under the null hypothesis and several distributions under different alternative hypotheses. Similarly, the distribution of S will be a mixture of a mass near zero and a normal distribution, described next. This mixture distribution is illustrated in Figs. 2 and 4 for two examples of Section 5.
Specifically, for values of S near zero we have To describe the behavior of S away from zero, begin with the quantile function −1 (i/(m + 1)) giving the approximate expected value of the order statistic p (i) . If m is large and i/m is not too close to either zero or one, then p (i) will be approximately normally distributed. In (2), S is the smallest value of k for which p (k+1) > (k + 1)α/m. This should occur for values of S with mean μ solving −1 ((μ + 1)/(m + 1)) = (μ + 1)α/m , or equivalently,

Yu and Zelterman Journal of Statistical Distributions and Applications
(2020) 7:9 Page 4 of 17 If we write S = mp (μ) /α for integer μ and use the large sample approximation to an order statistic, then the approximate variance of S is If the null (uniform) and alternative hypotheses are not very different from each other, then the solution to μ in (6) will be close to zero and Appendix B describes a different approximation to the behavior of B and S.

Behavior under the null hypothesis
Let us next examine the special case where all p-values are independently sampled under the null hypothesis. When the distribution of the p i are independent and marginally uniformly distributed then (3) and (5) are expressible as and in general, Details of the derivation of (7) appear in Appendix A.
Useful results can be obtained if we also assume the number of hypotheses m is large. The limiting distribution (7) of S, is for k = 0, 1, . . . . The probabilities in (8) sum to unity using equation (130) in Jolley (1961, p. 24). The mean of this distribution is α/(1 − α) and the variance is α/(1 − α) 3 . The distribution of S + 1 in (8) is known as the Borel distribution with applications in queueing theory (Tanner, 1961). Similarly, for large values of m, the number of identified p-values at FWER ≤ α for the Bonferroni criteria (1) will follow a Poisson distribution with mean α when sampling p-values under the null hypothesis.

Distributions for P−Values
We next propose a marginal distribution for p-values, independent of the choice of test statistic. We continue to assume the p-values are mutually independent and have the same marginal distributions. We must have concave (Genovese and Wasserman 2004;Sun and Cai 2009), otherwise the underlying test will have power smaller than its significance level for some α. Similarly, the corresponding density function ψ must be monotone decreasing. We next propose a flexible distribution for modeling the distribution of pvalues under alternative hypotheses.
Consider a distribution with a density function expressible as a polynomial in log p up to degree I = 0, 1, 2, . . .. The uniform (0-1) distribution is obtained for I = 0. The marginal density function we propose for p-values is Yu and Zelterman Journal of Statistical Distributions and Applications (2020) 7:9 Page 5 of 17 for real-valued parameters θ = {θ 1 , . . . , θ I } with I ≥ 1 where so the densities ψ I (p) integrate to one over 0 < p ≤ 1. Similarly, θ 0 is not an independent parameter. The corresponding cumulative distribution function is where β 0 = 1. The relationship between these parameters is linear: Throughout, we will interchangeably refer to either the θ or β parameterizations for simplicity.
These restrictions alone on θ 0 , θ 1 , and θ I are not sufficient to guarantee ψ I (p | θ ) is monotone decreasing or positive valued for all values of 0 ≤ p ≤ 1. The necessary conditions for achieving these properties are difficult to describe in general, but sufficient conditions are all θ i ≥ 0. Specific cases are examined in Appendix C for values of I up to I = 4. Models for larger values of I could be fitted by maximizing the penalized likelihood, such that ψ I (p | θ) is positive valued and monotone decreasing at the observed, sorted p-values.
In practice, the choice of I is found by fitting a sequence of models. Successive values of I represent nested models so twice the differences of the respective log-likelihoods will behave as χ 2 (1 df ) when the underlying additional parameter value is zero. In practice, we found I = 3 or 4 were adequate for the three examples in this work.
The ψ I density function is specially suited for modeling the marginal distribution of a uniform and a variety of non-uniform distributions for p-values. If each p i (i = 1, . . . , m ) is sampled from a different distribution with density function ψ I (p | θ i ), then the marginal density of all p i satisfies where θ is the arithmetic average of all θ i . A similar result holds if the values of I vary across distributions of p i . This mixing of distributions includes the uniform as a special case. Specifically, suppose 100π 0 −percent of the p-values are sampled from a uniform (0, 1) distribution (0 ≤ π 0 ≤ (2020) 7:9 Page 6 of 17 1) and the remaining 100(1−π 0 )−percent are sampled from ψ I (p | θ ). Then the marginal distribution has density function demonstrating π 0 is not identifiable in this model. Equations (13) and (14) illustrate the utility of ψ I in modeling p-values sampled from a mixture of the null hypothesis and different distributions under alternative hypotheses, yet retaining the same parametric distribution form. Donoho and Jin (2004) also describe the value of such a mixture of heterogeneous alternative hypotheses in multiple testing settings. Following Langaas et al. (2005); Tang et al. (2007) we use ψ I (p = 1 |θ) =θ 0 , the estimated density at p = 1, to estimate π 0 , the proportion of p-values sampled from the null hypothesis.

Two examples
For each of the examples in this work, we fitted the density function ψ I described in Section 4 and then used this model to examine the distribution of S given in (3). The fitted parameter valuesθ for these examples are given for successive values of I. We maximized the likelihoods using standard optimization routine nlm in R. This routine also provides estimates of the Hessian used to estimate standard errors of parameter estimates.
The evaluation of U k in (5) involves adding and subtracting many nearly equal values resulting in numerical instability. We computed U k using multiple precision arithmetic with the Rmpfr package in R (Maechler 2019). A third example will be introduced in Section 7, to illustrate estimation of power for multiple hypothesis testing problems.

Breast cancer
This microarray dataset was originally described by Hedenfalk et al. (2001) and also analyzed by Storey and Tibshirani (2003). These data summarize marker expressions of m = 3226 genes in seven women with the BRCA1 mutation and in eight women with the BRCA2 mutation. The objective was to determine differentially-expressed genes between these two groups. Earlier analyses used a two-sample t-test to compare the two groups for each gene, giving rise to m p-values. Efron (2004) and Jin and Cai (2007) model the z-scores corresponding to the p-values.
Fitted parameters are given in Table 1. The fitted model for I = 2 represents a big improvement over the model with I = 1 parameter. The model with I = 3 parameters has a modest improvement over the model with I = 2 and I = 4 demonstrates negligible change in the likelihood over I = 3. Fitted densities ψ I for I = 2 and 3 are plotted in Fig. 1 along with the observed data. There is only a small difference between the fitted models in this figure, and both exhibit a good fit to the data. Our estimate of π 0 given bŷ θ 0 is .65 for I = 2 and .62 for I = 3. An estimate of .67 for π 0 is described in Storey and Tibshirani (2003).
There are S = 29 statistically significant markers at FWER = .05 using the adjustment for multiplicity given in (2). The fitted distribution of S is displayed in Fig. 2 using  ψ 3 (· |θ ) . The mean of this fitted distribution is 22.75. The distribution in Fig. 2 appears as a mixture of a distribution concentrated near k = 0 and a left-truncated normal distribution with a local mode at 24. The observed value S = 29 is indicated in this figure. Yu and Zelterman Journal of Statistical Distributions and Applications (2020) 7:9 Page 7 of 17 The point mass at S = 0 is about 0.1 and values of S ≤ 3 account for about 20% of the distribution with I = 3 and fittedθ . This distribution is approximately a mixture of the distribution near zero and 80% of a normal with mean 26.1 and standard deviation 17.9 using (6).

The cancer genome atlas: lung cancer
This dataset contains the summary of an extensive database collected on tumors from N = 178 patients with squamous cell lung carcinoma. A full description of these data and Fig. 1 Histogram and fitted density function ψ I for the 3226 biomarker p-values from the breast cancer data with I = 2 (dashed line) and I = 3 (solid line). A histogram for these data also appears in Storey and Tibshirani (2003) (2020) 7:9 Page 8 of 17  Table 2. Distributions up to I = 4 showed statistically significant improvement in the log-likelihood but larger values of I failed to change it. The fitted density function ψ 4 (· |θ ) given in Fig. 3 demonstrates good agreement with the observed data. The estimateθ 0 of π 0 is about .70 for I = 4.
The fitted distribution of S given in (3) is plotted in Fig. 4. There is close agreement between the observed value (173), the mean (176.35) of the fitted distribution, and the local mode (177). As with Fig. 2, the fitted distribution of S appears as a mixture of a distribution concentrated near zero and a normal distribution. The local mode at zero gives a fitted Pr[ S ≤ 2 ] of .012. The density mass away from zero is approximately that of a normal distribution with mean 178.8 and standard deviation 39.1 using (6).

Sampling dependent P-values
In this section we describe a method for sampling of dependent p-values by conditioning on an unobservable, latent variable. Greater dependence among the p-values results (2020) 7:9 Page 9 of 17 in greater means and variances for the distribution of p-values. This behavior is also described by Owen (2005). Greater dependence also contributes to a larger point mass at zero. We will use the fitted breast cancer example of Section 5.1 to illustrate these methods. Let θ and denote I−tuples such that both θ + and θ − are valid parameters for the density ψ I described in Section 4. Let Y denote a Bernoulli random variable with parameter equal to 1/2. Conditional on the (unobservable) value of Y , assume all p-values are sampled from either ψ I ( · | θ + ) or ψ I ( · | θ − ). The marginal distribution of these exchangeable p-values is then ψ I (· | θ ) using (13).
To demonstrate the correlation among the p-values induced by this latent model, let Q 1 , Q 2 denote a random sample from ψ I , both with parameters either θ + or θ − , conditional on Y. The Q i are conditionally independent given Y and have marginal covariance where E(p | θ) is the expected value of ψ I ( p | θ ) calculated using (12). This covariance is never negative.
Continuing to sample in this fashion, we then have the marginal distribution As an illustration, we used θ =θ and = zσ whereθ andσ are the fitted parameters and their estimated standard errors respectively given in Table 1 for the breast cancer example with I = 3. The distributions given by (15) for z = 0, .25, .5, and .75 are plotted in Fig. 5. Summaries of these four distributions and the mutual correlations of the pvalues are given in Table 3. As we see in Fig Table 3 zero. Greater dependence results in a larger point mass at zero, as well as larger means and variances of S .

Power for planning studies
In this final section we describe how to plan for a larger project using data from a smaller pilot study. Huang et al. (2015) report on a study of N = 78 patients with lung cancer and examined m = 48, 803 markers to determine if any of these are related to patient survival. None of these markers were identified as statistically significant at α = .05 using the Bonferroni method. A link to their data appears in our References.  We examined their data and the parameter estimates for our fitted models ψ I appear in Table 4. We found the model with I = 3 provided the best fit and worked with that maximum likelihood estimateθ to model power. We estimate more than 90% of the pvalues were sampled from the null hypothesis in these data.
In order to describe power we will assume the magnitude of the effect, as measured by θ , is proportional to the square root of the subject sample size, as is often the case with parameters whose estimates are normally distributed. This assumption will also require values of θ to lie near the center of the valid parameter space and wouldn't be valid for extrapolating to extremely large sample sizes. That is, we computed power estimates in Table 5 setting θ = θ (N) = (N/78) 1/2θ where N is the proposed patient sample size and used = zθ in (15) to vary the dependence among p-values for values of z = 0, .4, and .8. A variety of sample sizes and correlations are summarized in Table 5. This table summarizes the power as the probability of identifying at least one marker with α = .05. The expected number of identified findings using S is also given in this table.
We estimate the published study by Huang et al. (2015) had about a 50% chance of detecting at least one marker with α = .05. Table 5 suggests increasing sample sizes from 78 to N ≥ 450 patients to achieve power greater than 80% under a model of independent sampling. Even small mutual correlations result in greater point masses at zero, reducing the power of detecting at least one statistically significant p-values. Another factor is the estimated high proportion of p-values sampled from the null hypothesis (π 0 = .908). Subsequent studies should restrict sampling to those markers showing promise in the pilot, as the case in Haynes et al. (2012).