Mean and variance of ratios of proportions from categories of a multinomial distribution

Duris, Frantisek; Gazdarica, Juraj; Gazdaricova, Iveta; Strieskova, Lucia; Budis, Jaroslav; Turna, Jan; Szemes, Tomas

doi:10.1186/s40488-018-0083-x

Research
Open access
Published: 18 January 2018

Mean and variance of ratios of proportions from categories of a multinomial distribution

Frantisek Duris ORCID: orcid.org/0000-0001-8985-4122^1,2,
Juraj Gazdarica³,
Iveta Gazdaricova³,
Lucia Strieskova³,
Jaroslav Budis⁴,
Jan Turna⁵ &
…
Tomas Szemes^1,3,5

Journal of Statistical Distributions and Applications volume 5, Article number: 2 (2018) Cite this article

14k Accesses
9 Citations
7 Altmetric
Metrics details

Abstract

Ratio distribution is a probability distribution representing the ratio of two random variables, each usually having a known distribution. Currently, there are results when the random variables in the ratio follow (not necessarily the same) Gaussian, Cauchy, binomial or uniform distributions. In this paper we consider a case, where the random variables in the ratio are joint binomial components of a multinomial distribution. We derived formulae for mean and variance of this ratio distribution using a simple Taylor-series approach and also a more complex approach which uses a slight modification of the original ratio. We showed that the more complex approach yields better results with simulated data. The presented results can be directly applied in the computation of confidence intervals for ratios of multinomial proportions.

AMS Subject Classification: 62E20

Introduction

Combinations of random variables (e.g., sums, products, ratios) regularly occur in many scientific areas. Particularly useful is the ratio of two random variables. For example, plant scientists use the ratio of leaf area to total plant weight (leaf area ratio) in the plant growth analysis (Poorter and Garnier 1996), and geneticists use the ratio of total genetic diversity distributed among populations to total genetic diversity in the pooled populations as a measure of population differentiation (Culley et al. 2002). The ratio of two fluorescent signals has several applications in fluorescence microscopy, e.g., estimating the DNA sequence copy number as a function of chromosomal location (Piper et al. 1995), and there are many (dimensionless) ratios employed in engineering (Mekic et al. 2012). In case of categorical data (i.e., from a binomial or multinomial distribution), there are numerous applications of ratios as well in consumer preference studies, election poll results, quality control, epidemiology, and so on.

Formally, a ratio distribution is a probability distribution constructed as the distribution of the ratio of two random variables, each having another (known) distribution. More particularly, given two random variables Y₁ and Y₂, the distribution of the random variable Z that is formed as the ratio Z=Y₁/Y₂ is a ratio distribution. When using ratio distributions for theoretical and practical purposes, it is helpful to know its mean and variance, preferably in a computationally efficient form. In the case that Y₁ and Y₂ follow normal distributions, and $\mu _{Y_{2}}=0$, Z is known as Cauchy distribution (Geary 1930; Fieller 1932; Hinkley 1969; Korhonen and Narula 1989; Marsaglia 2006). Other authors have addressed ratios of binomial proportions (also known as relative risk) (Koopman 1984; Bonett and Price 2006; Price and Bonett 2008), ratios of uniform distributions (Sakamoto 1943), Student’s t distributions (Press 1969), Weibull and gamma distributions (Basu and Lochner 1971; Provost 1989; Nadarajah and Kotz 2006), beta distributions (Pham-Gia 2000), Laplace and Bessel distributions (Nadarajah 2005; Nadarajah and Kotz 2005) and others. General notes on the product and ratio of two (not necessarily normal) random variables can also be found in (Frishman 1971; Van Kempen and Van Vliet 2000).

In our paper, we consider a ratio involving two or more random variables that jointly have a multinomial distribution. This situation is similar to relative risk or risk ratio which is the ratio of the probability of an event occurring (for example, developing a disease or being injured) in an exposed group to the probability of the event occurring in a comparison, non-exposed group. However, while the probabilities in the risk ratio are independent (in the sense that they describe two independent events in two independent groups), in our case, the probabilities are tied together through the covariance between multinomial categories. These ratios serve as a common framework for opinion polls, statistical quality control, and consumer preference studies. Confidence intervals for the odds ratio, which can be easily calculated, if the standard deviation is known, are especially important for applications. Nelson (1972) presented estimates, confidence intervals, and hypothesis tests for the odds ratio in trinomial distributions. Piegorsch and Richwine (2001) examined some types of confidence intervals in the context of analysis of genetic mutant spectra. Quesenberry and Hurst (1964) and Goodman (1965) explored methods for obtaining a set of simultaneous confidence intervals for the probabilities of a multinomial distribution. A comparison of performance of various confidence intervals also appeared in Alghamdi (2015); Aho and Bowyer (2015). To the best of our knowledge, however, there has been no analytical treatment of the ratio of multinomial proportions including derivations for formulae for the mean and variance of such a ratio.

A ratio between two or more random variables that jointly have a multinomial distribution also arises in the trending field of the non-invasive prenatal testing of common fetal aneuploidies such as trisomy of the 13^th, 18^th or 21^st chromosome (Chiu et al. 2008; Sehnert et al. 2011; Lau et al. 2012; Minarik et al. 2015). We are currently working on implementation of this model into laboratory practice, and this paper represents a mathematical background of our work. In this paper, we discuss two solutions to the problem of mean and variance of the said ratio. More particularly, we derive asymptotic formulae for the mean and variance of the random variable Z=Y₁/Y₂, where $Y_{1}=\sum _{k\in I} X_{k}$ and $Y_{2}=\sum _{k\in J} X_{k}$, I,J⊂{1,...,r} and I∩J=∅, are sums of random variables X₁,...,X_r which together have a joint multinomial distribution.

Solution by Taylor series

There is a simple solution to the mean and variance of the ratio of multinomial proportions that can be derived by using the Taylor series. Formally, let a set of random variables X₁,...,X_r have a probability function

$$pr\left(X_{1}=x_{1},..., X_{r}=x_{r}\right) = \frac{n!}{\prod_{i=1}^{r}{x_{i}!}}\prod_{i=1}^{r}{p_{i}^{x_{i}}}, $$

where x_i are non-negative integers such that $\sum x_{i} = n$ and p_i are constants with p_i>0 and $\sum p_{i}=1$. The joint distribution of X₁,...,X_r is known as multinomial distribution. Let u,v∈{0,1}^r be two binary vectors such that $\sum u_{i}>0$, $\sum v_{i}>0$ and u_iv_i=0 for all i. We define

$$Z_{0} = \frac{X \cdot u}{X \cdot v}, $$

where · represents a scalar product and X=(X₁,...,X_r). Without loss of generality, we will restrict our explorations to r=3 and Z₀=X₁/X₂. This holds because the choice vectors u,v have no common X_i; thus, the X_is can be grouped to three disjoint sets: 1) X_is selected by u, 2) X_is selected by v, and 3) all others.

Also, the reader will note that the ratio Z₀=X₁/X₂ can be viewed as a ratio of absolute quantities as well as a ratio of fractions or probabilities because Z₀=(X₁/n)/(X₂/n).

Before we proceed any further, observe that because of the possible zero in the denominator of Z₀, there is no analytical solution to the mean and variance of the ratio Z₀. A workaround for this problem is to rewrite this ratio using a function that does not have a singularity. Let Z₀=f(X₁,X₂)=X₁/X₂ be a function of two random variables. Then, with $\mu =\left (\mu _{X_{1}}, \mu _{X_{2}}\right)$, we can use the Taylor series to approximate the function f as

$$\begin{array}{*{20}l} Z_{0}=f\left(X_{1},X_{2}\right) \approx&\ f(\mu) + \left(X_{1} - \mu_{X_{1}}\right)\frac{\partial f}{\partial X_{1}}(\mu) + \left(X_{2} - \mu_{X_{2}}\right)\frac{\partial f}{\partial X_{2}}(\mu) \\ &+ \frac{1}{2}\left(X_{1} - \mu_{X_{1}}\right)^{2}\frac{\partial^{2} f}{\partial X_{1}^{2}}(\mu) + \frac{1}{2}\left(X_{2} - \mu_{X_{2}}\right)^{2}\frac{\partial^{2} f}{\partial X_{2}^{2}}(\mu) \\ &+\left(X_{1} - \mu_{X_{1}}\right)\left(X_{2} - \mu_{X_{2}}\right)\frac{\partial^{2} f}{{\partial X_{1}}{\partial X_{2}}}(\mu), \end{array} $$

from which we have

$$ E(Z_{0}) \approx f(\mu) + \frac{1}{2}\frac{\partial^{2} f}{\partial X_{1}^{2}}(\mu)\sigma_{X_{1}}^{2} + \frac{1}{2}\frac{\partial^{2} f}{\partial X_{2}^{2}}(\mu)\sigma_{X_{2}}^{2} + \frac{\partial^{2} f}{{\partial X_{1}}{\partial X_{2}}}(\mu)\sigma_{X_{1},X_{2}}. $$

(1)

Since X₁ and X₂ are terms of a random vector X=(X₁,X₂,X₃) drawn from the multinomial distribution given by (n,p₁,p₂,p₃), we have $\mu _{X_{i}} = np_{i}$ and $\sigma _{X_{i}}^{2}=np_{i}(1-p_{i})$ for i=1,2, and $\sigma _{X_{1},X_{2}} = -np_{1}p_{2}$. It follows easily that

$$ E(Z_{0}) \approx \frac{p_{1}}{p_{2}} + \frac{1}{n}\left(\frac{p_{1}(1-p_{2})}{p_{2}^{2}} + \frac{p_{1}}{p_{2}}\right) = \frac{p_{1}}{p_{2}}\left(1 + \frac{1}{np_{2}}\right). $$

(2)

For variance, we use a simpler approximation of f

$$f\left(X_{1},X_{2}\right) \approx f(\mu) + \left(X_{1} - \mu_{X_{1}}\right)\frac{\partial f}{\partial X_{1}}(\mu) + \left(X_{2} - \mu_{X_{2}}\right)\frac{\partial f}{\partial X_{2}}(\mu), $$

from which we have

$$ var(Z_{0}) \approx \frac{\partial f}{\partial X_{1}}(\mu)^{2}\sigma_{X_{1}}^{2} + \frac{\partial f}{\partial X_{2}}(\mu)^{2}\sigma_{X_{2}}^{2} + 2\frac{\partial f}{\partial X_{1}}(\mu)\frac{\partial f}{\partial X_{2}}(\mu)\sigma_{X_{1},X_{2}}, $$

(3)

and finally

$$ var(Z_{0}) \approx \frac{1}{n}\left(\frac{p_{1}(1-p_{1})}{p_{2}^{2}} + \frac{p_{1}^{2}(1-p_{2})}{p_{2}^{3}} + 2\frac{p_{1}^{2}}{p_{2}^{2}} \right) = \frac{1}{n}\left(\frac{p_{1}}{p_{2}}\right)^{2}\left(\frac{1}{p_{1}} + \frac{1}{p_{2}}\right). $$

(4)

Solution by a modified ratio

3.1 Definition

Let the symbols X, u, and v have the same meaning as in Section 2. We define a new random variable Z₁ as

$$ Z_{1} = \frac{X \cdot u}{X \cdot v + 1}. $$

(5)

The + 1 in the above definition serves to avoid zero in the denominator, and thus solves the problem with the singularity of Z₀. For the same reasons as in Section 2, we will restrict our explorations to k=3 and Z₁=X₁/(X₂+1).

3.2 Sample space

The sample space $S_{Z_{1}}\subseteq \mathbb {Q}$ of the random variable Z₁ is limited by the sample space S_X of the multinomially distributed random vector X=(X₁,X₂,X₃). Therefore, if X assumes values from the multinomial distribution given by (n,p₁,p₂,p₃), then Z₁ cannot assume all rational values a/(b+1) for some $a, b\in \mathbb {N}$, but only those that satisfy a+b≤n and a,b≥0. Furthermore, values 2/2 and 4/4 are considered identical; therefore, different outcomes of random vector X may correspond with the same outcome of Z₁. In other words, each instance (a,b,c) of X corresponds with exactly one instance a/(b+1) of Z₁, while an instance of Z₁ may correspond with multiple instances of X.

Naturally, the probability of a particular value of Z₁ can be determined by summing the probabilities of all (multinomial) vectors that are associated with this value. From this, it follows that if the initial multinomial probability distribution function of random vector X is

$$pr\left(X_{1}=a,X_{2}=b,X_{3}=c\right) = \left(\begin{array}{c} n\\ a,b,c \end{array}\right)p_{1}^{a}~p_{2}^{b}~p_{3}^{c}, $$

then the probability distribution function of random variable Z₁ is

$$pr\left(Z_{1} = d\right) = \sum_{\substack{{a,b,c\in\{0,...,n\}}\\a+b+c=n\\a/(b+1)=d}} \left(\begin{array}{c} n\\ a,b,c \end{array}\right)p_{1}^{a}~p_{2}^{b}~p_{3}^{c}, $$

which can be rewritten as

$$pr\left(Z_{1} = d\right) =\sum_{b=0}^{n}\sum_{\substack{a=0\\a/(b+1)=d}}^{n-b}\left({n \atop b}\right)\left({n-b \atop a}\right) p_{1}^{a}~ p_{2}^{b}~ (1-p_{1}-p_{2})^{n-a-b}. $$

3.3 Mean and variance

Now we can state the mean and variance of Z₁. The proofs of the statements can be found in the Appendix.

Theorem 1

Let X=(X₁,X₂,X₃) be a random vector from the multinomial distribution given by (n,p₁,p₂,p₃). The expected value of the random variable Z₁, given by (5), is

$$E(Z_{1})=\frac{p_{1}}{p_{2}}\left(1 - (1-p_{2})^{n}\right). $$

Theorem 2

Let X=(X₁,X₂,X₃) be a random vector from the multinomial distribution given by (n,p₁,p₂,p₃), where

$$n>\frac{1-p_{2}}{p_{2}}N + \frac{1-2p_{2}}{p_{2}} $$

for some natural non-zero N. The variance of the random variable Z₁, given by (5), is

$$\begin{array}{*{20}l} var(Z_{1}) =&\ \left[\frac{p_{1}}{p_{2}(1-p_{2})}\right]^{2}\frac{\frac{1-p_{2}}{p_{1}}-2}{n+2} + \frac{p_{1}}{p_{2}(1-p_{2})}\frac{\frac{p_{1}}{1-p_{2}}-1}{n+1} \\ &+ \sum_{k=1}^{N} \frac{\left[\frac{p_{1}}{p_{2}(1-p_{2})}\right]^{2}}{\left({n+k+1 \atop k}\right)p_{2}^{k}}\left[1-\frac{k+2 - \frac{1-p_{2}}{p_{1}}}{n+k+2}\right] + O\left(\frac{1}{n^{N+1}}\right). \end{array} $$

Corollary 1

For N=1 we have for the variance from Theorem 2

$$var(Z_{1}) = \frac{1}{n}\left(\frac{p_{1}}{p_{2}}\right)^{2} \left(\frac{1}{p_{1}} + \frac{1}{p_{2}}\right) + O\left(\frac{1}{n^{2}}\right). $$

Observe that the formula for the variance is asymptotic in nature, and thus it may not work well for small n and certain configurations of p₁, p₂ and p₃. See Section 5 for more details.

Approximate error of solution by a modified ratio

Let

$$Err = g(X_{1},X_{2}) = \frac{X_{1}}{X_{2}} - \frac{X_{1}}{X_{2}+1} = \frac{X_{1}}{X_{2}(X_{2}+1)} $$

be a function of two random variables expressing the difference between Z₀ and Z₁. Analogous to the Eqs. (1)–(4) from Section 2 and with f(X₁,X₂)=X₁/[X₂(X₂+1)], we have for the mean and variance of Err

$$\begin{array}{*{20}l} E(Err) &\approx \frac{p_{1}}{p_{2}(1+np_{2})} + \frac{p_{1}(1-p_{2})\left(1 + 3np_{2} + 3n^{2}p_{2}^{2}\right)}{np_{2}^{2}(1 + np_{2})^{3}} + \frac{p_{1}(1+2np_{2})}{np_{2}(1+np_{2})^{2}} \\ &= \frac{p_{1}\left[1 + 4np_{2} + (5-p_{2})n^{2}p_{2}^{2} + n^{3}p_{2}^{3}\right]}{np_{2}^{2}(1 + np_{2})^{3}}, \end{array} $$

(6)

$$\begin{array}{*{20}l} var(Err) &\approx \frac{(1 - p_{1}) p_{1}}{n p_{2}^{2} (1 + n p_{2})^{2}} + \frac{(1 - p_{2}) (p_{1} + 2 n p_{1} p_{2})^{2}}{n p_{2}^{3} (1 + n p_{2})^{4}} + \frac{2 p_{1}^{2} (1 + 2 n p_{2})}{n p_{2}^{2} (1 + n p_{2})^{3}} \\ &= \frac{p_{1} \left[p_{2} (1 + n p_{2})^{2} + p_{1} \left\{1 + 4 n p_{2} + (4 - p_{2}) n^{2} p_{2}^{2}\right\}\right]}{n p_{2}^{3} (1 + n p_{2})^{4}}. \end{array} $$

(7)

It follows from the Eqs. (6) and (7) that Z₁ is an asymptotically (n→∞) unbiased estimator of the ratio of multinomial proportions Z₀. Moreover, the Eqs. (6) and (7) can be used to correct the mean and variance of the modified ratio Z₁ to better reflect the mean and variance of the original ratio Z₀. Let $Z_{1}^{cor} = Z_{1} + Err$ be a new random variable. Since the expected value is linear, we have directly

$$\begin{array}{*{20}l} E\left(Z_{1}^{cor}\right) &= E(Z_{1}) + E(Err) \approx \\ &\approx \frac{p_{1}}{p_{2}}\left(1 - (1-p_{2})^{n}\right) + \frac{p_{1}\left[1 + 4np_{2} + (5-p_{2})n^{2}p_{2}^{2} + n^{3}p_{2}^{3}\right]}{np_{2}^{2}(1 + np_{2})^{3}}. \end{array} $$

For the variance, we have

$$var(Z_{1}^{cor}) = var(Z_{1}) + var(Err) + 2cov(Z_{1},Err), $$

where

$$cov(Z_{1},Err) = E(Z_{1}\cdot Err) - E(Z_{1}) \cdot E(Err). $$

To approximate the value of E(Z₁·Err), we use the Taylor series again, particularly Eq. (1). After some rearrangement, we get

$$\begin{array}{*{20}l} {}E\left(\frac{X_{1}^{2}}{X_{2}(X_{2}+1)^{2}}\right) \approx& \frac{n p_{1}^{2}}{p_{2} (1 + n p_{2})^{2}} + \frac{(1 - p_{1}) p_{1}}{p_{2} (1 + n p_{2})^{2}} \\ &+ \frac{p_{1}^{2} (1 - p_{2}) \left(1 + 4 n p_{2} + 6 n^{2} p_{2}^{2}\right)}{p_{2}^{2} (1 + n p_{2})^{4}} + \frac{2 p_{1}^{2} (1 + 3 n p_{2})}{p_{2} (1 + n p_{2})^{3}} \\ =& \frac{p_{1} \left[p_{2} (1 + n p_{2})^{2} + p_{1} \left\{1 + (5 + 2 p_{2}) n p_{2} + (8 - p_{2}) n^{2} p_{2}^{2} + n^{3} p_{2}^{3}\right\}\right]}{p_{2}^{2} (1 + n p_{2})^{4}} \end{array} $$

Thus, we can now easily calculate the value of $var\left (Z_{1}^{cor}\right)$ (equation omitted due to its length). In the next section, we shall discuss numerical simulations and performance of the presented formulae.

Numerical simulations

Numerical simulations were performed in the following way. We selected several multinomial distributions given by (n,p₁,p₂,p₃) and for each such distribution, we sampled 10⁵ random vectors (X₁,X₂,X₃). Vectors with X₂=0 were counted (variable zeros) and omitted from further calculations; that is, they were not replaced by new random vectors. For the vectors with X₂≠0, we calculated the ratios Z₀=X₁/X₂, while the ratios Z₁=X₁/(X₂+1) were calculated from all 10⁵ sampled vectors. Thus, we obtained 10⁵−zeros values of Z₀ and 10⁵ values of Z₁. From both sets we calculated the mean and variance of the sampled data. We compared these values with the predictions as follows below.

For the mean, we compared the means of the two data sets with the Taylor-series solution given by Eq. (2), and with the modified ratio (MR) solution given by Theorem 1 with and without the correction given by the Eq. (6).

For the variance, we compared the variances of the two data sets with the Taylor-series solution given by Eq. (4), and with the modified ratio solution given by Theorem 2 with and without the correction (the final formula for corrected variance of the modified ratio was omitted due to its length, but see Section 4 for calculation details). Note that for variance given by Theorem 2, we considered the case N=5 so that its error O(1/n⁶) would not interfere with the correction.

Figure 1 shows the simulation results for the multinomial distribution given by (n=10,…,50,p₁=0.25,p₂=0.5,p₃=0.25). The corrected modified ratio gives the best model of the mean and variance of Z₀. Observe also that the uncorrected modified ratio is a very precise model of Z₁.

In Fig. 2, when p₂ and n are small, the discrepancy between the models and the data gets larger, although the corrected modified ratio still outperforms the Taylor-series approach. The uncorrected modified ratio is also a very good model of Z₁.

Figures 3 and 4 further explore the limits of the presented models. In Fig. 3, we compared the performance of the variance models in three multinomial distributions (with decreasing value of p₂) for various values of N from Theorem 2. Note that with growing N, there also grows the minimal value of n for which the Theorem 2 holds; therefore, the variance models start from a different n. It will be observed that all models have difficulty describing the initial part of the variance curve of the simulated data. However, one should keep in mind that the formula in Theorem 2 is only asymptotic.

In Fig. 4, we compared the models for mean on the same data as in Fig. 3. Again, for small values of n, the models fail to capture the real trend of the data. On a side note, the data for Z₁ are very well described by the uncorrected modified ratio model from Theorem 1.

The supplemental material contains a script (Additional file 1) to generate similar plots for the user-specified multinomial distribution (n,p₁,p₂,p₃) and a range of n. Given the results from the simulation data, we encourage the reader to use this script and check whether the formulae presented in the paper will provide for a good approximation of Z₀ for his/hers particular multinomial distribution.

Appendix

Proof of Theorem 1

Lemma 1

Let $n\in \mathbb {N}$ and $R\in \mathbb {R}$. Then it holds

$$\sum_{k=0}^{n}{\left({n \atop k}\right)R^{k}k} = nR\left(1+R\right)^{n-1}. $$

Proof

From $\left ({n \atop k}\right)=\frac {n}{k}\left ({n-1 \atop k-1}\right)$ it directly follows that

$$\sum_{k=0}^{n}{\left({n \atop k}\right)R^{k}k} =nR\sum_{k=0}^{n-1}{\left({n-1 \atop k}\right) R^{k}} =nR(1+R)^{n-1}. $$

□

Proof of Theorem 1

From the definition of the expected value we have

$$E(Z_{1}) = \sum_{d\in S_{Z_{1}}} pr(Z_{1} = d) \cdot d, $$

where $S_{Z_{1}}$ is a sample space of Z₁. By using

$$pr(Z_{1} = d) = \sum_{b=0}^{n}\sum_{\substack{a=0\\a/(b+1)=d}}^{n-b}\left({n \atop b}\right)\left({n-b \atop a}\right)p_{1}^{a}~ p_{2}^{b}~ (1-p_{1}-p_{2})^{n-a-b} $$

from Section 3.2, we can write

$$E(Z_{1}) = \sum_{d \in S_{Z_{1}}} \left(\sum_{b=0}^{n}\sum_{\substack{a=0\\a/(b+1)=d}}^{n-b}\left({n \atop b}\right)\left({n-b \atop a}\right)p_{1}^{a} p_{2}^{b} (1-p_{1}-p_{2})^{n-a-b}\right) d. $$

Furthermore, because $\sum _{b=0}^{n}\sum _{a=0}^{n-b}$ enumerates all possible values of a random vector (X₁,X₂,X₃)=(a,b,n−a−b) for the given n, it also enumerates all values of Z₁ including their multiplicities (see Section 3.2). Thus, we can simplify the expression of E(Z₁) into

$$E(Z_{1}) = \sum_{b=0}^{n}\sum_{a=0}^{n-b}\left({n \atop b}\right)\left({n-b \atop a}\right)p_{1}^{a} p_{2}^{b} (1-p_{1}-p_{2})^{n-a-b}\frac{a}{b+1}. $$

We rewrite this expression to separate the sums, thus obtaining

$$\begin{array}{*{20}l} E(Z_{1})=(1-p_{1}-p_{2})^{n} &\sum_{b=0}^{n}{\left({n \atop b}\right)\left(\frac{p_{2}}{1-p_{1}-p_{2}}\right)^{b}\frac{1}{b+1}}\cdot \\ \cdot&\sum_{a=0}^{n-b}{\left({n-b \atop a}\right)\left(\frac{p_{1}}{1-p_{1}-p_{2}}\right)^{a} a}. \end{array} $$

(8)

Using Lemma 1, we have for (8)

$$\begin{array}{*{20}l} \sum_{a=0}^{n-b}{\left({n-b \atop a}\right)\left(\frac{p_{1}}{1-p_{1}-p_{2}}\right)^{a} a} =(n-b)\frac{p_{1}}{1-p_{1}-p_{2}}\left(\frac{1-p_{2}}{1-p_{1}-p_{2}}\right)^{n-b-1}. \end{array} $$

By putting this back to E(Z₁) and after some rearrangement of the terms, we get

$$\begin{array}{*{20}l} E(Z_{1}) = (1-p_{2})^{n} \left(\frac{p_{1}}{1-p_{2}}\right) \sum_{b=0}^{n}\left({n \atop b}\right)\left(\frac{p_{2}}{1-p_{2}}\right)^{b}\frac{n-b}{b+1}. \end{array} $$

(9)

We continue by splitting the following fraction into two terms

$$\frac{n-b}{b+1} = \frac{n+1}{b+1} - 1. $$

By this, the sum in (9) splits into two parts

$$E(Z_{1}) = A + B, $$

where

$$\begin{array}{*{20}l} A&=(1-p_{2})^{n} \left(\frac{p_{1}}{1-p_{2}}\right) \sum_{b=0}^{n}\left({n \atop b}\right)\left(\frac{p_{2}}{1-p_{2}}\right)^{b}\frac{n+1}{b+1},\\ B&=(1-p_{2})^{n} \left(\frac{p_{1}}{1-p_{2}}\right) \sum_{b=0}^{n}\left({n \atop b}\right)\left(\frac{p_{2}}{1-p_{2}}\right)^{b} (-1). \end{array} $$

With $\left ({n \atop b}\right)\frac {n+1}{b+1}=\left ({n+1 \atop b+1}\right)$ and some rearrangement of the terms, we obtain

$$A=\frac{p_{1}}{p_{2}}\left(\frac{1}{1-p_{2}} - (1-p_{2})^{n}\right), $$

and a straightforward calculation of B yields

$$B=-\frac{p_{1}}{1-p_{2}}. $$

Finally, after putting A and B together, we get

$$E(Z_{1}) = A+B = \frac{p_{1}}{p_{2}} - \frac{p_{1}}{p_{2}}(1-p_{2})^{n} = \frac{p_{1}}{p_{2}}\left(1 - (1-p_{2})^{n}\right). $$

□

Proof of Theorem 2

The proof of Theorem 2 relies on a series of lemmas and corollaries. For a better navigation through the proof, see Fig. 5 for the proof scheme.

Lemma 2

Let $n\in \mathbb {N}$ and $R\in \mathbb {R}$. Then it holds

$$\sum_{k=0}^{n}{\left({n \atop k}\right)R^{k}k^{2}} = n(n-1)R^{2}(1+R)^{n-2} + nR(1+R)^{n-1}. $$

Proof

From $\left ({n \atop k}\right)=\frac {n}{k}\left ({n-1 \atop k-1}\right)$ and Lemma 1 it follows that

$$\begin{array}{*{20}l} \sum_{k=0}^{n}{\left({n \atop k}\right)R^{k}k^{2}} &= nR\sum_{k=0}^{n-1}{\left({n-1 \atop k}\right)R^{k}(k+1)} \\ &= nR\sum_{k=0}^{n-1}{\left({n-1 \atop k}\right)R^{k}k} + nR\sum_{k=0}^{n-1}{\left({n-1 \atop k}\right)R^{k}}\\ &= n(n-1)R^{2}(1+R)^{n-2} + nR(1+R)^{n-1}. \end{array} $$

□

Lemma 3

Let $n\in \mathbb {N}$ and $R\in \mathbb {R}\backslash \{0\}$. Then, for any $n\in \mathbb {N}$ it holds

$$\sum_{b=1}^{n} \left({n \atop b}\right)\frac{R^{b}}{b} = \sum_{k=0}^{N} \left(A_{2k} - B_{2k}\right) + A_{2N+1}, $$

where

$$\begin{array}{*{20}l} A_{2k} &= \left(\prod_{i=1}^{k+1}\frac{1}{n+i}\right) \frac{k!} {R^{k+1}}\left(1+R\right)^{n+k+1},\\ B_{2k} &= \left(\prod_{i=1}^{k+1}\frac{1}{n+i}\right) \frac{k!} {R^{k+1}}\sum_{b=0}^{k+1}\left({n+k+1 \atop b}\right)R^{b},\\ A_{2k+1} &= \left(\prod_{i=1}^{k+1}\frac{1}{n+i}\right) \frac{(k+1)!}{R^{k+1}}\sum_{b=k+2}^{n+k+1}\left({n+k+1 \atop b}\right)\frac{R^{b}}{b - (k+1)}. \end{array} $$

Proof

By induction on N. Let N=0. Then, it follows

$$\sum_{b=1}^{n} \left({n \atop b}\right)\frac{R^{b}}{b} = \sum_{b=1}^{n} \left({n \atop b}\right)\frac{R^{b}}{b+1}\left(1 + \frac{1}{b}\right) = \sum_{b=1}^{n} \left({n \atop b}\right)\frac{R^{b}}{b+1} + \sum_{b=1}^{n} \left({n \atop b}\right)\frac{R^{b}}{b(b+1)}. $$

By using $\frac {n+1}{k+1}\left ({n \atop k}\right)=\left ({n+1 \atop k+1}\right)$ and the binomial theorem, we can write

$$\begin{array}{*{20}l} \sum_{b=1}^{n} \left({n \atop b}\right)\frac{R^{b}}{b} &= \frac{1}{n+1}\frac{1}{R}\sum_{b=1}^{n} \left({n+1 \atop b+1}\right)R^{b+1} + \frac{1}{n+1}\frac{1}{R}\sum_{b=1}^{n} \left({n+1 \atop b+1}\right)\frac{R^{b+1}}{(b+1) - 1} \\ &=\frac{1}{n+1}\frac{1}{R}\sum_{b=2}^{n+1} \left({n+1 \atop b}\right)R^{b} + \frac{1}{n+1}\frac{1}{R}\sum_{b=2}^{n+1} \left({n+1 \atop b}\right)\frac{R^{b}}{b - 1} \\ &=A_{0} - B_{0} + A_{1}. \end{array} $$

The base of the induction holds. Assume that the lemma holds up to some natural N. We prove that it holds for N+1 as well. Consider the term A_2N+1. We have

$$\begin{array}{*{20}l} A_{2N+1} &= \left(\prod_{i=1}^{N+1}\frac{1}{n+i}\right) \frac{(N+1)!}{R^{N+1}}\sum_{b=N+2}^{n+N+1}\left({n+N+1\atop b}\right)\frac{R^{b}}{b+1}\left(1 + \frac{N+2}{b-(N+1)}\right) \\ &= X_{1} + X_{2}, \end{array} $$

where

$$\begin{array}{*{20}l} X_{1} &= \left(\prod_{i=1}^{N+1}\frac{1}{n+i}\right) \frac{(N+1)!}{R^{N+1}}\sum_{b=N+2}^{n+N+1}\left({n+N+1 \atop b}\right)\frac{R^{b}}{b+1},\\ X_{2} &= \left(\prod_{i=1}^{N+1}\frac{1}{n+i}\right) \frac{(N+1)!}{R^{N+1}}\sum_{b=N+2}^{n+N+1}\left({n+N+1 \atop b}\right)\frac{R^{b}}{b+1}\frac{N+2}{b-(N+1)}. \end{array} $$

Furthermore, by the same trick with the binomial coefficient as above, we rewrite the terms X₁ and X₂ as

$$\begin{array}{*{20}l} {}X_{1} &= \left(\prod_{i=1}^{N+1}\frac{1}{n+i}\right) \frac{(N+1)!}{R^{N+1}} \frac{1}{n+N+2}\frac{1}{R} \sum_{b=N+2}^{n+N+1}\left({n+N+2 \atop b+1}\right)R^{b+1},\\ {}X_{2} &= \left(\prod_{i=1}^{N+1}\frac{1}{n+i}\right) \frac{(N+1)!}{R^{N+1}} \frac{1}{n+N+2}\frac{1}{R} \sum_{b=N+2}^{n+N+1}\left({n+N+2 \atop b+1}\right)\frac{R^{b+1}(N+2)}{(b+1) - 1 -(N+1)}. \end{array} $$

After some rearrangement, we finally get (again using the binomial theorem)

$$\begin{array}{*{20}l} X_{1} &= \left(\prod_{i=1}^{N+2}\frac{1}{n+i}\right) \frac{(N+1)!}{R^{N+2}} \sum_{b=N+3}^{n+N+2}\left({n+N+2 \atop b}\right)R^{b} = A_{2(N+1)} - B_{2(N+1)},\\ X_{2} &= \left(\prod_{i=1}^{N+2}\frac{1}{n+i}\right) \frac{(N+2)!}{R^{N+2}} \sum_{b=N+3}^{n+N+2}\left({n+N+2 \atop b}\right)\frac{R^{b}}{b -(N+2)} = A_{2(N+1) + 1}. \end{array} $$

□

Remark 1

We will often use Lemma 3 with n+1 instead of n. Therefore, we restate the Lemma 3 with this change. Let $n\in \mathbb {N}$ and $R\in \mathbb {R}\backslash \{0\}$. Then, for any $n\in \mathbb {N}$ it holds

$$\sum_{b=1}^{n+1} \left({n+1 \atop b}\right)\frac{R^{b}}{b} = \sum_{k=0}^{N}\left(A_{2k} - B_{2k}\right) + A_{2N+1}, $$

where

$$\begin{array}{*{20}l} A_{2k} &= \left(\prod_{i=2}^{k+2}\frac{1}{n+i}\right) \frac{k!} {R^{k+1}}\left(1 + R\right)^{n+k+2},\\ B_{2k} &= \left(\prod_{i=2}^{k+2}\frac{1}{n+i}\right) \frac{k!} {R^{k+1}}\sum_{b=0}^{k+1}\left({n+k+2 \atop b}\right)R^{b},\\ A_{2k+1} &= \left(\prod_{i=2}^{k+2}\frac{1}{n+i}\right) \frac{(k+1)!}{R^{k+1}}\sum_{b=k+2}^{n+k+2}\left({n+k+2 \atop b}\right)\frac{R^{b}}{b - (k+1)}. \end{array} $$

Lemma 4

Let p₁,p₂∈(0,1)be some real constants. Let k,n be some non-zero natural numbers. Let A_2k+1 be the term from Remark 1. Furthermore, let R=p₂/(1−p₂), and let

$$\begin{array}{*{20}l} A &= (n+1)n\left(\frac{p_{1}}{1-p_{2}}\right)^{2} + (n+1)\frac{p_{1}}{1-p_{2}},\\ D &= \frac{(1-p_{2})^{n}}{n+1}\frac{1-p_{2}}{p_{2}}. \end{array} $$

Then, for α∈[1,k+2], it holds

$$ADA_{2k+1} \leq \alpha\frac{n}{(k+2)\left({n+k+3 \atop k+2}\right)}\frac{p_{1}}{p_{2}^{k+3}(1-p_{2})}\left(\frac{p_{1}}{1-p_{2}} + \frac{1}{n}\right) = O\left(\frac{1}{n^{k+1}}\right). $$

Proof

First of all, for α∈[1,k+2] we have

$$A_{2k+1} = \alpha \left(\prod_{i=2}^{k+3}\frac{1}{n+i}\right) \frac{(k+1)!}{R^{k+2}} \sum_{b=k+3}^{n+k+3}\left({n+k+3 \atop b}\right)R^{b}. $$

This follows easily by applying the inequality

$$\frac{k+2}{b+1} \geq \frac{1}{b - (k+1)} \geq \frac{1}{b+1} $$

to the term A_2k+1 from Remark 1, which holds for any natural b,k except for pairs b=k+1 (in our case b>k+1). We can see this by solving the inequality

$$\frac{1+x}{b+1}\geq\frac{1}{b - (k+1)} $$

for x. By this, we get an upper and lower bound on the term A_2k+1, which differ by a multiplicative constant k+2. Finally, the lemma follows by extending the summation through index b in the term A_2k+1 to a full range from 0 to n+k+3, by applying the binomial theorem and some simple rearrangement of the terms. The O bound follows from the fact that $\left ({n \atop k}\right)\geq \left (\frac {n}{k}\right)^{k}$. □

Lemma 5

Let p₁,p₂∈(0,1)be some real constants. Let k,n be some non-zero natural numbers. Let A_2k be the term from Remark 1. Furthermore, let R=p₂/(1−p₂), and let

$$\begin{array}{*{20}l} A &= (n+1)n\left(\frac{p_{1}}{1-p_{2}}\right)^{2} + (n+1)\frac{p_{1}}{1-p_{2}},\\ D &= \frac{(1-p_{2})^{n}}{n+1}\frac{1-p_{2}}{p_{2}}. \end{array} $$

Then, it holds

$$ADA_{2k} = \frac{\left(\frac{p_{1}}{p_{2}(1-p_{2})}\right)^{2}}{\left({n+k+1 \atop k}\right) p_{2}^{k}}\left(1 - \frac{k+2 - \frac{1-p_{2}}{p_{1}}}{n+k+2}\right). $$

Proof

The lemma follows easily by a straightforward multiplication of the terms A, D and A_2k, and some rearrangement of the terms. □

The following lemma is an extension of one borrowed from Graham et al. (1994).

Lemma 6

Let 0<α<R/(1+R)for some real R>0. Then, it holds

$$\sum_{k\leq\alpha n} \left({n \atop k}\right) R^{k} = R^{m} 2^{nH(\alpha) - \frac{1}{2}\lg n + O(1)}, $$

where m=⌊αn⌋ and

$$H(\alpha) = \alpha\lg\frac{1}{\alpha} + (1-\alpha)\lg\frac{1}{1-\alpha}. $$

Proof

First of all, we have

$$\frac{\left({n \atop k-1}\right)R^{k-1}}{\left({n \atop k}\right)R^{k}} = \frac{k}{n-k+1}\frac{1}{R} \leq \frac{\alpha n}{n-\alpha n + 1}\frac{1}{R}< \frac{\alpha}{1-\alpha}\frac{1}{R}. $$

Let m=⌊αn⌋=αn−ε. It holds

$$\begin{array}{*{20}l} \left({n \atop m}\right)R^{m} < \sum_{k\leq m}\left({n \atop k}\right)R^{k} &< \left({n \atop m}\right)R^{m}\left(1 + \frac{\alpha}{1 + \alpha}\frac{1}{R} + \left(\frac{\alpha}{1-\alpha}\frac{1}{R}\right)^{2} + \ldots\right) \\ &= \left({n \atop m}\right)R^{m} \frac{(1-\alpha)R}{(1-\alpha)R - \alpha} \end{array} $$

because

$$\frac{\alpha}{1-\alpha}\frac{1}{R}<1, $$

which follows from α<R/(1+R). Thus,

$$\sum_{k\leq m}\left({n \atop k}\right)R^{k} = \left({n \atop m}\right)R^{m} O(1). $$

By Stirling’s approximation, we have

$$\begin{array}{*{20}l} {}\log \left({n \atop m}\right) &= -\frac{1}{2}\log n - (\alpha n - \epsilon)\log\left(\alpha - \frac{\epsilon}{n}\right) - \left((1-\alpha)n + \epsilon\right)\log\left(1-\alpha + \frac{\epsilon}{n}\right) + O(1) \\ &= -\frac{1}{2}\log n - n \alpha \log \alpha - n(1-\alpha)\log(1- \alpha) + O(1), \end{array} $$

and the lemma follows. □

Lemma 7

Let p₁,p₂∈(0,1)be some real constants. Let k,n be some non-zero natural numbers such that

$$n>\frac{1-p_{2}}{p_{2}}k + \frac{1-2p_{2}}{p_{2}}. $$

Let B_2k be the term from Remark 1. Furthermore, let R=p₂/(1−p₂), and let

$$\begin{array}{*{20}l} A &= (n+1)n\left(\frac{p_{1}}{1-p_{2}}\right)^{2} + (n+1)\frac{p_{1}}{1-p_{2}},\\ D &= \frac{(1-p_{2})^{n}}{n+1}\frac{1-p_{2}}{p_{2}}. \end{array} $$

Then, it holds

$${}ADB_{2k} = n(1-p_{2})^{n} \frac{p_{1}}{p_{2}(1-p_{2})}\left(p_{1} + \frac{1-p_{2}}{n}\right)\frac{2^{k+1}O(1)}{(k+1)(n+k+2)^{\frac{1}{2}}} = O\left(n^{\frac{1}{2}}(1-p_{2})^{n}\right). $$

Proof

Let α=(k+1)/(n+k+2). One can easily verify that α<R/(1+R)=p₂ because of the choice of n. Thus, we can apply Lemma 6 to the sum from the term B_2k. From this, it follows that

$$ {}\sum_{b=0}^{k+1}\left({n+k+2 \atop b}\right)\left(\frac{p_{2}}{1-p_{2}}\right)^{b} = \left(\frac{p_{2}}{1-p_{2}}\right)^{k+1}2^{(n+k+2)H(\alpha) - \frac{1}{2}\lg (n+k+2) + O(1)}, $$

(10)

where

$$H(\alpha) = \alpha\lg\frac{1}{\alpha} + (1-\alpha)\lg\frac{1}{1-\alpha}. $$

Moreover, for H(α) we have

$$H(\alpha) = \frac{k+1}{n+k+2}\lg \left(\frac{2(n+k+2)}{k+1}\right) - O\left(\frac{1}{n^{2}}\right), $$

which follows from

$$\lg (1-\alpha) = -\sum_{i=1}^{\infty} \frac{\alpha^{i}}{i}. $$

Plunging this into (10), we get

$$\sum_{b=0}^{k+1}\left({n+k+2 \atop b}\right)\left(\frac{p_{2}}{1-p_{2}}\right)^{b} = \left(\frac{p_{2}}{1-p_{2}}\right)^{k+1} \frac{\left(\frac{2(n+k+2)}{k+1}\right)^{k+1}O(1)}{(n+k+2)^{\frac{1}{2}}}. $$

With this, we can write for the whole B_2k term from Remark 1

$$ {}B_{2k} = \frac{\left(\frac{2(n+k+2)}{k+1}\right)^{k+1}O(1)}{\left({n+k+1 \atop k}\right)(n+k+2)^{\frac{3}{2}}} \leq \frac{2^{k+1}\left({n+k+2 \atop k+1}\right)O(1)}{\left({n+k+1 \atop k}\right)(n+k+2)^{\frac{3}{2}}} = \frac{2^{k+1}O(1)}{(k+1)(n+k+2)^{\frac{1}{2}}} $$

(11)

because $\left (\frac {n}{k}\right)^{k}\leq \left ({n \atop k}\right)$. Similarly, with $\left ({n \atop k}\right)<\left (\frac {ne}{k}\right)^{k}$, we have for B_2k

$$B_{2k} \geq \frac{\left(\frac{2}{e}\right)^{k+1}O(1)}{(k+1)(n+k+2)^{\frac{1}{2}}}, $$

if we use

$$\left({n+k+1 \atop k}\right)(n+k+2)^{\frac{3}{2}}=\left({n+k+2 \atop k+1}\right)(k+1)(n+k+1)^{\frac{1}{2}}. $$

Thus, we have

$$B_{2k} = \frac{2^{k+1}O(1)}{(k+1)(n+k+2)^{\frac{1}{2}}}, $$

and the lemma easily follows by multiplying B_2k with the term AD. □

Corollary 2

Let p₁,p₂∈(0,1)be some real constants. Let n,N be some non-zero natural numbers such that

$$n>\frac{1-p_{2}}{p_{2}}N + \frac{1-2p_{2}}{p_{2}}. $$

Let A_2k,B_2k, k=0,...,N, and A_2N+1 be terms from Remark 1. Furthermore, let R=p₂/(1−p₂), and let

$$\begin{array}{*{20}l} A &= (n+1)n\left(\frac{p_{1}}{1-p_{2}}\right)^{2} + (n+1)\frac{p_{1}}{1-p_{2}},\\ D &= \frac{(1-p_{2})^{n}}{n+1}\frac{1-p_{2}}{p_{2}}. \end{array} $$

Then, it holds

$${}AD\sum_{b=1}^{n+1}\left({n+1 \atop b}\right)\left(\frac{p_{2}}{1-p_{2}}\right)^{b}\frac{1}{b} = \left(\frac{p_{1}}{p_{2}(1-p_{2})}\right)^{2}\sum_{k=0}^{N} \frac{1-\frac{k+2 - \frac{1-p_{2}}{p_{1}}}{n+k+2}}{\left({n+k+1 \atop k}\right)p_{2}^{k}} + O\left(\frac{1}{n^{N+1}}\right). $$

Proof

Follows from Lemmas 4, 5 and 7. □

Lemma 8

Let p₁,p₂∈(0,1) be some real constants and n some non-zero natural number. Let

$$\begin{array}{*{20}l} B &= (2n+1)\left(\frac{p_{1}}{1-p_{2}}\right)^{2} + \frac{p_{1}}{1-p_{2}},\\ D &= \frac{(1-p_{2})^{n}}{n+1}\frac{1-p_{2}}{p_{2}}. \end{array} $$

Then, it holds

$$\begin{array}{*{20}l} {}BD\sum_{b=1}^{n+1}\left({n+1 \atop b}\right)\left(\frac{p_{2}}{1-p_{2}}\right)^{b} &= 2\left(\frac{p_{1}}{1-p_{2}}\right)^{2}\frac{1}{p_{2}} + \frac{1}{n+1}\frac{p_{1}}{p_{2}(1-p_{2})}\left(1 - \frac{p_{1}}{1-p_{2}}\right) \\ &+ O\left((1-p_{2})^{n}\right). \end{array} $$

Proof

Straightforward by binomial theorem. □

Lemma 9

Let p₁,p₂∈(0,1) be some real constants and n some non-zero natural number. Let

$$\begin{array}{*{20}l} C &= \left(\frac{p_{1}}{1-p_{2}}\right)^{2},\\ D &= \frac{(1-p_{2})^{n}}{n+1}\frac{1-p_{2}}{p_{2}}. \end{array} $$

Then, it holds

$$CD\sum_{b=1}^{n+1}\left({n+1 \atop b}\right)\left(\frac{p_{2}}{1-p_{2}}\right)^{b} b= \left(\frac{p_{1}}{1-p_{2}}\right)^{2}. $$

Proof

Straightforward by Lemma 1 and binomial theorem. □

Proof of Theorem 2

The variance of the random variable Z₁ can be calculated as

$$var(Z_{1})=E(Z_{1}^{2})-E^{2}(Z_{1}). $$

By Theorem 1, we have

$$E(Z_{1})=\frac{p_{1}}{p_{2}}\left(1-(1-p_{2})^{n}\right). $$

So, we only need to determine the value of $E\left (Z_{1}^{2}\right)$. From the definition of the expected value, we have

$${}E\left(Z_{1}^{2}\right) = \sum_{b=0}^{n}\sum_{a=0}^{n-b}\left({n \atop b}\right)\left({n-b \atop a}\right)p_{1}^{a} p_{2}^{b} (1-p_{1}-p_{2})^{n-a-b}\left(\frac{a}{b+1}\right)^{2} =(1-p_{1}-p_{2})^{n} V_{1} V_{2}, $$

where

$$\begin{array}{*{20}l} V_{1} &= \sum_{b=0}^{n}{\left({n \atop b}\right)\left(\frac{p_{2}}{1-p_{1}-p_{2}}\right)^{b}\left(\frac{1}{b+1}\right)^{2}},\\ V_{2} &= \sum_{a=0}^{n-b}{\left({n-b \atop a}\right)\left(\frac{p_{1}}{1-p_{1}-p_{2}}\right)^{a} a^{2}}. \end{array} $$

By application of Lemma 2 to V₂, we obtain

$$\begin{array}{*{20}l} E(Z_{1}^{2}) &= (1-p_{2})^{n} \sum_{b=0}^{n}{\left({n \atop b}\right)\left(\frac{p_{2}}{1-p_{2}}\right)^{b}\left(\frac{1}{b+1}\right)^{2}} W,\\ W&=(n-b)(n-b-1)\left(\frac{p_{1}}{1-p_{2}}\right)^{2} + (n-b)\frac{p_{1}}{1-p_{2}}. \end{array} $$

By using the equality

$$\left({n \atop b}\right)\left(\frac{1}{b+1}\right)^{2} = \left({n+1 \atop b+1}\right)\frac{1}{n+1}\frac{1}{b+1} $$

and adjustment of the summation borders, we get

$$\begin{array}{*{20}l} E\left(Z_{1}^{2}\right) &= \frac{(1-p_{2})^{n}}{n+1}\cdot\frac{1-p_{2}}{p_{2}} \cdot\sum_{b=1}^{n+1} {\left({n+1 \atop b}\right)\left(\frac{p_{2}}{1-p_{2}}\right)^{b}\frac{1}{b}} W,\\ W&=(n-b+1)(n-b)\left(\frac{p_{1}}{1-p_{2}}\right)^{2} + (n-b+1)\frac{p_{1}}{1-p_{2}}. \end{array} $$

Next, we split the term W according to powers of b, thus obtaining

$$W = A - Bb + Cb^{2}, $$

where

$$\begin{array}{*{20}l} A&= (n+1)n\left(\frac{p_{1}}{1-p_{2}}\right)^{2} + (n+1)\frac{p_{1}}{1-p_{2}},\\ B&= (2n+1)\left(\frac{p_{1}}{1-p_{2}}\right)^{2} + \frac{p_{1}}{1-p_{2}},\\ C&= \left(\frac{p_{1}}{1-p_{2}}\right)^{2}. \end{array} $$

If we set

$$D=\frac{(1-p_{2})^{n}}{n+1}\cdot\frac{1-p_{2}}{p_{2}}, $$

then we can write

$$E\left(Z_{1}^{2}\right) = D\sum_{b=1}^{n+1} \left({n+1 \atop b}\right) \left(\frac{p_{2}}{1-p_{2}}\right)^{b}\left(\frac{A}{b} - B + Cb\right) = S_{1} + S_{2} + S_{3}, $$

where

$$\begin{array}{*{20}l} S_{1} &= AD\sum_{b=1}^{n+1} \left({n+1 \atop b}\right) \left(\frac{p_{2}}{1-p_{2}}\right)^{b}\frac{1}{b},\\ S_{2} &= -BD\sum_{b=1}^{n+1} \left({n+1 \atop b}\right) \left(\frac{p_{2}}{1-p_{2}}\right)^{b},\\ S_{3} &= CD\sum_{b=1}^{n+1} \left({n+1 \atop b}\right) \left(\frac{p_{2}}{1-p_{2}}\right)^{b} b, \end{array} $$

and by Corollary 2 (S₁) and Lemmas 8 (S₂) and 9 (S₃) we get

$$\begin{array}{*{20}l} E(Z_{1}^{2}) =&\ \sum_{k=0}^{N} \left(\frac{p_{1}}{p_{2}(1-p_{2})}\right)^{2} \frac{1}{\left({n+k+1 \atop k}\right)p_{2}^{k}}\left(1-\frac{k+2 - \frac{1-p_{2}}{p_{1}}}{n+k+2}\right) -\\ &- 2\left(\frac{p_{1}}{1-p_{2}}\right)^{2}\frac{1}{p_{2}} - \frac{1}{n+1}\frac{p_{1}}{p_{2}(1-p_{2})}\left(1 - \frac{p_{1}}{1-p_{2}}\right) \\ &+ \left(\frac{p_{1}}{1-p_{2}}\right)^{2} + O\left(\frac{1}{n^{N+1}}\right). \end{array} $$

The rest of the proof follows from adding the term −E²(Z₁) to the derived expression for $E\left (Z_{1}^{2}\right)$, separating the term for k=0 from the rest of the sum, and simple rearrangement of the resulting terms. □

References

Aho, K, Bowyer, RT: Confidence intervals for ratios of proportions: implications for selection ratios. Methods Ecol. Evol. 6(2), 121–132 (2015).
Article Google Scholar
Alghamdi, N: Confidence intervals for ratios of multinomial proportions (2015). Master’s thesis, University if Nebraska at Omaha.
Basu, A, Lochner, RH: On the distribution of the ratio of two random variables having generalized life distributions. Technometrics. 13(2), 281–287 (1971).
Article MathSciNet MATH Google Scholar
Bonett, DG, Price, RM: Confidence intervals for a ratio of binomial proportions based on paired data. Stat. Med. 25(17), 3039–3047 (2006).
Article MathSciNet Google Scholar
Chiu, RW, Chan, KA, Gao, Y, Lau, VY, Zheng, W, Leung, TY, Foo, CH, Xie, B, Tsui, NB, Lun, FM, et al: Noninvasive prenatal diagnosis of fetal chromosomal aneuploidy by massively parallel genomic sequencing of dna in maternal plasma. Proc. Natl. Acad. Sci. 105(51), 20458–20463 (2008).
Article Google Scholar
Culley, TM, Wallace, LE, Gengler-Nowak, KM, Crawford, DJ: A comparison of two methods of calculating gst, a genetic measure of population differentiation. Am. J. Bot. 89(3), 460–465 (2002).
Article Google Scholar
Fieller, E: The distribution of the index in a normal bivariate population. Biometrika. 24, 428–440 (1932).
Article MATH Google Scholar
Frishman, F: On the arithmetic means and variances of products and ratios of random variables (1971). Technical report, DTIC Document.
Geary, R: The frequency distribution of the quotient of two normal variates. J. R. Stat. Soc. 93(3), 442–446 (1930).
Article MATH Google Scholar
Goodman, LA: On simultaneous confidence intervals for multinomial proportions. Technometrics. 7(2), 247–254 (1965).
Article MATH Google Scholar
Graham, RL, Knuth, DE, Patashnik, O: Concrete Mathematics: A Foundation for Computer Science, 2nd edn, p. 492. Addison-Wesley Longman Publishing Co., Inc., Boston (1994). exercise 42.
MATH Google Scholar
Hinkley, DV: On the ratio of two correlated normal random variables. Biometrika. 56(3), 635–639 (1969).
Article MathSciNet MATH Google Scholar
Koopman, P: Confidence intervals for the ratio of two binomial proportions. Biometrics. 40, 513–517 (1984).
Article Google Scholar
Korhonen, PJ, Narula, SC: The probability distribution of the ratio of the absolute values of two normal variables. J. Stat. Comput. Simul. 33(3), 173–182 (1989).
Article MathSciNet MATH Google Scholar
Lau, TK, Chen, F, Pan, X, Pooh, RK, Jiang, F, Li, Y, Jiang, H, Li, X, Chen, S, Zhang, X: Noninvasive prenatal diagnosis of common fetal chromosomal aneuploidies by maternal plasma dna sequencing. J. Matern. Fetal Neonatal Med. 25(8), 1370–1374 (2012).
Article Google Scholar
Marsaglia, G: Ratios of normal variables. J. Stat. Softw. 16(4), 1–10 (2006).
Article Google Scholar
Mekic, E, Sekulovic, N, Bandjur, M, Stefanovic, M, Spalevic, P: The distribution of ratio of random variable and product of two random variables and its application in performance analysis of multi-hop relaying communications over fading channels. Przegl. Elektrotechniczny. 88(7A), 133–137 (2012).
Google Scholar
Minarik, G, Repiska, G, Hyblova, M, Nagyova, E, Soltys, K, Budis, J, Duris, F, Sysak, R, Bujalkova, MG, Vlkova-Izrael, B, et al: Utilization of benchtop next generation sequencing platforms ion torrent pgm and miseq in noninvasive prenatal testing for chromosome 21 trisomy and testing of impact of in silico and physical size selection on its analytical performance. PloS ONE. 10(12), 0144811 (2015).
Article Google Scholar
Nadarajah, S: On the product and ratio of laplace and bessel random variables. J. Appl. Math. 2005(4), 393–402 (2005).
Article MathSciNet MATH Google Scholar
Nadarajah, S, Kotz, S: On the ratio of pearson type vii and bessel random variables. Adv. Decis. Sci. 2005(4), 191–199 (2005).
MathSciNet MATH Google Scholar
Nadarajah, S, Kotz, S: On the product and ratio of gamma and weibull random variables. Econ. Theory. 22(2), 338–344 (2006).
Article MathSciNet MATH Google Scholar
Nelson, W: Statistical methods for the ratio of two multinomial proportions. Am. Stat. 26(3), 22–27 (1972).
Google Scholar
Pham-Gia, T: Distributions of the ratios of independent beta variables and applications. Commun. Stat. Theory Methods. 29(12), 2693–2715 (2000).
Article MathSciNet MATH Google Scholar
Piegorsch, WW, Richwine, KA: Large-sample pairwise comparisons among multinomial proportions with an application to analysis of mutant spectra. J. Agric. Biol. Environ. Stat. 6(3), 305–325 (2001).
Article Google Scholar
Piper, J, Rutovitz, D, Sudar, D, Kallioniemi, A, Kallioniemi, O-P, Waldman, FM, Gray, JW, Pinkel, D: Computer image analysis of comparative genomic hybridization. Cytometry. 19(1), 10–26 (1995).
Article Google Scholar
Poorter, H, Garnier, E: Plant growth analysis: an evaluation of experimental design and computational methods. J. Exp. Bot. 47(9), 1343–1351 (1996).
Article Google Scholar
Press, SJ: The t-ratio distribution. J. Am. Stat. Assoc. 64(325), 242–252 (1969).
MathSciNet Google Scholar
Price, RM, Bonett, DG: Confidence intervals for a ratio of two independent binomial proportions. Stat. Med. 27(26), 5497–5508 (2008).
Article MathSciNet Google Scholar
Provost, S: On the distribution of the ratio of powers of sums of gamma random variables. Pak. J. Stat. 5, 157–174 (1989).
MathSciNet MATH Google Scholar
Quesenberry, CP, Hurst, D: Large sample simultaneous confidence intervals for multinomial proportions. Technometrics. 6(2), 191–195 (1964).
Article MathSciNet MATH Google Scholar
Sakamoto, H: On the distributions of the product and the quotient of the independent and uniformly distributed random variables. Tohoku Math. J. First Ser. 49, 243–260 (1943).
MathSciNet MATH Google Scholar
Sehnert, AJ, Rhees, B, Comstock, D, de Feo, E, Heilek, G, Burke, J, Rava, RP: Optimal detection of fetal chromosomal abnormalities by massively parallel dna sequencing of cell-free fetal dna from maternal blood. Clin. Chem. 57(7), 1042–1049 (2011).
Article Google Scholar
Van Kempen, G, Van Vliet, L: Mean and variance of ratio estimators used in fluorescence ratio imaging. Cytometry. 39(4), 300–305 (2000).
Article Google Scholar

Download references

Acknowledgements

This contribution is the result of implementation of the project REVOGENE −Research centre for molecular genetics (ITMS 26240220067) supported by the Research & Developmental Operational Programme funded by the European Regional Development Fund.

Author information

Authors and Affiliations

Geneton s.r.o., Galvaniho 7, Bratislava, 82104, Slovakia
Frantisek Duris & Tomas Szemes
Slovak Centre of Scientific and Technical Information, Lamacska cesta 7315/8A, Bratislava, 81104, Slovakia
Frantisek Duris
Comenius University, Faculty of Natural Sciences, Ilkovicova 3278/6, Bratislava, 84104, Slovakia
Juraj Gazdarica, Iveta Gazdaricova, Lucia Strieskova & Tomas Szemes
Comenius University Faculty of Mathematics, Physics and Informatics, Mlynska dolina, Bratislava, 84248, Slovakia
Jaroslav Budis
Comenius University, Science Park, Ilkovicova 8, Bratislava, 84104, Slovakia
Jan Turna & Tomas Szemes

Authors

Frantisek Duris
View author publications
You can also search for this author in PubMed Google Scholar
Juraj Gazdarica
View author publications
You can also search for this author in PubMed Google Scholar
Iveta Gazdaricova
View author publications
You can also search for this author in PubMed Google Scholar
Lucia Strieskova
View author publications
You can also search for this author in PubMed Google Scholar
Jaroslav Budis
View author publications
You can also search for this author in PubMed Google Scholar
Jan Turna
View author publications
You can also search for this author in PubMed Google Scholar
Tomas Szemes
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed equally to the research. FD wrote the manuscript. JG prepared the figures. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Frantisek Duris.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional file

Additional file 1

A script written in language R to perform custom numerical simulations and produce graphical output. (R 10 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Duris, F., Gazdarica, J., Gazdaricova, I. et al. Mean and variance of ratios of proportions from categories of a multinomial distribution. J Stat Distrib App 5, 2 (2018). https://doi.org/10.1186/s40488-018-0083-x

Download citation

Received: 09 August 2017
Accepted: 03 January 2018
Published: 18 January 2018
DOI: https://doi.org/10.1186/s40488-018-0083-x

Mean and variance of ratios of proportions from categories of a multinomial distribution

Abstract

Introduction

Solution by Taylor series

Solution by a modified ratio

3.1 Definition

3.2 Sample space

3.3 Mean and variance

Theorem 1

Theorem 2

Corollary 1

Approximate error of solution by a modified ratio

Numerical simulations

Appendix

Proof of Theorem 1

Lemma 1

Proof

Proof of Theorem 1

Proof of Theorem 2

Lemma 2

Proof

Lemma 3

Proof

Remark 1

Lemma 4

Proof

Lemma 5

Proof

Lemma 6

Proof

Lemma 7

Proof

Corollary 2

Proof

Lemma 8

Proof

Lemma 9

Proof

Proof of Theorem 2

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Publisher’s Note

Additional file

Additional file 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords