Skip to main content

A flexible multivariate model for high-dimensional correlated count data

Abstract

We propose a flexible multivariate stochastic model for over-dispersed count data. Our methodology is built upon mixed Poisson random vectors (Y1,…,Yd), where the {Yi} are conditionally independent Poisson random variables. The stochastic rates of the {Yi} are multivariate distributions with arbitrary non-negative margins linked by a copula function. We present basic properties of these mixed Poisson multivariate distributions and provide several examples. A particular case with geometric and negative binomial marginal distributions is studied in detail. We illustrate an application of our model by conducting a high-dimensional simulation motivated by RNA-sequencing data.

Introduction

As multivariate count data become increasingly common across many scientific disciplines, including economics, finance, geosciences, biology, marketing, and others, there is a growing need for flexible families of multivariate distributions with discrete margins. In particular, flexible models with correlated classical marginal distributions are in high demand in many different applied areas (see, e.g., Barbiero and Ferrari (2017); Madsen and Birkes (2013); Madsen and Dalthorp (2007); Nikoloulopoulos and Karlis (2009); Xiao (2017)). With this in mind, we propose a general method of constructing discrete multivariate distributions with certain common marginal distributions. One important example of this construction is a discrete multivariate model with correlated negative binomial (NB) components and arbitrary parameters. However, our approach is quite general and can produce families with different margins, going beyond the NB case.

One way to generate multivariate distributions with particular margins is an approach through copulas (see, e.g., Nelsen (2006)), and multivariate discrete distributions constructed through this method have been proposed in recent years (see, e.g., Barbiero and Ferrari (2017); Madsen and Birkes (2013); Nikoloulopoulos (2013); Nikoloulopoulos and Karlis (2009); Xiao (2017) and references therein). Recall that a copula is a cumulative distribution function (CDF) on [0,1]d, describing a random vector with standard uniform margins. Moreover, for any random vector X=(X1,…,Xd) with the joint CDF F and marginal CDFs Fi there is a copula function C(u1,…,ud) so that

$$ {}F(x_{1}, \ldots,x_{d}) = \mathbb{P}(X_{1}\leq x_{1}, \ldots,X_{d}\leq x_{d}) = C(F_{1}(x_{1}), \ldots, F_{d}(x_{d})), \,\,\, x_{i}\in \mathbb{R}, i=1,\ldots,d. $$
(1)

Further, for continuous distributions with marginal probability density functions (PDFs) fi(x)=Fi′(x), the copula function C is unique, and the joint PDF of the {Xi} is given by

$$ f(x_{1}, \ldots,x_{d}) = \left\{ \prod_{i=1}^{d} f_{i}(x_{i})\right\} c(F_{1}(x_{1}), \ldots, F_{d}(x_{d})), \,\,\, x_{i}\in \mathbb{R}, i=1,\ldots,d, $$
(2)

where the function c(u1,…,ud) is the PDF corresponding to the copula C(u1,…,ud). However, for discrete distributions, the copula is no longer unique and there is no analogue of (2) for calculating the relevant probabilities. Using this concept, one can define a random vector Y=(Y1,…,Yd) in \(\mathbb {R}^{d}\) with arbitrary marginal CDFs Fi viz.

$$ (Y_{1}, \ldots, Y_{d}) = \left(F_{1}^{-1}(U_{1}), \ldots,F_{d}^{-1}(U_{d})\right), $$
(3)

where U=(U1,…,Ud) is a random vector with standard uniform margins and the CDF given by

$$ {}F_{\mathbf{U}}(u_{1}, \ldots, u_{n}) = \mathbb{P}(U_{1}\leq u_{1}, \ldots, U_{d}\leq u_{d}) = C(u_{1}, \ldots, u_{d}), \,\,\, {(u_{1}, \ldots,u_{d})}^{\top} \in [0,1]^{d}, $$
(4)

with a particular copula C. While one can use any of the multitude of different copula functions in this construction, an approach based on Gaussian copula, known as NORTA (NORmal To Anything, see, e.g., Chen (2001); Song and Hsiao (1993)), is especially popular due to its flexibility, particularly in the case of discrete distributions (see, e.g., Barbiero and Ferrari (2017); Madsen and Birkes (2013); Nikoloulopoulos (2013)).

While our approach involves copulas as well, the latter connect with continuous multivariate distributions rather than discrete, which avoids the issues with non-uniqueness of the copula function. Additionally, compared with the direct approach (3), in our scheme the computation of relevant probabilities is straightforward. Our methodology is based on mixtures of Poisson distributions, which is a common way of obtaining discrete analogs of continuous distributions on nonnegative reals with a particular stochastic interpretation. Indeed, discrete univariate mixed Poisson distributions have been proven useful stochastic models in many scientific fields (see, e.g., Karlis and Xekalaki (2005), where one can find a comprehensive review of these distributions with over 30 particular examples). This construction can be described through a randomly stopped Poisson process. More precisely, let \(\{N(t), t\in \mathbb {R}_{+}\}\) be a homogeneous Poisson process with rate λ>0, so that the marginal distribution of N(t) is Poisson with parameter (mean) λt. Then, for any random variable T with cumulative distribution function (CDF) FT, supported on \(\mathbb {R}_{+}\), the quantity Y=N(T) is an integer-valued random variable, with distribution determined viz. standard conditioning argument as follows:

$$ \mathbb{P}(Y=n) = \int_{\mathbb{R}_{+}} \frac{e^{-\lambda t}(\lambda t)^{n}}{n!}dF_{T}(t), \,\,\, n\in \mathbb{N}_{0}=\{0, 1, \ldots \}. $$
(5)

Many standard probability distributions on \(\mathbb {N}_{0}\) arise from this scheme. In particular, if T has a standard gamma distribution with shape parameter r>0, given by the PDF

$$ f(x) = \frac{1}{\Gamma(r)}x^{r-1}e^{-x}, \,\,\, x\in \mathbb{R}_{+}, $$
(6)

then Y=N(T) will have a NB distribution NB(r,p) with the probability mass function (PMF)

$$ \mathbb{P}(Y=n) = \frac{\Gamma(n+r)}{\Gamma(r)n!} p^{r}(1-p)^{n}, \,\,\, n\in \mathbb{N}_{0}, $$
(7)

where p=1/(1+λ) (see Section 3.2 in the Appendix). As the NB model is quite important across many applications and can be extended to more general stochastic processes (see, e.g., Kozubowski and Podgórski (2009)), it shall serve as a basic example of our approach.

An extension of this scheme to the multivariate case can be accomplished in two different ways, leading to mixed multivariate Poisson distributions of Kind (or Type) I and II in the terminology of Karlis and Xekalaki (2005). The former arises viz.

$$ \mathbf{Y} = (Y_{1}, \ldots, Y_{d}) = (N_{1}(T), \ldots, N_{d}(T)), $$
(8)

where the {Ni(·)} are Poisson processes with rates λi and T is, as before, a random variable on \(\mathbb {R}_{+}\), independent of the {Ni}. While in general the marginal distributions of (N1(t),…,Nd(t)) can be correlated multivariate Poisson (see, Johnson et al. (1997)), we shall assume that the processes {Ni} are mutually independent. In this case, the joint probability generating function (PGF) of the {Yi} in (8) is of the form

$$ G(s_{1}, \ldots, s_{d}) = \mathbb{E} \left\{\prod_{i=1}^{d} s_{i}^{Y_{i}} \right\} = \phi_{T}\left(\sum_{i=1}^{d} \lambda_{i} - \sum_{i=1}^{d} \lambda_{i} s_{i} \right), \,\,\, (s_{1}, \ldots, s_{d})^{\top} \in [0,1]^{d}, $$
(9)

where ϕT is the Laplace transform (LT) of T, while the relevant probabilities can be conveniently expressed as

$$ \mathbb{P}(\mathbf{Y} = \mathbf{y}) = \prod_{i=1}^{d} g_{i}(y_{i})h(\mathbf{y}), \,\,\, \mathbf{y} = (y_{1}, \ldots, y_{d})^{\top} \in \mathbb{N}_{0}^{d}, $$
(10)

where the {gi} in (10) are the marginal PMFs of the {Yi}. As shown in the Appendix, the function h is of the form

$$ h(\mathbf{y}) = \frac{v_{T}\left(\sum_{i=1}^{d} y_{i}, \sum_{i=1}^{d} \lambda_{i}\right)}{\sum_{i=1}^{d} v_{T}(y_{i},\lambda_{i})}, \,\,\, \mathbf{y} = (y_{1}, \ldots, y_{d})^{\top} \in \mathbb{N}_{0}^{d}, $$
(11)

where

$$ v_{T}(y,\lambda) = \mathbb{E} \left\{e^{-\lambda T}T^{y} \right\}, \,\,\, \lambda, y \in \mathbb{R}_{+}. $$
(12)

In case of gamma distributed T, with shape parameter r>0 and unit scale, the functions v and h above can be evaluated explicitly (see the Appendix for details), and the above distribution is know in the literature as the multivariate negative multinomial distribution (see Chapter 36 of Johnson et al. (1997) and references therein). Since the marginal distributions in this case are NB, the distribution has also been termed multivariate NB. In cases where the function v(·,·) is not available explicitly, it can be easily evaluated numerically, viz. Monte Carlo simulations.

Our main focus will be a more flexible family of mixed Poisson distributions of Kind II, where each Poisson process \(\{N_{i}(t), \,\, t\in \mathbb {R}_{+}\}\) is randomly stopped at a different random variable Ti, leading to

$$ \mathbf{Y} = (Y_{1}, \ldots, Y_{d}) = (N_{1}(T_{1}), \ldots, N_{d}(T_{d})), $$
(13)

where the random vector T=(T1,…,Td) follows a multivariate distribution on \(\mathbb {R}_{+}^{d}\). A particular special case of this construction with the {Ti} having correlated log-normal distributions was recently proposed in Madsen and Dalthorp (2007), where this model was referred to as lognormal-Poisson hierarchy (L-P model). While that particular model does not allow explicit forms for marginal PMFs, it proved useful for applications. Our generalization, which we shall refer to as T-Poisson hierarchy, will allow T in (13) to have any continuous distribution on \(\mathbb {R}_{+}^{d}\), with margins not necessarily belonging to the same parametric family. As will be seen in the sequel, the joint PMF of this more general model can still be written as in (10), with an appropriate function h. In particular, we shall work with families of distributions of T described by marginal CDFs {Fi} and a copula function C(u1,…,ud). In this set-up, the PMF of Y, which is still of the form (10), can be expressed as

$$ \mathbb{P}(\mathbf{Y} = \mathbf{y}) = \prod_{i=1}^{d} g_{i}(y_{i})\mathbb{E} \left\{ c(F_{1}(X_{1}), \ldots, F_{d}(X_{d})) \right\}, \,\,\, y_{i}\in \mathbb{N}_{0}, \,\,\, i=1,\ldots,d, $$
(14)

where the gi are the marginal PMFs of {Yi}, the function c(u1,…,ud) is the PDF corresponding to the copula C(u1,…,ud), and the {Xi} are independent random variables with certain distributions dependent on the {yi}. This expression, which is an analogue of (2) for discrete multivariate distributions defined through our scheme, provides a convenient way for computing probabilities of these multivariate distributions. This computational aspect of our construction compares favorably with a cumbersome formula for the PMF (see, e.g., Proposition 1.1 in Nikoloulopoulos and Karlis (2009)) of the competing method defined viz. (3).

In what follows, we explore these ideas to provide a flexible multivariate modeling framework for dependent count data — emphasizing computationally convenient expressions and scaleable algorithms for high-dimensional applications. We begin by showing how multivariate count data can be generated as mixtures of Poisson distributions by developing sequences of independent Poisson processes randomly stopped at an underlying continuous real-valued random variable T (a T−Poisson hierarchy). Then we show how our T−Poisson hierarchy scheme gives rise to computationally convenient joint probability mass functions (PMFs) and how particular choices of parameters/distributions can be used to construct well-known models such as the multivariate negative binomial. Next, we describe a scaleable simulation algorithm using our construction and copula theory. Two examples are provided: a basic example to produce a multivariate geometric distribution and an elaborate high-dimensional simulation study, aiming to model and simulate RNA-sequencing data. We note that our modeling framework and computationally-convenient formulas may facilitate novel data analysis strategies, but we do not take up that task in this current study. We conclude with an Appendix containing selected proofs of assertions made throughout.

Multivariate mixtures of Poisson distributions

Our goal is to produce a random vector Y=(Y1,…,Yd) with correlated mixed Poisson components. To this end, we start with a sequence of independent Poisson processes \(\{N_{i}(t), \,\, t\in \mathbb {R}_{+}\}\), i=1,…,d, where the rate of the process Ni(t) is λi. Next, we let T=(T1,…Td) have a multivariate distribution on \(\mathbb {R}_{+}^{d}\) with the PDF fT(t). Then, we define

$$ \mathbf{Y} = (Y_{1}, \ldots, Y_{d}) = (N_{1}(T_{1}), \ldots,N_{d}(T_{d})). $$
(15)

In the terminology of Karlis and Xekalaki (2005), this is a special case of multivariate mixed Poisson distributions of Type II. Assuming that the {Ni(t)} are independent of T, by standard conditioning arguments (see Lemma 7 in the Appendix) we obtain

$$ \mathbb{P}(\mathbf{Y} = \mathbf{y}) = \prod_{i=1}^{d} \frac{\lambda_{i}^{y_{i}}}{y_{i}!}\int_{\mathbb{R}_{+}^{d}} e^{-\sum_{i=1}^{d}\lambda_{i}t_{i}} \prod_{i=1}^{d} t_{i}^{y_{i}} f_{\mathbf{T}}(\mathbf{t})d\mathbf{t}. $$
(16)

While in some cases one can obtain explicit expressions for the above joint probabilities, in general these have to be calculated numerically. The calculations can be facilitated by certain representations of these probabilities, discussed in the Appendix (see Lemmas 7 and 8 in the Appendix).

This procedure is quite general, and leads to a multitude of multivariate discrete distributions. Flexible models allowing for marginal distributions of different types can be obtained by a popular approach with copulas. Assume that T has a continuous distribution on \(\mathbb {R}_{+}^{d}\) with marginal PDFs fi and CDFs Fi driven by a particular copula C(u1,…,ud), so that the joint CDF of the {Ti} is given by

$$F_{\mathbf{T}}(\mathbf{t}) = \mathbb{P}(T_{1}\leq t_{1}, \ldots,T_{d}\leq t_{d}) = C(F_{1}(t_{1}), \ldots, F_{d}(t_{d})), \,\,\, \mathbf{t} = (t_{1}, \ldots,t_{d})^{\top} \in \mathbb{R}_{+}^{d}. $$

Then according to (2), the joint PDF fT is of the form

$$ f_{\mathbf{T}}(\mathbf{t}) = \left\{ \prod_{i=1}^{d} f_{i}(t_{i})\right\} c(F_{1}(t_{1}), \ldots, F_{d}(t_{d})), \,\,\, \mathbf{t} = (t_{1},\ldots,t_{d})^{\top} \in \mathbb{R}_{+}^{d}, $$
(17)

where the function c(u1,…,ud) is the PDF corresponding to the copula CDF C(u1,…,ud). When we substitute (17) into (16), we get

$$ \mathbb{P}(\mathbf{Y} = \mathbf{y}) = \prod_{i=1}^{d} \frac{\lambda_{i}^{y_{i}}}{y_{i}!}\int_{\mathbb{R}_{+}^{d}} e^{-\sum_{i=1}^{d}\lambda_{i}t_{i}} \prod_{i=1}^{d} \left[t_{i}^{y_{i}} f_{i}(t_{i})\right] c(F_{1}(t_{1}), \ldots, F_{d}(t_{d})) d\mathbf{t}. $$
(18)

Using the results presented in the Appendix (see Lemma 7 in the Appendix), one can show that the marginal PMFs of the {Yi} are given by

$$ \mathbb{P}(Y_{i}=y) = \frac{\lambda_{i}^{y}}{y!}\mathbb{E} \left[ e^{-\lambda_{i}T_{i}}T_{i}^{y}\right] = \mathbb{E} \left[ f_{\lambda_{i}T_{i}}(W)\right], $$
(19)

where \(f_{\lambda _{i}T_{i}}(\cdot)\) is the PDF of λiTi and W has a standard gamma distribution with shape parameter y+1. With this notation, we can write (18) in the form

$$ \mathbb{P}(\mathbf{Y} = \mathbf{y}) = \prod_{i=1}^{d} \mathbb{P}(Y_{i}=y_{i}) \int_{\mathbb{R}_{+}^{d}} c(F_{1}(t_{1}), \ldots, F_{d}(t_{d})) g(\mathbf{t}|\mathbf{y}) d\mathbf{t}, $$
(20)

where the quantity g(t|y) in the above integral is the joint PDF of multivariate distribution with independent margins,

$$ g(\mathbf{t}|\mathbf{y}) = \prod_{i=1}^{d} g_{i}(t_{i}|y_{i}) $$
(21)

with

$$ g_{i}(t|y) = \frac{t^{y}e^{-\lambda_{i} t} f_{i}(t)}{\mathbb{E} \left[ T_{i}^{y} e^{-\lambda_{i}T_{i}}\right]}, \,\,\, t\in \mathbb{R}_{+}. $$
(22)

Thus, the integral in (20) can be expressed as

$$ \int_{\mathbb{R}_{+}^{d}} c(F_{1}(t_{1}), \ldots, F_{d}(t_{d})) g(\mathbf{t}|\mathbf{y}) d\mathbf{t} =\mathbb{E} \left\{ c(F_{1}(X_{1}), \ldots, F_{d}(X_{d})) \right\}, $$
(23)

where X=(X1,…,Xd) has a multivariate distribution with independent components, governed by the PDF specified by (21) - (22). This leads to the following result.

Proposition 1

In the above setting, the joint probabilities (18) admit the representation

$$ \mathbb{P}(\mathbf{Y} = \mathbf{y}) = \prod_{i=1}^{d} \mathbb{P}(Y_{i}=y_{i}) \mathbb{E} \left\{ c(F_{1}(X_{1}), \ldots, F_{d}(X_{d})) \right\}, \,\,\, \mathbf{y} = (y_{1}, \ldots,y_{d})^{\top} \in \mathbb{N}_{0}^{d}, $$
(24)

where the marginal probabilities are given by (19) and the PDF of X=(X1,…,Xd) is given by (21) - (22).

Let us note that the joint moments of the Y1,…Yd exist whenever their counterparts of T1,…,Td are finite, in which case they can be evaluated by standard conditioning arguments. In particular, the mean and the covariance matrix of Y are related to their counterparts connected with T in a simple way, specified by Lemma 9 in the Appendix. It follows that \(\mathbb {E}Y_{i} = \lambda _{i} \mathbb {E} T_{i}\) and \(\mathbb {V} ar Y_{i} = \lambda _{i} \mathbb {E} T_{i} + \lambda _{i}^{2} \mathbb {V} ar T_{i}\), so the distributions of the {Yi} are always over-dispersed. Moreover, we have

$${\mathbb{C} ov(Y_{i}, Y_{j}) = \lambda_{i}\lambda_{j} \mathbb{C} ov(T_{i}, T_{j}),\,\,\, i\neq j,} $$

so that the correlation coefficient of Yi and Yj (if it exists) is related to that of Ti and Tj as follows:

$$ \rho_{Y_{i}, Y_{j}} = c_{i,j}\rho_{T_{i}, T_{j}},\,\,\, i\neq j, $$
(25)

where

$$ c_{i,j} = \frac{\sqrt{\lambda_{i}}\sqrt{\lambda_{j}}}{\sqrt{\lambda_{i} + \frac{\mathbb{E}(T_{i})}{\mathbb{V} ar(T_{i})}}\sqrt{\lambda_{j} + \frac{\mathbb{E}(T_{j})}{\mathbb{V} ar(T_{j})}}},\,\,\, i\neq j. $$
(26)

Remark 1

While in general the correlation can be positive as well as negative and admits the same range as its counterpart for Ti and Tj, the range of possible correlations of Yi and Yj can be further restricted if the margins are fixed. The maximum and minimum correlation can be deduced from (25) - (26) and the range of correlation corresponding to the joint distribution of Ti and Tj. The later is provided by the minimum and the maximum correlations, corresponding to the lower and the upper Fréchet copulas,

$$ C_{L}(u_{1}, u_{2}) = \max\{u_{1}+u_{2}-1,0\}, \,\,\, C_{U}(u_{1}, u_{2}) = \min\{u_{1}, u_{2}\}, \,\,\, u_{1}, u_{2} \in [0, 1]. $$
(27)

The upper bound for the correlation is obtained when the distribution of (Ti,Tj) is driven by the upper Fréchet copula CU in (27), so that \(T_{i}\stackrel {d}{=}F_{i}^{-1} (U)\)and \(T_{j}\stackrel {d}{=}F_{j}^{-1}(U)\), where U is standard uniform and the Fi(·), Fj(·) are the CDFs of Ti, Tj, respectively. Similarly, the lower bound for the correlation is obtained when the distribution of (Ti,Tj) is driven by the lower Fréchet copula CL in (27), where we have \(T_{i}\stackrel {d}{=}F_{i}^{-1} (U)\) and \(T_{j}\stackrel {d}{=}F_{j}^{-1}(1-U)\). While these correlation bounds are usually not available explicitly, they can be easily obtained by Monte-Carlo approximations viz. simulation from these (degenerate) probability distributions or by other standard approximate methods (see, e.g., Demitras and Hedeker (2011), and references therein).

Remark 2

We note that when a bivariate random vector Y=(Y1,Y2) is defined viz. (15) and the distribution of the corresponding T=(T1,T2) is driven by one of the copulas in (27), then the distribution of T is not absolutely continuous and the above derivations leading to the PDF of Y need a modification. It can be shown that in this case the marginal distributions of the Yi are still given by (19) while the joint PMF of (Y1,Y2) is also as in (20) with d=2, but with the integral term replaced with

$$ \int_{0}^{1} g_{1}(u|y_{1})g_{2}(u|y_{2})du \,\,\, \text{and} \,\,\, \int_{0}^{1} g_{1}(u|y_{1})g_{2}(1-u|y_{2})du $$
(28)

under the upper and the lower Fréchet copula cases, respectively, where the gi(·|y) in (28) are PDFs on (0,1) given by

$$ g_{i}(u|y) = \frac{e^{-\lambda_{i}F_{i}^{-1}(u)}\left[F_{i}^{-1}(u)\right]^{y}}{\mathbb{E} \left[ e^{-\lambda_{i}T_{i}}T_{i}^{y}\right] }, \,\,\, u\in (0,1), y\in \mathbb{N}_{0}, i=1,2. $$
(29)

Again, while the integrals in (28) are rarely available explicitly, they can be easily approximated by Monte-Carlo simulations in order to compute the joint PMF of Y=(Y1,Y2). These two “extreme" distributional cases can also be used to derive the full range of the values for the correlation of Y=(Y1,Y2) when the marginal distributions (19) are fixed, if needed.

2.1 Mixed Poisson distributions with NB margins

We now consider the case where the mixed Poisson marginal distributions of Y are NB, so that the marginal distributions of T are gamma (see Lemma 1 in Appendix). Thus, we shall assume that the coordinates of the random vectors T have univariate standard gamma distributions with shape parameters \(r_{i}\in \mathbb {R}_{+}\), i=1,…,d. There have been numerous multivariate gamma distributions developed over the years, and we could use any of them here. However, we follow a general approach based on copulas, discussed above. Thus, we assume that the dependence structure of T is governed by some copula function C(u1,…,ud), which admits the PDF c(u1,…,ud). In this case, the fi in (18) are given by (6) where r=ri and the Fi are the corresponding CDFs. Here, the marginal PMFs of the {Yi} in (19) are given by

$$ \mathbb{P}(Y_{i} = y) = \frac{\Gamma(y+r_{i})}{\Gamma(r_{i})y!} p_{i}^{r_{i}} \left(1-p_{i}\right)^{y}, \,\,\, y\in \mathbb{N}_{0}, $$
(30)

where the NB probabilities are given by pi=1/(1+λi)(0,1) (so that λi=(1−pi)/pi>0). Further, the PDF of X in Proposition 1 is still given by (21), where the marginal PDFs gi(·|yi) now admit explicit expressions

$$ g_{i}(t|y_{i}) = \frac{(1+\lambda_{i})^{y_{i}+r_{i}}}{\Gamma(y_{i}+r_{i})}t^{y_{i}+r_{i}-1}e^{-(1+\lambda_{i})t}, \,\,\, t\in \mathbb{R}_{+}. $$
(31)

We recognize that these are gamma PDFs. Thus, in this special case of multivariate mixed Poisson distributions of Type II with NB marginal distributions, the random vector X in the representation (14) has multivariate gamma distribution as well, but with independent margins. This fact is summarized in the result below.

Corollary 1

Let Y have a mixed Poisson distribution defined viz. (15), where the {Ni(·)} are independent Poisson processes with respective rates λi and T has multivariate gamma distribution with standard gamma margins with shape parameters ri and CDFs Fi, governed by a copula PDF c(u). Then, the marginal PMFs of Y are given by (30) with pi=1/(1+λi)(0,1) and its joint PMF is given by (14), where X=(X1,…,Xd) has multivariate gamma distribution with independent gamma marginal distributions of the {Xi} with PDFs given by (31).

Remark 3

If the expectation in (14) does not admit an explicit form in terms of the y1,…,yd, one can approximate its value viz. straightforward Monte-Carlo approximation involving random variate generation of independent gamma random variates {Xi}.

Let us note that since the {Ti} have standard gamma distributions with shape parameters ri, we have \(\mathbb {E}(T_{i}) = \mathbb {V} ar(T_{i}) = r_{i}\), and an application of Lemma 9 leads to the following result.

Proposition 2

Let Y have a mixed Poisson distribution defined viz. (15), where the {Ni(·)} are independent Poisson processes with respective rates λi and T has multivariate gamma distribution with standard gamma margins with shape parameters ri and CDFs Fi, governed by a copula PDF c(u). Then, \(\mathbb {E}(\mathbf {Y}) = \mathbf {I}({\boldsymbol \lambda }) \mathbf {r}\), where r=(r1,…,rd) and I(λ) is a d×d diagonal matrix with the {λi} on the main diagonal. Moreover, the covariance matrix of Y is given by

$${\boldsymbol \Sigma}_{\mathbf{Y}} =\mathbf{I}({\boldsymbol \lambda}) \mathbf{I}(\mathbf{r}) + \mathbf{I}({\boldsymbol \lambda}) {\boldsymbol \Sigma}_{\mathbf{T}} \mathbf{I}({\boldsymbol \lambda})^{\top}, $$

where ΣT is the covariance matrix of T and I(r) is a d×d diagonal matrix with the {ri} on the main diagonal.

Remark 4

The correlation of Yi and Yj is still given by (25), where this time

$$c_{i,j} = \sqrt{\frac{\lambda_{i}}{1+\lambda_{i}}} \sqrt{\frac{\lambda_{j}}{1+\lambda_{j}}}, \,\,\, i\neq j, $$

since in (26) we have \(\mathbb {E}(T_{i}) = \mathbb {V} ar(T_{i})\). Let us note that while in principle the quantities ci,j can assume any value in (0,1) when we choose appropriate λi and λj, they are fixed for particular marginal NB distributions, since in this model the NB probabilities are given by pi=1/(1+λi). In the terms of the latter, we have

$$c_{i,j} = \sqrt{1-p_{i}} \sqrt{1-p_{j}}, \,\,\, i\neq j. $$

These quantities, along with the full range of correlations for \(\rho _{T_{i}, T_{j}}\) in (25), can be used to obtain the upper and lower bounds for possible correlations of Yi and Yj in this model. We note that the possible range of \(\rho _{T_{i}, T_{j}}\) depends on the shape parameters ri and rj. If the {Ti} are exponential (so that ri=rj=1), then the upper limit of their correlation can be shown to be 1. However, the full range for the correlation of Ti and Tj is usually a subset of [−1,1], which can be approximated by Monte-Carlo simulations (see Remarks 1-2) or other approximate methods (see, e.g., Demitras and Hedeker (2011)).

2.2 Simulation

One particular way of defining this model, convenient for simulations, is by using the Gaussian copula to generate T. This is a very popular methodology due to its flexibility and ease of simulating from a required multivariate normal distribution. The Gaussian copula is one that corresponds to a multivariate normal distribution with standard normal marginal distributions and covariance matrix R. Since the marginals are standard normal, this R is also the correlation matrix. If FR is the CDF of such multivariate normal distribution, then the corresponding Gaussian copula CR is defined through

$$F_{\mathbf{R}}(x_{1}, \ldots, x_{d}) = C_{\mathbf{R}}(\Phi(x_{1}), \ldots, \Phi(x_{d})), $$

where Φ(·) is the standard normal CDF. Note that the copula CR is simply the CDF of the random vector (Φ(X1),…,Φ(Xd)), where (X1,…,Xd)Nd(0,R). If the distribution is continuous (so that R is non-singular), the copula CR admits the PDF cR, given by

$$ c_{\mathbf{R}}(u_{1}, \ldots,u_{d}) = \frac{1}{|\mathbf{R}|^{1/2}} e^{-\frac{1}{2}(\Phi^{-1}(\mathbf{u}))^{T} (\mathbf{R}^{-1}-\mathbf{I}_{d}) \Phi^{-1}(\mathbf{u})},\,\,\, \mathbf{u}=(u_{1}, \ldots, u_{d})^{\top} \in [0,1]^{d}, $$
(32)

where Φ−1(u)=(Φ−1(u1),…,Φ−1(ud)) and Id is d×d identity matrix. This cR will then be used in equations (20), (23), and (14). Simulation of multivariate gamma T with margins Fi based on this copula is quite simple, and involves the following steps:

  • Generate X=(X1,…,Xd)Nd(0,R);

  • Transform X to U=(U1,…,Ud) viz Ui=Φ(Xi), i=1,…,d;

  • Return T=(T1,…,Td), where \(T_{i}=F_{i}^{-1}(U_{i})\), i=1,…,d;

Remark 5

This strategy of using Gaussian copula to generate multivariate distributions is quite popular indeed, and it became known in the literature as the NORTA (NORmal To Anything) method (see, e.g., Chen (2001); Song and Hsiao (1993)). This methodology has been recently used to generate multivariate discrete distributions, see, e.g., Barbiero and Ferrari (2017), Madsen and Birkes (2013), or Nikoloulopoulos (2013) and references therein. The standard approach discussed in these papers proceeds by simulating the vector U from the Gaussian copula following the steps (i) - (ii) above and then transforming the coordinates of U directly viz. the inverse CDFs of the components of the target random vector Y=(Y1,…,Yd), which can be described as

  • Return Y=(Y1,…,Yd), where \(Y_{i}=G_{i}^{-1}(U_{i})\), i=1,…,d;

Here, the Gi are the CDFs of the Yi. If the distributions of the Yi are discrete (such as NB), the inverse CDF is defined in the standard way as

$$G^{-1}(u) = \inf\{y: G(y)\geq u \}.$$

The difference of our approach and the one discussed in the literature as described above is in the final step, regardless of the particular copula c that is used. In the standard approach one first simulates random U from c and then proceeds viz. (iii)’ above to get the target random vector Y (having a multivariate distribution with CDFs Gi). On the other hand, our proposal is to first generate T viz. step (iii) above and then obtain the target variable viz. (15). While our methodology involves an extra step compared with this direct method, it offers a simple way of calculating the joint probabilities, which is not available in the other approach. Additionally, our methodology offers a stochastic explanation of the resulting distributions viz. mixing mechanism and its relation to the underlying Poisson processes, which is lacking in the somewhat artificial standard approach. Another advantage of the approach viz. mixed Poisson are possible extensions to more general stochastic processes in the spirit of the NB process studied by Kozubowski and Podgórski (2009). On the other hand, its disadvantage is the fact that not all discrete marginal distributions can be obtained, only those that are mixed Poisson to begin with.

Remark 6

Let us note that the mixed Poisson approach to generate multivariate distributions was used in Madsen and Dalthorp (2007), where Y was obtained viz. (15) with standard Poisson processes and where T=eX with X being multivariate normal with mean μ=(μ1,…,μd) and covariance matrix Σ=[σi,j]. Since in this case the marginals of T have log-normal distributions, the authors referred to this construction as lognormal-Poisson hierarchy. This can be seen as a special case our scheme, where we have \(\phantom {\dot {i}\!}\lambda _{i}=e^{\mu _{i}}\) and the marginal CDFs of Ti of the form \(F_{i}(t) = \Phi (\log t_{i}/\sigma _{ii})\). The copula PDF of the {Ti} is the Gaussian copula (32) where R is the correlation matrix corresponding to Σ.

An important aspect of this problem is how to set the parameters of the underlying copula function so that the distribution of Y has given characteristics, such as the means and the covariances (and correlations). In the case where a Gaussian copula is used, this has to do with determining the correlation matrix R. This problem arises in the general scheme (i)—(iii) as well — and has been discussed in the literature (see, e.g., Barbiero and Ferrari (2017); Xiao (2017); Xiao and Zhou (2019)). Generally, there is no simple relation between R and the correlation matrix of T in (i)—(iii). However, other measure of associations - such as Kendall’s τ or Spearman’s ρ do transfer directly and may be preferred to use in our set-up. These issues will be the subject of further research.

Examples

We provide two examples. The first example describes the T-Poisson hierarchy approach to construct a multivariate geometric distribution. Second, we demonstrate how the T-Poisson hierarchy can be used to conduct a high-dimensional (d=1026) simulation study inspired by RNA-sequencing data — a challenging computational task.

3.1 Multivariate geometric distributions

Suppose that the random vector T in (15) has marginal standard exponential distributions, so that the marginal CDFs of the {Ti} are of the form

$$ F_{i}(t) = 1-e^{-t}, \,\,\, t\in \mathbb{R}_{+}. $$
(33)

In this case, the {Yi} have geometric distributions with parameters pi=1/(1+λi), so that

$$ \mathbb{P}(Y_{i}=y) = p_{i}(1-p_{i})^{y}, \,\,\, y\in \mathbb{N}_{0}. $$
(34)

One can then obtain a multitude of multivariate distributions with geometric margins by selecting various copulas for the underlying distributions of T. As an example, consider the case with Farlie-Gumbel-Morgenstern (FGM) copula driven by a parameter θ[−1,1], given by

$$ C(\mathbf{u}) = \prod_{i=1}^{d} u_{i}\left(1 + \theta \prod_{i=1}^{d} (1-u_{i})\right), \,\,\, \mathbf{u} = (u_{1}, \ldots, u_{d})^{\top} \in [0,1]^{d}. $$
(35)

Consider a two dimensional case with d=2, where the PDF corresponding to (35) is of the form

$$ c(\mathbf{u}) = 1+\theta(1-2u_{1})(1-2u_{2}), \,\,\, \mathbf{u} = (u_{1}, u_{2})^{\top} \in [0,1]^{2}. $$
(36)

In this case, the random vector X=(X1,X2) in Corollary 1 has independent gamma margins (31) with shape parameters yi+1 and scale parameters 1+λi, i=1,2. Using this fact, coupled with (33), one can evaluate the expectation in (14), leading to

$$ \mathbb{E}\left\{c(F_{1}(X_{1}), F_{2}(X_{2})) \right\} = 1+\theta \left[1-2\left(\frac{1}{1+p_{1}}\right)^{y_{1}+1}\right] \left[1-2\left(\frac{1}{1+p_{2}}\right)^{y_{2}+1}\right]. $$
(37)

In view of Corollary 1, this leads to the following expression for the joint probabilities of bivariate geometric distribution defined by our scheme viz. FGM copula:

$$ \mathbb{P}(\mathbf{Y} = \mathbf{y}) = \prod_{i=1}^{2}p_{i}(1-p_{i})^{y_{i}}\left\{1+\theta \prod_{i=1}^{2}\left[1-2\left(\frac{1}{1+p_{i}}\right)^{y_{i}+1}\right]\right\}, \,\,\, \mathbf{y}=(y_{1}, y_{2})^{\top} \in \mathbb{N}_{0}^{2}. $$
(38)

We shall denote this distribution by GEO(p1,p2,θ). When θ=0, the {Yi} are independent geometric variables with parameters pi(0,1), i=1,2. Otherwise, Y1,Y2 are correlated, with

$$ \mathbb{C} ov(Y_{1}, Y_{2}) = \frac{\theta}{4}\frac{1-p_{1}}{p_{1}}\frac{1-p_{2}}{p_{2}}, $$
(39)

as can be verified by routine, albeit tedious, algebra. In turn, the correlation of Y1,Y2 becomes

$$ \rho_{Y_{1}, Y_{2}} = \frac{\theta}{4}\sqrt{1-p_{1}}\sqrt{1-p_{2}}, $$
(40)

and can generally take any value in the range (−1/4,1/4).

3.2 Simulating RNA-seq data

This section describes how to simulate data using a T-Poisson hierarchy, aiming to replicate the structure of high-dimensional dependent count data. In fact, simulating RNA-sequencing (RNA-seq) data is a one of the primary motivating applications of the proposed methodology, seeking scaleable Monte Carlo methods for realistic multivariate simulation (for example, see Schissler et al. (2018)).

The RNA-seq data generating process involves counting how often a particular messenger RNA (mRNA) is expressed in a biological sample. Since this is a counting process with no upper bound, many modeling approaches use discrete random variables with infinite support. Often the counts exhibit over-dispersion and so the negative binomial arises as a sensible model for the expression levels (gene counts). Moreover, the counts are correlated (co-expressed) and cannot be assumed to behave independently. RNA-seq platforms quantify the entire transcriptome in one experimental run, resulting in high-dimensional data. In humans, this results in count data corresponding to over 20,000 genes (coding genomic regions) or even over 77,000 isoforms when alternating spliced mRNA are counted. This suggests simulating high-dimensional multivariate NB with heterogeneous marginals would be useful tool in the development and evaluation of RNA-seq analytics.

In an illustration of our proposed methodology applied to real data, we seek to simulate RNA-sequencing data by producing simulated random vectors generated from the Type II T-Poisson framework (as in Eq. (13)). Our goal is to replicate the structure of a breast cancer data set (BRCA: breast cancer invasive carcinoma data set from The Cancer Genome Atlas). For simplicity, we begin by filtering to retain the top 5% highest expressing genes of the 20,501 gene measurements from N=1212 patients’ tumor samples, resulting in d=1026 genes. All these genes exhibit over-dispersion and, so, we proceed to estimate the NB parameters (ri,pi),i=1,…,d, to determine the target marginal PMFs gi(yi) (via method of moments). Notably, the \(\hat {p}_{i}'s\) are small — ranging in [3.934×10−6,1.217×10−2]. To complete the simulation algorithm inputs, we estimate the Pearson correlation matrix RY and set that as the target correlation.

With the simulation targets specified, we proceed to simulate B=10,000 random vectors Y =(Y1,…,Yd) with target Pearson correlation RY and marginal PMFs gi(yi) using a T-Poisson hierarchy of Kind II. Specifically, we first employ the direct Gaussian copula approach to generate B random vectors following a standard multivariate Gamma distribution T with shape parameters ri equal to the target NB sizes and Pearson correlation matrix RT. Care must be taken when setting the specifying R (refer to Eq. (32)) — we employ Eq. (25) to compute the scaling factors ci,j and adjust the underlying correlations to ultimately match the target RY. Notably, of the 525,825 pairwise correlations from the 1026 genes, no scale factor was less than 0.9907, indicating the model can produce essentially the entire range of possible correlations. Here we are satisfied with approximate matching of the specified Gamma correlation and set R = RT in our Gaussian copula scheme (R indicating the specified multivariate Gaussian correlation matrix). Finally, we generate the desired random vector Yi=Ni(Ti) by simulating Poisson counts with expected value μi=λi×Ti, for i=1,…,d, (with \(\lambda _{i}=\frac {(1-p_{i})}{p_{i}}\)) and repeat B=10,000 times.

Figure 1 shows the results of our simulation by comparing the specified target parameter (horizontal axes) with the corresponding quantities estimated from the simulated data (vertical axes). The evaluation shows that the simulated counts approximately match the target parameters and exhibit the full range of estimated correlation from the data. Utilizing 15 CPU threads in a MacBook Pro carrying a 2.4 GHz 8-Core Intel Core i9 processor, the simulation completed in less than 30 seconds.

Fig. 1
figure 1

The T-Poisson strategy produces simulated random vectors from a multivariate negative binomial (NB) that replicate the estimated structure from an RNA-seq data set. The dashed red lines indicated equality between estimated parameters (vertical axes; derived from the simulated data) and the specified target parameters (horizontal axes)

\thelikesection Appendix

\thelikesubsection Gamma-Poisson mixtures

For the convenience of the reader, we include a short proof of the well-known fact stating that Poisson distribution with gamma-distributed parameter is NB (see, e.g., Solomon (1983)).

Lemma 1

If \(\{N(t), t\in \mathbb {R}_{+}\}\) is a homogeneous Poisson process with rate λ=(1−p)/p>0 and T is an independent standard gamma variable with shape parameter r, then the randomly stopped process, Y=N(T), has a NB distribution NB(r,p) with the PMF (7).

Proof

Suppose that T has a standard gamma distribution with the PDF (6) and the corresponding CDF FT. When we substitute the latter into (5), we obtain

$$\mathbb{P}(Y=n) = \int_{\mathbb{R}_{+}} \frac{e^{-\lambda t}(\lambda t)^{n}}{n!}\frac{1}{\Gamma(r)}t^{r-1}e^{-t}dt. $$

After some algebra, this produces

$$\mathbb{P}(Y=n) = \frac{\Gamma(n+r)}{\Gamma(r)n!} \frac{\lambda^{n}}{(1+\lambda)^{n+r}} \int_{\mathbb{R}_{+}} \frac{(1+\lambda)^{n+r}}{\Gamma(n+r)}t^{n+r-1}e^{-t(1+\lambda)}dt. $$

Since the integrand above is the PDF of gamma distribution with shape n+r and scale 1+λ, the integral becomes 1 and we have

$$\mathbb{P}(Y=n) = \frac{\Gamma(n+r)}{\Gamma(r)n!} \left(\frac{1}{1+\lambda}\right)^{r} \left(\frac{\lambda}{1+\lambda}\right)^{n}, $$

which we recognize as the NB probability from (7) with p=(1+λ)−1. The result follows when we set λ=(1−p)/p in the above analysis. □

\thelikesubsection Mixed multivariate Poisson distributions of type I

Here we provide basic distributional facts about mixed multivariate Poisson distributions of Type I, which are the distributions of Y=(Y1,…,Yd)=(N1(T),…,Nd(T)), where the {Ni(·)} are independent Poisson processes with rates λi and T is a random variable on \(\mathbb {R}_{+}\), independent of the {Ni}.

Lemma 2

In the above setting, the PGF of Y is given by

$$G(\mathbf{s}) = \mathbb{E} \left\{\prod_{i=1}^{d} s_{i}^{Y_{i}} \right\} = \phi_{T}\left(\sum_{i=1}^{d} \lambda_{i} - \sum_{i=1}^{d} \lambda_{i} s_{i} \right), \,\,\, \mathbf{s} = (s_{1}, \ldots, s_{d})^{\top}\in [0,1]^{d}, $$

where ϕT is the LT of T.

Proof

By using standard conditioning argument, we have

$$ G(\mathbf{s}) = \mathbb{E} \left\{\prod_{i=1}^{d} s_{i}^{Y_{i}} \right\} = \int_{\mathbb{R}_{+}}\mathbb{E} \left\{\left.\prod_{i=1}^{d} s_{i}^{Y_{i}}\right|T=t \right\} dF_{T}(t). $$
(41)

Since given T=t the variables {Yi} are independent and Poisson distributed with means {λit}, respectively, we have

$$\mathbb{E} \left\{\left.\prod_{i=1}^{d} s_{i}^{Y_{i}}\right|T=t \right\} = \prod_{i=1}^{d} \mathbb{E} \left\{\left.s^{Y_{i}}\right|T=t \right\} = \prod_{i=1}^{d} e^{-\lambda_{i}t(1-s_{i})} = e^{-t\left(\sum_{i=1}^{d} \lambda_{i} - \sum_{i=1}^{d} \lambda_{i} s_{i} \right)}. $$

When we substitute the above into (41) we conclude that the PGF of Y is indeed of the form stated above. □

Remark 7

Note that in the dimensional case d=1, we recover the well-known formula for the PGF of Y=N(T),

$$ G(s) = \phi_{T}(\lambda(1-s)), \,\,\, s\in [0,1], $$
(42)

where λ>0 is the rate of the Poisson process \(\{N(t),\,\, t\in \mathbb {R}_{+}\}\). If we further assume that T is standard gamma distributed with shape parameter r>0, so that

$$\phi_{T}(t) = \left(\frac{1}{1+t}\right)^{r}, \,\,\, t\in \mathbb{R}_{+}, $$

and we take λ=(1−p)/p, we obtain

$$ G(s) = \left(\frac{p}{1-(1-p)s}\right)^{r}, \,\,\, s\in [0,1]. $$
(43)

We recognize this as the PGF of the NB distribution NB(r,p), as it should be according to Lemma 1. Similarly, the PGF of a d-dimensional mixed Poisson distribution with such a gamma distributed T takes on the form

$$G(\mathbf{s}) = \left(\frac{1}{Q-\sum_{i=1}^{d} P_{i} s_{i}}\right)^{r}, \,\,\, \mathbf{s} = (s_{1}, \ldots, s_{d})^{\top} \in [0,1]^{d}, $$

where Pi=λi and \(Q=1+\sum _{i=1}^{d} P_{i}\). This is a general form of multivariate negative multinomial distribution (see Chapter 36 of Johnson et al. (1997)). Since the PGF of the marginal distributions of Yi in this setting is of the form (43) with p=(1+λi)−1, all marginal distributions are NB. Due to this property, discrete multivariate distributions with the above PGFs have been termed multivariate NB distributions (for more details, see Johnson et al. (1997)).

Remark 8

Let us note that changing a scaling factor of the variable T in this model has the same effect as adjusting the rate parameters connected with the Poisson processes {Ni(·)}. Namely, it follows from Lemma 2 that if we let \(\tilde {T} = cT\) in the above setting, then we have the following equality in distribution:

$$ \left(N_{1}\left(\tilde{T}\right), \ldots, N_{d}\left(\tilde{T}\right)\right)^{\top}\stackrel{d}{=} \left(\tilde{N}_{1}(T), \ldots, \tilde{N}_{d}(T)\right)^{\top}, $$
(44)

where the \(\{\tilde {N}_{i}(\cdot)\}\) are independent Poisson processes with rates cλi, respectively. Thus, without loss of generality, we may assume that the scale parameter of the variable T in this model is set to unity.

Lemma 3

In the above setting, the PMF of Y is given by

$$\mathbb{P}(\mathbf{Y} = \mathbf{y}) = \prod_{i=1}^{d} g_{i}(y_{i})h(\mathbf{y}), \,\,\, \mathbf{y} = (y_{1}, \ldots, y_{d})^{\top} \in \mathbb{N}_{0}^{d}, $$

where

$$g_{i}(y) = \frac{\lambda_{i}^{y}}{y!}v_{T}(y,\lambda_{i}), \,\,\, y\in \mathbb{N}_{0}, $$

are the marginal PMFs of the {Yi},

$$v_{T}(y,\lambda) = \mathbb{E} \left\{T^{y} e^{-\lambda T} \right\}, \,\,\, \lambda, y \in \mathbb{R}_{+}, $$

and the function h is given by

$$h(\mathbf{y}) = \frac{v_{T}\left(\sum_{i=1}^{d} y_{i}, \sum_{i=1}^{d} \lambda_{i}\right)}{\prod_{i=1}^{d} v_{T}(y_{i},\lambda_{i})}, \,\,\, \mathbf{y} = (y_{1}, \ldots, y_{d})^{\top} \in \mathbb{N}_{0}^{d}. $$

Proof

Since given T=t the variables {Yi} are independent and Poisson distributed with means {λit}, respectively, by using standard conditioning argument, followed by some algebra, we have

$$ {}\mathbb{P}(\mathbf{Y} = \mathbf{y}) = \prod_{i=1}^{d} \frac{\lambda_{i}^{y_{i}}}{y_{i}!} \int_{\mathbb{R}_{+}} e^{-t \sum_{i=1}^{d}\lambda_{i}}t^{\sum_{i=1}^{d} y_{i}} dF_{T}(t) = \left[ \prod_{i=1}^{d} \frac{\lambda_{i}^{y_{i}}}{y_{i}!}\right] \left[ v_{T}\left(\sum_{i=1}^{d} y_{i}, \sum_{i=1}^{d} \lambda_{i}\right)\right]. $$
(45)

Similarly, the marginal PMFs are given by

$$ \mathbb{P}(Y_{i} = y) = \frac{\lambda_{i}^{y}}{y!} \int_{\mathbb{R}_{+}} e^{-t \lambda_{i}}t^{y} dF_{T}(t) = \frac{\lambda_{i}^{y}}{y!} v_{T}\left(y, \lambda_{i}\right). $$
(46)

By combining (45) and (46), we obtain the result. □

Remark 9

Note that the joint PMF of Y can be also written as

$$ \mathbb{P}(\mathbf{Y} = \mathbf{y}) = v_{T}\left(\sum_{i=1}^{d} y_{i}, \sum_{i=1}^{d} \lambda_{i}\right) \prod_{i=1}^{d} \frac{\lambda_{i}^{y_{i}}}{y_{i}!}, $$
(47)

which is a convenient expression for approximating these probabilities by Monte Carlo simulations if the function vT(·,·) is not available explicitly and the random variable T is straightforward to simulate. We also note that whenever the marginal PMFs of Yi are explicit, then so is the function vT(·,·), which is clear from Lemma 3. For example, if T is standard gamma with shape parameter r, then we have

$$v_{T}(y,\lambda) = \frac{\Gamma(r+y)}{\Gamma(r)}\left(\frac{1}{1+\lambda}\right)^{r+y} = \frac{y!}{\lambda^{y}}\mathbb{P}(Y=y), \,\,\, \lambda, y \in \mathbb{R}_{+}, $$

where Y has a NB distribution with parameters r and p=1/(1+λ).

Next, we present an alternative expression for the joint probabilities P(Y=y), which provides a convenient formula for their computation whenever the variable T is difficult to simulate but its PDF is easy to compute. This representation involves a multinomial random vector N=(N1,…,Nd) with parameters n and p=(p1,…,pd), denoted by MUL(n,p), where \(n\in \mathbb {N}\) represents the number of trials, the {pi} represent event probabilities that sum up to one, and

$$ \mathbb{P}(\mathbf{N} = \mathbf{y}) = \frac{n!}{y_{1}! \cdots y_{d}!}p_{1}^{y_{1}} \cdots p_{d}^{y_{d}}, \,\,\, \mathbf{y} \in \left\{ \mathbf{k} = (k_{1}, \ldots, k_{d})^{d} \in \mathbb{N}_{0}^{d}: \sum_{i=1}^{d} k_{i}=n\right\}. $$
(48)

Lemma 4

In the above setting, the PMF of Y is given by

$$ \mathbb{P}(\mathbf{Y} = \mathbf{y}) = \mathbb{P}(\mathbf{N}=\mathbf{y})\mathbb{E}\left(f_{\lambda T}(W) \right), \,\,\, \mathbf{y} = (y_{1}, \ldots, y_{d})^{\top} \in N_{0}^{d}, $$
(49)

where \(\lambda =\sum _{i=1}^{d} \lambda _{i}\), NMUL(n,p) with \(n = \sum _{i=1}^{d}y_{i}\) and pi=λi/λ, the quantity fλT is the PDF of λT, and W has standard gamma distribution with shape parameter n+1.

Proof

Proceeding as in the proof of Lemma 3, we obtain

$$ \mathbb{P}(\mathbf{Y} = \mathbf{y}) = \prod_{i=1}^{d} \frac{\lambda_{i}^{y_{i}}}{y_{i}!} \frac{n!}{\lambda^{n}}\frac{1}{\lambda} \int_{\mathbb{R}_{+}} \frac{\lambda^{n+1}}{n!}t^{(n+1)-1} e^{-\lambda t} f_{T}(t) dt. $$
(50)

Since the integrand is the product of fT(t) and the density of gamma random variable X with shape parameter n+1 and scale λ, we have

$$\frac{1}{\lambda} \int_{\mathbb{R}_{+}} \frac{\lambda^{n+1}}{n!}t^{(n+1)-1} e^{-\lambda t} f_{T}(t) dt = \mathbb{E}\left[ \frac{1}{\lambda} f_{T}(X)\right] = \mathbb{E}\left[ \frac{1}{\lambda} f_{T}\left(\frac{W}{\lambda}\right)\right], $$

where W=λX has standard gamma distribution with shape parameter n+1 (and scale 1). To conclude the result, observe that the expression

$$\prod_{i=1}^{d} \frac{\lambda_{i}^{y_{i}}}{y_{i}!} \frac{n!}{\lambda^{n}} $$

in (50) coincides with the multinomial probablity (48) with pi=λi/λ while

$$\frac{1}{\lambda} f_{T}\left(\frac{w}{\lambda}\right) = f_{\lambda T}(w). $$

Remark 10

Note that in one dimensional case where d=1 the multinomial probability in (49) reduces to 1, and we obtain

$$ \mathbb{P}(Y=y) = \mathbb{E}\left(f_{\lambda T}(W) \right), \,\,\, y\in N_{0}, $$
(51)

where Y=dN(T), \(\{N(t),\,\, t\in \mathbb {R}_{+}\}\) is a Poisson process with rate λ, the quantity fλT is the PDF of λT, the variable W has standard gamma distribution with shape parameter y+1, and T is independent of the Poisson process.

Finally, we present well-known results concerning the mean and the covariance structure of mixed multivariate Poisson distributions of Type I, which are easily derived through standard conditioning arguments. Generally, whenever the mean of T exists then so does the mean of each Yi, and we have \(\mathbb {E}(Y_{i})=\lambda _{i} \mathbb {E}(T)\). Moreover, the variance of each Yi is finite whenever T has a finite second moment, in which case we have \(\mathbb {V} ar(Y_{i}) = \lambda _{i} \mathbb {E}(T) + \lambda _{i}^{2}\mathbb {V} ar(T)\). Thus, the variance of Yi is greater than the mean, and the distribution of Yi is over-dispersed. Finally, under the latter assumption, the covariance of Yi and Yj exists and equals \(\mathbb {C} ov(Y_{i}, Y_{j}) = \lambda _{i}\lambda _{j} \mathbb {V} ar(T)\). The result below summarizes these facts.

Lemma 5

In the above setting, the mean vector of Y exists whenever the mean of T is finite, in which case we have \(\mathbb {E}(\mathbf {Y}) = {\boldsymbol \lambda } \mathbb {E}(T)\), where λ=(λ1,…,λd). Moreover, if T has a finite second moment, then the covariance matrix of Y is well defined and is given by

$${\boldsymbol \Sigma} = \mathbb{E}(T) \mathbf{I}({\boldsymbol \lambda}) + \mathbb{V} ar(T){\boldsymbol \lambda}{\boldsymbol \lambda}^{\top}, $$

where I(λ) is a d×d diagonal matrix with the {λi} on the main diagonal.

Remark 11

By the above result, if it exists, the correlation coefficient of Yi and Yj is given by

$$\rho_{i,j} = \frac{\sqrt{\lambda_{i}}\sqrt{\lambda_{j}}}{\sqrt{\lambda_{i} + \frac{\mathbb{E}(T)}{\mathbb{V} ar(T)}}\sqrt{\lambda_{j} + \frac{\mathbb{E}(T)}{\mathbb{V} ar(T)}}}. $$

The correlation is always positive, and can generally fall anywhere within the boundaries of zero and one.

\thelikesubsection Mixed multivariate Poisson distributions of type II

Here we provide basic distributional facts about mixed multivariate Poisson distributions of Type II, which are the distributions of Y=(Y1,…,Yd)=(N1(T1),…,Nd(Td)), where the {Ni(·)} are independent Poisson processes with rates λi and T=(T1,…Td) is a random vector in \(\mathbb {R}_{+}^{d}\) with the PDF fT, independent of the {Ni}.

Lemma 6

In the above setting, the PGF of Y is given by

$$ G(\mathbf{s}) = \mathbb{E} \left\{\prod_{i=1}^{d} s_{i}^{Y_{i}} \right\} = \phi_{\mathbf{T}}(\mathbf{I}({\boldsymbol \lambda}) ({\boldsymbol 1} - {\boldsymbol s})), \,\,\, \mathbf{s} = (s_{1}, \ldots, s_{d})^{\top} \in [0,1]^{d}, $$
(52)

where ϕT is the LT of T, I(λ) is a d×d diagonal matrix with the {λi} on the main diagonal, and 1 is a d-dimensional column vector of 1s.

Proof

By using standard conditioning argument, we have

$$ G(\mathbf{s}) = \mathbb{E} \left\{\prod_{i=1}^{d} s_{i}^{Y_{i}} \right\} = \int_{\mathbb{R}_{+}^{d}}\mathbb{E} \left\{\left.\prod_{i=1}^{d} s_{i}^{Y_{i}}\right|\mathbf{T}=\mathbf{t} \right\} dF_{\mathbf{T}}(\mathbf{t}). $$
(53)

Since given T=t the variables {Yi} are independent and Poisson distributed with means {λiti}, respectively, we have

$$\mathbb{E} \left\{\left.\prod_{i=1}^{d} s_{i}^{Y_{i}}\right|\mathbf{T}=\mathbf{t} \right\} = \prod_{i=1}^{d} \mathbb{E} \left\{\left.s^{Y_{i}}\right|\mathbf{T}=\mathbf{t} \right\} = \prod_{j=1}^{d} e^{-\lambda_{i}t_{j}(1-s_{j})} = e^{-\mathbf{t}^{\top} \mathbf{I}({\boldsymbol \lambda}) ({\boldsymbol 1} - {\boldsymbol s})}. $$

When we substitute the above into (53) we conclude that the PGF of Y is as stated in the lemma. □

Remark 12

Note that the expression (52) is a generalization of (42) to the multivariate case of mixed Poisson. Additionally, observe that if the components of T coincide, that is Ti=T for i=1,…d, we have

$$\phi_{\mathbf{T}}(\mathbf{t}) = \mathbb{E} \left(e^{-\mathbf{t}^{\top} \mathbf{T}}\right) = \mathbb{E} \left(e^{- (t_{1}+ \cdots + t_{d}) T}\right) = \phi_{T}(t_{1}+ \cdots + t_{d}), $$

and the PGF in (52) reduces to its counterpart provided in Lemma 2, as it should.

Remark 13

Let us note that changing scaling factors of the variables Ti in this model has the same effect as adjusting the rate parameters connected with the Poisson processes {Ni(·)}. Namely, it follows from Lemma 6 that if we let \(\tilde {T_{i}} = c_{i}T_{i}\) in the above setting, then we have the following equality in distribution:

$$ \left(N_{1}(\tilde{T_{1}}), \ldots, N_{d}(\tilde{T_{d}})\right)^{\top}\stackrel{d}{=} \left(\tilde{N}_{1}(T_{1}), \ldots, \tilde{N}_{d}(T_{d})\right)^{\top}, $$
(54)

where the \(\{\tilde {N}_{i}(\cdot)\}\) are independent Poisson processes with rates ciλi, respectively. Thus, without loss of generality, we may assume that the scale parameters of the variables Ti in this model are set to unity.

Next, we provide a convenient formula for the PMF of multivariate mixed Poisson distributions of Type II, which is an extension of that given in Lemma 3. To state the result, we extend the definition of the function vT described by (12) to vector-valued arguments and random vectors T in \(\mathbb {R}_{+}^{d}\). Namely, for a=(a1,…,ad), \(\mathbf {b} = (b_{1}, \ldots, b_{d})^{\top }\in \mathbb {R}_{+}^{d}\) we set

$$ \mathbf{a}^{\mathbf{b}} = \prod_{i=1}^{d} a_{i}^{b_{i}} $$
(55)

and define

$$ v_{\mathbf{T}}(\mathbf{y},{\boldsymbol \lambda}) = \mathbb{E} \left\{\mathbf{T}^{\mathbf{y}} e^{-{\boldsymbol \lambda}^{\top} \mathbf{T}} \right\}, \,\,\, {\boldsymbol \lambda}, \mathbf{y} \in \mathbb{R}_{+}^{d}. $$
(56)

Lemma 7

In the above setting, the PMF of Y is given by

$$\mathbb{P}(\mathbf{Y} = \mathbf{y}) = \prod_{i=1}^{d} g_{i}(y_{i})h(\mathbf{y}), \,\,\, \mathbf{y} = (y_{1}, \ldots, y_{d})^{\top} \in \mathbb{N}_{0}^{d}, $$

where

$$g_{i}(y) = \frac{\lambda_{i}^{y}}{y!}v_{T_{i}}(y,\lambda_{i}), \,\,\, y\in \mathbb{N}_{0}, $$

are the marginal PMFs of the {Yi} and the function h is given by

$$h(\mathbf{y}) = \frac{v_{\mathbf{T}} (\mathbf{y},{\boldsymbol \lambda})}{\prod_{i=1}^{d} v_{T_{i}}(y_{i},\lambda_{i})}, \,\,\, \mathbf{y} = (y_{1}, \ldots, y_{d})^{\top} \in \mathbb{N}_{0}^{d}. $$

Proof

By using standard conditioning argument, we have

$$ \mathbb{P}(\mathbf{Y} = \mathbf{y}) = \int_{\mathbb{R}_{+}^{d}} \mathbb{P}(N_{1}(T_{1})=y_{1}, \ldots, N_{d}(T_{d})=y_{d}|\mathbf{T} = \mathbf{t})f_{\mathbf{T}}(\mathbf{t})d\mathbf{t}, $$
(57)

where y=(y1,…,yd) and t=(t1,…,td). Further, by independence, we have

$$ \mathbb{P}(N_{1}(T_{1})=y_{1}, \ldots, N_{d}(T_{d})=y_{d}|\mathbf{T} = \mathbf{t}) = \prod_{i=1}^{d} \mathbb{P}(N_{i}(t_{i})=y_{i}). $$
(58)

Since the Ni(ti) are Poisson with parameters λiti, we have

$$ \mathbb{P}(N_{i}(t_{i})=y_{i}) = \frac{e^{-\lambda_{i} t_{i}}(\lambda_{i} t_{i})^{y_{i}}}{y_{i}!}, \,\,\, i=1,\ldots,d. $$
(59)

When we now substitute (58) - (59) into (57), then after some algebra we get

$$ \mathbb{P}(\mathbf{Y} = \mathbf{y}) = \prod_{i=1}^{d} \frac{\lambda_{i}^{y_{i}}}{y_{i}!}\int_{\mathbb{R}_{+}^{d}} e^{-\sum_{i=1}^{d}\lambda_{i}t_{i}} \prod_{i=1}^{d} t_{i}^{y_{i}} f_{\mathbf{T}}(\mathbf{t})d\mathbf{t} = \left[ \prod_{i=1}^{d} \frac{\lambda_{i}^{y_{i}}}{y_{i}!}\right] \left[ v_{\mathbf{T}} (\mathbf{y},{\boldsymbol \lambda})\right]. $$
(60)

Similarly, the marginal PMFs are given by

$$ \mathbb{P}(Y_{i} = y) = \frac{\lambda_{i}^{y}}{y!} \int_{\mathbb{R}_{+}} e^{-t \lambda_{i}}t^{y} dF_{T_{i}}(t) = \frac{\lambda_{i}^{y}}{y!} v_{T_{i}}\left(y, \lambda_{i}\right). $$
(61)

By combining (60) and (61), we obtain the result. □

We now present an alternative expression for the joint probabilities P(Y=y), which facilitates their computation if the random vector T is difficult to simulate but its PDF is readily available.

Lemma 8

In the above setting, the PMF of Y is given by

$$ \mathbb{P}(\mathbf{Y} = \mathbf{y}) = \mathbb{E}\left(f_{\mathbf{I}({\boldsymbol \lambda})\mathbf{T}}(\mathbf{W}) \right), \,\,\, \mathbf{y} = (y_{1}, \ldots, y_{d})^{\top} \in \mathbb{N}_{0}^{d}, $$
(62)

where the quantity fI(λ)T is the PDF of I(λ)T=(λ1T1,…,λdTd) and W=(W1,…,Wd) with mutually independent Wi having standard gamma distributions with shape parameters yi+1.

Proof

Proceeding as in the proof of Lemma 4, we obtain

$$ \mathbb{P}(\mathbf{Y} = \mathbf{y}) = \prod_{i=1}^{d} \frac{1}{\lambda_{i}} \int_{\mathbb{R}_{+}^{d}} \prod_{i=1}^{d} \left\{ \frac{\lambda_{i}^{y_{i}+1}}{y_{i}!} t_{i}^{(y_{i}+1)-1} e^{-\lambda_{i} t_{i}}\right\} f_{\mathbf{T}}(\mathbf{t}) d\mathbf{t}. $$
(63)

Note that the product under the integral above is the PDF of X=(X1,…,Xd), where the Xi are mutually independent gamma random variables with shape parameters yi+1 and scale parameters λi. This allows us to conclude that

$$\mathbb{P}(\mathbf{Y} = \mathbf{y}) = \mathbb{E} \left[ \prod_{i=1}^{d} \frac{1}{\lambda_{i}} f_{\mathbf{T}}(\mathbf{X}) \right] = \mathbb{E} \left[ \prod_{i=1}^{d} \frac{1}{\lambda_{i}} f_{\mathbf{T}}\left(\frac{W_{1}}{\lambda_{1}}, \ldots, \frac{W_{d}}{\lambda_{d}}\right)\right], $$

where W=(W1,…,Wd)=I(λ)X has independent standard gamma components with shape parameters yi+1. To conclude the result, observe that

$$\prod_{i=1}^{d} \frac{1}{\lambda_{i}} f_{\mathbf{T}}\left(\frac{W_{1}}{\lambda_{1}}, \ldots, \frac{W_{d}}{\lambda_{d}}\right) =f_{\mathbf{I}({\boldsymbol \lambda})\mathbf{T}}(\mathbf{W}). $$

Finally, let us summarize standard results concerning the mean and the covariance structure of mixed multivariate Poisson distributions of Type II, which parallel the results for Type I, and are easily derived through standard conditioning arguments. Generally, whenever the means of {Ti} exist then so do the means of the {Yi}, and we have \(\mathbb {E}(Y_{i})=\lambda _{i} \mathbb {E}(T_{i})\). Similarly, the variance of each Yi is finite whenever Ti has a finite second moment, in which case we have \(\mathbb {V} ar(Y_{i}) = \lambda _{i} \mathbb {E}(T_{i}) + \lambda _{i}^{2}\mathbb {V} ar(T_{i})\). Again, the distribution of Yi is always over-dispersed. Finally, for any ij, the covariance of Yi and Yj exists and equals \(\mathbb {C} ov(Y_{i}, Y_{j}) = \lambda _{i}\lambda _{j} \mathbb {C} ov(T_{i}, T_{j})\) whenever the covariance of Ti and Tj exists. These facts are summarized in the result below.

Lemma 9

In the above setting, the mean vector of Y exists whenever the mean of T is finite, in which case we have \(\mathbb {E}(\mathbf {Y}) = \mathbf {I}({\boldsymbol \lambda }) \mathbb {E}(\mathbf {T})\), where λ=(λ1,…,λd) and I(λ) is a d×d diagonal matrix with the {λi} on the main diagonal. Moreover, if T has a finite covariance matrix ΣT then the covariance matrix of Y is well defined as well and is given by

$${\boldsymbol \Sigma}_{\mathbf{Y}} =\mathbf{I}({\boldsymbol \lambda}) \mathbf{I}(\mathbb{E}(\mathbf{T})) + \mathbf{I}({\boldsymbol \lambda}) {\boldsymbol \Sigma}_{\mathbf{T}} \mathbf{I}({\boldsymbol \lambda})^{\top}, $$

where \(\mathbf {I}(\mathbb {E}(\mathbf {T})) \) is a d×d diagonal matrix with the diagonal entries \(\{\mathbb {E}(T_{i})\}\).

Availability of data and materials

The results shown here are in whole or part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga.

Code reproducing the BRCA data set and computational analyses is available from the corresponding author on reasonable request.

Abbreviations

BRCA:

Breast invasive carcinoma

CDF:

Cumulative distribution function

FGM:

Farlie-Gumbel-Morgenstern

L-P model:

lognormal-Poisson model

mRNA:

messenger ribonucleic acid

NB:

Negative binomial

NORTA:

NORmal To Anything

PDF:

Probability density functions

PGF:

Probability generating function

PMF:

Probability mass function

RNA-seq:

RNA-sequencing

References

  • Barbiero, A., Ferrari, P. A.: An R package for the simulation of correlated discrete variables. Comm. Statist. Simul. Comput. 46(7), 5123–5140 (2017).

    Article  MathSciNet  Google Scholar 

  • Chen, H.: Initialization for NORTA: Generation of random vectors with specified marginals and correlations. INFORMS J. Comput. 13(4), 257–360 (2001).

    Article  MathSciNet  Google Scholar 

  • Clemen, R. T., Reilly, T.: Correlations and copulas for decision and risk analysis. Manag. Sci. 45, 208–224 (1999).

    Article  Google Scholar 

  • Demitras, H., Hedeker, D.: A practical way for computing approximate lower and upper correlation bounds. Amer. Statist. 65(2), 104–109 (2011).

    Article  MathSciNet  Google Scholar 

  • Johnson, N., Kotz, S., Balakrishnan, N.: Discrete Multivariate Distributions. Wiley, New York (1997).

    MATH  Google Scholar 

  • Karlis, D., Xekalaki, E.: Mixed Poisson distributions. Intern. Statist. Rev. 73(1), 35–58 (2005).

    Article  Google Scholar 

  • Kozubowski, T. J., Podgórski, P.: Distribution properties of the negative binomial Lévy process. Probab. Math. Statist. 29, 43–71 (2009).

    MathSciNet  MATH  Google Scholar 

  • Madsen, L., Birkes, D.: Simulating dependent discrete data. J. Stat. Comput. Simul. 83(4), 677–691 (2013).

    Article  MathSciNet  Google Scholar 

  • Madsen, L., Dalthorp, D.: Simulating correlated count data. Environ. Ecol. Stat. 14(2), 129–148 (2007).

    Article  MathSciNet  Google Scholar 

  • Nelsen, R. B.: An Introduction to Copulas (2006).

  • Nikoloulopoulos, A. K.: Copula-based models for multivariate discrete response data. In: Copulae in Mathematical and Quantitative Finance, 231–249, Lect. Notes Stat., 213. Springer, Heidelberg (2013).

    Google Scholar 

  • Nikoloulopoulos, A. K., Karlis, D.: Modeling multivariate count data using copulas. Comm. Statist. Sim. Comput. 39(1), 172–187 (2009).

    Article  MathSciNet  Google Scholar 

  • Schissler, A. G., Piegorsch, W. W., Lussier, Y. A.: Testing for differentially expressed genetic pathways with single-subject N-of-1 data in the presence of inter-gene correlation. Stat. Methods Med. Res. 27(12), 3797–3813 (2018).

    Article  MathSciNet  Google Scholar 

  • Solomon, D. L.: The spatial distribution of cabbage butterfly eggs. In: Roberts, H., Thompson, M. (eds.)Life Science Models Vol. 4, pp. 350–366. Springer-Verlag, New York (1983).

    Google Scholar 

  • Song, W. T., Hsiao, L. -C.: Generation of autocorrelated random variables with a specified marginal distribution. In: Proceedings of 1993 Winter Simulation Conference - (WSC ’93), pp. 374–377, Los Angeles (1993). https://doi.org/10.1109/WSC.1993.718074.

  • Xiao, Q.: Generating correlated random vector involving discrete variables. Comm. Statist. Theory Methods. 46(4), 1594–1605 (2017).

    Article  MathSciNet  Google Scholar 

  • Xiao, Q., Zhou, S.: Matching a correlation coefficient by a Gaussian copula. Comm. Statist. Theory Methods. 48(7), 1728–1747 (2019).

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors thank the two reviewers for their comments that help improve the paper. We also thank Professors Walter W. Piegorsch and Edward J. Bedrick (University of Arizona) for their helpful discussions.

Funding

Research reported in this publication was supported by MW-CTR-IN of the National Institutes of Health under award number 1U54GM104944.

Author information

Authors and Affiliations

Authors

Contributions

AGS and TJK conceived the study. TJK, AKP, and AGS developed the approach. ADK and AGS conducted the computational analyses. TJK, AKP, and AGS wrote the manuscript. TJK, AKP, AGS, ADK revised the manuscript. All authors read and approved with the final document.

Corresponding author

Correspondence to A. Grant Schissler.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Knudson, A.D., Kozubowski, T.J., Panorska, A.K. et al. A flexible multivariate model for high-dimensional correlated count data. J Stat Distrib App 8, 6 (2021). https://doi.org/10.1186/s40488-021-00119-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40488-021-00119-y

Keywords

AMS Subject Classification