 Research
 Open Access
 Published:
A flexible multivariate model for highdimensional correlated count data
Journal of Statistical Distributions and Applications volume 8, Article number: 6 (2021)
Abstract
We propose a flexible multivariate stochastic model for overdispersed count data. Our methodology is built upon mixed Poisson random vectors (Y_{1},…,Y_{d}), where the {Y_{i}} are conditionally independent Poisson random variables. The stochastic rates of the {Y_{i}} are multivariate distributions with arbitrary nonnegative margins linked by a copula function. We present basic properties of these mixed Poisson multivariate distributions and provide several examples. A particular case with geometric and negative binomial marginal distributions is studied in detail. We illustrate an application of our model by conducting a highdimensional simulation motivated by RNAsequencing data.
Introduction
As multivariate count data become increasingly common across many scientific disciplines, including economics, finance, geosciences, biology, marketing, and others, there is a growing need for flexible families of multivariate distributions with discrete margins. In particular, flexible models with correlated classical marginal distributions are in high demand in many different applied areas (see, e.g., Barbiero and Ferrari (2017); Madsen and Birkes (2013); Madsen and Dalthorp (2007); Nikoloulopoulos and Karlis (2009); Xiao (2017)). With this in mind, we propose a general method of constructing discrete multivariate distributions with certain common marginal distributions. One important example of this construction is a discrete multivariate model with correlated negative binomial (NB) components and arbitrary parameters. However, our approach is quite general and can produce families with different margins, going beyond the NB case.
One way to generate multivariate distributions with particular margins is an approach through copulas (see, e.g., Nelsen (2006)), and multivariate discrete distributions constructed through this method have been proposed in recent years (see, e.g., Barbiero and Ferrari (2017); Madsen and Birkes (2013); Nikoloulopoulos (2013); Nikoloulopoulos and Karlis (2009); Xiao (2017) and references therein). Recall that a copula is a cumulative distribution function (CDF) on [0,1]^{d}, describing a random vector with standard uniform margins. Moreover, for any random vector X=(X_{1},…,X_{d})^{⊤} with the joint CDF F and marginal CDFs F_{i} there is a copula function C(u_{1},…,u_{d}) so that
Further, for continuous distributions with marginal probability density functions (PDFs) f_{i}(x)=Fi′(x), the copula function C is unique, and the joint PDF of the {X_{i}} is given by
where the function c(u_{1},…,u_{d}) is the PDF corresponding to the copula C(u_{1},…,u_{d}). However, for discrete distributions, the copula is no longer unique and there is no analogue of (2) for calculating the relevant probabilities. Using this concept, one can define a random vector Y=(Y_{1},…,Y_{d})^{⊤} in \(\mathbb {R}^{d}\) with arbitrary marginal CDFs F_{i} viz.
where U=(U_{1},…,U_{d})^{⊤} is a random vector with standard uniform margins and the CDF given by
with a particular copula C. While one can use any of the multitude of different copula functions in this construction, an approach based on Gaussian copula, known as NORTA (NORmal To Anything, see, e.g., Chen (2001); Song and Hsiao (1993)), is especially popular due to its flexibility, particularly in the case of discrete distributions (see, e.g., Barbiero and Ferrari (2017); Madsen and Birkes (2013); Nikoloulopoulos (2013)).
While our approach involves copulas as well, the latter connect with continuous multivariate distributions rather than discrete, which avoids the issues with nonuniqueness of the copula function. Additionally, compared with the direct approach (3), in our scheme the computation of relevant probabilities is straightforward. Our methodology is based on mixtures of Poisson distributions, which is a common way of obtaining discrete analogs of continuous distributions on nonnegative reals with a particular stochastic interpretation. Indeed, discrete univariate mixed Poisson distributions have been proven useful stochastic models in many scientific fields (see, e.g., Karlis and Xekalaki (2005), where one can find a comprehensive review of these distributions with over 30 particular examples). This construction can be described through a randomly stopped Poisson process. More precisely, let \(\{N(t), t\in \mathbb {R}_{+}\}\) be a homogeneous Poisson process with rate λ>0, so that the marginal distribution of N(t) is Poisson with parameter (mean) λt. Then, for any random variable T with cumulative distribution function (CDF) F_{T}, supported on \(\mathbb {R}_{+}\), the quantity Y=N(T) is an integervalued random variable, with distribution determined viz. standard conditioning argument as follows:
Many standard probability distributions on \(\mathbb {N}_{0}\) arise from this scheme. In particular, if T has a standard gamma distribution with shape parameter r>0, given by the PDF
then Y=N(T) will have a NB distribution NB(r,p) with the probability mass function (PMF)
where p=1/(1+λ) (see Section 3.2 in the Appendix). As the NB model is quite important across many applications and can be extended to more general stochastic processes (see, e.g., Kozubowski and Podgórski (2009)), it shall serve as a basic example of our approach.
An extension of this scheme to the multivariate case can be accomplished in two different ways, leading to mixed multivariate Poisson distributions of Kind (or Type) I and II in the terminology of Karlis and Xekalaki (2005). The former arises viz.
where the {N_{i}(·)} are Poisson processes with rates λ_{i} and T is, as before, a random variable on \(\mathbb {R}_{+}\), independent of the {N_{i}}. While in general the marginal distributions of (N_{1}(t),…,N_{d}(t)) can be correlated multivariate Poisson (see, Johnson et al. (1997)), we shall assume that the processes {N_{i}} are mutually independent. In this case, the joint probability generating function (PGF) of the {Y_{i}} in (8) is of the form
where ϕ_{T} is the Laplace transform (LT) of T, while the relevant probabilities can be conveniently expressed as
where the {g_{i}} in (10) are the marginal PMFs of the {Y_{i}}. As shown in the Appendix, the function h is of the form
where
In case of gamma distributed T, with shape parameter r>0 and unit scale, the functions v and h above can be evaluated explicitly (see the Appendix for details), and the above distribution is know in the literature as the multivariate negative multinomial distribution (see Chapter 36 of Johnson et al. (1997) and references therein). Since the marginal distributions in this case are NB, the distribution has also been termed multivariate NB. In cases where the function v(·,·) is not available explicitly, it can be easily evaluated numerically, viz. Monte Carlo simulations.
Our main focus will be a more flexible family of mixed Poisson distributions of Kind II, where each Poisson process \(\{N_{i}(t), \,\, t\in \mathbb {R}_{+}\}\) is randomly stopped at a different random variable T_{i}, leading to
where the random vector T=(T_{1},…,T_{d})^{⊤} follows a multivariate distribution on \(\mathbb {R}_{+}^{d}\). A particular special case of this construction with the {T_{i}} having correlated lognormal distributions was recently proposed in Madsen and Dalthorp (2007), where this model was referred to as lognormalPoisson hierarchy (LP model). While that particular model does not allow explicit forms for marginal PMFs, it proved useful for applications. Our generalization, which we shall refer to as TPoisson hierarchy, will allow T in (13) to have any continuous distribution on \(\mathbb {R}_{+}^{d}\), with margins not necessarily belonging to the same parametric family. As will be seen in the sequel, the joint PMF of this more general model can still be written as in (10), with an appropriate function h. In particular, we shall work with families of distributions of T described by marginal CDFs {F_{i}} and a copula function C(u_{1},…,u_{d}). In this setup, the PMF of Y, which is still of the form (10), can be expressed as
where the g_{i} are the marginal PMFs of {Y_{i}}, the function c(u_{1},…,u_{d}) is the PDF corresponding to the copula C(u_{1},…,u_{d}), and the {X_{i}} are independent random variables with certain distributions dependent on the {y_{i}}. This expression, which is an analogue of (2) for discrete multivariate distributions defined through our scheme, provides a convenient way for computing probabilities of these multivariate distributions. This computational aspect of our construction compares favorably with a cumbersome formula for the PMF (see, e.g., Proposition 1.1 in Nikoloulopoulos and Karlis (2009)) of the competing method defined viz. (3).
In what follows, we explore these ideas to provide a flexible multivariate modeling framework for dependent count data — emphasizing computationally convenient expressions and scaleable algorithms for highdimensional applications. We begin by showing how multivariate count data can be generated as mixtures of Poisson distributions by developing sequences of independent Poisson processes randomly stopped at an underlying continuous realvalued random variable T (a T−Poisson hierarchy). Then we show how our T−Poisson hierarchy scheme gives rise to computationally convenient joint probability mass functions (PMFs) and how particular choices of parameters/distributions can be used to construct wellknown models such as the multivariate negative binomial. Next, we describe a scaleable simulation algorithm using our construction and copula theory. Two examples are provided: a basic example to produce a multivariate geometric distribution and an elaborate highdimensional simulation study, aiming to model and simulate RNAsequencing data. We note that our modeling framework and computationallyconvenient formulas may facilitate novel data analysis strategies, but we do not take up that task in this current study. We conclude with an Appendix containing selected proofs of assertions made throughout.
Multivariate mixtures of Poisson distributions
Our goal is to produce a random vector Y=(Y_{1},…,Y_{d})^{⊤} with correlated mixed Poisson components. To this end, we start with a sequence of independent Poisson processes \(\{N_{i}(t), \,\, t\in \mathbb {R}_{+}\}\), i=1,…,d, where the rate of the process N_{i}(t) is λ_{i}. Next, we let T=(T_{1},…T_{d})^{⊤} have a multivariate distribution on \(\mathbb {R}_{+}^{d}\) with the PDF f_{T}(t). Then, we define
In the terminology of Karlis and Xekalaki (2005), this is a special case of multivariate mixed Poisson distributions of Type II. Assuming that the {N_{i}(t)} are independent of T, by standard conditioning arguments (see Lemma 7 in the Appendix) we obtain
While in some cases one can obtain explicit expressions for the above joint probabilities, in general these have to be calculated numerically. The calculations can be facilitated by certain representations of these probabilities, discussed in the Appendix (see Lemmas 7 and 8 in the Appendix).
This procedure is quite general, and leads to a multitude of multivariate discrete distributions. Flexible models allowing for marginal distributions of different types can be obtained by a popular approach with copulas. Assume that T has a continuous distribution on \(\mathbb {R}_{+}^{d}\) with marginal PDFs f_{i} and CDFs F_{i} driven by a particular copula C(u_{1},…,u_{d}), so that the joint CDF of the {T_{i}} is given by
Then according to (2), the joint PDF f_{T} is of the form
where the function c(u_{1},…,u_{d}) is the PDF corresponding to the copula CDF C(u_{1},…,u_{d}). When we substitute (17) into (16), we get
Using the results presented in the Appendix (see Lemma 7 in the Appendix), one can show that the marginal PMFs of the {Y_{i}} are given by
where \(f_{\lambda _{i}T_{i}}(\cdot)\) is the PDF of λ_{i}T_{i} and W has a standard gamma distribution with shape parameter y+1. With this notation, we can write (18) in the form
where the quantity g(ty) in the above integral is the joint PDF of multivariate distribution with independent margins,
with
Thus, the integral in (20) can be expressed as
where X=(X_{1},…,X_{d})^{⊤} has a multivariate distribution with independent components, governed by the PDF specified by (21)  (22). This leads to the following result.
Proposition 1
In the above setting, the joint probabilities (18) admit the representation
where the marginal probabilities are given by (19) and the PDF of X=(X_{1},…,X_{d})^{⊤} is given by (21)  (22).
Let us note that the joint moments of the Y_{1},…Y_{d} exist whenever their counterparts of T_{1},…,T_{d} are finite, in which case they can be evaluated by standard conditioning arguments. In particular, the mean and the covariance matrix of Y are related to their counterparts connected with T in a simple way, specified by Lemma 9 in the Appendix. It follows that \(\mathbb {E}Y_{i} = \lambda _{i} \mathbb {E} T_{i}\) and \(\mathbb {V} ar Y_{i} = \lambda _{i} \mathbb {E} T_{i} + \lambda _{i}^{2} \mathbb {V} ar T_{i}\), so the distributions of the {Y_{i}} are always overdispersed. Moreover, we have
so that the correlation coefficient of Y_{i} and Y_{j} (if it exists) is related to that of T_{i} and T_{j} as follows:
where
Remark 1
While in general the correlation can be positive as well as negative and admits the same range as its counterpart for T_{i} and T_{j}, the range of possible correlations of Y_{i} and Y_{j} can be further restricted if the margins are fixed. The maximum and minimum correlation can be deduced from (25)  (26) and the range of correlation corresponding to the joint distribution of T_{i} and T_{j}. The later is provided by the minimum and the maximum correlations, corresponding to the lower and the upper Fréchet copulas,
The upper bound for the correlation is obtained when the distribution of (T_{i},T_{j}) is driven by the upper Fréchet copula C_{U} in (27), so that \(T_{i}\stackrel {d}{=}F_{i}^{1} (U)\)and \(T_{j}\stackrel {d}{=}F_{j}^{1}(U)\), where U is standard uniform and the F_{i}(·), F_{j}(·) are the CDFs of T_{i}, T_{j}, respectively. Similarly, the lower bound for the correlation is obtained when the distribution of (T_{i},T_{j}) is driven by the lower Fréchet copula C_{L} in (27), where we have \(T_{i}\stackrel {d}{=}F_{i}^{1} (U)\) and \(T_{j}\stackrel {d}{=}F_{j}^{1}(1U)\). While these correlation bounds are usually not available explicitly, they can be easily obtained by MonteCarlo approximations viz. simulation from these (degenerate) probability distributions or by other standard approximate methods (see, e.g., Demitras and Hedeker (2011), and references therein).
Remark 2
We note that when a bivariate random vector Y=(Y_{1},Y_{2})^{⊤} is defined viz. (15) and the distribution of the corresponding T=(T_{1},T_{2})^{⊤} is driven by one of the copulas in (27), then the distribution of T is not absolutely continuous and the above derivations leading to the PDF of Y need a modification. It can be shown that in this case the marginal distributions of the Y_{i} are still given by (19) while the joint PMF of (Y_{1},Y_{2})^{⊤} is also as in (20) with d=2, but with the integral term replaced with
under the upper and the lower Fréchet copula cases, respectively, where the g_{i}(·y) in (28) are PDFs on (0,1) given by
Again, while the integrals in (28) are rarely available explicitly, they can be easily approximated by MonteCarlo simulations in order to compute the joint PMF of Y=(Y_{1},Y_{2})^{⊤}. These two “extreme" distributional cases can also be used to derive the full range of the values for the correlation of Y=(Y_{1},Y_{2})^{⊤} when the marginal distributions (19) are fixed, if needed.
2.1 Mixed Poisson distributions with NB margins
We now consider the case where the mixed Poisson marginal distributions of Y are NB, so that the marginal distributions of T are gamma (see Lemma 1 in Appendix). Thus, we shall assume that the coordinates of the random vectors T have univariate standard gamma distributions with shape parameters \(r_{i}\in \mathbb {R}_{+}\), i=1,…,d. There have been numerous multivariate gamma distributions developed over the years, and we could use any of them here. However, we follow a general approach based on copulas, discussed above. Thus, we assume that the dependence structure of T is governed by some copula function C(u_{1},…,u_{d}), which admits the PDF c(u_{1},…,u_{d}). In this case, the f_{i} in (18) are given by (6) where r=r_{i} and the F_{i} are the corresponding CDFs. Here, the marginal PMFs of the {Y_{i}} in (19) are given by
where the NB probabilities are given by p_{i}=1/(1+λ_{i})∈(0,1) (so that λ_{i}=(1−p_{i})/p_{i}>0). Further, the PDF of X in Proposition 1 is still given by (21), where the marginal PDFs g_{i}(·y_{i}) now admit explicit expressions
We recognize that these are gamma PDFs. Thus, in this special case of multivariate mixed Poisson distributions of Type II with NB marginal distributions, the random vector X in the representation (14) has multivariate gamma distribution as well, but with independent margins. This fact is summarized in the result below.
Corollary 1
Let Y have a mixed Poisson distribution defined viz. (15), where the {N_{i}(·)} are independent Poisson processes with respective rates λ_{i} and T has multivariate gamma distribution with standard gamma margins with shape parameters r_{i} and CDFs F_{i}, governed by a copula PDF c(u). Then, the marginal PMFs of Y are given by (30) with p_{i}=1/(1+λ_{i})∈(0,1) and its joint PMF is given by (14), where X=(X_{1},…,X_{d})^{⊤} has multivariate gamma distribution with independent gamma marginal distributions of the {X_{i}} with PDFs given by (31).
Remark 3
If the expectation in (14) does not admit an explicit form in terms of the y_{1},…,y_{d}, one can approximate its value viz. straightforward MonteCarlo approximation involving random variate generation of independent gamma random variates {X_{i}}.
Let us note that since the {T_{i}} have standard gamma distributions with shape parameters r_{i}, we have \(\mathbb {E}(T_{i}) = \mathbb {V} ar(T_{i}) = r_{i}\), and an application of Lemma 9 leads to the following result.
Proposition 2
Let Y have a mixed Poisson distribution defined viz. (15), where the {N_{i}(·)} are independent Poisson processes with respective rates λ_{i} and T has multivariate gamma distribution with standard gamma margins with shape parameters r_{i} and CDFs F_{i}, governed by a copula PDF c(u). Then, \(\mathbb {E}(\mathbf {Y}) = \mathbf {I}({\boldsymbol \lambda }) \mathbf {r}\), where r=(r_{1},…,r_{d})^{⊤} and I(λ) is a d×d diagonal matrix with the {λ_{i}} on the main diagonal. Moreover, the covariance matrix of Y is given by
where Σ_{T} is the covariance matrix of T and I(r) is a d×d diagonal matrix with the {r_{i}} on the main diagonal.
Remark 4
The correlation of Y_{i} and Y_{j} is still given by (25), where this time
since in (26) we have \(\mathbb {E}(T_{i}) = \mathbb {V} ar(T_{i})\). Let us note that while in principle the quantities c_{i,j} can assume any value in (0,1) when we choose appropriate λ_{i} and λ_{j}, they are fixed for particular marginal NB distributions, since in this model the NB probabilities are given by p_{i}=1/(1+λ_{i}). In the terms of the latter, we have
These quantities, along with the full range of correlations for \(\rho _{T_{i}, T_{j}}\) in (25), can be used to obtain the upper and lower bounds for possible correlations of Y_{i} and Y_{j} in this model. We note that the possible range of \(\rho _{T_{i}, T_{j}}\) depends on the shape parameters r_{i} and r_{j}. If the {T_{i}} are exponential (so that r_{i}=r_{j}=1), then the upper limit of their correlation can be shown to be 1. However, the full range for the correlation of T_{i} and T_{j} is usually a subset of [−1,1], which can be approximated by MonteCarlo simulations (see Remarks 12) or other approximate methods (see, e.g., Demitras and Hedeker (2011)).
2.2 Simulation
One particular way of defining this model, convenient for simulations, is by using the Gaussian copula to generate T. This is a very popular methodology due to its flexibility and ease of simulating from a required multivariate normal distribution. The Gaussian copula is one that corresponds to a multivariate normal distribution with standard normal marginal distributions and covariance matrix R. Since the marginals are standard normal, this R is also the correlation matrix. If F_{R} is the CDF of such multivariate normal distribution, then the corresponding Gaussian copula C_{R} is defined through
where Φ(·) is the standard normal CDF. Note that the copula C_{R} is simply the CDF of the random vector (Φ(X_{1}),…,Φ(X_{d}))^{⊤}, where (X_{1},…,X_{d})^{⊤}∼N_{d}(0,R). If the distribution is continuous (so that R is nonsingular), the copula C_{R} admits the PDF c_{R}, given by
where Φ^{−1}(u)=(Φ^{−1}(u_{1}),…,Φ^{−1}(u_{d}))^{⊤} and I_{d} is d×d identity matrix. This c_{R} will then be used in equations (20), (23), and (14). Simulation of multivariate gamma T with margins F_{i} based on this copula is quite simple, and involves the following steps:

Generate X=(X_{1},…,X_{d})^{⊤}∼N_{d}(0,R);

Transform X to U=(U_{1},…,U_{d})^{⊤} viz U_{i}=Φ(X_{i}), i=1,…,d;

Return T=(T_{1},…,T_{d})^{⊤}, where \(T_{i}=F_{i}^{1}(U_{i})\), i=1,…,d;
Remark 5
This strategy of using Gaussian copula to generate multivariate distributions is quite popular indeed, and it became known in the literature as the NORTA (NORmal To Anything) method (see, e.g., Chen (2001); Song and Hsiao (1993)). This methodology has been recently used to generate multivariate discrete distributions, see, e.g., Barbiero and Ferrari (2017), Madsen and Birkes (2013), or Nikoloulopoulos (2013) and references therein. The standard approach discussed in these papers proceeds by simulating the vector U from the Gaussian copula following the steps (i)  (ii) above and then transforming the coordinates of U directly viz. the inverse CDFs of the components of the target random vector Y=(Y_{1},…,Y_{d})^{⊤}, which can be described as

Return Y=(Y_{1},…,Y_{d})^{⊤}, where \(Y_{i}=G_{i}^{1}(U_{i})\), i=1,…,d;
Here, the G_{i} are the CDFs of the Y_{i}. If the distributions of the Y_{i} are discrete (such as NB), the inverse CDF is defined in the standard way as
The difference of our approach and the one discussed in the literature as described above is in the final step, regardless of the particular copula c that is used. In the standard approach one first simulates random U from c and then proceeds viz. (iii)’ above to get the target random vector Y (having a multivariate distribution with CDFs G_{i}). On the other hand, our proposal is to first generate T viz. step (iii) above and then obtain the target variable viz. (15). While our methodology involves an extra step compared with this direct method, it offers a simple way of calculating the joint probabilities, which is not available in the other approach. Additionally, our methodology offers a stochastic explanation of the resulting distributions viz. mixing mechanism and its relation to the underlying Poisson processes, which is lacking in the somewhat artificial standard approach. Another advantage of the approach viz. mixed Poisson are possible extensions to more general stochastic processes in the spirit of the NB process studied by Kozubowski and Podgórski (2009). On the other hand, its disadvantage is the fact that not all discrete marginal distributions can be obtained, only those that are mixed Poisson to begin with.
Remark 6
Let us note that the mixed Poisson approach to generate multivariate distributions was used in Madsen and Dalthorp (2007), where Y was obtained viz. (15) with standard Poisson processes and where T=e^{X} with X being multivariate normal with mean μ=(μ_{1},…,μ_{d})^{⊤} and covariance matrix Σ=[σ_{i,j}]. Since in this case the marginals of T have lognormal distributions, the authors referred to this construction as lognormalPoisson hierarchy. This can be seen as a special case our scheme, where we have \(\phantom {\dot {i}\!}\lambda _{i}=e^{\mu _{i}}\) and the marginal CDFs of T_{i} of the form \(F_{i}(t) = \Phi (\log t_{i}/\sigma _{ii})\). The copula PDF of the {T_{i}} is the Gaussian copula (32) where R is the correlation matrix corresponding to Σ.
An important aspect of this problem is how to set the parameters of the underlying copula function so that the distribution of Y has given characteristics, such as the means and the covariances (and correlations). In the case where a Gaussian copula is used, this has to do with determining the correlation matrix R. This problem arises in the general scheme (i)—(iii) as well — and has been discussed in the literature (see, e.g., Barbiero and Ferrari (2017); Xiao (2017); Xiao and Zhou (2019)). Generally, there is no simple relation between R and the correlation matrix of T in (i)—(iii). However, other measure of associations  such as Kendall’s τ or Spearman’s ρ do transfer directly and may be preferred to use in our setup. These issues will be the subject of further research.
Examples
We provide two examples. The first example describes the TPoisson hierarchy approach to construct a multivariate geometric distribution. Second, we demonstrate how the TPoisson hierarchy can be used to conduct a highdimensional (d=1026) simulation study inspired by RNAsequencing data — a challenging computational task.
3.1 Multivariate geometric distributions
Suppose that the random vector T in (15) has marginal standard exponential distributions, so that the marginal CDFs of the {T_{i}} are of the form
In this case, the {Y_{i}} have geometric distributions with parameters p_{i}=1/(1+λ_{i}), so that
One can then obtain a multitude of multivariate distributions with geometric margins by selecting various copulas for the underlying distributions of T. As an example, consider the case with FarlieGumbelMorgenstern (FGM) copula driven by a parameter θ∈[−1,1], given by
Consider a two dimensional case with d=2, where the PDF corresponding to (35) is of the form
In this case, the random vector X=(X_{1},X_{2})^{⊤} in Corollary 1 has independent gamma margins (31) with shape parameters y_{i}+1 and scale parameters 1+λ_{i}, i=1,2. Using this fact, coupled with (33), one can evaluate the expectation in (14), leading to
In view of Corollary 1, this leads to the following expression for the joint probabilities of bivariate geometric distribution defined by our scheme viz. FGM copula:
We shall denote this distribution by GEO(p_{1},p_{2},θ). When θ=0, the {Y_{i}} are independent geometric variables with parameters p_{i}∈(0,1), i=1,2. Otherwise, Y_{1},Y_{2} are correlated, with
as can be verified by routine, albeit tedious, algebra. In turn, the correlation of Y_{1},Y_{2} becomes
and can generally take any value in the range (−1/4,1/4).
3.2 Simulating RNAseq data
This section describes how to simulate data using a TPoisson hierarchy, aiming to replicate the structure of highdimensional dependent count data. In fact, simulating RNAsequencing (RNAseq) data is a one of the primary motivating applications of the proposed methodology, seeking scaleable Monte Carlo methods for realistic multivariate simulation (for example, see Schissler et al. (2018)).
The RNAseq data generating process involves counting how often a particular messenger RNA (mRNA) is expressed in a biological sample. Since this is a counting process with no upper bound, many modeling approaches use discrete random variables with infinite support. Often the counts exhibit overdispersion and so the negative binomial arises as a sensible model for the expression levels (gene counts). Moreover, the counts are correlated (coexpressed) and cannot be assumed to behave independently. RNAseq platforms quantify the entire transcriptome in one experimental run, resulting in highdimensional data. In humans, this results in count data corresponding to over 20,000 genes (coding genomic regions) or even over 77,000 isoforms when alternating spliced mRNA are counted. This suggests simulating highdimensional multivariate NB with heterogeneous marginals would be useful tool in the development and evaluation of RNAseq analytics.
In an illustration of our proposed methodology applied to real data, we seek to simulate RNAsequencing data by producing simulated random vectors generated from the Type II TPoisson framework (as in Eq. (13)). Our goal is to replicate the structure of a breast cancer data set (BRCA: breast cancer invasive carcinoma data set from The Cancer Genome Atlas). For simplicity, we begin by filtering to retain the top 5% highest expressing genes of the 20,501 gene measurements from N=1212 patients’ tumor samples, resulting in d=1026 genes. All these genes exhibit overdispersion and, so, we proceed to estimate the NB parameters (r_{i},p_{i}),i=1,…,d, to determine the target marginal PMFs g_{i}(y_{i}) (via method of moments). Notably, the \(\hat {p}_{i}'s\) are small — ranging in [3.934×10^{−6},1.217×10^{−2}]. To complete the simulation algorithm inputs, we estimate the Pearson correlation matrix R_{Y} and set that as the target correlation.
With the simulation targets specified, we proceed to simulate B=10,000 random vectors Y =(Y_{1},…,Y_{d})^{⊤} with target Pearson correlation R_{Y} and marginal PMFs g_{i}(y_{i}) using a TPoisson hierarchy of Kind II. Specifically, we first employ the direct Gaussian copula approach to generate B random vectors following a standard multivariate Gamma distribution T with shape parameters r_{i} equal to the target NB sizes and Pearson correlation matrix R_{T}. Care must be taken when setting the specifying R (refer to Eq. (32)) — we employ Eq. (25) to compute the scaling factors c_{i,j} and adjust the underlying correlations to ultimately match the target R_{Y}. Notably, of the 525,825 pairwise correlations from the 1026 genes, no scale factor was less than 0.9907, indicating the model can produce essentially the entire range of possible correlations. Here we are satisfied with approximate matching of the specified Gamma correlation and set R = R_{T} in our Gaussian copula scheme (R indicating the specified multivariate Gaussian correlation matrix). Finally, we generate the desired random vector Y_{i}=N_{i}(T_{i}) by simulating Poisson counts with expected value μ_{i}=λ_{i}×T_{i}, for i=1,…,d, (with \(\lambda _{i}=\frac {(1p_{i})}{p_{i}}\)) and repeat B=10,000 times.
Figure 1 shows the results of our simulation by comparing the specified target parameter (horizontal axes) with the corresponding quantities estimated from the simulated data (vertical axes). The evaluation shows that the simulated counts approximately match the target parameters and exhibit the full range of estimated correlation from the data. Utilizing 15 CPU threads in a MacBook Pro carrying a 2.4 GHz 8Core Intel Core i9 processor, the simulation completed in less than 30 seconds.
\thelikesection Appendix
\thelikesubsection GammaPoisson mixtures
For the convenience of the reader, we include a short proof of the wellknown fact stating that Poisson distribution with gammadistributed parameter is NB (see, e.g., Solomon (1983)).
Lemma 1
If \(\{N(t), t\in \mathbb {R}_{+}\}\) is a homogeneous Poisson process with rate λ=(1−p)/p>0 and T is an independent standard gamma variable with shape parameter r, then the randomly stopped process, Y=N(T), has a NB distribution NB(r,p) with the PMF (7).
Proof
Suppose that T has a standard gamma distribution with the PDF (6) and the corresponding CDF F_{T}. When we substitute the latter into (5), we obtain
After some algebra, this produces
Since the integrand above is the PDF of gamma distribution with shape n+r and scale 1+λ, the integral becomes 1 and we have
which we recognize as the NB probability from (7) with p=(1+λ)^{−1}. The result follows when we set λ=(1−p)/p in the above analysis. □
\thelikesubsection Mixed multivariate Poisson distributions of type I
Here we provide basic distributional facts about mixed multivariate Poisson distributions of Type I, which are the distributions of Y=(Y_{1},…,Y_{d})^{⊤}=(N_{1}(T),…,N_{d}(T))^{⊤}, where the {N_{i}(·)} are independent Poisson processes with rates λ_{i} and T is a random variable on \(\mathbb {R}_{+}\), independent of the {N_{i}}.
Lemma 2
In the above setting, the PGF of Y is given by
where ϕ_{T} is the LT of T.
Proof
By using standard conditioning argument, we have
Since given T=t the variables {Y_{i}} are independent and Poisson distributed with means {λ_{i}t}, respectively, we have
When we substitute the above into (41) we conclude that the PGF of Y is indeed of the form stated above. □
Remark 7
Note that in the dimensional case d=1, we recover the wellknown formula for the PGF of Y=N(T),
where λ>0 is the rate of the Poisson process \(\{N(t),\,\, t\in \mathbb {R}_{+}\}\). If we further assume that T is standard gamma distributed with shape parameter r>0, so that
and we take λ=(1−p)/p, we obtain
We recognize this as the PGF of the NB distribution NB(r,p), as it should be according to Lemma 1. Similarly, the PGF of a ddimensional mixed Poisson distribution with such a gamma distributed T takes on the form
where P_{i}=λ_{i} and \(Q=1+\sum _{i=1}^{d} P_{i}\). This is a general form of multivariate negative multinomial distribution (see Chapter 36 of Johnson et al. (1997)). Since the PGF of the marginal distributions of Y_{i} in this setting is of the form (43) with p=(1+λ_{i})^{−1}, all marginal distributions are NB. Due to this property, discrete multivariate distributions with the above PGFs have been termed multivariate NB distributions (for more details, see Johnson et al. (1997)).
Remark 8
Let us note that changing a scaling factor of the variable T in this model has the same effect as adjusting the rate parameters connected with the Poisson processes {N_{i}(·)}. Namely, it follows from Lemma 2 that if we let \(\tilde {T} = cT\) in the above setting, then we have the following equality in distribution:
where the \(\{\tilde {N}_{i}(\cdot)\}\) are independent Poisson processes with rates cλ_{i}, respectively. Thus, without loss of generality, we may assume that the scale parameter of the variable T in this model is set to unity.
Lemma 3
In the above setting, the PMF of Y is given by
where
are the marginal PMFs of the {Y_{i}},
and the function h is given by
Proof
Since given T=t the variables {Y_{i}} are independent and Poisson distributed with means {λ_{i}t}, respectively, by using standard conditioning argument, followed by some algebra, we have
Similarly, the marginal PMFs are given by
By combining (45) and (46), we obtain the result. □
Remark 9
Note that the joint PMF of Y can be also written as
which is a convenient expression for approximating these probabilities by Monte Carlo simulations if the function v_{T}(·,·) is not available explicitly and the random variable T is straightforward to simulate. We also note that whenever the marginal PMFs of Y_{i} are explicit, then so is the function v_{T}(·,·), which is clear from Lemma 3. For example, if T is standard gamma with shape parameter r, then we have
where Y has a NB distribution with parameters r and p=1/(1+λ).
Next, we present an alternative expression for the joint probabilities P(Y=y), which provides a convenient formula for their computation whenever the variable T is difficult to simulate but its PDF is easy to compute. This representation involves a multinomial random vector N=(N_{1},…,N_{d})^{⊤} with parameters n and p=(p_{1},…,p_{d})^{⊤}, denoted by MUL(n,p), where \(n\in \mathbb {N}\) represents the number of trials, the {p_{i}} represent event probabilities that sum up to one, and
Lemma 4
In the above setting, the PMF of Y is given by
where \(\lambda =\sum _{i=1}^{d} \lambda _{i}\), N∼MUL(n,p) with \(n = \sum _{i=1}^{d}y_{i}\) and p_{i}=λ_{i}/λ, the quantity f_{λT} is the PDF of λT, and W has standard gamma distribution with shape parameter n+1.
Proof
Proceeding as in the proof of Lemma 3, we obtain
Since the integrand is the product of f_{T}(t) and the density of gamma random variable X with shape parameter n+1 and scale λ, we have
where W=λX has standard gamma distribution with shape parameter n+1 (and scale 1). To conclude the result, observe that the expression
in (50) coincides with the multinomial probablity (48) with p_{i}=λ_{i}/λ while
□
Remark 10
Note that in one dimensional case where d=1 the multinomial probability in (49) reduces to 1, and we obtain
where Y=dN(T), \(\{N(t),\,\, t\in \mathbb {R}_{+}\}\) is a Poisson process with rate λ, the quantity f_{λT} is the PDF of λT, the variable W has standard gamma distribution with shape parameter y+1, and T is independent of the Poisson process.
Finally, we present wellknown results concerning the mean and the covariance structure of mixed multivariate Poisson distributions of Type I, which are easily derived through standard conditioning arguments. Generally, whenever the mean of T exists then so does the mean of each Y_{i}, and we have \(\mathbb {E}(Y_{i})=\lambda _{i} \mathbb {E}(T)\). Moreover, the variance of each Y_{i} is finite whenever T has a finite second moment, in which case we have \(\mathbb {V} ar(Y_{i}) = \lambda _{i} \mathbb {E}(T) + \lambda _{i}^{2}\mathbb {V} ar(T)\). Thus, the variance of Y_{i} is greater than the mean, and the distribution of Y_{i} is overdispersed. Finally, under the latter assumption, the covariance of Y_{i} and Y_{j} exists and equals \(\mathbb {C} ov(Y_{i}, Y_{j}) = \lambda _{i}\lambda _{j} \mathbb {V} ar(T)\). The result below summarizes these facts.
Lemma 5
In the above setting, the mean vector of Y exists whenever the mean of T is finite, in which case we have \(\mathbb {E}(\mathbf {Y}) = {\boldsymbol \lambda } \mathbb {E}(T)\), where λ=(λ_{1},…,λ_{d})^{⊤}. Moreover, if T has a finite second moment, then the covariance matrix of Y is well defined and is given by
where I(λ) is a d×d diagonal matrix with the {λ_{i}} on the main diagonal.
Remark 11
By the above result, if it exists, the correlation coefficient of Y_{i} and Y_{j} is given by
The correlation is always positive, and can generally fall anywhere within the boundaries of zero and one.
\thelikesubsection Mixed multivariate Poisson distributions of type II
Here we provide basic distributional facts about mixed multivariate Poisson distributions of Type II, which are the distributions of Y=(Y_{1},…,Y_{d})^{⊤}=(N_{1}(T_{1}),…,N_{d}(T_{d}))^{⊤}, where the {N_{i}(·)} are independent Poisson processes with rates λ_{i} and T=(T_{1},…T_{d})^{⊤} is a random vector in \(\mathbb {R}_{+}^{d}\) with the PDF f_{T}, independent of the {N_{i}}.
Lemma 6
In the above setting, the PGF of Y is given by
where ϕ_{T} is the LT of T, I(λ) is a d×d diagonal matrix with the {λ_{i}} on the main diagonal, and 1 is a ddimensional column vector of 1s.
Proof
By using standard conditioning argument, we have
Since given T=t the variables {Y_{i}} are independent and Poisson distributed with means {λ_{i}t_{i}}, respectively, we have
When we substitute the above into (53) we conclude that the PGF of Y is as stated in the lemma. □
Remark 12
Note that the expression (52) is a generalization of (42) to the multivariate case of mixed Poisson. Additionally, observe that if the components of T coincide, that is T_{i}=T for i=1,…d, we have
and the PGF in (52) reduces to its counterpart provided in Lemma 2, as it should.
Remark 13
Let us note that changing scaling factors of the variables T_{i} in this model has the same effect as adjusting the rate parameters connected with the Poisson processes {N_{i}(·)}. Namely, it follows from Lemma 6 that if we let \(\tilde {T_{i}} = c_{i}T_{i}\) in the above setting, then we have the following equality in distribution:
where the \(\{\tilde {N}_{i}(\cdot)\}\) are independent Poisson processes with rates c_{i}λ_{i}, respectively. Thus, without loss of generality, we may assume that the scale parameters of the variables T_{i} in this model are set to unity.
Next, we provide a convenient formula for the PMF of multivariate mixed Poisson distributions of Type II, which is an extension of that given in Lemma 3. To state the result, we extend the definition of the function v_{T} described by (12) to vectorvalued arguments and random vectors T in \(\mathbb {R}_{+}^{d}\). Namely, for a=(a_{1},…,a_{d})^{⊤}, \(\mathbf {b} = (b_{1}, \ldots, b_{d})^{\top }\in \mathbb {R}_{+}^{d}\) we set
and define
Lemma 7
In the above setting, the PMF of Y is given by
where
are the marginal PMFs of the {Y_{i}} and the function h is given by
Proof
By using standard conditioning argument, we have
where y=(y_{1},…,y_{d})^{⊤} and t=(t_{1},…,t_{d})^{⊤}. Further, by independence, we have
Since the N_{i}(t_{i}) are Poisson with parameters λ_{i}t_{i}, we have
When we now substitute (58)  (59) into (57), then after some algebra we get
Similarly, the marginal PMFs are given by
By combining (60) and (61), we obtain the result. □
We now present an alternative expression for the joint probabilities P(Y=y), which facilitates their computation if the random vector T is difficult to simulate but its PDF is readily available.
Lemma 8
In the above setting, the PMF of Y is given by
where the quantity f_{I(λ)T} is the PDF of I(λ)T=(λ_{1}T_{1},…,λ_{d}T_{d})^{⊤} and W=(W_{1},…,W_{d})^{⊤} with mutually independent W_{i} having standard gamma distributions with shape parameters y_{i}+1.
Proof
Proceeding as in the proof of Lemma 4, we obtain
Note that the product under the integral above is the PDF of X=(X_{1},…,X_{d})^{⊤}, where the X_{i} are mutually independent gamma random variables with shape parameters y_{i}+1 and scale parameters λ_{i}. This allows us to conclude that
where W=(W_{1},…,W_{d})^{⊤}=I(λ)X has independent standard gamma components with shape parameters y_{i}+1. To conclude the result, observe that
□
Finally, let us summarize standard results concerning the mean and the covariance structure of mixed multivariate Poisson distributions of Type II, which parallel the results for Type I, and are easily derived through standard conditioning arguments. Generally, whenever the means of {T_{i}} exist then so do the means of the {Y_{i}}, and we have \(\mathbb {E}(Y_{i})=\lambda _{i} \mathbb {E}(T_{i})\). Similarly, the variance of each Y_{i} is finite whenever T_{i} has a finite second moment, in which case we have \(\mathbb {V} ar(Y_{i}) = \lambda _{i} \mathbb {E}(T_{i}) + \lambda _{i}^{2}\mathbb {V} ar(T_{i})\). Again, the distribution of Y_{i} is always overdispersed. Finally, for any i≠j, the covariance of Y_{i} and Y_{j} exists and equals \(\mathbb {C} ov(Y_{i}, Y_{j}) = \lambda _{i}\lambda _{j} \mathbb {C} ov(T_{i}, T_{j})\) whenever the covariance of T_{i} and T_{j} exists. These facts are summarized in the result below.
Lemma 9
In the above setting, the mean vector of Y exists whenever the mean of T is finite, in which case we have \(\mathbb {E}(\mathbf {Y}) = \mathbf {I}({\boldsymbol \lambda }) \mathbb {E}(\mathbf {T})\), where λ=(λ_{1},…,λ_{d})^{⊤} and I(λ) is a d×d diagonal matrix with the {λ_{i}} on the main diagonal. Moreover, if T has a finite covariance matrix Σ_{T} then the covariance matrix of Y is well defined as well and is given by
where \(\mathbf {I}(\mathbb {E}(\mathbf {T})) \) is a d×d diagonal matrix with the diagonal entries \(\{\mathbb {E}(T_{i})\}\).
Availability of data and materials
The results shown here are in whole or part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga.
Code reproducing the BRCA data set and computational analyses is available from the corresponding author on reasonable request.
Abbreviations
 BRCA:

Breast invasive carcinoma
 CDF:

Cumulative distribution function
 FGM:

FarlieGumbelMorgenstern
 LP model:

lognormalPoisson model
 mRNA:

messenger ribonucleic acid
 NB:

Negative binomial
 NORTA:

NORmal To Anything
 PDF:

Probability density functions
 PGF:

Probability generating function
 PMF:

Probability mass function
 RNAseq:

RNAsequencing
References
Barbiero, A., Ferrari, P. A.: An R package for the simulation of correlated discrete variables. Comm. Statist. Simul. Comput. 46(7), 5123–5140 (2017).
Chen, H.: Initialization for NORTA: Generation of random vectors with specified marginals and correlations. INFORMS J. Comput. 13(4), 257–360 (2001).
Clemen, R. T., Reilly, T.: Correlations and copulas for decision and risk analysis. Manag. Sci. 45, 208–224 (1999).
Demitras, H., Hedeker, D.: A practical way for computing approximate lower and upper correlation bounds. Amer. Statist. 65(2), 104–109 (2011).
Johnson, N., Kotz, S., Balakrishnan, N.: Discrete Multivariate Distributions. Wiley, New York (1997).
Karlis, D., Xekalaki, E.: Mixed Poisson distributions. Intern. Statist. Rev. 73(1), 35–58 (2005).
Kozubowski, T. J., Podgórski, P.: Distribution properties of the negative binomial Lévy process. Probab. Math. Statist. 29, 43–71 (2009).
Madsen, L., Birkes, D.: Simulating dependent discrete data. J. Stat. Comput. Simul. 83(4), 677–691 (2013).
Madsen, L., Dalthorp, D.: Simulating correlated count data. Environ. Ecol. Stat. 14(2), 129–148 (2007).
Nelsen, R. B.: An Introduction to Copulas (2006).
Nikoloulopoulos, A. K.: Copulabased models for multivariate discrete response data. In: Copulae in Mathematical and Quantitative Finance, 231–249, Lect. Notes Stat., 213. Springer, Heidelberg (2013).
Nikoloulopoulos, A. K., Karlis, D.: Modeling multivariate count data using copulas. Comm. Statist. Sim. Comput. 39(1), 172–187 (2009).
Schissler, A. G., Piegorsch, W. W., Lussier, Y. A.: Testing for differentially expressed genetic pathways with singlesubject Nof1 data in the presence of intergene correlation. Stat. Methods Med. Res. 27(12), 3797–3813 (2018).
Solomon, D. L.: The spatial distribution of cabbage butterfly eggs. In: Roberts, H., Thompson, M. (eds.)Life Science Models Vol. 4, pp. 350–366. SpringerVerlag, New York (1983).
Song, W. T., Hsiao, L. C.: Generation of autocorrelated random variables with a specified marginal distribution. In: Proceedings of 1993 Winter Simulation Conference  (WSC ’93), pp. 374–377, Los Angeles (1993). https://doi.org/10.1109/WSC.1993.718074.
Xiao, Q.: Generating correlated random vector involving discrete variables. Comm. Statist. Theory Methods. 46(4), 1594–1605 (2017).
Xiao, Q., Zhou, S.: Matching a correlation coefficient by a Gaussian copula. Comm. Statist. Theory Methods. 48(7), 1728–1747 (2019).
Acknowledgements
The authors thank the two reviewers for their comments that help improve the paper. We also thank Professors Walter W. Piegorsch and Edward J. Bedrick (University of Arizona) for their helpful discussions.
Funding
Research reported in this publication was supported by MWCTRIN of the National Institutes of Health under award number 1U54GM104944.
Author information
Authors and Affiliations
Contributions
AGS and TJK conceived the study. TJK, AKP, and AGS developed the approach. ADK and AGS conducted the computational analyses. TJK, AKP, and AGS wrote the manuscript. TJK, AKP, AGS, ADK revised the manuscript. All authors read and approved with the final document.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Knudson, A.D., Kozubowski, T.J., Panorska, A.K. et al. A flexible multivariate model for highdimensional correlated count data. J Stat Distrib App 8, 6 (2021). https://doi.org/10.1186/s4048802100119y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s4048802100119y
Keywords
 Multivariate count data
 Copula
 Distribution theory
 Big data applications
 GammaPoisson hierarchy
 Mixed Poisson distribution
 Negative binomial distribution
 Highdimensional multivariate simulation
 RNAsequencing data
AMS Subject Classification
 62E10
 62E15
 62H05
 62H10
 62H30