Multivariate distributions of correlated binary variables generated by pair-copulas

Lin, Huihui; Chaganty, N. Rao

doi:10.1186/s40488-021-00118-z

Research
Open access
Published: 05 March 2021

Multivariate distributions of correlated binary variables generated by pair-copulas

Journal of Statistical Distributions and Applications volume 8, Article number: 4 (2021) Cite this article

6104 Accesses
1 Citations
Metrics details

Abstract

Correlated binary data are prevalent in a wide range of scientific disciplines, including healthcare and medicine. The generalized estimating equations (GEEs) and the multivariate probit (MP) model are two of the popular methods for analyzing such data. However, both methods have some significant drawbacks. The GEEs may not have an underlying likelihood and the MP model may fail to generate a multivariate binary distribution with specified marginals and bivariate correlations. In this paper, we study multivariate binary distributions that are based on D-vine pair-copula models as a superior alternative to these methods. We elucidate the construction of these binary distributions in two and three dimensions with numerical examples. For higher dimensions, we provide a method of constructing a multidimensional binary distribution with specified marginals and equicorrelated correlation matrix. We present a real-life data analysis to illustrate the application of our results.

Introduction

In clinical trials and research studies in health care and medicine, the endpoint of the observed data most often consists of correlated binary observations. The generalized estimating equation (GEE), introduced by Liang and Zeger (1986), has been the common statistical tool for analyzing such data. However, this method has several drawbacks. One of the drawbacks is that it uses an ambiguously defined working correlation to model the dependence in the binary observations, which could lead to misleading conclusions (Sabo and Chaganty, 2010). Another drawback it is a non-likelihood approach, in the sense, it does not have an underlying joint distribution for the correlated binary observations. Other alternatives to GEEs for the analysis of correlated binary data are Markov chains (MCs) and multivariate probit (MP) models. A contrasting study of the first order MC model and the MP model was presented by Yang and Chaganty (2014). They showed that both models are asymptotically efficient, and discussed situations where one is preferable over the other.

In recent years, due to their success in other disciplines, copulas have been used to develop likelihood-based methods as another alternative to GEEs. Some researchers have combined copulas with MC models. Escarela et al. (2009) have used Gaussian copula to construct conditional probabilities in Markov chain models in the context of longitudinal binary data. The copula-based bivariate probit models were generalized by Winkelmann (2012) replacing the Gaussian distribution by Frank and Clayton copulas. Radice et al. (2016) introduced nonlinear regression models, where non-Gaussian copulas were used to deal with the dependence between binary responses. Smith et al. (2010) showed that longitudinal continuous data can be modeled by D-vine pair-copula, and later extended their work to the discrete case using a Bayesian framework in Smith and Khaled (2012). A Gaussian copula model for integer-valued ARMA structured time series data with or without covariates was developed by Lennon (2016). Panagiotelis et al. (2017) introduced two algorithms for optimizing vine structure and pair-copula selection for discrete regular vine copulas. One of these algorithms uses a modified Akaike information criterion and the other uses predictive scores with cross-validation.

In this paper we study multivariate binary distributions generated by D-vine pair-copula models. These models are relatively easy to implement since they use only bivariate copulas, and flexible because they allow different types of bivariate copulas to model different types of dependence in the conditional distributions. We will see that the D-vine pair-copula model has some advantages over the MP model.

The organization of this paper is as follows. We first present a lucid description of the construction of bivariate and trivariate binary distributions using bivariate Gaussian, Clayton, Frank, and Gumbel copulas in “Construction of vine pair-copula binary distributions” section. We discuss comparisons between pair-copula models and the multivariate probit (MP) model in “Comparison of pair-copula and MP models” section, together with a numerical example where the vine pair-copula model overcomes the difficulties associated with the MP model. In “Extensions to four and higher dimensions” section we discuss extensions to four and higher dimensions and present a method of constructing multivariate binary distribution with specified marginals and equicorrelated structure. In “Parameter estimation” section, we discuss parameter estimation by maximum likelihood for grouped data. “Data analysis” section contains an analysis of a real-life data. We end the paper with some discussion in “Discussion” section.

Construction of vine pair-copula binary distributions

In this section, we illustrate methods of constructing multivariate binary distributions with specified correlations using vine pair-copula methods. The major advantage of these methods is that the multivariate distribution can be constructed using only bivariate copulas. The method is computationally feasible, flexible, and can accommodate various types of dependence because the bivariate copula need not be the same for various bivariate marginal and conditional distributions. We start first with the simplest cases, bivariate and the trivariate distributions, and then show how they can be extended to higher dimensions focusing on the special correlation structures useful in longitudinal and clustered binary data analysis.

Bivariate binary distributions

Consider first the case of two binary variables. Let Y=(Y₁, Y₂), where the subscripts 1 and 2 possibly may indicate two sequential time points. Assume that E(Y_i)=p_i for i=1,2. According to the theorem of Sklar (1959), the joint CDF of Y using a copula function C is given by F(y₁,y₂)=C(F₁(y₁),F₂(y₂)), where F₁ and F₂ are CDFs of the univariate binary distributions of Y₁ and Y₂ respectively. Following Panagiotelis et al. (2012), we can recover the joint probability mass function (PMF) of Y from the CDF as

$$\begin{array}{@{}rcl@{}} P(Y_{1}=y_{1},Y_{2}=y_{2})&\!\!=&\!\! C(F_{1}(y_{1}), F_{2}(y_{2});\; \theta)-C(F_{1}(y_{1}-1), F_{2}(y_{2});\; \theta) \cr && \!\!\! -C(F_{1}(y_{1}), F_{2}(y_{2}-1);\; \theta)+C(F_{1}(y_{1}-1), F_{2}(y_{2}-1);\; \theta). \cr &&\end{array} $$

(1)

The C(u₁,u₂; θ) could be any copula C_i,1≤i≤5, given in Table 1. The copula parameter θ is the correlation coefficient γ for the Gaussian copula, and it is α for the Clayton, Frank, and Gumbel copulas.

Table 1 Copula families of distribution functions

Full size table

Selecting y_i=0,1 and noting that F_i(0)=P(Y_i=0)=1−p_i=q_i for i=1,2, Eq. (1) simplifies to the probabilities given in Table 2.

Table 2 PMF of bivariate binary variables

Full size table

Given ρ=Corr(Y₁,Y₂), and if we use in Table 2 the Gaussian copula C₁, the parameter θ=γ can be obtained by solving equation

$$\begin{array}{@{}rcl@{}} \text{Corr}(Y_1,Y_2)=\rho=\frac{C_{1}(p_1,\, p_2;\,\gamma)-p_{1}\,p_2}{\sqrt{p_{1}\,q_{1}\,p_{2}\,q_2}}, \end{array} $$

(2)

since 1−q₁−q₂+C₁(q₁,q₂; γ)=C₁(p₁, p₂; γ).

Trivariate binary distributions

In this section we extend the pair-copula method to construct three dimensional binary distributions. Let Y=(Y₁, Y₂, Y₃) be a vector of three correlated binary random variables. Note that

$$\begin{array}{@{}rcl@{}} P(Y_1=y_{1},Y_{2}=y_{2},Y_{3}=y_{3})=P(Y_{2}=y_{2})*P(Y_{1}=y_{1},Y_{3}=y_{3}|Y_{2}=y_{2}). \end{array} $$

(3)

The above equation shows the three dimension distribution can be obtained by constructing the bivariate conditional distribution of (Y₁, Y₃) given Y₂=y₂. To this end we introduce some notation. We first construct bivariate distributions for (Y₁, Y₂) and (Y₂, Y₃) selecting bivariate copulas C₁₂(u₁, u₂) and C₂₃(u₁, u₂) from Table 2. For notational convenience we omit the copula parameter here and in some formulas later.

Let $q_{1|0}=P(Y_{1}=0|Y_{2}=0)=\frac {C_{12}(q_{1}, q_{2})}{q_{2}}$ and $p_{1|0}=1-q_{1|0}=1-\frac {C_{12}(q_{1}, q_{2})}{q_{2}}$. Thus Y₁|Y₂=0 is distributed as Bernoulli with mean p_1|0. Similarly, Y₃|Y₂=0 is distributed as Bernoulli with mean p_3|0, where $p_{3|0}=1-\frac {C_{23}(q_{2}, q_{3})}{q_{2}}$. We also have Y₁|Y₂=1 is Bernoulli with mean $p_{1|1}=1-\frac {q_{1}-C_{12}(q_{1}, q_{2})}{p_{2}}$, and Y₃|Y₂=1 is Bernoulli with mean $p_{3|1}=1-\frac {q_{3}-C_{23}(q_{2}, q_{3})}{p_{2}}$.

Table 3 shows the conditional distributions of (Y₁, Y₃)|Y₂=0 and (Y₁, Y₃)|Y₂=1. Finally, from Eq. (3) and using the conditonal distributions we can get the joint trivariate PMF as given in Table 4.

Table 3 Conditional PMF of (Y₁, Y₃) given Y₂

Full size table

Table 4 PMF of trivariate binary variables

Full size table

We give six numerical examples to see that different copulas and parameter values give rise to different trivariate binary distributions. All these six distributions have the same marginal means p₁=0.8,p₂=0.7 and p₃=0.6. The choice of the copulas and parameter values are summarized in Table 5 for these six cases.

Table 5 Summary of parameter values for PMF of the trivariate binary variables

Full size table

We give details of the calculations only for case 5, the others are similarly done. In this case we start with Table 2 using Gaussian copula with γ₁₂=0.752,γ₂₃=0.607, and p=(0.8, 0.7, 0.6). The resulting PMFs of bivariate binary variables are in Table 6.

Table 6 PMF of bivariate binary variables using Gaussian copula

Full size table

Now, to construct the conditional PMFs of bivariate variables we need to get the marginal parameters of the conditional Bernoulli variables Y₁|Y₂=0 and Y₁|Y₂=1. These are

$$\begin{array}{@{}rcl@{}} q_{1|0}&=&\frac{P(Y_1=0,Y_2=0)}{q_2}=\frac{0.1517}{0.3}=0.5057,\\ q_{3|0}&=&\frac{P(Y_2=0,Y_3=0)}{q_2}=\frac{0.2097}{0.3}=0.6990,\\ q_{1|1}&=&\frac{P(Y_1=0,Y_2=1)}{p_2}=\frac{0.0483}{0.7}=0.0690,\\ q_{3|1}&=&\frac{P(Y_2=1,Y_3=0)}{p_2}=\frac{0.1903}{0.7}=0.2719. \end{array} $$

Then, p_1|0=1−0.5057=0.4943,p_3|0=1−0.6990=0.3010,p_1|1=1−0.0690=0.931 and p_3|1=1−0.2719=0.7281. Also, Frank copula is used for the conditional distributions with parameters $\alpha _{13|Y_{2}=0}=0.95, \alpha _{13|Y_{2}=1}=0.85$. The PMFs of conditional bivariate binary variables are calculated according to Table 3, resulting in the values given in Table 7.

Table 7 Conditional PMF of bivariate binary variables generated by Frank copula

Full size table

Since P(Y₁=y₁,Y₂=y₂,Y₃=y₃)=P(Y₂=y₂)∗P(Y₁=y₁,Y₃=y₃|Y₂=y₂), the last step is to multiply values in Table 7 by P(Y₂=y₂) to get the joint trivariate probability. For example, P(Y₁=0,Y₂=0,Y₃=0)=0.3782∗0.3=0.1135 or P(Y₁=1, Y₂=1, Y₃=0)=0.2475∗0.7=0.1732. The three dimensional joint binary distributions for the six cases are given in Table 8.

Table 8 PMF of trivariate binary variables

Full size table

Comparison of pair-copula and MP models

In this section, we will compare the probability mass functions generated by the pair-copula methods and by the multivariate probit model. We will see that even if we use bivariate Gaussian copulas, these two methods yield different probability mass functions. Furthermore, the pair-copula method is successful in cases where the MP model fails to generate a probability mass function with specified univariate marginals and correlations.

Multivariate probit (MP) model

Let Y=(Y₁,Y₂,…,Y_m) be a vector of binary random variables. The multivariate probit model assumes that associated with the vector Y there is a latent vector Z=(Z₁,Z₂,…,Z_m), which is distributed as multivariate normal (MVN), such that Y_t=1 if Z_t>0, and Y_t=0 if Z_t≤0. Assume Z_t=μ_t+ε_t, where ε=(ε₁,…,ε_m) is MVN (0, R). Then, p_t=P(Y_t=1)=P(Z_t>0)=P(μ_t+ε_t>0)=Φ(μ_t), and q_t=(1−p_t)=Φ(−μ_t). The joint PMF of Y=(Y₁,Y₂,…,Y_m) is given by

$$\begin{array}{@{}rcl@{}} P(Y_1=y_1,Y_2=y_2,\ldots,Y_m=y_m)= \int_{D_m}...\int_{D_1}\frac{1}{(2\pi)^{\frac{m}{2}}|R|^{\frac{1}{2}}}\exp\left(-\frac{\boldmath{\epsilon}\; R^{-1}\;\boldmath{\epsilon}'}{2}\right) \;\; d\boldmath{\epsilon},\\ \end{array} $$

(4)

where D_t=(−∞,μ_t) if y_t=1, and D_t=(μ_t,∞) if y_t=0. For example, for m=3 we have

$$\begin{array}{@{}rcl@{}} P(Y_1=0,Y_2=0,Y_3=0)&=&\int_{\mu_1}^{\infty}\int_{\mu_2}^{\infty}\int_{\mu_3}^{\infty} \phi_{3}(\epsilon_1,\epsilon_2,\epsilon_{3}\, ;\, R)\;\;d\boldmath{\epsilon}\cr \cr &=&\Phi_{3}(-\mu_1,-\mu_2,-\mu_{3}\, ;\, R), \end{array} $$

where Φ₃(ε ; R) is the CDF of trivariate standard normal with correlation matrix R.

Distributions generated by pair-copula and MP models

Since the MP model relies on the Gaussian distribution, for a fair comparison we will use the Gaussian copulas for the bivariate and conditional distributions in the pair-copula construction. Consider the case of two dimensions. In this case, taking C₁₂ as the bivariate Gaussian copula, the PMF as given in Table 2 is P(Y₁=0,Y₂=0)=C₁₂(q₁,q₂;γ)=Φ₂(Φ⁻¹(q₁),Φ⁻¹(q₂);γ)=Φ₂(−μ₁,−μ₂;γ), which is identical to the probability under the MP model. Therefore, the probability distributions are the same for two dimensions. For three dimensions for the MP model, we have

$$\begin{array}{@{}rcl@{}} P(Y_1=0,Y_2=0,Y_3=0)&=\Phi_{3}(-\mu_1,-\mu_2,-\mu_{3}\, ;\, R). \end{array} $$

(5)

From Table 4, we see that for the pair-copula model

$$\begin{array}{@{}rcl@{}} P(Y_1=0,\;Y_2=0,\;Y_3=0)=q_{2}\; C_{13|0}(q_{1|0},\, q_{3|0}). \end{array} $$

(6)

With bivariate Gaussian copulas we have q_i=Φ(−μ_i) for i=1,2,3 and

$$\begin{array}{@{}rcl@{}} q_{1|0}=\frac{C_{12}(q_1,q_2)}{q_2}=\frac{\Phi_{2}(-\mu_1,-\mu_2;\,\gamma_{12})}{\Phi(-\mu_2)}=P(\epsilon_1<-\mu_1|\epsilon_2<-\mu_2), \\ q_{3|0}=\frac{C_{23}(q_2,q_3)}{q_2}=\frac{\Phi_{2}(-\mu_2,-\mu_3;\,\gamma_{23})}{\Phi(-\mu_2)}=P(\epsilon_3<-\mu_3|\epsilon_2<-\mu_2), \end{array} $$

(7)

where ε=(ε₁, ε₂, ε₃) is distributed as a standard trivariate normal with correlation matrix R=(γ_ij). The quantities q_1|0 and q_3|0 are the same as the corresponding values for the probit model. Taking C_13|0(u₁,u₂) as bivariate Gaussian copula with correlation γ_13|0, Eq. (6) is equivalent to

$$\begin{array}{@{}rcl@{}} P(Y_1=0,\;Y_2=0,\;Y_3=0) = \Phi(-\mu_2)\, \Phi_{2}\Bigl(\Phi^{-1}(q_{1|0}), \Phi^{-1}(q_{3|0}); \gamma_{13|0}\Bigr). \end{array} $$

(8)

Clearly the quantity (8) is not equal to (5) since the parameter γ_13|0 can be any value in (−1, 1) and need not be related to R. Thus the PMF of the pair-coupla model is different from the MP model.

An advantage of the pair-copula method over the MP model

In this section, we give an example to show that the pair-copula method is useful to construct multivariate binary distributions with specified marginals and correlation structure in cases where the multivariate probit model breaks down. Let’s assume the marginal means are given by the vector p=(0.2, 0.3, 0.2). For the equicorrelated structure the feasible range of the correlation parameter ρ is (− 0.25,0.7638), see Theorem 1 in Chaganty and Joe (2006). For the value ρ=0.76, the latent correlation matrix obtained by solving Eq. (2), for all pairs is

$$\begin{array}{@{}rcl@{}} R=\left[\begin{array}{ccc} 1&0.9853 &0.9411\\ 0.9853&1&0.9853\\ 0.9411&0.9853&1\\ \end{array}\right], \end{array} $$

which is not positive definite and thus the MP method does not give a PMF for the binary variables. However, the pair-copula method generates a PMF for the binary variables with specified marginal means and equicorrelated structure. The input values needed to calculate the PMF are listed in Table 9.

Table 9 Input values for construction of the PMF of equicorrelated binary variables

Full size table

Proceeding as in “Trivariate binary distributions” section, the resulting three dimensional distribution is given in Table 10. We can check that this distribution has the specified marginal means p₁=0.2,p₂=0.3, and p₃=0.2, and equicorrelated structure with ρ=0.76.

Table 10 Three dimensional distribution with specified marginals

Full size table

Extensions to four and higher dimensions

The pair-copula method for three dimensions described in “Trivariate binary distributions” section can be extended to construct four or higher-dimensional multivariate binary distributions. The foundations of these higher dimensional extensions have been laid out in Joe (1996; 1997). In a pioneering work, Bedford and Cooke (2002) showed how to use graphical models consisting of vines with trees and edges. The edges in a given tree become the nodes of the next tree. The vine structures not only help in enumerating and organizing numerous decompositions of a multivariate distribution but also facilitate models for different types of dependence for the marginal and conditional distributions. In recent years several articles have been published in the literature on vine pair-copula models, see Kurowicka and Joe (2011). The papers by Min and Czado (2010), Gruber and Czado (2015), and Dalla Valle et al. (2018) discuss Bayesian inference for these vine pair-copula models. The lecture notes by Czado (2019) and Brechmann and Schepsmeier (2013) discusses the practical implementation of vine copulas using the R software. The two most popular vines are the canonical C-vine and the drawable D-vine, see Czado (2019) and Joe (2014). An application of the C-vine for analyzing familial data is in Deng and Chaganty (2021). In this paper, we focus on the D-vine which is a natural candidate for analyzing longitudinal data which consists of an ordered sequence of variables.

Figure 1 shows the nested tree structure of the D-vine for m variables. There are (m−1) trees and for the ith tree there are m−i+1 nodes, represented by rectangular boxes. In the case of m=4, the D-vine consists of 3 trees. For the pair-copula construction we will need bivariate copulas for the pairs (12), (23), and (34) in tree 1. Since we are dealing with binary variables, for tree 2 we will need two bivariate copulas for constructing the conditional distribution of (13|2) and another two for constructing (24|3). The final tree requires the construction of the conditional distribution of (14|23), which in turn requires four bivariate copulas for the four possible values of the conditioned variables 2 and 3. The joint PMF in four dimensions is given by

$$\begin{array}{@{}rcl@{}} P(Y_1=y_1,Y_2=y_2,Y_3=y_3, Y_4=y_4)&=&P(Y_1=y_1,Y_4=y_4|Y_2=y_2, Y_3=y_3)\cr &&\;\; \times\; P(Y_2=y_2, Y_3=y_3), \end{array} $$

(9)

The probability P(Y₁=y₁,Y₄=y₄|Y₂=y₂,Y₃=y₃) requires p_1|00=P(Y₁=1|Y₂=0,Y₃=0),p_1|01=P(Y₁=1|Y₂=0,Y₃=1),…, p_4|11=P(Y₄=1|Y₂=1,Y₃=1), which can be obtained from the bivariate and trivariate distributions constructed as in “Construction of vine pair-copula binary distributions” section.

Binary distributions with structured correlation matrices

To allow parsimonious modeling, multivariate binary distributions with structured correlation matrices are normally employed in the analysis of longitudinal or clustered binary data. The two most popular structured correlation matrices are autoregressive of order one (AR(1)) and equicorrelated. Yang and Chaganty (2014) have outlined a method of constructing a multivariate binary distribution with AR(1) structure, and here we focus on the equicorrelated structure.

In “An advantage of the pair-copula method over the MP model” section, we gave an example of a three-dimensional binary distribution with specified marginals and equicorrelated structure. The pair-copula method with bivariate Gaussian copulas can be used to generate higher-dimensional multivariate binary distributions with specified marginals and equicorrelated structure. This requires specification of the partial correlations Corr(Y₁,Y₃|Y₂),Corr(Y₂,Y₄|Y₃),..., Corr(Y_m−2,Y_m|Y_m−1) for tree 2; Corr(Y₁,Y₄|Y₂,Y₃),..., Corr(Y_m−3,Y_m|Y_m−2,Y_m−1) for tree 3;....; Corr(Y₁,Y_m|Y₂,...Y_m−1) for tree m−1. For binary variables, these partial correlations depend on the values of the conditional variables. To simplify matters we set Corr(Y_i,Y_i+k|Y_i+1,...Y_i+k−1)=ρ/(1+(k−1)ρ). The motivation for this assumption comes from the result that for equicorrelated structure, partial correlation ρ_{i,i+k|i+1,…,i+k−1} equals ρ/(1+(k−1)ρ) as shown in the Appendix. The corresponding parameter γ_{i,i+k|i+1,…,i+k−1} of the bivariate Gaussian copula can be obtained by solving Eq. (2) using the two conditional probabilities p_{i|i+1,…,i+k−1} and p_{i+k|i+1,…,i+k−1}. In the next section we give a numerical example to illustrate this method for dimension m=4.

Numerical example of equicorrelated binary distribution

Assuming the marginal means are p=(0.26, 0.36, 0.25, 0.24), the feasible range of the correlation parameter ρ is (−0.3244,0.7492) for the equicorrelated structure. Let ρ=0.4, the distribution is calculated and presented in Table 12 using input values from Table 11.

Table 11 Input values for construction of the PMF of equicorrelated binary variables

Full size table

Table 12 Four dimensional distribution with specified marginals and equicorrelated structure

Full size table

We can check the marginal means of the distribution in Table 12 are P(Y₁=1)=0.26,P(Y₂=1)=0.36,P(Y₃=1)=0.25,P(Y₄=1)=0.24, and further the distribution has an equicorrelated structure with ρ=0.4.

Parameter estimation

In this section, we discuss estimation of the parameters via maximum likelihood estimation (MLE) for the D-vine pair-copula model with bivariate Gaussian distributions. Suppose that there are n independent subjects, and there are m repeated binary observations on each subject. Thus the data consists of binary vectors y_i=(y_i1,y_i2,⋯,y_im) of dimension m. Let p_j be the marginal probability of y_ij assumed to be the same for all i. There are 2^m possible combinations for y_i. For instance, when m=4, we have 16 combinations, that is, y_i=(0,0,0,0), or (0,0,0,1), or (0,0,1,0),⋯, or (1,1,1,1). The n observations can be grouped into 2^m counts. Assume the number of (0,⋯,0) vectors is n₁, the number of (0,⋯,1) is n₂, so on and so forth, the number of (1,⋯,1) is $\phantom {\dot {i}\!}n_{2^{m}}$. Using these notations, the loglikelihood, ℓ(θ), for D-vine pair-copula model for a sample of n independent observations is given by

$$\begin{array}{@{}rcl@{}} \ell(\boldmath{\theta}) &\propto &n_{1} \log P(Y_{i1}=0,Y_{i2}=0,\ldots,Y_{im}=0)+n_{2} \log P(Y_{i1}=0,Y_{i2}=0,\ldots,Y_{im}=1)\\ & &\;\;\;\; +\ldots+n_{2^{m}} \log P(Y_{i1}=1,Y_{i2}=1,\ldots,Y_{im}=1), \end{array} $$

(10)

where the parameter θ consists of marginal probabilities and copula parameters that are functions of correlations between the binary variables. Take the two dimensional example shown in Table 2 for instance, the loglikelihood is

$$\begin{array}{@{}rcl@{}} \ell(\gamma_{12},p_{1},p_{2})&\propto &\!\! n_{1} \log (P(Y_{i1}=0,Y_{i2}=0))+n_{2} \log (P(Y_{i1}=0,Y_{i2}=1))\\ & & \! +n_{3} \log (P(Y_{i1}=1,Y_{i2}=0))+n_{4} \log (P(Y_{i1}=1,Y_{i2}=1))\\ &=& \!\! n_{1}\log (C_{1}(q_{1}, q_{2};\gamma_{12}))+n_{2}\log(q_{1}-C_{1}(q_{1}, q_{2};\gamma_{12}))\\ & &\! +n_{3} \log (q_{2}-C_{1}(q_{1}, q_{2};\gamma_{12}))+n_{4}\log(1-q_{1}-q_{2}+C_{1}(q_{1}, q_{2};\gamma_{12})),\\ \end{array} $$

(11)

where C₁ is the bivariate Gaussian copula. The maximum likelihood estimates of the parameters are obtained by maximizing (10) using the optimization routine “L-BFGS-B” by Byrd et al. (1995) which allows box constraints. The standard errors of the parameters are obtained from the Hessian matrix at optimized values using “Richardson” method of the function “Hessian” in the R package “numDeriv” by Gilbert and Varadhan (2012).

Data analysis

Here we present a real-life data analysis to illustrate the application of the D-vine pair-copula with bivariate Gaussian distributions. We also compare the results with the MP model and the model that ignores the correlation between the variables.

Drug response data

This data was first reported by Grizzle et al. (1969). Here 46 subjects were treated with three drugs 1, 2 and 3, and recorded their response as 0 for unfavorable or 1 for favorable. For example, (0, 0, 0) stands for unfavorable responses for all the three drugs. We assume the three binary responses are equicorrelated with correlation parameter ρ. The maximum likelihood estimates (MLE) of the marginal probabilities p₁,p₂ and p₃ and ρ together with standard errors (SE) are presented in Table 13.

Table 13 Parameter estimates and standard errors for the drug response data

Full size table

The estimate of ρ is close to zero both for the MP and D-vine pair-copula models. The estimates and standard errors of D-vine independent copula model are listed at the last two columns of Table 13. The D-Vine independent copula model has the minimum AIC and seems to be a good choice for this data.

Discussion

In recent years vine pair-copula models have become popular for analyzing dependent multivariate data. However, understanding and using these models for discrete in particular for binary data can pose as a challenge to the practitioner. In this paper, we have illustrated the pair-copula construction of binary distributions in the case of two and three dimensions that make it easy for the practitioner. In three dimensions using bivariate Gaussian copula, we have shown that the probability mass function generated by the pair-copula differs from the mass function of the multivariate probit (MP) model. We gave a numerical example where the MP model fails but one is able to use the pair-copula method to generate mass function with specified marginals and correlations. For four and higher dimensions we provide a method of constructing a multivariate binary distribution with specified marginals and equicorrelated structure using the D-vine pair-copula method. We discussed the maximum likelihood estimation of the parameters for grouped multivariate binary data and provided a real-life data analysis. Future work involves including covariates in these models.

Appendix

Consider the equicorrelated structure given by $R=(1-\rho)I_{m}+\rho \, e_{m}\,e_{m}^{T}$, with parameter ρ. Here I_m is the identity matrix of dimension m and e_m is a m×1 column vector of ones. From formula (2.19), page 40 in Joe (2014), we have the partial correlation is given by

$$\begin{array}{@{}rcl@{}} \rho_{1,m|2,...,m-1}=\frac{\rho-\rho^{2}\,e_{m-2}^{T}\, R_{11}^{-1}\, e_{m-2}} {{1-\rho^{2}\, \,e_{m-2}^{T}\, R_{11}^{-1}\, e_{m-2}}}\,, \end{array} $$

(12)

where $R_{11}=(1-\rho)I_{m-2}+\rho \, e_{m-2}\,e_{m-2}^{T}$. Using the formula in Example 4.1 of Chaganty (1997), we have

$$\begin{array}{@{}rcl@{}} R_{11}^{-1}=\frac{1}{1-\rho}\left[I_{m-2}-\frac{\rho}{1+(m-3)\rho}e_{m-2}\,e_{m-2}^{T}\right]. \end{array} $$

Since $e_{m-2}^{T}\,e_{m-2}=(m-2)$ we get

$$\begin{array}{@{}rcl@{}} e_{m-2}^{T}\, R_{11}^{-1}\,e_{m-2} &=&\frac{1}{1-\rho}\left((m-2)-\frac{\rho}{1+(m-3)\rho}(m-2)^{2}\right)\cr &=&\frac{(m-2)}{1-\rho}\left(1-\frac{\rho(m-2)}{1+(m-3)\rho}\right)\cr &=&\frac{(m-2)}{1+(m-3)\rho}. \end{array} $$

(13)

Substituting (13) in (12) and simplfying we get

$$\begin{array}{@{}rcl@{}} \rho_{1,m|2,...,m-1} &=&\frac{\rho(1+(m-3)\rho)-(m-2)\rho^{2}}{1+(m-3)\rho-(m-2)\rho^{2}}\cr &=&\frac{\rho-\rho^{2}}{m\rho(1-\rho)+(1-\rho)(1-2\rho)}\cr &=&\frac{\rho}{1+(m-2)\rho}. \end{array} $$

(14)

The constant (m−2) in the denominator of (14) represents the number of conditional variables. More generally, for the equicorrelated structure the partial correlation ρ_{i,i+k|i+1,…,i+k−1}=ρ/(1+(k−1)ρ) for any 1≤i≤(m−k),1≤k≤(m−1).

Availability of data and materials

Interested readers can contact the first author.

Abbreviations

AIC:: Akaike information criterion
AR(1):: Autoregressive of order one
CDF:: Cumulative distribution function
GEE:: Generalized estimating equations
MC:: Markov chains
MLE:: Maximum likelihood estimation
MP:: Multivariate probit
MVN:: Multivariate normal
PMF:: Probability mass function
SE:: Standard error

References

Bedford, T., Cooke, R. M.: Vines–a new graphical model for dependent random variables. Ann. Stat. 30(4), 1031–1068 (2002).
Article MathSciNet Google Scholar
Brechmann, E., Schepsmeier, U.: Modeling dependence with c- and d-vine copulas: the r package cdvine. J. Stat. Softw., 52 (2013). https://doi.org/10.18637/jss.v052.i03.
Byrd, R. H., Lu, P., Nocedal, J., Zhu, C.: A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 16(5), 1190–1208 (1995).
Article MathSciNet Google Scholar
Chaganty, N. R.: An alternative approach to the analysis of longitudinal data via generalized estimating equations. J. Stat. Plan. Inf. 63(1), 39–54 (1997).
Article MathSciNet Google Scholar
Chaganty, N. R., Joe, H.: Range of correlation matrices for dependent bernoulli random variables. Biometrika. 93(1), 197–206 (2006).
Article MathSciNet Google Scholar
Czado, C.: Analyzing Dependent Data with Vine Copulas: A Practical Guide With R. Springer International Publishing, Lecture Notes in Statistics (2019).
Book Google Scholar
Dalla Valle, L., Leisen, F., Rossini, L.: Bayesian non-parametric conditional copula estimation of twin data. J. Roy. Stat. Soc. C: Appl. Stat. 67, 523–548 (2018).
Article MathSciNet Google Scholar
Deng, Y., Chaganty, N. R.: Pair-copula models for analyzing family data. J. Stat. Theory Pract. 15(1), 13 (2021).
Article MathSciNet Google Scholar
Escarela, G., Perez-Ruiz, L. C., Bowater, R. J.: A copula-based markov chain model for the analysis of binary longitudinal data. J. Appl. Stat. 36(6), 647–657 (2009).
Article MathSciNet Google Scholar
Gilbert, P., Varadhan, R.: numDeriv: Accurate Numerical Derivatives. R Package (2012). http://CRAN.R-project.org/package=numDeriv.
Grizzle, J. E., Starmer, C. F., Koch, G. G.: Analysis of categorical data by linear models. Biometrics, 489–504 (1969).
Gruber, L., Czado, C.: Sequential bayesian model selection of regular vine copulas. Bayesian Anal. 10, 937–963 (2015).
Article MathSciNet Google Scholar
Joe, H.: Families of m-variate distributions with given margins and m(m−1)/2 bivariate dependence parameters. Lecture Notes–Monograph Series, vol. 28. Institute of Mathematical Statistics, Hayward (1996).
Google Scholar
Joe, H.: Multivariate Models and Multivariate Dependence Concepts. Chapman & Hall/CRC, London (1997).
Book Google Scholar
Joe, H.: Dependence modeling with copulas. Chapman and Hall/CRC, London (2014).
Book Google Scholar
Kurowicka, D., Joe, H.: Dependence modeling: vine copula handbook. World scientific, Singapore (2011).
Google Scholar
Lennon, H.: Gaussian copula modelling for integer-valued time series. PhD thesis. The University of Manchester (United Kingdom) (2016).
Liang, K. Y., Zeger, S. L.: Longitudinal data analysis using generalized linear models. Biometrika. 73(1), 13–22 (1986).
Article MathSciNet Google Scholar
Min, A., Czado, C.: Bayesian inference for multivariate copulas using pair-copula constructions. J. Financ. Econ. 8, 511–546 (2010).
Google Scholar
Panagiotelis, A., Czado, C., Joe, H.: Pair copula constructions for multivariate discrete data. J. Am. Stat. Assoc. 107(499), 1063–1072 (2012).
Article MathSciNet Google Scholar
Panagiotelis, A., Czado, C., Joe, H., Stöber, J.: Model selection for discrete regular vine copulas. Comput. Stat. Data Anal. 106, 138–152 (2017).
Article MathSciNet Google Scholar
Radice, R., Marra, G., Wojtyś, M.: Copula regression spline models for binary outcomes. Stat. Comput. 26(5), 981–995 (2016).
Article MathSciNet Google Scholar
Sabo, R. T., Chaganty, N. R.: What can go wrong when ignoring correlation bounds in the use of generalized estimating equations. Stat. Med. 29(24), 2501–2507 (2010).
Article MathSciNet Google Scholar
Sklar, M.: Fonctions de repartition an dimensions et leurs marges. Publ. Inst. Stat. Univ. Paris. 8, 229–231 (1959).
MATH Google Scholar
Smith, M., Min, A., Almeida, C., Czado, C.: Modeling longitudinal data using a pair-copula decomposition of serial dependence. J. Am. Stat. Assoc. 105(492), 1467–1479 (2010).
Article MathSciNet Google Scholar
Smith, M. S., Khaled, M. A.: Estimation of copula models with discrete margins via bayesian data augmentation. J. Am. Stat. Assoc. 107(497), 290–303 (2012).
Article MathSciNet Google Scholar
Winkelmann, R.: Copula bivariate probit models: with an application to medical expenditures. Health Econ. 21(12), 1444–1455 (2012).
Article Google Scholar
Yang, W., Chaganty, N. R.: A contrasting study of likelihood methods for the analysis of longitudinal binary data. Commun. Stat. Theory Methods. 43(14), 3027–3046 (2014).
Article MathSciNet Google Scholar

Download references

Acknowledgements

We thank the associate editor and two referees whose constructive comments on an earlier version resulted in an improved presentation.

Funding

There is no funding support for the research work.

Author information

Authors and Affiliations

Department of Mathematics & Statistics, Old Dominion University, 2300 Elkhorn Avenue, Norfolk, 23529, USA
Huihui Lin & N. Rao Chaganty

Authors

Huihui Lin
View author publications
You can also search for this author in PubMed Google Scholar
N. Rao Chaganty
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors have contributed equally to the work. The author(s) read and approved the final manuscript.

Corresponding author

Correspondence to N. Rao Chaganty.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Lin, H., Chaganty, N.R. Multivariate distributions of correlated binary variables generated by pair-copulas. J Stat Distrib App 8, 4 (2021). https://doi.org/10.1186/s40488-021-00118-z

Download citation

Received: 12 October 2020
Accepted: 16 February 2021
Published: 05 March 2021
DOI: https://doi.org/10.1186/s40488-021-00118-z

Multivariate distributions of correlated binary variables generated by pair-copulas

Abstract

Introduction

Construction of vine pair-copula binary distributions

Bivariate binary distributions

Trivariate binary distributions

Comparison of pair-copula and MP models

Multivariate probit (MP) model

Distributions generated by pair-copula and MP models

An advantage of the pair-copula method over the MP model

Extensions to four and higher dimensions

Binary distributions with structured correlation matrices

Numerical example of equicorrelated binary distribution

Parameter estimation

Data analysis

Drug response data

Discussion

Appendix

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords