Skip to main content

Multivariate distributions of correlated binary variables generated by pair-copulas

Abstract

Correlated binary data are prevalent in a wide range of scientific disciplines, including healthcare and medicine. The generalized estimating equations (GEEs) and the multivariate probit (MP) model are two of the popular methods for analyzing such data. However, both methods have some significant drawbacks. The GEEs may not have an underlying likelihood and the MP model may fail to generate a multivariate binary distribution with specified marginals and bivariate correlations. In this paper, we study multivariate binary distributions that are based on D-vine pair-copula models as a superior alternative to these methods. We elucidate the construction of these binary distributions in two and three dimensions with numerical examples. For higher dimensions, we provide a method of constructing a multidimensional binary distribution with specified marginals and equicorrelated correlation matrix. We present a real-life data analysis to illustrate the application of our results.

Introduction

In clinical trials and research studies in health care and medicine, the endpoint of the observed data most often consists of correlated binary observations. The generalized estimating equation (GEE), introduced by Liang and Zeger (1986), has been the common statistical tool for analyzing such data. However, this method has several drawbacks. One of the drawbacks is that it uses an ambiguously defined working correlation to model the dependence in the binary observations, which could lead to misleading conclusions (Sabo and Chaganty, 2010). Another drawback it is a non-likelihood approach, in the sense, it does not have an underlying joint distribution for the correlated binary observations. Other alternatives to GEEs for the analysis of correlated binary data are Markov chains (MCs) and multivariate probit (MP) models. A contrasting study of the first order MC model and the MP model was presented by Yang and Chaganty (2014). They showed that both models are asymptotically efficient, and discussed situations where one is preferable over the other.

In recent years, due to their success in other disciplines, copulas have been used to develop likelihood-based methods as another alternative to GEEs. Some researchers have combined copulas with MC models. Escarela et al. (2009) have used Gaussian copula to construct conditional probabilities in Markov chain models in the context of longitudinal binary data. The copula-based bivariate probit models were generalized by Winkelmann (2012) replacing the Gaussian distribution by Frank and Clayton copulas. Radice et al. (2016) introduced nonlinear regression models, where non-Gaussian copulas were used to deal with the dependence between binary responses. Smith et al. (2010) showed that longitudinal continuous data can be modeled by D-vine pair-copula, and later extended their work to the discrete case using a Bayesian framework in Smith and Khaled (2012). A Gaussian copula model for integer-valued ARMA structured time series data with or without covariates was developed by Lennon (2016). Panagiotelis et al. (2017) introduced two algorithms for optimizing vine structure and pair-copula selection for discrete regular vine copulas. One of these algorithms uses a modified Akaike information criterion and the other uses predictive scores with cross-validation.

In this paper we study multivariate binary distributions generated by D-vine pair-copula models. These models are relatively easy to implement since they use only bivariate copulas, and flexible because they allow different types of bivariate copulas to model different types of dependence in the conditional distributions. We will see that the D-vine pair-copula model has some advantages over the MP model.

The organization of this paper is as follows. We first present a lucid description of the construction of bivariate and trivariate binary distributions using bivariate Gaussian, Clayton, Frank, and Gumbel copulas in “Construction of vine pair-copula binary distributions” section. We discuss comparisons between pair-copula models and the multivariate probit (MP) model in “Comparison of pair-copula and MP models” section, together with a numerical example where the vine pair-copula model overcomes the difficulties associated with the MP model. In “Extensions to four and higher dimensions” section we discuss extensions to four and higher dimensions and present a method of constructing multivariate binary distribution with specified marginals and equicorrelated structure. In “Parameter estimation” section, we discuss parameter estimation by maximum likelihood for grouped data. “Data analysis” section contains an analysis of a real-life data. We end the paper with some discussion in “Discussion” section.

Construction of vine pair-copula binary distributions

In this section, we illustrate methods of constructing multivariate binary distributions with specified correlations using vine pair-copula methods. The major advantage of these methods is that the multivariate distribution can be constructed using only bivariate copulas. The method is computationally feasible, flexible, and can accommodate various types of dependence because the bivariate copula need not be the same for various bivariate marginal and conditional distributions. We start first with the simplest cases, bivariate and the trivariate distributions, and then show how they can be extended to higher dimensions focusing on the special correlation structures useful in longitudinal and clustered binary data analysis.

Bivariate binary distributions

Consider first the case of two binary variables. Let Y=(Y1, Y2), where the subscripts 1 and 2 possibly may indicate two sequential time points. Assume that E(Yi)=pi for i=1,2. According to the theorem of Sklar (1959), the joint CDF of Y using a copula function C is given by F(y1,y2)=C(F1(y1),F2(y2)), where F1 and F2 are CDFs of the univariate binary distributions of Y1 and Y2 respectively. Following Panagiotelis et al. (2012), we can recover the joint probability mass function (PMF) of Y from the CDF as

$$\begin{array}{@{}rcl@{}} P(Y_{1}=y_{1},Y_{2}=y_{2})&\!\!=&\!\! C(F_{1}(y_{1}), F_{2}(y_{2});\; \theta)-C(F_{1}(y_{1}-1), F_{2}(y_{2});\; \theta) \cr && \!\!\! -C(F_{1}(y_{1}), F_{2}(y_{2}-1);\; \theta)+C(F_{1}(y_{1}-1), F_{2}(y_{2}-1);\; \theta). \cr &&\end{array} $$
(1)

The C(u1,u2; θ) could be any copula Ci,1≤i≤5, given in Table 1. The copula parameter θ is the correlation coefficient γ for the Gaussian copula, and it is α for the Clayton, Frank, and Gumbel copulas.

Table 1 Copula families of distribution functions

Selecting yi=0,1 and noting that Fi(0)=P(Yi=0)=1−pi=qi for i=1,2, Eq. (1) simplifies to the probabilities given in Table 2.

Table 2 PMF of bivariate binary variables

Given ρ=Corr(Y1,Y2), and if we use in Table 2 the Gaussian copula C1, the parameter θ=γ can be obtained by solving equation

$$\begin{array}{@{}rcl@{}} \text{Corr}(Y_1,Y_2)=\rho=\frac{C_{1}(p_1,\, p_2;\,\gamma)-p_{1}\,p_2}{\sqrt{p_{1}\,q_{1}\,p_{2}\,q_2}}, \end{array} $$
(2)

since 1−q1q2+C1(q1,q2; γ)=C1(p1, p2; γ).

Trivariate binary distributions

In this section we extend the pair-copula method to construct three dimensional binary distributions. Let Y=(Y1, Y2, Y3) be a vector of three correlated binary random variables. Note that

$$\begin{array}{@{}rcl@{}} P(Y_1=y_{1},Y_{2}=y_{2},Y_{3}=y_{3})=P(Y_{2}=y_{2})*P(Y_{1}=y_{1},Y_{3}=y_{3}|Y_{2}=y_{2}). \end{array} $$
(3)

The above equation shows the three dimension distribution can be obtained by constructing the bivariate conditional distribution of (Y1, Y3) given Y2=y2. To this end we introduce some notation. We first construct bivariate distributions for (Y1, Y2) and (Y2, Y3) selecting bivariate copulas C12(u1, u2) and C23(u1, u2) from Table 2. For notational convenience we omit the copula parameter here and in some formulas later.

Let \(q_{1|0}=P(Y_{1}=0|Y_{2}=0)=\frac {C_{12}(q_{1}, q_{2})}{q_{2}}\) and \(p_{1|0}=1-q_{1|0}=1-\frac {C_{12}(q_{1}, q_{2})}{q_{2}}\). Thus Y1|Y2=0 is distributed as Bernoulli with mean p1|0. Similarly, Y3|Y2=0 is distributed as Bernoulli with mean p3|0, where \(p_{3|0}=1-\frac {C_{23}(q_{2}, q_{3})}{q_{2}}\). We also have Y1|Y2=1 is Bernoulli with mean \(p_{1|1}=1-\frac {q_{1}-C_{12}(q_{1}, q_{2})}{p_{2}}\), and Y3|Y2=1 is Bernoulli with mean \(p_{3|1}=1-\frac {q_{3}-C_{23}(q_{2}, q_{3})}{p_{2}}\).

Table 3 shows the conditional distributions of (Y1, Y3)|Y2=0 and (Y1, Y3)|Y2=1. Finally, from Eq. (3) and using the conditonal distributions we can get the joint trivariate PMF as given in Table 4.

Table 3 Conditional PMF of (Y1, Y3) given Y2
Table 4 PMF of trivariate binary variables

We give six numerical examples to see that different copulas and parameter values give rise to different trivariate binary distributions. All these six distributions have the same marginal means p1=0.8,p2=0.7 and p3=0.6. The choice of the copulas and parameter values are summarized in Table 5 for these six cases.

Table 5 Summary of parameter values for PMF of the trivariate binary variables

We give details of the calculations only for case 5, the others are similarly done. In this case we start with Table 2 using Gaussian copula with γ12=0.752,γ23=0.607, and p=(0.8, 0.7, 0.6). The resulting PMFs of bivariate binary variables are in Table 6.

Table 6 PMF of bivariate binary variables using Gaussian copula

Now, to construct the conditional PMFs of bivariate variables we need to get the marginal parameters of the conditional Bernoulli variables Y1|Y2=0 and Y1|Y2=1. These are

$$\begin{array}{@{}rcl@{}} q_{1|0}&=&\frac{P(Y_1=0,Y_2=0)}{q_2}=\frac{0.1517}{0.3}=0.5057,\\ q_{3|0}&=&\frac{P(Y_2=0,Y_3=0)}{q_2}=\frac{0.2097}{0.3}=0.6990,\\ q_{1|1}&=&\frac{P(Y_1=0,Y_2=1)}{p_2}=\frac{0.0483}{0.7}=0.0690,\\ q_{3|1}&=&\frac{P(Y_2=1,Y_3=0)}{p_2}=\frac{0.1903}{0.7}=0.2719. \end{array} $$

Then, p1|0=1−0.5057=0.4943,p3|0=1−0.6990=0.3010,p1|1=1−0.0690=0.931 and p3|1=1−0.2719=0.7281. Also, Frank copula is used for the conditional distributions with parameters \(\alpha _{13|Y_{2}=0}=0.95, \alpha _{13|Y_{2}=1}=0.85\). The PMFs of conditional bivariate binary variables are calculated according to Table 3, resulting in the values given in Table 7.

Table 7 Conditional PMF of bivariate binary variables generated by Frank copula

Since P(Y1=y1,Y2=y2,Y3=y3)=P(Y2=y2)P(Y1=y1,Y3=y3|Y2=y2), the last step is to multiply values in Table 7 by P(Y2=y2) to get the joint trivariate probability. For example, P(Y1=0,Y2=0,Y3=0)=0.37820.3=0.1135 or P(Y1=1, Y2=1, Y3=0)=0.24750.7=0.1732. The three dimensional joint binary distributions for the six cases are given in Table 8.

Table 8 PMF of trivariate binary variables

Comparison of pair-copula and MP models

In this section, we will compare the probability mass functions generated by the pair-copula methods and by the multivariate probit model. We will see that even if we use bivariate Gaussian copulas, these two methods yield different probability mass functions. Furthermore, the pair-copula method is successful in cases where the MP model fails to generate a probability mass function with specified univariate marginals and correlations.

Multivariate probit (MP) model

Let Y=(Y1,Y2,…,Ym) be a vector of binary random variables. The multivariate probit model assumes that associated with the vector Y there is a latent vector Z=(Z1,Z2,…,Zm), which is distributed as multivariate normal (MVN), such that Yt=1 if Zt>0, and Yt=0 if Zt≤0. Assume Zt=μt+εt, where ε=(ε1,…,εm) is MVN (0, R). Then, pt=P(Yt=1)=P(Zt>0)=P(μt+εt>0)=Φ(μt), and qt=(1−pt)=Φ(−μt). The joint PMF of Y=(Y1,Y2,…,Ym) is given by

$$\begin{array}{@{}rcl@{}} P(Y_1=y_1,Y_2=y_2,\ldots,Y_m=y_m)= \int_{D_m}...\int_{D_1}\frac{1}{(2\pi)^{\frac{m}{2}}|R|^{\frac{1}{2}}}\exp\left(-\frac{\boldmath{\epsilon}\; R^{-1}\;\boldmath{\epsilon}'}{2}\right) \;\; d\boldmath{\epsilon},\\ \end{array} $$
(4)

where Dt=(−,μt) if yt=1, and Dt=(μt,) if yt=0. For example, for m=3 we have

$$\begin{array}{@{}rcl@{}} P(Y_1=0,Y_2=0,Y_3=0)&=&\int_{\mu_1}^{\infty}\int_{\mu_2}^{\infty}\int_{\mu_3}^{\infty} \phi_{3}(\epsilon_1,\epsilon_2,\epsilon_{3}\, ;\, R)\;\;d\boldmath{\epsilon}\cr \cr &=&\Phi_{3}(-\mu_1,-\mu_2,-\mu_{3}\, ;\, R), \end{array} $$

where Φ3(ε ; R) is the CDF of trivariate standard normal with correlation matrix R.

Distributions generated by pair-copula and MP models

Since the MP model relies on the Gaussian distribution, for a fair comparison we will use the Gaussian copulas for the bivariate and conditional distributions in the pair-copula construction. Consider the case of two dimensions. In this case, taking C12 as the bivariate Gaussian copula, the PMF as given in Table 2 is P(Y1=0,Y2=0)=C12(q1,q2;γ)=Φ2(Φ−1(q1),Φ−1(q2);γ)=Φ2(−μ1,−μ2;γ), which is identical to the probability under the MP model. Therefore, the probability distributions are the same for two dimensions. For three dimensions for the MP model, we have

$$\begin{array}{@{}rcl@{}} P(Y_1=0,Y_2=0,Y_3=0)&=\Phi_{3}(-\mu_1,-\mu_2,-\mu_{3}\, ;\, R). \end{array} $$
(5)

From Table 4, we see that for the pair-copula model

$$\begin{array}{@{}rcl@{}} P(Y_1=0,\;Y_2=0,\;Y_3=0)=q_{2}\; C_{13|0}(q_{1|0},\, q_{3|0}). \end{array} $$
(6)

With bivariate Gaussian copulas we have qi=Φ(−μi) for i=1,2,3 and

$$\begin{array}{@{}rcl@{}} q_{1|0}=\frac{C_{12}(q_1,q_2)}{q_2}=\frac{\Phi_{2}(-\mu_1,-\mu_2;\,\gamma_{12})}{\Phi(-\mu_2)}=P(\epsilon_1<-\mu_1|\epsilon_2<-\mu_2), \\ q_{3|0}=\frac{C_{23}(q_2,q_3)}{q_2}=\frac{\Phi_{2}(-\mu_2,-\mu_3;\,\gamma_{23})}{\Phi(-\mu_2)}=P(\epsilon_3<-\mu_3|\epsilon_2<-\mu_2), \end{array} $$
(7)

where ε=(ε1, ε2, ε3) is distributed as a standard trivariate normal with correlation matrix R=(γij). The quantities q1|0 and q3|0 are the same as the corresponding values for the probit model. Taking C13|0(u1,u2) as bivariate Gaussian copula with correlation γ13|0, Eq. (6) is equivalent to

$$\begin{array}{@{}rcl@{}} P(Y_1=0,\;Y_2=0,\;Y_3=0) = \Phi(-\mu_2)\, \Phi_{2}\Bigl(\Phi^{-1}(q_{1|0}), \Phi^{-1}(q_{3|0}); \gamma_{13|0}\Bigr). \end{array} $$
(8)

Clearly the quantity (8) is not equal to (5) since the parameter γ13|0 can be any value in (−1, 1) and need not be related to R. Thus the PMF of the pair-coupla model is different from the MP model.

An advantage of the pair-copula method over the MP model

In this section, we give an example to show that the pair-copula method is useful to construct multivariate binary distributions with specified marginals and correlation structure in cases where the multivariate probit model breaks down. Let’s assume the marginal means are given by the vector p=(0.2, 0.3, 0.2). For the equicorrelated structure the feasible range of the correlation parameter ρ is (− 0.25,0.7638), see Theorem 1 in Chaganty and Joe (2006). For the value ρ=0.76, the latent correlation matrix obtained by solving Eq. (2), for all pairs is

$$\begin{array}{@{}rcl@{}} R=\left[\begin{array}{ccc} 1&0.9853 &0.9411\\ 0.9853&1&0.9853\\ 0.9411&0.9853&1\\ \end{array}\right], \end{array} $$

which is not positive definite and thus the MP method does not give a PMF for the binary variables. However, the pair-copula method generates a PMF for the binary variables with specified marginal means and equicorrelated structure. The input values needed to calculate the PMF are listed in Table 9.

Table 9 Input values for construction of the PMF of equicorrelated binary variables

Proceeding as in “Trivariate binary distributions” section, the resulting three dimensional distribution is given in Table 10. We can check that this distribution has the specified marginal means p1=0.2,p2=0.3, and p3=0.2, and equicorrelated structure with ρ=0.76.

Table 10 Three dimensional distribution with specified marginals

Extensions to four and higher dimensions

The pair-copula method for three dimensions described in “Trivariate binary distributions” section can be extended to construct four or higher-dimensional multivariate binary distributions. The foundations of these higher dimensional extensions have been laid out in Joe (1996; 1997). In a pioneering work, Bedford and Cooke (2002) showed how to use graphical models consisting of vines with trees and edges. The edges in a given tree become the nodes of the next tree. The vine structures not only help in enumerating and organizing numerous decompositions of a multivariate distribution but also facilitate models for different types of dependence for the marginal and conditional distributions. In recent years several articles have been published in the literature on vine pair-copula models, see Kurowicka and Joe (2011). The papers by Min and Czado (2010), Gruber and Czado (2015), and Dalla Valle et al. (2018) discuss Bayesian inference for these vine pair-copula models. The lecture notes by Czado (2019) and Brechmann and Schepsmeier (2013) discusses the practical implementation of vine copulas using the R software. The two most popular vines are the canonical C-vine and the drawable D-vine, see Czado (2019) and Joe (2014). An application of the C-vine for analyzing familial data is in Deng and Chaganty (2021). In this paper, we focus on the D-vine which is a natural candidate for analyzing longitudinal data which consists of an ordered sequence of variables.

Figure 1 shows the nested tree structure of the D-vine for m variables. There are (m−1) trees and for the ith tree there are mi+1 nodes, represented by rectangular boxes. In the case of m=4, the D-vine consists of 3 trees. For the pair-copula construction we will need bivariate copulas for the pairs (12), (23), and (34) in tree 1. Since we are dealing with binary variables, for tree 2 we will need two bivariate copulas for constructing the conditional distribution of (13|2) and another two for constructing (24|3). The final tree requires the construction of the conditional distribution of (14|23), which in turn requires four bivariate copulas for the four possible values of the conditioned variables 2 and 3. The joint PMF in four dimensions is given by

$$\begin{array}{@{}rcl@{}} P(Y_1=y_1,Y_2=y_2,Y_3=y_3, Y_4=y_4)&=&P(Y_1=y_1,Y_4=y_4|Y_2=y_2, Y_3=y_3)\cr &&\;\; \times\; P(Y_2=y_2, Y_3=y_3), \end{array} $$
(9)
Fig. 1
figure1

D-vine structure of dimension m

The probability P(Y1=y1,Y4=y4|Y2=y2,Y3=y3) requires p1|00=P(Y1=1|Y2=0,Y3=0),p1|01=P(Y1=1|Y2=0,Y3=1),…, p4|11=P(Y4=1|Y2=1,Y3=1), which can be obtained from the bivariate and trivariate distributions constructed as in “Construction of vine pair-copula binary distributions” section.

Binary distributions with structured correlation matrices

To allow parsimonious modeling, multivariate binary distributions with structured correlation matrices are normally employed in the analysis of longitudinal or clustered binary data. The two most popular structured correlation matrices are autoregressive of order one (AR(1)) and equicorrelated. Yang and Chaganty (2014) have outlined a method of constructing a multivariate binary distribution with AR(1) structure, and here we focus on the equicorrelated structure.

In “An advantage of the pair-copula method over the MP model” section, we gave an example of a three-dimensional binary distribution with specified marginals and equicorrelated structure. The pair-copula method with bivariate Gaussian copulas can be used to generate higher-dimensional multivariate binary distributions with specified marginals and equicorrelated structure. This requires specification of the partial correlations Corr(Y1,Y3|Y2),Corr(Y2,Y4|Y3),..., Corr(Ym−2,Ym|Ym−1) for tree 2; Corr(Y1,Y4|Y2,Y3),..., Corr(Ym−3,Ym|Ym−2,Ym−1) for tree 3;....; Corr(Y1,Ym|Y2,...Ym−1) for tree m−1. For binary variables, these partial correlations depend on the values of the conditional variables. To simplify matters we set Corr(Yi,Yi+k|Yi+1,...Yi+k−1)=ρ/(1+(k−1)ρ). The motivation for this assumption comes from the result that for equicorrelated structure, partial correlation ρi,i+k|i+1,…,i+k−1 equals ρ/(1+(k−1)ρ) as shown in the Appendix. The corresponding parameter γi,i+k|i+1,…,i+k−1 of the bivariate Gaussian copula can be obtained by solving Eq. (2) using the two conditional probabilities pi|i+1,…,i+k−1 and pi+k|i+1,…,i+k−1. In the next section we give a numerical example to illustrate this method for dimension m=4.

Numerical example of equicorrelated binary distribution

Assuming the marginal means are p=(0.26, 0.36, 0.25, 0.24), the feasible range of the correlation parameter ρ is (−0.3244,0.7492) for the equicorrelated structure. Let ρ=0.4, the distribution is calculated and presented in Table 12 using input values from Table 11.

Table 11 Input values for construction of the PMF of equicorrelated binary variables
Table 12 Four dimensional distribution with specified marginals and equicorrelated structure

We can check the marginal means of the distribution in Table 12 are P(Y1=1)=0.26,P(Y2=1)=0.36,P(Y3=1)=0.25,P(Y4=1)=0.24, and further the distribution has an equicorrelated structure with ρ=0.4.

Parameter estimation

In this section, we discuss estimation of the parameters via maximum likelihood estimation (MLE) for the D-vine pair-copula model with bivariate Gaussian distributions. Suppose that there are n independent subjects, and there are m repeated binary observations on each subject. Thus the data consists of binary vectors yi=(yi1,yi2,,yim) of dimension m. Let pj be the marginal probability of yij assumed to be the same for all i. There are 2m possible combinations for yi. For instance, when m=4, we have 16 combinations, that is, yi=(0,0,0,0), or (0,0,0,1), or (0,0,1,0),, or (1,1,1,1). The n observations can be grouped into 2m counts. Assume the number of (0,,0) vectors is n1, the number of (0,,1) is n2, so on and so forth, the number of (1,,1) is \(\phantom {\dot {i}\!}n_{2^{m}}\). Using these notations, the loglikelihood, (θ), for D-vine pair-copula model for a sample of n independent observations is given by

$$\begin{array}{@{}rcl@{}} \ell(\boldmath{\theta}) &\propto &n_{1} \log P(Y_{i1}=0,Y_{i2}=0,\ldots,Y_{im}=0)+n_{2} \log P(Y_{i1}=0,Y_{i2}=0,\ldots,Y_{im}=1)\\ & &\;\;\;\; +\ldots+n_{2^{m}} \log P(Y_{i1}=1,Y_{i2}=1,\ldots,Y_{im}=1), \end{array} $$
(10)

where the parameter θ consists of marginal probabilities and copula parameters that are functions of correlations between the binary variables. Take the two dimensional example shown in Table 2 for instance, the loglikelihood is

$$\begin{array}{@{}rcl@{}} \ell(\gamma_{12},p_{1},p_{2})&\propto &\!\! n_{1} \log (P(Y_{i1}=0,Y_{i2}=0))+n_{2} \log (P(Y_{i1}=0,Y_{i2}=1))\\ & & \! +n_{3} \log (P(Y_{i1}=1,Y_{i2}=0))+n_{4} \log (P(Y_{i1}=1,Y_{i2}=1))\\ &=& \!\! n_{1}\log (C_{1}(q_{1}, q_{2};\gamma_{12}))+n_{2}\log(q_{1}-C_{1}(q_{1}, q_{2};\gamma_{12}))\\ & &\! +n_{3} \log (q_{2}-C_{1}(q_{1}, q_{2};\gamma_{12}))+n_{4}\log(1-q_{1}-q_{2}+C_{1}(q_{1}, q_{2};\gamma_{12})),\\ \end{array} $$
(11)

where C1 is the bivariate Gaussian copula. The maximum likelihood estimates of the parameters are obtained by maximizing (10) using the optimization routine “L-BFGS-B” by Byrd et al. (1995) which allows box constraints. The standard errors of the parameters are obtained from the Hessian matrix at optimized values using “Richardson” method of the function “Hessian” in the R package “numDeriv” by Gilbert and Varadhan (2012).

Data analysis

Here we present a real-life data analysis to illustrate the application of the D-vine pair-copula with bivariate Gaussian distributions. We also compare the results with the MP model and the model that ignores the correlation between the variables.

Drug response data

This data was first reported by Grizzle et al. (1969). Here 46 subjects were treated with three drugs 1, 2 and 3, and recorded their response as 0 for unfavorable or 1 for favorable. For example, (0, 0, 0) stands for unfavorable responses for all the three drugs. We assume the three binary responses are equicorrelated with correlation parameter ρ. The maximum likelihood estimates (MLE) of the marginal probabilities p1,p2 and p3 and ρ together with standard errors (SE) are presented in Table 13.

Table 13 Parameter estimates and standard errors for the drug response data

The estimate of ρ is close to zero both for the MP and D-vine pair-copula models. The estimates and standard errors of D-vine independent copula model are listed at the last two columns of Table 13. The D-Vine independent copula model has the minimum AIC and seems to be a good choice for this data.

Discussion

In recent years vine pair-copula models have become popular for analyzing dependent multivariate data. However, understanding and using these models for discrete in particular for binary data can pose as a challenge to the practitioner. In this paper, we have illustrated the pair-copula construction of binary distributions in the case of two and three dimensions that make it easy for the practitioner. In three dimensions using bivariate Gaussian copula, we have shown that the probability mass function generated by the pair-copula differs from the mass function of the multivariate probit (MP) model. We gave a numerical example where the MP model fails but one is able to use the pair-copula method to generate mass function with specified marginals and correlations. For four and higher dimensions we provide a method of constructing a multivariate binary distribution with specified marginals and equicorrelated structure using the D-vine pair-copula method. We discussed the maximum likelihood estimation of the parameters for grouped multivariate binary data and provided a real-life data analysis. Future work involves including covariates in these models.

Appendix

Consider the equicorrelated structure given by \(R=(1-\rho)I_{m}+\rho \, e_{m}\,e_{m}^{T}\), with parameter ρ. Here Im is the identity matrix of dimension m and em is a m×1 column vector of ones. From formula (2.19), page 40 in Joe (2014), we have the partial correlation is given by

$$\begin{array}{@{}rcl@{}} \rho_{1,m|2,...,m-1}=\frac{\rho-\rho^{2}\,e_{m-2}^{T}\, R_{11}^{-1}\, e_{m-2}} {{1-\rho^{2}\, \,e_{m-2}^{T}\, R_{11}^{-1}\, e_{m-2}}}\,, \end{array} $$
(12)

where \(R_{11}=(1-\rho)I_{m-2}+\rho \, e_{m-2}\,e_{m-2}^{T}\). Using the formula in Example 4.1 of Chaganty (1997), we have

$$\begin{array}{@{}rcl@{}} R_{11}^{-1}=\frac{1}{1-\rho}\left[I_{m-2}-\frac{\rho}{1+(m-3)\rho}e_{m-2}\,e_{m-2}^{T}\right]. \end{array} $$

Since \(e_{m-2}^{T}\,e_{m-2}=(m-2)\) we get

$$\begin{array}{@{}rcl@{}} e_{m-2}^{T}\, R_{11}^{-1}\,e_{m-2} &=&\frac{1}{1-\rho}\left((m-2)-\frac{\rho}{1+(m-3)\rho}(m-2)^{2}\right)\cr &=&\frac{(m-2)}{1-\rho}\left(1-\frac{\rho(m-2)}{1+(m-3)\rho}\right)\cr &=&\frac{(m-2)}{1+(m-3)\rho}. \end{array} $$
(13)

Substituting (13) in (12) and simplfying we get

$$\begin{array}{@{}rcl@{}} \rho_{1,m|2,...,m-1} &=&\frac{\rho(1+(m-3)\rho)-(m-2)\rho^{2}}{1+(m-3)\rho-(m-2)\rho^{2}}\cr &=&\frac{\rho-\rho^{2}}{m\rho(1-\rho)+(1-\rho)(1-2\rho)}\cr &=&\frac{\rho}{1+(m-2)\rho}. \end{array} $$
(14)

The constant (m−2) in the denominator of (14) represents the number of conditional variables. More generally, for the equicorrelated structure the partial correlation ρi,i+k|i+1,…,i+k−1=ρ/(1+(k−1)ρ) for any 1≤i≤(mk),1≤k≤(m−1).

Availability of data and materials

Interested readers can contact the first author.

Abbreviations

AIC:

Akaike information criterion

AR(1):

Autoregressive of order one

CDF:

Cumulative distribution function

GEE:

Generalized estimating equations

MC:

Markov chains

MLE:

Maximum likelihood estimation

MP:

Multivariate probit

MVN:

Multivariate normal

PMF:

Probability mass function

SE:

Standard error

References

  1. Bedford, T., Cooke, R. M.: Vines–a new graphical model for dependent random variables. Ann. Stat. 30(4), 1031–1068 (2002).

    MathSciNet  Article  Google Scholar 

  2. Brechmann, E., Schepsmeier, U.: Modeling dependence with c- and d-vine copulas: the r package cdvine. J. Stat. Softw., 52 (2013). https://doi.org/10.18637/jss.v052.i03.

  3. Byrd, R. H., Lu, P., Nocedal, J., Zhu, C.: A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 16(5), 1190–1208 (1995).

    MathSciNet  Article  Google Scholar 

  4. Chaganty, N. R.: An alternative approach to the analysis of longitudinal data via generalized estimating equations. J. Stat. Plan. Inf. 63(1), 39–54 (1997).

    MathSciNet  Article  Google Scholar 

  5. Chaganty, N. R., Joe, H.: Range of correlation matrices for dependent bernoulli random variables. Biometrika. 93(1), 197–206 (2006).

    MathSciNet  Article  Google Scholar 

  6. Czado, C.: Analyzing Dependent Data with Vine Copulas: A Practical Guide With R. Springer International Publishing, Lecture Notes in Statistics (2019).

    Google Scholar 

  7. Dalla Valle, L., Leisen, F., Rossini, L.: Bayesian non-parametric conditional copula estimation of twin data. J. Roy. Stat. Soc. C: Appl. Stat. 67, 523–548 (2018).

    MathSciNet  Article  Google Scholar 

  8. Deng, Y., Chaganty, N. R.: Pair-copula models for analyzing family data. J. Stat. Theory Pract. 15(1), 13 (2021).

    MathSciNet  Article  Google Scholar 

  9. Escarela, G., Perez-Ruiz, L. C., Bowater, R. J.: A copula-based markov chain model for the analysis of binary longitudinal data. J. Appl. Stat. 36(6), 647–657 (2009).

    MathSciNet  Article  Google Scholar 

  10. Gilbert, P., Varadhan, R.: numDeriv: Accurate Numerical Derivatives. R Package (2012). http://CRAN.R-project.org/package=numDeriv.

  11. Grizzle, J. E., Starmer, C. F., Koch, G. G.: Analysis of categorical data by linear models. Biometrics, 489–504 (1969).

  12. Gruber, L., Czado, C.: Sequential bayesian model selection of regular vine copulas. Bayesian Anal. 10, 937–963 (2015).

    MathSciNet  Article  Google Scholar 

  13. Joe, H.: Families of m-variate distributions with given margins and m(m−1)/2 bivariate dependence parameters. Lecture Notes–Monograph Series, vol. 28. Institute of Mathematical Statistics, Hayward (1996).

    Google Scholar 

  14. Joe, H.: Multivariate Models and Multivariate Dependence Concepts. Chapman & Hall/CRC, London (1997).

    Google Scholar 

  15. Joe, H.: Dependence modeling with copulas. Chapman and Hall/CRC, London (2014).

    Google Scholar 

  16. Kurowicka, D., Joe, H.: Dependence modeling: vine copula handbook. World scientific, Singapore (2011).

    Google Scholar 

  17. Lennon, H.: Gaussian copula modelling for integer-valued time series. PhD thesis. The University of Manchester (United Kingdom) (2016).

  18. Liang, K. Y., Zeger, S. L.: Longitudinal data analysis using generalized linear models. Biometrika. 73(1), 13–22 (1986).

    MathSciNet  Article  Google Scholar 

  19. Min, A., Czado, C.: Bayesian inference for multivariate copulas using pair-copula constructions. J. Financ. Econ. 8, 511–546 (2010).

    Google Scholar 

  20. Panagiotelis, A., Czado, C., Joe, H.: Pair copula constructions for multivariate discrete data. J. Am. Stat. Assoc. 107(499), 1063–1072 (2012).

    MathSciNet  Article  Google Scholar 

  21. Panagiotelis, A., Czado, C., Joe, H., Stöber, J.: Model selection for discrete regular vine copulas. Comput. Stat. Data Anal. 106, 138–152 (2017).

    MathSciNet  Article  Google Scholar 

  22. Radice, R., Marra, G., Wojtyś, M.: Copula regression spline models for binary outcomes. Stat. Comput. 26(5), 981–995 (2016).

    MathSciNet  Article  Google Scholar 

  23. Sabo, R. T., Chaganty, N. R.: What can go wrong when ignoring correlation bounds in the use of generalized estimating equations. Stat. Med. 29(24), 2501–2507 (2010).

    MathSciNet  Article  Google Scholar 

  24. Sklar, M.: Fonctions de repartition an dimensions et leurs marges. Publ. Inst. Stat. Univ. Paris. 8, 229–231 (1959).

    MATH  Google Scholar 

  25. Smith, M., Min, A., Almeida, C., Czado, C.: Modeling longitudinal data using a pair-copula decomposition of serial dependence. J. Am. Stat. Assoc. 105(492), 1467–1479 (2010).

    MathSciNet  Article  Google Scholar 

  26. Smith, M. S., Khaled, M. A.: Estimation of copula models with discrete margins via bayesian data augmentation. J. Am. Stat. Assoc. 107(497), 290–303 (2012).

    MathSciNet  Article  Google Scholar 

  27. Winkelmann, R.: Copula bivariate probit models: with an application to medical expenditures. Health Econ. 21(12), 1444–1455 (2012).

    Article  Google Scholar 

  28. Yang, W., Chaganty, N. R.: A contrasting study of likelihood methods for the analysis of longitudinal binary data. Commun. Stat. Theory Methods. 43(14), 3027–3046 (2014).

    MathSciNet  Article  Google Scholar 

Download references

Acknowledgements

We thank the associate editor and two referees whose constructive comments on an earlier version resulted in an improved presentation.

Funding

There is no funding support for the research work.

Author information

Affiliations

Authors

Contributions

All authors have contributed equally to the work. The author(s) read and approved the final manuscript.

Corresponding author

Correspondence to N. Rao Chaganty.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lin, H., Chaganty, N.R. Multivariate distributions of correlated binary variables generated by pair-copulas. J Stat Distrib App 8, 4 (2021). https://doi.org/10.1186/s40488-021-00118-z

Download citation

Keywords

  • D-vine
  • Mutivariate binary distributions
  • Multivariate probit model
  • Pair-copulas
\