Multivariate distributions of correlated binary variables generated by pair-copulas

Correlated binary data are prevalent in a wide range of scientific disciplines, including healthcare and medicine. The generalized estimating equations (GEEs) and the multivariate probit (MP) model are two of the popular methods for analyzing such data. However, both methods have some significant drawbacks. The GEEs may not have an underlying likelihood and the MP model may fail to generate a multivariate binary distribution with specified marginals and bivariate correlations. In this paper, we study multivariate binary distributions that are based on D-vine pair-copula models as a superior alternative to these methods. We elucidate the construction of these binary distributions in two and three dimensions with numerical examples. For higher dimensions, we provide a method of constructing a multidimensional binary distribution with specified marginals and equicorrelated correlation matrix. We present a real-life data analysis to illustrate the application of our results.


Introduction
In clinical trials and research studies in health care and medicine, the endpoint of the observed data most often consists of correlated binary observations. The generalized estimating equation (GEE), introduced by Liang and Zeger (1986), has been the common statistical tool for analyzing such data. However, this method has several drawbacks. One of the drawbacks is that it uses an ambiguously defined working correlation to model the dependence in the binary observations, which could lead to misleading conclusions (Sabo and Chaganty, 2010). Another drawback it is a non-likelihood approach, in the sense, it does not have an underlying joint distribution for the correlated binary observations. Other alternatives to GEEs for the analysis of correlated binary data are Markov chains (MCs) and multivariate probit (MP) models. A contrasting study of the first order MC model and the MP model was presented by Yang and Chaganty (2014). They showed that both models are asymptotically efficient, and discussed situations where one is preferable over the other.
In recent years, due to their success in other disciplines, copulas have been used to develop likelihood-based methods as another alternative to GEEs. Some researchers have

Construction of vine pair-copula binary distributions
In this section, we illustrate methods of constructing multivariate binary distributions with specified correlations using vine pair-copula methods. The major advantage of these methods is that the multivariate distribution can be constructed using only bivariate copulas. The method is computationally feasible, flexible, and can accommodate various types of dependence because the bivariate copula need not be the same for various bivariate marginal and conditional distributions. We start first with the simplest cases, bivariate and the trivariate distributions, and then show how they can be extended to higher dimensions focusing on the special correlation structures useful in longitudinal and clustered binary data analysis.

Bivariate binary distributions
Consider first the case of two binary variables. Let Y = (Y 1 , Y 2 ), where the subscripts 1 and 2 possibly may indicate two sequential time points. Assume that E(Y i ) = p i for i = 1, 2. According to the theorem of Sklar (1959), the joint CDF of Y using a copula function (2021) 8:4 Page 3 of 14 Here 2 denotes the standard bivariate normal CDF with correlation coefficient γ and −1 is the inverse of the standard normal CDF C is given by F(y 1 , y 2 ) = C(F 1 (y 1 ), F 2 (y 2 )), where F 1 and F 2 are CDFs of the univariate binary distributions of Y 1 and Y 2 respectively. Following Panagiotelis et al. (2012), we can recover the joint probability mass function (PMF) of Y from the CDF as (1) The C(u 1 , u 2 ; θ) could be any copula C i , 1 ≤ i ≤ 5, given in Table 1. The copula parameter θ is the correlation coefficient γ for the Gaussian copula, and it is α for the Clayton, Frank, and Gumbel copulas.
Selecting y i = 0, 1 and noting that (1) simplifies to the probabilities given in Table 2.

Trivariate binary distributions
In this section we extend the pair-copula method to construct three dimensional binary distributions. Let Y = (Y 1 , Y 2 , Y 3 ) be a vector of three correlated binary random variables. Note that The above equation shows the three dimension distribution can be obtained by constructing the bivariate conditional distribution of (Y 1 , Y 3 ) given Y 2 = y 2 . To this end we introduce some notation. We first construct bivariate distributions for (Y 1 , Y 2 ) and (Y 2 , Y 3 ) selecting bivariate copulas C 12 (u 1 , u 2 ) and C 23 (u 1 , u 2 ) from Table 2. For notational convenience we omit the copula parameter here and in some formulas later.
We give six numerical examples to see that different copulas and parameter values give rise to different trivariate binary distributions. All these six distributions have the same marginal means p 1 = 0.8, p 2 = 0.7 and p 3 = 0.6. The choice of the copulas and parameter values are summarized in Table 5 for these six cases.
The three dimensional joint binary distributions for the six cases are given in Table 8.

Comparison of pair-copula and MP models
In this section, we will compare the probability mass functions generated by the paircopula methods and by the multivariate probit model. We will see that even if we use bivariate Gaussian copulas, these two methods yield different probability mass functions. Furthermore, the pair-copula method is successful in cases where the MP model fails to generate a probability mass function with specified univariate marginals and correlations.

Multivariate probit (MP) model
Let Y = (Y 1 , Y 2 , . . . , Y m ) be a vector of binary random variables. The multivariate probit model assumes that associated with the vector Y there is a latent vector Z = (Z 1 , Z 2 , . . . , Z m ), which is distributed as multivariate normal (MVN), such that Y t = 1 where 3 ( ; R) is the CDF of trivariate standard normal with correlation matrix R.

Distributions generated by pair-copula and MP models
Since the MP model relies on the Gaussian distribution, for a fair comparison we will use the Gaussian copulas for the bivariate and conditional distributions in the pair-copula construction. Consider the case of two dimensions. In this case, taking C 12 as the bivariate Gaussian copula, the PMF as given in Table 2 is P(Y 1 = 0, Y 2 = 0) = C 12 (q 1 , q 2 ; γ ) = 2 ( −1 (q 1 ), −1 (q 2 ); γ ) = 2 (−μ 1 , −μ 2 ; γ ), which is identical to the probability under the MP model. Therefore, the probability distributions are the same for two dimensions. For three dimensions for the MP model, we have From Table 4, we see that for the pair-copula model With bivariate Gaussian copulas we have q i = (−μ i ) for i = 1, 2, 3 and where = ( 1 , 2 , 3 ) is distributed as a standard trivariate normal with correlation matrix R = (γ ij ). The quantities q 1|0 and q 3|0 are the same as the corresponding values for the probit model. Taking C 13|0 (u 1 , u 2 ) as bivariate Gaussian copula with correlation γ 13|0 , Eq. (6) is equivalent to Clearly the quantity (8) is not equal to (5) since the parameter γ 13|0 can be any value in (−1, 1) and need not be related to R. Thus the PMF of the pair-coupla model is different from the MP model.

An advantage of the pair-copula method over the MP model
In this section, we give an example to show that the pair-copula method is useful to construct multivariate binary distributions with specified marginals and correlation structure in cases where the multivariate probit model breaks down. Let's assume the marginal means are given by the vector p = (0.2, 0.3, 0.2). For the equicorrelated structure the feasible range of the correlation parameter ρ is (− 0.25, 0.7638), see Theorem 1 in Chaganty and Joe (2006). For the value ρ = 0.76, the latent correlation matrix obtained by solving Eq. (2), for all pairs is which is not positive definite and thus the MP method does not give a PMF for the binary variables. However, the pair-copula method generates a PMF for the binary variables with specified marginal means and equicorrelated structure. The input values needed to calculate the PMF are listed in Table 9. Proceeding as in "Trivariate binary distributions" section, the resulting three dimensional distribution is given in Table 10. We can check that this distribution has the specified marginal means p 1 = 0.2, p 2 = 0.3, and p 3 = 0.2, and equicorrelated structure with ρ = 0.76.

Extensions to four and higher dimensions
The pair-copula method for three dimensions described in "Trivariate binary distributions" section can be extended to construct four or higher-dimensional multivariate binary distributions. The foundations of these higher dimensional extensions have been laid out in Joe (1996;. In a pioneering work, Bedford and Cooke (2002) showed how to use graphical models consisting of vines with trees and edges. The edges in a given tree become the nodes of the next tree. The vine structures not only help in enumerating and organizing numerous decompositions of a multivariate distribution but also facilitate models for different types of dependence for the marginal and conditional distributions. In recent years several articles have been published in the literature on vine pair-copula models, see Kurowicka and Joe (2011). The papers by Min and Czado (2010), Gruber and Czado (2015), and Dalla Valle et al. (2018) discuss Bayesian inference for these vine pair-copula models. The lecture notes by Czado (2019) and Brechmann and Schepsmeier (2013) discusses the practical implementation of vine copulas using the R software. The two most popular vines are the canonical C-vine and the drawable D-vine, see Czado (2019) and Joe (2014). An application of the C-vine for analyzing familial data is in Deng and Chaganty (2021). In this paper, we focus on the D-vine which is a natural candidate for analyzing longitudinal data which consists of an ordered sequence of variables. Figure 1 shows the nested tree structure of the D-vine for m variables. There are (m−1) trees and for the ith tree there are m − i + 1 nodes, represented by rectangular boxes. In the case of m = 4, the D-vine consists of 3 trees. For the pair-copula construction we will need bivariate copulas for the pairs (12), (23), and (34) in tree 1. Since we are dealing with binary variables, for tree 2 we will need two bivariate copulas for constructing the Table 10 Three dimensional distribution with specified marginals conditional distribution of (13|2) and another two for constructing (24|3). The final tree requires the construction of the conditional distribution of (14|23), which in turn requires four bivariate copulas for the four possible values of the conditioned variables 2 and 3. The joint PMF in four dimensions is given by The probability P( which can be obtained from the bivariate and trivariate distributions constructed as in "Construction of vine pair-copula binary distributions" section.

Binary distributions with structured correlation matrices
To allow parsimonious modeling, multivariate binary distributions with structured correlation matrices are normally employed in the analysis of longitudinal or clustered binary data. The two most popular structured correlation matrices are autoregressive of order one (AR(1)) and equicorrelated. Yang and Chaganty (2014) have outlined a method of constructing a multivariate binary distribution with AR(1) structure, and here we focus on the equicorrelated structure.
In "An advantage of the pair-copula method over the MP model" section, we gave an example of a three-dimensional binary distribution with specified marginals and equicorrelated structure. The pair-copula method with bivariate Gaussian copulas can be used to generate higher-dimensional multivariate binary distributions with specified marginals and equicorrelated structure. This requires specification of the partial correlations for tree 3; ....; Corr(Y 1 , Y m |Y 2 , ...Y m−1 ) for tree m − 1. For binary variables, these partial correlations depend on the values of the conditional variables. To simplify matters we set Corr(Y i , Y i+k |Y i+1 , ...Y i+k−1 ) = ρ/(1 + (k − 1)ρ). The motivation for this assumption comes from the result that for equicorrelated structure, partial correlation ρ i,i+k|i+1,...,i+k−1 equals ρ/(1 + (k − 1)ρ) as shown in the Appendix. The corresponding parameter γ i,i+k|i+1,...,i+k−1 of the bivariate Gaussian copula can be obtained by solving Eq. (2) using the two conditional probabilities p i|i+1,...,i+k−1 and p i+k|i+1,...,i+k−1 . In the next section we give a numerical example to illustrate this method for dimension m = 4.

Parameter estimation
In this section, we discuss estimation of the parameters via maximum likelihood estimation (MLE) for the D-vine pair-copula model with bivariate Gaussian distributions. Suppose that there are n independent subjects, and there are m repeated binary observations on each subject. Thus the data consists of binary vectors y i = (y i1 , y i2 , · · · , y im ) of dimension m. Let p j be the marginal probability of y ij assumed to be the same for all i. There are 2 m possible combinations for y i . For instance, when m = 4, we have 16 combinations, that is, y i = (0, 0, 0, 0), or (0, 0, 0, 1), or (0, 0, 1, 0), · · · , or (1, 1, 1, 1). The n observations can be grouped into 2 m counts. Assume the number of (0, · · · , 0) vectors is n 1 , the number of (0, · · · , 1) is n 2 , so on and so forth, the number of (1, · · · , 1) is n 2 m . Using these notations, the loglikelihood, (θ ), for D-vine pair-copula model for a sample of n independent observations is given by where the parameter θ consists of marginal probabilities and copula parameters that are functions of correlations between the binary variables. Take the two dimensional example shown in Table 2 for instance, the loglikelihood is =n 1 log(C 1 (q 1 , q 2 ; γ 12 )) + n 2 log(q 1 − C 1 (q 1 , q 2 ; γ 12 )) +n 3 log(q 2 − C 1 (q 1 , q 2 ; γ 12 )) + n 4 log(1 − q 1 − q 2 + C 1 (q 1 , q 2 ; γ 12 )), where C 1 is the bivariate Gaussian copula. The maximum likelihood estimates of the parameters are obtained by maximizing (10) using the optimization routine "L-BFGS-B" by Byrd et al. (1995) which allows box constraints. The standard errors of the parameters are obtained from the Hessian matrix at optimized values using "Richardson" method of the function "Hessian" in the R package "numDeriv" by Gilbert and Varadhan (2012).

Data analysis
Here we present a real-life data analysis to illustrate the application of the D-vine paircopula with bivariate Gaussian distributions. We also compare the results with the MP model and the model that ignores the correlation between the variables.

Drug response data
This data was first reported by Grizzle et al. (1969). Here 46 subjects were treated with three drugs 1, 2 and 3, and recorded their response as 0 for unfavorable or 1 for favorable. For example, (0, 0, 0) stands for unfavorable responses for all the three drugs. We assume the three binary responses are equicorrelated with correlation parameter ρ. The maximum likelihood estimates (MLE) of the marginal probabilities p 1 , p 2 and p 3 and ρ together with standard errors (SE) are presented in Table 13. The estimate of ρ is close to zero both for the MP and D-vine pair-copula models. The estimates and standard errors of D-vine independent copula model are listed at the last two columns of Table 13. The D-Vine independent copula model has the minimum AIC and seems to be a good choice for this data.

Discussion
In recent years vine pair-copula models have become popular for analyzing dependent multivariate data. However, understanding and using these models for discrete in particular for binary data can pose as a challenge to the practitioner. In this paper, we have illustrated the pair-copula construction of binary distributions in the case of two and three dimensions that make it easy for the practitioner. In three dimensions using bivariate Gaussian copula, we have shown that the probability mass function generated by the pair-copula differs from the mass function of the multivariate probit (MP) model. We gave a numerical example where the MP model fails but one is able to use the paircopula method to generate mass function with specified marginals and correlations. For four and higher dimensions we provide a method of constructing a multivariate binary distribution with specified marginals and equicorrelated structure using the D-vine paircopula method. We discussed the maximum likelihood estimation of the parameters for grouped multivariate binary data and provided a real-life data analysis. Future work involves including covariates in these models.