 Research
 Open Access
 Published:
A new trivariate model for stochastic episodes
Journal of Statistical Distributions and Applications volume 8, Article number: 2 (2021)
Abstract
We study the joint distribution of stochastic events described by (X,Y,N), where N has a 1inflated (or deflated) geometric distribution and X, Y are the sum and the maximum of N exponential random variables. Models with similar structure have been used in several areas of applications, including actuarial science, finance, and weather and climate, where such events naturally arise. We provide basic properties of this class of multivariate distributions of mixed type, and discuss their applications. Our results include marginal and conditional distributions, joint integral transforms, moments and related parameters, stochastic representations, estimation and testing. An example from finance illustrates the modeling potential of this new model.
Introduction
This paper introduces a new model for stochastic events such as growth/decline periods of a financial return, flood, drought, or a heat wave, among others. The model describes the durationN, the magnitudeX and the peak valueY of such events through the random structure
where the {X_{i}}_{i≥1} are independent and identically distributed (IID) exponential random variables given by the probability density function (PDF)
N is an integervalued random variable, independent of the {X_{i}}, and \( \bigvee _{i = 1}^{N} X_{i}\) denotes the maximum of {X_{i}}_{i=1,…,N}. Events like this arise in many applications. A common process generating such observations is PeaksoverThreshold process where we are interested in observations exceeding (or below) a threshold. For example, in finance, time periods with positive/negative logreturns give growth/decline periods for an asset. In climate/hydrology, a flood may be described as stage of a stream exceeding a levy, heat wave may be described as consecutive days when the maximum daily temperature exceeds a high threshold, deluge can be thought of as consecutive observations (daily, hourly, etc.) of precipitation exceeding a high threshold. In energy research, heating degree days are those days when the maximum daily temperature is below 64 degrees Fahrenheit (the threshold temperature varies by country and region). Drought may be considered to be a time period when the maximum precipitation (daily, annual, etc.) is below a seasonal low for a given region. Various bivariate and trivariate models for stochastic episodes/events with the structure of (1) were developed and applied in different fields in Arendarczyk et al. (2018a, b), BarretoSouza (2012), BarretoSouza and Silva (2019), Biondi et al. (2002, 2005, 2008), Kozubowski and Panorska (2005, 2008), and Kozubowski et al. (2008a, b, 2010, 2011). Most of the existing models assume that the underlying observations {X_{i}} are IID exponential or (dependent) Pareto variables, thus cover the cases of light and heavy tailed processes. In the aforementioned work the duration N was modeled with geometric distribution.
Our interest in extending the models developed earlier was motivated by the observation that the duration of (exceedingly) many events in processes such as financial asset returns is one time period. While geometric distribution allows for values of one, in many processes the events of duration 1 (above/below threshold) are either more or less common than in the geometric model. For example, heat waves are often a “hot day” (so 1inflated), financial returns often switch daily from positive to negative (1 inflated), large precipitation is often lasting one day (1 inflated). Summarizing, while the geometric durations work well for some applications, they do not work well for every applications. Thus, we introduced the generalization allowing 1inflation or 1deflation.
Thus, we developed an extension of the models with geometric duration, to allow more flexibility when accounting for the frequency of data with the duration of one. Models for count process, such as duration, are typically discrete, positive or nonnegative, integer valued random variables such as geometric, negative binomial or Poisson. There are many works in the literature (medical, ecology, social science, actuarial science, etc.) describing counts data with a very large number of zeros. The models used to account for the “excess” zeros fall to two general types: zero inflated (ZI) or hurdle (H) models. We discuss and provide examples of applications for these models in Section 2. We also provide a representation for the ZI and H models via waiting times for the first success in independent Bernoulli trials with different probabilities of success.
This paper is organized as follows. Section 2 is devoted to the discussion of ZI nad H models and introduces our model for 1inflated geometric distribution. Section 3 introduces our trivariate model and presents its basic properties. Section 4 provides information about marginal and conditional distributions for the trivariate vector. Section 5 is devoted to estimation and testing connected with the new model. An illustrative data example is given in Section 6. Selective proofs and auxiliary results are collected in the Supplementary Material.
Mixture models for duration
In this section we briefly discuss zero inflated and hurdle models and introduce our (shifted hurdle) model for duration. As noted in the introduction, the two common ways of dealing with extensive zeros in the literature are zero inflated (ZI) (or zeroadjusted, zeroaltered) and hurdle (H) models (see, e.g., Cameron and Trivedi 1998, 2005; Lambert 1992; Mullahy 1986; 1997; Panicha 2018; Zuur et al. 2009; Alshkaki 2016; and references therein). These two approaches to account for large number of zeros involve mixture distributions with two components, but they differ in the way that zeros can occur. The models are mixtures of a point mass at zero and a counting distribution. In the ZI models, zero can occur as an outcome of the point mass variable or the counting variable. On the other hand, in H models zero can only occur as an outcome of the point mass while the counting variable is truncated at zero.
Examples of zeroinflated or hurdle models used in the literature include applications in econometrics (see, e.g., Cameron and Trivedi 1998, 2005; Zeileis et al. 2008), ecology (Panicha 2018; Zuur et al. 2009), public health, epidemiology and bioinformatics (Hu et al. 2011; Zelterman 2004; Chipeta et al. 2014). There are also interesting applications in social science, criminology and actuarial science (Aryal 2011; Constantinescu et al. 2019; Iwunor 1995; Pandey and Tiwari 2011; Sharma and Landge 2013; Tüzen and Erbaş 2018). In ecology, the use of ZI nad H models is connected with estimation population sizes using various capturerecapture type methods. In public health and epidemiology, these models are used to estimate the number of sick with a given disease, in bioinformatics ZI and H models serve for estimation of the size of the population of drug users, and in criminology and social science to estimate the size of ruralurban migration, the size of homeless populations or violators of a certain law, or the number of highway crashes (see, e.g., Famoye and Singh 2006; Iwunor 1995; Pandey and Tiwari 2011; Sharma and Landge 2013). In actuarial science, zeroinflated discrete and dependent by mixture Pareto distribution was used in modeling probability of ruin in the compound binomial risk model in Constantinescu et al. (2019).
We now turn to the definitions of the ZI and H models. We start with the notation used in the rest of this paper: the set of nonnegative integers (including zero) shall be denoted by \(\mathbb {N}_{0}\), while \(\mathbb {N}\) shall stand for the set of natural numbers (excluding zero).
2.1 The zeroinflated model
The ZI model is a mixture of point mass at zero and a counting random variable N. In practice, the latter is often chosen to follow a standard discrete distribution such as Poisson, geometric or negative binomial (see, e.g., Mullahy 1986; Lambert 1992). It is important to note that in the ZI model, the zeros may come from two different sources: the point mass or the count variable. The probability mass function (PMF) f_{ZI} of a zeroinflated random variable N_{ZI}, derived from a “base” discrete random variable N with the PMF f, is of the form
where I_{A} is the indicator function of the set A.
Remark 1
The corresponding mixture representation, connecting the relevant random variables, is as follows:
where J is a Bernoulli random variable with parameter 1−q, independent of N.
2.2 The hurdle model
The hurdle model is also a mixture, where the components are a point mass at zero and a counting “base” random variable N with the PMF f. However, the base random variable is truncated below at zero before mixing. Due to the truncation, the PMF f of N is converted to f_{T}, where the latter is the PMF of the conditional distribution of N given that N≥1,
Mixing this distribution with a point mass at zero leads to the hurdle distribution based on N, with the PMF of the form
Remark 2
Similarly to the ZI case, a random variable N_{H} with the PMF (3) admits the mixture representation of the form
where J is a Bernoulli random variable with parameter 1−q, independent of N_{T}. Note that in the hurdle model the value of zero can only come from the Bernoulli trail J.
Both the IZ and H models have appeared in the literature. The practical convenience of the hurdle model comes from the ease of estimation procedures compared with those for the IZ model. Since in this work the count random variable N represents the duration of an event, our N is always greater than or equal to one. Further, our data may show an unusual frequency of ones (not zeros). Thus, we use a hurdletype model (shifted up by one) for the duration N. We discuss it in more details below.
2.3 A hurdletype geometric distribution
We start with the definition of geometric random variable we use in this work: a random variable with the PMF
will be referred to as a geometric random variable with parameter (probability of success) p, and denoted by \(N_{p} \sim \mathcal {GEO}(p)\). Note that this variable “starts” at one, and accounts for the number of trials until the first success in a series of IID Bernoulli trials with parameter p. Using this variable, we can define a hurdletype model with the PMF of the form
Next we define our counting variable N that would represent the duration in the trivariate model (1). Namely, this will be the distribution given by (5) shifted up by one, with the PMF of the form
We shall denote this distribution by \(\mathcal {HGEO}(p,q)\), which stands for hurdle  geometric distribution, and write \(N \sim \mathcal {HGEO}(p, q)\) when the random variable N follows this distribution. Note, that depending on the relation between p and q, the \(\mathcal {HGEO}(p,q)\) model may overinflate the number of ones (p<q) or underinflate the number of ones (p>q) compared to geometric distribution with probability of success p. The following result provides a useful stochastic representation of this distribution.
Proposition 1
If \(N \sim \mathcal {HGEO}(p,q)\) then
where N_{p} is geometric with the PMF (4) and I is Bernoulli with parameter 1−q, independent of N_{p}.
It is easy to see that when p=q then the HGEO model (7) and its PMF (6) reduce to geometric distribution with parameter p, and the PMF given by (4). It is interesting to compare the hurdletype HGEO model above with one analogous to zeroinflation, and also built upon the geometric distribution. The PMF of the latter will be of the form
Remark 3
Both of the models introduced above have interpretations as waiting times for the first success in a sequence of independent Bernoulli trials {I_{j}}. Namely, if the probabilities of success are given by \(\mathbb {P}(I_{1} = 1) = q\) and \(\mathbb {P}(I_{j} = 1) = p\) for j≥2 then the number of trials till the first success will have the HGEO distribution given by the PMF (6). On the other hand, if the probabilities of success are the same as above for n≥2 while for n=1 we have \(\mathbb {P}(I_{1} = 1) = q + (1  q) p\), then the corresponding waiting time will have a distribution with the PMF (8). Because of this, it is clear that the first model is more flexible than the second: for the hurdle type model, we have \(\mathbb {P}(I_{1}=0)=1q\), which can fall anywhere in the unit interval, while an analogous probability for the second model is equal to (1−p)(1−q), which does not cover the entire unit interval as q changes in (0,1).
Definition and basic properties of the new trivariate model
In this section we formally define the new distribution of (1) and derive its basic properties. Here, and elsewhere in the paper, the notation \(\mathcal {EXP}(\beta)\) stands for the exponential distribution with the PDF (2).
Definition 1
The random vector (X,Y,N) with the stochastic representation given in (1), where the {X_{i}} are IID exponential random variables with the PDF (2) and \(N \sim \mathcal {HGEO}(p, q)\) with the PMF (6), independent of the {X_{i}}, is said to have a generalized TETLG (GT) distribution, denoted by \(\mathcal {GT}(p, q, \beta)\).
We note that when p=q, then the variable N has a geometric distribution with parameter p, and the random vector (X,Y,N) above has the TETLG distribution studied in Kozubowski et al. (2011), where the name stands for Trivariate distribution with Exponential, Truncated Logistic and Geometric marginal distributions. Our construction provides a flexible generalization of the TETLG model that accounts for the excess of ones in the data.
We now derive basic characteristics of the GT model, starting with its PDF. For this, we use the bivariate distribution of \(\left (\sum _{i=1}^{n} X_{i}, \bigvee _{i=1}^{n} X_{i}\right)\), developed in Qeadan et al. (2012), which is the conditional distribution of (X,Y) given N=n in the GT model. This distribution, referred to as the BGGE model in Qeadan et al. (2012), has the PDF of the form
where, for all \(n\in \mathbb {N}\),
The values of the joint PDF in (9) depend on which of the sectors S_{k} the input (x,y) belongs to, where
and
The support of this distribution is the set
The joint distribution of \((X, Y, N) \sim \mathcal {GT}(p, q, \beta)\) can now be derived via a standard hierarchical approach, where \(N \sim \mathcal {HGEO}(p, q)\) with the PMF f given by (6) and (X,Y)N=n has the PDF f(x,yn) given by (9), so that f(x,y,n) of (X,Y,N) is f(x,y,n)=f(x,yn)f(n). This leads to the following result.
Proposition 2
The PDF of \((X, Y, N) \sim \mathcal {GT}(p, q, \beta)\) is given by
where the function H is defined in (10).
Remark 4
We note that the support of the GT distribution is the same as that of its special case of TETLG distribution, which consists of the set \(\{(x,y,n):n\in \mathbb {N}, (x,y)\in A_{n}\}\). In analogy with the TETLG model, if n=1 and (x,y)∈S_{0}, so that the point (x,y,n)is in the set A_{1}×{1}, the joint PDF reduces to
where \(\phantom {\dot {i}\!}f_{1}(x,y,n) = \beta e^{\beta x}I_{A_{1} \times \{1\}}(x,y,n). \) In other words, with probability q, the distribution is concentrated on the set A_{1}×{1}, and represents the random vector (E_{0},E_{0},1), where \(E_{0}\sim \mathcal {EXP}(\beta)\). However, when n≥2, the conditional distribution of (X,Y)given N=n is absolutely continuous with the PDF f(x,yn) given in (9), in which case the joint PDF in (11) becomes
where f_{2}(x,y,n)=f(x,yn)p(1−p)^{n−2}. Altogether, we have
showing that the GT distribution is a mixture of a degenerate distribution of (E_{0},E_{0},1) and a proper, trivariate distribution (with the PDF f_{2}), where the mixing probabilities correspond to the events N=1 and N≥2, respectively. This absolutely continuous component is the same as its analogue in the TETLG model.
The mixture representation (12) of the GT distribution can also be seen from the stochastic representation of this model, stated below.
Proposition 3
If \((X, Y, N) \sim \mathcal {GT}(p,q,\beta)\) then
where \(N\sim \mathcal {HGEO}(p,q)\) and the {E_{i}} are independent \(\mathcal {EXP}(\beta)\) variables, independent of N.
Next, we provide an alternative stochastic representation, involving a geometric variable N_{p} rather than the mixed geometric variable N, which is the part of the random vector (X,Y,N). Both of these representations are useful for deriving further properties of the GT distribution.
Proposition 4
If \((X, Y, N) \sim \mathcal {GT}(p, q, \beta)\) then
where all the variables on the righthandside of (14) are mutually independent, \(N_{p} \sim \mathcal {GEO}(p)\), I is a Bernoulli random variable with parameter (1−q), and the {E_{i}} are independent \(\mathcal {EXP}(\beta)\) random variables.
Remark 5
We note that in the special case p=q, both of the above stochastic representations result in the TETLG distribution studied in Kozubowski et. al. (2011), describing the random vector
where N_{p} and the {E_{i}} are as above. Further, it can be shown that in general the random vector \((X,Y, N)\sim \mathcal {GT}(p, q, \beta)\) can be directly related to a TETLG random vector \((\tilde {X}, \tilde {Y}, \tilde {N})\) viz. another stochastic representation, which involves the operation ⊕_{j} defined below. This operation acts componentwise on two vectors, x=(x_{1},…,x_{n}) and y=(y_{1},…,y_{n}), and returns another vector, denoted by x⊕_{j}y. The latter is obtained by adding the corresponding coordinates of x and y, with the exception of the jth coordinate, where we take x_{j}∨y_{j} (the maximum of x_{j} and y_{j}). Thus, we have
With this notation we have
where E_{0} and I are as above and the operation ⊕_{2} acts as the maximum of the second coordinates and the sum of the first and the third coordinates, so that
Using the above representations, we obtain the characteristic function (ChF) of the GT model, presented below.
Proposition 5
The characteristic function of \((X, Y, N) \sim \mathcal {GT}(p, q, \beta)\) is given by
When p=q, Eq. (17) yields the ChF of the TETLG model of Kozubowski et al. (2011). The joint moments of the GT model are presented below.
Proposition 6
If \((X, Y, N) \sim \mathcal {GT}(p, q, \beta)\) then
where the double summation is taken over all sets of nonnegative integers \(k_{1}, \dots, k_{n}\) and \(l_{1}, \dots, l_{n}\) that add up to k and l, respectively.
One can use the above result to obtain the mean vector and the covariance matrix of the GT distribution, which are presented below.
Proposition 7
If \((X, Y, N) \sim \mathcal {GT}(p, q, \beta)\), then
and the elements of the covariance matrix Σ=[σ_{i,j}] are as follows:
where the constant c_{p} in the expression for σ_{2,2} is given by
Marginal and conditional distributions
In this section we present the marginal and (selected) conditional distributions of the new trivariate GT distribution.
4.1 Bivariate margins
Here we discuss the three bivariate marginal distributions of (X,N),(Y,N), and (X,Y), starting with the joint distribution of (X,N).
4.1.1 The marginal distribution of X,N
In view of the stochastic representation in (1), the joint PDF of (X,N) can be derived through a standard conditioning argument using the fact that \(N \sim \mathcal {HGEO}(p, q)\) and, given N=n, the variable X is the sum of n IID exponential variables with parameter β>0. Thus, XN=n has a gamma distribution \(\mathcal {GAM}(n, \beta)\), with the PDF given by:
We obtain the joint PDF of (X,N) by multiplying the conditional PDF (20) by the marginal PMF of N given by (6), leading to the following result.
Proposition 8
If \((X,Y,N) \sim \mathcal {GT}(p, q,\beta)\) then the joint PDF of (X,N) is given by
Remark 6
Clearly, in the special case p=q we obtain the bivariate BEG distribution (see Kozubowski and Panorska 2005), describing the random vector
where \(N_{p}\sim \mathcal {GEO}(p)\) and the {E_{i}} are IID \(\mathcal {EXP}(\beta)\). Further, it can be seen from Proposition 4 or the relation (16) that in general the random vector (X,N) with the PDF (21) can be related to a BEG random vector \((\tilde {X}, \tilde {N})\) viz.
where \(E_{0} \sim \mathcal {EXP}(\beta)\), I is Bernoulli with parameter 1−q, and all the variables on the righthandside of (22) are mutually independent.
4.1.2 The marginal distribution of Y,N
We now turn to the joint distribution of (Y,N). Here, given N=n, the variable Y is the maximum of n IID exponential random variables. Thus, it has a generalized exponential distribution with the PDF
By proceeding as above, we obtain the joint PDF of (Y,N) as follows.
Proposition 9
If \((X,Y,N) \sim \mathcal {GT}(p, q,\beta)\) then the joint PDF of (Y,N) is given by
Remark 7
We note that in the special case p=q we obtain the bivariate BTLG distribution (see Kozubowski and Panorska 2008), describing the random vector
where N_{p} and the {E_{i}} are as above. Further, it can be seen from the relation (16) that in general the random vector (Y,N) with the PDF (24) can be related to a BTLG random vector \((\tilde {Y}, \tilde {N})\) viz.
where E_{0} and I are as above and the operation ⊕_{j} is given by (15), so the ⊕_{1} above acts as the maximum of the first coordinates and the sum of the second coordinates,
4.1.3 The marginal distribution of X,Y
Finally, we turn to the last of the three bivariate distributions, the distribution of (X,Y). Its PDF can be obtained in a standard way by adding up the trivariate PDF of the GT model (given in Theorem 2) across all the values of \(n\in \mathbb {N}\). The support of this new bivariate distribution is the set
Lengthy algebra produces the result below, which can be proven in the same way as Theorem 3.1 in Kozubowski et al. (2011).
Proposition 10
If \((X, Y, N) \sim \mathcal {GT}(p,q, \beta)\) then the marginal PDF of (X,Y) is given by
where \(\phantom {\dot {i}\!}g_{1}(x,y)=\beta e^{\beta x} I_{A_{1}}(x,y)\) and
with
Remark 8
The result shows that the distribution of (X,Y) is a mixture of a degenerate distribution of the vector (E_{0},E_{0}) with \(E_{0}\sim \mathcal {EXP}(\beta)\), which is a singular part of the distribution (corresponding to the event N=1, which occurs with probability q) and an absolutely continuous component with the PDF g_{2} given by (27), supported on the set A (corresponding to the event N≥2, which occurs with probability 1−q). Similar interpretation applies to the special case of the BETL distribution discussed in Kozubowski et al. (), obtained here when p=q. In that case the above proposition yields the PDF of
where N_{p} and the {E_{i}} are as above. Further, it can be seen from the relation (16) that in general the random vector (X,Y) with the PDF (26) can be related to a BETL random vector \((\tilde {X}, \tilde {Y})\) viz.
where E_{0} and I are as above and the operation ⊕_{j} is given by (15), so that
4.2 Univariate margins
We now discuss the univariate margins. Since N has a hurdletype generalized geometric distribution given by (6), we shall focus on the marginal distributions of X and Y, starting with X.
4.2.1 The marginal distribution of X
The PDF of X can be calculated in a straightforward way by summing up the joint PDF of X and N given by (21) across all the values of \(n\in \mathbb {N}\), leading to the result below.
Proposition 11
If \((X, Y, N) \sim \mathcal {GT}(p, q, \beta)\) then the PDF of X is
with the corresponding CDF of the form
Remark 9
The results shows that X is a generalized mixture of two exponential distributions, one with parameter β and another with parameter pβ. The term “generalized” signifies the fact that the two weights in (30), although add up to one, are not necessarily restricted to the unit interval. It is worth noting that the distribution of X is also a proper mixture of two distributions, of which one is again exponential with parameter β while the other has a hypoexponential distribution, also known as generalized Erlang distribution (see, e.g., Johnson et al. 1994), given by the PDF
The above is the PDF of X_{1}+X_{2}, where X_{1} is exponential with parameter pβ and X_{2} is exponential with parameter β, independent of X_{1} (the term “hypoexponential” describes convolutions of exponential variables with different parameters). While the above hypoexponential variable is also a generalized mixture of exponential distributions, the distribution of X is a proper mixture as its PDF can be expressed as
with g(·) given by (31). This representation can be obtained directly from Proposition 4, which shows that X is either equal to X_{1} (representing the exponential part E_{0}, with probability q) or X_{1}+X_{2}, where X_{2} is exponential with parameter pβ. This X_{2} corresponds to the first sum on the righthandside of the representation (14), and the hypoexponential component X_{1}+X_{2} occurs with probability 1−q.
Remark 10
The above mixture representations of X provide a convenient tool in deriving basic characteristics of the distribution of X. For example, the moment generating function (MGF) of X is of the form
while the moments of X are given by
4.2.2 The marginal distribution of Y
The PDF of Y shown in the result below can be calculated as that of X, by summing up the joint PDF of Y and N given by (24) across all the values of \(n\in \mathbb {N}\).
Proposition 12
If \((X, Y, N) \sim \mathcal {GT}(p, q, \beta)\) then the PDF of Y is
while the CDF of Y is
Remark 11
The structure of the above CDF reveals that the distribution of Y is also a two component mixture. Indeed, Y can be thought of as either an exponential variable E_{0} with parameter β (with probability q) or the maximum of two independent variables, \(E_{0}\vee \tilde {Y}\), where
with N_{p} and {E_{i}} as above. The variable \(\tilde {Y}\) has a truncated logistic distribution on \(\mathbb {R}_{+}\), with the CDF
studied by Marshall and Olkin (1997). This can also be seen from the representation (29), showing that \(Y\stackrel {d}{=} E_{0}\vee I \tilde {Y}\).
4.3 Conditional distributions
Here we summarize basic facts concerning bivariate and univariate conditional distributions connected with the GT distribution. Since the results below are established by routine derivations involving ratios of the relevant PDFs, their elementary proofs are omited.
4.3.1 Bivariate conditional distributions
Here we consider the three bivariate conditional distributions of (X,Y)N=n,(X,N)Y=y, and (Y,N)X=x.
4.3.2 The distribution of X and Y given N=n
The conditional distribution of (X,Y) given \(N=n\in \mathbb {N}\) was studied in Qeadan et al. (2012), and is known as the BGGE(β,n) model. Its PDF, given by (9)  (10), provided the basis for our derivation of the GT PDF. In particular, the conditional distribution of X given N=n is Gamma with the PDF given in (20) while the conditional distribution of Y given N=n is generalized exponential (see, e.g., Gupta and Kundu 2007) with the PDF given in (23).
4.3.3 The distribution of X and N given Y=y
Next, we consider the conditional PDF of (X,N) given Y=y>0, which turns out to be of the form
where the function H(x,y,n) is given by (10) and
We note that when n=1, this conditional PDF is nonzero only if x=y, in which case it takes on the value of q/v(y). We also note that when n≥2, the function H(x,y,n) will be nonzero only if x satisfies ky<x≤(k+1)y,k=1,…,n−1, in which case
4.3.4 The distribution of Y and N given X=x
Finally, we have the following expression for the PDF of (Y,N) given X=x>0:
where the function H(x,y,n) is given by (10) and
Again, when n=1 the PDF above is nonzero only if y=x, in which case it takes on the value of q(1−p)/u(x), and when n≥2 the function H(x,y,n) will be nonzero only if x satisfies x/(k+1)≤y<x/k,k=1,…,n−1, in which case we have (32).
4.3.5 Univariate conditional distributions
It turns out that all three univariate conditional distributions of X, Y, and N given the other two variables are the same as their counterparts in the special TETLG case (p=q) studied by Kozubowski et al. (2011). We present their formulas below for the convenience of the reader.
4.3.6 The distribution of X given Y=y,n=n
The PDF of the conditional distribution of X given \(Y=y>0, N=n\in \mathbb {N}\) is given by
Similarly to the cases discussed above, when n=1 the PDF is nonzero only if y=x, in which case it takes on the value of 1. In turn, when n≥2 the function H(x,y,n) will be nonzero only if x satisfies ky<x≤(1+k)y,k=1,…,n−1, in which case we have (32).
4.3.7 The distribution of Y given X=x,n=n
The conditional PDF of Y given \(X=x>0, N=n\in \mathbb {N}\) is of the form
Again, for n=1 we have f(yx,n)=1 if y=x (and zero otherwise), while for n≥2 the function H(x,y,n) will be nonzero only if y satisfies x/(k+1)≤y<x/k,k=1,…,n−1, in which case we have (32). We also note that this particular distribution is parameterfree.
4.3.8 The distribution of N given X=x,y=y
As in the TETLG case, the conditional distribution of N given X=x>0 and Y=y>0 reduces to a point mass at 1 when x=y. On the other hand, for \((x,y)\in S_{k}, k\in \mathbb {N}\), the PMF of this distribution is of the form
where W_{s} is defined in (28).
Estimation and testing
In this section we consider the problems of estimating the parameters of the GT model and testing the hypothesis that p=q (so that the GT model reduces to the TETLG model) based on a random sample (X_{1},Y_{1},N_{1}),…,(X_{k},Y_{k},N_{k}) from the \(\mathcal {GT}(p, q, \beta)\) distribution.
5.1 Maximum likelihood estimation
We start with the Fisher information matrix I(p,q,β) corresponding to the distribution of \((X, Y, N) \sim \mathcal {GT}(p, q, \beta)\). Routine calculations lead to
Next, we turn to the parameter estimation viz. maximum likelihood. While the PDF of the GT model is rather complicated, fortunately the function H(x,y,n) is parameterfree and the derivation of the maximum likelihood estimators (MLEs) is straightforward. Indeed, the likelihood function can be written as
where C is parameterfree, k_{1} is the number of data points with N_{i}=1, and \(\overline {X}_{k}, \overline {N}_{k}\) are the sample means of the {X_{i}} and the {N_{i}}, respectively. Thus, the statistics \(k_{1}/k, \overline {X}_{k}\) and \(\overline {N}_{k}\) are jointly sufficient. We note that these statistics do not involve the values of {Y_{i}}_{i=1,…,k}. This is due to the fact that the conditional distribution of Y given X and N is parameter free (see Section 4.3.7). Thus the values of N and X carry all the information necessary to estimate all the parameters. We also note that the likelihood function can be maximized with respect to each parameter separately from the other parameters, leading to explicit MLEs provided below.
Proposition 13
Let (X_{1},Y_{1},N_{1}),…,(X_{k},Y_{k},N_{k}) be IID observations from \(\mathcal {GT}(p, q, \beta)\) distribution such that \(\overline {X}_{k}>0, \overline {N}_{k}>1\). Then, there exist unique MLEs of the three parameters, given by
We note that since the distribution of X is absolutely continuous, we have \(\mathbb {P}(\overline {X}_{k}>0)=1\). Additionally, while it is possible to have \(\overline {N}_{k}=1\), which occurs only if all sample values are equal to 1, the probability of this event converges to zero as the sample size k goes to infinity. However, if such an event does occur, the MLEs of q and β still exist and are unique (with values of 1 and \(1/\overline {X}_{k}\), respectively) while the MLE of p is undefined.
Further, the estimators do not involve the values {Y_{i}}, and the estimator of β is exactly the same as its counterpart in the TETLG model of Kozubowski et al. (2011). Moreover, if only the data on the {X_{i},N_{i}} were available, and the values of the third variable were missing, we would still obtain exactly the same set of three estimators. In fact, the estimators of p and q are only dependent on the univariate observations of the {N_{i}}, and would be exactly the same if the rest of the data was missing, or if we only have information on the {Y_{i},N_{i}} while the {X_{i}} were missing. However, in the latter case, the estimator of β is no longer the same as above, and a numerical search is needed to find it. The estimators are also different from those above if we only worked with bivariate data on {X_{i},Y_{i}} or only univariate data involving either the {X_{i}} or the {Y_{i}}. In these three cases the underlying models are mixtures, and finding the estimators is not straightforward. We now turn to the asymptotic properties of the MLEs, where we have the following result.
Proposition 14
The vector MLE \((\hat {p}_{k}, \hat {q}_{k}, \hat {\beta }_{k})^{\top }\) given in Proposition 13 is

1.
Consistent;

2.
Asymptotically normal, that is \(\sqrt {n}[(\hat {p}_{k}, \hat {q}_{k}, \hat {\beta }_{k})^{\top }  (p, q, \beta)^{\top }]\) converges in distribution to a trivariate normal distribution with the (vector) mean zero and the covariance matrix
$$ {\boldsymbol \Sigma}_{MLE} = \left[\begin{array}{ccc} \frac{p^{2}(1  p)}{1  q} & 0 & 0 \\ 0 & q(1  q) & 0 \\ 0 & 0 & \frac{p\beta^{2}}{1 + p  q} \end{array}\right]; $$(36) 
3.
Asymptotically efficient, that is the asymptotic covariance matrix (36) coincides with the inverse of the Fisher information matrix (33).
Remark 12
The above result allows to derive approximate (1−α)×100% confidence intervals for the parameters in large sample setting, leading to
where the quantities \(\hat {\sigma }_{ASY}(p), \hat {\sigma }_{ASY}(q)\), and \(\hat {\sigma }_{ASY}(\beta)\) are the square roots of the diagonal entries of the asymptotic covariance matrix (36) with the parameters replaced by their MLEs.
5.2 Testing for p=q under the GT model
The objective of this section is to develop a likelihood ratio (LR) test for the null hypothesis H_{0}:p=q under the assumption that the data follow the \(\mathcal {GT}(p,q,\beta)\) distribution. Before setting up the test, let us consider the parameter space of this model, and its subspace corresponding to the null hypothesis. Clearly, we must have β>0 and p,q must belong to the unit interval. However, care is needed in regard to the boundary values of p and q in order to assure the parameterization is identifiable and the possible values of p and q are consistent with the results of estimation. With this in mind, we denote the vectorparameter by θ=(θ_{1},θ_{2},θ_{3})^{⊤}, where θ_{1}=p,θ_{2}=q, and θ_{3}=β, and propose to set the general parameter space Θ as follows:
where
With this definition of the parameter subspace for p and q, all the boundary values with p=0 are excluded, regardless of q, and so are all the boundary values with q=1, with the exception of the “corner” of the unit square where we have p=q=1. Clearly, the null subset of Θ where p=q corresponds to the set
With the above setup, we wish to test
where Θ_{1}=Θ−Θ_{0}. The classical LR test rejects the null hypothesis in (37) in favor of the alternative for large values of the LR test statistic
where L(·) is the likelihood function (of the full GT model) given by (34). The evaluation of the likelihood ratio test statistic (38) is straightforward. Indeed, the numerator in (38) is simply the value of the likelihood evaluated at the three MLEs given by (35), resulting in \(L(\hat {p}_{k}, \hat {q}_{k}, \hat {\beta }_{k})\), given explicitly. The denominator is given by \(L(\hat {p}^{0}_{k}, \hat {q}^{0}_{k}, \hat {\beta }^{0}_{k})\), where the triple \((\hat {p}^{0}_{k}, \hat {q}^{0}_{k}, \hat {\beta }^{0}_{k})\) are the values of the parameter θ that maximizes the likelihood over Θ_{0}. What this means in our case is that we set q=p in the likelihood function (34), resulting in
and subsequently maximize the function L_{0}(p,β) with respect to p∈(0,1] and β>0. We recognize the function in (39) as the likelihood based on the TETLG model of Kozubowski et al. (2011), which is known to be maximized by
Thus, the denominator in (38) becomes \(L_{0}(\hat {p}^{0}_{k}, \hat {\beta }^{0}_{k}) = L(\hat {p}^{0}_{k}, \hat {p}^{0}_{k}, \hat {\beta }^{0}_{k})\), and is also given explicitly. By putting these facts together, we arrive at the following result.
Proposition 15
The LR statistic (38) for testing the hypotheses in (37) based on a random sample of size k from a \(\mathcal {GT}(p,q,\beta)\) distribution is given by
where k_{1} is the number of sample values with \(N_{i}=1, \hat {p}_{k}, \hat {q}_{k}, \hat {\beta }_{k}\) are given by (35), and \(\hat {p}^{0}_{k}, \hat {\beta }^{0}_{k}\) are given by (40).
Remark 13
We note that the LR statistic does not involve the values of {X_{i}} and {Y_{i}}. In fact, the exact same statistic comes up in connection with testing the hypotheses
in the context of univariate \(\mathcal {HGEO}(p,q)\) distribution, based on a random sample N_{1},…,N_{k}. The PMF of this distribution is given in (6). Under the null hypothesis in (42) this 1 inflated geometric distribution reduces to the classical geometric distribution \(\mathcal {GEO}(p)\), given by the PMF (4).
By the standard large sample theory, the quantity 2 logΛ has approximately chisquare distribution when the sample size k is large, which helps to setup the critical region in practice.
Proposition 16
Let Λ be the LR test statistic (38), based on a random sample of size k from \(\mathcal {GT}(p,q,\beta)\) distribution. Then, as k→∞, the quantity 2 logΛ converges in distribution to a chisquare random variable with 1 degree of freedom.
Remark 14
While the calculation of the LR test statistic or the quantity 2logΛ in practice is straightforward, some care is required when dealing with certain exceptional cases, where the ratios in (41) may seem to be undefined. Careful examination of the likelihood function, the relevant MLEs, and the LR statistic reveals five different cases, which can be described as follows:

1.
If all the values of N_{i} are 1 (so that k_{1}=k) then 2 logΛ=0,

2.
If all the values of N_{i} are 2 (so that k_{1}=0) then 2 logΛ=2k log4,

3.
If all the values of N_{i} are either 1 or 2, but they are not all the same, then
$$2 \log\Lambda = 2k\left[ \overline{N}_{k}\log(\overline{N}_{k}) + \frac{k_{1}}{k}\log\left(\frac{k_{1}}{k}\right) \right], $$ 
4.
If N_{i}≥2 for all i=1,…,k and at least one value is greater than 2 then
$$2\log\Lambda = 2k\left[(\overline{N}_{k}  2)\log(\overline{N}_{k}  2)  2(\overline{N}_{k}  1)\log(\overline{N}_{k}  1) + \overline{N}_{k}\log(\overline{N}_{k}) \right], $$ 
5.
If at least one N_{i}=1 and at least one N_{i}>2 then
$$\begin{array}{@{}rcl@{}} {}2\log\Lambda & = & 2k \left[ \frac{k_{1}}{k}\log\left(\frac{k_{1}}{k}\right) + 2\left(1  \frac{k_{1}}{k}\right)\log\left(1  \frac{k_{1}}{k}\right)  2(\overline{N}_{k}  1)\log(\overline{N}_{k}  1) \right. \\ & & \left. + (\overline{N}_{k}  1  \left(1  \frac{k_{1}}{k}\right)\log\left(\overline{N}_{k}  1  \left(1  \frac{k_{1}}{k}\right)\right) + \overline{N}_{k}\log(\overline{N}_{k}) \right]. \end{array} $$(43)
In order to have a practical guide as to when one can use the limiting distribution as a reasonable approximation to the distribution of 2 logΛ we performed a Monte Carlo study. Noting that the speed of convergence may depend on the true value of p, we simulated 10,000 samples of (varying) size k from \(\mathcal {GEO}(p)\) distribution for selected values of p. We then found the smallest k for which the (empirical) distribution of 2 logΛ can be assumed to be \(\chi ^{2}_{1}\). We used KolmogorovSmirnov goodnessoffit test with significance level of 0.05 to assess whether the distribution of 2 logΛ can be reasonably considered to be \(\chi ^{2}_{1}\). We summarized the results of this simulation study in Table 1.
The simulation shows that when the true value of p is between 0.1 and 0.9, the sample size needed for the limiting distribution to be a good approximation for the distribution of 2 logΛ is below 100. However, once the value of p becomes closer to 0 or 1, the sample sizes required for a reasonable approximation are growing. In particular, note that the sample size required for large values of p (close to 1) are much larger than those for the small values of p (close to 0).
An illustrative data example
In this section, we illustrate potential applications of the new GT model using S&P500 index return data.
The data was downloaded from ‘Yahoo! Finance’ historical data archive. The initial data were the daily closing prices for the S&P500 index, covering the period from Dec 30, 1927 to April 17, 2020. These were converted to (n = 23,183) daily logreturns, i.e. natural logarithms of the ratios of the closing prices for two consecutive days. Finally, we converted these data to the growth periods, where the daily closing prices increase from one day to the next one, so that the logreturns stay positive (5,540 growth periods). In this case, the N_{i} are the durations (in days) of the growth periods, while the X_{i} and the Y_{i} are the magnitude and the maximum daily return for the ith growth period. We call this data set S&P500.
We estimated the parameters p, q, and β of the underlying \(\mathcal {GT}(p,q,\beta)\) model viz. maximum likelihood, using the results given in Proposition 13. The resulting estimates are shown in Table 2, along with estimated margins of error (ME) of the 95% (asymptotic) confidence intervals described in the remark following Proposition 14.
Note, that in this case, p<q and so we have an example of underinflated ones data. This may be reflection of the fact that S&P500 is a composite index, which is more stable, that is less prone to changes from growth to decline, than an individual stock return. Further, we tested the hypotheses H_{0}:p=q versus H_{1}:p≠q on significance level 0.05 using the likelihood ratio test described in Section 5, and obtained the test statistic 2 logΛ=10.06, with (approximate) pvalue of 0.003. Thus, we rejected the null hypothesis and conclude that durations are coming from the shifted hurdle model. Thus, the use of GT model for the growth episodes arising from our data is better than the standard TETLG model (connected with p=q), which has been used before in similar settings (see, Kozubowski et al. 2011).
Although we did not aim at a formal goodness of fit analysis, we wanted to present visual evidence of the reasonable fit of our model. We fitted the marginal distributions of X and Y, and the conditional distributions for X given N and Y given N, when N=1,2,3. The fit is illustrated in Fig. 1 below with QQ plots.
All QQ plots show reasonable to very good fit. Note that the fit of the marginal of maximum is particularly impressive, given that the values of the maxima of the episodes did not play any role in the estimation.
Next, we present the fit of the \(\mathcal {HGEO}(p, q)\) model for duration. Table 3 contains the observed frequencies/relative frequencies along with estimated model probabilities for the duration of S&P500 growth events.
The relative frequencies and model probabilities for our data are reasonably close, and we conclude that the fit of the \(\mathcal {HGEO}(p, q)\) model is quite good for this data set as well. We believe that the fit of the GT model to the S&P500 data will start a common use of this model for data sets with an excessive number of ones.
Availability of data and materials
All data is publicly available. Here is the link to ‘Yahoo! Finance’ historical archive: https://finance.yahoo.com/quote/%5EGSPC/history?p=%5EGSPC
Abbreviations
 The following abbreviations are used in this manuscript (in alphabetical order):

BGGE: Bivariate distribution with geometric and generalized exponential margins
 CDF:

Cumulative distribution function
 ChF:

Characteristic function
 EXP:

Exponential distribution
 GEO:

Geometric
 GT:

Generalized TETLG
 H:

Hurdle model
 HGEO:

Hurdle geometric distribution
 IID:

Independent and identically distributed
 LR:

Likelihood ratio
 MLE:

Maximum likelihood estimator
 PDF:

Probability density function
 PMF:

Probability mass function
 TETLG:

Trivariate distribution with exponential, truncated logistic and geometric margins
 ZI:

Zero inflated distribution
References
Alshkaki, R. S. A.: On zeroone inflated geometric distribution. Internat. Res. J. Math. Eng. IT. 3(8), 10–21 (2016).
Arendarczyk, M., Kozubowski, T. J., Panorska, A. K.: A bivariate distribution with Lomax and geometric margins. J. Korean Statist. Soc. 47, 405–422 (2018a).
Arendarczyk, M., Kozubowski, T. J., Panorska, A. K.: The joint distribution of the sum and the maximum of dependent Pareto risks. J. Multivar. Anal. 167, 136–156 (2018b).
Aryal, T.: Inflated geometric distribution to study the distribution of rural outmigrants. J. Instit. Eng. 8(1), 266–268 (2011).
BarretoSouza, W.: Bivariate gammageometric law and its induced Lévy process. J. Multivar. Anal. 109, 130–145 (2012).
BarretoSouza, W., Silva, R. B.: A bivariate infinitely divisible law for modeling the magnitude and duration of monotone periods of logreturns. Statist. Neerlandica. 73, 211–233 (2019).
Biondi, F., Kozubowski, T. J., Panorska, A. K.: Stochastic modeling of regime shifts. Clim. Res. 23, 23–30 (2002).
Biondi, F., Kozubowski, T. J., Panorska, A. K.: A new model for quantifying climate episodes. Internat. J. Climatol. 25, 1253–1264 (2005).
Biondi, F., Kozubowski, T. J., Panorska, A. K., Saito, L: A new stochastic model of episode peak and duration for ecohydroclimatic applications. Ecol. Modell. 211, 383–395 (2008).
Cameron, A. C., Trivedi, P. K.: Regression Analysis of Count Data. Cambridge University Press, Cambridge (1998).
Cameron A.C., Trivedi. P.K.: Microeconometrics: Methods and Applications. Cambridge University Press, Cambridge (2005).
Chipeta, M. G., Ngwira, B. M., Simoonga, C., Kazembe, L. N.: Zero adjusted models with applications to analyzing helminths count data. BMC Res Notes. 7, 7–856 (2014).
Constantinescu, C. D., Kozubowski, T. J., Qian, H. H.: Probability of ruin in discrete insurance risk model with dependent Pareto claims. Depend. Model. 7(1), 215–233 (2019).
Famoye, F., Singh, K. P.: Zeroinflated generalized Poisson regression model with an application to domestic violence data. J. Data Sci. 4, 117–130 (2006).
Gupta, D., Kundu, D.: Generalized exponential distribution: Existing results and some recent developments. J. Statist. Plann. Infer. 137(11), 3537–3547 (2007).
Hu, M. C, Pavlicova, M, Nunes, E. V: Zeroinflated and hurdle models of count data with extra zeros: Examples from an HIVrisk reduction intervention trial. Am. J. Drug Alcohol Abuse. 37(5), 367–375 (2011).
Iwunor, C. C. O.: Estimation of parameters of the inflated geometric distribution for rural outmigration. Genus. 51(3/4), 253–260 (1995).
Johnson, N. L., Kotz, S., Balakrishnan, N.: Continuous Univariate Distributions, vol. 1, 2nd ed. Wiley, New York (1994).
Kozubowski, T. J., Panorska, A. K.: A mixed bivariate distribution with exponential and geometric marginals. J. Statist. Plann. Infer. 134, 501–520 (2005).
Kozubowski, T. J., Panorska, A. K.: A mixed bivariate distribution connected with geometric maxima of exponential variables. Comm. Statist. Theory Methods. 37, 2903–2923 (2008).
Kozubowski, T. J., Panorska, A. K., Biondi, F.: Mixed multivariate models for random sums and maxima. In: SenGupta, A. (ed.)Advances in Multivariate Statistical Methods, Statistical Science and Interdisciplinary Research  Vol. 4, pp. 145–171. World Scientific, Singapore (2008a).
Kozubowski, T. J., Panorska, A. K., Podgórski, K.: A bivariate Lévy process with negative binomial and gamma marginals. J. Multivar. Anal. 99, 1418–1437 (2008b).
Kozubowski, T. J., Panorska, A. K., Qeadan, F.: The distributions of the peak to average and peak to sum ratios under exponentiality. In: Wells, M. T., SenGupta, A. (eds.)Advances in Directional and Linear Statistics, Festschrift Volume for J.S. Rao, PhysicaVerlag, Heidelberg, pp. 131–141 (2010).
Kozubowski, T. J, Panorska, A. K, Qeadan, F: A new multivariate model involving geometric sums and maxima of exponentials. J. Statist. Plann. Infer. 141(7), 2353–2367 (2011).
Lambert, D.: Zeroinflated Poisson regression, with an application to defects in manufacturing. Technometrics. 34, 1–14 (1992).
Marshall, A. W., Olkin, I.: A new method for adding a parameter to a family of distributions with application to the exponential and Weibull families. Biometrika. 84(3), 641–652 (1997).
Mullahy, J.: Specification and testing of some modified count data models. J. Econ. 33, 341–365 (1986).
Mullahy, J.: Heterogeneity, excess zeros, and the structure of count data models. J. Appl. Econ. 12(3), 337–350 (1997).
Pandey, H., Tiwari, R.: An inflated probability model for the rural outmigration. Recent Res. Sci. Tech. 3(7), 100–103 (2011).
Panicha, K.: Capturerecapture estimation and modelling for oneinflated count data. Dissertation, University of Southampton (2018).
Qeadan, F., Kozubowski, T. J., Panorska, A. K.: The joint distribution of the sum and the maximum of n i.i.d. exponential random variables. Comm. Statist. Theory Methods. 41(3), 544–569 (2012).
Sharma, A. K., Landge, V. S.: Zero inflated negative binomial for modeling heavy vehicle crash rate on Indian rural highway. Internat. J. Adv. Eng. Tech. 5(2), 292–301 (2013).
Tüzen, M. F., Erbaş, S.: A comparison of count data models with an application to daily cigarette consumption of young persons. Comm. Statist. Theory Methods. 47(23), 5825–5844 (2018).
Zeileis, A., Kleiber, C., Jackman, S.: Regression models for count data in R. J. Statist. Softw. 27(8), 1–25 (2008).
Zelterman, D.: Discrete Distributions, Applications in the Health Sciences. Wiley, New Jersey (2004).
Zuur, A. F., Leno, E. N., Walker, N. J., Saveliev A.A., Smith G.M.: Mixed effects models and extensions in ecology with R. Statistics for Biology and Health. Springer Science and Business Media (2009). https://doi.org/10.1007/978038787458611.
Acknowledgements
The authors thank Ms. Ilaria Vinci for helpful discussions and help with checking of tedious calculations. We also thank the two reviewers for their thoughtful and constructive comments, that helped us improve this manuscript.
Funding
NA.
Author information
Affiliations
Contributions
FZ is a PHD student, who derived majority of the results, performed simulations and data analysis, and produced all figures. AKP is his adviser, who reviewed all this work from inception to the end of the manuscript writing, and contributed to some of the results. TJK is a research collaborator who contributed to some of the results and revision of the paper. The author(s) read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
Supplementary material for “A new trivariate model for stochastic episodes”.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zuniga, F., Kozubowski, T.J. & Panorska, A.K. A new trivariate model for stochastic episodes. J Stat Distrib App 8, 2 (2021). https://doi.org/10.1186/s40488021001143
Received:
Accepted:
Published:
Keywords
 BEG model
 BGGE distribution
 BTLG distribution
 Extremes
 Financial data
 Geometric distribution
 Maximum likelihood estimation
 Random sum
 Stochastic representation
 TETLEG model
 Zeroaltered distribution
 Zeroinflated distribution
Classification codes
 60E05
 60G50
 60G70
 62E15
 62H05
 62H12
 62H15
 62P05
 62P15