The unifed distribution

Quijano Xacur, Oscar Alberto

doi:10.1186/s40488-019-0102-6

Research
Open access
Published: 05 November 2019

The unifed distribution

Oscar Alberto Quijano Xacur¹

Journal of Statistical Distributions and Applications volume 6, Article number: 13 (2019) Cite this article

3208 Accesses
2 Citations
Metrics details

Abstract

We introduce a new distribution with support on (0,1) called unifed. It can be used as the response distribution for a GLM and it is suitable for data aggregation. We make a comparison to the beta regression. A link to an R package for working with the unifed is provided.

Introduction

We introduce the unifed distribution. It is a continuous distribution with support on the interval (0,1). It can be characterized as the only exponential dispersion family containing the uniform distribution. This makes it suitable to be used as the response variable of a Generalized Linear Model (GLM).

An R (see (R Core Team 2017) and (Quijano Xacur 2019b)) package has been developed to work with this distribution. It is called unifed and contains functions for the density, distribution, quantiles and random generator. It also contains a family that can be used within the glm function of R. Additionally, the package provides Stan (Stan Development Team 2018) code for performing Bayesian analysis with the unifed including a function for fitting Bayesian unifed GLMs. Information about the package and how to install it can be found at https://gitlab.com/oquijano/unifed.

This is not the only model for performing regression on the unit interval. The beta regression (see (Ferrari and Cribari-Neto 2004)) has existed for a while and it provides more flexible shapes than the unifed GLM. One appealing property of the unifed GLM is that it is suitable for data reduction while the beta regression is not. This is discussed in “On the difficulties of data aggregation for the beta regression” section.

This paper is divided into 4 sections. In “Exponential dispersion families and GLMs” section we review the definition and properties of exponential dispersion families and GLMs. “The unifed distribution” section defines the unifed distribution. In “An illustrative example” section we illustrate an application to an auto insurance claims example. “Comparison between the unifed GLM and the beta regression” section reviews the beta regression and underlines it’s differences with the unifed GLM.

Exponential dispersion families and GLMs

A reproductive Exponential Dispersion Family (EDF) is a set of distributions whose densities are given by

$$ f(y|\theta,\phi) = a(y,\phi)\exp\left(\frac{1}{\phi} \left\{y\theta - \kappa(\theta) \right\}\right),\qquad \theta\in\Theta, \phi\in\Phi. $$

(1)

θ and Θ are called the canonical parameter and canonical space, respectively and ϕ is known as the dispersion parameter. For θ∈int(Θ) (here int stands for interior),

$$ \mathbb{E}[Y] = \dot{\kappa}(\theta)\qquad \text{and}\qquad \mathbb{V}[Y]=\phi\ddot{\kappa}(\theta), $$

(2)

where $\dot {\kappa }=\kappa ^{\prime }$ and $\ddot {\kappa }=\dot {\kappa }^{\prime }$. (Eq. 2) allows to relate the mean and the variance and the mean of any EDF. This motivates the following definitions (see (Jørgensen 1997) or (Jørgensen 1992)).

Definition 1.

Given an exponential dispersion family, the mean domain of the family is defined as

$$ \Omega = \left\{\mu = \dot{\kappa}(\theta) : \theta \in \text{int}\left(\Theta\right) \right\}. $$

Definition 2.

The variance function of an EDF is defined as V:Ω→[0,∞) with

$$ \mathbf{V}(\mu) = (\dot{\kappa} \circ \dot{\kappa}^{-1})(\mu). $$

Note that $\mathbb {V}[Y]=\phi \mathbf {V}(\mu)$. The support of the members of an EDF depend only on ϕ (and not on θ). For a given family, let C_ϕ be the convex support of any member of the family with dispersion parameter ϕ. We define the convex support of the family as

$$ C_{\Phi}=\bigcup_{\phi\in\Phi} C_{\phi} {.} $$

Definition 3.

The unit deviance function of an exponential dispersion family is defined as d:C_Φ×Ω→[0,∞) with

$$ d\left(y,\mu\right) = 2 \left[ \sup_{\theta\in\Theta}\{\theta y - \kappa (\theta)\} - y \dot{\kappa}^{-1}(\mu) + \kappa\big(\dot{\kappa}^{-1}(\mu)\big) \right]. $$

(3)

The unit deviance function allows to re-parametrize (Eq. 1) as

$$ f(y|\mu,\phi) = c(y,\phi)\exp\left(-\frac{1}{2\phi}d(y,\mu) \right). $$

(4)

This is known as the mean–value parametrization. When the canonical space Θ is open, the EDF is said to be regular. In this case C_Φ=Ω and (Eq. 3) is equivalent to

$$ d\left(y,\mu\right) = 2 \left[ y\{ \dot{\kappa}^{-1}(y) -\dot{\kappa}^{-1}(\mu) \} - \kappa\big(\dot{\kappa}^{-1}(y)\big) + \kappa\big(\dot{\kappa}^{-1}(\mu)\big) \right]. $$

(5)

Weights and data aggregation

In many applications it is useful to include a known positive weight to each observation. When this is done, the dispersion parameter is divided by the weight w, and (Eq. 1) and (Eq. 4) become respectively

$$\begin{array}{*{20}l} f(y|\theta,\phi) &= a(y,\phi/w)\exp\left(\frac{w}{\phi} \left\{y\theta - \kappa(\theta) \right\}\right),\quad \text{and}\\ f(y|\mu,\phi) &= c(y,\phi/w)\exp\left(-\frac{w}{2\phi}d(y,\mu) \right). \end{array} $$

(6)

There is a useful property of reproductive exponential dispersion families that allows for data aggregation. Jørgensen’s notation (from (Jørgensen 1997)) is very convenient to express this property: given a fixed exponential family, if Y has mean μ and density given by (Eq. 6), we say that it is ED(μ,ϕ/w) distributed. The property is then as follows: if Y₁,Y₂,⋯,Y_n are independent, and Y_i∼ED(μ,ϕ/w_i), then

$$ \bar{Y}=\frac{w_{1}Y_{1}+\cdots+w_{n}Y_{n}}{w_{+}}\sim ED(\mu,\phi/w_{+}),\qquad w_{+}=\sum_{i=1}^{n} w_{i}. $$

(7)

GLMs

In a GLM the response variable is assumed to follow an EDF with density

$$ f(y|\theta,\phi)=a(y,\phi)\exp\left(\frac{w}{\phi} \{y\theta-\kappa(\theta)\} \right). $$

(8)

Note that ϕ in (Eq. 1) corresponds to ϕ/w in (Eq. 8) which implies that the mean and variance can be expressed as μ=κ^′(θ) and σ²=ϕκ^′′(θ)/w, respectively. Here w≥0 is known as the weight. In applications w is usually known and ϕ needs to be estimated. It is further assumed that there is a vector of explanatory variables, also known as covariates, x=(x₁⋯x_p)^T, a vector of coefficients $\boldsymbol {\mathcal {B}}=(\mathcal {B}_{0}~\mathcal {B}_{1} \cdots \mathcal {B}_{p})^{T}$ and a function g known as the link function such that

$$ g(\mu)=\mathcal{B}_{0}+x_{1}\mathcal{B}_{1}+\cdots+x_{p}\mathcal{B}_{p}. $$

(9)

It is useful for further developments to express the canonical parameter θ in terms of the coefficients. Since $\mu = \kappa ^{\prime }(\theta) \equiv \dot {\kappa }(\theta)$ then:

$$\begin{array}{*{20}l} (g \circ \dot{\kappa}) (\theta) &= \mathcal{B}_{0}+x_{1}\mathcal{B}_{1}+\cdots+x_{p}\mathcal{B}_{p} \\ \theta&=(g \circ \dot{\kappa})^{-1}(\mathcal{B}_{0}+x_{1}\mathcal{B}_{1}+\cdots+x_{p}\mathcal{B}_{p}). \end{array} $$

(10)

The population can be divided into different classes according to the values of the explanatory variables. Thus, given a sample, we can group together all the observations that share the same values of the explanatory variables and aggregate them using (Eq. 7). It is important to mention that with this grouping there is no loss of information for estimating the mean since $\bar {Y}$ is a sufficient statistic for θ (but not for ϕ, thus some information is lost for the estimation of ϕ). In this sense we say that GLMs are suitable for data aggregation. At the end of “An illustrative example” section we illustrate this property with real data for a unifed GLM.

Possibly after aggregating, let m be the number of classes and θ∈Θ^m, where Θ^m={θ=(θ₁⋯θ_m)^T:θ₁,…,θ_m∈Θ} is the set of all possible values of the vector θ. The density of the sample can be expressed as

$$ f(\boldsymbol{y} |\boldsymbol{\theta},\phi) = A(\boldsymbol{y},\phi) \exp \left(\frac {\boldsymbol{y}^{T}W\boldsymbol{\theta} - \boldsymbol{1}^{T} W \boldsymbol{\kappa} (\boldsymbol{\theta})} {\phi} \right),\qquad \boldsymbol{y}\in\mathbb{R}^{m}, $$

(11)

where κ(θ)=(κ(θ₁)⋯κ(θ_m))^T, W=diag(w₁,⋯,w_m), with w_i being the sum of all the weights in the i-th class, 1=(1⋯1)^T and $A(\boldsymbol {y},\phi) = \prod _{i=1}^{m} \big (a(y_{i}, \frac {w_{i}}{\phi })\big)$.

It is useful to reparameterize (Eq. 11) in terms of the mean vector μ instead of θ. Using the mean value parametrization (this is (Eq. 4) but substituting ϕ for ϕ/w), (Eq. 11) can be reparameterized as

$$ f(\boldsymbol{y}|\boldsymbol{\mu},\phi) = C(\boldsymbol{y},\phi)\exp \left(-\frac{1}{2\phi}D(\boldsymbol{y},\boldsymbol{\mu}) \right), $$

(12)

where $C(\boldsymbol {y},\phi)=\prod _{i=1}^{m}c(y_{i},\frac {\phi }{w_{i}})$, and D:Ω^m×Ω^m→[0,∞) with

$$ D(\boldsymbol{y},\boldsymbol{\mu})=\sum_{i=1}^{m}w_{i}d(y_{i},\mu_{i}), $$

(13)

Ω^m={(μ₁⋯μ_m)^T:μ₁,…,μ_m∈Ω}. D is called the deviance of the model. Note that finding the maximum likelihood estimator of $\boldsymbol {\mathcal {B}}$ is equivalent to finding what value of $\boldsymbol {\mathcal {B}}$ minimizes the deviance. For further details about the use and properties of the deviance see (Jørgensen 1992).

The unifed distribution

The unifed family is the Exponential Dispersion Family (EDF) generated by the uniform distribution (see Chapters 2 and 3 of (Jørgensen 1997) to see how an EDF can be generated from a moment generating function). We created the R package unifed (see (Quijano Xacur 2019b)) that includes functions to work with the unifed. In this section we make references to some functions in the package and we use this font format for those references.

To express the density of the unifed distribution we need the density of the sum of n independent uniform(0,1) random variables. This corresponds to the Irwin-Hall distribution (see (Johnson et al. 1995)) and its density function is

$$ h(y;n) = \frac{1}{(n-1)!}\sum_{k=0}^{\lfloor{y}\rfloor} (-1)^{k} \binom{n}{k} (y-k)^{n-1},\qquad y\in[0,n],n\in\mathbb{N}{.} $$

(14)

The canonical and index spaces of the unifed family are $\Theta = \mathbb {R}$ and $\Phi = \left \{1,\frac {1}{2},\frac {1}{3},\frac {1}{4}\ldots \right \} $, and the cumulant generator is

$$ \kappa(\theta)=\left\{ \begin{array}{ll} \log\left(\frac{e^{\theta}-1}{\theta}\right)& \text{if }\theta\neq 0\\ 0 & \text{if }\theta=0 \end{array} \right.. $$

(15)

The density of a unifed distribution with canonical parameter θ and dispersion parameter ϕ is

$$ f(x;\theta,\phi) = \frac{h(x/\phi,1/\phi)}{\phi}\exp\left(\frac{x\theta - \kappa(\theta)}{\phi}\right), $$

(16)

where h and κ are as in (Eq. 14) and (Eq. 15), respectively and $x\in [0,1],\theta \in \mathbb {R}, \phi \in \left \{1,\frac {1}{2},\frac {1}{3},\ldots \right \}$. We denote the unifed distribution with canonical parameter θ and dispersion parameter ϕ with unifed(θ,ϕ).

The unifed package does not contain an implementation of (Eq. 16). This is because we did not find a numerically stable way to compute h. To show this, the package includes the function dirwin.hall that computes h. Table 1 shows the results we get by calling this function with n set to 50 and varying the values of y. The changes of sign indicate that a float overflow is happening.

Table 1 Float overflow of the Irwin-Hall implementation

Full size table

The package calls unifed distribution the one-parameter special case of (Eq. 16) where ϕ=1, which we denote with unifed(θ). This simplifies the density to

$$ f(x;\theta) = \left\{ \begin{array}{ll} \frac{\theta}{e^{\theta} - 1} e^{x \theta} & \text{if}\ \theta \neq 0\\ 1 & \text{if}\ \theta = 0 \end{array} \right. \quad \text{for}\ x \in (0,1). $$

(17)

The functions dunifed, punifed, qunifed and runifed, give the density, distribution, quantile and simulation functions, respectively of this simplified version. The mean and variance of each element of the family are given by

$$\begin{array}{*{20}l} \mathbb{E}[X]&= \dot{\kappa}(\theta) = \left\{ \begin{array}{ll} \frac{(\theta-1)e^{\theta} + 1}{\theta(e^{\theta} -1)} & \text{if }\theta \neq 0\\ \frac{1}{2} & \text{if }\theta=0 \end{array}\right. {,} \qquad \end{array} $$

(18)

$$\begin{array}{*{20}l} \mathbb{V}[X]&= \dot{\kappa}(\theta)=\left\{ \begin{array}{ll} \left(\frac{ e^{2\theta} - (\theta+2)e^{\theta} + 1} {\theta^{2} (e^{\theta}-1)^{2}}\right)& \text{if }\theta \neq 0 \\ \frac{1}{12} & \text{if }\theta=0 \end{array} \right., \end{array} $$

(19)

where $\dot {\kappa }$ and $\dot {\kappa }$ are the first and second derivative of κ, respectively. We have not been able to find an analytical expression for the inverse function $\dot {\kappa }^{-1}$. Thus, it has not been possible either to find analytical expressions for the variance function and unit deviance of the unifed. Nevertheless, the unifed package contains the function unifed.kappa.prime.inverse that uses the Newthon Raphson method to implement the inverse of $\dot {\kappa }$. This allows us to get a numerical solution for the variance function by using the relation $\mathbf {V}(\mu) = \ddot {\kappa } (\dot {\kappa }^{-1}(\mu))$. This is implemented in the function unifed.varf.

Similarly, since the unifed is a regular EDF (see Chapter 2 of (Jørgensen 1997)), we can compute the unit deviance by using the relation

$$ d(y,\mu)=2\left[y\{\dot{\kappa}^{-1}(y)-\dot{\kappa}^{-1}(\mu)\}-\kappa(\dot{\kappa}^{-1}(y))+\kappa(\dot{\kappa}^{-1}(\mu))\right]. $$

(20)

The function unifed.unit.deviance computes the unit deviance using (Eq. 20). As mentioned in “Exponential dispersion families and GLMs” section, the unit deviance can be used to reparametrize the distribution in terms of it’s mean and dispersion parameter. We denote with unifed^∗(μ,ϕ) the unifed distribution with mean μ and dispersion parameter ϕ and when ϕ=1, we write simply unifed^∗(μ).

Figure 1 shows plots of the unifed distribution for different values of its mean. We can see that except for μ=0.5, it is always monotone. For μ<0.5 it is strictly decreasing and the mode is at zero. For μ>0.5 it is strictly increasing and the mode is at one. The R code used for producing this plot can be found in (Quijano Xacur 2018).

Maximum likelihood estimation

Suppose you have an independent and identically distributed sample X₁,…,X_n coming from a unifed(θ) distribution and you want to compute the maximum likelihood estimator (mle) $\hat {\theta }$ of θ. The derivative of the log-likelihood function is given by

$$\begin{array}{*{20}l} \ell^{\prime}(\theta|X_{1},\ldots,X_{n})&= n \frac{(1-\theta)e^{\theta} - 1}{\theta(e^{\theta} - 1)} + \sum_{i=1}^{n} X_{i} \\ &= - n \dot{\kappa}(\theta) + \sum_{i=1}^{n} X_{i}. \end{array} $$

Making the expression above equal to zero and solving for θ, the mle for θ is given by

$$ \hat{\theta} = \dot{\kappa}^{-1}\left(\bar{X}\right), $$

(21)

where $\bar {X}=\sum _{i=1}^{n}X_{i} / n$. The function unifed.mle in the unifed R package computes the mle using (Eq. 21). It is possible to use the unifed distribution as the response distribution of a GLM. In this case, ϕ must be fixed to one and the weight of each class is the number of observations in the class. The mle $\hat {\mathcal {B}}$ of the regression coefficients can be found using iterative weighted least squares. In Section 2.5 of (McCullagh and J.A. 1989), they show that this method works for any response distribution whose density can be expressed as (Eq. 8). Thus, the method also works for the unifed. The unifed R package (Quijano Xacur 2019b) provides the function unifed that returns a family object than can be used inside the glm function.

An illustrative example

In this section we apply a unifed GLM to a publicly available dataset. The data appears in (de Jong and Heller 2008). It is based on 67,856 one–year auto insurance policies from 2004 or 2005. The dataset can be downloaded from the companion site of the book (see (de Jong and Heller 2008)). Table 2 shows the description of the variables as provided at the website.

Table 2 Vehicle insurance variables

Full size table

We are interested in modeling the exposure; which is the proportion of time of the year in which the insurance policy is in-force for a given client. We use gender, agecat, area and veh_age as the explanatory variables.

The R code used to obtain the results that follow can be found in (Quijano Xacur 2019a).

The data was aggregated using (Eq. 7) and a unifed GLM was fit to it. Table 3 (exported from R using the package xtable (Dahl et al. 2018)) shows the summary provided by the glm function of R. We see that all the variables included have at least one significant class.

Table 3 Summary of Unifed GLM

Full size table

A χ² test for goodness of fit is commonly used for GLMs. The null hypothesis is that the data is distributed according to the fitted GLM. Assuming the null hypothesis for this example implies that the residual deviance reported at the bottom of Table 3 follows a χ² distribution with 273 degrees of freedom. The p-value for this example is $\mathbb {P}(\chi _{273}^{2}\ge 297.86)=0.14$. Now, the detail with this test is that the χ² distribution for the residual deviance is asymptotic on the smallest weight of all classes going to infinity (see (Jørgensen 1992, Section 3.6)). The smallest observed weight here is 4 and it corresponds to the class with gender=F, agecat=6, area=F and veh_age=1. Therefore the χ² test for this example is not reliable.

Figure 2 shows the deviance residuals of this model. It suggests a good fit since they do not show any apparent pattern.

Verifying data aggregation:

We now fit the same model as in the previous section but without aggregating the data. Table 4 shows the summary of the model from R. The code used to generate this table can be found in (Quijano Xacur 2019a).

Table 4 Summary of Unifed GLM without Data Aggregation

Full size table

By comparing Tables 3 and 4 one can see that the estimated coefficients are the same in both cases. Thus, even though the deviance of both models differ, they give the same mle for the coefficients. This shows what we mean with data aggregation.

Comparison between the unifed GLM and the beta regression

The beta regression (Ferrari and Cribari-Neto 2004) is a versatile model for applications with a response variable on the unit interval. Moreover, the well documented R package betareg (Cribari-Neto and Zeileis 2010) makes it a practical tool in many applications.

The beta regression

The density of the beta distribution contains a large variety of shapes. In (Ferrari and Cribari-Neto 2004) the beta density is reparameterized as

$$ f(y) = \frac{\Gamma(\phi)}{\Gamma(\mu \phi) \Gamma((1 - \mu) \phi) } y^{\mu\phi -1} (1 - y)^{(1-\mu)\phi-1 },\qquad 0 < y < 1, $$

(22)

with 0<μ<1 and ϕ>0, and the distribution is denoted by $\mathcal {B}(\mu,\phi)$. Under this parametrization, if $Y \sim \mathcal {B}(\mu,\phi)$, the mean and variance are

$$ \mathbb{E}[Y] = \mu \quad \text{and} \quad \mathbb{V}[Y]=\frac{\mu (1-\mu)}{1+\phi}. $$

(23)

Here ϕ is called the precision parameter of the distribution. In the beta regression model it is assumed that the response variable is a vector Y=(Y₁,…,Y_m), in which $Y_{i}\sim \mathcal {B}(\mu _{i},\phi)$ for i=1,…,m. The $Y_{i}^{\prime }s$ are assumed independent to each other. The explanatory variables are incorporated to the model through the relation

$$ g(\mu_{i}) = \boldsymbol{x_{i}}^{T}\boldsymbol{\mathcal{B}}, $$

where $\boldsymbol {\mathcal {B}}$ is a vector of parameters and x_i is a vector of regressors. $g:(0,1)\rightarrow \mathbb {R}$ is invertible and is called the link function.

Then (Simas et al. 2010) generalized this model to allow the precision parameter to vary among classes in a similar way to the double generalized linear models (see (Smyth and Verbyla 1999)). More specifically, in this case the response vector Y=(Y₁,…,Y_m) is such that $Y_{i}\sim \mathcal {B}(\mu _{i},\phi _{i})$, independently and

$$\begin{array}{*{20}l} g_{1}(\mu_{i}) &= \boldsymbol{x_{i}}^{T}\boldsymbol{\mathcal{B}}, \\ g_{2}(\phi_{i}) &= \boldsymbol{z_{i}}^{T}\boldsymbol{\gamma}, \end{array} $$

where $\boldsymbol {\mathcal {B}}$ and γ are regression coefficients.

These regression models offer great flexibility when the response variable lies in the interval (0,1), and both are implemented in the R package betareg ((R Core Team 2017), (Cribari-Neto and Zeileis 2010)).

The beta distribution is not an EDF and therefore the beta regression is not a GLM. Nevertheless the parametrization chosen by the authors of the model along with (Eq. 23) give it a similar look and feel.

On the difficulties of data aggregation for the beta regression

Data aggregation gives a practical advantage when working with large datasets. For GLMs this is straightforward due to two properties of $\bar {Y}$ in (Eq. 7):

$\bar {Y}$ is a sufficient statistic for μ
The distribution of $\bar {Y}$ belongs to the same family as the Y_i’s in (Eq. 7).

We do not know any statistic with these two properties for the beta distribution. For instance, let Y₁,…,Y_n be an i.i.d sample from a $\mathcal {B}(\mu,\phi)$ distribution. The joint likelihood function of this sample is then

$$ f(\boldsymbol{y}) = \left(\frac{\Gamma(\phi)}{\Gamma(\mu \phi) \Gamma((1 - \mu) \phi) }\right)^{n} \left(\prod_{i=1}^{n}y_{i}\right)^{\mu \phi -1} \left(\prod_{i=1}^{n}(1 - y_{i})\right)^{(1-\mu)\phi-1 }, $$

where y=(y₁,…,y_n). This density can be rearranged as follows

$$ f(\boldsymbol{y}) = \left(\frac{\Gamma(\phi)}{\Gamma(\mu \phi) \Gamma((1 - \mu) \phi) }\right)^{n} \left[\prod_{i=1}^{n}\frac{(1-y_{i})^{\phi-1}}{ y_{i}}\right] \left(\prod_{i=1}^{n} \frac{y_{i}}{1-y_{i}} \right)^{\mu\phi} $$

The factorization theorem (see (Hogg et al. 2019, Chapter 7)), implies that $T=\prod _{i=1}^{n} \frac {y_{i}}{1-y_{i}}$ is sufficient for μ. Now, the distribution of T, which is not beta, would be needed to use T for data aggregation. In other words, a regression model whose response distribution is a family that includes the distribution of T for every n would need to be developed.

Differences between the unifed GLM and the beta regression

The unifed density does not have the variety of shapes that the beta density has. To see this, compare the shapes shown in Figure 1 with the shapes for the beta distribution shown in Figure 3 (Quijano Xacur 2019a). Thus, the beta regression is able to adapt to more shapes than a unifed GLM and even more so if regressors are used for the dispersion parameter.

In those cases where a beta regression and a unifed GLM give similar good fit, the parsimony principle suggests to pick the unifed GLM, since it has one parameter less; the dispersion parameter is known for the unifed GLM.

From a numerical point of view, the unifed GLM has the advantage that it is possible to use (Eq. 7) for data reduction. This is a practical advantage when dealing with large datasets specially if simulations of the response vector need to be performed.

Conclusion

This paper introduced a new distribution called unifed. It is the Exponential Dispersion Family generated by the uniform distribution. It allows to fit a GLM for responses on the unit interval (0,1). An R package for working with this distribution is provided.

We made a comparison to the beta regression, which is another regression model for responses on the unit interval. It provides more flexible shapes and therefore it can give better fit than a unifed GLM in many situations. In contrast, the unifed GLM is suitable for data aggregation which is a practical advantage when working with large datasets.

An application using publicly available data was presented.

Availability of data and material

The data used for the example in this article is publicly available and it can be downloaded from www.businessandeconomics.mq.edu.au/our_departments/Applied_Finance_and_Actuarial_Studies/acst_docs/glms_for_insurance_data/data/car.csv.

Abbreviations

EDF:: Exponential dispersion family
GLM:: Generalized linear model
mle:: Maximum likelihood estimator

References

Cribari-Neto, F., Zeileis, A.: Beta regression in R. J. Stat. Softw. 34(2), 1–24 (2010).
Article Google Scholar
Dahl, D. B., Scott, D., Roosen, C., Magnusson, A., Swinton, J.: Xtable: Export Tables to LaTeX or HTML. R package version 1.8-3. (2018). https://CRAN.R-project.org/package=xtable. Accessed Mar 2019.
de Jong, P., Heller, G. Z.: Generalized Linear Models for Insurance Data. Cambridge University Press (2008). Companion website: http://www.acst.mq.edu.au/GLMsforInsuranceData. https://doi.org/10.1017/CBO9780511755408.
Ferrari, S., Cribari-Neto, F.: Beta regression for modelling rates and proportions. J. Appl. Stat. 31(7), 799–815 (2004). https://doi.org/10.1080/0266476042000214501.
Article MathSciNet Google Scholar
Hogg, R. V., McKean, J. W., Craig, A. T.: Introduction to Mathematical Statistics. 8th. Pearson, Boston (2019).
Google Scholar
Johnson, N. L., Kotz, S., Balakrishnan, N.: Continuous Univariate Distributions, Vol. 2. Wiley & Sons, New York (1995).
MATH Google Scholar
Jørgensen, B.: The Theory of Exponential Dispersion Models and Analysis of Deviance. Instituto de Matemática Pura e Aplicada, (IMPA), Brazil (1992).
MATH Google Scholar
Jørgensen, B.: The Theory of Dispersion Models. Chapman & Hall, London (1997).
MATH Google Scholar
Smyth, GK, Verbyla, AP: Double generalized linear models: approximate reml and diagnostics. Proceedings of the 14th International Workshop on Statistical Modelling, 66–80 (1999). https://pdfs.semanticscholar.org/3fd5/fb7ee7e6991d0e6e2f50dacc80283a4701b1.pdf.
McCullagh, P., J.A., N.: Generalized Linear Models. 2nd. Chapman and Hall, London New York (1989).
Book Google Scholar
Quijano Xacur, O. A.: Beta Density Plot. Code Snippet (2019). https://gitlab.com/oquijano/unifed/snippets/1880287. Accessed Jul 2019.
Quijano Xacur, O.A.: Unifed Density Plot. Code Snippet (2018). https://gitlab.com/oquijano/unifed/snippets/1786224. Accessed Jul 2019.
Quijano Xacur, O. A.: Vehicle Insurance Example. Code Snippet (2019a). https://gitlab.com/oquijano/unifed/snippets/1786226. Accessed Jul 2019.
Quijano Xacur, O.A.: unifed. R package version 1.1.0 (2019b). https://CRAN.R-project.org/package=unifed. Accessed Jul 2019.
R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2017). https://www.R-project.org/.
Google Scholar
Simas, A. B., Barreto-Souza, W., Rocha, A. V.: Improved estimators for a general class of beta regression models. Comput. Stat. Data Anal. 54(2), 348–366 (2010). https://doi.org/10.1016/j.csda.2009.08.017.
Article MathSciNet Google Scholar
Stan Development Team: RStan: the R interface to Stan. R package version 2.18.2 (2018). http://mc-stan.org/.

Download references

Acknowledgements

Not applicable.

Funding

Not applicable.

Author information

Authors and Affiliations

Concordia University, Montreal, Canada
Oscar Alberto Quijano Xacur

Authors

Oscar Alberto Quijano Xacur
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All contributions were made by the author of the article, Oscar Alberto Quijano Xacur.

Corresponding author

Correspondence to Oscar Alberto Quijano Xacur.

Ethics declarations

Competing interests

The author declares that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Quijano Xacur, O. The unifed distribution. J Stat Distrib App 6, 13 (2019). https://doi.org/10.1186/s40488-019-0102-6

Download citation

Received: 23 May 2019
Accepted: 24 September 2019
Published: 05 November 2019
DOI: https://doi.org/10.1186/s40488-019-0102-6

The unifed distribution

Abstract

Introduction

Exponential dispersion families and GLMs

Definition 1.

Definition 2.

Definition 3.

Weights and data aggregation

GLMs

The unifed distribution

Maximum likelihood estimation

An illustrative example

Verifying data aggregation:

Comparison between the unifed GLM and the beta regression

The beta regression

On the difficulties of data aggregation for the beta regression

Differences between the unifed GLM and the beta regression

Conclusion

Availability of data and material

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords