The Unifed Distribution

We introduce a new distribution with support on (0,1) called unifed. It can be used as the response distribution for a GLM and it is suitable for data aggregation. We make a comparison to the beta regression. A link to an R package for working with the unifed is provided.


Introduction
We introduce the unifed distribution. It is a continuous distribution with support on the interval (0,1). It can be characterized as the only exponential dispersion family containing the uniform distribution. This makes it suitable to be used as the response variable of a Generalized Linear Model (GLM).
An R (see [10] and [12]) package has been developed to work with this distribution. It is called unifed and contains functions for the density, distribution, quantiles and random generator of this distribution. A family that can be used within the glm function of R is also provided. Information about the package and how to install it can be found at https: //gitlab.com/oquijano/unifed. This is not the only model for performing regression on the unit interval. The beta regression (see [4]) has existed for a while and it provides more flexible shapes than the unifed GLM. One appealing property of the unifed GLM is that it is suitable for data reduction while the beta regression is not. This is discussed in section 4.2.
This paper is divided in 4 sections. In Section 1 we review the definition and properties of exponential dispersion families and GLMs. Section 2 defines the unifed distribution. In Section 3 we illustrate an application to an auto insurance claims example. Section 4 reviews the beta regression and underlines it's differences with the unifed GLM.

Exponential Dispersion Families and GLMs
A reproductive Exponential Dispersion Family (EDF) is a set of distributions whose densities are given by θ and Θ are called the canonical parameter and canonical space, respectively and φ is known as the dispersion parameter. For θ ∈ int (Θ) (here int stands for interior), whereκ = κ andκ =κ . This motivates the following definitions (see [8] or [7]).
Definition 1.1. Given an exponential dispersion family, the mean domain of the family is defined as Another important property is that the support of the distribution only depends on φ (and not on θ). For a given family, let C φ be the convex support of any member of the family with dispersion parameter φ. We define the convex support of the family as Definition 1.2. The unit deviance function of an exponential dispersion family is defined as d : The unit deviance function allows to re-parametrize (1) as This is known as the mean-value parametrization. When the canonical space Θ is open, the EDF is said to be regular. In this case C Φ = Ω and (3) is equivalent to

Weights and Data Aggregation
In many applications it is useful to include a known positive weight to each observation. When this is done, the dispersion parameter is divided by the weight w, and (1) and (4) become respectively There is a useful property of reproductive exponential dispersion families that allows for data aggregation. Jørgensen's notation (from [8]) is very convenient to express this property: given a fixed exponential family, if Y has mean µ and density given by (6), we say that it is ED(µ, φ/w) distributed. The property is then as follows: if Y 1 , Y 2 , · · · , Y n are independent, and Y i ∼ ED(µ, φ/w i ), then

GLMs
In a GLM the response variable is assumed to follow an EDF with density Note that φ in (1) corresponds to φ/w in (8) which implies that the mean and variance can be expressed as µ = κ (θ) and σ 2 = φκ (θ)/w, respectively.
Here w ≥ 0 is know as the weight. In applications w is known usually and φ needs to be estimated. It is further assumed that there is a vector of explanatory variables, also known as covariates, x = (x 1 · · · x p ) T , a vector of coefficients β = (β 0 β 1 · · · β p ) T and a function g known as the link function such that g(µ) = β 0 + x 1 β 1 + · · · + x p β p .
It is useful for further developments to express the canonical parameter θ in terms of the coefficients. Since µ = κ (θ) ≡κ(θ) then: The population can be divided into different classes according to the values of the explanatory variables. Thus, given a sample, we can group together all the observations that share the same values of the explanatory variables and aggregate them using (7). It is important to mention that with this grouping there is no loss of information for estimating the mean sinceȲ is a sufficient statistic for θ (but not for φ, thus some information is lost for the estimation of φ).
Possibly after aggregating, let m be the number of classes and θ ∈ Θ m , where Θ m = θ = (θ 1 · · · θ m ) T : θ 1 , . . . , θ m ∈ Θ is the set if all possible values of the vector θ. The density of the sample can be expressed as where κ(θ) = κ(θ 1 ) · · · κ(θ m ) T , W = diag(w 1 , · · · , w m ), with w i being the sum of all the weights in the i-th class, 1 = (1 · · · 1) T and A(y, φ) = m i=1 a(y i , w i φ ) . In order to express θ in terms of β, we define the following maps and the design matrix where x i is the vector of explanatory variables for the i-th class. Throughout this article we assume that X has full rank and that the model is not saturated (i.e. p + 1 < m). With all these definitions, we have that It is useful to reparameterize (11) in terms of the mean vector µ instead of θ. Using the mean value parameterization (this is (4) but substituting φ for φ/w), (11) can be reparameterized as where where Ω m = (µ 1 · · · µ m ) T : µ 1 , . . . , µ m ∈ Ω . D is called the deviance of the model. We give here some of its properties: • Given a sample, finding the mle of θ is equivalent to finding the value of β that minimizes the deviance.
• D can be used to estimate the dispersion parameter (although it is not the only method). The deviance estimator of φ is given bŷ • The asymptotic distribution of D plays an important role in model assessment and variable selection.
For further details about the use and properties of the deviance see [7].

The Unifed Distribution
The unifed is the Exponential Dispersion Family (EDF) generated by the uniform distribution (see Chapters 2 and 3 of [8] to see how an EDF can be generated from a moment generating function). We created the R package unifed (see [12]) that includes functions to work with this distribution. In this section we make references to some functions in this package and we use this font format for those references.
To express the density of the unifed we need the density of the distribution of the sum of n independent unif orm(0, 1) random variables. This corresponds to the Irwin-Hall distribution (see [6]) and its density function is where sgn is the sign function defined as The canonical and index spaces of the unifed family are Θ = R and Φ = 1, 1 2 , 1 3 , 1 4 . . . , and the cumulant generator is The density of a unifed with canonical parameter θ and dispersion parameter φ is where h and κ are as in (15) and (16), respectively and x ∈ [0, 1], θ ∈ R, φ ∈ 1, 1 2 , 1 3 , . . . . The unifed package does not contain an implementation of (17). This is because we did not find a numerically stable way to compute h. To show this, the package includes the function dirwin.hall that computes h. 1 shows the results we get by calling this function with n set to 50 and varying the values of y. The changes of sign indicate that a float overflow is happening.
Thus, the package calls unifed distribution the 1-parameter special case case of (17) where φ = 1. This simplifies the density to The functions dunifed, punifed, qunifed and runifed, give the density, distribution, quantile and simulation functions, respectively of this simplified version. The mean and variance of each element of the family are given by whereκ andκ are the first and second derivative of κ, respectively. We have not been able to find an analytical expression for the inverse functionκ −1 .
Thus, it has not been possible either to find analytical expressions for the variance function and unit deviance of the unifed. Nevertheless, the unifed package contains the function unifed.kappa.prime.inverse that uses the Newthon Raphson method to implement the inverse ofκ. This allows us to get a numerical solution for the variance function by using the relation V(µ) =κ(κ −1 (µ)). This is implemented in the function unifed.varf. Similarly, since the unifed is a regular EDF (see Chapter 2 of [8]), we can compute the unit deviance by using the relation The function unifed.unit.deviance computes the unit deviance using (21). Figure 1 shows plots of the unifed distribution for different values of its mean. We can see that except for µ = 0.5, it is always monotone. For µ < 0.5 it is strictly decreasing and the mode is at zero. For µ > 0.5 it is strictly increasing and the mode is at one. The R code used for producing this plot can be found at https://gitlab.com/oquijano/unifed/snippets/ 1786224.
It is possible to use the unifed as the response distribution for a GLM. In this case, the dispersion parameter φ must be fixed to 1 and the weight of each class is the number of observations in the class. The unifed R package ( [12]) provides the function unifed that returns a family object than can be used inside the glm function.

An Illustrative Example
In this section we apply a unifed GLM to a publicly available dataset. The data appears in [3].   Table 2 shows the description of the variables as provided at the website. We are interested in modeling the exposure; which is the proportion of time of the year in which the insurance policy is in-force for a given client. We use gender, agecat, area and veh_age as the explanatory variables.
The R code used to obtain the results that follow can be found at https: //gitlab.com/oquijano/unifed/snippets/1786226.
The data was aggregated using (7) and a unifed GLM was fit to it. Table 3 (exported from R using the package xtable [2]) shows the summary provided by the glm function of R. We see that all the variables included have at least one significant class.
A χ 2 test for goodness of fit is commonly used for GLMs. The null hypothesis is that the data is distributed according to the fitted GLM. Assuming the null hypothesis for this example implies that the residual deviance reported at the bottom of Table 3 follows a χ 2 distribution with 273 degrees of freedom. The p-value for this example is P(χ 2 273 ≥ 297.86) = 0.14. Now, the detail with this test is that the χ 2 distribution for the residual deviance is asymptotic on the smallest weight of all classes going to infinity (see [7,Section 3.6]). The smallest observed weight here is 4 and it corresponds to the class with gender=F, agecat=6, area=F and veh\_age=1. Therefore the χ 2 test for this example is not reliable. Figure 2 shows the deviance residuals of this model. It suggests a good fit since they do not show any apparent pattern.   The beta regression ( [4]) is a versatile model for applications with a response variable on the unit interval. Moreover, the well documented R package betareg ([1]) makes it a practical tool in many applications.

The Beta Regression
The density of the beta distribution contains a large variety of shapes. In [4] the beta density is reparameterized as with 0 < µ < 1 and φ > 0, and the distribution is denoted by B(µ, φ). Under this parametrization, if Y ∼ B(µ, φ), the mean and variance are Here φ is called the precision parameter of the distribution. In the beta regression model it is assumed that the response variable is a vector Y = (Y 1 , . . . , Y m ), in which Y i ∼ B(µ i , φ) for i = 1, . . . , m. The Y i s are assumed independent to each other. The explanatory variables are incorporated to the model through the relation where β is a vector of parameters and x i is a vector of regresors. g : (0, 1) → R is invertible and is called the link function.
Then [11] generalized this model to allow the precision parameter to vary among classes in a similar way to the double generalized linear models (see [9]). More specifically, in this case the response vector Y = (Y 1 , . . . , Y m ) is such that Y i ∼ B(µ i , φ i ), independently and where β and γ are regression coefficients. These regression models offer great flexibility when the response variable lies in the interval (0, 1), and both are implemented in the R package betareg ([10], [1]).
The beta distribution is not an exponential family and therefore the beta regression is not a GLM. Nevertheless this parametrization chosen by the authors of the model along with (23) give it a similar look and feel.

On the Difficulties of Data Aggregation for the Beta Regression
Data aggregation gives a practical advantage when working with large datasets. For GLMs this is straightforward due to two properties ofȲ in (7): •Ȳ is a sufficient statistic for µ • The distribution ofȲ belongs to the same family than the Y i 's in (7).
We do not know any statistic with these two properties for the beta distribution. For instance, let Y 1 , . . . , Y n be an i.i.d sample from a B(µ, φ) distribution. The joint likelihood function of this sample is then where y = (y 1 , . . . , y n ). This density can be rearranged as follows The factorization theorem (see [5,Chapter 7]), implies that T = n i=1 y i 1−y i is sufficient for µ. Now, the distribution of T , which is not beta, would be needed to use T for data aggregation. In other words, a regression model whose response distribution is a family that includes the distribution of T for every n would need to be developed.

Differences Between the Unifed GLM and the Beta Regression
The unifed density does not have the variety of shapes that the beta density has. To see this compare the shapes shown in Figure 1 with the shapes for the beta distribution shown in [4]. Thus, the beta regression adapts to more shapes than a unifed GLM and even more so if regresors are used for the dispersion parameter.
In those cases where a beta regression and a unifed GLM give similar good fit, the parsimony principle suggests to pick the unifed GLM, since it has one parameter less; the dispersion parameter is known for the unifed GLM.
From a numerical point of view, the unifed GLM has the advantage that it is possible to use (7) for data reduction. This is a practical advantage when dealing with large datasets specially if simulations of the response vector need to be performed.