The spherical-Dirichlet distribution

Today, data mining and gene expressions are at the forefront of modern data analysis. Here we introduce a novel probability distribution that is applicable in these fields. This paper develops the proposed spherical-Dirichlet distribution designed to fit vectors located at the positive orthant of the hypersphere, as it is often the case for data in these fields, avoiding unnecessary probability mass. Basic properties of the proposed distribution, including normalizing constants and moments are developed. Relationships with other distributions are also explored. Estimators based on classical inferential statistics, such as method of moments and maximum likelihood estimators are obtained. Two applications are developed: the first one uses simulated data, and the second uses a real text mining example. Both examples are fitted using the proposed spherical-Dirichlet distribution and their results are discussed.


Introduction
In text mining and gene expressions analysis, the collections of texts are represented in a vector-space model, which implies that texts once standardized, are coded as vectors in a sphere of higher dimensions, also called a hypersphere (Suvrit 2016). Many researchers currently model these distributions by means of existing probability density mixtures, however, these approximations waste probability mass in the whole hypersphere, when it is actually only needed at the positive orthant of the hypersphere. This is mainly because of the non-existence of suitable distributions for that subspace. The new proposed distribution fills that void, allowing for an efficient modeling of these vectors.

Basic properties
In this section we introduce the proposed spherical-Dirichlet distribution, its moments and basic properties.

Guardiola Journal of Statistical Distributions and Applications
(2020) 7:6 Page 2 of 14 Transforming the Dirichlet distribution from the simplex to the positive orthant of the hypersphere ( Fig. 1) taking the square root transformation where We refer to (3) as the spherical-Dirichlet distribution (SDD) and write x ∼ SDD(α i ). We introduce the parameters α i as the concentration parameters in a similar manner to the corresponding parameters of the Dirichlet distribution.

Moments
In this section we compute the first and second order moments, mode, standard deviation, variances and covariances and its corresponding covariance matrix. First, we compute the expected value for one of the variables, for example let us consider the expected value of x 1 where we recognize the expression inside the integral as the kernel of the proposed SDD with a new first parameter α 1 + 1 2 , then we can rewrite immediately this expression as we define μ i as, the expected value from (6) can be rewritten as, The general solution for the first moment of a vector x = {x 1 , ....x m } T with a vector of parameters α = {α 1 , .....α m } T can be written as then, the expected value for a vector x can also be written as Similarly, we compute the expected value for x 2 1 as again, we can recognize the expression inside the integral as the kernel of the proposed SDD with a new first parameter α 1 + 1, that yields this result can be generalized to Moreover, the variance for any variable x i is and the covariance for x 1 , x 2 can be written as after some arrangements, we can identify the kernel of the proposed SDD with the first two parameters as α 1 + 1 2 , and α 2 + 1 2 , where we can solve the corresponding integral, and our result takes the form In general, for any pair of variables (x i , x j ) we can write where δ ij is the delta Kronecker, and we can also write the covariance for any pair of variables (x i , x j ) as We can also write the covariance for any pair of variables ( that in matrix notation can also be written as where that summarizes our results in a succinct form.

Mode and relationship with the mean
The mode for the SDD can be determined by finding the values of α i that maximize this function, alternatively, we can also maximize the log of this function as it is customary and usually an easier procedure. First, taking the natural log of the SDD and adding the constraint m i=1 x 2 i = 1 for the purpose of using Lagrange multipliers we get taking derivatives with respect to x i and setting them to zero we have solving for x 2 i , it yields similarly, taking derivatives with respect to x m and solving for x m , we have substituting these results at the constraint, we can solve for λ as where we can obtain the mode for x i as and for x m Considering the special case of a symmetric SDD, we set up α i = α for i < m, and α m = α + 1 2 , both (30) and (31) yield the mean for a symmetric SDD for α i = α for i < m, and α m = α + 1 2 , yields where we can see that the mode does not match the expected value for a symmetric SDD, however, we can still find an asymptotic relationship using the expression developed by Frame (Frame 1949), using this approximation it yields where that in the limit matches the mode at (32).

Relationships of the SDD with other distributions
In this section we explore the relationships, or lack thereof, between the SDD and other popular distributions such as the uniform, von Mises and its particular case of the Fisher Bingham distribution. We consider limiting cases for different values of the concentration parameters α i .

Limiting case symmetric distribution for large α
Assuming a symmetric SDD with α i = α, for ∀α i we can write subject to the restrictions in this case the covariance matrix can be reduced to where in an attempt to write the SDD as a rotational distribution of the type shown by Mardia (Mardia and Jupp 2000), the latter expression can be rewritten as or equivalently we can't determine an equivalence to the von Mises or similar rotationally symmetric distributions, however, using the expression developed by Frame (Frame 1949), we can see that in the limiting case for α → ∞ and consequently α 0 → ∞ we have we conclude that for large values of α the covariance matrix tends to zero, consequently, the SDD tends to be concentrated as a vector with no variation.

Limiting case uniform distribution
We now consider the case where α i = 1 2 , for i < m and α m = 1, the SDD takes the form which is a constant thickness independent of the values of x i , then the SDD becomes the uniform distribution over the positive orthant of the hypersphere.

Similarities and differences of the SDD with the von Mises and Fisher-Bingham distributions
The von Mises distribution is usually considered the analogue of the normal distribution in the circle as described by Mardia in (Mardia 1975). The von Mises distribution and its particular case for the three dimensional sphere, the Fisher-Bingham distribution, both tend to converge to a multivariate and bivariate normal distribution respectively for large values of κ as shown by Kent (Kent 1982). The proposed SDD doesn't seem to converge to the von Mises distribution or to a multivariate normal distribution for large values of α i , but rather it tends to be concentrated as a vector as it was established at the end of the corresponding section for the limiting cases for the SDD.
Moreover, both the von Mises and the Fisher-Bingham distribution converge to the uniform distribution for very small values of κ, in a similar way as the SDD becomes the uniform distribution for the values of the parameters described at the end of the previous subsection.

Inference for the spherical-Dirichlet distribution
We now consider estimation of the parameters of the SDD. Our main interest is to develop suitable procedures to estimate the set of parameters α i , given a sample of random vectors located at the positive orthant of the hypersphere. We first derive estimators for α i using the method of moments (MOM), and next we develop estimators for the same set of parameters using the method of maximum likelihood estimation (MLE).

Method of moments (MOM)
Using a similar procedure as the one developed by Narayanan (Narayanan 1992) to estimate the parameters of the Dirichlet distribution, suppose we have a random sample with n random vectors X 1 , X 2 , ....X n such that X i ∈ m = X j |j = 1..., m; X j > 0, m j=1 x 2 j = 1 that are i.i.d., then and We define the sample moments as and We have m-1 first order moment equations and m-1 second order moment equations to solve for m unknowns α i . To avoid linear dependency and for the sake of simplicity we choose one of the first order moments, and m-1 of the second order moment equations then, the remaining m-1 second order moment equations are There is no closed form solution for α i in solving simultaneously (45) and (46), so we must solve numerically to obtain the corresponding method of moments estimators for α i . Results from MOM can be used as initial values for the MLE that usually exhibit better statistical properties.

Maximum likelihood estimation (MLE)
Suppose that we have a random sample of vectors on the positive orthant of the hypersphere, X 1 , X 2 , ....X n , where X i ∈ m from an SDD with pdf defined at (3). Then, the log-likelihood is The parameters for an SDD can be estimated by maximizing the log-likelihood function of the data, in a similar procedure as the one used by Minka for the Dirichlet distribution described at (Minka 2000). We can group all the constant terms as K, and we can rewrite all the products and sums as where the function that needs to be optimized after removing unnecessary constants is The gradient of the objective function can be obtained by differentiating the loglikelihood ln F(α) with respect to α k as is the digamma function. The optimization is subject to the constraints α i 0. The SDD is a member of the exponential family and therefore it is a convex function, and the observed sufficient statistic is equal to the expected sufficient statistic, where the latter is and the observed sufficient statistic is that leads to the following iterative procedure Although the proposed procedure does not guarantee in general reaching a global maximum, updating successively (51) provides reasonable results, and convergence is typically fast.

Applications to data
Let's now consider estimation of the parameters of the SDD. We first developed an example using simulated data generated from the proposed SDD with parameters we assumed to be unknown for the purpose of this estimation. Next, a second example was developed using a text mining example, with data obtained from a publicly available data set. Both examples were solved using MOM and MLE, applying the proposed techniques described at the corresponding sections for the method of moments and maximum likelihood estimation, and results obtained from both methods were compared.

Simulation example
Four different simulations were performed each with 1,000 randomly generated values from an SDD in a three-dimensional hypersphere, with known values of the parameters α 1 , α 2 and α 3 . Plots for the corresponding values of these parameters are shown at Figs. 2 and 3. Inferences to estimate the values of these parameters, assumed to be unknown, were performed using the MOM and MLE procedures developed in the corresponding sections. Graphs for the SDD corresponding to the proposed four different sets of parameters are shown at the following figures: First, an estimation was performed using MOM and iterating between (45) and (46). These values were updated in each cycle until convergence was achieved within a preset tolerance limit. The estimated values of the parameters using MOM were used as the initial values for the iterative process using MLE. For the latter method expression (51) is updated successively until the values of the parameters were stable within a preset tolerance level. Results for the estimation of both methods and the true values of the parameters are shown at Table 1.
Note the close agreement between the MLEs and MOMs at the results shown at Table 1.

Text mining example
A text mining example was developed using a publicly available data set assembled by Lang (Lang). An example of email messages regarding several interest groups are available, the "auto" topic was selected and summarized using standard data mining techniques. A collection of randomly selected sample of 160 documents (emails) were extracted and summarized as vectors at the positive orthant of the hypersphere. Common terms such as "from" or "subject" were excluded as they did not provide any discriminant power and could potentially bias the analysis. Vocabulary reduction for synonymous and stemming were performed, and the ten most common terms were extracted by obtaining their raw frequencies. The frequencies for these terms were expressed as the components of vectors in a ten-dimensional space. A small fraction of the data set can be seen at Table 2.  An appropriate transformation for these vectors was applied to reduce extreme values and eliminate zeros. The transformation used here was x transf = ln(1.10 + x). These vectors were standardized to a unit length at the positive quadrant of the hypersphere and they were fitted using the proposed multivariate SDD for ten dimensions. The estimation for the corresponding α i 's for the proposed distribution was done using both MOM and MLE, and their corresponding estimated values are shown Table 3 The number of iterations needed to fit the SDD for the MOM procedure within a preset tolerance level were 271 iterations. The final results of the MOM estimators were used as the initial values for the MLE procedure, and a new model was fitted using 19 additional iterations. Although the MLE procedure in general does not guarantee finding a global maximum, the proposed method provided reasonable results and the convergence was fast enough.

Conclusions
The proposed SDD constitutes a superior alternative to other competing methods for fitting unit vectors at the positive orthant of the hypersphere. The SDD avoids wasting probability mass or using distribution mixtures that are not suitable for the positive orthant of the hypersphere. Inference results for MOM and MLE were in close agreement for simulated data, and reasonably close for the real text mining example. The simulated data were randomly generated from the proposed SDD while the text mining data were obtained from a real text mining problem. The SDD is flexible and shows a rich variety of shapes suitable to fit a wide range of data, in a similar way that the beta distribution does for a one-dimensional space. Under an appropriate transformation it can also accommodate zeros for some coordinates of the hyper-vectors. Future research may be aimed to enhance the capability of handling zero-value components, avoiding further need of transforming data.