- Research
- Open Access
- Published:

# The spherical-Dirichlet distribution

*Journal of Statistical Distributions and Applications*
**volume 7**, Article number: 6 (2020)

## Abstract

Today, data mining and gene expressions are at the forefront of modern data analysis. Here we introduce a novel probability distribution that is applicable in these fields. This paper develops the proposed spherical-Dirichlet distribution designed to fit vectors located at the positive orthant of the hypersphere, as it is often the case for data in these fields, avoiding unnecessary probability mass. Basic properties of the proposed distribution, including normalizing constants and moments are developed. Relationships with other distributions are also explored. Estimators based on classical inferential statistics, such as method of moments and maximum likelihood estimators are obtained. Two applications are developed: the first one uses simulated data, and the second uses a real text mining example. Both examples are fitted using the proposed spherical-Dirichlet distribution and their results are discussed.

## Introduction

In text mining and gene expressions analysis, the collections of texts are represented in a vector-space model, which implies that texts once standardized, are coded as vectors in a sphere of higher dimensions, also called a hypersphere (Suvrit 2016). Many researchers currently model these distributions by means of existing probability density mixtures, however, these approximations waste probability mass in the whole hypersphere, when it is actually only needed at the positive orthant of the hypersphere. This is mainly because of the non-existence of suitable distributions for that subspace. The new proposed distribution fills that void, allowing for an efficient modeling of these vectors.

## Basic properties

In this section we introduce the proposed spherical-Dirichlet distribution, its moments and basic properties.

### Probability density function and normalizing constant

The spherical-Dirichlet distribution is obtained by transforming the Dirichlet distribution on the simplex to the corresponding space on the hypersphere. In this section we derive the density and we compute the normalizing constants. Let ** y** have a Dirichlet distribution on the simplex as described by Ingram (Olkin and Rubin 1964).

where

Transforming the Dirichlet distribution from the simplex to the positive orthant of the hypersphere (Fig. 1)

taking the square root transformation

Computing the Jacobian for all independent variables, it follows that

the proposed transformation from (1) results in

where

We refer to (3) as the spherical-Dirichlet distribution (SDD) and write *x*∼*S**D**D*(*α*_{i}). We introduce the parameters *α*_{i} as the concentration parameters in a similar manner to the corresponding parameters of the Dirichlet distribution.

### Moments

In this section we compute the first and second order moments, mode, standard deviation, variances and covariances and its corresponding covariance matrix. First, we compute the expected value for one of the variables, for example let us consider the expected value of *x*_{1}

where we recognize the expression inside the integral as the kernel of the proposed SDD with a new first parameter \(\alpha _{1}+\frac {1}{2}\), then we can rewrite immediately this expression as

we define *μ*_{i} as,

the expected value from (6) can be rewritten as,

The general solution for the first moment of a vector **x**={*x*_{1},....*x*_{m}}^{T} with a vector of parameters ** α**={

*α*

_{1},.....

*α*

_{m}}

^{T}can be written as

let

then, the expected value for a vector **x** can also be written as

Similarly, we compute the expected value for \(x_{1}^{2}\) as

again, we can recognize the expression inside the integral as the kernel of the proposed SDD with a new first parameter *α*_{1}+1, that yields

this result can be generalized to

Moreover, the variance for any variable *x*_{i} is

and the covariance for *x*_{1},*x*_{2} can be written as

after some arrangements, we can identify the kernel of the proposed SDD with the first two parameters as \(\alpha _{1}+\frac {1}{2}\), and \(\alpha _{2}+\frac {1}{2}\), where we can solve the corresponding integral, and our result takes the form

In general, for any pair of variables (*x*_{i},*x*_{j}) we can write

where *δ*_{ij} is the delta Kronecker, and we can also write the covariance for any pair of variables (*x*_{i},*x*_{j}) as

We can also write the covariance for any pair of variables (*x*_{i},*x*_{j}) as

that in matrix notation can also be written as

an equivalent expression is

similarly we let

where

that summarizes our results in a succinct form.

### Mode and relationship with the mean

The mode for the SDD can be determined by finding the values of *α*_{i} that maximize this function, alternatively, we can also maximize the log of this function as it is customary and usually an easier procedure. First, taking the natural log of the SDD and adding the constraint \(\sum _{i=1}^{m}x_{i}^{2}=1\) for the purpose of using Lagrange multipliers we get

taking derivatives with respect to *x*_{i} and setting them to zero we have

solving for \(x_{i}^{2}\), it yields

similarly, taking derivatives with respect to *x*_{m}

and solving for *x*_{m}, we have

substituting these results at the constraint, we can solve for *λ* as

where we can obtain the mode for *x*_{i} as

and for *x*_{m}

Considering the special case of a symmetric SDD, we set up *α*_{i}=*α* for *i*<*m*, and \(\alpha _{m}=\alpha +\frac {1}{2}\), both (30) and (31) yield

the mean for a symmetric SDD for *α*_{i}=*α* for *i*<*m*, and \(\alpha _{m}=\alpha +\frac {1}{2}\), yields

where we can see that the mode does not match the expected value for a symmetric SDD, however, we can still find an asymptotic relationship using the expression developed by Frame (Frame 1949),

using this approximation it yields

where

that in the limit matches the mode at (32).

## Relationships of the SDD with other distributions

In this section we explore the relationships, or lack thereof, between the SDD and other popular distributions such as the uniform, von Mises and its particular case of the Fisher Bingham distribution. We consider limiting cases for different values of the concentration parameters *α*_{i}.

### Limiting case symmetric distribution for large *α*

Assuming a symmetric SDD with *α*_{i}=*α*, for ∀*α*_{i} we can write

subject to the restrictions

in this case the covariance matrix can be reduced to

where

in an attempt to write the SDD as a rotational distribution of the type shown by Mardia (Mardia and Jupp 2000), the latter expression can be rewritten as

or equivalently

we can’t determine an equivalence to the von Mises or similar rotationally symmetric distributions, however, using the expression developed by Frame (Frame 1949), we can see that in the limiting case for *α*→*∞* and consequently *α*_{0}→*∞* we have

and

which in the limit it yields

we conclude that for large values of *α* the covariance matrix tends to zero, consequently, the SDD tends to be concentrated as a vector with no variation.

### Limiting case uniform distribution

We now consider the case where \(\alpha _{i}=\frac {1}{2}\), for *i*<*m* and *α*_{m}=1, the SDD takes the form

which is a constant thickness independent of the values of *x*_{i}, then the SDD becomes the uniform distribution over the positive orthant of the hypersphere.

### Similarities and differences of the SDD with the von Mises and Fisher-Bingham distributions

The von Mises distribution is usually considered the analogue of the normal distribution in the circle as described by Mardia in (Mardia 1975). The von Mises distribution and its particular case for the three dimensional sphere, the Fisher-Bingham distribution, both tend to converge to a multivariate and bivariate normal distribution respectively for large values of *κ* as shown by Kent (Kent 1982).

The proposed SDD doesn’t seem to converge to the von Mises distribution or to a multivariate normal distribution for large values of *α*_{i}, but rather it tends to be concentrated as a vector as it was established at the end of the corresponding section for the limiting cases for the SDD.

Moreover, both the von Mises and the Fisher-Bingham distribution converge to the uniform distribution for very small values of *κ*, in a similar way as the SDD becomes the uniform distribution for the values of the parameters described at the end of the previous subsection.

## Inference for the spherical-Dirichlet distribution

We now consider estimation of the parameters of the SDD. Our main interest is to develop suitable procedures to estimate the set of parameters *α*_{i}, given a sample of random vectors located at the positive orthant of the hypersphere. We first derive estimators for *α*_{i} using the method of moments (MOM), and next we develop estimators for the same set of parameters using the method of maximum likelihood estimation (MLE).

### Method of moments (MOM)

Using a similar procedure as the one developed by Narayanan (Narayanan 1992) to estimate the parameters of the Dirichlet distribution, suppose we have a random sample with *n* random vectors *X*_{1},*X*_{2},....*X*_{n} such that \(X_{i} \in \Re ^{m} = \left [X_{j}| j=1...,m; X_{j}>0, \sum _{j=1}^{m}x_{j}^{2}=1\right ]\) that are i.i.d., then

and

We define the sample moments as

and

We have *m-1* first order moment equations and *m-1* second order moment equations to solve for *m* unknowns *α*_{i}. To avoid linear dependency and for the sake of simplicity we choose one of the first order moments, and *m-1* of the second order moment equations

then, the remaining *m-1* second order moment equations are

There is no closed form solution for *α*_{i} in solving simultaneously (45) and (46), so we must solve numerically to obtain the corresponding method of moments estimators for *α*_{i}. Results from MOM can be used as initial values for the MLE that usually exhibit better statistical properties.

### Maximum likelihood estimation (MLE)

Suppose that we have a random sample of vectors on the positive orthant of the hypersphere, *X*_{1},*X*_{2},....*X*_{n}, where *X*_{i}∈ℜ^{m} from an SDD with pdf defined at (3). Then, the log-likelihood is

The parameters for an SDD can be estimated by maximizing the log-likelihood function of the data, in a similar procedure as the one used by Minka for the Dirichlet distribution described at (Minka 2000). We can group all the constant terms as *K*, and we can rewrite all the products and sums as

where the function that needs to be optimized after removing unnecessary constants is

The gradient of the objective function can be obtained by differentiating the log-likelihood ln*F*(*α*) with respect to *α*_{k} as

where \(\Psi =:\frac {d\ln \Gamma (x)}{dx}\) is the digamma function. The optimization is subject to the constraints \(\alpha _{i} \geqq 0\). The SDD is a member of the exponential family and therefore it is a convex function, and the observed sufficient statistic is equal to the expected sufficient statistic, where the latter is

and the observed sufficient statistic is

that leads to the following iterative procedure

Although the proposed procedure does not guarantee in general reaching a global maximum, updating successively (51) provides reasonable results, and convergence is typically fast.

## Applications to data

Let’s now consider estimation of the parameters of the SDD. We first developed an example using simulated data generated from the proposed SDD with parameters we assumed to be unknown for the purpose of this estimation. Next, a second example was developed using a text mining example, with data obtained from a publicly available data set. Both examples were solved using MOM and MLE, applying the proposed techniques described at the corresponding sections for the method of moments and maximum likelihood estimation, and results obtained from both methods were compared.

### Simulation example

Four different simulations were performed each with 1,000 randomly generated values from an SDD in a three-dimensional hypersphere, with known values of the parameters *α*_{1},*α*_{2} and *α*_{3}. Plots for the corresponding values of these parameters are shown at Figs. 2 and 3. Inferences to estimate the values of these parameters, assumed to be unknown, were performed using the MOM and MLE procedures developed in the corresponding sections. Graphs for the SDD corresponding to the proposed four different sets of parameters are shown at the following figures:

First, an estimation was performed using MOM and iterating between (45) and (46). These values were updated in each cycle until convergence was achieved within a pre-set tolerance limit. The estimated values of the parameters using MOM were used as the initial values for the iterative process using MLE. For the latter method expression (51) is updated successively until the values of the parameters were stable within a pre-set tolerance level. Results for the estimation of both methods and the true values of the parameters are shown at Table 1.

Note the close agreement between the MLEs and MOMs at the results shown at Table 1.

### Text mining example

A text mining example was developed using a publicly available data set assembled by Lang (Lang). An example of email messages regarding several interest groups are available, the “auto” topic was selected and summarized using standard data mining techniques. A collection of randomly selected sample of 160 documents (emails) were extracted and summarized as vectors at the positive orthant of the hypersphere. Common terms such as “from” or “subject” were excluded as they did not provide any discriminant power and could potentially bias the analysis. Vocabulary reduction for synonymous and stemming were performed, and the ten most common terms were extracted by obtaining their raw frequencies. The frequencies for these terms were expressed as the components of vectors in a ten-dimensional space. A small fraction of the data set can be seen at Table 2.

An appropriate transformation for these vectors was applied to reduce extreme values and eliminate zeros. The transformation used here was *x*_{transf}= ln(1.10+*x*). These vectors were standardized to a unit length at the positive quadrant of the hypersphere and they were fitted using the proposed multivariate SDD for ten dimensions. The estimation for the corresponding *α*_{i}’s for the proposed distribution was done using both MOM and MLE, and their corresponding estimated values are shown Table 3

The number of iterations needed to fit the SDD for the MOM procedure within a preset tolerance level were 271 iterations. The final results of the MOM estimators were used as the initial values for the MLE procedure, and a new model was fitted using 19 additional iterations. Although the MLE procedure in general does not guarantee finding a global maximum, the proposed method provided reasonable results and the convergence was fast enough.

## Conclusions

The proposed SDD constitutes a superior alternative to other competing methods for fitting unit vectors at the positive orthant of the hypersphere. The SDD avoids wasting probability mass or using distribution mixtures that are not suitable for the positive orthant of the hypersphere. Inference results for MOM and MLE were in close agreement for simulated data, and reasonably close for the real text mining example. The simulated data were randomly generated from the proposed SDD while the text mining data were obtained from a real text mining problem. The SDD is flexible and shows a rich variety of shapes suitable to fit a wide range of data, in a similar way that the beta distribution does for a one-dimensional space. Under an appropriate transformation it can also accommodate zeros for some coordinates of the hyper-vectors. Future research may be aimed to enhance the capability of handling zero-value components, avoiding further need of transforming data.

## Availability of data and materials

The data for the text mining example was obtained from the publicly available data set assempled by Lang (Lang). The particular sample analysed during the current study are available from the corresponding author on reasonable request.

## References

Frame, J. S.: An approximation to the quotient of gamma function. Am. Math. Mon. 56(8), 529–535 (1949).

Kent, J. T.: The Fisher-Bingham distribution on the sphere. J. R. Stat. Soc. Ser. B Methodol. 44(1), 71–80 (1982). https://doi.org/10.1111/j.2517-6161.1982.tb01189.x. https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.2517-6161.1982.tb01189.x.

Lang, K.: CMU Text Learning Group Data Archives. https://www.cs.cmu.edu/afs/cs/project/theo-20/www/data/news20.html. Accessed 1 Sep 2019.

Mardia, K. V.: Statistics of directional data. J. R. Stat. Soc. Ser. B Methodol. 37(3), 349–393 (1975).

Mardia, K. V., Jupp, P. E.: Directional Statistics, 2nd edn. Wiley series in probability and statistics. Wiley, Chichester (2000).

Minka, T. P.: Estimating a Dirichlet distribution. Technical Report, MIT (2000). https://www,microsoft.com/en-us/research/publication/estimating-dirichlet-distribution/.

Narayanan, A.: A note on parameter estimation in the multivariate beta distribution. Comput. Math. Appl. 24(10), 11–17 (1992). https://doi.org/10.1016/0898-1221(92)90016-B.

Olkin, I., Rubin, H.: Multivariate beta distributions and independence properties of the Wishart distribution. Ann. Math. Stat. 35(1), 261–269 (1964).

Suvrit, S.: Directional statistics in machine learning: a brief review. arXiv e-prints, 1605–00316 (2016). http://arxiv.org/abs/1605.00316.

## Acknowledgements

The author is grateful to the invaluable help of Eduardo García-Portugués from the Department of Statistics, Carlos III University of Madrid (Spain).

## Funding

Travel funding for the completion of this project was received by a Research Enhancement grant from the Texas A&M University-Corpus Christi Division of Research and Innovation.

## Author information

### Affiliations

### Contributions

JHG is the sole author of this article. The author(s) read and approved the final manuscript.

### Corresponding author

## Ethics declarations

### Competing interests

The author declares that he has no competing interests.

## Additional information

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

**Open Access** This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

## About this article

### Cite this article

Guardiola, J.H. The spherical-Dirichlet distribution.
*J Stat Distrib App* **7, **6 (2020). https://doi.org/10.1186/s40488-020-00106-9

Received:

Accepted:

Published:

### Keywords

- Dirichlet distribution
- Text mining
- Hypersphere
- Gene expressions
- Positive orthant