- Research
- Open access
- Published:
The spherical-Dirichlet distribution
Journal of Statistical Distributions and Applications volume 7, Article number: 6 (2020)
Abstract
Today, data mining and gene expressions are at the forefront of modern data analysis. Here we introduce a novel probability distribution that is applicable in these fields. This paper develops the proposed spherical-Dirichlet distribution designed to fit vectors located at the positive orthant of the hypersphere, as it is often the case for data in these fields, avoiding unnecessary probability mass. Basic properties of the proposed distribution, including normalizing constants and moments are developed. Relationships with other distributions are also explored. Estimators based on classical inferential statistics, such as method of moments and maximum likelihood estimators are obtained. Two applications are developed: the first one uses simulated data, and the second uses a real text mining example. Both examples are fitted using the proposed spherical-Dirichlet distribution and their results are discussed.
Introduction
In text mining and gene expressions analysis, the collections of texts are represented in a vector-space model, which implies that texts once standardized, are coded as vectors in a sphere of higher dimensions, also called a hypersphere (Suvrit 2016). Many researchers currently model these distributions by means of existing probability density mixtures, however, these approximations waste probability mass in the whole hypersphere, when it is actually only needed at the positive orthant of the hypersphere. This is mainly because of the non-existence of suitable distributions for that subspace. The new proposed distribution fills that void, allowing for an efficient modeling of these vectors.
Basic properties
In this section we introduce the proposed spherical-Dirichlet distribution, its moments and basic properties.
Probability density function and normalizing constant
The spherical-Dirichlet distribution is obtained by transforming the Dirichlet distribution on the simplex to the corresponding space on the hypersphere. In this section we derive the density and we compute the normalizing constants. Let y have a Dirichlet distribution on the simplex as described by Ingram (Olkin and Rubin 1964).
where
Transforming the Dirichlet distribution from the simplex to the positive orthant of the hypersphere (Fig. 1)
taking the square root transformation
Computing the Jacobian for all independent variables, it follows that
the proposed transformation from (1) results in
where
We refer to (3) as the spherical-Dirichlet distribution (SDD) and write x∼SDD(αi). We introduce the parameters αi as the concentration parameters in a similar manner to the corresponding parameters of the Dirichlet distribution.
Moments
In this section we compute the first and second order moments, mode, standard deviation, variances and covariances and its corresponding covariance matrix. First, we compute the expected value for one of the variables, for example let us consider the expected value of x1
where we recognize the expression inside the integral as the kernel of the proposed SDD with a new first parameter \(\alpha _{1}+\frac {1}{2}\), then we can rewrite immediately this expression as
we define μi as,
the expected value from (6) can be rewritten as,
The general solution for the first moment of a vector x={x1,....xm}T with a vector of parameters α={α1,.....αm}T can be written as
let
then, the expected value for a vector x can also be written as
Similarly, we compute the expected value for \(x_{1}^{2}\) as
again, we can recognize the expression inside the integral as the kernel of the proposed SDD with a new first parameter α1+1, that yields
this result can be generalized to
Moreover, the variance for any variable xi is
and the covariance for x1,x2 can be written as
after some arrangements, we can identify the kernel of the proposed SDD with the first two parameters as \(\alpha _{1}+\frac {1}{2}\), and \(\alpha _{2}+\frac {1}{2}\), where we can solve the corresponding integral, and our result takes the form
In general, for any pair of variables (xi,xj) we can write
where δij is the delta Kronecker, and we can also write the covariance for any pair of variables (xi,xj) as
We can also write the covariance for any pair of variables (xi,xj) as
that in matrix notation can also be written as
an equivalent expression is
similarly we let
where
that summarizes our results in a succinct form.
Mode and relationship with the mean
The mode for the SDD can be determined by finding the values of αi that maximize this function, alternatively, we can also maximize the log of this function as it is customary and usually an easier procedure. First, taking the natural log of the SDD and adding the constraint \(\sum _{i=1}^{m}x_{i}^{2}=1\) for the purpose of using Lagrange multipliers we get
taking derivatives with respect to xi and setting them to zero we have
solving for \(x_{i}^{2}\), it yields
similarly, taking derivatives with respect to xm
and solving for xm, we have
substituting these results at the constraint, we can solve for λ as
where we can obtain the mode for xi as
and for xm
Considering the special case of a symmetric SDD, we set up αi=α for i<m, and \(\alpha _{m}=\alpha +\frac {1}{2}\), both (30) and (31) yield
the mean for a symmetric SDD for αi=α for i<m, and \(\alpha _{m}=\alpha +\frac {1}{2}\), yields
where we can see that the mode does not match the expected value for a symmetric SDD, however, we can still find an asymptotic relationship using the expression developed by Frame (Frame 1949),
using this approximation it yields
where
that in the limit matches the mode at (32).
Relationships of the SDD with other distributions
In this section we explore the relationships, or lack thereof, between the SDD and other popular distributions such as the uniform, von Mises and its particular case of the Fisher Bingham distribution. We consider limiting cases for different values of the concentration parameters αi.
Limiting case symmetric distribution for large α
Assuming a symmetric SDD with αi=α, for ∀αi we can write
subject to the restrictions
in this case the covariance matrix can be reduced to
where
in an attempt to write the SDD as a rotational distribution of the type shown by Mardia (Mardia and Jupp 2000), the latter expression can be rewritten as
or equivalently
we can’t determine an equivalence to the von Mises or similar rotationally symmetric distributions, however, using the expression developed by Frame (Frame 1949), we can see that in the limiting case for α→∞ and consequently α0→∞ we have
and
which in the limit it yields
we conclude that for large values of α the covariance matrix tends to zero, consequently, the SDD tends to be concentrated as a vector with no variation.
Limiting case uniform distribution
We now consider the case where \(\alpha _{i}=\frac {1}{2}\), for i<m and αm=1, the SDD takes the form
which is a constant thickness independent of the values of xi, then the SDD becomes the uniform distribution over the positive orthant of the hypersphere.
Similarities and differences of the SDD with the von Mises and Fisher-Bingham distributions
The von Mises distribution is usually considered the analogue of the normal distribution in the circle as described by Mardia in (Mardia 1975). The von Mises distribution and its particular case for the three dimensional sphere, the Fisher-Bingham distribution, both tend to converge to a multivariate and bivariate normal distribution respectively for large values of κ as shown by Kent (Kent 1982).
The proposed SDD doesn’t seem to converge to the von Mises distribution or to a multivariate normal distribution for large values of αi, but rather it tends to be concentrated as a vector as it was established at the end of the corresponding section for the limiting cases for the SDD.
Moreover, both the von Mises and the Fisher-Bingham distribution converge to the uniform distribution for very small values of κ, in a similar way as the SDD becomes the uniform distribution for the values of the parameters described at the end of the previous subsection.
Inference for the spherical-Dirichlet distribution
We now consider estimation of the parameters of the SDD. Our main interest is to develop suitable procedures to estimate the set of parameters αi, given a sample of random vectors located at the positive orthant of the hypersphere. We first derive estimators for αi using the method of moments (MOM), and next we develop estimators for the same set of parameters using the method of maximum likelihood estimation (MLE).
Method of moments (MOM)
Using a similar procedure as the one developed by Narayanan (Narayanan 1992) to estimate the parameters of the Dirichlet distribution, suppose we have a random sample with n random vectors X1,X2,....Xn such that \(X_{i} \in \Re ^{m} = \left [X_{j}| j=1...,m; X_{j}>0, \sum _{j=1}^{m}x_{j}^{2}=1\right ]\) that are i.i.d., then
and
We define the sample moments as
and
We have m-1 first order moment equations and m-1 second order moment equations to solve for m unknowns αi. To avoid linear dependency and for the sake of simplicity we choose one of the first order moments, and m-1 of the second order moment equations
then, the remaining m-1 second order moment equations are
There is no closed form solution for αi in solving simultaneously (45) and (46), so we must solve numerically to obtain the corresponding method of moments estimators for αi. Results from MOM can be used as initial values for the MLE that usually exhibit better statistical properties.
Maximum likelihood estimation (MLE)
Suppose that we have a random sample of vectors on the positive orthant of the hypersphere, X1,X2,....Xn, where Xi∈ℜm from an SDD with pdf defined at (3). Then, the log-likelihood is
The parameters for an SDD can be estimated by maximizing the log-likelihood function of the data, in a similar procedure as the one used by Minka for the Dirichlet distribution described at (Minka 2000). We can group all the constant terms as K, and we can rewrite all the products and sums as
where the function that needs to be optimized after removing unnecessary constants is
The gradient of the objective function can be obtained by differentiating the log-likelihood lnF(α) with respect to αk as
where \(\Psi =:\frac {d\ln \Gamma (x)}{dx}\) is the digamma function. The optimization is subject to the constraints \(\alpha _{i} \geqq 0\). The SDD is a member of the exponential family and therefore it is a convex function, and the observed sufficient statistic is equal to the expected sufficient statistic, where the latter is
and the observed sufficient statistic is
that leads to the following iterative procedure
Although the proposed procedure does not guarantee in general reaching a global maximum, updating successively (51) provides reasonable results, and convergence is typically fast.
Applications to data
Let’s now consider estimation of the parameters of the SDD. We first developed an example using simulated data generated from the proposed SDD with parameters we assumed to be unknown for the purpose of this estimation. Next, a second example was developed using a text mining example, with data obtained from a publicly available data set. Both examples were solved using MOM and MLE, applying the proposed techniques described at the corresponding sections for the method of moments and maximum likelihood estimation, and results obtained from both methods were compared.
Simulation example
Four different simulations were performed each with 1,000 randomly generated values from an SDD in a three-dimensional hypersphere, with known values of the parameters α1,α2 and α3. Plots for the corresponding values of these parameters are shown at Figs. 2 and 3. Inferences to estimate the values of these parameters, assumed to be unknown, were performed using the MOM and MLE procedures developed in the corresponding sections. Graphs for the SDD corresponding to the proposed four different sets of parameters are shown at the following figures:
First, an estimation was performed using MOM and iterating between (45) and (46). These values were updated in each cycle until convergence was achieved within a pre-set tolerance limit. The estimated values of the parameters using MOM were used as the initial values for the iterative process using MLE. For the latter method expression (51) is updated successively until the values of the parameters were stable within a pre-set tolerance level. Results for the estimation of both methods and the true values of the parameters are shown at Table 1.
Note the close agreement between the MLEs and MOMs at the results shown at Table 1.
Text mining example
A text mining example was developed using a publicly available data set assembled by Lang (Lang). An example of email messages regarding several interest groups are available, the “auto” topic was selected and summarized using standard data mining techniques. A collection of randomly selected sample of 160 documents (emails) were extracted and summarized as vectors at the positive orthant of the hypersphere. Common terms such as “from” or “subject” were excluded as they did not provide any discriminant power and could potentially bias the analysis. Vocabulary reduction for synonymous and stemming were performed, and the ten most common terms were extracted by obtaining their raw frequencies. The frequencies for these terms were expressed as the components of vectors in a ten-dimensional space. A small fraction of the data set can be seen at Table 2.
An appropriate transformation for these vectors was applied to reduce extreme values and eliminate zeros. The transformation used here was xtransf= ln(1.10+x). These vectors were standardized to a unit length at the positive quadrant of the hypersphere and they were fitted using the proposed multivariate SDD for ten dimensions. The estimation for the corresponding αi’s for the proposed distribution was done using both MOM and MLE, and their corresponding estimated values are shown Table 3
The number of iterations needed to fit the SDD for the MOM procedure within a preset tolerance level were 271 iterations. The final results of the MOM estimators were used as the initial values for the MLE procedure, and a new model was fitted using 19 additional iterations. Although the MLE procedure in general does not guarantee finding a global maximum, the proposed method provided reasonable results and the convergence was fast enough.
Conclusions
The proposed SDD constitutes a superior alternative to other competing methods for fitting unit vectors at the positive orthant of the hypersphere. The SDD avoids wasting probability mass or using distribution mixtures that are not suitable for the positive orthant of the hypersphere. Inference results for MOM and MLE were in close agreement for simulated data, and reasonably close for the real text mining example. The simulated data were randomly generated from the proposed SDD while the text mining data were obtained from a real text mining problem. The SDD is flexible and shows a rich variety of shapes suitable to fit a wide range of data, in a similar way that the beta distribution does for a one-dimensional space. Under an appropriate transformation it can also accommodate zeros for some coordinates of the hyper-vectors. Future research may be aimed to enhance the capability of handling zero-value components, avoiding further need of transforming data.
Availability of data and materials
The data for the text mining example was obtained from the publicly available data set assempled by Lang (Lang). The particular sample analysed during the current study are available from the corresponding author on reasonable request.
References
Frame, J. S.: An approximation to the quotient of gamma function. Am. Math. Mon. 56(8), 529–535 (1949).
Kent, J. T.: The Fisher-Bingham distribution on the sphere. J. R. Stat. Soc. Ser. B Methodol. 44(1), 71–80 (1982). https://doi.org/10.1111/j.2517-6161.1982.tb01189.x. https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.2517-6161.1982.tb01189.x.
Lang, K.: CMU Text Learning Group Data Archives. https://www.cs.cmu.edu/afs/cs/project/theo-20/www/data/news20.html. Accessed 1 Sep 2019.
Mardia, K. V.: Statistics of directional data. J. R. Stat. Soc. Ser. B Methodol. 37(3), 349–393 (1975).
Mardia, K. V., Jupp, P. E.: Directional Statistics, 2nd edn. Wiley series in probability and statistics. Wiley, Chichester (2000).
Minka, T. P.: Estimating a Dirichlet distribution. Technical Report, MIT (2000). https://www,microsoft.com/en-us/research/publication/estimating-dirichlet-distribution/.
Narayanan, A.: A note on parameter estimation in the multivariate beta distribution. Comput. Math. Appl. 24(10), 11–17 (1992). https://doi.org/10.1016/0898-1221(92)90016-B.
Olkin, I., Rubin, H.: Multivariate beta distributions and independence properties of the Wishart distribution. Ann. Math. Stat. 35(1), 261–269 (1964).
Suvrit, S.: Directional statistics in machine learning: a brief review. arXiv e-prints, 1605–00316 (2016). http://arxiv.org/abs/1605.00316.
Acknowledgements
The author is grateful to the invaluable help of Eduardo García-Portugués from the Department of Statistics, Carlos III University of Madrid (Spain).
Funding
Travel funding for the completion of this project was received by a Research Enhancement grant from the Texas A&M University-Corpus Christi Division of Research and Innovation.
Author information
Authors and Affiliations
Contributions
JHG is the sole author of this article. The author(s) read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The author declares that he has no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Guardiola, J.H. The spherical-Dirichlet distribution. J Stat Distrib App 7, 6 (2020). https://doi.org/10.1186/s40488-020-00106-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40488-020-00106-9