# The spherical-Dirichlet distribution

## Abstract

Today, data mining and gene expressions are at the forefront of modern data analysis. Here we introduce a novel probability distribution that is applicable in these fields. This paper develops the proposed spherical-Dirichlet distribution designed to fit vectors located at the positive orthant of the hypersphere, as it is often the case for data in these fields, avoiding unnecessary probability mass. Basic properties of the proposed distribution, including normalizing constants and moments are developed. Relationships with other distributions are also explored. Estimators based on classical inferential statistics, such as method of moments and maximum likelihood estimators are obtained. Two applications are developed: the first one uses simulated data, and the second uses a real text mining example. Both examples are fitted using the proposed spherical-Dirichlet distribution and their results are discussed.

## Introduction

In text mining and gene expressions analysis, the collections of texts are represented in a vector-space model, which implies that texts once standardized, are coded as vectors in a sphere of higher dimensions, also called a hypersphere (Suvrit 2016). Many researchers currently model these distributions by means of existing probability density mixtures, however, these approximations waste probability mass in the whole hypersphere, when it is actually only needed at the positive orthant of the hypersphere. This is mainly because of the non-existence of suitable distributions for that subspace. The new proposed distribution fills that void, allowing for an efficient modeling of these vectors.

## Basic properties

In this section we introduce the proposed spherical-Dirichlet distribution, its moments and basic properties.

### Probability density function and normalizing constant

The spherical-Dirichlet distribution is obtained by transforming the Dirichlet distribution on the simplex to the corresponding space on the hypersphere. In this section we derive the density and we compute the normalizing constants. Let y have a Dirichlet distribution on the simplex as described by Ingram (Olkin and Rubin 1964).

$$\begin{array}{*{20}l} f_{\text{Dir}}(\mathbf{y};\alpha)&=\frac{\Gamma(\alpha_{0})}{{\prod_{i=1}^{m}\Gamma({\alpha_{i})}}}\prod_{i=1}^{m}{y_{i}}^{\alpha_{i}-1}\\ &=\frac{\Gamma(\alpha_{0})}{{\prod_{i=1}^{m}\Gamma({\alpha_{i})}}}\prod_{i=1}^{m-1}{y_{i}}^{\alpha_{i}-1}\left(1-\sum_{i=1}^{m-1}y_{i}\right)^{(\alpha_{m}-1)} \end{array}$$
(1)

where

$$\begin{array}{*{20}l} \alpha_{i}\in\Re^{+},\;\; \alpha_{0}=:{\sum_{i=1}^{m}{\alpha_{i}}}, \;\; 0\leqq y_{i}\leqq1, \;\; \sum_{i=1}^{m}{y_{i}}=1, \end{array}$$

Transforming the Dirichlet distribution from the simplex to the positive orthant of the hypersphere (Fig. 1)

taking the square root transformation

$$\begin{array}{*{20}l} x_{i} = \sqrt{y_{i}}, \;\; y_{i} = {x_{i}}^{2}, \;\; \frac{\partial y_{i}}{\partial x_{i}}=2x_{i}, \text{ for} \;\; i=1,....(m-1), \;\;x_{m}=\sqrt{y_{m}}. \end{array}$$
(2)

Computing the Jacobian for all independent variables, it follows that

$$J = \left|\begin{array}{ccccc} \frac{\partial y_{1}}{\partial x_{1}}=2x_{1} & \frac{\partial y_{1}}{\partial x_{2}}=0 & 0 & \dots & \\ \frac{\partial y_{2}}{\partial x_{1}}=0 & \frac{\partial y_{2}}{\partial x_{2}}=2x_{2} & 0 & \dots & \\ \hdotsfor{5} \\ 0 & 0 & \dots & \frac{\partial y_{m-1}}{\partial x_{m-1}}=2x_{m-1} \end{array}\right| = \prod_{i=1}^{m-1} {2x_{i}}=2^{m-1}\prod_{i=1}^{m-1} {x_{i}}$$

the proposed transformation from (1) results in

$$\begin{array}{*{20}l} f_{\text{SDir}}(\mathbf{x};\alpha)&=\frac{2^{m-1}\Gamma(\alpha_{0})}{{\prod_{i=1}^{m}\Gamma({\alpha_{i})}}}\prod_{i=1}^{m-1}{x_{i}}^{2\alpha_{i}-1}\cdot x_{m}^{2\alpha_{m}-2}\\ &=\frac{2^{m-1}\Gamma(\alpha_{0})}{{\prod_{i=1}^{m}\Gamma({\alpha_{i})}}}\prod_{i=1}^{m}{x_{i}}^{2\alpha_{i}-1}\cdot x_{m}^{-1}\\ &=\frac{2^{m-1}\Gamma(\alpha_{0})}{{\prod_{i=1}^{m}\Gamma({\alpha_{i})}}}\prod_{i=1}^{m-1}{x_{i}}^{2\alpha_{i}-1}\left(1-\sum_{i=1}^{m-1}x_{i}^{2}\right)^{(\alpha_{m}-1)} \end{array}$$
(3)

where

$$\begin{array}{*{20}l} \alpha_{0}=:{\sum_{i=1}^{m}{\alpha_{i}}}, \;\; \alpha_{i}\in\Re^{+},\;\; 0\leqq x_{i}\leqq1, \;\; \sum_{i=1}^{m}{x_{i}^{2}}=1. \end{array}$$

We refer to (3) as the spherical-Dirichlet distribution (SDD) and write xSDD(αi). We introduce the parameters αi as the concentration parameters in a similar manner to the corresponding parameters of the Dirichlet distribution.

### Moments

In this section we compute the first and second order moments, mode, standard deviation, variances and covariances and its corresponding covariance matrix. First, we compute the expected value for one of the variables, for example let us consider the expected value of x1

$$\begin{array}{*{20}l} E(x_{1})&=\int\dots\int\frac{2^{m-1}\Gamma(\alpha_{0})}{{\prod_{i=1}^{m}\Gamma({\alpha_{i})}}}x_{1}\left(\prod_{i=1}^{m}{x_{i}}^{2\alpha_{i}-1}\right)\cdot x_{m}^{-1} {\mathrm{d}x_{1}}\dots{\mathrm{d}x_{m}} \end{array}$$
(4)
$$\begin{array}{*{20}l} &=\int\dots\int\frac{2^{m-1}\Gamma(\alpha_{0})}{{\prod_{i=1}^{m}\Gamma({\alpha_{i})}}}{x_{1}}^{2(\alpha_{1}+\frac{1}{2})-1}\left(\prod_{i=2}^{m}{x_{i}}^{2\alpha_{i}-1}\right)\cdot x_{m}^{-1}{\mathrm{d}x_{1}}\dots{\mathrm{d}x_{m}}, \end{array}$$
(5)

where we recognize the expression inside the integral as the kernel of the proposed SDD with a new first parameter $$\alpha _{1}+\frac {1}{2}$$, then we can rewrite immediately this expression as

$$\begin{array}{*{20}l} E(x_{1})&=\frac{2^{m-1}\Gamma(\alpha_{0})}{\prod_{i=1}^{m}\Gamma(\alpha_{i})}\frac{\Gamma\left(\alpha_{1}+\frac{1}{2}\right)\prod_{i=2}^{m}\Gamma(\alpha_{i})}{{2^{m-1}\Gamma\left({\alpha_{0}+\frac{1}{2}}\right)}}\\ &=\frac{\Gamma(\alpha_{0})}{\Gamma(\alpha_{0}+\frac{1}{2})}\frac{\Gamma\left(\alpha_{1}+\frac{1}{2}\right)}{{\Gamma({\alpha_{1}})}}, \end{array}$$
(6)

we define μi as,

$$\begin{array}{*{20}l} \mu_{i}=:\frac{\Gamma\left(\alpha_{i}+\frac{1}{2}\right)}{\Gamma(\alpha_{i})}, \end{array}$$
(7)

the expected value from (6) can be rewritten as,

$$\begin{array}{*{20}l} E(x_{i})=\frac{\mu_{i}}{\mu_{0}}. \end{array}$$
(8)

The general solution for the first moment of a vector x={x1,....xm}T with a vector of parameters α={α1,.....αm}T can be written as

$$\begin{array}{*{20}l} E(\mathbf{x})=\frac{\Gamma(\alpha_{0})}{\Gamma(\alpha_{0}+\frac{1}{2})}\left(\frac{\Gamma\left({\alpha_{1}+\frac{1}{2}}\right)}{\Gamma(\alpha_{1})},\dots,\frac{\Gamma\left({\alpha_{m-1}+\frac{1}{2}}\right)}{\Gamma(\alpha_{m-1})}\right) =\frac{1}{\mu_{0}}\frac{\Gamma\left({\boldsymbol\alpha+\frac{1}{2}}\right)}{\Gamma(\boldsymbol\alpha)}, \end{array}$$
(9)

let

$$\begin{array}{*{20}l} \boldsymbol{\mu}=:\frac{\Gamma\left({\boldsymbol\alpha+\frac{1}{2}}\right)}{\Gamma(\boldsymbol\alpha)},\quad C=:\frac{||{\boldsymbol{\mu}}||}{\mu_{0}},\quad \bar{\boldsymbol{\mu}}=:\frac{\boldsymbol{\mu}}{||{\boldsymbol{\mu}}||},\quad \bar{\boldsymbol{\mu}}\in\Omega_{m-1}, \end{array}$$
(10)

then, the expected value for a vector x can also be written as

$$\begin{array}{*{20}l} E(\mathbf{x})=\frac{\boldsymbol{\mu}}{\mu_{0}}=\frac{||{\boldsymbol{\mu}}||}{\mu_{0}}\cdot\frac{\boldsymbol{\mu}}{||{\boldsymbol{\mu}}||}=C\cdot\bar{\boldsymbol{\mu}}. \end{array}$$
(11)

Similarly, we compute the expected value for $$x_{1}^{2}$$ as

$$\begin{array}{*{20}l} E(x_{1}^{2})&=\int\dots\int\frac{2^{m-1}\Gamma(\alpha_{0})}{{\prod_{i=1}^{m}\Gamma({\alpha_{i})}}}x_{1}^{2}\left(\prod_{i=1}^{m}{x_{i}}^{2\alpha_{i}-1}\right)\cdot x_{m}^{-1}{\mathrm{d}x_{1}}\dots{\mathrm{d}x_{m}} \end{array}$$
(12)
$$\begin{array}{*{20}l} &=\frac{2^{m-1}\Gamma(\alpha_{0})}{{\prod_{i=1}^{m}\Gamma({\alpha_{i})}}}\int\dots\int x_{1}^{{2(\alpha_{1}+1)-1}}\left(\prod_{i=2}^{m}{x_{i}}^{2\alpha_{i}-1}\right)\cdot x_{m}^{-1}{\mathrm{d}x_{1}}\dots{\mathrm{d}x_{m}}, \end{array}$$
(13)

again, we can recognize the expression inside the integral as the kernel of the proposed SDD with a new first parameter α1+1, that yields

$$\begin{array}{*{20}l} E(x_{1}^{2})&=\frac{2^{m-1}\Gamma(\alpha_{0})}{{\prod_{i=1}^{m}\Gamma({\alpha_{i})}}}\frac{\Gamma(\alpha_{1}+1)\prod_{i=2}^{m-1}\Gamma(\alpha_{i})}{{2^{m-1}\Gamma({\alpha_{0}+1})}}\\ &=\frac{\Gamma(\alpha_{0})}{\Gamma(\alpha_{0}+1)}\frac{\Gamma(\alpha_{1}+1)}{\Gamma(\alpha_{1})}=\frac{\alpha_{1}}{\alpha_{0}}, \end{array}$$
(14)

this result can be generalized to

$$\begin{array}{*{20}l} E(x_{i}^{2})=\frac{\alpha_{i}}{\alpha_{0}}. \end{array}$$
(15)

Moreover, the variance for any variable xi is

$$\begin{array}{*{20}l} V(x_{i})=\frac{\alpha_{i}}{\alpha_{0}}-\frac{\mu_{i}^{2}}{\mu_{0}^{2}}, \end{array}$$
(16)

and the covariance for x1,x2 can be written as

$$\begin{array}{*{20}l} E(x_{1}{\cdot}x_{2})=\int\dots\int \frac{2^{m-1}\Gamma(\alpha_{0})}{{\prod_{i=1}^{m}\Gamma({\alpha_{i})}}}{x_{1}}\cdot{x_{2}}\left(\prod_{i=1}^{m}{x_{i}}^{2\alpha_{i}-1}\right)\cdot x_{m}^{-1}{\mathrm{d}x_{1}}\dots{\mathrm{d}x_{m}}, \end{array}$$
(17)

after some arrangements, we can identify the kernel of the proposed SDD with the first two parameters as $$\alpha _{1}+\frac {1}{2}$$, and $$\alpha _{2}+\frac {1}{2}$$, where we can solve the corresponding integral, and our result takes the form

$$\begin{array}{*{20}l} E(x_{1}{\cdot}x_{2})=\frac{\Gamma\left(\alpha_{1}+\frac{1}{2}\right)\Gamma\left(\alpha_{2}+\frac{1}{2}\right)}{\alpha_{0}\Gamma(\alpha_{1})\Gamma(\alpha_{2})}=\frac{\mu_{1}\cdot\mu_{2}}{\alpha_{0}}. \end{array}$$
(18)

In general, for any pair of variables (xi,xj) we can write

$$\begin{array}{*{20}l} E(x_{i}{\cdot}x_{j})=\delta_{ij}\cdot\frac{\alpha_{i}}{\alpha_{0}}+ (1 -\delta_{ij})\cdot\frac{\mu_{i}\cdot\mu_{j}}{\alpha_{0}}, \end{array}$$
(19)

where δij is the delta Kronecker, and we can also write the covariance for any pair of variables (xi,xj) as

$$\begin{array}{*{20}l} COV(x_{i},x_{j})=\left(\frac{1}{\alpha_{0}}-\frac{1}{\mu_{0}^{2}}\right)\mu_{i}\cdot\mu_{j} \text{ for }{i}\neq{j}. \end{array}$$
(20)

We can also write the covariance for any pair of variables (xi,xj) as

$$\begin{array}{*{20}l} COV(x_{i},x_{j})=\delta_{ij}\cdot\left(\frac{\alpha_{i=j}}{\alpha_{0}}-\frac{\mu_{i}^{2}}{\mu_{0}^{2}}\right)+(1-\delta_{ij})\cdot\left(\frac{1}{\alpha_{0}}-\frac{1}{\mu_{0}^{2}}\right)\mu_{i}\cdot\mu_{j}, \end{array}$$
(21)

that in matrix notation can also be written as

$$\boldsymbol{\Sigma}= \left[ {\begin{array}{cccc} \frac{\alpha_{1}}{\alpha_{0}}-\frac{\mu_{1}^{2}}{\mu_{0}^{2}} & \left(\frac{1}{\alpha_{0}}-\frac{1}{\mu_{0}^{2}}\right)\mu_{1}\cdot\mu_{2}& \dots & \dots \\ \left(\frac{1}{\alpha_{0}}-\frac{1}{\mu_{0}^{2}}\right)\mu_{2}\cdot\mu_{1} & \frac{\alpha_{2}}{\alpha_{0}}-\frac{\mu_{2}^{2}}{\mu_{0}^{2}} & \dots & \dots \\ \hdotsfor{4} \\ \dots& \dots & \dots & \frac{\alpha_{m}}{\alpha_{0}}-\frac{\mu_{m}^{2}}{\mu_{0}^{2}}\\ \end{array}} \right],$$

an equivalent expression is

$$\boldsymbol{\Sigma}= \frac{1}{\alpha_{0}} \left[ {\begin{array}{cccc} \alpha_{1}-\mu_{1}^{2} & 0 & \dots & \dots \\ 0 & \alpha_{2}-\mu_{2}^{2} & \dots & \dots \\ \hdotsfor{4} \\ \dots& \dots & \dots & \alpha_{m}-\mu_{m}^{2} \\ \end{array}} \right]- \left(\frac{1}{\mu_{0}^{2}}-\frac{1}{\alpha_{0}}\right)\boldsymbol{\mu}\boldsymbol{\mu}^{T},$$

similarly we let

$$\begin{array}{*{20}l} \boldsymbol{\Sigma}=\frac{1}{\alpha_{0}}diag(\boldsymbol{\alpha})-\frac{C^{2}\mu_{0}^{2}}{\alpha_{0}}diag(\bar{\boldsymbol{\mu}}\bar{\boldsymbol{\mu}}^{T})-C^{2}\left(1-\frac{\mu_{0}^{2}}{\alpha_{0}}\right)\bar{\boldsymbol{\mu}}\bar{\boldsymbol{\mu}}^{T}, \end{array}$$
(22)

where

$$\begin{array}{*{20}l} C=\frac{||{\boldsymbol{\mu}}||}{\mu_{0}},\quad \bar{\boldsymbol{\mu}}=\frac{\boldsymbol{\mu}}{||{\boldsymbol{\mu}}||},\quad \bar{\boldsymbol{\mu}}\in\Omega_{m-1}, \end{array}$$
(23)

that summarizes our results in a succinct form.

### Mode and relationship with the mean

The mode for the SDD can be determined by finding the values of αi that maximize this function, alternatively, we can also maximize the log of this function as it is customary and usually an easier procedure. First, taking the natural log of the SDD and adding the constraint $$\sum _{i=1}^{m}x_{i}^{2}=1$$ for the purpose of using Lagrange multipliers we get

$$\begin{array}{*{20}l} {\ln}f_{\text{SDir}}(\mathbf{x},\alpha)=\ln\left(\frac{2^{m-1}\Gamma(\alpha_{0})}{\prod_{i=1}^{m}\Gamma(\alpha_{i})}\right)+\sum_{i=1}^{m}(2\alpha_{i}-1)\ln{x_{i}}\\-\ln x_{m}-\lambda\left(\sum_{i=1}^{m}x_{i}^{2}-1\right), \end{array}$$
(24)

taking derivatives with respect to xi and setting them to zero we have

$$\begin{array}{*{20}l} \frac{\partial {\ln}f_{\text{SDir}}}{\partial x_{i}}=(2\alpha_{i}-1)\frac{1}{x_{i}}-2{x_{i}}\lambda=0 \text{ for }i< m, \end{array}$$
(25)

solving for $$x_{i}^{2}$$, it yields

$$\begin{array}{*{20}l} x_{i}^{2}=\frac{2\alpha_{i}-1}{2\lambda} \text{ for} i< m, \end{array}$$
(26)

similarly, taking derivatives with respect to xm

$$\begin{array}{*{20}l} \frac{\partial {\ln}f_{\text{SDir}}}{\partial x_{m}}=(2\alpha_{m}-1)\frac{1}{x_{m}}-\frac{1}{x_{m}}-2{x_{m}}\lambda=0 \text{ for }i=m, \end{array}$$
(27)

and solving for xm, we have

$$\begin{array}{*{20}l} x_{m}^{2}=\frac{\alpha_{m}-1}{\lambda} \text{ for}\ i=m, \end{array}$$
(28)

substituting these results at the constraint, we can solve for λ as

$$\begin{array}{*{20}l} \lambda=\frac{1}{2}(2\alpha_{0}-m-1), \end{array}$$
(29)

where we can obtain the mode for xi as

$$\begin{array}{*{20}l} \text{(mode)}x_{i}=\sqrt{\frac{2\alpha_{i}-1}{2\alpha_{0}-m -1}} \text{ for}\ i< m, \end{array}$$
(30)

and for xm

$$\begin{array}{*{20}l} \text{(mode)}x_{m}=\sqrt{\frac{2(\alpha_{m}-1)}{2\alpha_{0}-m -1}} \text{ for}\ i=m. \end{array}$$
(31)

Considering the special case of a symmetric SDD, we set up αi=α for i<m, and $$\alpha _{m}=\alpha +\frac {1}{2}$$, both (30) and (31) yield

$$\begin{array}{*{20}l} \text{(mode)}x_{i}=\sqrt{\frac{2\alpha-1}{m\cdot (2\alpha-1)}}=\frac{1}{\sqrt{m}}\text{ for}\ \alpha\neq \frac{1}{2} \text{ for}\ i\leq m, \end{array}$$
(32)

the mean for a symmetric SDD for αi=α for i<m, and $$\alpha _{m}=\alpha +\frac {1}{2}$$, yields

$$\begin{array}{*{20}l} E(x_{i})=\frac{\mu_{i}}{\mu_{0}}=\frac{\Gamma\left(\alpha+\frac{1}{2}\right)}{\Gamma(\alpha)}\cdot\frac{\Gamma(\alpha_{0})}{\Gamma\left(\alpha_{0}+\frac{1}{2}\right)}=\frac{\Gamma\left(\alpha+\frac{1}{2}\right)}{\Gamma(\alpha)}\cdot\frac{\Gamma\left(m\alpha+\frac{1}{2}\right)}{\Gamma(m\alpha+1)}, \end{array}$$
(33)

where we can see that the mode does not match the expected value for a symmetric SDD, however, we can still find an asymptotic relationship using the expression developed by Frame (Frame 1949),

$$\begin{array}{*{20}l} {\lim}_{x\to\infty} f(x)=\frac{\Gamma(x+a)}{\Gamma(x)}=x^{a}, \end{array}$$
(34)

using this approximation it yields

$$\begin{array}{*{20}l} {\lim}_{\alpha\to\infty}E(x_{i})=\left(\alpha^{\frac{1}{2}}\right)\cdot \frac{1}{(m\alpha)^{\frac{1}{2}}}=\frac{1}{\sqrt{m}}, \end{array}$$
(35)

where

$$\alpha_{i}=\alpha \text{ for }i< m, \;\; \alpha_{m}=\alpha+\frac{1}{2}, \text{ and }\alpha\neq \frac{1}{2},$$

that in the limit matches the mode at (32).

## Relationships of the SDD with other distributions

In this section we explore the relationships, or lack thereof, between the SDD and other popular distributions such as the uniform, von Mises and its particular case of the Fisher Bingham distribution. We consider limiting cases for different values of the concentration parameters αi.

### Limiting case symmetric distribution for large α

Assuming a symmetric SDD with αi=α, for αi we can write

$$\begin{array}{*{20}l} f_{\text{SDir}}(\mathbf{x};\alpha)&=2^{m-1}\frac{\Gamma(m\alpha)}{\Gamma(\alpha)^{m}}\prod_{i=1}^{m}{x_{i}}^{2\alpha-1} \cdot x_{m}^{-1}, \end{array}$$
(36)

subject to the restrictions

$$\begin{array}{*{20}l} 0\leqq x_{i}\leqq1, \;\; \sum_{i=1}^{m}{x_{i}^{2}}=1, \;\; \alpha\in\Re^{+}, \end{array}$$

in this case the covariance matrix can be reduced to

$$\begin{array}{*{20}l} \boldsymbol{\Sigma}=\frac{1}{m}\left(1-\frac{\mu_{\alpha}^{2}}{\alpha}\right)\boldsymbol{I}-\left(\frac{\mu_{\alpha}}{\mu_{0}}\right)^{2}\left(1-\frac{\mu_{0}^{2}}{m\alpha}\right)\boldsymbol{1}\boldsymbol{1}^{T}, \end{array}$$
(37)

where

$$\begin{array}{*{20}l} \mu_{\alpha}=\frac{\Gamma\left(\alpha+\frac{1}{2}\right)}{\Gamma(\alpha)},\quad \mu_{0}=\frac{\Gamma\left(\alpha_{0}+\frac{1}{2}\right)}{\Gamma(\alpha_{0})}, \end{array}$$

in an attempt to write the SDD as a rotational distribution of the type shown by Mardia (Mardia and Jupp 2000), the latter expression can be rewritten as

$$\begin{array}{*{20}l} \boldsymbol{\Sigma}=\left(1-\frac{\mu_{\alpha}^{2}}{\alpha}\right)\left(\frac{1}{m}\boldsymbol{I}-\bar{\boldsymbol{\mu}}\bar{\boldsymbol{\mu}}^{T}\right)+\left(1-m\frac{\mu_{\alpha}^{2}}{\mu_{0}^{2}}\right)\bar{\boldsymbol{\mu}}\bar{\boldsymbol{\mu}}^{T}, \end{array}$$
(38)

or equivalently

$$\begin{array}{*{20}l} \boldsymbol{\Sigma}=var(x)m\bar{\boldsymbol{\mu}}\bar{\boldsymbol{\mu}}^{T}+\left(\frac{1-\frac{\mu_{\alpha}^{2}}{\alpha}}{m}\right)\left(\boldsymbol{I}-m\bar{\boldsymbol{\mu}}\bar{\boldsymbol{\mu}}^{T}\right), \end{array}$$
(39)

we can’t determine an equivalence to the von Mises or similar rotationally symmetric distributions, however, using the expression developed by Frame (Frame 1949), we can see that in the limiting case for α and consequently α0 we have

$$\begin{array}{*{20}l} {\lim}_{\alpha\to\infty}\mu_{\alpha}={\lim}_{\alpha\to\infty}\frac{\Gamma\left(\alpha+\frac{1}{2}\right)}{\Gamma(\alpha)}=\alpha^{\frac{1}{2}}, \end{array}$$

and

$$\begin{array}{*{20}l} {\lim}_{\alpha\to\infty}\mu_{0}={\lim}_{\alpha\to\infty}\frac{\Gamma\left(m\alpha+\frac{1}{2}\right)}{\Gamma(m\alpha)}=(m\alpha)^{\frac{1}{2}}, \end{array}$$

which in the limit it yields

$$\begin{array}{*{20}l} {\lim}_{\alpha\to\infty}\boldsymbol{\Sigma}={\lim}_{\alpha\to\infty}\left(1-\frac{\mu_{\alpha}^{2}}{\alpha}\right)\left(\frac{1}{m}\boldsymbol{I}-\bar{\boldsymbol{\mu}}\bar{\boldsymbol{\mu}}^{T}\right)+\left(1-m\frac{\mu_{\alpha}^{2}}{\mu_{0}^{2}}\right)\bar{\boldsymbol{\mu}}\bar{\boldsymbol{\mu}}^{T}=0, \end{array}$$

we conclude that for large values of α the covariance matrix tends to zero, consequently, the SDD tends to be concentrated as a vector with no variation.

### Limiting case uniform distribution

We now consider the case where $$\alpha _{i}=\frac {1}{2}$$, for i<m and αm=1, the SDD takes the form

$$\begin{array}{*{20}l} f_{\text{SDir}}(\mathbf{x};\alpha)=\frac{2^{m-1}\Gamma\left(\frac{m-1}{2}+1\right)}{\prod_{i=1}^{m-1}\Gamma\left(\frac{1}{2}\right)\Gamma(1)}\prod_{i=1}^{m-1}{x_{i}}^{2 \frac{1}{2}-1}\cdot x_{m}^{2(1)-2}=\frac{2^{m-1}\Gamma\left(\frac{m+1}{2}\right)}{\pi^{\left(\frac{m-1}{2}\right)}}, \end{array}$$
(40)

which is a constant thickness independent of the values of xi, then the SDD becomes the uniform distribution over the positive orthant of the hypersphere.

### Similarities and differences of the SDD with the von Mises and Fisher-Bingham distributions

The von Mises distribution is usually considered the analogue of the normal distribution in the circle as described by Mardia in (Mardia 1975). The von Mises distribution and its particular case for the three dimensional sphere, the Fisher-Bingham distribution, both tend to converge to a multivariate and bivariate normal distribution respectively for large values of κ as shown by Kent (Kent 1982).

The proposed SDD doesn’t seem to converge to the von Mises distribution or to a multivariate normal distribution for large values of αi, but rather it tends to be concentrated as a vector as it was established at the end of the corresponding section for the limiting cases for the SDD.

Moreover, both the von Mises and the Fisher-Bingham distribution converge to the uniform distribution for very small values of κ, in a similar way as the SDD becomes the uniform distribution for the values of the parameters described at the end of the previous subsection.

## Inference for the spherical-Dirichlet distribution

We now consider estimation of the parameters of the SDD. Our main interest is to develop suitable procedures to estimate the set of parameters αi, given a sample of random vectors located at the positive orthant of the hypersphere. We first derive estimators for αi using the method of moments (MOM), and next we develop estimators for the same set of parameters using the method of maximum likelihood estimation (MLE).

### Method of moments (MOM)

Using a similar procedure as the one developed by Narayanan (Narayanan 1992) to estimate the parameters of the Dirichlet distribution, suppose we have a random sample with n random vectors X1,X2,....Xn such that $$X_{i} \in \Re ^{m} = \left [X_{j}| j=1...,m; X_{j}>0, \sum _{j=1}^{m}x_{j}^{2}=1\right ]$$ that are i.i.d., then

$$\begin{array}{*{20}l} E(x_{i})=\frac{\Gamma\left(\alpha_{i}+\frac{1}{2}\right)}{\Gamma(\alpha_{i})}\cdot\frac{\Gamma(\alpha_{0})}{\Gamma\left(\alpha_{0}+\frac{1}{2}\right)}=\frac{\mu_{i}}{\mu_{0}} \text{ for} \forall i, \end{array}$$
(41)

and

$$\begin{array}{*{20}l} E\left(x_{i}^{2}\right)=\frac{\alpha_{i}}{\alpha_{0}} \text{ for}\ \forall i. \end{array}$$
(42)

We define the sample moments as

$$\begin{array}{*{20}l} X_{1j}^{'}=\frac{1}{n}\sum_{i=1}^{n}x_{ij} \quad j=1,..,m, \end{array}$$
(43)

and

$$\begin{array}{*{20}l} X_{2j}^{'}=\frac{1}{n}\sum_{i=1}^{n}x_{ij}^{2} \quad j=1,..,m. \end{array}$$
(44)

We have m-1 first order moment equations and m-1 second order moment equations to solve for m unknowns αi. To avoid linear dependency and for the sake of simplicity we choose one of the first order moments, and m-1 of the second order moment equations

$$\begin{array}{*{20}l} \frac{\Gamma\left(\alpha_{1}+\frac{1}{2}\right)}{\Gamma(\alpha_{1})}\cdot\frac{\Gamma(\alpha_{0})}{\Gamma\left(\alpha_{0}+\frac{1}{2}\right)}=\frac{1}{n}\sum_{i=1}^{n}x_{i1}=X_{11}^{'}, \end{array}$$
(45)

then, the remaining m-1 second order moment equations are

$$\begin{array}{*{20}l} \frac{\alpha_{i}}{\alpha_{0}}=\frac{1}{n}\sum_{i=1}^{n}x_{ij}^{2}=X_{2j}^{'}\text{\quad}j=2,...,(m-1). \end{array}$$
(46)

There is no closed form solution for αi in solving simultaneously (45) and (46), so we must solve numerically to obtain the corresponding method of moments estimators for αi. Results from MOM can be used as initial values for the MLE that usually exhibit better statistical properties.

### Maximum likelihood estimation (MLE)

Suppose that we have a random sample of vectors on the positive orthant of the hypersphere, X1,X2,....Xn, where Xim from an SDD with pdf defined at (3). Then, the log-likelihood is

$$\begin{array}{*{20}l} \ln{L}(\boldsymbol\alpha)&=\ln\prod_{i=1}^{n}\frac{2^{m-1}\Gamma\left(\sum_{j=1}^{m}\alpha_{j}\right)}{{\prod_{j=1}^{m}\Gamma({\alpha_{j})}}}\prod_{j=1}^{m}{x_{ij}}^{2\alpha_{j}-1}\cdot x_{im}^{-1}. \end{array}$$
(47)

The parameters for an SDD can be estimated by maximizing the log-likelihood function of the data, in a similar procedure as the one used by Minka for the Dirichlet distribution described at (Minka 2000). We can group all the constant terms as K, and we can rewrite all the products and sums as

$$\begin{array}{*{20}l} \ln{L}(\boldsymbol\alpha)&= K + n\ln\Gamma\left(\sum_{j=1}^{m}\alpha_{j}\right)-n\sum_{j=1}^{m}\ln\Gamma(\alpha_{j})+\sum_{i=1}^{n}\sum_{j=1}^{m}(2\alpha_{j}-1)\ln{x_{ij}}-\sum_{i=1}^{n}\ln x_{im},\\ &=K + n\left(\ln\Gamma\left(\sum_{j=1}^{m}\alpha_{j}\right)\,-\,\sum_{j=1}^{m}\ln\Gamma(\alpha_{j})\,+\,\sum_{j=1}^{m}(2\alpha_{j}\,-\,1)\frac{1}{n}\sum_{i=1}^{n}\ln{x_{ij}}-\frac{1}{n}\sum_{i=1}^{n}\ln x_{im}\right), \end{array}$$

where the function that needs to be optimized after removing unnecessary constants is

$$\begin{array}{*{20}l} F(\boldsymbol{\alpha})=\ln\Gamma\left(\sum_{j=1}^{m}\alpha_{j}\right)-\sum_{j=1}^{m}\ln\Gamma(\alpha_{j})+\sum_{j=1}^{m}(2\alpha_{j}-1)\left(\frac{1}{n}\sum_{i=1}^{n}\ln{x_{ij}}\right)-\frac{1}{n}\sum_{i=1}^{n}\ln x_{im}. \end{array}$$

The gradient of the objective function can be obtained by differentiating the log-likelihood lnF(α) with respect to αk as

$$\begin{array}{*{20}l} \nabla(F)_{k}=\frac{\partial F}{\partial \alpha_{k}} =\Psi\left(\sum_{j=1}^{m}\alpha_{j}\right)-\Psi(\alpha_{k})+2\left(\frac{1}{n}\sum_{i=1}^{n}\ln{x_{ik}}\right), \end{array}$$
(48)

where $$\Psi =:\frac {d\ln \Gamma (x)}{dx}$$ is the digamma function. The optimization is subject to the constraints $$\alpha _{i} \geqq 0$$. The SDD is a member of the exponential family and therefore it is a convex function, and the observed sufficient statistic is equal to the expected sufficient statistic, where the latter is

$$\begin{array}{*{20}l} E\left(x_{k}\right)=\frac{1}{2}\Psi(\alpha_{k})-\frac{1}{2}\Psi\left(\sum_{j=1}^{m}\alpha_{j}\right), \end{array}$$
(49)

and the observed sufficient statistic is

$$\begin{array}{*{20}l} \frac{1}{n}\sum_{i=1}^{n}\ln{x_{ij}}, \end{array}$$
(50)

that leads to the following iterative procedure

$$\begin{array}{*{20}l} \Psi(\alpha_{k}^{new})=\Psi\left(\sum_{j=1}^{m}\alpha_{j}^{old}\right)+2\left(\frac{1}{n}\sum_{i=1}^{n}\ln{x_{ik}}\right). \end{array}$$
(51)

Although the proposed procedure does not guarantee in general reaching a global maximum, updating successively (51) provides reasonable results, and convergence is typically fast.

## Applications to data

Let’s now consider estimation of the parameters of the SDD. We first developed an example using simulated data generated from the proposed SDD with parameters we assumed to be unknown for the purpose of this estimation. Next, a second example was developed using a text mining example, with data obtained from a publicly available data set. Both examples were solved using MOM and MLE, applying the proposed techniques described at the corresponding sections for the method of moments and maximum likelihood estimation, and results obtained from both methods were compared.

### Simulation example

Four different simulations were performed each with 1,000 randomly generated values from an SDD in a three-dimensional hypersphere, with known values of the parameters α1,α2 and α3. Plots for the corresponding values of these parameters are shown at Figs. 2 and 3. Inferences to estimate the values of these parameters, assumed to be unknown, were performed using the MOM and MLE procedures developed in the corresponding sections. Graphs for the SDD corresponding to the proposed four different sets of parameters are shown at the following figures:

First, an estimation was performed using MOM and iterating between (45) and (46). These values were updated in each cycle until convergence was achieved within a pre-set tolerance limit. The estimated values of the parameters using MOM were used as the initial values for the iterative process using MLE. For the latter method expression (51) is updated successively until the values of the parameters were stable within a pre-set tolerance level. Results for the estimation of both methods and the true values of the parameters are shown at Table 1.

Note the close agreement between the MLEs and MOMs at the results shown at Table 1.

### Text mining example

A text mining example was developed using a publicly available data set assembled by Lang (Lang). An example of email messages regarding several interest groups are available, the “auto” topic was selected and summarized using standard data mining techniques. A collection of randomly selected sample of 160 documents (emails) were extracted and summarized as vectors at the positive orthant of the hypersphere. Common terms such as “from” or “subject” were excluded as they did not provide any discriminant power and could potentially bias the analysis. Vocabulary reduction for synonymous and stemming were performed, and the ten most common terms were extracted by obtaining their raw frequencies. The frequencies for these terms were expressed as the components of vectors in a ten-dimensional space. A small fraction of the data set can be seen at Table 2.

An appropriate transformation for these vectors was applied to reduce extreme values and eliminate zeros. The transformation used here was xtransf= ln(1.10+x). These vectors were standardized to a unit length at the positive quadrant of the hypersphere and they were fitted using the proposed multivariate SDD for ten dimensions. The estimation for the corresponding αi’s for the proposed distribution was done using both MOM and MLE, and their corresponding estimated values are shown Table 3

The number of iterations needed to fit the SDD for the MOM procedure within a preset tolerance level were 271 iterations. The final results of the MOM estimators were used as the initial values for the MLE procedure, and a new model was fitted using 19 additional iterations. Although the MLE procedure in general does not guarantee finding a global maximum, the proposed method provided reasonable results and the convergence was fast enough.

## Conclusions

The proposed SDD constitutes a superior alternative to other competing methods for fitting unit vectors at the positive orthant of the hypersphere. The SDD avoids wasting probability mass or using distribution mixtures that are not suitable for the positive orthant of the hypersphere. Inference results for MOM and MLE were in close agreement for simulated data, and reasonably close for the real text mining example. The simulated data were randomly generated from the proposed SDD while the text mining data were obtained from a real text mining problem. The SDD is flexible and shows a rich variety of shapes suitable to fit a wide range of data, in a similar way that the beta distribution does for a one-dimensional space. Under an appropriate transformation it can also accommodate zeros for some coordinates of the hyper-vectors. Future research may be aimed to enhance the capability of handling zero-value components, avoiding further need of transforming data.

## Availability of data and materials

The data for the text mining example was obtained from the publicly available data set assempled by Lang (Lang). The particular sample analysed during the current study are available from the corresponding author on reasonable request.

## References

1. Frame, J. S.: An approximation to the quotient of gamma function. Am. Math. Mon. 56(8), 529–535 (1949).

2. Kent, J. T.: The Fisher-Bingham distribution on the sphere. J. R. Stat. Soc. Ser. B Methodol. 44(1), 71–80 (1982). https://doi.org/10.1111/j.2517-6161.1982.tb01189.x. https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.2517-6161.1982.tb01189.x.

3. Lang, K.: CMU Text Learning Group Data Archives. https://www.cs.cmu.edu/afs/cs/project/theo-20/www/data/news20.html. Accessed 1 Sep 2019.

4. Mardia, K. V.: Statistics of directional data. J. R. Stat. Soc. Ser. B Methodol. 37(3), 349–393 (1975).

5. Mardia, K. V., Jupp, P. E.: Directional Statistics, 2nd edn. Wiley series in probability and statistics. Wiley, Chichester (2000).

6. Minka, T. P.: Estimating a Dirichlet distribution. Technical Report, MIT (2000). https://www,microsoft.com/en-us/research/publication/estimating-dirichlet-distribution/.

7. Narayanan, A.: A note on parameter estimation in the multivariate beta distribution. Comput. Math. Appl. 24(10), 11–17 (1992). https://doi.org/10.1016/0898-1221(92)90016-B.

8. Olkin, I., Rubin, H.: Multivariate beta distributions and independence properties of the Wishart distribution. Ann. Math. Stat. 35(1), 261–269 (1964).

9. Suvrit, S.: Directional statistics in machine learning: a brief review. arXiv e-prints, 1605–00316 (2016). http://arxiv.org/abs/1605.00316.

## Acknowledgements

The author is grateful to the invaluable help of Eduardo García-Portugués from the Department of Statistics, Carlos III University of Madrid (Spain).

## Funding

Travel funding for the completion of this project was received by a Research Enhancement grant from the Texas A&M University-Corpus Christi Division of Research and Innovation.

## Author information

Authors

### Corresponding author

Correspondence to Jose H. Guardiola.

## Ethics declarations

### Competing interests

The author declares that he has no competing interests. 