A new diversity estimator

Zheng, Lukun; Jiang, Jiancheng

doi:10.1186/s40488-017-0063-6

Methodology
Open access
Published: 15 September 2017

A new diversity estimator

Lukun Zheng¹ &
Jiancheng Jiang²

Journal of Statistical Distributions and Applications volume 4, Article number: 12 (2017) Cite this article

2459 Accesses
Metrics details

Abstract

The maximum likelihood estimator (MLE) of Gini-Simpson’s diversity index (GS) is widely used but suffers from large bias when the number of species is large or infinite. We propose a new estimator of the GS index and show its unbiasedness. Asymptotic normality of the proposed estimator is established when the number of species in the population is finite and known, finite but unknown, and infinite. Simulations demonstrate advantages of our estimator over the MLE, and a real example for the extinction of dinosaurs endorses the use of our approach. Mathematics Subject Classification (MSC) codes is 60E05, which refers to distributions: general theory.

Introduction

Diversity indices are quantitative measures for both richness, the number of categories, and the degree of the evenness of their relative abundances. See Rao (1982), Ludwig and Reynolds (1988), and Patil and Taillie (1979) for further information. It is important to measure the diversity index of a population. For example, in ecology, a decline in diversity over time may indicate a gradual extinction of an ecosystem, while a rapid decline may indicate an extinction due to some sudden impacts. Based on this, scientists argued that the extinction of the dinosaur is due to a large asteroid impact roughly contemporaneous with the end of the Cretaceous. Gini-Simpson’s index (GS), together with Shannon’s entropy, are the two best known diversity measures. They are widely used in modern sciences such as ecology, demography, anthropology, information theory, and so on. See Hurlbert (1971), Peet (1974), Hunter and Gaston (1988), and Rogers and Hsu (2001).

Consider a population with K species for which p _i denotes the relative abundance of species i (i=1,…,K) such that $\sum \limits _{i=1}^{K}p_{i}=1$. Simpson (1949) proposed the index

$$ \lambda=\sum\limits_{i=1}^{K}p_{i}^{2} $$

(1)

to measure the degree of concentration for the population. Gini-Simpson’s index is defined as

$$ GS=\sum\limits_{i=1}^{K} p_{i}(1-p_{i})=1-\lambda. $$

(2)

There are also many other indices in literature. See Shannon (1948), Good (1953), Renyi (1961), and Hill (1973) among others. In the literature of biodiversity, according to Ricotta (2005), there are a “jungle" of biological measures of diversity. For a comprehensive discussion on the various relationships among these indices, one may refer to Rennolls and Laumonier (2006) and Mao (2007).

Let $\{X_{i}\}_{i=1}^{n}$ be an iid sample from the population {p _k;k=1,…,K}, and f _k the observed frequency of the kth category. Let $\hat {p}_{k}=\frac {f_{k}}{n}$ and $\hat {P}=\{\hat {p}_{k}; k=1,\dots, K\}$. The most important estimator of GS is the MLE

$$ \widehat{GS}=1-\sum\limits_{k=1}^{K} \hat{p}_{k}^{2}. $$

(3)

When K is finite, MLE is asymptotically normal if the underlying distribution is inhomogeneous and is asymptotically distributed as Chi-square if the underlying distribution is homogeneous. Another closed related estimator is given by

$$ \frac{n}{n-1}\left[1-\sum\limits_{k=1}^{K}\hat{p}_{k}^{2}\right]=\frac{n}{n-1}\left[1-\sum\limits_{k=1}\left(\frac{f_{k}}{n}\right)^{2}\right]. $$

(4)

Bhargava and Uppulurif (1977) showed that it is unbiased and established its asymptotic distribution.

Although the MLE is asymptotically efficient when K is not large relative to the sample size, it does not work well for large K, especially when K is large or infinite. This is easy to understand, since there are only about n/K observations on average for estimating each parameter, and hence the MLE is inefficient when n/K is small. In fact, $\widehat {GS}$ is inconsistent in the case of K=∞ or K=K _n converging to ∞ too fast, and furthermore one cannot use the modern penalized estimation, for example lasso, to estimate p _k, since there is no sparsity structure here. As it will be shown in this paper, MLE also works for the case K=∞ but under some restrictions. Most of the existing methodologies take some adjustment to deal with this problem but result in very complicated forms with less tractable distributional characteristics. Practical techniques include jackknife and bootstrap, see Fritsch and Hsu (1999). Zhang and Zhou (2010) studied a group of estimators for ζ _u,v. Due to these problems, little is known about the asymptotic distributional characteristics except in a naive approach. This motivates us to propose a new approach to estimating the GS index. Our new estimator is unbiased, asymptotically normal and efficient for all the cases about the number of species K.

The remainder of the paper is organized as follows. In “A general birthday problem” section, the birthday problem is generalized to cases with unequal probabilities and infinite categories, and the connection between the generalized birthday problem and the GS index is established. In “The estimator” section, based on the relationship between the generalized birthday problem and the GS index, an unbiased estimator of the GS index is proposed and the asymptotic normality is derived under all the three cases with respect to the number of species in the population. In “Asymptotic properties” section, an empirical study about dinosaur extinction data and a simulation study are employed to demonstrate the performance of our estimator.

A general birthday problem

The Birthday problem is an important example in standard textbooks like Feller (1971). The problem is to find the probability that among n students in a class, no two or more students share the same birthday under the assumption that individuals’ birthdays are independent and that for every individual, all 365 days of the year are equally likely as possible birthdays. It has been generalized in many ways under the uniform probability assumption. See Johnson and Kotz (1977) and Fang (1985) among others. Birthday problems with unequal probabilities are also studied over the years. For recent works, see Joag-Dev and Proschan (1992) and Wagner (2002) among others.

Similar to the Bernoulli trial, we define a categorical trial X as a random experiment with K possible outcomes (categories) with probability distribution P={p _k:k=1,…,K}, where K is finite (known or unknown) or infinite. We call it “a success of category i” if the outcome of a categorical trial belongs to category i. Consider an independent sequence of categorical trials {X _i;i=1,2,… } in which the probability of success of each category keeps the same for each trial. Let H _m be the number of distinct categories shown up in the first m trials. We assume m≥2 since it is trivial if m=1. Calculating the probability distribution of H _m is generally referred to as birthday problems with unequal probabilities. See the references above. Let Y _k be the number of successes of the kth category in the first m trials and let $I_{k}=1_{(Y_{k}=0)}\phantom {\dot {i}\!}$ for k=1,…,K be the indicator function with I _k=1 if the kth category does not appear in the sample. Then

$$ H_{m}=\sum\limits_{k=1}^{K} \left(1-I_{k}\right). $$

(5)

Theorem 1

For fixed m and finite or infinite value of K, we have

$$\begin{array}{@{}rcl@{}} E(H_{m})&=&{m\choose 1}\sum\limits_{k=1}^{K} p_{k}- {m\choose 2}\sum\limits_{k=1}^{K} p_{k}^{2}+\cdots +{m\choose m}(-1)^{m+1}\sum\limits_{k=1}^{K} p_{k}^{m},\\ Var(H_{m}) &=& \sum\limits_{k=1}^{K} (1-p_{k})^{m}\left[\!1\,-\,(\!1\,-\,p_{k})^{m}\right]\,+\,2\!\sum\limits_{1\leq i<j\leq K}\!\left[\!(\!1\!-p_{i}-p_{j})^{m}-(1-p_{i})^{m}(1-p_{j})^{m}\right]. \end{array} $$

The proof of the theorem is given in the Appendix.

Remark 1

It is easy to see that V a r(H _m) is finite for fixed m. In fact, V a r(H _m)<m ².

Now we are ready to establish the connection between the generalized birthday problem and the GS index. For a categorical trial with K categories and probability distribution P={p _k:k=1,…,K}, we have a population of K species with relative abundances {p _k:k=1,…,K}. A random sample of size m from this population corresponds to the first m trials in the independent sequence of categorical trials {X _i;i=1,2,… }. As a result, the first m categorical trials can be equivalently viewed as a random sample of size m from the corresponding population, and consequently H _m represents the number of distinct species in a random sample of size m from the corresponding population. The following theorem shows that $GS=1-{\sum \nolimits }_{k=1}^{K} p_{k}^{2}$ is the same as E(H ₂)−1.

Theorem 2

Consider a population with K species and relative abundances P={p _k;k=1,…,K}. Then,

$$ GS=E(H_{2})-1, $$

(6)

where H ₂ is number of distinct species in a random sample of size 2.

The theorem is a direct result of Theorem 1 taking m=2 and definition of GS (Eq. (2)). The above theorem indicates that GS is an estimable parameter under population P.

The estimator

Let $\{X_{i}\}_{i=1}^{n}$ be an iid random sample of size n from population P with finite or infinite value of K. For any sub-sample $\{X_{i_{1}},\dots, X_{i_{m}}:\, 1\leq i_{1}<\dots <i_{m}\leq n\}$ from this sample,

$H_{m}(X_{i_{1}},\dots, X_{i_{m}})$ is the number of distinct species in the sub-sample. Therefore $H_{m}(X_{i_{1}},\dots, X_{i_{m}})$ is a symmetric function. Define the following U-statistic

$$ Z_{n,m}={{n}\choose{m}}^{-1}\sum\limits_{c} H_{m}(X_{i_{1}},\dots, X_{i_{m}}), $$

(7)

where ${\sum \nolimits }_{c}$ denotes the summation over all the ${{n}\choose {m}}$ combinations of m distinct elements {i ₁, …, i _m} from {1, 2, …, n}. Then Z _n,m→E(H _m) almost surely as n→∞ based on the asymptotic distribution of the U-statistics in DasGupta (2008). This motivates us to estimate GS by

$$ \widehat{GS}_{1}=Z_{n,2}-1. $$

(8)

It is easy to verify that $\widehat {GS}_{1}=Z_{n,2}-1$ is always an unbiased estimator of GS. In fact, H ₂−1 is an unbiased estimator of GS by Theorem 2, and Z _n,2−1 is the average across all combinatorial selections of size 2 from the full set of observations of H ₂−1 applied to each sub-sample.

Asymptotic properties

Asymptotic properties for MLE

Let’s firstly prove the asymptotic normality of $\widehat {GS}$ when K=∞. That is, there are infinitely many species in the population. Assume the probability distribution is P={p _i;i=1,2,… } with p _i≥p _i+1 for all i and $\sum \limits _{i=1}^{\infty } p_{i}=1$. And we have the corresponding Gini-Simpson’s index $GS=1-\sum \limits _{i=1}^{\infty } p_{i}^{2}=1-\lambda $. We have the following result.

Theorem 3

Let P={p _i;i=1,2,… } be the probability distribution of a population with infinite species. Assume that there exits a sequence of $\{N_{n}\}_{n=1}^{\infty }$ such that $np_{N_{n}+1,+}\rightarrow 0$, then we have the following

$$\frac{\sqrt{n}\left(\widehat{GS}-GS\right)}{\hat{\sigma}}\overset{p}{\to} N(0,1) $$

where

$$ \hat{\sigma}^{2}=4\left[\sum\limits_{i=i}^{N_{n}}\hat{p}_{k}^{3}-\left(\sum\limits_{i=1}^{N_{n}}\hat{p}_{i}^{2}\right)^{2}\right]. $$

(9)

The proof is given in the Appendix.

The following theorem is implied by Bhargava and Uppulurif (1977) when K is finite, homogeneous or inhomogeneous.

Theorem 4

If the underlying population distribution is inhomogeneous, then

$$ \frac{\sqrt{n}\left(\widehat{GS}-GS\right)}{\hat{\sigma}}\overset{p}{\to} N(0,1) $$

(10)

where

$$\hat{\sigma}^{2}=4\left[\sum\limits_{k=1}^{K} \hat{p}_{k}^{3}-\left(\sum\limits_{k=1}^{K} \hat{p}_{k}^{2}\right)^{2}\right]. $$

If the underlying population distribution is homogeneous, we have

$$ nK\left(\widehat{GS}-GS\right)\overset{d}{\to}-\chi_{K-1}^{2}. $$

(11)

Asymptotic properties of $\widehat {GS}_{1}$

The above U-statistic construction paves the way to establish the asymptotic normality of Z _n,2. For an iid random sample {X _i; i=1,…,n} under the distribution P, θ=θ(P) is an estimable parameter and h(X ₁,…,X _m) is a symmetric kernel satisfying E _P{h(X ₁,…,X _m)}=θ(P). Let $U_{n}={{n}\choose {m}}^{-1}\sum \limits _{c}h(X_{i_{1}},\dots, X_{i_{m}})$ where ${\sum \nolimits }_{c}$ is the summation over the ${{n}\choose {m}}$ combinations of m distinct elements {i ₁,…,i _m} from {1,…,n}. Let h ₁(x ₁)=E _P{h(x ₁,X ₂,…,X _m)} be the conditional expectation of h given X ₁=x ₁, and $\sigma _{1}^{2}=Var_{P}\{h_{1}(X_{1})\}$. Then we have the following proposition by Hoeffding (1948).

Proposition 1

If E _P(h ²)<∞ and $\sigma _{1}^{2}>0$, then $\sqrt {n}\left (U_{n}-\theta \right) \overset {d}{\to } N\left (0,m^{2}\sigma _{1}^{2}\right)$.

From Remark 1, E _P(h ²)=V a r(H ₂(X ₁,X ₂))+(E(H ₂(X ₁,X ₂)))²≤4+G S ²<∞. Note that $h_{1}(x_{1})=E_{P}\left (2-I_{X_{2}=x_{1}}\right)=2-p_{x_{1}}$. It follows that

$$ {~}^{2} \sigma_{1}^{2}=Var_{P}\left(h_{1}(X_{1})\right)=Var_{P}\left(2-p_{X_{1}}\right)=Var_{P}\left(p_{X_{1}}\right)=\sum\limits_{k=1}^{K}p_{k}^{3}-\left(\sum\limits_{k=1}^{K}p_{k}^{2}\right)^{2}\geq 0. $$

(12)

The equality holds if and only if the probability distribution {p _k:k=1,…,K} is uniform. Of course, if K=∞, the inequality hold strictly since the distribution can never be uniform. Therefore, we have the following theorem.

Theorem 5

If the distribution {p _k:k=1,…,K} is not uniform, then

$$ \sqrt{n}\left(\widehat{GS}_{1}-GS\right) \overset{d}{\to} N\left(0,4\sigma_{1}^{2}\right). $$

(13)

Remark 2

Non-uniform distribution includes two cases: non-uniform finite distributions(K<∞) and infinite distributions(K=∞).

By (7), (12), and Theorem 1, it is easy to see that

$$ {~}^{2} \hat{\sigma}_{1}^{2}=Z_{n,3}-Z_{n,2}^{2}+Z_{n,2}-1 $$

(14)

is a consistent estimator of $\sigma _{1}^{2}$. Hence the following corollary is established.

Corollary 1

Under the conditions of Theorem 5, we have

$$ \frac{\sqrt{n}\left(\widehat{GS}_{1}-GS\right)}{2\hat{\sigma}_{1}} \overset{d}{\to} N(0,1). $$

(15)

For homogeneous distributions, we have the following result.

Theorem 6

If the distribution {p _k:k=1,…,K} is homogeneous, then

$$ nK\left(\widehat{GS}_{1}-GS\right)\overset{d}{\to} \chi_{K-1}^{2}-K+1. $$

(16)

The proof is given in Appendix. Compared with the MlE estimator, our estimator is reaches the same effect in homogeneous situation.

Examples and simulation studies

Example 1

(Dinosaur Extinction) The cause of the extinction of dinosaurs at the end of the Cretaceous period remains a mystery. Among all the theories, it is now widely accepted that it is due to a large asteroid impact at the end of the cretaceous. Sheehan et al. (1991) argued that diversity remained relatively constant throughout the Cretaceous period. The scientists reason that if the disappearance of the dinosaurs was gradual, one should observe a decline in diversity prior to extinction.

The data were organized by dividing the formation into three equally spaced stratigraphic levels, each of which represented a period of approximately 730,000 years. Fossils were cross-tabulated according to the stratigraphic level and the family to which the dinosaur belonged. Families represented are Cerotopsidae, Hadrosauridae, Hypsilophodontidae, Pachycephalosauridae, Tryrannosauridae, Ornithomimidae, Saurornithoididae, Dromaeosauridae. The summarized data is shown in Table 1 available in Rogers and Hsu (2001).

Table 1 Dinosaur counts by family and stratigraphic level

Full size table

Let’s denote the true value of GS indices at the Lower, Middle, and Upper level by G S _L, G S _M, and G S _U, respectively. It is interesting to ask if the dinosaur diversity changed.

To address the questions, we would like to present 95% simultaneous confidence intervals for all the pairwise contrasts: G S _L−G S _M, G S _L−G S _U, and G S _M−G S _U.

Using expressions for $\widehat {GS}_{1}$ and $\hat {\sigma }_{1}^{2}$ from the previous section and the normal approximation in our theorems, we obtain simultaneous confidence intervals for all pairwise contrasts. The results are provided in Table 2

Table 2 95% simultaneous confidence intervals for all pairwise contrasts

Full size table

Since all the confidence intervals contain zero, we may infer that all three communities were practically equivalent with respect to the GS index. That is, there is no significant change or decline of the diversity over time. Therefore, our study supports the theory of a sudden extinction of dinosaurs.

Our proposed estimator has advantages over the MLE when the sample size n is not large relative to the number of species K, especially when K=∞. In the following we conduct a simulation study for K=∞. We omit simulations for other scenarios for saving space.

Example 2

(K=∞) Consider the population {p _k=e ^{−(k−1)/10}−e ^−k/10: k≥1}. It is easy to calculate the true value of GS for this distribution:

$$GS=1-\sum\limits_{k=1}^{\infty} p_{k}^{2}=0.95004. $$

We generate random samples of size n=10, 50, and 100, and calculate the MLE $\widehat {GS}$ and our proposed estimator $\widehat {GS}_{1}$, together with their standard deviations(Eqs. (9’) and (14)). The simulation is based on 500 replications and the results are obtained by averaging the corresponding estimates in each replication. Also, since it is known that the population distribution is not uniform, we will just apply $\widehat {GS}_{1}$ due to the reason mentioned before. The simulation results are summarized in Table 3.

Table 3 Estimates of GS for Example 2

Full size table

From Table 3, we see that the deviations of the MLEs from the true value G S=0.95004 are much greater than those of our proposed estimates. This is due to the facts that $\widehat {GS}$ has a large bias and that the sample coverage is limited when the sample size is relatively small compared with the number of species. Our proposed estimator, instead, overcome such obstacles since it is an unbiased estimator of GS.And it is also shown that our proposed estimator has smaller variance.

Discussion

Birthday problem has been studied and extended in different forms and in many different areas. The same is true for diversity measures. The connection between these two topics is established in this paper through H ₂ and the mostly used Gini-Simpson’s index. There are many other correlated diversity indices in the literature, like Shannon’s entropy, Renyi’s index. For these indices, we can also find corresponding estimators in a similar way through the result in Theorem 1. The advantage of our approach over the MLE is obvious when the sample size is not large relative to the number of species. There are many other open problems built on this connection between birthday problem and diversity measures. For example, further investigation is needed to study the estimation of mutual information in view of generalized birthday problem. Our approach provides a framework for solving various problems inherited from the diversity measures.

Appendix 1: Proof of Theorem 1

Theorem 1

For fixed m and finite or infinite value of K, we have

$$ E(H_{m})={m\choose 1}\sum\limits_{k=1}^{K} p_{k}- {m\choose 2}\sum\limits_{k=1}^{K} p_{k}^{2}+\cdots +{m\choose m}(-1)^{m+1}\sum\limits_{k=1}^{K} p_{k}^{m}, $$

$$\begin{array}{*{20}l} {}Var(H_{m}) =\! \sum\limits_{k=1}^{K} (\!1-p_{k})^{m}\!\left[1-\!(1-p_{k})^{m}\right]\,+\,2\!\sum\limits_{1\leq i<j\leq K}\!\left[\!(1-p_{i}-p_{j})^{m}\,-\,(1-p_{i})^{m}(1-p_{j})^{m}\right]. \end{array} $$

Proof

Let’s consider the following lemma first. □

Lemma 1

For the class of random variables {I _k;k=1,…,K}, we have

$$\begin{array}{*{20}l} E(I_{k}) &=(1-p_{k})^{m}; \end{array} $$

(17)

$$\begin{array}{*{20}l} Var(I_{k}) &=(1-p_{k})^{m}-(1-p_{k})^{2m}, \end{array} $$

(18)

$$\begin{array}{*{20}l} Cov(I_{i},I_{j}) &= (1-p_{j}-p_{k})^{m}-(1-p_{j})^{m}(1-p_{k})^{m},\, \text{for $i\neq j$ } \end{array} $$

(19)

Lemma 1 can be verified easily.

When K is finite, the following equations are easily established.

$${}\begin{aligned} E(H_{m})&=\sum\limits_{k=1}^{K}\left(1-EI_{k}\right) \\ &= \sum\limits_{k=1}^{K}\left(1-(1-p_{k})^{m}\right) \\ &= \sum\limits_{k=1}^{K}\left({m\choose 1}p_{k}- {m\choose 2}p_{k}^{2}+\cdots +{m\choose m}(-1)^{m+1}p_{k}^{m} \right) \\ &= {m\choose 1}\sum\limits_{k=1}^{K} p_{k}- {m\choose 2}\sum\limits_{k=1}^{K} p_{k}^{2}+\cdots +{m\choose m}(-1)^{m+1}\sum\limits_{k=1}^{K} p_{k}^{m}, \text{and }\\ Var\left(H_{m}\right) &=Var \left[\sum\limits_{k=1}^{K}\left(1-I_{k}\right)\right] \\ &=\sum\limits_{k=1}^{K} Var\left(1-I_{k} \right)+2\sum\limits_{1\leq i<j\leq K} Cov(1-I_{i},1-I_{j}) \\ &= \sum\limits_{k=1}^{K} Var\left(I_{k} \right)+2\sum\limits_{1\leq i<j\leq K} Cov(I_{i},I_{j}) \\ &= \sum\limits_{k=1}^{K} (1-p_{k})^{m}\!\left[\!1\,-\,(1\,-\,p_{k})^{m}\right]\,+\,2\!\sum\limits_{1\leq i<j\leq K}\!\left[(1-p_{i}-p_{j})^{m}-(1-p_{i})^{m}(1-p_{j})^{m}\right] \end{aligned} $$

When K is infinite, the above equations are guaranteed by dominated convergence theorem. In fact, we have H _m≤m and $H_{m}^{2}\leq m^{2}$.

Appendix 2: Proof of Theorem 3

Theorem 3

Let P={p _i;i=1,2,… } be the probability distribution of a population with infinite species. Assume that there exits a sequence of $\{N_{n}\}_{n=1}^{\infty }$ such that $np_{N_{n}+1,+}\rightarrow 0$, then we have the following

$$\frac{\sqrt{n}\left(\widehat{GS}-GS\right)}{\hat{\sigma}}\overset{p}{\to} N(0,1) $$

where

$$ \hat{\sigma}^{2}=4\left[\sum\limits_{i=i}^{N_{n}}\hat{p}_{k}^{3}-\left(\sum\limits_{i=1}^{N_{n}}\hat{p}_{i}^{2}\right)^{2}\right]. $$

(9’)

Proof

Now let’s consider a sequence of populations with probability distributions P _N={p ₁,p ₂,…,p _N−1,P _N,+}, where $p_{N,+}=\sum \limits _{i=N}^{\infty } p_{i}$. The corresponding Gini-Simpson’s index is

$$GS_{N}=1-\left(\sum\limits_{i=1}^{N-1}p_{i}^{2}+p_{N,+}^{2}\right)=1-\lambda_{N}. $$

It is easy to check that

$$\lambda_{N} \rightarrow \lambda $$

as N→∞.

Let $\{X_{i}\}_{i=1}^{n}$ be an iid sample from the population P. The MLE of GS is

$$\widehat{GS}=1-\sum\limits_{i=1}^{\infty} \hat{p}_{k}^{2}. $$

For fixed N, let’s re-label the same sample $\{X_{i}\}_{i=1}^{n}$ to another sample $\{Y_{i}\}_{i=1}^{n}$ as follows:

$$\begin{array}{*{20}l} & Y_{i}=X_{i}\, \text{if}\; X_{i}< N \\ & Y_{i}=N\, \text{if}\; X_{i}\geq N \end{array} $$

Then $\{Y_{i}\}_{i=1}^{n}$ can be regarded as a iid sample from P _N with Gini-Simpson’s index G S _N.

The MLE of G S _N is

$$\widehat{GS}_{N}=1-\left(\sum\limits_{i=1}^{N-1} \hat{p}_{i}^{2}+\hat{p}^{2}_{N,+}\right). $$

It is easy to see that

$$\widehat{GS}_{N}-\widehat{GS}\rightarrow 0 $$

as N→∞. In fact,

$$ \widehat{GS}-\widehat{GS}_{N}=0 $$

(20)

if X _i≤N for all i=1,2,…,n.

Therefore,

$$\begin{array}{@{}rcl@{}} &&\sqrt{n}\left(\widehat{GS}-GS\right)\\ && =\sqrt{n}\left(\widehat{GS}-\widehat{GS}_{N}+\widehat{GS}_{N}-\lambda_{N}+\lambda_{N}-\lambda\right) \\ &&= \sqrt{n}\left(\widehat{GS}-\widehat{GS}_{N}\right)+\sqrt{n}\left(\widehat{GS}_{N}-\lambda_{N} \right)+\sqrt{n}(\lambda_{N}-\lambda) \end{array} $$

For any positive integer n, consider a corresponding integer N _n. The probability that all the observations in the sample $\{X_{i}\}_{i=1}^{n}$ is less or equal to N _n is

$$\left(1-p_{N_{n}+1,+}\right)^{n}=\left(1-\sum\limits_{i=N_{n}+1}^{\infty} p_{i}\right)^{n}. $$

Therefore, if

$$\left(1-\sum\limits_{i=N_{n}+1}^{\infty} p_{i}\right)^{n} =\left(1-p_{N_{n}+1,+}\right)^{\frac{np_{N_{n}+1,+}}{p_{N_{n}+1,+}} }=e^{-np_{N_{n}+1}} \rightarrow 1 $$

that is,

$$ np_{N_{n}+1,+}\rightarrow 0 $$

(21)

then all the observations in the sample $\{X_{i}\}_{i=1}^{n}$ falls into the first N _n species with probability going to 1 as n increases. In turn,

$$\widehat{GS}-\widehat{GS}_{N} $$

equal to zero with probability going to 1 due to Eq. (20). Therefore,

$$\sqrt{n}\left(\widehat{GS}-\widehat{GS}_{N}\right) $$

converge to o with probability going to 1 as n increases.

In addition,

$$\begin{array}{*{20}l} \sqrt{n}\left(\lambda_{N_{n}}-\lambda \right) &=\sqrt{n}\left(\sum\limits_{i=1}^{N_{n}-1}p_{i}^{2}+p_{N_{n},+}^{2}-\sum\limits_{i=1}^{\infty}p_{i}^{2}\right) \\ &=\sqrt{n}\left[\left(\sum\limits_{i=N_{n}}^{\infty} p_{i}\right)^{2}-\sum\limits_{i=N_{n}}^{\infty} p_{i}^{2}\right] \\ &\leq \sqrt{n}\sum\limits_{i=N_{n}}^{\infty} p_{i}^{2} \\ &\leq \sqrt{n} p_{N_{n},+} \end{array} $$

Therefore, if $\sqrt {n}p_{N_{n},+}\rightarrow 0$ which is a weaker condition than (21), we have

$$ \sqrt{n}\left(\lambda_{N_{n}}-\lambda \right)\rightarrow 0. $$

Therefore, by Slutsky’s theorem, the theorem is proved. □

Appendix 3: Proof of Theorem 6

Theorem 6

If the distribution {p _k:k=1,…,K} is homogeneous, then

$$ nK\left(\widehat{GS}_{1}-GS\right)\overset{d}{\to} \chi_{K-1}^{2}-K+1. $$

(16’)

Proof

For an iid random sample {X _i; i=1,…,n} under the distribution P, θ=θ(P) is an estimable parameter and h(X ₁,…,X _m) is a symmetric kernel satisfying E _P{h(X ₁,…,X _m)}=θ(P). Let $U_{n}={{n}\choose {m}}^{-1}\sum \limits _{c}h(X_{i_{1}},\dots, X_{i_{m}})$ where ${\sum \nolimits }_{c}$ is the summation over the ${{n}\choose {m}}$ combinations of m distinct elements {i ₁,…,i _m} from {1,…,n}. Let h ₁(x ₁)=E _P{h(x ₁,X ₂,…,X _m)} be the conditional expectation of h given X ₁=x ₁, and ζ ₁=V a r _P{h ₁(X ₁)}. Also let h ₂(x ₁,x ₂)=E _P{h(x ₁,x ₂,X ₃…,X _m)} be the conditional expectation of h given X ₁=x ₁,X ₂=x ₂, and ζ ₂=V a r _P{h ₂(X ₁,X ₂)}. Define

$$\tilde{h}_{2}=h_{2}-\theta. $$

Then we have the following lemmas by Hoeffding (1948). □

Lemma 2

If E _P(h ²)<∞ and ζ ₁>0, then $\sqrt {n}\left (U_{n}-\theta \right) \overset {d}{\to } N\left (0,m^{2}\zeta _{1}\right)$.

Lemma 3

If E _P(h ²)<∞ and ζ ₁=0<ζ ₂, then $n\left (U_{n}-\theta \right) \overset {d}{\to } \frac {m(m-1)}{2}Y$, where Y is a random variable of the form

$$Y=\sum\limits_{j}\lambda_{j}\left(\chi^{2}_{1j}-1\right), $$

where χ ₁₁,χ ₁₂,… are independent $\chi _{1}^{2}$ variates and λ _js are the eigenvalues of the following operator on the function space L ₂(R,P):

$$ Ag(x)=\int_{-\infty}^{\infty} \tilde{h}_{2}(x,y)g(y)dP(y),\ x\in R, g\in L_{2}. $$

(22)

For our case, we have θ=G S+1 and the kernal function given as

$$ h(x_{1}, x_{2})=V(x_{1},x_{2})=2-I_{x_{1}=x_{2}}. $$

(23)

That is,

$$GS+1=E_{P}\{h(X_{1},X_{2})\} $$

for given population distribution P.

Under the assumption of homogeneous population distribution, ζ ₁=0. Since

$$h(X_{1},X_{2})=2-I_{X_{1}=X_{2}}=\left\{ \begin{aligned} 1 &\,\text{if}\,X_{1}=X_{2} \\ 2 &\,\text{if}\,X_{1}{\neq} X_{2} \end{aligned} \right.$$

We have

$$\begin{array}{*{20}l} {}\zeta_{2}&=Var\left(h(X_{1},X_{2})\right) \\ &=E\left(h^{2}(X_{1},X_{2})\right)-\left(E(h(X_{1},X_{2})\right)^{2} \\ &=1\cdot P(X_{1}=X_{2})+4P(X_{1}\neq X_{2})-\left(P(X_{1}=X_{2})+2P(X_{1}\neq X_{2})\right)^{2} \\ &= P(X_{1}=X_{2})\,+\,4P(X_{1}\!\neq X_{2})\,-\,P^{2}(X_{1}\,=\,X_{2})\,-\,4P(X_{1}\,=\,X_{2})P(X_{1\!}\neq\! X_{2})\,-\,4P^{2}(X_{1}\!\neq\! X_{2}) \\ &=P(X_{1}=X_{2})(1-P(X_{1}=X_{2}))+4P(X_{1}\neq X_{2})\left[1-P(X_{1}=X_{2})-P(X_{1}\neq X_{2})\right] \\ &=P(X_{1}=X_{2})P(X_{1}\neq X_{2}) \\ &=\sum\limits_{i=1}^{K} p_{i}^{2}\left(1-\sum\limits_{i=1}^{K} p_{i}^{2}\right) >0 \end{array} $$

Also

$$\theta=GS+1=2-\sum\limits_{i=1}^{K}\frac{1}{K^{2}}=2-\frac{1}{K}. $$

Now let’s find the eigenvalues of operator A under the homogeneous distribution. We have $\tilde {h}_{2}(x,y)=2-I_{x=y}-\theta =\frac {1}{K}-I_{x=y}$. And

$$\begin{array}{*{20}l} Ag(x)&=\int_{-\infty}^{\infty} \tilde{h}_{2}g(y)dP(y) \\ &= \int_{-\infty}^{\infty} \left(\frac{1}{K}-I_{x=y}\right) g(y)dP(y) \\ &=\frac{1}{K^{2}}\sum\limits_{i=1}^{K}g(i)-\frac{1}{K}g(x) \\ &=\frac{1}{K^{2}} \sum\limits_{i\neq x} g(i)+\left(\frac{1}{K^{2}}-\frac{1}{K}\right) g(x) \end{array} $$

Since g:{1,2,…,K}→R, it can be viewed as a vector from R ^K. And A is a linear operator on R ^K. And the matrix representation of A is

$$A\vec{g}=T\vec{g} $$

where T is a K×K matrix with $T(i,i)=\frac {1}{K^{2}}-\frac {1}{K}$ and $T(i,j)=\frac {1}{K^{2}}$ for i≠j. The matrix T has two eigenvalues λ=0 with multiplicity one and $\lambda =-\frac {1}{K}$ with multiplicity K−1.

Therefore due to Lemma 3 and properties of independent Chi-square distributions, theorem is proved

Appendix 4: About the variances of $\widehat {GS}$ and $\widehat {GS}_{1}$

From section of Asymptotic behaviour for homogeneous case, we get that

$$ \zeta_{1}= \sum\limits_{i=1}^{K} p_{i}^{3}-\left(\sum\limits_{i=1}^{K} p_{i}^{2}\right)^{2} $$

and

$$\zeta_{2}=\sum\limits_{i=1}^{K} p_{i}^{2}\left(1-\sum\limits_{i=1}^{K} p_{i}^{2}\right). $$

By the following lemma by Hoeffding (1948):

Lemma 4

The variance of U _n is given by

$$ Var_{F}(U_{n})=\dbinom{n}{m}^{-1} \sum\limits_{c=1}^{m} \dbinom{m}{c} \dbinom{n-m}{m-c} \zeta_{c} $$

(24)

Therefore,

$$\begin{array}{*{20}l} Var\left(\widehat{GS}_{1}\right) &=\dbinom{n}{2}^{-1}\left(2(n-2)\zeta_{1}+\zeta_{2}\right) \\ &=\frac{2}{n(n-1)}\left[2(n-2)\left(\sum\limits_{i=1}^{K} p_{i}^{3}-\left(\sum\limits_{i=1}^{K} p_{i}^{2}\right)^{2}\right)+\sum\limits_{i=1}^{K} p_{i}^{2}-\left(\sum\limits_{i=1}^{K} p_{i}^{2}\right)^{2}\right] \\ &=\frac{2}{n(n-1)}\left[2(n-2)\sum\limits_{i=1}^{K} p_{i}^{3}-(2n-3)\left(\sum\limits_{i=1}^{K} p_{i}^{2}\right)^{2}+\sum\limits_{i=1}^{K} p_{i}^{2}\right] \end{array} $$

From Bhargava and Uppuluri (1977), we have

$$Var\left(\widehat{GS}\right)=\frac{(n-1)^{2}}{n^{2}}\frac{2}{n(n-1)}\left[2(n-2)\sum\limits_{i=1}^{K} p_{i}^{3}-(2n-3)\left(\sum\limits_{i=1}^{K} p_{i}^{2}\right)^{2}+\sum\limits_{i=1}^{K} p_{i}^{2}\right] $$

Therefore, we have the following theorem.

Theorem 7

When K is finite, we have

$$Var\left(\widehat{GS}\right)=\frac{(n-1)^{2}}{n^{2}}Var\left(\widehat{GS}_{1}\right). $$

References

Bhargava, N, Uppuluri, VRX: Sampling distributions of Gini’s index of diversity. Appl. Math. Campn. 3, 1–24 (1977).
Article MathSciNet MATH Google Scholar
DasGupta, A: Asymptotic Theory of Statistics and Probability. Springer, New York (2008).
MATH Google Scholar
Fang, KT: Occupancy problems. Encyclopedia of Statistical Sciences (Kotz, S, Johnson, NL, eds.)Wiley, New York (1985).
Google Scholar
Feller, W: An Introduction to Probability Theory and Its Applications, vol. 1. 2nd ed. Wiley, New York (1971).
MATH Google Scholar
Fritsch, KS, Hsu, JC: Multiple comparison of entropies with applications to dinosaur biodiversity. Biometrics. 55, 1300–1305 (1999).
Article MATH Google Scholar
Good, IJ: The population frequencies of species and the estimation of population parameters. Biometrika. 40, 237–264 (1953).
Article MathSciNet MATH Google Scholar
Hill, MO: Diversity and evenness: A unifying notation and its consequences. Ecology. 54, 427–432 (1973).
Article Google Scholar
Hoeffding, W: A class of statistics with asymptotically normal distribution. Ann. Math. Stat. 19, 293–325 (1948).
Article MathSciNet MATH Google Scholar
Hunter, PR, Gaston, MA: Numerical index of the discriminatory ability of typing systems: an application of Simpson’s index of diversity. J. Clin. Microbiol. 26(11), 246–466 (1988).
Google Scholar
Hurlbert, SH: The nonconcept of species diversity: A critique and alternative parameters. Ecology. 52, 577–586 (1971).
Article Google Scholar
Joag-Dev, K, Proschan, F: Birthday problem with unlike probabilities. Am. Mat. Mon. 99, 10–12 (1992).
Article MathSciNet MATH Google Scholar
Johnson, NL, Kotz, S: Urn models and their application. Wiley, New York (1977).
MATH Google Scholar
Ludwig, J, Reynolds, JF: Patterns of the abundance of species: a comparison of two hierarchical models. OIKOS. 53, 235–241 (1988).
Article Google Scholar
Mao, CX: Estimating species accumulation curves and diversity indices. Stat. Sinica. 17, 761–774 (2007).
MathSciNet MATH Google Scholar
Patil, GP, Taillie, C: An overview of diversity. In: Grassle, JF, Tatil, GP, Smith, WK, Taillie, C (eds.)Ecological Diversity in Theory and Practice, pp. 3–27. International Co-operative Publishing House, Fairland (1979).
Google Scholar
Peet, RK: The measurement of species diversity. Ann. Rev. Ecol. System. 5, 285–307 (1974).
Article Google Scholar
Rao, CR: Diversity: Its measurement, decomposition, apportionment and analysis. Sankhya. A44, 1–22 (1982).
MathSciNet MATH Google Scholar
Renyi, A: On measures of entropy and information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1. Contributions to the Theory of Statistics. The Regents of the University of California pp. 547–561 (1961).
Rennolls, K, Laumonier, Y: A New Local Estimator of Regional Species Diversity, in Terms of ’Shadow Species’, with a Case Study from Sumatra. J. Trop. Ecol. 22, 321–329 (2006).
Article Google Scholar
Ricotta, C: Through the jungle of biological diversity. Acta. Biotheor. 53(1), 29–38 (2005).
Article Google Scholar
Rogers, J, Hsu, J: Multiple comparisons of biodiversity. Biometrical. J. 43, 617–625 (2001).
Article MathSciNet MATH Google Scholar
Shannon, CE: A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 (1948).
Article MathSciNet MATH Google Scholar
Sheehan, PM, et al.: Sudden extinction of the Dinosaurs: Latest Cretaceous, Upper Great Plains, U. S. A. Science 254.5033, 835–839 (1991).
Simpson, EH: Measurement of diversity. Nature. 163, 688 (1949).
Article MATH Google Scholar
Wagner, D: A generalized birthday problem. In Crypto, vol. 2442, pp. 288-303. Springer-Verlag (2002).
Zhang, Z, Zhou, J: Re-parameterization of multinomial distributions and diversity indices. J. Stat. Plan. Infer. 140, 1731–1738 (2010).
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics, Tennessee Technological University, 1 William L Jones Dr, Cookeville, 38505, TN, USA
Lukun Zheng
Department of Mathematics and Statistics, UNC Charlotte, 9201 University City Blvd, 28223Charlotte, USA
Jiancheng Jiang

Authors

Lukun Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Jiancheng Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

LZ contributed in the following ways: Conception and design of study. Data collection, data analysis and interpretation. Drafting the article. JJ contributed to the paper in the following ways: Critical revision of the article. Final approval of the version to be published. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Jiancheng Jiang.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Zheng, L., Jiang, J. A new diversity estimator. J Stat Distrib App 4, 12 (2017). https://doi.org/10.1186/s40488-017-0063-6

Download citation

Received: 07 January 2017
Accepted: 06 July 2017
Published: 15 September 2017
DOI: https://doi.org/10.1186/s40488-017-0063-6

A new diversity estimator

Abstract

Introduction

A general birthday problem

Theorem 1

Remark 1

Theorem 2

The estimator

Asymptotic properties

Asymptotic properties for MLE

Theorem 3

Theorem 4

Asymptotic properties of \(\widehat {GS}_{1}\)

Proposition 1

Theorem 5

Remark 2

Corollary 1

Theorem 6

Examples and simulation studies

Example 1

Example 2

Discussion

Appendix 1: Proof of Theorem 1

Theorem 1

Proof

Lemma 1

Appendix 2: Proof of Theorem 3

Theorem 3

Proof

Appendix 3: Proof of Theorem 6

Theorem 6

Proof

Lemma 2

Lemma 3

Appendix 4: About the variances of \(\widehat {GS}\) and \(\widehat {GS}_{1}\)

Lemma 4

Theorem 7

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords