- Methodology
- Open access
- Published:
A new diversity estimator
Journal of Statistical Distributions and Applications volume 4, Article number: 12 (2017)
Abstract
The maximum likelihood estimator (MLE) of Gini-Simpson’s diversity index (GS) is widely used but suffers from large bias when the number of species is large or infinite. We propose a new estimator of the GS index and show its unbiasedness. Asymptotic normality of the proposed estimator is established when the number of species in the population is finite and known, finite but unknown, and infinite. Simulations demonstrate advantages of our estimator over the MLE, and a real example for the extinction of dinosaurs endorses the use of our approach. Mathematics Subject Classification (MSC) codes is 60E05, which refers to distributions: general theory.
Introduction
Diversity indices are quantitative measures for both richness, the number of categories, and the degree of the evenness of their relative abundances. See Rao (1982), Ludwig and Reynolds (1988), and Patil and Taillie (1979) for further information. It is important to measure the diversity index of a population. For example, in ecology, a decline in diversity over time may indicate a gradual extinction of an ecosystem, while a rapid decline may indicate an extinction due to some sudden impacts. Based on this, scientists argued that the extinction of the dinosaur is due to a large asteroid impact roughly contemporaneous with the end of the Cretaceous. Gini-Simpson’s index (GS), together with Shannon’s entropy, are the two best known diversity measures. They are widely used in modern sciences such as ecology, demography, anthropology, information theory, and so on. See Hurlbert (1971), Peet (1974), Hunter and Gaston (1988), and Rogers and Hsu (2001).
Consider a population with K species for which p i denotes the relative abundance of species i (i=1,…,K) such that \(\sum \limits _{i=1}^{K}p_{i}=1\). Simpson (1949) proposed the index
to measure the degree of concentration for the population. Gini-Simpson’s index is defined as
There are also many other indices in literature. See Shannon (1948), Good (1953), Renyi (1961), and Hill (1973) among others. In the literature of biodiversity, according to Ricotta (2005), there are a “jungle" of biological measures of diversity. For a comprehensive discussion on the various relationships among these indices, one may refer to Rennolls and Laumonier (2006) and Mao (2007).
Let \(\{X_{i}\}_{i=1}^{n}\) be an iid sample from the population {p k ;k=1,…,K}, and f k the observed frequency of the kth category. Let \(\hat {p}_{k}=\frac {f_{k}}{n}\) and \(\hat {P}=\{\hat {p}_{k}; k=1,\dots, K\}\). The most important estimator of GS is the MLE
When K is finite, MLE is asymptotically normal if the underlying distribution is inhomogeneous and is asymptotically distributed as Chi-square if the underlying distribution is homogeneous. Another closed related estimator is given by
Bhargava and Uppulurif (1977) showed that it is unbiased and established its asymptotic distribution.
Although the MLE is asymptotically efficient when K is not large relative to the sample size, it does not work well for large K, especially when K is large or infinite. This is easy to understand, since there are only about n/K observations on average for estimating each parameter, and hence the MLE is inefficient when n/K is small. In fact, \(\widehat {GS}\) is inconsistent in the case of K=∞ or K=K n converging to ∞ too fast, and furthermore one cannot use the modern penalized estimation, for example lasso, to estimate p k , since there is no sparsity structure here. As it will be shown in this paper, MLE also works for the case K=∞ but under some restrictions. Most of the existing methodologies take some adjustment to deal with this problem but result in very complicated forms with less tractable distributional characteristics. Practical techniques include jackknife and bootstrap, see Fritsch and Hsu (1999). Zhang and Zhou (2010) studied a group of estimators for ζ u,v . Due to these problems, little is known about the asymptotic distributional characteristics except in a naive approach. This motivates us to propose a new approach to estimating the GS index. Our new estimator is unbiased, asymptotically normal and efficient for all the cases about the number of species K.
The remainder of the paper is organized as follows. In “A general birthday problem” section, the birthday problem is generalized to cases with unequal probabilities and infinite categories, and the connection between the generalized birthday problem and the GS index is established. In “The estimator” section, based on the relationship between the generalized birthday problem and the GS index, an unbiased estimator of the GS index is proposed and the asymptotic normality is derived under all the three cases with respect to the number of species in the population. In “Asymptotic properties” section, an empirical study about dinosaur extinction data and a simulation study are employed to demonstrate the performance of our estimator.
A general birthday problem
The Birthday problem is an important example in standard textbooks like Feller (1971). The problem is to find the probability that among n students in a class, no two or more students share the same birthday under the assumption that individuals’ birthdays are independent and that for every individual, all 365 days of the year are equally likely as possible birthdays. It has been generalized in many ways under the uniform probability assumption. See Johnson and Kotz (1977) and Fang (1985) among others. Birthday problems with unequal probabilities are also studied over the years. For recent works, see Joag-Dev and Proschan (1992) and Wagner (2002) among others.
Similar to the Bernoulli trial, we define a categorical trial X as a random experiment with K possible outcomes (categories) with probability distribution P={p k :k=1,…,K}, where K is finite (known or unknown) or infinite. We call it “a success of category i” if the outcome of a categorical trial belongs to category i. Consider an independent sequence of categorical trials {X i ;i=1,2,… } in which the probability of success of each category keeps the same for each trial. Let H m be the number of distinct categories shown up in the first m trials. We assume m≥2 since it is trivial if m=1. Calculating the probability distribution of H m is generally referred to as birthday problems with unequal probabilities. See the references above. Let Y k be the number of successes of the kth category in the first m trials and let \(I_{k}=1_{(Y_{k}=0)}\phantom {\dot {i}\!}\) for k=1,…,K be the indicator function with I k =1 if the kth category does not appear in the sample. Then
Theorem 1
For fixed m and finite or infinite value of K, we have
The proof of the theorem is given in the Appendix.
Remark 1
It is easy to see that V a r(H m ) is finite for fixed m. In fact, V a r(H m )<m 2.
Now we are ready to establish the connection between the generalized birthday problem and the GS index. For a categorical trial with K categories and probability distribution P={p k :k=1,…,K}, we have a population of K species with relative abundances {p k :k=1,…,K}. A random sample of size m from this population corresponds to the first m trials in the independent sequence of categorical trials {X i ;i=1,2,… }. As a result, the first m categorical trials can be equivalently viewed as a random sample of size m from the corresponding population, and consequently H m represents the number of distinct species in a random sample of size m from the corresponding population. The following theorem shows that \(GS=1-{\sum \nolimits }_{k=1}^{K} p_{k}^{2}\) is the same as E(H 2)−1.
Theorem 2
Consider a population with K species and relative abundances P={p k ;k=1,…,K}. Then,
where H 2 is number of distinct species in a random sample of size 2.
The theorem is a direct result of Theorem 1 taking m=2 and definition of GS (Eq. (2)). The above theorem indicates that GS is an estimable parameter under population P.
The estimator
Let \(\{X_{i}\}_{i=1}^{n}\) be an iid random sample of size n from population P with finite or infinite value of K. For any sub-sample \(\{X_{i_{1}},\dots, X_{i_{m}}:\, 1\leq i_{1}<\dots <i_{m}\leq n\}\) from this sample,
\(H_{m}(X_{i_{1}},\dots, X_{i_{m}})\) is the number of distinct species in the sub-sample. Therefore \(H_{m}(X_{i_{1}},\dots, X_{i_{m}})\) is a symmetric function. Define the following U-statistic
where \({\sum \nolimits }_{c}\) denotes the summation over all the \({{n}\choose {m}}\) combinations of m distinct elements {i 1, …, i m } from {1, 2, …, n}. Then Z n,m →E(H m ) almost surely as n→∞ based on the asymptotic distribution of the U-statistics in DasGupta (2008). This motivates us to estimate GS by
It is easy to verify that \(\widehat {GS}_{1}=Z_{n,2}-1\) is always an unbiased estimator of GS. In fact, H 2−1 is an unbiased estimator of GS by Theorem 2, and Z n,2−1 is the average across all combinatorial selections of size 2 from the full set of observations of H 2−1 applied to each sub-sample.
Asymptotic properties
Asymptotic properties for MLE
Let’s firstly prove the asymptotic normality of \(\widehat {GS}\) when K=∞. That is, there are infinitely many species in the population. Assume the probability distribution is P={p i ;i=1,2,… } with p i ≥p i+1 for all i and \(\sum \limits _{i=1}^{\infty } p_{i}=1\). And we have the corresponding Gini-Simpson’s index \(GS=1-\sum \limits _{i=1}^{\infty } p_{i}^{2}=1-\lambda \). We have the following result.
Theorem 3
Let P={p i ;i=1,2,… } be the probability distribution of a population with infinite species. Assume that there exits a sequence of \(\{N_{n}\}_{n=1}^{\infty }\) such that \(np_{N_{n}+1,+}\rightarrow 0\), then we have the following
where
The proof is given in the Appendix.
The following theorem is implied by Bhargava and Uppulurif (1977) when K is finite, homogeneous or inhomogeneous.
Theorem 4
If the underlying population distribution is inhomogeneous, then
where
If the underlying population distribution is homogeneous, we have
Asymptotic properties of \(\widehat {GS}_{1}\)
The above U-statistic construction paves the way to establish the asymptotic normality of Z n,2. For an iid random sample {X i ; i=1,…,n} under the distribution P, θ=θ(P) is an estimable parameter and h(X 1,…,X m ) is a symmetric kernel satisfying E P {h(X 1,…,X m )}=θ(P). Let \(U_{n}={{n}\choose {m}}^{-1}\sum \limits _{c}h(X_{i_{1}},\dots, X_{i_{m}})\) where \({\sum \nolimits }_{c}\) is the summation over the \({{n}\choose {m}}\) combinations of m distinct elements {i 1,…,i m } from {1,…,n}. Let h 1(x 1)=E P {h(x 1,X 2,…,X m )} be the conditional expectation of h given X 1=x 1, and \(\sigma _{1}^{2}=Var_{P}\{h_{1}(X_{1})\}\). Then we have the following proposition by Hoeffding (1948).
Proposition 1
If E P (h 2)<∞ and \(\sigma _{1}^{2}>0\), then \(\sqrt {n}\left (U_{n}-\theta \right) \overset {d}{\to } N\left (0,m^{2}\sigma _{1}^{2}\right)\).
From Remark 1, E P (h 2)=V a r(H 2(X 1,X 2))+(E(H 2(X 1,X 2)))2≤4+G S 2<∞. Note that \(h_{1}(x_{1})=E_{P}\left (2-I_{X_{2}=x_{1}}\right)=2-p_{x_{1}}\). It follows that
The equality holds if and only if the probability distribution {p k :k=1,…,K} is uniform. Of course, if K=∞, the inequality hold strictly since the distribution can never be uniform. Therefore, we have the following theorem.
Theorem 5
If the distribution {p k :k=1,…,K} is not uniform, then
Remark 2
Non-uniform distribution includes two cases: non-uniform finite distributions(K<∞) and infinite distributions(K=∞).
By (7), (12), and Theorem 1, it is easy to see that
is a consistent estimator of \(\sigma _{1}^{2}\). Hence the following corollary is established.
Corollary 1
Under the conditions of Theorem 5, we have
For homogeneous distributions, we have the following result.
Theorem 6
If the distribution {p k :k=1,…,K} is homogeneous, then
The proof is given in Appendix. Compared with the MlE estimator, our estimator is reaches the same effect in homogeneous situation.
Examples and simulation studies
Example 1
(Dinosaur Extinction) The cause of the extinction of dinosaurs at the end of the Cretaceous period remains a mystery. Among all the theories, it is now widely accepted that it is due to a large asteroid impact at the end of the cretaceous. Sheehan et al. (1991) argued that diversity remained relatively constant throughout the Cretaceous period. The scientists reason that if the disappearance of the dinosaurs was gradual, one should observe a decline in diversity prior to extinction.
The data were organized by dividing the formation into three equally spaced stratigraphic levels, each of which represented a period of approximately 730,000 years. Fossils were cross-tabulated according to the stratigraphic level and the family to which the dinosaur belonged. Families represented are Cerotopsidae, Hadrosauridae, Hypsilophodontidae, Pachycephalosauridae, Tryrannosauridae, Ornithomimidae, Saurornithoididae, Dromaeosauridae. The summarized data is shown in Table 1 available in Rogers and Hsu (2001).
Let’s denote the true value of GS indices at the Lower, Middle, and Upper level by G S L , G S M , and G S U , respectively. It is interesting to ask if the dinosaur diversity changed.
To address the questions, we would like to present 95% simultaneous confidence intervals for all the pairwise contrasts: G S L −G S M , G S L −G S U , and G S M −G S U .
Using expressions for \(\widehat {GS}_{1}\) and \(\hat {\sigma }_{1}^{2}\) from the previous section and the normal approximation in our theorems, we obtain simultaneous confidence intervals for all pairwise contrasts. The results are provided in Table 2
Since all the confidence intervals contain zero, we may infer that all three communities were practically equivalent with respect to the GS index. That is, there is no significant change or decline of the diversity over time. Therefore, our study supports the theory of a sudden extinction of dinosaurs.
Our proposed estimator has advantages over the MLE when the sample size n is not large relative to the number of species K, especially when K=∞. In the following we conduct a simulation study for K=∞. We omit simulations for other scenarios for saving space.
Example 2
(K=∞) Consider the population {p k =e −(k−1)/10−e −k/10: k≥1}. It is easy to calculate the true value of GS for this distribution:
We generate random samples of size n=10, 50, and 100, and calculate the MLE \(\widehat {GS}\) and our proposed estimator \(\widehat {GS}_{1}\), together with their standard deviations(Eqs. (9’) and (14)). The simulation is based on 500 replications and the results are obtained by averaging the corresponding estimates in each replication. Also, since it is known that the population distribution is not uniform, we will just apply \(\widehat {GS}_{1}\) due to the reason mentioned before. The simulation results are summarized in Table 3.
From Table 3, we see that the deviations of the MLEs from the true value G S=0.95004 are much greater than those of our proposed estimates. This is due to the facts that \(\widehat {GS}\) has a large bias and that the sample coverage is limited when the sample size is relatively small compared with the number of species. Our proposed estimator, instead, overcome such obstacles since it is an unbiased estimator of GS.And it is also shown that our proposed estimator has smaller variance.
Discussion
Birthday problem has been studied and extended in different forms and in many different areas. The same is true for diversity measures. The connection between these two topics is established in this paper through H 2 and the mostly used Gini-Simpson’s index. There are many other correlated diversity indices in the literature, like Shannon’s entropy, Renyi’s index. For these indices, we can also find corresponding estimators in a similar way through the result in Theorem 1. The advantage of our approach over the MLE is obvious when the sample size is not large relative to the number of species. There are many other open problems built on this connection between birthday problem and diversity measures. For example, further investigation is needed to study the estimation of mutual information in view of generalized birthday problem. Our approach provides a framework for solving various problems inherited from the diversity measures.
Appendix 1: Proof of Theorem 1
Theorem 1
For fixed m and finite or infinite value of K, we have
Proof
Let’s consider the following lemma first. □
Lemma 1
For the class of random variables {I k ;k=1,…,K}, we have
Lemma 1 can be verified easily.
When K is finite, the following equations are easily established.
When K is infinite, the above equations are guaranteed by dominated convergence theorem. In fact, we have H m ≤m and \(H_{m}^{2}\leq m^{2}\).
Appendix 2: Proof of Theorem 3
Theorem 3
Let P={p i ;i=1,2,… } be the probability distribution of a population with infinite species. Assume that there exits a sequence of \(\{N_{n}\}_{n=1}^{\infty }\) such that \(np_{N_{n}+1,+}\rightarrow 0\), then we have the following
where
Proof
Now let’s consider a sequence of populations with probability distributions P N ={p 1,p 2,…,p N−1,P N,+}, where \(p_{N,+}=\sum \limits _{i=N}^{\infty } p_{i}\). The corresponding Gini-Simpson’s index is
It is easy to check that
as N→∞.
Let \(\{X_{i}\}_{i=1}^{n}\) be an iid sample from the population P. The MLE of GS is
For fixed N, let’s re-label the same sample \(\{X_{i}\}_{i=1}^{n}\) to another sample \(\{Y_{i}\}_{i=1}^{n}\) as follows:
Then \(\{Y_{i}\}_{i=1}^{n}\) can be regarded as a iid sample from P N with Gini-Simpson’s index G S N .
The MLE of G S N is
It is easy to see that
as N→∞. In fact,
if X i ≤N for all i=1,2,…,n.
Therefore,
For any positive integer n, consider a corresponding integer N n . The probability that all the observations in the sample \(\{X_{i}\}_{i=1}^{n}\) is less or equal to N n is
Therefore, if
that is,
then all the observations in the sample \(\{X_{i}\}_{i=1}^{n}\) falls into the first N n species with probability going to 1 as n increases. In turn,
equal to zero with probability going to 1 due to Eq. (20). Therefore,
converge to o with probability going to 1 as n increases.
In addition,
Therefore, if \(\sqrt {n}p_{N_{n},+}\rightarrow 0\) which is a weaker condition than (21), we have
Therefore, by Slutsky’s theorem, the theorem is proved. □
Appendix 3: Proof of Theorem 6
Theorem 6
If the distribution {p k :k=1,…,K} is homogeneous, then
Proof
For an iid random sample {X i ; i=1,…,n} under the distribution P, θ=θ(P) is an estimable parameter and h(X 1,…,X m ) is a symmetric kernel satisfying E P {h(X 1,…,X m )}=θ(P). Let \(U_{n}={{n}\choose {m}}^{-1}\sum \limits _{c}h(X_{i_{1}},\dots, X_{i_{m}})\) where \({\sum \nolimits }_{c}\) is the summation over the \({{n}\choose {m}}\) combinations of m distinct elements {i 1,…,i m } from {1,…,n}. Let h 1(x 1)=E P {h(x 1,X 2,…,X m )} be the conditional expectation of h given X 1=x 1, and ζ 1=V a r P {h 1(X 1)}. Also let h 2(x 1,x 2)=E P {h(x 1,x 2,X 3…,X m )} be the conditional expectation of h given X 1=x 1,X 2=x 2, and ζ 2=V a r P {h 2(X 1,X 2)}. Define
Then we have the following lemmas by Hoeffding (1948). □
Lemma 2
If E P (h 2)<∞ and ζ 1>0, then \(\sqrt {n}\left (U_{n}-\theta \right) \overset {d}{\to } N\left (0,m^{2}\zeta _{1}\right)\).
Lemma 3
If E P (h 2)<∞ and ζ 1=0<ζ 2, then \(n\left (U_{n}-\theta \right) \overset {d}{\to } \frac {m(m-1)}{2}Y\), where Y is a random variable of the form
where χ 11,χ 12,… are independent \(\chi _{1}^{2}\) variates and λ j s are the eigenvalues of the following operator on the function space L 2(R,P):
For our case, we have θ=G S+1 and the kernal function given as
That is,
for given population distribution P.
Under the assumption of homogeneous population distribution, ζ 1=0. Since
We have
Also
Now let’s find the eigenvalues of operator A under the homogeneous distribution. We have \(\tilde {h}_{2}(x,y)=2-I_{x=y}-\theta =\frac {1}{K}-I_{x=y}\). And
Since g:{1,2,…,K}→R, it can be viewed as a vector from R K. And A is a linear operator on R K. And the matrix representation of A is
where T is a K×K matrix with \(T(i,i)=\frac {1}{K^{2}}-\frac {1}{K}\) and \(T(i,j)=\frac {1}{K^{2}}\) for i≠j. The matrix T has two eigenvalues λ=0 with multiplicity one and \(\lambda =-\frac {1}{K}\) with multiplicity K−1.
Therefore due to Lemma 3 and properties of independent Chi-square distributions, theorem is proved
Appendix 4: About the variances of \(\widehat {GS}\) and \(\widehat {GS}_{1}\)
From section of Asymptotic behaviour for homogeneous case, we get that
and
By the following lemma by Hoeffding (1948):
Lemma 4
The variance of U n is given by
Therefore,
From Bhargava and Uppuluri (1977), we have
Therefore, we have the following theorem.
Theorem 7
When K is finite, we have
References
Bhargava, N, Uppuluri, VRX: Sampling distributions of Gini’s index of diversity. Appl. Math. Campn. 3, 1–24 (1977).
DasGupta, A: Asymptotic Theory of Statistics and Probability. Springer, New York (2008).
Fang, KT: Occupancy problems. Encyclopedia of Statistical Sciences (Kotz, S, Johnson, NL, eds.)Wiley, New York (1985).
Feller, W: An Introduction to Probability Theory and Its Applications, vol. 1. 2nd ed. Wiley, New York (1971).
Fritsch, KS, Hsu, JC: Multiple comparison of entropies with applications to dinosaur biodiversity. Biometrics. 55, 1300–1305 (1999).
Good, IJ: The population frequencies of species and the estimation of population parameters. Biometrika. 40, 237–264 (1953).
Hill, MO: Diversity and evenness: A unifying notation and its consequences. Ecology. 54, 427–432 (1973).
Hoeffding, W: A class of statistics with asymptotically normal distribution. Ann. Math. Stat. 19, 293–325 (1948).
Hunter, PR, Gaston, MA: Numerical index of the discriminatory ability of typing systems: an application of Simpson’s index of diversity. J. Clin. Microbiol. 26(11), 246–466 (1988).
Hurlbert, SH: The nonconcept of species diversity: A critique and alternative parameters. Ecology. 52, 577–586 (1971).
Joag-Dev, K, Proschan, F: Birthday problem with unlike probabilities. Am. Mat. Mon. 99, 10–12 (1992).
Johnson, NL, Kotz, S: Urn models and their application. Wiley, New York (1977).
Ludwig, J, Reynolds, JF: Patterns of the abundance of species: a comparison of two hierarchical models. OIKOS. 53, 235–241 (1988).
Mao, CX: Estimating species accumulation curves and diversity indices. Stat. Sinica. 17, 761–774 (2007).
Patil, GP, Taillie, C: An overview of diversity. In: Grassle, JF, Tatil, GP, Smith, WK, Taillie, C (eds.)Ecological Diversity in Theory and Practice, pp. 3–27. International Co-operative Publishing House, Fairland (1979).
Peet, RK: The measurement of species diversity. Ann. Rev. Ecol. System. 5, 285–307 (1974).
Rao, CR: Diversity: Its measurement, decomposition, apportionment and analysis. Sankhya. A44, 1–22 (1982).
Renyi, A: On measures of entropy and information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1. Contributions to the Theory of Statistics. The Regents of the University of California pp. 547–561 (1961).
Rennolls, K, Laumonier, Y: A New Local Estimator of Regional Species Diversity, in Terms of ’Shadow Species’, with a Case Study from Sumatra. J. Trop. Ecol. 22, 321–329 (2006).
Ricotta, C: Through the jungle of biological diversity. Acta. Biotheor. 53(1), 29–38 (2005).
Rogers, J, Hsu, J: Multiple comparisons of biodiversity. Biometrical. J. 43, 617–625 (2001).
Shannon, CE: A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 (1948).
Sheehan, PM, et al.: Sudden extinction of the Dinosaurs: Latest Cretaceous, Upper Great Plains, U. S. A. Science 254.5033, 835–839 (1991).
Simpson, EH: Measurement of diversity. Nature. 163, 688 (1949).
Wagner, D: A generalized birthday problem. In Crypto, vol. 2442, pp. 288-303. Springer-Verlag (2002).
Zhang, Z, Zhou, J: Re-parameterization of multinomial distributions and diversity indices. J. Stat. Plan. Infer. 140, 1731–1738 (2010).
Author information
Authors and Affiliations
Contributions
LZ contributed in the following ways: Conception and design of study. Data collection, data analysis and interpretation. Drafting the article. JJ contributed to the paper in the following ways: Critical revision of the article. Final approval of the version to be published. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Zheng, L., Jiang, J. A new diversity estimator. J Stat Distrib App 4, 12 (2017). https://doi.org/10.1186/s40488-017-0063-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40488-017-0063-6