- Methodology
- Open Access
A new diversity estimator
- Lukun Zheng^{1} and
- Jiancheng Jiang^{2}Email author
https://doi.org/10.1186/s40488-017-0063-6
© The Author(s) 2017
- Received: 7 January 2017
- Accepted: 6 July 2017
- Published: 15 September 2017
Abstract
The maximum likelihood estimator (MLE) of Gini-Simpson’s diversity index (GS) is widely used but suffers from large bias when the number of species is large or infinite. We propose a new estimator of the GS index and show its unbiasedness. Asymptotic normality of the proposed estimator is established when the number of species in the population is finite and known, finite but unknown, and infinite. Simulations demonstrate advantages of our estimator over the MLE, and a real example for the extinction of dinosaurs endorses the use of our approach. Mathematics Subject Classification (MSC) codes is 60E05, which refers to distributions: general theory.
Keywords
- Diversity measure
- Gini-Simpson’s index
- U Statistics
Introduction
Diversity indices are quantitative measures for both richness, the number of categories, and the degree of the evenness of their relative abundances. See Rao (1982), Ludwig and Reynolds (1988), and Patil and Taillie (1979) for further information. It is important to measure the diversity index of a population. For example, in ecology, a decline in diversity over time may indicate a gradual extinction of an ecosystem, while a rapid decline may indicate an extinction due to some sudden impacts. Based on this, scientists argued that the extinction of the dinosaur is due to a large asteroid impact roughly contemporaneous with the end of the Cretaceous. Gini-Simpson’s index (GS), together with Shannon’s entropy, are the two best known diversity measures. They are widely used in modern sciences such as ecology, demography, anthropology, information theory, and so on. See Hurlbert (1971), Peet (1974), Hunter and Gaston (1988), and Rogers and Hsu (2001).
There are also many other indices in literature. See Shannon (1948), Good (1953), Renyi (1961), and Hill (1973) among others. In the literature of biodiversity, according to Ricotta (2005), there are a “jungle" of biological measures of diversity. For a comprehensive discussion on the various relationships among these indices, one may refer to Rennolls and Laumonier (2006) and Mao (2007).
Bhargava and Uppulurif (1977) showed that it is unbiased and established its asymptotic distribution.
Although the MLE is asymptotically efficient when K is not large relative to the sample size, it does not work well for large K, especially when K is large or infinite. This is easy to understand, since there are only about n/K observations on average for estimating each parameter, and hence the MLE is inefficient when n/K is small. In fact, \(\widehat {GS}\) is inconsistent in the case of K=∞ or K=K _{ n } converging to ∞ too fast, and furthermore one cannot use the modern penalized estimation, for example lasso, to estimate p _{ k }, since there is no sparsity structure here. As it will be shown in this paper, MLE also works for the case K=∞ but under some restrictions. Most of the existing methodologies take some adjustment to deal with this problem but result in very complicated forms with less tractable distributional characteristics. Practical techniques include jackknife and bootstrap, see Fritsch and Hsu (1999). Zhang and Zhou (2010) studied a group of estimators for ζ _{ u,v }. Due to these problems, little is known about the asymptotic distributional characteristics except in a naive approach. This motivates us to propose a new approach to estimating the GS index. Our new estimator is unbiased, asymptotically normal and efficient for all the cases about the number of species K.
The remainder of the paper is organized as follows. In “A general birthday problem” section, the birthday problem is generalized to cases with unequal probabilities and infinite categories, and the connection between the generalized birthday problem and the GS index is established. In “The estimator” section, based on the relationship between the generalized birthday problem and the GS index, an unbiased estimator of the GS index is proposed and the asymptotic normality is derived under all the three cases with respect to the number of species in the population. In “Asymptotic properties” section, an empirical study about dinosaur extinction data and a simulation study are employed to demonstrate the performance of our estimator.
A general birthday problem
The Birthday problem is an important example in standard textbooks like Feller (1971). The problem is to find the probability that among n students in a class, no two or more students share the same birthday under the assumption that individuals’ birthdays are independent and that for every individual, all 365 days of the year are equally likely as possible birthdays. It has been generalized in many ways under the uniform probability assumption. See Johnson and Kotz (1977) and Fang (1985) among others. Birthday problems with unequal probabilities are also studied over the years. For recent works, see Joag-Dev and Proschan (1992) and Wagner (2002) among others.
Theorem 1
The proof of the theorem is given in the Appendix.
Remark 1
It is easy to see that V a r(H _{ m }) is finite for fixed m. In fact, V a r(H _{ m })<m ^{2}.
Now we are ready to establish the connection between the generalized birthday problem and the GS index. For a categorical trial with K categories and probability distribution P={p _{ k }:k=1,…,K}, we have a population of K species with relative abundances {p _{ k }:k=1,…,K}. A random sample of size m from this population corresponds to the first m trials in the independent sequence of categorical trials {X _{ i };i=1,2,… }. As a result, the first m categorical trials can be equivalently viewed as a random sample of size m from the corresponding population, and consequently H _{ m } represents the number of distinct species in a random sample of size m from the corresponding population. The following theorem shows that \(GS=1-{\sum \nolimits }_{k=1}^{K} p_{k}^{2}\) is the same as E(H _{2})−1.
Theorem 2
where H _{2} is number of distinct species in a random sample of size 2.
The theorem is a direct result of Theorem 1 taking m=2 and definition of GS (Eq. (2)). The above theorem indicates that GS is an estimable parameter under population P.
The estimator
Let \(\{X_{i}\}_{i=1}^{n}\) be an iid random sample of size n from population P with finite or infinite value of K. For any sub-sample \(\{X_{i_{1}},\dots, X_{i_{m}}:\, 1\leq i_{1}<\dots <i_{m}\leq n\}\) from this sample,
It is easy to verify that \(\widehat {GS}_{1}=Z_{n,2}-1\) is always an unbiased estimator of GS. In fact, H _{2}−1 is an unbiased estimator of GS by Theorem 2, and Z _{ n,2}−1 is the average across all combinatorial selections of size 2 from the full set of observations of H _{2}−1 applied to each sub-sample.
Asymptotic properties
Asymptotic properties for MLE
Let’s firstly prove the asymptotic normality of \(\widehat {GS}\) when K=∞. That is, there are infinitely many species in the population. Assume the probability distribution is P={p _{ i };i=1,2,… } with p _{ i }≥p _{ i+1} for all i and \(\sum \limits _{i=1}^{\infty } p_{i}=1\). And we have the corresponding Gini-Simpson’s index \(GS=1-\sum \limits _{i=1}^{\infty } p_{i}^{2}=1-\lambda \). We have the following result.
Theorem 3
The proof is given in the Appendix.
The following theorem is implied by Bhargava and Uppulurif (1977) when K is finite, homogeneous or inhomogeneous.
Theorem 4
Asymptotic properties of \(\widehat {GS}_{1}\)
The above U-statistic construction paves the way to establish the asymptotic normality of Z _{ n,2}. For an iid random sample {X _{ i }; i=1,…,n} under the distribution P, θ=θ(P) is an estimable parameter and h(X _{1},…,X _{ m }) is a symmetric kernel satisfying E _{ P }{h(X _{1},…,X _{ m })}=θ(P). Let \(U_{n}={{n}\choose {m}}^{-1}\sum \limits _{c}h(X_{i_{1}},\dots, X_{i_{m}})\) where \({\sum \nolimits }_{c}\) is the summation over the \({{n}\choose {m}}\) combinations of m distinct elements {i _{1},…,i _{ m }} from {1,…,n}. Let h _{1}(x _{1})=E _{ P }{h(x _{1},X _{2},…,X _{ m })} be the conditional expectation of h given X _{1}=x _{1}, and \(\sigma _{1}^{2}=Var_{P}\{h_{1}(X_{1})\}\). Then we have the following proposition by Hoeffding (1948).
Proposition 1
If E _{ P }(h ^{2})<∞ and \(\sigma _{1}^{2}>0\), then \(\sqrt {n}\left (U_{n}-\theta \right) \overset {d}{\to } N\left (0,m^{2}\sigma _{1}^{2}\right)\).
The equality holds if and only if the probability distribution {p _{ k }:k=1,…,K} is uniform. Of course, if K=∞, the inequality hold strictly since the distribution can never be uniform. Therefore, we have the following theorem.
Theorem 5
Remark 2
Non-uniform distribution includes two cases: non-uniform finite distributions(K<∞) and infinite distributions(K=∞).
is a consistent estimator of \(\sigma _{1}^{2}\). Hence the following corollary is established.
Corollary 1
For homogeneous distributions, we have the following result.
Theorem 6
The proof is given in Appendix. Compared with the MlE estimator, our estimator is reaches the same effect in homogeneous situation.
Examples and simulation studies
Example 1
(Dinosaur Extinction) The cause of the extinction of dinosaurs at the end of the Cretaceous period remains a mystery. Among all the theories, it is now widely accepted that it is due to a large asteroid impact at the end of the cretaceous. Sheehan et al. (1991) argued that diversity remained relatively constant throughout the Cretaceous period. The scientists reason that if the disappearance of the dinosaurs was gradual, one should observe a decline in diversity prior to extinction.
Dinosaur counts by family and stratigraphic level
Interval | Counts |
---|---|
Upper | (50, 29, 3, 0, 3, 4, 1, 0) |
Middle | (53, 51, 2, 0, 3, 8, 6, 0) |
Lower | (19, 7, 1, 0, 2, 0, 3, 0) |
Let’s denote the true value of GS indices at the Lower, Middle, and Upper level by G S _{ L }, G S _{ M }, and G S _{ U }, respectively. It is interesting to ask if the dinosaur diversity changed.
To address the questions, we would like to present 95% simultaneous confidence intervals for all the pairwise contrasts: G S _{ L }−G S _{ M }, G S _{ L }−G S _{ U }, and G S _{ M }−G S _{ U }.
95% simultaneous confidence intervals for all pairwise contrasts
Contrast | Estimate | Std.Error | Critical value z _{ α/6} | Lower bound | Upper bound |
---|---|---|---|---|---|
G S _{ L }−G S _{ M } | -0.0474 | 0.0733 | 2.3941 | -0.2229 | 0.1281 |
G S _{ L }−G S _{ U } | -0.0525 | 0.0774 | 2.3941 | -0.2378 | 0.1328 |
G S _{ M }−G S _{ U } | -0.0050 | 0.0414 | 2.3941 | -0.1041 | 0.0941 |
Since all the confidence intervals contain zero, we may infer that all three communities were practically equivalent with respect to the GS index. That is, there is no significant change or decline of the diversity over time. Therefore, our study supports the theory of a sudden extinction of dinosaurs.
Our proposed estimator has advantages over the MLE when the sample size n is not large relative to the number of species K, especially when K=∞. In the following we conduct a simulation study for K=∞. We omit simulations for other scenarios for saving space.
Example 2
Estimates of GS for Example 2
K=∞ | n=10 | n=50 | n=100 |
---|---|---|---|
\(\widehat {GS}\) | 0.8532 (std: 0.0354) | 0.9317 (std: 0.0107) | 0.9407 (std: 0.0068) |
\(\widehat {GS}_{1}\) | 0.9480 (std: 0.0075) | 0.9507 (std: 0.0061) | 0.9502 (std: 0.0053) |
From Table 3, we see that the deviations of the MLEs from the true value G S=0.95004 are much greater than those of our proposed estimates. This is due to the facts that \(\widehat {GS}\) has a large bias and that the sample coverage is limited when the sample size is relatively small compared with the number of species. Our proposed estimator, instead, overcome such obstacles since it is an unbiased estimator of GS.And it is also shown that our proposed estimator has smaller variance.
Discussion
Birthday problem has been studied and extended in different forms and in many different areas. The same is true for diversity measures. The connection between these two topics is established in this paper through H _{2} and the mostly used Gini-Simpson’s index. There are many other correlated diversity indices in the literature, like Shannon’s entropy, Renyi’s index. For these indices, we can also find corresponding estimators in a similar way through the result in Theorem 1. The advantage of our approach over the MLE is obvious when the sample size is not large relative to the number of species. There are many other open problems built on this connection between birthday problem and diversity measures. For example, further investigation is needed to study the estimation of mutual information in view of generalized birthday problem. Our approach provides a framework for solving various problems inherited from the diversity measures.
Appendix 1: Proof of Theorem 1
Theorem 1
Proof
Let’s consider the following lemma first. □
Lemma 1
Lemma 1 can be verified easily.
Appendix 2: Proof of Theorem 3
Theorem 3
where
Proof
Then \(\{Y_{i}\}_{i=1}^{n}\) can be regarded as a iid sample from P _{ N } with Gini-Simpson’s index G S _{ N }.
if X _{ i }≤N for all i=1,2,…,n.
Therefore, by Slutsky’s theorem, the theorem is proved. □
Appendix 3: Proof of Theorem 6
Theorem 6
Proof
Then we have the following lemmas by Hoeffding (1948). □
Lemma 2
If E _{ P }(h ^{2})<∞ and ζ _{1}>0, then \(\sqrt {n}\left (U_{n}-\theta \right) \overset {d}{\to } N\left (0,m^{2}\zeta _{1}\right)\).
Lemma 3
Therefore due to Lemma 3 and properties of independent Chi-square distributions, theorem is proved
Appendix 4: About the variances of \(\widehat {GS}\) and \(\widehat {GS}_{1}\)
By the following lemma by Hoeffding (1948):
Lemma 4
Therefore, we have the following theorem.
Theorem 7
Declarations
Authors’ contributions
LZ contributed in the following ways: Conception and design of study. Data collection, data analysis and interpretation. Drafting the article. JJ contributed to the paper in the following ways: Critical revision of the article. Final approval of the version to be published. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
- Bhargava, N, Uppuluri, VRX: Sampling distributions of Gini’s index of diversity. Appl. Math. Campn. 3, 1–24 (1977).MathSciNetView ArticleMATHGoogle Scholar
- DasGupta, A: Asymptotic Theory of Statistics and Probability. Springer, New York (2008).MATHGoogle Scholar
- Fang, KT: Occupancy problems. Encyclopedia of Statistical Sciences (Kotz, S, Johnson, NL, eds.)Wiley, New York (1985).Google Scholar
- Feller, W: An Introduction to Probability Theory and Its Applications, vol. 1. 2nd ed. Wiley, New York (1971).MATHGoogle Scholar
- Fritsch, KS, Hsu, JC: Multiple comparison of entropies with applications to dinosaur biodiversity. Biometrics. 55, 1300–1305 (1999).View ArticleMATHGoogle Scholar
- Good, IJ: The population frequencies of species and the estimation of population parameters. Biometrika. 40, 237–264 (1953).MathSciNetView ArticleMATHGoogle Scholar
- Hill, MO: Diversity and evenness: A unifying notation and its consequences. Ecology. 54, 427–432 (1973).View ArticleGoogle Scholar
- Hoeffding, W: A class of statistics with asymptotically normal distribution. Ann. Math. Stat. 19, 293–325 (1948).MathSciNetView ArticleMATHGoogle Scholar
- Hunter, PR, Gaston, MA: Numerical index of the discriminatory ability of typing systems: an application of Simpson’s index of diversity. J. Clin. Microbiol. 26(11), 246–466 (1988).Google Scholar
- Hurlbert, SH: The nonconcept of species diversity: A critique and alternative parameters. Ecology. 52, 577–586 (1971).View ArticleGoogle Scholar
- Joag-Dev, K, Proschan, F: Birthday problem with unlike probabilities. Am. Mat. Mon. 99, 10–12 (1992).MathSciNetView ArticleMATHGoogle Scholar
- Johnson, NL, Kotz, S: Urn models and their application. Wiley, New York (1977).MATHGoogle Scholar
- Ludwig, J, Reynolds, JF: Patterns of the abundance of species: a comparison of two hierarchical models. OIKOS. 53, 235–241 (1988).View ArticleGoogle Scholar
- Mao, CX: Estimating species accumulation curves and diversity indices. Stat. Sinica. 17, 761–774 (2007).MathSciNetMATHGoogle Scholar
- Patil, GP, Taillie, C: An overview of diversity. In: Grassle, JF, Tatil, GP, Smith, WK, Taillie, C (eds.)Ecological Diversity in Theory and Practice, pp. 3–27. International Co-operative Publishing House, Fairland (1979).Google Scholar
- Peet, RK: The measurement of species diversity. Ann. Rev. Ecol. System. 5, 285–307 (1974).View ArticleGoogle Scholar
- Rao, CR: Diversity: Its measurement, decomposition, apportionment and analysis. Sankhya. A44, 1–22 (1982).MathSciNetMATHGoogle Scholar
- Renyi, A: On measures of entropy and information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1. Contributions to the Theory of Statistics. The Regents of the University of California pp. 547–561 (1961).Google Scholar
- Rennolls, K, Laumonier, Y: A New Local Estimator of Regional Species Diversity, in Terms of ’Shadow Species’, with a Case Study from Sumatra. J. Trop. Ecol. 22, 321–329 (2006).View ArticleGoogle Scholar
- Ricotta, C: Through the jungle of biological diversity. Acta. Biotheor. 53(1), 29–38 (2005).View ArticleGoogle Scholar
- Rogers, J, Hsu, J: Multiple comparisons of biodiversity. Biometrical. J. 43, 617–625 (2001).MathSciNetView ArticleMATHGoogle Scholar
- Shannon, CE: A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 (1948).MathSciNetView ArticleMATHGoogle Scholar
- Sheehan, PM, et al.: Sudden extinction of the Dinosaurs: Latest Cretaceous, Upper Great Plains, U. S. A. Science 254.5033, 835–839 (1991).Google Scholar
- Simpson, EH: Measurement of diversity. Nature. 163, 688 (1949).View ArticleMATHGoogle Scholar
- Wagner, D: A generalized birthday problem. In Crypto, vol. 2442, pp. 288-303. Springer-Verlag (2002).Google Scholar
- Zhang, Z, Zhou, J: Re-parameterization of multinomial distributions and diversity indices. J. Stat. Plan. Infer. 140, 1731–1738 (2010).MathSciNetView ArticleMATHGoogle Scholar