Open Access

Flexible modelling of survival curves for censored data

Journal of Statistical Distributions and Applications20163:6

DOI: 10.1186/s40488-016-0045-0

Received: 1 July 2015

Accepted: 16 February 2016

Published: 29 February 2016

Abstract

This article outlines flexible strategies to model survival curves for censored data and find parametric confidence intervals using generalised lambda distributions. Owing to the rich shapes of generalised lambda distributions, these distributions are well suited to the problem of estimating survival curves. This article presents three useful techniques in estimating survival curves: matching partial probability weighted moments (PWM), maximum likelihood estimation (MLE) and simulation-refitting (SR) methods. The performance of these techniques are examined using right skewed, left skewed, symmetric bell curved and extreme value simulated data with varying degrees of censoring and sample sizes. Applications of the proposed methods in the context of multi-stage disease modelling and competing risks are also provided. Under controlled simulated experiments, PWM and MLE estimation tend to exhibit more precise estimates for survival curves than the SR method, however, the SR method tends to perform better in practice. The methods proposed in this article are very general and can be used to fit a wide range of empirical survival curves. Compared to the standard Kaplan Meier survival curve, the methods in this article have the added benefits of producing smoother survival curves and more consistent statistical estimates where all the statistical information of the survival curve can be obtained directly under one parametric model.

Introduction

The parametric modelling of the survival curve has always been a tricky task; it involves identifying a suitable probability density function and its parameters for incomplete data. The problem of finding a statistical distribution for survival curves can be broken down into two parts: 1) the problem of identifying a suitable distribution and 2) the problem of estimation based on the assumed distribution. This article proposes that the problem of distribution identification can be solved by using generalized lambda distributions (GLDs). The second problem of finding suitable GLD estimates in the case of censored data can be solved using the following methods: direct maximum likelihood estimation, matching partial probability weighted moments and the simulating-refitting method. All of these methods are discussed below in "Estimation Algorithms for Survival Data" section.

Traditionally, the identification of a suitable probability density function is often difficult or time consuming, since many statistical distributions have a limited range of shapes. Even if a probability density can be found, parameter estimation is an additional hurdle. The usual method of using maximum likelihood estimation incorporating censoring may not always work. A common problem of this approach is that the numerical method used in the optimisation process may fail to give a reasonable solution due to the complexity of likelihood function. This is particularly relevant to multi-stage disease situations or where a mixture of statistical distributions is needed to estimate the overall survival curve.

To date, several solutions have been proposed for this problem. Owing to the significance of Normal distribution in statistics, several authors have proposed the use of Normal mixtures (Komárek 2009, McLachlan and Peel 2000, Böhning and Seidel 2003) to model a range of empirical data. While Normal mixture models are available, because of the limitation of the shape of Normal distribution (it must be symmetric and unimodal), it can require a relatively complicated model for relatively simple data. For example, Fig. 1 shows a unimodal, skewed distribution being modelled by a number of distributions, including mixture of six Normal distributions- the optimal model using BIC for EM initialized by hierarchical clustering (Fraley and Raftery 2002, 2007). The true distribution is FMKL (Freimer et al. 1988) GLD with parameters λ 1 = 4.56687718, λ 2 = 0.33274810, λ 3 = 0.65408979, λ 4 = -0.01021826 and 5000 observations were generated from this distribution and fitted using Normal mixture model. The Normal mixture model appears to be overly complex and still failed to capture the true underlying shape of the distribution. Other distributions such as gamma distribution give a much simpler and convincing shape, but still over estimates the peak of the distribution. Skewed Normal distribution (Azzalini 1999) is even more inaccurate than gamma distribution, missing the mode of the true distribution and does not give the correct shape.
Fig. 1

Normal mixture, Gamma and skewed Normal approximation for a distribution with a relatively simple shape. The real distribution is FMKL GLD(4.56687718, 0.33274810, 0.65408979, -0.01021826)

In practice, owing to the complexity of real life data, it is cumbersome to find a good statistical distribution for empirical data by trial and error. Instead, it is preferable to use a distribution with very flexible shapes. In the current literature, for survival data, perhaps the most well-known distributions are the exponentiated Weibull distribution (Mudholkar and Srivastava 1993, Mudholkar et al. 1995, Singh et al. 2005) and the four parameter Weibull distribution (Wahed et al. 2009, Jeong 2006). Other distributions such as generalised hyperbolic distribution, g and h distributions and many others could also be suitable candidates. The superiority of GLDs over exponentiated Weibull distribution and comparable performance to the four parameter Weibull distribution in some general settings is demonstrated using simulation studies in Simulation studies. While GLDs may not always outperform specific techniques developed for a special case of survival data, the aim of this article is to present GLD as a useful general purpose parametric model for survival data. Once the GLD fits to survival data is attained, other summary statistics such as hazards can be easily obtained and will be consistent with the same underlying survival distribution and the distributional shape of survival times can be easily gleaned by plotting the fitted GLD. In contrast, one can only obtain survival probabilities and survival curves using Kaplan Meier (KM) method and other statistical estimates such as hazards, shape of underlying distributions need to be extracted using other techniques, potentially leading to a loss in consistency of estimates due to the employment of different estimation methods on the same data. It is also well known that a parametric method is more powerful in detecting a significant difference compared to a non-parametric method such as KM method, so there is potential for cost saving using less number of patients in clinical trials using a parametric approach to analyse survival data.

In the statistical literature, GLDs are well known for its flexibility of shapes and both RS GLD (Ramberg and Schmeiser 1974, Ramberg et al. 1979) and FMKL GLD (Freimer et al. 1988) have been used in statistical literature to fit a wide range of empirical data (Karian and Dudewicz 2000, Su 2007b, 2007a, 2010a, 2010b). The versatility of GLD to accurately estimating a range of well-known statistical distributions such as Normal, T, Chi-squared, F and others is well known (Karian and Dudewicz 2000, Su 2005, 2007b, 2010a). The downside of GLDs is that they are defined by the inverse quantile functions (I.e. GLD probability density function must be calculated numerically) and for RS GLD, there are a number of restrictions on the range of parameters to ensure RS GLD is a proper probability density function. Recent statistical research have overcome some of these difficulties and there are a number of stable methods available to fit GLDs to empirical data: maximum likelihood estimation (Su 2007b, 2007a), quantile matching (Su 2010a, 2010b) starship method (King and MacGillivray 1999) and L moments matching (Asquith 2007a, Karvanen and Nuutinen 2008). It is also possible to fit mixtures of GLDs to data with unusual shapes using maximum likelihood estimation or quantile matching (Su 2007a, 2010a).

While the above methods are adequate for fitting uncensored survival data, special adjustments and some novelty is needed to ensure the effectiveness of GLDs is maintained for survival data with censoring. The methods proposed in this paper: direct maximum likelihood estimation, matching partial probability weighted moments and simulating-refitting method all extend the methods developed in Su (2007b, 2007a, 2010a, 2010b) to facilitate a successful GLD fit to survival data. Although the theoretical justifications of these methods for fitting any statistical distributions to data can be found in the literature (Aldrich 1997, Fisher 1922, Hosking 1990), it is unclear how these methods would perform in the case of GLDs. This article aims to fill these gaps in literature by describing the mathematical development of these fitting methods for GLDs and illustrate the typical performance of GLDs through extensive simulations (Simulation studies) and some real life data examples. Various discussions on the practical issues of fitting distribution to survival data such as modelling multi-stage disease scenario (Application in multi-stage disease modelling) and the difference between controlled simulation performance and real life performance (Application in empirical data modelling) are also presented in this article.

Characterization of survival curve

For survival data with observed failure times t 1, t 2, t 3, … t m , let f(t) be the probability density function and F(t) be the cumulative probability density for survival time, then the survival curve is given by 1 − F(t). The problem is now to find an estimated F(t) for a given data under censoring. The problem of choosing f(t) and F(t) is solved by using generalized lambda distributions and the estimation of the parameters of f(t) and F(t) is achieved using one of the three methods described in Estimation Algorithms for Survival Data section. A brief introduction to generalized lambda distributions is given below.

Generalized lambda distributions

There are two types of generalised lambda distributions. The RS (Ramberg and Schmeiser 1974, Ramberg et al. 1979) generalised lambda distribution with parameters λ 1, λ 2, λ 3, λ 4 is defined by its inverse quantile function, where u is a quantile from 0 to 1.
$$ {F}^{-1}(u)={\lambda}_1+\frac{u^{\lambda_3}-{\left(1-u\right)}^{\lambda_4}\ }{\lambda_2},\ 0\le u\le 1 $$
(1)
The probability density function for RS GLD is in (2), noting that F − 1(u) = t, then \( f(t)=\frac{1}{\frac{d{F}^{-1}(u)}{du}} \):
$$ \frac{\lambda_2}{\lambda_4{\left(1-u\right)}^{\lambda_4-1}+{\lambda}_3{u}^{\lambda_3-1}} $$
(2)
The RS GLD is only defined if \( \frac{\lambda_2}{\lambda_4{\left(1-u\right)}^{\lambda_4-1}+{\lambda}_3{u}^{\lambda_3-1}}\ge 0 \). King and MacGillivray (1999) discussed the conditions for which RS GLD is a valid probability density function. Due to these restrictions, another type of generalised lambda distributions, the commonly known FMKL GLD (Freimer et al. 1988)1 was introduced, the only restriction for FMKL GLD is λ 2 ≥ 0. When λ 3 ≠ 0, λ 4 ≠ 0, the FMKL GLD takes the following form:
$$ {F}^{-1}(u)={\lambda}_1+\frac{\frac{u^{\lambda_3}-1}{\lambda_3}-\frac{{\left(1-u\right)}^{\lambda_4}-1}{\lambda_4}\ }{\lambda_2},\ 0\le u\le 1 $$
(3)
For completeness, when either λ 3 or λ 4 or both are equal to zero, the FMKL GLD takes a different limiting form:
$$ {\lambda}_3=0,{\lambda}_4\ne 0 $$
$$ {F}^{-1}(u)={\lambda}_1+\frac{ \log (u)-\frac{{\left(1-u\right)}^{\lambda_4}-1}{\lambda_4}\ }{\lambda_2},\ 0\le u\le 1 $$
(4)
$$ {\lambda}_3\ne 0,{\lambda}_4=0 $$
$$ {F}^{-1}(u)={\lambda}_1+\frac{\frac{u^{\lambda_3}-1}{\lambda_3}- \log \left(1-u\right)\ }{\lambda_2},\ 0\le u\le 1 $$
(5)
$$ {\lambda}_3=0,{\lambda}_4=0 $$
$$ {F}^{-1}(u)={\lambda}_1+\frac{ \log (u)- \log \left(1-u\right)\ }{\lambda_2},\ 0\le u\le 1 $$
(6)
The probability density function of FMKL GLD when λ 3 ≠ 0, λ 4 ≠ 0 is:
$$ \frac{\lambda_2}{{\left(1-u\right)}^{\lambda_4-1}+{u}^{\lambda_3-1}} $$
(7)

The probability density function of FMKL GLD when either λ 3 or λ 4 or both are equal to zero can be easily derived and is not provided here. For FMKL and RS GLD, the numerical solution of u in F − 1(u) = t gives the cumulative density function F(t), since u = F(t). This is usually obtained using the Newton–Raphson method (see GLDEX package (Su 2007) in R). Once the corresponding u is found for a given t, the probability density function f(t) for RS and FMKL GLDs can be derived using (2) and (7) respectively.

Estimation algorithms for survival data

Maximising the likelihood for survival data with censored observations (MLE)

The likelihood for censored data is well known in the literature. Take the example of a survival data with right censoring up to time T, where we observe m random failures out of n subjects and denote observed failure times as t 1, t 2, t 3, … t m . Let f(t) be the probability density survival function and F(t) be the cumulative probability survival density for survival time, then the likelihood with exact and right censored observations is:
$$ \left({\displaystyle {\prod}_{i=1}^mf\left({t}_i\right)}\right){\left(1-F(T)\right)}^{n-m}. $$
(8)
For left censoring, a failure time is only known to be before a certain time. The likelihood function required is:
$$ \left({\displaystyle {\prod}_{i=1}^mf\left({t}_i\right)}\right){\left(F(T)\right)}^{n-m}. $$
(9)

If there is no censoring, then the usual likelihood function \( \left({\displaystyle {\prod}_{i=1}^mf\left({t}_i\right)}\right) \) is obtained. Other forms of censoring have been discussed elsewhere (Klein and Moeschberger 1997, Lawless 1981, Marubini and Valsecchi 1995, Patti et al. 2007).

In the context of GLDs, F(t) and f(t) can be obtained as described in Generalized Lambda Distributions and these are available from a number of statistical packages such as GLDEX (Su 2007) in R. The usual way of fitting survival curves using maximum likelihood estimation for GLD is to take the logarithm of the likelihood and maximize the likelihood. The maximisation is usually done using the Nelder-Mead simplex algorithm. This is usually a more robust method than trying to find a set of parameters by differentiating the log likelihood and finding the parameters numerically by setting the equations to zero. Additionally, the problem of choosing initial values to kick start the optimisation process can be solved using the method described in Su (2007b, 2007a) or by using the initial values obtained by matching PWMs as detailed below. The initial values search method in Su (2007b, 2007a) uses an extensive randomised search across the parameters using quasi random number generators such as the Sobol or Halton sequence and this article chooses the best randomised set of parameters in terms of the largest likelihood value to initiate the optimisation process.

Matching partial probability weighted moments (PWMs)

As an alternative to maximum likelihood estimation, it is possible to estimate the censored distribution by matching PWMs. The main advantage of using PWMs is that these moments are more robust than conventional moments with respect to sampling variability. Closely related to PWMs are L moments, where sample L moments can be defined indirectly as functions of probability weighted moments. Owing to their robustness, fitting statistical distributions by matching PWMs/L moments to data is usually preferable to matching conventional moments.

Let the order statistics for a complete sample of n observations be defined as X j : n where j = 1,2,3,… n, where X 1 : n  ≤ X 2 : n  ≤ … X n : n . The PWMs of sample data for right and left censoring (Hosking 1995, Greenwood et al. 1979) are given below.

The r-th PWM for right censored data, denoted as \( b\widehat{{}_{right}}(r) \), with n − m censored values replaced by threshold T is given in (10).
$$ b\widehat{{}_{right}}(r)=\frac{1}{n}\left\{{\displaystyle \sum_{j=1}^m}\frac{\left(j-1\right)\left(j-2\right)\dots \left(j-r\right)}{\left(n-1\right)\left(n-2\right)\dots \left(n-r\right)}{X}_{j:n}+\left({\displaystyle \sum_{j=m+1}^n}\frac{\left(j-1\right)\left(j-2\right)\dots \left(j-r\right)}{\left(n-1\right)\left(n-2\right)\dots \left(n-r\right)}\right)T\right\} $$
(10)
The r-th PWM for left censoring, with n − v censored values replaced by threshold T is given in (11).
$$ b\widehat{{}_{left}}(r)=\frac{1}{n}\left\{{\displaystyle \sum_{j=n-v+1}^n}\frac{\left(j-1\right)\left(j-2\right)\dots \left(j-r\right)}{\left(n-1\right)\left(n-2\right)\dots \left(n-r\right)}{X}_{j:n}+\left({\displaystyle \sum_{j=1}^{n-v}}\frac{\left(j-1\right)\left(j-2\right)\dots \left(j-r\right)}{\left(n-1\right)\left(n-2\right)\dots \left(n-r\right)}\right)T\right\} $$
(11)

When \( b\widehat{{}_{left}}(0) \) or \( b\widehat{{}_{right}}(0) \), the calculation reduces to a simple average over all uncensored and censored observations in the dataset. In this article we take the first four PWMs with r = 0, 1, 2, 3.

Hosking (1995) shows that for a given probability distribution with cumulative density function F(t) = u and quantile function F − 1(u), the theoretical right and left censored PWMs with censoring threshold T and F(T) = k are as follows:
$$ {b}_{right}(r)={\displaystyle {\int}_0^k{u}^r{F}^{-1}(u)du+\frac{1-{k}^{r+1}}{r+1}T} $$
(12)
$$ {b}_{left}(r)={\displaystyle {\int}_k^1{u}^r{F}^{-1}(u)du+\frac{k^{r+1}}{r+1}T} $$
(13)
$$ {\displaystyle \sum_r}{\left[\widehat{b(r)}-b(r)\right]}^2,\ r=0,1,2,3. $$
(14)

The PWMs for RS and FMKL GLD for left and right censoring are given in the Appendix. Based on these results, it is now possible to find a set of parameters of GLD that minimize the sum of the squared difference between sample and derived PWMs using (14). The problem of choosing initial values for the optimisation process is again solved using the method described in Su (2007b, 2007a), which involves an extensive randomised search across the parameters using quasi random number generators such as the Sobol or Halton sequence. The initial values used to start the optimisation process will be a randomised set of parameters that best matches the partial probability weighted moments between the sample and the estimated GLD.

Simulating-refitting (SR) method

This method exploits the Kaplan Meier curve or any non-parametric survival curves by simulating survival times from the survival curve. The aim is to create “uncensored” simulated data to allow direct estimation of statistical distributions. The concept behind SR method is illustrated in Fig. 2 and the probabilities of survival times over each time interval are given in Table 1.
Fig. 2

Illustrating SR Method

Table 1

Illustrating SR method

Probability

Interval

Distribution Used for Simulation

0.25

100–200

Uniform

0.25

200–300

Uniform

0.25

300–400

Uniform

0.25

400–500

Triangular

Figure 2 is an example where 25 % of patients experienced death at day 100, 200, and 300 and from day 400 onwards the data is censored. To simulate data from this survival curve, 75 % of the total number of observations will come from a uniform distribution over the interval 100–200 (25 %), 200–300 (25 %), 300–400 (25 %), with the remaining 25 % from a triangular distribution at 400–500 (Fig. 2). The last time point 500 is chosen arbitrarily as 400 × (1+ proportion of censored data) or 400 × 1.25. A triangular distribution towards the end of survival curve is chosen to facilitate an easier fit as many distributions tend to exhibit a downward tail towards their theoretical minimum or maximum. Once the data is simulated, the data can be treated as uncensored and standard distributional fitting method can be used.

The SR strategy effectively transforms survival data with censoring into survival data without any censoring. This aids the parametric modelling of the survival curve by allowing a visual representation of \( f\left(\widehat{t}\right) \). Additionally, it also allows methods such as the starship method, method of moment matching, quantile matching, maximum likelihood estimation and L moments matching to be used to facilitate a potentially better GLD fit. This is another advantage of the SR method, since not all of the available methods for fitting GLD to data can be easily adapted to cope with censored survival data. This article illustrates the use of SR in conjunction with maximum likelihood estimation (ML), L moment matching (LM) and quantile matching (QS) (see Su (2010a) for details) for GLDs in a series of simulation studies under the Simulation studies section.

Confidence intervals for survival curves

Once a parametric model for survival curves is found using any of the above method, the confidence intervals for survival curves can be evaluated directly without using simulation owing to the work by Cramer (1963) and Su (2009). Detailed descriptions and performance of this method can be found in Su (2009).

To find confidence interval for P-th quantile for n observations, Cramer (1963) showed that generically, P(X ≤ X np ) or g(x) as follows. Note in (15), a typo in Su (2009) is fixed here:
$$ \begin{array}{c}\hfill g(x)=\frac{\Gamma \left(n+1\right)}{\Gamma \left(w+1\right)\Gamma \left(n-w\right)}{\left[F(x)\right]}^w{\left[1-F(x)\right]}^{n-w-1}f(x)\hfill \\ {}\hfill w=np\hfill \\ {}\hfill \Gamma (y)={\displaystyle \underset{0}{\overset{\infty }{\int }}}{u}^{y-1}{e}^{-u}du\hfill \end{array} $$
(15)
To find the 100(1 − α)  % confidence interval analytically, the following equations need to be solved:
$$ {\displaystyle \underset{0}{\overset{Upper\ Limit}{\int }}}g(x)dx=1-\frac{\alpha }{2} $$
(16)
$$ {\displaystyle \underset{0}{\overset{Lower\ Limit}{\int }}}g(x)dx=\frac{\alpha }{2} $$
(17)

Note that \( {\displaystyle {\int}_0^{x_0}g(x)dx={\left.\beta \left(w+1,n-w\right)\right|}_0^{F\left({x}_0\right)}} \) where β is the Euler’s incomplete beta function normalized by the complete Beta function. This procedure is known as the analytical-maximum likelihood GLD approach in Su (2009).

For illustration purposes, the cancer dataset from the survival library in R is used. The survival curve is estimated by RS GLD (λ 1 = 4.746579e + 01, λ 2 = 7.504002e-04, λ 3 = 7.166621e-03, λ 4 = 3.432945e-01) using SR method. To visually check the validity of the evaluated 95 % confidence intervals (CIs), 1000 simulated survival curves (grey area in Fig. 3) were generated along with the parametric CIs in Fig. 3. This is further validated by the almost indistinguishable result between the estimated CIs from simulated data and the parametrically evaluated CIs (Fig. 3). Extensive simulation studies on the performance of CIs in terms of coverage probability for different sample sizes and quantiles are covered in Su (2009).
Fig. 3

Parametric confidence interval (CI) for survival curve for cancer data from survival library in R

Assessing the goodness of fit

In the absence of full information, the quality of parametric modelling can be visually examined by comparing the fitted parametric model with a non parametric survival curve such as the Kaplan Meier (KM) survival curve. In the case where the SR method is used, it is possible to compare the final model against the simulated data using QQ plots and more formally through Kolmogorov-Smirnoff test or Kolmogorov-Smirnoff resample test (Su 2007b, 2007a, 2010a, 2010b).

Simulation studies

The use of direct MLE and PWM matching for censored data has been applied using well known distributions such as log Normal and Weibull distributions (Wang et al. 2010a). The theoretical justification for using MLE and PWM to fit any continuous distribution to data can be found elsewhere (Hosking 1990, Fisher 1922, Aldrich 1997). For the SR method, as long as the simulated data is sufficiently close to the true underlying distribution and the GLD can model the simulated uncensored data accurately, it will yield an accurate model.

Unlike many standard statistical distributions, GLDs are characterised by inverse quantile functions. This means to use ML under SR or MLE directly, it is necessary to get the probability density functions using numerical methods such as the Newton–Raphson procedure. Other methods such as PWM do not require this numerical step in the optimisation algorithm. However, all fitting methods are affected by the sudden shape change problem in GLD parameter estimation. For relatively small change in one of the four parameters, GLD can exhibit a dramatic change in shape. For example, RS GLD (0,1,0.5,1) is an increasing function from -1 to 1 but RS GLD (0,1,0.5,0.75) is a parabola shaped function from -1 to 1, even though the change in the fourth parameter is only 0.25. In other situations, RS GLD (0,1,1.5,2) and RS GLD (0,1,1.5,2.5) (with a 0.5 change in the fourth parameter) do exhibit similar shapes and there is a smoother transition in shapes as the fourth parameter changes. This property means the standard theory examining the lower bound for the variability of parameters is not particularly useful for GLDs. The theoretical examinations of fitting methods for GLD are drawn out by numerical computations, potential abrupt changes in shape of distributions from small changes in parameters and perhaps the most elegant strategy at the present time is to use simulations to compare between the methods. Instead of comparing whether the parameters of fitted GLD are close to the true GLD, the emphasis is on whether the fitted GLD quantiles are sufficiently close to the true quantiles from some other known statistical distribution. This is illustrated below.

To assess the performance of various estimation methods, survival curves were generated from 2000 observations from symmetric and skewed Normal distribution with parameters: location = 20, scale = 2, shape = -5 or 5 or 0 for left skewed, right skewed and symmetric shape respectively. The motivation for using skewed Normal rather than Weibull distribution is to facilitate a better comparison across different scenarios using the same distribution with different parameters. Also, it would be rather unfair comparison if one were to use extended Weibull distributions (exponentiated Weibull and four parameter Weibull) to fit Weibull distributed data. Instead, the primary focus here is to examine how well the GLDs and extended Weibull distributions fit data from other distributions, since the true distribution is never known in practice. Additionally, Gumbel distribution (an extreme value distribution) with location parameter 15 and scale parameter 5 is also used in this comparison. These distributions are primarily chosen to examine the behaviour of these fitting algorithms over a range of different shapes.

This entire process is repeated for 200 observations to allow assessment of effect of sample size on the performance of proposed fitting methods. To create right censored data, observations greater than quantile ranging from 0.5 (median) to 0.9 are censored. Similarly, to create left censored data, observations less than quantile ranging from 0.1 to 0.5 (median) are censored. Five estimation methods: ML (maximum likelihood)/LM (L moments)/QS (quantile matching) under SR and MLE and PPWM matching were applied over 100 simulation runs. When fitting RS GLD using MLE with half of the data being censored (i.e., at 0.5), sometimes it is desirable to use the SR-ML method to generate the initial values to start the optimisation process, rather than using randomised search as it would lead to a better performance. This strategy is used in this article.

To give reader an idea as to the degree of accuracy attained by GLDs in comparison to other distributions, this article also assesses the performance of the exponentiated Weibull distribution (Mudholkar and Srivastava 1993, Mudholkar et al. 1995, Singh et al. 2005) and the four parameter Weibull distribution (Wahed et al. 2009, Jeong 2006) under the same simulation scenarios. Maximum likelihood estimation via Newton–Raphson algorithm is used to fit both distributions to survival data. The set of initial values used to begin the optimisation process is obtained as follows: 1000 initial values from 0 to 100 are randomly using Sobol sequence generator for parameters of both distributions. From these 1000 set of values, the set of initial values that maximises the likelihood is used in the optimisation process.

The sample size 200 and 2000 were chosen to reflect that the number of patients in many Phase III and IV trials are in the vicinity of 200 and some large meta-analysis may combine several studies and reach around 2000 patients. The primary intention of 200 and 2000 sample size is to allow comparison as to the accuracy of the estimates as sample size increases. The general pattern of improved accuracy and numerical precision for larger sample sizes is seen from Figs. 4, 5, 6 and 7 and this is an expected result.
Fig. 4

Trellis plot showing the performance of log mean relative error among SR-LM (simulating refitting-L moments), SR-ML (simulating refitting-maximum likelihood), SR-QS (simulating refitting-quantile matching), MLE (maximum likelihood direct estimation) and PWM (partial probability weighted moments matching) for RS and FMKL GLD, exponentiated Weibull and four parameter Weibull distributions. The description “SD-RC0.9” means the true distribution partial probability weighted moments and was symmetric and observations greater than the 90 % quantile were treated as right censored data. The simulation result shown is for 2000 samples

Fig. 5

Trellis plot showing the performance of log of variance of relative error among SR-LM (simulating refitting-L moments), SR-ML (simulating refitting-maximum likelihood), SR-QS (simulating refitting-quantile matching), MLE (maximum likelihood direct estimation) and PWM (partial probability weighted moments matching) for RS and FMKL GLD, exponentiated Weibull and four parameter Weibull distributions. The description “SD-LC0.1” means the true distribution was symmetric and observations less than the 10 % quantile were treated as left censored data. The simulation result shown is for 2000 samples

Fig. 6

Trellis plot showing the performance of log mean relative error among SR-LM (simulating refitting-L moments), SR-ML (simulating refitting-maximum likelihood), SR-QS (simulating refitting-quantile matching), MLE (maximum likelihood direct estimation) and PWM (partial probability weighted moments matching) for RS and FMKL GLD, exponentiated Weibull and four parameter Weibull distributions. The description “SD-RC0.9” means the true distribution was symmetric and observations greater than the 90 % quantile were treated as right censored data. The simulation result shown is for 200 samples

Fig. 7

Trellis plot showing the performance of log of variance of relative error among SR-LM (simulating refitting-L moments), SR-ML (simulating refitting-maximum likelihood), SR-QS (simulating refitting-quantile matching), MLE (maximum likelihood direct estimation) and PWM (partial probability weighted moments matching) for RS and FMKL GLD, exponentiated Weibull and four parameter Weibull distributions. The description “SD-LC0.1” means the true distribution was symmetric and observations less than the 10 % quantile were treated as left censored data. The simulation result shown is for 200 samples

To compare the performance between methods, the relative error was computed. The relative error is defined as the absolute difference between fitted and true quantile divided by the true quantile. This is computed using 100 equally spaced quantiles from 1 % quantile up to the censored quantile for right censored data. For left censored data, this is computed using 100 equally spaced quantiles from the censored quantile up to the 99 % quantile. The log mean and log variance of the relative error among five estimation methods for different types of censoring and different statistical distributions are shown in Figs. 4, 5, 6 and 7. The log transformation is designed to solve the problem of extreme results, to ensure a fairer and clearer comparison across different methods.

Within Figs. 4, 5, 6 and 7, the emphasis is on the performance of different methods. In Fig. 4, it is clear that the exponentiated Weibull is among the worst performing distribution except with respect to fitting of Gumbel distribution data and the four parameter Weibull has fairly comparable performance with GLD but performs less well for Gumbel data. The precision comparison in Fig. 5 indicates similar conclusion as Fig. 4 and the general pattern in these two figures is reflected in Figs. 6 and 7 but with added variability.

Within GLD methods, Figs. 4 and 5 show that the most accurate methods appear to be MLE and PWM matching while quantile matching, ML and LM under SR method appear to be more variable in a number of cases. This is expected as ML/LM/QS under SR introduce extra variability through simulation, due to the nature of the SR algorithm in this article. However, this is not always true. The advantage of using SR is seen in Fig. 6 for Gumbel Distribution with 10 % left censored data (GB-LC0.1) for FMKL GLD PWM and FMKL GLD under SR-LM. It is clear that in this example, the direct use of PWM does not result in a fitting result as good as using SR-LM. This is likely due to the difficulty in ascertaining a suitable set of initial values to ensure proper convergence to find the best possible GLD fit, an area where SR can provide valuable guidance and input.

The log variability of relative error plot shows that the methods provided give quite precise results with direct MLE tends to outperforms PWM. ML/LM/QS under SR all have similar performance and often perform slightly worse than MLE or PWM (Figs. 5 and 7).

The perceived, generic pattern of superior performances of MLE or PWM over ML/LM SR should be interpreted with caution. There are also cases, as shown in the example above, where LM SR can in fact outperform from direct use of PWM. Also, note that the optimality of the simulation results comes from the fact that the true distributions are known and a single GLD is an adequate approximation to the true underlying distribution. In practical situations, the true underlying distribution is unknown and may require a mixture of GLDs. When dealing with mixture of GLDs, it is harder to fit a distribution to censored data using direct MLE or PWMs, since this requires maximising or minimising a much more complex objective function which can be difficult in practice. The ML/LM/QS under SR, on the other hand, can be adapted more easily fit mixture of GLDs and the success of these methods in fitting mixtures has already been documented elsewhere (Su 2007b, 2010a, 2010b). SR also tends to give better model for empirical data, as illustrated in Application in empirical data modelling. The main message is that the theoretical loss of efficiency and accuracy using SR is likely to be minimal as evident in these simulation studies but as illustrated below, SR can provide additional information to aid the fitting of a suitable distribution which is not attainable by using an estimation method such as PWM or MLE directly.

Application in empirical data modelling

While the methods described in this article works well so far in controlled, simulation experiments, it is the successful modelling of survival curves for real life data set that will be most useful for practitioners. The European Blood Marrow Transplant (EBMT) registry (2205 patients) and Amsterdam Cohort Studies on AIDS infection data (329 participants) from Putter et al. (2007) are used for this illustration.

For EBMT data, the aim is to model the survival curve for those who experienced platelet recovery and subsequently died or had a relapse. For this dataset, both MLE and SR are used to estimate the survival curve. From Fig. 8 panel A and B, the SR estimation is better than using MLE. While the SR estimation is based on mixture of two GLDs and MLE is based on a single GLD, this comparison is not unfair. Since when we fit distribution to data using MLE with censored data, we have no information as to the shape of our target distribution. While it may be possible, for example, to use mixture of two GLDs with MLE to model censored data, this method will often fail because the likelihood often becomes too complex to maximize. In contrast, by using the SR method, users can clearly see that mixture of GLDs is needed in this case (Fig. 8-panel C). Also, it is much easier to fit a mixture of GLDs under the SR scheme since it is only necessary to maximize the usual uncensored likelihood. In practice, the additional information provided by SR method on the shape of the survival distribution to assist the identification of an appropriate probability density is what gives SR method an edge over direct use of PWM or MLE on censored data.
Fig. 8

In panel (a), the EBMT Kaplan Meier survival curve is modelled by maximising the censored log likelihood directly using RS or FMKL GLD. In panel (b), the EBMT Kaplan Meier survival curve is modelled by mixtures of GLDs using maximum likelihood estimation (Su 2007a) under SR scheme. Panel (c) shows the estimated EBMT probability density function of mixture of RS GLDs from Panel (b) and only the observed part of the survival curve is displayed in Panel (b). Panel (d) shows the AIDS infection cumulative incidence function (CIF) (Amsterdam Cohort Study) modelled by mixture of RS GLDs (Su 2007a) using the SR method

For Amsterdam Cohort Studies on AIDS infection, because there is competing risk (see Putter et al. (2007) for more details), the appropriate measure of probability is to use the cumulative incidence function rather than the naive KM curve. The SR-ML method is again used in this example and the resulting fit is very close to the non-parametric cumulative incidence function as shown in Fig. 8-panel D.

A further interest with the EBMT data in this article is to illustrate how the GLD fits could be improved and compared against well-known parametric models using mixture of Normal distributions. We consider three method in this example: fitting mixture of 3 RS GLDs (Su 2007a, 2010b) under SR scheme, fitting mixtures of Normal distributions (Fraley and Raftery 2002, 2007) under SR scheme, and fitting mixture of Normal distributions to censored data directly using Bayesian method (Komárek 2009). A fourth method, kernel density estimation (using Normal distribution as kernel), is provided as a benchmark as to how well these methods fit the target distribution in Fig. 9.
Fig. 9

Panel (a) compares the use of simulating refitting method (GLD mixture Vs Normal mixture) and the use of Bayesian Normal mixture model from mixAK package. Panel (b) shows the performance of GLD mixture and Normal mixture (SR methods only) relative to KM survival curve and this is compared directly in Panel (c). Panel (c) shows the mixture of GLDs gives a simpler representation of the survival curve compared to using mixture of Normal distributions

The survival curve fit to EBMT data is first refitted using mixture of 3 RS GLDs using SR method. Details on fitting mixture of GLDs can be obtained from Su (2007a, 2010b). This GLD mixture model (fitted using quantile matching method (Su 2010b)) now corresponds almost exactly to the survival curve (Fig. 9, (B)). Previously, the use of mixture of 2 RS GLDs in Fig. 8 (B) shows a slight departure (but still within the 95 % CI) at the start of survival curve. Note the number of GLDs to be chosen can be selected using AIC. For example, in this case, the AIC under mixtures of 3 RS GLDs give AIC is 179658.8 compared to 175950.5 under mixtures of 2 RS GLDs. The preference therefore, based on AIC, is to use mixtures of 2 RS GLDs.

In Fig. 9, the EBMT data was also fitted using Normal mixtures. Firstly, MCMC estimation of Normal mixtures for survival data with censoring is applied directly using 5 Normal distributions via mixAK package in R (Komárek 2009). Secondly, Normal mixtures is fitted directly using the SR method via mclust package (Fraley and Raftery 2002, 2007) in R. This involves finding the optimal Normal mixture using BIC under the EM algorithm by hierarchical clustering and a mixture of 8 Normal distributions was fitted onto the simulated data using the SR method. Figure 9, panel A shows that the mixAK Bayesian Normal mixture model failed to estimate the survival distribution accurately for survival times greater than 2500 in comparison to kernel density estimation. As a result, this model is not considered further in Fig. 9 panel (b) and (c). Fig. 9 shows that the GLD mixture model not only provides a compelling fit to the KM survival curve (Fig. 9, panel b) but also has the advantage of possessing a simpler shape and less parameters than the Normal mixture model (Fig. 9, panel a and c).

Application in multi-stage disease modelling

A common goal in multi-stage disease modelling is to find an overall survival curve, accounting for all different paths to a final outcome. Lo et al. (2009) discussed how this can be done using saddlepoint approximation in a semi-markov process setting, however, the problem of choosing a suitable statistical distribution is still unsolved. Their method also hinged upon the successful maximisation of the likelihood for the whole system; which can be difficult to achieve for a complex multi-stage disease scenario.

The method proposed below can be used quite effectively for complex multi-stage disease scenarios. Consider a simple multi-stage disease scenario in Fig. 10.
Fig. 10

A simple multi-stage disease scenario

Figure 10 shows two possible paths leading to death for each patient in a clinical trial. Patients could develop a disease leading to death or go straight to death. There are two pathways leading to death and censoring could occur any time.

The general pattern of any multi-stage disease scenario is that all patients end up at the final stage, which may be death or disease. Some patients may not have yet experienced an event and in that case, a censored event is recorded. Other patients may be censored at the start of the study, before any pathways. However, there are no recurrent events in which patients may go back and forward between different stages indefinitely. Under these conditions, the process of finding the parametric survival curve for the overall system is as follows:
  1. 1.

    Find a parametric survival curve for each path, using either MLE, PWMs matching or SR technique. Each path is comprised of censored and uncensored data and any data censored at any intermediate stages would be classified as censored data.

     
  2. 2.

    Obtain the number of patients for each path; simulate survival times from the fitted parametric model for each path.

     
  3. 3.

    Combine all the simulated results in step 2 and the censored times observed at the start of the study into one final survival data. Model the survival times of this final data set parametrically using either PWMs matching, MLE or SR method. This gives the overall parametric survival curve for the system.

     
The above modelling strategy is simple yet effective. It avoids the problem of having a complicated likelihood in the event of a complex multi-stage system as was used in Lo et al. (2009) and it provides a visual check as to the overall goodness of fit between the estimated parametric survival curve and the non-parametric survival curve for the whole system. To illustrate, consider a simulated multi-stage disease modelling in Fig. 10. Let path 1 be the path with an intermediate event before death and path 2 be the path straight to death. One thousand survival times for path 1 and 2 are generated from Weibull (6,12) and Weibull (3,13) distributions respectively. Survival times greater than 20 are censored at 20 for the whole system. Additionally, for path 1, survival time greater than 14 are censored at 14. Under this simulation scenario, the overall survival curves estimated by RS GLD using MLE and SR methods are shown in Fig. 11. While both methods appeared to be effective, a slightly more accurate result is obtained using the direct ML estimation technique in this case.
Fig. 11

Modelling overall survival curve for multi-stage disease scenario using (a) direct maximum likelihood estimation and (b) SR method

Conclusion

This article illustrates the use of generalised lambda distributions in conjunction with MLE, PWMs matching and SR method to find an approximate probability density function and confidence interval for the survival curve. In the event where the use of MLE and PWMs matching failed to give convincing result, the SR method is often a useful alternative. The SR method converts the censored survival data into uncensored data, allowing users to improve the distributional fit using a wider range of fitting methods. The SR method can also provide initial values for a secondary optimisation for MLE and PWM matching, which sometimes provide a better fit to survival curves that is not attainable using only one method.

The development of these techniques is promising as it means statisticians are no longer limited to non- parametric techniques when analysing such data. Also, it is now possible to extract all the statistical information such as mean, quantile, variability in relation to survival times consistently under one parametric model and this opens up the prospect of developing more powerful statistical models and tests for censored survival data frequently used in engineering and medicine. Recent advances in GLD regression (Su 2015) also opens the possibility of extending this work into accelerated failure models, which will further enhance statistician’s toolbox in practice.

Footnotes
1

The correct abbreviation should be FKML GLD, but in conformity with the statistical literature, the FMKL GLD terminology is used here.

 

Declarations

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Authors’ Affiliations

(1)
Covance Pty Ltd

References

  1. Aldrich, J.: R.A. Fisher and the making of maximum likelihood 1912–1922. Stat. Sci. 12(3), 162–176 (1997)View ArticleMathSciNetMATHGoogle Scholar
  2. Asquith, W.: L-moments and TL-moments of the generalized lambda distribution. Comput. Stat. Data Anal. 51(9), 4484–4496 (2007)View ArticleMathSciNetMATHGoogle Scholar
  3. Azzalini, A.: Statistical applications of the multivariate skew-normal distribution. J. Royal Stat. Soc., Series B 61, 579–602 (1999)View ArticleMathSciNetMATHGoogle Scholar
  4. Böhning, D., Seidel, W.: Editorial: recent developments in mixture models. Comput. Stat. Data Anal. 41, 349–357 (2003)View ArticleMATHGoogle Scholar
  5. Cramer, H.: Mathematical methods of statistics. Princeton University Press, N.J. (1963)Google Scholar
  6. Fisher, R.A.: On the mathematical foundations of theoretical statistics. Philos. Trans. Royal Soc. London A 222, 309–368 (1922)View ArticleMATHGoogle Scholar
  7. Fraley, C., Raftery, A.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97, 611–631 (2002)View ArticleMathSciNetMATHGoogle Scholar
  8. Fraley, C., Raftery, A.: Bayesian regularization for normal mixture estimation and model-based clustering. J. Class. 24, 155–181 (2007)View ArticleMathSciNetMATHGoogle Scholar
  9. Freimer, M., Kollia, G., Mudholkar, G., Lin, C.: A study of the generalised Tukey lambda family. Commun. Stat.- Theory Methods 17(10), 3547–3567 (1988)View ArticleMathSciNetMATHGoogle Scholar
  10. Greenwood, J.A., Landwehr, J.M., Matalas, N.C., Wallis, J.R.: Probability weighted moments: definitions and relation to parameters of several distributions expressible in inverse form. Water Resour. Res. 15(6), 1049–1054 (1979)View ArticleGoogle Scholar
  11. Hosking, J.R.M.: L-moments: analysis and estimation of distributions using linear combinations of order statistics. J. Royal Stat. Soc.: Series B (Methodological) 52(1), 105–124 (1990)MathSciNetMATHGoogle Scholar
  12. Hosking, J.R.M.: The use of L moments in the analysis of censored data. In: Balakrishnan, N. (ed.) Recent advances in life testing and reliability, pp. 546–560. CRC Press, Boca Raton (1995)Google Scholar
  13. Jeong, JH: A new parametric family for modelling cumulative incidence functions: application to breast cancer data. J. Royal Stat. Soc. Series A. 169(2),289–303 (2006)Google Scholar
  14. Karian, Z., Dudewicz, E.: Fitting statistical distributions: the generalized lambda distribution and generalised bootstrap methods. Chapman and Hall, New York (2000)View ArticleGoogle Scholar
  15. Karvanen, J., Nuutinen, A.: Characterizing the generalized lambda distribution by L-moments. Comput. Stat. Data Anal. 52(4), 1971–1983 (2008)View ArticleMathSciNetMATHGoogle Scholar
  16. King, R., MacGillivray, H.: A starship estimation method for the generalised lambda distributions. Aust. N. Z. J. Stat. 41(3), 353–374 (1999)View ArticleMathSciNetMATHGoogle Scholar
  17. Klein, J.P., Moeschberger, M.L.: Survival analysis. Techniques for censored and truncated data. Springer, New York (1997)MATHGoogle Scholar
  18. Komárek, A.: A new R package for Bayesian estimation of multivariate normal mixtures allowing for selection of the number of components and interval-censored data. Comput. Stat. Data Anal. 53(12), 3932–3947 (2009)View ArticleMATHGoogle Scholar
  19. Lawless, J.F.: Statistical models and methods for lifetime data. Wiley, New York (1981)Google Scholar
  20. Lo, S.N., Heritier, S., Hudson, M.: Saddlepoint approximation for semi-Markov processes with application to a cardiovascular randomized study. Comput. Stat. Data Anal. 53(3), 683–698 (2009)View ArticleMathSciNetMATHGoogle Scholar
  21. Marubini, E., Valsecchi, E.G.: Analysing survival data from clinical trials and observational studies. Wiley, New York (1995)MATHGoogle Scholar
  22. McLachlan, G., Peel, D.: Finite mixture models. Wiley, New York (2000)View ArticleMATHGoogle Scholar
  23. Mudholkar, G.S., Srivastava, D.K.: Exponentiated Weibull family for analyzing bathtub failure-ratedata. IEEE Trans. Reliab. 42(2), 299–302 (1993)View ArticleMATHGoogle Scholar
  24. Mudholkar, G.S., Srivastava, D.K., Freimer, M.: The Exponentiated Weibull Family: a reanalysis of the bus-motor-failure data. Technometrics 37(4), 436–445 (1995)View ArticleMATHGoogle Scholar
  25. Patti, S., Biganzoli, E., Boracchi, P.: Review of maximum likelihood functions for right censored data. A new elementary derivation. (2007)Google Scholar
  26. Putter, H., Fiocco, M., Geskus, R.: Tutorial in biostatistics: competing risks and multi-state models. Stat. Med. 26, 2389–2430 (2007)View ArticleMathSciNetGoogle Scholar
  27. Ramberg, J., Schmeiser, B.: An approximate method for generating asymmetric random variables. Commun. Assoc. Comput Mach. 17, 78–82 (1974)MathSciNetMATHGoogle Scholar
  28. Ramberg, J., Tadikamalla, P., Dudewicz, E., Mykytka, E.: A probability distribution and its uses in fitting the data. Technometrics 21, 201–214 (1979)View ArticleMATHGoogle Scholar
  29. Singh, U., Gupta, P.K., Upadhyay, S.K.: Estimation of parameters for exponentiated-Weibull family under type-II censoring scheme. Comput. Stat. Data Anal. 48(3), 509–523 (2005)View ArticleMathSciNetMATHGoogle Scholar
  30. Su, S.: A discretized approach to flexibly fit generalized lambda distributions to data. J. Mod. Appl. Stat. Methods 4, 408–424 (2005)Google Scholar
  31. Su, S.: Fitting single and mixture of generalized lambda distributions to data via discretized and maximum likelihood methods: GLDEX in R. J. Stat. Softw. 21(9), 1–17 (2007)View ArticleGoogle Scholar
  32. Su, S.: Numerical maximum log likelihood estimation for generalized lambda distributions. Comput. Stat. Data Anal. 51(8), 3983–3998 (2007b)View ArticleMATHGoogle Scholar
  33. Su, S: GLDEX: Fitting Single and Mixture of Generalized Lambda Distributions (RS and FMKL) Using Discretized and Maximum Likelihood Methods. CRAN. Available at: https://cran.r-project.org/web/packages/GLDEX/index.html (2007)
  34. Su, S.: Confidence intervals for quantiles using generalized lambda distributions. Comput. Stat. Data Anal. 53(9), 3324–3333 (2009)View ArticleMATHGoogle Scholar
  35. Su, S.: Chapter 14: fitting GLD to data via quantile matching method. In: Karian, Z., Dudewicz, E. (eds.) Handbook of distribution fitting methods with R, pp. 557–583. CRC Press/Taylor & Francis, Boca Raton (2010)View ArticleGoogle Scholar
  36. Su, S.: Chapter 15: fitting GLD to data using the GLDEX 1.0.4 in R. In: Karian, Z., Dudewicz, E. (eds.) Handbook of distribution fitting methods with R, pp. 585–608. CRC Press/Taylor & Francis, Boca Raton (2010)View ArticleGoogle Scholar
  37. Su, S.: Flexible parametric quantile regression model. Stat Comput. 25(3), 635–650 (2015)View ArticleMathSciNetGoogle Scholar
  38. Wahed, A.S., Luong, T.M., Jeong, J.H.: A new generalization of Weibull distribution with application to a breast cancer data set. Stat. Med. 28(16), 2077–94 (2009)View ArticleMathSciNetGoogle Scholar
  39. Wang, D., Hutson, A.D., Miecznikowski, J.C.: L-moment estimation for parametric survival models given censored data. Stat. Methodol. 7(6), 655–667 (2010)View ArticleMathSciNetMATHGoogle Scholar

Copyright

© Su. 2016