# Flexible modelling of survival curves for censored data

- Steve Su
^{1}Email author

**3**:6

**DOI: **10.1186/s40488-016-0045-0

© Su. 2016

**Received: **1 July 2015

**Accepted: **16 February 2016

**Published: **29 February 2016

## Abstract

This article outlines flexible strategies to model survival curves for censored data and find parametric confidence intervals using generalised lambda distributions. Owing to the rich shapes of generalised lambda distributions, these distributions are well suited to the problem of estimating survival curves. This article presents three useful techniques in estimating survival curves: matching partial probability weighted moments (PWM), maximum likelihood estimation (MLE) and simulation-refitting (SR) methods. The performance of these techniques are examined using right skewed, left skewed, symmetric bell curved and extreme value simulated data with varying degrees of censoring and sample sizes. Applications of the proposed methods in the context of multi-stage disease modelling and competing risks are also provided. Under controlled simulated experiments, PWM and MLE estimation tend to exhibit more precise estimates for survival curves than the SR method, however, the SR method tends to perform better in practice. The methods proposed in this article are very general and can be used to fit a wide range of empirical survival curves. Compared to the standard Kaplan Meier survival curve, the methods in this article have the added benefits of producing smoother survival curves and more consistent statistical estimates where all the statistical information of the survival curve can be obtained directly under one parametric model.

## Introduction

The parametric modelling of the survival curve has always been a tricky task; it involves identifying a suitable probability density function and its parameters for incomplete data. The problem of finding a statistical distribution for survival curves can be broken down into two parts: 1) the problem of identifying a suitable distribution and 2) the problem of estimation based on the assumed distribution. This article proposes that the problem of distribution identification can be solved by using generalized lambda distributions (GLDs). The second problem of finding suitable GLD estimates in the case of censored data can be solved using the following methods: direct maximum likelihood estimation, matching partial probability weighted moments and the simulating-refitting method. All of these methods are discussed below in "Estimation Algorithms for Survival Data" section.

Traditionally, the identification of a suitable probability density function is often difficult or time consuming, since many statistical distributions have a limited range of shapes. Even if a probability density can be found, parameter estimation is an additional hurdle. The usual method of using maximum likelihood estimation incorporating censoring may not always work. A common problem of this approach is that the numerical method used in the optimisation process may fail to give a reasonable solution due to the complexity of likelihood function. This is particularly relevant to multi-stage disease situations or where a mixture of statistical distributions is needed to estimate the overall survival curve.

*λ*

_{1}= 4.56687718,

*λ*

_{2}= 0.33274810,

*λ*

_{3}= 0.65408979,

*λ*

_{4}= -0.01021826 and 5000 observations were generated from this distribution and fitted using Normal mixture model. The Normal mixture model appears to be overly complex and still failed to capture the true underlying shape of the distribution. Other distributions such as gamma distribution give a much simpler and convincing shape, but still over estimates the peak of the distribution. Skewed Normal distribution (Azzalini 1999) is even more inaccurate than gamma distribution, missing the mode of the true distribution and does not give the correct shape.

In practice, owing to the complexity of real life data, it is cumbersome to find a good statistical distribution for empirical data by trial and error. Instead, it is preferable to use a distribution with very flexible shapes. In the current literature, for survival data, perhaps the most well-known distributions are the exponentiated Weibull distribution (Mudholkar and Srivastava 1993, Mudholkar et al. 1995, Singh et al. 2005) and the four parameter Weibull distribution (Wahed et al. 2009, Jeong 2006). Other distributions such as generalised hyperbolic distribution, g and h distributions and many others could also be suitable candidates. The superiority of GLDs over exponentiated Weibull distribution and comparable performance to the four parameter Weibull distribution in some general settings is demonstrated using simulation studies in Simulation studies. While GLDs may not always outperform specific techniques developed for a special case of survival data, the aim of this article is to present GLD as a useful general purpose parametric model for survival data. Once the GLD fits to survival data is attained, other summary statistics such as hazards can be easily obtained and will be consistent with the same underlying survival distribution and the distributional shape of survival times can be easily gleaned by plotting the fitted GLD. In contrast, one can only obtain survival probabilities and survival curves using Kaplan Meier (KM) method and other statistical estimates such as hazards, shape of underlying distributions need to be extracted using other techniques, potentially leading to a loss in consistency of estimates due to the employment of different estimation methods on the same data. It is also well known that a parametric method is more powerful in detecting a significant difference compared to a non-parametric method such as KM method, so there is potential for cost saving using less number of patients in clinical trials using a parametric approach to analyse survival data.

In the statistical literature, GLDs are well known for its flexibility of shapes and both RS GLD (Ramberg and Schmeiser 1974, Ramberg et al. 1979) and FMKL GLD (Freimer et al. 1988) have been used in statistical literature to fit a wide range of empirical data (Karian and Dudewicz 2000, Su 2007b, 2007a, 2010a, 2010b). The versatility of GLD to accurately estimating a range of well-known statistical distributions such as Normal, T, Chi-squared, F and others is well known (Karian and Dudewicz 2000, Su 2005, 2007b, 2010a). The downside of GLDs is that they are defined by the inverse quantile functions (I.e. GLD probability density function must be calculated numerically) and for RS GLD, there are a number of restrictions on the range of parameters to ensure RS GLD is a proper probability density function. Recent statistical research have overcome some of these difficulties and there are a number of stable methods available to fit GLDs to empirical data: maximum likelihood estimation (Su 2007b, 2007a), quantile matching (Su 2010a, 2010b) starship method (King and MacGillivray 1999) and L moments matching (Asquith 2007a, Karvanen and Nuutinen 2008). It is also possible to fit mixtures of GLDs to data with unusual shapes using maximum likelihood estimation or quantile matching (Su 2007a, 2010a).

While the above methods are adequate for fitting uncensored survival data, special adjustments and some novelty is needed to ensure the effectiveness of GLDs is maintained for survival data with censoring. The methods proposed in this paper: direct maximum likelihood estimation, matching partial probability weighted moments and simulating-refitting method all extend the methods developed in Su (2007b, 2007a, 2010a, 2010b) to facilitate a successful GLD fit to survival data. Although the theoretical justifications of these methods for fitting any statistical distributions to data can be found in the literature (Aldrich 1997, Fisher 1922, Hosking 1990), it is unclear how these methods would perform in the case of GLDs. This article aims to fill these gaps in literature by describing the mathematical development of these fitting methods for GLDs and illustrate the typical performance of GLDs through extensive simulations (Simulation studies) and some real life data examples. Various discussions on the practical issues of fitting distribution to survival data such as modelling multi-stage disease scenario (Application in multi-stage disease modelling) and the difference between controlled simulation performance and real life performance (Application in empirical data modelling) are also presented in this article.

## Characterization of survival curve

For survival data with observed failure times *t*
_{1}, *t*
_{2}, *t*
_{3}, … *t*
_{
m
}, let *f*(*t*) be the probability density function and *F*(*t*) be the cumulative probability density for survival time, then the survival curve is given by 1 − *F*(*t*). The problem is now to find an estimated *F*(*t*) for a given data under censoring. The problem of choosing *f*(*t*) and *F*(*t*) is solved by using generalized lambda distributions and the estimation of the parameters of *f*(*t*) and *F*(*t*) is achieved using one of the three methods described in Estimation Algorithms for Survival Data section. A brief introduction to generalized lambda distributions is given below.

## Generalized lambda distributions

*λ*

_{1},

*λ*

_{2},

*λ*

_{3},

*λ*

_{4}is defined by its inverse quantile function, where

*u*is a quantile from 0 to 1.

*F*

^{− 1}(

*u*) =

*t*, then \( f(t)=\frac{1}{\frac{d{F}^{-1}(u)}{du}} \):

^{1}was introduced, the only restriction for FMKL GLD is

*λ*

_{2}≥ 0. When

*λ*

_{3}≠ 0,

*λ*

_{4}≠ 0, the FMKL GLD takes the following form:

*λ*

_{3}or

*λ*

_{4}or both are equal to zero, the FMKL GLD takes a different limiting form:

*λ*

_{3}≠ 0,

*λ*

_{4}≠ 0 is:

The probability density function of FMKL GLD when either *λ*
_{3} or *λ*
_{4} or both are equal to zero can be easily derived and is not provided here. For FMKL and RS GLD, the numerical solution of *u* in *F*
^{− 1}(*u*) = *t* gives the cumulative density function *F*(*t*), since *u* = *F*(*t*). This is usually obtained using the Newton–Raphson method (see GLDEX package (Su 2007) in R). Once the corresponding *u* is found for a given *t*, the probability density function *f*(*t*) for RS and FMKL GLDs can be derived using (2) and (7) respectively.

## Estimation algorithms for survival data

### Maximising the likelihood for survival data with censored observations (MLE)

*T*, where we observe

*m*random failures out of

*n*subjects and denote observed failure times as

*t*

_{1},

*t*

_{2},

*t*

_{3}, …

*t*

_{ m }. Let

*f*(

*t*) be the probability density survival function and

*F*(

*t*) be the cumulative probability survival density for survival time, then the likelihood with exact and right censored observations is:

If there is no censoring, then the usual likelihood function \( \left({\displaystyle {\prod}_{i=1}^mf\left({t}_i\right)}\right) \) is obtained. Other forms of censoring have been discussed elsewhere (Klein and Moeschberger 1997, Lawless 1981, Marubini and Valsecchi 1995, Patti et al. 2007).

In the context of GLDs, *F*(*t*) and *f*(*t*) can be obtained as described in Generalized Lambda Distributions and these are available from a number of statistical packages such as GLDEX (Su 2007) in R. The usual way of fitting survival curves using maximum likelihood estimation for GLD is to take the logarithm of the likelihood and maximize the likelihood. The maximisation is usually done using the Nelder-Mead simplex algorithm. This is usually a more robust method than trying to find a set of parameters by differentiating the log likelihood and finding the parameters numerically by setting the equations to zero. Additionally, the problem of choosing initial values to kick start the optimisation process can be solved using the method described in Su (2007b, 2007a) or by using the initial values obtained by matching PWMs as detailed below. The initial values search method in Su (2007b, 2007a) uses an extensive randomised search across the parameters using quasi random number generators such as the Sobol or Halton sequence and this article chooses the best randomised set of parameters in terms of the largest likelihood value to initiate the optimisation process.

### Matching partial probability weighted moments (PWMs)

As an alternative to maximum likelihood estimation, it is possible to estimate the censored distribution by matching PWMs. The main advantage of using PWMs is that these moments are more robust than conventional moments with respect to sampling variability. Closely related to PWMs are L moments, where sample L moments can be defined indirectly as functions of probability weighted moments. Owing to their robustness, fitting statistical distributions by matching PWMs/L moments to data is usually preferable to matching conventional moments.

Let the order statistics for a complete sample of *n* observations be defined as *X*
_{
j : n
} where *j* = 1,2,3,… *n*, where *X*
_{1 : n
} ≤ *X*
_{2 : n
} ≤ … *X*
_{
n : n
}. The PWMs of sample data for right and left censoring (Hosking 1995, Greenwood et al. 1979) are given below.

*r*-th PWM for right censored data, denoted as \( b\widehat{{}_{right}}(r) \), with

*n*−

*m*censored values replaced by threshold

*T*is given in (10).

*r*-th PWM for left censoring, with

*n*−

*v*censored values replaced by threshold

*T*is given in (11).

When \( b\widehat{{}_{left}}(0) \) or \( b\widehat{{}_{right}}(0) \), the calculation reduces to a simple average over all uncensored and censored observations in the dataset. In this article we take the first four PWMs with *r* = 0, 1, 2, 3.

*F*(

*t*) =

*u*and quantile function

*F*

^{− 1}(

*u*), the theoretical right and left censored PWMs with censoring threshold

*T*and

*F*(

*T*) =

*k*are as follows:

The PWMs for RS and FMKL GLD for left and right censoring are given in the Appendix. Based on these results, it is now possible to find a set of parameters of GLD that minimize the sum of the squared difference between sample and derived PWMs using (14). The problem of choosing initial values for the optimisation process is again solved using the method described in Su (2007b, 2007a), which involves an extensive randomised search across the parameters using quasi random number generators such as the Sobol or Halton sequence. The initial values used to start the optimisation process will be a randomised set of parameters that best matches the partial probability weighted moments between the sample and the estimated GLD.

### Simulating-refitting (SR) method

Illustrating SR method

Probability | Interval | Distribution Used for Simulation |
---|---|---|

0.25 | 100–200 | Uniform |

0.25 | 200–300 | Uniform |

0.25 | 300–400 | Uniform |

0.25 | 400–500 | Triangular |

Figure 2 is an example where 25 % of patients experienced death at day 100, 200, and 300 and from day 400 onwards the data is censored. To simulate data from this survival curve, 75 % of the total number of observations will come from a uniform distribution over the interval 100–200 (25 %), 200–300 (25 %), 300–400 (25 %), with the remaining 25 % from a triangular distribution at 400–500 (Fig. 2). The last time point 500 is chosen arbitrarily as 400 × (1+ proportion of censored data) or 400 × 1.25. A triangular distribution towards the end of survival curve is chosen to facilitate an easier fit as many distributions tend to exhibit a downward tail towards their theoretical minimum or maximum. Once the data is simulated, the data can be treated as uncensored and standard distributional fitting method can be used.

The SR strategy effectively transforms survival data with censoring into survival data without any censoring. This aids the parametric modelling of the survival curve by allowing a visual representation of \( f\left(\widehat{t}\right) \). Additionally, it also allows methods such as the starship method, method of moment matching, quantile matching, maximum likelihood estimation and L moments matching to be used to facilitate a potentially better GLD fit. This is another advantage of the SR method, since not all of the available methods for fitting GLD to data can be easily adapted to cope with censored survival data. This article illustrates the use of SR in conjunction with maximum likelihood estimation (ML), L moment matching (LM) and quantile matching (QS) (see Su (2010a) for details) for GLDs in a series of simulation studies under the Simulation studies section.

## Confidence intervals for survival curves

Once a parametric model for survival curves is found using any of the above method, the confidence intervals for survival curves can be evaluated directly without using simulation owing to the work by Cramer (1963) and Su (2009). Detailed descriptions and performance of this method can be found in Su (2009).

*P*-th quantile for

*n*observations, Cramer (1963) showed that generically,

*P*(

*X*≤

*X*

_{ np }) or

*g*(

*x*) as follows. Note in (15), a typo in Su (2009) is fixed here:

*α*) % confidence interval analytically, the following equations need to be solved:

Note that \( {\displaystyle {\int}_0^{x_0}g(x)dx={\left.\beta \left(w+1,n-w\right)\right|}_0^{F\left({x}_0\right)}} \) where *β* is the Euler’s incomplete beta function normalized by the complete Beta function. This procedure is known as the analytical-maximum likelihood GLD approach in Su (2009).

*λ*

_{1}= 4.746579e + 01,

*λ*

_{2}= 7.504002e-04,

*λ*

_{3}= 7.166621e-03,

*λ*

_{4}= 3.432945e-01) using SR method. To visually check the validity of the evaluated 95 % confidence intervals (CIs), 1000 simulated survival curves (grey area in Fig. 3) were generated along with the parametric CIs in Fig. 3. This is further validated by the almost indistinguishable result between the estimated CIs from simulated data and the parametrically evaluated CIs (Fig. 3). Extensive simulation studies on the performance of CIs in terms of coverage probability for different sample sizes and quantiles are covered in Su (2009).

## Assessing the goodness of fit

In the absence of full information, the quality of parametric modelling can be visually examined by comparing the fitted parametric model with a non parametric survival curve such as the Kaplan Meier (KM) survival curve. In the case where the SR method is used, it is possible to compare the final model against the simulated data using QQ plots and more formally through Kolmogorov-Smirnoff test or Kolmogorov-Smirnoff resample test (Su 2007b, 2007a, 2010a, 2010b).

## Simulation studies

The use of direct MLE and PWM matching for censored data has been applied using well known distributions such as log Normal and Weibull distributions (Wang et al. 2010a). The theoretical justification for using MLE and PWM to fit any continuous distribution to data can be found elsewhere (Hosking 1990, Fisher 1922, Aldrich 1997). For the SR method, as long as the simulated data is sufficiently close to the true underlying distribution and the GLD can model the simulated uncensored data accurately, it will yield an accurate model.

Unlike many standard statistical distributions, GLDs are characterised by inverse quantile functions. This means to use ML under SR or MLE directly, it is necessary to get the probability density functions using numerical methods such as the Newton–Raphson procedure. Other methods such as PWM do not require this numerical step in the optimisation algorithm. However, all fitting methods are affected by the sudden shape change problem in GLD parameter estimation. For relatively small change in one of the four parameters, GLD can exhibit a dramatic change in shape. For example, RS GLD (0,1,0.5,1) is an increasing function from -1 to 1 but RS GLD (0,1,0.5,0.75) is a parabola shaped function from -1 to 1, even though the change in the fourth parameter is only 0.25. In other situations, RS GLD (0,1,1.5,2) and RS GLD (0,1,1.5,2.5) (with a 0.5 change in the fourth parameter) do exhibit similar shapes and there is a smoother transition in shapes as the fourth parameter changes. This property means the standard theory examining the lower bound for the variability of parameters is not particularly useful for GLDs. The theoretical examinations of fitting methods for GLD are drawn out by numerical computations, potential abrupt changes in shape of distributions from small changes in parameters and perhaps the most elegant strategy at the present time is to use simulations to compare between the methods. Instead of comparing whether the parameters of fitted GLD are close to the true GLD, the emphasis is on whether the fitted GLD quantiles are sufficiently close to the true quantiles from some other known statistical distribution. This is illustrated below.

To assess the performance of various estimation methods, survival curves were generated from 2000 observations from symmetric and skewed Normal distribution with parameters: location = 20, scale = 2, shape = -5 or 5 or 0 for left skewed, right skewed and symmetric shape respectively. The motivation for using skewed Normal rather than Weibull distribution is to facilitate a better comparison across different scenarios using the same distribution with different parameters. Also, it would be rather unfair comparison if one were to use extended Weibull distributions (exponentiated Weibull and four parameter Weibull) to fit Weibull distributed data. Instead, the primary focus here is to examine how well the GLDs and extended Weibull distributions fit data from other distributions, since the true distribution is never known in practice. Additionally, Gumbel distribution (an extreme value distribution) with location parameter 15 and scale parameter 5 is also used in this comparison. These distributions are primarily chosen to examine the behaviour of these fitting algorithms over a range of different shapes.

This entire process is repeated for 200 observations to allow assessment of effect of sample size on the performance of proposed fitting methods. To create right censored data, observations greater than quantile ranging from 0.5 (median) to 0.9 are censored. Similarly, to create left censored data, observations less than quantile ranging from 0.1 to 0.5 (median) are censored. Five estimation methods: ML (maximum likelihood)/LM (L moments)/QS (quantile matching) under SR and MLE and PPWM matching were applied over 100 simulation runs. When fitting RS GLD using MLE with half of the data being censored (i.e., at 0.5), sometimes it is desirable to use the SR-ML method to generate the initial values to start the optimisation process, rather than using randomised search as it would lead to a better performance. This strategy is used in this article.

To give reader an idea as to the degree of accuracy attained by GLDs in comparison to other distributions, this article also assesses the performance of the exponentiated Weibull distribution (Mudholkar and Srivastava 1993, Mudholkar et al. 1995, Singh et al. 2005) and the four parameter Weibull distribution (Wahed et al. 2009, Jeong 2006) under the same simulation scenarios. Maximum likelihood estimation via Newton–Raphson algorithm is used to fit both distributions to survival data. The set of initial values used to begin the optimisation process is obtained as follows: 1000 initial values from 0 to 100 are randomly using Sobol sequence generator for parameters of both distributions. From these 1000 set of values, the set of initial values that maximises the likelihood is used in the optimisation process.

To compare the performance between methods, the relative error was computed. The relative error is defined as the absolute difference between fitted and true quantile divided by the true quantile. This is computed using 100 equally spaced quantiles from 1 % quantile up to the censored quantile for right censored data. For left censored data, this is computed using 100 equally spaced quantiles from the censored quantile up to the 99 % quantile. The log mean and log variance of the relative error among five estimation methods for different types of censoring and different statistical distributions are shown in Figs. 4, 5, 6 and 7. The log transformation is designed to solve the problem of extreme results, to ensure a fairer and clearer comparison across different methods.

Within Figs. 4, 5, 6 and 7, the emphasis is on the performance of different methods. In Fig. 4, it is clear that the exponentiated Weibull is among the worst performing distribution except with respect to fitting of Gumbel distribution data and the four parameter Weibull has fairly comparable performance with GLD but performs less well for Gumbel data. The precision comparison in Fig. 5 indicates similar conclusion as Fig. 4 and the general pattern in these two figures is reflected in Figs. 6 and 7 but with added variability.

Within GLD methods, Figs. 4 and 5 show that the most accurate methods appear to be MLE and PWM matching while quantile matching, ML and LM under SR method appear to be more variable in a number of cases. This is expected as ML/LM/QS under SR introduce extra variability through simulation, due to the nature of the SR algorithm in this article**.** However, this is not always true. The advantage of using SR is seen in Fig. 6 for Gumbel Distribution with 10 % left censored data (GB-LC0.1) for FMKL GLD PWM and FMKL GLD under SR-LM. It is clear that in this example, the direct use of PWM does not result in a fitting result as good as using SR-LM. This is likely due to the difficulty in ascertaining a suitable set of initial values to ensure proper convergence to find the best possible GLD fit, an area where SR can provide valuable guidance and input.

The log variability of relative error plot shows that the methods provided give quite precise results with direct MLE tends to outperforms PWM. ML/LM/QS under SR all have similar performance and often perform slightly worse than MLE or PWM (Figs. 5 and 7).

The perceived, generic pattern of superior performances of MLE or PWM over ML/LM SR should be interpreted with caution. There are also cases, as shown in the example above, where LM SR can in fact outperform from direct use of PWM. Also, note that the optimality of the simulation results comes from the fact that the true distributions are known and a single GLD is an adequate approximation to the true underlying distribution. In practical situations, the true underlying distribution is unknown and may require a mixture of GLDs. When dealing with mixture of GLDs, it is harder to fit a distribution to censored data using direct MLE or PWMs, since this requires maximising or minimising a much more complex objective function which can be difficult in practice. The ML/LM/QS under SR, on the other hand, can be adapted more easily fit mixture of GLDs and the success of these methods in fitting mixtures has already been documented elsewhere (Su 2007b, 2010a, 2010b). SR also tends to give better model for empirical data, as illustrated in Application in empirical data modelling. The main message is that the theoretical loss of efficiency and accuracy using SR is likely to be minimal as evident in these simulation studies but as illustrated below, SR can provide additional information to aid the fitting of a suitable distribution which is not attainable by using an estimation method such as PWM or MLE directly.

## Application in empirical data modelling

While the methods described in this article works well so far in controlled, simulation experiments, it is the successful modelling of survival curves for real life data set that will be most useful for practitioners. The European Blood Marrow Transplant (EBMT) registry (2205 patients) and Amsterdam Cohort Studies on AIDS infection data (329 participants) from Putter et al. (2007) are used for this illustration.

For Amsterdam Cohort Studies on AIDS infection, because there is competing risk (see Putter et al. (2007) for more details), the appropriate measure of probability is to use the cumulative incidence function rather than the naive KM curve. The SR-ML method is again used in this example and the resulting fit is very close to the non-parametric cumulative incidence function as shown in Fig. 8-panel D.

The survival curve fit to EBMT data is first refitted using mixture of 3 RS GLDs using SR method. Details on fitting mixture of GLDs can be obtained from Su (2007a, 2010b). This GLD mixture model (fitted using quantile matching method (Su 2010b)) now corresponds almost exactly to the survival curve (Fig. 9, (B)). Previously, the use of mixture of 2 RS GLDs in Fig. 8 (B) shows a slight departure (but still within the 95 % CI) at the start of survival curve. Note the number of GLDs to be chosen can be selected using AIC. For example, in this case, the AIC under mixtures of 3 RS GLDs give AIC is 179658.8 compared to 175950.5 under mixtures of 2 RS GLDs. The preference therefore, based on AIC, is to use mixtures of 2 RS GLDs.

In Fig. 9, the EBMT data was also fitted using Normal mixtures. Firstly, MCMC estimation of Normal mixtures for survival data with censoring is applied directly using 5 Normal distributions via mixAK package in R (Komárek 2009). Secondly, Normal mixtures is fitted directly using the SR method via mclust package (Fraley and Raftery 2002, 2007) in R. This involves finding the optimal Normal mixture using BIC under the EM algorithm by hierarchical clustering and a mixture of 8 Normal distributions was fitted onto the simulated data using the SR method. Figure 9, panel A shows that the mixAK Bayesian Normal mixture model failed to estimate the survival distribution accurately for survival times greater than 2500 in comparison to kernel density estimation. As a result, this model is not considered further in Fig. 9 panel (**b**) and (**c**). Fig. 9 shows that the GLD mixture model not only provides a compelling fit to the KM survival curve (Fig. 9, panel **b**) but also has the advantage of possessing a simpler shape and less parameters than the Normal mixture model (Fig. 9, panel **a** and **c**).

## Application in multi-stage disease modelling

A common goal in multi-stage disease modelling is to find an overall survival curve, accounting for all different paths to a final outcome. Lo et al. (2009) discussed how this can be done using saddlepoint approximation in a semi-markov process setting, however, the problem of choosing a suitable statistical distribution is still unsolved. Their method also hinged upon the successful maximisation of the likelihood for the whole system; which can be difficult to achieve for a complex multi-stage disease scenario.

Figure 10 shows two possible paths leading to death for each patient in a clinical trial. Patients could develop a disease leading to death or go straight to death. There are two pathways leading to death and censoring could occur any time.

- 1.
Find a parametric survival curve for each path, using either MLE, PWMs matching or SR technique. Each path is comprised of censored and uncensored data and any data censored at any intermediate stages would be classified as censored data.

- 2.
Obtain the number of patients for each path; simulate survival times from the fitted parametric model for each path.

- 3.
Combine all the simulated results in step 2 and the censored times observed at the start of the study into one final survival data. Model the survival times of this final data set parametrically using either PWMs matching, MLE or SR method. This gives the overall parametric survival curve for the system.

## Conclusion

This article illustrates the use of generalised lambda distributions in conjunction with MLE, PWMs matching and SR method to find an approximate probability density function and confidence interval for the survival curve. In the event where the use of MLE and PWMs matching failed to give convincing result, the SR method is often a useful alternative. The SR method converts the censored survival data into uncensored data, allowing users to improve the distributional fit using a wider range of fitting methods. The SR method can also provide initial values for a secondary optimisation for MLE and PWM matching, which sometimes provide a better fit to survival curves that is not attainable using only one method.

The development of these techniques is promising as it means statisticians are no longer limited to non- parametric techniques when analysing such data. Also, it is now possible to extract all the statistical information such as mean, quantile, variability in relation to survival times consistently under one parametric model and this opens up the prospect of developing more powerful statistical models and tests for censored survival data frequently used in engineering and medicine. Recent advances in GLD regression (Su 2015) also opens the possibility of extending this work into accelerated failure models, which will further enhance statistician’s toolbox in practice.

The correct abbreviation should be FKML GLD, but in conformity with the statistical literature, the FMKL GLD terminology is used here.

## Declarations

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- Aldrich, J.: R.A. Fisher and the making of maximum likelihood 1912–1922. Stat. Sci.
**12**(3), 162–176 (1997)View ArticleMathSciNetMATHGoogle Scholar - Asquith, W.: L-moments and TL-moments of the generalized lambda distribution. Comput. Stat. Data Anal.
**51**(9), 4484–4496 (2007)View ArticleMathSciNetMATHGoogle Scholar - Azzalini, A.: Statistical applications of the multivariate skew-normal distribution. J. Royal Stat. Soc., Series B
**61**, 579–602 (1999)View ArticleMathSciNetMATHGoogle Scholar - Böhning, D., Seidel, W.: Editorial: recent developments in mixture models. Comput. Stat. Data Anal.
**41**, 349–357 (2003)View ArticleMATHGoogle Scholar - Cramer, H.: Mathematical methods of statistics. Princeton University Press, N.J. (1963)Google Scholar
- Fisher, R.A.: On the mathematical foundations of theoretical statistics. Philos. Trans. Royal Soc. London A
**222**, 309–368 (1922)View ArticleMATHGoogle Scholar - Fraley, C., Raftery, A.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc.
**97**, 611–631 (2002)View ArticleMathSciNetMATHGoogle Scholar - Fraley, C., Raftery, A.: Bayesian regularization for normal mixture estimation and model-based clustering. J. Class.
**24**, 155–181 (2007)View ArticleMathSciNetMATHGoogle Scholar - Freimer, M., Kollia, G., Mudholkar, G., Lin, C.: A study of the generalised Tukey lambda family. Commun. Stat.- Theory Methods
**17**(10), 3547–3567 (1988)View ArticleMathSciNetMATHGoogle Scholar - Greenwood, J.A., Landwehr, J.M., Matalas, N.C., Wallis, J.R.: Probability weighted moments: definitions and relation to parameters of several distributions expressible in inverse form. Water Resour. Res.
**15**(6), 1049–1054 (1979)View ArticleGoogle Scholar - Hosking, J.R.M.: L-moments: analysis and estimation of distributions using linear combinations of order statistics. J. Royal Stat. Soc.: Series B (Methodological)
**52**(1), 105–124 (1990)MathSciNetMATHGoogle Scholar - Hosking, J.R.M.: The use of L moments in the analysis of censored data. In: Balakrishnan, N. (ed.) Recent advances in life testing and reliability, pp. 546–560. CRC Press, Boca Raton (1995)Google Scholar
- Jeong, JH: A new parametric family for modelling cumulative incidence functions: application to breast cancer data. J. Royal Stat. Soc. Series A.
**169**(2),289–303 (2006) - Karian, Z., Dudewicz, E.: Fitting statistical distributions: the generalized lambda distribution and generalised bootstrap methods. Chapman and Hall, New York (2000)View ArticleGoogle Scholar
- Karvanen, J., Nuutinen, A.: Characterizing the generalized lambda distribution by L-moments. Comput. Stat. Data Anal.
**52**(4), 1971–1983 (2008)View ArticleMathSciNetMATHGoogle Scholar - King, R., MacGillivray, H.: A starship estimation method for the generalised lambda distributions. Aust. N. Z. J. Stat.
**41**(3), 353–374 (1999)View ArticleMathSciNetMATHGoogle Scholar - Klein, J.P., Moeschberger, M.L.: Survival analysis. Techniques for censored and truncated data. Springer, New York (1997)MATHGoogle Scholar
- Komárek, A.: A new R package for Bayesian estimation of multivariate normal mixtures allowing for selection of the number of components and interval-censored data. Comput. Stat. Data Anal.
**53**(12), 3932–3947 (2009)View ArticleMATHGoogle Scholar - Lawless, J.F.: Statistical models and methods for lifetime data. Wiley, New York (1981)Google Scholar
- Lo, S.N., Heritier, S., Hudson, M.: Saddlepoint approximation for semi-Markov processes with application to a cardiovascular randomized study. Comput. Stat. Data Anal.
**53**(3), 683–698 (2009)View ArticleMathSciNetMATHGoogle Scholar - Marubini, E., Valsecchi, E.G.: Analysing survival data from clinical trials and observational studies. Wiley, New York (1995)MATHGoogle Scholar
- McLachlan, G., Peel, D.: Finite mixture models. Wiley, New York (2000)View ArticleMATHGoogle Scholar
- Mudholkar, G.S., Srivastava, D.K.: Exponentiated Weibull family for analyzing bathtub failure-ratedata. IEEE Trans. Reliab.
**42**(2), 299–302 (1993)View ArticleMATHGoogle Scholar - Mudholkar, G.S., Srivastava, D.K., Freimer, M.: The Exponentiated Weibull Family: a reanalysis of the bus-motor-failure data. Technometrics
**37**(4), 436–445 (1995)View ArticleMATHGoogle Scholar - Patti, S., Biganzoli, E., Boracchi, P.: Review of maximum likelihood functions for right censored data. A new elementary derivation. (2007)Google Scholar
- Putter, H., Fiocco, M., Geskus, R.: Tutorial in biostatistics: competing risks and multi-state models. Stat. Med.
**26**, 2389–2430 (2007)View ArticleMathSciNetGoogle Scholar - Ramberg, J., Schmeiser, B.: An approximate method for generating asymmetric random variables. Commun. Assoc. Comput Mach.
**17**, 78–82 (1974)MathSciNetMATHGoogle Scholar - Ramberg, J., Tadikamalla, P., Dudewicz, E., Mykytka, E.: A probability distribution and its uses in fitting the data. Technometrics
**21**, 201–214 (1979)View ArticleMATHGoogle Scholar - Singh, U., Gupta, P.K., Upadhyay, S.K.: Estimation of parameters for exponentiated-Weibull family under type-II censoring scheme. Comput. Stat. Data Anal.
**48**(3), 509–523 (2005)View ArticleMathSciNetMATHGoogle Scholar - Su, S.: A discretized approach to flexibly fit generalized lambda distributions to data. J. Mod. Appl. Stat. Methods
**4**, 408–424 (2005)Google Scholar - Su, S.: Fitting single and mixture of generalized lambda distributions to data via discretized and maximum likelihood methods: GLDEX in R. J. Stat. Softw.
**21**(9), 1–17 (2007)View ArticleGoogle Scholar - Su, S.: Numerical maximum log likelihood estimation for generalized lambda distributions. Comput. Stat. Data Anal.
**51**(8), 3983–3998 (2007b)View ArticleMATHGoogle Scholar - Su, S: GLDEX: Fitting Single and Mixture of Generalized Lambda Distributions (RS and FMKL) Using Discretized and Maximum Likelihood Methods. CRAN. Available at: https://cran.r-project.org/web/packages/GLDEX/index.html (2007)
- Su, S.: Confidence intervals for quantiles using generalized lambda distributions. Comput. Stat. Data Anal.
**53**(9), 3324–3333 (2009)View ArticleMATHGoogle Scholar - Su, S.: Chapter 14: fitting GLD to data via quantile matching method. In: Karian, Z., Dudewicz, E. (eds.) Handbook of distribution fitting methods with R, pp. 557–583. CRC Press/Taylor & Francis, Boca Raton (2010)View ArticleGoogle Scholar
- Su, S.: Chapter 15: fitting GLD to data using the GLDEX 1.0.4 in R. In: Karian, Z., Dudewicz, E. (eds.) Handbook of distribution fitting methods with R, pp. 585–608. CRC Press/Taylor & Francis, Boca Raton (2010)View ArticleGoogle Scholar
- Su, S.: Flexible parametric quantile regression model. Stat Comput.
**25**(3), 635–650 (2015)View ArticleMathSciNetGoogle Scholar - Wahed, A.S., Luong, T.M., Jeong, J.H.: A new generalization of Weibull distribution with application to a breast cancer data set. Stat. Med.
**28**(16), 2077–94 (2009)View ArticleMathSciNetGoogle Scholar - Wang, D., Hutson, A.D., Miecznikowski, J.C.: L-moment estimation for parametric survival models given censored data. Stat. Methodol.
**7**(6), 655–667 (2010)View ArticleMathSciNetMATHGoogle Scholar