Efficient and adaptive rank-based fits for linear models with skew-normal errors
- Joseph W McKean^{1}Email author and
- John D Kloke^{2}
https://doi.org/10.1186/s40488-014-0018-0
© McKean and Kloke; licensee Springer. 2014
Received: 10 February 2014
Accepted: 21 July 2014
Published: 22 October 2014
Abstract
The rank-based fit of a linear model is based on minimizing a norm. A score function needs to be selected for the fit and the proper choice leads to asymptotically efficient regression estimators, i.e., fits equivalent to the maximum likelihood estimators (mle). In this paper, we present the family of optimal scores functions for the skew-normal family of distributions. We show the easy computation of this rank-based fit using the R package Rfit. We present the results of a small simulation study comparing the rank-based estimators and the mles in terms of efficiency and validity over skew-normal and contaminated normal distributions. We also develop and present empirical results for a Hogg-type adaptive procedure for selecting among a family of these scores based on a robust initial fit.
Keywords
Linear models Monte Carlo Nonparametrics Regression rank scores Robust Wilcoxon proceduresIntroduction
Rank-based fitting of linear models offers an attractive alternative to least squares (LS) and maximum likelihood (mle) fitting. The geometry of the rank-based fit is similar to that of LS. Simply replace the Euclidean squared norm used in the LS fit with another norm which, unlike LS, results in a robust fit. The accompanying robust analysis is the analogue of the LS’s ANOVA and ANCOVA. The rank-based analysis offers a complete analysis including robust diagnostics to check quality of fit. This rank based analysis has recently been extended to mixed and nonlinear models; see [Kloke et al. (2009]) and [Abebe and McKean (2013]), respectively. A full development of the rank-based analysis can be found in Chapters 3-5 of the monograph by [Hettmansperger and McKean (2011]). The rank-based analysis is, generally, highly efficient. It can easily be optimized depending on the information available concerning the distribution of the random errors. For example, if the form of the error distribution is known, then an appropriate rank-based procedure can be selected to attain full efficiency.
In this paper, we discuss rank-based analyses which are appropriate for the skew-normal (SN) family of distributions. This is a rich family of skewed distributions developed by [Azzalini (1985]). The skewness of a distribution in this family is controlled by a shape parameter α, −∞<α<∞. Distributions are left or right skewed, depending on whether α<0 or α>0, respectively. If α=0 then the distribution is normal. As we discuss in Section 3, all of these distributions are light-tailed. Such families of distributions occur frequently in accelerated failure time (AFT) models. The response of interest in such models is survival time and its log is frequently modeled in terms of a log linear model. The random errors for these log linear models are, generally, skewed. For instance, the family of distributions for log linear models when survival time follows an F-distribution contains a wide variety of skewed distributions with tail weights that range from moderate to heavy; see [McKean and Sievers (1989]). Hence, the skew-normal family adds a rich class of light-tailed skewed distributions which also includes the normal distribution.
In Section 2, we outline the rank-based analysis for a general linear model. The computation of these analyses is easily handled by the R package RfitRfit, developed by [Kloke and McKean (2013]), which can be freely downloaded at the CRAN http://cran.us.r-project.org/. We discuss this software for an example in this section and we continue such discussion in the remainder of the article. The data for the example and the R code, supplemental to RfitRfit, used in this article are available to the reader at the http://www.stat.wmich.edu/mckean/SN/.
In Section 3, we develop rank-based analyses for the skew-normal family. These analyses are efficient for this family and the appropriate analysis is fully efficient. As we discuss, these analyses are technically robust similar to the optimal rank-based analysis for normally distributed errors. In contrast, in a sensitivity analysis, we show that the maximum likelihood fit (mle) is not robust. In Section 4, we present the results of a Monte Carlo study which verify the robustness and validity of the rank-based analysis over a family of SN distributions and contaminated SN distributions. These studies confirm the nonrobustness of the mle analysis.
The rank-based analysis depends on the shape parameter α. One outcome of the Monte Carlo study is that rank-based analyses based on shape parameters in a neighborhood of the correct α had very similar behavior to that using the correct α. This suggests that a simple Hogg-type adaptive procedure would entertain excellent properties in this situation. In Section 5, we develop such an adaptive scheme for the family of SN distributions. In a simulation study, we verify the efficiency and validity of this scheme over SN situations and, further, over two contaminated situations.
Notation and rank-based analysis
where 1_{ n } is a vector of n ones; X is a n×p design matrix which may contain predictors (covariates) as well as indicator (dummy) variables; β_{0} is an intercept parameter; β is a p×1 vector of regression parameters; and e is a n×1 vector of random errors. Because we have an intercept parameter in the model, we assume without loss of generality that the design matrix X is centered, (all columns of X have mean 0). For the theory discussed below, assume that the components of e are iid with pdf f(x) and cdf F(x), where F(x) is unknown.
where $\parallel \mathit{v}{\parallel}_{2}^{2}=\sum _{i=1}^{n}{v}_{i}^{2}$ is the Euclidean norm.
These estimators were proposed by [Jaeckel (1972]) and [Jurečková (1971]). An associated rank-based analysis, including diagnostics procedures, is discussed in Chapters 3-5 of the monograph by [Hettmansperger and McKean (2011]). A score function needs to be selected. Often the Wilcoxon (linear) score function is used, $\phi \left[\phantom{\rule{0.3em}{0ex}}u\right]=\sqrt{12}[\phantom{\rule{0.3em}{0ex}}u-(1/2\left)\right]$. When Wilcoxon scores are used, we refer to the subsequent fit and analysis as the Wilcoxon analysis. Another frequent choice is the sign scores function, φ[ u]=sgn [ u−(1/2)], which yields the l_{1}-fit. Score functions are discussed in terms of optimality in Section 2.2.
2.1 Theory
Based on this influence function it is clear that ${\widehat{\mathit{\beta}}}_{\phi}$ is robust in Y-space if the scores function φ(u) is bounded. Note, though, that ${\widehat{\mathit{\beta}}}_{\phi}$ is not robust in X -space. A weighted version of the Wilcoxon estimator called the HBR (high breakdown rank-based) achieves 50% breakdown in both the X -space and the Y-space; see [Chang et al. (1999]).
Note that the only difference between the theory for the LS and rank-based estimators is that σ^{2} is replaced by ${\tau}_{\phi}^{2}$. Hence, the asymptotic relative efficiency (ARE) between the LS and rank-based estimator is ${\sigma}^{2}/{\tau}_{\phi}^{2}$. For Wilcoxon scores, assuming that the random errors have a normal distribution, this ARE is the familiar 0.955; that is, for normal errors using the Wilcoxon analysis instead of the LS analysis results in only a 5% loss of efficiency. [Koul et al. (1987]) developed an estimator of τ_{ φ }, ${\widehat{\tau}}_{\phi}$, which is computed by RfitRfit.
where t_{α/2,n−p−1} denotes the upper α/2, t-critical value with n−p−1 degrees of freedom.
An asymptotic level α test is to reject H_{0} in favor of H_{ A }, if F_{ φ }≥F_{ α }(q,n−p−1), where F_{ α }(q,n−p−1) denotes the upper α-critical value of an F-distribution with q and n−p−1 degrees of freedom.
see page 243 of [Hettmansperger and McKean (2011]). We refer to R_{2} as a robust coefficient of determination in subsequent examples.
2.2 Optimal scores
where ρ is a correlation coefficient and $\sqrt{I\left(\phantom{\rule{0.3em}{0ex}}f\right)}$ is Fisher Information. Therefore, minimizing τ_{ φ } is equivalent to maximizing the above identity. By the last equality, this is accomplished by making ρ=1; i.e., by taking φ(u) to be φ_{ f }(u). So expression (8) is the score function which optimizes the rank-based analysis. Since ${\hat{\mathit{\beta}}}_{\phi}$ is location and scale equivariant, only the form of f(x) is needed. Furthermore, since in this case ${\tau}_{\phi}=1/\sqrt{I\left(\phantom{\rule{0.3em}{0ex}}f\right)}$, the rank-based estimator ${\hat{\mathit{\beta}}}_{\phi}$ is asymptotically fully efficient, i.e., ${\hat{\mathit{\beta}}}_{\phi}$ has the same asymptotic distribution as the maximum likelihood estimator (mle).
For example, if the error distribution is normal, then the optimal score function simplifies to φ(u)=Φ^{−1}(u), the normal scores. If the error distribution is logistic, then the linear Wilcoxon scores are obtained, while double exponential (Laplace) distributed errors produces the sign scores.
2.3 Computation of the rank-based analysis
The computation of a rank-based analysis can be obtained by using the R package Rfit Rfit developed by [Kloke and McKean (2012]), which can be downloaded at CRAN. Like R, Rfit Rfit is freeware and can run on all platforms (windows, linux, and mac). As we discuss in Section 3, it is easy to install new scores in Rfit Rfit based on a general scores function. For now, we illustrate the computation of Rfit for a Wilcoxon analysis in the following example.
Example 2.1
(Linear Model with Skew-Normal Errors). We use a simulated data set based on the model y=β_{0}+β_{1}x_{1}+β_{2}x_{2}+β_{3}x_{3}+e, where x_{1}=1,⋯,50; x_{2} and x_{3} are variates from a standard normal distribution; and the random errors are generated from a standard skew-normal distribution with shape parameter α=−8, as discussed in Section 3. We set β_{1}=0.01, β_{2}=0.15, and β_{3}=0.0. The sample size is n=50. The data set can be downloaded at the url cited in Section 1. The code segments below assume that the R vectors yy, x1x1, x2x2, x3x3, contain respectively the responses and values for x_{1}, x_{2} and x_{3}. For this example, the following R code using the package Rfit Rfit computes the Wilcoxon fit of the model, prints out the table of coefficients, and saves the Studentized residuals in the vector studwstudw.
Based on the summary table, the 95% confidence intervals for β_{ j }, j=1,2,3, (10), trap the true parameters. The overall F_{ φ } test that all the regression coefficients are 0 except for the intercept is significant, p=0.0257. The value of the robust coefficient of determination R_{2}, expression (14), is 18%.
As noted in Section 3, the skew-normal distribution chosen to generate the random variates in this example is left skewed. Hence, as expected, the residuals show longer left than right tails. These plots show that scores for left-skewed error distributions are more appropriate for this data than the Wilcoxon scores. There appears to be one large outlier in the left tail, also. ■
Skew-normal error distributions
where the parameter α satisfies −∞<α<∞ and ϕ(x) and Φ(x) are the pdf and cdf of a standard normal distribution, respectively. For this paper, if a random variable X has this pdf, we say that X has a standard skew-normal distribution with parameter α and write X∼S N(α). If α=0, then X has a standard normal distribution. Further X is distributed left skewed if α<0 and right skewed if α>0. This family of distributions was introduced by [Azzalini (1985]), who discussed many of its properties.
In this paper, we are interested in linear models, (1), where the random errors may have skew-normal errors. In this case, the random error can be written as e_{ i } = b ε_{ i }, where ε_{ i } has a standard skew-normal distribution and b is a scale parameter. The rank-based estimator ${\widehat{\mathit{\beta}}}_{\phi}$ and corresponding analysis are regression and scale equivariant, so there is no need to estimate the scale parameter b. The only scale parameter requiring estimation for standard errors is τ_{ φ }. Likewise, for inference on the vector of parameters β there is no need to estimate the shape parameter α.
For all values of α, this score function is strictly increasing over the interval (0,1); see [Azzalini (1985]). As expected, for α=0, expression (17) simplifies to the normal scores. Due to the first term on the right-side of expression (18), all the score functions in this family are unbounded, indicating that the skew-normal family of distributions is light-tailed. Thus the influence functions of the rank-based estimators based on scores in this family are unbounded in the Y -space and, hence, are not robust. This includes the normal scores, but [Huber (1981]) pointed out that normal scores are technically robust and, as our simulation studies show, the family of skew-normal scores seems also to be technically robust.
3.1 Computation of the rank-based analysis using skew-normal scores
The computation of the rank-based analysis can be obtained by using the R package RfitRfit. It is easy to install the family of skew-normal scores. Briefly, rank-based scores form a class in Rfit Rfit consisting of three parts: the score function, its derivative, and a vector of parameters used in the definition of the function. For the skew-normal scores, details are given in the appendix, but for the readers convenience the necessary R code is contained in the R function skewnsskewns, which we have placed at the web site cited in the introduction.
Example 3.1
(Example 2.1, Continued). We now return to Example 2.1 and show the computation of the rank-based analysis of it based on the skew-normal scores with shape parameter α=−8. The first two lines of code define the skew-normal scores as salp salp and the third line sets the shape parameter. Details of this definition can be found in the appendix.
Note that the skew-normal analysis is much more precise than the Wilcoxon analysis of the last section. The empirical ARE is (τ_{ W }/τ_{α=−8})^{2}=2.78; i.e., for this data set, the skew-normal analysis is 2.8 times more efficient than the Wilcoxon analysis. Note, also, that the robust coefficient of determination, R_{2}, has increased from 18% to 28%.
3.2 Sensitivity analysis
Values of the sensitivity function for the mle at the given values of Δ
Δ | 0 | 20 | 40 | 60 | 80 | 100 | 1000 | 2000 |
---|---|---|---|---|---|---|---|---|
mle | 0.00 | −0.07 | −0.07 | −0.00 | 0.12 | 0.30 | −5.80 | −6.32 |
3.3 Range of practical α parameters
Parameters (mean μ , median $\stackrel{~}{\mu}$ , variance σ ^{ 2 } , and coefficient of skewness ξ ) for S N ( α ) distributions for the given values of α
α | 0.00 | 1.00 | 2.00 | 4.00 | 6.00 | 7.00 | 8.00 | 10.00 | 15.00 | 20.00 | ∞ |
---|---|---|---|---|---|---|---|---|---|---|---|
μ | 0.00 | 0.56 | 0.71 | 0.77 | 0.79 | 0.79 | 0.79 | 0.79 | 0.80 | 0.80 | 0.80 |
$\stackrel{~}{\mu}$ | 0.00 | 0.55 | 0.66 | 0.67 | 0.67 | 0.67 | 0.67 | 0.67 | 0.67 | 0.67 | 0.67 |
σ ^{2} | 1.00 | 0.68 | 0.49 | 0.40 | 0.38 | 0.38 | 0.37 | 0.37 | 0.37 | 0.36 | 0.36 |
ξ | 0.00 | 0.14 | 0.45 | 0.78 | 0.89 | 0.92 | 0.93 | 0.96 | 0.98 | 0.98 | 0.99 |
Because we are interested in linear models, there is another practical reason for this range of α values. Note that the support of a skew-normal distribution is (−∞,∞) making it ideal for error distributions for regression models. On the other hand, the support of a half-normal distribution is (0,∞), which is generally the support of a survival distribution. Often, the log’s of such variables are modeled as accelerated failure time (AFT) models, as briefly discussed in Section 1.
Monte Carlo study
where W_{ i } has a skew-normal distribution with shape parameter α=5, V_{ i } has a $N({\mu}_{c}=10,{\sigma}_{c}^{2}=36)$ distribution, I_{ε,i} has a binomial (1,ε=0.15) distribution, and W_{ i },V_{ i }, and I_{ε,i} are all independent. Hence, this contaminated distribution is skewed with heavy right tails. The design is slightly unbalanced with n_{1}=45 and n_{2}=55. Without loss of generality β,θ, and β_{0} were set to 0.
For the rank based procedures, we selected the rank-based procedure based on the score function φ_{5}(u), (18), which is optimal for a skew normal distribution with α=5 and then three on each side of the optimal, i.e., procedures based on the score functions φ_{ α }(u) with α=2,3,4,6,7, and 8. With the discussion in Section 3.3 in mind, we also selected the rank-based procedure with α=10. The rank-based Wilcoxon, least squares (LS) procedure, and mle procedures complete the methods investigated. The empirical results presented are the empirical AREs, which for each estimator is the ratio of the empirical mean-square error (MSE) of the mle to the empirical MSE of the estimator; hence, values of this ratio less than 1 are favorable to the mle while values greater than 1 are favorable to the estimator. Secondly, we present the empirical confidence intervals with nominal confidence 0.95. For all the procedures, we chose asymptotic confidence intervals of the form $\widehat{\beta}\pm 1.96\mathit{\text{SE}}\left(\widehat{\beta}\right)$. We used a simulation size of 10,000.
Summary of results of simulation study of rank-based procedures and the mle procedure for the skew-normal with shape α =5 distribution and a skew-normal contaminated distribution
Skew normal errors | Contaminated errors | |||||||
---|---|---|---|---|---|---|---|---|
β | θ | β | θ | |||||
Proced. | ARE | Conf. | ARE | Conf. | ARE | Conf. | ARE | Conf. |
rb α=2 | 1.02 | 0.96 | 1.04 | 0.96 | 6.61 | 0.98 | 10.84 | 0.98 |
rb α=3 | 1.09 | 0.96 | 1.11 | 0.96 | 7.43 | 0.97 | 12.24 | 0.98 |
rb α=4 | 1.13 | 0.96 | 1.15 | 0.96 | 7.79 | 0.97 | 12.91 | 0.98 |
rb α=5 | 1.14 | 0.96 | 1.16 | 0.96 | 7.85 | 0.96 | 13.10 | 0.97 |
rb α=6 | 1.13 | 0.95 | 1.16 | 0.96 | 7.73 | 0.96 | 13.02 | 0.97 |
rb α=7 | 1.11 | 0.95 | 1.14 | 0.95 | 7.49 | 0.95 | 12.72 | 0.97 |
rb α=8 | 1.09 | 0.95 | 1.12 | 0.95 | 7.17 | 0.95 | 12.30 | 0.96 |
rb α=10 | 1.04 | 0.94 | 1.07 | 0.94 | 6.46 | 0.94 | 11.22 | 0.95 |
rb Wil. | 0.78 | 0.95 | 0.79 | 0.95 | 4.70 | 0.96 | 7.56 | 0.97 |
LS | 0.70 | 0.95 | 0.71 | 0.95 | 0.20 | 0.95 | 0.31 | 0.95 |
mle | 1.00 | 0.93 | 1.00 | 0.93 | 1.00 | 0.96 | 1.00 | 0.99 |
For the contaminated error distribution, the rank-based estimators are much more efficient than the mle procedure. Further, the estimator with scores based on α=5 is still the most empirically powerful in the study. It has empirical efficiency of 785% relative to the mle for β and 1310% for θ. Even the Wilcoxon procedure is over 756% more efficient than the mle for θ. On the basis of the empirical confidences, for both parameters, all procedures appear to be from slightly to moderately conservative. Least squares performed extremely poor in the contaminated part of the study. All the rank-based procedures based on skew-normal scores display technical robustness in this study.
Hogg-Type adaptive procedure
In Section 3, we discussed the rank-based method based on the optimal score function for a specified shape parameter α. Asymptotically, it is as efficient as the mle and, at least for the situations covered in the simulation study, the rank-based estimator appears to be more efficient than the mle for finite samples. In practice, though, the true shape parameter is not known. One could obtain the mle of α and use that score function. The mle, however, is not robust. Reconsidering the empirical study, note from Table 3 that the rank-based estimates close to the optimal rank-based estimator were more efficient than the mle and most had efficiencies that were quite close to that of the optimal. That is, in selecting a score function, perhaps close would suffice. In this section, we consider a Hogg-type adaptive scheme which has this as its goal.
[Hogg et al. (1975]) proposed an adaptive procedure for tests of the difference in locations for the two sample problem. The null hypothesis is that the two population distributions are the same. The selection of the test is based on a pair of selector statistics that measure respectively skewness and tail weight of the underlying error distribution. These selector statistics are functions of the order statistics of the combined samples. Several distribution-free rank tests of significance level δ comprise the tests. Under the null hypothesis, it follows from the sufficiency and completeness of the combined order statistics and the distribution-freeness of the rank test statistics that the selected test maintains the level δ. See, also, the discussion in Chapter 10 of [Hogg et al. (2013]).
This is fine for simple location tests where we have distribution-free rank tests, but in our case we are fitting a linear model and, hence, the adaption must be based on the residuals from an initial fit. Thus the above mentioned sufficiency result is not true for our fitting case. [Shomrani (2003]) developed a Hogg-type adaptive scheme for fitting a linear model based on an initial fit. In Shomrani’s scheme, the selector statistics are functions of the residuals from the initial fit. While the significance level is no longer maintained, based on the results of a large simulation study, the scheme’s empirical levels were generally close to the nominal value. In Chapter 6 of [Kloke and McKean (2014]), R software is developed for this scheme.
The adaptive scheme of [Shomrani (2003]) was formed for a wide range of error distributions: from left to right skewed and from light to heavy tailed distributions. We refine this scheme for the skew-normal family of distributions. As discussed above, there are two selector statistics involved. One, Q_{1}, selects based on skewness while the other, Q_{2} selects based on tail thickness. In a preliminary study over the skew-normal family, tail thickness did not seem to be a paramount issue, so we focus on Q_{1} alone.
where ${\overline{U}}_{.05}$, ${\overline{M}}_{.5}$, and ${\overline{L}}_{.05}$ are the averages of the largest 5% of the V_{ i }’s, the middle 50% of the V_{ i }’s, and the smallest 5% of the V_{ i }’s, respectively. Large values of Q_{1} indicate that the right tails of the sample are longer than the left tails; i.e., indicating an underlying right skewed distribution. Likewise, small values of Q_{1} indicate left-skewness. Note that as left (right) skew increases, Q_{1} is likely to decrease (increase). The statistic Q_{1} is not robust. One scheme under current investigation is to replace the means by medians. Keep in mind, though, that a robust diagnostic analysis is available for the initial robust fit. Hence in practice, outliers are easily flagged and modifications to the adaptive scheme can be made.
5.1 Adaptive scheme for skew normals
Our adaptive scheme consists of the 7 optimal score functions for skew-normal distributions with α=−12,−8,−4,0,4,8,and 12. So there are three scores each for left and right skewed distributions along with the normal scores. The scheme utilizes residuals from an initial Wilcoxon fit. We chose the Wilcoxon because it is robust. Also, it is optimal for a symmetric distribution (logistic) and, hence, less likely to bias selection for left or right skewness.
Simulated sample median of the distribution of Q _{ 1 } drawn from the skew-normal distribution with shape parameter α
α | -10 | -6 | -2 | 2 | 6 | 10 |
---|---|---|---|---|---|---|
Median Q_{1} | 0.44 | 0.49 | 0.73 | 1.37 | 2.05 | 2.26 |
- 1.
Fit using Wilcoxon scores ⇒ Obtain residuals ${\hat{\mathit{e}}}_{W}$.
- 2.
Compute ${Q}_{1}\left({\hat{\mathit{e}}}_{W}\right)$ and then select φ _{ α } using expression (24), using the estimated medians of Q _{1}.
- 3.
Fit with selected score φ _{ α }.
- 4.
Inference is based on the fit of Step (3).
We next try the scheme on the data of Example 2.1.
Example 5.1
For the data of Example 2.1, the adaptive scheme chose the score function with α =−12
α | β _{1} | β _{2} | β _{3} | ${\widehat{\tau}}_{\phi}$ | RobustR_{2} |
---|---|---|---|---|---|
α=−12 (Adaptive Choice) | 0.0035 | 0.2190 | 0.0368 | 0.3675 | 0.2743 |
α=−10 | 0.0030 | 0.2095 | 0.0480 | 0.3699 | 0.2732 |
α=−9 | 0.0037 | 0.1923 | 0.0514 | 0.3531 | 0.2828 |
α=−8 | 0.0042 | 0.1831 | 0.0544 | 0.3484 | 0.2865 |
α=−7 | 0.0043 | 0.1822 | 0.0537 | 0.3533 | 0.2849 |
α=−6 | 0.0044 | 0.1808 | 0.0540 | 0.3490 | 0.2884 |
α=−5 | 0.0044 | 0.1805 | 0.0531 | 0.3580 | 0.2838 |
α=−4 | 0.0043 | 0.1854 | 0.0449 | 0.3886 | 0.2668 |
The values of the regression coefficients are given along with the robust coefficient of determinations R_{2} and estimates of τ_{ φ }. The fits are quite similar. Notice that in terms of precision, ${\widehat{\tau}}_{\phi}$’s, that the fit with α=−8 is the most precise. __
5.2 Simulation study
where S_{ i }∼S N(α), C_{ i }∼N(10,6^{2}), I_{.15} is Bernoulli with proportion of success 0.15, and S_{ i },C_{ i },I_{.15} are independent. Thus, Situation II is the same as the second situation of Section 3, i.e., right-skewed contamination. Situation III is the same as situation II except that C_{ i }∼N(0,6^{2}), i.e., symmetric contamination. 10,000 simulations were used for each situation.
The methods considered are: our adaptive scheme (AdSch), least squares (LS), Wilcoxon (Wil), and maximum likelihood (mle). We also considered the procedure based on the correct α; i.e., the α which is selected for the distribution of the random errors. Note that this is not a statistical method and we label it as Optrv, “rv” for random variable. Even for Situation I, the distribution of its rank-based estimate depends on the multinomial random variable involved in the selection of the simulated distribution. We only include it to serve as an yardstick for the four statistical methods.
Empirical efficiencies and confidence coefficients for Situation I (error distributions are skew-normals)
AdSch | Optrv | LS | Wil | mle | |
---|---|---|---|---|---|
β, ARE | 1.06 | 1.12 | 0.75 | 0.80 | 1.00 |
β, Conf | 0.95 | 0.95 | 0.95 | 0.95 | 0.93 |
θ, ARE | 1.05 | 1.11 | 0.73 | 0.79 | 1.00 |
θ, Conf | 0.95 | 0.95 | 0.95 | 0.95 | 0.94 |
For Situation I, the random errors have a skew-normal distribution with shape parameter α drawn from the set {−12,−11,…,12}, while the scheme selects scores from the set {−12,−8,−4,0,4,8,12}. These sets are different; hence, it does not make sense to consider when the scheme made the “correct” selection. We did keep track of how often the selection was within two units of the distribution simulated. For the 10,000 simulations of Situation I, the estimate of this proportion is 0.584. Note that for Situations II and III, the random errors have a contaminated skew-normal distribution. In particular, it is not a skew-normal distribution. So for Situations II and III, this proportion is irrelevant.
Empirical efficiencies and confidence coefficients for Situation II (error distributions are skewed contaminated skew-normals)
AdSch | Optrv | LS | Wil | mle | |
---|---|---|---|---|---|
β, ARE | 3.58 | 1.86 | 0.92 | 7.27 | 1.00 |
β, Conf | 0.97 | 0.97 | 0.95 | 0.95 | 0.94 |
θ, ARE | 3.56 | 1.56 | 0.94 | 7.36 | 1.00 |
θ, Conf | 0.97 | 0.96 | 0.95 | 0.95 | 0.95 |
Empirical efficiencies and confidence coefficients for Situation III (error distributions are symmetrically contaminated skew-normals)
AdSch | Optrv | LS | Wil | mle | |
---|---|---|---|---|---|
β, ARE | 3.67 | 1.01 | 0.26 | 6.16 | 1.00 |
β, Conf | 0.94 | 0.97 | 0.95 | 0.97 | 0.95 |
θ, ARE | 4.77 | 0.85 | 0.35 | 8.18 | 1.00 |
θ, Conf | 0.94 | 0.97 | 0.95 | 0.97 | 0.98 |
Conclusion
Rank-based analyses of linear models depend on the selection of a score function. In practice, often the Wilcoxon (linear) score function is chosen. These scores require no tuning constants and, further, the Wilcoxon rank-based analysis attains 95.5% efficiency relative to the traditional least squares (LS) analysis when the random error distribution is normal. However, rank-based analyses are easily optimized if there is knowledge of the distribution of the random errors of the linear model. For example, if the random errors are normally distributed then selecting the normal scores for the rank-based analysis results in the efficiency of 100% (fully efficient) relative to the LS analysis.
In this paper, we have presented the rank-based analyses based on appropriate score functions for random errors having a distribution from the family of skew-normal distributions. In this case, the score function depends on the shape parameter α, −∞<α<∞. Of course, the rank-based analysis is fully efficient if the correct α is known. The rank-based analysis is a complete analysis, including fitting, inference (rank-based ANOVA), and robust diagnostic procedures. Based on the results of our Monte Carlo, these rank-based analyses appear to be more efficient than the maximum likelihood (mle) analysis for the skew-normal distributions considered. The most efficient rank-based analysis is based on the optimal score function, but even those rank-based analyses with shape parameters within three units of the correct α were more efficient than the mle in these situations. They were much more efficient than the mle’s over situations where the error distribution had a contaminated skew-normal distribution. Based on empirical confidence levels, all the methods in the study were valid.
The good efficiency results for the rank-based analyses in a neighborhood of the true α suggest that a Hogg-type adaptive scheme would have high efficiency. In Section 5, we developed such a scheme for the skew-normal family of distributions based on an initial robust Wilcoxon fit. In the Monte Carlo studies we performed, this scheme was more efficient than the mle over the family of skew-normal distributions and was much more efficient than the mle over the contaminated skew-normal situations. Furthermore, for the situations covered, this adaptive scheme appears to be valid.
[Kloke and McKean (2012]) developed an R package Rfit Rfit for these rank-based analyses, which can be freely downloaded at CRAN. The default scores are the Wilcoxon scores, but, as we discuss in Section 3 it is easy to add classes of scores including the optimal scores for skew-normal distributions. The adaptive scheme of Section 5 is also easily coded using RfitRfit. The necessary code for the scores and the adaptive scheme can be found at the web site cited in Section 1. Hence, computation of these rank-based analyses is not a problem.
The rank-based analyses using skew-normal scores are robust in Y (response) space, but not in X (factor) space. The weighted Wilcoxon fit proposed by [Chang et al. (1999]) yield a robust rank-based analysis which possesses 50% breakdown in X (factor) space. We are now developing such an analysis for the skew-normal scores; see [Abebe et al. (2014]) for discussion. This analysis could also be part of an adaptive scheme.
A Appendix: R code for the class of skew normal scores
To complete the class statement for the skew-normal scores we need only compute the quantiles F^{−1}(u;α). [Azzalini (2013]) developed the R package sn sn (available at CRAN) which computes the quantile function F^{−1}(u;α) and, also, the corresponding pdf and cdf. The command qsn(u,shape=alpha) qsn(u,shape=alpha) returns F^{−1}(u;α), for 0<u<1. The package sn sn requires the package mnormtmnormt.
The following R code defines the class of skew-normal scores:
The next code segment obtains the data for a plot of the scores with shape parameter α=−7.
Declarations
Acknowledgement
We acknowledge the helpful comments of an associate editor and a referee on the original manuscript.
Authors’ Affiliations
References
- Abebe A, McKean JW: Weighted Wilcoxon estimators in Nonlinear Regression. Aust N Z J. Stat 2013, 55: 401–420. 10.1111/anzs.12046MathSciNetView ArticleGoogle Scholar
- Abebe, A, McKean, JW, Kloke, JD, Bilgic, Y: Iterated Reweighted Rank-Based Estimates for GEE Models. Technical Report (2014).Google Scholar
- Azzalini A: A class of distributions which includes the normal ones. Scand. J. Stat 1985, 12: 171–178.MathSciNetGoogle Scholar
- Azzalini, A: R package sn: The skew-normal and skew-t distributions (version 0.4–18).Google Scholar
- Chang W, McKean J, Naranjo J, Sheather S: High-breakdown rank regression. J. Am. Stat. Assoc 1999, 94: 205–219. 10.1080/01621459.1999.10473836MathSciNetView ArticleGoogle Scholar
- Hettmansperger TP, McKean JW: Robust Nonparametric Statistical Methods. Chapman-Hall, Boca Raton, FL; 2011.Google Scholar
- Hogg RV, Fisher DM, Randles RH: A two-sample adaptive distribution-free test. J. Am. Stat. Assoc 1975, 70: 656–661.Google Scholar
- Hogg RV, McKean JW, Craig AT: Introduction to Mathematical Statistics. Pearson, Boston; 2013.Google Scholar
- Huber, PJ: Robust Statistics. John Wiley & Son (1981).View ArticleGoogle Scholar
- Jaeckel LA: Estimating regression coefficients by minimizing the dispersion of residuals. Ann. Math. Stat 1972, 43: 1449–1458. 10.1214/aoms/1177692377MathSciNetView ArticleGoogle Scholar
- Jurečková J: Nonparametric estimate of regression coefficients. Ann. Math. Stat 1971, 42: 1328–1338. 10.1214/aoms/1177693245MathSciNetView ArticleGoogle Scholar
- Kloke JD, McKean JW: Rfit: Rank-based estimation for linear models. R J 2012, 4: 57–64.Google Scholar
- McKean JW: Small sample properties of JR estimators. In JSM Proceedings . American Statistical Association, Alexandria, VA; 2013.Google Scholar
- Kloke, JD, McKean, JW: Nonparametric statistical methods using R, Chapman-Hall, Boca Raton, FL (2014).View ArticleGoogle Scholar
- Kloke JD, McKean JW, Rashid M: Rank-based estimation and associated inferences for linear models with cluster correlated errors. J. Am. Stat. Assoc 2009, 104: 384–390. 10.1198/jasa.2009.0116MathSciNetView ArticleGoogle Scholar
- Koul HL, Sievers GL, McKean JW: An estimator of the scale parameter for the rank analysis of linear models under general score functions. Scand. J. Stat 1987, 14: 131–141.MathSciNetGoogle Scholar
- McKean J, Sheather S: Diagnostic procedures. Wiley Interdiscip. S. Rev.: Comput tat 2009, 1(2):221–233. 10.1002/wics.12View ArticleGoogle Scholar
- McKean J, Sievers G: Rank scores suitable for analysis of linear models under asymmetric error distributions. Technometrics 1989, 31: 207–218. 10.1080/00401706.1989.10488514MathSciNetView ArticleGoogle Scholar
- Pourahmadi M: Construction of skew-normal random variables: Are they linear combinations of normals and half-normals. J. Stat. Theory Appl 2007, 3: 314–328.MathSciNetGoogle Scholar
- Shomrani, A: A comparison of different schemes for selecting and estimating score functions based on residuals. Ph.D. thesis, Western Michigan University, Department of Statistics (2003).Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.