Density deconvolution for generalized skew-symmetric distributions

Potgieter, Cornelis J.

doi:10.1186/s40488-020-00103-y

Research
Open access
Published: 23 July 2020

Density deconvolution for generalized skew-symmetric distributions

Cornelis J. Potgieter ORCID: orcid.org/0000-0002-1995-6817^1,2

Journal of Statistical Distributions and Applications volume 7, Article number: 2 (2020) Cite this article

1903 Accesses
Metrics details

Abstract

The density deconvolution problem is considered for random variables assumed to belong to the generalized skew-symmetric (GSS) family of distributions. The approach is semiparametric in that the symmetric component of the GSS distribution is assumed known, and the skewing function capturing deviation from the symmetric component is estimated using a deconvolution kernel approach. This requires the specification of a bandwidth parameter. The mean integrated square error (MISE) of the GSS deconvolution estimator is derived, and two bandwidth estimation methods based on approximating the MISE are also proposed. A generalized method of moments approach is also developed for estimation of the underlying GSS location and scale parameters. Simulation study results are presented including a comparing the GSS approach to the nonparametric deconvolution estimator. For most simulation settings considered, the GSS estimator is seen to have performance superior to the nonparametric estimator.

Introduction

The density deconvolution problem arises when it is of interest to estimate the probability density function (pdf) f_x(x) of a random variable X using observations contaminated by measurement error. Specifically, the observed sample consists of data W_j=X_j+U_j, j=1,…,n, where the X_j are independent and identically distributed (iid) random variables with pdf f_x(x) and the U_j are iid measurement error variables with pdf f_u(u). This paper presents a semiparametric approach for estimating f_x(x) that assumes X belongs to the class of generalized skew-symmetric (GSS) distributions. The GSS deconvolution model for X specifies a base symmetric distribution, providing the basic structure for the model. Thereafer, kernel methodology is used to estimate a skewing function that captures the deviation from the specified symmetric distribution. This semiparametric GSS approach attempts to capture the best of a parametric and a nonparametric solution and provides a very flexible approach for modeling f_x(x).

The problem of estimating f_x(x) from a contaminated sample W₁,…,W_n was first considered by Carroll and Hall (1988) and Stefanski and Carroll (1990) who proposed a fully nonparametric solution under the assumption of a fulle known measurement error distribution f_u(u). Since then, much work on the topic has followed. Fan (1991a) and Fan (1991b) considered the theoretical properties of the density deconvolution estimator, and Fan and Truong (1993) extended the methodology to nonparametric regression. Diggle and Hall (1993) and Neumann and Hössjer (1997) considered the case of the measurement error distribution being unknown, and assumed that an external sample of error data was available to estimate the measurement error distribution. Delaigle et al. (2008) considered how replicate data can be used to estimate the characteristic function of the measurement error. The nonparametric estimator requires the selection of a bandwidth parameter. The two-stage plug-in bandwidth of Delaigle and Gijbels (2002) has become the gold-standard in application; Delaigle and Gijbels (2004) provides an overview of several popular bandwidth selection approaches. Delaigle and Hall (2008) considered the use of simulation-extrapolation (SIMEX) for bandwidth selection in a variety of measurement error problems.

Two more recent papers considered the deconvolution problem in new and novel ways. Delaigle and Hall (2014) considered parametrically-assisted nonparametric density deconvolution, while the groundbreaking work of Delaigle and Hall (2016) made use of the empirical phase function to estimate the pdf f_x(x) with the measurement error having unknown distribution and without the need for replicate data. The phase function approach imposes the restrictions that X has no symmetric component and that the characteristic function of U is real-valued and strictly positive.

The GSS family of distributions that is the basis for estimation in this paper dates back to Azzalini (1985), the first publication discussing a so-called skew-normal distribution. There has been a great deal of activity since with the monographs by Genton (2004) and Azzalini (2013) providing a good overview of the existing literature on the topic. Much of the GSS research has been theoretical in nature. While this theoretical work is important for understanding the statistical properties of GSS distributions, the applied value of this family has not often been realized in the literature. Notable exceptions that have used GSS distributions in application include the modeling of pharmacokinetic data, see Chu et al. (2001), the redistribution of soil in tillage, see Van Oost et al. (2003), and the retrospective analysis of case-control studies, see Guolo (2008). All of these authors considered fully parametric models. Arellano-Valle et al. (2005) considered a fully parametric measurement error model assuming both X and U follow skew-normal distributions. Lachos et al. (2010) modeled X using a scale-mixture of skew-normal distributions while assuming U is a mixture of normals. Furthermore, both Kim et al. (2016) and Wang et al. (2017) consider factor analysis models using skew-symmetric distributions. Most recently, Kahrari et al. (2019) developed linear mixed models using a skew-normal-Cauchy distribution and Arellano-Valle et al. (2020) considered the measurement error problem using a two-piece normal distribution to allow for skewness. No other work applying GSS distributions in the measurement error context was found.

The present paper is structured as follows. In the next section, the GSS deconvolution estimator is developed and some of its theoretical properties derived. In the subsequent section, bandwidth estimation methods for the skewing function are considered. Thereafer, a generalized method of moments (GMM) approach for estimating the GSS location and scale parameters is developed. The penultimate section presents simultion results, and the paper concludes with two real-data applications. An Appendix contains both some technical arguments and additional simulation results.

Generalized skew-symmetric deconvolution

Derivation of the GSS estimator

Consider the problem of estimating the probability density function (pdf) f_x(x) associated with random variable X based on a sample contaminated by additive measurement error, W_j=X_j+U_j, j=1,…,n. Here, the X_j are the true measurements of interest, and the W_j and U_j represent, respectively, the contaminated observation and the measurement error. It is assumed that the X_j are iidf_x(x), the U_j are iidf_u(u), and X_j and U_j are mutually independent for all j. Furthermore, the U_j are assumed to have a symmetric distribution with mean 0 and variance $\sigma _{u}^{2}$. As is typical in the deconvolution literature, the distribution of U_j is assumed fully known. Auxiliary data, when available, would make it possible to relax this assumption and estimate f_u(u); see for example Delaigle et al. (2008).

The deconvolution estimator developed here assumes that f_x(x) belongs to the GSS class of distributions. That is, X=ξ+ωZ with $\xi \in \mathbb {R} $ and ω>0 denoting location and scale parameters, and with Z having pdf

$$ f_{z}(z) =2f_{0}(z) \pi (z), z\in \mathbb{R} $$

(1)

with f₀(z) a pdf symmetric around 0 and π(z), hereafter referred to as the skewing function, satisfying the inequality constraint 0≤π(z)=1−π(z)≤1. In fact, any function satisfying this inequality constraint can be paired with any symmetric pdf f₀(z) and will result in (1) being a valid pdf. The corresponding pdf of X is f_x(x)=(2/ω)f₀[(x−ξ)/ω]π[(x−ξ)/ω].

The approach considered here is semiparametric in nature. The symmetric pdf f₀(z) is assumed known, but no parametric assumptions are made regarding the skewing function π(z). (In fact, if symmetric component f₀(z) were not assumed known, pdf f_z(z) would not be identifiable; see Appendix A.1 for details). The base density f₀(z) provides the basic strucuture of the model, and the skewing function π(z) captures the deviation from the base model. Thus, the approach attempts to capture the best of a parametric and a nonparametric solution, and the GSS family provides a very flexible approach for modeling f_z(z).

GSS random variables have an invariance property under even transformations that is central to the development of the deconvolution estimator in the remainder of this section. Let Z be GSS according to (1) and let Z₀ have symmetric pdf f₀(z). For any even function t(z), it holds that $t(Z) \overset {d}{=} t(Z_{0})$ with $\overset {d}{=}$ denoting equality in distribution; see Proposition 1.4 in Azzalini (2013). Thus, the distribution of t(Z) depends only on f₀(z) and not on π(z). Now, let ψ_z(t) denote the characteristic function of Z, and let c₀(t)=Re[ψ_z(t)] and s₀(t)=Im[ψ_z(t)] denote the real and imaginary components of ψ_z(t). The real component can be expressed as c₀(t)=E[cos(tZ)]. By the property of even transformation, it follows that c₀(t)=E[cos(tZ₀)] which is the characteristic function associated with f₀(z).

Now, assume (ξ,ω) are known, and define W^∗=(W−ξ)/ω. Furthermore, observe that W^∗=Z+ω⁻¹Uand therefore has characteristic function $\phantom {\dot {i}\!}\psi _{w^{\ast }}(t) =\psi _{z}(t/\omega) \psi _{u}(t)$ where ψ_u(t) is the real-valued characteristic function of U. It follows that

$$ \text{Re}\left\{ \psi_{w^{\ast }}(t) \right\} =c_{0}(t) \psi_{u}(t/\omega) $$

(2)

and

$$ \text{Im}\left\{ \psi_{w^{\ast }}(t) \right\} =s_{0}(t) \psi_{u}(t/\omega). $$

(3)

The functions c₀(t) and ψ_u(t) in (2) and (3) are known while s₀(t) is unknown. Noting that f_z(z) can be expressed as

$$ f_{z}(z) = f_{0}(z) +\frac{1}{2\pi }\int_{\mathbb{R}}\sin (tz) s_{0}(t) dt, $$

(4)

it follows that an estimator of s₀(t) can be used to construct an estimator of f_z(z). To this end, for random sample W₁,…,W_n, let $W_{j}^{\ast } = (W_{j}-\xi)/\omega $ for j=1,…,n, and define

$$\tilde{s}_{0}(t) =\frac{1}{\psi_{u}(t/\omega) } \frac{1}{n}\sum_{1\leq j\leq n}\sin \left(tW_{j}^{\ast }\right). $$

This empirical estimator, while unbiased for s₀(t), is not suitable for estimating f_z(z) when substituted in (4) as the integral diverges. This is attributable to the tail behavior of $\tilde {s}_{0}(t) $. While s₀(t) converges to 0 as |t|→∞ for any continuous distribution, $\tilde {s}_{0}(t)$ corresponds to an empirical measure and diverges as |t|→∞. This follows upon noting that the bounded periodic function $n^{-1} \sum _{j} \sin (tW_{j}^{\ast })$ is divided by ψ_u(t/ω), with the latter decreasing to 0 as |t| increases.

Next, consider the “smoothed” estimator

$$ \hat{s}_{0}(t) =\frac{\psi_{k}(ht) }{\psi_{u}(t/\omega) }\frac{1}{n}\sum_{1\leq j\leq n}\sin \left(tW_{j}^{\ast }\right) $$

(5)

where ψ_k(t) is a non-negative weight function and h is a bandwidth parameter. This estimator has expectation $\mathrm {E}[\hat {s}_{0}(t)]=\psi _{k}(ht)s_{0}(t)$ and therefore is biased for s₀(t). However, it also has some desirable properties. Firstly, it is an odd function, $\hat {s}_{0}(-t)=-\hat {s}_{0}(t)$ for all $t \in \mathbb R$. Secondly, substitution of (5) into (4) results in the well-defined estimator for f_z(z),

$$ \hat{f}_{z}(z) = f_{0}(z) + \frac{1}{2\pi}\int_{\mathbb R} \sin(tz)\hat{s}_{0}(t)dt, $$

(6)

provided ψ_k(t) is chosen such that |ψ_k(ht)/ψ_u(t/ω)|→0 as |t|→∞. Choosing ψ_k(t) to be 0 outside a bounded interval will trivially satisfy this requirement.

Estimator (6) suffers from the same drawback as the usual nonparametric deconvolution estimator in that it may be negative in parts. In practice, the negative parts can be truncated and the resulting function rescaled to integrate to 1. To circumvent this ad-hoc fix, combine Eqs. (1) and (4) to obtain

$$ \pi (z) =\frac{1}{2}-\frac{1}{4\pi f_{0}(z) } \int_{R}\sin \left(tz\right) s_{0}(t) dt. $$

(7)

Substitution of (5) in (7), along with the identity sin(tz)=(e^itz−e^−itz)/(2i), gives

$$ \hat{\pi}(z) =\frac{1}{2}+\frac{1}{8f_{0}(z) } \left\{ \tilde{f}_{w^{*}}(z) -\tilde{f}_{w^{*}}(-z) \right\} $$

(8)

where $\tilde {f}_{w^{*}}(z) =(nh\omega)^{-1}\sum K_{h\omega }[ (z-W_{j}^{*})/(h\omega)]$ is the well-studied nonparametric deconvolution density estimator of Carroll and Hall (1988) with deconvolution kernel $K_{h}(y) =(2\pi)^{-1}{\int _{\mathbb R }} e^{-ity}\psi _{k}(t)/\psi _{u}(t/h)dt$. The potential for (6) being negative in parts is reflected in (8) not being range-respecting. Specifically, it is possible to have $\hat {\pi }(z) \not \in \left [ 0,1\right ] $ for a set z with nonzero measure. A range-corrected skewing function estimator is $\tilde {\pi }(z) =\max \left [ 0,\min \left \{ 1,\hat {\pi }\left (z\right) \right \} \right ]$. The estimated density function of X based on the range-corrected skewing function is

$$ \tilde{f}\left(x|\xi,\omega \right) =\frac{1}{\omega }f_{0}\left(\frac{ x-\xi }{\omega }\right) \tilde{\pi}\left(\frac{x-\xi }{\omega }\right). $$

(9)

Use of the range-corrected skewing function estimate ensures that (9) is always a valid pdf. There is no need for any additional truncation of negative values and subsequent rescaling as would be the case with direct implementation of (6).

Some properties of the estimator

The range-corrected estimator $\tilde {\pi }(z)$ is asymptotically equivalent to $\hat {\pi }(z)$ in (8) on any closed subset of $\mathbb {R}$. As such, the latter will be used to evaluate the properties of the GSS deconvolution estimator. Firstly, note that using the known expected value of the nonparametric deconvolution estimator $\tilde {f}_{w^{\ast }}(z)$, it follows from (8) that

$$E\left[ \hat{\pi}(z) \right] - \pi (z) = \frac{c_{k}}{4}\frac{ f_{z}^{\prime \prime }(z) -f_{z}^{\prime \prime }(-z) }{f_{0}(z) }\cdot h^{2}+O\left(h^{3}\right) $$

with constant c_k depending only on the kernel function ψ_k(t). Thus, for an appropriately chosen bandwidth h, $\hat {\pi }(z)$ is consistent for π(z), and the density estimator $\tilde {f}(x|\xi,\omega)$ in (9) is also consistent for f_x(x).

The mean integrated square error (MISE), derived in Appendix A.2, is

$$ {}\text{MISE}(h) = \left(2\pi\right)^{-1}\int_{\mathbb{R}} \left\{\frac{\psi_{k}^{2}(ht)}{n} \left[\frac{1-c_{0}(2t)\psi_{u}(2t/\omega)}{2\psi_{u}^{2}(t/\omega)}-s_{0}^{2}(t)\right]+\left[\psi_{k}(ht)-1\right]^{2}s_{0}^{2}(t)\right\} dt. $$

(10)

When the distribution Z is symmetric, i.e. π(z)=1/2 for all z so that s₀(t)=0 for all t, and letting MISE_sym denotes the MISE calculated under symmetry,

$$\begin{array}{*{20}l} \text{MISE}_{\text{sym}}(h) & = \left(4\pi\right)^{-1} \int_{\mathbb{R}} \frac{\psi_{k}^{2}(ht)}{n} \left[\frac{1-c_{0}(2t)\psi_{u}(2t/\omega)}{\psi_{u}^{2}(t/\omega)}\right] dt \\ & \leq \left(2\pi n\right)^{-1} \int_{\mathbb{R}} \frac{\psi_{k}^{2}(ht)}{\psi_{u}^{2}(t/\omega)}dt. \end{array} $$

Here the inequality follows upon noting that |1−c₀(2t)ψ_u(2t/ω)|≤2 for all t. This upper bound of MISE_sym is proportional to the asymptotic MISE of the nonparametric deconvolution estimator, see equation (2.7) in Stefanski & Carroll (1990). Thus, in the symmetric case, one would expect the GSS deconvolution estimator to perform no worse than the nonparametric deconvolution estimator for a correctly specified symmetric component c₀(t). In fact, since this is an upper bound, large gains in efficiency may be possible. Our simulation results presented in a later section are congruent with this statement.

Bandwidth selection

Implementation of the GSS deconvolution estimator requires a bandwidth paramter h to be specified. Two methods for selecting this bandwidth are developed in this section. The first method uses cross-validation (CV) to approximate the integrated square error (ISE), and the second method approximates the MISE in (10).

A cross-validation bandwidth

For the GSS deconvolution estimator, the density-based ISE is proportional to the ISE for the imaginary component s₀(t) of the characteristic function,

$$ \int_{\mathbb R} \left[\tilde{f}_{z}(z)-f_{z}(z)\right]^{2}dz \propto \int_{\mathbb R} \left[\hat{s}_{0}(t)-s_{0}(t)\right]^{2} dt. $$

(11)

This follows from Parseval’s identity and recalling that the real component c₀(t) is known. Let C(h) denote the expression obtained by expanding the square on the right-hand side of (11) and keeping only terms involving the estimator $\hat {s}_{0}(t)$,

$$ C(h) = \int_{\mathbb R} \hat{s}_{0}^{2}(t)dt - 2 \int_{\mathbb R} \hat{s}_{0}(t) s_{0}(t)dt. $$

(12)

Now, note that the second integral in (12) can be written as

$$ \int_{\mathbb R} \hat{s}_{0}(t) s_{0}(t)dt = \sum_{i=1}^{n} \int_{\mathbb R} \frac{\psi_{k}(ht)\sin(tW_{i}^{\ast})}{\psi_{u}(t/\omega)} s_{0}(t)dt. $$

(13)

Define $\tilde {s}_{(i)}(t)$ to be an estimate of s₀(t) excluding the ith observation,

$$\tilde{s}_{(i)}(t) = \frac{(n-1)^{-1}\sum_{j\neq i} \sin (tW_{j}^{*})}{\psi_{u}(t/\omega)}. $$

This quantity is unbiased for s₀(t) for all i, and $\tilde {s}_{(i)}(t)$ is independent of W_i. The CV score follows by substitution of $\tilde {s}_{(i)}(t)$ in (13) for each i in the summand, giving

$$ {}\hat{C}(h) \,=\, \int_{\mathbb R} \frac{\psi_{k}(ht)}{\psi_{u}^{2}(t/\omega)}\left[\psi_{k}(ht)\left\{\frac{1}{n}\sum_{j=1}^{n}\sin (tW_{j}^{*})\right\}^{2}\,-\,\frac{2}{n(n-1)}\sum_{i=1}^{n}\sum_{j\neq i}\sin(tW_{i}^{*})\sin(tW_{j}^{*})\!\right]. $$

(14)

This result is similar to that of Stefanski and Carroll (1990) in the nonparametic setting, but here only requires estimating the imaginary component of the characteristic function. The CV bandwidth is defined to be the value $\tilde {h}$ that minimizes $\hat {C}(h)$.

An MISE bandwidth

Consider the MISE in (10), and note that the only unknown quantity therein is $s_{0}^{2}(t)$. Furthermore, observe that $\mathrm {E}\left [\sin (tW_{j}^{*})\sin (tW_{k}^{*})\right ]=\psi _{u}^{2}(t/\omega)s_{0}^{2}(t)$ whenever j≠k. Thus, $s_{0}^{2}(t)$ can be estimated by

$$ \hat{s}_{2}(t) = \max \left\{0, \frac{1}{n(n-1)\psi_{u}^{2}(t/\omega)}\sum_{j=1}^{n} \sum_{k \neq j}\sin(tW_{j}^{*})\sin(tW_{k}^{*})\right\}\mathcal{I}(|t|\leq \kappa), $$

(15)

where $\mathcal {I}(\cdot)$ is the indicator function and κ is some positive constant. The constant κ can be thought of as a smoothing parameter which ensures that the estimator $\hat {s}_{2}(t)$ behaves well for large values of |t|. Ideally, κ should be chosen in a data-dependent way and development of this approach is ongoing. However, based on extensive simulation work, it has been found that values κ∈[3,5] work reasonably well for a wide range of underlying GSS distributions considered. Now, taking (10), substituting $\hat {s}_{2}(t)$ for $s_{0}^{2}(t)$, and ignoring components that do not depend on the bandwidth, gives MISE approximation score

$$ \begin{aligned} \hat{M}(h) &= \frac{1}{h} \int_{\mathbb R}\left\{ \frac{\psi_{k}^{2}(t)}{n\psi_{u}^{2}[t/(h\omega)]} \left[\frac{1-\psi_{u}[2t/(h\omega)]c_{0}(2t/h)}{2}\right] \right.\\&+ \left.\left[\frac{n-1}{n}\psi_{k}(t)-2\right]\psi_{k}(t)\hat{s}_{2}(t/h) \right\}dt. \end{aligned} $$

(16)

The MISE-approximation bandwidth is defined to be the value $\tilde {h}$ that minimizes $\hat {M}(h)$.

Location and scale estimation

Generalized method of moments

Up to this point, the location and scale parameters ξ and ω have been treated as known quantities. This is unrealistic in practice. Estimation of the GSS parameters for a known symmetric component has been considered in the literature, see Ma et al. (2005) and Azzalini et al. (2010), and Potgieter and Genton (2013). However, none of these authors considered the presence of measurement error. Here, a Generalized Method of Moments (GMM) approach accounting for measurement error is developed. Recall that W_j=X_j+U_j=ξ+ωZ_j+U_j, j=1,…,n. Let M≥2 be a positive integer and assume that the Z_j and the U_j have at least 2M finite moments. Let T_k denote the (2k)th centered moment,

$$ T_{k} := T_{k}(\xi,\omega) = n^{-1}\sum_{j=1}^{n}\left(\frac{W_{j}-\xi}{\omega}\right)^{2k}. $$

(17)

This variable has expectation E[T_k]=E[(Z+ω⁻¹U)^2k] and admits expansion

$$ \mathrm{E}\left[T_{k}\right] = \sum_{j=0}^{k} {{2k}\choose{2j}} \omega^{-2(k-j)} \mathrm{E}\left[Z^{2j}\right] \mathrm{E} \left[U^{2(k-j)}\right]. $$

(18)

By the GSS property of even transformations, $\mathrm {E} [Z^{2j}]=\mathrm {E} [Z_{0}^{2j}]$ for j=1,…,M with Z₀ a random variable with pdf f₀(z). Furthermore, the evaluation of the moments of U pose no problem as this distribution is assumed known. Thus, E[T_k] can easily be evaluated using (18).

Now, define quadratic form $D(\xi,\omega) = n \mathbf {T}_{M}^{\top } \mathbf {\Sigma }^{-1} \mathbf {T}_{M}$ with T_M denoting the vector T_M=(T₁−E[T₁],…,T_M−E[T_M])^⊤ with covariance matrix Σ. The covariance matrix has entries Σ_ij=n⁻¹(E[T_i+j]−E[T_i]E[T_j]). The GMM estimators are defined to be the minimizer of D(ξ,ω). In evaluating D(ξ,ω), both the expectations E[T_k], k=1,…,M and the covariance matrix Σ are functions of the parameter ω, but not of ξ.

Selection from multiple GMM solutions

One difficulty encountered with the GMM approach is that the statistic D(ξ,ω) frequently has multiple minima, and the global minimum does not always corresponds to the “correct” solution. This equivalent problem also occurs in the non-measurement error setting and is an artifact of the skewing function being unknown; see Section 7.2.2 in Azzalini (2013) for an overview and illustration. Solutions considered there range from selecting the model with the smallest squared integral of the second derivative of the estimated skewing function, to selecting a solution based on matching model-based and empirical skewness coefficients.

Now, assume that D(ξ,ω) has J local minima occurring at $(\hat {\xi }_{j},\hat {\omega }_{j})$, j=1,…,J. Furthermore, let $\tilde {f}_{j}(x|\hat {\xi }_{j},\hat {\omega }_{j})$ denote the GSS density deconvolution estimator in (9) obtained using solution $(\hat {\xi }_{j},\hat {\omega }_{j})$. Thus, J different GSS deconvolution estimators are calculated. Using the jth estimated density, define the kth model-implied moment,

$$ \tilde{\mu}_{j,k} = \int_{\mathbb{R}} x^{k} \tilde{f}_{j}(x|\hat{\xi}_{j},\hat{\omega}_{j}) dx, $$

(19)

and model-implied characteristic function,

$$ \tilde{\phi}_{j}(t) = \int_{\mathbb{R}} \exp (itx) \tilde{f}_{j}(x|\hat{\xi}_{j},\hat{\omega}_{j}) dx. $$

(20)

Based on these quantities, two different selection methods are now proposed. Throughout, it will be assumed that measurement error U has distribution symmetric about 0.

Skewness matching: In model W=X+U, the skewness of X can be estimated by $\hat {\gamma }_{x}=\left [{\hat {\sigma }_{w}^{2}}/{(\hat {\sigma }_{w}^{2}-{\sigma }_{u}^{2})^{3/2}}\right ]\hat {\gamma }_{w}$ where $\hat {\sigma }_{w}^{2}$ and $\hat {\gamma }_{w}$ denote the sample variance and skewness of iid random variables W₁,…,W_n. Now, for the jth solution pair $(\hat {\xi }_{j},\hat {\omega }_{j})$, the GSS model-implied skewness is given by $\hat {\gamma }_{j}=\left ({\tilde {\mu }_{j,3}-3\tilde {\mu }_{j,2}\tilde {\mu }_{j,1}+2\tilde {\mu }_{j,1}^{3}}\right)/\left ({\tilde {\mu }_{j,2}-\tilde {\mu }_{j,1}^{2}}\right)$ with $\tilde {\mu }_{j,k}$ as defined in (19). The selected solution is the one with implied skewness closest to the empirical skewness. Specifically, letting $d_{j} = |\hat {\gamma }_{x}-\hat {\gamma }_{j}|$, j=1,…,J, the selected solution is $(\hat {\xi }_{j^{\ast }},\hat {\omega }_{j^{\ast }})$ with j^∗= arg min1≤j≤Jd_j.

Phase function matching: The phase function, a normalized version of the characteristic function, is a recent tool employed in density deconvolution – see Delaigle and Hall (2016) and Nghiem and Potgieter (2018) for further details. Let ρ_w(t) and ρ_x(t), denote the phase functions of X and W=X+U. For U having strictly positive characteristic function, these phase functions are equal, ρ_w(t)=ρ_x(t) for all t. The empirical estimate of the phase function of X is $\hat {\rho }_{x}(t)=\hat {\psi }_{w}(t)/|\hat {\psi }_{w}(t)|$ with $\hat {\psi }_{w}(t)$ the empirical characteristic function of W, and $|z|=(z\bar {z})^{1/2}$ and $\bar {z}$ denoting the complex norm and cojucate of z. For the jth GMM solution $(\hat {\xi }_{j},\hat {\omega }_{j})$, the model-implied phase function is given by $\tilde {\rho }_{j}(t)=\tilde {\phi }_{j}(t)/|\tilde {\phi }_{j}(t)|$ with $\tilde {\phi }_{j}(t)$ as defined in (20). Now, letting w(t) denote a non-negative weight function symmetric around 0, define distance metric $R_{j} = \int _{\mathbb {R}} \vert \hat {\rho }_{x}(t) - \tilde {\rho }_{j}(t) \vert w(t) dt$ for j=1,…,J. The selection solution is $(\hat {\xi }_{j^{\ast }},\hat {\omega }_{j^{\ast }})$ with j^∗= arg min1≤j≤JR_j. That is, the selected solution has minimum phase function distance. In this paper, weight function $w(t)=[1-(t/t^{\ast })^{2}]^{3}\ \mathcal {I}(|t| \leq t^{\ast })$ will be used with t^∗ the smallest t>0 such that $|\hat {\psi }_{w}(t)|\leq n^{-1/4}$ as per Delaigle and Hall (2016).

Simulation studies

The performance of the GSS deconvolution estimator was evaluated using extensive simulations. Letting ϕ(z) and Φ(z) denote the standard normal density and distribution functions, data X₁,…,X_n were generated from GSS distributions with symmetric component f₀(z)=ϕ(z) and using three different skewing functions, π₀(z)=1/2, π₁(z)=Φ(9.9625z) and π₂(z)=Φ(z³−2z). The location and scale parameters were taken to be ξ=0 and ω=1. Figure 1 illustrates the three resulting pdfs f_x(x)=(2/ω)ϕ[(x−ξ)/ω]π_k[(x−ξ)/ω], k=0,1,2. Note that the skewing function π₀(z) does not introduce any deviation from symmetry and corresponds to simulating from a normal distribution. Additionally, the skewing function π₁(z) results in a positive skew distribution, while π₂(z) results in a bimodal distribution.

Two measurement error distributions were considered with U₁,…,U_n being either Normal or Laplace with mean 0 and variance chosen to have noise-to-signal ratio $\text {NSR}=\sigma _{u}^{2}/\sigma _{x}^{2}$ either 0.2 or 0.5. Samples W_j=X_j+U_j, j=1,…,n, with n∈{50,100,200,500} were generated from each of the possible simulation configurations described.

Comparison of oracle estimators

The first simulation study presented compares the proposed GSS estimator to the established nonparametric estimator of Carroll and Hall (1988), and assumes the existence of an oracle that selects the “best” possible bandwidth for each of the estimators. Specifically, for a sample W₁,…,W_n, let $\tilde {f}_{\text {gss}}(x|h)$ and $\tilde {f}_{\text {np}}(x|h)$ denote, respectively, the GSS and nonparametric estimators with bandwidth h. The ISE is defined as

$$\text{ISE}_{\mathrm{m}}(h) = \int_{\mathbb R} \left[\tilde{f}_{\mathrm{m}}(x|h)-f_{x}(x)\right]^{2}dx$$

where m∈{gss,np}. Then, the “best” bandwidth is the value that minimizes the ISE between the estimated and true densities. Furthermore, when GMM results in more than one solution for the GSS location and scale parameters, the oracle also selects the solution that result in smallest ISE. In practice, no oracle exists to do these selections. Even so, comparing the estimators under such idealized conditions speaks to the best possible performance of these methods.

For each simulation configuration, N=1000 samples were generated. Due to the occasional occurrence of very large outliers in ISE, the median ISE (rather than mean ISE) is reported. The first and third quartiles of ISE are also reported. Results for n∈{200,500} are summarized in Table 1, and for n∈{50,100} are presented in Table 6 in Appendix A.5.

Table 1 Median of 100×ISE, as well as first and third quartiles [ Q₁,Q₃] for the oracle GSS and nonparametric (NP) deconvolution estimators

Full size table

Inspection of Table 1 shows how well the GSS estimator can perform relative to the nonparametric estimator. In the symmetric case with skewing function π₀(z), the reduction in median ISE is most dramatic and exceeds 50% in all cases. For skewing functions π₁(z) and π₂(z), the reduction in median ISE is also seen to be as large as 40%. There is one instance where median ISE of the nonparametric estimator is smaller than that of the GSS estimator – skewing function π₂(z) with NSR=0.5, Laplace measurement error, and sample size n=200. (The same holds true for sample sizes n=50 and 100 in Table 6.) However, the equivalent scenario with sample size n=500 has the GSS estimator with smaller median ISE. This possibly indicates the effect of estimating the location and scale parameters in smaller samples and when large amounts of heavier-tailed-than-normal measurement error is present. Overall, the GSS deconvolution estimator performs very well. Thus, the additional structure being imposed through the a priori specification of the symmetric pdf f₀(z) can result in a large decrease in ISE.

Bandwidth estimation

The next simulation study investigated the two proposed bandwidth estimation approaches. Specifically, the CV and MISE bandwidths as well as the two-stage plug-in (PI) bandwidth of Delaigle and Gijbels (2002), originally developed for nonparametric deconvolution, were implemented. For each simulated sample, the ISE was calculated. When necessary, GMM solution selection with phase-function matching was used. The nonparametric deconvolution estimator with PI bandwidth was also calculated; corresponding results are included for reference purposes. The median ISE values for the methods are summarized in Tables 2 and 3 for sample sizes n∈{200,500}, and in Tables 7 and 8 in Appendix A.5 for sample sizes n∈{50,100}.

Table 2 Median of 100×ISE for the GSS deconvolution estimators with CV, MISE, and PI bandwidths, and the nonparametric (NP) estimator with PI bandwidth. Sample size n=200

Full size table

Table 3 Median of 100×ISE for the GSS deconvolution estimators with CV, MISE, and PI bandwidths, and the nonparametric (NP) estimator with PI bandwidth. Sample size n=500

Full size table

In Tables 2 and 3, it is seen that there isn’t a consisent “best” bandwidth method. For skewing functions π₀ (the symmetric case) and π₂, the PI bandwidth generally has smallest median ISE. In these same scenarios, MISE frequently (but by no means consistently) outperforms CV. For π₁(z) the MISE bandwidth performs best. In all simulation settings, there is a GSS bandwidth method that results in better performance than the nonparametric estimator. These same conclusions broadly hold for sample sizes n∈{50,100} in Appendix A.5.

The results presented above were restricted to phase-function matching for the GMM estimators, as it was found to generally have better performance that skewness matching. For details of the simulation comparing the two GMM matching methods, see Appendix A.3.

GMM estimation

One other simulation study was performed, and considered the choice of M (the number of even moment to use) when evaluating the GMM estimators of (ξ,ω). These simulations results are presented in Appendix A.4. In summary, the larger value M=5 was generally seen to outperform M=2 for π₁(z) and π₂(z). In the symmetric π₀(z) case, M=2 performed slightly better than M=5. In all instances, root mean square error (RMSE) was used as criterion.

Data applications

Coal abrasiveness index data

Data from an industrial application, first considered by Lombard (2005), are analyzed here. The data were obtained by taking batches of coal, splitting them in two, and randomly allocating each of the half-batches to one of two methods used to measure the abrasiveness index (AI) of coal, a measure of the quality of coal. The observed data consist of 98 pairs (w_1i,w_2i) assumed to be from a population with W_1i=X_i+U_1i and W_2i=μ+σ(X_i+U_2i). Here, X_i denotes the true AI of the ith batch, U_1i and U_i2 denote measurement error, and constants μ and σ account for the two AI measurement methods being on different scales. Of interest is estimating f_x(x), the true density of AI. However, the data (w_1i,w_2i) first need to be combined in a sensible way.

To this end, let μ_w,k and $\sigma _{w,k}^{2}$ denote the mean and variance of the W_ki, k=1,2, and let μ_x and $\sigma _{x}^{2}$ denote the mean and variance of the X_i. Note that μ_w,1=μ_x, μ_w,2=μ+σμ_x, $\sigma _{w,1}^{2}=\sigma _{x}^{2}+\sigma _{u}^{2}$, and $\sigma _{w,2}^{2}=\sigma ^{2}\left (\sigma _{x}^{2}+\sigma _{u}^{2}\right)$. By replacing the population moments with their sample counterparts, estimators $\hat {\sigma }=s_{w,2}/s_{w,1}=0.679$ and $\hat {\mu }=\bar {w}_{2} - \hat {\sigma }\bar {w}_{1}=59.503$ are obtained. Here, $(\bar {w}_{1},s_{w,1})$ denote the sample mean and standard deviation of the observed w₁-data with similar definitions holding for the w₂-quantities. Now, the paired observations are combined as $w_{i} =0.5w_{1i}+0.5\left (w_{2i}-\hat {\mu }\right)/\hat { \sigma }$. At the population level this corresponds to W_i≈X_i+0.5(U_1i+U_2i):=X_i+ε_i. An estimate of the measurement error variance $\sigma _{\varepsilon }^{2}$ is obtained by calculating $\hat {\sigma }_{u}^{2}=(2n)^{-1}\sum \left [ W_{1i}-\left (W_{2i}-\hat {\mu }\right)/\hat {\sigma } \right ]^{2}=174.6$ and noting that $\hat {\sigma }_{\varepsilon }^{2}=174.6/2=87.3$. This corresponds to the W_i having noise-to-signal ratio NSR=16.35%.

The GSS deconvolution estimator for f_x(x) is now calculated assuming a normal symmetric component, f₀(z)=ϕ(z), along with a Laplace distribution for the measurement error ε. (The equivalent estimator assuming normal measurement was also calculated and is nearly identical in shape.) GMM with M=5 gives solution pairs $(\hat {\xi } _{1},\hat {\omega }_{1}) =\left (192.88,29.90\right) $ and $(\hat {\xi }_{2},\hat {\omega }_{2}) =\left (230.41,32.43\right) $. For each of these, the corresponding skewing function estimate $\tilde {\pi }_{j}(z)$ and phase function distance R_j was calculated, the latter using weight function w(t)=[1−(t/t^∗)²]³ for t∈[−t^∗,t^∗] and t^∗=0.06. Here, R₁=0.023<0.046=R₂ and therefore solution $(\hat {\xi } _{1},\hat {\omega }_{1})$ with estimated skewing function $\tilde {\pi }_{1}(z)$ was selected. Skewness matching resulted in selection of the same solution. Figure 2 shows a kernel density estimator of f_w(w), the density of the contaminated W_i, as well as the GSS deconvolution estimator of f_x(x) with MISE bandwidth $\tilde {h}=0.102$.

This application illustrates one of the less appealing aspects of the GSS approach sometimes encountered in smaller samples. Note the sharp “edge” in the GSS estimator around x=225. This is an artefact of the hard truncation applied when calculating the range-respecting skewing function estimate $\tilde {\pi }(z)$. The resulting density estimate is not differentiable at this point. This is equivalent to non-differentiable points in the nonparametric deconvolution estimator when it is truncated to be positive.

Systolic blood pressure data

The data here are a subset of n=1615 male participants in the Framingham Heart Study, see for example Carroll et al. (2006) for more detail. The data consist of systolic blood pressure measurements from two patient exams (the second and third exams in the study). At each exam, two replicate measurements were obtained giving data (SBP₂₁,SBP₂₂,SBP₃₁,SBP₃₂). Let P₁=(SBP₂₁+SBP₂₂)/2 and P₂=(SBP₃₁+SBP₃₂)/2 denote the average systolic blood pressure observed at each of the exams, and calculate transformed variables W_j= log(P_j−50), j=1,2, as suggested by Carroll et al. (2006). This is done to adjust large skewness present in the data. The measurement W=(W₁+W₂)/2=X+U is a surrogate for the true long-term average systolic blood pressure X (on the transformed logarithmic scale). Using the replicates (W₁,W₂), estimate standard deviations $\hat {\sigma }_{x}=0.1976$ and $\hat {\sigma }_{u}=0.0802$ are obtained.

The GSS deconvolution estimator assuming a Laplace distribution for the measurement error U and using a normal reference density f₀(z)=ϕ(z) was computed. GMM with M=5 resulted in only one solution, $(\hat {\xi },\hat {\omega })=(4.429,0.210)$, and therefore no selection was needed. Figure 3 displays both the GSS deconvolution estimator and the nonparametric deconvolution estimator, both with PI bandwidths.

The nonparametric deconvolution estimator has previously been applied to the Framingham Heart Study. It is therefore reassuring that the GSS estimator is not dissimilar in appearance.

Conclusion

In this paper, the density deconvolution problem is considered for variables belonging to the family of generalized skew-symmetric (GSS) distributions. Implementation requires both the estimation of location and scale parameters (ξ,ω), and the estimation of a skewing function π(z). Estimation methods are proposed for both of these quantities, and extensive simulation studies are performed. In simulation studies performed, the GSS deconvolution estimator is generally seen to result in large improvements over the nonparametric deconvolution estimator (using median ISE as criterion).

There are still several questions related to GSS deconvolution that can be considered. Firstly, the estimator requires the specification of a known symmetric component f₀(z). While this is done to ensure model identifiability, it would be possible to consider several candidate symmetric densities and choose the “best” among these. The related goodness-of-fit testing problem for a specified symmetric component can also be explored. Secondly, it should be noted that the contaminated W also has a GSS distribution. An alternative modeling approach could therefore estimate the pdf of W directly and then recover the pdf of X. Lastly, it was observed in the simulation study that the nonparametric deconvolution kernel in a few isolated instances had superior performance to the GSS estimator under selection, while GSS had better under oracle conditions for the same simulation configurations. This suggests that further refinement of the bandwidth calculation and solution selection procedure may be possible, and related work is ongoing.

Appendix

A.1 Generalized skew-symmetric representation

Here, it is established that any continuous random variable has a non-unique representation as a GSS distribution. This motivates, in part, the need to assume a parametric form for pdf f₀(z) when doing estimation. Let Y be a continuous random variable with pdf f_y(y) and let ξ be a real number. Furthermore, let B be a Bernoulli(p=0.5) random variable, and define new random variables D_ξ=|Y−ξ| and T=BD_ξ−(1−B)D_ξ. The random variable T is symmetric about 0 and has pdf f_t(t)=(1/2)[f_y(ξ+t)+f_y(ξ−t)]. Next, define

$$\pi_{t}(t)=\frac{1}{2}\frac{f_{y}(\xi+t)}{f_{t}(t)}=\frac{f_{y}(\xi+t)}{f_{y}(\xi+t)+f_{y}(\xi-t)}$$

and note that π_t(t) satisfies 0≤π_t(t)=1−π_t(−t)≤1. By construction, it follows that f_y(y) can be expressed as f_y(y)=2f_t(y−ξ)π_t(y−ξ). Assuming that Y has finite variance, the variance of T is given by $\omega _{\xi }^{2} = \int _{\mathbb {R}}t^{2}f_{t}(t)dt$. Then, letting f_ξ(t)=f_t(t/ω_ξ)/ω_ξ and π_ξ(t)=π_t(t/ω_ξ), it is possible to write

$$f_{y}(y)=\frac{2}{\omega_{\xi}} f_{\xi}\left(\frac{y-\xi}{\omega_{\xi}}\right)\pi_{\xi}\left(\frac{y-\xi}{\omega_{\xi}}\right).$$

This representation does not depend on a specific value for ξ and, as such, holds for every ξ. However, each value of ξ is associated with a different symmetric component f_ξ(z) and skewing function π_ξ(z). As such, there is a family of distributions f_ξ(z) symmetric about 0 and with unit variance such that the random variable Y can be expressed as a GSS distribution with symmetric component belonging to this family. The work in this paper is motivated by the assumption that it is possible to correctly specify one symmetric distribution in the family f_ξ(z).

A.2 MISE derivation

To derive an expression for the mean integrated square error (MISE), considering the estimator $\hat {s}_{0}(t)$ defined in (5). Recall that $\mathrm {E}\left [\hat {s}_{0}(t)\right ] = \psi _{k}(ht)s_{0}(t)$. Additionally, it has covariance structure

$$\begin{aligned} \text{Cov}\left[\hat{s}_{0}(t_{1}),\hat{s}_{0}(t_{2})\right] &= \frac{\psi_{k}(ht_{1})\psi_{k}(ht_{2})}{n} \\ &\times\left[\frac{c_{0}(t_{1}-t_{2})\psi_{u}[(t_{1}-t_{2})/\omega]-c_{0}(t_{1}+t_{2})\psi_{u}[(t_{1}+t_{2})/\omega)]}{2\psi_{u}(t_{1}/\omega)\psi_{u}(t_{2}/\omega)}-s_{0}(t_{1})s_{0}(t_{2})\right]. \end{aligned} $$

The integrated squared error (ISE) of the GSS estimator can now be expressed in terms of $\hat {s}_{0}(t)$,

$$\begin{array}{@{}rcl@{}} \text{ISE} &=& \int_{\mathbb{R}} \left[\tilde{f}_{z}(z)-f_{z}(z)\right]^{2}dz \\ &=& \frac{1}{2\pi} \int_{\mathbb{R}} \left|\hat{\psi}_{z}(t)-\psi_{z}(t)\right|^{2}dt \\ &=& \frac{1}{2\pi} \int_{\mathbb{R}} \left[\hat{s}_{0}(t)-s_{0}(t)\right]^{2}dt \end{array} $$

where the first equality is an application of Parseval’s identity, and the second follows upon noting that the estimated characteristic function $\hat {\psi }_{z}(t)$ and true characteristic function ψ_z(t) have common real component c₀(t) which therefore cancels out, leaving only the estimated and true imaginary components. Also note that ISE is a function of the bandwidth h through $\hat {s}_{0}(t)$. Now, MISE=E[ISE] can be evaluated using the expectation and covariance functions associated with $\hat {s}_{0}(t)$, in the latter setting t₁=t₂=t. Eq. 10 follows.

A.3 GMM estimators simulation

The performance of GMM estimation of (ξ,ω) was evaluated in a simulation study. Data were simulated as described in the main paper. For each simulated dataset, the estimators minimizing D(ξ,ω) were obtained for both M=2 and M=5 even moments. While the sixth, eight and tenth sample moments used for the M=5 setting arguably contain additional information, there is a great deal of added variability introduced when estimating these higher order moments. This simulation explored the benefits, if any, of doing so. In simulated samples where multiple solutions $(\hat {\xi }_{j},\hat {\omega }_{j})$, j=1,…,J were obtained, the existence of an oracle able to choose the solution closest to the true value (0,1) (as measured using Euclidean distance) was assumed. A total of N=1000 samples were generated for each simulation configuration. Root mean square error (RMSE) was used as criterion, and the results are shown in Table 4.

Table 4 RMSE for GMM estimators, N=Normal, L=Laplace

Full size table

In the setting with X normal, i.e. using π₀(z), using M=5 moments results in a small increase in RMSE compared to the case M=2. The average increase in RMSE for ξ is 1.2% and for ω is 9.5% across the settings considered. On the other hand, the simulation results for skewing functions π₁(z) and π₂(z) look very different. Here, the RMSE for ξ decreases for both skewing functions, and the RMSE for ω decreases for skewing function π₂(z). Also, the average RMSE of ω for π₂(z) remains unchanged across the simulation settings considered. One possible reason for the increase in RMSE in the symmetric case is that the underlying distribution is normal and therefore higher-order moments do not contain any “extra” information about the distribution. On the other hand, for π₁(z) and π₂(z) there is a substantive departure from normality and the higher-order sample moments, despite their large variability, do contain useful information about the underlying distribution. As the increase in RMSE in the symmetric case is relatively small compared to the decrease in the asymmetric cases, the paper uses the GMM estimators with M=5 in all other simulations.

A.4 Solution selection simulation

The simulation results comparing the performance of the skewness matching and phase function distance solution selection mechanisms follow here. Data were generated as described in the “Simulation studies” section of the main paper. For each simulated sample, all GMM solutions $(\hat {\xi }_{j},\hat {\omega }_{j})$, j=1,…,J were obtained. Solution selection was then implemented for both skewness matching and phase function matching. These techniques require a bandwidht to be selected. The simulation implemented CV, MISE, and PI bandwidth selection. However, the conclusions with regards to selection methods were very similar for these and therefore only MISE bandwidth results are included here. To contextualize these results from selection, results corresponding to an oracle able to choose the solution with smallest ISE are also reported, as well as a blind selection approach randomly selecting one of the GMM solutions.

The simulation results are summarized in Table 5. In this table, the median ISE of skewness matching and phase function distance are given in the columns SKW and PHS. The column MIN contains the median ISE for the oracle selecting the solution with smallest ISE, while RND contains the median ISE of randomly selecting one of the GMM solutions. Finally, the median ISE of the nonparametric deconvolution estimator with PI bandwidth is given in column NP for reference purposes.

Table 5 Median of 100×ISE for GSS estimator with MISE bandwidth

Full size table

Table 6 Median of 100×ISE, as well as first and third quartiles [ Q₁,Q₃] for the oracle GSS and nonparametric (NP) deconvolution estimators, sample sizes n=50,100

Full size table

Inspection of Table 5 shows that estimation under both the skewness and phase function matching generally performs better than the fully nonparametric estimator, with the exception being the combination of skewing function π₂(z) and Laplace measurement error. However, as the GSS estimator outperformed the nonparametric estimator under an oracle bandwidth as seen in Table 1 of the main paper, this does suggest that further improvement of the GSS estimator may still be possible by refining parameter estimation and bandwidth selection – this is ongoing work. Further inspection of Table 5 shows that estimation under both the skewness and phase function matching performs better than random selection, with the exception that random selection outperforms the skewness matching for π₁(z) and normal measurement error. While there are a few instances where skewness matching outperformed phase function matching, the latter generally has very good performance and comes close to the best possible performance of the minimum ISE under oracle selection.

A.5 supplemental simulation results

This subsection contains two sets of supplemental simulation results. The first of these, found in Table 6, pertains to a comparison of oracle estimators for sample sizes n={50,100}. The second of these, found in Tables 7 and 8, pertains to comparing bandwidth estimation methods for sample sizes n={50,100}. The conclusions that can be drawn from these results are consistent with those discussed in the “Simulation studies” section of the main paper, and are included here for completeness.

Table 7 Median of 100×ISE for the GSS deconvolution estimators with CV, MISE, and PI bandwidths, and the nonparametric (NP) estimator with PI bandwidth. Sample size n=50

Full size table

Table 8 Median of 100×ISE for the GSS deconvolution estimators with CV, MISE, and PI bandwidths, and the nonparametric (NP) estimator with PI bandwidth. Sample size n=100

Full size table

Availability of data and materials

The coal abrasiveness index data are discussed in Lombard (2005). These data are proprietary, and cannot be released publically. The systolic blood pressure data are discussed in Carroll et al. (2006) and constitute a subset of the Framingham Heart Study. The subset of data used in this paper is publically available in the R package decon.

Abbreviations

CV:: Cross-validation
GMM:: Generalized method of noments
GSS:: Generalized skew-symmetric
iid:: Independent and identically distributed
ISE:: Integrated squared error
MISE:: Mean integrated squared error
pdf:: Probability density function
PI:: Plug-in
RMSE:: Root mean square error

References

Arellano-Valle, R. B., Ozan, S., Bolfarine, H., Lachos, V.: Skew normal measurement error models. J. Scand. J. Stat. 96(2), 265–281 (2005).
MathSciNet MATH Google Scholar
Arellano-Valle, R. B., Azzalini, A., Ferreira, C. S., Santoro, K.: A two-piece normal measurement error model. Comput. Stat. Data Anal. 144, 106863 (2020).
Article MathSciNet MATH Google Scholar
Azzalini, A.: A class of distributions which includes the normal ones. Scand. J. Stat. 12, 171–178 (1985).
MathSciNet MATH Google Scholar
Azzalini, A., Genton, M. G., Scarpa, B.: Invariance-based estimating equations for skew-symmetric distributions. Metron. 68(3), 275–298 (2010).
Article MathSciNet MATH Google Scholar
Azzalini, A.: The Skew-normal and Related Families. Cambridge University Press, New York (2013).
Book MATH Google Scholar
Carroll, R. J., Hall, P.: Optimal rates of convergence for deconvolving a density. J. Am. Stat. Assoc. 83(404), 1184–1186 (1988).
Article MathSciNet MATH Google Scholar
Carroll, R. J., Ruppert, D., Stefanski, L. A., Crainiceanu, C. M.: Measurement Error in Nonlinear Models: a Modern Perspective. CRC press, Boca Raton (2006).
Book MATH Google Scholar
Chu, K. K., Wang, N., Stanley, S., Cohen, N. D.: Statistical evaluation of the regulatory guidelines for use of furosemide in race horses. Biometrics. 57(1), 294–301 (2001). https://doi.org/10.1111/j.0006-341x.2001.00294.x.
Article MathSciNet MATH Google Scholar
Delaigle, A., Gijbels, I.: Estimation of integrated squared density derivatives from a contaminated sample. J. R. Stat. Soc. Ser. B Stat. Methodol. 64(4), 869–886 (2002).
Article MathSciNet MATH Google Scholar
Delaigle, A., Gijbels, I.: Practical bandwidth selection in deconvolution kernel density estimation. Comput. Stat. Data Anal. 45(2), 249–267 (2004).
Article MathSciNet MATH Google Scholar
Delaigle, A., Hall, P.: Using simex for smoothing-parameter choice in errors-in-variables problems. J. Am. Stat. Assoc. 103(481), 280–287 (2008).
Article MathSciNet MATH Google Scholar
Delaigle, A., Hall, P., Meister, A.: On deconvolution with repeated measurements. Ann. Stat. 36(2), 665–685 (2008). https://doi.org/10.1214/009053607000000884.
Article MathSciNet MATH Google Scholar
Delaigle, A., Hall, P.: Parametrically assisted nonparametric estimation of a density in the deconvolution problem. J. Am. Stat. Assoc. 109(506), 717–729 (2014).
Article MathSciNet MATH Google Scholar
Delaigle, A., Hall, P.: Methodology for non-parametric deconvolution when the error distribution is unknown. J. R. Stat. Soc. Ser. B Stat. Methodol. 78(1), 231–252 (2016).
Article MathSciNet MATH Google Scholar
Diggle, P. J., Hall, P.: A fourier approach to nonparametric deconvolution of a density estimate. J. R. Stat. Soc. Ser. B Methodol.55(2), 523–531 (1993). https://doi.org/10.1111/j.2517-6161.1993.tb01920.x.
MathSciNet MATH Google Scholar
Fan, J.: Asymptotic normality for deconvolution kernel density estimators. Sankhyā: Indian J. Stat. Ser. A. 53(1), 97–110 (1991a).
Fan, J.: On the optimal rates of convergence for nonparametric deconvolution problems. Ann. Stat.19(3), 1257–1272 (1991b).
Fan, J., Truong, Y. K.: Nonparametric regression with errors in variables. Ann. Stat. 21(4), 1900–1925 (1993). https://doi.org/10.1214/aos/1176349402.
Article MathSciNet MATH Google Scholar
Genton, M. G. E..: Skew-elliptical Distributions and Their Applications: a Journey Beyond Normality. CRC Press, Boca Raton (2004).
Book MATH Google Scholar
Guolo, A.: A flexible approach to measurement error correction in case–control studies. Biometrics. 64(4), 1207–1214 (2008).
Article MathSciNet MATH Google Scholar
Kahrari, F., Ferreira, C., Arellano-Valle, R.: Skew-normal-cauchy linear mixed models. Sankhya B. 81(2), 185–202 (2019).
Article MathSciNet Google Scholar
Kim, H. -M., Maadooliat, M., Arellano-Valle, R. B., Genton, M. G.: Skewed factor models using selection mechanisms. J. Multivar. Anal. 145, 162–177 (2016).
Article MathSciNet MATH Google Scholar
Lachos, V., Labra, F., Bolfarine, H., Ghosh, P.: Multivariate measurement error models based on scale mixtures of the skew–normal distribution. Statistics. 44(6), 541–556 (2010).
Article MathSciNet MATH Google Scholar
Lombard, F.: Nonparametric confidence bands for a quantile comparison function. Technometrics. 47(3), 364–371 (2005).
Article MathSciNet Google Scholar
Ma, Y., Genton, M. G., Tsiatis, A. A.: Locally efficient semiparametric estimators for generalized skew-elliptical distributions. J. Am. Stat. Assoc. 100(471), 980–989 (2005).
Article MathSciNet MATH Google Scholar
Neumann, M. H., Hössjer, O.: On the effect of estimating the error density in nonparametric deconvolution. J. Nonparametric Stat. 7(4), 307–330 (1997).
Article MathSciNet MATH Google Scholar
Nghiem, L., Potgieter, C. J.: Density estimation in the presence of heteroscedastic measurement error of unknown type using phase function deconvolution. Stat. Med. 37(25), 3679–3692 (2018).
Article MathSciNet Google Scholar
Potgieter, C. J., Genton, M. G.: Characteristic function-based semiparametric inference for skew-symmetric models. Scand. J. Stat. 40(3), 471–490 (2013).
Article MathSciNet MATH Google Scholar
Stefanski, L. A., Carroll, R. J.: Deconvolving kernel density estimators. Statistics. 21(2), 169–184 (1990).
Article MathSciNet MATH Google Scholar
Van Oost, K., Van Muysen, W., Govers, G., Heckrath, G., Quine, T., Poesen, J.: Simulation of the redistribution of soil by tillage on complex topographies. Eur. J. Soil Sci. 54(1), 63–76 (2003).
Article Google Scholar
Wang, W. -L., Liu, M., Lin, T. -I.: Robust skew-t factor analysis models for handling missing data. Stat. Methods Appl. 26(4), 649–672 (2017).
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

The author has no funding sources to declare.

Author information

Authors and Affiliations

Department of Mathematics, Texas Christian University, Forth Worth, TX, USA
Cornelis J. Potgieter
Department of Statistics, University of Johannesburg, Johannesburg, South Africa
Cornelis J. Potgieter

Authors

Cornelis J. Potgieter
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

This is a single author paper and all research and writing is was conducted by the author. The author read and approved the final manuscript.

Corresponding author

Correspondence to Cornelis J. Potgieter.

Ethics declarations

Competing interests

The author declares that he has no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Potgieter, C.J. Density deconvolution for generalized skew-symmetric distributions. J Stat Distrib App 7, 2 (2020). https://doi.org/10.1186/s40488-020-00103-y

Download citation

Received: 16 September 2019
Accepted: 09 July 2020
Published: 23 July 2020
DOI: https://doi.org/10.1186/s40488-020-00103-y

Density deconvolution for generalized skew-symmetric distributions

Abstract

Introduction

Generalized skew-symmetric deconvolution

Derivation of the GSS estimator

Some properties of the estimator

Bandwidth selection

A cross-validation bandwidth

An MISE bandwidth

Location and scale estimation

Generalized method of moments

Selection from multiple GMM solutions

Simulation studies

Comparison of oracle estimators

Bandwidth estimation

GMM estimation

Data applications

Coal abrasiveness index data

Systolic blood pressure data

Conclusion

Appendix

A.1 Generalized skew-symmetric representation

A.2 MISE derivation

A.3 GMM estimators simulation

A.4 Solution selection simulation

A.5 supplemental simulation results

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords