 Research
 Open access
 Published:
Density deconvolution for generalized skewsymmetric distributions
Journal of Statistical Distributions and Applications volume 7, Article number: 2 (2020)
Abstract
The density deconvolution problem is considered for random variables assumed to belong to the generalized skewsymmetric (GSS) family of distributions. The approach is semiparametric in that the symmetric component of the GSS distribution is assumed known, and the skewing function capturing deviation from the symmetric component is estimated using a deconvolution kernel approach. This requires the specification of a bandwidth parameter. The mean integrated square error (MISE) of the GSS deconvolution estimator is derived, and two bandwidth estimation methods based on approximating the MISE are also proposed. A generalized method of moments approach is also developed for estimation of the underlying GSS location and scale parameters. Simulation study results are presented including a comparing the GSS approach to the nonparametric deconvolution estimator. For most simulation settings considered, the GSS estimator is seen to have performance superior to the nonparametric estimator.
Introduction
The density deconvolution problem arises when it is of interest to estimate the probability density function (pdf) f_{x}(x) of a random variable X using observations contaminated by measurement error. Specifically, the observed sample consists of data W_{j}=X_{j}+U_{j}, j=1,…,n, where the X_{j} are independent and identically distributed (iid) random variables with pdf f_{x}(x) and the U_{j} are iid measurement error variables with pdf f_{u}(u). This paper presents a semiparametric approach for estimating f_{x}(x) that assumes X belongs to the class of generalized skewsymmetric (GSS) distributions. The GSS deconvolution model for X specifies a base symmetric distribution, providing the basic structure for the model. Thereafer, kernel methodology is used to estimate a skewing function that captures the deviation from the specified symmetric distribution. This semiparametric GSS approach attempts to capture the best of a parametric and a nonparametric solution and provides a very flexible approach for modeling f_{x}(x).
The problem of estimating f_{x}(x) from a contaminated sample W_{1},…,W_{n} was first considered by Carroll and Hall (1988) and Stefanski and Carroll (1990) who proposed a fully nonparametric solution under the assumption of a fulle known measurement error distribution f_{u}(u). Since then, much work on the topic has followed. Fan (1991a) and Fan (1991b) considered the theoretical properties of the density deconvolution estimator, and Fan and Truong (1993) extended the methodology to nonparametric regression. Diggle and Hall (1993) and Neumann and Hössjer (1997) considered the case of the measurement error distribution being unknown, and assumed that an external sample of error data was available to estimate the measurement error distribution. Delaigle et al. (2008) considered how replicate data can be used to estimate the characteristic function of the measurement error. The nonparametric estimator requires the selection of a bandwidth parameter. The twostage plugin bandwidth of Delaigle and Gijbels (2002) has become the goldstandard in application; Delaigle and Gijbels (2004) provides an overview of several popular bandwidth selection approaches. Delaigle and Hall (2008) considered the use of simulationextrapolation (SIMEX) for bandwidth selection in a variety of measurement error problems.
Two more recent papers considered the deconvolution problem in new and novel ways. Delaigle and Hall (2014) considered parametricallyassisted nonparametric density deconvolution, while the groundbreaking work of Delaigle and Hall (2016) made use of the empirical phase function to estimate the pdf f_{x}(x) with the measurement error having unknown distribution and without the need for replicate data. The phase function approach imposes the restrictions that X has no symmetric component and that the characteristic function of U is realvalued and strictly positive.
The GSS family of distributions that is the basis for estimation in this paper dates back to Azzalini (1985), the first publication discussing a socalled skewnormal distribution. There has been a great deal of activity since with the monographs by Genton (2004) and Azzalini (2013) providing a good overview of the existing literature on the topic. Much of the GSS research has been theoretical in nature. While this theoretical work is important for understanding the statistical properties of GSS distributions, the applied value of this family has not often been realized in the literature. Notable exceptions that have used GSS distributions in application include the modeling of pharmacokinetic data, see Chu et al. (2001), the redistribution of soil in tillage, see Van Oost et al. (2003), and the retrospective analysis of casecontrol studies, see Guolo (2008). All of these authors considered fully parametric models. ArellanoValle et al. (2005) considered a fully parametric measurement error model assuming both X and U follow skewnormal distributions. Lachos et al. (2010) modeled X using a scalemixture of skewnormal distributions while assuming U is a mixture of normals. Furthermore, both Kim et al. (2016) and Wang et al. (2017) consider factor analysis models using skewsymmetric distributions. Most recently, Kahrari et al. (2019) developed linear mixed models using a skewnormalCauchy distribution and ArellanoValle et al. (2020) considered the measurement error problem using a twopiece normal distribution to allow for skewness. No other work applying GSS distributions in the measurement error context was found.
The present paper is structured as follows. In the next section, the GSS deconvolution estimator is developed and some of its theoretical properties derived. In the subsequent section, bandwidth estimation methods for the skewing function are considered. Thereafer, a generalized method of moments (GMM) approach for estimating the GSS location and scale parameters is developed. The penultimate section presents simultion results, and the paper concludes with two realdata applications. An Appendix contains both some technical arguments and additional simulation results.
Generalized skewsymmetric deconvolution
Derivation of the GSS estimator
Consider the problem of estimating the probability density function (pdf) f_{x}(x) associated with random variable X based on a sample contaminated by additive measurement error, W_{j}=X_{j}+U_{j}, j=1,…,n. Here, the X_{j} are the true measurements of interest, and the W_{j} and U_{j} represent, respectively, the contaminated observation and the measurement error. It is assumed that the X_{j} are iidf_{x}(x), the U_{j} are iidf_{u}(u), and X_{j} and U_{j} are mutually independent for all j. Furthermore, the U_{j} are assumed to have a symmetric distribution with mean 0 and variance \(\sigma _{u}^{2}\). As is typical in the deconvolution literature, the distribution of U_{j} is assumed fully known. Auxiliary data, when available, would make it possible to relax this assumption and estimate f_{u}(u); see for example Delaigle et al. (2008).
The deconvolution estimator developed here assumes that f_{x}(x) belongs to the GSS class of distributions. That is, X=ξ+ωZ with \(\xi \in \mathbb {R} \) and ω>0 denoting location and scale parameters, and with Z having pdf
with f_{0}(z) a pdf symmetric around 0 and π(z), hereafter referred to as the skewing function, satisfying the inequality constraint 0≤π(z)=1−π(z)≤1. In fact, any function satisfying this inequality constraint can be paired with any symmetric pdf f_{0}(z) and will result in (1) being a valid pdf. The corresponding pdf of X is f_{x}(x)=(2/ω)f_{0}[(x−ξ)/ω]π[(x−ξ)/ω].
The approach considered here is semiparametric in nature. The symmetric pdf f_{0}(z) is assumed known, but no parametric assumptions are made regarding the skewing function π(z). (In fact, if symmetric component f_{0}(z) were not assumed known, pdf f_{z}(z) would not be identifiable; see Appendix A.1 for details). The base density f_{0}(z) provides the basic strucuture of the model, and the skewing function π(z) captures the deviation from the base model. Thus, the approach attempts to capture the best of a parametric and a nonparametric solution, and the GSS family provides a very flexible approach for modeling f_{z}(z).
GSS random variables have an invariance property under even transformations that is central to the development of the deconvolution estimator in the remainder of this section. Let Z be GSS according to (1) and let Z_{0} have symmetric pdf f_{0}(z). For any even function t(z), it holds that \(t(Z) \overset {d}{=} t(Z_{0})\) with \(\overset {d}{=}\) denoting equality in distribution; see Proposition 1.4 in Azzalini (2013). Thus, the distribution of t(Z) depends only on f_{0}(z) and not on π(z). Now, let ψ_{z}(t) denote the characteristic function of Z, and let c_{0}(t)=Re[ψ_{z}(t)] and s_{0}(t)=Im[ψ_{z}(t)] denote the real and imaginary components of ψ_{z}(t). The real component can be expressed as c_{0}(t)=E[cos(tZ)]. By the property of even transformation, it follows that c_{0}(t)=E[cos(tZ_{0})] which is the characteristic function associated with f_{0}(z).
Now, assume (ξ,ω) are known, and define W^{∗}=(W−ξ)/ω. Furthermore, observe that W^{∗}=Z+ω^{−1}Uand therefore has characteristic function \(\phantom {\dot {i}\!}\psi _{w^{\ast }}(t) =\psi _{z}(t/\omega) \psi _{u}(t)\) where ψ_{u}(t) is the realvalued characteristic function of U. It follows that
and
The functions c_{0}(t) and ψ_{u}(t) in (2) and (3) are known while s_{0}(t) is unknown. Noting that f_{z}(z) can be expressed as
it follows that an estimator of s_{0}(t) can be used to construct an estimator of f_{z}(z). To this end, for random sample W_{1},…,W_{n}, let \(W_{j}^{\ast } = (W_{j}\xi)/\omega \) for j=1,…,n, and define
This empirical estimator, while unbiased for s_{0}(t), is not suitable for estimating f_{z}(z) when substituted in (4) as the integral diverges. This is attributable to the tail behavior of \(\tilde {s}_{0}(t) \). While s_{0}(t) converges to 0 as t→∞ for any continuous distribution, \(\tilde {s}_{0}(t)\) corresponds to an empirical measure and diverges as t→∞. This follows upon noting that the bounded periodic function \(n^{1} \sum _{j} \sin (tW_{j}^{\ast })\) is divided by ψ_{u}(t/ω), with the latter decreasing to 0 as t increases.
Next, consider the “smoothed” estimator
where ψ_{k}(t) is a nonnegative weight function and h is a bandwidth parameter. This estimator has expectation \(\mathrm {E}[\hat {s}_{0}(t)]=\psi _{k}(ht)s_{0}(t)\) and therefore is biased for s_{0}(t). However, it also has some desirable properties. Firstly, it is an odd function, \(\hat {s}_{0}(t)=\hat {s}_{0}(t)\) for all \(t \in \mathbb R\). Secondly, substitution of (5) into (4) results in the welldefined estimator for f_{z}(z),
provided ψ_{k}(t) is chosen such that ψ_{k}(ht)/ψ_{u}(t/ω)→0 as t→∞. Choosing ψ_{k}(t) to be 0 outside a bounded interval will trivially satisfy this requirement.
Estimator (6) suffers from the same drawback as the usual nonparametric deconvolution estimator in that it may be negative in parts. In practice, the negative parts can be truncated and the resulting function rescaled to integrate to 1. To circumvent this adhoc fix, combine Eqs. (1) and (4) to obtain
Substitution of (5) in (7), along with the identity sin(tz)=(e^{itz}−e^{−itz})/(2i), gives
where \(\tilde {f}_{w^{*}}(z) =(nh\omega)^{1}\sum K_{h\omega }[ (zW_{j}^{*})/(h\omega)]\) is the wellstudied nonparametric deconvolution density estimator of Carroll and Hall (1988) with deconvolution kernel \(K_{h}(y) =(2\pi)^{1}{\int _{\mathbb R }} e^{ity}\psi _{k}(t)/\psi _{u}(t/h)dt\). The potential for (6) being negative in parts is reflected in (8) not being rangerespecting. Specifically, it is possible to have \(\hat {\pi }(z) \not \in \left [ 0,1\right ] \) for a set z with nonzero measure. A rangecorrected skewing function estimator is \(\tilde {\pi }(z) =\max \left [ 0,\min \left \{ 1,\hat {\pi }\left (z\right) \right \} \right ]\). The estimated density function of X based on the rangecorrected skewing function is
Use of the rangecorrected skewing function estimate ensures that (9) is always a valid pdf. There is no need for any additional truncation of negative values and subsequent rescaling as would be the case with direct implementation of (6).
Some properties of the estimator
The rangecorrected estimator \(\tilde {\pi }(z)\) is asymptotically equivalent to \(\hat {\pi }(z)\) in (8) on any closed subset of \(\mathbb {R}\). As such, the latter will be used to evaluate the properties of the GSS deconvolution estimator. Firstly, note that using the known expected value of the nonparametric deconvolution estimator \(\tilde {f}_{w^{\ast }}(z)\), it follows from (8) that
with constant c_{k} depending only on the kernel function ψ_{k}(t). Thus, for an appropriately chosen bandwidth h, \(\hat {\pi }(z)\) is consistent for π(z), and the density estimator \(\tilde {f}(x\xi,\omega)\) in (9) is also consistent for f_{x}(x).
The mean integrated square error (MISE), derived in Appendix A.2, is
When the distribution Z is symmetric, i.e. π(z)=1/2 for all z so that s_{0}(t)=0 for all t, and letting MISE_{sym} denotes the MISE calculated under symmetry,
Here the inequality follows upon noting that 1−c_{0}(2t)ψ_{u}(2t/ω)≤2 for all t. This upper bound of MISE_{sym} is proportional to the asymptotic MISE of the nonparametric deconvolution estimator, see equation (2.7) in Stefanski & Carroll (1990). Thus, in the symmetric case, one would expect the GSS deconvolution estimator to perform no worse than the nonparametric deconvolution estimator for a correctly specified symmetric component c_{0}(t). In fact, since this is an upper bound, large gains in efficiency may be possible. Our simulation results presented in a later section are congruent with this statement.
Bandwidth selection
Implementation of the GSS deconvolution estimator requires a bandwidth paramter h to be specified. Two methods for selecting this bandwidth are developed in this section. The first method uses crossvalidation (CV) to approximate the integrated square error (ISE), and the second method approximates the MISE in (10).
A crossvalidation bandwidth
For the GSS deconvolution estimator, the densitybased ISE is proportional to the ISE for the imaginary component s_{0}(t) of the characteristic function,
This follows from Parseval’s identity and recalling that the real component c_{0}(t) is known. Let C(h) denote the expression obtained by expanding the square on the righthand side of (11) and keeping only terms involving the estimator \(\hat {s}_{0}(t)\),
Now, note that the second integral in (12) can be written as
Define \(\tilde {s}_{(i)}(t)\) to be an estimate of s_{0}(t) excluding the ith observation,
This quantity is unbiased for s_{0}(t) for all i, and \(\tilde {s}_{(i)}(t)\) is independent of W_{i}. The CV score follows by substitution of \(\tilde {s}_{(i)}(t)\) in (13) for each i in the summand, giving
This result is similar to that of Stefanski and Carroll (1990) in the nonparametic setting, but here only requires estimating the imaginary component of the characteristic function. The CV bandwidth is defined to be the value \(\tilde {h}\) that minimizes \(\hat {C}(h)\).
An MISE bandwidth
Consider the MISE in (10), and note that the only unknown quantity therein is \(s_{0}^{2}(t)\). Furthermore, observe that \(\mathrm {E}\left [\sin (tW_{j}^{*})\sin (tW_{k}^{*})\right ]=\psi _{u}^{2}(t/\omega)s_{0}^{2}(t)\) whenever j≠k. Thus, \(s_{0}^{2}(t)\) can be estimated by
where \(\mathcal {I}(\cdot)\) is the indicator function and κ is some positive constant. The constant κ can be thought of as a smoothing parameter which ensures that the estimator \(\hat {s}_{2}(t)\) behaves well for large values of t. Ideally, κ should be chosen in a datadependent way and development of this approach is ongoing. However, based on extensive simulation work, it has been found that values κ∈[3,5] work reasonably well for a wide range of underlying GSS distributions considered. Now, taking (10), substituting \(\hat {s}_{2}(t)\) for \(s_{0}^{2}(t)\), and ignoring components that do not depend on the bandwidth, gives MISE approximation score
The MISEapproximation bandwidth is defined to be the value \(\tilde {h}\) that minimizes \(\hat {M}(h)\).
Location and scale estimation
Generalized method of moments
Up to this point, the location and scale parameters ξ and ω have been treated as known quantities. This is unrealistic in practice. Estimation of the GSS parameters for a known symmetric component has been considered in the literature, see Ma et al. (2005) and Azzalini et al. (2010), and Potgieter and Genton (2013). However, none of these authors considered the presence of measurement error. Here, a Generalized Method of Moments (GMM) approach accounting for measurement error is developed. Recall that W_{j}=X_{j}+U_{j}=ξ+ωZ_{j}+U_{j}, j=1,…,n. Let M≥2 be a positive integer and assume that the Z_{j} and the U_{j} have at least 2M finite moments. Let T_{k} denote the (2k)th centered moment,
This variable has expectation E[T_{k}]=E[(Z+ω^{−1}U)^{2k}] and admits expansion
By the GSS property of even transformations, \(\mathrm {E} [Z^{2j}]=\mathrm {E} [Z_{0}^{2j}]\) for j=1,…,M with Z_{0} a random variable with pdf f_{0}(z). Furthermore, the evaluation of the moments of U pose no problem as this distribution is assumed known. Thus, E[T_{k}] can easily be evaluated using (18).
Now, define quadratic form \(D(\xi,\omega) = n \mathbf {T}_{M}^{\top } \mathbf {\Sigma }^{1} \mathbf {T}_{M}\) with T_{M} denoting the vector T_{M}=(T_{1}−E[T_{1}],…,T_{M}−E[T_{M}])^{⊤} with covariance matrix Σ. The covariance matrix has entries Σ_{ij}=n^{−1}(E[T_{i+j}]−E[T_{i}]E[T_{j}]). The GMM estimators are defined to be the minimizer of D(ξ,ω). In evaluating D(ξ,ω), both the expectations E[T_{k}], k=1,…,M and the covariance matrix Σ are functions of the parameter ω, but not of ξ.
Selection from multiple GMM solutions
One difficulty encountered with the GMM approach is that the statistic D(ξ,ω) frequently has multiple minima, and the global minimum does not always corresponds to the “correct” solution. This equivalent problem also occurs in the nonmeasurement error setting and is an artifact of the skewing function being unknown; see Section 7.2.2 in Azzalini (2013) for an overview and illustration. Solutions considered there range from selecting the model with the smallest squared integral of the second derivative of the estimated skewing function, to selecting a solution based on matching modelbased and empirical skewness coefficients.
Now, assume that D(ξ,ω) has J local minima occurring at \((\hat {\xi }_{j},\hat {\omega }_{j})\), j=1,…,J. Furthermore, let \(\tilde {f}_{j}(x\hat {\xi }_{j},\hat {\omega }_{j})\) denote the GSS density deconvolution estimator in (9) obtained using solution \((\hat {\xi }_{j},\hat {\omega }_{j})\). Thus, J different GSS deconvolution estimators are calculated. Using the jth estimated density, define the kth modelimplied moment,
and modelimplied characteristic function,
Based on these quantities, two different selection methods are now proposed. Throughout, it will be assumed that measurement error U has distribution symmetric about 0.
Skewness matching: In model W=X+U, the skewness of X can be estimated by \(\hat {\gamma }_{x}=\left [{\hat {\sigma }_{w}^{2}}/{(\hat {\sigma }_{w}^{2}{\sigma }_{u}^{2})^{3/2}}\right ]\hat {\gamma }_{w}\) where \(\hat {\sigma }_{w}^{2}\) and \(\hat {\gamma }_{w}\) denote the sample variance and skewness of iid random variables W_{1},…,W_{n}. Now, for the jth solution pair \((\hat {\xi }_{j},\hat {\omega }_{j})\), the GSS modelimplied skewness is given by \(\hat {\gamma }_{j}=\left ({\tilde {\mu }_{j,3}3\tilde {\mu }_{j,2}\tilde {\mu }_{j,1}+2\tilde {\mu }_{j,1}^{3}}\right)/\left ({\tilde {\mu }_{j,2}\tilde {\mu }_{j,1}^{2}}\right)\) with \(\tilde {\mu }_{j,k}\) as defined in (19). The selected solution is the one with implied skewness closest to the empirical skewness. Specifically, letting \(d_{j} = \hat {\gamma }_{x}\hat {\gamma }_{j}\), j=1,…,J, the selected solution is \((\hat {\xi }_{j^{\ast }},\hat {\omega }_{j^{\ast }})\) with j^{∗}= arg min1≤j≤Jd_{j}.
Phase function matching: The phase function, a normalized version of the characteristic function, is a recent tool employed in density deconvolution – see Delaigle and Hall (2016) and Nghiem and Potgieter (2018) for further details. Let ρ_{w}(t) and ρ_{x}(t), denote the phase functions of X and W=X+U. For U having strictly positive characteristic function, these phase functions are equal, ρ_{w}(t)=ρ_{x}(t) for all t. The empirical estimate of the phase function of X is \(\hat {\rho }_{x}(t)=\hat {\psi }_{w}(t)/\hat {\psi }_{w}(t)\) with \(\hat {\psi }_{w}(t)\) the empirical characteristic function of W, and \(z=(z\bar {z})^{1/2}\) and \(\bar {z}\) denoting the complex norm and cojucate of z. For the jth GMM solution \((\hat {\xi }_{j},\hat {\omega }_{j})\), the modelimplied phase function is given by \(\tilde {\rho }_{j}(t)=\tilde {\phi }_{j}(t)/\tilde {\phi }_{j}(t)\) with \(\tilde {\phi }_{j}(t)\) as defined in (20). Now, letting w(t) denote a nonnegative weight function symmetric around 0, define distance metric \(R_{j} = \int _{\mathbb {R}} \vert \hat {\rho }_{x}(t)  \tilde {\rho }_{j}(t) \vert w(t) dt\) for j=1,…,J. The selection solution is \((\hat {\xi }_{j^{\ast }},\hat {\omega }_{j^{\ast }})\) with j^{∗}= arg min1≤j≤JR_{j}. That is, the selected solution has minimum phase function distance. In this paper, weight function \(w(t)=[1(t/t^{\ast })^{2}]^{3}\ \mathcal {I}(t \leq t^{\ast })\) will be used with t^{∗} the smallest t>0 such that \(\hat {\psi }_{w}(t)\leq n^{1/4}\) as per Delaigle and Hall (2016).
Simulation studies
The performance of the GSS deconvolution estimator was evaluated using extensive simulations. Letting ϕ(z) and Φ(z) denote the standard normal density and distribution functions, data X_{1},…,X_{n} were generated from GSS distributions with symmetric component f_{0}(z)=ϕ(z) and using three different skewing functions, π_{0}(z)=1/2, π_{1}(z)=Φ(9.9625z) and π_{2}(z)=Φ(z^{3}−2z). The location and scale parameters were taken to be ξ=0 and ω=1. Figure 1 illustrates the three resulting pdfs f_{x}(x)=(2/ω)ϕ[(x−ξ)/ω]π_{k}[(x−ξ)/ω], k=0,1,2. Note that the skewing function π_{0}(z) does not introduce any deviation from symmetry and corresponds to simulating from a normal distribution. Additionally, the skewing function π_{1}(z) results in a positive skew distribution, while π_{2}(z) results in a bimodal distribution.
Two measurement error distributions were considered with U_{1},…,U_{n} being either Normal or Laplace with mean 0 and variance chosen to have noisetosignal ratio \(\text {NSR}=\sigma _{u}^{2}/\sigma _{x}^{2}\) either 0.2 or 0.5. Samples W_{j}=X_{j}+U_{j}, j=1,…,n, with n∈{50,100,200,500} were generated from each of the possible simulation configurations described.
Comparison of oracle estimators
The first simulation study presented compares the proposed GSS estimator to the established nonparametric estimator of Carroll and Hall (1988), and assumes the existence of an oracle that selects the “best” possible bandwidth for each of the estimators. Specifically, for a sample W_{1},…,W_{n}, let \(\tilde {f}_{\text {gss}}(xh)\) and \(\tilde {f}_{\text {np}}(xh)\) denote, respectively, the GSS and nonparametric estimators with bandwidth h. The ISE is defined as
where m∈{gss,np}. Then, the “best” bandwidth is the value that minimizes the ISE between the estimated and true densities. Furthermore, when GMM results in more than one solution for the GSS location and scale parameters, the oracle also selects the solution that result in smallest ISE. In practice, no oracle exists to do these selections. Even so, comparing the estimators under such idealized conditions speaks to the best possible performance of these methods.
For each simulation configuration, N=1000 samples were generated. Due to the occasional occurrence of very large outliers in ISE, the median ISE (rather than mean ISE) is reported. The first and third quartiles of ISE are also reported. Results for n∈{200,500} are summarized in Table 1, and for n∈{50,100} are presented in Table 6 in Appendix A.5.
Inspection of Table 1 shows how well the GSS estimator can perform relative to the nonparametric estimator. In the symmetric case with skewing function π_{0}(z), the reduction in median ISE is most dramatic and exceeds 50% in all cases. For skewing functions π_{1}(z) and π_{2}(z), the reduction in median ISE is also seen to be as large as 40%. There is one instance where median ISE of the nonparametric estimator is smaller than that of the GSS estimator – skewing function π_{2}(z) with NSR=0.5, Laplace measurement error, and sample size n=200. (The same holds true for sample sizes n=50 and 100 in Table 6.) However, the equivalent scenario with sample size n=500 has the GSS estimator with smaller median ISE. This possibly indicates the effect of estimating the location and scale parameters in smaller samples and when large amounts of heaviertailedthannormal measurement error is present. Overall, the GSS deconvolution estimator performs very well. Thus, the additional structure being imposed through the a priori specification of the symmetric pdf f_{0}(z) can result in a large decrease in ISE.
Bandwidth estimation
The next simulation study investigated the two proposed bandwidth estimation approaches. Specifically, the CV and MISE bandwidths as well as the twostage plugin (PI) bandwidth of Delaigle and Gijbels (2002), originally developed for nonparametric deconvolution, were implemented. For each simulated sample, the ISE was calculated. When necessary, GMM solution selection with phasefunction matching was used. The nonparametric deconvolution estimator with PI bandwidth was also calculated; corresponding results are included for reference purposes. The median ISE values for the methods are summarized in Tables 2 and 3 for sample sizes n∈{200,500}, and in Tables 7 and 8 in Appendix A.5 for sample sizes n∈{50,100}.
In Tables 2 and 3, it is seen that there isn’t a consisent “best” bandwidth method. For skewing functions π_{0} (the symmetric case) and π_{2}, the PI bandwidth generally has smallest median ISE. In these same scenarios, MISE frequently (but by no means consistently) outperforms CV. For π_{1}(z) the MISE bandwidth performs best. In all simulation settings, there is a GSS bandwidth method that results in better performance than the nonparametric estimator. These same conclusions broadly hold for sample sizes n∈{50,100} in Appendix A.5.
The results presented above were restricted to phasefunction matching for the GMM estimators, as it was found to generally have better performance that skewness matching. For details of the simulation comparing the two GMM matching methods, see Appendix A.3.
GMM estimation
One other simulation study was performed, and considered the choice of M (the number of even moment to use) when evaluating the GMM estimators of (ξ,ω). These simulations results are presented in Appendix A.4. In summary, the larger value M=5 was generally seen to outperform M=2 for π_{1}(z) and π_{2}(z). In the symmetric π_{0}(z) case, M=2 performed slightly better than M=5. In all instances, root mean square error (RMSE) was used as criterion.
Data applications
Coal abrasiveness index data
Data from an industrial application, first considered by Lombard (2005), are analyzed here. The data were obtained by taking batches of coal, splitting them in two, and randomly allocating each of the halfbatches to one of two methods used to measure the abrasiveness index (AI) of coal, a measure of the quality of coal. The observed data consist of 98 pairs (w_{1i},w_{2i}) assumed to be from a population with W_{1i}=X_{i}+U_{1i} and W_{2i}=μ+σ(X_{i}+U_{2i}). Here, X_{i} denotes the true AI of the ith batch, U_{1i} and U_{i2} denote measurement error, and constants μ and σ account for the two AI measurement methods being on different scales. Of interest is estimating f_{x}(x), the true density of AI. However, the data (w_{1i},w_{2i}) first need to be combined in a sensible way.
To this end, let μ_{w,k} and \(\sigma _{w,k}^{2}\) denote the mean and variance of the W_{ki}, k=1,2, and let μ_{x} and \(\sigma _{x}^{2}\) denote the mean and variance of the X_{i}. Note that μ_{w,1}=μ_{x}, μ_{w,2}=μ+σμ_{x}, \(\sigma _{w,1}^{2}=\sigma _{x}^{2}+\sigma _{u}^{2}\), and \(\sigma _{w,2}^{2}=\sigma ^{2}\left (\sigma _{x}^{2}+\sigma _{u}^{2}\right)\). By replacing the population moments with their sample counterparts, estimators \(\hat {\sigma }=s_{w,2}/s_{w,1}=0.679\) and \(\hat {\mu }=\bar {w}_{2}  \hat {\sigma }\bar {w}_{1}=59.503\) are obtained. Here, \((\bar {w}_{1},s_{w,1})\) denote the sample mean and standard deviation of the observed w_{1}data with similar definitions holding for the w_{2}quantities. Now, the paired observations are combined as \(w_{i} =0.5w_{1i}+0.5\left (w_{2i}\hat {\mu }\right)/\hat { \sigma }\). At the population level this corresponds to W_{i}≈X_{i}+0.5(U_{1i}+U_{2i}):=X_{i}+ε_{i}. An estimate of the measurement error variance \(\sigma _{\varepsilon }^{2}\) is obtained by calculating \(\hat {\sigma }_{u}^{2}=(2n)^{1}\sum \left [ W_{1i}\left (W_{2i}\hat {\mu }\right)/\hat {\sigma } \right ]^{2}=174.6\) and noting that \(\hat {\sigma }_{\varepsilon }^{2}=174.6/2=87.3\). This corresponds to the W_{i} having noisetosignal ratio NSR=16.35%.
The GSS deconvolution estimator for f_{x}(x) is now calculated assuming a normal symmetric component, f_{0}(z)=ϕ(z), along with a Laplace distribution for the measurement error ε. (The equivalent estimator assuming normal measurement was also calculated and is nearly identical in shape.) GMM with M=5 gives solution pairs \((\hat {\xi } _{1},\hat {\omega }_{1}) =\left (192.88,29.90\right) \) and \((\hat {\xi }_{2},\hat {\omega }_{2}) =\left (230.41,32.43\right) \). For each of these, the corresponding skewing function estimate \(\tilde {\pi }_{j}(z)\) and phase function distance R_{j} was calculated, the latter using weight function w(t)=[1−(t/t^{∗})^{2}]^{3} for t∈[−t^{∗},t^{∗}] and t^{∗}=0.06. Here, R_{1}=0.023<0.046=R_{2} and therefore solution \((\hat {\xi } _{1},\hat {\omega }_{1})\) with estimated skewing function \(\tilde {\pi }_{1}(z)\) was selected. Skewness matching resulted in selection of the same solution. Figure 2 shows a kernel density estimator of f_{w}(w), the density of the contaminated W_{i}, as well as the GSS deconvolution estimator of f_{x}(x) with MISE bandwidth \(\tilde {h}=0.102\).
This application illustrates one of the less appealing aspects of the GSS approach sometimes encountered in smaller samples. Note the sharp “edge” in the GSS estimator around x=225. This is an artefact of the hard truncation applied when calculating the rangerespecting skewing function estimate \(\tilde {\pi }(z)\). The resulting density estimate is not differentiable at this point. This is equivalent to nondifferentiable points in the nonparametric deconvolution estimator when it is truncated to be positive.
Systolic blood pressure data
The data here are a subset of n=1615 male participants in the Framingham Heart Study, see for example Carroll et al. (2006) for more detail. The data consist of systolic blood pressure measurements from two patient exams (the second and third exams in the study). At each exam, two replicate measurements were obtained giving data (SBP_{21},SBP_{22},SBP_{31},SBP_{32}). Let P_{1}=(SBP_{21}+SBP_{22})/2 and P_{2}=(SBP_{31}+SBP_{32})/2 denote the average systolic blood pressure observed at each of the exams, and calculate transformed variables W_{j}= log(P_{j}−50), j=1,2, as suggested by Carroll et al. (2006). This is done to adjust large skewness present in the data. The measurement W=(W_{1}+W_{2})/2=X+U is a surrogate for the true longterm average systolic blood pressure X (on the transformed logarithmic scale). Using the replicates (W_{1},W_{2}), estimate standard deviations \(\hat {\sigma }_{x}=0.1976\) and \(\hat {\sigma }_{u}=0.0802\) are obtained.
The GSS deconvolution estimator assuming a Laplace distribution for the measurement error U and using a normal reference density f_{0}(z)=ϕ(z) was computed. GMM with M=5 resulted in only one solution, \((\hat {\xi },\hat {\omega })=(4.429,0.210)\), and therefore no selection was needed. Figure 3 displays both the GSS deconvolution estimator and the nonparametric deconvolution estimator, both with PI bandwidths.
The nonparametric deconvolution estimator has previously been applied to the Framingham Heart Study. It is therefore reassuring that the GSS estimator is not dissimilar in appearance.
Conclusion
In this paper, the density deconvolution problem is considered for variables belonging to the family of generalized skewsymmetric (GSS) distributions. Implementation requires both the estimation of location and scale parameters (ξ,ω), and the estimation of a skewing function π(z). Estimation methods are proposed for both of these quantities, and extensive simulation studies are performed. In simulation studies performed, the GSS deconvolution estimator is generally seen to result in large improvements over the nonparametric deconvolution estimator (using median ISE as criterion).
There are still several questions related to GSS deconvolution that can be considered. Firstly, the estimator requires the specification of a known symmetric component f_{0}(z). While this is done to ensure model identifiability, it would be possible to consider several candidate symmetric densities and choose the “best” among these. The related goodnessoffit testing problem for a specified symmetric component can also be explored. Secondly, it should be noted that the contaminated W also has a GSS distribution. An alternative modeling approach could therefore estimate the pdf of W directly and then recover the pdf of X. Lastly, it was observed in the simulation study that the nonparametric deconvolution kernel in a few isolated instances had superior performance to the GSS estimator under selection, while GSS had better under oracle conditions for the same simulation configurations. This suggests that further refinement of the bandwidth calculation and solution selection procedure may be possible, and related work is ongoing.
Appendix
A.1 Generalized skewsymmetric representation
Here, it is established that any continuous random variable has a nonunique representation as a GSS distribution. This motivates, in part, the need to assume a parametric form for pdf f_{0}(z) when doing estimation. Let Y be a continuous random variable with pdf f_{y}(y) and let ξ be a real number. Furthermore, let B be a Bernoulli(p=0.5) random variable, and define new random variables D_{ξ}=Y−ξ and T=BD_{ξ}−(1−B)D_{ξ}. The random variable T is symmetric about 0 and has pdf f_{t}(t)=(1/2)[f_{y}(ξ+t)+f_{y}(ξ−t)]. Next, define
and note that π_{t}(t) satisfies 0≤π_{t}(t)=1−π_{t}(−t)≤1. By construction, it follows that f_{y}(y) can be expressed as f_{y}(y)=2f_{t}(y−ξ)π_{t}(y−ξ). Assuming that Y has finite variance, the variance of T is given by \(\omega _{\xi }^{2} = \int _{\mathbb {R}}t^{2}f_{t}(t)dt\). Then, letting f_{ξ}(t)=f_{t}(t/ω_{ξ})/ω_{ξ} and π_{ξ}(t)=π_{t}(t/ω_{ξ}), it is possible to write
This representation does not depend on a specific value for ξ and, as such, holds for every ξ. However, each value of ξ is associated with a different symmetric component f_{ξ}(z) and skewing function π_{ξ}(z). As such, there is a family of distributions f_{ξ}(z) symmetric about 0 and with unit variance such that the random variable Y can be expressed as a GSS distribution with symmetric component belonging to this family. The work in this paper is motivated by the assumption that it is possible to correctly specify one symmetric distribution in the family f_{ξ}(z).
A.2 MISE derivation
To derive an expression for the mean integrated square error (MISE), considering the estimator \(\hat {s}_{0}(t)\) defined in (5). Recall that \(\mathrm {E}\left [\hat {s}_{0}(t)\right ] = \psi _{k}(ht)s_{0}(t)\). Additionally, it has covariance structure
The integrated squared error (ISE) of the GSS estimator can now be expressed in terms of \(\hat {s}_{0}(t)\),
where the first equality is an application of Parseval’s identity, and the second follows upon noting that the estimated characteristic function \(\hat {\psi }_{z}(t)\) and true characteristic function ψ_{z}(t) have common real component c_{0}(t) which therefore cancels out, leaving only the estimated and true imaginary components. Also note that ISE is a function of the bandwidth h through \(\hat {s}_{0}(t)\). Now, MISE=E[ISE] can be evaluated using the expectation and covariance functions associated with \(\hat {s}_{0}(t)\), in the latter setting t_{1}=t_{2}=t. Eq. 10 follows.
A.3 GMM estimators simulation
The performance of GMM estimation of (ξ,ω) was evaluated in a simulation study. Data were simulated as described in the main paper. For each simulated dataset, the estimators minimizing D(ξ,ω) were obtained for both M=2 and M=5 even moments. While the sixth, eight and tenth sample moments used for the M=5 setting arguably contain additional information, there is a great deal of added variability introduced when estimating these higher order moments. This simulation explored the benefits, if any, of doing so. In simulated samples where multiple solutions \((\hat {\xi }_{j},\hat {\omega }_{j})\), j=1,…,J were obtained, the existence of an oracle able to choose the solution closest to the true value (0,1) (as measured using Euclidean distance) was assumed. A total of N=1000 samples were generated for each simulation configuration. Root mean square error (RMSE) was used as criterion, and the results are shown in Table 4.
In the setting with X normal, i.e. using π_{0}(z), using M=5 moments results in a small increase in RMSE compared to the case M=2. The average increase in RMSE for ξ is 1.2% and for ω is 9.5% across the settings considered. On the other hand, the simulation results for skewing functions π_{1}(z) and π_{2}(z) look very different. Here, the RMSE for ξ decreases for both skewing functions, and the RMSE for ω decreases for skewing function π_{2}(z). Also, the average RMSE of ω for π_{2}(z) remains unchanged across the simulation settings considered. One possible reason for the increase in RMSE in the symmetric case is that the underlying distribution is normal and therefore higherorder moments do not contain any “extra” information about the distribution. On the other hand, for π_{1}(z) and π_{2}(z) there is a substantive departure from normality and the higherorder sample moments, despite their large variability, do contain useful information about the underlying distribution. As the increase in RMSE in the symmetric case is relatively small compared to the decrease in the asymmetric cases, the paper uses the GMM estimators with M=5 in all other simulations.
A.4 Solution selection simulation
The simulation results comparing the performance of the skewness matching and phase function distance solution selection mechanisms follow here. Data were generated as described in the “Simulation studies” section of the main paper. For each simulated sample, all GMM solutions \((\hat {\xi }_{j},\hat {\omega }_{j})\), j=1,…,J were obtained. Solution selection was then implemented for both skewness matching and phase function matching. These techniques require a bandwidht to be selected. The simulation implemented CV, MISE, and PI bandwidth selection. However, the conclusions with regards to selection methods were very similar for these and therefore only MISE bandwidth results are included here. To contextualize these results from selection, results corresponding to an oracle able to choose the solution with smallest ISE are also reported, as well as a blind selection approach randomly selecting one of the GMM solutions.
The simulation results are summarized in Table 5. In this table, the median ISE of skewness matching and phase function distance are given in the columns SKW and PHS. The column MIN contains the median ISE for the oracle selecting the solution with smallest ISE, while RND contains the median ISE of randomly selecting one of the GMM solutions. Finally, the median ISE of the nonparametric deconvolution estimator with PI bandwidth is given in column NP for reference purposes.
Inspection of Table 5 shows that estimation under both the skewness and phase function matching generally performs better than the fully nonparametric estimator, with the exception being the combination of skewing function π_{2}(z) and Laplace measurement error. However, as the GSS estimator outperformed the nonparametric estimator under an oracle bandwidth as seen in Table 1 of the main paper, this does suggest that further improvement of the GSS estimator may still be possible by refining parameter estimation and bandwidth selection – this is ongoing work. Further inspection of Table 5 shows that estimation under both the skewness and phase function matching performs better than random selection, with the exception that random selection outperforms the skewness matching for π_{1}(z) and normal measurement error. While there are a few instances where skewness matching outperformed phase function matching, the latter generally has very good performance and comes close to the best possible performance of the minimum ISE under oracle selection.
A.5 supplemental simulation results
This subsection contains two sets of supplemental simulation results. The first of these, found in Table 6, pertains to a comparison of oracle estimators for sample sizes n={50,100}. The second of these, found in Tables 7 and 8, pertains to comparing bandwidth estimation methods for sample sizes n={50,100}. The conclusions that can be drawn from these results are consistent with those discussed in the “Simulation studies” section of the main paper, and are included here for completeness.
Availability of data and materials
The coal abrasiveness index data are discussed in Lombard (2005). These data are proprietary, and cannot be released publically. The systolic blood pressure data are discussed in Carroll et al. (2006) and constitute a subset of the Framingham Heart Study. The subset of data used in this paper is publically available in the R package decon.
Abbreviations
 CV:

Crossvalidation
 GMM:

Generalized method of noments
 GSS:

Generalized skewsymmetric
 iid:

Independent and identically distributed
 ISE:

Integrated squared error
 MISE:

Mean integrated squared error
 pdf:

Probability density function
 PI:

Plugin
 RMSE:

Root mean square error
References
ArellanoValle, R. B., Ozan, S., Bolfarine, H., Lachos, V.: Skew normal measurement error models. J. Scand. J. Stat. 96(2), 265–281 (2005).
ArellanoValle, R. B., Azzalini, A., Ferreira, C. S., Santoro, K.: A twopiece normal measurement error model. Comput. Stat. Data Anal. 144, 106863 (2020).
Azzalini, A.: A class of distributions which includes the normal ones. Scand. J. Stat. 12, 171–178 (1985).
Azzalini, A., Genton, M. G., Scarpa, B.: Invariancebased estimating equations for skewsymmetric distributions. Metron. 68(3), 275–298 (2010).
Azzalini, A.: The Skewnormal and Related Families. Cambridge University Press, New York (2013).
Carroll, R. J., Hall, P.: Optimal rates of convergence for deconvolving a density. J. Am. Stat. Assoc. 83(404), 1184–1186 (1988).
Carroll, R. J., Ruppert, D., Stefanski, L. A., Crainiceanu, C. M.: Measurement Error in Nonlinear Models: a Modern Perspective. CRC press, Boca Raton (2006).
Chu, K. K., Wang, N., Stanley, S., Cohen, N. D.: Statistical evaluation of the regulatory guidelines for use of furosemide in race horses. Biometrics. 57(1), 294–301 (2001). https://doi.org/10.1111/j.0006341x.2001.00294.x.
Delaigle, A., Gijbels, I.: Estimation of integrated squared density derivatives from a contaminated sample. J. R. Stat. Soc. Ser. B Stat. Methodol. 64(4), 869–886 (2002).
Delaigle, A., Gijbels, I.: Practical bandwidth selection in deconvolution kernel density estimation. Comput. Stat. Data Anal. 45(2), 249–267 (2004).
Delaigle, A., Hall, P.: Using simex for smoothingparameter choice in errorsinvariables problems. J. Am. Stat. Assoc. 103(481), 280–287 (2008).
Delaigle, A., Hall, P., Meister, A.: On deconvolution with repeated measurements. Ann. Stat. 36(2), 665–685 (2008). https://doi.org/10.1214/009053607000000884.
Delaigle, A., Hall, P.: Parametrically assisted nonparametric estimation of a density in the deconvolution problem. J. Am. Stat. Assoc. 109(506), 717–729 (2014).
Delaigle, A., Hall, P.: Methodology for nonparametric deconvolution when the error distribution is unknown. J. R. Stat. Soc. Ser. B Stat. Methodol. 78(1), 231–252 (2016).
Diggle, P. J., Hall, P.: A fourier approach to nonparametric deconvolution of a density estimate. J. R. Stat. Soc. Ser. B Methodol.55(2), 523–531 (1993). https://doi.org/10.1111/j.25176161.1993.tb01920.x.
Fan, J.: Asymptotic normality for deconvolution kernel density estimators. Sankhyā: Indian J. Stat. Ser. A. 53(1), 97–110 (1991a).
Fan, J.: On the optimal rates of convergence for nonparametric deconvolution problems. Ann. Stat.19(3), 1257–1272 (1991b).
Fan, J., Truong, Y. K.: Nonparametric regression with errors in variables. Ann. Stat. 21(4), 1900–1925 (1993). https://doi.org/10.1214/aos/1176349402.
Genton, M. G. E..: Skewelliptical Distributions and Their Applications: a Journey Beyond Normality. CRC Press, Boca Raton (2004).
Guolo, A.: A flexible approach to measurement error correction in case–control studies. Biometrics. 64(4), 1207–1214 (2008).
Kahrari, F., Ferreira, C., ArellanoValle, R.: Skewnormalcauchy linear mixed models. Sankhya B. 81(2), 185–202 (2019).
Kim, H. M., Maadooliat, M., ArellanoValle, R. B., Genton, M. G.: Skewed factor models using selection mechanisms. J. Multivar. Anal. 145, 162–177 (2016).
Lachos, V., Labra, F., Bolfarine, H., Ghosh, P.: Multivariate measurement error models based on scale mixtures of the skew–normal distribution. Statistics. 44(6), 541–556 (2010).
Lombard, F.: Nonparametric confidence bands for a quantile comparison function. Technometrics. 47(3), 364–371 (2005).
Ma, Y., Genton, M. G., Tsiatis, A. A.: Locally efficient semiparametric estimators for generalized skewelliptical distributions. J. Am. Stat. Assoc. 100(471), 980–989 (2005).
Neumann, M. H., Hössjer, O.: On the effect of estimating the error density in nonparametric deconvolution. J. Nonparametric Stat. 7(4), 307–330 (1997).
Nghiem, L., Potgieter, C. J.: Density estimation in the presence of heteroscedastic measurement error of unknown type using phase function deconvolution. Stat. Med. 37(25), 3679–3692 (2018).
Potgieter, C. J., Genton, M. G.: Characteristic functionbased semiparametric inference for skewsymmetric models. Scand. J. Stat. 40(3), 471–490 (2013).
Stefanski, L. A., Carroll, R. J.: Deconvolving kernel density estimators. Statistics. 21(2), 169–184 (1990).
Van Oost, K., Van Muysen, W., Govers, G., Heckrath, G., Quine, T., Poesen, J.: Simulation of the redistribution of soil by tillage on complex topographies. Eur. J. Soil Sci. 54(1), 63–76 (2003).
Wang, W. L., Liu, M., Lin, T. I.: Robust skewt factor analysis models for handling missing data. Stat. Methods Appl. 26(4), 649–672 (2017).
Acknowledgements
Not applicable.
Funding
The author has no funding sources to declare.
Author information
Authors and Affiliations
Contributions
This is a single author paper and all research and writing is was conducted by the author. The author read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The author declares that he has no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Potgieter, C.J. Density deconvolution for generalized skewsymmetric distributions. J Stat Distrib App 7, 2 (2020). https://doi.org/10.1186/s4048802000103y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s4048802000103y