Missing data approaches for probability regression models with missing outcomes with applications
- Li Qi^{1} and
- Yanqing Sun^{1}Email author
https://doi.org/10.1186/s40488-014-0023-3
© Qi and Sun; licensee Springer. 2014
Received: 24 July 2014
Accepted: 25 November 2014
Published: 16 December 2014
Abstract
In this paper, we investigate several well known approaches for missing data and their relationships for the parametric probability regression model P_{ β }(Y|X) when outcome of interest Y is subject to missingness. We explore the relationships between the mean score method, the inverse probability weighting (IPW) method and the augmented inverse probability weighted (AIPW) method with some interesting findings. The asymptotic distributions of the IPW and AIPW estimators are derived and their efficiencies are compared. Our analysis details how efficiency may be gained from the AIPW estimator over the IPW estimator through estimation of validation probability and augmentation. We show that the AIPW estimator that is based on augmentation using the full set of observed variables is more efficient than the AIPW estimator that is based on augmentation using a subset of observed variables. The developed approaches are applied to Poisson regression model with missing outcomes based on auxiliary outcomes and a validated sample for true outcomes. We show that, by stratifying based on a set of discrete variables, the proposed statistical procedure can be formulated to analyze automated records that only contain summarized information at categorical levels. The proposed methods are applied to analyze influenza vaccine efficacy for an influenza vaccine study conducted in Temple-Belton, Texas during the 2000-2001 influenza season. Mathematics Subject Classification Primary 62J02; Secondary 62F12
Keywords
Introduction
Suppose that Y is the outcome of interest and X is a covariate vector. One is often interested in the probability regression model P_{ β }(Y|X) that relates Y to X. In many medical and epidemiological studies, the complete observations on Y may not be available for all study subjects because of time, cost, or ethical concerns. In some situations, an easily measured but less accurate outcome named auxiliary outcome variable, A, is supplemented. The relationship between the true outcome Y and the auxiliary outcome A in the available observations can inform about the missing values of Y. Let V be a subsample of the study subjects, termed the validation sample, for which both true and auxiliary outcomes are available. Thus observations on (X,Y,A) are available for the subjects in V and only (X,A) are observed for those not in V.
It is well known that the complete-case analysis, which uses only subjects who have all variables observed, can be biased and inefficient, cf. Little and Rubin ([2002]). The issues also rise when substituting auxiliary outcome for true outcome; see Ellenberg and Hamilton ([1989]), Prentice ([1989]) and Fleming ([1992]). Inverse probability weighting (IPW) is a statistical technique developed for surveys by Horvitz and Thompson ([1952]) to calculate statistics standardized to a population different from that in which the data was collected. This approach has been generalized to many aspects of statistics under various frameworks. In particular, the IPW approach is used to account for missing data through inflating the weight for subjects who are underrepresented due to missingness. The method can potentially reduce the bias of the complete-case estimator when weighting is correctly specified. However, this approach has been shown to be inefficient in several situations, see Clayton et al. ([1998]) and Scharfstein et al. ([1999]). Robins et al. ([1994]) developed an improved augmented inverse probability weighted (AIPW) complete-case estimation procedure. The method is more efficient and possesses double robustness property. The multiple imputation described in Rubin ([1987]) has been routinely used to handle missing data. Carpenter et al. ([2006]) compared the multiple imputation with IPW and AIPW, and found AIPW as an attractive alternative in terms of double robustness and efficiency. Using the maximum likelihood estimation (MLE) coupled with the EM-algorithm ([Dempster et al. 1977]), Pepe et al. ([1994]) proposed the mean score method for the regression model P_{ β }(Y|X) when both X and A are discrete.
In this paper, we investigate several well known approaches for missing data and their relationships for the parametric probability regression model P_{ β }(Y|X) when outcome of interest Y is subject to missingness. We explore the relationships between the mean score method, IPW and AIPW with some interesting findings. Our analysis details how efficiency is gained from the AIPW estimator over the IPW estimator through estimation of validation probability and augmentation to the IPW score function. Applying the developed missing data methods, we derive the estimation procedures for Poisson regression model with missing outcomes based on auxiliary outcomes and a validated sample for true outcomes. Further, we show that the proposed statistical procedures can be formulated to analyze automated records that only contain aggregated information at categorical levels, without using observations at individual levels.
The rest of the paper is organized as follows. Section 2 introduces several missing data approaches for the probability regression model P_{ β }(Y|X), where the outcome Y may be missing. Section 3 explores the relationships among these estimators. The asymptotic distributions of the IPW and AIPW estimators are derived and their efficiencies are compared. Section 3 investigates efficiency of two AIPW estimators, one is based on the augmentation using a subset of observed variables and the other is based on the augmentation using the full set of observed variables. The procedures for Poisson regression using automated data with missing outcomes are derived in Section 4. The finite-sample performances of the estimators are studied in simulations in Section 5. The proposed method is applied to analyze influenza vaccine efficacy for an influenza vaccine study conducted in Temple-Belton, Texas during the 2000-2001 influenza season. The proofs of the main results are given in the Appendix A, while the proof of a simplified variance formula in Section 4 is placed in the Appendix B.
Missing data approaches
Consider the probability regression model P_{ β }(Y|X), where Y is the outcome of interest and X is a covariate vector. Let A be the auxiliary outcome for Y and V be the validation set such that observations on (X,Y,A) are available for the subjects in V and only (X,A) are observed for those in $\stackrel{\u0304}{V}$, the complement of V. In practice, the validation sample may be selected based on the characteristics of a subset, Z, of the covariates in X. We write X=(Z,Z^{ c }). For example, Z may include exposure indicator and other discrete covariates and Z^{ c } may be the exposure time. Let (Z_{ i },X_{ i },Y_{ i },A_{ i }), i=1,…,n, be independent identically distributed (iid) copies of (Z,X,Y,A). Let ξ_{ i }=I(i∈V) be the selection indicator.
Most statistical methods for missing data require some assumptions on missingness mechanisms. The commonly used ones are missing completely at random (MCAR) and missing at random (MAR). MCAR assumes that the probability of missingness in a variable is independent of any characteristics of the subjects. MAR assumes that the probability that a variable is missing depends only on observed variables. In practice, if missingness is a result by design, it is often convenient to let the missing probability depend on the categorical variables only. There is also simplicity in statistical inference by modeling the missing probability based on the categorical variables. We introduce the following missing at random assumptions.
MAR I:ξ_{ i } is independent of Y_{ i } conditional on (X_{ i },A_{ i }) and ξ_{ i } is independent of ${Z}_{i}^{c}$ conditional on (Z_{ i },A_{ i }).
MAR II:ξ_{ i } is independent of $({Y}_{i},{Z}_{i}^{c})$ conditional on (Z_{ i },A_{ i }).
Since the conditional density f(y,z^{ c }|ξ,z,a)=f(z^{ c }|ξ,z,a) f(y|z^{ c },ξ,z,a)=f(z^{ c }|z,a)f(y|z^{ c },z,a)=f(y,z^{ c }|z,a), MAR I implies MAR II. It is also easy to show that MAR II implies MAR.
The estimator ${\widehat{\beta}}_{I1}$ obtained by using ${W}_{i}^{I1}$ is an IPW estimator where a subject’s validation probability ${\pi}_{i}^{z}$ depends only on the category defined by (Z_{ i },A_{ i }). Because $E\left\{{\left({\pi}_{i}^{z}\right)}^{-1}{\xi}_{i}{S}_{\beta}\left({Y}_{i}\right|{X}_{i})\right\}=E\left\{{S}_{\beta}\left({Y}_{i}\right|{X}_{i})\right\}=0$, the estimator ${\widehat{\beta}}_{I1}$ is approximately unbiased. The estimator ${\widehat{\beta}}_{I2}$ obtained by using ${W}_{i}^{I2}$ is also an IPW estimator but with the validation probability π_{ i } depending on the category defined by (Z_{ i },A_{ i }) and the additional covariate ${Z}_{i}^{c}$.
The estimator ${\widehat{\beta}}_{E1}$ obtained by using ${W}_{i}^{E1}$ is the mean score estimator where the scores S_{ β }(Y_{ i }|X_{ i }) for those with missing outcomes are replaced by the estimated conditional expectations given (Z_{ i },A_{ i }). The estimator ${\widehat{\beta}}_{E2}$ obtained by using ${W}_{i}^{E2}$ is the mean score estimator where the scores S_{ β }(Y_{ i }|X_{ i }) for those with missing outcomes are replaced by the estimated conditional expectations given (X_{ i },A_{ i }). The estimator ${\widehat{\beta}}_{E2}$ is the mean score estimator in Pepe et al. ([1994]). The mean score estimator is the MLE estimator employing the EM-algorithm ([Dempster et al. 1977]) under the assumption that the auxiliary outcome is noninformative in the sense that the probability model P_{ θ }(A|Y,X) is unrelated to β.
The estimator ${\widehat{\beta}}_{A1}$ obtained using ${W}_{i}^{A1}$ is the AIPW estimator augmented with the estimated conditional expectation $\xca\left\{{S}_{\beta}\left(Y\right|{X}_{i}\left)\right|{Z}_{i},{A}_{i}\right\}$. The estimator ${\widehat{\beta}}_{A2}$ obtained using ${W}_{i}^{A2}$ is the AIPW estimator augmented with the estimated conditional expectation $\xca\left\{{S}_{\beta}\left(Y\right|{X}_{i}\left)\right|{X}_{i},{A}_{i}\right\}$. The estimator ${\widehat{\beta}}_{A3}$ is obtained using ${W}_{i}^{A3}$. The ${W}_{i}^{A3}$ differs from ${W}_{i}^{A2}$ in that the estimated validation probability is ${\widehat{\pi}}_{i}$ instead of ${\widehat{\pi}}_{i}^{z}$.
which entails that the estimator ${\widehat{\beta}}_{A1}$ has the double robust property in the sense that it is a consistent estimator of β if either ${\widehat{\pi}}_{i}^{z}$ is a consistent estimator of ${\pi}_{i}^{z}$ or $\xca\left[{S}_{\beta}\left(Y\right|{X}_{i}\left)\right|{Z}_{i},{A}_{i}\right]$ is a consistent estimator of E[S_{ β }(Y|X_{ i })|Z_{ i },A_{ i }]. Similarly, under MAR I, the estimator ${\widehat{\beta}}_{A2}$ possesses the double robust property in that ${\widehat{\beta}}_{A2}$ is a consistent estimator of β if either ${\widehat{\pi}}_{i}^{z}$ is a consistent estimator of ${\pi}_{i}^{z}$ or $\xca\left[{S}_{\beta}\left(Y\right|{X}_{i}\left)\right|{X}_{i},{A}_{i}\right]$ is a consistent estimator of E[S_{ β }(Y|X_{ i })|X_{ i },A_{ i }]. The estimator ${\widehat{\beta}}_{A3}$ has similar double robust property as ${\widehat{\beta}}_{A2}$.
Method comparisons and asymptotic results
Under MAR II, (Y_{ i },X_{ i }) is independent of ξ_{ i } conditional on (Z_{ i },A_{ i }), then $\xca\left\{{S}_{\beta}\left(Y\right|{X}_{i}\left)\right|{Z}_{i},{A}_{i}\right\}$ is an unbiased estimator of E{S_{ β }(Y|X_{ i })|Z_{ i },A_{ i }}.
Proposition 1.
Suppose that X=(Z,Z^{ c }) and A are discrete and their dimensionality is reasonably small. Under the nonparametric estimators ${\widehat{\pi}}_{i}^{z}={n}^{V}({Z}_{i},{A}_{i})/n({Z}_{i},{A}_{i})$, ${\widehat{\pi}}_{i}={n}^{V}({X}_{i},{A}_{i})/n({X}_{i},{A}_{i})$ and the estimators for the conditional expectation defined in (9) and (10), the estimators ${\widehat{\beta}}_{I1}$, ${\widehat{\beta}}_{E1}$ and ${\widehat{\beta}}_{A1}$ are equivalent, and the estimators ${\widehat{\beta}}_{I2}$, ${\widehat{\beta}}_{E2}$, ${\widehat{\beta}}_{A2}$ and ${\widehat{\beta}}_{A3}$ are equivalent. However, the estimator ${\widehat{\beta}}_{A2}$ is different from ${\widehat{\beta}}_{A1}$ unless ${Z}_{i}^{c}$ is linearly related to Z_{ i } in which case β is not identifiable.
The results of Proposition 1 are very intriguing since research has shown that the AIPW and the mean score methods are more efficient than the IPW method. It is also intriguing that the AIPW estimators ${\widehat{\beta}}_{A2}$ and ${\widehat{\beta}}_{A3}$ are actually the same estimators, not affected by the validation probability. To further understand these approaches, we investigate the asymptotic properties of these methods where (X,A) are not necessarily discrete. Through the asymptotic analysis, we gain insights about what matters to the efficiency in terms of the selections of the validation sample and the augmentation function.
where a^{⊗2}=a a^{′}.
Theorem 1.
and ${Q}_{i}^{A}={\xi}_{i}/{\pi}_{i}{S}_{\beta}\left({Y}_{i}\right|{X}_{i})+\left(1-{\xi}_{i}/{\pi}_{i}\right){E}_{a}\left\{{S}_{\beta}\left(Y\right|{X}_{i}\left)\right|{X}_{i},{A}_{i}\right\}$.
where ${O}_{i}=E\left\{{{\pi}_{i}}^{-2}{\xi}_{i}{S}_{\beta}\left({Y}_{i}\right|{X}_{i}){\left(\mathrm{\partial \pi}({X}_{i},{A}_{i},{\psi}_{0})/\mathrm{\partial \psi}\right)}^{\prime}\right\}{\left({I}^{\psi}\right)}^{-1}{S}_{i}^{\psi}$ and B_{ i }=(1−ξ_{ i }/π_{ i })E_{ a }{S_{ β }(Y|X_{ i })|X_{ i },A_{ i }}.
Suppose that the validation probability π_{ i } = P(ξ_{ i } = 1|X_{ i },A_{ i }) depends only on (Z_{ i },A_{ i }). That is, ${\pi}_{i}\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}{\pi}_{i}^{z}\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}P\left({\xi}_{i}=1|{Z}_{i},{A}_{i}\right)$. Suppose that ${\stackrel{~}{\pi}}_{i}$ is the MLE of ${\pi}_{i}^{z}$ under the parametric family ψ(Z_{ i },A_{ i },ψ). Let ${\widehat{\beta}}_{A1}$ be the estimator obtained by solving (14) where the augmented term, $\stackrel{~}{E}\left\{{S}_{\beta}\left(Y\right|{X}_{i}\left)\right|{X}_{i},{A}_{i}\right\}$, is a consistent parametric/nonparametric estimator of E{S_{ β }(Y|X_{ i })|Z_{ i },A_{ i }}. Let ${\widehat{\beta}}_{A2}$ be the estimator obtained by solving (14) where $\stackrel{~}{E}\left\{{S}_{\beta}\left(Y\right|{X}_{i}\left)\right|{X}_{i},{A}_{i}\right\}$ is a consistent parametric/nonparametric estimator of E{S_{ β }(Y|X_{ i })|X_{ i },A_{ i }}. The following corollary presents the asymptotic results for two AIPW estimators of β, one that corresponds to the augmentation based on a subset, (Z,A), of observed variables and the other that corresponds to the augmentation based on the full set, (X,A), of the observed variables.
Corollary 1.
where ${\Sigma}_{A1}\left(\beta \right)=E\left[\left(\right(1-{\pi}_{i}^{z})/{\pi}_{i}^{z})\text{Var}\left\{{S}_{\beta}\right({Y}_{i}\left|{X}_{i}\right)|{Z}_{i},{A}_{i}\}\right]$ and ${\Sigma}_{A2}\left(\beta \right)=E\left[\left(\right(1-{\pi}_{i}^{z})/{\pi}_{i}^{z})\text{Var}\left\{{S}_{\beta}\left({Y}_{i}\right|{X}_{i}\left)\right|{X}_{i},{A}_{i}\right\}\right]$. The asymptotic variance of ${\widehat{\beta}}_{A2}$ is smaller than the asymptotic variance of ${\widehat{\beta}}_{A1}$ if the covariates Z_{ i } are a proper subset of X_{ i }.
for (z,a)∈.
where B_{ i } and O_{ i } are defined following (16). The following corollary presents the analysis of the term ${n}^{-1/2}\sum _{i=1}^{n}({B}_{i}+{O}_{i})$ when (Z_{ i },A_{ i }) are discrete to understand how efficiency may be gained from the AIPW estimator over the IPW estimator.
Corollary 2.
By Corollary 2, (19) and (20), ${\widehat{\beta}}_{A}$ is more efficient than ${\widehat{\beta}}_{I}$ unless V a r{S_{ β }(Y_{ j }|X_{ j })|Z_{ j }=z,A_{ j }=a}=0 for all (z,a) for which P(Z_{ i }=z,A_{ i }=a)≠0. If X=Z and the validation probability π_{ i }=P(ξ_{ i }=1|X_{ i },A_{ i }) is nonparametrically estimated with the cell frequencies ${\widehat{\psi}}_{z,a}={n}^{V}(z,a)/n(z,a)$, then ${\widehat{\beta}}_{A}$ and ${\widehat{\beta}}_{I}$ are asymptotically equivalent.
Remark Consider the estimators of β obtained based on the estimating equation (1) corresponding to different choices of W_{ i } given in (2) to (8). If (Z,A) are discrete and the validation probability ${\pi}_{i}^{z}=P({\xi}_{i}=1|{Z}_{i},{A}_{i})$ is estimated nonparametrically by the cell frequency, then by Theorem 1 and Corollary 2, ${\widehat{\beta}}_{A1}$ and ${\widehat{\beta}}_{I1}$ have same asymptotic normal distributions as long as $\xca\left[{S}_{\beta}\right(Y\left|{X}_{i}\right)|{Z}_{i},{A}_{i}]$ is a consistent estimator of E[S_{ β }(Y|X_{ i })|Z_{ i },A_{ i }]. But ${\widehat{\beta}}_{A2}$ is more efficient than ${\widehat{\beta}}_{I1}$ as long as $\xca\left[{S}_{\beta}\right(Y\left|{X}_{i}\right)|{X}_{i},{A}_{i}]$ is a consistent estimator of E[S_{ β }(Y|X_{ i })|X_{ i },A_{ i }] since Var(B_{ i }+O_{ i }) is not zero by (21). These results are not affected by whether E[S_{ β }(Y|X_{ i })|Z_{ i },A_{ i }] and E[S_{ β }(Y|X_{ i })|X_{ i },A_{ i }] are estimated nonparametrically or based on some parametric models. In addition, by Theorem 1, Corollary 1 and 2, ${\widehat{\beta}}_{A3}$ and ${\widehat{\beta}}_{I2}$ have the same asymptotic normal distributions as long as $\xca\left[{S}_{\beta}\right(Y\left|{X}_{i}\right)|{X}_{i},{A}_{i}]$ is a consistent estimator of E[S_{ β }(Y|X_{ i })|X_{ i },A_{ i }].
Poisson regression using the automated data with missing outcomes
Many medical and public health data are available only in aggregated format, where the variables of interest are aggregated counts without being available at individual levels. Many existing statistical methods for missing data require observations at individual levels. Applying the missing data methods presented in Section 3, we derive some estimation procedures for the Poisson regression model with missing outcomes based on auxiliary outcomes and a validated sample for true outcomes. Further, we show that, by stratifying based on a set of discrete variables, the proposed statistical procedure can be formulated so that it can be used to analyze automated records which do not contain observations at individual levels, only summarized information at categorical levels.
where γ^{′}=(a_{0},a_{1},γ_{1},⋯,γ_{k−1}).
We apply the missing data methods introduced in Section 3 on model (22). The variables (Z_{ i },T_{ i },Y_{ i },A_{ i }) are observed for the validation sample V and (Z_{ i },T_{ i },A_{ i }) are observed for the nonvalidation sample $\stackrel{\u0304}{V}$. While the covariate Z can be considered as categorical, it is natural to consider the exposure time T as a continuous variable. We assume that the validation probability depends only on the stratification of (Z,A). That is, the validation sample is a stratified random sample by the categories defined by (Z,A). Of those estimators discussed in Section 2, there are only two different estimators, ${\widehat{\beta}}_{I1}$ and ${\widehat{\beta}}_{A2}$. We show in Section 4.3 that the proposed method can be formulated so that it can be used to analyze the automated records with missing outcomes. First we derive the explicit estimation procedures for ${\widehat{\beta}}_{I1}$ and ${\widehat{\beta}}_{A2}$ and their variance estimators under model (22).
4.1 Inverse probability weighting estimation
We adopt all notations introduced in Section 3. In particular, let ${\pi}_{i}^{z}=P({\xi}_{i}=1|{Z}_{i},{A}_{i})$ and ${\widehat{\pi}}_{i}^{z}={n}^{V}({Z}_{i},{A}_{i})/n({Z}_{i},{A}_{i})$. Let X=(Z,T) and X_{ i }=(Z_{ i },T_{ i }) to be consistent with earlier notations. The score function for subject i under model (22) is ${S}_{\beta}\left({Y}_{i}\right|{X}_{i})={Z}_{i}^{\prime}({Y}_{i}-{T}_{i}exp\left({\beta}^{\prime}{Z}_{i}\right)).$ The estimator ${\widehat{\beta}}_{I1}$ is obtained by solving $\sum _{i=1}^{n}({\xi}_{i}/{\widehat{\pi}}_{i}^{z}){S}_{\beta}\left({Y}_{i}\right|{X}_{i})=0,$ where ${S}_{\beta}\left({Y}_{i}\right|{X}_{i})={Z}_{i}^{\prime}({Y}_{i}-{T}_{i}exp\left({\gamma}^{\prime}{Z}_{i}\right))$. By Corollary 1, $\sqrt{n}({\widehat{\beta}}_{I1}-\beta )$ converges in distribution to a normal distribution with mean zero and the variance matrix I^{−1}(β)+I^{−1}(β)Σ_{A 1}(β)I^{−1}(β), where ${\Sigma}_{A1}\left(\beta \right)=E\left[\left(\right(1-{\pi}_{i}^{z})/{\pi}_{i}^{z})\text{Var}\left\{{S}_{\beta}\left({Y}_{i}\right|{X}_{i}\left)\right|{Z}_{i},{A}_{i}\right\}\right]$.
where $\widehat{\rho}(a,z)$ is the estimator of P{A_{ i }=a,Z_{ i }=z}, ${\widehat{\rho}}^{v}(a,z)$ is the estimator of P{i∈V|A_{ i }=a,Z_{ i }=z}, and $\hat{\text{Var}}\left\{{S}_{\beta}\left(Y\right|X\left)\right|A=a,Z=z\right\}$ is an estimator of Var{S_{ β }(Y_{ i }|X_{ i })|Z_{ i },A_{ i }}] which is derived in the following.
Since A is observed for all subjects, W can be determined if Y is known, and undetermined otherwise. The IPW estimator, ${\widehat{\gamma}}_{I1}$, of γ can be estimated by solving the equation $\sum _{i=1}^{n}({\xi}_{i}/{\widehat{\pi}}_{i}^{z}){S}_{\gamma}\left({W}_{i}\right|{X}_{i})=0,$ where ${S}_{\gamma}\left({W}_{i}\right|{X}_{i})={Z}_{i}^{\prime}({W}_{i}-{T}_{i}exp\left({\gamma}^{\prime}{Z}_{i}\right))$. The conditional distribution of Y given A=a, T, and Z=z is Binomial (a,p_{ z }), where p_{ z }= exp(β^{′}z)/(exp(β^{′}z)+ exp(γ^{′}z)). Since this conditional distribution does not depend on T, the outcome Y and T are conditionally independent given (A,Z). Therefore, Var{S_{ β }(Y|X)|A,Z}=Z Z^{′}{Var(Y|A,Z)+ exp(2β^{′}Z)Var(T|A,Z)}. The variance Var(Y|A=a,Z=z) can be estimated by $a{\widehat{p}}_{z}(1-{\widehat{p}}_{z})$, where ${\widehat{p}}_{z}=exp\left({\widehat{\beta}}^{\prime}z\right)/\left\{exp\left({\widehat{\beta}}^{\prime}z\right)+exp\left({\widehat{\gamma}}^{\prime}z\right)\right\}$, and Var(T|A=a,Z=z)=E(T^{2}|A=a,Z=z)−{E(T|A=a,Z=z)}^{2} can be estimated nonparametrically using the first and the second sample moments conditional on each category with A=a and Z=z.
4.2 Augmented inverse probability weighted estimation
where $\widehat{\rho}(a,z)$ is the estimator of P{A_{ i }=a,Z_{ i }=z} and ${\widehat{\rho}}^{v}(a,z)$ is the estimator of P{i∈V|A_{ i }=a,Z_{ i }=z}.
4.3 Estimation using the automated data
This section formulates the missing data estimation procedure for (22) based on the automated (summarized) information at categorical levels defined by relevant covariates of the model. In particular, we show that ${\widehat{\beta}}_{I1}$ and ${\widehat{\beta}}_{A2}$ and their variance estimators can be formulated using the automated data at categorical levels.
In many applications it is convenient to write Z=(1,Z_{(1)},Z_{(2)}) and β=(b_{0},b_{1},θ^{′})^{′}, where Z_{(1)} is the treatment indicator (Z_{(1)}=1 for the exposed group and Z_{(1)}=0 for the unexposed group) and Z_{(2)}=(η_{1},⋯,η_{k−1})^{′} as the other covariates, and θ=(θ_{1},⋯,θ_{k−1})^{′}. For the applications involving the automated data records, we let η_{1},⋯,η_{k−1} be k−1 dummy variables representing k groups. Without loss of generality, we choose the k th group as the reference group, η_{1}=1, η_{2}=0, ⋯, η_{k−1}=0 for group 1, η_{1}=0, η_{2}=1, ⋯, η_{k−1}=0 for group 2, so on and η_{1}=0, η_{2}=0, ⋯, η_{k−1}=0 for group k. Thus each value of Z denotes a category which can be represented by (l,m) for l=0,1 and m=1,⋯,k. This correspondence is denoted by Z≃(l,m) for convenience. For l=0,1 and m=1,⋯,k−1, category (l,m) is defined by Z with Z_{(1)}=l, η_{ m }=1 and η_{ j }=0 for j≠m, j=1,…,k, and category (l,k) is defined by Z_{(1)}=l and η_{ j }=0 for j=1,…,k−1. Under model (22), the expected number of events for a subject in category (l,m) with the time-exposure interval [ 0,T] is T exp(b_{ lm }), for l=0,1 and m=1,⋯,k, where b_{1k}=b_{0}+b_{1}, b_{0k}=b_{0}, b_{1m}=b_{0}+b_{1}+θ_{ m } and b_{0m}=b_{0}+θ_{ m } for 1≤m≤k−1. The parameter b_{1} represents the log-relative rate of the exposed versus the unexposed adjusted for other factors.
The following notations are used to show that the estimators of β and their variance estimators can be calculated using the automated information at the categorical levels. Let V(a,l,m) denote the set of subjects in V with (A=a, Z≃(l,m)), V(l,m) for the set of subjects in V with (Z≃(l,m)), n_{ alm } for the number of subjects with (A=a, Z≃(l,m)), ${n}_{\mathit{\text{alm}}}^{v}$ for the number of subjects in V(a,l,m), ${n}_{\mathit{\text{lm}}}^{v}$ for the number of subjects in V(l,m), ${\lambda}_{\mathit{\text{alm}}}={n}_{\mathit{\text{alm}}}/{n}_{\mathit{\text{alm}}}^{v}$, y_{ alm } for the number of events for subjects in V(a,l,m), y_{ lm } for the number of events for subjects in V(l,m), t_{ alm } for the total exposure time for subjects with (A=a, Z≃(l,m)), t_{2,a l m} for the total squared exposure time for subjects with (A=a, Z≃(l,m)), t_{ lm } for the total exposure time for subjects with Z≃(l,m), α_{ lm } for the number of automated events for subjects with Z≃(l,m).
for l=0,1 and m=1,…,k−1. When k>1, the equations have no explicit solutions.
where r=k−1 and ${q}_{\mathit{\text{lm}}}=E\left({T}_{i}{e}^{{b}_{\mathit{\text{lm}}}}I\right\{\text{individual}\phantom{\rule{2.77626pt}{0ex}}i\phantom{\rule{2.77626pt}{0ex}}\text{in category}\phantom{\rule{2.77626pt}{0ex}}(l,m)\left\}\right)$. The consistent estimator, $\xce\left(\widehat{\beta}\right)$, of I(β) is thus obtained by replacing q_{ lm } with $exp\left({\widehat{b}}_{\mathit{\text{lm}}}\right){t}_{\mathit{\text{lm}}}/n$.
where $\widehat{\rho}(a,l,m)={n}_{\mathit{\text{alm}}}/n$, ${\widehat{\rho}}^{v}(a,l,m)={n}_{\mathit{\text{alm}}}^{v}/{n}_{\mathit{\text{alm}}}$ and G_{ lm } be the value of ${G}_{i}={z}_{i}^{\otimes 2}$ when subject i belongs to the category (l,m). Hence the covariance matrix of ${\widehat{\beta}}_{I1}$ can be estimated by ${\xce}^{-1}\left(\widehat{\beta}\right)+{\xce}^{-1}\left(\widehat{\beta}\right){\widehat{\Sigma}}_{A1}\left(\widehat{\beta}\right){\xce}^{-1}\left(\widehat{\beta}\right)$ using the automated data.
for l=0,1 and m=1,…,k−1.
Hence the covariance matrix of ${\widehat{\beta}}_{A2}$ can be estimated by ${\xce}^{-1}\left(\widehat{\beta}\right)+{\xce}^{-1}\left(\widehat{\beta}\right){\widehat{\Sigma}}_{A2}\left(\widehat{\beta}\right){\xce}^{-1}\left(\widehat{\beta}\right)$ using the automated data.
which is the weighted sum of the estimated variances for the estimated log relative rate of the exposed versus the unexposed over k groups. The details of deviation are given in the Appendix B.
A simulation study
We conduct a simulation study to examine the finite sample performance of the IPW estimator ${\widehat{\beta}}_{I1}$ and the AIPW estimator ${\widehat{\beta}}_{A2}$. We consider the Poisson regression model (22). The covariates Z_{1} and Z_{2} are generated from the Bernoulli distributions with the probability of success equals to 0.4 and 0.5 respectively. The exposure time T is generated from a uniform distribution on [0,10]. Given Z=(Z_{1},Z_{2}) and T, the outcome variable Y follows a Poisson distribution with mean T exp(b_{0}+b_{1}Z_{1}+θ Z_{2}) where b_{0}=−0.5, b_{1}=−0.8 and θ=−0.6, and W follows a Poisson distribution with mean T exp(a_{0}+a_{1}Z_{1}+γ Z_{2}) where a_{0}=−1.3, a_{1}=−1.1, γ=−1. We set A=Y+W.
Four models for the validation sample are considered. Under Model 1, the validation sample is a simple random sample with probability π_{ i }=0.4. Model 2 considers π_{ i }=0.6. In Model 3, the validation probability only depends on A through the logistic regression model logit{π_{ i }(X,A)}=A−0.5 where X=(Z,T). In Model 4, the validation probability depends on A and Z_{1} through the logistic regression model logit{π_{ i }(X,A)}=A−Z_{1}−0.5.
Simulation comparison of the IPW estimator ${\widehat{\beta}}_{I1}$ , the AIPW estimator ${\widehat{\beta}}_{A2}$ and the complete-case (CC) estimator ${\widehat{\beta}}_{C}$ under various sample sizes and selection probabilities
b _{0} | b _{1} | θ | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
n | Bias | SSE | ESE | CP | Bias | SSE | ESE | CP | Bias | SSE | ESE | CP | |||
Model 1: π _{ i } =.4 | |||||||||||||||
50 | IPW | -.0415 | .3561 | .1839 | .851 | -.2175 | 1.6737 | .3354 | .864 | -.1610 | 1.2201 | .2962 | .847 | ||
AIPW | -.0110 | .2213 | .1664 | .890 | -.0062 | .3099 | .2873 | .943 | -.0186 | .3076 | .2551 | .929 | |||
CC | -.0246 | .3398 | .2738 | .938 | -.1515 | 1.6082 | .4730 | .968 | -.1038 | 1.1709 | .4187 | .959 | |||
100 | IPW | -.0650 | .1815 | .1404 | .870 | -.0548 | .3120 | .2458 | .891 | -.0249 | .2653 | .2161 | .898 | ||
AIPW | -.0094 | .1376 | .1187 | .914 | -.0024 | .2284 | .1988 | .926 | .0027 | .1994 | .1780 | .925 | |||
CC | -.0240 | .1728 | .1685 | .948 | -.0086 | .3086 | .2981 | .960 | .0031 | .2556 | .2581 | .946 | |||
300 | IPW | -.0368 | .0936 | .0874 | .931 | -.0209 | .1535 | .1460 | .946 | -.0022 | .1419 | .1286 | .929 | ||
AIPW | -.0027 | .0732 | .0712 | .946 | -.0028 | .1233 | .1165 | .940 | .0005 | .1130 | .1046 | .938 | |||
CC | -.0092 | .0919 | .0935 | .960 | -.0012 | .1627 | .1634 | .952 | .0040 | .1438 | .1432 | .952 | |||
500 | IPW | -.0183 | .0698 | .0671 | .938 | -.0172 | .1159 | .1128 | .943 | -.0083 | .1069 | .0993 | .933 | ||
AIPW | .0022 | .0566 | .0550 | .936 | -.0022 | .0956 | .0902 | .943 | -.0068 | .0867 | .0811 | .930 | |||
CC | .0006 | .0704 | .0716 | .949 | -.0059 | .1268 | .1255 | .949 | -.0046 | .1135 | .1103 | .942 | |||
800 | IPW | -.0126 | .0538 | .0527 | .942 | -.0134 | .0862 | .0889 | .950 | -.0029 | .0759 | .0779 | .947 | ||
AIPW | .0011 | .0433 | .0435 | .952 | -.0047 | .0720 | .0713 | .956 | -.0020 | .0638 | .0640 | .951 | |||
CC | .0002 | .0562 | .0565 | .948 | -.0051 | .0974 | .0990 | .958 | -.0013 | .0844 | .0869 | .958 | |||
Model 2: π _{ i } =.6 | |||||||||||||||
50 | IPW | -.0316 | .2079 | .1714 | .926 | -.0934 | .8426 | .3112 | .944 | -.0563 | .3320 | .2690 | .937 | ||
AIPW | -.0072 | .1723 | .1591 | .941 | -.0105 | .2893 | .2772 | .950 | -.0172 | .2653 | .2440 | .948 | |||
CC | -.0126 | .1973 | .1949 | .959 | -.0594 | .8369 | .3512 | .967 | -.0278 | .3213 | .3044 | .959 | |||
100 | IPW | -.0366 | .1399 | .1259 | .926 | -.0420 | .2363 | .2192 | .944 | -.0100 | .2103 | .1911 | .925 | ||
AIPW | -.0121 | .1206 | .1133 | .941 | -.0107 | .2069 | .1921 | .944 | .0078 | .1764 | .1700 | .940 | |||
CC | -.0142 | .1370 | .1345 | .947 | -.0216 | .2379 | .2370 | .961 | .0030 | .2103 | .2072 | .949 | |||
300 | IPW | -.0138 | .0742 | .0728 | .944 | -.0194 | .1267 | .1250 | .957 | -.0049 | .1064 | .1096 | .964 | ||
AIPW | -.0030 | .0650 | .0651 | .948 | -.0044 | .1136 | .1093 | .949 | .0005 | .0960 | .0974 | .956 | |||
CC | -.0017 | .0763 | .0759 | .946 | -.0118 | .1345 | .1328 | .951 | -.0035 | .1147 | .1169 | .957 | |||
500 | IPW | -.0069 | .0571 | .0555 | .945 | -.0096 | .0946 | .0965 | .947 | -.0094 | .0866 | .0844 | .953 | ||
AIPW | .0029 | .0495 | .0496 | .942 | -.0032 | .0856 | .0841 | .947 | -.0076 | .0757 | .0749 | .955 | |||
CC | .0013 | .0577 | .0581 | .947 | -.0034 | .1024 | .1019 | .949 | -.0086 | .0906 | .0899 | .954 | |||
800 | IPW | -.0072 | .0437 | .0438 | .954 | -.0069 | .0754 | .0763 | .956 | -.0025 | .0692 | .0664 | .947 | ||
AIPW | -.0011 | .0401 | .0393 | .951 | -.0019 | .0693 | .0665 | .943 | -.0015 | .0626 | .0590 | .931 | |||
CC | -.0012 | .0452 | .0460 | .958 | -.0026 | .0805 | .0806 | .952 | -.0024 | .0723 | .0709 | .952 | |||
Full data: π _{ i } =1 | |||||||||||||||
50 | -.0079 | .1510 | .1466 | .952 | -.0182 | .2691 | .2618 | .948 | -.0104 | .2264 | .2263 | .957 | |||
100 | -.0079 | .1068 | .1024 | .943 | -.0075 | .1841 | .1798 | .948 | -.0039 | .1560 | .1577 | .949 | |||
300 | -.0019 | .0596 | .0583 | .950 | -.0081 | .1032 | .1023 | .936 | .0001 | .0934 | .0898 | .932 | |||
500 | .0006 | .0452 | .0450 | .951 | -.0041 | .0783 | .0788 | .950 | .0014 | .0656 | .0693 | .960 | |||
800 | -.0004 | .0343 | .0356 | .951 | .0025 | .0612 | .0622 | .938 | -.0006 | .0532 | .0547 | .955 |
Simulation comparison of the IPW estimator ${\widehat{\beta}}_{I1}$ , the AIPW estimator ${\widehat{\beta}}_{A2}$ and the complete-case (CC) estimator ${\widehat{\beta}}_{C}$ under various sample sizes and selection probabilities
b _{0} | b _{1} | θ | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
n | Bias | SSE | ESE | CP | Bias | SSE | ESE | CP | Bias | SSE | ESE | CP | |||
Model 3 | |||||||||||||||
50 | IPW | .0081 | .1609 | .1502 | .949 | -.0034 | .5535 | .2790 | .954 | -.0116 | .2592 | .2400 | .963 | ||
AIPW | -.0070 | .1543 | .1486 | .950 | -.0134 | .2715 | .2690 | .958 | -.0185 | .2364 | .2330 | .955 | |||
CC | .0230 | .1529 | .1504 | .938 | .0798 | .5441 | .2835 | .940 | .0648 | .2367 | .2414 | .952 | |||
100 | IPW | -.0052 | .1145 | .1077 | .948 | -.0001 | .2073 | .2014 | .959 | .0030 | .1789 | .1724 | .948 | ||
AIPW | -.0124 | .1077 | .1041 | .947 | -.0085 | .1869 | .1840 | .957 | .0050 | .1636 | .1606 | .948 | |||
CC | .0221 | .1074 | .1054 | .939 | .1023 | .1830 | .1937 | .924 | .0828 | .1625 | .1664 | .915 | |||
300 | IPW | -.0011 | .0617 | .0614 | .951 | -.0044 | .1176 | .1157 | .952 | .0019 | .1009 | .0993 | .953 | ||
AIPW | -.0023 | .0582 | .0588 | .956 | -.0051 | .1056 | .1036 | .954 | .0022 | .0936 | .0910 | .944 | |||
CC | .0295 | .0577 | .0596 | .924 | .1051 | .1070 | .1095 | .824 | .0823 | .0925 | .0946 | .861 | |||
500 | IPW | .0018 | .0451 | .0473 | .958 | -.0037 | .0853 | .0895 | .958 | -.0069 | .0765 | .0767 | .945 | ||
AIPW | .0009 | .0430 | .0452 | .957 | -.0032 | .0793 | .0798 | .947 | -.0066 | .0689 | .0702 | .951 | |||
CC | .0317 | .0429 | .0459 | .903 | .1077 | .0788 | .0844 | .763 | .0737 | .0704 | .0730 | .839 | |||
800 | IPW | -.0006 | .0374 | .0375 | .951 | -.0030 | .0671 | .0708 | .962 | .0004 | .0617 | .0605 | .946 | ||
AIPW | -.0003 | .0362 | .0358 | .949 | -.0031 | .0623 | .0631 | .954 | -.0012 | .0577 | .0554 | .935 | |||
CC | .0315 | .0353 | .0364 | .863 | .1065 | .0630 | .0667 | .633 | .0786 | .0568 | .0576 | .721 | |||
Model 4 | |||||||||||||||
50 | IPW | .0053 | .1627 | .1504 | .948 | .0825 | .3531 | .2832 | .913 | -.0057 | .2736 | .2405 | .948 | ||
AIPW | -.0085 | .1549 | .1489 | .950 | -.0122 | .2746 | .2752 | .966 | -.0138 | .2395 | .2340 | .961 | |||
CC | .2295 | .2640 | .0855 | .531 | .4513 | .3805 | .1760 | .517 | .2954 | .3285 | .1409 | .536 | |||
100 | IPW | -.0050 | .1168 | .1085 | .939 | .0481 | .2350 | .2130 | .922 | .0016 | .1884 | .1761 | .940 | ||
AIPW | -.0130 | .1083 | .1043 | .943 | -.0067 | .1920 | .1885 | .950 | .0066 | .1648 | .1613 | .949 | |||
CC | .0196 | .1077 | .1063 | .943 | .2010 | .1946 | .2087 | .820 | .0900 | .1645 | .1702 | .910 | |||
300 | IPW | -.0001 | .0630 | .0624 | .945 | -.0001 | .1323 | .1311 | .955 | -.0011 | .1043 | .1038 | .946 | ||
AIPW | -.0020 | .0588 | .0588 | .951 | -.0052 | .1095 | .1059 | .952 | .0012 | .0950 | .0913 | .931 | |||
CC | .0271 | .0582 | .0601 | .930 | .2020 | .1060 | .1173 | .576 | .0894 | .0939 | .0965 | .863 | |||
500 | IPW | .0012 | .0457 | .0480 | .951 | -.0007 | .0966 | .1010 | .966 | -.0054 | .0801 | .0799 | .948 | ||
AIPW | .0006 | .0433 | .0453 | .956 | -.0010 | .0821 | .0813 | .941 | -.0059 | .0697 | .0704 | .950 | |||
CC | .0291 | .0434 | .0463 | .912 | .2047 | .0817 | .0903 | .364 | .0815 | .0711 | .0745 | .820 | |||
800 | IPW | -.0006 | .0381 | .0380 | .949 | .0004 | .0761 | .0794 | .967 | .0000 | .0636 | .0630 | .947 | ||
AIPW | -.0002 | .0362 | .0359 | .949 | -.0016 | .0640 | .0641 | .955 | -.0014 | .0574 | .0555 | .942 | |||
CC | .0288 | .0356 | .0367 | .885 | .2039 | .0644 | .0714 | .166 | .0864 | .0570 | .0588 | .673 |
Table 1 shows that under Model 1 and Model 2, the bias of all estimators is very small at a level comparable with that of the full data estimator. The bias decreases with increased sample size and the increased level of the validation probability. The empirical standard errors are in good agreement with the corresponding estimated standard errors, except for the IPW estimator when n≤100 and π≤0.6. Among them, AIPW has the smallest standard errors for all parameters and sample sizes concerned. The coverage probabilities of the confidence intervals for b_{0}, b_{1} and θ are close to the nominal level 95%. When the sample size and the validation probability are both small, for example, n=50 and π=0.4, the IPW has large bias and is unstable but the AIPW still performs well.
Table 2 gives the results under Model 3 and Model 4. The bias remains small for ${\widehat{\beta}}_{I1}$ and ${\widehat{\beta}}_{A2}$. The empirical standard errors are also close to the corresponding estimated standard errors. The coverage probabilities remain close to the nominal level 95% for all IPW and AIPW estimators. However, the complete-case estimator yields larger bias and incorrect coverage probability because of the association between the validation probability and the auxiliary variable A and/or the covariate Z_{1}, in which case the missing is not missing completely at random. The AIPW performs better than IPW with smaller standard errors.
An Application
A community-based, nonrandomized, open-label influenza vaccine (CAIV-T) study was conducted in Temple-Belton, Texas during the 2000-2001 influenza season. The total 11,606 healthy children aged 18 months - 18 years were involved in this study and about 20% of them received a single dose of CAIV-T in 2000. The primary clinical outcome was based on an nonspecific case definition called medically attended acute respiratory infection (MAARI), which included all International Classification of Diseases, Ninth Revision, Clinical Modification diagnoses codes (ICD-9 codes 381-383, 460-487) for upper and lower respiratory tract infections, otitis media and sinusitis. MAARI outcomes and demographic data were extracted from the Scott & White Health Plan administrative database. For each visit, one or two International Classification of Diseases, Ninth Revision, Clinical Modification diagnosis codes were listed. Visits for which asthma diagnosis codes alone were noted, without another MAARI code, were excluded. More details about this study can be found in Halloran et al. ([2003]).
Study data for influenza epidemic season 2000-01, by age and vaccine group (from Halloran et al.[2003])
Age group | Vaccine | No. of | No. of MAARI | No. of MAARI | No. of positive |
---|---|---|---|---|---|
(years) | children | cases | cases cultured | cultures | |
1.5-4 | CAIV-T | 537 | 389 | 16 | 0 |
None | 1844 | 1665 | 86 | 24 | |
5-9 | CAIV-T | 807 | 316 | 17 | 2 |
None | 2232 | 1156 | 118 | 53 | |
10-18 | CAIV-T | 937 | 219 | 19 | 3 |
None | 5249 | 1421 | 123 | 56 | |
Total | CAIV-T | 2281 | 924 | 52 | 5 |
None | 9325 | 4242 | 327 | 133 |
With the method developed in Section 4 for Poisson regression, we compare the risk of developing MAARI for children who received CAIV-T to the risk for children who had never received CAIV-T using the automated information provided in Table 3. The number of nonspecific MAARI cases extracted using the ICD-9 codes is the auxiliary outcome A, whereas the actual number of influenza cases Y is the outcome of interest. Let Z_{1} be the treatment indicator (1=vaccine and 0=placebo). Let Z_{2}=(η_{1},η_{2}) be the dummy variables indicating three age groups, where η_{1}=1 if the age is in the range 1.5–4, η_{1}=0, otherwise, and η_{2}=1 if the age is in the range 5–9, η_{2}=0, otherwise. The reference group is the age 10–18. The exposure time for all children is taken as T=1 year.
Consider a Poisson regression model with mean T exp(b_{0}+b_{1}Z_{1}+θ_{1}η_{1}+θ_{2}η_{2}). Using the IPW estimator ${\widehat{\beta}}_{I1}$, the estimates (standard errors) are ${\widehat{b}}_{0}=-0.7659$ (${\widehat{\sigma}}_{{b}_{0}}=0.1046$), ${\widehat{b}}_{1}=-1.5830$ (${\widehat{\sigma}}_{{b}_{1}}=0.5017$), ${\widehat{\theta}}_{1}=-0.5572$ (${\widehat{\sigma}}_{{\theta}_{1}}=0.2111$) and ${\widehat{\theta}}_{2}=-0.0199$ (${\widehat{\sigma}}_{{\theta}_{2}}=0.1472$). The age-adjusted relative rate (RR) in the vaccinated group compared with the unvaccinated group equals $exp\left({\widehat{b}}_{1}\right)=exp(-1.5830)=0.2054$, which means that the rate of developing MAARI for the vaccinated group is 20% of that for the unvaccinated group. In terms of the vaccine efficacy VE=1−RR=0.7946, this represents about 80% reduction in the risk of developing MAARI for the vaccinated group compared to the unvaccinated group. The 95% confidence interval of RR obtained by using the delta method is (0.0768,0.5490), showing clear evidence that the vaccinated children have less risk of influenza than the unvaccinated children. The 95% confidence interval for VE is (0.4510,0.9232).
Using the AIPW estimator ${\widehat{\beta}}_{A2}$, the estimates (standard errors) are ${\widehat{b}}_{0}=-2.0703$ (${\widehat{\sigma}}_{{b}_{0}}=0.0851$), ${\widehat{b}}_{1}=-1.8072$ (${\widehat{\sigma}}_{{b}_{1}}=0.3786$), ${\widehat{\theta}}_{1}=0.6452$ (${\widehat{\sigma}}_{{\theta}_{1}}=0.1966$) and ${\widehat{\theta}}_{2}=0.6235$ (${\widehat{\sigma}}_{{\theta}_{2}}=0.1265$). The age-adjusted relative rate (RR) is $exp\left({\widehat{b}}_{1}\right)=exp(-1.8072)=0.1641$. The estimated VE is 0.8359 and the 95% confidence interval is (0.6553,0.9219). The estimator ${\widehat{\beta}}_{A2}$ yields smaller standard errors and confidence intervals with more precision than using ${\widehat{\beta}}_{I1}$.
This data was analyzed by Halloran et al. ([2003]) and Chu and Halloran ([2004]). Assuming the binary probability model for P_{ β }(Y|X) where X includes the vaccination status and age group indicators, and using the mean score method, Halloran et al. ([2003]) found that the estimated VE based on the nonspecific MAARI cases alone was 0.18 with 95% confidence interval of (0.11,0.24). The estimated VE by incorporating the surveillance cultures was 0.79 with 95% confidence interval of (0.51,0.91). Halloran et al. also reported sample-size-weighted VE =0.77 with 95% confidence interval of (0.48,0.90). Chu and Halloran ([2004]) have developed a Bayesian method to estimate vaccine efficacy. By Chu and Halloran ([2004]), the estimated VE was 0.74 with 95% confidence interval (0.50,0.88) and estimated VE by the multiple imputation method was 0.71 with 95% confidence interval (0.42,0.86).
Our estimates of the vaccine efficacy are in line with the existing methods. The estimator ${\widehat{\beta}}_{A2}$ yields smaller standard errors and therefore confidence intervals are more precise than the existing methods of Halloran et al. ([2003]) and Chu and Halloran ([2004]). Compared to the binary regression, Poisson regression model allows multiple recurrent MAARI cases for each child. Although for this particular application the exposure time is fixed at one year time interval, the proposed method is applicable to the situation where the length of exposure time may be different for different children.
Conclusions
In this paper, we investigated the mean score method, the IPW method and the AIPW method for the parametric probability regression model P_{ β }(Y|X) when outcome of interest Y is subject to missingness. The asymptotic distributions are derived for the IPW estimator and the AIPW estimator. The selection probability often needs to be estimated for the IPW estimator, and both the selection probability and the conditional expectation of the score function needs to be estimated for the AIPW estimator. We investigated the properties of the IPW estimator and the AIPW estimator when the selection probability and the conditional expectation are implemented differently.
An AIPW estimator is said to be fully augmented if the selection probability and the conditional expectation are estimated using the full set of observed variables; it is partially augmented if the selection probability and the conditional expectation are estimated using a subset of observed variables. Corollary 1 shows that the fully augmented AIPW estimator is more efficient than the partially augmented AIPW estimator. Corollary 2 shows that the AIPW estimator is more efficient than the IPW estimator. However, when the selection probability depends only on a set of discrete random variables, the IPW estimator obtained by estimating the selection probability nonparametrically with the cell frequencies is asymptotically equivalent to the AIPW estimator augmented using the same set of discrete random variables. Proposition 1 shows that the IPW estimator, the AIPW estimator and the mean score estimator are equivalent if the selection probability and the conditional expectation are estimated using same set of discrete random variables.
Applying the developed missing data methods, we derived the estimation procedures for Poisson regression model with missing outcomes based on auxiliary outcomes and a validated sample for true outcomes. By assuming the selection probability depending only on the observed discrete exposure variables, not on the continuous exposure time, we show that the IPW estimator and the AIPW estimator can be formulated to analyze data when only aggregated/summarized information are available. The simulation study shows that for a moderate sample size and selection probability, the IPW estimator and AIPW estimator perform better than the complete-case estimator. The AIPW estimator is more efficient and more stable than the IPW estimator. The proposed methods are applied to analyze a data set from for an influenza vaccine study conducted in Temple-Belton, Texas during the 2000-2001 influenza season. The data set presented in Table 3 only contains summarized information at categorical levels defined by the three age groups and vaccination status. The actual number of influenza cases (the number of positive cultures) out of the number of MAARI cases cultured, along with the number of MAARI cases, are available for each category. Our analysis using the AIPW approach shows that the age-adjusted relative rate in the vaccinated group compared to the unvaccinated group equals 0.1641, which represents about 84% reduction in the risk of developing MAARI for the vaccinated group compared to the unvaccinated group.
Appendix A
Proof of Proposition 1.
we have $\sum _{i=1}^{n}{W}_{i}^{A1}=\sum _{i=1}^{n}{W}_{i}^{I1}$. Thus the AIPW estimator ${\widehat{\beta}}_{A1}$, the IPW estimator ${\widehat{\beta}}_{I1}$ and the mean score estimator ${\widehat{\beta}}_{E1}$ are equivalent to each other.
which is not zero unless ${Z}_{i}^{c}$ is linearly related to Z_{ i } and in this case β is not identifiable. Hence the AIPW estimator ${\widehat{\beta}}_{A2}$ is different from the AIPW estimator ${\widehat{\beta}}_{A1}$.
Following the same arguments leading to (A.1), we also have $\sum _{i=1}^{n}{W}_{i}^{E2}=\sum _{i=1}^{n}{W}_{i}^{I2}$. Hence, the estimators ${\widehat{\beta}}_{I2}$, ${\widehat{\beta}}_{E2}$ and ${\widehat{\beta}}_{A2}$ are equivalent. By following the steps in (A.2), we also have $\sum _{i=1}^{n}\left(1-\frac{{\xi}_{i}}{{\widehat{\pi}}_{i}}\right)\xca\left\{{S}_{\beta}\left(Y\right|{X}_{i}\left)\right|{X}_{i},{A}_{i}\right\}=0$. Hence, ${\widehat{\beta}}_{A3}$ is the same as ${\widehat{\beta}}_{I2}$. Therefore, these are essentially two different estimators. □
Proof of Theorem 1.
By the central limit theorem, both ${n}^{1/2}\left({\widehat{\beta}}_{I}-\beta \right)$ and ${n}^{1/2}\left({\widehat{\beta}}_{A}-\beta \right)$ have asymptotically normal distributions with mean zero and covariances equal to ${I}^{-1}\left(\beta \right)\text{Var}\left({Q}_{i}^{I}\right){I}^{-1}\left(\beta \right)$ and ${I}^{-1}\left(\beta \right)\text{Var}\left({Q}_{i}^{A}\right){I}^{-1}\left(\beta \right)$, respectively.
Hence, $\text{Cov}\left({Q}_{i}^{A},{B}_{i}+{O}_{i}\right)=0$. It follows that $\text{Var}\left({Q}_{i}^{I}\right)=\text{Var}\left({Q}_{i}^{A}\right)+\text{Var}({B}_{i}+{O}_{i})$. Since ${Q}_{i}^{A}={S}_{\beta}\left({Y}_{i}\right|{X}_{i})-\left(1-\frac{{\xi}_{i}}{{\pi}_{i}}\right)\left\{{S}_{\beta}\left({Y}_{i}\right|{X}_{i})-{E}_{i}\right\}$ and the two terms are uncorrelated under MAR I, we have $\text{Var}\left({Q}_{i}^{A}\right)=\text{Var}\left({S}_{\beta}\left({Y}_{i}\right|{X}_{i})\right)+\text{Var}\left((1-{\xi}_{i}/{\pi}_{i})\left\{{S}_{\beta}\left({Y}_{i}\right|{X}_{i})-{E}_{i}\right\}\right)$, where the first term equals I(β). This completes the proof of Theorem 1. □
Proof of Corollary 1.
The second term of $\text{Var}\left({Q}_{i}^{A1}\right)$ equals Σ_{A 1}(β) and the second term of $\text{Var}\left({Q}_{i}^{A2}\right)$ equals Σ_{A 2}(β). Then it follows from the main results in Theorem 1 that (17) and (18) hold.