Analysis of case-control data with interacting misclassified covariates
- Grace Y. Yi^{1}Email author and
- Wenqing He^{2}
https://doi.org/10.1186/s40488-017-0069-0
© The Author(s) 2017
Received: 5 April 2017
Accepted: 11 July 2017
Published: 30 October 2017
Abstract
Case-control studies are important and useful methods for studying health outcomes and many methods have been developed for analyzing case-control data. Those methods, however, are vulnerable to mismeasurement of variables; biased results are often produced if such a feature is ignored. In this paper, we develop an inference method for handling case-control data with interacting misclassified covariates. We use the prospective logistic regression model to feature the development of the disease. To characterize the misclassification process, we consider a practical situation where replicated measurements of error-prone covariates are available. Our work is motivated in part by a breast cancer case-control study where two binary covariates are subject to misclassification. Extensions to other settings are outlined.
Keywords
Introduction
Case-control studies are important and useful methods for studying rare health outcomes, such as rare diseases. They do not require us to follow up a large number of subjects over a long period of time. The primary purpose of a case-control study is to investigate how risk factors are associated with the disease incidence, and the study typically involves the comparison of cases (i.e., diseased individuals) with controls (i.e., disease-free individuals).
Various statistical analysis methods for case-control data have been developed in the literature (e.g., Prentice and Pyke 1979; Breslow and Cain 1988; Breslow and Day 1980; Schlesselman 1982). Those methods are, however, vulnerable to mismeasurement of variables that commonly accompanies case-control studies. It has been well documented that ignoring mismeasurement effects in the analysis often yields seriously biased results (e.g., Gustafson et al. 2001; Yi 2017). For instance, Bross (1954) examined misclassification effects on hypothesis testing with 2×2 tables. He commented that misclassification may present a more serious problem in the case of estimates than in the case of significance tests.
In the literature, many methods have been proposed to address mismeasurement effects (e.g., Armstrong et al. 1989; Carroll et al. 1995; Forbes and Santner 1995; Roeder et al. 2005). In particular, with misclassifiction present in discrete covariates or exposure variables, various authors explored strategies for accommodating misclassification effects. To name a few, Marinos et al. (1995) studied case-control data with non-differential misclassification. Morrissey and Spiegelaman (1999) and Lyles (2002) discussed adjustment methods for exposure misclassification in case-control studies where a validation sample is available. Chu et al. (2009) presented a likelihood-based approach for case-control studies with multiple non-gold standard exposure assessments. Under the Bayesian framework, Prescott and Garthwaite (2005) proposed methods for analyzing matched case-control studies in which a binary exposure variable is subject to misclassification. Tang et al. (2013) considered the case where both disease and exposure are subject to misclassification and developed misclassification adjustment methods by utilizing validation data. Mak et al. (2015) studied sensitivity analysis in case-control studies subject to exposure misclassification.
A common feature of those methods is that error-prone variables enter models separately and no interactions among error-contaminated covariates are considered. In case-control studies, however, error-involved covariates may interactively influence the development of diseases. Incorporating such a feature in the analysis is imperative. Zhang et al. (2008) developed an inference method to account for interacting covariates with misclassification. Their method is applicable only for the case where a validation subsample is available for determining the misclassification probabilities.
In many circumstances, a validation sample is impossible to be collected for various reasons. Some variables may be too expensive or time consuming to be measured precisely (e.g., Carroll et al. 1993). Some variables can never be measured precisely due to its nature. For instance, blood pressure entering a model usually refers to its long term average, and there is no way to obtain this value precisely; exposure amount to a hazard condition, such as radiation, is difficult to be measured accurately. In such situations, a validation sample with precise measurements of the variables is not available. However, surrogate measurements of those variables can sometimes be repeatedly collected in application. In this paper, we consider such a setting where replicated measurements of error-prone covariates are available and develop inference methods with interacting error-prone covariates taken into account. Our development is particularly cast into the framework of the prospective logistic regression model with misclassified binary covariates.
Our research is partially motivated by the case-control study on breast cancer discussed by Duffy et al. (1989). In this study, 451 breast cancer cases were compared with the same number of controls with respect to the error-prone risk factors: alcohol consumption and smoking, where alcohol consumption is defined as a binary variable by a threshold of 9.3 g ethanol/day, and the smoking variable is dichotomized by comparing the product of cigarettes smoked per day and years of smoking to 300. In addition, an independent study on 100 women was available, where repeated measurements of alcohol consumption and smoking were collected for those women on two occasions. It is interesting to study how smoking and alcohol use may be associated with the risk of developing breast cancer and whether or not those two factors may be interacting in explaining the development of breast cancer. A detailed analysis of such data is presented in Section 5.
The remainder of this paper is organized as follows. Section 2 outlines the notation and model setup. In Section 3 we explore the misclassification effects. In Section 4, we develop inferential procedures to accommodate misclassification effects with the availability of replicated measurements of error-contaminated covariates. Analysis of the motivating data with the proposed method is reported in Section 5, together with simulation studies which demonstrate the performance of our method. The manuscript is concluded with discussion and extensions.
Notation and framework
This measure can be used to indicate the association between the two binary covariates, which is classified by the subpopulations of cases and controls. The measure ψ is defined from the retrospective sampling viewpoint which directly reflects the feature of case-control designs. Equivalently, this measure has an equally interpretive feature in a prospective regression model.
As pointed out by Prentice and Pyke (1979), the baseline parameter β _{0} is not estimable from retrospectively collected data unless the prevalence P(Y=1) is known; the coefficients (β _{ a },β _{ s },β _{ as }), or the odds ratios ψ _{ jk }, however, is estimable from case-control data that are collected retrospectively.
We now elaborate on estimation procedures. For i,j,k=0,1, let N _{ ijk } represent the number of subjects with (Y=i,X _{ a }=j,X _{ s }=k), and let (n _{ i00},n _{ i10},n _{ i01},n _{ i11})^{ T } be a realization of the random vector N _{ i }=(N _{ i00},N _{ i10},N _{ i01},N _{ i11})^{ T }. Let n _{0} and n _{1} be the total number of controls and cases in the study, respectively. With the retrospective sampling scheme for case-control studies, these totals are treated as fixed, and it is often plausible to use multinomial distributions to independently characterize the cell counts for the control and case populations. Namely, N _{0} and N _{1} are assumed to be independent and marginally follow a multinomial distribution with N _{ i }∼Multinomial(n _{ i },p _{ i }), where p _{ i }=(p _{ i00},p _{ i10},p _{ i01},p _{ i11}) and \(\sum _{j,k=0}^{1} p_{ijk}=1\) and \(n_{i}=\sum _{j,k=0}^{1} n_{ijk}\) for i=0,1.
Interacting covariates with misclassification
Observed counts for case-control data
\(X^{*}_{s}=0\) | \(X^{*}_{s}=1\) | ||||
---|---|---|---|---|---|
\(X^{*}_{a}=0\) | \(X^{*}_{a}=1\) | \(X^{*}_{a}=0\) | \(X^{*}_{a}=1\) | Total | |
Case (Y=1) | \(n^{*}_{100}\) | \(n^{*}_{110}\) | \(n^{*}_{101}\) | \(n^{*}_{111}\) | n _{1} |
Control (Y=0) | \(n^{*}_{000}\) | \(n^{*}_{010}\) | \(n^{*}_{001}\) | \(n^{*}_{011}\) | n _{0} |
where the matrices Π _{ ia } and Π _{ is } are assumed invertible.
To describe the asymptotic variance \(\widehat {p}_{ijk}\), we apply the delta method to the asymptotic distribution of \(\left (\widehat {p}^{*}_{i00}, \widehat {p}^{*}_{i01},\widehat {p}^{*}_{i10},\widehat {p}^{*}_{i11}\right)^{\mathrm {T}}\) in combination of (8), where the asymptotic distribution of \(\left (\widehat {p}^{*}_{i00}, \widehat {p}^{*}_{i01},\widehat {p}^{*}_{i10},\widehat {p}^{*}_{i11}\right)^{\mathrm {T}}\) is of the same form as (4) except for replacing p _{ ijk } and \(\widehat {p}_{ijk}\) with \(p_{ijk}^{*}\) and \(\widehat {p}^{*}_{ijk}\) respectively, i,j,k=0,1.
Inference method with replicates
The foregoing method assumes that the misclassification probabilities are known, and it is useful for conducting sensitivity analyses where one may specify a class of plausible values of the sensitivity and specificity to evaluate the misclassification effects on estimation of quantities such as odds ratios ψ _{ jk } or cell probabilities p _{ ijk }.
Two replicates of surrogate covariate measurements
First assessment | Second assessment | First assessment | Second assessment | ||||
---|---|---|---|---|---|---|---|
\(X_{a2}^{*}=1\) | \(X_{a2}^{*}=0\) | Total | \(X_{s2}^{*}=1\) | \(X_{s2}^{*}=0\) | Total | ||
\(X_{a1}^{*}=1\) | \(n^{*}_{a11}\) | \(n^{*}_{a10}\) | \(n^{*}_{a1+}\) | \(X_{s1}^{*}=1\) | \(n^{*}_{s11}\) | \(n^{*}_{s10}\) | \(n^{*}_{s1+}\) |
\(X_{a1}^{*}=0\) | \(n^{*}_{a01}\) | \(n^{*}_{a00}\) | \(n^{*}_{a0+}\) | \(X_{s1}^{*}=0\) | \(n^{*}_{s01}\) | \(n^{*}_{s00}\) | \(n^{*}_{s0+}\) |
Total | \(n^{*}_{a}\) | \(n^{*}_{s}\) |
To see this, we consider an estimation method of the π _{ aj } and π _{ sj } which separately uses the repeated measurements of X _{ a } and X _{ s }. We describe only estimation of the π _{ aj } here; estimation of the π _{ sj } is similar.
In (9), one equation is determined by the other two, implying that model parameters are unidentifiable unless additional assumptions are imposed. To consider a reduced parameter space, we take the prevalence α _{ a } as given.
where the constant is omitted, \(n_{a}^{*}\) is the number of paired assessments, a _{ a11}, a _{ a10} and a _{ a00} are determined by (9) and constrained by a _{ a11}+2a _{ a10}+a _{ a00}=1.
Estimation of parameters is carried out with the maximization of the likelihood L(π _{ a1},π _{ a0}). The associated variance estimates are obtained from the observed information matrix, i.e., the negative of the second derivative matrix of logL(π _{ a1},π _{ a0}) evaluated at the estimates of parameters. To get rid of the constraints that the probabilities are bounded by 0 and 1, we reparameterize π _{ a0} and π _{ a1} by using the logit transformation when maximizing the likelihood.
for (j,k)≠(0,0). The variance of \(\log \left (\widehat {\psi }_{jk}\right)\) can be obtained by applying the delta method to the variance of \(\left (\hat \pi _{a1}, \widehat {\pi }_{a0}, \widehat {\pi }_{s1},\widehat {\pi }_{s0}, \widehat {p}^{*\mathrm {T}}_{0}, \widehat {p}^{*\mathrm {T}}_{1}\right)^{\mathrm {T}}\) which is a diagonal block matrix with variances of \(\left (\widehat {\pi }_{a1}, \widehat {\pi }_{a0}\right)^{\mathrm {T}}, (\widehat {\pi }_{s1}, \widehat {\pi }_{s0})^{\mathrm {T}}, \widehat {p}^{*}_{0}\) and \(\widehat {p}^{*}_{1}\) being the diagonal blocks. The derivatives of \(\log \left (\widehat {\psi }_{jk}\right)=\log \left (\widetilde {p}_{000}\right) +\log (\widetilde {p}_{1jk})-\log (\widetilde {p}_{100})-\log (\widetilde {p}_{0jk})\) with respect to the parameters can be easily obtained via (10).
Finally, as noted by a referee, certain constraints underlie the estimates of log odds ratios (11), which are reflected by the positivity of the probabilities in (10). These constraints essentially require the misclassification probabilities to be upper bounded properly to ensure that the observed surrogate measurements are relevant and useful. In other words, misclassification effects can only be addressed when they are not arbitrarily substantial, and this makes intuitive sense. For instance, when a misclassification probability, say \(P\left (X_{a}^{*}=0|X_{a}=1\right)\), is bigger than 1/2, then the observed measurements \(X_{a}^{*}\) carry useless information of X _{ a }; using such observations to estimate the model parameter, no matter how an estimation method is developed, is even worse than using artificial data generated from flipping a fair coin.
Numerical analysis
In this section, we analyze the motivating example to illustrate the usage of the proposed method and conduct numerical studies to assess the performance of our method.
5.1 Data analysis
Breast cancer case-control study: main study data
\(X^{*}_{s}=0\) | \(X^{*}_{s}=1\) | ||||
---|---|---|---|---|---|
\(X^{*}_{a}=0\) | \(X^{*}_{a}=1\) | \(X^{*}_{a}=0\) | \(X^{*}_{a}=1\) | Total | |
Y=1 | 268 | 82 | 61 | 39 | 450 |
Y=0 | 305 | 70 | 56 | 20 | 451 |
Total | 573 | 152 | 117 | 59 | 901 |
Breast Cancer Case-Control Study: Replicates of Surrogate Measurements
First assessment | Second assessment | First assessment | Second assessment | ||||
---|---|---|---|---|---|---|---|
\(X_{a}^{*}=1\) | \(X_{a}^{*}=0\) | Total | \(X_{s}^{*}=1\) | \(X_{s}^{*}=0\) | Total | ||
\(X_{a}^{*}=1\) | 18 | 6 | 24 | \(X_{s}^{*}=1\) | 11 | 2 | 13 |
\(X_{a}^{*}=0\) | 7 | 69 | 76 | \(X_{s}^{*}=0\) | 2 | 84 | 86 |
Total | 100 | 99 |
As comparisons, a referee suggested to further conduct two analyses, called Analysis 3 and Analysis 4. In Analysis 3, we take the misclassification probabilities as known and let their values be determined by the estimated specificities and sensitivities obtained from Analysis 1. In Analysis 4, we pretend the second sample is a validation sample where the measurements from the first assessment were taken as the true values and the measurements from the second assessment were regarded as surrogate measurements; sensitivities and specificities are then estimated from the relative frequencies using this artificial validation sample.
Analysis results for the breast cancer case-control study
Analysis 1 | Analysis 2 | |||||
---|---|---|---|---|---|---|
EST | SEM | 95% CI | EST | SEM | 95% CI | |
β _{ a } | 0.340 | 0.307 | (-0.261, 0.941) | 0.288 | 0.183 | (-0.071, 0.646) |
β _{ s } | 0.175 | 0.293 | (-0.400, 0.750) | 0.215 | 0.203 | (-0.183, 0.613) |
β _{ as } | 0.372 | 0.520 | (-0.647, 1.391) | 0.295 | 0.379 | (-0.447, 1.037) |
Analysis 3 | Analysis 4 | |||||
EST | SEM | 95% CI | EST | SEM | 95% CI | |
β _{ a } | 0.340 | 0.224 | (-0.099, 0.778) | 0.454 | 0.349 | (-0.229, 1.138) |
β _{ s } | 0.175 | 0.259 | (-0.334, 0.683) | 0.153 | 0.291 | (-0.418, 0.724) |
β _{ as } | 0.372 | 0.497 | (-0.603, 1.346) | 0.410 | 0.686 | (-0.934, 1.754) |
5.2 Sensitivity analysis
In the previous subsection the misclassification probabilities are estimated based on a small set of replicated surrogate measurements, whose accuracy may be questionable due to the small size of the data. We now investigate the effect of misclassification of the alcohol and smoking factors on the estimation of the odds ratios when misclassification probabilities are set differently. Three scenarios are considered: there is misclassification on alcohol factor only, on smoking factor only, and on both the alcohol and smoking factors. The sensitivity and specificity are employed to specify the (correct) classification rates; setting these quantities to be 1 corresponds to the case without misclassification. To ensure the nonnegativity of the probabilities p ^{∗}, specification of the sensitivity and specificity is subject to underlying constraints, as discussed in Section 4.
Noting that estimates of the log odds ratios are determined by (11) which includes the terms \(\widetilde {p}_{ijk}\) for i,j,k=0,1 and that, by (10), \(\widetilde {p}_{ijk}\) depends on the sensitivity and the specificity in the same manner, one might expect that the change of a log odds ratio relative to the sensitivity would behave in the same manner as that to the specificity. However, this speculation is not necessarily visualizable because the magnitude of the change can be different due to the dependence of the log odds ratios on estimated probabilities \(\widehat {p}_{ijk}^{*}\) obtained from the observed data. In other words, visual effects of the sensitivity and the specificity on changes of log odds ratios can be noticeably different, which is driven by the actual observed data in Table 1. This is reflected in our sensitivity analyses here.
5.3 Simulation study
In this subsection, we conduct simulation studies to assess the performance of the proposed method and to demonstrate the impact of ignoring the misclassification in the analysis.
We consider a setting similar to one of Zhang et al. (2008). Let X _{ a } be generated from a binomial distribution BIN(1,0.5) and let X _{ s } be generated from a binomial distribution BIN(1,0.5). Response Y is generated from model (1) where we set β _{ a }=β _{ s }=β _{ as }= log(2.0), and β _{0}=−3.0. For the sensitivity and specificity, we consider two settings: (I) π _{ a0}=π _{ a1}=0.8,π _{ s0}=π _{ s1}=0.9, and (II) π _{ a0}=π _{ a1}=0.9,π _{ s0}=π _{ s1}=0.95.
First, we generate a large number of individuals, say, 200000 individuals, which are treated as the underlying population. Then we randomly select n _{1} cases and n _{0} controls from this population to form a main study sample. We consider three scenarios with different sizes of cases and controls. In the first scenario, we take n _{1}=n _{0}=1000; in the second scenario, we take n _{1}=n _{0}=500; and in the third scenario, we take n _{1}=200 and n _{0}=600. To generate a second sample of replicates, we randomly select n ^{∗} individuals from the underlying population so that each individual has two repeated surrogate measurements for each of X _{ a } and X _{ s }. We consider two scenarios, called Scenario R _{1} and Scenario R _{2}, where n ^{∗} is set as 100 and 500, respectively.
Simulation results for the main study data with 1000 cases and 1000 controls
β _{ a } | β _{ s } | β _{ as } | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Setting | Method | Bias | SEM | CR% | Bias | SEM | CR% | Bias | SEM | CR% | |
R _{1} | I | Naive | -0.211 | 0.139 | 68.0 | 0.039 | 0.142 | 93.6 | -0.413 | 0.184 | 43.6 |
Proposed | 0.054 | 0.438 | 99.2 | -0.018 | 0.459 | 98.2 | 0.104 | 0.604 | 98.4 | ||
II | Naive | -0.079 | 0.145 | 92.0 | 0.046 | 0.140 | 94.4 | -0.249 | 0.193 | 75.6 | |
Proposed | 0.042 | 0.256 | 98.0 | -0.011 | 0.264 | 96.2 | 0.034 | 0.349 | 97.8 | ||
R _{2} | I | Naive | -0.217 | 0.144 | 67.8 | 0.034 | 0.143 | 94.2 | -0.411 | 0.191 | 39.4 |
Proposed | 0.015 | 0.350 | 95.8 | 0.026 | 0.328 | 96.6 | -0.003 | 0.475 | 96.6 | ||
II | Naive | -0.087 | 0.142 | 93.2 | 0.034 | 0.152 | 95.0 | -0.236 | 0.197 | 77.4 | |
Proposed | 0.011 | 0.212 | 97.0 | -0.005 | 0.223 | 96.0 | 0.008 | 0.301 | 94.8 |
Simulation results for the main study data with 500 cases and 500 controls
β _{ a } | β _{ s } | β _{ as } | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Setting | Method | Bias | SEM | CR% | Bias | SEM | CR% | Bias | SEM | CR% | |
R _{1} | I | Naive | -0.218 | 0.205 | 82.4 | 0.017 | 0.208 | 94.8 | -0.401 | 0.272 | 67.2 |
Proposed | 0.045 | 0.618 | 99.4 | -0.034 | 0.657 | 98.8 | 0.125 | 0.874 | 98.8 | ||
II | Naive | -0.093 | 0.214 | 92.4 | 0.046 | 0.199 | 96.0 | -0.248 | 0.270 | 86.0 | |
Proposed | 0.010 | 0.336 | 96.2 | -0.003 | 0.346 | 97.8 | 0.029 | 0.457 | 97.2 | ||
R _{2} | I | Naive | -0.210 | 0.204 | 83.2 | 0.024 | 0.204 | 94.0 | -0.399 | 0.274 | 71.0 |
Proposed | 0.033 | 0.506 | 97.6 | 0.004 | 0.496 | 96.8 | 0.042 | 0.709 | 95.4 | ||
II | Naive | -0.090 | 0.204 | 94.2 | 0.029 | 0.201 | 95.6 | -0.224 | 0.256 | 88.8 | |
Proposed | 0.012 | 0.303 | 97.4 | -0.013 | 0.294 | 97.0 | 0.028 | 0.394 | 97.0 |
Simulation results for the main study data with 200 cases and 600 controls
β _{ a } | β _{ s } | β _{ as } | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Setting | Method | Bias | SEM | CR% | Bias | SEM | CR% | Bias | SEM | CR% | |
R _{1} | I | Naive | -0.216 | 0.273 | 89.6 | 0.037 | 0.258 | 96.4 | -0.405 | 0.346 | 79.4 |
Proposed | 0.069 | 0.801 | 99.6 | 0.017 | 0.757 | 100.0 | 0.093 | 1.044 | 99.0 | ||
II | Naive | -0.099 | 0.306 | 93.0 | 0.034 | 0.301 | 94.0 | -0.231 | 0.365 | 92.0 | |
Proposed | 0.026 | 0.517 | 98.2 | -0.022 | 0.562 | 97.2 | 0.053 | 0.671 | 98.2 | ||
R _{2} | I | Naive | -0.220 | 0.282 | 86.6 | 0.039 | 0.281 | 95.6 | -0.404 | 0.362 | 78.6 |
Proposed | 0.056 | 0.760 | 98.4 | 0.037 | 0.791 | 99.2 | 0.022 | 1.047 | 99.0 | ||
II | Naive | 0.067 | 0.289 | 96.0 | 0.063 | 0.296 | 96.2 | -0.257 | 0.371 | 88.8 | |
Proposed | 0.052 | 0.452 | 98.4 | 0.043 | 0.461 | 98.2 | -0.021 | 0.597 | 96.8 |
It is clear that ignoring misclassification yields biased estimates of the parameters, and the coverage rates for the 95% confidence intervals considerably deviate from the nominal level. On the contrary, the proposed method yields much improved estimation results with a lot smaller biases. As a trade-off of improving point estimation, variances of the proposed estimators are bigger than those of the naive estimators, which has also been observed in other problems concerning measurement error or misclassification (e.g., Carroll et al. 2006; Yi 2017). However, jointly reporting point estimation and associated variability, the proposed method produces much better coverage rates for the 95% confidence intervals. More specifically, with a given scenario of R _{1} or R _{2}, the proposed method tends to produce better results for Setting I than for Setting II, as expected. With a given setting of I or II, standard errors obtained from Scenario R _{2} are smaller than those obtained from Scenario R _{1}.
Discussion and extensions
Dichotomized covariates are very common in medical studies and misclassification of these covariates happens frequently in the data collection process. It is important to incorporate such a feature in the data analysis; otherwise, biased results are usually derived. In this article, we investigate misclassification effects of error-prone binary covariates on the estimation of risk measures for case-control studies and develop a valid inference method for addressing misclassification effects. Our development is carried out under a practical setting where a validation sample is impossible but repeated measurements of error-contaminated variables are available. Numerical studies demonstrate satisfactory performance of our method.
Our method is motivated by the breast cancer case-control data discussed by Duffy et al. (1989) which contain two error-prone binary covariates and each has two replicates of surrogate measurements. It is possible to extend our method to more general settings where error-prone binary covariates may be more than two, or/and replicates of surrogate measurements can be arbitrary, or/and error-free covariates are also present. Here we outline three extensions.
6.1 Extension 1: replicates are more than 2
If there are m repeated measurements of each of covariate in model (1), then the development in Section 4 can be generalized as follows with the discussion on one of the two covariates.
6.2 Extension 2: covariates are more than 2
where the β _{0}, β _{ l } and β _{ jk } are the regression parameters for l=1,…,p and 1≤j<k≤p.
We note that model (12) reflects the main effects as well as all pairwise interactions among the covariates. The three-way or higher order interactions among the covariates are not included, which are virtually assumed to be zero. In problems for which such interactions are of interest, one may modify model (12) by adding those terms with additional parameters introduced. In principle, any order of interactions among the covariates may be included in the model until a saturated model is formed. The interpretation of the associated parameters would be modified accordingly.
6.3 Extension 3: error-free covariates are also present
To conclude, we comment that the development of Section 4 is based on the assumption of the nondifferential misclassification mechanism. This assumption allows us to estimate the sensitivities and specificities using a separate sample from the main study which has repeated surrogate measurements of covariates only but not measurements of the disease status. Such an assumption, however, may be too restrictive for some applications, especially for retrospective studies. In such instances, conducting sensitivity analyses can be a viable way to allow us not to impose the nondifferential misclassification mechanism but enable us to explore the impact of misclassification on inference results. Finally, our work here focuses on estimation of the model parameters. It is also interesting to develop procedures for hypothesis testing to incorporate misclassification effects along the lines of Bross (1954).
Declarations
Acknowledgements
The authors thank two anonymous referees whose comments improved the presentation of the manuscript. The research was supported by the Natural Sciences and Engineering Research Council of Canada.
Authors contribution
Both authors share contribution to the manuscript. GY proposed the research idea, and both authors together developed the methodology. WH worked on the computational analysis and GY drafted the manuscript. Both authors read and approved the final manuscript.
Competing interest
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
- Armstrong, BG, Whittemore, AS, Howe, GR: Analysis of case-control data with covariate measurement error: Application to diet and colon cancer. Stat. Med. 8, 1151–1163 (1989).View ArticleGoogle Scholar
- Barron, BA: The Effects of misclassification on the estimation of relative risk. Biometrics. 33, 414–418 (1977).View ArticleMATHGoogle Scholar
- Breslow, NE, Cain, KC: Logistic regression for two-stage case-control data. Biometrika. 75, 11–20 (1988).MathSciNetView ArticleMATHGoogle Scholar
- Breslow, NE, Day, NE: Statistical Methods in Cancer Research, Volume I - The Analysis of Case-Control Studies. International Agency for Research on Cancer, Lyon (1980).Google Scholar
- Bross, I: Misclassification in 2×2 tables. Biometrics. 10, 478–486 (1954).MathSciNetView ArticleMATHGoogle Scholar
- Carroll, RJ, Gail, MH, Lubin, JH: Case-control studies with errors in covariates. J. Am. Stat. Assoc. 88, 185–199 (1993).MathSciNetMATHGoogle Scholar
- Carroll, RJ, Ruppert, D, Stefanski, LA, Crainiceanu, CM: Measurement Error in Nonlinear Models. 2nd ed. Chapman & Hall/CRC, Boca Raton (2006).View ArticleMATHGoogle Scholar
- Carroll, RJ, Wang, S, Wang, CY: Prospective analysis of logistic case-control studeis. J. Am. Stat. Assoc. 90, 157–169 (1995).View ArticleMATHGoogle Scholar
- Chu, H, Cole, SR, Wei, Y, Ibrahim, JG: Estimation and inference for case-control studies with multiple nongold standard exposure assessments: with an occupational health application. Biostatistics. 10, 591–602 (2009).View ArticleGoogle Scholar
- Duffy, SW, Rohan, TE, Day, NE: Misclassification in more than one factor in a case-control study: A combination of Mantel-Haenszel and maximum likelihood approaches. Stat. Med. 8, 1529–1536 (1989).View ArticleGoogle Scholar
- Forbes, AB, Santner, TJ: Estimators of odds ratio regression parameters in matched case-control studies with covariate measurement errror. J. Am. Stat. Assoc. 90, 1075–1084 (1995).View ArticleMATHGoogle Scholar
- Gustafson, P, Le, ND, Saskin, R: Case-control snalysis with partial knowledge of exposure misclassification probabilities. Biometrics. 57, 598–609 (2001).MathSciNetView ArticleMATHGoogle Scholar
- Lyles, RH: A note on estimating crude odds ratios in case-control studies with differentially misclassified exposure. Biometrics. 58, 1034–1036 (2002).MathSciNetView ArticleMATHGoogle Scholar
- Mak, TSH, Best, N, Rushton, L: Robust Bayesian sensitivity analysis for case-control studies with uncertain exposure misclassification probabilities. Int. J. Biostat. 11, 135–149 (2015).MathSciNetView ArticleGoogle Scholar
- Marinos, AT, Tzonou, AJ, Karantzas, ME: Experimental quantiles of epidemiological indices in case-control studies with non-differential misclassification. Stat. Med. 14, 1291–1306 (1995).View ArticleGoogle Scholar
- Morrissey, M, Spiegelman, D: Matrix methods for estimating odds ratios with misclassified exposure data: Extensions and comparisons. Biometrics. 55, 338–344 (1999).View ArticleMATHGoogle Scholar
- Prentice, RL, Pyke, R: Logistic disease incidence models and case-control studies. Biometrika. 66, 403–411 (1979).MathSciNetView ArticleMATHGoogle Scholar
- Prescott, GJ, Garthwaite, PH: Bayesian analysis of misclassified binary data from a matched case-control study with a validation substdy. Stat. Med. 24, 379–401 (2005).MathSciNetView ArticleGoogle Scholar
- Roeder, K, Carroll, RJ, Lindsay, BG: A semiparametric mixture approach to case-control studies with error in covariables. J. Am. Stat. Assoc. 91, 722–732 (1996).MathSciNetView ArticleMATHGoogle Scholar
- Schlesselman, JJ: Case-Control Studies: Design, Conduct, Analysis. Oxford University Press, Oxford (1982).Google Scholar
- Serfling, RJ: Approximation Theorems of Mathematical Statistics. Wiley, New York (1980).View ArticleMATHGoogle Scholar
- Tang, L, Lyles, RH, Ye, Y, Lo, Y, King, CC: Extended matrix and inverse matrix methods utilizing internal validation data when both disease and exposure status are misclassified. Epidemiol. Methods. 2, 49–66 (2013).View ArticleMATHGoogle Scholar
- Yi, GY: Statistical Analysis with Measurement Error or Misclassification: Strategy, Method and Application. Springer Science+Business Media LLC, New York (2017).View ArticleMATHGoogle Scholar
- Zhang, L, Mukherjee, B, Ghosh, M, Gruber, S, Moreno, V: Accounting for error due to misclassification of exposures in case-control studies of gene-environment interaction. Stat. Med. 27, 2756–2783 (2008).MathSciNetView ArticleGoogle Scholar