Skip to main content

Analysis of case-control data with interacting misclassified covariates


Case-control studies are important and useful methods for studying health outcomes and many methods have been developed for analyzing case-control data. Those methods, however, are vulnerable to mismeasurement of variables; biased results are often produced if such a feature is ignored. In this paper, we develop an inference method for handling case-control data with interacting misclassified covariates. We use the prospective logistic regression model to feature the development of the disease. To characterize the misclassification process, we consider a practical situation where replicated measurements of error-prone covariates are available. Our work is motivated in part by a breast cancer case-control study where two binary covariates are subject to misclassification. Extensions to other settings are outlined.


Case-control studies are important and useful methods for studying rare health outcomes, such as rare diseases. They do not require us to follow up a large number of subjects over a long period of time. The primary purpose of a case-control study is to investigate how risk factors are associated with the disease incidence, and the study typically involves the comparison of cases (i.e., diseased individuals) with controls (i.e., disease-free individuals).

Various statistical analysis methods for case-control data have been developed in the literature (e.g., Prentice and Pyke 1979; Breslow and Cain 1988; Breslow and Day 1980; Schlesselman 1982). Those methods are, however, vulnerable to mismeasurement of variables that commonly accompanies case-control studies. It has been well documented that ignoring mismeasurement effects in the analysis often yields seriously biased results (e.g., Gustafson et al. 2001; Yi 2017). For instance, Bross (1954) examined misclassification effects on hypothesis testing with 2×2 tables. He commented that misclassification may present a more serious problem in the case of estimates than in the case of significance tests.

In the literature, many methods have been proposed to address mismeasurement effects (e.g., Armstrong et al. 1989; Carroll et al. 1995; Forbes and Santner 1995; Roeder et al. 2005). In particular, with misclassifiction present in discrete covariates or exposure variables, various authors explored strategies for accommodating misclassification effects. To name a few, Marinos et al. (1995) studied case-control data with non-differential misclassification. Morrissey and Spiegelaman (1999) and Lyles (2002) discussed adjustment methods for exposure misclassification in case-control studies where a validation sample is available. Chu et al. (2009) presented a likelihood-based approach for case-control studies with multiple non-gold standard exposure assessments. Under the Bayesian framework, Prescott and Garthwaite (2005) proposed methods for analyzing matched case-control studies in which a binary exposure variable is subject to misclassification. Tang et al. (2013) considered the case where both disease and exposure are subject to misclassification and developed misclassification adjustment methods by utilizing validation data. Mak et al. (2015) studied sensitivity analysis in case-control studies subject to exposure misclassification.

A common feature of those methods is that error-prone variables enter models separately and no interactions among error-contaminated covariates are considered. In case-control studies, however, error-involved covariates may interactively influence the development of diseases. Incorporating such a feature in the analysis is imperative. Zhang et al. (2008) developed an inference method to account for interacting covariates with misclassification. Their method is applicable only for the case where a validation subsample is available for determining the misclassification probabilities.

In many circumstances, a validation sample is impossible to be collected for various reasons. Some variables may be too expensive or time consuming to be measured precisely (e.g., Carroll et al. 1993). Some variables can never be measured precisely due to its nature. For instance, blood pressure entering a model usually refers to its long term average, and there is no way to obtain this value precisely; exposure amount to a hazard condition, such as radiation, is difficult to be measured accurately. In such situations, a validation sample with precise measurements of the variables is not available. However, surrogate measurements of those variables can sometimes be repeatedly collected in application. In this paper, we consider such a setting where replicated measurements of error-prone covariates are available and develop inference methods with interacting error-prone covariates taken into account. Our development is particularly cast into the framework of the prospective logistic regression model with misclassified binary covariates.

Our research is partially motivated by the case-control study on breast cancer discussed by Duffy et al. (1989). In this study, 451 breast cancer cases were compared with the same number of controls with respect to the error-prone risk factors: alcohol consumption and smoking, where alcohol consumption is defined as a binary variable by a threshold of 9.3 g ethanol/day, and the smoking variable is dichotomized by comparing the product of cigarettes smoked per day and years of smoking to 300. In addition, an independent study on 100 women was available, where repeated measurements of alcohol consumption and smoking were collected for those women on two occasions. It is interesting to study how smoking and alcohol use may be associated with the risk of developing breast cancer and whether or not those two factors may be interacting in explaining the development of breast cancer. A detailed analysis of such data is presented in Section 5.

The remainder of this paper is organized as follows. Section 2 outlines the notation and model setup. In Section 3 we explore the misclassification effects. In Section 4, we develop inferential procedures to accommodate misclassification effects with the availability of replicated measurements of error-contaminated covariates. Analysis of the motivating data with the proposed method is reported in Section 5, together with simulation studies which demonstrate the performance of our method. The manuscript is concluded with discussion and extensions.

Notation and framework

Let Y be the binary outcome variable, taking value 1 if a subject is a case and value 0 otherwise. Let X a and X s be two binary covariates taking value 0 or 1, such as alcohol and smoking statuses in the motivating example. For i,j,k=0,1, let

$$p_{ijk}=P\left(X_{a}=j,X_{s}=k|Y=i\right) $$

be the conditional probability for the case or control. Let ψ jk be the odds ratio for cases versus controls with (X a =j,X s =k) compared with the baseline category (X a =0,X s =0):

$$\psi_{jk}=\frac{p_{000}p_{1jk}}{p_{100}p_{0jk}} $$

for (j,k)≠(0,0). Define

$$\psi=\frac{\psi_{11}}{\psi_{01}\psi_{10}}. $$

This measure can be used to indicate the association between the two binary covariates, which is classified by the subpopulations of cases and controls. The measure ψ is defined from the retrospective sampling viewpoint which directly reflects the feature of case-control designs. Equivalently, this measure has an equally interpretive feature in a prospective regression model.

Consider the prospective logistic regression model with an interaction term between X a and X s :

$$ \log \left\{ \frac{P(Y=1|X_{a}, X_{s})} {P(Y=0|X_{a},X_{s})} \right\} = \beta_{0} + \beta_{a} X_{a} +\beta_{s} X_{s} + \beta_{as} X_{a} X_{s}, $$

where the β 0, β a , β s and β as are the regression parameters. These parameters can be expressed in terms of the odds ratios defined for the retrospective sampling framework:

$$ \beta_{a}=\log \psi_{10}, \ \beta_{s}=\log \psi_{01}, \ \text{and} \ \beta_{as}=\log \psi. $$

As pointed out by Prentice and Pyke (1979), the baseline parameter β 0 is not estimable from retrospectively collected data unless the prevalence P(Y=1) is known; the coefficients (β a ,β s ,β as ), or the odds ratios ψ jk , however, is estimable from case-control data that are collected retrospectively.

We now elaborate on estimation procedures. For i,j,k=0,1, let N ijk represent the number of subjects with (Y=i,X a =j,X s =k), and let (n i00,n i10,n i01,n i11)T be a realization of the random vector N i =(N i00,N i10,N i01,N i11)T. Let n 0 and n 1 be the total number of controls and cases in the study, respectively. With the retrospective sampling scheme for case-control studies, these totals are treated as fixed, and it is often plausible to use multinomial distributions to independently characterize the cell counts for the control and case populations. Namely, N 0 and N 1 are assumed to be independent and marginally follow a multinomial distribution with N i Multinomial(n i ,p i ), where p i =(p i00,p i10,p i01,p i11) and \(\sum _{j,k=0}^{1} p_{ijk}=1\) and \(n_{i}=\sum _{j,k=0}^{1} n_{ijk}\) for i=0,1.

These distributional assumptions immediately allow us to write out the likelihood function for the cell probabilities p ijk , ignoring the normalizing constant,

$$ L=\prod_{i=0}^{1} \prod_{j=0}^{1} \prod_{k=0}^{1} p_{ijk}^{n_{ijk}}. $$

In combination with the constraint \(\sum _{j,k} p_{ijk}=1\) for a given i, maximizing (3) with respect to the cell probabilities leads to the maximum likelihood estimator for the cell probabilities:

$$\widehat{p}_{ijk}=\frac{n_{ijk}}{n_{i}} \ \ \ \text{for} \ i, j, k=0,1. $$

Then the invariance of maximum likelihood estimators gives us an estimate of ψ jk :

$$\widehat{\psi}_{jk}= \frac{n_{000} n_{1jk}}{n_{100}n_{0jk}}. $$

To calculate the asymptotic variance of the estimator \(\widehat {\psi }_{jk}\) (as n 1 and n 0 both approach infinity), we equivalently consider the asymptotic variance of \(\log \widehat {\psi }_{jk}\). For i=1,0, the multinomial distribution N i Multinomial(n i ,p i ) yields the asymptotic distribution of \(\widehat {p}_{i}=\left (\widehat {p}_{i00},\widehat {p}_{i01},\widehat {p}_{i10},\widehat {p}_{i11}\right)^{\mathrm {T}}\) (Serfling 1980, pp.108-109):

$$\begin{array}{@{}rcl@{}} \sqrt{n_{i}} \left(\widehat{p}_{i} - p_{i}\right) \stackrel{d}{\rightarrow} N \left(0,\Sigma_{i} \right) \end{array} $$

as n i , where

$$\Sigma_{i}= \left(\begin{array}{cccc} p_{i00} (1-p_{i00}) & - p_{i00}p_{i01} & - p_{i00}p_{i10} & - p_{i00}p_{i11} \\ - p_{i01} p_{i00} & p_{i01}\left(1-p_{i01}\right) & - p_{i01}p_{i10} & - p_{i01}p_{i11} \\ -p_{i10} p_{i00} & - p_{i10}p_{i01} & p_{i10}(1-p_{i10}) & - p_{i10}p_{i11} \\ - p_{i11}p_{i00} & - p_{i11}p_{i01} & - p_{i11}p_{i10} & p_{i11}(1-p_{i11}) \\ \end{array} \right) $$

with the constraints \(\sum _{j,k} \widehat {p}_{ijk}=1\) and \(\sum _{j,k} p_{ijk}=1\) imposed. The asymptotic variances of the estimators \(\widehat {\psi }_{jk}\) and \(\widehat {\psi }\), or their logarithms, can be obtained using the delta method. Specifically, estimates of the asymptotic variances are

$$\widehat{\text{Avar}} \left(\log \widehat{\psi}_{jk}\right) = \frac{1}{n_{1jk} }+\frac{1}{n_{0jk}} + \frac{1}{n_{100}}+\frac{1}{n_{000}} $$

for j,k=0,1 and

$$\begin{array}{@{}rcl@{}} \widehat{\text{Avar}} \left(\log \widehat{\psi}\right) = \sum_{i=0}^{1} \sum_{j=0}^{1} \sum_{k=0}^{1} \frac{1}{n_{ijk}}. \end{array} $$

Interacting covariates with misclassification

In the presence of misclassification of the binary covariates, let \(X_{a}^{*}\) and \(X_{s}^{*}\) be the observed values of X a and X s , respectively. Let

$$\pi_{ia1}=P\left(X_{a}^{*}=1|X_{a}=1, Y=i\right) \ \ \text{and} \ \pi_{ia0}=P\left(X_{a}^{*}=0|X_{a}=0, Y=i\right) $$

be respectively the sensitivity and specificity of X a for the subpopulation with Y=i, and

$$\pi_{is1}=P\left(X_{s}^{*}=1|X_{s}=1, Y=i\right) \ \ \text{and} \ \pi_{is0}=P\left(X_{s}^{*}=0|X_{s}=0, Y=i\right) $$

be respectively the sensitivity and specificity of X s for the subpopulation with Y=i. Define

$$\Pi_{ia}=\left(\begin{array}{cc} \pi_{ia0} & 1-\pi_{ia1}\\ 1-\pi_{ia0} & \pi_{ia1} \end{array} \right) \ \ \text{and} \ \Pi_{is}=\left(\begin{array}{cc} \pi_{is0} & 1-\pi_{is0}\\ 1-\pi_{is1} & \pi_{is1} \end{array} \right). $$

For i,j,k=0,1, let

$$ p^{*}_{ijk}=P\left(X^{*}_{a}=j,X^{*}_{s}=k|Y=i\right) $$

be the probabilities for the observed covariate measurements corresponding to the case or control subpopulation. Write \(p^{*}_{i}=\left (p^{*}_{i00}, p^{*}_{i10}, p^{*}_{i01}, p^{*}_{i11}\right)\) for i=0,1.

We assume that

$$\begin{array}{@{}rcl@{}} &&P\left(X_{a}^{*}=j,X_{s}^{*}=k|X_{a}, X_{s}, Y\right)\\ &=&P\left(X_{a}^{*}=j|X_{a}, X_{s}, Y\right)P\left(X_{s}^{*}=k|X_{a}, X_{s}, Y\right), \end{array} $$
$$P\left(X_{a}^{*}=j|X_{a}, X_{s}, Y\right)=P\left(X_{a}^{*}=j|X_{a}, Y\right), $$


$$P\left(X_{s}^{*}=j|X_{a}, X_{s}, Y\right)=P\left(X_{s}^{*}=j|X_{s}, Y\right). $$

The first assumption says that the observed measurements \(X_{a}^{*}\) and \(X_{a}^{*}\) are conditionally independent, given the true values X a and X s and the disease status. The second and third conditions require that the misclassification probability of one variable does not depend on the true value of the other variable, given the true value of the variable itself and the disease status. Under these assumptions, we express the probabilities \(p^{*}_{ijk}\) using the true probabilities p ijk :

$$ \left(\begin{array}{cc} p^{*}_{i00} & p^{*}_{i01}\\ p^{*}_{i10} & p^{*}_{i11} \end{array} \right) = \Pi_{ia} \left(\begin{array}{cc} p_{i00} & p_{i01}\\ p_{i10} & p_{i11} \end{array} \right) \Pi_{is}. $$

The identity (6) allows us to estimate the probability p ijk using the estimates of \(p^{*}_{ijk}\) which can be obtained from the observed counts (Barron 1977). Let \(n^{*}_{ijk}\) represent the number of cases or controls with the observed measurement \(\left (X^{*}_{a}=j,X^{*}_{s}=k\right)\) for i,j,k=0,1, as displayed in Table 1.

Table 1 Observed counts for case-control data

Using the same reasoning as for (3), we obtain the likelihood based on the observed data

$$ L_{obs}=\prod_{i=0}^{1} \prod_{j=0}^{1} \prod_{k=0}^{1} \left(p_{ijk}^{*}\right)^{n^{*}_{ijk}}. $$

Maximizing the likelihood (7) with respect to the cell probabilities \(p^{*}_{ijk}\), under the constraint \(\sum _{j=0}^{1}\sum _{k=0}^{1} p^{*}_{ijk}=1\) for i=0,1, gives their estimators

$$\widehat{p}^{*}_{ijk}=\frac{n^{*}_{ijk}}{n_{i}} \ \ \ \text{for} \ i, j, k=0, 1. $$

Applying (6), we obtain the estimators for the true cell probabilities p ijk :

$$ \left(\begin{array}{cc} \widehat{p}_{i00} & \widehat{p}_{i01}\\ \widehat{p}_{i10} & \widehat{p}_{i11} \end{array} \right) = \Pi_{ia}^{-1} \left(\begin{array}{cc} \widehat{p}^{*}_{i00} & \widehat{p}^{*}_{i01}\\ \widehat{p}^{*}_{i10} & \widehat{p}^{*}_{i11} \end{array} \right) \Pi_{is}^{-1}, $$

where the matrices Π ia and Π is are assumed invertible.

To describe the asymptotic variance \(\widehat {p}_{ijk}\), we apply the delta method to the asymptotic distribution of \(\left (\widehat {p}^{*}_{i00}, \widehat {p}^{*}_{i01},\widehat {p}^{*}_{i10},\widehat {p}^{*}_{i11}\right)^{\mathrm {T}}\) in combination of (8), where the asymptotic distribution of \(\left (\widehat {p}^{*}_{i00}, \widehat {p}^{*}_{i01},\widehat {p}^{*}_{i10},\widehat {p}^{*}_{i11}\right)^{\mathrm {T}}\) is of the same form as (4) except for replacing p ijk and \(\widehat {p}_{ijk}\) with \(p_{ijk}^{*}\) and \(\widehat {p}^{*}_{ijk}\) respectively, i,j,k=0,1.

Inference method with replicates

The foregoing method assumes that the misclassification probabilities are known, and it is useful for conducting sensitivity analyses where one may specify a class of plausible values of the sensitivity and specificity to evaluate the misclassification effects on estimation of quantities such as odds ratios ψ jk or cell probabilities p ijk .

In practice, misclassification probabilities are usually unavailable and must be estimated from additional data sources. Here we consider a situation where an independent sample with two repeated covariate measurements is available. In addition to the main study data displayed in Table 1, a second independent sample is available as shown in Table 2. As there is no information on the disease status for this independent sample, we assume the nondifferential misclassification mechanism in order to estimate the sensitivities and specificities. Namely, we assume that

$$\pi_{iaj}=\pi_{aj} \ \ \text{and} \ \pi_{isj}=\pi_{sj}, $$

where π aj and π sj are constants, i,j=0,1. Although no gold standard measurements of X a and X s via a validation subsample are available for this circumstance, the discrepancy between the two repeated measurements allows us to estimate the misclassification probabilities under certain assumptions.

Table 2 Two replicates of surrogate covariate measurements

To see this, we consider an estimation method of the π aj and π sj which separately uses the repeated measurements of X a and X s . We describe only estimation of the π aj here; estimation of the π sj is similar.

Let \(X_{a1}^{*}\) and \(X_{a2}^{*}\) denote the first and second observed measurements for X a , respectively. Define

$$a_{ajk}=P\left(X_{a1}^{*}=j, X_{a2}^{*}=k\right) \ \ \text{for} \ j,k=0,1 $$

and α a =P(X a =1). If assuming conditional independence between the first and second observed measurements for X a :

$$\begin{array}{@{}rcl@{}} &&P\left(X_{a1}^{*}=j, X_{a2}^{*}=k|X_{a}=l \right)\\ &=& P\left(X_{a1}^{*}=j|X_{a}=l\right)P\left(X_{a2}^{*}=k|X_{a}=l \right) \ \ \text{for} \ j, k, l=0,1, \end{array} $$

then we obtain a a10=a a01, and

$$\begin{array}{@{}rcl@{}} a_{a11} &=& \pi_{a1}^{2}\alpha_{a}+(1-\pi_{a0})^{2}(1-\alpha_{a}); \\ a_{a10} &=&\pi_{a1}(1-\pi_{a1})\alpha_{a}+\pi_{a0}(1-\pi_{a0})(1-\alpha_{a});\\ a_{a00} &=&(1-\pi_{a1})^{2}\alpha_{a}+\pi_{a0}^{2}(1-\alpha_{a}). \end{array} $$

In (9), one equation is determined by the other two, implying that model parameters are unidentifiable unless additional assumptions are imposed. To consider a reduced parameter space, we take the prevalence α a as given.

Let \(N^{*}_{ajk}\) be the number of pairs \(\left (X_{a1}^{*}=j,X_{a2}^{*}=k\right)\) for j,k=0,1. Then we have a multinomial distribution

$$ \left(N^{*}_{a11},N^{*}_{a10},N^{*}_{a01},N^{*}_{a00}\right) \sim \text{Multinomial}\left(n^{*}_{a}, a_{a11},a_{a10},a_{a01},a_{a00}\right), $$

resulting in the likelihood,

$$\begin{array}{@{}rcl@{}} L(\pi_{a1},\pi_{a0}) &=& a_{a11}^{n_{a11}^{*}} \cdot a_{a10}^{n_{a10}^{*}+n_{a01}^{*}} \cdot a_{a00}^{n_{a00}^{*}} \end{array} $$

where the constant is omitted, \(n_{a}^{*}\) is the number of paired assessments, a a11, a a10 and a a00 are determined by (9) and constrained by a a11+2a a10+a a00=1.

Estimation of parameters is carried out with the maximization of the likelihood L(π a1,π a0). The associated variance estimates are obtained from the observed information matrix, i.e., the negative of the second derivative matrix of logL(π a1,π a0) evaluated at the estimates of parameters. To get rid of the constraints that the probabilities are bounded by 0 and 1, we reparameterize π a0 and π a1 by using the logit transformation when maximizing the likelihood.

To obtain estimates of regression parameters in model (1), we use the relationship (8) and obtain that

$$ \widehat{p}_{ijk} = \frac{\widetilde{p}_{ijk}}{\left(\widehat{\pi}_{a0}+\widehat{\pi}_{a1}-1\right)(\widehat{\pi}_{s0}+\widehat{\pi}_{s1}-1)} $$

for j,k=0,1, where

$$\begin{array}{@{}rcl@{}} \widetilde{p}_{i00} &=&\widehat{\pi}_{a1} \widehat{\pi}_{s1} \widehat{p}^{*}_{i00} - \left(1-\widehat{\pi}_{a1}\right) \widehat{\pi}_{s1} \widehat{p}^{*}_{i10} - \widehat{\pi}_{a1} (1-\widehat{\pi}_{s1}) \widehat{p}^{*}_{i01} +(1-\widehat{\pi}_{a1}) (1-\widehat{\pi}_{s1}) \widehat{p}^{*}_{i11}; \\ \widetilde{p}_{i10} &=&-(1-\widehat{\pi}_{a0}) \widehat{\pi}_{s1} \widehat{p}^{*}_{i00}+\widehat{\pi}_{a0} \widehat{\pi}_{s1} \widehat{p}^{*}_{i10}+ (1-\widehat{\pi}_{a0}) (1-\widehat{\pi}_{s1}) \widehat{p}^{*}_{i01} -\widehat{\pi}_{a0} (1-\widehat{\pi}_{s1}) \widehat{p}^{*}_{i11}; \\ \widetilde{p}_{i01} &=& -\widehat{\pi}_{a1} (1-\widehat{\pi}_{s0}) \widehat{p}^{*}_{i00} + (1-\widehat{\pi}_{a1}) (1-\widehat{\pi}_{s0}) \widehat{p}^{*}_{i10}+ \widehat{\pi}_{a1} \widehat{\pi}_{s0} \widehat{p}^{*}_{i01} -(1-\widehat{\pi}_{a1}) \widehat{\pi}_{s0} \widehat{p}^{*}_{i11}; \\ \widetilde{p}_{i11} &=& (1-\widehat{\pi}_{a0}) (1-\widehat{\pi}_{s0}) \widehat{p}^{*}_{i00} - \widehat{\pi}_{a0} (1-\widehat{\pi}_{s0}) \widehat{p}^{*}_{i10}- (1-\widehat{\pi}_{a0}) \widehat{\pi}_{s0} \widehat{p}^{*}_{i01} +\widehat{\pi}_{a0} \widehat{\pi}_{s0} \widehat{p}^{*}_{i11}. \end{array} $$

Consequently, the log odds ratios are estimated by

$$ \log\left(\widehat{\psi}_{jk}\right)=\log \left(\frac{\widehat{p}_{000} \cdot \widehat{p}_{1jk}}{ \widehat{p}_{100} \cdot \widehat{p}_{0jk}} \right) = \log \left(\frac{\widetilde{p}_{000} \cdot \widetilde{p}_{1jk}}{\widetilde{p}_{100} \cdot \widetilde{p}_{0jk}} \right) $$

for (j,k)≠(0,0). The variance of \(\log \left (\widehat {\psi }_{jk}\right)\) can be obtained by applying the delta method to the variance of \(\left (\hat \pi _{a1}, \widehat {\pi }_{a0}, \widehat {\pi }_{s1},\widehat {\pi }_{s0}, \widehat {p}^{*\mathrm {T}}_{0}, \widehat {p}^{*\mathrm {T}}_{1}\right)^{\mathrm {T}}\) which is a diagonal block matrix with variances of \(\left (\widehat {\pi }_{a1}, \widehat {\pi }_{a0}\right)^{\mathrm {T}}, (\widehat {\pi }_{s1}, \widehat {\pi }_{s0})^{\mathrm {T}}, \widehat {p}^{*}_{0}\) and \(\widehat {p}^{*}_{1}\) being the diagonal blocks. The derivatives of \(\log \left (\widehat {\psi }_{jk}\right)=\log \left (\widetilde {p}_{000}\right) +\log (\widetilde {p}_{1jk})-\log (\widetilde {p}_{100})-\log (\widetilde {p}_{0jk})\) with respect to the parameters can be easily obtained via (10).

Finally, as noted by a referee, certain constraints underlie the estimates of log odds ratios (11), which are reflected by the positivity of the probabilities in (10). These constraints essentially require the misclassification probabilities to be upper bounded properly to ensure that the observed surrogate measurements are relevant and useful. In other words, misclassification effects can only be addressed when they are not arbitrarily substantial, and this makes intuitive sense. For instance, when a misclassification probability, say \(P\left (X_{a}^{*}=0|X_{a}=1\right)\), is bigger than 1/2, then the observed measurements \(X_{a}^{*}\) carry useless information of X a ; using such observations to estimate the model parameter, no matter how an estimation method is developed, is even worse than using artificial data generated from flipping a fair coin.

Numerical analysis

In this section, we analyze the motivating example to illustrate the usage of the proposed method and conduct numerical studies to assess the performance of our method.

5.1 Data analysis

We analyze the case-control data discussed by Duffy et al. (1989) and described in Section 1. For any subject, let X a be a binary variable indicating whether or not the alcohol consumption is more than 9.3 g ethanol/day, and let X s be a binary variable indicating whether or not the lifetime cigarette-years of the subject is more than 300. Table 3 records the data of the main study, where one breast cancer case has missing observations and we ignore this in the analysis. In addition, there was an independent study available on 100 women who were neither cases nor controls. Repeated measurements of X a and X s were collected for those women on two occasions, and the measurements are given in Table 4 where one subject has missing observations of X s .

Table 3 Breast cancer case-control study: main study data
Table 4 Breast Cancer Case-Control Study: Replicates of Surrogate Measurements

We analyze the data using the proposed method described in Section 4 and the naive method with misclassification in X a and X s ignored, called Analysis 1 and Analysis 2, respectively. To remove the constraints of the specificities and sensitivities, we consider the reparameterization:

$$ \widetilde{\pi}=\frac{\exp(\delta)}{1+\exp(\delta)}, $$

where \(\widetilde {\pi }\) represents π a1,π a0,π s1 or π s0, and δ is the corresponding parameter which takes a value in (−,+).

As comparisons, a referee suggested to further conduct two analyses, called Analysis 3 and Analysis 4. In Analysis 3, we take the misclassification probabilities as known and let their values be determined by the estimated specificities and sensitivities obtained from Analysis 1. In Analysis 4, we pretend the second sample is a validation sample where the measurements from the first assessment were taken as the true values and the measurements from the second assessment were regarded as surrogate measurements; sensitivities and specificities are then estimated from the relative frequencies using this artificial validation sample.

The analysis results are reported in Table 5, where EST, SEM and 95% CI represent estimates, model-based standard errors and 95% confidence intervals for the parameters, respectively. Relative to those produced by the method with misclassification incorporated (Analysis 1), the naive analysis (Analysis 2) yields attenuated point estimates for β a and β as , and leads to an inflated estimate of β s . Analysis 1 produces larger standard errors than Analysis 2, which is consistent with the typical patterns observed in the analysis with measurement error models in the literature. The comparison between the results of Analyses 1 and 3 confirms the theoretical property that Analysis 3 produces the same point estimates of the response parameters as Analysis 1 does, but it yields smaller variance estimates than those produced from Analysis 1. While it is not possible to directly compare Analysis 4 to Analysis 1 or Analysis 3, the comparison of Analysis 4 to Analysis 2 reveals the same pattern as the comparison between Analyses 1 and 2. All the analyses suggest that none of smoking, alcohol consumption, and their interaction are statistically significant.

Table 5 Analysis results for the breast cancer case-control study

5.2 Sensitivity analysis

In the previous subsection the misclassification probabilities are estimated based on a small set of replicated surrogate measurements, whose accuracy may be questionable due to the small size of the data. We now investigate the effect of misclassification of the alcohol and smoking factors on the estimation of the odds ratios when misclassification probabilities are set differently. Three scenarios are considered: there is misclassification on alcohol factor only, on smoking factor only, and on both the alcohol and smoking factors. The sensitivity and specificity are employed to specify the (correct) classification rates; setting these quantities to be 1 corresponds to the case without misclassification. To ensure the nonnegativity of the probabilities p , specification of the sensitivity and specificity is subject to underlying constraints, as discussed in Section 4.

Noting that estimates of the log odds ratios are determined by (11) which includes the terms \(\widetilde {p}_{ijk}\) for i,j,k=0,1 and that, by (10), \(\widetilde {p}_{ijk}\) depends on the sensitivity and the specificity in the same manner, one might expect that the change of a log odds ratio relative to the sensitivity would behave in the same manner as that to the specificity. However, this speculation is not necessarily visualizable because the magnitude of the change can be different due to the dependence of the log odds ratios on estimated probabilities \(\widehat {p}_{ijk}^{*}\) obtained from the observed data. In other words, visual effects of the sensitivity and the specificity on changes of log odds ratios can be noticeably different, which is driven by the actual observed data in Table 1. This is reflected in our sensitivity analyses here.

Figure 1 shows the change of the log odds ratios according to the change of the sensitivity and specificity for the alcohol factor while keeping the sensitivity and specificity for the smoking factor to be 1. With a given specificity (i.e., π a0) or sensitivity (i.e., π a1), the log odds ratio log(ψ 10) tends to decrease as the sensitivity (i.e., π a1) or specificity (i.e., π a0) of alcohol factor increases; but the change rates for them are not the same. On the other hand, the log odds ratio log(ψ 01) increases as the sensitivity (i.e., π a1) of alcohol increases when the specificity (i.e., π a0) is kept fixed; the log odds ratio log(ψ 01) appears less sensitive to the change of the specificity (i.e., π a0) when the sensitivity (i.e., π a1) is given. Regarding the log odds ratio logψ, we notice that its value is affected the change in the sensitivity and specificity of the alcohol.

Fig. 1

Sensitivity analyses of the log odds ratios to the changes of π a0 and π a1 where π s0 and π s1 are set as 1. The left plot is for logψ 10, the center plot is for logψ 01, and the right plot is for logψ

Figure 2 presents the changes of the log odds ratios according to the change of the sensitivity and specificity for the smoking factor while keeping the sensitivity and specificity of the alcohol factor specified as 1. The log odds ratios log(ψ 10) appears to increase as the sensitivity (i.e., π s1) of the smoking factor increases with the specificity (i.e., π s0) kept fixed; whereas when the the sensitivity (i.e., π s1) of the smoking factor is fixed, the log odds ratios log(ψ 10) tends to decrease as the specificity (i.e., π s0) of the smoking factor increases; Again, the change in sensitivity and specificity of the smoking factor affects the value of log(ψ).

Fig. 2

Sensitivity analyses of the log odds ratios to the changes of π s0 and π s1 where π a0 and π a1 are set as 1. The left plot is for logψ 10, the center plot is for logψ 01, and the right plot is for logψ

Figure 3 shows how the log odds ratios may change relative to the change of the sensitivity and specificity for both the alcohol and smoking factors. While any circumstances can be considered, here we confine our attention to the scenario where the sensitivity for the alcohol and smoking factors is equal and the specificity for these two factors is common. It is evident that the values of both the sensitivity and specificity of the alcohol and smoking factors have the impact on estimation of the log odds ratios while the magnitudes can be different from case to case.

Fig. 3

Sensitivity analyses of the log odds ratios to the changes of π 0 and π 1, where π s0 and π a0 are set to be identical, and π s1 and π a1 are set to be identical. The left plot is for logψ 10, the center plot is for logψ 01, and the right plot is for logψ

5.3 Simulation study

In this subsection, we conduct simulation studies to assess the performance of the proposed method and to demonstrate the impact of ignoring the misclassification in the analysis.

We consider a setting similar to one of Zhang et al. (2008). Let X a be generated from a binomial distribution BIN(1,0.5) and let X s be generated from a binomial distribution BIN(1,0.5). Response Y is generated from model (1) where we set β a =β s =β as = log(2.0), and β 0=−3.0. For the sensitivity and specificity, we consider two settings: (I) π a0=π a1=0.8,π s0=π s1=0.9, and (II) π a0=π a1=0.9,π s0=π s1=0.95.

First, we generate a large number of individuals, say, 200000 individuals, which are treated as the underlying population. Then we randomly select n 1 cases and n 0 controls from this population to form a main study sample. We consider three scenarios with different sizes of cases and controls. In the first scenario, we take n 1=n 0=1000; in the second scenario, we take n 1=n 0=500; and in the third scenario, we take n 1=200 and n 0=600. To generate a second sample of replicates, we randomly select n individuals from the underlying population so that each individual has two repeated surrogate measurements for each of X a and X s . We consider two scenarios, called Scenario R 1 and Scenario R 2, where n is set as 100 and 500, respectively.

For each parameter configuration, we simulate 500 data sets and analyze the data using both the the proposed method and the naive method which disregards the misclassification feature. We report the bias (Bias), the model-based standard error (SEM), and the 95% confidence interval coverage rate (CR%), and the results are reported in Tables 6, 7 and 8, each corresponding to a size scenario.

Table 6 Simulation results for the main study data with 1000 cases and 1000 controls
Table 7 Simulation results for the main study data with 500 cases and 500 controls
Table 8 Simulation results for the main study data with 200 cases and 600 controls

It is clear that ignoring misclassification yields biased estimates of the parameters, and the coverage rates for the 95% confidence intervals considerably deviate from the nominal level. On the contrary, the proposed method yields much improved estimation results with a lot smaller biases. As a trade-off of improving point estimation, variances of the proposed estimators are bigger than those of the naive estimators, which has also been observed in other problems concerning measurement error or misclassification (e.g., Carroll et al. 2006; Yi 2017). However, jointly reporting point estimation and associated variability, the proposed method produces much better coverage rates for the 95% confidence intervals. More specifically, with a given scenario of R 1 or R 2, the proposed method tends to produce better results for Setting I than for Setting II, as expected. With a given setting of I or II, standard errors obtained from Scenario R 2 are smaller than those obtained from Scenario R 1.

Discussion and extensions

Dichotomized covariates are very common in medical studies and misclassification of these covariates happens frequently in the data collection process. It is important to incorporate such a feature in the data analysis; otherwise, biased results are usually derived. In this article, we investigate misclassification effects of error-prone binary covariates on the estimation of risk measures for case-control studies and develop a valid inference method for addressing misclassification effects. Our development is carried out under a practical setting where a validation sample is impossible but repeated measurements of error-contaminated variables are available. Numerical studies demonstrate satisfactory performance of our method.

Our method is motivated by the breast cancer case-control data discussed by Duffy et al. (1989) which contain two error-prone binary covariates and each has two replicates of surrogate measurements. It is possible to extend our method to more general settings where error-prone binary covariates may be more than two, or/and replicates of surrogate measurements can be arbitrary, or/and error-free covariates are also present. Here we outline three extensions.

6.1 Extension 1: replicates are more than 2

If there are m repeated measurements of each of covariate in model (1), then the development in Section 4 can be generalized as follows with the discussion on one of the two covariates.

Let \(X_{aj}^{*}\) denote the jth observed measurement of X a for j=1,…,m where m is an integer greater than 2. Let α a =P(X a =1) be the prevalence which is assumed known. Define

$$a_{a j_{1} \ldots j_{m}}=P\left(X_{a1}^{*}=j_{1}, \ldots, X_{am}^{*}=j_{m}\right) \ \ \ \text{for} \ j_{k}=0, 1 \ \ \text{and} \ k=1, \ldots, m. $$

Without loss of generality, we assume that these m replicates are independently collected, thus yielding

$$\begin{array}{@{}rcl@{}} a_{a j_{1} \ldots j_{m}} &=& P\left(X_{a1}^{*}=j_{1}, \ldots, X_{am}^{*}=j_{m}, X_{a}=1 \right)+ P\left(X_{a1}^{*}=j_{1}, \ldots, X_{am}^{*}=j_{m}, X_{a}=0 \right)\\ &=& \prod_{k=1}^{m} P\left(X_{ak}^{*}=j_{k}|X_{a}=1 \right)P\left(X_{a}=1\right) + \prod_{k=1}^{m} P\left(X_{ak}^{*}=j_{k}| X_{a}=0\right)P(X_{a}=0)\\ &=& \prod_{k=1}^{m} \pi_{a1}^{j_{k}}(1-\pi_{a1})^{1-j_{k}} \alpha_{a} + \prod_{k=1}^{m} (1-\pi_{a0})^{j_{k}} \pi_{a0}^{1-j_{k}} (1-\alpha_{a}). \end{array} $$

Suppose there are \(n_{a}^{*}\) measurements in total for covariate X a . Let \(N^{*}_{a j_{1} \ldots j_{m}}\) be the number of outcome \(\left (X_{a1}^{*}=j_{1}, \ldots, X_{am}^{*}=j_{m}\right)\) for j k =0,1 and k=1,…,m, and let \(N^{*}_{a}=\left (N^{*}_{a j_{1} \ldots j_{m}}: j_{k}=0, 1; k=1, \ldots, m\right)^{\mathrm {T}}\). Then we have a multinomial distribution

$$ N^{*}_{a} \sim \text{Multinomial}\left(n^{*}_{a}, a_{a}\right), $$

where \(a_{a}=\left (a_{a j_{1} \ldots j_{m}}: j_{k}=0, 1; k=1, \ldots, m\right)^{\mathrm {T}}\) with \(\sum _{j_{k}=0, 1; k=1, \ldots, m} a_{a j_{1} \ldots j_{m}} =1\), resulting in the likelihood

$$L(\pi_{a1},\pi_{a0}) = \prod_{j_{k}=0, 1; k=1, \ldots, m} (a_{a j_{1} \ldots j_{m}})^{n^{*}_{a j_{1} \ldots j_{m}} } $$

where the constant is omitted. Estimation of parameters π a1 and π a0 is carried out with the maximization of the likelihood L(π a1,π a0), and the associated variance estimates are obtained from the negative of the second derivative matrix of logL(π a1,π a0) evaluated at the estimates of parameters. Other development in the preceding sections can then carry through.

6.2 Extension 2: covariates are more than 2

When the number of binary covariates is greater than 2, model (1) can be generalized. To be specific, let X l denote the jth binary covariate for j=1,…,p where p is an integer greater than 2. Then model (1) can be generalized as

$$ \log \left\{ \frac{P\left(Y=1|X_{1}, \ldots, X_{p}\right)} {P\left(Y=0|X_{1}, \ldots, X_{p}\right)} \right\} = \beta_{0} + \sum_{l=1}^{p} \beta_{l} X_{l} + \sum_{j < k} \beta_{jk} X_{j} X_{k}, $$

where the β 0, β l and β jk are the regression parameters for l=1,…,p and 1≤j<kp.

These parameters can be interpreted in terms of the odds ratios defined for the retrospective sampling framework. Specifically, let

$$ q_{i}=P\left(X_{1}=0, \ldots, X_{p}=0|Y=i\right) $$


$$p_{i (k)}=P\left(X_{1}=0, \ldots, X_{k-1}=0, X_{k}=1, X_{k+1}=0, \ldots, X_{p}=0|Y=i\right) $$

for i=0 or 1 and k=1,…,p. Let ψ k be the odds ratio for cases versus controls with (X 1=0,…,X k−1=0,X k =1,X k+1=0,…,X p =0) compared with the baseline category (X 1=0,…,X p =0):

$$\psi_{k} =\frac{q_{0} p_{1 (k)} }{q_{1}p_{0(k)}} $$

for k=1,…,p.

For i=0 or 1 and 1≤j<kp, let

$$p_{i (jk)}=P\left(X_{j}=X_{k}=1; X_{l}=0: l \ne j, l \ne k|Y=i\right). $$


$$\psi_{jk} =\frac{q_{0} p_{1 (jk)} }{q_{1}p_{0(jk)}} \ \ \text{and} \ \ \phi_{jk}=\frac{\psi_{jk}}{\psi_{j}\psi_{k}}. $$


$$\beta_{l}=\log \psi_{l} \ \ \text{and} \ \beta_{jk}=\log \phi_{jk} $$

for l=1,…,p and 1≤j<kp. Other development in the preceding sections can then carry through with a more complex exposition.

We note that model (12) reflects the main effects as well as all pairwise interactions among the covariates. The three-way or higher order interactions among the covariates are not included, which are virtually assumed to be zero. In problems for which such interactions are of interest, one may modify model (12) by adding those terms with additional parameters introduced. In principle, any order of interactions among the covariates may be included in the model until a saturated model is formed. The interpretation of the associated parameters would be modified accordingly.

6.3 Extension 3: error-free covariates are also present

Model (1) can be modified to accommodate settings with error-prone covariates as well. Let Z denote the vector of error-free risk factors of a disease. The prospective logistic regression model is then written as

$$ \log \left\{ \frac{P(Y=1|X_{a}, X_{s}, Z)} {P(Y=0|X_{a}, X_{s}, Z)} \right\} = \beta_{0} + \beta_{a} X_{a} + \beta_{s} X_{s}+ \beta_{as} X_{a} X_{s} +\beta_{z}^{\mathrm{T}} Z, $$

where β 0,β a ,β s ,β as and β z are the regression parameters. The parameters β 0,β a ,β s , and β as can be interpreted in the same manner as (2) except that the associated conditional probabilities p ijk need to be modified as

$$p_{ijk}=P\left(X_{a}=j, X_{s}=k|Y=i, Z\right) $$

with error-free covariates Z being controlled. Estimation of the model parameters may then be carried out using the likelihood method.

To conclude, we comment that the development of Section 4 is based on the assumption of the nondifferential misclassification mechanism. This assumption allows us to estimate the sensitivities and specificities using a separate sample from the main study which has repeated surrogate measurements of covariates only but not measurements of the disease status. Such an assumption, however, may be too restrictive for some applications, especially for retrospective studies. In such instances, conducting sensitivity analyses can be a viable way to allow us not to impose the nondifferential misclassification mechanism but enable us to explore the impact of misclassification on inference results. Finally, our work here focuses on estimation of the model parameters. It is also interesting to develop procedures for hypothesis testing to incorporate misclassification effects along the lines of Bross (1954).


  1. Armstrong, BG, Whittemore, AS, Howe, GR: Analysis of case-control data with covariate measurement error: Application to diet and colon cancer. Stat. Med. 8, 1151–1163 (1989).

    Article  Google Scholar 

  2. Barron, BA: The Effects of misclassification on the estimation of relative risk. Biometrics. 33, 414–418 (1977).

    Article  MATH  Google Scholar 

  3. Breslow, NE, Cain, KC: Logistic regression for two-stage case-control data. Biometrika. 75, 11–20 (1988).

    MathSciNet  Article  MATH  Google Scholar 

  4. Breslow, NE, Day, NE: Statistical Methods in Cancer Research, Volume I - The Analysis of Case-Control Studies. International Agency for Research on Cancer, Lyon (1980).

    Google Scholar 

  5. Bross, I: Misclassification in 2×2 tables. Biometrics. 10, 478–486 (1954).

    MathSciNet  Article  MATH  Google Scholar 

  6. Carroll, RJ, Gail, MH, Lubin, JH: Case-control studies with errors in covariates. J. Am. Stat. Assoc. 88, 185–199 (1993).

    MathSciNet  MATH  Google Scholar 

  7. Carroll, RJ, Ruppert, D, Stefanski, LA, Crainiceanu, CM: Measurement Error in Nonlinear Models. 2nd ed. Chapman & Hall/CRC, Boca Raton (2006).

    Google Scholar 

  8. Carroll, RJ, Wang, S, Wang, CY: Prospective analysis of logistic case-control studeis. J. Am. Stat. Assoc. 90, 157–169 (1995).

    Article  MATH  Google Scholar 

  9. Chu, H, Cole, SR, Wei, Y, Ibrahim, JG: Estimation and inference for case-control studies with multiple nongold standard exposure assessments: with an occupational health application. Biostatistics. 10, 591–602 (2009).

    Article  Google Scholar 

  10. Duffy, SW, Rohan, TE, Day, NE: Misclassification in more than one factor in a case-control study: A combination of Mantel-Haenszel and maximum likelihood approaches. Stat. Med. 8, 1529–1536 (1989).

    Article  Google Scholar 

  11. Forbes, AB, Santner, TJ: Estimators of odds ratio regression parameters in matched case-control studies with covariate measurement errror. J. Am. Stat. Assoc. 90, 1075–1084 (1995).

    Article  MATH  Google Scholar 

  12. Gustafson, P, Le, ND, Saskin, R: Case-control snalysis with partial knowledge of exposure misclassification probabilities. Biometrics. 57, 598–609 (2001).

    MathSciNet  Article  MATH  Google Scholar 

  13. Lyles, RH: A note on estimating crude odds ratios in case-control studies with differentially misclassified exposure. Biometrics. 58, 1034–1036 (2002).

    MathSciNet  Article  MATH  Google Scholar 

  14. Mak, TSH, Best, N, Rushton, L: Robust Bayesian sensitivity analysis for case-control studies with uncertain exposure misclassification probabilities. Int. J. Biostat. 11, 135–149 (2015).

    MathSciNet  Article  Google Scholar 

  15. Marinos, AT, Tzonou, AJ, Karantzas, ME: Experimental quantiles of epidemiological indices in case-control studies with non-differential misclassification. Stat. Med. 14, 1291–1306 (1995).

    Article  Google Scholar 

  16. Morrissey, M, Spiegelman, D: Matrix methods for estimating odds ratios with misclassified exposure data: Extensions and comparisons. Biometrics. 55, 338–344 (1999).

    Article  MATH  Google Scholar 

  17. Prentice, RL, Pyke, R: Logistic disease incidence models and case-control studies. Biometrika. 66, 403–411 (1979).

    MathSciNet  Article  MATH  Google Scholar 

  18. Prescott, GJ, Garthwaite, PH: Bayesian analysis of misclassified binary data from a matched case-control study with a validation substdy. Stat. Med. 24, 379–401 (2005).

    MathSciNet  Article  Google Scholar 

  19. Roeder, K, Carroll, RJ, Lindsay, BG: A semiparametric mixture approach to case-control studies with error in covariables. J. Am. Stat. Assoc. 91, 722–732 (1996).

    MathSciNet  Article  MATH  Google Scholar 

  20. Schlesselman, JJ: Case-Control Studies: Design, Conduct, Analysis. Oxford University Press, Oxford (1982).

    Google Scholar 

  21. Serfling, RJ: Approximation Theorems of Mathematical Statistics. Wiley, New York (1980).

    Google Scholar 

  22. Tang, L, Lyles, RH, Ye, Y, Lo, Y, King, CC: Extended matrix and inverse matrix methods utilizing internal validation data when both disease and exposure status are misclassified. Epidemiol. Methods. 2, 49–66 (2013).

    Article  MATH  Google Scholar 

  23. Yi, GY: Statistical Analysis with Measurement Error or Misclassification: Strategy, Method and Application. Springer Science+Business Media LLC, New York (2017).

    Google Scholar 

  24. Zhang, L, Mukherjee, B, Ghosh, M, Gruber, S, Moreno, V: Accounting for error due to misclassification of exposures in case-control studies of gene-environment interaction. Stat. Med. 27, 2756–2783 (2008).

    MathSciNet  Article  Google Scholar 

Download references


The authors thank two anonymous referees whose comments improved the presentation of the manuscript. The research was supported by the Natural Sciences and Engineering Research Council of Canada.

Authors contribution

Both authors share contribution to the manuscript. GY proposed the research idea, and both authors together developed the methodology. WH worked on the computational analysis and GY drafted the manuscript. Both authors read and approved the final manuscript.

Competing interest

The authors declare that they have no competing interests.

Author information



Corresponding author

Correspondence to Grace Y. Yi.

Ethics declarations

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Yi, G.Y., He, W. Analysis of case-control data with interacting misclassified covariates. J Stat Distrib App 4, 16 (2017).

Download citation


  • Case-control study
  • Interaction term
  • Misclassification
  • Prospective logistic regression
  • Replicated measurements