The Rasch model allows for a conditional likelihood ratio goodness of fit test. The speed of approximation of the test statistic to the limiting distribution as a function of sample size and test length has not been analyzed so far. Three bootstrap simulation methods are analyzed with respect to their performance in providing a proper distribution of the test statistic under the null- and the alternative hypothesis.

Results

We found a stable approximation to the limiting χ^{2}-distribution for sample sizes of at least 500 and 10 items. The three bootstrap algorithms rendered consistent results for the H_{0}-cases but not for the H_{1}-cases.

Conclusion

A sequential probability sampling scheme proves sufficiently apt for generating samples under the alternative hypothesis. This superiority can be justified from a theoretical point of view.

Introduction

The dichotomous logistic model according to Rasch (1960, 1966; henceforth denoted as Rasch Model, RM) allows for assessing its adequacy for describing a given data set by means of a conditional Likelihood Ratio Test (LRT; Andersen 1973). The test statistic is approximately χ^{2}-distributed if the sample size n→∞. Hence, small samples will deteriorate inference, i.e. the limiting distribution will not provide sufficiently precise quantiles for a reasoned decision and we have to switch to the bootstrap (cf. Efron and Tibshirani 1998), which is computationally demanding.

However, no systematic investigation has been undertaken so far to analyze the rate of approximation of the test statistic to the limiting distribution. It is therefore difficult to decide when it is safe to use the χ^{2}-distribution or when a bootstrap is required. This question shall be tackled in a simulation study. Moreover, if we switch to the bootstrap, precision depends on the number of bootstrap samples. A concrete guideline will be given, how many bootstrap replications are required to fulfill a desired precision criterion.

The following outline shall guide the reader through the details of this study:

Theoretical Background We start with explaining the fundamentals of the Rasch Model (Section 2.1) and the essential basics of model parameter estimation (Section 2.2) to an extent required to understand the simulation procedures applied in the study. Section 2.3 shows the basics of the LRT, the test statistic of which the study focusses. The task of determining the speed of approximation of the test statistic to its limiting χ^{2}-distribution breaks down into three separate questions, which are formulated in Section 2.4.

Methods In order to perform the simulation study, bootstrap samples in line with the RM have to be generated. For that purpose, several algorithms are at our disposal, which are introduced in Section 3.2. The study considers the distribution of the test statistic under both the null and the alternative hypothesis. These two scenarios require different simulation strategies, which are explained in Section 3.1. The simulation study covers numerous different scenarios, which may arise in practical application. Section 3.3 lists the simulation parameters considered for that purpose.

Results The complex details of the study are split into results concerning the H_{0}-case (Section 4.1) and the H_{1}-case (Section 4.2). Finally, Section 4.3 introduces a flexible formula to compute an adequate number of bootstrap samples, if this procedure is required.

Theoretical background

2.1 The Rasch model (RM)

The RM is a discrete probability model of a Bernoulli variable, X_{
vi
}∈{0,1}, assuming two real valued parameters θ_{
v
} (v=1…n) and β_{
i
} (i=1…k),

A typical application of model (1) is psychometrics, with θ_{
v
} describing respondent’s v ability to solve a task (or item) and β_{
i
} describing the difficulty of task (or item) i. Both parameters are unbounded in value, i.e. \(\theta _{v}, \beta _{i} \in \mathbb {R}\). By means of the substitutions ξ_{
v
}= exp(θ_{
v
}) and ε_{
i
}= exp(−β_{
i
}) we yield the so-called multiplicative notation of the model equation,

Due to the exponentiation, ξ_{
v
} and ε_{
i
} take positive values only, and ε_{
i
} is interpreted as an item easiness parameter. Conditional on both parameter vectors θ=(θ_{1},θ_{2},…,θ_{
v
},…,θ_{
n
})^{T} and β=(β_{1},β_{2},…,β_{
i
},…,β_{
k
})^{T} the binary responses are assumed to be independent so that the joint distribution of all n responses to all k items ist given by the product of (1) (or (2), respectively) over v and i. This assumption is usually termed conditional or local independence.

The RM is a member of the exponential family (cf. Molenaar 1995, p. 41) with the sums \(R_{v} =\sum _{i}X_{\textit {vi}}\) and \(S_{i} =\sum _{v}X_{\textit {vi}}\) being the sufficient statistics for the parameters θ_{
v
} and β_{
i
}, respectively. The separability theorem (Fisher 1922) applies (Rasch 1966, p. 95; Rost 2001, p. 28), hence items can be compared independently of the ability parameters occurring in the sample and abilities can be estimated independently of the items used (given the items are in line with the model and the model holds for all respondents).

2.2 Parameter estimation

Several parameter estimation methods have been developed. Most straightforward from maximum likelihood theory is the Unconditional Maximum Likelihood approach (UML; or Joint Maximum Likelihood, JML; cf. Baker and Kim 2004, ch. 5.6). Here we determine estimates for both parameter vectors simultaneously by finding the maximum of the unconditional likelihood function as a function of θ and β. This is achieved by setting the partial derivatives equal to zero and applying the pertinent numeric methods to solve a system of nonlinear equations (cf. ibid., p. 136). However, this approach suffers from the so-called incidental parameter problem as expressed in Neyman and Scott (1948). While the item parameters appear as structural (or fixed) parameters, the person parameters constitute a random draw from the population and are therefore incidental (or nuisance) parameters. The simultaneous appearance of both kinds of parameters may cause inconsistent item parameter estimates. Corrective procedures have been proposed (cf. Molenaar 1995, p. 43), but there was dispute concerning their effect (cf. Baker and Kim 2004, ch. 5.6.2).

Generally, this incidental parameter problem may be overcome by marginalization or conditional inference (cf. Pawitan 2001, p. 274). In the first case (Marginal Maximum Likelihood estimation, MML, cf. Baker and Kim 2004, ch. 6), the incidental parameters θ_{
v
} are replaced by assuming a proper distribution G(θ) in the population (e.g. the normal), requiring only the hyperparameters τ of G(θ) to be determined (i.e. the mean and the variance of G(θ) in our example). Although this solves the incidental parameter problem, the correct choice of G(·) is decisive for obtaining correct estimates (cf. Molenaar 1995, p. 47).

The second approach is the Conditional Maximum Likelihood estimation method (CML; Andersen 1970), directly involving the parameter separability feature. At its heart, this method overcomes the incidental parameter problem by conditioning on each respondent’s observed value of the nuisance parameter’s sufficient statistic, i.e. the score r_{
v
}, when estimating the item parameters. The conditional likelihood function L_{
C
} of the item parameter vector ε=(ε_{1},ε_{2},…,ε_{
i
},…,ε_{
k
})^{T} given the vector of scores r=(r_{1},r_{2},…,r_{
v
},…,r_{
n
})^{T} can be written as

The term γ_{
r
} denotes the elementary symmetric function of order r, covering a complex combinatorical task (cf. Andersen 1972; Formann 1986; Gustafsson 1980). Expression (3) plays a crucial role in the conditional Likelihood Ratio Test (LRT; see next Section).

In the CML context, the person parameters are determined in a separate step, where the \(\hat {\beta }_{i}\) are assumed to be the true item parameters and the \(\hat {\theta }_{v}\) are obtained using maximum likelihood estimation (cf. Hoijtink and Boomsma 1995). Because the score r_{
v
} is a sufficient statistic for the person parameter, all respondents yielding the same score will obtain the same ability estimate, which will be termed \(\hat {\theta }_{r}\) or \(\hat {\xi }_{r}\), respectively.

In terms of the Rasch Model, items being never or always solved (i.e. s_{
i
}=0 and s_{
i
}=n) are infinitely difficult or easy, respectively. The same applies to respondents solving either no item or all items (i.e. r_{
v
}=0 and r_{
v
}=k). While this is seldom a problem for the items (it is unlikely that in a sample of reasonable size an item is never or always solved), it may constitute a problem for person parameter estimation, especially, when the instrument is short (i.e. k is small). However, practicioners demand estimates for all respondents, so we have to make further assumptions in order to obtain parameter estimates for such cases as well. These may be obtained with the Weighted Maximum Likelihood Estimation Method (WLE; Warm 1989). Based on a Bayesian argument, person parameter estimates are decreased in their absolute value, thus attenuating their unbounded growth.

2.3 Assessing model fit

Numerous methods for assessing the adequacy of the RM have been proposed, an overview of which can be found in Glas and Verhelst (1995). The present study focusses on the conditional Likelihood Ratio Test (LRT, Andersen 1973; Kreiner and Christensen 2013), which relies on the CML estimation method.

If the model holds, item parameter estimates do not differ across subsamples but for random variation (invariability assumption, cf. Engelhard Jr. 2013). The LRT allows for an assessment of this assumption by comparing the conditional likelihood of the entire dataset according to Eq. (3), henceforth denoted L_{0}, with the product of the conditional likelihoods obtained from subsets j=1…g of the data set,

follows asymptotically a central χ^{2}-distribution with

$$ df = (k-1)(g-1), $$

((6))

given that the Rasch Model is the true model and the subsample sizes n^{(j)}→∞ (ibid., p. 128). Andersen referred to a split according to the score r_{
v
}, however, one may apply a criterion of substantive interest, like sex, treatment group, or a random split. In many applications, two groups are formed at the median of the score distribution. Without loss of generality, we will consider this median split in the present study.

2.4 Study purpose

The present study targets the following three questions:

How close fits the sampling distribution of (5) the central χ^{2}-distribution for small values of n and k, and in which cases a bootstrap simulation might be preferable due to lack of approximation? This question is analyzed for both the H_{0}-case of model fit (Results, Section 4.1) and for model violations under a given H_{1} (Results, Section 4.2).

Second, do three pertinent bootstrap methods differ with respect to their preciseness in providing appropriate approximations of the type-I-error probability? These results are part of the tables of Sections 4.1 and 4.2.

And third, if a bootstrap is applied, which number of bootstrap replicates is required to obtain a sufficiently stable estimate of the desired quantile for the H_{0} case (Results, Section 4.3)?

Methods

The three questions shall be tackled by means of a simulation study, determining the sampling distribution of the test statistic (5) for various combinations of n and k. Usually, a simulation study starts with fixing the population parameters of interest and drawing samples from this population. In our case, this would comprise fixing a set of k item parameters and n person parameters (or k−1 person parameters associated with each score r, respectively). However, the CML approach relies on the sufficient statistics of the person parameters. We would, therefore, have to find those r_{
v
}, which are associated with a given set of person parameters and item parameters. This task is difficult to achieve, hence we developed the following procedure:

First, a set of k item parameters β^{∗} and n person parameters θ^{∗} is fixed, representing the population of interest. The item parameters β^{∗} were chosen equidistantly from the interval [−1,1] and person parameters θ^{∗} were randomly sampled from the N(0,1).

Then, an initial sample X_{0} of size n×k in line with the assumptions of the Rasch Model is drawn from this population, yielding the realized values of the initially chosen parameters and the according sufficient statistics. The parameter estimates \(\hat {\boldsymbol {\beta }}^{0}\) and \(\hat {\boldsymbol {\theta }}^{0}\) of this initial sample X_{0} supersede the initially chosen β^{∗} and θ^{∗}. We now dispose of both the parameter values and the accompanying sufficient statistics, which are required for the bootstrap algorithm introduced in Section 3.2.3.

The sample X_{0} serves as the basis for the generation of bootstrap samples providing the distribution of the test statistic (5).

3.1 Sampling under the null and the alternative hypothesis

In order to obtain an inital data set X_{0} providing for the distribution of the test statistic under the null hypothesis of model fit, we take the overall parameter vector β^{0}. This choice assumes no subgroup characteristics to be present.

In contrast, a data set X_{0} providing for the distribution of the test statistic under the alternative hypothesis is attained by separately bootstrapping j=1…g subsamples of size n^{(j)} using the original subsamples’ item parameter estimates \(\hat {\beta }_{i}^{(j)}\). These subsample parameter vectors will necessarily differ, at least by chance, i.e. \(\hat {\boldsymbol {\beta }}^{(1)} \ne \hat {\boldsymbol {\beta }}^{(2)} \ne \ldots \ne \hat {\boldsymbol {\beta }}^{(j)} \ne \ldots \ne \hat {\boldsymbol {\beta }}^{(g)}\). Merging these subsamples to one bootstrap sample of size \(n = \sum _{j} n^{(j)}\) will therefore result in a sample violating the item invariance assumption. Hence, such a dataset constitutes a random draw from a population realizing the alternative hypothesis fixed at a model deviation, which is constituted by the item parameter differences of the \(\hat {\boldsymbol {\beta }}^{(j)}\). If we now apply the LRT in the usual manner, the bootstrap distribution of the test statistic represents its distribution under the alternative.

3.2 Generating bootstrap samples

Several methods for generating bootstrap samples in the context of the RM have been proposed, two of which have gained some popularity. A third method, which to the authors’ knowledge has not been described in the context of the Rasch Model before, is introduced in Section 3.2.3. These methods may cause differing distributions of the LR test statistic for reasons outlined below, what might affect the conclusions in an unpredictable manner (cf. Q2 in Section 2.4). Therefore, all three methods will be applied parallel in order to evaluate their impact upon the resulting distribution of the LR test statistic.

Note that the Nonparametric (or “Naïve”) Bootstrap (cf. Davison and Hinkley 1997) is not suited for generating bootstrap samples in the CML context. This has theoretical reasons, which are elucidated in the light of the present findings in Section 5.3.

3.2.1 A “Normal” Approach

In the conditional estimation approach, unbiased item parameter estimates will be attained irrespective of the actual ability distribution. Therefore, the first simulation method uses only the item parameter estimates \(\hat {\boldsymbol {\beta }}^{0}\), while the person parameters are sampled from a freely chosen distribution. In our case, this was the N(0,σ^{2}), with randomly chosen but not too extreme values of σ^{2}. For notational ease, the hat will be omitted in the following.

The normal distribution has been chosen for it is arguably a proper candidate for numerous characteristics frequently assessed in those areas of social science, where the Rasch-Model is typically applied. This approach will be termed normal marginals, although, of course, the row marginal sums r_{1}…r_{
n
} are discrete by nature; it is the underlying parameters that are sampled from the normal. The method has been described in van den Wollenberg (1982) and has gained some popularity for generating data compliant with the RM, wherefore it is considered in the present analysis.

3.2.2 Remaining with the observed

Here we use both the person parameter estimates and the item parameter estimates obtained from X_{0}. The probability of a positive response is determined by using Eq. (1) and the according parameter estimates \(\hat {\theta }_{v}\) and \(\hat {\beta }_{i}\). However, this method raises two issues:

First, the CML method only allows for obtaining the item parameter estimates \(\hat {\beta }_{i}\). For the estimation of the person parameters θ_{
v
}, the item parameter estimates \(\hat {\beta }_{i}\) are taken as if they were the true parameters β_{
i
}. Hence, the random error associated with the item parameter estimates remains unconsidered, possibly rendering the person parameter estimates deficient. This could deteriorate the bootstrap procedure to an unknown extent.

Second, the ML estimate for respondents solving no item or all items would tend towards plus or minus infinity. Three ways of handling this situation could be thought of:

The modified estimates (WLE) according to Warm (1989) could be applied instead. But, as has been elucidated in the last paragraph of Section 2.2, this method systematically attenuates the person parameter estimates, making the implications for our bootstrap procedure imponderable.

The WLE could be inserted only for respondents with r=0 and r=k, and the ML estimates otherwise. This would probably reduce the problem largely, as in most cases only few respondents (compared to the total sample size) will realize such scores. This method is implemented in the WinMIRA software of von Davier (2001), for example.

One can deliberately use arbitrary values for respondents with r=0 and r=k, for example ±15, so that the resulting score will almost surely be equal to 0 or k. Such a method is applied in the software package M-Plus (Muthén 1998–2004; p. 35).

However, any of these three approaches is heuristical and thus has to be considered as unsatisfactory from a statistical point of view.

Because the intention is to maintain the original ability distribution as far as possible, method (iii) was applied in the present study. Nevertheless, this approach will not preserve the individual scores r_{
v
}. Therefore, this approach will be termed free marginals, because the marginal scores are likely to differ from the original ones.

3.2.3 The Rasch point of view

In contrast, a sequential importance sampling procedure following a truly conditional approach will be taken into consideration. It merely regards the conditional pattern probabilities in the way they are used in the CML estimation method. Here, the sufficient statistics r_{
v
} for the person parameter estimates are conditioned upon, making any distributional assumptions superfluous. The probability of a response vector x_{
v
}=(x_{
v1},…,x_{
vk
}) conditional on the score equals

Then, we transform this response probability to a manifest response by comparing it to a random number u drawn from the standard uniform distribution, i.e. u∈U(0,1). If P(x_{
v1}=1|r_{
v
}) exceeds u, the bootstrap respondent’s v first manifest response x_{
v1} is set to one and otherwise to zero (cf. van den Wollenberg 1982, p. 88). In case the response evaluates to one, this person’s score r_{
v
} is reduced by one, otherwise not, yielding the modified score after step one, \(r_{v}^{(1)}\). The procedure continues with the second item in the same manner and proceeds until all k items have been processed. As soon as the modified score after i steps, \(r_{v}^{(i)}\), equals zero, the probability of solving one of the remaining items has zero probability and the corresponding responses are set to zero. If \(r_{v}^{(i)}\) at any step i equals the number of the remaining items, all remaining responses are set to one.

That way, each individuals original score r_{
v
} is maintained, which is equivalent to fixing the row marginals of the observed data set X_{0}. Therefore, this procedure will be termed fixed marginals in the following. Nevertheless, the items’ sufficient statistics still vary according to the probability distribution described by the RM, which is the information the LRT relies upon.

3.3 Simulation parameters

The simulation comprised k = 5, 10, and 15 items and n = 100, 250, 500, 750, 1000, 2500, and 5000 observations. To each of the 21 possible combinations arising, which will be denoted designs, the three bootstrap algorithms (normal, free, and fixed) for both the H_{0} and the H_{1} case were applied. According to Eq. (6), the degrees of freedom were 4, 9, and 14, respectively.

One crucial aspect of the present study is to differentiate between inaccuracies due to a lack of approximation of the actual distribution of the test statistic to the limiting distribution (i.e. a truly statistical problem) on the one hand and an inaccurate bootstrap distribution due to an insufficient number of bootstrap samples (i.e. a merely technical problem) on the other hand. Preliminary trial runs suggested that m = 200,000 bootstrap replications seem to suffice for the required distinction. Assuming that this number of bootstrap replicates makes the bootstrap caused (i.e. technical) error negligible, any remaining deviation from the limiting density will be attributable to a true lack of approximation.

In order to determine the minimum number of bootstrap replicates required for a sufficiently good approximation of the bootstrap densities to the true ones under the null hypothesis (i.e. Q3 in Section 2.4), random samples of decreasing size m^{∗}<m have repeatedly been drawn with replacement from the original 200,000 samples of each design. The following values were chosen for m^{∗}: 500, 1000, 1500, 2000, 2500, 5000, 7500, 10,000, 15,000, 20,000, 25,000, 50,000, 100,000, and 150,000. Each draw of size m^{∗} was repeated 1000 times (m^{∗∗}) in order to obtain a distribution of the LR test statistic for each m^{∗}.

The densities obtained by means of the bootstrap will be depicted using kernel density estimators with bandwidth parameters of 0.2 to 0.5. The simulation itself was performed with the program Ganz Rasch (Alexandrowicz 2012), which supports all three simulation techniques introduced in Section 3.2. The simulation results were analyzed with R (R Core Team 2015).

Results

The results of the simulation study regarding questions Q1 and Q2 are presented separately for the H_{0}- and the H_{1}-case (Sections 4.1 and 4.2). Section 4.3 covers the results regarding question Q3, the required number of bootstrap samples.

4.1 Approximation under the null hypothesis

In order to describe the approximation of the bootstrap distributions to the limiting ones, we opposed the first four moments and selected quantiles of the estimated and the theoretical distribution of the test statistic; This step is accompanied by a density plot of the respective distributions. Second, the p-values of the LRT evaluated using both the bootstrap and the limiting distributions were opposed to each other.

4.1.1 Descriptive approach

Tables 4, 5 and 6 in the Appendix show the sample statistics of the 21 different designs for the fixed, the free, and the normal marginals case. The simulated values are opposed to the respective values of the χ^{2}-distribution with the according degrees of freedom (first row in each block). Each pair of columns denotes the estimated value of the statistic along with the relative deviation (in %) compared to the exact value of the limiting distribution. Two tendencies are discernible across all designs: The largest deviations can be found for the smallest sample size (n=100) combined with the fewest items (k=5, i.e. df=4). For example, in the fixed marginals case (Table 4), the mean deviates by 10.4 % for 5 items and 100 observations, by 6.0 % for 10/100, and by 4.3 % for 15/100. In comparison, with 5000 observations, the respective deviations were −0.1 % (k=5), 0.1 % (k=10), and <0.1 % (k=15).

Comparing the three simulation algorithms reveals the smallest relative deviation to appear for the bootstrap technique using normal marginals. For example, the relative error regarding the 95 %-quantile (being most important for hypothesis testing) is 2.2 % (df=4), 1.2 % (df=9), and 0.9 % (df=14). The respective figures for the fixed marginals case are 2.6 %, 1.2 %, and 1.0 %, while the normal marginals produces deviations of 0.8 %, 0.5 %, and 0.6 %.

The density plots (Fig. 1) allow for a rough assessment of the overall fit of the bootstrap distributions. The plot is threefold (normal, free, and fixed marginals) with three clusters of densities according to the df. Each line represents a certain sample size (i.e. seven per cluster). As can be seen, the seven lines per method and df cannot be kept apart in any of the plots, therefore no attempt was made to label the lines. Also, three lines indicating the limiting densities with the respective degrees of freedom have been superimposed (red lines). However, they mostly disappear in the three clusters, indicating overall agreement of the bootstrap generated distributions with the limiting distributions.

In the normal marginals case (Fig. 1, left hand plot), we would virtually identify no deviation of the bootstrap densities from the limiting ones. This observation was independent of the degrees of freedom and the sample size. In the other two cases (free and fixed marginals), one line per cluster seems somewhat dislocated, indicating a reduced probability of smaller χ^{2}-values and slightly heavier tails (the latter is hardly discernible). These lines represent the n=100 cases, in which the approximation seems deficient.

For both the free and the fixed marginals method (Fig. 1, middle and right hand plot), the bootstap densities for df=4 are located beyond the limiting distributions. The densities covering df=9 show the same tendency but to a much slighter extent, and when df=14, this effect vanishes entirely. The free and the fixed marginals method seem not to differ with respect to this issue. Therefore, we can conclude that the approximation improves with the degrees of freedom, which is in line with theory. Roughly ten degrees of freedom seem sufficient for the approximation to be considered satisfactory, given a sample size of at least 250.

4.1.2 Inferential Approach

In order to compare the bootstrap distributions with the according limiting ones, we used the Kolmogorov-Smirnov (K-S) test (cf. Thode 2002, ch. 5.1.1) to test the null hypothesis that the bootstrap distributions follow a χ^{2}-distribution with the respective degrees of freedom (i.e. 4 in the 5-items designs, 9 in the 10-items designs, and 14 in the 15-items designs). The results are given in Table 7 in the Appendix.

Many of the tests yield a significant result using a type-I error risk of 5 %. However, we recognize some non-significant results for (a) larger samples and (b) larger instruments: Non-significant results were obtained for the combinations (k/n) 5/5000 (fix), 10/1000 (free + nv), 10/5000 (nv), 15/750 (nv), 15/2500 (fix + free), and all three combinations 15/5000. The normal marginals appeared slightly better, as 4 out of 21 tests were not significant, opposed to 3 out of 21 for both the fixed and the free marginals cases. This is in line with the findings based on the descriptive statistics above.

Note that the K-S-tests rely on 200,000 bootstrap samples each, hence they are by far overpowered. Assuming that the probability of an error of the second kind almost vanishes with such large samples, a non-significant result corroborates our supposition that the bootstrap distribution in fact realizes the limiting distribution. Therefore, the mere fact that at least some of the tests were not significant is in fact a remarkable result. If we further look at the values of the K-S test statistics D: None of them exceeds 0.07 (which appeared with 5/100, fixed marginals algorithm). The K-S test evaluates the maximum difference of the CDFs of the bootstrap generated distributions and the limiting ones, which diverge at most by 7 % in the cases considered here.

4.1.3 Comparison of the p-values

Two comparisons shall enhance conclusions from a practical point of view: First, we emulate the action a person ignorant of approximation problems might take. This means, we simply use the seeked quantile of the limiting χ^{2}-distribution (e.g. 9.49 for α=0.05 and df=4) and decide whether or not to retain the H_{0}. For that purpose, we calculated the p-value the χ^{2}-quantile of the limiting distribution would yield when applied to the according bootstrap distribution (reflecting the proper probability measure). The (relative) difference of the p-value to the nominal α quantifies how misleading such a procedure would be. Table 8 in the Appendix presents this comparison for the three simulation algorithms, considering critical values of α = 0.10, 0.05, and 0.01.

The p-values were considerably increased for small samples, yielding differences of up to 52 % (α = 0.01, n=100, k=10, nv). Generally, sample sizes 100 and 250 and (in some instances) 500 would lead to substantially more significant results if we decided to use the quantiles of the limiting distribution rather than the bootstrap-based ones. Comparing the three simulation algorithms revealed that the normal marginals procedure performed somewhat better than the free and the fixed marginals algorithm. However, the discrepancies of the three methods are mostly moderate. Further, errors are more pronounced for low values of the type-I-error risk. As soon as samples exceed 500 observations, the deviations become increasingly smaller.

Remember that each bootstrap analysis is based on an actual realization of the test statistic (5), allowing for a second check: Table 1 compares the observed values of these test statistics applied to both the limiting χ^{2}-distribution and to the respective bootstrap generated distribution. As can be seen, most differences seem negligible. This is not surprising, because we now consider values far away from the regions relevant to inferential decisions (i.e. the distributions’ tails). Hence too heavy tails of the bootstrap distributions are compensated by regions of decreased probabilities for lower values of the test statistic.

4.2 Approximation under the alternative hypothesis

The analysis of the three bootstrap methods under the alternative refers to comparing the observed differences between the three methods with the given sample sizes. At this point, we have to keep in mind that each original sample X_{0} constitutes a random realization of a population distribution of its own, hence the bootstrap generated distributions of the test statistic differ across the various designs (i.e. each combination of number of items k and sample size n) and are therefore incomparable. Any attempt to achieve the same subgroup parameter would go beyond the objectives of the present study.

Table 2 shows the descriptive statistics for the three bootstrap methods. We see considerable differences in some cases: For example, in the k=5/n=250 design, the mean of the bootstrap distribution generated with the normal marginals method is more than three times larger than in the free or the fixed marginals case (the latter two being fairly similar). A similar tendency occurs in the k=5/n=750 design, also for the normal marginals method, yet to a weaker extent. In contrast, the fixed and the free marginals method yielded fairly similar distributions for all designs.

In order to rule out technical reasons for the unexpected distributions, simulations of the 5-items designs have been repeated twice. However, the results were virtually identical, highly deviating distributions appeared repeatedly, with no apparent pattern regarding sample size (in the first repetition, the phenomenon occurred with n=500, n=750, and n=2500, and in the second with n=250, n=500, and, to a lesser extent, n=500, n=2500, and n=5000). In no case, such pecularities were to observe with any of the other two algorithms, i.e. fixed or free marginals.

4.3 Number of bootstrap samples

When applying the bootstrap, we have to decide on the number m^{∗} of bootstrap samples, required to obtain sufficiently precise results in justifiable time, as bootstrapping may consume a considerable amount of time for large data sets and/or many items. First of all, a means is required to summarize the (loss of) precision when m^{∗} decreases. Because it is a common choice to evaluate the test statistic with respect to the 95 %-quantile of the appropriate limiting distribution, we will concentrate on this measure.

Rather than starting a new simulation, we resorted on the vast amount of data already at hand: The original simulation covered 7 sample sizes times 3 item counts times 3 algorithms, which totals in 7×3×3=63 vectors, each containing m=200,000 realizations of the test statistic (5). In order to evaluate the effect of less than 200,000 bootstrap samples (m^{∗}<m), we drew random subsamples of 14 different sizes m^{∗} from each of the 63 vectors, each repeated 1000 times. For each of these 882 × 1000 samples, we determined the empirical 95 %-quantile, yielding 882 vectors containing 1000 estimates of the quantile under consideration, \(\hat {q}_{.95}\). (For notational ease, the index will be omitted.)

The minimum and maximum value per vector would express a worst case appraisal of error to be found in the simulated data sets. However, in order to avoid singular outliers to detract from a more general perspective, the five most extreme values in each direction were averaged, a procedure which can be considered a stabilized minimum and maximum. The difference of these two figures is divided by the corresponding quantile of the according limiting distribution, which makes the measure comparable across all designs. It will be termed relative range, rr:

with \(\hat {q}_{(i)}\) denoting the sorted values of the quantile estimates per combination of n, k, algorithm, and m^{∗}.

Figure 2 shows the relative range by number of bootstrap replications m^{∗}. Two clear structures are discernible: First, all lines exhibit a (negative) logarithmic shape without exemption, with deviations rapidly decreasing with increasing number of bootstrap samples. Second, the larger the number of items k, the faster the deviations decrease together with increasing m^{∗}. However, the latter phenomenon is considerably smaller than the first one.

Due to the clear shape of the curves, we tried to formulate a general model predicting the required number of bootstrap replicates given a desired precision criterion in terms of rr. In order to apply a linear model, the logarithm of the relative range rr and the negative logarithm of the bootstrap replication number m^{∗} were taken. The algorithm was dummy coded (fr serving as reference category) and the number of variables k and the sample size n were directly entered into the model equation

$$ y = \beta_{0} + \beta_{1} \log(m^{*}) + \beta_{2}k + \beta_{3}n + \beta_{4[1]}\, fx + \beta_{4[2]}nv, $$

((10))

with y=− log(rr) and β_{(·)} denoting the regression coefficients. Their estimates are given in Table 3 along with the respective significance measures.

The model R^{2} equalled.994, which indicates a good fit of model (10). Aside of the intercept β_{0}, the coefficients regarding the bootstrap replication number, β_{1}, and the number of items, β_{2}, were significantly different from zero, but not those for the sample size, β_{3}, or the simulation algorithm, β_{4[·]}.

This affirms the impression already derived from Fig. 2 that m^{∗} and (to a much lesser extent) k suffice for the determination of the precision of the bootstrap analysis. From the coefficients indicated in Table 3, a rule of thumb has been developed to obtain a rough estimation of the required number of bootstrap samples (the coefficients were rounded):

If, for example, one wants to test 8 items with the LRT using two split groups, then the critical value \(\chi ^{2}_{[.95; 7]}=14.067\). The 95 % quantile of the bootstrap distribution shall not exceed the interval [ 13,15] (which complies with the probabilities.964 and.927, respectively), the range is two and the relative range is rr=2/14.1=0.142 (note that the deviations do not behave symmetrically, but this seems negligible in order to obtain a rough estimation of m^{∗}). Then, the optimal number of bootstrap replicates according to (11) amounts to 1219, hence 1200 bootstrap samples will be a good choice.

Discussion

The present study focuses on practical issues when applying the Likelihood Ratio statistic according to Andersen (1973) for testing the binary Rasch Model. If the model holds, the Likelihood Ratio test statistic approaches the limiting distribution to a sufficient extent even in cases where samples were small or items were few. The most problematic combination of 5 items and 100 observations revealed moderate deviations from the limiting distribution. But even in the most problematic cases the CDFs of the bootstrap and the according limiting distribution differed by no more than 7 %, which seems justifiable to us.

Generally, the approximation of the test statistic under the null-hypothesis shows sufficient approximation to the theoretical distribution if samples comprise at least 500 respondents and an instrument with more than ten items is considered. For studies considering smaller samples or fewer items we recommend the more expensive bootstrap method. However, this is little a drawback as bootstrapping small samples takes only a reasonable amount of time. In order to further control the required time, Eq. (11) provides an easily applicable rule of thumb allowing to limit the number of bootstrap samples warranting a precision criterion of interest.

5.1 Size Matters

But so far, only the type-I error probability of falsely rejecting the null hypothesis has been taken into consideration. It will be overpowered if samples are large, hence irrelevant model deviations will become significant, although they might be acceptable from a substantial point of view, which, in turn, might give rise to generally scorn the LRT as such.

However, a significance test has its merits as well, as it allows to rely on the decision criterion of statistical significance, which is fundamental in scientific reasoning. In order to avoid the propagation of unsubstantiated rules of thumb (cf. Maxwell 2000), a prospective power analysis (in the sense of Cohen 1988) is required. For that purpose, the simulation technique presented here provides a reliable means to obtain the required non-central distributions of the test statistic.

5.2 Simulation technique

In the present study, three pertinent bootstrap algorithms, which we termed normal marginals, free marginals, and fixed marginals method, have been compared. While there was hardly any difference in the null hypothesis cases, some striking differences were encountered for the non-central distributions, deserving further inspection: The LRT assumes the rowsums r_{
v
} to be fixed at their observed values, therefore the fixed marginals bootstrap adopts this assumption. Any deviation from the observed scores inevitably yields a different likelihood and the sampling distribution of the test statistic will change.

Interestingly, the present study revealed that such a change primarily occurs in the non-central case, which can be explained: If we let the rowsums vary freely (as has been done in the normal marginals and the free marginals case), score frequencies change as well. Now, in the central case, the same item parameter estimates \(\hat {\boldsymbol {\beta }}^{(0)}\) are used for all subsamples, which supports the assumptions made in the null-hypothesis. But in the non-central case, possibly differing subgroup estimates are used for generating the bootstrap samples. If, say, score group two yields highly deviating estimates \(\hat {\boldsymbol {\beta }}^{(2)}\), but the score two has (by chance) only sometimes occurred, the deviation will not be much reflected in the test statistic. But if the same score group would have appeared with a high frequency, the deviating estimates will considerably change the product of the subgroup likelihoods in Eq. (4) and the the test statistic will reflect the model violation. Hence, the test statistic (and, in turn, its bootstrap distribution) depends on the relative frequencies of the scores, which explains the observed differences of the three bootstrap methods considered.

Hence, the fixed marginals method has to be considered superior, not only for the theoretical reasons outlined above, but also for the present study revealed in certain cases the differences to be striking. The normal marginals method yielded problematic distributions of the test statistic (5) when simulating the H_{1}-distribution, which may also be explained: If we split along the score r_{
v
} (using the median, for example), score distributions in the subsamples will inevitably differ, causing the observed differences. Therefore, the normal marginals method is not eligible for that purpose.

5.3 Don’t be naïve!

As has been mentioned above, the naïve bootstrap (i.e. drawing response vectors with replacement from the original sample, cf. Davison and Hinkley 1997) cannot be applied to the present problem: In the specific case of the LRT, this would cause the split group membership to be drawn at random as well, thus changing the subgroup frequencies n_{
j
} of each bootstrap sample in an entirely unpredictable manner (the likewise argument exists for the case of regression analysis, cf. Enders 2010, p. 150).

One might therefore consider to draw the observations separately from the original subsamples. However, this would not yield the desired results: Only the response patterns of the possibly differing subgroup members form each group then, hence we end up with the distribution of the test statistic under the alternative hypothesis.

5.4 Subgroup frequencies

The similarities of the three algorithms in the H_{0}-case could be traced back to the fact that the original samples X_{0} have been generated with θ∼N(0,1): All bootstrap procedures reproduced the marginals’ distributions exceptionally well, what will not be necessarily the case in a practical application. Hence, we should not use the free or the normal marginals method but the fixed marginals method to perform a power analysis in the sense of Cohen (1988).

However, this algorithm has far-reaching consequences for further research: If one wanted to perform a power analysis of the LRT by means of a simulation study using the fixed marginals method, he or she would have to consider both the item parameters and the score group frequencies. But unfortunately, we face a technical complication here: In order to vary the sample size seamlessly (which is necessary to obtain the optimal sample size), the relative subgroup frequencies have to be maintained. For twice the original sample (or any other integer multiple), each observation can be drawn twice (or three times, and so on). But for any other sample size, the relative frequencies n_{
j
}/n would have to be carefully approximated, allowing for a sufficient reproduction of the marginals with increasing n.

5.5 How many bootstrap samples?

One question, which always has to be considered when applying of the bootstrap, is the number of bootstrap replicates that have to be generated. For this purpose, a very general solution has been found in the present study. Within the parameters considered, the expected maximum deviation of the 95 % quantile of the true distribution can be determined using the number of samples and the number of items. Of course, Eq. (11) could be extended to any other measure of interest, like another quantile, for example. The present approach demonstrated the feasibility of a means to generally determine the required number of bootstrap samples.

5.6 Limitations and outlook

One limitation of the present study can be seen in the fact that only the two groups sample split has been considered. However, the procedure presented here allows for a straightforward extension to any number of split groups. Further, this seems to be only a minor obstactle for the practical application of the present results, as available sample sizes seldom allow for splitting into more than two groups.

The LRT can also be applied to polytomous extensions of the RM, like, for example, the Rating Scale Model (Andrich 1978), or the Partial Credit Model (Masters 1982). Power considerations for these models have to be tackled separately, as the number of parameters to be estimated changes with the number of response categories involved. Further, the LRT plays a crucial role in testing the linear logistic extension of the RM (Linear Logistic Test Model, LLTM, Fischer 1973), where, the likelihood of the (empirically more restrictive) LLTM is opposed to that of the RM (cf. Alexandrowicz 2011). Again, the methods described here can be adapted accordingly.

The results obtained in the present study have two important implications: First, we are able to obtain the distribution of the test statistic under a fixed alternative by means of the recommended bootstrap method. This allows for determining the required sample size to detect a model violation which is considered relevant from a substantive point of view with fixed risks for the type-I and the type-II errors. And second, the necessary number of bootstrap replications for warranting a desired precision can be obtained. One might object that Eq. (11) only covers up to 15 items and may therefore not be used for larger instruments. But as we have seen, the larger the number of items, the fewer bootstrap samples are necessary given everything else is held constant. Therefore, one is on the safe side using a minimum of 500 bootstrap samples for data sets comprising more items. The same applies to sample sizes beyond those considered here.

One reason impairing the applicability of the LRT is that the power of the test for a given model deviation could not be determined. As a consequence, no sample planning was possible, leaving the researcher in the dark whether a significant result indicates a model deviation of substantial interest or was merely the consequence of too large a sample. However, this fundamental problem has been overcome by Draxler (2010) for the Wald-test and generalized to the LRT and the Rao-Score-test by Draxler and Alexandrowicz.

These present results allow to determine the optimal sample size required to detect a model deviation considered relevant from a substantial point of view with given risks of an error of the first and the second kind. We believe that the LRT is a valuable tool for testing whether an instrument allows for establishing a measurement and the present findings will facilitate its liable utilization.

Conclusion

The test statistic of the conditional Likelihood Ratio Test approximates its limiting distribution very fast. Only the combination of 5 items and 100 respondents revealed slight deviations, however, increasing either the number of items or the sample size will allow for employing the quantiles of the respective χ^{2}-distribution in the usual manner. Hence, the cLRT may be applied with confidence in many situations. All three bootstrap algorithms perform well under the null hypothesis and provide reliable estimates of the quantiles required for testing the null hypothesis of model fit taking the error of the first kind into consideration. In order to take the error of the second kind into account as well, we have to find the according non-central distribution of the test statistic given a specified model deviation. Here, the three simulation methods differed considerably. We recommend the newly introduced technique warranting the row marginals to remain at their observed values for this technique has to be considered superior on theoretical reasons. Finally, we provide an easy to apply formula for identifying the necessary number of bootstrap samples allowing to limit the bootstrap-related error to a freely definable degree. Especially in studies involving a large set of items or sample, this formula will prove useful to perform the bootstrap in a reasonable amount of time.

Appendix

References

Alexandrowicz, RW: Statistical and practical significance of the likelihood ratio test of the linear logistic test model versus the rasch model. Educ. Res. Eval. 17, 335–350 (2011).

Draxler, C, Alexandrowicz, RW: Sample size determination within the scope of conditional maximum likelihood estimation with special focus on testing the rasch model. Psychometrika. 80, 817–919 (2015).

Formann, AK: A note on the computation of the second order derivatives of the elementary symmetric functions in the rasch model. Psychometrika. 51, 335–339 (1986).

Glas, CAW, Verhelst, ND: Testing the Rasch Model. In: Fischer, GH, Molenaar, IW (eds.), pp. 69–95. Springer, NY (1995).

Gustafsson, J-E: A solution of the conditional estimation problem for long tests in the rasch model for dichotomous items. Educ. Psychol. Meas. 40, 377–385 (1980).

Hoijtink, H, Boomsma, A: On Person Parameter Estimation in the Dichotomous Rasch Model. In: Fischer, GH, Molenaar, IW (eds.), pp. 53–68. Springer, NY (1995).

Kreiner, S, Christensen, KB: Overall Tests of the Rasch Model. In: Christensen, KB, Kreiner, S, Mesbah, M (eds.), pp. 105–109. Wiley, Hoboken, NJ (2013).

Masters, GN: A rasch model for partial credit scoring. Psychometrika. 47, 149–174 (1982).

Rasch, G: An Individualistic Approach to Item Analysis. In: Lazarsfeld, PF, Henry, NW (eds.), pp. 89–107. The M.I.T. Press, Cambridge, MA (1966).

Rost, J: The Growing Family of Rasch Models. In: Boomsma, A, van Duijn, MAJ, Snijders, TAB (eds.), pp. 25–42. Springer, NY (2001).

R Core Team, R: A Language and Environment for Statistical Computing [Computer software]. R Foundation for Statistical Computing, Vienna, Austria (2015). Retrieved from http://www.R-project.org/ (Accessed 16 Jan 2016).

The authors declare that they have no competing interests.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Alexandrowicz, R.W., Draxler, C. Testing the Rasch model with the conditional likelihood ratio test: sample size requirements and bootstrap algorithms.
J Stat Distrib App3, 2 (2015). https://doi.org/10.1186/s40488-016-0039-y