Item fit statistics for Rasch analysis: can we trust them?

Background: Using person estimates to calculate fit statistics can lead to problems because the person estimates are biased. Conditional response probabilities given the total person score could be used instead. Methods: Data sets are simulated which fit the Rasch model. Type I error rates are calculated and the distributions of the fit statistics are compared with the assumed normal or chi-square distribution. Parametric bootstrap is used to further study the distributions of the fit statistics. Results: Type I error rates for unconditional chi-square statistics are larger than expected even for moderate sample sizes. The conditional chi-square statistics maintain the significance level. Unconditional outfit and infit statistics have asymmetric distributions with means slighly below 1. Conditional outfit and infit statistics have reduced Type I error rates. Conclusions: Conditional residuals should be used. If only unconditional residuals are available parametric bootstrapping is recommended to calculate valid p-values. Bootstrapping is also necessary for conditional outfit statistics. For conditional infit statistics the adjusted rule-of-thumb critical values look useful.


Introduction
Rasch models are increasingly used for the examination and development of measurement instruments in the health and psychological sciences (Belvedere and de Morton 2010;Bond and Fox 2015). They facilitate the detection of measurement problems like item bias or local dependence that may be overseen by traditional validation methods such as factor analysis and Cronbach's alpha coefficient. If the data from a questionnaire fit to the model expectations, a transformation of the ordinal score into an interval-level variable is available. To achieve all this, rigorous tests are essential, because the Rasch model makes some strong assumptions on the item response process.
To assess whether individual items fit the Rasch model fit statistics are widely used. Software like Winsteps (Linacre 2019), DIGRAM (Kreiner and Nielsen 2013) or the R packge eRm (Mair et al. 2019) calculate infit and outfit mean squares. RUMM2030 (Andrich et al. 2010) uses item and total item-trait interaction chi-square statistics. To detect misfitting items the fit statistics are either compared to rule-of-thumb critical values or transformed to test statistics which can be compared with the values of the purported distribution. A few questions arise immediately: what are suitable critical values for sound decisions, are the distributional assumptions justified and what happens if the sample size increases? There are many different guidelines for acceptable ranges for mean squares and different recommendations as to which approach should be chosen when sample size is large.
The behaviour of chi-square statistics has not been widely tested. Hagell and Westergren (2016) have shown that Type I errors increase for n ≥ 500. They studied situations with 25 items and sample sizes up to 2500, but conducted only one simulation for each situation. Other approaches to deal with sample size issues are drawing smaller random samples from a large sample or use an algebraic adjustment of the sample size before calculating p-values. These two procedures have been implemented in RUMM2030. Bergh (2015) compared the two approaches with each other. He found that for original sample sizes up to 21 000 and adjustments to sample sizes of 5000 both procedures work equally well. For adjustments to smaller sizes, the algebraic adjustment approach appeared less effective than random samples.
Simulation studies for outfit and infit statistics have shown several weaknesses. Means are not equal to the expected value of 1, distributions are asymmetric with extreme values more often occuring above 1, simple rule-of-thumb critical values for acceptable fit may be inappropriate (Smith 1991;Smith et al. 1998;Wang and Chen 2005). Wolfe (2013) examined the distributions of outfit and infit statistics under a limited number of conditions, and based on these results recommended bootstrapping to get adequate critical values.
There are two main problems when calculating and using fit statistics: the estimation of the residuals and the distribution of the fit statistics. All item fit statistics summarize standardized residuals which are based on estimates of response probabilities. Item and person parameter estimates are usually utilized here. Item estimates are consistent but person estimates are not. The latter are biased and the bias does not disappear with increased sample size. Due to the correspondence between person estimates and scores, estimates of conditional response probabilities given the total person score could be used to virtually eliminate the bias (Kreiner and Christensen 2011).
Because the exact distributions of the fit statistics are unknown for unconditional and conditional estimates, asymptotic distributions are used. It is unclear how reliable these approximations are. In this paper, we compare chi-square as well as outfit and infit statistics based on unconditional and conditional estimation procedures. Furthermore, bootstrap simulations are used to understand the distributions of these statistics.

Background
The Rasch model for dichotomous items (Rasch 1960) assumes that the response of a person to an item is stochastically independent of all other item responses for the same and other persons, and that the probability of a positive response to an item is equal to Müller Journal of Statistical Distributions and Applications (2020) 7:5 Page 3 of 12 where θ v is a parameter characterizing the person v and β i is an item parameter. Two different types of residuals are calculated during tests of fit of items to the Rasch model. Response residuals compare observed and expected for every combination of person and item. These residuals are given by whith E (X vi ) = P (X vi = 1) = p vi and Var (X vi ) = p vi (1 − p vi ). Outfit and infit statistics are calculated as means of the squared residuals.
The weights w vi used to calculate the infit statistics are equal to Var (X vi ). Two approaches are used to assess item fit: rule-of-thumb critical values for the mean squares or formal tests by dividing the mean squares by their standard errors and comparing the resulting test statistics with the normal distribution. The Wilson-Hilferty cube root transformation can be used to improve the approximation of a chi-square variable to the normal distribution (Wilson and Hilferty 1931).
Rule-of-thumb lower and upper limits for acceptable mean square fit values have been set by many researchers to 0.7 and 1.3. Linacre (2017) gave a detailed instruction on cutoff numbers suggesting values between 0.5 and 1.5 as acceptable. Adjustments have been proposed which take into account the sample size n. Smith et al. (1998) recommend critical values equal to 1 ± 6/ √ n for outfits and 1 ± 2/ √ n for infits. Unfortunately, simulation studies have shown that appropriate critical values also depend on the number of items and the difficulty of the item considered (Wang and Chen 2005).
A second type of residuals used are group residuals. They compare the total number of positive responses to an item in a group of persons to the expected number of responses in the same group. Item chi-square fit statistics are calculated as the sum of squared group residuals, where persons are grouped into class intervals g depending on their scores.
The total "item-trait interaction" chi-square test statistic is the sum of the item chisquares.
It is assumed that the test statistic defined in (4) follows a chi-square distribution with degrees of freedom df i equal to the number of class intervals minus 1, and that the total "item-trait interaction" follows a chi-square distribution with k · df i degrees of freedom. This test should show whether the data fit to the Rasch model for the classes along the scale.
The formulas for the fit statistics involve E(X vi ) and Var(X vi ) which are unknown and have to be estimated. The usual way to estimate E(X vi ) is to plug in the item and person parameter estimates. We call this the unconditional estimate leading to unconditional fit statistics.

Müller Journal of Statistical Distributions and Applications
(2020) 7:5 Page 4 of 12 Kreiner and Christensen (2011) showed that using person parameter estimates for the estimation of response probabilities lead to biased residuals and therefore biased outfit statistics. It is actually not needed to plug in person estimates, conditional estimates could be used instead. They are given bŷ whereβ denotes the vector of item parameters,β (i) denotes the vector of item parameters withoutβ i and γ r (β) is the elementary symmetrical function of order r of the item parameter estimates. The fit statistics based on (7) are called conditional fit statistics. Full details of the calculations can be found in Christensen and Kreiner (2013).
The widely used Rasch software Winsteps and the R package eRm calculate unconditional outfits and infits (3) based on (6). Other R packages such as mirt, LTM and irtoys use the same estimation approach. RUMM2030 estimates unconditional chi-square fit statistics (see (4) and (5)) also relying on (6). Only DIGRAM estimates conditional outfit and infit statistics utilizing (7). Most programs use chi-square or normal distributions for goodness of fit tests. Only LTM and DIGRAM allow simulations to get p-values. In this paper we want to answer the following questions: what are the consequences of using biased estimates and inadequate distributional assumptions and for which sample size do problems become serious?

Methods
Data sets were simulated to fit the Rasch model, with sample sizes between 150 and 10 000, and 10, 15 or 20 items. Person parameters came from a standard normal distribution. Item parameters were also chosen from a normal distribution or equidistantly fixed, ranging from -2 to +2 or from -2.5 to +2.5. Different conditions were studied because the range and the distribution of item parameters could have an effect on the results. Item parameters were estimated with conditional maximum likelihood, for the person parameters weighted maximum likelihood was used. Unconditional and conditional outfit, infit and chi-square fit statistics were calculated. The distributions of the p-values were then examined and proportions of p-values below 0.05, the type I error rates, were compared.
Parametric bootstrap was used to study the distributions of the fit statistics for various n (sample size) and k (number of items). First, a data set was generated from a Rasch model. Item and person parameters were estimated for this data set and fit statistics and pvalues were calculated. Next, bootstrap samples were generated from a Rasch model with parameters equal to the estimates calculated in the first step. Fit statistics for the samples were used to get their empirical distribution. This is called the bootstrap distribution. The critical value from this distribution gives the bootstrap p-value, which was compared with the p-value based on the normal or the chi-square distribution.
All simulations and calculations were done with the R statistical package (R Core Team 2019) and the additional R packages eRm (Mair et al. 2019) and PP (Reif and Steinfeld 2019).

Chi-square fit statistics
We start with simulations with normally distributed person parameters and fixed item parameters, equidistant in the interval [-2, +2]. Sample sizes vary between 150 and 2000, the number of items is 10, 15 or 20. The number of class intervals is three for n ≤ 250, and seven for larger n. For each situation 1000 simulations are done. As the Rasch model is true for all the data sets, the proportions of p-values below 0.05 should be about 0.05. Table 1 shows the proportions of p-values below 0.05 for unconditional and conditional individual item and total tests. Let us look first at the unconditional tests. For n=200 and k=10, we have a proportion of 0.107 of p-values below 0.05. The proportion even reaches 1 for n ≥ 1500 and k=10. There are therefore too many significant results regarding the total test for n ≥ 200, especially if there are not many items. For n ≥ 500, the type I error rate is also increased for single item tests. The proportions vary between 0.047 and 0.119 for n=500 and k=10. Hence, the chi-square statistics appear to be okay for some items, but not for all. The conditional total and single item tests maintain the significance level. Higher proportions can be found for unconditional tests for items located in the center. Figure 1a shows that the distribution of p-values for item 1 looks more or less uniform, whereas the same distribution for item 5 is skewed (Fig. 1b). In the case of conditional tests, the distributions look uniform as they should ( Fig. 1c and d). As the type I error rates for the unconditional total tests are more affected than the error rates for single item test, the distributions of p-values for unconditional total tests are even more skewed. Simulations with fixed item parameters equidistant in the interval [-2.5, +2.5] or normally distributed item parameters lead to very similar results.  The distributions of the simulated unconditional test statistics coincide with the proposed chi-square distributions as long as n < 500. For larger n the discrepancy between empirical and assumed distribution becomes larger and larger. Because the empirical distribution is shifted to the right, the type I errors increase. For n ≥ 1000 this affects items independently of their location. Figure 2 shows the unconditional (a-c) and conditional (d-f ) distributions for item 5 with n = 500 and item 1 with n = 1000 and n = 2000, k = 10. Chi-square density curve and histogram agree for the conditional test statistic in all situations. Table 2 contains unconditional and conditional chi-square fit statistics, and p-values based on the chi-square distribution as well as based on the bootstrapping procedure with n = 1000, k = 10 and fixed item parameters in the interval [-2, +2]. The two unconditional item fit statistics for item 5 and item 6 show misfit and the total test also rejects the Rasch model if the chi-square distribution is used. The bootstrap p-values are larger and do not indicate any misfit. As for the conditional tests, the chi-square distribution and bootstrap p-values are quite similar.

Outfit and infit statistics
Simulations are done again with normally distributed person parameters and fixed item parameters, equidistant in the interval [-2, +2]. Sample sizes vary between 150 and 10 000, the number of items is 10, 15 or 20. For each situation 1000 simulations are done. The Wilson-Hilferty transformation has been used for the unconditional fit statistics, but not for the conditional values because there was no apparent improvement of the approximation to the normal distribution. Table 3 shows mean values of outfit and infit statistics, and the proportions of p-values below 0.05. The unconditional outfit and infit statistics are biased, their means are smaller than the expected mean of 1. The size of the bias depends on the number of items. Type I error rates are increased for unconditional statistics if n ≥ 500, especially if there are not many items. Mean values are okay for the conditional Fig. 2 Distribution of the unconditional and conditional test statistic. The red curve is the proposed chisquare distribution, the red vertical line is the 95% percentile. Values above this critical value will lead to rejecting the null hypothesis tests, but the error rates appear to be too small. This is particularly true for items with large or small difficulties (see Fig. 3). Table 4 contains 2.5 and 97.5% percentiles for outfit and infit statistics. Ranges get more narrow for unconditional and conditional estimates, but unconditional fit statistics are not symmetric around 1. The critical values 0.5-1.5 proposed by Linacre are only valid for very small sample sizes and few items (n ≤ 150, k ≤ 10). The usual rule-of-thumb of   Smith et al. (1998) fit quite well for conditional infits over the range of sample sizes considered. Next, histograms of the standardized outfit and infit values are compared with the standard normal distribution. The deviation from the expected mean for unconditional outfit values is obvious in Fig. 4 (upper row). The Wilson-Hilferty transformation makes the approximation even worse for larger n. The conditional statistics are unbiased but there are some large outliers and too many values in the center, the calculated standard errors seem to be too large. Standard errors are also too large for unconditional outfit statistics. This can be seen if the outfits are centered around zero. For items with small ore large difficulties the situation is more extreme.
Parametric bootstrap is expected to produce smaller p-values for conditional statisics as the test based on the normal approximation. This is verified in Table 5 for the conditional outfit statistics with n = 2000 and k = 10.

Discussion
Residual-based fit statistics are widely used to assess Rasch model fit. At the same time, there are concerns about the quality of these indicators. Some people argue completely against the use of any residual fit statistic (Karabatsos 2000), others have developped alternatives such as likelihood-based fit statistics or graphical approaches to assess item fit (Orlando and Thissen 2000;Yu 2020). There is no doubt that other approaches can give valuable information about possible problems of items. Nevertheless, this paper is focused on residual-based fit statistics because they are so popular and we would like to help to improve their usage.
Two different estimates of residuals are considered, one based on biased and not consistent person parameters, the other based on scores. For a sample size of 200 or more, the unconditional total item-trait interaction chi-square test which uses the person parameter estimates shows increased Type I error rates. Unconditional single item chi-square statistics become unreliable for n ≥ 500. This is in accordance with the results of Hagell and Westergren (2016), but is now supported by many more simulations. The usually assumed chi-square approximations are inadequate and the parametric bootstrap confirms these results. In the case of unconditional tests, chi-square p-values are much smaller than bootstrap p-values. So this means that even for moderate sample sizes, the Rasch model is rejected too often and/or too many items falsely show misfit.
The conditional chi-square tests which are based on scores remain valid. The proportion of p-values below 0.05 is close to 0.05 and the chi-square distribution and the bootstrap distribution quite agree for conditional tests. As long as there are no conditional fit statistics implemented in RUMM2030, users have to be careful to not misinterpret seemingly significant results unless the sample size is small.

Müller Journal of Statistical Distributions and Applications
(2020) 7:5 Page 10 of 12 The unconditional outfit and infit statistics have means slightly smaller than the expected value of 1, at least if the number of items is small. Type I error rates are increased for unconditional statistics if n ≥ 500. Therefore, too many items are regarded as misfitting or the Rasch model as a whole is falsely rejected. Other authors have also noticed these problems a long time ago (Smith 1991;Wang and Chen 2005).
Mean values are okay for the conditional tests, but the error rates appear to be too small. The calculated standard errors are too large and therefore the standardized values become too small. The reason is that the squared residuals used in Eq. (3) are not independent as assumed. Correlations tend to be negative especially for items with large or small difficulties. The resulting true variance is therefore smaller than the estimated variance. The adjusted rule-of-thumb (1 ± 2/ √ n) appears to be reasonable if applied to conditional infit statistics, whereas the adjusted rule-of-thumb for outfit statistics (1 ± 6/ √ n) does not seem to be valid. As Winsteps and the R package eRm only calculate unconditional outfit and infit statistics, their results can become unreliable for sample sizes above 250. The R package iarm (Müller 2020) can be used to estimate conditional fit statistics which have correct mean values and to apply bootstrapping for the p-values. DIGRAM estimates conditional fit statistics as well. The user should also apply the recently implemented bootstrapping procedure to get p-values not relying on invalid distributional assumptions. Fig. 4 Distribution of the unconditional and conditional standardized outfit statistic for item 5. The red curve is the standard normal distribution, the red vertical lines are the 2.5% and 97.5% percentiles. Values outside these critical values will lead to rejecting the null hypothesis If only infit statistics are relevant, the adjusted rule-of-thumb given by Smith et al. (1998) could be used instead.

Conclusions
It is time to update the Rasch software. Large chisquare fit statistics are not just a matter of large sample sizes, problems start with n as small as 200. It is therefore crucial to use conditional estimates. The chisquare distribution can then be used as an approximation for the distribution of the test statistic. As for outfit and infit statistics, standard error calculations are not reliable. Parametric bootstrap should be used to get correct p-values.