Coster et al. tackle a critical issue in patient reported outcomes measurement: differential item functioning (DIF). DIF refers to the possibility that two people with identical health may nevertheless respond differently to questions about their health as a function of another variable (e.g. health status).2 DIF can obscure differences and similarities, decrease reliability and validity, and render group comparisons impossible.3 Unfortunately, identifying, evaluating, and coping with DIF isn’t straightforward.2 This happens partly because the variables that patient reported outcomes measure are latent, indirectly observed variables.2 As a result, DIF techniques must make some assumptions (like most statistical techniques). For example, many DIF detection methods, including that in Coster et al., assume a normally distributed latent variable.4 One can only infer the latent variable’s distribution, yet failure to account for non‐normality can lead to misidentifying DIF.4 Difficulties also arise because patient reported outcomes are used in a variety of ways. For example, the measures evaluated in Coster et al. were generally developed for research.5 Research often addresses cross‐group mean differences. Therefore, many studies evaluate DIF’s impact on means. However, even if DIF does not influence means, it can impact the range and distribution of scores and their relationships to other variables. As another concern, most DIF methods only compare two groups, which can lead to spurious conclusions because this fails to include other sources of DIF.3 Researchers often overlook methods that can include multiple sources of DIF simultaneously. Once identified, handling DIF is problematic.2 This occurs partly because the cause of DIF is rarely clear.3 More complex DIF detection methods can partly address this problem.3 But, without explicit a priori theories, it will take qualitative work post hoc with members of the group of interest to uncover potential sources of DIF. Yet the very small number of individuals that qualitative research can realistically include limits the generalizability of this type of work. In addition, understanding DIF’s source(s) will not necessarily solve the problem, because it doesn’t eliminate DIF. One could account for DIF using a statistical approach that uses different item parameters across groups for the same items, but to non‐statistical users this may lack transparency and feel like a score is adjusted ‘simply’ because someone belongs to a group. Alternatively, one could pursue group‐specific forms. But, despite statistical reassurances, users may have difficulty accepting score comparisons across two groups completing different forms. As another alternative, one might eliminate items with DIF entirely. However, this may cause problems if one drops items with theoretically or clinically important content. Understanding the cause of DIF may provide some guidance vis‐à‐vis which of these approaches to take. Despite these challenges, Coster et al. have taken a critical step forward in evaluating whether DIF on the Patient Reported Outcomes Measurement Information System (PROMIS) pediatric short forms appears to impact scores estimated for children with cerebral palsy (CP). Their work suggests it does. Recognizing that the DIF identification method can influence the findings, future research should build on this important work and use an approach that allows a non‐normal latent variable distribution among children with CP. Also, given that patient reported outcomes in clinical practice have increased dramatically,5 future research should address item drift. Item drift refers to DIF as a function of time.2 As an individual experiences symptoms, they may begin to answer questions about these symptoms differently, even if their symptoms don’t change. Clinicians need to feel confident that change reflects real change, not item drift. In addition, in our work we have become increasingly concerned about DIF across contexts. Children (and parents) may answer questions differently if they answer at home versus at the clinic or in a research setting. Future work should address this concern. Perhaps most importantly, other researchers using these and other patient reported outcomes in this and other populations should evaluate DIF, because it can dramatically influence the veracity of one’s conclusions.
Source: Psychological Test and Assessment Modeling, 58(2), 371-402.
Author: Jones, R. N., Tommet, D., Ramirez, M., Jensen, R. E., & Teresi, J. A. (2016).http://onlinelibrary.wiley.com/doi/10.1111/dmcn.13165/full