Psychometric Evaluation of a Social-Emotional Screening Questionnaire for 5-Year-Olds in Taiwan
The Ages & Stages Questionnaires: Social-Emotional, Second Edition (ASQ:SE-2) is a caregiver-completed screening tool for children aged 1 to 72 months. The measure has been adapted for use in Taiwan. This study examined the psychometric properties of the Traditional Chinese version of the 60-month interval (ASQ:SE-2-TC) using item response theory (IRT) and gender differences using differential item functioning analyses. A sample of 702 children, aged 54 months 0 days to 72 months 30 days, was collected to proportionally reflect the population distribution across regions in Taiwan. Results indicated that (a) item fit statistics ranged from 0.84 to 1.34 ( M = 1.01, SD = 0.13), (b) item difficulty ranged from −0.41 to 3.75 ( M = 2.00, SD = 0.74), (c) reliability (EAP/PV) was 0.80, and (d) all items demonstrated negligible differential item functioning by gender. One misfitting item was further examined using item characteristic curves to evaluate its performance across the latent trait continuum of social-emotional competence. Findings provide psychometric support for the use of the ASQ:SE-2-TC with Taiwanese children. The promising evidence suggests the importance of continued validation efforts to further establish its utility in developmental screening.
- Research Article
30
- 10.1086/691525
- Jun 1, 2017
- Journal of the Society for Social Work and Research
Objective: This article introduces the reader to item response theory (IRT), differential item functioning (DIF), and differential test functioning (DTF), and it demonstrates how a DIF and DTF analysis can be conducted using IRT analysis methods. DIF concerns the possibility that items on a scale work differently for different groups, such as females and males. DIF can accumulate across items on a scale to create DTF, which means the overall scale will have different levels of validity for the different groups. IRT is useful for conceptualizing DIF and DTF, and for conducting DIF analyses. Method: A DIF analysis of scores on Hudson's Multi-Problem Screening Inventory depression subscale illustrates the concepts of DIF and DTF and demonstrates an IRT analytical approach to detect possible DIF and DTF. Results: DIF was found in 3 of the items on the Multi-Problem Screening Inventory depression subscale, and evidence of possible DTF also was found. The magnitude of the DIF in one item was very large, but the potential DTF was small in magnitude. Conclusions: IRT is useful for conceptualizing DIF and DTF, and it has advantages for conducting DIF and DTF studies. Disadvantages include the need for large sample sizes. The risk of flawed inferences as a consequence of DIF and/or DTF in social work research is discussed.
- Research Article
4
- 10.5897/err2015.2284
- Jun 10, 2015
- Educational Research and Reviews
The purpose of this study is to carry out differential item functioning (DIF) analysis for content areas of a reading comprehension subtest using four area indices within Item Response Theory (IRT) framework. The differences in the magnitudes of the area indices were compared based on the subject areas. The DIF analysis was carried out across gender groups only. The item level data of the English reading comprehension subtest were gathered from English Language Achievement exam done in School of Foreign Languages, Ege University, Turkey, in 2013. A sample of 2,117 examinees (1,011 males and 1,116 females) was randomly selected. For the DIF analysis, (a) an IRT model for the item characteristic curves was specified; (b) model-data-fit was investigated for the selected IRT model;(c) Item characteristic curves were separately computed for each group on a common scale; finally, (d) indices indicating the degree of DIF on each item were computed. The results of the study indicated that both un-weighted and weighted area indices showed non-uniformity in DIF in the item characteristic curves in reading comprehension subtest in most cases. A significant correlation was observed between un-weighted and weighted area indices. Key words: Item characteristic curve, item bias, differential item functioning (DIF).
- Research Article
10
- 10.1080/15305058.2014.922567
- Oct 2, 2014
- International Journal of Testing
Differential item functioning (DIF) analysis is important in terms of test fairness. While DIF analyses have mainly been conducted with manifest grouping variables, such as gender or race/ethnicity, it has been recently claimed that not only the grouping variables but also contextual variables pertaining to examinees should be considered in DIF analyses. This study adopted propensity scores to incorporate the contextual variables into the gender DIF analysis. In this study, propensity scores were used to control for the contextual variables that potentially affect the gender DIF. Subsequent DIF analyses with the Mantel-Haenszel (MH) procedure and the Logistic Regression (LR) model were run with the propensity score applied reference (males) and focal groups (females) through propensity score matching. The propensity score embedded MH model and LR model detected fewer number of gender DIF than the conventional MH and LR models. The propensity score embedded models, as a confirmatory approach in DIF analysis, could contribute to hypothesizing an inference on the potential cause of DIF. Also, salient advantages of propensity score embedded DIF analysis models are discussed.
- Research Article
9
- 10.1177/02655322241290188
- Nov 28, 2024
- Language Testing
The growing diversity among test takers in second or foreign language (L2) assessments makes the importance of fairness front and center. This systematic review aimed to examine how fairness in L2 assessments was evaluated through differential item functioning (DIF) analysis. A total of 83 articles from 27 journals were included in a systematic review. The findings suggested that classical DIF techniques were dominant in use, particularly Rasch-based methods, the Mantel–Haenszel procedure, item response theory (IRT) approaches, logistic regression, and SIBTEST, but emerging methods such as DIF analysis based on cognitive diagnostic models were also identified. Most DIF studies examined manifest grouping variables such as gender and language background and were based on assessments of receptive language skills such as reading and listening comprehension. DIF analyses were mostly conducted in an exploratory fashion and causes of DIF were often justified on speculative rather than empirical grounds. In addition, the quality of DIF analyses was undermined by suboptimal reporting practices. Our results suggest the need to improve current DIF practices, to consider alternative DIF detection methods aligning with emerging views of measurement bias, and to adequately account for the heterogeneity of L2 test takers. The findings have implications for test design and use, fairness, and validity in L2 assessments.
- Book Chapter
4
- 10.1002/9781118411360.wbcla043
- Nov 11, 2013
Test fairness is a critical issue for any intended test use. When a test is fair, none of the examinee groups should be favored or disadvantaged. Differential item functioning (DIF) or differential testlet functioning (DTF) across subgroups of examinee population may not indicate bias, but should flag up the need for further scrutiny of the item content for any potential item bias. This chapter presents the statistical methods for DIF and DTF analysis as a tool for identifying potentially biased items in language assessment. The focus is on introducing the idea of latent differential functioning analyses in language assessments. The chapter starts with a review of the conventional DIF and DTF analysis methods based on manifest grouping variables. The necessity of latent differential functioning analysis, which borrows strength from both item response theory (IRT) and latent class analysis, is elaborated. An empirical example is used to illustrate different approaches to differential functioning analyses. The chapter ends with a discussion related to challenges and future research for DIF and DTF analyses in language assessments.
- Research Article
38
- 10.1007/s11136-009-9453-7
- Feb 27, 2009
- Quality of Life Research
Differential item functioning (DIF) analyses can be used to explore translation, cultural, gender or other differences in the performance of quality of life (QoL) instruments. These analyses are commonly performed using "baseline" or pretreatment data. We previously reported DIF analyses to examine the pattern of item responses for translations of the European Organisation for Research and Treatment of Cancer (EORTC) QLQ-C30 QoL instrument, using only data collected prior to cancer treatment. We now compare the consistency of these results with similar analyses of on-treatment and off-treatment assessments and explore whether item relationships differ from those at baseline. Logistic regression DIF analyses were used to examine the translation of each item in each multi-item scale at the three time points, after controlling for the overall scale score and other covariates. The consistency of results at the three time points was explored. For most EORTC QLQ-C30 subscales, the DIF results were very consistent across the three time points. Results for the Nausea and Vomiting scale varied the most across assessments. The results indicated that DIF analyses were stable across each time point and that the same DIF effects were usually found regardless of the treatment status of the respondent.
- Research Article
8
- 10.1177/0269216314545802
- Aug 26, 2014
- Palliative Medicine
Background: The Family Satisfaction with End-of-Life Care is an internationally used measure of satisfaction with cancer care. However, the Family Satisfaction with End-of-Life Care has not been studied for equivalence of item endorsement across different socio-demographic groups using differential item functioning. Aims: The aims of this secondary data analysis were (1) to examine potential differential item functioning in the family satisfaction item set with respect to type of caregiver, race, and patient age, gender, and education and (2) to provide parameters and documentation of differential item functioning for an item bank. Design: A mixed qualitative and quantitative analysis was conducted. A priori hypotheses regarding potential group differences in item response were established. Item response theory and Wald tests were used for the analyses of differential item functioning, accompanied by magnitude and impact measures. Results: Very little significant differential item functioning was observed for patient’s age and gender. For race, 13 items showed differential item functioning after multiple comparison adjustment, 10 with non-uniform differential item functioning. No items evidenced differential item functioning of high magnitude, and the impact was negligible. For education, 5 items evidenced uniform differential item functioning after adjustment, none of high magnitude. Differential item functioning impact was trivial. One item evidenced differential item functioning for the caregiver relationship variable. Conclusion: Differential item functioning was observed primarily for race and education. No differential item functioning of high magnitude was observed for any item, and the overall impact of differential item functioning was negligible. One item, satisfaction with “the patient’s pain relief,” might be singled out for further study, given that this item was both hypothesized and observed to show differential item functioning for race and education.
- Research Article
- 10.21985/n2dq40
- Apr 10, 2018
Psychometric analyses can help illuminate how people approach clinical tests. To understand how health literacy (i.e., literacy for health information) might lead to biased assessment of emotional distress, we examined the psychometric properties of anxiety and depression questionnaires, using differential item functioning (DIF) analysis. Items were flagged for DIF if item response theory parameters were different across health literacy groups. All items flagged for DIF had lower item-slopes for people with limited health literacy. This suggests that these items were less precise assessments. DIF analyses can identify items that are potentially problematic for people with limited health literacy (e.g., the item is too confusing). Design of questionnaires should incorporate psychometric methods (e.g., DIF analysis) to identify and reduce measurement bias.
- Research Article
54
- 10.1177/014662169401800203
- Jun 1, 1994
- Applied Psychological Measurement
Simulated data were used to investigate the performance of modified versions of the Mantel-Haenszel method of differential item functioning (DIF) analysis in computerized adaptive tests (CATs). Each simulated examinee received 25 items from a 75-item pool. A three-parameter logistic item response theory (IRT) model was assumed, and examinees were matched on expected true scores based on their CAT responses and estimated item parameters. The CAT-based DIF statistics were found to be highly correlated with DIF statistics based on nonadaptive administration of all 75 pool items and with the true magnitudes of DIF in the simulation. Average DIF statistics and average standard errors also were examined for items with various characteristics. Finally, a study was conducted of the accuracy with which the modified Mantel-Haenszel procedure could identify CAT items with substantial DIF using a classification system now implemented by some testing programs. These additional analyses provided further evidence that the CAT-based DIF procedures performed well. More generally, the results supported the use of IRT-based matching variables in DIF analysis. Index terms: adaptive testing, computerized adaptive testing, differential item functioning, item bias, item response theory.
- Research Article
8
- 10.1007/s11336-024-09948-7
- Feb 21, 2024
- Psychometrika
Ensuring fairness in instruments like survey questionnaires or educational tests is crucial. One way to address this is by a Differential Item Functioning (DIF) analysis, which examines if different subgroups respond differently to a particular item, controlling for their overall latent construct level. DIF analysis is typically conducted to assess measurement invariance at the item level. Traditional DIF analysis methods require knowing the comparison groups (reference and focal groups) and anchor items (a subset of DIF-free items). Such prior knowledge may not always be available, and psychometric methods have been proposed for DIF analysis when one piece of information is unknown. More specifically, when the comparison groups are unknown while anchor items are known, latent DIF analysis methods have been proposed that estimate the unknown groups by latent classes. When anchor items are unknown while comparison groups are known, methods have also been proposed, typically under a sparsity assumption – the number of DIF items is not too large. However, DIF analysis when both pieces of information are unknown has not received much attention. This paper proposes a general statistical framework under this setting. In the proposed framework, we model the unknown groups by latent classes and introduce item-specific DIF parameters to capture the DIF effects. Assuming the number of DIF items is relatively small, an L1\\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{mathrsfs} \\usepackage{upgreek} \\setlength{\\oddsidemargin}{-69pt} \\begin{document}$$L_1$$\\end{document}-regularised estimator is proposed to simultaneously identify the latent classes and the DIF items. A computationally efficient Expectation-Maximisation (EM) algorithm is developed to solve the non-smooth optimisation problem for the regularised estimator. The performance of the proposed method is evaluated by simulation studies and an application to item response data from a real-world educational test.
- Research Article
140
- 10.1186/1477-7525-8-81
- Aug 4, 2010
- Health and Quality of Life Outcomes
BackgroundDifferential item functioning (DIF) methods can be used to determine whether different subgroups respond differently to particular items within a health-related quality of life (HRQoL) subscale, after allowing for overall subgroup differences in that scale. This article reviews issues that arise when testing for DIF in HRQoL instruments. We focus on logistic regression methods, which are often used because of their efficiency, simplicity and ease of application.MethodsA review of logistic regression DIF analyses in HRQoL was undertaken. Methodological articles from other fields and using other DIF methods were also included if considered relevant.ResultsThere are many competing approaches for the conduct of DIF analyses and many criteria for determining what constitutes significant DIF. DIF in short scales, as commonly found in HRQL instruments, may be more difficult to interpret. Qualitative methods may aid interpretation of such DIF analyses.ConclusionsA number of methodological choices must be made when applying logistic regression for DIF analyses, and many of these affect the results. We provide recommendations based on reviewing the current evidence. Although the focus is on logistic regression, many of our results should be applicable to DIF analyses in general. There is a need for more empirical and theoretical work in this area.
- Preprint Article
1
- 10.21203/rs.3.rs-6180414/v1
- Apr 24, 2025
- Research Square
Standardized assessments are widely used to measure student achievement; however, they often fail to account for cultural and religious influences that may affect item functioning. This study investigated the extent to which cultural and religious factors influenced test performance in Ghana and Botswana using Item Response Theory (IRT) and Differential Item Functioning (DIF) analysis. A mixed-methods approach was employed, combining quantitative DIF analysis of standardized test data from 1,200 students (600 from Ghana and 600 from Botswana) with qualitative insights from 30 student interviews and 6 focus group discussions. Findings revealed that 27% of reading comprehension items and 21% of social studies items exhibited significant DIF (p < 0.01), favoring students whose cultural and religious backgrounds aligned with the content of test items. Mathematics items displayed fewer instances of DIF (9%), but word problems that referenced religious or cultural practices led to performance disparities. In particular, test items referencing Christian parables had a large DIF effect (effect size > 0.64), disadvantaging Muslim and secular students, while questions on chieftaincy and traditional leadership exhibited moderate DIF (effect size 0.45–0.55), affecting urban students unfamiliar with these concepts. Qualitative data reinforced these findings, as students expressed that familiarity with religious and cultural references helped them engage with test items more effectively. Some Muslim students found Christian-based passages challenging, while urban students reported difficulty understanding traditional folklore-related questions. Teachers and assessment experts raised concerns that standardized tests often reflect dominant cultural narratives, potentially disadvantaging minority groups. The study highlights the need for more inclusive assessment practices that minimize cultural bias and ensure equitable educational opportunities for all students. Key recommendations include implementing DIF analysis in test development, using culturally neutral content, engaging diverse stakeholders in assessment design, adopting alternative assessment methods, and training educators on bias in testing. These findings contribute to the broader discourse on fairness in educational assessment and offer practical strategies for improving testing practices in multi-denominational societies such as Ghana and Botswana.
- Research Article
- 10.22158/wjer.v4n1p62
- Dec 9, 2016
- World Journal of Educational Research
This study looked into differentially functioning items in a Chemistry Achievement Test. It also<br />examined the effect of eliminating differentially functioning items on the content and concurrent validity,<br />and internal consistency reliability of the test. Test scores of two hundred junior high school students<br />matched on school type were subjected to Differential Item Functioning (DIF) analysis. One hundred<br />students came from a public school, while the other 100 were private school examinees. The<br />descriptive-comparative research design utilizing differential item functioning analysis and validity and<br />reliability analysis was employed. The Chi-Square, Distractor Response Analysis, Logistic Regression,<br />and the Mantel-Haenszel Statistic were the methods used in the DIF analysis. A six-point scale ranging<br />from inadequate to adequate was used to assess the content validity of the test. Pearson r was used in<br />the concurrent validity analysis. The KR-20 formula was used for estimating the internal consistency<br />reliability of the test. The findings revealed the presence of differentially functioning items between the<br />public and private school examinees. The DIF methods differed in the number of differentially<br />functioning items identified. However, there was a high degree of correspondence between the Logistic<br />Regression and Mantel-Haenszel Statistic. After the elimination of the differentially functioning items,<br />the content and the concurrent validity, and the internal consistency reliability differed per DIF method<br />used. The content validity of the test differed ranging from slightly adequate to moderately adequate in<br />the number of items retained. The concurrent validity of the test also differed but all were positive and<br />indicate moderate relationship between the examinees’ test scores and their GPA in Science III.<br />Likewise, the internal consistency reliability of the test differed. The more differentially functioning<br />items eliminated, the lesser was the content and concurrent validity, and internal consistency reliability<br />of the test becomes. Elimination of differentially functioning items diminishes content and concurrent<br />validity, and internal consistency reliability, but could be use as basis in enhancing content, concurrent<br />as well as internal consistency reliability by replacing eliminated DIF items.
- Research Article
- 10.11591/ijphs.v14i3.25938
- Sep 1, 2025
- International Journal of Public Health Science (IJPHS)
Adolescent dating violence (ADV) is a global public health problem that has a serious impact on adolescents' physical, psychological, and social development. This study aimed to explore gender disparities in Indonesian adolescents' knowledge of dating violence using the Rasch Model and Differential Item Function Analysis. A total of 250 junior high school students in Yogyakarta, consisting of 107 males and 143 females, participated. The ADV knowledge measurement instrument consisted of 16 previously tested items for validity and reliability. Results showed that female students had a higher level of knowledge than male students, especially in identifying emotional and physical violence. Differential item function (DIF) analysis revealed that two items showed differences in perception based on gender, with female students focusing more on physical violence. In contrast, male students tended to view physical violence as a more common behaviour. This study highlights the importance of more inclusive and gender-sensitive educational programs to increase adolescents' knowledge of different forms of dating violence. The findings provide important insights for the development of interventions that can help prevent dating violence among adolescents.
- Research Article
5
- 10.21031/epod.1218144
- Mar 25, 2023
- Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi
This study aims to compare the Wald test and likelihood ratio test (LRT) approaches with Classical Test Theory (CTT) and Item Response Theory (IRT) based differential item functioning (DIF) detection methods in the context of cognitive diagnostic models (CDMs), using the TIMSS 2011 dataset as a retrofitting study. CDMs, which have a significant potential when determining the DIF and their contribution to validity, can give confidence under the strong methodological background condition is met. Therefore, it is hoped that this study will contribute to the literature to ensure the correct usage of CDMs and evaluate the compatibility of these new approaches with traditional methods. According to the analysis results, thirty-one items showed differences between the cognitive diagnosis assessments and the traditional methods. The item with the largest DIF was found in the Raju Unsigned Area Measures technique in IRT, whereas the item with the lowest DIF was found in the Wald test technique developed for CDMs. In general, the analyses show that methods not based on CDMs detect more items with DIF, but the Wald test and LRT methods based on CDMs detect fewer items with DIF. This study conducted DIF analyses to determine the test's psychometric properties within the framework of CDMs rather than the source of the bias. Researchers can take the study one step further and make more specific assessments about the items' bias regarding the test structure, test scope, and subgroups. In addition, DIF analyses in this study were carried out using only the gender variable, and researchers can use different variables to conduct studies specific to their purpose.