Detecting Rater Effects with Small Examinee Sample Sizes: Examining the Impacts of Rating Design and Item Sample Size on Rater Effect Indicators

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

ABSTRACT Although research on rater-mediated assessments includes considerations related to identifying rater effects in a variety of performance assessment contexts, researchers have not specifically focused on contexts with relatively small examinee sample sizes (N ≤ 100). Inspired by performance assessment contexts in which small examinee sample sizes may be common, such as childcare evaluation systems, we explored rater effect indicators in these conditions. We used a real data illustration from an assessment of home-based childcare providers to provide context for our study. Then, we used a simulation study to consider the performance of rater effect indicators in a wider range of conditions that reflect important aspects of these assessment systems. Overall, our results support the use of rater effect indicators from the Many-Facet Rasch Model to identify rater severity and rater range restriction effects in conditions with small examinee sample sizes. We discuss implications for research and practice.

Similar Papers
  • Research Article
  • Cite Count Icon 6
  • 10.1016/j.jsp.2021.01.001
Are ratings in the eye of the beholder? A non-technical primer on many facet Rasch measurement to evaluate rater effects on teacher behavior rating scales
  • May 14, 2021
  • Journal of School Psychology
  • Kara M Styck + 4 more

Are ratings in the eye of the beholder? A non-technical primer on many facet Rasch measurement to evaluate rater effects on teacher behavior rating scales

  • Research Article
  • 10.1097/01720610-200804000-00006
Considerations of sample size in medical research
  • Apr 1, 2008
  • Journal of the American Academy of Physician Assistants
  • John W Waterbor + 1 more

Considerations of sample size in medical research

  • Research Article
  • 10.1016/j.ajodo.2015.03.015
Inference from a sample mean--Part 1.
  • Jun 1, 2015
  • American Journal of Orthodontics and Dentofacial Orthopedics
  • Nikolaos Pandis

Inference from a sample mean--Part 1.

  • Research Article
  • Cite Count Icon 1
  • 10.1200/jco.2016.34.7_suppl.294
Assessing the impact of restricted follow-up and small sample sizes on survival estimations in prostate cancer using registry data.
  • Mar 1, 2016
  • Journal of Clinical Oncology
  • Dhvani Shah + 5 more

294 Background: Economic evaluations in oncology aim to assess the value of new therapies in the long term based on clinical trial data that often have restricted follow-up times (< 5 years) and small sample sizes (< 500 patients). This requires the use of extrapolation assumptions on long-term survival that go beyond the observed data. In this analysis, differences in survival extrapolation methods are tested in samples of sizes and follow-up reflecting typical clinical trials against a background of known survival in prostate cancer from a US based cancer registry. Methods: Data from the National Cancer Institute's Surveillance Epidemiology and End Results (SEER) registry on long-term survival in patients with stage IV prostate cancer were employed. The data set comprised those patients diagnosed between 1988 and 2003, with follow-up data available until 2012. Additional survival for those who received surgery (compared to those who did not), was estimated based on extrapolations using standard parametric statistical models (exponential, Weibull, log-logistic, log-normal, Gamma) fitted to the observed data. Survival analyses were run for 5 sample size scenarios (n = 27,670, 1000, 500, 200, 50) and 6 follow-up scenarios (follow-up years = 25, 20, 10, 5, 2, 1) yielding 30 combination scenarios. Performance of the methods was tested relative to the maximum follow-up, maximum sample size scenario (i.e. reference case) from the SEER registry. Results: Log-logistic and log-normal models were associated with flat tails which led to inflated survival estimations. For scenarios with smaller sizes, gamma models often did not converge. Exponential models were the most frequently reported as best model fit (in approximately 50% of scenarios). Also, gains in OS were consistent when exponential models were selected, and closely matched gain in OS from the reference case. Conclusions: Since clinical trials in oncology are often associated with small patient sample sizes and restricted follow-up, selecting an exponential model may lead to the most consistent and stable results based on the experiment constructed here. Further research should confirm these results for other types of cancer.

  • Research Article
  • Cite Count Icon 73
  • 10.1111/2041-210x.13270
Overcoming the challenge of small effective sample sizes in home‐range estimation
  • Aug 11, 2019
  • Methods in Ecology and Evolution
  • Christen H Fleming + 3 more

Technological advances have steadily increased the detail of animal tracking datasets, yet fundamental data limitations exist for many species that cause substantial biases in home‐range estimation. Specifically, the effective sample size of a range estimate is proportional to the number of observed range crossings, not the number of sampled locations. Currently, the most accurate home‐range estimators condition on an autocorrelation model, for which the standard estimation frame‐works are based on likelihood functions, even though these methods are known to underestimate variance—and therefore ranging area—when effective sample sizes are small.Residual maximum likelihood (REML) is a widely used method for reducing bias in maximum‐likelihood (ML) variance estimation at small sample sizes. Unfortunately, we find that REML is too unstable for practical application to continuous‐time movement models. When the effective sample sizeNis decreased toN ≤ (10), which is common in tracking applications, REML undergoes a sudden divergence in variance estimation. To avoid this issue, while retaining REML’s first‐order bias correction, we derive a family of estimators that leverage REML to make a perturbative correction to ML. We also derive AIC values for REML and our estimators, including cases where model structures differ, which is not generally understood to be possible.Using both simulated data and GPS data from lowland tapir (Tapirus terrestris), we show how our perturbative estimators are more accurate than traditional ML and REML methods. Specifically, when(5) home‐range crossings are observed, REML is unreliable by orders of magnitude, ML home ranges are ~30% underestimated, and our perturbative estimators yield home ranges that are only ~10% underestimated. A parametric bootstrap can then reduce the ML and perturbative home‐range underestimation to ~10% and ~3%, respectively.Home‐range estimation is one of the primary reasons for collecting animal tracking data, and small effective sample sizes are a more common problem than is currently realized. The methods introduced here allow for more accurate movement‐model and home‐range estimation at small effective sample sizes, and thus fill an important role for animal movement analysis. Given REML’s widespread use, our methods may also be useful in other contexts where effective sample sizes are small.

  • Book Chapter
  • 10.1007/978-3-319-67988-4_25
Bootstrap Guided Information Criterion for Reliability Analysis Using Small Sample Size Information
  • Dec 6, 2017
  • Eshan Amalnerkar + 2 more

Several methods for reliability analysis have been established and applied to engineering fields bearing in mind uncertainties as a major contributing factor. Small sample size based reliability analysis can be very beneficial when rising uncertainty from statistics of interest such as mean and standard deviation are considered. Model selection and evaluation methods like Akaike Information Criteria (AIC) have demonstrated efficient output for reliability analysis. However, information criterion based on maximum likelihood can provide better model selection and evaluation in small sample size scenario by considering the well-known measure of bootstrapping for curtailing uncertainty with resampling. Our purpose is to utilize the capabilities of bootstrap resampling in information criterion based reliability analysis to check for uncertainty arising from statistics of interest for small sample size problems. In this study, therefore, a unique and efficient simulation scheme is proposed which contemplates the best model selection frequency devised from information criterion to be combined with reliability analysis. It is also beneficial to compute the spread of reliability values as against solitary fixed values with desirable statistics of interest under replication based approach. The proposed simulation scheme is verified using a number of small and moderate sample size focused mathematical example with AIC based reliability analysis for comparison and Monte Carlo simulation (MCS) for accuracy. The results show that the proposed simulation scheme favors the statistics of interest by reducing the spread and hence the uncertainty in small sample size based reliability analysis when compared with conventional methods whereas moderate sample size based reliability analysis did not show any considerable favor.

  • Research Article
  • Cite Count Icon 12
  • 10.1016/j.asw.2017.08.004
Assessing C2 writing ability on the Certificate of English Language Proficiency: Rater and examinee age effects
  • Sep 9, 2017
  • Assessing Writing
  • Daniel R Isbell

Assessing C2 writing ability on the Certificate of English Language Proficiency: Rater and examinee age effects

  • Research Article
  • Cite Count Icon 5
  • 10.1037/spq0000518
Using many-facet rasch measurement and generalizability theory to explore rater effects for direct behavior rating-multi-item scales.
  • Mar 1, 2023
  • School psychology (Washington, D.C.)
  • Christopher J Anthony + 3 more

Although originally conceived of as a marriage of direct behavioral observation and indirect behavior rating scales, recent research has indicated that Direct Behavior Ratings (DBRs) are affected by rater idiosyncrasies (rater effects) similar to other indirect forms of behavioral assessment. Most of this research has been conducted using generalizability theory (GT), yet another approach, many-facet Rasch measurement (MFRM), has recently been utilized to illuminate the previously opaque nature of these rater idiosyncrasies. The purpose of this study was to utilize both approaches (GT and MFRM) to consider rater effects with 126 second- through fifth-grade students who were rated on two DBR-Multi-Item Scales by four raters (22 of these ratings were fully crossed). Results indicated the presence of rater effects and revealed nuances about their nature, including showing differences across construct domains, identifying items that are potentially more susceptible to rater effects than others, and isolating specific raters who appear to have been more susceptible to rater effects than other raters. These findings further indicate the indirect nature of DBRs and offer potential avenues for addressing and ameliorating rater effects in research and practice. (PsycInfo Database Record (c) 2022 APA, all rights reserved).

  • Research Article
  • Cite Count Icon 2
  • 10.24191/mjoc.v6i1.10540
INVESTIGATING THE IMPACT OF MULTICOLLINEARITY ON LINEAR REGRESSION ESTIMATES.
  • Mar 9, 2021
  • MALAYSIAN JOURNAL OF COMPUTING
  • Kunle Bayo Adewoye + 3 more

Multicollinearity is a case of multiple regression in which the predictor variables are themselves highly correlated. The aim of the study was to investigate the impact of multicollinearity on linear regression estimates. The study was guided by the following specific objectives, (i) to examined the asymptotic properties of estimators and (ii) to compared lasso, ridge, elastic net with ordinary least squares. The study employed Monte-carlo simulation to generate set of highly collinear and induced multicollinearity variables with sample sizes of 25, 50, 100, 150, 200, 250, 1000 as a source of data in this research work and the data was analyzed with lasso, ridge, elastic net and ordinary least squares using statistical package. The study findings revealed that absolute bias of ordinary least squares was consistent at all sample sizes as revealed by past researched on multicollinearity as well while lasso type estimators were fluctuate alternately. Also revealed that, mean square error of ridge regression was outperformed other estimators with minimum variance at small sample size and ordinary least squares was the best at large sample size. The study recommended that ols was asymptotically consistent at a specified sample sizes on this research work and ridge regression was efficient at small and moderate sample size.

  • Research Article
  • Cite Count Icon 8
  • 10.1002/j.2333-8504.2013.tb02330.x
THE EFFECTS OF RATER SEVERITY AND RATER DISTRIBUTION ON EXAMINEES' ABILITY ESTIMATION FOR CONSTRUCTED‐RESPONSE ITEMS
  • Dec 1, 2013
  • ETS Research Report Series
  • Zhen Wang + 1 more

ABSTRACTThe current study used simulated data to investigate the properties of a newly proposed method (Yao's rater model) for modeling rater severity and its distribution under different conditions. Our study examined the effects of rater severity, distributions of rater severity, the difference between item response theory (IRT) models with rater effect and without rater effect, and the difference between the precision of the ability estimates for tests composed of only constructed‐response (CR) items and for tests composed of multiple‐choice (MC) and CR items combined. Our results indicate that rater severity and its distribution can increase the bias of examinees' ability estimates and lower test reliability. Moreover, using an IRT model with rater effects can substantially increase the precision in the examinees' ability estimates, especially when the test was composed of only CR items. We also compared Yao's rater model with Muraki's rater effect model (1993) in terms of ability estimation accuracy and rater parameter recovery. The estimation results from Yao's rater model using Markov chain Monte Carlo (MCMC) were better than those from Muraki's rater effect model using marginal maximum likelihood.

  • Research Article
  • Cite Count Icon 5
  • 10.1177/01466216231174566
Modeling Rating Order Effects Under Item Response Theory Models for Rater-Mediated Assessments.
  • May 13, 2023
  • Applied Psychological Measurement
  • Hung-Yu Huang

Rater effects are commonly observed in rater-mediated assessments. By using item response theory (IRT) modeling, raters can be treated as independent factors that function as instruments for measuring ratees. Most rater effects are static and can be addressed appropriately within an IRT framework, and a few models have been developed for dynamic rater effects. Operational rating projects often require human raters to continuously and repeatedly score ratees over a certain period, imposing a burden on the cognitive processing abilities and attention spans of raters that stems from judgment fatigue and thus affects the rating quality observed during the rating period. As a result, ratees' scores may be influenced by the order in which they are graded by raters in a rating sequence, and the rating order effect should be considered in new IRT models. In this study, two types of many-faceted (MF)-IRT models are developed to account for such dynamic rater effects, which assume that rater severity can drift systematically or stochastically. The results obtained from two simulation studies indicate that the parameters of the newly developed models can be estimated satisfactorily using Bayesian estimation and that disregarding the rating order effect produces biased model structure and ratee proficiency parameter estimations. A creativity assessment is outlined to demonstrate the application of the new models and to investigate the consequences of failing to detect the possible rating order effect in a real rater-mediated evaluation.

  • Research Article
  • Cite Count Icon 3
  • 10.1177/0146621618798667
Item Response Theory Modeling for Examinee-selected Items with Rater Effect.
  • Oct 8, 2018
  • Applied Psychological Measurement
  • Chen-Wei Liu + 2 more

Some large-scale testing requires examinees to select and answer a fixed number of items from given items (e.g., select one out of the three items). Usually, they are constructed-response items that are marked by human raters. In this examinee-selected item (ESI) design, some examinees may benefit more than others from choosing easier items to answer, and so the missing data induced by the design become missing not at random (MNAR). Although item response theory (IRT) models have recently been developed to account for MNAR data in the ESI design, they do not consider the rater effect; thus, their utility is seriously restricted. In this study, two methods are developed: the first one is a new IRT model to account for both MNAR data and rater severity simultaneously, and the second one adapts conditional maximum likelihood estimation and pairwise estimation methods to the ESI design with the rater effect. A series of simulations was then conducted to compare their performance with those of conventional IRT models that ignored MNAR data or rater severity. The results indicated a good parameter recovery for the new model. The conditional maximum likelihood estimation and pairwise estimation methods were applicable when the Rasch models fit the data, but the conventional IRT models yielded biased parameter estimates. An empirical example was given to illustrate these new initiatives.

  • Research Article
  • Cite Count Icon 19
  • 10.1080/2372966x.2020.1827681
Evaluating the Impact of Rater Effects on Behavior Rating Scale Score Validity and Utility
  • Jan 4, 2021
  • School Psychology Review
  • Christopher J Anthony + 4 more

Behavior rating scales represent one of the most commonly used types of assessments in school psychology. Yet, they suffer from a fundamental limitation: They are an indirect methodology influenced partially by student behavior and partially by rater perspectives. Thus, the current study utilized advanced analytic approaches to evaluate rater effects on the Academic Competence Evaluation Scales–Short Form–Teacher (ACES-SF-T) with a partially crossed sample of 132 fourth- and fifth-grade students rated by seven teachers. Results indicated that rater effects had a minimal impact on the predictive validity of ACES-SF-T scores for state achievement tests, but at the individual level, rater effects could lead to starkly different conclusions about students’ academic, social, and behavioral functioning. Implications for research and practice are discussed.

  • Discussion
  • Cite Count Icon 6
  • 10.1111/petr.12484
Longitudinal stability of medication adherence: Trying to decipher an important construct.
  • May 4, 2015
  • Pediatric Transplantation
  • Sarah R Lieber + 1 more

Longitudinal stability of medication adherence: Trying to decipher an important construct.

  • Research Article
  • Cite Count Icon 23
  • 10.1177/0013164419834613
Exploring the Combined Effects of Rater Misfit and Differential Rater Functioning in Performance Assessments.
  • Apr 2, 2019
  • Educational and Psychological Measurement
  • Stefanie A Wind + 1 more

Rater effects, or raters' tendencies to assign ratings to performances that are different from the ratings that the performances warranted, are well documented in rater-mediated assessments across a variety of disciplines. In many real-data studies of rater effects, researchers have reported that raters exhibit more than one effect, such as a combination of misfit and systematic biases related to student subgroups (i.e., differential rater functioning [DRF]). However, researchers who conduct simulation studies of rater effects usually focus on the effects in isolation. The purpose of this study was to explore the degree to which rater effect indicators are sensitive to rater effects when raters exhibit more than one type of effect, and to explore the degree to which this sensitivity changes under different data collection designs. We used a simulation study to explore combinations of DRF and rater misfit. Overall, our findings suggested that it is possible to use common numeric and graphical indicators of DRF and rater misfit when raters exhibit both these effects, but that these effects may be difficult to distinguish using only numeric indicators. We also observed that combinations of rater effects are easier to identify when complete rating designs are used. We discuss implications of our findings as they result to research and practice.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.