Rater Differences Research Articles

Abstract This article uses latent structure analysis to model ordered category ratings by multiple experts on the appropriateness of indications for the medical procedure carotid endarterectomy. The statistical method used is a form of located latent class analysis, which combines elements of latent class and latent trait analysis. It assumes that treatment indications fall into distinct latent classes, with each latent class corresponding to a different level of appropriateness. The appropriateness rating of a treatment indication by a rater is assumed determined by the latent class membership of the indication, rating category thresholds of the rater, and random measurement error. The located latent class model has two alternative forms: a normal ogive form, which derives from the assumption of normally distributed measurement error, and a logistic approximation to the normal form. The approach has the following advantages for the analysis of ordered category ratings by multiple experts: (1) it assesses whether different raters base ratings on the same or different criteria; (2) it assesses rater bias—the tendency of some raters to make higher or lower ratings than others; (3) it characterizes rater differences in rating category definitions; (4) it provides theoretically based methods for combining the ratings of different raters; and (5) it provides a description of the distribution of the latent trait. The data examined are appropriateness ratings on 848 indications for carotid endarterectomy made by nine medical experts. The located latent class approach provides unique insights concerning the data. It identifies what appears to be a set of clear nonindications for carotid endarterectomy, but a corresponding set of clear indications is not evident. The results indicate that all raters measured a common latent trait of treatment appropriateness, but that some measured the trait better than others. Rater differences in overall bias and rating category definitions are evident. Two methods are used to combine raters' ratings. One uses ratings to calculate a continuous appropriateness score for each indication. The other uses ratings to assign indications to discrete outcome categories, each corresponding to a specific level of appropriateness. The located latent class approach for ordered category measures has possible applications besides the analysis of expert ratings, such as item analysis. Potential extensions of the model are discussed.

Read full abstract

SUMMARYThe adoption of a particular personnel device is dependent upon how consistently its use will lead to positive results. A selection device, for example, should consistently pick workers with higher production records than those of workers rejected. When repeated tests of the expected relationship yield inconsistent results, the personnel device is frequently rejected without further trial. This study indicates, however, that inconsistencies of relationship between test and ratings used as measures of productivity may be due to inconsistency of the ratings rather than to any deficiency of the personnel device.In an attempt to evaluate a battery of aptitude tests for hiring purposes, several ratings were obtained from foremen to serve as yardsticks of efficiency.1 The test initially showed inconsistent relationships with the ratings. Relationships which were consistent and sufficiently high to warrant adoption of the test were obtained only after eliminating from some sets of ratings the influence of a non‐relevant factor—length of service.Although the foremen had been instructed to rank the workers on two different traits—Personality and Ability—results showed that the two ratings covered essentially the same factors. Apparently, in spite of the logical analysis in the personnel office, the foreman's ratings of personality and his ratings of ability measure much the same thing—job effectiveness. In fact, the tests designed to predict the ability aspect of effectiveness were more closely related to the Personality ratings than to the Ability ratings. This is probably because both ratings reflected what the personnel analyst would define as “ability”, but the Personality ratings were less contaminated by a spurious relationship to mere length of service. If this finding can be generalized, it may serve to account for the many job failures ascribed to personality difficulty.A further implication is that ratings on Personality may in reality measure an aspect of ability which is relatively independent of length of service, and thus is more predictable by ability tests than are Ability ratings.One measure of the effectiveness of the rating is the extent to which it can be predicted by the selection tests. Contrary to initial expectation, the Over‐all rating was not uniformly superior to the part ratings on this basis and was definitely inferior to the sum of the part ratings. Thus the foremen, in combining the various factors leading to their appraisal of over‐all effectiveness, did not ascribe the best weights; in fact their judgmental combination was not as good as simple clerical addition without statistical adjustment.The initial results showed marked inconsistencies from rater to rater in the relation between the initial ratings and the selection tests. This inconsistency, apparently reflecting adversely on the usefulness of tests, turned out to result in part from rater differences in the meanings which each attached to the basis for his rating. Thus, correction for the spurious length‐of‐service bias by use of a second technique for securing ability ratings transformed the apparent inconsistency into consistency. Similar consistent and substantial relations between tests and ratings were obtained by limiting the study to those cases where length of service was more nearly equivalent, by using the Personality rating (less contaminated with seniority), and by statistical correction for seniority.

Read full abstract

Rater Differences Research Articles

Articles published on Rater Differences

The heritability of depressive symptoms: multiple informants and multiple measures.

Competency models: are self‐perceptions accurate enough?

MEASUREMENT ERROR IN RESEARCH ON HUMAN RESOURCES and FIRM PERFORMANCE: HOW MUCH ERROR IS THERE AND HOW DOES IT INFLUENCE EFFECT SIZE ESTIMATES?

The inconsistency with which raters weight and combine information across targets

Statistical Modeling of Expert Ratings on Medical Treatment Appropriateness

Singers and Stereotypes: The Image of Female Recording Artists*

ORGANIZATION AND RATER DIFFERENCES IN PERFORMANCE APPRAISALS

Environmental dispositions and the evaluation of architectural interiors

Predictive utility, sex of rater differences, and interrater reliabilities of the NOSIE-30.

Reliability and validity of proverb intepretation to assess mental status

Rater and patient characteristics associated with rater differences in psychiatric scale ratings

The effect of rater differences on symptom rating scale clusters.

Using Ratings to Validate Personnel Instruments: A Study in Method

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Rater Differences Research Articles

Articles published on Rater Differences

The heritability of depressive symptoms: multiple informants and multiple measures.

Competency models: are self‐perceptions accurate enough?

MEASUREMENT ERROR IN RESEARCH ON HUMAN RESOURCES and FIRM PERFORMANCE: HOW MUCH ERROR IS THERE AND HOW DOES IT INFLUENCE EFFECT SIZE ESTIMATES?

The inconsistency with which raters weight and combine information across targets

Statistical Modeling of Expert Ratings on Medical Treatment Appropriateness

Singers and Stereotypes: The Image of Female Recording Artists*

ORGANIZATION AND RATER DIFFERENCES IN PERFORMANCE APPRAISALS

Environmental dispositions and the evaluation of architectural interiors

Predictive utility, sex of rater differences, and interrater reliabilities of the NOSIE-30.

Reliability and validity of proverb intepretation to assess mental status

Rater and patient characteristics associated with rater differences in psychiatric scale ratings

The effect of rater differences on symptom rating scale clusters.

Using Ratings to Validate Personnel Instruments: A Study in Method