We welcome this critique of simplistic one-dimen sional measures of academic performance, in particular the naive use of impact factors and the h-index, and we can only extend sympathy to colleagues who are be ing judged using some of the techniques described in the paper. In particular we welcome the report's em phasis on the need for careful modeling of citation data rather than relying on simple summary statistics. Our own work on league tables adopts a modeling ap proach that seeks to understand the factors associated with institutional performance and at the same time to quantify the statistical uncertainty that surrounds insti tutional rankings or future predictions of performance. In the present commentary we extend this approach to an analysis of the 2008 UK Research Assessment Ex ercise (RAE) for Universities. Before we describe our analysis it is important to comment on an important modeling problem that arises in the analysis of citation data, alluded to but not dis cussed in detail in the report, nor, as far as we know, elsewhere. A principal difficulty with indices such as the h-index or simple citation counts is that there are inevitable dependencies between individual scientists' values. This is because a citation is to a paper with, in general, several authors, rather than to each specific au thor. Thus, for example, if two authors nearly always write all their papers together, they will tend to have very similar values. If they belong to the same uni versity department then their scores do not supply in dependent bits of information in compiling an overall score or rank for that department. Currently this issue is in the RAE, albeit imperfectly, by the re quirement that the same paper cannot be entered more than once by different authors for a given university department. In a citation based system this would also need to be recognized. In addition, if our two authors were in different, competing departments, we would also need to recog nize this since the dependency would affect the accu racy of any comparisons we make. We also note that this will, to some extent, affect our own analyses that we present below, and it will be expected to overesti mate the accuracy of our rankings. Unfortunately we have no data that would allow us to estimate, even ap proximately, how important this is. To deal with this problem satisfactorily would involve a model that in corporated effects for each author and the detailed information about the authorship of each paper that was cited. Goldstein (2003, Chapter 12.5) describes a multilevel multiple membership model that can be used for this purpose, where individual authors become level 2 and papers are level 1 units. The UK Research Assessment Exercise was pub lished on 18th December 2008, covering the years 2001-2008. 52,409 staff from 159 institutions were grouped into 67 units of assessment (UOA): up to 4 publications for each individual were considered as well as other activities and markers of esteem. Pan els drawn from around 1000 peer reviewers then pro duced a profile for each group, summariz ing in blocks of 5% the proportion of each submission judged by the panels to have met each of the follow ing quality levels: world-leading (4*), internation ally excellent (3*), internationally recognized (2*), nationally recognized (1*), and unclassified. This procedure is notable in terms of its use of peer judg ment rather than simple metrics, and allowing a dis tribution of performance rather than a single measure. All the data is available for downloading (Research As sessment Exercise, 2008). Figure 1 shows the results relevant for most statisti cians: the 30 groups entered under UOA22: Statistics and Operational Research. These have been ordered into a league table using the average number of stars which we shall term the mean score, which is the procedure adopted by the media. Also reported is the number of full-time equivalent staff in the submission. Controversy surrounds this number as it is unknown how selective institutions were in submitting staff? David Spiegelhalter is Winton Professor of Public Understanding of Risk, Statistical Laboratory, Centre for Mathematical Sciences, Wilberforce Road, Cambridge CB3 OWB, UK. Harvey Goldstein is Professor of Social Statistics, University of Bristol, 35 Berkeley Square, Bristol BS8 l JA, UK.
Read full abstract