One of the first questions we are asked is ‘How many judgements do we need to make?’

Originally we advised multiplying the number of scripts you had by 10. So, if you had 10 scripts we advised 100 judgements. Under this model each script would be judged 20 times.

**We now think that this number is an overestimate and recommend multiplying the number of scripts you have by 5. So if you have 10 scripts we would advise 50 judgements. Under this model each script is judged 10 times.**Having completed a number of large studies we are able to estimate the impact of the number of judgements on the accuracy of the results. Taking four large studies, with between 1,000 and 4,000 scripts each and between 10,000 and 40,000 judgements we can take bootstrap samples (samples with replacement) of different sizes from the judgement data and examine the correlation between the scale created for each sample with the final scale. Two of the studies use English essays; two of the studies use open-ended Maths questions.

The figure shows the correlation between the final scale and the scale after each round of judgements. A round is defined as the number of judgements it would take to judge each script once. If we have 10 scripts, round one would be 5 judgements, round two would be 10 judgements and so on. It is clear that once 8 rounds have been completed the remaining judgements yield diminishing returns.

Correlation between the final scale and the scale after each round |

Unless you want to estimate inter-rater reliability, therefore, we would now advise estimating the minimum number of judgements you need by multiplying the number of scripts you have by 5, which still allows a small margin of error.

