Last week we received this message from Tom Bramley at Cambridge Assessment,
I’ve been doing some work investigating the reliability of adaptive comparative judgement, in order to check out a suspicion/concern I’ve had for a while that the reliability figures that come out of it are a bit too good to be true. I ran a few simulations which did seem to show a fairly large amount of bias for ACJ studies with the number of comparisons per script that are usually used. But then I ran a few simulations using completely random data (i.e. every outcome a coin-toss). I got a reliability value above 0.7 in 3 cases and as high as 0.89 in one! I wasn’t expecting anything quite so drastic – it seems to show the reliability statistic is at best misleading for ACJ and at worst worthless.
Tom was using the swiss round method suggested by Pollitt (2012) to get the adaptive process going in the early stages, and wanted us to see if proper adaptivity had the same impact. As Ian Jones and I were about to run a CJ simulation workshop at Antwerp University with the D-PAC group we though this was an excellent subject for study!
Our results were surprising, and confirmed Bramley’s concerns. We ran a simulation that used 10 iterations of 50 scripts receiving every possible pair of judgements (1,225) under:
- the random distribution (no control of pairs)
- the distributed distribution (non repeating pairs & an equal number of judgements per script every round)
- the progressive adaptive algorithm with an acceleration parameter of 2 (quick decay of the random element)
- the progressive adaptive algorithm with an acceleration parameter of 4 (slow decay of the random element)
The Rasch model was fitted to the decisions after every judgement, and median reliability calculated across 10 iterations. The results are presented here.
Under the adaptive algorithm reliability rises as soon as the random element decays, peaking at around 0.7 once half the matrix of judgements is complete. Only when the algorithm is forced to select pairs that are further apart in ability, as all natural pairs have been exhausted, does the reliability start to fall. Neither the random nor the distributed algorithm shows this pattern, with reliability remaining low throughout.
Seeing this, one of the workshop participants said:
‘What we need is the opposite of adaptivity!’
Tom suggests that alpha is misleading or worthless for studies that involve a significant amount of adaptivity. We agree: without another measure that something has gone wrong, if you use an adaptive algorithm in your CJ you could end up with a high reliability coefficient and a completely spurious rank order of scripts. This is a worry!
Again, we would suggest great caution with adaptivity – at present it appears to offer small theoretical advantage at the cost of great risk of misinterpretation.
This problem does not arise in normal computerised adaptive testing (CAT) because the items in the bank are calibrated first, and the ‘measurement’ of the test-takers is based on fixed (assumed known) item difficulties. ACJ tries to calibrate and measure at the same time, a juggling act that may end with all the balls in the wrong order!
Many thanks to Tom, for bringing this to our attention, and to the workshop participants, Liesje Coertjens, Sven De Maeyer and San Verhavert for their hard work.