Reliability: Marking vs Judging

I recently blogged about judging rather than marking GCSE English mock exams. Kieron Bailey, Head of Faculty at Parker E-Act Academy, who asked his Department to judge the long essay question for the AQA GCSE English Literature Paper 1 mock (30 marks) was kind enough to share his data with me. 

In my post encouraging schools to judge I suggested that the marking reliability of 24 mark questions was likely to be +/- 6 marks according to the marking reliability research published by Ofqual.

How did the judging of the school compare to what we expect of marking reliability?

The school judged the work of 210 pupils, using 10 judges doing 120 judgements each. In total, therefore, they completed 1,200 judgements, with each script being judged 11 or 12 times. The judgements were done over various sittings, but it would seem that on average each judge took around an hour to complete their 120 judgements.

On completion the Head of Department looked through the results and decided that the best piece of work would be worth 23 marks, and the worst 5 marks. That gives us a range of 18 marks.

The average Standard Error of the True Score was 0.98 logits. The True Score range was -6.54 logits to 5.35 logits, a range of 11.89 logits.

We then transform the true scores to the scaled scores using the following method.

WANTED RANGE = Range of scaled scores wanted = 18

RANGE = Current range of true scores = 11.89

USCALE = WANTED RANGE / RANGE = 18 / 11.89 = 1.51

SCALED SCORE SE = (True Score SE * USCALE) = 0.98 * 1.51 = 1.49

So we can estimate that the judged scores are correct to +/- 1.49 marks.

We would expect traditional marking to be correct to +/- 6 marks.

The judging of this school would certainly seem to have resulted in a more reliable outcome than had they marked the scripts.

Had the school done more judgements per script, the reliability of the judging could be even higher still.

How highly should tests correlate?

I received a question this week from a data manager regarding a creative writing examination that they had administered at the school for their Year 8s and judged using He was interested in the correlations between the results on the creative writing tests and the results on other test data they held for students: a test of general abilities they administer at the start of Year 7 (baseline); and Key Stage 2 (ks2) results.

Here are the correlations.

He seemed disappointed that the correlations for the creative writing test were so low. What he hadn’t considered was impact of measurement error in the tests.

If we want to know the relationship between the constructs the tests are attempting to measure, we need to compensate for measurement error. Spearman invented a simple technique called the disattenuated correlation to do just this.

Let’s assume that the reliabilities of these tests are 0.9 for Key Stage 2, 0.9 for the generic baseline test, and 0.75 for the writing test (we don’t all agree on writing so let’s allow a lower reliability!). If we look at the disattenuated correlations we see that the correlations are higher.

The correlations between the creative writing tests and KS2 are 0.64. We could interpret this to mean that the constructs are related but not identical. This seems reasonable, as KS2 doesn’t test creative writing. Interestingly, the correlation between the KS2 and the baseline test is 0.95. The school is learning very little from their baseline test as it is testing the same constructs as KS2. The best that could be said from the test is that they are reducing the measurement error through repeated testing.

So what do correlations tell us about the validity of tests? Very little. A high correlation suggests that your test is redundant. A low correlation tells you that you are testing something different, but you have no idea what. Paul Newton, the co-author of an excellent text on validity, puts it like this:

“The classic approach to validation involved correlating results from a new test against results from an already established one. In other words, results from the established test provided the ‘criterion’ against which to judge results from the new test. In theory, a high correlation coefficient provides strong evidence that the new test is measuring essentially the same thing as the established test, the criterion measure. In practice, because it is so hard to provide plausible criterion measures, low correlations are hard to interpret, and even high correlations do not necessarily mean that the right thing has been measured.”

Paul prefers a lifecyle approach to validation. If you want to know if a test is useful for your purpose, you need to interrogate:

“the specification of the proficiency which is supposed to be assessed; the process for producing the tasks that are used to assess the proficiency; the process for administering those tasks; the process for evaluating task performances; the process for reporting assessment results; the specification of how those results should be interpreted; and so on.”

So, do take a look at correlations between tests – you may find you are doing some unnecessary testing. If you get a low correlation, don’t be disappointed, you may be learning something new.

An interview with Charlotte Goodchild, Team Leader KS3 English, Wilmslow High School

Why did you decide to use

Following a the removal of KS3 levels, our school recognised that this was a golden opportunity to evaluate the way in which we mark work and assess students.  Over the course of the last 2 years, we have spent considerable time as a school wrestling with how to respond to the changes in the KS3 curriculum and the disappearance of levels. This summer, we are in a position to launch our approach, which makes a clear break between formative and summative assessment and considers the idea of ‘Fluency Learning’.

Fluency Learning, and the language we have attached to this, considers how effectively students have learnt and practised the material being taught, and is based on an assumption that, at the end of a sequence of lessons, all of our students are capable of having a ‘complete’ knowledge of the taught subject content. This means we are moving to a model which assesses the quality of a student’s learning. It no longer considers where a student has come from (their prior attainment) and where they’re heading (a GCSE target) as we have concluded that this is a very limiting approach, which means we do not have sufficiently high expectations of all of our students.
As an English team, we were therefore looking for a way in which we could summatively assess our students’ knowledge (and application) that was reliable, removed the tendency for biased judgement and was not beholden to vague criteria and rubrics; appeared to answer all of these requirements and had the added advantage of reducing the workload that English teachers often face in exam season!

What were the practicalities of using the site?

The site was easy to use and Chris was extremely helpful in supporting us through the setup process.  I would very much recommend paying for the £60 subscription if you are thinking about trialling the software with a small group.  As we boldly decided to trial this across all our KS3 classes, we ended up marking near on 1000 scripts, which is quite a large trial!  I think moving forward, we will be signing up for the £300 subscription so that we can use the barcoded answer sheets as this will reduce the admin time spent scanning in the documents etc.

What advantages and disadvantages did you find?

One of the biggest advantages is that all of our teachers have now seen over 600 pieces of student work across KS3, in the space of about 3 hours.  As a team leader, this is really quite fantastic! We have already been able to identify specific strengths and weaknesses across the cohorts and this has made for some excellent discussions about ‘closing gaps’ and strategies moving forward next year.  Of course, the fact that we are not spending hours marking is an added bonus!  My team has been very positive about the software and, in my opinion, it’s a really good form of CPD as we are all able to see the standard across the board and evaluate the effectiveness of our own teaching/students’ progress in relation to this.

What are you planning to do next?

We are going to sign up for the £300 subscription and use comparative judgement in our two KS3 summative assessment windows.

Large scale paper based assessment is ideally suited to increase the efficiency of your large scale paper based assessment. The site helps you anonymise the scripts, distribute scans to judges and manage the judging process online. With each script seen by multiple judges you no longer need to worry about standardisation and moderation. Candidates will receive fair results, free from marker bias.

If you have an institutional support account, we even do a lot of the work for you! The process is simple:

  1. You send us a list of candidates
  2. You specify how many pages you would like for your answer sheets and whether you would like them lined or blank
  3. We create a pdf with named and bar coded answer sheets for you to print
  4. When you have finished you send us your scans
  5. We process the scans for you
  6. The processing removes all pupil identifiers for judges and prepares the images for judging
  7. When you have finished judging your results will be automatically matched to your pupil lists

The service is available as part of our institutional support licence. More details on the licence are here:

1,000,000 judgements: what have we learned?

Today we reached a landmark 1 million judgements on So what have we learned along the way?

CJ can reduce workload. Teachers are reporting reduced workload as there is no longer any need for standardisation and moderation.

CJ tests can measure with higher reliability and validity than specifically designed test batteries. Researchers at Loughborough University are revealing the power of simple questions in maths.

CJ can be used to improve the design of assessments. Ofqual solved the problem of exam boards producing qualifications of different difficulty.

CJ can be used to maintain standards in national examinations. In Victoria, researchers are exploring how you can measure changes in performance over time in national examinations.

Without CJ teachers can be a little over-optimistic! In this experiment, teachers were surprised to find their pupils hadn’t made us much progress as they had hoped.

Teacher assessment, given the right support, can be reliable and unbiased. Coursework has had a bad press recently, but this head of department shows how to get a department judging fairly and rigorously.

Standards in Maths haven’t fallen as much as we might think!  Researchers at Loughborough university answer an age old question using CJ, and comes up with a surprising result.

You can use CJ to measure how effective your interventions are. Derek, from Raleigh, NC, finds out that the teaching in his department needs improvement! 

CJ allows you to measure what you really want to measure. In which Daisy Christodolou at Ark puts all their mark schemes on a big bonfire.

You can build a robust assessment model on CJ. Our flagship product designed to help schools measure progress.

CJ stops everyone arguing about standards. Primary school teachers are suprised to find they can judge their entire year group in 7 minutes!

Eleven year olds like to write about dragons! And then good old Josh turns up. We find out more about the creative writing of 11 year olds.

Why we need measurement

This year I’ve been working with a school who have judged the essay sessions from their GCSE English mocks using All pupils took two mocks, one at the end of the autumn term, and another before the end of the spring term. After the spring term they were very excited about the improvement they had seen in the work of pupils, and were predicting a large increase in grades.

I decided to take a closer look for them. I took some scripts from the autumn mock and some scripts from the spring mock and mixed them up in one judging pot. I then got a couple of PhD students to judge them all together. As usual the students were blind to the purpose of the study or when the tests had taken place. Using the results of their judging I was able to anchor the two judging sessions together so they were on the same scale and make direct comparisons.

The results did indeed seem to show a slight improvement. Statistical tests, however, showed that this improvement was not significant, and probably just sampling error. There was certainly no evidence of the miracle the school was claiming!

Of course the school didn’t believe me – they were convinced they could see improvement. I was not surprised. They had been teaching the students all year, and couldn’t believe their efforts may not have translated into results. Where we want to see improvement, we find it. Without objective measurement we can convince ourselves of anything. 

Assigning marks to judged work

A recent blog by Tomas Needham described how he used to judge GCSE English Controlled Assessment in his school. Once he had finished judging he needed to assign marks to scripts so that they could be moderated by the exam board. How could he do that?

Tom’s solution was to include 8 pieces of work that had previously been moderated by AQA. For example, for the essays on Of Mice and Mean, each piece of work had a moderated mark, with the marks ranging from 10 to 26. The highest mark possible is 30. He removed all marks and annotations from these scripts and included them in the judging.

Once the judging was completed, the first thing I checked for Tom was that the marks and judging agreed. Marking, as Tom describes it, is ‘fiendishly difficult.’ If the marking is too inaccurate then the marks would be of little use. Luckily there proved to be a strong relationship between the marks and the CJ scores. The correlation was 0.89.

Once a relationship had been established between the moderated marks and the CJ marks it was easy to apply this relationship to the unmoderated marks using linear regression and predict a moderated mark for every script. The capping on the mark scheme at 30 meant that scripts we predicted would have much higher marks from the judging could only receive 30 marks.

So, statistically speaking, how successful was the judging? Tom had also asked his teachers to mark the work in isolation. Plotting the original teacher marks against the predicted moderated marks shows that there was a wide scatter of marks from the original teachers. Scripts that, as a group, the teachers had judged to have a mark of 19, had original marks which ranged from 16 to 30!

So, all in all, a really neat experiment by Tom, who has shown clearly here that teacher assessment, given the right support, can work! If we want to keep assessing important practical skills using teacher assessment we need to stop marking and start judging.