24th August 20
The last week has been very challenging for students, families and schools throughout the country. How did we get into this mess and what lessons can we learn for the future?
On paper, the algorithm seemed like a good idea. It had two main aims. The first was to prevent grade inflation and thereby ensure the exam results in 2020 would be broadly equivalent to exam results in previous and future years. In other words, an A grade in 2020 would be broadly the same to an A grade in 2019 and 2021. The second aim was to ensure there was no unconscious bias by teachers in the awarding of grades. The thinking behind this came from the evidence in a number of studies that suggested teachers could be biased against disadvantaged students when predicting grades for University.
Unfortunately, the first aim was doomed to failure. Given a student who has a 50-50 chance of getting an A or a B, a teacher is almost always going to give the benefit of the doubt. In addition, without exams, no student can have a bad day, muck up their paper and underperform. Therefore, it was inevitable that the overall grades from centre assessed grades (CAGs) would be higher than historical averages. It was also inevitable that any measure to reduce these to the historical averages would impact randomly on actual students. The overall numbers may be steady but the individual impacts would be unjust.
There were also significant flaws in the second aim. A study in 2016 concluded that high-achieving students in comprehensive schools were not predicted as highly as their high-achieving peers in grammar and independent schools. There were further conclusions that lower achieving students were the most overpredicted, and that a high proportion of predicted grades were inaccurate. Applying these conclusions to this situation in 2020 was however flawed. The studies looked at predicted grades for university entrance on UCAS forms. There was a significant difference between these and the CAGs, which crucially were to remain confidential to prevent pressure being exerted by students or parents to increase the grades. It was not comparing like with like.
What is noteworthy is that Ofqual consulted on the algorithm and it had wide support. Some, myself included, always thought it would be flawed, mainly because it would not take into account the different abilities in school cohorts on a year-to-year basis. In June, the education select committee said the same and raised a series of further concerns. No one though appreciated quite how bad it would actually be. It may be easy to say with hindsight, but it is puzzling now how anyone ever thought it could work in the first place.
Other European countries took different routes to the UK. In Germany, Spain, Austria, Hungary, and Bulgaria for example, exams went ahead in adapted form and with social distancing measures. In France, ‘local juries’ moderated the grades put forward by schools. In Italy, there were oral exams rather than written assessments. Only in the UK did we go down the doomed algorithm path.
The best option for the UK would have been something similar to the French approach. It is important to trust teachers’ knowledge of their students. However there always needs to be some moderation process to make sure a grade in one school is broadly equivalent to the same grade in another school, which is necessary for grade credibility. External moderation stops anyone from misinterpreting their own data, or in rare cases, playing the system. Such moderation could have been done by the exam board marking teams, by Ofsted inspectors, or by experienced heads and deputies in local ‘juries’. Instead we’ve ended up with a our very poor situation. It is absolutely right that the algorithm was abandoned, but it has left this year’s results unmoderated and in the eyes of some, without the full credibility the students deserve.
There is a significant body of opinion that believes independent schools should be abolished and this situation has given them plenty of fuel to support their case. When the original A-level results were issued it was clear from Ofqual’s own published analysis that Independent Schools had benefited disproportionately at A*/A. They had increased by 4.7% against a national average of 2.2%, whilst FE colleges had only gone up by 0.3%. In addition, it was quickly highlighted that ‘private school subjects’ such as Latin and Greek had significant inflation compared to other subjects such as Psychology and Maths. This rapidly developed into a narrative that the algorithm was rigged in favour of private schools and against disadvantaged students. Some went so far to say that this was deliberate.
For people working in the Independent sector it was a very uncomfortable time to look at the media and social media. To be fair, there were some witty memes circulating on the Internet; my favourite being (notwithstanding some confusions about the Swedish education system) : ‘CDDC : what ABBA would have been called if they went to a state school’.
The facts quoted were not untrue but did not tell the whole story. For example, at A*/C Comprehensive schools went up more than Independent Schools. A wide range of subjects saw similar inflation at A*/A to Latin and Greek, including the Drama, DT, Music, Spanish, PE – subjects which are taught widely across all types of schools. There were bizarre results in Independent schools as there were in other places.
The reason for the Independent school ‘bonus’ was small classes. As mentioned above, CAGs were always going to be higher than the algorithm grades. In class sizes of five or under, CAGs were left untouched as it was statistically inappropriate to apply the algorithm to such small samples. In classes of 6 to 15, the CAGs carried reasonable weight alongside historical performance. In classes over 15, historical performance in that subject in that school was pretty much all that counted. Therefore, small classes had many fewer CAGs reduced than big classes. Independent schools have more small classes and therefore disproportionately benefited.
What shocks me is that this was all set out in detail in Ofqual’s own 320 page report. They comment in depth on the class size effect and produced tables showing how different kinds of schools would be impacted. Their conclusion was that this proved there had been absolutely no bias. They might be skilled at producing the numbers, but they were severely lacking in interpreting these numbers, and understanding how they would be interpreted by others.
One concern from this mess is that people will conclude that teacher assessment has failed and exams are the only fair system for grading students. This would be a serious error.
At best, exams give a reasonable approximation of the knowledge skills and understanding of students. Often however, the marking is deeply flawed. Over the years, how many stories can Heads tell of successful remarks that have had a material impact on sixth form or University admission and future life plans? Close to home, my own son was originally marked as a 5 in English language GCSE, later upgraded on appeal to an 8.
In November 2016, Ofqual published Marking Consistency Metrics, which sets out the probability of a candidate being awarded the definitive grade in certain GCSE and A level subject exams. The definitive grade is that which would have been awarded if the candidate received the same mark awarded to the candidates’ work in the exam boards’ quality assurance processes. This research highlighted that in some subjects grade reliability is quite high (approaching 90%), while in other subjects the probability of agreement with the definitive grade can be as low as 50% (English Literature).
There are a range of reasons for variation in marking, including procedural error (e.g. not marking all the pages of an answer), attentional error (concentration lapses by examiners), inferential uncertainty (insufficient evidence provided by the candidate for the examiner to reach a definitive judgement) and definitional uncertainty (there is a range of legitimate marks allowed by the mark scheme because of a lack of tight definition).
They also noted that even when marking reliability is strong, there is still an issue of how reliable the overall grading of a student will be, as this relies upon the individual students’ competence in the questions that happen to be asked in any particular exam paper. The same student, perfectly marked, would be highly likely to secure different marks on different exam papers, based upon the questions selected by examiners.
Therefore, a little like the algorithm, exams have the appearance of being objective and scientific, but the reality is different.
The issues this year were not to do with teacher assessment per se. The first problem was that everything came out of the blue (no blame attached to anyone here) and therefore teacher assessment could not be prepared for in an appropriate way. Before schools generated their grades, no common methodology based on well understood principles and practices was distributed to schools through well-organised training sessions. The second problem was there was no standardisation or moderation once the grades had been generated by the schools. Strong processes here ensure that teachers are very mindful in the first place of what they award, and then ensure that any errors or inconsistencies are picked up and resolved.
It would be absolutely wrong to conclude that the 2020 fiasco should push us back towards exams which have such a high degree of randomness and inconsistency. If we accept the premise for the moment that we need to keep measuring young people quite so much at both 16 and 18 – a premise that can and should be challenged – then thoroughly prepared and properly moderated teacher assessment would be a far better system to deliver fair and just results to the next generation of our students.
Robert Lobatto
22 August 2020