In my last blog I suggested that performance targets may not be the unquestionable evil that some people take them to be. Knowing this to be a potentially controversial position, I invited argument in response; argument engaging with what I said, rather than merely repeating a dislike of targets. As yet there has been none.
What I did get – somewhat to my surprise – was an argument in favour of that other recent demon of performance analysis, league tables. A correspondent, a retired police officer and longtime advocate of league tables, suggested that ‘chief constables seem to have an institutional opposition to having their performance compared’.
I’m not sure I’d put it like that. But I am aware that league table comparisons have fallen out of fashion. Which is a pity, because they can be useful. So I thought I’d throw in my sixpenn’orth on the subject.
Whatever your views on league tables, they are irresistible. Whether it’s a league table of schools, health authorities or police services, we always look to see where we are. Furthermore, if we are near the top of the table we feel pleased, even proud. And if we are near the bottom… then we are reminded of how flawed league tables are.
Of the many police officers I’ve spoken to about league tables over the years, most have disliked them. We should avoid them, say many. Throw them out. Why? Because they don’t compare like with like. Because you can’t compare apples with oranges.
But we never do compare like with like, except in very rare circumstances when we are comparing two identical things. Indeed, if we could compare like with like, comparison itself would be pointless. The Greek philosopher Heraclitus said you can never step into the same river twice, as ‘ever-newer waters flow on those who step into the same rivers’, so we’re not even comparing like with like when we compare ourselves over time.
And if you are in the fruit juice business, needing to make decisions about sourcing raw materials, about relative costs and potential profit margins, you might well end up comparing apples and oranges.
So let’s leave the cliché behind, and address the main question. Whether league tables are good or bad, useful or misleading has a very simple answer. It depends. It depends on what you want to use them for, how you analyse them, and what conclusions you draw from them.
Seriously, I’m not making this up
In the late 1990s the Audit Commission published annual performance league tables for the police services in England and Wales. For each of a series of key performance indicators, the 43 were arranged in order from top to bottom. Each table was then ‘analysed’ by being divided into four equal blocks[1], which were called the four quartiles (wrongly as it happens: quartiles are the points that divide a set of data into four equal blocks, so there are three quartiles). Finally, each block was given a label: those in the top ‘quartile’ were ‘beacons’, the second ‘striving’, the third ‘coasting’ and the fourth – yes, you’ve guessed it – ‘failing’.
One of the league tables was for recorded robbery (per 10,000 population, if I remember correctly). Top of the beacons was Dyfed-Powys, bottom of the failing was Merseyside. So the implication was – for what else is a beacon if not a source of light? – that Merseyside should look to Dyfed-Powys (nice two-day trip to Carmarthen anyone?) to find out what they are doing to keep robbery so low.
This is so self-evident that it hardly needs saying, but I’ll say it anyway. The low robbery rate in Dyfed-Powys was not a reflection of better performance. (Sorry Dyfed-Powys, you may well have been the best, but we can’t draw that conclusion from these data.)
The mistake lies not in putting Dyfed-Powys at the top and Merseyside at the bottom, but in applying the labels ‘beacon’ and ‘failing’. Occupying the top position of a league table does not mean to say you’re the best. The most basic rule of analysis is: don’t confuse evidence (‘we’re top’) and interpretation (‘we’re the best’).
The solution, as ever, is to distinguish between apparent performance (the quantitative evidence of performance indicators) and actual performance (a qualitative judgement of how well we’re doing the job).
A league table is merely apparent performance, even when it is divided into four ‘quartiles’ and given labels. What we need to know is which forces are doing genuinely well, and which have problems that need to be addressed. But because circumstances vary so widely between the 43, we may end up finding the true beacons and failures lying side by side in the middle of the league table. We won’t know until we do some real analysis, as opposed to slicing everything into four and applying spurious labels.
The analytic problem can be stated very simply: how can we explain the variation in the league table? (It isn’t quite so simple to solve, of course.) How much of the variation in apparent performance is accounted for by differences in actual performance, as opposed to variations in environment, circumstances, recording practices, or the ever-present caprice of random fluctuation?
Keeping it in the family
Daunted by the sheer magnitude of the differences between, say, Wiltshire and the Met, and driven by an irrational fear of comparing apples with oranges, police services moved away from league tables based on all 43, and towards smaller groupings defined by certain objective similarities. From the 1990s these groupings went through various iterations, called families or most similar force groups, which appeared to reduce the risk of making inappropriate comparisons.
This is fine. On one level it makes sense to compare Merseyside with Greater Manchester rather than with Dyfed-Powys. In statistical terms this reduces the variance considerably by making circumstances more comparable. But it doesn’t solve the problem, because even within such a smaller grouping, we’re still not comparing like with like.
And in any case, reducing our data set from 43 to six wastes information that may be useful in our attempts to understand actual performance. By all means compare yourself with similar forces. But if you don’t also compare yourself with everyone else, you might be missing some interesting and useful conclusions.
A good illustration of this emerged from work I did some years ago on victim satisfaction in the Met, where comparisons were made using a league table of 32 Borough Command Units.
Of the 32 boroughs in London, it would be difficult to find two that are more different than Newham and Richmond. Newham, the most ethnically diverse district in the country, with high levels of social deprivation, was one of the busiest police command units in London. Richmond is much less diverse (according to the 2011 census, 71.4 per cent of Richmond residents are white British, compared to 16.7 per cent in Newham), with low levels of social deprivation. In area, Richmond is 60 per cent larger than Newham, with a population that is 60 per cent smaller.
Apples and oranges indeed.
In October 2012 the league table for victim satisfaction of the thirty-two Met borough BCUs showed Richmond to be top, and Newham bottom. So no surprises there. A bit like comparing Dyfed-Powys and Merseyside on robbery.
Twelve months later, Richmond were still top, but Newham had risen to second – a stunning success, on a robust indicator, achieved with no additional resources and no concomitant decline in other areas of performance.
The point is this. If Newham had been compared only to other demographically similar boroughs (Tower Hamlets, Lambeth or Brent, for example, all of which were near the bottom of the table), they would undoubtedly have come top. The fact that they rose to second out of all 32 is a very compelling finding: it shows what can be achieved if victim care is given the priority it deserves.
So what?
Any statistical comparison has the potential to be both useful and misleading. This is as true for league table comparisons as it is for binary comparisons, or for comparisons against targets. It is just as true for comparisons derived from large random samples, appropriately tested for statistical significance.
This is in the nature of comparison. Whenever we use any information, statistical or otherwise, there is a danger that we will draw the wrong conclusion. So should we stop using information to draw conclusions about the world? Or rather approach the act of interpretation in a more rigorous way?
In the face of the impossibility of ever comparing like with like, what shall we do about league tables? We have the choice of going in either of two directions.
The first is to reject them in principle: league tables don’t compare like with like, therefore we shouldn’t use them. “We’re not interested in how we compare against others, we’re only interested in improvement.” Fine words indeed, and I often hear them.
But what if we can learn something by comparing ourselves with others that might help us to improve? And what if we’re doing something really effective that others can learn from? How will we know unless we compare?
The second is to be clearsighted about the weakness of league tables, but to make comparisons anyway because that way me might learn something useful from them. Of course there are differences between any two police services, or between any two command units. There always will be. But if those differences can be identified, then they can be taken into account to assist our understanding.
The detail of how we do this is beyond the scope of this blog, but it can be done. Statistical modelling can be used. Or a method I developed some years ago, performance profiling, which goes beyond the simple position in a league table, and the crude assumption that ‘top’ means ‘best’, and allows us to draw qualitative conclusions about actual performance.
Whichever method we use, we must approach the task with discipline and rigour. But is that not also true for all statistical comparisons? Any statistic is merely a starting point, which sooner or later (preferably sooner) should lead to the question ‘why?’. Why is one organisation higher in the table than another? Why has our position in the table fallen from last year?
There is nothing intrinsically wrong with league tables. Just as there is nothing intrinsically wrong with targets, or binary comparisons. The problem is that people misuse them, draw unwarranted conclusions from them. If we can learn to use performance indicators and other information in a more sophisticated way, then league tables will surely take their place as another potentially useful source of learning. To reject them out of hand because they don’t compare like with like is to close the doors on potential learning.
Last week I accidentally cut myself with a kitchen knife. As I staunched the blood with a paper towel, I reflected on what I had done and realised the wound could have been worse. Did I throw away the knife? No, of course not. I resolved to be more careful in future.
The issues addressed in this blog are covered in more detail in Malcolm Hibberd’s Performance Masterclasses (including an explanation of Malcolm’s method for analysing league tables, in Masterclass 3).
Malcolm also provides a range of courses on improving victim care.
Photo by Joshua Golde on Unsplash
[1] Of course, forty-three doesn’t divide into four perfectly, so there were three blocks of eleven and one block of ten.