Big data is bad data
The cost of bad data is the illusion of knowledge. – Stephen Hawking
Schools, as with almost every other organ of state, are increasingly obsessed with big data. There seem to be two main aims: prediction and control. If only we collect and analyse enough data then the secrets of the universe will be unlocked. No child will be left behind and all will have prizes.
Can we learn from the past?
No. Or at least, not in any way that helps. We can see trends, but these are far more likely to be noise than signal. When exam results are rising we take credit and when they fall we wring our hands but Ofqual say that, depending on subjects, GCSE results are likely to vary from anywhere between 9 – 19% annually! What can we helpfully learn from this sort of volatility?
In Antifragile, Nicholas Nassim Taleb says,
Assume… that for what you are observing, at a yearly frequency, the ratio of signal to noise is about one to one (half noise, half signal)—this means that about half the changes are real improvements or degradations, the other half come from randomness. This ratio is what you get from yearly observations. But if you look at the very same data on a daily basis, the composition would change to 95 percent noise, 5 percent signal. And if you observe data on an hourly basis, as people immersed in the news and market price variations do, the split becomes 99.5 percent noise to 0.5 percent signal. (p.126)
The more data you collect and the more you try to analyse it, the less you are likely to perceive. Looking at the past leads us into believing we can control the present.
Can we control the present?
No. Or at least, not with any reliability. Stuff just happens. We think we can see causation, but all we ever see are correlations. Some correlations are implausible so we dismiss them:
(And, just for fun, have a look at these examples of our supposed ability to perceive causality from Belgian philosopher, Albert Michotte.)
Others seem logically connected so we pay them great heed. This leads us to see things that aren’t there, like for instance the ‘fact’ that sex is a significant factor in achievement.
We can check to see whether teachers and students are compliant, we can check to see whether they’re performing as we might want or expect, but we can’t read minds and we can’t measure learning. We can, of course, make inferences, but these are only really valuable if we’re clear on the very real distinction between learning and performance. This muddled thinking leads schools leaders to say things like, “80% of the teaching in our school is good or better.” Is it? How do you know? What they really mean is that 80% of lessons meet with their approval. This is not the same thing. But it makes us believe we can predict the future.
Can we predict the future?
No. Or at least, not with any accuracy. What we can do is make a prediction and then have a stab at working out the probability that our guess is accurate given the data we have. But this isn’t very sexy so we fool ourselves into imagining that our data is precise and accurate, and then trying to work out the probability that our predictions are true. We routinely say things like, “Jamie will get an B grade in maths.” And then attempt to work out the likelihood the prediction is accurate. What we should be doing is saying, “Jamie achieved a B grade in his last maths test,” and then trying to work out the likelihood this result was accurate.
Here’s how the problem looks when applied to the real world:
- I watched Eastenders last night. What’s the probability I watched BBC1?
- I watched BBC1 last night. What’s the probability I watched Eastenders?
See the problem? 1 is meaningless and 2 is strictly limited.
No amount of data can accurately describe the complexity of human systems. Maybe the data we collect at Christmas might help us to predict GCSE results with a reasonable amount of accuracy, but data collected 5 years previously is much less reliable as the margin of error only grows with time: inspection frameworks change; external assessment are changed mid- year; teachers leave, new teachers arrive; students experience spurts and declines. Reality gets in the way of our careful calculations.
Because we’re so poor at dealing with uncertainty, we struggle to accurately forecast the future. All prediction is based on probability; these probabilities are risks which we calculate by analysing historical data. Uncertainty though, is different to risk. A risk describes the possible range of all known outcomes and tries to calculate the likelihood of each occurring. All too often, we don’t even know that an outcome is possible, let alone the probability of it occurring. Data management systems can only tell us what has happened in the past (And we’re not even very good at interpreting that!) If the future is different to the past – as it inevitably will be – any forecast we make will be wrong.
The conclusions we might draw from all this are:
- You don’t know what you don’t know so avoid believing you can explain complex events.
- The more data you collect the less meaningful it will be.
- You cannot predict the future, but you can calculate how good your data is.