Can all of learning be summed up by test scores?

Contrary to popular opinion, I’m not all that bothered about test scores. I mean, obviously I’d far prefer pupils did well rather than poorly on a summative exam, particularly if it is likely to have some bearing on their future life chances – who wouldn’t? – but I’m certainly not interested in raising test scores for the sake of raising test scores.

Which is why I feel taken aback when people say things like this:

The simple answer to this leading question is, no. Like most people involved in education I want students to have the best possible chance of leading happy, productive lives; to go out into the world and flourish. Academic success is absolutely not necessarily a particularly important thing in and of itself. I just happen to believe that doing better in school increases the range of options open to us and is thus more likely to result in what I want for young people.  Children will learn a great many things – positive and negative – that no test will ever measure and for which there will be no certification. Test scores are a very imperfect proxy for establishing whether children have in fact achieved some measure of academic success. And that’s it. They have no inherent value.

That said, test scores are a pretty good proxy for establishing whether an educational intervention is likely to be worthwhile investing in. Most education research uses effect sizes to make it possible for us to compare which interventions are likely to be more profitable than others and these effect sizes are based on scores in tests.* We’re all prone to a wide range of cognitive biases which prevent us from being able to evaluate the effectiveness of a strategy in isolation. We may think something is working well, but our hopes and preferences blind us to reality; if our preferred approach leads to no or negligible impact on test scores then we should start to consider the prospect that we might be wrong. In order to raise ourselves above the anecdotal we design studies to try to establish if what we want to believe is actually real. Failure to do so means we are piddling around with naive, pre-scientific ideas about how the world works and thus we can’t expect anyone to take our claims seriously.

But, this doesn’t mean the only tests than can yield valid results are academic tests. Let’s say you want to claim not that your preferred methodology increases academic performance, but that it increases creativity. Or collaboration, or whatever. The first thing you have to do is to clearly define the construct you want to see an increase in. In the case of creativity, this is tricky as not everyone will agree on a definition. Most of the tests cited in support of efforts to raise creativity actually measure something called ‘divergent thinking’. This is normally defined as coming up with as many different solutions to a problem as possible. Here’s what Wikipedia says:

Divergent thinking is a thought process or method used to generate creative ideas by exploring many possible solutions. It is often used in conjunction with its cognitive colleague, convergent thinking, which follows a particular set of logical steps to arrive at one solution, which in some cases is a ‘correct’ solution. By contrast, divergent thinking typically occurs in a spontaneous, free-flowing, ‘non-linear’ manner, such that many ideas are generated in an emergent cognitive fashion. Many possible solutions are explored in a short amount of time, and unexpected connections are drawn. After the process of divergent thinking has been completed, ideas and information are organized and structured using convergent thinking.

This is reasonably clear and you can design tests to measure this construct without too much difficulty – tests such as asking participants to think of as many uses of a paperclip as possible in a limited time. This means we come up with a study where we split kids into two or more randomised groups and gave them all the paperclip test. Then we would give one group our creativity intervention and the others would either get no intervention or some other strategy designed to increase performance in a test of divergent thinking. Then, all the participants would redo the test – or a variant of it – and we could see if anyone’s test scores increased and then determine whether the increase might have occurred by chance and if one group’s increase is greater than the others.

Although I’d be more than happy to agree that one group was measurably better than the others, I’d still be wary of claiming we’d found a way of increasing creativity because divergent thinking isn’t the same thing. We probably don’t actually want to encourage people to list of lots of improbable uses for paperclips; what we really want is for people to have new and useful ideas.

The point is this: if there’s no way for us to measure what you think is important, then we only have your word to go on that what you propose is worthwhile. We know your word isn’t good enough because we know how prone human beings are to making very predictable mistakes. That’s why we have science. If you want to suggest that spending curriculum time on increasing students’ situational engagement instead of on more traditional academic pursuits then the burden of proof is with you. I suggest that you establish how you will measure the benefits you hope to see and then conduct a fair test in which spending time on engagement activities is compared against teaching academic content. If you can you that your approach leads to a measurable improvement in something then I promise to take your claims seriously and consider whether this improve is more beneficial than helping students get the best exam results possible.

11 Responses to Can all of learning be summed up by test scores?

  1. David F says:

    Hi David, it would be interesting if you could chat with Diane Ravitch, former ed secretary in the US who has written a lot of great stuff about the history of ed reform (she’s on the right side of most things), but is strongly against the use of testing for ed policy. See here for an article about her concerns:

    • David Didau says:

      Thanks – I agree with her on the use of test scores to hold teachers to a count. That is certainly poor practice. But this is dodgy thinking:

      “They are heavily affected by demography, so the kids from the most advantaged, high-income homes come out at the top and those from the least advantaged and lowest income are at the bottom,” she said. “So what you’re measuring is family income. The norm on all the standardized tests is they rank kids by family income. That’s simply the fact.”

      Just because scores average in this way doesn’t discount the experience of individuals. To rejects test scores on the grounds of demography is to admit defeat and say that children from lower income backgrounds can never compete. This is patently untrue.

      Here criticism of PISA etc. shows a misunderstanding is what it’s for. Everyone knows the rankings are a nonsense, but the in country comparisons from year to year reveal much about how an education system is performing and whether the trend is positive or negative. For instance. the fact the England’s performance is flat is sometimes taken as a bad thing, but might actually reveal real stability. Also, the correlations between test scores and educational practices is fascinating.

      Here’s my view on the value of tests:

  2. What tests would you or could you devise for measuring the worth of being in a school play, devising a dance, making a video or photographing your family?

    • Rich says:

      Depends on what you intend to get out of those activities.

    • David Didau says:

      I’m not aware of any objective tests of ‘worth’ Michael. Being in some school plays is – at least in my opinion – worth much more than being in other school plays. Devising a crap dance isn’t worth much, nor is taking a few hasty holiday snaps. All these activities would, I think, contain far more ‘worth’ if they sought to measure specific impact of something.

      • 1. How do we know if it’s a crap dance or not until we’ve devised it?
        2. Are you saying that we only know the ‘worth’ of an arts activity if we have devised some kind of measure of ‘the impact of something’? On audience? On those who create the specific art? Or some other criteria? Or, slightly different, are you saying: the standard of arts activities would rise if we did devise worth-criteria?

  3. Michael Pye says:

    I like the example of a non-academic test (divergent thinking) being used to measure comparative effectiveness. You have shown how the outcome measure is always problematic and vulnerable to assault in any study (test scores or another proxy).

    I wonder how many people stopped reading carefully after that point and just jumped to the erroneous conclusion that all research is pointless.

  4. Mark Featherstone-Witty says:

    Marking is a huge topic and it’s hard to know where to begin. In HE, we are steadily, we hope, reducing the focus on marks. Why? Once you leave for work, you are never marked again. Instead, you have to listen to what people are saying to you. At the end of modules, we separate formative from summative feedback by two weeks. If you read the formative feedback carefully, you can work out what mark you’ll be given. (But, sadly, we are obliged to provide marks.) At LIPA, we teach all the skills needed to put on an event. For our performance disciplines, graduates will not need degrees to work as performers. No-one is buying an album because the singer/songwriter has a degree.

    We’ve started a sixth form, have started a primary school and want to start a high school. Our primary school is awash with data, but mainly for grown ups.

Constructive feedback is always appreciated

%d bloggers like this: