Old Hat(tie)? Some things you ought to know about effect sizes

Ever since Hattie published Visible Learning back in 2009 the Effect Size has been king. For those of you who don’t know, an effect size is a mechanism for comparing the relative merits of different interventions. Hattie pointed out that everything that a teacher does will have some effect but that there will also be an opportunity cost: if you’re investing in time in one type of intervention you will be neglecting other types of intervention which might have a greater impact. He therefore used effect sizes to try to establish what has the greatest influence on student learning so that we could all concentrate on the stuff which had the great impact.

According to the Teachers’ Toolbox site, An effect-size of 1.0 is typically associated with:

  • Advancing learners’ achievement by one year, or improving the rate of learning by 50%
  • A correlation between some variable (e.g., amount of homework) and achievement of approximately .50
  • A two grade leap in GCSE, e.g. from a C to an A grade

Hattie decided to set the ‘hinge point’ at 0.4 stating that any intervention that came in below this figure had a negligible effect and was therefore not worth wasting teachers’ precious time on. He then went about aggregating the effects of thousands of research studies to aggregate the effect sizes for most of the stuff that gets wheeled out in classrooms.

This is what he found:

Influence Effect Size Source of Influence
Feedback 1.13 Teacher
Student’s prior cognitive ability 1.04 Student
Instructional quality 1.00 Teacher
Direct instruction .82 Teacher
Acceleration .72 Student
Remediation/feedback .65 Teacher
Student’s disposition to learn .61 Student
Class environment .56 Teacher
Challenge of Goals .52 Teacher
Peer tutoring .50 Teacher
Mastery learning .50 Teacher
Homework .43 Teacher
Teacher Style .42 Teacher
Questioning .41 Teacher
Peer effects .38 Peers
Advance organisers .37 Teacher
Simulation & games .34 Teacher
Computer-assisted instruction .31 Teacher
Testing .30 Teacher
Instructional media .30 Teacher
Affective attributes of students .24 Student
Physical attributes of students .21 Student
Programmed instruction .18 Teacher
Audio-visual aids .16 Teacher
Individualisation .14 Teacher
Finances/money
.12
School
Behavioural objectives .12 Teacher
Team teaching .06 Teacher
Physical attributes (e.g., class size) -.05 School

So now we know. Giving feedback is ace and adjusting class size is pointless.

Except that there are some naysayers. Maths boff @OllieOrange tells us that there are 3 things we should know about the effect size:

  1. Mathematicians don’t use it
  2. Mathematics textbooks don’t teach it.
  3. Statistical packages don’t calculate it.

And in the comments to this blog, Dylan Wiliam sets out his objections:

Here are some reasons why I find effect sizes are misleading, taken from pages 20 to 22 of:

Wiliam, D. (2010). An integrative summary of the research literature and implications for a new theory of formative assessment. In H. L. Andrade & G. J. Cizek (Eds.), Handbook of formative assessment (pp. 18-40). New York, NY: Taylor & Francis.

The use of standardized effect sizes to compare and synthesize studies is understandable, because few of the studies included in the various reviews published sufficient details to allow more sophisticated forms of synthesis to be undertaken, but relying on standardized effect sizes in educational studies creates substantial difficulties of interpretation, for two reasons.
First, as Black and Wiliam (1998a) noted, effect size is influenced by the range of achievement in the population. An increase of 5 points on a test where the population standard deviation is 10 points would result in an effect size of 0.5 standard deviations. However, the same intervention when administered only to the upper half of the same population, provided that it was equally effective for all students, would result in an effect size of over 0.8 standard deviations, due to the reduced variance of the subsample. An often-observed finding in the literature—that formative assessment interventions are more successful for students with special educational needs (for example in Fuchs & Fuchs, 1986)—is difficult to interpret without some attempt to control for the restriction of range, and may simply be a statistical artefact.

The second and more important limitation of the meta-analytic reviews is that they fail to take into account the fact that different outcome measures are not equally sensitive to instruction (Popham, 2007). Much of the methodology of meta-analysis used in education and psychology has been borrowed uncritically from the medical and health sciences, where the different studies being combined in meta-analyses either use the same outcome measures (e.g., 1-year survival rates) or outcome measures that are reasonably consistent across different settings (e.g., time to discharge from hospital care). In education, to aggregate outcomes from different studies it is necessary to assume that the outcome measures are equally sensitive to instruction.

It has long been known that teacher-constructed measures have tended to show greater effect sizes for experimental interventions than obtained with standardized tests, and this has sometimes been regarded as evidence of the invalidity of teacher- constructed measures. However, as has become clear in recent years, assessments vary greatly in their sensitivity to instruction—the extent to which they measure the things that educational processes change (Wiliam, 2007b). In particular, the way that standardized tests are constructed reduces their sensitivity to instruction. The reliability of a test can be increased by replacing items that do not discriminate between candidates with items that do, so items that all students answer correctly, or that all students answer incorrectly, are generally omitted. However, such systematic deletion of items can alter the construct being measured by the test, because items related to aspects of learning that are effectively taught by teachers are less likely to be included than items that are taught ineffectively.

For example, an item that is answered incorrectly by all students in the seventh grade and answered correctly by all students in the eighth grade is almost certainly assessing something that is changed by instruction, but is unlikely to be retained in a test for seventh graders (because it is too hard), nor in one for eighth graders (because it is too easy). This is an extreme example, but it does highlight how the sensitivity of a test to the effects of instruction can be significantly affected by the normal processes of test development (Wiliam, 2008).

The effects of sensitivity to instruction are far from negligible. Bloom (1984) famously observed that one-to-one tutorial instruction was more effective than average group-based instruction by two standard deviations. Such a claim is credible in the context of many assessments, but for standardized tests such as those used in the National Assessment of Educational Progress (NAEP), one year’s progress for an average student is equivalent to one-fourth of a standard deviation (NAEP, 2006), so for Bloom’s claim to be true, one year’s individual tuition would produce the same effect as 9 years of average group-based instruction, which seems unlikely. The important point here is that the outcome measures used in different studies are likely to differ significantly in their sensitivity to instruction, and the most significant element in determining an assessment’s sensitivity to instruction appears to be its distance from the curriculum it is intended to assess.
Ruiz-Primo, Shavelson, Hamilton, and Klein (2002) proposed a five-fold classification for the distance of an assessment from the enactment of curriculum, with examples of each:

  1. Immediate, such as science journals, notebooks, and classroom tests;
  2. Close, or formal embedded assessments (for example, if an immediate assessment asked about number of pendulum swings in 15 seconds, a close assessment would ask about the time taken for 10 swings);
  3. Proximal, including a different assessment of the same concept, requiring some transfer (for example, if an immediate assessment asked students to construct boats out of paper cups, the proximal assessment would ask for an explanation of what makes bottles float or sink);
  4. Distal, for example a large-scale assessment from a state assessment framework, in which the assessment task was sampled from a different domain, such as physical science, and where the problem, procedures, materials and measurement methods differed from those used in the original activities; and
  5. Remote, such as standardized national achievement tests.

As might be expected, Ruiz-Primo et al. (2002) found that the closer the assessment was to the enactment of the curriculum, the greater was the sensitivity of the assessment to the effects of instruction, and that the impact was considerable. For example, one of their interventions showed an average effect size of 0.26 when measured with a proximal assessment, but an effect size of 1.26 when measured with a close assessment.

In none of the meta-analyses discussed above was there any attempt to control for the effects of differences in the sensitivity to instruction of the different outcome measures. By itself, it does not invalidate the claims that formative assessment is likely to be effective in improving student outcomes. Indeed, in all likelihood, attempts to improve the quality of teachers’ formative assessment practices are likely to be considerably more cost-effective than many, if not most, other interventions (Wiliam & Thomson, 2007). However, failure to control for the impact of this factor means that considerable care should be taken in quoting particular effect sizes as being likely to be achieved in practice, and other measures of the impact, such as increases in the rate of learning, may be more appropriate (Wiliam, 2007c). More importantly, attention may need to be shifted away from the size of the effects and toward the role that effective feedback can play in the design of effective learning environments (Wiliam, 2007a). In concluding their review of over 3,000 studies of the effects of feedback interventions in schools, colleges and workplaces, Kluger and DeNisi observed that:

considerations of utility and alternative interventions suggest that even an FI [feedback intervention] with demonstrated positive effects should not be administered wherever possible. Rather additional development of FIT [feedback intervention theory] is needed to establish the circumstance under which positive FI effects on performance are also lasting and efficient and when these effects are transient and have questionable utility. This research must focus on the processes induced by FIs and not on the general question of whether FIs improve performance— look how little progress 90 years of attempts to answer the latter question have yielded. (1996 p. 278)

So there you go. The next time someone tries to trump you in debate by quoting effect sizes you’ll know exactly how to respond.

31 Responses to Old Hat(tie)? Some things you ought to know about effect sizes

  1. CristinaM. says:

    I am glad that you brought Hattie’s research into question because it really deserves more scrutiny.
    From being a reliable aggregation of research analyses it became a Bible that many educators do not question and sometimes do not understand. I am particularly interested in research at the moment (see my latest post) and I will look at Hattie’s work in the future, too. A very interesting conversation that I had with Ron Ritchhart (Senior Research Associate at Harvard School of Education and co-author of Making Thinking Visible) was also connected to Hattie’s effect sizes and how they translate into actual impact in the classroom.
    Thanks for the post – as always, not to be missed.

  2. Michael Dorian says:

    This is a really interesting post and makes some really relevant points which I feel more people need to be aware of. I, like yourself, saw Dylan Wiliam’s tweet a couple of days ago expressing his concern with effect sizes. I have to say however, that Wiliam’s criticism amuses me somewhat given statements made in “Inside the Black Box”. On page 4 the evidence to back up the validity of formative assessment is based around effect sizes, and that these effect sizes lead to improvement in pupil performance of between 1 and 2 GCSE grades. Now I realise Wiliam and Black probably chose the studies they analysed carefully, but nevertheless it still doesn’t get away from the fact that effect sizes are at the centre of their argument. Worse still, the way they link effect sizes to improvements in GCSE performance is at best disappointing and at worst misleading, especially given the number of people who now quote this statistic as a means of defending the panacea that is formative assessment. This is especially case if, as I suspect (and I’ll be happy to be corrected if wrong), the studies used by Black and Wiliam are based on classroom tests taken shortly after the formative assessment technique has been employed, rather than tests taken at the end of the year. Given this, I think the link to GCSE improvement is significantly oversold based on the difference between short term retention of information and long term retention. Furthermore, given teachers are in the game of long term retention it is really important that the validity of effect sizes is taken with a significant pinch of salt.

  3. Asbro says:

    Interesting that Hattie’s research needs to brought into question but I would have thought that this sort of research is never going to give a definitive answer and always needs to be taken with a pinch of salt. The measurement will never have all the variables under control and so some sort of statistical analysis needs to be done. Statisticians and researchers need to tell us how reliable their data is and include some sort of error bars in their results. How, I don’t know, but if we are to make progress on this sort of thing and base decisions on these results we need less infighting about how to do it and more definite indicators about which direction to go. This sort of thing is a little destructive since we do need to base our decisions on evidence rather than past practice and culture and so criticisms like this need to also suggest what can be done about such misleading information and show a way forward. Merely criticising is not enough.
    How do we make sense of this quagmire and make the right informed decisions?

  4. […] Ever since Hattie published Visible Learning back in 2009 the Effect Size has been king.  […]

  5. […] also by David Didau in this critique of ‘effect-sizes’ and their statistical validity here), what implications am I left with? […]

  6. Dylan Wiliam says:

    Michael Dorian (see above) is amused that Paul Black and I used effect sizes 15 years ago in our attempt to communicate to teachers and policy-makers the magnitude of the impact that classroom formative assessment might have on student learning (“Inside the black box”). In the academic review of research on which the IBB booklet was based, we explicitly avoided any attempt to summarize the net effect, not least because we were aware that the duration of the experiments that we reviewed were very variable, as were magnitude of effect sizes found, and the measures used different greatly in their sensitivity to the effects of teaching. To be honest, however, while we realized there were some problems with using effect sizes (one section of the academic paper is entitled “No meta-analysis”), it is only within the last few years that I have become aware of just how many problems there are. Many published studies on feedback, for example, are conducted by psychology professors, on their own students, in experimental sessions that last a single day. The generalizability of such studies to school classrooms is highly questionable.

    Another point that I have only recently understood well is the impact of the limited power of most educational and psychological experiments on meta-analyses. “Statistical power” is usually defined as the chance that a given experiment will produce a statistically significant result. If we know that a particular intervention has a given average impact, in some cases it will have a bigger impact, and in others, a lesser one. The important point is that most academic journals have a bias towards publishing statistically significant findings, so the studies that get published are those that “got lucky” in that they happened to be the ones where the effect of the intervention was larger than average. Some estimates indicate that the average statistical power of experiments in psychology and education are as low as 40% (and one study of those in neuroscience suggested a value of 20%). This means that the studies that actually get published vastly over-estimate the effects of the interventions being researched.

    In retrospect, therefore, it may well have been a mistake to use effect sizes in our booklet “Inside the black box” to indicate the sorts of impact that formative assessment might have. It would have been better to talk about extra months of learning which takes into account the fact that the annual growth in achievement, when measured in standard deviations, declines rapidly in primary school (one year’s growth is over 1 standard deviation for five-year-olds, and only around 0.4 for 11-year-olds). That said, in answer to Michael Dorian’s question, in arriving at our subjective estimate of 0.4 to 0.7 standard deviations for the impact of formative assessment, we did rely more on studies that were classroom-based, over extended periods of time, and which used standardized measures of achievement.

    I do still think that effect sizes are useful (and are far more useful than just reporting levels of statistical significance). If the effect sizes are based on experiments of similar duration, on similar populations, using outcome measures that are similar in their sensitivity to the effects of teaching, then I think comparisons are reasonable. Otherwise, I think effect sizes are extremely difficult to interpret.

    • Michael Dorian says:

      In reply to your post I first want to clarify that I did not say I was amused by your use of effect sizes 15 years ago. What I said was that I was amused by your recent tweet concerning the validity of effect sizes given the important part it plays in making the argument stated in “Inside the Black Box”. However, having read your post I can now see that your understanding of effect sizes, as you acknowledge in your post, has developed significantly over the last 15 years. Nevertheless, this still doesn’t get away from the fact that comments relating to improvements in GCSE grades or international performance are at best weakly supported by effect sizes. Moreover, the significance of the effect sizes calculated in the studies you looked at also needs to be examined from another perspective.

      It is clear that all other things remaining fair and equal that multiple studies returning a similar positive result are far more significant than one study returning a positive result. However, as I have eluded to before, this significance is only useful if the pupils in question are, at some later date from the original test, still able to produce significantly better test results or understanding. This is because, as I hope most would agree, teachers and students are in the learning game. Knowing something today is useful, but knowing it a lifetime later and being able to apply that knowledge in different situations is the true test of learning and far more useful.

      Having said all that, I feel it is vital to make one last point given much of what I have said will come across as criticism. I believe that formative assessment has its place in education and that some measures, especially effective feedback, can support learning. Nevertheless, it is not the only mechanism that works. Teachers, SMT and government need to be aware of this.

  7. OK. There is a problem here. As far as I can see almost everyone struggles with statistics (and I mean everyone, including scientists). It may be that effect sizes are the best tool that we have at the moment, but are they good enough to influence implementable policy?? Dylan Wiliam says in his comment above that the things you are comparing have to be similar for that comparison or synthesis to be meaningful (I think?). How often is that the case? I suspect that it is very difficult to synthesis a large number of studies, and only choose those that are very similar. This is important stuff. If effect sizes are not good enough to make policy on, let’s not use them…..

    (Generally, I have a concern that education is so complex that the context of the intervention is key [and too complex to be removed by statistical analysis]; but I am not a statistician).

  8. neilatkin63 says:

    How similar can classes be? or even the same class at different times of the day? let alone classes in different schools. I worry about how some people use this research data.
    A certain strategy might be perfect for a certain student at a certain time, but be hopeless when they are tired, hungry, hormonal…..
    To me the best teachers tune in to what is in front of them at the time rather than following some standardised single approach statistically significant ‘better’ style that may be completely inappropriate at the time.
    Look at the research as it has some value, know many ways of teaching and use them wisely. We are teaching human beings with all their irrationality and wonderful quirks! not some standardised factory raw materials!

    • Kelly says:

      This is one of the reasons why testing and judging children on their grades/levels is madness to me.

      Unfortunately, I feel, performing replaces learning in most schools at the moment.

      Bit off the point here but it’s what came to mind as I read all this about the unreliability of effect sizes. Grades and levels are just as unreliable!

  9. […] point’ of 0.4, although the average effect size is 0.79 (It’s worth reading one of my previous blog posts for explanation and critique of effect sizes.)So, as Hattie himself acknowledges, some forms of feedback are more effective than others. He […]

  10. […] are more influential than others (see Hattie’s Visible Learning and effect-sizes discussed by David Didau ) but even researchers are cautious to claim that once a strategy is transferred to another […]

  11. […] such a large difference. The results do need to be treated with caution (thanks to @LearningSpy for this) but calculating the effect size was hugely preferable to “I just know it […]

  12. […] Didau also addressed the limitations of effect-sizes and referenced Dylan William (and some of these “simple artifacts”) as Tom Sherrington […]

  13. […] teachers must do more of it. Certainly Hattie, the Sutton Trust and the EEF bandy about impressive effect sizes, but the evidence of flipping through a pupil’s exercise book suggests that the vast […]

  14. Like all descriptive statistics, effect sizes conceal all sorts of details. The original studies from which the meta studies come also have their own assumptions that we will never know from the meta studies. We know all this. Surely the point is not to spend too much time criticising the data analysis – leave that to the experts. Instead let’s look for what improves learning and see how we can make it work for us as unique individuals in the classroom.

  15. […] are worth keeping in mind. There’s some good exploration of these criticisms in recent blogs by The Learning Spyand OllieOrange2. It seems a sensible suggestion to divide analysis more clearly into age bands. As […]

  16. […] but the paper is here. FIND HIS PAPER (ICING OR OLLIE OR SOMEONE ELSE? ) His lengthy comment in reply to this post is also well worth […]

  17. […] which are probably fairly well-known by now e.g. David Weston’s ResearchEd 2013 talk; Learning Spy’s post which is related to Ollie Orange’s . If you read the introduction to Visible Learning, or […]

  18. smcnaugh says:

    Can someone explain how there are effect sizes reported that exceed 1.0?

  19. […] For John Hattie, the answer has been to compare effect sizes from different kinds of intervention. An effect size is a way of working out a standardised effect across different studies. It is basically the difference between the average of some kind of score from two comparison groups divided by the size of the spread of the data (the standard deviation). The problem is that Hattie perhaps uses this idea too generally. Can we really compare an effect size from a before-versus-after study with one from a control-versus-intervention study? If we narrow the student population e.g. by studying a high ability group then we will narrow the spread of results and so inflate the effect size. If we use standardised tests designed under psychometric principles then we will typically see less difference between groups and therefore reduce the effect size – this is due to the way that standardised tests are constructed. […]

  20. Dave says:

    I’m currently in a situation where people are constantly “quoting” Hattie to support the idea that students should never be retained because retention is always bad. And here we have the perennial problem of confusing correlation with causation.

    Ok, students who were retained go on to do poorly in school. Fair enough. There’s a clear correlation between retention and doing poorly later on. There is no evidence, however, to the idea that retention CAUSED a student to do poorly in school later on. The student was already having problems which led to the decision to retain. Retention may not have solved all the original problems, and the student continued to have the same problems and then did poorly in school.

    What we don’t know is what would have happened if the student had NOT been retained. Would the problems have been worse? Would the school have been an accelerated drop-out factory in stead of just a regular one? Maybe retention served to mitigate somewhat.

    Ok, retention alone does not solve all problems. I get that. But I don’t see any evidence that retention is always bad, that it’s the worst decision for a student. There are other factors that have a bigger negative “effect.” (though I argue that retention may not have yielded ANY negative effect.)

    Plus, to make one last point, Hattie’s research and “conclusions” are so broad and vague that they’re really not helpful. On retention, there’s no detail. Retained at what grade level? It doesn’t seem fair or helpful to lump a 1st grader and 9th grader together. It only shows that very GENERALLY (not always) those who are retained go on to do poorly. Shocker.

    So, if I may ask a favor of one of you big-name academics :-), please publish an article critiquing Hattie and reminding us all of the DIFFERENCE between correlation and causation. Much appreciated.

  21. There are quite a few peer reviews which question Hattie’s techniques and conclusion – see Prof Ivan Snook, et al, “Invisible Learnings?” and Pro Ivo Arnold.

    Snook’s critique is well worth reading – e.g., “Hattie says that he is not concerned with the quality of the research in the 800 studies but, of course, quality is everything. Any meta-analysis that does not exclude poor or inadequate studies is misleading, and potentially damaging if it leads to ill-advised policy developments. He also needs to be sure that restricting his data base to meta- analyses did not lead to the omission of significant studies of the variables he is interested in. ”

    And Prof Arnold –
    “A great asset of Hattie’s book is the reference list, which allows the inquisitive
    reader to dig a little bit deeper, by moving from the rankings to the underlying
    meta-studies. I have done this for the top-ranking influence, which is ‘‘self reported
    grades’’ (d = 1.44). This result is dominated by the Kuncel et al. (2005)
    meta-analysis (d = 3.1) (Kuncel et al. 2005). This paper is about the validity of
    ex-post self-reported grades (due to imperfect storage and retrieval from memory
    or intentional deception), not about students’ expectations or their predictive
    power of their own study performance, as Hattie claims. The paper thus should
    not have been included in the analysis. My N = 1 sampling obviously has its
    limits, but this example does raise questions regarding the remaining average
    effect sizes.”

  22. […] Didau, D. (2014, 2014-01-24). Old Hat(tie)? Some things you ought to know about effect sizes – David Didau: The Learning Spy. Retrieved from http://www.learningspy.co.uk/myths/things-know-effect-sizes/ […]

  23. […] positive impact on learning. For balance you may wish to also read @LearningSpy’s blog post summarising some objections to the use of effect size and this post questioning the statistics in Hattie’s work) to students or facilitate more […]

  24. Steve H says:

    The “effect size” John Hattie promotes appears to be a pre-post standardised mean difference. The concept of an effect size is premised on having a control group. If there is a controlled study, it’s possible to ascribe an effect to a cause.

    With pre-post assessment, there’s just a change over time. Dividing the mean difference by the standard deviation just gives a standardised mean difference. Call it “effect size” by all means, but it is not possible to attribute change to any specific effect because multiple effects are at play. So what is the point of comparing pre-post standardised mean differences with effect sizes, even if the effect sizes are based on solid studies?

    I agree there are a range of other issues, some of which are canvassed above.

  25. agreed Steve, I’m going through the meta-analyses and have been surprised to find most do not use pre-post tests, but rather correlation. Then correlation (r) is morphed into an effect size (d) by the formula 2r/SQRT(1-r^2). Note that, a rather weak correlation r = 0.45 gets converted into a massive effect size d=1.00 (in Hattie’s terms 2.5 year’s of progress!)

    Hattie sneaks all these correlation studies into his synthesis with no explanation, in fact in Chp2 of his book he says he mostly uses the method you describe above – but this is not true. For example, the 2 highest effect sizes ‘self report’ and ‘piagetian programs’ are all correlation studies. So for self report, kids are asked what they will get in a test and this is correlated with what they actually get (although some studies don’t even do this). Understandably the correlation is high, which is then converted to a massive effect size.

    For the controversial effect ‘class size’, many studies simply correlate number of staff at a school (often including non teaching staff) with some achievement measure of the whole school. There are no pre-post tests. Similar for ability grouping.

    I’m trying to find the rationale for converting a correlation into an effect size, but I can’t find anything. If anyone can help that would be great.

    I’ve got more detailed analysis here – http://visablelearning.blogspot.com/p/effect-size.html

  26. […] around the use of effect sizes and meta-analyses of this sort, see here (for some defence) and here, here, here or here for some background to the […]

  27. […] Hattie became the Chair of AITSL, it was clear, even to tertiary statistics students, that serious mathematical errors had been made. There continues to be a steady flow of journal articles contesting Hattie’s […]

Constructive feedback is always appreciated

%d bloggers like this: