Old Hat(tie)? Some things you ought to know about effect sizes
Ever since Hattie published Visible Learning back in 2009 the Effect Size has been king. For those of you who don’t know, an effect size is a mechanism for comparing the relative merits of different interventions. Hattie pointed out that everything that a teacher does will have some effect but that there will also be an opportunity cost: if you’re investing in time in one type of intervention you will be neglecting other types of intervention which might have a greater impact. He therefore used effect sizes to try to establish what has the greatest influence on student learning so that we could all concentrate on the stuff which had the great impact.
According to the Teachers’ Toolbox site, An effect-size of 1.0 is typically associated with:
- Advancing learners’ achievement by one year, or improving the rate of learning by 50%
- A correlation between some variable (e.g., amount of homework) and achievement of approximately .50
- A two grade leap in GCSE, e.g. from a C to an A grade
Hattie decided to set the ‘hinge point’ at 0.4 stating that any intervention that came in below this figure had a negligible effect and was therefore not worth wasting teachers’ precious time on. He then went about aggregating the effects of thousands of research studies to aggregate the effect sizes for most of the stuff that gets wheeled out in classrooms.
This is what he found:
|Influence||Effect Size||Source of Influence|
|Student’s prior cognitive ability||1.04||Student|
|Student’s disposition to learn||.61||Student|
|Challenge of Goals||.52||Teacher|
|Simulation & games||.34||Teacher|
|Affective attributes of students||.24||Student|
|Physical attributes of students||.21||Student|
|Physical attributes (e.g., class size)||-.05||School|
So now we know. Giving feedback is ace and adjusting class size is pointless.
Except that there are some naysayers. Maths boff @OllieOrange tells us that there are 3 things we should know about the effect size:
- Mathematicians don’t use it
- Mathematics textbooks don’t teach it.
- Statistical packages don’t calculate it.
And in the comments to this blog, Dylan Wiliam sets out his objections:
Here are some reasons why I find effect sizes are misleading, taken from pages 20 to 22 of:
Wiliam, D. (2010). An integrative summary of the research literature and implications for a new theory of formative assessment. In H. L. Andrade & G. J. Cizek (Eds.), Handbook of formative assessment (pp. 18-40). New York, NY: Taylor & Francis.
The use of standardized effect sizes to compare and synthesize studies is understandable, because few of the studies included in the various reviews published sufficient details to allow more sophisticated forms of synthesis to be undertaken, but relying on standardized effect sizes in educational studies creates substantial difficulties of interpretation, for two reasons.
First, as Black and Wiliam (1998a) noted, effect size is influenced by the range of achievement in the population. An increase of 5 points on a test where the population standard deviation is 10 points would result in an effect size of 0.5 standard deviations. However, the same intervention when administered only to the upper half of the same population, provided that it was equally effective for all students, would result in an effect size of over 0.8 standard deviations, due to the reduced variance of the subsample. An often-observed finding in the literature—that formative assessment interventions are more successful for students with special educational needs (for example in Fuchs & Fuchs, 1986)—is difficult to interpret without some attempt to control for the restriction of range, and may simply be a statistical artefact.
The second and more important limitation of the meta-analytic reviews is that they fail to take into account the fact that different outcome measures are not equally sensitive to instruction (Popham, 2007). Much of the methodology of meta-analysis used in education and psychology has been borrowed uncritically from the medical and health sciences, where the different studies being combined in meta-analyses either use the same outcome measures (e.g., 1-year survival rates) or outcome measures that are reasonably consistent across different settings (e.g., time to discharge from hospital care). In education, to aggregate outcomes from different studies it is necessary to assume that the outcome measures are equally sensitive to instruction.
It has long been known that teacher-constructed measures have tended to show greater effect sizes for experimental interventions than obtained with standardized tests, and this has sometimes been regarded as evidence of the invalidity of teacher- constructed measures. However, as has become clear in recent years, assessments vary greatly in their sensitivity to instruction—the extent to which they measure the things that educational processes change (Wiliam, 2007b). In particular, the way that standardized tests are constructed reduces their sensitivity to instruction. The reliability of a test can be increased by replacing items that do not discriminate between candidates with items that do, so items that all students answer correctly, or that all students answer incorrectly, are generally omitted. However, such systematic deletion of items can alter the construct being measured by the test, because items related to aspects of learning that are effectively taught by teachers are less likely to be included than items that are taught ineffectively.
For example, an item that is answered incorrectly by all students in the seventh grade and answered correctly by all students in the eighth grade is almost certainly assessing something that is changed by instruction, but is unlikely to be retained in a test for seventh graders (because it is too hard), nor in one for eighth graders (because it is too easy). This is an extreme example, but it does highlight how the sensitivity of a test to the effects of instruction can be significantly affected by the normal processes of test development (Wiliam, 2008).
The effects of sensitivity to instruction are far from negligible. Bloom (1984) famously observed that one-to-one tutorial instruction was more effective than average group-based instruction by two standard deviations. Such a claim is credible in the context of many assessments, but for standardized tests such as those used in the National Assessment of Educational Progress (NAEP), one year’s progress for an average student is equivalent to one-fourth of a standard deviation (NAEP, 2006), so for Bloom’s claim to be true, one year’s individual tuition would produce the same effect as 9 years of average group-based instruction, which seems unlikely. The important point here is that the outcome measures used in different studies are likely to differ significantly in their sensitivity to instruction, and the most significant element in determining an assessment’s sensitivity to instruction appears to be its distance from the curriculum it is intended to assess.
Ruiz-Primo, Shavelson, Hamilton, and Klein (2002) proposed a five-fold classification for the distance of an assessment from the enactment of curriculum, with examples of each:
- Immediate, such as science journals, notebooks, and classroom tests;
- Close, or formal embedded assessments (for example, if an immediate assessment asked about number of pendulum swings in 15 seconds, a close assessment would ask about the time taken for 10 swings);
- Proximal, including a different assessment of the same concept, requiring some transfer (for example, if an immediate assessment asked students to construct ￼￼￼￼￼￼￼￼boats out of paper cups, the proximal assessment would ask for an explanation of what makes bottles float or sink);
- Distal, for example a large-scale assessment from a state assessment framework, in which the assessment task was sampled from a different domain, such as physical science, and where the problem, procedures, materials and measurement methods differed from those used in the original activities; and
- Remote, such as standardized national achievement tests.
As might be expected, Ruiz-Primo et al. (2002) found that the closer the assessment was to the enactment of the curriculum, the greater was the sensitivity of the assessment to the effects of instruction, and that the impact was considerable. For example, one of their interventions showed an average effect size of 0.26 when measured with a proximal assessment, but an effect size of 1.26 when measured with a close assessment.
In none of the meta-analyses discussed above was there any attempt to control for the effects of differences in the sensitivity to instruction of the different outcome measures. By itself, it does not invalidate the claims that formative assessment is likely to be effective in improving student outcomes. Indeed, in all likelihood, attempts to improve the quality of teachers’ formative assessment practices are likely to be considerably more cost-effective than many, if not most, other interventions (Wiliam & Thomson, 2007). However, failure to control for the impact of this factor means that considerable care should be taken in quoting particular effect sizes as being likely to be achieved in practice, and other measures of the impact, such as increases in the rate of learning, may be more appropriate (Wiliam, 2007c). More importantly, attention may need to be shifted away from the size of the effects and toward the role that effective feedback can play in the design of effective learning environments (Wiliam, 2007a). In concluding their review of over 3,000 studies of the effects of feedback interventions in schools, colleges and workplaces, Kluger and DeNisi observed that:
considerations of utility and alternative interventions suggest that even an FI [feedback intervention] with demonstrated positive effects should not be administered wherever possible. Rather additional development of FIT [feedback intervention theory] is needed to establish the circumstance under which positive FI effects on performance are also lasting and efficient and when these effects are transient and have questionable utility. This research must focus on the processes induced by FIs and not on the general question of whether FIs improve performance— look how little progress 90 years of attempts to answer the latter question have yielded. (1996 p. 278)
So there you go. The next time someone tries to trump you in debate by quoting effect sizes you’ll know exactly how to respond.