A computation of the average effect size of various studies on age and creativity

One Year's Progress is a clever and popular claim from Hattie.

The key claims of Hattie's in his book Visible Learning (VL) are:

1. All educational interventions work.

"When teachers claim that they are having a positive effect on achievement or when a policy improves achievement, this is almost always a trivial claim: Virtually everything works. One only needs a pulse and we can improve achievement." (Hattie in VL p. 16).

Note: this is easily challenged, see Marzano (2009) who reports 20-40% of studies produce negative results & Kluger & DeNisi (1996) who report OVER 38% of feedback studies produce negative results.

2. Since All interventions work, we need to CHANGE the focus to what works best.

"Instead of asking 'What Works?' we should be asking 'What Works BEST?'" (Hattie in VL, p. 18)

3. To find What Works Best we look at meta-analyses which combine many studies together. Hattie then goes one step further and combines many meta-analyses together to develop his unique methodology - the META-meta-analysis.

4. To calculate What Works Best we find the simple statistic the Effect Size (d) reported in each meta-analyses and then average all of these into one number. This one average Effect Size then represents all of the studies for that intervention. E.G., Hattie combines 23 meta-analysis to get the ONE average effect size of 0.73 for feedback - details here.

5. The 'hinge point' is an Effect Size d = 0.40. Hattie stated, 

We can set benchmarks of what progress looks like (preferably d=0.40 for every student, at least d=0.30, and certainly not less than d=0.20) per implementation or year. (VL, p. 240)

"...the use of the "h-point" (d = 0.40) to demarcate the expected value of any innovations in schools is critical. Rather than using the zero point, which is hardly worthwhile, the standards for minimal success in schools should be more like d = 0.40." (VL, p.249)
"The d = 0.40 is what I referred to in Visible Learning as the hinge-point (or h-point) for identifying what is and what is not effective." (Hattie, 2012, p.3)

6. Hattie then claimed an effect size of 0.40 is equivalent to advancing a child’s achievement by 1 year.

"d = 0.4 is what we can expect as growth per year on average." (Hattie, 2012, p. 14) & (Hattie presentation Melbourne Graduate School, 2011, @21minutes).

How Reliable are these Claims?

These claims are all contingent on the statistic the Effect Size (details about how it is calculated here) suffice to say there a huge issues!

The Education Endowment Foundation (EEF) is one of the few organisations that use this same unique method of Hattie- the META-meta-analysis.

However, they deduce a very different scale with an effect size of d = 0.1 equivalent to 1 Months progress. So Hattie's 0.4 would be 4 months progress (not one year).

This difference is very concerning since it breaks one of the foundations of the Scientific Method, Reliability- which is the degree of consistency of a measure. A test will be reliable when it gives the same repeated result under the same conditions.


More concerning is Hattie's claim,

"I would go further and claim that those students who do not achieve at least a 0.40 improvement in a year are going backwards..." (VL, p. 250).

In terms of teacher assessment, he takes this one step further by declaring teachers who don't attain up to an effect size of 0.40 are "below averageHattie (2010, p. 86).

These claims are a major concern for a number of reasons, the Reliability issues as stated above and also Hattie's financial interest in a Teacher Assessment software program called e-asTTLE and performance pay.

Although, he did backtrack in his summary VL 2012 publication regarding the meaning of d = 0.40,

"I did not say that we use this hinge point for making decisions, but rather we used it to start discussions" (p. 14).

Yet, currently (Jan - 2022) on Hattie's commercial web site:

"The average effect size was 0.4, a marker that represented a year’s growth per year of schooling for a student. Anything above 0.4 would have a greater positive effect on student learning."

Yet, also on this web site in his publication "Real Gold vs, Fool's Gold" Hattie appears to backtrack from these claims about an effect size (d) of 0.40,

"But we must not get too oversold on using d = 0.40 in all circumstances. The interpretation can differ in light of how narrow (e.g., vocabulary) or wide (e.g., comprehension) the outcome is, the cost of the intervention (Simpson, 2017) ...When implementing the Visible Learning model, it is worth developing local knowledge about what works best in the context and not overly rely on 0.40. The 0.40 merely is the average of all 1,600 meta-analyses..." (p. 14)

Peer Reviews Question Hattie's claim that d = 0.40 is Equivalent to 1 Year's Progress

Thibault (2017) in, "Is John Hattie's Visible Learning so visible?" states that (translation to English),

"...regarding the order of magnitude of the effect, Hattie (2009) gives little detail. He explains summarily, an effect of d = 0.4 is roughly equivalent to the progression of a student in one year when an effect of d = 1 is roughly equivalent to the progression of 2 or 3 school years. 
Though this assertion is debatable as mentioned by Proulx (2017), the lack of detail in Hattie also raises questions about these associations between magnitude of effect and duration. 
For example, ... is an effect of d = 0.2 equivalent to the progression of a pupil over a half-year school?
Also, are the effects accounted for in the meta-analyses over the same duration and, where appropriate, how to compare the effect of an intervention of a few weeks or months versus an intervention on a full year? 
These questions, for me, remain vibrant on reading all these averages and effects."

Wecker et al. (2017) also question the notion that an effect of 0.4 corresponds to a year's progress:

"An observed effect size of, for example, 0.3 would obviously hardly correspond to the magnitude of the increase in competence over the course of a school year." (p. 33)

Kraft (2021) in his discussion with Hattie,

"I just think the 0.40 threshold is the wrong one. Of the almost 2,000 effect sizes I analyzed from RCTs examining the effects of educational interventions on standardized student achievement, only 13 percent were 0.40 or larger...

I would argue that the hinge point of 0.40 sets up education leaders to have unrealistic expectations about what is possible and what is meaningful. The scale of data you draw on using thousands of meta-analyses is incredibly impressive. But it also means that this average effect size of 0.40 pools across studies of widely variable quality.

Correlational studies with misleadingly strong associations likely dominate the data... Publication bias, where lots of studies of ineffectual programs are never written or published, further strengthens my suspicion that the 0.40 hinge point is too large."

THE NEED FOR BENCHMARK EFFECT SIZES:

Cohen (1988) hesitantly defined effect sizes as "small, d = 0.2", "medium, d = 0.5", and "large, d = 0.8", stating that "there is a certain risk in inherent in offering conventional operational definitions for those terms for use in power analysis in as diverse a field of inquiry as behavioral science" (p. 25).

Bloom et al. (2007) argue that,

"there is no universal guideline or rule of thumb for judging the practical importance or substantive significance of a standardized effect size estimate for an intervention. Instead one must develop empirical benchmarks of comparison that (see table of US benchmarks below) reflect the nature of the intervention being evaluated, its target population, and the outcome measure or measures being used. We apply this approach to the assessment of effect size measures for educational interventions designed to improve student academic achievement." (abstract)

Lipsey et al. (2012) state,

"Cohen’s broad categories of small, medium and large are clearly not tailored to the effects of intervention studies in education, much less any specific domain of education interventions, outcomes, and samples. Using those categories to characterize effect sizes from education studies, therefore, can be quite misleading. It is rather like characterizing a child’s height as small, medium, or large, not by reference to the distribution of values for children of similar age and gender, but by reference to a distribution for all vertebrate mammals." (p. 12)

The United States Department of Education has commissioned a more detailed study of the effect size benchmarks, for K-12, using the national testing across the USA of around 50 Million students (table of results below from Lipsey et al. (2012, p. 28):

A computation of the average effect size of various studies on age and creativity


Hattie acknowledges these results in his summary VL 2012 (p. 14), but uses them to justify his 'hinge point' d = 0.40 and says, 

"the effects for each year were greater in younger and lower in older grades ... we may need to expect more from younger grades (d > 0.60) than for older grades (d > 0.30)."

So Hattie's minor adjustment misses the HUGE variation from young to older students. He also does not address the use of much older college-level students and practising professionals, like doctors and nurses in many of his meta-analyses, e.g., 'self-report grades', 'problem-based learning', 'worked examples', etc.

The HUGE variation is a major moderator in Hattie's method of comparing effect sizes. The difference in two different influences could be simply due to the age of the students being measured.

Steiner (2021) also uses Lipsey et al. (2012) then emphasises the need to know the age of the students in the study,

"...few people I have spoken with outside the research community understand that an effect size, all on its own, doesn’t tell you whether you should be impressed by an educational intervention. You also have to know the intended grade-level. For instance, an intervention with an effect size of 0.2 wouldn’t make a significant difference to students in kindergarten, but it would make a huge difference for 11th graders."
Further, Professor Dylan Wiliam has also identified that meta-analyses need to control for the age of the students and the time period over which the study is conducted. Hattie's does NOT do this.

Wiliam goes further and says if this is not done the results are 'GARBAGE'.




Hattie used a number of meta-analyses from Prof Robert Slavin. But, Slavin has been very critical of Hattie's methods. Slavin (2017) supports Wiliam,

"In studies that do have control groups and in which experimental and control groups were tested on material they were both taught, effect sizes as large as +0.80, or even +0.40, are very unusual."

Hattie finally admits this in his 2015 article in which he defends some of the critiques:

"...the time over which any intervention is conducted can matter (we find that calculations over less than 10-12 weeks can be unstable, the time is too short to engender change, and you end up doing too much assessment relative to teaching). These are critical moderators of the overall effect-sizes and any use of hinge=.4 should, of course, take these into account."

Yet Hattie DOES NOT take this into account, there has been no attempt to detail and report the time over which the studies ran nor the age group of the students in the question. Clear examples are the studies Hattie used in the category of "Feedback".

Also, the landmark US study goes on to state:

"The usefulness of these empirical benchmarks depends on the degree to which they are drawn from high-quality studies and the degree to which they summarise effect sizes with regard to similar types of interventions, target populations, and outcome measures."

and also defined the criterion for accepting a research study (i.e., the quality needed):

  • Search for published and unpublished research dated 1995 or later.
  • Specialised groups such as special education students, etc. were not included.
  • Also, to ensure that the effect sizes extracted from these reports were relatively good indications of actual intervention effects, studies were restricted to those using random assignment designs (that is method 1 as explained in effect sizes) with practice-as-usual control groups and attrition rates no higher than 20% (p. 33).


NOTE: using these criteria few of the 800+ meta-analyses in VL would pass the quality test!


Is Year's or Months of Progress a Reliable Measure?

Matthew Baird and John Pane in Translating Standardized Effects of Education Programs Into More Interpretable Metrics compared the “years of time” translation to three other reader-friendly measures, rating which were most and least accurate reflections of effect sizes: to benchmarking against similar groups in other studies, to percentile growth, and to determining the probability of meeting a certain threshold. They conclude that,

"years of learning performs worst, and percentile gains performs best, making it our recommended choice for more interpretable translations of standardized effects."

What does it mean to say that research is probabilistic?

Behavioral research is probabilistic, which means that its findings are not expected to explain all the cases all the time (i.e., there are exceptions).

What statement describes the file drawer problem in psychology?

The file drawer problem rests on the assumption that statistically non- significant results are less likely to be published in primary-level studies and less likely to be included in meta-analytic reviews, thereby re- sulting in upwardly biased meta-analytically derived effect sizes.

Which of the following is an example of basic research?

Examples of basic research A study looking at how alcohol consumption impacts the brain. A study to discover the components making up human DNA. A study accessing whether stress levels make people more aggressive. A study looking to see if gender stereotypes lead to depression.

What key point does Zanna's experience best illustrate?

What key point does Zanna's experience best illustrate? Only one study is needed for researchers to be confident about a research finding. Researchers rarely stop at one study; they usually follow up on findings with additional studies.