Effect sizes

If you’d like to read a more in-depth discussion of effect sizes, I recommend also reading Daniel Lakens’ chapter in his textbook “Improving Your Statistical Inferences.”

An effect size is a quantitative description of the strength of a phenomenon (phenomenon means thing being studied). The larger the value, the stronger the phenomenon (e.g., bigger mean differences or stronger relationship).

Types of effect sizes

There are two basic effect sizes we tend to talk about:

The d family of effect sizes are standardized mean differences. They start at 0 (no mean difference) and can go up to infinity, with larger values meaning larger standardized mean differences. Some of the effect sizes in this family:

Cohen’s d is perhaps the most popular standardized mean difference effect size. Generally, the equation is the mean difference divided by the pooled standard deviation, but in reality the equation differs based on a variety of scenarios and whether you are using a one-sample, independent samples, or paired samples t-test.
Hedge’s g is a less biased version of Cohen’s d. Cohen’s d is particularly problematic for small sample sizes, so Hedge’s g is generally preferred, but you’ll see that not all statistical programs provide this effect size. It’s not that difficult to calculate Hedge’s g based on Cohen’s d, but just keep this information in mind.

The r family of effect sizes are measures of strength of association. As you’ll read about in the correlation and regression chapters, this family of effect sizes can describe the proportion of variance explained by squaring the correlation (e.g., with a correlation of r = .8, then the r-squared is .8² or .8 * .8 which is 64% variance explained). Some of the effect sizes in this family:

r is a correlation. It’s a standardized measure of the strength of association where r = -1 or +1 means a perfect relationship and r = 0 is no relationship at all. We typically work with Pearson’s correlation, but we will also learn about Spearman correlation and rho.
\(\eta^2\) (eta-squared) measures the proportion of variance in the dependent variable associated with the different groups of the independent variable. This is considered a biased estimate, especially when trying to compare values across studies, so there are two more preferred effect sizes. We’ll cover the difference between these three in a later chapter (ANOVA).
\(\eta^2_p\) (partial eta-squared) is calculated slightly differently and is considered a less biased estimate (again, we’ll learn about this in a later chapter). This can allow for better comparisons of effect sizes across studies. It’s still not perfect, though.
\(\omega^2\) (omega-squared) is calculated even more differently and is considered the least biased estimate. There is also \(\omega^2_p\) (partial omega-squared) and \(\omega^2_G\) (generalized omega-squared) but as jamovi doesn’t provide it we won’t go over it in this course.

What is all this about more or less biased effect sizes? It has to do with how the standard deviations are calculated in these effect sizes and the fact that we’re dealing with samples and trying to infer the population’s effect size. Cohen’s d and eta-squared tend to slightly overestimate the true population effect, so there are options that provide corrections for this overestimation and lead to less biased results.

We’ll also learn about phi, Cramer’s V, and beta as other measures of effect size.

If you nerded out over this information and want to learn more, check out this great journal article by Daniel Lakens or this chapter on effect sizes by Daniel Lakens.

Small, medium, and large effect sizes

What is considered a small, medium, and large effect size? Quite frankly, it depends.

You may have seen some heuristics online about what small, medium, and large is for Cohen’s d (e.g., .2, .5, and .8) and r (e.g., .1, .3, and .5) but these heuristics should not be used without critical thought. In fact, Cohen (who is regularly cited for these heuristics) said that the way we should determine cut-offs is based on looking across studies to find what is considered small, medium, and large in that particular context.

What makes an effect practically significant?

We’ll get into p-values in a moment, which are about statistical significance, but they don’t tell us anything about how meaningful the effect is. That’s what an effect size is for. But how do we know if it’s meaningful or practically significant?

Lakens (who also did the great journal article on effect sizes above) has a fantastic new preprint out on Sample Size Justification. In it, he provides an overview of six possible ways to determine which effect sizes are interesting:

“Smallest effect size of interest: what is the smallest effect size that is theoretically and practically interesting?
Minimally statistically detectable effect: given the test and sample size, what is the critical effect size that can be statistically significant?
Expected effect size: which effect size is expected based on theoretical predictions or previous research?
Width of confidence interval: which effect sizes are excluded based on the expected width of the confidence interval around the effect size?
Sensitivity power analysis: across a range of possible effect sizes, which effects does a design have sufficient power to detect when performing a hypothesis test?
Distribution of effect sizes in a research area: what is the empirical range of effect sizes in a specific research area, and which effects are a priori unlikely to be observed?” (p. 3)

Basically, what does past research have to say about what effect size you can expect (#3 and #6)? What is the smallest effect size you care about (#1)? What is the smallest effect size you can reasonably obtain (e.g., due to sample size limitations; #2, #3, and #4)? This is the justification you use to determine what effect size you are looking for. This is important for when you are then determining what sample size you need, which will be discussed in a separate section.

As a fun followup, as an example of #6, this study in the field of education collected effect sizes of many education interventions to figure out benchmarks for small (<.05), medium (<.20), and large (>= .20) effect sizes based on existing data rather than poor quality heuristics.