Scientists should be able to provide support for the absence of a meaningful effect. Currently researchers often incorrectly conclude an effect is absent based a non-significant result. It is statistically impossible to support the hypothesis that a true effect size is exactly zero. What is possible in a Frequentist hypothesis testing framework is to statistically reject effects large enough to be deemed worthwhile. When researchers want to argue for the absence of an effect that is large enough to be worthwhile to examine, they can test for equivalence.

A widely recommended approach within a Frequentist framework is to test for equivalence using the TOST procedure (Schuirmann, 1987), because of its simplicity and widespread use in other scientific disciplines. The goal in the TOST approach is to specify a lower and upper bound, such that results falling within this range are deemed equivalent to the absence of an effect that is worthwhile to examine (e.g., \(\Delta\)L = -0.3 to \(\Delta\)U = 0.3, where \(\Delta\) is a difference that can be defined by either standardized differences such as Cohen’s d, or raw differences such as 0.3 scale point on a 5-point scale). In the TOST procedure the null hypothesis is the presence of a true effect of \(\Delta\)L or \(\Delta\)U, and the alternative hypothesis is an effect that falls within the equivalence bounds, or the absence of an effect that is worthwhile to examine. The observed data is compared against \(\Delta\)L and \(\Delta\)U in two one-sided tests. If the p-value for both tests indicates the observed data is surprising, assuming \(\Delta\)L or \(\Delta\)U are true, we can follow a Neyman-Pearson approach to statistical inferences and reject effect sizes larger than the equivalence bounds. When making such a statement, we will not be wrong more often, in the long run, than our Type 1 error rate (e.g., 5%).

In this document I will present a number of practical examples for equivalence tests for independent t-tests, dependent t-tests, one-sample t-tests, correlations, and meta-analyses.

Hyde, Lindberg, Linn, Ellis, and Williams (2008) report that effect sizes for gender differences in mathematics tests across the 7 million students in the US represent trivial differences, where a trivial difference is specified as an effect size smaller then d = 0.1. The present a table with Cohen’s d and se is reproduced below:

Grades | d + se |
---|---|

Grade 2 | 0.06 +/- 0.003 |

Grade 3 | 0.04 +/- 0.002 |

Grade 4 | -0.01 +/- 0.002 |

Grade 5 | -0.01 +/- 0.002 |

Grade 6 | -0.01 +/- 0.002 |

Grade 7 | -0.02 +/- 0.002 |

Grade 8 | -0.02 +/- 0.002 |

Grade 9 | -0.01 +/- 0.003 |

Grade 10 | 0.04 +/- 0.003 |

Grade 11 | 0.06 +/- 0.003 |

```
library("TOSTER")
TOSTmeta(ES = 0.06, se = 0.003, low_eqbound_d=-0.1, high_eqbound_d=0.1, alpha=0.05)
```

`## Using alpha = 0.05 the meta-analysis was significant, Z = 20, p = 2.753624e-89`

`## `

```
##
## Using alpha = 0.05 the equivalence test was significant, Z = -13.33333, p = 7.406413e-41
```

`## `

```
## TOST results:
## Z-value 1 p-value 1 Z-value 2 p-value 2
## 1 53.33333 0 -13.33333 7.406413e-41
##
## Equivalence bounds (Cohen's d):
## low bound d high bound d
## 1 -0.1 0.1
##
## TOST confidence interval:
## Lower Limit 90% CI Upper Limit 90% CI
## 1 0.05506544 0.06493456
```

We see that indeed, all effect size estimates are measured with such high precision, we can conclude that they fall within the equivalence bound of d = -0.1 and d = 0.1. However, note that all of the effects are also statistically significant - so, the the effects are statistically different from zero, and practically equivalent.

*Hyde, J. S., Lindberg, S. M., Linn, M. C., Ellis, A. B., & Williams, C. C. (2008). Gender similarities characterize math performance. Science, 321(5888), 494-495.*

Eskine (2013) showed that participants who had been exposed to organic food were substantially harsher in their moral judgments relative to those exposed to control (d = 0.81, 95% CI: [0.19, 1.45]). A replication by Moery & Calin-Jageman (2016, Study 2) did not observe a significant effect (Control: n = 95, M = 5.25, SD = 0.95, Organic Food: n = 89, M = 5.22, SD = 0.83). Following Simonsohn’s (2015) recommendation the equivalence bound was set to the effect size the original study had 33% power to detect (with n = 21 in each condition, this means the equivalence bound is d = 0.48, which equals a difference of 0.384 on a 7-point scale given the sample sizes and a pooled standard deviation of 0.894). Using a TOST equivalence test with alpha = 0.05, assuming equal variances, and equivalence bounds of d = -0.43 and d = 0.43 is significant, t(182) = -2.69, p = 0.004. We can reject effects larger than d = 0.43.

`TOSTtwo.raw(m1=5.25,m2=5.22,sd1=0.95,sd2=0.83,n1=95,n2=89,low_eqbound=-0.384, high_eqbound=0.384, alpha = 0.05, var.equal=TRUE)`

`## Using alpha = 0.05 Student's t-test was non-significant, t(182) = 0.2274761, p = 0.8203089`

`## `

```
##
## Using alpha = 0.05 the equivalence test based on Student's t-test was significant, t(182) = -2.684218, p = 0.003970479
```

`## `

```
## TOST results:
## t-value 1 p-value 1 t-value 2 p-value 2 df
## 1 3.13917 0.0009885653 -2.684218 0.003970479 182
##
## Equivalence bounds (raw scores):
## low bound raw high bound raw
## 1 -0.384 0.384
##
## TOST confidence interval:
## Lower Limit 90% CI raw Upper Limit 90% CI raw
## 1 -0.1880364 0.2480364
```

*Moery, E., & Calin-Jageman, R. J. (2016). Direct and Conceptual Replications of Eskine (2013): Organic Food Exposure Has Little to No Effect on Moral Judgments and Prosocial Behavior. Social Psychological and Personality Science, 7(4), 312-319. https://doi.org/10.1177/1948550616639649 *

Deary, Thorpe, Wilson, Starr, and Whally (2003) report the IQ scores of 79,376 children in Scotland (39,343 girls and 40,033 boys). The IQ score for girls (M = 100.64, SD = 14.1) girls and boys (100.48, SD = 14.9) was non-significant. With such a huge sample, we can examine whether the data is equivalent using very small equivalence bounds, such as -0.05 and 0.05. Because sample sizes are unequal, and variances are unequal, Welch’s t-test is used, and we can conclude IQ scores are indeed equivalent, based on equivalence bounds of -0.05 and 0.05.

`TOSTtwo(m1=100.64,m2=100.48,sd1=14.1,sd2=14.9,n1=39343,n2=40033,low_eqbound_d=-0.05, high_eqbound_d=0.05, alpha = 0.05, var.equal=FALSE)`

`## Using alpha = 0.05 Welch's t-test was non-significant, t(79260.94) = 1.554136, p = 0.1201559`

`## `

```
##
## Using alpha = 0.05 the equivalence test based on Welch's t-test was significant, t(79260.94) = -5.490723, p = 2.007585e-08
```

`## `

```
## TOST results:
## t-value 1 p-value 1 t-value 2 p-value 2 df
## 1 8.598995 4.092652e-18 -5.490723 2.007585e-08 79260.94
##
## Equivalence bounds (Cohen's d):
## low bound d high bound d
## 1 -0.05 0.05
##
## Equivalence bounds (raw scores):
## low bound raw high bound raw
## 1 -0.7252758 0.7252758
##
## TOST confidence interval:
## Lower Limit 90% CI raw Upper Limit 90% CI raw
## 1 -0.009341433 0.3293414
```

*Deary, I. J., Thorpe, G., Wilson, V., Starr, J. M., & Whalley, L. J. (2003). Population sex differences in IQ at age 11: The Scottish mental survey 1932. Intelligence, 31(6), 533-542.*

Lakens, Semin, & Foroni (2012) examined whether the color of Chinese ideographs (white vs. black) would bias whether participants judged that the ideograph correctly translated positive, neutral, or negative words. In Study 4 the prediction was that participants would judge negative words to be correctly translated by black ideographs above guessing average, but no effects were predicted for translations in the other 5 between subject conditions in the 2 (ideograph color: white vs. black) x 3 (word meaning: positive, neutral, negative) design. The table below is a summary of the data in all 6 conditions:

Color | Valence | M | % | SD | t | df | p | d |
---|---|---|---|---|---|---|---|---|

White | Positive | 6.05 | 50.42 | 1.50 | 0.15 | 20 | .89 | .03 |

White | Negative | 6.68 | 55.67 | 1.70 | 1.75 | 18 | .10 | .40 |

White | Neutral | 5.95 | 49.58 | 1.40 | 0.87 | 19 | .87 | .04 |

Black | Positive | 6.45 | 53.75 | 1.91 | 1.06 | 19 | .30 | .24 |

Black | Negative | 6.95 | 57.92 | 1.15 | 3.71 | 19 | .00 | .80 |

Black | Neutral | 5.71 | 47.58 | 1.79 | 0.73 | 20 | .47 | .16 |

With 19 to 21 participants in each between subject condition, the study had 80% power to detect equivalence with equivalence bounds of -0.68 to d = 0.68.

`powerTOSTone(alpha=0.05, statistical_power=0.8, low_eqbound_d=-0.68, high_eqbound_d=0.68)`

`## The required sample size to achieve 80 % power with equivalence bounds of -0.68 and 0.68 is 19`

`## `

`## [1] 19`

When we perform 5 equivalence tests using equivalence bounds of -0.68 and 0.68, using a Holm-Bonferroni sequential procedure to control error rates, we can conclude statistical equivalence for the data in row 1, 3, and 6, but the conclusion is undetermined for tests 2 and 4, where the data is neither significantly different from guessing average, nor statistically equivalent. For example, performing the test for the data in row 6:

`TOSTone(m=5.71,mu=6,sd=1.79,n=20,low_eqbound_d=-0.68, high_eqbound_d=0.68, alpha=0.05)`

`## Using alpha = 0.05 the NHST one-sample t-test was non-significant, t(19) = -0.724536, p = 0.4775649`

`## `

```
##
## Using alpha = 0.05 the equivalence test was significant, t(19) = 2.316516, p = 0.01592727
```

`## `

```
## TOST results:
## t-value 1 p-value 1 t-value 2 p-value 2 df
## 1 2.316516 0.01592727 -3.765588 0.0006543016 19
##
## Equivalence bounds (Cohen's d):
## low bound d high bound d
## 1 -0.68 0.68
##
## Equivalence bounds (raw scores):
## low bound raw high bound raw
## 1 -1.2172 1.2172
##
## TOST confidence interval:
## Lower Limit 90% CI raw Upper Limit 90% CI raw
## 1 -0.9820961 0.4020961
```

It is clear I should have been more tentative in my conclusions in this study. Not only can I not conclude equivalence in some of the conditions, the equivalence bound I had 80% power to detect is very large, meaning the possibility that there are theoretically interesting but smaller effects remains.

*Lakens, D., Semin, G. R., & Foroni, F. (2012). But for the bad, there would not be good: Grounding valence in brightness through shared relational structures. Journal of Experimental Psychology: General, 141(3), 584-594. https://doi.org/10.1037/a0026468 *

Olson, Fazio, and Hermann (2007) reported correlations between implicit and explicit measures of self-esteem, such as the IAT, Rosenberg’s self-esteem scale, a feeling thermometer, and trait ratings. In Study 1 71 participants completed the self-esteem measures. Because no equivalence bounds are mentioned, we can see which equivalence bounds the researchers would have 80% power to detect.

`powerTOSTr(alpha=0.05, statistical_power=0.8, low_eqbound_r=-0.24, high_eqbound_r=0.24)`

`## The required sample size to achieve 80 % power with equivalence bounds of -0.24 and 0.24 is 71 pairs`

`## `

`## [1] 71`

With 71 pairs of observations between measures, the researchers have 80% power for equivalence bounds of r = -0.24 and r = 0.24.

The correlations observed by Olson et al (2007), Study 1, are presented in the table below (significant correlations are flagged by an asteriks).

Measure | IAT | Rosenberg | Feeling thermometer | Trait ratings |
---|---|---|---|---|

IAT | - | -.12 | -.09 | -.06 |

Rosenberg | - | .62* | .09 | |

Feeling thermometer | - | .29* | ||

Trait ratings | - |

We can test each correlation, for example the correlation between the IAT and the Rosenberg self-esteem scale of -0.12, given 71 participants, and using equivalence bounds of r_U = 0.24 and r_L = -0.24.

`TOSTr(n=71, r=-0.06, low_eqbound_r=-0.24, high_eqbound_r=0.24, alpha=0.05)`

`## Using alpha = 0.05 the NHST t-test was non-significant, p = 0.6191585`

`## `

```
##
## Using alpha = 0.05 the equivalence test was non-significant, p = 0.06386793
```

`## `

```
## TOST results:
## p-value 1 p-value 2
## 1 0.005971455 0.06386793
##
## Equivalence bounds (r):
## low bound r high bound r
## 1 -0.24 0.24
##
## TOST confidence interval:
## Lower Limit 90% CI raw Upper Limit 90% CI raw
## 1 -0.2538652 0.1384997
```

We see that none of the correlations can be declared to be statistically equivalent. Instead, the tests yield undetermined outcomes: the correlations are not significantly different from 0, nor statistically equivalent.

*Olson, M. A., Fazio, R. H., & Hermann, A. D. (2007). Reporting tendencies underlie discrepancies between implicit and explicit measures of self-esteem. Psychological Science, 18(4), 287-291.*