Introduction to Equivalence Testing with TOSTER

Daniel Lakens

2017-11-15

Introduction to Equivalence Testing with TOSTER

Scientists should be able to provide support for the absence of a meaningful effect. Currently researchers often incorrectly conclude an effect is absent based a non-significant result. It is statistically impossible to support the hypothesis that a true effect size is exactly zero. What is possible in a Frequentist hypothesis testing framework is to statistically reject effects large enough to be deemed worthwhile. When researchers want to argue for the absence of an effect that is large enough to be worthwhile to examine, they can test for equivalence.

A widely recommended approach within a Frequentist framework is to test for equivalence using the TOST procedure (Schuirmann, 1987), because of its simplicity and widespread use in other scientific disciplines. The goal in the TOST approach is to specify a lower and upper bound, such that results falling within this range are deemed equivalent to the absence of an effect that is worthwhile to examine (e.g., \(\Delta\)L = -0.3 to \(\Delta\)U = 0.3, where \(\Delta\) is a difference that can be defined by either standardized differences such as Cohen’s d, or raw differences such as 0.3 scale point on a 5-point scale). In the TOST procedure the null hypothesis is the presence of a true effect of \(\Delta\)L or \(\Delta\)U, and the alternative hypothesis is an effect that falls within the equivalence bounds, or the absence of an effect that is worthwhile to examine. The observed data is compared against \(\Delta\)L and \(\Delta\)U in two one-sided tests. If the p-value for both tests indicates the observed data is surprising, assuming \(\Delta\)L or \(\Delta\)U are true, we can follow a Neyman-Pearson approach to statistical inferences and reject effect sizes larger than the equivalence bounds. When making such a statement, we will not be wrong more often, in the long run, than our Type 1 error rate (e.g., 5%).

In this document I will present a number of practical examples for equivalence tests for independent t-tests, dependent t-tests, one-sample t-tests, correlations, and meta-analyses.

 

 

Equivalence test for Meta-analysis

Hyde, Lindberg, Linn, Ellis, and Williams (2008) report that effect sizes for gender differences in mathematics tests across the 7 million students in the US represent trivial differences, where a trivial difference is specified as an effect size smaller then d = 0.1. The present a table with Cohen’s d and se is reproduced below:

Grades d + se
Grade 2 0.06 +/- 0.003
Grade 3 0.04 +/- 0.002
Grade 4 -0.01 +/- 0.002
Grade 5 -0.01 +/- 0.002
Grade 6 -0.01 +/- 0.002
Grade 7 -0.02 +/- 0.002
Grade 8 -0.02 +/- 0.002
Grade 9 -0.01 +/- 0.003
Grade 10 0.04 +/- 0.003
Grade 11 0.06 +/- 0.003
library("TOSTER")
TOSTmeta(ES = 0.06, se = 0.003, low_eqbound_d=-0.1, high_eqbound_d=0.1, alpha=0.05)
## Using alpha = 0.05 the meta-analysis was significant, Z = 20, p = 5.507248e-89
## 
## 
## Using alpha = 0.05 the equivalence test was significant, Z = -13.33333, p = 7.406413e-41
## 

## TOST results:
##   Z-value 1 p-value 1 Z-value 2    p-value 2
## 1  53.33333         0 -13.33333 7.406413e-41
## 
## Equivalence bounds (Cohen's d):
##   low bound d high bound d
## 1        -0.1          0.1
## 
## TOST confidence interval:
##   Lower Limit 90% CI Upper Limit 90% CI
## 1         0.05506544         0.06493456

We see that indeed, all effect size estimates are measured with such high precision, we can conclude that they fall within the equivalence bound of d = -0.1 and d = 0.1. However, note that all of the effects are also statistically significant - so, the the effects are statistically different from zero, and practically equivalent.

Hyde, J. S., Lindberg, S. M., Linn, M. C., Ellis, A. B., & Williams, C. C. (2008). Gender similarities characterize math performance. Science, 321(5888), 494-495.

 

 

Independent Groups Equivalence Test

 

Independent Groups Student’s Equivalence Test

Eskine (2013) showed that participants who had been exposed to organic food were substantially harsher in their moral judgments relative to those exposed to control (d = 0.81, 95% CI: [0.19, 1.45]). A replication by Moery & Calin-Jageman (2016, Study 2) did not observe a significant effect (Control: n = 95, M = 5.25, SD = 0.91, Organic Food: n = 89, M = 5.22, SD = 0.83). Following Simonsohn’s (2015) recommendation the equivalence bound was set to the effect size the original study had 33% power to detect (with n = 21 in each condition, this means the equivalence bound is d = 0.48, which equals a difference of 0.429 on a 7-point scale given the sample sizes and a pooled standard deviation of 0.894). Using a TOST equivalence test with alpha = 0.05, assuming equal variances, and equivalence bounds of d = -0.48 and d = 0.48 is significant, t(182) = -2.69, p = 0.004. We can reject effects larger than d = 0.48 (or a raw difference of 0.429.

#TOSTtwo(m1=5.25,m2=5.22,sd1=0.95,sd2=0.83,n1=95,n2=89,low_eqbound_d=-0.48, high_eqbound=0.48, alpha = 0.05, var.equal=TRUE)
TOSTtwo.raw(m1=5.25,m2=5.22,sd1=0.95,sd2=0.83,n1=95,n2=89,low_eqbound=-0.429, high_eqbound=0.429, alpha = 0.05, var.equal=TRUE)
## Using alpha = 0.05 Student's t-test was non-significant, t(182) = 0.2274761, p = 0.8203089
## 
## 
## Using alpha = 0.05 the equivalence test based on Student's t-test was significant, t(182) = -3.025432, p = 0.001421094
## 

## TOST results:
##   t-value 1    p-value 1 t-value 2   p-value 2  df
## 1  3.480384 0.0003133386 -3.025432 0.001421094 182
## 
## Equivalence bounds (raw scores):
##   low bound raw high bound raw
## 1        -0.429          0.429
## 
## TOST confidence interval:
##   Lower Limit 90% CI raw Upper Limit 90% CI raw
## 1             -0.1880364              0.2480364

Thanks to Brent Donnelan for pointing out the SD in the control condition in the replication was 0.91, not 0.95 as in an earlier version of this vignette, and the raw equivalence bound is 0.429, not 0.384 as in an earlier version. I mistook the n (95) and the SD, and originally calculated raw bounds for 0.43, not 0.48 by mistake.

Moery, E., & Calin-Jageman, R. J. (2016). Direct and Conceptual Replications of Eskine (2013): Organic Food Exposure Has Little to No Effect on Moral Judgments and Prosocial Behavior. Social Psychological and Personality Science, 7(4), 312-319. https://doi.org/10.1177/1948550616639649

 

Independent Groups Welch’s Equivalence Test

Deary, Thorpe, Wilson, Starr, and Whally (2003) report the IQ scores of 79,376 children in Scotland (39,343 girls and 40,033 boys). The IQ score for girls (M = 100.64, SD = 14.1) girls and boys (100.48, SD = 14.9) was non-significant. With such a huge sample, we can examine whether the data is equivalent using very small equivalence bounds, such as -0.05 and 0.05. Because sample sizes are unequal, and variances are unequal, Welch’s t-test is used, and we can conclude IQ scores are indeed equivalent, based on equivalence bounds of -0.05 and 0.05.

TOSTtwo(m1=100.64,m2=100.48,sd1=14.1,sd2=14.9,n1=39343,n2=40033,low_eqbound_d=-0.05, high_eqbound_d=0.05, alpha = 0.05, var.equal=FALSE)
## Using alpha = 0.05 Welch's t-test was non-significant, t(79260.94) = 1.554136, p = 0.1201559
## 
## 
## Using alpha = 0.05 the equivalence test based on Welch's t-test  was significant, t(79260.94) = -5.490723, p = 2.007585e-08
## 

## TOST results:
##   t-value 1    p-value 1 t-value 2    p-value 2       df
## 1  8.598995 4.092652e-18 -5.490723 2.007585e-08 79260.94
## 
## Equivalence bounds (Cohen's d):
##   low bound d high bound d
## 1       -0.05         0.05
## 
## Equivalence bounds (raw scores):
##   low bound raw high bound raw
## 1    -0.7252758      0.7252758
## 
## TOST confidence interval:
##   Lower Limit 90% CI raw Upper Limit 90% CI raw
## 1           -0.009341433              0.3293414

Deary, I. J., Thorpe, G., Wilson, V., Starr, J. M., & Whalley, L. J. (2003). Population sex differences in IQ at age 11: The Scottish mental survey 1932. Intelligence, 31(6), 533-542.

 

 

One-Sample Equivalence Test

 

Lakens, Semin, & Foroni (2012) examined whether the color of Chinese ideographs (white vs. black) would bias whether participants judged that the ideograph correctly translated positive, neutral, or negative words. In Study 4 the prediction was that participants would judge negative words to be correctly translated by black ideographs above guessing average, but no effects were predicted for translations in the other 5 between subject conditions in the 2 (ideograph color: white vs. black) x 3 (word meaning: positive, neutral, negative) design. The table below is a summary of the data in all 6 conditions:

Color Valence M % SD t df p d
White Positive 6.05 50.42 1.50 0.15 20 .89 .03
White Negative 6.68 55.67 1.70 1.75 18 .10 .40
White Neutral 5.95 49.58 1.40 0.87 19 .87 .04
Black Positive 6.45 53.75 1.91 1.06 19 .30 .24
Black Negative 6.95 57.92 1.15 3.71 19 .00 .80
Black Neutral 5.71 47.58 1.79 0.73 20 .47 .16

With 19 to 21 participants in each between subject condition, the study had 80% power to detect equivalence with equivalence bounds of -0.68 to d = 0.68.

powerTOSTone(alpha=0.05, statistical_power=0.8, low_eqbound_d=-0.68, high_eqbound_d=0.68)
## The required sample size to achieve 80 % power with equivalence bounds of -0.68 and 0.68 is 19
## 
## [1] 18.52043

When we perform 5 equivalence tests using equivalence bounds of -0.68 and 0.68, using a Holm-Bonferroni sequential procedure to control error rates, we can conclude statistical equivalence for the data in row 1, 3, and 6, but the conclusion is undetermined for tests 2 and 4, where the data is neither significantly different from guessing average, nor statistically equivalent. For example, performing the test for the data in row 6:

TOSTone(m=5.71,mu=6,sd=1.79,n=20,low_eqbound_d=-0.68, high_eqbound_d=0.68, alpha=0.05)
## Using alpha = 0.05 the NHST one-sample t-test was non-significant, t(19) = -0.724536, p = 0.4775649
## 
## 
## Using alpha = 0.05 the equivalence test was significant, t(19) = 2.316516, p = 0.01592727
## 

## TOST results:
##   t-value 1  p-value 1 t-value 2    p-value 2 df
## 1  2.316516 0.01592727 -3.765588 0.0006543016 19
## 
## Equivalence bounds (Cohen's d):
##   low bound d high bound d
## 1       -0.68         0.68
## 
## Equivalence bounds (raw scores):
##   low bound raw high bound raw
## 1       -1.2172         1.2172
## 
## TOST confidence interval:
##   Lower Limit 90% CI raw Upper Limit 90% CI raw
## 1             -0.9820961              0.4020961

It is clear I should have been more tentative in my conclusions in this study. Not only can I not conclude equivalence in some of the conditions, the equivalence bound I had 80% power to detect is very large, meaning the possibility that there are theoretically interesting but smaller effects remains.

Lakens, D., Semin, G. R., & Foroni, F. (2012). But for the bad, there would not be good: Grounding valence in brightness through shared relational structures. Journal of Experimental Psychology: General, 141(3), 584-594. https://doi.org/10.1037/a0026468

 

 

Equivalence test for correlations

 

Olson, Fazio, and Hermann (2007) reported correlations between implicit and explicit measures of self-esteem, such as the IAT, Rosenberg’s self-esteem scale, a feeling thermometer, and trait ratings. In Study 1 71 participants completed the self-esteem measures. Because no equivalence bounds are mentioned, we can see which equivalence bounds the researchers would have 80% power to detect.

powerTOSTr(alpha=0.05, statistical_power=0.8, low_eqbound_r=-0.24, high_eqbound_r=0.24)
## The required sample size to achieve 80 % power with equivalence bounds of -0.24 and 0.24 is 71 pairs
## 
## [1] 70.05703

With 71 pairs of observations between measures, the researchers have 80% power for equivalence bounds of r = -0.24 and r = 0.24.

The correlations observed by Olson et al (2007), Study 1, are presented in the table below (significant correlations are flagged by an asteriks).

Measure IAT Rosenberg Feeling thermometer Trait ratings
IAT - -.12 -.09 -.06
Rosenberg - .62* .09
Feeling thermometer - .29*
Trait ratings -

We can test each correlation, for example the correlation between the IAT and the Rosenberg self-esteem scale of -0.12, given 71 participants, and using equivalence bounds of r_U = 0.24 and r_L = -0.24.

TOSTr(n=71, r=-0.06, low_eqbound_r=-0.24, high_eqbound_r=0.24, alpha=0.05)
## Using alpha = 0.05 the NHST t-test was non-significant, p = 0.6191585
## 
## 
## Using alpha = 0.05 the equivalence test was non-significant, p = 0.06386793
## 

## TOST results:
##     p-value 1  p-value 2
## 1 0.005971455 0.06386793
## 
## Equivalence bounds (r):
##   low bound r high bound r
## 1       -0.24         0.24
## 
## TOST confidence interval:
##   Lower Limit 90% CI raw Upper Limit 90% CI raw
## 1             -0.2538652              0.1384997

We see that none of the correlations can be declared to be statistically equivalent. Instead, the tests yield undetermined outcomes: the correlations are not significantly different from 0, nor statistically equivalent.

Olson, M. A., Fazio, R. H., & Hermann, A. D. (2007). Reporting tendencies underlie discrepancies between implicit and explicit measures of self-esteem. Psychological Science, 18(4), 287-291.