# Selecting Distributions

## Model selection

It can be difficult to select a ‘best fitting’ distribution when modeling species sensitivity data with low sample sizes. In these situations, model averaging can be used to fit multiple distributions and calculate a weighted average HCx and confidence limits (Schwarz and Tillmanns 2009). Distributions selected to use in model averaging must be bounded by zero given that effect concentrations cannot be negative. Furthermore, the selected distributions should provide a variety of shapes to capture the diversity of shapes in empirical species sensitivity distributions.

By default, the ssdtools uses Akaike Information Criterion corrected for small sample size (AICc) as a measure of relative quality of fit for different distributions and as the basis for calculating the model-averaged HCx.
However, if two or more similar models fit the data well then the support for this type of shape will be over-inflated (Burnham and Anderson 2002).

## Default Distributions

To avoid such model redundancy (Burnham and Anderson 2002), ssdtools and the accompanying Shiny app (Dalgarno 2018) fit three different shape distributions by default: the log-normal, gamma and the Burr Type-III 2-parameter distributions.

The three distributions are plotted below with a mean of 2 and standard deviation of 2 on the (natural) log concentration scale or around 7.4 on the concentration scale.

library(ssdtools)
library(ggplot2)

set.seed(7)
ssd_plot_cdf(ssd_match_moments(meanlog = 2, sdlog = 2))

### Log-normal distribution

The log-normal distribution was selected as the starting distribution given the data are for effect concentrations. The log-logistic distribution is often used as a candidate SSD primarily because of its analytic tractability (Aldenderg and Slob 1993). However, we did not include the log-logistic distribution in the default set of distribution because 1) it has a very similar shape to the log-normal distribution and 2) it is a specific case of the more general Burr family of distributions (Shao 2000).

The log-normal distribution does have a couple limitations to consider when fitting species sensitivity data. First, on the logarithmic scale, the normal distribution is symmetrical. Second, the log-normal distribution decays too quickly in the tails giving narrow tails which may not adequately fit the data. Additional distributions were selected to compensate for these deficiencies.

### Gamma distribution

The gamma distribution is a two-parameter distribution commonly used to model failure times or time to events. For use in modeling species sensitivity data, the gamma distribution has two key features that provide additional flexibility relative to the log-normal distribution: 1) it is non-symmetrical on the logarithmic scale; and 2) it has wider tails. The Weibull distribution was considered as an alternative but the Gamma distribution is generally more flexible.

### Burr Type III

Burr (1942) developed a family of 12 distributions with a flexible shape. The original Burr Type III distribution (also known as the inverse Burr distribution or the Dagum distribution) is a three parameter distribution that can take many shapes as dictated by the data and has a heavy tail which is useful in modeling extreme concentrations (Domma et al 2011).

The original Burr III distribution has three parameters (two shape and one scale parameter). However, the estimated parameters are very highly correlated for small samples and the distribution is essentially non-identifiable (i.e. different sets of parameter values give rise to essentially the same fit). For this reason, we have selected as a default distribution, the two-parameter (one shape and one scale parameter) version of the Burr III distribution that has better properties in small samples.

## All Distributions

For completeness the hazard concentrations for the other distributions provided by ssdtools are plotted below with a mean of 2 and standard deviation of 2 on the (natural) log concentration scale or around 7.4 on the concentration scale.

set.seed(7)
ssd_plot_cdf(ssd_match_moments(dists = c(
"burrIII2", "burrIII3", "gamma",
"gompertz", "lgumbel", "llogis",
"lnorm", "weibull"
), meanlog = 2, sdlog = 2)) +
theme(legend.position = "bottom")

## References

Aldenberg, T., and Slob, W. 1993. Confidence Limits for Hazardous Concentrations Based on Logistically Distributed NOEC Toxicity Data. Ecotoxicology and Environmental Safety 25(1): 48–63. http://doi.org/10.1006/eesa.1993.1006.

Burnham, K.P., and Anderson, D.R. 2002. Model Selection and Multimodel Inference. Springer New York, New York, NY. http://doi.org/10.1007/b97636.

Burr, I. W. 1942. Cumulative frequency functions. Annals of Mathematical Statistics. 13:2, 215–232.

Dalgarno, D. 2018. ssdtools: A shiny web app to analyse species sensitivity distributions. Prepared by Poisson Consulting for the Ministry of the Environment, British Columbia. https://bcgov-env.shinyapps.io/ssdtools/

Domma, F., Giordano, S. and Zenga, M. 2011. Maximum likelihood estimation in Dagum distribution with censored samples. Journal of Applied Statistics. 38:12, 2971-2985, DOI: 10.1080/02664763.2011.578613

Schwarz, C.J. and A.R. Tillmanns. 2019. Improving statistical methods to derive species sensitivity distributions. Water Science Series, WSS2019-07, Province of British Columbia, Victoria.

Shao, Q. 2000. Estimation for hazardous concentrations based on NOEC toxicity data: an alternative approach. : 13.