This vignette will show users how the `corrcoverage`

R package can be used to obtain a new credible set of variants that contains the true causal variant with some specified desired coverage value whilst containing as few variants as possible.

As in the corrected coverage vignette, let’s begin by simulating some GWAS data using the `simGWAS`

package. This package is not avaliable on CRAN so the example data generated by the simGWAS package is contained in this package.

```
set.seed(18)
library(corrcoverage)
# library(simGWAS)
data <- system.file("extdata", "", package="corrcoverage")
# Simulate reference haplotypes
nsnps <- 200
nhaps <- 1000
lag <- 30 # genotypes are correlated between neighbouring variants
maf <- runif(nsnps + lag, 0.05, 0.5) # common SNPs
laghaps <- do.call("cbind", lapply(maf, function(f) rbinom(nhaps, 1, f)))
haps <- laghaps[, 1:nsnps]
for (j in 1:lag) haps <- haps + laghaps[, (1:nsnps) + j]
haps <- round(haps/matrix(apply(haps, 2, max), nhaps, nsnps, byrow = TRUE))
snps <- colnames(haps) <- paste0("s", 1:nsnps)
freq <- as.data.frame(haps + 1)
freq$Probability <- 1/nrow(freq)
sum(freq$Probability)
```

`## [1] 1`

```
MAF <- colMeans(freq[, snps] - 1) # minor allele frequencies
CV <- sample(snps[which(colMeans(haps) > 0.1)], 1)
iCV <- sub("s", "", CV) # index of cv
LD <- cor2(haps) # correlation between SNPs
```

```
OR <- 1.1 # odds ratios
N0 <- 10000 # number of controls
N1 <- 10000 # number of cases
#z0 <- simulated_z_score(N0 = N0, # number of controls
# N1 = N1, # number of cases
# snps = snps, # column names in freq
# W = CV, # causal variants, subset of snps
# gamma.W = log(OR), # log odds ratios
# freq = freq) # reference haplotypes
z0 <- readRDS(paste0(data,"/z0_2.RDS"))
```

To calculate \(V\), the prior variance for the estimated effect size, we use `Var.data.cc`

.

We can then use the `ppfunc`

function to calculate the posterior probabilities of causality for each variant.

We use the `est_mu`

function to obtain an estimate of the true effect at the causal variant.

`## [1] 4.970273`

The `corrected_cov`

function is used to find the corrected coverage of a credible set with specified threshold, say 0.9.

Note that this function is similar to using `corrcov`

as explained in the “Corrected Coverage” vignette; which would require \(Z\)-scores, minor allele frequencies and sample sizes. Here, we have already calculated some of the intermediaries calculated in the `corrcov`

function (muhat, varbeta etc.) so we can use `corrected_cov`

instead.

```
thr = 0.9
corrcov <- corrected_cov(pp0 = postprobs, mu = muhat, V = varbeta,
Sigma = LD, thr = thr, nrep = 1000)
cs <- credset(pp = postprobs, thr = thr)
data.frame(claimed.cov = cs$claimed.cov, corr.cov = corrcov, nvar = cs$nvar)
```

```
## claimed.cov corr.cov nvar
## 1 0.9307395 0.967282 9
```

Using the Bayesian approach for statistical fine-mapping we obtain a 90% credible set consisting of 9 variants. The claimed coverage of this credible set is ~0.93, yet the corrected coverage estimate is ~0.97, suggesting that we can afford to be ‘more confident’ that we have captured the causal variant in our credible set.

Again, for the purpose of this vignette we can investigate how accurate this estimate is by simulating many credible sets from the same system, and finding the proportion of these that contain the true causal variant. Note that nrep is relatively low in this example to keep stored data in the R package within CRAN guidelines. Please increase nrep to increase accuracy of estimated empirical coverage.

```
# z0.tmp <- simulated_z_score(N0 = N0, # number of controls
# N1 = N1, # number of cases
# snps = snps, # column names in freq
# W = CV, # causal variants, subset of snps
# gamma.W = log(OR), # log odds ratios
# freq = freq, # reference haplotypes
# nrep = 200) # increase nrep to increase accuracy of estimate
z0.tmp <- readRDS(paste0(data,"/z0.tmp_2.RDS"))
pps <- ppfunc.mat(zstar = z0.tmp, V = varbeta) # find pps
cs.cov <- apply(pps, 1, function(x) credset(x, CV = iCV, thr = thr)$cov)
true.cov.est <- mean(cs.cov)
data.frame(claimed.cov = cs$claimed.cov, corr.cov = corrcov,
true.cov = true.cov.est, nvar = cs$nvar)
```

```
## claimed.cov corr.cov true.cov nvar
## 1 0.9307395 0.967282 0.97 9
```

We find that our corrected coverage value is very close to the empirical coverage of the credible set.

Our results suggest that we may be able to remove some variants from the credible set, whilst still achieving the desired coverage of 90%.

The `corrected_cs`

function uses GWAS summary statistics and some user-defined parameters to find the smallest credible set such that the coverage estimate is within some accuracy of the desired coverage.

The function requires the following parameters to be specified:

`z`

(vector of marginal \(Z\)-scores)`f`

(vector of minor allele frequencies)`N0`

,`N1`

(number of controls and cases respectively)`Sigma`

(SNP correlation matrix)`desired.cov`

(desired coverage of the credible set)

In addition, there are a number of optional parameters:

`lower`

(the lower value to try for the threshold)`upper`

(the upper value to try for the threshold)`acc`

(the accuracy required for the coverage of the resultant credible set)`max.iter`

(maximum number of iterations)

The function uses the bisection root finding method to converge to the smallest threshold such that the corrected coverage is larger than the desired coverage. The `lower`

and `upper`

parameters define the boundaries of the threshold values for the root finding method, with default values of 0 and 1 respectively. The `acc`

parameter has a default value set to 0.005, meaning that the algorithm will keep attempting to find a corrected credible set (until the number of iterations reaches `max.iter`

) that has coverage within 0.005 of the desired coverage value (e.g. between 0.895 and 0.905 for a 90% credible set).

The function reports the threshold values tested and their corresponding corrected coverage value at each iteration. The maximum number of iterations for the bisecting root finding algorithm is an optional parameter, with default value 20. The functions stops when either the number of iterations reaches the maximum, or the corrected coverage is within some accuracy of the desired coverage.

```
## [1] "thr: 0.75 , cov: 0.918448964618699"
## [1] "thr: 0.625 , cov: 0.870505410370508"
## [1] "thr: 0.6875 , cov: 0.893551309455069"
## [1] "thr: 0.71875 , cov: 0.906172863191769"
## [1] "thr: 0.703125 , cov: 0.899644208037482"
## [1] "thr: 0.7109375 , cov: 0.901677186379133"
```

```
## $credset
## [1] "s54" "s58" "s57" "s67"
##
## $req.thr
## [1] 0.7109375
##
## $corr.cov
## [1] 0.9016772
##
## $size
## [1] 0.7343958
```

The first threshold value tested is the midpoint of the `lower`

(0.5) and `upper`

(1) parameter values, which here is 0.75. The coverage of this credible set is too high (\(>0.9\)) and so a smaller threshold value is tested, the midpoint of the `lower`

parameter value (0.5) and the previously tried threshold value (0.75). This credible set has a coverage that is too low, and so a higher threshold value is tested, and so on.

In this example we see that a much smaller threshold value is required to obtain a credible set with 90% corrected coverage of the causal variant, containing only 4 variants. In the standard Bayesian approach, the threshold value of 90% leads to over-coverage.

Finally, we can compare this coverage estimate of the corrected credible set to an empirical estimate of the true coverage.

```
new.cs.sims <- apply(pps, 1, function(x) credset(x, CV = iCV, thr = res$req.thr)$cov)
true.cov.est2 <- mean(new.cs.sims)
```

Original 90% credible set:

```
df1 <- data.frame(claimed.cov = round(cs$claimed.cov, 3), corr.cov = round(corrcov, 3), true.cov = round(true.cov.est, 3), nvar = cs$nvar)
print(df1, row.names = FALSE)
```

```
## claimed.cov corr.cov true.cov nvar
## 0.931 0.967 0.97 9
```

New 90% credible set:

```
df2 <- data.frame(claimed.cov = round(res$size, 3), corr.cov = round(res$corr.cov, 3), true.cov = round(true.cov.est2, 3), nvar = length(res$credset))
print(df2, row.names = FALSE)
```

```
## claimed.cov corr.cov true.cov nvar
## 0.734 0.902 0.9 4
```

This vignette has shown how the `corrcoverage`

R package can be used to improve the resolution of a credible set from Bayesian genetic fine-mapping, without the use of any additional data.