Computing a PGS using RápidoPGS and GWAS catalog

Guillermo Reales

2020-08-03

Introduction

RápidoPGS is a tool to quickly compute polygenic scores (PGS) from GWAS summary statistics datasets from either case-control (eg. asthma) or quantitative traits (eg. height and BMI).

Input GWAS summary statistics datasets should have a minimum set of columns, with these names, but in any order:

Installation of RápidoPGS

If you don’t already have the dependency GenomicRanges, install it first

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("GenomicRanges")

RápidoPGS is currently available on GitHub, so we can install it using the following code (Note: If you don’t have remotes installed, please install it first:

if (!requireNamespace("remotes", quietly = TRUE))
    install.packages("remotes")
remotes::install_github("GRealesM/RapidoPGS")

Setup

Once installed, let’s load it. This will automatically load all required dependencies.

library(RapidoPGS)

Downloading data from GWAS catalog

GWAS catalog is an outstanding GWAS summary statistics data source, as it not only puts together tons of public datasets, but it also uses a semi-automatic pipeline to apply quality control and liftover (a.k.a. harmonise) those datasets, thus helping overcome the problem of having GWAS sumstats in so many different formats and builds.

In this vignette, we’ll use GWAS catalog to obtain an example dataset. For this vignette we’ll use a Breast cancer (BRCA) dataset from Michailidou et al., 2017, which is one that we used in our manuscript. This GWAS was performed on 119078 controls and 137045 cases of Breast cancer.

It’s a big file, and may take a while to download, so here we will show the command to download, but actually cheat and load a subset of data already loaded.

Note that in this particular case we chose a study that contained a single dataset. In other cases there may be more than one. In that case gwascat.download will prompt you with the list of available options, prompting their accession numbers, and asking you to choose a file by its number in the list, if running interactively. If you want to run gwascat.download automatically, and you know the number already, you can supply it using filenum argument, so R won’t bug you.

We use it’s PubMed ID (29059683). We could download only harmonised GWAS catalog columns (thus having a lighter file) but I don’t advise this because sometimes we may need columns from the original file. We then set hm_only = FALSE to get all columns.

ds <- gwascat.download(29059683, hm_only = FALSE)

Now for the cheat: for illustrative purposes, I took a random subset from this file including 100,000 SNPs from this file, which we’ll use in this tutorial. Bear in mind that the final PGS won’t be a valid model since we randomly removed most SNPs. It will serve for our teaching purposes, though! ;)

ds <- michailidou

Applying quality control to the dataset

The first thing to do is to check what our file looks like. RápidoPGS can handle NAs in crucial columns, so don’t worry too much about that (unless you have all NA for crucial columns, of course!).

A note of caution when dealing with GWAS catalog files, though: since they use a semi-automated pipeline, it is possible that even some columns are present, they are empty, so be careful with that!

Also, RápidoPGS requires allele frequencies, so it’s possible that you need to compute it from an external reference panel (eg. 1000 Genomes Phase III). We don’t cover that step in this vignette, but we might write instructions explaining how to do it in the future.

Let’s take a look at the file!

summary(ds)
##    hm_rsid            hm_chrom             hm_pos          hm_other_allele   
##  Length:100000      Length:100000      Min.   :    10519   Length:100000     
##  Class :character   Class :character   1st Qu.: 31928836   Class :character  
##  Mode  :character   Mode  :character   Median : 69324516   Mode  :character  
##                                        Mean   : 78427892                     
##                                        3rd Qu.:114421523                     
##                                        Max.   :248906719                     
##  hm_effect_allele      hm_beta           hm_effect_allele_frequency
##  Length:100000      Min.   :-2.9353000   Min.   :0.0051            
##  Class :character   1st Qu.:-0.0099000   1st Qu.:0.0200            
##  Mode  :character   Median :-0.0003000   Median :0.1033            
##                     Mean   :-0.0008408   Mean   :0.2230            
##                     3rd Qu.: 0.0095000   3rd Qu.:0.3537            
##                     Max.   : 0.6549000   Max.   :0.9949            
##  standard_error      p_value      
##  Min.   :0.0062   Min.   :0.0000  
##  1st Qu.:0.0075   1st Qu.:0.1940  
##  Median :0.0115   Median :0.4492  
##  Mean   :0.0197   Mean   :0.4617  
##  3rd Qu.:0.0259   3rd Qu.:0.7215  
##  Max.   :1.0917   Max.   :0.9993

In this case, we don’t have any missing data, which is fantastic. We’ll now prepare the dataset for RápidoPGS. We need to rename columns to something RápidoPGS can understand, and this is easy to do with data.table.

setnames(ds, old = c("hm_rsid", "hm_chrom", "hm_pos", "hm_other_allele", "hm_effect_allele", "hm_beta", "hm_effect_allele_frequency", "standard_error", "p_value"), new=c("SNPID","CHR", "BP","REF", "ALT", "BETA", "ALT_FREQ", "SE", "P"))

Now another thing to consider: RápidoPGS use only autosomes, so remove X or Y chromosomes at this step.

ds <- ds[CHR != "X",]

Computing PGS using RápidoPGS

Now we have applied QC to our dataset, we can now compute our PGS! We can easily do that applying the next command, but bear in mind that, since we used harmonised columns, the build is hg38, so we must tell RápidoPGS about that.

We should also provide the number of controls (N0) and cases (N1) for a case-control study. Note that if this was a quantitative trait dataset, we should inform total number of individuals only (as N0). Also, if our dataset had columns reporting the number of individuals for each SNP, we can replace N0 (and N1, in case-control datasets) by strings specifying the respective columns (eg. N0 = “Ncontrols”, N1 = “Ncases”). By doing so, RápidoPGS will extract this information directly from the columns.

Let’s get our hands dirty! Let’s compute first a full RápidoPGS model. Since we used harmonised coordinates from GWAS catalog, our file is in build hg38, and we need to specify it to RápidoPGS (default is hg19).

Advanced use note: You may want to filter by and align your dataset to an external reference. You can do that with RápidoPGS, too, by specifying the path of your reference file at reference argument. This reference file should have five columns (CHR, BP, REF, ALT, and SNPID) and be in the same build as our summary statistics dataset. RápidoPGS currently supports hg19 and hg38 builds.

full_PGS <- computePGS(ds, N0= 119078 ,N1=137045, build = "hg38")
## Assigning LD blocks...
## Done!
## Computing PGS for a case-control dataset, with 119078 controls, and 137045 cases.

Well, that was rápido! With increasingly big datasets, it will take a bit longer, of course.

Let’s take a look.

head(full_PGS)
##          SNPID CHR        BP REF ALT    BETA ALT_FREQ     SE       P ld.block
## 1: rs117214991   9 109388299   C   T  0.0929   0.0052 0.0559 0.09647      950
## 2:  rs61113083  11  82946128   G   A -0.0336   0.0623 0.0132 0.01086     1088
## 3: rs118065272  12  71684434   G   A  0.0021   0.0401 0.0219 0.92530     1167
## 4:   rs6067900  20  51693556   T   C -0.0038   0.2610 0.0072 0.59860     1574
## 5:   rs6132000  20  17808883   C   T -0.0119   0.0981 0.0119 0.31710     1557
## 6: rs544195892   3 196463480 TAC   T  0.0221   0.0526 0.0167 0.18470      381
##             ppi        weight
## 1: 2.876406e-12  2.672181e-13
## 2: 1.464198e-04 -4.919705e-06
## 3: 7.146329e-06  1.500729e-08
## 4: 3.237911e-06 -1.230406e-08
## 5: 7.740823e-06 -9.211579e-08
## 6: 1.503044e-05  3.321728e-07

We see three new columns: ld.block corresponds to the ld-detect LD block number assigned to the SNP (see manuscript for more details of where this comes from), ppi is the posterior probability that the SNP is causal, and weight is the weighted effect size (BETA * ppi) - and the column we’re interested in.

Applying different thresholds

Imagine that we want a small portable PGS. We could use a threshold (eg. keep only SNPs with ppi larger than \(1e^{-4}\), a reasonable threshold) or keep the top SNPs with largest weights (in either way).

We can do that using RápidoPGS, let’s see how by using a \(1e^{-4}\) threshold.

For accuracy reasons, we recommend recomputing the PGS on just these SNPs after the threshold was applied, so recalc = TRUE by default.

PGS_1e4 <- computePGS(ds, N0= 119078 ,N1=137045, build = "hg38", filt_threshold = 1e-4)
## Assigning LD blocks...
## Done!
## Computing PGS for a case-control dataset, with 119078 controls, and 137045 cases.
head(PGS_1e4)
##          SNPID CHR        BP REF ALT    BETA ALT_FREQ     SE        P ld.block
## 1:  rs61113083  11  82946128   G   A -0.0336   0.0623 0.0132 0.010860     1088
## 2: rs192985834  11  66280040   A   T -0.1053   0.0080 0.0442 0.017230     1079
## 3:  rs17826142  16  48507898   T   C  0.0503   0.0276 0.0195 0.009998     1391
## 4: rs141121949   4  42132514   G   A -0.0491   0.0399 0.0167 0.003400      413
## 5: rs193198319  14  86445647   A   C -0.0513   0.0319 0.0204 0.011990     1306
## 6:   rs4434842   1 103949496   T   G -0.0374   0.9488 0.0149 0.011880       60
##             ppi        weight
## 1: 0.0001466256 -4.926619e-06
## 2: 0.0002256980 -2.376600e-05
## 3: 0.0002292983  1.153370e-05
## 4: 0.0005089674 -2.499030e-05
## 5: 0.0001828172 -9.378522e-06
## 6: 0.0001481117 -5.539377e-06

You can omit recalculation by setting that argument to FALSE

PGS_1e4_norecalc <- computePGS(ds, N0= 119078 ,N1=137045, build = "hg38", filt_threshold = 1e-4, recalc = FALSE)
## Assigning LD blocks...
## Done!
## Computing PGS for a case-control dataset, with 119078 controls, and 137045 cases.
head(PGS_1e4_norecalc)
##          SNPID CHR        BP REF ALT    BETA ALT_FREQ     SE        P ld.block
## 1:  rs61113083  11  82946128   G   A -0.0336   0.0623 0.0132 0.010860     1088
## 2: rs192985834  11  66280040   A   T -0.1053   0.0080 0.0442 0.017230     1079
## 3:  rs17826142  16  48507898   T   C  0.0503   0.0276 0.0195 0.009998     1391
## 4: rs141121949   4  42132514   G   A -0.0491   0.0399 0.0167 0.003400      413
## 5: rs193198319  14  86445647   A   C -0.0513   0.0319 0.0204 0.011990     1306
## 6:   rs4434842   1 103949496   T   G -0.0374   0.9488 0.0149 0.011880       60
##             ppi        weight
## 1: 0.0001464198 -4.919705e-06
## 2: 0.0002255228 -2.374755e-05
## 3: 0.0002290956  1.152351e-05
## 4: 0.0005085857 -2.497156e-05
## 5: 0.0001827297 -9.374036e-06
## 6: 0.0001479511 -5.533371e-06

And what if we want the top 10 SNPs? Just change the filt_threshold argument. If it’s larger than 1, RápidoPGS will understand that you want a top list, rather than a thresholded result.

PGS_top10 <- computePGS(ds, N0= 119078 ,N1=137045, build = "hg38", filt_threshold = 10)
## Assigning LD blocks...
## Done!
## Computing PGS for a case-control dataset, with 119078 controls, and 137045 cases.
head(PGS_top10)
##         SNPID CHR        BP REF ALT    BETA ALT_FREQ     SE         P ld.block
## 1: rs79054216  11  69371625   C   T  0.2309   0.0145 0.0265 3.041e-18     1080
## 2: rs10788196  10 121658121   T   G -0.1324   0.8495 0.0085 5.935e-55     1036
## 3: rs28475635   4 174921828   G   A -0.1030   0.1177 0.0098 5.256e-26      492
## 4: rs17756147  14  68569410   G   A -0.0973   0.2310 0.0074 2.634e-39     1295
## 5:   rs580057   3  27303612   G   A  0.1038   0.5249 0.0062 9.185e-63      282
## 6: rs12997141   2 217071663   C   T -0.0914   0.2349 0.0073 1.690e-35      247
##    ppi  weight
## 1:   1  0.2309
## 2:   1 -0.1324
## 3:   1 -0.1030
## 4:   1 -0.0973
## 5:   1  0.1038
## 6:   1 -0.0914

So those are examples of basic RápidoPGS usage!

There are more advanced options:

Drop us a line if you encounter problems, and we’ll be happy to help.

Good luck!

Guillermo