1 Objectives

In this practical you will perform polygenic prediction using SBLUP and SBayesR, the variations of BLUP and BayesR that only require GWAS summary statistics. We will use the same data sets in the session of individual-level BLUP and BayesR analysis.

2 Scripts

The scripts (source code) used for this practical include

ls /data/module5/scr/sbayesr.R
ls /data/module5/scr/get_pred_r2.R

3 Analysis of a small data set using R script

Load data in R.

nmarkers <- 10      #number of markers
nrecords <- 325     #number of records

# data for training population
X <- matrix(scan("/data/module5/toy/xmat_trn.inp"), ncol=nmarkers, byrow = TRUE)
y <- matrix(scan("/data/module5/toy/yvec_trn.inp"), byrow = TRUE)

# data for validation population
Xval <- matrix(scan("/data/module5/toy/xmat_val.inp"), ncol=nmarkers, byrow = TRUE)
yval <- matrix(scan("/data/module5/toy/yvec_val.inp"), byrow = TRUE)

3.1 Obtain GWAS summary data

First, let us assume the genotype matrix is standardised (scaled) to have mean zero and variance one in each column. Run GWAS using the standardised genotypes:

X_unscaled = X
X_scaled = apply(X, 2, scale)
# run GWAS on the scaled genotypes
fit = apply(X_scaled, 2, function(x){summary(lm(y~x))$coefficients[2,1:2]})
b_scaled = fit[1,]

3.2 Compute LD correlation matrix

R = cor(X)

3.3 SBLUP

Run SBLUP using marginal SNP effects from GWAS and LD correlation matrix:

lambda = 10
I = diag(nmarkers)
coeff = R + I*lambda/nrecords
rhs = b_scaled
beta_sblup = solve(coeff, rhs)

3.4 BLUP

Run BLUP using the individual-level data as benchmark:

lambda = 10
I = diag(nmarkers+1)
X = cbind(1, X_scaled)
coeff = crossprod(X) + I*lambda
rhs = crossprod(X, y)
beta_blup = solve(coeff, rhs)[-1]

Question Are BLUP and SBLUP solutions the same?

cor(beta_blup, beta_sblup)

## [1] 1

In practice, GWAS is often done with unscaled genotypes (i.e., coded as 0, 1, 2). In this case, we need to first scale the marginal effects and then unscale the joint effect estimates to put them back to the scale of per-allele effects.

# run GWAS on the 0/1/2 genotypes
fit = apply(X_unscaled, 2, function(x){summary(lm(y~x))$coefficients[2,1:2]})
b = fit[1,]
se = fit[2,] 
# calculate the scale factor for each SNP (i.e., sqrt(heterozygosity/vary)), 
# assuming each SNP effect is vanishingly small
scale = sqrt(1/(nrecords*se^2))
# scale the marginal effects
b_scaled = b*scale

Perform SBLUP using scaled marginal effects

lambda = 10
I = diag(nmarkers)
coeff = R + I*lambda/nrecords
rhs = b_scaled
beta_sblup_scaled = solve(coeff, rhs)
beta_sblup_unscaled = beta_sblup_scaled/scale

Then we can predict GEBV of selection candidates.

# get the predicted genetic value (GEBV)
ghat=Xval%*%beta_sblup_unscaled

# prediction R-square
summary(lm(yval~ghat))$r.squared

## [1] 0.4759358

In practice, we often use a sparse LD correlation matrix that removes LD correlations that are almost zero. Let’s see how this would influence the BLUP performance.

# arbitrarily set a LD threshold such that R becomes sparse
R_sp = R
R_sp[abs(R_sp)<0.2] = 0

lambda = 10
I = diag(nmarkers)
coeff = R_sp + I*lambda/nrecords
rhs = b_scaled
beta_sblup_scaled = solve(coeff, rhs)
beta_sblup_unscaled_sp = beta_sblup_scaled/scale

plot(beta_sblup_unscaled, beta_sblup_unscaled_sp)
abline(a=0, b=1)

3.5 SBayesR

SBayesR method is implemented in sbayesr.R. We also include the BayesR method in bayesr.R. Have a look at the code and see what the differences are.

source("/data/module5/scr/sbayesr.R")

Run SBayesR. Note that here we are using the marginal effects from GWAS with unstandardised genotypes as input data, and the scaling factors are computed using standard errors.

sr.res = sbayesr(b, se, nrecords, R)

## 
##  iter  100, nnz =    8, sigmaSq =  0.888, h2 =  0.954, vare =  1.000
## 
##  iter  200, nnz =    8, sigmaSq =  1.314, h2 =  0.859, vare =  1.000
## 
##  iter  300, nnz =    3, sigmaSq =  4.070, h2 =  0.913, vare =  1.000
## 
##  iter  400, nnz =    4, sigmaSq =  0.492, h2 =  0.938, vare =  1.000
## 
##  iter  500, nnz =    3, sigmaSq =  3.921, h2 =  0.888, vare =  1.000
## 
##  iter  600, nnz =    4, sigmaSq =  2.423, h2 =  1.022, vare =  1.000
## 
##  iter  700, nnz =    9, sigmaSq =  3.864, h2 =  0.925, vare =  1.000
## 
##  iter  800, nnz =    8, sigmaSq =  1.120, h2 =  1.046, vare =  1.000
## 
##  iter  900, nnz =    7, sigmaSq =  1.850, h2 =  0.857, vare =  1.000
## 
##  iter 1000, nnz =    3, sigmaSq =  1.094, h2 =  0.971, vare =  1.000
## 
## Posterior mean:
##       Pi1       Pi2       Pi3       Pi4       Nnz   SigmaSq        h2      Vare 
## 0.4637807 0.2475237 0.1396782 0.1490173 4.5000000 2.3073678 0.9583929 1.0000000

beta.sr = colMeans(sr.res$beta)

Run BayesR as benchmark:

source("/data/module5/scr/bayesr.R")

r.res = bayesr(X_unscaled, y)

## 
##  iter  100, nnz =    5, sigmaSq =  2.204, h2 =  0.439, vare =  3.835, varg =  2.998 
## 
##  iter  200, nnz =    9, sigmaSq =  2.372, h2 =  0.429, vare =  4.095, varg =  3.078 
## 
##  iter  300, nnz =    8, sigmaSq =  3.690, h2 =  0.441, vare =  4.187, varg =  3.300 
## 
##  iter  400, nnz =   10, sigmaSq =  2.293, h2 =  0.505, vare =  3.709, varg =  3.789 
## 
##  iter  500, nnz =    4, sigmaSq =  2.137, h2 =  0.459, vare =  4.279, varg =  3.630 
## 
##  iter  600, nnz =    8, sigmaSq =  3.220, h2 =  0.514, vare =  3.505, varg =  3.713 
## 
##  iter  700, nnz =    4, sigmaSq =  2.091, h2 =  0.509, vare =  3.636, varg =  3.769 
## 
##  iter  800, nnz =    4, sigmaSq =  2.966, h2 =  0.547, vare =  4.081, varg =  4.932 
## 
##  iter  900, nnz =    8, sigmaSq =  2.215, h2 =  0.544, vare =  3.788, varg =  4.527 
## 
##  iter 1000, nnz =    5, sigmaSq =  2.055, h2 =  0.514, vare =  3.689, varg =  3.895 
## 
## Posterior mean:
##       Pi1       Pi2       Pi3       Pi4       Nnz   SigmaSq        h2      Vare 
## 0.2956888 0.2864830 0.2164192 0.2014090 6.8600000 3.9276307 0.5000334 3.9104262 
##      Varg 
## 3.9242224

beta.r = colMeans(r.res$beta)

Question: Are BayesR and SBayesR SNP effect estimates the same? What could possibly cause the difference? (hint: the scaling process implies an assumption of a negligible effect size for each individual SNP.)

cor(beta.r, beta.sr)

## [1] 0.9762267

Question: Can you tell which SNPs are likely to be the causal variants based on the posterior inclusion probability (PIP)? PIP is a Bayesian measurement for SNP-trait association, which is calculated as the frequency of the SNP being fitted as a non-zero effect in the model across MCMC samples.

delta = (sr.res$beta != 0)
pip = colMeans(delta)
plot(pip, type="h", xlab="SNP", ylab="Posterior inclusion probability")

4 Analysis of a larger data set using GCTA and GCTB

4.1 Run SBLUP using GCTA

Command manual can be found at https://yanglab.westlake.edu.cn/software/gcta/#SBLUP.

It requires a LD reference sample with genotypes. Here test population (n=500) is used as the LD reference sample. This step will take about 5 minutes.

bfile="/data/module5/sim/test"
ma="/data/module5/sim/gwas.ma"
gcta --bfile $bfile \
     --cojo-file $ma \
     --cojo-sblup 273604 \
     --cojo-wind 1000 \
     --thread-num 8 \
     --out gwas

The SNP results are stored in file simu.sblup.cojo. You can use these SBLUP estimates for polygenic prediction.

target="/data/module5/sim/target"
sblup="gwas.sblup.cojo"
plink --bfile $target --score $sblup 1 2 4 sum center --out target.sblup

phenFile="/data/module5/sim/target.phen" 
covFile="/data/module5/sim/target.cov"
prsFile="target.sblup.profile"
Rscript /data/module5/scr/get_pred_r2.R $phenFile $covFile $prsFile

Question: How does it compare to the BLUP result using individual-level data?

4.2 Run SBayesR using GCTB

A tutorial for using GCTB to run SBayesR can be found at https://cnsgenomics.com/software/gctb/#Tutorial.

4.2.1 Step 1: get GWAS summary data.

GCTB also uses .ma file to input GWAS summary data.

Since the computing time and memory consumption of SBayesR is independent of GWAS sample size and only summary-level data is required, we are able to run SBayesR with a much larger GWAS data set. Find the GWAS summary data file generated from using the simulated data based on the full UK Biobank samples for GWAS (n=350,000):

ls /data/module5/sim/ukb_matched.ma

4.2.2 Step 2: compute LD matrix.

DO NOT run this step in the practical because it can take a while and quite a bit of space. Instead, use the data that we have generated for you as shown in the next step.

GCTB provides options to compute different types of LD matrix. In our SBayesR paper (https://doi.org/10.1038/s41467-019-12653-0), we use shrunk LD matrix, which can be built by the following command.

# Build shrunk LD matrix for each chromosome

# DO NOT RUN THIS IN CLASS

# bfile="/data/module5/sim/gwas"
# genmap="/data/module5/sim/interpolated_genetic_map.txt"
> mldm.txt
for i in `seq 1 22`
  do
    gctb --bfile $bfile \
         --chr $i \
         --make-shrunk-ldm \
         --gen-map $genmap \
         --out chr$i
    echo "chr$i.ldm.shrunk" >> mldm.txt
done

In practice, if you are using summary statistics from GWAS of European ancestry, we recommend use of Banded matrix or Shrunk sparse matrix, which are computed using 1M HapMap3 SNPs in the UKB and can be download from our website (https://cnsgenomics.com/software/gctb/#Download).

4.2.3 Step 3: Run SBayesR

SBayesR can be carried out using command below. This may take about 5 minutes or more, depending on how busy the cluster is, and about 20GB RAM. Our cluster may not be able to accommodate this analysis for all students at the same time. You can check the results that produced from the same code at /data/module5/sim/sbayesr.parRes and /data/module5/sim/sbayesr.snpRes.

ma="/data/module5/sim/ukb_matched.imputed.ma"
mldm="/data/module5/sim/ldm/mldm.txt"
gctb --sbayes R \
     --mldm $mldm \
     --gwas-summary $ma \
     --gamma 0,0.01,0.1,1 \
     --pi 0.994,0.003,0.002,0.001 \
     --chain-length 1000 \
     --burn-in 200 \
     --out sbayesr

Result file sbayesr.parRes shows the posterior estimates of model parameters. Result file sbayesr.snpRes shows the estimates of joint SNP effects.

If interested, you should be able to run SBayesR using chromosome 1 only using

ma="/data/module5/sim/ukb_matched.imputed.ma"
ldm="/data/module5/sim/ldm/chr1.ldm.shrunk"
gctb --sbayes R \
     --ldm $ldm \
     --gwas-summary $ma \
     --gamma 0,0.01,0.1,1 \
     --pi 0.994,0.003,0.002,0.001 \
     --chain-length 1000 \
     --burn-in 200 \
     --out sbayesr.chr1

4.2.4 Compute PRS using PLINK

You can run polygenic prediction using the effect estimates from SBayesR. Firstly, copy /data/module5/sim/sbayesr.snpRes to your current folder, which is the result from genome-wide analysis. You are welcome to try running genome-wide analysis after the class when the server is not busy. Then, run

target="/data/module5/sim/target"
snpRes="sbayesr.snpRes"
plink --bfile $target --score $snpRes 2 5 8 header sum center --out target.sbayesr

phenFile="/data/module5/sim/target.phen" 
covFile="/data/module5/sim/target.cov"
prsFile="target.sbayesr.profile"
Rscript /data/module5/scr/get_pred_r2.R $phenFile $covFile $prsFile

Question: How many SNPs are fitted with non-zero effects? Is the number consistent with the number of causal variants (100) in the simulation? Is the prediction accuracy better than C+PT and SBLUP?

cat sbayesr.parRes

##               Mean            SD             
##    NumSnp1    261737.093750   5.634700       
##    NumSnp2    3.287500        3.892922       
##    NumSnp3    1.800000        2.619160       
##    NumSnp4    67.800003       3.779550       
##        Vg1    0.000000        0.000000       
##        Vg2    0.000241        0.000383       
##        Vg3    0.001843        0.003195       
##        Vg4    0.997916        0.003236       
##    SigmaSq    2.047411        0.551730       
##     ResVar    98.869164       0.879301       
##     GenVar    89.399185       1.834236       
##        hsq    0.471580        0.009676

NumSnp1-NumSnp4 are the numbers of SNPs in the mixture components 1 to 4 (component 1: zero effect; component 2: small effects whose variance = 0.01 \(\times \sigma^2_\beta\); component 3: medium effects whose variance = 0.1 \(\times \sigma^2_\beta\); component 4: large effects whose variance = 1 \(\times \sigma^2_\beta\)). hsq is the SNP-based heritability estimate.

A work by Jian Zeng

Module 5 Practical 5 Summary-data-based methods for polygenic prediction

Jian Zeng - 2023-06-23